Image is one of those tools that you don’t use on a daily basis but when you are in need, there is no workaround (good luck asking Google “What is the name of that tall blond actress with the fancy long silver dress in the photo posted on a fashion blog?” or “Where can I find her silver dress??”). For me, as a data scientist, the most interesting question here is: “How does image search engine work and definitely – “How could I build one?”. It would be great if you could just snap a picture of a room and be able to find all the objects from it (sofa, table, lamp, that cozy carpet) or the most similar ones in the database.
Let’s do it
The most obvious challenge when building image search engine for 3D objects is the fact that any picture loses all spatial information. Shapes and contours depend on the viewpoint, while the object might only be partly visible in the picture or look totally different from another perspective. There is no single solution for these computer vision problems so I decided to try several approaches. First, a classical model with a bag of visual words (more about that funny name below). Then, a model based on convolutional neural networks (still very much trending in Machine Learning).
Dataset
To play with some data first, I needed a database of images of two kinds. First of all, product photos (typical catalog photos preferably – e.g white background, good lightning) as well as some pictures of those products “in the wild” – in that case, pictures of room scenes where each object has been nicely located by a designer. For this purpose, I used images scrapped from IKEA.com website and divided them to two groups – catalog photos for general database (2193 images) and room pictures for query (298 images). Some object categories appeared in pictures more often than others such as chair, sofa, table or potted plant. Every room picture has been annotated with object categories that appear in it as well as filenames for product photos of those objects.
Extracting features
When you are looking for a particular object in the database, you want to create a technical description of that query. For example, if you were searching for your grandma’s wall clock, you would describe it as a circle object with characteristic Latin numbers, two pointy arrows etc. For visual search, the idea is the same, only the features are a bit different (every image is just a bunch of pixels/numbers after all) …
SIFT
One way to extract features is by algorithm proposed by D. Lowe back in 2004 – SIFT (Scale Invariant Feature Transform). I used its implementation from OpenCV computer vision library. The idea behind the algorithm is to generate multiple octaves of different sizes and different Gaussian blurs of the image. Then by subtracting one image from another, candidates for keypoints are generated. The best way to think about them is as some sort of distinctive corners or points in the image. What is more, SIFT calculates direction and strength of the gradient to determine keypoint descriptors. This allows for the algorithm to be scale invariant, that is it is able to recognise the object features even if they are rotated or resized.
In the example above image keypoint descriptors are marked with different colours where the keypoint is in the center of the circle and the line marks direction of the gradient.
After extracting image features from the query, you could just try brute force matching them against features from all images in the database and check where are the most matches. That was the first thing I tried and as expected it did not work very well. In particular, it was very keen to match the features from totally unrelated objects (as in the picture below). It sees there a bunch of clock features in a door hole while the same clock hangs on the wall practically unnoticed.
Visual vocabulary
Bag of words is a Machine Learning model used for text analysis, however it had much success being applied to computer vision problems. It dwells on the idea of creating a vocabulary of visual words and then calculating how often a given “word” appears in the image. When joined with SIFT, you extract features for all the images in the database and then try to group the similar features together (thus calling them a visual word). Grouping of the features is usually done by K-means clustering. What you get as a result (K clusters of extracted features) is called a visual vocabulary.
When you have a visual vocabulary, you can count for every picture how often words from vocabulary appear in it and get a new image representation in terms of words frequencies.
Deep features
Convolutional neural networks have been very successful for object detection, classification and various image analysis tasks. Their description is much beyond this article’s scope and for anyone interested to learn more I highly recommend CS231 (probably the most popular course on this topic in the internet). Neural network is this Data Scientist’s black box that eats a huge amount of data and spits powerful predictions in return.
I used Keras library for Python, which is a great starting point for CNN, and (as a few thousand images is not enough data to train) based my model on three architectures pretrained on ImageNet dataset – VGG19, VGG16 and Resnet. I fed images to the network where the last layer has been removed and treated the (normalised) result as extracted features that could then be compared using Euclidean distance.
Below are the initial accuracy results (for top 6 six retrieved images) compared for different models. Room picture served as a query and the result was considered correct if it contained catalog photo of the exact object appearing in the room. It is clear from the table that deep features seem to be performing better than Bag of Visual Words (BOVW) for all object categories.
Object detection
Initial results were fine but not breathtaking yet. What gave them a huge boost in quality was application of initial object detection. Which makes sense – if you are looking for the chair similar to the one in the room photo why give algorithm the picture of the whole room (and extract features which do not belong to that chair) if you could be feeding it with the cropped image instead.
I used current state of the art object detection technique YOLO 9000 (You Only LOOK Once) which is based on Darknet 19 deep neural network architecture. It is able to predict in real time (on GPU) nine thousand distinct object classes, so for my task of recognising sofa or table in the picture it was good enough.
The final search algorithm first looked at the image, detected classes of interest (table, sofa, chair, potted plant, bed), extracted the bounding boxes for detected classes and then searched for match in the subset of given object class.
Results
Applying object detection to search algorithm resulted in accuracy boost as big as 200-300% for most object categories and architectures.
Apart from being able to find the exact same object from picture in the database, the other returned share some significant similarities to query object (for example, the similarly looking object in different color was returned).
Summary
What I learnt from this project is that you can build a pretty cool visual search engine that is able to retrieve the exact objects even with so little data. Moreover, it works on a variety of different object classes so it could be easily extendable to other topics in visual search, not just interior design. Next steps, assuming a much bigger database of pictures, would allow to train the CNN model from scratch and perhaps achieve even better results.
Another interesting idea of how visual search could be improved is not only to retrieve visually similar results but also to account for style similarities (e.g. I am looking for a coffee table that matches my sofa). But this is a topic for another article. Stay tuned.
Read also: Image detection using YOLO algorithm and poker cards