Why we decided to experiment with object detection?
As some of you may know, Tooploox has a Tooploox Labs initiative that allows people to take a short break from commercial projects, and gives them a chance to try something new, experimenting with unexplored technologies. I happened to get such an opportunity. We’ve brainstormed about possible ideas, and we settled for a card detection app that would use Android’s new ML kit.
The reasoning behind that was that Machine Learning and Artificial Intelligence are becoming more significant, and while we have a well established AI team, the Android team did not experiment with utilizing any ML capabilities in our apps.
Goal of our play cards recognition app
The aim of the project was to aid the user in a game of poker: being able to take a photo of the players hand, and the cards on the table and calculate the probabilities of getting specific poker hands. Due to time constraints I’ve only managed to wrap up the recognition part, but a working app can be found here. The full code of the app can be viewed on GitHub.
The article outlines creating a custom model using Tiny YOLO v2, converting it to tensorflow and then to a .tflite
file that can be easily used in an Android app.
Yolo (short for You Only Look Once) is an object detection algorithm, first presented by Joseph Redmon in a paper and then subsequently improved.
Generating the dataset for our model
To be able to recognize anything of interest we need to create a dataset on which to train our model. Unless you can find an existing dataset that happens to be annotated with Darknet specific annotation format this requires relatively high effort. For example when searching for playing cards dataset the most i could find was a set that contained only 4 images of each card, which is sadly not enough to train the model to reliably recognize the cards.
Improving Reliability With Cards Image
In my case about 15 photos per class proved to be enough to get decent reliability. As in case of all data sets, more diversity usually means better results in the end. Try to capture the object from a variety of angles, distances, on different backgrounds or in various lighting conditions. All those may affect the quality of the results the algorithm will output. To make your dataset more diverse you can use tools like Augumentator.
Annotate The Photos
So if you thought that taking the actual photos will be the most laborious part of the process you were wrong. There’s still the issue of annotating the photos to create bounding boxes (i.e. the parts of an image that contains the object of a given class) in the right places, so the model will properly tag the desired objects. It’s quite important that those boxes are as accurate as possible, since the overlap between ground truth and the predictions would be an important factor to consider for the algorithm to evaluate the training progress.
The next difficulty is that i couldn’t find an efficient tool to draw the bounding boxes. I found two workable solutions:
- BBox-Label-Tool that unfortunately outputs the pascal VOC format and is not a pinnacle of ergonomy to put it mildly. The upside is that since the formats are relatively close and there’s a script by Guanghan Ning that should convert between the two.
- there’s also a fork that specifically outputs YOLO format but has the disadvantage of having lots of hardcoded strings and weird choice of output directories so you’d probably need to tweak the code a bit to suit your needs.
The YOLO annotation format is relatively simple. Basically it’s a .txt
file, that has the same name as the image it annotates and has 5 parameters
- the class index
- X coordinate of the middle of the bounding box
- Y coordinate of the middle of the bounding box
- Width of the box
- Height of the box
All those dimensions are relative, that is an annotation of
0 0.5 0.5 0.2 0.2
would describe a box for an object that is first on the category list that is centered in the middle of the image and whose width and height is 1/5th of the image.
Prepare dataset metadata with Darknet
To train the model we’ll use Darknet, the official tool made by the creator of yolo, Joseph Redmon. For the Darknet to properly parse your files you need a file that specifies all the info it needs to use. Let’s call it object.data
. It’s contents are as follows:
Let’s break it down line by line:
classes
is pretty self explanatory: It specifies the number of classes you want to usetrain
is the path to a text file that has the paths to the images that the model will train on. Note: these paths should be relative to the Darknet executablevalid
is also a path to a file but this time specifies the paths to images that would be used to verify the model output. The images you so diligently prepared should be split in roughly 3:1 ratio of train:testnames
is a list of object categories that the model should recognize. The number of lines in this file should correspond to the number of classes specified in the first linebackup
is the path to the directory where the intermediate results are stored.
Training our object detection algorithm
Now that we’ve got our dataset ready to go let’s put it to some use. A word of warning though: while Darknet can run on the CPU it does take a lot of time. And I do mean a *lot* of time. To put in into perspective — initially as a proof of concept i tried to train a model on a MacBook’s Pro CPU. After 24 hours of training the output was complete gibberish. Fortunately at Tooploox we enjoy having our own data-cruncher, namely the fabulous Balrog. After using the same training parameters on a GeForce 1080 Ti after 30 minutes of training the output was somehow usable.
Another disclaimer is that since Darknet uses CUDA if you have an non Nvidia card you cant use the official repo. There are ports to OpenCL like https://github.com/ganyc717/Darknet-On-OpenCL but I cannot say if they work ok.
First off you need to install Darknet (https://pjreddie.com/darknet/install/). Like the page says if you want to train on GPU you need to modify the Makefile
accordingly. For me the default ARCH
setting didn’t work, so chances are you’d have to modify that too. Luckily there’s a handy list that has settings for some of the most common graphic cards.
Config File
The /cfg
directory of the Darknet repo contains config files for all types of yolo networks that were created over the past iterations of the algorithm. Ideally we’d want to use the Tiny YOLO v3 since it’s the latest of the ‘tiny’ algorithms that are optimized towards speed. Unfortunately to use the model on Android we’d need to convert it to a tensorflow graph, using darkflow, and as of the time of writing this post, yolo v3 is not yet supported.
We need to adjust the yolov2-tiny.cfg
to match the number of classes we have. Near the end of the file there’s a line classes=80
. Change that line to match the declared number of classes.
One more thing that need to be changed in the file is the number of filters in the last convolutional layer to fit the new number of classes. The original file has filters=425
but you need to adjust that following the formula (number_of_classes + 5) * 5
. So in my case of recognizing playing cards (52 classes) it was (52 + 5) * 5 = 285.
Use Precomputed Weights
This step is optional but highly recommended. You could just launch Darknet without initial weights, but in my testing it turned out, that starting with precomputed weights increased the output quality significantly.
To start you need to get the weights from the official yolo site and strip the last couple of layers using darknet partial
command:./darknet partial your_cfg_file yolov2-tiny.weights yolov2-tiny.conv.13 13
This leaves you with a partially trained weights file that you can use in the next step.
Launching Darknet
The command to start training is
./darknet detector train path_to_obj.data path_to_cfg_file path_to_weights_file
so it you followed the naming used before it would be:
./darknet detector train input/obj.data input/config.cfg yolov2-tiny.conv.13
As to when to stop training there is a lot of excellent advice on AlexeyAB’s github page.
Since the intermediate training output is perfectly usable, you can check out the /backup directory where certain training milestones are saved and verify if the weights are good enough for your needs.
The command to test the weights is
./darknet detector test obj.data yolo.cfg yolo.weights testimage.jpg
This will output the results to the console, and additionally create a file name predictions.png
in the Dar directory that will have a visual representation of the recognition result.
Converting the AI model
Using YOLO weights directly is possible but rather cumbersome on android. However tensorflow lite library does a lot of the work for us. The downside is that we cant use yolo weights directly, but we need to convert them to a .tflite
format. This is done in several steps.
Convert .Weights To A Frozen Graph
First we need to convert the .weights
file to a protobuf file that represents a frozen tensorflow graph. There’s a nice tool to achieve just that https://github.com/thtrieu/darkflow
Unfortunately there’s a long standing bug with tiny yolo v2. The culprit is this exact line:
self.offset = 16
So after cloning the repo, but before building the project you need to change the self.offset
from 16 to 20 then build darkflow (like described on the readme page) and you’re good to go.
The command to run the conversion itself is simple and runs pretty quickly by Machine Learning standards:
flow --model cfg/yolo.cfg --load bin/yolo.weights --savepb
This generates a /built_graph
directory that has the protobuf file we’re interested in. It also contains a .meta
file that has some additional information that might be useful if you’re debugging the model.
Convert Protobuf To Tflite
So there’s only one last step remaining before we’ll have a perfectly mobile-friendly yolo weights file. We need to convert the tensorflow model from a .pb
file to a .tflite
.
The tensorflow repo has a tool called toco, which can be used to do just that. First we need to download the tensorflow repository (https://github.com/tensorflow/tensorflow), install a build tool called bazel and use it to build toco.
First to build toco:
bazel build tensorflow/contrib/lite/toco:toco
Then check if it installed correctly:
bazel-bin/tensorflow/contrib/lite/toco/toco --help
And convert your model using
And we’re done (with the model at least).
Use The Model On Android
Now all that remains, is to plug the model into an android app. Initially we wanted to make our app use google’s ML kit but it turned out that the file yolo outputs is over the 40Mb size limit and we relied on tensorflow-lite library instead.
It’s possible to decrease the size of the mode using quantization but that would lower the accuracy of the model.
ML kit in its current iteration basically wraps around tensorflow-lite, and only provides a way to upload the models to firebase to conduct inference off the device. While this may be useful in some cases, processing the output and interpreting the results would need to be done in the same way for both cases.
Let’s look at how it’s done.
First we need to create a org.tensorflow.lite.Interpreter
and then do a tensorflow.run(image, output)
But to complicate stuff we also need to make sure that the input is a ByteBuffer
that is 416 * 416 * 3 bytes, since the input of the algorithm takes 416 x 416 image, and each pixel has its color represented by 3 bytes (one for Red, Green, and Blue). This is done by the convertBitmapToByteBuffer
method.
Our output should be a 4 dimensional array ([1][416][416][3]), to match the output of the last yolo layer. After we do the tensorflow.run()
the output will be filled by, well, the output from the model. But sadly, that needs some processing that is done by the processOutput
method, directly inspired by a sample from tensorflow repository. How exactly this is achieved is shown in this gist
After that, we have a list of detected object classes and the probabilities of each of them.
Final object recognition app on android
The working app can be downloaded here. The code can be viewed on GitHub.
The repository also includes the dataset that was used to train the model and all the intermediate steps between yolo weights and usable .tflite
files.
The caveat is that the dataset is based on only one set of cards, while it does recognize other decks decently, the probabilities of recognition are much lower that when using the exact deck that contains the dataset.
Lessons learned
Machine Learning is a vast field of knowledge and going head first into it can be confusing and hard at time. The tooling around it can be wonky, and often happens to break for no apparent reason but with lots of cryptic error messages.
But in the end after you make multiple silly mistakes, and hopefully fix most of them, it actually works pretty well all things considered. The data set I’ve used is not the most diverse to say the least (I used only one deck of cards, and unsurprisingly the results are best when using the exact deck) and most of the training images have a very plain background, so detection results of cards that are rotated are not too great.
Probably I would have had an easier time, just going after using one of the models suggested by ML kit (MobileNet or Inception), but after discussing with our ML team, we chose YOLO since it’s a potentially better network architecture for the purpose of this task. A good follow up to the experiment would be checking the other, recommended models for MLKit. They might not match the recognition precision of YOLO, but I would expect them to be significantly easier to integrate into an android app.
Acknowledgements
This whole Tooploox Labs experiment would have not been possible if not for the help of our AI Research team, whom I repeatedly badgered with stupid questions and who took their time to teach me the basic concepts and tweak the results. Thanks a lot guys!
Read also about Augmenting AI image recognition with partial evidence