Object tracking with ML - Tooploox at Data Science Summit 2024

Object tracking is easy. Doing so on low-end hardware, without limiting the model to narrow cases, is a whole different story.

Object tracking finds multiple use cases and applications among a plethora of industries. CCTV cameras may ensure the security of industrial installations, while comparable cameras may support quality assurance while overseeing the packaging process. Security cameras may search for knives or firearms in public places and, in the sports realm, trackers can follow particular players or the ball itself.

Trackers are an established and popular technology. So do we need another one? The answer is – maybe we don’t need it, but we want one anyway.

Advantages of traditional trackers

Currently, the dominant traditional computer vision-based trackers come with multiple advantages. These include:

Price – traditional computer vision needs fewer resources and computing power than the ML-based approach. Also, with years of experience and development, there are multiple existing frameworks and solutions to choose from.
Performance – traditional computer vision is less resource-consuming, not only during the development, but also when working. This makes these systems more suitable for embedded systems.
Reliability – last but not least, traditional computer vision is constructed of pure programming, with no ML-components, giving the system no unpredictability at all. By that, the system can be used to perform more critical tasks.

Challenges for optical trackers

Despite the advantages of the traditional approach and multiple tools that are ready to be used, there are some limitations users and companies need to remember.

Occlusions – an occlusion happens when an object is either hidden by an object of the same class (a car hidden behind another car), or is hidden behind an object of another class (car hidden behind a tree).
Light changes – an object’s appearance may change not only due to its movements, but also when the lighting levels change. With fading or brightening color associated with light levels, the machine may classify the same object as multiple different objects.
Big and fast movements – a followed object may change its position rapidly, or may be hidden by other fast moving objects. For example, a car may be hidden behind a speeding truck.
Regions with plain textures – computer vision tools may get confused with large areas with little to no diversity in texture.
No context information, just tracking pixels – last but not least, the computer vision system is unable to get a glimpse of the nature of the object it is tracking. Contrary to an AI-based solution, it is about pixel change rate, not a real “object” found among other objects in a video.

Our approach – combining classic computer vision with AI

Overcoming these limitations is one of the key applications of ML-powered computer vision solutions. Yet making a fully AI computer vision system leads to the solution losing the benefits of classical computer vision.

To overcome that, the Tooploox team has constructed a tracker inspired by the TLD (Tracking Learning Detection) tracker from 2010, but boosted it with modern technology.

Its architecture consisted of an optical tracker, a positive-negative scheme for learning to train a classifier for the new appearance of an object every time it changed, and a classifier trained on a mix of hand-crafted features used for detection.

We used the same design but with a newer approach for each component. For tracking, we used our own implementation of median tracker, which is two times faster than the original. To learn all the different appearances of the same objects, we opted for a few-shot learning approach utilizing a siamese network supported by our own classifier for selecting new samples to be used by the network as templates of the object being tracked. The exact architecture we implemented as the learning module in this case was DaSiamRPN, which also performs detection tasks.

But this approach has challenges of its own.

Challenge – the object may change

The ML recognizes an object, but it may change. When it comes to more complicated objects, situations may be extremely unstable. A dog may turn or jump. A kid may fall or raise its hands while running. What then? For ML classifiers, these are all different objects.

Solution

To overcome this challenge, the team decided to add samples of objects on-the-go. Fresh data regarding the object’s current position and its appearance keeps the model updated regarding the object’s changing appearance. This enables the network to find an object regardless of how it changes.

Challenge – object moves so fast it confuses the classifier

For some super-fast objects, like sports cars, frightened cats, or objects falling from a shelf, sudden movement can just be too challenging to process. From the point of view of a computer vision algorithm based on changes in brightness, it may have just disappeared all together.

Solution

The team decided to use a DaSiamRPN network consisting of two parts. One was the Siamese network for estimating the similarity between two images. The assumption is that we train such a network on a large, general training set and expect it to work similarly on images of objects it has potentially never seen before.

Effectively comparing two images is one thing, but we still need the solution to find the specific location where this similarity is the greatest. The simplest method in such a situation is an approach where the system sequentially checks each part of the frame of a given size and sees how similar it is to the template.

However, this is time-consuming, especially since the relative size of the tracked object might have already changed, so we would have to perform such checks at multiple scales. The solution here is a Region Proposal Network, which, based on embeddings obtained from our Siamese network, generates candidates for comparison and then estimates the probability that each candidate is the object we are looking for along with its location.

The effect – the solution itself

Our solution was compared with other trackers available on OpenCV. It was tested on 10-15 selected datasets with isolated types of changes (translation, affine transformations, rotation, transformation, occlusion).

A problem arose when we observed that tuning for better results on datasets made live tracking with the camera worse. Ultimately, the team decided that empirical cases with the camera were the main application of this tracker. With that, the results might appear worse than the state-of-the-art in some cases, but the overall live-tracking performance is better.

The issue is that when evaluating the solution on a live camera stream, it’s difficult to collect quantitative results. So we rely more on a subjective impression that it works well and often positively surprises the user. It might get lost for a few seconds, but thanks to the power of ML, it reinitializes and continues working.

On a MacBook Pro with an M1 processor, the optical tracker is only limited by the camera’s fps (30/60 fps, we even reached 200 fps on recorded sequences in low resolution). The ML-based tracker has a performance of about 2-10 frames per second (with varying processing times). Generally, when we ran the ML tracker on demand, we sometimes had to catch up to 20 frames.

Summary

The Tooploox tracker was designed as a showcase for the Data Science Summit conference in Warsaw. Tooploox engineers, Adam Słucki and Łukasz Ziobroń, presented it as a way to show object recognition challenges and solutions in an easy to digest and understand form.