For labeling the data there is no simple answer with the Captcha mechanism being one of the most successful. Within the slightly annoying mechanism of proving one is not a robot, a user is provided with two words or an image and needs to spot the correct elements. When it comes to words, one is verified to be correct, while the other is labeled by the user.
The same goes for the image. With thousands of Internet users tirelessly labeling images in search of traffic lights, road signs, or pedestrians, it has become possible to gather a decent number of labeled images that are further used in training image recognition-based solutions.
On the other hand, though, a company needs to be extremely careful when it comes to gathering training data, so sometimes we must evaluate the labels and ensure 100% accuracy of the dataset. When this consists of thousands of images, it is a challenge in itself.
Out Of Distribution Problem
There is also a significant challenge in delivering cars for different regions, be they North America, Europe, or Africa. While for a human driver it is not a challenge to generalize a Pine and a Baobab into a simple “tree” category and move on without a second glance, the autonomous car system can struggle with this problem and become utterly confused in an unfamiliar environment.
Considering the general problem of generalization witnessed in neural networks, a car trained on data gathered in the US can be significantly less capable of driving autonomously in Europe. The narrow streets of Edinburgh’s old town are incomparable to the largely well-planned and wide streets of US towns and cities – and this is a factor that is hard to cope with for a neural network.
Thus, when building a dataset aimed to train a worldwide-capable car, it is crucial to build datasets consisting of images from all parts of the world. And that makes the datasets even bigger – like the whole of Google Street View, but fully labeled.
The prime and the most basic task of computer vision algorithms is to recognize an object in a picture. While computers outperform humans in multiple image recognition tasks, there are several which are particularly interesting in the context of autonomous vehicles.
- Object recognition needs to be done in a real-time environment. When it comes to input from a camera, it is sometimes based on a set of lines that are constantly flowing from the sensor and are used to refresh an ever changing image on a screen, rather than on a series of complete and whole images. Thus, there is a need to recognize objects without actually seeing them.
- There are multiple elements in an environment that can be confusing for an autonomous system – a truck trailer in front of a car can be a good example.
This first challenge can be solved by training a model on the data delivered by the sensor as an output, practically switching the model toward signal analysis rather than image recognition.
The second challenge is an example of a typical problem of AI being unable to generalize and having no prior knowledge on a subject. An interesting solution comes from enriching image recognition with partial evidence – a technique that enables the neural network to use a piece of additional information (for example context) to exclude the least probable outcome.
So if there is a car hovering higher than five meters off the ground, it is probably a billboard and there is no need to decelerate.
Traffic sign recognition is an iconic task for an autonomous vehicle’s neural network. The challenge in traffic sign recognition is in doing it quickly and in a highly volatile environment. A sign can be dirty, covered by leaves, twisted into an odd angle, or modified in this or another way.
Also, it is common to rearrange signs on the road or put some up temporarily, for example informing about a detour or road construction. So the net needs to be swift in processing and vigilant in spotting signs.
An even more significant challenge comes with pedestrians. The machine needs not only to recognize one without a moment’s doubt but also needs to be able to estimate one’s pose. If the pedestrian’s motion indicates that he or she is going to cross the road, the vehicle needs to spot that and react quickly.
Semantic Segmentation And Semantic Instance Segmentation
Semantic segmentation and semantic instance segmentation are intuitively similar problems with different sets of challenges.
Semantic segmentation is about detecting multiple entities in one image and providing each one with a separate label. Obviously, there can be a car, a road sign, a biker, and a truck on the road at the same time – and that’s what semantic segmentation is all about.
Semantic instance segmentation is about spotting the difference between each object in the scene. For a car’s system it is not enough that there are simply three cars on the road – it needs to be able to differ between them easily in order to track their individual behavior. While semantic segmentation frames each car, each tree and each pedestrian, semantic instance segmentation labels each as car1, car2, tree1, tree2 etc.
This challenge is to deliver a bit of additional information about the number of objects on the road and their positions regarding each other without naming them.
- Performance – as mentioned above, it is challenging to deliver truly real-time object recognition due to the limitations of the sensor itself. The same feature is limiting both for semantic segmentation and semantic instance segmentation in the same way it is for object recognition.
- Confusion – while machines are increasingly efficient in their tasks, one needs to remember that there is always a present factor of unpredictability in the artificial neural network. Thus the network can be confused by several factors like unusual lighting conditions or weather.
Tackling these problems is tied to acquiring and leveraging bigger datasets that provide the neural network with more examples to generalize on. Providing the network with artificial data either generated manually or via Generative Adversarial Networks is one of the simplest ways to tackle this challenge.
Stereovision And Multi-Camera Vision
Depth estimation is one of the key factors in ensuring the safety of a vehicle and its passengers. While there are multiple tools available, including camera radar and lidar, it is common to support them with multiple camera vision.
Knowing the distance between camera lenses and the exact location of an object on images taken by them is the first step toward building a stereo vision system. In theory, the rest is simple – the system uses triangulation to make depth estimations.
- Camera configuration – the distance between lenses and the sensitivity of the sensor can differ, delivering additional challenges for a depth estimation system.
- Non-parallel representation – the cameras used in autonomous cars can deliver slightly different images, without pixel-to-pixel world representation. Thus, if there is a hardware shift in a pixel representation of an image, the system can find it more challenging to calculate the distance.
- Perspective distortion – the bigger the distance between the camera lenses, the better the depth estimation. Yet this comes with another challenge – perspective distortion, which needs to be accounted for in the depth estimation.
The Tooploox engineers have overcome this challenge by feature engineering and combining the stereoscopic data with information from lidar and radar devices. Also, it is a matter of feature engineering – with enough artificial data generated to tweak the algorithm and real data to validate the effects, the team was able to deliver top-class results.
A good example of multi-objective camera usage in the automotive industry is Light, one of Tooploox’s clients.
Object tracking aims to provide the autonomous vehicle control system with information about the current motion and motion prediction of the object. While object recognition informs the system that a particular object is a car, a truck, or a tram, this feature delivers information on whether or not that object is accelerating, decelerating, or maneuvering.
- Risk estimation – the network needs to not only predict movement but also to anticipate behavior. In the same way, a human driver is often more careful when driving near cyclists than when driving along with other cars. Any incident would be more dangerous for a person on a bicycle.
- Volatile background – when tracking an object, the network needs to deal with changes in the background. Other vehicles are approaching, the road changes color or there are trees instead of fields behind it. This is not a problem for a human driver, but it can be utterly confusing for a neural network.
- Confusing objects – objects on a road are fairly repetitive, with dozens of red cars passing by each day. The tracking software can possibly mismatch one red car with another and thus provide the controller network with inaccurate information.
Providing the controller neural network with multimodal data gathered by Lidar and Radar is a good answer for this challenge. While Lidar can struggle with the identification of the type of a particular object, it delivers the exact position of it with pinpoint-accuracy. Radar provides less accurate data, yet independent of the scene and rarely affected by other factors like weather conditions.
3D Scene Analysis
Combining all information gathered from the techniques described above, the controlling system should construct a 3D representation of the surrounding world. This can be compared to one’s imagination, where the computations are done, the effects estimated and the outcome produced.
- Accuracy – in 3D scene analysis even the slightest inaccuracies tend to stack into a larger mistake and result in a drift. What appears to be harmless at low speeds can become visible when velocity rises.
- Multiple unpredictable objects – what is fairly straightforward on a highway gets complicated in urban traffic, where the street network grows more complex as well as the intentions of the other actors on the road.
Without the supporting role of Lidar and Radar, an effective and fast 3D scene analysis would be extremely challenging. Tooploox engineers and researchers have approached this challenge by working with point cloud data on object identification.
By this method, the system controlling the autonomous vehicle receives accurate information about objects in the scene from two independent sources – a camera with image recognition capabilities and a lidar system with point cloud data analysis and identification.
Read more our automotive-related content:
Image recognition capabilities are at the foundation of autonomous vehicle control systems. The good news is that the systems can be enriched with multiple sensors as well as provided with various other forms of data. Modern solutions leverage the capabilities delivered by HD maps or GPS systems to get better information to work with.
If you wish to get more information regarding autonomous vehicles or wish to discuss this matter, don’t hesitate to contact us now!