It goes without saying that an autonomous vehicle (AV) must be able to track the movement of pedestrians, animals, bicycles accurately, and other vehicles around it to safely and effectively get from point A to B. The systems responsible for doing this depend on being fed data, among other things, from which it is “trained” and learns to spot and react to these obstacles and hazards. 

A technique developed by Carnegie Mellon University (CMU) researchers called “scene flow” may be able to deliver improved results by training systems on larger datasets. Generally speaking, the more data that is available for training tracking systems, the better the results will be. And, according to the CMU researchers, they have found a way to unlock a “mountain” of autonomous driving data for exactly that purpose. 

Scene Flow

Most AVs navigate based on sensor data from light detection and radar (LiDAR) systems that scan the environment to generate three-dimensional information of the world surrounding the vehicle. 

This information is not an image but rather a cloud of points that AVs make sense of using a technique known as “scene flow.” Scene flow involves calculating the speed and trajectory of each 3D point, with groups of points moving together interpreted by the AV as moving objects such as vehicles and pedestrians. 

Previously, state-of-the-art methods for training AV systems have required the use of large labeled sets of sensor data that have been annotated to track each 3D point over time. However, tagging each dataset is a time-consuming and expensive process so, perhaps, only a relatively small amount of this data exists. 

As a result, scene flow training is often carried out using simulated data. This is much less effective, even when fine-tuned with the available small amount of real-world data. 

LiDAR point cloud of street scene.

Scene flow can be used to predict the future position of a cyclist by comparing the current LiDAR point cloud of a street scene, in green, with the point cloud from the previous time step in the sequence, shown in red. Image credited to Carnegie Mellon University
 

Taking a Different Approach

Instead, the research team tried out a different approach. They used unlabeled data, which is relatively easy to generate in abundance, to carry out the scene flow training. They developed a way for their system to detect its scene flow errors. 

At each instance, the system attempts to predict where each 3D point will be and how fast it moves. The following instance measures the distance between the predicted and actual location of the 3D point nearest to the predicted location, which is the first type of error. 

The system then reverses the process. Starting with the predicted point location, it works backward to map back to where the point originated. At this point, it measures the distance between the predicted position and actual origination point, which is the second type of error.

The system then works to correct these errors. “It turns out that that to eliminate both of those errors, the system actually needs to learn to do the right thing, without ever being told what the right thing is,” said David Held, assistant professor in CMU’s Robotics Institute.

Using Data to Train AV Systems 

To prove their method, the researchers calculated the accuracy of scene flow using a training set of synthetic data—25%. When supplemented with real-world data, this only increased to 31%. However, efficiency jumped to 46% when they added a large amount of unlabelled data to train the system using their approach. 

“Our method is much more robust than previous methods because we can train on much larger datasets,” said Himangi Mittal, a research intern.

Source: All About Circuits

ADVERTISEMENT