Visual & LiDAR-based Tracking of Traffic Cones

TechLabs Aachen
11 min readMay 30, 2021

--

This project was carried out as part of the TechLabs “Digital Shaper Program” in Aachen (Winter Term 2020)

Introduction

The aim is to create a tracker that can track the location of detected static objects over time. The information can be provided both via cameras and LiDAR. Since the objects are static, the computed temporal translation corresponds to the movement of the ego vehicle, providing an additional source of odometric data. The project is split into two parts. Part 1 corresponds to cone recognition via camera, while Part 2 combines data from the cameras and LiDAR to provide a more complete picture of the environment.

1. Camera-based Localisation of Traffic Cones

Method

  • Cone Recognition
    There are different algorithms in the domain of object recognition. The “You only look once”, i.e. YOLO, firstly introduced by Joseph Redmon, is a state of the art object recognition algorithm in academia. Because of its high real-time efficiency, this advanced CNN approach is used in our project. Our test set is also the application scenario of our project, coming from the Formula Student Racing Team of RWTH Aachen. The training dataset owned by MIT and Shanghai Jiaotong University research groups is available online, and the available image dataset has already annotated by the owners. The image below shows the composition of the training set, which constitutes isolated cones and samples from real driving situations.

Project Result

  • Cone Recognition
    Based on the available training sets, the training results are shown in the figure below. It is found that the training set with real driving scenarios delivers the best training result, which is the reason that it is implemented in the following work. The good performance regarding all the four metrics implies that the trained model can be used for cone detection in the test set.

Tip: Recall and Precision are defined as follows,

where TP denotes true positives, FP means false positives and FN represents false negatives.

The Gif shows the performance of the model on the test set. It is quite clear that almost all the cones in the field of view are identified correctly by this model with high confidence and few false positives, which implies both high recall and high precision.

Conclusion

For the cone recognition, the trained model based on YOLO-v5 shows good performance on the test set, where almost all cones on the track can be detected fast and precisely.

2. LiDAR-based Tracking of Traffic Cones

LiDAR uses infrared laser beam to detect the distance between the sensor and obstacles.

  • Purpose: The objective is to create a perception pipeline in Robot Operating System (ROS) to detect and localize the obstacle in the 3D real-world within the ROI (region of interest).
  • Working Principle: LiDAR sends out pulsed laser beams at different angles with few nanoseconds and measures how long it takes for each ray of light to travel back, and by scanning these beams across the field of view, it determines the actual distance to an obstacle such as cars, pedestrians, and any reflecting surface.
  • Applications: LiDAR provides higher resolution data and accurate coordinate measurement data for the purposes of autonomous driving, road infrastructure mapping and surveying.
  • Pipeline Steps: Filtering, Segmentation, Clustering
  • Programming Language and Libraries: C++, Python, PCL, ROS Melodic

Advantages and Limitations of Using Various Sensors

[Source: Udacity]

Individual sensor limitations can be resolved by combining several sensors, which dramatically enhances sensor capabilities in various environmental conditions and makes them more versatile for applications.

LiDAR Schematic: HDL-64E

[Source: https://velodyneLiDAR.com/products/hdl-64e/]

Advantages: Higher special resolution, long-range, depth
Disadvantages: Size, weather-sensitive, noise

Method of Approach: Late Fusion

[Late fusion schematic ]

In contrast to early fusion, late fusion combines the results of multiple sensors independently. For example, the 2D bounding boxes resulted from the camera are projected into 3D bounding boxes, and these bounding boxes are fused with the obtained bounding boxes obtained from the LiDAR. The goal of sensor fusion is to leverage the complementary strengths of the sensing modalities while canceling out their weaknesses.

LiDAR Pipeline in ROS

The LiDAR pipeline is designed to process raw point clouds to perform 3D object localisation.

Steps:

  1. Data Preparation
  2. Filtering
  3. Segmentation
  4. Clustering

Data Preparation

  • Datasets: rosbag
[rostopic list]

The sensor data available in our rosbag is depicted in the image above. The point clouds dataset from LiDAR corresponds to the topic pandar40p/points_raw having a frequency 10Hz and the image datasets from three cameras are represented by the topics pylon_camera_16_driver/image_rect_color, pylon_camera_17_driver/image_rect_color and pylon_camera_18_driver/image_rect_color with 10Hz.

  • Setting up of sensors coordinate frames in ROS
[ROS tf tree]

ROS tf package helps to keep track of multiple coordinate frames of sensors and links over time. It maintains the relationship between coordinate frames in a tree structure buffered in time and provides coordinates transformations between any two coordinate frames at any desired point in time.

[Sensor coordinate frames]

Filtering

Filtering helps to remove excessive data that are either inside or outside an interested range.

Methods:

  1. VoxelGrid Downsampling Filter
  2. Cropbox Filter
  3. ExtractIndices Filter

The original unorganized and high-resolution point clouds are filtered by employing different filtering methods. Like 2D pixels, the VoxelGrid class downsample the input point clouds into a 3D voxel grid by resulting single point per voxel, which is similar to a tiny cube in space. Cropbox class helps to crop the voxelized point clouds data to a region of interest, i.e. all of the points within the box, and each voxel indices can be extracted by ExtractIndices class.

[Extracting region of interest by performing various filtering techniques]

In the above image, the point clouds data from a 1/10th frame, it is very evident that the original point clouds are higher in resolution having immense space Information with 51412 data points in space. After the filtering method, the initial point clouds are downsampled significantly to just 3740 data points, which enclose the region of interest.

Filtering function declaration

#include <ros/ros.h>
#include "pcl_ros/filters/filter.h"
#include <pcl/filters/voxel_grid.h>
#include <pcl/filters/crop_box.h>
#include <pcl/filters/extract_indices.h>
pcl::PCLPointCloud2::Ptr Voxelization(const sensor_msgs::PointCloud2ConstPtr& input, float voxel_size, Eigen::Vector4f min_dim, Eigen::Vector4f max_dim);

The Voxelization function takes three arguments in addition to raw point clouds: voxel size, minimum, and maximum vector of the voxel grid. Voxel size takes a float value, in which each voxel incorporates the point clouds that lie within its region. Larger the voxel size, the lower the resolution of the point cloud. The minimum and maximum vector argument take the vectors in homogenous coordinate system(i.e.,[x,y,z,1]) and crop the downsampled voxel data to a specified region. The obtained filtered clouds are much lower resolution than the original clouds. Choosing the region of interest is very important that helps to detect the proximity of the obstacle to our Vehicle. Finally, filtering steps improves the computation effort drastically to process the point clouds.

Segmentation

Point cloud segmentation is the process of classifying point clouds into multiple homogeneous regions, the points in the same region will have the same properties.

Method:

  • RANSAC (Random Sample Consensus):
    RANSAC is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. Therefore, it is also known as an outlier detection method.”

In the below image, each point that fits a straight line within the specified threshold represent as inliers, and the rest of all the points are considered outliers.

[Source: Wikipedia]
[Plane fitting : Inliers(ground truth) v/s Outliers(traffic cones)]

The image above shows the segmentation on filtered point clouds using plane fitting into ground truth as road and obstacles as traffic cones, with points only in the filtered region of interest.

Segmentation function declaration

std::pair< pcl::PointCloud<PointT>::Ptr,  pcl::PointCloud<PointT>::Ptr> RANSAC_SegmentPlane(const pcl::PCLPointCloud2ConstPtr& input_cloud, int maxIterations, float distanceThreshold);

The segmentation function takes three arguments: point cloud data, distance threshold and maximum iteration. The filtered point clouds obtained after the filtering step is passed as input point cloud data. The distance threshold is a float value that defines how near a point must be to the model to be classified as an inlier. The robust segmentation method RANSAC is used to segment the input point data into inliers and outliers. It takes maximum iteration as the third argument, which specifies the maximum number of attempts to find the best plane fitting.

Clustering

Clustering identifies similarities between objects, which it groups according to those characteristics in common and differentiate them from other groups of objects.

Classification vs Clustering

[Source: https://blog.bismart.com/en/classification-vs.-clustering-a-practical-explanation]

Machine learning employs two types of pattern recognition: classification and clustering. The main difference between these methods is that the classification is a supervised machine learning method with has both label and input data. However, in contrast to classification, clustering is an unsupervised machine learning technique, which does not have label data.

Method:

  • Euclidean Clustering with KD-Tree:
    Clustering method needs to divide unorganized point clouds into smaller parts so that the overall processing time for point clouds is significantly reduced.
    Euclidean Clustering essentially groups the points that are close together. By choosing a threshold value, we can limit the closeness. Such that the points within this threshold considered as a unique cluster. In a more general case, we can make use of nearest neighbours (KD-Tree) and implement a clustering technique.
  • KD-Tree:
    k-d tree is a method of organizing some number of points in a space with k dimensions. It is a binary search tree with other constraints imposed on it. K-d trees are very useful for range and nearest neighbor searches.
[Source: Wikipedia]
[Clustering in action]

Clustering function declaration

#include <pcl/kdtree/kdtree.h>
#include <pcl/segmentation/extract_clusters.h>
void EuclideanCluster(pcl::PointCloud<PointT>::Ptr cloud, float clusterTolerance, int minSize, int maxSize);

The clustering function takes three arguments: segmented cloud data as input point clouds (outliers), cluster tolerance and minimum and maximum cluster size. Choosing a cluster tolerance is very important since setting the wrong values may result in multiple clusters of the same obstacle or combines the various obstacles as a single cluster. Hence selection of proper clustering tolerance is crucial. The minimum and maximum cluster size argument indicate that the range of cloud points must be present in each cluster.

Project Result

Callback function example:
Pipeline overview for processing input point clouds data for object localisation.

void callback(const sensor_msgs::PointCloud2ConstPtr& cloud)
{
typedef pcl::PointXYZ PointT;
// filtering
pcl::PCLPointCloud2::Ptr filterCloud = Voxelization(cloud, 0.1f , Eigen::Vector4f (-7.0f, -15.0f, -1.0f, 1.0f), Eigen::Vector4f (7.0f, -1.0f, -0.4f, 1.0f));
// segmentation
std::pair< pcl::PointCloud<PointT>::Ptr, pcl::PointCloud<PointT>::Ptr> segmentCloud = RANSAC_SegmentPlane(filterCloud, 15, 0.03);
// clustering
EuclideanCluster(segmentCloud.first, 0.50 , 3, 15);
// Publishing the data
pub_outliers.publish (segmentCloud.first);
pub_inliers.publish (segmentCloud.second);
}

Projection of clustered point clouds from 3D space onto the image plane:

The projection of 3D point clouds onto images requires intrinsic and extrinsic parameters. Intrinsic parameters refer to the camera matrices, whereas the extrinsic parameters indicate the position and orientation of both cameras and LiDAR with respect to a common reference coordinate system. Each 3D point must be translated and rotated in order to project clustered point clouds onto the image plane, and perspective projection could be represented as matrices by which the vector is multiplied. In order to avoid non-linearity, both the image and space coordinates can be transformed from the n-dimensional euclidian coordinate system into (n+1) dimensional homogeneous coordinates. It is simply achieved by adding the number 1 as an additional component to the original vector of the euclidian coordinate system.

pylon_camera_16_driver/camera_info topic from our rosbag gives the camera intrinsic parameters of the middle camera mounted on our vehicle. Since the available image datasets in rosbag are already distorted free, i.e., rectified images, thus rectifying rotation matrix R_rect is an identity matrix and P is an intrinsic calibration matrix. R and T are the rotation and translation matrices of LiDAR concerning the middle camera in our vehicle, which provides the extrinsic parameters of the sensor setup. The rotation and translation matrices are transformed into the homogeneous coordinate system representing RT.

[Intrinsic Camera Parameters]

Finally, the following equation shows the projection of a 3D Lidar point X in space to a 2D image point Y on the image plane of the middle camera using homogeneous coordinates:

[LiDAR Based Tracking of Traffic Cones in real time]

Conclusion

The designed C++ pipeline for processing point clouds successfully localises the traffic cones on the input point clouds datasets in real-time, demonstrating high performance. Compared to object detection on image datasets using CNN, projecting point clouds onto the images provides depth information. Using this information, we can calculate the distance between obstacles and the vehicle. Finally, the fusion of camera and LiDAR sensor data yields accurate localisation of the target object, which helps to navigate the vehicle autonomously.

Outlook

  • Fusion of 3D bounding boxes obtained by projecting the camera’s 2D bounding boxes onto those obtained from the LiDAR dataset processing.
  • Kalman Filtering
  • Odometry Calculation

References

Team Members:

  • Prajwal Chatralinganadoddi Ramesh, Master’s student at RWTH Aachen University LinkedIn
  • Yongpeng Zhao, Master’s student at RWTH Aachen University LinkedIn
  • Till Beemelmanns, Mentor, Research Scientist, Institute for Autmotive Engineering RWTH Aachen LinkedIn

TechLabs Aachen e.V. reserves the right not to be responsible for the topicality, correctness, completeness or quality of the information provided. All references are made to the best of the authors’ knowledge and belief. If, contrary to expectation, a violation of copyright law should occur, please contact journey.ac@techlabs.org so that the corresponding item can be removed.

--

--

TechLabs Aachen
TechLabs Aachen

Written by TechLabs Aachen

Learn Data Science, AI, Web Development by means of our pioneering Digital Shaper program that combines online learning, projects and community — Free for You!!

No responses yet