A Simple and Efficient Multi-task Network for 3D Object Detection and Road Understanding

Published 6 Mar 2021 in cs.CV and cs.RO | (2103.04056v1)

Abstract: Detecting dynamic objects and predicting static road information such as drivable areas and ground heights are crucial for safe autonomous driving. Previous works studied each perception task separately, and lacked a collective quantitative analysis. In this work, we show that it is possible to perform all perception tasks via a simple and efficient multi-task network. Our proposed network, LidarMTL, takes raw LiDAR point cloud as inputs, and predicts six perception outputs for 3D object detection and road understanding. The network is based on an encoder-decoder architecture with 3D sparse convolution and deconvolution operations. Extensive experiments verify the proposed method with competitive accuracies compared to state-of-the-art object detectors and other task-specific networks. LidarMTL is also leveraged for online localization. Code and pre-trained model have been made available at https://github.com/frankfengdi/LidarMTL.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (28)

View on Semantic Scholar

Summary

The paper introduces LidarMTL, a simple and efficient multi-task network that uses sparse convolutions on raw LiDAR data to jointly perform six distinct perception tasks, including 3D object detection and road understanding.
Evaluated on the Argoverse Dataset, LidarMTL achieved competitive accuracy in 3D object detection and robust results in point-wise classification tasks like road understanding, often matching or exceeding single-task state-of-the-art models.
LidarMTL demonstrates significant efficiency improvements with reduced memory consumption and enhanced inference speed compared to chains of single-task models, making it highly suitable for real-time autonomous vehicle systems.

Multi-task Network for 3D Object Detection and Road Understanding

In the field of autonomous driving, the ability to accurately detect dynamic objects and comprehend static road scenes is paramount. Current approaches often oversee such perception tasks individually; however, the paper titled "A Simple and Efficient Multi-task Network for 3D Object Detection and Road Understanding" presents a unified framework, termed LidarMTL, that integrates these tasks, significantly improving operational efficiency while maintaining competitive accuracy.

Core Contributions

The researchers have introduced a multi-task learning (MTL) architecture tailored for joint 3D object detection and road understanding. The proposed LidarMTL leverages raw LiDAR point cloud data to perform six distinct perception tasks within a single forward pass. This involves 3D object detection, foreground classification, intra-object part location regression, drivable and ground area classification, and ground height estimation.

The network architecture is primarily based upon a 3D UNet like encoder-decoder configuration utilizing sparse convolutions. The encoder extracts high-level features, while the decoder processes these features back into a comprehensive spatial representation. Task-specific heads are integrated into this architecture, allowing simultaneous task completion without task-specific network redundancies.

Evaluation and Results

Through extensive experiments on the Argoverse Dataset, the LidarMTL demonstrated the ability to achieve or exceed the accuracy of conventional state-of-the-art single-task networks, while benefiting from reduced memory consumption and enhanced inference speed. For object detection, the network achieved a mean average precision (mAP) that competes effectively against leading single-task models like PV-RCNN and PointPillars.

The multi-task network showed its prowess not only by attaining high accuracy in object detection but also by delivering robust results in point-wise classification tasks, such as foreground detection and road understanding. The accuracy in ground classification and estimation tasks also emphasized the model's competency, where experimental results yielding minimal deviations signify consistent performance.

Efficiency and Practical Implications

Efficiency is a distinguishing feature of this research, evidenced by LidarMTL's superiority in memory footprint reduction and inference speed compared to chains of single-task-specific models. The model size and computation efficiency ensure that it is well-suited for real-time applications in autonomous vehicle systems, a domain where both parameters are crucial given limited hardware capabilities.

In terms of practical applicability, the network's robustness to Lidar point cloud sparsity further enhances its utility across varying sensor setups and conditions. Moreover, the application of LidarMTL to online localization showcases its capacity to integrate semantic perception into real-world systems, significantly improving localization accuracy by filtering out dynamic objects and road-specific details.

Future Directions

This work opens avenues for integrating multi-frame Lidar data for enhanced motion prediction and object tracking, potentially refining real-time autonomous vehicle systems further. The benefits observed in perception accuracy and computational efficiency affirm the potential of multi-task learning architectures in LiDAR-based perception systems.

In sum, the paper underlines a step towards holistic perception systems in autonomous driving by addressing multiple key tasks under a unified network without compromising on the performance or scalability required for real-world deployment. These insights suggest that future research could focus on expanding the scope of tasks to encompass other relevant aspects of roadside scenarios, such as traffic sign detection and intent prediction, thereby broadening the operational safety and reliability of autonomous vehicles.

Markdown Report Issue