In the past few years, we have witnessed rapid development of autonomous
driving. However, achieving full autonomy remains a daunting task due to the
complex and dynamic driving environment. As a result, self-driving cars are
equipped with a suite of sensors to conduct robust and accurate environment
perception. As the number and type of sensors keep increasing, combining them
for better perception is becoming a natural trend. So far, there has been no
indepth review that focuses on multi-sensor fusion based perception. To bridge
this gap and motivate future research, this survey devotes to review recent
fusion-based 3D detection deep learning models that leverage multiple sensor
data sources, especially cameras and LiDARs. In this survey, we first introduce
the background of popular sensors for autonomous cars, including their common
data representations as well as object detection networks developed for each
type of sensor data. Next, we discuss some popular datasets for multi-modal 3D
object detection, with a special focus on the sensor data included in each
dataset. Then we present in-depth reviews of recent multi-modal 3D detection
networks by considering the following three aspects of the fusion: fusion
location, fusion data representation, and fusion granularity. After a detailed
review, we discuss open challenges and point out possible solutions. We hope
that our detailed review can help researchers to embark investigations in the
area of multi-modal 3D object detection