With the prevalence of multimodal learning, camera-LiDAR fusion has gained
popularity in 3D object detection. Although multiple fusion approaches have
been proposed, they can be classified into either sparse-only or dense-only
fashion based on the feature representation in the fusion module. In this
paper, we analyze them in a common taxonomy and thereafter observe two
challenges: 1) sparse-only solutions preserve 3D geometric prior and yet lose
rich semantic information from the camera, and 2) dense-only alternatives
retain the semantic continuity but miss the accurate geometric information from
LiDAR. By analyzing these two formulations, we conclude that the information
loss is inevitable due to their design scheme. To compensate for the
information loss in either manner, we propose Sparse Dense Fusion (SDF), a
complementary framework that incorporates both sparse-fusion and dense-fusion
modules via the Transformer architecture. Such a simple yet effective
sparse-dense fusion structure enriches semantic texture and exploits spatial
structure information simultaneously. Through our SDF strategy, we assemble two
popular methods with moderate performance and outperform baseline by 4.3% in
mAP and 2.5% in NDS, ranking first on the nuScenes benchmark. Extensive
ablations demonstrate the effectiveness of our method and empirically align our
analysis