Existing top-performance 3D object detectors typically rely on the
multi-modal fusion strategy. This design is however fundamentally restricted
due to overlooking the modality-specific useful information and finally
hampering the model performance. To address this limitation, in this work we
introduce a novel modality interaction strategy where individual per-modality
representations are learned and maintained throughout for enabling their unique
characteristics to be exploited during object detection. To realize this
proposed strategy, we design a DeepInteraction architecture characterized by a
multi-modal representational interaction encoder and a multi-modal predictive
interaction decoder. Experiments on the large-scale nuScenes dataset show that
our proposed method surpasses all prior arts often by a large margin.
Crucially, our method is ranked at the first position at the highly competitive
nuScenes object detection leaderboard.Comment: To appear at NeurIPS 2022. 16 pages, 7 figur