Masked Autoencoders (MAE) play a pivotal role in learning potent
representations, delivering outstanding results across various 3D perception
tasks essential for autonomous driving. In real-world driving scenarios, it's
commonplace to deploy multiple sensors for comprehensive environment
perception. While integrating multi-modal features from these sensors can
produce rich and powerful features, there is a noticeable gap in MAE methods
addressing this integration. This research delves into multi-modal Masked
Autoencoders tailored for a unified representation space in autonomous driving,
aiming to pioneer a more efficient fusion of two distinct modalities. To
intricately marry the semantics inherent in images with the geometric
intricacies of LiDAR point clouds, the UniM2AE is proposed. This model
stands as a potent yet straightforward, multi-modal self-supervised
pre-training framework, mainly consisting of two designs. First, it projects
the features from both modalities into a cohesive 3D volume space, ingeniously
expanded from the bird's eye view (BEV) to include the height dimension. The
extension makes it possible to back-project the informative features, obtained
by fusing features from both modalities, into their native modalities to
reconstruct the multiple masked inputs. Second, the Multi-modal 3D Interactive
Module (MMIM) is invoked to facilitate the efficient inter-modal interaction
during the interaction process. Extensive experiments conducted on the nuScenes
Dataset attest to the efficacy of UniM2AE, indicating enhancements in 3D
object detection and BEV map segmentation by 1.2\%(NDS) and 6.5\% (mIoU),
respectively. Code is available at https://github.com/hollow-503/UniM2AE.Comment: Code available at https://github.com/hollow-503/UniM2A