Recently, the pure camera-based Bird's-Eye-View (BEV) perception provides a
feasible solution for economical autonomous driving. However, the existing
BEV-based multi-view 3D detectors generally transform all image features into
BEV features, without considering the problem that the large proportion of
background information may submerge the object information. In this paper, we
propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out
background information according to the semantic segmentation of image features
and transform image features into semantic-aware BEV features. Accordingly, we
propose BEV-Paste, an effective data augmentation strategy that closely matches
with semantic-aware BEV feature. In addition, we design a Multi-Scale
Cross-Task (MSCT) head, which combines task-specific and cross-task information
to predict depth distribution and semantic segmentation more accurately,
further improving the quality of semantic-aware BEV feature. Finally, we
integrate the above modules into a novel multi-view 3D object detection
framework, namely SA-BEV. Experiments on nuScenes show that SA-BEV achieves
state-of-the-art performance. Code has been available at
https://github.com/mengtan00/SA-BEV.git