This paper proposes an efficient multi-camera to Bird's-Eye-View (BEV) view
transformation method for 3D perception, dubbed MatrixVT. Existing view
transformers either suffer from poor transformation efficiency or rely on
device-specific operators, hindering the broad application of BEV models. In
contrast, our method generates BEV features efficiently with only convolutions
and matrix multiplications (MatMul). Specifically, we propose describing the
BEV feature as the MatMul of image feature and a sparse Feature Transporting
Matrix (FTM). A Prime Extraction module is then introduced to compress the
dimension of image features and reduce FTM's sparsity. Moreover, we propose the
Ring \& Ray Decomposition to replace the FTM with two matrices and reformulate
our pipeline to reduce calculation further. Compared to existing methods,
MatrixVT enjoys a faster speed and less memory footprint while remaining
deploy-friendly. Extensive experiments on the nuScenes benchmark demonstrate
that our method is highly efficient but obtains results on par with the SOTA
method in object detection and map segmentation task