307 research outputs found
FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D Detection
The transformation of features from 2D perspective space to 3D space is
essential to multi-view 3D object detection. Recent approaches mainly focus on
the design of view transformation, either pixel-wisely lifting perspective view
features into 3D space with estimated depth or grid-wisely constructing BEV
features via 3D projection, treating all pixels or grids equally. However,
choosing what to transform is also important but has rarely been discussed
before. The pixels of a moving car are more informative than the pixels of the
sky. To fully utilize the information contained in images, the view
transformation should be able to adapt to different image regions according to
their contents. In this paper, we propose a novel framework named
FrustumFormer, which pays more attention to the features in instance regions
via adaptive instance-aware resampling. Specifically, the model obtains
instance frustums on the bird's eye view by leveraging image view object
proposals. An adaptive occupancy mask within the instance frustum is learned to
refine the instance location. Moreover, the temporal frustum intersection could
further reduce the localization uncertainty of objects. Comprehensive
experiments on the nuScenes dataset demonstrate the effectiveness of
FrustumFormer, and we achieve a new state-of-the-art performance on the
benchmark. Codes and models will be made available at
https://github.com/Robertwyq/Frustum.Comment: Accepted to CVPR 202
Fully Sparse 3D Object Detection
As the perception range of LiDAR increases, LiDAR-based 3D object detection
becomes a dominant task in the long-range perception task of autonomous
driving. The mainstream 3D object detectors usually build dense feature maps in
the network backbone and prediction head. However, the computational and
spatial costs on the dense feature map are quadratic to the perception range,
which makes them hardly scale up to the long-range setting. To enable efficient
long-range LiDAR-based object detection, we build a fully sparse 3D object
detector (FSD). The computational and spatial cost of FSD is roughly linear to
the number of points and independent of the perception range. FSD is built upon
the general sparse voxel encoder and a novel sparse instance recognition (SIR)
module. SIR first groups the points into instances and then applies
instance-wise feature extraction and prediction. In this way, SIR resolves the
issue of center feature missing, which hinders the design of the fully sparse
architecture for all center-based or anchor-based detectors. Moreover, SIR
avoids the time-consuming neighbor queries in previous point-based methods by
grouping points into instances. We conduct extensive experiments on the
large-scale Waymo Open Dataset to reveal the working mechanism of FSD, and
state-of-the-art performance is reported. To demonstrate the superiority of FSD
in long-range detection, we also conduct experiments on Argoverse 2 Dataset,
which has a much larger perception range () than Waymo Open Dataset
(). On such a large perception range, FSD achieves state-of-the-art
performance and is 2.4 faster than the dense counterpart. Codes will be
released at https://github.com/TuSimple/SST.Comment: NeurIPS 202
Echoes Beyond Points: Unleashing the Power of Raw Radar Data in Multi-modality Fusion
Radar is ubiquitous in autonomous driving systems due to its low cost and
good adaptability to bad weather. Nevertheless, the radar detection performance
is usually inferior because its point cloud is sparse and not accurate due to
the poor azimuth and elevation resolution. Moreover, point cloud generation
algorithms already drop weak signals to reduce the false targets which may be
suboptimal for the use of deep fusion. In this paper, we propose a novel method
named EchoFusion to skip the existing radar signal processing pipeline and then
incorporate the radar raw data with other sensors. Specifically, we first
generate the Bird's Eye View (BEV) queries and then take corresponding spectrum
features from radar to fuse with other sensors. By this approach, our method
could utilize both rich and lossless distance and speed clues from radar echoes
and rich semantic clues from images, making our method surpass all existing
methods on the RADIal dataset, and approach the performance of LiDAR. The code
will be released on https://github.com/tusen-ai/EchoFusion.Comment: Accepted by NeurIPS 202
- …