262 research outputs found
MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception
This paper proposes an efficient multi-camera to Bird's-Eye-View (BEV) view
transformation method for 3D perception, dubbed MatrixVT. Existing view
transformers either suffer from poor transformation efficiency or rely on
device-specific operators, hindering the broad application of BEV models. In
contrast, our method generates BEV features efficiently with only convolutions
and matrix multiplications (MatMul). Specifically, we propose describing the
BEV feature as the MatMul of image feature and a sparse Feature Transporting
Matrix (FTM). A Prime Extraction module is then introduced to compress the
dimension of image features and reduce FTM's sparsity. Moreover, we propose the
Ring \& Ray Decomposition to replace the FTM with two matrices and reformulate
our pipeline to reduce calculation further. Compared to existing methods,
MatrixVT enjoys a faster speed and less memory footprint while remaining
deploy-friendly. Extensive experiments on the nuScenes benchmark demonstrate
that our method is highly efficient but obtains results on par with the SOTA
method in object detection and map segmentation task
MegDet: A Large Mini-Batch Object Detector
The improvements in recent CNN-based object detection works, from R-CNN [11],
Fast/Faster R-CNN [10, 31] to recent Mask R-CNN [14] and RetinaNet [24], mainly
come from new network, new framework, or novel loss design. But mini-batch
size, a key factor in the training, has not been well studied. In this paper,
we propose a Large MiniBatch Object Detector (MegDet) to enable the training
with much larger mini-batch size than before (e.g. from 16 to 256), so that we
can effectively utilize multiple GPUs (up to 128 in our experiments) to
significantly shorten the training time. Technically, we suggest a learning
rate policy and Cross-GPU Batch Normalization, which together allow us to
successfully train a large mini-batch detector in much less time (e.g., from 33
hours to 4 hours), and achieve even better accuracy. The MegDet is the backbone
of our submission (mmAP 52.5%) to COCO 2017 Challenge, where we won the 1st
place of Detection task
EqCo: Equivalent Rules for Self-supervised Contrastive Learning
In this paper, we propose a method, named EqCo (Equivalent Rules for
Contrastive Learning), to make self-supervised learning irrelevant to the
number of negative samples in InfoNCE-based contrastive learning frameworks.
Inspired by the InfoMax principle, we point that the margin term in contrastive
loss needs to be adaptively scaled according to the number of negative pairs in
order to keep steady mutual information bound and gradient magnitude. EqCo
bridges the performance gap among a wide range of negative sample sizes, so
that we can use only a few negative pairs (e.g. 16 per query) to perform
self-supervised contrastive training on large-scale vision datasets like
ImageNet, while with almost no accuracy drop. This is quite a contrast to the
widely used large batch training or memory bank mechanism in current practices.
Equipped with EqCo, our simplified MoCo (SiMo) achieves comparable accuracy
with MoCo v2 on ImageNet (linear evaluation protocol) while only involves 4
negative pairs per query instead of 65536, suggesting that large quantities of
negative samples might not be a critical factor in InfoNCE loss
- …