166 research outputs found
On the generalized spectrum of bounded linear operators in Banach spaces
Utilizing the stability characterizations of generalized inverses, we investigate the generalized resolvent of linear operators in Banach spaces. We first prove that the local analyticity of the generalized resolvent is equivalent to the continuity and the local boundedness of generalized inverse functions. We also prove that several properties of the classical spectrum remain true in the case of the generalized one. Finally, we elaborate on the reason why we use the generalized inverse rather than the Moore-Penrose inverse or the group inverse to define the generalized resolvent
BEVHeight: A Robust Framework for Vision-based Roadside 3D Object Detection
While most recent autonomous driving system focuses on developing perception
methods on ego-vehicle sensors, people tend to overlook an alternative approach
to leverage intelligent roadside cameras to extend the perception ability
beyond the visual range. We discover that the state-of-the-art vision-centric
bird's eye view detection methods have inferior performances on roadside
cameras. This is because these methods mainly focus on recovering the depth
regarding the camera center, where the depth difference between the car and the
ground quickly shrinks while the distance increases. In this paper, we propose
a simple yet effective approach, dubbed BEVHeight, to address this issue. In
essence, instead of predicting the pixel-wise depth, we regress the height to
the ground to achieve a distance-agnostic formulation to ease the optimization
process of camera-only perception methods. On popular 3D detection benchmarks
of roadside cameras, our method surpasses all previous vision-centric methods
by a significant margin. The code is available at
{\url{https://github.com/ADLab-AutoDrive/BEVHeight}}.Comment: Accepted by CVPR 202
BEVHeight++: Toward Robust Visual Centric 3D Object Detection
While most recent autonomous driving system focuses on developing perception
methods on ego-vehicle sensors, people tend to overlook an alternative approach
to leverage intelligent roadside cameras to extend the perception ability
beyond the visual range. We discover that the state-of-the-art vision-centric
bird's eye view detection methods have inferior performances on roadside
cameras. This is because these methods mainly focus on recovering the depth
regarding the camera center, where the depth difference between the car and the
ground quickly shrinks while the distance increases. In this paper, we propose
a simple yet effective approach, dubbed BEVHeight++, to address this issue. In
essence, we regress the height to the ground to achieve a distance-agnostic
formulation to ease the optimization process of camera-only perception methods.
By incorporating both height and depth encoding techniques, we achieve a more
accurate and robust projection from 2D to BEV spaces. On popular 3D detection
benchmarks of roadside cameras, our method surpasses all previous
vision-centric methods by a significant margin. In terms of the ego-vehicle
scenario, our BEVHeight++ possesses superior over depth-only methods.
Specifically, it yields a notable improvement of +1.9% NDS and +1.1% mAP over
BEVDepth when evaluated on the nuScenes validation set. Moreover, on the
nuScenes test set, our method achieves substantial advancements, with an
increase of +2.8% NDS and +1.7% mAP, respectively.Comment: arXiv admin note: substantial text overlap with arXiv:2303.0849
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with
image-text pairs collected from the web. However, the presence of intrinsic
noise and unmatched image-text pairs in web data can potentially affect the
performance of representation learning. To address this issue, we first utilize
the OFA model to generate synthetic captions that focus on the image content.
The generated captions contain complementary information that is beneficial for
pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP),
a bi-path model that integrates supervision from both raw text and synthetic
caption. As the core components of ALIP, the Language Consistency Gate (LCG)
and Description Consistency Gate (DCG) dynamically adjust the weights of
samples and image-text/caption pairs during the training process. Meanwhile,
the adaptive contrastive loss can effectively reduce the impact of noise data
and enhances the efficiency of pre-training data. We validate ALIP with
experiments on different scales of models and pre-training datasets.
Experiments results show that ALIP achieves state-of-the-art performance on
multiple downstream tasks including zero-shot image-text retrieval and linear
probe. To facilitate future research, the code and pre-trained models are
released at https://github.com/deepglint/ALIP.Comment: 15pages, 10figures, ICCV202
- …