1,967 research outputs found
Cross Modal Distillation for Supervision Transfer
In this work we propose a technique that transfers supervision between images
from different modalities. We use learned representations from a large labeled
modality as a supervisory signal for training representations for a new
unlabeled paired modality. Our method enables learning of rich representations
for unlabeled modalities and can be used as a pre-training procedure for new
modalities with limited labeled data. We show experimental results where we
transfer supervision from labeled RGB images to unlabeled depth and optical
flow images and demonstrate large improvements for both these cross modal
supervision transfers. Code, data and pre-trained models are available at
https://github.com/s-gupta/fast-rcnn/tree/distillationComment: Updated version (v2) contains additional experiments and result
Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving
Closed-set 3D perception models trained on only a pre-defined set of object
categories can be inadequate for safety critical applications such as
autonomous driving where new object types can be encountered after deployment.
In this paper, we present a multi-modal auto labeling pipeline capable of
generating amodal 3D bounding boxes and tracklets for training models on
open-set categories without 3D human labels. Our pipeline exploits motion cues
inherent in point cloud sequences in combination with the freely available 2D
image-text pairs to identify and track all traffic participants. Compared to
the recent studies in this domain, which can only provide class-agnostic auto
labels limited to moving objects, our method can handle both static and moving
objects in the unsupervised manner and is able to output open-vocabulary
semantic labels thanks to the proposed vision-language knowledge distillation.
Experiments on the Waymo Open Dataset show that our approach outperforms the
prior work by significant margins on various unsupervised 3D perception tasks.Comment: ICCV 202
Object Detection in 20 Years: A Survey
Object detection, as of one the most fundamental and challenging problems in
computer vision, has received great attention in recent years. Its development
in the past two decades can be regarded as an epitome of computer vision
history. If we think of today's object detection as a technical aesthetics
under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+
papers of object detection in the light of its technical evolution, spanning
over a quarter-century's time (from the 1990s to 2019). A number of topics have
been covered in this paper, including the milestone detectors in history,
detection datasets, metrics, fundamental building blocks of the detection
system, speed up techniques, and the recent state of the art detection methods.
This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep
analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible
publicatio
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
We present F-VLM, a simple open-vocabulary object detection method built upon
Frozen Vision and Language Models. F-VLM simplifies the current multi-stage
training pipeline by eliminating the need for knowledge distillation or
detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1)
retains the locality-sensitive features necessary for detection, and 2) is a
strong region classifier. We finetune only the detector head and combine the
detector and VLM outputs for each region at inference time. F-VLM shows
compelling scaling behavior and achieves +6.5 mask AP improvement over the
previous state of the art on novel categories of LVIS open-vocabulary detection
benchmark. In addition, we demonstrate very competitive results on COCO
open-vocabulary detection benchmark and cross-dataset transfer detection, in
addition to significant training speed-up and compute savings. Code will be
released.Comment: 19 pages, 6 figure
Towards Open Vocabulary Learning: A Survey
In the field of visual scene understanding, deep neural networks have made
impressive advancements in various core tasks like segmentation, tracking, and
detection. However, most approaches operate on the close-set assumption,
meaning that the model can only identify pre-defined categories that are
present in the training set. Recently, open vocabulary settings were proposed
due to the rapid progress of vision language pre-training. These new approaches
seek to locate and recognize categories beyond the annotated label space. The
open vocabulary approach is more general, practical, and effective compared to
weakly supervised and zero-shot settings. This paper provides a thorough review
of open vocabulary learning, summarizing and analyzing recent developments in
the field. In particular, we begin by comparing it to related concepts such as
zero-shot learning, open-set recognition, and out-of-distribution detection.
Then, we review several closely related tasks in the case of segmentation and
detection, including long-tail problems, few-shot, and zero-shot settings. For
the method survey, we first present the basic knowledge of detection and
segmentation in close-set as the preliminary knowledge. Next, we examine
various scenarios in which open vocabulary learning is used, identifying common
design elements and core ideas. Then, we compare the recent detection and
segmentation approaches in commonly used datasets and benchmarks. Finally, we
conclude with insights, issues, and discussions regarding future research
directions. To our knowledge, this is the first comprehensive literature review
of open vocabulary learning. We keep tracing related works at
https://github.com/jianzongwu/Awesome-Open-Vocabulary.Comment: Project page at https://github.com/jianzongwu/Awesome-Open-Vocabular
- …