17 research outputs found
DeepBox: Learning Objectness with Convolutional Networks
Existing object proposal approaches use primarily bottom-up cues to rank
proposals, while we believe that objectness is in fact a high level construct.
We argue for a data-driven, semantic approach for ranking object proposals. Our
framework, which we call DeepBox, uses convolutional neural networks (CNNs) to
rerank proposals from a bottom-up method. We use a novel four-layer CNN
architecture that is as good as much larger networks on the task of evaluating
objectness while being much faster. We show that DeepBox significantly improves
over the bottom-up ranking, achieving the same recall with 500 proposals as
achieved by bottom-up methods with 2000. This improvement generalizes to
categories the CNN has never seen before and leads to a 4.5-point gain in
detection mAP. Our implementation achieves this performance while running at
260 ms per image.Comment: ICCV 2015 Camera-ready versio
Recommended from our members
Expert-level detection of acute intracranial hemorrhage on head computed tomography using deep learning.
Computed tomography (CT) of the head is used worldwide to diagnose neurologic emergencies. However, expertise is required to interpret these scans, and even highly trained experts may miss subtle life-threatening findings. For head CT, a unique challenge is to identify, with perfect or near-perfect sensitivity and very high specificity, often small subtle abnormalities on a multislice cross-sectional (three-dimensional [3D]) imaging modality that is characterized by poor soft tissue contrast, low signal-to-noise using current low radiation-dose protocols, and a high incidence of artifacts. We trained a fully convolutional neural network with 4,396 head CT scans performed at the University of California at San Francisco and affiliated hospitals and compared the algorithm's performance to that of 4 American Board of Radiology (ABR) certified radiologists on an independent test set of 200 randomly selected head CT scans. Our algorithm demonstrated the highest accuracy to date for this clinical application, with a receiver operating characteristic (ROC) area under the curve (AUC) of 0.991 ± 0.006 for identification of examinations positive for acute intracranial hemorrhage, and also exceeded the performance of 2 of 4 radiologists. We demonstrate an end-to-end network that performs joint classification and segmentation with examination-level classification comparable to experts, in addition to robust localization of abnormalities, including some that are missed by radiologists, both of which are critically important elements for this application
Contrastive Feature Masking Open-Vocabulary Vision Transformer
We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an
image-text pretraining methodology that achieves simultaneous learning of
image- and region-level representation for open-vocabulary object detection
(OVD). Our approach combines the masked autoencoder (MAE) objective into the
contrastive learning objective to improve the representation for localization
tasks. Unlike standard MAE, we perform reconstruction in the joint image-text
embedding space, rather than the pixel space as is customary with the classical
MAE method, which causes the model to better learn region-level semantics.
Moreover, we introduce Positional Embedding Dropout (PED) to address scale
variation between image-text pretraining and detection finetuning by randomly
dropping out the positional embeddings during pretraining. PED improves
detection performance and enables the use of a frozen ViT backbone as a region
classifier, preventing the forgetting of open-vocabulary knowledge during
detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT
achieves a state-of-the-art 33.9 AP, surpassing the best approach by 7.6
points and achieves better zero-shot detection transfer. Finally, CFM-ViT
acquires strong image-level representation, outperforming the state of the art
on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.Comment: Accepted to ICCV 202
RECLIP: Resource-efficient CLIP by Training with Small Images
We present RECLIP (Resource-efficient CLIP), a simple method that minimizes
computational resource footprint for CLIP (Contrastive Language Image
Pretraining). Inspired by the notion of coarse-to-fine in computer vision, we
leverage small images to learn from large-scale language supervision
efficiently, and finetune the model with high-resolution data in the end. Since
the complexity of the vision transformer heavily depends on input image size,
our approach significantly reduces the training resource requirements both in
theory and in practice. Using the same batch size and training epoch, RECLIP
achieves highly competitive zero-shot classification and image-text retrieval
accuracy with 6 to 8x less computational resources and 7 to 9x fewer FLOPs than
the baseline. Compared to the state-of-the-art contrastive learning methods,
RECLIP demonstrates 5 to 59x training resource savings while maintaining highly
competitive zero-shot classification and retrieval performance. Finally, RECLIP
matches the state of the art in transfer learning to open-vocabulary detection
tasks, achieving 32 APr on LVIS. We hope this work will pave the path for the
broader research community to explore language supervised pretraining in
resource-friendly settings.Comment: Published at Transactions on Machine Learning Researc
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
We present F-VLM, a simple open-vocabulary object detection method built upon
Frozen Vision and Language Models. F-VLM simplifies the current multi-stage
training pipeline by eliminating the need for knowledge distillation or
detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1)
retains the locality-sensitive features necessary for detection, and 2) is a
strong region classifier. We finetune only the detector head and combine the
detector and VLM outputs for each region at inference time. F-VLM shows
compelling scaling behavior and achieves +6.5 mask AP improvement over the
previous state of the art on novel categories of LVIS open-vocabulary detection
benchmark. In addition, we demonstrate very competitive results on COCO
open-vocabulary detection benchmark and cross-dataset transfer detection, in
addition to significant training speed-up and compute savings. Code will be
released.Comment: 19 pages, 6 figure
Video Question Answering with Iterative Video-Text Co-Tokenization
Video question answering is a challenging task that requires understanding
jointly the language input, the visual information in individual video frames,
as well as the temporal information about the events occurring in the video. In
this paper, we propose a novel multi-stream video encoder for video question
answering that uses multiple video inputs and a new video-text iterative
co-tokenization approach to answer a variety of questions related to videos. We
experimentally evaluate the model on several datasets, such as MSRVTT-QA,
MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins.
Simultaneously, our model reduces the required GFLOPs from 150-360 to only 67,
producing a highly efficient video question answering model.Comment: ECCV 202