27 research outputs found
Revisiting knowledge transfer for training object class detectors
We propose to revisit knowledge transfer for training object detectors on
target classes from weakly supervised training images, helped by a set of
source classes with bounding-box annotations. We present a unified knowledge
transfer framework based on training a single neural network multi-class object
detector over all source classes, organized in a semantic hierarchy. This
generates proposals with scores at multiple levels in the hierarchy, which we
use to explore knowledge transfer over a broad range of generality, ranging
from class-specific (bicycle to motorbike) to class-generic (objectness to any
class). Experiments on the 200 object classes in the ILSVRC 2013 detection
dataset show that our technique: (1) leads to much better performance on the
target classes (70.3% CorLoc, 36.9% mAP) than a weakly supervised baseline
which uses manually engineered objectness [11] (50.5% CorLoc, 25.4% mAP). (2)
delivers target object detectors reaching 80% of the mAP of their fully
supervised counterparts. (3) outperforms the best reported transfer learning
results on this dataset (+41% CorLoc and +3% mAP over [18, 46], +16.2% mAP over
[32]). Moreover, we also carry out several across-dataset knowledge transfer
experiments [27, 24, 35] and find that (4) our technique outperforms the weakly
supervised baseline in all dataset pairs by 1.5x-1.9x, establishing its general
applicability.Comment: CVPR 1
Learning Intelligent Dialogs for Bounding Box Annotation
We introduce Intelligent Annotation Dialogs for bounding box annotation. We
train an agent to automatically choose a sequence of actions for a human
annotator to produce a bounding box in a minimal amount of time. Specifically,
we consider two actions: box verification, where the annotator verifies a box
generated by an object detector, and manual box drawing. We explore two kinds
of agents, one based on predicting the probability that a box will be
positively verified, and the other based on reinforcement learning. We
demonstrate that (1) our agents are able to learn efficient annotation
strategies in several scenarios, automatically adapting to the image
difficulty, the desired quality of the boxes, and the detector strength; (2) in
all scenarios the resulting annotation dialogs speed up annotation compared to
manual box drawing alone and box verification alone, while also outperforming
any fixed combination of verification and drawing in most scenarios; (3) in a
realistic scenario where the detector is iteratively re-trained, our agents
evolve a series of strategies that reflect the shifting trade-off between
verification and drawing as the detector grows stronger.Comment: This paper appeared at CVPR 201
The Devil is in the Decoder: Classification, Regression and GANs
Many machine vision applications, such as semantic segmentation and depth
prediction, require predictions for every pixel of the input image. Models for
such problems usually consist of encoders which decrease spatial resolution
while learning a high-dimensional representation, followed by decoders who
recover the original input resolution and result in low-dimensional
predictions. While encoders have been studied rigorously, relatively few
studies address the decoder side. This paper presents an extensive comparison
of a variety of decoders for a variety of pixel-wise tasks ranging from
classification, regression to synthesis. Our contributions are: (1) Decoders
matter: we observe significant variance in results between different types of
decoders on various problems. (2) We introduce new residual-like connections
for decoders. (3) We introduce a novel decoder: bilinear additive upsampling.
(4) We explore prediction artifacts
COCO-Stuff: Thing and Stuff Classes in Context
Semantic classes can be either things (objects with a well-defined shape,
e.g. car, person) or stuff (amorphous background regions, e.g. grass, sky).
While lots of classification and detection works focus on thing classes, less
attention has been given to stuff classes. Nonetheless, stuff classes are
important as they allow to explain important aspects of an image, including (1)
scene type; (2) which thing classes are likely to be present and their location
(through contextual reasoning); (3) physical attributes, material types and
geometric properties of the scene. To understand stuff and things in context we
introduce COCO-Stuff, which augments all 164K images of the COCO 2017 dataset
with pixel-wise annotations for 91 stuff classes. We introduce an efficient
stuff annotation protocol based on superpixels, which leverages the original
thing annotations. We quantify the speed versus quality trade-off of our
protocol and explore the relation between annotation time and boundary
complexity. Furthermore, we use COCO-Stuff to analyze: (a) the importance of
stuff and thing classes in terms of their surface cover and how frequently they
are mentioned in image captions; (b) the spatial relations between stuff and
things, highlighting the rich contextual relations that make our dataset
unique; (c) the performance of a modern semantic segmentation method on stuff
and thing classes, and whether stuff is easier to segment than things.Comment: CVPR 2018 camera-read
How stable are Transferability Metrics evaluations?
Transferability metrics is a maturing field with increasing interest, which
aims at providing heuristics for selecting the most suitable source models to
transfer to a given target dataset, without fine-tuning them all. However,
existing works rely on custom experimental setups which differ across papers,
leading to inconsistent conclusions about which transferability metrics work
best. In this paper we conduct a large-scale study by systematically
constructing a broad range of 715k experimental setup variations. We discover
that even small variations to an experimental setup lead to different
conclusions about the superiority of a transferability metric over another.
Then we propose better evaluations by aggregating across many experiments,
enabling to reach more stable conclusions. As a result, we reveal the
superiority of LogME at selecting good source datasets to transfer from in a
semantic segmentation scenario, NLEEP at selecting good source architectures in
an image classification scenario, and GBC at determining which target task
benefits most from a given source model. Yet, no single transferability metric
works best in all scenarios
Extreme clicking for efficient object annotation
Manually annotating object bounding boxes is central to building computer
vision datasets, and it is very time consuming (annotating ILSVRC [53] took 35s
for one high-quality box [62]). It involves clicking on imaginary corners of a
tight box around the object. This is difficult as these corners are often
outside the actual object and several adjustments are required to obtain a
tight box. We propose extreme clicking instead: we ask the annotator to click
on four physical points on the object: the top, bottom, left- and right-most
points. This task is more natural and these points are easy to find. We
crowd-source extreme point annotations for PASCAL VOC 2007 and 2012 and show
that (1) annotation time is only 7s per box, 5x faster than the traditional way
of drawing boxes [62]; (2) the quality of the boxes is as good as the original
ground-truth drawn the traditional way; (3) detectors trained on our
annotations are as accurate as those trained on the original ground-truth.
Moreover, our extreme clicking strategy not only yields box coordinates, but
also four accurate boundary points. We show (4) how to incorporate them into
GrabCut to obtain more accurate segmentations than those delivered when
initializing it from bounding boxes; (5) semantic segmentations models trained
on these segmentations outperform those trained on segmentations derived from
bounding boxes.Comment: ICCV 201
How (not) to ensemble LVLMs for VQA
This paper studies ensembling in the era of Large Vision-Language Models
(LVLMs). Ensembling is a classical method to combine different models to get
increased performance. In the recent work on Encyclopedic-VQA the authors
examine a wide variety of models to solve their task: from vanilla LVLMs, to
models including the caption as extra context, to models augmented with
Lens-based retrieval of Wikipedia pages. Intuitively these models are highly
complementary, which should make them ideal for ensembling. Indeed, an oracle
experiment shows potential gains from 48.8% accuracy (the best single model)
all the way up to 67% (best possible ensemble). So it is a trivial exercise to
create an ensemble with substantial real gains. Or is it?Comment: 4th I Can't Believe It's Not Better Workshop (co-located with NeurIPS
2023
Situational Object Boundary Detection
Intuitively, the appearance of true object boundaries varies from image to
image. Hence the usual monolithic approach of training a single boundary
predictor and applying it to all images regardless of their content is bound to
be suboptimal. In this paper we therefore propose situational object boundary
detection: We first define a variety of situations and train a specialized
object boundary detector for each of them using [Dollar and Zitnick 2013]. Then
given a test image, we classify it into these situations using its context,
which we model by global image appearance. We apply the corresponding
situational object boundary detectors, and fuse them based on the
classification probabilities. In experiments on ImageNet, Microsoft COCO, and
Pascal VOC 2012 segmentation we show that our situational object boundary
detection gives significant improvements over a monolithic approach.
Additionally, our method substantially outperforms [Hariharan et al. 2011] on
semantic contour detection on their SBD dataset