65 research outputs found
Overwriting Pretrained Bias with Finetuning Data
Transfer learning is beneficial by allowing the expressive features of models
pretrained on large-scale datasets to be finetuned for the target task of
smaller, more domain-specific datasets. However, there is a concern that these
pretrained models may come with their own biases which would propagate into the
finetuned model. In this work, we investigate bias when conceptualized as both
spurious correlations between the target task and a sensitive attribute as well
as underrepresentation of a particular group in the dataset. Under both notions
of bias, we find that (1) models finetuned on top of pretrained models can
indeed inherit their biases, but (2) this bias can be corrected for through
relatively minor interventions to the finetuning dataset, and often with a
negligible impact to performance. Our findings imply that careful curation of
the finetuning dataset is important for reducing biases on a downstream task,
and doing so can even compensate for bias in the pretrained model.Comment: ICCV 2023 Ora
Crowdsourcing in Computer Vision
Computer vision systems require large amounts of manually annotated data to
properly learn challenging visual concepts. Crowdsourcing platforms offer an
inexpensive method to capture human knowledge and understanding, for a vast
number of visual perception tasks. In this survey, we describe the types of
annotations computer vision researchers have collected using crowdsourcing, and
how they have ensured that this data is of high quality while annotation effort
is minimized. We begin by discussing data collection on both classic (e.g.,
object recognition) and recent (e.g., visual story-telling) vision tasks. We
then summarize key design decisions for creating effective data collection
interfaces and workflows, and present strategies for intelligently selecting
the most important data instances to annotate. Finally, we conclude with some
thoughts on the future of crowdsourcing in computer vision.Comment: A 69-page meta review of the field, Foundations and Trends in
Computer Graphics and Vision, 201
Much Ado About Time: Exhaustive Annotation of Temporal Data
Large-scale annotated datasets allow AI systems to learn from and build upon
the knowledge of the crowd. Many crowdsourcing techniques have been developed
for collecting image annotations. These techniques often implicitly rely on the
fact that a new input image takes a negligible amount of time to perceive. In
contrast, we investigate and determine the most cost-effective way of obtaining
high-quality multi-label annotations for temporal data such as videos. Watching
even a short 30-second video clip requires a significant time investment from a
crowd worker; thus, requesting multiple annotations following a single viewing
is an important cost-saving strategy. But how many questions should we ask per
video? We conclude that the optimal strategy is to ask as many questions as
possible in a HIT (up to 52 binary questions after watching a 30-second video
clip in our experiments). We demonstrate that while workers may not correctly
answer all questions, the cost-benefit analysis nevertheless favors consensus
from multiple such cheap-yet-imperfect iterations over more complex
alternatives. When compared with a one-question-per-video baseline, our method
is able to achieve a 10% improvement in recall 76.7% ours versus 66.7%
baseline) at comparable precision (83.8% ours versus 83.0% baseline) in about
half the annotation time (3.8 minutes ours compared to 7.1 minutes baseline).
We demonstrate the effectiveness of our method by collecting multi-label
annotations of 157 human activities on 1,815 videos.Comment: HCOMP 2016 Camera Read
Multimodal Dataset Distillation for Image-Text Retrieval
Dataset distillation methods offer the promise of reducing a large-scale
dataset down to a significantly smaller set of (potentially synthetic) training
examples, which preserve sufficient information for training a new model from
scratch. So far dataset distillation methods have been developed for image
classification. However, with the rise in capabilities of vision-language
models, and especially given the scale of datasets necessary to train these
models, the time is ripe to expand dataset distillation methods beyond image
classification. In this work, we take the first steps towards this goal by
expanding on the idea of trajectory matching to create a distillation method
for vision-language datasets. The key challenge is that vision-language
datasets do not have a set of discrete classes. To overcome this, our proposed
multimodal dataset distillation method jointly distill the images and their
corresponding language descriptions in a contrastive formulation. Since there
are no existing baselines, we compare our approach to three coreset selection
methods (strategic subsampling of the training dataset), which we adapt to the
vision-language setting. We demonstrate significant improvements on the
challenging Flickr30K and COCO retrieval benchmark: the best coreset selection
method which selects 1000 image-text pairs for training is able to achieve only
5.6% image-to-text retrieval accuracy (recall@1); in contrast, our dataset
distillation approach almost doubles that with just 100 (an order of magnitude
fewer) training pairs.Comment: 28 pages, 11 figure
Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
The ability to perform effective planning is crucial for building an
instruction-following agent. When navigating through a new environment, an
agent is challenged with (1) connecting the natural language instructions with
its progressively growing knowledge of the world; and (2) performing long-range
planning and decision making in the form of effective exploration and error
correction. Current methods are still limited on both fronts despite extensive
efforts. In this paper, we introduce the Evolving Graphical Planner (EGP), a
model that performs global planning for navigation based on raw sensory input.
The model dynamically constructs a graphical representation, generalizes the
action space to allow for more flexible decision making, and performs efficient
planning on a proxy graph representation. We evaluate our model on a
challenging Vision-and-Language Navigation (VLN) task with photorealistic
images and achieve superior performance compared to previous navigation
architectures. For instance, we achieve a 53% success rate on the test split of
the Room-to-Room navigation task through pure imitation learning, outperforming
previous navigation architectures by up to 5%
SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition
Understanding the spatial relations between objects in images is a
surprisingly challenging task. A chair may be "behind" a person even if it
appears to the left of the person in the image (depending on which way the
person is facing). Two students that appear close to each other in the image
may not in fact be "next to" each other if there is a third student between
them.
We introduce SpatialSense, a dataset specializing in spatial relation
recognition which captures a broad spectrum of such challenges, allowing for
proper benchmarking of computer vision techniques. SpatialSense is constructed
through adversarial crowdsourcing, in which human annotators are tasked with
finding spatial relations that are difficult to predict using simple cues such
as 2D spatial configuration or language priors. Adversarial crowdsourcing
significantly reduces dataset bias and samples more interesting relations in
the long tail compared to existing datasets. On SpatialSense, state-of-the-art
recognition models perform comparably to simple baselines, suggesting that they
rely on straightforward cues instead of fully reasoning about this complex
task. The SpatialSense benchmark provides a path forward to advancing the
spatial reasoning capabilities of computer vision systems. The dataset and code
are available at https://github.com/princeton-vl/SpatialSense.Comment: Accepted to ICCV 201
- …