90,215 research outputs found
Unshuffling Data for Improved Generalization
Generalization beyond the training distribution is a core challenge in
machine learning. The common practice of mixing and shuffling examples when
training neural networks may not be optimal in this regard. We show that
partitioning the data into well-chosen, non-i.i.d. subsets treated as multiple
training environments can guide the learning of models with better
out-of-distribution generalization. We describe a training procedure to capture
the patterns that are stable across environments while discarding spurious
ones. The method makes a step beyond correlation-based learning: the choice of
the partitioning allows injecting information about the task that cannot be
otherwise recovered from the joint distribution of the training data. We
demonstrate multiple use cases with the task of visual question answering,
which is notorious for dataset biases. We obtain significant improvements on
VQA-CP, using environments built from prior knowledge, existing meta data, or
unsupervised clustering. We also get improvements on GQA using annotations of
"equivalent questions", and on multi-dataset training (VQA v2 / Visual Genome)
by treating them as distinct environments
Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries
Recognising objects according to a pre-defined fixed set of class labels has
been well studied in the Computer Vision. There are a great many practical
applications where the subjects that may be of interest are not known
beforehand, or so easily delineated, however. In many of these cases natural
language dialog is a natural way to specify the subject of interest, and the
task achieving this capability (a.k.a, Referring Expression Comprehension) has
recently attracted attention. To this end we propose a unified framework, the
ParalleL AttentioN (PLAN) network, to discover the object in an image that is
being referred to in variable length natural expression descriptions, from
short phrases query to long multi-round dialogs. The PLAN network has two
attention mechanisms that relate parts of the expressions to both the global
visual content and also directly to object candidates. Furthermore, the
attention mechanisms are recurrent, making the referring process visualizable
and explainable. The attended information from these dual sources are combined
to reason about the referred object. These two attention mechanisms can be
trained in parallel and we find the combined system outperforms the
state-of-art on several benchmarked datasets with different length language
input, such as RefCOCO, RefCOCO+ and GuessWhat?!.Comment: 11 page
Contextual Media Retrieval Using Natural Language Queries
The widespread integration of cameras in hand-held and head-worn devices as
well as the ability to share content online enables a large and diverse visual
capture of the world that millions of users build up collectively every day. We
envision these images as well as associated meta information, such as GPS
coordinates and timestamps, to form a collective visual memory that can be
queried while automatically taking the ever-changing context of mobile users
into account. As a first step towards this vision, in this work we present
Xplore-M-Ego: a novel media retrieval system that allows users to query a
dynamic database of images and videos using spatio-temporal natural language
queries. We evaluate our system using a new dataset of real user queries as
well as through a usability study. One key finding is that there is a
considerable amount of inter-user variability, for example in the resolution of
spatial relations in natural language utterances. We show that our retrieval
system can cope with this variability using personalisation through an online
learning-based retrieval formulation.Comment: 8 pages, 9 figures, 1 tabl
- …