1,007 research outputs found
Programmable Agents
We build deep RL agents that execute declarative programs expressed in formal
language. The agents learn to ground the terms in this language in their
environment, and can generalize their behavior at test time to execute new
programs that refer to objects that were not referenced during training. The
agents develop disentangled interpretable representations that allow them to
generalize to a wide variety of zero-shot semantic tasks
Going Beyond Nouns With Vision & Language Models Using Synthetic Data
Large-scale pre-trained Vision & Language (VL) models have shown remarkable
performance in many applications, enabling replacing a fixed set of supported
classes with zero-shot open vocabulary reasoning over (almost arbitrary)
natural language prompts. However, recent works have uncovered a fundamental
weakness of these models. For example, their difficulty to understand Visual
Language Concepts (VLC) that go 'beyond nouns' such as the meaning of
non-object words (e.g., attributes, actions, relations, states, etc.), or
difficulty in performing compositional reasoning such as understanding the
significance of the order of the words in a sentence. In this work, we
investigate to which extent purely synthetic data could be leveraged to teach
these models to overcome such shortcomings without compromising their zero-shot
capabilities. We contribute Synthetic Visual Concepts (SyViC) - a million-scale
synthetic dataset and data generation codebase allowing to generate additional
suitable data to improve VLC understanding and compositional reasoning of VL
models. Additionally, we propose a general VL finetuning strategy for
effectively leveraging SyViC towards achieving these improvements. Our
extensive experiments and ablations on VL-Checklist, Winoground, and ARO
benchmarks demonstrate that it is possible to adapt strong pre-trained VL
models with synthetic data significantly enhancing their VLC understanding
(e.g. by 9.9% on ARO and 4.3% on VL-Checklist) with under 1% drop in their
zero-shot accuracy.Comment: Accepted to ICCV 2023. Project page: https://synthetic-vic.github.io
Attributes2Classname: A discriminative model for attribute-based unsupervised zero-shot learning
We propose a novel approach for unsupervised zero-shot learning (ZSL) of
classes based on their names. Most existing unsupervised ZSL methods aim to
learn a model for directly comparing image features and class names. However,
this proves to be a difficult task due to dominance of non-visual semantics in
underlying vector-space embeddings of class names. To address this issue, we
discriminatively learn a word representation such that the similarities between
class and combination of attribute names fall in line with the visual
similarity. Contrary to the traditional zero-shot learning approaches that are
built upon attribute presence, our approach bypasses the laborious
attribute-class relation annotations for unseen classes. In addition, our
proposed approach renders text-only training possible, hence, the training can
be augmented without the need to collect additional image data. The
experimental results show that our method yields state-of-the-art results for
unsupervised ZSL in three benchmark datasets.Comment: To appear at IEEE Int. Conference on Computer Vision (ICCV) 201
A Short Survey of Systematic Generalization
This survey includes systematic generalization and a history of how machine
learning addresses it. We aim to summarize and organize the related information
of both conventional and recent improvements. We first look at the definition
of systematic generalization, then introduce Classicist and Connectionist. We
then discuss different types of Connectionists and how they approach the
generalization. Two crucial problems of variable binding and causality are
discussed. We look into systematic generalization in language, vision, and VQA
fields. Recent improvements from different aspects are discussed. Systematic
generalization has a long history in artificial intelligence. We could cover
only a small portion of many contributions. We hope this paper provides a
background and is beneficial for discoveries in future work
Sherlock: Scalable Fact Learning in Images
We study scalable and uniform understanding of facts in images. Existing
visual recognition systems are typically modeled differently for each fact type
such as objects, actions, and interactions. We propose a setting where all
these facts can be modeled simultaneously with a capacity to understand
unbounded number of facts in a structured way. The training data comes as
structured facts in images, including (1) objects (e.g., ), (3) actions (e.g., ). Each fact has a semantic
language view (e.g., ) and a visual view (an image with this
fact). We show that learning visual facts in a structured way enables not only
a uniform but also generalizable visual understanding. We propose and
investigate recent and strong approaches from the multiview learning literature
and also introduce two learning representation models as potential baselines.
We applied the investigated methods on several datasets that we augmented with
structured facts and a large scale dataset of more than 202,000 facts and
814,000 images. Our experiments show the advantage of relating facts by the
structure by the proposed models compared to the designed baselines on
bidirectional fact retrieval.Comment: Jan 7 Updat
- …