259,169 research outputs found
MEWL: Few-shot multimodal word learning with referential uncertainty
Without explicit feedback, humans can rapidly learn the meaning of words.
Children can acquire a new word after just a few passive exposures, a process
known as fast mapping. This word learning capability is believed to be the most
fundamental building block of multimodal understanding and reasoning. Despite
recent advancements in multimodal learning, a systematic and rigorous
evaluation is still missing for human-like word learning in machines. To fill
in this gap, we introduce the MachinE Word Learning (MEWL) benchmark to assess
how machines learn word meaning in grounded visual scenes. MEWL covers human's
core cognitive toolkits in word learning: cross-situational reasoning,
bootstrapping, and pragmatic learning. Specifically, MEWL is a few-shot
benchmark suite consisting of nine tasks for probing various word learning
capabilities. These tasks are carefully designed to be aligned with the
children's core abilities in word learning and echo the theories in the
developmental literature. By evaluating multimodal and unimodal agents'
performance with a comparative analysis of human performance, we notice a sharp
divergence in human and machine word learning. We further discuss these
differences between humans and machines and call for human-like few-shot word
learning in machines.Comment: Accepted at ICML 202
Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning
The Visual Dialogue task requires an agent to engage in a conversation about
an image with a human. It represents an extension of the Visual Question
Answering task in that the agent needs to answer a question about an image, but
it needs to do so in light of the previous dialogue that has taken place. The
key challenge in Visual Dialogue is thus maintaining a consistent, and natural
dialogue while continuing to answer questions correctly. We present a novel
approach that combines Reinforcement Learning and Generative Adversarial
Networks (GANs) to generate more human-like responses to questions. The GAN
helps overcome the relative paucity of training data, and the tendency of the
typical MLE-based approach to generate overly terse answers. Critically, the
GAN is tightly integrated into the attention mechanism that generates
human-interpretable reasons for each answer. This means that the discriminative
model of the GAN has the task of assessing whether a candidate answer is
generated by a human or not, given the provided reason. This is significant
because it drives the generative model to produce high quality answers that are
well supported by the associated reasoning. The method also generates the
state-of-the-art results on the primary benchmark
GuessWhat?! Visual object discovery through multi-modal dialogue
We introduce GuessWhat?!, a two-player guessing game as a testbed for
research on the interplay of computer vision and dialogue systems. The goal of
the game is to locate an unknown object in a rich image scene by asking a
sequence of questions. Higher-level image understanding, like spatial reasoning
and language grounding, is required to solve the proposed task. Our key
contribution is the collection of a large-scale dataset consisting of 150K
human-played games with a total of 800K visual question-answer pairs on 66K
images. We explain our design decisions in collecting the dataset and introduce
the oracle and questioner tasks that are associated with the two players of the
game. We prototyped deep learning models to establish initial baselines of the
introduced tasks.Comment: 23 pages; CVPR 2017 submission; see https://guesswhat.a
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
A long-standing goal of AI systems is to perform complex multimodal reasoning
like humans. Recently, large language models (LLMs) have made remarkable
strides in such multi-step reasoning on the language modality solely by
leveraging the chain of thought (CoT) to mimic human thinking. However, the
transfer of these advancements to multimodal contexts introduces heightened
challenges, including but not limited to the impractical need for
labor-intensive annotation and the limitations in terms of flexibility,
generalizability, and explainability. To evoke CoT reasoning in multimodality,
this work first conducts an in-depth analysis of these challenges posed by
multimodality and presents two key insights: "keeping critical thinking" and
"letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this
study proposes a novel DDCoT prompting that maintains a critical attitude
through negative-space prompting and incorporates multimodality into reasoning
by first dividing the reasoning responsibility of LLMs into reasoning and
recognition and then integrating the visual recognition capability of visual
models into the joint reasoning process. The rationales generated by DDCoT not
only improve the reasoning abilities of both large and small language models in
zero-shot prompting and fine-tuning learning, significantly outperforming
state-of-the-art methods but also exhibit impressive generalizability and
explainability.Comment: 24 pages, 13 figures, to be published in NeurIPS 202
Reason from Context with Self-supervised Learning
Self-supervised learning (SSL) learns to capture discriminative visual
features useful for knowledge transfers. To better accommodate the
object-centric nature of current downstream tasks such as object recognition
and detection, various methods have been proposed to suppress contextual biases
or disentangle objects from contexts. Nevertheless, these methods may prove
inadequate in situations where object identity needs to be reasoned from
associated context, such as recognizing or inferring tiny or obscured objects.
As an initial effort in the SSL literature, we investigate whether and how
contextual associations can be enhanced for visual reasoning within SSL
regimes, by (a) proposing a new Self-supervised method with external memories
for Context Reasoning (SeCo), and (b) introducing two new downstream tasks,
lift-the-flap and object priming, addressing the problems of "what" and "where"
in context reasoning. In both tasks, SeCo outperformed all state-of-the-art
(SOTA) SSL methods by a significant margin. Our network analysis revealed that
the proposed external memory in SeCo learns to store prior contextual
knowledge, facilitating target identity inference in the lift-the-flap task.
Moreover, we conducted psychophysics experiments and introduced a Human
benchmark in Object Priming dataset (HOP). Our results demonstrate that SeCo
exhibits human-like behaviors
- …