11 research outputs found
Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat
We propose a grounded dialogue state encoder which addresses a foundational
issue on how to integrate visual grounding with dialogue system components. As
a test-bed, we focus on the GuessWhat?! game, a two-player game where the goal
is to identify an object in a complex visual scene by asking a sequence of
yes/no questions. Our visually-grounded encoder leverages synergies between
guessing and asking questions, as it is trained jointly using multi-task
learning. We further enrich our model via a cooperative learning regime. We
show that the introduction of both the joint architecture and cooperative
learning lead to accuracy improvements over the baseline system. We compare our
approach to an alternative system which extends the baseline with reinforcement
learning. Our in-depth analysis shows that the linguistic skills of the two
models differ dramatically, despite approaching comparable performance levels.
This points at the importance of analyzing the linguistic output of competing
systems beyond numeric comparison solely based on task success.Comment: Accepted to NAACL 201
Visual Dialogue State Tracking for Question Generation
GuessWhat?! is a visual dialogue task between a guesser and an oracle. The
guesser aims to locate an object supposed by the oracle oneself in an image by
asking a sequence of Yes/No questions. Asking proper questions with the
progress of dialogue is vital for achieving successful final guess. As a
result, the progress of dialogue should be properly represented and tracked.
Previous models for question generation pay less attention on the
representation and tracking of dialogue states, and therefore are prone to
asking low quality questions such as repeated questions. This paper proposes
visual dialogue state tracking (VDST) based method for question generation. A
visual dialogue state is defined as the distribution on objects in the image as
well as representations of objects. Representations of objects are updated with
the change of the distribution on objects. An object-difference based attention
is used to decode new question. The distribution on objects is updated by
comparing the question-answer pair and objects. Experimental results on
GuessWhat?! dataset show that our model significantly outperforms existing
methods and achieves new state-of-the-art performance. It is also noticeable
that our model reduces the rate of repeated questions from more than 50% to
21.9% compared with previous state-of-the-art methods.Comment: 8 pages, 4 figures, Accept-Oral by AAAI-202
Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games
In visual guessing games, a Guesser has to identify a target object in a
scene by asking questions to an Oracle. An effective strategy for the players
is to learn conceptual representations of objects that are both discriminative
and expressive enough to ask questions and guess correctly. However, as shown
by Suglia et al. (2020), existing models fail to learn truly multi-modal
representations, relying instead on gold category labels for objects in the
scene both at training and inference time. This provides an unnatural
performance advantage when categories at inference time match those at training
time, and it causes models to fail in more realistic "zero-shot" scenarios
where out-of-domain object categories are involved. To overcome this issue, we
introduce a novel "imagination" module based on Regularized Auto-Encoders, that
learns context-aware and category-aware latent embeddings without relying on
category labels at inference time. Our imagination module outperforms
state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?!
zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and
Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold
categories are available at inference time. The imagination module also boosts
reasoning about object properties and attributes.Comment: Accepted to the International Conference on Computational Linguistics
(COLING) 202
Learning to merge - language and vision: A deep evaluation of the encoder, the role of the two modalities, the role of the training task.
Most human language understanding is grounded in perception. There is thus growing interest in combining information from language and vision. Multiple models based on Neural Networks have been proposed to merge language and vision information. All the models share a common backbone consisting of an encoder which learns to merge the two types of representation to perform a specific task. While some models have seemed extremely successful on those tasks, it remains unclear how the reported results should be interpreted and what those models are actually learning. Our contribution is three-fold. We have proposed (a) a new model of Visually Grounded Dialogue; (b) a diagnostic dataset to evaluate the encoder ability to merge visual and language input; (c) a method to evaluate the quality of the multimodal representation computed by the encoder as general purposed representations. We have proposed and analyzed a cognitive plausible architecture in which dialogue system modules are connected through a common \emph{grounded dialogue state encoder}. Our in-depth analysis of the dialogues shows the importance of going beyond task-success in the evaluation of Visual Dialogues: the dialogues themselves should play a crucial role in such evaluation.
We have proposed a diagnostic dataset, \emph{FOIL} which consists of images associated with incorrect captions that the model has to detect and correct. Finally, we have used FOIL to evaluate the quality of the multimodal representation produced by an encoder trained on different multimodal tasks. We have shown how the training task used effects the stability of the representation, their transferability and the model confidence
Visually Grounded Language Learning: a review of language games, datasets, tasks, and models
In recent years, several machine learning models have been proposed. They are
trained with a language modelling objective on large-scale text-only data. With
such pretraining, they can achieve impressive results on many Natural Language
Understanding and Generation tasks. However, many facets of meaning cannot be
learned by ``listening to the radio" only. In the literature, many
Vision+Language (V+L) tasks have been defined with the aim of creating models
that can ground symbols in the visual modality. In this work, we provide a
systematic literature review of several tasks and models proposed in the V+L
field. We rely on Wittgenstein's idea of `language games' to categorise such
tasks into 3 different families: 1) discriminative games, 2) generative games,
and 3) interactive games. Our analysis of the literature provides evidence that
future work should be focusing on interactive games where communication in
Natural Language is important to resolve ambiguities about object referents and
action plans and that physical embodiment is essential to understand the
semantics of situations and events. Overall, these represent key requirements
for developing grounded meanings in neural models.Comment: Preprint for JAIR before copyeditin