41 research outputs found
'What are you referring to?' Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges
Referential ambiguities arise in dialogue when a referring expression does
not uniquely identify the intended referent for the addressee. Addressees
usually detect such ambiguities immediately and work with the speaker to repair
it using meta-communicative, Clarificational Exchanges (CE): a Clarification
Request (CR) and a response. Here, we argue that the ability to generate and
respond to CRs imposes specific constraints on the architecture and objective
functions of multi-modal, visually grounded dialogue models. We use the SIMMC
2.0 dataset to evaluate the ability of different state-of-the-art model
architectures to process CEs, with a metric that probes the contextual updates
that arise from them in the model. We find that language-based models are able
to encode simple multi-modal semantic information and process some CEs,
excelling with those related to the dialogue history, whilst multi-modal models
can use additional learning objectives to obtain disentangled object
representations, which become crucial to handle complex referential ambiguities
across modalities overall.Comment: Accepted at SIGDIAL'23 (upcoming). Repository with code and
experiments available at https://github.com/JChiyah/what-are-you-referring-t
CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning
Approaches to Grounded Language Learning typically focus on a single
task-based final performance measure that may not depend on desirable
properties of the learned hidden representations, such as their ability to
predict salient attributes or to generalise to unseen situations. To remedy
this, we present GROLLA, an evaluation framework for Grounded Language Learning
with Attributes with three sub-tasks: 1) Goal-oriented evaluation; 2) Object
attribute prediction evaluation; and 3) Zero-shot evaluation. We also propose a
new dataset CompGuessWhat?! as an instance of this framework for evaluating the
quality of learned neural representations, in particular concerning attribute
grounding. To this end, we extend the original GuessWhat?! dataset by including
a semantic layer on top of the perceptual one. Specifically, we enrich the
VisualGenome scene graphs associated with the GuessWhat?! images with abstract
and situated attributes. By using diagnostic classifiers, we show that current
models learn representations that are not expressive enough to encode object
attributes (average F1 of 44.27). In addition, they do not learn strategies nor
representations that are robust enough to perform well when novel scenes or
objects are involved in gameplay (zero-shot best accuracy 50.06%).Comment: Accepted to the Annual Conference of the Association for
Computational Linguistics (ACL) 202
Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games
In visual guessing games, a Guesser has to identify a target object in a
scene by asking questions to an Oracle. An effective strategy for the players
is to learn conceptual representations of objects that are both discriminative
and expressive enough to ask questions and guess correctly. However, as shown
by Suglia et al. (2020), existing models fail to learn truly multi-modal
representations, relying instead on gold category labels for objects in the
scene both at training and inference time. This provides an unnatural
performance advantage when categories at inference time match those at training
time, and it causes models to fail in more realistic "zero-shot" scenarios
where out-of-domain object categories are involved. To overcome this issue, we
introduce a novel "imagination" module based on Regularized Auto-Encoders, that
learns context-aware and category-aware latent embeddings without relying on
category labels at inference time. Our imagination module outperforms
state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?!
zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and
Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold
categories are available at inference time. The imagination module also boosts
reasoning about object properties and attributes.Comment: Accepted to the International Conference on Computational Linguistics
(COLING) 202
An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games
Guessing games are a prototypical instance of the "learning by interacting"
paradigm. This work investigates how well an artificial agent can benefit from
playing guessing games when later asked to perform on novel NLP downstream
tasks such as Visual Question Answering (VQA). We propose two ways to exploit
playing guessing games: 1) a supervised learning scenario in which the agent
learns to mimic successful guessing games and 2) a novel way for an agent to
play by itself, called Self-play via Iterated Experience Learning (SPIEL).
We evaluate the ability of both procedures to generalize: an in-domain
evaluation shows an increased accuracy (+7.79) compared with competitors on the
evaluation suite CompGuessWhat?!; a transfer evaluation shows improved
performance for VQA on the TDIUC dataset in terms of harmonic average accuracy
(+5.31) thanks to more fine-grained object representations learned via SPIEL.Comment: Accepted paper for the 16th Conference of the European Chapter of the
Association for Computational Linguistics (EACL 2021
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks
Abstract. In this work we propose Ask Me Any Rating (AMAR), a novel content-based recommender system based on deep neural networks which is able to produce top-N recommendations leveraging user and item embeddings which are learnt from textual information describing the items. A comprehensive experimental evaluation conducted on stateof-the-art datasets showed a significant improvement over all the baselines taken into account
Going for GOAL: A Resource for Grounded Football Commentaries
Recent video+language datasets cover domains where the interaction is highly
structured, such as instructional videos, or where the interaction is scripted,
such as TV shows. Both of these properties can lead to spurious cues to be
exploited by models rather than learning to ground language. In this paper, we
present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or
`soccer') highlights videos with transcribed live commentaries in English. As
the course of a game is unpredictable, so are commentaries, which makes them a
unique resource to investigate dynamic language grounding. We also provide
state-of-the-art baselines for the following tasks: frame reordering, moment
retrieval, live commentary retrieval and play-by-play live commentary
generation. Results show that SOTA models perform reasonably well in most
tasks. We discuss the implications of these results and suggest new tasks for
which GOAL can be used. Our codebase is available at:
https://gitlab.com/grounded-sport-convai/goal-baselines.Comment: Preprint formatted using the ACM Multimedia template (8 pages +
appendix
Visually grounded representation learning using language games for embodied AI
The ability to communicate in Natural Language is considered one of the ingredients
that facilitated the development of humans’ remarkable intelligence. Analogously, developing artificial agents that can seamlessly integrate with humans requires them to
understand, and use Natural Language, just like we do. Humans use Natural Language to coordinate and communicate relevant information to solve their tasks—they
play so-called “language games”. In this thesis work, we explore computational models
of how meanings can materialise in situated and embodied language games. Meanings
are instantiated when language is used to refer to, and to do things in the world. In
these activities, such as “guessing an object in an image” or “following instructions to
complete a task”, perceptual experience can be used to derive grounded meaning representations. Considering that di↵erent language games favour the development of specific
concepts, we argue it is detrimental to evaluate agents on their ability to solve a single
task. To mitigate this problem, we define GroLLA, a multi-task evaluation framework
for visual guessing games that extends a goal-oriented evaluation with auxiliary tasks
aimed at assessing the quality of the representations as well. By using this framework,
we demonstrate the inability of recent computational models to learn truly multimodal
representations that can generalise to unseen object categories. To overcome this issue,
we propose a representation learning component that derives concept representations
from perceptual experience, obtaining substantial gains over the baselines—especially
when unseen object categories are involved. To demonstrate that guessing games are
a generic procedure for grounded language learning, we present SPIEL, a novel self-play procedure to transfer learned representations to novel multimodal tasks. We show
that models trained in this way can obtain better performance as well as learn better
concept representations than competitors. Thanks to this procedure, artificial agents
can learn from interaction using any image-based datasets. Additionally, learning the
meaning of concepts involves understanding how entities interact with other entities in
the world. For this purpose, we use action-based and event-driven language games to
study how an agent can learn visually grounded conceptual representations from dynamic scenes. We design EmBERT, a generic architecture for an embodied agent able
to learn representations useful to complete language-guided action execution tasks in
a 3D environment. Finally, learning visually grounded representations can be achieved
when watching others completing a task. Inspired by this idea, we study how to learn
representations from videos that can be used for tackling multimodal tasks such as commentary generation. For this purpose, we define Goal, a highly multimodal benchmark
based on football commentaries that requires models to learn very fine-grained and rich
representations to be successful. We conclude with some future directions for further
progress in computational learning of grounded meaning representations
An Analysis of Visually Grounded Instructions in Embodied AI Tasks
Thanks to Deep Learning models able to learn from Internet-scale corpora, we observed tremendous advances in both text-only and multi-modal tasks such as question answering and image captioning. However, real-world tasks require agents that are embodied in the environment and can collaborate with humans by following language instructions. In this work, we focus on ALFRED, a large-scale instruction-following dataset proposed to develop artificial agents that can execute both navigation and manipulation actions in 3D simulated environments. We present a new Natural Language Understanding component for Embodied Agents as well as an in-depth error analysis of the model failures for this challenge, going beyond the success-rate performance that has been driving progress on this benchmark. Furthermore, we provide the research community with important directions for future work in this field which are essential to develop collaborative embodied agents.</p