11 research outputs found

    Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat

    Get PDF
    We propose a grounded dialogue state encoder which addresses a foundational issue on how to integrate visual grounding with dialogue system components. As a test-bed, we focus on the GuessWhat?! game, a two-player game where the goal is to identify an object in a complex visual scene by asking a sequence of yes/no questions. Our visually-grounded encoder leverages synergies between guessing and asking questions, as it is trained jointly using multi-task learning. We further enrich our model via a cooperative learning regime. We show that the introduction of both the joint architecture and cooperative learning lead to accuracy improvements over the baseline system. We compare our approach to an alternative system which extends the baseline with reinforcement learning. Our in-depth analysis shows that the linguistic skills of the two models differ dramatically, despite approaching comparable performance levels. This points at the importance of analyzing the linguistic output of competing systems beyond numeric comparison solely based on task success.Comment: Accepted to NAACL 201

    Visual Dialogue State Tracking for Question Generation

    Full text link
    GuessWhat?! is a visual dialogue task between a guesser and an oracle. The guesser aims to locate an object supposed by the oracle oneself in an image by asking a sequence of Yes/No questions. Asking proper questions with the progress of dialogue is vital for achieving successful final guess. As a result, the progress of dialogue should be properly represented and tracked. Previous models for question generation pay less attention on the representation and tracking of dialogue states, and therefore are prone to asking low quality questions such as repeated questions. This paper proposes visual dialogue state tracking (VDST) based method for question generation. A visual dialogue state is defined as the distribution on objects in the image as well as representations of objects. Representations of objects are updated with the change of the distribution on objects. An object-difference based attention is used to decode new question. The distribution on objects is updated by comparing the question-answer pair and objects. Experimental results on GuessWhat?! dataset show that our model significantly outperforms existing methods and achieves new state-of-the-art performance. It is also noticeable that our model reduces the rate of repeated questions from more than 50% to 21.9% compared with previous state-of-the-art methods.Comment: 8 pages, 4 figures, Accept-Oral by AAAI-202

    Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

    Get PDF
    In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, relying instead on gold category labels for objects in the scene both at training and inference time. This provides an unnatural performance advantage when categories at inference time match those at training time, and it causes models to fail in more realistic "zero-shot" scenarios where out-of-domain object categories are involved. To overcome this issue, we introduce a novel "imagination" module based on Regularized Auto-Encoders, that learns context-aware and category-aware latent embeddings without relying on category labels at inference time. Our imagination module outperforms state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?! zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold categories are available at inference time. The imagination module also boosts reasoning about object properties and attributes.Comment: Accepted to the International Conference on Computational Linguistics (COLING) 202

    Learning to merge - language and vision: A deep evaluation of the encoder, the role of the two modalities, the role of the training task.

    Get PDF
    Most human language understanding is grounded in perception. There is thus growing interest in combining information from language and vision. Multiple models based on Neural Networks have been proposed to merge language and vision information. All the models share a common backbone consisting of an encoder which learns to merge the two types of representation to perform a specific task. While some models have seemed extremely successful on those tasks, it remains unclear how the reported results should be interpreted and what those models are actually learning. Our contribution is three-fold. We have proposed (a) a new model of Visually Grounded Dialogue; (b) a diagnostic dataset to evaluate the encoder ability to merge visual and language input; (c) a method to evaluate the quality of the multimodal representation computed by the encoder as general purposed representations. We have proposed and analyzed a cognitive plausible architecture in which dialogue system modules are connected through a common \emph{grounded dialogue state encoder}. Our in-depth analysis of the dialogues shows the importance of going beyond task-success in the evaluation of Visual Dialogues: the dialogues themselves should play a crucial role in such evaluation. We have proposed a diagnostic dataset, \emph{FOIL} which consists of images associated with incorrect captions that the model has to detect and correct. Finally, we have used FOIL to evaluate the quality of the multimodal representation produced by an encoder trained on different multimodal tasks. We have shown how the training task used effects the stability of the representation, their transferability and the model confidence

    Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

    Full text link
    In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language Understanding and Generation tasks. However, many facets of meaning cannot be learned by ``listening to the radio" only. In the literature, many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of `language games' to categorise such tasks into 3 different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should be focusing on interactive games where communication in Natural Language is important to resolve ambiguities about object referents and action plans and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for developing grounded meanings in neural models.Comment: Preprint for JAIR before copyeditin
    corecore