11 research outputs found
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
Predicting Perceived Age: Both Language Ability and Appearance are Important
When interacting with robots in a situated spoken dialogue setting, human dialogue partners tend to assign anthropomorphic and social characteristics to those robots. In this paper, we explore the age and educational level that human dialogue partners assign to three different robotic systems, including an un-embodied spoken dialogue system. We found that how a robot speaks is as important to human perceptions as the way the robot looks. Using the data from our experiment, we derived prosodic, emotional, and linguistic features from the participants to train and evaluate a classifier that predicts perceived intelligence, age, and education level
Clinical Text Prediction with Numerically Grounded Conditional Language Models
Assisted text input techniques can save time and effort and improve text quality. In this paper, we investigate how grounded and conditional extensions to standard neural language models can bring improvements in the tasks of word prediction and completion. These extensions incorporate a structured knowledge base and numerical values from the text into the context used to predict the next word. Our automated evaluation on a clinical dataset shows extended models significantly outperform standard models. Our best system uses both conditioning and grounding, because of their orthogonal benefits. For word prediction with a list of 5 suggestions, it improves recall from 25.03% to 71.28% and for word completion it improves keystroke savings from 34.35% to 44.81%, where theoretical bound for this dataset is 58.78%. We also perform a qualitative investigation of how models with lower perplexity occasionally fare better at the tasks. We found that at test time numbers have more influence on the document level than on individual word probabilities
Clinical Text Prediction with Numerically Grounded Conditional Language Models
Assisted text input techniques can save time and effort and improve text quality. In this paper, we investigate how grounded and conditional extensions to standard neural language models can bring improvements in the tasks of word prediction and completion. These extensions incorporate a structured knowledge base and numerical values from the text into the context used to predict the next word. Our automated evaluation on a clinical dataset shows extended models significantly outperform standard models. Our best system uses both conditioning and grounding, because of their orthogonal benefits. For word prediction with a list of 5 suggestions, it improves recall from 25.03% to 71.28% and for word completion it improves keystroke savings from 34.35% to 44.81%, where theoretical bound for this dataset is 58.78%. We also perform a qualitative investigation of how models with lower perplexity occasionally fare better at the tasks. We found that at test time numbers have more influence on the document level than on individual word probabilities
Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games
In visual guessing games, a Guesser has to identify a target object in a
scene by asking questions to an Oracle. An effective strategy for the players
is to learn conceptual representations of objects that are both discriminative
and expressive enough to ask questions and guess correctly. However, as shown
by Suglia et al. (2020), existing models fail to learn truly multi-modal
representations, relying instead on gold category labels for objects in the
scene both at training and inference time. This provides an unnatural
performance advantage when categories at inference time match those at training
time, and it causes models to fail in more realistic "zero-shot" scenarios
where out-of-domain object categories are involved. To overcome this issue, we
introduce a novel "imagination" module based on Regularized Auto-Encoders, that
learns context-aware and category-aware latent embeddings without relying on
category labels at inference time. Our imagination module outperforms
state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?!
zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and
Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold
categories are available at inference time. The imagination module also boosts
reasoning about object properties and attributes.Comment: Accepted to the International Conference on Computational Linguistics
(COLING) 202
Recommended from our members
Vision and Feature Norms: Improving automatic feature norm learning through cross-modal maps
Property norms have the potential to aid a wide range of semantic tasks, provided that they can be obtained for large numbers of concepts. Recent work has focused on text as the main source of information for automatic property extraction. In this paper we examine property norm prediction from visual, rather than textual, data, using cross-modal maps learnt between property norm and visual spaces. We also investigate the importance of having a complete feature norm dataset, for both training and testing. Finally, we evaluate how these datasets and cross-modal maps can be used in an image retrieval task.LB is supported by an EPSRC Doctoral Training Grant. DK is supported by EPSRC grant EP/I037512/1. SC is supported by ERC Starting Grant DisCoTex (306920) and EPSRC grant EP/I037512/1
How direct is the link between words and images?
Current word embedding models despite their success, still suffer from their
lack of grounding in the real world. In this line of research, Gunther et al.
2022 proposed a behavioral experiment to investigate the relationship between
words and images. In their setup, participants were presented with a target
noun and a pair of images, one chosen by their model and another chosen
randomly. Participants were asked to select the image that best matched the
target noun. In most cases, participants preferred the image selected by the
model. Gunther et al., therefore, concluded the possibility of a direct link
between words and embodied experience. We took their experiment as a point of
departure and addressed the following questions. 1. Apart from utilizing
visually embodied simulation of given images, what other strategies might
subjects have used to solve this task? To what extent does this setup rely on
visual information from images? Can it be solved using purely textual
representations? 2. Do current visually grounded embeddings explain subjects'
selection behavior better than textual embeddings? 3. Does visual grounding
improve the semantic representations of both concrete and abstract words? To
address these questions, we designed novel experiments by using pre-trained
textual and visually grounded word embeddings. Our experiments reveal that
subjects' selection behavior is explained to a large extent based on purely
text-based embeddings and word-based similarities, suggesting a minor
involvement of active embodied experiences. Visually grounded embeddings
offered modest advantages over textual embeddings only in certain cases. These
findings indicate that the experiment by Gunther et al. may not be well suited
for tapping into the perceptual experience of participants, and therefore the
extent to which it measures visually grounded knowledge is unclear.Comment: Accepted in the Mental Lexicon Journal:
https://benjamins.com/catalog/m