3,134 research outputs found
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge
Distributional models provide a convenient way to model semantics using dense
embedding spaces derived from unsupervised learning algorithms. However, the
dimensions of dense embedding spaces are not designed to resemble human
semantic knowledge. Moreover, embeddings are often built from a single source
of information (typically text data), even though neurocognitive research
suggests that semantics is deeply linked to both language and perception. In
this paper, we combine multimodal information from both text and image-based
representations derived from state-of-the-art distributional models to produce
sparse, interpretable vectors using Joint Non-Negative Sparse Embedding.
Through in-depth analyses comparing these sparse models to human-derived
behavioural and neuroimaging data, we demonstrate their ability to predict
interpretable linguistic descriptions of human ground-truth semantic knowledge.Comment: Proceedings of the 22nd Conference on Computational Natural Language
Learning (CoNLL 2018), pages 260-270. Brussels, Belgium, October 31 -
November 1, 2018. Association for Computational Linguistic
Learning Multi-Modal Word Representation Grounded in Visual Context
Representing the semantics of words is a long-standing problem for the
natural language processing community. Most methods compute word semantics
given their textual context in large corpora. More recently, researchers
attempted to integrate perceptual and visual features. Most of these works
consider the visual appearance of objects to enhance word representations but
they ignore the visual environment and context in which objects appear. We
propose to unify text-based techniques with vision-based techniques by
simultaneously leveraging textual and visual context to learn multimodal word
embeddings. We explore various choices for what can serve as a visual context
and present an end-to-end method to integrate visual context elements in a
multimodal skip-gram model. We provide experiments and extensive analysis of
the obtained results
Combining Language and Vision with a Multimodal Skip-gram Model
We extend the SKIP-GRAM model of Mikolov et al. (2013a) by taking visual
information into account. Like SKIP-GRAM, our multimodal models (MMSKIP-GRAM)
build vector-based word representations by learning to predict linguistic
contexts in text corpora. However, for a restricted set of words, the models
are also exposed to visual representations of the objects they denote
(extracted from natural images), and must predict linguistic and visual
features jointly. The MMSKIP-GRAM models achieve good performance on a variety
of semantic benchmarks. Moreover, since they propagate visual information to
all words, we use them to improve image labeling and retrieval in the zero-shot
setup, where the test concepts are never seen during model training. Finally,
the MMSKIP-GRAM models discover intriguing visual properties of abstract words,
paving the way to realistic implementations of embodied theories of meaning.Comment: accepted at NAACL 2015, camera ready version, 11 page
Are distributional representations ready for the real world? Evaluating word vectors for grounded perceptual meaning
Distributional word representation methods exploit word co-occurrences to
build compact vector encodings of words. While these representations enjoy
widespread use in modern natural language processing, it is unclear whether
they accurately encode all necessary facets of conceptual meaning. In this
paper, we evaluate how well these representations can predict perceptual and
conceptual features of concrete concepts, drawing on two semantic norm datasets
sourced from human participants. We find that several standard word
representations fail to encode many salient perceptual features of concepts,
and show that these deficits correlate with word-word similarity prediction
errors. Our analyses provide motivation for grounded and embodied language
learning approaches, which may help to remedy these deficits.Comment: Accepted at RoboNLP 201
Learning semantic sentence representations from visually grounded language without lexical knowledge
Current approaches to learning semantic representations of sentences often
use prior word-level knowledge. The current study aims to leverage visual
information in order to capture sentence level semantics without the need for
word embeddings. We use a multimodal sentence encoder trained on a corpus of
images with matching text captions to produce visually grounded sentence
embeddings. Deep Neural Networks are trained to map the two modalities to a
common embedding space such that for an image the corresponding caption can be
retrieved and vice versa. We show that our model achieves results comparable to
the current state-of-the-art on two popular image-caption retrieval benchmark
data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the
resulting sentence embeddings using the data from the Semantic Textual
Similarity benchmark task and show that the multimodal embeddings correlate
well with human semantic similarity judgements. The system achieves
state-of-the-art results on several of these benchmarks, which shows that a
system trained solely on multimodal data, without assuming any word
representations, is able to capture sentence level semantics. Importantly, this
result shows that we do not need prior knowledge of lexical level semantics in
order to model sentence level semantics. These findings demonstrate the
importance of visual information in semantics
- …