18,185 research outputs found
Learning semantic sentence representations from visually grounded language without lexical knowledge
Current approaches to learning semantic representations of sentences often
use prior word-level knowledge. The current study aims to leverage visual
information in order to capture sentence level semantics without the need for
word embeddings. We use a multimodal sentence encoder trained on a corpus of
images with matching text captions to produce visually grounded sentence
embeddings. Deep Neural Networks are trained to map the two modalities to a
common embedding space such that for an image the corresponding caption can be
retrieved and vice versa. We show that our model achieves results comparable to
the current state-of-the-art on two popular image-caption retrieval benchmark
data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the
resulting sentence embeddings using the data from the Semantic Textual
Similarity benchmark task and show that the multimodal embeddings correlate
well with human semantic similarity judgements. The system achieves
state-of-the-art results on several of these benchmarks, which shows that a
system trained solely on multimodal data, without assuming any word
representations, is able to capture sentence level semantics. Importantly, this
result shows that we do not need prior knowledge of lexical level semantics in
order to model sentence level semantics. These findings demonstrate the
importance of visual information in semantics
Detecting Sarcasm in Multimodal Social Platforms
Sarcasm is a peculiar form of sentiment expression, where the surface
sentiment differs from the implied sentiment. The detection of sarcasm in
social media platforms has been applied in the past mainly to textual
utterances where lexical indicators (such as interjections and intensifiers),
linguistic markers, and contextual information (such as user profiles, or past
conversations) were used to detect the sarcastic tone. However, modern social
media platforms allow to create multimodal messages where audiovisual content
is integrated with the text, making the analysis of a mode in isolation
partial. In our work, we first study the relationship between the textual and
visual aspects in multimodal posts from three major social media platforms,
i.e., Instagram, Tumblr and Twitter, and we run a crowdsourcing task to
quantify the extent to which images are perceived as necessary by human
annotators. Moreover, we propose two different computational frameworks to
detect sarcasm that integrate the textual and visual modalities. The first
approach exploits visual semantics trained on an external dataset, and
concatenates the semantics features with state-of-the-art textual features. The
second method adapts a visual neural network initialized with parameters
trained on ImageNet to multimodal sarcastic posts. Results show the positive
effect of combining modalities for the detection of sarcasm across platforms
and methods.Comment: 10 pages, 3 figures, final version published in the Proceedings of
ACM Multimedia 201
Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge
Distributional models provide a convenient way to model semantics using dense
embedding spaces derived from unsupervised learning algorithms. However, the
dimensions of dense embedding spaces are not designed to resemble human
semantic knowledge. Moreover, embeddings are often built from a single source
of information (typically text data), even though neurocognitive research
suggests that semantics is deeply linked to both language and perception. In
this paper, we combine multimodal information from both text and image-based
representations derived from state-of-the-art distributional models to produce
sparse, interpretable vectors using Joint Non-Negative Sparse Embedding.
Through in-depth analyses comparing these sparse models to human-derived
behavioural and neuroimaging data, we demonstrate their ability to predict
interpretable linguistic descriptions of human ground-truth semantic knowledge.Comment: Proceedings of the 22nd Conference on Computational Natural Language
Learning (CoNLL 2018), pages 260-270. Brussels, Belgium, October 31 -
November 1, 2018. Association for Computational Linguistic
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
- …