9,541 research outputs found
Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis
Multimodal language analysis often considers relationships between features
based on text and those based on acoustical and visual properties. Text
features typically outperform non-text features in sentiment analysis or
emotion recognition tasks in part because the text features are derived from
advanced language models or word embeddings trained on massive data sources
while audio and video features are human-engineered and comparatively
underdeveloped. Given that the text, audio, and video are describing the same
utterance in different ways, we hypothesize that the multimodal sentiment
analysis and emotion recognition can be improved by learning (hidden)
correlations between features extracted from the outer product of text and
audio (we call this text-based audio) and analogous text-based video. This
paper proposes a novel model, the Interaction Canonical Correlation Network
(ICCN), to learn such multimodal embeddings. ICCN learns correlations between
all three modes via deep canonical correlation analysis (DCCA) and the proposed
embeddings are then tested on several benchmark datasets and against other
state-of-the-art multimodal embedding algorithms. Empirical results and
ablation studies confirm the effectiveness of ICCN in capturing useful
information from all three views
Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge
Distributional models provide a convenient way to model semantics using dense
embedding spaces derived from unsupervised learning algorithms. However, the
dimensions of dense embedding spaces are not designed to resemble human
semantic knowledge. Moreover, embeddings are often built from a single source
of information (typically text data), even though neurocognitive research
suggests that semantics is deeply linked to both language and perception. In
this paper, we combine multimodal information from both text and image-based
representations derived from state-of-the-art distributional models to produce
sparse, interpretable vectors using Joint Non-Negative Sparse Embedding.
Through in-depth analyses comparing these sparse models to human-derived
behavioural and neuroimaging data, we demonstrate their ability to predict
interpretable linguistic descriptions of human ground-truth semantic knowledge.Comment: Proceedings of the 22nd Conference on Computational Natural Language
Learning (CoNLL 2018), pages 260-270. Brussels, Belgium, October 31 -
November 1, 2018. Association for Computational Linguistic
Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations
Multimodal embeddings aim to enrich the semantic information in neural
representations of language compared to text-only models. While different
embeddings exhibit different applicability and performance on downstream tasks,
little is known about the systematic representation differences attributed to
the visual modality. Our paper compares word embeddings from three
vision-and-language models (CLIP, OpenCLIP and Multilingual CLIP) and three
text-only models, with static (FastText) as well as contextual representations
(multilingual BERT; XLM-RoBERTa). This is the first large-scale study of the
effect of visual grounding on language representations, including 46 semantic
parameters. We identify meaning properties and relations that characterize
words whose embeddings are most affected by the inclusion of visual modality in
the training data; that is, points where visual grounding turns out most
important. We find that the effect of visual modality correlates most with
denotational semantic properties related to concreteness, but is also detected
for several specific semantic classes, as well as for valence, a
sentiment-related connotational property of linguistic expressions.Comment: Accepted for StarSEM 202
Leverage Points in Modality Shifts:Comparing Language-only and MultimodalWord Representations
Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models. While different embeddings exhibit different applicability and performance on downstream tasks, little is known about the systematic representation differences attributed to the visual modality. Our paper compares word embeddings from three vision-and-language models (CLIP, OpenCLIP and Multilingual CLIP, Radford et al. 2021; Ilharco et al. 2021; Carlsson et al. 2022) and three text-only models, with static (FastText, Bojanowski et al., 2017) as well as contextual representations (multilingual BERT Devlin et al. 2018; XLM-RoBERTa, Conneau et al. 2019). This is the first large-scale study of the effect of visual grounding on language representations, including 46 semantic parameters. We identify meaning properties and relations that characterize words whose embeddings are most affected by the inclusion of visual modality in the training data; that is, points where visual grounding turns out most important. We find that the effect of visual modality correlates most with denotational semantic properties related to concreteness, but is also detected for several specific semantic classes, as well as for valence, a sentiment-related connotational property of linguistic expressions.</p
Learning semantic sentence representations from visually grounded language without lexical knowledge
Current approaches to learning semantic representations of sentences often
use prior word-level knowledge. The current study aims to leverage visual
information in order to capture sentence level semantics without the need for
word embeddings. We use a multimodal sentence encoder trained on a corpus of
images with matching text captions to produce visually grounded sentence
embeddings. Deep Neural Networks are trained to map the two modalities to a
common embedding space such that for an image the corresponding caption can be
retrieved and vice versa. We show that our model achieves results comparable to
the current state-of-the-art on two popular image-caption retrieval benchmark
data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the
resulting sentence embeddings using the data from the Semantic Textual
Similarity benchmark task and show that the multimodal embeddings correlate
well with human semantic similarity judgements. The system achieves
state-of-the-art results on several of these benchmarks, which shows that a
system trained solely on multimodal data, without assuming any word
representations, is able to capture sentence level semantics. Importantly, this
result shows that we do not need prior knowledge of lexical level semantics in
order to model sentence level semantics. These findings demonstrate the
importance of visual information in semantics
- …