2,172 research outputs found
Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis
Multimodal language analysis often considers relationships between features
based on text and those based on acoustical and visual properties. Text
features typically outperform non-text features in sentiment analysis or
emotion recognition tasks in part because the text features are derived from
advanced language models or word embeddings trained on massive data sources
while audio and video features are human-engineered and comparatively
underdeveloped. Given that the text, audio, and video are describing the same
utterance in different ways, we hypothesize that the multimodal sentiment
analysis and emotion recognition can be improved by learning (hidden)
correlations between features extracted from the outer product of text and
audio (we call this text-based audio) and analogous text-based video. This
paper proposes a novel model, the Interaction Canonical Correlation Network
(ICCN), to learn such multimodal embeddings. ICCN learns correlations between
all three modes via deep canonical correlation analysis (DCCA) and the proposed
embeddings are then tested on several benchmark datasets and against other
state-of-the-art multimodal embedding algorithms. Empirical results and
ablation studies confirm the effectiveness of ICCN in capturing useful
information from all three views
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy
- …