11,420 research outputs found
Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pre-trained Doc2Vec model followed by fully-connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. ii) As for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval
Semi-supervised Deep Generative Modelling of Incomplete Multi-Modality Emotional Data
There are threefold challenges in emotion recognition. First, it is difficult
to recognize human's emotional states only considering a single modality.
Second, it is expensive to manually annotate the emotional data. Third,
emotional data often suffers from missing modalities due to unforeseeable
sensor malfunction or configuration issues. In this paper, we address all these
problems under a novel multi-view deep generative framework. Specifically, we
propose to model the statistical relationships of multi-modality emotional data
using multiple modality-specific generative networks with a shared latent
space. By imposing a Gaussian mixture assumption on the posterior approximation
of the shared latent variables, our framework can learn the joint deep
representation from multiple modalities and evaluate the importance of each
modality simultaneously. To solve the labeled-data-scarcity problem, we extend
our multi-view model to semi-supervised learning scenario by casting the
semi-supervised classification problem as a specialized missing data imputation
task. To address the missing-modality problem, we further extend our
semi-supervised multi-view model to deal with incomplete data, where a missing
view is treated as a latent variable and integrated out during inference. This
way, the proposed overall framework can utilize all available (both labeled and
unlabeled, as well as both complete and incomplete) data to improve its
generalization ability. The experiments conducted on two real multi-modal
emotion datasets demonstrated the superiority of our framework.Comment: arXiv admin note: text overlap with arXiv:1704.07548, 2018 ACM
Multimedia Conference (MM'18
CentralNet: a Multilayer Approach for Multimodal Fusion
This paper proposes a novel multimodal fusion approach, aiming to produce
best possible decisions by integrating information coming from multiple media.
While most of the past multimodal approaches either work by projecting the
features of different modalities into the same space, or by coordinating the
representations of each modality through the use of constraints, our approach
borrows from both visions. More specifically, assuming each modality can be
processed by a separated deep convolutional network, allowing to take decisions
independently from each modality, we introduce a central network linking the
modality specific networks. This central network not only provides a common
feature embedding but also regularizes the modality specific networks through
the use of multi-task learning. The proposed approach is validated on 4
different computer vision tasks on which it consistently improves the accuracy
of existing multimodal fusion approaches
Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis
Multimodal language analysis often considers relationships between features
based on text and those based on acoustical and visual properties. Text
features typically outperform non-text features in sentiment analysis or
emotion recognition tasks in part because the text features are derived from
advanced language models or word embeddings trained on massive data sources
while audio and video features are human-engineered and comparatively
underdeveloped. Given that the text, audio, and video are describing the same
utterance in different ways, we hypothesize that the multimodal sentiment
analysis and emotion recognition can be improved by learning (hidden)
correlations between features extracted from the outer product of text and
audio (we call this text-based audio) and analogous text-based video. This
paper proposes a novel model, the Interaction Canonical Correlation Network
(ICCN), to learn such multimodal embeddings. ICCN learns correlations between
all three modes via deep canonical correlation analysis (DCCA) and the proposed
embeddings are then tested on several benchmark datasets and against other
state-of-the-art multimodal embedding algorithms. Empirical results and
ablation studies confirm the effectiveness of ICCN in capturing useful
information from all three views
Robust correlated and individual component analysis
© 1979-2012 IEEE.Recovering correlated and individual components of two, possibly temporally misaligned, sets of data is a fundamental task in disciplines such as image, vision, and behavior computing, with application to problems such as multi-modal fusion (via correlated components), predictive analysis, and clustering (via the individual ones). Here, we study the extraction of correlated and individual components under real-world conditions, namely i) the presence of gross non-Gaussian noise and ii) temporally misaligned data. In this light, we propose a method for the Robust Correlated and Individual Component Analysis (RCICA) of two sets of data in the presence of gross, sparse errors. We furthermore extend RCICA in order to handle temporal incongruities arising in the data. To this end, two suitable optimization problems are solved. The generality of the proposed methods is demonstrated by applying them onto 4 applications, namely i) heterogeneous face recognition, ii) multi-modal feature fusion for human behavior analysis (i.e., audio-visual prediction of interest and conflict), iii) face clustering, and iv) thetemporal alignment of facial expressions. Experimental results on 2 synthetic and 7 real world datasets indicate the robustness and effectiveness of the proposed methodson these application domains, outperforming other state-of-the-art methods in the field
- …