4,765 research outputs found
Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pre-trained Doc2Vec model followed by fully-connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. ii) As for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
Objects that Sound
In this paper our objectives are, first, networks that can embed audio and
visual inputs into a common space that is suitable for cross-modal retrieval;
and second, a network that can localize the object that sounds in an image,
given the audio signal. We achieve both these objectives by training from
unlabelled video using only audio-visual correspondence (AVC) as the objective
function. This is a form of cross-modal self-supervision from video.
To this end, we design new network architectures that can be trained for
cross-modal retrieval and localizing the sound source in an image, by using the
AVC task. We make the following contributions: (i) show that audio and visual
embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and
between-mode retrieval; (ii) explore various architectures for the AVC task,
including those for the visual stream that ingest a single image, or multiple
images, or a single image and multi-frame optical flow; (iii) show that the
semantic object that sounds within an image can be localized (using only the
sound, no motion or flow information); and (iv) give a cautionary tale on how
to avoid undesirable shortcuts in the data preparation.Comment: Appears in: European Conference on Computer Vision (ECCV) 201
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos
When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation
Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA
Deep learning has successfully shown excellent performance in learning joint
representations between different data modalities. Unfortunately, little
research focuses on cross-modal correlation learning where temporal structures
of different data modalities, such as audio and video, should be taken into
account. Music video retrieval by given musical audio is a natural way to
search and interact with music contents. In this work, we study cross-modal
music video retrieval in terms of emotion similarity. Particularly, audio of an
arbitrary length is used to retrieve a longer or full-length music video. To
this end, we propose a novel audio-visual embedding algorithm by Supervised
Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video into
a shared space to bridge the semantic gap between audio and video. This also
preserves the similarity between audio and visual contents from different
videos with the same class label and the temporal structure. The contribution
of our approach is mainly manifested in the two aspects: i) We propose to
select top k audio chunks by attention-based Long Short-Term Memory
(LSTM)model, which can represent good audio summarization with local
properties. ii) We propose an end-to-end deep model for cross-modal
audio-visual learning where S-DCCA is trained to learn the semantic correlation
between audio and visual modalities. Due to the lack of music video dataset, we
construct 10K music video dataset from YouTube 8M dataset. Some promising
results such as MAP and precision-recall show that our proposed model can be
applied to music video retrieval.Comment: 8 pages, 9 figures. Accepted by ISM 201
- …