93 research outputs found
Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pre-trained Doc2Vec model followed by fully-connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. ii) As for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval
Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA
Deep learning has successfully shown excellent performance in learning joint
representations between different data modalities. Unfortunately, little
research focuses on cross-modal correlation learning where temporal structures
of different data modalities, such as audio and video, should be taken into
account. Music video retrieval by given musical audio is a natural way to
search and interact with music contents. In this work, we study cross-modal
music video retrieval in terms of emotion similarity. Particularly, audio of an
arbitrary length is used to retrieve a longer or full-length music video. To
this end, we propose a novel audio-visual embedding algorithm by Supervised
Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video into
a shared space to bridge the semantic gap between audio and video. This also
preserves the similarity between audio and visual contents from different
videos with the same class label and the temporal structure. The contribution
of our approach is mainly manifested in the two aspects: i) We propose to
select top k audio chunks by attention-based Long Short-Term Memory
(LSTM)model, which can represent good audio summarization with local
properties. ii) We propose an end-to-end deep model for cross-modal
audio-visual learning where S-DCCA is trained to learn the semantic correlation
between audio and visual modalities. Due to the lack of music video dataset, we
construct 10K music video dataset from YouTube 8M dataset. Some promising
results such as MAP and precision-recall show that our proposed model can be
applied to music video retrieval.Comment: 8 pages, 9 figures. Accepted by ISM 201
Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval
The heterogeneity gap problem is the main challenge in cross-modal retrieval.
Because cross-modal data (e.g. audiovisual) have different distributions and
representations that cannot be directly compared. To bridge the gap between
audiovisual modalities, we learn a common subspace for them by utilizing the
intrinsic correlation in the natural synchronization of audio-visual data with
the aid of annotated labels. TNN-CCCA is the best audio-visual cross-modal
retrieval (AV-CMR) model so far, but the model training is sensitive to hard
negative samples when learning common subspace by applying triplet loss to
predict the relative distance between inputs. In this paper, to reduce the
interference of hard negative samples in representation learning, we propose a
new AV-CMR model to optimize semantic features by directly predicting labels
and then measuring the intrinsic correlation between audio-visual data using
complete cross-triple loss. In particular, our model projects audio-visual
features into label space by minimizing the distance between predicted label
features after feature projection and ground label representations. Moreover,
we adopt complete cross-triplet loss to optimize the predicted label features
by leveraging the relationship between all possible similarity and
dissimilarity semantic information across modalities. The extensive
experimental results on two audio-visual double-checked datasets have shown an
improvement of approximately 2.1% in terms of average MAP over the current
state-of-the-art method TNN-CCCA for the AV-CMR task, which indicates the
effectiveness of our proposed model.Comment: 9 pages, 5 figures, 3 tables, accepted by IEEE ISM 202
Universal EEG Encoder for Learning Diverse Intelligent Tasks
Brain Computer Interfaces (BCI) have become very popular with
Electroencephalography (EEG) being one of the most commonly used signal
acquisition techniques. A major challenge in BCI studies is the individualistic
analysis required for each task. Thus, task-specific feature extraction and
classification are performed, which fails to generalize to other tasks with
similar time-series EEG input data. To this end, we design a GRU-based
universal deep encoding architecture to extract meaningful features from
publicly available datasets for five diverse EEG-based classification tasks.
Our network can generate task and format-independent data representation and
outperform the state of the art EEGNet architecture on most experiments. We
also compare our results with CNN-based, and Autoencoder networks, in turn
performing local, spatial, temporal and unsupervised analysis on the data
- …