989 research outputs found
Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pre-trained Doc2Vec model followed by fully-connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. ii) As for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval
Multimodal music information processing and retrieval: survey and future challenges
Towards improving the performance in various music information processing
tasks, recent studies exploit different modalities able to capture diverse
aspects of music. Such modalities include audio recordings, symbolic music
scores, mid-level representations, motion, and gestural data, video recordings,
editorial or cultural tags, lyrics and album cover arts. This paper critically
reviews the various approaches adopted in Music Information Processing and
Retrieval and highlights how multimodal algorithms can help Music Computing
applications. First, we categorize the related literature based on the
application they address. Subsequently, we analyze existing information fusion
approaches, and we conclude with the set of challenges that Music Information
Retrieval and Sound and Music Computing research communities should focus in
the next years
Improving Music Genre Classification from multi-modal properties of music and genre correlations Perspective
Music genre classification has been widely studied in past few years for its
various applications in music information retrieval. Previous works tend to
perform unsatisfactorily, since those methods only use audio content or jointly
use audio content and lyrics content inefficiently. In addition, as genres
normally co-occur in a music track, it is desirable to capture and model the
genre correlations to improve the performance of multi-label music genre
classification. To solve these issues, we present a novel multi-modal method
leveraging audio-lyrics contrastive loss and two symmetric cross-modal
attention, to align and fuse features from audio and lyrics. Furthermore, based
on the nature of the multi-label classification, a genre correlations
extraction module is presented to capture and model potential genre
correlations. Extensive experiments demonstrate that our proposed method
significantly surpasses other multi-label music genre classification methods
and achieves state-of-the-art result on Music4All dataset.Comment: Accepted by ICASSP 202
Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA
Deep learning has successfully shown excellent performance in learning joint
representations between different data modalities. Unfortunately, little
research focuses on cross-modal correlation learning where temporal structures
of different data modalities, such as audio and video, should be taken into
account. Music video retrieval by given musical audio is a natural way to
search and interact with music contents. In this work, we study cross-modal
music video retrieval in terms of emotion similarity. Particularly, audio of an
arbitrary length is used to retrieve a longer or full-length music video. To
this end, we propose a novel audio-visual embedding algorithm by Supervised
Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video into
a shared space to bridge the semantic gap between audio and video. This also
preserves the similarity between audio and visual contents from different
videos with the same class label and the temporal structure. The contribution
of our approach is mainly manifested in the two aspects: i) We propose to
select top k audio chunks by attention-based Long Short-Term Memory
(LSTM)model, which can represent good audio summarization with local
properties. ii) We propose an end-to-end deep model for cross-modal
audio-visual learning where S-DCCA is trained to learn the semantic correlation
between audio and visual modalities. Due to the lack of music video dataset, we
construct 10K music video dataset from YouTube 8M dataset. Some promising
results such as MAP and precision-recall show that our proposed model can be
applied to music video retrieval.Comment: 8 pages, 9 figures. Accepted by ISM 201
- …