93 research outputs found

    Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

    Get PDF
    Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pre-trained Doc2Vec model followed by fully-connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. ii) As for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval

    Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA

    Full text link
    Deep learning has successfully shown excellent performance in learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and video, should be taken into account. Music video retrieval by given musical audio is a natural way to search and interact with music contents. In this work, we study cross-modal music video retrieval in terms of emotion similarity. Particularly, audio of an arbitrary length is used to retrieve a longer or full-length music video. To this end, we propose a novel audio-visual embedding algorithm by Supervised Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video into a shared space to bridge the semantic gap between audio and video. This also preserves the similarity between audio and visual contents from different videos with the same class label and the temporal structure. The contribution of our approach is mainly manifested in the two aspects: i) We propose to select top k audio chunks by attention-based Long Short-Term Memory (LSTM)model, which can represent good audio summarization with local properties. ii) We propose an end-to-end deep model for cross-modal audio-visual learning where S-DCCA is trained to learn the semantic correlation between audio and visual modalities. Due to the lack of music video dataset, we construct 10K music video dataset from YouTube 8M dataset. Some promising results such as MAP and precision-recall show that our proposed model can be applied to music video retrieval.Comment: 8 pages, 9 figures. Accepted by ISM 201

    Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval

    Full text link
    The heterogeneity gap problem is the main challenge in cross-modal retrieval. Because cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared. To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels. TNN-CCCA is the best audio-visual cross-modal retrieval (AV-CMR) model so far, but the model training is sensitive to hard negative samples when learning common subspace by applying triplet loss to predict the relative distance between inputs. In this paper, to reduce the interference of hard negative samples in representation learning, we propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between audio-visual data using complete cross-triple loss. In particular, our model projects audio-visual features into label space by minimizing the distance between predicted label features after feature projection and ground label representations. Moreover, we adopt complete cross-triplet loss to optimize the predicted label features by leveraging the relationship between all possible similarity and dissimilarity semantic information across modalities. The extensive experimental results on two audio-visual double-checked datasets have shown an improvement of approximately 2.1% in terms of average MAP over the current state-of-the-art method TNN-CCCA for the AV-CMR task, which indicates the effectiveness of our proposed model.Comment: 9 pages, 5 figures, 3 tables, accepted by IEEE ISM 202

    Universal EEG Encoder for Learning Diverse Intelligent Tasks

    Full text link
    Brain Computer Interfaces (BCI) have become very popular with Electroencephalography (EEG) being one of the most commonly used signal acquisition techniques. A major challenge in BCI studies is the individualistic analysis required for each task. Thus, task-specific feature extraction and classification are performed, which fails to generalize to other tasks with similar time-series EEG input data. To this end, we design a GRU-based universal deep encoding architecture to extract meaningful features from publicly available datasets for five diverse EEG-based classification tasks. Our network can generate task and format-independent data representation and outperform the state of the art EEGNet architecture on most experiments. We also compare our results with CNN-based, and Autoencoder networks, in turn performing local, spatial, temporal and unsupervised analysis on the data
    • …
    corecore