17 research outputs found

    Raw Multi-Channel Audio Source Separation using Multi-Resolution Convolutional Auto-Encoders

    Get PDF
    Supervised multi-channel audio source separation requires extracting useful spectral, temporal, and spatial features from the mixed signals. The success of many existing systems is therefore largely dependent on the choice of features used for training. In this work, we introduce a novel multi-channel, multi-resolution convolutional auto-encoder neural network that works on raw time-domain signals to determine appropriate multi-resolution features for separating the singing-voice from stereo music. Our experimental results show that the proposed method can achieve multi-channel audio source separation without the need for hand-crafted features or any pre- or post-processing

    Pop Music Highlighter: Marking the Emotion Keypoints

    Get PDF
    The goal of music highlight extraction is to get a short consecutive segment of a piece of music that provides an effective representation of the whole piece. In a previous work, we introduced an attention-based convolutional recurrent neural network that uses music emotion classification as a surrogate task for music highlight extraction, for Pop songs. The rationale behind that approach is that the highlight of a song is usually the most emotional part. This paper extends our previous work in the following two aspects. First, methodology-wise we experiment with a new architecture that does not need any recurrent layers, making the training process faster. Moreover, we compare a late-fusion variant and an early-fusion variant to study which one better exploits the attention mechanism. Second, we conduct and report an extensive set of experiments comparing the proposed attention-based methods against a heuristic energy-based method, a structural repetition-based method, and a few other simple feature-based methods for this task. Due to the lack of public-domain labeled data for highlight extraction, following our previous work we use the RWC POP 100-song data set to evaluate how the detected highlights overlap with any chorus sections of the songs. The experiments demonstrate the effectiveness of our methods over competing methods. For reproducibility, we open source the code and pre-trained model at https://github.com/remyhuang/pop-music-highlighter/.Comment: Transactions of the ISMIR vol. 1, no.

    How Low Can You Go? Reducing Frequency and Time Resolution in Current CNN Architectures for Music Auto-tagging

    Full text link
    Automatic tagging of music is an important research topic in Music Information Retrieval and audio analysis algorithms proposed for this task have achieved improvements with advances in deep learning. In particular, many state-of-the-art systems use Convolutional Neural Networks and operate on mel-spectrogram representations of the audio. In this paper, we compare commonly used mel-spectrogram representations and evaluate model performances that can be achieved by reducing the input size in terms of both lesser amount of frequency bands and larger frame rates. We use the MagnaTagaTune dataset for comprehensive performance comparisons and then compare selected configurations on the larger Million Song Dataset. The results of this study can serve researchers and practitioners in their trade-off decision between accuracy of the models, data storage size and training and inference times.Comment: The 28th European Signal Processing Conference (EUSIPCO

    APPRENTISSAGE PROFOND POUR LA RECONNAISSANCE EN TEMPS REEL DES MODES DE JEU INSTRUMENTAUX

    Get PDF
    International audienceAu cours des dernières années, l'apprentissage profond s'est établi comme la nouvelle méthode de référence pour les problèmes de classification audio et notamment la reconnaissance d'instruments. Cependant, ces modèles ne traitent généralement pas la classification de modes de jeux avancés, question pourtant centrale dans la composition contemporaine. Les quelques études réalisées se cantonnent à une évaluation sur une seule banque de sons, dont rien n'assure la généralisation sur des données réelles. Dans cet article, nous étendons les méthodes de l'état de l'art à la classification de modes de jeu instrumentaux en temps réel à partir d'enregistrements de solistes. Nous montrons qu'une combinaison de réseaux convolutionnels (CNN) et récurrents (RNN) permet d'obtenir d'excellents résultats sur un corpus homogène provenant de 5 banques de sons. Toutefois, leur performance s'affaiblit sensiblement sur un corpus hétérogène, ce qui pourrait indiquer une faible capacité à généraliser à des données réelles. Nous proposons des pistes pour résoudre ce problème. Enfin, nous détaillons plusieurs utilisations possibles de nos modèles dans le cadre de systèmes interactifs
    corecore