21 research outputs found
Learning to rank music tracks using triplet loss
Most music streaming services rely on automatic recommendation algorithms to
exploit their large music catalogs. These algorithms aim at retrieving a ranked
list of music tracks based on their similarity with a target music track. In
this work, we propose a method for direct recommendation based on the audio
content without explicitly tagging the music tracks. To that aim, we propose
several strategies to perform triplet mining from ranked lists. We train a
Convolutional Neural Network to learn the similarity via triplet loss. These
different strategies are compared and validated on a large-scale experiment
against an auto-tagging based approach. The results obtained highlight the
efficiency of our system, especially when associated with an Auto-pooling
layer
A Feature Learning Siamese Model for Intelligent Control of the Dynamic Range Compressor
In this paper, a siamese DNN model is proposed to learn the characteristics
of the audio dynamic range compressor (DRC). This facilitates an intelligent
control system that uses audio examples to configure the DRC, a widely used
non-linear audio signal conditioning technique in the areas of music
production, speech communication and broadcasting. Several alternative siamese
DNN architectures are proposed to learn feature embeddings that can
characterise subtle effects due to dynamic range compression. These models are
compared with each other as well as handcrafted features proposed in previous
work. The evaluation of the relations between the hyperparameters of DNN and
DRC parameters are also provided. The best model is able to produce a universal
feature embedding that is capable of predicting multiple DRC parameters
simultaneously, which is a significant improvement from our previous research.
The feature embedding shows better performance than handcrafted audio features
when predicting DRC parameters for both mono-instrument audio loops and
polyphonic music pieces.Comment: 8 pages, accepted in IJCNN 201
How Low Can You Go? Reducing Frequency and Time Resolution in Current CNN Architectures for Music Auto-tagging
Automatic tagging of music is an important research topic in Music
Information Retrieval and audio analysis algorithms proposed for this task have
achieved improvements with advances in deep learning. In particular, many
state-of-the-art systems use Convolutional Neural Networks and operate on
mel-spectrogram representations of the audio. In this paper, we compare
commonly used mel-spectrogram representations and evaluate model performances
that can be achieved by reducing the input size in terms of both lesser amount
of frequency bands and larger frame rates. We use the MagnaTagaTune dataset for
comprehensive performance comparisons and then compare selected configurations
on the larger Million Song Dataset. The results of this study can serve
researchers and practitioners in their trade-off decision between accuracy of
the models, data storage size and training and inference times.Comment: The 28th European Signal Processing Conference (EUSIPCO
Classifying Music Genres Using Image Classification Neural Networks
Domain tailored Convolutional Neural Networks (CNN) have been applied to music genre classification using spectrograms as visual audio representation. It is currently unclear whether domain tailored CNN architectures are superior to network architectures used in the field of image classification. This question arises, because image classification architectures have highly influenced the design of domain tailored network architectures.We examine, whether CNN architectures transferred from image classification are able to achieve similar performance compared to domain tailored CNN architectures used in genre classification. We compare domain tailored and image classification networks by testing their performance on two different datasets, the frequently used benchmarking dataset GTZAN and a newly created, much larger dataset. Our results show that the tested image classification network requires a significantly lower amount of resources and outperforms the domain specific network in our given settings, thus leading to the advantage that it is not necessary to spend expert efforts for the design of the network