42 research outputs found
Multi-Level and Multi-Scale Feature Aggregation Using Pre-trained Convolutional Neural Networks for Music Auto-tagging
Music auto-tagging is often handled in a similar manner to image
classification by regarding the 2D audio spectrogram as image data. However,
music auto-tagging is distinguished from image classification in that the tags
are highly diverse and have different levels of abstractions. Considering this
issue, we propose a convolutional neural networks (CNN)-based architecture that
embraces multi-level and multi-scaled features. The architecture is trained in
three steps. First, we conduct supervised feature learning to capture local
audio features using a set of CNNs with different input sizes. Second, we
extract audio features from each layer of the pre-trained convolutional
networks separately and aggregate them altogether given a long audio clip.
Finally, we put them into fully-connected networks and make final predictions
of the tags. Our experiments show that using the combination of multi-level and
multi-scale features is highly effective in music auto-tagging and the proposed
method outperforms previous state-of-the-arts on the MagnaTagATune dataset and
the Million Song Dataset. We further show that the proposed architecture is
useful in transfer learning.Comment: 5 pages, 5 figures, 2 table
Raw Waveform-based Audio Classification Using Sample-level CNN Architectures
Music, speech, and acoustic scene sound are often handled separately in the
audio domain because of their different signal characteristics. However, as the
image domain grows rapidly by versatile image classification models, it is
necessary to study extensible classification models in the audio domain as
well. In this study, we approach this problem using two types of sample-level
deep convolutional neural networks that take raw waveforms as input and uses
filters with small granularity. One is a basic model that consists of
convolution and pooling layers. The other is an improved model that
additionally has residual connections, squeeze-and-excitation modules and
multi-level concatenation. We show that the sample-level models reach
state-of-the-art performance levels for the three different categories of
sound. Also, we visualize the filters along layers and compare the
characteristics of learned filters.Comment: NIPS, Machine Learning for Audio Signal Processing Workshop
(ML4Audio), 201
Music Genre Classification with Paralleling Recurrent Convolutional Neural Network
Deep learning has been demonstrated its effectiveness and efficiency in music
genre classification. However, the existing achievements still have several
shortcomings which impair the performance of this classification task. In this
paper, we propose a hybrid architecture which consists of the paralleling CNN
and Bi-RNN blocks. They focus on spatial features and temporal frame orders
extraction respectively. Then the two outputs are fused into one powerful
representation of musical signals and fed into softmax function for
classification. The paralleling network guarantees the extracting features
robust enough to represent music. Moreover, the experiments prove our proposed
architecture improve the music genre classification performance and the
additional Bi-RNN block is a supplement for CNNs
Masked Conditional Neural Networks for Environmental Sound Classification
The ConditionaL Neural Network (CLNN) exploits the nature of the temporal
sequencing of the sound signal represented in a spectrogram, and its variant
the Masked ConditionaL Neural Network (MCLNN) induces the network to learn in
frequency bands by embedding a filterbank-like sparseness over the network's
links using a binary mask. Additionally, the masking automates the exploration
of different feature combinations concurrently analogous to handcrafting the
optimum combination of features for a recognition task. We have evaluated the
MCLNN performance using the Urbansound8k dataset of environmental sounds.
Additionally, we present a collection of manually recorded sounds for rail and
road traffic, YorNoise, to investigate the confusion rates among machine
generated sounds possessing low-frequency components. MCLNN has achieved
competitive results without augmentation and using 12% of the trainable
parameters utilized by an equivalent model based on state-of-the-art
Convolutional Neural Networks on the Urbansound8k. We extended the Urbansound8k
dataset with YorNoise, where experiments have shown that common tonal
properties affect the classification performance.Comment: Conditional Neural Networks, CLNN, Masked Conditional Neural
Networks, MCLNN, Restricted Boltzmann Machine, RBM, Conditional Restricted
Boltz-mann Machine, CRBM, Deep Belief Nets, Environmental Sound Recognition,
ESR, YorNois
From Visual to Acoustic Question Answering
We introduce the new task of Acoustic Question Answering (AQA) to promote
research in acoustic reasoning. The AQA task consists of analyzing an acoustic
scene composed by a combination of elementary sounds and answering questions
that relate the position and properties of these sounds. The kind of relational
questions asked, require that the models perform non-trivial reasoning in order
to answer correctly. Although similar problems have been extensively studied in
the domain of visual reasoning, we are not aware of any previous studies
addressing the problem in the acoustic domain. We propose a method for
generating the acoustic scenes from elementary sounds and a number of relevant
questions for each scene using templates. We also present preliminary results
obtained with two models (FiLM and MAC) that have been shown to work for visual
reasoning
Automatic Classification of Music Genre using Masked Conditional Neural Networks
Neural network based architectures used for sound recognition are usually
adapted from other application domains such as image recognition, which may not
harness the time-frequency representation of a signal. The ConditionaL Neural
Networks (CLNN) and its extension the Masked ConditionaL Neural Networks
(MCLNN) are designed for multidimensional temporal signal recognition. The CLNN
is trained over a window of frames to preserve the inter-frame relation, and
the MCLNN enforces a systematic sparseness over the network's links that mimics
a filterbank-like behavior. The masking operation induces the network to learn
in frequency bands, which decreases the network susceptibility to
frequency-shifts in time-frequency representations. Additionally, the mask
allows an exploration of a range of feature combinations concurrently analogous
to the manual handcrafting of the optimum collection of features for a
recognition task. MCLNN have achieved competitive performance on the Ballroom
music dataset compared to several hand-crafted attempts and outperformed models
based on state-of-the-art Convolutional Neural Networks.Comment: Restricted Boltzmann Machine; RBM; Conditional RBM; CRBM; Deep Belief
Net; DBN; Conditional Neural Network; CLNN; Masked Conditional Neural
Network; MCLNN; Music Information Retrieval; MIR. IEEE International
Conference on Data Mining (ICDM), 201
Generating Music Medleys via Playing Music Puzzle Games
Generating music medleys is about finding an optimal permutation of a given
set of music clips. Toward this goal, we propose a self-supervised learning
task, called the music puzzle game, to train neural network models to learn the
sequential patterns in music. In essence, such a game requires machines to
correctly sort a few multisecond music fragments. In the training stage, we
learn the model by sampling multiple non-overlapping fragment pairs from the
same songs and seeking to predict whether a given pair is consecutive and is in
the correct chronological order. For testing, we design a number of puzzle
games with different difficulty levels, the most difficult one being music
medley, which requiring sorting fragments from different songs. On the basis of
state-of-the-art Siamese convolutional network, we propose an improved
architecture that learns to embed frame-level similarity scores computed from
the input fragment pairs to a common space, where fragment pairs in the correct
order can be more easily identified. Our result shows that the resulting model,
dubbed as the similarity embedding network (SEN), performs better than
competing models across different games, including music jigsaw puzzle, music
sequencing, and music medley. Example results can be found at our project
website, https://remyhuang.github.io/DJnet.Comment: Accepted at AAAI 201
The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification
Convolutional Neural Networks (CNNs) have had great success in many machine
vision as well as machine audition tasks. Many image recognition network
architectures have consequently been adapted for audio processing tasks.
However, despite some successes, the performance of many of these did not
translate from the image to the audio domain. For example, very deep
architectures such as ResNet and DenseNet, which significantly outperform VGG
in image recognition, do not perform better in audio processing tasks such as
Acoustic Scene Classification (ASC). In this paper, we investigate the reasons
why such powerful architectures perform worse in ASC compared to simpler models
(e.g., VGG). To this end, we analyse the receptive field (RF) of these CNNs and
demonstrate the importance of the RF to the generalization capability of the
models. Using our receptive field analysis, we adapt both ResNet and DenseNet,
achieving state-of-the-art performance and eventually outperforming the
VGG-based models. We introduce systematic ways of adapting the RF in CNNs, and
present results on three data sets that show how changing the RF over the time
and frequency dimensions affects a model's performance. Our experimental
results show that very small or very large RFs can cause performance
degradation, but deep models can be made to generalize well by carefully
choosing an appropriate RF size within a certain range.Comment: IEEE EUSIPCO 201
Content-Based Music Recommendation using Deep Learning
Music streaming services use recommendation systems to improve the customer experience by generating favorable playlists and by fostering the discovery of new music. State of the art recommendation systems use both collaborative filtering and content-based recommendation methods. Collaborative filtering suffers from the cold start problem; it can only make recommendations for music for which it has enough user data, so content-based methods are preferred. Most current content-based recommendation systems use convolutional neural networks on the spectrograms of track audio. The architectures are commonly borrowed directly from the field of computer vision. It is shown in this study that musically-motivated convolutional neural network architectures outperform architectures that are highly-optimized for image-related tasks. A content-based recommendation model is built using musically-motivated deep learning architectures. The model is shown to be able to map an artist onto an artist embedding space where its nearest neighbors by cosine similarity are related artists and make good recommendations. It is also shown that metadata, such as lyrics, artist origin, and year, significantly improve these mappings when combined with raw audio data
Acoustic Scene Classification Using Bilinear Pooling on Time-liked and Frequency-liked Convolution Neural Network
The current methodology in tackling Acoustic Scene Classification (ASC) task
can be described in two steps, preprocessing of the audio waveform into log-mel
spectrogram and then using it as the input representation for Convolutional
Neural Network (CNN). This paradigm shift occurs after DCASE 2016 where this
framework model achieves the state-of-the-art result in ASC tasks on the
(ESC-50) dataset and achieved an accuracy of 64.5%, which constitute to 20.5%
improvement over the baseline model, and DCASE 2016 dataset with an accuracy of
90.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9%
improvements with respect to the baseline system. In this paper, we explored
the use of harmonic and percussive source separation (HPSS) to split the audio
into harmonic audio and percussive audio, which has received popularity in the
field of music information retrieval (MIR). Although works have been done in
using HPSS as input representation for CNN model in ASC task, this paper
further investigate the possibility on leveraging the separated harmonic
component and percussive component by curating 2 CNNs which tries to understand
harmonic audio and percussive audio in their natural form, one specialized in
extracting deep features in time biased domain and another specialized in
extracting deep features in frequency biased domain, respectively. The deep
features extracted from these 2 CNNs will then be combined using bilinear
pooling. Hence, presenting a two-stream time and frequency CNN architecture
approach in classifying acoustic scene. The model is being evaluated on DCASE
2019 sub task 1a dataset and scored an average of 65% on development dataset,
Kaggle Leadership Private and Public board.Comment: inclusion in conference proceedings 2019 IEEE Symposium Series on
Computational Intelligence (IEEE SSCI 2019), Xiame