2 research outputs found
Graph Representation learning for Audio & Music genre Classification
Music genre is arguably one of the most important and discriminative
information for music and audio content. Visual representation based approaches
have been explored on spectrograms for music genre classification. However,
lack of quality data and augmentation techniques makes it difficult to employ
deep learning techniques successfully. We discuss the application of graph
neural networks on such task due to their strong inductive bias, and show that
combination of CNN and GNN is able to achieve state-of-the-art results on
GTZAN, and AudioSet (Imbalanced Music) datasets. We also discuss the role of
Siamese Neural Networks as an analogous to GNN for learning edge similarity
weights. Furthermore, we also perform visual analysis to understand the
field-of-view of our model into the spectrogram based on genre labels
Rethinking CNN Models for Audio Classification
In this paper, we show that ImageNet-Pretrained standard deep CNN models can
be used as strong baseline networks for audio classification. Even though there
is a significant difference between audio Spectrogram and standard ImageNet
image samples, transfer learning assumptions still hold firmly. To understand
what enables the ImageNet pretrained models to learn useful audio
representations, we systematically study how much of pretrained weights is
useful for learning spectrograms. We show (1) that for a given standard model
using pretrained weights is better than using randomly initialized weights (2)
qualitative results of what the CNNs learn from the spectrograms by visualizing
the gradients. Besides, we show that even though we use the pretrained model
weights for initialization, there is variance in performance in various output
runs of the same model. This variance in performance is due to the random
initialization of linear classification layer and random mini-batch orderings
in multiple runs. This brings significant diversity to build stronger ensemble
models with an overall improvement in accuracy. An ensemble of ImageNet
pretrained DenseNet achieves 92.89% validation accuracy on the ESC-50 dataset
and 87.42% validation accuracy on the UrbanSound8K dataset which is the current
state-of-the-art on both of these datasets.Comment: 8 pages, 3 figure