36 research outputs found
Sample-level CNN Architectures for Music Auto-tagging Using Raw Waveforms
Recent work has shown that the end-to-end approach using convolutional neural
network (CNN) is effective in various types of machine learning tasks. For
audio signals, the approach takes raw waveforms as input using an 1-D
convolution layer. In this paper, we improve the 1-D CNN architecture for music
auto-tagging by adopting building blocks from state-of-the-art image
classification models, ResNets and SENets, and adding multi-level feature
aggregation to it. We compare different combinations of the modules in building
CNN architectures. The results show that they achieve significant improvements
over previous state-of-the-art models on the MagnaTagATune dataset and
comparable results on Million Song Dataset. Furthermore, we analyze and
visualize our model to show how the 1-D CNN operates.Comment: Accepted for publication at ICASSP 201
Synthia's Melody: A Benchmark Framework for Unsupervised Domain Adaptation in Audio
Despite significant advancements in deep learning for vision and natural
language, unsupervised domain adaptation in audio remains relatively
unexplored. We, in part, attribute this to the lack of an appropriate benchmark
dataset. To address this gap, we present Synthia's melody, a novel audio data
generation framework capable of simulating an infinite variety of 4-second
melodies with user-specified confounding structures characterised by musical
keys, timbre, and loudness. Unlike existing datasets collected under
observational settings, Synthia's melody is free of unobserved biases, ensuring
the reproducibility and comparability of experiments. To showcase its utility,
we generate two types of distribution shifts-domain shift and sample selection
bias-and evaluate the performance of acoustic deep learning models under these
shifts. Our evaluations reveal that Synthia's melody provides a robust testbed
for examining the susceptibility of these models to varying levels of
distribution shift
Low-Resource Music Genre Classification with Advanced Neural Model Reprogramming
Transfer learning (TL) approaches have shown promising results when handling
tasks with limited training data. However, considerable memory and
computational resources are often required for fine-tuning pre-trained neural
networks with target domain data. In this work, we introduce a novel method for
leveraging pre-trained models for low-resource (music) classification based on
the concept of Neural Model Reprogramming (NMR). NMR aims at re-purposing a
pre-trained model from a source domain to a target domain by modifying the
input of a frozen pre-trained model. In addition to the known,
input-independent, reprogramming method, we propose an advanced reprogramming
paradigm: Input-dependent NMR, to increase adaptability to complex input data
such as musical audio. Experimental results suggest that a neural model
pre-trained on large-scale datasets can successfully perform music genre
classification by using this reprogramming method. The two proposed
Input-dependent NMR TL methods outperform fine-tuning-based TL methods on a
small genre classification dataset.Comment: Submitted to ICASSP 2023. Some experimental results were reduced due
to the space limit. The implementation will be available at
https://github.com/biboamy/music-repr
Automated Detection of COVID-19 Cough Sound using Mel-Spectrogram Images and Convolutional Neural Network
COVID-19 is a new disease caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) variant. The initial symptoms of the disease commonly include fever (83-98%), fatigue or myalgia, dry cough (76-82%), and shortness of breath (31-55%). Given the prevalence of coughing as a symptom, artificial intelligence has been employed to detect COVID-19 based on cough sounds. This study aims to compare the performance of six different Convolutional Neural Network (CNN) models (VGG-16, VGG-19, LeNet-5, AlexNet, ResNet-50, and ResNet-152) in detecting COVID-19 using mel-spectrogram images derived from cough sounds. The training and validation of these CNN models were conducted using the Virufy dataset, consisting of 121 cough audio recordings with a sample rate of 48,000 and a duration of 1 second for all audio data. Audio data was processed to generate mel-spectrogram images, which were subsequently employed as inputs for the CNN models. This study used accuracy, area under curve (AUC), precision, recall, and F1 score as evaluation metrics. The AlexNet model, utilizing an input size of 227×227, exhibited the best performance with the highest Area Under the Curve (AUC) value of 0.930. This study provides compelling evidence of the efficacy of CNN models in detecting COVID-19 based on cough sounds through mel-spectrogram images. Furthermore, the study underscores the impact of input size on model performance. This research contributes to identifying the CNN model that demonstrates the best performance in COVID-19 detection based on cough sounds. By exploring the effectiveness of CNN models with different mel-spectrogram image sizes, this study offers novel insights into the optimal and fast audio-based method for early detection of COVID-19. Additionally, this study establishes the fundamental groundwork for selecting an appropriate CNN methodology for early detection of COVID-19
Improving Robustness of Deep Convolutional Neural Networks via Multiresolution Learning
The current learning process of deep learning, regardless of any deep neural
network (DNN) architecture and/or learning algorithm used, is essentially a
single resolution training. We explore multiresolution learning and show that
multiresolution learning can significantly improve robustness of DNN models for
both 1D signal and 2D signal (image) prediction problems. We demonstrate this
improvement in terms of both noise and adversarial robustness as well as with
small training dataset size. Our results also suggest that it may not be
necessary to trade standard accuracy for robustness with multiresolution
learning, which is, interestingly, contrary to the observation obtained from
the traditional single resolution learning setting