16 research outputs found
Sample Mixed-Based Data Augmentation for Domestic Audio Tagging
Audio tagging has attracted increasing attention since last decade and has
various potential applications in many fields. The objective of audio tagging
is to predict the labels of an audio clip. Recently deep learning methods have
been applied to audio tagging and have achieved state-of-the-art performance,
which provides a poor generalization ability on new data. However due to the
limited size of audio tagging data such as DCASE data, the trained models tend
to result in overfitting of the network. Previous data augmentation methods
such as pitch shifting, time stretching and adding background noise do not show
much improvement in audio tagging. In this paper, we explore the sample mixed
data augmentation for the domestic audio tagging task, including mixup,
SamplePairing and extrapolation. We apply a convolutional recurrent neural
network (CRNN) with attention module with log-scaled mel spectrum as a baseline
system. In our experiments, we achieve an state-of-the-art of equal error rate
(EER) of 0.10 on DCASE 2016 task4 dataset with mixup approach, outperforming
the baseline system without data augmentation.Comment: submitted to the workshop of Detection and Classification of Acoustic
Scenes and Events 2018 (DCASE 2018), 19-20 November 2018, Surrey, U
Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification
In this paper, we propose a domain adaptation framework to address the device
mismatch issue in acoustic scene classification leveraging upon neural label
embedding (NLE) and relational teacher student learning (RTSL). Taking into
account the structural relationships between acoustic scene classes, our
proposed framework captures such relationships which are intrinsically
device-independent. In the training stage, transferable knowledge is condensed
in NLE from the source domain. Next in the adaptation stage, a novel RTSL
strategy is adopted to learn adapted target models without using paired
source-target data often required in conventional teacher student learning. The
proposed framework is evaluated on the DCASE 2018 Task1b data set. Experimental
results based on AlexNet-L deep classification models confirm the effectiveness
of our proposed approach for mismatch situations. NLE-alone adaptation compares
favourably with the conventional device adaptation and teacher student based
adaptation techniques. NLE with RTSL further improves the classification
accuracy.Comment: Accepted by Interspeech 202
A Two-Stage Approach to Device-Robust Acoustic Scene Classification
To improve device robustness, a highly desirable key feature of a competitive
data-driven acoustic scene classification (ASC) system, a novel two-stage
system based on fully convolutional neural networks (CNNs) is proposed. Our
two-stage system leverages on an ad-hoc score combination based on two CNN
classifiers: (i) the first CNN classifies acoustic inputs into one of three
broad classes, and (ii) the second CNN classifies the same inputs into one of
ten finer-grained classes. Three different CNN architectures are explored to
implement the two-stage classifiers, and a frequency sub-sampling scheme is
investigated. Moreover, novel data augmentation schemes for ASC are also
investigated. Evaluated on DCASE 2020 Task 1a, our results show that the
proposed ASC system attains a state-of-the-art accuracy on the development set,
where our best system, a two-stage fusion of CNN ensembles, delivers a 81.9%
average accuracy among multi-device test data, and it obtains a significant
improvement on unseen devices. Finally, neural saliency analysis with class
activation mapping (CAM) gives new insights on the patterns learnt by our
models.Comment: Submitted to ICASSP 2021. Code available:
https://github.com/MihawkHu/DCASE2020_task