3,326 research outputs found
Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network
This paper proposes to use low-level spatial features extracted from
multichannel audio for sound event detection. We extend the convolutional
recurrent neural network to handle more than one type of these multichannel
features by learning from each of them separately in the initial stages. We
show that instead of concatenating the features of each channel into a single
feature vector the network learns sound events in multichannel audio better
when they are presented as separate layers of a volume. Using the proposed
spatial features over monaural features on the same network gives an absolute
F-score improvement of 6.1% on the publicly available TUT-SED 2016 dataset and
2.7% on the TUT-SED 2009 dataset that is fifteen times larger.Comment: Accepted for IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP 2017
Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging
Environmental audio tagging is a newly proposed task to predict the presence
or absence of a specific audio event in a chunk. Deep neural network (DNN)
based methods have been successfully adopted for predicting the audio tags in
the domestic audio scene. In this paper, we propose to use a convolutional
neural network (CNN) to extract robust features from mel-filter banks (MFBs),
spectrograms or even raw waveforms for audio tagging. Gated recurrent unit
(GRU) based recurrent neural networks (RNNs) are then cascaded to model the
long-term temporal structure of the audio signal. To complement the input
information, an auxiliary CNN is designed to learn on the spatial features of
stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging)
of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE
2016) challenge. Compared with our recent DNN-based method, the proposed
structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the
development set. The spatial features can further reduce the EER to 0.10. The
performance of the end-to-end learning on raw waveforms is also comparable.
Finally, on the evaluation set, we get the state-of-the-art performance with
0.12 EER while the performance of the best existing system is 0.15 EER.Comment: Accepted to IJCNN2017, Anchorage, Alaska, US
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Large-scale weakly supervised audio classification using gated convolutional neural network
In this paper, we present a gated convolutional neural network and a temporal
attention-based localization method for audio classification, which won the 1st
place in the large-scale weakly supervised sound event detection task of
Detection and Classification of Acoustic Scenes and Events (DCASE) 2017
challenge. The audio clips in this task, which are extracted from YouTube
videos, are manually labeled with one or a few audio tags but without
timestamps of the audio events, which is called as weakly labeled data. Two
sub-tasks are defined in this challenge including audio tagging and sound event
detection using this weakly labeled data. A convolutional recurrent neural
network (CRNN) with learnable gated linear units (GLUs) non-linearity applied
on the log Mel spectrogram is proposed. In addition, a temporal attention
method is proposed along the frames to predicate the locations of each audio
event in a chunk from the weakly labeled data. We ranked the 1st and the 2nd as
a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6\% and
Equal error 0.73, respectively.Comment: submitted to ICASSP2018, summary on the 1st place system in DCASE2017
task4 challeng
Sample Mixed-Based Data Augmentation for Domestic Audio Tagging
Audio tagging has attracted increasing attention since last decade and has
various potential applications in many fields. The objective of audio tagging
is to predict the labels of an audio clip. Recently deep learning methods have
been applied to audio tagging and have achieved state-of-the-art performance,
which provides a poor generalization ability on new data. However due to the
limited size of audio tagging data such as DCASE data, the trained models tend
to result in overfitting of the network. Previous data augmentation methods
such as pitch shifting, time stretching and adding background noise do not show
much improvement in audio tagging. In this paper, we explore the sample mixed
data augmentation for the domestic audio tagging task, including mixup,
SamplePairing and extrapolation. We apply a convolutional recurrent neural
network (CRNN) with attention module with log-scaled mel spectrum as a baseline
system. In our experiments, we achieve an state-of-the-art of equal error rate
(EER) of 0.10 on DCASE 2016 task4 dataset with mixup approach, outperforming
the baseline system without data augmentation.Comment: submitted to the workshop of Detection and Classification of Acoustic
Scenes and Events 2018 (DCASE 2018), 19-20 November 2018, Surrey, U
- …