44 research outputs found
Multi-View Networks For Multi-Channel Audio Classification
In this paper we introduce the idea of multi-view networks for sound
classification with multiple sensors. We show how one can build a multi-channel
sound recognition model trained on a fixed number of channels, and deploy it to
scenarios with arbitrary (and potentially dynamically changing) number of input
channels and not observe degradation in performance. We demonstrate that at
inference time you can safely provide this model all available channels as it
can ignore noisy information and leverage new information better than standard
baseline approaches. The model is evaluated in both an anechoic environment and
in rooms generated by a room acoustics simulator. We demonstrate that this
model can generalize to unseen numbers of channels as well as unseen room
geometries.Comment: 5 pages, 7 figures, Accepted to ICASSP 201
Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection
This paper studies the detection of bird calls in audio segments using
stacked convolutional and recurrent neural networks. Data augmentation by
blocks mixing and domain adaptation using a novel method of test mixing are
proposed and evaluated in regard to making the method robust to unseen data.
The contributions of two kinds of acoustic features (dominant frequency and log
mel-band energy) and their combinations are studied in the context of bird
audio detection. Our best achieved AUC measure on five cross-validations of the
development data is 95.5% and 88.1% on the unseen evaluation data.Comment: Accepted for European Signal Processing Conference 201
CNN Architectures for Large-Scale Audio Classification
Convolutional Neural Networks (CNNs) have proven very effective in image
classification and show promise for audio. We use various CNN architectures to
classify the soundtracks of a dataset of 70M training videos (5.24 million
hours) with 30,871 video-level labels. We examine fully connected Deep Neural
Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We
investigate varying the size of both training set and label vocabulary, finding
that analogs of the CNNs used in image classification do well on our audio
classification task, and larger training and label sets help up to a point. A
model using embeddings from these classifiers does much better than raw
features on the Audio Set [5] Acoustic Event Detection (AED) classification
task.Comment: Accepted for publication at ICASSP 2017 Changes: Added definitions of
mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on
changes of latest Audio Set revision. Changed wording to fit 4 page limit
with new addition