323 research outputs found
Joint Detection and Classification Convolutional Neural Network on Weakly Labelled Bird Audio Detection
Bird audio detection (BAD) aims to detect whether there is a bird call in an audio recording or not. One difficulty of this task is that the bird sound datasets are weakly labelled, that is only the presence or absence of a bird in a recording is known, without knowing when the birds call. We propose to apply joint detection and classification (JDC) model on the weakly labelled data (WLD) to detect and classify an audio clip at the same time. First, we apply VGG like convolutional neural network (CNN) on mel spectrogram as baseline. Then we propose a JDC-CNN model with VGG as a classifier and CNN as a detector. We report the denoising method including optimally-modified log-spectral amplitude (OM-LSA), median filter and spectral spectrogram will worse the classification accuracy on the contrary to previous work. JDC-CNN can predict the time stamps of the events from weakly labelled data, so is able to do sound event detection from WLD. We obtained area under curve (AUC) of 95.70% on the development data and 81.36% on the unseen evaluation data, which is nearly comparable to the baseline CNN model
Weakly Labelled AudioSet Tagging with Attention Neural Networks
Audio tagging is the task of predicting the presence or absence of sound
classes within an audio clip. Previous work in audio tagging focused on
relatively small datasets limited to recognising a small number of sound
classes. We investigate audio tagging on AudioSet, which is a dataset
consisting of over 2 million audio clips and 527 classes. AudioSet is weakly
labelled, in that only the presence or absence of sound classes is known for
each clip, while the onset and offset times are unknown. To address the
weakly-labelled audio tagging problem, we propose attention neural networks as
a way to attend the most salient parts of an audio clip. We bridge the
connection between attention neural networks and multiple instance learning
(MIL) methods, and propose decision-level and feature-level attention neural
networks for audio tagging. We investigate attention neural networks modeled by
different functions, depths and widths. Experiments on AudioSet show that the
feature-level attention neural network achieves a state-of-the-art mean average
precision (mAP) of 0.369, outperforming the best multiple instance learning
(MIL) method of 0.317 and Google's deep neural network baseline of 0.314. In
addition, we discover that the audio tagging performance on AudioSet embedding
features has a weak correlation with the number of training samples and the
quality of labels of each sound class.Comment: 13 page
Classification of Animal Sound Using Convolutional Neural Network
Recently, labeling of acoustic events has emerged as an active topic covering a wide range of applications. High-level semantic inference can be conducted based on main audioeffects to facilitate various content-based applications for analysis, efficient recovery and content management. This paper proposes a flexible Convolutional neural network-based framework for animal audio classification. The work takes inspiration from various deep neural network developed for multimedia classification recently. The model is driven by the ideology of identifying the animal sound in the audio file by forcing the network to pay attention to core audio effect present in the audio to generate Mel-spectrogram. The designed framework achieves an accuracy of 98% while classifying the animal audio on weekly labelled datasets. The state-of-the-art in this research is to build a framework which could even run on the basic machine and do not necessarily require high end devices to run the classification
Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes
In this work we propose approaches to effectively transfer knowledge from
weakly labeled web audio data. We first describe a convolutional neural network
(CNN) based framework for sound event detection and classification using weakly
labeled audio data. Our model trains efficiently from audios of variable
lengths; hence, it is well suited for transfer learning. We then propose
methods to learn representations using this model which can be effectively used
for solving the target task. We study both transductive and inductive transfer
learning tasks, showing the effectiveness of our methods for both domain and
task adaptation. We show that the learned representations using the proposed
CNN model generalizes well enough to reach human level accuracy on ESC-50 sound
events dataset and set state of art results on this dataset. We further use
them for acoustic scene classification task and once again show that our
proposed approaches suit well for this task as well. We also show that our
methods are helpful in capturing semantic meanings and relations as well.
Moreover, in this process we also set state-of-art results on Audioset dataset,
relying on balanced training set.Comment: ICASSP 201
Unsupervised classification to improve the quality of a bird song recording dataset
Open audio databases such as Xeno-Canto are widely used to build datasets to
explore bird song repertoire or to train models for automatic bird sound
classification by deep learning algorithms. However, such databases suffer from
the fact that bird sounds are weakly labelled: a species name is attributed to
each audio recording without timestamps that provide the temporal localization
of the bird song of interest. Manual annotations can solve this issue, but they
are time consuming, expert-dependent, and cannot run on large datasets. Another
solution consists in using a labelling function that automatically segments
audio recordings before assigning a label to each segmented audio sample.
Although labelling functions were introduced to expedite strong label
assignment, their classification performance remains mostly unknown. To address
this issue and reduce label noise (wrong label assignment) in large bird song
datasets, we introduce a data-centric novel labelling function composed of
three successive steps: 1) time-frequency sound unit segmentation, 2) feature
computation for each sound unit, and 3) classification of each sound unit as
bird song or noise with either an unsupervised DBSCAN algorithm or the
supervised BirdNET neural network. The labelling function was optimized,
validated, and tested on the songs of 44 West-Palearctic common bird species.
We first showed that the segmentation of bird songs alone aggregated from 10%
to 83% of label noise depending on the species. We also demonstrated that our
labelling function was able to significantly reduce the initial label noise
present in the dataset by up to a factor of three. Finally, we discuss
different opportunities to design suitable labelling functions to build
high-quality animal vocalizations with minimum expert annotation effort
Automatic detection and classi cation of bird sounds in low-resource wildlife audio datasets
PhDThere are many potential applications of automatic species detection and classifi cation of birds from their sounds (e.g. ecological research, biodiversity monitoring, archival). However, acquiring adequately labelled large-scale and longitudinal data remains a major challenge, especially for species-rich remote areas as well as taxa that require expert input for identi fication. So far, monitoring of avian populations has been performed via manual surveying, sometimes even including the help of volunteers due to the challenging scales of the data. In recent decades, there is an increasing amount of ecological audio datasets that have tags assigned to them to indicate the presence or not of a specific c bird species. However, automated species vocalization detection and identifi cation is a challenging task. There is a high diversity of animal vocalisations, both in the types of the basic syllables and in the way they are combined. Also, there is noise present in most habitats, and many bird communities contain multiple bird species that can potentially have overlapping vocalisations. In recent years, machine learning has experienced a strong growth, due to increased dataset sizes and computational power, and to advances in deep learning methods that can learn to make predictions in extremely nonlinear problem settings. However, in training a deep learning system to perform automatic detection and audio tagging of wildlife bird sound scenes, two problems often arise. Firstly, even with the increased amount of audio datasets, most publicly available datasets are weakly labelled, having only a list of events present in each recording without any temporal information for training. Secondly, in practice it is difficult to collect enough samples for most classes of interest. These problems are particularly pressing for wildlife audio but also occur in many other scenarios. In this thesis, we investigate and propose methods to perform audio event detection and classi fication on wildlife bird sound scenes and other low-resource audio datasets, such as methods based on image processing and deep learning. We extend deep learning methods for weakly labelled data in a multi-instance learning and multi task learning setting. We evaluate these methods for simultaneously detecting and classifying large numbers of sound types in audio recorded in the wild and other low resource audio datasets
CAA-Net: Conditional Atrous CNNs with attention for explainable device-robust acoustic scene classification
Acoustic Scene Classification (ASC) aims to classify the environment in which
the audio signals are recorded. Recently, Convolutional Neural Networks (CNNs)
have been successfully applied to ASC. However, the data distributions of the
audio signals recorded with multiple devices are different. There has been
little research on the training of robust neural networks on acoustic scene
datasets recorded with multiple devices, and on explaining the operation of the
internal layers of the neural networks. In this article, we focus on training
and explaining device-robust CNNs on multi-device acoustic scene data. We
propose conditional atrous CNNs with attention for multi-device ASC. Our
proposed system contains an ASC branch and a device classification branch, both
modelled by CNNs. We visualise and analyse the intermediate layers of the
atrous CNNs. A time-frequency attention mechanism is employed to analyse the
contribution of each time-frequency bin of the feature maps in the CNNs. On the
Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 ASC
dataset, recorded with three devices, our proposed model performs significantly
better than CNNs trained on single-device data.Comment: IEEE Transactions on Multimedi
- …