9,850 research outputs found
Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification
In this paper we present a Deep Neural Network architecture for the task of
acoustic scene classification which harnesses information from increasing
temporal resolutions of Mel-Spectrogram segments. This architecture is composed
of separated parallel Convolutional Neural Networks which learn spectral and
temporal representations for each input resolution. The resolutions are chosen
to cover fine-grained characteristics of a scene's spectral texture as well as
its distribution of acoustic events. The proposed model shows a 3.56% absolute
improvement of the best performing single resolution model and 12.49% of the
DCASE 2017 Acoustic Scenes Classification task baseline.Comment: In Proceedings of the Detection and Classification of Acoustic Scenes
and Events 2017 Workshop (DCASE2017), November 201
CNNs-based Acoustic Scene Classification using Multi-Spectrogram Fusion and Label Expansions
Spectrograms have been widely used in Convolutional Neural Networks based
schemes for acoustic scene classification, such as the STFT spectrogram and the
MFCC spectrogram, etc. They have different time-frequency characteristics,
contributing to their own advantages and disadvantages in recognizing acoustic
scenes. In this letter, a novel multi-spectrogram fusion framework is proposed,
making the spectrograms complement each other. In the framework, a single CNN
architecture is applied onto multiple spectrograms for feature extraction. The
deep features extracted from multiple spectrograms are then fused to
discriminate the acoustic scenes. Moreover, motivated by the inter-class
similarities in acoustic scene datasets, a label expansion method is further
proposed in which super-class labels are constructed upon the original classes.
On the help of the expanded labels, the CNN models are transformed into the
multitask learning form to improve the acoustic scene classification by
appending the auxiliary task of super-class classification. To verify the
effectiveness of the proposed methods, intensive experiments have been
performed on the DCASE2017 and the LITIS Rouen datasets. Experimental results
show that the proposed method can achieve promising accuracies on both
datasets. Specifically, accuracies of 0.9744, 0.8865 and 0.7778 are obtained
for the LITIS Rouen dataset, the DCASE Development set and Evaluation set
respectively
Deep Within-Class Covariance Analysis for Robust Audio Representation Learning
Convolutional Neural Networks (CNNs) can learn effective features, though
have been shown to suffer from a performance drop when the distribution of the
data changes from training to test data. In this paper we analyze the internal
representations of CNNs and observe that the representations of unseen data in
each class, spread more (with higher variance) in the embedding space of the
CNN compared to representations of the training data. More importantly, this
difference is more extreme if the unseen data comes from a shifted
distribution. Based on this observation, we objectively evaluate the degree of
representation's variance in each class via eigenvalue decomposition on the
within-class covariance of the internal representations of CNNs and observe the
same behaviour. This can be problematic as larger variances might lead to
mis-classification if the sample crosses the decision boundary of its class. We
apply nearest neighbor classification on the representations and empirically
show that the embeddings with the high variance actually have significantly
worse KNN classification performances, although this could not be foreseen from
their end-to-end classification results. To tackle this problem, we propose
Deep Within-Class Covariance Analysis (DWCCA), a deep neural network layer that
significantly reduces the within-class covariance of a DNN's representation,
improving performance on unseen test data from a shifted distribution. We
empirically evaluate DWCCA on two datasets for Acoustic Scene Classification
(DCASE2016 and DCASE2017). We demonstrate that not only does DWCCA
significantly improve the network's internal representation, it also increases
the end-to-end classification accuracy, especially when the test set exhibits a
distribution shift. By adding DWCCA to a VGG network, we achieve around 6
percentage points improvement in the case of a distribution mismatch.Comment: 11 pages, 3 tables, 4 figure
ACGAN-based Data Augmentation Integrated with Long-term Scalogram for Acoustic Scene Classification
In acoustic scene classification (ASC), acoustic features play a crucial role
in the extraction of scene information, which can be stored over different time
scales. Moreover, the limited size of the dataset may lead to a biased model
with a poor performance for records from unseen cities and confusing scene
classes. In order to overcome this, we propose a long-term wavelet feature that
requires a lower storage capacity and can be classified faster and more
accurately compared with classic Mel filter bank coefficients (FBank). This
feature can be extracted with predefined wavelet scales similar to the FBank.
Furthermore, a novel data augmentation scheme based on generative adversarial
neural networks with auxiliary classifiers (ACGANs) is adopted to improve the
generalization of the ASC systems. The scheme, which contains ACGANs and a
sample filter, extends the database iteratively by splitting the dataset,
training the ACGANs and subsequently filtering samples. Experiments were
conducted on datasets from the Detection and Classification of Acoustic Scenes
and Events (DCASE) challenges. The results on the DCASE19 dataset demonstrate
the improved performance of the proposed techniques compared with the classic
FBank classifier. Moreover, the proposed fusion system achieved first place in
the DCASE19 competition and surpassed the top accuracies on the DCASE17
dataset
Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation
In recent years, neural network approaches have shown superior performance to
conventional hand-made features in numerous application areas. In particular,
convolutional neural networks (ConvNets) exploit spatially local correlations
across input data to improve the performance of audio processing tasks, such as
speech recognition, musical chord recognition, and onset detection. Here we
apply ConvNet to acoustic scene classification, and show that the error rate
can be further decreased by using delta features in the frequency domain. We
propose a multiple-width frequency-delta (MWFD) data augmentation method that
uses static mel-spectrogram and frequency-delta features as individual input
examples. In addition, we describe a ConvNet output aggregation method designed
for MWFD augmentation, folded mean aggregation, which combines output
probabilities of static and MWFD features from the same analysis window using
multiplication first, rather than taking an average of all output
probabilities. We describe calculation results using the DCASE 2016 challenge
dataset, which shows that ConvNet outperforms both of the baseline system with
hand-crafted features and a deep neural network approach by around 7%. The
performance was further improved (by 5.7%) using the MWFD augmentation together
with folded mean aggregation. The system exhibited a classification accuracy of
0.831 when classifying 15 acoustic scenes.Comment: 11 pages, 5 figures, submitted to IEEE/ACM Transactions on Audio,
Speech, and Language Processing on 08-July-201
Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features
Motivated by the fact that characteristics of different sound classes are
highly diverse in different temporal scales and hierarchical levels, a novel
deep convolutional neural network (CNN) architecture is proposed for the
environmental sound classification task. This network architecture takes raw
waveforms as input, and a set of separated parallel CNNs are utilized with
different convolutional filter sizes and strides, in order to learn feature
representations with multi-temporal resolutions. On the other hand, the
proposed architecture also aggregates hierarchical features from multi-level
CNN layers for classification using direct connections between convolutional
layers, which is beyond the typical single-level CNN features employed by the
majority of previous studies. This network architecture also improves the flow
of information and avoids vanishing gradient problem. The combination of
multi-level features boosts the classification performance significantly.
Comparative experiments are conducted on two datasets: the environmental sound
classification dataset (ESC-50), and DCASE 2017 audio scene classification
dataset. Results demonstrate that the proposed method is highly effective in
the classification tasks by employing multi-temporal resolution and multi-level
features, and it outperforms the previous methods which only account for
single-level features.Comment: Submit to PCM 201
Sample Dropout for Audio Scene Classification Using Multi-Scale Dense Connected Convolutional Neural Network
Acoustic scene classification is an intricate problem for a machine. As an
emerging field of research, deep Convolutional Neural Networks (CNN) achieve
convincing results. In this paper, we explore the use of multi-scale Dense
connected convolutional neural network (DenseNet) for the classification task,
with the goal to improve the classification performance as multi-scale features
can be extracted from the time-frequency representation of the audio signal. On
the other hand, most of previous CNN-based audio scene classification
approaches aim to improve the classification accuracy, by employing different
regularization techniques, such as the dropout of hidden units and data
augmentation, to reduce overfitting. It is widely known that outliers in the
training set have a high negative influence on the trained model, and culling
the outliers may improve the classification performance, while it is often
under-explored in previous studies. In this paper, inspired by the silence
removal in the speech signal processing, a novel sample dropout approach is
proposed, which aims to remove outliers in the training dataset. Using the
DCASE 2017 audio scene classification datasets, the experimental results
demonstrates the proposed multi-scale DenseNet providing a superior performance
than the traditional single-scale DenseNet, while the sample dropout method can
further improve the classification robustness of multi-scale DenseNet.Comment: Accepted to 2018 Pacific Rim Knowledge Acquisition Workshop (PKAW
Acoustic Features Fusion using Attentive Multi-channel Deep Architecture
In this paper, we present a novel deep fusion architecture for audio
classification tasks. The multi-channel model presented is formed using deep
convolution layers where different acoustic features are passed through each
channel. To enable dissemination of information across the channels, we
introduce attention feature maps that aid in the alignment of frames. The
output of each channel is merged using interaction parameters that non-linearly
aggregate the representative features. Finally, we evaluate the performance of
the proposed architecture on three benchmark datasets:- DCASE-2016 and LITIS
Rouen (acoustic scene recognition), and CHiME-Home (tagging). Our experimental
results suggest that the architecture presented outperforms the standard
baselines and achieves outstanding performance on the task of acoustic scene
recognition and audio tagging.Comment: Accepted in CHiME'18 (Interspeech Workshop
Ensemble Of Deep Neural Networks For Acoustic Scene Classification
Deep neural networks (DNNs) have recently achieved great success in a
multitude of classification tasks. Ensembles of DNNs have been shown to improve
the performance. In this paper, we explore the recent state-of-the-art DNNs
used for image classification. We modified these DNNs and applied them to the
task of acoustic scene classification. We conducted a number of experiments on
the TUT Acoustic Scenes 2017 dataset to empirically compare these methods.
Finally, we show that the best model improves the baseline score for DCASE-2017
Task 1 by 3.1% in the test set and by 10% in the development set.Comment: Detection and Classification of Acoustic Scenes and Events 201
A Comparison of deep learning methods for environmental sound
Environmental sound detection is a challenging application of machine
learning because of the noisy nature of the signal, and the small amount of
(labeled) data that is typically available. This work thus presents a
comparison of several state-of-the-art Deep Learning models on the IEEE
challenge on Detection and Classification of Acoustic Scenes and Events (DCASE)
2016 challenge task and data, classifying sounds into one of fifteen common
indoor and outdoor acoustic scenes, such as bus, cafe, car, city center, forest
path, library, train, etc. In total, 13 hours of stereo audio recordings are
available, making this one of the largest datasets available. We perform
experiments on six sets of features, including standard Mel-frequency cepstral
coefficients (MFCC), Binaural MFCC, log Mel-spectrum and two different large-
scale temporal pooling features extracted using OpenSMILE. On these features,
we apply five models: Gaussian Mixture Model (GMM), Deep Neural Network (DNN),
Recurrent Neural Network (RNN), Convolutional Deep Neural Net- work (CNN) and
i-vector. Using the late-fusion approach, we improve the performance of the
baseline 72.5% by 15.6% in 4-fold Cross Validation (CV) avg. accuracy and 11%
in test accuracy, which matches the best result of the DCASE 2016 challenge.
With large feature sets, deep neural network models out- perform traditional
methods and achieve the best performance among all the studied methods.
Consistent with other work, the best performing single model is the
non-temporal DNN model, which we take as evidence that sounds in the DCASE
challenge do not exhibit strong temporal dynamics.Comment: 5 pages including referenc
- …