4,059 research outputs found
Sample Dropout for Audio Scene Classification Using Multi-Scale Dense Connected Convolutional Neural Network
Acoustic scene classification is an intricate problem for a machine. As an
emerging field of research, deep Convolutional Neural Networks (CNN) achieve
convincing results. In this paper, we explore the use of multi-scale Dense
connected convolutional neural network (DenseNet) for the classification task,
with the goal to improve the classification performance as multi-scale features
can be extracted from the time-frequency representation of the audio signal. On
the other hand, most of previous CNN-based audio scene classification
approaches aim to improve the classification accuracy, by employing different
regularization techniques, such as the dropout of hidden units and data
augmentation, to reduce overfitting. It is widely known that outliers in the
training set have a high negative influence on the trained model, and culling
the outliers may improve the classification performance, while it is often
under-explored in previous studies. In this paper, inspired by the silence
removal in the speech signal processing, a novel sample dropout approach is
proposed, which aims to remove outliers in the training dataset. Using the
DCASE 2017 audio scene classification datasets, the experimental results
demonstrates the proposed multi-scale DenseNet providing a superior performance
than the traditional single-scale DenseNet, while the sample dropout method can
further improve the classification robustness of multi-scale DenseNet.Comment: Accepted to 2018 Pacific Rim Knowledge Acquisition Workshop (PKAW
Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision
We tackle the problem of audiovisual scene analysis for weakly-labeled data.
To this end, we build upon our previous audiovisual representation learning
framework to perform object classification in noisy acoustic environments and
integrate audio source enhancement capability. This is made possible by a novel
use of non-negative matrix factorization for the audio modality. Our approach
is founded on the multiple instance learning paradigm. Its effectiveness is
established through experiments over a challenging dataset of music instrument
performance videos. We also show encouraging visual object localization
results
Towards joint sound scene and polyphonic sound event recognition
Acoustic Scene Classification (ASC) and Sound Event Detection (SED) are two
separate tasks in the field of computational sound scene analysis. In this
work, we present a new dataset with both sound scene and sound event labels and
use this to demonstrate a novel method for jointly classifying sound scenes and
recognizing sound events. We show that by taking a joint approach, learning is
more efficient and whilst improvements are still needed for sound event
detection, SED results are robust in a dataset where the sample distribution is
skewed towards sound scenes.Comment: Accepted to Interspeech 201
Audio Classification of Bit-Representation Waveform
This study investigated the waveform representation for audio signal
classification. Recently, many studies on audio waveform classification such as
acoustic event detection and music genre classification have been published.
Most studies on audio waveform classification have proposed the use of a deep
learning (neural network) framework. Generally, a frequency analysis method
such as Fourier transform is applied to extract the frequency or spectral
information from the input audio waveform before inputting the raw audio
waveform into the neural network. In contrast to these previous studies, in
this paper, we propose a novel waveform representation method, in which audio
waveforms are represented as a bit sequence, for audio classification. In our
experiment, we compare the proposed bit representation waveform, which is
directly given to a neural network, to other representations of audio waveforms
such as a raw audio waveform and a power spectrum with two classification
tasks: one is an acoustic event classification task and the other is a
sound/music classification task. The experimental results showed that the bit
representation waveform achieved the best classification performance for both
the tasks.Comment: Accepted at INTERSPEECH201
A Compact and Discriminative Feature Based on Auditory Summary Statistics for Acoustic Scene Classification
One of the biggest challenges of acoustic scene classification (ASC) is to
find proper features to better represent and characterize environmental sounds.
Environmental sounds generally involve more sound sources while exhibiting less
structure in temporal spectral representations. However, the background of an
acoustic scene exhibits temporal homogeneity in acoustic properties, suggesting
it could be characterized by distribution statistics rather than temporal
details. In this work, we investigated using auditory summary statistics as the
feature for ASC tasks. The inspiration comes from a recent neuroscience study,
which shows the human auditory system tends to perceive sound textures through
time-averaged statistics. Based on these statistics, we further proposed to use
linear discriminant analysis to eliminate redundancies among these statistics
while keeping the discriminative information, providing an extreme com-pact
representation for acoustic scenes. Experimental results show the outstanding
performance of the proposed feature over the conventional handcrafted features.Comment: Accepted as a conference paper of Interspeech 201
Ensemble Of Deep Neural Networks For Acoustic Scene Classification
Deep neural networks (DNNs) have recently achieved great success in a
multitude of classification tasks. Ensembles of DNNs have been shown to improve
the performance. In this paper, we explore the recent state-of-the-art DNNs
used for image classification. We modified these DNNs and applied them to the
task of acoustic scene classification. We conducted a number of experiments on
the TUT Acoustic Scenes 2017 dataset to empirically compare these methods.
Finally, we show that the best model improves the baseline score for DCASE-2017
Task 1 by 3.1% in the test set and by 10% in the development set.Comment: Detection and Classification of Acoustic Scenes and Events 201
Cost-sensitive detection with variational autoencoders for environmental acoustic sensing
Environmental acoustic sensing involves the retrieval and processing of audio
signals to better understand our surroundings. While large-scale acoustic data
make manual analysis infeasible, they provide a suitable playground for machine
learning approaches. Most existing machine learning techniques developed for
environmental acoustic sensing do not provide flexible control of the trade-off
between the false positive rate and the false negative rate. This paper
presents a cost-sensitive classification paradigm, in which the
hyper-parameters of classifiers and the structure of variational autoencoders
are selected in a principled Neyman-Pearson framework. We examine the
performance of the proposed approach using a dataset from the HumBug project
which aims to detect the presence of mosquitoes using sound collected by simple
embedded devices.Comment: Presented at the NIPS 2017 Workshop on Machine Learning for Audio
Signal Processin
DNN and CNN with Weighted and Multi-task Loss Functions for Audio Event Detection
This report presents our audio event detection system submitted for Task 2,
"Detection of rare sound events", of DCASE 2017 challenge. The proposed system
is based on convolutional neural networks (CNNs) and deep neural networks
(DNNs) coupled with novel weighted and multi-task loss functions and
state-of-the-art phase-aware signal enhancement. The loss functions are
tailored for audio event detection in audio streams. The weighted loss is
designed to tackle the common issue of imbalanced data in background/foreground
classification while the multi-task loss enables the networks to simultaneously
model the class distribution and the temporal structures of the target events
for recognition. Our proposed systems significantly outperform the challenge
baseline, improving F-score from 72.7% to 90.0% and reducing detection error
rate from 0.53 to 0.18 on average on the development data. On the evaluation
data, our submission obtains an average F1-score of 88.3% and an error rate of
0.22 which are significantly better than those obtained by the DCASE baseline
(i.e. an F1-score of 64.1% and an error rate of 0.64).Comment: DCASE 2017 technical repor
Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation
In recent years, neural network approaches have shown superior performance to
conventional hand-made features in numerous application areas. In particular,
convolutional neural networks (ConvNets) exploit spatially local correlations
across input data to improve the performance of audio processing tasks, such as
speech recognition, musical chord recognition, and onset detection. Here we
apply ConvNet to acoustic scene classification, and show that the error rate
can be further decreased by using delta features in the frequency domain. We
propose a multiple-width frequency-delta (MWFD) data augmentation method that
uses static mel-spectrogram and frequency-delta features as individual input
examples. In addition, we describe a ConvNet output aggregation method designed
for MWFD augmentation, folded mean aggregation, which combines output
probabilities of static and MWFD features from the same analysis window using
multiplication first, rather than taking an average of all output
probabilities. We describe calculation results using the DCASE 2016 challenge
dataset, which shows that ConvNet outperforms both of the baseline system with
hand-crafted features and a deep neural network approach by around 7%. The
performance was further improved (by 5.7%) using the MWFD augmentation together
with folded mean aggregation. The system exhibited a classification accuracy of
0.831 when classifying 15 acoustic scenes.Comment: 11 pages, 5 figures, submitted to IEEE/ACM Transactions on Audio,
Speech, and Language Processing on 08-July-201
Deep Learning Algorithms with Applications to Video Analytics for A Smart City: A Survey
Deep learning has recently achieved very promising results in a wide range of
areas such as computer vision, speech recognition and natural language
processing. It aims to learn hierarchical representations of data by using deep
architecture models. In a smart city, a lot of data (e.g. videos captured from
many distributed sensors) need to be automatically processed and analyzed. In
this paper, we review the deep learning algorithms applied to video analytics
of smart city in terms of different research topics: object detection, object
tracking, face recognition, image classification and scene labeling.Comment: 8 pages, 18 figure
- …