24,631 research outputs found
Reducing Model Complexity for DNN Based Large-Scale Audio Classification
Audio classification is the task of identifying the sound categories that are
associated with a given audio signal. This paper presents an investigation on
large-scale audio classification based on the recently released AudioSet
database. AudioSet comprises 2 millions of audio samples from YouTube, which
are human-annotated with 527 sound category labels. Audio classification
experiments with the balanced training set and the evaluation set of AudioSet
are carried out by applying different types of neural network models. The
classification performance and the model complexity of these models are
compared and analyzed. While the CNN models show better performance than MLP
and RNN, its model complexity is relatively high and undesirable for practical
use. We propose two different strategies that aim at constructing
low-dimensional embedding feature extractors and hence reducing the number of
model parameters. It is shown that the simplified CNN model has only 1/22 model
parameters of the original model, with only a slight degradation of
performance.Comment: Accepted by ICASSP 201
Complexity-entropy causality plane: a useful approach for distinguishing songs
Nowadays we are often faced with huge databases resulting from the rapid
growth of data storage technologies. This is particularly true when dealing
with music databases. In this context, it is essential to have techniques and
tools able to discriminate properties from these massive sets. In this work, we
report on a statistical analysis of more than ten thousand songs aiming to
obtain a complexity hierarchy. Our approach is based on the estimation of the
permutation entropy combined with an intensive complexity measure, building up
the complexity-entropy causality plane. The results obtained indicate that this
representation space is very promising to discriminate songs as well as to
allow a relative quantitative comparison among songs. Additionally, we believe
that the here-reported method may be applied in practical situations since it
is simple, robust and has a fast numerical implementation.Comment: Accepted for publication in Physica
Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events
In this paper, we propose a new strategy for acoustic scene classification
(ASC) , namely recognizing acoustic scenes through identifying distinct sound
events. This differs from existing strategies, which focus on characterizing
global acoustical distributions of audio or the temporal evolution of
short-term audio features, without analysis down to the level of sound events.
To identify distinct sound events for each scene, we formulate ASC in a
multi-instance learning (MIL) framework, where each audio recording is mapped
into a bag-of-instances representation. Here, instances can be seen as
high-level representations for sound events inside a scene. We also propose a
MIL neural networks model, which implicitly identifies distinct instances
(i.e., sound events). Furthermore, we propose two specially designed modules
that model the multi-temporal scale and multi-modal natures of the sound events
respectively. The experiments were conducted on the official development set of
the DCASE2018 Task1 Subtask B, and our best-performing model improves over the
official baseline by 9.4% (68.3% vs 58.9%) in terms of classification accuracy.
This study indicates that recognizing acoustic scenes by identifying distinct
sound events is effective and paves the way for future studies that combine
this strategy with previous ones.Comment: code URL typo, code is available at
https://github.com/hackerekcah/distinct-events-asc.gi
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
The goal of this work is to train discriminative cross-modal embeddings
without access to manually annotated data. Recent advances in self-supervised
learning have shown that effective representations can be learnt from natural
cross-modal synchrony. We build on earlier work to train embeddings that are
more discriminative for uni-modal downstream tasks. To this end, we propose a
novel training strategy that not only optimises metrics across modalities, but
also enforces intra-class feature separation within each of the modalities. The
effectiveness of the method is demonstrated on two downstream tasks: lip
reading using the features trained on audio-visual synchronisation, and speaker
recognition using the features trained for cross-modal biometric matching. The
proposed method outperforms state-of-the-art self-supervised baselines by a
signficant margin.Comment: Under submission as a conference pape
- …