22,306 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Attentive Adversarial Learning for Domain-Invariant Training
Adversarial domain-invariant training (ADIT) proves to be effective in
suppressing the effects of domain variability in acoustic modeling and has led
to improved performance in automatic speech recognition (ASR). In ADIT, an
auxiliary domain classifier takes in equally-weighted deep features from a deep
neural network (DNN) acoustic model and is trained to improve their
domain-invariance by optimizing an adversarial loss function. In this work, we
propose an attentive ADIT (AADIT) in which we advance the domain classifier
with an attention mechanism to automatically weight the input deep features
according to their importance in domain classification. With this attentive
re-weighting, AADIT can focus on the domain normalization of phonetic
components that are more susceptible to domain variability and generates deep
features with improved domain-invariance and senone-discriminativity over ADIT.
Most importantly, the attention block serves only as an external component to
the DNN acoustic model and is not involved in ASR, so AADIT can be used to
improve the acoustic modeling with any DNN architectures. More generally, the
same methodology can improve any adversarial learning system with an auxiliary
discriminator. Evaluated on CHiME-3 dataset, the AADIT achieves 13.6% and 9.3%
relative WER improvements, respectively, over a multi-conditional model and a
strong ADIT baseline.Comment: 5 pages, 1 figure, ICASSP 201
- …