363 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
A review of differentiable digital signal processing for music and speech synthesis
The term “differentiable digital signal processing” describes a family of techniques in which loss function gradients are backpropagated through digital signal processors, facilitating their integration into neural networks. This article surveys the literature on differentiable audio signal processing, focusing on its use in music and speech synthesis. We catalogue applications to tasks including music performance rendering, sound matching, and voice transformation, discussing the motivations for and implications of the use of this methodology. This is accompanied by an overview of digital signal processing operations that have been implemented differentiably, which is further supported by a web book containing practical advice on differentiable synthesiser programming (https://intro2ddsp.github.io/). Finally, we highlight open challenges, including optimisation pathologies, robustness to real-world conditions, and design trade-offs, and discuss directions for future research
Investigating gated recurrent neural networks for speech synthesis
Recently, recurrent neural networks (RNNs) as powerful sequence models have
re-emerged as a potential acoustic model for statistical parametric speech
synthesis (SPSS). The long short-term memory (LSTM) architecture is
particularly attractive because it addresses the vanishing gradient problem in
standard RNNs, making them easier to train. Although recent studies have
demonstrated that LSTMs can achieve significantly better performance on SPSS
than deep feed-forward neural networks, little is known about why. Here we
attempt to answer two questions: a) why do LSTMs work well as a sequence model
for SPSS; b) which component (e.g., input gate, output gate, forget gate) is
most important. We present a visual analysis alongside a series of experiments,
resulting in a proposal for a simplified architecture. The simplified
architecture has significantly fewer parameters than an LSTM, thus reducing
generation complexity considerably without degrading quality.Comment: Accepted by ICASSP 201
Improving generalization of vocal tract feature reconstruction: from augmented acoustic inversion to articulatory feature reconstruction without articulatory data
We address the problem of reconstructing articulatory movements, given audio
and/or phonetic labels. The scarce availability of multi-speaker articulatory
data makes it difficult to learn a reconstruction that generalizes to new
speakers and across datasets. We first consider the XRMB dataset where audio,
articulatory measurements and phonetic transcriptions are available. We show
that phonetic labels, used as input to deep recurrent neural networks that
reconstruct articulatory features, are in general more helpful than acoustic
features in both matched and mismatched training-testing conditions. In a
second experiment, we test a novel approach that attempts to build articulatory
features from prior articulatory information extracted from phonetic labels.
Such approach recovers vocal tract movements directly from an acoustic-only
dataset without using any articulatory measurement. Results show that
articulatory features generated by this approach can correlate up to 0.59
Pearson product-moment correlation with measured articulatory features.Comment: IEEE Workshop on Spoken Language Technology (SLT
- …