11 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement
The majority of deep neural network (DNN) based speech enhancement algorithms
rely on the mean-square error (MSE) criterion of short-time spectral amplitudes
(STSA), which has no apparent link to human perception, e.g. speech
intelligibility. Short-Time Objective Intelligibility (STOI), a popular
state-of-the-art speech intelligibility estimator, on the other hand, relies on
linear correlation of speech temporal envelopes. This raises the question if a
DNN training criterion based on envelope linear correlation (ELC) can lead to
improved speech intelligibility performance of DNN based speech enhancement
algorithms compared to algorithms based on the STSA-MSE criterion. In this
paper we derive that, under certain general conditions, the STSA-MSE and ELC
criteria are practically equivalent, and we provide empirical data to support
our theoretical results. Furthermore, our experimental findings suggest that
the standard STSA minimum-MSE estimator is near optimal, if the objective is to
enhance noisy speech in a manner which is optimal with respect to the STOI
speech intelligibility estimator
Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure
In this paper we propose a Deep Neural Network (DNN) based Speech Enhancement
(SE) system that is designed to maximize an approximation of the Short-Time
Objective Intelligibility (STOI) measure. We formalize an approximate-STOI cost
function and derive analytical expressions for the gradients required for DNN
training and show that these gradients have desirable properties when used
together with gradient based optimization techniques. We show through
simulation experiments that the proposed SE system achieves large improvements
in estimated speech intelligibility, when tested on matched and unmatched
natural noise types, at multiple signal-to-noise ratios. Furthermore, we show
that the SE system, when trained using an approximate-STOI cost function
performs on par with a system trained with a mean square error cost applied to
short-time temporal envelopes. Finally, we show that the proposed SE system
performs on par with a traditional DNN based Short-Time Spectral Amplitude
(STSA) SE system in terms of estimated speech intelligibility. These results
are important because they suggest that traditional DNN based STSA SE systems
might be optimal in terms of estimated speech intelligibility.Comment: To appear in ICASSP 201
On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement
Many deep learning-based speech enhancement algorithms are designed to
minimize the mean-square error (MSE) in some transform domain between a
predicted and a target speech signal. However, optimizing for MSE does not
necessarily guarantee high speech quality or intelligibility, which is the
ultimate goal of many speech enhancement algorithms. Additionally, only little
is known about the impact of the loss function on the emerging class of
time-domain deep learning-based speech enhancement systems. We study how
popular loss functions influence the performance of deep learning-based speech
enhancement systems. First, we demonstrate that perceptually inspired loss
functions might be advantageous if the receiver is the human auditory system.
Furthermore, we show that the learning rate is a crucial design parameter even
for adaptive gradient-based optimizers, which has been generally overlooked in
the literature. Also, we found that waveform matching performance metrics must
be used with caution as they in certain situations can fail completely.
Finally, we show that a loss function based on scale-invariant
signal-to-distortion ratio (SI-SDR) achieves good general performance across a
range of popular speech enhancement evaluation metrics, which suggests that
SI-SDR is a good candidate as a general-purpose loss function for speech
enhancement systems.Comment: Published in the IEEE Transactions on Audio, Speech and Language
Processin