79 research outputs found
End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization
Supervised learning based on a deep neural network recently has achieved
substantial improvement on speech enhancement. Denoising networks learn mapping
from noisy speech to clean one directly, or to a spectrum mask which is the
ratio between clean and noisy spectra. In either case, the network is optimized
by minimizing mean square error (MSE) between ground-truth labels and
time-domain or spectrum output. However, existing schemes have either of two
critical issues: spectrum and metric mismatches. The spectrum mismatch is a
well known issue that any spectrum modification after short-time Fourier
transform (STFT), in general, cannot be fully recovered after inverse
short-time Fourier transform (ISTFT). The metric mismatch is that a
conventional MSE metric is sub-optimal to maximize our target metrics,
signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality
(PESQ). This paper presents a new end-to-end denoising framework with the goal
of joint SDR and PESQ optimization. First, the network optimization is
performed on the time-domain signals after ISTFT to avoid spectrum mismatch.
Second, two loss functions which have improved correlations with SDR and PESQ
metrics are proposed to minimize metric mismatch. The experimental result
showed that the proposed denoising scheme significantly improved both SDR and
PESQ performance over the existing methods
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Attention-based Speech Enhancement Using Human Quality Perception Modelling
Perceptually-inspired objective functions such as the perceptual evaluation
of speech quality (PESQ), signal-to-distortion ratio (SDR), and short-time
objective intelligibility (STOI), have recently been used to optimize
performance of deep-learning-based speech enhancement algorithms. These
objective functions, however, do not always strongly correlate with a
listener's assessment of perceptual quality, so optimizing with these measures
often results in poorer performance in real-world scenarios. In this work, we
propose an attention-based enhancement approach that uses learned speech
embedding vectors from a mean-opinion score (MOS) prediction model and a speech
enhancement module to jointly enhance noisy speech. The MOS prediction model
estimates the perceptual MOS of speech quality, as assessed by human listeners,
directly from the audio signal. The enhancement module also employs a quantized
language model that enforces spectral constraints for better speech realism and
performance. We train the model using real-world noisy speech data that has
been captured in everyday environments and test it using unseen corpora. The
results show that our proposed approach significantly outperforms other
approaches that are optimized with objective measures, where the predicted
quality scores strongly correlate with human judgments.Comment: 11 pages, 4 figures, 3 tables, submitted in journal TASLP 202
Single-Microphone Speech Enhancement and Separation Using Deep Learning
The cocktail party problem comprises the challenging task of understanding a
speech signal in a complex acoustic environment, where multiple speakers and
background noise signals simultaneously interfere with the speech signal of
interest. A signal processing algorithm that can effectively increase the
speech intelligibility and quality of speech signals in such complicated
acoustic situations is highly desirable. Especially for applications involving
mobile communication devices and hearing assistive devices. Due to the
re-emergence of machine learning techniques, today, known as deep learning, the
challenges involved with such algorithms might be overcome. In this PhD thesis,
we study and develop deep learning-based techniques for two sub-disciplines of
the cocktail party problem: single-microphone speech enhancement and
single-microphone multi-talker speech separation. Specifically, we conduct
in-depth empirical analysis of the generalizability capability of modern deep
learning-based single-microphone speech enhancement algorithms. We show that
performance of such algorithms is closely linked to the training data, and good
generalizability can be achieved with carefully designed training data.
Furthermore, we propose uPIT, a deep learning-based algorithm for
single-microphone speech separation and we report state-of-the-art results on a
speaker-independent multi-talker speech separation task. Additionally, we show
that uPIT works well for joint speech separation and enhancement without
explicit prior knowledge about the noise type or number of speakers. Finally,
we show that deep learning-based speech enhancement algorithms designed to
minimize the classical short-time spectral amplitude mean squared error leads
to enhanced speech signals which are essentially optimal in terms of STOI, a
state-of-the-art speech intelligibility estimator.Comment: PhD Thesis. 233 page
- …