8,160 research outputs found
A Conditional Generative Model for Speech Enhancement
Deep learning based speech enhancement approaches like Deep Neural Networks (DNN) and Long-Short Term Memory (LSTM) have already demonstrated superior results to classical methods.
However these methods do not take full advantage of temporal context information. While DNN and LSTM consider temporal context in the noisy source speech, it does not do so for the estimated clean speech. Both DNN and LSTM also have a tendency to over-smooth spectra, which causes the enhanced speech to sound muffled.
This paper proposes a novel architecture to address both issues, which we term a conditional generative model (CGM). By adopting an adversarial training scheme applied to a generator of deep dilated convolutional layers, CGM is designed to model the joint and symmetric conditions of both noisy and estimated clean spectra.We evaluate CGM against both DNN and LSTM in terms of Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) on TIMIT sentences corrupted by ITU-T P.501 and NOISEX-92 noise in a range of matched and mismatched noise conditions. Results show that both the CGM architecture and the adversarial training mechanism lead to better PESQ and STOI in all tested noise conditions. In addition to yielding significant improvements in PESQ and STOI, CGM and adversarial training both mitigate against over-smoothing
Unsupervised speech enhancement with diffusion-based generative models
Recently, conditional score-based diffusion models have gained significant
attention in the field of supervised speech enhancement, yielding
state-of-the-art performance. However, these methods may face challenges when
generalising to unseen conditions. To address this issue, we introduce an
alternative approach that operates in an unsupervised manner, leveraging the
generative power of diffusion models. Specifically, in a training phase, a
clean speech prior distribution is learnt in the short-time Fourier transform
(STFT) domain using score-based diffusion models, allowing it to
unconditionally generate clean speech from Gaussian noise. Then, we develop a
posterior sampling methodology for speech enhancement by combining the learnt
clean speech prior with a noise model for speech signal inference. The noise
parameters are simultaneously learnt along with clean speech estimation through
an iterative expectationmaximisation (EM) approach. To the best of our
knowledge, this is the first work exploring diffusion-based generative models
for unsupervised speech enhancement, demonstrating promising results compared
to a recent variational auto-encoder (VAE)-based unsupervised approach and a
state-of-the-art diffusion-based supervised method. It thus opens a new
direction for future research in unsupervised speech enhancement
Attentive Adversarial Learning for Domain-Invariant Training
Adversarial domain-invariant training (ADIT) proves to be effective in
suppressing the effects of domain variability in acoustic modeling and has led
to improved performance in automatic speech recognition (ASR). In ADIT, an
auxiliary domain classifier takes in equally-weighted deep features from a deep
neural network (DNN) acoustic model and is trained to improve their
domain-invariance by optimizing an adversarial loss function. In this work, we
propose an attentive ADIT (AADIT) in which we advance the domain classifier
with an attention mechanism to automatically weight the input deep features
according to their importance in domain classification. With this attentive
re-weighting, AADIT can focus on the domain normalization of phonetic
components that are more susceptible to domain variability and generates deep
features with improved domain-invariance and senone-discriminativity over ADIT.
Most importantly, the attention block serves only as an external component to
the DNN acoustic model and is not involved in ASR, so AADIT can be used to
improve the acoustic modeling with any DNN architectures. More generally, the
same methodology can improve any adversarial learning system with an auxiliary
discriminator. Evaluated on CHiME-3 dataset, the AADIT achieves 13.6% and 9.3%
relative WER improvements, respectively, over a multi-conditional model and a
strong ADIT baseline.Comment: 5 pages, 1 figure, ICASSP 201
- …