8,160 research outputs found

    A Conditional Generative Model for Speech Enhancement

    Get PDF
    Deep learning based speech enhancement approaches like Deep Neural Networks (DNN) and Long-Short Term Memory (LSTM) have already demonstrated superior results to classical methods. However these methods do not take full advantage of temporal context information. While DNN and LSTM consider temporal context in the noisy source speech, it does not do so for the estimated clean speech. Both DNN and LSTM also have a tendency to over-smooth spectra, which causes the enhanced speech to sound muffled. This paper proposes a novel architecture to address both issues, which we term a conditional generative model (CGM). By adopting an adversarial training scheme applied to a generator of deep dilated convolutional layers, CGM is designed to model the joint and symmetric conditions of both noisy and estimated clean spectra.We evaluate CGM against both DNN and LSTM in terms of Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) on TIMIT sentences corrupted by ITU-T P.501 and NOISEX-92 noise in a range of matched and mismatched noise conditions. Results show that both the CGM architecture and the adversarial training mechanism lead to better PESQ and STOI in all tested noise conditions. In addition to yielding significant improvements in PESQ and STOI, CGM and adversarial training both mitigate against over-smoothing

    Unsupervised speech enhancement with diffusion-based generative models

    Full text link
    Recently, conditional score-based diffusion models have gained significant attention in the field of supervised speech enhancement, yielding state-of-the-art performance. However, these methods may face challenges when generalising to unseen conditions. To address this issue, we introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models. Specifically, in a training phase, a clean speech prior distribution is learnt in the short-time Fourier transform (STFT) domain using score-based diffusion models, allowing it to unconditionally generate clean speech from Gaussian noise. Then, we develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference. The noise parameters are simultaneously learnt along with clean speech estimation through an iterative expectationmaximisation (EM) approach. To the best of our knowledge, this is the first work exploring diffusion-based generative models for unsupervised speech enhancement, demonstrating promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method. It thus opens a new direction for future research in unsupervised speech enhancement

    Attentive Adversarial Learning for Domain-Invariant Training

    Full text link
    Adversarial domain-invariant training (ADIT) proves to be effective in suppressing the effects of domain variability in acoustic modeling and has led to improved performance in automatic speech recognition (ASR). In ADIT, an auxiliary domain classifier takes in equally-weighted deep features from a deep neural network (DNN) acoustic model and is trained to improve their domain-invariance by optimizing an adversarial loss function. In this work, we propose an attentive ADIT (AADIT) in which we advance the domain classifier with an attention mechanism to automatically weight the input deep features according to their importance in domain classification. With this attentive re-weighting, AADIT can focus on the domain normalization of phonetic components that are more susceptible to domain variability and generates deep features with improved domain-invariance and senone-discriminativity over ADIT. Most importantly, the attention block serves only as an external component to the DNN acoustic model and is not involved in ASR, so AADIT can be used to improve the acoustic modeling with any DNN architectures. More generally, the same methodology can improve any adversarial learning system with an auxiliary discriminator. Evaluated on CHiME-3 dataset, the AADIT achieves 13.6% and 9.3% relative WER improvements, respectively, over a multi-conditional model and a strong ADIT baseline.Comment: 5 pages, 1 figure, ICASSP 201
    • …
    corecore