87 research outputs found
Feature enhancement of reverberant speech by distribution matching and non-negative matrix factorization
This paper describes a novel two-stage dereverberation feature enhancement method for noise-robust automatic speech recognition. In the first stage, an estimate of the dereverberated speech is generated by matching the distribution of the observed reverberant speech to that of clean speech, in a decorrelated transformation domain that has a long temporal context in order to address the effects of reverberation. The second stage uses this dereverberated signal as an initial estimate within a non-negative matrix factorization framework, which jointly estimates a sparse representation of the clean speech signal and an estimate of the convolutional distortion. The proposed feature enhancement method, when used in conjunction with automatic speech recognizer back-end processing, is shown to improve the recognition performance compared to three other state-of-the-art techniques
HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks
Real-world audio recordings are often degraded by factors such as noise,
reverberation, and equalization distortion. This paper introduces HiFi-GAN, a
deep learning method to transform recorded speech to sound as though it had
been recorded in a studio. We use an end-to-end feed-forward WaveNet
architecture, trained with multi-scale adversarial discriminators in both the
time domain and the time-frequency domain. It relies on the deep feature
matching losses of the discriminators to improve the perceptual quality of
enhanced speech. The proposed model generalizes well to new speakers, new
speech content, and new environments. It significantly outperforms
state-of-the-art baseline methods in both objective and subjective experiments.Comment: Accepted by INTERSPEECH 202
End-to-end non-negative auto-encoders: a deep neural alternative to non-negative audio modeling
Over the last decade, non-negative matrix factorization (NMF) has emerged as one of the most popular approaches to modeling audio signals. NMF allows us to factorize the magnitude spectrogram to learn representative spectral bases that can be used for a wide range of applications. With the recent advances in deep learning, neural networks (NNs) have surpassed NMF in terms of performance. However, these NNs are trained discriminatively and lack several key characteristics like re-usability and robustness, compared to NMF.
In this dissertation, we develop and investigate the idea of end-to-end non-negative autoencoders (NAEs) as an updated deep learning based alternative framework to non-negative audio modeling. We show that end-to-end NAEs combine the modeling advantages of non-negative matrix factorization and the generalizability of neural networks while delivering significant improvements in performance.
To this end, we first interpret NMF as a NAE and show that the two approaches are equivalent semantically and in terms of source separation performance. We exploit the availability of sophisticated neural network architectures to propose several extensions to NAEs. We also demonstrate that these modeling improvements significantly boost the performance of NAEs.
In audio processing applications, the short-time fourier transform~(STFT) is used as a universal first step and we design algorithms and neural networks to operate on the magnitude spectrograms. We interpret the sequence of steps involved in computing the STFT as additional neural network layers. This enables us to propose end-to-end processing pipelines that operate directly on the raw waveforms. In the context of source separation, we show that end-to-end processing gives a significant improvement in performance compared to existing spectrogram based methods. Furthermore, to train these end-to-end models, we investigate the use of cost functions that are derived from objective evaluation metrics as measured on waveforms. We present subjective listening test results that reveal insights into the performance of these cost functions for end-to-end source separation.
Combining the adaptive front-end layers with NAEs, we propose end-to-end NAEs and show how they can be used for end-to-end generative source separation. Our experiments indicate that these models deliver separation performance comparable to that of discriminative NNs, while retaining the modularity of NMF and the modeling flexibility of neural networks. Finally, we present an approach to train these end-to-end NAEs using mixtures only, without access to clean training examples
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
SkipConvGAN: Monaural Speech Dereverberation using Generative Adversarial Networks via Complex Time-Frequency Masking
With the advancements in deep learning approaches, the performance of speech
enhancing systems in the presence of background noise have shown significant
improvements. However, improving the system's robustness against reverberation
is still a work in progress, as reverberation tends to cause loss of formant
structure due to smearing effects in time and frequency. A wide range of deep
learning-based systems either enhance the magnitude response and reuse the
distorted phase or enhance complex spectrogram using a complex time-frequency
mask. Though these approaches have demonstrated satisfactory performance, they
do not directly address the lost formant structure caused by reverberation. We
believe that retrieving the formant structure can help improve the efficiency
of existing systems. In this study, we propose SkipConvGAN - an extension of
our prior work SkipConvNet. The proposed system's generator network tries to
estimate an efficient complex time-frequency mask, while the discriminator
network aids in driving the generator to restore the lost formant structure. We
evaluate the performance of our proposed system on simulated and real
recordings of reverberant speech from the single-channel task of the REVERB
challenge corpus. The proposed system shows a consistent improvement across
multiple room configurations over other deep learning-based generative
adversarial frameworks.Comment: Published in: IEEE/ACM Transactions on Audio, Speech, and Language
Processing ( Volume: 30
- …