Search CORE

620 research outputs found

On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering

Author: Dionelis Nikolaos
Publication venue
Publication date: 31/10/2018
Field of study

This report focuses on algorithms that perform single-channel speech enhancement. The author of this report uses modulation-domain Kalman filtering algorithms for speech enhancement, i.e. noise suppression and dereverberation, in [1], [2], [3], [4] and [5]. Modulation-domain Kalman filtering can be applied for both noise and late reverberation suppression and in [2], [1], [3] and [4], various model-based speech enhancement algorithms that perform modulation-domain Kalman filtering are designed, implemented and tested. The model-based enhancement algorithm in [2] estimates and tracks the speech phase. The short-time-Fourier-transform-based enhancement algorithm in [5] uses the active speech level estimator presented in [6]. This report describes how different algorithms perform speech enhancement and the algorithms discussed in this report are addressed to researchers interested in monaural speech enhancement. The algorithms are composed of different processing blocks and techniques [7]; understanding the implementation choices made during the system design is important because this provides insights that can assist the development of new algorithms. Index Terms - Speech enhancement, dereverberation, denoising, Kalman filter, minimum mean squared error estimation.Comment: 13 page

arXiv.org e-Print Archive

Integrated Speech Enhancement Method Based on Weighted Prediction Error and DNN for Dereverberation and Denoising

Author: Gao Guanglai
Li Hao
Zhang Hui
Zhang Xueliang
Publication venue
Publication date: 28/08/2017
Field of study

Both reverberation and additive noises degrade the speech quality and intelligibility. Weighted prediction error (WPE) method performs well on the dereverberation but with limitations. First, WPE doesn't consider the influence of the additive noise which degrades the performance of dereverberation. Second, it relies on a time-consuming iterative process, and there is no guarantee or a widely accepted criterion on its convergence. In this paper, we integrate deep neural network (DNN) into WPE for dereverberation and denoising. DNN is used to suppress the background noise to meet the noise-free assumption of WPE. Meanwhile, DNN is applied to directly predict spectral variance of the target speech to make the WPE work without iteration. The experimental results show that the proposed method has a significant improvement in speech quality and runs fast

arXiv.org e-Print Archive

Enhancement and Recognition of Reverberant and Noisy Speech by Extending Its Coherence

Author: Atlas Les
Pitton James
Powers Thomas
Wisdom Scott
Publication venue
Publication date: 01/09/2015
Field of study

Most speech enhancement algorithms make use of the short-time Fourier transform (STFT), which is a simple and flexible time-frequency decomposition that estimates the short-time spectrum of a signal. However, the duration of short STFT frames are inherently limited by the nonstationarity of speech signals. The main contribution of this paper is a demonstration of speech enhancement and automatic speech recognition in the presence of reverberation and noise by extending the length of analysis windows. We accomplish this extension by performing enhancement in the short-time fan-chirp transform (STFChT) domain, an overcomplete time-frequency representation that is coherent with speech signals over longer analysis window durations than the STFT. This extended coherence is gained by using a linear model of fundamental frequency variation of voiced speech signals. Our approach centers around using a single-channel minimum mean-square error log-spectral amplitude (MMSE-LSA) estimator proposed by Habets, which scales coefficients in a time-frequency domain to suppress noise and reverberation. In the case of multiple microphones, we preprocess the data with either a minimum variance distortionless response (MVDR) beamformer, or a delay-and-sum beamformer (DSB). We evaluate our algorithm on both speech enhancement and recognition tasks for the REVERB challenge dataset. Compared to the same processing done in the STFT domain, our approach achieves significant improvement in terms of objective enhancement metrics (including PESQ---the ITU-T standard measurement for speech quality). In terms of automatic speech recognition (ASR) performance as measured by word error rate (WER), our experiments indicate that the STFT with a long window is more effective for ASR.Comment: 22 page

arXiv.org e-Print Archive

Speech Dereverberation with Context-aware Recurrent Neural Networks

Author: Falk Tiago H.
Santos Joao Felipe
Publication venue
Publication date: 16/11/2017
Field of study

In this paper, we propose a model to perform speech dereverberation by estimating its spectral magnitude from the reverberant counterpart. Our models are capable of extracting features that take into account both short and long-term dependencies in the signal through a convolutional encoder (which extracts features from a short, bounded context of frames) and a recurrent neural network for extracting long-term information. Our model outperforms a recently proposed model that uses different context information depending on the reverberation time, without requiring any sort of additional input, yielding improvements of up to 0.4 on PESQ, 0.3 on STOI, and 1.0 on POLQA relative to reverberant speech. We also show our model is able to generalize to real room impulse responses even when only trained with simulated room impulse responses, different speakers, and high reverberation times. Lastly, listening tests show the proposed method outperforming benchmark models in reduction of perceived reverberation.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processin

arXiv.org e-Print Archive

Jointly optimal denoising, dereverberation, and source separation

Author: Boeddeker Christoph
Delcroix Marc
Haeb-Umbach Reinhold
Ikeshita Rintaro
Kinoshita Keisuke
Nakatani Tomohiro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/08/2020
Field of study

This paper proposes methods that can optimize a Convolutional BeamFormer (CBF) for jointly performing denoising, dereverberation, and source separation (DN+DR+SS) in a computationally efficient way. Conventionally, cascade configuration composed of a Weighted Prediction Error minimization (WPE) dereverberation filter followed by a Minimum Variance Distortionless Response beamformer has been usedas the state-of-the-art frontend of far-field speech recognition, however, overall optimality of this approach is not guaranteed. In the blind signal processing area, an approach for jointly optimizing dereverberation and source separation (DR+SS) has been proposed, however, this approach requires huge computing cost, and has not been extended for application to DN+DR+SS. To overcome the above limitations, this paper develops new approaches for jointly optimizing DN+DR+SS in a computationally much more efficient way. To this end, we first present an objective function to optimize a CBF for performing DN+DR+SS based on the maximum likelihood estimation, on an assumption that the steering vectors of the target signals are given or can be estimated, e.g., using a neural network. This paper refers to a CBF optimized by this objective function as a weighted Minimum-Power Distortionless Response (wMPDR) CBF. Then, we derive two algorithms for optimizing a wMPDR CBF based on two different ways of factorizing a CBF into WPE filters and beamformers. Experiments using noisy reverberant sound mixtures show that the proposed optimization approaches greatly improve the performance of the speech enhancement in comparison with the conventional cascade configuration in terms of the signal distortion measures and ASR performance. It is also shown that the proposed approaches can greatly reduce the computing cost with improved estimation accuracy in comparison with the conventional joint optimization approach.Comment: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 12 Feb 2020, Accepted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 14 July 202

arXiv.org e-Print Archive

A unified convolutional beamformer for simultaneous denoising and dereverberation

Author: Kinoshita Keisuke
Nakatani Tomohiro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/05/2019
Field of study

This paper proposes a method for estimating a convolutional beamformer that can perform denoising and dereverberation simultaneously in an optimal way. The application of dereverberation based on a weighted prediction error (WPE) method followed by denoising based on a minimum variance distortionless response (MVDR) beamformer has conventionally been considered a promising approach, however, the optimality of this approach cannot be guaranteed. To realize the optimal integration of denoising and dereverberation, we present a method that unifies the WPE dereverberation method and a variant of the MVDR beamformer, namely a minimum power distortionless response (MPDR) beamformer, into a single convolutional beamformer, and we optimize it based on a single unified optimization criterion. The proposed beamformer is referred to as a Weighted Power minimization Distortionless response (WPD) beamformer. Experiments show that the proposed method substantially improves the speech enhancement performance in terms of both objective speech enhancement measures and automatic speech recognition (ASR) performance.Comment: Published in IEEE Signal Processing Letter

arXiv.org e-Print Archive

Supervised Speech Separation Based on Deep Learning: An Overview

Author: Chen Jitong
Wang DeLiang
Publication venue
Publication date: 14/06/2018
Field of study

Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This article provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multi-talker separation), and speech dereverberation, as well as multi-microphone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.Comment: 27 pages, 17 figure

arXiv.org e-Print Archive

Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding

Author: Araki Shoko
Aroudi Ali
Delcroix Marc
Doclo Simon
Kinoshita Keisuke
Nakatani Tomohiro
Publication venue
Publication date: 10/05/2020
Field of study

The performance of speech enhancement algorithms in a multi-speaker scenario depends on correctly identifying the target speaker to be enhanced. Auditory attention decoding (AAD) methods allow to identify the target speaker which the listener is attending to from single-trial EEG recordings. Aiming at enhancing the target speaker and suppressing interfering speakers, reverberation and ambient noise, in this paper we propose a cognitive-driven multi-microphone speech enhancement system, which combines a neural-network-based mask estimator, weighted minimum power distortionless response convolutional beamformers and AAD. To control the suppression of the interfering speaker, we also propose an extension incorporating an interference suppression constraint. The experimental results show that the proposed system outperforms the state-of-the-art cognitive-driven speech enhancement systems in challenging reverberant and noisy conditions

arXiv.org e-Print Archive

Speech Dereverberation Using Nonnegative Convolutive Transfer Function and Spectro temporal Modeling

Author: Doclo Simon
Mohammadiha Nasser
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/09/2017
Field of study

This paper presents two single channel speech dereverberation methods to enhance the quality of speech signals that have been recorded in an enclosed space. For both methods, the room acoustics are modeled using a nonnegative approximation of the convolutive transfer function (NCTF), and to additionally exploit the spectral properties of the speech signal, such as the low rank nature of the speech spectrogram, the speech spectrogram is modeled using nonnegative matrix factorization (NMF). Two methods are described to combine the NCTF and NMF models. In the first method, referred to as the integrated method, a cost function is constructed by directly integrating the speech NMF model into the NCTF model, while in the second method, referred to as the weighted method, the NCTF and NMF based cost functions are weighted and summed. Efficient update rules are derived to solve both optimization problems. In addition, an extension of the integrated method is presented, which exploits the temporal dependencies of the speech signal. Several experiments are performed on reverberant speech signals with and without background noise, where the integrated method yields a considerably higher speech quality than the baseline NCTF method and a state of the art spectral enhancement method. Moreover, the experimental results indicate that the weighted method can even lead to a better performance in terms of instrumental quality measures, but that the optimal weighting parameter depends on the room acoustics and the utilized NMF model. Modeling the temporal dependencies in the integrated method was found to be useful only for highly reverberant conditions

arXiv.org e-Print Archive

Robust coherence-based spectral enhancement for speech recognition in adverse real-world environments

Author: Barfuss Hendrik
Huemmer Christian
Kellermann Walter
Schwarz Andreas
Publication venue: 'Elsevier BV'
Publication date: 07/08/2017
Field of study

Speech recognition in adverse real-world environments is highly affected by reverberation and nonstationary background noise. A well-known strategy to reduce such undesired signal components in multi-microphone scenarios is spatial filtering of the microphone signals. In this article, we demonstrate that an additional coherence-based postfilter, which is applied to the beamformer output signal to remove diffuse interference components from the latter, is an effective means to further improve the recognition accuracy of modern deep learning speech recognition systems. To this end, the recently updated 3rd CHiME Speech Separation and Recognition Challenge (CHiME-3) baseline speech recognition system is extended by a coherence-based postfilter and the postfilter's impact on the word error rates is investigated for the noisy environments provided by CHiME-3. To determine the time- and frequency-dependent postfilter gains, we use a Direction-of-Arrival (DOA)-dependent and a DOA-independent estimator of the coherent-to-diffuse power ratio as an approximation of the short-time signal-to-noise ratio. Our experiments show that incorporating coherence-based postfiltering into the CHiME-3 baseline speech recognition system leads to a significant reduction of the word error rate scores for the noisy and reverberant environments provided as part of CHiME-3.Comment: 21 pages, 5 figures. arXiv admin note: substantial text overlap with arXiv:1509.06882, Elsevier Computer Speech & Language (CSL), 201

arXiv.org e-Print Archive