620 research outputs found
On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering
This report focuses on algorithms that perform single-channel speech
enhancement. The author of this report uses modulation-domain Kalman filtering
algorithms for speech enhancement, i.e. noise suppression and dereverberation,
in [1], [2], [3], [4] and [5]. Modulation-domain Kalman filtering can be
applied for both noise and late reverberation suppression and in [2], [1], [3]
and [4], various model-based speech enhancement algorithms that perform
modulation-domain Kalman filtering are designed, implemented and tested. The
model-based enhancement algorithm in [2] estimates and tracks the speech phase.
The short-time-Fourier-transform-based enhancement algorithm in [5] uses the
active speech level estimator presented in [6]. This report describes how
different algorithms perform speech enhancement and the algorithms discussed in
this report are addressed to researchers interested in monaural speech
enhancement. The algorithms are composed of different processing blocks and
techniques [7]; understanding the implementation choices made during the system
design is important because this provides insights that can assist the
development of new algorithms. Index Terms - Speech enhancement,
dereverberation, denoising, Kalman filter, minimum mean squared error
estimation.Comment: 13 page
Integrated Speech Enhancement Method Based on Weighted Prediction Error and DNN for Dereverberation and Denoising
Both reverberation and additive noises degrade the speech quality and
intelligibility. Weighted prediction error (WPE) method performs well on the
dereverberation but with limitations. First, WPE doesn't consider the influence
of the additive noise which degrades the performance of dereverberation.
Second, it relies on a time-consuming iterative process, and there is no
guarantee or a widely accepted criterion on its convergence. In this paper, we
integrate deep neural network (DNN) into WPE for dereverberation and denoising.
DNN is used to suppress the background noise to meet the noise-free assumption
of WPE. Meanwhile, DNN is applied to directly predict spectral variance of the
target speech to make the WPE work without iteration. The experimental results
show that the proposed method has a significant improvement in speech quality
and runs fast
Enhancement and Recognition of Reverberant and Noisy Speech by Extending Its Coherence
Most speech enhancement algorithms make use of the short-time Fourier
transform (STFT), which is a simple and flexible time-frequency decomposition
that estimates the short-time spectrum of a signal. However, the duration of
short STFT frames are inherently limited by the nonstationarity of speech
signals. The main contribution of this paper is a demonstration of speech
enhancement and automatic speech recognition in the presence of reverberation
and noise by extending the length of analysis windows. We accomplish this
extension by performing enhancement in the short-time fan-chirp transform
(STFChT) domain, an overcomplete time-frequency representation that is coherent
with speech signals over longer analysis window durations than the STFT. This
extended coherence is gained by using a linear model of fundamental frequency
variation of voiced speech signals. Our approach centers around using a
single-channel minimum mean-square error log-spectral amplitude (MMSE-LSA)
estimator proposed by Habets, which scales coefficients in a time-frequency
domain to suppress noise and reverberation. In the case of multiple
microphones, we preprocess the data with either a minimum variance
distortionless response (MVDR) beamformer, or a delay-and-sum beamformer (DSB).
We evaluate our algorithm on both speech enhancement and recognition tasks for
the REVERB challenge dataset. Compared to the same processing done in the STFT
domain, our approach achieves significant improvement in terms of objective
enhancement metrics (including PESQ---the ITU-T standard measurement for speech
quality). In terms of automatic speech recognition (ASR) performance as
measured by word error rate (WER), our experiments indicate that the STFT with
a long window is more effective for ASR.Comment: 22 page
Speech Dereverberation with Context-aware Recurrent Neural Networks
In this paper, we propose a model to perform speech dereverberation by
estimating its spectral magnitude from the reverberant counterpart. Our models
are capable of extracting features that take into account both short and
long-term dependencies in the signal through a convolutional encoder (which
extracts features from a short, bounded context of frames) and a recurrent
neural network for extracting long-term information. Our model outperforms a
recently proposed model that uses different context information depending on
the reverberation time, without requiring any sort of additional input,
yielding improvements of up to 0.4 on PESQ, 0.3 on STOI, and 1.0 on POLQA
relative to reverberant speech. We also show our model is able to generalize to
real room impulse responses even when only trained with simulated room impulse
responses, different speakers, and high reverberation times. Lastly, listening
tests show the proposed method outperforming benchmark models in reduction of
perceived reverberation.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language
Processin
Jointly optimal denoising, dereverberation, and source separation
This paper proposes methods that can optimize a Convolutional BeamFormer
(CBF) for jointly performing denoising, dereverberation, and source separation
(DN+DR+SS) in a computationally efficient way. Conventionally, cascade
configuration composed of a Weighted Prediction Error minimization (WPE)
dereverberation filter followed by a Minimum Variance Distortionless Response
beamformer has been usedas the state-of-the-art frontend of far-field speech
recognition, however, overall optimality of this approach is not guaranteed. In
the blind signal processing area, an approach for jointly optimizing
dereverberation and source separation (DR+SS) has been proposed, however, this
approach requires huge computing cost, and has not been extended for
application to DN+DR+SS. To overcome the above limitations, this paper develops
new approaches for jointly optimizing DN+DR+SS in a computationally much more
efficient way. To this end, we first present an objective function to optimize
a CBF for performing DN+DR+SS based on the maximum likelihood estimation, on an
assumption that the steering vectors of the target signals are given or can be
estimated, e.g., using a neural network. This paper refers to a CBF optimized
by this objective function as a weighted Minimum-Power Distortionless Response
(wMPDR) CBF. Then, we derive two algorithms for optimizing a wMPDR CBF based on
two different ways of factorizing a CBF into WPE filters and beamformers.
Experiments using noisy reverberant sound mixtures show that the proposed
optimization approaches greatly improve the performance of the speech
enhancement in comparison with the conventional cascade configuration in terms
of the signal distortion measures and ASR performance. It is also shown that
the proposed approaches can greatly reduce the computing cost with improved
estimation accuracy in comparison with the conventional joint optimization
approach.Comment: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing
on 12 Feb 2020, Accepted to IEEE/ACM Trans. Audio, Speech, and Language
Processing on 14 July 202
A unified convolutional beamformer for simultaneous denoising and dereverberation
This paper proposes a method for estimating a convolutional beamformer that
can perform denoising and dereverberation simultaneously in an optimal way. The
application of dereverberation based on a weighted prediction error (WPE)
method followed by denoising based on a minimum variance distortionless
response (MVDR) beamformer has conventionally been considered a promising
approach, however, the optimality of this approach cannot be guaranteed. To
realize the optimal integration of denoising and dereverberation, we present a
method that unifies the WPE dereverberation method and a variant of the MVDR
beamformer, namely a minimum power distortionless response (MPDR) beamformer,
into a single convolutional beamformer, and we optimize it based on a single
unified optimization criterion. The proposed beamformer is referred to as a
Weighted Power minimization Distortionless response (WPD) beamformer.
Experiments show that the proposed method substantially improves the speech
enhancement performance in terms of both objective speech enhancement measures
and automatic speech recognition (ASR) performance.Comment: Published in IEEE Signal Processing Letter
Supervised Speech Separation Based on Deep Learning: An Overview
Speech separation is the task of separating target speech from background
interference. Traditionally, speech separation is studied as a signal
processing problem. A more recent approach formulates speech separation as a
supervised learning problem, where the discriminative patterns of speech,
speakers, and background noise are learned from training data. Over the past
decade, many supervised separation algorithms have been put forward. In
particular, the recent introduction of deep learning to supervised speech
separation has dramatically accelerated progress and boosted separation
performance. This article provides a comprehensive overview of the research on
deep learning based supervised speech separation in the last several years. We
first introduce the background of speech separation and the formulation of
supervised separation. Then we discuss three main components of supervised
separation: learning machines, training targets, and acoustic features. Much of
the overview is on separation algorithms where we review monaural methods,
including speech enhancement (speech-nonspeech separation), speaker separation
(multi-talker separation), and speech dereverberation, as well as
multi-microphone techniques. The important issue of generalization, unique to
supervised learning, is discussed. This overview provides a historical
perspective on how advances are made. In addition, we discuss a number of
conceptual issues, including what constitutes the target source.Comment: 27 pages, 17 figure
Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding
The performance of speech enhancement algorithms in a multi-speaker scenario
depends on correctly identifying the target speaker to be enhanced. Auditory
attention decoding (AAD) methods allow to identify the target speaker which the
listener is attending to from single-trial EEG recordings. Aiming at enhancing
the target speaker and suppressing interfering speakers, reverberation and
ambient noise, in this paper we propose a cognitive-driven multi-microphone
speech enhancement system, which combines a neural-network-based mask
estimator, weighted minimum power distortionless response convolutional
beamformers and AAD. To control the suppression of the interfering speaker, we
also propose an extension incorporating an interference suppression constraint.
The experimental results show that the proposed system outperforms the
state-of-the-art cognitive-driven speech enhancement systems in challenging
reverberant and noisy conditions
Speech Dereverberation Using Nonnegative Convolutive Transfer Function and Spectro temporal Modeling
This paper presents two single channel speech dereverberation methods to
enhance the quality of speech signals that have been recorded in an enclosed
space. For both methods, the room acoustics are modeled using a nonnegative
approximation of the convolutive transfer function (NCTF), and to additionally
exploit the spectral properties of the speech signal, such as the low rank
nature of the speech spectrogram, the speech spectrogram is modeled using
nonnegative matrix factorization (NMF). Two methods are described to combine
the NCTF and NMF models. In the first method, referred to as the integrated
method, a cost function is constructed by directly integrating the speech NMF
model into the NCTF model, while in the second method, referred to as the
weighted method, the NCTF and NMF based cost functions are weighted and summed.
Efficient update rules are derived to solve both optimization problems. In
addition, an extension of the integrated method is presented, which exploits
the temporal dependencies of the speech signal. Several experiments are
performed on reverberant speech signals with and without background noise,
where the integrated method yields a considerably higher speech quality than
the baseline NCTF method and a state of the art spectral enhancement method.
Moreover, the experimental results indicate that the weighted method can even
lead to a better performance in terms of instrumental quality measures, but
that the optimal weighting parameter depends on the room acoustics and the
utilized NMF model. Modeling the temporal dependencies in the integrated method
was found to be useful only for highly reverberant conditions
Robust coherence-based spectral enhancement for speech recognition in adverse real-world environments
Speech recognition in adverse real-world environments is highly affected by
reverberation and nonstationary background noise. A well-known strategy to
reduce such undesired signal components in multi-microphone scenarios is
spatial filtering of the microphone signals. In this article, we demonstrate
that an additional coherence-based postfilter, which is applied to the
beamformer output signal to remove diffuse interference components from the
latter, is an effective means to further improve the recognition accuracy of
modern deep learning speech recognition systems. To this end, the recently
updated 3rd CHiME Speech Separation and Recognition Challenge (CHiME-3)
baseline speech recognition system is extended by a coherence-based postfilter
and the postfilter's impact on the word error rates is investigated for the
noisy environments provided by CHiME-3. To determine the time- and
frequency-dependent postfilter gains, we use a Direction-of-Arrival
(DOA)-dependent and a DOA-independent estimator of the coherent-to-diffuse
power ratio as an approximation of the short-time signal-to-noise ratio. Our
experiments show that incorporating coherence-based postfiltering into the
CHiME-3 baseline speech recognition system leads to a significant reduction of
the word error rate scores for the noisy and reverberant environments provided
as part of CHiME-3.Comment: 21 pages, 5 figures. arXiv admin note: substantial text overlap with
arXiv:1509.06882, Elsevier Computer Speech & Language (CSL), 201
- …