828 research outputs found
Jointly optimal denoising, dereverberation, and source separation
This paper proposes methods that can optimize a Convolutional BeamFormer
(CBF) for jointly performing denoising, dereverberation, and source separation
(DN+DR+SS) in a computationally efficient way. Conventionally, cascade
configuration composed of a Weighted Prediction Error minimization (WPE)
dereverberation filter followed by a Minimum Variance Distortionless Response
beamformer has been usedas the state-of-the-art frontend of far-field speech
recognition, however, overall optimality of this approach is not guaranteed. In
the blind signal processing area, an approach for jointly optimizing
dereverberation and source separation (DR+SS) has been proposed, however, this
approach requires huge computing cost, and has not been extended for
application to DN+DR+SS. To overcome the above limitations, this paper develops
new approaches for jointly optimizing DN+DR+SS in a computationally much more
efficient way. To this end, we first present an objective function to optimize
a CBF for performing DN+DR+SS based on the maximum likelihood estimation, on an
assumption that the steering vectors of the target signals are given or can be
estimated, e.g., using a neural network. This paper refers to a CBF optimized
by this objective function as a weighted Minimum-Power Distortionless Response
(wMPDR) CBF. Then, we derive two algorithms for optimizing a wMPDR CBF based on
two different ways of factorizing a CBF into WPE filters and beamformers.
Experiments using noisy reverberant sound mixtures show that the proposed
optimization approaches greatly improve the performance of the speech
enhancement in comparison with the conventional cascade configuration in terms
of the signal distortion measures and ASR performance. It is also shown that
the proposed approaches can greatly reduce the computing cost with improved
estimation accuracy in comparison with the conventional joint optimization
approach.Comment: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing
on 12 Feb 2020, Accepted to IEEE/ACM Trans. Audio, Speech, and Language
Processing on 14 July 202
Speech Dereverberation Using Nonnegative Convolutive Transfer Function and Spectro temporal Modeling
This paper presents two single channel speech dereverberation methods to
enhance the quality of speech signals that have been recorded in an enclosed
space. For both methods, the room acoustics are modeled using a nonnegative
approximation of the convolutive transfer function (NCTF), and to additionally
exploit the spectral properties of the speech signal, such as the low rank
nature of the speech spectrogram, the speech spectrogram is modeled using
nonnegative matrix factorization (NMF). Two methods are described to combine
the NCTF and NMF models. In the first method, referred to as the integrated
method, a cost function is constructed by directly integrating the speech NMF
model into the NCTF model, while in the second method, referred to as the
weighted method, the NCTF and NMF based cost functions are weighted and summed.
Efficient update rules are derived to solve both optimization problems. In
addition, an extension of the integrated method is presented, which exploits
the temporal dependencies of the speech signal. Several experiments are
performed on reverberant speech signals with and without background noise,
where the integrated method yields a considerably higher speech quality than
the baseline NCTF method and a state of the art spectral enhancement method.
Moreover, the experimental results indicate that the weighted method can even
lead to a better performance in terms of instrumental quality measures, but
that the optimal weighting parameter depends on the room acoustics and the
utilized NMF model. Modeling the temporal dependencies in the integrated method
was found to be useful only for highly reverberant conditions
Speech Dereverberation in the STFT Domain
Reverberation is damaging to both the quality and the intelligibility of a
speech signal. We propose a novel single-channel method of dereverberation
based on a linear filter in the Short Time Fourier Transform domain. Each
enhanced frame is constructed from a linear sum of nearby frames based on the
channel impulse response. The results show that the method can resolve any
reverberant signal with knowledge of the impulse response to a non-reverberant
signal
On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering
This report focuses on algorithms that perform single-channel speech
enhancement. The author of this report uses modulation-domain Kalman filtering
algorithms for speech enhancement, i.e. noise suppression and dereverberation,
in [1], [2], [3], [4] and [5]. Modulation-domain Kalman filtering can be
applied for both noise and late reverberation suppression and in [2], [1], [3]
and [4], various model-based speech enhancement algorithms that perform
modulation-domain Kalman filtering are designed, implemented and tested. The
model-based enhancement algorithm in [2] estimates and tracks the speech phase.
The short-time-Fourier-transform-based enhancement algorithm in [5] uses the
active speech level estimator presented in [6]. This report describes how
different algorithms perform speech enhancement and the algorithms discussed in
this report are addressed to researchers interested in monaural speech
enhancement. The algorithms are composed of different processing blocks and
techniques [7]; understanding the implementation choices made during the system
design is important because this provides insights that can assist the
development of new algorithms. Index Terms - Speech enhancement,
dereverberation, denoising, Kalman filter, minimum mean squared error
estimation.Comment: 13 page
Supervised Speech Separation Based on Deep Learning: An Overview
Speech separation is the task of separating target speech from background
interference. Traditionally, speech separation is studied as a signal
processing problem. A more recent approach formulates speech separation as a
supervised learning problem, where the discriminative patterns of speech,
speakers, and background noise are learned from training data. Over the past
decade, many supervised separation algorithms have been put forward. In
particular, the recent introduction of deep learning to supervised speech
separation has dramatically accelerated progress and boosted separation
performance. This article provides a comprehensive overview of the research on
deep learning based supervised speech separation in the last several years. We
first introduce the background of speech separation and the formulation of
supervised separation. Then we discuss three main components of supervised
separation: learning machines, training targets, and acoustic features. Much of
the overview is on separation algorithms where we review monaural methods,
including speech enhancement (speech-nonspeech separation), speaker separation
(multi-talker separation), and speech dereverberation, as well as
multi-microphone techniques. The important issue of generalization, unique to
supervised learning, is discussed. This overview provides a historical
perspective on how advances are made. In addition, we discuss a number of
conceptual issues, including what constitutes the target source.Comment: 27 pages, 17 figure
Enhancement and Recognition of Reverberant and Noisy Speech by Extending Its Coherence
Most speech enhancement algorithms make use of the short-time Fourier
transform (STFT), which is a simple and flexible time-frequency decomposition
that estimates the short-time spectrum of a signal. However, the duration of
short STFT frames are inherently limited by the nonstationarity of speech
signals. The main contribution of this paper is a demonstration of speech
enhancement and automatic speech recognition in the presence of reverberation
and noise by extending the length of analysis windows. We accomplish this
extension by performing enhancement in the short-time fan-chirp transform
(STFChT) domain, an overcomplete time-frequency representation that is coherent
with speech signals over longer analysis window durations than the STFT. This
extended coherence is gained by using a linear model of fundamental frequency
variation of voiced speech signals. Our approach centers around using a
single-channel minimum mean-square error log-spectral amplitude (MMSE-LSA)
estimator proposed by Habets, which scales coefficients in a time-frequency
domain to suppress noise and reverberation. In the case of multiple
microphones, we preprocess the data with either a minimum variance
distortionless response (MVDR) beamformer, or a delay-and-sum beamformer (DSB).
We evaluate our algorithm on both speech enhancement and recognition tasks for
the REVERB challenge dataset. Compared to the same processing done in the STFT
domain, our approach achieves significant improvement in terms of objective
enhancement metrics (including PESQ---the ITU-T standard measurement for speech
quality). In terms of automatic speech recognition (ASR) performance as
measured by word error rate (WER), our experiments indicate that the STFT with
a long window is more effective for ASR.Comment: 22 page
Integrated Speech Enhancement Method Based on Weighted Prediction Error and DNN for Dereverberation and Denoising
Both reverberation and additive noises degrade the speech quality and
intelligibility. Weighted prediction error (WPE) method performs well on the
dereverberation but with limitations. First, WPE doesn't consider the influence
of the additive noise which degrades the performance of dereverberation.
Second, it relies on a time-consuming iterative process, and there is no
guarantee or a widely accepted criterion on its convergence. In this paper, we
integrate deep neural network (DNN) into WPE for dereverberation and denoising.
DNN is used to suppress the background noise to meet the noise-free assumption
of WPE. Meanwhile, DNN is applied to directly predict spectral variance of the
target speech to make the WPE work without iteration. The experimental results
show that the proposed method has a significant improvement in speech quality
and runs fast
Speech Dereverberation with Context-aware Recurrent Neural Networks
In this paper, we propose a model to perform speech dereverberation by
estimating its spectral magnitude from the reverberant counterpart. Our models
are capable of extracting features that take into account both short and
long-term dependencies in the signal through a convolutional encoder (which
extracts features from a short, bounded context of frames) and a recurrent
neural network for extracting long-term information. Our model outperforms a
recently proposed model that uses different context information depending on
the reverberation time, without requiring any sort of additional input,
yielding improvements of up to 0.4 on PESQ, 0.3 on STOI, and 1.0 on POLQA
relative to reverberant speech. We also show our model is able to generalize to
real room impulse responses even when only trained with simulated room impulse
responses, different speakers, and high reverberation times. Lastly, listening
tests show the proposed method outperforming benchmark models in reduction of
perceived reverberation.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language
Processin
A robust DOA estimation method for a linear microphone array under reverberant and noisy environments
A robust method for linear array is proposed to address the difficulty of
direction-of-arrival (DOA) estimation in reverberant and noisy environments. A
direct-path dominance test based on the onset detection is utilized to extract
time-frequency bins containing the direct propagation of the speech. The
influence of the transient noise, which severely contaminates the onset test,
is mitigated by a proper transient noise determination scheme. Then for voice
features, a two-stage procedure is designed based on the extracted bins and an
effective dereverberation method, with robust but possibly biased estimation
from middle frequency bins followed by further refinement in higher frequency
bins. The proposed method effectively alleviates the estimation bias caused by
the linear arrangement of microphones, and has stable performance under noisy
and reverberant environments. Experimental evaluation using a 4-element
microphone array demonstrates the efficacy of the proposed method.Comment: 7 pages, 4 figures, 3 tables, 33 reference
Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding
The performance of speech enhancement algorithms in a multi-speaker scenario
depends on correctly identifying the target speaker to be enhanced. Auditory
attention decoding (AAD) methods allow to identify the target speaker which the
listener is attending to from single-trial EEG recordings. Aiming at enhancing
the target speaker and suppressing interfering speakers, reverberation and
ambient noise, in this paper we propose a cognitive-driven multi-microphone
speech enhancement system, which combines a neural-network-based mask
estimator, weighted minimum power distortionless response convolutional
beamformers and AAD. To control the suppression of the interfering speaker, we
also propose an extension incorporating an interference suppression constraint.
The experimental results show that the proposed system outperforms the
state-of-the-art cognitive-driven speech enhancement systems in challenging
reverberant and noisy conditions
- …