1,251 research outputs found
On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering
This report focuses on algorithms that perform single-channel speech
enhancement. The author of this report uses modulation-domain Kalman filtering
algorithms for speech enhancement, i.e. noise suppression and dereverberation,
in [1], [2], [3], [4] and [5]. Modulation-domain Kalman filtering can be
applied for both noise and late reverberation suppression and in [2], [1], [3]
and [4], various model-based speech enhancement algorithms that perform
modulation-domain Kalman filtering are designed, implemented and tested. The
model-based enhancement algorithm in [2] estimates and tracks the speech phase.
The short-time-Fourier-transform-based enhancement algorithm in [5] uses the
active speech level estimator presented in [6]. This report describes how
different algorithms perform speech enhancement and the algorithms discussed in
this report are addressed to researchers interested in monaural speech
enhancement. The algorithms are composed of different processing blocks and
techniques [7]; understanding the implementation choices made during the system
design is important because this provides insights that can assist the
development of new algorithms. Index Terms - Speech enhancement,
dereverberation, denoising, Kalman filter, minimum mean squared error
estimation.Comment: 13 page
Evaluating the Non-Intrusive Room Acoustics Algorithm with the ACE Challenge
We present a single channel data driven method for non-intrusive estimation
of full-band reverberation time and full-band direct-to-reverberant ratio. The
method extracts a number of features from reverberant speech and builds a model
using a recurrent neural network to estimate the reverberant acoustic
parameters. We explore three configurations by including different data and
also by combining the recurrent neural network estimates using a support vector
machine. Our best method to estimate DRR provides a Root Mean Square Deviation
(RMSD) of 3.84 dB and a RMSD of 43.19 % for T60 estimation.Comment: In Proceedings of the ACE Challenge Workshop - a satellite event of
IEEE-WASPAA 2015 (arXiv:1510.00383
Speech Dereverberation with Context-aware Recurrent Neural Networks
In this paper, we propose a model to perform speech dereverberation by
estimating its spectral magnitude from the reverberant counterpart. Our models
are capable of extracting features that take into account both short and
long-term dependencies in the signal through a convolutional encoder (which
extracts features from a short, bounded context of frames) and a recurrent
neural network for extracting long-term information. Our model outperforms a
recently proposed model that uses different context information depending on
the reverberation time, without requiring any sort of additional input,
yielding improvements of up to 0.4 on PESQ, 0.3 on STOI, and 1.0 on POLQA
relative to reverberant speech. We also show our model is able to generalize to
real room impulse responses even when only trained with simulated room impulse
responses, different speakers, and high reverberation times. Lastly, listening
tests show the proposed method outperforming benchmark models in reduction of
perceived reverberation.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language
Processin
Supervised Speech Separation Based on Deep Learning: An Overview
Speech separation is the task of separating target speech from background
interference. Traditionally, speech separation is studied as a signal
processing problem. A more recent approach formulates speech separation as a
supervised learning problem, where the discriminative patterns of speech,
speakers, and background noise are learned from training data. Over the past
decade, many supervised separation algorithms have been put forward. In
particular, the recent introduction of deep learning to supervised speech
separation has dramatically accelerated progress and boosted separation
performance. This article provides a comprehensive overview of the research on
deep learning based supervised speech separation in the last several years. We
first introduce the background of speech separation and the formulation of
supervised separation. Then we discuss three main components of supervised
separation: learning machines, training targets, and acoustic features. Much of
the overview is on separation algorithms where we review monaural methods,
including speech enhancement (speech-nonspeech separation), speaker separation
(multi-talker separation), and speech dereverberation, as well as
multi-microphone techniques. The important issue of generalization, unique to
supervised learning, is discussed. This overview provides a historical
perspective on how advances are made. In addition, we discuss a number of
conceptual issues, including what constitutes the target source.Comment: 27 pages, 17 figure
Joint Estimation of Reverberation Time and Direct-to-Reverberation Ratio from Speech using Auditory-Inspired Features
Blind estimation of acoustic room parameters such as the reverberation time
and the direct-to-reverberation ratio () is still
a challenging task, especially in case of blind estimation from reverberant
speech signals. In this work, a novel approach is proposed for joint estimation
of and from wideband speech in noisy conditions.
2D Gabor filters arranged in a filterbank are exploited for extracting
features, which are then used as input to a multi-layer perceptron (MLP). The
MLP output neurons correspond to specific pairs of estimates; the output is integrated over time, and a simple
decision rule results in our estimate. The approach is applied to
single-microphone fullband speech signals provided by the Acoustic
Characterization of Environments (ACE) Challenge. Our approach outperforms the
baseline systems with median errors of close-to-zero and -1.5 dB for the
and estimates, respectively, while the
calculation of estimates is 5.8 times faster compared to the baseline.Comment: In Proceedings of the ACE Challenge Workshop - a satellite event of
IEEE-WASPAA 2015 (arXiv:1510.00383
Enhancement and Recognition of Reverberant and Noisy Speech by Extending Its Coherence
Most speech enhancement algorithms make use of the short-time Fourier
transform (STFT), which is a simple and flexible time-frequency decomposition
that estimates the short-time spectrum of a signal. However, the duration of
short STFT frames are inherently limited by the nonstationarity of speech
signals. The main contribution of this paper is a demonstration of speech
enhancement and automatic speech recognition in the presence of reverberation
and noise by extending the length of analysis windows. We accomplish this
extension by performing enhancement in the short-time fan-chirp transform
(STFChT) domain, an overcomplete time-frequency representation that is coherent
with speech signals over longer analysis window durations than the STFT. This
extended coherence is gained by using a linear model of fundamental frequency
variation of voiced speech signals. Our approach centers around using a
single-channel minimum mean-square error log-spectral amplitude (MMSE-LSA)
estimator proposed by Habets, which scales coefficients in a time-frequency
domain to suppress noise and reverberation. In the case of multiple
microphones, we preprocess the data with either a minimum variance
distortionless response (MVDR) beamformer, or a delay-and-sum beamformer (DSB).
We evaluate our algorithm on both speech enhancement and recognition tasks for
the REVERB challenge dataset. Compared to the same processing done in the STFT
domain, our approach achieves significant improvement in terms of objective
enhancement metrics (including PESQ---the ITU-T standard measurement for speech
quality). In terms of automatic speech recognition (ASR) performance as
measured by word error rate (WER), our experiments indicate that the STFT with
a long window is more effective for ASR.Comment: 22 page
Speech Enhancement with Wide Residual Networks in Reverberant Environments
This paper proposes a speech enhancement method which exploits the high
potential of residual connections in a Wide Residual Network architecture. This
is supported on single dimensional convolutions computed alongside the time
domain, which is a powerful approach to process contextually correlated
representations through the temporal domain, such as speech feature sequences.
We find the residual mechanism extremely useful for the enhancement task since
the signal always has a linear shortcut and the non-linear path enhances it in
several steps by adding or subtracting corrections. The enhancement capability
of the proposal is assessed by objective quality metrics evaluated with
simulated and real samples of reverberated speech signals. Results show that
the proposal outperforms the state-of-the-art method called WPE, which is known
to effectively reduce reverberation and greatly enhance the signal. The
proposed model, trained with artificial synthesized reverberation data, was
able to generalize to real room impulse responses for a variety of conditions
(e.g. different room sizes, , near & far field). Furthermore, it
achieves accuracy for real speech with reverberation from two different
datasets.Comment: 5 pages, 4 figures. arXiv admin note: text overlap with
arXiv:1901.00660, arXiv:1904.0451
Progressive Speech Enhancement with Residual Connections
This paper studies the Speech Enhancement based on Deep Neural Networks. The
proposed architecture gradually follows the signal transformation during
enhancement by means of a visualization probe at each network block. Alongside
the process, the enhancement performance is visually inspected and evaluated in
terms of regression cost. This progressive scheme is based on Residual
Networks. During the process, we investigate a residual connection with a
constant number of channels, including internal state between blocks, and
adding progressive supervision. The insights provided by the interpretation of
the network enhancement process leads us to design an improved architecture for
the enhancement purpose. Following this strategy, we are able to obtain speech
enhancement results beyond the state-of-the-art, achieving a favorable
trade-off between dereverberation and the amount of spectral distortion.Comment: 5 pages, 5 figure
LSTM based AE-DNN constraint for better late reverb suppression in multi-channel LP formulation
Prediction of late reverberation component using multi-channel linear
prediction (MCLP) in short-time Fourier transform (STFT) domain is an effective
means to enhance reverberant speech. Traditionally, a speech power spectral
density (PSD) weighted prediction error (WPE) minimization approach is used to
estimate the prediction filters. The method is sensitive to the estimate of the
desired signal PSD. In this paper, we propose a deep neural network (DNN) based
non-linear estimate for the desired signal PSD. An auto encoder trained on
clean speech STFT coefficients is used as the desired signal prior. We explore
two different architectures based on (i) fully-connected (FC) feed-forward, and
(ii) recurrent long short-term memory (LSTM) layers. Experiments using real
room impulse responses show that the LSTM-DNN based PSD estimate performs
better than the traditional methods for late reverb suppression
Robust coherence-based spectral enhancement for distant speech recognition
In this contribution to the 3rd CHiME Speech Separation and Recognition
Challenge (CHiME-3) we extend the acoustic front-end of the CHiME-3 baseline
speech recognition system by a coherence-based Wiener filter which is applied
to the output signal of the baseline beamformer. To compute the time- and
frequency-dependent postfilter gains the ratio between direct and diffuse
signal components at the output of the baseline beamformer is estimated and
used as approximation of the short-time signal-to-noise ratio. The proposed
spectral enhancement technique is evaluated with respect to word error rates of
the CHiME-3 challenge baseline speech recognition system using real speech
recorded in public environments. Results confirm the effectiveness of the
coherence-based postfilter when integrated into the front-end signal
enhancement
- …