7,715 research outputs found
DNN-Based Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement
Multi-frame approaches for single-microphone speech enhancement, e.g., the
multi-frame minimum-variance-distortionless-response (MVDR) filter, are able to
exploit speech correlations across neighboring time frames. In contrast to
single-frame approaches such as the Wiener gain, it has been shown that
multi-frame approaches achieve a substantial noise reduction with hardly any
speech distortion, provided that an accurate estimate of the correlation
matrices and especially the speech interframe correlation vector is available.
Typical estimation procedures of the correlation matrices and the speech
interframe correlation (IFC) vector require an estimate of the speech presence
probability (SPP) in each time-frequency bin. In this paper, we propose to use
a bi-directional long short-term memory deep neural network (DNN) to estimate a
speech mask and a noise mask for each time-frequency bin, using which two
different SPP estimates are derived. Aiming at achieving a robust performance,
the DNN is trained for various noise types and signal-to-noise ratios.
Experimental results show that the multi-frame MVDR in combination with the
proposed data-driven SPP estimator yields an increased speech quality compared
to a state-of-the-art model-based estimator
EMD-based filtering (EMDF) of low-frequency noise for speech enhancement
An Empirical Mode Decomposition based filtering (EMDF) approach is presented as a post-processing stage for speech enhancement. This method is particularly effective in low frequency noise environments. Unlike previous EMD based denoising methods, this approach does not make the assumption that the contaminating noise signal is fractional Gaussian Noise. An adaptive method is developed to select the IMF index for separating the noise components from the speech based on the second-order IMF statistics. The low frequency noise components are then separated by a partial reconstruction from the IMFs. It is shown that the proposed EMDF technique is able to suppress residual noise from speech signals that were enhanced by the conventional optimallymodified log-spectral amplitude approach which uses a minimum statistics based noise estimate. A comparative performance study is included that demonstrates the effectiveness of the EMDF system in various noise environments, such as car interior noise, military vehicle noise and babble noise. In particular, improvements up to 10 dB are obtained in car noise environments. Listening tests were performed that confirm the results
A Bayesian Network View on Acoustic Model-Based Techniques for Robust Speech Recognition
This article provides a unifying Bayesian network view on various approaches
for acoustic model adaptation, missing feature, and uncertainty decoding that
are well-known in the literature of robust automatic speech recognition. The
representatives of these classes can often be deduced from a Bayesian network
that extends the conventional hidden Markov models used in speech recognition.
These extensions, in turn, can in many cases be motivated from an underlying
observation model that relates clean and distorted feature vectors. By
converting the observation models into a Bayesian network representation, we
formulate the corresponding compensation rules leading to a unified view on
known derivations as well as to new formulations for certain approaches. The
generic Bayesian perspective provided in this contribution thus highlights
structural differences and similarities between the analyzed approaches
Block-Online Multi-Channel Speech Enhancement Using DNN-Supported Relative Transfer Function Estimates
This work addresses the problem of block-online processing for multi-channel
speech enhancement. Such processing is vital in scenarios with moving speakers
and/or when very short utterances are processed, e.g., in voice assistant
scenarios. We consider several variants of a system that performs beamforming
supported by DNN-based voice activity detection (VAD) followed by
post-filtering. The speaker is targeted through estimating relative transfer
functions between microphones. Each block of the input signals is processed
independently in order to make the method applicable in highly dynamic
environments. Owing to the short length of the processed block, the statistics
required by the beamformer are estimated less precisely. The influence of this
inaccuracy is studied and compared to the processing regime when recordings are
treated as one block (batch processing). The experimental evaluation of the
proposed method is performed on large datasets of CHiME-4 and on another
dataset featuring moving target speaker. The experiments are evaluated in terms
of objective and perceptual criteria (such as signal-to-interference ratio
(SIR) or perceptual evaluation of speech quality (PESQ), respectively).
Moreover, word error rate (WER) achieved by a baseline automatic speech
recognition system is evaluated, for which the enhancement method serves as a
front-end solution. The results indicate that the proposed method is robust
with respect to short length of the processed block. Significant improvements
in terms of the criteria and WER are observed even for the block length of 250
ms.Comment: 10 pages, 8 figures, 4 tables. Modified version of the article
accepted for publication in IET Signal Processing journal. Original results
unchanged, additional experiments presented, refined discussion and
conclusion
- …