322 research outputs found
Rank-1 Constrained Multichannel Wiener Filter for Speech Recognition in Noisy Environments
Multichannel linear filters, such as the Multichannel Wiener Filter (MWF) and
the Generalized Eigenvalue (GEV) beamformer are popular signal processing
techniques which can improve speech recognition performance. In this paper, we
present an experimental study on these linear filters in a specific speech
recognition task, namely the CHiME-4 challenge, which features real recordings
in multiple noisy environments. Specifically, the rank-1 MWF is employed for
noise reduction and a new constant residual noise power constraint is derived
which enhances the recognition performance. To fulfill the underlying rank-1
assumption, the speech covariance matrix is reconstructed based on eigenvectors
or generalized eigenvectors. Then the rank-1 constrained MWF is evaluated with
alternative multichannel linear filters under the same framework, which
involves a Bidirectional Long Short-Term Memory (BLSTM) network for mask
estimation. The proposed filter outperforms alternative ones, leading to a 40%
relative Word Error Rate (WER) reduction compared with the baseline Weighted
Delay and Sum (WDAS) beamformer on the real test set, and a 15% relative WER
reduction compared with the GEV-BAN method. The results also suggest that the
speech recognition accuracy correlates more with the Mel-frequency cepstral
coefficients (MFCC) feature variance than with the noise reduction or the
speech distortion level.Comment: for Computer Speech and Languag
Block-Online Multi-Channel Speech Enhancement Using DNN-Supported Relative Transfer Function Estimates
This work addresses the problem of block-online processing for multi-channel
speech enhancement. Such processing is vital in scenarios with moving speakers
and/or when very short utterances are processed, e.g., in voice assistant
scenarios. We consider several variants of a system that performs beamforming
supported by DNN-based voice activity detection (VAD) followed by
post-filtering. The speaker is targeted through estimating relative transfer
functions between microphones. Each block of the input signals is processed
independently in order to make the method applicable in highly dynamic
environments. Owing to the short length of the processed block, the statistics
required by the beamformer are estimated less precisely. The influence of this
inaccuracy is studied and compared to the processing regime when recordings are
treated as one block (batch processing). The experimental evaluation of the
proposed method is performed on large datasets of CHiME-4 and on another
dataset featuring moving target speaker. The experiments are evaluated in terms
of objective and perceptual criteria (such as signal-to-interference ratio
(SIR) or perceptual evaluation of speech quality (PESQ), respectively).
Moreover, word error rate (WER) achieved by a baseline automatic speech
recognition system is evaluated, for which the enhancement method serves as a
front-end solution. The results indicate that the proposed method is robust
with respect to short length of the processed block. Significant improvements
in terms of the criteria and WER are observed even for the block length of 250
ms.Comment: 10 pages, 8 figures, 4 tables. Modified version of the article
accepted for publication in IET Signal Processing journal. Original results
unchanged, additional experiments presented, refined discussion and
conclusion
DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays
Deep neural network (DNN)-based speech enhancement algorithms in microphone
arrays have now proven to be efficient solutions to speech understanding and
speech recognition in noisy environments. However, in the context of ad-hoc
microphone arrays, many challenges remain and raise the need for distributed
processing. In this paper, we propose to extend a previously introduced
distributed DNN-based time-frequency mask estimation scheme that can
efficiently use spatial information in form of so-called compressed signals
which are pre-filtered target estimations. We study the performance of this
algorithm under realistic acoustic conditions and investigate practical aspects
of its optimal application. We show that the nodes in the microphone array
cooperate by taking profit of their spatial coverage in the room. We also
propose to use the compressed signals not only to convey the target estimation
but also the noise estimation in order to exploit the acoustic diversity
recorded throughout the microphone array.Comment: Submitted to TASL
Structured Sparsity Models for Multiparty Speech Recovery from Reverberant Recordings
We tackle the multi-party speech recovery problem through modeling the
acoustic of the reverberant chambers. Our approach exploits structured sparsity
models to perform room modeling and speech recovery. We propose a scheme for
characterizing the room acoustic from the unknown competing speech sources
relying on localization of the early images of the speakers by sparse
approximation of the spatial spectra of the virtual sources in a free-space
model. The images are then clustered exploiting the low-rank structure of the
spectro-temporal components belonging to each source. This enables us to
identify the early support of the room impulse response function and its unique
map to the room geometry. To further tackle the ambiguity of the reflection
ratios, we propose a novel formulation of the reverberation model and estimate
the absorption coefficients through a convex optimization exploiting joint
sparsity model formulated upon spatio-spectral sparsity of concurrent speech
representation. The acoustic parameters are then incorporated for separating
individual speech signals through either structured sparse recovery or inverse
filtering the acoustic channels. The experiments conducted on real data
recordings demonstrate the effectiveness of the proposed approach for
multi-party speech recovery and recognition.Comment: 31 page
NoisyILRMA: Diffuse-Noise-Aware Independent Low-Rank Matrix Analysis for Fast Blind Source Extraction
In this paper, we address the multichannel blind source extraction (BSE) of a
single source in diffuse noise environments. To solve this problem even faster
than by fast multichannel nonnegative matrix factorization (FastMNMF) and its
variant, we propose a BSE method called NoisyILRMA, which is a modification of
independent low-rank matrix analysis (ILRMA) to account for diffuse noise.
NoisyILRMA can achieve considerably fast BSE by incorporating an algorithm
developed for independent vector extraction. In addition, to improve the BSE
performance of NoisyILRMA, we propose a mechanism to switch the source model
with ILRMA-like nonnegative matrix factorization to a more expressive source
model during optimization. In the experiment, we show that NoisyILRMA runs
faster than a FastMNMF algorithm while maintaining the BSE performance. We also
confirm that the switching mechanism improves the BSE performance of
NoisyILRMA.Comment: 5 pages, 3 figures, accepted for European Signal Processing
Conference 2023 (EUSIPCO 2023
- …