4,899 research outputs found
Structured Sparsity Models for Multiparty Speech Recovery from Reverberant Recordings
We tackle the multi-party speech recovery problem through modeling the
acoustic of the reverberant chambers. Our approach exploits structured sparsity
models to perform room modeling and speech recovery. We propose a scheme for
characterizing the room acoustic from the unknown competing speech sources
relying on localization of the early images of the speakers by sparse
approximation of the spatial spectra of the virtual sources in a free-space
model. The images are then clustered exploiting the low-rank structure of the
spectro-temporal components belonging to each source. This enables us to
identify the early support of the room impulse response function and its unique
map to the room geometry. To further tackle the ambiguity of the reflection
ratios, we propose a novel formulation of the reverberation model and estimate
the absorption coefficients through a convex optimization exploiting joint
sparsity model formulated upon spatio-spectral sparsity of concurrent speech
representation. The acoustic parameters are then incorporated for separating
individual speech signals through either structured sparse recovery or inverse
filtering the acoustic channels. The experiments conducted on real data
recordings demonstrate the effectiveness of the proposed approach for
multi-party speech recovery and recognition.Comment: 31 page
Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning
In this work, we investigated the teacher-student training paradigm to train
a fully learnable multi-channel acoustic model for far-field automatic speech
recognition (ASR). Using a large offline teacher model trained on beamformed
audio, we trained a simpler multi-channel student acoustic model used in the
speech recognition system. For the student, both multi-channel feature
extraction layers and the higher classification layers were jointly trained
using the logits from the teacher model. In our experiments, compared to a
baseline model trained on about 600 hours of transcribed data, a relative
word-error rate (WER) reduction of about 27.3% was achieved when using an
additional 1800 hours of untranscribed data. We also investigated the benefit
of pre-training the multi-channel front end to output the beamformed log-mel
filter bank energies (LFBE) using L2 loss. We find that pre-training improves
the word error rate by 10.7% when compared to a multi-channel model directly
initialized with a beamformer and mel-filter bank coefficients for the front
end. Finally, combining pre-training and teacher-student training produces a
WER reduction of 31% compared to our baseline.Comment: To appear in ICASSP 202
Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition
Far-field speech recognition in noisy and reverberant conditions remains a
challenging problem despite recent deep learning breakthroughs. This problem is
commonly addressed by acquiring a speech signal from multiple microphones and
performing beamforming over them. In this paper, we propose to use a recurrent
neural network with long short-term memory (LSTM) architecture to adaptively
estimate real-time beamforming filter coefficients to cope with non-stationary
environmental noise and dynamic nature of source and microphones positions
which results in a set of timevarying room impulse responses. The LSTM adaptive
beamformer is jointly trained with a deep LSTM acoustic model to predict senone
labels. Further, we use hidden units in the deep LSTM acoustic model to assist
in predicting the beamforming filter coefficients. The proposed system achieves
7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real
evaluation set.Comment: in 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP
- …