22 research outputs found
LEAP Submission to CHiME-6 ASR Challenge}
This paper reports the LEAP submission to the CHiME-6 challenge. The CHiME-6
Automatic Speech Recognition (ASR) challenge Track 1 involved the recognition
of speech in noisy and reverberant acoustic conditions in home environments
with multiple-party interactions. For the challenge submission, the LEAP system
used extensive data augmentation and a factorized time-delay neural network
(TDNN) architecture. We also explored a neural architecture that interleaved
the TDNN layers with LSTM layers. The submitted system improved the Kaldi
recipe by 2% in terms of relative word-error-rate improvements
End-to-End Multi-Look Keyword Spotting
The performance of keyword spotting (KWS), measured in false alarms and false
rejects, degrades significantly under the far field and noisy conditions. In
this paper, we propose a multi-look neural network modeling for speech
enhancement which simultaneously steers to listen to multiple sampled look
directions. The multi-look enhancement is then jointly trained with KWS to form
an end-to-end KWS model which integrates the enhanced signals from multiple
look directions and leverages an attention mechanism to dynamically tune the
model's attention to the reliable sources. We demonstrate, on our large noisy
and far-field evaluation sets, that the proposed approach significantly
improves the KWS performance against the baseline KWS system and a recent
beamformer based multi-beam KWS system.Comment: Submitted to Interspeech202
Improved Speaker-Dependent Separation for CHiME-5 Challenge
This paper summarizes several follow-up contributions for improving our
submitted NWPU speaker-dependent system for CHiME-5 challenge, which aims to
solve the problem of multi-channel, highly-overlapped conversational speech
recognition in a dinner party scenario with reverberations and non-stationary
noises. We adopt a speaker-aware training method by using i-vector as the
target speaker information for multi-talker speech separation. With only one
unified separation model for all speakers, we achieve a 10\% absolute
improvement in terms of word error rate (WER) over the previous baseline of
80.28\% on the development set by leveraging our newly proposed data processing
techniques and beamforming approach. With our improved back-end acoustic model,
we further reduce WER to 60.15\% which surpasses the result of our submitted
CHiME-5 challenge system without applying any fusion techniques.Comment: Submitted to Interspeech 2019, Graz, Austri
Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement
This work introduces sequential neural beamforming, which alternates between
neural network based spectral separation and beamforming based spatial
separation. Our neural networks for separation use an advanced convolutional
architecture trained with a novel stabilized signal-to-noise ratio loss
function. For beamforming, we explore multiple ways of computing time-varying
covariance matrices, including factorizing the spatial covariance into a
time-varying amplitude component and a time-invariant spatial component, as
well as using block-based techniques. In addition, we introduce a multi-frame
beamforming method which improves the results significantly by adding
contextual frames to the beamforming formulations. We extensively evaluate and
analyze the effects of window size, block size, and multi-frame context size
for these methods. Our best method utilizes a sequence of three neural
separation and multi-frame time-invariant spatial beamforming stages, and
demonstrates an average improvement of 2.75 dB in scale-invariant
signal-to-noise ratio and 14.2% absolute reduction in a comparative speech
recognition metric across four challenging reverberant speech enhancement and
separation tasks. We also use our three-speaker separation model to separate
real recordings in the LibriCSS evaluation set into non-overlapping tracks, and
achieve a better word error rate as compared to a baseline mask based
beamformer.Comment: 7 pages, 7 figures, IEEE SLT 2021 (slt2020.org
Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial Clustering Masks
Spatial clustering techniques can achieve significant multi-channel noise
reduction across relatively arbitrary microphone configurations, but have
difficulty incorporating a detailed speech/noise model. In contrast, LSTM
neural networks have successfully been trained to recognize speech from noise
on single-channel inputs, but have difficulty taking full advantage of the
information in multi-channel recordings. This paper integrates these two
approaches, training LSTM speech models to clean the masks generated by the
Model-based EM Source Separation and Localization (MESSL) spatial clustering
method. By doing so, it attains both the spatial separation performance and
generality of multi-channel spatial clustering and the signal modeling
performance of multiple parallel single-channel LSTM speech enhancers. Our
experiments show that when our system is applied to the CHiME-3 dataset of
noisy tablet recordings, it increases speech quality as measured by the
Perceptual Evaluation of Speech Quality (PESQ) algorithm and reduces the word
error rate of the baseline CHiME-3 speech recognizer, as compared to the
default BeamformIt beamformer.Comment: arXiv admin note: substantial text overlap with arXiv:2012.0157
Student-Teacher Learning for BLSTM Mask-based Speech Enhancement
Spectral mask estimation using bidirectional long short-term memory (BLSTM)
neural networks has been widely used in various speech enhancement
applications, and it has achieved great success when it is applied to
multichannel enhancement techniques with a mask-based beamformer. However, when
these masks are used for single channel speech enhancement they severely
distort the speech signal and make them unsuitable for speech recognition. This
paper proposes a student-teacher learning paradigm for single channel speech
enhancement. The beamformed signal from multichannel enhancement is given as
input to the teacher network to obtain soft masks. An additional cross-entropy
loss term with the soft mask target is combined with the original loss, so that
the student network with single-channel input is trained to mimic the soft mask
obtained with multichannel input through beamforming. Experiments with the
CHiME-4 challenge single channel track data shows improvement in ASR
performance.Comment: Submitted for Interspeech 201
Fast and Robust 3-D Sound Source Localization with DSVD-PHAT
This paper introduces a variant of the Singular Value Decomposition with
Phase Transform (SVD-PHAT), named Difference SVD-PHAT (DSVD-PHAT), to achieve
robust Sound Source Localization (SSL) in noisy conditions. Experiments are
performed on a Baxter robot with a four-microphone planar array mounted on its
head. Results show that this method offers similar robustness to noise as the
state-of-the-art Multiple Signal Classification based on Generalized Singular
Value Decomposition (GSVD-MUSIC) method, and considerably reduces the
computational load by a factor of 250. This performance gain thus makes
DSVD-PHAT appealing for real-time application on robots with limited on-board
computing power
Meeting Transcription Using Virtual Microphone Arrays
We describe a system that generates speaker-annotated transcripts of meetings
by using a virtual microphone array, a set of spatially distributed
asynchronous recording devices such as laptops and mobile phones. The system is
composed of continuous audio stream alignment, blind beamforming, speech
recognition, speaker diarization using prior speaker information, and system
combination. When utilizing seven input audio streams, our system achieves a
word error rate (WER) of 22.3% and comes within 3% of the close-talking
microphone WER on the non-overlapping speech segments. The speaker-attributed
WER (SAWER) is 26.7%. The relative gains in SAWER over the single-device system
are 14.8%, 20.3%, and 22.4% for three, five, and seven microphones,
respectively. The presented system achieves a 13.6% diarization error rate when
10% of the speech duration contains more than one speaker. The contribution of
each component to the overall performance is also investigated, and we validate
the system with experiments on the NIST RT-07 conference meeting test set
Deep Bayesian Unsupervised Source Separation Based on a Complex Gaussian Mixture Model
This paper presents an unsupervised method that trains neural source
separation by using only multichannel mixture signals. Conventional neural
separation methods require a lot of supervised data to achieve excellent
performance. Although multichannel methods based on spatial information can
work without such training data, they are often sensitive to parameter
initialization and degraded with the sources located close to each other. The
proposed method uses a cost function based on a spatial model called a complex
Gaussian mixture model (cGMM). This model has the time-frequency (TF) masks and
direction of arrivals (DoAs) of sources as latent variables and is used for
training separation and localization networks that respectively estimate these
variables. This joint training solves the frequency permutation ambiguity of
the spatial model in a unified deep Bayesian framework. In addition, the
pre-trained network can be used not only for conducting monaural separation but
also for efficiently initializing a multichannel separation algorithm.
Experimental results with simulated speech mixtures showed that our method
outperformed a conventional initialization method.Comment: 6 pages, 2 figures, accepted for publication in 2019 IEEE
International Workshop on Machine Learning for Signal Processing (MLSP
Speaker Adapted Beamforming for Multi-Channel Automatic Speech Recognition
This paper presents, in the context of multi-channel ASR, a method to adapt a
mask based, statistically optimal beamforming approach to a speaker of
interest. The beamforming vector of the statistically optimal beamformer is
computed by utilizing speech and noise masks, which are estimated by a neural
network. The proposed adaptation approach is based on the integration of the
beamformer, which includes the mask estimation network, and the acoustic model
of the ASR system. This allows for the propagation of the training error, from
the acoustic modeling cost function, all the way through the beamforming
operation and through the mask estimation network. By using the results of a
first pass recognition and by keeping all other parameters fixed, the mask
estimation network can therefore be fine tuned by retraining. Utterances of a
speaker of interest can thus be used in a two pass approach, to optimize the
beamforming for the speech characteristics of that specific speaker. It is
shown that this approach improves the ASR performance of a state-of-the-art
multi-channel ASR system on the CHiME-4 data. Furthermore the effect of the
adaptation on the estimated speech masks is discussed.Comment: submitted to IEEE SLT 201