57,657 research outputs found
A Purely End-to-end System for Multi-speaker Speech Recognition
Recently, there has been growing interest in multi-speaker speech
recognition, where the utterances of multiple speakers are recognized from
their mixture. Promising techniques have been proposed for this task, but
earlier works have required additional training data such as isolated source
signals or senone alignments for effective learning. In this paper, we propose
a new sequence-to-sequence framework to directly decode multiple label
sequences from a single speech sequence by unifying source separation and
speech recognition functions in an end-to-end manner. We further propose a new
objective function to improve the contrast between the hidden vectors to avoid
generating similar hypotheses. Experimental results show that the model is
directly able to learn a mapping from a speech mixture to multiple label
sequences, achieving 83.1 % relative improvement compared to a model trained
without the proposed objective. Interestingly, the results are comparable to
those produced by previous end-to-end works featuring explicit separation and
recognition modules.Comment: ACL 201
Neural Spatio-Temporal Beamformer for Target Speech Separation
Purely neural network (NN) based speech separation and enhancement methods,
although can achieve good objective scores, inevitably cause nonlinear speech
distortions that are harmful for the automatic speech recognition (ASR). On the
other hand, the minimum variance distortionless response (MVDR) beamformer with
NN-predicted masks, although can significantly reduce speech distortions, has
limited noise reduction capability. In this paper, we propose a multi-tap MVDR
beamformer with complex-valued masks for speech separation and enhancement.
Compared to the state-of-the-art NN-mask based MVDR beamformer, the multi-tap
MVDR beamformer exploits the inter-frame correlation in addition to the
inter-microphone correlation that is already utilized in prior arts. Further
improvements include the replacement of the real-valued masks with the
complex-valued masks and the joint training of the complex-mask NN. The
evaluation on our multi-modal multi-channel target speech separation and
enhancement platform demonstrates that our proposed multi-tap MVDR beamformer
improves both the ASR accuracy and the perceptual speech quality against prior
arts.Comment: accepted to Interspeech2020, Demo:
https://yongxuustc.github.io/mtmvdr
End-to-End Monaural Multi-speaker ASR System without Pretraining
Recently, end-to-end models have become a popular approach as an alternative
to traditional hybrid models in automatic speech recognition (ASR). The
multi-speaker speech separation and recognition task is a central task in
cocktail party problem. In this paper, we present a state-of-the-art monaural
multi-speaker end-to-end automatic speech recognition model. In contrast to
previous studies on the monaural multi-speaker speech recognition, this
end-to-end framework is trained to recognize multiple label sequences
completely from scratch. The system only requires the speech mixture and
corresponding label sequences, without needing any indeterminate supervisions
obtained from non-mixture speech or corresponding labels/alignments. Moreover,
we exploited using the individual attention module for each separated speaker
and the scheduled sampling to further improve the performance. Finally, we
evaluate the proposed model on the 2-speaker mixed speech generated from the
WSJ corpus and the wsj0-2mix dataset, which is a speech separation and
recognition benchmark. The experiments demonstrate that the proposed methods
can improve the performance of the end-to-end model in separating the
overlapping speech and recognizing the separated streams. From the results, the
proposed model leads to ~10.0% relative performance gains in terms of CER and
WER respectively.Comment: submitted to ICASSP201
End-to-End Multi-Channel Speech Separation
The end-to-end approach for single-channel speech separation has been studied
recently and shown promising results. This paper extended the previous approach
and proposed a new end-to-end model for multi-channel speech separation. The
primary contributions of this work include 1) an integrated waveform-in
waveform-out separation system in a single neural network architecture. 2) We
reformulate the traditional short time Fourier transform (STFT) and
inter-channel phase difference (IPD) as a function of time-domain convolution
with a special kernel. 3) We further relaxed those fixed kernels to be
learnable, so that the entire architecture becomes purely data-driven and can
be trained from end-to-end. We demonstrate on the WSJ0 far-field speech
separation task that, with the benefit of learnable spatial features, our
proposed end-to-end multi-channel model significantly improved the performance
of previous end-to-end single-channel method and traditional multi-channel
methods.Comment: submitted to interspeech 201
Discriminatively Re-trained i-vector Extractor for Speaker Recognition
In this work we revisit discriminative training of the i-vector extractor
component in the standard speaker verification (SV) system. The motivation of
our research lies in the robustness and stability of this large generative
model, which we want to preserve, and focus its power towards any intended SV
task. We show that after generative initialization of the i-vector extractor,
we can further refine it with discriminative training and obtain i-vectors that
lead to better performance on various benchmarks representing different
acoustic domains.Comment: 5 pages, 1 figure, submitted to ICASSP 201
Improved Speaker-Dependent Separation for CHiME-5 Challenge
This paper summarizes several follow-up contributions for improving our
submitted NWPU speaker-dependent system for CHiME-5 challenge, which aims to
solve the problem of multi-channel, highly-overlapped conversational speech
recognition in a dinner party scenario with reverberations and non-stationary
noises. We adopt a speaker-aware training method by using i-vector as the
target speaker information for multi-talker speech separation. With only one
unified separation model for all speakers, we achieve a 10\% absolute
improvement in terms of word error rate (WER) over the previous baseline of
80.28\% on the development set by leveraging our newly proposed data processing
techniques and beamforming approach. With our improved back-end acoustic model,
we further reduce WER to 60.15\% which surpasses the result of our submitted
CHiME-5 challenge system without applying any fusion techniques.Comment: Submitted to Interspeech 2019, Graz, Austri
Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training
Although great progresses have been made in automatic speech recognition
(ASR), significant performance degradation is still observed when recognizing
multi-talker mixed speech. In this paper, we propose and evaluate several
architectures to address this problem under the assumption that only a single
channel of mixed signal is available. Our technique extends permutation
invariant training (PIT) by introducing the front-end feature separation module
with the minimum mean square error (MSE) criterion and the back-end recognition
module with the minimum cross entropy (CE) criterion. More specifically, during
training we compute the average MSE or CE over the whole utterance for each
possible utterance-level output-target assignment, pick the one with the
minimum MSE or CE, and optimize for that assignment. This strategy elegantly
solves the label permutation problem observed in the deep learning based
multi-talker mixed speech separation and recognition systems. The proposed
architectures are evaluated and compared on an artificially mixed AMI dataset
with both two- and three-talker mixed speech. The experimental results indicate
that our proposed architectures can cut the word error rate (WER) by 45.0% and
25.0% relatively against the state-of-the-art single-talker speech recognition
system across all speakers when their energies are comparable, for two- and
three-talker mixed speech, respectively. To our knowledge, this is the first
work on the multi-talker mixed speech recognition on the challenging
speaker-independent spontaneous large vocabulary continuous speech task.Comment: 11 pages, 6 figures, Submitted to IEEE/ACM Transactions on Audio,
Speech and Language Processing. arXiv admin note: text overlap with
arXiv:1704.0198
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
The CHiME challenge series aims to advance robust automatic speech
recognition (ASR) technology by promoting research at the interface of speech
and language processing, signal processing , and machine learning. This paper
introduces the 5th CHiME Challenge, which considers the task of distant
multi-microphone conversational ASR in real home environments. Speech material
was elicited using a dinner party scenario with efforts taken to capture data
that is representative of natural conversational speech and recorded by 6
Kinect microphone arrays and 4 binaural microphone pairs. The challenge
features a single-array track and a multiple-array track and, for each track,
distinct rankings will be produced for systems focusing on robustness with
respect to distant-microphone capture vs. systems attempting to address all
aspects of the task including conversational language modeling. We discuss the
rationale for the challenge and provide a detailed description of the data
collection procedure, the task, and the baseline systems for array
synchronization, speech enhancement, and conventional and end-to-end ASR
Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning
Domain generalization remains a critical problem for speaker recognition,
even with the state-of-the-art architectures based on deep neural nets. For
example, a model trained on reading speech may largely fail when applied to
scenarios of singing or movie. In this paper, we propose a domain-invariant
projection to improve the generalizability of speaker vectors. This projection
is a simple neural net and is trained following the Model-Agnostic
Meta-Learning (MAML) principle, for which the objective is to classify speakers
in one domain if it had been updated with speech data in another domain. We
tested the proposed method on CNCeleb, a new dataset consisting of
single-speaker multi-condition (SSMC) data. The results demonstrated that the
MAML-based domain-invariant projection can produce more generalizable speaker
vectors, and effectively improve the performance in unseen domains.Comment: submitted to INTERSPEECH 202
Unsupervised training of a deep clustering model for multichannel blind source separation
We propose a training scheme to train neural network-based source separation
algorithms from scratch when parallel clean data is unavailable. In particular,
we demonstrate that an unsupervised spatial clustering algorithm is sufficient
to guide the training of a deep clustering system. We argue that previous work
on deep clustering requires strong supervision and elaborate on why this is a
limitation. We demonstrate that (a) the single-channel deep clustering system
trained according to the proposed scheme alone is able to achieve a similar
performance as the multi-channel teacher in terms of word error rates and (b)
initializing the spatial clustering approach with the deep clustering result
yields a relative word error rate reduction of 26 % over the unsupervised
teacher
- …