25 research outputs found
Robust Speech Recognition Using Generative Adversarial Networks
This paper describes a general, scalable, end-to-end framework that uses the
generative adversarial network (GAN) objective to enable robust speech
recognition. Encoders trained with the proposed approach enjoy improved
invariance by learning to map noisy audio to the same embedding space as that
of clean audio. Unlike previous methods, the new framework does not rely on
domain expertise or simplifying assumptions as are often needed in signal
processing, and directly encourages robustness in a data-driven way. We show
the new approach improves simulated far-field speech recognition of vanilla
sequence-to-sequence models without specialized front-ends or preprocessing
Speaker Adaptation for Attention-Based End-to-End Speech Recognition
We propose three regularization-based speaker adaptation approaches to adapt
the attention-based encoder-decoder (AED) model with very limited adaptation
data from target speakers for end-to-end automatic speech recognition. The
first method is Kullback-Leibler divergence (KLD) regularization, in which the
output distribution of a speaker-dependent (SD) AED is forced to be close to
that of the speaker-independent (SI) model by adding a KLD regularization to
the adaptation criterion. To compensate for the asymmetric deficiency in KLD
regularization, an adversarial speaker adaptation (ASA) method is proposed to
regularize the deep-feature distribution of the SD AED through the adversarial
learning of an auxiliary discriminator and the SD AED. The third approach is
the multi-task learning, in which an SD AED is trained to jointly perform the
primary task of predicting a large number of output units and an auxiliary task
of predicting a small number of output units to alleviate the target sparsity
issue. Evaluated on a Microsoft short message dictation task, all three methods
are highly effective in adapting the AED model, achieving up to 12.2% and 3.0%
word error rate improvement over an SI AED trained from 3400 hours data for
supervised and unsupervised adaptation, respectively.Comment: 5 pages, 3 figures, Interspeech 201
Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments
In real-world applications, users often require both translations and
transcriptions of speech to enhance their comprehension, particularly in
streaming scenarios where incremental generation is necessary. This paper
introduces a streaming Transformer-Transducer that jointly generates automatic
speech recognition (ASR) and speech translation (ST) outputs using a single
decoder. To produce ASR and ST content effectively with minimal latency, we
propose a joint token-level serialized output training method that interleaves
source and target words by leveraging an off-the-shelf textual aligner.
Experiments in monolingual (it-en) and multilingual (\{de,es,it\}-en) settings
demonstrate that our approach achieves the best quality-latency balance. With
an average ASR latency of 1s and ST latency of 1.3s, our model shows no
degradation or even improves output quality compared to separate ASR and ST
models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the
multilingual case