25 research outputs found

    Robust Speech Recognition Using Generative Adversarial Networks

    Full text link
    This paper describes a general, scalable, end-to-end framework that uses the generative adversarial network (GAN) objective to enable robust speech recognition. Encoders trained with the proposed approach enjoy improved invariance by learning to map noisy audio to the same embedding space as that of clean audio. Unlike previous methods, the new framework does not rely on domain expertise or simplifying assumptions as are often needed in signal processing, and directly encourages robustness in a data-driven way. We show the new approach improves simulated far-field speech recognition of vanilla sequence-to-sequence models without specialized front-ends or preprocessing

    Speaker Adaptation for Attention-Based End-to-End Speech Recognition

    Full text link
    We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition. The first method is Kullback-Leibler divergence (KLD) regularization, in which the output distribution of a speaker-dependent (SD) AED is forced to be close to that of the speaker-independent (SI) model by adding a KLD regularization to the adaptation criterion. To compensate for the asymmetric deficiency in KLD regularization, an adversarial speaker adaptation (ASA) method is proposed to regularize the deep-feature distribution of the SD AED through the adversarial learning of an auxiliary discriminator and the SD AED. The third approach is the multi-task learning, in which an SD AED is trained to jointly perform the primary task of predicting a large number of output units and an auxiliary task of predicting a small number of output units to alleviate the target sparsity issue. Evaluated on a Microsoft short message dictation task, all three methods are highly effective in adapting the AED model, achieving up to 12.2% and 3.0% word error rate improvement over an SI AED trained from 3400 hours data for supervised and unsupervised adaptation, respectively.Comment: 5 pages, 3 figures, Interspeech 201

    Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

    Full text link
    In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target words by leveraging an off-the-shelf textual aligner. Experiments in monolingual (it-en) and multilingual (\{de,es,it\}-en) settings demonstrate that our approach achieves the best quality-latency balance. With an average ASR latency of 1s and ST latency of 1.3s, our model shows no degradation or even improves output quality compared to separate ASR and ST models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the multilingual case
    corecore