2 research outputs found
Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
Sequence-to-sequence automatic speech recognition (ASR) models require large
quantities of data to attain high performance. For this reason, there has been
a recent surge in interest for unsupervised and semi-supervised training in
such models. This work builds upon recent results showing notable improvements
in semi-supervised training using cycle-consistency and related techniques.
Such techniques derive training procedures and losses able to leverage unpaired
speech and/or text data by combining ASR with Text-to-Speech (TTS) models. In
particular, this work proposes a new semi-supervised loss combining an
end-to-end differentiable ASRTTS loss with TTSASR
loss. The method is able to leverage both unpaired speech and text data to
outperform recently proposed related techniques in terms of \%WER. We provide
extensive results analyzing the impact of data quantity and speech and text
modalities and show consistent gains across WSJ and Librispeech corpora. Our
code is provided in ESPnet to reproduce the experiments.Comment: INTERSPEECH 201
Integrating Source-channel and Attention-based Sequence-to-sequence Models for Speech Recognition
This paper proposes a novel automatic speech recognition (ASR) framework
called Integrated Source-Channel and Attention (ISCA) that combines the
advantages of traditional systems based on the noisy source-channel model (SC)
and end-to-end style systems using attention-based sequence-to-sequence models.
The traditional SC system framework includes hidden Markov models and
connectionist temporal classification (CTC) based acoustic models, language
models (LMs), and a decoding procedure based on a lexicon, whereas the
end-to-end style attention-based system jointly models the whole process with a
single model. By rescoring the hypotheses produced by traditional systems using
end-to-end style systems based on an extended noisy source-channel model, ISCA
allows structured knowledge to be easily incorporated via the SC-based model
while exploiting the complementarity of the attention-based model. Experiments
on the AMI meeting corpus show that ISCA is able to give a relative word error
rate reduction up to 21% over an individual system, and by 13% over an
alternative method which also involves combining CTC and attention-based
models.Comment: To appear in Proc. ASRU2019, December 14-18, 2019, Sentosa, Singapor