223,951 research outputs found
Towards Language-Universal End-to-End Speech Recognition
Building speech recognizers in multiple languages typically involves
replicating a monolingual training recipe for each language, or utilizing a
multi-task learning approach where models for different languages have separate
output labels but share some internal parameters. In this work, we exploit
recent progress in end-to-end speech recognition to create a single
multilingual speech recognition system capable of recognizing any of the
languages seen in training. To do so, we propose the use of a universal
character set that is shared among all languages. We also create a
language-specific gating mechanism within the network that can modulate the
network's internal representations in a language-specific way. We evaluate our
proposed approach on the Microsoft Cortana task across three languages and show
that our system outperforms both the individual monolingual systems and systems
built with a multi-task learning approach. We also show that this model can be
used to initialize a monolingual speech recognizer, and can be used to create a
bilingual model for use in code-switching scenarios.Comment: submitted to ICASSP 201
Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model
Sequence-to-sequence models provide a simple and elegant solution for
building speech recognition systems by folding separate components of a typical
system, namely acoustic (AM), pronunciation (PM) and language (LM) models into
a single neural network. In this work, we look at one such sequence-to-sequence
model, namely listen, attend and spell (LAS), and explore the possibility of
training a single model to serve different English dialects, which simplifies
the process of training multi-dialect systems without the need for separate AM,
PM and LMs for each dialect. We show that simply pooling the data from all
dialects into one LAS model falls behind the performance of a model fine-tuned
on each dialect. We then look at incorporating dialect-specific information
into the model, both by modifying the training targets by inserting the dialect
symbol at the end of the original grapheme sequence and also feeding a 1-hot
representation of the dialect information into all layers of the model.
Experimental results on seven English dialects show that our proposed system is
effective in modeling dialect variations within a single LAS model,
outperforming a LAS model trained individually on each of the seven dialects by
3.1 ~ 16.5% relative.Comment: submitted to ICASSP 201
Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model
Multilingual models for Automatic Speech Recognition (ASR) are attractive as
they have been shown to benefit from more training data, and better lend
themselves to adaptation to under-resourced languages. However, initialisation
from monolingual context-dependent models leads to an explosion of
context-dependent states. Connectionist Temporal Classification (CTC) is a
potential solution to this as it performs well with monophone labels.
We investigate multilingual CTC in the context of adaptation and
regularisation techniques that have been shown to be beneficial in more
conventional contexts. The multilingual model is trained to model a universal
International Phonetic Alphabet (IPA)-based phone set using the CTC loss
function. Learning Hidden Unit Contribution (LHUC) is investigated to perform
language adaptive training. In addition, dropout during cross-lingual
adaptation is also studied and tested in order to mitigate the overfitting
problem.
Experiments show that the performance of the universal phoneme-based CTC
system can be improved by applying LHUC and it is extensible to new phonemes
during cross-lingual adaptation. Updating all the parameters shows consistent
improvement on limited data. Applying dropout during adaptation can further
improve the system and achieve competitive performance with Deep Neural Network
/ Hidden Markov Model (DNN/HMM) systems on limited data
Transfer learning of language-independent end-to-end ASR with language model fusion
This work explores better adaptation methods to low-resource languages using
an external language model (LM) under the framework of transfer learning. We
first build a language-independent ASR system in a unified sequence-to-sequence
(S2S) architecture with a shared vocabulary among all languages. During
adaptation, we perform LM fusion transfer, where an external LM is integrated
into the decoder network of the attention-based S2S model in the whole
adaptation stage, to effectively incorporate linguistic context of the target
language. We also investigate various seed models for transfer learning.
Experimental evaluations using the IARPA BABEL data set show that LM fusion
transfer improves performances on all target five languages compared with
simple transfer learning when the external text data is available. Our final
system drastically reduces the performance gap from the hybrid systems.Comment: Accepted at ICASSP201
- …