20,341 research outputs found
Multilingual Speech Recognition With A Single End-To-End Model
Training a conventional automatic speech recognition (ASR) system to support
multiple languages is challenging because the sub-word unit, lexicon and word
inventories are typically language specific. In contrast, sequence-to-sequence
models are well suited for multilingual ASR because they encapsulate an
acoustic, pronunciation and language model jointly in a single network. In this
work we present a single sequence-to-sequence ASR model trained on 9 different
Indian languages, which have very little overlap in their scripts.
Specifically, we take a union of language-specific grapheme sets and train a
grapheme-based sequence-to-sequence model jointly on data from all languages.
We find that this model, which is not explicitly given any information about
language identity, improves recognition performance by 21% relative compared to
analogous sequence-to-sequence models trained on each language individually. By
modifying the model to accept a language identifier as an additional input
feature, we further improve performance by an additional 7% relative and
eliminate confusion between different languages.Comment: Accepted in ICASSP 201
Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages
Sequence-to-sequence attention-based models integrate an acoustic,
pronunciation and language model into a single neural network, which make them
very suitable for multilingual automatic speech recognition (ASR). In this
paper, we are concerned with multilingual speech recognition on low-resource
languages by a single Transformer, one of sequence-to-sequence attention-based
models. Sub-words are employed as the multilingual modeling unit without using
any pronunciation lexicon. First, we show that a single multilingual ASR
Transformer performs well on low-resource languages despite of some language
confusion. We then look at incorporating language information into the model by
inserting the language symbol at the beginning or at the end of the original
sub-words sequence under the condition of language information being known
during training. Experiments on CALLHOME datasets demonstrate that the
multilingual ASR Transformer with the language symbol at the end performs
better and can obtain relatively 10.5\% average word error rate (WER) reduction
compared to SHL-MLSTM with residual learning. We go on to show that, assuming
the language information being known during training and testing, about
relatively 12.4\% average WER reduction can be observed compared to SHL-MLSTM
with residual learning through giving the language symbol as the sentence start
token.Comment: arXiv admin note: text overlap with arXiv:1805.0623
One-To-Many Multilingual End-to-end Speech Translation
Nowadays, training end-to-end neural models for spoken language translation
(SLT) still has to confront with extreme data scarcity conditions. The existing
SLT parallel corpora are indeed orders of magnitude smaller than those
available for the closely related tasks of automatic speech recognition (ASR)
and machine translation (MT), which usually comprise tens of millions of
instances. To cope with data paucity, in this paper we explore the
effectiveness of transfer learning in end-to-end SLT by presenting a
multilingual approach to the task. Multilingual solutions are widely studied in
MT and usually rely on ``\textit{target forcing}'', in which multilingual
parallel data are combined to train a single model by prepending to the input
sequences a language token that specifies the target language. However, when
tested in speech translation, our experiments show that MT-like \textit{target
forcing}, used as is, is not effective in discriminating among the target
languages. Thus, we propose a variant that uses target-language embeddings to
shift the input representations in different portions of the space according to
the language, so to better support the production of output in the desired
target language. Our experiments on end-to-end SLT from English into six
languages show important improvements when translating into similar languages,
especially when these are supported by scarce data. Further improvements are
obtained when using English ASR data as an additional language (up to
BLEU points).Comment: 8 pages, one figure, version accepted at ASRU 201
Massively Multilingual Adversarial Speech Recognition
We report on adaptation of multilingual end-to-end speech recognition models
trained on as many as 100 languages. Our findings shed light on the relative
importance of similarity between the target and pretraining languages along the
dimensions of phonetics, phonology, language family, geographical location, and
orthography. In this context, experiments demonstrate the effectiveness of two
additional pretraining objectives in encouraging language-independent encoder
representations: a context-independent phoneme objective paired with a
language-adversarial classification objective.Comment: Accepted at NAACL-HLT 201
Transfer learning of language-independent end-to-end ASR with language model fusion
This work explores better adaptation methods to low-resource languages using
an external language model (LM) under the framework of transfer learning. We
first build a language-independent ASR system in a unified sequence-to-sequence
(S2S) architecture with a shared vocabulary among all languages. During
adaptation, we perform LM fusion transfer, where an external LM is integrated
into the decoder network of the attention-based S2S model in the whole
adaptation stage, to effectively incorporate linguistic context of the target
language. We also investigate various seed models for transfer learning.
Experimental evaluations using the IARPA BABEL data set show that LM fusion
transfer improves performances on all target five languages compared with
simple transfer learning when the external text data is available. Our final
system drastically reduces the performance gap from the hybrid systems.Comment: Accepted at ICASSP201
Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition
It is important to transcribe and archive speech data of endangered languages
for preserving heritages of verbal culture and automatic speech recognition
(ASR) is a powerful tool to facilitate this process. However, since endangered
languages do not generally have large corpora with many speakers, the
performance of ASR models trained on them are considerably poor in general.
Nevertheless, we are often left with a lot of recordings of spontaneous speech
data that have to be transcribed. In this work, for mitigating this speaker
sparsity problem, we propose to convert the whole training speech data and make
it sound like the test speaker in order to develop a highly accurate ASR system
for this speaker. For this purpose, we utilize a CycleGAN-based non-parallel
voice conversion technology to forge a labeled training data that is close to
the test speaker's speech. We evaluated this speaker adaptation approach on two
low-resource corpora, namely, Ainu and Mboshi. We obtained 35-60% relative
improvement in phone error rate on the Ainu corpus, and 40% relative
improvement was attained on the Mboshi corpus. This approach outperformed two
conventional methods namely unsupervised adaptation and multilingual training
with these two corpora.Comment: Accepted for Interspeech 202
Analysis of Multilingual Sequence-to-Sequence speech recognition systems
This paper investigates the applications of various multilingual approaches
developed in conventional hidden Markov model (HMM) systems to
sequence-to-sequence (seq2seq) automatic speech recognition (ASR). On a set
composed of Babel data, we first show the effectiveness of multi-lingual
training with stacked bottle-neck (SBN) features. Then we explore various
architectures and training strategies of multi-lingual seq2seq models based on
CTC-attention networks including combinations of output layer, CTC and/or
attention component re-training. We also investigate the effectiveness of
language-transfer learning in a very low resource scenario when the target
language is not included in the original multi-lingual training data.
Interestingly, we found multilingual features superior to multilingual models,
and this finding suggests that we can efficiently combine the benefits of the
HMM system with the seq2seq system through these multilingual feature
techniques.Comment: arXiv admin note: text overlap with arXiv:1810.0345
Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model
Multilingual models for Automatic Speech Recognition (ASR) are attractive as
they have been shown to benefit from more training data, and better lend
themselves to adaptation to under-resourced languages. However, initialisation
from monolingual context-dependent models leads to an explosion of
context-dependent states. Connectionist Temporal Classification (CTC) is a
potential solution to this as it performs well with monophone labels.
We investigate multilingual CTC in the context of adaptation and
regularisation techniques that have been shown to be beneficial in more
conventional contexts. The multilingual model is trained to model a universal
International Phonetic Alphabet (IPA)-based phone set using the CTC loss
function. Learning Hidden Unit Contribution (LHUC) is investigated to perform
language adaptive training. In addition, dropout during cross-lingual
adaptation is also studied and tested in order to mitigate the overfitting
problem.
Experiments show that the performance of the universal phoneme-based CTC
system can be improved by applying LHUC and it is extensible to new phonemes
during cross-lingual adaptation. Updating all the parameters shows consistent
improvement on limited data. Applying dropout during adaptation can further
improve the system and achieve competitive performance with Deep Neural Network
/ Hidden Markov Model (DNN/HMM) systems on limited data
Multilingual Language Processing From Bytes
We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads
text as bytes and outputs span annotations of the form [start, length, label]
where start positions, lengths, and labels are separate entries in our
vocabulary. Because we operate directly on unicode bytes rather than
language-specific words or characters, we can analyze text in many languages
with a single model. Due to the small vocabulary size, these multilingual
models are very compact, but produce results similar to or better than the
state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that
use only the provided training datasets (no external data sources). Our models
are learning "from scratch" in that they do not rely on any elements of the
standard pipeline in Natural Language Processing (including tokenization), and
thus can run in standalone fashion on raw text
Structure-Level Knowledge Distillation For Multilingual Sequence Labeling
Multilingual sequence labeling is a task of predicting label sequences using
a single unified model for multiple languages. Compared with relying on
multiple monolingual models, using a multilingual model has the benefit of a
smaller model size, easier in online serving, and generalizability to
low-resource languages. However, current multilingual models still underperform
individual monolingual models significantly due to model capacity limitations.
In this paper, we propose to reduce the gap between monolingual models and the
unified multilingual model by distilling the structural knowledge of several
monolingual models (teachers) to the unified multilingual model (student). We
propose two novel KD methods based on structure-level information: (1)
approximately minimizes the distance between the student's and the teachers'
structure level probability distributions, (2) aggregates the structure-level
knowledge to local distributions and minimizes the distance between two local
probability distributions. Our experiments on 4 multilingual tasks with 25
datasets show that our approaches outperform several strong baselines and have
stronger zero-shot generalizability than both the baseline model and teacher
models.Comment: Accepted to ACL 2020, camera-ready. 14 page
- …