3,715 research outputs found
Analysis of Multilingual Sequence-to-Sequence speech recognition systems
This paper investigates the applications of various multilingual approaches
developed in conventional hidden Markov model (HMM) systems to
sequence-to-sequence (seq2seq) automatic speech recognition (ASR). On a set
composed of Babel data, we first show the effectiveness of multi-lingual
training with stacked bottle-neck (SBN) features. Then we explore various
architectures and training strategies of multi-lingual seq2seq models based on
CTC-attention networks including combinations of output layer, CTC and/or
attention component re-training. We also investigate the effectiveness of
language-transfer learning in a very low resource scenario when the target
language is not included in the original multi-lingual training data.
Interestingly, we found multilingual features superior to multilingual models,
and this finding suggests that we can efficiently combine the benefits of the
HMM system with the seq2seq system through these multilingual feature
techniques.Comment: arXiv admin note: text overlap with arXiv:1810.0345
Fantastic 4 system for NIST 2015 Language Recognition Evaluation
This article describes the systems jointly submitted by Institute for
Infocomm (IR), the Laboratoire d'Informatique de l'Universit\'e du Maine
(LIUM), Nanyang Technology University (NTU) and the University of Eastern
Finland (UEF) for 2015 NIST Language Recognition Evaluation (LRE). The
submitted system is a fusion of nine sub-systems based on i-vectors extracted
from different types of features. Given the i-vectors, several classifiers are
adopted for the language detection task including support vector machines
(SVM), multi-class logistic regression (MCLR), Probabilistic Linear
Discriminant Analysis (PLDA) and Deep Neural Networks (DNN).Comment: Technical report for NIST LRE 2015 Worksho
Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition
It is important to transcribe and archive speech data of endangered languages
for preserving heritages of verbal culture and automatic speech recognition
(ASR) is a powerful tool to facilitate this process. However, since endangered
languages do not generally have large corpora with many speakers, the
performance of ASR models trained on them are considerably poor in general.
Nevertheless, we are often left with a lot of recordings of spontaneous speech
data that have to be transcribed. In this work, for mitigating this speaker
sparsity problem, we propose to convert the whole training speech data and make
it sound like the test speaker in order to develop a highly accurate ASR system
for this speaker. For this purpose, we utilize a CycleGAN-based non-parallel
voice conversion technology to forge a labeled training data that is close to
the test speaker's speech. We evaluated this speaker adaptation approach on two
low-resource corpora, namely, Ainu and Mboshi. We obtained 35-60% relative
improvement in phone error rate on the Ainu corpus, and 40% relative
improvement was attained on the Mboshi corpus. This approach outperformed two
conventional methods namely unsupervised adaptation and multilingual training
with these two corpora.Comment: Accepted for Interspeech 202
Greedy, Joint Syntactic-Semantic Parsing with Stack LSTMs
We present a transition-based parser that jointly produces syntactic and
semantic dependencies. It learns a representation of the entire algorithm
state, using stack long short-term memories. Our greedy inference algorithm has
linear time, including feature extraction. On the CoNLL 2008--9 English shared
tasks, we obtain the best published parsing performance among models that
jointly learn syntax and semantics.Comment: Proceedings of CoNLL 2016; 13 pages, 5 figure
Emergent Translation in Multi-Agent Communication
While most machine translation systems to date are trained on large parallel
corpora, humans learn language in a different way: by being grounded in an
environment and interacting with other humans. In this work, we propose a
communication game where two agents, native speakers of their own respective
languages, jointly learn to solve a visual referential task. We find that the
ability to understand and translate a foreign language emerges as a means to
achieve shared goals. The emergent translation is interactive and multimodal,
and crucially does not require parallel corpora, but only monolingual,
independent text and corresponding images. Our proposed translation model
achieves this by grounding the source and target languages into a shared visual
modality, and outperforms several baselines on both word-level and
sentence-level translation tasks. Furthermore, we show that agents in a
multilingual community learn to translate better and faster than in a bilingual
communication setting.Comment: Accepted to ICLR 201
Massively Multilingual Adversarial Speech Recognition
We report on adaptation of multilingual end-to-end speech recognition models
trained on as many as 100 languages. Our findings shed light on the relative
importance of similarity between the target and pretraining languages along the
dimensions of phonetics, phonology, language family, geographical location, and
orthography. In this context, experiments demonstrate the effectiveness of two
additional pretraining objectives in encouraging language-independent encoder
representations: a context-independent phoneme objective paired with a
language-adversarial classification objective.Comment: Accepted at NAACL-HLT 201
Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning
We introduce polyglot language models, recurrent neural network models
trained to predict symbol sequences in many different languages using shared
representations of symbols and conditioning on typological information about
the language to be predicted. We apply these to the problem of modeling phone
sequences---a domain in which universal symbol inventories and
cross-linguistically shared feature representations are a natural fit.
Intrinsic evaluation on held-out perplexity, qualitative analysis of the
learned representations, and extrinsic evaluation in two downstream
applications that make use of phonetic features show (i) that polyglot models
better generalize to held-out data than comparable monolingual models and (ii)
that polyglot phonetic feature representations are of higher quality than those
learned monolingually.Comment: Proceedings of NAACL 2016; 10 page
AP17-OLR Challenge: Data, Plan, and Baseline
We present the data profile and the evaluation plan of the second oriental
language recognition (OLR) challenge AP17-OLR. Compared to the event last year
(AP16-OLR), the new challenge involves more languages and focuses more on short
utterances. The data is offered by SpeechOcean and the NSFC M2ASR project. Two
types of baselines are constructed to assist the participants, one is based on
the i-vector model and the other is based on various neural networks. We report
the baseline results evaluated with various metrics defined by the AP17-OLR
evaluation plan and demonstrate that the combined database is a reasonable data
resource for multilingual research. All the data is free for participants, and
the Kaldi recipes for the baselines have been published online.Comment: Submitted to APSIPA ASC 2017. arXiv admin note: text overlap with
arXiv:1609.0844
End-to-End Slot Alignment and Recognition for Cross-Lingual NLU
Natural language understanding (NLU) in the context of goal-oriented dialog
systems typically includes intent classification and slot labeling tasks.
Existing methods to expand an NLU system to new languages use machine
translation with slot label projection from source to the translated
utterances, and thus are sensitive to projection errors. In this work, we
propose a novel end-to-end model that learns to align and predict target slot
labels jointly for cross-lingual transfer. We introduce MultiATIS++, a new
multilingual NLU corpus that extends the Multilingual ATIS corpus to nine
languages across four language families, and evaluate our method using the
corpus. Results show that our method outperforms a simple label projection
method using fast-align on most languages, and achieves competitive performance
to the more complex, state-of-the-art projection method with only half of the
training time. We release our MultiATIS++ corpus to the community to continue
future research on cross-lingual NLU.Comment: Accepted at EMNLP 202
- …