Search CORE

97 research outputs found

Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model

Author: Bourlard Hervé
Garner Philip N.
Tong Sibo
Publication venue
Publication date: 23/01/2018
Field of study

Multilingual models for Automatic Speech Recognition (ASR) are attractive as they have been shown to benefit from more training data, and better lend themselves to adaptation to under-resourced languages. However, initialisation from monolingual context-dependent models leads to an explosion of context-dependent states. Connectionist Temporal Classification (CTC) is a potential solution to this as it performs well with monophone labels. We investigate multilingual CTC in the context of adaptation and regularisation techniques that have been shown to be beneficial in more conventional contexts. The multilingual model is trained to model a universal International Phonetic Alphabet (IPA)-based phone set using the CTC loss function. Learning Hidden Unit Contribution (LHUC) is investigated to perform language adaptive training. In addition, dropout during cross-lingual adaptation is also studied and tested in order to mitigate the overfitting problem. Experiments show that the performance of the universal phoneme-based CTC system can be improved by applying LHUC and it is extensible to new phonemes during cross-lingual adaptation. Updating all the parameters shows consistent improvement on limited data. Applying dropout during adaptation can further improve the system and achieve competitive performance with Deep Neural Network / Hidden Markov Model (DNN/HMM) systems on limited data

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

Author: Vu Ngoc Thang
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2014
Field of study

This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

KITopen

Recommended from our members

Investigation of multilingual deep neural networks for spoken term detection

Author: Gales MJF
Knill KM
Rath SP
Woodland PC
Zhang C
Zhang SX
Publication venue: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings
Publication date: 01/01/2013
Field of study

The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (∼10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance. © 2013 IEEE

Apollo (Cambridge)

Multilingual representations for low resource speech recognition and keyword search

Author: Audhkhasi K
Cui J
Cui X
Gales MJF
Golik P
Kingsbury B
Kislal E
Knill KM
Mangu L
Ney H
Nussbaum-Thom M
Picheny M
Ragni A
Ramabhadran B
Schluter R
Sethy A
Tüske Z
Wang H
Woodland P
Publication venue: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings
Publication date: 01/01/2015
Field of study

© 2015 IEEE. This paper examines the impact of multilingual (ML) acoustic representations on Automatic Speech Recognition (ASR) and keyword search (KWS) for low resource languages in the context of the OpenKWS15 evaluation of the IARPA Babel program. The task is to develop Swahili ASR and KWS systems within two weeks using as little as 3 hours of transcribed data. Multilingual acoustic representations proved to be crucial for building these systems under strict time constraints. The paper discusses several key insights on how these representations are derived and used. First, we present a data sampling strategy that can speed up the training of multilingual representations without appreciable loss in ASR performance. Second, we show that fusion of diverse multilingual representations developed at different LORELEI sites yields substantial ASR and KWS gains. Speaker adaptation and data augmentation of these representations improves both ASR and KWS performance (up to 8.7% relative). Third, incorporating un-transcribed data through semi-supervised learning, improves WER and KWS performance. Finally, we show that these multilingual representations significantly improve ASR and KWS performance (relative 9% for WER and 5% for MTWV) even when forty hours of transcribed audio in the target language is available. Multilingual representations significantly contributed to the LORELEI KWS systems winning the OpenKWS15 evaluation

Publikationsserver der RWTH Aachen University

Apollo (Cambridge)

White Rose Research Online

CUED - Cambridge University Engineering Department

Multilingual and Unsupervised Subword Modelingfor Zero-Resource Languages

Author: Alumäe
Badino
Badino
Carlin
Chen
Chen
Cui
De Vries
Dunbar
Dunbar
Gales
Grézl
Heck
Heck
Heck
Heck
Hermann
Huijbregts
Jansen
Jansen
Kamper
Kamper
Kamper
Lang
Lee
Levin
Menon
Paul
Peddinti
Pitt
Povey
Renshaw
Riad
Saon
Schatz
Schatz
Schultz
Shibata
Swietojanski
Synnaeve
Synnaeve
Thomas
Trmal
Tsuchiya
Versteegh
Veselý
Vu
Waibel
Walter
Yuan
Yuan
Yuan
Zeghidour
Zeiler
Zhang
Publication venue: 'Elsevier BV'
Publication date: 07/04/2020
Field of study

Subword modeling for zero-resource languages aims to learn low-level representations of speech audio without using transcriptions or other resources from the target language (such as text corpora or pronunciation dictionaries). A good representation should capture phonetic content and abstract away from other types of variability, such as speaker differences and channel noise. Previous work in this area has primarily focused unsupervised learning from target language data only, and has been evaluated only intrinsically. Here we directly compare multiple methods, including some that use only target language speech data and some that use transcribed speech from other (non-target) languages, and we evaluate using two intrinsic measures as well as on a downstream unsupervised word segmentation and clustering task. We find that combining two existing target-language-only methods yields better features than either method alone. Nevertheless, even better results are obtained by extracting target language bottleneck features using a model trained on other languages. Cross-lingual training using just one other language is enough to provide this benefit, but multilingual training helps even more. In addition to these results, which hold across both intrinsic measures and the extrinsic task, we discuss the qualitative differences between the different types of learned features.Comment: 17 pages, 6 figures, 7 tables. Accepted for publication in Computer Speech and Language. arXiv admin note: text overlap with arXiv:1803.0886

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Edinburgh Research Explorer

An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation

Author: Bourlard Hervé
Garner Philip N.
Tong Sibo
Publication venue
Publication date: 19/06/2017
Field of study

Different training and adaptation techniques for multilingual Automatic Speech Recognition (ASR) are explored in the context of hybrid systems, exploiting Deep Neural Networks (DNN) and Hidden Markov Models (HMM). In multilingual DNN training, the hidden layers (possibly extracting bottleneck features) are usually shared across languages, and the output layer can either model multiple sets of language-specific senones or one single universal IPA-based multilingual senone set. Both architectures are investigated, exploiting and comparing different language adaptive training (LAT) techniques originating from successful DNN-based speaker-adaptation. More specifically, speaker adaptive training methods such as Cluster Adaptive Training (CAT) and Learning Hidden Unit Contribution (LHUC) are considered. In addition, a language adaptive output architecture for IPA-based universal DNN is also studied and tested. Experiments show that LAT improves the performance and adaptation on the top layer further improves the accuracy. By combining state-level minimum Bayes risk (sMBR) sequence training with LAT, we show that a language adaptively trained IPA-based universal DNN outperforms a monolingually sequence trained model

Infoscience - École polytechnique fédérale de Lausanne

Crossref