335 research outputs found
Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling
This study addresses unsupervised subword modeling, i.e., learning feature
representations that can distinguish subword units of a language. The proposed
approach adopts a two-stage bottleneck feature (BNF) learning framework,
consisting of autoregressive predictive coding (APC) as a front-end and a
DNN-BNF model as a back-end. APC pretrained features are set as input features
to a DNN-BNF model. A language-mismatched ASR system is used to provide
cross-lingual phone labels for DNN-BNF model training. Finally, BNFs are
extracted as the subword-discriminative feature representation. A second aim of
this work is to investigate the robustness of our approach's effectiveness to
different amounts of training data. The results on Libri-light and the
ZeroSpeech 2017 databases show that APC is effective in front-end feature
pretraining. Our whole system outperforms the state of the art on both
databases. Cross-lingual phone labels for English data by a Dutch ASR
outperform those by a Mandarin ASR, possibly linked to the larger similarity of
Dutch compared to Mandarin with English. Our system is less sensitive to
training data amount when the training data is over 50 hours. APC pretraining
leads to a reduction of needed training material from over 5,000 hours to
around 200 hours with little performance degradation.Comment: 5 pages, 3 figures. Accepted for publication in INTERSPEECH 2020,
Shanghai, Chin
Improving multilingual speech recognition systems
End-to-end trainable deep neural networks have become the state-of-the-art architecture for automatic speech recognition (ASR), provided that the network is trained with a sufficiently large dataset. However, many existing languages are too sparsely resourced for deep learning networks to achieve as high accuracy as their resource-abundant counterparts.
Multilingual recognition systems mitigate data sparsity issues by training models on data from multiple language resources to learn a speech-to-text or speech-to-phone model universal to all languages. The resulting multilingual ASR models usually have better recognition accuracy than the models trained on the individual dataset.
In this work, we propose that two limitations exist for multilingual systems, and resolving the two limitations could result in improved recognition accuracy: (1) existing corpora are of the considerably varied form (spontaneous or read speech), corpus size, noise level, and phoneme distribution and the ASR models trained on the joint multilingual dataset have large performance disparities over different languages. We present an optimizable loss function, equal accuracy ratio (EAR), that measures the sequence-level performance disparity between different user groups and we show that explicitly optimizing this objective reduces the performance gap and improves the multilingual recognition accuracy. (2) While having good accuracy on the seen training language, the multilingual systems do not generalize well to unseen testing languages, which we refer to as cross-lingual recognition accuracy. We introduce language embedding using external linguistic typologies and show that such embedding can significantly increase both multilingual and cross-lingual accuracy. We illustrate the effectiveness of the proposed methods with experiments on multilingual and multi-user and multi-dialect corpora
Transducer-based language embedding for spoken language identification
The acoustic and linguistic features are important cues for the spoken
language identification (LID) task. Recent advanced LID systems mainly use
acoustic features that lack the usage of explicit linguistic feature encoding.
In this paper, we propose a novel transducer-based language embedding approach
for LID tasks by integrating an RNN transducer model into a language embedding
framework. Benefiting from the advantages of the RNN transducer's linguistic
representation capability, the proposed method can exploit both
phonetically-aware acoustic features and explicit linguistic features for LID
tasks. Experiments were carried out on the large-scale multilingual LibriSpeech
and VoxLingua107 datasets. Experimental results showed the proposed method
significantly improves the performance on LID tasks with 12% to 59% and 16% to
24% relative improvement on in-domain and cross-domain datasets, respectively.Comment: This paper was submitted to Interspeech 202
- …