Non-linear Cross-lingual Mappings for Low-resource Speech Recognition

Abstract

Multilingual speech recognition systems usually benefit low-resource languages but suffer degradation in the performance of several languages compared with their monolingual counterparts. With an objective to improve speech recognition performance for a target language, closer languages are chosen to build a multilingual system. The number of shared phonemes among them is usually taken into account to estimate the languages' closeness. However, various close languages such as English, German, Dutch and many others with significant phonemes overlap yield higher error rates from multilingual speech recognition systems when compared with their monolingual speech recognition systems. Limited attention has been paid towards investigating the performance trends of multilingual speech recognition systems and their relation with acoustic-phonetic similarities across the languages. The objective of this research is to estimate cross-lingual acoustic-phonetic similarities and their impact on multilingual speech recognition systems. To that end, a novel data-driven approach is proposed to analyse the output from several monolingual acoustic models given a target language speech signal. This technique measures the similarities between posterior distributions from various monolingual acoustic models given a target language speech signal. Neural networks-based `mapping models' are trained that transform the distributions from hybrid DNN-HMM acoustic models of different languages into a directly comparable form. The analysis shows that the `closeness' among the languages can not be truly estimated by the size of the shared phonemes set. Entropy analysis of the proposed mapping models exhibits that a language with lesser overlap can be more amenable to cross-lingual transfer, and hence more beneficial in the multilingual setup. The proposed mapping models are then exploited to improve low-resource speech recognition. A novel approach of hybrid DNN-HMM acoustic model fusion is proposed in a multilingual setup. Posterior distributions from different monolingual acoustic models, given a target language speech signal, are fused. Mapping models are trained for source-target language pairs to transform posteriors from a source acoustic model to the target language. These models require limited data as compared to the acoustic model training. Multilingual model fusion yields a relative average gain of 4.56% and 3.77% for selected languages from the Babel data set when compared with multilingual and monolingual baselines respectively. Cross-lingual model fusion shows that comparable results can be achieved without using posteriors from the language-dependent ASR system. Substantial phonemes overlap across the languages and the relatively smaller size of the universal phoneme set is expected to make the mapping task less challenging compared with mapping models for end-to-end ASR systems where tokens are usually graphemes or sub-word units. Furthermore, end-to-end speech recognition systems have been dominated over hybrid DNN-HMM models. So, the concept of learnable cross-lingual mappings is extended for end-to-end speech recognition to study if mappings could be learnt for end-to-end speech recognition systems. Mapping models are also employed to transliterate the source languages to the target language without using parallel data. Finally, the source audio and its transliteration are used for data augmentation to retrain the target language ASR system. The retrained speech recognition system with data augmentation results in a relative gain of up to 5% over the baseline monolingual speech recognition system for the selected languages from the Babel data set. Student-teacher learning or knowledge distillation has been previously used to address data scarcity issues for the training of speech recognition systems. However, a limitation of knowledge distillation training is that the student model classes must be a proper or improper subset of the teacher model classes. It prevents distillation from even acoustically similar languages if the character sets are not the same. In this work, the aforementioned limitation is addressed by proposing a novel multilingual knowledge distillation approach that exploits the earlier proposed mapping models. A pre-trained mapping model is used to map posteriors from a teacher language ASR system to the student language ASR system. These mapped posteriors are used as soft labels for knowledge distillation. Various teacher ensemble schemes are experimented with to train an ASR system for low-resource languages. A model trained with MUST learning reduces relative character error rate up to 9.5% in comparison with a baseline monolingual ASR

Similar works

Full text

thumbnail-image

White Rose E-theses Online

redirect
Last time updated on 31/07/2024

This paper was published in White Rose E-theses Online.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.