207 research outputs found

    Improved ASR for Under-Resourced Languages Through Multi-Task Learning with Acoustic Landmarks

    Full text link
    Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors suggest a strategy for developing language-independent automatic speech recognition: landmarks can potentially be learned once from a suitably labeled corpus and rapidly applied to many other languages. This paper proposes enhancing the cross-lingual portability of a neural network by using landmarks as the secondary task in multi-task learning (MTL). The network is trained in a well-resourced source language with both phone and landmark labels (English), then adapted to an under-resourced target language with only word labels (Iban). Landmark-tasked MTL reduces source-language phone error rate by 2.9% relative, and reduces target-language word error rate by 1.9%-5.9% depending on the amount of target-language training data. These results suggest that landmark-tasked MTL causes the DNN to learn hidden-node features that are useful for cross-lingual adaptation.Comment: Submitted in Interspeech201

    Robust Open-Set Spoken Language Identification and the CU MultiLang Dataset

    Full text link
    Most state-of-the-art spoken language identification models are closed-set; in other words, they can only output a language label from the set of classes they were trained on. Open-set spoken language identification systems, however, gain the ability to detect when an input exhibits none of the original languages. In this paper, we implement a novel approach to open-set spoken language identification that uses MFCC and pitch features, a TDNN model to extract meaningful feature embeddings, confidence thresholding on softmax outputs, and LDA and pLDA for learning to classify new unknown languages. We present a spoken language identification system that achieves 91.76% accuracy on trained languages and has the capability to adapt to unknown languages on the fly. To that end, we also built the CU MultiLang Dataset, a large and diverse multilingual speech corpus which was used to train and evaluate our system.Comment: 6pages, 1 table, 6 figure

    Towards Weakly Supervised Acoustic Subword Unit Discovery and Lexicon Development Using Hidden Markov Models

    Get PDF
    Developing a phonetic lexicon for a language requires linguistic knowledge as well as human effort, which may not be available, particularly for under-resourced languages. An alternative to development of a phonetic lexicon is to automatically derive subword units using acoustic information and generate associated pronunciations. In the literature, this has been mostly studied from the pronunciation variation modeling perspective. In this article, we investigate automatic subword unit derivation from the under-resourced language point of view. Towards that, we present a novel hidden Markov model (HMM) formalism for automatic derivation of subword units and pronunciation generation using only transcribed speech data. In this approach, the subword units are derived from the clustered context-dependent units in a grapheme based system using the maximum-likelihood criterion. The subword unit based pronunciations are then generated either by deterministic or probabilistic learning of the relationship between the graphemes and the acoustic subword units (ASWUs). In this article, we first establish the proposed framework on a well resourced language by comparing it against related approaches in the literature and investigating the transferability of the derived subword units to other domains. We then show the scalability of the proposed approach on real under-resourced scenarios by conducting studies on Scottish Gaelic, a genuinely minority and endangered language, and comparing the approach against state-of-the-art grapheme-based approaches in under-resourced scenarios. Our experimental studies on English show that the derived subword units can not only lead to better ASR systems compared to graphemes, but can also be exploited to build out-of-domain ASR systems. The experimental studies on Scottish Gaelic show that the proposed ASWU-based lexicon development approach retains its dominance over grapheme-based lexicon. Alternately, the proposed approach yields significant gains in ASR performance, even when multilingual resources from resource-rich languages are exploited in the development of ASR systems

    Grapheme-based Automatic Speech Recognition using Probabilistic Lexical Modeling

    Get PDF
    Automatic speech recognition (ASR) systems incorporate expert knowledge of language or the linguistic expertise through the use of phone pronunciation lexicon (or dictionary) where each word is associated with a sequence of phones. The creation of phone pronunciation lexicon for a new language or domain is costly as it requires linguistic expertise, and includes time and money. In this thesis, we focus on effective building of ASR systems in the absence of linguistic expertise for a new domain or language. Particularly, we consider graphemes as alternate subword units for speech recognition. In a grapheme lexicon, pronunciation of a word is derived from its orthography. However, modeling graphemes for speech recognition is a challenging task for two reasons. Firstly, grapheme-to-phoneme (G2P) relationship can be ambiguous as languages continue to evolve after their spelling has been standardized. Secondly, as elucidated in this thesis, typically ASR systems directly model the relationship between graphemes and acoustic features; and the acoustic features depict the envelope of speech, which is related to phones. In this thesis, a grapheme-based ASR approach is proposed where the modeling of the relationship between graphemes and acoustic features is factored through a latent variable into two models, namely, acoustic model and lexical model. In the acoustic model the relationship between latent variables and acoustic features is modeled, while in the lexical model a probabilistic relationship between latent variables and graphemes is modeled. We refer to the proposed approach as probabilistic lexical modeling based ASR. In the thesis we show that the latent variables can be phones or multilingual phones or clustered context-dependent subword units; and an acoustic model can be trained on domain-independent or language-independent resources. The lexical model is trained on transcribed speech data from the target domain or language. In doing so, the parameters of the lexical model capture a probabilistic relationship between graphemes and phones. In the proposed grapheme-based ASR approach, lexicon learning is implicitly integrated as a phase in ASR system training as opposed to the conventional approach where first phone pronunciation lexicon is developed and then a phone-based ASR system is trained. The potential and the efficacy of the proposed approach is demonstrated through experiments and comparisons with other standard approaches on ASR for resource rich languages, nonnative and accented speech, under-resourced languages, and minority languages. The studies revealed that the proposed framework is particularly suitable when the task is challenged by the lack of both linguistic expertise and transcribed data. Furthermore, our investigations also showed that standard ASR approaches in which the lexical model is deterministic are more suitable for phones than graphemes, while probabilistic lexical model based ASR approach is suitable for both. Finally, we show that the captured grapheme-to-phoneme relationship can be exploited to perform acoustic data-driven G2P conversion

    Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning

    Full text link
    Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data and collected data for 15 languages, and trained experimental models using these techniques. Our results show that pooling the small amounts of data available in multilingual end-to-end models, and pre-training on unsupervised data can help improve speech recognition quality for many African languages

    TheanoLM - An Extensible Toolkit for Neural Network Language Modeling

    Full text link
    We present a new tool for training neural network language models (NNLMs), scoring sentences, and generating text. The tool has been written using Python library Theano, which allows researcher to easily extend it and tune any aspect of the training process. Regardless of the flexibility, Theano is able to generate extremely fast native code that can utilize a GPU or multiple CPU cores in order to parallelize the heavy numerical computations. The tool has been evaluated in difficult Finnish and English conversational speech recognition tasks, and significant improvement was obtained over our best back-off n-gram models. The results that we obtained in the Finnish task were compared to those from existing RNNLM and RWTHLM toolkits, and found to be as good or better, while training times were an order of magnitude shorter
    • …
    corecore