530 research outputs found

    Semi-supervised and Active-learning Scenarios: Efficient Acoustic Model Refinement for a Low Resource Indian Language

    Full text link
    We address the problem of efficient acoustic-model refinement (continuous retraining) using semi-supervised and active learning for a low resource Indian language, wherein the low resource constraints are having i) a small labeled corpus from which to train a baseline `seed' acoustic model and ii) a large training corpus without orthographic labeling or from which to perform a data selection for manual labeling at low costs. The proposed semi-supervised learning decodes the unlabeled large training corpus using the seed model and through various protocols, selects the decoded utterances with high reliability using confidence levels (that correlate to the WER of the decoded utterances) and iterative bootstrapping. The proposed active learning protocol uses confidence level based metric to select the decoded utterances from the large unlabeled corpus for further labeling. The semi-supervised learning protocols can offer a WER reduction, from a poorly trained seed model, by as much as 50% of the best WER-reduction realizable from the seed model's WER, if the large corpus were labeled and used for acoustic-model training. The active learning protocols allow that only 60% of the entire training corpus be manually labeled, to reach the same performance as the entire data

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Towards Rapid Language Portability of Speech Processing Systems

    Get PDF

    Rapid Generation of Pronunciation Dictionaries for new Domains and Languages

    Get PDF
    This dissertation presents innovative strategies and methods for the rapid generation of pronunciation dictionaries for new domains and languages. Depending on various conditions, solutions are proposed and developed. Starting from the straightforward scenario in which the target language is present in written form on the Internet and the mapping between speech and written language is close up to the difficult scenario in which no written form for the target language exists

    How usable are digital collections for endangered languages? A review

    Get PDF
    Here, we report on pilot research on the extent to which language collections in digital linguistic archives are discoverable, accessible, and usable for linguistic research. Using a test case of common tasks in phonetic and phonological documentation, we evaluate a small random sample of collections and find substantial, striking problems in all domains. Of the original 20 collections, only six had digitized audio files with associated transcripts (preferably phrase-aligned). That is, only 30% of the collections in our sample were even potentially suitable for any type of phonetic work (regardless of quality of recording). Information about the contents of the collection was usually discoverable, though there was variation in the types of information that could be easily searched for in the collection. Though eventually three collections were aligned, only one collection was successfully force-aligned from the archival materials without substantial intervention. We close with recommendations for archive depositors to facilitate discoverability, accessibility, and functionality of language collections. Consistency and accuracy in file naming practices, data descriptions, and transcription practices is imperative. Providing a collection guide also helps. Including useful search terms about collection contents makes the materials more findable. Researchers need to be aware of the changes to collection structure that may result from archival uploads. Depositors need to consider how their metadata is included in collections and how items in the collection may be matched to each other and to metadata categories. Finally, if our random sample is indicative, linguistic documentation practices for future phonetic work need to change rapidly, if such work from archival collections is to be done in future

    Grapheme-based Automatic Speech Recognition using Probabilistic Lexical Modeling

    Get PDF
    Automatic speech recognition (ASR) systems incorporate expert knowledge of language or the linguistic expertise through the use of phone pronunciation lexicon (or dictionary) where each word is associated with a sequence of phones. The creation of phone pronunciation lexicon for a new language or domain is costly as it requires linguistic expertise, and includes time and money. In this thesis, we focus on effective building of ASR systems in the absence of linguistic expertise for a new domain or language. Particularly, we consider graphemes as alternate subword units for speech recognition. In a grapheme lexicon, pronunciation of a word is derived from its orthography. However, modeling graphemes for speech recognition is a challenging task for two reasons. Firstly, grapheme-to-phoneme (G2P) relationship can be ambiguous as languages continue to evolve after their spelling has been standardized. Secondly, as elucidated in this thesis, typically ASR systems directly model the relationship between graphemes and acoustic features; and the acoustic features depict the envelope of speech, which is related to phones. In this thesis, a grapheme-based ASR approach is proposed where the modeling of the relationship between graphemes and acoustic features is factored through a latent variable into two models, namely, acoustic model and lexical model. In the acoustic model the relationship between latent variables and acoustic features is modeled, while in the lexical model a probabilistic relationship between latent variables and graphemes is modeled. We refer to the proposed approach as probabilistic lexical modeling based ASR. In the thesis we show that the latent variables can be phones or multilingual phones or clustered context-dependent subword units; and an acoustic model can be trained on domain-independent or language-independent resources. The lexical model is trained on transcribed speech data from the target domain or language. In doing so, the parameters of the lexical model capture a probabilistic relationship between graphemes and phones. In the proposed grapheme-based ASR approach, lexicon learning is implicitly integrated as a phase in ASR system training as opposed to the conventional approach where first phone pronunciation lexicon is developed and then a phone-based ASR system is trained. The potential and the efficacy of the proposed approach is demonstrated through experiments and comparisons with other standard approaches on ASR for resource rich languages, nonnative and accented speech, under-resourced languages, and minority languages. The studies revealed that the proposed framework is particularly suitable when the task is challenged by the lack of both linguistic expertise and transcribed data. Furthermore, our investigations also showed that standard ASR approaches in which the lexical model is deterministic are more suitable for phones than graphemes, while probabilistic lexical model based ASR approach is suitable for both. Finally, we show that the captured grapheme-to-phoneme relationship can be exploited to perform acoustic data-driven G2P conversion

    Acoustic Modelling for Under-Resourced Languages

    Get PDF
    Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones. In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages

    Towards a corpus of Indian South African English (ISAE) : an investigation of lexical and syntactic features in a spoken corpus of contemporary ISAE

    Get PDF
    There is consensus among scholars that there is not just one English language but a family of “World Englishes”. The umbrella-term “World Englishes” provides a conceptual framework to accommodate the different varieties of English that have evolved as a result of the linguistic cross-fertilization attendant upon colonization, migration, trade and transplantation of the original “strain” or variety. Various theoretical models have emerged in an attempt to understand and classify the extant and emerging varieties of this global language. The hierarchically based model of English, which classifies world English as “First Language”, “Second Language” and “Foreign Language”, has been challenged by more equitably-conceived models which refer to the emerging varieties as New Englishes. The situation in a country such as multi-lingual South Africa is a complex one: there are 11 official languages, one of which is English. However the English used in South Africa (or “South African English”), is not a homogeneous variety, since its speakers include those for whom it is a first language, those for whom it is an additional language and those for whom it is a replacement language. The Indian population in South Africa are amongst the latter group, as theirs is a case where English has ousted the traditional Indian languages and become a de facto first language, which has retained strong community resonances. This study was undertaken using the methodology of corpus linguistics to initiate the creation of a repository of linguistic evidence (or corpus), of Indian South African English, a sub-variety of South African English (Mesthrie 1992b, 1996, 2002). Although small (approximately 60 000 words), and representing a narrow age band of young adults, the resulting corpus of spoken data confirmed the existence of robust features identified in prior research into the sub-variety. These features include the use of ‘y’all’ as a second person plural pronoun, the use of but in a sentence-final position, and ‘lakker’ /'lVk@/ as a pronunciation variant of ‘lekker’ (meaning ‘good’, ‘nice’ or great’). An examination of lexical frequency lists revealed examples of general South African English such as the colloquially pervasive ‘ja’, ‘bladdy’ (for bloody) and jol(ling) (for partying or enjoying oneself) together with neologisms such as ‘eish’, the latter previously associated with speakers of Black South African English. The frequency lists facilitated cross-corpora comparisons with data from the British National Corpus and the Corpus of London Teenage Language and similarities and differences were noted and discussed. The study also used discourse analysis frameworks to investigate the role of high frequency lexical items such as ‘like’ in the data. In recent times ‘like’ has emerged globally as a lexicalized discourse marker, and its appearance in the corpus of Indian South African English confirms this trend. The corpus built as part of this study is intended as the first building block towards a full corpus of Indian South African English which could serve as a standard for referencing research into the sub-variety. Ultimately, it is argued that the establishment of similar corpora of other known sub-varieties of South African English could contribute towards the creation of a truly representative large corpus of South African English and a more nuanced understanding and definition of this important variety of World English

    Incorporating Weak Statistics for Low-Resource Language Modeling

    Get PDF
    Automatic speech recognition (ASR) requires a strong language model to guide the acoustic model and favor likely utterances. While many tasks enjoy billions of language model training tokens, many domains which require ASR do not have readily available electronic corpora.The only source of useful language modeling data is expensive and time-consuming human transcription of in-domain audio. This dissertation seeks to quickly and inexpensively improve low-resource language modeling for use in automatic speech recognition. This dissertation first considers efficient use of non-professional human labor to best improve system performance, and demonstrate that it is better to collect more data, despite higher transcription error, than to redundantly transcribe data to improve quality. In the process of developing procedures to collect such data, this work also presents an efficient rating scheme to detect poor transcribers without gold standard data. As an alternative to this process, automatic transcripts are generated with an ASR system and explore efficiently combining these low-quality transcripts with a small amount of high quality transcripts. Standard n-gram language models are sensitive to the quality of the highest order n-gram and are unable to exploit accurate weaker statistics. Instead, a log-linear language model is introduced, which elegantly incorporates a variety of background models through MAP adaptation. This work introduces marginal class constraints which effectively capture knowledge of transcriber error and improve performance over n-gram features. Finally, this work constrains the language modeling task to keyword search of words unseen in the training text. While overall system performance is good, these words suffer the most due to a low probability in the language model. Semi-supervised learning effectively extracts likely n-grams containing these new keywords from a large corpus of audio. By using a search metric that favors recall over precision, this method captures over 80% of the potential gain
    • 

    corecore