210 research outputs found

    Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

    Full text link
    In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.Comment: Speech for Social Good Workshop, 2022, Interspeech 202

    Targeted Subset Selection for Limited-data ASR Accent Adaptation

    Full text link
    We study the task of adapting an existing ASR model to a non-native accent while being constrained by a transcription budget on the duration of utterances selected from a large unlabeled corpus. We propose a subset selection approach using the recently proposed submodular mutual information functions, in which we identify a diverse set of utterances that match the target accent. This is specified through a few target utterances and achieved by modelling the relationship between the target and the selected subsets using these functions. The model adapts to the accent through fine-tuning with utterances selected and transcribed from the unlabeled corpus. We also use an accent classifier to learn accent-aware feature representations. Our method is also able to exploit samples from other accents to perform out-of-domain selections for low-resource accents which are not available in these corpora. We show that the targeted subset selection approach improves significantly upon random sampling - by around 5% to 10% (absolute) in most cases, and is around 10x more label-efficient. We also compare with an oracle method where we specifically pick from the target accent and our method is comparable to the oracle in its selections and WER performance.Comment: Under review (INTERSPEECH 2022

    The Tai languages of Assam--a grammar and texts

    Get PDF

    Transfer learning of language-independent end-to-end ASR with language model fusion

    Full text link
    This work explores better adaptation methods to low-resource languages using an external language model (LM) under the framework of transfer learning. We first build a language-independent ASR system in a unified sequence-to-sequence (S2S) architecture with a shared vocabulary among all languages. During adaptation, we perform LM fusion transfer, where an external LM is integrated into the decoder network of the attention-based S2S model in the whole adaptation stage, to effectively incorporate linguistic context of the target language. We also investigate various seed models for transfer learning. Experimental evaluations using the IARPA BABEL data set show that LM fusion transfer improves performances on all target five languages compared with simple transfer learning when the external text data is available. Our final system drastically reduces the performance gap from the hybrid systems.Comment: Accepted at ICASSP201

    Combining tandem and hybrid systems for improved speech recognition and keyword spotting on low resource languages

    Get PDF
    Copyright © 2014 ISCA. In recent years there has been significant interest in Automatic Speech Recognition (ASR) and KeyWord Spotting (KWS) systems for low resource languages. One of the driving forces for this research direction is the IARPA Babel project. This paper examines the performance gains that can be obtained by combining two forms of deep neural network ASR systems, Tandem and Hybrid, for both ASR and KWS using data released under the Babel project. Baseline systems are described for the five option period 1 languages: Assamese; Bengali; Haitian Creole; Lao; and Zulu. All the ASR systems share common attributes, for example deep neural network configurations, and decision trees based on rich phonetic questions and state-position root nodes. The baseline ASR and KWS performance of Hybrid and Tandem systems are compared for both the "full", approximately 80 hours of training data, and limited, approximately 10 hours of training data, language packs. By combining the two systems together consistent performance gains can be obtained for KWS in all configurations

    North East Indian Linguistics 8 (NEIL 8)

    Get PDF
    This is the eighth volume of North East Indian Linguistics, a series of volumes for publishing current research on the languages of North East India, the first volume of which was published in 2008. The papers in this volume were presented at the 9th conference of the North East Indian Linguistics Society (NEILS), held at Tezpur University in February 2016. The papers for this anniversary volume continue the NEILS tradition of research by both local and international scholars on a wide range of languages and topics. This eighth volume includes papers on small community languages and large regional languages from across North East India, and present detailed phonological, semantic and morphosyntactic studies of structures that are characteristic of particular languages or language groups alongside sociolinguistic studies that explore language attitudes in contexts of language shift
    • …
    corecore