Annotated Speech Corpus for Low Resource Indian Languages: Awadhi,
  Bhojpuri, Braj and Magahi

Bali, Kalika; Kumar, Ritesh; Lahiri, Bornini; Ojha, Atul Kr.; Raj, Mohit; Ratan, Shyam; Seshadri, Vivek; Singh, Siddharth; Sinha, Sonal

Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Authors: Kalika Bali
Ritesh Kumar
Bornini Lahiri
Atul Kr. Ojha
Mohit Raj
Shyam Ratan
Vivek Seshadri
Siddharth Singh
Sonal Sinha
Publication date: 26 June 2022
Publisher

Abstract

In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.Comment: Speech for Social Good Workshop, 2022, Interspeech 202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2206.12931

Last time updated on 28/09/2022