82 research outputs found
Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi
In this paper we discuss an in-progress work on the development of a speech
corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and
Magahi using the field methods of linguistic data collection. The total size of
the corpus currently stands at approximately 18 hours (approx. 4-5 hours each
language) and it is transcribed and annotated with grammatical information such
as part-of-speech tags, morphological features and Universal dependency
relationships. We discuss our methodology for data collection in these
languages, most of which was done in the middle of the COVID-19 pandemic, with
one of the aims being to generate some additional income for low-income groups
speaking these languages. In the paper, we also discuss the results of the
baseline experiments for automatic speech recognition system in these
languages.Comment: Speech for Social Good Workshop, 2022, Interspeech 202
Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign
We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.Non peer reviewe
Review on Optical Character Recognition of Devanagari Script Using Neural Network
During the last decades lot of research work has been done in the field of character recognition on various scripts in various languages. In India peoples are used to speak national language Hindi and spoken by more than 500 million people. Many languages in India, such as Hindi, Marathi and Sanskrit has uses Devanagari as its base script .As compared to English character; Indian script (Devanagri) characters are complicated for recognition. Devnagri script is the basis for many Indian script including Hindi, Sanskrit, Marathi, Kashmiri, and so on. In this paper we present a review of research work that has been done in the field of character recognition in Devanagari script in past
- …