Search CORE

82 research outputs found

Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Author: Bali Kalika
Kumar Ritesh
Lahiri Bornini
Ojha Atul Kr.
Raj Mohit
Ratan Shyam
Seshadri Vivek
Singh Siddharth
Sinha Sonal
Publication venue
Publication date: 26/06/2022
Field of study

In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.Comment: Speech for Social Good Workshop, 2022, Interspeech 202

arXiv.org e-Print Archive

Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign

Author: Ali Ahmed
Glass James
Grondelaers Stefan
Jain Mayank
Kumar Ritesh
Lahiri Bornini
Ljubešić Nikola
Malmasi Shervin
Nakov Preslav
Oostdijk Nelleke
Samardžić Tanja
Scherrer Yves
Shon Suwon
Speelman Dirk
Tiedemann Jörg
van den Bosch Antal
van der Lee Chris
Zampieri Marcos
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2018
Field of study

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.Non peer reviewe

Radboud Repository

Helsingin yliopiston digitaalinen arkisto

Tilburg University Repository

Iterative Language Model Adaptation for Indo-Aryan Language Identification

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: The Association for Computational Linguistics
Publication date: 01/08/2018
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Review on Optical Character Recognition of Devanagari Script Using Neural Network

Author: Ms. Smita Ashokrao Bhopi, Mr. Manu Pratap Singh
Publication venue: Auricle Global Society of Education and Research
Publication date: 31/03/2018
Field of study

During the last decades lot of research work has been done in the field of character recognition on various scripts in various languages. In India peoples are used to speak national language Hindi and spoken by more than 500 million people. Many languages in India, such as Hindi, Marathi and Sanskrit has uses Devanagari as its base script .As compared to English character; Indian script (Devanagri) characters are complicated for recognition. Devnagri script is the basis for many Indian script including Hindi, Sanskrit, Marathi, Kashmiri, and so on. In this paper we present a review of research work that has been done in the field of character recognition in Devanagari script in past

International Journal on Future Revolution in Computer Science & Communication Engineering