82 research outputs found

    Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

    Full text link
    In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.Comment: Speech for Social Good Workshop, 2022, Interspeech 202

    Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign

    Get PDF
    We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.Non peer reviewe

    Review on Optical Character Recognition of Devanagari Script Using Neural Network

    Get PDF
    During the last decades lot of research work has been done in the field of character recognition on various scripts in various languages. In India peoples are used to speak national language Hindi and spoken by more than 500 million people. Many languages in India, such as Hindi, Marathi and Sanskrit has uses Devanagari as its base script .As compared to English character; Indian script (Devanagri) characters are complicated for recognition. Devnagri script is the basis for many Indian script including Hindi, Sanskrit, Marathi, Kashmiri, and so on. In this paper we present a review of research work that has been done in the field of character recognition in Devanagari script in past
    • …
    corecore