156 research outputs found
Part of Speech Tagging of Marathi Text Using Trigram Method
In this paper we present a Marathi part of speech tagger. It is a morphologically rich language. It is spoken by the native people of Maharashtra. The general approach used for development of tagger is statistical using trigram Method. The main concept of trigram is to explore the most likely POS for a token based on given information of previous two tags by calculating probabilities to determine which is the best sequence of a tag. In this paper we show the development of the tagger. Moreover we have also shown the evaluation done
IndicXTREME: A Multi-Task Benchmark For Evaluating Indic Languages
In this work, we introduce IndicXTREME, a benchmark consisting of nine
diverse tasks covering 18 languages from the Indic sub-continent belonging to
four different families. Across languages and tasks, IndicXTREME contains a
total of 103 evaluation sets, of which 51 are new contributions to the
literature. To maintain high quality, we only use human annotators to curate or
translate our datasets. To the best of our knowledge, this is the first effort
toward creating a standard benchmark for Indic languages that aims to test the
zero-shot capabilities of pretrained language models. We also release IndicCorp
v2, an updated and much larger version of IndicCorp that contains 20.9 billion
tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate
it on IndicXTREME to show that it outperforms existing multilingual language
models such as XLM-R and MuRIL
Bridging Language Gaps in Health Information Access: Konkani-English CLIR System for Medical Knowledge
This paper addresses the challenges posed by
linguistic diversity in terms of medical information by
introducing a Cross-Language Information Retrieval
System attuned to the needs of Konkani language
information seekers. The proposed system leverages
Konkani queries entered by the user, translates them to
English, and retrieves the documents using a thesaurus-
based approach. Various strategies also have been
considered to address the challenges posed by the source
language – Konkani which is a minority language spoken
in the Indian subcontinent. The proposed approach
showcases the potential of combining language
technology, information retrieval, and medical domain
expertise to bridge linguistic barriers. As healthcare
information remains a critical societal need, this work
holds promise in facilitating equitable access to medical
knowledge
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
We present Samanantar, the largest publicly available parallel corpora
collection for Indic languages. The collection contains a total of 49.7 million
sentence pairs between English and 11 Indic languages (from two language
families). Specifically, we compile 12.4 million sentence pairs from existing,
publicly-available parallel corpora, and additionally mine 37.4 million
sentence pairs from the web, resulting in a 4x increase. We mine the parallel
sentences from the web by combining many corpora, tools, and methods: (a)
web-crawled monolingual corpora, (b) document OCR for extracting sentences from
scanned documents, (c) multilingual representation models for aligning
sentences, and (d) approximate nearest neighbor search for searching in a large
collection of sentences. Human evaluation of samples from the newly mined
corpora validate the high quality of the parallel sentences across 11
languages. Further, we extract 83.4 million sentence pairs between all 55 Indic
language pairs from the English-centric parallel corpus using English as the
pivot language. We trained multilingual NMT models spanning all these languages
on Samanantar, which outperform existing models and baselines on publicly
available benchmarks, such as FLORES, establishing the utility of Samanantar.
Our data and models are available publicly at
https://indicnlp.ai4bharat.org/samanantar/ and we hope they will help advance
research in NMT and multilingual NLP for Indic languages.Comment: Accepted to the Transactions of the Association for Computational
Linguistics (TACL
DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text
This paper describes the development of a multilingual, manually annotated
dataset for three under-resourced Dravidian languages generated from social
media comments. The dataset was annotated for sentiment analysis and offensive
language identification for a total of more than 60,000 YouTube comments. The
dataset consists of around 44,000 comments in Tamil-English, around 7,000
comments in Kannada-English, and around 20,000 comments in Malayalam-English.
The data was manually annotated by volunteer annotators and has a high
inter-annotator agreement in Krippendorff's alpha. The dataset contains all
types of code-mixing phenomena since it comprises user-generated content from a
multilingual country. We also present baseline experiments to establish
benchmarks on the dataset using machine learning methods. The dataset is
available on Github
(https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo
(https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).Comment: 36 page
- …