Search CORE

45,432 research outputs found

Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

Author: Doddapaneni Sumanth
Kedia Harshit
Khapra Mitesh M.
Kumar Pratyush
Kunchukuttan Anoop
Mhaske Arnav
Murthy V Rudra
Publication venue
Publication date: 28/05/2023
Field of study

We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language translation. We also create manually annotated testsets for 9 languages. We demonstrate the utility of the obtained dataset on the Naamapadam-test dataset. We also release IndicNER, a multilingual IndicBERT model fine-tuned on Naamapadam training set. IndicNER achieves an F1 score of more than

80

for

7

out of

9

test languages. The dataset and models are available under open-source licences at https://ai4bharat.iitm.ac.in/naamapadam.Comment: ACL 202

arXiv.org e-Print Archive

Building a Text Collection for Urdu Information Retrieval

Author: Banka Haider
Khan Hamaid M.
Rasheed Imran
Publication venue: 'Wiley'
Publication date: 01/01/2021
Field of study

Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education

DSpace@FSM Vakif University