45,432 research outputs found
Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages
We present, Naamapadam, the largest publicly available Named Entity
Recognition (NER) dataset for the 11 major Indian languages from two language
families. The dataset contains more than 400k sentences annotated with a total
of at least 100k entities from three standard entity categories (Person,
Location, and, Organization) for 9 out of the 11 languages. The training
dataset has been automatically created from the Samanantar parallel corpus by
projecting automatically tagged entities from an English sentence to the
corresponding Indian language translation. We also create manually annotated
testsets for 9 languages. We demonstrate the utility of the obtained dataset on
the Naamapadam-test dataset. We also release IndicNER, a multilingual IndicBERT
model fine-tuned on Naamapadam training set. IndicNER achieves an F1 score of
more than for out of test languages. The dataset and models are
available under open-source licences at
https://ai4bharat.iitm.ac.in/naamapadam.Comment: ACL 202
Building a Text Collection for Urdu Information Retrieval
Urdu is a widely spoken language in the Indian subcontinent with over 300 million
speakers worldwide. However, linguistic advancements in Urdu are rare compared to
those in other European and Asian languages. Therefore, by following Text Retrieval
Conference standards, we attempted to construct an extensive text collection of
85 304 documents from diverse categories covering over 52 topics with relevance
judgment sets at 100 pool depth. We also present several applications to demonstrate
the effectiveness of our collection. Although this collection is primarily intended
for text retrieval, it can also be used for named entity recognition, text summarization,
and other linguistic applications with suitable modifications. Ours is the most
extensive existing collection for the Urdu language, and it will be freely available for
future research and academic education
- …