1 research outputs found
Challenge dataset of cognates and false friend pairs from Indian languages
Cognates are present in multiple variants of the same text across different
languages (e.g., "hund" in German and "hound" in English language mean "dog").
They pose a challenge to various Natural Language Processing (NLP) applications
such as Machine Translation, Cross-lingual Sense Disambiguation, Computational
Phylogenetics, and Information Retrieval. A possible solution to address this
challenge is to identify cognates across language pairs. In this paper, we
describe the creation of two cognate datasets for twelve Indian languages,
namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu,
Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an
Indian language cognate dictionary and utilize linked Indian language Wordnets
to generate cognate sets. Additionally, we use the Wordnet data to create a
False Friends' dataset for eleven language pairs. We also evaluate the efficacy
of our dataset using previously available baseline cognate detection
approaches. We also perform a manual evaluation with the help of lexicographers
and release the curated gold-standard dataset with this paper.Comment: Published at LREC 202