3 research outputs found
Cross Script Hindi English NER Corpus from Wikipedia
The text generated on social media platforms is essentially a mixed lingual
text. The mixing of language in any form produces considerable amount of
difficulty in language processing systems. Moreover, the advancements in
language processing research depends upon the availability of standard corpora.
The development of mixed lingual Indian Named Entity Recognition (NER) systems
are facing obstacles due to unavailability of the standard evaluation corpora.
Such corpora may be of mixed lingual nature in which text is written using
multiple languages predominantly using a single script only. The motivation of
our work is to emphasize the automatic generation such kind of corpora in order
to encourage mixed lingual Indian NER. The paper presents the preparation of a
Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora
is successfully annotated using standard CoNLL-2003 categories of PER, LOC,
ORG, and MISC. Its evaluation is carried out on a variety of machine learning
algorithms and favorable results are achieved.Comment: International Conference on Intelligent Data Communication
Technologies and Internet of Things (ICICI-2018
Context based Analysis of Lexical Semantics for Hindi Language
A word having multiple senses in a text introduces the lexical semantic task
to find out which particular sense is appropriate for the given context. One
such task is Word sense disambiguation which refers to the identification of
the most appropriate meaning of the polysemous word in a given context using
computational algorithms. The language processing research in Hindi, the
official language of India, and other Indian languages is restricted by
unavailability of the standard corpus. For Hindi word sense disambiguation
also, the large corpus is not available. In this work, we prepared the text
containing new senses of certain words leading to the enrichment of the
sense-tagged Hindi corpus of sixty polysemous words. Furthermore, we analyzed
two novel lexical associations for Hindi word sense disambiguation based on the
contextual features of the polysemous word. The evaluation of these methods is
carried out over learning algorithms and favorable results are achieved.Comment: Accepted in NGCT-201
Feature Selection on Noisy Twitter Short Text Messages for Language Identification
The task of written language identification involves typically the detection
of the languages present in a sample of text. Moreover, a sequence of text may
not belong to a single inherent language but also may be mixture of text
written in multiple languages. This kind of text is generated in large volumes
from social media platforms due to its flexible and user friendly environment.
Such text contains very large number of features which are essential for
development of statistical, probabilistic as well as other kinds of language
models. The large number of features have rich as well as irrelevant and
redundant features which have diverse effect over the performance of the
learning model. Therefore, feature selection methods are significant in
choosing feature that are most relevant for an efficient model. In this
article, we basically consider the Hindi-English language identification task
as Hindi and English are often two most widely spoken languages of India. We
apply different feature selection algorithms across various learning algorithms
in order to analyze the effect of the algorithm as well as the number of
features on the performance of the task. The methodology focuses on the word
level language identification using a novel dataset of 6903 tweets extracted
from Twitter. Various n-gram profiles are examined with different feature
selection algorithms over many classifiers. Finally, an exhaustive comparative
analysis is put forward with respect to the overall experiments conducted for
the task