2,516 research outputs found
Basic Linguistic Resources and Baselines for Bhojpuri, Magahi and Maithili for Natural Language Processing
Corpus preparation for low-resource languages and for development of human
language technology to analyze or computationally process them is a laborious
task, primarily due to the unavailability of expert linguists who are native
speakers of these languages and also due to the time and resources required.
Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in
the north-eastern parts), are low-resource languages belonging to the
Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a
relatively high-resource language, which is why we make our comparisons with
Hindi. We collected corpora for these three languages from various sources and
cleaned them to the extent possible, without changing the data in them. The
text belongs to different domains and genres. We calculated some basic
statistical measures for these corpora at character, word, syllable, and
morpheme levels. These corpora were also annotated with parts-of-speech (POS)
and chunk tags. The basic statistical measures were both absolute and relative
and were meant to give an indication of linguistic properties such as
morphological, lexical, phonological, and syntactic complexities (or richness).
The results were compared with a standard Hindi corpus. For most of the
measures, we tried to keep the size of the corpus the same across the languages
so as to avoid the effect of corpus size, but in some cases it turned out that
using the full corpus was better, even if sizes were very different. Although
the results are not very clear, we try to draw some conclusions about the
languages and the corpora. For POS tagging and chunking, the BIS tagset was
used to manually annotate the data. The sizes of the POS tagged data are 16067,
14669 and 12310 sentences, respectively for Bhojpuri, Magahi and Maithili. The
sizes for chunking are 9695 and 1954 sentences for Bhojpuri and Maithili,
respec
Enrichment of OntoSenseNet: Adding a Sense-annotated Telugu lexicon
The paper describes the enrichment of OntoSenseNet - a verb-centric lexical
resource for Indian Languages. This resource contains a newly developed
Telugu-Telugu dictionary. It is important because native speakers can better
annotate the senses when both the word and its meaning are in Telugu. Hence
efforts are made to develop a soft copy of Telugu dictionary. Our resource also
has manually annotated gold standard corpus consisting 8483 verbs, 253 adverbs
and 1673 adjectives. Annotations are done by native speakers according to
defined annotation guidelines. In this paper, we provide an overview of the
annotation procedure and present the validation of our resource through
inter-annotator agreement. Concepts of sense-class and sense-type are
discussed. Additionally, we discuss the potential of lexical sense-annotated
corpora in improving word sense disambiguation (WSD) tasks. Telugu WordNet is
crowd-sourced for annotation of individual words in synsets and is compared
with the developed sense-annotated lexicon (OntoSenseNet) to examine the
improvement. Also, we present a special categorization (spatio-temporal
classification) of adjectives.Comment: Accepted Long Paper at 19th International Conference on Computational
Linguistics and Intelligent Text Processing, March 2018, Hanoi, Vietna
Prepositional Attachment Disambiguation Using Bilingual Parsing and Alignments
In this paper, we attempt to solve the problem of Prepositional Phrase (PP)
attachments in English. The motivation for the work comes from NLP applications
like Machine Translation, for which, getting the correct attachment of
prepositions is very crucial. The idea is to correct the PP-attachments for a
sentence with the help of alignments from parallel data in another language.
The novelty of our work lies in the formulation of the problem into a dual
decomposition based algorithm that enforces agreement between the parse trees
from two languages as a constraint. Experiments were performed on the
English-Hindi language pair and the performance improved by 10% over the
baseline, where the baseline is the attachment predicted by the MSTParser model
trained for English
Context based Analysis of Lexical Semantics for Hindi Language
A word having multiple senses in a text introduces the lexical semantic task
to find out which particular sense is appropriate for the given context. One
such task is Word sense disambiguation which refers to the identification of
the most appropriate meaning of the polysemous word in a given context using
computational algorithms. The language processing research in Hindi, the
official language of India, and other Indian languages is restricted by
unavailability of the standard corpus. For Hindi word sense disambiguation
also, the large corpus is not available. In this work, we prepared the text
containing new senses of certain words leading to the enrichment of the
sense-tagged Hindi corpus of sixty polysemous words. Furthermore, we analyzed
two novel lexical associations for Hindi word sense disambiguation based on the
contextual features of the polysemous word. The evaluation of these methods is
carried out over learning algorithms and favorable results are achieved.Comment: Accepted in NGCT-201
Coping with Construals in Broad-Coverage Semantic Annotation of Adpositions
We consider the semantics of prepositions, revisiting a broad-coverage
annotation scheme used for annotating all 4,250 preposition tokens in a 55,000
word corpus of English. Attempts to apply the scheme to adpositions and case
markers in other languages, as well as some problematic cases in English, have
led us to reconsider the assumption that a preposition's lexical contribution
is equivalent to the role/relation that it mediates. Our proposal is to embrace
the potential for construal in adposition use, expressing such phenomena
directly at the token level to manage complexity and avoid sense proliferation.
We suggest a framework to represent both the scene role and the adposition's
lexical function so they can be annotated at scale---supporting automatic,
statistical processing of domain-general language---and sketch how this
representation would inform a constructional analysis.Comment: Presentation at Construction Grammar and NLU AAAI Spring Symposium,
Stanford, March 27-29 2017; 9 pages including references; 1 figur
LERIL : Collaborative Effort for Creating Lexical Resources
The paper reports on efforts taken to create lexical resources pertaining to
Indian languages, using the collaborative model. The lexical resources being
developed are: (1) Transfer lexicon and grammar from English to several Indian
languages. (2) Dependencey tree bank of annotated corpora for several Indian
languages. The dependency trees are based on the Paninian model. (3) Bilingual
dictionary of 'core meanings'.Comment: [ To appear in Proceedings of Workshop on Language Resources in Asia,
along with NLPRS-2001, Tokyo, 27-30 November 2001] Appeared in the
Proceedings of Workshop on Language Resources in Asia, along with NLPRS-2001,
Tokyo, 27-30 November 2001. Appeared in the proceedings of Workshop on
Language Resources in Asia, along with NLPRS-2001, Tokyo, 27-30 November 200
Towards Automation of Sense-type Identification of Verbs in OntoSenseNet(Telugu)
In this paper, we discuss the enrichment of a manually developed resource of
Telugu lexicon, OntoSenseNet. OntoSenseNet is a ontological sense annotated
lexicon that marks each verb of Telugu with a primary and a secondary sense.
The area of research is relatively recent but has a large scope of development.
We provide an introductory work to enrich the OntoSenseNet to promote further
research in Telugu. Classifiers are adopted to learn the sense relevant
features of the words in the resource and also to automate the tagging of
sense-types for verbs. We perform a comparative analysis of different
classifiers applied on OntoSenseNet. The results of the experiment prove that
automated enrichment of the resource is effective using SVM classifiers and
Adaboost ensemble.Comment: Accepted Long Oral Paper at 6th International Workshop on Natural
Language Processing for Social Media (SocialNLP) at 56th Annual Meeting of
the Association for Computational Linguistics, AC
Word sense disambiguation: a survey
In this paper, we made a survey on Word Sense Disambiguation (WSD). Near
about in all major languages around the world, research in WSD has been
conducted upto different extents. In this paper, we have gone through a survey
regarding the different approaches adopted in different research works, the
State of the Art in the performance in this domain, recent works in different
Indian languages and finally a survey in Bengali language. We have made a
survey on different competitions in this field and the bench mark results,
obtained from those competitions.Comment: International Journal of Control Theory and Computer Modeling
(IJCTCM) Vol.5, No.3, July 201
An Empirical Evaluation of Text Representation Schemes on Multilingual Social Web to Filter the Textual Aggression
This paper attempt to study the effectiveness of text representation schemes
on two tasks namely: User Aggression and Fact Detection from the social media
contents. In User Aggression detection, The aim is to identify the level of
aggression from the contents generated in the Social media and written in the
English, Devanagari Hindi and Romanized Hindi. Aggression levels are
categorized into three predefined classes namely: `Non-aggressive`, `Overtly
Aggressive`, and `Covertly Aggressive`. During the disaster-related incident,
Social media like, Twitter is flooded with millions of posts. In such emergency
situations, identification of factual posts is important for organizations
involved in the relief operation. We anticipated this problem as a combination
of classification and Ranking problem. This paper presents a comparison of
various text representation scheme based on BoW techniques, distributed
word/sentence representation, transfer learning on classifiers. Weighted
score is used as a primary evaluation metric. Results show that text
representation using BoW performs better than word embedding on machine
learning classifiers. While pre-trained Word embedding techniques perform
better on classifiers based on deep neural net. Recent transfer learning model
like ELMO, ULMFiT are fine-tuned for the Aggression classification task.
However, results are not at par with pre-trained word embedding model. Overall,
word embedding using fastText produce best weighted -score than Word2Vec
and Glove. Results are further improved using pre-trained vector model.
Statistical significance tests are employed to ensure the significance of the
classification results. In the case of lexically different test Dataset, other
than training Dataset, deep neural models are more robust and perform
substantially better than machine learning classifiers.Comment: 21 Page, 2 Figur
BCSAT : A Benchmark Corpus for Sentiment Analysis in Telugu Using Word-level Annotations
The presented work aims at generating a systematically annotated corpus that
can support the enhancement of sentiment analysis tasks in Telugu using
word-level sentiment annotations. From OntoSenseNet, we extracted 11,000
adjectives, 253 adverbs, 8483 verbs and sentiment annotation is being done by
language experts. We discuss the methodology followed for the polarity
annotations and validate the developed resource. This work aims at developing a
benchmark corpus, as an extension to SentiWordNet, and baseline accuracy for a
model where lexeme annotations are applied for sentiment predictions. The
fundamental aim of this paper is to validate and study the possibility of
utilizing machine learning algorithms, word-level sentiment annotations in the
task of automated sentiment identification. Furthermore, accuracy is improved
by annotating the bi-grams extracted from the target corpus.Comment: Accepted as Long Paper at Student Research Workshop in 56th Annual
Meeting of the Association for Computational Linguistics, ACL-201
- …