1,437 research outputs found
A comparative study of classifier combination applied to NLP tasks
The paper is devoted to a comparative study of classifier combination methods, which have been successfully
applied to multiple tasks including Natural Language Processing (NLP) tasks. There is variety of classifier
combination techniques and the major difficulty is to choose one that is the best fit for a particular
task. In our study we explored the performance of a number of combination methods such as voting,
Bayesian merging, behavior knowledge space, bagging, stacking, feature sub-spacing and cascading, for
the part-of-speech tagging task using nine corpora in five languages. The results show that some methods
that, currently, are not very popular could demonstrate much better performance. In addition, we learned
how the corpus size and quality influence the combination methods performance. We also provide the
results of applying the classifier combination methods to the other NLP tasks, such as name entity recognition
and chunking. We believe that our study is the most exhaustive comparison made with combination
methods applied to NLP tasks so far
Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
This work explores the use of word embeddings, also known as word vectors, trained on Spanish corpora, to use as features for Spanish verb sense disambiguation (VSD).
This type of learning technique is named disjoint semisupervised learning [1]: an unsupervised algorithm is trained on unlabeled data separately as a first step, and then its results (i.e. the word embeddings) are fed to a supervised classifier. Throughout this paper we try to assert two hypothesis: (i) representations of training instances based on word embeddings improve the performance of supervised models for VSD, in contrast to more standard feature engineering techniques based on information taken from the training data; (ii) using word embeddings trained on a specific domain, in this case the same domain the labeled data is gathered from, has a positive impact on the model’s performance, when compared to general domain’s word embeddings. The performance of a model over the data is not only measured using standard metric techniques (e.g. accuracy or precision/recall) but also measuring the model tendency to overfit the available data by analyzing the learning curve. Measuring this overfitting tendency is important as there is a small amount of available data, thus we need to find models to generalize better the VSD problem. For the task we use SenSem [2], a corpus and lexicon of Spanish and Catalan disambiguated verbs, as our base resource for experimentation.Sociedad Argentina de Informática e Investigación Operativ
Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
This work explores the use of word embeddings, also known as word vectors, trained on Spanish corpora, to use as features for Spanish verb sense disambiguation (VSD).
This type of learning technique is named disjoint semisupervised learning [1]: an unsupervised algorithm is trained on unlabeled data separately as a first step, and then its results (i.e. the word embeddings) are fed to a supervised classifier. Throughout this paper we try to assert two hypothesis: (i) representations of training instances based on word embeddings improve the performance of supervised models for VSD, in contrast to more standard feature engineering techniques based on information taken from the training data; (ii) using word embeddings trained on a specific domain, in this case the same domain the labeled data is gathered from, has a positive impact on the model’s performance, when compared to general domain’s word embeddings. The performance of a model over the data is not only measured using standard metric techniques (e.g. accuracy or precision/recall) but also measuring the model tendency to overfit the available data by analyzing the learning curve. Measuring this overfitting tendency is important as there is a small amount of available data, thus we need to find models to generalize better the VSD problem. For the task we use SenSem [2], a corpus and lexicon of Spanish and Catalan disambiguated verbs, as our base resource for experimentation.Sociedad Argentina de Informática e Investigación Operativ
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods
In this paper we concentrate on the resolution of the lexical ambiguity that
arises when a given word has several different meanings. This specific task is
commonly referred to as word sense disambiguation (WSD). The task of WSD
consists of assigning the correct sense to words using an electronic dictionary
as the source of word definitions. We present two WSD methods based on two main
methodological approaches in this research area: a knowledge-based method and a
corpus-based method. Our hypothesis is that word-sense disambiguation requires
several knowledge sources in order to solve the semantic ambiguity of the
words. These sources can be of different kinds--- for example, syntagmatic,
paradigmatic or statistical information. Our approach combines various sources
of knowledge, through combinations of the two WSD methods mentioned above.
Mainly, the paper concentrates on how to combine these methods and sources of
information in order to achieve good results in the disambiguation. Finally,
this paper presents a comprehensive study and experimental work on evaluation
of the methods and their combinations
Cross-domain polarity classification using a knowledge-enhanced meta-classifier
Current approaches to single and cross-domain polarity classification usually use bag of words, n-grams
or lexical resource-based classifiers. In this paper, we propose the use of meta-learning to combine and
enrich those approaches by adding also other knowledge-based features. In addition to the aforementioned
classical approaches, our system uses the BabelNet multilingual semantic network to generate features
derived from word sense disambiguation and vocabulary expansion. Experimental results show
state-of-the-art performance on single and cross-domain polarity classification. Contrary to other
approaches, ours is generic. These results were obtained without any domain adaptation technique.
Moreover, the use of meta-learning allows our approach to obtain the most stable results across domains.
Finally, our empirical analysis provides interesting insights on the use of semantic network-based
features.European Comission WIQ-EI IRSES (No. 269180)Ministerio de EconomÃa y Competitividad TIN2012-38603-C02-01Ministerio de EconomÃa y Competitividad TIN2012-38536-C03-02Junta de AndalucÃa P11-TIC-7684 M
A Comprehensive Review of Sentiment Analysis on Indian Regional Languages: Techniques, Challenges, and Trends
Sentiment analysis (SA) is the process of understanding emotion within a text. It helps identify the opinion, attitude, and tone of a text categorizing it into positive, negative, or neutral. SA is frequently used today as more and more people get a chance to put out their thoughts due to the advent of social media. Sentiment analysis benefits industries around the globe, like finance, advertising, marketing, travel, hospitality, etc. Although the majority of work done in this field is on global languages like English, in recent years, the importance of SA in local languages has also been widely recognized. This has led to considerable research in the analysis of Indian regional languages. This paper comprehensively reviews SA in the following major Indian Regional languages: Marathi, Hindi, Tamil, Telugu, Malayalam, Bengali, Gujarati, and Urdu. Furthermore, this paper presents techniques, challenges, findings, recent research trends, and future scope for enhancing results accuracy
- …