Search CORE

105 research outputs found

Naive Bayes and Exemplar-Based approaches to Word Sense Disambiguation Revisited

Author: Escudero Gerard
Marquez Lluis
Rigau German
Publication venue
Publication date: 01/01/2000
Field of study

This paper describes an experimental comparison between two standard supervised learning methods, namely Naive Bayes and Exemplar-based classification, on the Word Sense Disambiguation (WSD) problem. The aim of the work is twofold. Firstly, it attempts to contribute to clarify some confusing information about the comparison between both methods appearing in the related literature. In doing so, several directions have been explored, including: testing several modifications of the basic learning algorithms and varying the feature space. Secondly, an improvement of both algorithms is proposed, in order to deal with large attribute sets. This modification, which basically consists in using only the positive information appearing in the examples, allows to improve greatly the efficiency of the methods, with no loss in accuracy. The experiments have been performed on the largest sense-tagged corpus available containing the most frequent and ambiguous English words. Results show that the Exemplar-based approach to WSD is generally superior to the Bayesian approach, especially when a specific metric for dealing with symbolic attributes is used.Comment: 5 page

arXiv.org e-Print Archive

CiteSeerX

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Sometimes less is more : Romanian word sense disambiguation revisited

Author: Dinu Georgiana
Kübler Sandra
Publication venue
Publication date: 01/01/2007
Field of study

Recent approaches to Word Sense Disambiguation (WSD) generally fall into two classes: (1) information-intensive approaches and (2) information-poor approaches. Our hypothesis is that for memory-based learning (MBL), a reduced amount of data is more beneficial than the full range of features used in the past. Our experiments show that MBL combined with a restricted set of features and a feature selection method that minimizes the feature set leads to competitive results, outperforming all systems that participated in the SENSEVAL-3 competition on the Romanian data. Thus, with this specific method, a tightly controlled feature set improves the accuracy of the classifier, reaching 74.0% in the fine-grained and 78.7% in the coarse-grained evaluation

Hochschulschriftenserver - Universität Frankfurt am Main

High WSD Accuracy Using Naive Bayesian Classifier with Rich Features

Author: Le Cuong Anh
島津明
Publication venue: Logico-Linguistic Society of Japan
Publication date: 16/11/2005
Field of study

Word Sense Disambiguation (WSD) is the task of choosing the right sense of an ambiguous word given a context. Using Naive Bayesian (NB) classifiers is known as one of the best methods for supervised approaches for WSD (Mooney, 1996; Pedersen, 2000), and this model usually uses only a topic context represented by unordered words in a large context. In this paper, we show that by adding more rich knowledge, represented by ordered words in a local context and collocations, the NB classifier can achieve higher accuracy in comparison with the best previously published results. The features were chosen using a forward sequential selection algorithm. Our experiments obtained 92.3% accuracy for four common test words (interest, line, hard, serve). We also tested on a large dataset, the DSO corpus, and obtained accuracies of 66.4% for verbs and 72.7% for nouns

Waseda University Repository

Training a Naive Bayes Classifier via the EM Algorithm with a Class Distribution Constraint

Author: Jun&apos
Yoshimasa Tsuruoka
Publication venue: Morgan Kaufmann
Publication date: 01/01/2003
Field of study

Combining a naive Bayes classifier with the EM algorithm is one of the promising approaches for making use of unlabeled data for disambiguation tasks when using local context features including word sense disambiguation and spelling correction. However, the use of unlabeled data via the basic EM algorithm often causes disastrous performance degradation instead of improving classification performance, resulting in poor classification performance on average. In this study, we introduce a class distribution constraint into the iteration process of the EM algorithm. This constraint keeps the class distribution of unlabeled data consistent with the class distribution estimated from labeled data, preventing the EM algorithm from converging into an undesirable state. Experimental results from using 26 confusion sets and a large amount of unlabeled data show that our proposed method for using unlabeled data considerably improves classification performance when the amount of labeled data is small

CiteSeerX

Crossref

The University of Manchester - Institutional Repository

Improving the performance of dictionary-based approaches in protein name recognition

Author: Tsujii Jun’ichi
Tsuruoka Yoshimasa
Publication venue: Elsevier Inc.
Publication date: 01/12/2004
Field of study

AbstractDictionary-based protein name recognition is often a first step in extracting information from biomedical documents because it can provide ID information on recognized terms. However, dictionary-based approaches present two fundamental difficulties: (1) false recognition mainly caused by short names; (2) low recall due to spelling variations. In this paper, we tackle the former problem using machine learning to filter out false positives and present two alternative methods for alleviating the latter problem of spelling variations. The first is achieved by using approximate string searching, and the second by expanding the dictionary with a probabilistic variant generator, which we propose in this paper. Experimental results using the GENIA corpus revealed that filtering using a naive Bayes classifier greatly improved precision with only a slight loss of recall, resulting in 10.8% improvement in F-measure, and dictionary expansion with the variant generator gave further 1.6% improvement and achieved an F-measure of 66.6%

Elsevier - Publisher Connector

The University of Manchester - Institutional Repository

A hybrid architecture for robust parsing of german

Author: Erhard W. Hinrichs
Frank H. Müller
Ra Kübler
Tylman Ule
Universität Tübingen
Publication venue
Publication date: 01/01/2002
Field of study

This paper provides an overview of current research on a hybrid and robust parsing architecture for the morphological, syntactic and semantic annotation of German text corpora. The novel contribution of this research lies not in the individual parsing modules, each of which relies on state-of-the-art algorithms and techniques. Rather what is new about the present approach is the combination of these modules into a single architecture. This combination provides a means to significantly optimize the performance of each component, resulting in an increased accuracy of annotation

CiteSeerX

Hochschulschriftenserver - Universität Frankfurt am Main

Unsupervised Word Sense Disambiguation Using Neighborhood Knowledge

Author: Huang He-Yan
Jian Ping
Yang Zhi-Zhuo
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository