3,604 research outputs found

    Comparative Experiments on Disambiguating Word Senses: An Illustration of the Role of Bias in Machine Learning

    Full text link
    This paper describes an experimental comparison of seven different learning algorithms on the problem of learning to disambiguate the meaning of a word from context. The algorithms tested include statistical, neural-network, decision-tree, rule-based, and case-based classification techniques. The specific problem tested involves disambiguating six senses of the word ``line'' using the words in the current and proceeding sentence as context. The statistical and neural-network methods perform the best on this particular problem and we discuss a potential reason for this observed difference. We also discuss the role of bias in machine learning and its importance in explaining performance differences observed on specific problems.Comment: 10 page

    Naive Bayes and Exemplar-Based approaches to Word Sense Disambiguation Revisited

    Full text link
    This paper describes an experimental comparison between two standard supervised learning methods, namely Naive Bayes and Exemplar-based classification, on the Word Sense Disambiguation (WSD) problem. The aim of the work is twofold. Firstly, it attempts to contribute to clarify some confusing information about the comparison between both methods appearing in the related literature. In doing so, several directions have been explored, including: testing several modifications of the basic learning algorithms and varying the feature space. Secondly, an improvement of both algorithms is proposed, in order to deal with large attribute sets. This modification, which basically consists in using only the positive information appearing in the examples, allows to improve greatly the efficiency of the methods, with no loss in accuracy. The experiments have been performed on the largest sense-tagged corpus available containing the most frequent and ambiguous English words. Results show that the Exemplar-based approach to WSD is generally superior to the Bayesian approach, especially when a specific metric for dealing with symbolic attributes is used.Comment: 5 page

    Boosting Applied to Word Sense Disambiguation

    Get PDF
    In this paper Schapire and Singer's AdaBoost.MH boosting algorithm is applied to the Word Sense Disambiguation (WSD) problem. Initial experiments on a set of 15 selected polysemous words show that the boosting approach surpasses Naive Bayes and Exemplar-based approaches, which represent state-of-the-art accuracy on supervised WSD. In order to make boosting practical for a real learning domain of thousands of words, several ways of accelerating the algorithm by reducing the feature space are studied. The best variant, which we call LazyBoosting, is tested on the largest sense-tagged corpus available containing 192,800 examples of the 191 most frequent and ambiguous English words. Again, boosting compares favourably to the other benchmark algorithms.Comment: 12 page

    Thumbs up? Sentiment Classification using Machine Learning Techniques

    Full text link
    We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. We conclude by examining factors that make the sentiment classification problem more challenging.Comment: To appear in EMNLP-200

    Sometimes less is more : Romanian word sense disambiguation revisited

    Get PDF
    Recent approaches to Word Sense Disambiguation (WSD) generally fall into two classes: (1) information-intensive approaches and (2) information-poor approaches. Our hypothesis is that for memory-based learning (MBL), a reduced amount of data is more beneficial than the full range of features used in the past. Our experiments show that MBL combined with a restricted set of features and a feature selection method that minimizes the feature set leads to competitive results, outperforming all systems that participated in the SENSEVAL-3 competition on the Romanian data. Thus, with this specific method, a tightly controlled feature set improves the accuracy of the classifier, reaching 74.0% in the fine-grained and 78.7% in the coarse-grained evaluation

    Structure-semantics interplay in complex networks and its effects on the predictability of similarity in texts

    Get PDF
    There are different ways to define similarity for grouping similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex networks, which may be able to capture syntactic, semantic and even pragmatic features. The interplay between the various metrics of the complex networks is analyzed with three applications, namely identification of machine translation (MT) systems, evaluation of quality of machine translated texts and authorship recognition. We shall show that topological features of the networks representing texts can enhance the ability to identify MT systems in particular cases. For evaluating the quality of MT texts, on the other hand, high correlation was obtained with methods capable of capturing the semantics. This was expected because the golden standards used are themselves based on word co-occurrence. Notwithstanding, the Katz similarity, which involves semantic and structure in the comparison of texts, achieved the highest correlation with the NIST measurement, indicating that in some cases the combination of both approaches can improve the ability to quantify quality in MT. In authorship recognition, again the topological features were relevant in some contexts, though for the books and authors analyzed good results were obtained with semantic features as well. Because hybrid approaches encompassing semantic and topological features have not been extensively used, we believe that the methodology proposed here may be useful to enhance text classification considerably, as it combines well-established strategies

    Named Entity Extraction and Disambiguation: The Reinforcement Effect.

    Get PDF
    Named entity extraction and disambiguation have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. Although these topics are highly dependent, almost no existing works examine this dependency. It is the aim of this paper to examine the dependency and show how one affects the other, and vice versa. We conducted experiments with a set of descriptions of holiday homes with the aim to extract and disambiguate toponyms as a representative example of named entities. We experimented with three approaches for disambiguation with the purpose to infer the country of the holiday home. We examined how the effectiveness of extraction influences the effectiveness of disambiguation, and reciprocally, how filtering out ambiguous names (an activity that depends on the disambiguation process) improves the effectiveness of extraction. Since this, in turn, may improve the effectiveness of disambiguation again, it shows that extraction and disambiguation may reinforce each other.\u

    Weka: A machine learning workbench for data mining

    Get PDF
    The Weka workbench is an organized collection of state-of-the-art machine learning algorithms and data preprocessing tools. The basic way of interacting with these methods is by invoking them from the command line. However, convenient interactive graphical user interfaces are provided for data exploration, for setting up large-scale experiments on distributed computing platforms, and for designing configurations for streamed data processing. These interfaces constitute an advanced environment for experimental data mining. The system is written in Java and distributed under the terms of the GNU General Public License
    corecore