thesis

High-performance Word Sense Disambiguation with Less Manual Effort

Abstract

Supervised learning is a widely used paradigm in Natural Language Processing. This paradigm involves learning a classifier from annotated examples and applying it to unseen data. We cast word sense disambiguation, our task of interest, as a supervised learning problem. We then formulate the end goal of this dissertation: to develop a series of methods aimed at achieving the highest possible word sense disambiguation performance with the least reliance on manual effort. We begin by implementing a word sense disambiguation system, which utilizes rich linguistic features to better represent the contexts of ambiguous words. Our state-of-the-art system captures three types of linguistic features: lexical, syntactic, and semantic. Traditionally, semantic features are extracted with the help of expensive hand-crafted lexical resources. We propose a novel unsupervised approach to extracting a similar type of semantic information from unlabeled corpora. We show that incorporating this information into a classification framework leads to performance improvements. The result is a system that outperforms traditional methods while eliminating the reliance on manual effort for extracting semantic data. We then proceed by attacking the problem of reducing the manual effort from a different direction. Supervised word sense disambiguation relies on annotated data for learning sense classifiers. However, annotation is expensive since it requires a large time investment from expert labelers. We examine various annotation practices and propose several approaches for making them more efficient. We evaluate the proposed approaches and compare them to the existing ones. We show that the annotation effort can often be reduced significantly without sacrificing the performance of the models trained on the annotated data

    Similar works