9 research outputs found
Recommended from our members
High-performance Word Sense Disambiguation with Less Manual Effort
Supervised learning is a widely used paradigm in Natural Language Processing. This paradigm involves learning a classifier from annotated examples and applying it to unseen data. We cast word sense disambiguation, our task of interest, as a supervised learning problem. We then formulate the end goal of this dissertation: to develop a series of methods aimed at achieving the highest possible word sense disambiguation performance with the least reliance on manual effort.
We begin by implementing a word sense disambiguation system, which utilizes rich linguistic features to better represent the contexts of ambiguous words. Our state-of-the-art system captures three types of linguistic features: lexical, syntactic, and semantic. Traditionally, semantic features are extracted with the help of expensive hand-crafted lexical resources. We propose a novel unsupervised approach to extracting a similar type of semantic information from unlabeled corpora. We show that incorporating this information into a classification framework leads to performance improvements. The result is a system that outperforms traditional methods while eliminating the reliance on manual effort for extracting semantic data.
We then proceed by attacking the problem of reducing the manual effort from a different direction. Supervised word sense disambiguation relies on annotated data for learning sense classifiers. However, annotation is expensive since it requires a large time investment from expert labelers. We examine various annotation practices and propose several approaches for making them more efficient. We evaluate the proposed approaches and compare them to the existing ones. We show that the annotation effort can often be reduced significantly without sacrificing the performance of the models trained on the annotated data
Detecting Errors in Corpora Using Support Vector Machines
While the corpus-based research relies on human annotated corpora, it is often said that a non-negligible amount of errors remain even in frequently used corpora such as Penn Treebank. Detection of errors in annotated corpora is important for corpus-based natural language processing. In this paper, we propose a method to detect errors in corpora using support vector machines (SVMs). This method is based on the idea of extracting exceptional elements that violate consistency. We propose a method of using SVMs to assign a weight to each element and to find errors in a POS tagged corpus. We apply the method to English and Japanese POS-tagged corpora and achieve high precision in detecting errors
Stochastic chaos and thermodynamic phase transitions : theory and Bayesian estimation algorithms
Thesis (M. Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 177-200).The chaotic behavior of dynamical systems underlies the foundations of statistical mechanics through ergodic theory. This putative connection is made more concrete in Part I of this thesis, where we show how to quantify certain chaotic properties of a system that are of relevance to statistical mechanics and kinetic theory. We consider the motion of a particle trapped in a double-well potential coupled to a noisy environment. By use of the classic Langevin and Fokker-Planck equations, we investigate Kramers' escape rate problem. We show that there is a deep analogy between kinetic rate theory and stochastic chaos, for which we propose a novel definition. In Part II, we develop techniques based on Volterra series modeling and Bayesian non-linear filtering to distinguish between dynamic noise and measurement noise. We quantify how much of the system's ergodic behavior can be attributed to intrinsic deterministic dynamical properties vis-a-vis inevitable extrinsic noise perturbations.by Zhi-De Deng.M.Eng.and S.B