Article thumbnail

Word Sense Disambiguation by Web Mining for Word Co-occurrence Probabilities

By Peter D. Turney

Abstract

This paper describes the National Research Council (NRC) Word Sense Disambiguation (WSD) system, as applied to the English Lexical Sample (ELS) task in Senseval-3. The NRC system approaches WSD as a classical supervised machine learning problem, using familiar tools such as the Weka machine learning software and Brill's rule-based part-of-speech tagger. Head words are represented as feature vectors with several hundred features. Approximately half of the features are syntactic and the other half are semantic. The main novelty in the system is the method for generating the semantic features, based on word co-occurrence probabilities. The probabilities are estimated using the Waterloo MultiText System with a corpus of about one terabyte of unlabeled text, collected by a web crawler

Topics: Language, Computational Linguistics, Semantics, Machine Learning
Year: 2004
OAI identifier: oai:cogprints.org:3732

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.

Suggested articles

Citations

  1. (1995). An algebra for structured text search and a framework for its implementation.
  2. (2003). Coherent keyphrase extraction via Web mining.
  3. (2003). Frequency estimates for statistical word similarity measures.
  4. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL.
  5. (2000). Shortest substring retrieval and ranking.
  6. (1994). Some advances in transformationbased part of speech tagging.
  7. (2001). The Johns Hopkins SENSEVAL2 system descriptions.