Skip to main content
Article thumbnail
Location of Repository

Coherent Keyphrase Extraction via Web Mining

By Peter Turney

Abstract

Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. A limitation of previous keyphrase extraction algorithms is that the selected keyphrases are occasionally incoherent. That is, the majority of the output keyphrases may fit together well, but there may be a minority that appear to be outliers, with no clear semantic relation to the majority or to each other. This paper presents enhancements to the Kea keyphrase extraction algorithm that are designed to increase the coherence of the extracted keyphrases. The approach is to use the degree of statistical association among candidate keyphrases as evidence that they may be semantically related. The statistical association is measured using web mining. Experiments demonstrate that the enhancements improve the quality of the extracted keyphrases. Furthermore, the enhancements are not domain-specific: the algorithm generalizes well when it is trained on one domain (computer science documents) and tested on another (physics documents)

Topics: Statistical Models, Language, Machine Learning
Year: 2003
OAI identifier: oai:cogprints.org:3122

Suggested articles

Citations

  1. (1999). Domain-specific keyphrase extraction.
  2. (2001). Human evaluation of Kea, an automatic keyphrasing system.
  3. (1999). Improving browsing in digital libraries with keyphrase indexes.
  4. (1998). Inductive learning algorithms and representations for text categorization.
  5. (1993). Multi-interval discretization of continuous-valued attributes for classification learning.
  6. (1997). On the optimality of the simple Bayesian classifier under zero-one loss.
  7. (1997). Using lexical chains for text summarization.
  8. (1991). Using statistics in lexical analysis.
  9. (1995). Which method learns the most from data? Methodological issues in the analysis of comparative studies.
  10. (1989). Word association norms, mutual information and lexicography.

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.