Keyphrases are useful for a variety of purposes, including summarizing,
indexing, labeling, categorizing, clustering, highlighting, browsing, and
searching. The task of automatic keyphrase extraction is to select keyphrases
from within the text of a given document. Automatic keyphrase extraction makes
it feasible to generate keyphrases for the huge number of documents that do not
have manually assigned keyphrases. A limitation of previous keyphrase
extraction algorithms is that the selected keyphrases are occasionally
incoherent. That is, the majority of the output keyphrases may fit together
well, but there may be a minority that appear to be outliers, with no clear
semantic relation to the majority or to each other. This paper presents
enhancements to the Kea keyphrase extraction algorithm that are designed to
increase the coherence of the extracted keyphrases. The approach is to use the
degree of statistical association among candidate keyphrases as evidence that
they may be semantically related. The statistical association is measured using
web mining. Experiments demonstrate that the enhancements improve the quality
of the extracted keyphrases. Furthermore, the enhancements are not
domain-specific: the algorithm generalizes well when it is trained on one
domain (computer science documents) and tested on another (physics documents).Comment: 6 pages, related work available at http://purl.org/peter.turney