Skip to main content
Article thumbnail
Location of Repository

Text Classification by Bootstrapping with Keywords, EM and Shrinkage

By Andrew Mccallum

Abstract

When applying text classification to complex tasks, it is tedious and expensive to hand-label the large amounts of training data necessary for good performance. This paper presents an alternative approach to text classification that requires no labeled documents; instead, it uses a small set of keywords per class, a class hierarchy and a large quantity of easilyobtained unlabeled documents. The keywords are used to assign approximate labels to the unlabeled documents by termmatching. These preliminary labels become the starting point for a bootstrapping process that learns a naive Bayes classifier using Expectation-Maximization and hierarchical shrinkage. When classifying a complex data set of computer science research papers into a 70-leaf topic hierarchy, the keywords alone provide 45 % accuracy. The classifier learned by bootstrapping reaches 66 % accuracy, a level close to human agreement.

Topics: Shrinkage
Year: 1999
OAI identifier: oai:CiteSeerX.psu:10.1.1.134.3656
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.kamalnigam.com/pape... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.