When applying text classification to complex tasks, it is tedious and expensive to hand-label the large amounts of training data necessary for good performance. This paper presents an alternative approach to text classification that requires no labeled documents; instead, it uses a small set of keywords per class, a class hierarchy and a large quantity of easilyobtained unlabeled documents. The keywords are used to assign approximate labels to the unlabeled documents by termmatching. These preliminary labels become the starting point for a bootstrapping process that learns a naive Bayes classifier using Expectation-Maximization and hierarchical shrinkage. When classifying a complex data set of computer science research papers into a 70-leaf topic hierarchy, the keywords alone provide 45 % accuracy. The classifier learned by bootstrapping reaches 66 % accuracy, a level close to human agreement.
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.