Location of Repository

Supervised Feature Extraction for Text Categorization

By J. J. Verbeek

Abstract

This paper concerns finding the `optimal' number of word groups for text classification. We present a method to select which words to cluster into word groups and how many such word groups to use on the basis of a set of pre-classified texts. The method involves a `greedy' search through the space of possible word groups. The words are grouped according to the `Jensen-Shannon divergence' between the corresponding distributions over the classes. The criterion to decide which number of word groups to use is based on Rissanen's MDL Principle. We present empirical results that indicate that the proposed method performs well. Furthermore, the proposed method outperforms cross-validation in the sense that far fewer word groups are selected while prediction accuracy is just slightly worse. For the experimentation we used a subset of the `20 Newsgroup' dataset [10]

Topics: clustering, feature extraction, MDL, naive Bayes, supervised, text categorization
Year: 2007
OAI identifier: oai:CiteSeerX.psu:10.1.1.19.8021
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://carol.wins.uva.nl/~jver... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.