A Document Space Model for Automated Text Classification based on Frequency Distribution across Categories

Abstract

The distribution of words on a 2D plane with the document frequencies of the word as axes, is an ideal space to define a weighing metric for terms in the categories. In this paper, a completely data driven approach is presented to compute a weight surface over this plane, taking the axes as the document frequency of the word in the category and that in the rest of the categories. The statistical distribution of words across the categories is taken as the basis to build a document space model for representing the category vectors. This paper discusses how the model encompasses heuristics like the Inverse Document Frequency, and how it captures the semantics of the given categories. This paper also presents an evaluation of the model through various experiments. 1

    Similar works

    Full text

    thumbnail-image

    Available Versions