Text Document Classification: An Approach Based on Indexing

Abstract

ABSTRACT In this paper we propose a new method of classifying text documents. Unlike conventional vector space models, the proposed method preserves the sequence of term occurrence in a document. The term sequence is effectively preserved with the help of a novel datastructure called ‘Status Matrix’. Further the corresponding classification technique has been proposed for efficient classification of text documents. In addition, in order to avoid sequential matching during classification, we propose to index the terms in Btree, an efficient index scheme. Each term in B-tree is associated with a list of class labels of those documents which contain the term. Further the corresponding classification technique has been proposed. To corroborate the efficacy of the proposed representation and status matrix based classification, we have conducted extensive experiments on various datasets. Original Source URL : http://aircconline.com/ijdkp/V2N1/2112ijdkp04.pdf For more details : http://airccse.org/journal/ijdkp/vol2.htm

    Similar works