Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification

Abstract

A text retrieval system requires a method that is able to return a number of documents with high relevance upon user requests. One of the important stages in the text representation process is the weighting process. The use of Term Frequency (TF) considers the number of word occurrences in each document, while Inverse Document Frequency (IDF) considers the wide distribution of words throughout the document collection. However, the TF-IDF weighting cannot represent the distribution of words to documents with many classes or categories. The more unequal the distribution of words in each category, the more important the word features should be. This study developed a new term weighting method where weighting is carried out based on the frequency of occurrence of terms in each class which is integrated with the distribution of centroid-based terms which can minimize intra-cluster similarity and maximize inter-cluster variance. The ICF.TDCB term weighting method has been able to provide the best results in its application to SVM modeling with a dataset of 931 online news documents. The results show that SVM modeling had accuracy of 0.723, outperforming the use of other term weightings such as TF.IDF, ICF & TDCB

    Similar works