Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification

Sri Kusuma Aditya, Christian; Sumadi, Fauzi Dwi Setiawan

Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification

Authors: Christian Sri Kusuma Aditya
Fauzi Dwi Setiawan Sumadi
Publication date: 24 November 2023
Publisher: Universitas Muhammadiyah Malang
Doi

Abstract

A text retrieval system requires a method that is able to return a number of documents with high relevance upon user requests. One of the important stages in the text representation process is the weighting process. The use of Term Frequency (TF) considers the number of word occurrences in each document, while Inverse Document Frequency (IDF) considers the wide distribution of words throughout the document collection. However, the TF-IDF weighting cannot represent the distribution of words to documents with many classes or categories. The more unequal the distribution of words in each category, the more important the word features should be. This study developed a new term weighting method where weighting is carried out based on the frequency of occurrence of terms in each class which is integrated with the distribution of centroid-based terms which can minimize intra-cluster similarity and maximize inter-cluster variance. The ICF.TDCB term weighting method has been able to provide the best results in its application to SVM modeling with a dataset of 931 online news documents. The results show that SVM modeling had accuracy of 0.723, outperforming the use of other term weightings such as TF.IDF, ICF & TDCB

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control

oai:ojs.localhost:article/1793

Last time updated on 10/12/2023