A Modified Hierarchical Agglomerative Approach for Efficient Document Clustering System

Abstract

In today’s world, the increasing volume of text documents has brought challenges for their effective and efficient organization. This has led to an enormous demand for efficient tools that turn data into valuable knowledge. One of the techniques that can play an important role towards the achievement of this objective is document clustering. The main function of document clustering is automatic grouping of documents so that the documents within a cluster are very similar, but dissimilar to the documents in other clusters. This research proposes a Modified Agglomerative Hierarchical Clustering (MAHC) algorithm based on hierarchical method. In many traditional systems, the number of term frequency is considered to create data representation matrix. However, a modified algorithm creates data representation matrix based only on occurrence of items, not on frequency of items. The proposed algorithm can increase the quality of clustering because it can merge the related or similar documents into the same cluster efficiently. Moreover, the proposed algorithm can reduce the processing time than the existing methods. In this paper, the performance of clustering between the proposed and original clustering algorithm was compared and evaluated by using F-measure

    Similar works