Significance-Based Categorical Data Clustering

He, Zengyou; Hu, Lianyu; Jiang, Mudi; Liu, Yan

Significance-Based Categorical Data Clustering

Authors: Zengyou He
Lianyu Hu
Mudi Jiang
Yan Liu
Publication date: 7 November 2022
Publisher

Abstract

Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering. Consequently, a new clustering algorithm is proposed in which the significance-based objective function is optimized via a Monte Carlo search procedure. As a by-product, we can further calculate an empirical

p

-value to assess the statistical significance of a set of clusters and develop an improved gap statistic for estimating the cluster number. Extensive experimental studies suggest that our method is able to achieve comparable performance to state-of-the-art categorical data clustering algorithms. Moreover, the effectiveness of such a significance-based formulation on statistical cluster validation and cluster number estimation is demonstrated through comprehensive empirical results.Comment: 36 pages, 6 figure

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2211.03956

Last time updated on 12/12/2022