Empirical Comparative Analysis of 1-of-K Coding and K-Prototypes in Categorical Clustering

Franco, Hector; Pugh, John; Ross, Robert; Wang, Fei

Empirical Comparative Analysis of 1-of-K Coding and K-Prototypes in Categorical Clustering

Authors: Hector Franco
John Pugh
Robert Ross
Fei Wang
Publication date: 20 September 2016
Publisher: Dublin Institute of Technology

Abstract

Clustering is a fundamental machine learning application, which partitions data into homogeneous groups. K-means and its variants are the most widely used class of clustering algorithms today. However, the original k-means algorithm can only be applied to numeric data. For categorical data, the data has to be converted into numeric data through 1-of-K coding which itself causes many problems. K-prototypes, another clustering algorithm that originates from the k-means algorithm, can handle categorical data by adopting a different notion of distance. In this paper, we systematically compare these two methods through an experimental analysis. Our analysis shows that K-prototypes is more suited when the dataset is large-scaled, while the performance of k-means with 1-of-K coding is more stable. We believe these are useful heuristics for clustering methods working with highly categorical data

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Arrow@TUDublin

oai:arrow.tudublin.ie:scschcom...

Last time updated on 17/04/2020