Kernel Metric Learning for Clustering Mixed-type Data

Ghashti, Jesse S.; Thompson, John R. J.

Kernel Metric Learning for Clustering Mixed-type Data

Authors: Jesse S. Ghashti
John R. J. Thompson
Publication date: 2 June 2023
Publisher

Abstract

Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data. A predefined distance measurement is used to cluster data points based on their dissimilarity. While there exist numerous distance-based measures for data with pure numerical attributes and several ordered and unordered categorical metrics, an optimal distance for mixed-type data is an open problem. Many metrics convert numerical attributes to categorical ones or vice versa. They handle the data points as a single attribute type or calculate a distance between each attribute separately and add them up. We propose a metric that uses mixed kernels to measure dissimilarity, with cross-validated optimal kernel bandwidths. Our approach improves clustering accuracy when utilized for existing distance-based clustering algorithms on simulated and real-world datasets containing pure continuous, categorical, and mixed-type data.Comment: 23 pages, 5 tables, 2 figure

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.01890

Last time updated on 08/06/2023