Distance-based clustering and classification are widely used in various
fields to group mixed numeric and categorical data. A predefined distance
measurement is used to cluster data points based on their dissimilarity. While
there exist numerous distance-based measures for data with pure numerical
attributes and several ordered and unordered categorical metrics, an optimal
distance for mixed-type data is an open problem. Many metrics convert numerical
attributes to categorical ones or vice versa. They handle the data points as a
single attribute type or calculate a distance between each attribute separately
and add them up. We propose a metric that uses mixed kernels to measure
dissimilarity, with cross-validated optimal kernel bandwidths. Our approach
improves clustering accuracy when utilized for existing distance-based
clustering algorithms on simulated and real-world datasets containing pure
continuous, categorical, and mixed-type data.Comment: 23 pages, 5 tables, 2 figure