3 research outputs found
Truecluster matching
Cluster matching by permuting cluster labels is important in many clustering
contexts such as cluster validation and cluster ensemble techniques. The
classic approach is to minimize the euclidean distance between two cluster
solutions which induces inappropriate stability in certain settings. Therefore,
we present the truematch algorithm that introduces two improvements best
explained in the crisp case. First, instead of maximizing the trace of the
cluster crosstable, we propose to maximize a chi-square transformation of this
crosstable. Thus, the trace will not be dominated by the cells with the largest
counts but by the cells with the most non-random observations, taking into
account the marginals. Second, we suggest a probabilistic component in order to
break ties and to make the matching algorithm truly random on random data. The
truematch algorithm is designed as a building block of the truecluster
framework and scales in polynomial time. First simulation results confirm that
the truematch algorithm gives more consistent truecluster results for unequal
cluster sizes. Free R software is available.Comment: 15 pages, 2 figures. Details the matching needed for "Truecluster:
robust scalable clustering with model selection" but can also be used in
different context
Truecluster: robust scalable clustering with model selection
Data-based classification is fundamental to most branches of science. While
recent years have brought enormous progress in various areas of statistical
computing and clustering, some general challenges in clustering remain: model
selection, robustness, and scalability to large datasets. We consider the
important problem of deciding on the optimal number of clusters, given an
arbitrary definition of space and clusteriness. We show how to construct a
cluster information criterion that allows objective model selection. Differing
from other approaches, our truecluster method does not require specific
assumptions about underlying distributions, dissimilarity definitions or
cluster models. Truecluster puts arbitrary clustering algorithms into a generic
unified (sampling-based) statistical framework. It is scalable to big datasets
and provides robust cluster assignments and case-wise diagnostics. Truecluster
will make clustering more objective, allows for automation, and will save time
and costs. Free R software is available.Comment: Article (10 figures). Changes in 2nd version: dropped supplements in
favor of better integrated presentation, better literature coverage, put into
proper English. Author's website available via http://www.truecluster.co
Truecluster matching Truecluster
Cluster matching by permuting cluster labels is important in many clustering contexts such as cluster validation and cluster ensemble techniques. The classic approach is to minimize the euclidean distance between two cluster solutions which induces inappropriate stability in certain settings. Therefore, we present the truematch algorithm that introduces two improvements best explained in the crisp case. First, instead of maximizing the trace of the cluster crosstable, we propose to maximize a χ 2-transformation of this crosstable. Thus, the trace will not be dominated by the cells with the largest counts but by the cells with the most non-random observations, taking into account the marginals. Second, we suggest a probabilistic component in order to break ties and to make the matching algorithm truly random on random data. The truematch algorithm is designed as a building block of the truecluster framework and scales in polynomial time. First simulation results confirm that the truematch algorithm gives more consistent truecluster results for unequal cluster sizes. Free R software is available