Search CORE

7 research outputs found

Resampling approach for cluster model selection

Author: Avros R.
Barzily Z.
Toledano-Kitai D.
Volkovich Z.
Weber Gerhard Wilhelm
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/10/2011
Field of study

In cluster analysis, selecting the number of clusters is an "ill-posed" problem of crucial importance. In this paper we propose a re-sampling method for assessing cluster stability. Our model suggests that samples' occurrences in clusters can be considered as realizations of the same random variable in the case of the "true" number of clusters. Thus, similarity between different cluster solutions is measured by means of compound and simple probability metrics. Compound criteria result in validation rules employing the stability content of clusters. Simple probability metrics, in particular those based on kernels, provide more flexible geometrical criteria. We analyze several applications of probability metrics combined with methods intended to simulate cluster occurrences. Numerical experiments are provided to demonstrate and compare the different metrics and simulation approaches

OpenMETU (Middle East Technical University)

An application of the minimal spanning tree approach to the cluster stability problem

Author: Avros R.
Barzily Z.
Toledano-Kitai D.
Volkovich Z.
Weber Gerhard Wilhelm
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/03/2012
Field of study

Among the areas of data and text mining which are employed today in OR, science, economy and technology, clustering theory serves as a preprocessing step in the data analyzing. An important component of clustering theory is determination of the true number of clusters. This problem has not been satisfactorily solved. In our paper, this problem is addressed by the cluster stability approach. For several possible numbers of clusters, we estimate the stability of the partitions obtained from clustering of samples. Partitions are considered consistent if their clusters are stable. Clusters validity is measured by the total number of edges, in the clusters' minimal spanning trees, connecting points from different samples. Actually, we use the Friedman and Rafsky two sample test statistic. The homogeneity hypothesis of well mingled samples, within the clusters, leads to an asymptotic normal distribution of the considered statistic. Resting upon this fact, the standard score of the mentioned edges quantity is set, and the partition quality is represented by the worst cluster, corresponding to the minimal standard score value. It is natural to expect that the true number of clusters can be characterized by the empirical distribution having the shortest left tail. The proposed methodology sequentially creates the described distribution and estimates its left-asymmetry. Several presented numerical experiments demonstrate the ability of the approach to detect the true number of clusters

OpenMETU (Middle East Technical University)