Search CORE

27,167 research outputs found

Efficient Clustering via Kernel Principal Component Analysis and Optimal One-dimensional Thresholding

Author: Bhide Nachiket
Publication venue: 'University of Windsor Leddy Library'
Publication date: 10/03/2021
Field of study

Several techniques are used for clustering of high-dimensional data. Traditionally, clustering approaches are based on performing dimensionality reduction of high-dimensional data followed by classical clustering such as k-means in lower dimensions. However, this approach based on k-means does not guarantee optimality. Moreover, the result of k-means is highly dependent on initialization of cluster centers and hence not repeatable, while not being optimal. To overcome this drawback, an optimal clustering approach in one dimension based on dimensionality reduction is proposed. The one-dimensional representation of high dimensional data is obtained using Kernel Principal Component Analysis. The one-dimensional representation of the data is then clustered optimally using a dynamic programming algorithm in polynomial time. Clusters in the one-dimensional data are obtained by minimizing the sum of within-class variance while maximizing the sum of between-class variance. The advantage of the proposed approach is demonstrated on synthetic and real-life datasets over standard k-means in terms of optimality and repeatability

Scholarship at UWindsor

Relax, no need to round: integrality of clustering formulations

Author: Awasthi Pranjal
Bandeira Afonso S.
Charikar Moses
Krishnaswamy Ravishankar
Villar Soledad
Ward Rachel
Publication venue
Publication date: 14/04/2015
Field of study

We study exact recovery conditions for convex relaxations of point cloud clustering problems, focusing on two of the most common optimization problems for unsupervised clustering:

k

-means and

k

-median clustering. Motivations for focusing on convex relaxations are: (a) they come with a certificate of optimality, and (b) they are generic tools which are relatively parameter-free, not tailored to specific assumptions over the input. More precisely, we consider the distributional setting where there are

k

clusters in

\mathbb{R}^m

and data from each cluster consists of

n

points sampled from a symmetric distribution within a ball of unit radius. We ask: what is the minimal separation distance between cluster centers needed for convex relaxations to exactly recover these

k

clusters as the optimal integral solution? For the

k

-median linear programming relaxation we show a tight bound: exact recovery is obtained given arbitrarily small pairwise separation

\epsilon > 0

between the balls. In other words, the pairwise center separation is

\Delta > 2+\epsilon

. Under the same distributional model, the

k

-means LP relaxation fails to recover such clusters at separation as large as

\Delta = 4

. Yet, if we enforce PSD constraints on the

k

-means LP, we get exact cluster recovery at center separation

\Delta > 2\sqrt2(1+\sqrt{1/m})

. In contrast, common heuristics such as Lloyd's algorithm (a.k.a. the

k

-means algorithm) can fail to recover clusters in this setting; even with arbitrarily large cluster separation, k-means++ with overseeding by any constant factor fails with high probability at exact cluster recovery. To complement the theoretical analysis, we provide an experimental study of the recovery guarantees for these various methods, and discuss several open problems which these experiments suggest.Comment: 30 pages, ITCS 201

arXiv.org e-Print Archive

CiteSeerX

Model Assisted Variable Clustering: Minimax-optimal Recovery and Algorithms

Author: Bunea Florentina
Giraud Christophe
Luo Xi
Royer Martin
Verzelen Nicolas
Publication venue
Publication date: 18/04/2018
Field of study

Model-based clustering defines population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we contrast our methods with another popular clustering method, spectral clustering, specialized to variable clustering, and show that ensuring exact cluster recovery via this method requires clusters to have a higher separation, relative to the minimax threshold. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.Comment: Maintext: 38 pages; supplementary information: 37 page

arXiv.org e-Print Archive

Estimating the number of clusters using diversity

Author: Kingrani Suneel Kumar
Levene Mark
Zhang Dell
Publication venue: 'Sciedu Press'
Publication date: 01/04/2018
Field of study

It is an important and challenging problem in unsupervised learning to estimate the number of clusters in a dataset. Knowing the number of clusters is a prerequisite for many commonly used clustering algorithms such as k-means. In this paper, we propose a novel diversity based approach to this problem. Specifically, we show that the difference between the global diversity of clusters and the sum of each cluster's local diversity of their members can be used as an effective indicator of the optimality of the number of clusters, where the diversity is measured by Rao's quadratic entropy. A notable advantage of our proposed method is that it encourages balanced clustering by taking into account both the sizes of clusters and the distances between clusters. In other words, it is less prone to very small "outlier" clusters than existing methods. Our extensive experiments on both synthetic and real-world datasets (with known ground-truth clustering) have demonstrated that our proposed method is robust for clusters of different sizes, variances, and shapes, and it is more accurate than existing methods (including elbow, Calinski-Harabasz, silhouette, and gap-statistic) in terms of finding out the optimal number of clusters

Crossref

Birkbeck Institutional Research Online