24,474 research outputs found

    Clustering with Spectral Norm and the k-means Algorithm

    Full text link
    There has been much progress on efficient algorithms for clustering data points generated by a mixture of kk probability distributions under the assumption that the means of the distributions are well-separated, i.e., the distance between the means of any two distributions is at least Ω(k)\Omega(k) standard deviations. These results generally make heavy use of the generative model and particular properties of the distributions. In this paper, we show that a simple clustering algorithm works without assuming any generative (probabilistic) model. Our only assumption is what we call a "proximity condition": the projection of any data point onto the line joining its cluster center to any other cluster center is Ω(k)\Omega(k) standard deviations closer to its own center than the other center. Here the notion of standard deviations is based on the spectral norm of the matrix whose rows represent the difference between a point and the mean of the cluster to which it belongs. We show that in the generative models studied, our proximity condition is satisfied and so we are able to derive most known results for generative models as corollaries of our main result. We also prove some new results for generative models - e.g., we can cluster all but a small fraction of points only assuming a bound on the variance. Our algorithm relies on the well known kk-means algorithm, and along the way, we prove a result of independent interest -- that the kk-means algorithm converges to the "true centers" even in the presence of spurious points provided the initial (estimated) centers are close enough to the corresponding actual centers and all but a small fraction of the points satisfy the proximity condition. Finally, we present a new technique for boosting the ratio of inter-center separation to standard deviation

    Improved K-means clustering algorithms : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, Massey University, New Zealand

    Get PDF
    K-means clustering algorithm is designed to divide the samples into subsets with the goal that maximizes the intra-subset similarity and inter-subset dissimilarity where the similarity measures the relationship between two samples. As an unsupervised learning technique, K-means clustering algorithm is considered one of the most used clustering algorithms and has been applied in a variety of areas such as artificial intelligence, data mining, biology, psychology, marketing, medicine, etc. K-means clustering algorithm is not robust and its clustering result depends on the initialization, the similarity measure, and the predefined cluster number. Previous research focused on solving a part of these issues but has not focused on solving them in a unified framework. However, fixing one of these issues does not guarantee the best performance. To improve K-means clustering algorithm, one of the most famous and widely used clustering algorithms, by solving its issues simultaneously is challenging and significant. This thesis conducts an extensive research on K-means clustering algorithm aiming to improve it. First, we propose the Initialization-Similarity (IS) clustering algorithm to solve the issues of the initialization and the similarity measure of K-means clustering algorithm in a unified way. Specifically, we propose to fix the initialization of the clustering by using sum-of-norms (SON) which outputs the new representation of the original samples and to learn the similarity matrix based on the data distribution. Furthermore, the derived new representation is used to conduct K-means clustering. Second, we propose a Joint Feature Selection with Dynamic Spectral (FSDS) clustering algorithm to solve the issues of the cluster number determination, the similarity measure, and the robustness of the clustering by selecting effective features and reducing the influence of outliers simultaneously. Specifically, we propose to learn the similarity matrix based on the data distribution as well as adding the ranked constraint on the Laplacian matrix of the learned similarity matrix to automatically output the cluster number. Furthermore, the proposed algorithm employs the L2,1-norm as the sparse constraints on the regularization term and the loss function to remove the redundant features and reduce the influence of outliers respectively. Third, we propose a Joint Robust Multi-view (JRM) spectral clustering algorithm that conducts clustering for multi-view data while solving the initialization issue, the cluster number determination, the similarity measure learning, the removal of the redundant features, and the reduction of outlier influence in a unified way. Finally, the proposed algorithms outperformed the state-of-the-art clustering algorithms on real data sets. Moreover, we theoretically prove the convergences of the proposed optimization methods for the proposed objective functions

    Clustering with feature selection using alternating minimization. Application to computational biology

    Get PDF
    This paper deals with unsupervised clustering with feature selection. The problem is to estimate both labels and a sparse projection matrix of weights. To address this combina-torial non-convex problem maintaining a strict control on the sparsity of the matrix of weights, we propose an alternating minimization of the Frobenius norm criterion. We provide a new efficient algorithm named K-sparse which alternates k-means with projection-gradient minimization. The projection-gradient step is a method of splitting type, with exact projection on the ℓ 1 ball to promote sparsity. The convergence of the gradient-projection step is addressed, and a preliminary analysis of the alternating minimization is made. The Frobenius norm criterion converges as the number of iterates in Algorithm K-sparse goes to infinity. Experiments on Single Cell RNA sequencing datasets show that our method significantly improves the results of PCA k-means, spectral clustering, SIMLR, and Sparcl methods. The complexity of K-sparse is linear in the number of samples (cells), so that the method scales up to large datasets. Finally, we extend K-sparse to supervised classification

    Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

    Full text link
    We show how to approximate a data matrix A\mathbf{A} with a much smaller sketch A~\mathbf{\tilde A} that can be used to solve a general class of constrained k-rank approximation problems to within (1+ϵ)(1+\epsilon) error. Importantly, this class of problems includes kk-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k)O(k) dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For kk-means dimensionality reduction, we provide (1+ϵ)(1+\epsilon) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover' a good subspace for \bv{A}, but can be used directly to compute this subspace. Finally, for kk-means clustering, we show how to achieve a (9+ϵ)(9+\epsilon) approximation by Johnson-Lindenstrauss projecting data points to just O(logk/ϵ2)O(\log k/\epsilon^2) dimensions. This gives the first result that leverages the specific structure of kk-means to achieve dimension independent of input size and sublinear in kk

    How to Round Subspaces: A New Spectral Clustering Algorithm

    Full text link
    A basic problem in spectral clustering is the following. If a solution obtained from the spectral relaxation is close to an integral solution, is it possible to find this integral solution even though they might be in completely different basis? In this paper, we propose a new spectral clustering algorithm. It can recover a kk-partition such that the subspace corresponding to the span of its indicator vectors is O(opt)O(\sqrt{opt}) close to the original subspace in spectral norm with optopt being the minimum possible (opt1opt \le 1 always). Moreover our algorithm does not impose any restriction on the cluster sizes. Previously, no algorithm was known which could find a kk-partition closer than o(kopt)o(k \cdot opt). We present two applications for our algorithm. First one finds a disjoint union of bounded degree expanders which approximate a given graph in spectral norm. The second one is for approximating the sparsest kk-partition in a graph where each cluster have expansion at most ϕk\phi_k provided ϕkO(λk+1)\phi_k \le O(\lambda_{k+1}) where λk+1\lambda_{k+1} is the (k+1)st(k+1)^{st} eigenvalue of Laplacian matrix. This significantly improves upon the previous algorithms, which required ϕkO(λk+1/k)\phi_k \le O(\lambda_{k+1}/k).Comment: Appeared in SODA 201

    Sketch-based subspace clustering of hyperspectral images

    Get PDF
    Sparse subspace clustering (SSC) techniques provide the state-of-the-art in clustering of hyperspectral images (HSIs). However, their computational complexity hinders their applicability to large-scale HSIs. In this paper, we propose a large-scale SSC-based method, which can effectively process large HSIs while also achieving improved clustering accuracy compared to the current SSC methods. We build our approach based on an emerging concept of sketched subspace clustering, which was to our knowledge not explored at all in hyperspectral imaging yet. Moreover, there are only scarce results on any large-scale SSC approaches for HSI. We show that a direct application of sketched SSC does not provide a satisfactory performance on HSIs but it does provide an excellent basis for an effective and elegant method that we build by extending this approach with a spatial prior and deriving the corresponding solver. In particular, a random matrix constructed by the Johnson-Lindenstrauss transform is first used to sketch the self-representation dictionary as a compact dictionary, which significantly reduces the number of sparse coefficients to be solved, thereby reducing the overall complexity. In order to alleviate the effect of noise and within-class spectral variations of HSIs, we employ a total variation constraint on the coefficient matrix, which accounts for the spatial dependencies among the neighbouring pixels. We derive an efficient solver for the resulting optimization problem, and we theoretically prove its convergence property under mild conditions. The experimental results on real HSIs show a notable improvement in comparison with the traditional SSC-based methods and the state-of-the-art methods for clustering of large-scale images
    corecore