24,474 research outputs found
Clustering with Spectral Norm and the k-means Algorithm
There has been much progress on efficient algorithms for clustering data
points generated by a mixture of probability distributions under the
assumption that the means of the distributions are well-separated, i.e., the
distance between the means of any two distributions is at least
standard deviations. These results generally make heavy use of the generative
model and particular properties of the distributions. In this paper, we show
that a simple clustering algorithm works without assuming any generative
(probabilistic) model. Our only assumption is what we call a "proximity
condition": the projection of any data point onto the line joining its cluster
center to any other cluster center is standard deviations closer to
its own center than the other center. Here the notion of standard deviations is
based on the spectral norm of the matrix whose rows represent the difference
between a point and the mean of the cluster to which it belongs. We show that
in the generative models studied, our proximity condition is satisfied and so
we are able to derive most known results for generative models as corollaries
of our main result. We also prove some new results for generative models -
e.g., we can cluster all but a small fraction of points only assuming a bound
on the variance. Our algorithm relies on the well known -means algorithm,
and along the way, we prove a result of independent interest -- that the
-means algorithm converges to the "true centers" even in the presence of
spurious points provided the initial (estimated) centers are close enough to
the corresponding actual centers and all but a small fraction of the points
satisfy the proximity condition. Finally, we present a new technique for
boosting the ratio of inter-center separation to standard deviation
Improved K-means clustering algorithms : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, Massey University, New Zealand
K-means clustering algorithm is designed to divide the samples into subsets with the goal that maximizes the intra-subset similarity and inter-subset dissimilarity where the similarity measures the relationship between two samples. As an unsupervised learning technique, K-means clustering algorithm is considered one of the most used clustering algorithms and has been applied in a variety of areas such as artificial intelligence, data mining, biology, psychology, marketing, medicine, etc.
K-means clustering algorithm is not robust and its clustering result depends on the initialization, the similarity measure, and the predefined cluster number. Previous research focused on solving a part of these issues but has not focused on solving them in a unified framework. However, fixing one of these issues does not guarantee the best performance. To improve K-means clustering algorithm, one of the most famous and widely used clustering algorithms, by solving its issues simultaneously is challenging and significant.
This thesis conducts an extensive research on K-means clustering algorithm aiming to improve it.
First, we propose the Initialization-Similarity (IS) clustering algorithm to solve the issues of the initialization and the similarity measure of K-means clustering algorithm in a unified way. Specifically, we propose to fix the initialization of the clustering by using sum-of-norms (SON) which outputs the new representation of the original samples and to learn the similarity matrix based on the data distribution. Furthermore, the derived new representation is used to conduct K-means clustering.
Second, we propose a Joint Feature Selection with Dynamic Spectral (FSDS) clustering algorithm to solve the issues of the cluster number determination, the similarity measure, and the robustness of the clustering by selecting effective features and reducing the influence of outliers simultaneously. Specifically, we propose to learn the similarity matrix based on the data distribution as well as adding the ranked constraint on the Laplacian matrix of the learned similarity matrix to automatically output the cluster number. Furthermore, the proposed algorithm employs the L2,1-norm as the sparse constraints on the regularization term and the loss function to remove the redundant features and reduce the influence of outliers respectively.
Third, we propose a Joint Robust Multi-view (JRM) spectral clustering algorithm that conducts clustering for multi-view data while solving the initialization issue, the cluster number determination, the similarity measure learning, the removal of the redundant features, and the reduction of outlier influence in a unified way.
Finally, the proposed algorithms outperformed the state-of-the-art clustering algorithms on real data sets. Moreover, we theoretically prove the convergences of the proposed optimization methods for the proposed objective functions
Clustering with feature selection using alternating minimization. Application to computational biology
This paper deals with unsupervised clustering with feature selection. The problem is to estimate both labels and a sparse projection matrix of weights. To address this combina-torial non-convex problem maintaining a strict control on the sparsity of the matrix of weights, we propose an alternating minimization of the Frobenius norm criterion. We provide a new efficient algorithm named K-sparse which alternates k-means with projection-gradient minimization. The projection-gradient step is a method of splitting type, with exact projection on the ℓ 1 ball to promote sparsity. The convergence of the gradient-projection step is addressed, and a preliminary analysis of the alternating minimization is made. The Frobenius norm criterion converges as the number of iterates in Algorithm K-sparse goes to infinity. Experiments on Single Cell RNA sequencing datasets show that our method significantly improves the results of PCA k-means, spectral clustering, SIMLR, and Sparcl methods. The complexity of K-sparse is linear in the number of samples (cells), so that the method scales up to large datasets. Finally, we extend K-sparse to supervised classification
Dimensionality Reduction for k-Means Clustering and Low Rank Approximation
We show how to approximate a data matrix with a much smaller
sketch that can be used to solve a general class of
constrained k-rank approximation problems to within error.
Importantly, this class of problems includes -means clustering and
unconstrained low rank approximation (i.e. principal component analysis). By
reducing data points to just dimensions, our methods generically
accelerate any exact, approximate, or heuristic algorithm for these ubiquitous
problems.
For -means dimensionality reduction, we provide relative
error results for many common sketching techniques, including random row
projection, column selection, and approximate SVD. For approximate principal
component analysis, we give a simple alternative to known algorithms that has
applications in the streaming setting. Additionally, we extend recent work on
column-based matrix reconstruction, giving column subsets that not only `cover'
a good subspace for \bv{A}, but can be used directly to compute this
subspace.
Finally, for -means clustering, we show how to achieve a
approximation by Johnson-Lindenstrauss projecting data points to just dimensions. This gives the first result that leverages the
specific structure of -means to achieve dimension independent of input size
and sublinear in
How to Round Subspaces: A New Spectral Clustering Algorithm
A basic problem in spectral clustering is the following. If a solution
obtained from the spectral relaxation is close to an integral solution, is it
possible to find this integral solution even though they might be in completely
different basis? In this paper, we propose a new spectral clustering algorithm.
It can recover a -partition such that the subspace corresponding to the span
of its indicator vectors is close to the original subspace in
spectral norm with being the minimum possible ( always).
Moreover our algorithm does not impose any restriction on the cluster sizes.
Previously, no algorithm was known which could find a -partition closer than
.
We present two applications for our algorithm. First one finds a disjoint
union of bounded degree expanders which approximate a given graph in spectral
norm. The second one is for approximating the sparsest -partition in a graph
where each cluster have expansion at most provided where is the eigenvalue of
Laplacian matrix. This significantly improves upon the previous algorithms,
which required .Comment: Appeared in SODA 201
Sketch-based subspace clustering of hyperspectral images
Sparse subspace clustering (SSC) techniques provide the state-of-the-art in clustering of hyperspectral images (HSIs). However, their computational complexity hinders their applicability to large-scale HSIs. In this paper, we propose a large-scale SSC-based method, which can effectively process large HSIs while also achieving improved clustering accuracy compared to the current SSC methods. We build our approach based on an emerging concept of sketched subspace clustering, which was to our knowledge not explored at all in hyperspectral imaging yet. Moreover, there are only scarce results on any large-scale SSC approaches for HSI. We show that a direct application of sketched SSC does not provide a satisfactory performance on HSIs but it does provide an excellent basis for an effective and elegant method that we build by extending this approach with a spatial prior and deriving the corresponding solver. In particular, a random matrix constructed by the Johnson-Lindenstrauss transform is first used to sketch the self-representation dictionary as a compact dictionary, which significantly reduces the number of sparse coefficients to be solved, thereby reducing the overall complexity. In order to alleviate the effect of noise and within-class spectral variations of HSIs, we employ a total variation constraint on the coefficient matrix, which accounts for the spatial dependencies among the neighbouring pixels. We derive an efficient solver for the resulting optimization problem, and we theoretically prove its convergence property under mild conditions. The experimental results on real HSIs show a notable improvement in comparison with the traditional SSC-based methods and the state-of-the-art methods for clustering of large-scale images
- …