32,116 research outputs found
Strong Consistency of Reduced K-means Clustering
Reduced k-means clustering is a method for clustering objects in a
low-dimensional subspace. The advantage of this method is that both clustering
of objects and low-dimensional subspace reflecting the cluster structure are
simultaneously obtained. In this paper, the relationship between conventional
k-means clustering and reduced k-means clustering is discussed. Conditions
ensuring almost sure convergence of the estimator of reduced k-means clustering
as unboundedly increasing sample size have been presented. The results for a
more general model considering conventional k-means clustering and reduced
k-means clustering are provided in this paper. Moreover, a new criterion and
its consistent estimator are proposed to determine the optimal dimension number
of a subspace, given the number of clusters.Comment: A revised version of this was accepted in Scandinavian Journal of
Statistics. Please refer to the accepted ve
Scalable Deep -Subspace Clustering
Subspace clustering algorithms are notorious for their scalability issues
because building and processing large affinity matrices are demanding. In this
paper, we introduce a method that simultaneously learns an embedding space
along subspaces within it to minimize a notion of reconstruction error, thus
addressing the problem of subspace clustering in an end-to-end learning
paradigm. To achieve our goal, we propose a scheme to update subspaces within a
deep neural network. This in turn frees us from the need of having an affinity
matrix to perform clustering. Unlike previous attempts, our method can easily
scale up to large datasets, making it unique in the context of unsupervised
learning with deep architectures. Our experiments show that our method
significantly improves the clustering accuracy while enjoying cheaper memory
footprints.Comment: To appear in ACCV 201
A H-K Clustering Algorithm For High Dimensional Data Using Ensemble Learning
Advances made to the traditional clustering algorithms solves the various
problems such as curse of dimensionality and sparsity of data for multiple
attributes. The traditional H-K clustering algorithm can solve the randomness
and apriority of the initial centers of K-means clustering algorithm. But when
we apply it to high dimensional data it causes the dimensional disaster problem
due to high computational complexity. All the advanced clustering algorithms
like subspace and ensemble clustering algorithms improve the performance for
clustering high dimension dataset from different aspects in different extent.
Still these algorithms will improve the performance form a single perspective.
The objective of the proposed model is to improve the performance of
traditional H-K clustering and overcome the limitations such as high
computational complexity and poor accuracy for high dimensional data by
combining the three different approaches of clustering algorithm as subspace
clustering algorithm and ensemble clustering algorithm with H-K clustering
algorithm.Comment: 9 pages, 1 table, 2 figures, International Journal of Information
Technology Convergence and Services (IJITCS) Vol.4, No.5/6, December 201
Fast Subspace Clustering Based on the Kronecker Product
Subspace clustering is a useful technique for many computer vision
applications in which the intrinsic dimension of high-dimensional data is often
smaller than the ambient dimension. Spectral clustering, as one of the main
approaches to subspace clustering, often takes on a sparse representation or a
low-rank representation to learn a block diagonal self-representation matrix
for subspace generation. However, existing methods require solving a large
scale convex optimization problem with a large set of data, with computational
complexity reaches O(N^3) for N data points. Therefore, the efficiency and
scalability of traditional spectral clustering methods can not be guaranteed
for large scale datasets. In this paper, we propose a subspace clustering model
based on the Kronecker product. Due to the property that the Kronecker product
of a block diagonal matrix with any other matrix is still a block diagonal
matrix, we can efficiently learn the representation matrix which is formed by
the Kronecker product of k smaller matrices. By doing so, our model
significantly reduces the computational complexity to O(kN^{3/k}). Furthermore,
our model is general in nature, and can be adapted to different regularization
based subspace clustering methods. Experimental results on two public datasets
show that our model significantly improves the efficiency compared with several
state-of-the-art methods. Moreover, we have conducted experiments on synthetic
data to verify the scalability of our model for large scale datasets.Comment: 16 pages, 2 figure
Fast Landmark Subspace Clustering
Kernel methods obtain superb performance in terms of accuracy for various
machine learning tasks since they can effectively extract nonlinear relations.
However, their time complexity can be rather large especially for clustering
tasks. In this paper we define a general class of kernels that can be easily
approximated by randomization. These kernels appear in various applications, in
particular, traditional spectral clustering, landmark-based spectral clustering
and landmark-based subspace clustering. We show that for data points from
clusters with landmarks, the randomization procedure results in an
algorithm of complexity . Furthermore, we bound the error between the
original clustering scheme and its randomization. To illustrate the power of
this framework, we propose a new fast landmark subspace (FLS) clustering
algorithm. Experiments over synthetic and real datasets demonstrate the
superior performance of FLS in accelerating subspace clustering with marginal
sacrifice of accuracy
Strong Consistency of Factorial K-means Clustering
Factorial k-means (FKM) clustering is a method for clustering objects in a
low-dimensional subspace. The advantage of this method is that the partition of
objects and the low-dimensional subspace reflecting the cluster structure are
obtained, simultaneously. Conditions that ensure the almost sure convergence of
the estimator of FKM clustering as the sample size increases unboundedly are
derived. The result is proved for a more general model including FKM
clustering.Comment: A revised version of this was accepted in Annals of the Institute of
Statistical Mathematics. Please refer to the accepted version of this. In the
accepted ver., I describe a new interesting fact that there exists some cases
in which reduced k-means clustering becomes equivalent to FKM clustering as n
goes to infinity and provide a rough large deviation inequality for FKM
clusterin
Subspace Clustering using Ensembles of -Subspaces
Subspace clustering is the unsupervised grouping of points lying near a union
of low-dimensional linear subspaces. Algorithms based directly on geometric
properties of such data tend to either provide poor empirical performance, lack
theoretical guarantees, or depend heavily on their initialization. We present a
novel geometric approach to the subspace clustering problem that leverages
ensembles of the K-subspaces (KSS) algorithm via the evidence accumulation
clustering framework. Our algorithm, referred to as ensemble K-subspaces
(EKSS), forms a co-association matrix whose (i,j)th entry is the number of
times points i and j are clustered together by several runs of KSS with random
initializations. We prove general recovery guarantees for any algorithm that
forms an affinity matrix with entries close to a monotonic transformation of
pairwise absolute inner products. We then show that a specific instance of EKSS
results in an affinity matrix with entries of this form, and hence our proposed
algorithm can provably recover subspaces under similar conditions to
state-of-the-art algorithms. The finding is, to the best of our knowledge, the
first recovery guarantee for evidence accumulation clustering and for KSS
variants. We show on synthetic data that our method performs well in the
traditionally challenging settings of subspaces with large intersection,
subspaces with small principal angles, and noisy data. Finally, we evaluate our
algorithm on six common benchmark datasets and show that unlike existing
methods, EKSS achieves excellent empirical performance when there are both a
small and large number of points per subspace
Improved Distributed Principal Component Analysis
We study the distributed computing setting in which there are multiple
servers, each holding a set of points, who wish to compute functions on the
union of their point sets. A key task in this setting is Principal Component
Analysis (PCA), in which the servers would like to compute a low dimensional
subspace capturing as much of the variance of the union of their point sets as
possible. Given a procedure for approximate PCA, one can use it to
approximately solve -error fitting problems such as -means
clustering and subspace clustering. The essential properties of an approximate
distributed PCA algorithm are its communication cost and computational
efficiency for a given desired accuracy in downstream applications. We give new
algorithms and analyses for distributed PCA which lead to improved
communication and computational costs for -means clustering and related
problems. Our empirical study on real world data shows a speedup of orders of
magnitude, preserving communication with only a negligible degradation in
solution quality. Some of these techniques we develop, such as a general
transformation from a constant success probability subspace embedding to a high
success probability subspace embedding with a dimension and sparsity
independent of the success probability, may be of independent interest
Large-Scale Subspace Clustering via k-Factorization
Subspace clustering (SC) aims to cluster data lying in a union of
low-dimensional subspaces. Usually, SC learns an affinity matrix and then
performs spectral clustering. Both steps suffer from high time and space
complexity, which leads to difficulty in clustering large datasets. This paper
presents a method called k-Factorization Subspace Clustering (k-FSC) for
large-scale subspace clustering. K-FSC directly factorizes the data into k
groups via pursuing structured sparsity in the matrix factorization model.
Thus, k-FSC avoids learning affinity matrix and performing eigenvalue
decomposition, and has low (linear) time and space complexity on large
datasets. This paper proves the effectiveness of the k-FSC model theoretically.
An efficient algorithm with convergence guarantee is proposed to solve the
optimization of k-FSC. In addition, k-FSC is able to handle sparse noise,
outliers, and missing data, which are pervasive in real applications. This
paper also provides online extension and out-of-sample extension for k-FSC to
handle streaming data and cluster arbitrarily large datasets. Extensive
experiments on large-scale real datasets show that k-FSC and its extensions
outperform state-of-the-art methods of subspace clustering.Comment: Accepted to KDD'2
Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering
We develop and analyze a method to reduce the size of a very large set of
data points in a high dimensional Euclidean space R d to a small set of
weighted points such that the result of a predetermined data analysis task on
the reduced set is approximately the same as that for the original point set.
For example, computing the first k principal components of the reduced set will
return approximately the first k principal components of the original set or
computing the centers of a k-means clustering on the reduced set will return an
approximation for the original set. Such a reduced set is also known as a
coreset. The main new feature of our construction is that the cardinality of
the reduced set is independent of the dimension d of the input space and that
the sets are mergable. The latter property means that the union of two reduced
sets is a reduced set for the union of the two original sets (this property has
recently also been called composability, see Indyk et. al., PODS 2014). It
allows us to turn our methods into streaming or distributed algorithms using
standard approaches. For problems such as k-means and subspace approximation
the coreset sizes are also independent of the number of input points. Our
method is based on projecting the points on a low dimensional subspace and
reducing the cardinality of the points inside this subspace using known
methods. The proposed approach works for a wide range of data analysis
techniques including k-means clustering, principal component analysis and
subspace clustering. The main conceptual contribution is a new coreset
definition that allows to charge costs that appear for every solution to an
additive constant.Comment: The conference version of this work appeared at SODA 201
- …