32,116 research outputs found

    Strong Consistency of Reduced K-means Clustering

    Full text link
    Reduced k-means clustering is a method for clustering objects in a low-dimensional subspace. The advantage of this method is that both clustering of objects and low-dimensional subspace reflecting the cluster structure are simultaneously obtained. In this paper, the relationship between conventional k-means clustering and reduced k-means clustering is discussed. Conditions ensuring almost sure convergence of the estimator of reduced k-means clustering as unboundedly increasing sample size have been presented. The results for a more general model considering conventional k-means clustering and reduced k-means clustering are provided in this paper. Moreover, a new criterion and its consistent estimator are proposed to determine the optimal dimension number of a subspace, given the number of clusters.Comment: A revised version of this was accepted in Scandinavian Journal of Statistics. Please refer to the accepted ve

    Scalable Deep kk-Subspace Clustering

    Full text link
    Subspace clustering algorithms are notorious for their scalability issues because building and processing large affinity matrices are demanding. In this paper, we introduce a method that simultaneously learns an embedding space along subspaces within it to minimize a notion of reconstruction error, thus addressing the problem of subspace clustering in an end-to-end learning paradigm. To achieve our goal, we propose a scheme to update subspaces within a deep neural network. This in turn frees us from the need of having an affinity matrix to perform clustering. Unlike previous attempts, our method can easily scale up to large datasets, making it unique in the context of unsupervised learning with deep architectures. Our experiments show that our method significantly improves the clustering accuracy while enjoying cheaper memory footprints.Comment: To appear in ACCV 201

    A H-K Clustering Algorithm For High Dimensional Data Using Ensemble Learning

    Full text link
    Advances made to the traditional clustering algorithms solves the various problems such as curse of dimensionality and sparsity of data for multiple attributes. The traditional H-K clustering algorithm can solve the randomness and apriority of the initial centers of K-means clustering algorithm. But when we apply it to high dimensional data it causes the dimensional disaster problem due to high computational complexity. All the advanced clustering algorithms like subspace and ensemble clustering algorithms improve the performance for clustering high dimension dataset from different aspects in different extent. Still these algorithms will improve the performance form a single perspective. The objective of the proposed model is to improve the performance of traditional H-K clustering and overcome the limitations such as high computational complexity and poor accuracy for high dimensional data by combining the three different approaches of clustering algorithm as subspace clustering algorithm and ensemble clustering algorithm with H-K clustering algorithm.Comment: 9 pages, 1 table, 2 figures, International Journal of Information Technology Convergence and Services (IJITCS) Vol.4, No.5/6, December 201

    Fast Subspace Clustering Based on the Kronecker Product

    Full text link
    Subspace clustering is a useful technique for many computer vision applications in which the intrinsic dimension of high-dimensional data is often smaller than the ambient dimension. Spectral clustering, as one of the main approaches to subspace clustering, often takes on a sparse representation or a low-rank representation to learn a block diagonal self-representation matrix for subspace generation. However, existing methods require solving a large scale convex optimization problem with a large set of data, with computational complexity reaches O(N^3) for N data points. Therefore, the efficiency and scalability of traditional spectral clustering methods can not be guaranteed for large scale datasets. In this paper, we propose a subspace clustering model based on the Kronecker product. Due to the property that the Kronecker product of a block diagonal matrix with any other matrix is still a block diagonal matrix, we can efficiently learn the representation matrix which is formed by the Kronecker product of k smaller matrices. By doing so, our model significantly reduces the computational complexity to O(kN^{3/k}). Furthermore, our model is general in nature, and can be adapted to different regularization based subspace clustering methods. Experimental results on two public datasets show that our model significantly improves the efficiency compared with several state-of-the-art methods. Moreover, we have conducted experiments on synthetic data to verify the scalability of our model for large scale datasets.Comment: 16 pages, 2 figure

    Fast Landmark Subspace Clustering

    Full text link
    Kernel methods obtain superb performance in terms of accuracy for various machine learning tasks since they can effectively extract nonlinear relations. However, their time complexity can be rather large especially for clustering tasks. In this paper we define a general class of kernels that can be easily approximated by randomization. These kernels appear in various applications, in particular, traditional spectral clustering, landmark-based spectral clustering and landmark-based subspace clustering. We show that for nn data points from KK clusters with DD landmarks, the randomization procedure results in an algorithm of complexity O(KnD)O(KnD). Furthermore, we bound the error between the original clustering scheme and its randomization. To illustrate the power of this framework, we propose a new fast landmark subspace (FLS) clustering algorithm. Experiments over synthetic and real datasets demonstrate the superior performance of FLS in accelerating subspace clustering with marginal sacrifice of accuracy

    Strong Consistency of Factorial K-means Clustering

    Full text link
    Factorial k-means (FKM) clustering is a method for clustering objects in a low-dimensional subspace. The advantage of this method is that the partition of objects and the low-dimensional subspace reflecting the cluster structure are obtained, simultaneously. Conditions that ensure the almost sure convergence of the estimator of FKM clustering as the sample size increases unboundedly are derived. The result is proved for a more general model including FKM clustering.Comment: A revised version of this was accepted in Annals of the Institute of Statistical Mathematics. Please refer to the accepted version of this. In the accepted ver., I describe a new interesting fact that there exists some cases in which reduced k-means clustering becomes equivalent to FKM clustering as n goes to infinity and provide a rough large deviation inequality for FKM clusterin

    Subspace Clustering using Ensembles of KK-Subspaces

    Full text link
    Subspace clustering is the unsupervised grouping of points lying near a union of low-dimensional linear subspaces. Algorithms based directly on geometric properties of such data tend to either provide poor empirical performance, lack theoretical guarantees, or depend heavily on their initialization. We present a novel geometric approach to the subspace clustering problem that leverages ensembles of the K-subspaces (KSS) algorithm via the evidence accumulation clustering framework. Our algorithm, referred to as ensemble K-subspaces (EKSS), forms a co-association matrix whose (i,j)th entry is the number of times points i and j are clustered together by several runs of KSS with random initializations. We prove general recovery guarantees for any algorithm that forms an affinity matrix with entries close to a monotonic transformation of pairwise absolute inner products. We then show that a specific instance of EKSS results in an affinity matrix with entries of this form, and hence our proposed algorithm can provably recover subspaces under similar conditions to state-of-the-art algorithms. The finding is, to the best of our knowledge, the first recovery guarantee for evidence accumulation clustering and for KSS variants. We show on synthetic data that our method performs well in the traditionally challenging settings of subspaces with large intersection, subspaces with small principal angles, and noisy data. Finally, we evaluate our algorithm on six common benchmark datasets and show that unlike existing methods, EKSS achieves excellent empirical performance when there are both a small and large number of points per subspace

    Improved Distributed Principal Component Analysis

    Full text link
    We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve â„“2\ell_2-error fitting problems such as kk-means clustering and subspace clustering. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efficiency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for kk-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as a general transformation from a constant success probability subspace embedding to a high success probability subspace embedding with a dimension and sparsity independent of the success probability, may be of independent interest

    Large-Scale Subspace Clustering via k-Factorization

    Full text link
    Subspace clustering (SC) aims to cluster data lying in a union of low-dimensional subspaces. Usually, SC learns an affinity matrix and then performs spectral clustering. Both steps suffer from high time and space complexity, which leads to difficulty in clustering large datasets. This paper presents a method called k-Factorization Subspace Clustering (k-FSC) for large-scale subspace clustering. K-FSC directly factorizes the data into k groups via pursuing structured sparsity in the matrix factorization model. Thus, k-FSC avoids learning affinity matrix and performing eigenvalue decomposition, and has low (linear) time and space complexity on large datasets. This paper proves the effectiveness of the k-FSC model theoretically. An efficient algorithm with convergence guarantee is proposed to solve the optimization of k-FSC. In addition, k-FSC is able to handle sparse noise, outliers, and missing data, which are pervasive in real applications. This paper also provides online extension and out-of-sample extension for k-FSC to handle streaming data and cluster arbitrarily large datasets. Extensive experiments on large-scale real datasets show that k-FSC and its extensions outperform state-of-the-art methods of subspace clustering.Comment: Accepted to KDD'2

    Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

    Full text link
    We develop and analyze a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first k principal components of the reduced set will return approximately the first k principal components of the original set or computing the centers of a k-means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new feature of our construction is that the cardinality of the reduced set is independent of the dimension d of the input space and that the sets are mergable. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets (this property has recently also been called composability, see Indyk et. al., PODS 2014). It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as k-means and subspace approximation the coreset sizes are also independent of the number of input points. Our method is based on projecting the points on a low dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including k-means clustering, principal component analysis and subspace clustering. The main conceptual contribution is a new coreset definition that allows to charge costs that appear for every solution to an additive constant.Comment: The conference version of this work appeared at SODA 201
    • …
    corecore