52,975 research outputs found

    Randomized Dimensionality Reduction for k-means Clustering

    Full text link
    We study the topic of dimensionality reduction for kk-means clustering. Dimensionality reduction encompasses the union of two approaches: \emph{feature selection} and \emph{feature extraction}. A feature selection based algorithm for kk-means clustering selects a small subset of the input features and then applies kk-means clustering on the selected features. A feature extraction based algorithm for kk-means clustering constructs a small set of new artificial features and then applies kk-means clustering on the constructed features. Despite the significance of kk-means clustering as well as the wealth of heuristic methods addressing it, provably accurate feature selection methods for kk-means clustering are not known. On the other hand, two provably accurate feature extraction methods for kk-means clustering are known in the literature; one is based on random projections and the other is based on the singular value decomposition (SVD). This paper makes further progress towards a better understanding of dimensionality reduction for kk-means clustering. Namely, we present the first provably accurate feature selection method for kk-means clustering and, in addition, we present two feature extraction methods. The first feature extraction method is based on random projections and it improves upon the existing results in terms of time complexity and number of features needed to be extracted. The second feature extraction method is based on fast approximate SVD factorizations and it also improves upon the existing results in terms of time complexity. The proposed algorithms are randomized and provide constant-factor approximation guarantees with respect to the optimal kk-means objective value.Comment: IEEE Transactions on Information Theory, to appea

    Spatio-Temporal Surrogates for Interaction of a Jet with High Explosives: Part II -- Clustering Extremely High-Dimensional Grid-Based Data

    Full text link
    Building an accurate surrogate model for the spatio-temporal outputs of a computer simulation is a challenging task. A simple approach to improve the accuracy of the surrogate is to cluster the outputs based on similarity and build a separate surrogate model for each cluster. This clustering is relatively straightforward when the output at each time step is of moderate size. However, when the spatial domain is represented by a large number of grid points, numbering in the millions, the clustering of the data becomes more challenging. In this report, we consider output data from simulations of a jet interacting with high explosives. These data are available on spatial domains of different sizes, at grid points that vary in their spatial coordinates, and in a format that distributes the output across multiple files at each time step of the simulation. We first describe how we bring these data into a consistent format prior to clustering. Borrowing the idea of random projections from data mining, we reduce the dimension of our data by a factor of thousand, making it possible to use the iterative k-means method for clustering. We show how we can use the randomness of both the random projections, and the choice of initial centroids in k-means clustering, to determine the number of clusters in our data set. Our approach makes clustering of extremely high dimensional data tractable, generating meaningful cluster assignments for our problem, despite the approximation introduced in the random projections

    A cheap feature selection approach for the K -means algorithm

    Get PDF
    The increase in the number of features that need to be analyzed in a wide variety of areas, such as genome sequencing, computer vision or sensor networks, represents a challenge for the K-means algorithm. In this regard, different dimensionality reduction approaches for the K-means algorithm have been designed recently, leading to algorithms that have proved to generate competitive clusterings. Unfortunately, most of these techniques tend to have fairly high computational costs and/or might not be easy to parallelize. In this work, we propose a fully-parellelizable feature selection technique intended for the K-means algorithm. The proposal is based on a novel feature relevance measure that is closely related to the K-means error of a given clustering. Given a disjoint partition of the features, the technique consists of obtaining a clustering for each subset of features and selecting the m features with the highest relevance measure. The computational cost of this approach is just O(m · max{n · K, log m}) per subset of features. We additionally provide a theoretical analysis on the quality of the obtained solution via our proposal, and empirically analyze its performance with respect to well-known feature selection and feature extraction techniques. Such an analysis shows that our proposal consistently obtains results with lower K-means error than all the considered feature selection techniques: Laplacian scores, maximum variance, multi-cluster feature selection and random selection, while also requiring similar or lower computational times than these approaches. Moreover, when compared to feature extraction techniques, such as Random Projections, the proposed approach also shows a noticeable improvement in both error and computational time.BERC 2014-201

    Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain

    Full text link
    Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the \emph{tightest} lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations \textit{directly} in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions, and leverage the theoretical analysis to develop a fast algorithm to obtain an \emph{exact} solution to the problem. The suggested solution provides the tightest estimation of the L2L_2-norm or the correlation. We show that typical data-analysis operations, such as k-NN search or k-Means clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCA-based techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCA-based compression. The contributions of this work are generic as our methodology is applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD

    Quantized Compressive K-Means

    Full text link
    The recent framework of compressive statistical learning aims at designing tractable learning algorithms that use only a heavily compressed representation-or sketch-of massive datasets. Compressive K-Means (CKM) is such a method: it estimates the centroids of data clusters from pooled, non-linear, random signatures of the learning examples. While this approach significantly reduces computational time on very large datasets, its digital implementation wastes acquisition resources because the learning examples are compressed only after the sensing stage. The present work generalizes the sketching procedure initially defined in Compressive K-Means to a large class of periodic nonlinearities including hardware-friendly implementations that compressively acquire entire datasets. This idea is exemplified in a Quantized Compressive K-Means procedure, a variant of CKM that leverages 1-bit universal quantization (i.e. retaining the least significant bit of a standard uniform quantizer) as the periodic sketch nonlinearity. Trading for this resource-efficient signature (standard in most acquisition schemes) has almost no impact on the clustering performances, as illustrated by numerical experiments

    Approximation and Streaming Algorithms for Projective Clustering via Random Projections

    Full text link
    Let PP be a set of nn points in Rd\mathbb{R}^d. In the projective clustering problem, given k,qk, q and norm ρ[1,]\rho \in [1,\infty], we have to compute a set F\mathcal{F} of kk qq-dimensional flats such that (pPd(p,F)ρ)1/ρ(\sum_{p\in P}d(p, \mathcal{F})^\rho)^{1/\rho} is minimized; here d(p,F)d(p, \mathcal{F}) represents the (Euclidean) distance of pp to the closest flat in F\mathcal{F}. We let fkq(P,ρ)f_k^q(P,\rho) denote the minimal value and interpret fkq(P,)f_k^q(P,\infty) to be maxrPd(r,F)\max_{r\in P}d(r, \mathcal{F}). When ρ=1,2\rho=1,2 and \infty and q=0q=0, the problem corresponds to the kk-median, kk-mean and the kk-center clustering problems respectively. For every 0<ϵ<10 < \epsilon < 1, SPS\subset P and ρ1\rho \ge 1, we show that the orthogonal projection of PP onto a randomly chosen flat of dimension O(((q+1)2log(1/ϵ)/ϵ3)logn)O(((q+1)^2\log(1/\epsilon)/\epsilon^3) \log n) will ϵ\epsilon-approximate f1q(S,ρ)f_1^q(S,\rho). This result combines the concepts of geometric coresets and subspace embeddings based on the Johnson-Lindenstrauss Lemma. As a consequence, an orthogonal projection of PP to an O(((q+1)2log((q+1)/ϵ)/ϵ3)logn)O(((q+1)^2 \log ((q+1)/\epsilon)/\epsilon^3) \log n) dimensional randomly chosen subspace ϵ\epsilon-approximates projective clusterings for every kk and ρ\rho simultaneously. Note that the dimension of this subspace is independent of the number of clusters~kk. Using this dimension reduction result, we obtain new approximation and streaming algorithms for projective clustering problems. For example, given a stream of nn points, we show how to compute an ϵ\epsilon-approximate projective clustering for every kk and ρ\rho simultaneously using only O((n+d)((q+1)2log((q+1)/ϵ))/ϵ3logn)O((n+d)((q+1)^2\log ((q+1)/\epsilon))/\epsilon^3 \log n) space. Compared to standard streaming algorithms with Ω(kd)\Omega(kd) space requirement, our approach is a significant improvement when the number of input points and their dimensions are of the same order of magnitude.Comment: Canadian Conference on Computational Geometry (CCCG 2015

    Efficient Clustering on Riemannian Manifolds: A Kernelised Random Projection Approach

    Get PDF
    Reformulating computer vision problems over Riemannian manifolds has demonstrated superior performance in various computer vision applications. This is because visual data often forms a special structure lying on a lower dimensional space embedded in a higher dimensional space. However, since these manifolds belong to non-Euclidean topological spaces, exploiting their structures is computationally expensive, especially when one considers the clustering analysis of massive amounts of data. To this end, we propose an efficient framework to address the clustering problem on Riemannian manifolds. This framework implements random projections for manifold points via kernel space, which can preserve the geometric structure of the original space, but is computationally efficient. Here, we introduce three methods that follow our framework. We then validate our framework on several computer vision applications by comparing against popular clustering methods on Riemannian manifolds. Experimental results demonstrate that our framework maintains the performance of the clustering whilst massively reducing computational complexity by over two orders of magnitude in some cases

    Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

    Full text link
    We show how to approximate a data matrix A\mathbf{A} with a much smaller sketch A~\mathbf{\tilde A} that can be used to solve a general class of constrained k-rank approximation problems to within (1+ϵ)(1+\epsilon) error. Importantly, this class of problems includes kk-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k)O(k) dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For kk-means dimensionality reduction, we provide (1+ϵ)(1+\epsilon) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover' a good subspace for \bv{A}, but can be used directly to compute this subspace. Finally, for kk-means clustering, we show how to achieve a (9+ϵ)(9+\epsilon) approximation by Johnson-Lindenstrauss projecting data points to just O(logk/ϵ2)O(\log k/\epsilon^2) dimensions. This gives the first result that leverages the specific structure of kk-means to achieve dimension independent of input size and sublinear in kk
    corecore