38,634 research outputs found

    Approximation and Streaming Algorithms for Projective Clustering via Random Projections

    Full text link
    Let PP be a set of nn points in Rd\mathbb{R}^d. In the projective clustering problem, given k,qk, q and norm ρ[1,]\rho \in [1,\infty], we have to compute a set F\mathcal{F} of kk qq-dimensional flats such that (pPd(p,F)ρ)1/ρ(\sum_{p\in P}d(p, \mathcal{F})^\rho)^{1/\rho} is minimized; here d(p,F)d(p, \mathcal{F}) represents the (Euclidean) distance of pp to the closest flat in F\mathcal{F}. We let fkq(P,ρ)f_k^q(P,\rho) denote the minimal value and interpret fkq(P,)f_k^q(P,\infty) to be maxrPd(r,F)\max_{r\in P}d(r, \mathcal{F}). When ρ=1,2\rho=1,2 and \infty and q=0q=0, the problem corresponds to the kk-median, kk-mean and the kk-center clustering problems respectively. For every 0<ϵ<10 < \epsilon < 1, SPS\subset P and ρ1\rho \ge 1, we show that the orthogonal projection of PP onto a randomly chosen flat of dimension O(((q+1)2log(1/ϵ)/ϵ3)logn)O(((q+1)^2\log(1/\epsilon)/\epsilon^3) \log n) will ϵ\epsilon-approximate f1q(S,ρ)f_1^q(S,\rho). This result combines the concepts of geometric coresets and subspace embeddings based on the Johnson-Lindenstrauss Lemma. As a consequence, an orthogonal projection of PP to an O(((q+1)2log((q+1)/ϵ)/ϵ3)logn)O(((q+1)^2 \log ((q+1)/\epsilon)/\epsilon^3) \log n) dimensional randomly chosen subspace ϵ\epsilon-approximates projective clusterings for every kk and ρ\rho simultaneously. Note that the dimension of this subspace is independent of the number of clusters~kk. Using this dimension reduction result, we obtain new approximation and streaming algorithms for projective clustering problems. For example, given a stream of nn points, we show how to compute an ϵ\epsilon-approximate projective clustering for every kk and ρ\rho simultaneously using only O((n+d)((q+1)2log((q+1)/ϵ))/ϵ3logn)O((n+d)((q+1)^2\log ((q+1)/\epsilon))/\epsilon^3 \log n) space. Compared to standard streaming algorithms with Ω(kd)\Omega(kd) space requirement, our approach is a significant improvement when the number of input points and their dimensions are of the same order of magnitude.Comment: Canadian Conference on Computational Geometry (CCCG 2015

    Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain

    Full text link
    Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the \emph{tightest} lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations \textit{directly} in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions, and leverage the theoretical analysis to develop a fast algorithm to obtain an \emph{exact} solution to the problem. The suggested solution provides the tightest estimation of the L2L_2-norm or the correlation. We show that typical data-analysis operations, such as k-NN search or k-Means clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCA-based techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCA-based compression. The contributions of this work are generic as our methodology is applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD

    Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

    Full text link
    We show how to approximate a data matrix A\mathbf{A} with a much smaller sketch A~\mathbf{\tilde A} that can be used to solve a general class of constrained k-rank approximation problems to within (1+ϵ)(1+\epsilon) error. Importantly, this class of problems includes kk-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k)O(k) dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For kk-means dimensionality reduction, we provide (1+ϵ)(1+\epsilon) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover' a good subspace for \bv{A}, but can be used directly to compute this subspace. Finally, for kk-means clustering, we show how to achieve a (9+ϵ)(9+\epsilon) approximation by Johnson-Lindenstrauss projecting data points to just O(logk/ϵ2)O(\log k/\epsilon^2) dimensions. This gives the first result that leverages the specific structure of kk-means to achieve dimension independent of input size and sublinear in kk
    corecore