27,298 research outputs found
Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means
We analyze a compression scheme for large data sets that randomly keeps a
small percentage of the components of each data sample. The benefit is that the
output is a sparse matrix and therefore subsequent processing, such as PCA or
K-means, is significantly faster, especially in a distributed-data setting.
Furthermore, the sampling is single-pass and applicable to streaming data. The
sampling mechanism is a variant of previous methods proposed in the literature
combined with a randomized preconditioning to smooth the data. We provide
guarantees for PCA in terms of the covariance matrix, and guarantees for
K-means in terms of the error in the center estimators at a given step. We
present numerical evidence to show both that our bounds are nearly tight and
that our algorithms provide a real benefit when applied to standard test data
sets, as well as providing certain benefits over related sampling approaches.Comment: 28 pages, 10 figure
Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce
The kernel -means is an effective method for data clustering which extends
the commonly-used -means algorithm to work on a similarity matrix over
complex data structures. The kernel -means algorithm is however
computationally very complex as it requires the complete data matrix to be
calculated and stored. Further, the kernelized nature of the kernel -means
algorithm hinders the parallelization of its computations on modern
infrastructures for distributed computing. In this paper, we are defining a
family of kernel-based low-dimensional embeddings that allows for scaling
kernel -means on MapReduce via an efficient and unified parallelization
strategy. Afterwards, we propose two methods for low-dimensional embedding that
adhere to our definition of the embedding family. Exploiting the proposed
parallelization strategy, we present two scalable MapReduce algorithms for
kernel -means. We demonstrate the effectiveness and efficiency of the
proposed algorithms through an empirical evaluation on benchmark data sets.Comment: Appears in Proceedings of the SIAM International Conference on Data
Mining (SDM), 201
- …