Search CORE

416 research outputs found

A Sparse Johnson--Lindenstrauss Transform

Author: Dasgupta Anirban
Kumar Ravi
Sarlós Tamás
Publication venue
Publication date: 01/01/2010
Field of study

Dimension reduction is a key algorithmic tool with many applications including nearest-neighbor search, compressed sensing and linear algebra in the streaming model. In this work we obtain a {\em sparse} version of the fundamental tool in dimension reduction --- the Johnson--Lindenstrauss transform. Using hashing and local densification, we construct a sparse projection matrix with just

\tilde{O}(\frac{1}{\epsilon})

non-zero entries per column. We also show a matching lower bound on the sparsity for a large class of projection matrices. Our bounds are somewhat surprising, given the known lower bounds of

\Omega(\frac{1}{\epsilon^2})

both on the number of rows of any projection matrix and on the sparsity of projection matrices generated by natural constructions. Using this, we achieve an

\tilde{O}(\frac{1}{\epsilon})

update time per non-zero element for a

(1\pm\epsilon)

-approximate projection, thereby substantially outperforming the

\tilde{O}(\frac{1}{\epsilon^2})

update time required by prior approaches. A variant of our method offers the same guarantees for sparse vectors, yet its

\tilde{O}(d)

worst case running time matches the best approach of Ailon and Liberty.Comment: 10 pages, conference version

arXiv.org e-Print Archive

CiteSeerX

Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

Author: Cohen Michael B.
Elder Sam
Musco Cameron
Musco Christopher
Persu Madalina
Publication venue
Publication date: 02/04/2015
Field of study

We show how to approximate a data matrix

\mathbf{A}

with a much smaller sketch

\mathbf{\tilde A}

that can be used to solve a general class of constrained k-rank approximation problems to within

(1+\epsilon)

error. Importantly, this class of problems includes

k

-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just

O(k)

dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For

k

-means dimensionality reduction, we provide

(1+\epsilon)

relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover' a good subspace for \bv{A}, but can be used directly to compute this subspace. Finally, for

k

-means clustering, we show how to achieve a

(9+\epsilon)

approximation by Johnson-Lindenstrauss projecting data points to just

O(\log k/\epsilon^2)

dimensions. This gives the first result that leverages the specific structure of

k

-means to achieve dimension independent of input size and sublinear in

k

arXiv.org e-Print Archive

CiteSeerX

Random forests with random projections of the output space for high dimensional multi-label classification

Author: D. Achlioptas
D. Kocev
E.J. Candes
F. Pedregosa
G. Madjarov
G. Tsoumakas
G. Tsoumakas
J. Read
J.L. Faulon
L. Breiman
P. Geurts
W.B. Johnson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

We adapt the idea of random projections applied to the output space, so as to enhance tree-based ensemble methods in the context of multi-label classification. We show how learning time complexity can be reduced without affecting computational complexity and accuracy of predictions. We also show that random output space projections may be used in order to reach different bias-variance tradeoffs, over a broad panel of benchmark problems, and that this may lead to improved accuracy while reducing significantly the computational burden of the learning stage

arXiv.org e-Print Archive

Crossref

Open Repository and Bibliography - Liège

Random Projections For Large-Scale Regression

Author: B. McWilliams
D. Achlioptas
M.W. Mahoney
O.-A. Maillard
P. Dhillon
P.S. Dhillon
R.D. Cook
S. Dasgupta
T. Marzetta
Publication venue
Publication date: 19/01/2017
Field of study

Fitting linear regression models can be computationally very expensive in large-scale data analysis tasks if the sample size and the number of variables are very large. Random projections are extensively used as a dimension reduction tool in machine learning and statistics. We discuss the applications of random projections in linear regression problems, developed to decrease computational costs, and give an overview of the theoretical guarantees of the generalization error. It can be shown that the combination of random projections with least squares regression leads to similar recovery as ridge regression and principal component regression. We also discuss possible improvements when averaging over multiple random projections, an approach that lends itself easily to parallel implementation.Comment: 13 pages, 3 Figure

arXiv.org e-Print Archive

Crossref

Four lectures on probabilistic methods for data science

Author: Vershynin Roman
Publication venue
Publication date: 01/01/2016
Field of study

Methods of high-dimensional probability play a central role in applications for statistics, signal processing theoretical computer science and related fields. These lectures present a sample of particularly useful tools of high-dimensional probability, focusing on the classical and matrix Bernstein's inequality and the uniform matrix deviation inequality. We illustrate these tools with applications for dimension reduction, network analysis, covariance estimation, matrix completion and sparse signal recovery. The lectures are geared towards beginning graduate students who have taken a rigorous course in probability but may not have any experience in data science applications.Comment: Lectures given at 2016 PCMI Graduate Summer School in Mathematics of Data. Some typos, inaccuracies fixe

arXiv.org e-Print Archive

Ezid

Crossref

eScholarship - University of California