Search CORE

66,353 research outputs found

An Improved Approximation Algorithm for the Column Subset Selection Problem

Author: Boutsidis Christos
Drineas Petros
Mahoney Michael W.
Publication venue
Publication date: 01/01/2008
Field of study

We consider the problem of selecting the best subset of exactly

k

columns from an

m \times n

matrix

A

. We present and analyze a novel two-stage algorithm that runs in

O(\min\{mn^2,m^2n\})

time and returns as output an

m \times k

matrix

C

consisting of exactly

k

columns of

A

. In the first (randomized) stage, the algorithm randomly selects

\Theta(k \log k)

columns according to a judiciously-chosen probability distribution that depends on information in the top-

k

right singular subspace of

A

. In the second (deterministic) stage, the algorithm applies a deterministic column-selection procedure to select and return exactly

k

columns from the set of columns selected in the first stage. Let

C

be the

m \times k

matrix containing those

k

columns, let

P_C

denote the projection matrix onto the span of those columns, and let

A_k

denote the best rank-

k

approximation to the matrix

A

. Then, we prove that, with probability at least 0.8, \FNorm{A - P_CA} \leq \Theta(k \log^{1/2} k) \FNorm{A-A_k}. This Frobenius norm bound is only a factor of

\sqrt{k \log k}

worse than the best previously existing existential result and is roughly

O(\sqrt{k!})

better than the best previous algorithmic result for the Frobenius norm version of this Column Subset Selection Problem (CSSP). We also prove that, with probability at least 0.8, \TNorm{A - P_CA} \leq \Theta(k \log^{1/2} k)\TNorm{A-A_k} + \Theta(k^{3/4}\log^{1/4}k)\FNorm{A-A_k}. This spectral norm bound is not directly comparable to the best previously existing bounds for the spectral norm version of this CSSP. Our bound depends on \FNorm{A-A_k}, whereas previous results depend on \sqrt{n-k}\TNorm{A-A_k}; if these two quantities are comparable, then our bound is asymptotically worse by a

(k \log k)^{1/4}

factor.Comment: 17 pages; corrected a bug in the spectral norm bound of the previous versio

arXiv.org e-Print Archive

CiteSeerX

Provable Deterministic Leverage Score Sampling

Author: Boutsidis C.
Gittens A.
Hong Y. P.
Kunegis J.
Publication venue
Publication date: 02/06/2014
Field of study

We explain theoretically a curious empirical phenomenon: "Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good low-rank matrix surrogate". To obtain provable guarantees, previous work requires randomized sampling of the columns with probabilities proportional to their leverage scores. In this work, we provide a novel theoretical analysis of deterministic leverage score sampling. We show that such deterministic sampling can be provably as accurate as its randomized counterparts, if the leverage scores follow a moderately steep power-law decay. We support this power-law assumption by providing empirical evidence that such decay laws are abundant in real-world data sets. We then demonstrate empirically the performance of deterministic leverage score sampling, which many times matches or outperforms the state-of-the-art techniques.Comment: 20th ACM SIGKDD Conference on Knowledge Discovery and Data Minin

arXiv.org e-Print Archive

Crossref