66,353 research outputs found
An Improved Approximation Algorithm for the Column Subset Selection Problem
We consider the problem of selecting the best subset of exactly columns
from an matrix . We present and analyze a novel two-stage
algorithm that runs in time and returns as output an matrix consisting of exactly columns of . In the first
(randomized) stage, the algorithm randomly selects columns
according to a judiciously-chosen probability distribution that depends on
information in the top- right singular subspace of . In the second
(deterministic) stage, the algorithm applies a deterministic column-selection
procedure to select and return exactly columns from the set of columns
selected in the first stage. Let be the matrix containing
those columns, let denote the projection matrix onto the span of
those columns, and let denote the best rank- approximation to the
matrix . Then, we prove that, with probability at least 0.8, \FNorm{A -
P_CA} \leq \Theta(k \log^{1/2} k) \FNorm{A-A_k}. This Frobenius norm bound
is only a factor of worse than the best previously existing
existential result and is roughly better than the best previous
algorithmic result for the Frobenius norm version of this Column Subset
Selection Problem (CSSP). We also prove that, with probability at least 0.8,
\TNorm{A - P_CA} \leq \Theta(k \log^{1/2} k)\TNorm{A-A_k} +
\Theta(k^{3/4}\log^{1/4}k)\FNorm{A-A_k}. This spectral norm bound is not
directly comparable to the best previously existing bounds for the spectral
norm version of this CSSP. Our bound depends on \FNorm{A-A_k}, whereas
previous results depend on \sqrt{n-k}\TNorm{A-A_k}; if these two quantities
are comparable, then our bound is asymptotically worse by a
factor.Comment: 17 pages; corrected a bug in the spectral norm bound of the previous
versio
Provable Deterministic Leverage Score Sampling
We explain theoretically a curious empirical phenomenon: "Approximating a
matrix by deterministically selecting a subset of its columns with the
corresponding largest leverage scores results in a good low-rank matrix
surrogate". To obtain provable guarantees, previous work requires randomized
sampling of the columns with probabilities proportional to their leverage
scores.
In this work, we provide a novel theoretical analysis of deterministic
leverage score sampling. We show that such deterministic sampling can be
provably as accurate as its randomized counterparts, if the leverage scores
follow a moderately steep power-law decay. We support this power-law assumption
by providing empirical evidence that such decay laws are abundant in real-world
data sets. We then demonstrate empirically the performance of deterministic
leverage score sampling, which many times matches or outperforms the
state-of-the-art techniques.Comment: 20th ACM SIGKDD Conference on Knowledge Discovery and Data Minin
- β¦