66,353 research outputs found

    An Improved Approximation Algorithm for the Column Subset Selection Problem

    Full text link
    We consider the problem of selecting the best subset of exactly kk columns from an mΓ—nm \times n matrix AA. We present and analyze a novel two-stage algorithm that runs in O(min⁑{mn2,m2n})O(\min\{mn^2,m^2n\}) time and returns as output an mΓ—km \times k matrix CC consisting of exactly kk columns of AA. In the first (randomized) stage, the algorithm randomly selects Θ(klog⁑k)\Theta(k \log k) columns according to a judiciously-chosen probability distribution that depends on information in the top-kk right singular subspace of AA. In the second (deterministic) stage, the algorithm applies a deterministic column-selection procedure to select and return exactly kk columns from the set of columns selected in the first stage. Let CC be the mΓ—km \times k matrix containing those kk columns, let PCP_C denote the projection matrix onto the span of those columns, and let AkA_k denote the best rank-kk approximation to the matrix AA. Then, we prove that, with probability at least 0.8, \FNorm{A - P_CA} \leq \Theta(k \log^{1/2} k) \FNorm{A-A_k}. This Frobenius norm bound is only a factor of klog⁑k\sqrt{k \log k} worse than the best previously existing existential result and is roughly O(k!)O(\sqrt{k!}) better than the best previous algorithmic result for the Frobenius norm version of this Column Subset Selection Problem (CSSP). We also prove that, with probability at least 0.8, \TNorm{A - P_CA} \leq \Theta(k \log^{1/2} k)\TNorm{A-A_k} + \Theta(k^{3/4}\log^{1/4}k)\FNorm{A-A_k}. This spectral norm bound is not directly comparable to the best previously existing bounds for the spectral norm version of this CSSP. Our bound depends on \FNorm{A-A_k}, whereas previous results depend on \sqrt{n-k}\TNorm{A-A_k}; if these two quantities are comparable, then our bound is asymptotically worse by a (klog⁑k)1/4(k \log k)^{1/4} factor.Comment: 17 pages; corrected a bug in the spectral norm bound of the previous versio

    Provable Deterministic Leverage Score Sampling

    Full text link
    We explain theoretically a curious empirical phenomenon: "Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good low-rank matrix surrogate". To obtain provable guarantees, previous work requires randomized sampling of the columns with probabilities proportional to their leverage scores. In this work, we provide a novel theoretical analysis of deterministic leverage score sampling. We show that such deterministic sampling can be provably as accurate as its randomized counterparts, if the leverage scores follow a moderately steep power-law decay. We support this power-law assumption by providing empirical evidence that such decay laws are abundant in real-world data sets. We then demonstrate empirically the performance of deterministic leverage score sampling, which many times matches or outperforms the state-of-the-art techniques.Comment: 20th ACM SIGKDD Conference on Knowledge Discovery and Data Minin
    • …
    corecore