22 research outputs found

    Provable Deterministic Leverage Score Sampling

    Full text link
    We explain theoretically a curious empirical phenomenon: "Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good low-rank matrix surrogate". To obtain provable guarantees, previous work requires randomized sampling of the columns with probabilities proportional to their leverage scores. In this work, we provide a novel theoretical analysis of deterministic leverage score sampling. We show that such deterministic sampling can be provably as accurate as its randomized counterparts, if the leverage scores follow a moderately steep power-law decay. We support this power-law assumption by providing empirical evidence that such decay laws are abundant in real-world data sets. We then demonstrate empirically the performance of deterministic leverage score sampling, which many times matches or outperforms the state-of-the-art techniques.Comment: 20th ACM SIGKDD Conference on Knowledge Discovery and Data Minin

    Dynamic tree-structured sparse RPCA via column subset selection for background modeling and foreground detection

    Get PDF
    Video analysis often begins with background subtraction, which consists of creation of a background model that allows distinguishing foreground pixels. Recent evaluation of background subtraction techniques demonstrated that there are still considerable challenges facing these methods. Processing per-pixel basis from the background is not only time-consuming but also can dramatically affect foreground region detection, if region cohesion and contiguity is not considered in the model. We present a new method in which we regard the image sequence to be made up of the sum of a low-rank background matrix and a dynamic tree-structured sparse matrix, and solve the decomposition using our approximated Robust Principal Component Analysis method extended to handle camera motion. Furthermore, to reduce the curse of dimensionality and scale, we introduce a low-rank background modeling via Column Subset Selection that reduces the order of complexity, decreases computation time, and eliminates the huge storage need for large videos

    Improved Subsampled Randomized Hadamard Transform for Linear SVM

    Full text link
    Subsampled Randomized Hadamard Transform (SRHT), a popular random projection method that can efficiently project a dd-dimensional data into rr-dimensional space (rβ‰ͺdr \ll d) in O(dlog(d))O(dlog(d)) time, has been widely used to address the challenge of high-dimensionality in machine learning. SRHT works by rotating the input data matrix X∈RnΓ—d\mathbf{X} \in \mathbb{R}^{n \times d} by Randomized Walsh-Hadamard Transform followed with a subsequent uniform column sampling on the rotated matrix. Despite the advantages of SRHT, one limitation of SRHT is that it generates the new low-dimensional embedding without considering any specific properties of a given dataset. Therefore, this data-independent random projection method may result in inferior and unstable performance when used for a particular machine learning task, e.g., classification. To overcome this limitation, we analyze the effect of using SRHT for random projection in the context of linear SVM classification. Based on our analysis, we propose importance sampling and deterministic top-rr sampling to produce effective low-dimensional embedding instead of uniform sampling SRHT. In addition, we also proposed a new supervised non-uniform sampling method. Our experimental results have demonstrated that our proposed methods can achieve higher classification accuracies than SRHT and other random projection methods on six real-life datasets.Comment: AAAI-2

    Fair Column Subset Selection

    Full text link
    We consider the problem of fair column subset selection. In particular, we assume that two groups are present in the data, and the chosen column subset must provide a good approximation for both, relative to their respective best rank-k approximations. We show that this fair setting introduces significant challenges: in order to extend known results, one cannot do better than the trivial solution of simply picking twice as many columns as the original methods. We adopt a known approach based on deterministic leverage-score sampling, and show that merely sampling a subset of appropriate size becomes NP-hard in the presence of two groups. Whereas finding a subset of two times the desired size is trivial, we provide an efficient algorithm that achieves the same guarantees with essentially 1.5 times that size. We validate our methods through an extensive set of experiments on real-world data
    corecore