22 research outputs found
Provable Deterministic Leverage Score Sampling
We explain theoretically a curious empirical phenomenon: "Approximating a
matrix by deterministically selecting a subset of its columns with the
corresponding largest leverage scores results in a good low-rank matrix
surrogate". To obtain provable guarantees, previous work requires randomized
sampling of the columns with probabilities proportional to their leverage
scores.
In this work, we provide a novel theoretical analysis of deterministic
leverage score sampling. We show that such deterministic sampling can be
provably as accurate as its randomized counterparts, if the leverage scores
follow a moderately steep power-law decay. We support this power-law assumption
by providing empirical evidence that such decay laws are abundant in real-world
data sets. We then demonstrate empirically the performance of deterministic
leverage score sampling, which many times matches or outperforms the
state-of-the-art techniques.Comment: 20th ACM SIGKDD Conference on Knowledge Discovery and Data Minin
Dynamic tree-structured sparse RPCA via column subset selection for background modeling and foreground detection
Video analysis often begins with background subtraction, which consists of creation of a background model that allows distinguishing foreground pixels. Recent evaluation of background subtraction techniques demonstrated that there are still considerable challenges facing these methods. Processing per-pixel basis from the background is not only time-consuming but also can dramatically affect foreground region detection, if region cohesion and contiguity is not considered in the model. We present a new method in which we regard the image sequence to be made up of the sum of a low-rank background matrix and a dynamic tree-structured sparse matrix, and solve the decomposition using our approximated Robust Principal Component Analysis method extended to handle camera motion. Furthermore, to reduce the curse of dimensionality and scale, we introduce a low-rank background modeling via Column Subset Selection that reduces the order of complexity, decreases computation time, and eliminates the huge storage need for large videos
Improved Subsampled Randomized Hadamard Transform for Linear SVM
Subsampled Randomized Hadamard Transform (SRHT), a popular random projection
method that can efficiently project a -dimensional data into -dimensional
space () in time, has been widely used to address the
challenge of high-dimensionality in machine learning. SRHT works by rotating
the input data matrix by Randomized
Walsh-Hadamard Transform followed with a subsequent uniform column sampling on
the rotated matrix. Despite the advantages of SRHT, one limitation of SRHT is
that it generates the new low-dimensional embedding without considering any
specific properties of a given dataset. Therefore, this data-independent random
projection method may result in inferior and unstable performance when used for
a particular machine learning task, e.g., classification. To overcome this
limitation, we analyze the effect of using SRHT for random projection in the
context of linear SVM classification. Based on our analysis, we propose
importance sampling and deterministic top- sampling to produce effective
low-dimensional embedding instead of uniform sampling SRHT. In addition, we
also proposed a new supervised non-uniform sampling method. Our experimental
results have demonstrated that our proposed methods can achieve higher
classification accuracies than SRHT and other random projection methods on six
real-life datasets.Comment: AAAI-2
Fair Column Subset Selection
We consider the problem of fair column subset selection. In particular, we
assume that two groups are present in the data, and the chosen column subset
must provide a good approximation for both, relative to their respective best
rank-k approximations. We show that this fair setting introduces significant
challenges: in order to extend known results, one cannot do better than the
trivial solution of simply picking twice as many columns as the original
methods. We adopt a known approach based on deterministic leverage-score
sampling, and show that merely sampling a subset of appropriate size becomes
NP-hard in the presence of two groups. Whereas finding a subset of two times
the desired size is trivial, we provide an efficient algorithm that achieves
the same guarantees with essentially 1.5 times that size. We validate our
methods through an extensive set of experiments on real-world data