36 research outputs found
Subspace clustering of dimensionality-reduced data
Subspace clustering refers to the problem of clustering unlabeled
high-dimensional data points into a union of low-dimensional linear subspaces,
assumed unknown. In practice one may have access to dimensionality-reduced
observations of the data only, resulting, e.g., from "undersampling" due to
complexity and speed constraints on the acquisition device. More pertinently,
even if one has access to the high-dimensional data set it is often desirable
to first project the data points into a lower-dimensional space and to perform
the clustering task there; this reduces storage requirements and computational
cost. The purpose of this paper is to quantify the impact of
dimensionality-reduction through random projection on the performance of the
sparse subspace clustering (SSC) and the thresholding based subspace clustering
(TSC) algorithms. We find that for both algorithms dimensionality reduction
down to the order of the subspace dimensions is possible without incurring
significant performance degradation. The mathematical engine behind our
theorems is a result quantifying how the affinities between subspaces change
under random dimensionality reducing projections.Comment: ISIT 201
Sketching via hashing: from heavy hitters to compressed sensing to sparse fourier transform
Sketching via hashing is a popular and useful method for processing large data sets. Its basic idea is as follows. Suppose that we have a large multi-set of elements S=[formula], and we would like to identify the elements that occur “frequently" in S. The algorithm starts by selecting a hash function h that maps the elements into an array c[1…m]. The array entries are initialized to 0. Then, for each element a ∈ S, the algorithm increments c[h(a)]. At the end of the process, each array entry c[j] contains the count of all data elements a ∈ S mapped to j
Isometric sketching of any set via the Restricted Isometry Property
In this paper we show that for the purposes of dimensionality reduction
certain class of structured random matrices behave similarly to random Gaussian
matrices. This class includes several matrices for which matrix-vector multiply
can be computed in log-linear time, providing efficient dimensionality
reduction of general sets. In particular, we show that using such matrices any
set from high dimensions can be embedded into lower dimensions with near
optimal distortion. We obtain our results by connecting dimensionality
reduction of any set to dimensionality reduction of sparse vectors via a
chaining argument.Comment: 17 page
Compressed Sensing with Coherent and Redundant Dictionaries
This article presents novel results concerning the recovery of signals from
undersampled data in the common situation where such signals are not sparse in
an orthonormal basis or incoherent dictionary, but in a truly redundant
dictionary. This work thus bridges a gap in the literature and shows not only
that compressed sensing is viable in this context, but also that accurate
recovery is possible via an L1-analysis optimization problem. We introduce a
condition on the measurement/sensing matrix, which is a natural generalization
of the now well-known restricted isometry property, and which guarantees
accurate recovery of signals that are nearly sparse in (possibly) highly
overcomplete and coherent dictionaries. This condition imposes no incoherence
restriction on the dictionary and our results may be the first of this kind. We
discuss practical examples and the implications of our results on those
applications, and complement our study by demonstrating the potential of
L1-analysis for such problems
Low-distortion Subspace Embeddings in Input-sparsity Time and Applications to Robust Linear Regression
Low-distortion embeddings are critical building blocks for developing random
sampling and random projection algorithms for linear algebra problems. We show
that, given a matrix with and a , with a constant probability, we can construct a low-distortion embedding
matrix \Pi \in \R^{O(\poly(d)) \times n} that embeds \A_p, the
subspace spanned by 's columns, into (\R^{O(\poly(d))}, \| \cdot \|_p);
the distortion of our embeddings is only O(\poly(d)), and we can compute in O(\nnz(A)) time, i.e., input-sparsity time. Our result generalizes the
input-sparsity time subspace embedding by Clarkson and Woodruff
[STOC'13]; and for completeness, we present a simpler and improved analysis of
their construction for . These input-sparsity time embeddings
are optimal, up to constants, in terms of their running time; and the improved
running time propagates to applications such as -distortion
subspace embedding and relative-error regression. For
, we show that a -approximate solution to the
regression problem specified by the matrix and a vector can be
computed in O(\nnz(A) + d^3 \log(d/\epsilon) /\epsilon^2) time; and for
, via a subspace-preserving sampling procedure, we show that a -distortion embedding of \A_p into \R^{O(\poly(d))} can be
computed in O(\nnz(A) \cdot \log n) time, and we also show that a
-approximate solution to the regression problem can be computed in O(\nnz(A) \cdot \log n + \poly(d)
\log(1/\epsilon)/\epsilon^2) time. Moreover, we can improve the embedding
dimension or equivalently the sample size to without increasing the complexity.Comment: 22 page
Bridging Dense and Sparse Maximum Inner Product Search
Maximum inner product search (MIPS) over dense and sparse vectors have
progressed independently in a bifurcated literature for decades; the latter is
better known as top- retrieval in Information Retrieval. This duality exists
because sparse and dense vectors serve different end goals. That is despite the
fact that they are manifestations of the same mathematical problem. In this
work, we ask if algorithms for dense vectors could be applied effectively to
sparse vectors, particularly those that violate the assumptions underlying
top- retrieval methods. We study IVF-based retrieval where vectors are
partitioned into clusters and only a fraction of clusters are searched during
retrieval. We conduct a comprehensive analysis of dimensionality reduction for
sparse vectors, and examine standard and spherical KMeans for partitioning. Our
experiments demonstrate that IVF serves as an efficient solution for sparse
MIPS. As byproducts, we identify two research opportunities and demonstrate
their potential. First, we cast the IVF paradigm as a dynamic pruning technique
and turn that insight into a novel organization of the inverted index for
approximate MIPS for general sparse vectors. Second, we offer a unified regime
for MIPS over vectors that have dense and sparse subspaces, and show its
robustness to query distributions