Search CORE

36 research outputs found

Subspace clustering of dimensionality-reduced data

Author: Bölcskei Helmut
Heckel Reinhard
Tschannen Michael
Publication venue
Publication date: 27/04/2014
Field of study

Subspace clustering refers to the problem of clustering unlabeled high-dimensional data points into a union of low-dimensional linear subspaces, assumed unknown. In practice one may have access to dimensionality-reduced observations of the data only, resulting, e.g., from "undersampling" due to complexity and speed constraints on the acquisition device. More pertinently, even if one has access to the high-dimensional data set it is often desirable to first project the data points into a lower-dimensional space and to perform the clustering task there; this reduces storage requirements and computational cost. The purpose of this paper is to quantify the impact of dimensionality-reduction through random projection on the performance of the sparse subspace clustering (SSC) and the thresholding based subspace clustering (TSC) algorithms. We find that for both algorithms dimensionality reduction down to the order of the subspace dimensions is possible without incurring significant performance degradation. The mathematical engine behind our theorems is a result quantifying how the affinities between subspaces change under random dimensionality reducing projections.Comment: ISIT 201

arXiv.org e-Print Archive

Crossref

Sketching via hashing: from heavy hitters to compressed sensing to sparse fourier transform

Author: Indyk Piotr
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

Sketching via hashing is a popular and useful method for processing large data sets. Its basic idea is as follows. Suppose that we have a large multi-set of elements S=[formula], and we would like to identify the elements that occur “frequently" in S. The algorithm starts by selecting a hash function h that maps the elements into an array c[1…m]. The array entries are initialized to 0. Then, for each element a ∈ S, the algorithm increments c[h(a)]. At the end of the process, each array entry c[j] contains the count of all data elements a ∈ S mapped to j

DSpace@MIT

Crossref

Isometric sketching of any set via the Restricted Isometry Property

Author: Oymak Samet
Recht Benjamin
Soltanolkotabi Mahdi
Publication venue
Publication date: 06/10/2015
Field of study

In this paper we show that for the purposes of dimensionality reduction certain class of structured random matrices behave similarly to random Gaussian matrices. This class includes several matrices for which matrix-vector multiply can be computed in log-linear time, providing efficient dimensionality reduction of general sets. In particular, we show that using such matrices any set from high dimensions can be embedded into lower dimensions with near optimal distortion. We obtain our results by connecting dimensionality reduction of any set to dimensionality reduction of sparse vectors via a chaining argument.Comment: 17 page

arXiv.org e-Print Archive

CiteSeerX

Compressed Sensing with Coherent and Redundant Dictionaries

Author: Candes Emmanuel J.
Eldar Yonina C.
Needell Deanna
Randall Paige
Publication venue
Publication date: 01/01/2010
Field of study

This article presents novel results concerning the recovery of signals from undersampled data in the common situation where such signals are not sparse in an orthonormal basis or incoherent dictionary, but in a truly redundant dictionary. This work thus bridges a gap in the literature and shows not only that compressed sensing is viable in this context, but also that accurate recovery is possible via an L1-analysis optimization problem. We introduce a condition on the measurement/sensing matrix, which is a natural generalization of the now well-known restricted isometry property, and which guarantees accurate recovery of signals that are nearly sparse in (possibly) highly overcomplete and coherent dictionaries. This condition imposes no incoherence restriction on the dictionary and our results may be the first of this kind. We discuss practical examples and the implications of our results on those applications, and complement our study by demonstrating the potential of L1-analysis for such problems

arXiv.org e-Print Archive

CiteSeerX

Scholarship@Claremont

Elsevier - Publisher Connector

Low-distortion Subspace Embeddings in Input-sparsity Time and Applications to Robust Linear Regression

Author: Mahoney Michael W.
Meng Xiangrui
Publication venue
Publication date: 01/01/2013
Field of study

Low-distortion embeddings are critical building blocks for developing random sampling and random projection algorithms for linear algebra problems. We show that, given a matrix

A \in \R^{n \times d}

with

n \gg d

and a

p \in [1, 2)

, with a constant probability, we can construct a low-distortion embedding matrix \Pi \in \R^{O(\poly(d)) \times n} that embeds \A_p, the

\ell_p

subspace spanned by

A

's columns, into (\R^{O(\poly(d))}, \| \cdot \|_p); the distortion of our embeddings is only O(\poly(d)), and we can compute

\Pi A

in O(\nnz(A)) time, i.e., input-sparsity time. Our result generalizes the input-sparsity time

\ell_2

subspace embedding by Clarkson and Woodruff [STOC'13]; and for completeness, we present a simpler and improved analysis of their construction for

\ell_2

. These input-sparsity time

\ell_p

embeddings are optimal, up to constants, in terms of their running time; and the improved running time propagates to applications such as

(1\pm \epsilon)

-distortion

\ell_p

subspace embedding and relative-error

\ell_p

regression. For

\ell_2

, we show that a

(1+\epsilon)

-approximate solution to the

\ell_2

regression problem specified by the matrix

A

and a vector

b \in \R^n

can be computed in O(\nnz(A) + d^3 \log(d/\epsilon) /\epsilon^2) time; and for

\ell_p

, via a subspace-preserving sampling procedure, we show that a

(1\pm \epsilon)

-distortion embedding of \A_p into \R^{O(\poly(d))} can be computed in O(\nnz(A) \cdot \log n) time, and we also show that a

(1+\epsilon)

-approximate solution to the

\ell_p

regression problem

\min_{x \in \R^d} \|A x - b\|_p

can be computed in O(\nnz(A) \cdot \log n + \poly(d) \log(1/\epsilon)/\epsilon^2) time. Moreover, we can improve the embedding dimension or equivalently the sample size to

O(d^{3+p/2} \log(1/\epsilon) / \epsilon^2)

without increasing the complexity.Comment: 22 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

Bridging Dense and Sparse Maximum Inner Product Search

Author: Bruch Sebastian
Ingber Amir
Liberty Edo
Nardini Franco Maria
Publication venue
Publication date: 16/09/2023
Field of study

Maximum inner product search (MIPS) over dense and sparse vectors have progressed independently in a bifurcated literature for decades; the latter is better known as top-

k

retrieval in Information Retrieval. This duality exists because sparse and dense vectors serve different end goals. That is despite the fact that they are manifestations of the same mathematical problem. In this work, we ask if algorithms for dense vectors could be applied effectively to sparse vectors, particularly those that violate the assumptions underlying top-

k

retrieval methods. We study IVF-based retrieval where vectors are partitioned into clusters and only a fraction of clusters are searched during retrieval. We conduct a comprehensive analysis of dimensionality reduction for sparse vectors, and examine standard and spherical KMeans for partitioning. Our experiments demonstrate that IVF serves as an efficient solution for sparse MIPS. As byproducts, we identify two research opportunities and demonstrate their potential. First, we cast the IVF paradigm as a dynamic pruning technique and turn that insight into a novel organization of the inverted index for approximate MIPS for general sparse vectors. Second, we offer a unified regime for MIPS over vectors that have dense and sparse subspaces, and show its robustness to query distributions

arXiv.org e-Print Archive