Search CORE

396 research outputs found

Indexability, concentration, and VC theory

Author: Pestov Vladimir
Publication venue: 'Elsevier BV'
Publication date: 21/05/2011
Field of study

Degrading performance of indexing schemes for exact similarity search in high dimensions has long since been linked to histograms of distributions of distances and other 1-Lipschitz functions getting concentrated. We discuss this observation in the framework of the phenomenon of concentration of measure on the structures of high dimension and the Vapnik-Chervonenkis theory of statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded, improved and corrected version of the SISAP'2010 invited paper, this e-print, v3

arXiv.org e-Print Archive

Elsevier - Publisher Connector

A directed isoperimetric inequality with application to Bregman near neighbor lower bounds

Author: Chaudhuri Kamalika
Marcel
Talagrand Michel
Publication venue
Publication date: 16/05/2015
Field of study

Bregman divergences

D_\phi

are a class of divergences parametrized by a convex function

\phi

and include well known distance functions like

\ell_2^2

and the Kullback-Leibler divergence. There has been extensive research on algorithms for problems like clustering and near neighbor search with respect to Bregman divergences, in all cases, the algorithms depend not just on the data size

n

and dimensionality

d

, but also on a structure constant

\mu \ge 1

that depends solely on

\phi

and can grow without bound independently. In this paper, we provide the first evidence that this dependence on

\mu

might be intrinsic. We focus on the problem of approximate near neighbor search for Bregman divergences. We show that under the cell probe model, any non-adaptive data structure (like locality-sensitive hashing) for

c

-approximate near-neighbor search that admits

r

probes must use space

\Omega(n^{1 + \frac{\mu}{c r}})

. In contrast, for LSH under

\ell_1

the best bound is

\Omega(n^{1+\frac{1}{cr}})

. Our new tool is a directed variant of the standard boolean noise operator. We show that a generalization of the Bonami-Beckner hypercontractivity inequality exists "in expectation" or upon restriction to certain subsets of the Hamming cube, and that this is sufficient to prove the desired isoperimetric inequality that we use in our data structure lower bound. We also present a structural result reducing the Hamming cube to a Bregman cube. This structure allows us to obtain lower bounds for problems under Bregman divergences from their

\ell_1

analog. In particular, we get a (weaker) lower bound for approximate near neighbor search of the form

\Omega(n^{1 + \frac{1}{cr}})

for an

r

-query non-adaptive data structure, and new cell probe lower bounds for a number of other near neighbor questions in Bregman space.Comment: 27 page

arXiv.org e-Print Archive

Crossref

Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors

Author: Andoni Alexandr
Klein Philip N.
Laarhoven Thijs
Razenshteyn Ilya
Waingarten Erik
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2016
Field of study

[See the paper for the full abstract.] We show tight upper and lower bounds for time-space trade-offs for the

c

-Approximate Near Neighbor Search problem. For the

d

-dimensional Euclidean space and

n

-point datasets, we develop a data structure with space

n^{1 + \rho_u + o(1)} + O(dn)

and query time

n^{\rho_q + o(1)} + d n^{o(1)}

for every

\rho_u, \rho_q \geq 0

such that: \begin{equation} c^2 \sqrt{\rho_q} + (c^2 - 1) \sqrt{\rho_u} = \sqrt{2c^2 - 1}. \end{equation} This is the first data structure that achieves sublinear query time and near-linear space for every approximation factor

c > 1

, improving upon [Kapralov, PODS 2015]. The data structure is a culmination of a long line of work on the problem for all space regimes; it builds on Spherical Locality-Sensitive Filtering [Becker, Ducas, Gama, Laarhoven, SODA 2016] and data-dependent hashing [Andoni, Indyk, Nguyen, Razenshteyn, SODA 2014] [Andoni, Razenshteyn, STOC 2015]. Our matching lower bounds are of two types: conditional and unconditional. First, we prove tightness of the whole above trade-off in a restricted model of computation, which captures all known hashing-based approaches. We then show unconditional cell-probe lower bounds for one and two probes that match the above trade-off for

\rho_q = 0

, improving upon the best known lower bounds from [Panigrahy, Talwar, Wieder, FOCS 2010]. In particular, this is the first space lower bound (for any static data structure) for two probes which is not polynomially smaller than the one-probe bound. To show the result for two probes, we establish and exploit a connection to locally-decodable codes.Comment: 62 pages, 5 figures; a merger of arXiv:1511.07527 [cs.DS] and arXiv:1605.02701 [cs.DS], which subsumes both of the preprints. New version contains more elaborated proofs and fixed some typo

arXiv.org e-Print Archive

Repository TU/e

Crossref

Pure OAI Repository

Lower Bounds for Oblivious Near-Neighbor Search

Author: Larsen Kasper Green
Malkin Tal
Weinstein Omri
Yeo Kevin
Publication venue
Publication date: 09/04/2019
Field of study

We prove an

\Omega(d \lg n/ (\lg\lg n)^2)

lower bound on the dynamic cell-probe complexity of statistically

\mathit{oblivious}

approximate-near-neighbor search (

\mathsf{ANN}

) over the

d

-dimensional Hamming cube. For the natural setting of

d = \Theta(\log n)

, our result implies an

\tilde{\Omega}(\lg^2 n)

lower bound, which is a quadratic improvement over the highest (non-oblivious) cell-probe lower bound for

\mathsf{ANN}

. This is the first super-logarithmic

\mathit{unconditional}

lower bound for

\mathsf{ANN}

against general (non black-box) data structures. We also show that any oblivious

\mathit{static}

data structure for decomposable search problems (like

\mathsf{ANN}

) can be obliviously dynamized with

O(\log n)

overhead in update and query time, strengthening a classic result of Bentley and Saxe (Algorithmica, 1980).Comment: 28 page

arXiv.org e-Print Archive

Crossref

Cryptology ePrint Archive

Probabilistic Polynomials and Hamming Nearest Neighbors

Author: Alman Josh
Williams Ryan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/07/2015
Field of study

We show how to compute any symmetric Boolean function on

n

variables over any field (as well as the integers) with a probabilistic polynomial of degree

O(\sqrt{n \log(1/\epsilon)})

and error at most

\epsilon

. The degree dependence on

n

and

\epsilon

is optimal, matching a lower bound of Razborov (1987) and Smolensky (1987) for the MAJORITY function. The proof is constructive: a low-degree polynomial can be efficiently sampled from the distribution. This polynomial construction is combined with other algebraic ideas to give the first subquadratic time algorithm for computing a (worst-case) batch of Hamming distances in superlogarithmic dimensions, exactly. To illustrate, let

c(n) : \mathbb{N} \rightarrow \mathbb{N}

. Suppose we are given a database

D

n

vectors in

\{0,1\}^{c(n) \log n}

and a collection of

n

query vectors

Q

in the same dimension. For all

u \in Q

, we wish to compute a

v \in D

with minimum Hamming distance from

u

. We solve this problem in

n^{2-1/O(c(n) \log^2 c(n))}

randomized time. Hence, the problem is in "truly subquadratic" time for

O(\log n)

dimensions, and in subquadratic time for

d = o((\log^2 n)/(\log \log n)^2)

. We apply the algorithm to computing pairs with maximum inner product, closest pair in

\ell_1

for vectors with bounded integer entries, and pairs with maximum Jaccard coefficients.Comment: 16 pages. To appear in 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2015

arXiv.org e-Print Archive

Crossref

Doctor of Philosophy in Computing

Author: Abdullah Amirali
Publication venue: University of Utah
Publication date: 01/01/2016
Field of study

dissertationIn the last two decades, an increasingly large amount of data has become available. Massive collections of videos, astronomical observations, social networking posts, network routing information, mobile location history and so forth are examples of real world data requiring processing for applications ranging from classi?cation to predictions. Computational resources grow at a far more constrained rate, and hence the need for ef?cient algorithms that scale well. Over the past twenty years high quality theoretical algorithms have been developed for two central problems: nearest neighbor search and dimensionality reduction over Euclidean distances in worst case distributions. These two tasks are interesting in their own right. Nearest neighbor corresponds to a database query lookup, while dimensionality reduction is a form of compression on massive data. Moreover, these are also subroutines in algorithms ranging from clustering to classi?cation. However, many highly relevant settings and distance measures have not received similar attention to that of worst case point sets in Euclidean space. The Bregman divergences include the information theoretic distances, such as entropy, of most relevance in many machine learning applications and yet prior to this dissertation lacked ef?cient dimensionality reductions, nearest neighbor algorithms, or even lower bounds on what could be possible. Furthermore, even in the Euclidean setting, theoretical algorithms do not leverage that almost all real world datasets have signi?cant low-dimensional substructure. In this dissertation, we explore different models and techniques for similarity search and dimensionality reduction. What upper bounds can be obtained for nearest neighbors for Bregman divergences? What upper bounds can be achieved for dimensionality reduction for information theoretic measures? Are these problems indeed intrinsically of harder computational complexity than in the Euclidean setting? Can we improve the state of the art nearest neighbor algorithms for real world datasets in Euclidean space? These are the questions we investigate in this dissertation, and that we shed some new insight on. In the ?rst part of our dissertation, we focus on Bregman divergences. We exhibit nearest neighbor algorithms, contingent on a distributional constraint on the datasets. We next show lower bounds suggesting that is in some sense inherent to the problem complexity. After this we explore dimensionality reduction techniques for the Jensen-Shannon and Hellinger distances, two popular information theoretic measures. In the second part, we show that even for the more well-studied Euclidean case, worst case nearest neighbor algorithms can be improved upon sharply for real world datasets with spectral structure

The University of Utah: J. Willard Marriott Digital Library

Lower Bounds on Time-Space Trade-Offs for Approximate Near Neighbors

Author: Andoni Alexandr
Laarhoven Thijs
Razenshteyn Ilya
Waingarten Erik
Publication venue
Publication date: 01/01/2016
Field of study

We show tight lower bounds for the entire trade-off between space and query time for the Approximate Near Neighbor search problem. Our lower bounds hold in a restricted model of computation, which captures all hashing-based approaches. In articular, our lower bound matches the upper bound recently shown in [Laarhoven 2015] for the random instance on a Euclidean sphere (which we show in fact extends to the entire space

\mathbb{R}^d

using the techniques from [Andoni, Razenshteyn 2015]). We also show tight, unconditional cell-probe lower bounds for one and two probes, improving upon the best known bounds from [Panigrahy, Talwar, Wieder 2010]. In particular, this is the first space lower bound (for any static data structure) for two probes which is not polynomially smaller than for one probe. To show the result for two probes, we establish and exploit a connection to locally-decodable codes.Comment: 47 pages, 2 figures; v2: substantially revised introduction, lots of small corrections; subsumed by arXiv:1608.03580 [cs.DS] (along with arXiv:1511.07527 [cs.DS]

arXiv.org e-Print Archive

Repository TU/e

Pure OAI Repository