Search CORE

18,583 research outputs found

Approximate Nearest Neighbor Search for Low Dimensional Queries

Author: Har-Peled Sariel
Kumar Nirman
Publication venue
Publication date: 01/01/2010
Field of study

We study the Approximate Nearest Neighbor problem for metric spaces where the query points are constrained to lie on a subspace of low doubling dimension, while the data is high-dimensional. We show that this problem can be solved efficiently despite the high dimensionality of the data.Comment: 25 page

arXiv.org e-Print Archive

University of Memphis Digital Commons

CiteSeerX

Crossref

Analysis of approximate nearest neighbor searching with clustered point sets

Author: Maneewongvatana Songrit
Mount David M.
Publication venue
Publication date: 01/01/1999
Field of study

We present an empirical analysis of data structures for approximate nearest neighbor searching. We compare the well-known optimized kd-tree splitting method against two alternative splitting methods. The first, called the sliding-midpoint method, which attempts to balance the goals of producing subdivision cells of bounded aspect ratio, while not producing any empty cells. The second, called the minimum-ambiguity method is a query-based approach. In addition to the data points, it is also given a training set of query points for preprocessing. It employs a simple greedy algorithm to select the splitting plane that minimizes the average amount of ambiguity in the choice of the nearest neighbor for the training points. We provide an empirical analysis comparing these two methods against the optimized kd-tree construction for a number of synthetically generated data and query sets. We demonstrate that for clustered data and query sets, these algorithms can provide significant improvements over the standard kd-tree construction for approximate nearest neighbor searching.Comment: 20 pages, 8 figures. Presented at ALENEX '99, Baltimore, MD, Jan 15-16, 199

arXiv.org e-Print Archive

CiteSeerX

Down the Rabbit Hole: Robust Proximity Search and Density Estimation in Sublinear Space

Author: Har-Peled Sariel
Kumar Nirman
Publication venue
Publication date: 01/12/2012
Field of study

For a set of

n

points in

\Re^d

, and parameters

k

and \eps, we present a data structure that answers (1+\eps,k)-\ANN queries in logarithmic time. Surprisingly, the space used by the data-structure is \Otilde (n /k); that is, the space used is sublinear in the input size if

k

is sufficiently large. Our approach provides a novel way to summarize geometric data, such that meaningful proximity queries on the data can be carried out using this sketch. Using this, we provide a sublinear space data-structure that can estimate the density of a point set under various measures, including: \begin{inparaenum}[(i)] \item sum of distances of

k

closest points to the query point, and \item sum of squared distances of

k

closest points to the query point. \end{inparaenum} Our approach generalizes to other distance based estimation of densities of similar flavor. We also study the problem of approximating some of these quantities when using sampling. In particular, we show that a sample of size \Otilde (n /k) is sufficient, in some restricted cases, to estimate the above quantities. Remarkably, the sample size has only linear dependency on the dimension

arXiv.org e-Print Archive

University of Memphis Digital Commons

CiteSeerX

Bolt: Accelerated Data Mining with Fast Vector Compression

Author: Blalock Davis W
Guttag John V
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/06/2017
Field of study

Vectors of data are at the heart of machine learning and data mining. Recently, vector quantization methods have shown great promise in reducing both the time and space costs of operating on vectors. We introduce a vector quantization algorithm that can compress vectors over 12x faster than existing techniques while also accelerating approximate vector operations such as distance and dot product computations by up to 10x. Because it can encode over 2GB of vectors per second, it makes vector quantization cheap enough to employ in many more circumstances. For example, using our technique to compute approximate dot products in a nested loop can multiply matrices faster than a state-of-the-art BLAS implementation, even when our algorithm must first compress the matrices. In addition to showing the above speedups, we demonstrate that our approach can accelerate nearest neighbor search and maximum inner product search by over 100x compared to floating point operations and up to 10x compared to other vector quantization methods. Our approximate Euclidean distance and dot product computations are not only faster than those of related algorithms with slower encodings, but also faster than Hamming distance computations, which have direct hardware support on the tested platforms. We also assess the errors of our algorithm's approximate distances and dot products, and find that it is competitive with existing, slower vector quantization algorithms.Comment: Research track paper at KDD 201

arXiv.org e-Print Archive

Crossref

Robust Proximity Search for Balls using Sublinear Space

Author: Har-Peled Sariel
Kumar Nirman
Publication venue
Publication date: 01/01/2014
Field of study

Given a set of n disjoint balls b1, . . ., bn in IRd, we provide a data structure, of near linear size, that can answer (1 \pm \epsilon)-approximate kth-nearest neighbor queries in O(log n + 1/\epsilon^d) time, where k and \epsilon are provided at query time. If k and \epsilon are provided in advance, we provide a data structure to answer such queries, that requires (roughly) O(n/k) space; that is, the data structure has sublinear space requirement if k is sufficiently large

arXiv.org e-Print Archive

University of Memphis Digital Commons

Dagstuhl Research Online Publication Server

Probabilistic Polynomials and Hamming Nearest Neighbors

Author: Alman Josh
Williams Ryan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/07/2015
Field of study

We show how to compute any symmetric Boolean function on

n

variables over any field (as well as the integers) with a probabilistic polynomial of degree

O(\sqrt{n \log(1/\epsilon)})

and error at most

\epsilon

. The degree dependence on

n

and

\epsilon

is optimal, matching a lower bound of Razborov (1987) and Smolensky (1987) for the MAJORITY function. The proof is constructive: a low-degree polynomial can be efficiently sampled from the distribution. This polynomial construction is combined with other algebraic ideas to give the first subquadratic time algorithm for computing a (worst-case) batch of Hamming distances in superlogarithmic dimensions, exactly. To illustrate, let

c(n) : \mathbb{N} \rightarrow \mathbb{N}

. Suppose we are given a database

D

n

vectors in

\{0,1\}^{c(n) \log n}

and a collection of

n

query vectors

Q

in the same dimension. For all

u \in Q

, we wish to compute a

v \in D

with minimum Hamming distance from

u

. We solve this problem in

n^{2-1/O(c(n) \log^2 c(n))}

randomized time. Hence, the problem is in "truly subquadratic" time for

O(\log n)

dimensions, and in subquadratic time for

d = o((\log^2 n)/(\log \log n)^2)

. We apply the algorithm to computing pairs with maximum inner product, closest pair in

\ell_1

for vectors with bounded integer entries, and pairs with maximum Jaccard coefficients.Comment: 16 pages. To appear in 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2015

arXiv.org e-Print Archive

Crossref

Space Exploration via Proximity Search

Author: Har-Peled Sariel
Kumar Nirman
Mount David M.
Raichel Benjamin
Publication venue
Publication date: 03/12/2014
Field of study

We investigate what computational tasks can be performed on a point set in

\Re^d

, if we are only given black-box access to it via nearest-neighbor search. This is a reasonable assumption if the underlying point set is either provided implicitly, or it is stored in a data structure that can answer such queries. In particular, we show the following: (A) One can compute an approximate bi-criteria

k

-center clustering of the point set, and more generally compute a greedy permutation of the point set. (B) One can decide if a query point is (approximately) inside the convex-hull of the point set. We also investigate the problem of clustering the given point set, such that meaningful proximity queries can be carried out on the centers of the clusters, instead of the whole point set

arXiv.org e-Print Archive

University of Memphis Digital Commons

Dagstuhl Research Online Publication Server