18,583 research outputs found

    Approximate Nearest Neighbor Search for Low Dimensional Queries

    Full text link
    We study the Approximate Nearest Neighbor problem for metric spaces where the query points are constrained to lie on a subspace of low doubling dimension, while the data is high-dimensional. We show that this problem can be solved efficiently despite the high dimensionality of the data.Comment: 25 page

    Analysis of approximate nearest neighbor searching with clustered point sets

    Full text link
    We present an empirical analysis of data structures for approximate nearest neighbor searching. We compare the well-known optimized kd-tree splitting method against two alternative splitting methods. The first, called the sliding-midpoint method, which attempts to balance the goals of producing subdivision cells of bounded aspect ratio, while not producing any empty cells. The second, called the minimum-ambiguity method is a query-based approach. In addition to the data points, it is also given a training set of query points for preprocessing. It employs a simple greedy algorithm to select the splitting plane that minimizes the average amount of ambiguity in the choice of the nearest neighbor for the training points. We provide an empirical analysis comparing these two methods against the optimized kd-tree construction for a number of synthetically generated data and query sets. We demonstrate that for clustered data and query sets, these algorithms can provide significant improvements over the standard kd-tree construction for approximate nearest neighbor searching.Comment: 20 pages, 8 figures. Presented at ALENEX '99, Baltimore, MD, Jan 15-16, 199

    Down the Rabbit Hole: Robust Proximity Search and Density Estimation in Sublinear Space

    Full text link
    For a set of nn points in d\Re^d, and parameters kk and \eps, we present a data structure that answers (1+\eps,k)-\ANN queries in logarithmic time. Surprisingly, the space used by the data-structure is \Otilde (n /k); that is, the space used is sublinear in the input size if kk is sufficiently large. Our approach provides a novel way to summarize geometric data, such that meaningful proximity queries on the data can be carried out using this sketch. Using this, we provide a sublinear space data-structure that can estimate the density of a point set under various measures, including: \begin{inparaenum}[(i)] \item sum of distances of kk closest points to the query point, and \item sum of squared distances of kk closest points to the query point. \end{inparaenum} Our approach generalizes to other distance based estimation of densities of similar flavor. We also study the problem of approximating some of these quantities when using sampling. In particular, we show that a sample of size \Otilde (n /k) is sufficient, in some restricted cases, to estimate the above quantities. Remarkably, the sample size has only linear dependency on the dimension

    Bolt: Accelerated Data Mining with Fast Vector Compression

    Full text link
    Vectors of data are at the heart of machine learning and data mining. Recently, vector quantization methods have shown great promise in reducing both the time and space costs of operating on vectors. We introduce a vector quantization algorithm that can compress vectors over 12x faster than existing techniques while also accelerating approximate vector operations such as distance and dot product computations by up to 10x. Because it can encode over 2GB of vectors per second, it makes vector quantization cheap enough to employ in many more circumstances. For example, using our technique to compute approximate dot products in a nested loop can multiply matrices faster than a state-of-the-art BLAS implementation, even when our algorithm must first compress the matrices. In addition to showing the above speedups, we demonstrate that our approach can accelerate nearest neighbor search and maximum inner product search by over 100x compared to floating point operations and up to 10x compared to other vector quantization methods. Our approximate Euclidean distance and dot product computations are not only faster than those of related algorithms with slower encodings, but also faster than Hamming distance computations, which have direct hardware support on the tested platforms. We also assess the errors of our algorithm's approximate distances and dot products, and find that it is competitive with existing, slower vector quantization algorithms.Comment: Research track paper at KDD 201

    Robust Proximity Search for Balls using Sublinear Space

    Get PDF
    Given a set of n disjoint balls b1, . . ., bn in IRd, we provide a data structure, of near linear size, that can answer (1 \pm \epsilon)-approximate kth-nearest neighbor queries in O(log n + 1/\epsilon^d) time, where k and \epsilon are provided at query time. If k and \epsilon are provided in advance, we provide a data structure to answer such queries, that requires (roughly) O(n/k) space; that is, the data structure has sublinear space requirement if k is sufficiently large

    Probabilistic Polynomials and Hamming Nearest Neighbors

    Full text link
    We show how to compute any symmetric Boolean function on nn variables over any field (as well as the integers) with a probabilistic polynomial of degree O(nlog(1/ϵ))O(\sqrt{n \log(1/\epsilon)}) and error at most ϵ\epsilon. The degree dependence on nn and ϵ\epsilon is optimal, matching a lower bound of Razborov (1987) and Smolensky (1987) for the MAJORITY function. The proof is constructive: a low-degree polynomial can be efficiently sampled from the distribution. This polynomial construction is combined with other algebraic ideas to give the first subquadratic time algorithm for computing a (worst-case) batch of Hamming distances in superlogarithmic dimensions, exactly. To illustrate, let c(n):NNc(n) : \mathbb{N} \rightarrow \mathbb{N}. Suppose we are given a database DD of nn vectors in {0,1}c(n)logn\{0,1\}^{c(n) \log n} and a collection of nn query vectors QQ in the same dimension. For all uQu \in Q, we wish to compute a vDv \in D with minimum Hamming distance from uu. We solve this problem in n21/O(c(n)log2c(n))n^{2-1/O(c(n) \log^2 c(n))} randomized time. Hence, the problem is in "truly subquadratic" time for O(logn)O(\log n) dimensions, and in subquadratic time for d=o((log2n)/(loglogn)2)d = o((\log^2 n)/(\log \log n)^2). We apply the algorithm to computing pairs with maximum inner product, closest pair in 1\ell_1 for vectors with bounded integer entries, and pairs with maximum Jaccard coefficients.Comment: 16 pages. To appear in 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2015

    Space Exploration via Proximity Search

    Get PDF
    We investigate what computational tasks can be performed on a point set in d\Re^d, if we are only given black-box access to it via nearest-neighbor search. This is a reasonable assumption if the underlying point set is either provided implicitly, or it is stored in a data structure that can answer such queries. In particular, we show the following: (A) One can compute an approximate bi-criteria kk-center clustering of the point set, and more generally compute a greedy permutation of the point set. (B) One can decide if a query point is (approximately) inside the convex-hull of the point set. We also investigate the problem of clustering the given point set, such that meaningful proximity queries can be carried out on the centers of the clusters, instead of the whole point set
    corecore