27,168 research outputs found

    HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces

    Full text link
    Nearest neighbor searching of large databases in high-dimensional spaces is inherently difficult due to the curse of dimensionality. A flavor of approximation is, therefore, necessary to practically solve the problem of nearest neighbor search. In this paper, we propose a novel yet simple indexing scheme, HD-Index, to solve the problem of approximate k-nearest neighbor queries in massive high-dimensional databases. HD-Index consists of a set of novel hierarchical structures called RDB-trees built on Hilbert keys of database objects. The leaves of the RDB-trees store distances of database objects to reference objects, thereby allowing efficient pruning using distance filters. In addition to triangular inequality, we also use Ptolemaic inequality to produce better lower bounds. Experiments on massive (up to billion scale) high-dimensional (up to 1000+) datasets show that HD-Index is effective, efficient, and scalable.Comment: PVLDB 11(8):906-919, 201

    Efficient Large-scale Approximate Nearest Neighbor Search on the GPU

    Full text link
    We present a new approach for efficient approximate nearest neighbor (ANN) search in high dimensional spaces, extending the idea of Product Quantization. We propose a two-level product and vector quantization tree that reduces the number of vector comparisons required during tree traversal. Our approach also includes a novel highly parallelizable re-ranking method for candidate vectors by efficiently reusing already computed intermediate values. Due to its small memory footprint during traversal, the method lends itself to an efficient, parallel GPU implementation. This Product Quantization Tree (PQT) approach significantly outperforms recent state of the art methods for high dimensional nearest neighbor queries on standard reference datasets. Ours is the first work that demonstrates GPU performance superior to CPU performance on high dimensional, large scale ANN problems in time-critical real-world applications, like loop-closing in videos

    Effect of Neighborhood Approximation on Downstream Analytics

    Get PDF
    Nearest neighbor search algorithms have been successful in finding practically useful solutions to computationally difficult problems. In the nearest neighbor search problem, the brute force approach is often more efficient than other algorithms for high-dimensional spaces. A special case exists for objects represented as sparse vectors, where algorithms take advantage of the fact that an object has a zero value for most features. In general, since exact nearest neighbor search methods suffer from the “curse of dimensionality,” many practitioners use approximate nearest neighbor search algorithms when faced with high dimensionality or large datasets. To a reasonable degree, it is known that relying on approximate nearest neighbors leads to some error in the solutions to the underlying data mining problems the neighbors are used to solve. However, no one has attempted to quantify this error or provide practitioners with guidance in choosing appropriate search methods for their task. In this thesis, we conduct several experiments on recommender systems with a goal to find the degree to which approximate nearest neighbor algorithms are subject to these types of error propagation problems. Additionally, we provide persuasive evidence on the trade-off between search performance and analytics effectiveness. Our experimental evaluation demonstrates that a state-of-the-art approximate nearest neighbor search method (L2KNNGApprox) is not an effective solution in most cases. When tuned to achieve high search recall (80% or higher), it provides a fairly competitive recommendation performance compared to an efficient exact search method but offers no advantage in terms of efficiency (0.1x—1.5x speedup). Low search recall (\u3c60%) leads to poor recommendation performance. Finally, medium recall values (60%—80%) lead to reasonable recommendation performance but are hard to achieve and offer only a modest gain in efficiency (1.5x—2.3x)

    Efficient k-NN search on vertically decomposed data

    Get PDF
    Applications like multimedia retrieval require efficient support for similarity search on large data collections. Yet, nearest neighbor search is a difficult problem in high dimensional spaces, rendering efficient applications hard to realize: index structures degrade rapidly with increasing dimensionality, while sequential search is not an attractive solution for repositories with millions of objects. This paper approaches the problem from a different angle. A solution is sought in an unconventional storage scheme, that opens up a new range of techniques for processing k-NN queries, especially suited for high dimensional spaces. The suggested (physical) database design accommodates well a novel variant of branch-and-bound search, t

    Balancing clusters to reduce response time variability in large scale image search

    Get PDF
    Many algorithms for approximate nearest neighbor search in high-dimensional spaces partition the data into clusters. At query time, in order to avoid exhaustive search, an index selects the few (or a single) clusters nearest to the query point. Clusters are often produced by the well-known kk-means approach since it has several desirable properties. On the downside, it tends to produce clusters having quite different cardinalities. Imbalanced clusters negatively impact both the variance and the expectation of query response times. This paper proposes to modify kk-means centroids to produce clusters with more comparable sizes without sacrificing the desirable properties. Experiments with a large scale collection of image descriptors show that our algorithm significantly reduces the variance of response times without seriously impacting the search quality
    • …
    corecore