13,345 research outputs found
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
Nearest neighbor searching of large databases in high-dimensional spaces is
inherently difficult due to the curse of dimensionality. A flavor of
approximation is, therefore, necessary to practically solve the problem of
nearest neighbor search. In this paper, we propose a novel yet simple indexing
scheme, HD-Index, to solve the problem of approximate k-nearest neighbor
queries in massive high-dimensional databases. HD-Index consists of a set of
novel hierarchical structures called RDB-trees built on Hilbert keys of
database objects. The leaves of the RDB-trees store distances of database
objects to reference objects, thereby allowing efficient pruning using distance
filters. In addition to triangular inequality, we also use Ptolemaic inequality
to produce better lower bounds. Experiments on massive (up to billion scale)
high-dimensional (up to 1000+) datasets show that HD-Index is effective,
efficient, and scalable.Comment: PVLDB 11(8):906-919, 201
Efficient Document Indexing Using Pivot Tree
We present a novel method for efficiently searching top-k neighbors for
documents represented in high dimensional space of terms based on the cosine
similarity. Mostly, documents are stored as bag-of-words tf-idf representation.
One of the most used ways of computing similarity between a pair of documents
is cosine similarity between the vector representations, but cosine similarity
is not a metric distance measure as it doesn't follow triangle inequality,
therefore most metric searching methods can not be applied directly. We propose
an efficient method for indexing documents using a pivot tree that leads to
efficient retrieval. We also study the relation between precision and
efficiency for the proposed method and compare it with a state of the art in
the area of document searching based on inner product.Comment: 6 Pages, 2 Figure
Angle Tree: Nearest Neighbor Search in High Dimensions with Low Intrinsic Dimensionality
We propose an extension of tree-based space-partitioning indexing structures
for data with low intrinsic dimensionality embedded in a high dimensional
space. We call this extension an Angle Tree. Our extension can be applied to
both classical kd-trees as well as the more recent rp-trees. The key idea of
our approach is to store the angle (the "dihedral angle") between the data
region (which is a low dimensional manifold) and the random hyperplane that
splits the region (the "splitter"). We show that the dihedral angle can be used
to obtain a tight lower bound on the distance between the query point and any
point on the opposite side of the splitter. This in turn can be used to
efficiently prune the search space. We introduce a novel randomized strategy to
efficiently calculate the dihedral angle with a high degree of accuracy.
Experiments and analysis on real and synthetic data sets shows that the Angle
Tree is the most efficient known indexing structure for nearest neighbor
queries in terms of preprocessing and space usage while achieving high accuracy
and fast search time.Comment: To be submitted to IEEE Transactions on Pattern Analysis and Machine
Intelligenc
- …