16,493 research outputs found
Angle Tree: Nearest Neighbor Search in High Dimensions with Low Intrinsic Dimensionality
We propose an extension of tree-based space-partitioning indexing structures
for data with low intrinsic dimensionality embedded in a high dimensional
space. We call this extension an Angle Tree. Our extension can be applied to
both classical kd-trees as well as the more recent rp-trees. The key idea of
our approach is to store the angle (the "dihedral angle") between the data
region (which is a low dimensional manifold) and the random hyperplane that
splits the region (the "splitter"). We show that the dihedral angle can be used
to obtain a tight lower bound on the distance between the query point and any
point on the opposite side of the splitter. This in turn can be used to
efficiently prune the search space. We introduce a novel randomized strategy to
efficiently calculate the dihedral angle with a high degree of accuracy.
Experiments and analysis on real and synthetic data sets shows that the Angle
Tree is the most efficient known indexing structure for nearest neighbor
queries in terms of preprocessing and space usage while achieving high accuracy
and fast search time.Comment: To be submitted to IEEE Transactions on Pattern Analysis and Machine
Intelligenc
High-dimensional approximate nearest neighbor: k-d Generalized Randomized Forests
We propose a new data-structure, the generalized randomized kd forest, or
kgeraf, for approximate nearest neighbor searching in high dimensions. In
particular, we introduce new randomization techniques to specify a set of
independently constructed trees where search is performed simultaneously, hence
increasing accuracy. We omit backtracking, and we optimize distance
computations, thus accelerating queries. We release public domain software
geraf and we compare it to existing implementations of state-of-the-art methods
including BBD-trees, Locality Sensitive Hashing, randomized kd forests, and
product quantization. Experimental results indicate that our method would be
the method of choice in dimensions around 1,000, and probably up to 10,000, and
pointsets of cardinality up to a few hundred thousands or even one million;
this range of inputs is encountered in many critical applications today. For
instance, we handle a real dataset of images represented in 960
dimensions with a query time of less than sec on average and 90\% responses
being true nearest neighbors
- …