20,489 research outputs found
Nearest neighbor search with multiple random projection trees : core method and improvements
Nearest neighbor search is a crucial tool in computer science and a part of many machine learning algorithms, the most obvious example being the venerable k-NN classifier. More generally, nearest neighbors have usages in numerous fields such as classification, regression, computer vision, recommendation systems, robotics and compression to name just a few examples. In general, nearest neighbor problems cannot be answered in sublinear time – to identify the actual nearest data points, clearly all objects have to be accessed at least once. However, in the class of applications where nearest neighbor searches are repeatedly made within a fixed data set that is available upfront, such as recommendation systems (Spotify, e-commerce, etc.), we can do better. In a computationally expensive offline phase the data set is indexed with a data structure, and in the online phase the index is used to answer nearest neighbor queries at a superior rate. The cost of indexing is usually much larger than that of performing a single query, but with a high number of queries the initial indexing cost gets eventually compensated.
The urge for efficient index structures for nearest neighbors search has sparked a lot of research and hundreds of papers have been published to date. We look into the class of structures called binary space partitioning trees, specifically the random projection tree. Random projection trees have favorable properties especially when working with data sets with low intrinsic dimensionality. However, they have rarely been used in real-life nearest neighbor solutions due to limiting factors such as the relatively high cost of projection computations in high dimensional spaces. We present a new index structure for approximate nearest neighbor search that consists of multiple random projection trees, and several variants of algorithms to use it for efficient nearest neighbor search.
We start by specifying our variant of the random projection tree and show how to construct an index of multiple random projection trees (MRPT), along with a simple query that combines the results from independent random projection trees to achieve much higher query accuracy with faster query times. This is followed by discussion of further methods to optimize accuracy and storage. The focus will be on algorithmic details, accompanied by a thorough analysis of memory and time complexity. Finally we will show experimentally that a real-life implementation of these ideas leads to an algorithm that achieves faster query times than the currently available open source libraries for high-recall approximate nearest neighbor search
A Sparse Johnson--Lindenstrauss Transform
Dimension reduction is a key algorithmic tool with many applications
including nearest-neighbor search, compressed sensing and linear algebra in the
streaming model. In this work we obtain a {\em sparse} version of the
fundamental tool in dimension reduction --- the Johnson--Lindenstrauss
transform. Using hashing and local densification, we construct a sparse
projection matrix with just non-zero entries
per column. We also show a matching lower bound on the sparsity for a large
class of projection matrices. Our bounds are somewhat surprising, given the
known lower bounds of both on the number of rows
of any projection matrix and on the sparsity of projection matrices generated
by natural constructions.
Using this, we achieve an update time per
non-zero element for a -approximate projection, thereby
substantially outperforming the update time
required by prior approaches. A variant of our method offers the same
guarantees for sparse vectors, yet its worst case running time
matches the best approach of Ailon and Liberty.Comment: 10 pages, conference version
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
K-nearest Neighbor Search by Random Projection Forests
K-nearest neighbor (kNN) search has wide applications in many areas,
including data mining, machine learning, statistics and many applied domains.
Inspired by the success of ensemble methods and the flexibility of tree-based
methodology, we propose random projection forests (rpForests), for kNN search.
rpForests finds kNNs by aggregating results from an ensemble of random
projection trees with each constructed recursively through a series of
carefully chosen random projections. rpForests achieves a remarkable accuracy
in terms of fast decay in the missing rate of kNNs and that of discrepancy in
the kNN distances. rpForests has a very low computational complexity. The
ensemble nature of rpForests makes it easily run in parallel on multicore or
clustered computers; the running time is expected to be nearly inversely
proportional to the number of cores or machines. We give theoretical insights
by showing the exponential decay of the probability that neighboring points
would be separated by ensemble random projection trees when the ensemble size
increases. Our theory can be used to refine the choice of random projections in
the growth of trees, and experiments show that the effect is remarkable.Comment: 15 pages, 4 figures, 2018 IEEE Big Data Conferenc
- …