Search CORE

20,489 research outputs found

Nearest neighbor search with multiple random projection trees : core method and improvements

Author: Pitkänen Teemu
Publication venue: Helsingin yliopisto
Publication date: 01/01/2017
Field of study

Nearest neighbor search is a crucial tool in computer science and a part of many machine learning algorithms, the most obvious example being the venerable k-NN classifier. More generally, nearest neighbors have usages in numerous fields such as classification, regression, computer vision, recommendation systems, robotics and compression to name just a few examples. In general, nearest neighbor problems cannot be answered in sublinear time – to identify the actual nearest data points, clearly all objects have to be accessed at least once. However, in the class of applications where nearest neighbor searches are repeatedly made within a fixed data set that is available upfront, such as recommendation systems (Spotify, e-commerce, etc.), we can do better. In a computationally expensive offline phase the data set is indexed with a data structure, and in the online phase the index is used to answer nearest neighbor queries at a superior rate. The cost of indexing is usually much larger than that of performing a single query, but with a high number of queries the initial indexing cost gets eventually compensated. The urge for efficient index structures for nearest neighbors search has sparked a lot of research and hundreds of papers have been published to date. We look into the class of structures called binary space partitioning trees, specifically the random projection tree. Random projection trees have favorable properties especially when working with data sets with low intrinsic dimensionality. However, they have rarely been used in real-life nearest neighbor solutions due to limiting factors such as the relatively high cost of projection computations in high dimensional spaces. We present a new index structure for approximate nearest neighbor search that consists of multiple random projection trees, and several variants of algorithms to use it for efficient nearest neighbor search. We start by specifying our variant of the random projection tree and show how to construct an index of multiple random projection trees (MRPT), along with a simple query that combines the results from independent random projection trees to achieve much higher query accuracy with faster query times. This is followed by discussion of further methods to optimize accuracy and storage. The focus will be on algorithmic details, accompanied by a thorough analysis of memory and time complexity. Finally we will show experimentally that a real-life implementation of these ideas leads to an algorithm that achieves faster query times than the currently available open source libraries for high-recall approximate nearest neighbor search

Helsingin yliopiston digitaalinen arkisto

A Sparse Johnson--Lindenstrauss Transform

Author: Dasgupta Anirban
Kumar Ravi
Sarlós Tamás
Publication venue
Publication date: 01/01/2010
Field of study

Dimension reduction is a key algorithmic tool with many applications including nearest-neighbor search, compressed sensing and linear algebra in the streaming model. In this work we obtain a {\em sparse} version of the fundamental tool in dimension reduction --- the Johnson--Lindenstrauss transform. Using hashing and local densification, we construct a sparse projection matrix with just

\tilde{O}(\frac{1}{\epsilon})

non-zero entries per column. We also show a matching lower bound on the sparsity for a large class of projection matrices. Our bounds are somewhat surprising, given the known lower bounds of

\Omega(\frac{1}{\epsilon^2})

both on the number of rows of any projection matrix and on the sparsity of projection matrices generated by natural constructions. Using this, we achieve an

\tilde{O}(\frac{1}{\epsilon})

update time per non-zero element for a

(1\pm\epsilon)

-approximate projection, thereby substantially outperforming the

\tilde{O}(\frac{1}{\epsilon^2})

update time required by prior approaches. A variant of our method offers the same guarantees for sparse vectors, yet its

\tilde{O}(d)

worst case running time matches the best approach of Ailon and Liberty.Comment: 10 pages, conference version

arXiv.org e-Print Archive

CiteSeerX

Hashing for Similarity Search: A Survey

Author: Ji Jianqiu
Shen Heng Tao
Song Jingkuan
Wang Jingdong
Publication venue
Publication date: 13/08/2014
Field of study

Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

arXiv.org e-Print Archive

CiteSeerX

K-nearest Neighbor Search by Random Projection Forests

Author: Li Zhenpeng
Wang Honggang
Wang Jin
Wang Yingjie
Yan Donghui
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/12/2018
Field of study

K-nearest neighbor (kNN) search has wide applications in many areas, including data mining, machine learning, statistics and many applied domains. Inspired by the success of ensemble methods and the flexibility of tree-based methodology, we propose random projection forests (rpForests), for kNN search. rpForests finds kNNs by aggregating results from an ensemble of random projection trees with each constructed recursively through a series of carefully chosen random projections. rpForests achieves a remarkable accuracy in terms of fast decay in the missing rate of kNNs and that of discrepancy in the kNN distances. rpForests has a very low computational complexity. The ensemble nature of rpForests makes it easily run in parallel on multicore or clustered computers; the running time is expected to be nearly inversely proportional to the number of cores or machines. We give theoretical insights by showing the exponential decay of the probability that neighboring points would be separated by ensemble random projection trees when the ensemble size increases. Our theory can be used to refine the choice of random projections in the growth of trees, and experiments show that the effect is remarkable.Comment: 15 pages, 4 figures, 2018 IEEE Big Data Conferenc

arXiv.org e-Print Archive

Crossref