92 research outputs found
Graph-Based Time-Space Trade-Offs for Approximate Near Neighbors
We take a first step towards a rigorous asymptotic analysis of graph-based methods for finding (approximate) nearest neighbors in high-dimensional spaces, by analyzing the complexity of randomized greedy walks on the approximate nearest neighbor graph. For random data sets of size n = 2^{o(d)} on the d-dimensional Euclidean unit sphere, using near neighbor graphs we can provably solve the approximate nearest neighbor problem with approximation factor c > 1 in query time n^{rho_{q} + o(1)} and space n^{1 + rho_{s} + o(1)}, for arbitrary rho_{q}, rho_{s} >= 0 satisfying (2c^2 - 1) rho_{q} + 2 c^2 (c^2 - 1) sqrt{rho_{s} (1 - rho_{s})} >= c^4. Graph-based near neighbor searching is especially competitive with hash-based methods for small c and near-linear memory, and in this regime the asymptotic scaling of a greedy graph-based search matches optimal hash-based trade-offs of Andoni-Laarhoven-Razenshteyn-Waingarten [Andoni et al., 2017]. We further study how the trade-offs scale when the data set is of size n = 2^{Theta(d)}, and analyze asymptotic complexities when applying these results to lattice sieving
Graph-based time-space trade-offs for approximate near neighbors
We take a first step towards a rigorous asymptotic analysis of graph-based approaches for finding (approximate) nearest neighbors in high-dimensional spaces, by analyzing the complexity of (randomized) greedy walks on the approximate near neighbor graph. For random data sets of size on the -dimensional Euclidean unit sphere, using near neighbor graphs we can provably solve the approximate nearest neighbor problem with approximation factor c > 1 in query time and space , for arbitrary satisfying \begin{align} (2c^2 - 1) \rho_q + 2 c^2 (c^2 - 1) \sqrt{\rho_s (1 - \rho_s)} \geq c^4. \end{align} Graph-based near neighbor searching is especially competitive with hash-based methods for small and near-linear memory, and in this regime the asymptotic scaling of a greedy graph-based search matches the recent optimal hash-based trade-offs of Andoni-Laarhoven-Razenshteyn-Waingarten [SODA'17]. We further study how the trade-offs scale when the data set is of size , and analyze asymptotic complexities when applying these results to lattice sieving
Angle Tree: Nearest Neighbor Search in High Dimensions with Low Intrinsic Dimensionality
We propose an extension of tree-based space-partitioning indexing structures
for data with low intrinsic dimensionality embedded in a high dimensional
space. We call this extension an Angle Tree. Our extension can be applied to
both classical kd-trees as well as the more recent rp-trees. The key idea of
our approach is to store the angle (the "dihedral angle") between the data
region (which is a low dimensional manifold) and the random hyperplane that
splits the region (the "splitter"). We show that the dihedral angle can be used
to obtain a tight lower bound on the distance between the query point and any
point on the opposite side of the splitter. This in turn can be used to
efficiently prune the search space. We introduce a novel randomized strategy to
efficiently calculate the dihedral angle with a high degree of accuracy.
Experiments and analysis on real and synthetic data sets shows that the Angle
Tree is the most efficient known indexing structure for nearest neighbor
queries in terms of preprocessing and space usage while achieving high accuracy
and fast search time.Comment: To be submitted to IEEE Transactions on Pattern Analysis and Machine
Intelligenc
Optimal lower bounds for locality sensitive hashing (except when q is tiny)
We study lower bounds for Locality Sensitive Hashing (LSH) in the strongest
setting: point sets in {0,1}^d under the Hamming distance. Recall that here H
is said to be an (r, cr, p, q)-sensitive hash family if all pairs x, y in
{0,1}^d with dist(x,y) at most r have probability at least p of collision under
a randomly chosen h in H, whereas all pairs x, y in {0,1}^d with dist(x,y) at
least cr have probability at most q of collision. Typically, one considers d
tending to infinity, with c fixed and q bounded away from 0.
For its applications to approximate nearest neighbor search in high
dimensions, the quality of an LSH family H is governed by how small its "rho
parameter" rho = ln(1/p)/ln(1/q) is as a function of the parameter c. The
seminal paper of Indyk and Motwani showed that for each c, the extremely simple
family H = {x -> x_i : i in d} achieves rho at most 1/c. The only known lower
bound, due to Motwani, Naor, and Panigrahy, is that rho must be at least .46/c
(minus o_d(1)).
In this paper we show an optimal lower bound: rho must be at least 1/c (minus
o_d(1)). This lower bound for Hamming space yields a lower bound of 1/c^2 for
Euclidean space (or the unit sphere) and 1/c for the Jaccard distance on sets;
both of these match known upper bounds. Our proof is simple; the essence is
that the noise stability of a boolean function at e^{-t} is a log-convex
function of t.Comment: 9 pages + abstract and reference
Random projection to preserve patient privacy
With the availability of accessible and widely used cloud services, it is natural that large components of healthcare systems migrate to them; for example, patient databases can be stored and processed in the cloud. Such cloud services provide enhanced flexibility and additional gains, such as availability, ease of data share, and so on. This trend poses serious threats regarding the privacy of the patients and the trust that an individual must put into the healthcare system itself. Thus, there is a strong need of privacy preservation, achieved through a variety of different approaches. In this paper, we study the application of a random projection-based approach to patient data as a means to achieve two goals: (1) provably mask the identity of users under some adversarial-attack settings, (2) preserve enough information to allow for aggregate data analysis and application of machine-learning techniques. As far as we know, such approaches have not been applied and tested on medical data. We analyze the tradeoff between the loss of accuracy on the outcome of machine-learning algorithms and the resilience against an adversary. We show that random projections proved to be strong against known input/output attacks while offering high quality data, as long as the projected space is smaller than the original space, and as long as the amount of leaked data available to the adversary is limited
Fast -NNG construction with GPU-based quick multi-select
In this paper we describe a new brute force algorithm for building the
-Nearest Neighbor Graph (-NNG). The -NNG algorithm has many
applications in areas such as machine learning, bio-informatics, and clustering
analysis. While there are very efficient algorithms for data of low dimensions,
for high dimensional data the brute force search is the best algorithm. There
are two main parts to the algorithm: the first part is finding the distances
between the input vectors which may be formulated as a matrix multiplication
problem. The second is the selection of the -NNs for each of the query
vectors. For the second part, we describe a novel graphics processing unit
(GPU) -based multi-select algorithm based on quick sort. Our optimization makes
clever use of warp voting functions available on the latest GPUs along with
use-controlled cache. Benchmarks show significant improvement over
state-of-the-art implementations of the -NN search on GPUs
- …