4,801 research outputs found
Approximate Near Neighbors for General Symmetric Norms
We show that every symmetric normed space admits an efficient nearest
neighbor search data structure with doubly-logarithmic approximation.
Specifically, for every , , and every -dimensional
symmetric norm , there exists a data structure for
-approximate nearest neighbor search over
for -point datasets achieving query time and
space. The main technical ingredient of the algorithm is a
low-distortion embedding of a symmetric norm into a low-dimensional iterated
product of top- norms.
We also show that our techniques cannot be extended to general norms.Comment: 27 pages, 1 figur
A directed isoperimetric inequality with application to Bregman near neighbor lower bounds
Bregman divergences are a class of divergences parametrized by a
convex function and include well known distance functions like
and the Kullback-Leibler divergence. There has been extensive
research on algorithms for problems like clustering and near neighbor search
with respect to Bregman divergences, in all cases, the algorithms depend not
just on the data size and dimensionality , but also on a structure
constant that depends solely on and can grow without bound
independently.
In this paper, we provide the first evidence that this dependence on
might be intrinsic. We focus on the problem of approximate near neighbor search
for Bregman divergences. We show that under the cell probe model, any
non-adaptive data structure (like locality-sensitive hashing) for
-approximate near-neighbor search that admits probes must use space
. In contrast, for LSH under the best
bound is .
Our new tool is a directed variant of the standard boolean noise operator. We
show that a generalization of the Bonami-Beckner hypercontractivity inequality
exists "in expectation" or upon restriction to certain subsets of the Hamming
cube, and that this is sufficient to prove the desired isoperimetric inequality
that we use in our data structure lower bound.
We also present a structural result reducing the Hamming cube to a Bregman
cube. This structure allows us to obtain lower bounds for problems under
Bregman divergences from their analog. In particular, we get a
(weaker) lower bound for approximate near neighbor search of the form
for an -query non-adaptive data structure,
and new cell probe lower bounds for a number of other near neighbor questions
in Bregman space.Comment: 27 page
Near-Neighbor Preserving Dimension Reduction for Doubling Subsets of l_1
Randomized dimensionality reduction has been recognized as one of the fundamental techniques in handling high-dimensional data. Starting with the celebrated Johnson-Lindenstrauss Lemma, such reductions have been studied in depth for the Euclidean (l_2) metric, but much less for the Manhattan (l_1) metric. Our primary motivation is the approximate nearest neighbor problem in l_1. We exploit its reduction to the decision-with-witness version, called approximate near neighbor, which incurs a roughly logarithmic overhead. In 2007, Indyk and Naor, in the context of approximate nearest neighbors, introduced the notion of nearest neighbor-preserving embeddings. These are randomized embeddings between two metric spaces with guaranteed bounded distortion only for the distances between a query point and a point set. Such embeddings are known to exist for both l_2 and l_1 metrics, as well as for doubling subsets of l_2. The case that remained open were doubling subsets of l_1. In this paper, we propose a dimension reduction by means of a near neighbor-preserving embedding for doubling subsets of l_1. Our approach is to represent the pointset with a carefully chosen covering set, then randomly project the latter. We study two types of covering sets: c-approximate r-nets and randomly shifted grids, and we discuss the tradeoff between them in terms of preprocessing time and target dimension. We employ Cauchy variables: certain concentration bounds derived should be of independent interest
Ptolemaic Indexing
This paper discusses a new family of bounds for use in similarity search,
related to those used in metric indexing, but based on Ptolemy's inequality,
rather than the metric axioms. Ptolemy's inequality holds for the well-known
Euclidean distance, but is also shown here to hold for quadratic form metrics
in general, with Mahalanobis distance as an important special case. The
inequality is examined empirically on both synthetic and real-world data sets
and is also found to hold approximately, with a very low degree of error, for
important distances such as the angular pseudometric and several Lp norms.
Indexing experiments demonstrate a highly increased filtering power compared to
existing, triangular methods. It is also shown that combining the Ptolemaic and
triangular filtering can lead to better results than using either approach on
its own
- …