4,801 research outputs found

    Approximate Near Neighbors for General Symmetric Norms

    Full text link
    We show that every symmetric normed space admits an efficient nearest neighbor search data structure with doubly-logarithmic approximation. Specifically, for every nn, d=no(1)d = n^{o(1)}, and every dd-dimensional symmetric norm \|\cdot\|, there exists a data structure for poly(loglogn)\mathrm{poly}(\log \log n)-approximate nearest neighbor search over \|\cdot\| for nn-point datasets achieving no(1)n^{o(1)} query time and n1+o(1)n^{1+o(1)} space. The main technical ingredient of the algorithm is a low-distortion embedding of a symmetric norm into a low-dimensional iterated product of top-kk norms. We also show that our techniques cannot be extended to general norms.Comment: 27 pages, 1 figur

    A directed isoperimetric inequality with application to Bregman near neighbor lower bounds

    Full text link
    Bregman divergences DϕD_\phi are a class of divergences parametrized by a convex function ϕ\phi and include well known distance functions like 22\ell_2^2 and the Kullback-Leibler divergence. There has been extensive research on algorithms for problems like clustering and near neighbor search with respect to Bregman divergences, in all cases, the algorithms depend not just on the data size nn and dimensionality dd, but also on a structure constant μ1\mu \ge 1 that depends solely on ϕ\phi and can grow without bound independently. In this paper, we provide the first evidence that this dependence on μ\mu might be intrinsic. We focus on the problem of approximate near neighbor search for Bregman divergences. We show that under the cell probe model, any non-adaptive data structure (like locality-sensitive hashing) for cc-approximate near-neighbor search that admits rr probes must use space Ω(n1+μcr)\Omega(n^{1 + \frac{\mu}{c r}}). In contrast, for LSH under 1\ell_1 the best bound is Ω(n1+1cr)\Omega(n^{1+\frac{1}{cr}}). Our new tool is a directed variant of the standard boolean noise operator. We show that a generalization of the Bonami-Beckner hypercontractivity inequality exists "in expectation" or upon restriction to certain subsets of the Hamming cube, and that this is sufficient to prove the desired isoperimetric inequality that we use in our data structure lower bound. We also present a structural result reducing the Hamming cube to a Bregman cube. This structure allows us to obtain lower bounds for problems under Bregman divergences from their 1\ell_1 analog. In particular, we get a (weaker) lower bound for approximate near neighbor search of the form Ω(n1+1cr)\Omega(n^{1 + \frac{1}{cr}}) for an rr-query non-adaptive data structure, and new cell probe lower bounds for a number of other near neighbor questions in Bregman space.Comment: 27 page

    Near-Neighbor Preserving Dimension Reduction for Doubling Subsets of l_1

    Get PDF
    Randomized dimensionality reduction has been recognized as one of the fundamental techniques in handling high-dimensional data. Starting with the celebrated Johnson-Lindenstrauss Lemma, such reductions have been studied in depth for the Euclidean (l_2) metric, but much less for the Manhattan (l_1) metric. Our primary motivation is the approximate nearest neighbor problem in l_1. We exploit its reduction to the decision-with-witness version, called approximate near neighbor, which incurs a roughly logarithmic overhead. In 2007, Indyk and Naor, in the context of approximate nearest neighbors, introduced the notion of nearest neighbor-preserving embeddings. These are randomized embeddings between two metric spaces with guaranteed bounded distortion only for the distances between a query point and a point set. Such embeddings are known to exist for both l_2 and l_1 metrics, as well as for doubling subsets of l_2. The case that remained open were doubling subsets of l_1. In this paper, we propose a dimension reduction by means of a near neighbor-preserving embedding for doubling subsets of l_1. Our approach is to represent the pointset with a carefully chosen covering set, then randomly project the latter. We study two types of covering sets: c-approximate r-nets and randomly shifted grids, and we discuss the tradeoff between them in terms of preprocessing time and target dimension. We employ Cauchy variables: certain concentration bounds derived should be of independent interest

    Ptolemaic Indexing

    Full text link
    This paper discusses a new family of bounds for use in similarity search, related to those used in metric indexing, but based on Ptolemy's inequality, rather than the metric axioms. Ptolemy's inequality holds for the well-known Euclidean distance, but is also shown here to hold for quadratic form metrics in general, with Mahalanobis distance as an important special case. The inequality is examined empirically on both synthetic and real-world data sets and is also found to hold approximately, with a very low degree of error, for important distances such as the angular pseudometric and several Lp norms. Indexing experiments demonstrate a highly increased filtering power compared to existing, triangular methods. It is also shown that combining the Ptolemaic and triangular filtering can lead to better results than using either approach on its own
    corecore