986 research outputs found

    Down the Rabbit Hole: Robust Proximity Search and Density Estimation in Sublinear Space

    Full text link
    For a set of nn points in d\Re^d, and parameters kk and \eps, we present a data structure that answers (1+\eps,k)-\ANN queries in logarithmic time. Surprisingly, the space used by the data-structure is \Otilde (n /k); that is, the space used is sublinear in the input size if kk is sufficiently large. Our approach provides a novel way to summarize geometric data, such that meaningful proximity queries on the data can be carried out using this sketch. Using this, we provide a sublinear space data-structure that can estimate the density of a point set under various measures, including: \begin{inparaenum}[(i)] \item sum of distances of kk closest points to the query point, and \item sum of squared distances of kk closest points to the query point. \end{inparaenum} Our approach generalizes to other distance based estimation of densities of similar flavor. We also study the problem of approximating some of these quantities when using sampling. In particular, we show that a sample of size \Otilde (n /k) is sufficient, in some restricted cases, to estimate the above quantities. Remarkably, the sample size has only linear dependency on the dimension

    Dynamic User-Defined Similarity Searching in Semi-Structured Text Retrieval

    Get PDF
    Modern text retrieval systems often provide a similarity search utility, that allows the user to find efficiently a fixed number k of documents in the data set that are most similar to a given query (here a query is either a simple sequence of keywords or the identifier of a full document found in previous searches that is considered of interest). We consider the case of a textual database made of semi-structured documents. For example, in a corpus of bibliographic records any record may be structured into three fields: title, authors and abstract, where each field is an unstructured free text. Each field, in turns, is modelled with a specific vector space. The problem is more complex when we also allow each such vector space to have an associated user-defined dynamic weight that influences its contribution to the overall dynamic aggregated and weighted similarity. This dynamic problem has been tackled in a recent paper by Singitham et al. in VLDB 2004. Their proposed solution, which we take as baseline, is a variant of the cluster-pruning technique that has the potential for scaling to very large corpora of documents, and is far more efficient than the naive exhaustive search. We devise an alternative way of embedding weights in the data structure, coupled with a non-trivial application of a clustering algorithm based on the furthest point first heuristic for the metric k-center problem. The validity of our approach is demonstrated experimentally by showing significant performance improvements over the scheme proposed in VLDB 2004 We improve significantly tradeoffs between query time and output quality with respect to the baseline method in VLDB 2004, and also with respect to a novel method by Chierichetti et al. to appear in ACM PODS 2007. We also speed up the pre-processing time by a factor at least thirty

    Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations

    Full text link
    Graph-based approaches to nearest neighbor search are popular and powerful tools for handling large datasets in practice, but they have limited theoretical guarantees. We study the worst-case performance of recent graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG and DiskANN. For DiskANN, we show that its "slow preprocessing" version provably supports approximate nearest neighbor search query with constant approximation ratio and poly-logarithmic query time, on data sets with bounded "intrinsic" dimension. For the other data structure variants studied, including DiskANN with "fast preprocessing", HNSW and NSG, we present a family of instances on which the empirical query time required to achieve a "reasonable" accuracy is linear in instance size. For example, for DiskANN, we show that the query procedure can take at least 0.1n0.1 n steps on instances of size nn before it encounters any of the 55 nearest neighbors of the query.Comment: Accepted by NeurIPS 202

    Faster Clustering via Preprocessing

    Full text link
    We examine the efficiency of clustering a set of points, when the encompassing metric space may be preprocessed in advance. In computational problems of this genre, there is a first stage of preprocessing, whose input is a collection of points MM; the next stage receives as input a query set QMQ\subset M, and should report a clustering of QQ according to some objective, such as 1-median, in which case the answer is a point aMa\in M minimizing qQdM(a,q)\sum_{q\in Q} d_M(a,q). We design fast algorithms that approximately solve such problems under standard clustering objectives like pp-center and pp-median, when the metric MM has low doubling dimension. By leveraging the preprocessing stage, our algorithms achieve query time that is near-linear in the query size n=Qn=|Q|, and is (almost) independent of the total number of points m=Mm=|M|.Comment: 24 page

    Locality-Sensitive Hashing of Curves

    Get PDF
    We study data structures for storing a set of polygonal curves in Rd{\rm R}^d such that, given a query curve, we can efficiently retrieve similar curves from the set, where similarity is measured using the discrete Fr\'echet distance or the dynamic time warping distance. To this end we devise the first locality-sensitive hashing schemes for these distance measures. A major challenge is posed by the fact that these distance measures internally optimize the alignment between the curves. We give solutions for different types of alignments including constrained and unconstrained versions. For unconstrained alignments, we improve over a result by Indyk from 2002 for short curves. Let nn be the number of input curves and let mm be the maximum complexity of a curve in the input. In the particular case where mα4dlognm \leq \frac{\alpha}{4d} \log n, for some fixed α>0\alpha>0, our solutions imply an approximate near-neighbor data structure for the discrete Fr\'echet distance that uses space in O(n1+αlogn)O(n^{1+\alpha}\log n) and achieves query time in O(nαlog2n)O(n^{\alpha}\log^2 n) and constant approximation factor. Furthermore, our solutions provide a trade-off between approximation quality and computational performance: for any parameter k[m]k \in [m], we can give a data structure that uses space in O(22kmk1nlogn+nm)O(2^{2k}m^{k-1} n \log n + nm), answers queries in O(22kmklogn)O( 2^{2k} m^{k}\log n) time and achieves approximation factor in O(m/k)O(m/k).Comment: Proc. of 33rd International Symposium on Computational Geometry (SoCG), 201
    corecore