Search CORE

986 research outputs found

Down the Rabbit Hole: Robust Proximity Search and Density Estimation in Sublinear Space

Author: Har-Peled Sariel
Kumar Nirman
Publication venue
Publication date: 01/12/2012
Field of study

For a set of

n

points in

\Re^d

, and parameters

k

and \eps, we present a data structure that answers (1+\eps,k)-\ANN queries in logarithmic time. Surprisingly, the space used by the data-structure is \Otilde (n /k); that is, the space used is sublinear in the input size if

k

is sufficiently large. Our approach provides a novel way to summarize geometric data, such that meaningful proximity queries on the data can be carried out using this sketch. Using this, we provide a sublinear space data-structure that can estimate the density of a point set under various measures, including: \begin{inparaenum}[(i)] \item sum of distances of

k

closest points to the query point, and \item sum of squared distances of

k

closest points to the query point. \end{inparaenum} Our approach generalizes to other distance based estimation of densities of similar flavor. We also study the problem of approximating some of these quantities when using sampling. In particular, we show that a sample of size \Otilde (n /k) is sufficient, in some restricted cases, to estimate the above quantities. Remarkably, the sample size has only linear dependency on the dimension

arXiv.org e-Print Archive

University of Memphis Digital Commons

CiteSeerX

Dynamic User-Defined Similarity Searching in Semi-Structured Text Retrieval

Author: Geraci Filippo
Pellegrini Marco
Publication venue
Publication date
Field of study

Modern text retrieval systems often provide a similarity search utility, that allows the user to find efficiently a fixed number k of documents in the data set that are most similar to a given query (here a query is either a simple sequence of keywords or the identifier of a full document found in previous searches that is considered of interest). We consider the case of a textual database made of semi-structured documents. For example, in a corpus of bibliographic records any record may be structured into three fields: title, authors and abstract, where each field is an unstructured free text. Each field, in turns, is modelled with a specific vector space. The problem is more complex when we also allow each such vector space to have an associated user-defined dynamic weight that influences its contribution to the overall dynamic aggregated and weighted similarity. This dynamic problem has been tackled in a recent paper by Singitham et al. in VLDB 2004. Their proposed solution, which we take as baseline, is a variant of the cluster-pruning technique that has the potential for scaling to very large corpora of documents, and is far more efficient than the naive exhaustive search. We devise an alternative way of embedding weights in the data structure, coupled with a non-trivial application of a clustering algorithm based on the furthest point first heuristic for the metric k-center problem. The validity of our approach is demonstrated experimentally by showing significant performance improvements over the scheme proposed in VLDB 2004 We improve significantly tradeoffs between query time and output quality with respect to the baseline method in VLDB 2004, and also with respect to a novel method by Chierichetti et al. to appear in ACM PODS 2007. We also speed up the pre-processing time by a factor at least thirty

PUblication MAnagement

Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations

Author: Indyk Piotr
Xu Haike
Publication venue
Publication date: 29/10/2023
Field of study

Graph-based approaches to nearest neighbor search are popular and powerful tools for handling large datasets in practice, but they have limited theoretical guarantees. We study the worst-case performance of recent graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG and DiskANN. For DiskANN, we show that its "slow preprocessing" version provably supports approximate nearest neighbor search query with constant approximation ratio and poly-logarithmic query time, on data sets with bounded "intrinsic" dimension. For the other data structure variants studied, including DiskANN with "fast preprocessing", HNSW and NSG, we present a family of instances on which the empirical query time required to achieve a "reasonable" accuracy is linear in instance size. For example, for DiskANN, we show that the query procedure can take at least

0.1 n

steps on instances of size

n

before it encounters any of the

5

nearest neighbors of the query.Comment: Accepted by NeurIPS 202

arXiv.org e-Print Archive

Faster Clustering via Preprocessing

Author: Kopelowitz Tsvi
Krauthgamer Robert
Publication venue
Publication date: 01/01/2012
Field of study

We examine the efficiency of clustering a set of points, when the encompassing metric space may be preprocessed in advance. In computational problems of this genre, there is a first stage of preprocessing, whose input is a collection of points

M

; the next stage receives as input a query set

Q\subset M

, and should report a clustering of

Q

according to some objective, such as 1-median, in which case the answer is a point

a\in M

minimizing

\sum_{q\in Q} d_M(a,q)

. We design fast algorithms that approximately solve such problems under standard clustering objectives like

p

-center and

p

-median, when the metric

M

has low doubling dimension. By leveraging the preprocessing stage, our algorithms achieve query time that is near-linear in the query size

n=|Q|

, and is (almost) independent of the total number of points

m=|M|

.Comment: 24 page

arXiv.org e-Print Archive

CiteSeerX

Locality-Sensitive Hashing of Curves

Author: Driemel Anne
Silvestri Francesco
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 33rd International Symposium on Computational Geometry (SoCG 2017)
Publication date: 01/01/2017
Field of study

We study data structures for storing a set of polygonal curves in

{\rm R}^d

such that, given a query curve, we can efficiently retrieve similar curves from the set, where similarity is measured using the discrete Fr\'echet distance or the dynamic time warping distance. To this end we devise the first locality-sensitive hashing schemes for these distance measures. A major challenge is posed by the fact that these distance measures internally optimize the alignment between the curves. We give solutions for different types of alignments including constrained and unconstrained versions. For unconstrained alignments, we improve over a result by Indyk from 2002 for short curves. Let

n

be the number of input curves and let

m

be the maximum complexity of a curve in the input. In the particular case where

m \leq \frac{\alpha}{4d} \log n

, for some fixed

\alpha>0

, our solutions imply an approximate near-neighbor data structure for the discrete Fr\'echet distance that uses space in

O(n^{1+\alpha}\log n)

and achieves query time in

O(n^{\alpha}\log^2 n)

and constant approximation factor. Furthermore, our solutions provide a trade-off between approximation quality and computational performance: for any parameter

k \in [m]

, we can give a data structure that uses space in

O(2^{2k}m^{k-1} n \log n + nm)

, answers queries in

O( 2^{2k} m^{k}\log n)

time and achieves approximation factor in

O(m/k)

.Comment: Proc. of 33rd International Symposium on Computational Geometry (SoCG), 201

arXiv.org e-Print Archive

Repository TU/e

Pure OAI Repository

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università di Padova