986 research outputs found
Down the Rabbit Hole: Robust Proximity Search and Density Estimation in Sublinear Space
For a set of points in , and parameters and \eps, we present
a data structure that answers (1+\eps,k)-\ANN queries in logarithmic time.
Surprisingly, the space used by the data-structure is \Otilde (n /k); that
is, the space used is sublinear in the input size if is sufficiently large.
Our approach provides a novel way to summarize geometric data, such that
meaningful proximity queries on the data can be carried out using this sketch.
Using this, we provide a sublinear space data-structure that can estimate the
density of a point set under various measures, including:
\begin{inparaenum}[(i)]
\item sum of distances of closest points to the query point, and
\item sum of squared distances of closest points to the query point.
\end{inparaenum}
Our approach generalizes to other distance based estimation of densities of
similar flavor. We also study the problem of approximating some of these
quantities when using sampling. In particular, we show that a sample of size
\Otilde (n /k) is sufficient, in some restricted cases, to estimate the above
quantities. Remarkably, the sample size has only linear dependency on the
dimension
Dynamic User-Defined Similarity Searching in Semi-Structured Text Retrieval
Modern text retrieval systems often provide a similarity search utility, that allows the user to find efficiently a fixed number k of documents in the data set that are most similar to a given query (here a query is either a simple sequence of keywords or the identifier of a full document found in previous searches that is considered of interest). We consider the case of a textual database made of semi-structured documents. For example, in a corpus of bibliographic records any record may be structured into three fields: title, authors and abstract, where each field is an unstructured free text. Each field, in turns, is modelled with a specific vector space. The problem is more complex when we also allow each such vector space to have an associated user-defined dynamic weight that influences its contribution to the overall dynamic aggregated and weighted similarity. This dynamic problem has been tackled in a recent paper by Singitham et al. in VLDB 2004. Their proposed solution, which we take as baseline, is a variant of the cluster-pruning technique that has the potential for scaling to very large corpora of documents, and is far more efficient than the naive exhaustive search. We devise an alternative way of embedding weights in the data structure, coupled with a non-trivial application of a clustering algorithm based on the furthest point first heuristic for the metric k-center problem. The validity of our approach is demonstrated experimentally by showing significant performance improvements over the scheme proposed in VLDB 2004 We improve significantly tradeoffs between query time and output quality with respect to the baseline method in VLDB 2004, and also with respect to a novel method by Chierichetti et al. to appear in ACM PODS 2007. We also speed up the pre-processing time by a factor at least thirty
Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations
Graph-based approaches to nearest neighbor search are popular and powerful
tools for handling large datasets in practice, but they have limited
theoretical guarantees. We study the worst-case performance of recent
graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG
and DiskANN. For DiskANN, we show that its "slow preprocessing" version
provably supports approximate nearest neighbor search query with constant
approximation ratio and poly-logarithmic query time, on data sets with bounded
"intrinsic" dimension. For the other data structure variants studied, including
DiskANN with "fast preprocessing", HNSW and NSG, we present a family of
instances on which the empirical query time required to achieve a "reasonable"
accuracy is linear in instance size. For example, for DiskANN, we show that the
query procedure can take at least steps on instances of size before
it encounters any of the nearest neighbors of the query.Comment: Accepted by NeurIPS 202
Faster Clustering via Preprocessing
We examine the efficiency of clustering a set of points, when the
encompassing metric space may be preprocessed in advance. In computational
problems of this genre, there is a first stage of preprocessing, whose input is
a collection of points ; the next stage receives as input a query set
, and should report a clustering of according to some
objective, such as 1-median, in which case the answer is a point
minimizing .
We design fast algorithms that approximately solve such problems under
standard clustering objectives like -center and -median, when the metric
has low doubling dimension. By leveraging the preprocessing stage, our
algorithms achieve query time that is near-linear in the query size ,
and is (almost) independent of the total number of points .Comment: 24 page
Locality-Sensitive Hashing of Curves
We study data structures for storing a set of polygonal curves in
such that, given a query curve, we can efficiently retrieve similar curves from
the set, where similarity is measured using the discrete Fr\'echet distance or
the dynamic time warping distance. To this end we devise the first
locality-sensitive hashing schemes for these distance measures. A major
challenge is posed by the fact that these distance measures internally optimize
the alignment between the curves. We give solutions for different types of
alignments including constrained and unconstrained versions. For unconstrained
alignments, we improve over a result by Indyk from 2002 for short curves. Let
be the number of input curves and let be the maximum complexity of a
curve in the input. In the particular case where , for some fixed , our solutions imply an approximate near-neighbor
data structure for the discrete Fr\'echet distance that uses space in
and achieves query time in and
constant approximation factor. Furthermore, our solutions provide a trade-off
between approximation quality and computational performance: for any parameter
, we can give a data structure that uses space in , answers queries in time and achieves
approximation factor in .Comment: Proc. of 33rd International Symposium on Computational Geometry
(SoCG), 201
- …