17 research outputs found
Indexing Metric Spaces for Exact Similarity Search
With the continued digitalization of societal processes, we are seeing an
explosion in available data. This is referred to as big data. In a research
setting, three aspects of the data are often viewed as the main sources of
challenges when attempting to enable value creation from big data: volume,
velocity and variety. Many studies address volume or velocity, while much fewer
studies concern the variety. Metric space is ideal for addressing variety
because it can accommodate any type of data as long as its associated distance
notion satisfies the triangle inequality. To accelerate search in metric space,
a collection of indexing techniques for metric data have been proposed.
However, existing surveys each offers only a narrow coverage, and no
comprehensive empirical study of those techniques exists. We offer a survey of
all the existing metric indexes that can support exact similarity search, by i)
summarizing all the existing partitioning, pruning and validation techniques
used for metric indexes, ii) providing the time and storage complexity analysis
on the index construction, and iii) report on a comprehensive empirical
comparison of their similarity query processing performance. Here, empirical
comparisons are used to evaluate the index performance during search as it is
hard to see the complexity analysis differences on the similarity query
processing and the query performance depends on the pruning and validation
abilities related to the data distribution. This article aims at revealing
different strengths and weaknesses of different indexing techniques in order to
offer guidance on selecting an appropriate indexing technique for a given
setting, and directing the future research for metric indexes
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
Nearest neighbor searching of large databases in high-dimensional spaces is
inherently difficult due to the curse of dimensionality. A flavor of
approximation is, therefore, necessary to practically solve the problem of
nearest neighbor search. In this paper, we propose a novel yet simple indexing
scheme, HD-Index, to solve the problem of approximate k-nearest neighbor
queries in massive high-dimensional databases. HD-Index consists of a set of
novel hierarchical structures called RDB-trees built on Hilbert keys of
database objects. The leaves of the RDB-trees store distances of database
objects to reference objects, thereby allowing efficient pruning using distance
filters. In addition to triangular inequality, we also use Ptolemaic inequality
to produce better lower bounds. Experiments on massive (up to billion scale)
high-dimensional (up to 1000+) datasets show that HD-Index is effective,
efficient, and scalable.Comment: PVLDB 11(8):906-919, 201
Data sensitive approximate query approaches in metric spaces
Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2011.Thesis (Master's) -- Bilkent University, 2011.Includes bibliographical references leaves 56-59.Similarity searching is the task of retrieval of relevant information from datasets.
We are particularly interested in datasets that contain complex and unstructured
data such as images, videos, audio recordings, protein and DNA sequences. The
relevant information is typically defined using one of two common query types: a
range query involves retrieval of all the objects within a specified distance to the
query object; whereas a k-nearest neighbor query deals with obtaining k closest
database objects to the query object. A variety of index structures based on the
notion of metric spaces have been offered to process these two query types.
The query performances of the proposed index structures have not been satisfactory
particularly for high dimensional datasets. As a solution, various approximate
similarity search methods offering the users a quality/time trade-off
have been proposed. The rationale is that the users might be willing to tolerate
query precision to retrieve query results relatively faster. The proposed approximate
searching schemes usually have strong connections to the underlying data
structures, making the comparison of the quality of the essence of their ideas
difficult.
In this thesis we investigate various approximation approaches to decrease the
response time of similarity queries. These approaches use a variety of statistics
about the dataset in order to obtain dynamic (at the time of querying) and specific
guidance on the approximation for each query object individually. The experiments
are performed on top of a simple underlying pivot-based index structure
to minimize the effects of the index to our approximation schemes. The results
show that it is possible to improve the performance/precision of the approximation
based on data and query object sensitive guidance.Dilek, MerveM.S