43,624 research outputs found
Indexing Metric Spaces for Exact Similarity Search
With the continued digitalization of societal processes, we are seeing an
explosion in available data. This is referred to as big data. In a research
setting, three aspects of the data are often viewed as the main sources of
challenges when attempting to enable value creation from big data: volume,
velocity and variety. Many studies address volume or velocity, while much fewer
studies concern the variety. Metric space is ideal for addressing variety
because it can accommodate any type of data as long as its associated distance
notion satisfies the triangle inequality. To accelerate search in metric space,
a collection of indexing techniques for metric data have been proposed.
However, existing surveys each offers only a narrow coverage, and no
comprehensive empirical study of those techniques exists. We offer a survey of
all the existing metric indexes that can support exact similarity search, by i)
summarizing all the existing partitioning, pruning and validation techniques
used for metric indexes, ii) providing the time and storage complexity analysis
on the index construction, and iii) report on a comprehensive empirical
comparison of their similarity query processing performance. Here, empirical
comparisons are used to evaluate the index performance during search as it is
hard to see the complexity analysis differences on the similarity query
processing and the query performance depends on the pruning and validation
abilities related to the data distribution. This article aims at revealing
different strengths and weaknesses of different indexing techniques in order to
offer guidance on selecting an appropriate indexing technique for a given
setting, and directing the future research for metric indexes
Efficient Nearest Neighbor Search on Metric Time Series
While Deep-Learning approaches beat Nearest-Neighbor classifiers in an increasing number of areas, searching existing uncertain data remains an exclusive task for similarity search.
Numerous specific solutions exist for different types of data and queries.
This thesis aims at finding fast and general solutions for searching and indexing arbitrarily typed time series.
A time series is considered a sequence of elements where the elements' order matters but not their actual time stamps.
Since this thesis focuses on measuring distances between time series, the metric space is the most appropriate concept where the time series' elements come from.
Hence, this thesis mainly considers metric time series as data type.
Simple examples include time series in Euclidean vector spaces or graphs.
For general similarity search solutions in time series, two primitive comparison semantics need to be distinguished, the first of which compares the time series' trajectories ignoring time warping.
A ubiquitous example of such a distance function is the Dynamic Time Warping distance (DTW) developed in the area of speech recognition.
The Dog Keeper distance (DK) is another time-warping distance that, opposed to DTW, is truly invariant under time warping and yields a metric space.
After canonically extending DTW to accept multi-dimensional time series, this thesis contributes a new algorithm computing DK that outperforms DTW on time series in high-dimensional vector spaces by more than one order of magnitude.
An analytical study of both distance functions reveals the reasons for the superiority of DK over DTW in high-dimensional spaces.
The second comparison semantic compares time series in Euclidean vector spaces regardless of their position or orientation.
This thesis proposes the Congruence distance that is the Euclidean distance minimized under all isometric transformations; thus, it is invariant under translation, rotation, and reflection of the time series and therefore disregards the position or orientation of the time series.
A proof contributed in this thesis shows that there can be no efficient algorithm computing this distance function (unless P=NP).
Therefore, this thesis contributes the Delta distance, a metric distance function serving as a lower bound for the Congruence distance.
While the Delta distance has quadratic time complexity, the provided evaluation shows a speedup of more than two orders of magnitude against the Congruence distance.
Furthermore, the Delta distance is shown to be tight on random time series, although the tightness can be arbitrarily bad in corner-case situations.
Orthogonally to the previous mentioned comparison semantics, similarity search on time series consists of two different types of queries: whole sequence matching and subsequence search.
Metric index structures (e. g., the M-Tree) only provide whole matching queries natively.
This thesis contributes the concept of metric subset spaces and the SuperM-Tree for indexing metric subset spaces as a generic solution for subsequence search.
Examples for metric subset spaces include subsequence search regarding the distance functions from the comparison semantics mentioned above.
The provided evaluation shows that the SuperM-Tree outperforms a linear search by multiple orders of magnitude
Ptolemaic Indexing
This paper discusses a new family of bounds for use in similarity search,
related to those used in metric indexing, but based on Ptolemy's inequality,
rather than the metric axioms. Ptolemy's inequality holds for the well-known
Euclidean distance, but is also shown here to hold for quadratic form metrics
in general, with Mahalanobis distance as an important special case. The
inequality is examined empirically on both synthetic and real-world data sets
and is also found to hold approximately, with a very low degree of error, for
important distances such as the angular pseudometric and several Lp norms.
Indexing experiments demonstrate a highly increased filtering power compared to
existing, triangular methods. It is also shown that combining the Ptolemaic and
triangular filtering can lead to better results than using either approach on
its own
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
Nearest neighbor searching of large databases in high-dimensional spaces is
inherently difficult due to the curse of dimensionality. A flavor of
approximation is, therefore, necessary to practically solve the problem of
nearest neighbor search. In this paper, we propose a novel yet simple indexing
scheme, HD-Index, to solve the problem of approximate k-nearest neighbor
queries in massive high-dimensional databases. HD-Index consists of a set of
novel hierarchical structures called RDB-trees built on Hilbert keys of
database objects. The leaves of the RDB-trees store distances of database
objects to reference objects, thereby allowing efficient pruning using distance
filters. In addition to triangular inequality, we also use Ptolemaic inequality
to produce better lower bounds. Experiments on massive (up to billion scale)
high-dimensional (up to 1000+) datasets show that HD-Index is effective,
efficient, and scalable.Comment: PVLDB 11(8):906-919, 201
- …