2,512 research outputs found
Maximum Inner-Product Search using Tree Data-structures
The problem of {\em efficiently} finding the best match for a query in a
given set with respect to the Euclidean distance or the cosine similarity has
been extensively studied in literature. However, a closely related problem of
efficiently finding the best match with respect to the inner product has never
been explored in the general setting to the best of our knowledge. In this
paper we consider this general problem and contrast it with the existing
best-match algorithms. First, we propose a general branch-and-bound algorithm
using a tree data structure. Subsequently, we present a dual-tree algorithm for
the case where there are multiple queries. Finally we present a new data
structure for increasing the efficiency of the dual-tree algorithm. These
branch-and-bound algorithms involve novel bounds suited for the purpose of
best-matching with inner products. We evaluate our proposed algorithms on a
variety of data sets from various applications, and exhibit up to five orders
of magnitude improvement in query time over the naive search technique.Comment: Under submission in KDD 201
Indexing Metric Spaces for Exact Similarity Search
With the continued digitalization of societal processes, we are seeing an
explosion in available data. This is referred to as big data. In a research
setting, three aspects of the data are often viewed as the main sources of
challenges when attempting to enable value creation from big data: volume,
velocity and variety. Many studies address volume or velocity, while much fewer
studies concern the variety. Metric space is ideal for addressing variety
because it can accommodate any type of data as long as its associated distance
notion satisfies the triangle inequality. To accelerate search in metric space,
a collection of indexing techniques for metric data have been proposed.
However, existing surveys each offers only a narrow coverage, and no
comprehensive empirical study of those techniques exists. We offer a survey of
all the existing metric indexes that can support exact similarity search, by i)
summarizing all the existing partitioning, pruning and validation techniques
used for metric indexes, ii) providing the time and storage complexity analysis
on the index construction, and iii) report on a comprehensive empirical
comparison of their similarity query processing performance. Here, empirical
comparisons are used to evaluate the index performance during search as it is
hard to see the complexity analysis differences on the similarity query
processing and the query performance depends on the pruning and validation
abilities related to the data distribution. This article aims at revealing
different strengths and weaknesses of different indexing techniques in order to
offer guidance on selecting an appropriate indexing technique for a given
setting, and directing the future research for metric indexes
Indexability, concentration, and VC theory
Degrading performance of indexing schemes for exact similarity search in high
dimensions has long since been linked to histograms of distributions of
distances and other 1-Lipschitz functions getting concentrated. We discuss this
observation in the framework of the phenomenon of concentration of measure on
the structures of high dimension and the Vapnik-Chervonenkis theory of
statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded,
improved and corrected version of the SISAP'2010 invited paper, this e-print,
v3
- …