2,505 research outputs found
Indexing Metric Spaces for Exact Similarity Search
With the continued digitalization of societal processes, we are seeing an
explosion in available data. This is referred to as big data. In a research
setting, three aspects of the data are often viewed as the main sources of
challenges when attempting to enable value creation from big data: volume,
velocity and variety. Many studies address volume or velocity, while much fewer
studies concern the variety. Metric space is ideal for addressing variety
because it can accommodate any type of data as long as its associated distance
notion satisfies the triangle inequality. To accelerate search in metric space,
a collection of indexing techniques for metric data have been proposed.
However, existing surveys each offers only a narrow coverage, and no
comprehensive empirical study of those techniques exists. We offer a survey of
all the existing metric indexes that can support exact similarity search, by i)
summarizing all the existing partitioning, pruning and validation techniques
used for metric indexes, ii) providing the time and storage complexity analysis
on the index construction, and iii) report on a comprehensive empirical
comparison of their similarity query processing performance. Here, empirical
comparisons are used to evaluate the index performance during search as it is
hard to see the complexity analysis differences on the similarity query
processing and the query performance depends on the pruning and validation
abilities related to the data distribution. This article aims at revealing
different strengths and weaknesses of different indexing techniques in order to
offer guidance on selecting an appropriate indexing technique for a given
setting, and directing the future research for metric indexes
Dynamic selection of suitable pivots for similarity search in metric spaces
This paper presents a data structure based on Sparse Spatial Selection (SSS) for similarity searching. An algorithm that tries periodically to adjust pivots to the use of database index is presented. This index is dynamic. In this way, it is possible to improve the amount of discriminations done by the pivots.
So, the primary objective of indexes is achieved: to reduce the number of distance function evaluations, as it is showed in the experimentationVI Workshop Bases de Datos y Minería de Datos (WBD)Red de Universidades con Carreras en Informática (RedUNCI
Ptolemaic Indexing
This paper discusses a new family of bounds for use in similarity search,
related to those used in metric indexing, but based on Ptolemy's inequality,
rather than the metric axioms. Ptolemy's inequality holds for the well-known
Euclidean distance, but is also shown here to hold for quadratic form metrics
in general, with Mahalanobis distance as an important special case. The
inequality is examined empirically on both synthetic and real-world data sets
and is also found to hold approximately, with a very low degree of error, for
important distances such as the angular pseudometric and several Lp norms.
Indexing experiments demonstrate a highly increased filtering power compared to
existing, triangular methods. It is also shown that combining the Ptolemaic and
triangular filtering can lead to better results than using either approach on
its own
Using parallel pivot vs. clustering-based techniques for web engines
Web Engines are a useful tool for searching information in the Web. But a great part of this
information is non-textual and for that case a metric space is used. A metric space is a set where a
notion of distance (called a metric) between elements of the set is defined. In this paper we present
an efficient parallelization of a pivot-based method devised for this purpose which is called the
Sparse Spatial Selection (SSS) strategy and we compare it with a clustering-based method, a
parallel implementation of the Spatial Approximation Tree (SAT). We show that SAT compares
favourably against the pivot data structures SSS. The experimental results were obtained on a highperformance
cluster and using several metric spaces, that shows load balance parallel strategies
for the SAT. The implementations are built upon the BSP parallel computing model, which shows
efficient performance for this application domain and allows a precise evaluation of algorithms.VIII Workshop de Procesamiento Distribuido y ParaleloRed de Universidades con Carreras en Informática (RedUNCI
Using parallel pivot vs. clustering-based techniques for web engines
Web Engines are a useful tool for searching information in the Web. But a great part of this
information is non-textual and for that case a metric space is used. A metric space is a set where a
notion of distance (called a metric) between elements of the set is defined. In this paper we present
an efficient parallelization of a pivot-based method devised for this purpose which is called the
Sparse Spatial Selection (SSS) strategy and we compare it with a clustering-based method, a
parallel implementation of the Spatial Approximation Tree (SAT). We show that SAT compares
favourably against the pivot data structures SSS. The experimental results were obtained on a highperformance
cluster and using several metric spaces, that shows load balance parallel strategies
for the SAT. The implementations are built upon the BSP parallel computing model, which shows
efficient performance for this application domain and allows a precise evaluation of algorithms.VIII Workshop de Procesamiento Distribuido y ParaleloRed de Universidades con Carreras en Informática (RedUNCI
Modelling Efficient Novelty-based Search Result Diversification in Metric Spaces
Novelty-based diversification provides a way to tackle ambiguous queries by re-ranking a set of retrieved documents. Current approaches are typically greedy, requiring O(n2) document–document comparisons in order to diversify a ranking of n documents. In this article, we introduce a new approach for novelty-based search result diversification to reduce the overhead incurred by document–document comparisons. To this end, we model novelty promotion as a similarity search in a metric space, exploiting the properties of this space to efficiently identify novel documents. We investigate three different approaches: pivoting-based, clustering-based, and permutation-based. In the first two, a novel document is one that lies outside the range of a pivot or outside a cluster. In the latter, a novel document is one that has a different signature (i.e., the documentʼs relative distance to a distinguished set of fixed objects called permutants) compared to previously selected documents. Thorough experiments using two TREC test collections for diversity evaluation, as well as a large sample of the query stream of a commercial search engine show that our approaches perform at least as effectively as well-known novelty-based diversification approaches in the literature, while dramatically improving their efficiency.Fil: Gil Costa, Graciela Verónica. Yahoo; México. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico San Luis; ArgentinaFil: Santos, Rodrygo L. T.. University Of Glasgow; Reino UnidoFil: Macdonald, Craig. University Of Glasgow; Reino UnidoFil: Ounis, Iadh. University Of Glasgow; Reino Unid
Indexability, concentration, and VC theory
Degrading performance of indexing schemes for exact similarity search in high
dimensions has long since been linked to histograms of distributions of
distances and other 1-Lipschitz functions getting concentrated. We discuss this
observation in the framework of the phenomenon of concentration of measure on
the structures of high dimension and the Vapnik-Chervonenkis theory of
statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded,
improved and corrected version of the SISAP'2010 invited paper, this e-print,
v3
- …