15 research outputs found
Indexability, concentration, and VC theory
Degrading performance of indexing schemes for exact similarity search in high
dimensions has long since been linked to histograms of distributions of
distances and other 1-Lipschitz functions getting concentrated. We discuss this
observation in the framework of the phenomenon of concentration of measure on
the structures of high dimension and the Vapnik-Chervonenkis theory of
statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded,
improved and corrected version of the SISAP'2010 invited paper, this e-print,
v3
Whittle index based Q-learning for restless bandits with average reward
International audienceA novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to reduce the search space of Q-learning, resulting in major computational gains. Rigorous convergence analysis is provided, supported by numerical experiments. The numerical experiments show excellent empirical performance of the proposed scheme
Investigating binary partition power in metric query
It is generally understood that, as dimensionality increases, the minimum cost of metric query tends from (log ) to () in both space and time, where is the size of the data set. With low dimensionality, the former is easy to achieve; with very high dimensionality, the latter is inevitable. We previously described BitPart as a novel mechanism suitable for performing exact metric search in “high(er)” dimensions. The essential tradeoff of BitPart is that its space cost is linear with respect to the size of the data, but the actual space required for each object may be small as log2 bits, which allows even very large data sets to be queried using only main memory. Potentially the time cost still scales with (log ). Together these attributes give exact search which outperforms indexing structures if dimensionality is within a certain range. In this article, we reiterate the design of BitPart in this context. The novel contribution is an in-depth examination of what the notion of “high(er)” means in practical terms. To do this we introduce the notion of exclusion power, and show its application to some generated data sets across different dimensions.Publisher PD
Re-ranking Permutation-Based Candidate Sets with the n-Simplex Projection
In the realm of metric search, the permutation-based approaches have shown very good performance in indexing and supporting approximate search on large databases. These methods embed the metric objects into a permutation space where candidate results to a given query can be efficiently identified. Typically, to achieve high effectiveness, the permutation-based result set is refined by directly comparing each candidate object to the query one. Therefore, one drawback of these approaches is that the original dataset needs to be stored and then accessed during the refining step. We propose a refining approach based on a metric embedding, called n-Simplex projection, that can be used on metric spaces meeting the n-point property. The n-Simplex projection provides upper- and lower-bounds of the actual distance, derived using the distances between the data objects and a finite set of pivots. We propose to reuse the distances computed for building the data permutations to derive these bounds and we show how to use them to improve the permutation-based results. Our approach is particularly advantageous for all the cases in which the traditional refining step is too costly, e.g. very large dataset or very expensive metric function
Indexing Metric Spaces for Exact Similarity Search
With the continued digitalization of societal processes, we are seeing an
explosion in available data. This is referred to as big data. In a research
setting, three aspects of the data are often viewed as the main sources of
challenges when attempting to enable value creation from big data: volume,
velocity and variety. Many studies address volume or velocity, while much fewer
studies concern the variety. Metric space is ideal for addressing variety
because it can accommodate any type of data as long as its associated distance
notion satisfies the triangle inequality. To accelerate search in metric space,
a collection of indexing techniques for metric data have been proposed.
However, existing surveys each offers only a narrow coverage, and no
comprehensive empirical study of those techniques exists. We offer a survey of
all the existing metric indexes that can support exact similarity search, by i)
summarizing all the existing partitioning, pruning and validation techniques
used for metric indexes, ii) providing the time and storage complexity analysis
on the index construction, and iii) report on a comprehensive empirical
comparison of their similarity query processing performance. Here, empirical
comparisons are used to evaluate the index performance during search as it is
hard to see the complexity analysis differences on the similarity query
processing and the query performance depends on the pruning and validation
abilities related to the data distribution. This article aims at revealing
different strengths and weaknesses of different indexing techniques in order to
offer guidance on selecting an appropriate indexing technique for a given
setting, and directing the future research for metric indexes
Learning to Prune in Metric and Non-Metric Spaces
Abstract Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and "stretching" of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality
Robust And Scalable Learning Of Complex Dataset Topologies Via Elpigraph
Large datasets represented by multidimensional data point clouds often
possess non-trivial distributions with branching trajectories and excluded
regions, with the recent single-cell transcriptomic studies of developing
embryo being notable examples. Reducing the complexity and producing compact
and interpretable representations of such data remains a challenging task. Most
of the existing computational methods are based on exploring the local data
point neighbourhood relations, a step that can perform poorly in the case of
multidimensional and noisy data. Here we present ElPiGraph, a scalable and
robust method for approximation of datasets with complex structures which does
not require computing the complete data distance matrix or the data point
neighbourhood graph. This method is able to withstand high levels of noise and
is capable of approximating complex topologies via principal graph ensembles
that can be combined into a consensus principal graph. ElPiGraph deals
efficiently with large and complex datasets in various fields from biology,
where it can be used to infer gene dynamics from single-cell RNA-Seq, to
astronomy, where it can be used to explore complex structures in the
distribution of galaxies.Comment: 32 pages, 14 figure
On Geometric Range Searching, Approximate Counting and Depth Problems
In this thesis we deal with problems connected to range searching,
which is one of the central areas of computational geometry.
The dominant problems in this area are
halfspace range searching, simplex range searching and orthogonal range searching and
research into these problems has spanned decades.
For many range searching problems, the best possible
data structures cannot offer fast (i.e., polylogarithmic) query
times if we limit ourselves to near linear storage.
Even worse, it is conjectured (and proved in some cases)
that only very small improvements to these might be possible.
This inefficiency has encouraged many researchers to seek alternatives through approximations.
In this thesis we continue this line of research and focus on
relative approximation of range counting problems.
One important problem where it is possible to achieve significant speedup
through approximation is halfspace range counting in 3D.
Here we continue the previous research done
and obtain the first optimal data structure for approximate halfspace range counting in 3D.
Our data structure has the slight advantage of being Las Vegas (the result is always correct) in contrast
to the previous methods that were Monte Carlo (the correctness holds with high probability).
Another series of problems where approximation can provide us with
substantial speedup comes from robust statistics.
We recognize three problems here:
approximate Tukey depth, regression depth and simplicial depth queries.
In 2D, we obtain an optimal data structure capable of approximating
the regression depth of a query hyperplane.
We also offer a linear space data structure which can answer approximate
Tukey depth queries efficiently in 3D.
These data structures are obtained by applying our ideas for the
approximate halfspace counting problem.
Approximating the simplicial depth turns out to be much more
difficult, however.
Computing the simplicial depth of a given point is more computationally
challenging than most other definitions of data depth.
In 2D we obtain the first data structure which uses near linear space
and can answer approximate simplicial depth queries in polylogarithmic time.
As applications of this result, we provide two non-trivial methods to
approximate the simplicial depth of a given point in higher dimension.
Along the way, we establish a tight combinatorial relationship between
the Tukey depth of any given point and its simplicial depth.
Another problem investigated in this thesis is the dominance reporting problem,
an important special case of orthogonal range reporting.
In three dimensions, we solve this
problem in the pointer machine model and the external memory model
by offering the first optimal data structures in these models of computation.
Also, in the RAM model and for points from
an integer grid we reduce the space complexity of the fastest
known data structure to optimal.
Using known techniques in the literature, we can use our
results to obtain solutions for the orthogonal range searching problem as well.
The query complexity offered by our orthogonal range reporting data structures
match the most efficient query complexities
known in the literature but our space bounds are lower than the previous methods in the external
memory model and RAM model where the input is a subset of an integer grid.
The results also yield improved orthogonal range searching in
higher dimensions (which shows the significance
of the dominance reporting problem).
Intersection searching is a generalization of range searching where
we deal with more complicated geometric objects instead of points.
We investigate the rectilinear disjoint polygon counting problem
which is a specialized intersection counting problem.
We provide a linear-size data structure capable of counting
the number of disjoint rectilinear polygons
intersecting any rectilinear polygon of constant size.
The query time (as well as some other properties of our data structure) resembles
the classical simplex range searching data structures