15 research outputs found

    Indexability, concentration, and VC theory

    Get PDF
    Degrading performance of indexing schemes for exact similarity search in high dimensions has long since been linked to histograms of distributions of distances and other 1-Lipschitz functions getting concentrated. We discuss this observation in the framework of the phenomenon of concentration of measure on the structures of high dimension and the Vapnik-Chervonenkis theory of statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded, improved and corrected version of the SISAP'2010 invited paper, this e-print, v3

    Whittle index based Q-learning for restless bandits with average reward

    Get PDF
    International audienceA novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to reduce the search space of Q-learning, resulting in major computational gains. Rigorous convergence analysis is provided, supported by numerical experiments. The numerical experiments show excellent empirical performance of the proposed scheme

    Investigating binary partition power in metric query

    Get PDF
    It is generally understood that, as dimensionality increases, the minimum cost of metric query tends from (log ) to () in both space and time, where is the size of the data set. With low dimensionality, the former is easy to achieve; with very high dimensionality, the latter is inevitable. We previously described BitPart as a novel mechanism suitable for performing exact metric search in “high(er)” dimensions. The essential tradeoff of BitPart is that its space cost is linear with respect to the size of the data, but the actual space required for each object may be small as log2 bits, which allows even very large data sets to be queried using only main memory. Potentially the time cost still scales with (log ). Together these attributes give exact search which outperforms indexing structures if dimensionality is within a certain range. In this article, we reiterate the design of BitPart in this context. The novel contribution is an in-depth examination of what the notion of “high(er)” means in practical terms. To do this we introduce the notion of exclusion power, and show its application to some generated data sets across different dimensions.Publisher PD

    Re-ranking Permutation-Based Candidate Sets with the n-Simplex Projection

    Get PDF
    In the realm of metric search, the permutation-based approaches have shown very good performance in indexing and supporting approximate search on large databases. These methods embed the metric objects into a permutation space where candidate results to a given query can be efficiently identified. Typically, to achieve high effectiveness, the permutation-based result set is refined by directly comparing each candidate object to the query one. Therefore, one drawback of these approaches is that the original dataset needs to be stored and then accessed during the refining step. We propose a refining approach based on a metric embedding, called n-Simplex projection, that can be used on metric spaces meeting the n-point property. The n-Simplex projection provides upper- and lower-bounds of the actual distance, derived using the distances between the data objects and a finite set of pivots. We propose to reuse the distances computed for building the data permutations to derive these bounds and we show how to use them to improve the permutation-based results. Our approach is particularly advantageous for all the cases in which the traditional refining step is too costly, e.g. very large dataset or very expensive metric function

    Indexing Metric Spaces for Exact Similarity Search

    Full text link
    With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes

    Learning to Prune in Metric and Non-Metric Spaces

    Get PDF
    Abstract Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and "stretching" of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality

    Robust And Scalable Learning Of Complex Dataset Topologies Via Elpigraph

    Full text link
    Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computational methods are based on exploring the local data point neighbourhood relations, a step that can perform poorly in the case of multidimensional and noisy data. Here we present ElPiGraph, a scalable and robust method for approximation of datasets with complex structures which does not require computing the complete data distance matrix or the data point neighbourhood graph. This method is able to withstand high levels of noise and is capable of approximating complex topologies via principal graph ensembles that can be combined into a consensus principal graph. ElPiGraph deals efficiently with large and complex datasets in various fields from biology, where it can be used to infer gene dynamics from single-cell RNA-Seq, to astronomy, where it can be used to explore complex structures in the distribution of galaxies.Comment: 32 pages, 14 figure

    On Geometric Range Searching, Approximate Counting and Depth Problems

    Get PDF
    In this thesis we deal with problems connected to range searching, which is one of the central areas of computational geometry. The dominant problems in this area are halfspace range searching, simplex range searching and orthogonal range searching and research into these problems has spanned decades. For many range searching problems, the best possible data structures cannot offer fast (i.e., polylogarithmic) query times if we limit ourselves to near linear storage. Even worse, it is conjectured (and proved in some cases) that only very small improvements to these might be possible. This inefficiency has encouraged many researchers to seek alternatives through approximations. In this thesis we continue this line of research and focus on relative approximation of range counting problems. One important problem where it is possible to achieve significant speedup through approximation is halfspace range counting in 3D. Here we continue the previous research done and obtain the first optimal data structure for approximate halfspace range counting in 3D. Our data structure has the slight advantage of being Las Vegas (the result is always correct) in contrast to the previous methods that were Monte Carlo (the correctness holds with high probability). Another series of problems where approximation can provide us with substantial speedup comes from robust statistics. We recognize three problems here: approximate Tukey depth, regression depth and simplicial depth queries. In 2D, we obtain an optimal data structure capable of approximating the regression depth of a query hyperplane. We also offer a linear space data structure which can answer approximate Tukey depth queries efficiently in 3D. These data structures are obtained by applying our ideas for the approximate halfspace counting problem. Approximating the simplicial depth turns out to be much more difficult, however. Computing the simplicial depth of a given point is more computationally challenging than most other definitions of data depth. In 2D we obtain the first data structure which uses near linear space and can answer approximate simplicial depth queries in polylogarithmic time. As applications of this result, we provide two non-trivial methods to approximate the simplicial depth of a given point in higher dimension. Along the way, we establish a tight combinatorial relationship between the Tukey depth of any given point and its simplicial depth. Another problem investigated in this thesis is the dominance reporting problem, an important special case of orthogonal range reporting. In three dimensions, we solve this problem in the pointer machine model and the external memory model by offering the first optimal data structures in these models of computation. Also, in the RAM model and for points from an integer grid we reduce the space complexity of the fastest known data structure to optimal. Using known techniques in the literature, we can use our results to obtain solutions for the orthogonal range searching problem as well. The query complexity offered by our orthogonal range reporting data structures match the most efficient query complexities known in the literature but our space bounds are lower than the previous methods in the external memory model and RAM model where the input is a subset of an integer grid. The results also yield improved orthogonal range searching in higher dimensions (which shows the significance of the dominance reporting problem). Intersection searching is a generalization of range searching where we deal with more complicated geometric objects instead of points. We investigate the rectilinear disjoint polygon counting problem which is a specialized intersection counting problem. We provide a linear-size data structure capable of counting the number of disjoint rectilinear polygons intersecting any rectilinear polygon of constant size. The query time (as well as some other properties of our data structure) resembles the classical simplex range searching data structures
    corecore