10 research outputs found

    Indexability, concentration, and VC theory

    Get PDF
    Degrading performance of indexing schemes for exact similarity search in high dimensions has long since been linked to histograms of distributions of distances and other 1-Lipschitz functions getting concentrated. We discuss this observation in the framework of the phenomenon of concentration of measure on the structures of high dimension and the Vapnik-Chervonenkis theory of statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded, improved and corrected version of the SISAP'2010 invited paper, this e-print, v3

    Investigating binary partition power in metric query

    Get PDF
    It is generally understood that, as dimensionality increases, the minimum cost of metric query tends from (log ) to () in both space and time, where is the size of the data set. With low dimensionality, the former is easy to achieve; with very high dimensionality, the latter is inevitable. We previously described BitPart as a novel mechanism suitable for performing exact metric search in “high(er)” dimensions. The essential tradeoff of BitPart is that its space cost is linear with respect to the size of the data, but the actual space required for each object may be small as log2 bits, which allows even very large data sets to be queried using only main memory. Potentially the time cost still scales with (log ). Together these attributes give exact search which outperforms indexing structures if dimensionality is within a certain range. In this article, we reiterate the design of BitPart in this context. The novel contribution is an in-depth examination of what the notion of “high(er)” means in practical terms. To do this we introduce the notion of exclusion power, and show its application to some generated data sets across different dimensions.Publisher PD

    Re-ranking Permutation-Based Candidate Sets with the n-Simplex Projection

    Get PDF
    In the realm of metric search, the permutation-based approaches have shown very good performance in indexing and supporting approximate search on large databases. These methods embed the metric objects into a permutation space where candidate results to a given query can be efficiently identified. Typically, to achieve high effectiveness, the permutation-based result set is refined by directly comparing each candidate object to the query one. Therefore, one drawback of these approaches is that the original dataset needs to be stored and then accessed during the refining step. We propose a refining approach based on a metric embedding, called n-Simplex projection, that can be used on metric spaces meeting the n-point property. The n-Simplex projection provides upper- and lower-bounds of the actual distance, derived using the distances between the data objects and a finite set of pivots. We propose to reuse the distances computed for building the data permutations to derive these bounds and we show how to use them to improve the permutation-based results. Our approach is particularly advantageous for all the cases in which the traditional refining step is too costly, e.g. very large dataset or very expensive metric function

    Learning to Prune in Metric and Non-Metric Spaces

    Get PDF
    Abstract Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and "stretching" of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality

    Indexing Metric Spaces for Exact Similarity Search

    Full text link
    With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes

    Robust And Scalable Learning Of Complex Dataset Topologies Via Elpigraph

    Full text link
    Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computational methods are based on exploring the local data point neighbourhood relations, a step that can perform poorly in the case of multidimensional and noisy data. Here we present ElPiGraph, a scalable and robust method for approximation of datasets with complex structures which does not require computing the complete data distance matrix or the data point neighbourhood graph. This method is able to withstand high levels of noise and is capable of approximating complex topologies via principal graph ensembles that can be combined into a consensus principal graph. ElPiGraph deals efficiently with large and complex datasets in various fields from biology, where it can be used to infer gene dynamics from single-cell RNA-Seq, to astronomy, where it can be used to explore complex structures in the distribution of galaxies.Comment: 32 pages, 14 figure

    3D oceanographic data compression using 3D-ODETLAP

    Get PDF
    This paper describes a 3D environmental data compression technique for oceanographic datasets. With proper point selection, our method approximates uncompressed marine data using an over-determined system of linear equations based on, but essentially different from, the Laplacian partial differential equation. Then this approximation is refined via an error metric. These two steps work alternatively until a predefined satisfying approximation is found. Using several different datasets and metrics, we demonstrate that our method has an excellent compression ratio. To further evaluate our method, we compare it with 3D-SPIHT. 3D-ODETLAP averages 20% better compression than 3D-SPIHT on our eight test datasets, from World Ocean Atlas 2005. Our method provides up to approximately six times better compression on datasets with relatively small variance. Meanwhile, with the same approximate mean error, we demonstrate a significantly smaller maximum error compared to 3D-SPIHT and provide a feature to keep the maximum error under a user-defined limit

    Approaches to Quantifying EEG Features for Design Protocol Analysis

    Get PDF
    Recently, physiological signals such as eye-tracking and gesture analysis, galvanic skin response (GSR), electrocardiograms (ECG) and electroencephalograms (EEG) have been used by design researchers to extract significant information to describe the conceptual design process. We study a set of video-based design protocols recorded on subjects performing design tasks on a sketchpad while having their EEG monitored. The conceptual design process is rich with information on how designer’s do design. Many methods exist to analyze the conceptual design process, the most popular one being concurrent verbal protocols. A recurring problem in design protocol analysis is to segment and code protocol data into logical and semantic units. This is usually a manual step and little work has been done on fully automated segmentation techniques. Also, verbal protocols are known to fail in some circumstances such as when dealing with creativity, insight (e.g. Aha! experience, gestalt), concurrent, nonverbalizable (e.g. facial recognition) and nonconscious processes. We propose different approaches to study the conceptual design process using electroencephalograms (EEG). More specifically, we use spatio-temporal and frequency domain features. Our research is based on machine learning techniques used on EEG signals (functional microstate analysis), source localization (LORETA) and on a novel method of segmentation for design protocols based on EEG features. Using these techniques, we measure mental effort, fatigue and concentration in the conceptual design process, in addition to creativity and insight/nonverbalizable processing. We discuss the strengths and weaknesses of such approaches
    corecore