Search CORE

10 research outputs found

Indexability, concentration, and VC theory

Author: Pestov Vladimir
Publication venue: 'Elsevier BV'
Publication date: 21/05/2011
Field of study

Degrading performance of indexing schemes for exact similarity search in high dimensions has long since been linked to histograms of distributions of distances and other 1-Lipschitz functions getting concentrated. We discuss this observation in the framework of the phenomenon of concentration of measure on the structures of high dimension and the Vapnik-Chervonenkis theory of statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded, improved and corrected version of the SISAP'2010 invited paper, this e-print, v3

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Investigating binary partition power in metric query

Author: Connor Richard
Dearle Al
Vadicamo Lucia
Publication venue
Publication date: 04/01/2023
Field of study

It is generally understood that, as dimensionality increases, the minimum cost of metric query tends from (log ) to () in both space and time, where is the size of the data set. With low dimensionality, the former is easy to achieve; with very high dimensionality, the latter is inevitable. We previously described BitPart as a novel mechanism suitable for performing exact metric search in “high(er)” dimensions. The essential tradeoff of BitPart is that its space cost is linear with respect to the size of the data, but the actual space required for each object may be small as log2 bits, which allows even very large data sets to be queried using only main memory. Potentially the time cost still scales with (log ). Together these attributes give exact search which outperforms indexing structures if dimensionality is within a certain range. In this article, we reiterate the design of BitPart in this context. The novel contribution is an in-depth examination of what the notion of “high(er)” means in practical terms. To do this we introduce the notion of exclusion power, and show its application to some generated data sets across different dimensions.Publisher PD

St Andrews Research Repository

Re-ranking Permutation-Based Candidate Sets with the n-Simplex Projection

Author: A Esuli
B Thomee
D Novak
E Chávez
G Amato
G Amato
G Amato
G Amato
IJ Schoenberg
J Pennington
LM Blumenthal
ML Micó
P Zezula
R Connor
R Connor
R Connor
R Weber
V Pestov
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

In the realm of metric search, the permutation-based approaches have shown very good performance in indexing and supporting approximate search on large databases. These methods embed the metric objects into a permutation space where candidate results to a given query can be efficiently identified. Typically, to achieve high effectiveness, the permutation-based result set is refined by directly comparing each candidate object to the query one. Therefore, one drawback of these approaches is that the original dataset needs to be stored and then accessed during the refining step. We propose a refining approach based on a metric embedding, called n-Simplex projection, that can be used on metric spaces meeting the n-point property. The n-Simplex projection provides upper- and lower-bounds of the actual distance, derived using the distances between the data objects and a finite set of pivots. We propose to reuse the distances computed for building the data permutations to derive these bounds and we show how to use them to improve the permutation-based results. Our approach is particularly advantageous for all the cases in which the traditional refining step is too costly, e.g. very large dataset or very expensive metric function

Crossref

Stirling Online Research Repository (RIOXX)

Stirling Online Research Repository

University of St. Andrews - Pure

Learning to Prune in Metric and Non-Metric Spaces

Author: Bilegsaikhan Naidan
Leonid Boytsov
Publication venue
Publication date: 03/04/2020
Field of study

Abstract Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and "stretching" of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality

CiteSeerX

Indexing Metric Spaces for Exact Similarity Search

Author: Chen Lu
Gao Yunjun
Jensen Christian S.
Li Zheng
Miao Xiaoye
Song Xuan
Zhu Yifan
Publication venue
Publication date: 07/05/2020
Field of study

With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes

arXiv.org e-Print Archive

VBN

Robust And Scalable Learning Of Complex Dataset Topologies Via Elpigraph

Author: Albergante Luca
Barillot Emmanuel
Chen Huidong
Faure Louis
Gorban Alexander N.
Martin Alexis
Mirkes Evgeny M.
Pinello Luca
Zinovyev Andrei
Publication venue
Publication date: 20/06/2018
Field of study

Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computational methods are based on exploring the local data point neighbourhood relations, a step that can perform poorly in the case of multidimensional and noisy data. Here we present ElPiGraph, a scalable and robust method for approximation of datasets with complex structures which does not require computing the complete data distance matrix or the data point neighbourhood graph. This method is able to withstand high levels of noise and is capable of approximating complex topologies via principal graph ensembles that can be combined into a consensus principal graph. ElPiGraph deals efficiently with large and complex datasets in various fields from biology, where it can be used to infer gene dynamics from single-cell RNA-Seq, to astronomy, where it can be used to explore complex structures in the distribution of galaxies.Comment: 32 pages, 14 figure

arXiv.org e-Print Archive

3D oceanographic data compression using 3D-ODETLAP

Author: Fox Peter
Franklin W. Randolph
Lau Tsz-Yam
Li You
Stuetzle Christopher S.
Publication venue: Merrimack ScholarWorks
Publication date: 01/11/2010
Field of study

This paper describes a 3D environmental data compression technique for oceanographic datasets. With proper point selection, our method approximates uncompressed marine data using an over-determined system of linear equations based on, but essentially different from, the Laplacian partial differential equation. Then this approximation is refined via an error metric. These two steps work alternatively until a predefined satisfying approximation is found. Using several different datasets and metrics, we demonstrate that our method has an excellent compression ratio. To further evaluate our method, we compare it with 3D-SPIHT. 3D-ODETLAP averages 20% better compression than 3D-SPIHT on our eight test datasets, from World Ocean Atlas 2005. Our method provides up to approximately six times better compression on datasets with relatively small variance. Meanwhile, with the same approximate mean error, we demonstrate a significantly smaller maximum error compared to 3D-SPIHT and provide a feature to keep the maximum error under a user-defined limit

Merrimack College: Merrimack ScholarWorks

Approaches to Quantifying EEG Features for Design Protocol Analysis

Author: Nguyen Philon
Publication venue
Publication date: 16/03/2017
Field of study

Recently, physiological signals such as eye-tracking and gesture analysis, galvanic skin response (GSR), electrocardiograms (ECG) and electroencephalograms (EEG) have been used by design researchers to extract significant information to describe the conceptual design process. We study a set of video-based design protocols recorded on subjects performing design tasks on a sketchpad while having their EEG monitored. The conceptual design process is rich with information on how designer’s do design. Many methods exist to analyze the conceptual design process, the most popular one being concurrent verbal protocols. A recurring problem in design protocol analysis is to segment and code protocol data into logical and semantic units. This is usually a manual step and little work has been done on fully automated segmentation techniques. Also, verbal protocols are known to fail in some circumstances such as when dealing with creativity, insight (e.g. Aha! experience, gestalt), concurrent, nonverbalizable (e.g. facial recognition) and nonconscious processes. We propose different approaches to study the conceptual design process using electroencephalograms (EEG). More specifically, we use spatio-temporal and frequency domain features. Our research is based on machine learning techniques used on EEG signals (functional microstate analysis), source localization (LORETA) and on a novel method of segmentation for design protocols based on EEG features. Using these techniques, we measure mental effort, fatigue and concentration in the conceptual design process, in addition to creativity and insight/nonverbalizable processing. We discuss the strengths and weaknesses of such approaches

Concordia University Research Repository