Search CORE

12 research outputs found

A geometric framework for modelling similarity search

Author: Pestov Vladimir
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1999
Field of study

The aim of this paper is to propose a geometric framework for modelling similarity search in large and multidimensional data spaces of general nature, which seems to be flexible enough to address such issues as analysis of complexity, indexability, and the `curse of dimensionality.' Such a framework is provided by the concept of the so-called similarity workload, which is a probability metric space

\Omega

(query domain) with a distinguished finite subspace

X

(dataset), together with an assembly of concepts, techniques, and results from metric geometry. They include such notions as metric transform, \e-entropy, and the phenomenon of concentration of measure on high-dimensional structures. In particular, we discuss the relevance of the latter to understanding the curse of dimensionality. As some of those concepts and techniques are being currently reinvented by the database community, it seems desirable to try and bridge the gap between database research and the relevant work already done in geometry and analysis.Comment: 11 pages, LaTeX 2.

arXiv.org e-Print Archive

Crossref

Indexability, concentration, and VC theory

Author: Pestov Vladimir
Publication venue: 'Elsevier BV'
Publication date: 21/05/2011
Field of study

Degrading performance of indexing schemes for exact similarity search in high dimensions has long since been linked to histograms of distributions of distances and other 1-Lipschitz functions getting concentrated. We discuss this observation in the framework of the phenomenon of concentration of measure on the structures of high dimension and the Vapnik-Chervonenkis theory of statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded, improved and corrected version of the SISAP'2010 invited paper, this e-print, v3

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Entropy-scaling search of massive biological data

Author: Berger Bonnie
Daniels Noah M.
Danko David Christian
Yu Y. William
Publication venue: 'Elsevier BV'
Publication date: 01/06/2015
Field of study

Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

arXiv.org e-Print Archive

Elsevier - Publisher Connector

DSpace@MIT

Crossref

PubMed Central

Parametrización local de espacios métricos

Author: Chávez Edgar
Herrera Norma Edith
Publication venue
Publication date: 29/10/2012
Field of study

Muchas aplicaciones en computación tienen por objetivo buscar objetos en una base de datos que sean similares a uno dado. Todas estas aplicaciones pueden tratarse en abstracto con el formalismo de espacio métrico. Este método encapsula las propiedades de los objetos de la base de datos y permite construir índices genéricos. Existen muchas técnicas de construcción de índices para realizar búsquedas de proximidad, todas las técnicas tienen parámetros que dependen de la geometría del espacio. Estos parámetros balancean el tiempo de construcción, el tiempo de búsqueda y la memoria utilizada por el índice. En este trabajo presentamos un método de parametrización local que permite segmentar la base de datos de tal manera que a cada segmento se le pueden seleccionar de manera óptima sus parámetros adecuados. Ilustramos la técnica probando con un índice particularmente difícil de parametrizar, el GNAT. Para este efecto seleccionamos el espacio métrico de cadenas de palabras bajo la distancia de edición. La base de datos se divide en dos segmentos, los cuales se indizan por separado. Para satisfacer una consulta se busca en ambos índices. Esta operación resulta mas eficiente que buscar en el índice original.Eje: Bases de DatosRed de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual