12 research outputs found

    A geometric framework for modelling similarity search

    Full text link
    The aim of this paper is to propose a geometric framework for modelling similarity search in large and multidimensional data spaces of general nature, which seems to be flexible enough to address such issues as analysis of complexity, indexability, and the `curse of dimensionality.' Such a framework is provided by the concept of the so-called similarity workload, which is a probability metric space Ω\Omega (query domain) with a distinguished finite subspace XX (dataset), together with an assembly of concepts, techniques, and results from metric geometry. They include such notions as metric transform, \e-entropy, and the phenomenon of concentration of measure on high-dimensional structures. In particular, we discuss the relevance of the latter to understanding the curse of dimensionality. As some of those concepts and techniques are being currently reinvented by the database community, it seems desirable to try and bridge the gap between database research and the relevant work already done in geometry and analysis.Comment: 11 pages, LaTeX 2.

    Indexability, concentration, and VC theory

    Get PDF
    Degrading performance of indexing schemes for exact similarity search in high dimensions has long since been linked to histograms of distributions of distances and other 1-Lipschitz functions getting concentrated. We discuss this observation in the framework of the phenomenon of concentration of measure on the structures of high dimension and the Vapnik-Chervonenkis theory of statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded, improved and corrected version of the SISAP'2010 invited paper, this e-print, v3

    Entropy-scaling search of massive biological data

    Get PDF
    Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

    Parametrización local de espacios métricos

    Get PDF
    Muchas aplicaciones en computación tienen por objetivo buscar objetos en una base de datos que sean similares a uno dado. Todas estas aplicaciones pueden tratarse en abstracto con el formalismo de espacio métrico. Este método encapsula las propiedades de los objetos de la base de datos y permite construir índices genéricos. Existen muchas técnicas de construcción de índices para realizar búsquedas de proximidad, todas las técnicas tienen parámetros que dependen de la geometría del espacio. Estos parámetros balancean el tiempo de construcción, el tiempo de búsqueda y la memoria utilizada por el índice. En este trabajo presentamos un método de parametrización local que permite segmentar la base de datos de tal manera que a cada segmento se le pueden seleccionar de manera óptima sus parámetros adecuados. Ilustramos la técnica probando con un índice particularmente difícil de parametrizar, el GNAT. Para este efecto seleccionamos el espacio métrico de cadenas de palabras bajo la distancia de edición. La base de datos se divide en dos segmentos, los cuales se indizan por separado. Para satisfacer una consulta se busca en ambos índices. Esta operación resulta mas eficiente que buscar en el índice original.Eje: Bases de DatosRed de Universidades con Carreras en Informática (RedUNCI
    corecore