12 research outputs found
A geometric framework for modelling similarity search
The aim of this paper is to propose a geometric framework for modelling
similarity search in large and multidimensional data spaces of general nature,
which seems to be flexible enough to address such issues as analysis of
complexity, indexability, and the `curse of dimensionality.' Such a framework
is provided by the concept of the so-called similarity workload, which is a
probability metric space (query domain) with a distinguished finite
subspace (dataset), together with an assembly of concepts, techniques, and
results from metric geometry. They include such notions as metric transform,
\e-entropy, and the phenomenon of concentration of measure on
high-dimensional structures. In particular, we discuss the relevance of the
latter to understanding the curse of dimensionality. As some of those concepts
and techniques are being currently reinvented by the database community, it
seems desirable to try and bridge the gap between database research and the
relevant work already done in geometry and analysis.Comment: 11 pages, LaTeX 2.
Indexability, concentration, and VC theory
Degrading performance of indexing schemes for exact similarity search in high
dimensions has long since been linked to histograms of distributions of
distances and other 1-Lipschitz functions getting concentrated. We discuss this
observation in the framework of the phenomenon of concentration of measure on
the structures of high dimension and the Vapnik-Chervonenkis theory of
statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded,
improved and corrected version of the SISAP'2010 invited paper, this e-print,
v3
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Parametrización local de espacios métricos
Muchas aplicaciones en computación tienen por objetivo buscar objetos en una base de datos que sean similares a uno dado. Todas estas aplicaciones pueden tratarse en abstracto con el formalismo de espacio métrico. Este método encapsula las propiedades de los objetos de la base de datos y permite construir índices genéricos.
Existen muchas técnicas de construcción de índices para realizar búsquedas de proximidad, todas las técnicas tienen parámetros que dependen de la geometría del espacio. Estos parámetros balancean el tiempo de construcción, el tiempo de búsqueda y la memoria utilizada por el índice.
En este trabajo presentamos un método de parametrización local que permite segmentar la base de datos de tal manera que a cada segmento se le pueden seleccionar de manera óptima sus parámetros adecuados.
Ilustramos la técnica probando con un índice particularmente difícil de parametrizar, el GNAT.
Para este efecto seleccionamos el espacio métrico de cadenas de palabras bajo la distancia de edición. La base de datos se divide en dos segmentos, los cuales se indizan por separado. Para satisfacer una consulta se busca en ambos índices. Esta operación resulta mas eficiente que buscar en el índice original.Eje: Bases de DatosRed de Universidades con Carreras en Informática (RedUNCI