253 research outputs found
Re-Pair Compression of Inverted Lists
Compression of inverted lists with methods that support fast intersection
operations is an active research topic. Most compression schemes rely on
encoding differences between consecutive positions with techniques that favor
small numbers. In this paper we explore a completely different alternative: We
use Re-Pair compression of those differences. While Re-Pair by itself offers
fast decompression at arbitrary positions in main and secondary memory, we
introduce variants that in addition speed up the operations required for
inverted list intersection. We compare the resulting data structures with
several recent proposals under various list intersection algorithms, to
conclude that our Re-Pair variants offer an interesting time/space tradeoff for
this problem, yet further improvements are required for it to improve upon the
state of the art
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space
Indexing highly repetitive texts - such as genomic databases, software
repositories and versioned text collections - has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms
(BWTs). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used O(r) space and was able to efficiently count the number of
occurrences of a pattern of length m in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
r. In this paper we close this long-standing problem, showing how to extend the
Run-Length FM-index so that it can locate the occ occurrences efficiently
within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m
+ occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n
over an alphabet of size {\sigma} on a RAM machine with words of w =
{\Omega}(log n) bits. Within that space, our index can also count in optimal
time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and
locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which
is optimal in the packed setting and had not been obtained before in compressed
space. We also describe a structure using O(r log(n/r)) space that replaces the
text and extracts any text substring of length ` in almost-optimal time
O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct
access to suffix array, inverse suffix array, and longest common prefix array
cells, and extend these capabilities to full suffix tree functionality,
typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log
log_w(n/r + sigma)
Optimal-Time Text Indexing in BWT-runs Bounded Space
Indexing highly repetitive texts --- such as genomic databases, software
repositories and versioned text collections --- has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is , the number of runs in their Burrows-Wheeler Transform
(BWT). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used space and was able to efficiently count the number of
occurrences of a pattern of length in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
. Since then, a number of other indexes with space bounded by other measures
of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size
of the smallest grammar generating the text, the size of the smallest automaton
recognizing the text factors --- have been proposed for efficiently locating,
but not directly counting, the occurrences of a pattern. In this paper we close
this long-standing problem, showing how to extend the Run-Length FM-index so
that it can locate the occurrences efficiently within space (in
loglogarithmic time each), and reaching optimal time within
space, on a RAM machine of bits. Within
space, our index can also count in optimal time .
Raising the space to , we support count and locate in
and time, which is optimal in the
packed setting and had not been obtained before in compressed space. We also
describe a structure using space that replaces the text and
extracts any text substring of length in almost-optimal time
. (...continues...
Statistical and repetition-based compressed data structures
[Abstract]
In this thesis we present several practical compressed data structures that address
open problems related to statistically-compressible and highly repetitive databases.
In a the first part, we focus on statistical-based compressed data structures,
targeting the problem of managing large alphabets. This problem arises when
typical sequence-based compression is used as a basis for compressed data structures
representing more general structures like grids and graphs. Concretely, (a) we
provide space-efficient solutions to represent prefix-free codes when the alphabet
is large; (b) we also present a new wavelet-tree based data structure to solve rank
and select queries that obtains zero-order compression and outperforms previous
wavelet tree implementations on large alphabets.
In the second part of this thesis, we focus on highly repetitive datasets. We
present (c) a very space efficient grammar-based compressed data structure to solve
rank and select on these scenarios; (d) the first LZ77-space bounded compressed
data structure that solves rank and select queries in O(1) time and is in practice
almost as fast as statistically-compressed structures; and (e) the first practical
version of grammar-compressed tree topologies, obtaining unprecedented results in
the representation of repetitive trees.
Additionally, we apply our new solutions to several problems of interest: point
grids, inverted indexes, self-indexes, XPath systems, and compressed suffix trees of
highly repetitive inputs, displaying various space-time tradeoffs of interest.[Resumen]
En esta tesis presentamos varias estructuras de datos comprimidas de naturaleza
práctica, centradas en problemas abiertos relacionados con bases de datos
estadísticamente compresibles y bases de datos cuyo contenido es altamente
repetitivo.
En la primera parte, nos centramos en las estructuras de datos comprimidas para
bases de datos estadísticamente compresibles, más concretamente, en problemas
relativos al manejo de alfabetos grandes. Este tipo de problemas aparecen
cuando usamos técnicas clásicas de compresión estadística en estructuras de datos
comprimidas para secuencias, y éstas a su vez se aplican a problemas tales como
la representación de grillas de puntos o grafos. Concretamente, (a) presentamos
soluciones muy eficientes en términos de espacio para representar códigos libres de
prefijo cuando el alfabeto el grande; (b) y también presentamos una nueva estructura
de datos comprimida basada en wavelet trees para resolver consultas rank y select
que obtiene compresión de orden cero y mejora las implementaciones previas de
wavelet trees en alfabetos grandes.
En la segunda parte de esta tesis, nos centramos en las bases de datos altamente
repetitivas. Presentamos (c) una estructura de datos comprimida basada en
gramáticas para resolver consultas rank y select en este tipo de contextos y
que usa muy poco espacio; (d) la primera estructura de datos comprimida que
obtiene espacio proporcional al de un compresor LZ77 y resuelve consultas rank y
select en tiempo O(1), siendo en la práctica casi tan rápido como las estructuras de
datos basadas en compresión estadística; (e) la primera estructura de datos práctica
que utiliza gramáticas para comprimir topologías de árboles, obteniendo resultados
sin precedentes para la representación de árboles repetitivos.
Adicionalmente, mostramos varias aplicaciones en las que las estructuras de datos
que proponemos a lo largo de la tesis resultan de utilidad. Desde representaciones
de grillas de puntos, índices invertidos, auto-índices, sistemas XPath, hasta árboles
de sufijos comprimidos para colecciones altamente repetitivas, mostrando diferentes
resultados de interés tanto en términos de tiempo como de espacio.[Resumo]
Nesta tese presentamos varias estruturas de datos comprimidas de natureza práctica,
centradas en problemas abertos no ámbito das bases de datos estatisticamente
compresibles e das bases de datos altamente repetitivas.
Na primeira parte da tese, centrámonos nas estruturas de datos comprimidas para
as bases de datos estatisticamente compresibles. Máis concretamente en problemas
relativos ó manexo de alfabetos grandes. Este tipo de problemas aparecen cando
usamos técnicas de compresión estatística en estruturas de datos comprimidas para
secuencias, e esta á sua vez se utilizan para aplicacións tales como a representación de
grellas de puntos ou para a representación de grafos. Concretamente, (a) presentamos
solucións que son moi eficientes en termos espaciais para representar códigos libres
de prefixo cando o alfabeto é grande; e (b) tamén presentamos unha nova estructura
de datos comprimida baseada en wavelet trees para resolver consultas rank e select
que obtén compresión de orde cero e mellora as implementacións previas de wavelet
trees para alfabetos grandes.
Na segunda parte da tese, centrámosnos nas bases de datos con contido altamente
repetitivo. Presentamos (c) unha estrutura de datos comprimida baseada en
gramáticas que usa moi pouco espazo e resolve eficientemente consultas rank e
select en este tipo de contextos repetitivos; (d) a primeira estrutura de datos
comprimida que obtén espazo proporcional ó que obtén un compresor LZ77 e resolve
consultas rank e select en tempo O(1), sendo na práctica tan rápido coma as
estruturas de datos baseadas en compresión estatística; (e) a primeira estrutura de
datos práctica que utiliza gramáticas para comprimir topoloxías de árbores, obtendo
uns resultados sin precedentes para a representación de árbores repetitivos.
Adicionalmente, mostramos varias aplicacións nas que as estruturas de datos
que propoñemos ó longo da tese resultan de utilidade: representacións de grellas
de puntos, índices invertidos, auto-índices, sistemas XPath e árbores de sufixos
comprimidos para colecións altamente repetitivas, mostrando diferentes resultados
de interese, tanto en termos de espazo coma de tempo
Data Structures for Efficient String Algorithms
This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)-time preprocessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (so-called RMQs for Range Minimum Queries). The space for this data structure is 2n+o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and so-called Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the two-dimensional variant of the RMQ-problem we give a preprocessing scheme with quasi-optimal time bounds, but with an asymptotic increase in space consumption of a factor of log(n).
It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQ-problem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space- and time-consumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix- and LCP-array 2n+o(n) bits of additional space (coming from our RMQ-scheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(m*s+occ) time, where s denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for non-constant alphabets we can improve this to O(m*log(s)+occ) locating time, replacing our original scheme with a data structure of size approximately 2.54n bits. Again by using RMQs, we then show how to solve frequency-related string mining tasks in optimal time. In a final chapter we propose a space- and time-optimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all word-aligned occurrences of a pattern.
Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on real-word problem instances. In most cases our algorithms outperform previous approaches by all means
Effective reorganization and self-indexing of big semantic data
En esta tesis hemos analizado la redundancia estructural que los grafos RDF poseen y propuesto una técnica de preprocesamiento: RDF-Tr, que agrupa, reorganiza y recodifica los triples, tratando dos fuentes de redundancia estructural subyacentes a la naturaleza del esquema RDF. Hemos integrado RDF-Tr en HDT y k2-triples, reduciendo el tamaño que obtienen los compresores originales, superando a las técnicas más prominentes del estado del arte. Hemos denominado HDT++ y k2-triples++ al resultado de aplicar RDF-Tr en cada compresor.
En el ámbito de la compresión RDF se utilizan estructuras compactas para construir autoíndices RDF, que proporcionan acceso eficiente a los datos sin descomprimirlos. HDT-FoQ es utilizado para publicar y consumir grandes colecciones de datos RDF. Hemos extendido HDT++, llamándolo iHDT++, para resolver patrones SPARQL, consumiendo menos memoria que HDT-FoQ, a la vez que acelera la resolución de la mayoría de las consultas, mejorando la relación espacio-tiempo del resto de autoíndices.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Doctorado en Informátic
Doctor of Philosophy
dissertationGene expression data repositories provide large and ever increasing data for secondary use by translational informatics methods. For example, Gene Expression Omnibus (GEO) houses over 37,000 experiments with the goal of supporting further research. To use these published results in a larger meta-analysis, consolidation of the data are needed; however, the data are largely unstructured, thus hindering data integration efforts. Here, I propose the use of a novel pipeline, Ontology Based Data Integration (OBDI), which uses an ontological approach to combine the samples across multiple GEO experiments. The ODBI pipeline uses machine learning algorithms that permit researchers to consolidate and analyze data across GEO experiments. Here, I demonstrate how using an ontological approach to integrate samples across experiments can be used to explore the immune response at a molecular level. As part of this process, a Web Ontology Language (OWL) was developed for each data platform used. OWL serves as a core component in successfully processing different sample types. Immunological experiments from GEO were consolidated to evaluate this methodology. The experiments included samples analyzed on expression arrays, BeadChips, and sequencing technologies. The integration of a complex biological system and the incorporation of different biological data types will validate the potential of OBDI. iv The nature of biological data is highly dimensional. OBDI incorporates tools and techniques that can handle the analysis of various biological data. The machine learning analysis performed within the OBDI pipeline successfully evaluated the newly annotated experiments and provides insights that can be further explored. The OBDI pipeline can help researchers annotate experiments using ontologies and analyze the annotated experiments. To successfully build the pipeline, ontologies served as the backbone of integrating samples from GEO Series records into machine learning experiments using ML-Flex. By using the OBDI pipeline, researchers can access the uncurated experiments from GEO (GEO Data Series) and annotate the data using the terms in the ontologies. This mechanism allows for the organization of data sets in relationship to new experiments independent of GEO's GDS curation process. The OBDI system allows ontologies to grow organically around a cluster of experiments. These experiments are then further analyzed in ML-Flex using machine learning algorithms. The curated experiments are analyzed in silico and the computational analyses are supported by the OBDI ontological system
- …