253 research outputs found

    Re-Pair Compression of Inverted Lists

    Full text link
    Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompression at arbitrary positions in main and secondary memory, we introduce variants that in addition speed up the operations required for inverted list intersection. We compare the resulting data structures with several recent proposals under various list intersection algorithms, to conclude that our Re-Pair variants offer an interesting time/space tradeoff for this problem, yet further improvements are required for it to improve upon the state of the art

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

    Get PDF
    Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n over an alphabet of size {\sigma} on a RAM machine with words of w = {\Omega}(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma)

    Optimal-Time Text Indexing in BWT-runs Bounded Space

    Full text link
    Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is rr, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r)O(r) space and was able to efficiently count the number of occurrences of a pattern of length mm in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of rr. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occocc occurrences efficiently within O(r)O(r) space (in loglogarithmic time each), and reaching optimal time O(m+occ)O(m+occ) within O(rlog(n/r))O(r\log(n/r)) space, on a RAM machine of w=Ω(logn)w=\Omega(\log n) bits. Within O(rlog(n/r))O(r\log (n/r)) space, our index can also count in optimal time O(m)O(m). Raising the space to O(rwlogσ(n/r))O(r w\log_\sigma(n/r)), we support count and locate in O(mlog(σ)/w)O(m\log(\sigma)/w) and O(mlog(σ)/w+occ)O(m\log(\sigma)/w+occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(rlog(n/r))O(r\log(n/r)) space that replaces the text and extracts any text substring of length \ell in almost-optimal time O(log(n/r)+log(σ)/w)O(\log(n/r)+\ell\log(\sigma)/w). (...continues...

    Statistical and repetition-based compressed data structures

    Get PDF
    [Abstract] In this thesis we present several practical compressed data structures that address open problems related to statistically-compressible and highly repetitive databases. In a the first part, we focus on statistical-based compressed data structures, targeting the problem of managing large alphabets. This problem arises when typical sequence-based compression is used as a basis for compressed data structures representing more general structures like grids and graphs. Concretely, (a) we provide space-efficient solutions to represent prefix-free codes when the alphabet is large; (b) we also present a new wavelet-tree based data structure to solve rank and select queries that obtains zero-order compression and outperforms previous wavelet tree implementations on large alphabets. In the second part of this thesis, we focus on highly repetitive datasets. We present (c) a very space efficient grammar-based compressed data structure to solve rank and select on these scenarios; (d) the first LZ77-space bounded compressed data structure that solves rank and select queries in O(1) time and is in practice almost as fast as statistically-compressed structures; and (e) the first practical version of grammar-compressed tree topologies, obtaining unprecedented results in the representation of repetitive trees. Additionally, we apply our new solutions to several problems of interest: point grids, inverted indexes, self-indexes, XPath systems, and compressed suffix trees of highly repetitive inputs, displaying various space-time tradeoffs of interest.[Resumen] En esta tesis presentamos varias estructuras de datos comprimidas de naturaleza práctica, centradas en problemas abiertos relacionados con bases de datos estadísticamente compresibles y bases de datos cuyo contenido es altamente repetitivo. En la primera parte, nos centramos en las estructuras de datos comprimidas para bases de datos estadísticamente compresibles, más concretamente, en problemas relativos al manejo de alfabetos grandes. Este tipo de problemas aparecen cuando usamos técnicas clásicas de compresión estadística en estructuras de datos comprimidas para secuencias, y éstas a su vez se aplican a problemas tales como la representación de grillas de puntos o grafos. Concretamente, (a) presentamos soluciones muy eficientes en términos de espacio para representar códigos libres de prefijo cuando el alfabeto el grande; (b) y también presentamos una nueva estructura de datos comprimida basada en wavelet trees para resolver consultas rank y select que obtiene compresión de orden cero y mejora las implementaciones previas de wavelet trees en alfabetos grandes. En la segunda parte de esta tesis, nos centramos en las bases de datos altamente repetitivas. Presentamos (c) una estructura de datos comprimida basada en gramáticas para resolver consultas rank y select en este tipo de contextos y que usa muy poco espacio; (d) la primera estructura de datos comprimida que obtiene espacio proporcional al de un compresor LZ77 y resuelve consultas rank y select en tiempo O(1), siendo en la práctica casi tan rápido como las estructuras de datos basadas en compresión estadística; (e) la primera estructura de datos práctica que utiliza gramáticas para comprimir topologías de árboles, obteniendo resultados sin precedentes para la representación de árboles repetitivos. Adicionalmente, mostramos varias aplicaciones en las que las estructuras de datos que proponemos a lo largo de la tesis resultan de utilidad. Desde representaciones de grillas de puntos, índices invertidos, auto-índices, sistemas XPath, hasta árboles de sufijos comprimidos para colecciones altamente repetitivas, mostrando diferentes resultados de interés tanto en términos de tiempo como de espacio.[Resumo] Nesta tese presentamos varias estruturas de datos comprimidas de natureza práctica, centradas en problemas abertos no ámbito das bases de datos estatisticamente compresibles e das bases de datos altamente repetitivas. Na primeira parte da tese, centrámonos nas estruturas de datos comprimidas para as bases de datos estatisticamente compresibles. Máis concretamente en problemas relativos ó manexo de alfabetos grandes. Este tipo de problemas aparecen cando usamos técnicas de compresión estatística en estruturas de datos comprimidas para secuencias, e esta á sua vez se utilizan para aplicacións tales como a representación de grellas de puntos ou para a representación de grafos. Concretamente, (a) presentamos solucións que son moi eficientes en termos espaciais para representar códigos libres de prefixo cando o alfabeto é grande; e (b) tamén presentamos unha nova estructura de datos comprimida baseada en wavelet trees para resolver consultas rank e select que obtén compresión de orde cero e mellora as implementacións previas de wavelet trees para alfabetos grandes. Na segunda parte da tese, centrámosnos nas bases de datos con contido altamente repetitivo. Presentamos (c) unha estrutura de datos comprimida baseada en gramáticas que usa moi pouco espazo e resolve eficientemente consultas rank e select en este tipo de contextos repetitivos; (d) a primeira estrutura de datos comprimida que obtén espazo proporcional ó que obtén un compresor LZ77 e resolve consultas rank e select en tempo O(1), sendo na práctica tan rápido coma as estruturas de datos baseadas en compresión estatística; (e) a primeira estrutura de datos práctica que utiliza gramáticas para comprimir topoloxías de árbores, obtendo uns resultados sin precedentes para a representación de árbores repetitivos. Adicionalmente, mostramos varias aplicacións nas que as estruturas de datos que propoñemos ó longo da tese resultan de utilidade: representacións de grellas de puntos, índices invertidos, auto-índices, sistemas XPath e árbores de sufixos comprimidos para colecións altamente repetitivas, mostrando diferentes resultados de interese, tanto en termos de espazo coma de tempo

    Data Structures for Efficient String Algorithms

    Get PDF
    This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)-time preprocessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (so-called RMQs for Range Minimum Queries). The space for this data structure is 2n+o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and so-called Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the two-dimensional variant of the RMQ-problem we give a preprocessing scheme with quasi-optimal time bounds, but with an asymptotic increase in space consumption of a factor of log(n). It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQ-problem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space- and time-consumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix- and LCP-array 2n+o(n) bits of additional space (coming from our RMQ-scheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(m*s+occ) time, where s denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for non-constant alphabets we can improve this to O(m*log(s)+occ) locating time, replacing our original scheme with a data structure of size approximately 2.54n bits. Again by using RMQs, we then show how to solve frequency-related string mining tasks in optimal time. In a final chapter we propose a space- and time-optimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all word-aligned occurrences of a pattern. Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on real-word problem instances. In most cases our algorithms outperform previous approaches by all means

    Effective reorganization and self-indexing of big semantic data

    Get PDF
    En esta tesis hemos analizado la redundancia estructural que los grafos RDF poseen y propuesto una técnica de preprocesamiento: RDF-Tr, que agrupa, reorganiza y recodifica los triples, tratando dos fuentes de redundancia estructural subyacentes a la naturaleza del esquema RDF. Hemos integrado RDF-Tr en HDT y k2-triples, reduciendo el tamaño que obtienen los compresores originales, superando a las técnicas más prominentes del estado del arte. Hemos denominado HDT++ y k2-triples++ al resultado de aplicar RDF-Tr en cada compresor. En el ámbito de la compresión RDF se utilizan estructuras compactas para construir autoíndices RDF, que proporcionan acceso eficiente a los datos sin descomprimirlos. HDT-FoQ es utilizado para publicar y consumir grandes colecciones de datos RDF. Hemos extendido HDT++, llamándolo iHDT++, para resolver patrones SPARQL, consumiendo menos memoria que HDT-FoQ, a la vez que acelera la resolución de la mayoría de las consultas, mejorando la relación espacio-tiempo del resto de autoíndices.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Doctorado en Informátic

    Doctor of Philosophy

    Get PDF
    dissertationGene expression data repositories provide large and ever increasing data for secondary use by translational informatics methods. For example, Gene Expression Omnibus (GEO) houses over 37,000 experiments with the goal of supporting further research. To use these published results in a larger meta-analysis, consolidation of the data are needed; however, the data are largely unstructured, thus hindering data integration efforts. Here, I propose the use of a novel pipeline, Ontology Based Data Integration (OBDI), which uses an ontological approach to combine the samples across multiple GEO experiments. The ODBI pipeline uses machine learning algorithms that permit researchers to consolidate and analyze data across GEO experiments. Here, I demonstrate how using an ontological approach to integrate samples across experiments can be used to explore the immune response at a molecular level. As part of this process, a Web Ontology Language (OWL) was developed for each data platform used. OWL serves as a core component in successfully processing different sample types. Immunological experiments from GEO were consolidated to evaluate this methodology. The experiments included samples analyzed on expression arrays, BeadChips, and sequencing technologies. The integration of a complex biological system and the incorporation of different biological data types will validate the potential of OBDI. iv The nature of biological data is highly dimensional. OBDI incorporates tools and techniques that can handle the analysis of various biological data. The machine learning analysis performed within the OBDI pipeline successfully evaluated the newly annotated experiments and provides insights that can be further explored. The OBDI pipeline can help researchers annotate experiments using ontologies and analyze the annotated experiments. To successfully build the pipeline, ontologies served as the backbone of integrating samples from GEO Series records into machine learning experiments using ML-Flex. By using the OBDI pipeline, researchers can access the uncurated experiments from GEO (GEO Data Series) and annotate the data using the terms in the ontologies. This mechanism allows for the organization of data sets in relationship to new experiments independent of GEO's GDS curation process. The OBDI system allows ontologies to grow organically around a cluster of experiments. These experiments are then further analyzed in ML-Flex using machine learning algorithms. The curated experiments are analyzed in silico and the computational analyses are supported by the OBDI ontological system