10 research outputs found
Compressed Representations of Permutations, and Applications
We explore various techniques to compress a permutation over n
integers, taking advantage of ordered subsequences in , while supporting
its application (i) and the application of its inverse in
small time. Our compression schemes yield several interesting byproducts, in
many cases matching, improving or extending the best existing results on
applications such as the encoding of a permutation in order to support iterated
applications of it, of integer functions, and of inverted lists and
suffix arrays
LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations
LRM-Trees are an elegant way to partition a sequence of values into sorted
consecutive blocks, and to express the relative position of the first element
of each block within a previous block. They were used to encode ordinal trees
and to index integer arrays in order to support range minimum queries on them.
We describe how they yield many other convenient results in a variety of areas,
from data structures to algorithms: some compressed succinct indices for range
minimum queries; a new adaptive sorting algorithm; and a compressed succinct
data structure for permutations supporting direct and indirect application in
time all the shortest as the permutation is compressible.Comment: 13 pages, 1 figur
Compact Binary Relation Representations with Rich Functionality
Binary relations are an important abstraction arising in many data
representation problems. The data structures proposed so far to represent them
support just a few basic operations required to fit one particular application.
We identify many of those operations arising in applications and generalize
them into a wide set of desirable queries for a binary relation representation.
We also identify reductions among those operations. We then introduce several
novel binary relation representations, some simple and some quite
sophisticated, that not only are space-efficient but also efficiently support a
large subset of the desired queries.Comment: 32 page
Efficient Fully-Compressed Sequence Representations
We present a data structure that stores a sequence over alphabet
in n\Ho(s) + o(n)(\Ho(s){+}1) bits, where \Ho(s) is the
zero-order entropy of . This structure supports the queries \access, \rank\
and \select, which are fundamental building blocks for many other compressed
data structures, in worst-case time \Oh{\lg\lg\sigma} and average time
\Oh{\lg \Ho(s)}. The worst-case complexity matches the best previous results,
yet these had been achieved with data structures using n\Ho(s)+o(n\lg\sigma)
bits. On highly compressible sequences the bits of the
redundancy may be significant compared to the the n\Ho(s) bits that encode
the data. Our representation, instead, compresses the redundancy as well.
Moreover, our average-case complexity is unprecedented. Our technique is based
on partitioning the alphabet into characters of similar frequency. The
subsequence corresponding to each group can then be encoded using fast
uncompressed representations without harming the overall compression ratios,
even in the redundancy. The result also improves upon the best current
compressed representations of several other data structures. For example, we
achieve compressed redundancy, retaining the best time complexities, for
the smallest existing full-text self-indexes; compressed permutations
with times for and \pii() improved to loglogarithmic; and
the first compressed representation of dynamic collections of disjoint
sets. We also point out various applications to inverted indexes, suffix
arrays, binary relations, and data compressors. ..
Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS'09)
The Symposium on Theoretical Aspects of Computer Science (STACS) is held alternately in France and in Germany. The conference of February 26-28, 2009, held in Freiburg, is the 26th in this series. Previous meetings took place in Paris (1984), Saarbr¨ucken (1985), Orsay (1986), Passau (1987), Bordeaux (1988), Paderborn (1989), Rouen (1990), Hamburg (1991), Cachan (1992), W¨urzburg (1993), Caen (1994), M¨unchen (1995), Grenoble (1996), L¨ubeck (1997), Paris (1998), Trier (1999), Lille (2000), Dresden (2001), Antibes (2002), Berlin (2003), Montpellier (2004), Stuttgart (2005), Marseille (2006), Aachen (2007), and Bordeaux (2008). ..
Space-Efficient Data Structures for Information Retrieval
The amount of data that people and companies store has grown exponentially over the last few years. Storing this information alone is not enough, because in order to make it useful we need to be able to efficiently search inside it.
Furthermore, it is highly valuable to keep the historic data of each document stored, allowing to not only access and search inside the newest version, but also over the whole history of the documents.
Grammar-based compression has proven to be very effective for repetitive data, which is the case for versioned documents. In this thesis we present several results on representing textual information and searching in it. In particular, we present text indexes for grammar-based compressed text that support searching for a pattern and extracting substrings of the input text. These are the first
general indexes for grammar-based compressed text that support searching in sublinear time.
In order to build our indexes, we present new results on representing binary relations in a space-efficient manner, and construction algorithms that use little space to achieve their goal. These two results have a wide range of applications. In particular, the representations for binary relations can be used as a building block for several structures in computer science, such as graphs, inverted indexes, etc.
Finally, we present a new index, that uses on grammar-based compression, to solve the document listing problem. This problem deals with representing a collection of texts and searching for the documents that contain a given pattern. In spite of being similar to the classical text indexing problem, this problem has proven to be a challenge when we do not want to pay time proportional to the number of occurrences, but time proportional to the size of the result. Our proposal is designed particularly for versioned text, allowing the storage of a collection of documents with all their historic versions in little space. This is currently the smallest structure for such a purpose in practice
Statistical and repetition-based compressed data structures
[Abstract]
In this thesis we present several practical compressed data structures that address
open problems related to statistically-compressible and highly repetitive databases.
In a the first part, we focus on statistical-based compressed data structures,
targeting the problem of managing large alphabets. This problem arises when
typical sequence-based compression is used as a basis for compressed data structures
representing more general structures like grids and graphs. Concretely, (a) we
provide space-efficient solutions to represent prefix-free codes when the alphabet
is large; (b) we also present a new wavelet-tree based data structure to solve rank
and select queries that obtains zero-order compression and outperforms previous
wavelet tree implementations on large alphabets.
In the second part of this thesis, we focus on highly repetitive datasets. We
present (c) a very space efficient grammar-based compressed data structure to solve
rank and select on these scenarios; (d) the first LZ77-space bounded compressed
data structure that solves rank and select queries in O(1) time and is in practice
almost as fast as statistically-compressed structures; and (e) the first practical
version of grammar-compressed tree topologies, obtaining unprecedented results in
the representation of repetitive trees.
Additionally, we apply our new solutions to several problems of interest: point
grids, inverted indexes, self-indexes, XPath systems, and compressed suffix trees of
highly repetitive inputs, displaying various space-time tradeoffs of interest.[Resumen]
En esta tesis presentamos varias estructuras de datos comprimidas de naturaleza
práctica, centradas en problemas abiertos relacionados con bases de datos
estadísticamente compresibles y bases de datos cuyo contenido es altamente
repetitivo.
En la primera parte, nos centramos en las estructuras de datos comprimidas para
bases de datos estadísticamente compresibles, más concretamente, en problemas
relativos al manejo de alfabetos grandes. Este tipo de problemas aparecen
cuando usamos técnicas clásicas de compresión estadística en estructuras de datos
comprimidas para secuencias, y éstas a su vez se aplican a problemas tales como
la representación de grillas de puntos o grafos. Concretamente, (a) presentamos
soluciones muy eficientes en términos de espacio para representar códigos libres de
prefijo cuando el alfabeto el grande; (b) y también presentamos una nueva estructura
de datos comprimida basada en wavelet trees para resolver consultas rank y select
que obtiene compresión de orden cero y mejora las implementaciones previas de
wavelet trees en alfabetos grandes.
En la segunda parte de esta tesis, nos centramos en las bases de datos altamente
repetitivas. Presentamos (c) una estructura de datos comprimida basada en
gramáticas para resolver consultas rank y select en este tipo de contextos y
que usa muy poco espacio; (d) la primera estructura de datos comprimida que
obtiene espacio proporcional al de un compresor LZ77 y resuelve consultas rank y
select en tiempo O(1), siendo en la práctica casi tan rápido como las estructuras de
datos basadas en compresión estadística; (e) la primera estructura de datos práctica
que utiliza gramáticas para comprimir topologías de árboles, obteniendo resultados
sin precedentes para la representación de árboles repetitivos.
Adicionalmente, mostramos varias aplicaciones en las que las estructuras de datos
que proponemos a lo largo de la tesis resultan de utilidad. Desde representaciones
de grillas de puntos, índices invertidos, auto-índices, sistemas XPath, hasta árboles
de sufijos comprimidos para colecciones altamente repetitivas, mostrando diferentes
resultados de interés tanto en términos de tiempo como de espacio.[Resumo]
Nesta tese presentamos varias estruturas de datos comprimidas de natureza práctica,
centradas en problemas abertos no ámbito das bases de datos estatisticamente
compresibles e das bases de datos altamente repetitivas.
Na primeira parte da tese, centrámonos nas estruturas de datos comprimidas para
as bases de datos estatisticamente compresibles. Máis concretamente en problemas
relativos ó manexo de alfabetos grandes. Este tipo de problemas aparecen cando
usamos técnicas de compresión estatística en estruturas de datos comprimidas para
secuencias, e esta á sua vez se utilizan para aplicacións tales como a representación de
grellas de puntos ou para a representación de grafos. Concretamente, (a) presentamos
solucións que son moi eficientes en termos espaciais para representar códigos libres
de prefixo cando o alfabeto é grande; e (b) tamén presentamos unha nova estructura
de datos comprimida baseada en wavelet trees para resolver consultas rank e select
que obtén compresión de orde cero e mellora as implementacións previas de wavelet
trees para alfabetos grandes.
Na segunda parte da tese, centrámosnos nas bases de datos con contido altamente
repetitivo. Presentamos (c) unha estrutura de datos comprimida baseada en
gramáticas que usa moi pouco espazo e resolve eficientemente consultas rank e
select en este tipo de contextos repetitivos; (d) a primeira estrutura de datos
comprimida que obtén espazo proporcional ó que obtén un compresor LZ77 e resolve
consultas rank e select en tempo O(1), sendo na práctica tan rápido coma as
estruturas de datos baseadas en compresión estatística; (e) a primeira estrutura de
datos práctica que utiliza gramáticas para comprimir topoloxías de árbores, obtendo
uns resultados sin precedentes para a representación de árbores repetitivos.
Adicionalmente, mostramos varias aplicacións nas que as estruturas de datos
que propoñemos ó longo da tese resultan de utilidade: representacións de grellas
de puntos, índices invertidos, auto-índices, sistemas XPath e árbores de sufixos
comprimidos para colecións altamente repetitivas, mostrando diferentes resultados
de interese, tanto en termos de espazo coma de tempo