86,607 research outputs found
Space-Efficient Re-Pair Compression
Re-Pair is an effective grammar-based compression scheme achieving strong
compression rates in practice. Let , , and be the text length,
alphabet size, and dictionary size of the final grammar, respectively. In their
original paper, the authors show how to compute the Re-Pair grammar in expected
linear time and words of working space on top
of the text. In this work, we propose two algorithms improving on the space of
their original solution. Our model assumes a memory word of bits and a re-writable input text composed by such words. Our
first algorithm runs in expected time and uses
words of space on top of the text for any parameter
chosen in advance. Our second algorithm runs in expected
time and improves the space to words
A Grammar Compression Algorithm based on Induced Suffix Sorting
We introduce GCIS, a grammar compression algorithm based on the induced
suffix sorting algorithm SAIS, introduced by Nong et al. in 2009. Our solution
builds on the factorization performed by SAIS during suffix sorting. We
construct a context-free grammar on the input string which can be further
reduced into a shorter string by substituting each substring by its
correspondent factor. The resulting grammar is encoded by exploring some
redundancies, such as common prefixes between suffix rules, which are sorted
according to SAIS framework. When compared to well-known compression tools such
as Re-Pair and 7-zip, our algorithm is competitive and very effective at
handling repetitive string regarding compression ratio, compression and
decompression running time
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Re-Pair Compression of Inverted Lists
Compression of inverted lists with methods that support fast intersection
operations is an active research topic. Most compression schemes rely on
encoding differences between consecutive positions with techniques that favor
small numbers. In this paper we explore a completely different alternative: We
use Re-Pair compression of those differences. While Re-Pair by itself offers
fast decompression at arbitrary positions in main and secondary memory, we
introduce variants that in addition speed up the operations required for
inverted list intersection. We compare the resulting data structures with
several recent proposals under various list intersection algorithms, to
conclude that our Re-Pair variants offer an interesting time/space tradeoff for
this problem, yet further improvements are required for it to improve upon the
state of the art
GraCT: A Grammar based Compressed representation of Trajectories
We present a compressed data structure to store free trajectories of moving
objects (ships over the sea, for example) allowing spatio-temporal queries. Our
method, GraCT, uses a -tree to store the absolute positions of all objects
at regular time intervals (snapshots), whereas the positions between snapshots
are represented as logs of relative movements compressed with Re-Pair. Our
experimental evaluation shows important savings in space and time with respect
to a fair baseline.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Occam's Quantum Strop: Synchronizing and Compressing Classical Cryptic Processes via a Quantum Channel
A stochastic process's statistical complexity stands out as a fundamental
property: the minimum information required to synchronize one process generator
to another. How much information is required, though, when synchronizing over a
quantum channel? Recent work demonstrated that representing causal similarity
as quantum state-indistinguishability provides a quantum advantage. We
generalize this to synchronization and offer a sequence of constructions that
exploit extended causal structures, finding substantial increase of the quantum
advantage. We demonstrate that maximum compression is determined by the
process's cryptic order---a classical, topological property closely allied to
Markov order, itself a measure of historical dependence. We introduce an
efficient algorithm that computes the quantum advantage and close noting that
the advantage comes at a cost---one trades off prediction for generation
complexity.Comment: 10 pages, 6 figures;
http://csc.ucdavis.edu/~cmg/compmech/pubs/oqs.ht
- âŠ