8 research outputs found
Simple Worst-Case Optimal Adaptive Prefix-Free Coding
Gagie and Nekrich (2009) gave an algorithm for adaptive prefix-free coding
that, given a string over the alphabet with
, encodes in at most
bits, where is the empirical entropy of , such that encoding and
decoding take time. They also proved their bound on the encoding
length is optimal, even when the empirical entropy is high. Their algorithm is
impractical, however, because it uses complicated data structures. In this
paper we give an algorithm with the same bounds, except that we require , that uses no data structures more complicated than a
lookup table. Moreover, when Gagie and Nekrich's algorithm is used for optimal
adaptive alphabetic coding it takes time for decoding, but
ours still takes time
New Algorithms and Lower Bounds for Sequential-Access Data Compression
This thesis concerns sequential-access data compression, i.e., by algorithms
that read the input one or more times from beginning to end. In one chapter we
consider adaptive prefix coding, for which we must read the input character by
character, outputting each character's self-delimiting codeword before reading
the next one. We show how to encode and decode each character in constant
worst-case time while producing an encoding whose length is worst-case optimal.
In another chapter we consider one-pass compression with memory bounded in
terms of the alphabet size and context length, and prove a nearly tight
tradeoff between the amount of memory we can use and the quality of the
compression we can achieve. In a third chapter we consider compression in the
read/write streams model, which allows us passes and memory both
polylogarithmic in the size of the input. We first show how to achieve
universal compression using only one pass over one stream. We then show that
one stream is not sufficient for achieving good grammar-based compression.
Finally, we show that two streams are necessary and sufficient for achieving
entropy-only bounds.Comment: draft of PhD thesi
Efficient Fully-Compressed Sequence Representations
We present a data structure that stores a sequence over alphabet
in n\Ho(s) + o(n)(\Ho(s){+}1) bits, where \Ho(s) is the
zero-order entropy of . This structure supports the queries \access, \rank\
and \select, which are fundamental building blocks for many other compressed
data structures, in worst-case time \Oh{\lg\lg\sigma} and average time
\Oh{\lg \Ho(s)}. The worst-case complexity matches the best previous results,
yet these had been achieved with data structures using n\Ho(s)+o(n\lg\sigma)
bits. On highly compressible sequences the bits of the
redundancy may be significant compared to the the n\Ho(s) bits that encode
the data. Our representation, instead, compresses the redundancy as well.
Moreover, our average-case complexity is unprecedented. Our technique is based
on partitioning the alphabet into characters of similar frequency. The
subsequence corresponding to each group can then be encoded using fast
uncompressed representations without harming the overall compression ratios,
even in the redundancy. The result also improves upon the best current
compressed representations of several other data structures. For example, we
achieve compressed redundancy, retaining the best time complexities, for
the smallest existing full-text self-indexes; compressed permutations
with times for and \pii() improved to loglogarithmic; and
the first compressed representation of dynamic collections of disjoint
sets. We also point out various applications to inverted indexes, suffix
arrays, binary relations, and data compressors. ..
Worst-Case Optimal Adaptive Prefix Coding
A common complaint about adaptive prefix coding is that it is much slower than static prefix coding. Karpinski and Nekrich recently took an important step towards resolving this: they gave an adaptive Shannon coding algorithm that encodes each character in O(1) amortized time and decodes it in O(log H) amortized time, where H is the empirical entropy of the input string s. For comparison, Gagie’s adaptive Shannon coder and both Knuth’s and Vitter’s adaptive Huffman coders all use Θ(H) amortized time for each character. In this paper we give an adaptive Shannon coder that both encodes and decodes each character in O(1) worst-case time. As with both previous adaptive Shannon coders, we store s in at most (H + 1)|s | + o(|s|) bits. We also show that this encoding length is worst-case optimal up to the lower order term
LIPIcs, Volume 244, ESA 2022, Complete Volume
LIPIcs, Volume 244, ESA 2022, Complete Volum