2,399 research outputs found
New Algorithms and Lower Bounds for Sequential-Access Data Compression
This thesis concerns sequential-access data compression, i.e., by algorithms
that read the input one or more times from beginning to end. In one chapter we
consider adaptive prefix coding, for which we must read the input character by
character, outputting each character's self-delimiting codeword before reading
the next one. We show how to encode and decode each character in constant
worst-case time while producing an encoding whose length is worst-case optimal.
In another chapter we consider one-pass compression with memory bounded in
terms of the alphabet size and context length, and prove a nearly tight
tradeoff between the amount of memory we can use and the quality of the
compression we can achieve. In a third chapter we consider compression in the
read/write streams model, which allows us passes and memory both
polylogarithmic in the size of the input. We first show how to achieve
universal compression using only one pass over one stream. We then show that
one stream is not sufficient for achieving good grammar-based compression.
Finally, we show that two streams are necessary and sufficient for achieving
entropy-only bounds.Comment: draft of PhD thesi
Universal lossless source coding with the Burrows Wheeler transform
The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n â â, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source
Reordering Rows for Better Compression: Beyond the Lexicographic Order
Sorting database tables before compressing them improves the compression
rate. Can we do better than the lexicographical order? For minimizing the
number of runs in a run-length encoding compression scheme, the best approaches
to row-ordering are derived from traveling salesman heuristics, although there
is a significant trade-off between running time and compression. A new
heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades
off compression for a major running-time speedup, is a good option for very
large tables. However, for some compression schemes, it is more important to
generate long runs rather than few runs. For this case, another novel
heuristic, Vortex, is promising. We find that we can improve run-length
encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%:
these gains are on top of the gains due to lexicographically sorting the table.
We prove that the new row reordering is optimal (within 10%) at minimizing the
runs of identical values within columns, in a few cases.Comment: to appear in ACM TOD
Revisiting nested group testing procedures: new results, comparisons, and robustness
Group testing has its origin in the identification of syphilis in the US army
during World War II. Much of the theoretical framework of group testing was
developed starting in the late 1950s, with continued work into the 1990s.
Recently, with the advent of new laboratory and genetic technologies, there has
been an increasing interest in group testing designs for cost saving purposes.
In this paper, we compare different nested designs, including Dorfman, Sterrett
and an optimal nested procedure obtained through dynamic programming. To
elucidate these comparisons, we develop closed-form expressions for the optimal
Sterrett procedure and provide a concise review of the prior literature for
other commonly used procedures. We consider designs where the prevalence of
disease is known as well as investigate the robustness of these procedures when
it is incorrectly assumed. This article provides a technical presentation that
will be of interest to researchers as well as from a pedagogical perspective.
Supplementary material for this article is available online.Comment: Submitted for publication on May 3, 2016. Revised versio
Perfect Hash Function Generation on the GPU with RecSplit
Minimale perfekte Hashfunktionen (MPHFs) bilden eine statische Menge S von beliebigen SchlĂŒsseln auf die Menge der ersten |S| natĂŒrlichen Zahlen bijektiv ab, d. h., jeder Hashwert wird exakt einmal verwendet. Sie sind in vielen Anwendungen hilfreich, zum Beispiel, um Hashtabellen mit garantiert konstanter Zugriffszeit zu implementieren. MPHFs können sehr kompakt sein â weniger als 2 Bit pro SchlĂŒssel sind möglich. Andererseits sind MPHFs nicht in der Lage zu entscheiden, ob ein gegebener SchlĂŒssel zu S gehört. Zurzeit ist RecSplit die speichereffizienteste MPHF. RecSplit bietet verschiedene Kompromisse zwischen Platzverbrauch, Konstruktionszeit und Anfragezeit an. RecSplit kann zum Beispiel eine MPHF mit 1.56 Bits pro SchlĂŒssel in weniger als 2 ms pro SchlĂŒssel konstruieren. Das ist jedoch zu langsam fĂŒr groĂe Eingaben. Diese Arbeit prĂ€sentiert neue RecSplit-Implementierungen, die Multithreading, SIMD und die Leistung von GPUs nutzen, um die Konstruktionszeit zu verbessern. Gemeinsam mit unserer neuen bijection-rotation-Methode erreichen wir Beschleunigungen um Faktoren bis zu 333 fĂŒr unsere SIMD-Implementierung auf einer 8-Kern CPU und bis zu 1873 fĂŒr unsere GPU-Implementierung verglichen mit der originalen, sequenziellen RecSplit-Implementierung. Dadurch können wir MPHFs mit 1.56 Bits pro SchlĂŒssel in weniger als 1.5 ÎŒs pro SchlĂŒssel konstruieren
- âŠ