292 research outputs found
Universal lossless source coding with the Burrows Wheeler transform
The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n â â, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source
A Preadapted Universal Switch Distribution for Testing Hilberg's Conjecture
Hilberg's conjecture about natural language states that the mutual
information between two adjacent long blocks of text grows like a power of the
block length. The exponent in this statement can be upper bounded using the
pointwise mutual information estimate computed for a carefully chosen code. The
bound is the better, the lower the compression rate is but there is a
requirement that the code be universal. So as to improve a received upper bound
for Hilberg's exponent, in this paper, we introduce two novel universal codes,
called the plain switch distribution and the preadapted switch distribution.
Generally speaking, switch distributions are certain mixtures of adaptive
Markov chains of varying orders with some additional communication to avoid so
called catch-up phenomenon. The advantage of these distributions is that they
both achieve a low compression rate and are guaranteed to be universal. Using
the switch distributions we obtain that a sample of a text in English is
non-Markovian with Hilberg's exponent being , which improves over the
previous bound obtained using the Lempel-Ziv code.Comment: 17 pages, 3 figure
Universal Codes as a Basis for Time Series Testing
We suggest a new approach to hypothesis testing for ergodic and stationary
processes. In contrast to standard methods, the suggested approach gives a
possibility to make tests, based on any lossless data compression method even
if the distribution law of the codeword lengths is not known. We apply this
approach to the following four problems: goodness-of-fit testing (or identity
testing), testing for independence, testing of serial independence and
homogeneity testing and suggest nonparametric statistical tests for these
problems. It is important to note that practically used so-called archivers can
be used for suggested testing.Comment: accepted for "Statistical Methodology" (Elsevier
Mismatched codebooks and the role of entropy-coding in lossy data compression
We introduce a universal quantization scheme based on random coding, and we
analyze its performance. This scheme consists of a source-independent random
codebook (typically_mismatched_ to the source distribution), followed by
optimal entropy-coding that is_matched_ to the quantized codeword distribution.
A single-letter formula is derived for the rate achieved by this scheme at a
given distortion, in the limit of large codebook dimension. The rate reduction
due to entropy-coding is quantified, and it is shown that it can be arbitrarily
large. In the special case of "almost uniform" codebooks (e.g., an i.i.d.
Gaussian codebook with large variance) and difference distortion measures, a
novel connection is drawn between the compression achieved by the present
scheme and the performance of "universal" entropy-coded dithered lattice
quantizers. This connection generalizes the "half-a-bit" bound on the
redundancy of dithered lattice quantizers. Moreover, it demonstrates a strong
notion of universality where a single "almost uniform" codebook is near-optimal
for_any_ source and_any_ difference distortion measure.Comment: 35 pages, 37 references, no figures. Submitted to IEEE Transactions
on Information Theor
Unequal Message Protection: Asymptotic and Non-Asymptotic Tradeoffs
We study a form of unequal error protection that we term "unequal message
protection" (UMP). The message set of a UMP code is a union of disjoint
message classes. Each class has its own error protection requirement, with some
classes needing better error protection than others. We analyze the tradeoff
between rates of message classes and the levels of error protection of these
codes. We demonstrate that there is a clear performance loss compared to
homogeneous (classical) codes with equivalent parameters. This is in sharp
contrast to previous literature that considers UMP codes. To obtain our results
we generalize finite block length achievability and converse bounds due to
Polyanskiy-Poor-Verd\'{u}. We evaluate our bounds for the binary symmetric and
binary erasure channels, and analyze the asymptotic characteristic of the
bounds in the fixed error and moderate deviations regimes. In addition, we
consider two questions related to the practical construction of UMP codes.
First, we study a "header" construction that prefixes the message class into a
header followed by data protection using a standard homogeneous code. We show
that, in general, this construction is not optimal at finite block lengths. We
further demonstrate that our main UMP achievability bound can be obtained using
coset codes, which suggests a path to implementation of tractable UMP codes
An fpga-based loco-ans implementation for lossless and near-lossless image compression using high-level synthesis
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliationsIn this work, we present and evaluate a hardware architecture for the LOCO-ANS (Low Complexity Lossless Compression with Asymmetric Numeral Systems) lossless and near-lossless image compressor, which is based on JPEG-LS standard. The design is implemented in two FPGA generations, evaluating its performance for different codec configurations. The tests show that the design is capable of up to 40.5 MPixels/s and 124 MPixels/s per lane for Zynq 7020 and UltraScale+ FPGAs, respectively. Compared to the single thread LOCO-ANS software implementation running in a 1.2 GHz Raspberry Pi 3B, each hardware lane achieves 6.5 times higher throughput, even when implemented in an older and cost-optimized chip like the Zynq 7020. Results are also presented for a lossless only version, which achieves a lower footprint and approximately 50% higher performance than the version that supports both lossless and near-lossless. Interestingly, these great results were obtained applying High-Level Synthesis, describing the coder with C++ code, which tends to establish a trade-off between design time and quality of results. These results show that the algorithm is very suitable for hardware implementation. Moreover, the implemented system is faster and achieves higher compression than the best previously available near-lossless JPEG-LS hardware implementationThis research was funded in part by the Spanish Research Agency under the project AgileMon (AEI PID2019-104451RB-C21
Multiple Description Quantization via Gram-Schmidt Orthogonalization
The multiple description (MD) problem has received considerable attention as
a model of information transmission over unreliable channels. A general
framework for designing efficient multiple description quantization schemes is
proposed in this paper. We provide a systematic treatment of the El Gamal-Cover
(EGC) achievable MD rate-distortion region, and show that any point in the EGC
region can be achieved via a successive quantization scheme along with
quantization splitting. For the quadratic Gaussian case, the proposed scheme
has an intrinsic connection with the Gram-Schmidt orthogonalization, which
implies that the whole Gaussian MD rate-distortion region is achievable with a
sequential dithered lattice-based quantization scheme as the dimension of the
(optimal) lattice quantizers becomes large. Moreover, this scheme is shown to
be universal for all i.i.d. smooth sources with performance no worse than that
for an i.i.d. Gaussian source with the same variance and asymptotically optimal
at high resolution. A class of low-complexity MD scalar quantizers in the
proposed general framework also is constructed and is illustrated
geometrically; the performance is analyzed in the high resolution regime, which
exhibits a noticeable improvement over the existing MD scalar quantization
schemes.Comment: 48 pages; submitted to IEEE Transactions on Information Theor
Techniques of design optimisation for algorithms implemented in software
The overarching objective of this thesis was to develop tools for parallelising, optimising,
and implementing algorithms on parallel architectures, in particular General Purpose
Graphics Processors (GPGPUs). Two projects were chosen from different application areas
in which GPGPUs are used: a defence application involving image compression, and a
modelling application in bioinformatics (computational immunology). Each project had its
own specific objectives, as well as supporting the overall research goal.
The defence / image compression project was carried out in collaboration with the Jet
Propulsion Laboratories. The specific questions were: to what extent an algorithm designed
for bit-serial for the lossless compression of hyperspectral images on-board unmanned
vehicles (UAVs) in hardware could be parallelised, whether GPGPUs could be used to
implement that algorithm, and whether a software implementation with or without GPGPU
acceleration could match the throughput of a dedicated hardware (FPGA) implementation.
The dependencies within the algorithm were analysed, and the algorithm parallelised. The
algorithm was implemented in software for GPGPU, and optimised. During the optimisation
process, profiling revealed less than optimal device utilisation, but no further optimisations
resulted in an improvement in speed. The design had hit a local-maximum of performance.
Analysis of the arithmetic intensity and data-flow exposed flaws in the standard optimisation
metric of kernel occupancy used for GPU optimisation. Redesigning the implementation
with revised criteria (fused kernels, lower occupancy, and greater data locality) led to a new
implementation with 10x higher throughput. GPGPUs were shown to be viable for on-board
implementation of the CCSDS lossless hyperspectral image compression algorithm,
exceeding the performance of the hardware reference implementation, and providing
sufficient throughput for the next generation of image sensor as well.
The second project was carried out in collaboration with biologists at the University of
Arizona and involved modelling a complex biological system â VDJ recombination involved
in the formation of T-cell receptors (TCRs). Generation of immune receptors (T cell receptor
and antibodies) by VDJ recombination is an enormously complex process, which can
theoretically synthesize greater than 1018 variants. Originally thought to be a random
process, the underlying mechanisms clearly have a non-random nature that preferentially
creates a small subset of immune receptors in many individuals. Understanding this bias is a
longstanding problem in the field of immunology. Modelling the process of VDJ
recombination to determine the number of ways each immune receptor can be synthesized,
previously thought to be untenable, is a key first step in determining how this special
population is made. The computational tools developed in this thesis have allowed
immunologists for the first time to comprehensively test and invalidate a longstanding theory
(convergent recombination) for how this special population is created, while generating the
data needed to develop novel hypothesis
High throughput image compression and decompression on GPUs
Diese Arbeit befasst sich mit der Entwicklung eines GPU-freundlichen, intra-only, Wavelet-basierten Videokompressionsverfahrens mit hohem Durchsatz, das fĂŒr visuell verlustfreie Anwendungen optimiert ist. Ausgehend von der Beobachtung, dass der JPEG 2000 Entropie-Kodierer ein Flaschenhals ist, werden verschiedene algorithmische Ănderungen vorgeschlagen und bewertet. ZunĂ€chst wird der JPEG 2000 Selective Arithmetic Coding Mode auf der GPU realisiert, wobei sich die Erhöhung des Durchsatzes hierdurch als begrenzt zeigt. Stattdessen werden zwei nicht standard-kompatible Ănderungen vorgeschlagen, die (1) jede Bitebebene in nur einem einzelnen Pass verarbeiten (Single-Pass-Modus) und (2) einen echten Rohcodierungsmodus einfĂŒhren, der sample-weise parallelisierbar ist und keine aufwendige Kontextmodellierung erfordert. Als nĂ€chstes wird ein alternativer Entropiekodierer aus der Literatur, der Bitplane Coder with Parallel Coefficient Processing (BPC-PaCo), evaluiert. Er gibt SignaladaptivitĂ€t zu Gunsten von höherer ParallelitĂ€t auf und daher wird hier untersucht und gezeigt, dass ein aus verschiedensten Testsequenzen gemitteltes statisches Wahrscheinlichkeitsmodell eine kompetitive Kompressionseffizienz erreicht. Es wird zudem eine Kombination von BPC-PaCo mit dem Single-Pass-Modus vorgeschlagen, der den Speedup gegenĂŒber dem JPEG 2000 Entropiekodierer von 2,15x (BPC-PaCo mit zwei PĂ€ssen) auf 2,6x (BPC-PaCo mit Single-Pass-Modus) erhöht auf Kosten eines um 0,3 dB auf 1,0 dB erhöhten Spitzen-Signal-Rausch-VerhĂ€ltnis (PSNR). Weiter wird ein paralleler Algorithmus zur Post-Compression Ratenkontrolle vorgestellt sowie eine parallele Codestream-Erstellung auf der GPU. Es wird weiterhin ein theoretisches Laufzeitmodell formuliert, das es durch Benchmarking von einer GPU ermöglicht die Laufzeit einer Routine auf einer anderen GPU vorherzusagen. SchlieĂlich wird der erste JPEG XS GPU Decoder vorgestellt und evaluiert. JPEG XS wurde als Low Complexity Codec konzipiert und forderte erstmals explizit GPU-Freundlichkeit bereits im Call for Proposals. Ab Bitraten ĂŒber 1 bpp ist der Decoder etwa 2x schneller im Vergleich zu JPEG 2000 und 1,5x schneller als der schnellste hier vorgestellte Entropiekodierer (BPC-PaCo mit Single-Pass-Modus). Mit einer GeForce GTX 1080 wird ein Decoder Durchsatz von rund 200 fps fĂŒr eine UHD-4:4:4-Sequenz erreicht.This work investigates possibilities to create a high throughput, GPU-friendly, intra-only, Wavelet-based video compression algorithm optimized for visually lossless applications. Addressing the key observation that JPEG 2000âs entropy coder is a bottleneck and might be overly complex for a high bit rate scenario, various algorithmic alterations are proposed. First, JPEG 2000âs Selective Arithmetic Coding mode is realized on the GPU, but the gains in terms of an increased throughput are shown to be limited. Instead, two independent alterations not compliant to the standard are proposed, that (1) give up the concept of intra-bit plane truncation points and (2) introduce a true raw-coding mode that is fully parallelizable and does not require any context modeling. Next, an alternative block coder from the literature, the Bitplane Coder with Parallel Coefficient Processing (BPC-PaCo), is evaluated. Since it trades signal adaptiveness for increased parallelism, it is shown here how a stationary probability model averaged from a set of test sequences yields competitive compression efficiency. A combination of BPC-PaCo with the single-pass mode is proposed and shown to increase the speedup with respect to the original JPEG 2000 entropy coder from 2.15x (BPC-PaCo with two passes) to 2.6x (proposed BPC-PaCo with single-pass mode) at the marginal cost of increasing the PSNR penalty by 0.3 dB to at most 1 dB. Furthermore, a parallel algorithm is presented that determines the optimal code block bit stream truncation points (given an available bit rate budget) and builds the entire code stream on the GPU, reducing the amount of data that has to be transferred back into host memory to a minimum. A theoretical runtime model is formulated that allows, based on benchmarking results on one GPU, to predict the runtime of a kernel on another GPU. Lastly, the first ever JPEG XS GPU-decoder realization is presented. JPEG XS was designed to be a low complexity codec and for the first time explicitly demanded GPU-friendliness already in the call for proposals. Starting at bit rates above 1 bpp, the decoder is around 2x faster compared to the original JPEG 2000 and 1.5x faster compared to JPEG 2000 with the fastest evaluated entropy coder (BPC-PaCo with single-pass mode). With a GeForce GTX 1080, a decoding throughput of around 200 fps is achieved for a UHD 4:4:4 sequence
- âŠ