Search CORE

252,596 research outputs found

On Statistical Data Compression

Author: Mattern Christopher
Publication venue
Publication date: 17/02/2016
Field of study

Im Zuge der stetigen Weiterentwicklung moderner Technik wächst die Menge an zu verarbeitenden Daten.Es gilt diese Daten zu verwalten, zu übertragen und zu speichern.Dafür ist Datenkompression unerlässlich.Gemessen an empirischen Kompressionsraten zählen Statistische Datenkompressionsalgorithmen zu den Besten.Diese Algorithmen verarbeiten einen Eingabetext buchstabenweise.Dabei verfährt man für jeden Buchstaben in zwei Phasen - Modellierung und Kodierung.Während der Modellierung schätzt ein Modell, basierend auf dem bereits bekannten Text, eine Wahrscheinlichkeitsverteilung für den nächsten Buchstaben.Ein Kodierer überführt die Verteilung und den Buchstaben in ein Codewort.Umgekehrt ermittelt der Dekodierer aus der Verteilung und dem Codewort den kodierten Buchstaben.Die Wahl des Modells bestimmt den statistischen Kompressionsalgorithmus, das Modell ist also von zentraler Bedeutung.Ein Modell mischt typischerweise viele einfache Wahrscheinlichkeitsschätzer.In der statistischen Datenkompression driften Theorie und Praxis auseinander.Theoretiker legen Wert auf Modelle, die mathematische Analysen zulassen, vernachlässigen aber Laufzeit, Speicherbedarf und empirische Verbesserungen;Praktiker verfolgen den gegenteiligen Ansatz.Die PAQ-Algorithmen haben die Überlegenheit des praktischen Ansatzes verdeutlicht.Diese Arbeit soll Theorie und Praxis annähren.Dazu wird das Handwerkszeug des Theoretikers, die Codelängenanlyse, auf Algorithmen des Praktikers angewendet.Es werden Wahrscheinlichkeitsschätzer, basierend auf gealterten relativen Häufigkeiten und basierend auf exponentiell geglätteten Wahrscheinlichkeiten, analysiert.Weitere Analysen decken Methoden ab, die Verteilungen durch gewichtetes arithmetisches und geometrisches Mitteln mischen und Gewichte mittels Gradientenverfahren bestimmen.Die Analysen zeigen, dass sich die betrachteten Verfahren ähnlich gut wie idealisierte Vergleichsverfahren verhalten.Methoden aus PAQ werden mit dieser Arbeit erweitert und mit einer theoretischen Basis versehen.Experimente stützen die Analyseergebnisse.Ein weiterer Beitrag dieser Arbeit ist Context Tree Mixing (CTM), eine Verallgemeinerung von Context Tree Weighting (CTW).Durch die Kombination von CTM mit Methoden aus PAQ entsteht ein theoretisch fundierter Kompressionsalgorithmus, der in Experimenten besser als CTW komprimiert.The ongoing evolution of hardware leads to a steady increase in the amount of data that is processed, transmitted and stored.Data compression is an essential tool to keep the amount of data manageable.In terms of empirical performance statistical data compression algorithms rank among the best.A statistical data compressor processes an input text letter by letter and compresses in two stages --- modeling and coding.During modeling a model estimates a probability distribution on the next letter based on the past input.During coding an encoder translates this distribution and the next letter into a codeword.Decoding reverts this process.The model is exchangeable and its choice determines a statistical data compression algorithm.All major models use a mixer to combine multiple simple probability estimators, so-called elementary models.In statistical data compression there is a gap between theory and practice.On the one hand, theoreticians put emphasis on models that allow for a mathematical analysis, but neglect running time and space considerations and empirical improvements.On the other hand practitioners focus on the very reverse.The family of PAQ statistical compressors demonstrated the superiority of the practitioner's approach in terms of empirical compression.With this thesis we attempt to bridge the aforementioned gap between theory and practice with special focus on PAQ.To achieve this we apply the theoretician's tools to practitioner's approaches:We provide a code length analysis for several practical modeling and mixing techniques.The analysis covers modeling by relative frequencies with frequency discount and modeling by exponential smoothing of probabilities.For mixing we consider linear and geometrically weighted averaging of probabilities with Online Gradient Descent for weight estimation.Our results show that the models and mixers we consider perform nearly as well as idealized competitors.Experiments support our analysis.Moreover, our results add a theoretical basis to modeling and mixing from PAQ and generalize methods from PAQ.Ultimately, we propose and analyze Context Tree Mixing (CTM), a generalization of Context Tree Weighting (CTW).We couple CTM with modeling and mixing techniques from PAQ and obtain a theoretically sound compression algorithm that improves over CTW, as shown in experiments

Digitale Bibliothek Thüringen

Data Streams from the Low Frequency Instrument On-Board the Planck Satellite: Statistical Analysis and Compression Efficiency

Author: Burigana C.
Maino D.
Maris M.
Pasian F.
Publication venue: 'EDP Sciences'
Publication date: 05/05/2000
Field of study

The expected data rate produced by the Low Frequency Instrument (LFI) planned to fly on the ESA Planck mission in 2007, is over a factor 8 larger than the bandwidth allowed by the spacecraft transmission system to download the LFI data. We discuss the application of lossless compression to Planck/LFI data streams in order to reduce the overall data flow. We perform both theoretical analysis and experimental tests using realistically simulated data streams in order to fix the statistical properties of the signal and the maximal compression rate allowed by several lossless compression algorithms. We studied the influence of signal composition and of acquisition parameters on the compression rate Cr and develop a semiempirical formalism to account for it. The best performing compressor tested up to now is the arithmetic compression of order 1, designed for optimizing the compression of white noise like signals, which allows an overall compression rate = 2.65 +/- 0.02. We find that such result is not improved by other lossless compressors, being the signal almost white noise dominated. Lossless compression algorithms alone will not solve the bandwidth problem but needs to be combined with other techniques.Comment: May 3, 2000 release, 61 pages, 6 figures coded as eps, 9 tables (4 included as eps), LaTeX 2.09 + assms4.sty, style file included, submitted for the pubblication on PASP May 3, 200

arXiv.org e-Print Archive

Crossref

EDP Sciences OAI-PMH repository (1.2.0)

CERN Document Server

Compression and Conditional Emulation of Climate Model Output

Author: Guinness Joseph
Hammerling Dorit
Publication venue: 'Informa UK Limited'
Publication date: 30/10/2017
Field of study

Numerical climate model simulations run at high spatial and temporal resolutions generate massive quantities of data. As our computing capabilities continue to increase, storing all of the data is not sustainable, and thus it is important to develop methods for representing the full datasets by smaller compressed versions. We propose a statistical compression and decompression algorithm based on storing a set of summary statistics as well as a statistical model describing the conditional distribution of the full dataset given the summary statistics. The statistical model can be used to generate realizations representing the full dataset, along with characterizations of the uncertainties in the generated data. Thus, the methods are capable of both compression and conditional emulation of the climate models. Considerable attention is paid to accurately modeling the original dataset--one year of daily mean temperature data--particularly with regard to the inherent spatial nonstationarity in global fields, and to determining the statistics to be stored, so that the variation in the original data can be closely captured, while allowing for fast decompression and conditional emulation on modest computers

arXiv.org e-Print Archive

FigShare