252,596 research outputs found
On Statistical Data Compression
Im Zuge der stetigen Weiterentwicklung moderner Technik wächst die Menge an
zu verarbeitenden Daten.Es gilt diese Daten zu verwalten, zu ĂĽbertragen und
zu speichern.Dafür ist Datenkompression unerlässlich.Gemessen an
empirischen Kompressionsraten zählen Statistische
Datenkompressionsalgorithmen zu den Besten.Diese Algorithmen verarbeiten
einen Eingabetext buchstabenweise.Dabei verfährt man für jeden Buchstaben
in zwei Phasen - Modellierung und Kodierung.Während der Modellierung
schätzt ein Modell, basierend auf dem bereits bekannten Text, eine
Wahrscheinlichkeitsverteilung für den nächsten Buchstaben.Ein Kodierer
ĂĽberfĂĽhrt die Verteilung und den Buchstaben in ein Codewort.Umgekehrt
ermittelt der Dekodierer aus der Verteilung und dem Codewort den kodierten
Buchstaben.Die Wahl des Modells bestimmt den statistischen
Kompressionsalgorithmus, das Modell ist also von zentraler Bedeutung.Ein
Modell mischt typischerweise viele einfache Wahrscheinlichkeitsschätzer.In
der statistischen Datenkompression driften Theorie und Praxis
auseinander.Theoretiker legen Wert auf Modelle, die mathematische Analysen
zulassen, vernachlässigen aber Laufzeit, Speicherbedarf und empirische
Verbesserungen;Praktiker verfolgen den gegenteiligen Ansatz.Die
PAQ-Algorithmen haben die Ăśberlegenheit des praktischen Ansatzes
verdeutlicht.Diese Arbeit soll Theorie und Praxis annähren.Dazu wird das
Handwerkszeug des Theoretikers, die Codelängenanlyse, auf Algorithmen des
Praktikers angewendet.Es werden Wahrscheinlichkeitsschätzer, basierend auf
gealterten relativen Häufigkeiten und basierend auf exponentiell
geglätteten Wahrscheinlichkeiten, analysiert.Weitere Analysen decken
Methoden ab, die Verteilungen durch gewichtetes arithmetisches und
geometrisches Mitteln mischen und Gewichte mittels Gradientenverfahren
bestimmen.Die Analysen zeigen, dass sich die betrachteten Verfahren ähnlich
gut wie idealisierte Vergleichsverfahren verhalten.Methoden aus PAQ werden
mit dieser Arbeit erweitert und mit einer theoretischen Basis
versehen.Experimente stĂĽtzen die Analyseergebnisse.Ein weiterer Beitrag
dieser Arbeit ist Context Tree Mixing (CTM), eine Verallgemeinerung von
Context Tree Weighting (CTW).Durch die Kombination von CTM mit Methoden aus
PAQ entsteht ein theoretisch fundierter Kompressionsalgorithmus, der in
Experimenten besser als CTW komprimiert.The ongoing evolution of hardware leads to a steady increase in the amount
of data that is processed, transmitted and stored.Data compression is an
essential tool to keep the amount of data manageable.In terms of empirical
performance statistical data compression algorithms rank among the best.A
statistical data compressor processes an input text letter by letter and
compresses in two stages --- modeling and coding.During modeling a model
estimates a probability distribution on the next letter based on the past
input.During coding an encoder translates this distribution and the next
letter into a codeword.Decoding reverts this process.The model is
exchangeable and its choice determines a statistical data compression
algorithm.All major models use a mixer to combine multiple simple
probability estimators, so-called elementary models.In statistical data
compression there is a gap between theory and practice.On the one hand,
theoreticians put emphasis on models that allow for a mathematical
analysis, but neglect running time and space considerations and empirical
improvements.On the other hand practitioners focus on the very reverse.The
family of PAQ statistical compressors demonstrated the superiority of the
practitioner's approach in terms of empirical compression.With this thesis
we attempt to bridge the aforementioned gap between theory and practice
with special focus on PAQ.To achieve this we apply the theoretician's tools
to practitioner's approaches:We provide a code length analysis for several
practical modeling and mixing techniques.The analysis covers modeling by
relative frequencies with frequency discount and modeling by exponential
smoothing of probabilities.For mixing we consider linear and geometrically
weighted averaging of probabilities with Online Gradient Descent for weight
estimation.Our results show that the models and mixers we consider perform
nearly as well as idealized competitors.Experiments support our
analysis.Moreover, our results add a theoretical basis to modeling and
mixing from PAQ and generalize methods from PAQ.Ultimately, we propose and
analyze Context Tree Mixing (CTM), a generalization of Context Tree
Weighting (CTW).We couple CTM with modeling and mixing techniques from PAQ
and obtain a theoretically sound compression algorithm that improves over
CTW, as shown in experiments
Data Streams from the Low Frequency Instrument On-Board the Planck Satellite: Statistical Analysis and Compression Efficiency
The expected data rate produced by the Low Frequency Instrument (LFI) planned
to fly on the ESA Planck mission in 2007, is over a factor 8 larger than the
bandwidth allowed by the spacecraft transmission system to download the LFI
data. We discuss the application of lossless compression to Planck/LFI data
streams in order to reduce the overall data flow. We perform both theoretical
analysis and experimental tests using realistically simulated data streams in
order to fix the statistical properties of the signal and the maximal
compression rate allowed by several lossless compression algorithms. We studied
the influence of signal composition and of acquisition parameters on the
compression rate Cr and develop a semiempirical formalism to account for it.
The best performing compressor tested up to now is the arithmetic compression
of order 1, designed for optimizing the compression of white noise like
signals, which allows an overall compression rate = 2.65 +/- 0.02. We find
that such result is not improved by other lossless compressors, being the
signal almost white noise dominated. Lossless compression algorithms alone will
not solve the bandwidth problem but needs to be combined with other techniques.Comment: May 3, 2000 release, 61 pages, 6 figures coded as eps, 9 tables (4
included as eps), LaTeX 2.09 + assms4.sty, style file included, submitted for
the pubblication on PASP May 3, 200
Compression and Conditional Emulation of Climate Model Output
Numerical climate model simulations run at high spatial and temporal
resolutions generate massive quantities of data. As our computing capabilities
continue to increase, storing all of the data is not sustainable, and thus it
is important to develop methods for representing the full datasets by smaller
compressed versions. We propose a statistical compression and decompression
algorithm based on storing a set of summary statistics as well as a statistical
model describing the conditional distribution of the full dataset given the
summary statistics. The statistical model can be used to generate realizations
representing the full dataset, along with characterizations of the
uncertainties in the generated data. Thus, the methods are capable of both
compression and conditional emulation of the climate models. Considerable
attention is paid to accurately modeling the original dataset--one year of
daily mean temperature data--particularly with regard to the inherent spatial
nonstationarity in global fields, and to determining the statistics to be
stored, so that the variation in the original data can be closely captured,
while allowing for fast decompression and conditional emulation on modest
computers
- …