7,659 research outputs found
Compression Methods for Structured Floating-Point Data and their Application in Climate Research
The use of new technologies, such as GPU boosters, have led to a dramatic
increase in the computing power of High-Performance Computing (HPC)
centres. This development, coupled with new climate models that can better
utilise this computing power thanks to software development and internal
design, led to the bottleneck moving from solving the differential equations
describing Earth’s atmospheric interactions to actually storing the variables.
The current approach to solving the storage problem is inadequate: either
the number of variables to be stored is limited or the temporal resolution
of the output is reduced. If it is subsequently determined that another vari-
able is required which has not been saved, the simulation must run again.
This thesis deals with the development of novel compression algorithms
for structured floating-point data such as climate data so that they can be
stored in full resolution.
Compression is performed by decorrelation and subsequent coding of
the data. The decorrelation step eliminates redundant information in the
data. During coding, the actual compression takes place and the data is
written to disk. A lossy compression algorithm additionally has an approx-
imation step to unify the data for better coding. The approximation step
reduces the complexity of the data for the subsequent coding, e.g. by using
quantification. This work makes a new scientific contribution to each of the
three steps described above.
This thesis presents a novel lossy compression method for time-series
data using an Auto Regressive Integrated Moving Average (ARIMA) model
to decorrelate the data. In addition, the concept of information spaces and
contexts is presented to use information across dimensions for decorrela-
tion. Furthermore, a new coding scheme is described which reduces the
weaknesses of the eXclusive-OR (XOR) difference calculation and achieves
a better compression factor than current lossless compression methods for
floating-point numbers. Finally, a modular framework is introduced that
allows the creation of user-defined compression algorithms.
The experiments presented in this thesis show that it is possible to in-
crease the information content of lossily compressed time-series data by
applying an adaptive compression technique which preserves selected data
with higher precision. An analysis for lossless compression of these time-
series has shown no success. However, the lossy ARIMA compression model
proposed here is able to capture all relevant information. The reconstructed
data can reproduce the time-series to such an extent that statistically rele-
vant information for the description of climate dynamics is preserved.
Experiments indicate that there is a significant dependence of the com-
pression factor on the selected traversal sequence and the underlying data
model. The influence of these structural dependencies on prediction-based
compression methods is investigated in this thesis. For this purpose, the
concept of Information Spaces (IS) is introduced. IS contributes to improv-
ing the predictions of the individual predictors by nearly 10% on average.
Perhaps more importantly, the standard deviation of compression results is
on average 20% lower. Using IS provides better predictions and consistent
compression results.
Furthermore, it is shown that shifting the prediction and true value leads
to a better compression factor with minimal additional computational costs.
This allows the use of more resource-efficient prediction algorithms to
achieve the same or better compression factor or higher throughput during
compression or decompression. The coding scheme proposed here achieves
a better compression factor than current state-of-the-art methods.
Finally, this paper presents a modular framework for the development
of compression algorithms. The framework supports the creation of user-
defined predictors and offers functionalities such as the execution of bench-
marks, the random subdivision of n-dimensional data, the quality evalua-
tion of predictors, the creation of ensemble predictors and the execution of
validity tests for sequential and parallel compression algorithms.
This research was initiated because of the needs of climate science, but
the application of its contributions is not limited to it. The results of this the-
sis are of major benefit to develop and improve any compression algorithm
for structured floating-point data
Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP
With ever-increasing volumes of scientific data produced by HPC applications,
significantly reducing data size is critical because of limited capacity of
storage space and potential bottlenecks on I/O or networks in writing/reading
or transferring data. SZ and ZFP are the two leading lossy compressors
available to compress scientific data sets. However, their performance is not
consistent across different data sets and across different fields of some data
sets: for some fields SZ provides better compression performance, while other
fields are better compressed with ZFP. This situation raises the need for an
automatic online (during compression) selection between SZ and ZFP, with a
minimal overhead. In this paper, the automatic selection optimizes the
rate-distortion, an important statistical quality metric based on the
signal-to-noise ratio. To optimize for rate-distortion, we investigate the
principles of SZ and ZFP. We then propose an efficient online, low-overhead
selection algorithm that predicts the compression quality accurately for two
compressors in early processing stages and selects the best-fit compressor for
each data field. We implement the selection algorithm into an open-source
library, and we evaluate the effectiveness of our proposed solution against
plain SZ and ZFP in a parallel environment with 1,024 cores. Evaluation results
on three data sets representing about 100 fields show that our selection
algorithm improves the compression ratio up to 70% with the same level of data
distortion because of very accurate selection (around 99%) of the best-fit
compressor, with little overhead (less than 7% in the experiments).Comment: 14 pages, 9 figures, first revisio
Compression and Conditional Emulation of Climate Model Output
Numerical climate model simulations run at high spatial and temporal
resolutions generate massive quantities of data. As our computing capabilities
continue to increase, storing all of the data is not sustainable, and thus it
is important to develop methods for representing the full datasets by smaller
compressed versions. We propose a statistical compression and decompression
algorithm based on storing a set of summary statistics as well as a statistical
model describing the conditional distribution of the full dataset given the
summary statistics. The statistical model can be used to generate realizations
representing the full dataset, along with characterizations of the
uncertainties in the generated data. Thus, the methods are capable of both
compression and conditional emulation of the climate models. Considerable
attention is paid to accurately modeling the original dataset--one year of
daily mean temperature data--particularly with regard to the inherent spatial
nonstationarity in global fields, and to determining the statistics to be
stored, so that the variation in the original data can be closely captured,
while allowing for fast decompression and conditional emulation on modest
computers
Toward decoupling the selection of compression algorithms from quality constraints
Data intense scientific domains use data compression to reduce the storage space needed. Lossless data compression preserves the original information accurately but on the domain of climate data usually yields a compression factor of only 2:1. Lossy data compression can achieve much higher compression rates depending on the tolerable error/precision needed. Therefore, the field of lossy compression is still subject to active research. From the perspective of a scientist, the compression algorithm does not matter but the qualitative information about the implied loss of precision of data is a concern.
With the Scientific Compression Library (SCIL), we are developing a meta-compressor that allows users to set various quantities that define the acceptable error and the expected performance behavior. The ongoing work a preliminary stage for the design of an automatic compression algorithm selector. The task of this missing key component is the construction of appropriate chains of algorithms to yield the users requirements. This approach is a crucial step towards a scientifically safe use of much-needed lossy data compression, because it disentangles the tasks of determining scientific ground characteristics of tolerable noise, from the task of determining an optimal compression strategy given target noise levels and constraints. Future algorithms are used without change in the application code, once they are integrated into SCIL.
In this paper, we describe the user interfaces and quantities, two compression algorithms and evaluate SCIL’s ability for compressing climate data. This will show that the novel algorithms are competitive with state-of-the-art compressors ZFP and SZ and illustrate that the best algorithm depends on user settings and data properties
AIMES: advanced computation and I/O methods for earth-system simulations
Dealing with extreme scale Earth-system models is challenging from the computer science perspective, as the required computing power and storage capacity are steadily increasing.
Scientists perform runs with growing resolution or aggregate results from many similar smaller-scale runs with slightly different initial conditions (the so-called ensemble runs).
In the fifth Coupled Model Intercomparison Project (CMIP5), the produced datasets require more than three Petabytes of storage and the compute and storage requirements are increasing significantly for CMIP6.
Climate scientists across the globe are developing next-generation models based on improved numerical formulation leading to grids that are discretized in alternative forms such as an icosahedral (geodesic) grid.
The developers of these models face similar problems in scaling, maintaining and optimizing code.
Performance portability and the maintainability of code are key concerns of scientists as, compared to industry projects, model code is continuously revised and extended to incorporate further levels of detail.
This leads to a rapidly growing code base that is rarely refactored.
However, code modernization is important to maintain productivity of the scientist working
with the code and for utilizing performance provided by modern and future architectures.
The need for performance optimization is motivated by the evolution of the parallel architecture landscape from
homogeneous flat machines to heterogeneous combinations of processors with deep memory hierarchy.
Notably, the rise of many-core, throughput-oriented accelerators, such as GPUs, requires non-trivial code changes at minimum and, even worse, may necessitate a substantial rewrite of the existing codebase.
At the same time, the code complexity increases the difficulty for computer scientists and vendors to understand and optimize the code for a given system.
Storing the products of climate predictions requires a large storage and archival system which is expensive.
Often, scientists restrict the number of scientific variables and write interval to keep the costs
balanced.
Compression algorithms can reduce the costs significantly but can also increase the scientific yield of simulation runs.
In the AIMES project, we addressed the key issues of programmability, computational efficiency and I/O limitations that are common in next-generation icosahedral earth-system models.
The project focused on the separation of concerns between domain scientist, computational scientists, and computer scientists
- …