1,348 research outputs found
Compression Methods for Structured Floating-Point Data and their Application in Climate Research
The use of new technologies, such as GPU boosters, have led to a dramatic
increase in the computing power of High-Performance Computing (HPC)
centres. This development, coupled with new climate models that can better
utilise this computing power thanks to software development and internal
design, led to the bottleneck moving from solving the differential equations
describing Earth’s atmospheric interactions to actually storing the variables.
The current approach to solving the storage problem is inadequate: either
the number of variables to be stored is limited or the temporal resolution
of the output is reduced. If it is subsequently determined that another vari-
able is required which has not been saved, the simulation must run again.
This thesis deals with the development of novel compression algorithms
for structured floating-point data such as climate data so that they can be
stored in full resolution.
Compression is performed by decorrelation and subsequent coding of
the data. The decorrelation step eliminates redundant information in the
data. During coding, the actual compression takes place and the data is
written to disk. A lossy compression algorithm additionally has an approx-
imation step to unify the data for better coding. The approximation step
reduces the complexity of the data for the subsequent coding, e.g. by using
quantification. This work makes a new scientific contribution to each of the
three steps described above.
This thesis presents a novel lossy compression method for time-series
data using an Auto Regressive Integrated Moving Average (ARIMA) model
to decorrelate the data. In addition, the concept of information spaces and
contexts is presented to use information across dimensions for decorrela-
tion. Furthermore, a new coding scheme is described which reduces the
weaknesses of the eXclusive-OR (XOR) difference calculation and achieves
a better compression factor than current lossless compression methods for
floating-point numbers. Finally, a modular framework is introduced that
allows the creation of user-defined compression algorithms.
The experiments presented in this thesis show that it is possible to in-
crease the information content of lossily compressed time-series data by
applying an adaptive compression technique which preserves selected data
with higher precision. An analysis for lossless compression of these time-
series has shown no success. However, the lossy ARIMA compression model
proposed here is able to capture all relevant information. The reconstructed
data can reproduce the time-series to such an extent that statistically rele-
vant information for the description of climate dynamics is preserved.
Experiments indicate that there is a significant dependence of the com-
pression factor on the selected traversal sequence and the underlying data
model. The influence of these structural dependencies on prediction-based
compression methods is investigated in this thesis. For this purpose, the
concept of Information Spaces (IS) is introduced. IS contributes to improv-
ing the predictions of the individual predictors by nearly 10% on average.
Perhaps more importantly, the standard deviation of compression results is
on average 20% lower. Using IS provides better predictions and consistent
compression results.
Furthermore, it is shown that shifting the prediction and true value leads
to a better compression factor with minimal additional computational costs.
This allows the use of more resource-efficient prediction algorithms to
achieve the same or better compression factor or higher throughput during
compression or decompression. The coding scheme proposed here achieves
a better compression factor than current state-of-the-art methods.
Finally, this paper presents a modular framework for the development
of compression algorithms. The framework supports the creation of user-
defined predictors and offers functionalities such as the execution of bench-
marks, the random subdivision of n-dimensional data, the quality evalua-
tion of predictors, the creation of ensemble predictors and the execution of
validity tests for sequential and parallel compression algorithms.
This research was initiated because of the needs of climate science, but
the application of its contributions is not limited to it. The results of this the-
sis are of major benefit to develop and improve any compression algorithm
for structured floating-point data
Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP
With ever-increasing volumes of scientific data produced by HPC applications,
significantly reducing data size is critical because of limited capacity of
storage space and potential bottlenecks on I/O or networks in writing/reading
or transferring data. SZ and ZFP are the two leading lossy compressors
available to compress scientific data sets. However, their performance is not
consistent across different data sets and across different fields of some data
sets: for some fields SZ provides better compression performance, while other
fields are better compressed with ZFP. This situation raises the need for an
automatic online (during compression) selection between SZ and ZFP, with a
minimal overhead. In this paper, the automatic selection optimizes the
rate-distortion, an important statistical quality metric based on the
signal-to-noise ratio. To optimize for rate-distortion, we investigate the
principles of SZ and ZFP. We then propose an efficient online, low-overhead
selection algorithm that predicts the compression quality accurately for two
compressors in early processing stages and selects the best-fit compressor for
each data field. We implement the selection algorithm into an open-source
library, and we evaluate the effectiveness of our proposed solution against
plain SZ and ZFP in a parallel environment with 1,024 cores. Evaluation results
on three data sets representing about 100 fields show that our selection
algorithm improves the compression ratio up to 70% with the same level of data
distortion because of very accurate selection (around 99%) of the best-fit
compressor, with little overhead (less than 7% in the experiments).Comment: 14 pages, 9 figures, first revisio
SAR data compression: Application, requirements, and designs
The feasibility of reducing data volume and data rate is evaluated for the Earth Observing System (EOS) Synthetic Aperture Radar (SAR). All elements of data stream from the sensor downlink data stream to electronic delivery of browse data products are explored. The factors influencing design of a data compression system are analyzed, including the signal data characteristics, the image quality requirements, and the throughput requirements. The conclusion is that little or no reduction can be achieved in the raw signal data using traditional data compression techniques (e.g., vector quantization, adaptive discrete cosine transform) due to the induced phase errors in the output image. However, after image formation, a number of techniques are effective for data compression
Improving Performance of Iterative Methods by Lossy Checkponting
Iterative methods are commonly used approaches to solve large, sparse linear
systems, which are fundamental operations for many modern scientific
simulations. When the large-scale iterative methods are running with a large
number of ranks in parallel, they have to checkpoint the dynamic variables
periodically in case of unavoidable fail-stop errors, requiring fast I/O
systems and large storage space. To this end, significantly reducing the
checkpointing overhead is critical to improving the overall performance of
iterative methods. Our contribution is fourfold. (1) We propose a novel lossy
checkpointing scheme that can significantly improve the checkpointing
performance of iterative methods by leveraging lossy compressors. (2) We
formulate a lossy checkpointing performance model and derive theoretically an
upper bound for the extra number of iterations caused by the distortion of data
in lossy checkpoints, in order to guarantee the performance improvement under
the lossy checkpointing scheme. (3) We analyze the impact of lossy
checkpointing (i.e., extra number of iterations caused by lossy checkpointing
files) for multiple types of iterative methods. (4)We evaluate the lossy
checkpointing scheme with optimal checkpointing intervals on a high-performance
computing environment with 2,048 cores, using a well-known scientific
computation package PETSc and a state-of-the-art checkpoint/restart toolkit.
Experiments show that our optimized lossy checkpointing scheme can
significantly reduce the fault tolerance overhead for iterative methods by
23%~70% compared with traditional checkpointing and 20%~58% compared with
lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1
Near Lossless Time Series Data Compression Methods using Statistics and Deviation
The last two decades have seen tremendous growth in data collections because
of the realization of recent technologies, including the internet of things
(IoT), E-Health, industrial IoT 4.0, autonomous vehicles, etc. The challenge of
data transmission and storage can be handled by utilizing state-of-the-art data
compression methods. Recent data compression methods are proposed using deep
learning methods, which perform better than conventional methods. However,
these methods require a lot of data and resources for training. Furthermore, it
is difficult to materialize these deep learning-based solutions on IoT devices
due to the resource-constrained nature of IoT devices. In this paper, we
propose lightweight data compression methods based on data statistics and
deviation. The proposed method performs better than the deep learning method in
terms of compression ratio (CR). We simulate and compare the proposed data
compression methods for various time series signals, e.g., accelerometer, gas
sensor, gyroscope, electrical power consumption, etc. In particular, it is
observed that the proposed method achieves 250.8\%, 94.3\%, and 205\% higher CR
than the deep learning method for the GYS, Gactive, and ACM datasets,
respectively. The code and data are available at
https://github.com/vidhi0206/data-compression .Comment: 6 pages, 2 figures and 9 tables are include
- …