1,348 research outputs found

    Compression Methods for Structured Floating-Point Data and their Application in Climate Research

    Get PDF
    The use of new technologies, such as GPU boosters, have led to a dramatic increase in the computing power of High-Performance Computing (HPC) centres. This development, coupled with new climate models that can better utilise this computing power thanks to software development and internal design, led to the bottleneck moving from solving the differential equations describing Earth’s atmospheric interactions to actually storing the variables. The current approach to solving the storage problem is inadequate: either the number of variables to be stored is limited or the temporal resolution of the output is reduced. If it is subsequently determined that another vari- able is required which has not been saved, the simulation must run again. This thesis deals with the development of novel compression algorithms for structured floating-point data such as climate data so that they can be stored in full resolution. Compression is performed by decorrelation and subsequent coding of the data. The decorrelation step eliminates redundant information in the data. During coding, the actual compression takes place and the data is written to disk. A lossy compression algorithm additionally has an approx- imation step to unify the data for better coding. The approximation step reduces the complexity of the data for the subsequent coding, e.g. by using quantification. This work makes a new scientific contribution to each of the three steps described above. This thesis presents a novel lossy compression method for time-series data using an Auto Regressive Integrated Moving Average (ARIMA) model to decorrelate the data. In addition, the concept of information spaces and contexts is presented to use information across dimensions for decorrela- tion. Furthermore, a new coding scheme is described which reduces the weaknesses of the eXclusive-OR (XOR) difference calculation and achieves a better compression factor than current lossless compression methods for floating-point numbers. Finally, a modular framework is introduced that allows the creation of user-defined compression algorithms. The experiments presented in this thesis show that it is possible to in- crease the information content of lossily compressed time-series data by applying an adaptive compression technique which preserves selected data with higher precision. An analysis for lossless compression of these time- series has shown no success. However, the lossy ARIMA compression model proposed here is able to capture all relevant information. The reconstructed data can reproduce the time-series to such an extent that statistically rele- vant information for the description of climate dynamics is preserved. Experiments indicate that there is a significant dependence of the com- pression factor on the selected traversal sequence and the underlying data model. The influence of these structural dependencies on prediction-based compression methods is investigated in this thesis. For this purpose, the concept of Information Spaces (IS) is introduced. IS contributes to improv- ing the predictions of the individual predictors by nearly 10% on average. Perhaps more importantly, the standard deviation of compression results is on average 20% lower. Using IS provides better predictions and consistent compression results. Furthermore, it is shown that shifting the prediction and true value leads to a better compression factor with minimal additional computational costs. This allows the use of more resource-efficient prediction algorithms to achieve the same or better compression factor or higher throughput during compression or decompression. The coding scheme proposed here achieves a better compression factor than current state-of-the-art methods. Finally, this paper presents a modular framework for the development of compression algorithms. The framework supports the creation of user- defined predictors and offers functionalities such as the execution of bench- marks, the random subdivision of n-dimensional data, the quality evalua- tion of predictors, the creation of ensemble predictors and the execution of validity tests for sequential and parallel compression algorithms. This research was initiated because of the needs of climate science, but the application of its contributions is not limited to it. The results of this the- sis are of major benefit to develop and improve any compression algorithm for structured floating-point data

    Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP

    Full text link
    With ever-increasing volumes of scientific data produced by HPC applications, significantly reducing data size is critical because of limited capacity of storage space and potential bottlenecks on I/O or networks in writing/reading or transferring data. SZ and ZFP are the two leading lossy compressors available to compress scientific data sets. However, their performance is not consistent across different data sets and across different fields of some data sets: for some fields SZ provides better compression performance, while other fields are better compressed with ZFP. This situation raises the need for an automatic online (during compression) selection between SZ and ZFP, with a minimal overhead. In this paper, the automatic selection optimizes the rate-distortion, an important statistical quality metric based on the signal-to-noise ratio. To optimize for rate-distortion, we investigate the principles of SZ and ZFP. We then propose an efficient online, low-overhead selection algorithm that predicts the compression quality accurately for two compressors in early processing stages and selects the best-fit compressor for each data field. We implement the selection algorithm into an open-source library, and we evaluate the effectiveness of our proposed solution against plain SZ and ZFP in a parallel environment with 1,024 cores. Evaluation results on three data sets representing about 100 fields show that our selection algorithm improves the compression ratio up to 70% with the same level of data distortion because of very accurate selection (around 99%) of the best-fit compressor, with little overhead (less than 7% in the experiments).Comment: 14 pages, 9 figures, first revisio

    SAR data compression: Application, requirements, and designs

    Get PDF
    The feasibility of reducing data volume and data rate is evaluated for the Earth Observing System (EOS) Synthetic Aperture Radar (SAR). All elements of data stream from the sensor downlink data stream to electronic delivery of browse data products are explored. The factors influencing design of a data compression system are analyzed, including the signal data characteristics, the image quality requirements, and the throughput requirements. The conclusion is that little or no reduction can be achieved in the raw signal data using traditional data compression techniques (e.g., vector quantization, adaptive discrete cosine transform) due to the induced phase errors in the output image. However, after image formation, a number of techniques are effective for data compression

    Improving Performance of Iterative Methods by Lossy Checkponting

    Get PDF
    Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4)We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%~70% compared with traditional checkpointing and 20%~58% compared with lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1

    Near Lossless Time Series Data Compression Methods using Statistics and Deviation

    Full text link
    The last two decades have seen tremendous growth in data collections because of the realization of recent technologies, including the internet of things (IoT), E-Health, industrial IoT 4.0, autonomous vehicles, etc. The challenge of data transmission and storage can be handled by utilizing state-of-the-art data compression methods. Recent data compression methods are proposed using deep learning methods, which perform better than conventional methods. However, these methods require a lot of data and resources for training. Furthermore, it is difficult to materialize these deep learning-based solutions on IoT devices due to the resource-constrained nature of IoT devices. In this paper, we propose lightweight data compression methods based on data statistics and deviation. The proposed method performs better than the deep learning method in terms of compression ratio (CR). We simulate and compare the proposed data compression methods for various time series signals, e.g., accelerometer, gas sensor, gyroscope, electrical power consumption, etc. In particular, it is observed that the proposed method achieves 250.8\%, 94.3\%, and 205\% higher CR than the deep learning method for the GYS, Gactive, and ACM datasets, respectively. The code and data are available at https://github.com/vidhi0206/data-compression .Comment: 6 pages, 2 figures and 9 tables are include
    corecore