36 research outputs found

    Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture

    Full text link

    CEAZ: Accelerating Parallel I/O via Hardware-Algorithm Co-Design of Efficient and Adaptive Lossy Compression

    Full text link
    As supercomputers continue to grow to exascale, the amount of data that needs to be saved or transmitted is exploding. To this end, many previous works have studied using error-bounded lossy compressors to reduce the data size and improve the I/O performance. However, little work has been done for effectively offloading lossy compression onto FPGA-based SmartNICs to reduce the compression overhead. In this paper, we propose a hardware-algorithm co-design of efficient and adaptive lossy compressor for scientific data on FPGAs (called CEAZ) to accelerate parallel I/O. Our contribution is fourfold: (1) We propose an efficient Huffman coding approach that can adaptively update Huffman codewords online based on codewords generated offline (from a variety of representative scientific datasets). (2) We derive a theoretical analysis to support a precise control of compression ratio under an error-bounded compression mode, enabling accurate offline Huffman codewords generation. This also helps us create a fixed-ratio compression mode for consistent throughput. (3) We develop an efficient compression pipeline by adopting cuSZ's dual-quantization algorithm to our hardware use case. (4) We evaluate CEAZ on five real-world datasets with both a single FPGA board and 128 nodes from Bridges-2 supercomputer. Experiments show that CEAZ outperforms the second-best FPGA-based lossy compressor by 2X of throughput and 9.6X of compression ratio. It also improves MPI_File_write and MPI_Gather throughputs by up to 25.8X and 24.8X, respectively.Comment: 14 pages, 17 figures, 8 table

    Ultrafast Error-Bounded Lossy Compression for Scientific Datasets

    Get PDF
    Today\u27s scientific high-performance computing applications and advanced instruments are producing vast volumes of data across a wide range of domains, which impose a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in the scientific community because it not only can significantly reduce the data volumes but also can strictly control the data distortion based on the user-specified error bound. Existing lossy compressors, however, cannot offer ultrafast compression speed, which is highly demanded by numerous applications or use cases (such as in-memory compression and online instrument data compression). In this paper, we propose a novel ultrafast error-bounded lossy compressor that can obtain fairly high compression performance on both CPUs and GPUs and with reasonably high compression ratios. The key contributions are threefold. (1) We propose a generic error-bounded lossy compression framework - -called SZx - -that achieves ultrafast performance through its novel design comprising only lightweight operations such as bitwise and addition/subtraction operations, while still keeping a high compression ratio. (2) We implement SZx on both CPUs and GPUs and optimize the performance according to their architectures. (3) We perform a comprehensive evaluation with six real-world production-level scientific datasets on both CPUs and GPUs. Experiments show that SZx is 2∼16x faster than the second-fastest existing error-bounded lossy compressor (either SZ or ZFP) on CPUs and GPUs, with respect to both compression and decompression

    Approachable Error Bounded Lossy Compression

    Get PDF
    Compression is commonly used in HPC applications to move and store data. Traditional lossless compression, however, does not provide adequate compression of floating point data often found in scientific codes. Recently, researchers and scientists have turned to lossy compression techniques that approximate the original data rather than reproduce it in order to achieve desired levels of compression. Typical lossy compressors do not bound the errors introduced into the data, leading to the development of error bounded lossy compressors (EBLC). These tools provide the desired levels of compression as mathematical guarantees on the errors introduced. However, the current state of EBLC leaves much to be desired. The existing EBLC all have different interfaces requiring codes to be changed to adopt new techniques; EBLC have many more configuration options than their predecessors, making them more difficult to use; and EBLC typically bound quantities like point wise errors rather than higher level metrics such as spectra, p-values, or test statistics that scientists typically use. My dissertation aims to provide a uniform interface to compression and to develop tools to allow application scientists to understand and apply EBLC. This dissertation proposal presents three groups of work: LibPressio, a standard interface for compression and analysis; FRaZ/LibPressio-Opt frameworks for the automated configuration of compressors using LibPressio; and work on tools for analyzing errors in particular domains

    Integration of Rosenbrock-type solvers into CAM4-Chem and evaluation of its performance in the perspectives of science and computation

    Get PDF
    In this study, the perennial problem of overestimation of ozone concentration from the global chemistry-climate model (CAM4-Chem [Community Earth System Model with chemistry activated]) is investigated in the sense of numerics and computation. The high-order Rosenbrock-type solvers are implemented into CAM4-Chem, motivated by its higher order accuracy and better computational efficiency. The results are evaluated by comparing to the observation data and the ROS-2 [second-order Rosenbrock] solver can reduce the positive bias of ozone concentration horizontally and vertically at most regions. The largest reduce occurs at the mid-latitudes of north hemisphere where the bias is generally high, and the summertime when the photochemical reaction is most active. In addition, the ROS-2 solver can achieve ~2x speed-up compared to the original IMP [first-order implicit] solver. This improvement is mainly due to the reuse of the Jacobian matrix and LU [lower upper] factorization during its two-stage calculation. In order to gain further speed-up, we port the ROS-2 solver to the GPU [graphics processing unit] and compare the performance with CPU. The speed-up of the GPU version with the optimized configuration reaches a factor of ~11.7× for the computation alone and ~3.82× considering the data movement between CPU and GPU. The computational time of the GPU version increases more slowly than the CPU version as a function of the number of loop iterations, which makes the GPU version more attractive for a massive computation. Moreover, under the stochastic perturbation of initial input, we find the ROS-3 [third-order Rosenbrock] solver yields better convergence property than the ROS-2 and IMP solver. However, the ROS-3 solver generally provides a further overestimation of ozone concentration when it is implemented into CAM4-Chem. This is due to the fact that more frequent time step refinements are involved by the ROS-3 solver, which also makes the ROS-3 solver less computationally efficient than the IMP and ROS-2 solvers. We also investigate the effect of grid resolution and it shows that the fine resolution can provide relatively better pattern correlation than the coarse resolution, given the same chemical solver
    corecore