Parallelization of Variable Rate Decompression for GPU Acceleration

Abstract

Data movement has been long identified as the biggest challenge facing modern computer systems designers. To tackle this challenge, many novel data compression algorithms have been developed. These compression algorithms can be embedded into bandwidth-bound applications to reduce their memory traffic volume. As a result, data decompression, in many instances, is in the critical path of the application execution, while the compression itself can happen offine or outside of the critical path. Therefore, fast data decompression is of utmost importance. However, most existing parallel decompression schemes adopt a particular parallelization strategy suited for a particular HW platform. Such an approach fails to harness the parallelism found in diverse modern HW architectures. To this end, we propose multiple parallelization strategies for variable rate data decompression. The proposed strategies aim to utilize parallel architectures efficiently. Our strategies are based on generating extra information during the encoding phase, and then passing this information in a side-channel to the decoder. After that, the decoder can use that extra information to speed-up the decoding process tremendously. To demonstrate the effectiveness of our strategies, we implement them in a state-of-the-art compression algorithm called ZFP and apply it on a real-life industrial application from ASML. Our implementation is publicly available on GitHub. This application is a feed-forward control model for controlling wafer heat in EUV lithography machines. The application is dominated by matrix-vector multiplication (which is bandwidth-bound) and is executed on GPUs. We show that parallelization strategies suited for multicore CPUs are different from the ones suited for GPUs. On a CPU, we achieve a near-optimal speedup and an overhead size which is consistently less than 0.04% of the compressed data size. On a GPU, we achieve a decoding throughput of more than 130 GiB/s which allows us to execute the ASML application within the given time budget

    Similar works

    Full text

    thumbnail-image

    Available Versions