1,270 research outputs found
Stencil codes on a vector length agnostic architecture
Data-level parallelism is frequently ignored or underutilized. Achieved through vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or register size. In addition, automatic compiler vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some these issues, Arm recently released a new vector ISA, the Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical vector register length.
In this paper we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific
computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using vector lengths ranging from 128 to 2,048 bits show that these optimizations can lead to performance improvements over straight-forward vectorized code of up to 56.6% for 2,048 bit vectors. In addition, we show that certain optimizations can hurt performance due to a reduction in arithmetic intensity, and provide insight useful for compiler optimizers.This work has been partially supported by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and
Competitiveness (contract TIN2015-65316-P), and by the Generalitat de Catalunya (contracts 2017-SGR-1328 and 2017-SGR-1414).
The Mont-Blanc project receives funding from the EUs H2020 Framework Programme (H2020/2014-2020) under grant agreements
no. 671697 and no. 779877. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. Finally,
A. Armejach has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva
postdoctoral fellowship number FJCI-2015-24753.Peer ReviewedPostprint (author's final draft
ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code
Automatic code optimization is a complex process that typically involves the
application of multiple discrete algorithms that modify the program structure
irreversibly. However, the design of these algorithms is often monolithic, and
they require repetitive implementation to perform similar analyses due to the
lack of cooperation. To address this issue, modern optimization techniques,
such as equality saturation, allow for exhaustive term rewriting at various
levels of inputs, thereby simplifying compiler design.
In this paper, we propose equality saturation to optimize sequential codes
utilized in directive-based programming for GPUs. Our approach simultaneously
realizes less computation, less memory access, and high memory throughput. Our
fully-automated framework constructs single-assignment forms from inputs to be
entirely rewritten while keeping dependencies and extracts optimal cases.
Through practical benchmarks, we demonstrate a significant performance
improvement on several compilers. Furthermore, we highlight the advantages of
computational reordering and emphasize the significance of memory-access order
for modern GPUs
- …