2 research outputs found

    A domain-extensible compiler with controllable automation of optimisations

    Get PDF
    In high performance domains like image processing, physics simulation or machine learning, program performance is critical. Programmers called performance engineers are responsible for the challenging task of optimising programs. Two major challenges prevent modern compilers targeting heterogeneous architectures from reliably automating optimisation. First, domain specific compilers such as Halide for image processing and TVM for machine learning are difficult to extend with the new optimisations required by new algorithms and hardware. Second, automatic optimisation is often unable to achieve the required performance, and performance engineers often fall back to painstaking manual optimisation. This thesis shows the potential of the Shine compiler to achieve domain-extensibility, controllable automation, and generate high performance code. Domain-extensibility facilitates adapting compilers to new algorithms and hardware. Controllable automation enables performance engineers to gradually take control of the optimisation process. The first research contribution is to add 3 code generation features to Shine, namely: synchronisation barrier insertion, kernel execution, and storage folding. Adding these features requires making novel design choices in terms of compiler extensibility and controllability. The rest of this thesis builds on these features to generate code with competitive runtime compared to established domain-specific compilers. The second research contribution is to demonstrate how extensibility and controllability are exploited to optimise a standard image processing pipeline for corner detection. Shine achieves 6 well-known image processing optimisations, 2 of them not being supported by Halide. Our results on 4 ARM multi-core CPUs show that the code generated by Shine for corner detection runs up to 1.4× faster than the Halide code. However, we observe that controlling rewriting is tedious, motivating the need for more automation. The final research contribution is to introduce sketch-guided equality saturation, a semiautomated technique that allows performance engineers to guide program rewriting by specifying rewrite goals as sketches: program patterns that leave details unspecified. We evaluate this approach by applying 7 realistic optimisations of matrix multiplication. Without guidance, the compiler fails to apply the 5 most complex optimisations even given an hour and 60GB of RAM. With the guidance of at most 3 sketch guides, each 10 times smaller than the complete program, the compiler applies the optimisations in seconds using less than 1GB

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation
    corecore