29 research outputs found
Towards an Achievable Performance for the Loop Nests
Numerous code optimization techniques, including loop nest optimizations,
have been developed over the last four decades. Loop optimization techniques
transform loop nests to improve the performance of the code on a target
architecture, including exposing parallelism. Finding and evaluating an
optimal, semantic-preserving sequence of transformations is a complex problem.
The sequence is guided using heuristics and/or analytical models and there is
no way of knowing how close it gets to optimal performance or if there is any
headroom for improvement. This paper makes two contributions. First, it uses a
comparative analysis of loop optimizations/transformations across multiple
compilers to determine how much headroom may exist for each compiler. And
second, it presents an approach to characterize the loop nests based on their
hardware performance counter values and a Machine Learning approach that
predicts which compiler will generate the fastest code for a loop nest. The
prediction is made for both auto-vectorized, serial compilation and for
auto-parallelization. The results show that the headroom for state-of-the-art
compilers ranges from 1.10x to 1.42x for the serial code and from 1.30x to
1.71x for the auto-parallelized code. These results are based on the Machine
Learning predictions.Comment: Accepted at the 31st International Workshop on Languages and
Compilers for Parallel Computing (LCPC 2018
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Predicting the number of clock cycles a processor takes to execute a block of
assembly instructions in steady state (the throughput) is important for both
compiler designers and performance engineers. Building an analytical model to
do so is especially complicated in modern x86-64 Complex Instruction Set
Computer (CISC) machines with sophisticated processor microarchitectures in
that it is tedious, error prone, and must be performed from scratch for each
processor generation. In this paper we present Ithemal, the first tool which
learns to predict the throughput of a set of instructions. Ithemal uses a
hierarchical LSTM--based approach to predict throughput based on the opcodes
and operands of instructions in a basic block. We show that Ithemal is more
accurate than state-of-the-art hand-written tools currently used in compiler
backends and static machine code analyzers. In particular, our model has less
than half the error of state-of-the-art analytical models (LLVM's llvm-mca and
Intel's IACA). Ithemal is also able to predict these throughput values just as
fast as the aforementioned tools, and is easily ported across a variety of
processor microarchitectures with minimal developer effort.Comment: Published at 36th International Conference on Machine Learning (ICML)
201
Performance Improvement in Kernels by Guiding Compiler Auto-Vectorization Heuristics
Vectorization support in hardware continues to expand and grow as well we still continue on superscalar architectures. Unfortunately, compilers are not always able to generate optimal code for the hardware;detecting and generating vectorized code is extremely complex. Programmers can use a number of tools to aid in development and tuning, but most of these tools require expert or domain-specific knowledge to use. In this work we aim to provide techniques for determining the best way to optimize certain codes, with an end goal of guiding the compiler into generating optimized code without requiring expert knowledge from the developer. Initally, we study how to combine vectorization reports with iterative comilation and code generation and summarize our insights and patterns on how the compiler vectorizes code. Our utilities for iterative compiliation and code generation can be further used by non-experts in the generation and analysis of programs. Finally, we leverage the obtained knowledge to design a Support Vector Machine classifier to predict the speedup of a program given a sequence of optimization underprediction, with 82% of these accurate within 15 % both ways
The Janus triad: Exploiting parallelism through dynamic binary modification
We present a unified approach for exploiting thread-level, data-level, and memory-level parallelism through a same-ISA dynamic binary modifier guided by static binary analysis. A static binary analyser first examines an executable and determines the operations required to extract parallelism at runtime, encoding them as a series of rewrite rules that a dynamic binary modifier uses to perform binary transformation. We demonstrate this framework by exploiting three different kinds of parallelism to perform automatic vectorisation, software prefetching, and automatic parallelisation together on legacy application binaries. Software prefetch insertion alone achieves an average speedup of 1.2×, comparing favourably with an automatic compiler pass. Automatic vectorisation brings speedups of 2.7× on the TSVC benchmarks, significantly beating a compiler approach for some workloads. Finally, combining prefetching, vectorisation, and parallelisation realises a speedup of 3.8× on a representative application loop
Optimization and scalability of tiled code generation
PIPS is a source-to-source compiler developed by CRI ParisTech performing loop tiling to enforce locality and parallelism.In this work we designed a new PIPS phase performing an invariant code optimization and parallel directive selection on the generated tiled code.We obtained scalable tiled code and minimized the parallel directive overhead.The current PIPS generated code outperforms the previous one and achieves comparable results to other state-of-art code optimizers in terms of speed-up.ope