17,147 research outputs found
Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
This paper introduces Tiramisu, a polyhedral framework designed to generate
high performance code for multiple platforms including multicores, GPUs, and
distributed machines. Tiramisu introduces a scheduling language with novel
extensions to explicitly manage the complexities that arise when targeting
these systems. The framework is designed for the areas of image processing,
stencils, linear algebra and deep learning. Tiramisu has two main features: it
relies on a flexible representation based on the polyhedral model and it has a
rich scheduling language allowing fine-grained control of optimizations.
Tiramisu uses a four-level intermediate representation that allows full
separation between the algorithms, loop transformations, data layouts, and
communication. This separation simplifies targeting multiple hardware
architectures with the same algorithm. We evaluate Tiramisu by writing a set of
image processing, deep learning, and linear algebra benchmarks and compare them
with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu
matches or outperforms existing compilers and libraries on different hardware
architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041
Towards an Achievable Performance for the Loop Nests
Numerous code optimization techniques, including loop nest optimizations,
have been developed over the last four decades. Loop optimization techniques
transform loop nests to improve the performance of the code on a target
architecture, including exposing parallelism. Finding and evaluating an
optimal, semantic-preserving sequence of transformations is a complex problem.
The sequence is guided using heuristics and/or analytical models and there is
no way of knowing how close it gets to optimal performance or if there is any
headroom for improvement. This paper makes two contributions. First, it uses a
comparative analysis of loop optimizations/transformations across multiple
compilers to determine how much headroom may exist for each compiler. And
second, it presents an approach to characterize the loop nests based on their
hardware performance counter values and a Machine Learning approach that
predicts which compiler will generate the fastest code for a loop nest. The
prediction is made for both auto-vectorized, serial compilation and for
auto-parallelization. The results show that the headroom for state-of-the-art
compilers ranges from 1.10x to 1.42x for the serial code and from 1.30x to
1.71x for the auto-parallelized code. These results are based on the Machine
Learning predictions.Comment: Accepted at the 31st International Workshop on Languages and
Compilers for Parallel Computing (LCPC 2018
A Survey on Compiler Autotuning using Machine Learning
Since the mid-1990s, researchers have been trying to use machine-learning
based approaches to solve a number of different compiler optimization problems.
These techniques primarily enhance the quality of the obtained results and,
more importantly, make it feasible to tackle two main compiler optimization
problems: optimization selection (choosing which optimizations to apply) and
phase-ordering (choosing the order of applying optimizations). The compiler
optimization space continues to grow due to the advancement of applications,
increasing number of compiler optimizations, and new target architectures.
Generic optimization passes in compilers cannot fully leverage newly introduced
optimizations and, therefore, cannot keep up with the pace of increasing
options. This survey summarizes and classifies the recent advances in using
machine learning for the compiler optimization field, particularly on the two
major problems of (1) selecting the best optimizations and (2) the
phase-ordering of optimizations. The survey highlights the approaches taken so
far, the obtained results, the fine-grain classification among different
approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our
Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated
quarterly here (Send me your new published papers to be added in the
subsequent version) History: Received November 2016; Revised August 2017;
Revised February 2018; Accepted March 2018
TTC: A Tensor Transposition Compiler for Multiple Architectures
We consider the problem of transposing tensors of arbitrary dimension and
describe TTC, an open source domain-specific parallel compiler. TTC generates
optimized parallel C++/CUDA C code that achieves a significant fraction of the
system's peak memory bandwidth. TTC exhibits high performance across multiple
architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD
Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such
as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a
meaningful baseline implementation generated by external C++ compilers; the
results suggest that a domain-specific compiler can outperform its general
purpose counterpart significantly: For instance, comparing with Intel's latest
C++ compiler on the Haswell and Knights Corner architecture, TTC yields
speedups of up to and , respectively. We also showcase
TTC's support for multiple leading dimensions, making it a suitable candidate
for the generation of performance-critical packing functions that are at the
core of the ubiquitous BLAS 3 routines
Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library
We present an analysis on optimizing performance of a single C++11 source
code using the Alpaka hardware abstraction library. For this we use the general
matrix multiplication (GEMM) algorithm in order to show that compilers can
optimize Alpaka code effectively when tuning key parameters of the algorithm.
We do not intend to rival existing, highly optimized DGEMM versions, but merely
choose this example to prove that Alpaka allows for platform-specific tuning
with a single source code. In addition we analyze the optimization potential
available with vendor-specific compilers when confronted with the heavily
templated abstractions of Alpaka. We specifically test the code for bleeding
edge architectures such as Nvidia's Tesla P100, Intel's Knights Landing (KNL)
and Haswell architecture as well as IBM's Power8 system. On some of these we
are able to reach almost 50\% of the peak floating point operation performance
using the aforementioned means. When adding compiler-specific #pragmas we are
able to reach 5 TFLOPS/s on a P100 and over 1 TFLOPS/s on a KNL system.Comment: Accepted paper for the P\^{}3MA workshop at the ISC 2017 in Frankfur
Lost in translation: Exposing hidden compiler optimization opportunities
Existing iterative compilation and machine-learning-based optimization
techniques have been proven very successful in achieving better optimizations
than the standard optimization levels of a compiler. However, they were not
engineered to support the tuning of a compiler's optimizer as part of the
compiler's daily development cycle. In this paper, we first establish the
required properties which a technique must exhibit to enable such tuning. We
then introduce an enhancement to the classic nightly routine testing of
compilers which exhibits all the required properties, and thus, is capable of
driving the improvement and tuning of the compiler's common optimizer. This is
achieved by leveraging resource usage and compilation information collected
while systematically exploiting prefixes of the transformations applied at
standard optimization levels. Experimental evaluation using the LLVM v6.0.1
compiler demonstrated that the new approach was able to reveal hidden
cross-architecture and architecture-dependent potential optimizations on two
popular processors: the Intel i5-6300U and the Arm Cortex-A53-based Broadcom
BCM2837 used in the Raspberry Pi 3B+. As a case study, we demonstrate how the
insights from our approach enabled us to identify and remove a significant
shortcoming of the CFG simplification pass of the LLVM v6.0.1 compiler.Comment: 31 pages, 7 figures, 2 table. arXiv admin note: text overlap with
arXiv:1802.0984
- …