732 research outputs found
Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
This paper introduces Tiramisu, a polyhedral framework designed to generate
high performance code for multiple platforms including multicores, GPUs, and
distributed machines. Tiramisu introduces a scheduling language with novel
extensions to explicitly manage the complexities that arise when targeting
these systems. The framework is designed for the areas of image processing,
stencils, linear algebra and deep learning. Tiramisu has two main features: it
relies on a flexible representation based on the polyhedral model and it has a
rich scheduling language allowing fine-grained control of optimizations.
Tiramisu uses a four-level intermediate representation that allows full
separation between the algorithms, loop transformations, data layouts, and
communication. This separation simplifies targeting multiple hardware
architectures with the same algorithm. We evaluate Tiramisu by writing a set of
image processing, deep learning, and linear algebra benchmarks and compare them
with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu
matches or outperforms existing compilers and libraries on different hardware
architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041
Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation
The most popular multithreaded languages based on the fork-join concurrency model (CIlkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural choice. Since the programming concepts of CilkPlus and OpenMP are very different from those of CUDA, it is desirable to automatically generate optimized CUDA-like code from CilkPlus or OpenMP.
In this thesis, we propose an accelerator model for annotated C/C++ code together with an implementation that allows the automatic generation of CUDA code. One of the key features of this CUDA code generator is that it supports the generation of CUDA kernel code where program parameters (like number of threads per block) and machine parameters (like shared memory size) are treated as unknown symbols. Hence, these parameters need not to be known at code-generation-time: machine parameters and program parameters can be respectively determined when the generated code is installed on the target machine. In addition, we show how these parametric CUDA programs can be optimized at compile-time in the form of a case discussion, where cases depend on the values of machine parameters (e.g. hardware resource limits) and program parameters (e.g. dimension sizes of thread-blocks). This generation of parametric CUDA kernels requires to deal with non-linear polynomial expressions during the dependence analysis and tiling phase. To achieve these algebraic calculations, we take advantage of techniques from computer algebra, in particular in the RegularChains library of Maple. Various illustrative examples are provided together with performance evaluation
Source-to-source compilation of loop programs for manycore processors
It is widely accepted today that the end of microprocessor performance growth
based on increasing clock speeds and instruction-level parallelism (ILP)
demands new ways of exploiting transistor densities.
Manycore processors (most commonly known as
GPGPUs or simply GPUs) provide a viable solution to this performance
scaling bottleneck through large numbers of lightweight compute cores
and memory hierarchies that rely primarily on software for their
efficient utilization. The widespread proliferation of this class of
architectures today is a clear indication that exposing and managing
parallelism on a large scale as well as efficiently orchestrating
on-chip data movement is becoming an increasingly critical concern for
high-performance software development. In such a computing landscape
performance portability -- the ability to exploit the power of a variety
of manycore chips while minimizing the impact on software development
and productivity -- is perhaps one of the most important and challenging
objectives for our research community.
This thesis is about
performance portability for manycore processors and how source-to-source
compilation can help us achieve it. In particular, we show that for an
important set of loop-programs, performance portability is
attainable at low cost through compile-time polyhedral analysis and optimization
and parametric tiling for run-time performance
tuning. In other words, we propose and evaluate a source-to-source
compilation path that takes affine loop-programs as input and
produces parametrically tiled parallel code amenable to run-time tuning
across different manycore platforms and devices -- a very useful
and powerful property if we seek performance portability because it
decouples the compiler from the performance tuning process. The produced
code relies on a platform-independent run-time environment, called Avelas,
that allows us to formulate a robust and portable code generation algorithm.
Our experimental evaluation shows that Avelas induces low run-time overhead
and even substantial speed-ups for wavefront-parallel programs compared to a state-of-the-art
compile-time scheme with no run-time support. We also claim that the low overhead of Avelas is a strong
indication that it can also be effective as a general-purpose programming model
for manycore processors as we demonstrate for a set of ParBoil benchmarks.Open Acces
LLOV: A Fast Static Data-Race Checker for OpenMP Programs
In the era of Exascale computing, writing efficient parallel programs is
indispensable and at the same time, writing sound parallel programs is very
difficult. Specifying parallelism with frameworks such as OpenMP is relatively
easy, but data races in these programs are an important source of bugs. In this
paper, we propose LLOV, a fast, lightweight, language agnostic, and static data
race checker for OpenMP programs based on the LLVM compiler framework. We
compare LLOV with other state-of-the-art data race checkers on a variety of
well-established benchmarks. We show that the precision, accuracy, and the F1
score of LLOV is comparable to other checkers while being orders of magnitude
faster. To the best of our knowledge, LLOV is the only tool among the
state-of-the-art data race checkers that can verify a C/C++ or FORTRAN program
to be data race free.Comment: Accepted in ACM TACO, August 202
Easing parallel programming on heterogeneous systems
El modo más frecuente de resolver aplicaciones de HPC (High performance Computing) en tiempos de ejecución razonables y de una forma escalable es mediante el uso de sistemas de cómputo paralelo. La tendencia actual en los sistemas de HPC es la inclusión en la misma máquina de ejecución de varios dispositivos de cómputo, de diferente tipo y arquitectura.
Sin embargo, su uso impone al programador retos específicos. Un programador debe ser experto en las herramientas y abstracciones existentes para memoria distribuida, los modelos de programación para sistemas de memoria compartida, y los modelos de programación específicos para para cada tipo de co-procesador, con el fin de crear programas híbridos que puedan explotar eficientemente todas las capacidades de la máquina.
Actualmente, todos estos problemas deben ser resueltos por el programador, haciendo así la programación de una máquina heterogénea un auténtico reto.
Esta Tesis trata varios de los problemas principales relacionados con la programación en paralelo de los sistemas altamente heterogéneos y distribuidos. En ella se realizan propuestas que resuelven problemas que van desde la creación de códigos portables entre diferentes tipos de dispositivos, aceleradores, y arquitecturas, consiguiendo a su vez máxima eficiencia, hasta los problemas que aparecen en los sistemas de memoria distribuida relacionados con las comunicaciones y la partición de estructuras de datosDepartamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Doctorado en Informátic
- …