377 research outputs found

    Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code

    Full text link
    This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041

    ACOTES project: Advanced compiler technologies for embedded streaming

    Get PDF
    Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

    High-performance generalized tensor operations: A compiler-oriented approach

    Full text link
    The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches the performance of optimized BLAS libraries without the need for an external implementation or automatic tuning. Our approach provides competitive performance across hardware architectures and can be generalized to deliver the same benefits for algebraic path problems. By making fast linear algebra kernels available to everyone, we expect productivity increases when optimized libraries are not available. © 2018 Association for Computing Machinery

    Automatic Performance Optimization of Stencil Codes

    Get PDF
    A widely used class of codes are stencil codes. Their general structure is very simple: data points in a large grid are repeatedly recomputed from neighboring values. This predefined neighborhood is the so-called stencil. Despite their very simple structure, stencil codes are hard to optimize since only few computations are performed while a comparatively large number of values have to be accessed, i.e., stencil codes usually have a very low computational intensity. Moreover, the set of optimizations and their parameters also depend on the hardware on which the code is executed. To cut a long story short, current production compilers are not able to fully optimize this class of codes and optimizing each application by hand is not practical. As a remedy, we propose a set of optimizations and describe how they can be applied automatically by a code generator for the domain of stencil codes. A combination of a space and time tiling is able to increase the data locality, which significantly reduces the memory-bandwidth requirements: a standard three-dimensional 7-point Jacobi stencil can be accelerated by a factor of 3. This optimization can target basically any stencil code, while others are more specialized. E.g., support for arbitrary linear data layout transformations is especially beneficial for colored kernels, such as a Red-Black Gauss-Seidel smoother. On the one hand, an optimized data layout for such kernels reduces the bandwidth requirements while, on the other hand, it simplifies an explicit vectorization. Other noticeable optimizations described in detail are redundancy elimination techniques to eliminate common subexpressions both in a sequence of statements and across loop boundaries, arithmetic simplifications and normalizations, and the vectorization mentioned previously. In combination, these optimizations are able to increase the performance not only of the model problem given by Poisson’s equation, but also of real-world applications: an optical flow simulation and the simulation of a non-isothermal and non-Newtonian fluid flow

    Abstraction Raising in General-Purpose Compilers

    Get PDF

    Transformations Déclaratives dans le ModÚle Polyédrique

    Get PDF
    Despite the availability of sophisticated automatic optimizers, performance-critical code sections are in practice still tuned by human experts. Pragma-based languages such as OpenMP or OpenACC are the standard interface to apply such transformations to large code bases and loop transformation pragmas would be a straightforward extension to provide fine-grained control over a compilers loop optimizer. However, the manual optimization of programs via explicit sequences of directives is unlikely to fully solve this problem as expressing complex optimization sequences explicitly results in difficult to read and non-performance-portable code. We address this problem by presenting a novel framework of composable program transformations based on the internal tree-like program representation of a polyhedral compiler. Based on a set of tree matchers and transformers, we describe an embedded transformation language which provides the foundation for the development of program optimization tactics. Using this language, we express core building blocks such as loop tiling, fusion, or data-layout-transformations, and compose them to higher-level transformations expressing algorithm-specific optimization strategies for stencils, dense linear-algebra, etc. We expect our approach to simplify the development of polyhedral optimizers and integration of polyhedral and syntactic approaches.MalgrĂ© l’existence d’outils sophistiquĂ©s d’optimisation automatique, les parties des programmes dont la performance est cruciale sont toujoursoptimisĂ©es manuellement par des humains experts. Les langages basĂ©s sur des directives “pragma”, tels que OpenMP ou OpenACC, sont une interface typique pour exprimer les transformations sur des grandes bases de code source. Telles directives pour transformer des nids de boucles seraient une extension naturelle permettant de contrĂŽler l’optimiseur de boucles d’une maniĂšre prĂ©cise. Pourtant l’optimisation manuelle des programmes Ă  travers les sĂ©quences des directives de transformation n’est pas toujours souhaitable car ces sĂ©quences longues et complexes produisent des programmes peu lisibles et ne bĂ©nĂ©ficiant pas de la portabilitĂ© de performance entre les diffĂ©rentes architectures matĂ©rielles. Nous proposons une nouvelle approche pour dĂ©finir les transformations composables des programmes basĂ©e sur la reprĂ©sentation interne d’un compilateur polyĂ©drique sous forme de l’arbre. GrĂące Ă  un ensemble des “motifs” et “transformateurs” des arbres, nous dĂ©crivons un langage de transformation sur lequel nous basons le dĂ©veloppement des tactiques d’optimisation. Avec ce langage, il est possible d’exprimer les transformations basiques, telles que le tuilage, la fusion ou la transposition de donnĂ©es, ainsi que la composition de ces transformations afin de dĂ©finir une stratĂ©gie d’optimisation pour les grandes classes des programmes, telles que les pochoirs, les contractions de tenseurs, etc. Notre approche pourrait simplifier le dĂ©veloppement des optimiseurs polyĂ©driques et l’intĂ©gration des transformations polyĂ©driques et syntaxique

    AUTOMATING DATA-LAYOUT DECISIONS IN DOMAIN-SPECIFIC LANGUAGES

    Get PDF
    A long-standing challenge in High-Performance Computing (HPC) is the simultaneous achievement of programmer productivity and hardware computational efficiency. The challenge has been exacerbated by the onset of multi- and many-core CPUs and accelerators. Only a few expert programmers have been able to hand-code domain-specific data transformations and vectorization schemes needed to extract the best possible performance on such architectures. In this research, we examined the possibility of automating these methods by developing a Domain-Specific Language (DSL) framework. Our DSL approach extends C++14 by embedding into it a high-level data-parallel array language, and by using a domain-specific compiler to compile to hybrid-parallel code. We also implemented an array index-space transformation algebra within this high-level array language to manipulate array data-layouts and data-distributions. The compiler introduces a novel method for SIMD auto-vectorization based on array data-layouts. Our new auto-vectorization technique is shown to outperform the default auto-vectorization strategy by up to 40% for stencil computations. The compiler also automates distributed data movement with overlapping of local compute with remote data movement using polyhedral integer set analysis. Along with these main innovations, we developed a new technique using C++ template metaprogramming for developing embedded DSLs using C++. We also proposed a domain-specific compiler intermediate representation that simplifies data flow analysis of abstract DSL constructs. We evaluated our framework by constructing a DSL for the HPC grand-challenge domain of lattice quantum chromodynamics. Our DSL yielded performance gains of up to twice the flop rate over existing production C code for selected kernels. This gain in performance was obtained while using less than one-tenth the lines of code. The performance of this DSL was also competitive with the best hand-optimized and hand-vectorized code, and is an order of magnitude better than existing production DSLs.Doctor of Philosoph
    • 

    corecore