196 research outputs found

    Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code

    Full text link
    This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041

    Data locality and parallelism optimization using a constraint-based approach

    Get PDF
    Cataloged from PDF version of article.Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In the context of data-intensive embedded applications, there have been two complementary approaches to enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism. Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This can be achieved by considering multiple loop nests simultaneously. Although compilers address these two problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously. Therefore, an integrated approach that combines these two can generate much better results than each individual approach. Based on these observations, this paper proposes a constraint network (CN)-based formulation for data locality optimization and code parallelization. The paper also presents experimental evidence, demonstrating the success of the proposed approach, and compares our results with those obtained through previously proposed approaches. The experiments from our implementation indicate that the proposed approach is very effective in enhancing data locality and parallelization. Β© 2010 Elsevier Inc. All rights reserved

    PIPS Is not (just) Polyhedral Software Adding GPU Code Generation in PIPS

    No full text
    6 pagesInternational audienceParallel and heterogeneous computing are growing in audience thanks to the increased performance brought by ubiquitous manycores and GPUs. However, available programming models, like OPENCL or CUDA, are far from being straightforward to use. As a consequence, several automated or semi-automated approaches have been proposed to automatically generate hardware-level codes from high-level sequential sources. Polyhedral models are becoming more popular because of their combination of expressiveness, compactness, and accurate abstraction of the data-parallel behaviour of programs. These models provide automatic or semi-automatic parallelization and code transformation capabilities that target such modern parallel architectures. PIPS is a quarter-century old source-to-source transformation framework that initially targeted parallel machines but then evolved to include other targets. PIPS uses abstract interpretation on an integer polyhedral lattice to represent program code, allowing linear relation analysis on integer variables in an interprocedural way. The same representation is used for the dependence test and the convex array region analysis. The polyhedral model is also more classically used to schedule code from linear constraints. In this paper, we illustrate the features of this compiler infrastructure on an hypothetical input code, demonstrating the combination of polyhedral and non polyhedral transformations. PIPS interprocedural polyhedral analyses are used to generate data transfers and are combined with non-polyhedral transformations to achieve efficient CUDA code generation

    Automatic Runtime Calculation of Communications for Data-Parallel Expressions with Periodic Conditions

    Get PDF
    ProducciΓ³n CientΓ­ficaMany real-world applications feature data accesses on periodic domains. Manually implementing the synchronizations and communications associated to the data dependences on each case is cumbersome and error-prone. It is increasingly interesting to support these applications in high-level parallel programming languages or parallelizing compilers. In this paper, we present a technique that, for distributed-memory systems, calculates the specific communications derived from data-parallel codes with or without periodic boundary conditions on affine access expressions. It makes transparent to the programmer the management of aggregated communications for the chosen data partition. Our technique moves to runtime part of the compile-time analysis typically used to generate the communication code for affine expressions, introducing a complete new technique that also supports the periodic boundary conditions. We present an experimental study to evaluate our proposal using several study cases. Our experimental results show that our approach can automatically obtain communication codes as efficient as those found in MPI reference codes, reducing the development effort.2019-01-01MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H6 Network (TIN2016-81840-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS). By the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of Spain

    A Technique to Automatically Determine Ad-hoc Communication Patterns at Runtime

    Get PDF
    ProducciΓ³n CientΓ­ficaCurrent High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should take into account the combination of data communication across distributed memory, and the exploitation of shared-memory models. In this paper, we present a new communication calculation technique to be applied across different SPMD (Single Program Multiple Data) code blocks, containing several uniform data access expressions. We have implemented this technique in Trasgo, a programming model and compilation framework that transforms parallel programs from a high-level parallel specification that deals with parallelism in a unified, abstract, and portable way. The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. Our approach allows the automatic generation of pre-compiled multi-level parallel routines, libraries, or programs that can adapt their communication, synchronization, and optimization structures to the target system, even when computing nodes have different capabilities. Our experimental results show that, despite our runtime calculation, our approach can automatically produce efficient programs compared with MPI reference codes, and with codes generated with auto-parallelizing compilers.2018-12-01MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H6 (TIN2016-81840- REDT), COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS), and by the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETACIEMAT belongs to CIEMAT and the Government of Spain

    Π€ΠžΠ ΠœΠΠ›Π˜Π—ΠžΠ’ΠΠΠΠžΠ• ΠŸΠžΠ›Π£Π§Π•ΠΠ˜Π• КОММУНИКАЦИОННЫΠ₯ ΠžΠŸΠ•Π ΠΠ¦Π˜Π™ ΠŸΠΠ ΠΠ›Π›Π•Π›Π¬ΠΠ«Π₯ Π—Π•Π ΠΠ˜Π‘Π’Π«Π₯ ΠΠ›Π“ΠžΠ Π˜Π’ΠœΠžΠ’

    Get PDF
    Algorithms designed for implementation on parallel computers with distributed memory consist of computational macro operations (calculation grains) and communication operations specifying the data arrays exchange between computing nodes. The major difficulty is how to find an efficient way to organize the data exchange. To solve this problem, it is first necessary to identify information dependences between macro operations and then to generate the communication operations caused by these dependences. To automate and simplify the process of code generation, it is necessary to formalize communication operations. The formalization is known for the case of homogeneous information dependences. Such formalization uses the vectors of global dependences as a representation of dependences between the calculation grains. Also, there is a way that makes it possible to obtain the data arrays exchange, but it requires the usage of tools to work with polyhedra and does not formalize communication operations. This article presents a formalization method and a method of inclusion of communication operations into the algorithm structure (receiving and sending data arrays) in case of a parallel algorithm with affine dependences. The usage of functions determining the relationship between macro operations allowed obtaining explicit representations of communication operations. This work is a generalization of the formalization of the operations of sending data in a parallel algorithm, where operations are not divided into macro operations, as well as a generalization of some aspects of obtaining the communication operation method. Алгоритмы, ΠΏΡ€Π΅Π΄Π½Π°Π·Π½Π°Ρ‡Π΅Π½Π½Ρ‹Π΅ для Ρ€Π΅Π°Π»ΠΈΠ·Π°Ρ†ΠΈΠΈ Π½Π° ΠΏΠ°Ρ€Π°Π»Π»Π΅Π»ΡŒΠ½Ρ‹Ρ… ΠΊΠΎΠΌΠΏΡŒΡŽΡ‚Π΅Ρ€Π°Ρ… с распрСдСлСнной ΠΏΠ°ΠΌΡΡ‚ΡŒΡŽ, Π²ΠΊΠ»ΡŽΡ‡Π°ΡŽΡ‚ Π² сСбя Π²Ρ‹Ρ‡ΠΈΡΠ»ΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹Π΅ ΠΌΠ°ΠΊΡ€ΠΎΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΈ (Π·Π΅Ρ€Π½Π° вычислСний), Π° Ρ‚Π°ΠΊΠΆΠ΅ ΠΊΠΎΠΌΠΌΡƒΠ½ΠΈΠΊΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Π΅ ΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΈ, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ Π² явном Π²ΠΈΠ΄Π΅ Π·Π°Π΄Π°ΡŽΡ‚ ΠΎΠ±ΠΌΠ΅Π½ массивами Π΄Π°Π½Π½Ρ‹Ρ… ΠΌΠ΅ΠΆΠ΄Ρƒ Π²Ρ‹Ρ‡ΠΈΡΠ»ΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹ΠΌΠΈ ΡƒΠ·Π»Π°ΠΌΠΈ. НаибольшиС затруднСния Π²Ρ‹Π·Ρ‹Π²Π°Π΅Ρ‚, ΠΊΠ°ΠΊ ΠΏΡ€Π°Π²ΠΈΠ»ΠΎ, Π·Π°Π΄Π°Ρ‡Π° ΠΎΡ€Π³Π°Π½ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΎΠ±ΠΌΠ΅Π½Π° Π΄Π°Π½Π½Ρ‹ΠΌΠΈ. Для Π΅Π΅ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ Π½Π°Π΄ΠΎ сначала Π²Ρ‹ΡΠ²ΠΈΡ‚ΡŒ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Π΅ зависимости ΠΌΠ΅ΠΆΠ΄Ρƒ макроопСрациями, Π° Π·Π°Ρ‚Π΅ΠΌ ΡΠ³Π΅Π½Π΅Ρ€ΠΈΡ€ΠΎΠ²Π°Ρ‚ΡŒ ΠΏΠΎΡ€ΠΎΠΆΠ΄Π°Π΅ΠΌΡ‹Π΅ этими зависимостями ΠΊΠΎΠΌΠΌΡƒΠ½ΠΈΠΊΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Π΅ ΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΈ. Для Π°Π²Ρ‚ΠΎΠΌΠ°Ρ‚ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΈ упрощСния процСсса Π³Π΅Π½Π΅Ρ€Π°Ρ†ΠΈΠΈ ΠΊΠΎΠ΄Π° Π½Π΅ΠΎΠ±Ρ…ΠΎΠ΄ΠΈΠΌΠ° формализация получСния ΠΊΠΎΠΌΠΌΡƒΠ½ΠΈΠΊΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… ΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΉ. Вакая формализация извСстна для случая ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… зависимостСй. Она ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅Ρ‚ прСдставлСниС зависимостСй ΠΌΠ΅ΠΆΠ΄Ρƒ макроопСрациями Π²Π΅ΠΊΡ‚ΠΎΡ€Π°ΠΌΠΈ Π³Π»ΠΎΠ±Π°Π»ΡŒΠ½Ρ‹Ρ… зависимостСй. Π˜Π·Π²Π΅ΡΡ‚Π½Ρ‹ Ρ‚Π°ΠΊΠΆΠ΅ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Ρ‹, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡŽΡ‚ ΠΏΠΎΠ»ΡƒΡ‡ΠΈΡ‚ΡŒ пСрСсылаСмыС массивы Π΄Π°Π½Π½Ρ‹Ρ…, Π½ΠΎ ΠΏΡ€ΠΈ этом Ρ‚Ρ€Π΅Π±ΡƒΡŽΡ‚ примСнСния ΠΈΠ½ΡΡ‚Ρ€ΡƒΠΌΠ΅Π½Ρ‚Π°Π»ΡŒΠ½Ρ‹Ρ… срСдств для Ρ€Π°Π±ΠΎΡ‚Ρ‹ с ΠΌΠ½ΠΎΠ³ΠΎΠ³Ρ€Π°Π½Π½ΠΈΠΊΠ°ΠΌΠΈ ΠΈ Π½Π΅ Ρ„ΠΎΡ€ΠΌΠ°Π»ΠΈΠ·ΡƒΡŽΡ‚ ΠΊΠΎΠΌΠΌΡƒΠ½ΠΈΠΊΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Π΅ ΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΈ. Π’ настоящСм исслСдовании прСдставлСн способ Ρ„ΠΎΡ€ΠΌΠ°Π»ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΈ Π²ΠΊΠ»ΡŽΡ‡Π΅Π½ΠΈΡ Π² структуру Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ° ΠΊΠΎΠΌΠΌΡƒΠ½ΠΈΠΊΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… ΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΉ получСния ΠΈ ΠΎΡ‚ΠΏΡ€Π°Π²ΠΊΠΈ массивов Π΄Π°Π½Π½Ρ‹Ρ… Π² ΠΏΠ°Ρ€Π°Π»Π»Π΅Π»ΡŒΠ½ΠΎΠΌ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ΅ с Π°Ρ„Ρ„ΠΈΠ½Π½Ρ‹ΠΌΠΈ зависимостями. ΠŸΡ€ΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠΉ, ΠΎΠΏΡ€Π΅Π΄Π΅Π»ΡΡŽΡ‰ΠΈΡ… зависимости ΠΌΠ΅ΠΆΠ΄Ρƒ макроопСрациями, ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΠ»ΠΎ Π² Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ΅, Π·Π°Π΄Π°ΡŽΡ‰Π΅ΠΌ ΠΏΠ°Ρ€Π°Π»Π»Π΅Π»ΡŒΠ½Ρ‹Π΅ Π²Ρ‹Ρ‡ΠΈΡΠ»ΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹Π΅ процСссы, ΠΏΠΎΠ»ΡƒΡ‡ΠΈΡ‚ΡŒ явныС прСдставлСния ΠΊΠΎΠΌΠΌΡƒΠ½ΠΈΠΊΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… ΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΉ. ИсслСдования ΡΠ²Π»ΡΡŽΡ‚ΡΡ ΠΎΠ±ΠΎΠ±Ρ‰Π΅Π½ΠΈΠ΅ΠΌ Ρ„ΠΎΡ€ΠΌΠ°Π»ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΉ пСрСсылки элСмСнтов массивов Π² ΠΏΠ°Ρ€Π°Π»Π»Π΅Π»ΡŒΠ½ΠΎΠΌ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ΅, ΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΈ ΠΊΠΎΡ‚ΠΎΡ€ΠΎΠ³ΠΎ Π½Π΅ Ρ€Π°Π·Π±ΠΈΡ‚Ρ‹ Π½Π° ΠΌΠ°ΠΊΡ€ΠΎΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΈ, Π° Ρ‚Π°ΠΊΠΆΠ΅ Π½Π΅ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… аспСктов ΠΌΠ΅Ρ‚ΠΎΠ΄Π° получСния ΠΊΠΎΠΌΠΌΡƒΠ½ΠΈΠΊΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… ΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΉ, опрСдСляСмых ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹ΠΌΠΈ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹ΠΌΠΈ зависимостями

    Compiler Optimization Techniques for Scheduling and Reducing Overhead

    Get PDF
    Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead
    • …
    corecore