196 research outputs found
Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
This paper introduces Tiramisu, a polyhedral framework designed to generate
high performance code for multiple platforms including multicores, GPUs, and
distributed machines. Tiramisu introduces a scheduling language with novel
extensions to explicitly manage the complexities that arise when targeting
these systems. The framework is designed for the areas of image processing,
stencils, linear algebra and deep learning. Tiramisu has two main features: it
relies on a flexible representation based on the polyhedral model and it has a
rich scheduling language allowing fine-grained control of optimizations.
Tiramisu uses a four-level intermediate representation that allows full
separation between the algorithms, loop transformations, data layouts, and
communication. This separation simplifies targeting multiple hardware
architectures with the same algorithm. We evaluate Tiramisu by writing a set of
image processing, deep learning, and linear algebra benchmarks and compare them
with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu
matches or outperforms existing compilers and libraries on different hardware
architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041
Data locality and parallelism optimization using a constraint-based approach
Cataloged from PDF version of article.Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In
the context of data-intensive embedded applications, there have been two complementary approaches to
enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism.
Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher
levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require
a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip
memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This
can be achieved by considering multiple loop nests simultaneously. Although compilers address these two
problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously.
Therefore, an integrated approach that combines these two can generate much better results than each
individual approach. Based on these observations, this paper proposes a constraint network (CN)-based
formulation for data locality optimization and code parallelization. The paper also presents experimental
evidence, demonstrating the success of the proposed approach, and compares our results with those
obtained through previously proposed approaches. The experiments from our implementation indicate
that the proposed approach is very effective in enhancing data locality and parallelization.
Β© 2010 Elsevier Inc. All rights reserved
PIPS Is not (just) Polyhedral Software Adding GPU Code Generation in PIPS
6 pagesInternational audienceParallel and heterogeneous computing are growing in audience thanks to the increased performance brought by ubiquitous manycores and GPUs. However, available programming models, like OPENCL or CUDA, are far from being straightforward to use. As a consequence, several automated or semi-automated approaches have been proposed to automatically generate hardware-level codes from high-level sequential sources. Polyhedral models are becoming more popular because of their combination of expressiveness, compactness, and accurate abstraction of the data-parallel behaviour of programs. These models provide automatic or semi-automatic parallelization and code transformation capabilities that target such modern parallel architectures. PIPS is a quarter-century old source-to-source transformation framework that initially targeted parallel machines but then evolved to include other targets. PIPS uses abstract interpretation on an integer polyhedral lattice to represent program code, allowing linear relation analysis on integer variables in an interprocedural way. The same representation is used for the dependence test and the convex array region analysis. The polyhedral model is also more classically used to schedule code from linear constraints. In this paper, we illustrate the features of this compiler infrastructure on an hypothetical input code, demonstrating the combination of polyhedral and non polyhedral transformations. PIPS interprocedural polyhedral analyses are used to generate data transfers and are combined with non-polyhedral transformations to achieve efficient CUDA code generation
Automatic Runtime Calculation of Communications for Data-Parallel Expressions with Periodic Conditions
ProducciΓ³n CientΓficaMany real-world applications feature data accesses on periodic domains. Manually implementing the synchronizations and communications associated to the data dependences on each case is cumbersome and error-prone. It is increasingly interesting to support these applications in high-level parallel programming languages or parallelizing compilers. In this paper, we present a technique that, for distributed-memory systems, calculates the specific communications derived from data-parallel codes with or without periodic boundary conditions on affine access expressions. It makes transparent to the programmer the management of aggregated communications for the chosen data partition. Our technique moves to runtime part of the compile-time analysis typically used to generate the communication code for affine expressions, introducing a complete new technique that also supports the periodic boundary conditions. We present an experimental study to evaluate our proposal using several study cases. Our experimental results show that our approach can automatically obtain communication codes as efficient as those found in MPI reference codes, reducing the development effort.2019-01-01MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H6 Network (TIN2016-81840-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS). By the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of Spain
A Technique to Automatically Determine Ad-hoc Communication Patterns at Runtime
ProducciΓ³n CientΓficaCurrent High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should take into account the combination of data communication across distributed memory, and the exploitation of shared-memory models.
In this paper, we present a new communication calculation technique to be applied across different SPMD (Single Program Multiple Data) code blocks, containing several uniform data access expressions. We have implemented this technique in Trasgo, a programming model and compilation framework that transforms parallel programs from a high-level parallel specification that deals with parallelism in a unified, abstract, and portable way. The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. Our approach allows the automatic generation of pre-compiled multi-level parallel routines, libraries, or programs that can adapt their communication, synchronization, and optimization structures to the target system, even when computing nodes have different capabilities. Our experimental results show that, despite our runtime calculation, our approach can automatically produce efficient programs compared with MPI reference codes, and with codes generated with auto-parallelizing compilers.2018-12-01MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H6 (TIN2016-81840- REDT), COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS), and by the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETACIEMAT belongs to CIEMAT and the Government of Spain
Π€ΠΠ ΠΠΠΠΠΠΠΠΠΠΠΠ ΠΠΠΠ£Π§ΠΠΠΠ ΠΠΠΠΠ£ΠΠΠΠΠ¦ΠΠΠΠΠ«Π₯ ΠΠΠΠ ΠΠ¦ΠΠ ΠΠΠ ΠΠΠΠΠΠ¬ΠΠ«Π₯ ΠΠΠ ΠΠΠ‘Π’Π«Π₯ ΠΠΠΠΠ ΠΠ’ΠΠΠ
Algorithms designed for implementation on parallel computers with distributed memory consist of computational macro operations (calculation grains) and communication operations specifying the data arrays exchange between computing nodes. The major difficulty is how to find an efficient way to organize the data exchange. To solve this problem, it is first necessary to identify information dependences between macro operations and then to generate the communication operations caused by these dependences. To automate and simplify the process of code generation, it is necessary to formalize communication operations. The formalization is known for the case of homogeneous information dependences. Such formalization uses the vectors of global dependences as a representation of dependences between the calculation grains. Also, there is a way that makes it possible to obtain the data arrays exchange, but it requires the usage of tools to work with polyhedra and does not formalize communication operations. This article presents a formalization method and a method of inclusion of communication operations into the algorithm structure (receiving and sending data arrays) in case of a parallel algorithm with affine dependences. The usage of functions determining the relationship between macro operations allowed obtaining explicit representations of communication operations. This work is a generalization of the formalization of the operations of sending data in a parallel algorithm, where operations are not divided into macro operations, as well as a generalization of some aspects of obtaining the communication operation method.Β ΠΠ»Π³ΠΎΡΠΈΡΠΌΡ, ΠΏΡΠ΅Π΄Π½Π°Π·Π½Π°ΡΠ΅Π½Π½ΡΠ΅ Π΄Π»Ρ ΡΠ΅Π°Π»ΠΈΠ·Π°ΡΠΈΠΈ Π½Π° ΠΏΠ°ΡΠ°Π»Π»Π΅Π»ΡΠ½ΡΡ
ΠΊΠΎΠΌΠΏΡΡΡΠ΅ΡΠ°Ρ
Ρ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΠΎΠΉ ΠΏΠ°ΠΌΡΡΡΡ, Π²ΠΊΠ»ΡΡΠ°ΡΡ Π² ΡΠ΅Π±Ρ Π²ΡΡΠΈΡΠ»ΠΈΡΠ΅Π»ΡΠ½ΡΠ΅ ΠΌΠ°ΠΊΡΠΎΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΈ (Π·Π΅ΡΠ½Π° Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΠΉ), Π° ΡΠ°ΠΊΠΆΠ΅ ΠΊΠΎΠΌΠΌΡΠ½ΠΈΠΊΠ°ΡΠΈΠΎΠ½Π½ΡΠ΅ ΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΈ, ΠΊΠΎΡΠΎΡΡΠ΅ Π² ΡΠ²Π½ΠΎΠΌ Π²ΠΈΠ΄Π΅ Π·Π°Π΄Π°ΡΡ ΠΎΠ±ΠΌΠ΅Π½ ΠΌΠ°ΡΡΠΈΠ²Π°ΠΌΠΈ Π΄Π°Π½Π½ΡΡ
ΠΌΠ΅ΠΆΠ΄Ρ Π²ΡΡΠΈΡΠ»ΠΈΡΠ΅Π»ΡΠ½ΡΠΌΠΈ ΡΠ·Π»Π°ΠΌΠΈ. ΠΠ°ΠΈΠ±ΠΎΠ»ΡΡΠΈΠ΅ Π·Π°ΡΡΡΠ΄Π½Π΅Π½ΠΈΡ Π²ΡΠ·ΡΠ²Π°Π΅Ρ, ΠΊΠ°ΠΊ ΠΏΡΠ°Π²ΠΈΠ»ΠΎ, Π·Π°Π΄Π°ΡΠ° ΠΎΡΠ³Π°Π½ΠΈΠ·Π°ΡΠΈΠΈ ΠΎΠ±ΠΌΠ΅Π½Π° Π΄Π°Π½Π½ΡΠΌΠΈ. ΠΠ»Ρ Π΅Π΅ ΡΠ΅ΡΠ΅Π½ΠΈΡ Π½Π°Π΄ΠΎ ΡΠ½Π°ΡΠ°Π»Π° Π²ΡΡΠ²ΠΈΡΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΠ΅ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΌΠ΅ΠΆΠ΄Ρ ΠΌΠ°ΠΊΡΠΎΠΎΠΏΠ΅ΡΠ°ΡΠΈΡΠΌΠΈ, Π° Π·Π°ΡΠ΅ΠΌ ΡΠ³Π΅Π½Π΅ΡΠΈΡΠΎΠ²Π°ΡΡ ΠΏΠΎΡΠΎΠΆΠ΄Π°Π΅ΠΌΡΠ΅ ΡΡΠΈΠΌΠΈ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΡΠΌΠΈ ΠΊΠΎΠΌΠΌΡΠ½ΠΈΠΊΠ°ΡΠΈΠΎΠ½Π½ΡΠ΅ ΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΈ. ΠΠ»Ρ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΠ·Π°ΡΠΈΠΈ ΠΈ ΡΠΏΡΠΎΡΠ΅Π½ΠΈΡ ΠΏΡΠΎΡΠ΅ΡΡΠ° Π³Π΅Π½Π΅ΡΠ°ΡΠΈΠΈ ΠΊΠΎΠ΄Π° Π½Π΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΠ° ΡΠΎΡΠΌΠ°Π»ΠΈΠ·Π°ΡΠΈΡ ΠΏΠΎΠ»ΡΡΠ΅Π½ΠΈΡ ΠΊΠΎΠΌΠΌΡΠ½ΠΈΠΊΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΉ. Π’Π°ΠΊΠ°Ρ ΡΠΎΡΠΌΠ°Π»ΠΈΠ·Π°ΡΠΈΡ ΠΈΠ·Π²Π΅ΡΡΠ½Π° Π΄Π»Ρ ΡΠ»ΡΡΠ°Ρ ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΡ
ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ
Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠ΅ΠΉ. ΠΠ½Π° ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅Ρ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠ΅ΠΉ ΠΌΠ΅ΠΆΠ΄Ρ ΠΌΠ°ΠΊΡΠΎΠΎΠΏΠ΅ΡΠ°ΡΠΈΡΠΌΠΈ Π²Π΅ΠΊΡΠΎΡΠ°ΠΌΠΈ Π³Π»ΠΎΠ±Π°Π»ΡΠ½ΡΡ
Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠ΅ΠΉ. ΠΠ·Π²Π΅ΡΡΠ½Ρ ΡΠ°ΠΊΠΆΠ΅ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΡ, ΠΊΠΎΡΠΎΡΡΠ΅ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡ ΠΏΠΎΠ»ΡΡΠΈΡΡ ΠΏΠ΅ΡΠ΅ΡΡΠ»Π°Π΅ΠΌΡΠ΅ ΠΌΠ°ΡΡΠΈΠ²Ρ Π΄Π°Π½Π½ΡΡ
, Π½ΠΎ ΠΏΡΠΈ ΡΡΠΎΠΌ ΡΡΠ΅Π±ΡΡΡ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΡ ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΠ°Π»ΡΠ½ΡΡ
ΡΡΠ΅Π΄ΡΡΠ² Π΄Π»Ρ ΡΠ°Π±ΠΎΡΡ Ρ ΠΌΠ½ΠΎΠ³ΠΎΠ³ΡΠ°Π½Π½ΠΈΠΊΠ°ΠΌΠΈ ΠΈ Π½Π΅ ΡΠΎΡΠΌΠ°Π»ΠΈΠ·ΡΡΡ ΠΊΠΎΠΌΠΌΡΠ½ΠΈΠΊΠ°ΡΠΈΠΎΠ½Π½ΡΠ΅ ΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΈ. Π Π½Π°ΡΡΠΎΡΡΠ΅ΠΌ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΈ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ ΡΠΏΠΎΡΠΎΠ± ΡΠΎΡΠΌΠ°Π»ΠΈΠ·Π°ΡΠΈΠΈ ΠΈ Π²ΠΊΠ»ΡΡΠ΅Π½ΠΈΡ Π² ΡΡΡΡΠΊΡΡΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠ° ΠΊΠΎΠΌΠΌΡΠ½ΠΈΠΊΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΉ ΠΏΠΎΠ»ΡΡΠ΅Π½ΠΈΡ ΠΈ ΠΎΡΠΏΡΠ°Π²ΠΊΠΈ ΠΌΠ°ΡΡΠΈΠ²ΠΎΠ² Π΄Π°Π½Π½ΡΡ
Π² ΠΏΠ°ΡΠ°Π»Π»Π΅Π»ΡΠ½ΠΎΠΌ Π°Π»Π³ΠΎΡΠΈΡΠΌΠ΅ Ρ Π°ΡΡΠΈΠ½Π½ΡΠΌΠΈ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΡΠΌΠΈ. ΠΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ ΡΡΠ½ΠΊΡΠΈΠΉ, ΠΎΠΏΡΠ΅Π΄Π΅Π»ΡΡΡΠΈΡ
Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΌΠ΅ΠΆΠ΄Ρ ΠΌΠ°ΠΊΡΠΎΠΎΠΏΠ΅ΡΠ°ΡΠΈΡΠΌΠΈ, ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΠ»ΠΎ Π² Π°Π»Π³ΠΎΡΠΈΡΠΌΠ΅, Π·Π°Π΄Π°ΡΡΠ΅ΠΌ ΠΏΠ°ΡΠ°Π»Π»Π΅Π»ΡΠ½ΡΠ΅ Π²ΡΡΠΈΡΠ»ΠΈΡΠ΅Π»ΡΠ½ΡΠ΅ ΠΏΡΠΎΡΠ΅ΡΡΡ, ΠΏΠΎΠ»ΡΡΠΈΡΡ ΡΠ²Π½ΡΠ΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΡ ΠΊΠΎΠΌΠΌΡΠ½ΠΈΠΊΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΉ. ΠΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΡ ΡΠ²Π»ΡΡΡΡΡ ΠΎΠ±ΠΎΠ±ΡΠ΅Π½ΠΈΠ΅ΠΌ ΡΠΎΡΠΌΠ°Π»ΠΈΠ·Π°ΡΠΈΠΈ ΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΉ ΠΏΠ΅ΡΠ΅ΡΡΠ»ΠΊΠΈ ΡΠ»Π΅ΠΌΠ΅Π½ΡΠΎΠ² ΠΌΠ°ΡΡΠΈΠ²ΠΎΠ² Π² ΠΏΠ°ΡΠ°Π»Π»Π΅Π»ΡΠ½ΠΎΠΌ Π°Π»Π³ΠΎΡΠΈΡΠΌΠ΅, ΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΈ ΠΊΠΎΡΠΎΡΠΎΠ³ΠΎ Π½Π΅ ΡΠ°Π·Π±ΠΈΡΡ Π½Π° ΠΌΠ°ΠΊΡΠΎΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΈ, Π° ΡΠ°ΠΊΠΆΠ΅ Π½Π΅ΠΊΠΎΡΠΎΡΡΡ
Π°ΡΠΏΠ΅ΠΊΡΠΎΠ² ΠΌΠ΅ΡΠΎΠ΄Π° ΠΏΠΎΠ»ΡΡΠ΅Π½ΠΈΡ ΠΊΠΎΠΌΠΌΡΠ½ΠΈΠΊΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΉ, ΠΎΠΏΡΠ΅Π΄Π΅Π»ΡΠ΅ΠΌΡΡ
ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΠΌΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΠΌΠΈ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΡΠΌΠΈ
Compiler Optimization Techniques for Scheduling and Reducing Overhead
Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead
- β¦