72 research outputs found
Optimization of Triangular and Banded Matrix Operations Using 2d-Packed Layouts
International audienceOver the past few years, multicore systems have become more and more powerful and, thereby, very useful in high-performance computing. However, many applications, such as some linear algebra algorithms, still cannot take full advantage of these systems. This is mainly due to the shortage of optimization techniques dealing with irregular control structures. In particular, the well-known polyhedral model fails to optimize loop nests whose bounds and/or array references are not affine functions. This is more likely to occur when handling sparse matrices in their packed formats. In this paper, we propose to use 2d-packed layouts and simple affine transformations to enable optimization of triangular and banded matrix operations. The benefit of our proposal is shown through an experimental study over a set of linear algebra benchmarks
Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels
International audienceThe polyhedral model permits to automatically improve data locality and enable parallelism of regular linear algebra kernels. In previous work we have proposed a new data structure, 2d-packed layout, to store only the non-zeros elements of regular sparse (triangular and banded) matrices dynamically allocated for different basic linear algebra operations, and used Pluto to parallelize and optimize them. To our surprise, there were huge discrepancies in our measures of these kernels execution times that were due to the allocation mode: as statically declared arrays or as dynamically allocated arrays of pointers.In this paper we compare the performance of various linear algebra kernels, including some linear algebra kernels from the PolyBench suite, using different array allocation modes. We present our detailed investigation of the possible reasons of the performance variation on two different architectures: a dual 12-cores AMD (Magny-Cours) and a dual 10-cores Intel Xeon (Haswell-EP).We conclude that static or dynamic memory allocation has an impact on performance in many cases, and that the processor architecture and the gcc compiler's decisions can provoke significant and sometimes surprising variations, in favor of one or the other allocation mode
Pipelined Multithreading Generation in a Polyhedral Compiler
International audienceState-of-the-art automatic polyhedral parallelizers extract and express parallelism as isolated parallel loops. For example, the Pluto high-level compiler generates and annotates loops with "#pragma omp parallel for" directives. Our goal is to take advantage of pipelined multithreading, a parallelization strategy allowing to address a wider class of codes, currently not handled by automatic parallelizers. Pipelined multithreading requires to interlace iterations of some loops in a controlled way that enables the parallel execution of these iterations. We achieve this using OpenMP clauses such as ordered and nowait. The sketch of our method is to: (1) schedule a SCoP using traditional techniques such as Pluto's algorithm; (2) detect potential pipelines in groups of sequential loops; (3) fine-tune the schedule; and (4) generate the resulting code. The fully automatic generation is ongoing work, yet we show on a small set of experiments how pipelined multi-threading permits to parallelize programs which would otherwise not be parallelized
Splitting Polyhedra to Generate More Efficient Code: Efficient Code Generation in the Polyhedral Model is Harder Than We Thought
International audienceCode generation in the polyhedral model takes as inputa union of Z-polyhedra and produces code scanning all ofthem. Modern code generation tools are heavily relying onpolyhedral operations to perform this task. However, theseoperations are typically provided by general-purpose poly-hedral libraries that are not specifically designed to addressthe code generation problem. In particular, (unions of) poly-hedra may be represented in various mathematically equiv-alent ways which may have different properties with respectto code generation. In this paper, we investigate this prob-lem and try to find the best representation of polyhedra togenerate efficient code.We present two contributions. First we demonstrate thatthis problem has been largely under-estimated, showing sig-nificant control overhead deviations when using differentrepresentations of the same polyhedra. Second, we proposean improvement to the main algorithm of the state-of-the-artcode generation tool CLooG. It generates code with fewertests in the inner loops, and aims to reduce control overheadand to simplify vectorization for the compiler, at the cost ofa larger code size. It is based on a smart splitting of theunion of polyhedra while recursing on the dimensions. Weimplemented our algorithm in CLooG/PolyLib, and com-pared the performance and size of the generated code to theCLooG/isl version
Integer Affine Transformations of Parametric Z-polytopes and Applications to Loop Nest Optimization
The polyhedral model is a well-known compiler optimization framework for the analysis and transformation of affine loop nests. We present a new method concerning a difficult geometric operation that is raised by this model: the integer affine transformation of parametric Z-polytopes. The result of such a transformation is given by a worst-case exponential union of Z-polytopes. We also propose a polynomial algorithm (for fixed dimension), to count points in arbitrary unions of a fixed number of parametric Z-polytopes. We implemented these algorithms and compared them to other existing algorithms, for a set of applications to loop nest analysis and optimization
Adaptive Runtime Selection of Parallel Schedules in the Polytope Model
International audienceThere is often no unique version of a program that provides the best performance in all circumstances. Compilers should rely on an adaptive runtime decision to choose which optimizing and parallelizing transformations will lead to the best performance in any execution context.We present a new adaptive framework solving two drawbacks of existing methods: it is effective since the very first execution, and it handles slight variations of input data shape and size. In our proposal, different code versions of parallel loop nests are statically generated by the compiler. At install time, each version is profiled in different execution contexts. At runtime, the execution time of each code version is predicted using the profiling results, the current input data shape and the number of available processor cores. The predicted best version is then run. Our framework handles several versions of possibly tiled parallel loops, using the polytope model for both the profiling and the dynamic selection phases. We show on several benchmark programs that our runtime system selects one of the most efficient version with a very low runtime overhead. This quick and efficient selection leads to speedups compared to the usage of a unique version in every execution context
Numerical simulation for the MHD system in 2D using OpenCL
International audienceIn this work we compute the MHD equations with divergence cleaning on GPU. The method is based on the finite volume approach and Strang dimensional splitting. The simplicity of the approach makes it a good candidate for a GPU implementation with OpenCL. With adequate memory optimization access, we achieve very high speedups, compared to a classical sequential implementation.Dans ce travail, nous résolvons les équations de la MHD avec correction de divergence sur carte graphique. La méthode est basée sur les volumes finis et le splitting directionnel de Strang. La simplicité de l’algorithme en fait un bon candidat pour la programmation sur carte graphique sous OpenCL. Avec de bonnes optimisations des accès mémoire, nous obtenons de très bonnes accélérations, comparé à une programmation séquentielle classique
Handling Multi-Versioning in LLVM: Code Tracking and Cloning
International audienceInstrumentation by sampling, adaptive computing and dynamic optimization can be efficiently implemented using multiple versions of a code region. Ideally, compilers should automatically handle the generation of such multiple versions. In this work we discuss the problem of multi-versioning in the situation where each version requires a different intermediate representation. We expose the limits of nowadays compilers regarding these aspects and provide our solutions to overcome them, using the LLVM compiler as our research platform. The paper is focused on three main aspects: tracking code in LLVM IR, cloning, and communication between low-level and high-level representations. Aiming at performance and minimal impact on the behavior of the original code, we describe our strategies to guide the interaction between the newly inserted code and the optimization passes, from annotating code using metadata to inlining assembly code in LLVM IR. Our target is performing code instrumentation and optimization, with an interest in loops. We build a version in x86_64 assembly code to acquire low-level information, and multiple versions in LLVM IR for performing high-level code transformations. The selection mechanism consists in callbacks to a generic runtime system. Preliminary results on the SPEC CPU 2006 and the Pointer Intensive Benchmark suite show that our framework has a negligible overhead in most cases, when instrumenting the most time consuming loop nests
VMAD: a Virtual Machine for Advanced Dynamic Analysis of Programs
International audienceIn this paper, we present a virtual machine, VMAD (Virtual Machine for Advanced Dynamic analysis), enabling an efficient implementation of advanced profiling and analysis of programs. VMAD is organized as a sequence of basic operations where external modules associated to specific profiling strategies are dynamically loaded when required. The program binary files handled by VMAD are previously instrumented at compile time to include necessary data, instrumentation instructions and callbacks to the VM. Dynamic information, such as memory locations of launched modules, are patched at startup in the binary file. The LLVM compiler has been extended to automatically instrument programs according to both VMAD and the handled profiling strategies. VMAD's potential is illustrated by presenting a profiling strategy dedicated to loop nests. It collects all memory addresses that are accessed during a selected number of successive iterations of each loop. The collected addresses are consumed by an analysis process trying to interpolate addresses successively accessed through each memory reference as a linear function of some virtual loop indices. This profiling strategy using VMAD has been run on some of the SPEC2006 and Pointer Intensive benchmark suites, showing a very low time overhead, in most cases
Émergence d'un leadership distribué pour la construction d'un enseignement
National audienceDuring the first year of the Bachelor degree in Mathematics and Computer Sciences at the University of Strasbourg, students discover computer science by attending a class named Algorithmics and Programming. The pedagogic team is composed of multiple teachers with various backgrounds. In this paper, we present the collective work the pedagogic team achieved to completely rework the curriculum, by switching to a new programming language and also changing the teaching method, using flipped classroom. We analyse the organization of this collective work, built by collaborative decisions, cooperative achievements and a distributed leadership. We discuss the efficiency of this organization, and the observed effects over several years.En première année de licence de mathématique et informatique, à l’Université de Strasbourg, les étudiants débutent l’informatique en suivant un enseignement intitulé « algorithmique et programmation ». L’équipe pédagogique est nombreuse, et constituée d’enseignants aux profils variés. Dans cette communication, nous présentons le travail collectif réalisé par l’équipe pédagogique pour remanier en profondeur l’enseignement, en changeant le langage de programmation et la méthode pédagogique (classe inversée). Nous analysons l’organisation de ce travail collectif, caractérisé par des décisions collaboratives, des réalisations coopératives et un leadership distribué. Nous discutons de l’efficacité de cette organisation, et des effets observés sur plusieurs années
- …