48 research outputs found

    Parametric Multi-Level Tiling of Imperfectly Nested Loops

    Get PDF
    International audienceTiling is a crucial loop transformation for generating high perfor- mance code on modern architectures. Efficient generation of multi- level tiled code is essential for maximizing data reuse in systems with deep memory hierarchies. Tiled loops with parametric tile sizes (not compile-time constants) facilitate runtime feedback and dynamic optimizations used in iterative compilation and automatic tuning. Previous parametric multi-level tiling approaches have been restricted to perfectly nested loops, where all assignment state- ments are contained inside the innermost loop of a loop nest. Pre- vious solutions to tiling for imperfect loop nests have only handled fixed tile sizes. In this paper, we present an approach to paramet- ric multi-level tiling of imperfectly nested loops. The tiling tech- nique generates loops that iterate over full rectangular tiles, making them amenable to compiler optimizations such as register tiling. Experimental results using a number of computational benchmarks demonstrate the effectiveness of the developed tiling approach

    Beyond shared memory loop parallelism in the polyhedral model

    Get PDF
    2013 Spring.Includes bibliographical references.With the introduction of multi-core processors, motivated by power and energy concerns, parallel processing has become main-stream. Parallel programming is much more difficult due to its non-deterministic nature, and because of parallel programming bugs that arise from non-determinacy. One solution is automatic parallelization, where it is entirely up to the compiler to efficiently parallelize sequential programs. However, automatic parallelization is very difficult, and only a handful of successful techniques are available, even after decades of research. Automatic parallelization for distributed memory architectures is even more problematic in that it requires explicit handling of data partitioning and communication. Since data must be partitioned among multiple nodes that do not share memory, the original memory allocation of sequential programs cannot be directly used. One of the main contributions of this dissertation is the development of techniques for generating distributed memory parallel code with parametric tiling. Our approach builds on important contributions to the polyhedral model, a mathematical framework for reasoning about program transformations. We show that many affine control programs can be uniformized only with simple techniques. Being able to assume uniform dependences significantly simplifies distributed memory code generation, and also enables parametric tiling. Our approach implemented in the AlphaZ system, a system for prototyping analyses, transformations, and code generators in the polyhedral model. The key features of AlphaZ are memory re-allocation, and explicit representation of reductions. We evaluate our approach on a collection of polyhedral kernels from the PolyBench suite, and show that our approach scales as well as PLuTo, a state-of-the-art shared memory automatic parallelizer using the polyhedral model. Automatic parallelization is only one approach to dealing with the non-deterministic nature of parallel programming that leaves the difficulty entirely to the compiler. Another approach is to develop novel parallel programming languages. These languages, such as X10, aim to provide highly productive parallel programming environment by including parallelism into the language design. However, even in these languages, parallel bugs remain to be an important issue that hinders programmer productivity. Another contribution of this dissertation is to extend the array dataflow analysis to handle a subset of X10 programs. We apply the result of dataflow analysis to statically guarantee determinism. Providing static guarantees can significantly increase programmer productivity by catching questionable implementations at compile-time, or even while programming

    Dynamic Tile Free Scheduling for Code with Acyclic Inter-Tile Dependence Graphs

    Get PDF
    Free scheduling is a task ordering technique under which instructions are executedas soon as their operands become available. Coarsening the grain ofcomputations under the free schedule, by means of using groups of loop neststatement instances (tiles) in place of single statement instances, increases thelocality of data accesses and reduces the number of synchronization events, andas a consequence improves program performance. The paper presents an approachfor code generation allowing for the free schedule for tiles of arbitrarilynested affine loops at run-time. The scope of the applicability of the introducedalgorithms is limited to tiled loop nests whose inter-tile dependence graphs arecycle-free. The approach is based on the Polyhedral Model. Results of experimentswith the PolyBench benchmark suite, demonstrating significant tiledcode speed-up, are discussed

    On Characterizing the Data Access Complexity of Programs

    Full text link
    Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data access complexity will be increasingly important. The problem of developing lower bounds for data access complexity has been modeled using the formalism of Hong & Kung's red/blue pebble game for computational directed acyclic graphs (CDAGs). However, previously developed approaches to lower bounds analysis for the red/blue pebble game are very limited in effectiveness when applied to CDAGs of real programs, with computations comprised of multiple sub-computations with differing DAG structure. We address this problem by developing an approach for effectively composing lower bounds based on graph decomposition. We also develop a static analysis algorithm to derive the asymptotic data-access lower bounds of programs, as a function of the problem size and cache size

    On polynomial Code Generation

    Get PDF
    International audienceIn static analysis, one often has to deal with polynomials in the program control variables, either native to the source code or created by enabling analyses. We have explained elsewhere how to compute dependences in such situations and use them for building polynomial schedules. It remains to explain how to generate polynomial code. The present proposal is to target new parallel programming languages of the async/finish family, like X10 or Habanero, which are "polynomial friendly" and for which efficient compilers exists. Both these languages have barrier-like constructs-clocks for X10 and phasers for Habanero-which may be used to synchronize activities. To understand the behaviour of a clocked program, one has to count the number of clock advance operations since the creation of each activity. Advances with equal counts are synchronized , and these counts may be polynomials. The trick is therefore to insure that before executing an operation, its activity has executed as many advances as the current value of its schedule. This can be obtained by inserting auxilliary loops for executing the necessary advances. This scheme fails if the schedule is not monotone increasing with respect to the execution order in each activity. This problem may be solved by reordering the activities-which is possible since the real execution order is given by the schedule-or in extreme cases by index set splitting
    corecore