18 research outputs found

    An Efficient Code Generation Technique for Tiled Iteration Spaces

    No full text
    This paper presents a novel approach for the problem of generating tiled code for nested for-loops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multi-level memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when non-rectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and second sweep all points within each tile. For the first subproblem, 1 we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a non-unimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a non-unimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code

    Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs using Memory Mapped Network Interfaces

    No full text
    In this paper we propose several alternative methods for the compile time scheduling of Tiled Nested Loops onto a fixed size parallel architecture. We investigate the distribution of tiles among processors, provided that we have chosen either a non-overlapping communication mode, which involves successive computation and communication steps, or an overlapping communication mode, which supposes a pipelined, concurrent execution of communication and computations. In order to utilize the available processors as efficiently as possible, we can either adopt a cyclic assignment schedule, or assign neighboring tiles to the same CPU, or adapt the size and shape of tiles, so that the required number of processors is exactly equal to the number of the available ones. We theoretically and experimentally compare the proposed schedules, so as to design one which achieves the minimum total execution time, depending on the cluster configuration, (i.e. number and type of nodes, interconnect bandwidth, etc) the internal characteristics of the underlying architecture (i.e. NIC and DMA latencies, etc) and the iteration space size and shape. 1

    Code Generation Methods for Tiling Transformations

    No full text
    Tiling or supernode transformation has been widely used to improve locality in multi-level memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work due to non-rectangular tile shapes and arbitrary iteration space bounds. In this paper, we first survey code generation methods for nested loops which are transformed using non-unimodular transformations. All methods are based on Fourier-Motzkin (FM) elimination. Next, we consider and enhance previous work on rewriting tiled loops by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, all methods first enumerate the tiles containing points within the iteration space, and second, sweep the points within each tile. For the first, we extend previous work in order to access all tile origins correctly, while for the latter, we propose the transformation of the initial parallelepiped tile iteration space into a rectangular one, so as to generate code efficiently with the aid of a non-unimodular transformation matrix and its Hermite Normal Form (HNF). The resulting systems of inequalities are much simpler than those appeared in bibliography; thus their solutions are more efficiently determined using the FM elimination. Experimental results which compare all presented approaches, show that the proposed method for generating tiled code is significantly accelerated, thus rewriting any Ò-D tiled loop in a much more efficient and direct way. Keywords: loop tiling, supernodes, non-unimodular transformations, Fourier- Motzkin elimination, code generation

    Automatic Code Generation for Executing Tiled Nested Loops Onto Parallel Architectures

    No full text
    This paper presents a novel approach for the problem of generating tiled code for nested for-loops using a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multi-level memory hierarchies as well as to efficiently execute loops onto non-uniform memory access architectures. However, automatic code generation for tiled loops can be a very complex compiler work due to non-rectangular tile shapes and iteration space bounds. Our method considerably enhances previous work on rewriting tiled loops by considering parallelepiped tiles and arbitrary iteration space shapes. The complexity of code generation for tiling transformation is now reduced to the complexity of code generation for any linear transformation. Experimental results which compare all so far presented approaches, show that the proposed approach for generating tiled code is significantly accelerated

    A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

    No full text
    This paper proposes a novel approach for the parallel execution of tiled Iteration Spaces onto a cluster of SMP PC nodes. Each SMP node has multiple CPUs and a single memory mapped PCI-SCI Network Interface Card. We apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. In this way, intranode (intragroup) communication is annihilated. Groups are atomically executed inside each node. Nodes exchange data between successive group computations. We schedule groups much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive atomic group executions. The applied non-blocking schedule resembles a pipelined datapath where group computation phases are overlapped with communication ones, instead of being interleaved with them. Our experimental results illustrate that the proposed method outperforms previous approaches involving blocking communication or conventional grouping schemes

    Compiling Tiled Iteration Spaces for Clusters

    No full text
    This paper presents a complete end-to-end framework to generate automatic message-passing code for tiled iteration spaces. It considers general parallelepiped tiling transformations and general convex iteration spaces. We aim to address all problems concerning data parallel code generation efficiently by transforming the initial non-rectangular tile to a rectangular one. In this way, data distribution and communication become simple and straightforward. We have implemented our parallelizing techniques in a tool which automatically generates MPI code and run several experiments on a cluster of PCs. Our experimental results show the merit of general parallelepiped tiling transformations, and confirm previous theoretical work on schedulingoptimal tile shapes

    Automatic parallel code generation for tiled nested loops

    No full text
    This paper presents an overview of our work, concerning a complete end-to-end framework for automatically generating message passing parallel code for tiled nested for-loops. It considers general parallelepiped tiling transformations and general convex iteration spaces. We address all problems regarding both the generation of sequential tiled code and its parallelization. We have implemented our techniques in a tool which automatically generates MPI parallel code and conducted several series of experiments, concerning the compilation time of our tool, the efficiency of the generated code and the speedup attained on a cluster of PCs. Apart from confirming the value of our techniques, our experimental results show the merit of general parallelepiped tiling transformations and verify previous theoretical work on scheduling-optimal tile shapes
    corecore