26 research outputs found

    A Theoretical Approach Involving Recurrence Resolution, Dependence Cycle Statement Ordering and Subroutine Transformation for the Exploitation of Parallelism in Sequential Code.

    Get PDF
    To exploit parallelism in Fortran code, this dissertation consists of a study of the following three issues: (1) recurrence resolution in Do-loops for vector processing, (2) dependence cycle statement ordering in Do-loops for parallel processing, and (3) sub-routine parallelization. For recurrence resolution, the major findings include: (1) the node splitting algorithm cannot be used directly to break an essential antidependence link, of which the source variable that results in antidependence is itself the sink variable of another true dependence so a correction method is proposed, (2) a sink variable renaming technique is capable of breaking an antidependence and/or output-dependence link, (3) for recurrences formed by only true dependences, a dynamic dependence concept and the derived technique are powerful, and (4) by integrating related techniques, an algorithm for resolving a general multistatement recurrence is developed. The performance of a parallel loop is determined by the level of parallelism and the time delay due to interprocessor communication and synchronization. For a dependence cycle of a single parallel loop executed in a general synchronization mode, the parallelism exposed varies with the alignment of statements. Statements are reordered on the basis of execution-time of the loop as estimated at compile-time. An improved timing formula and a derived statement ordering algorithm are proposed. Further extension of this algorithm to multiple perfectly nested Do-loops with simple global dependence cycle is also presented. The subroutine is a potential source for parallel processing. Several problems must be solved for subroutine parallelization: (1) the precedence of parallel executions of subroutines, (2) identification of the optimum execution mode for each subroutine and (3) the restructuring of a serial program. A five-step approach to parallelize called subroutines for a calling subroutine is proposed: (1) computation of control dependence, (2) approximation of the global effects of subroutines, (3) analysis of data dependence, (4) identification of execution mode, and (5) restructuring of calling and called subroutines. Application of these five steps in a recursive manner to different levels of calling subroutines in a program addresses the parallelization of subroutines

    Parallel machine architecture and compiler design facilities

    Get PDF
    The objective is to provide an integrated simulation environment for studying and evaluating various issues in designing parallel systems, including machine architectures, parallelizing compiler techniques, and parallel algorithms. The status of Delta project (which objective is to provide a facility to allow rapid prototyping of parallelized compilers that can target toward different machine architectures) is summarized. Included are the surveys of the program manipulation tools developed, the environmental software supporting Delta, and the compiler research projects in which Delta has played a role

    Automatic Parallelization of Tiled Stencil Loop Nests on GPUs

    Full text link
    This thesis attempts to design and implement a compiler framework based on the polyhedral model. The compiler automatically parallelizes loop nests; especially stencil kernels, into efficient GPU code by loop tiling transformations which the polyhedral model describes. To enhance parallel performance, we introduce three practically efficient techniques to process different types of loop nests. The experimental results of our compiler framework have demonstrated that these advanced techniques can outperform previous approaches. Firstly, we aim to find efficient tiling transformations without violating data dependences. How to select a tile's shape and size is an open issue that is performance-critical and influenced by GPU's hardware constraints. We propose an approach to determine the tile shapes out of consideration for improving two-level parallelism of GPUs. The new approach finds appropriate tiling hyperplanes by embedding parallelism-enhancing constraints into the polyhedral model to maximize intra-tile, i.e., intra-SM parallelism. This improves the load balance among the streaming processors (SPs), which execute a wavefront of loop iterations within a tile. We eliminate parallelism-hindering false dependences to optimize inter-tile, i.e., inter-SM parallelism. This improves the load balance among the streaming multiprocessors (SMs), which execute a wavefront of tiles. Furthermore, to avoid combinatorial explosion of tile size's configurations, we present a model-driven approach to automating tile size selection that is performance-critical for loop tiling transformations, especially for DOACROSS loop nests. Our tile size selection model accurately estimates the execution times of tiled loop nests running on GPUs. The selected tile sizes lead to the performance results that are close to the best observed for a range of problem sizes tested. Finally, to address the difficulty and low-performance of parallelizing widely used SOR stencil loop nests, we present a new tiled parallel SOR method, called MLSOR, which admits more efficient data-parallel SIMD execution on GPUs. Unlike the previous two approaches that are dependence-preserving, the basic idea is to algorithmically restructure a stencil kernel based on a non-dependence-preserving parallelization scheme to avoid pipelining for higher parallelism. The new approach can be implemented in compilers through a pattern matching pass to optimize SOR-like DOACROSS loop nests on GPUs

    Compiler Optimization Techniques for Scheduling and Reducing Overhead

    Get PDF
    Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead

    Optimization within a Unified Transformation Framework

    Get PDF
    Programmers typically want to write scientific programs in a high level language with semantics based on a sequential execution model. To execute efficiently on a parallel machine, however, a program typically needs to contain explicit parallelism and possibly explicit communication and synchronization. So, we need compilers to convert programs from the first of these forms to the second. There are two basic choices to be made when parallelizing a program. First, the computations of the program need to be distributed amongst the set of available processors. Second, the computations on each processor need to be ordered. My contribution has been the development of simple mathematical abstractions for representing these choices and the development of new algorithms for making these choices. I have developed a new framework that achieves good performance by minimizing communication between processors, minimizing the time processors spend waiting for messages from other processors, and ordering data accesses so as to exploit the memory hierarchy. This framework can be used by optimizing compilers, as well as by interactive transformation tools. The state of the art for vectorizing compilers is already quite good, but much work remains to bring parallelizing compilers up to the same standard. The main contribution of my work can be summarized as improving this situation by replacing existing ad hoc parallelization techniques with a sound underlying foundation on which future work can be built. (Also cross-referenced as UMIACS-TR-96-93

    UPIR: Toward the Design of Unified Parallel Intermediate Representation for Parallel Programming Models

    Full text link
    The complexity of heterogeneous computing architectures, as well as the demand for productive and portable parallel application development, have driven the evolution of parallel programming models to become more comprehensive and complex than before. Enhancing the conventional compilation technologies and software infrastructure to be parallelism-aware has become one of the main goals of recent compiler development. In this paper, we propose the design of unified parallel intermediate representation (UPIR) for multiple parallel programming models and for enabling unified compiler transformation for the models. UPIR specifies three commonly used parallelism patterns (SPMD, data and task parallelism), data attributes and explicit data movement and memory management, and synchronization operations used in parallel programming. We demonstrate UPIR via a prototype implementation in the ROSE compiler for unifying IR for both OpenMP and OpenACC and in both C/C++ and Fortran, for unifying the transformation that lowers both OpenMP and OpenACC code to LLVM runtime, and for exporting UPIR to LLVM MLIR dialect.Comment: Typos corrected. Format update
    corecore