455 research outputs found

    TRACO: Source-to-Source Parallelizing Compiler

    Get PDF
    The paper presents a source-to-source compiler, TRACO, for automatic extraction of both coarse- and fine-grained parallelism available in C/C++ loops. Parallelization techniques implemented in TRACO are based on the transitive closure of a relation describing all the dependences in a loop. Coarse- and fine-grained parallelism is represented with synchronization-free slices (space partitions) and a legal loop statement instance schedule (time partitions), respectively. TRACO enables also applying scalar and array variable privatization as well as parallel reduction. On its output, TRACO produces compilable parallel OpenMP C/C++ and/or OpenACC C/C++ code. The effectiveness of TRACO, efficiency of parallel code produced by TRACO, and the time of parallel code production are evaluated by means of the NAS Parallel Benchmark and Polyhedral Benchmark suites. These features of TRACO are compared with closely related compilers such as ICC, Pluto, Par4All, and Cetus. Feature work is outlined

    Parallel Tiled Code Generation with Loop Permutation within Tiles

    Get PDF
    An approach of generation of tiled code with an arbitrary order of loops within tiles is presented. It is based on the transitive closure of the program dependence graph and derived via a combination of the Polyhedral and Iteration Space Slicing frameworks. The approach is explained by means of a working example. Details of an implementation of the approach in the TRACO compiler are outlined. Increasing tiled program performance due to loop permutation within tiles is illustrated on real-life programs from the NAS Parallel Benchmark suite. An analysis of speed-up and scalability of parallel tiled code with loop permutation is presented

    An Introduction to Slice-Based Cohesion and Coupling Metrics

    Get PDF
    This report provides an overview of slice-based software metrics. It brings together information about the development of the metrics from Weiser’s original idea that program slices may be used in the measurement of program complexity, with alternative slice-based measures proposed by other researchers. In particular, it details two aspects of slice-based metric calculation not covered elsewhere in the literature: output variables and worked examples of the calculations. First, output variables are explained, their use explored and standard reference terms and usage proposed. Calculating slice-based metrics requires a clear understanding of ‘output variables’ because they form the basis for extracting the program slices on which the calculations depend. This report includes a survey of the variation in the definition of output variables used by different research groups and suggests standard terms of reference for these variables. Our study identifies four elements which are combined in the definition of output variables. These are the function return value, modified global variables, modified reference parameters and variables printed or otherwise output by the module. Second, slice-based metric calculations are explained with the aid of worked examples, to assist newcomers to the field. Step-by-step calculations of slice-based cohesion and coupling metrics based on the vertices output by the static analysis tool CodeSurfer (R) are presented and compared with line-based calculations

    On Characterizing the Data Access Complexity of Programs

    Full text link
    Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data access complexity will be increasingly important. The problem of developing lower bounds for data access complexity has been modeled using the formalism of Hong & Kung's red/blue pebble game for computational directed acyclic graphs (CDAGs). However, previously developed approaches to lower bounds analysis for the red/blue pebble game are very limited in effectiveness when applied to CDAGs of real programs, with computations comprised of multiple sub-computations with differing DAG structure. We address this problem by developing an approach for effectively composing lower bounds based on graph decomposition. We also develop a static analysis algorithm to derive the asymptotic data-access lower bounds of programs, as a function of the problem size and cache size

    Compiler Optimization Techniques for Scheduling and Reducing Overhead

    Get PDF
    Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead
    • …
    corecore