1,053 research outputs found

    Run-time parallelization and scheduling of loops

    Get PDF
    The class of problems that can be effectively compiled by parallelizing compilers is discussed. This is accomplished with the doconsider construct which would allow these compilers to parallelize many problems in which substantial loop-level parallelism is available but cannot be detected by standard compile-time analysis. We describe and experimentally analyze mechanisms used to parallelize the work required for these types of loops. In each of these methods, a new loop structure is produced by modifying the loop to be parallelized. We also present the rules by which these loop transformations may be automated in order that they be included in language compilers. The main application area of the research involves problems in scientific computations and engineering. The workload used in our experiment includes a mixture of real problems as well as synthetically generated inputs. From our extensive tests on the Encore Multimax/320, we have reached the conclusion that for the types of workloads we have investigated, self-execution almost always performs better than pre-scheduling. Further, the improvement in performance that accrues as a result of global topological sorting of indices as opposed to the less expensive local sorting, is not very significant in the case of self-execution

    Improving programmability and performance for scientific applications

    Get PDF
    With modern advancements in hardware and software technology scaling towards new limits, our compute machines are reaching new potentials to tackle more challenging problems. While the size and complexity of both the problems and solutions increases, the programming methodologies must remain at a level that can be understood by programmers and scientists alike. In our work, this problem is encountered when developing an optimized framework to best exploit the semantic properties of a finite-element solver. In efforts to address this problem, we explore programming and runtime models which decouple algorithmic complexity, parallelism concerns, and hardware mapping. We build upon these frameworks to exploit domain-specific semantics using high-level transformations and modifications to obtain performance through algorithmic and runtime optimizations. We first discusses optimizations performed on a computational mechanics solver using a novel coupling technique for multi-time scale methods for discrete finite element domains. We exploit domain semantics using a high-level dynamic runtime scheme to reorder and balance workloads to greatly improve runtime performance. The framework presented automatically chooses a near-optimal coupling solution and runs a work-stealing parallel executor to run effectively on multi-core systems. In my latter work, I focus on the parallel programming model, Concurrent Collections (CnC), to seamlessly bridge the gap between performance and programmability. Because challenging problems in various domains, not limited to computation mechanics, requires both domain expertise and programming prowess, there is a need for ways to separate those concerns. This thesis describes methods and techniques to obtain scalable performance using CnC programming while limiting the burden of programming. These high level techniques are presented for two high-performance applications corresponding to hydrodynamics and multi-grid solvers

    Acceleration of a Full-scale Industrial CFD Application with OP2

    Get PDF

    Quantitative Performance Analysis of the SPEC OMPM2001 Benchmarks

    Get PDF

    Compiler Optimization Techniques for Scheduling and Reducing Overhead

    Get PDF
    Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead