112 research outputs found

    Efficient optimization of memory accesses in parallel programs

    Get PDF
    The power, frequency, and memory wall problems have caused a major shift in mainstream computing by introducing processors that contain multiple low power cores. As multi-core processors are becoming ubiquitous, software trends in both parallel programming languages and dynamic compilation have added new challenges to program compilation for multi-core processors. This thesis proposes a combination of high-level and low-level compiler optimizations to address these challenges. The high-level optimizations introduced in this thesis include new approaches to May-Happen-in-Parallel analysis and Side-Effect analysis for parallel programs and a novel parallelism-aware Scalar Replacement for Load Elimination transformation. A new Isolation Consistency (IC) memory model is described that permits several scalar replacement transformation opportunities compared to many existing memory models. The low-level optimizations include a novel approach to register allocation that retains the compile time and space efficiency of Linear Scan, while delivering runtime performance superior to both Linear Scan and Graph Coloring. The allocation phase is modeled as an optimization problem on a Bipartite Liveness Graph (BLG) data structure. The assignment phase focuses on reducing the number of spill instructions by using register-to-register move and exchange instructions wherever possible. Experimental evaluations of our scalar replacement for load elimination transformation in the Jikes RVM dynamic compiler show decreases in dynamic counts for getfield operations of up to 99.99%, and performance improvements of up to 1.76x on 1 core, and 1.39x on 16 cores, when compared with the load elimination algorithm available in Jikes RVM. A prototype implementation of our BLG register allocator in Jikes RVM demonstrates runtime performance improvements of up to 3.52x relative to Linear Scan on an x86 processor. When compared to Graph Coloring register allocator in the GCC compiler framework, our allocator resulted in an execution time improvement of up to 5.8%, with an average improvement of 2.3% on a POWER5 processor. With the experimental evaluations combined with the foundations presented in this thesis, we believe that the proposed high-level and low-level optimizations are useful in addressing some of the new challenges emerging in the optimization of parallel programs for multi-core architectures

    SVM Support in the Vienna Fortran Compilation System

    Get PDF
    Vienna Fortran, a machine-independent language extension to Fortran which allows the user to write programs for distributed-memory systems using global addresses, provides the forall-loop construct for specifying irregular computations that do not cause inter-iteration dependences. Compilers for distributed-memory systems generate code that is based on runtime analysis techniques and is only efficient if, in addition, aggressive compile-time optimizations are applied. Since these optimizations are difficult to perform we propose to generate shared virtual memory code instead that can benefit from appropriate operating system or hardware support. This paper presents the shared virtual memory code generation, compares both approaches and gives first performance results

    A Concurrent IFDS Dataflow Analysis Algorithm Using Actors

    Get PDF
    There has recently been a resurgence in interest in techniques for effective programming of multi-core computers. Most programmers find general-purpose concurrent programming to be extremely difficult. This difficulty severely limits the number of applications that currently benefit from multi-core computers. There already exist many concurrent solutions for the class of regular applications, which include various algorithms for linear algebra. For the class of irregular applications, which operate on dynamic and pointer- and graph-based structures, efficient concurrent solutions have so far remained elusive. Dataflow analysis applications, which are often found in compilers and program analysis tools, have received particularly little attention with regard to execution on multi-core machines. Operating on the theory that the Actor model, which structures computations as systems of asynchronously-communicating entities, is a more appropriate method for representing irregular algorithms than the shared-memory model, this work presents a concurrent Actor-based formulation of the IFDS, or Interprocedural Finite Distributive Subset, dataflow analysis algorithm. The implementation of this algorithm is done using the Scala language and its Actors library. This algorithm achieves significant speedup on multi-core machines without using any optimistic execution. This work contributes to Actor research by showing how the Actor model can be practically applied to a dataflow analysis problem. This work contributes to static analysis research by showing how a dataflow analysis algorithm can effectively make use of multi-core machines, allowing the possibility of faster and more precise analyses
    • …
    corecore