341 research outputs found

    Compiler Optimization Techniques for Scheduling and Reducing Overhead

    Get PDF
    Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead

    Exploiting cache locality at run-time

    Get PDF
    With the increasing gap between the speeds of the processor and memory system, memory access has become a major performance bottleneck in modern computer systems. Recently, Symmetric Multi-Processor (SMP) systems have emerged as a major class of high-performance platforms. Improving the memory performance of Parallel applications with dynamic memory-access patterns on Symmetric Multi-Processors (SMP) is a hard problem. The solution to this problem is critical to the successful use of the SMP systems because dynamic memory-access patterns occur in many real-world applications. This dissertation is aimed at solving this problem.;Based on a rigorous analysis of cache-locality optimization, we propose a memory-layout oriented run-time technique to exploit the cache locality of parallel loops. Our technique have been implemented in a run-time system. Using simulation and measurement, we have shown our run-time approach can achieve comparable performance with compiler optimizations for those regular applications, whose load balance and cache locality can be well optimized by tiling and other program transformations. However, our approach was shown to improve significantly the memory performance for applications with dynamic memory-access patterns. Such applications are usually hard to optimize with static compiler optimizations.;Several contributions are made in this dissertation. We present models to characterize the complexity and present a solution framework for optimizing cache locality. We present an effective estimation technique for memory-access patterns to support efficient locality optimizations and information integration. We present a memory-layout oriented run-time technique for locality optimization. We present efficient scheduling algorithms to trade off locality and load imbalance. We provide a detailed performance evaluation of the run-time technique

    CYCLIC: A Locality-Preserving Load-Balancing Algorithm for PDES on Shared Memory Multiprocessors

    Get PDF
    This paper presents a new load-balancing algorithm for shared memory multiprocessors that is currently being applied to the parallel simulation of logic circuits, specifically VHDL simulations. The main idea of this load-balancing algorithm is based on the exploitation of the usual characteristics of these simulations, that is, cyclicity and predictability, to obtain a good load balance while preserving the locality of references. This algorithm is useful not only in the area of logic circuit simulation but also in systems presenting a cyclic execution pattern, that is, repetition over time, making the future behavior of the tasks predictable. An example of this is Parallel Discrete Event Simulation (PDES), where several tasks are repeatedly executed in response to certain events. A comparison between the proposed algorithm and other load-balancing algorithms found in the literature reveals consistently better execution times with improvements in both load-balancing and locality of references that can be of help on current multicore desktop computers

    The impact of operating system scheduling policies and synchronization methods of performance of parallel applications

    Full text link

    A load-sharing architecture for high performance optimistic simulations on multi-core machines

    Get PDF
    In Parallel Discrete Event Simulation (PDES), the simulation model is partitioned into a set of distinct Logical Processes (LPs) which are allowed to concurrently execute simulation events. In this work we present an innovative approach to load-sharing on multi-core/multiprocessor machines, targeted at the optimistic PDES paradigm, where LPs are speculatively allowed to process simulation events with no preventive verification of causal consistency, and actual consistency violations (if any) are recovered via rollback techniques. In our approach, each simulation kernel instance, in charge of hosting and executing a specific set of LPs, runs a set of worker threads, which can be dynamically activated/deactivated on the basis of a distributed algorithm. The latter relies in turn on an analytical model that provides indications on how to reassign processor/core usage across the kernels in order to handle the simulation workload as efficiently as possible. We also present a real implementation of our load-sharing architecture within the ROme OpTimistic Simulator (ROOT-Sim), namely an open-source C-based simulation platform implemented according to the PDES paradigm and the optimistic synchronization approach. Experimental results for an assessment of the validity of our proposal are presented as well

    Scalability of microkernel-based systems

    Get PDF

    Generic design of Chinese remaindering schemes

    Get PDF
    We propose a generic design for Chinese remainder algorithms. A Chinese remainder computation consists in reconstructing an integer value from its residues modulo non coprime integers. We also propose an efficient linear data structure, a radix ladder, for the intermediate storage and computations. Our design is structured into three main modules: a black box residue computation in charge of computing each residue; a Chinese remaindering controller in charge of launching the computation and of the termination decision; an integer builder in charge of the reconstruction computation. We then show that this design enables many different forms of Chinese remaindering (e.g. deterministic, early terminated, distributed, etc.), easy comparisons between these forms and e.g. user-transparent parallelism at different parallel grains
    • …
    corecore