53 research outputs found

    Efficient Machine-Independent Programming of High-Performance Multiprocessors

    Get PDF
    Parallel computing is regarded by most computer scientists as the most likely approach for significantly improving computing power for scientists and engineers. Advances in programming languages and parallelizing compilers are making parallel computers easier to use by providing a high-level portable programming model that protects software investment. However, experience has shown that simply finding parallelism is not always sufficient for obtaining good performance from today's multiprocessors. The goal of this project is to develop advanced compiler analysis of data and computation decompositions, thread placement, communication, synchronization, and memory system effects needed in order to take advantage of performance-critical elements in modern parallel architectures

    Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

    Get PDF
    Recent proposals for Chip Multiprocessors (CMPs) advocate speculative, or implicit, threading in which the hardware employs prediction to peel off instruction sequences (i.e., implicit threads) from the sequential execution stream and speculatively executes them in parallel on multiple processor cores. These proposals augment a conventional multiprocessor, which employs explicit threading, with the ability to handle implicit threads. Current proposals focus on only implicitly-threaded code sections. This paper identifies, for the first time, the issues in combining explicit and implicit threading. We present the Multiplex architecture to combine the two threading models. Multiplex exploits the similarities between implicit and explicit threading, and provides a unified support for the two threading models without additional hardware. Multiplex groups a subset of protocol states in an implicitly-threaded CMP to provide a write-invalidate protocol for explicit threads. Using a fully-integrated compiler inf rastructure for automatic generation of Multiplex code, this paper presents a detailed performance analysis for entire benchmarks, instead of just implicitly- threaded sections, as done in previous papers. We show that neither threading models alone performs consistently better than the other across the benchmarks. A CMP with four dual-issue CPUs achieves a speedup of 1.48 and 2.17 over one dual-issue CPU, using implicit-only and explicit-only threading, respectively. Multiplex matches or outperforms the better of the two threading models for every benchmark, and a four-CPU Multiplex achieves a speedup of 2.63. Our detailed analysis indicates that the dominant overheads in an implicitly-threaded CMP are speculation state overflow due to limited L1 cache capacity, and load imbalance and data dependences in fine-grain threads

    Recursos anchos: una técnica de bajo coste para explotar paralelismo agresivo en códigos numéricos

    Get PDF
    Els bucles son la part que més temps consumeix en les aplicacions numèriques. El rendiment dels bucles està limitat tant pels recursos oferts per l'arquitectura com per les recurrències del bucle en la computació. Per executar més operacions per cicle, els processadors actuals es dissenyen amb graus creixents de replicació de recursos (tècnica de replicació) para ports de memòria i unitats funcionals. En canvi, el gran cost en termes d'àrea i temps de cicle d'aquesta tècnica limita tenir alts graus de replicació: alts valors en temps de cicle contraresten els guanys deguts al decrement en el nombre de cicles, mentre que alts valors en l'àrea requerida poden portar a configuracions impossibles d'implementar. Una alternativa a la replicació de recursos, és fer los més amples (tècnica que anomenem "widening"), i que ha estat usada en alguns dissenys recents. Amb aquesta tècnica, l'amplitud dels recursos s'amplia, fent una mateixa operació sobre múltiples dades. Per altra banda, alguns microprocessadors escalars de propòsit general han estat implementats amb unitats de coma flotants que implementen la instrucció sumar i multiplicar unificada (tècnica de fusió), el que redueix la latència de la operació combinada, tanmateix com el nombre de recursos utilitzats. A aquest treball s'avaluen un ampli conjunt d'alternatives de disseny de processadors VLIW que combinen les tres tècniques. S'efectua una projecció tecnològica de les noves generacions de processadors per predir les possibles alternatives implementables. Com a conclusió, demostrem que tenint en compte el cost, combinar certs graus de replicació i "widening" als recursos hardware és més efectiu que aplicar únicament replicació. Així mateix, confirmem que fer servir unitats que fusionen multiplicació i suma pot tenir un impacte molt significatiu en l'increment de rendiment en futures arquitectures de processadors a un cost molt raonable.Loops are the main time-consuming part of numerical applications. The performance of the loops is limited either by the resources offered by the architecture or by recurrences in the computation. To execute more operations per cycle, current processors are designed with growing degrees of resource replication (replication technique) for memory ports and functional units. However, the high cost in terms of area and cycle time of this technique precludes the use of high degrees of replication. High values for the cycle time may clearly offset any gain in terms of number of execution cycles. High values for the area may lead to an unimplementable configuration. An alternative to resource replication is resource widening (widening technique), which has also been used in some recent designs in which the width of the resources is increased (i.e., a single operation is performed over multiple data). Moreover, several general-purpose superscalar microprocessors have been implemented with multiply-add fused floating point units (fusion technique), which reduces the latency of the combined operation and the number of resources used. On this thesis, we evaluate a broad set of VLIW processor design alternatives that combine the three techniques. We perform a technological projection for the next processor generations in order to foresee the possible implementable alternatives. From this study, we conclude that if the cost is taken into account, combining certain degrees of replication and widening in the hardware resources is more effective than applying only replication. Also, we confirm that multiply-add fused units will have a significant impact in raising the performance of future processor architectures with a reasonable increase in cost

    Run-time optimization of adaptive irregular applications

    Get PDF
    Compared to traditional compile-time optimization, run-time optimization could offer significant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identified a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Specifically, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm specified by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems

    Memory Consistency and Cache Coherency in Network-on-Chip Based Multi-Core Systems

    Get PDF
    The complexity of modern Systems-on-Chips (SoC) is increasing with technology innovations. Designers of such systems are devoting significant attention not only to computation attributes, but increasingly more and more on communications characteristics. Having in mind scalability challenges, Networks-on-Chip (NoC) are already de facto standard for the communication backbone of SoC systems. As such, those systems are targeting more and more parallel execution of user defined, real-time applications, but the computer engineering society aims at hiding underlying platform specific characteristics and providing user with platform-independent services. Shared memory services are quite often a needed crucial property of such systems, therefore providing a coherent view, ensuring memory consistency, and still achieving the desired performance system characteristics is a huge challenge for scientists nowadays. With the invention of 3D integration, and opportunities of stacking memory modules on top of it, the concept of scalable shared memory will be one of the main memory access concepts besides message passing. In this thesis, the concept of a scalable coherency protocol which dynamically adopts to inputs of system and shared resources, is presented. Protocol ingredients, structure and internal modules interaction are described in detail. The conceptual idea of this protocol, influenced by widely accepted best practices in bus based systems as well of other NoC systems, is implemented for one particular type of NoC platform - XhiNoC (extendable Hierarchical Network-on Chip). The feasibility of the presented concept for distributed shared memory (DSM) coherency within NoC-based SoC architectures is confirmed by simulation-based experimental results.The complexity of modern Systems-on-Chips (SoC) is increasing with technology innovations. Designers of such systems are devoting significant attention not only to computation attributes, but increasingly more and more on communications characteristics. Having in mind scalability challenges, Networks-on-Chip (NoC) are already de facto standard for the communication backbone of SoC systems. As such, those systems are targeting more and more parallel execution of user defined, real-time applications, but the computer engineering society aims at hiding underlying platform specific characteristics and providing user with platform-independent services. Shared memory services are quite often a needed crucial property of such systems, therefore providing a coherent view, ensuring memory consistency, and still achieving the desired performance system characteristics is a huge challenge for scientists nowadays. With the invention of 3D integration, and opportunities of stacking memory modules on top of it, the concept of scalable shared memory will be one of the main memory access concepts besides message passing. In this thesis, the concept of a scalable coherency protocol which dynamically adopts to inputs of system and shared resources, is presented. Protocol ingredients, structure and internal modules interaction are described in detail. The conceptual idea of this protocol, influenced by widely accepted best practices in bus based systems as well of other NoC systems, is implemented for one particular type of NoC platform - XhiNoC (extendable Hierarchical Network-on Chip). The feasibility of the presented concept for distributed shared memory (DSM) coherency within NoC-based SoC architectures is confirmed by simulation-based experimental results

    Logical Inference Techniques for Loop Parallelization

    Get PDF
    This paper presents a fully automatic approach to loop parallelization that integrates the use of static and run-time analysis and thus overcomes many known difficulties such as nonlinear and indirect array indexing and complex control flow. Our hybrid analysis framework validates the parallelization transformation by verifying the independence of the loop’s memory references. To this end it represents array references using the USR (uniform set representation) language and expresses the independence condition as an equation, S = ∅, where S is a set expression representing array indexes. Using a language instead of an array-abstraction representation for S results in a smaller number of conservative approximations but exhibits a potentially-high runtime cost. To alleviate this cost we introduce a language translation F from the USR set-expression language to an equally rich language of predicates (F(S) ⇒ S = ∅). Loop parallelization is then validated using a novel logic inference algorithm that factorizes the obtained complex predicates (F(S)) into a sequence of sufficient-independence conditions that are evaluated first statically and, when needed, dynamically, in increasing order of their estimated complexities. We evaluate our automated solution on 26 benchmarks from PERFECT-CLUB and SPEC suites and show that our approach is effective in parallelizing large, complex loops and obtains much better full program speedups than the Intel and IBM Fortran compilers

    Performance Aspects of Synthesizable Computing Systems

    Get PDF

    A compiler cost model for speculative multithreading chip-multiprocessor architectures

    Get PDF

    Improving X10 Program Performances by Clock Removal

    Get PDF
    International audienceX10 is a promising recent parallel language designed specifically to address the challenges of productively programming a wide variety of target platforms. The sequential core of X10 is an object-oriented language in the Java family. This core is augmented by a few parallel constructs that create activities as a generalization of the well known fork/join model. Clocks are a generalization of the familiar barriers. Synchronization on a clock is specified by the advance() method call. Activities that execute \emph{advances} stall until all existent activities have done the same, and then are released at the same (logical) time. This naturally raises the following question: are clocks strictly necessary for X10 programs? Surprisingly enough, the answer is no, at least for sufficiently regular programs. One assigns a date to each operation, denoting the number of advances that the activity has executed before the operation. Operations with the same date constitute a \emph{front}, fronts are executed sequentially in order of increasing dates, while operations in a front are executed in parallel if possible. Depending on the nature of the program, this may entail some overhead, which can be reduced to zero for polyhedral programs. We show by experiments that, at least for the current X10 runtime, this transformation usually improves the performance of our benchmarks. Besides its theoretical interest, this transformation may be of interest for simplifying a compiler or runtime library

    Hybrid analysis of memory references and its application to automatic parallelization

    Get PDF
    Executing sequential code in parallel on a multithreaded machine has been an elusive goal of the academic and industrial research communities for many years. It has recently become more important due to the widespread introduction of multicores in PCs. Automatic multithreading has not been achieved because classic, static compiler analysis was not powerful enough and program behavior was found to be, in many cases, input dependent. Speculative thread level parallelization was a welcome avenue for advancing parallelization coverage but its performance was not always optimal due to the sometimes unnecessary overhead of checking every dynamic memory reference. In this dissertation we introduce a novel analysis technique, Hybrid Analysis, which unifies static and dynamic memory reference techniques into a seamless compiler framework which extracts almost maximum available parallelism from scientific codes and incurs close to the minimum necessary run time overhead. We present how to extract maximum information from the quantities that could not be sufficiently analyzed through static compiler methods, and how to generate sufficient conditions which, when evaluated dynamically, can validate optimizations. Our techniques have been fully implemented in the Polaris compiler and resulted in whole program speedups on a large number of industry standard benchmark applications
    corecore