38 research outputs found

    Pipelined Multithreading Transformations and Support Mechanisms

    No full text
    Even though chip multiprocessors have emerged as the predominant organization for future microprocessors, the multiple on-chip cores do not directly result in improved application performance (especially for legacy applications, which are predominantly sequential C/C++ codes). Consequently, parallelizing applications to execute on multiple cores is essential to their success. Independent multithreading techniques, like DOALL extraction, create partially or fully independent threads, which communicate rarely, if at all. While such strategies keep high inter-thread communication costs from impacting program performance, they cannot be applied to parallelize general-purpose applications which are characterized by difficult-to-break recurrences. Even though cyclic multithreading techniques, such as DOACROSS, are more applicable, the cyclic inter-thread dependences created by these techniques cause them to have very low tolerance to rising inter-core latencies.\ud \ud To address these problems, this work introduces a pipelined multithreading (PMT) transformation called Decoupled Software Pipelining (DSWP). DSWP, in particular, and PMT techniques, in general, are able to tolerate inter-core latencies, while still handling codes with complex recurrences. They achieve this by enforcing an acyclic communication discipline amongst threads, which allow threads to use inter-thread queues to communicate in a pipelined fashion. This dissertation demonstrates that DSWPed codes not only tolerate inter-core communication costs, but also effectively tolerate variable latency stalls in applications better than single-threaded execution on both in-order and out-of-order issue processors with comparable resources. It then performs a thorough analysis of the performance scalability of automatically generated DSWPed codes and identifies the conditions necessary to achieve peak PMT performance.\ud \ud Next, the dissertation shows that even though PMT applications tolerate inter-core latencies well, the high frequency of inter-thread communication (once every 5 to 20 dynamic instructions) in these codes, makes them very sensitive to the intra-thread overhead imposed by communication operations. In order to understand the issues surrounding inter-thread communication for PMT applications, this dissertation undertakes a methodical exploration of the design space of communication support options for PMT. Three new communication mechanisms with varying cost-performance tradeoffs are introduced and are shown to perform 38% to 200% better than the state of the art

    Amortizing Software Queue Overhead for Pipelined Inter-Thread Communication

    No full text
    Future chip multiprocessors are expected to contain multiple on-die processing cores. Increased memory system contention and wire delays will result in high inter-core latencies in these processors. Thus, parallelizing applications to efficiently execute on multiple contexts is key to achieving continued performance improvements. Recently proposed pipelined multithreading (PMT) techniques have shown significant promise for both manual and automatic parallelization. They tolerate increasing inter-thread communication delays by enforcing acyclic dependences amongst communicating threads and pipelining communication.\ud \ud However, lack of efficient communication support for such programs hinders related language and compiler research. While researchers have proposed dedicated interconnects and storage for inter-core communication, such mechanisms are not cost-effective, consume extra power, demand chip redesign effort, and necessitate complex operating system modifications. Software impelementations of shared memory queues avoid these problems. But, they tend to have heavy overhead per communication operation, causing them to negate parallelization benefits and worse still, to perform slower than the original single-threaded codes. In this paper, we present a simple compiler analysis to coalesce synchronization and queue pointer updates for select communication operations, to minimize the intra-thread overhead of software queue implementations. A preliminary comparison of static schedule heights shows a considerable performance improvement over existing software queue implementations

    Automatic thread extraction with decoupled software pipelining

    No full text
    {ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturersto add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have notsucceeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improve-ment for a large class of existing codes. To find useful work for chip multiprocessors, we proposean automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the fine-grained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Useof the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and pro-vide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-coreresource requirements. Using our initial fully automatic compiler implementation and a validated processor model,we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety ofcodes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promisingfuture for this approach.

    Decoupled Software Pipelining with the Synchronization Array

    No full text
    Despite the success of instruction-level parallelism (ILP) optimizations in increasing the performance of microprocessors, certain codes remain elusive. In particular, codes containing recursive data structure (RDS) traversal loops have been largely immune to ILP optimizations, due to the fundamental serialization and variable latency of the loop-carried dependence through a pointer-chasing load. To address these and other situations, we introduce decoupled software pipelining (DSWP), a technique that statically splits a single-threaded sequential loop into multiple non-speculative threads, each of which performs useful computation essential for overall program correctness. The resulting threads execute on thread-parallel architectures such as simultaneous multithreaded (SMT) cores or chip multiprocessors (CMP), expose additional instruction level parallelism, and tolerate latency better than the original single-threaded RDS loop. To reduce overhead, these threads communicate using a synchronization array, a dedicated hardware structure for pipelined inter-thread communication. DSWP used in conjunction with the synchronization array achieves an 11 % to 76 % speedup in the optimized functions on both statically and dynamically scheduled processors. 1
    corecore