4 research outputs found

    Learning Assisted Decoupled Software Pipelining (LA-DSWP)

    Get PDF
    In this thesis, I introduce and implement an extension to the Decoupled Software Pipelining (DSWP) algorithm proposed by Rangan et al. This new extension is named Learning Assisted Decoupled Software Pipelining (or LA-DSWP) as it applies reinforcement learning to the partitioning problem found within DSWP. Through experimentation, the viability of DSWP and LA-DSWP as optimizations that produce significant program speedup is tested and measured. As computer architects strive to keep up with public expectations for processor performance growth, they are increasingly turning to processor designs which utilize multiple independent cores on a single chip. Unlike most prior hardware innovations, computer programs must be written or compiled with multiple threads in mind to take advantage of these new hardware innovations. Automatic thread-extraction using Decoupled Software Pipelining seeks to extract multiple threads from a single-threaded program~\cite{ottoni-micro-2005}. This is done by allowing loops within the program to execute on multiple cores on a single processor chip simultaneously without programmer intervention. DSWP focuses on splitting large recursive data structure's traversal loops into multiple threads in an attempt to increase overall program performance. Unlike prior implementations of DSWP, this research uses a hardware and language independent implementation of DSWP using the LLVM framework. Rather than relying on custom-built hardware to facilitate communication between program threads, this implementation uses Intel's Thread Building Blocks library to create queues in the shared memory between the various on-chip processor cores. As this thesis will show, this design setup relies heavily on the memory subsystem of the targeted processors and is greatly impacted by the actual design of the memory subsystem. Another novel addition to DSWP explored in this thesis is the application of machine learning to the partitioning process. Instead of partitioning the nodes of a loop's program dependency graph using predefined heuristics, this thesis seeks to apply reinforcement learning to allow the DSWP agent to make more informed decisions when optimizing a given loop. The DSWP agent is able to collect and analyze data about each node of a program's loop to partition the loop on a node-by-node basis. This addition constitutes LA-DSWP. Through experimentation on modern Intel processors, this thesis tests the feasibility of LA-DSWP on current hardware. Multiple kernel programs were written to search for program patterns that can achieve performance increases using DSWP partitioning. Experiments were run using the partitioning methods discussed in earlier papers along with the proposed method utilizing machine learning

    Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution

    Full text link

    Beating in-Order Stalls With "Flea-Flicker" Two-Pass Pipelining

    No full text
    Accommodating the uncertain latency of load instructions is one of the most vexing problems in in-order microarchitecture design and compiler development. Compilers can generate schedules with a high degree of instruction-level parallelism but cannot effectively accommodate unanticipated latencies; incorporating traditional out-of-order execution into the microarchitecture hides some of this latency but redundantly performs work done by the compiler and adds additional pipeline stages. Although effective techniques, such as prefetching and threading, have been proposed to deal with anticipable, long-latency misses, the shorter, more diffuse stalls due to difficult-to-anticipate, first- or second-level misses are less easily hidden on inorder architectures. This paper addresses this problem by proposing a microarchitectural technique, referred to as two-pass pipelining, wherein the program executes on two in-order back-end pipelines coupled by a queue. The "advance" pipeline executes instructions greedily, without stalling on unanticipated latency dependences (executing independent instructions while otherwise blocking instructions are deferred). The "backup" pipeline allows concurrent resolution of instructions that were deferred in the other pipeline, resulting in the absorption of shorter misses and the overlap of longer ones. This paper argues that this design is both achievable and a good use of transistor resources and shows results indicating that it can deliver significant speedups for in-order processor designs
    corecore