18 research outputs found

    Automatically Harnessing Sparse Acceleration

    Get PDF
    Sparse linear algebra is central to many scientific programs, yet compilers fail to optimize it well. High-performance libraries are available, but adoption costs are significant. Moreover, libraries tie programs into vendor-specific software and hardware ecosystems, creating non-portable code. In this paper, we develop a new approach based on our specification Language for implementers of Linear Algebra Computations (LiLAC). Rather than requiring the application developer to (re)write every program for a given library, the burden is shifted to a one-off description by the library implementer. The LiLAC-enabled compiler uses this to insert appropriate library routines without source code changes. LiLAC provides automatic data marshaling, maintaining state between calls and minimizing data transfers. Appropriate places for library insertion are detected in compiler intermediate representation, independent of source languages. We evaluated on large-scale scientific applications written in FORTRAN; standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across heterogeneous platforms, applications and data sets we show speedups of 1.1×\times to over 10×\times without user intervention.Comment: Accepted to CC 202

    Tango: a Hardware-based Data Prefetching Technique for Superscalar Processors

    No full text
    We present a new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors. The emphasis in our scheme is on the effective utilization of slack (dead) time and hardware resources not used for the main computation. The scheme suggests a new hardware construct, the Program Progress Graph (PPG), as a simple extension to the Branch Target Buffer (BTB). We use the PPG for implementing a fast pre-program counter, pre-PC, that travels only through memory reference instructions (rather than scanning all the instructions sequentially). In a single clock cycle the pre-PC extracts all the predicted memory references in some future block of instructions, to obtain early data prefetching. In addition, the PPG can be used to implement a pre-processor and for instruction prefetching. The prefetch requests are scheduled to "tango" with the core requests from the data cache, by using only free time slots on the existing..

    Tango: a Hardware-based Data Prefetching Technique for Superscalar Processors

    No full text
    We present a new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors. The emphasis in our scheme is on the effective utilization of slack time and hardware resources not used for the main computation. The scheme suggests a new hardware construct, the Program Progress Graph (PPG), as a simple extension to the Branch Target Buffer (BTB). We use the PPG for implementing a fast pre-program counter, pre-PC, that travels only through memory reference instructions (rather than scanning all the instructions sequentially). In a single clock cycle the pre-PC extracts all the predicted memory references in some future block of instructions, to obtain early data prefetching. In addition, the PPG can be used for implementing a pre-processor and for instruction prefetching. The prefetch requests are scheduled to "tango" with the core requests from the data cache, by using only free time slots on the existing da..

    Partitioning and Scheduling to Counteract Overhead

    No full text
    We introduce a scheduling model, inspired by data flow computers, in which the overhead incurred in a system as well as computation time are described explicitly. Using this model, we provide algorithms for partitioning programs so as to minimize their completion time. In the traditional data flow paradigm, every instruction is considered a "task", and it is scheduled for execution as early as possible. Implementations of this scheme, however, involve overheads which affect the running time of the programs. We propose to partition the program into larger grains, each containing one or more instructions, such that scheduling those grains would result in minimizing the completion time. Our model accounts for both the overhead incurred when executing a program and the actual execution time of its instructions. Within this framework we derive lower and upper bounds on the execution time of programs represented as trees and DAGs. We provide algorithms for optimally partitioning such program..
    corecore