18 research outputs found
Automatically Harnessing Sparse Acceleration
Sparse linear algebra is central to many scientific programs, yet compilers
fail to optimize it well. High-performance libraries are available, but
adoption costs are significant. Moreover, libraries tie programs into
vendor-specific software and hardware ecosystems, creating non-portable code.
In this paper, we develop a new approach based on our specification Language
for implementers of Linear Algebra Computations (LiLAC). Rather than requiring
the application developer to (re)write every program for a given library, the
burden is shifted to a one-off description by the library implementer. The
LiLAC-enabled compiler uses this to insert appropriate library routines without
source code changes.
LiLAC provides automatic data marshaling, maintaining state between calls and
minimizing data transfers. Appropriate places for library insertion are
detected in compiler intermediate representation, independent of source
languages.
We evaluated on large-scale scientific applications written in FORTRAN;
standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across
heterogeneous platforms, applications and data sets we show speedups of
1.1 to over 10 without user intervention.Comment: Accepted to CC 202
Tango: a Hardware-based Data Prefetching Technique for Superscalar Processors
We present a new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors. The emphasis in our scheme is on the effective utilization of slack (dead) time and hardware resources not used for the main computation. The scheme suggests a new hardware construct, the Program Progress Graph (PPG), as a simple extension to the Branch Target Buffer (BTB). We use the PPG for implementing a fast pre-program counter, pre-PC, that travels only through memory reference instructions (rather than scanning all the instructions sequentially). In a single clock cycle the pre-PC extracts all the predicted memory references in some future block of instructions, to obtain early data prefetching. In addition, the PPG can be used to implement a pre-processor and for instruction prefetching. The prefetch requests are scheduled to "tango" with the core requests from the data cache, by using only free time slots on the existing..
Tango: a Hardware-based Data Prefetching Technique for Superscalar Processors
We present a new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors. The emphasis in our scheme is on the effective utilization of slack time and hardware resources not used for the main computation. The scheme suggests a new hardware construct, the Program Progress Graph (PPG), as a simple extension to the Branch Target Buffer (BTB). We use the PPG for implementing a fast pre-program counter, pre-PC, that travels only through memory reference instructions (rather than scanning all the instructions sequentially). In a single clock cycle the pre-PC extracts all the predicted memory references in some future block of instructions, to obtain early data prefetching. In addition, the PPG can be used for implementing a pre-processor and for instruction prefetching. The prefetch requests are scheduled to "tango" with the core requests from the data cache, by using only free time slots on the existing da..
Partitioning and Scheduling to Counteract Overhead
We introduce a scheduling model, inspired by data flow computers, in which the overhead incurred in a system as well as computation time are described explicitly. Using this model, we provide algorithms for partitioning programs so as to minimize their completion time. In the traditional data flow paradigm, every instruction is considered a "task", and it is scheduled for execution as early as possible. Implementations of this scheme, however, involve overheads which affect the running time of the programs. We propose to partition the program into larger grains, each containing one or more instructions, such that scheduling those grains would result in minimizing the completion time. Our model accounts for both the overhead incurred when executing a program and the actual execution time of its instructions. Within this framework we derive lower and upper bounds on the execution time of programs represented as trees and DAGs. We provide algorithms for optimally partitioning such program..