2 research outputs found
Recommended from our members
POET: Parameterized Optimization for Empirical Tuning
The excessive complexity of both machine architectures and applications have made it difficult for compilers to statically model and predict application behavior. This observation motivates the recent interest in performance tuning using empirical techniques. We present a new embedded scripting language, POET (Parameterized Optimization for Empirical Tuning), for parameterizing complex code transformations so that they can be empirically tuned. The POET language aims to significantly improve the generality, flexibility, and efficiency of existing empirical tuning systems. We have used the language to parameterize and to empirically tune three loop optimizations-interchange, blocking, and unrolling-for two linear algebra kernels. We show experimentally that the time required to tune these optimizations using POET, which does not require any program analysis, is significantly shorter than that when using a full compiler-based source-code optimizer which performs sophisticated program analysis and optimizations
Recommended from our members
Efficient Generation of Sequences of Dense Linear Algebra through Auto-Tuning
It is rare for a programmer to solve a numerical problem with a single library call; most problems require a sequence of calls. In the case of linear algebra, programmers will chain a series of Basic Linear Algebra Subprogram (BLAS) library calls to achieve the desired result. When a sequence of BLAS calls is memory bound, a great deal of performance is missed because optimization has not occurred between library routines. It is not practical to create a library with every required sequence of linear algebra operations, but at the same time it is difficult for programmers to write their own high performance implementation. One solution is for programmers to use an auto-tuning tool capable of optimizing the sequence of operations that exactly suits their need. This thesis presents a matrix representation and type system that describes basic linear algebra operations, the loops required to implement those operations, and the legality of key optimizations. This is demonstrated in an auto-tuning tool which generates loops and performs data parallelism and loop fusion. Results show that this approach can match or exceed performance of vendor tuned BLAS libraries, general purpose optimizing compilers, and hand written code. Further, this approach is shown to be both portable and work with a range of dense matrix storage formats. All of this is achieved with search times in the range of several minutes to a few hours