1,498 research outputs found
Compiler Optimization Techniques for Scheduling and Reducing Overhead
Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Scheduling Transformation and Dependence Tests for Recursive Programs
Scheduling transformations reorder the execution of operations in a program to improve locality and/or parallelism. The polyhedral model provides a general framework for performing instance-wise scheduling transformations for regular programs, reordering the iterations of loops that operate over dense arrays through transformations like tiling. There is no analogous framework for recursive programs—despite recent interest in optimizations like tiling and fusion for recursive applications. This paper presents PolyRec, the first general framework for applying scheduling transformations—like inlining, interchange, and code motion—to nested recursive programs and reasoning about their correctness. We describe the phases of PolyRec—representing dynamic instances, applying transformations, reasoning about correctness—and show that PolyRec is able to apply sophisticated, composed transformations to complex, nested recursive programs and improve performance through enhanced locality
Search-based Model-driven Loop Optimizations for Tensor Contractions
Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. The Tensor Contraction Engine (TCE) is a high-level program synthesis system that facilitates the generation of high-performance parallel programs from tensor contraction equations. We are developing a new software infrastructure for the TCE that is designed to allow experimentation with optimization algorithms for modern computing platforms, including for heterogeneous architectures employing general-purpose graphics processing units (GPGPUs). In this dissertation, we present improvements and extensions to the loop fusion optimization algorithm, which can be used with cost models, e.g., for minimizing memory usage or for minimizing data movement costs under a memory constraint. We show that our data structure and pruning improvements to the loop fusion algorithm result in significant performance improvements that enable complex cost models being use for large input equations. We also present an algorithm for optimizing the fused loop structure of handwritten code. It determines the regions in handwritten code that are safe to be optimized and then runs the loop fusion algorithm on the dependency graph of the code. Finally, we develop an optimization framework for generating GPGPU code consisting of loop fusion optimization with a novel cost model, tiling optimization, and layout optimization. Depending on the memory available on the GPGPU and the sizes of the tensors, our framework decides which processor (CPU or GPGPU) should perform an operation and where the result should be moved. We present extensive measurements for tuning the loop fusion algorithm, for validating our optimization framework, and for measuring the performance characteristics of GPGPUs. Our measurements demonstrate that our optimization framework outperforms existing general-purpose optimization approaches both on multi-core CPUs and on GPGPUs
Automatic Parallelization of Affine Loops using Dependence and Cache analysis in a Binary Rewriter
Today, nearly all general-purpose computers are parallel, but nearly all software running on them is serial. Bridging this disconnect by manually rewriting source code in parallel is prohibitively expensive. Automatic parallelization technology is therefore an attractive alternative.
We present a method to perform automatic parallelization in a binary rewriter. The input to the binary rewriter is the serial binary executable program and the output is a parallel binary executable. The advantages of parallelization in a binary rewriter versus a compiler include (i) compatibility with all compilers and languages; (ii) high economic feasibility from avoiding repeated compiler implementation; (iii) applicability to legacy binaries; and (iv) applicability to assembly-language programs.
Adapting existing parallelizing compiler methods that work on source code to work on binary programs instead is a significant challenge. This is primarily because symbolic and array index information used in existing compiler parallelizers is not available in a binary. We show how to adapt existing parallelization methods to achieve equivalent parallelization from a binary without such information. We have also designed a affine cache reuse model that works inside a binary rewriter building on the parallelization techniques. It quantifies cache reuse in terms of the number of cache lines that will be required when a loop dimension is considered for the innermost position in a loop nest. This cache metric can be used to reason about affine code that results when affine code is transformed using affine transformations. Hence, it can be used to evaluate candidate transformation sequences to improve run-time directly from a binary.
Results using our x86 binary rewriter called SecondWrite on a suite of dense- matrix regular programs from Polybench suite of benchmarks shows an geomean speedup of 6.81X from binary and 8.9X from source with 8 threads compared to the input serial binary on a x86 Xeon E5530 machine; and 8.31X from binary and 9.86X from source with 24 threads compared to the input serial binary on a x86 E7450 machine. Such regular loops are an important component of scientific and multi- media workloads, and are even present to a limited extent in otherwise non-regular programs.
Further in this thesis we present a novel algorithm that enhances the past techniques significantly for loops with unknown loop bounds by guessing the loop bounds using only the memory expressions present in a loop. It then inserts run-time checks to see if these guesses were indeed correct and if correct executes the parallel version of the loop, else the serial version executes. These techniques are applied to the large affine benchmarks in SPEC2006 and OMP2001 and unlike previous
methods the speedups from binary are as good as from source. We also present results on the number of loops parallelized directly from a binary with and without this algorithm. Among the 8 affine benchmarks among these suites, the best existing binary parallelization method achieves an geo-mean speedup of 1.33X, whereas our method achieves a speedup of 2.96X. This is close to the speedup from source code of 2.8X
- …