7 research outputs found
Compiler Optimization Techniques for Scheduling and Reducing Overhead
Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead
Establishing a base of trust with performance counters for enterprise workloads
Understanding the performance of large, complex enterprise-class applications is an important, yet nontrivial task. Methods using hardware performance counters, such as profiling through event-based sampling, are often favored over instrumentation for analyzing such large codes, but rarely provide good accuracy at the instruction level. This work evaluates the accuracy ofmultiple eventbased sampling techniques and quantifies the impact of a range of improvements suggested in recent years. The evaluation is performed on instances of three modern CPU architectures, using designated kernels and full applications. We conclude that precisely distributed events considerably improve accuracy, with further improvements possible when using Last Branch Records. We also present practical recommendations for hardware architects, tool developers and performance engineers, aimed at improving the quality of results
Datapath and memory co-optimization for FPGA-based computation
With the large resource densities available on modern FPGAs it is often the available
memory bandwidth that limits the parallelism (and therefore performance) that can be
achieved. For this reason the focus of this thesis is the development of an integrated
scheduling and memory optimisation methodology to allow high levels of parallelism to be
exploited in FPGA based designs.
A manual translation from C to hardware is first investigated as a case study,
exposing a number of potential optimisation techniques that have not been exploited in
existing work. An existing outer loop pipelining approach, originally developed for VLIW
processors, is extended and adapted for application to FPGAs. The outer loop pipelining
methodology is first developed to use a fixed memory subsystem design and then extended
to automate the optimisation of the memory subsystem. This approach allocates arrays
to physical memories and selects the set of data reuse structures to implement to match
the available and required memory bandwidths as the pipelining search progresses. The
final extension to this work is to include the partitioning of data from a single array across
multiple physical memories, increasing the number of memory ports through which data
my be accessed. The facility for loop unrolling is also added to increase the potential for
parallelism and exploit the additional bandwidth that partitioning can provide.
We describe our approach based on formal methodologies and present the results
achieved when these methods are applied to a number of benchmarks. These results show
the advantages of both extending pipelining to levels above the innermost loop and the
co-optimisation of the datapath and memory subsystem
ParallĂ©lisme des nids de boucles pour lâoptimisation du temps dâexĂ©cution et de la taille du code
The real time implementation algorithms always include nested loops which require important execution times. Thus, several nested loop parallelism techniques have been proposed with the aim of decreasing their execution times. These techniques can be classified in terms of granularity, which are the iteration level parallelism and the instruction level parallelism. In the case of the instruction level parallelism, the techniques aim to achieve a full parallelism. However, the loop carried dependencies implies shifting instructions in both side of nested loops. Consequently, these techniques provide implementations with non-optimal execution times and important code sizes, which represent limiting factors when implemented on embedded real-time systems. In this work, we are interested on enhancing the parallelism strategies of nested loops. The first contribution consists of purposing a novel instruction level parallelism technique, called âdelayed multidimensional retimingâ. It aims to scheduling the nested loops with the minimal cycle period, without achieving a full parallelism. The second contribution consists of employing the âdelayed multidimensional retimingâ when providing nested loop implementations on real time embedded systems. The aim is to respect an execution time constraint while using minimal code size. In this context, we proposed a first approach that selects the minimal instruction parallelism level allowing the execution time constraint respect. The second approach employs both instruction level parallelism and iteration level parallelism, by using the âdelayed multidimensional retimingâ and the âloop stripingâLes algorithmes des systĂšmes temps rĂ©els incluent de plus en plus de nids de boucles, qui sont caractĂ©risĂ©s par un temps dâexĂ©cution important. De ce fait, plusieurs dĂ©marches de parallĂ©lisme des boucles imbriquĂ©es ont Ă©tĂ© proposĂ©es dans lâobjectif de rĂ©duire leurs temps dâexĂ©cution. Ces dĂ©marches peuvent ĂȘtre classifiĂ©es selon deux niveaux de granularitĂ© : le parallĂ©lisme au niveau des itĂ©rations et le parallĂ©lisme au niveau des instructions. Dans le cas du deuxiĂšme niveau de granularitĂ©, les techniques visent Ă atteindre un parallĂ©lisme total des instructions appartenant Ă une mĂȘme itĂ©ration. Cependant, le parallĂ©lisme est contraint par les dĂ©pendances des donnĂ©es inter-itĂ©rations ce qui implique le dĂ©calage des instructions Ă travers les boucles imbriquĂ©es, provocant ainsi une augmentation du code proportionnelle au niveau du parallĂ©lisme. Par consĂ©quent, le parallĂ©lisme total au niveau des instructions des nids de boucles engendre des implĂ©mentations avec des temps dâexĂ©cution non-optimaux et des tailles du code importantes. Les travaux de cette thĂšse sâintĂ©ressent Ă lâamĂ©lioration des stratĂ©gies de parallĂ©lisme des nids de boucles. Une premiĂšre contribution consiste Ă proposer une nouvelle technique de parallĂ©lisme au niveau des instructions baptisĂ©e « retiming multidimensionnel dĂ©calĂ© ». Elle vise Ă ordonnancer les nids de boucles avec une pĂ©riode de cycle minimale, sans atteindre un parallĂ©lisme total. Une deuxiĂšme contribution consiste Ă mettre en pratique notre technique dans le contexte de lâimplĂ©mentation temps rĂ©el embarquĂ©e des nids de boucles. Lâobjectif est de respecter la contrainte du temps dâexĂ©cution tout en utilisant un code de taille minimale. Dans ce contexte, nous avons proposĂ© une premiĂšre dĂ©marche dâoptimisation qui consiste Ă utiliser notre technique pour dĂ©terminer le niveau parallĂ©lisme minimal. Par la suite, nous avons dĂ©crit une deuxiĂšme dĂ©marche permettant de combiner les parallĂ©lismes au niveau des instructions et au niveau des itĂ©rations, en utilisant notre technique et le « loop striping
Compilation Techniques for High-Performance Embedded Systems with Multiple Processors
Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems,
programming of embedded high-performance computing systems based on Digital
Signal Processors (DSPs) is still a highly skilled manual task. This is true for
single-processor systems, and even more for embedded systems based on multiple
DSPs. Compilers often fail to optimise existing DSP codes written in C due to the
employed programming style. Parallelisation is hampered by the complex multiple address
space memory architecture, which can be found in most commercial multi-DSP
configurations.
This thesis develops an integrated optimisation and parallelisation strategy that can
deal with low-level C codes and produces optimised parallel code for a homogeneous
multi-DSP architecture with distributed physical memory and multiple logical address
spaces. In a first step, low-level programming idioms are identified and recovered. This
enables the application of high-level code and data transformations well-known in the
field of scientific computing. Iterative feedback-driven search for âgoodâ transformation
sequences is being investigated. A novel approach to parallelisation based on a
unified data and loop transformation framework is presented and evaluated. Performance
optimisation is achieved through exploitation of data locality on the one hand,
and utilisation of DSP-specific architectural features such as Direct Memory Access
(DMA) transfers on the other hand.
The proposed methodology is evaluated against two benchmark suites (DSPstone
& UTDSP) and four different high-performance DSPs, one of which is part of a commercial
four processor multi-DSP board also used for evaluation. Experiments confirm
the effectiveness of the program recovery techniques as enablers of high-level transformations
and automatic parallelisation. Source-to-source transformations of DSP
codes yield an average speedup of 2.21 across four different DSP architectures. The
parallelisation scheme is â in conjunction with a set of locality optimisations â able to
produce linear and even super-linear speedups on a number of relevant DSP kernels
and applications
Software Pipelining of Nested Loops
This paper presents an approach to software pipelining of nested loops. While several papers have addressed software pipelining of inner loops, little work has been done in the area of extending it to nested loops. This paper solves the problem of finding the minimum iteration initiation interval (in the absence of resource constraints) for each level of a nested loop. The problem is formulated as one of finding a rational quasi-affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. This allows us to treat iteration-dependent statement reordering and multidimensional loop unrolling in the same framework. Unlike most work in scheduling nested loops, we treat each statement in the body as a unit of scheduling. Thus, the schedules derived allow for instances of statements from different iterations to be scheduled at the same time. Optimal schedules derived here subsume extant work on software pipelining of inner loops, in t..