67 research outputs found
Partitioning loops with variable dependence distances
A new technique to parallelize loops,vith variable distance vectors is presented The method extends previous methods in two ways. First, the present method makes it possible for array subscripts to be any linear combination of all loop indices. The solutions to the linear dependence equations established from such army subscripts are characterized by a pseudo distance matrix(PDM). Second, it allows us to exploit loop parallelism from the PDM by applying unimodular and partitioning transformations that preserve the lexicographical order of the dependent iterations. The algorithms to derive the PDM, to find a suitable loop transformation and to generate parallel code are described showing that it is possible to parallelize a wider range of loops automatically
High performance computing with FPGAs
Field-programmable gate arrays represent an army of logical units which can be organized in a highly parallel or pipelined fashion to implement an algorithm in hardware. The flexibility of this new medium creates new challenges to find the right processing paradigm which takes into account of the natural constraints of FPGAs: clock frequency, memory footprint and communication bandwidth. In this paper first use of FPGAs as a multiprocessor on a chip or its use as a highly functional coprocessor are compared, and the programming tools for hardware/software codesign are discussed. Next a number of techniques are presented to maximize the parallelism and optimize the data locality in nested loops. This includes unimodular transformations, data locality improving loop transformations and use of smart buffers. Finally, the use of these techniques on a number of examples is demonstrated.
The results in the paper and in the literature show that, with the proper programming tool set, FPGAs can speedup computation kernels significantly with respect to traditional processors
Non-uniform dependences partitioned by recurrence chains
Non-uniform distance loop dependences are a known obstacle to find parallel iterations. To find the outermost loop parallelism in these �irregular� loops, a novel method is presented based on recurrence chains. The scheme organizes non-uniformly dependent iterations into lexicographically ordered monotonic chains. While the initial and final iteration of monotonic chains form two parallel sets, the remaining iterations form an intermediate set that can be partitioned further. When there is only one pair of coupled array references, the non-uniform dependences are represented by a single recurrence equation. In that case, the chains in the intermediate set do not bifurcate and each can be executed as a WHILE loop. The independent iterations and the initial iterations of monotonic dependence chains constitute the outermost parallelism. The proposed approach compares favorably with other treatments of nonuniform dependences in the literature. When there are multiple recurrence equations, a dataflow parallel execution can be scheduled using the technique extensively to find maximum loop parallelism
Compiler Optimization Techniques for Scheduling and Reducing Overhead
Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead
On Characterizing the Data Access Complexity of Programs
Technology trends will cause data movement to account for the majority of
energy expenditure and execution time on emerging computers. Therefore,
computational complexity will no longer be a sufficient metric for comparing
algorithms, and a fundamental characterization of data access complexity will
be increasingly important. The problem of developing lower bounds for data
access complexity has been modeled using the formalism of Hong & Kung's
red/blue pebble game for computational directed acyclic graphs (CDAGs).
However, previously developed approaches to lower bounds analysis for the
red/blue pebble game are very limited in effectiveness when applied to CDAGs of
real programs, with computations comprised of multiple sub-computations with
differing DAG structure. We address this problem by developing an approach for
effectively composing lower bounds based on graph decomposition. We also
develop a static analysis algorithm to derive the asymptotic data-access lower
bounds of programs, as a function of the problem size and cache size
Software refactoring guided by multiple soft-goals
Software refactoring is intended to enhance the quality of a software by improving its understandability, performance, as well as other quality attributes. We adopt the modelling framework of [14] in order to analyze software qualities, to determine which software refactoring transformations are most appropriate. In addition, we use software metrics to evaluate software quality quantitatively. Our framework adopts and extends work reported in [15]
A theoretical foundation for program transformations to reduce cache thrashing due to true data sharing
AbstractCache thrashing due to true data sharing can degrade the performance of parallel programs significantly. Our previous work showed that parallel task alignment via program transformations can be quite effective for the reduction of such cache thrashing. In this paper, we present a theoretical foundation for such program transformations. Based on linear algebra and the theory of numbers, our work analyzes the data dependences among the tasks created by a fork-join parallel program and determines at compile time how these tasks should be assigned to processors in order to reduce cache thrashing due to true data sharing. Our analysis and program transformations can be easily performed by compilers for parallel computers
- …