Software pipelining can generate efficient schedules for loop by overlapping the execution of operations from different iterations in order to exploit maximllm Instruction Level Parallelism (ILP). Code optimization can decrease total number of calculations and memory related operations. As a result, instruction schedules can use freed resources to construct shorter schedules. Particularly, when the data is not presented in cache, the performance will be significantly degraded by memory references. Therefore, elimination of redundant load-store operations is most important for improving overall performance. This paper introduces a method for integrating software pipelining and load-store elimination techniques. Moreover, we demonstrate that integrated algorithm is more effective than other methods. Keywords : software pipelining, loadstore optimization, code generation, Instruction Level Parallelism.
I n t r o d u c t i o n
Software pipelining can generate efficient schedules for loops by overlapping the execution of operations from different iterations. With su~cient overlap, the software pipelined schedules can take maximum resources of machine. Modulo scheduling is a software pipelining technique which generate s an arrangement of operations in a loop, so that there is no resource conflicts or data dependence violations.
Traditionally, modulo scheduling restricts the space of software pipelined schedules by initiating consecutive loop iterations at constant rate, i.e. Initiation Interval (II) . The initiation interval is bounded by the minimum initiation interval (MII) [1, 2, 3] , which is a lower bound on the smallest feasible value o f / / . In general, M / / i s constrained by either resource constraints (MIIres) or cyclic dependence constraints (MXId~p).
Code optimization, such as load-store elimination, is most important in software pipelining approach. Many researches of loads/stores elimination have, been proposed. Callahan et a/ [4] first introduced the concept of register pipelining, in which array references that access same array elements are allocated in a register pipeline in order to minimize the memory traffic. Duesterwald [5, 6] and Bodik [7~ extend Callahan's method by using data flow analysis for detecting and eliminating redundant load-store operations. Duesterwa/d's method is based on reference ~'ariable analysis. Her x method treats each array reference independently and computes the length of array live range to determine the number of shift operations. Bodik proposes Congruent Class of References Analysis which analyzes the value of array references within a congruent class.
This method uses minimal number of memory related operations and minimal number of register-to-register copy operations in loops. But his method is only suitable for small dependance distances. Furthermore, only considering to reduce the memory traffic may not significantly improve the overall performance, since we reduce the number of load operations instead of increasing the number of shift operations. The generated schedule can not exploit more ILP while the transformed code needs more resources. In order to exploit more ILP, we developed a method [8] which minimizes load-store operations after performing software pipelining. This method should also introduce shift operations to replace load operations. In this paper, we integrate software pipelining and load-store optimization techniques in order to avoid shift operations and exploit maximal ILP in loops.
Background

Constraints affecting loop scheduling
To generate a software pipelined schedule, we require first an estimate of the initiation interval (II) which is the length of new pipelined loop.
Choosing the minimum feasible II achieves the highest possible steady-state performance. To determine the minimum initiation interval (MII), both dependence constraints and resource constraints should be taken into account respectively [9] .
The resource usage imposes a lower bound on the initiation interval (MIIrea), which is computed by accounting for all the resources consumed by each operation of an iteration. For example, if a resource is used u times on a machine with v identical copies of u, therefore, MII cannot be smaller than u/v, because of the modulo scheduling constraint.
Another factor that is used for an estimate of the lower bound on the initiation interval is cyclic dependence. To characterize the dependences, a dependence edge can be annotated with (~,/~) pair. The A indicates the number of iterations the dependence spans, and ~ indicates the time that must elapse between the time the first operation is issued and the time the second operation is issued. For cyclic dependence, a cycle 0 contains a series of edges, we use 60 to present the sum of the fi and A0 to present the sum of the ~ on 0. The dependence constrained lower bound, MIIdep, is determined by accounting for all the elementary cycles in data dependence graph that are created by recurrences. In general, a cyclic path must satisfy the following dependence constraint inequality. The minimum initiation interval can be found by solving for the minimum value of II in above inequality. Therefore, MIId~p is computed by taking the maximum value of ~ for all cycles.
~(a,i
["] (4) Any cyclic path having ~ equal to MIX is termed a critical cycle.
2.2
Longest path -critical cycle In loop scheduling, we would like to compute the minimum distance between two nodes. By the definition 2.1, the schedule distance of cyclic path i~ always zero or negative. Therefore, by reversing the sense of inequalities, the longest path is equivalent to the smallest schedule distance. For the example of figure 2(c), we can compute MIIde? for cyclic dependence. Supposing each operation can be issued in one unit cycle, table 1 gives the results of the computation for some cyclic paths. By Equation 3,  we have: Table 1 : graph
Computing H for cyclic dependence
The distances corresponding to each cyclic path are presented in the Do column. We can find that the path (1,3, 6, 7, 1) is the longest path, which dominates MIIde?. The others can be ignored. Therefore, computing cyclic dependence constraints is equivalent to finding the longest path in the dependence graph.
Definition
Let G = (N, E, ~, 8) be a loop dependence graph, if there exists a set of cyclic paths Oi. the minimal initiation interval due to dependence constraints is determined by the longest path O~ (or critical path). The other cycles are redundant for computing MII~p.
3 Load-store elimination
Eliminating redundant loads
To determine whether a load operation can be eliminated we can use the informations of data dependence. In data dependence graph, if there exists a predecessor of a load operation, the loaded value is available in the predecessor. Therefore, the undesirable load can be precisely characterized as follows: Furthermore, by eliminating a redundant operation, instruction schedules can use freed hardwaxe resources to construct shorter schedule. Therefore, if both load and store axe on the critical path, the length of pipelined schedules can be essentially reduced. 
Algorithms
This section presents the application of integrating software pipelining and memory operation optimization techniques. First, the optimization algorithms axe an attempt to minimize the load-store operations in loops, which consist 2 steps: eliminating loads for true and input dependence; and eliminating stores for output dependence. Then, code generation steps are used for realizing loop transformations.
Before describing the optimization algorithms, we should first introduce some definitions. 
Eliminating loads
In order to avoid a redundant load, we implement the rules of true dependence elimination and input dependence elimination. We refer to a point at which an operation is initialized as a load point, our approach treats only those load points. If the load point has at least a predecessor in true and input dependence graph, we call this point as elimination point. Then, no load is needed at this point, because the loaded value has already been available in a register.
Eliminating stores
The process of store elimination so that a store can be avoided is to find a store point as killing point at which the store operation is killed by a later store. While elimination of loads requires to find backward points at elimination points, the elimination of stores requires to find forward points at killing points. The above algorithms is summarized in algorithm 3.1. Generally, after eliminating redundant Joad-store operations, many methods can be directly implemented for generating pipelining schedule [3, 2, 8] .
Unfortunately, these existing methods do not consider the initial value for each array reference. Therefore, before using software pipelining, we should decide which value an array reference read comes from. For example of figure 2(d), there exists a dependency from node 6 to node 3 with the distance 1 and a condition: c(a,b) : if (I > 3) , that is if (I > 3) the read value by TS[I-1] is computed in the preceding iteration, otherwise the read value is not defined in the loop. It should be initialized in the prologue. So in our approach, we should first treat all conditions of each dependence edge and then parallelize the kernel loop. Our code generation algorithm is presented by the following steps.
4.2.1
Initializing array reference in prologue and epilogue
The first step consists of determining the initialization values for array references. It is well known that software pipelining is based on modulo scheduling method. Modulo scheduling selects a schedule for one iteration of the loop such that, when that schedule is repeated, no resource and dependence constraints are violated. Once a pipelining schedule has been found for loop body, we need to add a prologue and epilogue for completing entire loop, (more details in [8] ). Initializing prologue code requires the precise data dependence informations to determine initialization values for all array references. We have shown that data flow analysis can give us these precise informations, as well as the last write reference for a read operation. Algorithm 3.2 determines the initial values of array references in prologue and inserts last store instances in epilogue by using the precise data dependence informations.
4.2.2
The number of unrolling / Software Pipelining loop
In order to obtain full parallelism, a kernel loop would be unrolled the maximum number of live range, however this will increase the register pressure. Considering the number of registers in the target machine, we should choose a suitable number of unrolling. If the number of unrolling is smaller than live range, we should insert shift operations to keep the values in registers for reuse.
In our approach, we use DESP algorithm to find a pipelining schedule. The position of each operation is fixed in a matrix which present an operation by row number (rn) and column number (cn). The live range of each variable is determined by checking the row-number and column-number. For example, if a variable is created by operation opl scheduled at c~a(opl) and rn (opl) , and the last use of this variable is located at crt(op2) and r~z(op2). Thus the live range (LR) is equal to:
The unrolling number is determined by maximum live range of variables. 
Register allocation
The next step in code generation is to create scalar variables to replace temporary array reference to perform data reuse. An approach of register allocation for subscripted variables was developed by Callahan, eta/. [4] . This approach which is used in our technique, begins by calculating the number of scalar variable T~, needed to hold each variable generated by temporary array reference T~ [/] , which associates with the maximum dependence dist;mce (d), i.e., T~ = max(d) + 1. For example in Figure 2(e) , the temporary array reference T.~[/] needs three variables which call R10, Rn and Rx2.
The finally result for our example is shown in figure 2(f) . Here, the ";" delimiter separates operations that are executed in paraUel. 
Experimental Results
In this section, we report our experimental results by using our optimization method. The source Fortran files axe from Livermore benchmark suite. We
give experimental results that demonstrate the effectiveness of load-store redundant elimination. Table 2 : Load number before and after optimization
In table 2, we give the dynamic count of loads before and after optimization, where TNL is total number of loads, and P is percentage of elimination. For comparing our method ELSBSP (Elimination of Load and Store Before Software Pipeline) with [7] , we defined the same iteration numbers. The number of loads due to array references was determined by examining the assembly code output of the compiler. We test only some number of Livermore loops. Many loops did not contain opportunities for pipelining or contain inner-loop conditional control flow which we do not handle. As we can see, our methods can eliminate the same number of loads as BodY's method. But it is more effective than other methods for improving the performance. Existing methods eliminate all redundant load operations, but introduce shift operations to keep the value for reuse. On the contrary, our method first eliminates redundant loads, then we use software pipelining to generate a pipelining schedule and unroll the pipelined kernel loop. Therefore no shift operations are needed. Figure 3(a) shows the performances of tested loops. Each loop is compiled and executed at -O2 optimization level of Fortran compiler on Ultra-SPARC processor. The performance is represented by the execution time. Figure 3(b) presents the execution time of 1oop23 on DEC-Alpha processor for different loop scheduling and code optimization' approaclles. From left to right, the continuous bars represent the execution time of source program, Bodik's method, software piPeline heuristic, ELSASP 1 heuristic, and ELSBSP heuristic. We 1Elimination of Load and Store After Software can see t h a t E L S B S P heuristic reduces the execution time significantly. C o m p a r i n g the execution time of other m e t h o d s , on -O2 or -O3 optimization level of Fortran compiler, i m p l e m e n t a t i o n of ELS-B S P can obtain best improvement of performance. In this example, E L S B S P heuristic can improve the performance till 30%. These results present t h a t elimination of load-store is more successful while integrating with software pipelining. 
C o n c l u s i o n
In this paper, we introduce a new m e t h o d for eliminating r e d u n d a n t load and store by using precise d a t a dependence informations. Eliminating redundant loads can improve b o t h MIId~p and MII,.~m and result in reducing the length of pipelined schedules. T h e effect of these optimizations is to expose new opportunities for minimizing m e m o r y accesses. Therefore, by eliminating r e d u n d a n t load-store operations, we can improve overall performance of an application. T h e experimental results also present
Pipelining [8] t h a t integrating software pipelining and load-store elimination is more efficient t h a n other optimization techniques.
