Abstract-This paper presents reuse-aware modulo scheduling to maximizing stream reuse and improving concurrency for stream-level loops running on stream processors. The novelty lies in the development of a new representation for an unrolled and software-pipelined stream-level loop using a set of reuse equations, resulting in simultaneous optimization of two performance objectives for the loop, reuse and concurrency, in a unified framework. We have implemented this work in the compiler developed for our 64-bit FT64 stream processor. Our experimental results obtained on FT64 and by simulation using nine representative stream applications demonstrate the effectiveness of the proposed approach.
I. INTRODUCTION
The stream processors, such as Imagine [6] , Cell [9] , AMD FireStream and Merrimac [2] , represent a promising alternative in achieving high performance in media applications [6] and some scientific applications [2] , [9] . In [12] , we introduced the design and fabrication of FT64, the first 64-bit stream processor for scientific computing. Like Imagine [6] , Cell [9] and Merrimac [2] , FT64 is well mapped to the SVM (stream virtual machine) architecture [4] .
Under the stream programming model, such as Brook [1] , StreamC/KernelC [3] , an application is divided into a streamlevel program running on the host and a kernel-level program running on the stream processor. The former specifies the flow of streams between kernels and initiates the execution of kernels. The latter executes these kernels compiled to VLIW microcode on clusters of ALUs, one at a time.
Stream processors incorporate a software-managed on-chip memory, called the SRF (Stream Register File), as the nexus of the ALUs and off-chip memory to store streams. It is performance-critical to fully utilize SRF to capture the abundant stream reuse in stream applications so as to minimize the expensive off-chip memory traffic. In addition, a stream processor typically supports a few concurrent stream I/O operations that can be overlapped with kernel execution. This enables the SRF to hide memory latency if prefetching loads and delayed stores are scheduled judiciously.
This work is concerned with the development of compiler techniques for optimizing stream-level programs (or loops) by maximizing the reuse between streams and improving the concurrency through overlapping kernel execution and stream I/O on SVM-like stream processors (which is well mapped to the stream virtual machine architecture [4] ). Its novelty lies in formulating the underlying optimization problem via a set of reuse equations and performing both unrolling and software pipelining in a unified framework.
Loop unrolling and software pipelining [7] can expose stream reuse and concurrency, respectively, but they aim to achieve different goals. When applied to stream-level loops separately, either reuse or concurrency or both may be inadequately exploited (as discussed in Section II).
This paper makes the following contributions:
• We characterize an unrolled and software-pipelined stream-level loop using a set of reuse equations, by which both reuse and concurrency are made explicit (Section III).
• We propose a reuse-aware modulo scheduling technique to simultaneously maximize stream reuse and improve concurrency for a stream-level loop based on the notion of reuse equations (Section IV).
• We demonstrate the effectiveness of our compiler technique by running nine representative benchmarks on FT64 and by simulation (Section V).
II. A MOTIVATING EXAMPLE
We use an example to motivate the development of ReuseAware Modulo Scheduling (RAMS). In FT64, there are three resources, two memory units, Mem 1 and Mem 2 , for performing stream transfers between off-chip memory and the SRF and one stream processor, Exe, for executing kernels (one at a time). To focus on the basic idea, we assume that all loads, stores and kernels are each executed in one cycle.
Modulo scheduling [7] , the most popular technique for software pipelining, aims to minimize the II (the Initiation Interval between successive iterations) of a loop. The algorithm starts with the Minimum II, i.e., MII = max(ResII, RecII). ResII is the largest N/R, where N is the number of units of a particular resource required by the loop and R is the number of such units available (per cycle). RecII is the largest delay(c)/dist(c) among all recurrence cycles c in the DDG (Data Dependence Graph) of the loop, where delay(c) is the sum of the latencies of all operations in c and dist(c) is the sum of the dependence distances of all edges in c.
Consider a stream-level loop given in Figure 1 . In line 1, two arrays are declared. In line 2, three streams of size N are declared. In line 3, dataInit is called to initialize A residing in off-chip memory. In lines 5 and 6, the two data sections in A are gathered into streams a and b, resulting in the two data sections being loaded from off-chip memory into the SRF. In line 7, the kernel named foo is called to perform its computation on the stream processor. As shown, a and b are input streams and o is an output stream. In line 8, o is stored from the SRF back into O in off-chip memory. In line 10, dataSave is called to save the data of O into a file. There is a reuse from the second load in line 6 to the first load in line 5. Thus, the first load at iteration k accesses, i.e., reuses the same stream as the second load at iteration k − 2.
If we apply modulo scheduling to this loop without unrolling it first, a redundant load, i.e., operation 5 is executed every iteration since the reuse is not exploited. If we perform loop unrolling first, we must know what its best unroll factor is. Figure 2(a) shows the unrolled code with an unroll factor F = 3. The reuse could now be exploited. Applying modulo scheduling to the DDG in Figure 2 (b) yields the schedule in Figure 3 . In this case, II = MII = 6, meaning that each iteration in the original loop takes 6/3 = 2 cycles to complete. Although the reuse is fully exploited, concurrency is poorkernel execution and memory transfers are not overlapped. This example demonstrates that the best unroll factor, and consequently, the optimal performance of a stream-level loop is determined by not only the amount of reuse but also the amount of concurrency exploited. One ad hoc approach is to try a limited number of unroll factors and settle with the best one [5] . To avoid this, RAMS generates an unrolledand modulo-scheduled loop from a stream-level loop systematically by simultaneously maximizing stream reuse and improving concurrency, as shown in Figure 4 , based on a new concept of reuse equations introduced below.
III. REUSE EQUATIONS
The key insight behind RAMS is the realization that every unrolled and software-pipelined loop of a stream-level loop L can be characterized by a set of reuse equations for its steady state. Then the problem of finding a desirable schedule for L is simply reduced to one of building from L a set of reuse equations, from which the schedule is derived.
There are two types of equations: DR (Data Reuse) and SR (Space Reuse) equations. Every reuse equation specifies a reuse between two streams at a fixed reuse distance. For a DR equation, there is a reuse of both the data and space/name between the two streams. For an SR equation, only the space/name is reused. The set of DR equations dictates the amount of reuse to be realized while the set of SR equations constrains the amount of concurrency in a schedule of L. Both determine the minimal unroll factor used to maximize reuse and concurrency in a schedule of L.
A. Establishing Reuse Equations
Let s(i, j) be the i-th stream parameter of the j-th kernel in L. Every reuse equation (or relation) has the form:
which indicates the existence of a reuse flowing from 
where the reuse distance d For the loop in Figure 1 , there are two reuse groups: R 1 = {s(1, 1), s(2, 1)} and R 2 = {s(3, 1)}. We obtain the first two equations for R 1 and the last one for R 2 below:
which can be visualized in Figure 5 . Since it is unnecessary to consider transitive reuse edges, the reuse inherent in a reuse group always manifests itself as a reuse cycle graphically.
From now on, the edges on all DR paths are called DR edges and the edges corresponding to all SR equations SR edges. Every reuse cycle has only one SR edge. 
B. Understanding Reuse Equations
The set of DR equations is determined by a stream reuse analysis in L [13] . Let us consider the DR path P r of length
r associated with R r whose DR equations are given in (2) 
To expose simultaneously all reuse inherent in all m reuse groups, R 1 , . . . , R m , in the unrolled and modulo-pipelined loop of L, the unroll factor is chosen to be:
where L r is the length of the reuse cycle for R r (1 r m).
For our example discussed earlier with respect to (4), 
The reuse distance d r 1 in the SR equation (3), i.e., the weight of its SR edge, which is found by RAMS (Section IV), specifies the amount of renaming allowed for the stream S r represented by the first node on the path P r . By renaming S r for d 
IV. REUSE-AWARE MODULO SCHEDULING
RAMS takes as input a stream-level loop L characterized by a set of reuse equations and produces as output an unrolled and modulo-scheduled loop for stream processors. RAMS is reuseaware since it fully realizes the reuse specified by the reuse equations when looking for a schedule to improve concurrency in L. In addition, an unroll factor is automatically derived for realizing the reuse and concurrency exposed.
A. Modulo Scheduling for Concurrency
The input to ConMax in Algorithm 1 is a loop L together with a set of reuse equations for the loop, where the weights of DR edges in (2) are known and the weights of SR edges in (3) are not. However, the weight of every SR edge is initialized by (5) as its minimum value. The output is a modulo schedule for L with the weights of all SR edges being finalized.
The execution time of a load or store for a stream of size s is estimated linearly as a startup + b byte cost × s, where s is the In ConMax, we apply iterative modulo scheduling to find a schedule iteratively for L. However, the presence of SR edges leads to a fundamental difference. In ConMax, we judiciously increase the weight of a false dependence in the DDG that is uniquely related to the SR edge for a reuse group R r to reduce delay(c)/dist(c) for the corresponding recurrence cycle c that contains the false dependence. By increasing the weight of the SR edge (indirectly this way), the producer stream parameter S r in R r is renamed so that the corresponding load (store) for S r is turned into a prefetching load (delayed store) in the generated schedule, resulting in improved concurrency.
Starting with all m reuse groups in line 9, we modify L in line 10 by deleting the loads corresponding to all the consumer stream parameters in every reuse group R r and then obtain in line 11 a simplified DDG for the new L. The SR edge d r 1 for R r given in (3) is defined in the DDG of the original loop L. In line 12, we identify its unique representative edge e, denoted rep(d r 1 ) = def e, in the simplified DDG. After all redundant loads in R r are removed (line 10), the producer stream S r is involved in a recurrence cycle c with two nodes: a load/store op for S r and a kernel call with S r as a parameter. If S r is an input stream, then op is a load that fetches S r from off-chip memory into the SRF to be used in the kernel. If S R is an output stream, then op is a store that writes back S r produced by the kernel from the SRF into off-chip memory. Given a recurrence cycle, the classic scheduling constraint delay(c) dist(c) × II must be satisfied. This implies that a dependence path cannot span more than dist(c) + 1 stages. When there are m resources available, it is generally sufficient to have m concurrent stages being executed in parallel. In FT64, m = 3. So a maximum of three stages is sufficient to maximize concurrency based on our experience. Since dist(c) = 1 for an SR-rep cycle c in the DDG initially, it suffices to increase the weight of the SR-rep edge in the cycle by 1 to create concurrency. As renaming may increase SRF pressure, ConMax introduces it only wherever necessary. In line 13, we calculate ResII as usual. In line 14, we calculate RecII also as usual except that delay(c)/(dist(c) + 1) rather than delay(c)/dist(c) is used for an SR-rep cycle since its dist(c) can be increased by 1 to improve concurrency.
Let us examine the rest of ConMax. Each SR-rep edge e is initially marked as unvisited in line 12 to prevent its weight from being increased twice. In line 16, we call Increase SRRep Edge Weight to introduce renaming into every SRrep cycle that prevents a feasible schedule to be found with the current II. In lines 17 -25, we apply iterative modulo scheduling by starting with II initialized in line 15. Whenever a schedule cannot be found for the current II, we try to reduce delay(c)/dist(c) in lines 18 -20 by increasing the weight of an unvisited SR-rep edge e in an SR-rep cycle c by 1. Note that a schedule with the current II may be found next due to a relaxation of dependences in line 20 (even though delay(c)/dist(c) ≤ II for the corresponding SR-rep cycle just before line 20). Otherwise, we search for a feasible schedule at the next larger II (line 22) with the minimal amount of renaming introduced in line 23. Finally, in lines 26 -28, any weight increase in an SR-rep edge is reflected in the corresponding SR edge. In line 29, the schedule α is found.
Let us apply ConMax to our example loop. We reproduce in Figure 6 (a) its DDG given earlier in Figure 1 . Recall that this example has two reuse groups R 1 = {s(1, 1), s(2, 1)} and R 2 = {s(3, 1)}. Its reuse equations are given in (4) . After the load corresponding to s(1, 1) is deleted in line 10, the simplified DDG found in line 11 is given in Figure 6 (b). As illustrated in Figure 5 , there are two SR edges d 1 and d 2 associated with R 1 and R 2 , respectively. They are initialized to be d 1 = d 2 = 1 by (5) in line 7. We know that s(2, 1) denotes b and s(3, 1) denotes o. In line 12, the algorithm finds that rep(d 1 ) = (7, 6) and rep(d 2 ) = (8, 7) with both being initialized as (7, 6) .visited = (8, 7).visited = false.
Given the simplified DDG in Figure 6 (b), we obtain ResII = 1 in line 13 and RecII = 1 in line 14 (in contrast to RecII = 2 obtained traditionally). Thus, ConMax starts to schedule the loop with II = MII = 1 initialized in line 15. In line 16, the weights of the two SR-rep edges, (7, 6 ) and (8, 7), are increased from 1 to 2. Given the modified DDG in Figure 6(c) , ConMax has found an optimal schedule with II = 1 in Figure 4 . Since (7, 6) .visited = (8, 7) .visited = true, the weights of d 1 lines 26 -28 ). The final DDG is shown in Figure 6(c) . Correspondingly, the reuse equations in (4) are:
If renaming is not used, one will start and end with II = MII = 2. The schedule has poor concurrency since kernel execution cannot be overlapped with the load/store operations.
B. Code Generation
Given a modulo schedule, we apply our code generation algorithm (not shown) to turn it into an unrolled and modulopipelined loop. The kernel for our example, with the prologue and epilogue omitted, is given in Figure 4 .
V. EXPERIMENTAL RESULTS
We evaluate our work using nine representative scientific kernels listed in Table I 
A. Performance on FT64
In Table I , Swim-calc1 and Swim-calc2 are from SPEC 2000, EP and MG from NPB, GEMM from BLAS and Lalapce from NCSA. NLAG-5 is a nonlinear algebra solver for 2D nonlinear diffusion of hydrodynamics. These benchmarks are initially available in FORTRAN and later transformed into stream programs. Column "Reuse" shows the inherent reuse in these programs: C stands for cross-iteration reuse, N for non-reuse and I for intra-iteration reuse. For each program, three unroll factors, denoted by R, RC and RCM, are listed. RC is directly generated by RAMS. A smaller unroll factor, denoted RCM, may be generated by redefining (6) as
This may cause more SRF pressure due to a higher amount of renaming generated. For RC and RCM, the same amount of concurrency is achieved. The column MD shows if RAMS has changed any reuse distances for SR equations or not. Finally, R is generated if ConMax is not allowed to modify the weights of SR-rep edges. As a result, only reuse is exploited. No attempt for improving concurrency via renaming is made. Fig. 7 : Speedups of optimized over sequential loops. Figure 7 gives the performances on FT64 under a few different unroll factors, which include those found by RAMS as a subset. By comparing the bars corresponding to the R unroll factors given in Table I and those with smaller unroll factors (if any), the importance of exploiting stream reuse is validated. This is particularly pronounced in Swim-calc2, FFT and Laplace. By comparing the bars corresponding to the RC and RCM unroll factors in Table I and those no smaller than R, we also observe the importance of exploiting concurrency. This is particularly significant for Swim-calc2, GEMM and Jacobi and noticeable for Swim-calc1, NLAG-5 and MG. The RC and RCM performance bars also show that RAMS is successful in simultaneously exploiting reuse and concurrency. Laplace is the only one that does not follow the trend. Figure 8 demonstrates the effectiveness of in reducing the number of loads and stores due to improved reuse. For FFT, the data are kept in the SRF during computation due to reuse exploitation without incurring any off-chip memory transfers. Thus, RAMS achieves the largest speedup for FFT. The execution trace when F = 2 is shown in Figure 9 . The top band, "Mem Ops", represents the busy/idle status of the two memory units combined, and the bottom band, "Ker Ops", represents the busy/idle status of the stream processor. In each case, colored regions represent busy cycles. Clearly, memory transfers happen only at both ends.
Laplace is the only one that contradicts the performance trend expected by RAMS. This is attributed to the inaccuracy in estimating the execution times of its loads/stores. 
B. Performance on the Imagine Stream Processor
We have also evaluated our algorithm using Imagine [6] , another SVM-like stream processor. We have implemented a source-to-source translator, which takes a loop as input, applies RAMS and produces an optimized loop, which is then compiled by the Imagine compiler [3] to its binary code. We compare RAMS with their doUnroll() and doSoftwarePipeline() using their cycle-accurate simulator for Imagine. These two techniques can only be applied separately. In addition, loop unrolling is not automated and must be used with a programmer-supplied unroll factor. Figure 10 shows the performance improvements of the nine programs achieved by RAMS over the Imagine compiler. To apply doUnroll(), the same unroll factor discovered by RAMS is used. As for doSoftwarePipeline(), no unrolling is applied. Better speedups are observed in all except EP and GEMM. For EP, no stream reuse exists. Due to strong loop-carried dependences, memory operations are hardly overlappable with kernel execution. Neither loop unrolling nor software pipelining is beneficial. As for GEMM, only intraiteration reuse exists and it can be exploited equally by both the Imagine and our compilers. 
VI. RELATED WORK
The compiler developed for the Imagine stream processor [3] applies loop unrolling and software pipelining to improve the performance of stream programs. However, it does not recognize and exploit loop-carried stream reuse. Neither does it support automatic loop unrolling nor software pipelining. Furthermore, both cannot be applied together to a loop.
The work of [13] performs a data-flow analysis to recognize loop-dependent stream reuse. Our earlier work of [8] focuses on SRF allocation by using graph coloring. In particular, the work of [8] improves reuse and concurrency by requiring the programmer to experiment with different unroll factors. Our earlier work also does not address cross-iteration parallelism.
This work also differs from the prior work on improving array reuse [10] , [11] for general-purpose processors in several ways. First, array reuse is exposed for a hardware-managed cache while stream reuse is for a software-managed SRF.
Second, the former research examines the reuse between individual array elements while the latter considers the reuse between streams (with often thousands of elements). Third, stream copy operations are more costly than register copy operations, particularly for stream processors that do not support inter-lane communication in the SRF [13] , and should thus be minimized. Finally, concurrency exploitation in stream processors requires special attention since their memory transfers and computation are decoupled.
VII. CONCLUSION This paper characterizes an unrolled and software-pipelined loop of a stream-level loop using reuse equations. Based on this new notion, we have developed a new compiler approach, RAMS, that can simultaneously unroll and pipeline a streamlevel loop to maximize reuse and improve concurrency in the loop. We have implemented RAMS in a stream compiler developed for our 64-bit FT64 stream processor. Validation by running nine benchmarks on FT64 as well as Imagine [6] and by simulation shows the effectiveness of the proposed approach. Our technique should be also useful to other SVMlike stream processors, such as Cell [9] , Merrimac [2] .
