We 
Introduction
An m th -order linear recurrence is a set of equations of the form: The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. government. Min possible te for our implementation on the Y-MP. r1
Asymptotic megaflops rate = (1000nf )=(6te). where and are binary associative operators, and distributes over . Such linear recurrences frequently appear in scientific applications [20] , are very useful in the design of parallel algorithms [19, 1] , and can be used to solve a much broader class of recurrences [18, 11] .
R1

X(I) = X(I-1) + A(I)
Researchers have been studying parallel and vector algorithms to solve linear recurrences since the 1960's [17, 3, 18, 31, 6, 16, 29, 22, 11, 35, 12, 23] , and considerable effort has gone into producing fast implementations of these algorithms on parallel and vector machines [24, 21, 33, 32, 13, 26, 30] . Some supercomputer manufacturers have considered the solution of linear recurrences important enough to warrant the addition of special hardware to execute recurrences efficiently [34] . Three major classes of algorithms have emerged out of the research on algorithms to solve linear recurrences: recursive doubling [18, 31] , cyclic reduction [14, 3, 15] , and the partition method [5, 11, 35] . These methods take advantage of the associativity of and . With a variation of cyclic reduction, Chen and Kuck showed how to solve an m th -order linear recurrence in O(lg n lg m) time [6] . This algorithm is impractical, however, since it requires nm 2 processors. It has been shown that both with a variation of cyclic reduction [12] and with the partition method [5] , the total number of operations can be reduced to nm(m + 2) total calls to and nm(m + 1) total calls to . When the number of processors p is significantly less than n, or the vector half-performance length n 1=2 on a vector computer is significantly less than n, the partition method has been shown generally to perform better in practice [13] , and has been shown to be numerically stable for a large class of problems [33] . This paper discusses a variation of the partition method based on a loop restructuring technique we call loop raking. Although our variation requires the same number of arithmetic operations as the standard implementation of the partition method, our variation reduces the memory traffic giving better actual running times, and is better suited for a multiprocessor implementation because of its shorter vector half-performance length.
We have implemented our method on multiple processors of the CRAY Y-MP, and our timing results are summarized in Table 1 (times on varying lengths are given in Figures 1 and 2 ). On one processor of the Y-MP our recurrences are up to 4 times faster than the times of recurrences supplied as part of SCILIB [7] (an optimized scientific library supplied by Cray Research). On multiple processors we achieve an additional almost linear speedup (the SCILIB recurrence routines do not make use of multiple processors). In our implementation we concentrated on the asymptotic performance rather than reducing the halfperformance length. We expect that the performance on short vectors could be significantly improved with an optimized implementation, especially for the multiprocessor implementation. On vector processors with only a single load/store pipe, such as the Convex 3800 or IBM 3090/VF, the performance gains should be even greater because of the reduction in the number of memory accesses. To process an entire array of length N , we move the rake down one step on each of s iterations.
In addition to presenting performance numbers for first and second order recurrences, we show how some other loops can be converted into linear recurrences and give timings for them. The loops we discuss are loops 5 and 19 out of the 24 Livermore Loops [25] , and the PACK operation. The PACK operation takes a set of data and a set of flags and packs the data where the flags are true (1) into contiguous locations. The PACK operation is one of the Fortran 90 intrinsics and can be implemented on the CRAY Y-MP using the compress-index instruction. Our implementation is based on a first-order recurrence (see Section 5) and does not use the compress-index instruction, but achieves a better asymptotic performance than the equivalent Fortran code, which is compiled to use the compress-index instruction (see figure 3) .
The paper is organized as follows. Section 2 explains the notion of loop raking. Section 3 reviews the partition method for solving linear recurrences and introduces our variation based on loop raking. Section 4 describes our implementation of linear recurrence solvers on the CRAY Y-MP. Section 5 describes how some other loops can be converted to linear recurrences. Conclusions and future directions are presented in Section 6.
Loop Raking
We assume a vector multiprocessor in which each processor has a set of vector registers each of length VL. Loop raking uses a constant stride (s) of N=L (where L VL) to access a set of elements that are evenly distributed across the input vector-as if a rake was placed over the vector and shifted on each vector load. On each load a total of L elements are loaded into a vector register. See Figure 4 . Loop raking can be viewed as the dual of strip mining [27] . Both are loop transformations that define orders in which to execute the iterations of a loop, VL elements at a time. Strip mining can be viewed as mapping the loop iterations to a VL (N/VL) matrix and operating on each row with a single vector operation; loop-raking, on the other hand, can be viewed as organizing the iterations as an (N/VL) VL matrix and operating on each column with a vector operation.
It should be clear that for loops without loop-carried data dependences [27] , the results of strip mining and loop raking are identical. They are just different orders of processing the elements. However, the methods are quite different in the presence of data dependences. In this paper we demonstrate how loop raking can be used to vectorize linear recurrences, which cannot be vectorized with strip mining. Elsewhere it has been shown how loop raking can be used to ensure stability in a step of a radix sort [36] .
Choosing parameters It is important to choose the parameters of a raked loop so as to avoid memory conflicts. For the purposes of loop raking, we consider a vector of n elements as being organized in a rectangular fashion, characterized by three parameters l, s, and f, as shown in Figure 5 . These three parameters characterize the vector completely and are called its shape factors [4] . for the purposes of determining the shape factors given n. We use the following additional constraints for choosing the parameters:
s is chosen to minimize bank conflicts; it is not allowed to be a multiple of BANKS, a constant that is determined Such an organization of the data makes all memory accesses constant-stride and avoids bank conflicts.
Extensions to vector multiprocessors To extend loop
raking to a vector multiprocessor with P processors, a loop can be partitioned into L P parts. Each processor P i (0 i < P) therefore loads with a constant stride of N=(L P)
starting at position N P P i . We can view this as increasing the rake size from L to L P. The code for each processor is identical-only the offset and possibly the f of the shape factor in the last processor is different. However, since each region is independent, it is not necessary that the processors run synchronously. Synchronization is only required at the end of the raked loop.
Raking data instead of loops For some machines (such as the IBM 3090/VF), strided memory accesses are much slower than unit-stride accesses. In such a case, it may be more effective to store the data in raked order and use unit-stride memory operations. By this, we mean that the i th element of the vector is stored at location s (i mod s)+ (i=s) from the start of the vector, where s is determined as above. This does impose some storage overhead, since the shape factors must be kept as part of the storage descriptor for the vector. The same idea can be used for machines with caches. Since cache lines usually store multiple words, they can be used as an analog of vector registers. Raking data is clearly the correct approach in this case.
Raking the data only makes sense in a context where the data can be kept in raked order. Since standard elementwise vectorization can always run in raked order, it is reasonable to consider always keeping certain data in raked order.
Overview of the Recurrence Algorithm
The partition method for solving linear recurrences is based on partitioning the input vectors into k contiguous blocks. The standard implementation views the n iterations as a k n=k array in row-major order [15] . The algorithm consists of two passes, the first across the columns and the second down the rows. This is shown in Figure 6 . Our variation of the partition method is based on two raked loops with an intermediate step. The first loop is similar to first pass of the standard technique since each addition is operating on elements separated by a distance N=k. In loop raking, however, we always select k based on machine parameters. In particular, we pick k so that the elements fit into a vector register. An example of the loop raking method is shown in Figure 7 and works as follows:
First raked loop: Accumulate partial sums, one per processor. This only requires loading and accumulating. Only a single partial sum per vector element (partition) needs to be kept.
Intermediate step: Sweep across the partial sums to generate starting offset of each partition. This is the only communication that is required between the partitions, and can be achieved in various different ways.
Second raked loop:
Generate final results from the offsets. This requires loading the original values, and writing the final results.
This variation has two potential advantages over the standard implementation, which are discussed below.
Memory Use:
The raked-loop version reduces memory traffic since it can keep the partial sums in registers. It should be clear that for the case of a first-order recurrence with no constant (a scan), only one vector load is needed per iteration of the first loop. The second loop requires one vector load and one vector store per iteration. In the standard technique, if n > VL 2 then either the number of columns or the number of rows has to be greater than VL. This means that when passing over the dimension that is greater than VL, the partial sums have to be written out of the registers on each iteration and then read back on the next iteration. For the more general case of an m thorder recurrence, Table 2 gives the total number of loads and stores required by each version along with number of arithmetic operations, which is the same for both versions. Vector Half-Performance Length: The standard model for the performance of a vector operation on a vector of length n is that the operation will take time:
T(n) = t e (n + n 1=2 )
where t e is the per-element incremental cost, and n 1=2 is the vector half-performance length (the number of elements at which half the peak performance 1=t e is reached). Based on this model, the time taken by the standard partition method is:
T(n) = (t e;1 (k + n 1=2 ))n=k + (t e;2 (n=k + n 1=2 ))k
where t e;1 is the per-element cost of the first pass and t e;2 is the per-element cost of the second pass. This can be minimized to pick the best k, which gives approximately k = (nt e;1 =t e;2 )
1=2 . We can then generate an effective half performance length for the standard method, which gives
We therefore see that the effective half-performance length of the standard partition method goes up as the square of the vector half-performance length. When going to multiprocessors n 1=2 goes up at least linearly with the number of processors P, and therefore r 1=2 goes up as the square of P.
On the CRAY Y-MP, for example, n 1=2 25 on one processor, giving r 1=2 600. However, if we go to 8 processors, n 1=2 500, giving r 1=2 250; 000. This shows that on 8 processors of the Y-MP even for large n the partition method will not achieve half of its peak performance.
Using our variation, the intermediate step can be executed hierarchically on multiple processors giving an effective half performance length that only grows linearly with the number of processors.
Implementation on CRAY Y-MP
We have implemented first-and second-order recurrences on multiple processors of the CRAY Y-MP. We have also implemented several other loops that can be converted into linear recurrences. Table 1 summarizes the running times and compares them with the library versions supplied by Cray Research as part of the scientific library (SCILIB). Times for the recurrences over a range of sizes are given in Figures 1 and 2 .
In this section, we present pseudocode for our implementations of first-and second-order linear recurrences. This pseudocode should be applicable to any register-based vector machine. To implement the recurrences we used our own macro assembler, written in LISP, to generate Cray Assembly Language (CAL) [9] . We could not achieve the desired speedup with the FORTRAN compiler, and the CRAY macro and opdefs package [8] is not flexible enough-it does not supply macros needed for the raked loops. Programming directly in CAL would have been tedious and would have produced code difficult to maintain.
In the psuedocode for the algorithms, we use the convention that names of vector registers begin with V, and names of scalar registers with S. In the code we will assume f = s for purposes of clarity. If f < s, the algorithm needs to be modified to decrement the contents of the Vector Length register by 1 for the last (s ? f) iterations. In our implementation on the CRAY Y-MP, we unroll the loops and rename registers to avoid locking vector registers. We only show the code for computing all the elements; the first two phases of the corresponding algorithm can be used if only the final value is required.
Algorithm for R1-C
In the first phase of the first-order recurrence R1-C (x i] = a i]*x i-1] + b i]), we must compute both a sum term and product term for each column. This is done with the following raked loop. The second phase fixes up the starting values for the final phase. This is done with a scalar loop that computes the appropriate combination of the sum and product terms resulting from the first phase with the initial value.
The final phase evaluates the appropriate terms and writes them out to memory. The following raked loop accomplishes this. 
Algorithm for R2
We now consider the second-order linear recurrence The second phase fixes up the starting values for the final phase. This is done with a scalar loop that computes the successive products of the 2 2 matrices resulting from the first phase with the 2-vector of initial values.
The final phase evaluates the appropriate terms and writes them out to memory, using the following raked loop. 
Using multiple processors
The above algorithms require changes only to the scalar code (intermediate step) in order to use multiple processors of the CRAY Y-MP. To use P processors, we first partition the vectors into P pieces and assign each piece to one processor. Each processor executes the first phase of the single processor recurrence algorithm. At the end of the first phase, each processor saves the contents of the vectors it has accumulated by writing them out to memory, and also writes out the reduction of all the elements in its partition. (The processor reduction is a pair of values for R1-C and four values for R2.) Then one of the processors computes all the terms of a recurrence on these P values. Each processor then restores state by reading in the vectors that it wrote out at the end of the first phase. It also reads in the updated processor reduction value and uses it as the initial value for performing a serial recurrence on the accumulation vectors. This gives the correct starting offsets for the final phase, which can then proceed as in the single processor algorithm. Note that the algorithm for the scalar phase is hierarchical. This reduces the serial overhead, as we perform the serial scan on a vector of length VL, not of length VL P.
Our multiprocessor implementation uses the $MDO Cray Assembly Language macro for generating microtasked loops [8] . In our prototype implementation, there is an overhead of approximately 2000 cycles for starting each raked loop. We expect that this overhead could be significantly reduced with further tuning and optimization. This section shows how three loops can be reduced to recurrences using a combination of standard Fortran code restructuring techniques and pattern recognition (to recognize recurrences). These are Livermore loops 5 and 19, and the PACK operation. Feo previously described how loops 5 and 19 can be implemented with recursive doubling [10] . Here we transform the loops directly into linear recurrences. In the code that follows, function names conform to those used in Table 1 , and Fortran 90 constructs are used where appropriate. This discussion is not intended as a comprehensive formula for converting loops to recurrences, but rather as a set of case studies. Work on identifying scans, reductions, and recurrences in loop code is currently sparse [32, 28] .
LL5
The inner loop of Livermore loop 5 is as follows. By multiplying out the quantities on the right-hand side of the assignment, and some precomputation, this loop turns into a recurrence, as shown in the following code. LL19 Livermore loop 19 consists of two loops, each containing a pair of coupled recurrences. The recurrences run in the forward direction in the first loop and backwards in the second loop. The original code is as follows.
First, we promote the scalar STB5 to a vector called TMP. After induction variable recognition and loop iteration peeling, we obtain the following code. As all values of B5 written in the first loop are overwritten in the second, there is no need to update B5 in the first loop. Therefore we can forward-substitute B5 into TMP in the first loop. We can also rewrite the second loop to compute B5 in terms of TMP. As we make frequent use of the quantity SB(k) -1.0, we can precompute it in a vector called SB1. Also, we only require the last value of the recurrence in the first loop. This gives the following code for LL19. which is a first-order linear recurrence with no constant. The second loop can be vectorized using a select based on the vector mask and a scatter operation, and can be fused with the second raked loop of the recurrence. The PACK operation can therefore be executed in two raked loops. The loops have the following form: where Select selects elements from either V sum or V scratch based on whether V F is true or false. The vector register V scratch is set to point to some scratch memory. The instruction Scatter writes elements in V A to the locations in V I. The Select instruction is required because the Y-MP does not have a masked scatter: a scatter in which elements would only be scattered in positions where the vector-mask register is 1.
Unfortunately, it is unlikely that any existing compiler will get through the necessary transformations, but it is interesting that we can achieve almost the same performance as what can be achieved with the compress-index instruction without using any special hardware. If the Y-MP had a masked scatter operation, then it is likely that this method would outperform the compress-index instruction for long vectors.
Conclusions
We have described a variation of the partition method for solving m th -order linear recurrences that is well-suited to vector multiprocessors such as the CRAY Y-MP. The algorithm fully utilizes the vector and multiprocessor capabilities of the machine, and reduces the number of memory accesses as compared to the more commonly used version of the partition method. This method uses a general loop restructuring technique called loop raking. On a single processor of the Y-MP our implementation runs between 1.5 and 4 times faster than the corresponding library routines in SCILIB. Performance increases almost linearly on multiple processors, resulting in a speedup of 3.7 on 4 processors.
Loop raking is a general loop restructuring technique that can be used to vectorize many problems where more standard execution orders such as strip mining fail. Further work is needed before it can become a standard compiler optimization. The treatment of raked loops must be expanded to handle nested loops. Also, work is needed to characterize loops that can profit from loop raking. We hope that the benefits of loop raking will stimulate further study of this compiler optimization.
