Many large-scale scientic and engineering computations, e.g., some of the Grand Challenge problems[1], spend a major portion of execution time in their core loops computing band linear recurrences(BLR's). Conventional compiler parallelization techniques[4] cannot generate scalable parallel code for this type of computation because they respect loop-carried dependences(LCD's) in programs and there is a limited amount of parallelism in a BLR with respect to LCD's. For many applications, using library routines to replace the core BLR requires the separation of BLR from its dependent computation, which usually incurs signicant overhead. Our paper presents a new scalable algorithm, called the Regular Schedule, for parallel evaluation of BLR's. We describe our implementation of the Regular Schedule and discuss how to obtain maximum memory throughput in implementing the schedule on vector supercomputers. We also illustrate our approach, based on matrix chain multiplication formulation of BLR, to parallelizing programs containing BLR and other kinds of code. Signicant improvements in CPU performance over a range of programs containing BLR implemented using the Regular Schedule in C over the same programs implemented using highly-optimized coded-in-assembly BLAS routines [11] are demonstrated on Convex C240. Our approach can be used both at the user level in parallel programming code containing BLR's, and in compiler parallelization of such programs combined with recurrence recognition techniques for both vector supercomputers and massively parallel computers.
Introduction
Systems of linear equations arise in many large-scale scientic and engineering computations such as some of the Grand Challenge problems [1, 16] , e.g., computational uid dynamics, weather forecasting, structural analysis, plasma physics, etc. Frequently, the discretization of partial dierential equations in these computations result in 3 This work was supported in part by NSF grant CCR8704367 and ONR grant N0001486K0215. band linear systems involving BLR's. These band linear systems consume the major portion of computing time in these computations [1] . Furthermore, an LCD-preserving parallel program containing BLR's is just not scalable: the speedup obtained with such a program over its sequential counterpart is limited by a constant(usually very small) regardless of how many processors are used. This is disadvantageous: not only solving these band linear systems is time-consuming but also a large number of resources, e.g., vector processors or processing elements(PE's) in massively parallel computers are wasted.
Compiler parallelization techniques that respect LCD's [4] cannot turn the programs containing BLR's into scalable parallel ones. To compute these programs on parallel computers, special parallel algorithms are devised for some of them, e.g., recursive-doubling [18] for 1st-order LR's, modied cyclic reduction for 1st and 2nd-order LR's, cyclic reduction [21] for diagonally-dominant tridiagonal linear solver, \burn-at-both-ends" [12] for symmetric positive denite tridiagonal linear solvers. The most frequently used programs for solving band linear systems are written into library routines using the dierent methods mentioned above, e.g., BLAS routines in Vector Library on Convex and Scientic Library on Cray. However, three problems arise with using these library routines. First, many programs mix band linear systems and other code in the same main loops.
Transforming these loops to facilitate invoking library routines will make the loops perform signicantly worse than parallel programming these loops directly. The main reason is that separation of band linear code from the other code in the loop severely decreases locality of reference. Another reason is that the library routines conventionally destroy the input arrays to store intermediate and nal results, which incurs the overhead of keeping multiple copies of input arrays for possible later use. Secondly, many other kinds of frequently used band linear programs, both standard and hybrid with other codes, are not covered by the library routines, and the kinds of such programs increase as more kinds of large-scale computations are created. It would be inecient, (probably infeasible), to create a library routine for each kind of application. Thirdly, none of those parallel programming methods is optimal(in terms of execution time) with resource constraints, e.g., with a xed number of processors independent of the problem size. That is, there may exist a faster and more general parallel-programming method than the problem-specic methods for the band linear systems.
Our paper presents a new scalable algorithm, called the Regular Schedule, for parallel evaluation of general BLR's(i.e., mth-order LR's for m 1). The schedule's regular organization of the computation for BLR makes it extremely well suited for vector supercomputers and massively parallel computers. The recent results [26] on numerical properties of parallel algorithms for LR's provide evidence for the numerical stability of our Regular Schedule. We describe our implementation of the Regular Schedules and discuss how to obtain maximum memory throughput in implementing the schedule using analytical and experimental methods on vector supercomputers(our experiments were performed on Convex C240). We also illustrate our approach, based on matrix chain multiplication formulation of BLR, to parallelizing programs containing BLR and other kinds of code. Signicant improvements in CPU performance over a range of programs containing BLR implemented using the Regular Schedule in C over the same programs implemented using highly-optimized coded-in-assembly BLAS routines [11] are demonstrated on Convex C240. Our approach can be used both in parallel programming codes containing BLR's by users and in compiler parallelization of such programs combined with recurrence
for (j=i-m; j<i; j++)
Figure 1: (a) rst-order linear recurrence code and (b) mth-order linear recurrence code.
recognition techniques for both vector supercomputers and massively parallel computers.
Our paper is organized as follows. Section 2 presents the Regular Schedule. Section 3 describes the implementation of the Regular Schedule and discusses how to obtain the maximum memory throughput on a vector supercomputer. Section 4 describes our approach based on the Regular Schedule to parallelizing programs containing BLR's intermixed with additional code. Section 5 reports the timing results of several programs containing BLR implemented using the Regular Schedule with the same programs implemented using BLAS routines of VECLIB on Convex C240. Section 6 briey reviews the related work. Section 7 concludes this paper. The sequential program for the rst and higher order linear recurrence can be written as shown in Figure   1 . Because of the loop-carried dependences, the two loops above cannot be parallelized into scalable parallel programs with LCD-preserving parallelization techniques.
Given a BLR as shown in Figure 1 the idea of a scalable algorithm is to formulate BLR in terms of matrix chain multiplication. The associativity of matrix chain multiplication enables us to treat the problem as parallel prex [21, 22] Final value x k is the rst element of the resulting vectorx k . 2
Note that in Denition 2.2,
T k x k01 x k02 . . . The matrix multiplication chain in Denition 2.2 allows us to treat the parallel computation of RhN; mi in similar fashion to parallel prex computation [22] . Algorithm 2.1 describes the generic Regular Schedule in terms of matrix chain multiplication for parallel computing RhN; mi. It is also called Square Schedule as each iteration of its outermost loop produces p 2 nal results. To facilitate presentation and analysis, Algorithm 2.1 can be understood(described) on a concurrent-read-exclusive-write(CREW) parallel machine model(PRAM).
Issues of mapping the algorithm onto real vector supercomputers will be addressed in Section 3. A CREW PRAM machine is assumed to have p parallel processors interconnected with a global memory. Each operation is assumed to take a unit time. A memory location can be read simultaneously, i.e., broadcasting of data is allowed, but it can only be written by a single processor at a time. (1)
The computation tree for a period is shown in Figure 2 X3   X4  X5  X6  X7  X8  X9  X10  X11  X12  X13  X14  X15  X16  X17  X18   M18  M17  M16  M15   R16   R17   R18   M14  M13  M12  M11 2
The regular schedule, based on matrix chain multiplication, was derived using the Harmonic Scheduling technique [27] . Gajski's algorithm [15] achieves the same execution time as the Regular Schedule, under concurrent-read-and-exclusive-write(CREW) parallel random access machine(PRAM) model, but uses a dierent formulation and organizes the computation dierently. When mapped onto real machines, the Regular Schedule will perform better since it has better utilization of locality of reference. The algorithm for rst-order LR in [7] reduces to the regular schedule only for rst-order LR, but the blocked rst-order formulation for mth-order LR(m 2) described in [7] diers from the regular schedule and has longer execution time. The
Regular Schedule is not time-optimal under a CREW PRAM model, and in fact both its execution time and program-space eciency are inferior to many other schedules derived using Harmonic Scheduling [27] . However, its regular organization of the computation makes it extremely well suited for vector supercomputers and massively parallel computers.
Implementing the Regular Schedule on a Vector Computer
Although the regular schedule is scalable, the real machine constraints(e.g., dierent styles of parallelism exploitation, memory access latency, inter-processor communications and operating system of the machines) may greatly aect the performance. This section presents our implementation of the regular schedule on a vector supercomputer and discusses how to obtain maximum memory throughput using both analytical and experimental methods. Before discussing programming the regular schedule, we give an overview of relevant portion of Convex C240
architecture [9] in Figure 3 . Implementing the regular schedule(Algorithm 2.1) on a vector computer with the architecture above involves mapping the processors in Algorithm 2.1 onto the processing elements of the vector computer and placing the data across the memory banks so as to obtain maximum memory throughput. In our mapping, a processor in Algorithm 2.1 corresponds to a vector register element being operated on in the vector processor. This is because the pipelined functional units, the major mechanism to exploit parallelism in programs in a vector CPU, depend on the vector registers supplying data to be kept fully utilized and the number of data elements to be fetched into a vector register(i.e., the vector length) can be specied and controlled in the source code. Thus the number of processing elements in a vector computer is equal to the number of vector CPU's multiplied by vector register length. Since we have available only a single vector CPU for our experiment, we shall map Algorithm 2.1 to up to 128 processing elements. We can easily map our algorithm to multiple vector CPU's in the same fashion.
However, this mapping does not mean that speedup on the order of ((# of CPU 0 s) 2 (vector register length)) can be obtained. Also, we are concerned in this paper with parallel programming in high-level languages and performance tuning in source code. Issues such as low level tuning, e.g., to improve utilization of chaining will not be addressed in this paper
With this mapping scheme, one would normally expect that using full vector register length 128, i.e., mapping our algorithm to 128 processing elements, would yield the best performance. It is in fact not so, because of a special property of our algorithm and memory interleaving, as the question of choosing the optimal register length will be addressed in the remainder of this section.
As an example, the portion of the source code in Convex C of the regular schedule for 2nd-order LR is given below. The labels corresponding to the labels in Algorithm 2.1 are added for purpose of exposition. The coecient matrices are stored in sparse format. As commented in our C code, loop L2.1, L3 and L6 can be vectorized by the Convex C parallelizing compiler.
The symbolic constant RTH in these three loops will serve as the vector length. In Loop L2.1 and L3, RTH will be used as the vector stride measured in long words of 8 bytes on Convex C200 Series(the stride is 82RTH measured in bytes as in Convex assembly code). That is because consecutive iterations in these two loops operate on data that reside in the global memory (RTH (mod 32)) banks apart(recall that each iteration does a coecient matrix multiplication and all RTH iterations work on RTH redundant trees in parallel). Hence, RTH as the vector stride will determine how memory accesses are interleaved among the 32 memory banks, which is a key factor aecting the performance. The following claim provides an analytical method for choosing an optimal vector stride based on the architectural features given. All time intervals are measured in number of machine clock cycles. Our architecture can achieve a rate of one vector element per cycle with full memory interleaving during pipelined access stage(this is true for Convex C200 Series and other vector computers). This claim is also applicable to vector store instructions.
Proof: The idea of Formula 3 is self-explanatory from its rst line. 
Programming Based on Regular Schedule
For standard BLR's, the results in Section 2 and 3 can be applied straightforwardly to generate parallel programs for vector supercomputers. There are many application programs in which BLR's are mixed with other computations. The mixture of recurrence with other code make it less obvious how to apply the Regular Schedule to it than to the standard BLR. Sometimes even recognition of the core recurrence becomes non-trivial.
However, it is in this kind of situation that parallelizing the code comprehensively will yield better performance than separating the core recurrence from other code and replacing it with some prescribed library routine. In this section, we outline our approach, based on matrix chain multiplication formulation of the Regular Schedule, for parallelizing such programs in this section. Our approach can be used both for parallel programming programs with BLR's by programmers and for compiler parallelization of such programs combined with recurrence recognition techniques.
Our approach for parallelizing programs with BLR's utilizing the Regular Schedule consists of 3 steps:
1. Recognizing the BLR's and the dependence of other computations on the core BLR structure in the program given;
2. Construct the matrix chain multiplication for the BLR's and their dependent computation;
3. Applying the Regular Schedule.
There exist several methods for compiler recognition of recurrences in loops(since this is not the topic of this paper, we briey overview these works). Banerjee et al [3] showed how the data dependence graph can be used to isolate recurrences so that pattern matching techniques can be applied. Ammarguellat and Harrison [2] put forth a method for recognizing recurrence relations automatically. Pinter and Pinter [23] gave a graph-rewriting technique for recognizing recurrences by unfolding loops. Tanaka [25] provided a framework for both recognition and vectorization of 1st-order linear recurrence. Callahan [5] gave an algebraic approach to combining bounded recurrences and generating parallel code.
We illustrate our approach by going through it with Livermore Kernel 19| general linear recurrence equations as shown below. The rst loop can be rewritten into an equivalent loop as follows:
The scalar stb5 in the original loop has been privatized [13] into array stb5[N+1], which enables us to recognize that the core recurrence is on the second statement in the new loop. We then construct a matrix chain multiplication for the core recurrence as follows: Once we constructed the matrix chain multiplication for the core recurrence, we can easily apply our Regular Schedule to the chain for computing stb5 [k] . We organize the core recurrence with the computation around it according to locality of reference. The resulting loop is given as follows. Similarly, the second loop can be rewritten into an equivalent loop and the matrix chain multiplication for the new loop can be constructed, except that the induction variable in the second loop decrements.
In our experiment, all 3 benchmarks containing BLR intermixed with other code (tridiagonal elimination below diagonal, general linear recurrence equations and tridiagonal linear system solver) were programmed using our approach as illustrated in this section and good performance gains were obtained for all of them. We have preliminary evidence that the matrix chain multiplication mechanism of our Regular Schedule can facilitate parallel programming code containing BLR intermixed with other code in an integrated fashion(rather than separation). Such a comprehensive treatment, where locality of reference is better observed, will gain us more performance than treating BLR and its dependent code in a loop separately, as evidenced by our experiment and reasoning.
Results
We show in Table 1 the CPU performance of benchmark programs containing BLR implemented using the BLAS routines of the Convex C240 VECLIB [11] in comparison with the same programs implemented using our regular schedule. The CPU performance was measured by the user CPU time in seconds on a single processor of Convex C240. Programs using Regular Schedule were implemented in Convex C. The BLAS routines [11] of the Convex C240 VECLIB are coded in assembly and highly optimized to the architecture details, and are the fastest linear algebra routines installed at UCI. In comparison, the corresponding routines in the Convex The rst column lists the names of the programs containing BLR. Each program was tested for problem sizes of 1 through 4 million(which covers the typical size of some large scale computations [1] ). Double precision(long word of 8 bytes) oating-point numbers were used for all implementations. For each problem size, a program was run with 1 thread, and with no restriction on the number of threads(Convex C240 allows a maximum of 4 threads). The column titled \BLAS routine used" lists the BLAS routines used in coding the benchmark programs. The columns titled \user CPU time of programs using BLAS" and \user CPU time of programs using regular schedule" give the user CPU time of programs using BLAS routines and using our regular schedule respectively. The last column calculates the improvements in performance of programs using the regular schedule over the same programs using BLAS. To ensure a fair comparison, each pair of implementations(one using BLAS and one using the Regular Schedule) for a benchmark were submitted to a single batch job in order to run them in the same system load. A reported user CPU time for a program is the average of 100 runs. The vector length 121 for vector register was used in all our implementations.
Our benchmarks are among the most frequently used code(or code segments) containing BLR in scientic and engineering computations(3 Livermore Kernels are also found to be such kind of code). [Results for more benchmarks will be included in the nal version of the paper]. Note that, for benchmark Livermore Kernel 11|rst sum(i.e., parallel prex), we compare the program implemented using BLAS routine \dr1c " with the program based on Regular Schedule, where \dr1c " computes 1st-order constant coecient LR(which involves multiplication whereas parallel prex does not). This is a fair comparison since the performance dierence between involving multiplications and no multiplications is negligible because the multiplications and additions in \dr1c " are well chained.
Restricting execution to a single thread on Convex C240, programs implemented using Regular Schedule in C demonstrated improvements in performance of 17% to 72% over those using the assembly-coded and highly optimized BLAS routines. We can observe the following two phenomena. First, when the order of LR increases, the performance improvements of Regular Schedule over the BLAS routine increase. Since BLAS does not provide routines beyond 2nd-order LR, higher order LR's have to be adapted to use BLAS routines for 1st-order or 2nd-order LR. Thus the Regular Schedule will gain even more performance improvements for these higher order recurrences. Second, for benchmarks containing BLR and dependent code such as Livermore Kernel 5 and Livermore Kernel 19, the performance improvements of Regular-Schedule-based over BLAS-based implementation are higher than for the standard BLR.
To examine the the synchronization overhead in preparation for future experiments on the multiple processors of the Convex(which we currently don't have available), we also ran both sets of parallelized benchmarks under no restriction on the number of threads on a single vector CPU. The performance improvements by the Regular Schedule over the BLAS routines were considerably larger in the presence of synchronization than the improvements made in the absence of synchronization, which is encouraging.
We expect to have at least 5% to 10% more improvement in performance than reported here, if programs using Regular Schedule are coded in assembly.
Related Work
Due to the importance of the problem, parallel evaluation of linear recurrences has been studied in various forms by some of the pioneers in the eld for the past twenty years. Kogge and Stone [18] order LR's) with resource constraints and in particular found the strict time-optimal schedules for computing 1st and 2nd-order LR's.
Parallel prex is the simplest 1st-order LR |-all coecients a ij equal 1, i.e., there is no multiplication in the recurrence. It was studied by Ladner and Fisher [20] , Fich [14] , Snir [24] , and Nicolau and Wang [22] . The strict time optimal schedule for computing parallel prex with a xed number of processors(independent of problem size) on CREW PRAM model was found by Nicolau and Wang [22] .
Conclusion
We have presented a new scalable algorithm, called the Regular Schedule, for parallel evaluation of general BLR's(i.e., mth-order BLR's for m 1). The schedule's regular organization of the computation for BLR makes it extremely well suited for vector supercomputers and massively parallel computers. We described our implementation of the Regular Schedules and discussed how to obtain maximum memory throughput in implementing the schedule using analytical and experimental methods on vector supercomputers(our experiments were performed on Convex C240 vector supercomputer). We also illustrated our approach, based on matrix chain multiplication formulation of BLR, to parallelizing programs containing BLR and other kinds of code.
Our approach enables a comprehensive parallelization of programs containing core BLR and dependent code, and thus allows a better utilization of locality of reference in programs than separate treatment of BLR code and its dependent code. Signicant improvements in CPU performance of a range of programs containing BLR implemented using the Regular Schedule in C over the same programs implemented using highly-optimized coded-in-assembly BLAS routines [11] are demonstrated on Convex C240. The Regular Schedule can be used directly to parallelizing code containing BLR's on massively parallel computers. Future work includes extending Regular Schedule to multiple vector processors and experimenting with our approach on massively parallel computers.
