We consider the problem of designing optimal and efficient algorithms for solving tridiagonal linear systems with multiple right-hand side vectors on two-dimensional mesh interconnection networks. We derive asymptotic upper and lower bounds for these solvers using odd-even cyclic reduction. We present various important lower bounds on execution time for solving these systems including general lower bounds which are independent of initial data assignment, and lower bounds based on classifications of initial data assignments which classify assignments via the proportion of initial data assigned amongst processors. Finally, different algorithms are designed in order to achieve running times that are within a small constant factor of the lower bounds provided.
Introduction
In this paper, we consider the problem of designing algorithms for solving tridiagonal linear systems. Such algorithms are referred to as tridiagonal solvers. A method for solving these systems, in which there is particular interest by designers, is the well-known odd-even cyclic reduction method which is a direct method. Due to this interest, we chose to focus our attention on odd-even cyclic reduction. Thus, our results will be applicable to tridiagonal solvers designed on a two-dimensional mesh utilizing cyclic reduction. Much research has been spent exploring this problem which deals with designing and analyzing algorithms that solve these systems on specific types of interconnection networks 1, 5, 7, 8, 9, 10, 11] such as hypercube or butterfly. However, very little has been done on determining lower bounds for solving tridiagonal linear systems 4, 6, 10, 11] on any type of interconnection network or on specific general parallel models 9]. Moreover, few of these papers consider the case of multiple righthand side (RHS) vectors and the authors know of no papers which derive lower bounds on parallel run-time for odd-even cyclic solvers with multiple RHS vectors.
The main objective of this paper is to present asymptotic upper and lower bounds on the running time for solving tridiagonal systems with multiple right hand sides which utilize odd-even cyclic reduction on 2-dimensional meshes. Our decision to work with meshes is based on the simple fact that they are very common and frequently used interconnection networks. The results obtained in this paper will provide not only a means for measuring efficiency of existing algorithms but also provide a means of pinpointing algorithm design issues, data layouts and communication patterns in order to achieve optimal or near-optimal running times.
While we have already derived results for tridiagonal solvers with only a single vector on a 2-D mesh 10], the results were not easily adaptable to multiple vectors. In fact, multiple vectors produced several layers of complexity towards analysis of this problem. Some of the interesting results we shall show include the following: The skewness in the proportion of data will significantly affect running time. Another interesting result is the proof on the threshold for Research partially supported by an NSF CAREER Grant.
processor utilization. For some problem sizes, using common data layouts and straightforward communication patterns do not result in significantly higher complexities than assuming that all processors have access to all data items regardless of communication pattern. In many cases, common data layouts can be used to obtain optimal running times. However, there is no one algorithm, data layout or communication pattern which will provide optimal run-times for all cases. In fact, we present more than 6 algorithms/subalgorithms which are needed in order to show how to achieve optimal or nearoptimal running times for all cases.
The paper is divided as follows. Section 2 contains a description of the 2-D mesh topology. In Section 3 we discuss the odd-even cyclic reduction method for solving tridiagonal systems. In Section 4, we derive various important lower bounds on execution. First, we derive general lower bounds for solving tridiagonal systems, i.e. the bounds hold regardless of data assignment. We follow this by deriving lower bounds which rely on categorizing data layouts via the proportion of data assigned amongst processors. Furthermore, we describe commonlyused data layouts designers utilize for this problem. Lastly, we present a variety of algorithms and subalgorithms which will produce running times within a small constant factor of the lower bounds derived. Section 6 gives the conclusion and summary of results.
2-Dimensional Mesh Interconnection Network
A mesh network is a parallel model on P processors in which processors are grouped as two types: border processors and interior processors. Each interior processor is linked with exactly four neighbors. Communication between neighbor processors require exactly 1 time step. By this, we mean that if a processor transmits a message to its neighbor processor at the beginning of time x, its neighbor processor will receive the message at the beginning of time x + 1. Moreover, we assume that processors can be receiving transmitting] a message while performing a local operation. We assume that b 1 b 2 b R can be stored in one matrix B of size N R where the s th column of B represents vector b s . We make an analogous assumption for x 1 x 2 x R . Moreover, we assume for the sake of simplicity that
NR where k is an integer and R = 2 r . An algorithm is simply a set of arithmetic operations such that each processor is assigned a sequential list of these operations. An initial assignment of data to the processors is called a data layout. A list of message transmissions and receptions between processors is called a communication pattern. These three components (algorithm, data layout, communication pattern) are needed in order to determine running time.
Odd-even cyclic reduction 2, 3, 7] is a recursive method for solving tridiagonal systems of size N = 2 n ; 1. This method is divided into two parts: reduction and back substitution.
The first step of reduction is to remove each odd-indexed x si and create a tridiagonal system of size 2 n;1 ; 1. We then do the same to this new system and continue in the same manner until we are left with a system of size 1. This requires n phases. We refer to the tridiagonal matrix of phase j as M j and the vector as b In this paper, for simplicity, we assume that if an algorithm employs odd-even cyclic reduction, we assume that a processor computed items of whole rows of a matrix (i.e. the three non-zero data items).
The back substitution phase is initiated after the system of one equation has been determined. We recursively determine the values of the x si 's.
The first operation is x s 2 n;1 Returning back to Figure 1 , we see that the back substitution is represented by the top half of the graph. In essence, the entire tridiagonal system's dependency graph is R + 1 copies of the graph in Figure 1 (one for the M matrix, and the remaining for the R vectors) with some additional edges. Clearly, for any two vectors of b, their respective dependency graphs will not have any edges between them. In fact the only edges to be added are from the sub-dependency graph of M to every vector of b. In other words, for all vectors b s , each node in the M sub-dependency graph will have an edge to the "mirror" node in the sub-dependency graph of b s .
The serial complexity of this method is denoted by S(N R P) where P = 1. R = 1 is a special case, i.e.
S(N 1 1) = 19N ; 14 log(N + 1) ; 4:
In Section 4, we derive several important lower bounds on running time for odd-even cyclic reduction algorithms on 2-D mesh networks. Specifically, we derive general lower bounds independent of data layout, lower bounds which are based on categorization of data layouts, and lower bounds on running time for common data layouts for this problem. In Section 5.6 we present optimal algorithm running times on 2-d mesh topologies.
Lower bounds for Odd-Even Cyclic Reduction
Definition 4.
The class of all communication patterns is denoted by C. The class of all data layouts is denoted by D. A data layout D is said to be a single-item layout if each nonzero matrix-item is initially assigned to a unique processor.
In the following sections we shall provide lower bounds based on certain types of data layouts as well as general lower bounds. The lower bounds hold regardless of the choice of communication pattern.
A General Lower Bound for Odd-Even Cyclic Reduction
In this subsection we assume the data layout is the one in which each processor has a copy of every non-zero entry of M 0 and b 0 s . We denote this layout byD. SinceD is the best data layout possible, a lower bound based on this data layout will provide a general lower bound. 
Remark 4.1. We will denote the (parallel) computation lower bound for odd-even cyclic reduction of size N and R and utilizing P processors by S(N R P).

Remark 4.2. Let A be an odd-even cyclic reduction algorithm. A computation lower bound for A, given P processors and assuming none of the P processors are totally idle is S(N R P)
Proof. We will begin by assuming that all P processors will be used in computation and/or communication. Clearly, S(N R P) is a lower bound for the problem. Furthermore, we see that there are groups of processors which work together in order to solve the items in the dependency graph of M or in any of the b s 's graphs. Denote these (not necessarily disjoint) groups of processors by
. Since these processors must ultimately produce only one row of M in the back substitution phase, it is clear there is some type of "all-to-one broadcast" of P 0 processors in order to complete the back-substitution phase of M and/or b s . Thus, on a two-dimensional mesh, this will require communication of at least p P 0 . Moreover, this will also require computation of at least N P 0 . Therefore, a lower bound, assuming all P processors are used, is max(S(N R P)
This leads to the fact that P 0 = max(1 P R ) in order to minimize the lower bound. Furthermore, we see that using more than Ω(RN 2 3 ) processors, will reduce efficiency; i.e. the threshold for procesor utilization is Ω(RN In Section 5.6 we will provide optimal algorithms, i.e. the running times are within a small constant factor of the lower bounds. Therefore, we note that our results show that when P is sufficiently large, i.e. P = Ω(RN 
Lower Bounds for O on c P -data layouts
Many algorithms designed for solving tridiagonal linear systems assume that the data layout is single-item and that each processor is assigned roughly 1 P th of the rows of M 0 and the items of b where P is the number of processors available. However, in order to determine whether skewness of proportion of data has an effect on running time and if so, exactly how much, we classify data layouts by initial assignment proportions to each processor. In other words, in this section, we consider single-item data layouts in which each processor is assigned at most a fraction 
A data layout D on P processors is said to be a c P -data layout if (a) D is single-item, (b) no processor is assigned more than a fraction
Proof. The computational lower bound must also be a lower bound for the problem regardless of communication. Thus S(N R P) is a lower bound. Moreover, since at least one processor is assigned NRc P pieces of data, this processor (denoted as p 0 ) can either choose to use that data item in a binary arithmetic operation or send out the item to another processor. Therefore, it is clear that NRc 6P must also be a lower bound. We will now describe the argument for why p P 8 must also be a lower bound. We will use a contradiction argument.
Consider any two data items from M and/or B. These two items are said to be "related" if they are both required by some computation in order to provide final solutions. Clearly, any initial data item of M, i.e. M 0 will be related to any other initial item of M. Suppose the p P 8 is not a lower bound, this implies that any two processors which are both assigned initial items of M cannot be more than or equal to a distance of p P 4 from each other on the 2-D mesh. Furthermore, consider a processor p which computes M n;1 . Processor p will require data from all of the items of M 0 . Clearly, only processors which are within at most a distance of
all the processors that are assigned data items from not only M 0 but also items of B 0 which are related to any item in M 0 . Every processor is assigned at least one initial piece of data and the layout is single-item. Moreover, every item in B 0 is related to at least one item of M 0 . Thus, every processor assigned an item of B 0 or M 0 must be inside this region. However, the region does not include all the processors in the mesh. Therefore, this is a contradiction.
Analyzing the result given in the above theorem, we see that 1 P -data layouts will result in the best lower bounds for this class of data layouts, i.e. : 
Comparing results against the general lower bound, we see that for sufficiently large N (i.e. N >> P, or more precisely P R(N)
the lower bound for 1 P -data layouts is precisely equal to the general lower bound. The complexity of any algorithm A 2 O using a 1 P -data layout is Ω(log N + NR P + p P). In Section 5.6, we present algorithms that are within a small constant factor of the bounds presented in this and previous subsections.
Optimal Algorithms, Data Layouts, and Communication Schedules
In this section, we design the algorithms which will produce running times which are within a small constant factor of the lower bounds derived in the last section. We begin by discussing the data layouts we plan to use. We follow this with a discussion of some of the communication schedules (mostly dealing with distribution/broadcast of the data to processors that require this information for computation).
Lastly, we present the optimal algorithms along with their running times. We will discuss under which circumstances one algorithm should be used over another in order to achieve optimal or near-optimal running times.
Definitions of Data Layouts
In the previous sections, we have derived various lower bounds on running time for tridiagonal solvers which utilize odd-even cyclic reduction on a 2-dimensional mesh interconnection topology. In order to design algorithms which are within a constant factor of these lower bounds, we must determine which types of data layouts we will use. In this section, we present definitions on data layouts in order to use these layouts in the next subsections in our algorithm designs.
We would like to point out that while we can easily use the best data layoutD to obtain the general lower bounds, in this section we will present data layouts which are not so powerful but which still achieve the lower bounds stated. In fact, in the cases where N >> P, no processor is assigned more than O( NR P ) data items. We begin by presenting partial layouts and in each subsection, we discuss a data layout we will utilize in our optimal algorithms. This data layout will be utilized in order to achieve the general lower bounds when R P.
We begin by partitioning the processors into R groups of P R = 2 2k;r processors. For the sake of simplicity, we will assume that r is even. These R groups will be referred to as P i j where 1 i j p R. 
Data Layout 2
This data layout will be utilized in order to achieve the general lower bounds when R > P.
We begin by partitioning the assignment of the b s vectors to the P processors. Each processor will be assigned R P whole vectors. In particular p i j will be assigned the items in vectors b l where (i ;1)
Moreover, every processor will be assigned the items in matrix M.
Data Layout 3
This is a single-item data layout and will be utilized in order to achieve the single-item data layout lower bounds when P < N.
We will create a new notation for the mesh processors in order to utilize the partial data layouts defined in this section. The processors will be denoted by p 1 p 2 p P where p i j is denoted by -p (i; 1) p
We will allocate the items of M using data lay-
M P . Moreover, the processors will be allocated each vector b s using data layout D 0 bs P . Clearly, this is a c P -data layout where c is constant.
Data Layout 4
This is a single-item data layout and will be utilized in order to achieve the single-item data layout lower bounds when P > N and R P.
We begin by partitioning the processors into R groups of P R = 2 2k;r processors. These R groups will be referred to as P i where 1 i R.
If R p P then the processors assigned to group P l are p i j where (l ; 1)2 k;r + 1 i l2 k;r and 1 j p P.
We will create a new notation for the mesh processors in order to utilize the partial data layouts defined in this section. Consider the processors in group P l , these processors will be denoted Regardless of the size of R, for group P 1 , we will allocate the items of M using data layout
. Moreover, each group P l will be allocated vector b l using data layout D 0
this is a c P -data layout where c is constant.
Data Layout 5
This is a single-item data layout and will be utilized in order to achieve the single-item data layout lower bounds when R > P > N.
We begin by partitioning the assignment of the b s vectors to the P processors. Each processor will be assigned R P full vectors. In particular p i j will be assigned the items in vectors b l where (i ;1)
Moreover, processor p 1 1 will be assigned the items in matrix M. Clearly, this is a c P -data layout where c is constant.
The data layouts presented in this subsection will be used to design optimal or near-optimal algorithms for solving tridiagonal systems on a 2-dimensional mesh. We again note that while we could have simply assumed the best data layout for designing our algorithms to achieve the general lower bounds, we instead have listed data layouts in which no processor has been assigned more than O( NR P ) initial data items. Furthermore, we have presented several single item data layouts which we will be using in our optimal algorithm designs in later subsections.
In the following subsection, we present some communication schedules which we will use in our algorithms designs. These schedules are used to address data redistribution.
Communication Schedules
For some of the algorithms we will design in the next subsection, it is important that we are able to redistribute the data from one layout to another. In this section, we present communication schedules which do the needed redistributions.
Schedule 1
In this communication schedule, we are redistributing data layout 3 into the layout in which the matrix M is distributed using D M P and each b s is distributed using D bs P . Clearly, only neighbor processors (as ordered in the data layout discussion for 3) need to communicate. This would be accomplished by running the following:
For all j P ; 2, processors p j do in parallel send all initially assigned data items to processor p j+1
For all j 2, processor p j do in parallel send all initially assigned data items to processor p j;1
Running Time: O( RN P
).
Schedule 2
In this communication schedule, we are redistributing data layout 4 into the layout in which the matrix M is distributed to each group using layout D M . Two slightly different schedules for this redistribution are needed: one for R p P and one for R > p P.
We will discuss the schedule for R p P. The redistribution schedule for the other case is very similar and, for brevity, we will omit its discussion.
Consider R p P. We begin by redistributing so that each group is assigned the matrix M us-
. Clearly, only processors above and below (on the mesh) a processor need the same data to communicate. Collecting the data items of M in each column to the first processor in the column and then sending out the information to every other processor in the column would effectively accomplish the first step of the redistribution. The remaining redistributions can be easily taken care of using Schedule 1 above. This would be accomplished by running the following: 
Schedule 3
In this communication schedule, we are redistributing data layout 5 into the layout in which the matrix M is distributed to every processor. This is simply a "one-to-all broadcast" of 3(N ; 1) items. Clearly, this can be accomplished in O(N ; 1 + p P) steps. Due to the assumptions made in layout 5, the running tims is O(
The Optimal Algorithms
In the previous subsections, we have discussed the data layouts and some communication redistribution schedules we plan to use in our algorithms. In this section, we present the algorithms which will produce running times within at most a small constant factor of the lower bounds presented in the previous section.
SubAlgorithm 1 for Best Data Layout
This algorithm solves a tridiagonal system by sequentially solving values based on values it has already computed and not sent from another processor. This Subalgorithm will be a building block for the full algorithms.
Assume y of the vectors of b are assigned to this group of P 0 = 2 q processors. Moreover, assume the processors are ordered using the ordering given in data layout 3. The items of row i of level j are considered to be l (of all y vectors). Lastly, we will utilize only P 0 ; 1 processors.
SubAlgorithm 1 for Data LayoutD
Every processor p i (1 i < 2 q ) do in parallel :
for s = 1 to n ; q do for z = 2 n;q;s (i ; This data layout is used when R P. This algorithm is simply solving the tridiagonal system with one vector. We have partitioned the processors into R groups. Since the best data layout actually is not needed for subalgorithm 1 and in fact our data layout is sufficient, we simply call SubAlgorithm 1 for each group in parallel with P 0 = P R , y = 1. Therefore the total running time is O(
As we stated in the subsection defining Data Layout 1, if P RN 
Algorithm 2 for Data Layout 2
This data layout is used when R > P. This algorithm is simply sequentially solving the tridiagonal system with R P right-hand side vectors. We have partitioned the vectors into P groups. Each processor in parallel, serially solves their assigned system. Therefore the total running time is O(
Algorithm 3 for Data Layout 3
This algorithm first calls schedule 1, then it simply solves the tridiagonal system with R vectors. We consider the case P < N log N . Since the best data layout is actually not needed for subalgorithm 1 and in fact our data layout is sufficient (after schedule 1), we simply call SubAlgorithm This algorithm first calls schedule 2 then it is simply solving the tridiagonal system with one vector. We have partitioned the processors into R groups. Since the best data layout is actually not needed for subalgorithm 1 and in fact our data layout is sufficient (after schedule 2), we simply call SubAlgorithm 1 for each group in parallel with P 0 = P R , y = 1. Therefore the total running time is O(
Algorithm 5 for Data Layout 5
This algorithm first calls schedule 3, then it is simply sequentially solving the tridiagonal system with R P right-hand side vectors. We have partitioned the vectors into P groups. Each processor in parallel, serially solves their assigned system. Therefore the total running time is O(
In this section, we designed algorithms for each of the data layouts which cover all problem sizes. Analyzing the running times of our algorithms, we see that they are all within a small constant factor of the lower bounds provided. In particular, the algorithms for Data Layouts 1 and 2 satisfy the general lower bounds. The algorithms for the remaining Data Layouts all satisfy the single-item bounds. Moreover, analyzing the design of the algorithms, we see that to achieve the single-item bounds simply requires redistribution of a single-item layout to the layouts used to achieve the general lower bounds. Furthermore, no layout assigned any processor more than O( NR P ) initial data items.
Conclusion
In this paper we examined the problem of designing optimal tridiagonal solvers on 2-dimensional mesh interconnection networks with multiple right hand sides using odd-even cyclic reduction. This is the first paper to have tackled and solved this problem asymptotically.
We were able to derive asymptotic lower bounds on the execution time. Moreover, we were able to provide algorithms whose running times differ by only a small constant factor of these lower bounds.
Specifically, we proved that the complexity for solving tridiagonal linear systems regardless of data layout (i.e. a general parallel lower bound on a 2-dimensional mesh) is Ω(N This shows that utilizing more than Ω(RN processors will not result in any substantial runtime improvements. In fact, utilizing more processors may result in much slower run-times.
When we added the realistic assumption that the data layouts are single-item and the number of data items assigned to a processor is bounded, we derived lower bounds for classes of data layouts. The asymptotic lower bound is Ω(
NR P
+ log N + p P). We provided three types of data layouts which provided optimal running times. In other words, we showed that the best types of layouts in general are those in which either (a) each processor is assigned an equal number of rows of M 0 and b, (b) the processors are grouped into R groups in which every group is assigned P R of the vectors, or (c) each processor is assigned R P whole vectors of b. Therefore, we see that the skewness of proportion of data will significantly affect performance. Comparing the lower bounds for the c P -layouts with the general lower bounds, we see that restricting the proportion of data items assigned to a processor to NR P does not result in a significantly higher complexity than assuming all processors have all the data items for sufficiently large N (i.e. P RN Lastly, we show that there are algorithms, data layouts, and communication patterns whose running times are within a constant factor of the lower bounds provided. This provides us with the Θ-bounds stated above.
To achieve the general lower bound, i.e. the complexity for these methods regardless of data layout, while we could have usedD, the best data layout, i.e. the data layout in which every processor is assigned all the data items, we did not. We presented two data layouts in which no processor was assigned more than O( NR P ) data items. Two types of algorithms produce optimal running times: (a) algorithms in which processors are partitioned into R sets such that each set is in essence an independent tridiagonal system, or (b) algorithms in which each processor sequentially solves R P of the vectors. As stated previously, three different data layouts were provided in order to design optimal algorithms for single-item layouts. The three types of algorithms are: (a) algorithms which in essence utilize block layouts and solve the tridiagonal system R times. (b) algorithms in which processors are partitioned into R sets such that each set is in essence an independent tridiagonal system, or (c) algorithms in which each processor sequentially solves R P of the vectors. For (b)-(c), the algorithms provided are very similar to those to achieve the optimal running times (regardless of data layout). In fact, the difference in run-time is primarily the need to do redistributions of data in order to run the optimal algorithms. Finally, it is worthwhile to observe that for sufficiently large N, the single-item lower bounds are asymptotic to the general lower bound.
