Abstract-A new algorithm for the solution of linear recurrence systems on parallel or pipelined computers is described. Time bounds, speed-up and efficiency for SIMD and MIMD computers with fixed number of arithmetic elements (AE's), as well as for pipelined computers with fixed number of stages per operation, are obtained. The model of each computer is discussed in detail to explain better performance of the pipelined model. A simple modification in the design of AE's for parallel computers makes parallel model superior.
I. INTRODUCTION M/[ ANY algorithms try to obtain high-speed execution by
Mv1 exploiting inherent parallelism of the given problem.
On the other hand, there are many machines designed for high performance. For example, two typical architectures that have emerged are parallel and pipelined organizations. Generally speaking, a parallel machine tries to achieve higher speed by replication of processing units while a pipelined machine approaches the same goal by dividing the processing unit into several segments and by having several different sets of operands at different stages of execution at the same time.
However, the difficulty arises when someone tries to implement algorithms on existing or hypothetical machines. It is usually discovered that the algorithm was developed for an Manuscript received December 20, 1978 ; revised February 29, 1980 . This work was supported in part by the National Science Foundation under Grant US NSF MCS77-279 10. The author is with the Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801. oversimplified machine model. Such a frequently used model, for example, assumes only a fixed number of arithmetic elements (AE's) in the machine. Each AE is capable of performing any binary arithmetic operation (addition, multiplication, division, * * ) in one unit of time. Then the performance or speed of an algorithm can be obtained by dividing the total number of operations required by the algorithm with the number of AE's. This machine model neither takes into account the time and cost of operand transmission to and from the AE's nor the organization of data in the memory.
Therefore, it would be highly desirable to have a method for designing algorithms that would explicitly reveal 1) the structure of the processing units, 2) the structure of the interconnections, and 3) the organization of the data in memory.
In other words, we would like to define two different structures for each algorithm. The global structure specifies the number of processing units and memories, and their interconnections. The local structure specifies the organization of individual processing units, as well as the storage scheme of data in each memory.
With the above goal in mind, we shall introduce a technique based on the suffix problem for semigroups and then apply it to the linear recurrence systems. An algorithm for solving the suffix problem was developed by Ladner and Fisher [1] in an attempt to synthesize high-speed binary adders. Much earlier, Kogge and Stone [2] , [3] presented an algorithm based on the recursive-doubling technique for the recurrence systems in real domain. They developed the recursive-doubline algorithm for 0018-9340/81/0300-0190$00.75 ©) 1981 IEEE 190 191 GAJSKI: SOLVING LINEAR RECURRENCE SYSTEMS the first-order linear recurrence system and defined a set of properties that guarantees applicability of their technique to a broader class of problems.
Both the Ladner-Fisher and Kogge-Stone models assume an unlimited amount of processing power; that is, there is no limit on the number of operations they can use at any time. However, Ladner and Fisher give a bound on the total number of operations needed to solve the suffix problem. Their approach seems justified since they modeled logical nets which tend to be small and combinatorial in nature. Kogge and Stone [2] do not give any bounds on the size of their solution. Kogge gave later a time bound for the solution of the recurrence problems [3] , and showed how to pipeline it to obtain the maximal computational rate [4] . Simply stated, both Ladner and Fisher, and Kogge and Stone, proposed an algorithm for solution of recurrence systems and showed how to minimize the total number of operations, or how to design hardware for efficient execution of their algorithms.
In this paper we will address the same problem under completely opposite assumptions. We would like to construct efficient and fast algorithms for computation of recurrence systems on real machines with limited processing power like a parallel processor with the fixed number of processing units and memories and a fixed interconnection network among them, or a pipelined machine with the fixed number of stages per arithmetic operation and a fixed connection between arithmetic units. We are considering recurrence systems whose size is much larger than the number of AE's available at any single moment during the computation.
When the number of processors p is less than the recurrence size n, Kogge and Stone [2] proposed to use their algorithm rN/pl times to calculate p elements of the series at a time. In this paper we present a more efficient and faster algorithm.
A parallel evaluation of linear recurrence systems has been studied by a number of people. Chen and Kuck [5] , using a machine model with no restriction imposed on the number of AE's used at any time, give bounds on time and the number of AE's. They showed several different techniques to cut down the number of processors when their algorithm is mapped onto a limited number of processors, but they did not give the actual bounds. Later, Chen and Sameh [6] reported a time bound of k(m2n/p) log p log m + O(log p) for a linear recurrence system of order m and size n using only p processors at any time. Chen, Kuck, and Sameh [7] have developed an algorithm that achieves a time bound of (n/p)(2m2 + 3m) + O(n2 log (p/m)).
In this paper we show that for p < VH this bound can be further lowered to less than (n/p)(2m2 + 3m). Furthermore, a straight application of our algorithm will yield a bound of (n/p)(2m2 + m) + 2p in cases when only the last m components of the solution are required. This is the equivalent of evaluating the Horner expression (.--(a1bI + a2)b2 + *.. + an)b, + an+I when the order of linear recurrence system m = 1. Then our algorithm achieves a speed-up which is very close to the best possible speed-up of 2/3p + 1/3, established by Hyafil and Kung [5] . In addition to Chen and Kuck [5] , Heller [9] and Sameh and Brent [11] were interested in computing the minimum number of processors required to compute linear recurrence systems in the shortest possible time. A mapping of these algorithms onto real machines with the fixed number of processors, memories, and switches turned out to be highly difficult. This is understandable because the only goal of these algorithms is the minimum number of arithmetic operations per algorithm.
Furthermore, the works ([5] - [10] ) quoted above use matrix notation for description of algorithms and linear algebra tools for their development which does not lend itself to natural decomposition into processors, memories, and interconnections. In a previously reported work [ 11] , an algorithm based on the semigroup technique was easily converted into an array of processors connected with multiple shuffle networks, yet the algorithm yielded the best-known processor bound, m2n/2 + 0(mn), previously reported by Sameh and Brent [10] .
In the next section the suffix problem for semigroups is introduced with a technique for solving the general class of recurrence systems. The previous work is discussed in relation to this more general technique. In the rest of the paper our technique is then applied to the smaller subset of recurrence systems, namely, linear recurrences over the field of real numbers. The definitions and notations used throughout the rest of the paper are presented in Section III. Furthermore, the Main Theorem (Theorem 1), which forms the basis of all our algorithms, is proved.
In Section IV the algorithm for SIMD and MIMD parallel computer using p AE's is presented. We use a simple and well-known model to facilitate comparison with previous work and also to make formal treatment more tractable.
The model of a pipelined computer is described in Section V. The model is made simple enough to cover a wide spectrum of existing machines (from floating-point systems AP-1 20B to TI ASC, CDC STAR-100, and CRAY-1). It consists of two pipelined arithmetic units, a multiplief-, and an adder chained together into a multifunction pipe in a static configuration. Each arithmetic unit in the model has s stages and requires one time step for a single operation, although it can deliver s results in one time step in the vector mode.
For the performance comparison of different machines, which may be parallel, pipelined, or simply sequential, the speed-up, efficiency, and utilization must be defined using attributes that characterize all machines. Such an attribute is the computational rate rM (in operations/second) which can be sustained by the machine M in an ideal computation. On the other hand, an algorithm A is characterized by 0(A), the number of operations needed to execute A. The computational time of an algorithm A on a machine M is denoted by TM(A). Note that TM(A) > O(A)/rM. Therefore, for any two machines Mi and Mj and any two algorithms Ak and Al, we can define speedup S = TML(A1)/TMj(Ak), efficiency E = (rmi TMj(Al))/(rMj TMj(Ak)), utilization U = 0(Ak)/(rMj TMi(Ak)), and redundancy R = O(Ak)/O(A,). Section VI will conclude the paper by comparison of results for pipelined and parallel machines using the above-defined measures.
II. BASIC IDEA
We will describe in this section a technique for designing algorithms for recurrence systems. The definition of the suffix problem for semigroups will be defined first, followed by a detailed example of a simple Boolean recurrence system. The five-step procedure for solving recurrence systems will be given in conclusion.
Let (S, *) be a semigroup, where S is a set and * is a closed and associative operation defined on S. The suffix problem is defined as the computation of all products Sk * Sk-... * 2 *si,(k =n, n -1, -2, 1), for givenS, sn-, ,S2, SI S. The integer n is called the length of the suffix problem.
The suffix problem can be solved in many different ways which are determined by the order in which the semigroup operation is applied to the operands (elements of S). If a solution of the suffix problem is characterized by its size, which is defined as the number of operations * needed to compute the solution, and its depth which is defined as the maximum number of operations on any direct path from an input to an output, then the minimum possible size n -1 has also depth n -1. Ladner and Fisher [ 1 ] showed that the minimum depth of rlog n] requires only a size less than 4n. Furthermore, they showed that the size can be decreased substantially by permitting the depth to increase by an additive constant. For example, with a depth of rlog n] + 1 a size of 3n is obtainable while 2 log n depth requires only size 2n . In what follows, we shall consider the semigroup arising from a general recurrence system.
A recurrence system of the first order R( 1) is a quadruple (K, X, xo, F), where K = K1 X K2 X ... X Ks is the Cartesian product of sets of coefficients, X = X1 X X2X ... X Xt is the Cartesian product of sets of variables, xo E X is the initial value, and F = {tfk: X -XI k E K}. Usually *he set F is given in a compound form as the recurrence expre'. ion. For example, if A, B, C, D, and X are sets of real numbers, then R1 (1) = (A X B X CX D,X, xo, F1) mayhave F1: xi = (ai + bixi. 1)/(ci + dixi-1) (1) with + and juxtaposition denoting addition and multiplication of real numbers. Another example may be R2( 1) = (A X B, X, xo, F2), where A = B = X = 10, 1} and F2: xi = ai + bi- (2) In this case, +, juxtaposition, and -denote Boolean operators OR, AND, and NOT.
The functions in the set F can be extended to sequences of coefficients, so that for all ki, ki-1, k , k1 E K, fki,ki-I** k1(X) = fk(fki_,s ... k1(X)). Then the solution of the recurrence system R( 1) of length n, denoted by R (n, 1), is the sequence xn = fk, kn-i, ... , k 1 (xo), Xn-= fkn-l * *, k l (XO), * * *, x1 =fki(xO) for given kn, kn-, -* , k1 I K and x0 E X. Furthermore, for all i, 1 < i < n, xi =fki, k-i., ... , kl(xO) = fkiki-l(f.*.* fkI(xo) Therefore, the solution of every recurrence system can be decomposed into two subproblems: 1) suffix problem for its semigroup; that is, the computation of the functional composition fkt = fki 0 fki-1 fI2 0 * 0 fk2 Ofkl for all i, 1 < i < n; and 2) functional evaluation xi = fk, (xo) for all i, 1 < i < n.
Each of the above subproblems can be solved using only one type of arithmetic unit or cell. The Furthermore, if the cost of a FEC is less than the cost of a FCC, then the implementation in Fig. 1(e) is the best possible.
The above implementation was obtained by partitioning R (8, 1) into R(1)(4, 1), R(2)(2, 1), and R(3)(2, 1). While R(1)(4, 1) has a serial implementation, R(2)(2, 1 ) and R(3)(2, 1 ) have parallel implementations. R (8, 1 ) was then obtained by serial connection of RM, R(, and R(3). The same strategy will be used in Sections IV and V to obtain algorithms for parallel and pipelined machines.
The technique for solving recurrence systems based on the suffix problem of their semigroups allows for natural decomposition into two levels. The number of FCC's, FEC's, and their interconnections determine the global structure, while the content of a FCC or FEC represents the local structure. For a machine with the limited amount of processing power, the mapping of several FCC's and FEC's into one processing unit is necessary. The capability of that processing unit is defined by the operations required by the FCC and FEC specifications and the time necessary to execute them. On the other hand, the interconnection for communicating data between processors, organization of data in one or more memory modules, as well as the movement of data to processor units and back is determined by the algorithm used for solving the suffix problem. Ladner and Fisher [1 ] showed how to solve the suffix problem, but have never shown its connection to the (2) (e) semigroups of recurrences. Furthermore, they used the suffix problem to obtain upper bound on binary additions, but never adopted it for recurrence systems in real domain.
On the contrary, Kogge and Stone [2] , [3] presented an algorithm based on the recursive-doubling technique for the recurrence systems in real domain. They developed the recursive-doubline algorithm for the first-order linear recurrence system and defined a set of properties that guarantees applicability of their technique to a broader class of problems. Formally, if a recurrence system is of the form
where bi and ai are arbitrary constants and f and g are index-independent functions such that 1) 
3) g(x, g(y, z)) = g(h(x, y), z) then the recursive doubling is applicable. It is easy to prove that the above three conditions imply the existence of the semigroup of the recurrence system. However, the converse is not true, since the system R2 = (A X B, X, xo, F2) does not satisfy Kogge-Stone conditions, yet it has a parallel solution based on its semigroup. Intuitively, the Kogge-Stone approach can be characterized as bottom-up approach where an algorithm was generalized by forcing the broader class of problems to conform to restrictions of the algorithm. The semigroup technique, on the other hand, likens a top-down approach where a general technique is developed first and then applied to different classes of problems.
To determine the local structure (that is, the content of an FCC or an FEC), we must define functional composition and functional evaluation in terms of the algebraic operations used to specify the recurrence system. The recurrence system R2(1), for example, was specified over the standard Boolean algebra and, therefore, the operations OR, AND, and NOT will be used. The specification of the local structure can be divided into two tasks as follows:
1) generation of the semigroup (F+, O) 2) encoding of F+ to simplify the complexity of the FCC.
For the recurrence system R2(1), the set F2 was specified by the recurrence expression xi = ai + bi½-1. Since A = B = X = $0, 1}, the set F2 contains only three functions:
fA: {?}lfB:{?1}, fc 01
F2 is represented by the function table in Fig. 2 1(e) shows both encodings for all possible functions of F2. Thus, the structure of the IEC is specified by the following set of equations:
(6) The parallel implementation of R2 (8, 1) is shown in Fig. 3 . Since te = tc, this is the fastest possible implementation. It takes 3 + 2rlog2 n] gate delays for any length n.
The semigroup (Fj 0) of R2 (1) Therefore, the expression xi = a, + bix-j-1 can be written in the form Xi = (ai + bi)Yi-I + aixi-I = ci1i-I + dixi-1. The structure of the IEC (ci = ai + bi, di = ai) is obtained by equating the coefficients in (7). Furthermore, the complement of xi is of the form X= cixi-1 + dixi-1.
The form of (7) and (8) suggest the use of linear algebra notation for compactness and efficiency. Using xi = (xi, xi), ci = (ci, di), and ci = (ci, di), we can write Xi = Ci * xi-1 (9) where * denotes the matrix multiplication over Boolean algebra. Furthermore, using (7) and (8) Xi= Ci * Ci-* Xi-2 where for anyj, Cj denotes the 2 X 2 matrix In general, for any fj, fj cjFe , f 0°f ,j = f, cj. In other words, the expression ci * Cj determines the structure of the FCC already given by (3) and (4). The structure of the FCC is not unique, however; it depends on the encoding of F2j. For example, the encoding shown in Fig. 2(f) is obtained by the following manipulation of the original expression: The above procedure will be used in the rest of this paper for solving linear recurrences. The steps 1, 2, 3, and 4 will be dealt with in Section III and the suffix problem will be solved in Sections IV and V for the constraints imposed by the models of parallel and pipelined machines.
III. THE MAIN THEOREM
This section defines linear recurrence systems over the field of real numbers. Since for linear recurrences F = F+ the steps 1, 2, and 3 of Procedure 1 can be omitted. The encoding of F+ is an identity.
The only theorem of this section defines the functional composition, or in other words, the structure of the FCC. The two corollaries determine the complexity of the FCC in terms of additions and multiplications of real numbers.
The definition of a recurrence system in the rest of the paper has been slightly modified in an effort to accomodate the linear-algebra notation. The results in this section are basically a restatement of results attributable to Kogge and Stone. anymore. It seems that nothing has been accomplished since computation of bi, 5 < i < 8, must be executed serially, and it requires more time than the original computation of all xi, 1 < i < 9. However, the generation of bi, 5 < i < 8, is independent of any other subset of variables whose intersection with {xi 1 5 < i < 8} is empty, and therefore, the generation of bi's for any two disjoint subsets of variables can be performed concurrently. The model of a parallel computer (Fig. 4 ) that we will consider consists of parallel memory (PM), parallel processor (PP), and control unit (CU). PP has p arithmetic elements (AE). Each AE can perform any arithmetic operation in one time step (clock). All AE's are assumed to perform the same operation on each time step (SIMD computer). It is assumed further that no time is required to communicate data between processors and memories, and the storage arrangement of data is irrelevant. These assumptions are proved to be correct if an independent shift unit (SU) is attached to the second port of PM (as shown in dotted lines in Fig. 4 Furthermore, the time T" needed to compute the loop in line 9 using p AE's is given by Corollary 1.
T" = 2m.
Therefore, the total number of time steps Tp needed to com- The Algorithm 1 is given for the case when n > p2. It is easy to extend the above algorithm to all linear recurrence systems with n > 2p. However, the speed-up obtained is paid by decreased efficiency [ 12] .
The computational time Tp given by Theorem 2 can be improved if the restriction of a SIMD model is removed; that is, the assumption that all AE's perform the same operation at the same timQ. In the MIMD model that we shall consider, (p -1) AE's still perform the same function at the same time, while the only remaining AE may not. The idea can be easily explained using Fig. 2 . The system R(1)(4, 2) does not require computation of parallel coefficients bh, b2, b3, and b4. Since initial vector x0 is known, the variables x1, X2, X3, and X4 can be computed directly from x0, that is x = a* xi-1, 1 < i < 4. Furthermore, the number of time steps 2pm = 16 is less than the time needed to compute all parallel coefficients bi, 1 < i < 4, for any R () (p, m ), 2 < i < 4. Since the time to compute parallel coefficients is -2 ) (2m2 + m) = 25, the system R(1)(4, 2) could be enlarged to R(1)(6, 2) and still be solved in 24 time steps using only one AE. This is shown in Fig.  3 . The same trick can be employed after every p(p -1) + r variables where r is the size of the first subsystem R (1) (r, m ) . At that moment the value of x18 and x17 is known and there is no need to compute parallel coefficients b19, b20, b21, b22, b23, and b24. for j: = 2 until p do 8. begin 9. B&'i): -identity matrix; 10. for k: = 1 until p do b~"j: = atI) * After all parallel coefficients for all subsystems have been computed, all p processors are used to compute x4 2) = b-,2) * x(i), and consecutively xt j) = bt i) * x( i-l) for all j, 3 < p j < p. This can be accomplished in 2m(p -1) time steps. Therefore, A pipelined-processor model consists of Main Memory (MM), Pipelined Processor (PP), and Control Unit (CU). PP may consist of several pipelined functional units. In our model, we shall assume only 2 functional units: multiplication pipe and addition pipe. Each of them has s stages. These two functional units are connected serially with the multiplication pipe feeding the add pipe through register R 1. The results from the add pipe are either stored in MM or fed back into the add pipe through register R2 (Fig. 7) .
Algorithm 3: Given a linear recurrence system R (n, mi), XP end end U The lines 2, 3, and 4 in Algorithm 3 are used to initialize boundary coefficients which are not included in R (n, mn), although they are referenced by the inner loop in lines 5-9. An application of Algorithm 3 to the system R (28, 2) is shown in Fig. 8 .
The order of fetching from MM and computing coefficients and variables in PP is indicated by dashed line. For example, the computation of bl4 requires a14, bl3, and eI to be fetched from MM and entered into PP. It is impossible to proceed with computation of a]5 since bl4 will become available only after a certain number of time steps. Therefore, in order to keep pipelined functional units busy, Algorithm 3 proceeds by computing bL11, b8, X4, X3, X2, xi, and b21, b18. At that moment of time, bL14 should be available at the output of PP to be used in computation of bl5. The eiact sequence of statements to be executed in this case is given below: Proof: The proof is based on Algorithm 3. The system R(n, mi) is divided into subsystems R(J)(p, in), 1 <:j < rn/p]. To keep pipelined units busy almost all the time p + 1 subsystems are being solved concurrently, that is, different subsystems are using different stages of the pipelined units.
Each R( )(p, m) system is solved in two parts: by) = a V) * BVI , 1 < k < p, is computed first and x'() k+ I = b.) k+ I* xi'l), 1. <k < p, is evaluated afterwards. With p + 1 systems computed concurrently, the order of computation is as follows *<-b), 9-l) -,b(i-P+'), x(j-P), Ap-f), ***x V-P) bV+1), by' ** , b-p+ xyp+ p(-+) Fig. 7 , is given in Fig. 9 . The memory output ports MEM 1 and MEM 2 deliver two operands on each clock. An empty entry in Fig. 6 The comparison of speed-ups in Fig. 10 shows that the parallel algorithms are superior for all s > s0, and that "row sweep" is better for 1 < s < so. However, the cutoff point s0 may be too large to be practical for banded triangular systems with medium and large bands. For example, if a practical s = 15 is assumed, then the "row sweep" algorithm has the advantage for all m > 4.
Note that the "row sweep" algorithm has almost the best possible speed-up of 2s for all s < m/2. For all s > m/2, the "row sweep" algorithm has approximately a constant speed-up slowly approaching the value of 2s. This behavior is not surprising since the computation time T' is limited at the beginning by the number of stages and later by the operation time, which is kept constant with respect to the number of stages per operation.
The speed-up of the parallel algorithms is much smaller than 2s. This can be explained with increased redundancy of A, with respect to A. The redundancy function (Fig. 11 ) is a step function with steps corresponding to the set of all positive integers; that is, the first step corresponding to the parallel algorithm with p = 1, the second step to p = 2, and so on. When p changes from, say, k to k + 1, there is a substantial increase in the number of redundant operations which is compensated eventually by the increased number of stages s. The breakpoints in speed-up functions in Fig. 10 are easily explained with redundancy functions. For each step there is an optimal number of stages s for which the speed-up is maximal. These optimal values of s are'indicated by heavy dots in Fig. 10 . For any s between the two optimal values sI and 52 corresponding to parallel'algorithms A,, and As2, the algorithm with better speed-up is used. As the number of stages increases from an optimal value s1, the algorithm A,1 generates the same speed-up since the number of operations stays the same and the operands do not flow faster through the pipelined arithmetic units. Since the number of stages is greater than s1, some of the stages are idle, waiting for the operands to become available from the bottom of the pipe. The utilization of the machine decreases (Fig. 1 1) . The algorithm A12 is used when the increased number of stages is capable of compensating for increased redundancy of the algorithm. Since s is smaller than S2, the number of operations that can be performed independently is greater than the number of stages. There are no idle stages and the speed-up is limited to s. As s increases, the speed-up increases linearly with s until it reaches its maximum at S2. The utilization is constant in intervals of the speed-up's linear increase. Another interesting result is a lower-than-expected speed-up for banded systems with m = 1. Although the redundancy for m = 1 is low, the utilization is low, too, resulting in overall speedup that is below those for m = 2 and m = 4. The low utilization is the result of a very simple model which requires a multiplication by 1 as well as addition to 0 to be considered as operations, while they were not counted as operations in computation of redundancy. Since the percentage of these operations is high for m = 1, the utilization is very low.
VI. CONCLUSION A parallel algorithm for solving banded triangular systems on parallel or pipelined machines was developed. The algorithm uses extra redundant operations to allow parallel computation on a simple model of a parallel processor with p arithmetic elements and p memories. It was shown that rio exchange of data between processors is needed if the parallel memory is diagonally addressable with the shifting capability between modules. The algorithm is applicable to all linear recurrence systems with size greater than p2. The time bounds are given for the algorithms running on SIMD and MIMD models of machines. Interestingly, the same algorithm is applicable to machines with pipelined arithmetic units with s stages per operation. The number of stages is large enough to compensate for introduced redundancy and to achieve overall speedup with respect to a uniprocessor machine using the natural "row sweep" algorithm. When the "row sweep" algorithm is microprogrammed on the pipelined machine, the comparison shows that the parallel algorithm is superior whenever the number of stages s is greater than so. The break-even point so depends on the size of the band m, and increases with m. Even for medium m the break-even point may be too large to be practically implementable, since floating-point operations have a fixed number of gates and, therefore, they can be pipelined only to a certain number of stages.
The final comparison that remains to be made is between pipelined and parallel machines. In other words, what machine organization should we adopt if the only measure of performance is banded triangular system solvers? The pipelined machine with s stages per operation is compared to a parallel machine with p arithmetic units. As before, two algorithms-the "row sweep" algorithm and the parallel algorithm running on the pipeline machine-are compared to a parallel algorithm on a parallel machine.
The speed-ups (Tp/T,)(l + el) and (Tp/T')(l + E2) are plotted in Fig. 12 . The pipeline machine with the "row sweep" algorithm is inferior to the parallel machine for m = 1 and m = 2. For all m > 2, the parallel machine eventually succeeds to outperform the pipelined machine, when the number of processors is large enough. When the parallel algorithm is used on the pipelined machine, the situation is quite different. 
