Parallel computation has been of great interest in recent years. A parallel machine consists of a number of processors and an interconnection network to tie them together. This work examines a specific parallel processing problem on a specific architecture that allows the study of the integration of communication and computation. While these two issues are often studied separately, a combined study is rare.
The situation to be considered involves a linear daisy chain of processors, as is illustrated in Fig. 1 . A single "problem" (or job) is solved on the network at one time. It takes time w;T,, to solve the entire problem on processor i. Here wi is inversely proportional to the speed of the ith processor and Tcp is the normalized solution time when wi = 1.
It takes time z;Tcm to transmit the entire problem representation (data) over the ith link. Here zi is inversely proportional to the channel speed of the ith link and T,, is the normalized transmission time when z; = 1. It is assumed that the problem representation can be divided amongst the processors. Thus the problem representation is said to be "divisible". That is, fraction a; of the total problem is assigned to the ith processor so that its computing time becomes a; w;Tcp.
It is desired to determine the optimal values of the 0;s so that the problem is solved in the minimum amount of time. The situation is nontrivial as there are communication delays incurred in transmitting fractional parts of the problem representation to each processor from the originating processor.
There is a good deal of literature on scheduling and load sharing in multiprocessors [5-7, 14-21].
However most work to date assumes that a job can be assigned to at most one processor. Only recently has there been interest in multiprocessor scheduling with jobs that need to be assigned to more than one processor [22-241. In this work, one has a single job that can be arbitrarily partitioned among a number of processors. The framework being described is particularly germane to processing involving large data files (so that communication delay is nonnegligible), such as sensor data processing, signal processing, image processing, and Kalman filtering, where the data can be divided among multiple processors.
Two cases are considered: processors that have front end communications subprocessors for communications off-loading so that communication and computation may proceed simultaneously, and processors without front end communications subprocessors so that communication and computation must be performed at separate times.
of four processors with front-end communications subprocessor (as in Fig. 1 ) is illustrated in Fig. 2 . There is one graph for each processor. The horizontal axis is time. The upper half of each graph indicates communication time and the lower half indicates computation time. It is assumed that the problem (load) originates at the left most processor. At time 0, processor 1 can start working on its fraction, a1, of the problem in time alwlTcp. It also simultaneously communicates the remaining fraction of the problem to processor 2 in time (a2 + a3 + a4)z1Tcm. Processor 2 can then begin computation on its fraction of the problem (in time a2w2Tc,) and communicates the remaining load to processor 3 in time (a3 + a4)z2Tcm. The process continues until all processors are working on the problem.
A similar, but not identical, situation for a linear daisy chain network with processors that do not have front-end communication subprocessors is illustrated in Fig. 3 . Here each processor must communicate the remaining load to its right neighbor before it can begin computation on its own fraction.
A timing diagram for a linear daisy chain network In [l] recursive expressions for calculating the optimal ais were presented. These are based on the simpwing premise that for an optimal allocation of load, all processors must stop processing at the same time. Intuitively this is because otherwise some processors would be idle while others were still busy. Analogous solutions have been developed for tree networks [2] and bus networks [3, 41. The equivalence of first distributing load either to the left or to the right from a point in the interior of a linear daisy chain is demonstrated in [lo] . Optimal sequences of load distribution in tree networks are described in [S, 9, 111. Closed form solutions for homogeneous bus and tree networks appear in [13].
The concept of collapsing two or more processors and associated links into a single processor with equivalent processing speed is presented here. This allows a complete proof (an abridged one appears in [l] ) that for the optimal, minimal time solution all processors must stop at the same time. Moreover, for the case without front end communications subprocessors, it allows a simple algorithm, described in Section 111, to determine when it is economical to distribute load amongst multiple processors. Finally, in Section IV, the notion of equivalent processors enables the derivation of simple closed-form expressions for the equivalent speed of a linear daisy chain network containing an infinite number of processors. This provides a limiting value for the performance of this network architecture and load distribution sequence P11.
II. EQUIVALENT PROCESSORS
Consider a linear daisy chain network of N processors as in Fig. 1 . Two adjacent processors may be combined into a single "equivalent" processor that presents operating characteristics to the rest of the network that are identical to those of the original two processors. Two cases, processors with and without front-end communication subprocessors, are considered.
at the left-most processor (processor 1). If the load originates at an interior processor one can use the same methodology to collapse the processors to the left and the right of the originating processor into equivalent processors and then collapse the remaining three processors into a single equivalent processor.
In both cases it is assumed that the load originates
A. Front End Communications Subprocessors
We start with the N -1st and Nth processors, as illustrated in Fig. 4 . The figure begins at the moment when the load has finished being transmitted to the N -1st processor from the N -2nd processor. As in [l] The time each is active, from the figure, is nus one can recursively show that for a network of N processors the optimal solution occurs when all processors stop at the same time. The two processors with front-end (fe)
B. No Front-End Communications Subprocessor
Here &~-l is given by (3) with equality. The solution time is divided by the normalized computation time to yield the equivalent speed constant. Thus, starting with the N -1st and Nth processors, the entire linear chain of processors can be collapsed, two at a time, into a single equivalent processor.
Once again, to prove that the minimal time solution requires both processors to stop at the same time, the cases T N -~ 2 TN and T N -~ 5 TN can be considered.
For T N -~ 2 T N , simple algebra results in with equality occurring when both processors stop at the same time. From (6) the solution time can be rewritten as On the other hand, if (wN-lTcp -Z N -l T c m ) is negative, then minimizing Tsol is equivalent to maximizing &~-1 at &~-1 = 1. That is, communication speeds are slow relative to computation speed so that it is more economical for processor N -1 to process the entire load itself rather than to distribute part of it to processor N .
4-& N -I (~N -I Q N ) ( w N -~T~~ -Z N -l T c m ) .
The case where T N -I 5 TN proceeds along similar lines. Again, the ability to collapse processors into equivalent processors allows one to extend the proof that two processors must stop at the same time for a minimal time solution to N processors.
Ill. WHEN TO DISTRIBUTE L O A D
A practical problem for the case without front-end communications subprocessor is to compute the equivalent computation speed of a linear daisy chain network when, in fact, the optimal solution may not make use of all processors, because of too slow communication speeds. Again, if the load originates at the left-most processor, this can be done by collapsing the processors, two at a time, from right to left in Note that in (13) factors of (ai-1 + ai) cancel in the numerator and denominator.
By keeping track of whch of (10) and (11) is smaller, it is possible to determine whch processors to remove from the final network.
to the situation when the load originates at a processor which is located in the interior of the network. The parts of the network to the left and to the right of the originating processor can be collapsed, into equivalent processors, following the previous procedure. The remaining three processors (left, originating, right) can then be further collapsed into a single equivalent processor. Naturally, it must be checked whether the inclusion of the left and/or right equivalent processor leads to a faster solution.
Note that the above procedure can also be applied
IV. INFINITE N U M B E R O F PROCESSORS
A difficulty with the linear network daisy chained architecture is that as more and more processors are added to the network, the amount of improvement in the equivalent speed of the network approaches a saturation limit. Intuitively, this is because of the overhead in communicating the problem representation down the linear daisy chain in what is essentially a store and forward mode of operation. equivalent processing speed of an infinite number of homogeneous processors and links. These provide a limiting value on the performance of this architecture. The technique is similar to that used for infinitely sized electrical networks to determine equivalent impedance.
Let the load originate at a processor at the left boundary of the network (processor 1). The basic idea is to write an expression for the speed of the single equivalent processor for processors 1,2.. . CO. This is a function of the speed of the single equivalent processor for processors 2,3.. . CO. However these two speeds should be equal since both involve an infinite number of processors. One can simply solve for this speed.
Consider, first, the case where each processor has a front-end communication sub-processor. Let w; = w and z; = z. Let the network consist of PI and an equivalent processor for processors 2,3.. .CO. Then: calculate the limiting performance of an infinite sized daisy chain when the load originates at a processor at the interior of the network (with the network having infinite extent to the left and the right). Expressions (17) or (18) can be used to construct equivalent processors for the parts of the network to the left and right of the originating processor. The resulting three processor system can then be simply solved [l, 121. It is also possible to use the above results to
The concept of collapsing two or more processors into an equivalent processor has been shown to be useful in examining a variety of aspects related to these linear daisy chain networks of load sharing processors. Expressions for the performance of infinite chains of processors are particularly useful as if one can construct a finite-sized daisy chain that approaches the performance of a hypothetical infinite system, one can feel comfortable that performance cannot be improved further for this particular architecture and load distribution sequence [ll] Cheng, Y.-C., and Robertazzi, T. G. (1990) Bataineh, S., and Robertazzi, T. G. (1991) h u n g , J. Y. -T., and Young, G. H. (1989) Minimizing schedule length subject to minimum flow time. 
SL4M Journal on

