Much work has been done on the problem of synthesizing a processor array from a system of recurrence equations. Some researchers limit communication to nearest neighbors in the array; others use broadcast. In many cases, neither of the above approaches result in an optimal execution time.
Introduction
A system of uniform recurrence equations, as defined by Karp, Miller, and Winograd [23, 22] , maps especially well onto a systolic/wavefront array. Many researchers have either linearly mapped systems of uniform recurrence equations into spacetime, or translated them to systolic/wavefront arrays [36, 6, 37, 8, 40, 35, 15, 12, 43, 38, 42, 11, 26, 49] . Another approach in parallel processing is to allow broadcasting. This approach has been considered by several authors (e.g., see [53, 5, 1, 34] ). In broadcasting, a single processor can broadcast data to all processors simultaneously. A variation of this technique is to broadcast only in columns and rows, such that every processor can broadcast data to all other processors in the same row or column of a processor mesh [25, 50] . When broadcasting, signal propagation time depends on the number of processors in the multiprocessor system.
In this paper we suggest a compromise between global broadcast and nearest neighbor communication. This technique, called bounded broadcast, allows a processor to send data to all processors on the same row/column, which are within some bounded distance from it. This technique enhances the timing results of many algorithms which can use broadcast. We apply this technique to the transitive closure and all-pairs shortest distance problems [55, 14] . These problems have been considered by many researchers (e.g., see [18] , [54, ch. 5] , [43, page 289] , [27, 47, 46, 48, 31, 20, 29, 28] ). At present the best systolic array execution time reported for an N × N input matrix is 5N − 4 [28] , where the unit of time is the time it takes to make one computational step plus the time to transmit the result from one processor to its neighbor. By using a bounded broadcast, this time can be reduced to between N and 4N (ignoring low-order terms), depending on the design. Bounded broadcast can be used in other problems such as the general Algebraic Path Problem (APP) [30, 58, 16, 4, 10, 32, 41] and many matrix computations. (An execution time of about 4N is achieved by Delosme [10] . He uses a non-classical algorithm, avoiding broadcast altogether, and a larger number of processors.)
A handful of architectures that implement reconfigurable buses already have been presented [3, 52, 34, 33] . These architectures are well suited to implement the bounded broadcast described here, after the algorithm has been mapped to space and time using the procedure described in this paper. An earlier version of this work can be found in [56] . A similar approach recently has been developed independently by Risset and Robert [45, 44] .
Using Bounded Broadcast to Reduce Latency

Definitions Example 1
The system of recurrence equations (SRE) below finds the transitive closure, or all-pairs shortest distance, for a given input matrix A N ×N . This algorithm is the implementation of the Warshall-Floyd algorithm suggested by Kung et al. [28] . We have changed the names used there as follows: x → a 1 , r → a 2 and c → a 3 . Also, the input and output parts of the SRE are not given here.
The operation ⊗ models binary AND in the transitive closure case, and + in the all-pairs shortest distance case. The operation ⊕ models binary OR in the transitive closure case, and the MINIMUM operation in the all-pairs shortest distance case 1 . The recurrence equations above are used to illustrate some of the following definitions, which are related to an SRE.
Index set: The set of points where an array is computed or used.
Domain of computation:
The set of points C i where an array a i is computed (e.g.,
Dependence map: A function δ ij from the domain of computation of array a j (or part of it) to the index set of a i , on which the computation of a j depends (e.g., δ 12 A system of quasi-uniform recurrence equations: An SRE that is uniform except for boundary points (for a more precise definition, see [57] ). E.g., the SRE given in Ex. 1 is quasi-uniform.
Latency with respect to function f: The time T f separating the arrival of the first input from the departure of the last output, for one computation of f .
Given a system of recurrence equations (SRE), one of the design outputs is a vector π ∈ Z n and a set of constants {c i ∈ Z}, one for each array, such that for all problem sizes, if a variable a j (x 2 ) depends on a variable a i (x 1 ) (not necessarily directly), then
This condition ensures that there exists a valid execution ordering. Array variable a i (x) is computed at time 2 π T x + c i . To simplify our presentation, we use a less general formulation: We assume that:
• there is only one π.
The vector π in this case is referred to as a linear schedule vector, and the above mapping is referred to as a linear schedule.
Using bounded broadcast to reduce latency
The latency (sometimes referred to as delay time) comprises three components: 1) input, 2) execution, and 3) output. Sometimes it is possible to overlap the execution with the input/output. The execution time should be the time at which the last computation is done minus the time that the first computation is done. In the case of a linear schedule, it is max
, where π is the linear schedule vector, and C = i C i (i.e., it is the union of the domains of computation of the arrays in the SRE). Much research has been done on the problem of minimizing the latency for an integral π (e.g., [21, 43, 12, 17, 39, 9, 11] ). Shang and Fortes [51] mention the possibility of a rational π, but do not explain the physical meaning of such a schedule. Here we give physical meaning to such a schedule. The idea is to distinguish between two kinds of dependences:
1. dependence of one computation step on the result of a previous computation step;
2. dependence of a computation step on a 'propagating variable' (a variable that is transmitted unchanged from one processor to the next).
An array of only propagating variables is called a propagating array. Some researchers assume that the time taken to propagate a variable from a processor to its neighbor, is the same as the execution time for one computation step.
Other researchers assume that values can be broadcast such that all processors get the propagated value at the same time.
Both of these assumptions lead to a sub-optimal latency. In the first case, propagation of the variable is delayed until the computation step is finished; in the second case, the computation step is delayed until the propagated variable is received by all processors.
In this paper, we assume that a variable can be propagated through a bounded number of neighbor connections. The idea is to broadcast a propagating variable through as many neighbors as possible such that the time to do so balances with the time to execute one computation step, and communicate the results to the nearest neighbor.
In this paper, K denotes the number of neighbor processors that a variable can propagate through, in the same time that it takes to 1) execute one computation step, and 2) communicate the results to a neighbor processor. K grows as the time complexity of the computation step grows. (The time complexity of a computation step is measured by the depth of the circuit that implements it.) K also depends on the method by which the variable is propagated. The two principal methods are to:
1. use a bus that goes through K processors (with possible repeaters); 2. insert a latch stage in every processor.
propagating variable Again, if a variable is propagated to either fewer than K processors or more than K processors, then the overall latency is not minimized; in the former case, a propagating variable waits unnecessarily for a computation to complete; in the latter, a computation waits unnecessarily for a propagating variable to . . . complete propagation. By realizing this K-broadcast, the inserted latches therefore decrease overall latency. The two variable propagation methods are shown schematically in Fig. 2 and Fig. 1 , respectively. The value of K clearly decreases if latches are inserted. K also depends on the technology. Another factor which may affect the value of K is the time for clock distribution (if this exceeds the time for one computation step) [13, 7] . We ignore this factor, since clock distribution usually can be reduced by appropriate design [24, 19] . For present MOS technologies, and the simplest computation (1 -3 gates), K 1. Similarly it also may be that input and output steps take less time than the computation step. In this case, the input and/or output steps may be synchronized by a faster clock.
The discussion above indicates that we can choose π with fractional entries. In what follows, we assume that all entries of π are multiples of 1/K. All the results generalize to any rational π. A minimization of the latency comprises a solution to an integer linear program. If π is multiple of 1/K, then we can get an integer linear program by multiplying π by K to get an intermediate vector that is integer. One is still faced with 2 problems: 1) finding the constraints on π in the case of bounded broadcast; 2) finding a solution to the integer linear program, which is an NP-complete problem. Our distinction between propagating variables and other variables enables us to find these constraints. Following [43, page 129], we define the iteration vector to be a vector u ∈ Z n that satisfies:
1. u is perpendicular to space.
2. The greatest common divisor of the components of u is 1.
3. The first non-zero component of u is greater than zero.
From the above definition, the length of u is the minimum distance between two index points in that direction.
Consider first the case where variables are propagated using latches (see Fig. 1 ). The choice of the vector π, in this case, is subject to the following constraints:
This constraint guarantees that two consecutive computations done in the same processor are done no less than one time step apart. (In case the computation step is pipelined in each processor, this restriction is changed: Instead of 1, the right hand side of the inequality becomes 1/L, where L is the number of pipeline stages.)
2. For every dependence map of a propagating array (i.e., a dependence map for an array that depends on a propagating array), the corresponding translation part d satisfies π
This constraint guarantees that sufficient time exists for a propagating variable to be communicated to the location where it is used.
3. For every dependence map of a non-propagating array (i.e., a dependence map for an array that depends on a non-propagating array), the corresponding translation part d satisfies π
This constraint guarantees that the computation is complete before its result is required.
One jointly determines the spatial embedding and the linear schedule. The choice of rational entries in π implies that there is a faster basic clock in the array (e.g., if the minimum entry is 1/K as assumed here, this clock would be K times faster than the clock would be with a minimum entry of 1). A computation at index point p is executed at time step π T p; the period of each time step is 1/K. Latches thus must be present in each processor, as shown in Fig. 1 .
If the computation is very simple, it may be impractical to pass a propagating variable through a latch of a processor before transmitting it to the next processor. In this case, the basic clock remains the same, and the propagating variable is transmitted via a bus to K processors before being latched, as shown in Fig. 2 . The time step for a computation at index point p in this case is π T p . In order to ensure valid timing in this case, we add one more constraint:
4. For every dependence map of a propagating array on a non-propagating array, the corresponding translation part d satisfies π
This constraint guarantees that there is enough time for a propagating variable to first be computed, and then broadcast to K processors.
For example, suppose we choose π such that π T p = J + (1/K) for some integer J, and index point p in which a propagated variable is computed. The computation at p is done at time step J +1. Suppose this propagating variable is used at computation index points q 1 , q 2 , . . . , q K , each of which satisfy π T q i = J + 1 + (i/K). All the computations at these index points are done at time step J + 2. The constraint above ensures that there is enough time to compute the variable, and also propagate its value to all the above index points.
Applying bounded broadcast to the WarshallFloyd algorithm
The following example illustrates the advantage of using bounded broadcast.
Example 1A
The SRE below, given input matrix A N ×N , computes its transitive closure (or all-pairs shortest distance), A + , which is the SRE's output. This SRE is an implementation of the Warshall-Floyd algorithm, derived from Ex. 1 by 1) eliminating global connections as explained in [28] , 2) adding input statements (for A) and output statements (for A + ), and 3) eliminating zero translation vectors (in the dependence maps) by substitution. In this SRE, g represents the value 0 for the all-pairs shortest distance problem, and 1 for the transitive closure problem 3 .
The array a 1 holds the transitive closure (or shortest distance) as it is being computed; a 2 and a 3 are propagating arrays in the i and j directions, respectively.
The distinct translation parts of the dependence maps in this SRE are:
The first four translation parts above correspond to propagating arrays (i.e., their dependence maps are for arrays that depend on a propagating array); the last translation part corresponds to a non-propagating array (i.e., a translation part of a dependence map for an array that depends on a 1 , a non-propagating array). The above observation is used in picking a valid schedule vector π, that satisfies the constraints mentioned in § 2.2.
We consider 5 design cases for the above SRE. The first, presented in [28] , is referred to as Design K. The other 4 designs, denoted A -D, use bounded broadcast. In what follows, we define the unit of time as the time to execute one computation step and communicate the results to a neighbor processor. We denote by K the number of processors to which a variable can be propagated in one time unit. Designs A and B use latches in the propagation path, as shown in Fig. 1 ; designs C and D use a bus-like structure, as shown in Fig. 2 . Table 1 contains the design parameters, and the times. In designs A and C, input [output] is done before [after] the computation, taking N/M time steps. M = (unit of time)/(the time for a processor to communicate a variable to its neighbor). In order to further reduce the period (down to the execution time), one can incorporate special hardware so that input and output overlap execution.
In designs K, A, and B, the computation at index point p is executed in time step π T p. In designs C and D, the computation at index point p is executed in time step π T p . All these designs use N 2 processors. All of them are valid, because they satisfy the constraints for π. The latency of design K is claimed in [28] to be optimal. It is indeed optimal when π is restricted to be integer. However, as can be seen from Table 1 , it is not optimal when this restriction is removed.
If K ≥ 4 when latching the propagating variable in every processor, then designs A and B are better than designs C and D. K = 3 is a boundary case. If K < 3, then designs C and D are better than A and B. Designs A and B thus are suited to a complex computation (such as in the all-pairs shortest distance problem); designs C and D are better suited to a simpler computation (such as the transitive closure problem).
Designs A and C are preferable in case either:
• matrix A does not need to be input, and matrix A + does not need to be output;
• input/output can be done fast in comparison with execution (i.e., M > 1) and the latency is more important than the period;
• hardware is used to overlap input/output with execution.
Otherwise designs B and D are preferable.
We have assumed thus far that the processors are not pipelined. If the computation is complex, then we may employ internal pipelining. This is advantageous when the first restriction, |π T u| ≥ 1, is the bottleneck. For example, consider design B. Suppose we have L pipelining stages in each processor, and L ≤ K. Then the following linear schedule vector can be chosen:
. This vector satisfies all restrictions. The time step now is larger, since latches are added for pipelining. Let τ denote one time unit. Let h(L) = (τ using L stages)/(τ using 1 stage). The times for the pipelined version of design B are shown in Table 2 . In this case the period is less than N . The 'penalty' here is that hardware is added to pipeline each processor.
Conclusions
We presented bounded broadcasting, an architectural feature that can improve the performance of systolic arrays. We suggested a distinction between propagating variables and other variables. Using this distinction, we identified several conditions on the linear schedule vector for a system of recurrence equations, which are sufficient to implement the SRE with bounded broadcast.
We then illustrated bounded broadcast on the problem of transitive closure/allpairs shortest distance. The asymptotic latency, period, and execution times of the design of Kung et al. [28] (which are optimal for designs that do not use bounded broadcast) are 5N , N , and 5N , respectively. These same measures are  3N, 2N, N for design A; 3N, N, 3N for design B; 4N, 3N, 2N for design C; and 4N, N, 4N for design D. (These times can be reduced further by pipelining the computation step, and by providing hardware that overlaps input/output with execution.) The MP/C [3] , the CHiP computer [52] , and the meshes with reconfigurable buses described by Miller et al. [34] all are well suited to implement bounded broadcast. Unlike an SIMD broadcast step, the time to perform a bounded broadcast does not grow as the array grows. Systolic computing systems should make use of bounded broadcast for the following reasons:
• Essentially all systolic algorithms have propagated variables, hence can benefit from bounded broadcast.
• All other things being equal, a systolic computing system that implements an algorithm using bounded broadcast will perform significantly better than one that does not.
