Using a directed acyclic graph (dag) model of algorithms, the paper focuses on time-minimal multiprocessor schedules that use as few processors as possible. Such a processor-time-minimal scheduling of an algorithm's dag first is illustrated using a triangular shaped 2D directed mesh (representing, for example, an algorithm for solving a triangular system of linear equations). Then, algorithms represented by an n × n × n directed mesh are investigated. This cubical directed mesh is fundamental; it represents the standard algorithm for computing matrix product as well as many other algorithms. Completion of the cubical mesh requires 3n − 2 steps.
2. No processor can compute 2 different nodes during the same time step:
Definition: A multiprocessor schedule for a dag is time-minimal when the number of steps in the schedule equals the number of nodes in a longest directed path in the dag.
This is a machine-independent measure of the maximum parallelism in the dag. We now turn our attention to a specific problem: matrix product. Starting with Kung and Leiserson's seminal paper [26] , there has been a steady stream of successful research on systolic arrays, especially for computing a matrix product. Kung and Leiserson were the first to present a systolic array for banded matrix product [26] . This soon was followed by Weiser and Davis's systolic array [48] that uses 1/3 as many steps. This was followed by a time-minimal design for full matrix product completing in 3n − 2 steps, for n × n matrices, using n 2 processors [5] .
All these systolic arrays share two things:
1. They all use the same dependence dag, an n × n × n directed mesh 1 .
2. They all map this cubical mesh into processor-time with a transformation of the indices that is linear. That is, if we are given the computation
then each term of the summation -each inner product step -corresponds to an index vector,
[i j k]
T . The time step and processor location of each of these inner-product steps is given by:  and Silberschatz [42] . Linear maps are explicit in the research of Moldovan [32, 33] , Cappello and Steiglitz [2, 4, 5] , Quinton [37, 38] , and Gachet el al. [16] . Such investigations are surveyed by Fortes, Fu, and Wah [13] , and Quinton [39] . Work on enumerating, or otherwise exploring, the various linear maps associated with an iterative dependence dag has been reported by Moldovan [34] , Miranker and Winkler [31] , Danielsson [8] , Moldovan and Fortes [14] , Rao [43] , Delosme and Ipsen [10] , Rajopadhye et al. [41, 40] , and Lee and Kedem [27] .
There has been a great deal of work on optimizing systolic arrays. The work pursued by Li and Wah [21] , Fortes and Parisi-Presicce [15] , Rao [43] , Delosme and Ipsen [9] , Chen [6, 7] , Lee and Kedem [27] , Shang and Fortes [45] , and most recently by Wong and Delosme [49, 50] , all contribute to methods for optimizing systolic arrays. These efforts constrain the processor-time mapping to be a linear or affine transformation of the problem's index set. The first reason that this constraint is used is because it yields systolic arrays that are both intuitively appealing and practical to implement. The question nonetheless arises as to whether relaxing the linearity constraint results
in an even more efficient use of time and space. This question leads to the second reason that extant optimization efforts constrain the processor-time mapping to be linear or affine: the general problem of precedence constrained scheduling onto a set of processors is NP-complete (see [17] for references to a variety of such problems). Problems that are NP-complete may be dealt with in several ways. One way is to isolate a fundamental, indexed family of problem instances, and find an optimal parameterized solution for that family. This idea is illustrated below. First, the particular optimum under consideration is defined.
Definition: A multiprocessor schedule for a dag is processor-time-minimal when it uses as few processors as any time-minimal schedule for the dag.
Although only one of many performance measures, processor-time-minimality is useful because it indicates the minimum number of processing elements that are sufficient to extract the maximum parallelism from a dag. Being machine-independent, it is a more fundamental measure than those that depend on a particular architecture.
A simple example
We now illustrate a processor-time-minimal multiprocessor schedule for a simple family of dags:
a triangular shaped directed mesh. This family represents the standard algorithm for solving a triangular system of linear equations. For this family of dags, the instance index (or size parameter) is n; the parameterized time-minimal schedule is 2n−1; we will see that the parameterized processortime-minimal multiprocessor schedule is a systolic array that uses n/2 processing elements. Figure 1 [a] depicts a dag of processes for solving a triangular system of linear equations by forward substitution. Cappello and Laub [3] note that the multiprocessor schedule, depicted in Why is this map processor-time-minimal? Using Figure 1 [b], we can visualize a partition of the node set into layers according to the step for which the node is scheduled. If we name these layers with their step, the names would be 1, 2, . . . , 11. Notice that every node has an arc to some node in the next higher layer (except the sink node, labeled '6 6' in layer 11): Every node is on a longest path from the source to the sink. We now use this fact to argue that this layering is unique. Let v be some node depicted in Figure 1 [b] that is in layer i (2 ≤ i ≤ 11). We cannot reschedule v for an earlier layer without violating Constraint 1. Similarly, if we reschedule v for a later layer, then we must either move all nodes that come later in v's path to later layers (violating time-minimality) or leave them in their present layers (violating Constraint 1). The layering thus is unique. In particular, consider layer 5 which consists of the nodes labeled '3 3', '2 4', and '1 5'. From the foregoing, we conclude that, in any time-minimal schedule, these nodes all must be scheduled for the same step.
By Constraint 2, these 3 nodes must be scheduled onto 3 distinct processors.
The example above is a very simple illustration of an optimal parameterized solution for an indexed family of problem instances.
2 A processor-time-minimal systolic array for the cubical mesh
We now consider a fundamental family of dags: the n × n × n directed mesh. Beyond matrix product, Ibarra and Palis [19] point out that the cubical mesh is the dependence dag of a variety of recurrences over 3 variables (e.g., finding the longest common subsequence among 3 strings). Other computations include L − U factorization, a 3-pass transitive closure [18] , matrix triangulation, matrix inversion, and 2-dimensional tuple comparison [21] . This cubical mesh can be defined as follows.
where
| where exactly 1 of the following conditions holds
Time-minimal schedule: 3n − 2 steps. It is clear that the longest directed path in this dag has 3n − 2 nodes.
The processor-time lower bound
Definition: Let G = (N, A) be a dag. We uniquely label each node v ∈ N with number:
• i, when v is the ith node on a longest directed path in G;
• 0, otherwise.
This labeling partitions N into equivalence classes:
the ith node on a longest directed path in G}
We refer to each nonzero equivalence class as a concurrent set of nodes.
Using this definition, we state a simple but useful theorem. Proof. Since Q i is a concurrent set of nodes, it is a nonzero equivalence class of the labeling:
every node in Q i is the ith node on a longest directed path in G. Let v ∈ Q i . By multiprocessor scheduling Constraint 1, v cannot be scheduled for processing before step i. Consider all nodes that come later in a longest path that goes through v. If we move v to a later layer, then all these nodes must also move to later layers (or else Constraint 1 is violated). But then the schedule is no longer time-minimal. Therefore, every node in Q i must be scheduled for execution during step i in order for the schedule to be time-minimal. By Constraint 2, each node scheduled for step i must be processed
So, the number of nodes in a concurrent set contained in the cubical mesh is a lower bound on the number of processors used in any time-minimal multiprocessor schedule for it. The cubical mesh, it turns out, contains a concurrent set of size (3/4)n 2 . Fig. 2 depicts a 6 × 6 × 6 mesh. We argue this lower bound as follows. Each node in this dag is on some longest path. Consequently, by the argument given in the proof of Theorem 1, the concurrent sets of this dag are unique. In Fig. 2 each node is labeled with its step in a time-minimal schedule. By inspection, we can see that a maximum concurrent set corresponds to the set of nodes scheduled for step 8. In general, this is the midpoint of the computation: step (3n − 2)/2 . The ceiling notation is used in case n is odd.
Since the dag of Fig. 2 contains 27 nodes scheduled for step 8, according to Theorem 1, we need at least 27 processors to schedule this dag for completion in 16 steps. General expressions for the number of processors needed for the midpoint step follow:
even n:
odd n:
Thus, according to Theorem 1, we need at least (3/4)n 2 processors to complete this computation dag in 3n − 2 steps.
The processor-time upper bound
The question we pursue now is whether there is a systolic array that achieves this lower bound on processors. To do so, we introduce a more succinct representation of the cubical mesh. All the nodes in a vertical column of the mesh in Fig. 2 are represented by a single node (in Fig. 3 ) that is labeled with an interval of steps. These are the steps used by the n nodes in the vertical column The strategy for mapping the set of nodes onto the processor array is as follows. There is a processor for every node labeled with step 8 in Fig. 2 . A distinct processor is assigned to each vertical column of nodes (in Fig. 2 ) that contains a node labeled with step 8 (in general, every column with a node labeled with step (3n − 2)/2 ). These processors correspond to the circular nodes in Fig. 3 . The processor array developed so far (i.e., the array of circular nodes in Fig. 3 ) is shaped hexagonally.
To complete the mapping, we need to assign the 9 remaining (square) columns in Fig. 3 for example, that the processor named d must communicate directly to the processor named c.
In general, these boundary processors communicate directly with the processors on the opposite (parallel) boundary. To bring these directly communicating boundary processors into proximity, we map the hexagonally shaped array onto the surface of a cylinder. One simply wraps the n × n mesh This processor-time map generalizes to any even n. To visualize this map:
1. construct a transparency from Fig. 5; 2. wrap it into a cylinder, superimposing the appropriate nodes.
The cylindrical wrapping is the same when n is odd, except that the boundary connectivity is skewed slightly. This connectivity is illustrated, for n = 5, in Fig. 6 . The 'transparency method' mentioned above is an especially useful way to visualize this skewed wrapping. 
The step function is linear whereas the location function is not. The first component of the location function, π 1 , gives rise to the cylindrical wrapping. The second component of the location function, π 2 , decomposes into cases. These cases distinguish whether n is even (a cylindrical mesh) or odd (a skewed cylindrical mesh).
Three lemmas now are presented which, taken together, prove that this map results in a processortime-minimal systolic array.
Lemma 1:
Applying map m to G n results in a valid multiprocessor schedule.
Proof. In order for a schedule to be valid, 2 constraints must be satisfied:
1. A node is computed only after its children have been computed:
, and (i, j, k − 1). The schedule honors all precedences because
No processor computes 2 different nodes during the same time step:
Since the spatial components of the map (i.e., π 1 and π 2 ) do not depend on k, a processor is the image of all of a k-column or none of it. The n nodes in k-column (i, j) are executed at n different steps, depending on the node's k value. The only case we need consider, then, is when a processor is the image of more than 1 column. Let column (i 1 , j 1 ) and column (i 2 , j 2 ) both be mapped to the same processor. Then Proof.
Processor-time-minimality decomposes into 2 claims:
1. The schedule is time-minimal:
The node that maps to the least point in time is node (1, 1, 1) . The node that maps to the greatest point in time is node (n, n, n). Their time coordinates are respectively τ (1, 1, 1) = 1 and τ (n, n, n) = 3n − 2. Since the number of steps used, 3n − 2, equals the number of nodes in a longest path, the schedule is time-minimal.
2. The schedule uses as few processors as any that is time-minimal:
We know that there are (3/4)n 2 nodes that are labeled with step (3n − 2)/2 . We now
show that every column of nodes that does not contain a node labeled with step (3n − 2)/2 is mapped to a processor that also is the image of a column that does contain a node labeled with step (3n − 2)/2 . This enables us to place an upper bound of (3/4)n 2 on the number of processors. If column (i, j) does not contain a node labeled with step (3n − 2)/2 , then
We consider each of these cases:
This case decomposes into 2, depending on whether n is even or odd:
even n: We will show that:
(a) Column (i, j) maps to the same processor as column (i + (n/2), j + (n/2)).
(b) Column (i + (n/2), j + (n/2)) contains a node with time label (3n − 2)/2 . This is sufficient because it means that column (i, j) maps to one of at most (3/4)n 2 processors.
First, we show part (a) by substituting directly into the definition of π 1 and π 2 , the spatial components of the map.
We now show part (b): column (i + (n/2), j + (n/2)) must contain a node whose step
The first inequality is established as follows: Since τ (i, j, n) < (3n − 2)/2 , we have
For the second inequality,
odd n: We will show that:
(a) Column (i, j) maps to the same processor as column (i + n/2 , j + n/2 ).
(b) Column (i + n/2 , j + n/2 ) contains a node with time label (3n − 2)/2 . This is sufficient because it means that column (i, j) maps to a processor that already was allocated (see (a) above).
In order to use the definition of π 2 , we first establish that i + j < n/2 + 1. This
On the other hand,
The first inequality is clear; the second inequality holds since given τ (i, j, n) < (3n − 2)/2 , then i + j ≤ n/2 . Therefore,
Summarily,
We now show part (b): column (i + n/2 , j + n/2 ) must contain a node with time
The first inequality is established as follows: Again, since τ (i, j, n) < (3n − 2)/2 , we have that i + j ≤ n/2 . Therefore,
The details of this case are handled similarly.
We now show that the multi-processor schedule defined by m results in a systolic array.
Lemma 3:
Applying map m to G n results in a multiprocessor schedule for a systolic array.
Proof.
We show that communication is local in both time and space. Direct communication is represented by arcs in the graph. We may assume that this is either an i arc or a j arc: i = i + 1, or j = j + 1, but not both. (Recall that if 2 nodes differ only in coordinate k, then they are mapped to the same processor.) For these 2 cases, we now show that the spatial components of the processor-time map, π 1 and π 2 , preserve locality.
The difference in their first coordinate,
The difference in their second coordinate clearly is 1 when n is even. When n is odd, the map into the second coordinate, π 2 , decomposes into 3 cases, depending on what interval the sum i + j falls into:
• n/2 + 1 > i + j;
• n/2 + 1 ≤ i + j ≤ 3n/2 ;
When n is odd, the difference in the nodes' second coordinate depends on whether both i + j and i + j fall into the same interval (in which case their difference is 1), or they fall into different intervals (in which case the difference in the nodes' second coordinate is 0):
This case is similar to the one above. The difference in the nodes' first coordinate,
The difference in the nodes' second coordinate,
In Fig. 6 , the skewed wrap around connections (for odd n) correspond to the analysis above: an arc that wraps around also drops down 2 rows (i.e., the difference in rows between the destination of the arc and the source of the arc is −2).
From Lemmas 1 -3, we have the following.
Theorem 2: Applying the map m to G n results in a processor-time-minimal systolic array.
Array Layout
Cylindrically connected meshes are directly implementable in current PCB technology. They also are easy to embed in more densely connected systems, (e.g., in a hypercube, using a suitable gray code). Processors along the left and bottom edges of the square array depicted in Fig. 3 receive the input matrices. In the cylindrical processor-time-minimal array, these left [bottom] processors form a line that is inscribed on the surface of the cylinder. For these processors, the input schedule is given by the rule: a(i, j) and b(j, i) are input during step i + j − 1. The output (i.e., the product matrix) is held in place by the processors.
There are several simple ways to embed this cylinder in the Euclidean plane: fold it along the line drawn in Fig. 7 , resulting in a trapeziodally shaped array. When n is even, the trapezoid encompasses exactly 3n 2 /4 processors. When n is odd, this fold results in an array that is similar (see Fig. 8 ). Connectivity differences occur along the right boundary processors. When n is even and n/2 is odd, there is another natural way to embed the hexagonally shaped, cylindrically connected mesh in the Euclidean plane. As Fig. 9 makes plain, the array can be folded into a rectangular array whose dimensions are n/2 × 3n/2. [b] After folding. [b] After folding.
In the embeddings illustrated, the 3n 2 /4 processors are placed compactly, and routed with short wires. All of these organizations are feasible, for example, on the CHiP [46] . The cylindrically connected mesh presented here constructively disproves this conjecture (Melkemi and Tchuente may have meant to exclude cylindrically connected meshes).
Investigating fundamental algorithms with respect to processor-time-minimality may increase our basic understanding of their limits and potential for concurrent realization.
