The paper uses a directed acyclic graph (dag) model of algorithms. For a given dag, the paper focuses on processor-time minimal multiprocessor schedule: time minimal multiprocessor schedules that use as few processors as possible. The Kung, Lo, and Lewis (KLL) algorithm for computing the transitive closure of a relation over a set of IZ elements requires at least 5n -4 steps. As originally reported, their systolic array comprises n2 processing elements. In this paper, it first is shown that any multiprocessor that achieves this 5n -4 time bound needs at least [n2/31 processing elements. Then, a processor-time minimal systolic array realizing the KLL algorithm's dag is constructed. Its {n2/31 processing elements are organized as a cylindrically connected 2D mesh, when n E 0 mod 3. When n $ 0 mod 3, the 2D mesh is connected as a twisted torus.
Introduction
There has been a great deal of work on minimizing schedules for systolic arrays. The work pursued by Li , all contribute to methods for minimizing schedules for systolic arrays. These efforts constrain the processor-time mapping t o be a linear or affine transformation of the problem's index set. The first reason that this constraint is used is because it yields systolic arrays that are both intuitively appealing and practical to implement. The question nonetheless arises as to whether relaxing the linearity constraint results in an even more efficient use of time and processors. This question leads to the second reason that extant optimization efforts constrain the processor-time mapping to be linear or affine: The general problem of precedence constrained scheduling onto a set of processors is Definition: A systolic array for a dag is processor-time minimal when it uses as few processors as any systolic array that has a time minimal schedule for the dag.
Although only one of many performance measures, processor-time minimality is useful because it tells us how many processing elements are needed [sufficient] t o extract the mazimum amount of parallelism from an algorithm.
Transitive closure is a fundamental computation, justifying research into processor-time minimal multiprocessor schedules. Perhaps the best known parallel algorithm is by Guibas In this paper, we illustrate this machine-independent analysis using the KLL algorithm. We establish a precise lower bound of [n2/31 processors for time minimal completion (5n-4 steps). For example, using the KLL algorithm for computing the transitive closure of a relation over a set of 30 elements requires at least 146 time steps. Any multiprocessor that achieves this time bound needs at least 300 processors. ( The original realization of the KLL algorithm I141 would use 900 processors.) We then show that the lower bound is tight by constructing a systolic 2D mesh, which we prove realizes this processor-time lower bound. In fact, we construct 3 systolic 2D meshes, depending on whether n is congruent to 0,1, or 2 mod 3. When n F 0 mod 3, the mesh is cylindrically connected. Otherwise, the mesh is connected as a twisted torus.
The systolic array of Benaini, Robert, and Tourancheau [l] also is time-minimal, but uses more processors: [ n 2 / 2 ] . Their systolic array however is connected as a (simple) torus.
For the transitive closure algorithm that we analyze, the use of a dag with unweighted nodes and arcs is realistic; for other algorithms, dags may need to be weighted to represent differences in the constituent computational and/or communication tasks.
A processor-time lower bound for the KLL dag
The KLL dependence dag for computing the transitive closure, illustrated in 
Time-minimal schedule: 5n -4 steps. The longest directed path in this dag Definition: Let G = ( N , A ) be a dag. We label each node U E N with number:
i , when U is the ith node in some longest directed path; 0 0, otherwise.
has 5 n -4 nodes.
This labeling partitions N . We refer to each nonzero equivalence class as a concurrent set of nodes.
Using this definition, we state without proof a simple but useful theorem.
Theorem 1: Let G = ( N , A ) be a dag, Q N be a concurrent set of nodes, and P be the number of processors implementing a time-minimal schedule. Then
IQ1 5 IPI.
Lemma 1: The G,,(n) mesh dag contains a concurrent set of size I$).
Proof: Each node in this dag is on some longest path. Fig. 1 depicts Gt, (6) , an augmented 6 x 6 x 6 mesh. Each node is labeled with its time step in a time-minimal schedule. Viewed orthogonally along the k axis, as in Fig. 2 , it can be seen that any time step consists of diagonal bands, at most one band per k plane, separated from one another by 3 units of i and j vectors. This occurs because each point in one k plane is 3 units ahead of the corresponding point in the k -1 plane (i.e.,
Intuitively, the maximum number of points any particular time step can cover is [n2/31, since approximately 1 / 3 of the n x n grid (of i -j space) is covered. To derive the exact number, there are 3 cases to consider, depending on which of the 3 sets of diagonal lines are used. In each case, it is evident that the set that includes the main diagonal of the i -j grid is the largest:
For the case where n E 0 mod 3, 2 (Er:: 3i ) -n = n 2 / 3 . 
A processor-time upper bound for the KLL dag
The question we now pursue is whether there is a systolic array that achieves this exact lower bound on processors. The strategy for mapping the set of nodes onto the processor array is as follows. We assign a distinct processor to each horizontal column of Fig. 1 ( j , k are fixed, i varies for a particular row) which contains a node labeled with t,, a particular processor maximal time step. In Fig. 1, steps 7 and 10 are processor maximal. The particular time step we choose depends on whether n is even or odd: even n: t ( n , n , n ) /~ -1 = -1. odd n: [ t ( n , n , n ) / 2 1 = v.
This choice is made so that, no matter what n mod 3 is congruent to, the chosen step is processor maximal. The processor maximal steps are: n mod 3 = 0: every step in the interval [2n -3,3n]; n mod 3 = 2: every third step in the interval [2n -2 , 3 n -1); n mod 3 = 1: every third step in the interval [2n -1 , 3 n -21;
For n = 4, t, = 7. Therefore, as the dag of Fig. 1 is collapsed along the i axis, each horizontal column of the dag that has a node labeled 7 collapses onto a point that represents a real processor.
The collapsed graph for n = 12 is shown in Fig. 3 , where the k axis now runs horizontally and the j axis vertically, and each collapsed point represents a column in i-space. Since t, = 27, each column containing a time step of 27 is circled. To complete the mapping, we must assign the 96 (in general, n2 -[ $ I ) remaining columns. The way this is done depends on n. There are three cases t o consider, depending on the value of n mod 3.
Case n mod 3 = 0
The mapping is done with a simple mod function applied along the k axis. There are n / 3 real processors for each k-row. For a particular row, the remaining 2n/3 columns map to the n/3 real processors such that krcaimod n / 3 = krcmoiningmod n/3. Thus, each real processor handles 3 columns: its first column finishes execution just before its second column begins execution, and its second column finishes execution just before its third column begins execution (i.e., scheduling constraints are met). For the example in Fig. 3 , we use a mod 12/3 function so that each of the remaining 2 . 12/3 columns per row are mapped to a processor. The connectivity implied by this mapping requires, for example, that the processor assigned to column A must communicate directly to the processor assigned to column D. To realize these boundary connections, we map the array of Fig. 3 onto the surface of a cylinder.
International Conference on Application Specific Array Processors
The map mo : N H Zs (i.e., from nodes to spacetime) can be defined formally as follows.
For this mapping, t ( i , j , k ) is found by examining the dag of Fig. 1 . It is the earliest time a particular node can be processed. sl(j, k) is the mod n / 3 function, with an offset to assure that the first processor of the top row is located in the middle We show that a node is computed only after its children have been computed, and that no processor computes 2 different nodes during the same time step.
Lemma 1.2:
Applying map mo to graph Gte(n), where n m o d 3 = 0, results in a multi-processor that is processor-time minimal.
We show that the schedule uses 5n-4 time steps, and exactly n2/3 processors.
Lemma 1.3: Applying map mo to graph G t c ( n ) ,
where n m o d 3 = 0, results in a systolic array. Case n mod 3 = 2 This mapping is much more complicated than the case when n 3 0 mod 3. In Fig. 4 , the column at position ( j , k ) finishes just before the column at position ( j -1, k + 5 ) starts ((fl = 5). Our goal is to map these columns to the same processor. If the column is on the top row, it wraps around to the bottom row, to position (n,k + 151 -I $ ] ) (in Fig. 4, position (14,k + 1) ). On the wrap around, the first column finishes 3 time units before the next column begins. These columns also are mapped to the same processor. We achieve local communication by mapping the array onto the surface of a twisted torus. Each of the circled processors handles 3 or 4 of the nz columns of Fig. 4 .
We now formally define the map. This is done in 3 stages. We first change basis, transforming coordinate (j,k) t o (a',P'). Next Fig. 4 .
For example, in Fig. 4, column (j,k) = (1,7) is mapped t o (a,P) = (0,7), and then mapped back t o (1,7). Column (2,2) is mapped to (0,2) and then mapped back to (1,7). 
The map

