Processor arrays are frequently used to deliver high-performance in many applications with computationally intensive operations. This paper presents the General Parameter Method GP-M, a systematic parameter-based approach for synthesizing such algorithm-speci c architectures. GPM can synthesize processor arrays of any lower dimension from a uniform-recurrence description of the algorithm. The design objective is a general non-linear and non-monotonic user-speci ed function, and depends on attributes such as computation time of the recurrence on the processor array, completion time, load time, and drain time. In addition, bounds on some or all of these attributes can be speci ed. GPM performs an e cient search o f p olynomial complexity to nd the optimal design satisfying the user-speci ed design constraints. As an illustration, we show h o w GPM can be used to nd linear processor arrays for the problem of nding transitive closure. We consider design objectives that minimize computation time, or processor count, or completion time including load and drain times, and user-speci ed constraints on number of processing elements and or computation completion times. We show that GPM can be used to obtain optimal designs that trade between number of processing elements and completion time, thereby allowing the designer to choose a design that best meets the speci ed design objectives. We also show the equivalence between the model assumed in GPM and that in the popular dependence-based methods 4, 5 . Consequently, GPM can be used to nd optimal designs for both models.
Introduction
Many applications of digital signal processing, scienti c computing, medical imaging, digital communications, and control are characterized by repeated execution of a small number of computationally intensive operations. In order to meet performance requirements of these applications, it is often necessary to dedicate hardware with parallel processing capabilities to these specialized operations. Processor arrays or systolic arrays, due to their structural regularity and consequent suitability for VLSI implementation, are frequently used for this purpose. This paper discusses systematic ways of mapping these algorithms into specialized processor arrays.
The fundamental concept behind a processor architecture is that the Von-Neumann bottleneck is greatly alleviated by repeated use of a fetched data item in a physically distributed array o f processing elements 6 . The regularity of these arrays leads to inexpensive and dense VLSI implementations, which imply high-performance and low cost. Application-speci c processor arrays t naturally into the concept of a hardware library, where functional units are in relation to the host computer as subroutines from a software library are to production code.
Initial designs of processor arrays were ad hoc, and relied heavily on designers' skill and intuition. Since every algorithm needs a specialized design customized to its communication patterns, a systematic technique for generating processor arrays from the algorithm description is necessary. Therefore, a great deal of e ort has been devoted by n umerous researchers to generate processor arrays systematically. A n o v erview of the di erent methods can be found in the reference 7 .
The techniques discussed here apply to algorithms described as recurrences, either by mathematical expressions or by high-level-language programs. Section 1.1 provides a precise characterization of the class of algorithms for which our results are valid. The techniques are illustrated by examples involving linear arrays of processors 1-dimensional processor arrays; however, unless otherwise stated, the results can be extended to processor arrays of arbitrary dimensions. We choose to study linear arrays because they are easier to build and program than arrays of higher dimension.
The general notation used in this paper is as follows. Vectors are in lower case with arrows on top, and matrices are in upper-case bold font. The transpose of vectorṽ and matrix M are denoted byṽ t and M t , respectively. The absolute value of vectorṽ is denoted by jṽj, and notationṽ ũ means that every component o f v is greater than or equal to the corresponding component o f u . V ector0 denotes a row or column vector whose entries are all zeroes. The dimensions of vector0, and whether it denotes a row or column vector, are implied by the context in which it is used. The scalar product of two v ectorsṽ 1 andṽ 2 , and the product of a vectorṽ and matrix M are written without transposes asṽ 1 ṽ 2 andṽ M or M ṽ respectively. The product of two matrices M 1 , M 2 , and a scalar s a n d a v ectorṽ are simply written as M 1 M 2 and sṽwithout any dot symbol.
Algorithm Model
A ne dependence algorithms can be used to model a large number of computation-intensive applications in image processing, digital signal processing, and other scienti c applications. Such algorithms can be described as nested DO loops as follows.
DO j 1 = l 1 ; u 1 ;j 2 = l 2 ; u 2 ; ;j n = l n ; u n H 1 J ; H 2 J ; . . .
H t J ; END
The column vectorJ = j 1 ; j 2 ; ; j n t is the index vector or index point. , which is the vector di erence of the index vectors of these two iterations. The dependencies in the algorithm can be shown by a dependence graph DG over an n-dimensional n-D domain integer lattice, where nodes are labeled by index vectors corresponding to the operations in the innermost loop body, and arcs correspond to the loop-carried dependencies between two instances of the loop body. Hence, the loop body for scheduling is the set of statements in loop nests enclosing all the branch statements.
Uniform dependence algorithms or uniform recurrence e quations URE form a sub-class of There exist "uniformization" techniques for transforming AREs to UREs. See for example reference 8 . The basic idea is to select a few basic integral vectors which are the uniform dependencies such that all a ne dependencies of the ARE can be expressed as non-negative i n teger linear combinations of the basis vectors. This uniformization also removes the undesirable broadcasts of data in a VLSI processor array.
In this paper, we focus on algorithms that can be modeled as uniform recurrences and a ne recurrences that can be uniformized. Hence, the starting point of our mapping assumes a convex polyhedral domain and a set of constant dependence vectors collected into a matrix called the dependence matrix D. Ci; j; k = Ci; j; k , 1 + Ai; k Bk;j; 1 i; j; k N 3
The index set consists of all the integer points with a cube of side N. Input Ai; k r esp., Bk;j is used in several computations to generate Ci; j; k for all values of j resp., i and is given as Ai; j; k = Ai; 0; k B i; j; k = B0; j ; k 4 Ci; j; k = Ci; j; k , 1 + Ai; j; k Bi;j;k where Ai; 0; k = A i; k and B0; j ; k = B k;j. The a ne dependencies are 0; j ; 0 t and i; 0; 0 t . This example is used as a running example throughout this paper.
Previous Work
There has been a lot of research i n d e v eloping design methods to map uniform dependence algorithms to processor arrays. Most of these methods are based on or derived from the dependency method DM 4 , 5 . In DM, the problem of mapping an algorithm to a processor array i s c haracterized by a linear mapping matrix T = " S , where is the schedule vector and S is the allocation matrix. The design of the array is then equivalent to determining the elements of T. This general representation of a feasible design as a particular mapping matrix allows DM to be applied to uniform as well as non-uniform recurrences. However, in DM, the generality in representation leads to large search spaces for optimal designs, as the problem of nding optimal designs is posed as an integer programming problem 9, 10 . In contrast, the method presented in this paper, the General Parameter Method GPM, is restricted to uniform recurrences, but can be used to generate optimal designs for user-speci ed objectives including non-monotonic and non-linear ones using e cient search techniques of polynomial complexity.
There have been several earlier attempts to map algorithms onto lower dimensional arrays 11, 1 0 , 1 2 . Important steps towards a formal solution were rst made by Lee and Kedem 11 . They presented the concept of data-link collisions two data tokens contending for the same link simultaneously and conditions to avoid them. They also presented a method that analyzes all computations in the domain of the recurrence in order to detect computational con icts two computations scheduled to execute simultaneously in the same processor. To identify feasible designs, they provided necessary and su cient conditions for designs that avoid computational and data-link con icts. However, they did not present a n y systematic procedure for nding optimal designs. Subsequently, Shang and Fortes 9 h a v e developed closed-form conditions for a mapping to be free of computational con icts. These closed-form conditions also eliminate data-link con icts for active data 3 participating in the computations.
In general, in DM, feasible designs are found heuristically by rst specifying a good" allocation matrix S, and then subsequently determining the schedule vector that minimizes the computation time. Note that the number of choices for matrix S could be very large or even in nite, making it di cult or impossible to enumerate over them.
Initial work on parameter-based methods was done by Li and Wah 13 for a restricted set of uniform recurrences. They considered speci cally 3-D and 2-D recurrences and mapped them to 2-D and 1-D processor arrays, respectively. The structure of the recurrence was such that the dependence vectors were unit vectors and the dependency matrix, an identity matrix. This paper generalizes the above initial work int o a p o w erful and e cient array-synthesis technique called the General Parameter Method GPM by making three important and non-trivial extensions.
a We consider the recurrence model as a general n-D recurrence with arbitrary constant dependence vectors instead of a speci c 3-D one. The target processor arrays are also allowed to be of any l o w er dimension m, where 1 m n. W e provide new necessary conditions to guarantee the correctness of systolic processing in mapping high-dimensional recurrences to lower-dimensional processor arrays. These conditions de ne a search space polynomial in complexity with respect to the size of the recurrence to be mapped. In contrast, previous methods for nding optimal designs are based on integer linear programming with a search space of exponential complexity.
b We extend our search method to handle general non-linear objectives that may v ary nonmonotonically with the parameters, and introduce new pruning strategies to prune suboptimal designs in the search space so that optimal designs can be found e ciently. W e show i optimal designs that include load and drain times in the objective which i n troduce non-linearity in the objective function and constraints, and ii optimal designs with constraints on number of allowable processing elements and or completion time. Such designs cannot be found by previous methods.
c We show the equivalence between DM and GPM by providing necessary equations to transform parameters used in DM to those used in GPM, and vice versa. DM can be considered as a mapping problem in the Cartesian coordinate system with unit vectors as basis vectors, whereas GPM can be considered as mapping in a possibly non-orthogonal coordinate system with dependence vectors as basis vectors. The equivalence allows the designers familiar with DM to utilize the e ciency of GPM to nd optimal designs.
The potential simplicity of GPM over DM described in c is explained by observing that in mapping an n-D algorithm to an m-D processor array, the number of variables to be determined in DM is m + 1 n , whereas the number of parameters in GPM is m + 1 g , where g = rankD.
is involved in its chain of computations, and a passive phase, where the token is moving from the input peripheral processor to become active, or is moving to an output peripheral processor after its active phase. Figure 1 Application of GPM to nd optimal designs in DM.
Since g n as D is an n r matrix, the number of variables in GPM is often less than that in DM, and is at worst equal to the number of variables in DM. Hence, there is potential reduction in complexity b y performing the transformation, especially if there are only a few dependence vectors in a high-dimensional space.
Our transformation between GPM and DM extends the work of O'Keefe, Fortes and Wah 14 , who showed the equivalence between DM and GPM for 2-D and 3-D uniform recurrences. Our transformation also allows e cient search strategies developed in GPM to be used to nd optimal designs in DM. Consequently, designers familiar with DM can obtain better or optimal array designs using GPM. Referring to Figure 1 , after de ning the objective possibly non-linear and non-monotonic in terms of the representation chosen i.e., and S, the designer converts the objective in terms of the parameters of GPM using the equivalence given in Eq's 10 and 12 to be discussed in the next section. Once the objective and variables have been converted, GPM is used to generate optimal arrays e ciently. The solutions obtained by GPM are then converted to and S in DM using Eq's 10 and 12 again. This step involves solving two sets of simultaneous equations for and S from the periods and displacements in GPM, and has a worst-case complexity o f O n 3 .
The next three sections describe the parameters used in GPM, the constraints that must be satis ed for correct operation, the speci cation of the objective function, and the search strategy. We assume that processing elements are equally spaced in m dimensions with unit distance between directly connected processing elements, and that bu ers between directly connected processing elements, if any, are assumed to be equally spaced along the link.
General Parameter Method: Parameters
The intuition behind GPM is as follows. It is known that the semantics of processor arrays can be formally described by uniform recurrence equations; i.e., processor arrays are isomorphic" to uniform recurrences. This implies that as long as the computations de ned by the UREs are well-formed, there is a direct mapping from the recurrence to the processor array. In fact, this mapping is equivalent t o a linear transformation of the index set. Hence, for a linear mapping, the time resp., the distance is constant b e t w een execution of any t w o pointsĨ 1 andĨ 2 in the index set separated by a dependence vectord, whereĨ 1 =Ĩ 2 +d. This constant is equal to d resp., Sd independent of index pointsĨ 1 andĨ 2 . F or recurrences with uniform indexing functions i.e., UREs and uniformized AREs, the dependences are constant v ectors and homogeneous i.e., the set of dependence vectors at any point in the index set is the same as any other in the index set. Thus, the computation of the recurrence on the processor array is periodic in time and space along dependence vectors in the index space. This periodicity is succinctly captured and exploited in GPM, which considers the mapping problems in a possibly non-orthogonal coordinate system with dependence vectors as basis vectors. In other words, in GPM, a representation that captures the above periodicity is used, which allows the optimal target array to be found e ciently.
In GPM, the characterization of the behavior, correctness, and performance of a processor array is de ned in terms of a set of scalar and vector parameters. When a uniform recurrence is executed on a processor array, the computations are periodic and equally-spaced in the processor array. GPM captures this periodicity b y a minimal set of parameters de ned as follows.
Parameter 1: Periods. These capture the time between execution of the source and sink index points of a dependence vector. Suppose the time at which an index pointĨ de ned for the uniform recurrence equation is executed is given by function c Ĩ, the period of computation t j along dependence vectord j is de ned as t j = c Ĩ +d j , c Ĩ; j = 1 ; 2 ; ; r : 9 The number of periods de ned is equal to r, the number of dependencies in the algorithm. In terms of DM, period t j is related to, the schedule vector in DM, by the following equation 3 . t j = d j : 10 Parameter 2: Velocity.Ṽ j , v elocity of a datum along dependence vectord j , j = 1 ; 2 ; ; r , is de ned as the directional distance passed during a clock cycle. Since PEs are at unit distance from their neighbors, and bu ers if present must be equally spaced between PEs, the magnitude of the velocity m ust be a rational number of the form x=y, where x and y are integers and x y to prevent broadcasting. This implies that in y clock cycles, a datum propagates through x PEs and y , x bu ers. All tokens of the same variable have the same velocity both in speed and direction which is constant during the execution in the processor array. The total number of velocity parameters is r one for each dependence vector with each v elocity a n m -element v ector, where m is the dimension of the processor array. Hence, velocityṼ j is given by, V j =k j t j ; j = 1 ; 2 ; ; r ; 11 wherek j is the vector distance between the execution locations of the source and sink index points ofd j . In the notation of DM, S, the allocation matrix, is related tok j andd j as follows. 
Geneal Parameter Method: Constraint Equations
In Section 2, a set of r 2 + r parameters have been introduced to de ne a mapping on the target processor array. Assignment o f v alues to the parameters de nes a speci c processor array with a particular number of processors, bu ers, and data-input pattern. It is also easy to see that all processor arrays that solve a given algorithm or uniform recurrence correspond to some assignment of values to the parameters. Hence, choosing di erent v alues for these parameters leads to di erent array con gurations with di erent performance, and the problem of array design has been reduced to that of choosing appropriate parameter values. The choice of values for all r 2 +r parameters are not independent of each other. In this section, constraint equations relating the parameters are given such that the set of values for the parameters is meaningful and de nes a valid processor array. Theorems 1 and 2 provide the fundamental space-time relationship that must be satis ed by the parameters for correct systolic processing. Computational and data-link con icts are avoided by enforcing the condition in Theorem 3.
The following notation is introduced to simplify the presentation of the theorems. LetT =
Constraints for Correct Systolic Processing of URE
The following theorem relates the parameters de ned in GPM in the necessary conditions for correct systolic processing.
Theorem 1. The parameters velocities, spacings, and periods must satisfy the following constraint equations for correct systolic processing of the uniform recurrence e quation:
V i t i =Ṽ j t i +S j;i ; i; j = 1 ; 2 ; ; r: 14 Proof. See Appendix A.1.
These constraints ensure that in computing an index pointĨ at any processor in the array, all the participating data tokens are present at the processor at the same time, moving from their respective processors where they were used earlier. A total of r 2 vector constraints are obtained from Theorem 1.
Constraints for Linearly Dependent Dependence Vectors
Let S = hS i;j i ; i ; j = 1 ; 2 ; ; r ;be an r r matrix" actually, a matrix of vectors of spacings such that the i; j-th element of the matrix isS i;j . Note by de nition thatS i;i = 0. Let S i be the i-th row" of S; i.e., S i = hS i;1Si;2 S i;r i where S i is an mr matrix. SinceS i;j =Ṽ j t j ,Ṽ i t j = k j ,Ṽ i t j from Theorem 1, S i can be written in matrix form as
whereT is a vector composed of periods, and is the outer product or tensor product; i.e., a b =ãb t = a i b j .
The next theorem characterizes the constraints on the periods and displacements if the dependence vectors in the recurrence are not linearly independent.
Let g be the rank of dependency matrix D. Therefore, N, the null space of D, has r , g columns as D has r columns. Let N = ~ 1 2 ~ r , g b e a n r r , g matrix, where~ i ; i Theorem 2, therefore, provides a total of 2r , g constraints: r , g scalar constraints and r , g v ector constraints.
The following corollary shows the constraints on spacings that follow from Theorem 2. In fact, these constraints can be shown to be equivalent to those in Theorem 2. The implication of this corollary is that, of the r spacing parameters for each v ariable, only g ,1 of them are independent, one of them is zero, and the rest can be expressed as linear combinations of the g , 1 independent ones. In this example, there are a total of 27 vector constraints and 2 scalar constraints.
To summarize, a total of r 2 + r vector parameters and r scalar parameters have been de ned whose values have to be determined. Theorems 1 and 2 give a total of r 2 + r,g v ector constraints and r , g scalar constraints. Hence, g of the scalar parameters periods and g of the vector parameters have t o b e c hosen such that the other r , g scalar parameters and r 2 + r , g v ector parameter values can be determined from the chosen scalar and vector parameters. Since the performance of the design can naturally be expressed in terms of the periods and displacements, our strategy is to choose the g periods and g displacements to optimize a given performance criterion.
The remaining r , g periods, r , g displacements, and all of the spacings can be determined from Theorems 1 and 2. All the vector equations are solved in m-D space in order to obtain m-D vector parameters.
Constraints to Govern Valid Space-Time Mappings
The validity of a space-time mapping is governed by the following fundamental necessary and su cient conditions. Having established the parameters and the basic relationship among them in Theorems 1 and 2, we show h o w the fundamental conditions for valid space-time mappings are satis ed in GPM.
By de nition, periods denote the time di erence between the source and sink of dependencies.
Hence, the precedence constraint is satis ed by simply enforcing t i 1; i = 1 ; ; r . In the array model, all tokens of the same variable move with the same velocity. Hence, data-link con icts can exist if and only if two tokens of a variable are input at the same time into the same processor and travel together contending for links. This condition is called a data-input con ict in GPM, as two data tokens are in the same physical location and con ict with each other as they move through the processors together.
It is important to note that in GPM, computational con icts can exist if and only if datainput con icts occur. This can be seen by the following simple argument. If two index points are evaluated in the same processor at the same time, then for each v ariable, at least two distinct tokens exist together in the same processor. Hence, if there is at least one non-stationary variable, then there are data-input con ict for the tokens of that variable. Otherwise, all variables are stationary, and the entire computation is executed in one processor; i.e., there is no processor array. Hence, by enforcing that no data-input con icts exist, both computational and data-link con icts are avoided. Theorem 3 below presents conditions under which data-input con icts can be eliminated. 
Constraints in Preloaded Data
If the velocity o f a v ariable is zero, then the data corresponding to the variable have to be preloaded in the processors before computation begins. This problem involves designing a schedule that can overlap as much as possible the preloading of data with the systolic computations without delaying these computations. A general approach is to decide when a particular stationary datum needs to be used in its rst computation, and to develop a preloading schedule so that the bandwidth constraint of the processor array is satis ed and that the rst computation can begin with the minimum delay. We like to point out a that data do not have to be preloaded in any order governed by a dependence relation as in systolic processing as long as they do not con ict in using the inter-processor links, and the bandwidth of the input ports is not exceeded; b that the optimal preloading schedule may depend on the velocities and data distributions of the moving data; and c that preloading data may result in problem-size-dependent memory in each processor a design alternative often disallowed in systolic arrays. We discuss in Section 5 the e ect of preloading data on computation completion time for the transitive-closure problem.
Design Method 4.1 Formulation of the Search Problem
The design of a feasible processor array is equivalent t o c hoosing an appropriate set of parameters that satisfy the constraints imposed by dependency and application requirements for a speci c uniform recurrence equation and a speci c problem size N. The search for the best" design can be represented by the following optimization problem. The objective function b de ned in Eq. 23 is expressed in terms of attributes such a s T comp , computation time of the algorithm, T load , load time for the initial inputs, T drain , drain time for the nal results, and PE , n umber of processing elements in the design. Note that the completion time of evaluating a recurrence is T c = T comp + T load + T drain 25
All the attributes are then expressed in terms of the parameters de ned in GPM. The rst two constraints in Eq. 24 follow directly from the de nition of the parameters in GPM. Since the target array is systolic, displacement k i should not exceed period t i in order to prevent data broadcasting velocities should not exceed one. In addition, the constraints t i 1, i = 1 ; 2 ; . . . ; r , mean that precedence constraints are satis ed.
The third constraint indicates that the recurrence is evaluated correctly by the processor array, satisfying dependency requirements Theorems 1 and 2 and be free of data-link and computational con icts Theorem 3.
The fourth constraint indicates bounds on T c and PEimposed on the design to be obtained. For instance, the following are two possible formulations of the optimization problem:
Minimize T c for a design with a maximum bound on PE , PE UB ;
Minimize PEfor a design with a maximum bound on T c , T UB c .
Both of these formulations represent trade-o s between T and PE . The optimal design for the formulation given by Eq's 23 and 24 is found by a search algorithm. Since, in general, the objective function is nonlinear, involving functions such as ceiling, oor, and maximum minimum of a set of terms, it is di cult to describe a comprehensive algorithm that covers all possible cases. In the rest of this section, we rst describe a pruning strategy used in our search algorithm, followed by a discussion on searches with objectives that are functions of T c , T comp , T drain , and PE . W e then present the search algorithm and show its application for special cases of optimizing T c and PE . where N is not represented explicitly since it is a constant in the optimization. The decomposition is done in such a w a y that e 4 is a monotonic function of its variables, which m a y be a subset of t 1 ; . . . ; t r ; k 1 ; . . . ; k r . The intuition behind this decomposition is as follows.
Pruning Strategy
If the objective function bt 1 ; . . . ; t r ; k 1 ; . . . ; k r is a monotonic function of its variables, then the optimal value of the parameters can be found by e n umerating combinations of values of variables from their smallest permissible values given by Eq. 24 until a feasible design that satis es Theorems 1, 2 and 3 is found. Since b is monotonic, the rst feasible design obtained is also the optimal design.
The above idea of enumerating values of a monotonic function can be extended to the general case of non-monotonic objective functions. This is done by rst identifying e, a monotonic component of the objective that can be enumerated e ciently. The search proceeds by e n umerating designs so that values of e grow monotonically. T UB comp is re ned continuously as new incumbent designs are found in the search. The search stops when there is no combination of t i , i = 1 ; . . . ; r , that satis es T comp T UB comp . A special case of the optimization is to nd a design with the minimum computation time T comp not including load and drain times. This was done in our earlier work 1, 2 . Here, T UB comp = B inc = T inc comp , and the rst feasible design is the optimal design that minimizes T comp .
Search Procedure
In this section, we present our search procedure for minimizing bPE ;T c = b T comp , T load , T drain , PE Eq. 28, where T comp is a function of t 1 ; . . . ; t r , T load and T drain are functions of t 1 ; . . . ; t r , k 1 ,. . . , k r , and PEis a function of k 1 ,. . . , k r . The procedure has 11 steps.
1. Choose g periods and g displacements to be unconstrained parameters. Without loss of generality, let these periods and displacements be t i andk i , 1 i g , respectively.
2. Initialize T UB comp to be T seq comp , the computation time required to evaluate the recurrence sequentially.
3. Set the values of all the g unconstrained periods t i , i = 1 ; . . . ; g , to be unity. For a design that minimizes PE , the search procedure described above needs to be changed. In this case, e should be de ned as a function of k 1 ; . . . ; k r , and the search should start iterating with the smallest combinations of k 1 ; . . . ; k g .
Applications: Transitive Closure
Path-nding problems belong to an important class of optimization problems. Typical examples include computing the transitive closure and the shortest paths of a graph. 2-D processor arrays for nding transitive closures have been presented before 16, 1 7 . In this section we synthesize a onepass linear processor array for the transitive-closure problem using the Warshall-Floyd path-nding algorithm.
The transitive-closure problem is de ned as follows. Given an N-node directed graph with an N N Boolean adjacency matrix C i; j , 1 i; j N, the transitive closure C + i; j = 1 if there exists a path from node i to node j, where C i; j = 1 if there is an edge from node i to node j or i = j, and C i; j = 0 otherwise. That is, for k; i; j = 1 ; N C i; j = C i; j + C i; k Ck;j
35
The dependence structure of a general dynamic-programming formulation of the transitiveclosure problem is irregular and di cult to map on a regularly connected planar processor array. T o cope with this mapping problem, S.Y. Kung et. al. , h a v e converted the transitive-closure algorithm into an reindexed form and have mapped it to 2-D spiral and orthogonal arrays 16 . Based on their algorithm we obtain the following ve dependence vectors after pipelining the variables. d 1 = 0; 0; 1 t for k; i; j t k; i; j , 1 t ; 2 j N; d 2 = 0; 1; 0 t for k; i; j t k;i,1; j t ;2 iN; d 3 = 1; ,1; ,1 t for k; i; j t k , 1; i + 1 ; j + 1 t ;2 k N; 1i; j N , 1; 36 d 4 = 1; ,1; 0 t for k; i; N t k , 1; i + 1 ; N t ;2 kN; 1iN ,1; d 5 = 1; 0; ,1 t for k;N;j t k ,1; N ; j + 1 t ;2 k N; 1j N,1; whereĨ 1 Ĩ 2 means that the data at pointĨ 2 is used at pointĨ 1 . F or nodes on the boundary of the dependence graph, where i = N resp., j = N, dependenced 4 resp.,d 5 is present instead of dependenced 3 . F or other interior points, only 3 dependenciesd 1 ;d 2 ;d 3 exist.
The key observation is as follows. Matrix C whose transitive closure is to be found is input along dependence vectord 3 . Inputs along other dependence vectorsd 1 ,d 2 ,d 4 ,d 5 are non-existent; i.e., they are never sent i n to the array from the external host. Hence, there are no data-input con icts along these dependence directions. As a result, we need to consider data-input con icts only along directiond 3 . Since dependenciesd 3 ,d 4 andd 5 never co-exist, there are only two spacings for data along directiond 3 , namely,S 3;1 andS 3;2 .
A total of 8 relevant parameters are de ned for the transitive-closure problem: 3 periods t 1 ; t 2 ; t 3 , 3 displacementsk 1 ;k 2 ;k 3 , and 2 spacingsS 3;1 ,S 3;2 . F or a linear processor array, all the parameters are scalars. As derived in Example 4, the periods and velocities along directionsd 4 andd 5 are given as t 4 We illustrate in the rest of this section the following formulations of the optimization of linear processor arrays: i T comp -optimal designs without bound on PE , ii T c -optimal designs without bound on PE , iii PE -optimal designs without bound on T c or T comp , and iv optimal designs with speci c bounds on T comp or PE , and v optimal designs with speci c bounds on T c or PE .
Performance Attributes and Constraints
Before optimal designs can be found, we need to express performance attributes in the objective function in terms of the parameters in GPM. The attributes we are interested are T comp , T load , T drain , PE , and T c , where T c = T load + T comp + T drain . In this section, we show three lemmas that express these performance attributes in terms of the parameters de ned. We also show t w o constraints that re ne the constraints de ned in Theorem 3. Proof. See Appendix A.3. Proof. See Appendix A.4.
Lemma 3 does not cover the case when the input matrix is stationary. As pointed out in Section 3.4, stationary inputs need to be preloaded in the processor array before computation begins. Since there is only one input matrix C, w e assume that preloading take s a l o w er-bound time computed as the oor of the number of elements to be preloaded divided by the maximum number of input ports. A similar assumption is made when the nal stationary results need to be drained. Even with this optimistic assumption, we did not nd any design with stationary inputs outputs that out-perform designs with moving inputs. Although this observation is not true in general, we like to point out that a schedule to preload data in the processor array m a y not be governed by the data dependence relations, and that a general preloading schedule may depend on speci c design parameters such as values of the GPM parameters and architecture constraints such as bandwidth and memory.
For linear-array synthesis, since the spacings are scalars, let s 3;1 be S 3;1 and s 3;2 be S 3;2 . I n addition, the condition for data-input con ict Theorem 3 can be re ned as follows. Proof. See Appendix A.5. x , 1 N According to Theorem 4, data-input con icts are present, and the solution is not feasible. Table 1 shows the optimal linear-array designs found by the search procedure of GPM see Section 4.3 in which the objective is to minimize T comp computation time, not including load and drain times without bounds on PE . In nding these designs, t 3 is incremented before t 1 or t 2 in
Time-Optimal and Processor-Optimal Linear-Array Designs
Step 10 of the search procedure. This is done as it increases T comp by the smallest amount. Among all the designs that have the minimum T comp , w e found designs that require the minimum PE , followed by nding designs that require the minimum T load and T drain . W e list T load , T comp , T drain , PE s needed, and the CPU time used by the search procedure running on a Sun Sparcstation 10 30. We also list the equivalent v alues of schedule vector and allocation matrix S of DM by solving Eq's 10 and 12.
In a similar way, w e nd designs that optimize T c completion time, including load and drain times without bounds on PE . See Table 2 . Note that these designs have less total completion time and more PE s than the corresponding designs in Table 1 . For instance, for N = 300, the Our results in Tables 1 and 2 demonstrate that GPM, based on the equivalence between GPM and DM as shown in Eq's 10 and 12, can serve a s a p o w erful tool to nd optimal designs in DM.
It is important to point out that the objective used whether to minimize T comp or T c depends on the application. If the linear processor array is used to evaluate the transitive closure of one matrix, then minimizing T c will be important. On the other hand, if the processor array is used for pipelined evaluation of transitive closures of multiple matrices, then minimizing T comp may b e important.
If the objective is to minimize PEin the linear processor array, then Theorem 5 characterizes the PE -optimal design.
Theorem 5. The combinations of parameters t 1 ; t 2 ; t 3 = 1 ; 1 ; N,1 and k 1 ;k 2 ;k 3 = 0; 1; 1 or 1; 0; 1 result in linear processor arrays with a primary objective of minimizing the number of PEs, and a secondary objective of minimizing the computation time.
Proof. See Appendix A.6. Table 3 shows the PE -optimal designs obtained by GPM as well as those obtained by Lee and Kedem LK 10 and Shang and Fortes SF 9 . In this table, we show the load and drain times, computation times, and PE s for designs derived by these three methods., S, and the corresponding parameters in GPM are summarized as follows. N + 1 , 1 , 1 t 0, 0, ,1 t 1, 1, N , 1 ,1 , 0, 1 Table 3 shows that both the SF and GPM designs require the minimum number of PEs. The SF designs, however, were developed based on di erent assumptions. According to Lemma 1 and the table above, the SF designs have a computation time T comp = N , 1 N + 2 + 1 .
This computation time is lower than that of the GPM designs characterized by Theorem 5. This di erence is attributed to the fact that Shang and Fortes assumed that contention must only be avoided after the rst use of a variable and before its last use or generation. This is a valid assumption for systems with fast I O or where each PE has its own I O, or in cases where inputs are preloaded and outputs need not be drained or are post-drained. In GPM, we consider both contentions in computations as well as in data links. Excluding designs that have computational and data-link con icts results in designs that require slightly longer load, drain, and computation times.
To illustrate the point a b o v e, we compute using Eq. 37 the spacings used in the SF design 9 : s 3;1 = ,N , 1=N , 2 and s 3;2 = ,1=N , 2. These values of spacings result in data-input con icts between tokens C 1;j and C N;j,1 , j = 2 ; 3 ; :::; N, of input matrix C Theorem 4.
The space-time diagrams of two linear processor arrays, one optimizing T comp and the other optimizing T c , for N = 3 are shown in Figures 2 and 3 , respectively.
The design in Figure 2 optimizes T comp and has parameters: t 1 ; t 2 ; t 3 = 1 ; 1 ; 2 and k 1 ;k 2 ;k 3 = 0 ; 1 ; , 1. This design minimizes both T comp and PE , and therefore, minimizes any objective of the form PE x T y comp for x; y 1. The space-time diagram shows the execution 
Index (2,3,2) executes with inputs C13, C12, C23 Figure 2 Linear processor array for nding the transitive closure of a 3 3 matrix using parameters t 1 ; t 2 ; t 3 = 1 ; 1 ; 2 and k 1 ;k 2 ;k 3 = 0 ; 1 ; , 1. The array is optimal for minimum T comp , minimum PE , and minimum PE x T y comp , x; y 1. The PE used is the same as in Lee and Kedem's design.
3 C22 From the de nition of the periods, the time di erence between the execution of these two index points is t 1 +t 2 +t 3 =1+1+2=4 .Similarly, the displacement b e t w een the PEs executing the two index points is given byk 1 +k 2 +k 3 = 0 + 1 + , 1 = 0, Hence, in gure 2, they are executed by the same processor PE 1 at times 1 and 5, respectively. In a similar fashion, the entire space-time diagram can be derived mechanically from a knowledge of the periods and displacements.
The design in Figure 3 has parameters t 1 ; t 2 ; t 3 = 1 ; 2 ; 1 and k 1 ;k 2 ;k 3 = 0 ; , 1 ; 1.
It uses less load and drain times 3 units each, but its computation time T comp is higher than that in Figure 2 . It minimizes both T c and PE , and therefore, minimizes any objective of the form PE x T y c for x; y 1. Note that the load and drain times are not shown in these disgrams.
Further, for correct execution of the Floyd-Warshall algorithm, control signals are needed to govern the index-dependent assignments performed by the PEs in the array. These assignments are given in Tables I and II in the reference 11 .
Processor-Time Trade-o s
Comparing the results in Tables 2 and 3 , we found, for instance, that for a problem of size of 200, the T c -optimal design is 13.35 times faster than the PE -optimal design in terms of completion time, and uses 17.9 times more PEs than the PE -optimal design. The T c -optimal design for N = 200 requires 8958 time units and 3583 PEs, whereas the PE -optimal design requires 119602 time units and 200 PEs. A designer might b e u n willing to settle for either the large numb e r o f P E s required in the minimum-time design or the long completion time of the minimum-processor design.
In realistic design situations, there may be bounds on the number of processors or the completion time or both. Hence, a possible objective could be to have as few processors as possible, so long as the time is within a preset upper limit, T ub c or T ub comp , or to minimize T c or T comp with PE less than a given upper bound PE ub .
In the following discussion, let T min comp and PE max be, respectively, the completion time and PEof the minimum-T comp design. Designs with PE PE max would not be useful as their completion times have to be at least T min comp . On the other hand, let T max comp and PE min be, respectively, the computation time and PEof the minimum-processor design from as the number of PEs cannot be reduced below PE min . In this case, we are interested to nd designs with completion time greater than T min comp and PEless than PE max . Figure 4 shows how PEv aries with T comp for 3 di erent problem sizes: N = 100, 200, and 300. The y-axis PEis normalized by PE max , and the x-axis T comp is scaled by T max comp . This lets us compare the di erent problem sizes uniformly on the same scale. The stepped curves are obtained by bounding T comp and nding the PE -optimal designs for speci c recurrence sizes. There curves are stepped because there are only a small and nite number of processor-array con gurations that can satisfy the given time constraints. If the goal is to nd the PE -optimal designs, then we will have a small number of array con gurations; for each con guration, we will select the one with the minimum computation time.
Given the bound T ub comp resp., PE ub the designer can use Figure 4 to nd the minimum PE r esp., T comp required, and decide possibly from a cost perspective if it is acceptable. Again, the designer can exploit the initial steep decline in the plots to choose an alternative design that trades performance for cost. For instance, the minimum PEfor N = 200 drops by 43 for only a 19 increase in computation time.
If both T comp and PEare bounded from above, then the design with the minimum PE for a given time bound is determined using Figure 4 . First, a horizontal line is drawn across the graph for the desired bound on PE . The intersection between this line and the stepped curve represents the minimum T comp needed for any feasible design. If this minimum T comp is less than the desired T comp , then a feasible design can be obtained by the procedure discussed in Section 4.3. This now represents the best design under both time and processor constraints. Another observation from Figure 4 is that the plots for larger N decrease more rapidly than those for smaller N. Hence, for larger N, there is a substantial reduction in PE r esp., T comp for a relatively small increase of the computation time resp., PE from the optimum. Hence, for large N, there are more attractive alternatives than the time-or PE -optimal designs. Figure 5 shows a similar plot as in Figure 4 except that we depict the di erence between tradeo s obtained on T c and PEv ersus trade-o s obtained on T comp and PE . T w o sets of curves are shown, one for designs that minimize T comp , and the other for designs that minimize T c , for N equal to 100 and 200, respectively. The y-axis of these curves is normalized with respect to PEwhen T c is minimum since these designs require more PEs and less T c , and the x-axis is normalized with respect to T c when T comp = T max comp . These graphs show the di erence between designs obtained by di erent objectives. Given a bound T ub c , w e can see that the number of processors obtained by minimizing T c is less than or equal to the number of processors obtained by minimizing T comp .
Final Remarks
Algorithm-speci c parallel processing with processor arrays can be systematically accomplished with the help of the general parameter-based approach GPM discussed in this paper. The techniques discussed in this paper are ideally suited to loop nests described as uniform recurrences or as a ne recurrences that can be uniformized.
In GPM, the behavior of the target array is captured by a set of parameters, and the design problem is formulated as an optimization problem with an objective and a set of constraints speci ed in terms of the parameters. We show that the parameters in GPM can be expressed in terms of the processor-allocation matrix S and the time schedule vector in dependency-based methods DMs, thereby allowing GPM to be used in DMs to nd optimal designs. We present an e cient search procedure for nding T c -optimal or T comp -optimal resp., PE -optimal designs for speci ed bounds on PE r esp., T c or T comp , as well as optimal designs with general objective functions. The distinct features of GPM are in its ability to systematically search for optimal designs with speci c design requirements on T c or T comp and PE , and in its ability to include constraints on data-link and computational con icts in the optimization procedure.
In conclusion, Table 4 summarizes the unique features of GPM and DM. Non-uniform in the general case by specifying a general processor allocation matrix; processor arrays derived may h a v e i n the general case arbitrary speed direction changes for data tokens and have aperiodic computations.
Uniform controls throughout the processor array, resulting in constant v elocities and periodic computations.
Design objective and constraints
Computation-time optimal designs or processor-optimal designs with linear objective function and linear constraints.
General non-linear objective function and constraints with certain monotonicity properties on the objective function; new constraints have been developed that avoid data-link con icts. Search methods for nding processor array designs Choose heuristically processor-allocation matrix, and nd schedule vector satisfying processor-allocation constraints; methods for nding designs are based on linear integer programming or intelligent searches.
Search method is systematic enumeration and pruning on a search space polynomial in complexity with respect to problem size.
Designs obtained
Designs found are optimal in computation time with respect to a given choice of processor-allocation matrix; possible allocation matrices chosen are those that minimize the number of processing elements.
Trade-o s between number of processors and computation time, or between number of processors and completion time including load and drain times for a speci c problem instance can be obtained.
Summary
The two methods are equivalent representations for synthesizing uniform recurrences. The formulation of the design optimization problem and the search techniques developed are equally applicable in both representations. BD is needed to prove the theorem. The key observation is that token Z 2 Ĩ , pd 2 refers to the same element o f v ariable 2 for all p. This is true because variable 2 is pipelined alongd 2 in the index space and propagates through the array b e t w een the execution of indices di ering byd 2 . Hence, irrespective of the value of p,BD=S 2;1 . Again, by v ector composition, the theorem is proved. The time elapsed between the execution of the index point at PE P and the corresponding index point a t P E Q m ust be the same along paths 1 and 2. Therefore, Similarly, b y considering the displacement b e t w een P and Q along paths 1 and 2, we get K~ i = 0 . The distance from C 1;1 to PE C is equal to the number of elements between C 1;1 and PE C before any input element is sent i n to the array. Since the data-input pattern is dictated bỹ S 3;1 andS 3;2 , distance l 2 from C 1;1 to PE C again depends on the relative signs ofS 3;1 andS 3;2 with respect tok 3 . I f S 3 ; 1 andS 3;2 are in the same direction ask 3 , then C 1;1 is the rst element of the input, and l 2 = 1 . Similarly, i f S 3 ; 1 andS 3;2 are in the opposite direction tok 3 , then , and the time to get to PE C is equal to l 2 . Hence, T load is given by Eq. 40.
By symmetry, w e can verify easily that T drain , the time to drain the outputs from the array, is equal to T load .
A. Therefore, the minimumcomputation time for the minimum-processor designs occur for Cases 2 a n d 3 a b o v e.
