ABSTRACT
1, INTRODUCTION
2. HIERARCHICAL DG REPRESENTATION Array processors are well suited to efficiently implement a major class of signal processing algorithms due to their para.llelism and regular data flow [KUN88] . A widely used approach for mapping algorithms to array processors is the Dependence Graph ( D G ) methodology. In this methodology, first an algorithm is developed in Single h ssignment Code ( S A C ) where each variable is only allowed t.o have a single value. Then t,he algorithm is represented in a graphical form by a DG [KUN88] . The nodes of the DG are then mapped into an array processor. In literature, several techniques and software packages have been reported for the automation of the mapping (see e.g.
[QUISI], [MOLW] , [RAO88] , [hNN88] , and [JAYSla] ). Except for [JAYSla] , only mapping of regular DG's has been fully automated. The DG's for large and complex problems are not, regular in general and are very difficult to make regular by adding dummy operations.
In an earlier paper [MOE92] , based on work done in [JAYSlb] , we presented a n Integer Linear Programming ( U P ) formulation for mapping (semi-)regular D G s to array processors. In this paper, we use a branchand-bound technique to oht,ain the set of optimal solu-'The best way to manage the complexity of large systems is to adopt a hierarchically structured design. In literature, a hierarchical design environment has been treated in [KUN84] , [ANN€%] , [THI88] and [JAYSla] . Here we adopt the hierarchical form of the S A C proposed in [JAYSla] . This form is referred to as Structured
Single Asszgnment Code ( S AC). The graphical representation of the S 2AC description is called Structured Dependence Graph ( S D G ) . The canonical forms of the
S ' A C and SDG are used for the construction of the DG with local-dependence edges in a minimum dimension Euclidean space to keep projection simple.
An index point in a DG can, in general, contain a set of variables whose computations are dependent on variables from neighboring as well as same index point (multi-variable DG's). Single-variable 
ALTERNATIVE NODE SCHEDULES
For mapping regular iterative algorithms the systolic schedule is used, represented by the schedule vector s'. A systolic schedule implies that there is at least one delay on each edge of the resulting array processor. In semi-regular arrays however, the best schedule is not necessarily lying along a linear path. Therefore a more efficient approach has to be derived.
We define a DG as a directed graph G = {V, E} where V is the set of nodes and E is the set of directed edges. The set of nodes V = I U N U 0 contains input nodes I , output nodes 0, and intermediate nodes M . The feasibility of a schedule is determined by the partial ordering and process assignment scheme. A node should have valid data on all its input edges before it can be scheduled. Given the earliest schedule of all nodes nJ E I , the earliest schedule time of any node nJ E N U 0 can be found. The latest schedule time of nodes nJ E 0 is also known since the system must meet a set of deadlines which imposes that the output must be available before a specific time. A semi-regular DG contains a set of connected sub-DG's. These sub-DG's are regular. Keeping uniform delay distribution in the sub-DG's simplifies the design, Let
Ri be the set of all edges along a linear path and Ej be the set containing all edges on a selected number of linear paths belonging to several Ri with parallel edges. We partition the set of edges in the D G into a number of sets Ei, such that ViVj+Ei n E j = 4 and &Ei = E. A set {< Ei, dz >} specifies for each Ei a delay dz (i.e. all the edges in E' have the same delay di). Once the schedule time for an edge along a linear path in Ei is chosen, the delay dz is fixed for all edges in Ei. This is done for all sets E'.
Further, for mapping from a M-dimensional DG to a I<-dimensional array processor, any node can be connected to a maximum of 3K -1 nodes scheduled in the same time slot. We now define the set of constraints for scheduling as follows. We choose the edge detection problem as an example. The DG of the Edge Detector is three dimensional. Figure  1 shows one part of algorithm. The other part is identical.
Readers are referred to [JAYgIa] for the derivation. Due to the huge proccssing power requirement parallel processing is needed. The black nodes on the far right add the result of both parts. White nodes are convolution functions and dark grey nodes are row to column translation functions. It is clear that the DG for this problem is inhomogeneous. We now apply the A N S algorithm for a mask width of 4 and an image width of 4,s and 16. The schedule range for all nodes is 2. A bounding constraint is added for each edge direction in all subgraphs. The behavior of the three different image sizes is compared in the plot of Figure 2 . Tn all three cases, the final set of possible schedules contains five optimal solutions for all different DG sizes.
ALTER.NATIVE NODE PROJECTION
We construct, a n nlgorithrn to find all valid linear (nonlinear) projections. Linear mapping involves projection along a straight line whereas nonlinear mapping means that multiple nodes not necessarily along a straight line map to the same PE. Definition 
Projection Constraints: T h e enum e r a t i o n tree f o r projection s o l u t i o n s must be p r u n e d f o r each

node nz zf a n y of t h e f o l l o w i n g conditions hold:
A
n o d e ni c a n o n l y be projected o n t o a p o s i t i o n p ' j t h a t lies within t h e polyrec A'.
Two n o d e s m i t h t h e s a m e schedule c a n n o t be projected t o t h e s a m e PE. T h e n u m b e r of PE's should n o t exceed s o m e u p p e r bound. M a x i m u m n u m b e r of c o m m u n i c a t i o n l i n k s 3M -1 m u s t be preserved f o r each PE.
Up to now, factors such as complexity of the resulting PE and non-uniform distribution of 1/0 nodes on the boundary has not been taken into account. The ultimate performance goal of an array processor system is a computation rate that balances the available 1/0 bandwidth with the host. In order to achieve this we have to guarantee that the 1/0 nodes are uniformly distributed and match t>he interface to the outside world. An additional set of constraints are therefore needed.
Definition 4.3 Additional projection constraints: e I n p u t / O u t p u t n o d e s should r e m a i n o n t h e boundary. e P r e v e n t t h e m a p p i n g of n o d e s with different f u n ct i o n a l i t y o n t o t h e s a m e PE(optiona1). e R e m o v e equivalent a n d s i m i l a r solutions.
The A N P algorithm finds the set of all possible mappings { G P i } under the constraints in Definitions 4.2 and 4.3. It maps a M-dimensional DG to a I<'-dimensional array processor and finds the set of all possible linear and non-linear projections. No solution is found if a node violates the set of constraints for all intermediate solutions.
Linear projection has been thoroughly studied in literature. Yet in certain circumstances a non-linear mapping may offer some unique flexibility and advantage. To extract the optimal 1inea.r and non-linear solutions { G P ' } in terms of array processor characteristics and given constraints we need to define an extra set of constraints which we call bounding rules. Let us define a bounding rule for 
., a~) E Z,$: F o r set Rz o n t h e lanear p a t h joznzng t w o vertzces of t h e DGpolytope and lyang o n t h e boundary, all edges an R' h a v e t o follow t h e s a m e rule of projectzon 2.e. edge dzrectzons a f t e r projectton are identical t o each other.
Definition 4.4 guarantees that 1/0 nodes are mapped uniformly. This can be generalized to include the set E'.
The enhanced A N P algorithm uses set E2 to add bounding constraint,s. An example of the projection set {GP'} for a 3x3 matrix-vect,or multiplication using Definition 4.4 is given in Figure 3 . This set contains linear as well as semi-linear mapping. Whether the matrix is a 3 x 3 or n x R DG, the set { G P i } contains 5 alternative solutions which are equivalent for all sizes of the DG. In case an
?n x n DG where m # n, only solutions (a), (b) and ( e ) are possible. This means that given a regular array and using Definition 4.4, the set { G P i } is dependent on the topology of the DG but independent of the size of the DG. This is a very interesting result.
The above discussion assumes that the DG boundaries lie on hyperplanes orthogonal to each other. This is not always the case e.g sorting problem. We therefore define a general bounding constraint for regular DG's. 
Definition 4.5 General bounding rule for a regular M-dimensional DG: F o r s e t E2 of t h e linear p a t h joaning t w o vertices of t h e DG-polytope a n d lying o n t h e boundary of t h e DG, all edges in Ei h a v e t o f o l l o w the s a m e projectzon rule i.e. edge directions a f t e r m a p p i n g are identical t o each other. If t h e p a t h consists of floatzng nodes, create a set Ei of all edges, parallel t o each o t h e r
. C O M P L E X I T Y ISSUES A N D S C A L I N G
The time bound for both algorithms is limited by V -1 stages where V is the set of nodes. The average computations per stage are proportional to the set of edges to a node. It is apparent that all computations along the hyperplane orthogonal to the flow of data have no mutual dependency. Therefore they can be executed simultaneously. In general there is always a certain degree of dependency which dictates the sequence of the computation. The choice of the order in which nodes are to be placed in each step has an influence on the computation time but has no effect on the end result. Take the image detector example with image length of 3 and mask width of 3 and find the set {GPi} for the constraints as given in Figure 4 The local maximum of the peaks increase as the computation gradually proceeds because within the search path, the internal nodes have a higher degree of freedom than boundary nodes. For a small size DG this is not a problem but as the size of the DG increases this grows exponentially. This may cause the algorithm to run out of memory before reaching a solution. There are two ways to solve this problem. One is to add additional internal constraints concentric to the boundary constraint. This will reduce the internal peaks and speed up the calculation yet guarantee that the set of optimal solutions {GPi} is the same. Since the result is invariant to the size according to Theorem 4.1, another way is to solve for small size arrays and then scale up the result to the required size. The complexity of the algorithm for scaling is O(W+E), where 4C.14.5 W is the number of nodes in DG and E is the number of edges in RDG.
MAPPING T O A FIXED SIZE ARRAY
A major area of researc,h for systematic design methods is dedicated t o the general problem of mapping classes of algorithms onto regular array processors with limited number of processing elements, communication link or nieniory size. Systematic design of processor arrays with a given dimension and given number of PE's is called pnrtztioning. Existing approaches to the partitioning problem, however do only partially treat the problems like mapping from a M t o I< dimensional space directly, where M > K . Another point is that the approaches are bound to special structures. A unified approach t o the solution of the partitioning problem to realize all known partitioning schemes [TE193] and to linear and nonlinear mapping is not available. The algorithm mentioned in this paper can be used t o map arrays with limited resources. An upper limit on the number of PE's can be used or a boundary representation (b-reps) is defined.
. CONCLUSIONS
A systematic approach is presented for mapping algorithms into array processors. This approach uses the branch-and-bound technique to find the set of all optimal solutions. T h e power of this approach lies i n the ability t o generate the set of possible mapping alternatives using mixed linear and non-linear mapping. I t has also been shown that the resulting set is limited and independent of the problem size. This is especially interesting for modeling large and complex problems. Further, mapping from M-dimensional space to I<-dimensional space, where M > I<, is done in one step.
For mapping to fixed size arrays, it has been shown that different partitioning techniques, can be modeled in the algorithms using regularazed Boolean set operatzons for the design of 2 and 3-dimensional array processors.
