In this papel; we present some novel algorithms for scheduling hierarchical signaljow graphs in the domain of high-level synthesis. There are several key contributions of this papel: First, we develop a novel extension of the forcedirected scheduling problem which naturally handles loops and conditionals by coming up with a scheme of scheduling hierarchical signal flow graphs. Second, we develop three new parallel algorithms for the scheduling problem. Third, our parallel algorithms are portable across a wide range of parallel plagorms. We report results on a set of high-level synthesis benchmarks on 8-processor SGI Challenge and a network of 4 SUN SPARCstationS work stations. Finally, while some parallel algorithms f o r VLSI CAD reported by earlier researchers have reported a loss of qualities of results, our parallel algorifhms produce exactly the sume results as the sequential algorithms on which they are based.
thesis.
Scheduling is a major step in high level synthesis. It assigns each node in the graph to a specific time step. Numerous algorithms have been proposed for scheduling [3] , [5] , [6] , [7] .Force directed scheduling [3] , introduced by Paulin and Knight tries to minimize the number of resources for a given latency by trying to smooth out the resource requirement for different time steps within a given latency.
The work described in this paper is part of an ongoing project called ProperCAD [ 1 I] which is aimed at developing an integrated suite of parallel applications for VLSI CAD that run on a variety of parallel platforms. In the past, various tools have been developed to solve the problems at lower levels in the VLSI CAD hierarchy, which includes placement [12] , logic synthesis [14] , test generation [16] , fault simulation [ 131 and behavioral simulation [ 151.h this paper, we develop a novel extension of the force-directed scheduling problem which naturally handles loops and conditionals by coming up with a scheme of scheduling hierarchical signal flow graphs. We assume that the overall latency of the signal flow graph is given, and attempt to minimize the resource requirement. We also describe parallel algorithms which give the same quality of results as the sequential algorithm.
In a related work, some researchers [ 101 have proposed a distributed version of force directed list scheduling in which each processor executes the FDLS algorithm on a separate module set in contrast to our approach. 
a s k force directe
The basic force directed scheduling algorithm is an iterative algorithm which schedules one operation each itera-tion. The operation to be scheduled is selected based on a quantity termed f o r c e defined for each step. It is a measure of concurrency at that step. The force directed scheduling tends to balance the concurrency at each step without iengthening the execution time. First, it calculates the time frame of each operation, namely the time interval from the earliest start time to the latest start time for that operation. It is done by calculating the as soon aspossible(A5'AP) and as late as possible(ALAP) schedule of the graph. At each step, the algorithm determines the force of each node to each step in its time frame. It then selects the node with the least force and schedules it to the corresponding step. The force consists of two components, se1 f and predecessor-successor@s)
forces.
The concurrency of operations is captured by a distribution graph (DG) for each operation type. For each operation type k, the DG in step i is given by
the sum being taken over all operations of type k. The self force associated with the assignment of an operation with time frame from t to b to time step j (t 5 j 5 b) is given by
is the change in operation probability of node i as a result of the assignment. The predecessor force for the assignment of a node to a step j is the sum of the forces of the predecessors of the node arising out of change in time frames and the resulting change in concurrency. For a particular predecessor, it is quantified as :
where the interval from t to b is the old time frame and that from nt to nb is the new time frame. The successor force is defined analogously.
Scheduling of conditionals and loops in the basic force directed scheduling scheme
The alternates of a conditional are mutually exclusive. Thus, for a step in which mutually exclusive operations intersect, the probability of only the operation with the highest probability is added to the corresponding DG. When a loop is part of the behavioral description, the user have to specify a constraint on the loop iteration time or, alternately a constraint of the number of structural units available.
Further details of the basic force directed scheduling can be found in 
Parallel force directed scheduling on nonhierarchical graphs
In this section, we will look at the force directed scheduling problem of simple acyclic graphs. We describe algorithms for graphs with conditionals and loops in sections 4 and 5. For parallel implementation of the basic force directed scheduling, we have taken two problem partitioning approaches, node-based and step-based.
Node based problem partitioning
The nodes are partitioned among the processors such that each processor gets approximately equal amount of work. The details of the cost function and the partitioning strategy appears in [2] . We can define a local D G for each processor which is the DG calculated using only the nodes owned by that processor.The global D G is obtained by combining the localDGs of all the processors.
At each step, each processor calculates the forces of each owned node to each feasible time step, and sends the information to a master processor. The master determines the node with the least force and schedules it to the step with the least force. It then broadcasts that information to all the processors. Each processor updates the time frames of other nodes, and recalculates the localDG. The processors then perform a global reduction operation by which the localDGs are combined together to get the global D G . Figure 1 shows an example showing a non-hiearchical graph and a sample node-based decomposition among four processors. We report experimental results on this algorithm in Section 6.
Y -

22
\ i s e d decomposition for a aph for 4 processors
Step based problem partitioning
In this approach, the lime steps are pmitioned among the processors rather than the nodes such that each processor gets approximately equal amount of work. Let S be the set of nodes whose timeframes intersect the set T of processor p.It calculates the forces of each node in S to each time step in T it intersects. The force calculation and scheduling proceeds as in the case of node-based partitioning except that this approach does not have to perform a global reduction of DGs since it calculates the overall DGs of the steps it owns. 
ierarchical force directed scheduling
Previous approaches for force directed scheduling work for hierarchical signal flow graphs assume that the user has to specify the latency of each loop. We extend the basic force directed scheduling scheme to naturally handle hierarchical signal flow graphs, so that the latencies of loops are selected automatically to minimize the total resource requirement.
A hierarchical graph is a signal flow graph in which each node can be a block, which is a set of nodes enclosed by a source and a sink, an atomic operation (eg an addition or multiplication), a loop which encloses a block, or a conditional structure which encloses one or more alternates each of which are blocks. Figure 3 shows an example hierarchical signal flow graph.
Figure 3. A hierarchical signal flow graph ~StributiQn graph of a hierarchical entity
For each hierarchical entity h, we can associate a corresponding subDG, given by the tuple subDGi =+ P, lb, ub, k + for each operation type k where lb and ub are the lower and upper limits for the earliest and latest start times respectively for all nodes enclosed by h, P is a vector such that P ( i ) , l b 5 i 5 ub is the sum of the probabilities over all nodes n E h that n will be executed during time step i, and k is the type of the operation (eg adder, multiplier). The subDG of a hierarchical entity depends on the type of the entity. The calculation of subDGs for the various types of hierarchical DGs are explained below. In the following, subDG:(i) means the same as P ( i ) , the ith element of the member P of the tuple subDGk.
The subDG for an atomic node a of type k is given For a loop structure, we determine the maximum and minimum latencies possible for the loop body, recursively calculate the subDG of the loop body for each start time of the loop source and for each corresponding latency and compute their weighted sum, the weights being the probability that the loop body is assigned that latency and the loop source has that particular start time. We assume that all the possible start times of the source and all latencies for that particular start time are equally probable.
For a conditional structure, we recursively calculate the subDGs of each alternate for the conditional. The overall subDG is the DG formed by taking the stepwise maximum of the above subDGs at each step. Let the conditional structure have m alternates.
The calculation of subDGs can be sped up considerably based on the observation that subDG depends only on the relative positions of the source and the sink, not the absolute values of their earliest or latest start times. Thus during the calculation of subDG for a hierarchical entity, the subDGs of its components can be calculated once for each latency and reused for different values of start times.
The scheduling step
The hierarchical force directed algorithm proceeds level by level, scheduling the nodes at each higher level before proceeding to the next lower level. Consider a node n. Let Succ(n, k ) be the set of successors of n of type k and let Prec(n, k ) be the set of predecessors of n of type k which are in the same level as n in the signal flow graph. Define SuccDG(n, k , s) to be the aggregate DG of all nodes of type k in Succ(n, k ) when n is scheduled to step c. Similarly, PredDG(n, k , s) is the aggregate DG of all nodes in Pred(n, k ) when n is scheduled to step c. The self force of a control node is defined to be zero. The self force for a non-control node is the same as that in the case of the basic force directed scheduling algorithm.
The predecessor force of n for time step c is given by is the look-ahead factor. The successor force is defined analogously, with SuccDG(n, k , c) used instead of Pred-DG(n, k , c) in the above formula.The total force is given by the sum of self, predecessor and successor forces. The total force of all nodes to all feasible time steps are determined and the node with the least force is scheduled. Figure 4 shows the syntax tree of the example hierarchical signal flow graph of Figure 3 , and the levels of various nodes.
Parallel hierarchical force directed scheduling
In this section, we describe a parallel algorithm for the hierarchical scheduling approach explained in the previous section. The processors are assigned to the hierarchical nodes based on a cost function (specified in the next section) based on the number of nodes enclosed by the node. Call the set of processors assigned to a node n the process group of n. For nodes enclosed by a hierarchical node n , processors in the process group of n are assigned based on the cost function. For the atomic nodes, the processors in the process group of the immediately enclosing hierarchical node are assigned cyclically. For the graph in Figure 4 , the process group of node 9 is {O,l}.
The processors in the process group of each hierarchical node collectively calculates the subDGs of that node. It is done in a bottom up manner, since calculation of subDG for a hierarchical entity requires the subDGs of the nodes it enclose. The processors calculating the subDGs of nodes at a particular level communicates them with the processors in the process group of the hierarchical node enclosing those nodes by a group-level gather operation. Once this process completes with the calculation of subDGs of the root node, all processors have, in their local memories all the subDGs needed to calculate forces for the nodes assigned to them.
Since the subDGs of all nodes at alevel are known, the calculation of forces at any given level closely resembles that in the case of the basic force directed scheduling algorithm.
The parallel algorithm has three major steps, the node partitioning step, the pre-processing step and the force calculation and scheduling step.
Node partitioning step
In the node partitioning step, the nodes are distributed among the processors. Associated with each hierarchical entity is a height, which denotes the depth of the syntax tree corresponding to the nodes in that entity. Consider for example, the nodes in level 1. It contains hierarchical entities having possibly different heights. Consider the set of nodes with a given height (say h). The processors are partitioned across all the nodes in that set so that the number of processors assigned to node i is proportional to a cost funtion which is an approximation of the work involved for that node, given by
if i is a hierarchical node and C ( i ) = ALAP(i) -ASAP(i) + 1 otherwise. where N ( i ) is the set of nodes enclosed by i.
.2 Pse-processing step
In the pre-processing step, the subDGs for each hierarchical entity for all feasible latencies for that entity that arise during the force calculation and scheduling step are calculated and stored.
5,3 Force Calc lation and Scheduling
After the pre-processing step, the processors calculate the forces of the nodes in the current level to each feasible step and sends the best node to the master processor. The master selects the node with the least force and schedules it to the corresponding step. It then broadcasts the information to all processors. Due to this scheduling, the pre-calculated subDG of the hierarchical nodes enclosing the scheduled node (and all hierarchical nodes in the path from the node to the root node in the hierarchy) are no longer valid. Thus, they are recalculated. The recalculation can be performed efficiently, as explained in [2] .
An example illustrating the algorithm appears in [2] .
The parallel hierarchical force directed scheduling algorithm was implemented using the portable Message Passing Interface(MP1) [4] , which has been ported on a variety of parallel machines. We have tested the algorithm on a set of high level synthesis benchmarks.The resulting schedules were optimal for most cases and very close to optimal for others. The qualities of the results (in terms of the hardware resources used) for each latency case are identical in the sequential and parallel executions. The run times for some benchmarks for different latencies are as shown below. The results are shown for an 8-processor SGI Power Challenge shared memory multiprocessorand a network of SUN Sparc station 5 work-stations. The results for Intel Paragon multicomputer appears in [Z] . Tables 1 and 2 shows the runtimes and speedups for nodebased approach for non-hierarchical graphs and Tables 3 and  4 show the runtimes and speedups for step-based approach for non-hierarchical graphs. Tables 5 and 6 show the runtimes and speedups for hierarchical graphs.
We observe that when the latency is close to the minimum, the time frames of the nodes are small, in which case the total work involved in calculating the forces is relatively small. This results in low overall running time and poor speedup in those cases.
In this paper, we presented some novel algorithms for scheduling hierarchical signal flow graphs in the domain of high-level synthesis. In some related work, we are developing parallel algorithms for behavioral simulation [ 151.
In the future, we will pursue parallel algorithms for other tasks in high-level synthesis, such as allocation, and in hardwarelsoftware codesign and cosimulation. We will integrate all these tools with the parallel tools developed for solving the CAD problems at the lower level such as placement, routing, logic synthesis, and testing. All these tools will run in a portable manner in a large variety of parallel and distributed platforms. esults for node-based approach for non-hierarchical graphs on a network of work stations Table 2 Table 3 . Results for step-based approach for non-hierarchical graphs on an SGI Power Challenge multiprocessor Table 4 . Results for step-based approach for non-hierarchical graphs on a network of work stations 
