Parallel Algorithms for Force Directed Scheduling of Flattened and Hierarchical Signal Flow Graphs by Prabhakaran, Pradeep & Banerjee, Prithviraj
March 1996 UILU-ENG-96-2209
CRHC-96-05
Center for Reliable and High Performance Computing
Parallel Algorithms for Force 
Directed Scheduling of Flattened 
and Hierarchical Signal Flow Graphs
Pradeep Prabhakaran and Pritiiviraj Banerjee
Coordinated Science Laboratory 
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
UNCLASSIFIED
SECURITY CLASSIFICATION OF THIS PAGE
REPORT DOCUMENTATION PAGE
la. REPORT SECURITY CLASSIFICATION
Unclassified
1b. RESTRICTIVE MARKINGS 
None
2a. SECURITY CLASSIFICATION AUTHORITY
2b. DECLASSIFICATION /DOWNGRADING SCHEDULE
3. DISTRIBUTION/AVAILABILITY OF REPORT 
Approved for public release; 
distribution unlimited
4. PERFORMING ORGANIZATION REPORT NUMBER(S) 5. MONITORING ORGANIZATION REPORT NUMBER(S)
UILU-ENG-96-2209
CRHC-96-05
6a. NAME OF PERFORMING ORGANIZATION 
Coordinated Science Lab 
University of Illinois
6b. OFFICE SYMBOL 
(If applicable)
N/A
7a. NAME OF MONITORING ORGANIZATION
Semiconductor Research Corporation
6c ADDRESS (G'ty, State, and ZIP Code)
1101 W. Springfield Avenue 
Urbana, IL 61801
7b. ADDRESS (City, State, and ZIP Code)
Research Triangle Park, NC 27709
8a. NAME OF FUNDING/SPONSORING 
ORGANIZATION
7a
8b. OFFICE SYMBOL 
(If applicable)
9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER
8c ADDRESS (City, State, and ZIP Code) 10. SOURCE OF FUNDING NUMBERS
PRO G RAM PROJECT TASK W O R K  UNIT
7b
ELEMENT NO. NO. NO. ACCESSION NO.
11. TITLE (Include Security Classification)
Parallel Algorithms for Force Directed Scheduling of Flattened and Hierarchical Signal Flow
12. PERSONAL AUTHOR(S)
Pradeep Prabhakaran and Prithviraj Banerjee
13a. TYPE OF REPORT
Technical
13b. TIME COVERED 
FROM TO
14. DATE OF REPORT (Year, Month, Day)March 1996 |15. PAGE COUNT29
16. SUPPLEMENTARY NOTATION
17. COSATI CODES 18. SUBJECT TERMS (Continue on reverse if necessary and identify by block number)
FIELD GROUP SUB-GROUP high-level synthesis, force-directed scheduling,
hierarchical graphs, parallel algorithms, multiprocessors,
networks of workstations
19. ABSTRACT (Continue on reverse if necessary and identify by block number)
In this paper, we present some novel algorithms for scheduling hierarchical signal flow graphs in the domain of 
high-level synthesis. With complex chips that need to be designed in the future, it is expected that the runtimes of these 
scheduling algorithms will be quite large. There are several key contributions of this paper. First, we develop a novel 
extension of the force-directed scheduling problem which naturally handles loops and conditionals by coming up with 
a scheme of scheduling hierarchical signal flow graphs. Second, we develop three new parallel algorithms for the 
scheduling problem. Third, our parallel algorithms are portable across a wide range of parallel platforms. We report 
results on a set of high-level synthesis benchmarks on 8-processor SGI Challenge, a 16 processor Intel Paragon, and a 
network of 4 SUN SPARCstation5 work stations. Finally, while some parallel algorithms for VLSI CAD reported by 
earlier researchers have reported a loss of qualities of results, our parallel algorithms produce exactly the same results 
as the sequential algorithms on which they are based.
20. DISTRIBUTION/AVAILABILITY OF ABSTRACT
SUNCLASSIFIEDAJNUMITED □  SAME AS RPT. □  DTIC USERS
21. ABSTRACT SECURITY CLASSIFICATION 
Unclassified
22a. NAME OF RESPONSIBLE INDIVIDUAL 22b. TELEPHONE (Include Area Code) 22c. OFFICE SYMBOL
DD FORM 1473,84 MAR 83 APR edition may be used until exhausted. 
All other editions are obsolete.
SECURITY CLASSIFICATION OF THIS PAGE
UNCLASSIFIED
Parallel Algorithms for Force Directed Scheduling of Flattened and
Hierarchical Signal Flow Graphs
Pradeep Prabhakaran Prithviraj Banerjee
Center for Reliable and High-Performance Computing 
Coordinated Science Laboratory 
University of Illinois 
1308 West Main St.




In this paper, we present some novel algorithms for scheduling hierarchical signal flow graphs in the domain of 
high-level synthesis. With complex chips that need to be designed in the future, it is expected that the runtimes of these 
scheduling algorithms will be quite large. There are several key contributions of this paper. First, we develop a novel 
extension of the force-directed scheduling problem which naturally handles loops and conditionals by coming up with 
a scheme of scheduling hierarchical signal flow graphs. Second, we develop three new parallel algorithms for the 
scheduling problem. Third, our parallel algorithms are portable across a wide range of parallel platforms. We report 
results on a set of high-level synthesis benchmarks on 8-processor SGI Challenge, a 16 processor Intel Paragon, and a 
network of 4 SUN SPARCstation5 work stations. Finally, while some parallel algorithms for VLSI CAD reported by 
earlier researchers have reported a loss of qualities of results, our parallel algorithms produce exactly the same results 
as the sequential algorithms on which they are based.
KEYWORDS: High-level synthesis, force-directed scheduling, hierarchical graphs, parallel algorithms, multipro­
cessors, networks of workstations.
This research was supported in part by the National Science Foundation under grant M1P-9320854, the Semiconductor Research Corporation under 
grant SRC 95-D P-109, and the Advanced Research Projects Agency under contract DAA-H04-94-G-0273 administered by the Army Research Office. 
We would also like to thank Intel Corporation for the donation of an Intel Paragon to the University of Illinois.
1 Introduction
With the rapid improvement in VLSI technology, circuit design is becoming extremely complex and is placing increas­
ing demands on CAD tools. Parallel processing is fast becoming an attractive solution to reduce the inordinate amount 
of time spent in VLSI circuit design [1], Many workstation vendors have announced products based on multiproces­
sors. Furthermore, most CAD environements have large networks of workstations that can be used as parallel machines. 
It is clear that many CAD sofware vendors will soon be announcing products that will exploit the parallelism in the 
hardware. This has been recognized by several researchers in VLSI CAD as is evident in the recent literature for cell 
placement [16], [19], [20], circuit extraction [18], test generation [21], [22] and logic synthesis [24], [30], [31].
In this paper, we present some novel algorithms for scheduling hierarchical signal flow graphs in the domain of 
high-level synthesis. High level synthesis translates behavioral description of a circuit, specified as a signal flow graph 
to register transfer level description. With the increasing complexity of systems being designed, automated high level 
synthesis is becoming increasingly important.
Scheduling is a major step in high level synthesis. It assigns each node in the graph to a specific time step. Numer­
ous algorithms have been proposed for scheduling [2], [4]r [5], [7], [8]. An excellent overview of existing scheduling 
algorithms appears in [6]. An integer linear programming approach to scheduling is described in [7] and [8]. Inte­
ger linear programming is suitable for small systems, but may result in very large run times for larger problems. List 
scheduling [4], [5] is a fast algorithm which tries to find a minimal schedule, given the resource constraints. Force di­
rected scheduling [2], introduced by Paulin and Knight tries to minimize the number of resources for a given latency by 
trying to smooth out the resource requirement for different time steps within a given latency. Force directed scheduling 
have been found to be extremely effective and has gained interest since its introduction among other research groups 
[9], [10], [11], [12], [13]. Force directed list scheduling (FDLS) [2],which is a variation of force directed scheduling 
in which the user specifies the resources available and the signal flow graph is scheduled with minimum latency given 
the resource constraints.
Scheduling algorithms for data flow graphs corresponding to small designs containing about 50-100 nodes can take 
several minutes to execute on conventional workstations. With complex chips that need to be designed in the future 
containing thousands of nodes, it is expected that the runtimes of these algorithms will reach hours. Designers of high
1
level synthesis tools wish to explore a very large design space of solutions with various architectural alternatives (e.g. 
number of adders, multipliers, registers, buses, memory ports,etc.) for various values of expected latencies. The run­
times for exploring all the different choices can easily take several hours. This task is very interactive in nature, and 
a designer typically would want to have the solutions in less than a minute,and not hours. In addition, researchers are 
proposing more complex cost functions to produce designs with higher speeds, lower power and increased testability 
during the scheduling process. Hence, it is important to investigate efficient parallel algorithms for the problem which 
can reduce the runtimes from hours to minutes. The work described in this paper is part of an ongoing project called 
ProperCAD [15] which is aimed at developing an integrated suite of parallel applications for VLSI CAD that run on a 
variety of parallel platforms. In the past, various tools have been developed to solve the problems at lower levels in the 
VLSI CAD hierarchy, namely, placement [16], routing [17], circuit extraction [18], design rule checking [26], logic 
synthesis [27], [27], test generation [25], [29], fault simulation [28], and behavioral simulation [23]. In this paper, 
we address the problem of scheduling in high-level synthesis.
There are several key contributions of this paper. First, we develop a novel extension of the force-directed schedul­
ing problem which naturally handles loops and conditionals by coming up with a scheme of scheduling hierarchical 
signal flow graphs. Second, we develop three novel parallel algorithms for the scheduling problem, one based on node- 
based partitioning, another based on time-step partitioning, and a final one based on hierarchical partitioning. Third, 
our parallel algorithms are portable across a wide range of parallel platforms which includes shared memory multi­
processors distributed memory message-passing multicomputers and networks of workstations. We report results on 
an 8-processor SGI Challenge shared memory multiprocessor, a 16 processor Intel Paragon distributed memory mul­
ticomputer, and a network of 4 SUN SPARCstation5 workstations. Fourth, we report actual implementations of these 
algorithms on real parallel machines, and report results of qualities and runtimes for various benchmark circuits on a 
variety of parallel machines. We show in the paper that we have obtained speedups of about 8 to 10 on 16 processors of 
an Intel Paragon. Finally, while some parallel algorithms for VLSI CAD reported by earlier researchers have reported 
on loss of qualities of results, our parallel algorithms produce exactly the same results as the sequential algorithms on 
which they are based.
In some related work, some researchers [14] have proposed a parallel system for distributed high level synthesis
2
which uses coarse-grained parallelism to explore and evaluate many alternative VLSI designs efficiently. They propose 
a distributed version of force directed list scheduling in which each processor executes the FDLS algorithm on a separate 
module set. Force directed scheduling algorithm for minimizing resource requirement for a given latency assumes that 
the user specifies the latency for each loop body. In comparison to that work, we propose an extension of force directed 
scheduling algorithm which naturally handles loops and conditionals, and which automatically selects the latencies of 
the loop bodies. We assume that the overall latency of the signal flow graph is given, and attempt to minimize the 
resource requirement.
In Section 2 the basic force directed scheduling algorithm is summarized. In Section 3 we propose two approaches 
to implementing parallel force directed scheduling algorithm for non-hierarchical graphs. In Section 4 we propose a 
sequential force directed scheduling algorithm for hierarchical graphs. A parallel implementation of the above appears 
in Section 5. In Section 6 we give some experimental results, and conclude in Section 7.
2 Basic force directed scheduling
The basic force directed scheduling algorithm is an iterative algorithm which schedules one operation each iteration. 
The operation to be scheduled is selected based on a quantity termed force  defined for each step. It is a measure of 
concurrency at that step. The force directed scheduling tends to balance the concurrency at each step without lengthen­
ing the execution time. First, it calculates the time frame of each operation, namely the time interval from the earliest 
start time to the latest start time for that operation. It is done by calculating the as soon as possible(ASAP) and as late 
as possible(ALAP) schedule of the graph. At each step, the algorithm determines the force of each node to each step 
in its time frame. It then selects the node with the least force and schedules it to the corresponding step. The force con­
sists of two components, s e lf  and predecessorsuccessor{ps) forces. Self force of a node n to a step s is a measure 
of the increase in concurrency due to scheduling that node to that step. Predecessor and successor forces are measures 
of increase in concurrency of the predecessors and successors, respectively as a result of scheduling n to s.
The concurrency of operations can be captured by a distribution graph (DG ) for each operation type. For each
3
operation type k, the DG  in step i is given by
DG(i) = Prob(op, i )
k
the sum being taken over all operations of type fc.The mechanical analog for DG is the spring constant.
The self force associated with the assignment of an operation with time frame from t to b to time step j  (t < j  < b) 
is given by
b
s e l f  - fo r  ce(j) =  £ P G ( 0  * *«)]
i—t
where x{i) is the change in operation probability (the mechanical analog being the displacement of spring), given by
x{j) = (h -  1 ) /h
and
x(i) =  —1/h
where h = (b — t + 1).
The predecessor force for the assignment of a node to a step j  is the sum of the forces of the predecessors of the 
node arising out of change in time frames and the resulting change in concurrency. For a particular predecessor, it is 
quantified as:
nb b
Force(nt,nb ) =  £  [DG(i)/(nb - n t +  1)] -  £ [O G (« ) /(6  -  t  + 1)] (1)
i—nt i=t
where the interval from t to b is the old time frame and that from nt to nb is the new time frame. The successor force 
is defined analogously.
4
The basic force directed scheduling algorithm is summarized below:
F o r c e - D i r e c t e d ()
1 repeat
2 Evaluate tim e fra m es  :
3 2.1.F ind  A S A P  schedule
4 2.2.F ind  A L A P  schedule
5 Update distribution graphs
6 calculate se lf, predecessor and successor forces,
1 Add them  to get the total force fo r  each feasible control step
8 Schedule operation w ith loruest fo rce ;
9 Set its  tim e fra m e  equal to the selected step.
10 until all operations scheduled
2.1 Scheduling of conditionals and loops in the basic force directed scheduling scheme
The alternates of a conditional are mutually exclusive. Thus, for a step in which mutually exclusive operations inter­
sect, the probability of only the operation with the highest probability is added to the corresponding DG. The force 
calculation and scheduling proceeds exactly the same way as described earlier for an acyclic graph without conditional 
structures.
When a loop is part of the behavioral description, the user have to specify a constraint on the loop iteration time or, 
alternately a constraint of the number of structural units available. For multiple embedded loops, the operations of the 
innermost loop are scheduled first, relative to the local timing constraint. After that, the entire loop is taken as a single 
operation with execution time equal to the loop’s local time constraint.
Further details of the basic force directed scheduling can be found in [2].
5
3 Parallel force directed scheduling on non-hierarchical graphs
In this section, we will look at the force directed scheduling problem of simple acyclic graphs, corresponding to straight 
line code sequences in a behavioral description. We describe algorithms for graphs with conditionals and loops in sec­
tions 4 and 5. For parallel implementation of the basic force directed scheduling, we have taken two problem partition­
ing approaches, node-based and step-based. In the node based method, we partition the nodes among the processors, 
and in step-based method, we partition the time steps among the processors.
3.1 Node based problem partitioning
The nodes are partitioned among the processors.The partitioning is done by first calculating the total available work, 
which is approximated by
n u m n o d e s
W  =  {A LA P (i) -  A S A P {i) +  1)
i= 0
and work per processor, given by W /num procs. Each processor selects a set of nodes S  such that J2ses(A L A P (s)  — 
A S A P (s))  is approximately equal to the work per processor.
We can define a local JDG for each processor which is the DG  calculated using only the nodes owned by that 
processor.The global JDG is obtained by combining the local JDGs of all the processors.
At each step, each node calculates the forces of each owned node to each feasible time step, and sends the infor­
mation to a m aster  processor. The m aster  determines the node with the least force and schedules it to the step with 
the least force. It then broadcasts that information to all the processors. Each processor, upon receiving this informa­
tion, updates the time frames of other nodes, and recalculates the local-DG. The processors then perform a global 
reduction operation by which the locaLDGs are combined together to get the global-DG. Since the time frames of 
the nodes could change as a result of the schedule operation, the load function is re-computed, and nodes re-assigned 
to processors, to minimize load imbalance.
The node based parallel force directed scheduling algorithm is as follows: 
P a r a l l e l - F o r c e - D i r e c t e d - N o d e ()
1 repeat
2 Compute the work per processor
6
3 D eterm ine the list o f  nodes on current processor
4 Evaluate tim e fra m es  :
5 4.1.F ind  A S A P  schedule
6 4.2.F ind  A L A P  schedule
7 Compute local distribution graph fo r  the assigned nodes
8 P er fo rm  global reduction to get the overall distribution graph
9 Calculate s e l f , predecessor and successor forces fo r  the assigned nodes,
10 Add them  to get the total force fo r  each feasible control step
11 Send the operation w ith loiuest force to the m aster
12 Receive the global best node and step ;
13 Set its tim e fra m e  equal to the selected step.
























x <- a + b; 
y <- c * d; 
tl <- x + c; 
t2 <- x * y; 
t3 <- y + d; 
t4 < -tl * t2; 
t5 < -12 + t3; 
o 1 <- a + t4; 
t6 <- b * t5; 





Figure 1: Node based'decomposition for a non-hierarchical graph for 4 processors
Figure 1 shows an example showing an non-hiearchical graph and a sample node-based decomposition among four 
processors. We report experimental results on this algorithm in Section 6.
3.2 Step based problem partitioning
In this approach, the time steps are partitioned among the processors rather than the nodes. The partitioning is done by 
first calculating the total available work, which, as in the node based scheme, is approximated by
numnodes
W =  (A L A P (i) -  A S A P {i) +  1)
i=0
and work per processor, given by W /num procs. Each processor selects a set of steps T  such that ^~2s(A L A P (s) — 
A S A P (s)  +  1), where the time frame of s intersects T, is approximately equal to the work per processor. Let S  be 
the set of nodes whose timeframes intersect the set T  of processor p.It calculates the forces of each node in S  to each 
time step in T  it intersects. It then sends the information to a m aster processor which selects the least force node.
8
The master then schedules that node to the step with the minimum force. It then broadcasts the information. All other
processors, on receiving the information, updates the time frames of other nodes, and recalculates the DGs of the steps 
in T. In contrast to the node-based approach, the step based approach does not have to perform a global reduction 
of DGs since it calculats the overall DGs of the steps it owns. As in the case of node-based approach, the steps‘are 
reassigned to processors after each iteration so that they get approximately equal amount of work.
The step based parallel force directed scheduling algorithm is as follows:
Pa r a l l e l - F o r c e - D i r e c t e d - S t e p ()
1 repeat
2 Compute work — per — processor
3 D eterm ine the list o f  steps fo r  the current processor
4 E valuate tim e fra m es  :
5 4.1.F ind  A S A P  schedule
6 4.2.F ind  A L A P  schedule
7 Compute distribution graph fo r  the assigned steps
8 Calculate s e l f , predecessor and successor forces fo r  the
9 nodes which intersect the assigned steps,
10 Add them  to get the total force fo r  each feasible control step
11 Send the operation w ith loivest force to the m aster
12 Receive the global best node and step ; set its  tim e fra m e
13 equal to the selected step.
14 until all operations scheduled
9
x <- a + h: 
y < -c  * d; 
tl <- x + c; 
t2 <- x * y;
13 <- y + d;
14 <- il * |2;
15 < - 12 + l.V,
(il <- a + t4;
Ifi <- b * t5; 
i)2 < - 14 * 16;
inputs: a,h,c,d
outputs : o l,o 2
Figure 2: Step-based decomposition for a non-hierarchical graph for 4 processors
Figure 2 shows an example showing an non-hiearchical graph and a sample step-based decomposition among four 
processors. We show experimental results of this parallel algorithm in Section 6.
4 Hierarchical force directed scheduling
Previous approaches for force directed scheduling work for hierarchical signal flow graphs assume that the user has to 
specify the latency of each loop. We extend the basic force directed scheduling scheme to naturally handle hierarchical 
signal flow graphs, so that the latencies of loops are selected automatically to minimize the total resource requirement.
Definition 1 A hierarchical graph is a signal flow graph in which each node can be
• a block, which is a set o f nodes enclosed by a source and a sink
• atomic operation (eg an addition or multiplication) or
• a loop which encloses a block
• a conditional structure which encloses one or more alternates each o f which are blocks.
il (x > y) then 
{
lor i := I lo ti do 
a |i | := e + d; 
h |i | :* a ♦ h: 
end lor:
l'or i := I lo n do 
x |i | := a |i | + le; 





il := t  *d ;
01 := il + a;
02 := Il + h;
)
Figure 3: A hierarchical signal flow graph
Figure 3 shows an example hierarchical signal flow graph. Node 0 is the source and node 4 is the sink. Node 1 
is a conditional. Nodes 5 and 7 are the source nodes of the alternates for the conditional. Node 9 and 11 are source 
nodes for loops. Nodes 9 and 10 are the source and the sink, respectively of the loop body of the loop whose source 
is 5. The nodes of a hierarchical graph can be broadly classified as control nodes and non-control nodes. Control 
nodes include source and sink  nodes which enclose hierarchical entities, loopsource  and loopsink  which enclose a 
loop, and i f  source  and i f  s i n k  which enclose a conditional. Control nodes do not require resource allocation. The 
loopsource  node contains information about the loop such as the number of iterations.The non — control nodes are the 
regular nodes (eg adder) which require resource allocation. We assign a level number to each node. The level number 
of a node corresponds to the depth of nesting of that node in the signal flow graph. The nodes enclosed by the outermost 
source — sink  pair are assigned a level number 0. The next inner level nodes (those enclosed by hierarchical nodes 





O  Conimi no
4.1 Distribution graph of a hierarchical entity
For each hierarchical entity h, we can associate a corresponding sub-DG, given by the tuple sub-DG£ =-< P, lb, ub, k y  
for each operation type k where lb and ub are the lower and upper limits for the earliest and latest start times respec­
tively for all nodes enclosed by h, P  is a vector such that P (i), lb < i < ub is the sum of the probabilities over all 
nodes n G h that n will be executed during time step i, and k is the type of the operation (eg adder, multiplier). The 
sub-DG  of a hierarchical entity depends on the type of the entity. The calculation of sub-DGs for the various types of 
hierarchical DGs are explained below. In the following, sub-DGkn{i) means the same as P (i), the i th element of the 
member P  of the tuple sub-DG^.
4.1.1 Calculation of sub-DG  for an atomic node
The sub-DG  for an atomic node a of type k is given by sub-DGkn A D G a,lbaiUba, k y ,  where lbn — A SA P (a), 
ubn =  A LA P (a) +  lk and
ADGnii) (A L A P (a) -  A S A P  (a) +  1)
where lk is the latency of operation of type k, and
a ( i , j )=  1 if j < i < j  + l
0 otherwise.
4.1.2 Calculation of sub-DG  for a block
For a straight line block of code B, we calculate the sub-DG  by recursively calculating the sub-DGs of each component 
node and adding them up for each step. Let -< P j jb j7ubjt k y  be the sub-DG  for node j ,  where j  G B. The overall 
sub-DG  of the block for each operation type k is given by sub.D GkB =-< B D G B J b B ,ubB ,k  y  where
lbB = min Ibjj £ B
ubB = ma xubj j € B
12
and
B D G b (ì ) =  Y , P j ( ')
jeB
4.1.3 Calculation of sub-DG  for a loop
For a loop structure, we determine the maximum and minimum latencies of the loop body, recursively calculate the
weighted sum, the weights being the probability that the loop body is assigned that latency and the loop source has that 
particular start time. We assume that all the possible start times of the source and all latencies for that particular start 
time are equally probable.
Let L be a loop structure which executes for n iterations.Let esrc and lsrc be the earliest and the latest start time of 
the source node of L and esnk and lsnk be the earliest and latest start times of the sink of L  respectively. For a particular 
start time s, s < (lsn k —n *  (esnk -  esrc)) for the loop source, the maximum latency of the loop body is given by
Let -< B D G lB ,lbB, ubB, k >- be the sub-DG  for the body of the loop for a given latency l and for operation type 
k. Then the sub-DG  for L, denoted by sub.DGs£l for start time s for the loop source and latency l and is given by
where L D G s£l {i) = B D G B (i mod l) , lba£l = s and ub*£l = s + n x  l. The overall sub-DG  for L, is given by the 
tuple sub-DGl =-< LD G L,lbL,ubL,k  >-, where lbL = esrc, =  lsnk and
sub-DG  of the loop body for each start time of the loop source and for each corresponding latency and compute their
m ax — {Gnk s)/fl
and the minimum latency by
I m in  —  {&snk ^ s r c ) / { B  1)
sub-DGs£l =-< LD G s£l ,lbs£l ,ubs£ l , k y
13
4.1.4 Calculation of sub-DG  for conditional structure
For a conditional structure C, we recursively calculate the sub-DGs of each alternate for the conditional. The overall 
sub-DG  is the DG  formed by taking the stepwise maximum of the above sub-DGs at each step. Let the conditional 
structure have m  alternates. Let -< P j,lb j,iib j,k  y  be the sub-DG  for the j th alternate. Then, the overall DG  of the 
conditional structure is given by sub-DGc =-< C D G c ,lb c ,u b c ,k  y  where
Ibc = min lb;
j=i
and
for each operation type k.
ubc — mâxubj 
j = i
C D G c(i) = max Pj(i)
4.2 The scheduling step
The hierarchical force directed algorithm proceeds level by level, scheduling the nodes at each higher level before 
proceeding to the next lower level. Consider a node n. Let Succ(n , k) be the set of successors of n of type k and 
let Prec(n, k ) be the set of predecessors of n  of type k which are in the same level as n  in the signal flow graph. 
Define Succ-DG(n, k , s ) to be the aggregate DG  of all nodes of type k in Succ(n, k ) when n  is scheduled to step 
c. Similarly, P red-D G (n, k , s) is the aggregate DG  of all nodes in Pred(n, k) when n is scheduled to step c. Let 
-< sub-DGj, lbj,ubj, kj y  be the sub-DG  of node j .  Then, Succ-DG(n, k, c) is given by
Succ-D G (n ,k ,c ) =-< S D G n,lbsn,ubsn,k  y
where lb*n = m inie5ucc(n)A:) lbt , ubsn = max t€Succ(n,k) ubt , and SD G n{j) =  T,teSucc(n,k) sub-DGt (j).
sub-DG t(j)  is assumed to be zero if j  > ubt or j  < lbt . 
Similarly,
P red-D G (n,k,c) =< P D G n,lbn;Ubpn,k n y
14
where lbsn = m int^ Pred{n,k) , ubsn =  max t€Pred(n,k) ubt , and P D G n {j) =  Y.tePred{n,k) sub.D G t {j).
The self force of a control node is defined to be zero. The self force for a non.control node is the same as that in 
the case of the basic force directed scheduling algorithm.
The predecessor force of n for time step c is given by
numtypes ub
E  B ' t  i\ + DG[k][i]/3) xDG[k]\i] (2)
A ~ 1  i=lb
and successor force by
numtypes ub
E  ElC-Pti] +  £>G[fc][t]/3) x £?G[*][i] (3)
A: — 1 i=lb
where the lb,ub and P  in the first and second formula equals the corresponding lb,ub and P  fields of P red.D G (n, k, c) 
and Succ.D G (n , k , c) respectively. The term DG[k][i\/3 is the look-ahead factor. The total force is given by the sum 
of self, predecessor and successor forces. The total force of all nodes to all feasible time steps are determined and the 
node with the least force is scheduled.
4.3 Speeding up the DG calculation by reuse of sub-DGs
For hierarchical entities, it can be observed that the subJDG depends only on the relative positions of the source and 
the sink, not the absolute values of their earliest or latest start times, as long as the scheduled status of the interior nodes 
doesnt change. Thus, the sub.DGs are invariant under translation. Since the calculation of subJDG of a hierarchical 
entity usually involves calculation of the sub-DGs of its components repeatedly with different values for the earliest 
start time of the source or sink, but with the same latency, the subJDG for those components can be calculated once, and 
then reused with suitable translation. This results in an order of magnitude improvement in run time. In practice, we 
maintain a table which stores the sub-DGs of hierarchical entities for each latency encountered so far. Each table entry 
has a valid  bit which indicates whether the corresponding subJDG is correct. Initially, all the entries in the table are 
invalid. When the algorithm needs the sub-DG  for a node for a given latency, it first checks the table. If it encounters 
a valid entry, it can directly use the stored result. If the table entry is invalid, it calculates the subJDG, stores it in the 
table and sets the valid bit. As long as none of the nodes with in the hierarchical node is scheduled, the table entry 
remains valid, and can be reused. After a schedule step, all entries in the table corresponding to the hierarchical entities
15
enclosing the nodes scheduled are invalidated. 
The overall sequential algorithm is:
F o r c e - D i r e c t e d - H i e r a r c h y ()
1 for level <— 0 to number o f  levels
2 do repeat
3 Evaluate tim e fra m es  :
4 3.1.F ind  A S A P  schedule
5 3.2.F ind  A L A P  schedule
6 Calculate se lf, predecessor and successor forces fo r  the nodes
I  Add them  to get the total force fo r  each feasible control step
8 D eterm ine the best node and step ; set its tim e fra m e  equal
9 to the selected step.
10 Invalidate the stored sub_DGs on the path fro m  the scheduled node to the root
II Update the overall DG
12 until all operations scheduled
16
Level
Figure 4: Syntax tree of example hierarchical graph with processor assignments 
Figure 4 shows the syntax tree of example hierarchical signal How graph of Figure 3, and the levels of various nodes.
5 Parallel hierarchical force directed scheduling
In this section, we describe a parallel algorithm for the hierarchical scheduling approach explained in the previous sec­
tion. As in the sequential case, the scheduling is done one level at a time starting from level 0, scheduling all nodes at 
a particular level before proceeding to the next. The processors are assigned to the hierarchical nodes based on a cost 
function (specified in the next section) based on the number of nodes enclosed by the node. Call the set of processors 
assigned to a node n the process group of n. For nodes enclosed by a hierarchical node n, processors in the process 
group of n are assigned based on the cost function. For nodes at the lowest level(the atomic nodes), the processors in 
the process group of the immediately enclosing hierarchical node are assigned cyclically. For the graph in Figure 4, the 
process group of node 9 is {0,1}. As in the case of the sequential algorithm, we store subJDGs in a table and reuse 
them as long as they are valid. However, since the sub-DGs calculated by a particular processor are stored in its local 
memory, it is not immediately accessible to other processors.
The processors in the process group of each hierarchical node collectively calculates the sub-DGs of that node. 
It is done in a bottom up manner, since calculation of sub-DG  for a hierarchical entity requires the subJDGs of the
17
nodes it enclose. The processors calculating the sub-DGs of nodes at a particular level communicates them with the 
processors in the process group of the hierarchical node enclosing those nodes by a group-level gather operation. Once 
this process completes with the calculation of sub-DGs of the root node, all processors have, in their local memories 
all the sub-DGs needed to calculate forces for the nodes assigned to them. This is due to the fact that scheduling is 
done in a top down manner. At a particular level l, each processor needs only the sub-DGs of the nodes enclosed by 
the hierarchical node immediately enclosing it (which is at level l -  1). Since all nodes at level / -  1 has been already 
scheduled, changes in time frames of the children of a particular node h in level l -  1 does not affect the time-frames 
of the children of another node t i  at level l -  1, and hence does not have to be included in the list of predecessors or 
successors used in calculating the ps force (Equations 2, 3 ) for a child of h. Thus, a processor calculating the force 
for a child of h needs only the sub-DGs of other children of h, which have been calculated and distributed among the 
processors in the process group of h.
Consider for example node 15 in Figure 3. When its force is calculated, node 11 has already been scheduled, since it 
is at a lower level. Thus, scheduling node 15 does not affect the time-frames of nodes 13 or 14, and hence those nodes 
does not contribute to the predecessor force of 15. Only the sub-DGs of node 16 is required which was calculated 
earlier by processors 2 and 3. Only those processors assigned to a particular hierarchical node performs the calculation 
of forces and sub-DGs for the nodes enclosed by the hierarchical node. One advantage of this static division of work 
is that we can exploit locality in reusing the sub-DGs stored in each processor’s local memory.
For each hierarchical entity, there is possibly more than one assigned processor. One of them is assigned ownership 
for that node. The role of the owner of a node is to update the sub-DG  of the node during scheduling when one of 
the nodes enclosed by the node is scheduled. Since the sub-DGs of all nodes at a level are known, the calculation of 
forces at any given level closely resembles that in the case of the basic force directed scheduling algorithm. The parallel 
algorithm has three major steps:
• Node partitioning step
• The pre-processing step, and
• The force calculation and scheduling step
18
5.1 Node partitioning step
In the node partitioning step, the nodes are distributed among the processors. Associated with each hierarchical entity 
is a height, which denotes the depth of the syntax tree corresponding to the nodes in that entity. Consider for example, 
the nodes in level 0. It contains hierarchical entities having possibly different heights. Consider the set of nodes with 
a given height (say h). The processors are partitioned across all the nodes in that set so that the number of processors 
assigned to node i is proportional to a cost funtion which is an approximation of the work involved for that node, given 
by:
C{i) = A L A P (i)  -  A S A P (i)  +  1 + 5 1  C O')
j£N(i)
if i is a hierarchical node and
C{i) = A L A P (i) -  A S A P (i) +  1
otherwise, where N (i)  is the set of nodes enclosed by i. A particular processor among the processors assigned to each 
hierarchical node is assigned ownership for that node. In level 1, the processors assigned to the parent hierarchical 
entity are distributed among the nodes contained in that entity and so on until we reach the leaves.
5.2 Pre-processing step
In the pre-processing step, the sub-DGs for each hierarchical entity for all feasible latencies for that entity that arise 
during the force calculation and scheduling step are calculated and stored. This is achieved by setting the latency of the 
immediately enclosing hierarchical entity to its maximum value within the limits of its A S A P  and A L A P  schedule, 
(since that results in the maximum possible flexibility for the nodes within). The subJDGs are calculated level by level, 
starting at the deepest. As the computation proceeds from one level to the next, the subJDGs of each node in the current 
level for the feasible latencies calculated in the current step are distributed to all processors in the process group of the 
node immediately enclosing the node in the hierarchy, by a gather operation. Once the pre-processing step is complete, 
all the processors have the subJDGs required for calculating the forces.
19
5.3 Force Calculation and Scheduling
After the pre-processing step, the processors calculate the forces of the nodes to each feasible step and sends the best 
node to a master processor. The master selects the node with the least force and schedules it to the corresponding 
step. It then broadcasts the information to all processors. Due to this scheduling, the pre-calculated sub-DG  of the 
hierarchical nodes enclosing the scheduled node (and all hierarchical nodes in the path from the node to the root node 
in the hierarchy) are no longer valid. Thus, they are recalculated. The sub-DG  of a node is recalculated by the processor 
owning the node and sent to the processor owning the parent of the node, starting at the level in which the nodes are 
being scheduled currently. Since the scheduling proceeds from the lowest to the highest level, all the control nodes 
enclosing the scheduled node have been scheduled, and the new sub-DG  doesnt have to be distributed to all processors 
in the group, since the sub-DGs are needed only for force calculation. However, the sub-DG  of the root node (which 
is the overall DG  of the graph) is required for calculation of forces and hence is broadcast to all processors by the 
owner processor of the root node. Consider for example the situation when node 15 is scheduled. The sub-DGs of 
nodes 11,5,1 and 0 become invalid. Processor 2, the owner of node 11 recalculates the sub-DG  of 11 and sends it to 
processor 0. Processor 0 receives that and recalculates the sub-DGs of nodes 5,1 and 0 since it owns all those nodes. 
Once processor 0 calculates the sub-DG  of node 0 (which is the overall DG  of the graph), it broadcasts it to other 
processors.
5.4 An Example
We consider the hierarchical graph in Figure 3 in the previous section. At level 0, it has a conditional node (node 1) and 
an add operation(node 3). The conditional operation has two alternatives whose sources are nodes 5 and 7 respectively. 
Nodes 5 and 7 are at level 1. The second alternative has a multiply and two add operations, which are at level 2. As­
suming that we have 4 processors,they are assigned to the various nodes as shown in the figure. Assume that we have 
to schedule the graph for latency 10. Assume, for simplicity that each loop executes once. Thus, the conditional state­
ment can have latencies ranging from 3 to 9. The first alternative can have latencies ranging from 2 to 9, and the second, 
from 3 to 9. Consider the node 9.1n the preprocessing step, processors 0 and 1 recursively calculate the sub-DGs for 
node 9 for latencies 1 to 4, and processors 2 and 3 calculate the sub-DGs for node 11 for latencies 1 to 4. After that,
20
processors 0,1,2,and 3 perform an allgcitlier operation (which is defined in MPI [32], after which processors 0,1,2 and 
3 have the sub.DGs for nodes 9 and 11 for feasible latencies. However, processors 0 and 1 does not have the sub.DGs 
of nodes 15 or 16. But, they are not required for the DG  calculation for node 0 or for the calculation of forces for nodes 
9 and 10. In the next step, processors 0,1,2 and 3 calculate and store the sub.DGs for nodes 5,6, 7 and 8 after which 
they perform an allgather operation which results in processors 0,1,2 and 3 having the sub.D Gs for feasible latencies 
for nodes 5,6,7 and 8.
Consider now the force calculation phase. Processors 0,1,2 and 3 calculate the forces for nodes 1,2 and 3 and sched­
ules them. The sub-DGs of node 0 is invalidated and should be recalculated. However, the sub.DG  of node 1 can still 
be reused, since the sub.DG  of a hierarchical node depends only on the latency of the node, provided that none of the 
interior nodes are scheduled. After 1 and 2 are scheduled, processors 0,1,2 and 3 calculate the forces for nodes 5,6,7 
and 8 and schedules them. After they are scheduled, processors 0 and 1 calculate the forces for nodes 9 and 10 and 
processors 2 and 3 calculate the forces of nodes 11 and 12.
• 5.5 Load Balancing
As explained above, each hierarchical entity has a list of processors assigned to it in the first step. Due to the way the 
sub-DGs were calculated in the pre-processing step, each of the processor associated with a hierarchical node has the 
sub-DGs for all feasible latencies of the component nodes of the hierarchial entity in their local memories, and hence 
can calculate the forces of any of those nodes. The force calculation of a particular node to a particular step represents 
a unit of work. The total work involved in calculating the forces of the component nodes of a hierarchical entity is 
distributed among the processors in the process group of that entity. This distribution is done cyclically among the 
processors with the grain size ranging from 1 to total .work/(nprocs). In the figure, during the force calculation for 




Assume that the given hierarchical graph has N  nodes and has depth d. Let L be the overall latency. In the worst case, 
each level has one hierarchical node except one of them, having N  — d +  1 nodes. Thus, all p processors could be 
performing the all-gather operation for d — 1 levels in the worst case. In the last stage, each node could be allocated k 
processors, where k ranges from 1 to p. All-gather operation is equivalent to all nodes sending some message to one 
processor (gather) followed by that node broadcasting all the received messages to all the processors. Thus is takes at 
most the amount of communication equivalent to a gather operation followed by a broadcast. The gather operation and 
broadcast can be implemented in 2 t s{s/ p -  1) + twm i ( p -  1) time for a 2 -  D mesh. Thus, the all-gather operation can 
be perfomed in less than 4fs(v/p -  1) + 2twm i(p  -  1) units of time, where t s is the start-up time, tw is the per-word 
transfer time.
The computation required for the pre-processing step for a particular level t for each hierarchical node at that level 
is at most ciL s where s is the number of nodes contained within the hierarchical node at level t +  1 and Ci is some 
constant. Thus, the overall computation required for the pre-processing stage is J2t=l ci L N  which is 0 { L N 2). Thus 
in the worst case, the time for the pre-processing step is 0 ( L N 2) +  4dts(y/p -  1) +  2twm i(p  -  1), where mi is the 
maximum length of a message whose upper bound is C2 N L 2 where 0 2 is some constant.
6.2 The force-calculation and scheduling phase
In this phase, after each scheduling step, each processor sends its best node, along with the force to a master processor, 
and the master processor selects the globally best node and broadcasts it to all processors. Each of the above can be 
performed in time 2(ts +  twm 3)logp + 2th(s/p  -  1) where m 3 is the length of the messages which is a constant (since 
the message contains only the node with the least force, the corresponding time step, and the force). There are at most 
N  schedule steps (since at a given step, more than one node could be scheduled). Each force calculation step takes 
0 ( L N 2 /p) time assuming good load balance. Thus, the force calculation for all steps takes 0 ( L N 3 /p) since there are 
at most N  steps. Thus the overall time for the force calculation and scheduling is 0 ( L N 3/p ) +  2(ts +  twm 3)logp +
2th( y /p -  1).
22
7 Experimental Results
The parallel hierarchical force directed scheduling algorithm was implemented using the portable Message Passing 
Interface(MPJ) [32], which has been ported on a variety of parallel machines. We have tested the algorithm on a 
set of high level synthesis benchmarks.The resulting schedules were optimal for most cases and very close to optimal 
for others.The run times for some benchmarks for different latencies are as shown below. The results are shown for 
a 16-processor Intel Paragon distributed memory multicomputer, 8-processor S G I  Power Challenge shared memory 
multiprocessor and a network of SUN Sparc station 5 work-stations.
7.1 Experimental results for non-hierarchical graphs
Tables 1,2,3 shows the runtimes and speedups for node-based approach for non-hierarchical graphs and Tables 4,5 and 
6 show the runtimes and speedups for step-based approach for non-hierarchical graphs. From the results we can make 
the following observations. The qualities of the results (in terms of the hardware resources used) for each latency case 
are identical in the sequential and parallel executions. If the latency is close to the minimum, then the time frames of 
the nodes are small, in which case the total work involved in calculating the forces is relatively small. This results in 
low overall running time and poor speedup. For the node based partitioning scheme, we are getting speedups of 5 to 9 
on 16 processors on an Intel Paragon. The speedups on the Intel Paragon are comparable to that of the network of work 
stations, which has significant costs for communication. Clearly, the problem has been parallelized such that the grain 
size of computation between communications is quite large. Hence, we have effectively parallelized the problem.
Both partitioning schemes are comparable in their runtimes and speedups for a given parallel environment. Clearly 
we have been successful in balancing the load equally in both cases.
The results on the S G I  Challenge shared memory multiprocessor are worse than that of the Intel Paragon which is a 
distributed memory multicomputer. The result is counter-intuitive since most people think that it is possible to get better 
speedups on shared memory multiprocessors. We attribute the result due to cache coherence effects, contention and 
multitasking. In addition, for M P I ,  it is possible to get a large communication bandwidth on message passing machines 
during global operations such as allgatherQ  which creates more bus traffic in a shared memory multiprocessor.
23
Table 1: Run times(seconds) and speedups for node-based approach for non-hierarchical graphs on a network of work 
stations
time(speedup)
circuit latency 1 proc 2 proc 4 proc
Random(60) 60 336.1(1.0) 245.9(1.37) 177.3(1.89)
Random(60) 80 586.1(1.0) 336.4(1.74) 281.1(2.08)
elliptic 17 1.725(1.0) 0.998(1.73) 0.595(2.89)
elliptic 27 5.01(1.0) 2.78(1.80) 1.498(3.34)
Table 2: Run times(seconds) and speedups for node-based approach for non-hierarchical graphs on an SGI Power Chal­
lenge multiprocessor
time(speedup)
circuit latency 1 proc 2 proc 4 proc 8 proc
Random(60) 60 384(1.0) 305(1.26) 242.5(1.58) 148.1(2.59)
Random(60) 80 704(1.0) 713.8(0.98) 548.8(1.28) 358.0(1.96)
elliptic 17 1.37(1.0) 0.96(1.43) 0.71(1.93) 1.80(0.76)
elliptic 27 5.12(1.0) 2.69(1.9) 2.18(2.35) 3.75(1.36)
Table 3: Run times (seconds) and speedups for node-based approach for non-hierarchical graphs on Intel Paragon mul­
ticomputer
time(speedup)
circuit latency 1 proc 2 proc 4 proc 8 proc 16 proc
Random(60) 60 666.5(1.0) 472.1(1.4) 332.5(2.0) 203.8(3.27) 118.0(5.64)
Random(60) 80 1252(1.0) 706.1(1.77) 441.8(2.83) 286.2(4.37) 188.2(6.65)
elliptic 17 3.125(1.0) 1.843(1.69) 1.218(2.57) 0.8437(3.7) 0.703(4.44)
elliptic 27 11.20(1.0) 6.17(1.81) 3.39(3.3) 1.89(5.92) 1.25(8.96)
Table 4: Run times(seconds) and speedups for step-based approach for non-hierarchical graphs on a network of work 
stations
time(speedup)
circuit latency 1 proc 2 proc 4 proc
Random(60) 60 336.1(1.0) 334.9(1.0) 173.3(1.94)
Random(60) 80 586.1(1.0) 337.5(1.74) 228.1(2.57)
elliptic 17 1.86(1.0) 1.73(1.07) 0.996(1.87)
elliptic 27 4.81(1.0) 2.60(1.85) 1.392(3.45)
24
Table 5: Run times(seconds) and speedups for step-based approach for non-hierarchical graphs on an SGI Power Chal­
lenge multiprocessor
time(speedup)
circuit latency 1 proc 2 proc 4 proc 8 proc
Random(60) 60 384(1.0) 319.2(1.20) 251.1(1.53) 139.8(2.75)
Random(60) 80 704(1.0) 308.7(2.28) 229.2(3.07) 128.9(5.46)
elliptic 17 1.43(1.0) 0.96(1.49) 0.83(1.72) 2.01(0.71)
elliptic 27 4.83(1.0) 2.63(1.84) 2.47(1.95) 2.71(1.78)
Table 6: Run times for step-based approach for non-hierarchical graphs on Intel Paragon multicomputer
time(speedup)
circuit latency 1 proc 2 proc 4 proc 8 proc 16 proc
Random(60) 60 666.5(1.0) 458.9(1.45) 316.9(2.1) 195.1(3.42) 110.0(6.06)
Random(60) 80 1252(1.0) 702.5(1.78) 444.3(2.82) 293.2(4.27) 193.5(6.47)
elliptic 17 3.125(1.0) 1.797(1.74) 1.172(2.67) 0.828(3.77) 0.672(4.65)
elliptic 27 11.20(1.0) 6.03(1.86) 3.29(3.4) 1.90(5.89) 1.27(8.82)
7.2 Experimental results for hierarchical graphs
Tables 7,8 and 9 show the runtimes and speedups for hierarchical graphs. For the parallel algorithms for hierarchical 
graphs, the speedup results are similar to that of non-hierarchical graphs. Again the qualities of the results (in terms of 
the hardware resources used) for each latency case are identical in the sequential and parallel executions. As in the case 
of non-hierarchical graphs, if the latency is close to the minimum, then the time frames of the nodes are small, resulting 
in low overall running time and poor speedup. For Kalman, there are only few nodes in each level which results in low 
parallelism. The higher latencies in the above table correspond to the latencies which results in close to the minimum 
resource requirement for the corresponding circuit.
Table 7: Runtime(seconds) and speedups for parallel algorithm on a network of workstations for hierarchical graphs
time(speedup)
circuit latency 1 proc 2 proc 4 proc
kalman 3700 205.56(1.0) 133.36(1.54) 92.80(2.2)
elliptic 17 6.85(1.0) 3.32(2.1) 2.61(2.62)
elliptic 27 20.21(1.0) 10.87(1.85) 6.65(3.03)
synth 100 98.61(1.0) 77.78(1.26) 44.60(2.21)
synth 200 533.9(1.0) 419.9(1.27) 175.3(3.04)
25
Table 8: Runtime(seconds) and speedup of parallel algorithm on Intel Paragon message passing multicomputer for 
hierarchical graphs
Runtime(speedup)
circuit latency 1 proc 2 proc 4 proc 8 proc 16 proc
kalman 3700 107.89(1.0) 69.14(1.56) 38.51(2.8) 23.76(4.6) 17.22(6.3)
elliptic 17 3.23(1.0) 1.95(1.6) 1.45(2.3) 1.03(3.1) 0.804(4.01)
elliptic 27 11.35(1.0) 6.14(1.8) 3.45(3.3) 2.10(5.4) 1.45(7.8)
synth 100 49.67(1.0) 37.01(1.34) 21.01(2.4) 12.69(3.9) 8.57(5.8)
synth 200 253.47(1.0) 191.00(1.3) 82.06(3.1) 43.43(5.8) 24.68(10.3)
Table 9: Runtime(seconds) and speedups for parallel algorithm on SGI Power Challenge multiprocessor for hierarchical 
graphs
time(speedup)
circuit latency 1 proc 2 proc 4 proc
kalman 3700 76.09(1.0) 47.0(1.62) 28.97(2.63)
elliptic 17 1.634(1.0) 0.914(1.78) 0.629(2.59)
elliptic 27 5.064(1.0) 2.669(1.89) 1.459(3.47)
synth 100 27.94(1.0) 20.12(1.39) 13.62(2.05)
synth 200 158.2(1.0) 121.3(1.30) 55.25(2.86)
8 Conclusions
In this paper, we presented some novel algorithms for scheduling hierarchical signal flow graphs in the domain of high- 
level synthesis. There were several key contributions of this paper. First, we developed a novel extension of the force- 
directed scheduling problem which naturally handles loops and conditionals by coming up with a scheme of schedul­
ing hierarchical signal flow graphs. Second, we developed three novel parallel algorithms for the scheduling prob­
lem. Third, our parallel algorithms are portable across a wide range of parallel platforms. We reported results on an 8- 
processor SGI Challenge, a 16 processor Intel Paragon, and a network of 4 SUN SPARCstation5 workstations. Fourth, 
we reported actual implementations of these algorithms in real parallel machines, and report results of qualities and 
runtimes for various benchmark circuits on a variety of parallel machines. Finally, while some parallel algorithms for 
VLSI CAD reported by earlier researchers have reported on loss of qualities of results, our parallel algorithms produced 
exactly the same results as the sequential algorithms on which they are based.
In some related work, we are developing parallel algorithms for behavioral simulation [23]. In the future, we will 
pursue parallel algorithms for other tasks in high-level synthesis, such as allocation, and in hardware/software codesign
26
and cosimulation. We will integrate all these tools with the parallel tools developed for solving the CAD problems at 
the lower level such as placement, routing, logic synthesis, and testing. All these tools will run in a portable manner in 
a large variety of parallel and distributed platforms.
References
[1] P. Banerjee, “Parallel Algorithms for VLSI Computer-aided Design Applications,” Englewood-Clijfs, NJ:Prentice Hall, 1994.
[2] P. G. Paulin and J. P. Knight, “Force-directed scheduling for the behavioral synthesis of ASICS,” IEEE Trans. Computer-Aided 
Design, vol. 8, no. 6, pp. 661-679, 1989.
[3] W. Gropp, E. Lusk, and A. Skellum, “Using MPI: Portable Parallel Programming with the Message Passing Interface”, MIT 
Press, 1994.
[4] R. Haupt, “A survey of priority-rule based scheduling,” OR Spektrum, vol. 11, no. 1, pp. 3-16, 1989.
[5] M. C. McFarland, “Using bottom-up design techniques in the synthesis of digital hardware from abstract behavioral descrip­
tions,” Proc. 23rd DAC, Las Vegas, NV, June 1986, pp. 474-480.
[6] M. C. McFarland, A. C. Parker, and R. Camposano, “The high-level synthesis of digital systems,” Proc. IEEE, vol. 78, no. 2, 
pp. 301-318, 1990.
[7] C. H. Gebotys and M. I. Elmasry, “A global optimization approach for architectural synthesis,” in Proc. ICCAD, Santa Clara, 
CA, Nov. 1990, pp. 258-261.
[8] C. T. Hwang, T. H. Lee, and Y. C. Hsu, “A formal approach to the scheduling problem in high level synthesis,” IEEE Trans. 
CAD, vol. 10, no. 4, pp. 464-475, 1991.
[9] R. J. Cloutier and D. E. Thomas, “The combination of scheduling, allocation, and mapping in a single algorithm,” Carnegie 
Mellon Univ Res. Rep CMUCAD-90-17, May 1990.
[10] T. Kim, J. W. S. Liu, and C. L. Liu, “A scheduling algorithm for conditional resource sharing,” in Proc ICCAD, Santa Clara, 
CA, Nov 1991, pp. 84-87.
[11] A. Olah, S. H. Gerez, and S. M. Heemstra de Groot, ’’Scheduling and allocation for the high-level synthesis of DSP algorithms 
by exploitation of transfer mobility”, in Proc.CompEuro, Delft, The Netherlands, 1992, pp. 145-150.
[12] L. Stok, “Architectural synthesis and optimization of digital systems,” Ph. D dissertation, Eindenhoven Univ. of Technol., Ein- 
denhoven, The Netherlands.Apr. 1990
[13] Wim F. J. Verhaegh, Paul E. R. Lippens, Emile H. L. Aarts, Jan H. M. Korst, Jef L. van Meerbergen, Albert van der Werf, “Im­
proved force-directed scheduling in high-throughput digital signal processing”, in IEEE trans on CAD, vol. 14, no. 8, August 
1995.
[14] J. Roy, N. Kumar, R. Dutta, R. Vemuri,“DSS: A distributed high-level synthesis system”, in IEEE design and test o f computers, 
June 1992.
[15] B. Ramkumar and P. Banerjee, “ProperCAD: A portable object-oriented parallel environment for VLSI CAD,” IEEE Trans. 
Computer-Aided Design, vol. 13, pp. 829-842, July 1994.
[16] S. Kim, B. Ramkumar, J. Chandy, S. Parkes, and P. Banerjee,“ProperPLACE: A portable parallel algorithm for standard cell 
placement. Proc. 8th Int. Parallel Processing Symp. (IPPS-94), Apr. 1994.
[17] R. Brouwer and P. Banerjee, “PHIGURE: A parallel hierarchical global router,” in Proc. 27th Design Automation Conference., 
pp. 591-597, June 1990.
[18] K. P. Belkhale and P. Banerjee. “Parallel algorithms for VLSI circuit extraction,” IEEETrans. CAD, 10(2):604-618, May 1991.
[19] J. S. Rose, W. M. Snelgrove, and Z. G. Vranesic, “Parallel cell placement algorithms with quality equivalent to simulated 
annealing,” IEEE Trans. Computer-Aided Design, vol. 7, pp. 387-396, Mar. 1988.
[20] A. Casotto, F. Romeo, and A. Sangiovanni-Vincentelli, “A Parallel Simulated Annealing Algorithm for the Placement of 
Macro-Cells”, in IEEE Trans. Computer-aided Design Integrated Circuits Systems, pages 838-847, Sept. 1987.
A. Casotto and A. Sangiovanni-Vincentelli, “Placement of standard cells using simulated annealing on the Connection Ma­
chine,” in Digest of Papers, International Conference on Computer-Aided Design, (Santa Clara, CA), pp. 350-353, Nov. 1987.
27
[21] S. Patil, P. Banerjee, and T. Patel,“ Parallel Test Generation for Sequential Circuits on General Purpose Multiprocessors,” in 
Proc. 28th Design Automation Conf (DAC-91), June. 1991.
[22] P. Agrawal, V. D. Agrawal, J. Villodo, “Sequential Circuit Test Generation on a Distributed System,” in Proc 30th Design 
Automation Conf, (DAC-93), June. 1993.
[23] V. Krishnaswamy, P. Banerjee, “Actor basec parallel VHDL simulation using Time Warp,” in Proc. Int. Conf. on VLSI Design, 
1996.
[24] K. De, B. Ramkumar and P. Banerjee, "ProperS YN: A portable parallel algorithm for logic synthesis,” in IEEE Trans. Computer 
Aided Design( ICC AD-92), Nov. 1992.
[25] B. Ramkumar and P. Banerjee, “ProperTEST: A portable parallel test generator for sequential circuits,” to appear in IEEE 
Trans, on CAD, 1996.
[26] K. MacPherson, “Parallel algorithms for layout verification,” MS thesis, Department o f Electrical and Computer Engineer­
ing,University o f Illinois, Urbana-Champaign, Technical report, 1995.
[27] K. De, J. A. Chandy,S. Roy, S. Parkes and P Banerjee, “Parallel algorithms for logic synthesis using the MIS approach,” Pro­
ceedings of IPPS, Santa Barbara, April 1995.
[28] S. Parkes, P. Banerjee, and J. Patel, “A parallel algorithm for fault simulation based on PROOFS,” Proceedings of the ICCD, 
Austin, TX, Oct. 1995.
[29] S. Parkes, P. Banerjee, and J. Patel “ProperHITEC: A portable, parallel, object-oriented approach to sequential test generation,” 
in Proc. 31st Design Automation Conf, June 1994.
[30] R. Galivanche and S. M. Reddy, “A Parallel PLA Minimization Program,” in Proc. Design Automation Conf, pages 600-607, 
1987.
[31] H. D. Hachtel and P. H. Moceyunas, “Parallel Algorithms for Boolean Tautology Checking,” in Proc. Int. Conf. Computer- 
aided Design, pp. 422-425, Nov. 1987.
[32] W. Gropp, E. Lusk, and A. Skellum, “Using MPI: Portable Parallel Programming with the Message Passing Interface”, MIT 
Press, 1994.
28
