Partitioning-based algorithm for pipelined scheduling and module assignment by Wu, Allen C.H. et al.
UC Irvine
ICS Technical Reports
Title
Partitioning-based algorithm for pipelined scheduling and module assignment
Permalink
https://escholarship.org/uc/item/1hd8h9m8
Authors
Wu, Allen C.H.
Lis, Joseph
Gajski, Daniel D.
Publication Date
1991-04-09
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
Notice: This Material 
may be protected 
by Copyright Law 
(Title 17 U.S.C.) 
J>artitioning-Based Algorithm for 
Pipelined Scheduling and Module Assignmen~ 
Allen C-H. Wu 
Joseph Lis-
Daniel D. Gajski 
Technical Report #91-32 
April 9, 1991 
Dept. of Information and Computer Science 
University of California, Irvine 
Irvine, CA 92717 
(714) 856-8059 
Abstract 
We propose partitioning-based algorithms for pipeline scheduling, module assignment, 
and interconnect sharing. A novel hypergraph model is used to perform module as-
signment which facilitates the identification of sharable resources and the calculation 
of interconnect costs. The algorithms use clustering and interchange improvement 
techniques to maximize interconnect sharing. The results show significant improve-
ment over other published results. 

TABLE OF CONTENTS 
1. Introduction .......................................................................................................... 1 
2. Notation and definition 4 
3. Pipeline scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 
3.1 Determination of the disjoint partitions ......................................................... 5 
3.2 Partitioning-based scheduling ........................................................................ 6 
4. Function unit assignment and interconnect sharing ................. ;......................... 10 
4.1 Formulation .......................................... -........................................................... 11 
4.1.1 The hypergraph ......................................................................................... 11 
4.1.2 Interconnect cost in the hypergraph model .............................................. 16 
4.2 The algorithm . .. .... .... .. .... .. .. .. .. .. .. .. .. .... .. .. .... .. .. .. .. .. .. .. . . .. .... ........ .... .... .. .... .... .. .. . 17 
4.2.1 Initial assignment ....................................................................................... 18 
4.2.2 Improvement by interchanging ............................................. ~................... 19 
5. Experimental results ............................................................................................. 23 
6. Conclusions 24 
7. References . . . . . . .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 
Page i 

LIST OF FIGURES 
Figure 1. Pipeline example (a) Latency= 1 and (b) Latency=3. 2 
Figure 2. (a) Freedom calculations and schedule and (b) Disjoint sets. .... .. ......... 8 
Figure 3. Hyperedge merging and interconnect sharing. ....................................... 13 
Figure 3. (cont.)........................................................................................................ 14 
Figure 4. (a) A data flow graph example and (b) Hypergraph of (a). ............... ... 15 
Figure ·5. Feasible hyperedges. ... ............................................................................. 20 
Figure 6. "Lock" hyperedges. ....................... ...... ............ ......................................... 21 
Figure 7. The schedule and structural netlist of a FIR filter with latency=2., 
Figure 8. The schedule and structural netlist of a FIR filter with latency=3. 
Figure 9. The schedule and structural netlist of a FIR filter with latency=4. 
Figure 10. The data flow graph of an elliptic filter example. .. ............................ .. 
Figure 11. The (a) schedule, (b) operation assignment, and ( c) structural net-
25 
26 
27 
29 
list of an elliptic filter with latency=8. ........................................................... 30 . 
Figure 12. The (a) schedule, (b) operation assignment, and ( c) structural net-
list of an elliptic filter with latency=9. ........................................................... 31 
Page ii 

LIST OF TABLES 
Table 1. Results and Comparisons of a FIR filter example. ................................. 28 
Table 2. Results and Comparisons of an elliptic filter example. ........................... 32 
Page iii 

1. Introduction 
The data path synthesis of digital systems have been active research topics in 
recent years [1,3,5,6,8,9]. Two datapath models can be identified. One assumes that 
original description is a sequence of assignment statements, if statements, and loop 
statements. The problem is to map this description on connection of functional and 
storage units. Some or all of the functional units can be pipelined. The goal of this 
mapping is to minimize total execution time or number of functional units, storage 
units, and busses. The minimal execution time is achieved by exploiting maximal 
parallelism in the program such that minimal cost is obtained by maximal sharing of 
hardware units. The second model assumes that the given description is a single loop 
using a infinite stream of data. In this model, loop iterations form the pipeline and 
units are not shared inside a single iteration. We describe the solution for the second 
problem in this paper. 
The key parameter m determining the performance of a pipeline is the latency 
which is the number of time steps separating two task initiations. This parameter 
influences the overall computation time as well as hardware cost associated with the 
pipeline design. Consider a task that is partitioned into 5 subtasks. The pipeline 
schedules of such a task with latency=l and latency=3 are shown in Figure 1. Let c be 
the clock rate and t be the number of control steps of a task. The total computation 
time for n tasks is: tc + (n-l)lc. For example, if 1=1, n=lOO, and t=5, the total 
computation time is 104c. On the other hand, if 1=3 then the computation time ·of 302c 
is required to execute 100 tasks. Thus~ the lower the latency, the faster the 
computation time. 
Page 1 
Page 2 
tasks 
1 
2 
3 
4 
5 
ta ks 
1 f 1 
f 1 
T 
fL:1f 
f 2 
f 1 
f 2 
I 
fn~ 
,... task •I 
t 1 I t2 I t3 I t 4 I ts I 
f 2 f 3 f 4 f 5 
f 1 f 2 f 3 f 4 f 5 
f 1 f 2 f 3 f 4 f 5 
f 1 f 2 f 3 f 4 
f 1 f 2 f 3 
T T 
(a} 
f 3 f 4 f 5 
L = latenc y 
f 5 
f 4 f5 J 
T 
clock cycle 
2 f 1 f 2 f 3 f 4 f 5 
3 f 1 f 2 f 3 f 4 f 5 
~L=3~ clock cycle 
(b} 
Figure 1. Pipeline example (a) Latency==l and (b) Latency==3. 
An adverse effect of a small latency is a correspondingly higher hardware cost due 
to the increased number of parallel operations in each state. Consider the example in 
Figure l(a). Five subtasks are executed in the same clock cycle; therefore, minimally 
five function units are required to carry out the pipeline design with the latency= 1. 
However, in Figure l(b ), only two function units are needed to carry out the pipeline 
design with the latency=3. 
The pipelining technique has been incorporated into the scheduling, allocation, 
and resource binding tasks of high-level synthesis. Sehwa [8] first introduced a set of 
techniques for the synthesis of pipelined data paths. Park and Kurdahi (7] use a 
constrained clique· partitioning approach with the goal of maximizing interconnect 
sharing to perform module assignment for pipelined d.ata path designs. Hwang and 
Casavant [4] have developed a scheduling and hardware sharing algorithm using the 
force-directed scheduling approach (6) to synthesize both pipelined and non-pipelined 
designs. 
In this paper, we present a partition-based formulation for pipeline scheduling, 
function unit allocation, and interconnect sharing. A novel hypergraph model is used 
to perform module assignment which facilitates the identification of sharable resources 
and the calculation of interconnect costs. Clustering and interchange improvement 
techniques are used to maximize interconnect sharing. Results have demonstrated 
significant improvement over other published results. 
The remainder of this paper is organized as follows: Section 2 defines the notation 
and terms used throughout the paper. Section 3 describes the partition-based 
scheduling technique. Section 4 discusses the function unit assignment and 
Page3 
design is l\1ax( f n0 /r0 P l). Furthermore, if a fixed latency I is used, a function unit can 
be used at most in I clocks whose indices modulo-I are (O,l, .. ,l-1). Based on these facts, 
we can state the following fact: 
Fact 1: For a fixed latency I, there exists a set of disjoint partitions S={sk I k=O .. l-1} 
such that no two partitions are executed in the same clock cycle. 
Consequently, we can formulate the pipeline scheduling as a partitioning problem 
m which operation nodes are assigned to disjoint partitions while maximizing the 
function unit sharing. The scheduling algorithm first determines the necessary and 
sufficient number of function units R={roi:i}= f n
0
/I l for a given latency I, or determines 
the minimal latency I can be performed for the given resources. For example, consider a 
data flow graph consisting of 15 addition operation nodes and 8 multiply operation 
nodes. To satisfy the performance requirement of latency=3, 5 adders ( fi5t3 l) and 3 
multipliers ( fst3 l) are necessary and sufficient to carry out the pipeline design. As a 
result, there are three disjoint partitions, and initially, 5 adders and .3 multipliers are 
allocated to each partition. 
3.2. Partitioning-based scheduling 
The partitioning-based scheduling task consists of three steps: (i) Freedom 
calculation, (ii) Disjoint set partitioning, and (iii) Time step assignment. Algorithm I 
shows the pseudo code for this procedure. 
Freedom calculation. Based on ASAP (as soon as possible) and ALAP (as late as 
possible) scheduling, the algorithm first calculates the freedom of each node. For a 
node vi, freedom(vJ=fALAP(vJ-fASAP(vJ Furthermore, a chaining strategy [5] is 
implemented for the freedom calculation. The freedom calculations for the FIR filter 
Page 6 
are shown in Figure 2( a) where the delay for an adder is 40ns, a multiplier is 80ns, and 
the clock cycle is lOOns. 
Disjoint set partitioning. In the second step, the algorithm assigns nodes into 1 
disjoint partitions such that the resource sharing is maximized. The algorithm first 
assigns nodes with zero freedom to their corresponding partitions. Then, the algorithm 
assigns the rest of nodes to partitions in increasing order of freedom. For the node vi 
with freedom interval from fASAP( vJ to fALAP( vJ, the algorithm can assign v1 to partition 
sk between k=(fASAP(vJ mod 1) and k=(fALAP(vJ mod 1). The node assignment of vi is 
performed using the following steps: 
(1) From k=(fASAP(vJ mod 1) to (fALAP(vJ mod 1), the ·algorithm locates the• first 
partition sk with the available function unit for node vi. 
(2) If sk contains a predecessor node v. of node v. and v. cannot be chained with v ., J I I J 
then locate the next available partition set; otherwise, assign node vi to sk. 
(3) If sk contains a successor node v. of node v. and v. can be chained with v., then J I J I 
assign node vi to sk; otherwise, assign node v1 to sk and propagate node vi to the 
next available partition set. 
The final partition configuration of the FIR filter with fixed latency=3 is shown in 
Figure 2(b ). Since in the first scheduling phase, the algorithm allocates only the 
necessary resources for each disjoint partition. The algorithm adds more resources when 
needs to satisfy the required performance. 
Control step assigmnent. Finally, the algorithm assigns a control step to each 
operation node from inputs to outputs based on the data flow and its dependency. 
From disjoint set s0 to s1_ 1' the algorithm assigns the control step to the nodes without 
Page 7 
Page 8 
time step 
------~---------------~ .......... ......---------......---- 0 
latency= 3 
(a) 
(b) 
1 
2 
3 
4 
5 
out 
partition 
sets 
So 
Figure 2. (a). Freedom. calculations and schedule and (b). Disjoint sets. 
any data dependency edges and then deletes the nodes and their data dependency 
edges to the nodes' successors. For example, in Figure 2(b ), in s0 there are no data 
dependency edges for nodes +l, +2,and +3. The algorithm will assign step 1 to these 
three nodes and delete the nodes +l, +2, and +3. It will then delete the arcs between 
nodes +l, +2, +3 and nodes *l, *2, *3. The algorithm executes repeatedly until all of 
the nodes are assigned to a time step. The final scheduling of a FIR filter with fixed 
latency=3 is shown in Figure 2( a). 
Algorithm I Partitioning-Based Scheduling 
Partitioning_Based_Scheduling( G){ 
determine_partition_set( G,RJ); 
freedorn_cakulation( G); 
sort_freedom_list( G); 
/**assign operation nodes to partitions**/ 
for (i= 1 to n){ 
done = FALSE; 
step = f ASAP( vi); 
while (done== FALSE && step < fALAP(vJ){ 
k = step mod l; 
Page 9 
if (op( i) E R( sk)) { 
if ( {v/ sk I vj is vi's predecessor}){ 
if (vi can be chained with v){ 
sk = sk U {v); 
} 
R( sk) = R( sk) - op( i) ; 
done= TRUE; 
} 
else 
step = step + 1; 
else if ({v/sk I vj is v/s successor}){ 
if ( vj can be chained with vi){ 
sk = sk U {v); 
R(sk) = R(sk) - op(i); 
done= TRUE; 
} 
else{ 
} 
sk = sk U {v); 
R(sk) = R(sk) - op(i); 
step =step + 1; 
I= J; 
} 
} 
} 
} 
else{ 
} 
sk = sk U {v); 
R{ sk) = R{ sk) - op(i); 
done= TRUE; 
if (step > fALAP( v1) && done == FALSE){ 
for (k= 1 to 1) 
} 
R{sk) = R{sk) U op(i); 
step = f ASAP( vi); 
/**control step assignment**/ 
p= O; 
k= O; 
while (S*<P ){ 
k = k mod I~ 
· t = t U {v. Esk Iv. has no dependency edges}; p p I I 
delete_node_edge(sk,{vi Esk I vi has no dependency edges}); 
k = k + 1; 
p = p + 1; 
} 
4. Function unit assignment and interconnect sharing 
The objective of function unit assignment and interconnect sharing is to assign 
operation nodes into function units such that the interconnect cost is minimized. vVe 
formulate the function unit assignment and interconnect sharing problem in terms of a 
hypergraph merging. In this section, we first describe the hypergraph formation and 
interconnect cost in the hypergraph model. Then, we describe the function unit 
assignment algorithm that minimizes the interconnect cost by clustering the operation 
nodes into function units exploiting data dependency similarity. 
Page 10 
4.1. Formulation 
4.1.1. The hypergraph 
The algorithm first transforms the data flow graph G to a hypergraph H in which 
there are two types of hypernodes: (i) input/output and (ii) operation. Input/output 
hypernodes denote input/output ports. Each operation hypernode denotes a particular 
single-function unit such as adder, multiplier, or shifter. Each operation hypernode 
contains a set of nodes in the data flow graph. Hypernodes are connected with one or 
more hyperedges. Each hyperedge· denotes the physical connections between two 
function units; the weight of a hyperedge is the number of dependency edges assigned 
to it. 
At the time of hypergraph formation, each dependency edge is transformed into 
one hyperedge. After that, our algorithm performs hyperedge merging to reduce cost of 
interconnections. Before merging, each hyperedge is labeled as a right-data-input or a 
left-data-input. Two hyperedges can be merged if and only if: 
(1) They have the same source and destination hypernodes. 
(2) The operation nodes connected by the hyperedges have the same data dependency 
elapse time. 
(3) They have the same label (right or left). 
An example of case (2) is shown in Figure 3( a). The data dependency elapse time 
between two operation nodes v1 and v3 is defined as t( v3)-t( vi). In case when t( v3 )-
t( v1)=2 and t(v4 )-t(v2)=1, because eal,b3 needs one latch while ea2,b4 needs two latches 
one extra multiplexer input is required for each hyperedge ( ea1,b3 and ea2,b4) as shown in 
Page 11 
Figure 3(a). On the other hand, two hyperedges ea1,b3 and ea2 ,b4 in Figure 3(b) can be 
merged since their elapse times are the same t(v3)-t(v1)=t(v4)-t(vJ=l. Furthermore, 
consider case in Figure 3(c), since ea1,b3 enters the left input of vb and ea2,b4 enters the 
right input of vb, and both inputs are not commutable. Therefore, ea1,b3 and ea2,b4 can 
not be merged since one multiplexer input is required for each hyperedge. However, if 
two inputs ea1,b3 and ea2,b4 are commutable, they can be commuted first and then 
merged into one hyperedge as shown in Figure 3( d ). Another example is shown in 
Figure 3(e). If two hyperedges are connected to the different inputs of a function unit 
from the same source, they can not be merged. 
Figure 4(b) shows- an example of hypetgraph formulation from the DFG shown in 
Figure 4(a). There are a set of input and output hypernodes V:
0
={va,vb, ... ,vP}. In 
addition, there are 4 hypernodes vop ={vl,v2,v3,v4}, where type(vl) and type(v2) represent 
adders and type( v 3) and type( v 4 ) represent multipliers. Each hypernode contains two 
operation nodes: v1={v(+l),v(+3)}, v2={v(+2),v(+4)}, v3 ={v(*l),v(*3)}, v4 ={v(*2),v(*4)}; 
therefore, q( v1)=2, q( v2)=2, q( v3)=2, and q( v 4)=2. Since an adder and a multiplier 
have 2 inputs and 1 output each, p(v1)=p(v2)=p(v3)= p(v1)=2. There are two 
dependency edges between v1 to v3 , thus w( e13)=2. The hyperedge direction is based 
on the flow of data between the hypernodes. For example, e13 is connected from the 
output of v1 to the input of v 3 ; therefore, e13 is viewed as the outgoing hyperedge to the 
v1 and the incoming hyperedge to the v3 • 
Page 12 
Input 
~b 
(a) 
FU 
(b) 
no merge a b 
... 
(c) 
Page 13 Figure 3. Hyperedge merging and interconnect sharing. 
FU 
.(d) 
FU 
b 
FU 
(e) 
Figure 3. (cont.) 
Page 14 
control a b cde f g h 
step 
1 
2 
3 
p 
(a) 
- ~ dependency edges 0 input/output hypernodes 
.. hyperedges 0 hypemodes 0 operation nodes 
(b) 
Page 15 Figure 4. (a). A data flow graph exan1ple and (b). Hypergraph of (a). 
assignment step, the algorithm assigns the operation nodes into hypernodes based on 
the closeness of data dependency elapse time among the operation nodes. In the 
interchanging improvement step, the algorithm takes into account the data dependency 
similarities between hypernodes, and maxnmzes the interconnect sharing by 
interchanging the operation nodes in the different hypernodes. 
4.2.1. Initial assignrrent 
In the scheduling step, the algorithm allocates the function units and partitions 
the operation nodes into disjoint sets S={sk I k=0 . .1-1}. To satisfy the latency 
requirement, the function ·units only can be shared by the operation nodes in the 
different sets. The task of the initial assignment is to cluster the operation nodes from 
different sets into hypernodes (function units) such that the closeness of data 
dependency elapse time in each cluster is maximized. 
The data dependency elapse time of each operation node is calculated according to 
the final schedule. Each operation node includes two sets of data dependency elapse 
times: (i) the input elapse times between the node and it's predecessor nodes and (ii) 
the output elapse times between the node and it's successor nodes. For the example in 
Figure 2, consider operation node +9 which is scheduled at time step· 2 (t(+9)=2). 
node +9 has two predecessor nodes, *1 and *2, which are scheduled at time step 1 
(t(*l)=t(*2)=1), and one successor node +10 which is scheduled at time step 2 
(t( +10)=2). Thus, tin_elapse( +9)={1,1} such that the input elapse times of node +9 are 
t( +9)-t(*l)=l and t( +9)-t(*2)=1, and tout_elapse( +9)={0} such that the output elapse 
time of node + n is t( +9)-t(+10 )=0. 
Page 18 
The algorithm first calculates the elapse times for each operation node. During the 
assignment process, the algorithm then calculates the closeness of operation nodes and 
available function units when selecting the best suited unit for each operation node. 
For an operation node vi' if vi can be performed in a unit ve, then the Closeness(vi,vc) is 
calculated as follows: 
for (v. E vJ{ 
if (tEtin_elapse(vJ and tEtin_elapse(vi)) 
CToseness( vi, v J = Closeness( vi, v J + 1; 
if (tEt t el (v.) and tEt t el (v .)) ou _ apse 1 ou _ apse J 
CToseness( vi, v e) = Closeness( vi, v e) + 1; 
} 
Since the -function units can not be shared by the operation nodes in the same set, 
the algorithm will assign the operation nodes to the function units one set at a time. 
The algorithm calculates the closeness between each the operation node and available 
units, and assigns each operation node to the unit with the maximum closeness. 
4.2.2. Improverrent by interchanging 
In the interchanging improvement step, the algorithm takes into account the data 
similarity among function units. The algorithm minimizes the multiplexer cost by 
merging the hyperedges. We first describe how to find a feasible merging solution that 
allows two hyperedges to be merged. Finding a feasible merging solution consists of two 
parts: 
Finding a pair of feasible hyperedges. For any hypernode, there are two possible 
ways to merge the hyperedges: (i) merging of the incoming hyperedges, and (ii) merging 
of the outgoing hyperedges. An example of case (i) is shown in Figure 5( a). The 
hypernode ve has two incoming hyperedges ea 1 .... s and eb3,e6 from va and vb respectively. 
Page 19 
(b) 
Figure 5. Feasible hyperedges. 
An example of case (ii) is shown in Figure 5(b ). The hypernode v has two outgoing 
. a 
hyperedges eal,b3 and ea2,e6 entering vb and ve respectively. In both cases, if two 
hyperedges have (1). the same elapse time and (2). exited or entered the same type of 
hypernodes (type(vJ=type(vb) in case (i) and type(vb)=type(vJ in case (ii)), then they 
can possibly be merged by rearranging the operation nodes in va and vb (Figure 5(a)) or 
in vb and ve (Figure 5(b )). Therefore, a pair of feasible hyperedges can be defined as: 
(i) hypernodes which have the same elapse time and (ii) they either exit from 
hypernodes of the same type and enter the same destination hypernode, or they are two 
Page 20 
hyperedges exiting from the same hypernode and entering destination hypernodes of the 
same type. 
Finding a feasible rearrangement of the operation nodes in a pair of hypernodes. 
After locating a pair of feasible hyperedges, a pair of feasible hypernodes can be 
located. For example, in Figure 5(a), eal,cs and eb3,c6 are the feasible hyperedges; va and 
vb are the feasible hypernodes. There are two possible ways to merge eai,cs and eb3,c6 : (i) 
relocating v1 from va to vb or (ii) relocating v3 from vb to va. The rearrangement of 
ea123,b567 w{ e a123,b567 ) =3 ea23,c45 w{ea23,c4s) =2 
{a) {b) 
Figure 6. "Lock" hyperedges. 
Page 21 
operation nodes must not violate the function unit sharing rule as described in the 
previous section. For example, in case (i), if v3 and v1 can not share the same function 
unit ( v3 and v2 are in the same disjoint set), then the algorithm has to interchange v3 
and v2 rather than moving V3 to v a' However, if v3 and vl can not be assigned to the 
same operation unit, then the feasible rearrangement of operation nodes does not exist 
since eai,cs and eb3,c6 can not be merged by interchanging v3 and v1 . 
Using a bucket structure [2], the algorithm first sorts hypernodes in terms of the 
number of feasible hyperedges by ordering a list in decreasing order. After finding a 
feasible merging solution, the algorithm calculates the total multiplexer cost. If a 
smaller multiplexer cost is obtained; then the algorithm· nierges the - hyperedges; 
otherwise, the algorithm continues to find the next feasible merging solution. After 
merging, if the number of operation nodes in a hypernode is equal to the weight of it's 
incoming or outgoing hyperedge, then this hypernode achieves the maximum 
interconnect sharing. Hence, the algorithm will lock this hypernode, i.e. no more 
interchange for this hypernode is possible. For example, m Figure 6( a), 
q( v J=w( ea123,b567)=3. In Figure 6(b ), q( vb)=w( ea 23 ,b45 )=2. Both hypernodes will be 
locked. The algorithm runs repeatedly until no more hyperedges can be merged. 
Algorithm II Function Unit Assignment 
Let F be a set offeasible merging solution; 
Function_ Unit_Assignment( G,S; T;R){ 
/*initial assignment*/ 
calculate_data_dependency _elapse_time( G,T); 
V = build_hypernode(R); 
for (k=O to 1-1 ){ 
} 
for (vi Esk) 
closeness_calculation( vi, V); 
function_unit_assignmen t( sk, V); 
H = build_hypergraph(V,G); 
Page 22 
} 
/*interchanging improvement*/ 
mux_cost = mux_cost_calculation(H,¢ ); 
no_more_merge = FALSE; 
while (no_more_merge ==FALSE){ 
F = find_feasible_solution(H); 
} 
if (F == ¢) 
no_more__merge = TRUE; 
else{ 
} 
mux_cost_merge = mux_cost_calculation(H,F); 
if (mux_cost_merge < mux_cost){ 
merge_hyperedge (H,F); 
m ux_cost = m ux_~ost_merge; 
} 
5. Experimental results 
The algorithms are written in the C language, and the prototype implementation 
currently runs on SUN 4 workstations under the UNIX operating system. 
We have applied our algorithms to two examples: a FIR filter [8] (Figure 2(a)) and 
an elliptic filter [6] (Figure 10 ). For the FIR filter example, we have tested the example 
with the latency from 1 to 6. The examples of schedule and structural netlist with 
latency=2, 3, and 4 are shown in Figure 7, 8, and 9 respectively. Table 1 shows the 
comparison of our results with the results in [4] which is the only published paper 
documenting a complete set of results for the FIR filter and the elliptic filter examples. 
The results show that the number of multiplexer inp·uts was reduced up to 343 and the 
number of latches was reduced up to 193 using our algorithms. 
For the elliptic filter example, we have tested the example with the latency from 1 
to 9. The examples of schedule, operation assignment, and structural netlist with 
latency=8 and 9 are shown in Figure 11 and 12 respectively. The results in Table 2 
Page 23 
show that the number of multiplexer inputs and the number of latches were reduced up 
to 30% compared to the results in [4]. However, in the case of latency=4, our algorithm 
used 8 adders but [4] used 7 adders. 
6. Conclusions 
We presented partitioning-based algorithms for pipeline scheduling, module 
assignment, and interconnect. sharing. Based on a hypergraph model, the algorithms 
use clustering and interchange improvement techniques to maximize interconnect 
sharing. The results have shown significant improvement over other published results. 
This research has demonstrated that the hypergraph model facilitates the identification 
of sharable resources and calculation of interconnect costs. Furthermore, this approach 
produces very good results in very short time. 
7. Acknowledgements 
This work was supported by NSF grant #MIP-8922851, California MICRO grant 
#90-046, and contributions from Rockwell International, Western Digital, and Silicon 
Systems Inc. We are grateful for their support. The authors also like to thank Tedd 
Hadley, L. Ramachandran, Viraphol Chaiyakul, and Nels Vander Zanden for their 
useful discussions. 
Page 24 
Page 25 
I 
latency= 2 
{a) 
(b) 
I 
Figure 7. The schedule and structural 
netlist of a FIR filter with latency==2. 
1 
2 
5 
out 
Page 26 
..,...............,__..~-.---.---..~r--......-......------~----------~--timestep 0 
out 
latency= 3 
(a) 
(b) 
Figure 8. The schedule and structural 
netlist of a FIR filter with latency==3. 
1 
2 
3 
4 
5 
out 
Page 27 
time step 
~--.--~------.,-----......--~.---.--------------- 0 
I I 
latency= 4 
(a) 
(b) 
Figure 9. The schedule and structural 
netlist of a FIR filter with latency 4. 
:' 1 
2 
3 
4 
5 
out 
~ 
o86. 
cti 
~ 
00 
~~cy 
Resources~ 
Number of *'s 
Number of +'s 
Size of 
multiplexers 
Number of 
registers 
CPU time 
(sec) 
1 
* ** % * 
8 8 0 4 
15 15 0 8 
0 0 0 32 
52 57 -8.9 34 
0.3 9 - 0.5 
*:our results. 
**: The results in [4]. 
2 3 4 5 6 
** % * ** % * ** % * ** % * ** % 
4 0 3 3 0 2 2 0 2 2 0 2 2 0 
8 0 5 5 0 4 4 0 4 4 0 3 3 0 
40 -20.0 30 42 -28.6 27 41 
-34.1 33 39 -15.4 34 37 -8.1 
42 -19.0 37 43 -14.0 50 50 0 43 46 -6.5 37 43 -14.0 
138 0.3 144 0.3 142 0.4 143 0.3 147 - - - - -
Table 1. Results and Comparisons of a FIR filter example. 
In n2 n13 n26 n18 n38 n33 n39 
6 
in n2 n13 n26 n38 n33 out 
Figure 10. The data flow graph of an elliptic filter example. 
Page 29 
Page 30 
step 1: +1,+3 FU operation assignment 
step 2: +2,+4,+5 *1 *1,*3,*5,*8 
step 3: *t,*2 *2 *2,*4,*6,*7 
step 4: +6,+7,+8,+9 +1 +1,+2,+14 
step 5: +10,+ 11,*3,*4 +2 +3,+5,+8,+ 11,+ 16 
step 6: +12,+13,+14,+15,+16,+19 +3 +4,+6,+ 10,+ 12,+20,+21 
step 7: 
step 8: 
step 9: 
+17,+18,*5,*6 +4 + 15,+22,+23 
+20,+22,+25,*7,*8 +5 + 7,+13,+ 18,+24,+25 
+21,+23,+24,+26 +6 +9,+ 17,+19,+26 
(a) (b) 
(c) 
Figure 11. The (a) schedule, (b) operation assignment, and (c) 
structural netlist of an elliptic filter with latency==8. 
step 1: + 1 ,+2,+3 
step 2: +4,+5 FU operation assignment 
step 3: *1, *2 
*1 *1,*3,*5,*8 
step 4: +6,+7 ,+8,+9 
*2 *2,*4,*6,*7 
-
step 5: +10,+11,*3,*4 
+1 +2,+6,+ 12,+ 14,+21,+25 
step 6: +12,+13,+16,+19 
step 7: +14,+15,+17,+18,*5,*6 
+2 + 1,+4,+7,+ 1O,+13,+ 15,+20,+24_ 
+3 +8,+ 16,+ 17 ,+23 
step 8: +20,+22,+25,*7,*8 
+4 +3,+5,+9,+ 11,+ 18,+ 19,+22,+26 
step 9: +21,+23,+24,+26 
(a) (b) 
in 1111111 TTT. I TTT ] 
]_ 
l rtt:±=:l1~t ±itt ==t=tif=~+-m-=t:tt#+=hl11 1 Jl l[[l 
r mux ] r mux l mux l r mux ] 
t5 
( mux l 
l _L 
r mux J 
Jl 
r mux l [ mux 
Page 31 
+1 
-4-
-~ 
•• 
•i-
--
-4-
•lm 
... 
.... 
1 
I 
I T 
--..... 
..... 
-· 
-· 
--
... 
..... 
f mux] 
*1 
+3 
_______ .... 
.... 
LLL ,m1-m8 _lJ I I 
mux l mux l 
(c) 
*2 
Figure 12. The (a) schedule, (b) operation assignment, and (c) 
structural netlist of an elliptic filter with latency::::9. 
+4 
\ \\\ '' \~ \\ ~\ \\\ \\ \\ \\\ \\  \\\ \~ \\ \\~ \\\  
3 1970 00882 4432 
