PACE: A dynamic programming algorithm for hardware/software partitioning by Knudsen, Peter Voigt & Madsen, Jan
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
PACE: A dynamic programming algorithm for hardware/software partitioning
Knudsen, Peter Voigt; Madsen, Jan
Published in:
Proceedings. Fourth International Workshop on Hardware/Software Co-Design (Code/CASHE `96)
Link to article, DOI:
10.1109/HCS.1996.492230
Publication date:
1996
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Knudsen, P. V., & Madsen, J. (1996). PACE: A dynamic programming algorithm for hardware/software
partitioning. In Proceedings. Fourth International Workshop on Hardware/Software Co-Design (Code/CASHE
`96) (pp. 85-92). IEEE. DOI: 10.1109/HCS.1996.492230
PACE: A Dynamic Programming Algorithm for Hardware/Software 
Partitioning 
Peter Voigt Knudsen and Jan Madsen 
Department of Computer Science, Technical [Jniversity of Denmark, DK-2800 Lyngby, Denmark 
Abstract. 
This paper presents the PACE partitioning algo- 
rithm which is used in the LYCOS co-synthesis system 
for partitioning control/dataflow graphs into hardware- 
and software parts. The algorithm is a dynamic pro- 
gramming (algorithm which solves both the problem 
of minimizing system execution time with a hardware 
area constraint and the problem of minimizing hard- 
ware area with a system execution time constraint. The 
target architecture consists of a single inicroprocessor 
and a single hardware chip (ASIC, FPGA, etc.) which 
are connected by a communication channel. The al- 
gorithm incorporates a realistic communication model 
and thus attempts to  minimize communication over- 
head. The time-complexity of the algorithm is O(n2.A) 
and the space-complexity is O(n.A) where A is the to- 
tal area of the hardware chip and n the number of code 
fragments which may be placed in either hardware or 
software. 
1 Introduction 
The hardware/software partitioning of a system 
specification onto a target architecture consisting of a 
single CPU and a single ASIC has been investigated 
by a number of research groups [2, 5 ,  1, :', 8, 111. This 
target architecture is relevant in many areas where the 
performancle requirements cannot be met by general- 
purpose microprocessors, and where a complete ASIC 
solution is too costly. Such areas may be found in DSP 
design, construction of embedded systems, software ex- 
ecution acceleration and hardware emulakion and pro- 
totyping [lO]. 
One of the major differences among paxtitioning ap- 
proaches is in the way communication between hard- 
ware and software is taken into account during parti- 
tioning. Henkel, Ernst et al. [2, 1, 61 present a simu- 
lated annealing algorithm which moves #chunks of soft- 
ware code (in the following called blocks) to hardware 
until timing constraints are met. The algorithm ac- 
counts for communication and only variables which 
need to be transferred are actually taken into account, 
i.e., the possibility of local store is explloited. Gupta 
and De Micheli [5] present a partitioning approach 
which starts from an all-hardware solution. Their algo- 
rithm takes communication into account and is able to  
reduce communication when neighboring vertices are 
placed together in either software or hardware. The 
system model presented by Jantsch et al. [7] ignores 
communication. They present a dynamic programming 
algorithm based on the Knapsack algorithm which 
solves the partitioning problem for the case where some 
blocks include other blocks and are therefore mutu- 
ally exclusive. The algorithm has exponential mem- 
ory requirements which makes it impractical to use for 
large applications. To solve this problem they pro- 
pose a pre-selection scheme which only selects blocks 
which induce a speedup greater than 10%. However, 
this pre-selection scheme may fail to  produce good re- 
sults as communication overhead is ignored. Kalavade 
and Lee [8] present a partitioning algorithm which does 
take communication into account by attributing a fixed 
communication time to  each pair of blocks. This ap- 
proach may overestimate the communication overhead 
as more variables than actually needed are transferred. 
In this paper we present a dynamic programming 
algorithm called PACE [9] which solves the hard- 
ware/software partitioning problem taking communi- 
cation overhead into account. 
2 System Model 
This section presents the system model used by the 
partitioning algorithm, and describes how it is obtained 
from the functional specification. 
2.1 The CDFG F o r m a t  
The functional specification, which is currently de- 
scribed in VHDL or C, is internally represented as a 
control/data flow graph (CDFG) which can be defined 
as follows: 
Definition 1 A CDFG is  a set of nodes and directed 
edges (N, E) where a n  edge ei,j = (ni,nj) f rom ni E N 
to  nj E N ,  i # j ,  indicates that  nj depends on  ni 
because of data dependencies and/or control dependen- 
cies. 
Definition 2 A node ni E N is  recursively defined as 
ni = DFG I Cond I Loop I FU I Wait 
0-8186-72438/96 $05.00 0 1996 IEEE 
85 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 02,2010 at 09:25:17 EST from IEEE Xplore.  Restrictions apply. 
Cond = (Branchl, Branch2) 
Loop = (Test,Body) 
Branchl = CDFG 
algorithm, parent BSBs can be collapsed as to appear 
as single BSBs instead of the child BSBs they are com- 
posed of. This is illustrated in figure 2. The partition- 
Branch2 = CDFG Wait Wait Wait * 
Test = CDFG 
Body = CDFG 
FU = CDFG 
where a DFG is a pure dataflow graph without control 
structures, FU represents a function call, Wait is  used 
fo r  synchronization with the environment, Branchl and 
Branch2 are the CDFGs to  be executed in the “true” 
and ‘Yalse” branch case of  a conditional Cond respec- 
tively and Test and Body are the test  and body CDFGs 
of a Loop. 
2.2 D e r i v a t i o n  of B a s i c  Schedul ing Blocks  
In order to be able to partition the CDFG, it must 
first be divided into fragments, in the following called 
Basic Scheduling Blocks or BSBs. For each node ni in 
the CDFG, a BSB is created as shown in figure 1. Each 
MAIN ,--. Wait 
- _ _ *  
CDFG 
. ,  ._-
DFG 
LOOP 
Test 
Fu 
Body 
DFG 
DFG 
Cond 
Branchl 
DFG 
Branch2 
DFG 
DFG 
BSB hierarchy 
Figure 1: The  BSB hierarchy and i ts  correspondence 
with the CDFG. 
BSB can have child BSBs which are shown indented 
under the BSB. In this way a BSB hierarchy which 
reflects the hierarchy of the CDFG is obtained. A BSB 
stores information which is used by the partitioning 
algorithm to determine whether it should be placed in 
hardware or software: 
Definition 3 A BSB, Bi, is defined as a six-tuple 
Bi = < a,,;,t,,i,ah,i,th,i,ri,wi > 
where as,i and t3,i are the area and execution t ime of 
Bi when placed in software, ah,i and th,i  are the area 
and execution t ime  of Bi when placed in hardware and 
ri and wi contain the read-set and write-set variables 
O f  Bi. 
In order to be able to control the number and sizes 
of the BSBs which are considered by the partitioning 
DFG DFG DFG * 
Loop Loop 
Test Test 
FU FU 
Body Body Body 
DFG DFG 
DFG 
Cond 
Branchl 
DFG 
Branch2 
DFG 
DFG DFG DFG 
Original Hierarchy. Cond BSB collapsed. Test and Body BSBs 
Seven leaf BSBs. Six leaf BSBs. collapsed. Five leaf BSBs. 
Figure 2: Adjusting BSB granularity by hierarchical 
collapsing. 
ing algorithm only considers leaf BSBs which are BSBs 
which have no children. The leaf BSBs are marked 
with a dot in the figure. When BSBs are collapsed, the 
number of leaf BSBs decreases. As the execution time 
of the PACE algorithm depends quadratically on the 
number of BSBs, it is relevant to  be able to  control the 
number of BSBs in this way. 
As all leaf BSBs together make up the total system 
functionality, we can now define the system specifica- 
tion in terms of leaf BSBs: 
Definition 4 A system specification S is  described as 
an ordered list of n leaf BSBs, i.e. {Bl ,  Ba,. . . , B,} 
where Bi denotes BSB number i. 
In order to estimate performance, it is necessary to 
know how many times each BSB is executed for typical 
input data. This information is obtained from profiling. 
It is convenient to define two global functions which 
return profiling information for individual BSBs and 
individual variables: 
Definition 5 The funct ion pc : Bi E 5’ + Nut re- 
turns the number of tames Bi has been executed in a 
profiling run’. 
Definition 6 Let V denote the set of all variables f rom 
the read-sets and write-sets of the BSBs in S .  Then,  
f o r  a given variable v f rom the read-set or write-set of 
Bi, the funct ion ac : U E V -+ N u t  returns the number 
of t imes the variable is accessed by Bi:2 
( U  E ri) V (U  E wi) + ac(w) = pc(Bi) 
“pc” is short for “profiling count”. 
“a?’ is short for ‘Laccess count”. 
86 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 02,2010 at 09:25:17 EST from IEEE Xplore.  Restrictions apply. 
3 Problem formulation 
The partitioning model which the PACE algorithm 
uses is illustrated in figure 3. In this model hardware 
SW HW 
FI 
+ # s 3 ' 4  
Figure 3: Partitioning model used by PACE: a) Ex- 
ample of actual data-dependencies between hardware- 
and software BSBs, b)  How data-dependencies between 
adjacent hardware BSBs and software BSBs are inter- 
preted in the model. 
BSBs and software BSBs cannot execute in parallel. 
Furthermorle, adjacent hardware BSBs are assumed to 
be able to  communicate the readlwrite variables they 
have in common directly between them without in- 
volving the software side. As illustratedl in the figure, 
a given hardwarelsoftware partition can be thought 
of as composed of sequences of adjacent BSBs which 
only communicate their effective read- and write-sets 
from/to the software side. The following definitions 
formalize these assumptions. 
Definition 7 Si,j, j 2 i ,  denotes the sequence of 
BSBS {Bi,Bi+l,.  . . , Bj}. 
Definition 8 The  effective read-set ana! the effective 
write-set of a sequence Si; are denoted rid and wi,j 
respectively and are defined as 
ri,j = ( ~ i  U ~ i + i  U . . . U ~ j )  \ ( ~ i  U wi-ti U ' .  . 1J wj) 
w;,j = (w; U wi+l U . . U wj)  \ ( T ;  U Tzi- l  U . . . U r j )  
Using these definitions and the BSB defnitions given 
in section 2.2 we can now compute the speedup induced 
by moving a sequence of BSBs from hardware to soft- 
ware: 
Definition 9 The total (possibly negative) speedup in- 
duced by moving a BSB sequence Si,j t o  hardware is  
denoted s i j  and is  computed as 
k=i  
where t3-h and th+s denote the software-to-hardware 
and hardware-to-software communication t imes for a 
single variable, respectively. 
Definition 10 The  area penalty a i j  of moving Si to  
hardware is  computed as the s u m  of the individual 6 S B  
areas.. 
3 
ai,j = a k  
k=i  
In section 4.2 we discuss how the effect of hardware 
sharing is taken into account. Note that in calculating 
the speedup and area of a sequence it is not considered 
that hardware synthesis may synthesize the sequence 
as a whole which would probably reduce both sequence 
area and execution time as compared to just summing 
the individual area- and execution time components 
as described above. Incorporating such sequence op- 
timizations in the estimations will be fairly straight- 
forward but has not been carried out yet. Note, how- 
ever, that the improvement in speedup induced by all 
BSBs within the sequence being able to communicate 
directly with each other is taken into account. 
The partitioning problem can now be formulated 
as that of finding the combination of non-overlapping 
hardware sequences which yields the best speedup 
while having a total area penalty less than or equal 
to the available hardware area A. 
4 Software, Hardware and Communi- 
cation Estimation 
This section describes how hardware area- and exe- 
cution time, software execution time, and communica- 
tion time are estimated. 
4.1 Software E s t i m a t i o n  
Software execution time for a pure DFG (i.e., no 
controlflow) is estimated by performing a topological 
sort (linearization) of the nodes in the DFG. The nodes 
are then translated into a generic instruction set with 
the addressing modes of the instructions determined 
by data-dependencies and a greedy register allocation 
scheme. The execution times of the generic instructions 
87 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 02,2010 at 09:25:17 EST from IEEE Xplore.  Restrictions apply. 
are then determined from a technology file correspond- 
ing to  the target microprocessor. This is similar to the 
approach described in [3], where good estimation re- 
sults are reported, and the same technology files for 
the 8086, 80286, 68000 and 68020 microprocessors are 
used. The execution time of the DFG is obtained by 
summing the execution times of the generic instructions 
and multiplying the sum with the profiling count for the 
DFG. Execution times for higher level constructs such 
as loop BSBs and branch BSBs are obtained on basis 
of the execution times of their child BSBs. 
4.2 Hardware area e s t i m a t i o n  
A common way of estimating the hardware area of 
a BSB is to estimate how much area a full hardware 
implementation of the BSB will occupy. This includes 
hardware to  execute the calculations of the BSB and 
hardware to control the sequencing of these calcula- 
tions. If the total chip area is divided into a datapath 
area and a controller area, each BSB moved to hard- 
ware may be viewed as occupying a part of the datap- 
ath and a part of the controller. Figure 4a shows this 
model when one BSB has been moved to hardware. 
/a- 
Area = A cl+ A D, Area 6 (A ce A m) + Area = (Acr+  A m )  + A D 
(A DI+ A 02) 
A) 8)  C) 
Figure 4: BSB area estimation which accounts for hard- 
ware sharing: a) Controller- and datapath area for a 
single BSB, b) W h e n  sharing hardware, the total area 
for multiple BSBs are less than the summation of the 
individual ureas, c)  Variable controller urea and fixed 
datapath area for multiple BSBs with hardware sharing. 
When several BSBs are moved to hardware they may 
share hardware modules as they execute in mutual ex- 
clusion. Hence, an approach which estimates area as 
the summation of datapath and control areas for all 
hardware BSBs will probably overestimate the total 
area. This problem is depicted in figure 4b where the 
area of the datapath is not  equal to the sum of the 
individual BSB datapaths. 
In our approach the datapath area is the area of 
a set of preallocated hardware modules in the datap- 
ath as illustrated in figure 4c. The BSBs share these 
modules and the controller area is the area left for the 
BSB controllers. The hardware area of a BSB is then 
estimated as the hardware area of the corresponding 
controller, and will depend on the number of timesteps 
required for executing the BSB. The hardware area of 
a BSB therefore depends on its execution time. 
4.3 Hardware execu t ion  t i m e  estimation 
The hardware execution time for a DFG is deter- 
mined by dynamic list based scheduling [4] which at- 
tempts to utilize the hardware modules in the given al- 
location in order to maximize parallelism and thereby 
minimize execution time. The execution time obtained 
in this way is multiplied with the profiling count for the 
DFG. The execution time for higher level constructs is 
obtained as in the software case. 
4.4 C o m m u n i c a t i o n  e s t i m a t i o n  
Communication is currently assumed to  be memory 
mapped I/O. The transfer of k variables from software 
to hardware is assumed to require k generic MOV mi- 
croprocessor instructions and k Import operations as 
defined in the hardware library. Communication from 
hardware to software is estimated in the same way, just 
using the hardware Export operation instead. 
5 The PACE algorithm 
The idea behind the PACE algorithm is best illus- 
trated by an example. Figure 5 shows four BSBs which 
must be partitioned as to reach the largest speedup on 
the available area A=3. The speedup and area penalty 
for a single BSB which is moved to hardware is shown 
below each BSB. The numbers between two BSBs de- 
note the extra speedup which is incurred because of 
the BSBs being able to communicate directly with each 
other when they are both placed in hardware. 
a A = i  a, = 1 a,=? a,=i 
s,=5 s,=10 sc=2 s, = 10 
Figure 5: Example of partitioning problem with com- 
munication cost considerations. 
Obviously B and D should be placed in hardware as 
they have large inherent speedups. This leaves room for 
one more BSB. Should it be A or C? The answer to this 
is not obvious as A induces a large inherent speedup but 
a small communication speedup when placed together 
with B in hardware] whereas C induces a smaller in- 
herent speedup but on the other hand induces a large 
communication speedup when placed together with B 
and D in hardware. The following paragraphs explain 
how the PACE algorithm solves this problem. 
The algorithm utilizes the previously mentioned 
fact, that any possible partition can be thought of as 
88 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 02,2010 at 09:25:17 EST from IEEE Xplore.  Restrictions apply. 
composed of sequences of BSBs. If A, C and D are 
chosen for hardware, it corresponds to clhoosing the se- 
quences SA,A and SC,D. The speedup of sequence SC,D 
is larger than the sum of speedups of its components 
C and D due to  the extra communication speedup in- 
duced by both blocks being chosen for .hardware. So 
a natural approach will be to calculate the areas and 
speedups of all sequences of BSBs, and (chose the com- 
bination of sequences that induces the largest speedup. 
The areas and speedups of all sequences ,are calculated 
and shown in table 1. The ordering and grouping of 
BSBs is explained below. 
A (a=l,s=5) 
Group B: 
AB (a=2,s=17) 
B (a=I.s=lO) 
Group C 
ABC (a=3,s=21) 
EC (a=2,s=14) 
C (a=l. s=2) 
Group D: 
ABCD (a=4,s=35) 
ECD (a=3,s=28) 
CD (a=2,s=16) 
D (a=l,s=lO) 
E q u e n c e  I Elements I Area I S l T t m I  
~ G ‘ T O U P  A: All s e q u e n c e s  e n d i n g  w i t h 7 1  
‘a,a 
Best: 
- 
a.b S 
b,b S 
Best: 
- 
a,c 
b,c S 
Sc,c 
Best: 
__ 
a.d 
S 
S 
S 
d,d S 
Best: 
S 
b d  
c,d 
- 
A 
Toup  B! All s e q u e n c L  e n d i n g  wit! 
ABCD 4 35 
BCD - 28 
CD - 16 
D 
‘Table 1: Grouping of sequences. 
The problem is to find the combination of non- 
overlapping sequences which fits the available area A 
and whose speedup sum is as large as possible. This 
problem cannot be solved with an ordinary Knapsack 
Stuffing algorithm as some of the sequlences are mu- 
tually exclusive (because they contain identical BSBs) 
and therefore cannot be moved to  hardware at the same 
time. But if the sequences are ordered and grouped as 
shown in the table, a dynamic programming algorithm 
can be constructed which does not attempt to chose 
mutually exclusive sequences for hardware at the same 
time. 
The algorithm works as follows. Assume first that 
for each group up to and including group C the 
best (maxirnum speedup) combination of sequences has 
been found and stored for each (integer) area a from 
zero up to  the available area A. Assume then that 
for instance sequence S C , ~  with area is selected 
for hardware at the available area a. l-low is the op- 
timal combination of sequences on the remaining area 
a - ac,D then found? As C and D have been chosen 
for hardware, only A and B remain. So the best so- 
lution on tlhe remaining area must be found in group 
B which contains the best combination of sequences 
for all BSBs from the set {A,B}. Similarly, if the “se- 
quence” 5 ’ 1 1 , ~  is chosen for hardware, the best com- 
bination on the remaining area is found in group C. 
The optimal combination is always found in the group 
whose letter in the alphabet comes immediately before 
the letter of the first index in the chosen sequence. The 
important thing to note is that when a sequence from 
r 
I= Group A 
- 
Area: 
3 5 7  
10 
Sp=dwlSd,d I 21 BestChoice[D, 41 
BestSpeedup[D, 41 
Figure 6: The PACE algorithm employed f o r  a simple 
example. 
group X has been chosen, the optimal combination of 
sequences on the remaining area can be found in one 
of the groups A to pred(X), and, when sequences are 
selected as above, no mutually exclusive BSBs are se- 
lected simultaneously. In this way the best solutions 
for a given group can always be determined on basis of 
the best solutions found for the previous groups. 
Figure 6 shows how the best combination of 
sequences can be found using three matrices; 
Speedup [l. . ns,  0. . A ] ,  BestSpeedup [l. . n ,  1. . A1 
and BestChoiceCl. .n, 0.  .Al. 
ns is the number of sequences, n is the number of 
BSBs and A is the available area. Zero entries are not 
shown. Arrows indicate where values are copied from, 
but arrows are not shown for all entries in order to  
make the figure more readable. 
The Speedup matrix contains for each sequence 
and each available area the best speedup that can 
be achieved if that sequence is first moved to hard- 
ware and then sequences from the previous groups are 
moved to  hardware. In the figure, Speedup CSB,C, 31 
is 19 and is found as the inherent speedup of SB,C 
which is 14 plus the best obtainable speedup 5 on area 
3 - uB,c = 3 - 2 = 1 in group A (as B and C have 
been chosen). The Bestspeedup matrix contains for 
each group (which there are n of) and each area the 
best speedup that can be achieved by first selecting a 
sequence from that group or one of the previous groups. 
It can be calculated as 
BestSpeedupCg, a] = 
89 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 02,2010 at 09:25:17 EST from IEEE Xplore.  Restrictions apply. 
max(max(Speedup[S,al), BestSpeedup[pred(g) , a ] )  
The Bestchoice [ g  , a] matrix identifies the choice of 
sequence that gave this maximum value. The last two 
matrices are interleaved and typeset with bold letters 
in the figure. 
In the example, BestchoiceLC, 31 is 21 as this is 
the maximum speedup that can be found in group 
C with available area 3 and it is larger than the 
largest speedup that could be found in the previous 
groups, namely 17. The corresponding choice of se- 
quence is  SA,^. In contrast, Bestspeedup [C, I] and 
BestChoice [D, I] are copied from the corresponding 
entries of the previous group. For group C this is be- 
cause all Speedup entries in that group for area l are 
smaller than the best speedup 10 achieved with only se- 
quences from groups up to  and including B. For group 
D, Speedup [ S D , D ,  I] is also 10, so the choice of best 
sequence for this group is arbitrary. 
The solution to  the posed problem is found in 
the Bestchoice [D, 31 and Bestspeedup [D, 31 en- 
tries. The best initial choice is sequence SB,D with 
the corresponding total speedup of 28. As the area of 
this sequence is 3, no other sequences were taken, and 
need thus not be found by backtracking. This shows 
that it was best to chose C for hardware instead of A. 
The area 4 was included in the figure to show that the 
algorithm correctly chooses all four BSBs for hardware 
when there is room for them. This can be seen from 
the [D,4] entries. 
Once the Bestspeedup and Bestchoice lines have 
been calculated for each group, the Speedup values are 
no longer needed. Actually, the Speedup matrix is not 
needed at  all, as it can be replaced by the Bestspeedup 
matrix whose maximum values can be calculated “on 
the run”. This is because we are only interested in 
maximum values and corresponding choice of sequences 
for each group. This means that instead of the mem- 
ory requirements being proportional to the number of 
sequences ns, they are now proportional to n, as only 
the Bestspeedup and Bestchoice matrices are needed. 
The PACE algorithm is shown as algorithm 1. 
After the algorithm has been run, the best 
speedup that can be obtained is found in the entry 
Bestspeedup [NumBSBs , AvailableArea] . But as for 
the simple Knapsack algorithm, reconstruction of the 
chosen sequences and thereby of the chosen BSBs is 
necessary. 
5.1 Algorithm Analysis 
S E 9  
Direct inspection of the PACE algorithm shows that 
the time complexity is O(n2 . d) and the space com- 
plexity is O ( n  . d) (the PACE-reconstruct algorithm 
obviously has smaller time and area complexity and 
can hence be disregarded). Note that areas must be 
expressed as integral values. d can be reduced (at the 
expense of partitioning quality) by using a larger “area 
granularity”, for example by expressing BSBs sizes in 
PACE (n ,A)  
forall groups g = 1 to n do 
forall areas a = 0 to A do { 
} 
J 
forall groups g = 1 to n do { 
HighBSB t g; 
for LowBSB = 1 to HighBSB do { 
SeqArea area(SLowBSB,HighBSB); 
SeqSpeedup t SPeeduP(SLowBSB,HighBSB); 
foral l  areas a = SeqArea to A do { 
if (LowBSB = 1) then { 
if SeqSpeedup > BS[g, a] then { 
BS g,a] t SeqSpeedup; 
Bc\g.al i- SLowBSB,HighBSB; 
} 
} else { 
if SeqSpeedup + BS[LowBSB-1, a-SeqArea] > 
BS g, a] then { 
BS[g,a] t SeqSpeedup + 
BS [LowBSB- 1, a-SeqArea] ; 
BC[gza] SLowBSB,HighBSB; 
1 
1 
1 
if (HighBSB > 1) 
fora11 areas a = 0 to A do 
if BS[g-1, a] > BS[g, a] then { 
BS[g, a] i- BS[g-I, 81; 
BC[g, a] t BCrg-1, 81; 
I 
} 
} 
{ 
PACE-reconstruct (n, A, BS[],  BC[]) = 
HwBSBList t {}; 
AStart t 0; 
Found t false; 
while (AStart <= A) and not Found do 
if BS[n, AStart] = BS[n, A] then 
else 
Found t true 
AStart t AStart + 1; 
a t Astart; 
repeat { 
g t n; 
Seq + BCk,  a]; 
if Seq <> {} then { 
LowBSB t first index of Seq; 
HighBSB t second index of Seq; 
for BSB = LowBSB to  HighBSB do 
a t A - area(Seq); 
g t LowBSB - 1; 
add BSB to HwBSBList; 
1 
} until (a < 0) or (Seq = {}) or ( g  = 0); 
return HwBSBList; 
1 
Algorithm 1: PACE - A Partitioning Algorithm with 
Communication Emphasis. 
90 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 02,2010 at 09:25:17 EST from IEEE Xplore.  Restrictions apply. 
terms of RTL modules. The n2 term can be reduced 
by enlarging BSB granularity by hierarchical collapsing 
or by only considering BSB sequences which induce a 
speedup greater than zero (or greater than some given 
percentage). Also, there is no need to  pre-calculate ar- 
eas and speedups for sequences whose total area will 
be greater than A. Another optimization could be to  
only consider sequences of length 5 m where m is an 
arbitrary value selected prior to  execution of the algo- 
rithm. 
As for the simple Knapsack problem, the dual prob- 
lem of minimizing area witlh a fixed time-smstraint can 
be solved by swapping the area- and speedup entries 
calculated for each sequence. Another approach could 
be to  scan the bottom line of the Bestspeedup matrix 
from the left (see figure 6) until an entry is found which 
violates the time-constraint, as the PACE: algorithm in 
effect calculates the best speedup for all areas. 
Note that the areas and speedups of all sequences 
must be pre-calculated before the algorithm can ex- 
ecute. This operation basically has time complexity 
O(n2)  but tlhis can also be reduced. 
6 Experiments 
A series of experiments which demonstrate the capa- 
bilities of the PACE algorithm have been carried out. 
The experiments were carried out in LYCOS, the LYn- 
gby CO-Synthesis system, an experimentid co-synthesis 
system currently being developed at our institute. 
The sample application used for the experiments is 
a VHDL behavioral description, taken from an image 
processing application in optical-flow. Tlhe application 
calculates eigenvectors which are used tlo obtain local 
orientation estimates for the cloud movements in a se- 
quence of Meteosat thermal images. The application 
consisting of 448 lines of behavioral VHD L. The corre- 
sponding CDFG contains 1511 nodes and 1520 edges. 
BSB software execution-time is estimated for a 8086 
processor and a hardware library for an Actel ACT 3 
FPGA is used to  estimate hardware datapath and con- 
troller area. Partitioning is performed fcir a sequence 
of total hardware areas ranging from 1000 to 2000 in 
steps of 20, where an area unit equals the area of a 
logic/sequential module in the FPGA. Ta’ble 2 summa- 
rizes the characteristics of the most important modules. 
Hardware modules used for the expel?- 
‘ ~ m & 2 E E F k ~ z y +  
1103 
Table 2: Area and execution-time estimates for hard- 
ware modules and operations. 
Figure 7 shows the results of partitioning the sample 
application using three different partitioning models; 
loowoo Irk 
I 
1OW 12W 1400 1600 1800 2000 2200 2400 
Total chip area 
Figure 7: The PACE algorithm compared with Knap- 
sack algorithms which ignore communication or do not  
account for adjacent hardware block communication op- 
timization (allocation A).  
ignoring communication, simple communication where 
the read- and write-sets of a BSB are always transferred 
regardless of other BSBs placed in hardware, and adja- 
cent block communication which is the one used by the 
PACE algorithm. All three approaches are assumed 
to  be implemented with local hardware store, and are 
thus evaluated in the model domain according to the 
adjacent block communication model. 
For chip areas less that the allocated area for the 
datapath (in the figure 760), no speedup is obtained as 
no control area is available. As soon as control area 
is available the approach which ignores communication 
starts to move BSBs to hardware. For the approaches 
taking communication into account, moving BSBs to 
hardware is not beneficial before the total area reaches 
around 1040. It can be seen that as the chip area in- 
creases, more and more BSBs are moved to hardware. 
From the figure it is clear that the approach using the 
simple communication model does not move as many 
BSBs to  hardware as the approach which takes adja- 
cent block communication into account. This is mainly 
due to the fact that many of the BSBs have a communi- 
cation overhead which is larger than the speedup they 
induce. In any case the best results are obtained by 
partitioning according to  the adjacent block communi- 
cation model. 
I Datapath allocations I 
I I Allocation I 
Area I 760 I 1148 I 427 I 
Table 3: Modules and corresponding area fo r  each of 
the three allocations. 
91 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 02,2010 at 09:25:17 EST from IEEE Xplore.  Restrictions apply. 
7 w W O  
6wWO 
5moo 
- 
r Y 
8 
n’ 
r 
B 4wWO 
d 
30WOO 
z m o o  
I 
500 1 WO 1500 2wo 25w 
Total chip area 
Figure 8: The  PACE algorithm employed for different 
allocations. 
Figure 8 shows the results of partitioning with the 
PACE algorithm for three different allocations; A, B 
and C, all listed in table 3. 
Widely different results are obtained for given avail- 
able areas, and a specific allocation which is optimal 
for all areas cannot be found. Allocation C has the 
smallest datapath area which means that for relatively 
small areas, the partitioning algorithm is able to  move 
BSBs to  hardware and, thus, obtain the best partition. 
Around the area 1500 this is changed, now allocation A 
becomes more attractive due to  the fact that the larger 
datapath allocation can benefit from the inherent par- 
allelism of the sample application, i.e., larger speedups 
may be achieved for the individual BSBs. The figure 
also illustrates the problem of allocating too much dat- 
apath area, as allocation B which has the largest data- 
path area never manages to give the best partitioning 
even for large chip areas. 
7 Conclusion 
This paper has presented the PACE hard- 
warelsoftware partitioning algorithm which, within the 
presented partitioning model, gives an exact solution 
for the problem of minimizing total execution time with 
a hardware area constraint as well as for the problem 
of minimizing hardware area with a global execution 
time constraint. The partitioning model recognizes 1) 
that sequences of hardware BSBs only need to  commu- 
nicate their effective read- and write sets from/to soft- 
ware and 2) that both area and execution times can be 
calculated on a BSB sequence basis and therefore may 
be smaller than the sum of the individual area- and 
execution times of the BSBs in the sequence. The al- 
gorithm has quadratic time complexity, but as shown, 
this may be reduced. 
Experiments have been carried out that show that 
the algorithm is superior t o  algorithms which ignore 
communication or do not recognize adjacent block com- 
munication optimization . Also, it has been demon- 
strated how the algorithm can be used with a BSB 
hardware area model which assumes hardware sharing 
among BSBs. 
Acknowledgements 
This work is supported by the Danish Technical Re- 
search Council under the “Codesign” program. 
References 
[l] R. Ernst, J. Henkel, and T. Benner. Hard- 
ware/software co-synthesis of microcontrollers. Design 
and Test of Computers, pages 64-75, December 1992. 
[2] Rolf Ernst, Wei Ye, Thomas Benner, and Jorg Henkel. 
Fast timing analysis for hardware/software co-design. 
In ICCD ’93, 1993. 
[3] Jie Gong, Daniel D. Gajski, and Sanjiv Narayan. Soft- 
ware estimation from executable specifications. Tech- 
nical Report ICs-93-5, Dept. of Information and Com- 
puter Science, University of California, Irvine, Irvine, 
[4] Jesper Grode. Scheduling of control flow dominated 
data-flow graphs. Master’s thesis, Technical University 
of Denmark, 1995. 
[5] Rajesh K. Gupta and Giovanni De Micheli. System 
synthesis via hardware-software co-design. Technical 
Report CSL-TR-92-548, Computer Systems Labora- 
tory, Stanford University, October 1992. 
[6] D. Herrmann, 3. Henkel, and R. Ernst. An approach 
to  the adaptation of estimated cost parameters in the 
cosyma system. In CODES ’94, 1994. 
[7] Axel Jantsch, Peeter Ellervee, Johnny Oberg, Ahmed 
Hermani, and Hannu Tenhunen. Hardware/software 
partitioning and minimizing memory interface traffic. 
In EURO-DAC ’94, 1994. 
A global 
criticality/local phase driven algorithm for the con- 
strained hardware/software partitioning problem. In 
Third International Workshop on Hardware/Software 
Codesign, pages 42-48, September 1994. 
Fine-grain partitioning in code- 
sign. Master’s thesis, Technical University of Denmark, 
1995. 
[lo] Giovanni De Micheli. Computer-aided hardware- 
software codesign. IEEE Micro, 14(4):10-16, August 
1994. 
A 
binary-constraint search algorithm for minimizing 
hardware during hardware/software partitioning. In 
EURO-DAC ’94, pages 214-219, 1994. 
CA 92717-3425, March 8 1993. 
[SI Asawaree Kalavade and Edward A. Lee. 
[9] Peter V. Knudsen. 
[ll] Frank Vahid, Jie Gong, and Daniel D. Gajski. 
92 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 02,2010 at 09:25:17 EST from IEEE Xplore.  Restrictions apply. 
