






















Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Aspects of system modelling in Hardware/Software partitioning
Knudsen, Peter Voigt; Madsen, Jan
Published in:
Proceedings. Seventh IEEE International Workshop on Rapid System Prototyping





Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Knudsen, P. V., & Madsen, J. (1996). Aspects of system modelling in Hardware/Software partitioning. In
Proceedings. Seventh IEEE International Workshop on Rapid System Prototyping (pp. 18-23). IEEE. DOI:
10.1109/IWRSP.1996.506721
Aspects of System Modelling in Hardware/Software Partitioning 
Peter Voigt Knudsen and Jan Madsen 
Department of Information Technology, Technical University of Denmark 
pvke i t . d t U. dk , j an@i t . dt U.  dk 
Abstract 
Thi s  paper addresses fundamen ta l  aspects of s y s t em 
modelling and parti t ioning algorithms an the  area of  
Hardware/Software Codesign. Three basic sy s t em mod- 
els for parti t ioning are presented and the  consequences 
of parti t ioning according t o  each of these  are analyzed. 
T h e  analysis shows the  importance of making  a clear 
distinction between the  model used for parti t ioning and 
the  model used fo r  evaluation. It also illustrates the  i m -  
portance of having a realistic hardware model such  tha t  
hardware sharing can be taken  in to  account. Finally,  
the  importance of integrating scheduling and allocation 
wi th  parti t ioning i s  demonstrated. 
1 Introduction 
Hardwarelsoftware partitioning is often viewed as 
the synthesis of an architecture consisting of a single 
CPU and a single dedicated hardware component (full 
custom, FPGA, etc.) from an initial system specifica- 
tion, e.g., [I]. 
The aim of this paper is to emphasize the impor- 
tance of clearly defining and reporting the partition- 
ing model assumed by a partitioning algorithm and to 
demonstrate the kind of errors that may occur if the 
results produced by a partitioning algorithm which as- 
sumes a simplistic partitioning model are not evaluated 
in accordance with a realistic implementation oriented 
model. This will make it easier to evaluate particular 
partitioning approaches and to compare strong sides 
and weaknesses of different approaches. 
Even though the single CPU, single ASIC architec- 
ture is a special and limited example of a distributed 
system, the architecture is relevant in many areas such 
as DSP design, construction of embedded systems, soft- 
ware execution acceleration and hardware emulation 
and prototyping [8]. Further, it is the most com- 
monly used target architecture for automatic hard- 
warelsoftware partitioning approaches. 
The hardware/software partitioning of a system 
specification onto a single CPU, single ASIC architec- 
ture has been investigated by a number of research 
groups [l, 2, 3,  5, 6, 91 which have employed widely 
different input languages (C, C", VHDL, etc.) and sys- 
tem models, and have had different optimization goals 
and constraints in mind. This makes it very difficult to 
compare the approaches. The different approaches will 
be described and discussed in section 3 and 4. 
2 Modeling and Evaluation Aspects 
In order to solve a problem by means of computer 
tools, it is necessary to build a model of the real world 
problem, i.e., to transform the problem from the phys- 
ical domain into the model domain. In doing so, a 
number of details are disregarded as to  keep the model 
simple, however, the main characteristics of the phys- 
ical problem should still be present. When a suitable 
model has been found, the problem is solved within the 
model domain. Finally, the model domain result has 
to be realized in the physical domain. In figure 1 this 
is depicted for the partitioning problem. 
Model Domain 
Realization - - - - 
Solution 
Physical Domain 
Figure 1: D o m a i n  t rans format ions .  
It is important to realize, that results which are opti- 
mal within the model domain may not be optimal when 
transformed to the physical domain. Clearly, it is im- 
portant to model the real world as closely as possible 
and it is equally important to always carry out a series 
of tests in the real world domain when the performance 
and quality of an algorithm is evaluated. However, ac- 
tually implementing the partitioned system in order to  
evaluate it is costly and time consuming, thus, it is im- 
portant to be able to evaluate the different solutions 
in the model domain, before selecting the one to be 
implemented. 
In general, the system model used by the partition- 
ing algorithm may deviate from how the system is im- 
plemented in reality. In the following, the phrase par- 
t i t ioning model is used to denote the model used by 
the partitioning algorithm, and the phrase realization 
model is used to denote the model of the target archi- 
tecture, i.e., the way in which a partition is assumed to  
be implemented. 
Figure 2 shows two ways of evaluating a partitioning 
algorithm within the model domain: 
0-8186-7603-5/96 $5.00 0 1996 IEEE 
18 




Figure 2: Evaluating the  parti t ioning result in the  
model  domain;  I )  .Evaluating according t o  the  parti-  
t ioning model,  2) Ehaluating according t o  the realiza- 
t i o n  model.  
1. Evaluation according to the partitioning model. 
2. Evaluation according to  the realization model, 
A reason for choosing a simple partitioning model 
is that some partitioning algorithms can then solve the 
constrained partitioning problem faster and/or more 
elegantly. But, as this paper will emphasize, thle qual- 
ity of the algorithm must be evaluated according to  
the realization model which is actually chosen for im- 
plementation in the physical domain. 
3 Partitioning Models 
In order to  be able to  partition a system, it has to 
be divided into fragments, in the following called Basic 
Scheduling Blocks or BSBs. The BSBs define the gran- 
ularity of the Partitioning, and are the chunks (of code 
which have to  be mapped to software or hardw,are. 
In this section we present three of the most com- 
monly used partitioning models. Note that many other 
and more advanced partitioning models than the ones 
presented here can be thought of (models including 
global communication, function calls, hardwaxe and 
software processes executing in parallel, global opti- 
mizations, etc.). The purpose of listing the following 
partitioning models is, however, not to present a gen- 
eral classification scheme for partitioning approaches 
but rather to be able to  reach some general conclusions 
which will be valid no matter which model a particular 
partitioning algorithm is based on. 
The presented models are arranged by increasing 
complexity and all models are based on a system spec- 
ification which is represented as a sequence of I3SBs: 
Definition 1 A sgistem specification, 
s = (Bl,BZ,...,Bn) 
i s  a n  ordered list of n BSBs, where B, denotes BSB 
number  i and BSBs are ordered according t o  th(> = execu- 
tion sequence. 
It is assumed that all function calls have been flat- 
tened prior to BSB extraction. Also, it is assun-ted that 
BSBs execute under mutual exclusion, i.e., hardware 
and software BSBs are not allowed to execute simulta- 
neously. 
Definition 2 A BSB, Bi, i s  defined by the  six-tuple, 
Bi = (u,,i,t,,z,ah,i,th,i,ri, Wi) 
where as,i and ts,i are the  urea and execution t i m e  of 
Bi when  placed in software, ah,i and th,i  are the area 
and execution t i m e  of Bi when  placed in hardware, ri 
and wi are the  read-set and write-set  variables of Bi, 
respectively. 
The software area as,i, which indicates the software 
object code size, is not considered in the following dis- 
cussions (infinite software side capacity is assumed) and 
is only included for completeness. It is assumed that 
as, i ,  ts,i, ah,i and th, i  are all fixed values calculated prior 
to partitioning, i.e., each BSB has a fixed hardware 
area ah,i and execution time th,i independent of which 
other BSBs are implemented in hardware, and likewise 
for the software implementation. In section 6 these as- 
sumptions will be elaborated upon. We will use the 
notation t ( r i )  and t (w;)  to denote the communication 
time needed to transfer the variables of the read-set and 
write-set to/from Bi respectively. Area contributions 
associated with communication primitives are ignored 
in this paper. 
Finally, let a partition be defined as, 
Definition 3 A parti t ioning of a s y s t e m  specifica- 
t ion,  S = (B1, B2,. . . , B,) i s  a mapping P : 
(B1 I B2, . . . , B,) -+ {SW, H W } .  
Model I :  Instantaneous Communication 
The simplest possible model assumes instantaneous 
communication, i.e., t (r i )  = t (w i )  = 0. Figure 3 shows 
the initial all-software solution and a solution after par- 
titioning. 
HW 
Figure 3: Partit ioning model wi th  instantaneous com- 
municat ion.  
In this model each BSB, Bi, receives a constant (pos- 
sibly negative) speedup1, si = ts , i  - th,!, when imple- 
mented in hardware, independent of which other BSBs 
‘In this paper we will use the term “speedup” to denote the 
time saved by moving functionality to hardware. 
19 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 12:37:42 UTC from IEEE Xplore.  Restrictions apply. 
are implemented in hardware. Hence, the total speedup 
obtained by the partition may be defined as 
Model 2: Sample Communication 
A simple way of taking communication into account 
is depicted in figure 4a. 
(1: 84 a ~ a4 b 
Figure 4: Partitioning model with simple communica- 
tion model. 
In this model, if a BSB is placed in hardware, its 
read-set variables are always transferred from software 
to  hardware prior to execution of the BSB, and its 
write-set variables are always transferred back to soft- 
ware when it has finished its execution, regardless of 
how other BSBs are placed. This model allows us to 
associate a fixed hardware execution time, including 
communication time, with each BSB. Hence, the to- 
tal speedup obtained by the partition may be defined 
as 
B ; E H W  
where t(7-i) and t(wi) are the communication times 
needed to transfer the variables of the read-set and the 
write-set to/from Bi . 
Figure 4b illustrates the main problem of the simple 
communication model. It is evident that there is an 
unnecessary communication overhead associated with 
the communication from B2 to B3 which results in an 
implementation which is not optimal. However, in this 
simple model where we have to be able to associate a 
fixed hardware execution time with each BSB, that is 
inevitable. 
Model 3: Adjacent Block Communication 
Considering figure 4b it would be better if Bz could 
send its write set variables directly to B3 by storing 
them in local hardware memory. This is the main idea 
of model 3 and is depicted in figure 5a. 
II- 84 
a 
Figure 5: Partitioning model with adjacent block com- 
munication. 
The adjacent block communication gives a more re- 
alistic model of the partitioned system. However, it is 
no longer possible to associate a fixed execution time 
with each BSB, as the execution time now depends on 
whether the previous and following blocks are placed in 
hardware. I.e., each time a BSB is moved to  hardware, 
an analysis of the influence on the read/write sets has 
to be performed. This also makes the calculation of 
the speedup more difficult, as it now depends on the 
sequence of BSBs placed in hardware and the commu- 
nication through their effective read- and write-sets. 
Figure 5b illustrates how the model can lead to  a 
better partitioning result than the previous models. B3 
is a BSB which receives no speedup when transferred 
to  hardware. Within model 1 and 2 it will therefore 
be placed in software. But within model 3 placing it 
in hardware will result in a speedup as t,,2 and tr,4 
disappear, so this may be done unless the area penalty 
is too large compared with the speedup. 
4 Partitioning Approaches 
From the preceding discussions we may define two 
constrained partitioning problems: 
1. Optimize for speed with an area constraint. 
2. Optimize for area with a speedup constraint 
(which equals the total software execution time 
minus an execution speed constraint). 
Approaches Assuming Instantaneous Communica- 
tion 
The system model presented by Jantsch et al. [5] 
corresponds to partition model 1 as communication is 
ignored and BSBs cannot execute in parallel. 
The presented results are model domain results 
based on a realization model equal to model 1, i.e., no 
attempts to evaluate according to a realistic realization 
model have been made. Thus, the reported results may 
be overly optimistic. 
20 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 12:37:42 UTC from IEEE Xplore.  Restrictions apply. 
Approaches Assuming Adjacent Block Communi- 
cation 
The COSYMA system by Henkel, Ernst et al. [1, 
2, 41 uses a simulated annealing algorithm which indi- 
rectly attempts to minimize hardware (as no explicit 
hardware area costs are calculated) given several local 
execution-time constraints. The algorithm accounts for 
adjacent block communication optimization, so the sys- 
tem model corresponds to partitioning model 31. The 
results which are reported are physical domain results 
measured on a system implementation which is also 
based on model 3. 
In [7] we present i5 dynamic programming algorithm 
called PACE which takes adjacent block communi- 
cation into account during partitioning. As for the 
COSYMA system this approach assumes that hard- 
ware and software BISBs cannot execute in parallel. 
Other Approaches 
Some approaches do not present their partitioning 
model together with the presentation of the algorithm 
and results. This makes it difficult to judge uplon the 
quality of the algorithm. Vahid et al. [9] is an ex,zmple. 
Other approaches use a partitioning model which 
does not fit nicely into the simple models we have pre- 
sented. For instance, Kalavade and Lee [6] and Gupta 
and De Micheli [3] both present a partitioning approach 
which uses a model that resembles partitioning model 
3 but differs in that hardware and software BSEk may 
execute in parallel. [ 31 presents physical domain evalu- 
ations while [6] uses model domain evaluations. 
5 Experiment,s with Different F’arti- 
tioning and Evaluation Models 
A series of experiments which demonstrate the ef- 
fects of optimizing according to one model and eval- 
uating according to another model have been carried 
out in order to  be able to discuss and compare the par- 
titioning approaches presented in the previous section. 
The sample application used for the experiments is a 
VHDL behavioral description consisting of 372 lines 
code. The VHDL program is translated into a con- 
trol/dataflow graph (CDFG) and divided into a, num- 
ber of BSBs. The CDFG contains 1043 vertices, 1143 
edges, 8 loops (nested up to three levels) and lone if- 
then-else construct. 
For the experiments, software execution-time esti- 
mates are based on a M68000 microprocesso~r, and 
hardware execution- time estimates are obtained from 
a global schedule of the CDFG onto an Actel ACT 3 
FPGA. 
The investigated partitioning problem is to optimize 
for speed with a hardware area constraint. A Knapsack 
algorithm is used to solve the partitioning problem for 
partitioning models 1 and 2. To solve the partitioning 
problem for partitioning model 3 the PACE [7] algo- 
rithm is used. For all tests, a fixed datapath (alloca- 
tion area of 1148 is assumed and the area of hardware 
BSBs is the estimated area of their hardware controller 
based on an Actel ACT 3 FPGA. Each logic/sequential 
module in the FPGA is assigned the area 1. Memory 
mapped 1/0 is assumed for hardware/software com- 
munication. Partitioning is performed for a sequence 
of available hardware areas ranging from 1000 to 2000 
in steps of 20. 
In the following figures, the names of the individual 
graphs have the form “Az:Ey”. This means that the 
graph shows the result of employing a partitioning algo- 
rithm which optimizes according to partitioning model 
z and evaluating the resulting partition according to  
realization model y (the models are numbered as in 
section 3). Only model domain evaluations have been 
carried out. 
Figure 6:  Partitioning and evaluation according to the 
same model. 
Figure 6 illustrates the results of partitioning and 
evaluating according to the same model. As discussed 
in section 4 this is how most approaches are presented 
and related to each other. 
For chip areas less that the allocated area for the 
datapath, no speedup is obtained as no control area 
is available. As soon as control area is available, the 
A1:El approach starts to  move BSBs to  hardware. For 
the approaches taking communication into account, 
moving BSBs to hardware is not beneficial until the 
total area reaches around 1370. It can be seen that 
as the chip area increases, more and more BSBs are 
moved to hardware, thus, for large chip areas there 
is less communication between hardware and software, 
hence the A3:E3 and A1:El approaches become com- 
parable. Also, it is clear that the A2:E2 approach does 
not move as many BSBs to hardware as the two other 
approaches. This is mainly due to the fact that many 
of the BSBs have a communication overhead which is 
larger than the speedup they induce. 
Just comparing the curves of figure 6 may lead to the 
conclusion that A1:El is the best approach. However, 
taking the partitions of figure 6 and just evaluating 
them according to the most realistic model (model 3) 
results in the curves shown in figure 7. 
The first thing that is noted is, that even though 
the A1:El approach of figure 6 “thinks” it is achieving 
a large speedup for areas in the range 1200 to 1700, 
21 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 12:37:42 UTC from IEEE Xplore.  Restrictions apply. 
Figure 7: Comparison  of the  results obtained by opti-  
miz ing  according t o  each of the three parti t ioning mod-  
els and implement ing  according to  the  m o s t  realistic re- 
alization model 3 which optimizes adjacent block com- 
municat ion .  
it is actually producing worse than the all-software so- 
lution (the A1:E3 curve). Thus, using a partitioning 
approach which assumes instantaneous communication 
when evaluated, as in [5], the reported results may be 
wrong; in the experiments performed in our paper the 
approach was wrong by a factor of 2. It should also 
be noted that as the A1:El approach assumes instan- 
taneous communication, it may, by pure chance, select 
BSBs which actually give a good partitioning, however, 
this will not be decidable. 
6 Hardware Modeling and its Relation 
to  Partitioning 
In order to  have realistic hardware area estimates it 
is important to  also have a realistic hardware model. In 
the preceding discussion it was presumed that the hard- 
ware area of BSBs remained fixed independent of the 
Partitioning. The hardware area of a BSB (B,) may be 
viewed as containing two parts; a datapath area (adp,,) 
in which all computation (and storage) is performed, 
and a control unit area (acu,%) which controls the exe- 
cution of the datapath, i.e., ah,% = acu,z + adp,z. 
There is a strong relationship between the amount 
(and type) of allocated hardware resources in the data- 
path ( a d p , % )  and the hardware execution time, as more 
hardware resources allows for exploiting the inherent 
parallelism of the application. When an allocation has 
been selected, the execution time of a BSB is found by 
scheduling the operations in the BSB. The scheduling 
will also determine the area of the control unit (acu,z). 
Now, a fixed hardware area (ah,,) of a BSB (B2) can 
be obtained in one of the following ways. 
1. Hardware resources are allocated and scheduled 
separately for each BSB before partitioning. The 
total hardware area may be expressed as 
ah = acu,i + adp,; 
In this way hardware sharing among BSBs is not 
considered. 
2. Global allocation of hardware resources and indi- 
vidual scheduling of the BSBs is performed before 
partitioning. The fixed hardware area ah,i of the 
individual BSBs is then the area of their corre- 
sponding hardware controller plus the increase in 
multiplexer and interconnect area’. The available 
area for BSBs is equal to  the total hardware area 
minus the area of the allocated hardware resources. 
In this case the total hardware area of a single BSB 
may be expressed as 
B ; € H W  B i E H W  
B ; € H W  
In the next section we will demonstrate the impor- 
tance of being able to  consider scheduling and alloca- 
tion during partitioning. As the first approach is clearly 
inefficient, we will only consider the second method. 
7 Experiments With Different Alloca- 
tions 
A series of experiments to  demonstrate the effect of 
having different allocations (and thus different sched- 
ules) have been carried out. 
5MX40 
0 
4W 6m 8W 1Ma 1200 14W 16W 1SW 2 M a  
Total ohmema 
Figure 8: Part i t ions with three dif ferent allocations. 
Figure 8 shows the results of partitioning the sample 
application of section 5 with the PACE algorithm using 
the three allocations A, B and C, listed in table 1. 
Both partitioning and evaluation is done according 
to  model 3, i.e., taking adjacent block communication 
into account, hence figure 8 shows A 3 3 3  curves. 
interconnect areas are not considered. 
ZFor the results presented in this paper, the multiplexer and 
22 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 12:37:42 UTC from IEEE Xplore.  Restrictions apply. 
[ D a t a p a t h  allocations I 
=- 
Table 1: Modules and corresponding area for each of 
the  three allocations. 
Hardware modules used for the allocation exper in lent1  
Table 2: Area and execut ion-t ime est imates  for hard- 
ware modules and operations. 
Table 2 summarizes the characteristics of the most 
important of the allocated hardware modules. All fig- 
ures are estimates based on the Actel ACT 3 FPGA, 
and have been used to  calculate the allocation areas 
shown in table 1. 
As figure 8 ShoW:j, widely different results are ob- 
tained for given available areas, and a specific alloca- 
tion which is optimal for all areas cannot be found. 
Number of combinatorial mullinters (mul-comb) 
Figure 9: Controller/datapath area trade-offs when par- 
t i t ioning under  a hardware area constraint. 
Figure 9 illustrates this controller/datapath area 
trade-off when partitioning under hardware area con- 
straints. The figure shows the all-hardware execution 
times for three different allocations, containing 1, 2 and 
3 combinatorial multipliers respectively, and the corre- 
sponding obtained slpeedup. Going from 1 to 2 multi- 
pliers results in a significant speedup which would be 
expected as the all-hardware execution time is almost 
halved. However, the expected speedup when adding 
an extra multiplier (reducing the all-hardware id 3xecu- 
tion time further) is not seen, instead the speedup is 
reduced to  around the same level as for a single multi- 
plier. The reason for this is of course that 3 multipliers 
allocate a large fraction of the available hardware area, 
leaving only a small area for the controller, and hence 
only a few BSBs can be moved to hardware. 
8 Conclusion 
Three basic partitioning models have been analyzed 
and discussed in relation to  how they have been used 
by various research groups. We have shown the impor- 
tance of making a clear distinction between partition- 
ing models and evaluation models, i.e., the importance 
of making a realistic evaluation of results produced by 
algorithms operating on simple models. Thus, a clear 
distinction between partitioning model and realization 
model is a key issue for obtaining a basis for comparing 
different approaches. 
Finally, we have shown the importance of integrating 
allocation and scheduling with partitioning and that 
this is particular important when partitioning under 
hardware area constraints, such as for FPGAs. 
Acknowledgments 
This work is supported by the Danish Technical Re- 
search Council under the "Codesign" program. 
References 
R. Ernst, J. Henkel, and T. Benner. Hardware/software 
co-synthesis of microcontrollers. Design and Test of 
Computers, pages 64-75, December 1992. 
Rolf Ernst, Wei Ye, Thomas Benner, and Jorg Henkel. 
Fast timing analysis for hardware/software co-design. 
In ICCD '93, 1993. 
Rajesh K. Gupta and Giovanni De Micheli. System syn- 
thesis via hardware-software co-design. Technical Re- 
port CSL-TR-92-548, Computer Systems Laboratory, 
Stanford University, October 1992. 
D. Herrmann, J. Henkel, and R. Ernst. An approach 
to the adaptation of estimated cost parameters in the 
cosyma system. In CODES '94, 1994. 
Axel Jantsch, Peeter Ellervee, Johnny Oberg, Ahmed 
Hermani, and Hannu Tenhunen. Hardware/software 
partitioning and minimizing memory interface traffic. 
In EURO-DAC '94, 1994. 
Asawaree Kalavade and Edward A. Lee. A global 
criticality/local phase driven algorithm for the con- 
strained hardware/software partitioning problem. In 
Codes/CASHE'94, pages 42-48, September 1994. 
Peter V. Knudsen and Jan Madsen. PACE: A dy- 
namic programming algorithm for hardware/software 
partitioning. In Codes/CASHE'96, March 1996. 
Giovanni De Micheli. Computer-aided hardware- 
software codesign. IEEE Micro, 14(4):10-16, August 
1994. 
Frank Vahid, Jie Gong, and Daniel D. Gajski. A binary- 
constraint search algorithm for minimizing hardware 
during hardware/software Partitioning. In EURO-DAC 
'94, pages 214-219, 1994. 
23 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 12:37:42 UTC from IEEE Xplore.  Restrictions apply. 
