






















Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Communication Estimation for Hardware/Software Codesign
Knudsen, Peter Voigt; Madsen, Jan
Published in:
Hardware/Software Codesign, 1998. (CODES/CASHE '98) Proceedings of the Sixth International Workshop on





Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Knudsen, P. V., & Madsen, J. (1998). Communication Estimation for Hardware/Software Codesign. In
Hardware/Software Codesign, 1998. (CODES/CASHE '98) Proceedings of the Sixth International Workshop on
(pp. 55-59). IEEE Computer Society Press. DOI: 10.1109/HSC.1998.666238
Communication Estimation for Hardwardsoftware Codesign 
Peter Voigt Knudsen and Jan Madsen 
Department of Information Technology, Technical University of Denmark 
pvk@it.dtu.dk,jan@it.dtu.dk 
Abstract 
This paper presents a general high level estimation 
model of communication thmughputfor the implementation 
of a given communication protocol. The model, which is 
p a n  of a larger model that includes component price, sofr. 
ware driver object code size and hardware driver area, is 
intended to be general enough to be able to capture the 
characteristics of a wide range of communication pro to col.^ 
and yet to be sufficiently detailed as to allow the designer or 
design tool to eficiently explore tradeoffs between through- 
put, bus widths, bursi/nori-barst trunsfers and data packing 
strategies. Thus itpmvides a basisfor decision making with 
respect to communication protocols/compunents and com- 
munication driver design in the initial design space explo- 
rationphase of a co-synthesisprocess where a large number 
ofpossibilities must be examined and where fast estimators 
are therefore necessary. The full model allows for  addi- 
tional (nioneyicost, software code size and hardware area 
tradeoffs to be examined. 
1. Introduction 
This paper presents the underlying estimation model for 
a communication estimation tool which extends the current 
communication estimation capabilities of the LYCOS 121 
co-synthesis system. The model is the basis of a high level 
communication library that for each supported process- 
ing unit and for each supported protocol captures perfor- 
mance/aredprice and other characteristics of the necessary 
drivers and of the communication channel. Our aim is to 
utilize this library in a communication estimation tool that 
will work together with the other estimationlpartitioning 
tools in LYCOS as p a n  of the design space explorationlco- 
synthesis cycle. Most current approaches to co-synthesis 
consider communication synthesis to he a final step in the 
co-synthesis trajectory [ 1][3][4]. For instance, [ I ]  presents 
communication synthesis as an allocation problem to be 
solved afrer system-level partitioning whereas we integrate 
communication synthesis with design space exploration and 
system level partitioning. For example, we wish to be able 
to trade off a fast and expensive communication protocol 
for a slow but cheaper protocol and a faster co-processor, 
if that is feasible. This should not be done after system 
level partitioning as the level of communication overhead 
between system components influences what the best par- 
tition is. For this we need fast estimators of the kind pre- 
sented in this paper. [6] models communication at various 
levels of abstraction which enables multi-level system sim- 
ulation to verify correct behavior given the selected com- 
munication componentslprotocols, but the question of how, 
to select the hest combination of communication compo- 
nentslprotocols still needs to be addressed. Our communi- 
cation model in combination with the estimation tnol helps 
the designerldesign tool answer this question. 
2. The communication model 
U 
sw sw DriVer Channel HW Driver HW 
Figure 1. Communication model overview 
Figure 1 shows our model of point to point commu- 
nication. The figure shows communication in a proces- 
sorlcoprocessor target architecture, but the model is not lim- 
ited to this architecture - it can be used to model and esti- 
mate communication overhead in any architecture where R 
connection between two processing elements has been es- 
tablished. The time overhead of establishing such a connec- 
tion (arbitration, etc.) is currently not modeledlestimated. 
Note that, in contrast to prior work, we consider the 
possible performance degradation imposed by the hard- 
warelsoftware drivers, and not only the characteristics of 
the channel. 
For simplicity, we consider communication in one di- 
rection only in this paper. In general, some of the model 
parameters will depend on the transmission direction. For 
instance, a PCI bus master read is slower than a write, so 
the parameters that model channel transmission delay exist 
55 
1092-6100/98 $10.00 0 1998 IEEE 
c Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on June 01,2010 at 14:00:33 UTC from IEEE Xplore.  Restrictions apply. 
in both a “read” version and a “write” version in the full 
model. 
2.1. Driver transmission delay model 
Figure 2 defines the parameters which are used to esti- 
mate driver transmission delay. The driver receives ni input 
SWMW Transmitling Driver Channel 
”, *om*ol cIocklmqL*IIcI: 
c”D,sr: c,s tnNmlu ionWord  6- 
P,oca~ i”g  6”ClSS p 
Figure 2. Driver transmission parameters, 
words for transmission and produces n, channel words. In 
order to do so, it may have to pack or split driver input words 
in order to fit the channel hit width wc and it may have to 
perform other kinds of data processing. The packing gran- 
ularity wg influences the transmission processing delay and 
is defined in section 2.6. Given the clock frequency of the 
transmitting processor, fl ,  the number of cycles, clc.  it re- 
quires to call the driver for transmission (transfer arguments 
to, transfer execution flow to, etc.) and the number of trans- 
mission processing (packinglsplittingletc.) cycles per driver 
input word, ctp. we can write the total driver transmission 
delay as 
(1)  t l d  = (etc + ctrnt)/ft 
2.2. Channel transmission delay model 
Driver 1 Channel Driver 2 
Figure 3. Channel transmission parameters 
Assume that the number of transmitted channel words nc 
and the number of required synchronization cycles ccs are 
known (formulas for these will be derived in sections 2.5 
and 2.6). Given the clock frequency of the channel fc and 
the number of transmission cycles per channel word cct, the 
total channel transmission delay is then calculated as 
where we have assumed that a connection has already been 
Channel Receiving Driver SWlHW 
Figure 4. Driver reception parameters 
that it knows how data was packed by the transmitting 
driver’. We will also assume that w? 2 wl and that each 
unpackedlunsplit word of size wt is put on a single output 
word of bit width w7. Given the clock frequency of the re- 
ceiving processor f T ,  the number of driver call cycles for 
reception cTc and the number of reception processing (un- 
packinglunsplittingletc.) cycles per transmission driver in- 
put word, e?,,, the formula for driver reception delay simply 
becomes 
(3) t T d  = (eTc + crpnt)/fr 
2.4. Total transmission delay 
We assume that the driver production of channel words, 
channel transmission and driver reception of channel words 
occur in parallel in a pipelined fashion, which means that 
it is the slowest part that determines the total transmission 
delay t l .  We set the maximum delay to 
t ,  = maw(&, t c d .  tTd  (4) 
and calculate the total transmission delay as 
where the last term is an approximation of the pipeline 
startuplcompletion delay’. 
2.5. Burst mode modelling - n, equation 
The preceding sections have assumed that nc and rCa 
were known. This section and section 2.6 give a detailed 
derivation of these figures. 
In  order to he able to handle burst mode transfers, we 
model nc to consist of (nb - 1) bursts of size sb and a re- 
mainder burst of size s?. 0 < sr < sb: 
nc = (lib - l)sa + s7 (6) 
The burst elements all have bit width tuc. We let the variable 
b, denote one of three supported burst transfer types, fixed 
in the d ivers  
>As the number of channel words may differ from the number of trans- 
missioninception words, the pipeline stmnuplcompletion delay is not mod- 
eled accurately by the given term. An exact derivation is outside the scope 
of this paper - however, it is imponant to include an estimate of the delay 
m it may have significance for smdl transfen. 
2.3. Driver reception delay model 
we that the receiving driver in addition to the 
parameter n, also receives the parameters wt and wy so 
56 
c 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on June 01,2010 at 14:00:33 UTC from IEEE Xplore.  Restrictions apply. 
1 
(each burst has a fixed size), max (there is a maximum on 
the burst size, hut smaller bursts are allowed) and inf (there 
is no limit on the burst size). We can now calculate nb and 
s? as follows:' 
(7)  
if b, = inf nb = { '  
[ n c d / S b ]  if b, =fixed, max 
(8) 
if b, = fixed 
if b, = max,inf S? = { sb n,d - (nb - 1 ) S b  
where ncd is the number of actual channel data values corre- 
sponding to the nt driver input words of hit width tut which 
have been packedlsplit to fit the channel width tuc. An equa- 
tion for n,d is derived in section 2.6. 
Given the number of synchronization cycles per burst cab 
(possibly a fraction) and the number of synchronization cy- 
cles per transfer session css. we can now write the number 
of channel synchronization cycles cc8 as 
With these definitions, the equations for n, and ccs model 
the following four variants of burst transfers: 
I .  Non-burst mode: This is modeled by setting b, = 
fixed (or max) and the burst size Sb to I which results 
in n b  = n C d ,  sr = 1 and nc = n c d .  The number of 
synchronization cycles becomes ccs = [ n c d c s b l  + c.. 
where C8b should now he interpreted as the number of 
synchronization cycles per channel data word. 
2. Burst mode withfued burst size Sb. This is modeled by 
setting b, = fixed which forces the last burst to he of 
size sb regardless of how many values in that burst are 
actual data values. 
3. Burst mode with maximum burst size Sb. This is mod- 
eled by setting b ,  = max. The last burst (if any) has 
size si < .sb, but it  will still require the same number 
of burst synchronization cycles as the preceding bursts. 
4. Burst mode with unlimited burst size. This is modeled 
by setting b, = inf and Sb = 0. Then nb becomes 
1 indicating a single burst, s? = n c d ,  indicating that 
ned data values are to he transferred in the single burst, 
and nc = ncd. The number of synchronization cycles 
becomes cc9 = [cSbl + css so we only spend time on 
a single set of burst synchronization cycles. 
Example 1: (PCI burst mode modelling). Consider a PCI 
bus [5]  master read transaction of n,d = 1000 words of 
width tuc = 32 on a 32 hit wide, 33 Mhz PCI-bus. The 
PCI-bus supports burst transfers with maximum, fixed as 
well as unlimited size. 
'In the following, r.1 denotes the smallest inlegcr larger than or equal 
to x (truncating upwards). LzJ denotes the largest integer smaller than or 
equal to x (truncating downwards). 
We assume a maximum size ( b ,  = max) burst trans- 
fer of size Sb = 32. This ensures a low bus latency that 
allows other, higher priority, units on the bus to interrupt 
the transfer. We assume that the bus arbitration latency 
is 2 clock cycles and that the bus is initially IDLE so that 
the bus acquisition latency is 0 clock cycles. We set slave 
device select (DevSel) delay to 1 clock cycle. As the ad- 
dress bus and data bus are multiplexed, the PCI burst trans- 
fer consists of an address transfer followed by the (up to) 
32 data transfers. For a read transaction, a turnaround cy- 
cle is required between the address transfer and the data 
transfers in order to avoid bus contention. After com- 
pletion of the burst, an additional IDLE cycle is required. 
The address transfer and the data transfers each last one 
clock cycle (assuming zero wait state transfers), except 
for the first data transfer which lasts 4 clock cycles. We 
see that the number of synchronization cycles per burst is 
C,b = 2 + 0 + l(DevSel cycle) + l(turnaround cycle) + 
3(extracycles for first data transfer) +  IDLE cycle) = 8. 
Using (7) and (8). we can now calculate n b  = [ n c d / . s b l  = 
32 = 8. As we set the number of synchronization cycles 
per session, css, to zero, we can now use (6) to calculate the 
number of actually transmitted channel words, n,, and (9) 
to calculate the number of channel synchronization cycles 
ces: 
[1000/321 = 32 and S? = n2,d - (nb - 1 ) S b  = 1000 ~ 31 ' 
n,  = (32 - 1).  32 + S = 1000 
cc8 = r32 ' 81  + 0 = 256 
As the number of transmission cycles per channel word is 
crt = 1 ,  we now use (2) to calculate the channel transmis- 
sion delay to 
t,d = (0 + 1 . 1000 + 256 + 0) / (33 .  l o 6 )  = 3Sps 
0 
2.6. Data packinglsplitting 
In this section we show how the number of channel 
data words n,d is determined for various packinglsplitting 
schemes. 
2.6.1 
We generalize the process of packing the nt smaller driver 
input words of width wt into the n,d larger channel data 
words of width wc to he a two-step process: 
nc,j equation: packing (ut 5 tuc) 
I .  First split the input words into n1 fragments of bit 
width wy ,  wy 5 wc. If wy > tut, one input word is 
put on each fragment as shown in figure 5.C. 
2. Then pack as many as possible (n2) of these fragments 
onto each channel word. 
57 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on June 01,2010 at 14:00:33 UTC from IEEE Xplore.  Restrictions apply. 
The reason for introducing the intermediate first step is that 
we can then model optimal as well as fast packing with the 
same equation, as shown below. 
Each driver input word occupies [wt/wyl fragments of 
width w y .  so we need to pack a total of n l  = nt rwt/wyl 
fragments. Each channel word can hold n2 = jwc/wgJ 
fragments. The number of required channel words is thus 
[n1/n21 which expands to 
Figure 5 gives an example of data packing for three different 
A) Optimal Packing 8 )  MWiUm PacWng 
* . Z  5 . 5  "4.6 
5 
\ 
C) Fast Packing 
Figure 5. Packing with different grunulurities. 
Optimal packing (wy = 1). Optimal packing is 
achieved by packing the driver input words in a hit-wise 
manner. This corresponds to setting the packing granularity 
wy to I .  Slack is only possible in the last channel word. 
( I O )  reduces to 
7~ = r(ntwt)/.wcl ( 1 1 )  
Medium fast packing (wy = wt). Medium fast pack- 
ing is achieved by packing the driver input words in a per 
input-word manner, i.e. only as many whole input words 
that can fit in a channel word are put on each channel word. 
This corresponds to setting the packing granularity tuy equal 
to wt .  Slack can now occur in each channel word. ( I O )  re- 
duces to 
"ed = rnt/ lwc/wtil  (12) 
Fast packing (wy = wC). Fast packing is achieved by 
packing each input word onto a single channel word. This 
corresponds to setting the packing granularity wy equal to 
wc. Slack will occur in every channel word if 7 4  > wt. 
Naturally ( I O )  reduces to 
n c d  = ?it (13) 
2.6.2 
Figure 6 gives an example of data splitting for two different 
values of wg. 
IIcd equation: splitting (tu, 5 tut) 
Figure 6. Splitting with different grunulurities. 
We follow the same two-step approach as in the packing 
phase and find that equation for n,d becomes identical to 
(IO) .  This means that ( I O )  covers packing as well as split- 
ting with the only requirement that wy 5 w,:. 
This implies that the equation for optimal splitting (tuy = 
1) is identical to ( I  1) and the equation for medium fast split- 
ting (wp = tuc) is identical to (12). There is no "fast split- 
ting" (wy = wt) case as we cannot in general fit a whole 
driver data word into the smaller channel words (only when 
wt = tuc). 
2.6.3 Resulting ncd equation 
The final equation for nc,j which covers both packing and 
splitting can now be written as 
This equation models both fast, medium fast and optimal 
packinglsplitting, depending on the parameter wy . The 
packing/splitting time in general depends on wy, so the 
transmission processing delay ctp in (1) and the reception 
processing delay crp in (3) are not actually constants but 
functions of wg: 
The communication model library should provide separate 
values of ctp and crp for each supported value of wy or pro- 
vide the functions Ftp and Frp as expressions. 
Example 2: (Bit level serial communication modelling). 
We consider serial RS-232 communication using a serial 
58 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on June 01,2010 at 14:00:33 UTC from IEEE Xplore.  Restrictions apply. 
communications controller, for instance a Zilog 28530 SCC 
[7] which is configured to perform %hit asynchronous com- 
munication using 1 stop hit and I parity hit. We set the baud 
rate to 19600, and assume that we wish to write nt = 1000 
words of hit width wt = 32. We consider each channel data 
element to he a single bit, so we = wp = 1. (14) gives us 
the number of channel data words, ncd: 
n,d = 1(1000~32/1 l ) / (1 l / l J )1  = 32000 
We model the channel transfers to consist of bursts of sire 
sb = 8 and set b ,  = fixed. There will only he three syn- 
chronization cycles per burst (for the implicit start hit and 
the stop and parity hits) as there is no need to reconfigure 
the SCC for a write operation each time we transfer a byte 
and there is no delay between burst (byte) transfers as we 
can reload the write register while the previous byte is he- 
ing transferred, so cab = 3. Equations (7) and (8) give us 
na = rncd/8l = 4000 and sr = sb = 8. We assume that 
the SCC is already properly configured and set c.. = 0. (6) 
now gives us n, = (4000 - 1) . 8 + 8 = 32000 and (9) 
gives us ccs = 14000. 31 + 0 = 12000. Each data element 
(hit) transfer lasts et = 1 clock cycle and the channel clock 
frequency is fc = 19600. We can now use (2) to calculate 
the channel transmission delay to 
t c d  = (1 .32000 + 12000)/19600 = 2.2s 
0 
2.7. Design space exploration 
The preceding examples have focused on demonstrating 
the modelling capabilities of the communication model. We 
here give an example of how the model can he used in the 
design space exploration phase of system level co-synthesis. 
transmission delay to hc tt = 70ps + 2 .  (7Ops/lOOO) = 
70.14ps. 
We now consider a configuration where we use a slow 
(and cheaper) f t  = 50 Mhz transmitting processor that only 
packs one 16 bit value on each 32 hit channel word (i.e. 
fast packing) thus using only three processing cycles per 
transmission word (wy = 32, ctp = 3). The receiving pro- 
cessor also uses cTp = 3 unpacking cycles per transmis- 
sion word. All other parameters are the same as in the 
previous configuration. We now find that nCd = 1000, 
nb = 1000, sr = 1, ccs = 1000 and n, = 1000 which re- 
sults in (ttd = 60ps, t,d = 6 2 . 5 ~ ~ .  trd = 15ps). Here, 
t ,  = 6 2 . 5 ~ s  which results in a total transmission delay of 
tt = 6 2 . 5 ~ ~  + 2 .  (62.5ps/lOOO) = 6 2 . 6 ~ ~ .  
We can conclude that in this case the hest choice of trans- 
mission processor is the cheap and slow processor, even 
though it  does not utilize the full bus bandwidth and chan- 
nel transmission time is larger than before. The fact that i t  
spends less time on packing data makes it the better choice. 
Though being artificial, the example demonstrates that the 
performance of the drivers have to he balanced with the per- 
formance of the channel in order to find the hest system 
configuration. 0 
3. Conclusion 
We have presented a high level communication esti- 
mation model suitable for design space exploration in co- 
synthesis and have demonstrated its modelling capabilities 
and intended use. Future work will focus on extending the 
model to include bus arhitrationlacquisition delay in case of 
buses with multiple drivers and to integrate the communi- 
cation estimator with partitioning and design space explo- 
ration in the LYCOS system. 
References 
Example 3: Consider communication of (nt = 1000) 16 
hit words (tut = 16) via an fc = 32 Mhz channel of width 
wc = 32 using a proprietary protocol with no burst mode 
[ I ]  I.-M. Daveau, T. B. Ismail, and A. A. Jerraya. Synthesis of 
System-Level Communication by an Allocation-Based Ap- 
oroach. In Eiphrh lnremativnal Svmnosiwn on Svstem Sw- 
Y _ I  (b ,  = fixed, s b  = 1) and a multiplexed addreddata bus 
with one address transfer per data transfer(csb = 1) and one 
clock cycle per transfer (cct = 1). The receiving processor 
operates at clock frequency f? = 200 Mhz. 
rhexis, pages 150 - 155, September 1995. 
121 I. Madsen, J. Grode. P. V. Knudsen, M. E. Petersen, and 
A. Haxthausen. LYCOS: the Lyngby Co-Synthesis System. 
Design Automarion for Embedded Systems, 2(2):195 - 235, 
First we consider a configuration where we use a fast 
(and expensive) ft = 100 Mhz transmitting processor that 
can pack two 16 hit values on each 32 hit channel word us- 
ing seven processing cycles per transmission word (wy = 
16, ctp = 7). The receiving processor also uses cTP = 7 
cycles per transmission word to unpack the received chan- 
nel words. All other parameters are set to zero. For this 
configuratig, we find that n , d  = 500, n b  = 500, s7 = 1, 
ccs = 500 and nc = 500 and can now calculate the trans- 
mitting driver delay, channel delay and receiving driver de- 
lay to (ttd = 7Ops. tcd = 31.25ps, t,d = 35ps). We see 
that the transmitting driver is the communication hottle- 
neck ( t ,  = t t d  = 70ps) and find, using ( 5 )  the resulting 
1997. 
[3] I. Madsen and B. Hald. An Approach to Interface Synthe- 
sis. In Proceedings of the Eighlh lnrernational Symposium on 
System Synthesis, pages 16- 21, 1995. 
[4] S. Narayan and D. D. Gajski. Protocol Generation for Com- 
munication Channels. In Proceedinns ofrhr 31rh DAC. Daees ., ~ . .  I 
547 - 548, 1994. 
151 PCI SDecial Interest GrouD. PClLocal Bus Snecificarion, Re- . _  .~ 
vision 2.1, June 1995. 
161 J. Zhu, R. Domer, and D. D. Gaiski. Syntax and Semantics 
~~ 
of the SpecC Language. In Proc&dings"f rhr SASlMl Work- 
shop, pages 75 - 82, 1997. 
[7] Zilog, Inc. SCC/ESCCAnd ISCC Family Of Producrs User's 
Manual, 1997. 
59 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on June 01,2010 at 14:00:33 UTC from IEEE Xplore.  Restrictions apply. 
