Abstract
for a slow but cheaper protocol and a faster co-processor, if that is feasible. This should not be done after system level partitioning as the level of communication overhead between system components influences what the best partition is. For this we need fast estimators of the kind presented in this paper.
[6] models communication at various levels of abstraction which enables multi-level system simulation to verify correct behavior given the selected communication componentslprotocols, but the question of how, to select the hest combination of communication componentslprotocols still needs to be addressed. Our communication model in combination with the estimation tnol helps the designerldesign tool answer this question. Figure 1 shows our model of point to point communication. The figure shows communication in a processorlcoprocessor target architecture, but the model is not limited to this architecture -it can be used to model and estimate communication overhead in any architecture where R connection between two processing elements has been established. The time overhead of establishing such a connection (arbitration, etc.) is currently not modeledlestimated. Note that, in contrast to prior work, we consider the possible performance degradation imposed by the hardwarelsoftware drivers, and not only the characteristics of the channel.
The communication model
For simplicity, we consider communication in one direction only in this paper. In general, some of the model parameters will depend on the transmission direction. For instance, a PCI bus master read is slower than a write, so the parameters that model channel transmission delay exist 55 1092-6100/98 $10.00 0 1998 IEEE c in both a "read" version and a "write" version in the full model. words for transmission and produces n, channel words. In order to do so, it may have to pack or split driver input words in order to fit the channel hit width wc and it may have to perform other kinds of data processing. The packing granularity wg influences the transmission processing delay and is defined in section 2.6. Given the clock frequency of the transmitting processor, fl, the number of cycles, clc. it requires to call the driver for transmission (transfer arguments to, transfer execution flow to, etc.) and the number of transmission processing (packinglsplittingletc.) cycles per driver input word, ctp. we can write the total driver transmission delay as (1)
Driver transmission delay model
t l d = (etc + ctrnt)/ft
Channel transmission delay model
Driver 1 Channel Driver 2
Figure 3. Channel transmission parameters
Assume that the number of transmitted channel words nc and the number of required synchronization cycles ccs are known (formulas for these will be derived in sections 2.5 and 2.6). Given the clock frequency of the channel fc and the number of transmission cycles per channel word cct, the total channel transmission delay is then calculated as where we have assumed that a connection has already been
Channel
Receiving Driver SWlHW Figure 4 . Driver reception parameters that it knows how data was packed by the transmitting driver'. We will also assume that w? 2 wl and that each unpackedlunsplit word of size wt is put on a single output word of bit width w7. Given the clock frequency of the receiving processor f T , the number of driver call cycles for reception cTc and the number of reception processing (unpackinglunsplittingletc.) cycles per transmission driver input word, e?,,, the formula for driver reception delay simply becomes (3)
Total transmission delay
We assume that the driver production of channel words, channel transmission and driver reception of channel words occur in parallel in a pipelined fashion, which means that it is the slowest part that determines the total transmission delay t l . We set the maximum delay to
and calculate the total transmission delay as where the last term is an approximation of the pipeline startuplcompletion delay'.
Burst mode modelling -n, equation
The preceding sections have assumed that nc and rCa were known. This section and section 2.6 give a detailed derivation of these figures.
In order to he able to handle burst mode transfers, we model nc to consist of (nb -1) bursts of size sb and a remainder burst of size s?. 0 < sr < sb:
The burst elements all have bit width tuc. We let the variable b, denote one of three supported burst transfer types, fixed in the divers >As the number of channel words may differ from the number of transmissioninception words, the pipeline stmnuplcompletion delay is not modeled accurately by the given term. An exact derivation is outside the scope of this paper -however, it is imponant to include an estimate of the delay m it may have significance for smdl transfen.
Driver reception delay model
we that the receiving driver in addition to the parameter n, also receives the parameters wt and wy so 1 (each burst has a fixed size), max (there is a maximum on the burst size, hut smaller bursts are allowed) and inf (there is no limit on the burst size). We can now calculate nb and s? as follows:'
where ncd is the number of actual channel data values corresponding to the nt driver input words of hit width tut which have been packedlsplit to fit the channel width tuc. An equation for n,d is derived in section 2.6.
Given the number of synchronization cycles per burst cab (possibly a fraction) and the number of synchronization cycles per transfer session css. we can now write the number of channel synchronization cycles cc8 as With these definitions, the equations for n, and ccs model the following four variants of burst transfers: We assume a maximum size (b, = max) burst transfer of size Sb = 32. This ensures a low bus latency that allows other, higher priority, units on the bus to interrupt the transfer. We assume that the bus arbitration latency is 2 clock cycles and that the bus is initially IDLE so that the bus acquisition latency is 0 clock cycles. We set slave device select (DevSel) delay to 1 clock cycle. As the address bus and data bus are multiplexed, the PCI burst transfer consists of an address transfer followed by the (up to) 32 data transfers. For a read transaction, a turnaround cycle is required between the address transfer and the data transfers in order to avoid bus contention. After completion of the burst, an additional IDLE cycle is required. The address transfer and the data transfers each last one clock cycle (assuming zero wait state transfers), except for the first data transfer which lasts 4 clock cycles. We see that the number of synchronization cycles per burst is C,b = 2 + 0 + l(DevSel cycle) + l(turnaround cycle) + 3(extracycles for first data transfer) + IDLE cycle) = 8.
Using (7) and (8). we can now calculate n b = [ n c d / . s b l = 32 = 8. As we set the number of synchronization cycles per session, css, to zero, we can now use (6) to calculate the number of actually transmitted channel words, n,, and (9) to calculate the number of channel synchronization cycles ces:
[1000/321 = 32 and S? = n2,d -(nb -1 ) S b = 1000 ~ 31 ' n , = (32 -1 ) . 32 + S = 1000
As the number of transmission cycles per channel word is crt = 1 , we now use (2) to calculate the channel transmission delay to 
Data packinglsplitting
In this section we show how the number of channel data words n,d is determined for various packinglsplitting schemes.
2.6.1
We generalize the process of packing the nt smaller driver input words of width wt into the n,d larger channel data words of width w c to he a two-step process: 2. Then pack as many as possible (n2) of these fragments onto each channel word.
The reason for introducing the intermediate first step is that we can then model optimal as well as fast packing with the same equation, as shown below. Each driver input word occupies [wt/wyl fragments of width w y . so we need to pack a total of n l = nt rwt/wyl fragments. Each channel word can hold n2 = jwc/wgJ fragments. The number of required channel words is thus [n1/n21 which expands to Figure 5 gives an example of data packing for three different Optimal packing (wy = 1). Optimal packing is achieved by packing the driver input words in a hit-wise manner. This corresponds to setting the packing granularity wy to I . Slack Medium fast packing (wy = wt). Medium fast packing is achieved by packing the driver input words in a per input-word manner, i.e. only as many whole input words that can fit in a channel word are put on each channel word. This corresponds to setting the packing granularity tuy equal to w t . Slack can now occur in each channel word. (IO) reduces to "ed = rnt/lwc/wtil (12) Fast packing (wy = wC). Fast packing is achieved by packing each input word onto a single channel word. This corresponds to setting the packing granularity wy equal to This implies that the equation for optimal splitting (tuy = 1) is identical to ( I 1) and the equation for medium fast splitting (wp = tuc) is identical to (12). There is no "fast splitting" (wy = wt) case as we cannot in general fit a whole driver data word into the smaller channel words (only when wt = tuc).
Resulting ncd equation
The final equation for nc,j which covers both packing and splitting can now be written as This equation models both fast, medium fast and optimal packinglsplitting, depending on the parameter wy . The packing/splitting time in general depends on wy, so the transmission processing delay ctp in (1) and the reception processing delay crp in (3) are not actually constants but functions of w g :
The communication model library should provide separate values of ctp and crp for each supported value of wy or provide the functions Ftp and Frp as expressions.
Example 2: (Bit level serial communication modelling).
We consider serial RS-232 communication using a serial communications controller, for instance a Zilog 28530 SCC [7] which is configured to perform %hit asynchronous communication using 1 stop hit and I parity hit. We set the baud rate to 19600, and assume that we wish to write nt = 1000 words of hit width wt = 32. We consider each channel data element to he a single bit, so we = wp = 1. (14) gives us the number of channel data words, ncd:
We model the channel transfers to consist of bursts of sire sb = 8 and set b , = fixed. There will only he three synchronization cycles per burst (for the implicit start hit and the stop and parity hits) as there is no need to reconfigure the SCC for a write operation each time we transfer a byte and there is no delay between burst (byte) transfers as we can reload the write register while the previous byte is heing transferred, so cab = 3. Equations (7) and (8) give us na = rncd/8l = 4000 and sr = sb = 8. We assume that the SCC is already properly configured and set c.. = 0. (6) now gives us n, = (4000 -1) . 8 + 8 = 32000 and (9) gives us ccs = 14000. 31 + 0 = 12000. Each data element (hit) transfer lasts et = 1 clock cycle and the channel clock frequency is fc = 19600. We can now use (2) to calculate the channel transmission delay to 
Design space exploration
The preceding examples have focused on demonstrating the modelling capabilities of the communication model. We here give an example of how the model can he used in the design space exploration phase of system level co-synthesis. transmission delay to hc tt = 70ps + 2 . (7Ops/lOOO) =
70.14ps.
We now consider a configuration where we use a slow (and cheaper) f t = 50 Mhz transmitting processor that only packs one 16 bit value on each 32 hit channel word (i.e. fast packing) thus using only three processing cycles per transmission word (wy = 32, ctp = 3). The receiving processor also uses cTp = 3 unpacking cycles per transmission word. All other parameters are the same as in the previous configuration. We now find that nCd = 1000, nb = 1000, sr = 1, ccs = 1000 and n, = 1000 which results in (ttd = 60ps, t,d = 6 2 . 5~~. trd = 15ps). Here, t , = 6 2 . 5~s which results in a total transmission delay of tt = 6 2 . 5~~ + 2 . (62.5ps/lOOO) = 6 2 . 6~~.
We can conclude that in this case the hest choice of transmission processor is the cheap and slow processor, even though it does not utilize the full bus bandwidth and channel transmission time is larger than before. The fact that it spends less time on packing data makes it the better choice. Though being artificial, the example demonstrates that the performance of the drivers have to he balanced with the performance of the channel in order to find the hest system configuration. 0
Conclusion
We have presented a high level communication estimation model suitable for design space exploration in cosynthesis and have demonstrated its modelling capabilities and intended use. Future work will focus on extending the model to include bus arhitrationlacquisition delay in case of buses with multiple drivers and to integrate the communication estimator with partitioning and design space exploration in the LYCOS system. cycles per transmission word to unpack the received channel words. All other parameters are set to zero. For this configuratig, we find that n , d = 500, n b = 500, s7 = 1, ccs = 500 and nc = 500 and can now calculate the transmitting driver delay, channel delay and receiving driver delay to (ttd = 7Ops. tcd = 31.25ps, t,d = 35ps). We see that the transmitting driver is the communication hottleneck ( t , = t t d = 70ps) and find, using ( 5 ) the resulting 
References

