Integrating communication protocol selection with partitioning in hardware/software codesign by Knudsen, Peter Voigt & Madsen, Jan
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Integrating communication protocol selection with partitioning in hardware/software
codesign
Knudsen, Peter Voigt; Madsen, Jan
Published in:
Proceeings of the 11th International Symposium on System Synthesis
Link to article, DOI:
10.1109/ISSS.1998.730610
Publication date:
1998
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Knudsen, P. V., & Madsen, J. (1998). Integrating communication protocol selection with partitioning in
hardware/software codesign. In Proceeings of the 11th International Symposium on System Synthesis (pp. 111-
116). IEEE. DOI: 10.1109/ISSS.1998.730610
Integrating Communication Protocol Selection with Partitioning in
Hardware/Software Codesign
Peter Voigt Knudsen and Jan Madsen
Department of Information Technology, Technical University of Denmark
pvk@it.dtu.dk, jan@it.dtu.dk
Abstract
This paper presents a codesign approach which incorpo-
rates communication protocol selection as a design param-
eter within hardware/software partitioning. The presented
approach takes into account data transfer rates depending
on communication protocol types and configurations, and
different operating frequencies of system components, i.e.
CPUs, ASICs, and busses. It also takes into account the tim-
ing and area influences of drivers and driver calls needed
to perform the communication. The approach is illustrated
by a number of design space exploration experiments which
use models of the PCI and USB communication protocols.
1. Introduction
This paper presents an approach to hardware/software
codesign which integrates communication protocol selec-
tion with hardware/software partitioning. The approach
has been implemented as an extension to the LYCOS[6] co-
synthesis system.
Most current approaches to co-synthesis consider com-
munication synthesis to be a final step in the co-synthesis
trajectory [2][3][7][8]. For instance, [2] presents commu-
nication synthesis as an allocation problem to be solved
after system-level partitioning. However, as the level of
communication overhead between system components in-
fluences what the best partition is, communication synthesis
has to be integrated with design space exploration and sys-
tem level partitioning. For example, we wish to be able to
trade off a fast and expensive communication protocol for a
slow but cheaper protocol and a faster co-processor, if that
is feasible.
Mem. Mapped
PCI
USB
(8086, 68000, etc)
SW DriverSW HW
Mem. Mapped
PCI
USB
Channel
Mem. Mapped
USB
PCI
Custom, etc)
(Xilinx, Full
HW Driver
Figure 1. Communication model overview.
In this paper we explore how communication protocol
selection influences the partitioning process. In particular,
we examine a design space extended with protocol selec-
tion, area of drivers, and operating frequencies of system
components. The approach is based on the communication
estimation model that we proposed in [5]. This model repre-
sents a high level communication library that for each sup-
ported processing unit and for each supported protocol cap-
tures performance/area/price and other characteristics of the
necessary drivers and of the communication channel. The
structure of the model is illustrated in figure 1. The model
allows for fast estimation which is important when being
part of an iterative synthesis strategy exploring a very large
design space.
Communication models which allow for functional ver-
ification have been proposed. For instance, [10] models
communication at various levels of abstraction which en-
ables multi-level system simulation to verify correct be-
havior given the selected communication components/pro-
tocols. However, the question of how to select the best com-
bination of communication components/protocols together
with hardware and software components still needs to be
addressed. The aim of this paper is to present an approach
which addresses this question.
2. The communication model
This section gives a shortened description of the commu-
nication model that was introduced in [5]. In addition, the
model is extended with a communication driver area model
which is given in section 2.4.
In this paper, we use the model to model communica-
tion in a processor/coprocessor target architecture as shown
in figure 1, but it is not limited to this architecture – it can
be used to model and estimate communication overhead in
any architecture where a connection between two process-
ing elements has been established. The time overhead of es-
tablishing such a connection (arbitration, etc.) and the time
overhead caused by bus collisions are currently not mod-
eled/estimated. Finally note that, in contrast to prior work,
we consider the possible performance degradation imposed
by the hardware/software drivers, and not only the charac-
teristics of the channel.
2.1. Channel model
Figure 2 shows the channel model. The channel is as-
sumed to receive words of width from the transmit-
3·3 + 4 = 13n_c = (3-1)·5+5 = 15 c_cs =
n_cd = 12 channel
data words
s_b = 5 n_b = 3
bursts
s_r = 5 (2)
c_ss = 4 c_sb = 3 c_ct = 2 f_c = 16 Mhz
w_c = 8
n_c = 15
channel words
actual
Figure 2. Channel model.
ting driver1. These words are then transmitted in ( )
bursts of size and a last (remainder) burst of size . The
number of synchronization cycles between bursts is denoted
and the number of initial synchronization cycles .
Each data element transfer lasts cycles and the channel
frequency is . Using these terms, the number of transmit-
ted channel words is calculated as
(1)
and the total number of channel synchronization cycles
as2
(2)
The total channel transmission delay is now calculated
as
(3)
The number of bursts and the size of the remainder
burst depend on the channel burst mode. We model four
burst modes as shown in figure 3. The “No” burst mode is
on basis of
Fixed burst mode
Max burst mode
Inf burst mode
No burst mode
These determine
n_b
n_cd and s_b
and s_r
Figure 3. Burst mode modelling.
used when synchronization is required between each data
element. The “Fixed” burst mode is used when each burst
is required to have a fixed size. The last burst may con-
tain slack (unused places) but requires the same transmis-
sion time as the other bursts. The “Max” burst mode only
has a maximum on the burst size, so here slack will not oc-
cur in the last burst which can then be smaller and hence
communicated faster than the other bursts. The “Inf” burst
mode is used when there is no limit on the burst size so that
all words can be transmitted in one large burst. Please
refer to [5] for the equations for and .
2.2. Driver transmission/reception delay model
The transmission driver and reception driver models are
shown in figure 4 where we assume that the software part is
1Throughout the paper we adopt the convention that subscripting in
figures is represented by an underscore. For example, n b in a figure cor-
responds to in the text.
2 denotes the smallest integer larger than or equal to x. denotes
the largest integer smaller than or equal to x.
ProgramDriver
c_rp = 1
n_t = 6
HW
Program Driver
c_tp = 2
n_t = 6
SW
w_c = 8
c_tc = 5
w_t = 16 w_r = 16
f_c = 16 Mhz
Channel
c_rc = 3
f_t = 32 Mhz f_r = 66 Mhz
n_cd = 12 n_c = 15
Figure 4. Transmission/reception driver model.
the transmitter and the hardware part the receiver. The pa-
rameters in the two models are similar, so only the param-
eters for the transmission driver model will be explained.
The driver receives transmission words from the program
which spends cycles on the call. The transmission
words of width are then converted to channel words
which requires cycles for each of the words. Using
these terms, the transmission delay is calculated as
(4)
and, similarly, the reception delay as
(5)
Note that the receiving driver processing delay is the
number of processing cycles per transmission driver input
word which is why it is multiplied with .
w_t = 2
C)
B)
A)
w_g = 5
w_c = 5
w_c = 5
w_g = 2
w_t = 2
w_c = 5
w_g = 1
w_t = 2
(n_cd = 2)
(n_cd = 3)
(n_cd = 5)
Figure 5. Different packing granularities. A) Opti-
mal Packing. B) Medium Packing. C) Fast Packing.
The driver processing delays and depend on
the amount of processing the driver needs to perform per
driver transmission word. In [5] we show how different
ways of packing/splitting data on the channel, in cases
where the channel data width is different from the trans-
mitting/receiving processor data width, may influence these
numbers. As shown in figure 5 for the packing case (with
), we use a parameter (the packing granularity)
to describe the degree of packing. More optimized packing
will in general require more processing in the drivers so in
our models, the values for and are larger for smaller
values of . As apparent from figure 5, the packing granu-
larity will also influence the number of channel data words,
, that the transmitting driver produces. Please refer to [5]
for a full treatment of packing/splitting and for an equation
for .
2.3. Total transmission delay
We assume that the driver production of channel words,
channel transmission and driver reception of channel words
occur in parallel in a pipelined fashion which means that
it is the slowest part that determines the total transmission
delay . We set the maximum of the three delays to
(6)
and calculate the total transmission delay as
(7)
where the last term is an approximation of the pipeline
startup/completion delay.
2.4. Area model
Program Driver
n_cl = 2
a_c = 9
a_d = 250
HW/SW
Figure 6. Area model.
Figure 6 illustrates the communication area model which
is an addition to the communication model in [5]. is the
area overhead associated with each driver call, the num-
ber of such calls and the area of the driver itself. The
area unit is “bytes” if the considered processor is a software
processor and an area unit decided by the communication li-
brary designer when the processor is a hardware processor.
The total hardware area/software object code size associ-
ated with all communication to/from the considered proces-
sor is now calculated as
(8)
This equation captures both of the following cases:
1. Non-inlined communication. is set to the number
of bytes or the hardware area required to call the soft-
ware/hardware driver and is the area of the driver.
This is the normal case.
2. Inlined communication. The driver is not actually
called, but inline expanded. This also implies that
and are zero. will be set to the area of the driver
and will be zero.
2.5. Modelling Examples
Figure 7 shows how the PCI and USB channels can be
modeled with our approach.
The PCI bus [9] is modeled as a 32 bit channel where
each transfer takes 1 cycle. The PCI bus supports burst
transfers with maximum, fixed as well as unlimited size.
We assume a maximum size (max) burst transfer of size
. This ensures a low bus latency that allows other,
higher priority, units on the bus to interrupt the transfer. We
assume that the bus arbitration latency is 2 clock cycles and
that the bus is initially IDLE so that the bus acquisition la-
tency is 0 clock cycles. We set slave device select (De-
vSel) delay to 1 clock cycle. As the address bus and data
bus are multiplexed, the PCI burst transfer consists of an
address transfer followed by the (up to) 32 data transfers.
For a read transaction, a turnaround cycle is required be-
tween the address transfer and the data transfers in order to
avoid bus contention. After completion of the burst, an ad-
ditional IDLE cycle is required. The address transfer and the
data transfers each last one clock cycle (assuming zero wait
state transfers), except for the first data transfer which lasts
4 clock cycles.
PCI Model
USB Model
Burst mode = Max
Channel width w_c = 8 bit
Burst size s_b = 32 Burst mode = Max
Channel width w_c = 32 bit
DevSel cycle
Address transfer cycle
Turnaround cycle (read transaction)
First data transfer: 3 extra cycles
IDLE cycle
Bus arbitration latency: 2 cycles
Cycles per transfer c_ct = 8
Cycles per transfer c_ct = 1
Burst Sync.
c_sb = 9 cycles
Session Sync.
c_ss = 0 cycles
Session Sync.
c_ss = 0 cycles
c_sb = 80
Burst size s_b = 1023
Figure 7. PCI and USB modelling.
The USB [1] (Universal Serial Bus) is a new personal
computer interconnect that can support simultaneous at-
tachment of multiple devices. Transactions on USB are
always framed into 1ms quanta which correspond to 1500
bytes at full speed, i.e. 12 MHz. Though USB is a se-
rial bus, we model it as a 8 bit wide channel, i.e. as being
able to transfer 1 byte at a time, where each transfer then
takes 8 cycles. Approximately 10 bytes (which translates
to 80 cycles) are used for protocol overhead when using
isochronous transfers. The maximum data payload size is
1023 bytes, so we set the burst size to 1023 and use the
max burst mode as bursts are allowed to be smaller than
1023 bytes.
3. Design space exploration experiments
In the following, we first describe how the partitioning
algorithm PACE[4] in the PALACE partitioning and design
space exploration tool in LYCOS has been been extended
to utilize the communication model. Then follows three
experiments that demonstrate how communication protocol
tradeoffs can influence partitioning and design space explo-
ration results.
3.1. Extension of the PACE partitioning algorithm
PACE is a dynamic programming algorithm for hard-
ware/software partitioning in the binary case, i.e. where we
wish to partition an application onto a target architecture
consisting of a software processor and a hardware copro-
cessor. The communication model can be used for more
general target architectures but this simple architecture is
sufficient for demonstrating the kinds of tradeoffs that are
involved when protocol selection becomes part of the search
space.
B3
B4
B1
B2
B5
B1
B2
B5
SW SWHW HW
B3
B4
Figure 8. Deriving effective R/W-sets for communi-
cation estimation in PACE.
Simply put, PACE models the application as consisting of
a sequence of basic scheduling blocks (BSBs) to be placed
in either hardware or software3. The algorithm is charac-
terized by being able to recognize that multiple adjacent
blocks when put in hardware only need to communicate the
effective read/write-sets of variables from/to software, as il-
lustrated in figure 8. The variables communicated between
adjacent hardware blocks are not considered to contribute
to the communication overhead.
As for communication overhead calculation, PACE has
been extended to utilize the new communication estimation
routines when calculating the reception/transmission delays
for each sequence of adjacent hardware blocks.
As for speedup calculation, the algorithm has been ex-
tended to account for the hardware and software operat-
ing frequencies, that the communication model also takes
into account, so that it recognizes that better speedups are
achieved when the operating frequency of the hardware part
is increased.
Datapath area
Driver area
Dyn. allocated
Unused
Driver call areas
Pre-allocated
BSB controller areas
Figure 9. Extended PACE area model.
Finally, the algorithm has been extended to take account
of the hardware area occupied by the hardware communica-
tion drivers and by each call to the drivers, as expressed in
equation 8. Figure 9 shows the hardware model that PACE
now incorporates. The full box represents the total area of
the hardware chip. The right parts of the box represent pre-
allocated area for the datapath (multipliers, adders, etc.) and
for the communication driver. The left parts of the box rep-
resent the area that is available for partitioning. For each se-
quence of BSBs that is put in hardware, there will be a con-
tribution associated with implementing the corresponding
hardware controller and two contributions associated with
the transfer of variables to/from the driver, respectively. The
two last ( ) contributions are visualized in figure 6 as well.
3Please refer to [4] for a more thorough description.
Note that register file area, interconnect area, etc. are cur-
rently ignored in the area model.
3.2. Experiment details
In all the experiments, we performed partitioning on the
same application as in [4]. The application calculates eigen-
vectors which are used to obtain local orientation estimates
for cloud movements in a sequence of Meteosat thermal im-
ages and consists of 448 lines of behavioral VHDL which
have been converted to a Control/Dataflow Graph (CDFG)
containing 1511 nodes and 1520 edges. This CDFG was au-
tomatically divided into 167 BSBs. For the software proces-
sor, we used a Motorola 68000 model and for the hardware
coprocessor, we used an Actel ACT 3 FPGA library to esti-
mate hardware datapath and controller area. The total hard-
ware area was set to 2000 logic/sequential modules. The
hardware datapath contained, among other modules, three
adders/subtracters, three dividers and three multipliers and
occupied a data path area of 1189. All variables are 16 bit
wide. For the communication channels, we used three mod-
els:
pci-fastp which models the 32 bit wide 33 MHz PCI pro-
tocol and whose drivers incorporate “fast” packing as
shown in figure 5C. The driver processing (packing)
delay per transmission word was set to .
The driver area is set to and the driver call
area to .
pci-optmp which models the 32 bit wide 33 MHz PCI pro-
tocol and whose drivers incorporate “optimal” packing
as shown in figure 5A. The driver processing (pack-
ing/unpacking) delay per transmission word was set to
. The driver area is set to and
the driver call area to .
usb which models the 12 MHz USB protocol and whose
drivers split/unsplit each 16 bit transmission word
into/from two 8 bit channel words using
cycles. The driver area is assumed to be half as big as
the PCI driver and is set to and the driver
call area to .
3.3. Experiment 1
In the first experiment, the channel frequency was fixed
at the “native” frequency of the chosen channel (33 MHz
for PCI and 12 MHz for USB). The software and hardware
frequencies were set equal to each other and set to the fol-
lowing frequencies in turn: 12 MHz, 33 MHz, 66 MHz and
99 MHz. In this way, the channel (at least for the higher
hardware/software frequencies) can be expected to become
a larger and larger bottleneck as it is the slowest part that de-
termines the communication throughput, according to equa-
tions 6 and 7. For each of the three channels and each of the
chosen software/hardware frequencies, we performed hard-
ware/software partitioning and noted the total resulting sys-
tem execution time as reported back from the partitioning
tool. The result is shown in figure 10.
First we note that the USB protocol is the best choice (re-
sulting in lowest system execution time and therefore in best
24
6
8
10
12
14
16
18
20
22
10 20 30 40 50 60 70 80 90 100
u
s
f_r = f_t (MHz)
PCI32 Fast Packing at f_c = 33 Mhz  
PCI32 Optimal Packing at f_c = 33 Mhz
USB at f_c = 12 Mhz
Figure 10. System execution times for different pro-
tocols and varying but equal SW/HW frequencies.
speedup) for frequencies up to about 66 MHz. This could
be somewhat surprising as it is the protocol with the low-
est throughput. The reason is that its driver area is smaller
than the driver areas of the PCI protocols. This results in
more available hardware chip area so that more BSBs can
be moved to hardware and benefit from hardware speedup.
However, as the hardware/software frequency is increased
to above 66 MHz, the low throughput USB channel be-
comes too large a bottleneck, and the PCI channels become
better choices, though the available BSB area is smaller for
these.
For the PCI channels, we see that the pci-fastp pro-
tocol is slightly better than the pci-optmp protocol for all
frequencies. This could also be somewhat surprising as you
would expect the pci-optmp that packs data on the chan-
nel, and therefore has better channel throughput than the
pci-optmp protocol, to perform better at higher frequen-
cies where the fact that it spends more cycles on packing
data is counterbalanced by the fact that the processors run
at a higher speed. But this is only true if the channel is the
communication bottleneck. For the PCI driver experiments,
throughput data output from the partitioning tool showed
that the channel was only the communication bottleneck in
the pci-fastp 99 MHz experiment and the effect of this
was negligible. For the other frequencies, the hardware and
software drivers were the communication bottlenecks, and
here the pci-fastp driver had the advantage of spending
less cycles on packing than the pci-optmp driver.
Clearly, a tool that can help the designer analyze these
tradeoffs is needed.
3.4. Experiment 2
In the second experiment, we examined the relation be-
tween driver/channel throughputs and the system execution
time more closely. The experiment, which consisted of two
parts, was performed for the pci-fastp protocol. In the
first part, the channel frequency was set to 16 MHz and in
the second part to 33 MHz. In both parts, the software fre-
quency was fixed to 66 MHz and the hardware frequency
was varied between 15 and 80 MHz in steps of 5 MHz.
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
10 20 30 40 50 60 70 80
KB
/s
HW Frequency (MHz)
HW Throughput
CH Throughput (f_c = 16 Mhz)
SW Throughput
CH Throughput (f_c = 33 Mhz)
Figure 11. SW, HW and channel throughputs.
Figure 11 shows the resulting software, channel and
hardware throughputs for these experiments. These num-
bers were calculated by the partitioning tool on basis of
equations (3), (4) and (5) (a fictive transfer of 200MB of
data was used for the calculations). The software and hard-
ware throughput graphs are the same for the two parts of
the experiment as only the channel frequency is changed.
As expected, the hardware driver throughput rate increases
linearly with the hardware frequency. For the 16MHz chan-
nel frequency case, we get a constant channel throughput of
24976 KB/s and for the 33 MHz case, a constant channel
throughput of 53333 KB/s.
As a consequence of equation 7, the effective commu-
nication throughput is approximately equal to the minimum
of the software driver, channel and hardware driver through-
puts. Therefore we have for the 16 MHz PCI channel fre-
quency case that the effective throughput is limited by the
“HW Throughput” graph and the “CH Throughput” graph
with a cutoff at 37.5 MHz. In the 33 MHz PCI channel fre-
quency case, the effective throughput is limited by the “HW
Throughput” graph and the “SW Throughput” graph with a
cutoff at 66 MHz.
The implication of this unlinear effective throughput
with respect to system performance after partitioning is seen
in figure 12 which shows the resulting system execution
time as a function of the hardware frequency. Note that
both graphs in the figure flatten out at their respective cut-
off frequency where the communication overhead becomes
dominant.
We see that for hardware frequencies below the first cut-
off frequency of 37.5 MHz, the slow PCI channel and the
fast PCI channel perform equally well. The reason for this
is that the effective throughput is solely determined by the
hardware driver throughput in this frequency range. So
for this frequency range, the best choice of communication
channel is the slow PCI channel (if price and performance
are the only components of the cost function).
For hardware frequencies between 37.5 MHz and 66
MHz, the fast PCI channel is now the best choice, as the
choice of that channel means that the increasing hardware
driver throughput can be utilized. Choosing the slow PCI
3.75
3.80
3.85
3.90
3.95
4.00
4.05
4.10
10 20 30 40 50 60 70 80
u
s
HW Frequency (MHz)
System execution time (f_c = 33 Mhz)
System execution time (f_c = 16 MHz)
37.5 MHz
66 MHz
Figure 12. System execution times.
channel would limit the effective throughput to that of the
PCI channel itself.
Above 66 MHz, the channel throughput becomes the
communication bottleneck if the slow PCI channel is cho-
sen and the software driver throughput becomes the bottle-
neck if the fast PCI channel is chosen. As the software
driver has a higher throughput than that of the slow PCI
channel, choosing fast PCI channel will also result in the
best performance in this hardware frequency range.
3.5. Experiment 3
The last experiment was a simple experiment aimed at
investigating the tradeoff between internal and external im-
plementation of the hardware drivers. The previous exper-
iments have assumed that the drivers occupied area on the
hardware chip, but the designer may also choose to imple-
ment the communication drivers on a separate chip (thus ob-
taining a larger available area for partitioning), if the price
of this is not too high compared with the performance gain.
pci-fastp pci-optmp USB
Internal Driver 119.20% 114.75% 182.22%
External Driver 787.35% 646.58% 279.82%
Table 1. Speedups for internal versus external hard-
ware driver implementation.
In this experiment, the software and hardware frequen-
cies were fixed to 33 MHz and the channel frequency to
the native frequency of the channel. Table 1 shows the re-
sults of partitioning sessions performed with internal versus
external implementation of the hardware drivers. For the
external implementation, the hardware driver areas were set
to zero but the hardware driver call areas were of course still
added to the BSB sequence areas.
We see that for internal implementation of drivers, we
obtain the best speedup for the USB protocol. This is be-
cause the USB drivers are smaller, leaving more available
area for partitioning, as we also saw in experiment 1. For
external implementation of the drivers, the pci-fastp
protocol is now the best choice as the available area for par-
titioning will be the same regardless of protocol choice and
the pci-fastp protocol is the one with the best perfor-
mance.
Experiments like the above can be used to guide the de-
signer to the choice between protocol and driver implemen-
tation that gives him the best tradeoff between price and
performance.
4. Conclusion
We have presented a communication model which in-
cludes both modeling of the channel and the drivers. The
model has been integrated with hardware/software parti-
tioning as to extend the design space. Design space explo-
ration experiments have shown the importance of including
a detailed communication model as well as the importance
of integrating communication protocol selection with selec-
tion of hardware and software components.
In particular, the experiments have shown the impor-
tant effect of incorporating the influence of drivers on area
and performance. Also, we have seen that driver perfor-
mance may be the actual throughput bottleneck rather than
the channel. We have seen that the adjustment of the oper-
ating frequency of a single system component alone influ-
ences what the best choice of communication protocol is, in
a way that it requires careful analysis to determine. Operat-
ing frequencies of system components are clearly important
parts of the design space, and the impact of these must be
analyzed for both system components and communication
channels and utilized in the design space exploration phase
to determine the best system configuration.
References
[1] http://www.teleport.com/˜usb/docs.htm.
[2] J.-M. Daveau, T. B. Ismail, and A. A. Jerraya. Synthesis of
System-Level Communication by an Allocation-Based Ap-
proach. In Proceedings of the 8th ISSS, pages 150 – 155,
1995.
[3] M. Eisenring and J. Teich. Domain-specific interface gener-
ation from dataflow specifications. In Proceedings of the 6th
Codes/CASHE, pages 43–47, 1998.
[4] P. V. Knudsen and J. Madsen. PACE: A Dynamic Pro-
gramming Algorithm for Hardware/Software Partitioning.
In Proceedings of the 4th Codes/CASHE, pages 85 – 92,
1996.
[5] P. V. Knudsen and J. Madsen. Communication Estimation
for Hardware/Software Codesign. In Proceedings of the 6th
Codes/CASHE, pages 55 – 59, 1998.
[6] J. Madsen, J. Grode, P. V. Knudsen, M. E. Petersen, and
A. Haxthausen. LYCOS: the Lyngby Co-Synthesis System.
Design Automation for Embedded Systems, 2(2):195 – 235,
1997.
[7] J. Madsen and B. Hald. An Approach to Interface Synthesis.
In Proceedings of the 8th ISSS, pages 16 – 21, 1995.
[8] S. Narayan and D. D. Gajski. Protocol Generation for Com-
munication Channels. In Proceedings of the 31th DAC,
pages 547 – 548, 1994.
[9] PCI Special Interest Group. PCI Local Bus Specification,
Revision 2.1, June 1995.
[10] J. Zhu, R. Do¨mer, and D. D. Gajski. Syntax and Seman-
tics of the SpecC Language. In Proceedings of the SASIMI
Workshop, pages 75 – 82, 1997.
