An Algorithm for Hardware/Software Partitioning Using Mixed Integer Linear by Marwedel, Peter & Niemann, Ralf
, , 1{34 ()
c

Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
An Algorithm for Hardware/Software Partitioning
Using Mixed Integer Linear Programming
RALF NIEMANN AND PETER MARWEDEL
DEPT. OF COMPUTER SCIENCE XII, UNIVERSITY OF DORTMUND, D-44221 DORTMUND, GER-
MANY
niemann@ls12.informatik.uni-dortmund.de,marwedel@ls12.informatik.uni-dortmund.de
Received May 1, 1991
Editor:
Abstract. One of the key problems in hardware/software codesign is hardware/software par-
titioning. This paper describes a new approach to hardware/software partitioning using integer
programming (IP). The advantage of using IP is that optimal results are calculated for a chosen
objective function. The partitioning approachworks fully automatic and supports multi-processor
systems, interfacing and hardware sharing. In contrast to other approaches where special estim-
ators are used, we use compilation and synthesis tools for cost estimation. The increased time
for calculating values for the cost metrics is compensated by an improved quality of the values.
Therefore, fewer iteration steps for partitioningare needed. The paper presents an algorithmusing
integer programming for solving the hardware/software partitioning problem leading to promising
results.
Keywords: hardware/software codesign, hardware/software partitioning, embedded systems,
mixed integer linear programming
21. Introduction
Embedded systems typically consist of application specic hardware parts and pro-
grammable parts, i.e., processors like DSPs, core processors or ASIPs. In com-
parison to the hardware parts, the software parts can be developed and modied
much easier. Thus, software is less expensive in terms of costs and development
time. Hardware, however, provides better performance. For this reason, a system
designer's goal is to design a system fullling all performance constraints and using
a minimum amount of hardware.
Hardware/software codesign deals with the problem of designing embedded sys-
tems, where automatic partitioning is one key issue. This paper describes a new
approach in hardware/software partitioning for multi-processor systems working
fully automatic. The approach is based on integer programming (IP) to solve the
partitioning problem. A formulation of the IP-model will be introduced in detail.
The drawback of solving IP-models often is a high computation time. To reduce
the computation time, an algorithm using IP has been developed which splits the
partitioning approach in two phases. In a rst phase, a mapping of nodes to hard-
ware or software is calculated by estimating the schedule times for each node with
heuristics. During the second phase a correct schedule is calculated for the resulting
HW/SW-mapping of the rst phase. It will be shown that this heuristic schedul-
ing approach strongly reduces the computation time while the results are nearly
optimal for the chosen objective function.
Another new feature of our approach is the cost estimation technique. The cost
model is not calculated by estimators like in other approaches, because the quality
of estimations is often poor and estimators do not consider compiler eects. In
our approach, a compiler and a high-level synthesis tool are used instead of special
estimators. The disadvantage of an increased runtime for calculating values for
the cost metrics is compensated by a higher precision of these values. A higher
precision leads to fewer partitioning iterations.
The outline of the paper is as follows: Section 2 gives an overview of related work
in the eld of hardware/software partitioning. Our system specication method is
introduced in section 3. In section 4 our own approach to partitioning is presented.
A formulation of the hardware/software partitioning problem follows in section 5.
Section 6 describes the IP-model of the problem. Experimental results of solving
these IP-models are presented in section 7 and a conclusion is given in section 8.
32. Related Work
There are many approaches to hardware/software partitioning. One of these is the
COSYMA system [3], where hardware/software partitioning is based on simulated
annealing using estimated costs. The partitioning algorithm is software-oriented,
because it starts with a rst non-feasible solution consisting only of software com-
ponents. In an inner loop partitioning (ILP) software parts of the system are
iteratively realized in hardware until all timing constraints are fullled. To handle
discrepancies between estimated and real execution time, an outer loop partition-
ing (OLP) restarts the ILP with adapted costs [8]. The OLP is repeated until all
performance constraints are fullled.
Another hardware/software partitioning approach is realized in the VULCAN
system [5]. This approach is hardware-oriented. It starts with a complete hardware
solution and iteratively moves parts of the system to the software as long as the
performance constraints are fullled. In this approach performance satisability is
not part of the cost function. For this reason, the algorithm can easily be trapped
in a local minimum.
The approach of Vahid [16] uses a relaxed cost function to satisfy performance in
an inner partitioning loop and to handle hardware minimization in an outer loop.
The cost function consists of a very heavily weighted term for performance and
a second term for minimizing hardware. The authors present a binary-constraint
search algorithm which determines the smallest size constraint (by binary search) for
which a performance satisfying solution can be found. The partitioning algorithm
minimizes hardware, but not execution time.
Kalavade and Lee [11] present an algorithm (GCLP) that determines for each node
iteratively the mapping to hardware or software. The GCLP algorithm does not use
a hardwired objective function, but it selects an appropriate objective according a
global time-criticality measure and another measure for local optimum. The results
are close to optimal and the runtime grows quadratically to the number of nodes.
This approach has been extended to solve the extended partitioning problem [12]
including the implementation selection problem.
Eles [4] presents a two-stage partitioning approach, where in the rst step a
VHDL system specication is partitioned into two sets of candidates for hardware
and software using proling and user-interaction. In the second step a process
graph is constructed and partitioned into hardware and software parts using a
simulated-annealing algorithm [15].
Jantsch [10] presents a partitioning approach where hardware candidates are pre-
selected using proling. All of these selected hardware candidates realize a system
speedup of greater than 1. The goal is to speed-up a system by incorporating
hardware. A key feature is a memory allocation method which minimizes the
interface trac between hardware and software. The disadvantage of this approach
is that hard timing constraints can not be guaranteed because the cost model is
based on proling.
43. System Specication
One of the key problems in hardware/software codesign is specication of large
systems. Many system specication languages have been developed in the last
years. One of the most frequently used ones is VHDL, because many CAD tools
supporting VHDL exist. In our approach, we also specify systems in VHDL. In [2],
[7] the advantages and disadvantages of several system specication languages have
been compared and the results for VHDL are promising.
To specify a system that has to be partitioned, the designer has to dene the
following:
1. The target technology has to be specied by dening the set of processors
for the software parts and the component library for synthesizing the hardware
parts of the embedded system.
2. The system has to be dened in VHDL as a set of interconnected instances of
components (behavioural VHDL-entities).
3. The design constraints have to be dened, including performance constraints
(timing) and resource constraints (area, memory).
In our approach the target technology, the system, and the design constraints
are specied by using the specication tool COSYS
1
which is part of the codesign
tool COOL
2
. COSYS is a graphical VHDL-based interface for hierarchical system
specication. In the following, specifying systems with COSYS will be illustrated
by an MPEG audio system.
First, the designer denes behavioural entities using VHDL source code. These
behavioural entities, called components, are instantiated to form structural en-
tities. This is done by connecting instances of these components by wires (VHDL
signals). Structural entities may also be instantiated. Thus, COSYS allows the
designer to describe the system hierarchically. In gure 1 the specication of a
hierarchical system is illustrated.
Example 1:
The system mpeg audio depicted in gure 1 realizes an MPEG audio encoder and de-
coder. The upper part of the system represents the encoder that encodes incoming PCM
audio samples. The result is an encoded bit-stream of the MPEG audio format. The
lower part of the system realizes the decoder that decodes MPEG audio bit-streams into
PCM samples. The structural entity quantizer coding is instantiated in this hierarch-
ical specication. Quantizer coding is dened with help of two behavioural entities. It
contains an instance of component quantizer and another instance of component coding
which have been specied in VHDL code.
The partitioning approach for these specied systems will be described in the
following sections.
5frame_packing
psycho_model
reconstructionframe_unpacking inverse_mapping
pcm_samples_in
encoded_bits_in
encoded_bits_out
pcm_samples_out
Hierarchical system ’mpeg_audio’
mpeg_audio
mapping
o
o
o
   ...
   ...
end behavior;
   ...
   ...
end behavior;
quantizer
entity quantizer is
end quantizer;
architecture behavior of quantizer is
coding
entity coding is
end coding;
architecture behavior of  coding is
quantizer_coding
i
i
model
quantizer
sample
modelp_model
Structural system ’quantizer_coding’
coding
Behavioral entities
quantizer_coding
Figure 1. Hierarchical system specication of an MPEG audio system
64. Hardware/Software Partitioning Approach
Design constraints
else
Syntax Graph Model
C code generation
SW costs HW costs
Partitioning Graph
Solving ILP model
Cluster SW nodes
Refine Partitioning Graph
SW costs
If Solution exists
then
VHDL system specification
VHDL code generation
High-Level Synthesis
Target technology definition
Result := Valid_Partitioning
(Retargetable) Compilation
Valid_Partitioning := Partitioning
(Retargetable) Compilation
Figure 2. Hardware/Software Partitioning
After the system has been specied with COSYS, the VHDL specication is com-
piled into an internal syntax graph model. For each component (behavioural VHDL-
entity) of this model, software source code (C or DFL) and hardware source code
(VHDL) is generated. The software parts are compiled and the hardware parts are
synthesized by a high-level synthesis tool (OSCAR [13]). The results are values
for software cost metrics (software execution time, memory usage) and values for
hardware cost metrics (hardware execution time, area) for the components. The
disadvantage of an increased runtime for calculating the cost metrics by running
compilers and synthesizers is compensated by a better quality. Moreover, a higher
7precision of the cost values leads to fewer partitioning iterations. After the com-
pilation/synthesis phase, a partitioning graph is generated in two steps. First, the
hierarchy of the system is attened. Then, a partitioning graph is created in which
each node of the graph represents an instance of a component in the attened sys-
tem. Edges of the partitioning graph represent the wires between these instances.
In gure 3 the partitioning graph is calculated for a hierarchical system.
Example 2:
o
v1
v3
v5
v2
v4
o
v1 v2
v3 v4
v5
x1
x2
x3
x6
x4
x5
x3
x2
x1
x4
x5
x6
Flattened systemHierarchical system Partitioning graph
Figure 3. Calculation of the partitioning graph
In the rst step, the structural entities are attened resulting in a set of instances of
components. Then for each instance (v
1
: : : v
5
) a node is added to the partitioning graph.
The edges between the nodes represent the wires between the instances.
Nodes are weighted with hardware and software costs, edges are weighted with
interface costs which reect the cost of hardware/software interfaces. Interface
costs are approximated by the number and type of data owing between both
nodes. User-dened design constraints are also attached to the graph. Thus, the
partitioning graph includes all information needed for partitioning.
The partitioning graph is then transformed into an IP-model, which is the key
issue of this paper. Afterwards, the model is solved by an IP-solver. The calculated
design is optimal for the chosen objective function using the generated cost model,
but nevertheless it is possible to improve the design, because although sharing
eects between dierent instances of the same components is considered, sharing
eects between dierent components is not. This limitation can be removed by
an iterative partitioning approach. We use a software oriented approach, because
compilation is faster than synthesis and software oriented approaches seem to be
superior to hardware oriented approaches (see [16]).
Sets of nodes which have been mapped on the same processor are clustered. For
each cluster, a new cost metric is calculated by compiling all nodes of the cluster
together. Then, the partitioning graph is transformed by replacing each cluster
by a new node with the new cost metric attached. Finally, the redened graph
is repartitioned. This iteration will be repeated until no solution is found. The
last valid partitioning represents the resulting design. The clustering technique is
illustrated in gure 4.
8Example 3:
v1
v2
v4 v5 v7
v8 v9
v1
v2 v3,v6
v7v9
v5v4
v1
v2
v4 v5
v8
v3,v6,
v7,v9,
v12 v12 v12
v10,
v11
2nd
Partitioning
1st 3rd
Partitioning Partitioning
v10
v11
v8 v10,v11
v6
v3
Figure 4. Partitioning renement
The rst partitioning iteration results in 4 software nodes (v
3
,v
6
,v
10
,v
11
). The nodes
v
3
; v
6
and v
10
; v
11
are clustered. After the second iteration it is now possible to execute
v
7
,v
9
on the processor, so the new cluster contains v
3
,v
6
,v
7
,v
9
,v
10
,v
11
. In the third
iteration no more nodes can be moved from hardware to software.
95. Formulation of the HW/SW Partitioning Problem
This section introduces a formulation of the hardware/software partitioning prob-
lem. This formulation is necessary to simplify the description of the problem with
the help of an IP-model. We have to dene the system which has to be partitioned
and the target technology used to implement the system.
5.1. Target Technology and System Specication
Denition 1 A target technology T is dened as a tuple
T = (V; E); V = H [P [M; E  PS(V) n

fvg j v 2 V
	
containing all target technology components and interconnections. The target tech-
nology components V are dened as a set of hardware components (ASICs) H =
fh
1
; : : : ; h
n
H
g, processors P = fp
1
; : : : ; p
n
P
g and memories M = fm
1
; : : : ;m
n
M
g.
The target technology interconnections E are dened as a set of busses E = fe
1
; : : : ; e
n
E
g
connecting these components (at least 2) where PS(V) represents the power set of
V.
Example 4:
processor
p1
memory
m1
hardware
component
h1
bus b1
Figure 5. Target technology
In gure 5 an example for a target technology is given. It contains a processor p
1
, a
hardware component h
1
, external memorym
1
and a bus b
1
connecting p
1
; h
1
and m
1
.
A system that has to be mapped to the target technology consists of several
instances of dierent system components and interconnections between them. The
formal denition is as follows:
10
Denition 2 A system S is dened as a 4-tuple
S = (C; V;E; I)
with the following denitions:
C = fc
1
; : : : ; c
n
C
g set of system components,
V = fv
1
; : : : ; v
n
V
g set of nodes, representing instances of system components,
E  V  V set of edges, representing interconnections between nodes,
I : V ! C I(v
i
) = c
l
denes that v
i
is an instance of component c
l
.
Example 5:
System components
FIR
+
*
v9 v10
v11
v5
v1 v2
v6
v3
v7
v4
v8
System specification
Figure 6. System specication
In gure 6 a 4-band-equalizer is specied. It consists of 3 system components: an FIR-
lter, a multiplier and an adder. The equalizer is specied by using 4 instances of the
FIR-lter (v
1
: : : v
4
), 4 instances of the multiplier (v
5
: : : v
8
) and 3 instances of the adder
(v
9
: : : v
11
). This 4-band-equalizer is a well suited example to demonstrate the scheduling,
hardware sharing and interfacing problem. Therefore, it will be used in the rest of the
paper as a demonstrator example.
5.2. Hardware and Software Implementation
Parts of the system may be implemented in hardware or in software. The main
dierence between implementing system instances v
i
1
; v
i
2
on a processor or on a
hardware component is that v
i
1
; v
i
2
can not be executed in parallel on a processor.
On the software side, a system component c
l
is implemented as a function on a
processor. Each system instance v
i
of c
l
which is mapped to the processor uses a
corresponding function call for this function. On the hardware side however, two
instances v
i
1
and v
i
2
of c
l
may be executed in parallel on a hardware component.
Therefore, it is possible that v
i
1
and v
i
2
are mapped to dierent hardware instances
of c
l
. The following denition will dene the dierent implementation possibilities.
11
Denition 3
Let S = (C; V;E; I) be a system.
Let T = (V; E) be a target technology. V = H [P [M
Let p
k
2 V be a processor and h
k
2 V a hardware component.
The sets of possible hardware implementations Impl
hw
(c
l
; h
k
) and software
implementations Impl
sw
(c
l
; p
k
) for a system component c
l
are dened as:
Impl
hw
(c
l
; h
k
) = fh
l;1;k
: : :h
l;N;k
j N =j fv
i
j I(v
i
) = c
l
g j g
Impl
sw
(c
l
; p
k
) = fp
l;k
g
The sets of possible hardware implementations Impl
hw
(S; T ) and software imple-
mentations Impl
sw
(S; T ) for S on T are then dened as:
Impl
hw
(S; T ) =
[
c
l
2C;h
k
2H
Impl
hw
(c
l
; h
k
)
Impl
sw
(S; T ) =
[
c
l
2C;p
k
2P
Impl
sw
(c
l
; p
k
)
Example 6:
bus b1
hardware component h1
FIR1 FIR2 FIR3 FIR4
MUL1 MUL2 MUL3 MUL4
ADD1 ADD3ADD2
processor p1
function FIR(...)
function ADD(...)
function MUL(...)
v2 v3 v4
v9 v10
v11
v5 v6 v7 v8
v1
Figure 7. Hardware/software implementations
The possible hardware/software implementations for the 4-band-equalizer are depicted in
gure 7. Three functions may be implemented in software, one for an FIR-lter (FIR),
one for multiplying (MUL) and one function for adding (ADD). On the hardware side,
4 hardware instances of an FIR-lter may be needed. In such a case, the highest speed
can be reached, because all 4 hardware instances are able to work in parallel. Finally,
the hardware may contain a maximum of 4 hardware multipliers and 3 hardware adders.
12
5.3. Cost Model
Hardware/software partitioning algorithms need cost metrics for the nodes and
the edges of the system to evaluate dierent partitionings. The values for these
cost metrics are calculated for the possible hardware and software implementations
of system components. This is done during the compilation and synthesis phase,
described in section 4. In our approach, we partition systems based on the following
cost metrics:
Denition 4
Let S = (C; V;E; I) be a system and T = (V; E) a target technology.
Let c 2 C be a system component and e 2 E an edge of system S.
Let p 2 V be a processor, h 2 V a hardware component and b 2 E a bus of T .
The cost metrics are dened as follows:
c
dm
(c; p) represents the software data memory required by c on p,
c
pm
(c; p) the software program memory required by c on p,
c
ts
(c; p) the software execution time required by c on p,
c
a
(c; h) the hardware area required by c on h,
c
th
(c; h) the hardware execution time required by c on h,
ci
a
(e,b) the additional interface hardware area required by e on b and
ci
t
(e; b) the additional interface communication time required by e on b.
These costs are also dened for instances v of these system components c. They
are denoted by c
dm
(v; p), c
pm
(v; p), c
ts
(v; p), c
a
(v; h) and c
th
(v; h) for a system
component c. The costs for dierent instances of the same system component are
obviously equal:
Let T = (V;E) be a target technology.
Let p 2 V be a processor and h 2 V a hardware component of T .
8p 2 V : I(v
i
1
) = I(v
i
2
) ) c
ts
(v
i
1
; p) = c
ts
(v
i
2
; p)^ c
dm
(v
i
1
; p) = c
dm
(v
i
2
; p)^
c
pm
(v
i
1
; p) = c
pm
(v
i
2
; p) (1)
8h 2 V : I(v
i
1
) = I(v
i
2
) ) c
th
(v
i
1
; h) = c
th
(v
i
2
; h) ^ c
a
(v
i
1
; h) = c
a
(v
i
2
; h) (2)
According to the cost metrics denition for system components, we can dene
the resource costs for each target technology component. The resource costs of a
target technology component t
k
represent the sum of cost metrics used by the nodes
mapped to t
k
.
Denition 5
Let S = (C; V;E; I) be a system and T = (V; E) a target technology.
Let p 2 V be a processor, h 2 V a hardware component and b 2 E a bus of T .
The resource costs required for implementing S on target technology components
13
are dened as follows:
C
dm
(p) represents the software data memory required on p,
C
pm
(p) the software program memory required on p,
C
a
(h) the hardware area required on h and
CI
a
(b) the additional interface hardware area required for b.
For each of these resource costs maximumvalues can be dened. These values are
called resource constraints for the target technology, e.g. the maximal number
of CLBs of an FPGA or the amount of internal memory of a processor. They are
denoted by MAX
dm
(p), MAX
pm
(p) and MAX
a
(h).
A design represents the realization of a system S on a target technology T . The
design quality can be expressed by evaluating the resource costs. The results are
the following design costs:
Denition 6 The design costs of a system S are dened as follows:
C
dm
(S) represents the used software data memory,
C
pm
(S) the used software program memory,
C
a
(S) the hardware area and
C
t
(S) the total execution time.
The design costs may also be constrained, e.g. the total execution time of a system
has to fulll a timing constraint to guarantee real-time conditions. These design
constraints are denoted by MAX
dm
(S);MAX
pm
(S);MAX
a
(S) and MAX
t
(S).
With help of this complex cost model, the hardware/software partitioning problem
can be dened as follows.
5.4. Hardware/Software Partitioning
The task of the hardware/software partitioning problem is to map nodes to target
technology components and edges (if communication is necessary) to busses of the
target technology (see gure 8). The goal of partitioning algorithms is to minimize
the design costs, while meeting all requirements. The design costs are calculated
with help of a given cost model.
Denition 7
Let S = (C; V;E; I) be a system.
Let T = (V; E) be a target technology. V = H [P [M
Let Impl
hw
(S; T ) be the set of possible hardware implementations of S on T .
Let Impl
sw
(S; T ) be the set of possible software implementations of S on T .
The hardware/software partitioning problem is dened as the problem of
nding a mapping from S to T given by two mapping functions:
mv : V ! Impl V  Impl
hw
(S; T ) [ Impl
sw
(S; T )
me : E ! Impl E  E
14
such that
mv(v
i
) =
8
>
<
>
>
:
p
l;k
2 Impl
sw
(c
l
; p
k
) ; if v
i
is implemented by function
p
l;k
on p
k
calculating c
l
= I(v
i
)
h
l;j;k
2 Impl
hw
(c
l
; h
k
) ; if v
i
is implemented by the j th
instance h
l;j;k
of c
l
= I(v
i
) on h
k
:
me(e
i
) =

b
k
2 E ; if e
i
is needed to realize an interface on bus b
k
; ; if no interface is needed for e
i
and design costs are minimized and resource and design constraints are met.
The following example illustrates this complex denition of the problem.
Example 7:
bus b1
hardware component h1
FIR1 FIR2 FIR3 FIR4
MUL1 MUL2 MUL3 MUL4
ADD1 ADD3ADD2
processor p1
function FIR(...)
function ADD(...)
function MUL(...)
v1 v2 v3 v4
v9 v10
v11
v5 v6 v7 v8
e1 e2 e3 e4
Figure 8. Hardware/software partitioning
The hardware/software partitioning problem is illustrated in gure 8. Two FIR-lters
(v
1
; v
2
) are mapped to a rst hardware instance (FIR
1
) on h
1
. Two other FIR-lters
(v
3
; v
4
) are mapped to a second hardware instance (FIR
2
). The multipliers (v
5
; : : : ; v
8
)
are implemented as a function MUL on processor p
1
. The function ADD on p
1
imple-
ments the adders (v
9
; : : : ; v
11
). The results of the FIR-lters are calculated by h
1
and
have to be transported to p
1
. Therefore the edges e
1
; : : : ; e
4
realize the interfaces on bus
b
1
. In summary, two instances of an FIR-lter are implemented by hardware instances
on h
1
, and two functions (MUL;ADD) are implemented in software on p
1
.
15
6. The IP-Model
Many optimization problems can be solved optimally by using integer program-
ming (IP). This paper will show that our IP-model allows us to solve the hard-
ware/software partitioning problem with the following characteristics:
 optimal solution for an objective function,
 support for multiprocessor and multi-ASIC target technologies,
 timing constraints are guaranteed by scheduling the nodes,
 bus conicts are prevented by scheduling communication events on edges,
 interface costs are considered,
 instances of the same system component can share their implementation on
hardware,
 interactive support for user-dened constraints.
The following paragraphs describe the IP-model for performing hardware/software
partitioning with these characteristics. To simplify the description of the IP-model,
the following notations are used:
Denition 8 Sets of nodes and edges
Let S = (C; V;E; I) be a system.
pred nodes(v 2 V ) = fw j 9p : p = (w; : : : ; v)g
succ nodes(v 2 V ) = fw j 9p : p = (v; : : : ; w)g
pred edges(v 2 V ) = fe j e = (x; y) [ y 2 pred nodes(v)g
succ edges(v 2 V ) = fe j e = (x; y) [ x 2 succ nodes(v)g
pred edges(e 2 E) = ff j e = (v;w) [ f 2 pred edges(v)g
succ edges(e 2 E) = ff j e = (v;w) [ f 2 succ edges(w)g
instances of(c 2 C) = fv j I(v) = cg
share nodes(v 2 V ) = fw j w 6= v ^ I(v) = I(w)g
schedule nodes(v 2 V ) = fw j w 6= v ^w =2 fpred nodes(v)[ succ nodes(v)gg
schedule edges(e 2 E) = ff j f 6= e ^ f =2 fpred edges(e) [ succ edges(e)gg
path nodes(v
1
; v
2
2 V ) = fw j w 2 succ nodes(v
1
) ^w 2 pred nodes(v
2
)g
path edges(v
1
; v
2
2 V ) = fe j e 2 succ edges(v
1
) ^ e 2 pred edges(v
2
)g
dominator nodes(v 2 V ) = fw j 8p : p = (s; : : : ; v) ^ pred nodes(s) = ; : w 2 pg
Furthermore, the following indices and variables are used:
Denition 9 Indices
L = f1; : : : ; n
C
g indices for system components c
l
2 C,
I = f1; : : : ; n
V
g indices for nodes v
i
2 V ,
J 2 N
+
0
indices for hardware instances of system components,
KH = f1; : : : ; n
H
g indices for hardware components h
k
2 H,
KP = f1; : : : ; n
P
g indices for processors p
k
2 P,
KB = f1; : : : ; n
E
g indices for busses b
k
2 E .
16
Denition 10 Variables for costs and constraints
Let S = (C; V;E; I) be a system and T = (V; E) a target technology.
Let c
l
2 C be a system component, v
i
2 V a node and e 2 E an edge of system S.
Let p
k
2 V be a processor, h
k
2 V a hardware component and b
k
2 E a bus of T .
c
ts
l;k
, c
dm
l;k
, c
pm
l;k
cost metrics c
ts
(c
l
; p
k
), c
dm
(c
l
; p
k
), c
pm
(c
l
; p
k
),
c
th
l;k
, c
a
l;k
cost metrics c
th
(c
l
; h
k
), c
a
(c
l
; h
k
),
c
ts
i;k
, c
dm
i;k
, c
pm
i;k
cost metrics c
ts
(v
i
; p
k
), c
dm
(v
i
; p
k
), c
pm
(v
i
; p
k
),
c
th
i;k
, c
a
i;k
cost metrics c
th
(v
i
; h
k
), c
a
(v
i
; h
k
),
ci
t
i
1
;i
2
;k
, ci
a
i
1
;i
2
;k
cost metrics ci
t
(e; b
k
), ci
a
(e; b
k
) for e = (v
i
1
; v
i
2
),
C
dm
k
, C
pm
k
resource costs C
dm
(p
k
), C
pm
(p
k
),
C
a
k
resource costs C
a
(h
k
),
CI
a
k
resource costs CI
a
(b
k
),
MAX
dm
k
, MAX
pm
k
resource constraints MAX
dm
(p
k
), MAX
pm
(p
k
),
MAX
a
k
resource constraints MAX
a
(h
k
),
C
t
, C
dm
, C
pm
, C
a
design metrics C
t
(S), C
dm
(S), C
pm
(S), C
a
(S),
MAX
t
, MAX
a
design constraints MAX
t
(S);MAX
a
(S)
MAX
dm
, MAX
pm
design constraints MAX
dm
(S);MAX
pm
(S)
T
S
i
starting time of node v
i
,
T
D
i
execution time of node v
i
,
T
E
i
ending time of node v
i
,
TI
S
i
1
;i
2
starting time of edge e = (v
i
1
; v
i
2
),
TI
D
i
1
;i
2
execution time of edge e = (v
i
1
; v
i
2
),
TI
E
i
1
;i
2
ending time of edge e = (v
i
1
; v
i
2
).
6.1. The Decision Variables
The IP-model needs decision variables for dening mapping, scheduling, sharing
and interfacing constraints. Thus, the solution of the IP-model is driven by the
following variables:
Denition 11
x
i;j;k
=
8
<
:
1 : v
i
is mapped to the j th hardware instance of c = I(v
i
)
on hardware component h
k
;
0 : otherwise:
X
i;k
=

1 : v
i
is mapped to hardware component h
k
;
0 : otherwise:
Y
i;k
=

1 : v
i
is mapped to processor p
k
;
0 : otherwise:
Z
i
1
;i
2
=

1 : an interface is needed between v
i
1
and v
i
2
;
0 : otherwise:
z
i
1
;i
2
;k
=

1 : communication between v
i
1
; v
i
2
is realized on bus b
k
;
0 : otherwise:
17
nx
l;j;k
=
8
<
:
1 : at least 1 instance of c
l
is mapped to the j th
hardware instance of c
l
on h
k
;
0 : otherwise:
NY
l;k
=

1 : at least 1 instance of c
l
is mapped to processor p
k
;
0 : otherwise:
NX
l;k
: number of hardware instances of c
l
realized on h
k
:
b
i
1
;i
2
=

1 : v
i
1
ends before v
i
2
starts;
0 : otherwise:
bi
i
1
;i
2
=

1 : communication time for e
i
1
ends before e
i
2
starts;
0 : otherwise:
Example 8:
processor
p1 h1
bus b1
FIR1 FIR2
v1 v2 v3 v4
p1
b1
t
v5
e1
v1
FIR2 on h1
FIR1 on h1 v2
v3 v4
e2
v6
e3
v7 v8
e4
v9 v10 v11v9 v10
v11
v5 v6 v7 v8
e1 e2 e3 e4
Figure 9. Hardware sharing
To visualize usage of these variables, gure 9 shows a partitioningof the 4-band-equalizer.
v
1
: : : v
4
are mapped to hardware componenth
1
(X
1;1
= : : : = X
4;1
= 1). All other nodes
v
5
: : : v
11
are mapped to processor p
1
(Y
5;1
= : : : = Y
11;1
= 1). This mapping forces
interfaces for edge e
1
= (v
1
; v
5
); : : : ; e
4
= (v
4
; v
8
) (Z
1;5
= : : : = Z
4;8
= 1) to be needed,
because data has to be transported from h
1
to p
1
. Therefore, e
1
: : : e
4
are mapped to bus
b
1
(z
1;5;1
= : : : = z
4;8;1
= 1). To reduce the amount of hardware area, v
1
and v
2
share
the same hardware instance of an FIR-lter on h
1
(x
1;1;1
= x
2;1;1
= 1). v
3
and v
4
share
the second one (x
3;2;1
= x
4;2;1
= 1).
The variables get the following values: nx
1;1;1
= nx
1;2;1
= 1; nx
1;3;1
= nx
1;4;1
= 0,
because only the rst two hardware instances of four possible FIR-lters (c
1
) are required
on h
1
(NX
1;1
= 2). All multiplications (c
2
) and adders (c
3
) are realized on processor
p
1
. Therefore, one function is needed for multiplying (NY
2;1
= 1) and another function
is needed for adding (NY
3;1
= 1) incoming values. The timing diagram shows a possible
schedule for this partitioning. In the depicted case b
5;6
= 1, because v
5
is executed before
v
6
on p
1
; bi
1;2
= 1, because the transfer for e
1
is executed before the transfer for e
2
.
18
6.2. The Constraints
The following constraints have to be fullled:
1. General Constraints: Each node v
i
is executed exactly on one target tech-
nology component t
k
, a processor or a hardware component (eq.5). If a system
component c
l
has been realized on a processor p
k
, then it is not necessary to
implement it more than once (eq.6), because it can be implemented as one func-
tion and several function calls (see denition 3). Therefore, the number NY
l;k
is calculated by equations 7 and 8. In contrast to the binary variable NY
l;k
,
the number of hardware instances NX
l;k
of a system component c
l
realized on
a hardware component h
k
may be greater than one. If no hardware sharing is
considered, then NX
l;k
is equal to the sum of system instances of c
l
that have
been mapped to h
k
(eq.9).
8i 2 I : 8k 2KH : X
i;k
 1 (3)
8i 2 I : 8k 2KP : Y
i;k
 1 (4)
8i 2 I :
P
k2KH
X
i;k
+
P
k2KP
Y
i;k
= 1 (5)
8l 2 L : 8k 2KP : NY
l;k
 1 (6)
8l 2 L;8i : I(v
i
) = c
l
;8k 2KP : NY
l;k
 Y
i;k
(7)
8l 2 L;8k 2KP : NY
l;k

X
i:I(v
i
)=c
l
Y
i;k
(8)
8l 2 L;8k 2KH : NX
l;k
=
X
i:I(v
i
)=c
l
X
i;k
(9)
2. Resource Constraints: The area C
a
k
(eq.10) used on a hardware component
h
k
is calculated by accumulating the costs for all hardware instances of system
components realized on h
k
. The amount of used memory (eq.11,12) on a pro-
cessor p
k
is calculated by summing up the costs for implementing these system
components as functions. The resource costs may not violate their resource
constraints.
8k 2KH : C
a
k
=
P
l2L
NX
l;k
 c
a
l;k
MAX
a
k
(10)
8k 2KP : C
dm
k
=
P
l2L
NY
l;k
 c
dm
l;k
MAX
dm
k
(11)
8k 2KP : C
pm
k
=
P
l2L
NY
l;k
 c
pm
l;k
MAX
pm
k
(12)
19
3. Design Constraints: The design costs for the complete system are calculated
by accumulating the resource costs required by the components of the target
technology (eq.13-15). These design costs may not exceed their given design
constraints. The required hardware area includes additional hardware CI
a
k
used
for interfaces (eq.13). If interfacing is not considered, CI
a
k
= 0 for all busses b
k
.
The design costs C
t
will be described separately.
C
a
=
P
k2KH
C
a
k
+
P
k2KB
CI
a
k
MAX
a
(13)
C
dm
=
P
k2KP
C
dm
k
MAX
dm
(14)
C
pm
=
P
k2KP
C
pm
k
MAX
pm
(15)
4. Timing Constraints:
The timing costs cannot be calculated by accumulating the execution time of
the nodes, because two nodes v
1
; v
2
can be executed in parallel if they do not
share the same resources and if there is no path from v
1
to v
2
and vice versa.
To determine the starting time and ending time for each node, scheduling has
to be performed. The execution time T
D
i
(eq.16) of v
i
is either a hardware or a
software execution time. The ending time T
E
i
(eq.17) of v
i
is the sum of starting
time T
S
i
and execution time T
D
i
. The system execution time C
t
(eq.18) is the
maximum of the ending times of all nodes v
i
and may not violate the global
design timing constraint. Data dependencies (eq.19) have to be considered
for all edges e = (v
i
1
; v
i
2
) including interface communication time TI
D
i
1
;i
2
of
equation 35. If interfacing is not considered, TI
D
i
1
;i
2
= 0. The starting times T
S
i
(eq.20) of nodes have to be in their ASAP/ALAP-range which can be calculated
in a preprocessing step.
8i 2 I : T
D
i
=
X
k2KH
X
i;k
 c
th
i;k
+
X
k2KP
Y
i;k
 c
ts
i;k
(16)
8i 2 I : T
E
i
= T
S
i
+ T
D
i
(17)
8i 2 I : T
E
i
 C
t
 MAX
t
(18)
8e = (v
i
1
; v
i
2
) 2 E : T
S
i
2
 T
E
i
1
+ TI
D
i
1
;i
2
(19)
8i 2 I : ASAP (v
i
)  T
S
i
 ALAP (v
i
) (20)
6.3. Hardware Sharing
If hardware sharing is considered, then it is not sucient to model bindings between
nodes v
i
and hardware components h
k
with help of the binary variable X
i;k
. In
order to consider hardware sharing, the binding of v
i
to the j-th hardware instance
20
(of system component c
l
= I(v
i
)) contained in h
k
has to be modelled. This binding
is modelled using the binary binding variable x
i;j;k
(eq.21).
Example 9:
processor
p1 h1
bus b1
FIR1 FIR2
v1 v2 v3 v4
v9 v10
v11
v5 v6 v7 v8
t
v1
FIR2 on h1
FIR1 on h1 v2
v3 v4
x1,1,1=1; x2,1,1=1
x3,2,1=1; x4,2,1=1
Figure 10. Hardware sharing
In gure 10 an example is given for sharing hardware resources to minimize the amount
of hardware area. The system instances v
1
; v
2
of an FIR-lter are mapped to the rst
hardware instance FIR
1
of an FIR-lter on hardware componenth
k
(x
1;1;1
= x
2;1;1
= 1).
v
3
and v
4
share the second hardware instance (x
3;2;1
= x
4;2;1
= 1). Therefore, 4 system
instances are realized by two hardware instances of FIR-lters on h
k
. The timing diagram
shows that v
1
; v
2
and also v
3
; v
4
have to be scheduled. But both hardware instances FIR
1
and FIR
2
are able to work in parallel on h
1
.
A node v
i
is realized on h
k
, if v
i
is bound to one hardware instance of system
component c
l
= I(v
i
) on h
k
(eq.22). If at least one instance v
i
of system component
c
l
is bound to the j-th hardware instance of c
l
on h
k
, then nx
l;j;k
= 1 (eq.23,24).
The number NX
l;k
of used hardware instances of c
l
on h
k
is calculated by accu-
mulating the variables nx
l;j;k
(eq.25).
8k 2KH : 8l 2 L : N =j instances of(c
l
) j:
8i : I(v
i
) = c
l
: 8j 2 f1; : : : ; Ng : x
i;j;k
 1 (21)
8i : I(v
i
) = c
l
: X
i;k
=
N
X
j=1
x
i;j;k
(22)
8i : I(v
i
) = c
l
: 8j 2 f1; : : : ; Ng : nx
l;j;k
 x
i;j;k
(23)
8j 2 f1; : : : ; Ng : nx
l;j;k

X
i:I(v
i
)=c
l
x
i;j;k
(24)
NX
l;k
=
N
X
j=1
nx
l;j;k
(25)
21
If hardware sharing is not considered, then equation 9 is used instead of equations
21-25.
6.4. Interfacing
An interface has to be realized for an edge e = (v
i
1
; v
i
2
), if v
i
1
and v
i
2
are realized
on dierent target technology components.
Example 10:
processor
p1
memory
m1
hardware
component
h1
bus b1
v9 v10
v11
v5
v1 v2
v6
v3
v7
v4
v8
h1
b1
p1
t
v4
e4
v8
e1 e2 e3 e4
X4,1=1
Y8,1=1
z4,8,1= Z4,8= 1
Figure 11. Interface
In gure 11 an example is given for a required interface. v
4
has been mapped to hardware
component h
1
(X
4;1
= 1) and v
8
to processor p
1
(Y
8;1
= 1). Therefore, the output data
of v
4
, calculated on h
1
has to be moved to v
8
, implemented on p
1
. Thus, an interface is
needed for e
4
, indicated by Z
4;8
= 1. For this reason, edge e
4
= (v
4
; v
8
) is mapped to
bus b
1
(z
4;8;1
= 1), to realize the data transfer. The timing diagram shows that v
8
starts
after the data has been transfered from v
4
using b
1
.
An interface is needed between two nodes v
i
1
; v
i
2
, indicated by Z
i
1
;i
2
= 1, if they
are mapped to dierent target technology components (eq.27-31). With help of
the interface binding variable z
i
1
;i
2
;k
, a bus is selected realizing the data transfer
from v
i
1
to v
i
2
(eq.33). The additional amount of hardware area CI
a
k
and the com-
munication delay TI
D
i
1
;i
2
used for an interface are calculated in equations 34-35.
Additional constraints for the starting and ending time of communication are ad-
ded in equations 36-39.
22
8e = (v
i
1
; v
i
2
) 2 E :
Z
i
1
;i
2
 1 (26)
8k 2 KH : Z
i
1
;i
2
 X
i
1
;k
 X
i
2
;k
(27)
8k 2 KH : Z
i
1
;i
2
 X
i
2
;k
 X
i
1
;k
(28)
8k 2KP : Z
i
1
;i
2
 Y
i
1
;k
  Y
i
2
;k
(29)
8k 2KP : Z
i
1
;i
2
 Y
i
2
;k
  Y
i
1
;k
(30)
Z
i
1
;i
2
! minimize (31)
8k 2 KH : z
i
1
;i
2
;k
 1 (32)
Z
i
1
;i
2
=
X
k2KB
z
i
1
;i
2
;k
(33)
8e = (v
i
1
; v
i
2
) 2 E :
8k 2 KB : CI
a
k
=
X
e2E
z
i
1
;i
2
;k
 ci
a
i
1
;i
2
;k
(34)
TI
D
i
1
;i
2
=
X
k2KB
z
i
1
;i
2
;k
 ci
t
i
1
;i
2
;k
(35)
TI
E
i
1
;i
2
= TI
S
i
1
;i
2
+ TI
D
i
1
;i
2
(36)
TI
S
i
1
;i
2
 T
E
i
1
(37)
TI
E
i
1
;i
2
 T
S
i
2
(38)
ASAP (e
i
1
)  TI
S
i
1
;i
2
 ALAP (e
i
1
) (39)
6.5. Scheduling
Two nodes v
i
1
; v
i
2
which can be executed in parallel have to be sequentialized, if
 v
i
1
and v
i
2
are executed on the same processor or
 v
i
1
and v
i
2
share the same hardware instance on the same hardware component.
To sequentialize two nodes v
i
1
; v
i
2
, the binary decision variable b
i
1
;i
2
is used.
Two edges e
i
1
; e
i
2
have to be sequentialized, if e
i
1
and e
i
2
represent interfaces and
both edges use the same bus to realize the communication. In this case, the binary
decision variable bi
i
1
;i
2
is used to schedule e
i
1
and e
i
2
. The following example will
illustrate all situations where scheduling constraints are required.
Example 11:
In gure 12 all three possibilities are depicted when scheduling constraints are required.
1. v
7
and v
8
are mapped to the same processor p
1
: Then, v
7
has to be executed before
v
8
(b
7;8
= 1) or v
8
before v
7
(b
8;7
= 1).
2. v
3
and v
4
are mapped to the same hardware instance of an FIR-lter on h
1
: There-
fore v
3
and v
4
have to be scheduled.
23
processor
p1 hardware component h1
bus b1
FIR1 FIR2
v1 v2 v3 v4
v9 v10
v11
v5 v6 v7 v8
e3 e4
t
p1
v8 v7
v7 v8
or
...
... ...
...
t
b1
or
...
... ...
...e3
e3
e4
e4
b3,4=1 or b4,3=1
bi3,4=1 or 
bi4,3=1
b7,8=1 or 
b8,7=1
t
FIR2 on h1
v4 v2
v3 v4
or
...
... ...
...
Figure 12. Scheduling
3. e
3
realizes an interface for transferring data from v
3
(on p
1
) to v
7
(on h
1
). e
4
realizes an interface between v
4
and v
8
. Both edges have been mapped to bus b
1
to
transferring the data. For this reason, the communication times of e
3
and e
4
have
to be scheduled. If e
3
is scheduled before e
4
, then the schedule variable bi
3;4
= 1,
otherwise bi
4;3
= 1.
The following constraints are necessary, to schedule nodes and edges:
8k 2KP;8v
i
1
; v
i
2
2 V : v
i
1
2 schedule nodes(v
i
2
)
T
E
i
1
 T
S
i
2
+ (3  b
i
1
;i
2
  Y
i
1
;k
  Y
i
2
;k
)  C
1
(40)
T
E
i
2
 T
S
i
1
+ (2 + b
i
1
;i
2
  Y
i
1
;k
  Y
i
2
;k
)  C
2
(41)
8l 2 L :j instances of(c
l
) j 2 :
8k 2KH;8v
i
1
; v
i
2
2 V : v
i
1
; v
i
2
2 instances of(c
l
)
T
E
i
1
 T
S
i
2
+ (3  b
i
1
;i
2
  x
i
1
;j;k
  x
i
2
;j;k
)  C
3
(42)
T
E
i
2
 T
S
i
1
+ (2 + b
i
1
;i
2
  x
i
1
;j;k
  x
i
2
;j;k
)  C
4
(43)
8k 2KB;8e
i
1
= (v
i
11
; v
i
12
); e
i
2
= (v
i
21
; v
i
22
) 2 E : e
i
1
2 schedule edges(e
i
2
) :
TI
E
i
11
;i
12
 TI
S
i
21
;i
22
+ (3  bi
i
1
;i
2
  z
i
11
;i
12
;k
  z
i
21
;i
22
;k
)  C
5
(44)
TI
E
i
21
;i
22
 TI
S
i
11
;i
12
+ (2 + bi
i
1
;i
2
  z
i
11
;i
12
;k
  z
i
21
;i
22
;k
)  C
6
(45)
The idea of these constraints is equivalent in all three cases. For this reason,
only the constraints 40 and 41 for scheduling nodes v
i
1
and v
i
2
using the same
24
processors p
k
are described in the following. If v
i
1
and v
i
2
have to be scheduled
(Y
i
1
;k
= Y
i
2
;k
= 1), then one of the following conditions has to be fullled:
1. v
i
1
is executed before v
i
2
(b
i
1
;i
2
= 1) ) T
E
i
1
 T
S
i
2
, or
2. v
i
2
is executed before v
i
1
(b
i
1
;i
2
= 0) ) T
E
i
2
 T
S
i
1
.
This fact is modelled by the constraints dened in equations 40 and 41:
Y
i
1
;k
= Y
i
2
;k
= 1 b
i
1
;i
2
equation 40 equation 41
yes 0 T
E
i
1
 T
S
i
2
+ C
1
T
E
i
2
 T
S
i
1
yes 1 T
E
i
1
 T
S
i
2
T
E
i
2
 T
S
i
1
+ C
2
no 0;1 T
E
i
1
 T
S
i
2
+ n
1
C
1
; n
1
 1 T
E
i
2
 T
S
i
1
+ n
2
C
2
; n
2
 1
If v
i
1
and v
i
2
have to be scheduled (Y
i
1
;k
= Y
i
2
;k
= 1), only one of equations 40 and
41 results in hard constraints. If b
i
1
;i
2
= 0, equation 40 has no eect and if b
i
1
;i
2
= 1
equation 41 can be ignored. If either Y
i
1
;k
= 0 or Y
i
2
;k
= 0, both constraints have
no eect, if C
1
and C
2
are dimensioned correctly. It can be shown, that C
1
and C
2
have the following lower bounds:
Let MaximalExecutionT ime(v
i
) =Max

fc
ts
i;k
1
j k
1
2 KPg [ fc
th
i;k
2
j k
2
2KHg
	
1. C
1
= dALAP (v
i
1
) +MaximalExecutionT ime(v
i
1
)  ASAP (v
i
2
)e, because
T
E
i
1
 T
S
i
2
+ C
1
 T
S
i
2
+ ALAP (v
i
1
) +MaximalExecutionT ime(v
i
1
)  ASAP (v
i
2
)
 ALAP (v
i
1
) +MaximalExecutionT ime(v
i
1
) 2
2. C
2
= dALAP (v
i
2
) +MaximalExecutionT ime(v
i
2
)  ASAP (v
i
1
)e : : :2
6.6. Heuristic Scheduling
Resource constrained scheduling is a NP-complete problem [6]. Therefore, it is clear
that solving the scheduling problem optimally can not be done eciently. For this
reason, we have developed an algorithm using integer programming that solves the
partitioning problem while iterating the following steps:
1. Solve an IP-model for the hardware/software mapping with help of approxi-
mated time values.
2. Solve an IP-model for calculating a valid schedule with nodes mapped to hard-
ware or software.
3. If the resulting total time violates the timing constraint, repeat the rst two
steps with a timing constraint that is tighter than the approximated total time
of step 1. (see gure 13).
25
Example 12:
t
Exact
CONSTRAINT
Approximation
Exact
Approximation
1. Iteration 2.Iteration
new Constraint
t
Figure 13. Heuristic scheduling
The rst partitioning results in an approximated execution time which fullls the given
timing constraint. However, the exact execution time violates this constraint. For this
reason, a second partitioning with a new timing constraint is executed. This new con-
straint is tighter than the approximation of the rst partitioning. The second parti-
tioning results in a decreased approximated execution time. The exact execution time
of the second partitioning fullls the original timing constraint. Therefore, the second
partitioning represents the solution.
The following constraints are used in addition to the equations 16-20 to approximate
time values:
 Predecessor nodes:
A node v is ready to start, if all its predecessors have nished their execution.
The eect of being forced to schedule some of these predecessor nodes can be
exploited to estimate the starting time of v.
Example 13:
v3 v4
v1 v2
v5
processor
p1
bus b1
hardware 
component
h1
Figure 14. Using predecessor nodes in heuristic scheduling
In gure 14 the starting time of v
5
is at least the sum of execution times of v
1
; v
2
; v
3
,
because all 3 nodes have been mapped to p
1
and have to be scheduled.
) T
S
5
 c
ts
1;1
+ c
ts
2;1
+ c
ts
3;1
.
26
The starting time of a node v
i
is equal or greater to the accumulated software
execution times of all predecessor nodes v
i
(eq.46) on a processor p
k
. Similar
constraints can be added if hardware sharing (eq.47) and/or interfacing (eq.48)
are considered.
8i
1
2 I : 8l 2 L : N =j instances of(c
l
) j:
Let LV = pred nodes(v
i
1
) and LE = pred edges(v
i
1
) :
8k 2 KP : T
S
i
1

X
v
i
2
2LV
Y
i
2
;k
 c
ts
i
2
;k
(46)
8j 2 f1; : : : ; Ng : 8k 2KH : T
S
i
1

X
v
i
2
2LV;
I(v
i
2
)=c
l
x
i
2
;j;k
 c
th
l;k
(47)
8k 2 KB : T
S
i
1

X
e=(v
i
2
;v
i
3
)2LE
z
i
2
;i
3
;k
 ci
t
i
2
;i
3
;k
(48)
 Dominator nodes:
Another possibility to estimate the starting time of a node v is to look at
dominator nodes. A dominator node w of v is a node, such that each path to
v contains w (see denition 8). v is able to start if dominator w and all nodes
between w and v have been executed.
Example 14:
processor
p1
bus b1
hardware 
component
h1
v4 v5
v7
v2
v3
v1
v6
v0
v8
Dominator(v7)
Figure 15. Using dominator nodes in heuristic scheduling
In gure 15 the starting time of v
7
is at least the sum of the ending time of v
2
and
the execution times of v
4
; v
5
, because v
4
; v
5
have to be scheduled after executing
v
2
.
) T
S
7
 T
E
2
+ c
ts
4;1
+ c
ts
5;1
.
The starting time of a node v
i
1
is equal or greater to the sum of the ending time
of the dominator node v
i
0
of v
i
1
and the software execution times on processor
27
p
k
of all nodes on the paths between v
i
0
and v
i
1
(eq.49). Equation 50 denes
the same constraint for the hardware execution times of all shared nodes on
the paths between v
i
0
and v
i
1
. Equation 51 denes the equivalent constraint
considering communication times of required interfaces.
8i
0
; i
1
2 I : v
i
0
2 dominator nodes(v
i
1
) : 8l 2 L : N =j instances of(c
l
) j:
Let LV = path nodes(v
i
0
; v
i
1
) and LE = path edges(v
i
0
; v
i
1
) :
8k 2 KP : T
S
i
1
 T
E
i
0
+
X
v
i
2
2LV
Y
i
2
;k
 c
ts
i
2
;k
(49)
8j 2 f1; : : : ; Ng : 8k 2KH : T
S
i
1
 T
E
i
0
+
X
v
i
2
2LV;
I(v
i
2
)=c
l
x
i
2
;j;k
 c
th
l;k
(50)
8k 2 KB : T
S
i
1
 T
E
i
0
+
X
e=(v
i
2
;v
i
3
)2LE
z
i
2
;i
3
;k
 ci
t
i
2
;i
3
;k
(51)
28
7. Results
To evaluate the quality of our partitioning approach we have done an application
study in the area of audio algorithms. We have implemented systems between 6 and
29 nodes (between 6 and 37 edges) which have to be partitioned. The partitioning
results in this section have been calculated for the following systems:
 n-band-equalizers with n  7,
 a system called audiolab including a mixer, a fader, an echo, an equalizer and a
balance ruler, and
 an MPEG audio encoder (layer II).
These systems were mapped to a target architecture (see gure 16) containing a
SPARC processor, an ASIC manufactured in a 1 CMOS technology (COMPASS
library) and external memory. A bus connects these components.
SPARC
processor
external
memory
ASIC
(Compass,
1.0µ)
Figure 16. Target architecture
All calculated partitionings consider interface costs and hardware sharing eects
between nodes. The IP-models were solved by using the IP-solver package OSL
3
from IBM. The computation times of the examples represent CPU seconds on a
RS6000. The heuristic partitioning approach can be evaluated by examining
 the quality and
 the computation time
compared to optimal solutions.
The quality of the heuristic approach can be derived from the deviation between
the exact and the approximated solutions. Two equalizers, a 2-band- and a 3-band-
equalizer, were partitioned with the optimal and the heuristic approach. For each
system, solutions were calculated for a set of 8 timing constraints, resulting in a set
of designs ranging from a complete software to a complete hardware solution.
The hardware area deviation is zero for both benchmarks. The total system exe-
cution time diers between both approaches (see gure 17). The optimal approach
29
1 2 3 4 5 6 7 8
2-band-eq.
3-band-eq.0
1
2
3
4
5
6
7
D
ev
ia
tio
n 
(Ti
me
) [%
]
Design
Figure 17. Deviation System Execution Time (exact/heuristic approach)
calculates a mapping rst using a minimal amount of hardware area, and then
minimizes the system execution time. The heuristic approach tries to minimize the
hardware area and nding a valid schedule in a second step. For this reason, the
resulting system execution times may dier. The deviation in our experiments is
not greater than 6.4%. The average deviation is smaller than 1%.
The main dierence of both approaches is the computation time (see gure 18).
The computation time for both benchmarks is below one second for all designs using
1 2 3 4 5 6 7 8
2-band-eq. (heuristic)
2-band-eq. (optimal)
3-band-eq. (heuristic)
3-band-eq. (optimal)
0,00
200,00
400,00
600,00
800,00
1000,00
1200,00
1400,00
1600,00
1800,00
So
lu
tio
n 
Ti
m
e 
[s]
Design
Figure 18. Computation Time (exact/heuristic approach)
the heuristic approach. The computation time calculating the optimal solution is
1773 seconds in the worst case. It becomes clear that solving the hardware/software
partitioning problem optimally is not applicable to systems with a larger number
of instances.
In gure 19 an overview of partitioning all benchmarks using the heuristic ap-
proach is given.
30
1 2 3 4 5 6 7 8
2-band-eq. 
3-band-eq. 
4-band-eq. 
5-band-eq. 
6-band-eq. 
7-band-eq. 
audiolab
mpeg encoder
0,00
2,00
4,00
6,00
8,00
10,00
12,00
14,00
16,00
18,00
So
lu
tio
n 
Ti
m
e 
[s]
Design
Figure 19. Computation Time (heuristic approach)
The largest computation time is 16.9 seconds for partitioning the mpeg system
(containing 29 nodes and 37 edges).
Clearly, the heuristic approach is more practical than the optimal approach, be-
cause the results are always nearly optimal and the computation times are signic-
antly lower.
Finally, the trade-o between hardware area and system execution time is demon-
strated for the audiolab system (containing 25 nodes and 31 edges). 8 dierent
partitionings (see gure 20) have been calculated for 8 dierent timing constraints.
0
78,4
0
50
100
150
200
250
300
350
400
450
500
3060 12470 18600 28510 41420 48080 60410 72900
Time [ns]
Ar
ea
 [1
06
λ2
]
Figure 20. Area/Time Curve for the Audiolab System
A pure software realization of the audiolab system would result in a system exe-
cution time of 72900 ns, but this solution is too slow. The fastest realization would
have a system execution time of 3060 ns, but it would be a complete hardware
realization using 457; 9  10
6

2
chip area. This solution is a too expensive. The
31
best solution is a hardware/software solution which fullls the timing constraint of
22675 ns (44.1 kHz sample frequency) with a minimal amount of hardware. The
calculated solution has a system execution time of 18600 ns and would require
78; 4  10
6

2
chip area.
32
8. Conclusion
This paper presents a new approach of fully-automated hardware/software parti-
tioning supporting multi-processor systems, interfacing and hardware sharing. An
algorithm has been developed, which is able of solving the hardware/software parti-
tioning problem using integer programming and leading to (nearly) optimal results.
In contrast to other approaches, where hardware and software costs are estimated,
our approach follows the idea of 'using the tools' for cost estimation. The dis-
advantage of an increased calculation time is compensated by better metrics and
therefore fewer iteration steps. The presented results are very promising, because
nearly optimal results are calculated in short time. Future work will deal with fur-
ther renement of the IP-model for target architecture selection and design studies
of other system level examples.
33
Notes
1. COSYS: (Codesign System specication tool)
2. COOL: (Codesign Tool)
3. OSL : Optimal Subroutine Library
References
1. A. Bender. Design of an Optimal Loosely Coupled Heterogeneous Multiprocessor System.
European Design & Test Conference (ED&TC), pages 275{281, 1996.
2. W. Ecker. Using VHDL for HW/SW Co-Specication. International Conference on
Computer-Aided Design (ICCAD), pages 500{505, 1993.
3. R. Ernst, J. Henkel, and T. Benner. Hardware-software Cosynthesis for Microcontrollers.
IEEE Design & Test, Vol.12, pages 64{75, 1993.
4. P. Eles, Z. Peng, and A. Doboli. VHDL System-level Specication and Partitioning in a
Hardware/Software Co-Synthesis Environment. Third International Workshop on Hard-
ware/Software Codesign, Grenoble, pages 49{55, 1994.
5. R.K. Gupta, C. Coelho, and G. De Micheli. Synthesis and Simulation of Digital Systems
Containing InteractingHardware and Software Components. 29th ACM, IEEE Design Auto-
mation Conference, pages 225{230, 1992.
6. M.R. Garey and D.S. Johnson. Complexity Results for Multiprocessor Scheduling under
Resource Constraints. SIAM J. Comput., pages 397{411, 1975.
7. D. Gajski, F. Vahid, S. Narayan, and J. Gong. Specication and Design of Embedded
Systems. Prentice-Hall, 1994
8. D. Henkel, J. Herrmann, and R. Ernst. An Approach to the Adaption of Estimated Cost
Parameters in the COSYMA System. Third International Workshop on Hardware/Software
Codesign, Grenoble, pages 100{107, 1994.
9. J. Henkel, R. Ernst, W. Ye, M. Trawny, and T. Benner. COSYMA: Ein System zur Hard-
ware/Software Co-Synthese. GME Fachbericht Nr. 15 Mikroelektronik, pages 167{172, 1995.
10. A. Jantsch, P. Ellervee, J.

Oberg, A. Hemani, and H. Tenhunen. Hardware/Software Parti-
tioning and Minimizing Memory Interface Trac. European Design Automation Conference
(EURO-DAC), pages 226{231, 1994.
11. A. Kalavade and E.A. Lee. A Global Critically/Local Phase Driven Algorithm for the Con-
strained Hardware/Software Partitioning Problem. Third International Workshop on Hard-
ware/Software Codesign, Grenoble, pages 42{48, 1994.
12. A. Kalavade and E.A. Lee. The Extended Partitioning Problem: Hardware/Software Map-
ping and Implementation-Bin Selection. Proceedings of the 6th International Workshop on
Rapid Systems Prototyping, 1995.
13. B. Landwehr, P. Marwedel, and R. Domer. OSCAR: Optimum Simultaneous Scheduling,
Allocation and Resource Binding Based on Integer Programming. Proceedings of the EURO-
DAC, pages 90{95, 1994.
14. R. Niemann, and P. Marwedel. Hardware/Software Partitioning using Integer Programming.
European Design & Test Conference (ED&TC), pages 473{479, 1996.
15. Z. Peng and K. Kuchcinski. An Algorithm for Partitioning of Application Specic Systems.
Proceedings of the European Conference on Design Automation (EDAC), pages 316{321,
1993.
16. F. Vahid, J. Gong, and D. Gajski. A Binary-Constraint Search Algorithm for Minimizing
Hardware duringHardware/SoftwarePartitioning.European Design Automation Conference
(EURO-DAC), pages 214{219, 1994.
34
Received Date
Accepted Date
Final Manuscript Date
