Abstract
Introduction
Embedded systems typically consist of application specific hardware parts and programmable parts, i.e., processors like DSPs, core processors or ASIPs. In comparison to the hardware parts, the software parts can be developed and modified much easier. Thus, software is less expensive in terms of costs and development time. Hardware however, provides better performance. For this reason, a system designer's goal is a system which fulfills all performance constraints by using as few as possible hardware. Hardware/software codesign deals with the problem of designing embedded systems, where automatic partitioning is one key issue. This paper describes a new approach in hardware/software partitioning for multiprocessor systems working fully automatic. The approach is based on integer programming (IP) to solve the partitioning problem optimally. A formulation of the IP-model will be introduced in detail. The drawback of solving IP-models often is a high computation time. To reduce the computation time, a second approach has been developed which splits the partitioning approach in two phases. In a first phase, a mapping of nodes to hardware or software is calculated by estimating the schedule times for each node with heuristics. During the second phase a correct schedule is calculated for the resulting HW/SW-mapping of the first phase. It will be shown that this heuristic scheduling approach strongly reduces the computation time while the results are nearly optimal for the chosen objective function.
Another new feature of our approach is the cost estimation technique. The cost model is not calculated by estimators like other approaches, because the quality of estimations is often bad and estimators do not concern compiler effects. In our approach the tools (a compiler for the software parts and a high-level synthesis tool for the hardware parts) are used instead of special estimators. The disadvantage of an increased runtime for calculating the cost metrics is compensated by a better quality of the cost metrics compared to the results of estimators. Furthermore, better cost metrics lead to fewer partitioning iterations.
The outline of the paper is as follows: Section 2 gives an overview of related work in the field of hardwarelsoftware partitioning. In section 3 our own approach to partitioning is presented. A formulation of the hardware/software partitioning problem follows in section 4. Section 5 describes the problem by an IPmodel. After experimental results of solving these IPmodels have been presented in section 6, a conclusion is given in section 7 .
Related Work
There are only few approaches considering hardware/software partitioning.
One of these is the COSYMA system [EHB93] , where hardware/software partitioning is based on simulated annealing using estimated costs. 
Hardware/Software Part it ioning Approach
Our hardware/software partitioning approach is depicted in figure l. The designer has to specify the target architecture by defining the set of processors for the software parts and the component library to synthesize the hardware parts. The system has to be defined in VHDL as a set of interconnected in- Moreover, the designer has to determine the design constraints, containing performance constraints (timing) and resource constraints (area, memory). Then, the VHDL specification is compiled into an internal syntax graph model. For each entity of this model, software source code (C or DFL) and hardware source code (VHDL) is generated. The software parts are compiled and the hardware parts are synthesized by a high-level synthesis tool (OSCAR [LMD94] ). The results are software cost metrics (software execution time, memory usage) and hardware cost metrics (hardware execution time, area) for the entities. The disadvantage of an increased runtime for calculating the cost metrics is compensated by a better quality compared to the results of estimators. Moreover, better cost metrics lead to fewer partitioning iterations. After the compilation/synthesis phase a partitioning graph is generated. Nodes represent the instances of VHDL-entities of the system and edges represent the interconnections between them. The nodes are weighted with the hardware and software costs, the edges are weighted with interface costs which occur if an interface is used between the nodes of the edge. The interface costs are approximated by the number and type of data flowing between both nodes. The user-defined design constraints are also matched to the graph. Thus, the partitioning graph includes all information needed for partitioning. The partitioning graph is then trans-formed into an IP-model, which is the key issue of this paper. Afterwards, the model is solved by an IPsolver. The calculated design is optimal for the generated cost model, but nevertheless it is possible to improve the design, because sharing between different instances of same entities is considered, but not sharing effects between different entities. This disadvantage can be removed by an iterative partitioning approach. We use a software oriented approach, because compilation is faster than synthesis and software oriented approaches seem to be superior to hardware oriented approaches (see [VGG94] ). Sets of nodes which have been mapped on the same processor are clustered. For each cluster a new cost metric is calculated by compiling all nodes of the cluster together. Then, the partitioning graph is transformed by replacing each cluster by a new node attached with the new cost metric. Finally, the redefined graph is repartitioned. This iteration will be repeated until no solution is found. The last valid partitioning represents the resulting design. The clustering technique is illustrated in figure 2. This section introduces a formulation of the hardware/software partitioning problem. This formulation is necessary to simplify the description of the problem with the help of an IP-model. We have to define the target architecture used to realize the system and the system itself which has to be partitioned. 
The definitions will be used in the following example: Example 1:
In figure 3 a system is specified consistingof 2 entities en1 (circle), en2 (box) and 7 instances 01,. . . ,07 of these entities. This system will be partitioned for a target architecture containing one ASIC, one DSP, memory and a bus connecting these components.
The IP-Model
Linear optimization problems can be solved optimally by using integer programming (IP). This paper 
., n p } the indices of elements tak E T A and L = (1,. . . , ns} the indices of elements en? E E . Let e t k be the cost metric cx(enl) f o r entity en1 and cy,k the cost metric c"(vj) f o r node v j on target architecture component t a k .

Let c," be the system cost c"(s) on tak of system s and MAX," the according maximum ofC,".
Let qs be the execution starting time of node v j .
Let Tj" be the execution t i m e of node v j .
Let ?E be the execution ending time of node v j .
The Decision Variables
Our IP-model uses the following O/l-variables:
Definition 5.2 Let the following O/l-variables be defined as:
1 : v j is not shared on tao, 0 : otherwise. 
. 2 The Constraints
The following constraints have to be fulfilled:
1. General Constraints: Each node v j is executed exactly on one target architecture component t a k .
Resource Constraints:
The values for used data memory C:" (eq. 3) and program memory c:" (eq. 4) on each processor tal, may not exceed a given maximum. The used hardware area C:
(eq. 5) is the sum of hardware area of unshared instances, shared entities, and the total interface area GI: (eq. 14). C: may not exceed a given maximum.
Timing Constraints:
The timing costs cannot be calculated by accumulating the execution time of the nodes, because nodes, that are not shared on the ASIC can be executed in parallel. To determine the starting time and ending time for each node, scheduling has to be performed. The execution time q p (eq. 
Interfacing
An interface has to be realized for an edge e = (vji,vj,), if vjl and vj, are realized on different target architecture components. This fact is formulated with help of additional constraints for the interface O/l-variable ijl ,j, (see [NM95] ). Then, the following interface costs can be calculated: interface execution time q'l,ja (eq. la), interface hardware area A:i,j, (eq. 13), and the area of all interfaces Cl: (eq. 14).
T;i,,a = i31,3a * ci:1,jz A i l , J a = i31&2 * c231,32
(12)
.a CI,a =
A311r32
(14) e = ( V J 1 , v~~ )€E
Sharing
An entity en? is shared on hardware t u 0 (eq. 15), if at least two nodes vj,, vj2 which are instances of entity en1 are executed shared on tao. An entity en1 is shared on processor t a k (eq. 16), if at least one instance of entity en1 is executed on tay.
Scheduling
If two nodes vjl , vj, should be sequentialized, then the scheduling variables bj,,j2,k and bj,,jl,k have to be different, otherwise both have to be 1. The additional constraints for bj,,j,,k and bj2,j,,k are defined in [NM95] . With bj,,j2,k,bja,jl,k nodes can be sequential-
Heuristic Scheduling
Optimal scheduling of the nodes is a complex problem, to solve this problem is to execute partitioning while iterating the following steps:
1. Solve an I€'-model for the hardware/software mapping with help of approximated time values.
2. Solve an I€'-model for calculating an exact schedule with nodes mapped to hardware or software.
3.
If the resulting total time violates the timing constraint, repeat the first two steps with a timing constraint that is tighter than the approximated total time of step 1. (see figure 5 ). The correct constraints can be found in [NM95] .
Results
The interesting parameter for partitioning is the number of nodes n which have to be partitioned. For this reason, we have developed some examples containing a lot of instances of small VHDL-entities. The target architecture for all examples consists of a processor, an ASIC, memory and a bus connecting all components. All calculated partitionings concern interface costs and sharing effects between nodes. The computation times of the examples represent CPU seconds on a Sun SPARCstation20. The heuristic partitioning approach can be evaluated by examining 0 the quality and 0 the computation time compared to the optimal result.
The quality of the heuristic approach can be evaluated by determining the deviation between the exact and the approximated solution. If the heuristic partitioning approach does not consider interfacing, then the results are always exact, and therefore optimal. If interfacing is considered however, then the approximated system execution time may differ from the exact value. Therefore, we have partitioned 6 different systems (see figures 6,s) with the optimal and the heuristic approach. For each system, solutions have been calculated for a set of constraints. In figure 6 it is
. . .--n=7 (heuristic) 7 (optimal) n=4 (optimal)
Figure 6: System execution time (exact/heuristic approach)
shown that the approximated execution time is equal or very close to the exact value. The maximal deviation between the exact and the approximated execution time is 5.13%, the average deviation is smaller than 1% for all examined systems. In contrast to the partitioning quality, the computation times are very different. Figure 7 depicts the computation times of both approaches for a system, which consists of 7 nodes. This system has been partitioned for 8 different system execution time constraints. The maximal computation time is 246 seconds for the optimal partitioning approach and 2 seconds for the heuristic one, i.e., computation time is drastically decreased. In figure 8 the computation time of the heuristic approach is depicted for 6 different systems. For all of these systems several different 
Conclusion
This paper presents a new approach of full-automated hardware/software partitioning supporting multiprocessor systems, interfacing and hardware sharing. The partitioning approach itself is based on integer programming leading to optimal results. In contrast to other approaches, where hardware and software costs are estimated, our approach follows the idea of 'using the tools' for cost estimation. The disadvantage of an increased calculation time is compensated by better metrics and therefore fewer iteration steps. The presented results are very promising, because nearly optimal results are calculated in short time. Future work will deal with design studies of real system level examples.
