Abstract-Hardware/software partitioning is a key issue in the design of embedded systems when performance constraints have to be met and chip area and/or power dissipation are critical. For that reason, diverse approaches to automatic hardware/software partitioning have been proposed since the early 1990s. In all approaches so far, the granularity during partitioning is fixed, i.e., either small system parts (e.g., base blocks) or large system parts (e.g., whole functions/processes) can be swapped at once during partitioning in order to find the best hardware/software tradeoff. Since the deployment of a fixed granularity is likely to result in suboptimum solutions, we present the first approach that features a flexible granularity during hardware/software partitioning. Our approach is comprehensive in so far that the estimation techniques, our multigranularity performance estimation technique described here in detail, that control partitioning, are adapted to the flexible partitioning granularity. In addition, our multilevel objective function is described. It allows us to tradeoff various design constraints/goals (performance/hardware area) against each other. As a result, our approach is applicable to a wider range of applications than approaches with a fixed granularity. We also show that our approach is fast and that the obtained hardware/software partitions are much more efficient (in terms of hardware effort, for example) than in cases where a fixed granularity is deployed.
Which decision is finally taken depends on a variety of design constraints/goals like performance, hardware effort, power dissipation, etc. Therefore, the process of hardware/software partitioning is a complex optimization problem. In addition, hardware/software partitioning is crucial for the distinctiveness of a product and therefore for its success on the market.
This trend has been recognized since the early 1990s where first approaches for automatic hardware/software partitioning have been proposed. Also, first industrial projects have been reported where hardware/software partitioning techniques led to a tremendously improved system design [4] . However, these approaches are limited in the sense that they provided a fixed granularity for partitioning only. The granularity determines the size of system parts that can be either implemented as a software program or a synthesized piece of hardware. The proposed granularities ranged from base block level to function/process level. Each of these granularities works well with particular applications only. Furthermore, the relevancy of adequate estimation methodologies that control hardware/software partitioning, has often been neglected. Adaptation of granularity and estimation methodologies would help to detect more efficient partitions.
In this paper that is based on the experience and limitations of previous approaches, we present the first approach for hardware/software partitioning with a flexible granularity, 1 i.e., depending on the peculiarities of a specific application and imposed design constraints (performance, hardware effort, etc.), the actually selected granularity can span the wide range from base blocks to whole functions/processes. Additionally, we provide estimation methodologies that are adapted to different levels of granularities, or the other way around, and help to determine the finally used granularity. As a third benefit, our approach features a multidimensional objective function that trades diverse design constraints/goals in a sophisticated manner.
The remainder of this paper is organized as follows. Section II gives an overview of related work. In Section III we discuss problems in partitioning and estimation in order to sensitize the reader to the basic ideas of our approach. The target architecture that has been deployed in our experiments as well as communication estimation is described in Section V. While Section IV gives an overview of our approach, Section VI describes our partitioning model in detail. Afterwards, in Section VII our hardware performance estimation approach as well as its adaptation to partitioning are discussed. The partitioning procedure itself as well as the deployed multilevel objective function are explained in Section VIII. Results are conducted in Section IX, and finally Section X gives a conclusion.
II. RELATED WORK
Automated approaches to hardware/software partitioning have started to emerge since the early 1990s. Since then, numerous approaches have been proposed, varying the deployed optimization algorithm in order to cope with the large solution space, tailored to specific application areas (e.g., data dominated, control dominated), using a fine-grained or coarse-grained partitioning, pursuing a software-oriented or hardware-oriented approach, etc. A software-oriented approach means that the initial implementation of the whole application is supposed to be a software solution. During partitioning, system parts are migrated to hardware until constraints are met (like timing constraints, for example). The other way around, a hardware-oriented approach starts with a complete hardware solution and swaps parts to software until constraints are violated (assumed that an application specific hardware is more efficient in terms of the formulated constraints). The granularity of a partitioning approach determines the size of a part of the whole application that might be considered to implement either as hardware or as software during partitioning. In this sense, a fine-grained granularity indicates system parts as small as an instruction whereas a coarse-grained granularity indicates whole functions or processes. 2 Granularities between these two extremities include base blocks 3 and control blocks, i.e., one or more base blocks that form together a loop, several nested loops, if-then-else constructs or a combination of those. Possible granularities are depicted in Fig. 1 . In the following, we will introduce some key papers in more detail and we will especially pay attention to the specific granularity since we are focusing in this paper on the question of granularity during 2 Please note that there exists no uniformly defined definition of the terms fine-grained and coarse-grained partitioning approach. Rather, these definitions are used within this paper and they are based on widely recognized approaches to hardware/software partitioning. 3 A base block is defined in the same way as in [2] .
partitioning. This is because through our studies we found out that the granularity during partitioning has the largest impact on the quality of the final implementation.
Beginning with a fine-grained, i.e., instruction-level granularity, Athanas/Silverman [5] presented an approach where a short sequence or even a single computation-intensive instruction is implemented in the form of an accelerator hardware. Increasing performance is achieved in a similar manner as it is in a computing system where a general purpose processor delegates floating point instructions to an FPU. However, this approach is not limited to a specific class of instructions. Barros/Rosenstiel deploy the language UNITY to cluster instructions with respect to parallelism, mutual exclusion and data dependencies. Because of the relatively high communication overhead that is imposed by instruction-level partitioning, many approaches focus on base-block-level partitioning: Jantsch et al. [7] use profiling to pre-select possible hardware candidates of base blocks. Dynamic Programming is finally used to determine the best candidates out of those. Knudsen et al. [8] conduct partitioning using one or more base blocks, i.e., control constructs. The approach of Parameswaran et al. [9] is solely profiling driven and restricted to base blocks. Our first approach [10] partitioned at base-block level too and used Simulated Annealing to cope with the complexity of the design space.
Function/process level approaches use the granularity that is determined by the programmer of an application. So, selecting candidates for partitioning is straightforward and due to the nature of functions, communication overhead is in general lower than it were when using (randomly selected) base blocks. A hardware-oriented approach to partitioning is provided by Gupta et al. [24] . A greedy algorithm is used for partitioning and the software part features the possibility of multiple threads. Niemann et al. [25] present another hardware-oriented approach deploying VHDL entities as a partitioning granularity. A software-oriented approach for functional-level partitioning is proposed by Vahid et al. [13] , [14] . Their work contains a technique called "procedure exlining." It is able to reveal similar code sequences within the behavioral specification such that those sequences can be implemented by one procedure that is called at the former location of those sequences. Another function/process level approach is provided by Wolf et al. [23] .
They aim at heterogeneous distributed systems. A "sensitivity" determines the degree of performance improvement every time a process is allocated to a hardware processing element (PE) or a software PE. Further function/process level approaches are: Edwards et al. [11] with their profile driven approach, Peng et al. [12] using Petri Nets and clustering techniques, Adams et al. [15] using code motions between tasks, Chou et al. [16] focusing on hardware/software interface synthesis, D'Ambrosio et al. [17] using a complex metric to evaluate the feasibility of a possible partition and finally applying a branch-and-bound technique, Carreras et al. [18] using an approach based on the LOTOS language, the approach of Ismail et al. [19] representing an interactive partitioning tool named Partif, Kalavade et al. [20] using a two-phase objective function the "GCLD" approach, Sciuto et al. [21] deploying a greedy algorithm that partitions Occam II applications, and Teich et al. [22] an evolutionary approach. The codesign approaches [29] - [31] do not focus on automated partitioning but concentrate on the cosimulation and/or rapid prototyping issues.
The approaches presented so far are mainly performancedriven with "in parts," algorithms for minimizing the additional hardware effort. To complete this overview we want to mention that recently cosynthesis approaches under power constraints have been proposed [26] - [28] though their focus is not primarily partitioning.
All presented automated hardware/software partitioning approaches have in common that they are based on a fixed granularity, i.e., the granularity is either fixed or it is determined by the user in advance (i.e., before the automated partitioning process starts).
We want to emphasize that our approach presented here is the first approach that provides a flexible granularity that can change automatically during the partitioning process in order to achieve more efficient results.
III. PROBLEMS IN PARTITIONING AND ESTIMATION
In this section we describe four scenarios dealing with the deployed partitioning granularity (Scenarios 1 and 2), estimation issues (Scenario 3), and multiple design constraints (Scenario 4). In a subsequent conclusion we extract the characteristics of how a hardware/software partitioning approach should look like. Those characteristics present the basic ideas of our approach that is presented in the following sections.
Scenario 1-Deploying a Coarse-Grained Granularity: Given is the specification of a system comprising the two functions/processes and . Furthermore, it is assumed that the target architecture consists of one off-the-shelf processor (running a software program; hence it is notated as SW for software) and an application specific hardware (HW). Obviously, there are the following four mappings for partitioning (the arrow " " has the meaning "is mapped on"):
SW HW
3)
HW SW
4)
HW.
Since cases 1) and 4) are the trivial cases, hardware/software systems are usually represented by cases 2) and 3). Let us assume that the given real-time constraints can only be met by implementing one computation intensive loop-assume it is part of either or and it represents a small fraction of the particular function only-as hardware (since it might be faster than a software implementation). Through a designer's approach of functional partitioning one whole function/process would be implemented as an application specific hardware where only a small fraction of it would last to meet the constraints. As a result, the system costs are much larger than they could be through a finer-grain partitioning than function/process-level partitioning. Scenario 2-Deploying a Fine-Grained Granularity: The same target architecture and the same design constraint as in the previous scenario might apply. But now a fine-grained base block granularity is deployed. The small loop could now cheaply (without unnecessary hardware overhead) be implemented in hardware by selecting the according base blocks.
The drawback is that the design space for hardware partitioning is exponentially high ( different hardware/software partitions with the number of base blocks, for example). Therefore, a partitioning algorithm might not find the best solution if is large. Scenario 3-Estimation Issues: Estimating performance, hardware effort, etc., is necessary in order to judge the quality of a possible hardware/software partition. Let us assume the run-time of a base block should be estimated for the case it is implemented as a piece of software running on a pipelined processor (we can conduct estimation using an instruction set simulator, short ISS). Assume furthermore, the number of pipeline stages is equal to or even larger than the number of cycles the base block could be possibly be executed in. If we now estimate the execution time of the base block solely using the ISS, we are likely to over-estimate the execution time since some cycles are necessary to fill the pipeline. This overhead in estimation is obviously the more serious the smaller the piece of code within the base block is.
Scenario 4-Multiple Design Constraints: Imposed system constraints are manifold like performance constraints, chip area limitations, maximum power dissipation, etc. These constraints may even be mutual-dependent or even contradictory: a high performance design may result in high power dissipation but the design constraint may be to achieve high performance at lowest possible power dissipation, for example. An additional problem is that those constraints have different physical units (time and energy in the example).
Our conclusions are as follows: Scenario 4 implies the need of a multidimensional objective function with a mechanism of a flexible tradeoff between these dimensions. Scenario 3 suggests an adaptation of estimation techniques (that control partitioning) to the granularity and the partitioning is carried out. Finally, Scenarios 1 and 2 suggest a model for a flexible (i.e., nonfixed) granularity.
For all depicted problems we propose solutions that are introduced in detail within the remaining sections of this paper.
IV. OUR APPROACH AT A GLANCE
The purpose of this section is to give an overview of the main steps and base ideas of our approach. While the following sec- tions provide a detailed insight into each of these steps, here we give a hint where those steps will be explained in detail. One of the main features of our approach is a dynamically determined granularity, i.e., a granularity that can actually vary somewhere between the extremities of a single base block on the one side and a whole function/process on the other side. The granularity that is actually chosen, however, depends on the peculiarities of the application and the constraints imposed (like performance, hardware effort, etc.). We can roughly divide our approach into the three steps that are depicted by Fig. 2 .
In
Step I a behavioral description (written in C) is parsed and translated into a flow graph where each node represents a base block and each directed edge denotes the direction of the control flow. In addition, during this step a structural classification of each part of the graph is performed in order to ease the following steps. This step is explained in more detail in Section VI-A.
Step II is dedicated to deriving the so-called partitioning objects. A partitioning object is a piece of the whole graph consisting of at least one base block and at most all the base blocks within the treated function/process. The algorithm that derives all possible partitioning objects is described in Section VI-A. As we will see, solely structural information is used to derive the partitioning objects. The whole intelligence of our approach, i.e., the ability to take a small or large partitioning object in order to actually perform partitioning, is the task of Step III. The idea behind defining partitioning objects is the complexity implied by performing partitioning. Assume we have a number of parts of an application and each of these parts may possibly be implemented in hardware or software. Then, the whole number of different partitioning objects amounts to . Though we cannot reduce the complexity, we can reduce the computation time. An example: due to design constraints lets assume that 90% of all system parts had to be implemented in hardware. Assume furthermore that we start with an all-software solution 4 and we can migrate one out of the parts at a time. This implies different hardware/software partitions and each of them has to be evaluated, i.e., tested whether it fulfills the constraints and what the costs are. Depending on the size of the application this might not be feasible since too computation intensive. For this reason we have our partitioning objects of different size that might even overlap, i.e., they might cover, in parts, the same part of the graph. An example is shown in the sketch of Fig. 2, Step II. There, partitioning object (PO) overlaps with PO and PO overlaps with PO , etc. Now, a sophisticated optimization algorithm for partitioning could select a large PO instead of many small POs, thereby minimizing the number of different partitions to evaluate. This large advantage comes at an increased number of POs (as opposed to POs like small base blocks that do not overlap) of about four times throughout all our applications we conducted our experiments with. As a summary, Step II eases the computation effort tremendously through building POs.
Finally,
Step III puts it all together. The so-called macro instructions are built. Macro instructions are those application parts that are actually implemented in hardware.
Definition: A macro instruction consists of one or more adjacent partitioning objects, i.e., a macro instruction presents a contiguous piece of the application and it is solely executed on an application specific hardware.
The way a macro instruction is composed of one or more POs depends on the optimization procedure during partitioning and on the constitution of the objective function. Assume, for example, we have a hard time constraint and at the same time we want to minimize the hardware effort. In case a single PO cannot meet the time constraints at reasonable hardware costs, a combination of two or even more POs might. So-called "synergetic effects" play a key role here: for example, two POs might increase the performance by a factor of two (as opposed to one PO) but at much less hardware costs than twice the time of one PO since they can share hardware resources. Similar examples can be given for lowering the total communication overhead between hardware and software, etc. Please also note that a single PO cannot cover all possible hardware/software partitions. This is another reason why a single PO might not necessarily equal a macro instruction. So actually, in Step III an implicit clustering 5 is performed. It depends on the peculiarities of the application itself, the imposed design constraints, and the constitution of the objective function. As a result, the granularity in our partitioning approach is widely flexible, i.e., a macro instruction can be a single PO (which can be a single base block) or a contiguous set of POs that can represent a whole function/process. Section VIII introduces our multilevel objective function and the optimization procedure. In order to evaluate a partition according to the imposed constraints, the impact of implementing a PO or a set of POs as hardware or as software has to be estimated. We have a wide range of high-level optimization algorithms (hardware performance, software performance, hardware/software communication effort, hardware effort) and present one as an example on how to adapt the flexible granularity to estimation. In Section VII-A our path-based estimation technique for hardware performance is introduced.
V. MODEL OF TARGET ARCHITECTURE AND OVERVIEW OF COMMUNICATION MODEL
In this section we are introducing the model of the target architecture the results of our experiments are based on, as well as the communication model. We want to emphasize that the algorithms for partitioning and estimation are independent of our somehow limited target architecture unless otherwise mentioned.
Model of Target Architecture: The state diagrams 6 of the coprocessor "hardware" and standard processor "software" are given in Fig. 3 on the left side and the right side, respectively. 6 Each node may represent one or more states.
Beginning with the coprocessor, the first state is a test of a status word. It tells the coprocessor whether to start executing or to wait. Please note that this is due to a limited model where either the coprocessor or the standard processor is executing (i.e., mutual exclusive execution). Once the coprocessor is supposed to execute, it is provided with an ID of the macro instruction it should execute (technically, this is a function call on software side with one of the parameters denoting the ID). Each of the grey-underlayed branches represents the states of one macro instruction (please note that at this level of abstraction the borders of POs are invisible). After execution of the dedicated macro instruction, the memory is physically given back, thus allowing to be accessed by the standard processor subsequently. Before, i.e., at the end of each macro instruction and actually part of it, data that might be used by the software side afterwards is saved to the shared memory. Now, the status word is changed, signaling that the standard processor can continue execution.
The standard processor (right state diagram of Fig. 3 continues execution of the program, (test of status word at the bottom of the diagram) unless no hardware call occurs HW Call . In the case a hardware call occurs, necessary data and parameters are saved into the shared memory, the status word is set accordingly, and memory access is passed. The standard processor is now waiting until the coprocessor has completed execution. Please note that those states marked with an " " are implemented by a small hardware that is neither part of the coprocessor nor part of the standard processor. Furthermore, all greyunderlayed states are subject to hardware or software synthesis in case of the coprocessor and the standard processor, respectively. All other states represent the generic parts, i.e., those parts that are independent of an application.
Overview of Communication Model: Based on this target architecture, we are giving a rough overview on how communication estimation between hardware and software is taken care of. For that reason, Fig. 4 shows the model where each white box represents a partitioning object, a set of white boxes encircled by dotted lines denotes a macro instruction, solid arcs show the direction of the control flow, and data transfers are represented by dotted arcs.
On the right-hand side of the figure we have listed the scheme of the time shares for the example of the communication overhead implied by a call from software to hardware. The first share ("I.") stems from loading the registers in order to preserve the contents of all register for a continued execution when control is returned to software once hardware has completed execution. Note that this loading might be necessary since, due to the peculiarities of the deployed compiler register, contents might actually reside in the main memory.
Time share "II." represents the time for data transfers from the processor to the shared memory for all data that is needed for execution by the hardware (in case arrays, etc., addresses are deposited). Shares "I." and "II." are represented by one arc in the figure. Share "III." is due to the hardware function call. Share "IV." is not transparent through this communication model since it stems from the reaction time of the hardware, i.e., the time after the hardware call is completed until the start of the execution of the dedicated macro instruction (this time component refers to the top-level state of the coprocessor state diagram in Fig. 3 ) can begin. Finally, share "V." denotes the time necessary to retrieve data from the shared memory until the macro instruction can be executed.
A similar sequence of transactions is necessary when control is returned from hardware to software.
We would like to emphasize that communication estimation is part of the performance estimation and a substantial element within the time component of our objective function. However, since we do not focus on estimation within this paper, we are not providing the according algorithms here. Solely the number of the arcs in Fig. 4 gives the impression that the communication overhead has a significant impact on the overall execution. And indeed, it has. Even the composition of a macro instruction of POs is chosen in such a manner (driven by the objective function) that the total communication overhead is minimized.
VI. MODEL FOR MULTIGRANULARITY PARTITIONING APPROACH

A. Defining a Base Granularity
This step refers to the first box shown in the overview of Fig. 2. Step I comprises the transformation of a behavioral description (for example in C or C ) into a control flow graph . We define three different types of nodes:
• a node that contains straightforward code (no control con- As can be seen, at the beginning of a procedure/function the nesting level is one. It increases by one if a new control construct like a loop, an if-then-else construct, etc., is entered. It decreases by one if a control construct is exited. Different nodes are provided with values of nesting level in such a way that the algorithm for building partitioning objects (in Section VI-B) always clusters complete control constructs (head and body of a loop; all branches of an if-then-else, etc.).
The above model will be used in the algorithm in the subsequent section.
B. Generating the Partitioning Objects
This step refers to the second box shown in the overview of Fig. 2 .
Before describing the algorithm for generating the partitioning objects, the requirements are enlisted as follows.
a) It is desirable to put small parts like base blocks or instructions into a single partitioning object. b) Larger partitioning objects should contain whole control constructs (e.g., nested loops) or possibly functions/procedures. c) By means of a few moves only (a move is the action that puts an partitioning object from software to hardware or vice versa) a "good" partitioning should be achievable. Some definitions: we call a partitioning object. is a set that contains nodes ; hence, is a set of sets. In addition, a temporary set is deployed. It contains those partitioning objects that have already been generated during a preceding iteration step through the algorithm. The following relation holds:
. In order to simplify the algorithm the term indicates that a new partitioning object is created by copying the contents of and to . Therein represent partitioning objects. Finally, the algorithm for generating the partitioning objects is formulated as follows: The algorithm is performed as long as the condition of Step 4) is false. If it becomes true, that means we have reached the largest possible granularity (i.e., a partitioning object represents a whole function/procedure already) and no further partitioning objects have to be created.
Note that only one of the conditions in 1), 2a), or 2b) can become true at one time for a newly generated partitioning objects.
The result of performing the algorithm on the example graph of Fig. 5 is depicted in Fig. 6 . Each of the graphs in Fig. 6 represents the state after a single iteration through the algorithm. The resulting partitioning objects are set off by the grey color. Note that the complete set contains all meanwhile generated partitioning objects (in the example: to ). At this point we have defined a graph representation and an algorithm that generates partitioning objects out of the basic granularity (that is expressed by the set ). Anyhow, it is not yet clear which particular partitioning object is actually used during a step in hardware/software partitioning. As mentioned earlier, this might depend on the deployed estimation techniques (see Scenario 3 in Section III) and/or on the evaluation of a partitioning through the objective function. Sections VII and VIII are dedicated to give an answer to this question.
VII. ESTIMATION TECHNIQUES
In our approach to partitioning we use high-level estimation techniques in order to evaluate the quality of a possible partitioning. In this sense, high-level estimation techniques control the partitioning process. Those estimation techniques comprise software performance, hardware/software communication effort, hardware effort and power estimation. We have proposed appropriate techniques in [32] - [35] , respectively. Here we will describe our methodology for estimating hardware performance. We see which ways the granularity in partitioning and estimation can be adapted, i.e., in which way our approach to flexible granularity is controlled by the peculiarities of estimation techniques. Though we just demonstrate the adaptation for our hardware performance estimation technique, similar considerations apply to our other estimation techniques as well.
A. Estimation Technique for Hardware Performance
Our requirements to an estimation technique for hardware performance as part of a partitioning approach are:
• high accuracy (compared to the actual schedule that is carried out by high-level synthesis after partitioning); • adaptation to the granularity of the partitioning process (flexible granularity); • and the possibility of a tradeoff between computation effort and quality (in order to conduct fast "what-if-analysis"). For these reasons we eventually developed a modified version of Path-Based Scheduling [36] . The main features of our technique are: a sophisticated approach to control the problem of path explosion 7 while offering the possibility to adapt the partitioning granularity and estimation granularity. The problem of path explosion in Path-Based Scheduling has been addressed before by [37] - [39] . Anyhow, in contradiction to those, our approach allows a cost function driven decomposition of the graph. Hence, we can tradeoff between quality and computation time and we can conduct an adaptation to any required granularity.
Path-based scheduling consists of the following passes: I) transforming a CDFG into a directed acyclic graph; II) collecting ALL paths; III) scheduling all paths As-Fast-As-Possible (AFAP see [36] ); IV) overlapping all paths. Computation time intensive Steps are III and IV as a consequence of Step II. This is because the number of paths grows exponentially with the number of nodes . 7 The number of paths is 2 in the worst case with N the number of nodes in the graph that is scheduled Fig. 7 . Calculating the number of paths with different cut points set.
Our technique to circumvent the path explosion is to decompose the whole graph and to schedule each subgraph by itself. It should be noted that each subgraph represents a partitioning object as defined in Section VI. Rather than describing the whole scheduling algorithm we focus only on the decomposition problem and the adaptation to partitioning (Steps I, III, and IV are the same as used in [36] ).
Decomposition is conducted be setting so-called cut points. The same graph representation as defined in Section VI is used.
We assume that each node has two attributes: the nesting level ST (see Section VI) and the iteration time it. The iteration time it is obtained by running a profiler through the application before we apply our scheduling estimation. As a result of profiling we obtain the number of times each node of the application is performed for a typical set of input stimuli patterns. 8 Before we describe our method for finding a good location for setting cut points, we give three simple examples in order to give an idea why we defined the rules and the algorithm presented afterwards.
Example 1: Calculating all paths in the graph representation given by Fig. 7 leads to a number of paths. Each path contains a number of operations. Example 2: Now assume the graph has been split into two subgraphs through cut point CP1. Determining all paths for each subgraph leads to a total number of possible paths. Example 3: Instead of CP1 cut point CP2 is set and all possible paths are calculated again. Here . Examples 2 and 3 are expected to achieve a faster scheduling result than Example 1 because a cstep (control step) ends at each cut point and a new cstep starts right after a cut point.
The loss in quality-measured in terms of an additional number of csteps-depends on the data dependencies of operations before and after a cut point. Assuming that the operations have already been optimally ordered, there is no way to influence this effect.
Another aspect is the hardware constraint (number of available hardware resources like functional units). The larger the number of resources, the larger is the additional number of control steps since a potential high parallelism is prevented (see example in Fig. 8) .
A good measure for the loss of quality is the increase of execution time (measured in clock cycles) implied by the schedule, rather than the number of additional csteps. Let it be the number of times an operation scheduled in control step ( is the set of all control steps) is executed. Then it
gives the total execution time of a program whose graph representation has been cut into subgraphs . In spite of the fact that Example 2 and 3 lead to almost the same (reduced) number of paths, Example 2 is expected to achieve a smaller execution time (only four iterations at cut point CP1 compared to ten iterations at CP2). These considerations lead to the formulation of Rule 1. Rule 1: Locate the cut points at those locations in the graph that impose a minimum number of additional clock cycles compared to a noncut graph (2) Therein, and are the execution times with and without cut points set. Preferred candidates are those with the lowest it numbers (iterations).
Please note that possible locations for setting cut points must have been determined before Rule 1 can be applied.
The total number of possible cut points should be reduced to those cut points which will actually reduce the complexity (i.e., prevent path explosion).
Rule 2:
A possible cut point CP is located after a node if is a join point and if there exists a path where is the first node and where
ST ST
Note that ST is the nesting level of node . The path ends if a node is encountered with ST ST . Fig. 9(a) shows the result after applying Rule 2 for a small example. There, only the location after node 5 fulfills all conditions for setting a cut point. It is obvious that cut points right after nodes 9 and 13 would not contribute to reducing the number of paths. Now assume both Rules 1 and 2 have already been applied but the number of cut points is too large. 9 This is equivalent to a too small granularity (see definition of granularity in Section VI) that can lead to effects like the one described in Scenario 3 in Section III.
In that case, a reduction of cut points can be achieved by applying Rule 3. 9 The user determines the maximum number of paths. Rule 3: Find out paths that contain more than one cut point located at the same nesting level ST. Search for that cut point that would partition the path into two approximately equally sized pieces (the metric is the number of comprised operations). In the next step mark this cut point as "already visited" and define it as the root of a binary tree. Afterwards, treat the resulting two paths in the same way. Now an edge from the previous cut point to each of the two new cut points is drawn, thus building a binary tree. The procedure is repeated for all resulting smaller paths until no more cut points are found.
Cut points can now be reduced hierarchically, starting with the cut points at the leaf of the constructed binary tree. Through this strategy a granularity is maintained that splits the whole graph into equally sized pieces. Hence, the quality of the following schedule is improved.
Note that the user determines the total number of cut points that are actually set. The algorithm above gives a precedence only, i.e., it determines which cut point to delete next. An example is given in Fig. 9 . Fig. 10 shows in which way Rules 1 to 3 are applied within our performance estimation technique (we call it path-based estimation technique): It starts with initializing the set of cut points with the empty set. Afterwards profiling data (it values) is generated (by simulation) and written to the graph representation. In the following step the graph is transformed into a directed acyclic graph (DAG) and the number of all paths #paths is computed (lines [2] [3] [4] [5] . If the number of paths would exceed the computation time (Table I says that a number of leads to acceptable computation time of a few minutes only) cut points are calculated (l.7, l.15ff, see below). Then the DAG is split at the according locations (l.8) and for all dag's in a path-based scheduling is executed. First all paths are calculated and for each path an AFAP-schedule is performed (l. 9-13). Then the set of all constraints is taken in order to superimpose all constraints of all paths (l. 14). Cut points are computed in the function compute cutpoints. It starts with scanning all nodes of the DAG (l. 17). If a node fulfills the conditions formulated by Rule 1, the node is inserted in the list of potential cut points CP list (l. 18, 19) . Now the cut points are sorted in such a way that those cut points which are located at less often executed parts of the graph (i.e., low iteration time ) have the highest priority since At this point the user can determine the quality-/computationtime-tradeoff by choosing the number of cut points to apply (l. 21) . If the number of potential cut points found exceeds this number (l. 22) a selection is necessary: for the case there are more than one cut points in the list which would lead to the same loss of quality-assuming only Rule 1 and 2 have been applied-Rule 3 will decide (l. 23, 24) which of them will be deleted from the list in order to meet exactly the user-defined # .
B. Adaptation of Estimation and Granularity
As we already mentioned briefly, the granularity for partitioning depends on the results of the Estimation. In the following this dependency is explained.
The point is that we cannot consider those partitioning objects (POs) for which we cannot provide an appropriate estimation. Assume, for instance, Example 2 from Section VII-A where cut point number 1 is set. Assume furthermore that our algorithm from Section VI-B proposes a PO that comprises four nodes . Now we would need to have a path that solely comprises nodes that are also part of in order to provide with an estimation value. Obviously, there is no such path in the whole path set . Consequently, we have no estimation value for it. It also does not help that paths overlap with partitioning object . In a similar discussion we could construct an example where we have a large PO (meaning it comprises many nodes) that would require too many (since small) paths in order to get estimated. Estimation actually would have to be accomplished through "patching" what is not acceptable since it is too inaccurate. The latter case can happen if we chose many cut points resulting in many small paths. Concludingly, in both examples we cannot estimate the according POs. As a result those POs will not be considered, i.e., they will actually be deleted from (see Section VI-B). Obviously, the estimation method controls the possible granularity: if we chose many cut points then many small paths are obtained such that finally small POs are selected. If we chose a few cut points only, accordingly large POs are preferred. Or in a more general manner: the quality we prefer for estimation directly influences the possible granularity. Table I shows the results achieved by our performance estimation technique. For diverse applications, the number of paths (#pth), the number of cut points (#cpt), the scheduling result, and the computation time are given. Note that the user determines the number of cut points he/she wants to be set. Thus, the quality/computation time tradeoff is influenced: if only a few cut points are set, the number of paths is higher, the computation time increase but the quality of the schedule is improved. The other way around, many cut points lead to fewer paths, the computation time is decreased, but the quality of the schedule suffers. More apparent than these pure numbers, Figs. 11 and 12 unveil a peculiar behavior that we can exploit to optimize the quality/computation time tradeoff: interestingly, Fig. 11 unveils that the computation times decreases steeply with a few cut points set but at a certain point, an additional number of cut points leads to an unremarkable improvement in computation time (those points are marked by arcs).
C. Results Achieved by Deploying the Hardware Performance Estimation Technique
An accuracy of the schedule of about 15% (given in Fig. 12 as a deviation in percent compared to the best possible schedule) can already be achieved at a small fraction of all paths. Or we can argue the other way around: if we would consider more paths in order to perform the schedule then we would significantly increase the computation while the accuracy would in- Fig. 12 . Tradeoff between quality (deviation from according best schedule) and complexity (number of paths).
crease only insubstantially. Our conclusion is that the best compromise between quality and computation time is to consider only a fraction of all paths. This is possible through our method that sets cut points accordingly.
VIII. OPTIMIZATION THROUGH A DYNAMICALLY WEIGHTED OBJECTIVE FUNCTION
The task of the optimization algorithm is to find the best hardware/software partition according to an objective function. As mentioned earlier, our definition of a macro function is the actual implementation of a piece of hardware in form of an application specific hardware (ASIC). A macro function will typically consist of a set of consecutive partitioning objects. A large variety of different granularities (differently sized partitioning objects) supports the optimization algorithm in finding a good partition. Or in other words, if a large macro instruction would lead to a good partition but only small partitioning objects are available, many of them would be needed. Thus optimization becomes more complex and finding that specific good solution is more unlikely. Hence, a flexible granularity enhances the prospect to obtain good results, i.e., partitions.
This step refers to the third box shown in the overview of Fig. 2 .
As an optimization algorithm we have chosen the simulated annealing algorithm [40] . The reasons are as follows.
• It is mathematically well investigated.
• It offers the possibility of a quality/computation time tradeoff.
• It is a general-purpose optimization algorithm (that means it is easy to adapt to our partitioning model). Our implementation uses the annealing schedule described in [41] since it offers one of the best quality/computation time tradeoffs. The interface between the annealing schedule and our specific problem formulation of partitioning consists of basically three functions:
• generate() When this function is called the generation of a new state is requested, i.e., in hardware/software partitioning a new partition has to be generated. This is done by a move of a partitioning object from software to hardware or vice versa. A characteristic peculiarity of each partition is its cost Cost
. Thereby is the set of all possible partitions.
• accept() If the annealing algorithm has decided to accept a move then accept() returns the value "TRUE".
• reject() If the annealing algorithm has decided not to accept this move then reject() returns the value "TRUE".
A. Selection of a Move
The following prerequisites led to the definition of the selection algorithm for a new move:
• Exactly one partitioning object is moved from software to hardware or vice versa during one call of the generate() function.
• The algorithm assumes that hardware and software parts execute in mutual exclusion. 10 • The algorithm should select a new move dependent upon the current partition. Within the following algorithm the term has the meaning that partitioning object is moved from software to hardware using the move . The algorithm is defined as follows: 
5) If
None of the conditions in steps 2), 3), 4) is fulfilled Then Proceed with 2) above Else Ready, since a valid move is found.
If a valid move is found, exactly one of the cases Case 1, Case 2, or Case 3 will be applied. All possible base configurations and moves are depicted in Fig. 13 . The assignments of cases to figures are as follows:
• Case 1 "TRUE" corresponds to Fig. 13(a) ; • Case 2 a) "FALSE" corresponds to Fig. 13(b) ; • Case 2 a) "TRUE" corresponds to Fig. 13 (c);
• Case 3 "TRUE" corresponds to Fig. 13(d) . A situation like the one in Fig. 13 (e) cannot occur since no move can generate such an initial configuration (it would hurt the conventions on the target architecture, i.e., mutual exclusive execution of software and hardware). Based on this base configurations and according moves every valid codesign can be reached.
B. Partitioning and Granularity
In Section VIII-A we have shown in which way the estimation algorithm influences the possible granularity. Here, we explain in which way a flexible granularity finally leads to (small or large) macro instructions.
During each move of the optimization algorithm, one partitioning object is finally selected and supposed to be implemented as hardware or as software (see algorithm is previous section). Let us assume, for example, that the final implementation is likely to have many hardware parts and a few software parts (this might be the case when severe timing constraints are imposed that can only be met when the hardware part is large). In this case it is obviously more advantageous if the algorithm would select larger partitioning objects rather than smaller. But since a partitioning object is selected randomly (we use Simulated Annealing), we cannot influence the selection itself. But once one of the large partitioning objects has been selected randomly, it may significantly improve the cost function. The other way around, a small partitioning object that is selected to be removed from the hardware side will less insignificantly aggravate the cost function than a large one would. As a consequence, the optimization algorithm will implicitly tend to collect a few but large partitioning objects for hardware implementation instead of many small. This leads to a smaller number of moves. Other examples can be constructed where a large partitioning object would be too expensive and a few small can achieve a better result.
Thus, the flexible granularity helps to achieve:
• better hardware/software designs;
• to reduce the computation time during partitioning. All adjacent partitioning objects (those that provide a contiguous part in the control flow) that finally have been selected for a hardware implementation form one macro instruction. Typically, more than one macro instructions is implemented.
C. Multilevel Objective Function
The task of the objective function is to measure the quality of the partition after each move that has been proposed by the optimization algorithm. The problem in defining a objective function is to find a appropriate way to tradeoff different design goals/constraints against each other. Design constraints/goals are: performance, hardware effort, power dissipation, etc. Though the ideas of our objective function are general we demonstrate the usefulness by means of one design constraint (performance) and one optimization goal (hardware effort).
The objective function is defined as follows:
We will discuss the single components in the following, especially the weighting factor which varies dynamically, dependent on how close the time constraint is met. A static weighting by using the two constant factors and only is not sufficient for our purpose of meeting a real-time constraint and minimizing the hardware effort at the same time. But before, let us discuss another problem: since time and area have different physical units, we have to normalize our components. As for the hardware component, we calculate the area of a piece of hardware (using our estimation method presented in Section VII) and divide it by the average area of all other pieces of hardware . Consequently, we get a relative number that is in average close to 1.
The time component is normalized as follows: Here, is the real-time constraint and is the current execution time of the system. Since our aim is we define the deviation as a cost and put it in relation to in order to get a unit-less number that can be combined with the area component. The constant 1 is added for the case where , meaning that in that case the area component would totally dominate unless we add 1.
So far, the cost function has been built straightforward. However, we would never gain our aim of meeting a real-time constraint and minimizing the hardware effort unless we would care about . A desirable behavior of would be that it is zero when we are far away from meeting the real-time constraint. This means that in such cases the timing behavior is optimized exclusively. But when is close to , the area component should start to become more and more influence until is almost equal to where should be maximum. This is a dynamic weighting, meaning that is dependent on the timing component. The following definition fulfills the desirable behavior: else The value of is obtained by experience. In a more apparent way, is shown in Fig. 14 . Actually there are given three possible examples of how can look like depending on the parameter . For a small value of the outermost flanks apply. For large values of (indicated by the direction the arcs are heading to) the innermost flanks apply. The flanks between belong to a value of that is somewhere between. Unless the system timing is far away from . If is close to starts to increase until it becomes maximum (i.e., 1). Now the area component is dominating over the time component. This is desirable since the time constraints at this stage has almost been fulfilled. So, the dynamic weighting offers the possibility of separating area and time dependent on the current state, without switching to another objective function (this might be very computation intensive). By means of the factors and in (3) it is possible to control the extent to which the area component should dominate during the end-phase of the optimization procedure.
We want to emphasize that our objective function is not the first one to comprise performance and area. In fact, Vahid/Gajski [13] propose an objective function by formulating the violations in terms of timing and area. But the weighing between these violations is fixed a priori. An adaptation of the weighing between the violations, according to the current state of the partitioning process, is not provided. This is where we see the advantage of our approach that does provide this flexibility.
Kalavade/Lee [20] propose another approach to meet timing constraints and minimize hardware during hardware/software partitioning. They divide their optimization algorithm into two phases. One is for meeting the timing constraints solely. Once this is accomplished they switch to a second phase where they take care of area minimization. In our approach we do not have to switch between two different optimization phases. Instead, the weighing between timing component and area component varies continuously. Thus we avoid a possibly nondetermined behavior of the optimization algorithm due to abrupt switching.
As we will show in the following section, our flexible approach achieves significantly better results than our previously deployed approach that featured an objective function with fixed weighing factors.
IX. EXPERIMENTS AND RESULTS
Experiments have been conducted in order to show the superiority of our hardware/software partitioning approach with a flexible granularity compared to an approach with a fixed granularity. We will also show the efficiency of our multilevel objective function. As for the performance estimation technique, we already presented results in Section VII.
Most of the applications stem from the domain of digital signal processing and some others are control dominated. The partitioning approach with flexible granularity is applicable to both domains as the results will show. The size of the applications vary from about 50 lines of C code to about 600 lines of C code and the number of paths ranges from 28 to 18.5 Mio. One of the applications is part of a real industrial project. Through this diversity of sizes the partitioning approach can show that a flexible granularity leads to fast and high-quality results, independent from the size of the applications. It should be noted that we used the same applications in the experiments in Section VII as we did for partitioning here in this section. But in Section VII we used single functions of the particular application whereas we used whole applications consisting of one or more functions here.
The results have been achieved by a fully automated partitioning algorithm and estimation techniques.
The quality of our approach is evaluated by means of the achieved speedup, the hardware effort of the applications specific hardware, and the computation time of our algorithm. 
A. Speedup
The speedup is defined as spu spu where is the execution time of an application for an all-software solution, is the execution time of the same application but for a hardware/software implementation (after partitioning), and spu is the imposed constraint speedup. Fig. 15 shows the behavior of the obtained speedup and speedup constraint for different design points of an application. The graph "spu " gives the result obtained by the partitioner. It can be seen that in all cases the constraint of the above equation is met spu spu . Table II shows some more results. It also shows the real speedup "spu " that has been achieved as follows: after hardware/software partitioning the software part has been mapped to a standard processor core (SPARC) and the hardware part (including interfaces) has been synthesized using high-level synthesis. Afterwards we have taken the output (slif netlist) of the high-level synthesis and optimized it using the SYNOPSYS design compiler. After simulation of software and hardware parts we got the real speedup called "spu ."
The table shows that in all cases the speedup values are very close to the constraint. The partitioner meets the given speedup constraint in all cases. Due to the inaccuracy of the estimation tools (hardware run-time, software run-time, communication time) that perform estimation at a high level of abstraction there are some small deviations between real synthesis results spu and constraint spu . That reflects that the estimation tools are of good accuracy.
An arc in the table has the meaning that the according design point is the same as the one the arc points to. Other applications cannot be sped up more than a maximum value due to the peculiarities of an application and the deployed hardware resources.
B. Hardware Effort
Since embedded systems on a chip are limited in size, minimizing the hardware effort is an important goal. The attribute "geq" (gate equivalents) gives the hardware cost for each design point. Large speedups around 10.0 lead to small hardware costs (application specific hardware of the hardware/software system) of only 30.000 or less gate equivalents is due to our multilevel objective function. The hardware effort values are obtained by using the SYNOPSYS design compiler. Fig. 16 shows the percentage of hardware that could be saved in comparison to a simple objective function (i.e., without dynamically weighted the hardware component. As a result, in some cases savings of up to 50% have been achieved. Table III gives the exact numbers in terms of gate equivalents and in addition in some cases slightly different performance values are shown.
C. Computation Time
Due to the variable size of the partitioning objects that can cover the whole application or only a single instruction, the computation time could almost be kept independent from the size of the benchmark. Of course, the number of all possible partitioning objects is larger than it would be when using a fixed-size granularity (we measured about four times more partitioning objects than there would have been if the granularity had been fixed to base block level; but there is no exponential growth). As a result the computation times have been in most cases within a couple of seconds.
X. CONCLUSION
We presented the first approach to automated hardware/software partitioning that features a flexible granularity. That means rather than fixing the granularity in advance (i.e., before automated partitioning starts), our granularity is determined during partitioning as a result of the interplay between estimation techniques and imposed design constraints. In the estimation section we introduced our path-based technique. We could show that we can easily tradeoff the quality of a schedule and the computation time necessary to accomplish that. The main result from the hardware run-time estimation part is that all of the investigated applications could be estimated at a sufficient accuracy (sufficient in this sense are approximately 15% since we use the estimation results solely to guide partitioning rather than actual performance prediction) while we only considered a fraction of all possible parts. Though we cannot say that a certain percentage in terms of a path coverage always guarantees these 15%, we can certainly say that all application showed a turning point. This is a point in the graph (see Fig. 11 ) indicating that an increasing number of paths does not significantly contribute to improving the quality of the schedule any further. We deployed data-dominated as well as control-dominated applications and both classes show the same behavior. But since control-dominated applications in general feature more paths (at a comparable size in terms of C-code lines, for example), the path-based estimation technique reduces the computation time of those applications much more significantly than applications that belong to the class of data-dominated applications.
Furthermore, we introduced our dynamically weighted cost function. This cost function has the ability to balance different physical components (time, hardware effort) such that the optimization algorithm starts first with meeting the timing constraints and afterwards-when the timing constraints are almost met-and it weighs the area component increasingly higher. As a result, we achieved area savings between 0% and about 50% compared to the formerly deployed cost function without a dynamic weighing.
Furthermore, we showed that the granularity is floating somewhere between a single basic block and a whole function/process thus speeding up the computation time of the optimization procedure. This becomes possible since, by demand, large partitioning objects can be moved from software to hardware (or vice versa) rather than many small, for example. This depends on the strenght of the imposed constraints.
The all-over partitioning results show relatively high speedups (compared to an all-software solution) at a reasonable amount of application specific hardware. Most of the results could be obtained within a few seconds due to the flexible granularity. This comes at the cost of a slightly increased computation time for precalculation of estimation values (hardware Dr. Ernst is a member of ACM and the German GI.
