Abstract
Introduction
A successful solution of the constrained hardwaresoftware partitioning problem depends on adequate estimates of performance characteristics and the implementation cost (the complexity) of appropriate HW/SW system parts on all stages of the partitioning. To reduce the HWISW codesign space and to control the partitioning process one could use an appropriate cost function counting performance-complexity requirements.
Both the HW-oriented [l, 21 and the SW-oriented [3, 41 approaches allow fine-grain automatic partitioning. Among the related work, authors in [5] investigate the partitioning problem from a cospecification.
Despite the similarity of the results for different initial conditions [4] the efficiency of HW/SW partitioning in these approaches depends on the initial solution in the codesign space, and the cost function must be automatically adapted.
A clustering approach [6, 71 with using closeness criteria [8] to control the partitioning process turns to account the design space properties. However the user decides on clustering and partitions the operations. In addition, the highly nonmonotonic design space makes difficulties in introducing the metric (the distance function) 171. In [9] , an approach is described which uses a relaxed cost function that enables the partition algorithm to focus on satisfying performance and to handle the HW minimization. The parameterized architecture model [ 101 is proposed which allows to consider the number of buses, memory ports, and connection styles affecting machine parallelism. 
Main goals and features
The main objectives of performance-complexity analysis are to estimate marginal satisfiability for performance requirements on every stage of HWISW partitioning and to determine the partition process direction in the HW/SW codesign space for the cost function minimization. There are several distinctive features in the proposed approach.
e First, starting from the system specification as a C program (as in the software-oriented approach [3, La] ) it allows to extract the Pareto optimal set of system alternatives in the HW size -system performance codesign space, to estimate extremely different implementations as HW [I, 21 or SW [3, 41,  and to choose an optimal HW/SW one.
e Second, profiling the C program and using the special graph for an internal representation -a metaoperator net (M-net) [ 131, this approach enables to estimate the software complexity on the object code level and even on the assembly language level with the rapid performance estimation system. It is important because using the assembly code based on the details of the processor selection let us reduce redundancy introduced by different compilers in SW timing estimation, and the estimation is fast due to the special C program profiler realization. Third, using generalized performance-complexity estimates and the codesign space properties (the Pareto subsets) it is possible to control the partitioning process as in [6, 71, but, in constrast with above works, this approach enables fine-grain automatic partitioning, and the communication overhead minimization.
Experimental results discussed in Section 6 are promising and prove the relative insensibility of the proposed approach to the initial solution.
Performance-complexity analysis overview
This section addresses an inner loop of performance-complexity analysis. After HW/SW partitioning, assembly (for SW) and VHDL (for HW) code generation, and high-level synthesis, the stage of global run time analysis is necessary (an outer analysis loop). The major steps of the inner loop analysis are the following. 1) Preliminary profiling. The GSSS system [ 131 was used as a platform for the performance-complexity investigation in HW/SW partitioning. two-stage investigation of the SW complexity: on the level of C functions, basic blocks and statements, and on the assembly code level by building the SW execution trace. This trace can be built with using trace interruptions (as an example, the interruption 01 in BIOS for IBM PC) and the frequency counters method. This method consists of short operations and instructions automatic clustering, gathering statistics, and using special tables for the calculation of execute instruction times for different processors.
In the presence of nondeterministic operations in the system specification (data-dependent operations, loops and waiting for external events [2]) we use stochastic estimates for the SW complexity (the number of processor cycles) and the maximal CPU cycle time.
2) SW run modelling. Those SW (code) segments are selected for HW moving, where timing constraints are violated. For multiprocessor systems the partition task is complicated by global scheduling and allocation. The selected code segments are belonging to the critical path, and the partition task is solved for these segments. For those code segments which are not critical, D.R. Fulkerson task is solved. That is an optimal delay distribution for a cost function
We use the minimization under nninimum/maximum timing constraints.
3) The S W-segment candidate list renovation. After the HW evalluation of the selected code segment the internal representation transformations may be possible (as an example, concurrent operations in SW and HW). Iin consequence of these transformations the renovation of timing constraints is probable, and the candidate list may be reduced significantly. In the multiprocessor case only those transformations are possible which do not violate minimudmaximum timing constraints for noncritical path SW-segments. Above properties define the provosed approach as an adaptive one, After the Pareto optimal variant extraction and the systematic HW/SW codesign space exploration the constrained partition optimization is realized.
Performance-complexity estimates in HW/SW partitioning

Processing model
In this section, the software running model for a general-purpose processor is discussed. The main goal is to use it in performance-complexity analysis. We are given an inpiit data block consisting of d bits, the time constraint T'(d) for d-bit processing hy using software S,(d), which requires not more than IS, ( d ) ( prlocessor cycles under a given value of I;. 
The formal definition of the constrained partition optimization problem
As mentioned in section 4.1, the CPU model captures not only a processor. So, we shall not neglect CPU hardware.
The experimental results presented in the Section 6 base on the target architectures consisting of the following functional imits: CPU with the HW size p , a,, a,, Am 2900 processor family was used in all examples for CPU building. The SW complexity varied from lo3 to lo5 processor cycles with the minimal CPU cycle time 200 ns. The maximal data block was 512 byte with the hypergeometric distribution of data arrivals from 1 0 ,~s to 230 ,us and the confidence probability 0,95. The maximal delay coefficient for CPU with DMA logic was not more than 1,04.
The HW/SW codesign space exploration
The codesign space exploration is the first stage of performance-complexity estimation in HWlSW partitioning.
During this stage the Pareto optimal sets of system alternatives in hardware size ( H ) -system performance (T) space are extracted. The next step is variant clustering in accordance to the timing constraint T' . One could account that for every feasible value of the SW complexity there are one or more (the Pareto optimal set) variants of the designed system in H -r space.
The determination of the partitioning direction
This is the second stage of performance-complexity estimation. In experiments we supposed the weights a, = aF = a, = aM = 1 for the explicit extraction of the SW complexity and performance variation during HW/SW partitioning. As Figure 2 shows, under the fixed time constraint q, if the SW complexity increases, HW size must be increased for preserving time constraint satisfiability. Coefficients ap , aF, a,, aM must be adapted during this stage for the CF minimization in (6).
As any acceptable partitioning the supposed approach minimizes the HW-SW communication (the CU portion decreases under software run time portion increasing).
Conclusions and future work
The major result of this work is the following. The method of performance-complexity analysis in HW/SW patitioning for real-time systems under timing constraints is suggested.
The distinct features of the method are (a) the rapid performance-complexity estimation for SW based on the set of introduced stochactic characteristics and the SW experimental investigation; (b) the exploration of HWISW codesign space by the Pareto optimal sets of system variants extraction, that enables to define the partition process direction for the cost function minimization. These features define the adaptive HWfSW partitioning.
The proposed approach will be extended by the RISC processors inclusion (Intel i860, Motorola M88000, Sun SPARC) and the DLX RISC core using for the processing model generalization. Now the GSSS sys#tern is integrated with Vantage OptiumTM , version .5.100 containing Styx for adequate performance analysis of total execution time accounting real HW delays.
