partitioning is an important problem in HW-SW Codesign of embedded systems. We establish a HW-SW partitioning model based on system's Basic Scheduling Block (BSB) graph and propose a Modified Genetic Partitioning Algorithm (MGPA). By adopting an adaptive fitness function definition and a novel evolving strategy, we enhance the stability, efficiency and result quality of our partitioning algorithm. Experiment results show the algorithm's effectiveness in solving the HW-SW partitioning problem.
I. INTRODUCTION
With the development of IC technology, the scale of modem embedded system becomes larger and the functions become more complicated. Now SOC (System On Chip) that is made up of processors, coprocessors, memories, A D , DIA and IP cores bas been used frequently. Meanwhile, customers and markets are more strict with the performance of the system such as its energy consumption, cost and time-to-market. The design of hardware and software is traditionally processed independently and the partitioning of HW-SW sub-system depends on experiences of the designer, the problems of the design usually appear in the last minute.
So the traditional design methods no longer meet the design requirement of modem embedded systems. Due to the disadvantage of the traditional design .methods, HW-SW Codesign is becoming a promising solution to modem embedded system design [ 1, 2, 4] .
HW-SW Codesign uses an automatic and optimized system architecture exploration process to take place of conventional handmade HW-SW subsystem partitioning. It comprises three steps: Allocations of resources, Assignments and Scheduling. Allocations of system resources determine kinds and quantities of system resources @rocessors, specified hardware, buses and so on). Assignments determine which resource realizes the appointed system function. Scheduling arranges the sequences of the implements on each resource. The result of scheduling reflects the performance of system and it could instruct the adjustment of allocations and assignments processes. Through HW-SW Codesign, the designer could process rapid prototype development and estimate the performances ofthe system at a high-level stage, thus optimize the system's performance, cost, energy consumption and so on. . In this method, the processor-coprocessor system architecture is primarily adopted. The optimization goal is to maximize the system performance under certain cost restriction by selecting appropriate BSB to be realized in a hardware way. This method is also called as HW-SW Partitioning. This paper mainly focuses on the HW-SW Partitioning problem and proposes an efficient partitioning algorithm.
11
. RELATED WORKS At present, there are not many algorithms about HW-SW Partitioning. Emst presented a softwareoriented approach [9] . In his approach, the entire functions are firstly implemented by software. Simulated Annealing (SA) algorithm was used to select the appropriate functions to be implemented by hardware. The summary of all the hardware cost in the system formed the system cost. By the end of algorithm, they can find out the optimized combination of HW-realized and SW-realized functions to get the maximum performance under the given system cost constraint. But in his algorithm, the run time was long and the SA searching parameters (initiative temperature and speed of annealing), which were hard to select, greatly influenced the result quality; Gupta proposed a hardwarcoriented approach [SI. Firstly, the whole functions were implemented by hardware. Then his algorithm gradually selected the appropriate parts to be realized by software to reduce the system cost until the system performance missed the given performance requirement. Thus the minimum-cost system realization could be achieved. But it was difficult and impractical to get the algorithm required initial all-hardware solution due to 0-7803-55 15-2/04/$20.00 02004 IEEEthe complexity of system functions; Knudsen presented a PACE approach [SI. Like Emst's method, his algorithm was also a softwareoriented one. The whole functions were initially implemented by software. Then an dynamic programming scheme was used to achieve the HW-SW partitioning. His algorithm required the system cost constraint given as an integer and the complexity of the algorithm was direct proportional to the square of BSB's numbers and the cost constraints. When running, his algorithm needed to hold a large programming table and require a high demand of RAM and CPU power. In order to reduce the complexity, the partitioning of BSBs was limited to the neighboring sequences of BSBs but this reduced the potential of optimization; Kalavade put forward MIBS algorithm [ I I]. In the process of partitioning, the algorithm checked the BSB one by one and dynamically adjusted the threshold of partitioning it to HW or SW according to simultaneously considering the global effect and the local characteristics of the BSB. The algorithm supports the selection from one BSB's multiple versions of HW realization. MIBS was a heuristic algorithm and the heuristic rules adopted were not feasible to all possible situations. Meanwhile, the sequence of the BSBs processed in the algorithm was with great influence to the result. If a formerly partitioned BSB occupied a large portion of the system resources, the following BSB could not be partitioned in accordance with MIBS's adjustment, which often occurred and worsened the result quality. Besides, MISB ignored the overhead of data exchange between hardware and software, which sometimes played a big role in realistic system.
Partitioning of Hw-SW is a combination optimization problem and the model and the algorithm efficiency are key to solve the problem. Moreover, the data exchange between Nw-SW subsystem must be take into account. In this paper we establish a model of HW-SW partitioning based on BSB graph and propose a partitioning algorithm based on modified genetic algorithm.
The remainder of this paper is organized as follows: Section I11 focuses on the description of system function in BSB graph. Section IV gives a model for partitioning of Nw-SW. Section V proposes a HW-SW partitioning algorithm based on modified genetic algorithm. Section VI contains the experiment results with analysis. Finally, Section VI1 concludes the paper.
111. DESCRIPTION OF SYSTEM FUNCTION Generally, the goal ofHW-SW partitioning algorithm is to optimize the performance of the system under the cost restrictions. Selecting an all softwareoriented implement as a starting point is feasible. C/C+ is usually employed to realize the whole system function. Then, we can describe the system function through converting high-level language (C/C++) to CDFG that comprises node set N and directional edge set E. For ni E N , e . . = ( n i , n . ) E E presents the control relation or data flow direction between the nodes ni and nj
' . J J
For the node ni E N , it can be presented by the below forms.
F U = CDFG DFGrepresents data graph and it doesn't have any control stmcture; cond and loop represents conditional branch and circular control structure, respectively; brannchl and branch2 denotes different conditional branches; test and body represent the loop-ending judgement and the loop body; F U represents subroutine or function calling; waif is used to synchronize with running environment.
We could easily get BSBs through CDFG. Through the operation of collapse, by which the neighbouring BSBs are united, we could reduce the quantities of BSBs and simplify the exploration of the solution space.
A simple example written with C language is presented i= I .j=o.r=o. Figure 2 gives the corresponding description of CDFG and Figure 3 contains the corresponding BSBs, where one dot represents a BSB.
BSB graph is composed of node set B and directional edge set E. In BSB graph, the sequence number of every node Bi corresponds to the execute order of the BSBs. For Bj E B , the edge e;,, = (B,,Bj)E E denotes the relationship of data exchange in BSB. The property set v ( q j ) denotes the data exchange volume. Every BSB node could get data from its predecessor node and output data to its successor node. A typical &node BSBs system is demonstrated in Figure 4 .
Every BSB node Bi can be defmed by a tuple space: In Figure 5 , we can see that Primary Processor is common processor and Coprocessor is specified hardware such as ASIC or FPGA. The Primary Process and Coprocessor are coupled with shared memory used for data exchange. The aim of OUT algorithm is to minimize run time T of the target system and meet the cost constraint CostReq which is given before design.
The partitioning model is based on BSB graph. represents the time that variables are read from shared memory to SW; th.,,, represents the time that variables are written from HW to memory; t,,-td represents the time that variables are read from memory to HW. pc,(v) denotes the times the variable v of Bi has been accessed in a single run, which is equal to pc,, the times that B; has been accessed.
The cost of the target system can be defmed as the summary of the cost of all the BSBs realized by hardware:
VB,E HW
The goal of the partitioning can he defined as
Minimize T subject to C o s t S C o s t R e q \

V. PARTITIONING ALGORITHM BASED ON
Exploration efficiency, result quality and the robustness are the main concern of a partitioning algorithm. We propose a genetic-algorithm-based HWISW partitioning algorithm. According to the characteristics of the partitioning problem, we present an adaptive fitness function defmition and evolution strategy.
A . Encoding, Creation o/ initial population. population scale and the termination condition
The goal of the algorithm is to determine the implement pattern (SW or HW) of each BSB. Therefore, we employ the binary encoding. The chromosome is defined as   ( k , , k,, k , . . . k , ) k, E {1,0} i € {l...n} , where n is the number of BSB. If k,=l, the corresponding BSB will be realized by software. And ifk,=O, the BSB will be realized by hardware. The initial population is created randomly. In order to maintain a diversified population and keep the algorithm efficient, we select 2N (N is the number of BSBs) as the scale of the population in the experiments and keep the scale throughout the evolution. We terminate the algorithm after 3N-generation evolutions.
B. Fitness Function
We construct a generalized objective function based on the target system run time and the system cost constraint. Fitness function can be derived from the objective function. In the objective function, the cost restriction is included in a penalty item. The penalty item is designed in a way that in the early stage of evolution, the diversity of the population is guaranteed and in the later evolution stage, only the individuals satisfying the restriction have an advantageous fitness value. At the same time, the objective function should encourage the individuals to fully utilize the system cost given to get a better result. Due to the run time and the cost item usually having different unit, we adopt some normalization operation to the two items to balance their respective impact on the objective function value.
GENETIC ALGORITHM
We introduce two normalization factors, Oc and U , : 
C. Principle ofselection and strategy of evolution
The principle of selection depends on roulette selection and elite reserving. We adopt a double elite reserving tactic that we preserve the individual with maximum fitness value, which may be offend the cost restriction, along with the individual with maximum fitness value while not offending the cost restriction.
The crossover and mutation operation influence the algorithm efficiency very much. In our point of view, crossover operation means searching the result space in a broad extension and mutation operation means searching an individual's neighboring space. Therefore, it makes sense that in the beginning of the algorithm we use more crossover operation while in the ending using more mutation operation. Using these strategies, we obtain a high searching efficiency while keeping the population's diversity.
In the algorithm, P%of the individuals of next generation is obtained by crossover operation and the probability of crossover operation is Pc(in OUT experiment P~0 . 9 ) .
The other (I-P)% of them is obtained by mutation operation and the mutation operation was applied to those individuals with top (I-P)%fitness value. The bit number flipped in mutation is m. Along with the evolution, we gradually decrease the number of individuals generated by crossover operation and increase the number of individuals generated by mutation operation, which is P = Po + i *dp , dp is the proportion change between two generation.
VI. EXPERIMENTS AND ANALYSIS
Because no original data could be obtained in the literature, we've randomly created BSB data for experiments. We've taken the reasonability into account while creating the experimental BSB data. SA algorithm presented in [9] is used as a comparison to our algorithm, Table 1 shows the data of IO-node system. Table 2 lists the data exchange relationship between BSB nodes. Table 3 lists the partial experimental results. From table 3, we can see that the efficiency of our algorithm is higher than SA algorithm with the same or better system speedup performance obtained.
The curve in figure 6 shows the convergence of the averaged objective function of a 50-nodes system with 20 times stochastic running. The curve in figure 7 demonstrates the performance results of a SO-nodes system after 20 times stochastic m i n g and shows the stability of ow algorithm. partitioning result. In future, OUI works mainly focus on several aspects: enhancing the efficiency of HW-SW partitioning algorithm, improving the efficiency of converting the system description from high-level to BSE and CDFG and the efficiency of acquiring the BSB's propelty values. To optimize the system's cost, performance and energy consumption simultaneously which requires multi-objective optimization is also a problem deserving: further research.
