Abstract. The hardware-software automated partitioning of a RTOS in the SoC (SoC-RTOS partitioning) is a crucial step in the hardware-software co-design of SoC. First, a new model for SoC-RTOS partitioning is introduced in this paper, which can help in understanding the essence of the SoC-RTOS partitioning. Second, a discrete Hopfield neural network approach for implementing the SoC-RTOS partitioning is proposed, where a novel energy function, operating equation and coefficients of the neural network are redefined. Third, simulations are carried out with comparisons to the genetic algorithm and ant algorithm in the performance and search time used. Experimental results demonstrate the feasibility and effectiveness of the proposed method.
Introduction
As a new type of embedded systems, a SoC (System-on-a-Chip) almost implements the functionality of an overall computer system in a single IC (Integrated Chip). In general, embedded software in the SoC is composed of RTOS (Real-time Operating System) and embedded application software. The RTOS in the SoC is shortly called SoC-RTOS. Recently, the SoC becomes more and more popular in the market, according to its architecture, SoC-RTOS functionality doesn't need to be implemented solely by software, whereas some functions of SoC-RTOS can be implemented by hardware. This can greatly improved the performance of SoC-RTOS. Thus, hard-In [9] , Mooney III presents a δ hardware-software generation framework for the SoC, which provides with automatic hardware-software configurability of the RTOS between a few pre-designed partitions, e.g., SFR (Special Function Register), RTU (Real-Time Unit), LC (Lock Cache), DDU (Deadlock Detection Unit) and DMMU (Dynamic Memory Management Unit). These hardware facilities, which realize the task management and IPC (Internal Procedure Call) of the RTOS, remarkably improve the performance of the multi-task SoC-RTOS. Meanwhile, a SystemWeaver module in the multi-core SoC architecture is designed in [10] , which acts as a hardware task controller to implement the task list management and IPC of RTOS. In fact, many modern CISC and RISC microprocessors, such as Intel Corporation's Pentium and ARM Corporation's ARM, also provide some special control registers and its corresponding hardware circuits to support the process switch and IPC for RTOSs and other general-purpose OSs (Operating System), e.g., Windows and Linux.
However, these SoC-RTOS partitioning solutions in [9] and [10] are based on experience and difficult to guarantee an optimal partitioning. In order to further facilitate the advance of the RTOS, SoC and microprocessor, it is imperative to explore some theoretical aspects of the SoC-RTOS partitioning.
Due to the characteristics of the SoC-RTOS, the SoC-RTOS partitioning is quite different from the partitioning of embedded systems and SoCs. Usual hardwaresoftware partitioning methods are inadequate for such SoC-RTOS partitioning tasks in many aspects. The composition of hardware and software elements in the SoC-RTOS creates some new problems, such as modeling the SoC-RTOS, refining constraints and multi-object conditions, designing an appropriate optimization algorithm, evaluating the partitioning results, and system architecture issues. In this paper, we focus on the optimization algorithm development and design of SoC-RTOS partitioning [11] .
Description of the SoC-RTOS Partitioning Problem
The SoC-RTOS partitioning is a NP-complete problem, which main objective is to optimally allocate the functional behavior of the RTOS to the hardware-software system of the SoC under constraints. The SoC-EOS partitioning also can be considered a part of the SoC-EOS hardware-software co-synthesis in some literatures. The functional behavior of the SoC-RTOS can be modeled by a task graph. For software, a task is a set of coarse-grained operations with definite interface, which can be an algorithm procedure, an object or a component; for hardware, a task is a specific IP (Intellectual Property) module with clear functions, interface and constraints [11, 12] .
To formulate our problem, the following notations are used in this paper:
G : A directed acyclic graph (DAG) and also refers to the task graph of a SoC-
The task node set that has to be partitioned,
The directed edge set that represents the control or data dependency and communication relationship between two nodes, 
The total performance of P Definition 1 (k-way partitioning). For given
, it is called k-way partitioning if there exists a cluster set
As 2 = k , P is called bi-partitioning, which means that only one software context (e.g., one general-purpose processor) and one hardware context (e.g., one ASIC or FPGA) are considered in the target system; as
partitioning, which means that multiple software contexts and multiple hardware contexts are considered in the target system. According to the architecture of the target system, the SoC-RTOS partitioning can be categorized into bi-partitioning and multi-way partitioning. Bi-partitioning is the foundation of the multi-way partitioning, and is widely applied in domain applications. Hence, the partitioning only refers to the bi-partitioning without any additional declaration in this paper.
Definition 2 (SoC-RTOS partitioning). For given
, the SoC-RTOS partitioning is formulated as the following constrained optimization problem: 
A Novel Discrete Hopfield Neural Network Approach
The discrete Hopfield neural network approaches (DHNNA) have been successfully applied to signal and image processing, pattern recognition and optimization. In this paper, we employ this type of neural network to solve the SoC-RTOS partitioning optimization problem.
Neuron expression
A neural network with N neurons is used to give a response for each of the N nodes in the graph G . The i-th neuron belongs to the subset with node i, and has an input i U and output i V . The output of the neuron is given by:
where the neuron output
To avoid the local optimum caused by initial conditions, the neuron input value should be restricted within a certain range. 
Energy function
In response to the constraint and objective condition of the SoC-RTOS partitioning, an energy function consisting of the following two terms is defined by:
where A and B are two positive coefficients which are specified in Subsection 3.4 below. α is the system architecture speedup ratio, which means a performance compared value between hardware-software partitioned SoC-RTOS and purely software realized SoC-RTOS; i β is the hardware task speedup ratio and has the different values for different tasks, which means a performance compared value between hardware implementation and software implementation in the same task node.
The function ( )
used in Eq. (7) is given by: 
Operating equation
The operating equation for the i-th neuron is governed by:
In order to avoid the local optimum and obtain a high quality solution within a limited computation time, a noise term D given by Eq. (11) is added to the operating equation (10); that is,
If the noise term D is kept adding in the updating rule, the state changes excessively and even a local optimum solution may not be reachable. Hence, the term D will be discarded in the operating equation
• is a round-off operator, i.e., it gives an integer most close to the entity, ( )
T is a positive coefficient and max T is a maximal step of iterations.
Setting of coefficients for the operating equation
The coefficient A depends on the average value ω of the task node costs; that is,
N S H
The coefficient B depends on the avg U . In this study, we set 
Performance Evaluation by Simulation
To verify the feasibility and effectiveness of the proposed method in this paper, we employed the similar simulation methods used in [6] , [14] and [15] . Also, a comparative study was carried out with the genetic algorithm (GA) and ant algorithm (AA). This study targets the bi-partitioning problem, so there is one processor and one programming hardware component (e.g., FPGA) in the target system. We use the Spartan-3 S1000 chip manufactured by Xilinx Corporation as our FPGA model, which could contain 4 processor cores and 17,280 programming logic blocks (PLBs) at most.
Target system architecture
In our experiments, we only use one processor and 15,452 PLBs. The target system architecture is shown in Fig. 1 . Since software is stored in memory and executed by MPU, ARM core and memory represent the software. FPGA PLBs represent the hardware indicated by bold line box in Fig. 1. 
Simulation conditions
To date, no standard benchmark and test cases for this topic is available. The methods commonly adopted in the literature are to generate the random DAG, and to assign some attributes to the nodes and edges.
To simplify our simulations, the following assumptions are made:
(1) The costs (e.g., running time and occupied hardware area) of task implementation on the processor and PLBs are static and can be calculated in advance.
(2) The costs of communication between the nodes in different contexts are constant during the execution time.
(3) To compare with the software implementation under equal conditions, the parallel of hardware implementation is neglected.
In this simulation, we constrain the settings as follows:
(1) Use the GVF (Graph Visualization Framework) software package to generate 5 groups of random DAG as our task graphs. To achieve the exact results in a limited time, the number of task nodes ( N ) in each group is set as 50, 100, 300, 800 and 1200, respectively. Each group has 30 sample graphs, in which each graph has the different edge generation ratios ( ρ ). The average value of 30 samples in each group is taken as the final performance result of this group [6] .
(2) The costs of task nodes and communication costs of edges, each task node is related with two functions for one is a hardware function and another is a software function, while each edge is associated with one function. The output of function is taken as the cost of task node and edge. The appropriate cost function for each task node and edge are chosen from the MediaBench benchmark program package [6] .
(3) As the initial partitioning, 2 N task nodes are assigned to each subset.
(4) The simulation environment used the Intel Celeron 2.6GHz processor, 512MB SDRAM, Linux 9.0 operating system and KDevelop 3.2 IDE. Table 1 shows the experimental results of search time and
Simulation results and analysis
produced by the DHNNA, GA and AA on the different node number. Fig. 2 shows the relationship between the search time used and count of task nodes in these three algorithms. It is observed that the search time from the DHNNA is shorter than that obtained by GA, and slightly worse than that obtained by the AA. However, the overall performances obtained from the DHNNA are better than the others. In particular, with the increase of node number, the DHNNA remarkably outperforms the other algorithms.
In the target system architecture shown in Fig. 1 , the aim of this experiment is to optimize the running time of SoC-RTOS under the occupied hardware area constraint. In fact, the DHNNA can be also applied to the hardware-software partitioning of embedded system and SoC after some modification, while taking into account other constraints and optimization performances, such as energy consumption, hard realtime and multi-processor. Along with an increase of the node number, the value of 
DHNNA

Conclusions
In this paper, we developed a discrete Hopfield neural network approach for solving a problem of SoC-RTOS partitioning. According to the characteristics of SoC-RTOS partitioning, a new energy function for a Hopfield neural network is defined, and some practical considerations on the state updating rule are given. Simulation results demonstrate that our method is superior to some conventional methods, such as genetic algorithm and ant algorithm. A further investigation on the robustness of the solution with respect to the initial state and model parameters is being expected.
