ABSTRACT Complex embedded systems with multi-processing units are important platforms for running complex tasks. In the development of complex embedded systems, hardware/software partitioning plays an important role. In practical application, there are many dynamic tasks which require the hardware/software partitioning to be done in real time. It is necessary to design efficient algorithms to do this. In this paper, the shuffled frog leaping algorithm (SFLA) and the greedy algorithm (GRA) are used to generate a hybrid algorithm named SFLA-GRA. On the basis of the SFLA, the SFLA-GRA uses the greedy idea to terminate invalid iterations and adjust the search step size. By these greedy strategies, the algorithm can be effectively accelerated. Experimental results show that compared with the other swarm intelligence (SI) algorithms, the efficiency of the proposed algorithm has been improved.
I. INTRODUCTION
With the development of science and technology, the scale and complexity of tasks are increasing. In order to ensure the normal operation of complex tasks, it is necessary to improve the processing performance of the embedded systems. Increasing the number and type of the processing units can effectively improve the performance of the embedded systems. Therefore, complex embedded systems with multi-processing units have become the main platforms for running complex tasks. Hardware/software partitioning is an important part of the development of complex embedded systems. A complex task consists of many subtasks, hardware/software partitioning assigns these subtasks to software units or hardware units. It is obvious that different task assignment schemes would bring different execution results and finding the optimal scheme is the goal of hardware/software partitioning.
For hardware/software partitioning, there are two important parts: the hardware/software partitioning models and the hardware/software partitioning algorithms. To build a model, the architecture of the system and the objectives of the hardware/software partitioning problem should be considered. In aspect of the architecture, some works [1] , [2] are based on the architecture consists of a single software unit and a signal hardware unit while some works [3] , [4] are based on the architecture consists of different types and quantities of software and hardware units. The system to be partitioned is generally given in the form of a task graph, or a set of task graphs [5] , and the parallelism and the communication between two processing units usually should be considered. In aspect of the objectives, a task of hardware/software partitioning may have one [6] or multiple [7] objectives. The common objectives include minimizing the execution time of tasks on the system, minimizing the hardware area of the system, minimizing the power of the system, minimizing the communication overhead of the system. In different codesign environments, the models would be different [5] .
Algorithms used to solve hardware/software partitioning problems can be divided into two categories: exact algorithms and heuristic algorithms. Exact algorithms [7] - [9] can accurately find the best solution of the problem. But hardware/software partitioning is an NP-hard problem [10] , when its scale is large, using exact algorithms would be difficult and time-consuming. Because heuristic algorithms can obtain the optimal solutions or near-optimal solutions within a reasonable time [11] , they have become the main way to solve hardware/software partitioning problems [12] - [15] .
As an important class of heuristic algorithms, SI algorithms [16] - [21] are flexible and highly adaptable [22] . These algorithms are proved to be more suitable for solving complex problems [6] .
In recent years, heuristic algorithms are more and more applied in hardware/software partitioning. On the basis of the original heuristic algorithms, a lot of methods are proposed to improve the solution quality of hardware/software partitioning. Such as the two software-oriented and the second hardware-oriented greedy heuristic algorithms [23] , the supervised shuffled frog leaping algorithm [24] , the position disturbed particle swarm optimization with invasive weed optimization [25] . These algorithms effectively improve the search ability and are able to obtain the better solutions. In practical application, most tasks are dynamic, the hardware/software partitioning usually should be done in real time. Therefore, improving the efficiency of hardware/software partitioning without decreasing their solution quality is another important research content. Based on different ideas, some studies have been done to reduce the running time of hardware/software partitioning. Based on the method of model reduction, work [26] proposes the graph reduction techniques to reduce the design space for hardware/software partitioning. Based on the method of hardware acceleration, work [27] designs the parallel algorithm which can be accelerated by GPUs. In addition to these two methods, the most common way to reduce the running time is improving the performance of the algorithm itself.
This paper would mainly research on hardware/software partitioning algorithms, it would use SI algorithms to solve the problems of hardware/software partitioning. SI algorithms are inexact algorithms, it is difficult to judge whether the solutions obtained by them are the best solutions. They search for optimal solutions through multiple iterations. These iterations are composed of valid iterations that improve the quality of solutions and invalid iterations that reduce the efficiency of algorithms. Increasing the total number of iterations can improve the solution quality, but it would also increase the running time of the algorithms. Therefore, when SI algorithms are used in hardware/software partitioning, they should be improved to reduce the running time. There are two common ways to improve the SI algorithms: generating new algorithms by the fusion of multiple algorithms, designing the adaptive search strategy for the algorithms. Based on the two ways, some improved algorithms have been proposed in recent years [28] - [31] . Compared with the original algorithms, these improved algorithms can effectively improve the quality of the output solutions. But because most improved algorithms mainly focus on improving the solution quality, their computational efficiency sometimes cannot be improved effectively.
Based on the above analysis, this paper focuses on studying efficient algorithms which can be used in hardware/software partitioning. We first select SFLA which has good performance when applied in hardware/software partitioning as the basic algorithm. Then, based on the idea of the fusion of multiple algorithms, SFLA is hybridized with GRA [32] to generate an improved algorithm named SFLA-GRA. Compared with SFLA, the hybrid algorithm can effectively reduce the number of invalid iterations and improve the algorithm efficiency.
The rest of this paper is organized as follows. In section 2, the related knowledge of hardware/software partitioning is shown. In section 3, the hybrid algorithm SFLA-GRA is proposed. In section 4, the experimental results are given and in section 5, we conclude our work.
II. HARDWARE/SOFTWARE PARTITIONING PROBLEM
In different codesign environments, the models would be different. Most of the hardware/software partitioning methods can work perfectly within their own codesign environments, but it is impossible to compare them, because of the large differences in their codesign environments and the lack of benchmarks [5] . Compared with the exact algorithms, SI algorithms have stronger adaptability, they can be more easily used to solve different models. In this paper, we mainly focus on studying the efficient algorithms of hardware/software partitioning. In order to facilitate the research of the algorithms, our work is based on a simplified model.
In this model, the architecture is made up by one software unit and multiple hardware units of the same type. In this architecture, tasks assigned to the hardware units can be executed concurrently while tasks assigned to the software unit must be executed sequentially. When there is communication between a software unit and a hardware unit or between two hardware units, the communication time is required.
Directed Acyclic Graph (DAG) is usually used to illustrate the hardware/software partitioning. The graphs before and after partitioning are shown in Figure 1 . It can be seen in Figure 1 (a), node i has three attributes: software execution time s i , hardware execution time h i , hardware area a i . c i,j is the communication time between node i and j. Based on these attributes, tasks would be assigned to VOLUME 6, 2018 software unit or hardware units. Figure 1 (b) shows a result after partitioning, where node 3, 4, 6 and 7 are assigned to software and the other nodes are assigned to hardware.
The partitioning scheme can be encoded to an
where N is the number of task nodes, x i ∈ {0, 1}, x i = 0 represents task i is assigned to the software and x i = 1 represents task i is assigned to hardware.
The optimization objective is minimizing the critical path which demonstrates the longest path. The critical path would determine the time required to execute the tasks on the embedded platform. The hardware area is set as a constraint. The optimization problem can be expressed by:
Where TE(k) represents the completion time of the kth path, M is the number of paths. A_limit denotes the constraint value of hardware area.
III. SFLA-GRA
A. SFLA AND GRA 1) SFLA SI algorithms usually have similar characteristics, such as easy to be realized and strong global optimization ability. But for different problems, different SI algorithms usually have different performances. Our work is based on the original SI algorithms. In our previous works [6] , [33] , we have used different original SI algorithms to solve hardware/software partitioning problems and found that SFLA has the highest efficiency. Therefore, SFLA is selected to generate the effective algorithm. SFLA is inspired by the foraging behavior of frog population. In this algorithm, the position of a frog represents a solution and the searching for optimal solutions is based on multiple iterations. Before the iterations, some solutions would be generated randomly to form the initial population. There are three steps in each iteration: grouping, updating and shuffling. Frogs are divided into several groups in the step of grouping and shuffled together in the shuffling step. Updating is the most important part of SFLA. In this step, the worst solution of each group would move to better solutions to update itself.
In order to further analyze the algorithm process of SFLA, we use SFLA to solve hardware/software partitioning problem with 500 task nodes, the maximum number of iterations is set to 1500 and the fitness curve of solutions is shown in Figure 2 .
It can be seen in Figure 2 , invalid iterations account for a large proportion of the total iterations, and these invalid iterations obviously reduce the running speed of the algorithm. In addition, when there are a large number of successive invalid iterations, the solutions usually have reached a relatively high quality. Terminating the algorithm when the number of successive invalid iterations arrives at a threshold is a common method to reduce the running time. But after the large number of successive invalid iterations, there may also be valid iterations to improve the solution quality. Therefore, this termination condition may make the algorithm miss the higher quality solutions. Based on these phenomena, if the valid iterations can appear after a small number of successive invalid iterations and the suitable algorithm termination condition is set, the running time of the algorithm would be effectively reduced.
2) GRA
GRA is designed by the greedy idea. It is one of the simplest heuristic algorithms. GRA usually starts from an initial solution which is generated randomly and then updates the solution iteratively. At each iteration, some alternative solutions would be generated by the moving of the current solution, and the distances between different alternative solutions and the current solution are the same. The algorithm would choose the optimal alternative solution to replace the current one based on their profits. Because the choice of each iteration is made just based on the current profits, the output solution of GRA is usually a local optimal solution.
In order to further analyze the algorithm process of GRA, we use GRA to solve the same hardware/software partitioning problem which is solved by SFLA in the previous part, the algorithm would be terminated when there is no better alternative solution can be chosen to replace the current one. The fitness curve of solutions is shown in Figure 3 . It can be seen from Figure 3 , GRA is terminated after 26 iterations and the quality of the solution obtained now has been obviously improved compared with the initial solution. GRA can get better solutions in each iteration and keep a fast descending speed. These show that GRA has the advantage of high efficiency. But when the algorithm is terminated, the quality of the output solution is much worse than that of SFLA, which proves that it is easy for GRA to fall into local optimum.
B. THE HYBRID ALGORITHM
To ensure the quality of the solutions, SFLA with strong global optimization ability should be used in hardware/software partitioning problems. Although SFLA has higher efficiency compared with some SI algorithms, its efficiency still should be further improved. To achieve this goal, there are three methods: 1) Reducing the invalid iterations. Invalid iterations account for a large proportion of the total iterations, so reducing the invalid iterations can obviously reduce the number of iterations. 2) Terminating the algorithms with the effective termination conditions. If the algorithm is terminated too early, the quality of the solutions would be poor, while if the algorithm is terminated too late, the running time of the algorithm would be long. So it is important to find the effective termination conditions. 3) Accelerating the search efficiency of each iteration. When the search efficiency of each iteration is improved, the total running time would be reduced.
Based on the above analysis, the greedy idea of GRA is introduced into SFLA to generate a new algorithm SFLA-GRA. SFLA has strong global optimization ability and GRA has high efficiency. Therefore, the fusion of the two algorithms can obtain their respective advantages. The three most important parts of the hybrid algorithm are shown as follows:
1) TERMINATING INVALID ITERATIONS OF SFLA WITH GRA.
When the algorithm starts, SLFA-GRA would be run in accordance with the steps of SFLA. During this process, the number of successive invalid iterations would be calculated and if it is higher than the threshold Inv_Limit, GRA function would be run. Because GRA has the ability to get better solutions in one iteration, it would help the algorithm to terminate the invalid iterations. When the GRA can no longer find a better solution or reaches its maximum number of iterations, it would be stopped and the SFLA would be continued. The pseudo code of the process of GRA function is shown in Table 1 .
2) TERMINATING THE ALGORITHM WITH A NEW TERMINATION CONDITION.
For SFLA, when a large number of successive invalid iterations appears, there is a big possibility that the obtained solution is near to the best solution. For GRA, when a better solution can't be obtained, the current obtained solution is at least a local optimal solution. Therefore, if neither SFLA (with a certain number of iterations) nor GRA can find a better solution, there would be a larger probability that the current solution is equal to or close to the best solution in the solution space. Therefore, after a certain number of successive invalid iterations, GRA function would be run, if a better solution still can't be found by GRA function, the algorithm would be terminated.That is the termination condition of SFLA-GRA.
3) ACCELERATING THE SEARCH WITH GREEDY STEP SIZE.
In SFLA, a bad solution moves toward a better one to find a new solution. The generation process of a new solution is VOLUME 6, 2018 shown as follows:
Where X new , X bad and X better represent the new, the bad and the better solutions respectively. N is the number of task nodes. Rand is an N -dimensional vector composed of random values between 0 and 1.
Step is the search step size of SFLA.
To accelerate the algorithm, greedy step size is proposed in SFLA-GRA. When x i bad is 0 and x i better is 1, the greedy step size is denoted by g i 0,1 , when x i bad is 1 and x i better is 0, the greedy step size is denoted by g i 1,0 , g i 0,1 and g i 1,0 can be calculated by:
Where C i 0,1 is the profit when x i bad is changed from 0 to 1. s i , h i and a i are the software execution time,the hardware execution time and the hardware area of task node i. mas and mis are the maximum search step size and the minimum search step size. The generation of a new solution based on the greedy step size is shown as:
The pseudo code of SFLA-GRA is shown in Table 2 . Where X i,worst , X i,best , X BEST are the worst solution in group i, the best solution in group i and the best solution in the population, respectively. U _bond and L_bond are the upper and lower bounds of the solution space. In order to further compare the three algorithms, the flow charts of SFLA, GRA and SFLA-GRA are show in Figure 4 .
IV. SIMULATION RESULTS
The proposed algorithm SFLA-GRA is simulated in C++ on an Intel Core i5-6400, 2.70GHz CPU, 8.00GB of RAM, running Microsoft Windows 10 operating system. Random instances are generated by TGFF tool to test the performance of the algorithms.These instances include five task sets with different scale (200 task nodes, 300 task nodes, 500 task nodes, 700 task nodes and 1000 task nodes). The constraint value is set to 1/2 of the maximum area (the total hardware area when all tasks are assigned to hardware units). To test the performance of SFLA-GRA, it is compared with the two original algorithms: SFLA and GRA, four original SI algorithms: Artificial Bee Colony Algorithm (ABC), Artificial Fish Swarm Algorithm (ASFA), Genetic Algorithm (GA) and Particle Swarm Optimization Algorithm (PSO), and four improved SI algorithms proposed in recent years: Mnemonic Shuffled Frog Leaping Algorithm(MSFLA) [34] , Improved Particle Swarm Optimization(IPSO) [35] , novel Artificial Bee Colony Algorithm(called APABC) [36] and Improved Genetic Algorithm(IGA) [37] . SFLA-GRA is terminated based on the termination condition introduced in section III. GRA is terminated when no better alternative solution can be found. Other algorithms are terminated by two common termination conditions. The first condition is that there are 150 successive invalid iterations. The second condition is that the number of iterations reaches 1500. The simulation results are averaged by 10 runs.
A. COMPARISON BASED ON THE SOLUTION QUALITY AND RUNNING TIME Table 3 and Table 4 show the comparison results of the 11 algorithms in terms of solution quality and running time. In Table 3 , SFLA and other SI algorithms used for VOLUME 6, 2018 comparison are terminated by the first termination condition(150 successive invalid iterations). In Table 4 , SFLA and other SI algorithms used for comparison are terminated by the second termination condition(1500 iterations). Q and R are calculated by:
Where fit comp and fit sfla−gra are the fitness values of the solutions obtained by the comparison algorithm and the SFLA-GRA, respectively. run comp and run sfla=gra are the running time of the comparison algorithm and the SFLA-GRA, respectively. It can be seen from formula 5, when Q and R are positive values, the solution quality of SFLA-GRA is better than the comparison algorithm and the running time of SFLA-GRA is shorter than the comparison algorithm.
As shown in Table 3 and Table 4 . GRA has the shortest running time among all the algorithms, but its solution quality is poor, which proves GRA is easy to fall into local optimum. Under the first termination condition, when the numbers of task nodes are 500, 700 and 1000, the running time of SFLA is less than SFLA-GRA, but its solution quality is also worse than SFLA-GRA. That because SFLA is terminated too early to get the high quality solution. Under the second condition, both the solution quality and the running time of SFLA are worse than SFLA-GRA, which proves SFLA-GRA effectively improves the efficiency while guaranteeing the quality of solutions.
Compared with four other original SI algorithms and four improved algorithms, SFLA-GRA can get higher quality solutions within a shorter running time in most cases. Under the first termination condition, the running time of ABC and PSO is sometimes shorter than SFLA-GRA, but their short running time leads to the poor solution quality. In some cases, ASFA and IGA can obtain the solutions whose quality is the same as SFLA-GRA , but ASFA and IGA are the two most time-consuming algorithms among the 8 comparison algorithms and their efficiency is much lower than SFLA-GRA.
Compared with the fist termination condition, when SI algorithms are terminated under the second termination condition, their running time would be longer but their quality would be higher. That is because the 150 successive invalid iterations are easy to appear before the total number of iterations reaches 1500, and the quality of the solutions obtained at this time usually has room for further improvement. This result also shows that it is important to reduce the invalid iterations and set the suitable termination conditions.
B. COMPARISON BASED ON THE FITNESS CURVES
The fitness curves of the 11 algorithms when they are used to solve the instances of 300 nodes and 700 nodes are shown in Figure 5 .
It can be seen from Figure 5 (a), GRA keeps the fastest descending speed and is terminated after a small number of iterations, but the solution quality of GRA is the worst. This further proves that GRA has high efficiency but easily falls into local optimum. In the early stage, the descending speeds of SFLA-GRA and SFLA are similar. But as the number of iterations increases, the large number of invalid iterations would reduce the descending speed of SFLA while SFLA-GRA would still keep a fast speed. That proves introducing GRA into SFLA can effectively terminate the invalid iterations.
It can be seen from Figure 5 (b) and Figure 5 (c) that with the number of iterations increases, the descending speeds of most algorithm curves are getting slower and slower. But the fitness curve of SFLA-GRA can keep a fast descending speed until the algorithm is terminated. In the early stage, some algorithms may have faster speeds than SFLA-GRA, but as the increase of their invalid iterations, the descending speeds of these algorithms would be exceeded by SFLA-GRA. It should also be noted that the time required for these algorithms(especially ASFA and IGA) to complete one iteration is much longer than that of the SFLA-GRA. It also can be seen from these curves that SLFA-GRA would usually be terminated in less than 700 iterations but the quality of its output solution is highest among all the algorithms. This proves the validity of the termination condition of SFLA-GRA. 
V. CONCLUSION
In this paper, we first analyze the importance of improving the efficiency of the hardware/software partitioning. Then, based on the idea of the fusion of multiple algorithms, SFLA and GRA are hybridized to generate a hybrid algorithm SFLA-GRA. On the basic of SFLA, the new algorithm uses GRA function to terminate the invalid iterations and sets greedy search step size to further accelerate the algorithm. Experimental results show that the proposed algorithm SFLA-GRA outperforms all comparison algorithms, especially in terms of the algorithm efficiency.
There are some future research suggestions: 1) It is found during our research that the profit function is an important factor which affects the performance of GRA function. So it should be further studied. 2)Our proposed algorithm is based on the original SFLA, but there are many improved heuristic algorithms. Hybridizing GRA with these improved heuristic algorithms may further improve the efficiency.
3) The methods of model simplification and hardware acceleration can also be studied to accelerate the proposed algorithm.
