Abstract-In this paper, a List Simulated Anneal (LSA) algorithm is proposed for the DAG tasks scheduling on the Network-on-chip to simultaneously optimize makespan, load balance and average link load. A task list is first created for the DAG tasks, and the task-to-processor assignment is performed using Best Fit rule. Then the generated schedule is further optimized using LSA. In LSA, the task execution order is determined by the task list, and the task mapping solution is optimized using simulated annealing. By conducting series of simulation, the performance of our proposal is validated. Comparing to the list Best Fit and list random mapping algorithm, our LSA has 9% and 25.4% shorter makespan, 31.1% and 79.4% better load balance and 18% smaller average link load.
I. INTRODUCTION
The task scheduling problem on Multiprocessor has proved to be NP-hard [1] . To solve this complex problem, many heuristics are proposed, such as list scheduling heuristics [2] and meta-heuristics [3] . List scheduling maintains a task list which is created based on the task priorities, and assigns task to processor using certain rules, such as First Fit, Best Fit and Next Fit [4] . In another way, meta-heuristics randomly search the solution space for the optimal schedule, such as Genetic Algorithm [5] , Particle Swarm Optimization [6] , and Simulated Annealing (SA) [7] .
Simulated annealing is a generic probabilistic algorithm for global optimization which is inspired from the metallurgic process of heating and cooling down the material so that the atoms of material progress to the equilibrium state. SA simulates this process to explorer the search space.
In this paper, we combine the list schedule heuristics and SA meta-heuristics, and propose a List Simulated Anneal (LSA) algorithm for the DAG tasks scheduling on the Network-on-chip. Our proposal draws advantages from both heuristics, and performs optimization simultaneously on makespan, load balance and average link load.
The rest of this paper is organized as follow: Section II summarizes the related work; the problem is formulated in Section III; our proposal is elaborated in Section IV; Section V gives the comparative simulation results; and Section VI concludes the paper.
II. RELATED WORKS
List heuristic and meta-heuristic for task scheduling for multicore system has been widely researched. For list scheduling, [8] proposes a communication-aware list scheduling algorithm for the NoC-based MPNoC. In [9] , a contention-aware list scheduling algorithm is proposed for the dynamic reconfigurable NoC system. Reference [10] implements a Best Fit Decreasing heuristic for task scheduling in NoC. List scheduling is popular for its straightforward implementation and relatively low computational requirement. However in most situations, the result of list scheduling can be further optimized.
For the meta-heuristics methods, a genetic algorithm is proposed in [5] for the task scheduling in multiprocessor system. Reference [6] proposes a modified particle swarm optimization with load-balance to schedule heterogeneous tasks on to heterogeneous processors. In [7] , a simulated annealing task scheduling algorithm is proposed for Voltage-Frequency islands applied NoC-based MPSoC. The meta-heuristics can effectively explore the solution space for the optima, however if combined with list schedule, the efficiency of searching as well as the probability of finding a better solution is greatly increased.
In this paper, we combine list schedule and simulated annealing, and propose a list simulated annealing scheduling algorithm to simultaneously optimize make-span, load balance and average link load.
III. PROBLEM FORMULATION

A. Task Model
In this paper, tasks are modeled using Directed Acyclic Graphs (DAGs 
B. Network-on-Chip Hardware
The target hardware is a 2D mesh NoC-based MPSoC, as illustrated in Fig. 2 . Each PE is connected to a router, and routers are interconnected with each other through bidirection links. Data is transferred through NoC in the form of packets.
PEs are homogenous processor cores with local data cache. If two consequential tasks are scheduled to the same PE, the successor task reads the predecessor's data directly from the data cache of the PE without routing in NoC. The microstructure of a NoC router is shown in Fig. 3 . The router has five Inports and Outports corresponding to five directions of East, West, North, South and Local. The decoder in the Inport scans the first flit of the FIFO for any incoming packet. If decoder detects the head flit of a packet, it performs XY routing algorithm and send request signal to the arbiter of the corresponding Outport. If the arbiter receives multiple request signals, contention are solved using Round-Robin arbitration. The granted Inport then forwards the packet to the downstream router. Wormhole routing is adopted to minimize the buffer requirement as well as the packet latency [12] . The back pressure mechanism is also employed to further reduce end-to-end delay [13] .
IV. PROPOSED ALGORITHM
In this section, a list simulated annealing algorithm is proposed for DAG tasks scheduling on the NoC. A task list is first created, and a schedule solution is obtained by applying Best Fit (BF) algorithm. The generated schedule is then employed as the initial solution of Simulated Annealing (SA), and optimized by the cooling down process. 
A. List schedule
The target DAG is processed by a list schedule algorithm to generate an initial schedule solution for the SA. The execution sequence of tasks is defined by a task list which is produced according to the top distance of tasks, and the task-to-processor assignment is performed using BF.
Task list generation:
The top distance of each task in the target DAG is calculated using the method mentioned in Section III. Then the tasks are sorted based on their top distance values, and inserted to the task list. The task with the smallest top distance is placed at the head of the list. For the example DAG in Fig. 1 , the task list is: task 0, task 2, task 1, task 3, task 4, task 6, task 5, and task 7.
Task list generated by top distance ensures that the precedent condition of tasks is met. For any task i with predecessor task j and successor task k, it is obviously that the top distance of task i is greater than that of task j, and smaller than task k's top distance. In the generated task list, task j is scheduled prior to the task i, and only when task i finishes, task k is scheduled.
Best Fit scheduling:
Tasks read from the list are then allocated to processor using the BF rule. For a to-bescheduled task i, the BF algorithm calculates the Estimated Finish Time (EFT) of each PE, which is the time that each PE finishes current task running on it, and EFT is regarded as the earliest starting time that each PE can provide for the target task. BF chooses the PE with the best (earliest) EFT, and schedules task i to that PE.
Although BF scheduling is focusing solely on the makespan, its output schedule provides a good start point for the further optimization. The pseudocode of the list scheduling is shown in Fig.4 .
B. List Simulated Annealing Scheduling
In this subsection, a list simulated annealing algorithm is proposed to further optimize the schedule generated by list best fit algorithm presented before. Like the algorithm in previous subsection, the execution order of tasks is determined by the task list which guarantees the precedent condition of tasks, and SA is used to find optimal task-to-processor allocation solution.
Solution representation:
In the SA, schedule solution is represented by the symbol S. For a schedule problem with n_tsk tasks scheduling to n_pe PEs, a symbol is expressed as:
where an element i s k = , {1, ... , _ } k n p e ∈ denotes that the i-th task is scheduled to the k-th PE.
Evaluation of schedule solution:
In our proposal, three metrics are monitored to evaluate the performance of a symbol (schedule), and they are: Makespan (M), Load Balance (B) [14] and Average Link Load (L) [15] .
The Makespan, also called the schedule length, is the time span for the NoC to finish all the tasks in a DAG.
The Load Balance metric is defined as follow: 
The Load Balance measures the inverse coefficient of variant of the total workload on each processor. The larger B value suggests better balanced schedule.
The last metric, Average Link Load, monitors the traffic load on each link, and is defined to be the average value of traffic loads on all links. A better schedule is supposed to minimize the L metric.
( )
The evaluation result of a symbol i is given in (4), which is based on the global weighted sum method in multi-objective optimization [16] . All three metric are normalized to a reference symbol which in our proposal is the schedule generated by list scheduling. The final evaluation result is the weight sum of the improvement of three metrics. When evaluation is needed, the SA symbol along with the task list is send to the evaluation process for the assessment of the metrics, and then the process returns the evaluation score.
Cooling down process: SA starts with a high temperature level, then gradually cooling down to a lower temperature level. At each level, SA repeats certain number of neighbor searches, in which a new symbol is generated by randomly change the value of a random element from the original symbol.
Better moves of the neighbor searching are always accepted, and the original symbol is updated. A worse move is also accepted under a certain probability to prevent algorithm from local optima. The probability of accepting a worse move decreases as the temperature decreases.
The temperature cooling function is given in (5) 
The worse move acceptance function is defined as (6), where function random() returns a random number of interval (0,1), and 0 E is the performance score of the reference symbol 0 S . 
A. Simulation Setup
To evaluate the performance of our proposal, we implement the list SA (LSA) scheduling algorithm in C++, and simulate the produced schedules under a SystemC based cycle-accurate NoC simulator which is modified from the work in [18] . The list BF (LBF) as well as the list random mapping (LRM) scheduling is also implemented and simulated as reference.
The NoC is a 4×4 2D mesh structure with XY routing algorithm. Wormhole switching is adopted, and RoundRobin arbitration is enforced to solve contentions.
For the LSA, the algorithm terminates after 50 temperature level, and the cooling factor q=0. 
B. Task Generation
In our simulation, both random DAG and the realworld application DAG are used for evaluation the performance of our schedule algorithm.
The 20 random DAGs are generated using TGFF 3.1 [19] . The detail information of each DAG is shown in Table I . The task number (n_tsk) varies from 52 to 101 which covers various of DAG sizes. Tasks' computation time is randomly generated from 40~160 cycles. CCR is set to 3, 2, 1.5 and 1.2 to simulate light, medium, heavy and extreme heavy communication load. The series_w and series_l are the parameters required by the new algorithm of TGFF 3.1. The series_w and series_l parameter are given as a tuple (average, multiplier), and set the width/length of series chains.
Two real world applications, solving Laplace Equation (LE) using Gauss-Seidel algorithm in [20] and Molecular Dynamics Code (MDC) in [21] are also simulated in our experiment.
C. Simulation Results
The simulation results of List Simulated Annealing (LSA), List Best Fit (LBF) and List Random Mapping (LRM) scheduling algorithm under our NoC simulator is illustrated in Fig. 6 . Fig. 6 (a) shows the results of makespan. The first thing to notice is that the makespan does not regularly expand as the number of tasks grows. This is because the makespan is more sensitive to the length of the critical path, which is the longest path exists in the DAG, than the number of tasks. For example, tg1 has 63 tasks, and tg17 has 101 tasks. However the makespan of tg1 is 52% Task Graph  tg1  tg2  tg3  tg4  tg5  tg6  tg7  tg8  tg9  tg10   tsk_num   63  58  59  52  75  71  67  71  80 larger than the makespan of tg17, for the critical path of tg1 (1557 cycles) is larger than that of tg17 (1226 cycles). From Fig. 6 (a) we observe that the makespan of LSA is always smaller than the makespan of LBF and LRM. The average makespan of LSA is 91% of LBF and 74.6% of LRM. We also observe that in some situations, the makespan results of LSA is very close to that of LBF, that is because the schedule produced by LBF is already near-optima. Fig. 6 (b) shows the load balance results, and all the results are normalized to the LSA equivalent. Notice that in some situations, the load balance of LSA is identical to that of LBF. The explanation to this phenomenon is that two or more processors swap all their tasks during the cooling down process of LSA. Besides these situations, LSA still outperforms LBF by 33.1% and LRM by 79.4%.
The average link load results are illustrated in Fig. 6  (c) . Obviously, LSA remarkably reduces the average link load on NoC. The average link load of LSA is 57.9% and 56.3% smaller than that of LBF and LRM.
Moreover, we measure the end-to-end routing delay of each packet, and the overall average end-to-end delay of three schedule algorithm is shown in Fig. 7 . Although it is not a optimization goal in our proposal, as presented in the figure, the average end-to-end delay of LBF and LRM are of the same level, and the routing delay of LSA is 18% shorter than that of LBF and LRM. This is because both makespan and average link load optimization favors the schedule with shorter routing delay. A list simulated annealing scheduling algorithm is proposed in this paper for the DAG tasks scheduling on the NoC. The proposal combines list schedule and simulated annealing to optimize makespan, load balance and average link load. Through series of simulations, the performance of our proposal is validated.
