The continuous improvements in several application domains such as telecommunications, aerospace or multimedia lead to additional design constraints on power budget and architecture scalability. The heterogeneous CPU/FPGA (Central Processing Unit/Field-Programmable Gate Array) architecture is one of the most promising solutions in this context leading to high performance reconfigurable computing. In such systems, multi-core processors (CPU) provide high computation rates while the reconfigurable logic (FPGA) offers high performance and adaptability to the application real-time constraints. However, there is a lack of CAD (Computer Aided Design) tools able to deal with the development of applications on such heterogeneous systems. This research investigates the problem of the optimisation of static and run-time task mapping on a real-time computing system CPU/FPGA used to implement intimately coupled hardware and software models.
INTRODUCTION
A promising approach to cover the needs of future systems with real time constraints are heterogeneous architectures with a combination of general CPUs (Central Processing Unit) and reconfigurable fabrics provided by standard FPGAs circuit (Field-Programmable Gate Array). In such systems, the multicore CPUs provide raw computation rates and ease of programming while the reconfigurable logic offers massively parallel computing power and adaptability on the circuit level. Due to the high parallelism rate of the application, FPGA technology could offer better performances comparing to the CPUs (up to ten times Asano et al. (2009) at lower frequencies). FPGA is a chip containing a large number of logic blocks, these blocks are connected together by a configurable routing matrix, which allows the reconfiguration of the component functionality as desired.
In order to harvest the maximum benefits of the performance of each part of the system (CPU and FPGA), we must offer efficient methods and tools that help software designers to map efficiently the application tasks on the available resources while considering real-time constraints. This challenge involves the static and the runtime task mapping. As highlighted in Jong- Kook et al. (2007) , an important research challenge is how to assign tasks to the available resources in order to maximize some performance criterion of the heterogeneous architecture.
In this research, we explore the problem of task mapping in terms of minimizing the makespan 1 . We assume that the tasks execution is non-preemptive, and we take into account the communications cost between each pair of tasks linked by precedence. We formulate an exact method and a heuristic approach in order to deal, respectively, with static initial mapping and runtime mapping in case of the arrival of new tasks. The effort to develop an exact method was motivated by its importance in the static initial phase, especially in view of the absence of contributions in the literature concerning exact approaches. The heuristic that we introduced is a Modified version of the well known HEFT (Heterogeneous Earliest-Finish Time) Heuristic. Indeed, experiments in many prior contributions in the literature have shown that the HEFT Heuristic is one of the most effective heuristics to deal with heterogeneous architectures. This paper is structured as follows. Section 2 reviews some representative related contributions. Sections 3 describes the problem statement and the context of this study. The exact method and the MHEFT Heuristic are presented in detail in Sections 4 and 5; respectively.
RELATED WORKS
Several works in the field of task mapping in computing systems have been proposed, such as Koulamas and Kyparisis (2007) , Chen and Lin (1998) and ulrich Heiss (1992) , which target identical parallel machines. Recently, research efforts are increasingly oriented towards solving the mapping problem onto heterogeneous computing systems. The mapping problem consists of two significant parts; the matching that involves assigning tasks to the available resources, and the scheduling that considers the execution sequence of tasks. In the literature, the majority of the existing works refer to the term scheduling to mean mapping. For example, Ibarra and Kim (1977) presents several heuristic algorithms for scheduling independent tasks on non-identical processors. Already in 1988, Casavant and Kuhl (1988) presented a taxonomy of scheduling in generalpurpose distributed computing systems. At the highest level, they distinguish between local scheduling, which we called scheduling, and global scheduling, which we called matching as shown in Figure 1 . At the lower level they differentiate between two categories of mapping: static and dynamic; see Figure 1 . A number of papers propose different approaches to address the problem of static mapping, see Shoukat et al. (2002) and Braun et al. (2001) , in which a comparison of eleven static heuristics is presented for mapping a class of independent tasks onto heterogeneous distributed computing systems. Additionally several papers have investigated the dynamic mapping problem; see Jong-Kook et al. (2007) , Fazlali et al. (2010) and Lin et al. (1997) . The different approaches explored include the greedy heuristics in Luo et al. (2007) , in which the authors evaluate and compare 20 greedy heuristics for mapping a class of independent tasks onto heterogeneous computing systems. In Alaoui et al. (1999) , the authors offer a genetic algorithm for the mapping problem; finally a hybrid heuristic is presented in Chen and Lin (1998) . To summarize, most of the approaches in the literature can be classified into two categories mainly, heuristic-based and guided-random-search based algorithms.
The first category, namely, heuristic-based algorithms, may be classified into three groups: The second category, guided random search-based algorithms were explored in several papers. The genetic algorithms are the most used techniques as in Hou et al. (1994) and Lee et al. (1997) , indeed they offer a good minimization of the schedule length. However their performance remains unsatisfactory with regard to their run time requirements.
The main observation we made while reviewing the literature is the lack of approaches that yield the exact solution for mapping on heterogeneous architectures. Our work differs from those cited above by focusing especially on mapping tasks onto the heterogeneous CPU/FPGA systems in static and runtime phases, for a nonpreemptive execution of the tasks. In this work, we exploit the heterogeneous CPU/FPGA architecture by using several options for assigning tasks.
PROBLEM STATEMENT
We consider the challenge of mapping a simulation project containing several tasks onto a dynamically reconfigurable CPU/FPGA architecture at the initial phase. We denote by C the set of cores and by F the set of FPGAs as illustrated in Figure 2 (left). We denote γ =| C | and φ =| F | where | x | indicates the cardinality of x. In this work C contains four cores [C1,C2,C3,C4] and F contains only one element [F1], hence γ = 4 and φ = 1.
As described in the example Figure 2 (right), we mapped the project onto an architecture with 4-cores CPU and one FPGA. In this example, tasks of project P1 are sequentially executed respecting the task graph precedence as shown in Figure 2 (right).
A project consists of several tasks with data dependencies among them. Each edge between two tasks i and k represents a precedence constraint, while the edge 'weight' represents the average communication cost between the two tasks. A task without any parent is called an "initial task" and is denoted "ent" (task 1 in the example of Figure 2 (right)), and a task without any child is called a "terminal task" and is denoted "exit" (task 12 in the example of Figure 2 (right)). A task can be executed on a given processor only when all data from its parents become available to that processor.
The following assumptions are made:
• The tasks are numbered topologically, hence all the predecessors of any task have an inferior index to that task.
• Task execution of a given application is non-preemptive.
• A task could have three types of implementation:
· Type 1: only a software version on a CPU. · Type 2: only a hardware version on FPGA. · Type 3: both hardware and software versions.
In this section, we first present an exact method to get an optimal mapping of a given project, then we present a heuristic approach in order to improve the speed of performance. For both approaches, we have the following notation in common:
Symbol Definition n number of tasks. m number of processors. C the set of cores. F the set of FPGAs. γ =| C | the cardinality of C. φ =| F | the cardinality of F. t i j the execution time of task i in processor j. c ik the average communication cost between task i and task k.
• n and m are, respectively, the number of tasks and the number of processors.
• t i j : the execution time of task i in processor j. Note that j is the index of both the CPU Cores and the FPGA. • Finally, the preemption is not allowed:
If two tasks, say i and k are assigned to the same processor then they must be separated by the processing time of the earliest one, whether the two tasks are related by precedence or not (In case of precedence the condition is implicitly taken care of in the second constraint). Since we do not know beforehand which task will precede which other task on any processor, let a ik be a binary variable as presented in Figure 4 . The no-preemption condition is then provided by the following two inequalities, in which W1 and W2 are two very large numbers. ∀i, k ∈ {1 · · · n}, ∀ j ∈ {1 · · · m} and i < k
Ti Tk
Tk Ti
Fig. 4. Definition of the a ik variables
Note that usually the non preemption cosntraint is formulated as follows:
With the formulation given in equation 7, we diminished the number of variables and constraints.
Objective function
We wish to obtain the fastest execution of each project, which amounts to minimize the makespan C max . In our context minimizing the C max is eqquivalent to minimize the finish time of the "terminal task". The objective is realized by adopting the following objective function:
Note that, as explained before equation (2), the expression between brackets in equation (9) is equivalent to the finish time of the terminal task.
THE MHEFT HEURISTIC
The proposed algorithm is inspired strongly by the HEFT algorithm (Heterogeneous Earliest Finish Time algorithm). As described in Topcuoglu and Wu (2002) , the HEFT is an application scheduling algorithm for a bounded number of heterogeneous processors, which has two main phases:
• Compute the priorities of all tasks;
• Select the tasks according to their priorities, then schedule each task on its "best" processor that minimizes the task finish time.
The HEFT algorithm minimizes the makespan, which is our objective in this study.
In this section, we are given the following information: 
Where avail[j] is the earliest time at which processor j is ready for task execution. For example if Tk is the last assigned task on processor j, then avail[j] is the time that processor j completed the execution of the task k and it is ready to execute another task. AFT (k) is the actual finish time of the task k. We deduce that:
The schedule length of one project of tasks is the AFT of the terminal task of this project. The schedule length which is also called makespan is defined then as follows:
Given a project with n tasks and m processors, the Average Computation Cost (ACC) of a task (i) is computed by dividing the sum of computation costs of the task on each processor by the number of available processors.
In order to establish the priority of tasks allocation we introduce a new parameter called the upward rank "rank u " , which is defined recursively as follows:
The rank is computed recursively by traversing the task upward. For the terminal task "exit", the upward rank value is equal to
rank u (i) is the length of the critical path from the task Ti to the terminal task, including the computation cost of task Ti.
The two steps of the Modified HEFT algorithm are then, as follows:
• Compute the priorities of all tasks: The priority is established using the upward rank value rank u . The list of priorities is generated by sorting the tasks by decreasing order of rank u .
• Best processor selection phase:
The HEFT algorithm has an insertion-based policy which considers the possible insertion of a task in an earliest idle time slot between already-scheduled tasks on a processor. The length of an idle time-slot, i.e, the difference between execution start time and finish time of two tasks that were consecutively scheduled on the same processor, should be at least capable of computation cost of the task to be scheduled. Scheduling on this idle time slot should preserve precedence constraints.
Algorithm 1 Algorithm MHEFT
Set the computation costs of tasks and communication costs between tasks.
Compute rank u for all tasks by traversing graph upward, starting from the terminal task. Sort the tasks in a scheduling list by nonincreasing order of values. while there are unscheduled tasks in the list do Select the first task Ti, from the list for scheduling. if Ti ∈ Type 1 then for each core j in the core-set (j ∈ C) do Compute EFT (i, j) value using insertion-based scheduling policy. Assign task Ti to the core j that minimizes EFT of task i. end for else if Ti ∈ Type 2 then for each FPGA j in the FPGA-set (j ∈ F) do Compute EFT (i, j) value using insertion-based scheduling policy. Assign task Ti to the FPGA j that minimizes EFT of task Ti. end for else if Ti ∈ Type 3 then for each processor j in the processor-set (j ∈ C U F) do Compute EFT (i, j) value using insertion-based scheduling policy. Assign task Ti to the processor j that minimizes EFT of task Ti. end for end if end while For recall definition of Type 1, 2 and 3, see problem statement in Section 3.
THE MHEFT HEURISTIC VS THE EXACT METHOD
In this section we present a comparison between the MHEFT Heuristic and the Exact method. In order to compare the performance of each method we generate several graphs containing from 2 to 40 tasks. The comparison metrics are the makespan, the relative gap of the makespan and the mapping execution time. The relative gap of the makespan is calculated by the following ratio:
While analysing the makespan obtained by both MHEFT Heuristic and Exact method, we observe that the MHEFT Heuristic gives a good performance, indeed, the makespan is close or equal to the makespan obtained with the Exact method. However, there are some problematic cases where the relative gap is greater than 10%, a preliminary analysis shows that it is due to several parameters such as the ratio = number of precedence links / number of tasks. We want to improve in future works the MHEFT Heuristic in order to stabilize the relative gap at a value less than or equal to 10%. For the mapping execution time we obtained a result less than 0.5 ms for all the instances with the MHEFT Heuristic. Table 1 gives the execution times obtained with the exact method. As expected the MHEFT Heuristic is much faster than the exact method. In this paper we studied the problem of mapping a project of several tasks into the heterogeneous architecture CPU/FPGA. This problem is proven to be NP complete, see FernandezBaca (1989) and Garey and Johnson (1990) . In spite of this complexity, a mathematical model solved by an exact method has been proposed and evaluated. In practice, the experience shows that the use of an exact method remains very important in the context of real time constraints. Indeed, if we assume a simulation test which is executed on several hours, even if the execution of the exact method may take several minutes, this, is negligible compared to an overload which may cause the loss of the entire test simulation. A good initial mapping will permit to avoid such overload risks.
In this study, average communications costs were considered between each pair of tasks linked by precedence, and the execution, into processors was considered as non preemptive. We first presented an exact method which we will use for the static initial mapping, furthermore, it serves as a reference, for evaluating the optimality of heuristics approaches. The second approach proposed in this work is a modified version of the heuristic "HEFT". The changes concern the limitation of options choice. Since in this part, some tasks are exclusively executable on some kind of processors. The MHEFT gives a good optimality and will serve to assign new arrivals tasks at runtime phase. Also the MHEFT results represent an upper bound for the exact method, indeed, this information may permit to speed up the execution time of the exact method.
In order to enrich this work, in future research, we will compare the performance of our approaches with the achievement of several experiments. Also we look to consider the exact communication costs instead of the average in order to improve the optimality. Another possible extension of the settings studied in this paper is to consider that the execution time of tasks may increase during the execution of a project of tasks so as to be more realistic. In order to speed up the execution time of the exact method, additional weights for the selection of the processors can be an efficient solution to explore, indeed, the mathematical model presents an important number of symmetries, since we consider several identical cores. We plan to extend our algorithms to deal with the multi-objective "makespan minimization" and "load balancing". Finally, we intend to investigate the dynamic mapping in order to deal with re-allocation of tasks so as to anticipate the onset of overloads.
