Abstract-Energy management for multimode software defined radio systems remains a daunting challenge. This brief develops a high level framework that generates a multiprocessor systems on chip architecture from a library of heterogeneous processing resources that can be reconfigured to support various modes of operation. The framework proposes joint task and core mapping with system level floorplanning. With the objective of minimizing energy, we develop an analytical probabilistic model that considers static, dynamic, configuration, and communication energy components for multiple applications characterized by probabilities of execution. Finally, a fast energy aware joint task and core mapping heuristic is proposed and performance is demonstrated on realistic benchmarks. Index Terms-Core mapping, multiprocessor systems on chip (MPSoC), synthesis, task mapping.
I. INTRODUCTION
The rich space of applications/configurations required of modern wireless systems creates a wide range of scenarios that mandates an energy efficient reconfigurable platform. Recently, multiprocessor systems on chip (MPSoC) architectures have evolved rapidly in the race to flexible high-performance embedded computing. This is particularly true for multimode communication systems that require high flexibility and low power simultaneously. Toward that end, numerous techniques have been developed to optimize energy consumption. A key realization that energy aware techniques utilize is that efficiency is highly related to the nature of the application. Usually, the set of target applications can be characterized stochastically where each type of application is represented by a certain execution probability. These probabilities can be obtained through statistical information collected from each user regularly. The consideration of these execution probabilities affects system performance and energy efficiency. For example, a mode with high execution probability should be mapped to lower power processors satisfying the execution requirements to obtain an energy efficient system. Energy optimization techniques presented in prior work consider only processing energy, or communication energy or both. However, the consideration of reconfiguration energy was neglected. It is important to consider the reconfiguration energy required to switch the processing units (PUs) between different tasks, which would be the expected mode of operation for a multireconfigurable platform that supports different radio access technologies. The reconfiguration cost is highly dependent on the architecture and structure of the platform. For example, if a reconfigurable fabric is used then there is a reconfiguration energy cost associated with the change of configuration bits of configurable logic blocks (CLBs), and connections represented in the switching matrix [1] . In case of multiprocessorbased systems, there is an associated cost in switching context and reloading programs from internal or external memories and so on. Numerous algorithms have been proposed for task mapping and scheduling [2] - [5] , as well as processor mapping to networks on-chip (NoC) tiles in energy efficient multiprocessor-based systems [6] - [8] . Custom NoC synthesis is presented in [9] to minimize power consumption while satisfying performance constraints. In [10] and [11] , a comprehensive survey of task mapping and application mapping onto NoCs is presented.
The work in [12] and [13] considered joint task and core mapping for irregular and custom NoCs assuming a given NoC architecture. Demirbas et al. [14] considered NoC topology generation with task and core mapping to minimize the application latency. In this brief, we consider joint task and core mapping with system level floorplanning to minimize the energy consumption. As an extension to our previous work in [15] , we develop a framework that generates a reconfigurable MPSoC architecture from a library of heterogeneous PUs based on a probabilistic model that considers static, dynamic, reconfiguration, and communication energy components for multiple applications characterized by certain execution probabilities.
II. SYSTEM MODEL
The system is assumed to run a set of probabilistic applications, on a heterogeneous platform that comprises different types of PUs connected in a 2-D NoC-based MPSoC. The utilized model is similar to that originally proposed in [4] . In addition to leakage and dynamic processing energy, our proposed model incorporates both communication and reconfiguration energy. Simulation results confirm that exploring the expanded design space leads to design points that are more efficient than the reference baseline technique. To maintain consistency, we adopt the same parameter notations as [4] . In general, the model assumes a set of scenarios S, and each scenario S m ∈ S has an execution probability χ m , and includes a set of tasks represented by a directed task graph. Each scenario S m ∈ S is executed for an average time τ m . Although the information about scenario execution times may not be present, it can be collected statistically from different users. The set of tasks among all scenarios is defined by T , and the set T m is the set of tasks executed in scenario S m .
The system comprises a set of PU types P, where the j th PU type, P j , can have multiple instances such thatp j,k is the kth instance of P j . The maximum possible number of instances F j of the j th PU type is determined at design time based on the task requirements. The static power of P j is σ j . There is a reconfiguration cost associated with switching from one scenario S a to S m . This cost represents the reconfiguration cost of the PUs to run the tasks in S m , and not originally in S a . The dynamic and reconfiguration powers of the ith task, t i , on any instance of P j are δ i, j and η i, j , while the corresponding execution and reconfiguration times are τ i, j and τ r i, j . Each task t i has an execution probability of ψ i , average number of executions of κ av i , and reconfiguration probability of r i . The value of r i depends on the transition probabilities among different execution scenarios. The feasibility indicator f i, j is used to identify if it is 1063-8210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. feasible to map a task t i on P j , and utilization of running t i on P j is defined by u i, j . The communication cost between t i mapped top j,k and t l mapped top q,v depends on the communication volume, and the distance between the cores. Since the cores are potentially heterogeneous, it is necessary to consider custom NoC architectures. At early design stage, the communication energy can be abstracted by point-to-point physical links [9] . Therefore, the communication energy between two cores can be represented by the Manhattan distance. For a specific PU instancep j,k with a width w j,k , and height h j,k , the core location is defined by the x-y coordinates such that X 
III. PROBLEM FORMULATION
This brief considers the problem of building an NoC-based MPSoC architecture from a library of different PU types, followed by joint task mapping to different PU instances, and core mapping with custom NoC floorplanning at design time. The main goal is to minimize the energy consumption including static, dynamic, communication, and reconfiguration energy while maintaining performance constraints. A task t i is related to the PUp j,k through task utilization u i, j . Since one PU instance can run more than one task simultaneously, the overall utilization should be satisfied.
The variable M i, j,k defines the mapping between t i andp j,k such that M i, j,k = 1 if t i is mapped onp j,k , and 0 otherwise. In addition, the variable Z m, j,k defines the mapping between scenarios and PU instances such that Z m, j,k = 1 when any t i ∈ T m is mapped tô The average energy consumption among all scenarios is derived as
is the communication volume between t i and t l in S m , and ω i,l,m is a weighting factor representing the latency constraint in hops [9] between t i and t l in S m . The target is to find the optimal task mapping M i, j,k as well as the optimal core mapping in terms of the x-y coordinates (X min j,k , Y min j,k ) to minimize the average energy consumption.
The mixed integer linear programming (MILP) formulation of the optimization problem is depicted in (2) . The constraint in (2b) guarantees that each task is mapped to one and only one feasible PU instance. The constraint in (2c) ensures that the utilization condition is satisfied for each instance in each scenario. The constraint (2d) guarantees that there is no overlapping between two different PU instances. In addition, the four linearization inequalities in (2e) are s.t.
According to [4] , the task mapping problem that considers only static and dynamic power for one scenario and one PU type is reduced to the bin packing problem, which is known to be NP-hard. As a consequence, the problem of joint task and core mapping for multiple scenarios and PU types considering static, dynamic, configuration, and communication energy components is NP-hard in a strong sense.
IV. PROPOSED DESIGN TIME HEURISTIC ALGORITHM
We propose an iterative heuristic solution to minimize the overall energy consumption. The proposed solution has initial and iterative Algorithm 2 Core (Re)Mapping Algorithm mapping algorithms. Each one consists of task mapping, followed by core (re)mapping, where it performs the system level floorplanning to the utilized heterogeneous PU instances.
A. Initial Static Energy Aware Task Mapping
The initial algorithm, shown in algorithm 1, performs core selection and preliminary task mapping based on a cost function that considers static, dynamic, and reconfiguration energy costs. There is no corresponding available communication cost before initial mapping. The set of tasks with the minimum cost function on P j is included in T j and all the tasks in T j are arranged based on utilization in descending order. This heuristic starts mapping tasks to the PUs with lower static power first. This increases the chance to pack more tasks into the instance with lower static power consumption, and contributes accordingly to the total energy saving. As shown in step 9, the PU types are sorted in ascending order according to the static power consumption. Subsequently, the new energy cost is computed for each t i ∈ T j on each PU instancep j ,k . The new energy cost on each instance consists of static, dynamic, and reconfiguration energy in steps 15-17. In step 20, the task t i is mapped topǰ ,ǩ with the minimum total estimated cost. After mapping t i topǰ ,ǩ , the total utilization of pǰ ,ǩ for each scenario S m that host t i , denoted by U m,ǰ ,ǩ , is calculated in step 21, and the mapping variable is updated in step 22. Finally, an initial core mapping takes place by running Algorithm 2 as depicted by step 25.
The initial solution determines the number of utilized PU instances for each PU type, and the initial core mapping for each utilized PU instance. Accordingly, a preliminary architecture is generated at design time. This architecture is used in an iterative solution seeking further energy reduction.
B. Iterative Remapping Solution
After the initial task and core mapping, an iterative solution is used to remap tasks to reduce the total energy consumption. It is highly similar to the initial mapping, however, it considers the communication energy based on recent core mapping. At each iteration, this solution seeks a PU instance for each task t i that maximizes the energy saving. The utilized PU instances are considered and sorted in ascending order based on the maximum utilization among the different scenarios. Afterward, a new energy cost is calculated for each task t i on eachp j ,k based on static, dynamic, reconfiguration, and communication cost, while satisfying the utilization constraint. The resulting average energy consumption is reduced with the increase in the number of iterations and performance approaches the optimal solution.
C. Core (Re)Mapping Solution
After each mapping iteration, the core mapping heuristic in Algorithm 2 is used to locate each utilized PU instance in a specific NoC position such that the communication energy is minimized. This core mapping is based on the most recent task mapping. The set of utilized instances that host tasks are defined by P o . The set of mapped and unmapped PU instances is defined by P p and P u , respectively. The core mapping is determined based on the average number of transactions between the different instances. The algorithm tends to map the instances with a larger number of transactions as close as possible. This heuristic creates system level floorplanning to build a custom NoC architecture that is well suited for the target heterogeneous platform. The heuristic considers the X-Y plane as a feasible area represented by R, as in step 3. As presented in step 5, the core mapping heuristic chooses the PU instance with the highest average number of two-way transactions with other instances as the first one to be placed. It is indicated asp j m ,k m , and it is mapped with the left lower side (X min
, defined by r in step 7, is occupied and removed from the set of feasible positions R as in step 8. Then, other instances are mapped one by one in descending order according to the average cost with the currently mapped instances. The position of the selected instance is determined such that it minimizes the total communication cost with the mapped instances as indicated in step 11. The algorithm repeats this process until all instances are mapped. Whenever a new instance is mapped, it is inserted in the set of mapped instances P p and removed from the set of unmapped instances P u as presented by step 13. This can be considered as a 2-D bin packing problem. However, for the scope of this brief, we impose no restriction on the area or aspect ratio of the chip.
The algorithm complexity is based on the number of tasks (N t ), number of PU types (N p ), number of resulting PU instances (N i ), and number of scenarios (N s ). The complexity of the initial mapping O (N p N t N i N s ) , while the iterative mapping is O(N 2 i N t N s ).
V. PERFORMANCE RESULTS
This section demonstrates the performance of the proposed heuristic for a set of realistic benchmarks that are widely used in current smart phones; namely WCDMA, denoted by A 1 , LTE, denoted by A 2 , JPEG encoder, denoted by A 3 and MP3 decoding, denoted by A 4 . These applications are profiled and divided into individual tasks as presented by the task graphs in Fig. 1 . The figure also illustrates the communication volumes.
A total of 11 execution scenarios are assumed, where each one comprises of single or multiple applications with equal latency requirements and different throughputs. The different scenarios with their corresponding execution probabilities are shown in Table I . In addition, we assume a library of six different PU types, where each PU type can have a multiple instances. The chosen PUs are: 1) OpenRISC core; 2) ARM-Cortex-A9; 3) Ultra-SPARC-T1; 4) a large reconfigurable FPGA fabric based on the Xilinx-VirtexII Pro CLBs with a total of 5824 CLBs; 5) a small reconfigurable fabric with 728 CLBs of the same type; and 6) a turbo decoder coprocessor. Table II presents the specifications of the used PU types.
For the set of given tasks, the task execution times, utilization, and different costs associated with PU types are obtained from [1] and [16] - [22] . The utilization of the different tasks on OpenRISC, TABLE I  DESCRIPTION OF REALISTIC BENCHMARKS   TABLE II  DESCRIPTION OF THE USED PU TYPES ARM-Cortex-A9, and Ultra-SPARC-T1 is calculated as the ratio between task execution time and task deadline based on the application throughput. The utilization of FPGA is calculated as a ratio between the task requirements and the available resources in terms of number of CLBs.
The proposed heuristic is applied to the set of the aforementioned scenarios on the hypothetical platform, and performance is evaluated. Performance of the proposed heuristic is compared with the CPLEX solver solution, the SA-based heuristic [23] , and a baseline heuristic that considers the static and dynamic powers only [4] . Table III presents the energy consumption and the algorithm execution time of the different algorithms. As shown from the table, the proposed heuristic achieves 23.33% energy saving with respect to the heuristic that considers only static and dynamic power. It achieves 65.11% (51.43%) energy saving with respect to the SA approach running at 100 (1000) iterations. Table III also highlights the heuristic performance with respect to the CPLEX solution upon setting the CPLEX time limit to different values. The proposed heuristic outperforms the CPLEX optimizer when running CPLEX for 1 and 5 min, respectively. The CPLEX outperforms the heuristic performance after running for 1 h, the results show that the overhead of the heuristic performance is 7.1% of the CPLEX solution. The CPLEX optimizer was not able to handle the same problem with a larger number of PU types due to the extensive memory requirements. This confirms that using the proposed heuristic is necessary, especially with higher dimension problem sizes where using the CPLEX optimizer becomes infeasible. The compiled architecture consists of seven PU instances divided as follows: 1) two OpenRISC cores; 2) one ARM-Cortex-A9 core; 3) one coarse reconfigurable fabric; 4) two fine reconfigurable fabrics; and 5) one turbo decoder coprocessor core. The utilized PU instances are mapped to build the NoC-based MPSoC. Fig. 2 shows the generated architecture with the corresponding task mapping and floorplanning.
VI. CONCLUSION
This brief proposes a framework that generates an energy aware MPSoC platform with corresponding task and core mappings. This brief presents an MILP model for energy aware joint task and core mapping with system level floorplanning. The developed framework performs core selection, then it proposes an energy aware joint task and core mapping heuristic with system level floorplanning for custom NoC generation. The proposed framework can handle large problem size where it becomes infeasible to use optimal solvers. The framework has been applied to a realistic test case and comparative performance results have been presented.
