An efficient technique for mapping application tasks to heterogeneous processing elements (PEs) on a Network-onChip (NoC) platform, operating at multiple voltage levels, is presented in this paper. The goal of the mapping is to minimize energy consumption subject to the performance constraints. Such a mapping involves solving several subproblems. Most of the research effort in this area often address these subproblems in a sequential fashion or a subset of them. We take a unified approach to the problem without compromising the solution time and provide techniques for optimal and heuristic solutions. We prove that the voltage assignment component of the problem itself is NP-hard and is inapproximable within any constant factor. Our optimal solution utilizes a Mixed Integer Linear Program (MILP) formulation of the problem. The heuristic utilizes MILP relaxation and randomized rounding. Experimental results based on E3S benchmark applications and a few real applications show that our heuristic produces near-optimal solution in a fraction of time needed to find the optimal.
Introduction
In recent years System-on-Chip (SoC) design has become extremely challenging due to the increasing complexities in processor and semiconductor technologies. Multicore SoC based embedded systems may contain either all homogeneous generic processing cores, or a varying number of heterogeneous PEs. These heterogeneous PEs may represent programmable general purpose cores, task specific co-processors or hardware accelerators, etc. Networkon-Chip (NoC) architectures provide an alternative to the bus-based communication mechanism that can meet the challenging requirements of performance, scalability and flexibility [1, 2] . As the number of PEs on a SoC and the data traffic between them continues to grow, energy minimization subject to performance constraint has become one of the most important objectives.
The problem of minimizing energy consumption during application execution while satisfying the performance constraints can be divided into four main subproblems: (i) mapping of the application tasks to the PEs, (ii) mapping of the PEs to the routers of the NoC architecture, (iii) assigning operating voltages to the PEs (in case they can operate at multiple voltages) and (iv) routing of data paths, i.e., traffic movement on the NoC architecture. As consideration of all four subproblems simultaneously increases the complexity of the problem, most of the research effort in this domain [4, 6, 8, 9] either solve problems (i), (ii), (iii) and (iv) in a sequential fashion, or solve only a subset of them.
To find an energy efficient application mapping, all four problems (i)-(iv) have to be solved. There are two options available -solve them sequentially or solve them in a unified way. The sequential approach has manifold disadvantages. Firstly, decision taken at an early phase may turn out to be expensive later. Secondly, because of some earlier decisions may lead to violation of constraints at some later phase, and thus resulting in re-execution of all the steps multiple times involving an enormous amount of computation.
We show with a motivating example here and later with extensive experimental results that the sequential approach may lead to sub-optimal solution. To the best of our knowledge, our proposed technique is the first that unifies all the four subproblems under a single problem formulation and develops optimal and heuristic solutions for the problem. Although scaling down voltage levels of PEs is favorable for reduction of energy consumption, excessive number of voltage islands may be detrimental from the perspective of physical design as it creates voltage island fragmentation of the chip and increases the complexity of the power delivery network. Therefore, the number of voltage islands on the chip should follow an upper bound. In literature [7, 8] , the constraint on the maximum number of voltage either has not been captured properly or involves a huge computation The motivation behind our unified approach comes from the following example. Fig. 1(a) shows an example application task graph consisting of four tasks T 0, T 1, T 2 and T 3. The edges represent the task dependencies and the labels on the edges represent the inter-task communication volume in number of bytes. Tables 1 and 2 show the execution time and power consumption of these four tasks on four available PEs P 0, P 1, P 2 and P 3, respectively.
Following a sequential approach, the resultant mapping is as shown in Fig. 1(b) , with computation energy consumption 3.8124µJ and communication energy consumption 0.952µJ. Thus, the overall energy consumption for the application is (3.8124 + 0.952) = 4.7644µJ. With such a mapping, the execution finish-time of the application is 86.104µs, well within the specified deadline of 122µs. For our proposed unified mapping, as shown in Fig. 1(c) , the total energy consumption is 3.8734µJ = 3.8575µJ + 0.0159µJ, leading to 18.7% save of energy as compared with the sequential approach. With this mapping, the execution finish-time of the application is 84.852µs, still within deadline of 122µs.
Energy consumption can be further reduced if we take advantage of operating the PEs of the NoC at multiple voltage levels. When the PEs are allowed to operate at different voltages, a certain amount of energy will be consumed by From the example above, it is clear that a unified approach to application mapping to NoC PEs, operating at multiple voltages leads to better energy utilization than a sequential approach. The energy consumption model considered in this paper has three components -(i) energy consumption due to computation, (ii) energy consumption due to communication and (iii) voltage transition energy between adjacent PEs operating at different voltages.
In this paper, we have assumed a regular mesh architecture as the communication infrastructure, where each router has 5 ports. One of the ports is used for connecting it to a PE and the other four are for connection to the neighboring routers. The algorithms proposed in this paper can be used with any other NoC topologies as well. Communication power consumption parameter values used for the evaluation of our techniques are taken from [9] .
Problem Formulation
In this section, we provide formal definition of the application mapping problem. The input instance to this problem is explained in Table 3 . The output of the problem is as shown in Table 4 . The objective is to minimize the overall energy consumption, such that: 1. All the application tasks finish their execution before deadline D. 
Computational Complexity
In this section we define the voltage assignment problem, subproblem (iii) of the application mapping problem and prove it to be NP-complete and inapproximable within any constant factor. If M 1 (t j ) = p i , then for each p i we can find out an allowable set of voltage levels L i = 
Set of tasks V T = {t1, t2, . . . , tm}; directed edge eij = (ti, tj ) ∈ E T representing the dependency of task tj on task ti wij For edge eij ∈ E T , the data communication volume (in Mbps) between tasks ti and tj D Application deadline, by which all the tasks need to be completed κ Maximum number of voltage islands created P A set of processing elements {p1, p2, . . . , p k } Pt ⊆ P For each task t ∈ V T , a subset of PEs, potential to execute the task t Vp
For each PE p ∈ P , an allowable set of voltage levels n v1, v2, . . . , vn p o in which this PE can operate, where np denotes number of distinct voltage levels it can operate τ (t, p, vp)
Execution time of task t ∈ V T on PE p ∈ Pt, when p is operating at voltage vp ∈ Vp (t, p, vp) Computation energy consumption of task t ∈ V T on PE p ∈ Pt, when p is operating at voltage vp ∈ Vp
An undirected n × n mesh architecture graph, where each node r ∈ V R denotes a router in the NoC architecture, whereas each edge eij = (ri, rj ) represents a router link in the NoC architecture between those two router nodes Task to PE Mapping function M1 : M1 (ti) = pj , which means that task ti has been mapped to PE pj ; ti ∈ V T , and pj ∈ Pt i ⊆ P PE to Router Mapping Function M2 : M2 (pi) = rj , which means that PE pi has been mapped to NoC router rj ; pi ∈ P and rj ∈ V R PE to Voltage Assignment Function M3 :
M3 (pi) = vj , which means that PE pi has been assigned to voltage vj ; pi ∈ P and vj ∈ Vp i
Task Edge to Path in Mesh Mapping Function M4 :
. . , v ini }, such that the task t j can meet its deadline d j (obtained using deadline D and communication latencies) whenever p i is operating at the voltage levels in L i . We consider two components for energy consumption:
(1) Energy consumption by PE p i when operating at voltage v ∈ L i . We denote this component by η v pi . (2) Energy consumption by the level shifter connecting two adjacent PEs p i and p j , where p i is operating at voltage x and p j is operating at voltage y, denoted by α xy . Voltage Assignment Problem: Given an n × n = N grid, where each node represents a router in the NoC architecture with a PE attached to it and a list L i of allowable operating voltages associated with each PE p i , the problem is to assign a voltage v j to PE p i , where v j ∈ L i for all the PEs, such that the following energy consumption expression is minimized:
The voltage assignment problem has one-to-one correspondence with the minimum weight grid coloring problem, defined as follows: Minimum Weight Grid Coloring Problem: Given a grid graph G = (V, E) of dimension a × b = M . Each node u ∈ V has an associated set of colors C u = {1, . . . , K u } each with a certain color-cost c (k) ∈ + , k ∈ C u . Let u, w ∈ V be neighbors in the grid-graph, i.e., {u, w} ∈ E. We are also given combination-costs c (k, l) ∈ + for each color combination k ∈ C u and l ∈ C w . The goal of the minimum weight grid coloring problem is to find a coloring, i.e., for each node u ∈ V a color k ∈ C u , which minimizes the following objective function:
We now prove the decision version of this problem, called MINCOL, to be NP-hard by reducing an instance of the known NP-hard problem GRAPH 3-COLORABILITY with no vertex degree exceeding 4 [5] -(page 85) to an instance of MINCOL. Theorem 1. Min weight grid coloring is NP-hard and even inapproximable within any factor.
Proof. Let G = (V , E ) denote the given planar graph. From this graph we construct an instance of the min weight grid coloring problem as follows.
We embed G in a grid graph
where n is the number of nodes in G , using a polynomial algorithm for computing an orthogonal representation. Let W ⊆ V denote the subset of nodes in G which corresponds to the embedded node set V of the given planar graph. For each edge e = {u , w } ∈ E we denote by p e = u 1 , e 1 , u 2 , . . . , e j−1 , u j the path corresponding to the embedding of e . P = e ∈E p e denotes the set of all such paths. Fig. 2 gives an example of how such an embedding could look like for a simple graph With each node u ∈ V of G we associate 3 colors C u = {1, 2, 3}. We set all color-costs c (k) = 0 for k ∈ C u and u ∈ V . Similarly, all combination-costs of edges {u, w} ∈ E contained in none of the paths, i.e., {u, w} / ∈ P , are set to zero as well: c (k, l) = 0 for k ∈ C u and l ∈ C w . For each path p ∈ P we set the combinationcosts as follows: i) For e 1 = {u 1 , u 2 } we set c (k, l) = 0 for k ∈ C u1 , l ∈ C u2 and k = l. For remaining combinations with k = l we set c (k, l) = 0. In other words, if u 1 and u 2 are assigned different colors, we have cost 0, otherwise 1. ii) For e i with i ∈ {2, . . . , j − 1} we set c (k, l) = 1 for k ∈ C ui , l ∈ C ui+1 and k = l. For the remaining combinations with k = l we set c (k, l) = 0. In other words, if u i and u i+1 are assigned the same color, we have cost 0, otherwise cost 1.
If we aim for a total cost of 0, the path p will propagate the color chosen for u j all the way to u 2 . For the cost to remain at zero, u 1 and u 2 (and therefore u j ) must be colored with different colors. Therefore, the given planar graph G is 3-colorable, if and only if there is a solution to the constructed min weight grid coloring problem of total cost zero. This proves the problem to be NP-hard. The inapproximability within any factor follows, since any approximation algorithm with multiplicative approximation ratio ρ and additive factor ρ could be used to decide whether G is 3-colorable as well: simply multiply all combination costs by ρ + 1. If G is 3-colorable, the optimal solution has cost 0 and therefore the approximation algorithm must find a solution with cost ≤ ρ · 0 + ρ = ρ . Otherwise, if G is not 3-colorable, any solution has cost ≥ ρ + 1. Hence, the approximation algorithm distinguish between two cases.
Optimal Solution for Application Mapping
In this section we use the mathematical programming techniques to solve the application mapping problem. We formulate the problem as a Mixed Integer Linear Program (MILP). In Table 5 we define a few of the variables used in the MILP formulation and also declare their types. The parameters used for calculation of energy consumption and execution time are as follows: τ tpv = execution time of task t ∈ V T on PE p ∈ P t at voltage level v ∈ V p (in sec) tpv = computation energy consumption for task t ∈ V T on Binary 1, if t ∈ V T , p ∈ Pt ⊆ P, v ∈ Vp, and M1 (t) = p and M3 (p) = v; 0, otherwise ζrv Binary 1, if r ∈ V R operating at voltage v, i.e., M2 (p) = r and M3 (p) = v for some p ∈ P ; 0, otherwise f ij xy Binary 1, if eij ∈ E T , exy ∈ E R , and exy ∈ M4 (eij ); 0, otherwise PE p ∈ P t at voltage level v ∈ V p (in Joules) α v1v2 = voltage transition energy consumption parameter for two adjacent nodes operating at voltage levels v 1 and v 2 , respectively (in Joules)
The objective of the application mapping problem is to minimize the energy consumption of the system subject to the application deadline constraint, mesh interconnection link bandwidth constraint and maximum allowed number of voltage islands constraint, i.e.,
where E c is the computation energy consumption, E r and E l are the communication energy consumed at the router ports and links, respectively and E vt is the voltage transition energy consumption, and calculated as:
In order to eliminate the quadratic term in the expression of E vt , we define the following decision variable: θ v1v2 xy = 1, if ζ xv1 = 1 and ζ yv2 = 1, otherwise 0
Hence,
Constraints: Due to the limitations in space, here we omit the mathematical details of the formulations of the constraints.
(1) Each task executes on exactly one PE, at exactly one voltage, connected to exactly one router node.
(2) Each router operates at exactly one voltage level and is attached with at most one processor. (3) Each task needs to finish within the specified application deadline, and need to maintain task dependencies. (4) Bandwidth constraint on each router link: the total flow passing through a link does not exceed its bandwidth. (5) The number of voltage islands is less than or equal to the specified maximum allowed value κ.
In this section we describe our proposed MILP relaxation and randomized rounding based heuristic. The algorithm takes as input the parameters specified in the Section 2 and the corresponding MILP formulation. Output of the algorithm consists of the mappings M 1 , M 2 , M 3 and M 4 , defined in Table 4 . The heuristic is explained below. 
9:
repeat 10:
Round γtr to 1 with probability as its value 11:
until (γtr = 1 for some r)
12:
Round ζrv variables following co-relation constraints
13:
Round σrp variables following co-relation constraints 14: 
Experimental Results
In this section we present the experimental results to evaluate our proposed approach. All our experiments can be classified into the following three categories considering the evaluation goal of the experiment: i) The optimal solution for variable voltage setup is compared with the optimal solution for fixed voltage setup.
ii) The optimal solution of our proposed unified approach is compared with that of the sequential approach.
iii) The quality of the heuristic solution is evaluated by comparing it with the optimal solution.
The experiments are performed using applications (autoindustry, consumer, networking and office-automation) from the E3S benchmark suite [3] and three real applica- 
567'89
(c) Networking On the other hand, for high voltage setups feasible solutions were found with higher energy consumption values. The heuristic is implemented in C++. The MILP for achieving the optimal solution and the corresponding relaxed LP as part of the heuristic were executed on the same machine using ILOG CPLEX 10.0 Concert technology. Experiments with E3S Benchmark Applications: Mesh dimension 4 × 4 was used for the auto-industry application and 3 × 3 for all the other three applications. The value for the maximum number of voltage islands κ was set to 4 and 3, respectively, for these two different sizes of mesh topologies. Fig. 3 compares the optimal solution while using the variable voltage (VV) levels with the optimal solution while setting a fixed voltage (FV) level, i.e., optimal at Case I with the optimals for all other cases. On average for these four applications, the variable voltage setup can save 18% energy consumption over the fixed voltage setup.We compare the solution quality of our proposed unified approach with the sequential approach in Fig. 4 . For these four applications, on average, we are able to save 16% energy consumption. Fig. 5 shows the quality of our proposed heuristic as compared with the optimal solution. i.e., the optimal and heuristic solution for Case I. In all the cases, the heuristic is able to find near-optimal solutions. Fig. 6 shows that the time taken to obtain the heuristic solution is negligible compared to that required to solve the MILP for the optimal solution. Experiments with Real Applications: Mesh dimension 3×3 for MPEG4 and MWD applications and 4×4 was used for the OPD application. The value of maximum number of voltage islands κ was set to 3 and 4, respectively, for these two different sizes of mesh. Fig. 7 compares the optimal solution while using the variable voltage (VV) levels with the optimal solution while setting a fixed voltage (FV) level, i.e., optimal at Case I with the optimals for all other cases. On average for these three applications, the variable voltage setup can save 11% energy consumption over the fixed voltage setup. Fig. 8 shows the quality of our proposed heuristic as compared with the optimal solution. i.e., the optimal and heuristic solution for Case I. In all the cases, the heuristic is able to find near-optimal solutions within seconds.
Thus, the experimental results support our claim of achieving near-optimal solutions from the heuristic. It also shows that the optimal solution with flexible voltage levels is more energy-efficient than the optimal at some fixed voltage level. Moreover, it shows that optimal solutions can be achieved using the unified approach as opposed to the sub-optimal solution obtained by the sequential approach.
Conclusion
In this paper, we have proposed a unified approach to solve the application mapping problem on a heterogeneous NoC platform for energy minimization. The voltage assignment problem is proven to be NP-hard. Our solution techniques are evaluated using benchmark suite E3S [3] and three real applications. Experimental results demonstrate effectiveness of our heuristic, superiority of the unified approach over sequential approach and advantage of operating the PEs at multiple voltage levels for energy minimization.
