Reliable energy-aware application mapping, task scheduling, and voltage-frequency island partitioning so as to minimize the energy consumption while preserving the required bandwidth and latency is considered as a challenging problem in the designing of Multi-Processor System-on-Chip. To achieve modular design and low power consumption, Globally Asynchronous Locally Synchronous (GALS) design paradigm is a promising approach which fits very well with the voltage-frequency islands concept. In this paper, we formulate mapping problem of a real-time application with stochastic execution times onto multicore systems, scheduling tasks on processors, and assigning voltage-frequency levels to Processing Elements (PEs) as a Mixed Integer Linear Programming (MILP) in GALS-based Network-on-Chip. Furthermore, owing to the importance of reliability issue, we address the effects of transient faults in our proposed MILP formulation such that the reliability of the whole system incorporating several heterogeneous PEs is guaranteed to be better than a given threshold. Due to the NP-hardness of such a problem, a rounding by samplingbased heuristic algorithm is provided. Experimental results based on E3S benchmark suite and some real applications show the effectiveness of our proposed heuristic in achieving a near-optimal solution in a small fractional of time needed to find the optimal solution. Experimental results also show that, our formulation preserves the required reliability and increases the energy consumption by 70% in some cases.
Introduction
Due to the increasing demand for high computational capabilities, Network-on-Chip (NoC) has been emerged as a promising interconnect solution to achieve high performance systems [2] . Such chips consist of regular tiles and contain variety number of PEs. The PEs can be DSP cores, programmable general purpose cores, high-bandwidth Input/Output (IO) devices or task specific coprocessors which introduce heterogeneity within the chips. The NoC provides more scalability, flexibility, and performance over the bus-based communication mechanism [3] . Developing a design methodology for NoC-based multiprocessor System-on-Chip (MPSoC) while optimizing the total system energy consumption under the bandwidth, latency, and reliability constraints poses novel and exciting challenges to the research community [4] . Energy-aware application mapping, task scheduling, and Voltage-Frequency Islands (VFIs) partitioning problem have been addressed in recent studies on the NoC-based MPSoC platform design [5] [6] [7] [8] [9] . Owing to the heterogeneity of PEs, assigning a specific application task to different PEs results in very different computational energy consumption. For each task assignment, diversity of inter-task communication volumes and existence of different paths between source and destination, can led to completely different communication energy consumption. Furthermore, PEs can operate at multiple voltage-frequency levels to achieve desirable performance and energy trade-offs; consequently, the solution to the energy-aware application mapping, task scheduling, and VFIs partitioning has a significant impact on the total energy consumption.
Assigning single supply/threshold voltage and operating clock speed to all PEs is not energy efficient and needs to distribute a single global clock signal throughout a chip. Asynchronous techniques can be used for finding a solution to overcome the synchronous clock problem challenges. However, a fully asynchronous system design needs a well adopted design technique for synchronous systems. The gap between synchronous and asynchronous design techniques is fitted by choosing a globally asynchronous and locally synchronous technique [9] . GALS design technique distributes and partitions the global clock tree as local trees. Hence, it can reduce clock tree size, spread, and clock buffering requirements [10] [11] [12] [13] [14] . It also inherits the advantages of both synchronous and asynchronous design and can be well adapted to VFI concept [15] .
Multi-processor systems which are designed and implemented with GALS technique are partitioned into several VFIs. In spite of the fact that scaling down the voltage-frequency level of each PE minimizes the computational energy consumption, the extra communication energy consumption and the overhead of designing excessive number of voltage-frequency independent islands may diminish the energy saving within the chip as depicted in Fig. 1 . Due to the fact that each VFI requires its own power grid, clock tree, mixed-clock/mixed-voltage FIFOs and voltage level converter in order to communicate with other VFIs [9] , there must be limited number of VFIs to achieve fine-grain system-level power management in a single chip [1] .
In order to solve the energy-aware application mapping, task scheduling, and voltage-frequency islands partitioning problem, several studies have been recently carried out [1, [5] [6] [7] 16, 17] . Aiming at reducing energy consumption and confidence that tasks will not lose their deadlines, these studies benefit from applying wide range of techniques such as branchand-bound algorithm, linear programming, randomized algorithm and greedy-based heuristics [6, 8, [16] [17] [18] [19] . These studies considered the problem of optimizing energy under the assumption that all tasks execute their worst-case execution times. Nevertheless, as an important observation in many cases, tasks may finish before their Worst-Case Execution Time [20] . In this study, we consider tasks with worst-case execution times and probabilistic execution times. For simplicity, it is further assumed that a probability distribution of each task execution times is being provided.
Optimizing the energy consumption of running a target application on a NoC-based MPSoC architecture while satisfying bandwidth, latency, reliability and real-time constraints can be divided into five subproblems as depicted in Fig. 2 : (a) assigning tasks to PEs, (b) mapping PEs onto the routers of the NoC architecture, (c) assigning voltage-frequency levels to PEs, (d) assigning task communications to routing paths among PEs, (e) scheduling tasks onto PEs. To achieve the optimal energy consumption, all five subproblems have to be solved simultaneously. Despite the fact that solving all subproblems simultaneously is a complicated task, it leads to optimal overall energy consumption.
In addition to the complexities imposed by the task constraints (e.g., task deadline, throughput and bandwidth requirements) on mapping and scheduling, reliability has been emerged as a significant application constraint in embedded systems [21, 22] . As feature size continue to be scaled down in deep submicron technology, it is possible to integrate a large number of transistors into a single chip. This integration leads to considerable rate of transient faults caused by errors which arise from capacitive crosstalk, power supply, and neutron and alpha radiations [23] [24] [25] [26] . It worth to mention that, the effect of voltage scaling on transient faults [27, 28] make the reliability issues more complicated.
In this paper, we formulate the problem of energy-aware application mapping, task scheduling, and voltage-frequency islands partitioning as a Mixed Integer Linear Programming and solve all subproblems (a) to (e) mentioned above simultaneously in a unified manner. To the best of our knowledge, our proposed MILP formulation is the first that formulates reliability constraints and considers stochastic behavior of task execution times. We also address different constraints such as bandwidth, latency, task deadlines and the maximum number of VFIs in our proposed MILP formulation. The effect of transient faults on the reliability of the system is also formulated as an MILP which guarantees the minimum required reliability of the system. The remainder of this paper is organized as follows: The related work is addressed in Section 2. The power, energy, latency, reliability and application models are described in Section 3. Section 4 presents the energy-aware application mapping, task scheduling, and VFIs partitioning problem formulation while a novel rounding by sampling-based heuristic is provided in Section 5. Experimental results are provided in Section 6. Finally, Section 7 concludes the paper.
Related work
The authors in [1] developed a design methodology which partitions NoC architecture into multiple VFIs and assigns supply and threshold voltage levels to each VFI. Their methodology minimizes the overall system energy consumption, under performance constraints. In [29] , Jang et al. construct the framework for energy-efficient VFI-aware partitioning, mapping and routing. The authors in [30] and [8] consider the mapping and routing problem in the presence of bandwidth as well as latency constraints as an application requirement.
In [6] , Ghosh et al. take a unified approach to minimize the energy consumption while mapping the application tasks to the PEs, mapping the PEs to the routers, assigning operating voltage to the PEs, and routing of data paths subject to performance constraints. They propose MILP formulation to find the optimal solution and then provide randomized rounding-based heuristic to utilize the MILP relaxation. Considering regular mesh NoC topology, in [19] , the authors formulate the mapping and voltage islanding problem as an optimization problem in a unified way and present both optimal solution, obtained by solving the MILP, and heuristic solution based on random greedy selection.
The authors in [7, 16, 31] minimize the total communication energy consumption in the mapping problem under different performance or bandwidth constraints. The authors in [5] propose a mapping technique which allocates available PEs to the incoming real-time application tasks in order to minimize communication energy, given some deadline constraints. In [18] , the authors present an algorithm to map cores under bandwidth constraints, minimizing the average communication delay. The authors in [17] formulate the inter-tile network contention using ILP formulation and solve the contention-aware application mapping problem by a mapping heuristic.
Preliminaries and problem definition
In this section we review the necessary background required for the rest of the paper. 
Application characteristics
In order to formulate the energy-aware application mapping, task scheduling, and voltage-frequency islands partitioning in a formal way, the following definitions are provided which clarify the characteristics of the application:
where each node r i ∈ V R represents a router in the tile-base NoC architecture where each tile consists of a processing or storage element (referred to as PE) and a router, and each arc a ij ∈ A R represents a link between two adjacent routers r i and r j [32] . Simply, the number of nodes is n = N 2 and the number of arcs is m = 4N(N − 1). Also, the bandwidth of each physical data channel a ij is represented by B W (a ij ). Fig. 3(a) depicts an ACG for a 3 × 3 mesh-based NoC architecture.
Definition 2. Let
. . , p n } be the set of PEs, which can operate at any frequency in a set of available frequency-
We use normalized voltages and frequencies which implies f max = 1.0. The power consumption of each processing element j operating at frequency level f k ∈ L is represented by ρ k j . Also, β kl is the energy overhead of connecting two adjacent tiles with their corresponding PEs operating at
, is a directed task graph, where V T is the set of tasks and c ij ∈ E T represents the t j s dependency on t i [32] . Fig. 3(b) shows a sample CTG with 6 tasks and 8 edges. Each c ij has the following properties:
The data communication volume (bit) between t i and t j .
• bw(c ij ): The minimum bandwidth (bits per second) required by c ij that should be satisfied for each task communications.
• σ (c ij ): The maximum tolerable latency of c ij which is given in the number of hop counts instead of an exact number of clock cycles [30] and also represents the performance constraint in the mapping problem.
Furthermore, each t i ∈ V T has the following properties:
• The deadline i which is the deadline of t i .
• Two vectors, X ij = {x 
Definition 4.
The minimum reliability R 0 indicates the minimum tolerable reliability of the real-time system which is required to be satisfied while mapping a real-time application and scheduling the real-time tasks [49] .
Power, energy and delay models
In a general system-level power model, the power consumption of a computing system is given by [28] as:
where P s is the static power which is used to keep the clock running, maintain basic circuits. This power can be only removed by turning down the whole system. P ind is the frequency independent active power which indicates the constant power consumed by off-chip and external devices. Putting the system components in sleep mode can efficiently remove the frequency independent active power. Finally, P d represents the frequency dependent active power which consists of processor dynamic power and any power that depends on processing frequencies and system supply voltage. If the system operates in sleep mode h equals 0 otherwise h = 1. Here, C ef , V and f represent the switch capacitance, supply voltage and frequency of the system, respectively. Switching between on and off modes has a significant impact on time and energy overhead [33] . In the sequel, we assume an always "on" system. We know that p s does exist, below we focus on frequency dependent active power and frequency independent active power.
For frequency scaling, when frequency independent active power is greater than zero energy increases if the frequency of the application decreases and at the same time, we keep the supply voltage constant [28] . However, frequency dependent active power term decreases as frequency of the application decreases [34] . In voltage scaling approach reduces the supply voltage when reduces the frequency [35] . Owing to the almost linear relation between circuit delay and V −1 [36] and [37] , the authors in [28] indicate that for systems to operate properly, the operating frequency needs to scale down linearly when the supply voltage is decreased. They used normalized frequencies and voltages and assume f max = V max = 1 which implies that for frequency f , the corresponding supply voltage is V = f , V max = f . We consider the similar assumptions. Thus, the dynamic energy consumption of executing a task at frequency f and supply voltage V , can be modeled as:
where x i is the execution time of task t i when it is executed at f max .
Considering the probabilistic execution times for each task, the expected computational energy consumption E f i can be expressed as:
where x k i represents the kth possible execution time of t i when it is executed at frequency f and Pr k i denotes its corresponding probability. Here, X i represents the set of all possible execution times for task t i and intuitively,
The dynamic energy consumed by one bit of data sending through the router can be defined by bit energy E bit metric as [38] :
where E B bit , E L bit and E S bit represent the communication energy consumption of the buffer, link and switch fabric, respectively. Assuming the bit energy values are measured at maximum supply voltage and frequency, the dynamic part of communication energy consumption of transmitting one bit from the source router to the destination router can be computed as [39] :
where P is the set of routers on the path from the source outer to the destination router and f r i represents normalized frequency which is assigned to router r i .
Partitioning the chip into VFIs imposes extra energy overhead. This energy is required to connect nodes in one VFI with other nodes in other VFIs. The overhead of connecting two different voltage-frequency islands is modeled as [39] : (2) where E ClkGen , E Vconv and E MixClkFifo represent the energy consumed by generating additional clock signals [40] , the voltage level converters [41] and mixed-clock/mixed-voltage FIFOs [42] , respectively. The power delivery network is a complicated structure and considering implementation overheads prohibits to increase the number of VFIs without provision [6, 9, 39] . As discussed in [1], the energy consumption of level shifters which is used to connect voltage-frequency islands is not negligible and could be considerably high. This energy is derived based on the level converter circuit designed in [41] , where energy overhead during the voltage transition between two levels is estimated to be proportional to the difference of the square of the voltage levels.
Finally, we assume that routers in each VFI are locally synchronous and routers in two VFIs communicate with each other through mixed-clock/mixed-voltage FIFOs. Therefore, the communication latency between source and destination routers, while sending a volume of data equal to vol(src, dest) bits, in non-blocking traffics is expressed as [39] :
where P , W , and D S represent the set of routers on the path from source router to destination router, the physical data channel width, and routing, switching, and propagation delay across wires between two routers at maximum frequency, respectively. Here, f i is an operating frequency assigned to r i and t fifo equals to the delay of FIFO buffers.
Fault and reliability models
Focusing on transient faults, the authors in [43] 
where λ 0 represents the average fault rate at f max and voltage V max . That is, g( f max , V max ) = 1. Based on the results presented in [44] and [45] , the fault rates increases exponentially when supply voltage decreases in Alpha processors family and also memory devices for different technologies. This is due to the fact that reducing the supply voltage leads to smaller critical charge which is responsible for exponentially increased fault rate [46] and [47] . Furthermore, lower energy particles with smaller critical charge could cause an error owing to the fact that there are lower energy particles than higher energy particles [48] .
As discussed in Section 3.2, we assume that the supply voltage for lower operating frequencies is reduced [36] . Hence, the effects of voltage scaling on fault rates while the system runs at frequency f and corresponding
can be formulated as [28] :
where d > 0 is a constant which represents the sensitivity of fault rate to voltage and frequency scaling.
The reliability of a task is defined in [49] as the probability of completing the task successfully, i.e. without encountering errors triggered by transient faults. Since we assumed that transient faults arrive according to a Poisson process in which its arrival rate depends on the operating frequency, the probability of having no transient fault during one running of a task t i can be computed as:
where x i is execution time of task t i when it is executed at f max and simply its execution time at f < f max is given by x i / f .
We consider different possible execution times for each task to express more realistic behavior of a real-time system. The expected value of R f i can be expressed as:
We are more interested in the Worst-Case Reliability for each task which occurs at its WCET (e.g., x
WCET i
). The WCR is an underestimation of task reliability since ∀k,
. We also consider WCR as a constraint during mapping process which can guarantee the reliability of a restricted real-time system. Considering the definition of the reliability for each task and the fact that tasks are executed on PEs independently, if the reliability of task t i at frequency f is R f i then the worst-case reliability of the whole system at frequency f is calculated as the product of the reliability of its tasks and is given by:
As discussed above reducing the supply voltage at lower operating frequencies leads to exponentially increased fault rates which results in less reliability. Hence, there is a trade-off between computational energy consumption and the reliability in terms of supply voltage and operating frequency. Increasing the chip operating frequency and supply voltage together increases reliability at the expense of more energy consumption. We are interested in minimizing the energy consumption as long as not deviating from minimum reliability requirement of the system. 
Problem definition
We have the following formal definition for the energy-aware application mapping, task scheduling, and voltagefrequency islands partitioning as:
Minimize(E comp + E comm + E VFIs ) (4) where E comp , E comm , and E VFIs represent computational energy consumption, communication energy consumption, and VFIs overhead energy consumption, respectively.
Constraints:
• Each task is assigned to exactly one PE.
• Each processor operates at exactly one frequency and voltage level.
• Each PE is mapped onto exactly one router and exactly one PE is assigned to each router.
• Execution deadlines are met for each task and all task dependencies are considered.
• The reliability of the whole system need to be above a given threshold.
• Number of VFIs are less than or equal to a given upper bound.
• The link bandwidth constraint and communication latency constraint are met for router links and task communications, respectively.
Proposed MILP formulation
In this section, we develop a formal definition of energy-aware application mapping, task scheduling, and voltagefrequency islands partitioning, subject to bandwidth, latency, deadline constraints and the maximum number of permissible VFIs. There are four possible functions M 1 : for task to processing element assignment function, M 2 : for PEs to voltagefrequency level assignment function, M 3 : for processing element to router mapping function, and M 4 : represent the task communication to path mapping function as depicted in Table 1 . We formulate these assignment and mapping functions step by step and, we model the scheduling problem and all constraints in an MILP formulation.
We first formulate E comp , E comm , and E VFIs as an MILP formulation according to the models presented in Section 3.2 and formulate the first three constraints simultaneously in Sections 4.1, 4.2, and 4.3. Then we formulate the rest of the constraints in Section 4.4.
Computational energy consumption
The computational energy consumption plays a significant role in total energy consumption and below we formulate this problem as an MILP formulation. Before doing so, we need to model all related parameters. Computational energy consumption of each PE while running a set of tasks depends on two factors: the tasks execution times and PE's frequency level. At first, we consider M 1 (t i ) = p j which states the assignment of t i to p j and can be modeled by an indicator variable as:
Each task must be assigned to exactly one PE, so the following constraint is provided as:
Secondly, frequency level assignment (e.g., M 2 (p j ) = f k ) can be modeled by another indicator variable as:
The following constraint is used to ensure that exactly one frequency level has to be assigned to each PE as:
Considering these two assignment function simultaneously, we can provide a new decision variable as:
This variable can be expressed in a quadratic form as 
Accordingly, the following constraints provide conditions in the definition of γ as:
The expected computational energy consumption now can be formulated as:
where the elements in the set N ij = {1, . . . ,n ij } correspond to elements in X ij .
VFIs overhead energy consumption
Mapping PEs onto routers can be modeled as:
Similarly, the following constraints are used in order to map each PE to exactly one router and assign exactly one processor to each router, respectively.
Mapping p j with assigned frequency f k onto a router r m is modeled by the following indicator variable as:
These conditions can be modeled by the following constraints:
The following indicator variable indicates assigning a frequency level to a router:
The definition of θ mk can be formulated as:
We define another indicator variable to model two adjacent routers with their assigned frequencies as:
This conditional definition can be written as the following constraints:
Now the overhead energy consumption of connecting two VFIs can be modeled as:
Communication energy consumption
There may be multiple paths between a source router and a destination router in a given network. Each path results in different throughput, latency and energy consumption. Let us define M 4 (c ij ) as a task communicate to path function which means:
where Γ is the set of links which are involved in delivering data communication volume between t i and t j , t j ∈ V T . In order to specify the routing path for a given task communication, say c ij , it is necessary to know routers which t i and t j are mapped onto. We define, ϕ im indicator variable as:
Modeling this variable needs a new indicator variable which is defined as follow:
Considering the definitions of ϕ and ϑ , now we can write the following equation:
It is clear that each task needs to be mapped onto exactly one PE and each PE must be assigned to exactly one router, hence the following constraint is given:
If M 4 (c ij ) = Γ and |Γ | 1 (e.g., t i and t j are mapped to deferent routers) then Γ needs to satisfy three main constraints.
• Loop avoidance constraint: There must be no loop in the path from
• Source and destination constraints: The routing path must be from t j ) ). These routers, therefore, must be visited once through the routing path which implies that ∃l 1 = a x 1 y 1 ∈ Γ such that ϕ ix 1 = 1 and ∃l 2 = a x 2 y 2 ∈ Γ such that ϕ jy 2 = 1. It could be also inferred that l 3 = a x 3 y 3 ∈ Γ such that ϕ iy 3 = 1 and l 4 = a x 4 y 4 ∈ Γ such that ϕ jx 4 = 1.
• Path existence constraint: Links selected for the routing path must construct a single connected path. Hence, ∀l 1 ∈ Γ if l 1 = a x 1 y 1 and ϕ jy 1 = 0 then ∃l 2 = a x 2 y 2 ∈ Γ , such that x 2 = y 1 . 
Clearly, Γ is an empty set if t i and t j are mapped onto the same router. This could be written as:
Putting together all the elements, now the routing path constraints could be expressed formally in our MILP formulation.
It could be inferred from the loop avoidance constraint that if M 4 (c ij ) = Γ , then for each router r x ∈ V R there is at most one link in Γ that r x appears as its head and also there is one link in Γ that r y appears as its tail. The source and destination of the path, additionally, must be visited once. All these constraints could be modeled as:
Furthermore, other constraints in the source and destination constraints need to be satisfied. There must be a link where t i is mapped onto its head and also there must be a link where t j is mapped onto its tail. Thus, the path existence constraint can be stated as:
Considering routing paths, it is now possible to model the Communication Energy Consumption as:
where v ij and ε xk represent v(c ij ) and (
, respectively. Note that (9) is in the Quadratic Programming (QP) form and they can be reduced to an Integer Linear Programming (ILP) by inequalities presented in (5).
Constraints
(1) Real-time application deadline: A valid schedule satisfies two kinds of constraints: edge constraints and inprocessor constraints [50] . Edge constraint induced by a ij keeps the order between each t i and its successor t j . It is necessary that any two tasks that are assigned to the same PE should not overlap in their execution time. This can be achieved by imposing in-processor constraint to the tasks assigned to a single PE. In order to express these constraints, sep ij which was introduced in (8) is used. Moreover, the worst-case execution time of each task can be calculated as:
Edge constraint, therefore, can be formulated as: 
where τ k represents routing, switching and propagation delay across wires between two routers at frequency f k . We can linearize (10) by technique introduced in (5) . Note that, due to the tight scheduling constraints and the need for predictability in real-time systems, the task WCET is applied to the scheduling constraints in order to check that tasks are able to meet their deadlines. In-processor constraints can be expressed by the following inequalities:
where b ij equals 1 if t i is executed before t j ; otherwise, it equals 0 as stated in (11) . Lastly, the following constraints represent start time constraint and deadline constraint for each task.
∀t i ∈ V T :
(2) Worst-case reliability: As mentioned in Section 3.3, WCR of each task t i when it is executing on p j at frequency f can be written as:
Now we can write the WCR of whole system as a product of all task reliabilities:
. The WCR constraint can be expressed as R > R 0 , that is:
Number of VFIs: Voltage-frequency level assignment can be viewed as coloring the vertexes of the grid graph. Let the routers be represented by vertexes, and the voltage-frequency levels be represented by colors. Construct new graph by defining new edges between each two adjacent vertexes with the same color. The problem of finding the number of voltagefrequency islands is the same as finding the number of connected components in this new constructed graph as (12) . The variable κ is used to represent the new constructed graph.
which means:
The number of connected components in a graph equals to the number of vertexes minus the number of edges in the spanning forest. The spanning forest is a set of spanning trees, one for each connected component. Since there are two edges between each two adjacent routers r x and r y , in our architecture characterization graph (e.g., a xy and a yx ), we use rooted spanning trees instead of spanning trees as in Fig. 4 . A rooted tree is a directed acyclic graph in which all edges point away from the root. The root vertex has no parent (e.g., a vertex with no entering edges) and all other vertexes have exactly one parent. In order to represent the spanning directed forest we introduce new variable ι xy which is 1 if κ xy is used in a spanning directed forest and 0 otherwise.
As discussed before, ι xy must be 0 when κ xy equals 0. To enforce this constraint, following inequality needs to be satisfied:
Additionally, each vertex must be a tail of at most one edge. This constraint can be ensured by the following condition:
Furthermore, there must be a constraint to prevent loop formation. The number of edges in any given set of vertexes chosen from a spanning forest is at most one less than the number of vertexes in the given set. So, loops can be prevented by:
where S represents a non-empty subset of V R and |S| is the size of S. Finally, for a large constant C , the following constraint assures that l is spanning and covers all vertexes in the connected components of new constructed graph:
Note that, C can be set here to be equal to 2 × |V R | due to the fact that each node has at most |V R | − 1 neighbors which implies that [4] . The link bandwidth constraint states that the total traffic passing through a network link does not exceed the link capacity. The link bandwidth requirements must be satisfied for each router link which can be expressed as the following inequality: Hence the inequality given in (13) ensures that the latency constraint is satisfied after the mapping:
Heuristic solution
Since the MILP formulation presented above is NP-hard we now provide a randomized sampling-based heuristic. The proposed heuristic provides near optimal solution in a significantly low computation time compared to MILP-based approach.
Algorithm 1 Rounding by sampling-based heuristic

Mapping, scheduling, and VFI partitioning GALS-based approach
Input: The parameters and corresponding MILP formulation specified in Section 4.
Output: Functions M 1 , M 2 , M 3 , and M 4 , and the starting times for tasks t i ∈ V T specified in Table 1. 1: Relax all the integer constraints in the MILP formulation. 2: Solve the LP using CPLEX. 
Independently round x ij to 1 with probability as its value. 8:
end for 10:
for all p j ∈ P do
11:
repeat 12:
Independently round y jk to 1 with probability as its value. 13:
end for 15:
Round z variables using rounding by sampling technique presented in [51] Our heuristic takes as input the parameters specified in Section 4 and the corresponding MILP formulation and produces functions M 1 , M 2 , M 3 , and M 4 specified in Table 1 and the starting times for tasks t i ∈ V T as the output. In the first step we relax all the integer constraints in the MILP formulation and solve the relaxed Linear Problem (LP) using CPLEX [54] . The algorithm then runs in an iterative manner. At each iteration, the best solution is compared with the current solution found in this iteration and updated accordingly. Increasing the number of iterations increases the probability of finding a better solution, but it increases the execution time of the heuristic algorithm.
The feasible solution of the LP, as opposed to the MILP, could not be guaranteed to be an integral feasible solution. Hence a rounding scheme is needed to transform the linear program solution to an integral one. This solution must satisfy all constrains specified in Section 3.4. Therefore, a sufficient approximate solution for the original problem is desired.
The presented above algorithm uses both independent randomized rounding and rounding by sampling techniques introduced in [51] . The rounding by sampling technique is based on sampling from a maximum entropy distribution over the combinatorial structures hidden in the feasible solutions. The main implication of this technique is that it keeps the combinatorial structures intact, while aiming at preserving the solutions quantitatively. The technique transforms a feasible solution of LP to an integral one while preserving marginal probabilities imposed by the linear values obtained from solving the LP. In [51] , the author provides an intuitive combinatorial algorithm that samples a fractional matching from the maximum entropy distribution while using the least biased Bayesian update rule. The interested readers are referred to [51] for more detail implementation of the method. Algorithm 1 applies these techniques in order to achieve a near optimal solution of the main problem.
Experimental results
In this section we present the experimental results to demonstrate the effectiveness of our proposed approach in minimizing the energy consumption. The experimental results are gathered from benchmark applications (office-automation, consumer, networking, and auto-industry) collected from the Embedded System Synthesis Benchmarks Suite (E3S) [52] and three real applications MPEG4 [53] , Multi-Window Display (MWD) [53] and Object Place Decoder (OPD) [18] . E3S benchmark suite was designed for use in embedded systems synthesis research. In particular, it was designed for use in automated system-level allocation, assignment, and scheduling studies. It contains 17 processors which are characterized based on the measured execution times of 47 tasks, power quantities derived from processor datasheets, and additional information. In addition, E3S contains communication resources that model a number of different buses. There is one task set for each of the four application suite: office-automation, consumer, networking, and auto-industry. The number of tasks, communication edges between them and the mesh-based network sizes are depicted in Table 2 .
The following discrete voltage levels are used for voltage level assignment: V 0 = 1.9V , V 1 = 2.3V , V 2 = 2.5V , V 3 = 3.3V , and V 4 = 3.6V . The power consumption estimation of the tasks on the PEs is provided in the benchmark for E3S benchmark applications. We use the same approach as [19] for estimating the power consumption of the real applications tasks. The estimated power consumption of each task is mentioned in the E3S benchmark suite. Tasks in E3S benchmark suite can be classified as computing intensive, I/O read-write or memory read-write intensive tasks. Applying similar classifications to the real application tasks, we used similar values for power consumption given in the E3S benchmark suit as an estimation of the power consumption values for the tasks in the real applications.
The BCET for tasks is assumed to be 50% of their WCETs. We suppose that there are 10 different execution times within BCET and WCET of a task. We consider the same two different probability distributions as [49] regarding the execution times of tasks. For uniform distribution, the probability of a task to take any one of its execution times is the same and equal to 0.1. The second one is the modified discrete normal distribution. We consider three cases with the average of BCET + (WCET − BCET)/4, (WCET − BCET)/2, and BCET + 3(WCET − BCET)/4 which are represented as Norm25, Norm50, and Norm75, respectively (see Fig. 5 ).
All the experiments were performed on an Intel ® Core™ i7 CPU 860 2.80 GHz PC with 4.00 GB RAM. The rounding by sampling-based heuristic is implemented in C++. The MILP optimal solution and the corresponding relaxed LP solution as a part of the heuristic are achieved using ILOG CPLEX 11.1 Concert technology. As depicted in Fig. 6 , the provided heuristic is able to find near optimal solution within a few seconds. The extra time is negligible compared to that required to solve the MILP for finding the optimal solution. The quality of our proposed heuristic is depicted in Fig. 5 . In all cases our heuristic can find a near optimal solution. On average for all benchmark applications, our heuristic solution can save 56%, 45%, and 28% energy consumption over the optimal solution for fixed voltage at V 4 , V 3 , and V 2 , respectively. Note that, operating at lower voltage-frequency levels would result in missing the task deadlines due to the slow execution times.
We have assumed that transient faults obeys the Poisson distribution [43] and the average fault rate at the maximum voltage-frequency level, λ 0 , is equal to 10
6 . This rate corresponds to 100,000 FITs (failure in time, in terms of errors per billion hours of use) per megabit which is a reasonable fault rate as reported in [46] . Taking the effects of voltage frequency scaling on transient fault rates into account, the exponent in the exponential fault model (see Section 3.3) is assumed to be d = 6 (or 7). These assumptions have been previously presented in [49] . Fig. 7 verifies that the applications consume more energy as they require higher worst-case reliability. The normalized energy consumption grows gradually when we have no constraint on worst-case reliability until reaching to 70% required worst-case reliability for most benchmark applications. Then, it increases dramatically from 70% required worst-case reliability to 95% required worst-case reliability. When the applications request 99% worst-case reliability they consume 1.7 times energy compared to the case that we have no constraint on reliability.
Conclusion
In this paper we formulated energy-aware mapping problem of a real-time application with stochastic execution times onto multi-core systems, scheduling tasks on processors, and assigning voltage-frequency levels to Processing Elements (PEs) as a Mixed Integer Linear Programming (MILP) in GALS-based Network-on-Chip (NoC). Due to the NP-hardness of the energy-aware application mapping, task scheduling, and voltage-frequency islands partitioning problem we presented a novel rounding by sampling-based heuristic algorithm to achieve a near optimal solution to main problem. Experimental results based on E3S benchmark suites and some real applications reveal our proposed heuristic demonstrate that using multiple voltage-frequency levels is more efficient than using fixed voltage-frequency level. Experimental results also show that, preserving the required worst-case reliability of the system could increase the energy consumption by 70% in some scenarios.
