The interlinked processing units in modern Cyber-Physical Systems (CPS) creates a large network of connected computing embedded systems. Network-on-Chip (NoC)-based Multiprocessor System-on-Chip (MP-SoC) architecture is becoming a de facto computing platform for real-time applications due to its higher performance and Quality-of-Service (QoS). The number of processors has increased significantly on the multiprocessor systems in CPS; therefore, Voltage Frequency Island (VFI) has been recently adopted for effective energy management mechanism in the large-scale multiprocessor chip designs. In this article, we investigated energy-efficient and contention-aware static scheduling for tasks with precedence and deadline constraints on intelligent edge devices deploying heterogeneous VFI-based NoC-MPSoCs (VFI-NoC-HMPSoC) with DVFS-enabled processors. Unlike the existing population-based optimization algorithms, we proposed a novel population-based algorithm called ARSH-FATI that can dynamically switch between explorative and exploitative search modes at run-time. Our static scheduler ARHS-FATI collectively performs task mapping, scheduling, and voltage scaling. Consequently, its performance is superior to the existing state-of-the-art approach proposed for homogeneous VFI-based NoC-MPSoCs. We also developed a communication contentionaware Earliest Edge Consistent Deadline First (EECDF) scheduling algorithm and gradient descent-inspired voltage scaling algorithm called Energy Gradient Decent (EGD). We introduced a notion of Energy Gradient (EG) that guides EGD in its search for island voltage settings and minimize the total energy consumption.
Static task scheduling describes a mechanism to suitably allocate tasks on processors before the embedded systems run while fulfilling certain obligations such as energy consumption and/or performance [11, 24] . Proper scheduling approach can drastically increase the reliability of an embedded system [17] . Static task schedulers can be used in applications, e.g., surveillance, human recognition, person tracking, gait analysis, advanced healthcare, and crowd and traffic monitoring [1, 42] . Meeting the constraints such as deadlines, performance, and QoS of the battery constrained multiprocessor systems in CPS applications plays a critical role [49] . For example, missing a deadline for an application can reduce QoS and performance [15] . Moreover, energy-aware computing is a critical challenging facet in modern embedded systems, because higher energy consumption not only causes an increased carbon dioxide (CO 2 ) emission but also limits their life [19, 41, 49] . Dynamic Voltage and Frequency Scaling (DVFS) is a technique applied to MPSoC architectures to reduce the power overhead while maintaining the desired performance. In DVFS, the energyefficiency is achieved by reducing the supply voltage and processor's clock frequency [6] .
In this article, we investigated for the first time ever the problem of energy-efficient and contention-aware static scheduling on the edge computing devices using heterogeneous VFIbased NoC-MPSoC (VFI-NoC-HMPSoC) system with DVFS-enabled processor for a set of tasks with precedence constraints and deadline. Our main contributions and innovations are given as follows:
(1) We performed task mapping, ordering, and voltage scaling in an integrated way using a novel search-based meta-heuristic called ARSH-FATI. Our static scheduler also considers energy performance profiles of the processors, voltage levels within each processor, contention at the NoC links, and inter-VFI communications during task scheduling. (2) Our meta-heuristic ARSH-FATI can dynamically switch between different search modes to achieve a satisfactory trade-off between explorative and exploitative search during runtime. Moreover, we presented a new a contention-aware Earliest Edge Consistent Deadline First (EECDF) scheduling algorithm and gradient descent inspired Energy Gradient Decent (EGD) voltage scaling technique. (3) We compared the energy performance of our static scheduler ARSH-FATI with stateof-the-art CA-TMES-Search and CA-TMES-Quick [16] energy management approaches using eight real benchmarks adopted from E3S benchmark suit. Our meta-heuristicbased static task scheduler achieved an average energy-efficiency of ∼24% and ∼30%, respectively.
We organize the rest of the article as follows: Section 2 reviews the existing search-based algorithms and state-of-the-art energy optimization approaches. Section 3 presents the application, computing platform, and power models. In Section 4, we discuss our static contention-aware energy optimization scheme. The simulation results on different benchmarks are discussed in Section 5. In Section 6, we conclude this article.
LITERATURE REVIEW
Task mapping and scheduling on multiprocessor architectures is an NP-hard problem and different heuristics have been proposed based on mathematical formulation such as Integer Linear Programming (ILP), Non Linear Programming (NLP), Linear Programming (LP), and Mixed Integer Linear Programming (MILP). Similarly, search-based heuristic algorithms using selection, crossover, mutation, and elitism are also widely deployed. The popular examples of these search-based algorithms are Ant Colony Optimization (ACO), Genetic Algorithm (GA), and Particle Swarm Optimization (PSO). Among these algorithms, GA is widely adopted for task mapping and scheduling [2, 34] . These evolutionary algorithms belong to stochastic generate and test algorithms that are based on (1) exploration of the search space and (2) exploitation of the promising information already found. Exploration primarily describes the ability of an algorithm to discover the unseen regions while exploitation demonstrates the capability to proceed in the desired direction for improvement. For example, in GAs, mutation and crossover are hypothetically considered to perform exploration and exploitation, respectively, [10, 47] . However, there is strong criticism that crossover does not possess a competitive advantage over mutation [47] . Nevertheless, these search-based algorithms fail to efficiently exploit the available chunk of information, i.e., schemata. Moreover, exploration and exploitation are the two opposing forces, and a well-found balance between them determines the success of a search-based algorithm.
Multiprocessor computing systems are widely adopted for edge devices in CPS due to their highperformance and reliability. In order to reduce the energy consumption of the edge devices deploying multiprocessor embedded systems researchers have investigated task mapping and scheduling techniques. One of the earliest works in scheduling includes a scheme developed by Olafsson to efficiently distribute the tasks, i.e., workload, on heterogeneous multiprocessor system [38] . Aydin et al. [4] provided DVFS-based energy-efficient scheduling algorithm with O (n 2 loдn) complexity for independent real-time tasks with different power consumption characteristics on multiprocessors systems. They formulated the scheduling problem as an NLP and assigned constant speed to the tasks while maintaining the optimality. Other energy management studies used DVFS technique for energy optimization. For example, Zhang et al. [50] presented a meta-heuristic scheduling algorithm called Shuffled Frog Leaping Algorithm (SFLA) by integrating the gains of PSO and Memetic algorithms while comparing the energy-efficiency of SFLA with GA. Kumar and Vidyarth [25] integrated task mapping and voltage assignment in a single optimization loop of GA. They used DVFS technique to assign voltages to the tasks such that the dynamic energy consumption is reduced with an acceptable performance trade-off. Wang et al. performed preemptive periodic independent tasks scheduling using Discrete Event System (DES) supervisory control [48] . Liu and Qi mapped the tasks using Weighted Earliest Finish Time (WEFT) algorithm and executed the tasks with lowest possible earliest completion time [30] . These investigations reduced energy consumption of independent tasks running on MPSoC architectures without explicitly considering the precedence constraints.
Huang et al. [18] used an extended ILP formulation for energy optimization on heterogeneous NoC-based MPSoC systems and developed a heuristic called Simulated Annealing with Timing Adjustment (SA-TA). Fundamentally SA-TA optimizes energy consumption by reaching near to the global optimum under timing constraints. Gammoudi et al. scheduled real-time periodic tasks on homogeneous NoC-based MPSoC to meet deadline, energy, and communication constraints using a heuristics manipulated by deterministic strategy [15] . Ali et al. performed integrated task mapping, scheduling, and voltage assignment on NoC-based heterogeneous MPSoC (HMPSoC) systems using a heuristic called EIMSVS for reducing processing and communication energies [3] . Ishak et al. investigated a non-preemptive scheduling for tasks with precedence constraints and individual deadlines. They used NLP and ILP to assign optimal voltages to the tasks and communications on NoC links [20] . A similar ILP-based approach is followed by Tariq et al. [46] using Iterative Offline Energy-aware Task and Communication Scheduling (IOETCS) algorithm for total energy consumption reduction. Ali et al. developed an energy-efficient task scheduling approach using Contention-aware Integrated Task Mapping and Voltage Assignment (CITM-VA) meta-heuristic algorithm. CITM-VA integrated DVFS and DPM to achieve maximum energy savings by reducing both static and dynamic power consumptions while considering the contention at NoC links [2] . Ding et al. presented a Hybrid Heuristic Genetic Algorithm with Adaptive Parameter (HGAAP) for energy-aware task mapping on heterogeneous multiprocessor architectures [13] . However, these studies consider MPSoC systems for tasks with precedence constraints but perform mapping and scheduling on single processor per VFI.
Ninomiya et al. [36] developed a task scheduling scheme for VFI-based MPSoC architecture using SA-based algorithm for energy consumption reduction and generated an optimal schedule for set of tasks under deadline constraints. Pagani et al. [39] presented a scheduling scheme called Single Frequency Approximation (SFA) to map the tasks and assign optimal voltage and frequency levels to each VFI. Liu and Guo [29] developed a Voltage Island Largest Capacity First (VILCF) algorithm for mapping the tasks on active VFI first to fully utilize it before activating other inactive VFIs. Shin et al. [43] studied communication-aware VFI partitioning approach and developed a task mapping, voltage assignment algorithm for reducing inter-VFI communications. These investigations in References [29, 36, 39, 43] deploy bus-based VFI-MPSoC systems for independent tasks mapping and scheduling. Some other researchers considered NoC-based VFI-MPSoC systems; for instance, Jang and Pan [21] performed energy-aware scheduling for dependent tasks by reducing VFI's power overheads. Digalwar et al. [12] presented a scheduling algorithm to optimize the total energy consumption for periodic tasks with hard deadline. Han et al. [16] developed a contention-aware static mapping and scheduling scheme for a set of tasks with precedence constraints to minimize the make-span and inter-VFI communications. They developed a contention and energy-aware task mapping and edge scheduling (CATMES) heuristics to assign tasks to processors while scheduling the edges on NoC. Two approaches-CA-TMES-Quick and CA-TMES-Search-were developed to select the processor for a task where it can start earliest among all processors. CA-TMES-Quick first performs task assignment and then determines routes for the communications that are sent to this task. CA-TMES-Search calculates start time for each task while considering communication contention. The processor offering earliest start time for a task is selected by CA-TMES-Search. Specifically, CA-TMES-Search relatively performs better than CA-TMES-Quick, because it coordinates the task mapping in an exhaustive way; therefore, make-span significantly reduces. We use these CA-TMES-Quick and CA-TMES-Search energy management schemes as baseline to determine the performance of our static task scheduling approach.
Though these state-of-the-art studies [12, 16, 21, 36, 43] addressed the energy-efficiency on VFIbased NoC-MPSoC systems but none of them performed investigation on heterogeneous computing platform while considering processor energy performance profiles for achieving higher energy savings. Specifically, to the best of our knowledge, none of the prior work focused on contentionaware and energy-efficient task scheduling on VFI-NoC-HMPSoC using DVFS technique for dependent tasks with precedence constraints and common deadline.
PRELIMINARIES
In this section, we first present the relevant application model; second, we discuss our VFI-NoC-HMPSoC architecture; finally, we explain the energy model. Moreover, in this article, we use the term tile and processor interchangeably.
Application Model
We characterize a real-time workload or application by Directed Acyclic Graph (DAG): G (V , E, X ) shown in Figure 1 
. . ,v n } represents a set of tasks, E ⊆ V × V shows directed edges set, while each edge (v i , v j ) ∈ E denotes data dependency between two tasks. For example, if we have an edge from task v i to task v j , then v i is the predecessor of v j and outputs the data to v j , where v j is the successor of v i and it accepts input data from v i . Moreover, X indicates set of directed edge weights while χ (i, j ) is the edge-weight of an edge (v i , v j ) that shows the volume of data (in unit of bits) sent from v i to v j . We assume all tasks in the application have a common deadline, D. 
System Architecture
We consider a VFI-based NoC-MPSoC architecture with M processors P = (pe 1 , pe 2 , pe 3 , . . . pe M ) demonstrated in Figure 2 . Each tile consists of a processor, local memory, and network interface card. Processors of the target architecture are partitioned into a set C = {c 1 , c 2 , c 3 , . . . c m } of m heterogeneous VFIs, where each VFI, c i ∈ C consists a set of k homogeneous processors. We assume processors within an island (VFI) are of same type. Processors across different VFIs may be of different types, i.e., inter-VFI processors may be heterogeneous. Each VFI can operate independently at a set {(V dd 1 , f 1 ), (V dd 2 , f 2 ), (V dd 3 , f 3 ), . . . , (V dd n , f n )} of n discrete voltage and frequency levels while a common supply voltage is shared by intra-VFI processors and routers.
Communication Model
We assume a 2D-mesh topology NoC for communication architecture of the VFI-based heterogeneous NoC-MPSoC (VFI-NoC-HMPSoC) shown in Figure 3 . Each tile of the computing system is associated with a router to communicate with other processors. In NoC, buffers are used in routers to host the incoming flits when immediate transfer to next processor and/or Intellectual property (IP) is not possible because of the congestion. NoC mesh consists of N R rows and N C columns; therefore, the number of processors in VFI-NoC-HMPSoC is equal to N R × N C . Each router has five ports; four ports are used to communicate with the neighbor routers and one is dedicated for the purpose of communicating with the processor. A link is used to connect two routers and/or a router with a processor. We consider that all links are identical, full duplex, and have the same bandwidth, b w .
Switching
Technique. Virtual cut-through (VCT) and wormhole (WH) are the two most popular packet switching techniques for NoC interconnects. In WH, each packet is split into small pieces known as FLITS. When a packet traverses in the network, the WH immediately determines its next hop, forwards it, and then the subsequent FLITS worm their way through the network. In VCT routing the buffer size is large and the entire packet is sent to the next node. Thus, VCT has lower latency, higher link utilization, and lesser packet blocking probability. Though WH switching is simple and possesses higher efficiency of flow control over VCT in case of congestion occurrence, the stalling packet can block all the links and produces a low link utilization. Therefore, we consider VCT packet switching technique in this article.
Routing Technique.
Routing technique in a network decides the path of a packet from source to the destination router. We use a well-known XY deterministic routing on NoC that is the most suitable option for 2D-mesh topology networks. Moreover, XY routing is a simple yet effective approach. Additionally, one of the major advantages is that a deadlock does not occur in it. In XY routing, the packets at the routers are routed in X-direction first and later on in the Y-direction.
Energy Model
We adopt the energy model described in Reference [37] . The total energy consumption of an application is the sum of processing and communication energy consumption: E p and Ec, respectively. The parameter E p is the energy consumed in the execution of tasks on the processor, whereas communication energy is consumed in transmission of communications on the network that includes switch fabric, links, and buffers. E p and E c are discussed in detail in Reference [37] .
The total energy E consumed by an application is given given as follows:
Concisely, we consider DAG applications and heterogeneous VFI -NoC-HMPSoC architecture with VCT switching, XY routing, and energy consumptions that occur due to processors and communications.
STATIC CONTENTION-AWARE ENERGY OPTIMIZATION APPROACH
VFI-NoC-HMPSoCs consists of processors with different power-performance profiles. These processors operate at distinctive frequency and voltage, i.e., speed levels. Furthermore, precedence constraints and the deadline of the tasks essentially must be observed. Subsequently, the execution order of the tasks and communications could significantly affect the total energy consumption. A substantial amount of energy could be saved by assigning priorities to the tasks with shorter deadline, because DVFS can efficiently utilize the available idle slack by assigning a lower speed to the tasks. Therefore, the obtained quality of the solution is influenced by three factors: (1) task mapping, (2) ordering, and (3) voltage assignment. The state-of-the-art approach by Reference [27] performs task orderings and voltage scaling in an integrated manner and performs task mapping ALGORITHM 1: ARSH-FATI input: A DAGG, tasks Deadlines, an MPSoC, total number of iterations Ω and population size μ output: Task to processor mapping map and islands voltage levels vol Construct two matrices Π and Ψ of zeros having dimensions μ × |V | and a vector f of zeros having dimension μ × 1;
Find the best solution π b and the worst solution π w ;
Construct an extended graph G e given a mapping;
Set map and vol to mapping and islands voltage settings, respectively, with the highest fitness in the population.
separately. However, we think task and communication ordering and voltage scaling can be helpful in steering the task mapping optimization process towards a more energy-efficient solution. This is one of the major factors that we consider in the design of our energy-aware integrated mapping, scheduling and voltage scaling (ARSH-FATI) algorithm. ARSH-FATI algorithm considers mapping, scheduling, and voltage scaling in an integrated manner. The details of ARSH-FATI are given in Algorithm 1. Before we start explaining our algorithms, we first define an extended graph G e illustrated in Figure 4 . In an application, there are two kinds of events communications and tasks. To schedule communications using traditional DAG-based scheduling approaches, we transform a DAG, i.e., G into an extended graph G e . Given a task to processor mapping, an extended graph G e is constructed by inserting an additional node v s to graph G for each edge
The additional inserted nodes are called communication nodes. The extended graph is represented by
where V is a set of the task nodes, V * is a set of the communication nodes, and E is a set of the edges.
ARSH-FATI is a population-based algorithm in which only the best and worst solutions of the previous population are used to generate μ number of candidate solutions for the current population. Such kind of selection algorithms in the literature are commonly referred to as (1 + μ) selection algorithms.
Robustness of ARSH-FATI algorithm lies in the notion of updating the parameter dimensional rate (DR) at run-time during the searching process. Our algorithm attains a satisfactory trade-off between the exploitation and exploration attributes of the search process. We define the parameter DR as the percentage of tasks that are re-mapped probabilistically to generate a new solution (mapping) form current (best and worst) solutions. The need for only re-mapping a percentage of tasks and not all the tasks stems from the sensitivity of energy consumption to task mapping in this (energy optimization) problem. In other words re-mapping even a small subset of tasks may generate a schedule with energy consumption significantly different than the schedule generated by original mapping. Hence, the role of DR is to adjust at run-time the exploitation and exploration features of ARSH-FATI algorithm that we explain in the following:
Step 1. Initial population generation: First, we generate a matrix Π of dimensions μ × |V |, where each row in this matrix represents a task to processor mapping. Each row in matrix Π is generated by randomly mapping tasks to processors.
Step 2. Evaluation: We define the following fitness function to gauge the quality of each member of the population:
We define the following two terms:
(1) Best solution: is a member of the population that has the highest fitness value.
(2) Worst solution: is a member of the population that has the lowest fitness value.
Step 3. Setting parameter DR: We set the value of DR to 0.3 for the initial population. This value is determined empirically after extensive experiments. The DR value for the other populations generated during the optimization process is determined as follows:
According to Equation (2), if the best solution found so far is improved in the previous iteration, then the value of DR is increased by dividing it by 0 < λ < 1; otherwise, we decrease DR by multiplying it with λ. We refer to λ as the dimensional rate adaption parameter, as it determines the new value of DR during the optimization process. The larger the value of DR, the more explorative the search is, as this enables the moves in the search space by re-mapping many tasks at the same time, thereby leading to large and unconstrained step sizes. Compared to this, a small value of DR motivates a more exploitative search by allowing small and conservative steps in the search space. The motivation behind Equation (2) is to encourage the re-mapping of more tasks and thus support more explorative search if the energy consumed by the schedule generated by the mapping in the previous iteration reduces. However, if the energy does not reduce, then the explorative search is rather restricted and ARSH-FATI takes small steps near the current mapping.
It is worth noting that ARSH-FATI also reduces the communication contention. The energy function has two components: the communication energy and the processing energy. Notice that the most effective mechanism of minimizing communication energy is to reduce the traffic over the network. As the traffic over the network reduces, so does the communication contention. In scenarios where the communication energy dominates the total energy, ARSH-FATI will choose the solution that minimizes the traffic over the network and, consequently, reduced contention among the communications. The prime objective of ARSH-FATI is to minimize the total energy. Therefore, in scenarios where the total energy is dominated by processing energy, it may choose a solution that generates high contentions between communications but minimizes the total energy consumption. Steps involved in the working principle of ARSH-FATI are given as follows:
New population: In every iteration for each candidate solution, we only remap a subset of tasks. These tasks are selected based on the value of DR. We remap the selected task v i as follows:
where θ is the processor where v i is currently mapped, r is a random numbers, and π b [i] and π w [i] are the processor where v i is mapped in the best and worst solutions, respectively. The function
is defined as follows:
where r 1 and r 2 are random numbers. The term r 1 (π b [i] − θ ) reflects the likelihood of the solution to move closer to the best solution in the population and the term r 2 (π w [i] − θ ) reflects the likelihood of the solution to avoid the worst solution.
Earliest Edge Consistent Deadline First (EECDF )
Algorithm. Before we describe EECDF given in Algorithm 2, we define some notations. The worst-case execution time of a task node v i mapped on processor pe k operating at frequency f j is et worst-case clock cycles of v i on processor pe k . The start and finish times of a task node v i are respectively denoted by ρ (v i ) and ζ (v i ). Similarly for a communication node v j (corresponding to edge (v a , v b )), the transmission time on a link L between processors pe s and pe d is et
where b w is the link width, f s and f d are the frequencies of pe s and pe d , respectively. The start and finish time of v j on link L are respectively denoted by ρ (v j , L) and if v i is a task node then Schedule v i subject to rules R1, R2, and R3; end else
Schedule v i subject to rules R4, R5, R6, and R7; end Delete v i from R; Insert all ready nodes in R; end Calculate the energy e and make-span m of the schedule.
Scheduling, in general, is an NP-hard problem. Hence, in this work, we propose an earliest edge consistent deadline first (EECDF ) heuristic algorithm. EECDF is a static list scheduler that prioritizes nodes with shorter edge consistent deadline (ECD) over nodes with longer ECD. The motivation behind this is to allow the DVFS algorithm to efficiently utilize the available slack.
Given task to processor mapping, operating frequencies of processors and a DAG G, we calculate the ECD by the following dynamic programming algorithm.
Traverse the DAG G in the reverse topological order of G. If the task v i is a sink node, then its ECD, d i is equal to its pre-assigned deadline d i , otherwise:
where ISucc i is a set of immediate successors of v i . The ECD, d j of a communication node is same as its parent (task) node. The EECDF algorithm is described in Algorithm 2. We perform four major steps.
(1) Calculate the ECD of each task v i ∈ G (Line 1).
(2) Create a ready queue R and insert all the source nodes in G e to R (Line 2).
(3) Find a node v i that has minimum ECD in R and schedule it. Then delete v i from R and insert all the ready nodes in G e to R. Repeat this until R is empty (Lines 3-10). (4) Calculate the energy E and make-span m of the schedule.
We define seven rules to schedule the highest priority node v i ∈ R. The first three rules deal with the schedule of a task node and the remaining four deal with the schedule of a communication node. Task scheduling rules: The schedule of a task node v i is obtained by applying the following rules collectively in order:
R3 enforces EECDF rule on the schedule of task nodes. Under these rules, task nodes with shorter ECD have higher priority than task nodes with longer ECD. High-priority tasks are scheduled earlier in time than low-priority tasks. Communication scheduling rules: In communication scheduling, network resources such as links are treated as processors in a way that each communication can only use one resource at a time. Hence, communication nodes are scheduled on the links for the time they occupy them.
Consider a communication node v j whose source is mapped on pe sr c and destination is mapped on pe dest , the routing algorithm used by the network generates the route R j from pe sr c to pe dst . The route R j =< L 1 , L 2 , . . . , L l > is an ordered list of links, where L 1 is the first link and L l is the last link on the route.
Note that the route depends only on the source and destination of the communication, because in our network model, we assume deterministic (XY ) routing. Furthermore, the entire communication must be transmitted on the established route, because in the network model, we suppose circuit switching. A communication node utilizing this route must be scheduled on all the links (of this route). The data traverses these links in the order they appear in the route vector.
Link causality constraints: The schedule of a communication node v j on links of route R j must abide by the link causality constraints defined as follows:
The causality constraints impose bounds on the schedule of v j on the links of R j . The finish time of v j must not be sooner on link L k than its predecessor link L k−1 .
Given a communication node v j whose parent node is v a and child node is v b , the schedule of a v j on R j =< L 1 , L 2 , . . . , L l > is obtained by applying the following rules collectively in order: ; (EGD) . EGD in Algorithm 3 is inspired by gradient descent. Given task mapping and the initial islands operating voltages, EGD explores the solution space to find voltage settings for islands such that total energy consumption is minimized and the resulting schedule under these settings is feasible.
Before we describe EGD, we define two terms: an extensible island and an island energy gradient. An island c j ∈ C is extensible if by reducing its operating voltage the resulting schedule under the new voltage settings is feasible.
EGD is guided by energy gradient in its search for the island voltage setting that minimizes energy consumption. Given the operating voltage V dd of an island c j , the energy consumption E and make-span m of the schedule, the energy gradient of c j is defined as:
where γ is a large number, E and m is the energy consumption and make-span of the schedule, respectively, when c j operates at V dd , where V dd is a voltage level lower than V dd . EGD repeats the following two steps until there are no extensible islands:
Step 1: First find a set of extensible islands. Then for each extensible island c j , do the following:
• Find a set {V min dd , . . . ,V L dd } of operating voltages, where V L dd is the maximum operating voltage of c j under which the energy consumption of the schedule reduces.
• Tentatively adjust the operating voltage of c j to each voltage level in set {V min dd , . . . ,V L dd }, call EECD to calculate the make-span, the energy consumption of the schedule under new voltage settings, and calculate the EG.
Step 2: Find the island c j and its operating voltage V dd that maximizes the energy gradient and adjust the operating voltage of c j to V dd .
EGD may repeat the above-mentioned two steps several times before it converges. In each iteration, EGD can find many extensible islands and can adjust their operating voltages to many different levels. Each of these island voltage pairs may lead to some reduction in energy consumption. EGD chooses the pair that maximizes the EG. This is because for each island voltage pair the energy consumption of the schedule under the new voltage settings reduces without or with an increase in the make-span of the schedule. Both of these cases are reflected in the EG function. The first case is an ideal one, because energy is reduced without any reduction in the available slack. Hence, the EG gives more weight to island voltage pairs that lie in the first case by multiplying the energy difference with a large integer λ. In the second case, energy reduces but with an increase in schedule make-span. In this case, EG is the ratio between energy difference and the make-span difference. The higher the ratio, the better the island voltage pair. A large value of this ratio is an indication of a large numerator and a small denominator. A large numerator reflects a big energy difference. This is desirable, because it indicates that, by changing the voltage level, the schedule under new voltage settings reduces energy significantly. A small numerator reflects a small makespan difference. This is also desirable, as this indicates more slack will be available for the nodes in the subsequent iterations.
EXPERIMENTAL SETUP AND RESULTS
In this section, we explain the experimental setup used for simulation. We also generate energy consumption values for different real benchmarks and discuss the results.
Experimental Setup
We use eight real benchmarks in our experimental analysis on VFI-NoC-MPSoC computing architecture while generating results for different scenarios. The real benchmarks are adopted from Embedded System Synthesis Benchmarks Suite (E3S), which is a widely used benchmark suit in the task mapping and scheduling research [9] . Automatic Target Recognition (ATR) benchmark is a real-time streaming application used for pattern recognition. Benchmark MP3-decoder performs Huffman decoding and Inverse Discrete Transform (IDCT). JPEG-encoder contains tasks for Huffman encoding and Discrete Cosine Transform (DCT). Office benchmark contains tasks for text processing, image rotation, and gray-scale to binary conversion. Auto-industry represents an embedded application that includes tasks such as Fast Fourier Transform (FFT), finite/infinite impulse response filter, IDCT, Inverse Fast Fourier Transform (IFFT), matrix arithmetic, table lookup, road speed calculation, and interpolation. Consumer-1 and Consumer-2 benchmarks perform JPEG compression and/or decompression, conversions such as from RGB to CMYK and RGB to YIQ.
We use Samsung Exynos 5422 chip power and energy model for our simulations adopted from Reference [28] and use two types of processors, i.e., type 1: high-performance Cortex A15 (big) and type 2: low-power Cortex A7 (little). The Cortex A7 consumes ∼6−12 times less power compared to Cortex A15 [32] . The operating frequencies and relative power consumption of both types are given in Table 1 . Moreover, we adopt 70nm processor technology parameters from Ali et al. [2] listed in Table 2 . We built the simulation environment in Matlab version R2016a. Moreover, we conducted the experiments using hardware platform of Intel (R) Xeon (R), i5-3570 CPU with the clock frequency of 3.50GHz and 16GB memory, 10MB cache. We also use intlinprog solver for programming ILP problems. We first select real-world then synthetic benchmarks and report on the energy-efficiency evaluation of our ARSH − FAT I meta-heuristic. Figure 5 shows the impact of DR on ARSH − FAT I performance. We initially set DR = 0.3, though it can acquire values 0.1 ≤ DR ≤ 0.5 with small impact on the total energy performance for static task scheduling. The results indicate that the energy performance of the ARSH − FAT I slightly decreases when DR = 0.1 and DR = 0.5. However, our algorithm automatically sets the DR value to produce maximum energy-efficiency but initially setting DR = 0.10 means ARSH − FAT I performs an insufficient exploration while DR = 0.5 leads to an excessive exploration. Thus, DR = 0.3 is the nominal initial value for our meta-heuristic. ARSH − FAT I converges, i.e., DR value relatively stabilizes at 200 number of iterations (NI) and a minute variation occurs till 500 while no variations occur when N I > 500. Therefore, we consider N I = 500, μ = 5, and λ = 0.9 for our experiments. 
Results
We generate results for four scenarios considering different metrics such as homogeneous MP-SoC platform, heterogeneous multiprocessing computing system, PPI, deadline, and CCR. In this section, we refer to different parameters listed in Table 3 .
Scenario 1.
We set the default parameters NV FI = 4, PPI = 2 × 2, M = 16, DR = 0.30, and perform experiments on eight real benchmarks deploying both homogeneous and heterogeneous VFI-NoC-MPSoC computing architectures.
We compare the energy performance of ARSH − FAT I with state-of-the-art CA-TMES-Search and CA-TMES-Quick [16] . First, we consider a homogeneous VFI-NoC-MPSoC system where all the processors are of type 1. We set the operating frequencies of the processors to their maximum (f max = 2.0GHz). Second, we use a VFI-NoC-HMPSoC deploying both type 1 and type 2 processors without voltage scaling technique. We randomly select the type of processor for each VFI to generate a heterogeneous computing platform to ensure unbiased experimentation. Third, we consider a VFI-NoC-HMPSoC computing architecture and use EGD to efficiently avail the slack in the processors. Table 4 summarizes the energy consumption values for these three cases on eight real benchmarks. Figure 6 demonstrates the energy performance of our static task scheduler ARSH − FAT I compared to CA-TMES-Search and CA-TMES-Quick. X-axis denotes real benchmarks while y-axis represents energy consumption in joules (J). Not surprisingly, when all the processors are of type 1, (ARSH − FAT I ) homoдeneous consumes lower energy, because our population-based meta-heuristic performs better solution space exploration during task mapping and subsequently reduces communication energy. In other words, (ARSH − FAT I ) homoдeneous schedules dependent tasks closer to each other to avoid energy dissipation occurring due to the utilization of links, switches, and buffers for communications. Specifically, (ARSH − FAT I ) homoдeneous achieves an average energyefficiency of ∼15%, ∼8% over CA-TMES-Quick and CA-TMES-Search, respectively.
The energy savings further increase when both type 1 and type 2 processors are deployed to form VFI-NoC-HMPSoC system. Task scheduler (ARSH − FAT I ) het er oдeneous attains an averageefficiency of ∼13%, ∼20% compared to CA-TMES-Search and CA-TMES-Quick, respectively. Unlike CA-TMES-Quick and CA-TMES-Search energy management approaches, our static scheduler (ARSH − FAT I ) het er oдeneous is aware of the energy performance profiles and generates a task schedule such that higher energy consuming tasks are mapped on low-performance and highenergy-efficient processor.
Our static scheduler ARSH − FAT I when integrated with voltage scaling algorithm EGD, i.e., (ARSH − FAT I ) het er oдeneous+EGD , achieves the highest energy-efficiency. It produces an average energy savings of ∼24%, ∼30% over CA-TMES-Search and CA-TMES-Quick, respectively. EGD tends to find the voltage settings for islands such that energy consumption is minimized and the deadline constraints are satisfied. In other words, EGD reduces the computation energy consumption by intelligently exploiting the available slack in the processors.
Summarizing the observations in these experiments in scenario 1, ARSH − FAT I reduces both the communication and computation energy consumptions while not sacrificing the constraints. Our approach ARSH − FAT I for static task scheduling on VFI-NoC-MPSoC architecture outperforms both CA-TMES-Search and CA-TMES-Quick.
Scenario 2.
Next, we examine the impact of PPIs on energy consumption while determining the ability of ARSH − FAT I to utilize the resources experimenting on nine real benchmarks. We set NV FI = 4, heterogeneous computing system, and systematically upgrade PPI = 2 × 2, 4 × 2, 4 × 3, i.e., M = 16, 32, 64. Figure 7 illustrates that the energy consumption of the benchmarks MP3-decoder, JPEG-encoder, and Robot decrease with the gradual increase in the PPI. This energy reduction is due to the availability of more processors on the MPSoC computing platform. Moreover, MP3-decoder, JPEGencoder, and Robot have a higher degree of parallelism and efficiently utilize the available processors on the MPSoC. These benchmarks contain a relatively higher number of task nodes and degree of parallelism. MP3-decoder consumes 1.1627J energy at PPI = 2 × 2 while it decreases to 1.0936J and 1.0728J for PPI = 4 × 2 (ΔE = 0.0.0691J), and PPI = 4 × 3 (ΔE = 0.0899), respectively. Similarly, JPEG-encoder depletes 1.2396J energy at PPI = 2 × 2 and this energy consumption reduces to 1.1722J (ΔE = 0.0674J), 1.1517J (ΔE = 0.0879J) at PPI = 4 × 2, 4 × 3, respectively. We also evaluate the performance of our static scheduler ARSH − FAT I on a more complex real benchmark, Robot containing 88 tasks. Compared to PPI = 2 × 2 (M = 16), ARSH − FAT I achieves energy savings of ∼13% and ∼18% for Robot when PPI = 4 × 2 (M = 32) and PPI = 4 × 3 (M = 64), respectively. These results demonstrate that our meta-heuristic ARSH − FAT I can efficiently utilize the resources and degree of parallelism in the benchmarks to reduce the total energy consumption.
Scenario 3.
We now conduct experiments to analyze the robustness of ARSH-FATI under deadline variations and compare its performance with CA-TMES-Search. We consider voltage scalable heterogeneous computing architecture with NV FI = 4, PPI = 1 × 2, and M = 8. We set the baseline deadline for each benchmark in set 1 and set 2 (described in Table 2 ) to the makespan of the schedule generated by CA-TMES-Search under the condition of all VFIs operating at maximum frequencies. Figure 8 and Figure 9 show the energy consumption of ARSH − FAT I and CA-TMES-Search for set 1 and set 2, respectively. The MULT represents the factor multiplied to the baseline-deadline. For example, MU LT = 1.00 at horizontal axis in Figures 8 and 9 indicates the deadline of each benchmark is set to 1.00 × baseline-deadline. The dotted lines represent our ARSH − FAT I while the straight lines show CA-TMES-Quick. The condition MU LT < 1 indicates a strict deadline while MU LT > 1 shows a relaxed deadline. The energy-efficiency of ARSH-FATI gradually reduces starting from MU LT < 1 to MU LT = 0.95. This increase in energy consumption occurs due to the reduction in slack. Though energy consumption slightly increases under the strict deadline conditions (of MU L < 1), ARSH − FAT I can still successfully generate a feasible schedule. Moreover, as deadline decreases, ARSH-FATI tends to schedule more tasks on high-performance processors. These processors reduce task execution time at a cost of higher energy consumption. This is another reason for the increase of energy consumption along with the reduction in slack. The same is not true for CA-TMES-Search, because it neglects to consider the energy performance profiles of the processors during the task mapping phase. The EECDF prioritizes nodes with shorter ECD, thereby increasing the chance of generating a feasible schedule. This is because ECD of a node depends on the pre-assigned deadline. As the deadline varies, so does ECD; consequently, the relative urgency of nodes may change. This additional information reflected by ECD can be exploited by EECDF . On the contrary, CA-TMES-Search uses b-level to reflect the relative urgency of tasks. The metric b-level is independent of the application deadlines, hence the CA-TMES-Search is unaware of the deadline variations and is unable to respond accordingly.
Under the condition MU LT > 1, the energy-efficiency of ARSH − FAT I rapidly increases. ARSH − FAT I, being aware of processor energy performance profiles, tends to map more tasks on lower-performance but energy-efficient processors. Contrarily, CA-TMES-Search is inadequate to avail energy performance profiles; consequently, it maps more tasks on high-performance, lowerenergy-efficient processors. ARSH-FATI schedules nodes in EECDF manner, hence EGD can efficiently utilize the slack, because nodes with longer ECD are not blocked by nodes with shorter ECD. The same is not true for CA − T MES − Search. Furthermore, uniform voltage scaling used by CA − T MES − Search is an inefficient technique for a heterogeneous system. Thus, ARSH − FAT I maintains its remarkable energy performance, robustness, and QoS for real benchmarks at 0.95 ≤ MU LT ≤ 1.05.
Scenario 4.
Now, we evaluate the energy performance of ARSH − FAT I at NV FI = 4, PPI = 2 × 2, M = 16, and CCR = 0.2 − 3.0. Figure 10 illustrates the impact of CCR on ARSH − FAT I energy performance while CA-TMES-Quick (represented by blue line) is used as a baseline. Evidently, ARSH − FAT I static scheduler consumes less energy compared to CA-TMES-Search due to performing task mapping, scheduling, and voltage scaling in an integrated manner. With the increase in communication volume, the energy consumption of ARSH − FAT I reduces and reaches to its minimum value at CCR = 1.0. ARSH − FAT I maps the dependent tasks (parent and child nodes) on the same processor when 0.2 ≤ CCR ≤ 1.0 to decrease the communication energy. ARSH − FAT I at CCR > 1 tends to map all the dependent tasks on the closest possible processors, which leads to a slight increase in energy consumption. Our static scheduler ARSH − FAT I performs relatively better when 0.5 ≤ CCR ≤ 2.0, i.e., when network contention is medium. Our static scheduler ARSH − FAT I outperforms CA-TMES-Search in terms of energy-efficiency for 0.2 ≤ CCR ≤ 3.0. Table 5 summarizes the energy performance of ARSH − FAT I compared to the baseline stateof-the-art CA − T MES − Search and CA − T MES − Quick energy management approaches when NV FI = 4 and PPI = 2 × 2 in the multiprocessor computing system. Energy consumption of the dependent DAG tasks decreases when the computing platform is changed from homogeneous to heterogeneous. The energy-efficiency further improves when voltage scaling technique EGD is deployed. Concisely, ARSH − FAT I outperforms CA-TMES-Search and CA-TMES-Quick in terms of energy savings while maintaining higher robustness.
CONCLUSION
Cyber-Physical Systems (CPS) integrate computation with physical processes using battery constrained intelligent edge devices. The computational complexity of real-time applications in CPS is rapidly increasing. Consequently, Network-on-Chip (NoC)-based Voltage Frequency Islands (VFIs), Globally Asynchronous Locally Synchronous (GALS) are widely adopted in large-scale multiprocessor chip designs due to their higher performance, simple architecture, and energyefficiency. Unlike other scheduling techniques [16, 35, 38, 48] , we investigated a harder scheduling problem, i.e., contention-aware and energy-efficient DAG tasks scheduling on heterogeneous VFIbased NoC-MPSoC (VFI-NoC-HMPSoC) computing architecture with DVFS-enabled processors. We proposed a novel static task scheduler, ARSH-FATI, which performs task mapping, scheduling, and voltage scaling in an integrated manner while considering the energy performance profiles of the processors and contention at the NoC links. Our meta-heuristic ARSH-FATI can intelligently switch at run-time between explorative and exploitative search modes for performance trade-off. We also integrated communication contention-aware Earliest Edge Consistent Deadline First (EECDF) scheduling approach and Energy Gradient Decent (EGD) algorithm for voltage scaling in ARSH-FATI to reduce the computation energy consumption. We performed experiments on eight real benchmarks considering different scenarios. Our static scheduler outperformed state-ofthe-art CA-TMES-Search and CA-TMES-Quick [16] energy management approaches and achieved ∼24% and ∼30% on average energy-efficiency, respectively.
