Abstract-Per-core Dynamic Voltage and Frequency (V/F) Scaling (DVFS) is a well-known methodology for achieving energy efficiency in multicore systems. Heuristic DVFS techniques provide fast, suboptimal V/F predictions while Dynamic Programming (DP) methods solve smaller sub-problems iteratively and use their outcomes to evaluate V/F levels globally, but at the cost of overhead delays. We propose an efficient DP framework using the Viterbi algorithm, which uses the Energy-Delay Product (EDP) as an objective function to predict the best V/F levels using applications' profiled information, to minimize energy consumption and execution time. Experimental results show that our framework outperforms heuristics using the EDP criteria and provides near-optimal solutions when maximizing energy saving is as, or more, important than minimizing execution time penalty. In fact, across several benchmarks, our proposed algorithm provides from a 12 to 75 percent improvement in EDP compared to heuristic methods. Furthermore, using a Pareto frontier to evaluate solutions of the algorithms under study, we demonstrate that our framework's energy-time solution is on average only 9 percent worse than the optimal solution. In addition, we show that our dynamic programming solution is 3 to 18 percent closer to a theoretical lower-bound when compared to the studied heuristic methods.
INTRODUCTION
M ODERN large-scale computing systems, such as data centers and High Performance Computing (HPC) clusters, massively integrate multicore chips in their system design. Such systems are severely constrained by power and cooling costs to meet computing needs of emerging extremescale (or exascale) applications. The power efficiency and timing constraints of the high-performance multicore chips become more critical with increase in the number of cores in a single chip.
Dynamic Voltage/Frequency (V/F) Scaling (DVFS) remains as one of the most effective methods for adjusting processing cores' V/F levels to match application's performance/energy-efficiency goals-increasing cores' V/F levels during compute-intensive phases of the applications and lowing the V/F levels when the cores are less utilized. Effective DVFS techniques rely on future predications of application or system behavior (e.g., prediction of future busy/idle cycles) to make correct adjustments to the V/F levels now.
Existing heuristics-based core-level DVFS methods perform V/F level predictions based on a current system state without examining all possible solutions. Although greedy techniques are fast, they do not in general produce global optimal solutions. Dynamic Programming (DP) techniques address this limitation by solving sub-problems sequentially using previously solved smaller sub-problems. At each stage, the subproblems' solution values are stored to avoid re-solving the same sub-problems in the next stage. At the final stage, a DP technique moves backward, through the sup-problems' solutions, stage by stage to obtain the optimal solution [23] .
This paper proposes a fast DP framework based on the Viterbi algorithm [30] to find optimal per-core V/F levels over applications' execution phases at runtime. This paper targets application-specific kernel codes or benchmarks that are periodically executed as embedded applications or as core segments of other larger applications. To do this, the workloads of these applications are extensively analyzed by optimization methods at compile-time to obtain more accurate system status at runtime as mentioned above. The proposed technique uses the Energy-Delay Product (EDP) metric [1] as its cost function to determine energy consumption and execution time tradeoff at each application execution phase. Our Viterbi-based DP (VDP) algorithm has polynomial time complexity and low implementation overhead, particularly when the number of V/F levels is less than the number of application's execution phases.
This paper presents the following:
A Viterbi-based Dynamic Programming algorithm is proposed to perform per-core DVFS to make a tradeoff between performance and energy consumption goals of applications. The VDP algorithm uses EDP as its objective function to predict the best V/F levels that maximize the cores energy savings while minimizing application execution time. The VDP algorithm states are defined using the applications profiled energy and time data. The proposed VDP algorithm performance is compared to a greedy algorithm, an ondemand governor, and a feedback controller method. Using EDP as the metric to evaluate these algorithms, experimental results demonstrate that VDP outperforms the heuristics under the study by an average of 12 to 75 percent. Since comparing the VDP performance to many other state-of-the-art DVFS algorithms is not practical, the optimality of the VDP and other algorithms under study are evaluated using a Pareto frontier. We will demonstrate that the VDP algorithm has the closest performance to its corresponding solution on the Pareto frontier curve (on average 9 percent across the experimented benchmarks). Furthermore, we will show that the VDP algorithm's solutions are on average 3 to 18 percent closer to a theoretical lower-bound energy-time solution (best timing and best energy efficiency) than the other studied heuristics. This paper is structured as follows. Section 2 gives an overview of algorithms related to the ones discussed in this paper. Section 3 presents a model for executing the experimented applications. It also defines an application interval as well as tasks executed per interval in this execution model. Section 4 explains a strategy for profiling data from the executed applications. Based on the profiled data, this section defines 1) problem objectives and 2) system states used by the algorithms to perform DVFS. Section 5 details descriptions of VDP and the other heuristics that evaluate VDP. Section 6 provides analyses on runtime and amount of memory consumed by the algorithms. Section 7 explains setups of the states, experimented applications, and simulation toolsets. Section 8 compares VDP to the heuristics and discusses the optimality of various solutions obtained. Concluding remarks and future work is discussed in Section 9.
RELATED WORK
Application of Dynamic Power Management (DPM) techniques has been extensively studied in multicore systems that aim to scale cores' V/F levels during an application run to trade reduced static/dynamic energy consumption for an acceptable execution time performance. To achieve this goal, a wide range of system characteristics has been studied and evaluated, including, but not limited to, memory access delays [2] , [3] , slack reclamation [4] , [5] , [37] task characterization and scheduling [6] , [10] , application phase detection [8] , [9] , compiler-based instruction set optimizations [10] , [11] , and communication bandwidth [38] .
Using these parameters, various optimization algorithms have been proposed to balance the power/energy consumption and execution time, such as: linear programming [12] , simulated annealing [13] , genetic algorithms [14] , game theory-based [15] , and machine learning techniques [16] . Among such a large body of literature that are presented in the DPM/DVFS arena, we mainly overview predictive, feedback control-based, and dynamic programming techniques, three of which are implemented and compared in this paper.
Predictive techniques predict future system state either greedily based on current system state or by considering history of system states. Ioannou et al., [17] devise a DVFS methodology at multiple core domains granularity that tracks Message Passing Interface-based recurrent application phases. According to this methodology when an application phase is re-encountered at runtime, cores' V/F levels are scaled up or down depending on performance degradation. If the performance degradation, caused by running this phase with a lower V/F level at previous occurrence does not exceed a threshold, the V/F levels are decreased; otherwise, they are scaled up.
Spiliopoulos et al., [2] predict cores' execution times based on measuring memory latency cycles (due to last-level cache misses) that scale with the cores' frequencies. Using the memory cycles to measure the application performance loss, [2] evaluates the cores' energy efficiency, using EDP, under multiple V/F levels. Lai et al, [18] propose a profile-guided DVFS based on the CPU-intensiveness of application phases measured by instruction per cycle and bus utilization. This profile is used by an analytically derived power-performance model, which finds suitable V/F levels at domain-level granularity that minimize EDP. Isci et al., [9] design a look up table that performs phase predictions at runtime. The phase prediction table, which guides DVFS, is constructed based on categorizing memory bus transaction to retired instruction ratio (phase pattern) into multiple levels. A history window of previous patterns and corresponding predictions are stored in the table to be used when consecutive similar patterns are found at runtime.
The feedback control approaches adjust the cores' voltage and frequency based on the difference between current system performance and a target point at every control interval. David et al., [19] develop a feedback-based algorithm that controls the multicore tiles' V/F levels in the Intel Single-chip Cloud Computer (SCC) platform. Their feedback controller adjusts the tiles' V/F levels considering a reference queue occupancy level and inter-VFI communication workload, which is measured by message arrival/ service rates. Leva et al., [20] integrate a thermal controller with a power/performance controller to regulate per-core temperature. This event-driven controller is invoked to improve system performance and reduce power consumption overhead only when either a temperature limit or a timeout is exceeded.
Kim et al., [21] dynamically tune a Voltage/Frequency Island-based (VFI) system's V/F levels with a feedback controller. Control variable defined in this work is a metric that captures average core and link utilizations within the VFIs. This metric, whose value falls in one of pre-defined ranges and mapped to a V/F level in a lookup table, is augmented by a tracking error (between predicted V/F level for the next time interval and observed metric for that interval) to increase the controller's prediction accuracy. Lukefahr et al., [22] propose a controller operating between two cores, one as high performance and the other as energy efficient core combined in a heterogeneous dual-core processor to maximize the energy saving within a performance loss constraint. The control decision to switch between these cores is based on an estimate of performance loss vs. energy waste tradeoff in the next time interval.
Dynamic Programming (DP) techniques overcome local decision-making limitations whether the predictions are made greedily or are compared to expected outcomes to enhance future predictions. To approximate optimal solutions, DP techniques use principles of overlapping subproblems, optimal substructures, and memorization [23] . Niu et al., [24] propose a DP algorithm for task voltage assignment problem on a Directed Acyclic Graph (DAG). The task voltage assignments are determined by bottom-up traversing the DAG where a subgraph feasible solution is defined as the one that probabilistically minimizes subgraph energy consumption and satisfies an execution time constraint.
Zhong et al., [25] formulate a voltage scheduling problem that minimizes the system energy consumption and solve it with an exact DP in polynomial time. For each V/F level, they define a state to be the tasks' cumulative utilizations and energy consumptions on a partial path up to the current task. For any two states, the one with higher energy consumption and utilization is pruned by the dominance relation. The DP's backtracking phase constructs optimal V/F level sequences starting from a state with maximum cumulative utilization.
Jung et al., [26] apply a value iteration algorithm as a DP that uses Bellman equation to minimize a system state (power-delay product) cost. This cost is defined as a summation of instantaneous energy consumption, caused by incurring an action (V/F level change) on the current system state, and discounts the next state energy consumption given state transition probabilities. Kang et al., [27] transform cores V/F level decision problem as finding optimal number of cache banks per core to maximize performance without violating temperature constraints. They use a DP whose sub-problem solutions correspond to cache bankto-core allocations that maximize total system throughput (instructions per second).
WORKLOAD EXECUTION MODEL

Assumptions
Workloads, which will be explained in more details in Section 7.3, consist of multithreaded benchmarks. Based on modern memory system architecture, we assume a shared memory (e.g., L2 cache or main memory) is accessed by cores to facilitate inter-core data traffic at runtime [28] . All execution runs reported in this paper are performed on the benchmarks' parallel section, known as Region Of Interest (ROI).
Impact of Communication on Tasks Execution Times
Benchmarks that are experimented in this paper utilize dataparallel parallelization model. In this model, benchmark input data is distributed among cores where each core, in parallel with others, executes program instructions to process a different subset of data (task). For communicating data among cores during benchmark execution, the underlying architecture provides shared memory as mentioned in Section 3.1. Accessing the data through the shared memory causes memory access delays. These delays, which are considered as communication delays, are embedded in the execution time of the tasks as will be explained in Section 3.3. Our previous work [39] extensively studied the impact of the communication delays on system energy efficiency and is not discussed here. However, it should be noted that such delays affect the execution lengths of the benchmark tasks that consist of computation and communication (memory access) periods. This paper utilizes the varied execution lengths of the tasks per application interval to slow down cores running small tasks and speed up the ones running larger tasks, resulting in optimizing the energy efficiency for the entire application runtime.
Execution Phase Definition
The benchmark ROI consists of a set of execution intervals (or phases), where in each interval a set of parallel threads run simultaneously on multiple cores. All threads in an interval synchronize using synchronization operations (e.g., locks and barriers), as demonstrated in Fig. 1 , which shows a threephase execution of Radix-2 sort application during its ROI execution. Note that execution times of threads running within the same interval/phase are different, depending on (1) the input data and size and (2) the memory access delays at runtime. This is shown in Fig. 2 , where, for example during interval t 1 ; t 2;1 takes the longest time to complete while t 1;1 and t r;1 complete their executions earlier during that interval. Thus, the interval time or length is defined by the execution time of the slowest thread during that interval. It should be noted that the interval lengths vary from each other, caused by the nature of the computations and communications during that interval. For example, computing first key rank shown as second phase in RADIX ( Fig. 1 ) requires more intercore data transfers (memory reads/writes) than other two phases that have more computations and local data transfers.
Task Set Definition
By defining the execution run of a thread in an interval as a task, the benchmark's parallel computational phases can be modeled by a task set T ¼ ft j j1 j pg as shown in Fig. 2 , where t j ¼ ft i;j ; 1 i rg denotes a set of subtasks, t i;j , which execute on core c i ; 1 i r, at interval t j , and whose execution times may include memory access delays for data exchange among the subtasks through the shared memory. In Fig. 2 , during an interval, gray portions show computation periods of cores executing tasks and black portions show the core's idle periods representing overheads caused by intercore synchronizations at the end of each interval. In our algorithm, we take advantage of these variations in idle periods to improve energy efficiency and performance by slowing down cores that execute tasks with longer idle periods and speeding up cores with tasks that have shorter idle periods within any given interval. A characteristic of this approach is that dependencies among the tasks are automatically preserved by the synchronization constructs. The amount of variations among the idle periods in an application depends on the cores' computational workloads in each of the application's execution intervals, which is partially impacted by memory access (i.e., inter-core communication) delays.
PRELIMINARIES
Application Profiling
The energy-efficiency algorithms studied in this work rely on three application parameters, i.e., tasks' execution times, tasks' energy consumption, and cores' utilization. A task execution time during an application phase is the time that a core executes the task before reaching a barrier (shaded boxes in Fig. 2 .) Energy consumption of a task refers to the rate of core's power usage during the task's execution time. Busy utilization during an application phase is defined as the ratio of a core's busy cycles while executing the task to the total cycles (sum of busy and idle cycles) in the execution phase.
The above three parameters are obtained by profiling the application. The profiling of a benchmark/application collects the execution time, energy consumption, and busy utilization parameters for each application phase in the benchmark for each possible V/F level. It is noted that even though profiling is a time-consuming process (large overhead), it is carried out only once per application and only at compile-time (prior to the actual application execution). Therefore, this overhead is not considered in the complexity of the energy-efficiency algorithms.
Compile-Time vs. Runtime V/F Tuning
Our proposed optimization framework follows optimizeonce-execute-many-times realization model. This means that the framework is a good fit for kernel code or embedded applications whose V/F levels can be optimized once at compile-time, and then applied many times over. Thus, for these applications, the following steps are taken to realize DVFS performed by all the energy-efficiency algorithms studied in this paper:
1. At compile-time i. Profile the application and obtain the execution time, energy consumption, and busy utilization parameters. ii. Apply an energy-efficiency algorithm (e.g., VDP, feedback, or ondemand governor) using the profiled parameters to predict the best V/F levels for each application interval for each core. Some of these algorithms use local information (e.g., ondemand governor) and some like VDP use global information for performing static (compile-time) optimization. Once the best predicted V/F levels (per core and per interval) are obtained, they are stored in a look-up table to be used at run-time.
At runtime
At each application interval, the system (e.g., OS or microkernels) will consult the look-up table obtained in step 1 to fetch each core's V/F level and issue it to the corresponding voltage regulators with minimal runtime overhead.
Problem Definition
This paper aims at determining the cores' V/F levels per application phase/interval to minimize the application's execution time (makespan) and total energy consumption, i.e., formulations (1) and (2), respectively:
Where, d i;j;l and e i;j;l in (1) and (2) denote the execution time (delay) and energy consumption of executing subtasks t i;j under V/F level l in interval t j , respectively. x i;j;l is a decision variable, which indicates whether the subtask t i;j is executed under V/F level l during interval t j . The maxð:Þ function in (1) determines the interval's length based on the slowest core's execution time among all cores in that interval. Of note, VFL is a set of discrete V/F levels, i.e., VFL ¼ ½1; L where L is the maximum V/F level.
State Definition
To evaluate V/F level predictions for future benchmark intervals, we use two state definitions. The first state definition is based on per-task energy consumption and execution time measured over multiple V/F levels. This state definition is used by the proposed VDP and greedy algorithms. The second state definition is based on per-task computational workload (or per-core utilization in an interval), which is used by the ondemand governor and feedback controller. The algorithmic implementations of feedback controller and ondemand governor presented in this paper leverage cores' utilizations in the past to predict future states. Therefore, the core utilization-based state definition was a reasonable choice to consider for these heuristics.
As previously mentioned, each core's V/F levels are determined independent of the V/F levels of the other cores. Hence, for the rest of this paper, index i, which corresponds to core c i , is not shown in equations and algorithmic descriptions.
Energy/Time-Based State Definition: The state set is defined as follows:
s j ¼ fðe j;l ; d j;l Þj8t i;j ; 8lg;
Where, S in (3) is the system state set. s j in (4) denotes a state defined by energy consumption and execution time of task t i;j in interval t j under V/F level l. Intuitively, for each interval j, state s j has multiple instances, where each instance corresponds to a ðe j;l ; d j;l Þ pair obtained at a particular V/F level l. Thus, the number of states (instances) for each interval j depends on the number of V/F levels in VFL. Since the cores execute tasks, in an interval, with unique computational characteristics, each core has different states, ðe j;l ; d j;l Þ pairs, than the other cores within and over the application's intervals.
Utilization-Based State Definition: This state definition is based on per-core utilizations over all the intervals measured under multiple V/F levels.
Where, in (6), s l , i.e., the state defined for V/F level l, is a set of average core utilizations u j . For each interval t j , an average of core utilization at multiple V/f levels l, u j , is computed as shown in (7). Equation (8) indicates that the states are non-overlapping.
The boundaries of states defined in (6) are determined by K-means clustering [29] . K-means is a well-known learning algorithm that partitions a data set into multiple clusters such that each cluster contains a subset of similar data by considering a distortion metric. This paper uses the K-means clustering to form clusters of cores' utilization by executing the following steps iteratively:
Where, in (9), u j is assigned to a cluster (state) s l whose centroid (cluster average value), m s 0 has the least squared Euclidean distance to u j compared to the centroids of other states s' l . This step is followed by (10) , which computes an average of all assigned u j determined in the previous step (9) . It should be noted that produced clusters do not overlap and each cluster is associated with a V/F level where clusters with minimum and maximum centroids are associated with the lowest and highest V/F levels (l ¼ 1 and l ¼ L in VFL), respectively.
ALGORITHM DESCRIPTIONS
This section describes the VDP algorithm and the three heuristic algorithms that solve per-core DVFS problem discussed in Section 4.3, (1), (2) . It should be noted that algorithms' pseudocodes presented here are executed per-core. Therefore, benchmarks' total energy consumption and execution time will be obtained after all the cores complete the execution of tasks assigned to them.
Viterbi-Based Dynamic Programming (VDP)
VDP is based on the Viterbi algorithm [30] , which is a dynamic programming technique and operates on a trellis diagram basis (Fig. 3) , spanned over P steps (intervals) where in each step j system state s j is in one of the L states.
Viterbi operates on the trellis in two phases: forward and backward. In the forward phase, in every step j and for every state l, the objective function value (e.g., EDP in our study) is computed for all state sequences starting from a state in the first step and ending in the l-th state at the j-th step where the corresponding objective value is stored. After reaching the last step (P), Viterbi uses these locally stored objective values to backtrack the trellis and find a state sequence, among the other sequences, that provides the best global solution.
Given core's state s jÀ1 in current interval j À 1 and possible states s j in the next interval j, the cost (C) of the path that consists of a state sequence starting from s 1 and leading to s j through s jÀ1 is computed as follows: 
Where, symbol ! indicates the state sequence from s 1 to s jÀ1 on the trellis. E jÀ1;l 0 and D jÀ1;l 0 are cumulative energy consumption and execution time (corresponding to the state sequence) up to state s jÀ1 at V/F level l 0 . Symbol D shows instantaneous transition energy and time costs between states s jÀ1 and s j at V/F levels l 0 and l, respectively. fð:Þ is used to add up and normalize the energy and time variables mentioned above. Integer parameters m and n weigh the energy consumption to execution time portions of the cost function (11) .
Having computed the path costs that cross through all the states s jÀ1 and s j in intervals j À 1 and j, respectively, the state s jÀ1 on the path with the minimum cost is recorded for each state s j :
Where, s Ã jÀ1 is the state, among all states s jÀ1 (corresponding to V/F levels l'), that minimizes the cost function of a state sequence up to interval j. The best state sequence (V/F levels) is obtained after determining state s P , in the P-th interval (last interval), with minimum cost and moving backward on the trellis: Where (11) being the minimum among the other states on a path from t 1 to t jÀ1 .
The VDP algorithm's pseudocode is described in Algorithm 1. The algorithm's variables are explained in Lines 1-4. The states cumulative energy consumption and execution time are initialized (Line 5). For the remaining intervals, the algorithm computes for each state the cumulative energy consumption and execution time up to the current interval (Lines 6-9). The algorithm records a state in the previous interval, which led to minimum cost (weighted EDP) for the state in the current interval (Line 10) and updates cumulative energy consumption and execution time for that state (Line 11). Finally, the algorithm determines a state, in the last interval, with the minimum weighted EDP (Line 14) and backtracks to find optimal state sequence (Lines [15] [16] [17] 
Metric for VDP Cost Function
Generally speaking, any DVFS algorithm, including VDP proposed in this paper, aims at scaling the V/F levels of cores at runtime to optimize desired objectives. Typical DVFS algorithms, whether they are used as prototypes for empirical studies or implemented in real embedded systems (i.e., Linux governors), use cores workloads to predict V/F levels. To use cores workloads as a metric for V/F level decisions, one has to carefully define workload thresholds for selecting suitable V/F levels while avoiding unnecessary frequent V/F level switching. This paper demonstrated this limitation when constructing the lookup table (Section 4.4) for the feedback and ondemand algorithms. To address this limitation, VDP uses our objective parameters, execution time and energy consumption ((1), (2)), as V/F selection criteria for its cost function. To account for the simultaneous impact of energy over time and vice versa, these parameters are combined into a well-known metric, Energy-Delay Product (EDP) [1] . EDP is used to measure tradeoffs between energy reduction and performance loss on the time-energy Pareto frontier.
It should be noted that utilizing energy and time as the criteria for V/F level decisions provides sensible outcomes only when their values are accurately estimated. As such, this paper uses the profiling strategy, as discussed in Section 4.1, for estimating energy usage and execution time of running tasks by cores under all V/F levels per application interval.
Greedy
This algorithm's policy is based on a simplified version of the Viterbi algorithm, i.e., it only considers cores' current states to make future state predictions (with no backtracking or global optimization). The state representation defined in this algorithm is based on (4) and EDP is used to evaluate the energy consumption and execution time tradeoff over all the states.
Algorithm 2 describes the proposed greedy algorithm. The variables and procedure in Algorithm 2 are similar to the ones in Algorithm 1 except that after the initialization (Algorithm 1, Line 5), for each interval t j , the algorithm greedily chooses state s Ã j whose EDP is minimum among all states s j in interval j (Line 6). Therefore, the cost of the existing path (state sequence) is only updated by energy and time of the predicted states s Ã j (Line 7).
Algorithm 2. Greedy pseudocode 1 . for t j from t 2 to t p do: 2. for each s j do:
Feedback Controller
In this algorithm, in general, prediction error is incorporated in the objective function to prevent unnecessary V/F level changes caused by short-term workload variations at runtime. The core utilization-based state definition (explained in 4.2) is used to construct a time-invariant lookup table of state and unique V/F level pairs. We deploy the well-known Exponential Weighted Moving Average (EWMA) algorithm [31] , which implements the feedback controller to predict core utilization in the next interval based on the core utilization in the current interval while accounting for the prediction error.
Algorithm 3 describes the feedback controller's pseudocode. Lines 1-6 explain the algorithm's variables. Line 7 performs the core utilization's prediction for the second interval. Using weighted sum of the actual core utilization in the current interval, which is obtained based on the state (and corresponding V/F level) predicted in the previous interval, as well as core utilization prediction history, core utilization for the next interval is predicted (Line 9). After that, the state in which the predicted core utilization falls is determined and its corresponding V/F level is obtained from the lookup table (Line 10). In case the predicted core utilization does not fall into any of the states, state whose centroid is the closest to the core utilization is selected for the next interval (Lines 11-13 u 
Ondemand Governor
This algorithm is based on ondemand governor used in Linux kernel [40] , which adjusts the V/F levels of a core according to the current core workload. To account for the variation of the core workload across the application intervals, the governor used in this paper also utilizes workloads in the previous intervals for adjusting V/F levels. Technically, this algorithm predicts the core's utilization for next interval based on the average of core's utilizations of current and previous intervals. The lookup table constructed in 5.3 is also used for the ondemand governor to find the correct state and select the corresponding V/F level for the next interval. u k : actual core utilization for interval k. 4. U (cumulative cores utilization): stores core utilizations for hi intervals. 5. hlðt 1 Þ ¼ L 6. for t j from t 2 to t p do:
if no such s l exists then:
end if 13.
Find V/F level l that corresponds to s Ã l and add it to hl (t k ). 14. end for Pseudocode in Algorithm 4 describes the ondemand governor algorithm. The algorithm's variables are explained in Lines 1-4. Line 5 initializes history level array, which is used by the algorithm to obtain the actual cores utilization based off the previously predicted V/F levels. Lines 7-12 predict the core utilization (and its corresponding V/F level) for the next interval based on the average of actual core utilizations, in the previous intervals, whose values depend on the predicted V/F levels in those intervals. Line 13 adds the corresponding V/F level to the history level array to be used for the next V/F prediction.
ALGORITHMS TIME/SPACE COMPLEXITY
As mentioned in 5.1, the VDP algorithm consists of two phases, where in the first phase (forward phase), shown in (11), the costs of state sequences on the trellis are computed and in the second phase (backward phase), shown in (13), the trellis is backtracked to produce optimal state sequence. The backward phase takes constant time because the backpointers are used to track the states with optimum cost in each interval. The VDP algorithm's forward phase computes the EDP, jVFLj 2 times in every interval. Thus, for N intervals the time complexity is OðjVFLj 2 Á N). Since determining minimal-cost path re quires jVFLj 2 comparison pairs and j VFL j comparison outcomes, b(s j ), are stored for states in every interval, the VDP's space complexity is
The greedy algorithm is a simpler version of VDP where the EDP is computed j VFL j times in every interval wrt. the selected state in the previous interval; hence its time complexity is OðjVFLj Á NÞ. The greedy algorithm's space complexity is O ( j VFL j ) since only j VFL j comparison pairs are conducted in an interval.
Both feedback controller's and ondemand governor's time complexities are linear in the number of intervals and their space complexities are constant. The history interval (hi) in the ondemand governor remains fixed during the algorithm runtime, so it does not impact the space complexity.
VDP's Implementation Efficiency
The VDP framework is a static (pre-runtime) V/F level assignment algorithm-the V/F levels are predicted by VDP at compile-time and put in a look up table for access during the run-time. The previous section explained that VDP's time complexity is polynomial and scales with the number of V/F levels and the number of intervals. However, the algorithm is efficiently implemented in MATLAB with the worst-case execution time of the algorithm being less than 10 seconds.
The psuedocodes of the algorithms are presented in Section 5. The actual source codes are available here: https:// gitlab.eecs.wsu.edu/Shervin/DVFS_Algorithms 7 EXPERIMENTAL SETUP
States Configuration
The following 6 V/F levels, VFL ¼ [1, 6] , are associated to the states defined based on the cores' energy/time or utilization: (0.5 V, 1.25 GHz), (0.6 V, 1.5 GHz), (0.7 V, 1.75 GHz), (0.8 V, 2.0 GHz), (0.9 V, 2.25 GHz), (1.0 V, 2.5 GHz). These V/F levels, which are equally distanced, are within a nominal V/F level range, used by our power and performance simulators, whose power and performance values have linear relationship. Using a higher frequency than 2.5 GHz only provides marginal improvements in execution times in our study and thus is not considered. Using a lower voltage than 0.5 V is not supported by our power simulator's configuration set up since it negatively increases the leakage power consumption.
We use [32] to compute V/F level switching time/energy overheads that are in the order of 0.5 microsecond/1 microjoule, respectively. These time/energy overheads, which are very small and often negligible, are still accounted for in our algorithms evaluations. For the heuristics explained in Section 5, the state values are ordered in such a way that the first and last states, which correspond to the energy and time pairs or lookup table entries, are associated to the lowest and highest V/F levels, respectively.
Emulation Setup
GEM5 [33] , as an industry-and academic-standard emulation environment for fine-grain computer system evaluations, is used to assess and compare the performance of the proposed VDP and greedy methods vs ondemand and feedback control-based algorithms. Using the 65nm technology node, GEM5 simulates a full system of 64 Alpha homogenous cores that are arranged in an 8 Â 8 mesh topology. Ruby [33] is used as memory model to provide private 64 KB L1 instruction and data caches and a shared 8MB (128 KB per core) L2 cache.
McPAT [34] is used to obtain per-core energy consumption in each interval.
Benchmarks
Our heuristics' energy efficiencies are measured by running five workloads chosen from SPLASH-2 and PARSEC benchmark suites [35] , [36] . These benchmark suites consist of realistic applications whose runtime characteristics vary from a floating-point computation-bound (e.g., FFT) to a data sharing memory-bound (e.g., CANNEAL) benchmark. Hence, the experimental results shown here are likely generalizable to applications with varying computational intensity. All benchmarks, whose number of execution phases (intervals) varies between 6 and 16, are run with large input sizes to better evaluate the effects of the considered V/F levels on per-core energy consumption and execution time for among the intervals. Table 1 illustrates problem size and application domain of these benchmarks.
PERFORMANCE EVALUATION
This section evaluates the algorithms energy-efficiency, discussed in Section 5, based on the system energy consumption and execution time tradeoffs. First, the algorithms' energy/ time efficiencies are evaluated and compared using EDP as performance measure. Then, we measure the extent to which the algorithms' average energy-time solutions are close to corresponding optimal solutions, defined on benchmarks' Pareto frontiers, and a theoretical lower-bound optimal solution. Fig. 4 shows the algorithms' EDP outcomes. The EDPs are normalized to the no-DVFS EDP where the task sets' energy consumptions and execution times are obtained at the highest V/F level, (1.0 V, 2.5 GHz). Thus, smaller normalized EDP values reflect better overall performance.
Energy-Delay Product (EDP) Comparison
For each algorithm in Fig. 4 , four bars are presented; three of those that correspond to solving our optimization problem with different input configurations are discussed below. The fourth bar represents the simple average of the EDP values for the other corresponding 3 configurations.
For the VDP and greedy algorithms we use E n :D m as objective function for optimizing energy and time parameters. Thus, the n and m integer factors are used to assign more weight to either energy or time, respectively. For example, m > n indicates that minimizing the execution time, perhaps at the cost of higher energy consumption, is more important to the user. Needless to say, the degree of importance of one parameter over the other is determined by the integer values assigned to m and n. Since m and n are "to the power of" factors, it is not necessary to use large values for them since the effect of bigger values quickly diminishes beyond a factor of 2. Thus, in our studies we would either assign 1 or 2 to m and n.
For feedback controller, the input configuration w specifies the amount of contribution of prediction history over the past intervals over core utilization in the current interval to predict core utilization for the next interval. For example, w < 0:5 suggests that the prediction history is weighted more than the current core utilization.
The input configuration of the ondemand governor determines the number of past intervals (history interval) that is considered to predict the next interval's core utilization (i.e., hi ¼ 2 means predicting core utilization based on the past two intervals).
Observations. The following are some observations about the bar charts in Fig. 4 .
VDP and Greedy. When considering the equal weight for energy and time ðm ¼ n ¼ 1Þ, VDP provides better/lower EDP compared to Greedy across all the benchmarks. Among the configurations, m < n (energy reduction is more important than time improvement) provides lower EDP for VDP and Greedy. In terms of the average EDP, VDP has the superior performance compared to all the heuristics, justifying that VDP performs more computations than the other heuristics to achieve better performance (Section 6).
Feedback controller. The feedback controller generally achieves lower EDP when the prediction history outweighs the current core utilization ðw < 0:5Þ. Interestingly, for this configuration, feedback controller performance is comparable to the "average case" EDPs of VDP and Greedy. This indicates that the accumulation of trends in the prediction history is more effective than the current core utilization on predicting the core utilization (or V/F level) in the next interval. For w < 0:5 and w > 0:5 configurations, Ondemand governor. Fig. 4 shows that the ondemand governor performance is steady for different history intervals (hi ¼ 1; 2, and 3). This indicates that the most recent core utilization ðhi ¼ 1Þ provides sufficient information with less memory footprint to predicting the next V/F levels. Of note, using hi ¼ 1 in our ondemand governor is analogous to the real implementation of this governor in Linux kernel. Compared to the other algorithms, the poorer performance of the governor is because of using 1) a simple predictive model and 2) the variations in core utilization from one interval to the next make core utilization a poor choice as a V/F level selection metric.
VDP vs. heuristics. To quantify the relative performance of VDP to the other three heuristics, Table 2 computes a ratio between the "average case" EDP obtained by each heuristic to VDP. As demonstrated in Table 2 , on the average, VDP performance is 12, 25, and 75 percent better than Greedy, Feedback, and Ondemand governor, respectively.
Optimality of the Outcomes
The previous section demonstrated the algorithms' relative performances under different configurations. For comparing our heuristics against the state-of-the-art algorithms, we realized that it is not practical to consider all those algorithms in our comparison study. Furthermore, a state-of-the-art algorithm may or may not improve the EDP at the same scale as others, which causes inconsistencies when comparing performances of our heuristics to those of state-of-the-art algorithms. Instead, we measure the extent to which our heuristics' performances are close to the optimal solutions, which provides a reliable way of evaluating our heuristics independent of optimality of other algorithms. This section presents a two-fold comparative study, which trades off the energy consumption and execution time solutions between our algorithms and optimal solutions. First, the EDPs for the average energy-time solutions, discussed in Section 8.1, are compared to equivalent optimal solutions on the Pareto frontier. To find out the overall performance of these algorithms, distances of their solutions are computed from a theoretical lower-bond solution on the energy-time search space. Fig. 5 demonstrates an example of our approach for comparing the average energy-time solutions, over the configurations discussed earlier in this section, provided by our DVFS algorithms and their corresponding optimal solutions. In this figure, the black markers represent the energy-time solutions for our heuristics. For an algorithm, for example VDP, its corresponding best solution is a point on the no-DVFS Pareto frontier, which has the same performance degradation as VDP but saves the most energy consumption (square white marker in Fig. 5 ). In this figure, the energy-time solution of the ondemand govern or is shown for hi ¼ 1 (i.e., the current core utilization is used for predicting the V/F levels), similar Greedy  12  7  12  22  5  12  Feedback  36  25  35  25  5  25  Ondemand  120  38  70  89  60  75 to the algorithm design of ondemand governor used in the Linux kernel. Of note, similar comparisons between the algorithms vs. optimal energy-time solutions can be performed when the optimal solutions are defined to be points on the Pareto frontier that have the best performances (fastest execution times) subject to consuming the same energies as the algorithms under consideration. Fig. 6 shows the algorithms' energy efficiencies when their average EDPs are normalized to their respective optimal solutions on the Pareto frontier curve for a given time constraint. For example, considering VDP in Fig. 5 , for 0.253 seconds execution delay, VDP provides 11.69 joules energy efficiency. For the same execution time (0.253 seconds), the best energy efficiency we can possibly obtain is around 8.89 joules (on the Pareto curve). Thus, for LU, the EDP of the VDP solution is 18.97 percent worse than the optimal. Fig. 6 shows that on the average VDP gains the closest performance to its optimal solution (1.09x compared to optimal) among the algorithms studied here. This indicates that for the same time penalty as the optimal solution to perform DVFS, VDP obtains the best energy saving. In contrast, the ondemand governor shows the worst performance, degrading its optimal solution's EDP by almost 2x, which corroborates the ondemand's high EDP outcomes as shown in Fig. 4 . Fig. 6 suggests that for LU benchmark, VDP has a clear dominance over the other algorithms. The reason is that LU, compared to other benchmarks, exhibits more variations among the cores' utilizations. The dynamic programming nature of VDP leverages such variations to efficiently scale the cores' V/F levels across the intervals with respect to their workloads.
To quantify the extent to which the algorithms' energytime solutions are close to optimal, we use a measure that indicates the percentage of similarity of our algorithms' EDPs to the EDP of the theoretically best (ideal) solution, i.e., when we assume the application can theoretically run at the highest speed with the lowest energy consumption. The theoretical ideal solutions are obtained by taking the least execution time and energy consumption from the two solutions that lie on the upper-left and lower-right corners of the Pareto frontier, respectively. For example, in Fig. 5 for LU, the best theoretical solution results in 0.162 seconds of execution time at the cost of 6.452 Joules. The similarity measure (14) and (15) is defined based on the Euclidean distance between the EDPs of the algorithms and the EDP of the corresponding best theoretical solution. Here, S alg;best 2 [0,1] is the similarity degree between an algorithm (alg) and the best solution (best) EDPs. DðEDP alg ; EDP best Þ denotes the Euclidean distance of the best solution ðEDP best Þ from EDP of the algorithm ðEDP alg c Þ whose energy-time solution is obtained by using an input configuration c (e.g., w < 0:5; w ¼ 0:5, and w > 0:5 for the feedback controller). As seen in (14) , the similarity measure is inversely proportional to the Euclidean distance, indicating that an algorithm's energy efficiency is more similar (with a higher percentage) to the best solution when EDPs of the corresponding solutions have a lower distance. In contrast, the algorithm has a lower percentage of similarity, compared to the best solution, when the EDPs of corresponding solutions have a larger distance. Table 3 shows the similarities of our algorithms EDPs to the best solution EDP computed by (14) . It is observed that on average VDP is 3, 7 and 18 percent more similar to the best solution compared to the greedy, feedback controller, and ondemand heuristics, respectively. Furthermore, Table 3 suggests that among the heuristics, greedy's similarity outcomes are more comparable to the VDP. The degrees of similarity of the algorithms to the best solution, shown in Table 3 , match their relative energy efficiencies (or EDPs) of these algorithms to one another as shown in Figs. 4 and 6.
CONCLUSION AND FUTURE WORK
This paper proposes a Dynamic Programming (DP) framework, based on the Viterbi algorithm, to achieve fine-grain, per core energy-time tradeoff analysis. This technique globally optimizes per-core V/F levels by minimizing the cores' energy consumptions and execution times. The performance of the framework is compared to faster version of the VDP algorithm (Greedy) and two other heuristic algorithms (Feedback and Ondemand). The EDP performance results show that, on the average, our proposed VDP algorithm outperforms Greedy by 12 percent, Feedback by 25 percent and Ondemand by 75 percent.
Furthermore, the results show that the VDP algorithm performs the best when maximizing energy saving is as, or more, important than minimizing the execution time penalty. Considering the best possible optimal solution for each algorithm, the results show that VDP, compared to the other heuristics, has the closest performance to its corresponding optimal solution.
As discussed in Section 4.2, the algorithms presented in this paper perform compile-time DVFS and are usable for applications with a specific execution model (Fig. 1) . Furthermore, these algorithms are applied on a system with a moderate number of homogenous cores as explained in Section 7.2. To further extend these algorithms, future work will address the following points: 1) Studying applications with different execution models, 2) Analyzing the scalability and energy efficiency of the algorithms with respect to larger system sizes, 3) Integrating the algorithms with OS kernels to perform runtime DVFS, and 4) Evaluating the performance of the algorithms on systems with heterogeneous cores. 
