Abstract-Applications that are run on multicore systems without performance targets can waste significant energy. This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and the per-core voltage-frequency settings to save energy while respecting QoS requirements of individual applications in multiprogrammed workloads run on multi-core systems. It does so by doing configuration-space exploration across the spectrum of LLC partition sizes and DVFS settings at runtime at negligible overhead. Compared to DVFS and cache partitioning alone, we show that our proposed coordinated RMA is capable of saving, on average, 20% energy as compared to 15% for DVFS alone and 7% for cache partitioning alone, when the performance target is set to 70% of the baseline system performance.
I. INTRODUCTION
Resource management at the microarchitectural level aims at maximizing multi-core system performance or energy efficiency. However, if applications are not associated with any performance targets in terms of Quality-of-Service (QoS) constraints, the energy expenditure can be excessive. In contrast, if applications have clearly defined QoS constraints, resources can be throttled down to deliver enough performance with a greatly reduced energy cost.
Core Voltage-Frequency (VF) and the per-core share of the Last-Level Cache (LLC) are two popular resources to control both performance and energy-efficiency of applications running on a multi-core system. The reason is that the former is most effective for compute-intensive application phases whereas the latter can be more valuable for memory-intensive phases.
We envision a resource management system where all applications in a multiprogrammed workload, on a multicore system, have QoS constraints that can be met by a baseline allocation of resources; e.g., partitioning of LLC resources evenly across cores at a given VF setting. The objective of our envisioned resource manager is to maximize energy efficiency by dynamically distributing resources at run-time across cores.
The literature describes several DVFS and LLC partitioning resource management schemes. Many of these schemes, such as [1] - [6] , do not consider QoS constraints. QoS constraints are a critical requirement to reduce system energy without degrading user experience. Other approaches focus on optimizing the system when only a single application has QoS constraints [7] - [11] and are hence not applicable for multi-core systems with per-application QoS targets. A common scenario for QoS-constrained workloads is to share the system between one latency critical job and other best-effort batch jobs [12] - [21] . In this scenario, the optimization is performed on the best-effort jobs by utilizing the system resources only when the latency-critical job is not using them. In our work, we consider the more general and difficult scenario in which all the applications in a workload have strict performance constraints. Instead of maximizing the aggregate performance or system utilization, our target is to minimize the system energy without sacrificing the performance on any application. The solution to this problem also works for a less strict scenario where a bounded reduction in performance can be tolerated on any subset of the applications. Hence, it is a general approach.
There are only a few works in prior art that assume performance constraints on all the applications in the workload [22] , [23] . In these works only the core DVFS controllers target performance constraints, while the LLC controllers attempt to minimize the overall cache misses. Since minimizing the global LLC miss rate can be in conflict with meeting individual performance targets, these approaches can lead to QoS violations and are thus not acceptable solutions. Furthermore, they cannot optimize system energy because the partitioning controller does not take the DVFS effect into account. A smaller allocation of cache to an application can result in an increase in the core voltage-frequency which has a quadratic effect on core energy consumption. This paper proposes, for the first time, an integrated resource manager that coordinately controls, in a single centralized algorithm, LLC partitioning and core DVFS of all the applications sharing multicore system resources. The goal is to minimize system energy without violating performance constraints on any application. To this end, the Resource Management Algorithm (RMA) performs a configuration-space exploration, at each program phase change, to identify the best allocation of resources. The challenge is to perform this search in a complex multi-dimensional configuration space with negligible run-time overhead, even as the size of the system increases. In this paper we propose a multi-layer pruning heuristic to perform this operation in polynomial time. The proposed method does not require any profiling, training, or prior knowledge about the runtime behavior of applications.
Our experimental results show that the proposed scheme using combined DVFS and LLC partitioning is more effective in saving energy compared to isolated DVFS and cache partitioning. In addition, the overhead of invoking the RMA at phase changes has a negligible impact on the energy savings. When the performance target is the same as the baseline system, the savings are as high as 12%. When the performance target is reduced to 70% of the baseline system, the proposed RMA is capable of saving on average 20% energy as compared to 15% for DVFS alone and 7% for cache partitioning alone.
The main contributions of this work are threefold. First, we contribute with an online resource management scheme that controls per-core DVFS settings and LLC partitioning coordinately to maximize system-level energy-efficiency while respecting the QoS constraints for all applications in a multiprogrammed workload at a low overhead. Second, we contribute with a heuristic algorithm to find an optimal configuration in polynomial time. Finally, we evaluate, via a novel simulation framework, the effect of different resource management algorithms on full executions of benchmark applications in a multi-core system.
The rest of the paper is organized as follows. Section II provides the motivation for this work. The proposed scheme is described in Section III. The simulation methodology and experimental results are presented and discussed in Sections IV and V, respectively. Section VI discusses related work. Finally, we conclude in Section VII.
II. BACKGROUND AND MOTIVATION
This section first presents the baseline architecture framework and the basic assumptions. It then provides motivational data for the potential of saving energy when both LLC and DVFS are managed in coordination.
We consider a multi-core system where each core runs a single-threaded program. In the baseline system, the LLC capacity is evenly partitioned among the cores and all the cores run at a base frequency. It should be noted that this equal distribution of resources is just an example baseline. Furthermore, per-core DVFS can be controlled individually.
The QoS target for each application is expressed as a fraction of the instruction per second (IPS) rate on the baseline setting in each execution interval. This is described in detail in Section III-E. A cloud provider could for example sell computing cycles cheaper if, say, the performance target can be reduced to 70% of the baseline configuration. This form of QoS requirement can support mixed workloads with different performance targets. The RMA attempts to dynamically select a system configuration in terms of a per-core LLC partition size and VF for each individual application that minimizes the system energy and yet meet the QoS constraint expressed as a performance target.
The hypothesis is that the energy savings of an isolated DVFS or LLC partitioning strategy are limited, and that with a global and coordinated control of both resources it becomes possible to find a more efficient set of configurations. To investigate this, we conduct an experiment on different 4-core On the other hand, Cache Sensitivity is determined by the amount of variation in MPKI when changing from a smaller partition size to a larger one, relative to the per-core LLC size of the baseline system. Three RMAs are considered in this experiment: 1) DVFS only, 2) LLC partitioning only, and 3) Combined, i.e., coordinated control of DVFS and LLC partitioning. All three RMAs use an ideal model of performance and energy to select a configuration that minimizes energy while meeting the performance target. The RMA is invoked whenever any program experiences a phase change. Figure 1 shows the energy saving results compared to the baseline with a strict IPS target (top) and a target relaxed by 30% (bottom). The top figure shows that an energy saving of more than 10% is possible without a performance degradation on any of the cores. Of course, the DVFS controller has no option in this case to save energy without lowering the performance. The LLC partitioning controller has limited options. But, the combined controller can cancel the performance degradation on the core with a lowered cache share by increasing its VF while the performance boost on the core that receives a larger cache allows a reduction of VF on that core to save energy. Thus, a more efficient configuration that reduces the sum of core and memory access energies with the same level of performance Figure 1 shows that a 30% relaxation of the IPS targets opens up further possibilities to save over 25% of energy with the combined coordinated controller for several applications.
III. THE PROPOSED SCHEME
This section presents the proposed resource management scheme. Figure 2 shows an overview of the system. On each core, a monitoring mechanism periodically collects information from hardware performance counters. The RMA, which is part of a light-weight power management software handler, is invoked at a phase change. It uses data collected from performance counters and Auxiliary Tag Directories (ATD) [1] to do configuration-space exploration of the performance and energy across all different LLC and frequency configurations. Once the new optimal configuration is found, it is applied to the DVFS and LLC partitioning controllers.
The rest of this section reviews the required hardware support and the necessary software components including the SW integration, phase-change detection, the performance and energy models, and the RMA.
A. Hardware Support and SW Integration
In order to support per-core DVFS, we assume that the chip has as many voltage regulators as the number of cores. This has been implemented for example in [24] - [26] .
The proposed scheme requires hardware support for partitioning the LLC and predicting the miss counts at different allocations with minimum runtime overhead. We assume a partitioning of LLC ways that is for example implemented in Intel [27] and Qualcomm [28] products. This technique has two advantages. First, the overhead of changing the partitions is limited to re-writing a bit-mask while the actual data movement is automatically performed by the replacement policy during execution. Second, it allows the use of the Auxiliary Tag Directory (ATD) [1] that provides the necessary data for the analytical models with negligible timing overhead.
Furthermore, certain statistics from performance counters including computation time, memory access time, number of executed instructions, and number of memory write backs are needed. Our technique also assumes statistics to model the effect of memory-level parallelism on performance. This will be described in Section III-C. Finally, for implementing a low overhead energy model, as discussed in Section III-D, the system must support measuring of core energy consumption during an execution interval.
In order to avoid interference on the memory bandwidth, we assume that the memory controller provides a fair allocation of the available bandwidth among contending applications.
The proposed method is invoked at phase changes that are checked at the granularity of 100 million instructions. This granularity allows it to be implemented as a light-weight powermanagement software-handler with negligible overhead. This manager's operation consists of two parts. In the first step, it collects the performance statistics by reading the registers that captures the performance counter values. In the second step, based on these statistics, it determines the VF of each core and controls the LLC partitioning by writing to the corresponding registers for allocation bit masks.
B. Phase-Change Detection
Applications typically exhibit phase behavior [29] , [30] in which the instruction execution rate, measured in InstructionsPer-Clock (IPC) for example, is almost constant. We propose to invoke the RMA only when there is a phase change. We adopt a simple phase-change detection method presented in [30] . It is based on the coefficient of variation of the average IPC over consecutive intervals. If the resulting value is above a threshold, it invokes the RMA. The threshold value determines the sensitivity to the changes. A smaller threshold activates the RMA more frequently which increases the overhead while a larger threshold may miss some of the smaller changes and opportunities for energy saving. A threshold of 25% was found in [30] to be a good tradeoff.
C. Performance Model
In order to predict the impact of resource allocations on the performance, we consider the following simple IPS model as a function of LLC allocation w and core frequency f :
where IC is the instruction count over the monitoring interval, C base is the active CPU cycles excluding the memory access stalls, AMAT is the average memory access time of LLC misses, and M (w) is the LLC miss count as a function of w. C base is derived from performance counters and we assume that it is independent of w. M (w) is derived from the ATD.
It is well known that AMAT is sensitive to Memory Level Parallelism (MLP). To model the MLP effect, we use the approach proposed by Karkhanis and Smith [31] based on probability functions. If P ov (i) denotes the probability of having i overlapping LLC misses during an interval and ML is the memory access latency for an isolated DRAM access, AMAT can be calculated as follows:
We use this formula in our simulations by collecting the MLP histogram statistics during an interval. This can be captured by performance counters similar to those available in some modern processors (such as Intel's L1D_PEND_MISS.PENDING counter). We then use this AMAT value in (1) to estimate the performance of different configurations. In Section V we analyze the accuracy of the model.
D. Energy Model
The RMA must only model the energy consumption of the components that are affected by its decisions. That includes the energy of core and memory accesses. This low overhead energy model uses the statistics collected over a monitoring interval with fixed number of instructions (IC). Hence the Energy-Per-Instruction (EPI) is calculated for each core i as follows:
In this model, T is the time to execute IC instructions. This is derived from the performance model. E c,dyn represents the dynamic energy consumed by different core events. In our configuration space, this parameter is only affected by the core voltage which is determined by the core frequency. P c,static is the constant static power consumption of the core which is also dependent on the core voltage. The value of the core static power can be evaluated offline for each frequency setting and get stored in a small table. Core dynamic energy is derived by subtracting the static energy during an interval from the core energy measurements of that interval. To estimate the dynamic energy at other frequencies, this value is scaled by the core voltage squared. E mem is the energy consumed by memory accesses. This parameter is dependent on both the number of cache misses and write-backs to the main memory. The cache misses are estimated from the ATD, and the write-backs are measured from the performance counters. We make the simplifying assumption that the number of write-backs does not change with cache size. The accuracy of the model is studied in Section V.
E. Resource Management Algorithm (RMA)
An overview of the proposed RMA is shown in Figure 3a . After the detection of a new phase in a core, LLC miss values are collected from the ATD. The performance model uses these values to predict the IPS for different system configurations. The configuration space for each core has two dimensions: cache allocation w and frequency f . Considering all the possible combinations among all the cores creates a significantly complex search space that is not feasible for online resource management.
To address this problem, we prune the search space on each core and reduce it to a single dimension as follows. For each possible allocation of cache to each core, a minimum frequency can be found that meets the IPS constraint. This is depicted by the yellow bars in the hypothetical graphs in Figure 3a . We can easily ignore the other configurations because any lower frequency violates the constraints while higher frequencies (and voltages) increase the energy consumption.
If w b and f b represent the baseline system configuration, the minimum frequency is derived from the following equations:
The parameter α in (4) is used for relaxing the performance target. In case of a strict target, its value is set to 1.0, otherwise a smaller value is used. If no f min is found for some smaller values of w, those values are discarded from the minimum frequency set.
In the next step, the energy model transforms the minimum frequency set into an EPI-set using (3) and (5):
This process happens at each phase change on every core. At this point the new EPI-set is passed to the optimization algorithm that already contains the EPI-sets of other cores. This algorithm finds the new optimum setting that minimizes the sum of EPI values for all the cores.
F. Final Optimization
After pruning the configuration space of each core to a set of EPI values for each possible allocation of LLC, we need to find the best combination of allocations with a sum equal to the LLC size. We define a vector V = {w 1 , w 2 , ..., w N } as an LLC allocation over N cores, A as the total number of available LLC ways, and W max as an upper bound for LLC allocation to each core. We then define the optimization problem as follows:
To solve this problem in polynomial time, we leverage the idea presented in [15] to design our optimization algorithm. The pseudo code is shown in Algorithm 1. It starts from N different energy curves for each core (lines [13] [14] [15] [16] . Each pair of curves are then reduced to a single curve that gives the lowest energy for a given allocation to the pair (lines [18] [19] [20] . This leads to N/2 remaining curves. By repeating the same process, in log 2 N levels of reduction, the minimum energy configuration is found (lines [22] [23] .
The reduction process works as follows. Let us assume a pair of cores i and j and a maximum way allocation W ij to the pair while V ij = {w i , w j } denotes a specific allocation to these cores. A V * ij could easily be found that minimizes 
Hence, the two curves e i (w i ) and e j (w j ) are reduced to a single curve E * ij and a corresponding allocation vector V * ij both as a function of total allocation W ij (lines [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] One of the advantages of this algorithm is that during a reduction level, each reduction function is independent of the others. Therefore, if only one core experiences a phase change at the invocation of the RMA, only the reductions that are affected by that core should be executed. As depicted in Figure 3b , only log 2 N reductions are required in a system with N cores. This significantly improves the scalability of the algorithm.
IV. EXPERIMENTAL METHODOLOGY
We analyze the proposed RMA using a simulation method based on SimPoint analysis [32] and Sniper [33] plus McPAT [34] simulations. In Section IV-A, we present the default architecture model used to derive the experimental results. Then, in Section IV-B, we present the simulation methodology. Finally, Section IV-C introduces the workloads used in the simulations.
A. Base Configuration
We use the default Nehalem microarchitecture model in Sniper with three levels of cache. In order to have a more accurate simulation, especially for studying the impact of MLP for performance modeling, we use the rob core model in Sniper. Table I summarizes the architectural parameters used in our simulations. The processor model is a 4-way out-of-order core. A more aggressive core would shift the workloads to be more memory intensive. This would make cache partitioning alone and our proposed combined scheme more effective as it would give more headroom for trading a smaller cache partition size for a higher frequency. The baseline system consists of four cores. We will however present results also for eight-core systems.
B. Simulation Method
In order to reduce the simulation time, we base our simulations on the SimPoint methodology. However, to accurately 
18:
For each pair Z = {i, j} in ArrayT do
19:
TZ ← Reduce (Ti, Tj )
20:
Replace {Ti, Tj } with TZ
21:
until length(ArrayT)> 1
22:
TFinal =ArrayT[0]
23:
Return We have adopted a method based on the idea presented by Van Biesbrouck et al. in [35] as follows. A phase trace of each benchmark program is created using SimPoint analysis. The phase trace consists of the sequences of phases that a program will visit, given that the program execution is divided into instruction sequences, denoted slices, of a fixed length. We make a simplifying assumption that the program behavior in all the slices of a phase is exactly the same as the representative slice of that phase selected by SimPoint. Hence, the phase trace aims at mimicking the phase changes of each benchmark program. Figure 4 shows an overview of the simulation steps. The SPEC CPU2006 whole program Pinballs from the Sniper website [36] are used as input to the process. SimPoint then generates the representative program slices. In the next step, these slices are simulated with Sniper and McPAT at different configurations. The result is a simulation database for all the phases, all the voltage-frequency levels, and all the LLC sizes for each benchmark application. We use up to 10 phases for each application with a slice size of 100M instructions plus a prior cache warm-up region of another 100M instructions.
In the second part of the simulation process, the RMA Simulator regenerates the execution of a multiprogrammed workload in a multi-core scenario with an RMA using the program phase traces and the simulation database for each program in the workload mix. Figure 5 shows an example scenario of a simulation. In this figure, the numbered boxes represent program slices with fixed instruction count while the patterned colors correspond to a program phase. The system starts with the baseline configuration. Looking at the phase trace of each application, the IPS rate of that phase at the base system configuration is collected from the simulation database. The time of the first global phase change is then calculated (t 1 ). At this point, when App2 enters a new phase from slice 3 and App1 is in the middle of slice 2, the RMA is invoked. First, the time and energy overheads are added to the simulation results. Then, the new IPS setting is derived from the simulation database according to the configuration selected by the RMA. .The same process repeats at the next global phase change at t 2 . During this repetitive process other statistics such as energy consumption are also collected. Unlike the energy model described in Section III-D, the simulator collects both the core and un-core energy components plus the dynamic energy of the main memory. The described process continues until the end of the simulation.
C. Workloads
We use SPEC CPU2006 for our experiments. We categorize the applications based on two important criteria: Memory Intensity and Cache Sensitivity. We define these criteria by looking into the MPKI curve of each application for different LLC partition sizes around the per-core baseline partition size. We count an application with a high base MPKI (more than 5) as memory intensive. On the other hand, if the MPKI variation between 50% and 150% of the base is larger than a threshold (20%) with a large enough base MPKI (more than 0.2), we count it as cache sensitive. Hence, we have the following workload types: Table II shows the applications that belong to each type. For five of the SPEC CPU2006 applications (calculix, milc, sjeng, tonto, zeusmp), the SimPoint tool is unable to correctly create sufficiently long (100M instructions) warmup phases and detailed slices. For this reason, we exclude these applications from our experiments. However, the remaining set contains many benchmarks from each category.
We create a list of different combinations of application types to model a wide range of 4 and 8 core workload mixes. We then use the python function random.choice() to select benchmarks from each category for each workload. The result is listed in Table III . In addition to the benchmarks in each workload, the number of execution rounds on the baseline system is reported for each benchmark within parentheses. More details about the simulation length of each workload are provided in Section V-A. bwaves (1), astar (7), leslie3d (1), libquantum (1) CCCC1 gcc (3), h264ref (1), hmmer (1), gobmk (2) CCCC2 gobmk (2), bzip2 (1), hmmer (1), h264ref (1) DDDD1 gamess (2), perlbench (2), povray (2), namd (1) DDDD2 namd (1), cactusADM (1), perlbench (4), dealII (1) AABB1 soplex (6), sphinx3 (1), GemsFDTD (1), astar (7) AABB2 soplex (4) 
V. RESULTS
We start by describing the evaluation metrics in Section V-A. The energy saving results for different workload mixes and scenarios are then presented and discussed in Section V-B. Finally, the accuracy of the models and the overhead imposed by the RMA is analyzed and discussed in Sections V-C and V-D, respectively.
A. Evaluation Metrics
Each experiment starts by running all the applications in the workload mix to completion while keeping the baseline system configuration. In order to keep all cores busy during a simulation interval, each application restarts after completion until all of them finish at least one round of execution. For the next simulations with different RMAs, we make sure that each application completes at least the same number of rounds as on the baseline system.
The performance results are evaluated by comparing the average execution time of each round of application run, with that of the baseline system. To evaluate the energy savings for a fixed amount of work, we sum the energy values corresponding to the execution of the same number of rounds as on the baseline, for all applications using the target RMA. The resulting value is then compared with that of the baseline system.
We consider two different scenarios. In one scenario-Strict IPS target-the QoS target is fixed to the IPS value on the baseline configuration and no performance degradation is allowed. In this scenario, we consider two RMAs: i) Cache partitioning only and ii) the proposed RMA with coordinated control of DVFS and cache partitioning, called Combined. The Partitioning RMA controls only the LLC partitioning without affecting the core frequencies. Its goal is to minimize system energy without violating the performance constraints. It uses the same optimization algorithm described in Section III-F. DVFS-only RMAs are not relevant in this scenario because they cannot affect a system with strict IPS targets.
In the second scenario-Relaxed IPS target-we allow a 30% reduction of IPS. Hence, in addition to the two RMAs evaluated under the first scenario, we add two DVFS-only RMAs: i) dynamic DVFS and ii) static DVFS. First, dynamic DVFS finds the minimum frequency that satisfies the QoS target during execution. Intuitively, a 30% reduction of the frequency would meet the target, if the workload is compute bound. A memory bound workload would experience even a smaller IPS reduction. As a comparison, we also consider a naive RMA that statically reduces the frequency by 30%, called static DVFS.
B. Energy Savings
We perform the experiments on 4 and 8 core workloads both with strict IPS targets and 30% relaxed targets. This means that in the former case, no performance degradation is allowed on any application while the latter allows a bounded reduction of 30% of the IPS for all the applications.
Two sets of simulations are done for each RMA. One with idealistic assumptions to show the potential with oracle performance and energy modeling and neglecting the RMA overheads. The other for a realistic system that uses the analytical performance and energy models described in Section III with overheads added to the simulations. The overheads will be analyzed in detail later. Figure 6 shows the energy savings of the two RMAs relative to the baseline. For each workload mix, we show four bars corresponding to, from left to right, the Partitioning with oracle models, the Partitioning with analytical models and overheads, the Combined with oracle models, and the Combined with analytical models and overheads. Overall, we can see that there is a huge potential of the Combined scheme. Compared to Part.-Oracle, CombinedOracle manages to save substantially more energy. At a fourcore system the saving is, on average, 6.6% versus 2.5% whereas on an eight-core system it is 5.7% versus 2.6%. On a realistic system, with the performance and energy models proposed, the energy saving differences are similar: 4.4% versus 2.4% on a four-core system and 4.7% versus 2.7% on an eightcore system.
1) Strict IPS target:
We now analyze the findings in more detail. First, the effect of Partitioning is limited to workloads that are a mix of cache sensitive (A or C) and cache insensitive (B or D) applications. This is not the case for the Combined RMA because it has a secondary dimension of flexibility, frequency variation. In the workloads that are all cache sensitive i.e. mixes of A or C, the Combined scheme shows a significant advantage over the Partitioning RMA. However, there is an exception in 4A4C where even the Combined method cannot find an efficient configuration.
Second, in the workloads that are all cache insensitive, i.e. mixes of B or D, none of the RMAs are very effective since any re-distribution of cache resources does neither improve the performance nor energy of any application. In fact, with limited modeling accuracy and considering the overheads, this may even lead to a small increase in the energy consumption. This could be avoided by disabling the RMA when such workload mixes are detected.
Third, the effect of modeling error is usually a reduction in energy saving and rarely a miss of the performance target. This is because the simple models used in this work are more likely to underestimate the performance. Hence, the RMA may miss some of the energy saving potentials to meet the QoS targets. This will be analyzed in detail later in this section.
Finally, workloads AAAA1, 8A and 8B show an interesting phenomenon. In these cases, modeling inaccuracy leads to a lower energy consumption while the performance is slightly higher for some applications. At some points during the execution, the RMA with a realistic model increases the speed of some applications compared to the ideal RMA. This affects the order of program phases that execute at the same time. This reordering creates new situations in which more efficient configurations can be found. These configurations are specially important in a workload containing only memory-intensive applications. The proposed RMA does not see this potential because, for generality, it does not use any temporal information such as offline profiles.
In summary, by providing an RMA that can allow coordinated control of voltage-frequency and cache partitioning across a multiprogrammed workload, significant energy savings are enabled without any performance cost.
2) Relaxed IPS target: Even though there are energy saving possibilities by optimizing the resource trade-offs between applications, the total amount of saving is limited without trading off performance. If the user can tolerate a bounded reduction of performance, further energy savings become possible. Figure 7 shows the energy savings for 4-core (top) and 8-core (bottom) workload mixes when the (IPS) performance target is relaxed by 30% for all the applications. We now also consider the DVFS RMAs because it is now possible to lower the core frequencies without violating the relaxed QoS requirements.
Looking at the energy saving values in Figure 7 , we can make the following observations. First, by using DVFS it is possible to save around 15% of energy. The static DVFS on average saves a few per-cents less than the dynamic DVFS technique. This reduction comes from the core-energy consumption. Second, in some cases, Partitioning has the potential to save up to 16% of energy. This corresponds to the energy consumption of memory accesses. But, it is not very effective in cache-insensitive workloads. Even in some mixes of cache sensitive applications, like 4A4C or 8C, its effect is limited. Third, the Combined RMA outperforms the other three in all the cases since it can save both the energy consumption of the core and memory. It has the potential to save up to 30% of energy, with an average around 20%. In summary, by relaxing the performance target, the energy savings can be increased. For partitioning alone and DVFS only, an average energy saving of 7% and 15% can be achieved, respectively. Coordinated control of LLC Partitioning and DVFS, as proposed in this paper, 20% of system energy, on average, can be saved.
C. Sensitivity to Modeling Accuracy
The low-overhead simplified performance and energy model used in our resource management scheme has limited accuracy. A performance model error may lead to a violation of QoS targets. But, that happens only if the model overestimates the IPS at a certain configuration and that configuration passes all the layers of pruning described in Section III and selected as the best configuration. Even in that case, if during the next phases, a configuration with underestimated IPS is selected, it may cancel the effect of the prior IPS target violation. The simulations with strict performance targets include 160 application executions for 4 and 8 core workloads. Out of these cases, there are less than ten that under-perform compared to base with only a few percents. The worst case is libquantum with 5.1% and 6.2% longer execution time in BBBB2 and 4A4B respectively. The 160 application executions with the relaxed target had even better results with only a few cases that violate the execution time target by less than 3%. This shows the resilience of the proposed scheme to modeling inaccuracy.
The energy model inaccuracy only affects the amount of energy saving achieved by the RMA. This effect is already analyzed in Figures 6 and 7 when comparing the ideal results using oracle models with realistic results using the proposed energy model. Further details of the model accuracy evaluation are not presented here because of space limitation.
D. Impact of Overheads
The discussed resource management schemes add overheads to the system in three steps: i) collecting the required statistics, ii) finding the optimal configuration and iii) enforcing the new configuration.
Reading the performance counter values has negligible overhead in a monitoring interval of 100M instructions. However, the additional instructions that need to be executed for each RMA impose timing and energy overheads. The exact values of these overheads depend on the system configuration at each point of time. Therefore, we evaluate the overhead as a fraction of instruction count to the program execution. We pessimistically assume that the RMA is invoked at every monitoring interval.
In order to evaluate the instruction count overheads, we implement the proposed RMA as presented in Section III in the C programming language along with the two other RMAs. We then compile and execute each RMA software implementation and measure the number of executed instructions. The results are summarized in Table IV for different number of cores and W max parameters. As explained earlier, this parameter indicates an upper bound on the LLC allocation to each core. We assume systems with eight LLC cache ways per core. For example, the associativity of a 4-core system is 32 and each core can get up to 29 ways (32 minus the number of other cores). In the simulation results presented in Section V-B, this parameter is limited to 32 even for the 8-core system which is a pessimistic assumption for energy-saving possibilities.
We make the following observations from Table IV . First, the overhead of the algorithm is mostly affected by W max and not by the number of cores because it determines the amount of computations in the reduction function described in Algorithm 1. This shows that the proposed scheme is scalable with the number of cores. Second, the overhead of the Combined RMA is very close to that of the Partitioning RMA. This shows that the proposed scheme manages to keep the overhead component bounded despite adding another dimension to the configuration space. The overheads of the RMA are added to the simulation results for the core that experiences a phase change.
Finally, when the RMA decides to change the system configuration, there is the overhead of performing DVFS and re-partitioning the LLC. For the DVFS overhead, we assume 15 μs and 3 μJ as reported in [37] for the Samsung Exynos 4210. The impact of the DVFS overhead is minimal. For example, if the clock frequency is set to 2 GHz and the average IPC is 2, a 100M instruction interval takes 25 ms. In that case, even if the frequency is scaled at every interval, it will add 0.06% to the timing overhead. Both the timing and energy overheads of DVFS are added to the simulation results whenever the RMA chooses a new frequency for each core. Re-partitioning of LLC is limited to modifying a few bit-masks for each core and has negligible overhead.
After re-partitioning, the data movement in LLC happens according to the memory access patterns of applications. The application that receives an additional cache way will gradually replace the data of the previous owner during execution. In our case study, each LLC way contains 256 KB which consists of 4K cache blocks (See Table I ). Assuming that all of these blocks will be filled with new data by the new owner over an interval of 100M instructions, it will cause an additional MPKI of 0.04 which is negligible compared to the MPKI of memory intensive applications. Many of these misses may overlap with other misses and do not cause a timing overhead. In reality, a re-configuration happens after several intervals when the program experience a phase change which further diminishes these overheads. VI. RELATED WORK Previous attempts to control on-chip resources to enforce QoS constraints on applications include a wide range of types of resources and configuration methods. Adding QoS requirements for the applications has a profound impact on the resourcemanagement approach compared to works that do not take QoS into account [3] - [6] . A common QoS workload usually consists of a mix of one latency critical (LC) application with strong performance constraints and other best effort (BE) applications [12] - [16] , [19] . In such cases, the focus is typically to improve the BE applications' execution while providing guaranteed minimum resources for the LC applications. Therefore, the number of LC applications that can run on such a system is very limited and resource optimization is fundamentally dependent on the BE applications. On the other hand, when using DVFS to enforce QoS, energy efficiency can be improved for the LC application [7] - [11] , [20] . However, this prior art does not consider cache partitioning among multiple applications as we do in this paper. Intel Speed-Shift technology [38] is an example of recent DVFS techniques implemented in the Skylake architecture. Compared to the previous Speed-
Step technology, which is managed in software, Speed-Shift is managed by the processor, which enables fast and finegrained control over its voltage-frequency states. Unlike our work, Speed-shift is oblivious to QoS requirements, taking into account only processor utilization. Adding QoS to Speed-shift is an interesting direction to be considered in future work.
In [22] , [23] cache partitioning is used in the proposed solutions, but only to minimize the cache misses without taking the application QoS constraints into account. The DVFS controller is responsible for enforcing QoS constraints for workloads, where all applications have QoS constraints. Such an approach is sub-optimal and may even lead to QoS violations, since the LLC partitioning controller does neither evaluate the effect of its decisions on the QoS constraints nor the energy consumption of each application.
A centralized controller to explore a multi-dimensional design space of different resources is necessary to find the most efficient system configuration. However, the complexity and overhead of such a controller is a serious challenge for online resource management. Many of the previous proposals avoid this issue by breaking the control mechanism into independent controllers for different resources [19] , [22] or different applications [4] , [5] , [15] . [23] proposes independent controllers for different resources, applications, and even objectives. However, such methods cannot be as efficient as a centralized controller managing several resources because the configuration-space of each local controller is limited. There have been attempts to come up with coordinated management of multiple resources based on machine learning [3] , [6] . The downside of such methods is that they do not provide enough accuracy when applications enter new computation phases. Furthermore, they depend on expensive online learning-processes that are not fast enough to react to frequent application phase-changes in multiple concurrently executing applications.
In contrast, in this work, we present a solution to control multiple resources, different objectives, and different applications coordinately in a centralized controller to maximize the efficiency. We significantly reduce the complexity by intelligently pruning the sections of the design space that leads to inferior results. This method uses statistics from HW performance counters and ATD to model a wide range of resource allocations in a single interval. Such an approach is fast enough to deal with frequent phase changes of applications and provides sufficient accuracy at the new phases.
VII. CONCLUDING REMARKS
This paper presents, for the first time, an online Resource Management Algorithm (RMA) that finds the most efficient configuration, at each program phase, to minimize two important processor energy components, namely core energy-perinstruction and DRAM memory access, using a coordinated controller for DVFS and Last-Level Cache Partitioning. It uses a model-predictive approach to establish the effect of different configurations on both performance and energy by collecting statistics from hardware performance counters with no need for any profiling, training or prior knowledge about the detailed run time behavior of programs. The RMA is implemented in software with appropriate hardware support and invoked when a program phase-change is detected. To keep the run-time overhead negligible, the RMA uses a heuristic algorithm that performs configuration-space exploration in polynomial time.
Our experimental evaluation shows that our combined approach, using coordinated DVFS and cache partitioning, is more effective in saving energy than independent DVFS and cache partitioning RMAs. In addition, the overhead of invoking the RMA at phase changes has a negligible impact on the energy savings. The energy savings, when the performance target is the same as the baseline system, can potentially be as high as 12% and on average 4%. However, when the performance target is 70% of the baseline system, the proposed RMA is capable of saving, on average, 20% energy as compared to 15% for DVFS alone and 7% for cache partitioning alone.
