Abstract-Shared resource contention has been a major performance issue for CMPs. In this paper, we tackle the power contention problem in power constrained CMPs by considering and treating power as a first-class shared resource. Power contention occurs when multiple processes compete for power, and leads to degraded system performance. In order to solve this problem, we develop a shared resource contention-aware scheduling algorithm that mitigates the contention for power and the shared memory subsystem at the same time. The proposed scheduler improves system performance by balancing the shared resource usage among scheduling groups. Evaluation results across a variety of multiprogrammed workloads show performance improvements over a state-of-theart scheduling policy which only considers memory subsystem contention.
INTRODUCTION
SHARED resource contention is known as one of the major performance issues in chip multiprocessors (CMPs). Since performance degrades when multiple processes compete for shared hardware resources of CMPs, various attempts have been made by both hardware [1] , [2] and software [3] , [4] , [5] to mitigate its performance impact.
In this paper, we argue that power should also be treated as a first-class shared resource for CMPs with the emergence of power capping mechanisms [6] . Power capping is realized by the power management system via power-performance knobs such as DVFS (dynamic voltage and frequency scaling). Therefore, executing a program (or programs) with relatively high power consumption on power capped CMPs can cause "power contention" that guides the power management system to throttle the processor which results in degraded system performance.
We enhance the power management system in the context of scheduling where processes with mutually exclusive shared resource (including power) needs are co-scheduled such that the shared resource utilization is balanced; in this way we mitigate contention and hence system performance is improved. Recent study also takes power into account to provide QoS for latency-sensitive services in datacenters [7] . Major differences in dealing with power comes from the fact that we leverage scheduling to address both CPU and DRAM power contention where the related work addresses only CPU power using DVFS. Fig. 1 shows the performance when six copies of sjeng (from SPEC CPU2006 [8] ) and 6 copies of CPU power hungry microbenchmark CPU-hog are executed on a six-core Intel Sandy Bridge processor. The unbalanced scheduler executes six sjeng after six CPU-hog where the balanced scheduler pairs three sjeng and three CPU-hog, and executes them one after the other. The figure highlights two important points: (1) power contention degrades noticeable performance and (2) potential of power contentionaware scheduling.
The left figure shows the performance without a power cap. Since both sjeng and CPU-hog are CPU-intensive, there is no contention at the memory subsystem (or "cache-memory contention") regardless of the scheduling policy. Therefore, the performance of the two policies are the same. When certain CPU power cap is applied, scheduling policy matters. The performance of CPU-hog degrades 19.9 percent due to power contention in the unbalanced policy where that of sjeng remains constant, because CPU-hog consumes more power than sjeng. The balanced policy succeeds in mitigating the power contention where both sjeng and CPU-hog show a slowdown of only 1.0 percent, which results in 10.6 percent total performance improvement. This simple experiment shows the need of power contention-aware scheduling, which we further explore in this paper.
In the rest of the paper, we develop and evaluate a scheduling algorithm that mitigates the performance issue with resource contention by balancing the utilization of the shared resources. The key idea of the proposed approach is to treat both CPU and DRAM power consumption as first-class shared resources, such as the memory subsystem.
IMPACT OF POWER CONTENTION
This section explores the performance of various multiprogrammed workloads in order to understand the impact due to power contention.
Experimental Setup
Platform: We perform experiments on an Intel Xeon E5-2620 Sandy Bridge 2.0 GHz 6-core CMP with 16 GB of main memory. Simultaneous multithreading (SMT; also known as Hyper-Threading) and dynamic overclocking (Turbo Boost) are disabled. The Linux governor is set to 'performance' to avoid non-deterministic performance variance.
Workloads: We use SPEC CPU2006 benchmark suite with reference inputs for our study. The number of co-scheduled copies of the same instance of a benchmark is varied from 1 to 6 and the average performance results are reported, unless otherwise specified. In order to explore the performance impact due to power contention at the CPU or DRAM, we apply different levels of power caps to either domain.
Results
Figs. 2, 3 and 4 show performance characteristics due to power contention (and also inevitably, cache-memory contention) for some SPEC benchmarks (not all of them are shown due to limited space). Results of CPU-intensive benchmarks while varying the CPU power cap are presented in caps as expected. However, we can see different degrees of performance degradation because the power consumption of each program is different.
For the CPU power contention (Fig. 3) , we see almost up to 70 percent performance degradation with tight constraints for programs with higher cache-memory contention. When we focus on Fig. 4 , we can see different trends from Figs. 2 and 3. Some benchmarks (perlbench and gcc) show almost no or modest performance degradation with tighter DRAM power caps whereas others (bwaves, GemsFDTD, libquantum and lbm) show significant performance degradation as we have also seen in the CPU power contention scenario. This is because even though the benchmarks are memory-intensive from a performance perspective, it does not necessarily mean that the DRAM power contention is severe.
We conclude that CPU and DRAM power contention can be major performance bottlenecks when reasonable power caps are applied to the system. In addition, we see that performance impact due to power contention is hard to predict especially when cachememory contention and power contention occur simultaneously, which can happen in a realistic scenario where the system needs to schedule a mix of different workloads. This motivates us to develop a scheduling algorithm that mitigates both cache-memory contention and power contention.
CONTENTION-AWARE CO-SCHEDULING 3.1 Basic Idea
The basic idea of the proposed algorithm is to combine programs stressing mutually exclusive shared resources in order to avoid severe contention. The scheduler organizes the tasks in CPU-local run queues, divides time into epochs and executes each scheduling group composed of tasks with different characteristics in a roundrobin manner to ensure fairness among scheduling groups. Since the degree of resource contention depends on the execution phases of each program, the combination of programs are dynamically controlled.
Estimating Shared Resource Usage
In order to balance the shared resource utilization among scheduling groups, the scheduler needs to be capable of estimating the shared resource usage of each program. We collect statistics from hardware performance monitoring units (PMUs) to estimate the shared resource usage, represent them as an activity vector [9] and utilize them to make the co-scheduling decisions.
Usage of memory subsystems: Many previous works have shown that the amount of memory pressure programs generate has high correlation with the last level cache (LLC) miss statistics [2] , [10] , [11] . We use the number of LLC misses per second as a proxy to estimate the amount of memory pressure each program develops.
Usage of power: For power usage we estimate the actual CPU and DRAM power consumption of each program using a statistical model. We trained the model using the power consumption values measured via RAPL counters as inputs. The training data is obtained by running SPEC CPU2006 benchmarks without a power cap. We collected the PMU data every 400 ms, and each benchmark is executed six times with number of co-scheduled instances varied from 1 to 6.
We model the CPU and DRAM power consumption of the platform with multiple input variables. We use five PMUs in our study: instructions committed, cycles, branch prediction misses, L1 cache accesses, and LLC misses. LLC misses is also used to estimate memory subsystem usage. The power consumption P is expressed as:
where f is the average clock frequency within an epoch. Note that f is only applied for CPU power estimation and not for DRAM. W i is the coefficient of each component and E i is its event counts (e.g., number of LLC misses). The term f Â E i Â W i represents the dynamic power consumption that component i contributes to, and C represents the constant value which is mostly static power consumption.
Proposed Co-Scheduling Algorithm
An activity vector (which is per process and one-dimensional) represents to what degree a running process utilizes the shared resources we are interested in. The size of the vector equals the number of resources we consider, which is three: (1) memory subsystem, (2) CPU power and (3) DRAM power. The elements of the vector are normalized to the maximum utilization the respective resource can consume. The maximum usage amount of the memory subsystem is experimentally obtained using a manually written microbenchmark. For the power consumption, the elements are normalized to the CPU and DRAM power caps.
The goal of our scheduling algorithm is to balance the usage of each shared resource among scheduling groups, which can be achieved by having programs with diverse characteristics in each group (recall the balanced co-scheduling example in Fig. 1 ). We use a similar algorithm to vector balancing [3] where at the end of each epoch, the scheduler calculates the variance varsum of each scheduling group which is the sum over the variance of all vector elements; then it searches for a co-schedule that increases the minimum varsum of all the scheduling groups by calculating varsum for all possible co-schedules. Varsum is defined as:
, where E ij is the ith element (either memory pressure, CPU power consumption or DRAM power consumption) of the jth program, N is the number of shared resources which is three in our study and M is the number of programs in a scheduling group which is six in our platform.
EVALUATION

Scheduler Implementation
We have implemented a prototype of the proposed scheduler as a user level software. The evaluation system runs Linux kernel 2.6.32 where the LIKWID toolset version 3.0 [12] is modified to allow periodical access to performance and power related PMUs. Linux sched_setaffinity(2) API is used to control the CPU affinity of processes to realize the proposed algorithm. Scheduling epoch per scheduling group is set to 400 ms. For the CPU and DRAM power models, we have applied multivariate linear regression to periodical statistics obtained via executions of a single copy of all the evaluated SPEC benchmarks. We have confirmed that the model is accurate enough (adjusted R-squared > 0:9) for our purpose.
Workloads and Evaluation Scenarios
Since it is projected that computer systems will continue to be power limited, we conduct evaluations under a relatively tight power constraint with CL3 and DL3. We use SPEC CPU2006 benchmarks and microbenchmarks that stress CPU and memory to evaluate the proposed scheduler that considers both cache-memory contention and power contention, against a state-of-the-art scheduler which only considers cache-memory contention. The counterpart is implemented in the same way as the proposed scheduler except that it calculates varsum using only the memory pressure (does not estimate nor use the CPU and DRAM power consumption). We also evaluate two schedulers for reference: a scheduler that only considers power-contention, and a random scheduler. We report the Harmonic speedup [13] to discuss performance.
Evaluating with all the possible combinations of 12 benchmarks is not feasible due to the vast exploration space. Therefore, we use 30 workloads composed of 12 randomly-selected benchmarks. Workloads are sorted in increasing order of their performance improvements of the proposed scheduler as we will see later.
The evaluation methodology we use is similar to the one originally proposed for SMT scheduling [14] where a number of studies have been evaluated in a similar fashion [15] . To account for the varieties of execution times among programs, we instantaneously restart a program when it finishes execution, until all the processes are executed at least one time to completion. This means that we keep the number of threads constant during the whole evaluation. WL11 shows some corner-case scenario, where the power contention scheduler performs 6.1 percent better than the proposed scheduler. This workload shows relatively stable behavior with high CPU power contention and modest cache-memory contention, where the proposed scheduler not only balances the power contention but also balances the cache-memory contention which is less of a bottleneck. By doing so, the CPU power utilization becomes slightly unbalanced compared to that of the power contention-aware scheduler which results in degraded performance. Nonetheless, the proposed scheduler still performs better than the cache-memory contention-aware scheduler.
Evaluation Results
Another interesting thing to point out is the two rightmost workloads (WL29 and WL30) where the proposed scheduler greatly outperforms the cache-memory contention-aware scheduler. For these two workloads, the power contention-aware scheduler also shows more than 15 percent improvement, which clearly indicates that most of the performance boost comes from the mitigated power contention.
CONCLUSION
As the ability to cap the power consumption of CMPs is becoming increasingly critical in modern computer system designs, power contention which can lead to significant performance degradation becomes a critical problem. As we have seen in Section 2, we believe that power contention is a serious issue where the problem opens up new research opportunities in this area. Our work tackles the problem from an angle where we have leveraged scheduling as an optimization knob. We have shown that our scheduling algorithm which considers power as a first-class shared resource and co-schedules programs with different characteristics improves system performance over state-of-the-art cache-memory contention aware scheduler.
