When two or more programs are co-scheduled on the same multicore computer they might experience a slowdown due to the limited off-chip memory bandwidth. According to our measurements, this slowdown does not depend on the total bandwidth use in a simple way. One thing we observe is that a higher memory bandwidth usage will not always lead to a larger slowdown. This means that relying on bandwidth usage as input to a job scheduler might cause non-optimal scheduling of processes on multicore nodes in clusters, clouds, and grids. To guide scheduling decisions, we instead propose a slowdown based characterization approach. Real slowdowns are complex to measure due to the exponential number of experiments needed. Thus, we present a novel method for estimating the slowdown programs will experience when co-scheduled on the same computer. We evaluate the method by comparing the predictions made with real slowdown data and the often used memory bandwidth based method. This study show that a scheduler relying on slowdown based categorization makes fewer incorrect co-scheduling choices and the negative impact on program execution times is less than when using a bandwidth based categorization method.
Introduction
A common challenge shared by cluster, cloud, and grid computing services is the limited off-chip memory bandwidth of modern multicore processors. The memory bus can easily become a bottleneck even at a low grade of utilization which makes it a key factor for achieving high performance. The consequences of the memory bottleneck are increased execution times and less than optimal throughput. This in turn leads to, for example, suboptimal resource usage and higher energy consumption. The problem is expected to grow as the number of cores currently increases faster than the pin count for chipset packaging.
One way to overcome, or at least minimize, the impact of the limited off-chip bandwidth is to use memory contention aware job scheduling [1] [2] [3] [4] . While a traditional cluster or cloud job-scheduler tries to ensure a fair distribution of CPU time between processes a memory aware jobscheduler attempts to perform the allocation of cores to jobs in order to minimize the co-schedule degradations (slowdowns) that occurs from memory bandwidth contention. Tang et al. [2] has observed benefits of up to 40% when using memory aware scheduling. They also estimate that a performance improvement of even 1% can result in millions of dollars saved.
Successful job-scheduling depends on two components, a scheduling algorithm that decides which jobs are to be co-scheduled and a program characterization method which provides the input to the scheduling algorithm. Previous work within the area typically aims to co-schedule programs based on two main policies: (1) to distribute the memory bandwidth usage as evenly as possible between all available nodes [5] or (2) , to fill each node as close to the peak memory bandwidth of the nodes as possible [3, 6] .
Regardless of the policy or algorithm used, the coscheduling decisions are only as good as the program characterization method used. Therefore, in this paper we first highlight the drawbacks associated with using a memory bandwidth based method which, to our knowledge, is the currently most widely proposed method. We then propose a new slowdown based characterization method. Our proposed method combines the use of hardware counters [7] and memory traffic generation [8] in order to predict the actual slowdown (performance degradation) a program will experience when co-scheduled with another program.
To verify the accuracy of the proposed characterization method we compare its estimated performance degradations with both a memory bandwidth characterization method as well as with real world slowdown measurements. Although our focus is on memory bus contention, we use two different scenarios, one where the cores has separate last-level caches (LLC) and one where they share the last-level cache, to also assess the impact of cache memory contention. We find that when using the slowdown based characterization method only 2 out of 20 estimates deviates 10% or more when using separate caches while 4 out of 20 predictions deviates 10% or more when using a shared cache. In general we find that, for both methods, the predictions are less accurate when co-scheduling programs using a shared cache.
In the next section, Section 2, the background of memory bandwidth aware scheduling is presented before we evaluate the correlation between memory bandwidth use and actual slowdowns in Section 3. The new slowdown based characterization method is described in Section 4. Section 5 describes the method used when comparing the two characterization methods as well as the results of the comparison. Finally the paper is concluded in Section 6.
Memory Bandwidth Aware Scheduling
Job scheduling is a highly active research and development area within the cluster and cloud communities. One of the larger scheduling challenges today is to minimize the impact of the limited memory bus. The problem of determining which programs to co-schedule is non-trivial, and with more than two co-scheduled programs on each node the problem becomes NP-complete [9] . Today's industrial clusters and grids use nodes with 4 or up to 128 cores and may have 25 or more different programs running from almost as many software vendors; i.e. many programs are closed-source black-box software. Even worse, a typical cloud environment has at least an order of magnitude more programs, thousands of different customers and unknown software vendors. Current state-of-the-art memory-aware job scheduling [1, 2] try to minimize memory contention based on a program's memory usage. A job-scheduler consist of two parts: a characterization method for identifying the characteristics of the programs and a scheduling algorithm which allocates nodes/cores to the different programs based on said characterization, in an effort to minimize execution times and maximize throughput. Scheduling policies trying to minimize performance degradation due to memory contention can be found here [1, 4, 5, 10] .
A commonly used characterization method is based on the program's memory bandwidth usage, which often is obtained by simulation or by using hardware counters to record the last-level cache misses or memory bus transactions during a solo execution of the program. Other methods that have been suggested primarily characterize the cache usage as an indirect measure of the memory bus contention, and the resulting slowdowns. These characterization methods are based on stack distance profiles (FOE, SDC, PAIN) [11] or a combination of last-level cache accesses, misses and the number of ways used (animal classes) [5] . However, most cache contention methods found in literature concern the characterization of programs as input to an operating system scheduler [5, 10] and not the macro level for co-scheduling in a cluster, grid or cloud environment. Since our target architecture has cores which operate with both separate and shared last-level caches this paper focus on the memory bandwidth methods and disregard the cache based metrics. Using cache contention based characterization methods is clearly misleading when co-scheduling processes on cores with separate cache hierarchies.
The most commonly referred way of obtaining the memory bus usage of a program is to record the number of LLC cache misses. This does not give an exact number of the memory bandwidth usage since it does not include all memory bus transactions, such as prefetching or read before write for cache line evicts. Due to the drawbacks of calculating the LLC cache misses we use the bus trans mem performance counter which is the other widely used and more direct method for determining the memory bandwidth usage of the programs. The bus trans mem.self counter returns the number of 64-byte (L2 cache line sized) transfers performed. This includes burst transactions, partial reads and writes, invalid transactions and cache line fills due to prefetching. Hence the programs total bandwidth usage can easily be calculated.
Correlation between Memory Bandwidth and Slowdown
In this section we examine the correlation between measured memory bandwidths and real slowdowns using a set of benchmarks (Table 1) . Memory access performance is tightly coupled to a processor's cache structure. The cache structure, or sharing thereof, most definitely affects the outcome of almost all memory bus measurements. We used a processor with four cores and a 2-way split second-level (L2) cache architecture (see Figure 1) , namely, the Intel Core 2 Quad processor Q9550. In the Q9550 two cores (pair 0) share one L2 cache and the other two cores (pair 1) share the second L2 cache. The split cache architecture of the Q9550 allowed us to perform comparable measurements of two competing programs both with separate and with shared caches. The evaluation is split into two scenarios:
Scenario 1: Separate last-level (L2) caches with no cache sharing. Only the memory bus is shared.
Scenario 2: Shared last-level (L2) cache with L2 cache sharing and memory bus sharing. The specifications of the computer used in our experiments are shown in Table 2 . It is equipped with 8 GB of ram which are accessed though a shared front-side bus architecture. All four cores in the Q9550 are equipped with local first-level (L1) instruction and data caches, and two 6MB second-level (L2) caches. The Gnu/Linux Centos 5.5 x86 64 distribution is installed on the system and everything is compiled with the Gnu C Compiler (gcc) version 4.1.2.
The theoretical memory bus bandwidth of the system is ∼10 GB/s, but none of the benchmarks nor STREAM [12] managed to sustain a memory bandwidth over ∼6.9 GB/s. Hence, in this study we will refer to 6.9 GB/s as the maximum sustainable bandwidth.
The comparison was performed using four benchmarks from the NAS Parallel Benchmark (NPB) suite [13] , namely CG, EP, FT and IS. The benchmarks and their characteristics are described in Table 1 . According to Koukis and Koziris [4] , CG puts a heavy load on the memory subsystem while FT and LU place a medium load and EP generates a low load. We used Papi [7] and the bus trans mem.self performance counter to evaluate the The real slowdown was determined experimentally by co-scheduling the programs two by two on different cores, while recording the execution times. This was done for all possible combinations. This also included the coscheduling of instances of the same program. We ran the interfering process repeatedly to make sure that the coscheduling was active during the full run of the measured process.
Scenario 1: Separate Last-Level (L2) Caches
The measurements in Scenario 1 was performed by coscheduling two programs, locking them to cores that does not share the L2 cache. The first program executed on a core connected to the first L2 cache and the second program executed on a core connected to the second L2 cache. Hence, the programs have their own private L1 and L2 caches and the only resources that are shared are the memory bus and the memory controller. Table 3 contains the execution times of the programs when executed alone as well as when co-scheduled. Each row in the table represents a program and the execution time it experienced when co-scheduled with the programs listed at the top of the column. As can be seen in Table 3 , when two CG processes are co-scheduled they both experience a 12% slowdown. In this case the memory bus is almost full, 6.8 GB/s (2*3.4 GB/s) of 6.9 GB/s is utilized. However, when two FT processes are co-scheduled, utilizing only 2 GB/s they experience a 6% slowdown. This illustrates the nonuniformity between the bandwidth usage and actual slowdowns. Another example is the case when a CG and an FT process share the memory bus. The CG process experiences a slowdown of 4% while the slowdown for FT is 8%, and in this case the bus usage is approximately 4.5 GB/s, i.e, the average slowdown is 6%, which is the same as in the case of the two co-scheduled FT processes. However the aggregated memory bus usage is 125% (2.5 GB/s) higher for CG,FT than FT,FT.
We also find that some processes actually executes slightly faster when co-scheduled with another process compared to when alone. For example, EP co-scheduled with FT executes for 102 seconds which is lower than 103 seconds when alone. We have no good explanation for this behavior, but one reason might be the use of common shared libraries.
Scenario 2: Shared Last-Level (L2) Cache
The execution time results for Scenario 2 where the coscheduled programs share the last-level cache as well as the memory bus can be found in Table 4 (Scenario 2). The overall slowdown in Scenario 2 is higher when compared to Scenario 1. This is due to the sharing of the last-level cache. The slowdown for two co-scheduled CG processes has now increased from 12% to 27%. However, in both scenarios, an important observation is that the memory bandwidth usage was measured as 3.4 GB/s per process and the aggregated memory bus usage is calculated as 6.8 GB/s both with separate and shared caches. Thus, in an heterogeneous environment, only looking at memory bandwidths does not differentiate between nodes with separate and shared caches.
When the co-scheduled programs share the last-level cache the correlation between memory bandwidth usage and slowdown is even weaker than with separate caches. This indicates that the cache contention not only leads to increased slowdowns but also that the magnitude of slowdown is affected by other factors than a program's bandwidth usage. Chandra et al. [10] show that the number of last-level cache misses typically increases for a coscheduled program due to the sharing of the cache capacity, thus, contributing to the slowdown.
Predicting Slowdown due to Memory Contention
The previous section showed that there is a mismatch in the correlation between memory bandwidth usage and actual slowdown for co-scheduled programs. The correct slowdown a program experience due to co-scheduling can only be determined by executing and timing all possible co-scheduling combinations. However, this brute-force method suffers from an exponential complexity in terms of the number of applications involved. We now introduce an alternative method, with lower complexity, to predict memory traffic slowdown effects. This approach avoids running all combinations of applications against each other. Instead, it characterizes each application one by one a predetermined number of times. The proposed method is based on a suggestion in [8] where de Blanche et al. suggest that hardware counters in combination with a traffic generator can be used to measure a program's sensitivity to memory bus interference. This section further develops these ideas and presents a walkthrough example of how this methodology works.
The Proposed Method for Slowdown based Characterization
The procedure to estimate and predict slowdowns consists of three steps:
1. Determine the program's memory bandwidth usage (for example, using hardware counters).
Profile the program against different levels of interfering memory bus traffic.
3. Perform slowdown prediction based on interpolation of data from steps 1 and 2.
The first step (1) in the slowdown estimation method is to use hardware counters to determine the execution time and memory bandwidth usage of the program. These measurements were done using Papi and the bus trans mem.self performance counter (see Table 1 ).
In the second step (2) the program slowdown is measured by executing the program a pre-determined number of times while different levels of interfering memory traffic is generated. In this study four different levels of memory load was generated: 25%, 50%, 75% and 100% of the maximum memory bandwidth. The Memgen tool, described in [8] , was used for traffic generation and runtime data collection. The recorded slowdowns, as measured by Memgen, are shown in Table 5 and Figure 2 . We then create the program P 's slowdown function, S P (b), where b is the fraction of the total memory bandwidth used by the co-scheduled program. S P (b) relies on linear interpolation between the measured bandwidth points in order to estimate the slowdown occurring at a coscheduled bandwidth usage of b%. The S P (b) functions are shown in Figure 2 together with the measurement points.
The third step (3) is to predict a program's (P 1) slowdown when sharing the memory bus with another program (P 2). The prediction is performed by taking the memory bandwidth usage value, b P 2 , as measured by Papi and the bus trans mem.self performance counter, for P 2 ( Table 1) . We then estimate the slowdown for P 1 using S P 1 (b P 2 ) which returns the slowdown prediction for P 1 when coscheduled with P 2.
For example, to predict the slowdown experienced by LU when co-scheduled with FT, with no cache sharing, we first use papi to determine that the memory bandwidth usage of FT is 1099 MB/s. This corresponds to 16% of the maximum memory bus capacity, i.e. b F T = 16%. We then turn to Figure 2a and lookup the slowdown for S LU (b F T ) the predicted slowdown is thus 3%. This lookup could easily be performed by an online job scheduler.
Accuracy of Memgen Slowdown Predictions
To evaluate the accuracy of the slowdown based coschedule predictions we define the accuracy as the relative absolute difference between the predicted and real execution times, i.e., as the absolute value of the difference between the predicted execution time (Table 6 ) and the real measured execution times (Table 3) , divided by the real Table 7 . Accuracy of the slowdown predictions for CG, EP, FT, and LU. The accuracy is defined as the relative difference between predicted and real co-scheduling slowdown. FT 11% 0% -1% 7%
LU 13% -2% -1% -4% measured execution times. A comparison of the memgen predictions and the real execution times can be seen in Table 7 . For measurements with separate last level caches, the predictions are accurate to within 0-5% for a majority, 9 out of 16, of the possible co-schedule combinations. One single co-schedule overestimates the slowdown by as much as 12% when two instances of CG are co-scheduled. It is not surprising that the accuracy of the shared cache results is better than for the separate caches, since the behavior of two resources, both the cache and memory bus, has to be accounted for. Still the predictions are correct (within 5%) for 9 of the 16 combinations. In the shared last-level cache scenario, three predictions fall into the 11%-16% range. They all occur when programs are coscheduled with CG. There seems to be a pattern in the overestimations for the shared cache. The overestimations are highest when competing with CG, second highest for LU, and FT. The measurements for EP are not overestimated at all. This is most likely an effect of the traffic generator's memory access pattern. It reads and writes large arrays and it is likely that this behavior invalidates the cache faster than most real programs would. Hence, the predictions tend to be worst-case estimates.
Job Scheduling Implications
For the successful implementation of memory aware jobscheduling we need a characterization method to provide information about the scale of the slowdown from co-scheduling two programs and a scheduling algorithm which uses the characterization information to minimize the overall slowdown. In this section we examplify how inaccurate characterizations might impact job scheduling negatively and compare the results for memory bandwidth based and slowdown based co-scheduling.
The rationale behind bandwidth based co-scheduling is that the overall bandwidth usage is a good indicator of the slowdown a program experiences. More importantly, when comparing the bandwidth usage of two different coscheduled program pairs the pair with the lower total bandwidth usage should experience less slowdown.
For a program pair, we define their slowdown s as the sum of each program's slowdown divided by the number of cores (the average slowdown across programs running co-scheduled on different cores). The bandwidth use of a pair is the sum of each program's bandwidth use. With the bandwidth characterization method, pairs will be correctly ordered if their slowdown, s(x), for a certain bandwidth x, is a monotonically increasing function. This means that if we pick two bandwidths x 1 and x 2 such that x 1 ≤ x 2 , then the corresponding slowdowns will satisfy s(x 1 ) ≤ s(x 2 ). Hence, if this criteria is satisfied a scheduler ordering coscheduling program pairs based on their bandwidths will order pairs in the same way as if they were ordered by their real slowdowns. On the other hand if s(
there is a risk of wrongly ordering pairs. Hence, the job scheduler will favor the pair with the higher slowdown (and lower bandwidth) before the pair with the lower slowdown (and higher bandwidth).
For a slowdown based characterization method, x is instead defined as the predicted slowdown and the same reasoning about ordering apply. To assess the job scheduling impact of inaccurate characterization, we focus on the risk of wrongly ordering the co-scheduling pairs. The magnitude of this risk is defined as the difference s(x 2 )−s(x 1 ).
To evaluate a complete characterization we choose to look at all adjacent bandwidth measurements (or predicted slowdowns), x i+1 and x i , and calculate the sum of negative slowdowns:
If s(x) is monotonically increasing, then ∆s i , will be positive for all values of b and not contribute to the sum. A negative ∆s i indicates that the characterization method has ordered the possible co-schedules incorrectly and is thus included in the sum. For a perfect characterization method the sum of negative slowdowns will always be 0. When comparing different characterization methods, results with a sum closer to 0 will have less negative impact on the final co-schedule.
Job Scheduling Implication Comparison
We used a standard bandwidth based method (Section 3) and our new slowdown based method (Section 4) to characterize four benchmark programs and evaluate all possible co-schedules.
As can be seen in Figure 3a , measured slowdown as a function of bandwidth utilization is not monotonically increasing. This might cause the scheduler to wrongly order the pairs, hence the scheduler interprets the co-scheduling pairs with a lower bandwidth utilization as preferred over any pairs with a higher bandwidth usage. While the scheduler prefers FT,FT over FT,LU which in turn is preferred over CG,EP, the slowdown actually decreases from FT,FT, with a bandwidth/slowdown ratio of From a scheduler standpoint this means that 6 of the 10 co-schedulable pairs are placed in an incorrect order and thus the scheduler will in some cases make a suboptimal choice.
Turning to the slowdown based method we can derive that it incorrectly orders one co-schedule (one negative ∆s i ) when using separate caches (Fig. 3c) and two co-schedules when the programs share the LLC cache (Fig. 3d) . In both scenarios FT,FT is incorrectly preferred over (ordered before) FT,LU. Furthermore, for shared caches, the slowdown based method orders LU,LU before CG,LU which is incorrect. The errors made by the slowdown based method are also made by the bandwidth method. However the bandwidth based method incorrectly ordered one additional co-schedule for the separate cache and three additional co-schedules for the shared cache scenarios. Table 8 shows the sum of negative slowdowns for the separate and shared cache scenarios. The slowdown based characterization performs an almost perfect job on the separate caches. The only non-optimal selection a scheduler would make is when both FT,FT and FT,LU co-schedules are included. This would give a performance degradation of a mere 3.2% of the FT,FT execution time. The worst case found in this study was for the memory bandwidth based characterization on a shared cache system; a scheduler would choose LU,LU over CG,FT which would result in a performance degradation equal to 28% of the LU,LU execution time.
The experiment shows that the sum of negative slowdowns for the memory bandwidth based characterization method is approximately 3.4 times larger than that of the slowdown based characterization in both scenarios. Thus, a job scheduler will make better co-scheduling decisions based on information from the slowdown based characterizations compared to the memory bandwidth based.
Conclusion
In this paper we proposed a new slowdown based characterization method aimed at estimating and predicting a program's slowdown due to memory bus contention when co-scheduled with another program. The slowdown based characterization method is verified against real co-scheduling slowdown measurements and compared against the current standard characterization method which is based on memory bandwidth usage. We used two experiment setups. In the first one the two co-scheduled programs used separate L1 and L2 (last-level) caches, and in the second setup, they shared the L2 cache.
The results were evaluated using two different criteria: (1) The number of co-schedules (program pairs) the characterization methods could order correctly, and (2), the impact of the incorrect ordering (if any) might have on the job-scheduling results. For the impact evaluation we use a new metric based on the sum of negative slowdowns.
To conclude, in this study, a scheduler relying on slowdown based characterizations makes fewer incorrect choices and has a lower impact on program execution times when compared to a scheduler basing its decisions on data from a bandwidth based characterization method.
