Highly-parallel, high-performance scienti c applications must maximize performance inside of a power envelope while maintaining scalability. Emergent parallel and distributed systems o er a growing number of operating modes that provide unprecedented control of processor speed, memory latency, and memory bandwidth. Optimizing these systems for performance and power requires an understanding of the combined e ects of these modes and thread concurrency on execution time. In this paper, we describe how an analytical performance model that separates pure computation time (C) and pure stall time (S) from computation-memory overlap time (O) can accurately capture these combined e ects. We apply the COS model to predict the performance of thread and power mode combinations to within 7% and 17% for parallel applications (e.g. LULESH) on Intel x86 and IBM BG/Q architectures, respectively.
INTRODUCTION
Future high-performance, scienti c applications will be highly parallel and designed to run in environments of enormous scale but limited power. E ciency will be key to achieving the promise of exascale. Emergent systems will have large numbers of con gurable operating modes that provide unprecedented control of processor speed and memory frequency and bandwidth. Unfortunately, very Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permi ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from permissions@acm.org. li le is known about the combined e ects of these operating modes and thread concurrency on execution time and e ciency.
e performance e ects of various operating modes have been studied mostly in isolation. Dynamic voltage and frequency scaling (DVFS), the automated adjustment of processor power and speed se ings, has been explored extensively [6, 17, 19, 34] . More recently, analogous research on the e ects of dynamic memory voltage and frequency thro ling (DMT ), the automated adjustment of DRAM power and speed se ings, has surfaced [10, 13, 29] . Other memory power modes such as dynamic bandwidth thro ling 1 (DBT ), where one or more idle clock cycles are inserted between memory accesses to lower peak bandwidth, are emergent. Dynamic concurrency thro ling (DCT ), the automated adjustment of thread concurrency, has also received widespread a ention for some time [8] .
While some have a empted to study the combined e ects of two types of operating modes (e.g., CPU and memory scaling [10, 13] , CPU scaling and concurrency thro ling [8] ), to the best of our knowledge, no one has accurately modeled the combined e ects of CPU thro ling, memory thro ling, and concurrency thro ling.
Modeling the combined e ects of these three operating modes is incredibly challenging. Capturing the interactive performance e ects of a highly con gurable problem space could be intractable in highly-parallel, high-performance environments. Furthermore, the interactive e ects of these modes are likely to be non-linear, complicating e orts to identify simple but useful analytical models of performance.
In this paper, we present the COS Model of parallel performance for dynamic variations in processor speed, memory speed, and thread concurrency. To the best of our knowledge, this is the rst model to accurately capture the simultaneous, combined e ects of these three operating modes.
e COS model is based on a simple observation. Past models of operating mode performance tend to combine the overlap of compute and memory performance into either compute time or memory stall time. However, we have observed that the behavior of overlap when these operating modes change is so complex that it must be modeled independently of these other times. is observation leads to the formulation of a Compute-Overlap-Stall (COS) Model where each term can be modeled independently to the others.
In addition to presenting the COS model, we demonstrate how to capture these important (and independent) parameters on both Intel servers and the IBM BG/Q system. We also show how the COS model can be used to classify the best available models. We validate our modeling e orts on 19 HPC kernels and perform extensive sensitivity analyses to identify weaknesses. Our COS Model has more functionality than previously available and the accuracy is as good as or be er than best available operating mode models with prediction errors as low as 7% on Intel systems and 17% on the IBM BG/Q system.
COMPUTE-OVERLAP-STALL MODEL 2.1 COS Model Parameters
e Compute-Overlap-Stall (COS) model estimates parallel execution time as the sum of pure compute time (T c ), overlap time (T o ), and pure stall time (T s ). More generally,
where T is total time for a running application. Figure 1 shows an example execution time pro le for a simple, single-threaded application. A single core executes some computation that triggers two separate, non-blocking memory operations. As the code executes, portions of time are spent exclusively on onchip, in-cache computations; exclusively on o -chip memory operations; and on some form of overlap between computation and memory accesses. Figure 1 provides context for de ning terms of the COS model more precisely. T c is the sum of the execution times of an application spent exclusively on computation 2 or the pure compute time. In this example, T c is the sum of the pure compute times identi ed at the start and end of the application's execution. T o is the sum of the execution times of an application spent overlapping computation and memory operations or the overlap time. In this example, T o is the sum of the overlap times, there are three of these stage occurrences over the application's execution. T s is the sum of the execution times of an application spent exclusively on memory stalls or the pure stall time. In this example, T s is the sum of
e COS Trace
e ordered summation of the terms of the COS model constitutes a simpli ed trace of the application. We call this a COS Trace. More precisely, the 8 stage occurrences for the example in Figure  1 are expressed in the following COS Trace:
Analogously, we propose a general COS Trace as follows:
where cP, oP, sP are the number of stages corresponding to the three types of time in the COS trace: the pure compute time (T c ), the overlap time (T o ), and the pure stall time (T s ). For Figure 1 , cP = 2, oP = 3, and sP = 3. Predicting parallel execution time using the COS model involves estimating the e ect of a system or application change on the COS Trace 3 .
COS Model Notations
In succeeding discussions we will use (f c ) and (f ′ c ) to refer to a starting CPU frequency and the changed CPU frequency respectively. Moreover, ∆f c denotes the change from f c to f ′ c . We can de ne ∆f m and ∆t analogously. We use the shorthand (
to denote changes to DVFS, DMT/DBT, and thread count respectively. For example, (
refers to simultaneous memory thro ling and changes to thread counts.
e Importance of Isolating Overlap
Many existing models of parallel performance ignore overlap [3, 16, 22, 31, 33, 36] . When overlap is considered, the e ects are either captured in the compute time (T c ) or memory stall time (T s ) parameters. If overlap is included in T c , then the model assumes ∆f c e ects apply equally to the overlap portion. If overlap is included in T s , then the model assumes ∆f m e ects apply equally to the overlap portion. Figure 2 shows the stall time (y-axis) for a code region (R1) of the LULESH OpenMP application kernel [1] . e CPU voltage/frequency increases from le to right (x-axis). e gure shows the measured stall time and the predicted stall time for two best-available performance prediction approaches (stall-and leading-load-based [16, 22, 33] ). Notice that both approaches consistently under-predict stall time. Furthermore, in another code region (R2) of the Lulesh OpenMP application (not shown), the same prediction techniques over-predict stall time. 3 e power-performance operating modes studied include CPU Dynamic Voltage and Frequency Scaling (DVFS) and DRAM Dynamic Memory Frequency rottling (DMT) on Intel architectures; Dynamic Memory Bandwidth ro ling (DBT) on BG/Q architectures; and Dynamic Concurrency ro ling (DCT) on both architectures.
When stall time dominates, these mis-predictions lead to signi cant inaccuracies in execution time prediction. e e ects are exacerbated by the complex computation and memory overlap scenarios that a ect stall and compute time and are more common in mixed operating modes (DVFS, DMT, and DCT). Figure 2: Stall time (y-axis) for varying CPU voltage/frequency settings (x-axis) for the LULESH benchmark on an x86 system. When stall time dominates, these mispredictions lead to signi cant inaccuracies in execution time prediction. e e ects are exacerbated by the complex computation and memory overlap scenarios that a ect stall and compute time and are more common in mixed operating modes (DVFS, DMT, and DCT). Figure 3 shows a simpli ed example for three CPU frequencies (f c1 < f c2 < f c3 ) increasing from le to right. In each sub gure, core activity and memory activity are shown separately as a thread progresses in time (x-axis) from le to right. e COS trace is provided for each sub gure. In the rst sub gure, at the lowest CPU frequency f c1 , there are 4 distinct compute and memory overlap phases in the COS trace.
e Challenge of Isolating Overlap
is indicates regular memory accesses where the CPU is busy with work during the memory stall time. More precisely, the 9 stage occurrences for f c1 in Figure 3 are expressed in the following COS Trace:
e change from ( Figure 3) as follows:
is re ects a dependency between the resulting COS trace and CPU frequency. In this case, there is a change in the arrival rate of the memory requests due to the CPU frequency changes. In the new con guration, there are no pure compute gaps between memory references leading to a change in the number and length of overlap stages.
Increasing the frequency a second time in this example (f c3 in Figure 3 ) alters the COS trace again, resulting in:
is demonstrates the creation of a pure stall stage that did not exist in the previous two COS traces (f c1 and f c2 ) in Tc (1) Tc (2) Tc (3) Tc (4) To (2) Tc(1) To (4) To (3) To (1) Tc(2) To(1) Tc (5) Tc (1) Tc (2) To (1) Ts (1) To (2) f c1 f c2 f c3 < < Figure 3 : Overlap time and pure stall time are related to computation intensity.
3. We observe similar behaviors for memory thro ling changes
e Role of Computational Intensity
Estimating the COS terms for simultaneous changes in operating modes such as (
is even more challenging than the single CPU speed change described in Section 2.5. In theory, each term of the COS trace is a ected by all operating mode changes (∆f c , ∆f m , and ∆t). In practice, it depends on the system and application design. e initial focus of our work is on shared memory systems running multithreaded OpenMP applications where parallel threads are mostly homogeneous and synchronized and the programs use a bulk-synchronous programming model. is focus leads to some simplifying assumptions while still covering a large set of parallel applications of interest to a broad community of scientists [4, 5, 15, 21, 23, 26, 32] . Table 1 shows the application of these assumptions to reduce the set of interactions we need to consider for accurate predictions between system recon gurations and model parameters. In addition to the system con guration changes (∆f c , ∆f m , and ∆t), Table 1 lists CI as a consideration for both overlap T ′ o and pure stall T ′ s times. CI here stands for Computational Intensity, or the percentage of memory stall time that is overlapped with useful work on the CPU. CI determines how much stall time is a ected by CPU speed (∆f c ) and how much is a ected by memory thro ling (∆f m ).
We conducted statistical analyses to identify a correlation between stall time and measurable hardware counters available on most x86 architectures [2] .
rough exhaustive experimentation for all available con gurations (
, we found that two widely available counters-last-level cache misses (LLCM) and time-per-instruction (TPI)-e ectively captured the stall time e ects of ∆f c , ∆f m , and ∆t. is nding is key to the COS model's e ectiveness since it enables us to use linear approximation methods to separate pure stall time from overlap stall time. For completeness, we studied the e ects of CI on compute overlap but Table 1 : E ects on COS model parameters of any starting con guration (f c , f m , t) to any other operating mode con guration (each row) for changes in processor speed, memory throttling, and number of threads (∆f c , ∆f m , and ∆t). For some con gurations, we have additionally identi ed CI (Computational Intensity) as having signi cant in uence over COS model parameters.
Con g T
we found that compute time was dominated by e ects from CPU speed and thread count and not a ected signi cantly by CI .
Practical Estimation of COS Parameters
We can use the COS trace of Equation 3 to predict the parallel execution time (T ′ ) of another system con guration for any combination of ∆f c , ∆f m and ∆t. Since the variables may not be directly measurable, the challenge is to collect accurate approximations without requiring system design changes or reverting to simulation. In this section we describe one method for predicting T ′ using direct measurements readily available on most x86 systems. Several of the parameters of Equation 3 are directly measurable. Both total time T and the pure stall time T s are directly measurable using the CPU hardware counters available on most modern platforms [2] .
We've also observed in our experimental work that overlap consists of a portion a ected by CPU speed changes (related to compute time and denoted as T oc ) and another portion a ected by memory thro ling (related to stall time and denoted as T os ). e portions vary according to the computational intensity CI of the application (see Section 2.6).
Under these measurements and observations, the operation mode
where
Multiplying by the ratio of the CPU speed f c and the new CPU speed f ′ c follows the dependencies (∆f c , ∆t) listed in Table 1 for 
c . We will discuss how thread changes a ect predictions in Section 2.8. Table 1 shows that time for the ( 
Combining our approximations for both sets of terms in Equation 7, our approximation of T ′ for a operating mode con guration
In the next section, we describe how we use training sets and linear regression to identify the alpha parameters in this equation to develop a general model for each application in our set of 19.
O line Training and Online Prediction
We use a training set measured o ine to predict online a larger set of ∆f c , ∆f m , and ∆t con gurations. 
Offline training (observed execution time)
Online prediction (predicted execution time) e astute reader will notice that Equation 9 contains no term for the number of threads despite our claim to predict for dynamic concurrency changes. e impact of threads is captured in a set of linear approximations for Equation 9 applied to our training sets. What follows is an explanation of the algorithm we use to predict the simultaneous e ects of ∆f c , ∆f m , and ∆t con gurations. Figure 4 and Algorithm 1 describe our sampling techniques in detail. Basically we gather a set of data for a given application and take samples at various con gurations for ∆f c , ∆f m , and ∆t. We use this data to conduct linear regression on Equation 9 to determine the values of the four α parameters. For each measurement, we simultaneously gather execution time (T ), stall time (T s ), LLCM values, and T PI values.
We designed Algorithm 1 to formally describe the process illustrated by Figure 4 . We de ne f min c , f min m , and t min as the minimum speed se ing for CPU, minimum thro ling se ing for memory, and the smallest number of threads respectively for a training set.
Algorithm 1 Train the COS Model for any application 
In Algorithm 1, thread behavior is captured by the training set. Basically, by reapplying Equation 9 to di erent thread con gurations (steps 2 and 3 in Algorithm 1) we are able to capture the e ects of threads on the COS model parameters using a combination of direct measurements and linear regression.
ese e ects are incorporated in both [T − T s ] and the LLCM and T PI terms of Equation 9 . read e ects are implicitly captured in the algorithmic application of Equation 9 and thus not explicitly in the formulation.
For a memory modes, b CPU modes, and c thread se ings, we require a × b × 2 measurements for step 2 in Algorithm 1 and a × c measurements for step 3 in Algorithm 1. ese measurements are captured visually by the hashed squares on the le side of Figure 4 .
is is compared to our ability to predict a × b × c combinations using a single training set (see the darker squares on the right side of Figure 4 ). We have also determined that of the 19 applications studied, only 4-6 models are needed to accurately predict T ′ for all 19 applications. In future work, we are a empting to reduce the training sets further for online usage. In the remainder of this paper, we compare our predictions with direct measurements and use the resulting COS model for analysis for 19 applications on Intel x86 and IBM BG/Q systems.
EMPIRICAL MODEL VALIDATION
In this section, we validate the COS model on a multi-core machine using several application benchmarks with di erent computational characteristics. We measure the accuracy of the model by comparing the model's prediction versus observed values measured on real hardware.
Machine Characteristics
We validate the COS model on a cluster comprised of Dell PowerEdge R430 servers. Each node has two Intel Xeon E5-2623 v3 (Haswell) processors and 32 GB of DDR4 memory. Each processor has four cores and each core supports two hardware threads. e Haswell processor supports 16 CPU frequencies ranging from 1.2 to 3.0 GHz. e memory system supports three bus frequencies: 1.333, 1.600, and 1.866 GHz.
Application Benchmarks
We employ a set of benchmarks and kernels that represent diverse computational characteristics appearing in high-performance, parallel, scienti c applications.
e application benchmarks include the following codes:
• LULESH (CORAL benchmark suite 4 
LULESH is an explicit hydrodynamics proxy application that contains data access pa erns and computational characteristics of larger hydrodynamics codes at LLNL [1]. We use ve code regions within an OpenMP version of LULESH that represent different phases of the application and consume over 90% of the runtime [27] . ese ve code regions (R1 to R5) were selected in collaboration with domain scientists to isolate the code regions with a diverse set of computational intensity characteristics.
AMGmk includes three compute intensive kernels from AMG, an algebraic multigrid benchmark application derived directly from the BoomerAMG solver in the Hypre linear solvers library [18] .
is code is used broadly in a number of applications [26] of interest to the multi-physics community.
e default Laplace-type problem is built from an unstructured grid with various jumps and anisotropy in one part. We label these kernels K1 to K3.
Rodinia is a benchmark suite for heterogeneous computing [7] . We use six OpenMP codes from the domains of data mining, graph algorithms, physics simulation, molecular dynamics, and linear algebra: Kmeans, k-Nearest Neighbors (kNN), Breadth-First Search (BFS), HotSpot, LavaMD, and LU Decomposition (LUD). Components of this application suite such as HotSpot are of high interest to domain scientists for use in structured grid applications [4, 5, 21] . ere is also high demand for optimized linear algebra solvers [32, 37] such as kNN, Kmeans, and LUD that are used regularly in many high-performance applications and systems.
pF3D is a massively parallel application that simulates laserplasma interactions at the National Ignition Facility at LLNL [24] .
is simulator aids scientists in tuning plasma and laser beam experiments crucial to experimental physics [23] . e pF3D kernels derive from the functions that consume the most time during a typical pF3D run and are wri en in OpenMP. We use the following kernels: Absorbdt, Acadv K1, Acadv K2, APCPFT, and Advance .
In total we used 5 + 3 + 6 + 5 = 19 code regions and application kernels to evaluate the proposed model. For simplicity, we refer to these as codes or applications although they are application benchmarks.
Performance Prediction Accuracy
We compare the execution time predicted using modeling with the execution time observed by running the codes. First, for each code, we train its model o ine (see Section 2.8) using a sample of ∆f c , ∆f m , and ∆t as shown in Table 2 . With these con gurations we derive the model coe cients. At this point, we can use the model to predict the execution time of any given con guration. Second, we run the code under the con gurations not in the training set
K m e a n s (a) Average prediction error. for a total of 225 con gurations. Each of these is run 20 times to smooth out system noise e ects and the average execution time is calculated. ird, we do this for all 19 codes. Figure 5 shows the model prediction accuracy for all of the codes. Figure 5a shows the average prediction error of each code across the entire con guration space not in the training set. e prediction error is calculated as follows (also shown in Figure 4 ):
|T measur e − T pr edict |
T measur e Figure 5 shows that the average prediction error per code is significantly low: varying from 1.4% to no more than 7%. Most of the codes though have an error lower than 4%. is demonstrates the proposed model is highly accurate for a broad range of applications. We also measure the standard deviation of the prediction error as shown in Figure 5b .
e standard deviations for all the codes is within 4.5%. Our proposed model is signi cantly accurate for the three dimensional con guration space for all 19 applications.
To verify that the tested codes include a wide range of di erent computational characteristics, we measured the sensitivity of a subset of our codes to certain parameters such as processor speed. To capture an application's sensitivity, we focus on pressure to the memory system measured as last level cache misses per second. We expect, for example, low memory pressure for compute-intense applications (see Section 2.6) and high pressure for memory bandwidthintense applications. Figure 6 shows last-level cache misses (LLCM) per second as a function of di erent processor and memory speeds and thread concurrency. We employ two processor speeds (1.2 and 3.0 GHz), two memory speeds (1.333 and 1.866 GHz), and two thread counts (4 and 16). Each con guration is represented as a tuple of the following form:
(C: cpu frequency, M: memory frequency, T: num threads)
K m e a n s First, we focus on one con guration: (C1.2, M1.333, T4). Codes including kNN, AMG K1 and K2, and LULESH R4 show low memory bandwidth presssure. is matches our expectation since AMG K1 and K2 are compute-intense kernels as is LULESH R4 [28] . While LULESH R1 and R3 are among the ones with the highest usage, Kmeans, BFS, and LULESH R2 exercise higher memory bandwidth utilization. ese last two have been shown to be memory bandwidth intensive [27] .
Second, we observe that some codes are signi cantly a ected by di erent parameters such as memory speed and processor speed. LULESH R1 for example shows increased memory pressure with increases in processor speed and also with increases in memory speed. R1 has a high number of instructions per cycle (IPC) that bene t from the increased processor speed shi ing the pressure to the memory system. Except for R3, other regions of LULESH show a similar pa ern but at di erent scales. Kmeans, AMG K1, and AMG K2 show signi cant sensitivity to processor speed because of their linear algebra computations.
ird, there are codes that show low sensitivity to di erent congurations. kNN is a clear example of this. BFS is not a ected by increases in processor performance since there is li le computation during the graph traversals but does show sensitivity to memory performance as a result of the operations fetching undiscovered graph nodes from memory. LULESH R3 is an interesting case since there are small changes with either processor or memory speed. is is the result of the code being almost exclusively memory bandwidth bound. us, Figure 6 shows that the codes studied in this work capture a diverse set of computational characteristics. Furthermore, the resources in the critical path for some of these codes can change signi cantly with di erent con gurations. For example, AMG K2 run with a 3.0 GHz processor becomes signi cantly dependent on the memory system when increasing memory frequency from 1.333 to 1.866 GHz.
e low prediction error of the proposed COS model shows that we can capture the e ect of these complex interactions accurately.
COS on Intel Sensitivity Analysis
3.4.1 Memory Prefetching. During development, we noticed the COS model accuracy was sensitive to prefetch se ings.
e effects of hardware and so ware prefetching on performance are captured in changes to the COS Trace described by Equation 3. For example, a successful prefetch could increase overlap by preemptively importing data from main memory to cache. An incorrect prefetch however causes cache pollution and could lead to more overlap stages and stall stages.
To be er understand these e ects, we ran LULESH with hardware prefetching enabled and hardware prefetching disabled and analyzed the results using COS. Figure 7 shows these results using 4 Intel prefetchers: DCU streamer prefetcher (load data to L1 data cache triggered by an ascending access of recently loaded data), DCU IP prefetcher (load data to L1 data cache based on load instruction and its detected regular stride), adjacent line prefetcher (fetch cache line to L2 and last level cache with the pair line), and hardware (streamer) prefetcher (fetch cache lines to L2 and last level cache based on detection of forward or backward stream of requests from L1).
A er enabling all the hardware prefetchers, the accuracy of our predictor worsens as expected. For LULESH R1, the change in average prediction error (Figure 7a ) and standard deviation (Figure 7b ) are both very minimal. In contrast, LULESH R4 has the largest di erential (4x) in accuracy when prefetching is enabled.
is is likely due to a large increase in overlap when prefetching is enabled since the R4 region is dominated by compute when overlap is disabled.
When prefetching is disabled, we get excellent accuracy using an extrapolation technique to predict con gurations not observed directly in the training set. To improve the accuracy of COS for prefetching, we switched to an interpolation technique using 4-and 16-thread con gurations to predict 6-, 8-, 10-, and 12-thread con gurations.
e results validate that the COS model based predictor can successfully capture the impact of DVFS, DMT, and DCT simultaneously with prefetching but at the expense of predictor exibility.
e COS approximation techniques implemented in the Intel systems could be extended to be er capture the e ects of prefetching overlap on performance using approaches similar to those used for power-performance modes.
ROB and MSHR. e sizes of the reorder bu er (ROB)
and miss status holding register (MSHR) increase with each generation in CPU design. e ROB reorders instructions to increase instruction-level parallelism and the MSHR increases the number of loads that can be handled under a previous miss.
ese techniques have the potential to increase overlap and can impact the accuracy of the COS predictor on Intel Systems
We picked nine applications to ascertain the sensitivity of COS to the ROB and MSHR. e Intel hyperthreading design enables us to indirectly control the size of the ROB and MSHR. For a single thread per core, the ROB and MSHR are xed in size. However, if we overload a core with multiple threads, the ROB and MSHR resources are divided among the threads. We exploit this indirect control in our experiments.
In our experimental setup, we identify two basic con gurations: 1) 4-, 6-, and 8-threads where at most one OpenMP thread is mapped to a core, and 2) 10-, 12-, 14-, and 16-threads where at least two cores run two OpenMP threads. As mentioned, these Intel machines have 8 cores each with two hardware threads per core.
For these experiments we disable prefetching and use our extrapolation approach with 4-and 6-threads for training. Figure 8 shows that both average error and the standard deviation for 8 threads are much be er than all the other con gurations of threads.
ere appears to be a correlation between the least accurate of our earlier experiments, ABSORBDT ( Figure 5 ), and the ROB and MSHR results. ough further experimentation is needed, it is likely that ABSORBDT is sensitive to ROB and MSHR sizes and we could consider improvements to our approximations of the COS model that incorporate these characteristics.
CROSS-ARCHITECTURE VALIDATION USING IBM'S BLUE GENE/Q
To demonstrate the portability and scalability of our model we validate COS on IBM's Blue Gene/Q (BG/Q) architecture. BG/Q is a scalable, energy e cient, high-performance system. e BG/Q architecture is capable of dynamic memory bandwidth thro ling (DBT), where memory bandwidth is dynamically controlled through insertion of a con gurable number of memory idle cycles between each DDR memory request.
BG/Q's DBT is di erent from the dynamic memory frequency thro ling (DMT) common to Intel systems. While memory frequency thro ling changes the latency of each main memory access, bandwidth thro ling reduces the e ective bandwidth through inserting idle cycles (or no-ops or bubbles) in the instruction pipeline.
e number of memory idle cycles inserted is called the thro ling threshold and ranges between 0 and 126. Studies have shown this parameter can a ect the performance of applications as well as their power consumption [29] .
e thro ling threshold a ects those memory accesses that occur within the threshold window. For instance, if the time between two dependent memory requests at the memory controller is larger than the thro ling threshold, the latency of these memory requests is not a ected. When the time between two memory requests is smaller than the threshold, the latency of the second memory request would increase by the con gurable number of memory idle cycles.
Unlike the Intel system, BG/Q is not capable of CPU frequency scaling and thus we limit our validation of COS to variations in memory bandwidth and thread concurrency. BG/Q has only two levels of cache, L1 and L2, compared to 3 levels of cache on our x86 experimental system.
Approximating the COS Trace
Assume we change memory speed from f m to f ′ m (∆f m ) using DBT. To illustrate the e ect on performance, Figure 9 shows the execution time in cycles (y-axis) for di erent thro ling thresholds represented by number of memory idle cycles (x-axis) for all regions of the LULESH application. For each region (R1, R2, R3, R4, R5) of LULESH, two di erent phases can be distinguished: 1) a nearly at or constant segment in the function at low thresholds and 2) a linearly increasing function at a threshold that appears to be different for each region. is forms a hockey stick shaped function for each region with a di erent in ection point. In a way, this is an example of Amdahl's law applied to an architectural enhancement. A portion of the code (phase 1 in this example) is not a ected by the enhancement (e.g., insertion of memory idle cycles) while a portion of the code (phase 2 in this example) is a ected by the enhancement. While this is an oversimpli cation in some ways, it implies that we can potentially use a piece-wise function to approximate the performance for these codes if we can identify the in ection point (i.e., the number of memory idle cycles) where performance loss begins.
Following a series of experiments, we determined the in ection points correlate to characteristics of a region's memory access behavior. We approximate the COS Trace expressed by Equation 3 using a piecewise function of performance: where t 0 is the performance with no memory thro ling, a is the threshold of the in ection point, and b is the slope of the linear function.
For the constant function (T = t 0 ), memory thro ling has li le impact on the COS Trace: performance does not change by inserting memory idle cycles. is can be explained with the following two cases. First, the gap between most of the application memory accesses is larger than the thro ling threshold. e number of inserted memory idle cycles is too small to cause delays in memory accesses (T s is not changing) and thus total execution time. In this case, inserting idle cycles does not change T c ,T o , and T s . Second, the gap between memory accesses is smaller than the thro ling threshold, but the memory accesses overlap with processor computation. Inserting idle cycles can delay issuing new memory requests but does not change the length of any of the three stages, T c ,T o , and T s .
e in ection point a in Equation 10 depends on the memory access pa erns of applications. A correlation analysis among some critical compute/memory related hardware events (e.g. oating point operations, L2 cache misses per second, etc.) shows that its value is highly related to memory intensity: L2 cache misses (L2M) per instruction (INST). By applying linear regression, we can approximate the value of a with the following:
where α is a coe cient and c 1 is a constant and both will be determined using linear regression.
e impact of memory thro ling on the second segment is linear (T = b f m + t 0 ). is can be explained with the COS Trace as follows. For a su ciently large number of idle cycles, application memory accesses cannot overlap with computation. In this case, the length of the pure compute stages would not change with memory thro ling; the length of the overlap stages would be zero; and the length of the pure stall stages would change linearly with the number of memory idle cycles inserted. Approximating the number of memory accesses with L2 misses, the impact on the COS Trace can be expressed as follows:
where ∆T c , ∆T o , and ∆T s are the resulting change to execution time for each respective phase. us, we can approximate the value of b as follows: b = β × L2M + c 2 where β is a coe cient and c 2 is a constant and both will be determined using linear regression.
Based on the equations above, we can predict performance using Equation 10 as follows:
O line Training and Online Prediction
We apply linear regression to approximate the model coe cients of Equation 11. e con guration space includes two parameters: the thro ling threshold (∆f m ) and the number of threads (∆t). e threshold ranges from 0 to 126 idle cycles and the number of threads from 4 to 64 with an interval of 4. e details of the training congurations and the overall con guration space is given in Table 3 .
We use the ve code regions of LULESH to train the model. Each region has its own trained coe cients. We use the model to predict the performance of the ve code regions of LULESH for those con gurations in the con guration space that are not in the training set. To measure the accuracy of the model, we compare these predicted values with the performance measured by running the same con gurations on the machine. Figure 10 shows the average error and standard deviation of our model. Four of the ve code regions show a reasonable average error, around 10% or less. Region 2, however, shows a large average error of 17%.
ere are several factors that a ect the model accuracy of the BG/Q implementation of the COS model. First, our experiments on BG/Q included prefetching, which makes the overlap time T o more complex as observed on the Intel system. On BG/Q we used L2 cache misses to represent memory pressure similar to the Intel system. We could relax this requirement and use the number of load and store operations on BG/Q for our approximations. is may improve accuracy.
Second, in some cases the value of a may not be strictly constant but a linear function with a small slope value. e number of idle cycles that can be inserted (integers) does not provide ne-enough granularity to estimate a more accurately.
ird, we used a small number of sample con gurations for training a because we limited our con guration space to only a subset of the available number of threads. We expect that a larger space using all 64 (from 1 to 64 threads) con gurations along the ∆t dimension would have resulted in be er accuracy.
LIMITATIONS AND DISCUSSION
We have demonstrated the use and accuracy of the COS model for predicting the performance of a set of DVFS, DMT/DBT, and DCT con gurations. e model can be used in a so ware or hardware implementation to allocate or deallocate resources to the working threads in a parallel application. is is an advantage to a parallel scheduler or runtime system.
As mentioned the key concept of the COS model is the isolation of overlap. While in an abstract sense this is straight forward, we show in Sections 2.4 -2.6 that empirically isolating overlap is wrought with challenges. We resolved a number of these challenges using the assumption of regular parallel applications where threads are mostly homogeneous and computation proceeds in a bulk synchronous way with no other dependencies among threads.
is leaves a number of limitations to the model that must be addressed to consider irregular parallel codes (e.g., heterogeneous threads, asynchronous, cross-thread dependencies).
Overlap types In our earlier discussions, we simpli ed the definition of overlap into computation overlap and memory overlap. In general, overlap can also occur between multiple threads on a single core/CPU or across multiple cores/CPUs accessing the same memory. Under our assumptions, these don't a ect the COS Trace much, but these must be considered for irregular parallel codes.
Computational intensity Computational intensity (CI ) has impact on the overlap as discussed earlier. A key insight gained from this work is that CI can be used to predict the impact of simultaneous con guration changes in CPU, memory, and threads on overlap. More overlap types, combined with irregular codes, are likely to make accurate prediction more challenging. ere could also be non-CI e ects that we've not accounted for in parallel irregular applications.
Role of Co-design ese challenges could be alleviated somewhat by improvements in our ability to directly measure the overlap of parallel codes. is could be accomplished in so ware, but would be most e ective when co-designed with hardware. Our work indicates that there are meaningful representative hardware counters that give insight to overlap and computational intensity, but they are indirect at best. Furthermore, this data is usually limited to an individual thread with no context for other concurrent threads on the same core or CPU. Mechanisms for tracking this type of information could vastly improve our understanding of overlap as well as our ability to optimize parallel applications and systems.
RELATED WORK
To the best of our knowledge, this work is the rst to propose an analytical performance model that captures the simultaneous effects of DVFS, DCT, and DMT/DBT on the performance of multithreaded applications on real systems. Table 4 provides a synopsis of work most closely related to ours. ere has been extensive work focused on modeling the e ects of CPU DVFS on performance using stall-based approaches [22] ; leading loads [16, 22, 33] ; CRIT-BW, a leading load derivative [31] ; and DEP-BURST, a CRIT-BW derivative [3] . While all of these consider the e ects of out-of-order execution and non-blocking caches, only DEP-BURST considers multithreading using a critical path analysis to determine which core to boost. Table 4 shows how the resulting CPU DVFS performance models capture the characteristics of the COS model: T c , T o , T s from Equation 1. Stall-based approaches assume the CPU DVFS a ects overlap time in the same way it a ects pure compute time and thus combine T c and T o . ey also purport that pure stall time (T s ) is constant with changes in CPU frequency -this is in direct contrast to our ndings that stall time is a ected by CPU frequency (see Figure 2 ). e leading load model and its derivatives combine overlap time (T o ) with pure stall time (T s ) and assume the combined value is constant while T c is proportional to CPU frequency. is assumption leads to inaccuracies since the impact of CPU frequency on T o can be quite di erent from T c and T s -as we discussed in Sections 2.4 -2.7.
Su et. al. [35] is the only work we know of that implements the leading load model on real systems. is is the most accurate model available for a real system but it only models DVFS on AMD architectures.
e COS Model implementations in this paper are as accurate or be er than this and model the combined e ects of DVFS, DMT/DBT, and DCT across multiple architectures. Su et. al. also showed that the leading load approach is less accurate for memory intensive applications and that the accuracy of the leading load model is highly dependent on the level of memory boundedness -these match our ndings as well. Table 4 also shows a comparison with memory power performance modeling tools. Deng et. al. [12, 13] presented a performance model for memory frequency scaling (MemScale and MultiScale) of single threaded applications on in-order processors. ey made similar assumptions as those in the CPU DVFS models that the overlap time (T o ) is combined with pure compute time (T c ). Deng et. al. [11] created CoScale to extend MemScale to consider DVFS. e accuracy is very good for single threaded applications on in-order processors. But the limiting combination of T o and T c remains.
Sundriyal and Sosonkina [36] proposed the "Joint" performance model that considers the simultaneous e ects of CPU DVFS and DMT. However, the model estimates T o as a constant for all applications on a single system. is contradicts our ndings that overlap is a ected by CPU frequency (see Figure 2) .
Less directly related work relevant to our discussions include: David et al. [10] investigated the impact of memory frequency scaling on power and performance and proposed a model for real systems; Li et al. studied the thro ling interface on IBM BG/Q systems and demonstrated its ability to optimize system e ciency [29] ; Ercan et. al. [14] presented a heuristic runtime solution for coordinating CPU and memory frequencies to improve energy efciency; Curtis-Maury et al. created heuristic models that manage DVFS and DCT simultaneously for multi-threaded applications [8, 9, 20, 25, 30] .
CONCLUSIONS AND FUTURE WORK
In this paper, we propose the COS Model of parallel performance to accurately capture the combine e ects of DVFS, DMT/DBT, and thread concurrency on real systems. We applied the COS model to both Intel and IBM architectures within 7% and 17% accuracy for a set of 19 important applications. e key insight to the COS model is the separation of memory and compute overlap from pure compute and pure memory stalls. is separation enables more accurate approximations and a straightforward methodology that is capable of modeling the complexity introduced with concurrency. A key limitation of the model is the focus on structured parallel codes that while representative of many important applications precludes accurate use on irregular parallel codes for now. Despite the limitations, we provide strong evidence that the fundamental focus on overlap in the COS model will be key to steering future high-performance systems and applications to maximize their efciencies. In future work, we plan to explore extending the COS model to irregular parallel applications in both OpenMP and MPI. We also plan to adapt the techniques described for use in runtime systems.
