Graphics processing units (GPUs) support dynamic voltage and frequency scaling to balance computational performance and energy consumption. However, simple and accurate performance estimation for a given GPU kernel under different frequency settings is still lacking for real hardware, which is important to decide the best frequency configuration for energy saving. We reveal a fine-grained analytical model to estimate the execution time of GPU kernels with both core and memory frequency scaling. Over a 2× range of both core and memory frequencies among 20 GPU kernels, our model achieves accurate results (4.83% error on average) with real hardware. Compared to the cycle-level simulators, our model only needs simple microbenchmarks to extract a set of hardware parameters and kernel performance counters to produce such high accuracy without kernel source analysis.
I. INTRODUCTION
Recently, graphics processing units (GPUs) have become widely used from deep learning (DL) workstations to highperformance supercomputing centers. In particular, the most popular DL toolkits [1] , [2] , [3] , [4] heavily rely on the remarkable computational power of GPUs. However, due to the rapidly increasing computation requirements of both DL toolkits and other GPU applications on large amounts of data, the total energy consumption can be very high. This, not only results in high electricity budgets, but also violates green computing. For example, the supercomputer Titan accelerated with the NVIDIA K20× requires a power supply of 8.21 million watts with an annual electricity cost of approximately 23 million dollars [5] . Decreasing 5% of the power consumption can reduce electricity costs by up to 1 million dollars. Effective energy-saving techniques are emergent for GPUs.
Energy conservation techniques in modern computers are generally based on dynamic voltage and frequency scaling (DVFS). GPUs usually support simple automatic voltage and frequency adjustment to save power and protect the hardware. Nevertheless, GPUs hardly gain the best energy efficiency under the default voltage and frequency settings [6] , [7] and still have potentials of energy conservation. To find the most energy-efficient DVFS configurations, energy consumption under different DVFS settings should be predicted, which requires modeling both the performance and runtime power of GPUs under various voltage and frequency settings. In this paper, we address the performance modeling problem.
There are two main challenges to GPU performance prediction under different core and memory frequencies. First, compared to traditional CPUs, GPUs have a much more complex memory hierarchy of which GPU vendors reveal few details. Second, GPUs have two independent frequencies belonging to core and memory respectively, which affect different GPU components. They may have different performance effects on different GPU applications.
Some state-of-the-art work has revealed the analytical pipeline GPU performance model [8] , [9] , [10] , [11] , which emphasizes the relationship between compute cycles and memory latency. However, there still exist some opportunities to reinforce the model. First, the L2 cache becomes larger and larger over the evolutionary GPU generations. Larger cache can generally increase the cache hit rate, hence the model should explicitly consider the impact of L2 cache on data access latency and throughput. Second, most of the previous models only work under default GPU frequency settings. Kernel behavior may change significantly when the core and memory frequencies are adjusted.
Simulation methods [12] , [13] , [14] expose sufficient details to facilitate an understanding of the execution of GPU kernels. The best available simulator to date [14] combines performance counters and specific hardware information to estimate the kernel execution time with high accuracy. However, compared to the fast-evolving GPU generations, such simulators still stand for the earlier GPU architecture, such as Fermi, which does not meet the great changes of newer hardware. Furthermore, these simulators usually consume much longer times than real hardware, complicating application to real-time power-performance optimization.
Recent GPU performance models [15] , [16] , [17] , [18] , [19] have also witnessed machine learning method trends, such as K-means, multiple linear regression and neural networks, and have obtained considerable accuracy. However, few of them have introduced frequency scaling as an impact factor. Related work has strongly relied on training data, such as specific performance counters and kernel settings. Although they can reveal some correlations between the input parameters and execution time, how the parameters interact and contribute to the final time prediction requires further exploration.
We believe that a fast and accurate GPU performance model is a key ingredient in energy conservation with the DVFS technique and that it should be applicable to real hardware. We first attempt to model the memory system of a GPU with a simplified first-come-first-served queue in which the service rate depends on the memory frequency. Based on this, we propose a GPGPU performance estimation model that considers both core and memory frequency scaling. Notably, this is the first performance modeling paper that explicitly considers both core and memory frequency scaling. We make the following contributions: 1) We model the memory system of a GPU with a simplified queuing model related to frequency. 2) We establish an analytical GPU performance model with both core and memory frequency scaling. 3) On real GPU hardware, our performance model achieves a mean absolute percentage error (MAPE) of 4.83% across 36 frequency settings with up to 2× scaling among 20 kernels. We also achieve MAPEs of 0.73% to 11.2% for each single kernel, indicating the great accuracy and low variance of our performance model. The rest of this paper is organized as follows. Section II lists some related works which study GPU performance modeling and the effects brought by DVFS techniques. Section III details our memory queuing model for a GPU with frequency scaling. Section IV proposes our GPGPU performance estimation model with both core and memory frequency scaling. Section V describes our experimental setup and presents the experimental results. Finally, Section VI states our conclusion and addresses future work.
II. RELATED WORK
To derive an accurate performance model of GPUs, it is quite important to understanding its complex memory hierarchy. Henry Wong et al. [20] and develop a micro-benchmark suite and measure some characteristics such as cache structure and latency of various memory types, TLB parameter, latency and throughput of arithmetic and logic operations. Meltzer [21] extended similar work on Tesla C2070. In addition, Xinxin Mei et al. [22] , [23] also conduct similar dissection but address more on memory hierarchy. They propose a fine-grained Pchase method to explore the cache parameters with uncommon structure and replacement policy which appears in the latest generation of GPU (Kepler and Maxwell). However, such methods usually test a single kernel with only few threads executing one type of instructions. When thousands of threads access memory simultaneously, which generally happens in GPU applications, the memory bandwidth might not satisfy the demands and some operations would be stalled in the queue of memory controller (MC). Such cases lead to high variance in memory access latency.
Hong et al. [8] , [24] proposed an analytical model by estimating different degree of memory parallelism and computation parallelism with some offline information of the kernel program. Furthermore, Sim et al. [25] improves the above MWP-CWP model by considering cache effects, SFU characteristics and instruction throughput. However, their methods ignore the effects of shared memory latency and DRAM memory latency divergence, which may bring some significant biases in some memory-bounded application. Song et al. [17] extend their models and address more on different types of memory access by collecting some simple counters. However, the model averages the cache effects among all the warps and potentially ignores memory latency divergence in some asymmetric applications.
Nath et al. [9] present CRISP model which analyze the performance in the face of different frequencies of compute cores. They pointed out that DVFS on GPU is different from that on CPU since computation operations and memory transactions from different threads can overlap in most of the time. Based on the characteristics of GPU performance with varying frequencies found from experiments, they classify different execution stages in the kernel program and compute them with various frequencies. However, CRISP only works for the case of either scaling down the core frequency or scaling up the memory frequency. Also the model may be more complicated if considering the memory frequency. Joao et al. [26] proposed a GPU power estimation model with both core and memory frequency scaling. They designed well crafted microbenchmarks to extract the model parameters of each GPU components under the default frequency setting. Then they attempted to predict the power consumption of an application under over a wide range of frequency scaling.
Recent state-of-art works witness the advantages of machine learning methods on GPU performance and power modeling. Gene Wu et al. [15] built a performance model based on different patterns of scaling with various core frequency and memory frequency. He firstly adopted K-means to cluster different patterns of scaling behavior among 37 kernels and then explored the relationship between performance counters and clustering patterns with ANN modeling. With the model trained with large amount of data, one can predict the performance of one kernel under any setting of core frequency and memory frequency with the predicted scaling pattern. Wang et al. [27] adopted SVR algorithm to estimate GPU power consumption considering both core and memory frequency scaling.
III. MEMORY MODELING WITH FREQUENCY SCALING
In the previous performance modeling work, memory latency is often characterized as a constant parameter obtained by micro-benchmarking. However, the memory latency of each thread may vary if the memory system is saturated. Memory latency can also change with different frequency settings. To start our performance modeling work, we first estimate the total execution time of pure DRAM access (T Gm ) with a simplified queueing model. We do not adopt a classical M/D/1 queueing model for two reasons. First, the number of memory transactions of one GPU kernel is finite. In this case, the service rate (μ) is allowed to be smaller than the arrival rate (λ). Second, instead of modeling the expected waiting time and service time of each coming request, we study how the total execution time of finishing all of the requests changes with different memory frequencies.
A. DRAM latency
The SMs in GPUs execute threads in groups of 32 threads called warps [28] . When a thread warp launches a memory request to the global memory, it usually takes hundreds of cycles to go through DRAM if the data are not cached. This minimum latency happens when the memory system is idle and only contains the overhead of path traveling between SMs and DRAM, and data access in DRAM. Fig. 1 shows this case. The inter-arrival intervals between two consequent memory requests are longer than the time of loading data from DRAM. Thus, each memory request only costs the minimum latency. To compute T Gm in this case, we only need to focus on how many memory requests are executed by a warp. As shown in Fig. 1 , T Gm can be calculated using Eq. (1). λ denotes the arrival rate of the coming memory requests. N W denotes the total warp number. L Dm denotes the minimum memory latency with no memory contention. N GT denotes the number of global memory transactions of one warp. 
If the memory system is saturated due to intensive memory requests, the minimum latency can hardly be achieved. Most of the requests must wait in the queue until the previous ones have been finished. In Fig. 2 , each memory request is launched with a very short interval so that each memory request takes both the minimum latency and the queuing delay, which is the waiting time in the queue. Thus, intensive memory access demands can lead to diverse memory latency. In this case, we can calculate the total time consumption using Eq. (2). D Dm denotes the service time of one memory transaction.
We design the following experiments to validate our above analysis. The pseudocode is shown in Listing 1. We launch a GPU kernel that only loads data from DRAM. First, the kernel records the SM index and the warp index of each thread. Each thread then executes several rounds of DRAM loading operations with coalescing access. We define two global variables for different experimental settings. STRIDE defines the number of DRAM loadings of each thread. RECNUM defines the time records of each thread. We exclude the codes for storing the results to global memory due to space limitation. To avoid data pre-fetching [23] and L1/L2 cache hit, we allocate the maximum number of threads for each SM and assign a proper STRIDE to generate a large enough data array. We first adopt the codes in Listing 1 to determine how global memory latency changes with increasing warp number. To reduce the overhead of clock() and time recording in the global memory as much as possible, we set RECNUM to 1 and STRIDE to 4. Fig. 3 shows our experimental results. We can make two inferences from the results. First, memory latency can be diverse due to intensive requests. Second, memory latency is somehow linearly correlated with warp numbers, which is consistent with the model in Fig. 2 We then explore how frequency scaling affects L Dm and D Dm . We again utilize the code in Listing 1 to measure the L Dm under different memory frequencies. We find that it can be fitted by Eq. (3a) with 0.9959 R-squared.
As for DRAM delay measurement, we comment out the clock() function in Listing 1 and set STRIDE to 8 to achieve as high of a DRAM bandwidth as possible. To calculate D Dm in cycles, we transfer the total execution time into cycles by multiplying the time and the core frequency and calculate the number of memory transactions of all of the warps. We can then determine D Dm using Eq. (2) with the obtained L Dm . Similar with memory latency fitting, we observe that the DRAM delay has strong correlation with memory frequency and can be fitted with using Eq. (3b) with 0.999 R-squared. 1 # d e f i n e STRIDE 4 2 # d e f i n e BLOCK SIZE 256 3 
B. L2 cache
With the development of GPU generations, L2 cache has become increasingly larger (e.g., from 512 KB for Fermi GTX560Ti to 2 MB for Maxwell GTX980) to reduce the pressure on DRAM. As mentioned before, different cache hit rates may bring different sensitivity to frequency scaling. Similar to the DRAM measurement experiments, we use the same global latency code to obtain the L2 cache latency (L l2 ) under different frequency settings. We observe that the latency is always within 220 to 224 cycles. This is reasonable, as the L2 cache is affected by core frequency. Thus, we take the average 222 cycles for L2 minimum latency. We also set the value of D l2 to 1 due to the fact that the L2 cache can return a memory request per core cycle.
C. Adjustment with Frequency Scaling
For simplicity, our performance model defines a baseline frequency setting in which the ratio of memory frequency to core frequency is 1. We measure the basic latency and throughput of all of the components including core computation, shared memory access, constant memory, L2 cache and DRAM under baseline settings. We can use the standard average memory access time [29] model to obtain the average global memory access latency (L Gm avg ) and the average queueing delay (D Gm avg ) of all of the global memory transactions occurring during kernel execution with Eqs. (4a) and (4b). γ denotes the L2 cache hit rate of the kernel. f SM and f MEM denote the core and memory frequencies, respectively.
As we calculate the execution time in the scope of core frequency, there is no extra adjustment to the latency and throughput in SMs.
IV. GRAPHICS PROCESSING UNIT PERFORMANCE MODELING WITH FREQUENCY SCALING
Generally, a GPU can launch a large number of warps throughout the kernel execution period. However, due to the hardware resource limitation, one SM can only execute a certain number of warps concurrently. These warps are called active warps and the number of active warps per SM is denoted by N W act . Once we obtain the time consumption of one round of active warps (T act ), the total execution time of a kernel can be estimated using Eq. (5) . N B denotes the total number of thread blocks. N W pb denotes the number of warps per block. N SM denotes the number of SMs.
A GPU kernel can be divided into several segments. Although some segments do not access shared memory, some do. As they are influenced by different frequencies and usually have different working patterns, we classify the GPU kernel segments into two categories by whether shared memory transactions occur during kernel execution. Some kernels also utilize texture cache for further performance improvement and may affect the accuracy of our model. We leave this for future improvement work on our current model.
A. Performance Modeling without Shared Memory
The first case of our model is that the GPU kernel does not utilize shared memory. In this case, the kernel only contains computation in the SM and global memory transactions. As described in Section III, we can estimate the time consumption of global memory transactions. For the computation, we simply assume that the same computation time occurs before each global memory transaction, as shown in Fig. 4 . We divide the total number of compute instructions (N comp ) by the total number of global memory transactions (N GT ) to obtain the average compute instructions per global memory transaction (N comp avg ). The average computation time (T comp avg ) before each global memory transaction can be estimated using Eq. (6) .
If the kernel launches many compute instructions and few memory requests that do not saturate the memory bandwidth, the computation period will dominate the total kernel execution time. However, if the memory access latency is somehow much longer than the computation period due to intensive memory requests, the computation period can be hidden by memory operations. The first case is regarded as the computedominated or compute-bound kernel, whereas the second is the memory-dominated or memory-bound kernel.
Compute-dominated kernel: When there are enough compute instructions to be issued and the memory requests are not too intensive due to long computation periods, the global memory latency can be hidden, as illustrated by the upper part of Fig. 4 . We can estimate T act using Eq. (7):
where T comp avg ≥ D Gm avg and T comp avg × (N W act − 1) ≥ L Gm avg . R o denotes the repeat times of a pair of computation period and global memory transaction period.
Memory-dominated kernel: When the memory bandwidth is saturated or there are not enough warps to issue compute instructions, one memory request must wait until all of the outstanding requests have been finished. The bottom part of Fig. 4 demonstrates this case. Similar with the case in Fig. 2 , we can regard the compute cycles as the inter-arrival time of two consequent memory requests. We can estimate T act using Eq. (8) by focusing on the D Gm avg of each warp:
where upper part of Fig. 5 . We can estimate T act using Eq. (9) for this case:
where T comp avg ≤ D Gm avg and T comp avg +L Gm avg ≤ D Gm avg ×(N W act −1). When T comp avg is longer than D Gm avg , the memory requests of each warp do not have queuing delay, as shown in the bottom part of Fig. 5 . We can estimate T act using Eq. (10):
B. Performance Modeling with Shared Memory
Shared memory plays an important role in GPU performance optimization, as it has lower latency and higher throughput than DRAM and can be shared among the threads within the same block. Its latency and throughput are affected by core frequency. One GPU kernel may have different patterns of shared memory utilization, which complicates the performance estimation. Basically, there are three phases for this kind of kernel. At the beginning, all of the warps send memory requests to the global memory and then store the data in the shared memory. Second, threads within the same block access shared memory for computation. Finally, the results are written back to global memory. According to the access workload to shared memory, we design the performance models for two cases as described below.
Shared memory requests are infrequent: For kernels that only have few shared memory requests, as the latency of shared memory access is much lower than that of global memory access, they can often be hidden by the later one. Fig.  6 shows this case. In the first phase, all of the warps launch global memory requests and store the data in shared memory, resulting in heavy traffic in DRAM. In the second phase, each warp only executes one shared memory access consuming only a few cycles. Then, each warp writes the results back to the global memory, which again launches numerous global memory transactions. The pattern is similar to the bottom part of Fig. 4 , except that it is shared memory latency to be hidden. We can estimate the execution time using Eq. (11):
where T comp avg ≤ D Gm avg and (T comp
. L sh denotes shared memory latency. Transpose with coalesced optimization is one instance of this case. As the shared memory latency can be hidden, the kernel is not sensitive to core frequency, but to memory frequency, which also meets the results of our previous motivating examples. Shared memory requests are intensive: If shared memory access is intensive, latency can contribute significantly to the final execution time. Fig. 7 shows matrix multiplication optimized with shared memory as an example. Due to the function of synchronization, we can regard this procedure as a repetition of two phases. First, all of the warps launch global memory reading and store the data in shared memory, which is somehow similar to the case in the bottom part of Fig. 4 . In the second phase each warp loads data from the shared memory multiple times, which takes a considerable amount of time and can no longer be hidden by global memory transactions. Thus, the total execution time of one round can be calculated using Eqs. (12)∼ (14) . R in denotes the number of shared memory transactions in the second phase.
MMS is one instance of this case. Its performance is sensitive to both core and memory frequency which is revealed in our previous motivating examples and can be explained by our model. First, T 1 contains a large number of global memory transactions which makes the execution time sensitive to memory frequency. Second, although shared memory latency is much shorter than global memory latency, nearly 3 dozen shared memory requests in T 2 also contribute a lot to the final T act , which also makes the execution time sensitive to core frequency. These two cases can be adopted to many classical GPU kernels. Although there are other more complicated irregular instances such as MC EstimatePiInlineP and reduction, the similar phase partition methodology can apply to them. We leave detailed analyses of such irregular kernels for future work. 
V. EXPERIMENTS
A. Experimental Methodology
With the help of the NVIDIA Inspector [31] , we can fix the performance state and adjust the core frequency and memory frequency of the GPU within a certain range. Using this method we can obtain the execution time data with certain frequency combinations of GPU. We cover both core frequency and memory frequency at a 2× range of scaling from 500 MHz to 1,000 MHz with a step size of 100 MHz, such that 36 frequency configurations are tested in total. We validate our model among 20 realistic GPU kernels from CUDA SDK 8.0 [32] and Rodinia [33] , listed in Table III on a real NVIDIA Maxwell GTX980. Hardware specifications of our test machine are listed in Table II . These benchmark applications cover a wide range of execution patterns, such as DRAM intensive, L2 cache intensive, shared memory intensive and computation intensive. They also have different kernel grid and block settings, which implies various GPU workloads and different magnitudes of kernel execution time. With these experimental settings, we have a total of 720 unique experiments. We repeat each experiment for 1,000 times and report the MAPEs. We use the NVIDIA Profiler [30] to extract the performance counters at the baseline core and memory frequency to drive our model. Note that we only need one time data collection, which makes our model run fast. Our current model does not consider the effects of translation look-aside buffer misses, cold cache misses of DRAM and the instruction fetching 
B. Experimental Results
First, we would like to observe the instruction distributions of different GPU kernels to facilitate an understanding of the sources of prediction errors. As Fig. 8 demonstrates, our tested kernels have various partitions of different types of instructions, suggesting that we attempt to design a general model for different types of GPU kernels. In addition, such instruction statistics help us locate the principle contributors to the execution time under certain frequency settings. Fig. 9 shows the MAPE of each tested GPU kernel across 36 frequency configurations. Our performance model estimates achieve high accuracy with a MAPE of 4.83% across 720 unique experiments. For each kernel, the MAPEs range from 0.73% to 11.2% as shown in Fig. 9 . Even nearly 90% of the predictions have below 10% absolute precision errors.
A heat-map of time prediction absolute errors is shown in Fig. 10 . Each entry is the average error of one frequency setting among all of the tested kernels. High error values gather in the lower triangular area, which implies that the model reveals larger biases when applying low core frequency along with high memory frequency. Within this area more kernels become compute-bound and shared-memory-intensive, such as MMS. This may result from the assumption that the shared memory transactions within one warp are handled in order without any queuing delay. Nevertheless, we can still achieve no more than 11% error for these extreme cases.
VI. CONCLUSION
We demonstrate a GPGPU performance predictor for a wide range of core and memory frequencies. We first estimate the total time consumption of multiple memory requests under different frequency settings, then take the profiling data of a given kernel under the baseline frequency setting as the input to estimate the performance at other frequencies. Our model can quickly and accurately predict the execution time of a GPU kernel on real hardware, which is important to deriving real-time energy conservation suggestions with DVFS techniques. We show that our performance estimation method can achieve a MAPE of 4.83% across up to 2× both core and memory frequencies scaling among 20 realistic GPU kernels. Our experimental results also indicate that our model can catch the performance scaling behaviors of DRAM, L2 cache and shared memory very precisely.
As for future work, we have two directions for improvements. First, our model does not explore too much about shared memory as we treat on DRAM and does not take texture/L1 cache and constant memory into account, which may introduce larger errors for kernels containing access requests to them. Second, collaborating with GPU power models, building a real-time voltage and frequency controller for GPUs based on energy conservations strategies with DVFS techniques may prove to be a remarkable project.
