Abstract-Energy consumption has been a great deal of concern in recent years and developers need to take energy-efficiency into account when they design algorithms. Their design needs to be energy-efficient and low-power while it tries to achieve attainable performance provided by underlying hardware. However, different optimization techniques have different effects on power and energy-efficiency and a visual model would assist in the selection process.
I. INTRODUCTION
Energy and power consumption have been playing a significant role in High Performance Computing (HPC) systems, especially with recent advances in hardware design [1] . The current trend in HPC systems is towards building an energyefficient design [2] . Considering Tianhe-2 as the fastest supercomputer in 2015, power consumption of supercomputers are required to be 26 times more efficient with respect to their performance [3] so that 20 MW limitation by DoE for power is not crossed. Furthermore, the current trend in designing datacenters leads to 400% increase in the cost of cooling and power, and they are expected to continue to rise [4] . To that end, not only do hardware designers need to consider consumption of energy, application developers and algorithm designers are also required to deal with the consumption of energy and power as one of their design factors. The main goal of this paper is to demonstrate the relation among three pillars visually: performance, power consumption, and energyefficiency of kernels in a single model in order to provide developers with insightful information.
Our model is inspired by the classic Roofline model [5] . Such a model serves as a visual representation of operational intensity of a kernel against maximum attainable performance that the hardware can achieve. In their model, Williams et al. showed how peak performance and peak attainable memory bandwidth in their model relate to each other in a system. For small values of operational intensity, due to lack of the optimization of memory operations, a kernel is bounded by memory bandwidth of the system. For kernels that are not inherently memory-bound, improving locality turns them into compute-bound ones. In general, one can utilize roofline model to find bottlenecks in its kernels and help developers to find suitable technique to improve their performance.
The roofline model tries to incorporate both computation bounds and memory bounds into one model. Consequently, one looks into the model and identifies proper optimization techniques. However, the model does not provide insights on the power limitations of the system to the user. The assumption is that we have an unlimited source of energy and there is no concern on the amount of power and energy to be consumed, which is certainly incorrect for modern HPC centers. Therefore, a model that incorporates optimization techniques with respect to energy consumption would be useful to developers.
Our contributions in this paper are as follows:
• Insightful visual representation: Our model provides an insightful visual representation of power consumption with respect to energy efficiency in a system. Taking both power consumption and efficiency of computation and memory into consideration, we identify whether a kernel is power-bound (power-hungry) or compute-bound.
• Energy efficiency: To characterize the efficiency of our architecture, we define energy efficiency as flops per Joule (J). We demonstrate the trade-off between energyefficiency and power consumption in our model.
• Effects of optimization techniques on power and energy efficiency: Through our models, developers could understand how optimization techniques would affect power and energy-efficiency and which technique has more impact on our kernel.
II. CONSTRUCTION OF ROOFLINE MODELS
For simple von Neumann architecture [6] the energy consumption is modeled as:
where E f lops and E mem stand for the total consumed energy for floating-point computations and memory operations, respectively. E 0 is the constant energy required for our system to work. Our assumption is that E 0 remains constant during the execution time. Table I describes the symbols on this paper. Considering Eq. 1, we rephrase it as following:
where T , W , Q and E C stand for the execution time, the number of floating-point operations, the number of memory operations, and the consumed energy for our kernel, respectively. Using simple mathematical model in Eq. 2 power consumption, P , and reciprocal of energy efficiency, E W = 1 EE , could be rewritten as following:
Combining Eq. 3 and 4 results in following linear equation between P and E W :
On the other hand, under a specific optimization strategy the peak value for power consumption could be represented as P peak . Thus, a roofline model with relation between E W and P could be defined as following:
where π is the performance of system (defined as FLOP per second). Using the same approach, one can find similar relationship between energy per byte and power consumption as presented in Eq. 7
which Q t and E Q are the memory bandwidth (BW ) and required energy to transfer all data to/from main memory on the device, respectively. Combining this equation with the peak value for power consumption leads to following relationship between P and E Q , which represents a roofline model for performance of memory subsystem. Equation 8 shows this roofline model. 
A common metric to measure the energy-efficiency in HPC systems is performance-per-watts [7] , [8] , where performance is defined as "useful work" per second. Work, in scientific computing, is measured as the number of arithmetic operations, while in graph traversal algorithms, it could be defined as the number of traversed nodes in a graph [6] . Therefore, in this paper, computational energy-efficiency is defined as following equation:
Mathematical elimination of t in Eq. 9 defines energy efficiency as the total number of floating-point operations per Joule. It shows that EE is in fact the reciprocal of X-axis in our model. Similar to this approach, energy efficiency of memory subsystem could be defined as memory bandwidth over consumed power as shown in Eq. 10.
Similar to Eq. 9, Eq. 10 would be the reciprocal of our X-axis for established roofline model of memory subsystem. Section III shows how Eq. 6 and 8 will help us to build our models for two NVIDIA GPUs that we have considered in our paper for experimental analysis.
III. DETERMINING CEILINGS FOR POWER AND EE
The roofline models that we present in this paper are categorized into two groups: the first model tries to relate the energy-efficiency (EE) of computational subsystem with power and the second one relates EE of memory subsystem to power. Our models provide upper bound values for power and EE. This will help developers understand the optimizations that would result in less power consumption and/or better EE of the kernel. In other words, if a kernel were to be executed to collect its power and EE levels our model aims to determine the optimizations that should be implemented to make this kernel consume less power and/or be more energy efficient. Since there is a tradeoff between power consumption and EE it is a challenge to identify one unique technique that would lead to improving both levels simultaneously.
Each optimization technique is represented as a ceiling in our model that designates the effect of applying the technique. The power and EE gap between two consecutive ceilings would show how much we will gain or lose by enabling associative technique. Figure 1 depicts the computational and memory roofline models for single-and double-precision computations for NVIDIA Geforce GTX 970. To study the effect of computational performance, we implemented a simple reduction kernel in CUDA that computes dot product of two big arrays (each has 67,108,864 elements). At first, we changed number of threads and blocks. In the following figures, they are represented as the set of t × b numbers, where t and b refer to the number of threads in a block and number of blocks, respectively. They are represented as a rational number w.r.t their peak value. Figure 1 demonstrates that increasing total number of threads results in a more energy-efficient kernel by losing a few Watts for single-precision computations. This statement is also correct when number of blocks is increased to an order of magnitude, otherwise, it does not help EE. The same trend is also noticeable for double-precision computations too.
As the next step, we studied the effect of enabling fused multiply-add (FMA) operations. The FMA ceiling represents this optimization while number of threads and blocks are set to their peak values (1x32). Figure 1 shows enabling FMA has no significant effect on power and EE.
The last step was to investigate the effect of instruction-level parallelism (ILP) through unrolling and maintaining partial sum of the main loop. The effect of such an optimization could easily be spotted for single-precision computations. In both cases of precisions, enabling ILP definitely enhances EE while only a few extra Watts is consumed. Above cases show the results for mutually independent studies of FMA and ILP techniques. However, we enabled them simultaneously and investigated their effect. Like our previous understandings, FMA operations did not affect performance and energy consumption significantly.
To study performance of memory subsystem, we used GPU-STREAM [9] as our benchmark application to measure the bandwidth of DRAM memory on the device. We modified the benchmark to support energy measurement by employing our Phoenix 1 library. Through imposing unoptimized modifications to our kernel we were able to investigate the effects of strided memory accesses. Furthermore, we also restricted memory accesses to a subset of threads to understand the effect of not exploiting parallelism in accessing arrays on the device in a uniform manner. Figure 1 depicts these effects on GTX 970. Strided memory accesses has significant effect on EE compared to limiting memory accesses to a small subset of total threads. If 50% of threads access memory in a strided fashion, EE of our kernel will drop dramatically. This designates that if a kernel falls behind this ceiling, developers need to look into unwanted strided accesses to memory to find out the sources of losing EE in our kernels. In addition, since double-precision computations requires almost double bandwidth than single-precision ones, the effect of thread abandonment can be as severe as strided accesses. Normal refers to uniform memory access among all threads in the system. Figure 2 shows our models for NVIDIA Tesla K80. It depicts that the only way to gain efficiency in energy consumption is through implementing ILP or enabling FMA operations. Increasing number of threads and blocks leads to consuming more power while gaining no efficiency in energy consumption. When numbers of blocks are increased by orders of magnitude, we observed some EE, otherwise, like in GTX 970 increasing number of blocks in small steps does not help EE. All levels of parallelism should be enabled to achieve energyefficiency in our designs. Memory ceilings of K80 follow a similar approach to GTX 970. Strided memory accesses on the device dramatically reduces our chances for EE. However, issuing memory accesses from a subset of threads (instead of uniform accesses to device memory) adversely affects EE when we are performing double-precision operations.
How does our model help developers? We represent a kernel in terms of energy per flop and energy per byte representing the computational and memory performance of a given kernel. Position of these points with regards to the ceilings in our models will help developers identify relevant optimization techniques to improve power and EE of the kernel. It can be visually identified how much energy-efficient our system becomes by wasting a few Watts of power for each technique that we enable. For instance, if data point of the kernel falls behind ceilings of 1x1 for double-precision computations for K80, it indicates that either we need to implement ILP optimization technique in our kernel or increase the level of parallelism to its peak achievable value. It should be noted that our models represent the relationship between power consumption and EE. Although our model does not identify optimization techniques for the developers (whereas roofline model does so), our model helps understand the influence of an optimization on power and energy efficiency. Nevertheless, one can confirm that the steepness of the line (slope) in our figures relate to the performance of computation (FLOP/s) and memory (BW); in both of our figures 1 and 2 the ceilings closer to the Y-axis represent better performance.
IV. DISCUSSIONS ON RELATED WORK
Needleess to say the Roofline model is a well established model by itself and there have been extensions proposed to the model in the past. Choi et. al. [6] studied energy by using operational intensity as an independent variable to discuss the effect of energy and power on performance. Our model is inherently different to theirs as we are incorporating power consumption and energy efficiency into a single roofline model given known performance and memory bandwidth. Choi's model depicts how the power level relates to operational intensity of the kernel. However, it does not consider power and energy-efficiency in one model.
In [10] , Hong and Kim present a prediction model for GPUs where the optimal number of active processors for any given program is predicted and that increasing number of cores for memory-bound applications does not improve computational performance. The GPUWattch model [11] presents a prediction model that accurately follows the power consumption footprint overtime. They also investigated the effect of DVFS using their model on GPUs. In our model, we do not predict a kernel's power consumption but we explore the relationship of power consumption w.r.t energy-efficiency of a kernel.
Caparrós and Püschel [12] proposed to evaluate performance by extracting the rooflines with the aid of cycle-bycycle analysis of the schedule that identifies the bottlenecks of underlying architecture. Their goal is to introduce additional rooflines to the model through a set of detailed architectural abstractions. They developed a mathematical model for performance based on a set of performance-relevant parameters from a modern processor by exploiting the extracted directed acyclic graph (DAG) of the computation. Ofenbeck et al. [13] produced a model through measuring a set of relative performance counters, like number of SSE and AVX instructions. We followed the same approach and calculate the number of floating-point operations as the amount of work to be done.
Illic et al. [14] extended the original roofline model by proposing to measure the bandwidth observed from different cache levels instead of the whole DRAM memory in a multilevel cache hierarchy system. Therefore, for each cache level, they proposed a ceiling based on the bandwidth of that cache level. This approach leads to a robust model independent of the size of an input data. However, the original roofline model is dependent on the size of the input data. In a subsequent study, they incorporated power and energy consumption in their model [15] by proposing a mathematical formula for consumed energy and power at various levels of cache hierarchy. As shown in our results, although we observe that the data size may not influence the energy efficiency, it seem to certainly affect the power consumption.
In [16] and [17] , authors modeled the system as an interactive queuing network and presented a visualized model similar to the roofline model. Through a small set of parameters from architecture and application one can build the model and study the effect of different levels of parallelism on their application. Nevertheless, their model did not include power consumption and energy-efficiency and they merely focused on performance.
There have been other efforts on roofline models for GPUs too. Nugteren et. al. [18] investigated the effects of enabling DVFS on performance and represent its effects in the roofline model. Jia et. al. [19] extracted roofline model for GPUs and demonstrated their generated model for NVIDIA C2050 and AMD HD5850. They introduced a set of common ceilings for both architectures.
V. CONCLUSION AND FUTURE WORK
In this paper, we introduced two roofline models, inspired from the original roofline model. We aim to provide insightful data to application developers and algorithm designers on energy-efficiency and power consumption of a kernel. We developed a mathematical model of energy consumption and extended the traditional roofline model with both energy consumption and power consumption. Our roofline models consists of a set of ceilings that represent optimization techniques. Through these ceilings one can visually realize the effect of applying the techniques on power and energy-efficiency and accordingly achieve a low-power design. Currently our work in progress includes applying our model on real world-scientific molecular dynamics codes.
