With the development of hardware and software, GPU has been used in Ggeneral-Purpose computation 
Introduction
With the development of applications of GPUs ( Graphics Processing Units) in the high performance computing field the computing power of GPUs gives considerably higher peak computing power than CPUs. However, the power optimization for GPUs is still a big challenge. To optimize the power of GPUs measuring and investigating the power consumption is the key issue. Sylvain et. al. firstly consider the power consumption of GPUs from a software perspective. Their research investigates how and where the power consumption located in a GPU by analyzing the relationship between the measured power consumption and the specific unit within the GPUs [1] . Y.Jiao et.al. systematically characterize the power and energy efficiency of GPU computing and explore the correlation between power consumption and different computational pattern under various voltage and frequency levels [2] . The power consumption of different operations like data transfering operations, arithmetic operations and memory access operations were measured in literature [3] . These measurements provide valuable data for power optimization at the instruction level.
To provide insight into energy consumption in computing phase of GPUs, we propose a simple power consumption model from the software perspective to predict the power consumption. Hyesoon Kim et.al. proposed a prediction model integrated power and performance for GPUs. This model can provide the optimal number of active processors for given applications [4] . Our approach is to scan and analyze the application' source code at the instruction level. The similar method was used to analyze the power and energy consumption for DSP applications. SoftExplorer is a tools based on this method that can estimate the power consumption from the C program or the assembly code [5] . While our work overlaps with some of the prior methods, there are distinct differences. First, our power consumption model is built for the GPUs from the instruction level. Second, the power model predicts the energy consumption for GPU application at the complie phase. Since the PTX assembly language code is generated by the complier, this model belongs to the static estimation method and is more accuracy than the power consumption model at the C language or function levels. To build the power consumption model, we parsed the PTX assembly file of the applications and built the control flow graph to count the dynamic instruction number. Identifying the loops in the application is the key problem for counting the instruction number. When unrooling the loops, the average power can be obtained through the dynamic instruction number. The remainder of this paper is organized as follows: section 2 introduce the CUDA and PTX program model of GPUs. And the power characterize of the different instructions. The detailed power consumption model and algorithms for the GPU application are provided in section 3. Next, we verify the power consumption model through the expeirments with various benchmarks in section 4. Finally, Summarizes our findings and future work.
Background and Motivation

CUDA and PTX code
Compute Unified Device Architecture(CUDA) is a programming model and hardware/software environment for GPGPU. It includes an extension to the C programming language that allows programmers to design GPGPU functions called kernels running on the GPU with multiple threads [6] . A CUDA program is comprised of many CUDA kernels. As shown in Fig. 1 (a) , we provide a concreted example of CUDA kernel that is matrix multiplication computing application. From the software perspective, we may consider the energy comsumption at the CUDA C level. However, it is not a fine-grained to analyze the power dissipation because the CUDA C language is far from the binary language. The CUDA code can be complied into an intermediate representation PTX generated by NVCC under the -ptx flag. The transformation from CUDA code to PTX code may be performed important transformations and optimizations, such as register allocation and instruction scheduling. Since these optimizations have a very significant impact on the program performance, analyzing the PTX code is prior to the CUDA C code [7] .
PTX Instruction Categories
Since PTX instructions translate nearly one to one with native binary microinstructions [8] , we consider the power consumption at the PTX language level. The advantage of using the PTX language to analyze the energy consumption is that has more accuracy than the CUDA C language level. The key issue to analyze the PTX code is to distinguish the instruction type. We use the instruction identifiers in literature [8] to divide the instruction set into two categories. One is computing instruction set denoted as A. The other is accessing memory instruction set denoted as M that can further be grouped into four subsets, global,shared,constant and texture memory [9] . 
PTX Instruction Power Consumption
The energy consumption of different instructions are different. We measured the power consumption of PTX instructions by experiments. For the sake of simplicity, we design specific kernels that contain the same instruction sequence. The executing time of each kernel function should be greater than 50ms in order to record the experimental results accurately. The NVIDIA Geforce GTX 280 was selected in the experiments. During the kernel instruction execution, the power dissipation of the computing instruction is around 90W. The global memory and texture memory accessing instruction is about 71W and 75W respectively. The energy consumption of texture memory accessing instruction is lower 4W than global memory due to the cache. In the most case the energy consumption of constant memory accessing instruction fluctuates between 65W and 68W.
Energy Consumption Model
In this section, we build the power consumption model to estimate the energy efficiency of GPU computing applications from the software perspective. While power (P) describes the rate of energy consumption at a discrete point in time and energy (E) represents the total energy spent in time interval(t 1 ,t 2 ).
Equation (1) specifies the relation between power, the executing time and energy. However P(t) is difficult to obtain because this function vary according to various applications. To address this issue, we propose average power P to replace P(t) and estimate the energy consumption of applications. So the energy E is a sum of products of average power and the executing time T (T = t 2 -t 1 ), namly E P T   . Then the key issue is how to identify the average power P of the applications. The main idea is to count the instruction number in the programs and calculate the mean value of the various energy consumption of PTX instructions.
Basic Definitions
Definition 1. Basic block. Assume that a program P is a sequence of instructions, P = {I 1 ,I 2 ,...,I n }. A basic block b is a maximal sequence of instructions {I 1 ,I 2 ,...,I m } that can be entered only at the beginning and exited only at the end [10] . Building basic blocks must rely on the rules. The first rule is to identify leaders that is the first instruction in a procdure or the target of any branch, or an instruction immediately following a branch. The second rule is gobble all subsequent instructions until the next leader.
Fig.2 Building Basic Blocks Example
As shown in Fig.2 , the leader set is {1,3,5,7,10,12} and the basic block sets is {{1,2},{3,4},{5,6},{7,8,9},{10,11},{12,13,14}}. Definition 4. Dominator. There is relationship in the CFG. If node d dominate node i, all the paths from entry to node i must include node d. The node d is a dominators of node i in this case [11, 12] .
Definition 5. Natural Loop. Natural loop have a single entry node that dominates all the nodes in the loop and a back edge that is an arc whose head dominates its tail in the CFG.
Power Consumption Model
Building the power consumption model is mainly to obtain average power P for the applications. Assume that a program P contians n 1 computing instructions, n 2 GM accessing instructions, n 3 SHM accessing instructions, n 4 CM accessing instructions, and n 5 TM accessing instructions. N is the number of the instructions in the program. The power dissipation of computing instruction and four different accessing instruction denote as P 1 ,P 2 ,P 3 ,P 4 ,P 5 , respectively. The average power P can be defined as equation (2).
Where P i ( i = 1..5) can be measured through the experiments discussed in section 2.3. Assume a program contains m Kernels. The power consumption model of application is as follows.
The execution time of kernels can be easily obtained in this model. However counting the number of the instruction is key issue and there is a challenge in analyzing the number of the instructions exactly in the programs. Then we provide the detailed approach to obtain the instruction number. The CUDA programs are complied and the compiler generates intermediate assembler level instruction, the NVIDIA PTX instruction set [6] . We use the number of PTX instructions for the dynamic number of instructions. It is noted that the dynamic number of PTX instructions is proportional to the number of data elements. So there are loops in the kernel functions and we can not simply count the number of instructions in the kernel functions. In order to obtain the exactly dynamic number of PTX instuctions [8, 13] , the loops in the kernels should be identified and unrolled. We firstly parse the PTX file output by the compiler and using the PTX knowledge to identify and categorize the instructions as computing operations, branch operations, memory loads/stores operations, synchronization operations and other miscellaneous operations. Further, it also distinguishes the different kinds of memory accessing (global, shared, constant, texture). The second step is to decompose the program by identifying basic blocks that is a instruction sequence. Then the control flow graph of kernels should be constructed and identified the loops in the kernels. When the loops has been identified, the loops will be unrolled and estimated the dynamic instructions in the loop body. The detailed algorithms to count the number of instructions from PTX code is provided in next section.
Algorithms
The goal of the algorithms is to calculate average power for the GPUs applications. The algorithms consist of two phases: building the control flow graph for the PTX code and identifying the loops in the kernels phase. The first phase is to scan and analyze the PTX code. This phase generates the control flow graph of applications to aid further analysis. The second phase incudes identifying the loop and calculating the average power. Identifying the loops in the kernels is the key step to count the dynamic instruction number. After Unrolling the loops the average power can be calculated by the equation (2).
An Energy Consumption Model for GPU Computing at Instruction Level Haifeng Wang, Qingkui Chen
Building Basic Blocks Algorithm
Input: A sequence of instructions {I 1 ,I 2 ,...,I n } Output: Basic blocks set {b 1 1-3 ). Then the instruction sequence is grouped into basic blocks set ( line 4) and the control flow graph of kernels are built in order to identify the loops in the instruction sequence ( line 5). After identifying the loops, each basic block in the loop will be added two fields, type and the number of iteration ( line 9 -13). The final loop iteratively traverses the basic block set and calculates the instruction count ( line [14] [15] [16] [17] [18] [19] . When the basic block isn't a loop block, the algorithm directly counts the number of instructions ( line [15] [16] . If the basic block belongs to a loop, the number of instruction is the product of the iteration number and the instruction number in the basic block ( line 17-18). Finally, the result is the average power for the applications ( line 20).
Implementation
To evaluate the power consumption model, we have developed a prototype of automatic analyzing energy consumption. Our prototype system aims to analyze the energy consumption profile of programs based on the PTX code. The input data of the prototype is the PTX assembly code generated by the NVCC complier. And the outcome is the average power for the applications. This section explains the implementation of the prototype system. As shown in Fig. 3 , it is comprised of five major componets: a database to store the instruction identifier, a modeling rules set to adjust the power consumption modeling approach, a set of preprocessors to handle the PTX code of applications, the automatic analyzing engine implementing automatic analyzing algorithms and configuration manager scheduling the others modules to work effectively.
Figure 3 Structure of Prototype System
Configuration manager is the main procedure in the prototype system that is responsible for scheduling all other modules and initializing the instruction identifier database and modeling rules set. The preprocessors are used to identify the kernel function in the PTX codes, extract the PTX instruction sequence and build the control flow graph for the applications. Automatic analyzing engine analyzes the control flow graph of applications, counts the number of dynamic instructions and calculates the energy consumption based on the modeling rules. The modeling rules set maintains a modeling parameter set for different GPUs architectures. The instruction identifier base is a knowledge database storing the instruction identifers listed in table 1. This prototype system was implemented in PERL language because the PERL language in the textual analysis is more efficient than others. In addition, the instruction identifier database and the model rule set were built in XML language.
Experiments and Results
The aim of the experiments is to prove the validity and accuracy of the power consumption model for the GPUs applications. We compare the average power value calculated by the prototype with the power measured in the experiments to verfiy the power consumption model. Our testbed is an Intel Core 2 processors, 2G DDR RAM, 320G SeaGate HardDisk and NIVIDA GeForce GTX280 card with 602 MHZ core frequency and 1107 MHZ memory frequncy. The operation system is Windows XP Profession with CUDA toolkit2.3 and the driver version is CUDA 190.29. All the benchmarks were selected from the CUDA SDK and listed in Table 2 [15] . To verify the power consumption prediction model and also to record the variation of the power consumption, we design a set of revisedbenchmarks that simply repeat a loop for 1000 times. From the table 1, the dyanmic instruction number are calculated by the prototype system and average power are also obtained through the power consumption model. Fig.4(a) shows the power profiles of the Kernels when the applications are executing. In this experiments, we mainly considered the external power supply for the GPUs and ignored the PCI-E power supply due to the inaccessibility [3, 16] . We use a current probe sensor to convert the measure current to a corresponding voltage signal and convery the DC signal to the computer to log the power profiles. For the simplicity, we just plot three applications power profiles in the figure 4(a) and the sampling interval is 50ms. We can see that when the kernels are running the power fluctuate in specific value. Such as the application Bitonic sort denoted as #1 in the figure 4(a), the GPU power is around 90W in the computation phase. Since the measured power is a variable, we use the mean value of GPU power during the kernel executing as the measured value denoted P . shows comparison between the average power P predicted by the power consumption model and the measured power P . The average power is less than the measured power since the revised benchmarks need execute looply that lead to incurr the extra power for the GPU initializtion. The experimental results indicate that the relative error between P and P is not more than 5% and this prove that this modeling method is as good as we can expect.
Conclusion
Prior work has used the similar method to build the power consumption model for DSP applications. However the GPUs computing applications have different structure features compared to the DSP applications that are only nature loops and simple branches in the kernels. In this study we parsed the PTX assembly file and calculated the dyanamic instruction number by unrolling the loops in the kernels. We found that the average power estimated by the power consumption model and the measured power are very close. This power consumption model allows the programers to estimate their programs' energy consumption and help them to optimize the design scheme from the energy consumption perspective. In our future work, we will consider the branch instructions and improve the accuracy of this power consumption model. In addtion, we will solve the power consumption model to adapt to different GPUs architectures and further enhance the prototype system.
