Introduction
Modern computing systems have evolved toward multicore/manycore processing systems very rapidly. In the current market of computing systems, manycore processors become major products of many chip manufacturing vendors (e.g., Intel, AMD, nVidia) [1, 2, 3, 4] . They have been used widely in our life. Quad or hexa-core processors are not only available for high-performance computing but also available for general computing in desktop personal computers. Moreover, those manycore systems are also used in mobile phones together with manycore GPU. The trend would be accelerated further in the future with convenient parallel programming languages: CUDA [5] and OpenCL [6] . The software platforms including well defined APIs and language structures have provided an easy developing environment for GPU based parallel software design.
In addition to the performance issues in GPU computing, power and energy consumption issues have been investigated widely. For the performance optimization, the most important parameter selection in GPU code optimization is determining the number of threads / blocks in a grid which are defining how the concurrent threads are mapped into internal streaming multiprocessors and streaming cores inside a GPU. There have been trials to deriving optimal performance of GPU applications by exploring the design space that is spanned by the numbers of threads and blocks in the grid [7] .
In this paper, we have investigated the simple real-measurement based power and energy consumptions of commercial GPUs with simple parallel application, "dot-product" that is well known and widely used in many multimedia or scientific applications. By varying the numbers of threads and blocks in the grid of a GPU, we have observed power and energy consumptions of our system including GPU. Then, approximately, we estimate and analyze the power and energy consumption of a GPU device running the dot-product application. For this analysis, we have used an nVidia GTX 660 GPU and CUDA for our experiments.
In Section 2, we briefly address preliminary on a GPU, its architecture and CUDA. Detailed power and energy-aware design space exploration on a GPU with dot-product application is explained in Section 3. The experiments and analysis we achieved from the design space exploration are addressed in Section 4. Finally, Section 5 summarizes this work with future work.
Preliminary Work

CPU vs GPU
Figure 1. CPU, DSP, FPGA, ASIC and GPU comparison
Originally, a GPU had been proposed for accelerating graphics computations but it now has been evolved toward accelerator for general computation. The more interesting point is that the GPU architecture uses manycore architecture for high speed computations. However, since it is purely operating with software code, the GPU provides higher flexibility than ASIC or FPGA that are traditional high performance computing devices. As shown in Figure 1 , a GPU provide high flexibility comparable to CPU but its computation power is approaching toward ASIC. Figure 2 shows the internal structures of CPU and GPU devices. As shown in Figure 2 , a GPU includes higher number of smaller cores compared with a CPU which includes few number of high performance cores with multi-level larger caches. The operating principles behind both CPU and GPU architecture are different with each other: a CPU architecture is devised to accelerate the latency performance of few threads with few high-performance cores while a GPU architecture is devised to increase the throughput performance of many threads with hundreds or thousands of small cores.
GPU Architecture and CUDA
As shown in Figure 2 , a GPU architecture is mainly made of a scalable number of streaming multiprocessors (SMs), each contains a number of parallel processing cores (SPs). Furthermore, it includes warp schedulers, dispatch units, special function units (SFUs), local memory, shared memory, texture memory, L1 cache and constant cache. GPU has its own global device memory up to several gigabytes; it supports high memory bandwidth. Shared memory which is accessed as fast as register can be configured manually together with L1 cache. Both constant memory and texture memory are cached.
Reading data from constant cached is as fast as a register if all threads read the same address. A group of 32 threads, so-called a warp is executed together, meaning that the same instruction is applied to all 32 threads (Single instruction multiple threads, SIMT). Even though accessing global device memory takes hundreds of clock cycles, GPU is designed to support zero-latency warp switching as long as there are enough active warps to make cores busy [5] .
Supported on NVIDIA GPUs, CUDA (Computed Unified Device Architecture) is a data parallel programming model. Host program launches a sequence of kernels with implicit barrier synchronization between kernels. Each kernel is organized as a hierarchy of light weight parallel threads which are grouped into blocks. A set of blocks are grouped into a grid that executes a kernel function. Each thread or block has a unique index in the grid. The number of threads in a block is specified by programmers.
However the numbers of blocks that are running on one SM are determined by the available resource of the SM and resource requirement of blocks. The challenge for programmers is to optimize GPU architecture features by writing efficient CUDA code so that it exposes enough parallelism to make full use of SIMD architecture, exploit full device memory bandwidth and maximize the utilization of shared memory and data cache efficiently.
GPU Design Space Exploration
Basics of Design Space Exploration
To optimize the performance of GPU application software, the software designers have to understand the internal architecture of GPU. Although CUDA or OpenCL provide high-level language structure for the software developers, precisely speaking, the CUDA or OpenCL should be considered as a low-level programming language because the developers have to know hardware details to a certain degree. For the easy development of GPU based application, we need automated tools for configuring design parameters optimally for a given optimization target (performance and power) in the near future.
There have been many optimization researches for GPU with CUDA or OpenCL. The followings are generally performed optimization strategies for modern GPU architectures [8] .
(1) Mapping threads to SMs and SPs of GPU Among the optimization strategies, the most fundamental thing is to find the optimal mappings of application threads to the SMs and SPs of GPU. "Design space exploration (DSE)" is normally performed for finding optimal design configuration of a system by varying the possible design parameters that have impacts on performance, power/energy or design cost.
Full design space exploration would be mostly demanding for deriving highly optimized code, but it would take many hours of time for the design space exploration. We will focus only on the first strategies for seeking optimal code with respect to power and energy consumption.
Dot Product Application
We choose a dot-product application for our design space exploration for power and energy optimization. Actually, the application itself is not critically important in our work and the dot-product application is just one possible case. The dot-product application has high degree of parallelism in the pair-wise production phase and the parallelism reduces step by step as reduction phase progresses.
Power/Energy Consumption Measurement in GPU
Recently, there have been high interests in the power and energy consumption of a general purpose GPU computing. Many of work have been progressed with various model and experiments [9, 10, 11, 12] . Our work is focusing on the derivation of power and energy consumption with simple measuring facilities. For precise power measurement, separate power sources: one for GPU and the other for main computer system excluding the GPU. However, in this study, we measured power consumption of a whole system and derived the power consumption of the GPU. Figure 3 shows our experimental setup for measuring power consumptions of our system including a GPU. As a power monitor, Inspector II power measuring device has been used. Due to the lack of real-time measuring capability, we only capture peak power consumption during the execution period of the GPU application. To remove noises, we performed 10 measurements for each thread/block grid configuration and then we calculate the average values of the measurements for the results. Table 1 and Table 2 show the system specification and GPU specification, respectively. As a host system, IntelⓇ Core™ i7 processor (i7-3770 @ 3.4GHz) has been used and nVidia GTX 660 GPU board has been used as a target GPU device. PCI-Express version 3.0 is used for transferring data between the host and target machines. 
Experiments for GPU-DSE
Experimental Setup
Experimental Results
When a whole system is in idling state, 46 W is consumed for the case (P idleGPU ). When GPU runs with dot-product application, approximately 71 W is consumed (P workGPU ). We measured a system power without a GPU then 17 W is observed (P noGPU ). We estimate the static power of GPU from P idleGPU and P noGPU . So, P staticGPU is P idleGPU -P noGPU . The static power of our GPU is estimated to 29 W approximately. The dynamic power of GPU (P dynGPU ) also can be estimated from P workGPU and P idleGPU . Then, it is around 25 W. The static power seems quite a lot so we need a special static power eliminating circuitry for modern high-performance GPU.
 CPU with idling GPU (P idleGPU ) = 46 W  CPU with working GPU (P workGPU ) = 71 W  CPU without GPU (P noGPU ) = 17 W  P staticGPU = 46W-17W=29W  P dynGPU = 71W-46W=25W Table 3 and 4 shows the performance and power measurement from the experiments, respectively. When we varied the number of threads and blocks, significant performance variations have been observed.
The optimal configuration for performance is "128 for thread size and 64 for block size". At the configuration, 72.968 W has been observed. Finally, optimal energy consumption can be calculated as "16.760 mJ = 72.968 W229.698μs" since the energy is the product of power and delay.
Interesting observation I have from experiments is that power consumption becomes more than 100W when block and thread size are set to '1'. It is very unreasonable since only one SP in one SM is working for the computation. We expect the smallest power consumption will be observed from the configuration but the result is the opposite of our expectation. Currently we are investigating the reason behind the unexpected results by investigating the GPU internal operation in detail.
Conclusion
Recently, there have been tremendous interests in the acceleration of general computing applications using a Graphics Processing Unit (GPU). Now the GPU provides the computing powers not only for fast processing of graphics applications, but also for general computationally complex data intensive applications. On the other hand, power and energy consumptions are also becoming important design criteria. Consequently, software designs have to consider the power/energy consumptions together with performance when they are developing software.
In this paper, we explore a design space exploration with a commercial GPU: nVidia GTX 660 for investigating the best configuration of a kernel grid structure in a GPU for optimal power or energy consumption. Our work is focusing on the derivation of power and energy consumption with simple measuring facilities.
