ABSTRACT For the past 40 years, Moore's law has predicted the rapid growth of the computer industry. In the past few years, however, this growth has slowed for central processing units (CPUs). Instead, there has been a shift to multicore computing, specifically with the general purpose graphic processing units (GPUs). Conventional CPUs have between two and eight cores, but the GPUs can have hundreds, even thousands of cores. By parallelizing code, the computing power of these cores can be utilized to greatly accelerate the performance of certain algorithms. The GPUs, however, have been known to consume more power than the conventional processing units. While previous research has focused on the impact that the GPUs have on performance, there are much fewer studies on the impact of GPUs on energy consumption and efficiency. Some researchers have hypothesized that if the performance of an algorithm was sufficiently increased on a GPU, then the accelerated time would actually cause the GPU to consume less energy. For the first time to our knowledge, we study the energy efficiency of a GPU with an application to iris recognition. Using GPU-based code written in the C++ compatible compute unified device architecture language, energy consumption tests are performed on basic image processing techniques, including image inversion, thresholding, dilation, erosion, and memory/computationally intensive calculations, such as the template matching. We demonstrate that the portions of these algorithms implemented on the GPU reduce energy consumption by as much as 272 times.
I. INTRODUCTION
The usefulness of graphical processing units (GPUs) for scientific programming has been demonstrated for many years. Because of their high computing speed, GPUs are now capable of computing at lower energy than central processing units. Due to the parallelism, GPU processing speeds are growing at a greater rate than CPUs. Because of this inherent parallelism, additional computing cores can be easily added as fabrication size decreases. GPU processing power has quadrupled in the past few years. The theoretical computational performance of a single state-of-the-art GPU is in the range of 8 trillion floating point operations per second (8 Teraflops) for a single GPU. With current technology, up to four GPUs can be placed in a single PC. With a stateof-the-art GPU costing less than 1000 dollars, the scientific community now has access to a large amount of processing power for a relatively small cost. One of the most appealing facts is that GPU processing rates are growing at a rate of 3.0-3.7 every 18 months [1] . CPU speeds have only grown by a factor of 2.2 over the same period.
The scientific community has demonstrated the use of GPUs for computational speedup on tasks such as sparse matrix solutions [2] , linear algebra operations [3] , fast Fourier transforms [4] , discrete wavelet transforms [5] , [6] , imagebased relighting [7] , and iris recognition [8] - [17] . This research is based on some of these past demonstrated performance improvements [10] , [13] and expands on them in the energy efficient area. There has also been past research efforts that focused on improving the efficiency of iris recognition [18] - [20] . However, no research has focused on the energy efficient processing of such algorithms with GPUs.
Historically, GPUs were considered power hungry devices. While true, researchers have found that the energy consumption on devices with algorithms correctly and optimally ported to a GPU-based system can be reduced [21] , [22] . Energy consumption is important in both the desktop/server market as well as in the mobile market. With the rising difficulty in obtaining energy around the world, cost is a dominant factor in desktop computations. On the mobile side, battery life is of utmost consideration. This work concerns desktop computation of a biometric processing iris recognition algorithm.
Iris recognition [23] is one of the most accurate methods of human identification commonly used today [8] , [9] . Error rates of one in ten million have been predicted in production systems [9] . For the last few years, iris recognition has become an important tool in the Department of Defense and the Department of Homeland Security. An iris recognition system that is both accurate and fast is crucial to support operations on large databases. The popularity of iris recognition, coupled with the processing complexity make this application of interest for energy efficient processing. This research demonstrates computationally intensive portions of an iris identification algorithm ported to a GPU using the CUDA programming language can be more energy efficient when compared to a more traditional CPU approach. These computationally intensive routines include commonly used image processing operations such as image inversion, value squaring, thresholding, dilation, erosion and template matching. Section 2 will provide a brief description of a popular iris recognition algorithm that is used for this research. Section 3 describes the portions of the algorithm that are ported to a GPU and contrasts it with a traditional CPU approach. Section 4 presents and analyzes the results while Section 5 provides concluding remarks.
II. BACKGROUND
Iris recognition stands out as one of the most accurate biometric methods in use today. Dr. John Daugman pioneered one of the first iris recognition algorithms [9] , [23] . There have also been recent numerous advancements in iris recognition accuracy [24] - [27] . The objective of this research is not to analyze and compare the various algorithms, but instead to analyze the energy consumption for portions of one of the readily available algorithms. This algorithm, known as the Ridge Energy Direction (RED) algorithm [8] will be the basis for this work and is described next.
A. RIDGE ENERGY DIRECTION (RED) BACKGROUND
The main steps of the RED algorithm are as follows: segmentation, where the iris is extracted from other facial elements such as the eyelids, sclera (white part of the eye), pupil (dark circle in the center of the eye) and eyelashes; feature extraction, where the segmented iris is unwrapped and encoded into a binary template; and template matching, where the system compares the current template with other templates already in the database.
1) SEGMENTATION
Segmentation is the process of determining the location of the iris in the image, or specifically determining the inner and outer boundaries of the iris. The processing tasks involved in this step include a binary morphology operation to determine the pupil center and radius. With the center location and radius of the pupil known, local statistics are used to find the outer boundary [8] . This localization allows the encoding algorithm to extract only the meaningful portions of the iris. Portions of this segment of the algorithm include image inversion, thresholding, dilation and erosion. These portions are ported to the GPU as described in Section 3.
2) FEATURE EXTRACTION
Feature extraction is the conversion of the segmented iris image to a binary template or machine code that represents the distinctive information in the iris. In the RED algorithm, this step includes unwrapping the segmented iris image into a rectangular matrix which allows radial and annular ridges to be more easily detected. The RED algorithm uses directional filtering to generate the iris template, a set of bits that meaningfully represents a person's iris. The template is generated by comparing the results of the two different directional filters and writing a single bit that represents the filter with the highest output at the location. A '1' is assigned for strong vertical content and a '0' is assigned for strong horizontal content. The resulting bits for each pixel are concatenated to form a bit vector unique to the iris. A template mask is also created during this filtering process. The template mask is used to identify pixel locations where neither vertical nor horizontal directions are identified.
3) TEMPLATE MATCHING
Template matching is the process in which the iris recognition system takes the created binary template and compares it to each template stored in its database until a match is found. To execute this step, the RED algorithm uses the fractional Hamming Distance (HD) equation to measure how close the two templates are to each other, given by equation (1) .
The ⊕ operator is the binary exclusive-or operation. This is used to detect disagreement between the bits from the two templates which represent the ridge detection of the corresponding pixels. The template mask is used to assure that the information comparison is not influenced by artifacts such as eyelids/eyelashes and specularities. The symbol ∩ represents the binary AND function which is used in the numerator to discount noise pixels in the templates, and in the denominator to dispel of any non-iris areas that could affect the accuracy of the calculation. The || || operator is the summation operator. A lower HD score indicates a greater match between the two irises being compared. A predetermined threshold value is used to make a match or non-match declaration. This template matching portion of the recognition algorithm is ported to a GPU using the CUDA programming language.
B. GRAPHICS PROCESSING UNITS AND COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA) BACKGROUND
GPU processing can be implemented in many configurations. A single graphics card can contain multiple GPUs and a single computer can contain multiple video graphics cards. Each individual GPU is separately addressable, and can execute a program that is identical to or distinct from the programs executed on the other cards. Video memory on one card cannot be accessed by another video card, thus no memory access conflict will exist, but data sharing is also limited. A database search on multiple GPUs could be considered a trivial parallel programming problem. As with most programs, energy efficiency will depend on the efficiency of the executed code. For a GPU, the faster the code can complete its function, the less energy it will use. Properly utilizing the parallel and hardware features of the GPU will dramatically speed up the execution of the GPU code.
The CUDA programming language offers functions and constructs that give full access to the GPU architecture and layered memory structure. The CUDA programming language can be used as an extension to the C/C++ (and other) programming language. The CUDA language contains instructions and functions that make the GPU suitable for general purpose computing. Through its functions, GPU memory can be accessed and managed by a user application. Executable code in the application can be designated for CPU or GPU execution.
To allow controllable access to local high speed memory, processors are grouped together to form a multiprocessor. There are 32 scalar processors per multiprocessor on the GTX 580 GPU [28] , with each multiprocessor each having access to 64KB of shared memory. The multiprocessor is an important concept when programming the GPU because it contains shared memory and all memory optimizations are performed at the multiprocessor level. The NVIDIA GPU has a single instruction multiple data (SIMD) architecture. Small processor groups allow fast parallel and comprehendible serial access to shared memory. Each multiprocessor has hardware that performs zero overhead based thread scheduling. This replaces iterations of programming loops with independent threads which execute with zero overhead. The individual scalar processors have 32 bit floating point and integer arithmetic logic units, each having a two-instruction pipeline. Integer and floating point instructions can be executed concurrently. The floating point math is slightly faster than integer math. Since each scalar processor can execute two instructions simultaneously, a GTX 580 GPU can execute up to 1024 math instructions in a single clock cycle [29] .
To achieve higher performance, optimizing memory access is crucial. Each multiprocessor allows concurrent memory access and instruction execution. This allows a pipeline of processing to be performed as data arrives. Some threads can be executed while other threads wait for their memory request to be filled. There are five main types of available memory: global, shared, register, constant and texture. Each is suitable for a different type of memory access. Off chip memory (global memory) is accessible by the CPU and GPU. With a latency of 400-800 GPU clock cycles, it is the slowest memory type. Coalescing is the term used to describe the merging of individual global memory request. During coalescing, addresses within a 128 byte segment that are simultaneously requested are merged. Each multiprocessor contains 16 kilobytes of on-chip shared memory [28] . A shared memory access takes 2 clock cycles making it the fastest memory available. Register memory is 16 kilobytes of 2 cycle access on-chip memory. It has the same speed advantage as shared memory, but is private. Constant memory is a 64 kilobyte cached region of global memory. It is accessible to all threads, is read-only and provides fast memory access when multiple threads read the same address.
Since the processors operate in SIMD fashion, repeated instructions on multiple data are pipelined thereby increasing speed. Divergent paths result in idle processors reducing speed. Each path of a conditional statement is serialized decreasing speed and disrupting previously formed processing pipelines. Performing unneeded processing is sometimes faster than executing divergent branches (conditionals).
III. GPU IMPLEMENTATIONS OF RED FUNCTIONS
We would like to emphasize that although the GPU porting required some change to the actual code, the functionality of our iris recognition algorithm is not affected, and hence the accuracy of the algorithm is unchanged. We now describe examples of our code that were executed on both a multicore CPU as well as a GPU. In Section III.A, we describe the thresholding, dilation, inversion, and erosion functions and Section III.B describes the more complex template matching function. In cection III.C, we will discuss the optimization decisions made for template matching function.
A. THRESHOLDING, DILATION, INVERSION AND EROSION MULTI-CORE CPU AND GPU VERSIONS
To effectively run on a multi-core CPU, the functions must be multi-threaded. Illustrated in Figure 1a is sample CPU code for spawning multiple threads that applies to all functions that we tested on the CPU. Since these functions are highly parallel, porting the code was as simple as spawning the CPU functions into even amounts of work. The spawning of threads code is the same for all of our tested code, and hence is only shown once (see Figure 1a) . The CPU code that performs dilation is shown in Figure 1b (thresholding, inversion and erosion have a similar parallel nature, and hence are not shown). The iterative process on rows and columns represents the entire input image. The if statement represents the checking of surrounding pixels to perform the dilation.
Illustrated in Figure 2a is sample code that allocates GPU memory and calls a GPU based dilate_kernel function. This code shows the execution of 480 blocks, each containing 320 threads. The CPU will continue execution after the kernel calls are made. Simultaneous computations can be performed by the CPU and GPU. When data is requested from the GPU, CPU execution will block until the data is available. Figure 2b is an example of the GPU code for the dilate_kernel function. As previously stated, the functional behavior of the GPU is equivalent to the CPU version shown in Figure 1 , and hence the iris recognition accuracy will be the same (i.e. same ROC curve). Similar to the CPU version, the optimal version for the GPU is also straightforward. This function would operate on each single pixel of an image and be executed by each thread running on the GPU. Each thread uses the system variables blockIdx and threadIdx to identify which piece of data it should process. ImageIn and ImageOut are pointers allocated and transferred by the CPU code. This code resulted in full GPU utilization. Thresholding, erosion, and inversion were similarly easy to optimize and the code is similar to dilation. Figure 3 is a portion of the CPU code for template matching. The line of code that performs the computation for template matching is the Result = . . . line which correlates with Equation 1. Note that the inner loop computes a match check for one template (64 * 32 = 2048) and the outer loop checks all entries in the database. Figure 4 displays the GPU code for the template matching algorithm. The lines of GPU code that is required to perform the mask, including the XOR and AND are similar to the CPU code. However, this code is executed on the GPU, but has the same functionality as the CPU version.
Illustrated in

B. TEMPLATE MATCHING MULTI-CORE CPU AND GPU VERSIONS
Shown in
C. TEMPLATE MATCHING GPU OPTIMIZATIONS
Although not much programming effort is required to keep the GPU busy for the thresholding, inversion, erosion and dilation, much time was spent optimizing the GPU template matching algorithm. The algorithm is used widely in the biometrics community due to its computational simplicity and speed of execution. However, this computational simplicity causes the template matching algorithm to be memory intensive. To optimize memory access, various hardware features of the GPU architecture were investigated. Parallelization and memory access strategies were tested in an attempt to supply a constant stream of data to the multiprocessors. Table 1 shows various strategies attempted and the resulting performance gains.
At the time of this research, a single serial CPU core can perform approximately 3.2 million iris templates matches per second. One trivial method of algorithm parallelization is to execute the serial code on each GPU processor, similar to the inversion, dilation, erosion, and thresholding approach. The first entry in Table 1 shows this method. This trivial method performed poorly because it did not incorporate many of the memory access and thread scheduling optimizations available in the GPU hardware. Since data was not being read sequentially from global memory, data words were being fetched from global memory 4 bytes at a time. Had memory fetch coalescing occurred, 128 bytes of memory would have been fetched in the same amount of time. The first optimization was to spread the processing of each template across the scalar processors within a multiprocessor. This was designed to cause concurrently executing threads to read sequential addresses from global memory. The result was a 34 percent increase in performance. Next, to keep each processor filled with executing threads, the number of threads per MP was increased from 1 to 64 for a 346 percent increase in performance. Now the hardware thread scheduler could switch across active threads to hide memory latency.
Many of the optimizations gave a performance gain, but some did not and are shown here for pedagogical purposes. When computing the number of bits that match in two templates, the standard method of counting bits using a lookup table did not work well. Any reasonably sized lookup table required that 16 bit computations to be used. Shared memory was too small to hold a 32 bit table, and global memory was to slow to allow for a performance gain. Since this algorithm is memory intensive, but not compute intensive, we make the assumption that memory is the processing bottleneck and instructions are blocked waiting for data. Computing the bit count using computational methods effectively decreased the memory access to computation ratio, giving a 212 percent increase in performance. Increasing the number of threads to 128 per multiprocessor gave another performance increase of 23 percent. Fetching and processing two template words per thread gave a 25 percent performance increase. This change allows each thread to fill the two stage processor pipeline with one memory fetch. Increasing the number of threads to 512 per multiprocessor gave a performance increase of 13 percent. Lastly, several divergent paths were removed from the kernel code giving a 14 percent performance gain. The incremental optimizations applied to the algorithm gave a total performance increase of 3600 percent over applying one thread per scalar processor.
IV. RESULTS
Several categories of common image processing operations were performed to test the capability of the GPU on scientific image applications [10] , [13] . One category included operations that contain a very small number of computations (threshold and value inversion). A second category (erosion and dilation) were operations where concurrent writes between GPU processors was a concern. The final category was an operation that was memory intensive (template matching). This algorithm demonstrates the benefits of the GPU memory architecture and some methods of optimization for that architecture. The small templates can fit within shared memory. In [10] and [13] the template matching speed for the CPU and GPU were compared. Here we compare the energy consumption.
A. EXECUTION PLATFORM
All experiments are conducted with the program executions running on a desktop system with and without a GPU. The CPU is an i5 quad-core with no hyper-threading. It has a 6MB SmartCache, 3.1 GHz clock and 4GB of main memory. The GPU utilized is the NVidia G580 that has 512 cores, 772 MHz clock, and 192.4 GB/s memory bandwidth.
To collect the power and energy data, we utilized a CWT 1b Rogowski [30] transducer that loops around the electrical outlet of our system (see Figure 5 ). The plug from the connector is shown in the bottom right in Figure 5 . The two lines of the plug needed to be separated for the coil to properly separate the fields. The transducer indirectly reports the current through the line. The actual unit reports volts, and a conversion at 20.00 mv/A is required to get the amperage. This transducer is connected to our Tektronix TDS5104B Digital Phosphor oscilloscope [31] . This oscilloscope has the best sampling rate and least amount of noise that we could obtain.
B. ENERGY AND POWER CONSUMPTION
To illustrate our calculation, Figure 6 displays the power (watts) consumed against time (seconds) for the erosion computation for three different hardware configurations. The first hardware configuration is a single threaded single core execution. The second configuration uses the same computer, but utilizes all four cores. The third configuration is the same system, but has a GPU utilized. Wattage was calculated by taking the current supplied by our electrical source and multiplying it by the voltage supplied. As can be seen, the power consumed by the GPU-based system is by far the highest. GPUs, when fully utilized typically consume more power than their CPU counterparts. However, as we will see, the energy consumed by the GPU-based system is less. Also, note that the power consumed for a 4-core processor is greater than the single core. However, it is not a linear relationship. The 4-core system does not use 4 times the power of the single core system. Table 2 presents the overall energy consumption for the same three systems. Also shown is the improvement of the GPU over the Four-core configuration. Energy is calculated by integrating the power against time (area under the curve in Figure 6 ). As can be seen in Table 2 , the GPU can reduce the energy consumption by up to 272 times. This reduction in energy consumption is attributed to the reduction of execution time. We find that the GPU is very close to fully utilized for the inversion, dilation, erosion and thresholding operations. Hence the speedup, in time, is also leading to a reduction in overall energy consumption. For example, the areas under the curve shown in Figure 6 leads to a factor of 197 times reduction in energy consumption for erosion.
We now turn our attention to the scaling of threads for the CPU only configurations. Figures 7 and 8 show the time consumed and energy consumed for the various functions (not template matching) while varying the number of cores. As can be seen, as the number of cores doubles, the amount of time to process the request is cut in half from 1 to 2 and from 2 to 4 cores respectively. These results clearly demonstrate the ability of the iris recognition software components to execute in parallel.
The energy consumption is also greatly reduced as the number of cores is increased. However, there is not a halving of the energy from 1 to 2 and 2 to 4 cores respectively. Single core execution does not consume an equivalent amount of power as that of a two-core execution since three of the cores are idle instead of two of the cores; hence the energy cannot be completely cut in half because the time is cut in half and the powers are not the same. However, the energy is still significantly reduced. As a point, going from 1 to 2 cores reduces the energy consumption by approximately 42%.
Obviously, the number of CPU cores is in the single digits, while the number of GPU cores is approaching thousands. The GPU based device is utilized efficiently, and this relates to the energy savings. We conclude that the trend for both configurations is promising in terms of energy efficiency for iris recognition processing, with more cores leading to more energy efficiency.
V. CONCLUSION
As CPUs no longer keep pace with Moore's law, multicore devices like GPUs will become increasingly common. Energy, however, is no longer a trivial cost, and GPUs are known to consume more power than CPUs. As the GPU technology continues to rapidly develop, the performance increases have been noted. There is little research in the area of energy efficient biometric processing however, and this experiment presents further evidence for energy efficient computing with GPUs.
In this work, we demonstrate that portions of an iris recognition algorithm can have their energy consumption reduced when utilizing a GPU. Template matching, image inversion, value squaring, thresholding, dilation, and erosion all had energy consumption reduced by as much as 12 -272 times.
The trends for both CPU and GPU based systems are also promising since more cores leads to more energy efficiency for these parallel biometric applications. As GPU based systems continue to scale, it will be interesting to see the energy trends as we move into the multiple of thousands of cores per processor. Additionally, the application space for biometrics includes the mobile market, where some smartphones are now built with GPUs inside.
