Many applications, such as machine learning and sensor data analysis, are statistical in nature and can tolerate some level of inaccuracy in their computation. Approximate computing is a viable method to save energy and increase performance by controllably trading off energy for accuracy. In this paper, we propose a tiered approximate floating point multiplier, called CFPU, which significantly reduces energy consumption and improves the performance of multiplication at a slight cost in accuracy. The floating point multiplication is approximated by replacing the costly mantissa multiplication step of the operation with lower energy alternatives. We process the data by using one of the three modes: a basic approximate mode, an intermediate approximate mode, or on the exact hardware, depending on the accuracy requirements. We evaluate the efficiency of the proposed CFPU on a wide range of applications including twelve general OpenCL ones and three machine learning applications. Our results show that using the first CFPU approximation mode results in 3.5× energy-delay product (EDP) improvement, compared to a GPU using traditional floating point units (FPUs), while ensuring less than 10% average relative error. Adding the second mode further increases the EDP improvement to 4.1×, compared to an unmodified FPU, for less than 10% error. In addition, our results show that the proposed CFPU can achieve 2.8× EDP improvement for multiply operations as compared to state-of-the-art approximate multipliers.
I. INTRODUCTION
I N 2017, the number of smart devices around the world exceeded 20 billion. This number is expected to exceed 40 billions by 2022 [1] , [2] . Many of these devices have batteries with strict energy constraints, so the need for systems that can efficiently handle the computing requirements of dataintensive workloads is undeniable [3] - [7] . Running machine learning algorithms or multimedia applications on general purpose processors, e.g., GPUs and CPUs, results in large energy consumption and performance inefficiency. Many applications do not need highly accurate computation, so accepting slight The authors are with the Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093 USA (e-mail: dperoni@ucsd.edu; moimani@ucsd.edu; tajana@ucsd.edu).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCAD. 2018.2885317 inaccuracy, instead of doing all computation precisely, results in significant energy and performance improvements [8] - [15] . Approximate computing seeks to alleviate this problem. While there are a number of proposed approximate solutions, they are limited to a small range of applications because they cannot control the error rate of their output. Several data processing applications use a large range of values and require high precision. Therefore, computations in many traditional and state-of-the-art computing systems use floating point units (FPUs) [13] , [16] - [18] . For example, on GPUs, frame rendering or high-performance scientific computations require many FPU operations and use a large amount of power. We observed over 85% of floating point arithmetic involved multiplication in the general OpenCL applications we tested. To cover the same dynamic range as with floating point, the fixed point unit must be five times larger and 40% slower than a corresponding floating point [19] , [20] .
Multiplication is one of the most common and costly FP operations, slowing down the computation in many applications, such as signal processing, neural networks, and stream processing [21] - [25] . There are a number of approximate multiplication units designed to save power through different techniques. Several prior publications truncate the operands of the multiplication or use different sized blocks to enable approximate multiplication [26] - [28] . However, a lack of accuracy controls and large area overhead reduce the advantages provided by these approximate designs.
In this paper, we propose a configurable floating point multiplication, called CFPU, which significantly reduces the floating point multiplication energy consumption by trading off accuracy. CFPU avoids the costly multiplication when calculating the fractional part of a floating point number in one of two ways. 1) Mantissa Discarding: In floating point multiplies, the bottom 23 bits represent the mantissa. To calculate the result of a multiply operation, the mantissa from each operand are multiplied together, a step which consumes the majority of the total power and bottlenecks the operation. A substantially faster and lower energy approach is to discard one of the input mantissa and use the second one directly. Using one mantissa directly has the potential to generate high error rates for individual multiplies, so we provide two modifications that increase the final output accuracy. a) Adaptive operand selection finds and discards the mantissa which results in the lowest error.
Minimizing individual operation error allows us to increase the number of calculations performed on the CFPU while maintaining the same output error. b) Tuning examines the first N bits of the discarded mantissa to predict error in the output result. 2) Shift and Add: When the mantissa discarding cannot produce results with the error below a user-specified requirement, we run the operation in a more accurate approximate mode. We examine the discarded mantissa to find the location of the first "1" bit. We shift the nondiscarded mantissa based on this bit's position and add it to itself to create the new mantissa for the result value. This approach requires more energy and computation time compared to the first stage but is more accurate. An operation run in the second stage has at least 50% lower error than one run in the first stage for the same input operands. If neither of these approaches produces an output with an acceptable error, our design can control the level of output accuracy by identifying the inputs that result in the highest output error and assigning them to compute precisely on the CFPU. We evaluate the efficiency of the proposed technique on AMD Southern Island GPU architecture by replacing the traditional FPUs with the proposed CFPU. We test OpenCL workloads and our results show that using first stage CFPU approximation results in 3.5× energy-delay product (EDP) improvement, compared to an unmodified GPU, while ensuring less than 10% average relative error. Adding the second stage further increases the EDP improvement to 4.1× for the same level of accuracy. Comparing the proposed CFPU with previous state-of-the-art multipliers [26] , [28] - [30] shows that our design can achieve 2.8× higher EDP with lower error.
We also examine the impact of CFPU on several machine learning algorithms. With these algorithms becoming increasingly popular, there is a strong need to improve their performance without sacrificing output error. They are often naturally stochastic, allowing them to accept an error in their output. We use our design to run three Rodinia [31] machine learning benchmarks: K-nearest neighbor (KNN), Back Propagation and K-means. K-means and KNN are used in data mining applications and involve dense linear algebra computation, while Back Propagation is used for training weights in neural networks. For machine learning algorithms CFPU design can achieve 2.4× energy saving and 2.0× speedup compared to an unmodified GPU while ensuring less than 1% average relative error. These benchmarks have 50% energy savings and 40% speedup when running on CFPU with two stages rather than the previously proposed one stage design [32] .
The rest of this paper is organized as following. Section II reviews the related work. Section III describes the proposed approximate multiplications. The experimental results are presented in Section IV. Finally, Section V concludes this paper.
II. RELATED WORK
There are several commonly examined approaches to approximate computing: voltage over scaling (VOS), use of approximate hardware blocks, and use of approximate memory units. VOS involves dynamically reducing the voltage supplied to a hardware component to save energy, but at the expense of accuracy. Error rates for VOS can be modeled to determine the tradeoff between energy and accuracy for applications, allowing voltage to be lowered until an error threshold is reached [18] , [33] - [37] . The circuit is sensitive to any variations, and if the operating voltage of a circuit is decreased too far, timing errors begin to appear which are too large to correct. A configurable associative memory was proposed by [9] which can relax computation by using VOS on the TCAM rows/bitlines to trade between energy and output accuracy. These techniques suffer because they are bound by GPU pipeline stages and therefore cannot improve computation performance. Because they make use of Hamming distance, a metric which does not consider bit position's impact on error, output accuracy is difficult to predict.
Another recently emerged strategy is the application of nonvolatile memories to create approximate memory units, for energy efficient storage and computing purposes [9] , [38] - [40] . In computing, the goal of this approach is to store common inputs and their corresponding outputs. This style of associative memory can retrieve the closest output for given inputs in order to reduce power consumption [41] , [42] . This approach does not work well in applications without a large number of redundant calculations. Associative memory can be integrated into FPUs to reduce these redundancies. Resistive CAMs have been used in order to accelerate application level computation as shown in [43] . This paper uses the bit position insensitive metric of hamming distance resulting in poor accuracy in GPU usage.
Finally, approximate hardware involves redesigning basic component blocks to save energy, at the cost of accurate output [26] , [29] , [44] , [45] . Liu et al. [29] utilized approximate adders to create an energy efficient approximate multiplier. Hashemi et al. [26] designed a multiplier that selects a reduced number of bits used in the multiplication to conserve power. Speculative designs are a recently explored route. Work in [46] proposed a speculative adder with error recognition to perform approximate operations. These types of adders can be utilized in approximate multipliers. Camus et al. [44] proposed a speculative approximate multiplier combines gate-level pruning and inexact speculative adders to lower power consumption and shrink FPU area. Deep neural networks can tolerate approximate hardware, as shown in [7] , which examines the use of variable fixed point in deep neural networks. Compared to the previous work [26] , [28] , [30] , we focus on optimizing floating point multiplication by eliminating mantissa multiplication. Our design computes common power of 2 multiplies exactly. Our configurable approximate floating point multiplier predicts accuracy based on the incoming inputs and runs high error operations on exact hardware rather than correcting results after computation. We extend our previous design [32] to offer two levels, instead of just one, of approximation to a tradeoff between approximation error and energy benefits. The first stage approximately multiplies two values by using the mantissa from one operand directly in the output. If the first stage error is too high, in the second stage the kept mantissa is shifted based on the position of the first "1" bit of the discarded mantissa and added to itself to produce an approximate result. If the second stage error is also too high, operations can be computed on exact hardware. For the same level of application accuracy, our modified multiplier reduces application energy by up to 45% compared with this paper in [32] and shows a 2.8× EDP improvement for multiply operations when compared to state-of-the-art approximate multipliers [26] , [28] , [30] .
III. APPROXIMATE FPU MULTIPLIER
Compared to integer computing units, FPUs are usually costly and energy-hungry components, due to the complex way floating point numbers are stored. Multiplication-based components are inefficient and slow down many current applications including multimedia, streaming, neural networks, and other machine learning applications [8] , [26] . Fig. 1 shows the breakdown of ALU operations for the Sobel image filter application which has a high prevalence of multiply and multiply-add (muladd) operations compared to additions. Horowitz [47] estimated in 45 nm a floating point multiply consumes 3.7 pJ of energy compared to a floating point add which consumes 0.9 pJ. A floating point multiply consumes only 20% more energy than an integer multiply, so performing all operations as fixed point does provide significant energy savings. Based on these power estimates, the floating point multiply and muladd operations consume over 90% of the ALU energy for the Sobel application, making these operations good targets for energy optimization based on these values.
In order to make multiplication more efficient, we propose a two-stage floating point multiplier. The first stage optimizes mantissa multiplication by reusing one of the input mantissa directly in the output. The second stage seeks to reduce error further by shifting and adding the retained mantissa to itself.
A. IEEE 754 Floating Point Multiply
In floating point notation, a number consists of three parts: 1) a sign bit; 2) an exponent; and 3) a fractional value. In IEEE 754 floating point representation, the sign bit is the most significant bit, bits 31 to 24 hold the exponent value, and the remaining bits contain the fractional value, also known as the mantissa. The exponent bits represent a power of two ranging from −127 to 128. The mantissa bits store a value between 1 and 2, which is multiplied by 2 exp to give the decimal value.
FPU multiply follows the steps shown in Fig. 2 . First, the sign bit of A × B = C is calculated by XORing the sign bit of the A and B operands. Second, the effective value of the exponential terms are added together. Finally, the two mantissa values are multiplied to provide the result's mantissa. Because the mantissa ranges from 1 to 2, the output of the multiplication always falls between 1 and 4. If the output mantissa is greater than 2, it is normalized by dividing by 2 and increasing the exponent by 1.
B. CFPU Usage
CFPU is highly transparent to the application. A user running an error-tolerant application specifies the maximum error for any given multiply operation, Error max , prior to execution. CFPU uses this value to ensure operations producing error greater than the value are sent to either a more accurate approximation mode or the exact hardware.
The flow chart for the proposed design is shown in Fig. 3 . Adaptive selection checks for a mantissa which, when discarded, produces an exact output when possible. The selector controls a multiplexer (MUX) between the two inputs A and B, and copies the selected mantissa to the output while discarding the other. Tuning utilizes the first N bits from the discarded mantissa to check against a threshold value and, if the threshold is not exceeded, the computation is complete. If the threshold is exceeded, the operation is run in the second stage which uses shift and add to increase accuracy. If the error of the second stage still exceeds the specified Error max , the output is computed on the exact FPU hardware. We further explain the modifications in the following sections. 
C. First Stage Approximation 1) Mantissa Discarding:
The multiplication of the mantissas is the most costly operation, taking over 80% of the total energy of the multiply operation [48] , so the first stage approximation removes it entirely. Rather than multiplying the two mantissas, the unmodified mantissa from one of the input operands (e.g., input B) is used for the output value.
The error of any approximate multiply is Mantissa discarded − 1. In the case, where an operand is a power of 2, the mantissa is 1 and there is no error in the result
Because the largest value a mantissa can be multiplied by is 2, the deviation from the kept mantissa and the correct answer is at most 100%. However, the maximum error can be reduced down to 50% by adding the first bit of the discarded mantissa to the sum of the exponent values. When the discarded mantissa is greater than 1.5 (the first mantissa bit is 1), the error is less than 50% if the kept mantissa is multiplied by 2 instead of 1. This is the functional equivalent of increasing the exponent by 1. By doing this, the error range is shifted to be −50% to 50% instead of 0% to 100% as shown in
The additional logic needed to perform approximate floating point multiplication is shown in Fig. 2 . The B mantissa is used directly as the output mantissa, and the first bit of the discarded mantissa A is added with the two exponent values. Fig. 4 shows the flow for approximation. The sign bit for the result is computed by XORing the two input sign bits. The exponent for the result is computed by adding the two input exponents and the MSB of the discarded mantissa. CFPU ensures all operations compute their results below the user specified Error max through the use of adaptive selection and tuning. In the first level of approximation, adaptive selection identifies the best mantissa to use in the output and discards the other. The upper N bits of the discarded mantissa are used to predict error and select between copying the other mantissa directly to the output, using the second level of approximation, or multiplying the two mantissas together.
2) Adaptive Operand Selection: Choosing which mantissa should be used for the result can significantly impact the accuracy of the CFPU output. For example, if the values 2.0 and 3.0 are multiplied, the result will be either 6.0 (exact) or 8.0 (33% error) depending on which mantissa is discarded.
In [32] , we only use adaptive operand selection to identify mantissa values of zero. We improve adaptive operand selection to find the best mantissa to discard for all operations. This approach increases hit rate as more operations can be approximated, and reduces error for individual operations.
We compare the two mantissa values to determine the value which produces the lowest error when discarded. The output result uses the best mantissa. The benefits of this approach are twofold: 1) error for individual operations decreases and 2) the percentage of operations run in the first stage approximate mode increases because more operations are below the Error max . The worst case error occurs when the discarded mantissa is closest to 1.5, so the preferred mantissa can be identified by detecting the longest continuous series of identical bits starting from the MSB. The mantissa with the longest series of either 1s or 0s gives a lower error when discarded. If all mantissa bits are "0," the error is 0. If both mantissas have an identical length series of bits, mantissa A is discarded. Fig. 5 illustrates the distance calculation. If the mantissa from A is discarded, the resulting approximate multiply produces a value of 6.625 with an error of 5.9%. By comparison, if the mantissa from B is discarded instead, the output value is 8.5 with an error of 20.7%. Mantissa A has a series of three 0s compared to the single "1" of B, so discarding A results in a lower error.
3) Tuning Control: It is possible that neither mantissa produces output lower than Error max when discarded. CFPU automatically detects cases, where the error exceeds the specified requirement and computes the result in the more accurate second stage. For example, if the maximum desired error is 5%, then the multiplication of A and B in Fig. 5 cannot be computed using the first stage mantissa drop approximation.
The N bits after the MSB of the discarded mantissa are checked to tune the level of approximation. The case of maximum error occurs when the discarded mantissa is furthest from a power of 2, which occurs when its value is "1" followed by all 0s. When tuning, the goal is to ensure values over Error max are selected to run in the second stage, so the hardware must change its selection depending on the value of the first bit of the discarded mantissa. If the first bit of the discarded mantissa is "1," the first N tuning bits are checked for 0s, where N is selected based on Error max
If a "0" is found, the hardware runs in exact mode. Similarly, when the first bit is "0," the first N tuning bits are checked for 1s instead. For each guaranteed bit in the A i−1 to first indexes, the maximum error is reduced by half. Checking only one bit corresponds to a maximum error of 25%, two bits is 12.5%, etc.,
An example of CFPU multiplication is shown in Fig. 6 for two 32-bit floating point numbers in precise FPU and proposed CFPU with Error max set to 12.5%. An Error max of 12.5% requires CFPU to check N = 2 tuning bits. The conventional FPU finds the correct solution of −510 by adding the exponents and multiplying the two mantissa, while XORing the sign bit to find three parts of the output data. Our design first compares the mantissas, which both contain a series of three identical bits, so operand A is selected for discarding. Next CFPU checks the first mantissa bit and the N tuning bits after that. In this case, the first mantissa bit is "1," so the next two bits are checked for "0" to determine if the value is below the desired error rate. When two tuning bits are checked, the maximum error is 12.5%. In this example, both tuning bits are "1," so the calculation continues in approximate mode and the mantissa from the value 8.5 is copied to the output value. The resulting output is −544, which deviates 6.67% from the correct value of −510 and falls below the desired Error max .
D. Second Stage Shift and Add
If the first stage produces a result which has the error greater than Error max , CFPU activates the second stage instead. It has greater accuracy than the first stage but higher energy cost. The energy draw of the second stage is less than that of the exact hardware.
Some applications when running with only a single level of approximation, they require a high percentage of multiplies to be run in exact mode. FFT, MersenneTwister, and DwtHaar1D require almost 50% of their operations to be run on the CFPU exact mode to maintain output error below 10% as shown in Table II . Running this high a percentage of operation in exact mode limits potential energy savings, so the addition of the second level of approximation to CFPU allows more of the multiply operations to be approximated and while maintaining acceptable output error.
Our second stage approximation shifts and adds the kept mantissa with itself to produce a value closer to the exact output. Our hardware detects the positions of the first two "1" bits in the discarded mantissa. Fig. 4 shows the overall flow of CFPU. The first level uses the fractional bits directly, Frac 2, while the second level uses the shifted value of Frac 2 summed with the original value of Frac 2. Frac 1 provides tuning bits to predict if the error for the first level is too large. If it is, then the second level error is predicted. If this predicted error is also too great, the two mantissa values are multiplied together to produce an exact result. Fig. 7 demonstrates the shift and add design. The first "1" bit position, P1, is detected by hardware and determines S, the shift amount, where S is the number of mantissa bits minus the bit position. The mantissa used in the output is shifted S positions right and added its unshifted value. The second "1" bit is used for tuning and helps determine the maximum error between the calculated result and the approximate result. The maximum error decreases the further the first "1" bit is from the MSB because of the shift and added mantissa decreases relative to the original mantissa. The closer the second "1" bit, P2, is relative to the first, the higher the error in the result. If the two "1" bits are adjacent, the shift and add value will be up to 50% smaller than necessary to reach an exact result
The maximum error for the second level is based on the P1 within the discarded mantissa. The further P1 is from the MSB, the lower the maximum error will be. Fig. 8 provides an example of the computation which the first stage design computed with 6.67% error as shown in Fig. 6 . In this example, P1 is in the fourth MSB, so the shift value is 4. Each mantissa contains a hidden "1" bit left of the MSB which must be accounted for and shifted in. The mantissa from B is shifted left by 4 bits to create the value "0001111..." and added the unshifted value "1110000..." to produce the approximate output. In this example, the second stage design calculates the output exactly. The error decreases from 6.67% calculated by the first stage to 0% from the second stage while consuming less energy than the exact multiply operation.
In cases where there is only one "1" bit in the mantissa, the shift and add approach produces exact results. In this case, the discarded mantissa is effectively a power of 2 and as such resolves to an integer value by which to shift the kept mantissa.
1) Running on Exact Hardware: The output error produced by the second stage approximation can still exceed Error max , Fig. 9 . Circuitry to support adaptive operand selector and tuning the level of approximation in CFPU. so some of the computations must be run on the exact hardware. Similar to the mantissa discarding approach, we use a threshold value to determine which values are run on the exact hardware. We search the discarded mantissa for the second occurrence of a "1" bit, P2. The position of P2 relative to the first bit, and P1 relative to the MSB determines the threshold value, V. The threshold is calculated as V = S + (P1 − P2), where S is the shift amount. The minimum threshold occurs when P1 = 22 and P2 = 21, the upper MSBs, giving S = 1 and V = 2. V = 2 corresponds to a maximum output error of 25% on an individual operation. As V increases, the maximum output error becomes 50%
V must be greater than N, the number of tuning bits, in order to guarantee the result with produce an error below Error max If the output error still exceeds Error max , the operation runs on exact FPU hardware instead. The shift and add approach is more complex and power intensive when compared to the basic mantissa discard approach. The additional logical overhead and extra power draw make it viable as an intermediate option to reduce output error without requiring the full power consumption of the exact hardware. Shift and add is more accurate than mantissa drop, but if a user requires even greater output accuracy, operands producing highly erroneous results are sent to the exact hardware. The user can configure output accuracy by adjusting Error max value for both stages. Fig. 9 shows the circuitry to enable CFPU adaptive operand selection and accurate tuning. We implement adaptive operand selection by checking the mantissa bits in both input operands. Our design compares the mantissas to determine which produces the lowest output error. If one mantissa has bits which are all zero, the second mantissa is copied to the output to produce an exact result. To ensure the mantissa which produces the greatest error is discarded, the hardware must locate and discard the mantissa furthest from 1.5. The further the discarded mantissa is from 1.5, the lower the output error. The circuit examines the two input mantissas and detects the one with the longest chain of continuous 1s or 0s starting from the first bit. A chain of zeros represents a mantissa close to 1, and similarly, a chain of ones is closer to a mantissa of 2. The mantissa with the longest consecutive chain of either zeros or ones is copied to the output and the other discarded.
IV. CFPU HARDWARE SUPPORT
As Fig. 9 shows, the detector circuitry is a simple transistorresistor circuitry which samples the match-line (ML) voltage to detect the A i−1 , A i−2 , . . . , A 0 input operand. In case of any "1" bit in a mantissa, the sense amplifier detects changes in the ML voltage (ML = 1). However, if all mantissa bits are zero, no current passes through R sense and the B operand mantissa is selected as the output mantissa. To detect the "1" bit on A th i−1 , . . . , A th 0 indices on CFPU, the sense amplifier Clk needs to be set to 250 ps. Based on the results, we can dynamically change the sampling time to balance the ratio of the running input workload on the approximate CFPU core. The operand selection happens by using two MUXs which are controlled with our detector hardware signal. Similarly, to tune the level of approximation, our design uses N bits (after the first mantissa bit) of the selected mantissa to decide when to perform mantissa multiplication or approximate it. The number of tuning bits sets the level of approximation, with each additional bit reducing the maximum error by half. The goal is to check the value of the A i−1 , . . . , A i−N to make sure they are same as the A i . For this purpose, the circuitry selects the original value or inverted values of the tuning bits for the circuitry to search. To make the design area efficient, we use the same circuitry for adaptive operand selection and tuning approximation. For each application, sampling time can be individually set in order to provide target accuracy.
CFPU with two level approximation requires similar hardware to perform adaptive operand section. The two leading "1" bits in selected input operand are detected and their position used to calculate the maximum error. If the estimated error is less than Error max , shift and add is used, otherwise the result is computed on exact hardware. Fig. 10 shows the proposed hardware supporting two level approximating. The adaptive operand selector identifies the best mantissa for use in the result and the tuning bits of the discarded mantissa are examined. If the error is low, the best mantissa is copied directly to the output. Otherwise, a leading "1" but detector identifies Fig. 11 . Framework to support tunable CFPU approximation. the bit positions of the discarded mantissa to predict the second stage error. If the second stage error meets the accuracy requirement, a shift block and an adder block compute the result mantissa. If the error of the second stage is too great, the standard mantissa multiplier hardware computes the result.
V. RESULTS

A. Experimental Setup
We integrated the proposed approximate CFPU in the FPUs of an AMD Southern Island Radeon HD 7970 GPU. We modified Multi2sim, a cycle accurate CPU-GPU simulator [49] , to model the CFPU functionality in three main floating point operations in GPU architecture: 1) multiplier; 2) multiplieraccumulator; and 3) multiply-add. We evaluated energy of traditional FPUs using Synopsys Design Compiler and optimized for power using Synopsys Prime Time for 1 ns delay in 45-nm ASIC flow [50] . The circuit level simulation of the CFPU design has been performed using HSPICE simulator in 45-nm TSMC technology. We first test the efficiency of enhanced GPU on twelve general OpenCL applications from AMD OpenCL SDK [51]: Sobel, Robert, Mean, Laplacian, Sharpen, Prewit, QuasiRandom, FFT, Mersenne, DwHaar1D, Blur, and Blackscholes. In these applications, roughly 85% of the floating point operations involve multiplication.
Our design is then tested using machine learning applications. Machine learning algorithms are often error-tolerant allowing them to be run more efficiently on approximate hardware. We examine three OpenCL benchmarks from the Rodinia 3.1 machine learning suite [31] . These benchmarks are KNN, Back Propagation, and K-means.
1) K-Means: Highly parallelizable clustering algorithm used in many data mining applications. 2) KNN: Calculates the KNN from given data. Calculates euclidean distance from many data points in parallel. 3) Back Propagation: Used to train the weights in a neural network. Error values are propagated through the network to retrain nodes and reduce output accuracy. We propose an automated framework to fine-tune the level of approximation and satisfy required accuracy while providing the maximum energy savings. Fig. 11 shows the proposed framework, consisting of the accuracy tuning and accuracy measurement blocks. The framework starts by putting the CFPU in the maximum level of approximation when no tuning bits are checked. Then, based on the user accuracy requirement, it dynamically decreases the level of approximation 1 tuning bit at a time until computation accuracy satisfies the user quality of service. The tuning is adjusted using a register setting to control the approximation level of the CFPU. For each application, this framework returns the optimal number of CFPU tuning bits checked, providing maximum energy and performance efficiency. In future runs, the detected optimal configuration is set using the accuracy register setting prior to running approximable code and disabled using the register after completion of the code.
B. First Stage CFPU
We first look at approximate multiplication. The proposed modified FPU can run entirely in approximate mode while providing a level of accuracy that is still acceptable for many applications. Table I shows the computation accuracy, energy savings, and speedup of running eight general OpenCL applications on the approximate GPU. These applications achieve error below 10% while using only first stage approximation. The energy and performance of proposed hardware are normalized to the energy and performance of a GPU using conventional FPUs. Our experimental evaluation shows that our approximate hardware can achieve to 72% energy savings, 19% speedup, and 5.4× EDP for these applications compared to the traditional AMD GPU, while providing an acceptable output quality less than 10% average relative error.
1) Adaptive Operand Selection: The approximate multiply uses both exponents in its calculation, but discards one of the mantissas, making an operation effectively a multiply by a power of 2. Therefore, a multiplication by a power of 2 always results in an exact answer on our hardware. It is possible to reduce error by ensuring the value of the discarded mantissa is equal to 1. This occurs when all the mantissa bits are 0. In the 11 OpenCL applications we tested an average of 52% of multiplies involved at least one power of 2. Hardware intelligently checking both inputs and adaptively discarding mantissas results in more exact computations and greatly reduced overall output error. Fig. 12(a) compares the portion of multiplications which runs precisely on the proposed CFPU with and without adaptive operand selection for the evaluated applications. Fig. 12(b) shows the impact of the adaptive operand selection on the computation accuracy of the proposed CFPU. In random operand selection, the mantissa of the first input is always selected for the output without comparing the potential error of each mantissa. The result shows that adaptive operand selection significantly improves computation accuracy such that for all shown applications, the average relative error decreases to less than 7%. This improvement is due to increasing the portion of multiplications which are run precisely on the CFPU. We verify this by looking at the percentage of precise CFPU operations using the adaptive selection technique. For example, in Sobel application, 82% of the outputs are calculated exactly, with an overall relative error of 16% when using random operand selection, while adaptive selection shows 92% of the outputs calculated exactly, at an overall relative error of 9%. All operations in Sharpen, Prewit, and QuasiRandom contain a mantissa of zero, so CFPU computes all results exactly. The image processing applications Sobel, Roberts, Blur, Mean, and Laplace also involve many operations that can be computed exactly with CFPU. Using adaptive operand selection to only select and discard zero mantissas improves application accuracy by up to 8×. This paper in [32] only utilizes a zero mantissa selection policy, which does not reduce error significantly for many applications. the computation accuracy by up to 13× (8×) compared to random operand selection (zero mantissa only).
The application showing the best accuracy improvement, Roberts, decreases from 13.6% application error using random operand selection to 1.85% using zero mantissa only selection [32] . The improved adaptive operand selection further decreases the output error to 1.08%. Excluding the three applications that only contain zero mantissa operations and get the best possible results, our evaluation for 12 different applications shows adaptive selection reduces average error by 2.2× more than our previous work [32] .
2) Tuning: We show the efficiency of the proposed CFPU by running different multimedia and general streaming applications on the enhanced GPU architecture. We consider 10% average relative error as an acceptable accuracy metric for all applications, verified by [52] . We tune the level of approximation by checking the N bits of mantissa in one of the input operands. If all N tuning bits match with the first mantissa bit, the multiplication runs in approximate mode, otherwise, it runs precisely by multiplying the mantissa of input operands. For each application, Table II shows the average relative error and portion of running multiplications in each application on exact and on approximate CFPU, when the number of tuning bit changes from 0 to 4 bits. Increasing the number of tuning bits improves the computation accuracy by processing the far and inaccurate multiplications in precise CFPU mode. Increasing the number of tuning bits slows down the computation because a larger portion of data is processed on precise CFPU. Fig. 13 shows the energy consumption and EDP of a GPU enhanced with tunable CFPU using different numbers of tuning bits. Our experimental evaluation shows that running applications on proposed CFPU provides, respectively, 3.5× and 2.7× EDP improvement compared to a GPU using traditional FPUs, while ensuring less than 10% (1%) average relative error.
C. Second Stage CFPU
Although proposed first stage approximate multiplication provides high energy savings, the accuracy of computation depends on the application. For some applications, with quantized inputs, e.g., Sharpen filter, the proposed design can work precisely with no average relative error. Other applications, such as recognition algorithms like motion tracking and detection applications, quantify changes in the input data allowing them to tolerate small amounts of error. An approximate multiplier must be able to control the level of output error to ensure close to exact results are calculated for these applications. Fig. 14 shows the distribution of error rates for each multiply operation of two applications. In the case of Sobel, almost 90% of the multiplies are by a power of 2 and are handled exactly by our approximate solution. The remaining 10% of operations have incorrect values with error rates ranging up to 50%. The Mersenne Twister application, on the other hand, has a evener distribution of error rates. While about 12% of the computations have no error, the error rates are too randomly distributed to provide acceptable overall error without additional optimization. For this application, the first stage approximation does not provide sufficient accuracy on its own, so over 50% of operations must be run on exact hardware to keep error below 1%. Table III lists the hit rate of approximate hardware and average relative error for different applications running on hardware using two levels of approximation. The result shows that CFPU using two-level approximation can provide significantly higher accuracy compared to single level approximation. This efficiency comes from the ability of CFPU to assign input data to an approximation hardware which better classifies input data. Fig.15 shows the energy consumption and EDP of CFPU using a two-level approximation. The result shows that accepting 10% (1%) average relative error, CFPU can provide 4.1× (3.2×) EDP improvement as compared to a GPU using traditional FPUs. To ensure the quality of computation, Fig. 16 compares the visual results of Blur running on precise and approximate hardware. Our result shows that approximate computing creates no noticeable difference between the precise and approximate result images.
Tables IV and V list the energy efficiency improvement, speedup, and average relative error of running Rodinia applications on CFPU with one and two levels of approximation. The results are listed when the number of tuning bits changes from 0-bits to 4-bits. For machine learning algorithms, CFPU with a single stage achieves 1.6× energy savings and 1.4× speedup while ensuring less than 1% average relative error. Enabling second stage approximation increases energy savings to 2.4× and speedup to 2.0×, 50% and 40% improvements, respectively. Fig. 17 shows the energy and accuracy improvements for each optimization to the CFPU. In the first case, mantissa discarding is used for every operation, resulting in the highest energy savings, but poor accuracy. Adaptive selection reduces error but adds a small additional overhead. Tuning is used to reaching accuracy requirements, but energy savings is decreased drastically because a large portion of operations run on exact hardware. Finally, the second stage shift and add are used to reduce energy, while still maintaining accuracy. Our evaluation shows that the proposed CFPU design can achieve 4.1× (3.2×) EDP improvement while ensuring less than 10% (1%) average relative error. The tested algorithms performed well when coupled with approximate hardware. Fig. 18 compares the output of K-means on exact and approximate hardware. All classification errors occur around the boundaries of the clusters. All operations are computed approximately and only 2% of the points are incorrectly classified.
D. Overhead and Comparison
The first stage adds a 3.4% area overhead to the FPU, while the second stage adds an extra 6.2% area overhead. The energy overhead of a multiply operation when running on CFPU in exact mode is 2.7%, which is negligible compared to efficiency and tuning capability that CFPU can provide. In order to outperform the standard FPU, our design needs to run at least 4% of the data in the first or second stage. We observed significantly higher percentages in all of the applications tested on the proposed CFPU.
To understand the advantage of the proposed design, we compare the energy consumption and delay of the proposed CFPU with the state-of-the-art approximate multipliers proposed in [26] , [28] , and [30] . Previous designs are limited to a small range of robust and error-tolerant applications, as they are not able to tune the level of accuracy at runtime. In contrast, our CFPU dynamically predicts the inaccurate results and processes them in precise mode. CFPU tunes the level of accuracy at runtime based on the user accuracy requirement. The ability to run in exact mode and save power increases the range of applications that benefit from it. Table VI lists the power consumption, critical path delay, and EDP of CFPU alongside previous work in [26] , [28] , and [30] in their best configurations. We set CFPU to use three tuning bits, so the maximum output error per operation is the same or less than the multipliers we compare against. Tuning requires bit checks which increase energy consumption, so a CFPU configured to check 3 bits has slightly fewer energy savings than one configured to predict error. Our evaluation shows that at the same level of accuracy, the proposed design can achieve 2.8× EDP improvement compared to the state-of-the-art approximate multipliers for a multiply operation.
VI. CONCLUSION
In this paper, we propose a configurable floating point multiplier which can approximately perform the computation with significantly lower energy and performance cost. CFPU controls the level of approximation by processing the data in one of the three tiers: 1) basic approximate mode; 2) intermediate approximate mode; and 3) on the exact hardware. The first stage approximate mode discards one input's mantissa and uses the second's directly in the output to save energy. Accuracy is tuned by examining the discarded mantissa to estimate output error. When error exceeds a user-specified maximum, CFPU uses a second level of approximation. This mode uses a shift and add to increase accuracy. If the approximate output error is too high, the multiply is run on exact hardware. Our results show that using first stage CFPU approximation results in 3.5× EDP improvement compared to an unmodified FPU, while ensuring less than 10% average relative error [32] . Adding the second stage further increases the EDP improvement, compared to the base FPU, to 4.1× for that same level of accuracy. In addition, our results show the proposed CFPU achieves 2.8× EDP improvement for multiply operations as compared to the state-of-the-art approximate multipliers.
