Abstract-In this paper we present approximate adders and multipliers to reduce the datapath complexity of image processing systems with only a small degradation in PSNR performance. We build upon the approximate circuits proposed in [8] and [9] . We show that selective application of accurate and approximate adders can significantly improve the accuracy of a 2D DCT system. For instance, our implementation of 2D DCT has comparable PSNR performance compared to [8] with 34-50% reduction in area. We also propose an approximate multiplier where the partial products have varying degrees of approximation. Such a multiplier helps improve the accuracy of the system as demonstrated through FFT and Gaussian filter case studies.
INTRODUCTION
Adders and multipliers are the basic building blocks in the datapath of any signal processing implementation. Dynamic range adjustment of datapath units or truncation has shown to be an effective way of reducing area, latency and power of datapath units [1] [2] [3] . Recently approximate circuits have been proposed in [4] [5] [6] [7] [8] [9] which also reduce area and power but generate more accurate results compared to truncation based systems.
One of the earliest works in approximate circuits is the 'almost correct adder' in [4] , which propagates carry upto log n stages, where n is the datapath width. The approximate adder in [5] utilizes logic synthesis to design approximate versions of a given function. It determines the minterms that produce an approximate version of a circuit with smallest number of literals for a given error rate threshold. A reconfigurable approximate adder that supports varying accuracy requirements is proposed in [6] . An approximate adder and approximate Booth multiplier are proposed in [7] and applied to the pipeline stages of a superscalar processor. A prediction unit predicts the error in the early stages of the pipeline and uses the accurate result instead. In [8] , several approximations to a full adder circuit are proposed by removing transistors systematically. The DCT implementation results provided in this paper demonstrate that this is a very promising approach. An inaccurate 2x2 multiplier based on Karnaugh Map simplification is proposed in [9] , which acts as the building block for larger multipliers. The inaccurate multiplier achieves 31.78% -45.4 % power savings compared to the accurate design [9] .
In this paper we build upon the work presented in [8] and [9] for approximate adders and multipliers, respectively. We show that a mixture of accurate and approximate adders can achieve very high performance with low area. For 2D Discrete Cosine Transform (DCT), we show that certain dataflow configurations are more amenable to this method and that compared to [8] we can achieve ~34-50% reduction in area with comparable PSNR performance. Next we propose enhancements to the approximate multiplier in [9] . We show that if some of the 2x2 multiplier blocks in the design of the larger multiplier are accurate, the overall accuracy can be improved significantly, especially when two large numbers are multiplied. For Fast Fourier Transform (FFT), we show that use of the proposed approximate multiplier achieves 4.7dB better PSNR performance with a very small area overhead compared to the one that uses the multiplier in [9] .
II. APPROXIMATE ADDERS

A. Prior Work
In order to reduce the complexity of the datapath unit, approximations have been proposed for the full adder (FA) circuit in [8] . The approximate circuits are obtained by systematically removing transistors from the 24-transistor mirror adder circuit. The two rules followed in this process are The motivation for Approximation 5 is to avoid carry propagation altogether by making the sum independent of C in . To achieve this, the Sum output is approximated to B and C out is approximated to A.
The proposed approximations result in significant reduction in the area. 
B. Case Study: Discrete Cosine Transform (DCT)
The 2D DCT is implemented by applying 1D DCT along the rows and then applying 1D DCT along the columns. The multiplications with the cosine coefficients are replaced with additions and shifts. In [8] , 1D DCT is implemented using matrix vector multiplication; there is no reuse of computations. Our implementation of 1D DCT reuses many of the computations as shown in Figure 1 . The first stage of DCT is implemented by employing a simple butterfly structure. The adders in the first stage butterfly structure are accurate since these results are used in the rest of the computations. The figure also elaborates how the coefficients w0, w1, …, w7 are computed. All adders used in the computations of w0, w1, …, w7 from y0, y1, …, y7 are approximate.
We now compare the results of the forward-backward DCT on six standard 512x512 images, Baboon, Barbara, Boat, House, Lena and Peppers. The method in [8] uses a 20bit datapath where approximate adders are used for 9 LSBs and accurate adders are used for 11 MSBs. Our implementation uses a 16bit datapath where approximate adders are used for 6 LSBs and accurate adders are used for the remaining 10 MSBs. We present the PSNR results for the DCT implementations using accurate adder, Approximation 1 and Approximation 5 adders in Table I . The PSNR values are calculated using MATLAB double precision floating point results as the ground truth. Compared to [8] , we achieve an average PSNR improvement of 3.13 dB and area reduction of 50.89% when accurate adders are used. The corresponding PSNR improvement and area reduction numbers for Approximation 1 adders are 4.33 dB and 48.96%, for Approximation 2 adders are 2.38 dB and 48.18%, for Approximation 3 adders are 7.54 dB and 47.52%, for Approximation 4 adders are 6.58 dB and 47.82% and for Approximation 5 adders are 1.8 dB and 39.87%. We can see that an implementation with Approximation 5 adders performs quite well for all cases, with almost no overhead. The improvement in PSNR is achieved by introducing accurate adders in the first butterfly stage and the area reduction is achieved by reuse of computations.
Through this study, we conclude that selective implementation of accurate as well approximate adders can lead to a more optimized solution in terms of both accuracy and area. Secondly, keeping the early stages of the algorithm accurate provides scope for aggressive approximation in later stages. Finally, this method has better performance than truncation. For instance truncation of 6 LSBs in the 16bit datapath implementation results in almost the same area as our implementation using Approximation 5 but with 2-3 dB lower PSNR than using our method. 
III. APPROXIMATE MULTIPLIER
A. Background
The basic building block of the approximate multiplier is the 2x2 multiplier, which multiplies two 2 bits words (a 1 a 0 ) and (b 1 b 0 ). This should produce 4 bits of outputs since the largest number generated by a 2x2 multiplication is 9 (1001). In [9] , an approximation to 2x2 multiplication is introduced where this multiplication value is estimated to be 7, which can be represented with 3 bits (111). By restricting the output of the 2x2 multiplier from 4 to 3 bits greatly reduces the complexity and introduces only a small error -only one in 16 possible combinations is erroneous. Larger multipliers are built using this 2x2 multiplier in [9] .
B. Proposed Multiplier
We propose three changes to the multiplier in [9] . First, we further approximate the 2x2 multiplier building block by approximating out0 to 0. Even though the critical path of the resulting multiplier remains the same, the area has reduced. Figure 2 shows the logic functions of the accurate, reference [9] and proposed 2x2 multipliers. The second enhancement that we propose is the way a larger multiplier is built. The multiplier in [9] uses the same 2x2 approximate multiplier to compute all partial products. Instead, we introduce three levels of approximation within the larger multiplier. We calculate the least significant partial product with maximum degree of approximation, the middle partial products with medium approximation and the most significant partial product with no approximation. Figure 3 describes the proposed architecture. The motivation behind introducing three levels of approximation is when two large numbers are multiplied; only the lower partial products are computed with approximation. This improves the accuracy of the multiplier compared to the reference multiplier.
While our multiplier performs better when inputs are large, it can be seen that the accuracy drops when the inputs are small. To overcome this, we vary the accuracy of our multiplier based on the dynamic range of the inputs. We compute the product with medium approximation (reference multiplier) when either input is smaller than 16 and compute the product with most aggressive approximation (proposed multiplier) when both inputs are greater than 16.
We now evaluate our 8x8 multiplier against the reference multiplier by sweeping the two inputs A & B from 0-255 and computing the percentage error for each input bin, as shown in Figure 4 . It can be noted that the proposed multiplier has the same performance as that of the reference multiplier when either (or both) input is small. However, the proposed multiplier performs much better than the reference multiplier for larger inputs. Building an 8x8 Multiplier using 4x4 multipliers. The figure shows varying degree of approximations in the multiplier building blocks. 
C. Comparison of Hardware Complexity
We compare the hardware complexity of the accurate, reference [9] and proposed multipliers for the 2x2 and 8x8 configurations.
The multipliers are sysnthesized using Synopsys DC Compiler for the SAED 90nm Generic Library (optimized for power). The areas of the accurate, reference [9] and proposed 2x2 multiplier are 52.53, 26.72 and 19.35 nm 2 , respectively and the latencies are 0.69, 0.53 and 0.53 ns, respectively. For the 8x8 multiplier, the gate count of the proposed multiplier is ~40% that of the accurate multiplier and ~46% higher than the multiplier in [9] . This is because our multiplier also includes some accurate building blocks in order to improve the accuracy.
D. Case Studies
We use the proposed approximate multiplier to implement a 5x5 Gaussian filter and a 32-point Fast Fourier Transform (FFT). We compare the performance of the different implementations with respect to PSNR where the ground truth is obtained by running MATLAB simulations in double precision floating point. The results for six standard imagesBaboon, Barbara, Boat, House, Lena and Peppers -each of size 512x512 are presented here.
Gaussian Filter: We use 8x8 multipliers to implement the 5x5 Gaussian Smoothing filter for σ = 1. The internal precision is 16 bits for both adders and multipliers. The performance of the implementation using the different multipliers are shown in Table II . Implementation with the proposed multiplier results in an average drop of 3.04 dB drop in accuracy with an approximately 35% reduction in area compared to the accurate multiplier. As the coefficients of the Gaussian filter are fairly small, our multiplier uses almost the same set of approximations as the reference multiplier [9] . This leads to almost identical results for the two cases. For filters with large coefficients, our implementation has significantly better results compared to [9] . 
Fast Fourier Transform (FFT):
We implemented 2D 32x32 FFT by first applying 1D FFT along the rows and then along the columns. We used the Split Radix FFT (SRFFT) algorithm as described in [10] to implement the 1D FFTs. The results for the forward-backward FFT algorithm for the approximate and accurate multipliers are presented in Table III . The internal precision of the adders and multipliers is 16 bits. The width of the output of first 1D FFT is 12 bits and that of the second 1D FFT is 16 bits. The same configuration was used for all three implementations. Our implementation has significantly better performance than the reference multiplier in [9] and comparable performance with the accurate multiplier implementation.
When two large numbers are multiplied, calculating all the partial products with the same level of approximation leads to a steeper drop in accuracy. However, varying the degree of approximation within the large multiplier by computing the MSB partial product with an accurate multiplier building block and computing the LSB partial product with an approximate multiplier building block led to higher accuracy. Overall this scheme resulted in an average of 4.7 dB improvement in PSNR for a very small area overhead.
IV. CONCLUSION
In this study we present an effective way to implement image processing algorithms using approximate datapath units. Instead of using approximate adders and multipliers indiscriminately, we selectively use approximate components so that the performance drop is quite small. We show that for 2D DCT, compared to [8] , we can achieve ~39-50% area reduction with comparable PSNR performance. We also propose a multiplier with varying degrees of accuracy. The proposed multiplier performs much better than the multiplier in [9] with a small area overhead. For instance, we achieve a 4.7 dB improvement in PSNR performance for 2D 32x32 FFT compared to the implementation using the approximate multiplier proposed in [9] .
ACKNOWLEDGMENT
This work was supported in part by NSF CSR 0910699.
