Abstract-Sometimes reducing the precision of a numerical pro cessor, by introducing errors, can lead to significant performance (delay, area and power dissipation) improvements without com promising the overall quality of the processing. In this work, we show how to perform the two basic operations, addition and multiplication, in an imprecise manner by simplifying the hard ware implementation. With the proposed "sloppy" operations, we obtain a reduction in delay, area and power dissipation, and the error introduced is still acceptable for applications such as image processing.
I. INTRODUCTION
There are several fields of application of computer arithmetic that can tolerate some imprecision. For example, in audio and image processing or in wireless communication, it might be desirable to get better performance (faster, smaller, less power hungry systems) at expenses of some quality degradation. Recently, a few papers have addressed this issue of designing imprecise hardware to save power [1], [2] , [3] , [4] . In this work, we introduce a systematic way of having imprecise arithmetic operations for the two most common operations: addition and multiplication. We liked the term "sloppy " introduced in [5] , and we will use this term in the paper to refer to imprecise arithmetic operations.
II. SLOPPY ADDITION
Ignoring the least significant bits of an addition, by implement ing a truncated adder saves area and it is faster at expenses of a truncation error. Instead of completely ignoring the least significant bits, in the "sloppy" approach we do not propagate the carry in those bits. Assuming that we are operating on positive integers, and defining position k as the bit of weight 2k in a n-bit word, we can ignore the carry up to position k when implementing the addition. The bit-level algorithm to implement this sloppy adder is the following: That is, the sloppy adder computes 161 (exact value is 173)
introducing an error E = 12.
By looking at the bits of weight < 2k, we notice that the XOR of two ones produces a zero sum bit (1 EB 1 = 0). Because the carry is not computed (or propagated), in position k an error 2 k +l is generated. The error can be halved to 2k by computing the OR of the two bits in place of the XOR. For the example above we have: and the error is reduced from E = 12 to E = 6 (halved).
By simulating all possible combinations of the operands for the 8-bit addition (k = 4), we found that by obtaining the sum by OR-ing the k least-significant bits the average error is E mean = 3.75, while by XOR-ing, it is E mean = 7.5.
We show in Fig. 1 the comparison of the hardware implemen tation of the sloppy adder used in the above example (n = 8, k = 4) and an error-free 8-bit carry-propagate adder (CPA).
The data I on delay, area and power dissipation are reported in Table I .
In a rough evaluation, we considered lowering the supply voltage V DD in the sloppy adder to match the delay of the error-free adder (1.0 ns). In our library, when V DD is lowered from 1.0 V to 0.7 V the delay doubles. In the expression for the power dissipated by a circuit containing N gates
we assume that the switching capacitance a i C i does not change when scaling V DD . Therefore, K = 20 is constant: PO.7V = (0.7)2·20':::0 10 j.lW I The adders are synthesized with radix-4 carry-look-ahead iterative carry network [6] . That is, with the sloppy adder the power is reduced to 1/4 at same adder speed. The natural competitor of the sloppy adder is the truncated adder. We performed a comparison between our imprecise adder and the truncated adder by implementing a 16-bit adder with a Carry Look-Ahead (CLA) network to propagate the carry for different sloppy/truncated configurations. The output bits affected by errors are shown as 0 in Fig. 2 for truncation t = 8 and sloppy bits k = 8.
Gate-level netlists are generated, by a C program, for each unit under test. The netlists are synthesized (unconstrained) to optimize buffering and cells' drive strength according to the actual fan-out. In the comparison, we are interested in relating the introduced error to the power dissipation. Fig. 3 shows the error introduced by the imprecise adders as a function of the number of imprecise/truncated bits (4, 8 and 12 bits). In Fig. 4 we show the power dissipation of each imprecise adder as function of the error. The sloppy adder turned out to dissipate lower power than the truncated adder for the same error level.
III. SLOPPY MULTIPLIC ATION
Parallel multiplication p = x·y can be divided into three steps: 1) generation of Partial Products (PPs); 2) carry-free reduction from n PPs to 2 operands; 3) carry-propagate two operands addition. 
Eq. # of bits We use a sloppy approach for step 1 only, as step 2 is quite delay-efficient (no carry propagation) and step 3 has been addressed in the previous section. We consider radix-4 multiplication because for n x n bit operands the unit is smaller: only � PPs are generated. In radix-4 multiplication, the radix-4 digits of the multiplier y are recoded into signed-digit representation to avoid multiples of 3 and carry propagation as explained in [6] . The resulting archi tecture (for one digit) recoder plus PP generation (rec+PPgen) is sketched in Fig. 5(a) . Similarly to what was done for the addition, we have a sloppy rec+PPgen for the least-significant digits of y. The recoding is performed as shown in Table II . The resulting hardware implementation is greatly simplified as shown in Fig. 5(b) . Fig. 6(b) shows how the sloppy bits 0 are arranged in the array. As the average error for sloppy recoding is zero (Table II) , for patterns in which two adjacent digits in ya re 1 = (Olh and 3 = (l1h the errors on two different rows compensate in the internal columns of the array. This is shown in the example of Table III for (0111)4 x (0231k From Fig. 6(b) it is clear that the error due to sloppy rows can propagate well into the most-significant digits of the product.
To limit this propagation, we opt for a hybrid row in which only the least-significant digits of the row are computed sloppy as shown in Fig. 5(c) . With this scheme, called in the following sloppy-columns, the propagation of the error can be limited to a given column Fig. 6(c) . Again, a competitor of the sloppy scheme is the truncated one. To compare performance and error introduced, we im plemented a 12 x 12-bit multiplier (two's complement) in the following schemes: 1) r2-mult a radix-2 standard multiplier; 2) r4-mult a radix-4 standard multiplier (with PPs generation as in Fig. 5 (a»; 3) r2-trunc a r2-mult with t truncated bits; 4) r4-trunc a r4-mult with t truncated bits; 5) sloppy-rows a radix-4 multiplier with PPs generation as in Fig. 5(b) for k multiplier radix-4 digits (rows). 6) sloppy-cols a radix-4 multiplier with PPs generation as in Fig. 5 (c) for t radix-2 columns (bits).
As done for the adder, we report the mean error as function of the imprecise digits/bits in Fig. 7 and the power dissipation as function of the error in Fig. 8 . The power dissipation of the precise multipliers is Pr2 = 0.53 mW for the radix-2 and Pr4 = 0.47 mW for the radix-4 multiplier. The power figures do not include the final carry-propagate adder. Fig. 8 shows that among the truncated schemes, radix-4 is by far more power efficient than radix-2 because of the reduced number of PPs. Moreover, from Fig. 8 we derive that the sloppy row schemes with k = 1,2,3 have very similar characteristics (error and power) to those of radix-4 truncated .�
Eq. # of columns/rows • to t = 4,6,8 bits, respectively.
IV. ApPLICATIONS TN IMAGE PROCESSING
To verify the figures found in the stand-alone characterization of the imprecise operators, we implement some common im age processing algorithms in imprecise hardware and evaluate the performance. As sample pictures, we used the two grayscale (each pixel is an unsigned 8-bit integer) images of Fig. 9 (upper part) .
A. Image Fi ltering
We use the sloppy adder defined in Sec. II with k = 4 sloppy bits to process two 256 x 256 grayscale images of Fig. 9 (top) for the following bidimensional filters: 1) an averaging (low-pass) filter; 2) a sharpening filter; 3) an edge-detection unit.
These filters can be implemented in the the spatial domain by addition and shift operations. intensity (luminosity) per pixel (i,j):
The maximum error E m ax and the average error E = Ll\�}j are reported in Table IV for the different types of filtering. The results show that the degradation is independent of the image (urna is a portrait, while huse has greater detail). Depending on the filter mask, we can change the design of the sloppy adder to obtain larger savings. For example, for edge-detection, a sloppy adder with k = 6 has an average error E = 28, but visually, the degradation is not noticeable.
B. Inverse Discrete Cosine Transformation (IDCT)
Now we combine the imprecise multiplier schemes with an error-free adder in a multiply-add (and accumulate) unit (Fig. 10 ) which can be used for the trivial implementation of the Inverse Discrete Cosine Transform (IDCT), which is part of the JPEG decompression algorithm. For the unit of Fig. 10 we opted for carry-save (error-free) accumulation to keep separate the imprecision due to the multiplier and to the adder. Based on the results of software simulations, we decided not to use a sloppy adder as the extra Fig. 8 , we ex cluded from the IDCT evaluation radix-2 truncated multipliers (more power hungry than all others) and the sloppy-columns schemes (power dissipation savings are marginal when the error increases). In summary, we implemented the following multiply-accumulate units: 1) r4-muIt: radix-4 12 x 12-bit multiplier and 24-bit adder;
2) r4-trunc-6: r4-mult with t = 6 truncated bits and 18-bit The results in Table V are obtained by implementation in a 90 nm standard cells library (clock rate is 100 MHz). The errors are computed with respect to a floating-point software implementation (quantization error for r4-mult). The results show that the larger reduction in power is obtained for radix-4 truncated multipliers. This is in large part justified by the smaller area required by the accumulate circuitry (accumulate-path: CSA 4:2, two registers and final adder) that for the truncated schemes are reduced up to 33% (16 vs. 24 bit accumulate-path). For the multiplier itself, as shown in Fig. 8 , the smaller sloppy rows in the sloppy scheme compensate for the larger tree when compared to the truncated multipliers. The visual results obtained by sloppy-row-2 IDCT computa tion are shown in Fig. 9 (bottom) . For the sloppy-row-3 and the r4-trunc-S schemes the image degradation is such that for the IDCT these multipliers are probably not acceptable. The complete visual results of the IDCT test are reported in an electronic appendix [7] .
V. CONCLUSIONS AND FUTURE WORK
We have presented simple ways of performing addition and multiplication in an imprecise manner with the aim to get better performance (delay, area and power) at expenses of an increased error which can be tolerated in some applications. Different combinations of precise/truncated/sloppy operators can be used depending on the specific implementation of the algorithm. In future work, we plan to characterize in term of error and performance (delay, area, power dissipation) these imprecise operators and find a systematic way of combining them to meet the desired error/performance constraint for a given application.
