This paper presents an error compensation method for truncated multiplication. From two n-bit operands, the operator produces an n-bit product with small error compared to the 2n-bit exact product. The method is based on a logical computation followed by a simplification process. The filtering parameter used in the simplification process helps to control the trade-off between hardware cost and accuracy. The proposed truncated multiplication scheme has been synthesized on an FPGA platform. It gives a better accuracy over area ratio than previous well-known schemes such as the constant correcting and variable correcting truncation schemes (CCT and VCT).
I. INTRODUCTION
In many digital signal processing applications, fixed-point arithmetic is used. In order to avoid word-size growth, operators with n-bit input(s) must return an n-bit result. For multiplication, the 2n-bit result of an (n x n)-bit product has to be set back to n-bit by dropping the n least significant bits through a reduction scheme (usually truncation or rounding). This is the purpose of truncated multipliers.
Truncated multiplication is used mainly for applications such as finite-impulse response (FIR) filtering and discrete cosine transform (DCT) operations. It can also be used to reduce the hardware cost of function evaluation [1] . This paper starts by the notations and presents the main methods used for truncated multiplication in Section II. In this section, we introduce a simple classification of truncated multiplication schemes. The proposed method is presented in Section III. Our method is based on carry prediction and selection. Section IV presents the error analysis and the implementations results on FPGAs. It also presents a comparison with some existing solutions.
II. BACKGROUND
A. Notations Figure 1 presents the partial product array (PPA) of an unsigned (4 x 4)-bit multiplication (see [2] for full-width multiplication algorithms). The partial product xiyj is often represented as a dot for compact notation (see Fig. 2 ).
We use fixed-point notation with n fractional bits (i.e., O°X1X2X3 ... X* ) for the operands and the result. As shown in Figure 2 , MP represents the n -1 most significant columns of the partial product array. MP corresponds to the n bits of the final truncated result. Wl,b is the weight of the least significant bit in the truncated result, i.e., Wl,b = 2'. The least significant part of the PPA is noted LP, and we further distinguish its k most significant columns as LPmajor, and the remaining n -k columns as LPmirior. In some schemes, the column in LPminor with the highest weight (the left-most column in LPminor in Figure 2 ) is used. It is referred in the following as LPm i) or. We refer to a column in the PPA by its weight, for example MP extends from columns coll to coln, i.e. from the column where the partial products weight 2-1 to the column where the partial products weight 2-n = wllb.
Function truncn(x) denotes x truncated to the n-th bit, and roundn(X) stands for x rounded to the n-th bit.
A truncated multiplication scheme computes the partial products in MP, and add an error compensation value (ECV) computed as a function of LP. The result of a truncated multiplication is noted P = truncn (MP + f (LP)).
The most obvious way of performing a truncated multiplication is first to compute the exact 2n bits of the result, then round it to n bits. This full-width result is PFW = roundn (MP + LP).
While giving the smallest possible error, which is only due to the rounding, this method also requires the highest amount Table I presents the classification of the previous methods and, our method, accord.;ng:ly to those groups.
Of hard.ware 'by computing a'l'l the partial produ.cts.
C. Static ECV: C
Since the sum bits in the 2Tn-bit full-width product are not I 6,teepce au fC=Sm(P setmtd all used, one iS tempted. to remove some low-weight columlns by as.mn tha eac bi ofteipt a rbblt in LP inl order to dilminilsh the hard.ware cost of thee multiplier.. Howeer,by dingthis th cariesin he lw-wightpar ofof beinrlg one. The probabilities of each carry and sum bit are EIowever~~~: bydlrL hstecrlsmtelwwl ato evaluated using the logic properties of te half-adder anLd fullte PPA are lost, tereby introducing an evaluation error. adrcls Two indsof eror ccurin runcted ultilicaion-the The direct-trunlcated multiplication scheme described previevaluation error E,aCl, which iS due to the columns that are ously also fits in this category with C = O, LP iS not computed removed in LP, and the truncatio:n error Et,r,n,, which occurs no aprxmae.
whenl the computed valu.e of the PPA iS reduced. to anl n-bit va:lu.e. D. Staltic ECV: LPmajo, + C A d.irect-tru.ncated. mu.Itiplier computes olnly the n -I most siglnificanlt co:lu.mns of the VPPA. While minlimizilng the Truncated. multip:licationl schemes with a static ECV aprequired amoulnt of hardware, this approach does nrlot talk-e inlto proximate the error done by leavinlg out the 'low-weight account an:y of the carries propagating from LP, anLd leads columns LPmino, with a co:ntstanlt, which is computed either to a maximal evaluationrl error. The result of a direct-trunrlcated by exhaustive search o:n the i:nput values or as a statistical mu.ltiplier is evaluatio:nL of the expected value of LPmin,,r I:nL order to as anl extenlsionl of M/P. The The multiplication result is:
The multiplication result is:
In order to further diminish the error, some schemes have been proposed where, instead of approximating the value of the partial products in LPminor by a constant, it is expressed as a function of the partial products in LPmajoor [4] gives an ECV for a modified Booth encoded PPA, where each line of partial products in LPminor is estimated as a multiple of the corresponding partial product in LPmajor (k = 1). This results in a data-dependent ECV.
[8] presents another dynamic ECV for a modified Booth multiplier. For every possible combination of bits in the recoded operand, a corresponding expected value of LPminor is computed by statistic analysis, and added to the exact value of LPmajor (k = 1). This gives for every possible value of the recoded operand an approximation of the carries propagated from LP. A carry generation circuit is then computed using a Karnaugh map. For sizes larger than 12, the exhaustive simulation is replaced by statistical analysis. F Dynamic ECV: LPmajor + f (LPminor)
In [5] , the ECV for a Baugh-Wooley array multiplier is computed in three parts. First LPmajor is computed and summed. Then the partial products in LP(h) are computed, m inor some of them are inverted, and all this is summed. The pattern of inversions applied to the partial products in LP(h) is m inor parametrized by an integer Q. The sum is noted OQ,k. Finally the expected value of (LPminor-OQ,k) is estimated, and added do LPmajor + OQ,k. The best value of Q is obtained by exhaustive search. For n > 16, a statistical analysis can be performed.
The variable correction truncated (VCT) multiplication [9] , [10] estimates the carries propagated from LPminor by adding to the least column of LPmajor the partial products of LP(h) . This is equivalent to multiplying these partialmrninor products by two. An immediate consequence is that the ECV is minimal when the multiplication operands are minimal, and P = truncn (MP + LPmajor + 2LP(ihor +roundn+k (-Etrunc)) An hybrid correction truncation (HCT) multiplication [11] realizes a compromise between the CCT and VCT multiplication by only using a percentage p of the partial products in LPminor for the ECV, and adding 1 -p of the evaluation error Eeval, defined in the CCT multiplication. The truncation error Etrunc is also the same as in the CCT and VCT multiplications.
G. Dynamic ECV: LPmajor + LPminor Only the full-width multiplier fits into this category: this is the case where all of LP is computed.
III. PROPOSED METHOD
In this work, a new data-dependent truncated multiplication scheme is introduced. It is named prediction-selection correcting truncated (PSCT) multiplication. It is proposed for direct non-recoded unsigned array multiplication.
In the CCT, VCT and HCT multiplication schemes, the carries propagated from LPminor are estimated, either by statistical analysis or with the help of the partial products in column LPm i) or. It is then difficult to know what kind of error is done, and what additional terms might be introduced in order to improve accuracy.
Our approach tries to address this issue by computing in a first time the exact values of every carry generated in LPminor, and then discarding the less probable ones. This scheme simplifies the computation of the ECV and lower the associated hardware cost, while keeping track of the error made by removing those products.
A. Carry Prediction
Consider a complete PPA as the one used for the full-width multiplication in Figure 2 . Since the n -k least significant bits of the result are discarded, the corresponding sum bits of LPminor do not have to be computed. But 
At this point, te truncated operator we obtain gives the C(z, y)=x:UzyOyl AB(S, Y)= x:Usyoyl Asame result as the full-width multiplication Fig. 3 . The four steps of carry predictioln for n2 = 4 anld. k = :1.
B. Carry Selection So far, the evaluation error is lower than 2-k-IW1Sb, but te partial products in LPmC,fj,, become very complicated not implemelnted. at a:ll, al:l the carries generated. there are ltost, as n grows, and. a ltarge hard.ware cost may resu.lt. fInr ord.er leading to the evaluationr error described. in Eq. (1). to redu.ce the area requirements of the truncated. multip:lier, In order to keep te evaluationrl error low while removinrlg some carries have to be simplified. For that purpose, the logic u:nnecessary hardware, onl the logic formulas of the carries formulas are written under their simplified disjunctive nrlormral generated in LPnino,r have to actually be implemented. These tru.ncated. mu.ltiplication with n = 4 and, k = 1. The olnly Figure 3 shows te carry predictio:n steps for a 4-bit :multicolumrn inL Lpmajorcoa:s y:***,oy A(y)ad plicatioln withl one columln in LPnajo, (a).
C(x, y), where C(x, y) = XOXIX2YoYIY2 V XoXIX2SYOYIY in . (b) The first step computes te carries generated in the its disjunrctive nrormal formr. least signrlificant column of LPnino,, c018. Since there is I:f nlo threshold. is imposed., the truticated. mu.ltip:licatioln is only one partial product tere, no carry canL be generated. equivalelntt to te ideal rounded multiplication . (c) The carry generated. in C017 is added to C016. This For a value of te threshold t = 4, both conjunctio:ns in canry is expressed. logicaVlly as xoxlyoyl.
C(x, y) are removed., and. the evalu.ation error is increased. by 
A. Mathematical Error
For each studied multiplier scheme, the absolute bias QS3, average absolute error Eavg, standard deviation or and absolute maximum error £max are given in output lsb Wl,b 2 2. They are computed exhaustively for n < 8, and using an extensive random sampling for larger values of n. The mathematical data for the previous example is given in Table II . We can see that, by acting on the value of the threshold t, we realize a compromise between the full-width multiplier and the CCT multiplier.
B. Synthesis
We studied the implementation of truncated multiplication schemes on FPGAs. The CAD tool used was Xilinx ISE8. Ii and the target was an FPGA of the Spartan 3 family (XC3S200) with a medium speedgrade (-5). Synthesis and place-and-route were area-oriented with a standard effort. The multipliers were implemented using LUTs (not hardware block multipliers).
The Xilinx devices are optimized for 4 up to 8-bit input functions. This allows us to perform an efficient implementation of PSCT multipliers with a threshold up to t = 8. We implemented the PSCT multipliers for t running from 3 to 8. A PSCT multiplier where t = 2 is equivalent to a CCT multiplier with the same value of the parameter k.
C. Comparisons
The comparisons are lead with some well known truncation schemes for direct multiplication, that is the CCT [7] and VCT [9] multiplications. The full-width multiplier and directtruncated multiplier are used as a reference.
The comparisons were done for n = 8, 12 and 16. Our method is not yet fit for higher values of n, because of the fast growing computational cost of the prediction process. Figure 4 shows how the different schemes behave for n = 8, 12 and 16 from to top bottom. The X-axis gives the average absolute error, which is our principal accuracy criterion. The Y-axis gives the hardware cost relatively to the full-width multiplier. The aim is to perform a good accuracy while minimizing the hardware cost. This corresponds to the lower left part of each graph.
For n = 8, the CCT is outperformed by the PSCT for t = 3, 4 and 5. One can compute with the same average accuracy as the CCT with smaller PSCT multipliers. Similarly, for n = 12, the PSCT for t = 3 require less hardware to provide the same average accuracy as the CCT. For n = 16, the two schemes are equivalent.
Tables III, IV and V show accuracy results for the truncated multiplication schemes. If one wants to get an average accuracy as small as possible, that is get close to 0.25, the PSCT multiplication has a lower hardware cost than the other truncated multiplication methods. 
CONCLUSION
We presented a new truncated multiplication scheme. The method first computes the logic expression of the carries propagated from LPminor, then performs simplifications while keeping control over the introduced error. This scheme achieves an improvement both for accuracy and hardware requirements over previous schemes. The proposed method has been implemented on FPGAs, it shows an area reduction for comparable accuracy on 8 and 12-bit multipliers.
In a near future we plan to improve the speed of our method in order to deal with larger multipliers. We also plan to study the effects of different groupings of the partial products during the carry prediction phase, that should lead to accuracy and hardware cost improvements.
