Abstract-In this paper, a new recursive multibit recoding multiplication algorithm is introduced. It provides a general space-time partitioning of the multiplication problem that not only enables a drastic reduction of the number of partial products (N/r), but also eliminates the need of pre-computing odd multiples of the multiplicand in higher radix (r≥3) multiplication. Based on a mathematical proof that any higher radix-2 r can be recursively derived from a combination of two or a number of lower radices, a series of generalized radix-2 r multipliers are generated by means of primary radices: 2 1 , 2 2 , 2 5 , and 2 8 . A variety of higher-radix (2 3 -2 32 ) two's complement 64x64 bit serial/parallel multipliers are implemented on Virtex-6 FPGA and characterized in terms of multiply-time, energy consumption per multiply-operation, and area occupation for r value varying from 2 to 64. Compared to a recent published algorithm, savings of 21%, 53%, 105% are respectively obtained in terms of speed, power, and area.
I.
BACKGROUND AND MOTIVATION The continuous refinement of the mostly-used design paradigm based on modified Booth algorithm [1] combined to a reduction tree (carry-save-adder array , Dadda,…) has reached saturation. In [2] only slight improvements are achieved. The proposal reduces the partial product number from N/2+1 to N/2 using different circuit optimization techniques of the critical path.
Theoretically, only the signed multibit recoding multiplication algorithm [3] is capable of a drastic reduction (N/r) of the partial product number, given that r+1 is the number of bits of the multiplier that are simultaneously treated (1≤r≤N). Unfortunately, this algorithm requires the precomputation of a number of odd multiples of the multiplicand (until (2 r-1 -1).X) that scales linearly with r. The large number of odd multiples not only requires a considerable amount of multiplexers to perform the necessary complex recoding into PPG, but dramatically increases the routing density as well. Therefore, a reverse effect occurs that offsets speed and power benefits of the compression factor (N/r). This is the main reason why the multibit recoding algorithm was abandoned. In practice, designs do not exceed r=3 (radix-8).
The current trend [4] [5] relies upon advanced arithmetic to determine minimal number bases that are representatives of the digits resulting from larger multibit recoding. The objective is to eliminate information redundancy inside r+1 bit-length slices for a more compact PPG. This is achievable as long as no or just very few odd multiples are required.
In [4] , Seidel et al. have introduced a secondary recoding of digits issued from an initial multibit recoding for 5≤r≤16. The recoding scheme is based on balanced complete residue system. Though it significantly reduces the number of partial products (N/r for 5≤r≤ 16), it requires some odd multiples for r≥8. While in [5] , Dimitrov et al. have proposed a new recoding scheme based on double base number system for 6≤r≤11. The algorithm is limited to unsigned multiplication and requires a larger number of odd multiples.
Instead of looking for more effective number bases, which is a hard mathematical task, our approach consists in exploiting already existing odd-multiple free recoding algorithms (2  1 , 2   2   , 2   5 , and 2 8 ) to recursively build up generalized oddmultiple free radix-2 r recoding schemes.
To achieve such a goal, the multibit recoding multiplication algorithm is revisited [3] . Its design space is extended by the introduction of a new recursive version that enables a hardware-friendly space-time partitioning of the multiplication problem. Depending on r value ranging from 2 to N, highlyscalable signed multipliers with various levels of parallelism and latencies can be systematically generated with insignificant control-complexity. The new algorithm has also the merit to recursively reduce the number of partial products (N/r) without any limit for the parameter r and any need for the odd multiples of the multiplicand. It also allows the combination of different recoding schemes proposed in the literature into the same architecture for better performances of the multiplier. Several higher radix (2 3 -2 32 ) two's complement 64x64 bit serial/parallel multipliers based on combined recoding schemes are implemented on Virtex-6 FPGA and characterized in terms of speed, power, and area occupation for r value ranging from 2 to 64. Compared to a new signed version of Dimitrov et al. algorithm [5] and Seidel et al. algorithm [4] , outstanding results are obtained with the new multibit recoding scheme for r=8 formed by the combination of Seidel algorithm (r=5), MacSorley algorithm (r=2) [1] and Booth algorithm (r=1) [6] .
The respective savings are as follows: 21%, 53%, 105% and 8%, 52%, 63% are obtained in terms of multiply-time, energy consumption per multiply-operation, and total gate count, respectively. The paper is organized as follows. Section I outlines the main requirement specifications for a generalized radix-2 . For simplicity purposes and without loss of generality, we assume that r is a divider of N .
In equation (1), the two's complement representation of the multiplier Y is split into N/r two's complement slices ( j Q ), each of r+1 bit length. Each pair of two contiguous slices has one overlapping bit. In literature, equation (1) . Hence, the partial-product generation-process consists first in selecting one odd-multiple (m.X) among the whole set of pre-computed odd-multiples, which is then submitted to a hardwired shift of e positions, and finally conditionally complemented (-1) s depending on the bit sign s of Q j term.. While lower m.X can be obtained using just one addition (3X=2X+1X), the calculation of higher ones may require a number of computation steps (11X= 8X+2X+1X).
To bypass the hard problem of odd-multiples, we exploit the fact that the two's complement multiplier Y on which equation (1) is applied, is composed of a series of two's complement digits ( j Q ) on which equation (1) can be recursively applied again. Based on this observation, let us announce the two following theorems. 
with s+t a divider of r , and t < s.
Likewise, when theorem (2) is applied to equation (1) 
Theorem (1) and (2) allow an exponential reduction (1/2 ks and 1/2
, resp.) of the number of odd-multiples in equations (4) and (6) in comparison to equation (2) , but at the expense of a linear augmentation (ks-1 and k(s+t)-1, resp.) in the number of additions. The advantage by far outweighs the cost, as practically shown in the next section.
The translation of equation (4) into architecture is depicted by Fig. 1 , where each PPG j (Q j ) is built up using identical PPG ji (P ji ). This is not the case for equation (6) which requires two different PPG ji (P ji and T ji ) . Theorem (1) and (2) can be merged together to produce PPG j made of a number of different PPG ji (P ji ,T ji , U ji , V ji ,...). This is the general case that is thoroughly studied in the next section in order to determine the optimal multiplier.
978-1-4673-0821-2/12/$31.00 ©2012 IEEE Mux is an heuristic measure of the multiplexer logic inside PPGi . Add is the exact umber of adders. di is the delay due to Mux logic (d2 < d5 < d8 < d'8) ) RECODING SCHEMES Theorems (1) and (2) permit to build up any high radix-2 r multiplication algorithm based on lower sub-radices, employing much less odd-multiples. The objective is to generate high radix-2 r multiplication without odd-multiples for a maximum reduction of multiplexer complexity inside PPG j . To achieve such a goal, a number of odd-multiple free lowradix algorithms are used, such as Booth algorithm [6] (radix-2 ) with minimum hardware resources ( Table I ). The generation process was manually guided by an heuristic (Table II) that evaluates the logic complexity (Mux) inside each PPG j (Fig. 1) .
The multipliers were mapped to Virtex-6 FPGA and characterized in terms of multiply-time, energy consumption per multiply-operation, and area occupation for r value varying from 2 to 64. The obtained results (Fig. 2, 3, and 4) showed an outstanding superiority of our algorithms over their recent counterparts [4] [5] . When comparing our algorithms to each other, ß2 2 algorithm is the most area and energy efficient algorithm for any value of r (Table II) . For r ranging from 8 to 64, ß''2 8 is the fastest algorithm, but it is outperformed by ß2 32 for r values greater than 64. ß2 2 algorithm served to design a 16-bit set-point PID. The implementation results outperformed the published ones at all levels [7] . { } { } 
