Abstract-In this paper, a new recursive multibit recoding multiplication algorithm is introduced. It provides a general space-time partitioning of the multiplication problem that not only enables a drastic reduction of the number of partial products (n/r), but also eliminates the need of pre-computing odd multiples of the multiplicand in higher radix (ß≥8) multiplication. Based on a mathematical proof that any higher radix ß=2 r can be recursively derived from a combination of two or a number of lower radices, a series of generalized radix ß=2 r multipliers are generated by means of primary radices: 2 1 , 2 2 , 2 5 , and 2 8 . A variety of higher-radix (2 3 -2 32 ) two's complement 64x64 bit serial/parallel multipliers are implemented on Virtex-6 FPGA and characterized in terms of multiply-time, energy consumption per multiply-operation, and area occupation for r value varying from 2 to 64. Compared to reference algorithm, savings of 8%, 52%, 63% are respectively obtained in terms of speed, power, and area. In addition, a new low-power and highly-flexible radix 2 r adapted technique for a multi-precision multiplication is presented.
I. BACKGROUND AND MOTIVATION
N multiplication-intensive applications, as in digital signal processing or process control, multiply-time is a critical factor that limits the whole system performance. When these types of applications are embedded, energy consumption per multiply operation becomes an additional critical issue. Furthermore, in high-precision or large-operand-size applications such as in cryptography, the need for a scalable serial/parallel multiplier is essential as the multiplier size grows quadratically O(n 2 ) with operand size n. Consequently, high-speed, low-power, and highly-scalable architecture are the three major requirements for today's general purpose multiplier [1] .
The continuous refinement of the mostly-used design paradigm based on modified Booth algorithm [2] combined to a reduction tree (carry-save-adder array , Dadda [3] , HPM [4] ) has reached saturation. In [5] and [6] only slight improvements are achieved. Both proposals reduce the partial product number from n/2+1 to n/2 using different circuit optimization techniques of the critical path.
Theoretically, only the signed multibit recoding multiplication algorithm [7] is capable of a drastic reduction (n/r) of the partial product number, given that r+1 is the number of bits of the multiplier that are simultaneously treated (1≤r≤n). Unfortunately, this algorithm requires the pre-computation of a number of odd multiples of the multiplicand (until (2 r-1 -1).X) that scales linearly with r. The large number of odd multiples not only requires a considerable amount of multiplexers to perform the necessary complex recoding into PPG, but dramatically increases the routing density as well. Therefore, a reverse effect occurs that offsets speed and power benefits of the compression factor (n/r). This is the main reason why the multibit recoding algorithm was abandoned. In practice, designs do not exceed r=3 (radix 8).
The current trend [8] [9] relies upon advanced arithmetic to determine minimal numeric bases that are representatives of the digits resulting from larger multibit recoding. The objective is to eliminate information redundancy inside r+1 bit-length slices for a more compact PPG. This is achievable as long as no or just very few odd multiples are required.
In [8] , Seidel et al. have introduced a secondary recoding of digits issued from an initial multibit recoding for 5≤r≤16. The recoding scheme is based on balanced complete residue system. Though it significantly reduces the number of partial products (n/r for 5≤r≤ 16) , it requires some odd multiples for r≥8. While in [9] , Dimitrov et al. have proposed a new recoding scheme based on double base number system for 6≤r≤11. The algorithm is limited to unsigned multiplication and requires a larger number of odd multiples.
Instead of looking for more effective numeric bases, which is a hard mathematical task, our approach consists in exploiting already existing odd-multiple free recoding algorithms (2 1 , 2 2 , 2 5 , and 2 8 ) to recursively build up generalized odd-multiple free radix 2 r recoding schemes. To achieve such a goal, the multibit recoding multiplication algorithm is revisited [7] . Its design space is extended by the introduction of a new recursive version that enables a hardware-friendly space-time partitioning of the multiplication problem. Depending on r value ranging from 2 to n, highly-scalable signed multipliers with various levels of parallelism and latencies can be systematically generated with insignificant control-complexity. The new algorithm has also the merit to recursively reduce the number of partial products (n/r) without any limit for the parameter r and any need for the odd multiples of the multiplicand. It also allows the combination of different recoding schemes proposed in the literature into the same architecture for better performances of the multiplier. Several higher radix (ß=2 3 , 2 32 ) two's complement 64x64 bit serial/parallel multipliers based on combined recoding schemes are implemented on Virtex-6 FPGA and characterized in terms of speed, power, and area occupation for r values ranging from 2 to 64. Compared to a new signed version of Dimitrov et al. algorithm [9] and Seidel et al. algorithm [8] , outstanding results are obtained with the new multibit recoding scheme for r=8 formed by the combination of Seidel algorithm (r=5), MacSorley algorithm (r=2) [2] and Booth algorithm (r=1) [10] . The respective savings are as follows: 21%, 53%, 105% and 8%, 52%, 63% are obtained in terms of multiply-time, energy consumption per multiplyoperation, and total gate count, respectively. In addition, a new low-power and high-throughput radix 2 r adapted technique for multi-precision multiplication is introduced. Contrary to existing techniques [11] [12] , this new one allows a customized partitioning of the operands in any number of sub-operands and in any sub-operand bit-sizes.
The paper is organized as follows. Section I outlines the main requirement specifications for a generalized radix 2 r multiplication. Section II introduces the new recursive multibit recoding multiplication algorithm. Afterwards, some high-radix (ß=2 3 , 2 8 ) variants of the new algorithm are presented in Section III, while their implementation results are discussed in Section IV. Higher radix (ß=2 8   , 2 32 ) algorithms are introduced in Section V. Section VI describes the new low-power technique for multi-precision multiplication. Finally, Section VII provides some concluding remarks and suggestions for future work.
II. THE NEW RECURSIVE MULTIBIT RECODING MULTIPLICATION ALGORITHM
The equation (2.1.2) of the original multibit recoding algorithm presented in [7] (see Appendix) does not offer hardware visibility. Let us rewrite it in a simpler hardwarefriendly form, as follows: . For simplicity purposes and without loss of generality, we assume that r is a divider of n.
In this general case, the multiplier Y is split into n/r slices, each of r+1 bit length. Each pair of two contiguous slices has one overlapping bit. In literature, equation (1) Thus, the signed multiplication between X and Y becomes:
Where each partial product can be expressed as follows:
) represents the required set of odd multiples of the multiplicand (m.X) for radix 2 r . Hence, the partial product generation process consists first in selecting one odd multiple (m.X) among the whole set of pre-computed odd multiples, which is then submitted to a hardwired shift of e positions, and finally conditionally complemented (-1) s depending on the bit sign s of Q j term. Table I provides a picture on how the number of odd multiples grows when the radix becomes higher. While lower m.X can be obtained using just one addition (3X=2X+1X), the calculation of higher ones may require a number of computation steps (11X= 8X+2X+1X).
To bypass the hard problem of odd multiples, let us announce the two following theorems accompanied with their respective proofs:
, such as s is a divider of r.
Proof. Equation (1) is recursively applied on Q j term of equation (1) . Thus, equation (1) 
As a result of theorems (1) and (2), much less odd multiples are needed in partial products of equations (4) and (6) than in equation (2), but at the expense of a number of additions. The advantage by far outweighs the cost, as practically shown in the next section. The translation of equation (4) into architecture is depicted by Fig. 1 , where each PPG j is built up using identical PPG ji . This is not the case for equation (6) which requires two different PPG ji . Theorem (1) and (2) can be merged to produce PPG j made of a number of different PPG ji .
III. SOME VARIANTS OF THE NEW RECURSIVE MULTIBIT RECODING MULTIPLICATION ALGORITHM Theorems (1) and (2) permit to build up any higher radix multiplication algorithm based on lower radices. But the objective is to generate higher radix multiplication without odd multiples. To achieve such a goal, a number of oddmultiple free low-radix algorithms are used, such as Booth algorithm [10] (3) for (r,s)=(1,1) and (r,s)=(2,2). They are respectively summarized as follows: 
With ( ) { }
Higher radices are obtained as follows.
A. Radix 2 3 recoding
Radix 2 3 recoding based on equation (1) 
In fact, equation (9) is a combination of Booth
) and modified Booth algorithms
). Hence, for equation (9),
Furthermore, equation (9) is recursively used to generate any radix 2 r recoding with
as follows: 
This equation is referred to by ß2 3 for later comparison with other general radix algorithms based on lower radices. Its corresponding architecture is illustrated by Fig. 2 .
B. Radix 2 4 recoding
For r=4, equation ( (11) is incorporated into equation (3) 
For performance comparison, we developed a new signed radix 2 8 [9] . The new recoding is: (12) In the preceding section, we introduced five generalized multibit space-time partitioning schemes, which are: ß2 2 , ß2 3 , ß2 5 , ß2 8 , and ß'2 8 . They all require ( ) { }
In this paper, only the serial/parallel form is explored (Fig. 3) , targeting applications where the serialization of multiplication is mandatory. This is the case for instance in embedded digital PID (Peripheral Integral Derivative) controller where five multiplication cores are required [14] , or for high-precision or very large operand size applications (cryptography) where a fully-parallel n×n bit implementation is excluded.
In signed serial/parallel multiplication, r-bit slices of the multiplier are processed each clock cycle, which induces a theoretical multiply time of n/r for a double precision product (2n bits). The special cases where r=n and r=r min correspond to fully-parallel and fully-sequential multiplier, respectively. In between (r=2r min , n/2), partiallyparallel multipliers are obtained. In fact, the lower limit of r depends on the recoding scheme used (ex: for ß2 5 , r min =5). Reader is referred to [8] , [13] , and [9] for recoding tables used respectively in ß2 5 , ß2 8 , and ß'2 8 . Before comparison, all recoding schemes proposed in this paper underwent several steps of verification. First all equations were validated with a random C-program. Then, they were implemented at RTL level in Verilog-2001 (IEEE 1364) as technologyindependent reusable IP-cores [1] , using exactly the same optimized coding style for an equitable comparison. They are compile-time reconfigurable according to n and r. All RTL codes went through a severe cycle-accurate functional verification procedure using Modelsim SE-6.3f logic simulator. They were first challenged against a set of special and severe test cases (visual simulation), and then submitted to a random test for a very large number of vectors. After a successful functional verification, physical tests were performed. They were integrated into an FPGA evaluation board for an ultimate validation.
Afterwards, all equations were synthesized and mapped to the same Virtex-6 FPGA circuit (XC6VSX475t-2FF1156) using Xilinx ISE 13.2 release version [15] . Two's complement 64x64 bit radix 2 r serial/parallel multipliers with r varying from r min to 64 were characterized in terms of area occupation (number of occupied Virtex-6 slices), maximum multiply time, and maximum energy consumption per multiply operation. The results are depicted in Fig. 4 , 5 and 6, respectively.
A. Area Occupation
Three basic components are necessary for the implementation of the proposed multipliers: a) multiplexers to decode the digit terms Q ji , P ji , … ; b) shifters for partial product generation; c) and adders for partial product summation. Whereas the exact number of adders can be known in advance, we need to develop heuristics for the two others. Multiplexer complexity depends on: a) the lower radix 2 s used to build up the higher radix 2 r ; b) the number (i) of "case" statements used to decode the digit terms; c) the number of entries (e i ) in each "case" statement; d) the number (d i ) of digit terms ; e) and on the number of odd multiples (|O mi |) used to calculate the digit terms. Hence, we The total number of adders comprises ( ) 1 (Fig. 3) 
= β β
Add Add Significant conclusion: the area occupation is dominated by the Mux factor, and becomes larger (Fig. 4) as Mux number becomes higher (Table II) . This correlation is advantageously used to minimize the area occupation and power consumption as will be shown in the next section.
B. Energy consumption
While energy consumption is function of the switched capacitance, Fig. 4 and 5 show a direct correlation between area occupation and energy consumption. Making Mux indicator lower, will result in a less energy-consumer recoding algorithm.
C. Delay
The delay (T) along the critical path is the summation of PPG delay and reduction tree delay. While the former is constant, the latter depends on the topology used: either linear or logarithmic. The number of levels for each case is given in Table II and the performance of each algorithm is depicted in Fig. 6 . The total multiply time is equal to (n/r)T. Note that all results presented in this paper are based on linear implementation of the reduction tree.
Based on theory and implementation results, it is set clear that ß2
2 algorithm is the best in terms of area and energy consumption. As for speed, ß2 2 is the fastest until r=16. Beyond this value, it is surpassed by ß2 8 . ß2 2 algorithm served to design a scalable 16-bit setpoint PID controller employing five multiplication cores. The implementation results outperformed the published ones at all levels [14] .
V. HIGHER RADIX MULTIBIT RECODING MULTIPLICATION ALGORITHMS
Further performance requires higher r values (r ≥ 8) necessarily. Guided by Mux and Add indicators, the objective is to generate a recoding scheme that outperforms ß2 2 in area and power, and ß2 8 in speed. 8, 9, 10 ). Result summary with regard to Dimitrov and Seidel algorithms is given in Table V .
B. Radix 2 13 recoding
As ß2 8 and ß2 5 show good results for speed and power respectively, they have been merged (ß2 13 ) for a better compromise. However, the Mux saving (130r) is not important enough compared to Mux value (192r) of ß2 8 . This explains the closeness of the results between ß2 13 and ß2 8 .
C. Radix 2 16 recoding
To achieve a significant Mux saving, ß2 8 is combined with ß2 2 based on theorem (1) and (2) simultaneously. ß2 16 exhibits a Mux value of 100r, which is almost the half required by ß2 8 . Better results are obtained in terms of area and energy. The fact that ß2 16 is little bit slower than ß2 8 is due to the higher PPG adder number required (10) . For r greater than 64, ß2 16 will surpasses ß2 8 since the total number of adder levels will be lower. Higher radices provide more speed.
D. Radix 2 24 recoding
To push lower the energy consumption while increasing the speed, lower Mux values with higher radices are required. This can be achieved using the mixture of: ß2 8 (Fig. 11) . At this level, some useful conclusions can be drawn depending on the topology of the reduction tree used, either linear or logarithmic (Table VI ). In the case of a linear tree, ß2 2 is the most area and energy efficient algorithm for any value of r. For r ranging from 8 to 64, ß''2 8 is the fastest algorithm, but it will be outperformed by ß2 32 for r values greater than 64. In the case of logarithmic reduction tree, ß2 2 is by far the best at all aspects since it always requires the lowest number of adder levels (
) whatever r value (Table VI) . { } Based on higher radix recoding algorithms proposed so far (ß2 8 
can be recursively pursued farther for very large-operand-size applications (n >>). The number of adder levels required by a ß2 x algorithm will be:
, where a is a constant depending on the level number of PPG adders. Thus, ß2
x will outperform ß2 2 for a e x 4 ≥ .
VI. NEW RADIX 2 r MULTI-PRECISION MULTIPLICATION

TECHNIQUE
Prior to develop a highly-scalable multi-precision multiplier, the need for a flexible and low-power signextension technique is mandatory.
A. New radix 2 r sign extension technique
Though many low-power sign extension techniques exist in the literature, they are not adapted to reconfigurability. The reason for this shortcoming is that the correction bits must be calculated for each value of operand-size n [11] [16] . Besides, to the authors' knowledge, no signextension solution exists for radix based multiplication (r). In what follows, we propose a generic low-power solution that circumvents these two obstacles. It is illustrated by Fig.  12 for n=8 and r=2, but can be systematically extended to any n and r values. Intuitively, we are not simultaneously performing the sum of the partial products, but each partial product of current step j is added to the sum of the preceding ones (from 0 to j-1). The rationale for the number of sign-bits to the left can be done locally, step by step, row by row. In other words, we have to take advantage of the fact that the partial sum already contains the sum of the sign bits of previous partial products. We must simply ensure that the sum output of the sign bit of current step j is added to the two most-significant bits of the next step (j+1). To generalize to radix ß2 r multiplication, the sign-bit (n th position bit) of each partial product is extended with r bits to the left (r-1 for a maximum shift, plus one bit for the sign), and the sum output of the sign bit of step j is added to the r most-significant bits of the next step (j+1).
B. New Radix 2 r Multi-precision multiplication technique
In traditional n×n bit multi-precision multipliers, there is possibility to perform either a single n×n double precision, or a single n/2×n/2 simple precision, or a twin parallel n/2×n/2 simple precision multiplication. This is made possible by partitioning the two operands X and Y into respectively most and less significant sub-operands X H Y H , and X L Y L . A number of solutions exist and are summarized in [11] [12] . Unfortunately, they are either restricted to unsigned multiplication, or they do not take power consumption into consideration, or they are not flexible enough. We propose hereafter a new technique that not only overcomes all above-mentioned shortcomings, but also allows a customized partitioning of the operands into any number of slices as well as in any slice sizes. Besides, this new technique is well adapted to radix based multiplication. Its features are compared to the technique presented in [11] .
Let us take equation (1) 
(13) Note that Q 1 and Q 0 are (n/2)+1 bit size, but x -1 can be omitted from Q 0 since it is stuck at zero. Thus, we obtain four independent signed multipliers:
Y L which are respectively (n/2)+1×(n/2), (n/2)+1×n/2, n/2×(n/2), n/2×n/2 bit size. Fig. 13 illustrates the implementation of equation (13) for a signed 16x16 bit multiplier based on recoding algorithm ß2 2 with r=2. Equation (13) eliminates the cumbersome term (EV×2 n/2 ) in equation (6) of [11] as well as the necessary logic for its generation. More importantly, in Fig. 13 , four 8x8 bit multiplications can be performed simultaneously, whereas in [11] only two are allowed because of the shared terms ) and CV required for the sign extension. Without counting the necessary EV generation logic and the use of inverters for the negation of the sign bits, the partitioning proposed in [11] consumes a total bit count of 205 for a 16x16 bit multiplier, while ours requires 198 bits.
Note that equation (5) 
Four independent signed multipliers are generated:
Y L , which are respectively (n/4)+1×(n/4), (n/4)+1×(3n/4), (3n/4)+1×(n/4), and (3n/4)×(3n/4) bit size. The translation of equation (14) into architecture is depicted by Fig. 14. Both partitioning schemes (Fig. 13 and Fig 14) needs the same amount of bits (198) .
More efficiently, equation (13) can be combined with ß''2 8 algorithm for the recoding of Y H and Y L submultiplicands to produce a faster partitioning (Fig. 15) for operand sizes larger than 16 bits according to the implementation results shown in Fig. 10 .
More importantly, equation (1) can be used to partition the X and Y operands into any desired number of slices depending on r value. Choosing for instance r=n/4 results into the following partitioning: (15) for n=16 based on ß2 2 with r=2 are described in Fig. 16 . Equation (15) requires a total bit count of 254 which induces an overhead of 28% compared to equation (13) .
Finally, equation (1) and (5) can be combined with any proposed recoding algorithm (ß2 r ) to produce any desired multi-precision multiplication scheme.
VII. CONCLUSION AND FUTUR WORK
We developed a recursive version of the multibit recoding multiplication algorithm which enabled to solve two hard problems: radix 2 r signed multiplication and radix 2 r multi-precision signed multiplication. The former is oddmultiple free solution with advanced capabilities for multiplication-intensive applications that must dissipate minimal power while operating at high speed. In addition, the solution is highly-scalable allowing a hardware-friendly partitioning that can be tailored to the desired performance and power budget.
We deliberately opted for FPGA implementation to rapidly explore a large number of variants of the recursive algorithm. Only ten recoding algorithms have been selected and reported in this paper. We first gave priority to a serial/parallel implementation as it is the most appropriate to designing embedded finite-word-length controllers, which is our ultimate objective. A fully-parallel implementation will be given the same attention for further investigation and optimization.
Guided by Mux and Add indicators, even higher oddmultiple free recoding algorithms (ß2   64   , ß2   128   , ß2 256 ,…) can be generated to efficiently cope with large-operand-size applications, such as in cryptography. However, for large r values, the use of advanced optimization heuristics becomes mandatory in order to determine the primary radix (2 1 , 2 2 , 2 5 , and 2 8 ) configuration that leads to the optimal implementation of the desired radix. This issue is being explored at present time and we plan to report our results in a forthcoming paper.
As for the multi-precision solution, this latter would not have been possible without the development of a flexible sign-extension technique. Based on the new recursive algorithm, we proposed a generic partitioning scheme that can be adapted to any size combination of the operands in order to reduce the power consumption while increasing the computational throughput. This new solution will be deeply explored for further optimizations using the proposed radix 2 r algorithms. APPENDIX
A. Multibit Recoding Algorithm
Let X be an n-bit two's complement format binary integer. The value of X can then be found from: 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30 
