Abstract-This paper addresses the problem of multiplication with large operand sizes (N≥32). We propose a new recursive recoding algorithm that shortens the critical path of the multiplier and reduces the hardware complexity of partial-product-generators as well. The new recoding algorithm provides an optimal space/time partitioning of the multiplier architecture for any size N of the operands. As a result, the critical path is drastically reduced to I. BACKGROUND AND MOTIVATION N multiplication-intensive applications, as in digital signal processing or process control, multiply-time is a critical factor that limits the whole system performance. When these types of applications are embedded, energy consumption per multiply operation becomes an additional critical issue. Furthermore, in large-operand-size applications (N≥32), the need for a scalable architecture is essential to ensure a linear increase O(N) of multiply-time while multiplier size grows quadratically O(N 2 ) with operand bit-length N. Consequently, high-speed, low-power, and highly-scalable architecture are the three major requirements for today's general-purpose multipliers [1] .
commercial designs do not exceed r=4 (radix-16). A hybrid radix-4/-8 is proposed in [3] for low-power multimedia applications. To increase the speed of the multiplier, most ancient processors employed radix-8, such as: Fchip [4] , IBM S/390 [5] , Alpha RISC [6] , IA-32 [7] and AMDK7 [8] . While radix-16 is used only in the most recent Intel processors: 64 and IA-32 [9] , and Itanium-Poulson [10] .
In research, the highest radix algorithms are proposed in the works of Seidel et al. [11] and Dimitrov et al. [12] . Both works rely upon advanced arithmetic to determine minimal number-bases that are representatives of the digits resulting from larger multibit recoding. The objective is to eliminate information redundancy inside r+1 bit-length slices for a more compact PPG. This is achievable as long as no or just very few odd-multiples are required.
Seidel introduced a secondary recoding of digits issued from an initial multibit recoding for 5≤r≤16. The recoding scheme is based on balanced complete residue system. Though it significantly reduces the number of partial products (N/r for 5≤r≤16), it requires some odd-multiples for r≥8. Dimitrov proposed a new recoding scheme based on double base number system for 6≤r≤11. The algorithm is limited to unsigned multiplication and requires larger number of odd-multiples. Both algorithms [11] [12] require a PPG that includes a number of adders to accumulate intermediary partial products corresponding to recoded elementary digits.
In fact, odd-multiples are not the only problem for a compact PPG. Recoding large slices (r≥8) in a mono-bloc PPG such as in [11] [12] , requires the use of an RTL "case statement" with r+1 entries. In this case, 2 r+1 combinations must be processed, which yields to a huge amount of multiplexer resources. Thus, mono-bloc PPG recoding is incompatible with high radix (r≥8) approach whose purpose is to reduce the multiply-time (N/r) of large operand size (N ≥32) multipliers.
The objective of this paper is to overcome these two above-mentioned shortcomings. To achieve such a goal, the multibit recoding multiplication algorithm is revisited [2] . Its design space is extended by the introduction of a new recursive version that enabled to solve the hard problem of radix-2 r two's complement multiplication for any value of r. The solution consists essentially in dividing the high radix-2 r mono-bloc PPG j ( Fig. 1 .a) into a number of lower sub-radix-2 s odd-multiple free PPG ji ( Fig. 1.b ), such as s is a divider of r . As direct benefits of the partitioning of Fig. 1 .b:
• there is no need to pre-compute odd-multiples of the multiplicand, which drastically reduces the required amount of hardware resources and routing; • since the size of PPG ji entry is much smaller than the size of PPG j one (s≤r/2), the total multiplexing logic required by RTL "case statements" to recode the entries is greatly reduced;
A New High Radix-2 r (r≥8) Multibit Recoding Algorithm for Large Operand Size (N ≥32) Multipliers • the possibility to simultaneously process larger bit slices (r≥16) radically shortens the critical path in terms of adder levels, especially for very large operand sizes (N≥64). Guided by accurate area heuristics, the final result of an optimization process, gradually undertaken in this paper, delivers for each value of N (N=8..8192) the appropriate radix-2 r (r=8..512) and sub-radix-2 s (s=4..32) that lead to the architecture with the shortest critical path (
) in adder stages. The couple (r,s) serves to partition the architecture so that maximum parallelism is exploited. As for area, our proposed architectures require as many hardware resources as modified Booth algorithm [13] with a critical path of N/2 [14] [15] [16] [17] . For instance, a 64-bit two's complement finely pipelined multiplier requires a latency of seven clock cycles only (critical path composed of a series of 7 adders). FPGA implementation on Virtex-6 circuit of our 64-bit two's complement radix-2 32 multiplier shows important gain ratios over Seidel [11] and Dimitrov [12] radix-2 8 algorithms. The respective gain ratios are enumerated as follows: 1.62, 1.71, 2.64 and 1.83, 1.71, 3.32 are obtained in terms of multiply-time, energy consumption per multiply-operation, and total gate count, respectively.
The paper is organized as follows. Section I outlines the main requirement specifications for a generalized radix-2 r multiplication. Section II introduces the new recursive multibit recoding multiplication algorithm, illustrated by two high-radix (2 8 and 2 16 ) recoding examples in Section III. Section IV introduces some preliminary steps toward an optimal partitioning of the multiplier architecture, while the optimal partitioning is presented in Section V. Section VI compares and discusses the implementation results. Finally, Section VII provides some concluding remarks and suggestions for future work.
II. THE NEW RECURSIVE MULTIBIT RECODING MULTIPLICATION ALGORITHM
The equation (2.1.2) of the original multibit recoding algorithm presented in [2] does not offer hardware visibility. Let us rewrite it in a simpler hardware-friendly form, as follows: . For simplicity purposes and without loss of generality, we assume that r is a divider of N .
In equation (1), the two's complement representation of the multiplier Y is split into N/r two's complement slices ( j Q ), each of r+1 bit length. Each pair of two contiguous slices has one overlapping bit. In literature, equation (1) Thus, the signed multiplication between X and Y becomes:
(2). Where each partial product can be expressed as follows: . Hence, the partial-product generation-process consists first in selecting one oddmultiple (m.X) among the whole set of pre-computed oddmultiples, which is then submitted to a hardwired shift of f positions, and finally conditionally complemented (-1) e depending on the bit sign e of Q j term. Table I provides a picture on how the number of odd-multiples grows when the radix becomes higher. While lower m.X can be obtained using just one addition (3X=2X+1X), the calculation of higher ones may require a number of computation steps (11X= 8X+2X+1X).
To bypass the hard problem of odd-multiples, we exploit the fact that the N+1 bit-length two's complement multiplier Y on which equation (1) is applied, is composed of a series (N/r) of r+1 bit-length two's complement slices ( j Q digits) on which equation (1) can be recursively applied again. Based on this observation, let us announce the two following theorems accompanied with their respective proofs inserted in Appendix. 
Theorem (1) and (2) allow an exponential reduction (1/2 ks and 1/2 k(s+t) , resp.) of the number of odd-multiples in equations (4) and (6) in comparison to equation (2) , but at the expense of a linear increase (ks-1 and k(s+t)-1, resp.) in the number of additions. The advantage by far outweighs the cost, as practically shown in the next section.
The translation of equation (4) into architecture is depicted by Fig. 1 .b, where each PPG j (Q j ) is built up using r/s identical PPG ji (P ji ). This is not the case for equation (6) which requires two different PPG ji (P ji and T ji ) . Theorem (1) and (2) [13] ) can be derived from equation (3) for (r,s)=(1,1) and (r,s)=(2,2), respectively. They are respectively summarized as follows: 
And Seidel radix-2 8 recoding is given by the following 
. Note that while equations (9) and (10) are odd-multiple free since all included digits are power of 2, they require a post-accumulation to deal with odd numbers (7, 11 and 121) . Thus, a number of extra-adders are needed.
Optimized higher radices are obtained as follows.
A. Our new radix-2 8 recoding
Based on theorem (2), each 8+1 bit slice is split into 5+1, 2+1, and 1+1 overlapping slices using Seidel radix- 
B. Our new radix-2 16 recoding
Likewise, using theorem (2), each 16+1 bit slice is split into 8+1, 5+1, 2+1, and 1+1 overlapping slices using Seidel radix-2 8 and radix-2 
In our preceding work [20] , we pursued this combination process farther and generated a series of higher radix (2 24 , 2 32 , …) recoding schemes with ( ) { }
. However, what still remains unknown is to determine, for a given N value, the proper radix (2 r ) that leads to the optimal architecture.
hal-00872326, version 1 -11 Oct 2013
The translation of equations (11) and (12) into architectures is depicted in Fig. 2 .a and 2.b, respectively.
All Dimitrov algorithms developed in [12] 
For the comparative study, our proposed algorithms (eq. 11 and 12) as well as Seidel and Dimitrov algorithms (eq. 10 and 13, resp.) are first analytically characterized and then physically implemented.
C. Analytical characterization of area and speed
Prior implementation, we need to develop a generalized theoretical model which predicts area and speed features of each recoding algorithm with respect to N and r values.
1) Area
Three basic components are necessary for the implementation of RTL multipliers:
• multiplexers (Mux1) to recode the digit terms (Q j ,P j ,…) included in the recoding expression; • shifters (Mux2) for partial product generation;
• and adders for partial product summation. Whereas the exact number of adders can be known in advance, we need to develop heuristics for the two others. The total multiplexer complexity (Mux1) of a radix-2 r multiplier depends on:
• the number (N/r) of PPG j ;
• the number (i) of lower sub-radices (2 
, which requires 6 adders for post-accumulation operation [11] [19] . Hence, the total number of necessary adders is: 
2) Delay
The total delay (Del T ) along the critical path is the summation of PPG j delay and reduction tree delay. Based on the total number of adders (Add T ), the critical path of the multiplier in terms of logic levels is: ). 8 . Table II provides the area occupation and delay for each recoding algorithm.
D. Physical implementation
All recoding schemes mentioned in Table II underwent several verification steps. First all equations were validated with a random C-program. Then, they were implemented at RTL level in Verilog-2001 (IEEE 1364) as technology-independent reusable IP-cores [1] , using exactly the same optimized coding style for an equitable comparison. They are compile-time reconfigurable according to N and r. Reader is referred to [11] , [19] , and [12] for recoding tables used in equations (9), (10) , and (13), respectively.
All RTL codes went through a severe cycle-accurate functional verification procedure using Modelsim SE-6.3f logic simulator. They were first challenged against a set of special and severe test cases, and then submitted to a random test for a very large number of vectors. After a successful functional verification, physical tests were performed. They were integrated into an FPGA evaluation board for an ultimate validation. Afterwards, all equations were synthesized and mapped to the same Virtex-6 FPGA circuit (xc6vsx475t-2ff1156) using Xilinx ISE 13.2 release version [21] . We used for comparison a two's complement 64×64 bit parallel multiplier. The implementation results are grouped in Table III. Although 8 ) on the total performance is important (Table III) . Besides, it is the most area consumer despite the fact that it employs the lowest number of adders (N/4-1). Adversely, Seidel algorithm is the most adder consumer (7N/8-1). To determine which factor, Mux T or Add T , exerts more influence on area occupation, let us compare their respective ratios for Seidel and Dimitrov algorithms: Mux T (Eq.13)/Mux T (Eq.10)=2.7 and Add T (Eq.10)/Add T (Eq.13)=3.5.
Significant conclusion: the area occupation is dominated by Mux T factor, and becomes larger as Mux T number becomes higher (Table II and III) . This correlation is advantageously used to minimize area occupation as will be shown in the next section.
McSorley algorithm (eq. 8) is the least area consumer and the slowest recoding scheme for any value of N. The best area/speed compromise for N=64 is given by our recoding scheme based on equation (11) . However, this latter will be outperformed by equation (12) for larger values of N (N>64) since a higher radix ( 2 16 ) is employed. While energy consumption is function of the switched capacitance, Table III shows a direct correlation between area occupation and energy consumption. Making Mux T indicator lower, will result in a less energy-consumer recoding algorithm.
Finally, based on theory and implementation results, we conclude that the best tradeoff related to our recoding schemes depends on N and r values. For larger N values (N>64), larger radices are necessary to reduce the critical path. But for larger radices (r>16) we need to duplicate some of the elementary PPG ji (2 1 ,2 2 ,2 5 ,2 8 ) to build up the radix-2 r PPG j . Therefore, at this level a relevant question arises: given N, what is the value of r and its corresponding elementary PPG ji configuration (optimal partitioning of PPG j ) that leads to the shortest critical path (Del Tmin ) with minimum hardware resources (Mux Tmin )? The answer to this question is given in the next sections.
IV. PRELIMINARY STUDY TO AN OPTIMAL PARTITIONNING
We extend the recoding-space of our equations (11) and (12) The translation of equation (14) into architecture is depicted in Fig. 1.b (top view only) , where each PPG j is built quadruplet (a,b,c,d) as illustrated by Fig. 3 . For instance, to equations (11) and (12) correspond (0,1,1,1) and (1,1,1,1 Given N and r, to determine the optimal partitioning of the whole multiplier (global optimum since PPG j are identical), we need to find first the quadruplet (a,b,c,d) that satisfies the condition 8a+5b+2c+d=r and leads to the PPG j with minimum hardware ressources (Mux min ) and the shortest critical path (Del min ). As it is not sure that such a solution exists, we are using composite metrics A (Table IV) corresponding to each  basic recoding algorithm (2   8   ,2 5 ,2
. Because of an explosive number of possible combinations (N>>), the solution space is exhaustively explored using a deterministic C-program for r varying from 8 to 1024. The obtained results are reported in Table V. As conclusion, optimal area solutions (Mux=Mux min ) are exclusively based on radix-2 2 algorithm (0,0,c,0), but they are excessively slow (Del>>Del min ). While optimal speed solutions (Del=Del min ) are entirely composed of radix-2 (Table IV) . To correct this disequilibrium, we replace respectively the two Seidel radix-2 8 Table  VI . Results delivered by the deterministic C-program are reported in SOLUTION (a,b,c,d) SOLUTION (a,b,c,d) The new results are so interesting that we are encouraged to pursue further the optimization process using higher basic sub-radices (s>8) to reduce the total delay (Del T ) of the multiplier. Let us this time replace Table VIII .
The C-program shows up even more interesting results since starting from r≥64 (Table IX) , lower delays are obtained with the same multiplexer complexities as the ones reported in Table VII . Based on the obtained results, we pushed farther the optimization process using even higher basic sub-radices (s=16..32).
All optimal solutions come either on the form (a,0,0,0) or (0,b,0,0). At this level we can draw a significant conclusion: since the optimal solution is always in the form (a,0,0,0) or (0,b,0,0) with a=2k and b=2k', there exists an integer s=2k'' such as either (s,0,0,0) or (0,s,0,0) is the optimal solution.
Consequently, equation (14) Based on heuristic developed in Section III, multiplexer complexity of equation (15) for the whole multiplier is always equal to Mux T =10×N/2=5N for any value of r and s. As for the multiplier delay (Del T ), we need to determine the couple (r,s) that leads to the shortest critical path in terms of adder levels. This is what is achieved in the next section.
V. THE OPTIMAL PARTITIONNING
The total delay (Del T ) of the whole multiplier related to equation (15) is: Del T = N/r-1+Del+d 2 where Del is the PPG j delay equal to (r/s-1)+(s/2-1), and d 2 is the multiplexer delay corresponding to the recoding logic of radix-2 2 . Thus,
The optimal delay with regard to r is obtained for (r,s) couples satisfying ( )
When r is substituted by
into Del T expression, we obtain:
. Likewise, the optimal delay with regard to s is obtained for s value satisfying
Hence, the optimal delay becomes:
Finally, we conclude that the optimal N-bit multiplier, in comparison to equation (8) [13] , relies on the new triple recursive equation (15) with (r,s)=( Table X provides the s and r values that lead to the optimal partitioning with respect to the operand size N. The values s and r correspond to the number of multiplier bits that are treated simultaneously inside each PPG ji and each PPG j , respectively. For N=64, the optimal partitioning is obtained with (r,s)=(32,8) as illustrated by Fig. 4 . Whereas equations (15) and (8) require the same amount of hardware resources (Mux T , Add T )=(320,31), they exhibit different critical paths: 7 and 31 in terms of adder levels, respectively.
VI. DISCUSSION OF THE IMPLEMENTATION RESULTS
We proved via FPGA implementation (Table III) how much accurate are the area heuristics developed in Section III (Table II) . Based on this, we have undertaken a gradual theoretical optimization process that yielded to equation (15) . This latter is implemented on FPGA with N=64, and the results in terms of multiply-time, energy consumption per multiply-operation, and total gate count, are as follows: 78.98 MMPS, 1.45pJ and 1987 slices, respectively. Compared to implementation results of Seidel and Dimitrov algorithms (Table III) , gain ratios of 1.62, 1.71, 2.64 and 1.83, 1.71, 3.32 are obtained, respectively. A 64-bit multiplier generated by Xilinx Coregen exhibits 75.86 MMPS and consumes twelve 18×18 bit DSP-slice multipliers.
The real reasons behind these important results are cleared up as follows. 8  4  8  2  3  6  1  16 4  8  3  7  7  2  32 8  16  5  15  9  4  64 8  32  7  31  13  8  128 8  32  9  63  21  16  256 16 64  13  127  37  32  512 16 128 17  255  69  64  1024 16 128 21  511  133  128  2048 32 256 28  1023  261  256  4096 32 512 35  2047  517  512  8192 32 512 45  4095  1029 Fig. 4 . Optimal partitioning of a two's complement 64×64 bit radix-2 32 parallel multiplier based on equation (15) 
A. Area occupation
For operand size N=64, equation (15) is a composite radix-2 32 algorithm (Table X) , where each PPG j processes simultaneously 32+1 inputs that are split on four sub-radix-2 8 PPG ji made of four instances ( ji k C ) of McSorley algorithm (Fig. 4) . Seidel and Dimitrov algorithms are rather radix-2 8 algorithms, based on mono-bloc PPG j .
In fact, although radix-2 8 PPG ji of equation (15) and radix-2 8 PPG j of Seidel and Dimitrov are based on different recoding schemes, they are mathematically equivalent since they produce the same partial product PP ji /PP j . Based on theory (Table II) and implementation results (Table III) , Dimitrov recoding is the most space consuming due to the use of odd-multiples of the multiplicand. On the other hand, Seidel recoding does not require odd-multiples, but since 9 inputs are treated simultaneously in a mono-bloc PPG j , a large amount of multiplexer resources is needed to recode the 2 9 =512 input combinations. Finally, radix-2 8 PPG ji of equation (15) is the least area consumer because it does not employ odd-multiples and requires a small amount of multiplexers as the total number of input combinations in each radix-2 8 PPG ji is equal to 8+8+8+8=32. Note that the three recoding schemes are incorporating a number of adders in their PPG ji /PPG j which is 3, 6, and 1 for equation (15) , Seidel and Dimitrov algorithms, respectively.
Significant conclusion: the area occupation is dominated by the Mux factor, and becomes larger as Mux number becomes higher.
B. Delay
Using higher radices (r>>) will certainly shortens the critical path. However, for high r values, mono-bloc PPG j recoding induces an important delay (d s ) due to the high density of multiplexer logic that significantly degrades the whole performance of the multiplier. This is clearly illustrated by Dimitrov radix-2 8 recoding whose critical-path totalizes 8 adder levels but exhibits a lower multiply rate (43.17 MMPS) compared to Seidel recoding that have a critical-path composed of 13 adder levels but shows a more interesting rate (48.62 MMPS) due to lower multiplexer complexity (Table II and III) . As for equation (15) , since a composite PPG j is used, d s is equal to d 2 ( ji k C delay) which is the smallest delay (d 2 < d 5 < d 8 ). Besides, the critical path goes through the smallest number (7) of adder stages, exploiting maximum parallelism that can be provided by the triple-recursive equation (15) . Thus, it is not surprising that equation (15) achieves the best performance (78.98 MHz), even when compared to Xilinx Coregen multiplier based on DSP-slices (75.86 MHz). A double-recursive (s=2) version of equation (15) served to design a scalable 16-bit setpoint Finite-Word-Length PID controller, employing five multiplication cores. The implementation results outperformed the published ones at all levels [23] .
Significant conclusion: using composite recoding in conjunction with an optimal partitioning (r and s values) provides the shortest critical path. Equation (15) shows high aptitude for pipelining. Two finely and coarsely grained systolic architectures for 64-bit multiplier are depicted in Fig. 5.a and Fig. 5 .b, respectively. Fig. 5 .a architecture is more suitable for high throughput applications, with 7 clock-cycle latency.
VII. CONCLUSION AND FUTUR WORK
Upon the basis of the new multibit recoding multiplication algorithm, we developed optimal parallel multipliers with shortest critical paths and minimum hardware resources for any value of operand size N. We demonstrated by theory and FPGA implementation the superiority of our high-radix algorithms over their existing counterparts. Because exploiting the maximum parallelism inherent in multiply operation, our look-up-table based multiplier (eq. 15) is even speed-competitive with Xilinx's hardwired multiplier employing DSP-Slices (18×18 bit fullcustom multipliers).
More importantly, we demonstrated also that the current trend relying upon minimal number-bases for the development of high radix-2 r recoding (r≥8) with monobloc PPG requires an excessive amount of multiplexer resources, which offsets speed and power benefits of the compressor factor N/r. On the other hand, we proved that composite PPG based on the new recursive multibit recoding algorithm is the best realistic alternative.
The topology of our proposed recoding schemes shows high capabilities for pipelining which can be finely or coarsely grained to satisfy both high throughput and low latency applications. A radix-2 32 64-bit parallel multiplier was finely pipelined, resulting in a systolic architecture with seven clock-cycle latency.
While the theoretical concept was validated using FPGA as a preliminary step, an ASIC implementation based on a standard-cell library is necessary for an ultimate validation of the whole optimization work. This issue will be explored in the near future, and we intend to report our results in a forthcoming paper. Thus, the total size becomes N+1. Y is a two's complement number. It is written as follows: 
