We propose a split array multiplier organized in a left-to-right leapfrog (LRLF) 
Introduction
The three steps of parallel multiplication are denoted as recoding and partial product (PP) generation (PPG), PP reduction (PPR), and final carrypropagate addition (CPA). Based on the approaches to PPR, multipliers are usually classified into: (i) linear array multipliers with logic delay proportional to n, and (ii) tree multipliers with delay proportional to log(n) [12] . The tree reduction treats PP bits either in rows or in columns. Although tree multipliers have the shortest logic delay in the PPR step, they have irregular layout with complicated interconnects. On the other hand, array multipliers have larger delay but offer regular layout and simpler interconnects. As interconnects become important in deep submicron design [22] , architectures with regular layout and simple interconnects are desirable. Irregular layouts with complicated interconnects not only demand more physical design effort but also introduce significant interconnect delay and make noise a problem due to several types of wiring capacitance [1, 22] .
Modern multiplier designs use [4:2] adders [14] to reduce the PPR logic delay and regularize the layout. To improve regularity and compact layout, regularly structured tree (RST) with recurring blocks [6] and rectangular-styled tree by folding [8] were proposed, at the expense of more complicated interconnects. In [15] , three dimensional minimization (TDM) algorithm was developed to design adders of the maximal possible size with optimized signal connections, which further shortened the PPR path by 1 ∼ 2 XOR delays. However, the resulting structure has more complex layout than a [4:2]-adder based tree. In [10] , multiplication was divided recursively into smaller multiplications to increase layout regularity and scalability, which essentially resulted in a hierarchical tree structure.
In linear array multiplier design, the even/odd split structure [9] was proposed to reduce both delay and power of conventional right-to-left (R-L) linear array structures. In [13] , a leapfrog structure was proposed to take advantage of the delay imbalances in adders. in [4] , a left-to-right (L-R) carry-free (LRCF) array multiplier was proposed where the final CPA step to produce the MS bits of the product was avoided by using on-the-fly conversion in parallel with the linear reduction. In [2] , this LRCF approach was extended to produce 2n-bit product. It was also discovered that glitches in L-R reduction arrays were smaller than in the conventional R-L arrays, especially for data with large dynamic range [5, 20, 7] .
To further reduce the delay of array multipliers while maintaining their regular layout and simple interconnect, this paper proposes split array LRLF (SALRLF) multipliers that combine the advantages of splitting, L-R computation, and leapfrog structure. Two types of splitting are considered: even/odd and upper/lower. Each step of SALRLF is optimized with the primary objective of delay reduction and the sec-ondary objective of power reduction. Logic-level analysis as well as physical layout with guided floorplanning are conducted to compare SALRLF with tree multipliers.
In the following, the multiplicand
and the multiplier Y = −y n−1 2 n−1 + n−2 i=0 y i 2 i are integers in the two's-complement form with n being even to simplify description. For logic-level analysis, the delay of a 2-input XOR2 gate, T XOR2 , is used as the unit delay. The delay of two-level a complex gate such as AOI22 (AND2-NOR2) is equivalent to T XOR2 .
Partial Product Generation
Radix-4 recoding is used to reduce the number of PPs to half. After comparing common recoders, we developed a version neg/two/one-nf ("nf" for neg-first) shown in Fig. 1 . The negation operation is done before the selection between 1X and 2X so that two i and one i set P P i to zero regardless of neg i for "−0". To generate additional '1' for negative P P i , a correction bit c i = y 2i+1 (y 2i y 2i−1 ) is used. Due to shifting, each PP has a 0 between P P i+1,0 and c i . To have a more regular LSB part of each PP, P P i,0 is added with c i bit in advance [18] . The P P (new) i,0 and c (new) i are described as:
are obtained no later than other PP bits. The generated PP bit-array is arranged in MSB-first or L-R manner as shown in Fig. 2 . The grey circles are P P 
Partial Product Reduction
The delay gap between tree multipliers and array multipliers is mainly due to the linear PPR structure in Figure 2 : MSB-first radix-4 PP bit array (n=12).
conventional array multipliers. To improve the speed of array multipliers, parallelism is introduced in PPR. In addition, different adder types and the signal flow between them also have impact on delay, area, and power.
L-R leapfrog (LRLF) structure
To exploit the delay difference between carry and sum signals in adders, the sum signals in the leapfrog [13] structure for R-L array multipliers skip over alternate rows. Because all the carry signals propagate through the entire array, the MSBs of final PPR vectors arrive at the same (latest) time. This unfortunately prevents optimization of the final CPA that is possible in tree multipliers. To allow final CPA optimization in linear array multipliers, we combine L-R computation and leapfrog structure resulting in a new L-R leapfrog (LRLF) array multiplier scheme. A LRLF multiplier for PP array of Fig. 2 is shown in Fig. 3 . The dashed lines are carries and solid lines are sum signals. Each adder symbol represents either a FA if all three inputs are variables or a HA if one of three inputs is constant.
The power and delay characteristics of LRLF multiplier have been reported in [7] . Here we optimize the [4:2] adder design according to input arrival profiles. The basic [4:2] adder module, M42, is shown in Fig. 4 
The order is arbitrary and does not affect the discussion here because all inputs are functionally equivalent. According to input arrival profiles, two designs with different Sum logic are developed: M42L (linear-Sum) in Fig. 5a and M42T (tree-Sum) in 5b. The arrival times of T out and Cout are (4) which are smaller than those in M42. In M42L, Sum arrives at 
Split array LRLF (SALRLF) structure
The PPR delay of an LRLF array multiplier is about n 2 T XOR2 while that of an n × n-bit radix-4 tree multiplier is 3( log 2 ( n 4 ) )T XOR2 . The delay of LRLF is not comparable with that of tree multipliers when n > 16. To reduce PPR delay, certain level of parallelism is necessary.
One approach is to split the PP bit array into even PPs and odd PPs, as shown in Fig. 6 . In each split part, PPs are shifted four bits each row and reduced into two vectors using a LRLF structure. The final vectors from even and odd parts are merged by a (2n− 3)-bit [4:2] adder. This algorithm is named even/odd LRLF (EOLRLF).
Another approach is to split the PP bit array into upper PPs and lower PPs, as shown in Fig. 7 . In each part, PPs are shifted two bits each row and reduced into two vectors using LRLF. The final vectors from upper and lower parts are merged by a [4:2] adder. To The PPR delay of SALRLF is about (
, depending on the type of adders used. For n ≤ 32, the delay is < 11 ∼ 12 while the best result of a tree multiplier is ≤ 9. Further splitting of the PP array reduces the layout regularity and will not be considered. Instead, optimization of FAs and final CPA as well as floorplanning will be used to narrow the remaining gap. In EOLRLF, the arrival profile of PPR final vectors has fewer latest-arriving bits than that in tree multipliers. Fig. 8 shows the PPGR delay profiles in a 32 × 32-bit TDM multiplier, an EOLRLF, and a ULLRLF. The number of latest-arriving bits in EOL-RLF is 5 while this number is 8 in TDM. The bit delay distribution in EOLRLF is also more regular. Most bit groups in EOLRLF have 4-5 bits. But the group size varies a lot in TDM. The final adder design could exploit these better-shaped arrival profiles in EOLRLF to reduce delay. Compared with EOLRLF, ULLRLF has two main advantages. First, the shifting distance between PPs in each upper/lower part is 2 positions instead of 4, which leads to simpler interconnects. Second, the final [4:2] adder in ULLRLF is only (n+2)-bit in contrast to (2n − 3)-bit in EOLRLF. On the other hand, URLRLF has a worse arrival profile than EOL-RLF. However, such a profile only leads to just one T AO21 delay, which will be explained in Section 4. Our detailed layout experiments indicate that EOLRLF is worse than URLRLF in all measurements. Therefore, we choose ULLRLF in the following discussion. 
Optimization of FAs
In array multipliers, the basic components for PPR are full adders (FA). Two common FA structures, FA-MUX and FA-ND3, are shown in Fig. 9 . Compared with FA-ND3, FA-MUX typically has smaller area even if pass transistors are not used. Since FA is the most used element in array multipliers, smaller FA would lead to smaller overall area, which is also helpful in the reduction of power consumption and interconnect delay. As to logic delay, however, FA-NAND3 is better than FA-MUX because the delay from all inputs to Cout is T AO222 (T AO222 ≈ T XOR2 ). Because of the different characteristics of FA inputs, it is possible to optimize signal flow with respect to propagation delay. This technique has been applied in TDM tree multipliers [15] . In addition to delay, signal flow optimization affects power [7] . Assume the three input signals to FA are Ain, Sin, Cin. These input signals are sorted according to their arrival times. We assume that the α relationship is α Ain ≤ α Cin ≤ α Sin . The order is arbitrary since the inputs are functionally equivalent. In FA-ND3, Sin is connected to pin Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH'03) 1063-6889/03 $17.00 (C) 2003 IEEE C. There is no restriction on the connections between Ain(Bin) and pin A(B) unless transistor-level difference between A and B is considered. In FA-MUX, Sin is also connected to pin C. Between Ain and Bin, the signal with less switching activity is connected to pin A for power saving because pin B has less load capacitance and is used for the one with higher switching activity. Since PP bits arrive at the earliest time and never change after PPG, they are connected to A pins. This signal flow optimization technique is named CSSC to reflect the interchange of sum and carry signals. In the experiments section, we show the delay effects of FA selection and CSSC optimization in LRLF and ULLRLF array multipliers.
Final Adder
Final adders are optimized to match the nonuniform input arrival profiles. The optimal final adder for tree multipliers is CSMA based design [16] . Efficient design of on-the-fly converter for L-R array multipliers also corresponds to a multi-level carryselect (CSEL) or conditional-sum (CSUM) adders [11] . In [19] , generalized earliest-first (GEF) algorithm was proposed to design CSUM for arbitrary input arrival profile. The similarity between CSUM and prefix adder (PFA) is also shown in [19] where PFA is called CLA.
We followed the GEF algorithm and chose PFA for final addition because the PFA operators, AO21 and AND2, are simpler than the basic CSUM operators -a pair of MUX21. Two lists, P list and T list, are maintained in GEF. All (G, A) signal pairs are initially put into P list and sorted according to arrival times. The earliest pairs are then moved to T list. Adjacent bit pairs in T list are retrieved and merged from left to right. The merged pairs are put back into P list. The iteration continues until the generation of the MSB carry bit. Other carry bits are generated using existing (G, A) bits. A PFA example for a hill-shaped arrival profile is shown in Fig. 10 . Black nodes in PFA are computation cells and white nodes have no logic or only buffers. In the original GEF, the merging is conducted from from right to left. Because of different input-output delays in operator '•', the left-to-right merging in T list leads to 0.5T XOR2 delay improvement.
Let W max be the largest number of adjacent signals that arrive at the same time. If these W max signals are also the latest arriving signals in a hill-shaped arrival profile, the delay of PFA for such a profile can be estimated as
which is not directly related to the adder width 2n. A small W max would lead to a small T P F A . However, the difference in T P F A is just one T AO21 for most schemes in our study because of the logarithmic relationship. One T AO21 delay could be further eliminated from T P F A if carry-select adders are used for the final stages of the left part in hill-shaped arrival profiles [3] . 
Experiments
To compare the proposed ULLRLF with tree multipliers, logic-level delay analysis is first conducted. Actual VHDL implementation and physical layout are then performed on Synopsys and Cadence design platforms.
Delay comparison at logic level
VHDL generation programs for both LRLF and ULLRLF algorithms have been written with the flexibility of FA selection and signal flow optimization. The comparison results at logic level without wiring effects are normalized to T XOR2 and listed in Table 1 . T GR is the delay of PPG and PPR. T A is the delay of the final adder. For LRLF, the use of FA-ND3 rather than FA-MUX reduces PPR delay and the overall delay by 1 T XOR2 . CSSC reduces the delay by 1 T XOR2 except for 48-bit LRLF-ND3 where the reduction is 2. For ULLRLF, FA-ND3 reduces one T XOR2 in PPR, but not the overall delay. CSSC only reduces PPR delay in ULLRLF-MUX by 0.5 and also has no effect Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH'03) 1063-6889/03 $17.00 (C) 2003 IEEE on the overall delay. This is because varying FAs and applying CSSC change the input arrival profiles of the final adder and affect T A by up to 1 T XOR2 . Even if there is little delay advantage, however, it is still useful to apply CSSC for power reduction [7] . We have also noticed that CSSC and FA-ND3 could help EOL-RLF achieve 0.5 ∼ 2 less logic delay than ULLRLF. However, ULLRLF outperforms EOLRLF after layout because of smaller area and simpler wiring. Finally, it is worthwhile to note that T A does have little relation with the adder width as explained in Eq. 7.
Using the results from Table 1 , we now compare the delays of LRLF/ULLRLF with tree multipliers. Radix-2 and radix-4 TDM schemes [15] [18] are chosen because they are the best tree multipliers to our knowledge. In addition, tree multipliers based on [4:2] and [3:2] CSAs are also used for comparison as they have more regular structures. To avoid the delay due to the extra row P P [n/2] in radix-4 two's-complement multipliers, the reduction of 9 PPs from P P [n/2 − 8] to P P [n/2] is based on a [9:4] adder with only 3T XOR2 delay, as illustrated in Fig. 11 . The 3T XOR2 delay is achieved as follows. All FAs except the right most one in the shaded [3:2] CSA are simplified into HAs with half delay as they have constant inputs. Inputs of the second-level [3:2] adders are properly optimized so that each FA has one input arriving at least T XOR2 later than the other two inputs. This late input is connected to pin C of FAs to ensure one T XOR2 delay. To distinguish from other tree multipliers, the radix-4 tree multiplier using this special [9:4] adder is named tree9to4. The logic delay comparison results are given in Table 2. The blank boxes with '-' are because T GR s or delay profiles from PPR are not available from literature. The original TDM-radix4 data in [18] are normalized to our measurement base. It is shown that the radix-4 tree multipliers based on our [9:4] adder design have almost the same T P P GR as TDM schemes. For n ≤ 32, ULLRLFs have 0.5 ∼ 1.5T XOR2 more delay than tree multipliers. For larger precisions, ULLRLF shows 23% more gate delay for 48 × 48-bit multiplication and 28% more for 54 × 54-bit multiplication.
Simulation with physical layout
For more realistic evaluation, structural VHDL designs are compiled and mapped into Artisan TSMC 0.18µm 1.8-Volt standard-cell library [23] using Synopsys Design Compiler. For a fair comparison of different schemes, [3:2] FA cells in the library are not used because there is no [4:2] adder cells. Buffers are inserted automatically by Design Compiler. Two radix-4 schemes for 24×24-bit and 32×32-bit multiplication are compared: tree9to4 and ULLRLF-MUX-CSSC. tree9to4 has the similar delay as TDM but is more regular. MUX-CSSC based designs are chosen because it has smaller area and CSSC is good for power. CSSC is also applied in tree9to4. Standard-cell based automatic layout is first conducted using Cadence Silicon Ensemble. Interconnect parameters are extracted from layout and back-annotated into Synopsys tools for delay and power calculation. Power consumption is measured at 100MHz with 500 pseudo-random data. The results are shown in Table 3 . For 24-bit, URLRLF is better than tree9to4 in area, delay, and power, with up to 7% improvement. For 32-bit, URLRLF has 3% less area and 10% less power than tree9to4 while keeping similar delay. We have also experimented layout with guided floorplanning for 32 × 32-bit multipliers. The floorplan of tree9to4 is shown in Fig. 12 , which is based on H-tree for symmetry and regularity [17] . The row utilization rate has to be relaxed to 63% from 70% in automatic layout for routability. In addition, all blocks have to be assigned to specific regions for delay reduction. The floorplan of ULLRLF is shown in Fig. 13 . Table 4 . The delay is improved by 4% from automatic layout. For ULLRLF, there is no cost in area for this delay improvement. For tree9to4, the area increases 10%. After layout with guided floorplanning, ULLRLF and tree9to4 has similar delay while tree9to4 has 15% more area and 9% more power. 
Conclusions
We have studied left-to-right split array multiplier schemes EOLRLF/ULLRLF. An efficient radix-4 recoding logic generates the partial products in a leftto-right order. These partial products are split into upper/lower or even/odd groups. These two groups are reduced in parallel using the L-R leapfrog structure with optimized adder modules and signal flows. Results from the two groups are merged using a [4:2] adder. The final adder is a prefix adder optimized to match non-uniform input arrival profile. We find that upper/lower splitting outperforms even/odd splitting after layout although even/odd splitting is a little bet- clude that ULLRLF array multipliers and tree multipliers are similar in major performance characteristics for n ≤ 32 if standard-cell based automatic layout is conducted.
