Abstract-Since integrated circuits were invented, fabrication engineers have been able to steadily decrease the dimensions of the devices (transistors). These reductions in the minimum feature sizes have resulted in improved performance. In addition, the dimensions of the interconnect used to connect the active transistors have also scaled. The decreasing dimensions of the physical devices causes the capacitance and resistances of the different parts of the multiplier to change. Therefore, the relative delay due to each part of the multiplier changes. In addition, the different encoding schemes used to generate the partial products and the different topologies used in the reduction of the partial products effect the total latency of the multiplier. This paper examines the effects of the smaller device dimensions on multipliers. It shows that the interconnect is becoming more important and that automatic generation of partial products provides the minimum latency for small feature sizes.
INTRODUCTION
HE speed of the multiplier is a critical issue in determining the performance of microprocessors. It has become common for modern microprocessor to have fully expanded implementations of floating point multipliers in hardware, in contrast to iterative implementations as in the previous generations. In fact, the very latest processors also implement fully expanded integer multiplication to speed up address translation, array indexing, and other integer operations.
Multiplication is the process of adding a number of partial products. Multiplication algorithms differ in the means of partial product generation and partial product addition to produce the final result.
The earliest multiplication methods were iterative schemes based on a simple shift and add algorithm. Later, Wallace [18] suggested the simultaneous addition of all partial products in a carry-free manner using carry save adders. Wallace reduced the partial products by connecting the carry save adders in parallel in a tree structure. A single carry propagate addition is needed in the final step. Dadda [5] then generalized the carry save adder into a "parallel (n, m) counter." These counters are combinational circuits that encode the number of ones that are present in the n input bits using m output bits. Dadda also showed how to minimize the number of counters used in the reduction of the parallel products. The concept of the counter was then extended by Stenzel et al. [14] , who showed that the Wallace tree and the Dadda schemes are all special cases of the generalized (n k−1 , n k−2 , .., n 1 , n 0 , m) counter. The generalized counter is a counter that encodes k successively weighted input columns and produces a weighted sum of m bits. So, a carry save (or a binary full) adder is a (3, 2) counter. A counter that adds two columns of five bits each producing a 4-bit result is a (5, 5, 4) counter.
The tree structures that are described by Wallace and Dadda suffer from irregular interconnections. These irregular connections occur because the counter connections are unique for each partial product weight. There have been several different tree structures that have been proposed by Zuras and McAllister [20] and Mou and Jutand [9] that reduce the partial products using more regular interconnections with a slight increase in the number of counter levels over those used by Wallace trees.
Weinberger [19] introduced the [4, 2] compressor 1 which provides fast, symmetric, and regular design. This design is the binary tree that was implemented by Santoro [11] in building multipliers. The [4, 2] compressor was then extended by Song and De Micheli [13] in the development of the [9, 2] compressor family.
Most previous analyses of the partial product reduction trees use a simple compressor delay model as the basis for their design, where the delay from each input of a compressor to each output is equal. Also, the delay due to interconnection is typically ignored. Unfortunately, such simple models do not accurately reflect the performance of actual implementations where not all inputs have the same delay and where the added delay due to interconnect is significant, especially for minimum feature sizes below 0.5µm. However, a simple delay model is sufficient for the design of a binary tree using [4, 2] compressors, as the delay for all inputs of a [4, 2] compressor are approximately equal.
Designing an optimized partial product array using (3, 2) counters requires taking into account all delay components. Organizing the counters in order to minimize worst case delay is not trivial. Therefore, an algorithmic approach to the design, using a sophisticated delay model that takes into account the interconnect delay due to counter placement and the different path delays, is extremely useful. We have implemented such an algorithm, based upon the T approach of Oklobdzija et al. [10] . The algorithm extends theirs by taking into account interconnect delay due to counter placement and the different path delays. Our algorithm also uses a delay model for the (3, 2) counter that recognized the difference in delay between the two outputs. In a (3, 2) counter, the three inputs are identical from a functional perspective; however, their delays may differ due to implementation. In particular, the effective parasitic capacitances are different for the three inputs. In CMOS, this is determined by the relative position of a particular input transistor in the stack. The two outputs also have a different propagation delays. In a static CMOS implementation, the carry is usually faster than the sum because the circuit is simpler.
Another constraint is the availability of wiring tracks for the routing of each column (or bit width) in the partial product array [1] . The number of wiring tracks available in a column is a function of the fabrication process and the floor plan of the multiplier. It is a fixed parameter for each column and it limits the possible interconnections.
The counter schemes are orthogonal with the use of Booth encoding [4] , [8] , [3] , which is used to reduce the number of partial products. In Booth encoding, the number of summands is reduced by recoding the multiplier bits into groups that select multiples of the multiplicand. We consider several versions. Booth 2 reduces the n partial products by n/2 + 1 by shifting and signing the partial products. Similarly, Booth 3 reduces the n partial product to about n/3, but it requires the formation of a ±3 × multiple of a partial product. This so-called "hard multiple" itself requires an addition, potentially slowing the overall process.
A final variant is the Redundant Booth 3 [3] , which limits the length of the carry (in this study, to seven bits) in the formation of the ±3 multiples. Thus, a ±3 multiple of a 53-bit multiplicand forms a sign-extended 53-bit partial product plus 53/7 additional carry bits. The effect is to increase the tree height slightly.
Methodology
The IEEE 754 standard defines the precision of the significand for several different word lengths. The standard defines the significands in a sign-magnitude format. The required precisions of the significand include single precision (24 bits) and double precision (53 bits), and the support for double extended precision (≥ 64) is recommended. Recently, there has been interest in even higher precision and quad (113 bits) is included.
In this study, we examined multiplier performance and area tradeoffs over combinations of several parameters: feature size (f = 1.0µm to 0.2µm, or to 0.1µm in some cases), counter configuration (3, 2) and [4, 2] , encoding scheme (non-Booth, Booth 2, and Booth 3), and significand precision (24 bits through 113 bits). For each category, we implemented a custom layout of a binary-tree multiplier using the MAGIC layout tool. Additionally, a unique (3, 2) array was designed for every combination of feature size, encoding scheme, and significand precision. Using extracted parasitics, we performed SPICE timing simulations for each combination of parameters. Each simulation included delays due to transistors as well as interconnect. The scalable delay model of McFarland and Flynn [7] , based on SPICE level 3, was used to project results down to 0.1µm. The reader should note that the accuracy of SPICE level 3 is questionable at feature size of 0.1µm, ultimately limiting the scope of our study.
EFFECTS OF SMALLER FEATURE SIZES
As feature sizes decrease to submicron, supply voltages also decrease. For extremely short submicron channel lengths, the supply voltage can be in the 1-2 volt range. Therefore, at these small channel lengths, logic designers cannot use NMOS pass transistor logic because of the threshold voltage drops. At these small feature sizes, the logic families that use transmission gates with both NMOS and PMOS transistors operate correctly, given the threshold voltage drops.
When scaling the dimensions of the different physical devices from one technology to another, the devices (transistors) are reduced in size, reducing the lengths of the local interconnect (wires) required to connect the transistors. Therefore, one would expect that the relative delay due to the devices and interconnect should remain constant. However, wires scale at a slower rate than the transistors. In ideal scaling, all physical dimensions are scaled equally. The transistors delay goes down by the scaling factor. In contrast, the delay due to ideally scaled wires remains constant since the wire resistance per length grows quadratically as the cross section of the wire is decreased. Therefore, the resistance of each wire increases by the scaling factor. The capacitance per length for wires remains constant. The lengths of the wires decrease by the scaling factor and, therefore, the wire capacitance decreases by the scaling factor. Thus, the interconnection RC is constant so that wire delay is constant. However, wires are not scaled ideally, but use quasi-ideal scaling. If wires scaled "ideally," the width, spacing, and wire thickness (and field oxide) would all scale directly as the feature scale factor s = 1/feature size. Quasi-ideal interconnections scale the wire thickness by 1 s , not 1/s, creating a relatively taller wire with a larger cross section. The net effect of quasi-ideal scaling of wires is that the wire delay decreases at a rate slower than the scaling factor. However, at small feature sizes, wires still contribute a larger part of the total delay.
The changes in the relative delays of the devices and the interconnects causes the latencies of the different encoding schemes and topologies to differ with the changing of the feature size.
WIRE EFFECTS
This section examines the effects of decreasing the dimensions of transistors and wires using the same scaling factor. The wires are scaled using the quasi-ideal scaling in which the wire height is reduced at a smaller rate than the wire length.
The simulations are based upon the scalable SPICE transistor models developed by McFarland and Flynn [7] . The delays are for a 25 o C operating temperature. The latencies are calculated from the 50 percent Vdd points. The counters used in the multiplier are all DPL (Double Pass transistor Logic) circuits [15] . In the remainder of the paper, we present relative data to compare various approaches. The absolute values of the simulated delays are found in the appendix. Fig. 1 illustrates the effects of wires on the latency of a Booth 2 encoded, binary tree, double precision multiplier. The figure illustrates the relative delay due to wires compared to the same topology without wires (no interconnect delay). The delay due to wires increases at smaller feature sizes. At the 0.1µ feature size, wires contribute approximately 75 percent of the overall delay in this double precision, Booth 2 multiplier.
Even at these deep submicron device sizes, wire capacitance is the significant contributor to the delay of the multiplier. Wire resistance is not a significant portion of the total delay due to two factors:
1) The wires in multipliers are relatively short and, hence, the resistance of each wire is small. 2) The wires in multipliers are driven by transistors that have relatively narrow widths.
Therefore, the effective resistance of the transistors is large. These two factors cause the ratio of transistor resistance to wire resistance to be large and, hence, insensitive to increases in the wire resistance. The wire effects on the different encoding schemes are shown in Fig. 2 . In this figure each point is a ratio of the encoding scheme's delay to that of the same scheme with no wires. The relative positions of the different curves do not reflect the relative delays of the different encoding schemes.
The Booth 3 encoding scheme is the least affected by the incremental wire delay. This is due to two factors: 1) Booth 3 generates the smallest number of partial products, and consequently has the shortest wires. 2) Booth 3 has the largest amount of logic in its critical path due to the adder that is used to generate the three times multiple. This extra logic means that there are a greater number of transistors in the critical path of the multiplier. Thus, the wire capacitance and resistance are a smaller fraction of the total capacitance and resistance.
Both Booth 2 and non-Booth encoding have approximately the same number of transistors on the critical path. However, wire effects for non-Booth encoded partial products are larger because non-Booth has more partial products and therefore longer wires. These longer wires cause the wire capacitance and resistance to be a more significant part of the total capacitance and resistance, as compared to Booth 2. The effects of wires for Redundant Booth 3 encoding closely match those for Booth 3. The only differences are that Redundant Booth 3 has longer wires and a slightly smaller number of logic levels. Therefore, Redundant Booth 3 is affected by wires to a slightly larger degree than Booth 3.
The wire effects for the procedurally generated partial product reduction tree and the binary tree are shown in Fig. 3 for a double precision, Booth 2 encoded multiplier. In this figure, each point is a ratio of a topology's delay to the same topology without wires. The relative positions of the different curves do not reflect the relative delays of the different topologies.
The curve for the procedurally designed partial product reduction tree is nonmonotonic. This is because each point for the procedurally generated (algorithmic) reduction tree is a different design that is obtained using the delays for the counters and the wires at this feature size.
The procedurally generated partial product reduction tree is affected by wires to the same extent as the binary tree for large feature sizes. However, at the smaller feature sizes, wire delay is a larger part of the total delay. The ability of the procedurally generated partial product reduction tree to hide the extra delay of the long wires using the fast input to the output path of the (3, 2) counter causes the procedurally generated trees to be less affected by wires.
BINARY TREE VS. PROCEDURAL LAYOUTS
The previous section showed that procedurally generated partial product reduction tree is less affected by wire delay. In order to determine the performance trade-offs of the use of regular topologies, as opposed to procedurally designed partial product reduction trees, the regular topology with the minimum latency, i.e., the binary tree, and the irregular procedurally designed trees were simulated for several partial product generation schemes, precisions, and minimum feature sizes.
The absolute performance of the procedurally generated (algorithmic) reduction tree to the binary tree for double precision non-Booth encoded multipliers is shown in Fig. 4a . The relative performance is also shown in Fig. 4b . The relative per- formance graph gives a clear comparison between the two partial product reduction schemes. Therefore, only relative delay graphs are shown. The latency of the binary tree and the procedurally generated partial product reduction tree are comparable for large feature sizes. This is because, at the larger feature sizes, interconnect does not significantly affect the total delay of the binary tree. In addition, the procedural approach is better able to hide the interconnect delay. However, at smaller feature sizes, the procedural layouts outperform the binary tree.
The same reasoning applies to multipliers built using Booth 2 encoding for quad precision (113 bits) because the number of partial products is approximately equal in both cases.
The relative performance of the procedurally generated partial product reduction tree to binary trees for quad precision non-Booth encoded multipliers is shown in Fig. 5 . The latency of the binary tree is smaller than that for the procedurally generated arrays for large feature sizes. This is because the quad format has a large number of partial products and therefore requires a larger number of counters in the critical path. However, at smaller feature sizes, interconnect has a significant effect on the total delay. The procedural approach is able to provide comparable latency to the binary tree because of its ability to hide interconnect delay by connecting the slow inputs due to the longer wires with the fast inputs for the counters. Table 1 summarizes the performance results from Figs. 4 and 5 along with several other common significand precisions and possible encoding schemes.
In this table, Procedural represents that a procedurally designed partial product reduction tree has superior performance; in other cases, a Binary Tree has superior performance. A Tie is considered whenever the latencies of both multipliers are with 3 percent of each other. The results show that the procedural approach is better suited for connecting the counters in the partial product array.
The table clearly shows that, at smaller feature sizes, the procedural approach to the design of the partial product reduction tree easily outperforms the best regular topology.
It is also true that the current form of the procedural approach can be improved. One would expect that the procedural advantage would be monotonically greater at smaller feature sizes. This does appear to be the case in Fig. 5 from 0.3µm to 0.2µm. This aberration is not present in Booth 2 or Booth 3 quad precision designs (see Tables 11 and 18 ). This appears to be a tool limitation, as can also be seen by referring to Tables 17 and 18 . Note the 64-and 68-bit entries. In several cases, the 68-bit design is faster than the 64-bit design. A designer, concerned only about speed, would obviously use the 68-bit design for a 64-bit application and throw away the extra bits. (Note the extra area cost in Table 28 .)
SCALING EFFECTS
The previous sections showed that the procedurally generated partial product reduction tree provides the minimum latency when compared to the best regular topology (binary tree). However, nothing was said about which of the encoding schemes gives the minimum latency as the feature sizes decrease.
This section compares the different encoding schemes used to generate the partial products in terms of latency and latency × area for several significand lengths and topologies. Fig. 6 compares the encoding schemes for a double precision binary tree multiplier. The Booth 2 encoding provides the minimum latency for the multiplier for all feature sizes. However, at deep submicron feature sizes, Booth 3 is approximately 5 percent slower while being significantly smaller. Fig. 7 compares the encoding schemes for a single precision procedurally generated partial product reduction tree multiplier. Both non-Booth and Booth 2 encoding schemes provide the minimum latency for the multiplier for all feature sizes. The latency reduction in reducing the number of partial products in Booth 2 is offset by the extra latency required in generating the partial products. Table 2 summarizes the choice of encoding scheme which results in the minimum latency from Figs. 6 and 7, along with several other significand sizes. The results are shown for the two choices of reduction topology and for different feature sizes. A tie is considered whenever the relative delays of the encoding schemes are within 3 percent. From this table, it can be seen that, as the length of the significand increases, Booth 2 is the choice which minimizes latency. In most of the cases, the reduction in the number of summands achieved when moving from Booth 2 to Booth 3 encoding is not large enough to offset the extra delay needed to generate the hard (three times) multiple required for Booth 3.
Topology
Another important consideration in partial product reduction is the topology of the implementation. As used here, topology refers to important differences in the implementations in the way the counters are interconnected, the allowable number of wires per wiring channel, and the length of the wires to connect the counters. 
TABLE 1 REDUCTION TOPOLOGY CHOICE WHICH MINIMIZES LATENCY ACROSS FEATURE SIZES
A basic topology difference is the array and the tree implementation. Arrays are two-dimensional implementations with n counter delays required to sum n partial products. Trees are three-dimensional implementations (such as the Wallace tree) that require only O(log n) counter delays to sum n partial products. Arrays naturally require fewer wires per channel than trees. Indeed, for trees, the number of wires per channel increases as the size of the tree (number of partial products). For our study, we restrict the number of wires per channel to 10. This allows us to use faster dual rail (domino) logic for arrays, but not for trees which are restricted to single rail CMOS type gates.
In terms of attractive topology alternatives, we distinguish among the following: Double linear array. An array implementation that adds the odd numbered partial product rows separately from the even rows. The two outputs are then combined by a [4, 2] compressor.
Higher order array [1] . An extension of the above to multiple (> 2) subarrays. Summation by sequence of [4, 2] compressors is limited by number of wiring tracks per channel.
Binary tree [11] , [19] . A tree implementation based on [4, 2] compressors. For n partial products, it requires log 2 n wiring tracks per channel.
Balanced delay tree [20] . A tree constructed by connecting progressively larger serial chains of (3, 2) counters. A connection to the tree is made when the delay of the chain equals the delay of the tree.
Overturned staircase tree [9] . A recursively designed tree that uses a minimum number of serial (3, 2) counters (same number of counters as a Wallace tree, but with a predictable number of wires per channel).
The encoding scheme that minimizes the latency for all the topologies for double precision multipliers is shown in Table 3 . A tie is represented in this table whenever the values for the latencies are within 3 percent of each other. The double linear array, whose latency is directly proportional to the number of partial products, has both Booth 3 and Redundant Booth 3 providing the minimum latency. The extra latency for Redundant Booth 3 due to the larger number of partial products is offset by the reduction in latency due to not requiring the generation of the three times multiple. Redundant Booth 3 provides the minimum latency for the higher order array because the latency is related quadratically to the number of partial products. Therefore, the reduction in the number of summands for Booth 3 is not as significant as in the double linear array.
For the remaining tree topologies, Booth 2 provides the minimum latency, with the exception of the balanced delay tree, where there is a tie. The balanced delay tree requires a different number of counter levels for Booth 2 and Redundant Booth 3. The extra latency due to the difference in the 
Area · Time Product
Not all multiplier implementations require minimum latency. For these cases, the areas of the multipliers are also important. The relative areas of the designs are given in Fig. 8 . For the single precision, all the encoding schemes have approximately the same area. This is because the reduction in the area due to the smaller number of partial products for the Booth encoding schemes is offset by the larger area for the partial product generators and, in the case for Booth 3, the three times adder.
As the significand's size increases, the reduction in the area due to Booth encoding is larger. The reduction in area for Booth 2 is 25 percent compared to non-Booth for quad precision. This is due to three factors: 1) Booth 2 reduced the number of partial products compared to non-Booth by about one half. 2) The partial product generators for Booth 2 are larger than the partial product generator for non-Booth. 3) Extra logic is needed to select the possible values for the partial products (Booth encoders).
Booth 3 does not reduce the area to a third of what is required by non-Booth for the same reasons as discussed above. In addition, Booth 3 requires a dedicated adder that calculates the values of the three times multiple.
In cases where the minimum latency is not the only criteria for designing the multipliers, Table 4 summarizes the choice of encoding scheme that minimizes the latency × area product.
For single precision, both the latency and area of nonBooth and Booth 2 encoding are approximately the same. As a result, the delay × area product is the same for both. Non-Booth encoding is recommended in this case due to its simplicity of implementation. For other precisions, Booth 3 encoded multipliers are 10-15 percent smaller and 5-20 percent slower than Booth 2 encoded multipliers.
Pipelining
Pipelined multipliers pose the added problem of partitioning the multiplier into stages, each stage suited to the processor cycle. Processors have diverse cycle time constraints. Some processors are designed to have fewer long cycles for instruction execution; others use a larger number of fast cycles. In general, if we measure cycle time by the equivalent delay of a serial chain of fan-out of four (FO4) CMOS An algorithmically designed Wallace tree offers the most flexibility in tree partitioning, but also is the most complex in design effort. See [1] for some additional comments on tree partitioning.
GENERAL REMARKS AND CONCLUSIONS
Using a scalable technology delay model, we have shown the significance of technology parameters, especially feature size, in the selection of implementation algorithm for a multiplier. By extension, we would expect the same to be true for any large functional unit in a processor.
As feature sizes shrink to submicron lengths, wires contribute a larger portion of the total delay of the multiplier. Wire capacitance continues to be the significant contributor to the total delay for the multiplier.
The procedurally generated partial product reduction tree is less affected by wire delay than is the binary tree. Therefore, the use of the procedural design of the partial product reduction tree gives the smallest latency of all the topologies.
Booth 2 encoding provides the minimum latency for almost all the significant sizes and topologies. However, Booth 3 provides the best latency × area. At the smaller feature sizes, non-Booth is less attractive in term of delay because of its longer wires.
Some of our findings may seem to contradict earlier reports. For example, Oklobdzija et al. [10] determined the superiority of non-Booth, while Santoro [11] showed the value of Binary trees. As these were older studies based on gate delay models, we are in general agreement with their findings given their assumptions. Rapid technology changes should alert the designer to the applicability of any older study conclusions, including our own.
Our study uses a SPICE based scalable delay model whose accuracy is already limited at 0.1µm. It is clear that better delay models suited to deep submicron (below 0.1µm) is required for further, more comprehensive studies.
APPENDIX DESIGN VALUES
This appendix tabulates some of the numerical results that were obtained.
Binary Tree
These are the latencies for the binary tree multiplier that was compared to the algorithmic tree. 
Procedural Tree
These are the latencies for the Procedural tree multiplier that was compared to the binary tree. 
Topology
The latencies for the different topologies used in the paper are given in this section. The latency for the array topologies is for a dual-rail circuit, while the tree topologies use singlerail. This was done under the assumption that the number of tracks/channel is limited to 10. The goal here was not to compare the topologies to each other, but rather to compare the different encoding schemes for each topology. 
Areas
The areas of the different multipliers are presented in this section. The areas are for a nonpipelined multiplier, with a latch at the beginning of the multiplier. The areas are given in terms of the minimum mask resolution λ. The minimum feature size is f = 2λ; area is f 2 = 4λ 2 .   TABLE 27  AREA FOR BINARY TREE   TABLE 28  AREA FOR PROCEDURAL LAYOUT TABLE 29  TOPOLOGY AREA 
