Abstract -A hybrid radix-4/radix-8 architecture targeted for high bit multipliers is presented as a compromise between the high speed of a radix4 multiplier architecture and the low power dissipation of a radix-8 multiplier architecture. In this hybrid radix-4/radix-8 multiplier architecture, the performance bottleneck of a radix-8 multiplier, the generation of three times the multiplicand for use in generating the radix-8 partial product, is performed in parallel with the reduction of the radix4 partial products rather than serially, as in a radix-8 multiplier. This hybrid radix4/radix-8 multiplier architecture requires 13% less power for a 64 x 64 bit multiplier, and results in only a 9% increase in delay, as Compared with a radix4 implementation. When supply voltage is scaled such that all multipliers exhibit the same delay, the 64 x 64 bit hybrid radix4hdix-8 multiplier dissipates less power than either the radix4 or radix-8 multipliers. The hybrid radix-4/radix-8 architecture is therefore appropriate for those applications that must dissipate minimal power and operate at high speeds.
I. Introduction
High speed multipliers are fundamental elements in signal processing and arithmetic based systems. The higher bit widths required of modem multipliers provide the opportunity to explore new architectures which would be impractical for smaller bit width multiplication. Architectures for circuit elements historically were designed to operate at maximum speed, notwithstanding the resulting power dissipation. Recently, greater emphasis has been placed on reducing the power dissipation of important circuit functions while maintaining these high speeds. Therefore, power dissipation as well as circuit speed should be considered at the architectural level.
A leveling off of the power factor, the power dissipated per bit2-&, and hence the power efficiency, has recently been observed Cl]. This leveling of the power factor is illustrated in Figure 1 . This trend leads to the conclusion that to further improve the power efficiency of multipliers, power dissipation must be addressed at the architectural level as well as at the circuit level.
The data in Figure 1 In this paper a hybrid Booth radix-4/radix-8 multiplier architecture is presented as a method to tradeaff speed and power dissipation in two's complement signed multipliers. The improved speed and power dissipation characteristics of this new multiplier architecture are compared with that of standard radix4 and radix-8 based multipliers. The hybrid radix-4/radix-8 architecture presented in this paper is described in Section 11. The speed and power dissipation characteristics of the three multiplier architectures are compared in Section 111. Finally, some conclusions are drawn in Section IV.
11. Hybrid Radix Architecture The proposed hybrid radix4/radix-8 multiplier architecture uses a combination of modified Booth radix-4 and radix-8 encoding [8-lo] . The hybrid radix-4hadix-8 architecture mitigates the delay penalty associated with the generation of 3B (see Figure 2 ) for radix4 encoding by using the additional parallelism of the radix4 encoding/reduction.
In this manner the hybrid radix-4/radix-8 multiplier combines the speed advantage of the radix-4 multiplier with the reduced power dissipation of the radix-8 multiplier.
In a radix-8 architecture, the multiplication process is serially dependent upon the time required to generate 3B: while 3B is being generated by a high speed adder, no other processing can take place within the multiplier. This requirement to generate 38 leads to a significant delay penalty, on the order of 10-20%, as compared with a radix4 architecture (where the partial products may be generated by simple shifting and/or complementing) 11 1 I.
In the hybrid radixdhadix-8 architecture, a subset of the partial products are generated using radix4 modified 0-7803-3073-0/96/$5 .OO 01996 IEEE Booth encoding. Reduction begins on these radix4 partial products while 3B is simultaneously being generated by a high speed adder. Upon generating 3B, the remaining partial products are generated using radix-8 encoding, and these partial products are subsequently included within the reduction tree. A WallaceDadda structure is assumed for the reduction tree [12, 13] . In this manner, some reduction of the partial products takes place while the high speed adder is generating 3B; therefore, less of a delay penalty is incurred. Utilizing radix-8 encoding for many of the partial products reduces the total number of partml products, thereby reducing the power required to sum the partial products. As described in Section 111, three reduction steps take place during the generation of 3B for both the 32 x 32 bit multiplier and the 64 x 64 bit multiplier. A diagram of the hybrid radix-4/radix-8 architecture is shown in Figure 2 .
It is important to note that the delay penalty associated with the generation of 38 can not be entirely mitigated using this hybrid approach. An additional delay penalty is incurred since all of the partial products are not immediately available when the reduction process is initiated. As Wallace/Dadda reduction trees utilize parallel adder cells to perform the partial product reduction, the more parallel data available to the tree, the more time efficient the reduction steps become. Thus, the availability of only a subset of the partial products at the initiation of the reduction process reduces the efficiency of the early reduction steps.
By delaying the generation of the radix-8 partial products until three reduction steps have been completed, fewer bits in parallel are initially available. Thus, the reduction process is not as time efficient, requiring additional reduction steps as compared with an architecture in which all the partial products are available simultaneously when the reduction process begins. In essence, the parallelism of the reduction tree is reduced in exchange for operating the reduction tree in parallel with the 3B adder.
By selecting the number of partial products generated by radix-4 and radix-8 encoding, it is possible to limit the number of reduction steps to just one more step than is requited by a radix4 multiplier (assuming 32 x 32 bit and 4/radix-8 multiplier, ten partial products are generated by the radix4 encoding and 15 by the radix-8 encoding. As the radix-8 partial products are not immediately available, it is convenient to use radix-8 on the lower order partial products, as the low order bits are not used in the early reduction steps. A 32 x 32 bit hybrid radix4radix-8 multiplier implementation has eight partial products generated by the radix-4 encoding and six partial products generated by the radix-8 encoding.
For this 64 x 64 bit hybrid radix-4lradix-8 implementation, the required nine reduction steps are as follows: 11 64 x 64 bit multipliers). Note that by using the one's complement plus the carry-in to form the two's complement, the number of bits at the start of the reduction process is one bit greater than the number of partial products. This additional bit is the carry-in of the highest order partial product. Thus, the hybrid reduction begins at eleven bits, although there are only ten partial products. However, when the radix-8 partial products become available after the thud reduction step, the carry-in from the highest order radix-8 partial product does not align with any of the resultant bits from the first three reduction steps. Hence, the fourth reduction step begins with the four resultant bits plus the 15 radix-8 partial products, rather than four bits plus 16 partial products.
With a 32 x 32 bit multiplier, seven steps are required for partial product reduction in a hybrid radix-4/radix-8 implementation, as compared with six for a radix4 implementation and five for a radix-8 implementation. The reduction steps for the 32 x 32 bit hybrid radix4/mdix-8 implementation are: 9 -.+ 6 -.+ 4 3 3 + 6 --f 6 --+ 4 -.+ 3 -.+ 2.
Performance
The propagation delay, transistor count, and power dissipation characteristics of the 32 x 32 bit and the 64 x 64 bit multipliers are presented in this section. In subsection A, the delay of the new hybrid radix4lradix-8 multiplier architecture is compared with the delay of the radix4 and radix-8 multiplier architectures. In subsection B, the num- Table I . The radix4 multiplier exhibits the least delay, and the radix-8 multiplier exhibits the most delay. The hybrid radix4/radix-8 delay falls between those of the radix4 and radix-8 multipliers.
Note that the delays shown in Table I do not include the effects of interconnect impedances. as is the case in these multipliers. The transistor count for the 32 x 32 bit and 64 x 64 bit implementations of each of the three architectures are compared in Table 11 . The radix-8 implementations require the fewest transistors, while the radix4 implementations require the most transistors. The number of transistors required to implement the hybrid radix-4/radix-8 multipliers falls between those of the radix4 and radix-8 multipliers. Table 111 .
Note that the encoder cells in the 64 x 64 bit and 32 x 32 bit multipliers are identical, however, the loading differs by a factor of approximately two. This distinction accounts for the differences in encoder power dissipation between the two multiplier configmtions. In an n x n bit multiplier, a radix4 encoder drives n+l decoders, while a radix-8 encoder drives n+2 decoders. Tapemi buffers [la have been included between the encoders and decoders to drive this large fanout, and the power dissipation of these buffers has been included in the total power dissipation of the encoder listed in Table 111 . As with the delay values presented in Table I , these power dissipation figures do not account for interconnect impedances. Also note that although the sign generation circuitry is identical for both the radix4 and radix-8 implementations, the power dissipation of this circuit is not identical for both applications. This disparity between the radix4 and the radix-8 power dissipation exists because the control signal input CO does not toggle as frequently in a radix-8 implementation as it does in a radix4 implementation. This disparity in toggling frequency leads to lower dynamic power dissipation in the sign bit generation circuit. As the number of sign extension bits varies for each partial product, the power dissipation of the sign generation circuitry (as shown in Table ID ) amunts for only the loading of the first stage of the tapered buffers which drive the sign extension bits. Since the tapered buffer is customized for the specific loading of each partial product, the power dissipation of these buffers is included in the architecture-specific power dissipation totals presented in Tables IV and V V, a radix-8 multiplier dissipates less power than a radix4
multiplier. The hybrid radix-4/radm-8 architecture dissipates power at a level between that of the radix4 and radix-8 multipliers. Thus, the hybrid radk4/radix-8 multiplier architecture is a useful architecture for those applications which require low power while operating at speeds greater than that of a full radix-8 multiplier. Radix-8 multiplication is appmpriate for those ultra-low power systems in which added delay can be tolerated.
D. The Effects of Voltage Scaling on Performance
Voltage scaling, reducing the power supply voltage, may be applied to higher speed multipliers to reduce the power dissipation of these circuits, while simultaneously increasing delay. The delay of the multipliers is proportional to the power supply, VDD, as shown in (l), where VT represents the transistor threshold voltage, and the power dissipation is proportional to the square of the power supply voltage as shown in (2) [171.
The power dissipation of the radix-4, hybrid radix-4/radix-8, and radix4 multipliers after voltage scaling is compared in a b l e VI. Note that the scaled voltage levels are referenced to the radix-8 multiplier operating at 5 volts.
For shorter bit widths such as exemplified by a 3 2 x 32 bit multiplier, the delay and power dissipation overhead due to the additional 3B adder and more complex encoding is not outweighed by the reduction in delay and power dissipation associated with the partial product summation. In this case, the simpler radix4 encoded multiplier provides the lowest power dissipation at a given delay.
However at higher bit widths, as exemplitied by the 64 x 64 bit multipliers, the mdix-4 and radix-8 multipliers dissipate approximately equivalent power at a given delay, both of which are greater than the hybrid radix-4/radix-8 multiplier. As higher bit widths and lower power become important design issues in multipliers, the opportunity to develop new architectures to meet these requirements arises. A new hybrid radix4/radix-8 multiplier architecture is presented in this paper that is both low power and high speed; this architecture provides a trade-off between the high speed of a radix4 multiplier architecture and the low power dissipation of a radix-8 multiplier architecture. In this hybrid radix4hadix-8 multiplier architecture, the performance bottleneck of a radix-8 multiplier (the generation of 3B for the radix-8 partial product generation) is performed in parallel with the reduction of the radix4 partial products rather than serially, as in a radix-8 multiplier. Thus, the hybrid radix4/radix-8 multiplier accomplishes a portion of the partial product reduction while a high speed adder is generating 3B. This strategy minimizes a portion of the delay penalty incurred by the radix-8 multiplier in generating 3B. The hybrid radix-4/radix-8 multiplier architecture dissipates 13% less power in a 64 x 64 bit multiplier with only a 9% increase in delay, as compared to a radix4 implementation. When the supply voltage of the 64 x 64 bit multipliers is scaled such that the radix4, radix-8, and hybrid radix4/radix-8 multipliers exhibit the same delay, the hybrid radix4hdix-8 multiplier dissipates the least power. The hybrid radix-4/radix-8 architecture therefore provides a trade-off between high speed and low power for application to those systems which require both high speed and low power signed multiplication.
