Several commercial processors have selected the radix-8 multiplier architecture to increase their speed, thereby reducing the number of partial products. Radix-8 encoding reduces the digit number length in a signed digit representation. Its performance bottleneck is the generation of the term 3X, also referred to as hard multiple. This term is usually computed by an adding and shifting operation, 3X=2X+X, in a high-speed adder. In a 2X+X addition, close full adders share the same input signal. This property permits simplified algebraic expressions associated to a 3X operation other than in a conventional addition. This paper shows that the 3X operation can be expressed in terms of two signals, H i and K i , functionally equivalent to two carries. H i and K i are computed in parallel using architectures which lead to an area and speed efficient implementation. For the purposes of comparison, implementation based on standard-cells of conventional adders has been compared with the proposed circuits based on these H i and K i signals. As a result, the delay of proposed serial scheme is reduced by roughly 67% without additional cost in area, the delay and area of the carry look-ahead scheme is reduced by 20% and 17%, and that of the parallel prefix scheme is reduced by 26% and 46%, respectively.
INTRODUCTION
The binary number system based on two´s complement representation of numbers is commonly used in arithmetic units 1,2 . However, there are other number systems which are very useful for certain applications. Avizienis 3 defined in 1961 a class of redundant signed-digit number systems with a symmetric digit set of a radix-r positional number system. A specific case of this representation used in high speed-arithmetic is the minimum redundancy signed-digit, where the digits are of the form d j ∈{ r /2, r -1, 1 , 0, 1,…,r/2 -1, r/2} with r≥2 and r=2 p , where r =-r. For example, for radix-2, this is the digit set {1 , 0, 1}, for radix-4, it is { 2 , 1 , 0, 1, 2} and for radix-8 it is { 4 , 3 , 2 , 1 , 0, 1, 2, 3, 4}. Thus, an n-bit two´s complement number X=(x 0 , x 1 , .., x n-1 ) can be expressed in a radix-r minimum redundancy signed representation D=(d 0 Table I shows the word length n' of signed digit representation for different radix and values of n. Note that a higher signed digit representation leads to fewer digits.
Multiplication is perhaps the arithmetic circuit where radix-r minimum redundancy signed representation has been most widely used. It involves two basic operations: generation of partial products and their accumulation. One way to speed up the multiplication is to reduce the number of partial products by using radix-r encoding. The modified Booth's algorithm 4 is the most popular approach for implementing fast multipliers using parallel encoding. This scheme requires Table 1 . Word length (n') of minimum redundancy signed representation for different radix and values of an n-bit binary number.
The "bottleneck" of radix-8 multiplier architecture is the generation of 3X. This term must be computed, generally by an adder, before the partial product producing an increase to the latency of multiplier. In pipelined radix-8 multipliers
5, 11
, 3X is generated in the first stage in parallel with booth-3 encoding; any interested reader can find a detailed description of radix-8 CMOS S/390 pipelined multiplier in 14, 15 . A solution for non-pipelined multipliers is the hybrid radix-4/radix-8 architecture presented in 16 . In this scheme, radix-4 and radix-8 partial products are performed in parallel, reducing by 13% the power with a 9% increase in delay, as compared with a radix-4 implementation. Another idea based on partially redundant partial products with bias constant has been proposed in 17 . It uses a series of small-length adders with no carry propagation and one compensation constant must be added. However, a design tradeoff must be resolved. Radix-8 encoding solution based on redundant logic to eliminate the 3X computing is presented in
18
, although parameters as speed or area are not given or compared. In other special architectures, as described in the design of filter FIR 13 , the 3X is pre-computed off the critical path resulting in a fast and low power multiplier. This paper presents some simplified algebraic expressions of 3X operation, resulting in more efficient circuits in terms of area and speed in comparison with those whose implementations are based on conventional adders. To do this, two signals, H i and K i , functionally equivalent to two carries, are introduced. These signals are computed in parallel reducing the critical path of the circuit and minimizing the hardware implementation. Three architectures based on different schemes (serial, carry look-ahead (CLA) and parallel prefix) have been proposed and compared with conventional ones using a standard cell CMOS library. The results show a reduction in delay of 67% for serial scheme, 20% for CLA scheme and 26% for parallel prefix. Important reductions in area are also achieved for both. 
SERIAL ADITTION
Let X=(x 0 , x 1 ,...,x n-1 ) be a binary number of n-bit in two's complement and S=(S 0 , S 1 ,...,S n+1 ) the result in n+2-bit of performing the 3X operation. A trivial way to generate 3X is to add 2X+X as shown in Figure 1 . In this circuit, S 0 =x 0 , S n =C n-1 and the sign of S are directly defined by x n-1 (S n+1 =x n-1 ) and, thus, sign extension is not necessary. The 2X+X operation means that the adjoining FA share the same input variable. This characteristic allows the algebraic expressions of the adders to be simplified in order to obtain area and speed-efficient circuits.
In the full adder (FA) of Figure 1 , the sum (S i ) and carry (C i ) output are defined as S i =x i ⊕ x i-1 ⊕C i-1 and C i =x i x i-1 +(x i +x i-1 )C i-1 , respectively. Developing the expressions of this circuit and then grouping together terms, it is verified that C i can be defined as: Thus, C i can be expressed in terms of two variables, H i and K i , defined by means of the following recursive relations:
where i=1,2,….,n-1, H 0 =x 0 and K 0 =0. These signals have the following properties
From (1)- (5), C i can be expressed in terms of H i and K i in the following way: 
The output sum S i can be directly obtained from H i and K i without it being necessary to generate C i . We get:
for i =1,2,…,n-1 and with K -1 =H -1 =0. Eq. (7) can be also transformed applying (5) in Figure 2 shows the 3X addition implementation derived from equations (3), (4) and (7) for n=12. Note the propagation of H i y K i signals are generated in a parallel ripple configuration through the NOR gates with an asymptotic time O(n). A more efficient implementation of this circuit can be made taking advantage of the properties of H i y K i . Figure 3 shows a new implementation for n=12 using the expressions derived from Eq. (7) indicated below:
for i=0,2,4,… . This circuit generates simultaneously the S i and S i+1 outputs from H i-1 and K i-1 signals and propagates these signal in parallel by means of OR-NAND and AND-NOR gates.
CARRY LOOK-AHEAD ADDITION
Adders based on the carry look-ahead principle remain dominant, since the carry delay can be improved by calculating the carries to each stage in parallel. The expressions of H i and K i defined in Eq. (3) and (4) are of a similar form to those used in conventional carry look-ahead circuits. For example, H 8 and K 8 are defined as
The propagate and generate signals of conventional adders are replaced in Eq. (10) 
with H 2 =gh 0 . The number of these signals is reduced to roughly n/4 in comparison with n in a conventional adder. For example, for j=4 we get:
In a similar way, let gk j be the generate signal and pk j the propagate signal of K 4j+2 . These signals are defined by
The following equation of recurrence is established
with K 2 =gk 0 . For example, for j=4 we get: )))) gk pk (  gk  pk  (  gk  pk  (  gk  pk  (  gk  K   0  1  1  2  2  3  3  4  4 18
Note that the definition of gh j , ph j , gk j and pk j only allow one H 4j+2 and one K 4j+2 signal to be obtained for every four input signals, but it has the advantage of reducing the number of levels in a look-ahead scheme. Figure 4 shows the structure of a 3X implementation for n=68 using two-level of 4-bit CLA modules. H 4j+2 and K 4j+2 are computing in parallel through CLA-I/II and CLA-III/IV modules, respectively. CLA-I implements the following expressions: 
In a similar way as for K signals, CLA-III constitutes the first level of computation defined by the following expression: Figure 5 .b shows the schema of CLA-III. The CLA-IV implements these switching functions: K   2  18  18  34  34  50  50  66  66  66   2  18  18  34  34  50  50  50   2  18  18  34  34  34   2  18  18 
PARALLEL PREFIX ADDITION
The associative property of the well-known concatenation operator "○" introduced by Brent and Kung 19,20 for prefix adders allows the ripple configuration to be transformed into a parallel binary tree structure to make high-speed addition. As a result, these adders have a structure, which is very adequate for VLSI. The similitude between the expressions for conventional adders and the expressions of H 4j+2 and K 4j+2 described in Eq. (11)-(17) allow this operator to be applied to these signals. The operator ○ associated to H 4j+2 can be defined as (gh, ph)○(gh', ph')=(gh+ph gh', ph ph') (26) Fig.7 . Parallel prefix scheme for n=18.
