Introduction
Bit-serial arithmetic is often used in parallel systems with high connectivity to reduce the wiring down to a reasonable level. When multiplications are required, a typical choice is a serial-parallel multiplier. In this device, one factor is stored in parallel, while the other is entered serially. However, this scheme is not always possible, as, for instance, when both factors are input serially at the same time. In such cases a multiplier with two serial inputs is needed.
An example of serial input and output multiplier was presented by R. F. Lyon 1] . Its salient feature is a high throughput obtained at the expenses of a truncated output. Fullprecision modular serial multipliers for unsigned numbers were introduced by H. J. Sips 2] and by N. R. Strader and V. T. Rhyne 3] . In a similar paper, R. Gnanasekaran 4] presented the rst multiplication scheme for two's complement numbers. It directly takes into account the negative weight of the most signi cant bit in the two's complement representation. It results in a overcomplicated design which requires the knowledge of the cycle when the sign bit is presented. Moreover, a latency of one clock cycle is required before the output is produced. L. Dadda 5] independently extended the original scheme to perform a signed multiplication. Two solutions are presented. First, the sign-extension of the input is used to produce a result which is correct up to the 2N th cycle, where the N th bit represents the original sign-bit before sign-extension. Second, the negative weight of the most signi cant bit is directly accounted for. Anyway, a single clock-cycle latency is still present. Furthermore, the modularity issue is not directly addressed.
Finally, Rhyne and Strader 6] presented a further multiplication scheme using Booth recoding and called Signed Canonical Bit-Sequential multiplier (SCBS multiplier). It is modular but far more complicated than the basic schemes derived from the row & diagonal algorithm. Dadda 7] pointed out the unnecessary complexity of this scheme and once again presented his multiplier based on sign-extension. The same minor limitations discussed above apply. The purely combinational latency of the SCBS is emphasized by S. G. Smith 8] as the sole reason for using it instead of simpler circuits. When latency clock-cycles are acceptable, Smith proposes the ush multiplier, a very compact design, directly derived from the serial-parallel scheme. Section 2 presents an original scheme and implementation for serial-serial multiplication of about the same complexity as Sips' design. This multiplier o ers four important features: (1) there are no latency clock-cycles between the input presentation and the result output; (2) it can be used for unsigned as well as signed two's complement numbers; (3) it produces a full-precision result (no truncated or rounded bits) which can be extended to an arbitrarily large number of bits (with no extra hardware); and (4) it is fully modular. These features are individually present in other designs but none of the cited schemes has these four characteristics at the same time.
An unsigned serial squarer has been presented by Dadda 9] . It uses the algorithm originally discussed by T. C. Chen 10] . If sign-extension is used, the complexity for signed two's complement numbers is said to be almost twice that for unsigned numbers. Other techniques are described but no really compact or modular solution is shown (the simplest implements a negative weight for the most signi cant bit). The design presented in section 3 has the same complexity as the Dadda's unsigned squarer but accepts also two's complement numbers and has zero clock-cycle latency.
Notation and Symbols
The multiplications considered in this correspondence will be performed on two N-bit factors, X and Y . The resulting product is a K-bit number P . Similarly, the square of the N-bit factor X is a K-bit number S. In both cases, to ensure a correct result, K must satisfy K 2N.
All numbers are represented by capital letters, the bits of their binary representations are denoted by the corresponding lower-case letter and an index. Figure 1 shows the logic symbols used in this correspondence. The symbols (a) to (e) represent one-cycle delay elements, that is, D ip-ops whose clock inputs have been omitted for simplicity. Symbol (b) has an active-high synchronous-clear signal, and symbol (c) an active-high synchronous-set signal. Symbols (d) and (e) have respectively an active-high and active-low enable signal. The enable signal must be active for a clock transition to load a value into the element. It is obviously possible to combine clear/set and enable signals on the same delay element. In this case the clear/set signal has higher precedence.
The name (n, k) counter will be used to designate a device with n inputs and whose k outputs code a binary word representing the number of active input signals. An alternative view of this device is that of an adder of n one-bit words producing a k-bit number. The number of outputs is obviously equal to k = dlog 2 (n + 1)e. Some implementations of this class of devices have been discussed by Swartzlander 11] . Symbols (f) and (g) in gure 1 represent respectively a (3, 2) counter and a (5, 3) counter. Finally, symbol (h) is a two input multiplexer. Figure 2 shows the multiplication algorithm resulting from a straightforward application of the two's complement integer de nition. The N ? 1 least signi cant bits of each sub-product must rst be added, and then the last bit is subtracted, except for the last sub-product, for which additions are replaced by subtractions and vice versa. The complex control required for this algorithm makes it a poor candidate for hardware implementation.
Basic Algorithms
A possible simpli cation is to sign-extend the sub-products to the desired length K of the nal result, and to perform \normal" additions and subtraction, keeping only the rst K bits. The resulting algorithm is shown in gure 3, for K = 9 bits. This operation can alternatively be viewed as sign-extending the coe cient X to K bits.
However, the algorithm of gure 3 has the disadvantage that the last sub-product must be subtracted while all the previous ones are to be added. This can be overcome by also signextending the data Y to K bits and then by performing K additions. Again, only the rst K bits of the result should be kept. This algorithm, shown in gure 4, is the one used most often. Both algorithms shown in gures 3 and 4 produce a sign-extended result.
Serial-Serial Multiplier
Using the algorithm of gure 4 on sign-extended inputs, the number of partial terms to add grows linearly with the bit position. Most of the terms added to produce all bits of weight above 2 2N?1 inclusive are due to the sign extension, and are in the form x N?1 y N?1 . It can be shown that the regularity of these terms makes it possible to ignore them without producing a di erence in the output.
Proceeding in a way similar to Strader and Rhyne 3] , the partial product after the presentation of the bit number i can be expressed recursively as (2) with P i = P i for i < N. The error incurred at cycle j N by using equation (2) instead of equation (1) x3y2   x3y1   x3y3  x0y3  x1y3  x2y3  x3y3  x3y3  x3y3   x3y3  x0y3  x1y3  x2y3  x3y3  x3y3  x3y3   x3y3  x0y3  x1y3  x2y3  x3y3  x3y3  x3y3   x3y3  x0y3  x1y3  x2y3  x3y3  x3y3  x3y3   x3y3  x0y3  x1y3  x2y3  x3y3  x3y3  x3y3   x3y3  x0y3  x1y3  x2y3  x3y3  x3y3  x3y3   p7  p0  p1  p2  p3  p4  p5 
It is clear from equation (4) that E i mod 2 i+1 = 0, 8i. This implies that the missing part of the partial sum never a ects a serial output multiplier which|at cycle i = K ? 1, that is, at the end of the multiplication|has produced only the least signi cant K bits of output. Moreover, when unsigned numbers are entered zero-extended, E i is identically zero. Therefore the expression for P i can be conveniently used to build a multiplier performing equally well on unsigned zero-extended and on sign-extended operands.
The classic add & shift technique can be conveniently used introducing 
Equation (5) shows that each partial product can be produced with a combinational delay by using the stored values of the operands only up to the previous input bits (X i?1 and Y i?1 ) and the current inputs (x i and y i ). The symmetric term x i y i should be added on-line to bit number i. The 1=2 term accounts for the shift and the fractional part represents the ready part of the product which has been shifted out. Before shifting, the value added to the partial product in each cycle is an integer value. Thus, if the remainder of the division is extracted from the multiplier at each iteration, an equivalent system can be built up that stores only the integer part of the result: 
A similar expression for i N can be derived from equation (6) . A better understanding of the situation can be obtained by comparing an example of this algorithm with the basic algorithm for sign-extended operands shown in gure 4. Figure 5 presents an example multiplication of two 4-bit operands, generating a 9-bit output. The terms present in gure 4 and missing in gure 5 correspond to those in the last sum of equation (4) .
A single bit-slice implementing directly the algorithm of equation (7) is shown in gure 6. Owing to its structure, this multiplier requires both operands to be of the same size. Should this not be the case, the shorter one is to be extended to the size of the longer one. N ? 1 such slices are necessary to multiply two N-bit numbers (thus allowing storage of X N?2 and Y N?2 ). They should be connected as shown in gure 7. Perfect modularity, at the expense of some redundant hardware, can be obtained by replicating in every module the AND gate of gure 7 and by replacing the multiplexer with an N th module having signal Pin at zero. Both operands are serially entered, least signi cant bit rst, during the rst N clock cycles. As mentioned above, unsigned numbers should be extended with zeroes during the K ? N remaining cycles (with K 2N to ensure a correct result) and two's complement numbers should be signextended. The FirstBit signal should be set high during the rst cycle; subsequently held low. Similarly, the LastBit signal should be activated only during the last cycle of a multiplication or, more precisely, during the cycle preceding a new multiplication. The output is serially available starting with the rst cycle. The values outlined by the sets in the upper portion of gure 5 correspond to the terms added by each module of the multiplier; while the elements of the sets in the lower portion of the gure are added to the partial sum in the same cycle.
Usually, two new terms are added in each bit-slice at every cycle. This suggests the use of a (5, 3) counter with a far carry (weight 4) back-propagated to the module that precedes the generating one (delay element on the output Cout). Another delay element is provided to store the local carry. The output multiplexer of the (i + 1) th slice is used to add the symmetric term x i y i to the appropriate bit of the partial sum. The delay elements with enable signals progressively store the operands (X i and Y i ), and are used to produce the two new terms added to the partial sum (x i Y i?1 and y i X i?1 ). Finally the delay elements on the path Pin-Pout constitute the register to store and shift the partial sum Q i .
The shift register produced by chaining the signals Rin and Rout of the various stages activate successive modules of the multiplier; this allows the addition of the symmetric term and enables the memorization of the input bits. It can be replaced by a binary counter followed by a decoder. The counter should be synchronously reset before the rst bit is entered. This alternative scheme is valid also for the proposed squarer (see section 3). Essentially, it trades modularity for a possible small hardware reduction. Otherwise the two approaches are equivalent.
The module shown in gure 6 has a critical path composed of an AND gate, a (5, 3) counter, and a 2-input multiplexer. Compared to other designs (e.g., Dadda 7] ), this design increases the critical path delay by the propagation time through the multiplexer.
Serial Squarer
There are many systems where a serial unit should produce the square of its input; for instance, when computing the Euclidean distance between two vectors. If two di erent numbers are not required to be multiplied in the same unit, then it would be a waste of hardware to implement such a unit with the serial-serial multiplier presented in section 2. The fact that the two operands are identical facilitates in the simpli cation of the multiplication algorithm. Equation ( 
An expression for i N can be obtained by limiting the register storage and avoiding the addition of the term x i . The remarks on the irrelevance of the incurred error E i hold equally true. Figure 8 shows graphically the algorithm used to square an unsigned number of N = 4 bits. The upper part of the gure shows the conventional unsigned multiplication algorithm with both operands being identical. The lower part presents the simpli ed squaring algorithm. Similarly, gure 9 presents the algorithm used to square signed two's complement numbers. In this example, the input has N = 4 bits and the output K = 9 bits.
A single bit-slice of the proposed serial squarer is shown in gure 10. N ? 1 such slices| connected as shown in gure 11|are used to square an N-bit operand. For improved modularity, as in the case of the multiplier, the AND gate can be replaced with an N th module having signal Sin at zero. The operand is serially entered, least signi cant bit rst, during the N rst clock cycles, and is extended with zeroes (unsigned number) or sign-extended (two's complement number) during the K ? N cycles (with K 2N to ensure a correct result). As with the serial-serial multiplier, the FirstBit and the LastBit signals should be set high during the rst and the last cycle respectively; otherwise they are set low. The output is serially available with no latency. The same hardware can be used for both unsigned and two's complement numbers.
The values outlined in the sets in gures 8 and 9 are added to the partial result at successive iterations, the terms on the n th line being produced by module n. The terms x i are added through the output multiplexers. The operand is progressively stored (X i ) in the register made up of the delay elements with enable. The output of the AND gate is added to the partial sum, stored in the shift register composed of the delay elements on the path Sin-Sout. Comparing gure 10 with gure 6, it may be noted that the delay element holding Q i is here moved after the adder. This doubles the weight of the added term (output of the AND gate) and accounts for the 2 in front of the term x i X i?1 found in equation (8).
Conclusions
As detailed in the introduction, the multiplication scheme discussed above presents some similarities with other designs, but the proposed implementation is the most compact one. An interesting feature, extremely important for VLSI designs, is its high degree of modularity. The delay between an input bit presentation and the corresponding output bit has been reduced to the delay of a combinational circuit, whose critical path contains only a (5, 3) counter and a few gates. Furthermore, a novel and important feature has been added: the result remains correct (i.e., the sign is extended) as long as the inputs are sign-extended, hence allowing the production of an arbitrarily long output. In some cases, this can be a key feature. Supposing that a sum of products is being computed with all bit-serial components and that the partial-sum size is larger than that of the single products, no special circuitry has to be added between the multiplier and the serial adder, provided that the operands are appropriately sign-extended. The proposed squarer, an original implementation of the algorithm presented by Chen 10], also allows the use of signed two's complement numbers. The implementation shares the same advantages as the companion multiplier: it is compact, modular, and has no latency. It is moreover much simpler than the signed squarer described by Dadda 9] .
The described serial-serial multiplier and serial squarer have been designed in VLSI and simulated using the Compass ASIC development environment 12]. The multiplier has also been included in the GENES IV circuit 13], which has been manufactured in CMOS 1:0 m technology and successfully tested. 
