Abstract
Introduction
A high speed multiplier core forms one of the basic building blocks for digital signal processors. Signal processing applications typically implement iterative algorithms having a large number of multiply add/accumulate operations. High throughput rates required in such applications are best satisfird by pipelining standard multiplier architectures. Recent advances in real-time filter architectures indicate that pipelined multipliers, with clock frequency much higher than the data rate, can significantly reduce the overall chip area without increasing the power consumption or the i/o bandwidth. The above reasons motivate us t o explore the limits of pipelined multipliers implemented in a standard, mature CMOS process.
This work describes the implementation of a CMOS 8 bit by 8 bit signed two's complement multiplier using a very finely pipelined carry save array architecture in order to achieve a throughput rate of 230 million multiplications per second. Pipelined multiplier architectures described in the literature all show the importance of the clock skew problem at high clock rates. The current architecture uses the true single phase clocking scheme [I] which has the inherent advantage that clock skew problems are restricted to a proper distribution of a single clock phase. This scheme has been demonstrated as being very attractive for the design of very fast adder elements [2] . A full circuit simulation, although desirable, is not practical for large circuits as in the present case. Mulliplier characterization has been done by SPICE simulation for smaller blocks and timing simulations for the overall multiplier. This scheme allows the design to be verified in reasonable time, without resorting to overall circuit simulation [3] . Simulation studies indicate a power dissipation of 540mW at a clock speed of 230MHz. Implementation is in 1.6pm N well CMOS single polysilicon double metal process using the NEL-SIS IC Design System '41. The multiplier has 5176 transistors in a n area of 1.5" by 1. 4". The current work uses the TSPCl/TSPC% (true single phase clocked type 1 and type 2) circuits for the full and half adder compdte elements. This scheme fits in well with a finely piptlined design, and yields very high speed circuits [2] . A dynamic clocking strategy is used for these circuits. This method gives a latency of a n n stage pipeline as n/2. Latches are integral to the compute block in these :ircuii.s. This scheme yields a utilization of 50% for a compute block. However the other 50% time is utilizcd for precharging, because of the dynamic nature of TSPCl/TSPC2 circuits, and is not wasted. Latch stages required for skewing multiplier bits and for deskewing multiplicand bits, use true single phase latches [l] . IPipelining is carried out within full-adder blocks, so that a full-adder has a P half and an N half. Considering a full-adder row to be two stages, 2n stages have a latency of n clock cycles.
Multiplier Architecture
Among the various multipl er architectures, the array architecture is the mchittcture of choice for very high throughput pipelined multipliers, mainly due to regularity of its structure and its semi-systolic nature, wherein most of the signals propagate between local blocks. Various different forms of array architectures may be conceived of, depending on the direction of data flow, and the way in which the last row of the multiplier, the vector merge adder is implemented.
The current work uses a carry save architecture implementing the modified Baugh Wooley signed two's complement multiplication algorithm [5, 31, with the LSB partial product being evaluated first. Data flow is solely in the vertical direction. Although this architecture is semi-systolic (the multiplier bits are broadcast over a n adder row), in comparison to systolic architectures using both vertical and horizontal pipelining, it is more efficient in silicon area usage [ 6 ] . The nonsystolic architecture however, adds delay problems due to long data paths for multiplier lines as a design issue.
The modified Baugh Wooley algorithm used is illustrated by considering a 4bit x 4bit multiplication. 
P I P O
The implementation of the above for 8bit x 8bit multiplication is shown in figure l. This reflects the actual floorplan and consists of six major blocks:
1. The partial product summing full-adder array, which consists of 5 full-adder rows ( f a ) , a top row of AND gates ( a n d l ) , a second row of halfadders ( h a ) and a final full-adder row with complemented partial product terms ( n f a ) . Pipelining is at the half bit level with every full-adder having two pipeline stages. Partial products are generated within the full-adder cells using AND gates (except for the last row which uses NAND gates) and is carried out in parallel with partial product summation. This scheme is more efficient than one in which the AND array is kept separate [3] . Schemes like modified Booth recoding have not been used, because the fine level of pipelining makes a complex Booth recoder the major bottleneck.
2. The triangular vector merge adder summing up two 7 bit numbers to generate the most significant 8 bits of the product. This includes the blocks hal, h a l l , ha21, ha31 and ha41.
3. Latch stages to skew the multiplier bits (Z2, Zl). This includes the buffers needed to drive fairly long horizontal multiplier bit lines (Zbuf). A latch stage is made by cascading a P latch and a n N latch.
4.
Deskewing latch stages for the product bits (Z6,
5.
Clock distribution circuitry (see figure 3 ).
6. Output buffers consisting of inverter chains to buffer product bits (these are not shown in figure 1).
In this design at a given clock tick the current multiplicand, is multiplied with the multiplier clocked in on the previous clock cycle. The actual implementation is very nearly square with dimensions of 1 . 5 m m x 1. 4" and has in all 5176 transistors. About 25% of the area is occupied by the vector merge adder and 10Y .
cupied by the clock drivers.
True Single Phase Clocked Full-adder
The schematic of the pipelined single phase clocked adder is shown in figure 2 . Each full-adder also includes the partial product generation circuitry. A full adder is partitioned into two pipeline stages a P (left half of the schematic) and a n N block (right half of the schematic). This circuit differs from that in [2], in allowing partial product evaluation in parallel to s u m generation and in minimizing the complexity of the P half.
Evaluation of the sum and c a r r y is a two step process. In the first step during the clock low period the P half generates: the xor of the si and ci inputs, the product term aj.bi, a latched version of the ci input, and a latched version of the a j input. Either the xor or xnor term may be implemented with the same transistor complexity. In the current design the choice of generating the xor term was governed by layout issues. Generation of the xor term is carried out by generating si.ci by using a fully complementary CMOS gate with the z o r term being obtained from si.& , si , ci.
In the high period the N half now evaluates the The above scheme has the advantage of evaluating the partial product in parallel with the P half evaluation of the sum and cy. It partitions the addition process into two halves with the computation being so arranged so that more of it is in the faster N half. The z n o~ and aj.bi terms in the P half are gcnerated by using the T S P C l circuit, while the N half exclusively uses the TSPC2 circuit. Because the P half uses the T S P C l circuit the outputs generated by it show spikes which are positive going [2]. This is minimized by proper transistor sizing, but are not t,otally eliminated. The outputs of the N half show far smaller spikes, since TSPC2 circuits are used. Transistor count for the full adder block is 59. If the transistor count of latches, and the partial product generator is excluded we get a count of 40 for the actual adder block. In comparison, an FCC full adder has a transistor count of 24, while a N O R A fulladder requires 25 transistors. Area o f the full-adder is 142.8pm x 147pm.
Each full adder has been designed for abutment. Data flow is in the vertical direction except for the bi line which runs horiEonta1. Power and clock lines run horizontally in first metal, without breaks.
Buffered latch
Since each row of the multiplier evaluates its inputs on the clock low level, the buffered latch must ensure that the multiplier bit is valid during this phase.
The buffered latch is built by having a P single phase latch, an inverter, zt comparatively larger N latch, and a large sized inverter capable of meeting the timi~lg constraints while driving the multiplier line.
Proper operation of the multiplier requires t c l k h > t d N -t t d z n v ,
where t c l k , i , t d N and tdznv are the clock high period, delay of t8he N latch for a high to low output transition and the inverter delay respiactively. The design allows driving the 500fF multiplier line capacitance, a t a clock period of 3.8ns. Multiplier line capacitance was kept a t a minimum, since each full-adder line contributes a load equal to a single P transistor to the multiplier line (see figure 2) .
Vector merge adder
The vector merge adder uses a triangular array of half adders. This structure i!; similar to that used in the multipliers in [6, 31. Merging of two half-adder rows is used in the curl-ent implementation, allowing an 8 bit sum of two 7 bit numbers to be generated in 4 clock cycles [3] . A cascade of two half adders thus constitutes a basic block. This has one half adder entirely in the P half and the other entirely in the N half. Blocks on the main diagonal have one half adder in the P half anld a la.tch in the N half. The percentage area used by the vector merge adder, along with the deskewing registers required a t the output is 25% of the total area for the multiplier, not including the clock drivers and the output buffers. The vector merge adder also takes care of the 1 addition required in the Baugh Wooley algorithm. For the 8bzt x 8bit case a 1 has to be added in the 9th bit position and the 16th bit position. This is done by feeding in a 1 to the tcp most half adder row a t the MSB position. For the 9th bi; position by noting that an addition of 1 implies tha; the sum is the ZOT of the other two inputs aqd the carry is the 07' of the inputs it is possible to design a block very close to a half adder. This is the scheme which is employed here. This block is labeled FA1 in {,he floorplan.
Clock system
A crucial issue involved in high speed clocked designs is the determination of the clock distribution method to be employed. The clock distribution used is illustrated in figure 3.
Each row of the multiplier has a common clock line -driven by a clock buf€er -For all the blocks within it. The clock buffer is a CMOS inverter. In order that the clock have a 50% duty cycle a P device wiclth 2.75 times that of the N device is required. Using parallel connected 8 N devices of width 16pm and 8 P devices of width 44pm (length 1.6pnz) rise and fall times as determined by SPICE simulations with a load of 5.3pf are 650ps. SPICE simulationr; of basic blocks use rise and fall times which are much larger (2.07~). .,. 
Results
SPICE simulations of the full-adder indicates operation at clock periods of 3.8n.s at 27°C. Outputs of the full-adder were assumed t o drive inputs of a similar cell. The robustness of the full-adder design, with respect to high temperature, is indicated by its capability t o sustain clock rates of 180MHz at 125°C.
The full-adder can accept fairly slow clock transition edged ( a triangular clock waveform being used for SPICE simulations). The average power dissipation was 6mW. Overall timing simulations to confirm multiplier operation was done using the NELSIS simulator SLS. Multiplier power dissipation as estimated by SLS is 540mW at 2 3 0 M H 8 . We expect a slightly higher dissipation] since the simulation estimate is not very accurate.
