In this paper, we present three digital multiplier architectures capable of operating in the gigahertz range, based on MOS Current Mode Logic (MCML) style. A small library of MCML logic gates consisting of NAND/AND, XOR/XNOR, (3x2) counter (full adder), [4:2] compressor, and master-slave flip-flop were designed and optimized for high-speed operation. Using these gates, we propose three different 8-bit MCML binary-tree multiplier architectures and compare their performance in terms of latency, throughput (number of multiplications per second) and power consumption. According to our simulation, the fastest multiplier targeting for TSMC 0.18 µm CMOS technology attains a throughput of 4.76 GHz or 4.76 Billion multiplications per second and a latency of 3.8 ns.
INTRODUCTION
The increasing demand for fast arithmetic units in floating point co-processors, graphic processing units and DSP chips has shaped the need for highly integrated, high-speed multipliers. Traditionally multiplier architectures fall in into one of the following two categories, viz. array multipliers and tree multipliers. The latency of array multipliers is a linear function of the word length of the multiplier, O(n), whereas in the case of tree multipliers, the latency is a logarithmic function of the word length, O[log(n)]. Hence, tree structures require fewer numbers of stages for partial product reduction compared to array structures and are more suitable for high-speed multiplier designs. To enhance the throughput, we pipelined our multipliers by inserting a register stage after every compressor cell.
The ability to build logic gates that operate at a high speed, while dissipating relatively small power, makes MOS current mode logic (MCML) a promising technique for designing gigahertz-range arithmetic circuits. Our high-speed pipelined tree multipliers exploit several attractive features of (MCML) as described later. A small library of MCML logic gates consisting of NAND/AND, XOR/XNOR, 3x2 Counter (Full Adder), [4:2] Compressor and Flip-flop form the core components of our multipliers, and they were designed and optimized for highspeed operation. We propose three 8-bit MCML multiplier architectures, a 3-2 tree architecture with a ripple carry adder, a 4-tree architecture with a ripple carry adder, and a 4-2-tree architecture with a carry look-ahead adder.
Section 2 covers basics of MCML. In this section, we also discuss various MCML design metrics and tradeoffs involved in MCML gate design. Section 3 describes design of various MCML gates for our library, discusses optimization techniques adopted for the design and also provides simulation results. In Section 4, we present our three 8-bit MCML multiplier architectures. In section 5, we compare the performance of the proposed multiplier architectures and present simulation results. Finally, Section 6 summarizes our research.
MOS CURRENT MODE LOGIC (MCML)
The operation of an MCML gate may be understood with the help of a basic structure of an MCML gate, shown in Figure 1 [1] . It consists of a load resistors R L , a differential pull-down network (PDN) with complementary sets of inputs and outputs, and a constant current source I CS . The differential inputs (complementary sets) are applied to the pull down network (PDN). The PDN has a tree-like differential structure, similar to a Differential Cascode Voltage Switch (DCVS) family [2] . The output and its complement are available at the two arms as indicated in the figure. The PDN is grounded through a constant current source I CS , which is usually an NMOS transistor. The voltage swing at the output and its complement is ǻV = I CS R L and is controlled by setting the value of the current source I CS and the effective value of R L , which is usually a PMOS transistor. The voltage swing is in the range of a few hundred mV and is a crucial leverage factor in high-speed MCML gate design. Every MCML gate has two bias voltages, RFP and RFN. The value of RFP is set to achieve the desired load resistance. The value of the load resistance can also be controlled by the dimensions of the PMOS transistor. RFN biases the current source transistor and helps in fixing the desired current. The width of the current source transistor is usually large to make the transistor robust, to decrease the mismatch effects, and to enable a future reduction in V DD [3] .
The equations for the total propagation delay, power dissipation, and power delay product of an MCML logic circuit and its CMOS counterpart are shown in Table 1 [1]. 
As can be seen from Table 1 , the delay of an MCML logic circuit varies linearly with voltage swing ǻV and is independent of the supply voltage V DD , in contrast to conventional CMOS logic circuits. The power dissipation of an MCML logic circuit varies linearly with the supply V DD and is independent of the operating frequency, whereas power dissipation of conventional CMOS circuits depends linearly on operating frequency and has a square-law dependence on supply voltage. Since the delay of an MCML gate depends linearly with ǻV and is independent of supply V DD, the delay can be effectively minimized by lowing the voltage swing ǻV, while maintaining the supply voltage. Further, as the power dissipation is independent of the operating frequency, MCML circuits may be operated at high speeds without increasing the power dissipation, which is in contrast to conventional CMOS circuits [1] . Important design issues for MCML logic circuits include the need for shallow logic depth and signal regeneration. For in-depth comparison between MCML and CMOS logic styles, the reader is referred to [1] .
MCML GATE LIBRARY
Our library of MCML logic gates consists of NAND/AND, XOR/XNOR, (3x2) counter (or full adder), [4:2] compressor, and master-slave flip-flop. To be able to operate the gates in the gigahertz range, minimization of delays expressed in Table 1 was our main objective. A three-step approach was adopted for this purpose. First, the maximum current through the logic transistors per unit width was determined through simulations, which is in the order of a few hundred microamperes. After fixing the bias voltages RFP and RFN and using the current value, the dimensions of the load transistors and the current source transistor were determined through further simulations. Optimizing the PMOS load transistor is one of the most crucial and challenging tasks in an MCML gate design, involving fine tradeoffs between non-linearity and signal strengths. Whereas a high W/L ratio improves the delay by decreasing the resistance, it increases the non-linearity of the resistance leading to degradation in the output voltage swings [3] . An MCML full adder and a [4:2] compressor are shown in Figure 3 . The full adder or the (3x2) counter consists of an XOR3 gate or a sum circuit as shown in Figure 3(a) , and a majority function or the carry circuit is shown in Figure 3 (b). The MCML XOR3 gate shown in Figure 3 (a) is based on the DCVSL design proposed by Chu and Pulfrey in [2] . The XOR3 design reduces the number of transistors by two compared to the BDD design proposed by Musicer in [1] . Our simulation results confirmed that the former design is faster than the latter one with the added advantage of less area. The [4:2] compressor is designed using two full adders as shown in Figure 3(c) . It is a special form of a (5, 3) counter with one carry entering and one leaving the compressor column [4] . The (3x2) counters and [4:2] compressors are used in our 3-2-tree and 4-2-tree MCML multipliers, respectively. Three different delay models, intrinsic delay, FO4 delay, and actual delay, were considered. The intrinsic delay is the critical path delay of a gate without any load. The FO4 delay is the propagation delay with a fan-out of four, in which a gate drives four identical gates. The FO4 delay parameter often serves as a benchmark for gates of different logic families and/or processing technologies. The actual delay is the critical delay of a gate when it is embedded in our proposed MCML multipliers. The actual delay determines the speed of the proposed multiplier architectures. It should be noted that the delay of the flip-flop is calculated as the sum of its setup time and the clock-to-q delays. Another important performance metric is power consumption. We estimated average power by applying all input combinations for a gate with at most three inputs and by random input patterns for a gate with larger number of inputs. The gates were laid out with TSMC 0.18 µm CMOS as the target technology, and SPICE simulation was performed to estimate the performance. Table 2 shows the performance for our MCML library gates. 
PROPOSED MULTIPLIER ARCHITECTURES
A tree structure is a good choice for high-speed multipliers and has been employed in the proposed MCML multiplier architectures. This is due to the logarithmic reduction of partial products in tree multipliers in contrast to a linear reduction in array multipliers [4] .
The latency for tree multipliers is a logarithmic function of the word length O[log(n)], whereas the latency for array multipliers is a linear function O(n).
The performance of a multiplier is usually measured in terms of its latency and throughput. The throughput of a multiplier is an important metric for applications such as high-speed digital signal processing and other computation intensive circuits, whereas latency is more significant for applications such as advanced microprocessor architectures. The block diagram of the architectures of the three proposed 8-bit MCML multipliers, viz. 3-2-tree architecture with a ripple carry adder, 4-2-tree architecture with a ripple carry adder and 4-2-tree architecture with a carry look-ahead adder are shown in Figure 4 . In order to avoid clutter, the complementary signals are not shown in Figure  4 .
3-2-Tree MCML Multiplier with Ripple Carry Adder
The main components of the 8-bit 3-2 tree MCML multiplier shown in Figure 4 (a) are: a 64-bit partial product generator, the 14 3-2-tree slices, the deskew registers, and a 13-bit ripple carry adder. The partial product generator consists of 64 MCML AND/NAND gates. There are seven full adders in one 3-2 tree slice that are pipelined using flip-flops. Each tree slice reduces eight partial products to two outputs (sum and carry) in four clock cycles as shown in the block diagram. In general, a 3-2 tree design requires log 1.5 [N/2] stages to reduce N partial products in to sum and carry outputs. In contrast, a 4-2 tree design needs log 2 [N/2] stages for the same reduction. 14 such tree slices are used for the reduction of 64 partial products into sum and carry outputs that arrive at the same time. A ripple carry adder is used for parallel addition of these sum and carry bits. It is pipelined by inserting a flip-flop between every full adder. Since all inputs arrive at the same time, they should be delayed successively and applied to the adder. This delay balancing at the inputs and outputs of the ripple carry adder is achieved with the help of deskewing registers. Deskewing registers are columns of flipflops used to insert appropriate delays at the input and the output. This ensures that all the outputs of the ripple carry adder appear at the same clock cycle. Thus, the total latency (number of delay stages) in the multiplier is 18: one clock cycle is attributed to the partial product generator, four clock cycles are attributed to the 3-2-tree slice and 13 attributed to the ripple carry adder.
4-2-Tree MCML Multiplier with Ripple Carry Adder
The 4-2-tree MCML multiplier architecture, shown in Figure  4 (b), is designed in a similar manner to the 3-2-tree architecture except for the use of [4:2] compressors. 13 4-2 tree slices are used for the reduction of 64 partial products. Unlike the previous architecture, a 14-bit ripple carry adder (RCA), which is used for the parallel addition of the sum and carry bits, is pipelined by inserting a flip-flop between every two full adders. The total latency or the number of delay stages for the multiplier is ten.
4-2-Tree Multiplier with Carry Look-ahead Adder
The architecture of the 4-2-tree MCML multiplier using a carry look-ahead adder (CLA), shown in Figure 4(c) , is the same to the above multiplier except the type of the adder used. In this design, a CLA, consisting of three 4-bit CLAs, is pipelined by inserting one flip-flop between every 4-bit CLA, and it is used for the parallel addition of the sum and carry outputs from the 13 4-2-tree slices. The total latency or the number of delay stages in the multiplier is six.
PERFORMANCE OF MCML MULTIPLIERS
The performance of the three 8-bit MCML multipliers proposed in this paper is summarized in Table 3 . The 3-2 tree MCML multiplier with a RCA, called Architecture I, achieves the highest throughput among the three, but it incurs the largest latency and the area. It operates with a throughput of 4.76 GHz and dissipates 261 mW of power. The high throughput of Architecture I is due to the high degree of pipelining achieved, while the high area is due to the large number of flip-flops used for delay balancing. The high-degree of pipelining is also the cause for the high latency clock cycle count in Architecture I. On the other hand, the 4-2-tree MCML multiplier with a CLA called Architecture III has the least latency clock cycle count because of the least degree of pipelining and also has the smallest area among the three architectures. However, it results in the lowest throughput of 2 GHz. It is interesting to note that a 4-2 tree multiplier with a RCA (Architecture I) is faster than its CLA version (Architecture III), but it consumes more power and less area-efficient. This is because of the shorter critical path of the pipelined RCA. The performance of the 4-2 tree MCML multiplier with an RCA, called Architecture II, lies between the two architectures, Architecture I and Architecture III.
A fair comparison of various multiplier architectures with the proposed designs is difficult. Multipliers are often designed with different technologies and performance goals. We selected four fastest multipliers designs available in open literature for comparison with the proposed architectures. Table 4 shows the throughputs of four multiplier architectures. It should be noted that all the four designs adopted CMOS technology. The fastest multiplier among the four designs is the one proposed by Intel [7] , which achieves the throughput of 1.5 GHz. It can be observed that all proposed architectures achieve higher throughputs than the designs found in contemporary literature. 
SUMMARY
In this paper, we propose three gigahertz-range multiplier architectures using MOS Current Mode Logic (MCML) style, involving tradeoffs among throughput, latency, power dissipation and area. The 3-2 tree MCML multiplier with an RCA operates with a maximum throughput of 4.76 GHz (4.76 Billion multiplications per second) and a latency of 3.78 ns. The 4-2-tree architecture with a RCA operates with a throughput of 3.3 GHz and a latency of 3 ns and the 4-2-tree architecture with a CLA operates with a throughput of 2 GHz and a latency of 3 ns. So a designer may choose an appropriate architecture considering his/her need.
