I. INTRODUCTION
O NE of the main advantages of MOSFET scaling to nanometer gate lengths is the ability to reach device speeds exceeding 120 GHz with low supply voltages. Despite the high intrinsic speed of transistors available in these processes, the design of 1.2-V 40-Gb/s digital blocks in CMOS for fiber optic and backplane communications remains a challenge. A CMOS implementation would permit 40-Gb/s serializer-deserializer (SERDES) chips to reach the same levels of integration as state-of-the-art 10-Gb/s ICs [1] , [2] and to operate from a single 1.2-V power supply. For a 40-Gb/s SERDES to be economically viable, its cost and performance must be competitive compared to a 4 10 Gb/s solution [3] . Typically, a 40-Gb/s SERDES must have less than 2.5 times the cost of a 10-Gb/s system while consuming less than 2.5 times the power.
Although a half-rate 40-Gb/s transmitter in CMOS has been reported in [4] , it operates from 1.5 V and consumes two times the power of a similar SiGe BiCMOS transmitter operating at 86 Gb/s [5] . Full-rate retiming has been successfully demonstrated at speeds above 40 Gb/s only in III-V [6] , [7] and SiGe BiCMOS technologies [8] , [9] . However, these circuits operate from 1.5 V or higher supplies and consume at least 20 mW per latch. A full-rate 40-Gb/s latch has yet to be realized in CMOS.
To truly benefit from the lower power potential of nanoscale MOSFETs, the traditional CML latch topology must be simplified by reducing the number of vertically-stacked transistors to allow for 1.2-V operation. The availability of lowdevices is also a prerequisite. In the past, the low-voltage latch has been implemented either by removing the current source [10] or using transformers [11] to couple the signal between the differential pairs in the clock and data paths. The former solution has been demonstrated in 90-nm CMOS at speeds below 20 GHz. The latter has been used in a 60-Gb/s 2:1 multiplexer (MUX) clocked at 30 GHz, but its bandwidth is limited to that of the transformer. On the other hand, low-voltage transimpedance amplifiers (TIAs) in 90-nm CMOS operating from 1 V [12] and 0.8 V [13] have been recently reported, but at data rates of 38.5 Gb/s and 25 Gb/s, respectively. This paper presents the first 40-Gb/s full-rate D-type flip-flop (DFF) and the first 40-Gb/s TIA in CMOS. Operation at 40 Gb/s is made possible by combining low-and high-transistors in the latch and optimally biasing, and sizing the transistors at the peak-current density (Fig. 1) . This paper is organized as follows. The design of a 40-Gb/s latch and decision circuit is described in Section II. A discussion of 90-nm General Purpose (GP) and 65-nm Low Power (LP) technology performance is presented in Section III. The differences between GP and LP processes for high-speed design are 0018-9200/$25.00 © 2007 IEEE analyzed based on experimental data on devices, latches, and TIA circuits.
II. DECISION CIRCUIT
Flip-flop circuits are the most critical digital blocks used in high-speed wireline and fiber-optic transceivers, equalizers, and mm-wave-sampling ADCs [14] . In a full-rate transceiver, the flip-flop must retime the data at a clock frequency equal to the data rate, while also removing the jitter. Fig. 2 illustrates the proposed MOS-CML latch schematic and its placement in a decision circuit. The data and clock signals are applied to the Master-Slave flip-flop through broadband TIAs. A MOS-CML output buffer drives the 40-Gb/s signal to 50-loads.
A. Latch
In the proposed latch topology, the clock signal switches the differential pair transistors M1 and M2 from 0 to mA/ m. The latter corresponds to the peak-current density of nMOSFETs [12] . Equivalently, when the gate voltages of M1 and M2 are equal, the current density through each device is 0.15 mA/ m. To fully switch the 90-nm MOS differential pair, a voltage swing exceeding 300 mV per side is required [12] . For a 30 m 0.09 m device with 4.5 mA and a load resistance of 40 , the voltage swing at the output of each latch is mV (1) which is sufficient to fully switch the differential pair in the next stage and results in an inverter gain 1.2. The bandwidth of the latch is extended with shunt inductive peaking. For a fanout of and the input capacitance of the next stage equal to , the total capacitance at the drain of M3 is fF (2) and the inductance pH increases the [15] to
which is adequate for 40-Gb/s operation on the data path.
The second technique that improves the speed of the latch is the choice of devices with different in the data and clock paths of the latch. As shown in Fig. 2 
B. Data and Clock Buffers
The gain stages that provide the data and clock signals to the latches in the flip-flop are implemented as fully differential versions of the nMOS TIA with pMOS active load described in [16] . As illustrated in Fig. 2 , an on-chip 1:1 vertically stacked transformer converts the single-ended external clock to a differential signal applied to the TIA input for testing purposes. Although the transformer limits the bandwidth of the clock tree at low frequencies to about 20 GHz, it is preferred over applying the clock signal to one side of the amplifier's differential input. Because the differential amplifier has no common-mode rejection, the clock signal would arrive at the latch with amplitude and phase mismatch. However, in a 40-Gb/s SERDES implementation a 40-GHz VCO would be integrated on chip and the transformer would be removed. The TIAs are followed by differential common-source stages with inductive peaking.
To tune out the parasitic capacitance in the feedback loop of the TIA, a 500-pH inductor, realized with vertically stacked windings in the top two metal layers of the process, is inserted in series with the feedback resistor . Its self-resonant frequency exceeds 100 GHz when a layout with 35 m diameter and 2 m conductor width is employed. The relatively large series resistance of the vertically-stacked inductor can be absorbed in the feedback resistor . The pMOS current mirrors control the bias currents of the MOSFETs in the TIA stages and in the following common-source amplifiers, making them independent of temperature and power supply variations [17] . The role of the differential common-source stages with inductive peaking is also to provide the proper DC levels to the latches in the flip-flop. The 12-common-mode resistor at the clock tree output lowers the DC voltage level at the gates of M1 and M2. The gate voltage must correspond to a drain current density of 0.15 mA/ m, such that the transistors switch from 0 to 0.3 mA/ m. It should be noted that a CML inverter with a tail current source cannot be employed in place of the M7-M8 differential pair due to lack of voltage headroom. More importantly, this bias scheme is robust to supply voltage variation from 1.1 V to 1.3 V. When the supply voltage increases above 1.1 V, the of the clock pair transistors in the latch increases commensurately. As illustrated in Fig. 1 , this has no impact on the of the nMOSFET which remains practically constant at large . 
C. Simulation Results
The decision circuit was simulated over process and temperature corners after extraction of layout parasitics with a 2 1 pseudorandom signal. The corresponding output 40-Gb/s eye diagram at 1.2 V is shown in Fig. 3 .
D. Experimental Results of Decision Circuit
The decision circuit was fabricated in two different 90-nm CMOS processes to investigate the portability of the design across foundries. All transistor sizes are identical and the passive components have the same value ( and ) in both technologies. Both dies (Fig. 4 ) occupy 800 600 m including the pads.
The circuits were tested on wafer with 67-GHz single-ended and differential probes. In the absence of a full-fledged 40-Gb/s bit error rate tester (BERT), the 40-Gb/s pseudorandom binary sequence (PRBS) data were generated by multiplexing four appropriately shifted pseudorandom streams at 10 Gb/s each. The external clock was provided by a low phase noise Agilent E8257D PSG signal source and data were captured by an Agilent Infiniium DCA-86100C oscilloscope with 70-GHz remote heads. It should be noted that contributions from the test setup and oscilloscope have not been de-embedded from the measured jitter, amplitude, and rise/fall times shown in Figs. 5-8 and 11 . Fig. 5 reproduces the input and output eye diagrams at 30 Gb/s and 1.2-V supply, showing a significant reduction in jitter from 1.7 to 0.5 ps rms. The rise/fall times are improved from 14 ps to less than 7 ps (Fig. 6) . Compared to the decision circuit of [18] , where 40-Gb/s operation required 1.5 V, an improved clock distribution tree in this design allowed for 40-Gb/s full-rate retiming from 1.2 V (Figs. 7 and  8) . Fig. 8 illustrates the output eye diagram at 40 Gb/s for a input pattern. The measured phase margin of the latch is 163 . The resulting bathtub curve at 40 Gb/s can be found in Fig. 9 . Error-free operation was verified for an input pattern of 508 bits, by capturing the input and output bitstreams on the sampling scope. Part of the captured bitstream at 40 Gb/s is shown in Fig. 10 . Power dissipation at 1.2 V is 130 mW.
The decision circuit was tested across temperature for different supply voltages to verify the robustness of the latch biasing scheme in the absence of current sources. Measurements were conducted for supply voltages between 1 V and 1.5 V and at temperatures up to 100 C. At 1-V supply and 100 C, the maximum rate with retiming and jitter reduction is 32 Gb/s. Fig. 11 shows the 40-Gb/s eye diagram at 1.2 V and 100 C. Even though no errors were observed in this case, the output jitter is not improved over that at the input, indicating that the clock path does not have enough bandwidth and that the latches do not retime the data. Table I compares this circuit to state-of-the-art latches in SiGe BiCMOS and InP technologies. The proposed MOS-CML latch has the lowest power dissipation. At 40 Gb/s, the CMOS latch consumes half the power of the 43-Gb/s SiGe BiCMOS latch.
III. SCALING TO 65-NM CMOS

A. Device Performance in 90-nm GP and 65-nm LP CMOS
The measured of 90-nm GP and 65-nm LP nMOSFETs from two different foundries is summarized in Fig. 1 . The measured data in Fig. 1 clearly indicate that the peak-value occurs at the same current density irrespective of the device threshold voltage and technology node. As shown in Section II, this property of submicron MOSFETs can be applied in the design of high-speed digital circuits that are robust to threshold voltage, , and ultimately power supply voltage variation. Another important aspect unveiled by the measured data in Fig. 1 is that the threshold voltage of the low-65-nm LP MOSFETs is actually higher than that of the high-90-nm GP devices. At the same time, the of the 65-nm LP FETs is slightly lower than that of the 90-nm GP ones. Both effects are due to the thicker gate oxide and slightly longer gate lengths of the 65-nm LP process. This behavior is the result of the requirement to reduce gate leakage in LP processes for RF and analog applications [19] . However, gate and subthreshold leakage pose no problem at mm-wave frequencies and in high-speed digital CML gates, where the tail current far exceeds the leakage currents [20] .
B. Building Block Evaluation in 65-nm LP CMOS
To investigate the benefits of switching from 90-nm GP CMOS to 65-nm LP CMOS for 40-Gb/s applications, two TIA circuits and a static divider using the same topology as the 90-nm GP CMOS latches described earlier were designed and tested.
1) Transimpedance Amplifiers:
A 40-Gb/s CMOS TIA must be able to operate from 1.2-V supply with more than 30 GHz bandwidth and low noise. Possible TIA topologies are shown in Fig. 12 . The TIA with resistive load [ Fig. 12(a) ] requires a significant DC voltage drop on in order to achieve adequate open loop gain, making it impractical. One approach to increase the gain, while requiring only 0.6 V of DC headroom, is to replace the resistor with a pMOS active load [ Fig. 12(b) ], as has been shown in [12] . The loop gain of the TIA increases from to . The pMOS load is needed to increase the gain of the amplifier at low supply voltages, at the expense of higher capacitance at the output node. The latter effect is partially mitigated by the feedback inductor, which resonates out the parasitic capacitance of the nMOS and pMOS transistors. To further improve performance, while reducing the power dissipation, one can employ a typical CMOS inverter with resistive and inductive feedback [ Fig. 12(c) ]. As outlined in [16] , the CMOS inverter offers the advantage of smaller size and lower bias current for the same performance. For example, the CMOS inverter with feedback resistor has a small-signal open-loop gain of (4) with an input resistance (5) Due to its higher transconductance, it can achieve the same noise impedance for about 1/3 the transistor size of an nMOS TIA [12] .
The nMOS TIA with pMOS active load [ Fig. 12(b) ] and the CMOS inverter TIA [ Fig. 12(c)] were fabricated in the 65-nm LP technology. The values of the transistor total gate width , feedback resistor
, and inductor are shown in Table II . In both cases, the core TIA stage is followed by a buffer, which drives the signal to the external 50-load. The die photo of the 65-nm CMOS TIA is reproduced in Fig. 13 . The circuit occupies an area of 300 370 m including the pads. The core area of the TIA is 85 65 m and the 600-pH inductor has a diameter of 10 m with 0.5 m metal width and is realized in the top three metal layers of the process. S-parameter, eye diagram, and noise measurements were performed on wafer. Due to the higher threshold voltage of the 65-nm LP MOSFETs compared to the GP technology (Fig. 2) , and therefore the larger required for maximum gain and lowest noise [12] , the supply voltage of the TIA exceeds 1.3 V. The measured S-parameters of the 65-nm CMOS TIA are provided in Fig. 14 . The 3-dB bandwidth of the 65-nm CMOS and nMOS TIAs is 23 GHz and 21 GHz, respectively, from 1.5-V power supplies. We also note that the 65-nm nMOS TIA has lower bandwidth while operating from higher supply voltages than its counterpart implemented in 90-nm GP CMOS [12] .
Noise parameter measurements were performed up to 26 GHz with a Focus Microwaves system. The and of the CMOS TIA is presented in Fig. 16 for various  voltages. Fig. 17 illustrates the and versus frequency of both 65-nm TIAs and a 90-nm nMOS TIA from [12] . Despite its lower current, the CMOS TIA has lower due to its lower noise resistance and because the real part of its optimum noise impedance is closer to 50 . Eye diagrams were measured for both TIA circuits with a 508 bits pseudorandom sequence having 100 mV amplitude. The output eye diagrams at 37 Gb/s are illustrated in Figs. 18 and 19 . The bandwidth improvement is apparent in the eye diagrams, with the CMOS TIA having a larger eye opening. The better frequency response of the CMOS inverter TIA allowed for 40-Gb/s operation as shown in Fig. 20 . For a power consumption of 6 mW in its gain stage, the circuit achieves 0.15 mW/Gb/s, while having a noise figure lower than 9 dB and 6 dB of gain.
The 65-nm TIA experiments prove that, in a given technology node, the CMOS inverter TIA has lower noise figure, higher gain, and larger bandwidth, while consuming less than half the power of a nMOS TIA. The 40-Gb/s CMOS TIA also consumes less power than common-gate TIAs [13] , [21] . However, when comparing the performance of the same nMOS TIA topologies in 90-nm GP and 65-nm LP technologies we find that, despite the lower metal pitch and area, the 65-nm LP circuits suffer from TABLE II  COMPARISON OF TIA TOPOLOGIES higher noise, lower bandwidth and dissipate more power than the 90-nm GP ones. The performance of both TIA topologies in 65-nm LP CMOS is summarized in Table II .
2) Static Divider: The static divider consists of two latches with feedback and an output driver (Fig. 21) . The same latch topology as in Fig. 2 is employed and the 65-nm LP transistors have a total width of 46 m and 36 m with a corresponding 7 mA. The single-ended external clock is converted to a differential signal through a transformer. The gate bias of the clock transistors in the latch is applied at the center tap of the transformer. Fig. 22 shows the layout of the 65-nm LP CMOS static divider. Its measured self-oscillation frequency was 28 GHz and the circuit was verified to divide up to 36 GHz when biased from 1.5-V supply. Clearly, the 65-nm CMOS TIA and static divider measurements indicate that, to reach 40 Gb/s, 65-nm LP CMOS circuits require supply voltages exceeding 1.2 V.
IV. CONCLUSION
Low-voltage circuit blocks have been fabricated for 40-Gb/s wireline communications in 90-nm GP and 65-nm LP CMOS technologies. A decision circuit achieves full-rate retiming at 40 Gb/s from 1.2 V with a power consumption of 10.8 mW in the latch. A CMOS inverter TIA with resistive and inductive feedback has higher bandwidth, lower noise, and larger gain than an nMOS TIA with active pMOS load, while consuming 1/3 of the current with a power dissipation of 6 mW. Biasing MOSFETs at the peak-current density (0.3 mA/ m) in the latch for maximum speed and at optimum noise figure current density (0.15 mA/ m) in the TIAs for low noise and maximum bandwidth ensures the optimum performance across technology nodes and foundries. While these low-power topologies show for the first time operation in CMOS at 40 Gb/s from 1. 
