Abstract-Many design challenges exist in achieving high frequency clocking for high-speed applications. This paper describes a new clock distribution technique and clocking approach with the use of clock doublers in close proximity to sub-circuits to achieve higher data rates, and in many cases, reduce design complexity and power in serializers. A half-rate 4:1 serializer using this unique frequency doubling clock distribution technique has been implemented in a 90 nm BiCMOS process. The design includes a pattern length LFSR with phase shifting logic as the testing circuit and a high bandwidth cascoded output driver. The chip has the dimensions of 1.8 2.2 mm and consumes 5.78 W from a 3.4 V supply voltage at 140 Gb/s.
I. INTRODUCTION
A DVANCEMENTS in silicon germanium technology continue to make strides in meeting the demand of high speed communication systems [1] - [3] . IBM's 0.13 m SiGe technology generation has offered great achievements in obtaining data rates of up to 132 Gb/s [4] . Recent advances in SiGe bipolar process have led to a 90 nm lithography with a 300 GHz and a 360 GHz [5] . However, even with this technology advancement, the use of higher frequency signaling will reveal challenges on the clock distribution network and phase-locked loop (PLL) systems. As operating frequency of these systems increases, so does design complexity and the requirement of stricter noise performance characteristics. A lower frequency clock distribution will be advantageous when trying to achieve higher data rates as modeling complexity is reduced.
Frequency doublers have been widely explored in wireless communications and analog transmitters, however there has been very little work on digital system implementation. This paper proposes to utilize a lower frequency clock signal distributed throughout the chip and double its frequency via T. G. Neogi is with Global Foundries, Malta, NY 12020 USA (e-mail: tuhin. neogi@gmail.com).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JSSC.2015.2472600 frequency (clock) doublers in close proximity to the required sub-circuits. Through simulation, such concepts in CMOS clock distribution networks have shown to reduce power consumption by 50.2% [6] . As in CMOS, this power reduction approach can also be implemented in SiGe bipolar as current is proportional to performance. The authors seek to expand on this concept by introducing an aggressive and more advanced approach in clocking high speed data serializers with the use of frequency (clock) doublers to obtain higher data rates, and in many cases, reduce design complexity and reduce power consumption. To demonstrate this, the concept is applied to a 4:1 serializer.
II. MOTIVATION
A timeline for the past 15 years of the state-of-the-art SerDes using SiGe across many generations can be viewed in Fig. 1 . This work looks to push the state-of-the-art serializers beyond these data rates using more advanced SiGe technologies.
Currently digital systems work on a standard full-rate, half-rate, and even quarter-rate architecture where a clock signal is distributed from its respective at-rate clock source; e.g., 40 GHz half-rate clock source producing 80 Gb/s data rate. Half-rate architectures are a good compromise between higher data rates and risks of duty-cycle distortion (DCD) and clock skew. Fig. 2 shows the conventional 4:1 half-rate serializer architecture where a half-rate clock is sourced; e.g., 0018 -9200 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. for a 160 Gb/s serial output, an 80 GHz clock is sourced. For typical serial applications, this creates more design challenges for PLL systems [13] . When using conventional high tuning range VCOs, such as ring-oscillators, it becomes difficult to obtain high frequencies with low phase noise. An LC-type VCO can offer improved phase noise but with dramatically reduced tuning range. One way of minimizing the clock distribution network shown in Fig. 2 is to place the clock source in very close proximity to the last stage 2:1 and clock divider. However, for cases where serializers are integrated with other circuity and require the clock source to be more remote such as multi-channel SerDes , using common clocks, the clock distribution can be quite extensive. This would not just lead to increase power consumption but also an increase in design complexity as a larger layout footprint and stricter modeling of the transmission lines would be needed to heed the effects of crosstalk and noise.
III. CLOCK DOUBLERS FOR DIGITAL SYSTEMS
There currently exists a wide range of frequency doubler implementations. Frequency doublers can be generally categorized into two types: passive and active.
Passive-type frequency doublers rely on tuned LC-passive elements or impedance to select the second harmonic of the input signal [16] - [19] . The drawback to this is a much larger layout footprint that is problematic for digital system integration. Active-type frequency doublers rely on the switching performance and parasitics of the device. This type can be configured in two ways: in-phase and quarter-phase.
In-phase clock doublers require a single-phase input clock signal thus requiring a single clock buffer path. This contributes to lower power consumption and a simpler layout complexity when doubling locally in comparison to quarter-phase doubling. In-phase doubling works by direct modulation of the signal, typically by means of a double balanced Gilbert-type mixer producing a differential output [20] , [21] . However, it is difficult to maintain a 50% duty-cycle doubled clock over a wide frequency range. For half-rate architecture digital systems, DCD needs to be minimal.
Quarter-phase doubling requires two input signals with a 90 phase difference which are routed into an XOR circuit [22] . The drawback to this is the challenge of either maintaining a 90 phase difference between the two inputs [6] or making use of proper threshold biasing [13] in order to minimize DCD over a very wide frequency range. When doubling locally, two clock paths may be required, thus increasing clock distribution complexity and power consumption, which can also be problematic for large digital systems.
Borrowing a concept from RF, a mixer cell can be configured to double a clock signal by connecting and together as shown in Fig. 3 . With the use of sine wave clock signals, a mixer can be mathematically represented as two signals multiplied together in the time domain.
The objective of this work is to find a compact design that can be implemented in close range to the destination circuitry. For this purpose, the use of passive elements, such as capacitors and inductors, should be minimized. The number of transistors should also be minimized. Furthermore, the design should only require a single-phase clock input to avoid distributing two independent signals. Therefore, an in-phase active-type doubler was used.
IV. ARCHITECTURE
A simplified block diagram of the proposed 4:1 serializer is illustrated in Fig. 4 . The 4:1 serializer uses the standard 2:1 halfrate multiplexor stages for which it takes 4 parallel bits from the 35 Gb/s pseudo-random bit sequence (PRBS) generator and uses a half-rate clock of 70 GHz produced from the clock doubler to generate the 140 Gb/s output. Each standard 2:1 halfrate multiplexor contains a 5-latch design. The combined latch configuration results in an edge-triggered flip-flop which better retimes the incoming data. 
A. PRBS
A PRBS is often used as a model to test high speed serial interface devices because it generates nearly purely random bit patterns and yet is still predictable for one to verify its correctness or bit error rate (BER). The higher order the PRBS is, the longer pattern length and more test cases are produced. A consequence of a higher order PRBS is the increased layout space required. A 10-register PRBS is chosen in this design for its ability to provide a sufficient test sequence with a reasonable amount of layout space. An even order PRBS was also desired so that interconnecting wires between registers could be of equal and minimal length.
The 4-way PRBS generator uses a circular linear feedback shift register (LFSR) architecture illustrated in Fig. 5 . For maximum length bit sequences the taps need to be made at the 10th and 7th stages, thus the polynomial is . The LFSR has a dead state where all registers contain zeros. A single reset register is used to introduce a one in the LFSR path to get the LFSR out of the dead state and into normal sequential operation. In normal sequential operation, it then generates a test sequence of 1023 bits in length. To achieve a multiple-bit output of purely pseudo-random behavior, bit sequence shifting is required [12] , [23] . In this case requiring 4 uncorrelated bit patterns, each output bit sequence must have a shift distance of a quarter of the pattern length or about 256 digits. Therefore, , and of the pattern needs to be calculated by (1) However, the propagation delay among each output is non-uniform thus resulting in severe data skew. In order to minimize propagation delay and power consumption, each of the four output digits needs to be rotated to find a scenario where each output requires nearly the same number of XORs and the shortest gate-depth. In this case, a gate-depth of 2 was desired. A rotational factor of 355 was discovered to obtain the minimum uniform XOR inputs of 3 or 4, yielding (2) Each gate stage contains an XOR-merged flip-flop (FF) to further reduce propagation delay and synchronize the output. The configuration for a 3 and 4-input XOR is shown in Fig. 6 . To keep clock loading uniform, a D-FF is added in Fig. 6(b) .
B. Clock Distribution
The clock distribution consists of an input buffer that accepts a single-ended 35 GHz clock signal sourced off-chip and a series of differential cascading trans-admittance stage (TAS) and transimpedance stage (TIS) amplifiers to distribute the clock signal throughout the chip. Fig. 7 shows the block diagram of the clock distribution network for the 4:1 multiplexor. TAS and TIS amplifiers were chosen for their high drivability and large voltage swings [24] .
The timing of the clock signals to the first stage of multiplexors with respect to the last stage is extremely crucial. Since the clock paths of the first and second multiplexor stage differ in terms of component circuitry and physical distance, phase interpolator (PI) circuits were used to vary the input clock phase to the first multiplexor stage while keeping the last stage fixed.
For any synchronous digital circuit, clock distribution is the most critical aspect to high performance. Any mismatch in length and loading between differential lines can cause skew and ultimately closing of the eyes. For this case, design symmetry and H-tree layout concepts were unequivocally used. 
A. BiCMOS Latches and Flip-Flops
A CML latch is shown in Fig. 8 and is commonly used in high-speed applications [4] , [12] . The circuit is also the most balanced design with level-2 logic inputs and output ensuring the latching and holding transistors never reach saturation. The clock signal is decoupled locally via emitter followers for enhanced isolation and improved driving of the load capacitance seen at the input of the differential transistor pair. A local biasing technique was used in the design for not only layout compactness but also its increased tolerance to variable conditions such as voltage droop and temperature. Emitter degeneration was used for more accurate current mirroring.
B. TAS and TIS Clocking
The TAS and TIS circuit schematic is shown in Fig. 9 . This type of amplifier is based on the negative-feedback type proposed by Cherry and Hooper [24] , as are other high performing circuits [4] , [13] , [25] . The implementation utilizes a strong impedance mismatch between stages for a very high-bandwidth product, and in which both gain and bandwidth are insensitive to transistor variations [24] .
Each TAS and TIS component is connected via a transmission line. The circuits were conveniently setup to have both the and set at a level-2 voltage, allowing for easy construction of cascading structures. Through the first emitter follower (EF) pair, the signal reaches the open-collector transistor pair with a level-3 voltage, ensuring the transistor pair never reaches saturation. The resistor combined with the feedback EF prevents the input transistor pair of the TIS from reaching saturation, thus always supplying a sufficient level of for maximum performance.
The feedback resistor and the pull-up resistor of the TIS both contribute to the gain. The feedback resistor should be made large but not large enough for it to cause voltage headroom issues that would limit output voltage swing and gain.
The transmission line should be sufficiently inductive in order to compensate the capacitive effects at the input transistors of the TIS [4] . Modeling of the transmission lines was done in Sonnet EM solver where the layout is extracted as shown in Fig. 10 . Here, the most critical transmission line is examined as all three signals are routed to the inputs of the clock doubler. While sufficient bandwidth needs to be verified, clock arrival times and clock skew needs to be examined as well. Adjusting the vias and cornering of wires and thus effectively the length, will allow tuning of the arrival times to avoid clock skew at the most critical high speed components of the last 2:1 multiplexor. Fig. 11 shows large signal amplitude vs. frequency plot of the TIS, fan-out three complex transmission line in Fig. 10 , and TAS when configured together. Cut-off is determined as 0.707 of near-DC frequency amplitude. Simulations show that with peak-bias currents, the system achieves a cut-off frequency of 100 GHz. When the bias current is reduced by a factor of 3, the circuit maintains sufficient amplitude at 50 GHz.
C. Phase Interpolator
The phase interpolator shown in Figs. 12 and 13 incorporates a MUX style configuration that linearly selects between a fixed slow path (A) and an unmodified fast path (B) [16] . Emitter degeneration is used to obtain linearity at the transistor pair. The amount of phase shift is dependent on the frequency of operation and the number of delay elements used in the slow path. A larger amount of delay will increase the range of phase shift but will produce a lower output voltage swing at the midpoint of phase shift, and thus may require extra buffering.
The main purpose of the PI circuit was to provide fine tuning of the clock for lab demonstration and to cover the delay difference between the clock doubler path and first stage multiplexor path. Fig. 14 shows the varied output phase range of about 6 ps. At a frequency of 35 GHz, this is about a 76 phase tuning range. Fig. 15 , the clock doubler contains three main stages: a doubler-core, a single-ended-to-differential active balun, and an inductively peaked differential amplifier. The front-end of the doubler-core uses a single pair of emitter followers ( and ) and diodes ( and ) to level shift the input signals and minimize mismatch of arrival times. A Gilbert cell design was chosen as the doubler-core. In-phase doubling with a Gilbert cell produces a high output during the current transition of the ECL pair. When fully switched, a low output is produced. Therefore, two occurrences of , time output high, and , time output low, occur in each cycle of the input as shown in Fig. 16 . When the ECL pair is in its switch transition, there is a point where the bias current is split evenly between the ECL transistors and the current that reaches the resistor is 1/2 the bias . When the ECL transistor is fully switched, the current reaching the resistor is just . This gives a voltage swing between and from . For high slew rate signals such as sine waves this poses the risk of uneven and shown in Fig. 16(a) , however due to wire and device parasitics, and thermal effects, the signal seen by Gilbert cell is more of a triangle wave shown in Fig. 16(b) . Thus, you have a more even and at the output. To aid this effect, large transistors can be used to increase capacitance, thus distorting the signal. Another option is to place small capacitors at the input of the double-core, however this will reduce the bandwidth of the circuit if made too large. It is important to note that by reducing the slew rate of the input signals to the clock doubler, the clock doubler becomes more susceptible to noise and crosstalk which can result in increased jitter at the output.
D. Clock Doubler

Shown in
Ideally, the current will only flow through the left pull-up resistor (via and ), however due to unavoidable difference in arrival times at the two transistor level inputs and other non-idealities, current will leak through the right side as well. When both pull-up resistors are used, a phase difference between the differential output ( and ) develops as shown in Fig. 17 , leading to reduced differential amplitude. One way to compensate this is to use microstrip transmission line inductors [20] , however this increases layout size. Another option is to omit the right side resistor and pull the rail directly to . Under this configuration, the result output is single-ended and thus a balun is required to generate the differential signal.
The output is DC converted via a pair of coupling capacitors ( and ) and pull-up/down resistors. The single-ended doubled frequency signal is converted to differential via an active balun. For this configuration, one of the transistors' base has to be set at an appropriate DC level in comparison to the incoming signal for a balanced duty cycle output. The input carrying just the DC signal is capacitively coupled to the output of the doubler circuit that is pulled high. Due to its proximity to the doubler circuit, noise factors identical to the carrying signal will likely propagate through, thus making use of the high CMRR benefits of differential signaling.
The differential signal is then routed to a high-frequency amplifier to drive the latches and wire. This amplifier uses inductive peaking to extend the operational bandwidth with a higher gain. The output emitter followers ( and ) are sized at twice that of the previous emitter follower pairs ( and ) . This extra current helps drive wire and load at higher frequencies. Fig. 18 shows the 3D metal layout of the clock doubler in Sonnet with transistors removed. Capacitors and resistors are shown as place holders and are only spatially accurate. Simulations in Sonnet took into account metal thicknesses, but for illustrative purposes, they are not shown in the figure.
The input differential signal arrives from the right side of Fig. 18 where it is brought down to the lower level metal and quickly doubled at the doubler-core. Once the signal reaches the doubled frequency, it is quickly brought up to the higher metal layer to conveniently connect to the capacitor and to better isolate it from substrate noise. Once the singled-ended signal is converted to differential via the active balun, the middle metal layers are used to isolate themselves from power supply noise and substrate noise. A slightly higher and thicker wire is used to transfer the output to the high speed flip-flops.
Thick middle metal layers are used for the local power rails instead of the highest level metals. This approach offers a good ground plane for the peaking inductors and offers bypass capacitance characteristics to minimize noise development on the power supply rails. High-density capacitors are added to increase this capacitive effect which can be seen at the end of each inductor in Fig. 18 .
The inductor was developed using the highest copper metal layer available. Due to limited space, a compact and customized inductor was required. To provide a more accurate model, the inductors were designed and modeled using Sonnet, and later imported back into Cadence for verification.
E. High-Speed Output Driver
The output driver shown in Fig. 19 is a double-terminated cascoded amplifier that is designed to drive external 50 ohm loads for testing. The pull-up resistors are set to 50 ohm to match the impedance of the transmission line and 50 ohm termination at the oscilloscope. This offers good matching and reduced reflections, but requires twice the current because the effective impedance to the collectors is half of that in open-collector topologies. The level-2 input voltage is converted to a level-3 for the ECL transistor pair, and , via relatively large emitter followers. The bases of the transconductance transistor pairs, and , are tied to a large bias resistor, , pulled to
. With a theoretically constant and small current drawing from both bases of and , a sufficient bias voltage develops. Therefore, the DC component of can be represented by (3) This bias voltage allows for sufficient head room for large output voltage swings and prevents the transistor pair ( and ) from saturating.
VI. FABRICATION AND LAYOUT
The design was fabricated on a 90 nm process with ten metal layers. The chip shown in Fig. 20 has dimensions of 1.8 2.2 mm and utilizes four sets of 10-pin pads with dimensions of 100 100 m at a 150 m pitch.
Conventionally, the two top-most thick metal layers are used to distribute the power from the pads to the circuits. Two sets of two middle thick metals were used for the clock signal. The thinner set was used for lower frequency clock signals at which lengths did not exceed 300 m. For longer wires and at frequencies beyond this, the thicker set was used. For lines of critical interest such as clock lines longer than 300 m and clock lines carrying high frequencies, the wires were modeled and extracted using Sonnet EM solver, and imported to Cadence Spectre simulation for verification.
Layout symmetry is of the utmost importance when designing high-speed synchronous systems, which can be seen in Fig. 21 . Design considerations need to be made to reduce coupling, crosstalk, and switching noise. For lines carrying high frequencies, skin effect needs to be taken into account. Calculating the skin depth and its increased resistive effect of rectangular cross-sectional wires is complex, however calculation methods do exist [26] . It is also important to note that square cross-sectional wires have the highest skin effect resistance ratio, and thus wires should be made thin, whether vertically or horizontally. Vertically thin wires are more attractive due to their reduced vertical capacitive coupling to the substrate or power rails.
Another concern is voltage droop throughout the chip and this comes in two forms: uniform and non-uniform droop. Both should be minimized; however non-uniform droop causes the most performance degradation due to clock/data skew, and is more difficult to compensate for.
Increased temperature causes reduced transistor mobility, and when non-uniform throughout the chip, causes clock/data skew as well. Ideally, transistors should be placed compactly in order to minimize interconnecting length wires. However, due to difference in bias current levels, local hot spots can develop. Given this, transistors of like functionality, e.g., differential pair transistors, were placed relatively close ( m) together and separated from all other or different transistors of functionality by greater than 5 m. For example, differential pair transistors are placed close together and separated from current source transistors.
VII. TESTING AND RESULTS
To properly test the serializer it is important to analyze the bit stream for functional correctness and measure the quality of the signal via an eye measurement. Figs. 22 and 23 show the measurement setup. Testing of the serializer was accomplished on-die using two 10-pin 50 GHz probes, one 10-pin power probe, and one 110 GHz dual-pin probe. The high-speed serial output differential signal was sent off-chip via a 110 GHz dual-pin probe and 3" length 110 GHz 1 mm cables to remote sampling modules (86118A H01) of an Agilent DCA-X 86100D Wide-bandwidth Oscilloscope with 86107A precision time-base module. A divided down signal of the clock provided from on-chip was sent off to the oscilloscope as the trigger signal. For improved jitter measurements, a second trigger option was used via the Agilent 4068A-40 Clock divider. The clock source was provided from a 250 kHz-50 GHz Keysight (Agilent) PSG Analog Signal Generator with ultra-low phase noise option UNY, and sent on-chip via Pasternack 10-50 GHz PE2079 power splitters, a 12" 50 GHz cable, and a 50 GHz probe.
Measurements were done at various supply voltages ranging from 3.2 to 3.4 V, drawing a total current between 1.3 to 1.7 A, respectively. The serializer-core comprises roughly 30% of the total power. This includes the 4:1 multiplexor, its respective clock distribution network, and the output driver. Fig. 25 shows eye diagrams and BER bathtub plots at 80 and 128 Gb/s, respectively. Fine tuning of the PI became more useful in cleaning up the output data signals by improving sampled bit transitions at 120 Gb/s and higher. By using a linear feedforward equalizer (LFE) to compensate for high frequency loss effects, 136 and 140 Gb/s can be obtained. A 4-tap 2-precursor LFE with tap point values of 0.147345, 0.284373, 1.342589, and 0.205561 was used. Fig. 26 shows eyes and BER bathtub plots at 136 and 140 Gb/s. BER contour plots were extracted using Keysight's 86100-401 Advanced Eye Analysis as shown in Fig. 27. Fig. 28 shows 160 Gb/s eye diagrams. This hints that with proper enhancement 160 Gb/s could be obtained. 
VIII. CONCLUSION
A 140 Gb/s serializer in a 90 nm SiGe bipolar technology using a unique clock distribution technique has been developed and demonstrated. It accomplishes this by distributing out a quarter-rate frequency throughout the chip and obtaining the half-rate frequency via compact clock doublers in close proximity to the required sub-circuits. At 140 Gb/s, the design consumes 5.78 W total power from a 3.4 V supply and has an of 12.5 pJ/b for the serializer-core.
