Abstract-This paper presents a 28 Gb/s multistandard SerDes macro which is fabricated in TSMC 28 nm CMOS process. The transimpedance amplifier (TIA) base analog front-end achieved 15 dB high-frequency boost with an on-chip compact passive inductor. The adaptation loop for the boost is decoupled from the decision feedback equalizer (DFE) adaptation by the use of a group delay algorithm. The DFE is a half-rate 1-tap unrolled design with only two total error latches for power and area reduction. A twostage sense amplifier-based latch achieved sensitivity of 15 mV. The high-speed clock buffer uses a PMOS active inductor circuit with common-mode feedback to optimize the circuit performance. The transceiver achieves error-free operation at 28 Gbps with 34 dB channel loss, consumes the worst case power of 560 mW/lane, and fully complies with multiple standards and applications.
I. INTRODUCTION
A S THE data center traffic increase by 20 30% a year, the demand for the high-speed SerDes is getting critical for deployments of new standards and protocols in communication systems to address increasing bandwidth demand. In order to meet new standards, such a OIF CEI-25G-LR, CEI-28G-MR/SR/VSR, IEEE802.3bj, and 32G-FC, data rate are increased to 25 28 Gbps that is more than 75% higher than previous generation of SerDes, which operates at 10 16 Gbps. A high-speed SerDes must meet multiple challenges including high-speed operation, intensive equalization techniques, low power consumption, small area, and robustness. Particularly for the SerDes applications with several hundreds of lanes integrated in single chip, the power consumption is a very important factor in maintaining high performance. There are several previous works at 28 Gbps or higher data rate [1] - [4] . Those use unrolled decision feedback equalizer (DFE) to meet the critical timing margin, but the unrolled DFE structure increases the number of DFE slicers that increases the overall power and die area. In order to tackle these challenges, we introduced several circuits and architecture blocks. One of the key design is an analog front-end (AFE) which achieve 15 dB high-frequency boost at 14 GHz. In order to do this, we used high-speed transimpedance amplifier (TIA) with on-chip compact inductor feedback. The TIA amplifier is very simple inverter-based structure with tail current for high bandwidth. The boost is adaptive and its adaptation loop is decoupled from the DFE adaptation loop by the use of a group delay adaptation algorithm. The DFE is a half-rate first-tap unrolled structure with only two total error latches for power and area reduction. A two-stage sense amplifier-based slicer is developed to achieve sensitivity of 15 mV with long degeneration sampling time and DFE timing closure. We also developed the high-speed clock buffer which uses a new active inductor circuit. This active inductor circuit has the capability to control output common-mode voltage to optimize circuit operating points.
II. SERDES SYSTEM ARCHITECTURE

A. SerDes Subsystem
The SerDes subsystem consists of four lanes and two PLLs as shown in Fig. 1 either from PLLA or PLLB. With this configuration, it supports rate agility on a lane basis with dual PLLs and independent clock muxes. Also, it has half rate, quarter rate, and 1/8th rate mode to supports multiple standards.
B. Receiver Block
Fig . 2 shows the entire SerDes block diagram. For the receiver (RX) input stage, it has bump inductor followed by termination resistor . The bump inductor is placed under the bump and there is no area penalty. Also it improves the return loss performance. Then the signal goes through on-chip ac coupling capacitor . The capacitance of Cac is 4 pF, and the resistance of is 120 k to form the high-pass filter (HPF). This HPF corner frequency is 330 kHz, and it is sufficiently low to support a wide frequency range.
When the ac-coupled capacitor is needed on a printed circuit board (PCB), it requires the through-hole on the PCB and it creates the impedance discontinuity. The impedance discontinuity causes the signal reflection and in order to compensate this reflection, the DFE needs to have a higher tap locations like 30 50 tap. This increases the overall SerDes power consumption and area. Thus, our approach is to put the on-chip ac-coupling cap on the die and it eliminates the reflections. We implemented this capacitor with metal-to-metal fringing cap and placed under bump to minimize the area penalty. Furthermore, on-chip ac-capacitance decouples the dc bias voltage from transmitter (TX) side and the AFE input common-mode voltage can be set for optimum bias voltage. After the signal goes through the ac-coupling capacitor, the differential signal of long run length data can droop and a base line wonder correction (BLWC) circuit recover the low-frequency drift [5] . Fig. 3 shows a high-level analysis of the impact of ac coupling capacitor on the error induced on an ideal transmitted data stream. The error is a percentage of the transmitted amplitude. We plot the error percentage versus ac coupling corner or pole frequency for various PRBS patterns including PRBS7, 9, 11, and 15. The error is measured over the length of one PRBS pattern block. Plotted also are the results for the "CLS52TP2" pattern which is an IEEE 802.3AP recommended test pattern that emulates some of the run length properties of PRBS31. Since PRBS31 is an extremely long pattern, we isolated (through exhaustive search) a worst case section of this pattern, WP31, which provides the worst error from ac coupling. In reality, this section of the pattern occurs with low probability but we consider this hear nonetheless to upper bound the effects of ac coupling. The plot also includes the results for WP31. For shorter length PRBS patterns up to 11, the effect of ac coupling of even up to 500 kHz is very small. For our corner frequency of 330 kHz, the error with PRBS15 and CLS52TP2 is less than 2%. Only for the low-probability worst case section of PRBS31, we see the error become much more significant up to 9%. The BLWC loop helps to cancel any residual error from PRBS15/CLS52TP2 or the worst case section of PRBS31. The overall baseline wander cancellation scheme is similar to that of [5] , and the cancellation term depends on the received data pattern but the cancellation gain is independent of the data pattern.
The AFE is a single-stage circuit and incorporates both the variable gain adjustment (VGA) and linear equalizer (LEQ) functions in it. The buffer drives the DFE input stages and the buffer output is controlled by common mode feedback to control the common mode for the DFE input stage. The offset cancellation loop monitors the buffer output offset and feed back to the AFE. The DFE is a 14-tap half-rate design and it uses first-tap unrolled architecture with sign-sign least mean square (LMS) algorithm for tap adaptation. The DFE output data are deserialized and send to the digital block. The DFE is a half rate design and it has odd and even blocks. Each odd and even block uses a half-rate clock (2T clock) with in-phase (I) clock and quadrature phase (Q) clock which have 90 degree phase difference from each other. The I clock is used to sample the data and the Q clock is used for transition sampling. A phase interpolator (PI) block receives 2T I and Q quadrature phase clocks from the PLLs and generates the sampling clock for DFE.
The CDR consists of a bang-bang phase detector (BBPD) followed by a digital decimated loop filter similar to that described in [21] . The loop filter is a second-order filter with a proportional path and an integral path. The proportional and integral paths are summed to obtain the overall loop filter output which is a 6-bit phase code that controls a PI with 1UI/64 spacing or resolution.
The linear range of PI is from 6 to 14 GHz, and it is followed by programmable divider 1, 1/2, and 1/4 ( Fig. 2 ). With this configuration, it supports a data rate from 3 to 28 Gbps. For the 1.25 Gbps data rate, the data path and the clock path are treated differently. For 1.25 Gbps, the clock input to RX is 12.5 GHz, and it goes to dedicated divided-by-10 blocks to generate ten phases. For the 1.25 Gbps signal path, the input signal does not have much of an ISI effect, and the signal swing is relatively large. Thus, it does not require any equalization, and we have a separate signal path for this mode. The 1.25 Gbps input signal goes to dedicated comparator directory, and the output of the comparator is sent to digital block. The digital block receives the data and the ten-phase clock and the digital clock and data recovery (CDR) loop select the one phase out of ten phases.
There are several buffer stages before and after the PI to maintain the clock quality. These clock buffers employ active inductor circuits with common-mode feedback (CMFB) circuit. An IQ skew calibration circuit is employed to maintain the phase relationship between I and Q clocks before the clocks reach the DFE latches. The DFE latches are calibrated according to their dc value with forced value onto the first-tap unrolling muxes.
C. Transmitter Block
The TX block consists of serializer and source-series-terminated (SST) driver shown in Fig. 4 . The serializer receives 16 bit parallel data from the digital block, and it has 16:4, 4:2, and 2:1 serializer in it. Then, serialized data are passed to driver. Since the data are serialized before the driver, the driver does not require any clock. This reduces the number of clock buffer and power consumption. The driver uses a SST driver structure with 3tap FIR filter [5] , [6] . In order to achieve the low power consumption, we minimized the driver structure to 68 segments and we reduced the TX outputs loading by reducing the ESD devices as possible. The output of the driver is connected to bump inductor to improve the return loss performance. The clock path for TX has a duty cycle distortion (DCD) correction circuit. The clock duty cycle is one of the most important parameters because the duty cycle distortion directly degrades the TX eye opening so it is necessary to have a calibration or real-time compensation. We implemented the duty cycle correction circuit and is shown in Fig. 5 . The clock buffer with programmable duty cycle is configured with Mn1, Mn2, Mp1, and Mp2. The Mn2 and Mp2 are programmable devices, and they can control the duty cycle with this configuration. During the calibration mode, digital circuitry sweeps the control code for the programmable duty cycle circuit. The output of the programmable duty cycle buffer goes through the low-pass filter, and the duty cycle is converted to the voltage domain. Then, the comparator detects if its duty cycle is higher or lower than 50%, and the comparator output is sent to the digital block. Then, the digital block finds the optimum code. The comparator is connected after the low-pass filter and that corner frequency is set to 8 MHz to detect the lower frequency portion of the clock as possible. Thus, the comparator does not require the-high speed operation, and it uses the large device and careful layout to minimize the comparator offset.
III. ANALOG FRONT-END DESIGN
A. Analog Front-End Architecture
The AFE is an important block for not only equalization but also the robustness and power. In order to achieve those requirements, a simple and minimum stage of construction is preferable. For the high-speed application such as 28 Gbps or above, increasing the bias current with the simple differential pair circuit does not help after certain optimum bias because the output impedance of the differential pair degrade by increasing the bias current. Thus, even the gm will improve with higher current, it cannot achieve sufficient gain. It also degrades the bandwidth because the large device size is needed at the low-power supply voltage.
The TIA circuit is widely used in high-speed applications [7] - [10] , and the TIA has an advantage to improve the highfrequency bandwidth. We developed a single-stage TIA-based LEQ with on-chip inductor feedback. The AFE needs to drive the DFE input stages where there are several data and error DFE slicer circuits connected. The TIA decouples the heavy loading capacitance and improves the high-frequency response. Also, we developed the compact inductor and made it possible to implement in a small silicon area. With the TIA structure plus compact inductor, we achieved 15 dB boost for a 28-Gbps data rate. The AFE input stage is shown in Fig. 6 and has NMOS differential pair with degeneration resistor and capacitor . The gain and the boost are controlled by tuning the and . The output current from the input differential pair is fed to the TIA stages. The TIA amplifier is configured by single stage inverter based circuit with , and the tail current to maintain the dc bias. In order to keep the Mn2 and Mp2 in the TIA to be biased properly over the PVT, the reference voltage for CMFB is generated by the replica bias block in Fig. 6 . The replica bias block is a half circuit of TIA, and the input and the output are shorted. This generates the reference bias voltage to be optimum over the PVT variation.
For the feedback path, we use the compact inductor and the resistor in series. This compact inductor uses top two metal layers and the inductance is 800 pH. The size of inductor is 25 m 25 m. Also, the common-mode feedback circuit is implemented to make the dc operation point optimum for the TIA amplifier. Fig. 7 shows the simplified small-signal model of the analog front-end circuit. The represents transconductance of the input-stage differential pair, and is the transconductance of TIA amplifier. and are the parasitic capacitors at the input and output node of TIA. The is the feedback path impedance. The transfer function of this small-signal model is expressed as (1) (2) Fig. 8 . AFE circuits.
and shows the second-order low-pass filter transfer function. The dc gain, pole frequency and Q are given by
To make large in design, the pole frequency moves higher and the Q factor becomes high. Thus, it improves the high-frequency behavior. Also, the dc gain is simplified to , and this improves the robustness.
B. Circuits Design
The AFE circuit is shown in Fig. 8 and the AFE configure the differential input pair with the TIA as a loading. The overall AFE transfer function can be expressed by (6) . (6) (7) (8) where the is the feedback impedance of TIA block. By tuning the degeneration resistor , we can control the gain, and, by tuning , the high-frequency boost is controlled.
C. AFE Adaptation
The adaptation algorithm for the AFE peaking parameter is a simplified form of a least mean squared (LMS) type of algorithm [11] . Although this algorithm generally works well, the use of the error information at the data sampling phase and past decision terms which are also used by the DFE adaptation can in certain situations lead to coupling of the adaptation loops when they are jointly adapted, because both AFE and DFE adaptation use same error information. The AFE peaking parameter control parameter is updated as follows: (9) where is an adaptation gain control parameter for the AFE. The parameter is mapped through a look up table to the value which controls the peaking. Note that this update equation uses the sign of the error weighted by an average of past sliced data samples. This update equation has an LMS-based analytical underpinning derived from the AFE formulaic representation of [11] . Although the AFE topology of [11] is somewhat different than here (here, the peaking control affects the peaking magnitude as well as the peaking frequency to some extent, whereas in [11] the peaking frequency is independent of the peaking magnitude control), we empirically determined that (9) could be used to adapt the peaking control here as well. Another more ad-hoc view of the adaptation equation independent of the AFE topology is obtained by considering the role of the AFE in performing post-cursor intersymbol equalization across multiple symbol periods-a function which is also performed by the DFE of Section IV; albeit, in the case of the DFE, the equalization happens through feedback cancellation. Note that the LMS-based DFE equalizer tap adaptation of Section IV is governed by (13) which happens to be individual terms of the AFE update equation for up to the first taps of (9) . Equation (9) for the AFE adaptation can be thought of as an composite or amalgamation of adaptation of multiple post-cursor equalization taps performed in a manner similar to the DFE taps of (13) . Although the AFE update (9) allows the AFE peaking to converge adequately, the fact that (9) and (13) use common terms implies that the closely related or coupled adaptation metrics govern both AFE peaking and DFE tap adaptation which can sometimes lead to a suboptimal solution as the AFE and DFE adaptation are coupled and may "fight" to optimize the underlying minimum mean squared error criterion. Based on our simulation experience, we find that lower loss channels, the AFE may be somewhat suboptimally overequalized based on typical adaptation parameters. To alleviate this potential coupling, a group-delay algorithm that uses error information at the transition sample is also considered. Fig. 9 shows the concept of the heuristic group-delay algorithm, named as such because the algorithm considers waveforms corresponding to different group delays for different amounts of AFE peaking. The figure shows three waveform segments in the vicinity of the transition sampling time which is nominally 0.5 T away from the data sampling time. When the AFE provides optimal equalization, the waveform segments would ideally have an average zero crossing threshold at the transition sampling point. If the AFE overequalizes the signal, the waveform zero crossing appears earlier than for optimal equalization, and the transition samples will tend to have a positive bias. If the AFE underequalizes the signal, the waveform zero crossing appears later than for optimal sampling, and the transition samples will tend to have a negative bias. Thus, we can correlate transition sample values to whether the AFE peaking should be increased or decreased. Let the binary sliced transition sample value be denoted by and the sliced data value by . Then, we have Table I , which provides the increase/decrease resulting from correlating when we have a transition between the prior data bit and the current data bit . With this group-delay algorithm, the AFE peaking code is then updated as (10) Since this group delay algorithm does not use the signed error signal for adaptation, it significantly alleviates coupling with DFE adaptation.
Although the group-delay AFE adaptation uses transition samples as does the BBPD of the CDR loop, the CDR adaptation bandwidth is significantly larger than that of the LEQ adaptation bandwidth thereby further minimizing any interaction between the CDR and LEQ adaptation loops.
IV. DFE DESIGN
The DFE block is a half-rate design and has 14 feedback taps with the first tap unrolled. The first tap unrolling relaxes the critical timing [12] - [16] . Fig. 10 shows the DFE block diagram and, since this is a half-rate design, it has the same odd and even block. Each odd and even slice receives the analog signal from AFE, and the prebuffer drives five latches. Those latches are eye finder latch for eye monitoring, error latch for adaptation, transition latch for CDR, and two unrolled data latches. The DFE feedback is one of the critical paths and is shown in Fig. 11 . This block diagram is the odd slice, and the even slice has the same structure. A half-rate DFE with first tap (H1) unrolled architecture is implemented to eliminate H1 tap timing from the critical path. Then, the H2 tap becomes one of the most critical timing paths, and we treated this separately from other taps and the H2 tap is fed back to the DFE buffer. The rest of H3-H14 taps are combined at DFE summer and fed back to the slicer.
All of the DFE taps including unrolled 1st tap are adapted by LMS algorithm. Let be the input to the DFE from the output of the AFE. The DFE equalized signal is then , which is the DFE input from which the DFE feedback terms are subtracted. Let be the sampled version of the DFE equalized signal. In this design, only the second tap and higher are subtracted from the AFE output to obtain due to first tap unrolling as follows: (11) With the first tap unrolling, the final decision is made by slicing at thresholds based on the unrolled tap1 value and the prior data bit as follows: (12) The algorithmic LMS adaptation for unrolled tap1 is similar to the standard DFE tap1 adaptation with for the first tap
The only difference is that the sign of the error signal is obtained from slicing signal amplitudes not relative to the equalization target amplitude but at an error latch threshold obtained from a combination of and . For the first tap unrolled DFE, there are four error latches per odd and even slice in general, which is shown Fig. 12(a) . The error latch threshold positions are the combination of and . The error latch slicer output is determined based on the following equations: (14) Thus, the unrolled DFE structure increased the number of latches and increased the overall power and area. In our architecture shown in Fig. 12(b) , we use only one error latch instead of four per even and odd slice. This error latch position is programmable (static mode) and dynamically rotatable to minimize sensitivity to pattern variations (dynamic mode). There are four possible positions for each odd and even error latch. Thus, there are 16 combinations for those error positions, and we can choose any combination by programming for the static mode. For the dynamic mode, we rotate several combinations of positions periodically. The period of each position can be programmable from 512 T to 32 768 T. With this structure, we could reduce the power and complexity without losing any performance.
A slicer circuit design is the most challenging part of the DFE implementation as link performance directly depends on the slicer sensitivity. Fig. 13(a) shows the slicer circuit diagram, and Fig. 13(b) shows the sampling clock generation circuit and its output clock phases. In order to increase the DFE slicer sensitivity, it requires a long degeneration time. The sampling period is minimized by inverter gate delay, and it increases the degeneration time. With this sampling phase generation, we achieved 15 mV sensitivity. For this sampling clock, the duty cycle is not necessary to be 50% in our design.
The slicer has three differential inputs for the data, the H3-H14 feedback, and the offset. The CK1 and CK2 generate the three phases, sampling phase, regeneration phase, and preset phase. When the CK1 is high, it samples the data. Then, CK1 goes to low, and the positive feedback circuits start regeneration. After CK 2 goes to low, it will do preset. The following mux stage also has the same circuit topology with slicer circuits, and this architecture achieved 15 mV sensitivity. The error latch, transition latch, and eye monitor latch are all same circuit structure as data latch.
V. CLOCK BUFFER DESIGN
The high-speed clock buffer is a key building block component and is used in many blocks in the SerDes. The clock buffer performance and power are very important for overall the SerDes power consumption. In order to achieve low power and high-frequency bandwidth, the active inductor circuit is widely used in many applications [17] - [20] . Fig. 14(a) shows the conventional active inductor load circuit and the NMOS transistor with the resistor connected between the gate of NMOS transistor and low impedance node, like power supply. The capacitor is connected between the source and the gate of the NMOS transistor. The impedance of this active inductor loading is given by (15) where the gm is the transconductance of the NMOS transistor. The advantage of this active inductor load is that the bandwidth is improved by adding the zero. However, it requires the Vgs as a head room. In order to relax the head room, the native device can be used. However, usually minimum channel length of native device is limited by process and is not applicable for high-speed applications with low-power supply voltage application. Fig. 14(b) shows the PMOS active inductor load circuit. The PMOS, capacitor and resister are connected similar to NMOS active inductor circuit. The current source is connected to make the become and the head room is improved by iR. Furthermore, by controlling the current i, the common-mode voltage can be adjusted. The output common mode voltage can be expressed by VDD-. The current source i is controlled by common mode feedback circuit to maintain the optimum output voltage.
The complete clock buffer circuit with this active inductor loading is shown in the Fig. 15 .
is the input differential pair and , , and c1 configure the active inductor loading. The common mode feedback circuit controls the current . Fig. 16 is the simulated frequency response of this active inductor clock buffer circuit over the PVT sweeps. One additional benefit of this clock buffer circuit is that it doesn't amplify the dc input. In other words, it attenuates the input offset whereas it amplifies the amplitude at the clock frequency.
VI. MEASUREMENT RESULTS AND CHIP MICROGRAPH
The silicon measured TX output waveform is shown in Fig. 17 . This measurement is done at 28 Gbps with PRBS7 pattern and it shows the 250 fs random jitter and 8 ps total jitter. A digitally measured AFE frequency response is shown in Fig. 18 . For this measurement, we applied a sinusoidal wave at the RX input and sweep that frequency. We then let CDR lock to the waveform and run the target amplitude adaptation without DFE. After the adaptation completes, we monitor the mean target amplitude from reading the H0 value. Then we calculated the gain from H0 compared with the input signal amplitude. This measurement includes AFE, a buffer, and DFE prebuffer response and a 2 3 dB loss from the package and PCB trace. Fig. 19(a) is a frequency response of the measured test channel. It shows the insertion loss measurement sdd12 and return loss measurements sdd11 and sdd22. The total insertion loss is 34 dB at 14 GHz which is the Nyquist frequency. Fig. 19(b) is the measured eye diagram at 28 Gbps and the bit error rate (BER) is . The vertical eye margin is 110 mV and the horizontal margin is 0.6UI. Fig. 19(c) is the measured bathtub data up to 10E-9 and extrapolated data to 1E-18. Fig. 20 shows the eye opening test for static mode and dynamic mode of error latch positioning with PRBS31 pattern at 25 Gbps. The horizontal margin with dynamic mode is 30% larger and the horizontal margin is 0.05UI wider than that of static mode.
The sinusoidal jitter tolerance measurements over temperature and power supply voltage variations are shown in Fig. 21 . The test channel is 31.5 dB loss at 12.9 GHz. The test channel includes all other impairments, like cross talk and random noise interference. The tested data rate is 25.8 Gbps and the BER is 10E-12 with PRBS31 pattern. The power supply voltage is swept from 1.0 to 1.1 V and temperature from 0 to 120 C. Fig. 22 is the chip microphotograph that consists of four lanes and two PLLs in subsystem macro. The PLLA is placed between lane 0 and 1, and the PLLB is placed between lane 2 and 3. The clock tree path is located between the analog TX and analog RX. The size of each lane is 300 m wide and 1854 m long. The total SerDes subsystem macro area is 3.34 mm .
The worst case power breakdown for four-lane one-PLL configuration SerDes is shown in Fig. 23 . The measured worst case power consumption per lane is 560 mW under conditions of 28 Gbps, three sigma fast silicon die material, and temperature of 125 degree. The SerDes use three power supply voltages. The 1.5-V supply is used only for the regulator used for PLL VCO and high-speed divider. The 0.85 V is used for static logic, deserializer, and serializer except final 2:1 mux. The remaining high-speed blocks use 1.05 V. The SerDes transceiver summary and comparison table is shown in Table II . Fig. 24 is the picture of the 100 G Ethernet system implementation. The each chip has eight lanes and there are eight chips in the each card. The total 64 channels are communicating with one another 64 channels with error-free operation. This backplane insertion loss and cross talk are shown in Fig. 25 . The insertion loss is 29.43 dB and the insertion loss to crosstalk ratio is 13.53 dB at 12.89 GHz. 
VII. CONCLUSION
We developed a 28 Gbps SerDes macro which has four lanes and two PLLs and fabricated it in TSMC 28 nm CMOS process. The TIA-based AFE achieves 15-dB high-frequency boost with single pair of on-chip inductor. The half-rate unrolled DFE with two rotatable error latches achieves DFE timing closure, less hardware, and robust adaptation. The group-delay algorithm for AEQ adaptation eliminates potential coupling between AEQ and DFE adaptation. The active inductor clock buffer circuit with CMFB extends bandwidth without major power penalty. The transceiver achieves error-free operation at 28 Gbps with 34 dB channel loss with the worst case power of 560 mW/lane and it fully complies with multiple standards and applications.
Hiroshi Kimura (M'94) received the B.S. degree in electrical engineering from Chiba University, Chiba, Japan, in 1990.
He joined the System Development Laboratory, Hitachi Ltd., Yokohama, Japan, in 1990, where he worked on the research and development of read-channel analog circuits for hard disk drives. 
