Abstract-Two novel clocking strategies for a highspeed multi-channel serializer-deserializer (SERDES) are proposed in this paper. Both of the clocking strategies are based on groups, which facilitate flexibility and expansibility of the SERDES. One clocking strategy is applicable to moderate parallel I/O cases, such as high density, short distance, consistent media, high temperature variation, which is used for the serializer array. Each group within the strategy consists of a full-rate phase-locked loop (PLL), a full-rate delay-locked loop (DLL), and two fixed phase alignment (FPA) techniques. The other is applicable to more awful I/O cases such as higher speed, longer distance, inconsistent media, serious crosstalk, which is used for the deserializer array. Each group within the strategy is composed of a PLL and two DLLs. Moreover, a half-rate version is chosen to realize the desired function of 1:2 deserializer. Based on the proposed clocking strategies, two representative ICs for each group of SERDES are designed and fabricated in a standard 0.18 µm CMOS technology. Measurement results indicate that the two SERDES ICs can work properly accompanied with their corresponding clocking strategies.
I. INTRODUCTION
Moore's Law predicts that the complexity of integrated circuits (ICs) doubles every 18 months, which suggests that the communication throughout between ICs and external worlds should increase proportionally; however, Rent's rule forecasts that the input/output (I/O) number will rise at a slower pace. Consequently, in order for a system consisting of ICs to exhibit a maximum performance, each I/O bandwidth should be increased, in addition to parallel interconnect strategies [1] . A highspeed multi-channel or parallel serializer-deserializer (SERDES) is regarded as an excellent choice, nowadays.
It is well known that almost I/O communication can work properly, only accompanied with appropriate clocking strategies. Especially, as for a high-speed multichannel SERDES, because of the existence of high density, skew and jitter, and so on, clocking strategies play a critical role in its quality, density, throughout, etc.
Based on an extensive survey, two novel clocking strategies for a high-speed multi-channel SERDES are proposed, which have such advantages as reliability, compactness, and no need of reference clock and off-chip tuning. Furthermore, these clocking strategies are based on groups, which facilitate the expansibility of the SERDES. Finally, each representative group within the serializer and deserializer arrays for the 5 Gb/s/ch SERDES is designed and fabricated in a standard 0.18 µm CMOS technology, respectively. Measured results shows the SERDES ICs can function properly with the proposed clocking strategies.
II. TARGET APPLICATION SYSTEM
This work aims at the high-speed interconnection between high-performance CPUs, which is shown in Fig.  1 . The target application system is mainly composed of two high-speed optical interconnection ICs with embedded CPUs, which are mounted on the surface of a standard PCB, and parallel optical transmission media, usually 850 nm multimode fiber, buried in the PCB. The interconnection ICs consist of high-performance CPUs, electronic I/O transceiver ICs, and optical VCSEL/PD arrays. Fig. 2 indicates the high-speed I/O transceiver array between CPUs. As for the transmitter array, a number of low-speed data from CPUs are multiplexed into fewer high-speed data (5 Gb/s/ch), which are then converted into optical signals along fiber array by transmitters; as for the receiver array, optical signals from fibers are converted into electronic higher-speed data by the receiver array, which are then demultiplexed into lowspeed data, eventually fed into another CPU. Totally, 96-ch fiber (I/O) arrays are able to carry up to 480 Gb/s data throughout.
According to Fig. 2 , it can be found that a 12-ch transceiver array is the basic unit. Fig. 3 
III. PROPOSED MIXED CLOCKING STRATEGIES
Taking a cursory glance, it seems that an I/O interface, e.g. Fig. 3 , is irrelevant to clocking. However, virtually, any digital I/O interface, esp. SERDES, can function only under the premise of sophisticated clocking strategies.
Clocking Issues Related with Parallel I/O Interfaces
Skew and jitter are the two basic problems related to clocking [2] . Skew characterizes the spatial variation, or static variation of a clock phase, which mainly results from path mismatch, parasitic capacitance and so on. Whereas jitter depicts the dynamic variation of a clock phase, which usually results from power/ground noise, crosstalk, device noise and so on. The jitter below 10 Hz is also called as wander. Finally, it should be pointed out that the so-called clock here includes the clock inherent in data stream.
Compared to a single-channel I/O interface, as shown in Fig. 4 , besides jitter, a multi-channel I/O interface should account for some other issues related to skew, e.g. skew between data from different channels, skew between data and clock, skew between clocks, and so on.
Conventional Clocking Strategies
As shown in Table 1 , according to the relative magnitude among UI (bit unit width), inter-channel skew, and jitter, the clocking strategies for parallel I/Os can be classified into the below three categories.
As for application A, the situations are excellent, which usually exhibit characteristics of low speed, short distance, constant temperature and so on, so the conditions that skew is much smaller than UI, and jitter is much smaller than UI, can be satisfied. For this kind of application, it is sufficient that only the phase relationship between any channel of data and the clock is assured by some means, and then clocking tree strategies, e.g. Htype tree, are applied to make sure that all clock phases to their corresponding channel of data are almost same.
As for application B, the situations are just moderate, where, usually, speed is not too high, but I/O distance is a little long, or/and the temperature varies slowly, and so on, so the conditions that skew is comparable to, even bigger than UI can be satisfied. For this kind of application, de-skew strategies can be used to tune the phase relationships between data and their corresponding clocks. Usually, at the stage of startup, a segment of special code pattern is needed to accomplish the de-skew task. In consideration of wander, the de-skew operation is usually carried out every other some time.
As for application C, the situations are tough, where, usually, speed is high, crosstalk is serious, or/and the temperature varies largely, and so on. For this kind of application, automatic phase identification/tracking strategies, such as feed-forward, feedback, blind-oversampling techniques, are required, which can continuously track the phase variations inherent within data.
In the above discussion, the de-skew between given data and local clocks are mainly involved; whereas in order to conquer the inter-channel de-skew, a FIFO (First-In-First-Out) technique is widely used in the downstream stages. The above techniques are discussed from the angle of the phase relationship adjustment. According to the synchronization relationship between clocks used to sample data locally and clocks used to generate data at transmitters, the clocking strategies can also be classified into (1) system or global clocking [4, 6, 11] , where the reference clock is shared by both sides; (2) source synchronous/forwarded clocking [3, 9, 10, 12, 13] , where a clock is fed from transmitters along data channels; (3) embedded clocking [7, 8, 22] , where a clock is embedded into data at the transmitter, and then is extracted at the receiver; (4) local clocking [20, 21, [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] , where clocks are synthesized locally. (1) ~ (3) belong to synchronous, or mesochronous clocking strategies, which is applicable to applications A ~ C, and almost all of conventional CDR (clock and data recovery) techniques can be used; whereas (4) belongs to plesiochronous one, which is preferable to application C, and PI/PS (phase interpolation/ phase selection) type and blind-oversampling (blindoversampling) techniques are usually used [2, 35] .
Clocking strategies are mainly realized by some special CDR-related techniques. A CDR consists basically of a CR (clock recovery) or CE (clock extraction), and a DR (data recovery). The latter can be further divided into data decision and phase alignment (or phase adjustment). As CDR techniques evolve and a lot of CDR-related techniques appear, the concept or connotation of a CDR is generalized, but the phase alignment is usually indispensable. Table 2 lists several basic CDR-related techniques, including PLL(phase-locked loop), DLL(delay-locked loop), MPA (Manual Phase Alignment), FPA (Fixed Phase Alignment), PI/PS, and BOS. These techniques are then compared in terms of CE, APA (automatic phase alignment), MPA (manual phase alignment), ACI (Adjacent-Channel Interference), stability, complexity, power consumption, occupied area, and degree of integration. The FPA realizes phase alignment by means of fixed path (length and trace) matching layout and load balance techniques.
Proposed Novel Mixed Clocking Strategies
According to Table 1 , from A to C, the function of the corresponding clocking strategies rise; however, their complexity also increases proportionally. Usually, from different levels, the application will take on different characteristics, so different CDR-related techniques belonging to different conventional clocking strategies can be mixed up for given applications. From another point of view, although Table 1 shows general clocking strategies for some typical parallel I/O application situations, actually, due to the existence of some limited factors, such as density, power, reliability, the freedom of choice over clocking strategies is restricted. Any single clocking strategy based on a single CDR-related technique is not sufficient, not to mention optimum, so mixed clocking strategies based on several CDR-related techniques are proposed to make a tradeoff to meet the special overall requirements, or to achieve maximum performance.
From Fig. 3 , there exist three interfaces related to the parallel optical link or SERDES. Because interface III is responsible by the downstream CPU or other ICs, only both interfaces I and II are concentrated on in this paper. The specification comparison between the two target interfaces is shown in Table 3 . No any reference clocks for the interface are provided by the system, and power consumption should be reduced as possible.
As for both interfaces I and II, because no any system/local reference clock is provided，the embedded clocking strategies must be employed. All the CDRrelated techniques shown in Table 2 are able to recover the data, but only the PLL technique can extract clock information locally from the corresponding input data, so it must be required definitely. However, in consideration of die area, power, pulling-in effects between VCOs, and so on, it isn't wise that a PLL is used for each channel. The remaining techniques can work only with the help of external clocks, so, in other words, they can be employed here only accompanied by PLLs. Furthermore, the MPA is excluded because of the need of external tuning, and also, both PI/PS and BOS techniques are not preferred in view of die area, power and complexity, and so on. In conclusion, in terms of their advantages such as compactness, reliability, low power, low complexity, DLL and FPA techniques are considered here. Specially, according to Table 3 , the interface I exhibits the following characteristics: multi-channel (24 channels), high speed (2.5 Gb/s/ch), space-limited (<125 µm/ch), high temperature variation, serious crosstalk, short distance (~10 cm) and so on.
Since both ends of interface I reside on a PCB and the transmission distance is short, for those neighboring channels, the application occasion is similar to application A, where the inter-channel skew is shorter, jitter, esp. wander, inherent in them is highly correlative, so the PFA technique is preferred over the DLL one, due to its small size.
For those channels, where spaces between them are slightly further, and the inter-channel skews are slightly larger, so the FPA technique isn't applicable. However, due to the similar or consistent situations, e.g. power supply, temperature, and so on, the consistency between the jitter, esp. wander inherent in these channels remain higher, where the application surroundings are approaching application B, so the DLL technique is more preferred. The DLL is used because it can possess the ability of APA.
Whereas for those channels where spaces between them are further, jitter (esp. wander) are almost irrelevant, which approaches application C, so the PLL technique is preferred. Here, the DLL is out of place, the reasons are as follows:
(1) Long-distance transfer will degrade clock signals, for example, to increase jitter, or incur distortion, and clock buffer chains will also help to crosstalk between channels.
(2) A long inter-channel wander will need an even longer VCDL (voltage-controlled delay line) chain, up to several, even more, UIs, which will increase power and die area sharply. The limited phase tuning range inherent in a DLL makes it unfit for the occasions where the relative wanders are large. Even though the control circuits exist here, every time that they work, the transient process, and then bit errors will appear. Therefore, the DLLs are preferred for de-skew stages, or occasions where the relative wander is not too large.
Compared to interface I, interface II possesses the following characteristics: Fewer channel (12 channels), higher speed (5 Gb/s/ch), larger pitch (250 µm/ch), longer distance (~m) and so on Because of higher speed, longer distance, conversion between optical and electrical domains, and so on, the relative skew even between neighboring channels is hardly controlled, so the FPA technique is ill-suited. Except the FPA, the DLL and PLL can be used alternately because of the same reasons as that for interface I, and due to the smaller channel number, the space allows that only DLLs and PLLs are used.
Based on the above analysis, two novel mixed clocking strategies are proposed for interfaces I and II of the parallel SerDes application, respectively.
As for the proposed clocking strategies, shown in Fig.  5 , the interface channels are divided into several identical groups. Within each group, different kinds of the above mentioned CDR-related techniques are combined to achieve the desired and comprehensive aims. Here, CH ij indicates the jth channel of the ith group.
Within interface I, 4 channels are allocated for each group and 3 kinds of CDR-related techniques are designated for the 4 channels, a PLL for one, a DLL for The PLL is needed to extract a clock and realize the phase alignment locally, and the extracted clock is also fed into the DLL. At the same time, the two adjusted clocks from DLL and PLL are also provided to the other two channels by FPA.
Whereas within interface II, 3 channels are allocated for each group and 2 kinds of CDR-related techniques are designated for the 3 channels, a PLL for one, two DLL for the other two. For example, for the Group 0 of the interface II, a PLL is for CH 01 , two DLL for CH 00 and CH 02 , respectively.
Within both interfaces, PLLs and DLLs are placed shoulder to shoulder in order to facilitate the transfer of clock signals.
For each group, the clocking strategies are somewhat like the source synchronous one. Because of consistent temperature variation, crosstalk, and so on, the wander, even skew inherent in neighboring channels is similar.
Due to the low-pass characteristics of the PLL, the wander within the extracted clock is nearly same with the one inherent in the corresponding channel, and then also consistent with the one within neighboring channels, so DLL, even FPA can be used for the neighboring channels.
In order to reduce crosstalk by power supply, a separate power supply is preferred for each group, and it can also increase the wander, even jitter, consistency within each group.
IV. CIRCUIT DESIGN
Just as mentioned above, interfaces I and II, shown in Fig. 3 , consist of several identical groups, respectively, so it's sufficient that only design for a group within each interface is focused on here. Owing to its superior performance, such as high speed, resistance to interference, the CML (current mode logic) is chosen [36, 37] . In addition to the core, input/output and inter-stage buffers are widely used to buffer, reshape and improve the circuit performance. Interface I Interface II Fig. 5 . The proposed mixed clocking strategies for interfaces I and II.
Circuit Design for Interface I
The block diagram of a group within interface I is shown in Fig. 6 , from which the clocking strategy mentioned in the above section can be easily observed. Two 2:1 multiplexers (MUXs) are used to serialize 4-way input data into 2-way output data. The PLL, DLL, and FPA techniques are adopted to provide proper clocks for the two MUXs. Fig. 7 shows the block diagram consisting mainly of a PLL and a MUX for interface I, which corresponds to the below two channels in Fig. 6 . The PLL consists of a PFD (phase frequency detector), a V/I (voltage to current) convertor, a LF (loop filter) and an I/Q (inphase/quadrature) VCO (voltage controlled oscillator). In consideration of PVT variation, the limited frequency acquisition range of a PD (phase detector) makes it incapable of the reliable operation of the PLL, so a PFD is selected. Furthermore, a full-rate PFD is used instead of a fractional-rate one, because the clock with a frequency equivalent to the input data speed is required for the operation of the half-rate multiplexer. Finally, owing to its compact layout, a 4-stage inductorless ring VCO is adopted to generate I/Q clocks for the PFD. Here, it should be pointed out that, in this section, PD doesn't refer to a photodiode as in section II, but a phase detector.
As for the multiplexer, a half-rate one is chosen over its full-rate counterpart because of its low power consumption, small die area, and low complexity, and so on.
As shown in Fig. 8 , the DLL (delay locked loop) for interface I mainly consists of a PD , a V/I convertor, a LF and a VCDL. Because the DLL needn't frequency acquisition, a full-rate PD is used. The VCDL should be long enough to provide sufficient varied phase shift range to allow for a large inter-channel wander, or to avoid dead lock.
The full-rate PD and full-rate PFD, which are used for the above mentioned DLL and PLL, respectively, are shown in Fig. 9 . From this figure, it can be found that the full-rate PD is actually a double-edge-triggered flip-flop (DETFF), which consists of two latches and a conventional selector shown in Fig. 10(a) and (b) , and the PFD [38] is composed of two identical DETFFs (or PDs) and a tri-state DETFF (Tri-DETFF) which is also called a frequency detector (FD). The main difference of the Tri-DETFF from the DETFF is that a tri-state selector shown in Fig. 10(c) is chosed over the conventional selector shown in Fig. 10(a) . The timing diagrams for a group within interface I shown in Fig. 6 can be found in Fig. 11 . CHin i0 , CHin i1 , CHin i2 , and CHin i3 are the 4-way input data; CHout i0 and CHout i1 are the 2-way high-speed multiplexed output data. CLK i1 and CLK i2 are clocks extracted by the above mentioned DLL and PLL from CHin i1 and CHin i2 , respectively, so, from Fig. 11 , it can be seen that the rising edges of CLK i1 and CLK i2 are aligned with edges of CHin i1 and CHin i2 , respectively.
From Fig. 11 , there exists one quarter of a clock period (T/4) difference between the output data and their corresponding extracted clocks, which results from that, in order for an optimum phase relationship, a T/4-phaseshifted clock is used for the selector within the half-rate MUX, shown in Fig. 7 .
Circuit Design for Interface II
The block diagram of a group for interface II is indicated in Fig. 12 , which is composed mainly of a PLL and two DLLs. The PLL is placed in between to facilitate the distribution of clock signals. Through the combination of a PLL and two DLLs within a group, 3-way input data can be deserialized into 6-way output data. Fig. 13 shows the block diagrams of a PLL and a DLL for interface II. Contrary to the PLL and DLL for interface I, the half-rate PLL and DLL are used here. The advantages of the half-rate techniques are as follows: Firstly, no additional independent demultiplexers (DeMUXs) are needed, because the half-rate PFD or PD possesses the function of a 1:2 demultiplexer. Secondarily, power consumption can be reduced largely because nearly all blocks can operate only at half a speed of the full-rate versions; finally, it's not easy for a multi-stage inductorless ring VCO to oscillate at 5 GHz reliably in standard 0.18 µm CMOS technology.
The PLL is composed of a half-rate PFD, a V/I converter, a LF and a ring VCO. The Savoj PFD [39] is chosen, shown in Fig. 14 , which consists of two identical half-rate PD (IPD and QPD) and a FD. The FD is identical to the one used in the full-rate PFD for interface I. The basic building blocks, shown in Fig. 10 , are also used for the PFD. From Fig. 14, 4 -way differentical clocks are needed for the PFD, so a 4-stage differentical ring VCO is used. The VCO cell is given in Fig. 15 , in which a current-folding technique [40] is used to alleviate the conflict between the low voltage headroom and the sensitivity of the VCO.
The DLL consists of a half-rate PD, a V/I converter, a LF (loop filter) and 2-way VCDLs. The half-rate PD is identical to the one used in the half-rate PFD shown in Fig. 14 . Because 2-way differential clocks are required for the half-rate PD, 2-way VCDLs are used. The circuit diagram of a delay cell of the VCDL is given in Fig. 15 . Fig. 17 shows the timing diagrams for a group within interface II shown in Fig. 12 . From it, it can be found that 3-way high-speed data (CHin i0 ~ CHin i1 ) are demultiplexed into 6-way low-speed data (CHout i00 ~ CHin i21 ), and skews between channels can also be observed. 
V. LAYOUT DESIGN
In order to reduce the tape-out cost, only one group for each interface is designed and fabricated in a standard 0.18 µm CMOS technology. Compactness, isolation and match play a critical role in layout design in this work.
The layout of a group for interface I is shown in Fig.  18 , which, including pads, occupies a die area of 673 µm×667 µm. The core width is only 450 µm, which is narrower than 4×125 µm, so the size specifications can be met. The arrangement of pads is just convenient for test and area reduction.
4-way 2.5 Gb/s differential data signals are fed into from the top, left and bottom sides, simultaneously, and 2-way 5Gb/s differential data are produced from the right side. Along the chip edges, ground and power pads are interlaced with the ones for signals to reduce crosstalk and improve signal integrity.
No any pad is used for external tuning. The NMOS transistors instead of MIM capacitors are used for the large capacitor required by the PLL and DLLs. The layout be designed as compact as possible, and guarding strips or rings are placed between critical blocks or around the sensitive circuit blocks, e.g. VCO.
The layout of a group for interface II is shown in Fig.  19 , which, including pads, occupies a die area of 1200 µm×943 µm. The core width is only 750 µm, which meet the size specification: 250 µm/ch. As indicated in Fig. 18 , 3-way differential input data are fed into from the left side and the bottom side, and the 6-way differential input data are taken out from the two-column pads on the right side. All the layout design techniques mentioned above are also employed here.
VI. MEASUREMENT RESULTS
The fabricated IC for the interface I is measured after it is bonded on a microwave PCB, which is shown in Fig.  20 . The high-speed differential traces on the PCB are strictly controlled to avoid imperfect issues, such as reflection, and their differential impedance are designed to 100Ω. Fig. 21 shows the measurement results of the IC for the interface I. The measured eye-diagram of onechannel data among the four-channel 2.5 Gb/s input data is shown in the upper half of Fig. 21(a) . The measured two-channel 5 Gb/s output data eye-diagrams are shown in the lower half of Fig. 21(a) and Fig. 21(b) , respectively. The former is derived from the MUX associated with a PLL, and the latter is from the one associated with a DLL.
In order to avoid crosstalk between the two-column bond-wires for the corresponding output pads, shown in Fig. 19 , the measurement for the fabricated IC for the interface II is evaluated directly on a microwave probe station, which is shown in Fig. 22 . Fig. 23 shows the measurement results of the IC for the interface II. The measured eye-diagram of onechannel data among three-channel 5 Gb/s output data is shown in the upper half of Fig. 23(a) . The measured eyediagram of one-channel 2.5 Gb/s demultiplexed data from the PLL is shown in the lower half of Fig. 23(a) . Fig. 23(b) and (c) show the measured one-channel output eye-diagrams from the other two DLLs, respectively.
VII. CONCLUSIONS
A high-speed multi-channel SERDES plays an important role in high-speed interconnect between ICs. In this paper, a 5 Gb/s/ch multi-channel SERDES is concentrated on, and based on extensive comparison and analysis, two novel clocking strategies are proposed for the SERDES. Both of the clocking strategies are based upon groups. The group within one strategy consists of a PLL, a DLL and two FPA techniques, and the group within the other strategy is composed of a PLL and two DLL. Each clocking strategy possesses distinct different characteristics for different application cases. Finally, two representative ICs for each group of SERDES are designed and fabricated in a standard 0.18 µm CMOS technology. Simulation and measurement results indicate that the two SERDES ICs can work properly accompanied with their corresponding clocking strategies. 
