Abstract -A 9.5 mW 20 Gb/s 40 x 70 /_m2 inductorless 1:4 DEMUX in 90 nm CMOS process is presented. In order to reduce power and area, the DEMUX uses a multi-phase clock architecture that requires a smaller number of latches operating at a slower clock rate than in the conventional tree architecture. To provide low-voltage scalability, the latches operate with a near-tail-to-rail logic swing. It is realized without significant speed penalty by adopting current-sourceless CML-type latches with unconventional settings. It offers a larger noise margin and elimination of logic level converters too. The well-balanced scalable design could possibly broaden the applications of high-speed SerDes in the coming ultralowvoltage many-core era.
I. INTRODUCTION Applications of high-performance demultiplexers (DE-
- [6] have so far been limited chiefly to those that accept a design trade-off in favor of speed, such as fiberoptic communication systems. In typical optical wavelengthdivision multiplexing (WDM) systems, for example, perwavelength speed of 10 Gb/s or faster is required. The conventional wisdom in gaining or maintaining speed of CML-type circuits, used in [1] - [6] , has been to reduce the signal amplitude. A dilemma here is that the noise immunity might have to be traded off. Another problem is that current-mode logic (CML)-type circuits do not go well with the trend of lowering the supply voltage. As regards other factors than the speed, since as many DEMUXes as the wavelength multiplicity are needed in WDM systems, the power and area are also important considerations. We present a 1:4 DEMUX with near-rail-to-rail architecture that could offer a possible solution to the challenges in signal integrity and low-voltage scalability as well as in balancing the speed, power, and area.
II. ARCHITECTURE Fig. 1 shows the architecture of the 1:4 DEMUX along with a timing chart. This multi-phase clock architecture was chosen in favor of power and area reductions. The half-rate clock CLKin is halved in frequency by the divider consisting of 2 latches. Its outputs are quadrature-phase-shifted and are fed into the two 1:2 DEMUXes. In the upper 1:2 DEMUX, a pair of master-slave latches (MSLs) samples every other bit in the input bitstream Din by using rising and falling edges of the quarter-rate clock signal CLK2. The latch Ldl aligns the outputs from the pair of MSLs. The lower 1:2 DEMUX operates likewise. The two 1:2 DEMUXes sample the input bitstream alternately as dictated by the phaseshifted clock signals CLK1 and CLK2 and perform the 1:4 DEMUX function. The latches Ldo and Ld2 align the outputs from the lower 1:2 DEMUX with those from the upper one. The 1:4 DEMUX uses 12 latches altogether, as opposed to 15 in the conventional tree architecture [3] , [4] , which uses three 1:2 DEMUXes operating at a half rate. The quarterrate operation of the 1:2 DEMUXes in our design leads to low power dissipation. This design is similar to the 12-latch 1:4 DEMUX proposed in [6] . It thus allows more than 20 % power and area reductions in comparison with the 15-latch DEMUX. The selector circuit in Fig. 1 is for performing measurement with a small number of pads. It is a simple combinational logic circuit consisting of transmission gates and NOR gates. Fig. 2 shows circuit schematics of three different CMLtype latches. The CML has conventionally been the architecture of choice for high-speed digital circuits. However, in 90nm CMOS, the power supply voltage is limited to about 1.2 V and the threshold voltage is about 0.3 V. Each transistor, therefore, cannot enjoy sufficiently high voltage across its drain and source required for high-speed operation. Switching noise degrades the output eye opening as shown in Fig. 2(a) . The current-sourceless CML structure was recently proposed for low-voltage operation [4] , [7] . The tail transistor is eliminated to give increased voltage headroom. The size of the upper transistors is determined so that the gain of the differential amplifier is as low as 1 dB to 2 dB for high speed latching. The simulated eye diagram ( Fig. 2(b) ) shows a wider opening than that of the conventional CML ( Fig. 2(a) to up to 0.6 V, yet the clock has to have a rail-to-rail voltage swing.
III. CIRCUIT DESIGN
In our design, in order to give a larger noise margin and better scalability in light of the ever-lowering supply voltage, near-rail-to-rail architecture is introduced making use of the current-sourceless CML-type topology [4] . To achieve the transition from the 0.6V swing [4] to a near-1.2V swing, the resistance RL of the loads has to be roughly doubled, but the speed might drop as a consequence. However, since the gain goes up with RL, the transistors can actually be made smaller to keep the gain low. Then, the parasitic capacitance CL associated with the transistors, including those being driven at the next stage, also becomes smaller, and therefore the cut-off frequency remains high. The signal amplitude thus becomes nearly rail-to-rail without imposing significant speed penalty. It is not quite rail-to-rail because of the small voltage drop across the transistors. In order to make the voltage drop small, the clocked transistors should be enlarged somewhat. Circuit blocks built with the nearrail-to-rail architecture can be freely intermixed with CMOS digital blocks without logic level conversion. In Fig. 1 , the divider, the 1:2 DEMUXes, and the latches Ldo and Ld2 are near-rail-to-rail, and the rest are built of CMOS logic gates and transmission gates. Fig. 3(a) schematically shows the layout of the currentsourceless CML latch in Fig. 2(b) . The polysilicon load resistors cover a significant area. To reduce the area, PMOS loads operating in the triode (or linear) region are adopted as was done in [1] , as shown in Figs. 2(c) and 3(b) . This is a technique used in the pseudo-NMOS logic [8] . The area of the latch is estimated to become about a third of the currentsourceless CML architecture's, as illustrated in Fig. 3 . The use of the PMOS triode loads instead of polysilicon resistors has no adverse effect on the signal integrity when the signal amplitude is near-rail-to-rail. A micrograph of the test chip is shown in Fig. 4 IV. MEASUREMENT RESULTS Fig. 5 shows measured output eye diagrams. The input was 20Gb/s non-return-to-zero (NRZ) 231 -1 pseudorandom binary sequence (PRBS). A half-rate sinusoidal clock signal of 10 GHz was also supplied. Each of the four output data streams was observed using a sampling oscilloscope and an error detector by selecting a channel. Wide eye openings were observed at an error rate of less than 10-12. The output jitter was 20ps (p-p). The input data phase margin was 100 degrees for the input data mentioned. The power dissipation was 9.5 mW with the supply voltage of 1.2V. Measurement was made also with lower supply voltages of down to 0.9 V. When the supply voltage was 0.9 V, the operation speed was 13 Gb/s and the power dissipation was 3.3 mW.
Comparisons are made in Fig. 6 of area and speed of CMOS DEMUXes [1] - [5] . Further details of the DEMUXes are given in Table I . All the earlier DEMUXes used CMLtype architecture with a small logic swing. The power dissipation values of our circuit, listed in Table I , are those of the circuit core (Fig. 1) 
