Abstract-A low-voltage, 90-nm CMOS optical interconnect transceiver operating at 1550-nm optical wavelength is presented. This is the first demonstration of a novel optoelectronic modulator architecture (the quasi-waveguide angled-facet electroabsorption modulator) in a system. It features a simple electronic packaging via flip-chip bonding to silicon. Devices have a broad optical bandwidth, are arrayed two dimensionally, and feature surface normal, spatially separated, and misalignment-tolerant optical ports. The modulators are driven with a novel pulsed-cascode driver capable of supplying an output-voltage swing of 2 V (twice the nominal 1-V CMOS supply) without overstressing thin-oxide core CMOS devices. At the receiver side, a sensitivity of −15.2 dBm is obtained with an integrating/double-sampling front end. The transceiver includes clock generation and recovery circuitry that enables a data serialization factor of five. At a maximum data rate of 1.8 Gb/s, the optical transmitter, receiver, and clocking circuitry consume 12.6, 4.5, and 6.5 mW, respectively, for a total link electrical power dissipation of 23.6 mW. To the best of our knowledge, this is the first demonstration of an interconnect transceiver operating at 1550 nm with a III-V output device directly integrated to the CMOS.
Optoelectronic modulators using the quantum-confined Stark effect [2] are effective transmitter devices for optical interconnects due to their potential for high-frequency operation and low power dissipation. Most of these devices fall into the category of surface normal or waveguide devices. The surface normal devices typically require a thick multiple-quantum-well (MQW) region to get an adequate contrast and, thus, require a large operating voltage to achieve the necessary electric field for switching. It is possible to reduce the MQW thickness and the operating voltage by containing the MQW region in an asymmetric Fabry-Pérot resonator, but the resulting design is constrained by a narrow wavelength band of operation and a strong temperature dependence. While the waveguide devices do not have these problems, packaging is difficult due to the need to couple to small-area modes, and waveguides can only be arrayed in one dimension of the wafer surface.
This paper demonstrates an optical interconnect transceiver at 1550 nm using a modulator architecture that combines the benefits of both the surface normal and waveguide modulators-the quasi-waveguide angled-facet electroabsorption modulator (QWAFEM). These devices have previously been demonstrated to operate over a wavelength range of 16 nm [3] . They allow for surface-normal access to spatially separated input and output ports and are tolerant to small misalignments of the beam. They have a low drive voltage of 2 V and can directly be flip-chip-bonded to CMOS without a high-speed electrical packaging.
In order to explore the use of these modulators in highdensity chip-to-chip optical interconnect applications, 2-D modulator arrays are directly flip-chip-bonded to a CMOStransceiver chip, thus eliminating the need for high-speed electrical packaging. The transceiver is fabricated in a 90-nm CMOS process and employs a novel pulsed-cascode modulator driver [4] that is capable of supplying an output-voltage swing of 2 V (twice the nominal 1-V supply) without overstressing thin-oxide core CMOS devices. Completing the optical link is a low-voltage integrating and double-sampling receiver front end [5] that eliminates the requirement of a high-bandwidth transimpedance amplifier (TIA).
In this paper, we, to the best of our knowledge, present and discuss the first demonstration of an optical interconnect transceiver operating at 1550 nm with a III-V output device directly integrated to the CMOS. Section II describes the design and operation of the QWAFEM devices. The low-voltage CMOS transceiver is outlined in Section III. In Section IV, we detail the experimental setup and results, and in Section V, we summarize this paper.
II. MODULATORS
The QWAFEM-device architecture is shown in Fig. 1 [3] . The modulators are fabricated on a double-side polished (100) InP wafer, upon which InGaAs/InP epitaxial layers have been grown via metal-organic chemical vapor deposition. The growth consists of a p-doped-intrinsic n-doped (PIN) diode, containing an MQW structure in the intrinsic region and a threeperiod InGaAsP/InP distributed Bragg reflector (DBR) in the n-doped region. Mesas are etched for the diodes, and metal pand n-contacts are deposited in an evaporator. On two opposite sides of each mesa, mirrors are selectively etched in the substrate to reveal the {111} planes. Light enters the QWAFEM at normal incidence through the antireflection-coated back surface of the InP substrate. The light undergoes three total internal reflections (two from the selectively etched mirrors and one reflection from the interface between the epitaxially grown InGaAsP and air) before exiting along a path antiparallel to, and displaced from, the incident beam. A resonant cavity is formed around the MQW region by the DBR and the epitaxy-air interface, enhancing absorption. Oblique incidence in the MQW region increases the interaction length of the light with the quantum wells and enhances the performance of the resonator.
Unlike in a waveguide modulator where propagation along the absorbing material is accomplished by coupling to a waveguide mode, there is no modal constraint imposed in coupling through the active region at grazing incidence, easing the optical alignment constraints. Furthermore, the triple-bounce geometry results in a fixed separation between the input and output beams for small translations of the input beam across the modulator substrate surface. In this implementation, the displacement between the input and output ports facilitated the separation of the output beam for detection using a pick-off mirror (POM), which can provide a fourfold improvement in insertion loss compared with using a power beamsplitter (BS) for this purpose. In addition, multiple modulators on a chip could be tested by translating the chip without realigning the detection optics. For future parallel link implementations, it should be possible to align an array of lensed fibers for the input and output couplings to the modulator chip in a single alignment step, provided that the pitches of the fibers and the modulators are matched.
Building upon previous work [3] , in this implementation of the QWAFEM, the spacing of modulators was changed to match the pitch of the CMOS-transceiver chip, and the diode mesas were resized to reduce capacitance. The current devices range from 20 × 60 to 40 × 90 µm in the diode area, corresponding to the designed capacitances ranging from 700 fF to 2.5 pF. While reducing the device size reduces the capacitance, it will also reduce the tolerance to misalignments, and in the limit of small devices, it will reduce the maximum contrast ratio. To avoid stringent growth thickness calibration, three epitaxial wafers were grown with different resonator lengths. Each resonator had two sacrificial layers, of which none, one, or both could selectively be etched to optimize the resonator length. The optimal combination of the wafer and the number of sacrificial layers etched was chosen after experimental comparison of nine fabricated device arrays. A static curve of the contrast ratio versus the wavelength of a typical such QWAFEM is shown in Fig. 2 . 
III. CMOS TRANSCEIVER

A. Transceiver Architecture
The optical interconnect transceiver architecture is shown in Fig. 3 . In order to enable short bit periods without consuming excessive area and power in clock generation and distribution, a multiple clock-phase multiplexing architecture is used at both the transmitter and the receiver. In the transmitter frequency synthesis phase-locked loop (PLL), a five-stage ring oscillator provides five sets of complementary clock phases that are spaced a bit period apart. These phases are used to switch a level-shifting multiplexer to produce a serial data stream with a data rate of five times the clock frequency. The multiplexer serial output is then buffered by the modulator-driver output stage [4] . At the receiver side, the input photocurrent is integrated onto the input-node capacitance, and a doublesampling technique is used to resolve the data bits [5] , [6] . A demultiplexing factor of five is directly achieved at the input node using five uniform clock phases from the clock and data recovery (CDR) system.
B. Modulator Driver
For modern CMOS technologies, an output swing greater than the nominal power supply is required in order to provide an appropriate contrast ratio with integrated surface normal EMs. This conflicts with the CMOS reliability considerations [7] , [8] which constrain the maximum static voltages across a core transistor's gate, source, and drain terminals to be no more than the nominal power supply, whereas the transient voltage spikes must not exceed this limit by more than 20%-30%. Thick-oxide I/O devices that are rated for higher voltage operation could potentially be used to supply the necessary modulator drive voltages, but these thick-oxide devices cannot match the core CMOS devices' speed. Thus, the challenge is to provide at high data rates an acceptable output swing without overstressing the core devices. To address this, a pulsed-cascode output stage is used that reliably supplies a voltage swing of twice the nominal supply and consists of only core devices for maximum switching speed. Fig. 4 shows the pulsed-cascode output stage which accepts both a "low" input IN low that swings between the Gnd and the nominal chip Vdd and a "high" input IN high with the same data value that has been level-shifted to swing between Vdd and Vdd2, where Vdd2 is nominally twice the voltage of Vdd. The level-shifting multiplexer circuitry is detailed in [4] . Static-voltage overstress is eliminated in the output-stage cascode structure by equally splitting the output voltage across the series transistors. Pulsing the gates of the cascode transistors (MN2 and MP2) during transitions with NAND-and NOR-pulse gates, respectively, allows this driver to eliminate the transient drain-source voltage (V ds ) overstress present in static-biased cascode drivers [9] and prevents transistor degradation from hot-carrier injection [10] . Fig. 5 shows the simulation waveforms of the pulsed-cascode modulator driver with a nominal CMOS supply of 1 V, providing a 2-V output transition from high to low with an assumed modulator capacitance load of 1 pF. A falling transition from the "low" input switches the bottom nMOS (MN1) to drive node mid n to Gnd, and a simultaneous falling transition on the "high" input triggers a positive pulse from the NOR-pulse gate that drives the gate of MN2 from Vdd to near Vdd2 to allow the output to begin discharging at roughly the same time that the MN2 source is being discharged [ Fig. 5(a) and (b) ]. . The NOR-pulse gate is sized such that the gate of MN2 does not swing all the way to Vdd2 and that the edge rate of the pulse signal also matches the falling rate of mid n . Therefore, during the transition, a gate-source voltage that does not overly exceed the nominal supply is developed across MN2. The "high" input also activates a pull-down nMOS (MN3) to drive node mid p from Vdd2 to Vdd to prevent excessive V ds stress on MP2. Similarly, during an output transition from low to high, the "high" input switches the top pMOS (MP1) to drive node mid p to Vdd2, and the "low" input triggers a negative pulse from the NAND-pulse gate that drives the gate of MP2 transistor from Vdd to near Gnd. For ratios of C out /C midn from 1.3 (unloaded) to 15.5, no voltage spikes between the gate, source, and drain terminals of any output devices exceed more than 20% above the supply voltage.
It is important that the cascode transistors have a similar drive strength as the top or bottom transistors to reduce the V ds stress during transients. Thus, in order to minimize the body voltage effect on the cascode transistors, they are placed in separate wells that are dynamically biased with replica circuitry to track their source voltages. This reduces the cascode transistors' threshold voltages, resulting in a similar voltage drop across the two series driving transistors. The increased drive strength of the cascode transistor also serves to reduce the modulator driver's output transition time. Little power and area overhead is necessary for the replica-bias circuitry, as the replica transistors are sized to be less than 10% of the main driver transistors.
C. Integrating and Double-Sampling Receiver
While receiver circuitry power and area may not be a primary issue for traditional telecom applications which demand high sensitivity, in high-density optical interconnect applications, performance parameters such as sensitivity must be balanced with power and area constraints. A receiver front-end architecture that reduces the number of linear gain elements, and thus is less sensitive to the reduced gain in modern CMOS processes, is the integrating and double-sampling front end [6] . An absence of high-gain amplifiers allows for savings in both the power and the area and makes the integrating and doublesampling architecture more suitable for the chip-to-chip optical interconnect applications.
The integrating and double-sampling receiver front end [5] , as shown in Fig. 6 , demultiplexes the incoming data stream with five parallel segments that include a pair of input samplers, a buffer, and a sense amplifier. Two current sources at the receiver input node, namely, a photodiode current and a current source that is feedback-biased to the average photodiode current, supply and deplete charge from the receiver input capacitance, respectively. For data encoded to ensure dc balance, the input voltage will integrate up or down due to the mismatch in these currents. A differential voltage ∆v b is developed in each receiver segment by sampling at the beginning and end of a bit period defined by the rising edge of the recovered clocks Φ[n] and Φ[n + 1], respectively. While, in a previous implementation [8] , ∆v b was directly applied to an offset-corrected StrongArm latch [11] used as a sense amplifier for data regeneration, the reduced supply voltage that comes with scaling technologies causes the integrating input to exceed the sense-amp input range. In order to fix the senseamp common-mode input level and to buffer the sensitive sample nodes from kickback charge, a differential buffer is inserted between the samplers and the sense amp. The power penalty of the additional buffer is quite small (250-µW per segment), as the buffer gain is low to avoid sense-amp offset saturation, and the bandwidth requirements are relaxed due to input demultiplexing. The use of pMOS samplers provides a receiver input range from 0.6 to 1.1 V. Demultiplexing directly at the input allows the sense-amp sufficient time (five times the bit period) for data regeneration and precharging, thus eliminating the requirement for a TIA operating at the bit rate.
IV. EXPERIMENT
Three experimental configurations were used, as shown in Fig. 7 . In the first configuration, laser light was free-spacecoupled onto the modulators, and the output was collimated and coupled onto a large-area photodetector for dc contrastratio measurements. In the second configuration, light exiting the modulators was coupled into a fiber for transition speed measurements with a high-speed oscilloscope. In the final configuration, light was coupled into high-speed detectors (HSDs) to complete the transceiver link.
Common to all of the configurations, an array of InP QWAFEMs was flip-chip-bonded to the CMOS-transceiver chip with eight transmit channels sized for varying drive strengths and two receive channels, as shown in Fig. 8(a) . The transceiver was fabricated in a 1-V 90-nm CMOS process.
The chip was placed in an open-cavity surface-mount package on a test board mounted on a three-axis translation stage. An HP8133A pulse generator supplied the reference clock figure) is accomplished by a microscope objective (O1), and the spatially displaced output beam of the modulator is reflected by the pick-off mirror (POM). In A, for the dc contrast-ratio measurements, collimated light is absorbed by a large-area detector (LAD). In B, for the high-speed rise-and fall-time measurements, the light is reflected off a second mirror and focused into a single-mode fiber. In C, for the full transceiver link, the light is focused by a second microscope objective (O2) onto a high-speed detector (HSD). Alignment of the beam on the modulator and the detector is accomplished with an IR camera (not shown), an LED for illumination, and two removable pellicle beamsplitters (BS). to the transmitter PLLs, and the transmit data sequence was controlled with an on-chip 20-b register that can be programmed with a computer via a serial testing interface.
Light from an Agilent 81680A tunable laser with a range of 1457-1584 nm was coupled via polarization-maintaining fiber into a free-space collimator. The collimator was followed by a rotatable half-wave plate, which is used to ensure that the linear-polarized light in the QWAFEM's resonator would be transverse electric (i.e., in the plane of the quantum wells) for optimal performance. The collimated beam was focused onto the modulator array with a Mitutoyo infinity-corrected 10× near-infrared (NIR) objective with a free-space focal spot diameter of about 12 µm. Between the collimator and the objective, a removable pellicle BS was used to allow imaging of the beam on the modulator array to aid the beam alignment. The light entry and exit points on the array's substrate were displaced by 200 µm. After collimation by the microscope objective, the beam exiting the modulator was separated for detection by a POM. A photodetector was placed in the beam path for the dc contrast-ratio measurements. The default highspeed modulation of the devices was bypassed by setting each bit in the 20-b sequence to the same state, and the modulators were switched by changing the bias voltage applied to the pcontacts of all driven modulators. For each working device, the optimal combination of the bias voltage and the wavelength was chosen to maximize the contrast ratio.
The maximum contrast ratio measured on a device bonded to CMOS was 3.86 dB and measured at 1528 nm for a 2-V swing, which is somewhat exceeding the performance of the unbonded device in Fig. 2 . Upon coupling the beam into a singlemode fiber, the same device yielded a peak contrast ratio of 5.53 dB for the same conditions. Modulation of the beam in this device could also result in change of shape of the beam because different angular components in the beam would differently interact with the resonator. Any such change of shape effectively corresponds to coupling light into higher order modes. Such higher order modes would not propagate in the fiber; therefore, any such power in those modes would be lost, hence actually possibly increasing the contrast of the modulator in the system and explaining the larger contrast ratio observed after coupling into the single-mode fiber. It was found that the contrast ratio decreased as the optical power in the system was increased, which may be due to the photogenerated carriers screening the field applied across the MQW region. A test of the electrical contacts on the modulator chip indicated that the metal contacts to the devices were nonohmic, which may be responsible for an inefficient sweepout of the carriers, and such imperfect contacts may also limit the response time of the modulator.
Rise and fall times of the QWAFEMs were measured using an 86109A 30-GHz oscilloscope. After the POM, the setup was modified such that the beam was reflected off a second mirror and into a single-mode fiber. The fiber-coupled light passed through an erbium-doped fiber amplifier and a variable attenuator and into the oscilloscope. The devices were set to send a pattern of ten sequential bits on and then ten sequential bits off to measure the rise and fall times. The fastest transmitter had a rise time of 1.2 ns and a fall time of 900 ps, which are measured from 10% to 90%. The device's estimated capacitance was 1.5 pF. The device with the highest contrast ratio, which was used in the transceiver link, had a rise time of 3.8 ns and a fall time of 3.9 ns, and its estimated capacitance was 1.8 pF.
As these transition times exceed the anticipated values by over an order of magnitude, a combination of electrical and optical testing was performed to determine the root cause. High-speed electrical operation of the driver loaded with a bonded modulator is verified by using on-chip samplers to subsample the output voltage and convert it to a proportional current-driven off-chip and viewed on the oscilloscope. The onchip driver has an adequate bandwidth for 10-Gb/s operation. However, due to the excessive series contact resistance, which is estimated to be on the order of 1 kΩ, this output drive voltage is filtered at the actual modulator. Thus, the resulting optical waveform has increased transition times, as shown in the 1-Gb/s optical oscilloscope waveforms of Fig. 9 .
For the high-speed transceiver link, the test setup was modified such that the output light from the POM was coupled via a Mitutoyo infinity-corrected 20× NIR objective into a 20-µm diameter high-speed InGaAs/InP photodetector (PDCS20T, Albis Optoelectronics, Switzerland). The photodetectors are attached to the receivers on a second identical CMOS-transceiver chip via short wirebonds [ Fig. 8(b) ]. This chip is also packaged and attached to a test board mounted to a three-axis translation stage. To enable measurements over a wider range of optical power, the output of the Agilent 81680A tunable laser was coupled via nonpolarization-maintaining fiber through an erbium-doped fiber amplifier, a variable fiber attenuator, and a polarization controller and into the free-space collimator. In this configuration, it was possible to optimize the phase and bias of the detectors. The received data are verified with an on-chip 20-b register whose output can either be scanned out to a computer or also be observed on an oscilloscope. The bit error rate (BER) of individual worst-case bit sequences was measured as the input optical power and the detection phase were adjusted. While the CMOS transceiver was designed to nominally operate at 5-16 Gb/s, the contact-resistance-limited transition times of the transmitter did not permit operation at that speed. When the chip was too slowly triggered, its performance degraded due to the limited voltage-controlled-oscillator (VCO) range. Thus, in order to get meaningful results from the transceiver link, we synthesized a repeating 10-b pattern by specifying the 20-b sequence in pairs of bits to allow the VCO to operate at a higher frequency. Since the receiver chooses the decision threshold based on the average current at the photodetector, it was necessary to send signals with an equal number of ones and zeros. We tested several bit patterns, attempting to generate the worst-case detection scenario available with 10 b. By taking a histogram of each of the worst-case bits in the pattern, we were able to estimate the error rate. The transmission of 10-b sequences was tested over a range . The bit timing margin was such that the BER was estimated at less than 10 −9 over a total range of phase shift of the receiver clock of 47% of the period of 1 b. Table I shows the collected results of our BER test. The lower data rates required uncharacteristically more power because the link's performance was degraded as the speed was decreased far below the chip's designed clock rate.
The measured 10%-90% rise and fall times correspond to an ∼1.7-ns time constant (assuming a simple one-pole system), which, in turn, corresponds in our simulations to ∼1.3-Gb/s maximum data rate. Our measured rates of up to 1.8 Gb/s are somewhat higher than this. It is possible that changing the bias voltage on the device during the optimization of the signal in the link may have also changed the rise and fall times from the values measured, hence giving a different data-rate limit.
Loss was measured for the optical path. From the laser source to the free-space collimated beam, the loss was 0.7 dB. Between there and focusing through the microscope, reflecting off the V-grooves, and the separation by the POM, the loss was an additional 7.8 dB. Focusing on the device mesa (in the "pass" state of the device) incurred a loss of 2.3 dB. In focusing the beam on the detector, the loss calculated from a photocurrent measurement was 2.6 dB.
The total transceiver electrical power dissipation is 23.6 mW at 1.8 Gb/s. Transmitter power, including clock generation, is 15.2 mW, with 3.8 mW to drive the QWAFEM, 8.8 mW for the multiplexer and buffers, and 2.6 mW for the TX PLL. The receiver consumes 8.4 mW, including the clock recovery, with 4.5 mW from the integrating/double-sampling front end and 3.9 mW from the CDR circuitry. Total transceiver area is 0.092 mm 2 , with 0.017 mm 2 for the transmitter and 0.075 mm 2 for the receiver. By assuming a resolution of the contact-resistance issue and a nominal increase in driver sizing for the 1.8-pF modulator, the estimated driver power consumption at 16 Gb/s is 69.0 mW, including multiplexing and buffering. At this data rate, the other link components would also consume more dynamic switching power. At 16 Gb/s, the TX PLL consumes 23.0 mW, the RX front-end consumes 23.0 mW, and the RX CDR consumes 35.0 mW, for a total link power of 150 mW [5] . At 10 Gb/s, the total power reduces to 97.8 mW due to the reduced dynamic switching power. Improvements in power efficiency are possible in parallel I/O systems where there is a potential to easily share the transmit clock-generation PLL among several channels with proper clock distribution and a potential use of phase-correction circuitry. This allows the TX PLL power to be amortized among the channel number, which could range as high as 12 or 20. In a typical 12-channel system, the implemented 1.8-Gb/s transceiver power consumption would drop from 23.6 to 21.2 mW per channel. If the aforementioned contact-resistance issues are resolved, then the power at 10 Gb/s would drop from 97.8 to 84.6 mW. While the implemented per-channel clock recovery does not allow easy sharing of the receiver clock recovery, alternate clocking architectures, such as a source-synchronous forwarded-clock system, allow for similar sharing of receiver clocking circuitry at the expense of an extra dedicated channel for the clock.
V. CONCLUSION
We believe that this paper is the first demonstration of an optoelectronic interconnect transceiver at 1550 nm using an output device directly bonded to the CMOS. Integrating the modulators with CMOS is practical due to their low drive voltage of 2 V and the simplicity of packaging via solder bonds. The QWAFEM architecture is a good candidate for optical interconnect systems due to its surface normal input and output ports, its ability to be arrayed in two dimensions, its misalignment-tolerant optical ports, and its broad bandwidth of operation.
A transceiver is implemented with a pulsed-cascode driver that reliably supplies an output-voltage swing of twice the nominal CMOS supply to the QWAFEMs, allowing for compatibility with present and future scaled CMOS technologies. In addition, the integrating and double-sampling receiver front end allows for demultiplexing of the data directly at the input and eliminates the need for a high-bandwidth TIA. The good power efficiency and the small area of the transceiver make it suitable for high-density optical interconnect applications.
