Abstract-We present a new method of latency reduction in optical interconnects: using very low duty cycle return-to-zero encoding (i.e., subpicosecond pulses). An analytical comparison of three different receiver architectures, including transimpedance, integrating, and totem-pole diode pair, is presented. For all three receivers, we demonstrate that using short pulses instead of nonreturn-to-zero (NRZ) shortens the circuit delay. We also experimentally demonstrate a 65% reduction in latency of a transimpedance receiver by using short optical pulses. Finally, we show that the latency of optical interconnects can be comparable to or even less than electrical interconnects for global on-chip communication.
I. INTRODUCTION

M
ODERN computer processors run at the clock speeds of many gigahertz but the processor to memory interface typically runs at only a few hundred megahertz. A key reason for this difference, and a problem for computing in general, is that the interface connection speeds are not able to keep up with the increase in the processor speeds. This is mainly because of design issues of electrical busses and their underlying physical properties. Due to the capacity limitations of electrical wires, most long distance communication is now done via optics. For medium distance communication, e.g., local-area network (LAN), metropolitan area networks (MAN), wide area network (WAN) (about 300 m-100 km), optics is making inroads, specifically, because only optics can support the high data rates required by these applications. At shorter distances (a few meters-few hundred meters), primarily in data links, optics is rapidly gaining entry. Recently, a lot of research has been done on technologies for massively parallel optical interconnects, including dense integration of modulators [1] and vertical cavity surface emitting lasers (VCSELs) [2] . This might allow the use of optical interconnects in short distance inter-chip and even intra-chip communication.
In connections between and within electronic chips, signal latency is a critical parameter in determining system performance. As the complementary metal-oxide-semiconductor (CMOS) linewidth scales, the processor clock speed increases, making it difficult to run an entire chip synchronously. In other words, transferring data within a clock cycle is becoming difficult. According to the International Technology Roadmap for Semiconductors (ITRS) estimate [3] , gate delay and local interconnect delay are being reduced as the technology is scaling, but the delay of global interconnects with and without repeaters is continuously increasing relative to the clock period.
The propagation velocity of global interconnects with repeaters is a small fraction of the velocity of light (e.g., 0.1 c-0.2 c) and is not expected to improve significantly [4] - [6] . For 0.25 m technology, the delay of global lines is less than a clock cycle, but for future technologies the delay will be longer than a clock cycle. If the signals can be propagated at a significant fraction of the velocity of light, (e.g., 0.3 c), the delay in communication will be less than a clock cycle up to 0.1 m technology [4] .
It might be possible to use optics to provide communication across chips at a significant fraction of the velocity of light. For optics to be feasible, the delay in the transmitter and the receiver has to be very low, of the order of a few gate delays. The delay of propagation in optical media cannot be altered though it is relatively fast ( 0.67 c in glass). Transmitter and receiver circuits designed in silicon CMOS are likely to keep pace with the speed of logic operations in silicon chips as the technology scales [7] .
Latency is very important in optical interconnects as mentioned by Dambre et al. [8] , who show that with low latency optical links, three-dimensional (3-D) optoelectronic multi-FPGAs outperform two-dimensional (2-D) electronic FPGAs. Also, recently Collet et al. [9] concluded that since the most critical issue in computer architecture is the access time to the main memory, signal latency is of critical importance in implementing optical interconnects. There is some concern, however, about the increased latency of optical interconnects compared with their electrical counterpart because of the added functions of electrical-to-optical and optical-to-electrical conversion as expressed in [10] . But because of advanced integration techniques [1] , optical components with very low parasitics can be placed close to the origin of signal, which significantly reduces the delay in routing and driving signals to optical components. Kyriakis-Bitzaros et al. [11] , on the basis of a realistic model in 0.8 m CMOS technology, demonstrated that the latency of an optical link with VCSELs and edge-emitting lasers is lower than the electrical link even for subcentimeter line length.
Most previous work has looked at the latency of an optical link using an NRZ data format with a VCSEL or an edge-emitting laser as a transmitter. The turn-on delay of lasers could add significant latency, which depends on the electrical drive signal strength and waveform [12] . This turn-on delay can be eliminated by using modulators instead of VCSELs. It is additionally possible to significantly reduce the latency of optical interconnect receivers by using short pulses with modulators. The fast optical rise time and concentration of all the energy in short pulses both work toward reducing the latency.
Short pulses provide many advantages in communications [13] , though it is important that the medium be able to support the propagation of these pulses. On electrical wires, the frequency dependent losses are very high for short pulses, and there is substantial dispersion that spreads the pulses, making their use impractical. In optics, the losses for the entire spectrum of short pulses are nearly constant. Also, the dispersion in an optical medium for small distances is tolerable and does not cause significant broadening. It has been demonstrated that use of short pulses in modulator based interconnects can remove skew and jitter from an array of modulators [14] , provide sensitivity enhancement in receivers [15] , deliver a precise clock signal [17] , and allow single source wavelength division multiplexed interconnects [18] . All these advantages of short pulses make it an attractive option for short distance optical interconnects.
In parallel optical interconnects, the latency is reduced by reducing complexity. Instead of multiplexing many data streams and sending them over fewer channels, multiple channels running at the clock rate of the given technology allow simpler receiver and transmitter circuits, which in turn reduce the latency of the link. Note that the issue in this paper is not one of the total bit rate that can be supported by different approaches, nor is it one of the efficiency of utilization of optical bandwidth. In longer communications links, long delay is unavoidable because of signal propagation times and receiver latencies are not important. In contrast, the receiver delay (latency) in short interconnects can be a substantial fraction of the overall delay, which in turn is very important for system performance. This paper, therefore, deals with the comparison of latencies between NRZ and short pulse links in different receiver architectures through simulation and measurement, at roughly the system clock rate supported by the given technology. The organization of this paper is as follows: the concept of a modulator-based interconnect system is described in Section II. Section III presents simulations of three different receiver topologies. Latency measurement results for the transimpedance receiver are presented in Section IV. Finally, conclusions are given in Section V.
II. SYSTEM CONCEPT
A reflection-modulator-based interconnect system is shown in Fig. 1 . The optical devices on the chips are p-i-n diodes with quantum wells in the intrinsic region. The devices are referred to here as multiple quantum well (MQW) diodes. A detailed description of the operation of these devices is given in [19] . MQW diodes can act as both the modulator and the detector, depending on the circuit to which they are connected. These diodes are hybrid-integrated to silicon CMOS chips [1] to take advantage of the highly advanced silicon CMOS process and the good optoelectronic properties of GaAs at 850 nm, the wavelength of operation. In a modulator-based interconnect system, a fan-out element generates multiple beams from the input beam. These beams are modulated by the reflection-modulator array and imaged on the receiver array, where the data is recovered.
Optical interconnects have three components: the transmitter, the medium of propagation and the receiver. The transmitter can be easily optimized because it essentially consists only of digital components (its input is a digital logic level). For a MQW modulator, the driver is typically an electrical buffer chain and optimization of such a circuit is outlined in [20] . The receiver, having analog input, provides the largest room for improvement [21] . Therefore, in the following sections, we analyze the latency of different receiver architectures for NRZ and short pulse inputs. Signal latency, here, is defined as the maximum delay between rise or fall of the input and output waveforms, measured at 50% of the signal amplitude.
The following parameter values are assumed for simulation, which correspond to the parameters of 0.25 m technology in which the receivers were fabricated. Clock period in a given technology is roughly eight FO-4, where FO-4 is the delay of an inverter driving another inverter four times its size [5] . In 0.25 m technology, this corresponds to about 1 GHz clock rate and the link is assumed to operate with this clock. Low-capacitance high-responsivity photodiodes are assumed based on results presented in the literature [22] . (The actual diode capacitance was higher in current fabrication, used in our experimental results.)
• Supply voltage 2.5 V.
• Speed of operation 1 Gb/s.
• Photodiode responsivity 0.5 A/W.
• Number of post amplifier stages 2.
• Capacitance of the diode after integration 40 fF.
• Optical pulsewidth is subpicosecond. • Pulsewidth of electrical current pulses generated from photodiode (limited by the transit time of carriers in intrinsic region) 10 ps.
III. MODELING OF THE RECEIVER
A. Transimpedance Receivers
The transimpedance receiver is the most commonly used receiver in optical communication. The first stage of a transimpedance receiver consists of an amplifier with resistive feedback. This is followed by gain stages. In this paper, the transimpedance receiver is based on inverters following the implementation used in [23] and [24] . Fig. 2 shows a simplified schematic of the transimpedance receiver. Intuitively, we would expect to lower the latency of the transimpedance receiver by using short pulses, instead of NRZ signaling because, for the same total energy in the bit period, a larger maximum amplitude at the output of the transimpedance stage is generated with short pulse input. This larger amplitude reduces the gain required from later stages, hence reducing the latency. This circuit was analyzed using the circuit-simulator SPICE, and a first-order analytic model. The latency of the transimpedance receiver with an NRZ data format was analyzed in [21] .
To understand the mechanism of latency in receivers, a simplified model of the transimpedance receiver is analyzed. This model is shown in Fig. 3 . The first stage is the transimpedance amplifier with a finite gain-bandwidth product. An ideal amplifier with a series output impedance (same as of transistors), together with the output capacitance, models the finite-gain-bandwidth amplifier. All the capacitances at the output of the front-end amplifier, including the input capacitance of the next stage, are combined into a single capacitance represented by . The gain stages are modeled as open loop amplifiers with a finite -gain-bandwidth product. After computing the swing at the output of the transimpedance amplifier, the re- ) is calculated for the post-amplifier chain. Due to the finite-gain-bandwidth product, the time constant of the stage can be deduced given the required . A first-order estimation of latency, defined as the delay from 50% input change to 50% output change, can be performed by adding the time constants of all these stages. A step input simulates the NRZ input and a 10-ps pulse simulates the short pulse input.
The following parameter values are assumed.
• Total capacitance at the input of the receiver fF.
• Feedback resistance k .
• Output impedance of the amplifier k .
• Total capacitive loading at the output of the amplifier fF.
• Open loop gain of the amplifier .
• Gain-bandwidth product of each post amplifier stage 10 GHz. The transfer function of the transimpedance stage is given by (1) where and . Based on this transfer function, pulse and step responses were computed for the transimpedance stage. Energies per bit were computed based on 1-Gb/s operation of the receiver.
Receiver latency with different input optical energies is shown in Fig. 4 . This result shows a latency reduction of 65% for large optical energies by using short pulses as compared with NRZ [25] . The results match the precise SPICE simulations for the transimpedance receiver. It should be noted though, that increasing the data rate reduces the advantage of short pulses over NRZ, and in the limit when bit period is the same as the short pulsewidth, they both have equivalent performance.
The effect on latency by changing the number of post-amplifier stages is considered next. If the total gain needed from the (2) decreases exponentially with the number of stages . For low , the exponential decay of dominates, while for a larger the linear increase of dominates. This behavior can be seen in Fig. 5 . Intuitively, for a large , when is increased to , the reduction in gain per stage is very small. Since the reduction in gain is small, the reduction in delay per stage is also small, but because of the extra stage the total delay (which is the sum of the delays of all stages) increases. On the other hand, for a small , when is increased to , the reduction in gain per stage is relatively large causing a large reduction in the delay. Even with one extra stage, the overall delay is reduced. This implies that there is an optimal number of post-amplifier stages to minimize latency.
The receiver delay versus the number of stages for different input optical energies per bit is shown in Fig. 6 . This plot follows the same pattern as in Fig. 5 . For optical energy around 100 fJ, two or three stages of post-amplifier minimizes the latency of the receiver. The calculated receiver latency versus optical energy for a two-stage and a three-stage post amplifier is shown in Fig. 7 . This figure illustrates that as the pulse energy is increased, the amount of gain required reduces, causing the delay to be minimized by a lower number of stages for a pulse energy higher than a certain crossover pulse energy. Crossover occurs at 70 fJ for NRZ in this figure, but for short pulses this crossover occurs below the plotted optical energies.
In this section, we saw that by using short pulses the latency in the transimpedance receiver can be reduced by more than 60% compared with using NRZ data. The results of the first order model and SPICE simulation match very closely. By using the first order model, it was also concluded that for a given energy per bit, there is an optimum number of stages to minimize latency, which may not be the same for short pulses and NRZ input. For reasonable pulse energies, as a rule of thumb, the latency is minimized by using somewhere between two to five post-amplifier stages. Latency measurement results for this receiver are given in Section IV. Fig. 8 shows the circuit schematic diagram of the integrating receiver front-end. The architecture of this integrating receiver is based on a strongarm latch [16] . This receiver regeneratively amplifies the differential input to generate logic levels. It integrates the input photocurrent for half a cycle and for the remaining half cycle it evaluates based on integrated charge. The output of this front-end is valid for half the cycle and for the other half it is set to the supply voltage. A set-reset (SR) latch is used to convert this output to a valid output for the entire bit period. Fig. 8 . Circuit schematic of the integrating receiver front-end. Fig. 9 . Latency with respect to the clock in the integrating receiver with NRZ and short pulse inputs. Fig. 9 illustrates the timing for the integrating receiver. In this receiver, the latency is a function of the total integrated charge. A typical integration period is half of the clock cycle. If the energy is spread over the entire bit period, as in the case of NRZ, the latency is equal to the integration period (half of the bit period) plus the time to resolve the logic level (at any bit rate). Changing the duty cycle from 50% leads to a tradeoff of energy and latency. Reducing the integration period reduces latency but increases the required optical energy per bit, while increasing the integration period has the opposite effect. In the case of short pulses, the pulses can arrive at the end of the integration period and transfer all of the energy in an instant. As seen in the figure, the latency with a short pulse can be as low as the evaluation time. For a practical system, there needs to be some timing margin to account for the jitter and other variability in the system. This receiver operates on the principle of positive feedback, hence it is very sensitive. The evaluation time of this receiver depends logarithmically on the amount of integrated charge. For modeling this receiver, the parameter values of the 0.25 m CMOS technology were assumed so that the results could be compared with the transimpedance receiver. A SPICE simulation of the latency of the entire integrating receiver circuit (including the SR latch) is plotted in Fig. 10 . According to these Fig. 10 . Latency of the entire integrating receiver, including the SR latch, with short pulse input computed by using SPICE circuit simulator. results, the total latency of the receiver is 150 ps for 50 pJ of pulse energy. At 1 Gb/s operation, the latency of the receiver is 500 ps 150 ps 650 ps with NRZ input, while it is 150 ps for short pulse input as shown above. Thus, the latency of this receiver is also significantly reduced by using short pulses.
B. Integrating Receiver
C. Totem-Pole Diode Receiver
Very low latency at the expense of larger optical power can be achieved by using a diode pair connected in the totem-pole configuration as shown in Fig. 11 . This design is effectively receiver-less ("recless") as there is no voltage amplifier involved. Removing amplifier stages can not only reduce the delay of the receiver but also eliminate skew and jitter introduced by these stages. By reducing skew and jitter, a very precise optical clock can be injected with short pulses as mentioned in [17] . Elimination of amplifiers also allows to deliver very fast edges through this "receiver," which can be utilized for characterizing high speed signals on-chip.
This "receiver" needs to be connected to a high impedance, such as the input of a buffer, where the charge is integrated. Here the input capacitance is charged to the supply rails by providing sufficient optical energy. The optical energy required to charge the node "in" to the supply rails is a linear function of the front-end capacitance, which is typically dominated by the photodiode capacitance. If the total capacitance at the node "in" is and the total voltage swing required is , then the total charge required is . For a photodiode responsivity , the minimum optical energy required is . This optical energy can either be delivered in a very brief period or it can be spread out over the entire bit period ( ). If the input to this receiver is NRZ data with the minimum required pulse energy, the input node will reach half of the supply voltage in half a cycle ( ). If short pulses are used, instead, the input node will be charged quickly ( ), limited only by the carrier transit time in the intrinsic region of the diode. The timing diagram in Fig. 11 shows the charging of the input node with NRZ and short pulses.
If the flip-chip bonded photodiode capacitance is 40 fF and the responsivity is 0.5 A/W, then for a total capacitance of 90 fF (i.e., assuming a buffer capacitance of 10 fF) the optical energy required to charge the input node by 2.5 V (i.e., the supply voltage for 0.25 m CMOS technology) is 450 fJ. At 1 Gb/s this translates to 450 W of optical power. This is a relatively large amount of power, which may make practical use of this receiver difficult. In specific applications, though, this might be an acceptable power level. By using a metal-semiconductor-metal photodiode, or a silicon photodiode in a silicon-on-insulator process, the photodiode capacitance can, however, be reduced, which can reduce the optical energy required. For a 1-m long intrinsic region, the carrier transit time is roughly 10 ps, which determines the latency with short pulses in this receiver. By comparison, for 1 Gb/s operation, the latency with NRZ data and minimum optical power will be 500 ps. Hence, this receiver gives the minimum latency with short pulses of all three receivers studied here, though the amount of optical energy required can be much larger.
IV. MEASUREMENT OF THE TRANSIMPEDANCE RECEIVER LATENCY
The results in the earlier section predict that receiver latency can be significantly reduced by using short pulses. To verify this concept, the latency of the receiver-modulator driver pair was measured experimentally. Circuits were fabricated in 0.25-m standard CMOS technology, and the optical devices, MQW diodes, were flip-chip bonded on these circuits. The active area of these diodes was 20 m 20 m. The schematic of the transmitter-receiver circuit is shown in Fig. 12 . A PMOS transistor act as the resistive feedback element. By varying the voltage at node "tune," the resistance value can be changed. Node "tune" was kept at 0 V for latency measurement. An optical pump-probe setup was used for measurement, which is described in detail in [26] . Short pulses ( 150 fs) generated from a Ti : sapphire modelocked laser at 850 nm were used for the pump and probe beams at 80 MHz (limited by the repetition rate of the laser). Pump beam and a cw laser output as balance beam were incident on the differential diode pair at the receiver input. The pump beam excited the receiver, while the balance beam brought the receiver back to its original state over time. A modulator driver was driven by the output of the receiver. The voltage output of the modulator driver was sampled optically with a probe beam at the same rate as the pump beam. Varying the delay between pump and probe beam mapped the response of the transceiver pair. Since the optical pulses were only 150 fs, subpicosecond resolution could be achieved in measurements.
The measurement of latency for NRZ data was performed on a different setup, because the delays were much larger. They were done using a high speed detector (2.5-GHz bandwidth), and directly evaluating the waveforms on an oscilloscope. The receiver was designed to operate at 1 Gb/s, but since the repetition rate of the short pulse laser was 80 MHz, all the measurements were done at 80 Mb/s. Measurement results were extrapolated to 1 Gb/s for both NRZ and short pulses. For a given receiver design this does not change the latency, which is the primary concern of this paper. Additionally, the optical energy per bit reduces proportionally with the speed of operation for NRZ data for a given latency, while it remains constant for short pulses. This implies that increasing the bit rate works in favor of NRZ. In the limit of the bit period equal to the short pulsewidth, both data formats are equivalent. Here, we assume the operation of the link is at the system clock rate supported by 0.25 m process, i.e., 1 Gb/s. The latency of an entire interconnect can be easily computed by adding the delay of signal propagation to the measured latency of the transceiver pair. Fig. 13 shows the measured values of latency for NRZ and short pulses for the circuit in Fig. 12 . These results match the SPICE simulations of the circuit reasonably well. It is evident that short pulses reduce the latency of the transceiver pair compared with NRZ input by a very significant amount, as predicted by the simplified modeling in Section III. In these measurements, the diode capacitance was 260 fF though the circuits were designed to drive lower capacitance. This capacitance can be easily reduced to less than 50 fF by improved fabrication techniques and using smaller devices. Lower capacitance and optimized circuit will reduce the latency further.
Measurements of latency of a transimpedance receiver implemented in bipolar technology were presented by Wieland et al. [27] . The overall delay of their receiver was 1.5 ns at 1-Gb/s operation. The latency measured with NRZ data here for the transceiver is of the same order, though the latency is much lower with short pulses.
V. CONCLUSIONS AND DISCUSSION
Chip sizes are expected to increase modestly with future generations, and will remain around 2 cm on a side. Assuming a global interconnect of 2 cm the latency of a repeatered electrical line is 330 ps (at 20% of the velocity of light) [4] . For optical interconnects, the propagation time for a 2 cm distance in glass is 100 ps. The latency in the transmitter can be brought down to 70 ps (assuming a single buffer driving the modulator capacitance), and as we have shown in Section III, the transimpedance receiver latency can be reduced to 70 ps with short pulses. It is even possible to have less than 70 ps of receiver latency by using the "recless" receiver. Thus, this shows that optical interconnects can achieve latencies comparable to electrical interconnects or even less, at least theoretically, for on-chip global communication. The key reasons for low latency in optical interconnects are advanced integration of optical devices with silicon, and short pulse encoding.
In this paper, an analysis of receiver latency for different architectures was presented. It is shown that short pulse encoding significantly reduces the receiver latency compared with NRZ Table I summarizes the results of the receiver latency simulation. A "recless" receiver with short pulses has the shortest delay, but at the expense of optical power. The optical power required depends on the photodiode capacitance. The transimpedance and integrating receivers have nearly the same latency with short pulses.
The effects of various parameters on the latency of the transimpedance receiver by using a simplified model were described. It was shown that the number of stages required to minimize the latency depends on the input optical energy, i.e., as the energy per bit is increased, the optimal number of stages reduces. Finally, experimental measurements on a transimpedance receiver demonstrate latency reduction by more than 50% using short pulses, verifying modeling and simulation. With the same optical energy per bit, short pulses substantially reduce the latency for a given transimpedance design compared to NRZ.
This work is an initial study of the use of short optical pulses to reduce latency in optical interconnects. Short pulses hold a vast potential in optical interconnects. The availability of compact high-speed mode-locked lasers, packaging, and the cost are the current challenges in making a viable intrachip and interchip optical interconnect system. With further advances in optical devices and integration technology it might be possible to have low latency optical interconnects using short pulses for on-chip communication.
Noah C. Helman (S'02) received the A.B. degree in
