Abstract-We introduce a first principles, end-to-end analysis of opto-electronic communication links which incorporates a thorough model of the receiver circuitry, in addition to the more familiar laser transmitter optimization. In particular, we optimize receiver sensitivity and power by studying their dependence on front-end design as well as follow-on digital sampler requirements. We find that the photo-receiver sensitivity is the most important factor in controlling the overall link power consumption. Our physical model and circuit optimization principles are applied to a heterogenous-integrated photonic+CMOS platform, where we show state-of-the-art performance through this physics-based rapid-design protocol. Incidentally this greatly simplifies the design process. Lastly, we apply this approach to extrapolate future performance trends, platform bottlenecks, and fundamental limits in optical link design while showcasing the potential for sub-1fJ/bit system efficiency at high speeds.
I. INTRODUCTION
O PTICAL interconnects, after having completely replaced electrical interconnects for long haul communication, are forecast to continue their expansion to shorter and shorter links, eventually bringing data directly to the processing chips, and even potentially replacing some of the longer interconnects on the chip itself [1] . This is due to several key aspects of optical interconnects: their potential for extremely high bandwidth, distance insensitivity of optical channel loss when compared with electrical, and better optical components and technology. Nevertheless, as illustrated in Fig. 1 , in a world where energy dissipation from computing units is becoming increasingly important, optical links must still prove that they can offer a more energy efficient communication means than electrical links for shorter distances.
Commonly cited objectives for chip to chip links range in the ∼100 fJ per bit, and drop to ∼10 fJ per bit when considering on-chip interconnects [2] . These energy requirements, when combined with the extremely high bandwidths needed, still pose a number of challenges for optical links. The emergence of Silicon Photonics is offering new possibilities The authors are with the University of California, Berkeley, California, (e-mail: ktset@berkeley.edu; chriskeraly@berkeley.edu; eliy@berkeley.edu; vlada@berkeley.edu).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSI.2016.2633942 Fig. 1 . Demonstrated Optical Link Efficiencies [6] - [15] , Against Objectives From [2] . Further information at linksurvey.eecs.berkeley.edu.
and prospects in this regard by enabling seamless integration of photonics and electronics on a single platform, thereby increasing energy efficiency. The purpose of this work is to model these links and optimize them in order to explore what limits can be reached in terms of energy efficiency and how these limits depend on the specific technology available. Prior literature in this space has made strides in accurately modeling particular aspects of the link data path, namely the front-ends and the systems-level energy breakdowns ([3] - [5] ). However, a proper marriage between "analog"-dominated and "digital"-dominated constraints has yet to be demonstrated. More specifically, in the context of optical receiver design, specifications on the sensitivity and power of the signal are contingent on the interaction of the front-end and the followon samplers that ultimately convert the analog signal into a digital bit, to set the overall energy, bandwidth, gain and noise properties. Linking all of these relevant interactions together, this work shows the behavior of the full optical link under different regimes of operation from the context of energyefficiency and noise.
We begin in Section II by introducing the optical link model and discussing the fundamental assumptions we make in characterizing the front-end, the follow-on samplers, and then the link in general. We will also theorize and derive first-order approximations of the sensitivity limits experienced by such a system. Next, in Section III, we apply these principles to a 65 nm heterogeneously integrated photonic-CMOS platform and show the performance of link designs optimized over various data rates. In particular, we show the importance 1549-8328 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. In Section IV, we use the link results from Section III to design and simulate receivers in each of these two regimes thereby validating our methodology as well as any inherent assumptions. Lastly, in Sections V and VI, we use this methodology to extrapolate performance metrics and analyze trends of the nextgeneration platforms from a photonics-optimized as well as photonics+CMOS-optimized standpoints.
II. OPTICAL LINK MODELING
We consider a very general model for the optical link, which enables us to perform optimizations on its topology and estimate the optimal energy per bit which can be achieved at particular data rates given the technology constraints. The philosophy of the model is depicted in Fig. 2 and is detailed as such: the receiver element is constructed with a transimpedance front end followed by N amplifications stages and terminated with a sampling unit composed of M individual samplers. The number of amplification stages N, the size of each stage, the number of sampling units M and the sizing of its transistors constitute optimization variables. There is of course a variety of other receiver topologies or variations on the one suggested. The framework we describe next will be readily extendable to these topologies.
The energy consumed in the receiver can be computed from the bias currents and circuit capacitances, and its sensitivity is determined by two constraints: a noise constraint, and a system output voltage constraint. Finally, the energy consumed by the transmitter can be calculated starting from the receiver sensitivity requirement, and back-propagating that through the data path losses to the transmitter. The total energy is the sum of the receiver and transmitter energy, which is minimized with respect to the optimization variables at hand.
A. Receiver Modeling
The receiver is modeled as illustrated in Fig. 2 . The front end consists of a transimpedance amplifier (TIA) that converts the input photocurrent to an output voltage signal, and is followed by N chained gain stages forming a voltage amplifier (VA) to further amplify the signal. All these amplifiers are considered to be first order stages (except for the TIA which has two poles: one from the photodiode capacitance and feedback resistor, and one from the input capacitance of the VA). The chaining of such stages causes the overall bandwidth to degrade. The bandwidth B chain resultant from N first order stages of bandwidth f S is [15] :
We set the target end-to-end bandwidth to 0.7 × f data , where f data is the Nyquist rate of the input data stream. This implies that the bandwidth f S of each stage must be
in order to satisfy this constraint. The factor of 2 comes as a result of the two poles imposed by the TIA.
1) Gain-Bandwidth Product:
While the unity current gainbandwidth of a technology is f T , the actual gain bandwidth that is achieved in an individual gain stage that is loaded by its replica will be lower due to various parasitics and non idealities. Additionally, different gain stage topologies will yield different GBWs. For example, inductive peaking is a popular way of enhancing the bandwidth and will yield a higher GBW than simple resistively-loaded stages. Therefore we use a parameter α which describes what fraction of f T is achieved by each individual gain stage. The GBW of a replicaloaded stage is therefore f a = α f T .
2) Voltage Amplifiers: Here, we introduce the analysis of the follow-on voltage amplifiers, which helps lay the foundation for the analysis of the transimpedance amplifier stage. Every stage in the voltage amplifier is defined by input transistor gate width W gate,i (where "i " denotes its position in the amplifier chain), which then also defines its transconductance g m,i , gate capacitance C ox,i and bias current I d,i . To simplify the problem, we assume that g m , C ox , and I d are simply proportional to W gate , which implies that the biasing for each transistor is relatively similar-a reasonable assumption to first order. The GBW of each stage depends on the capacitance seen at the output, and in the case of simple resistively loaded stages, we have G BW i = g m,i /(C out,i + C in,i+1 ). We define β = C out /C in as the ratio of output to input capacitance of a gain stage. Similar to α, β is dependent on the stage topology. These parameters are given and derived in the Appendix A for different topologies of gain stages. Therefore
and we can derive the GBW of every stage as:
As mentioned earlier, each gain stage must also have a 3-dB bandwidth of f S , so that the DC gain of stage i in the linear amplifier is:
The maximum gain is capped by the intrinsic gain of the devices g m r 0 
where C P D is the photo-detector parasitic capacitance including the interconnect between the photo-detector and the TIA, and C in,T I A is the TIA input capacitance. The two poles resulting from the TIA designed in this fashion are not real, and the damping factor is ζ = bounded as 0.5 < ζ < 1, implying the bandwidth is marginally greater than if the poles were real. This means that (2) slightly overestimates the required bandwidth per stage. To first order this is an acceptable approximation.
The total transimpedance gain of the front end is therefore
4) Sampler:
The modeled sampling stage is made of M interleaved StrongArm samplers (also referred to as Sense Amplifiers (SA)), that evaluate the bits sequentially. This means each individual StrongArm has a cycle M × T bit long. Half of this period is dedicated to the resetting of the sampler, while the other half is dedicated to the integration and regeneration of the bit (minus the setup time of the follow-on flip-flop T D2S ) so that the actual time the sampler is evaluating is T S A , given in (10) . The schematic of an individual sampler is depicted in Fig. 3 and sample transient waveforms are shown in Fig. 4 . The blue and red lines show the complementary outputs of the StrongArm sampler (nodes X and Y in Fig. 3 ). The integration period lasts while the input pair discharges nodes P, Q, X and Y until nodes X and Y reach V D D − V th,P which dictate when the cross coupled pair turns on and the regeneration period starts [16] (V th,P is the threshold voltage of the PMOS). Fig. 4 shows a StrongArm's transient characteristics with the three main regimes of operation highlighted. The regeneration gain is generated by a cross coupled pair forming a latch, is exponential with time, and brings the output signal to logic levels. The optimization variables available are the common mode voltages at the input, the gate widths of the input transistors, and the gate widths for the cross coupled pair transistors. These define the length of the integration period (which must stay under T bit in order to avoid intersymbol interference), the integration gain, and the regeneration gain. The size of the tail transistor, M 7 , is not considered to be an optimization parameter and is sized to be at least 2x larger than the input pair, M 1 and M 2 , and therefore not current-limiting the signal path.
The sampler then drives a dynamic to static (D2S) converter stage which we simply characterize as a load capacitance to the sampler, C in,D2S [17] . The D2S requires a fixed amount of time T D2S ∼ 2 f T to latch, which is taken out of the total evaluation time. The derivation of the integration time and sampler gain are derived in Appendix B. Approximations are nevertheless given here:
where V T H is the absolute value of the threshold voltages and G S A is the final sampler gain. Finally the input capacitance of the SA seen by the front end is given by M × C ox,S A . The fanout M is detrimental to the gain of the front end, and, as will be shown, can be amortized by using switches that connect only one sampler at a time to the output of the VA. In this case, the input capacitance seen by the front end is approximately C ox,S A neglecting wire capacitance and junction capacitance effects of the sampling switches and the RC time associated with them. This assumption holds true for reasonable number of samplers: (13) Indeed the size of the transistor serving as a switch can be made substantially smaller than the input cap of the SA, by a factor ∼ f T f data to minimize it's effect on the circuit bandwidth, and the only capacitance it presents to the circuit is it's gate-drain capacitance, justifying (13) .
The energy consumed by the sampler comes from the charging and discharging of all it's capacitances at each cycle, as well as the dynamic power burned by the cross coupled inverter during the latching process:
where C S A comprises all the capacitances that will have to be charged to V D D during the reset period.
B. Sensitivity Calculation
The sensitivity calculated for the receiver can be separated into two parts: the sampler swing requirement, and the circuit noise requirement. The final sensitivity is the sum of the two.
1) Swing Based Sensitivity Requirement:
The swing requirement represents the signal needed to ensure that the differential voltage at the output of the sampler reaches V D D by the appropriate time, and is calculated from the sampler gain, the TIA gain and the VA gain:
The factor of 2 comes from the fact that the signal is only half the actual photon current magnitude for an optical ONE (during a ZERO, the photon current is assumed to be nil). Slight changes must be made if the modulation extinction ratio of the transmitter is finite, but the general framework is the same.
2) Noise Based Sensitivity Requirement:
The noise requirement necessitates the calculation of the input referred noise generated by the amplification circuit. These include the feedback resistor thermal noise, the Johnson noise from the TIA's transistors, and the transistor noise from the follow-on transistors as well as the noise from the samplers. The TIA transistor and resistor noises are calculated using Personick's method, with all the Personick integrals set to unity [18] , while the follow-on stages are estimated using approximations consistent with literature [19] . The photon shot noise (or PD shot noise) is neglected as it is always much lower than the circuit noise sources for incoherent detection systems (roughly one order of magnitude). Indeed for a BER of 10 −12 , the limit that would be imposed by photon shot noise is 27 photons per bit during a ONE (also known as the quantum limit), which is a current of 44 nanoamps at 10 Gbps. Naturally when the other noise sources impose a higher photon current, the photon shot noise's absolute value also goes up, but it will necessarily be smaller than the other noise sources.
Here, k b is the Boltzmann Constant and θ is 273 Kelvin. The sampler voltage noise is approximated using the methodology presented in [20] .
Finally the sensitivity is calculated using a current SNR of 14 in order to achieve a bit error rate of 10 −12 . Please note that this is for current magnitude SNR and not power SNR.
The total photon current requirement at the input of the photodiode is the sum of the swing current requirement and the noise current requirement:
C. Energy Per Bit
The total energy per bit that is consumed by the link is the sum of the energy burned in the transmitter and the receiver
E R X includes the power burned in the amplification stages as well as the energy consumed by the samplers. E T X includes laser energy and modulator energy E mod , where V T X represents the energy cost of transmitted photons that represent a bit successfully detected at the receiver. It encompasses all the efficiencies, η, encountered from the generation of photons to their absorption into useful photocurrent in the receiver photodiode, such as the laser wall plug efficiency, coupler inefficiencies, waveguide losses, modulator loss, photodiode quantum efficiency, etc. 
D. Model Inputs and Optimization Variables
The model described enables us to rapidly predict the performance of a given optical receiver characterized by the number of amplification and sampling stages, the technology available, and the size of the transistors involved. These different parameters can therefore also be optimized in order to reach minimal total link energy. The optimization variables and model parameters are described in Table I , and the optimized links are presented in Figs. 5, 6 and Table III.   TABLE II 
MINIMUM POWER LAWS FOR E/b LIMITS DEPENDENCE

E. Model Purpose and Limitations
The goal of the model is to accurately encompass all the most important effects and limits that fundamentally constrain the performance of an optical link. Naturally, no model can include all practical limitations, such as systematic and random transistor mismatches, kickback, jitter, layout imperfections, etc. In particular, transistor offset and mismatch can be of significant importance and its effects have been extensively studied [21] . However, through calibration techniques, which indeed add design complexity, the effects of mismatch can be corrected while still adding a minimal power penalty. Additionally, exotic amplification schemes such as higher order stages, or multiple interleaving schemes are not included. While these considerations are important in practical circuit design, we believe that our modeling approach is readily extendable to include these considerations. The presented model will however allow us to derive some general conclusions about critical link trade-offs. It is also precise-enough to provide optimal transistor sizing and accurate sensitivity predictions leading to functional circuits as those shown in Section IV. 
A. Technology Overview
The circuit optimization has been applied for use in a heterogeneous integration platform which combines a 65 nm CMOS technology node with a custom-SOI photonic node [14] . In this process, separate 300 mm photonic and CMOS wafers are face-to-face oxide bonded in the CNSE 300 mm foundry.
To reduce the capacitance between the CMOS and photonic wafers, the silicon substrate is first removed from the photonic wafer and through-oxide-VIAs (TOVs) with a lumped Table I .
Using Fig. 2 as reference, light from the laser source experiences multiple sources of loss before contacting the photodiode on the receiver side. Firstly, the laser source itself is assumed to have a wall-plug efficiency of 10%. The three vertical grating couplers, which measured 3.5dB/coupler of loss, are also in the critical path of the signal. The germanium photodiode has a measured responsivity of 0.8 A/W at 1510 nm [22] . In addition, the modulator insertion loss is 5 dB. In this study, we assume no waveguide loss. However, that can be easily implemented. The above path losses translate to an overall photon energy cost, V T X , of 580 V. The modulator energy in this platform is 20 fJ/bit and will therefore be neglected for the purposes of this analysis.
B. Single Slicer Case (M = 1)
The results of the optimization for the optimal performance of the link are plotted in Fig. 5 . The laser energy to accommodate Noise and Swing are respectively the quantities described in (27).
Two clear regimes are visible: the "Noise limited regime" at low datarates, where the sensitivity of the receiver is constrained by the noise, and the "Swing limited regime" at high datarates, where the sensitivity is dominated by the output swing requirement (V out = V D D ). The regeneration gain of the sampler is exponential with time, so it is natural that at higher datarates it drops significantly. While the VA can compensate for this drop in gain by increasing its number of stages (and this happens at ∼7.5 GHz for M=1), there is a limit to the amount of aggregate gain achievable by chaining amplifiers due to the bandwidth requirement, as described in (2) .
The justification for adding multiple slicers is now obvious. This relaxes the condition on the regeneration time being less than the bit duration, and can thus push the swing limited regime to much higher datarates.
C. Multiple Slicer Case (M ≥ 1)
The results of the optimization when the number of samplers is not constrained to 1 is plotted on Fig. 6 . We can see that there is no longer a "Swing limited regime", since the optimal topologies have several samplers in order to benefit from much higher regeneration time and gain. While the energy per bit is greatly reduced at higher data rates, eventually the sampler noise starts to dominate. This comes about because as the data rate goes up, the bandwidth requirement on the VA reduces the possible achievable gains. Additionally, adding several samplers increases the fan-out of the VA by a factor M, further reducing the gain. Eventually the gain of the amplifier stages drops below 1, so that the input referred noise coming from the samplers becomes greater than that coming from the front end. We therefore observe a front end noise limited regime and a sampler noise limited regime, which is different from the sampler swing limited regime discussed in the single-slicer case.
IV. SCHEMATIC DESIGNS OF MODEL RESULTS
A. 5 Gbps Optical Receiver
To highlight the performance in the noise-limited regime, we introduce the schematic design of an optimized receiver topology operating at 5 Gbps, with no active equalization, running off of a 1.2 V supply. Fig. 7 shows the overall topology of the front-end pre-amplifier and slicers. While the number of slicers and samplers does not match the optimal values of Fig. 6 , these values were chosen because they yielded performance within a few percent of the optimum, and were easier to implement. Nonetheless the transistors sizings were still produced by the algorithm.
1) Design Overview:
The photodiode, with a total capacitance C P D of 20 fF, inputs into a TIA amplifier with a feedback resistance, R F B , valued at 2.3k . The output of this stage enters a single pre-amplifier gain stage with a gain of 2 before entering the optimized, dual-data-rate (DDR) triggered StrongArm Sense Amplifers and follow-on dynamicto-static converters. The sense amplifiers and dynamic-to-static converters are triggered on clock and clockB ( and B ), which each operate at half the data rate or 2.5 GHz. The sampler transistor sizes as well as the front-end sizings are optimized using the algorithm. Additionally, the biasing at each stage is also dictated by the algorithm. More specifically, the common-mode voltage at the input of the samplers was selected to be 850 mV while the constrained common-mode voltages at the TIA's input and output were set at V D D /2 or 600 mV. The output of the samplers, which are effectively a 1-to-2 deserialized version of the input data sequence, was verified in simulation.
2) Simulation Results: The above design has been implemented at the simulation level and its performance was verified with respect to the values predicted from the model. Table III summarizes specifications for the model and simulated results. The optimized circuit had an overall front end gain of 5.1k and from the StrongArm sampler's standpoint, the minimum required swing at the input (neglecting noise) to resolve successfully at 5 Gbps, or 200 ns of evaluation time per sampler, was measured to be 6 mV. This translates to a 0.8μA receiver sensitivity due to the swing requirement of the sampler. From a noise perspective, the total input-referred noise contribution from the front-end is 0.21μA (1σ ). Thus, the total simulated input sensitivity i swing + 14 × i noise , or 3.8μA. The total energy per bit for the full RX block is 280 fJ/bit, with the front-end consuming 115 fJ/bit and the samplers plus D2S consuming 165 fJ/bit total. The front-end E/b in this case takes into account the dummy front-end as well. From an overall link perspective, the energy in the laser and TX macro is 392 fJ/bit.
B. Active-CTLE Enhanced 5 Gbps Optical Receiver
We now use the algorithm to design a continuous-time linear equalizer (CTLE)-based optical receiver front-end. The purpose of this design is to show that it is possible to reduce the noise contribution of the feedback resistor in the front end, as will be discussed in section V. We nevertheless present the circuit results here for consistency. The full schematic is shown in Fig. 8 .
1) Design Overview:
The CTLE-based pre-amplifier allows for the preceding TIA stage's R F B to increase drastically, from 2.3k to 70k . This enhances noise performance while keeping the overall gain-bandwidth the same. The new lowpass pole of the TIA front-end is then compensated with the peaking of the CTLE block, which adds a zero in the transfer function from the degenerated R S and C S (see Fig. 11 ). The zero location is chosen to match the dominant pole-location of the TIA, thereby enhancing the overall bandwidth to the target specification for operation at 5 Gbps. The two poles of the CTLE are set to the same frequency in order to maximize the effective gain-bandwidth of the stage [5] .
2) Results: The results are summarized in Table III . The overall receiver gain and bandwidth of the CTLE are approximately that of the standard RX topology, at 5460 k and 5.3 GHz, respectively. The CTLE-based front-end consumes 250 fJ/bit with the samplers consuming 141 fJ/bit. This yields an overall RX E/b of 391 fJ/bit. The main advantage in using a CTLE-based scheme comes from the input referred noise sensitivity, as will be further elaborated in Section V. Here, we observe 0.2μA input sensitivity whereas the standard RX topology had almost double that. In the CTLE topology, the feedback resistor contributes only 15% of the total 
C. DDR 25 Gbps Optical Receiver
1) Design Overview:
To better characterize the universality of the model, we now present an optimized optical receiver design operating in the sampler swing-limited regime. We choose to operate at V D D of 1.6 V in order to allow for enough voltage headroom to utilize cascode-amplifiers as the basis design for the VA stages, which have an α of 0.4 as opposed to the standard amplifiers which have an α of 0.29. We retain the StrongArm topology for the samplers and also retain the topology of the D2S converters. In this design, we choose to operate the system as DDR to show the importance of relaxed timing margin on the sampler's evaluation period.
Under these constraints, the model-predicted topology is shown in Fig. 9 . All front-end FETs, resistances, and sampler FETs, are all sized based on the constraints presented by the algorithm. In the DDR case, M = 2 2) Results: To avoid bandwidth reduction at the input node of the TIA itself, the optimized TIA feedback resistance was 530 . This translates to an overall gain of 770 in the two-stage front-end and an overall bandwidth of 18.8 GHz, which meets our programmed target specification of 0.7 * 25 GHz, or 17.5 GHz. At this data rate, the sampler required a minimum swing of 165 mV with a common mode of 840 mV. The overall swing-based sensitivity is therefore 280μA. The rationale for this high sensitivity is as follows: because the system was operating within the sampler-swing dominated regime and with a fixed number of samplers for DDR, the algorithm would resort to increasing the laser power to meet the sensitivity requirement of the sampler instead of 
D. QDR 25 Gbps Optical Receiver
1) Design Overview:
In the subsequent analysis, we retain the same technology parameters as in the previous section. However, now, we present a quadrature-data-rate (QDR), M=4 from Fig. 9 , operation of the receiver, wherein four samplers are utilized to parse the amplified photodiode signal. Once again, the design of the front-end as well as samplers is fully optimized with our tool taking into account the added capacitive load factor on the final stage of the VAs. In using four phases, we alleviate the timing evaluation requirements of the samplers by doubling the allocated time for sampling and reset phases, while adding clocking overhead in the form of quadrature phase generation. In the context of links, this drastically improves efficiency and extends the crossover point of the noise-limited and sampler-limited regimes to past 25 Gbps, as seen in Fig. 6 . Although we acknowledge the added overhead of generating quadrature phases versus dual phases, the purpose of this analysis is to highlight the importance of easing timing requirements for the samplers to improve the overall performance. Indeed, an order of magnitude reduction in link power efficiency was observed (mostly through the increase in SA gain), not taking into account the cost overhead of clock phase generation. According to Fig. 6 , the optimal number of samplers is actually 8, yet the energy consumption benefit of going from 4 to 8 samplers is not enough to justify the additional design complexity.
2) Results: The QDR receiver performed on par with the DDR in power, gain, and bandwidth metrics. However, from a swing sensitivity standpoint, the QDR receiver performed an order of magnitude better. The simulations yielded a swing sensitivity of under 5μA, with a front-end gain of 760 and 20.5 GHz net bandwidth. The four samplers and D2Ss were consuming 153 fJ/bit while the front-end was consuming 395 fJ/bit for a total 550 fJ/bit being burned on the receiver end. The input referred noise sensitivity for the receiver was 1.8μA, now mostly dominated by the sampler noise.
Because of this ultra-low sensitivity, even though the RX total power stayed approximately the same for the DDR and QDR cases, the required laser power was substantially reduced, as shown in Table III .
Discrepancy in the noise sensitivity values between modeled and designed may be attributed to not only the sampler noise approximation error (which may be as high as 50%) but also to the first-order noise calculation methodology being used [20] . The swing sensitivity discrepancy stems from the following: (1) the simulated swing sensitivity looks at settling at the output of the D2S, an effect not captured by the model; (2) lower input sensitivities rely heavily on capturing the effects of regeneration properly. The error on the regeneration side shows up as an exponential variation in the sampler gain. For the purposes of this study, however, the 2x variability in sensitivity is considered acceptable.
E. Switched QDR 25 Gbps Optical Receiver 1) Design Overview:
To alleviate the sampling noise contribution of the StrongArm sense amplifiers, a time-interleaved switching topology was implemented, reducing the load on the VA and allowing it to provide more gain for a given bandwidth constraint. The schematic is shown in Fig. 10 . By placing a track and hold circuit prior to the sampler array, not only does the sense amplifier input load capacitance diminish, but the potential effects of kickback from other sampler clocks is also theoretically reduced. The receiver topology and design process is similar to the QDR receiver in Fig. 9 . All transistor sizings are optimized by the tool with the biggest difference being in how the C ox,S A capacitance scales. C ox,S A now goes up linearly with sampler input FET size and is completely independent of the slicing count, M, as detailed in Section 2. For the purposes of this study, the non-idealities of these sampling switches (i.e. finite junction capacitance) were not taken into account within this study. However, the simulation results reflect performance with these non-idealities in place, and we see no significant difference between the predicted and simulated specifications. This is because the sampler count, M, is kept to a reasonable value according to (13) . Additionally, charge injection, which was not modeled analytically, is a common mode issue and, therefore, does not affect sensitivity drastically.
2) Results: The results in III for the 25 Gbps Switching QDR receiver show similar performance to the non-switching. However, the total noise sensitivity is reduced by 10% on account of the reduction in sampler noise contribution, while the noise from the front-end stays relatively constant. The sensitivity required to overcome the sampler swing is also relatively constant, with small adjustments made to the input sampler FETs on account of the switching.
V. SENSITIVITY AND ENERGY LIMITS
While the model enables us to choose optimal transistor sizings and achieve optimal system link efficiencies, it does not immediately provide us with a deep understanding of the different limits experienced by such a system. In this section we derive these limits. As shown earlier, it is possible to alleviate the swing requirement by using an appropriate amount of interleaved samplers. In a similar way, if a track and hold method is used as in Section IV-E to negate the effect of interleaving fanout, we can make sure the dominant noise source comes from the very front end. We will therefore focus on the limits imposed by noise in the TIA/VA front end.
A. Front End Noise Limit
The noise in the front end is dominated by the first amplification stage, which is the TIA in this case. The two major sources of noise have been given in (18) and (19) , and their input referred noise current is given in (29) and (30).
Where f T I A is the bandwidth of the TIA, and 6.4a F = q/V th where V th is the thermal noise voltage. The optimal C in,T I A is somewhere between 0 and C P D .
Nevertheless, the feedback resistor noise can be overcome to some extent by increasing the value of the feedback resistor, and compensating for the bandwidth degradation by including equalization such as a CTLE stage, as we show in the example circuit of Fig. 8 . The total front end bandwidth is not enhanced in any way since the TIA and the CTLE stage compensate each other, as illustrated in Fig. 11 , but this enables the use of a higher resistor value and therefore translates to smaller input referred noise. In (29), this is illustrated by the fact that f T I A is reduced, therefore reducing the input referred noise. In this way, it appears that the transistor noise is somewhat more fundamental than the feedback resistor noise.
B. Limits at High Data Rates
At high data rates, the input referred noise contributed from the transistors is high enough that laser energy required to overcome it will be the dominant source of power consumption. In this case the optimal receiver will be optimized purely for noise and not its own power consumption, since it will be negligible. We can easily show from (30) that the optimal sizing for the input transistors will be C in,T I A = C P D . This yields the transistor noise limit, which, expressed in terms of photons per bit for a ONE is:
Naturally if this value comes close to the quantum limit of 27 photons per bit for a ONE, the photon shot noise will start to take over.
C. Limits at Low Datarates
At lower data rates, the power will not necessarily be dominated by the laser power. If we consider only the noise from the TIA transistors and the power consumption of the TIA and the laser, the energy per bit consumption of the link is: In this case, there is an optimal size for C in,T I A . The lower the data rate, the smaller the input capacitance of the TIA will be in order to minimize power consumption for that stage. To obtain an analytic expression, we assume that C in,T I A C P D , which leads to:
Concluding from (34), the optimal energy per bit in this case does not depend on the datarate or the speed of the transistors f T when the link energy is not dominated by the laser power.
D. E/b Power Laws
The limit between these two regimes is when we can no longer use the approximation C in,T I A C P D which is only valid when
With the photonics platform used in section III, this leads to f T f data ∼ 15 which clearly states why 25 Gb/s is in the laser limited regime, whereas 5 Gb/s is in the full link limited regime. The power laws for optimal Energy/bit of these different regimes is summarized in Table II. VI. OBSERVATIONS IN SCALING AND TECHNOLOGY With performance limitations arising from both the quality of the CMOS and photonic devices, this section aims to study the effects of an improved design platform with respect to optimized energy per bit. Following the previous analytic analysis, here we utilize the model and optimization procedure described in section II, and apply it to different hypothetical technology platforms. This enables the capture of additional effects such as sampler energy not described in section V. In doing so, we hope to target key bottlenecks in performance and potential for improvements in the next-generation of integration technologies.
A. Improvements in Photonics and Interconnects
Parasitics such as coupler losses and photodiode capacitance dominate the platform described in Table I and limit the achievable energy efficiency. To study the importance of the photonic performance, we replace the existing metrics for coupler losses, modulator loss, laser efficiency and photodiode capacitance, C P D from 3.5 dB/coupler, 5 dB/modulator, 10% laser efficiency and 20 fF to 1 dB/coupler, 3 dB/modulator, 30% laser, and 3 fF, respectively, implying V T X = 15V . In addition, modulator efficiency as low as 1 fJ/bit have been demonstrated, justifying their omission from this analysis [23] .
The results of the analysis are shown in Fig. 12 . As compared with the existing heterogeneous integration platform, using better photonics shows more than an order-of-magnitude improvement in link efficiency. Because the price to convert from the photonic to electrical domain, V T X , is so cheap now, the optimized links at the various data rates are more receiverperformance limited, as expected intuitively.
B. Improvements in Photonics CMOS
To push the boundary of integration technologies altogether, we now turn to the case where the photonics and CMOS are both pushed to their bounds. In particular, we utilize the same best photonic specifications from before, but, now, scale the technology node to reflect a theoretical f T of 1 THz. The results of the study are shown in Fig. 12 . For lower data rates, the performance improvement from scaling f T from 150 GHz to 1 THz is observable but not drastic and stems mostly from the lower energy consumption of the samplers themselves and not the front end amplifier or the laser, as expected from the limits of section V. For the 25 G DDR case, however, the improvement is almost an order of magnitude since the faster amplifiers can provide gain at these speeds. Notice that the last column in this bar plot shows a 100 G DDR receiver, with a theoretical best end-to-end link efficiency of 20 fJ/bit.
While the previous sections show the performance for given technologies, we can reverse the exercise to deduce the necessary technology properties for a given link efficiency. To achieve sub 1 fJ/bit efficiency at 5 Gbps and f t = 1000 Thz, this would require C P D = 200 aF, V D D = 0.5 V, V ov = 0.1 V and V T X = 10 V. These small photodiode capacitances would require such a small device that some sort of absorption enhancement would be necessary, such as a cavity or a metaloptic focusing scheme [24] . At this point the link energy itself is so small that effort must be redirected to the energy overhead of peripheral blocks such as clock networks and bias generators.
The performance results for these higher data rates have another interesting trend-as the CMOS platform performance improves, the energy consumption of the receiver is mostly limited by the sampler itself. Because we have assumed a StrongArm topology for the sampler for all data rates of operation, the minimum achievable E/b of this sampler is far greater than the rest of the link put together. This yields the conclusion that within the confines of a better platform where photon efficiency is so high, using a simple gain stage such as an inverter as the sampler is more optimal than having a StrongArm or CML latch.
VII. CONCLUSION
This work introduced an fundamentals-influenced optimization approach for true end-to-end optical links incorporating a TIA, a linear amplifier chain, and follow-on StrongArm latches. In portraying this "digital IN" to "digital OUT" end-to-end link structure, distinct regimes of operation were evident. For low data rate regimes, overcoming front-end noise posed as the dominant contributor to overall link budget. However, for high data rate regimes, the StrongArm voltage evaluation requirement quickly dominated and yielded the swing-limited regime. Circuit techniques such as sampler time interleaving were used to greatly reduce this swing requirement by exponentially amplifying the sampler gain. In doing so, the swing requirement no longer became dominant at higher data rates. Rather, sampler noise instead quickly took its place in this regime of operation. Further circuit techniques such as placing interleaving switches between the AFE and the StrongArm latches can be used to further reduce this sampler noise contribution. This is done by effectively reducing the total sampler load capacitance, thereby allowing for higher front-end gain.
This work continues by using this fundamental model in order to extrapolate the performance of next-generation, bestcase technologies that are optimized for photonic as well as CMOS performances. As expected, the best-case energy per bit for these optimized technologies scales to show more than an order-of-magnitude improvement in performance. Moreover, new limitations arise that are a result of the "weakest link" technology. For example, using the best-case photonics with standard CMOS platform reveals the E/b is receiverperformance limited. These trends and next-generation platform studies showcase the importance of various parameters and their ultimate relationship to end-to-end link performance. , the ratio of input to output capacitance. Here we calculate α and β for simple g m R L topology and cascode stages for the 65 nm platform used.
A. α-Factor Derivation
For a simple g m R L topology we have
where the second term accounts for the Miller Effect, and C out = C gd + C ds . For a cascode stage, we have
Notice that the C G D seen by the input does not see the Miller effect due to the intermediary FET between the input FET and the output node. Given that C ox = 0.5 f F/μm, C gd = 0.2 f F/μm, C gs = 0.27 f F/μm, we have α = 0.36 for a standard g m R L stage and α = 0.4 for a cascode stage.
B. β-Factor Derivation
With the expressions given above, it is easy to show that β = 0.29 for g m R l stages and β = 0.4 for cascode stages.
APPENDIX B SAMPLER ANALYSIS
The role of the sampler is to bring the signal coming out of the amplifier to logic levels so that the digital circuit can effectively process it at the output. The modeling described here enables the efficient optimization of transistor sizes in order to yield optimal sampler performance in terms of sensitivity and power consumption. Most samplers rely on a positive feedback latching mechanism, such as a cross coupled inverter pair in order to achieve exponential gain and recover digital levels from extremely low signal voltages. The sampler analyzed here, and depicted in Fig. 3 is known as the StrongArm, but the presented analysis and trends can be generalized to a large family of sampler topologies, such as CML-based samplers or more exotic techniques such as double-tail sampling.
C. StrongArm Operating Principle
Before the sampler starts evaluating, the clock is down, and the nodes P, Q, X, and Y are brought up to VDD by the reset transistors driven by clock, φ. The evaluation starts when the clock goes up, and is composed of two periods,: the sampling period, where in the nodes P, Q, X, and Y discharge through M1, M2, M3, M4, and M7, building a differential voltage on nodes X and Y. The sampling period ends when V X,Y reach V D D − V th,P and the cross coupled inverters composed of M3, M4, M5, and M6 turn on. The regeneration then starts and the differential voltage on nodes X and Y is amplified to logic level by the latch.
D. Sampling Period
The sampling phase can itself be divided into two separate phases. The first, during which only M1 and M2 are on, discharges nodes P and Q until they reach 
where τ = C XY C P Q g m, 3 (C XY + C P Q ) .
There is no closed form solution to determine when nodes XY reach V D D − V th,P , but if τ is small compared to V th,P (C P Q +C XY )/I 1 , which is usually the case, the end time of the second sampling phase may be approximated as t 2 ∼ V th,P (C P Q + C XY )
The differential mode, during the second phase, can be shown [16] 
Since C XY is usually greater than C P Q , τ is usually negative, and there is no regeneration gain during the sampling period. The sampling gain can be approximated as
E. Regeneration Period
Once the top PMOS transistors turn on, the regeneration period starts. The approximation is made that only the crosscoupled inverter pairs are on, providing positive feedback gain, with a time constant 
