Abstract-Both power efficiency and per-channel data rates of high-speed input/output (I/O) links must be improved in order to support future inter-chip bandwidth demand. In order to scale data rates over band-limited channels, various types of equalization circuitry are used to compensate for frequency-dependent loss. However, this additional complexity introduces power and area costs, requiring selection of an appropriate I/O equalization architecture in order to comply with system power budgets. This paper presents a design flow for power optimization of highspeed electrical links at a given data rate, channel type, and process technology node, which couples statistical link analysis techniques with circuit power estimates based on normalized transistor parameters extracted with a constant current density methodology. The design framework selects the optimum equalization architecture, circuit logic style (CMOS versus currentmode logic), and transmit output swing for minimum I/O power. Analysis shows that low loss channel characteristics and minimal circuit complexity, together with scaling of transmitter output swing allows excellent power efficiency at high data rates.
I. INTRODUCTION

I
MPROVING input/output (I/O) power efficiency is important for applications ranging from high-performance processors to next generation mobile devices. Many-core microprocessors, which require significant increases in parallel data bandwidth, are projected to have aggregate I/O bandwidth in excess of 1 TBps based on current bandwidth scaling rates of two to three times every two years [1] . Unless I/O power efficiency is dramatically improved, I/O power budgets will be forced to grow above the typical 10%-20% total processor budget and/or performance metrics must be sacrificed to comply with thermal power limits. In the mobile device space, processing performance is projected to increase 10× over the next five years in order to support the next generation of multimedia features [2] . This increased processing translates into aggregate I/O data rates in the hundreds of Gb/s, requiring the I/O circuitry to operate at low-mW/Gb/s efficiency levels for sufficient battery lifetimes. These requirements are reflected in recent work in low-power I/O design [3] - [5] , where the emphasis is on improving I/O power efficiency at data rates near 10 Gb/s.
While nanometer CMOS technologies provide adequate bandwidth for data rates in excess of 10 Gb/s, limited electrical channel bandwidth prohibits high-speed I/O data rate scaling. In order to achieve reliable communication, equalization circuitry is often employed to compensate for frequency-dependent channel losses. However, excessive equalization complexity can increase I/O power dissipation to unacceptable levels for future processors [1] . This creates the need for low power architectural techniques, which can significantly improve the I/O power efficiency to comply with system power budgets.
Sophisticated link analysis tools [6] - [10] , which use statistical means to combine deterministic noise sources, random noise sources, and receiver sensitivity and aperture time, are often employed to efficiently explore the performance of a potentially wide design space of transmit and/or receive equalizer combinations on a given electrical channel. The most common equalization circuits employed in highspeed links are transmitter (TX) feed-forward equalization (FFE) [11] , receiver (RX) continuous-time linear equalization (CTLE) [12] , and decision-feedback equalization (DFE) [13] . Given strict system power constraints, I/O designers must balance acceptable link margins with circuit power consumption. It is often the case that different configurations of the aforementioned equalization circuitry will satisfy the required link margins at a given bit-error rate (BER). However, it can be difficult to predict which configuration is optimal in terms of power efficiency, as this is generally not modeled in the link analysis tools and can vary with data rate, channel quality, and CMOS process node. This paper presents a design methodology that minimizes high-speed link power dissipation by selecting the optimum equalization architecture, circuit logic style [CMOS versus current-mode logic (CML)], and transmit output swing for a given data rate, channel type, and process technology [14] . This paper leverages previous optimization methods for electrical links [6] , [15] , [16] and builds upon them by combining statistical link analysis techniques with comprehensive equalization and serialization circuit power models. Due to the complex tradeoffs involved in the design of high-speed links, a statistical link analysis tool [8] is utilized to optimize equalization parameters for specific channel characteristics and estimate the link margin under user-defined voltage swing, timing noise, and receiver sensitivity parameters at a given BER. Based on the link margin results, transmitter output swing is scaled to satisfy the minimum receiver eye opening requirement and operate at optimal power efficiency. Comprehensive transmitter and receiver circuit models, which utilize normalized transistor parameters extracted from preliminary SPICE simulations of the circuit topologies, are used to provide accurate power estimates over a wide link architecture search space. This paper is organized as follows. An overview of the high-speed electrical interconnect model is given in Section II. Section III discusses the modeling of the link circuits and how their performance varies over data rate and different CMOS process nodes. Applying the link optimization methodology, detailed in Section IV, to electrical links operating on three backplane channels with differing loss profiles yields the power efficiency estimates in Section V. In order to observe the impact of CMOS technology scaling, modeling is undertaken in both 90-and 45-nm processes. Finally, Section VI concludes this paper. Fig. 1 shows a block diagram of the high-speed electrical link modeled in this paper. Parallel input data is serialized at the transmitter in order to meet system I/O bandwidth demands under the constraints of limited high-speed I/O pins in chip packages and minimum printed circuit board wiring pitches. The incoming serial data at the receiver is typically conditioned with amplifier and/or equalizer blocks before being sampled and regenerated to logic levels by a decision element, which is often a differential dynamic sense-amplifier [17] . Finally, the data is deserialized to the core data rate of the receiver chip. The entire signal chain from serializer to deserializer is modeled in this paper, including local clock buffering. As architectures for transmit clock generation and receiver timing recovery can vary significantly with application, this modeling is left for future work in order to more clearly display the electrical channel performance impact.
II. ELECTRICAL I/O MODEL
The frequency responses of the three backplane channels [18] considered in this paper are shown in Fig. 2 . Channel "B1" has a total length of 6.5 , consisting of 5.2 line card traces and only 1.3 on the backplane board, and displays the lowest frequency-dependent loss due to both its short length and the use of the bottom backplane signaling layer to minimize impedance discontinuities. The impact of channel length is evident in the increased loss of the "C4" channel, which has a total length of 32 in, with 12-in line card traces and 20 in on the bottom layer of the backplane board. Channel "T20" is slightly shorter than "C4," with 5.9-in line card traces and 20-in traces on the top layer of the backplane board. While the T20 channel low-frequency loss is similar to that of the C4 channel, the backplane via stubs associated with signaling on the top layer introduce a capacitive impedance discontinuity that causes severe loss in the T20 channel near 7 GHz.
Sending high-speed data pulses over these low-pass channels can result in significant inter-symbol interference (ISI), which can limit the maximum achievable data rate. In order to overcome these channel distortion effects, various combinations of the transmitter and receiver equalization circuits shown in Fig. 1 are employed. The following section presents an overview of the link equalization circuits and their modeling.
III. LINK CIRCUIT MODELING
In order to achieve accurate modeling results for the transmitter and receiver circuitry, normalized transistor parame- conductances (g ds /W ), etc.) are utilized. These are extracted from preliminary SPICE simulations of the circuit topologies with the transistors at various biasing conditions of differing overdrive voltages and current density, which correspond to different transistor transition frequencies, f T . In varying circuit parameters, such as bandwidth and current drive, the individual transistor parameters are scaled in a constant current density manner by incrementing transistor finger number under fixed biasing conditions and finger size. The impact of technology scaling on link power efficiency is studied by modeling the link circuits in both a 90-nm CMOS process with 110-GHz peak f T at 0.4 mA/μm current density and a 45-nm process with 225-GHz peak f T at 0.4 mA/μm.
A. Transmitter Feed-Forward Equalizer
Transmit-side equalization is typically implemented as a linear feed-forward equalizer (Fig. 3) , which pre-distorts or shapes the data pulse over several bit times in order to mitigate channel distortion. As the transmitter has peak voltage swing constraints, for the typical FFE high-pass filters used to compensate the low-pass electrical channels the equalizer attenuates transmit output low frequency data content in order to flatten the combined channel and transmitter finite-impulse response filter frequency response up to Nyquist frequency of operation. While increasing the equalizer tap number allows for more flexibility in ISI cancellation, this paper models a maximum of four-FFE taps (the main data, one pre-cursor, and two post-cursor taps) due to performance improvements generally diminishing beyond this complexity level [9] .
Parallel current-mode drivers implement the equalization taps. These drivers are sized to produce the required transmitter output voltage swing on the parallel combination of the 50 channel and the 50 TX termination placed on chip to minimize signal reflections. In computing the transmitter power consumption, the entire signal chain of tap multiplexers, latches, xors, pre-driver, and driver circuits are sized for the power-optimal transmit output swing that meets the minimum receiver eye opening requirement for the specified BER. Included in the transmitter power total is the power of the local clock buffering to clock the transmitter latches and muxes, which scales with the output stage sizing. The major constraints in modeling the transmitter equalizer circuit are as follows.
1) The maximum peak-to-peak differential output swing is set equal to the nominal power supply voltage.
2) The 20%-80% transition time, τ , of the serialization and equalization circuits is limited to one-third of a bit period in order to avoid excessive on-chip ISI. Both CML and CMOS logic-based designs are analyzed over the data rates of interest in order to predict when it is optimum from a power perspective to transition from a CMOS to a CML-based design. As an example of how the transmitter circuits' power is estimated, consider the CML predriver in Fig. 3 . This pre-driver is sized to satisfy the maximum transition time constraint of one-third the desired bit period
where α is a constant, R P is the pre-driver load resistor and C OP is the pre-driver output total capacitance, consisting of the gate capacitance of the output driver stage, C GO , the drain capacitance of the pre-driver nMOS, C DIN , and the parasitic capacitance of the load resistor, C RP . Note that while α is ln (4) for an ideal RC system, the value of α used in the modeling is extracted in SPICE simulations to capture device nonlinearity and improve accuracy. In order to size the pre-driver and estimate its power consumption, a fan-out (FO CML ) ratio of the output stage capacitance, C GO , and the input gate capacitance of the predriver, C GIN , is derived from the transition time equation
where V SW is the CML buffer swing and I P is the tail current
where
which represents the CML buffer self-loading factor. Using SPICE simulations, a reference CML buffer is characterized to extract values for I Pref , R Pref , C GINref , C DINref , and C RPref . These reference values are normalized by the nMOS differential pair width, W INref , and used to compute the fan-out ratio
Equation (5) relates the fan-out factor to the bit period and the normalized parameters of the CML gate. Note that as the bit period shortens with increased data rates (1/T b ), this fanout factor will drop, resulting in a larger sized pre-driver that consumes more power.
Assuming that the pre-driver transistors are designed at the same current density as the output stage, the fan-out ratio determines the number of reference transistors to use in sizing the pre-driver,
and
The fan-out ratio also serves as a ratio of pre-driver and output stage currents
Thus, the pre-driver power can be computed as
The use of this equation-based model allows for rapid power estimation over data rate and output power level, while the use of the SPICE-extracted parameters allows for high accuracy. A similar procedure is undertaken for the CMOS pre-driver case, with a look-up table extracted from SPICE simulations employed to extract pre-driver transition time versus fan-out at high accuracy.
Modeling results for the TX FFE, designed for maximum output swing and implemented with either a majority of CML and CMOS circuits, are shown in Figs. 4 and 5, respectively. For both the CMOS and CML case, the power efficiency at low data rates is dominated by the output stage, which supports maximum output swing levels of 1.2 V ppd in the 90-nm process and 1.1 V ppd in the 45-nm process. At the moderate data rates, when the output stage power is more amortized, the CML designs are less power efficient relative to the CMOS designs due to the static power dissipation of the CML. However, the CMOS logic supports lower fan-outs at higher date rates due to a higher percentage of self-loading capacitance; necessitating large transistor sizes and increased power to satisfy the transition time constraint. Scaling from the 90-to the 45-nm CMOS process allows for improvements in power efficiency and maximum data rate in both the CMOS and CML designs. For a fixed output swing level, the total transmitter power increases with equalization tap number due to the extra logic associated with each tap. In the CMOS-based transmitter operating at maximum output swing level, the overhead is 28% in the 90-nm process and 16% in the 45-nm process when the equalization tap number is increased from one to four taps. However, if the transmit output swing is optimized for a given channel, the equalization logic overhead can become a larger percentage of the total transmit power. For instance, with a 0.1-V ppd transmit swing the overhead of going from one to four equalization taps at 10 Gb/s is 119% in the 90-nm process and 93% in the 45-nm process. As explained in Section IV, in order to determine the power optimum system the transmit swing should be optimized for each acceptable equalization configuration. 
B. Receiver CTLE
At the receiver side, a CTLE is a simple structure that provides gain and equalization with low power and area overhead. As shown in Fig. 6 , it is often realized as a differential amplifier with programmable RC-degeneration, which creates a peaking response to compensate for the low-pass channel response. The CTLE transfer function is
where g mIN is the input differential-pair transistors' transconductance, R s is the degenerated resistance, C s is the degenerated capacitance, R out is the parallel combination of the pMOS load and nMOS output resistance, and C out is the total output capacitance formed by both the load C L and the output transistor drain capacitances, C DP and C DIN .
The major constraint in modeling the CTLE is that the 3-dB bandwidth f 3dB , which is set by the output node, should be a certain percentage, β, of the data rate f DR 
A v
Fre q u e n cy ωz ωp1 ωp2 A reasonable β value of 70% is assumed in order to balance CTLE bandwidth, noise, and power [19] .
CTLE power dissipation is set by the capacitive loading and biasing conditions for the gain-bandwidth that supports the system data rate and channel loss. Similar to the transmit circuit modeling procedure, a fan-out (FO CTLE ) ratio of the CTLE load capacitance and input gate capacitance, C GIN , is derived from the bandwidth equation. Using the relationship
where A vpk is the CTLE gain without any degeneration effects, the CTLE bandwidth can be expressed as
where f TIN is the transition frequency of the input differential pair transistors
which represents the CTLE self-loading factor. Using SPICE simulations, a reference CTLE is characterized to extract values for g mINref , C GINref , R outref , C DPref , C DINref , and I Cref . These reference values are normalized by the nMOS differential pair width, W INref , and used to compute the fan-out ratio
where Equation (16) relates the fan-out factor to the gain-bandwidth and the normalized parameters of the CTLE. Note that as the gain-bandwidth scales with increased data rates, this fanout factor will drop, resulting in a larger sized CTLE that consumes more power.
Once the load capacitance of the CTLE is known, which will be either the input capacitance of the deserializing block or a DFE, the fan-out ratio determines the number of reference transistors to use in sizing the CTLE
The CTLE power consumption is computed by scaling the reference design current by the finger number computed in (18); thus preserving the transistor current density corresponding to the f T that supports the required CTLE gain-bandwidth
Fig. 7 compares the CTLE power dissipation for a given bandwidth computed with the equation-based model versus actual transistor-level SPICE simulations. Scaling the reference CTLE design in a constant current density manner allows a close match between the modeling and the SPICE simulation results, with only slight deviation at high bandwidth.
The modeling results of Fig. 8 show that CTLE power efficiency is a strong function of the peak gain requirement. In the 90-nm technology, 12-dB peak gain is realized only up to 14 Gb/s, whereas 6-dB peak gain is achieved past 20 Gb/s. Scaling technology to the higher f T 45-nm process allows realization of 12-dB peak gain out to 18 Gb/s.
C. Receiver Decision-Feedback Equalizer
Another receiver-side equalization circuit commonly implemented in high-speed links is the decision-feedback equalizer. A DFE, shown in Fig. 9 , attempts to directly subtract ISI from the incoming signal by feeding back the resolved bits to control the polarity of Fig. 9 . Five-tap decision feedback equalizer [13] . receive equalization, a DFE does not directly amplify the input signal noise or cross-talk since it uses the quantized input values. However, there is the potential for error propagation in a DFE, if the noise or residual ISI is large enough for a quantized output to be wrong. Also, due to the feedback equalization structure, the DFE cannot cancel pre-cursor ISI. The main challenge in DFE implementation is closing timing on the first tap feedback, since this must be done in one bit period or unit interval (UI). Direct-feedback implementations, such as the one modeled in this paper, requires the critical timing path to be highly optimized in order to achieve adequate settling (> 95%) of the ISI subtraction. This critical timing path (Fig. 10) includes the Clk-Q delay of the sense-amplifier comparator, t CLK−Q,SA, and the propagation delay of the feedback multiplexer, t PROP,MUX , and amplifier A 2 , t PROP,A2
A dominant term in this critical timing path is the senseamplifier comparator Clk-Q delay, which is a function of the input voltage at the decision clock edge. In the DFE modeling results of Fig. 11 , a minimum 50-mV ppd comparator input signal is used in order to achieve data rates, which exceed 10 Gb/s in the 90-nm technology. The minimum eye opening compliance voltage constraint is applied at the output of amplifier A 1 , which is the DFE summation node where the ISI cancellation occurs. Amplifier A 2 serves to provide sufficient gain to amplify this equalized signal to a minimum 50 mV ppd and also isolate the DFE summation node from sense-amplifier charge kickback. Following the comparator, the CML mux propagation delay is dominated by the time required to achieve 95% settling. This forces a maximum summation resistor, which for a given data rate, is a function of the overall loading capacitance. As the number of taps grows, this loading capacitance increases, leading to a smaller summation resistor and larger power consumption to achieve the DFE minimum swing levels. Similarly, the A2 amplifier output resistance and power is set by the 95% settling constraint at its output. Also included in the DFE power total is the power of the local clock buffering to clock the DFE sense amplifers, latches, and muxes.
At data rates low relative to the process speed, this critical timing path is not difficult to meet and the power of the DFE is mainly set by the static current required for the minimum comparator input signal. Thus, as shown in the modeling results of Fig. 11 , the DFE power efficiency improves initially as data rates scale due to the amortization of this static current. However, the power consumption of the individual blocks must increase as data rates scale further in order to reduce their cumulative delay sufficiently to meet the stringent 1UI timing path. Ultimately, this critical timing path cannot be met at high-data rates, resulting in a maximum data rate, which is a function of the process technology and the number of DFE taps.
Increasing the DFE tap number allows for an increased amount of post-cursor ISI cancellation. While the timing paths for additional taps are somewhat relaxed, increasing DFE tap number adds additional loading on the critical tap-current summation node. Also important is that wire capacitance, while modeled and scaled with each tap, will be a function of the exact layout floor plan and the specific technology constraints. Increased accuracy of the DFE critical summation node can be realized by referring to reference design layouts. As shown in the modeling results of Fig. 11 , increasing the tap number results in a reduced maximum data rate and degraded power efficiency. A maximum of five DFE taps is considered in this paper.
IV. LINK OPTIMIZATION METHODOLOGY
The objective of this design methodology is to minimize high-speed link power dissipation by selecting the optimum equalization architecture, circuit logic style [CMOS versus CML], and transmit output swing for a given data rate, channel type and process technology. Fig. 12 shows how the link optimization methodology couples both statistical link modeling, the left half of the flowchart, and accurate circuit models obtained from SPICE simulations, the right half of the flowchart. The electrical link I/O specifications used in this paper are shown in Table I and serve as the constraints for the statistical link analysis tool. Here, the jitter values are similar to the industry standard common electrical I/O [20] and the minimum eye compliance voltage is scaled down to save power, while still maintaining reasonable receiver sensitivity.
StatEye [8] , an open source statistical link analysis tool, which utilizes statistical methods in modeling the impact of ISI and deterministic and random noise sources, is used to predict the voltage and timing margins of a link with a given equalization configuration operating over a certain channel characterized by s-parameters. A database is generated, which stores the link margin and equalization coefficients for all equalization configurations that satisfy the required I/O performance constraints. For the different link architectures, transmitter output swing is optimized with the constraint that the link voltage margin meets the minimum eye opening compliance requirement, resulting in considerable power savings.
The power consumption of these acceptable link equalization configurations is then computed based on the circuit models. As shown by the right half of the flowchart, in order to achieve accurate circuit modeling results, normalized transistor parameters (transconductance, capacitance, output conductance, etc.) are utilized. These are extracted from preliminary SPICE simulations of the circuit topologies. The circuit parameters are scaled in a constant current density manner, as outlined in the previous section, by scaling transistor finger number under fixed biasing conditions and finger size. Transmitter and receiver circuits are modeled by utilizing the equation-based modeling method discussed in the circuit link modeling section, constrained to satisfy circuit design criteria at a given data rate specification.
With the transmitter and receiver circuits modeled for a determined equalization configuration for a certain channel at a given data rate and process node, the equation based models of the previous section determine its total power consumption. This procedure is repeated for the multiple combinations of equalization configurations in the database to compute their respective power consumption. Thus, the power computation of multiple equalizer combinations satisfying the I/O specifications provides an exhaustive search space, from which is selected an optimal architecture with minimum power solution for a given data rate, channel type, and process technology node.
For example, suppose for a 90-nm system operating at 10 Gb/s over a certain channel the statistical link analysis tool gives two potential equalization configurations, which satisfy the link constraints: 1) two-tap TX FFE, RX CTLE (12-dB peaking), and two-tap DFE and 2) three-tap TX FFE and four-tap DFE. In computing the power for configuration 1, the CTLE data from Fig. 8 and the two-tap DFE data from Fig. 11 would be used, along with a scaled version of the Fig. 5 two-tap TX FFE data, which has been optimized for the required transmit swing. While for configuration 2, the power for the four-tap DFE power and three-tap TX FFE, optimized for the required swing, is used. The optimizer then picks the minimum power solution.
While this paper presents results for the most common highspeed link equalization architectures, TX FFE, RX CTLE, and DFE, the methodology can be applied to other transmitter and receiver filter or equalizer structures and also modulation schemes. With equivalent circuit models that accurately describe the alternative link topologies, these architectures can be investigated. Also the effects of interference terms other than the thru channel loss, such as crosstalk, can easily be modeled with inclusion in the link analysis tool.
V. LINK PERFORMANCE COMPARISONS
Using the discussed optimization methodology, link power efficiency for the three channels from Section II is computed and the impact of optimizing transmitter output swing and circuit style are illustrated in Figs. 13 and 14 , respectively. Optimizing transmit swing can dramatically reduce power. As shown in Fig. 13, at 12 Gb/s the power is roughly cut in half on the high-loss T20 channel and dramatically reduced to 20% of the nonscaled value in the low-loss B1 channel. The choice of CML versus CMOS circuit style is a function of data rate and technology node. As shown in the 90-nm modeling results of Fig. 14 , at low data rates the CMOSbased link has better power efficiency than the CML-based link with significant static power dissipation. However, beyond 14 Gb/s the CMOS-based link power increases steeply due to reduced fan-out values, and the CML-based link becomes more power optimal. For example, at 16 Gb/s the CMOSbased link achieves 5.95 mW/Gb/s operating on the low loss B1 channel, while the CML-based link power efficiency is only 1.62 mW/Gb/s.
The impact of electrical channel and process node is evident in the modeling results of Fig. 15 , which combines the CMOS and CML-based results to select the optimum design at a given data rate, and Fig. 16 , which shows the optimum equalization architecture. The high-loss T20 channel is strongly channellimited, as there is no difference in the optimum equalization architecture or CMOS circuit style between the 90-and 45-nm processes. A three-tap FFE transmitter and four-tap DFE receiver is required at the maximum data rate of 12 Gb/s, resulting in a 90-nm power efficiency of 3.0 mW/Gb/s and 1.8 mW/Gb/s in the 45-nm process.
The C4 channel has improved loss characteristics due to signaling on the bottom backplane layer, avoiding the detrimental impact of the T20 long via stubs. For this channel, the process node has an impact on the optimum equalization architecture and circuit style. In the 90-nm technology, a CMOS design is more power efficient up to 14 Gb/s, while above this data rate a CML design is chosen. A CMOS design is chosen for all data rates in the 45-nm technology. Also, the 90-nm design cannot efficiently leverage CTLE equalization above 12 Gb/s, while the 45-nm design utilizes a CTLE up to 16 Gb/s. The 90-nm design is limited to 16 Gb/s due to the inability of implementing a high-speed direct feedback DFE, while scaling to the 45-nm process allows the use of DFE to achieve operation up to 18 Gb/s, as discussed previously in Section III-C.
The low-loss B1 channel does not require significant equalization complexity until about 18 Gb/s. Interestingly, the optimal equalization architecture selected is one-tap TX FFE with CTLE up to 16 Gb/s in 90 nm and 18 Gb/s in 45 nm.
Including the CTLE actually achieves less power than with only one-tap TX FFE, i.e., no equalization, as the CTLE peak gain allows scaling down the transmit output swing significantly. The 90-nm design switches to a three-tap TX at 18 Gb/s due to the inefficiency of the CTLE at this high-data rate, while the 45-nm design can still leverage a high-peak gain CTLE at this data rate and does not require the three-tap TX FFE until 20 Gb/s. Excellent power efficiency is achieved with this low-loss channel, as sub-mW/Gb/s operation is possible for the transmitter and receiver circuitry, again neglecting clock generation, distribution, and recovery, in the 45-nm technology up to 18 Gb/s. Above 20 Gb/s, the channel could potentially achieve higher data rates with DFE. However, even the 45-nm technology cannot efficiently implement the direct-feedback architecture modeled in this paper. Thus, this link is technology limited, and could potentially benefit by scaling to a more advanced process node or through the use of an increased complexity loop-unrolled DFE architecture [21] .
Relative to published low-power links, comparable results are obtained from the modeling methodology, as shown in Fig. 15 . The 90-nm implementation on the T20 channel has similar 6.25-Gb/s performance as another 90-nm design [3] , minus clocking power, operating on a channel with a similar loss at 3.125 GHz. The 45-nm implementation on the B1 channel has similar 10-Gb/s performance as a 45-nm design [22] operating on a channel with a similar loss at 5 GHz, again neglecting the clocking power.
The modeling work of this paper, while being somewhat optimistic relative to these two fabricated examples, is useful in predicting the power efficiency trends versus channel loss and circuit complexity. Improved accuracy can be obtained with a specific process by iteration cycles with typical layout topologies. Also important is to consider the margin that must be built into the design to account for process variations, which ultimately leads to degradations in power efficiency.
VI. CONCLUSION
In conclusion, this paper presented a design flow for optimization of high-speed electrical I/O link power utilizing statistical link analysis methods and circuit power estimates. The use of statistical link analysis allows for the optimization of equalization parameters and estimation of link margins for a given channel characteristics and data rate. Comprehensive transmitter and receiver circuit models, which utilize normalized transistor parameters extracted from preliminary SPICE simulations of the circuit topologies, are used to provide accurate power estimates over a wide link architecture design search space. The design methodology predicts the optimum equalization architecture, circuit style (CMOS versus CML), and transmit output swing for minimum I/O power. Analysis shows that low loss channel characteristics and minimal circuit complexity, together with scaling of transmitter output swing, allows excellent power efficiency at high-data rates.
