This paper presents a transceiver for fast and energy-efficient global on-chip communication, consisting of a nonlinear charge-injecting (CI) 3-tap transmit filter (TX) and a sampling receiver (RX) with transimpedance pre-amplifier (TIA). Recently, pre-emphasis techniques [1] [2] [3] [4] have demonstrated significantly better energy-efficiency than repeater interconnects. To further improve energy-efficiency over pre-emphasis techniques that require analog subtraction [3, 5] , our TX selects a pattern-dependent current to inject into the wire, performing feed-forward equalization (FFE) while mitigating the nonlinearity of the driver. This 3-tap charge-injecting (CI) FFE enables stronger equalization than the capacitively driven TX [1, 2] or edge-detection pre-emphasis [4] , achieving data-rate of 4Gb/s over a 1cm on-chip wire. The TIA at RX improves bandwidth, signal amplitude, and reduces bias power, breaking the trade-offs in conventional resistor termination [6] , and mitigates equalized signal degradation due to impedance changes in dynamic current sensing [7] . To improve energy-efficiency, TX is designed in latch-based double-data-rate (DDR) style. TX has a weak driver design of accuracy and and a strong driver designed for efficiency. Since I 0 is much smaller (~30µA) and more sensitive than other current values, a differential current switch is used as the weak driver for accuracy. A strong driver injects large currents (I 1 , I 2 , and I 1 +I 2 ) on data transitions by driving the eight 5b PMOS or NMOS digital-to-analog converters (DACs) (P 1 , P 2 , N 1 , N 2 for each terminal) rail-to-rail for most efficient charge delivery. Since DAC transistors are changing the region of operation as the output voltage swings from 0.5 to 0.9V (see Fig. 3 .6.1), the DAC current and output impedance fluctuate. However, this nonlinearity is compensated by statically tuning each DAC segment (P 1 , P 2 , N 1 , N 2 ). changes from '11' to '01' (on a transition by D 0 ) by conducting -I 2 current via P 2 -and N 2 + segment. At the same time, the drain voltage of P 2 -segment also changes from middle level to high level. Since each state transition has unique voltage-change profile and injected current value, assigning a DAC code for each transition compensates the nonlinear behavior. Except for the transitions between '01' and '10,' coefficients are statically assigned since a single designated DAC segment drives each transition. With small accuracy cost, the transitions between '01' and '10' are done by turning on all strong-driver DACs and a weak driver, since the current errors of P 1 /N 1 and P 2 /N 2 when all segments are turned-on compared to other individual transitions cancel each other. A decoding block is implemented with small number of simple gates to control the DACs.
.6.1 depicts the overall link architecture with waveforms for single bitflip pattern. Current sources at TX provide a small bias current for the TIA through signaling wires. The TX digitally computes and injects the FFE current (IT+ and IT-) into the wires. At RX, the received current (IR+ and IR-) is converted into a voltage (VS+ and VS-) by the TIA, and the slicers and DFE make a decision about the received bits.
The principle of CI FFE is shown in Fig. 3 To improve energy-efficiency, TX is designed in latch-based double-data-rate (DDR) style. TX has a weak driver design of accuracy and and a strong driver designed for efficiency. Since I 0 is much smaller (~30µA) and more sensitive than other current values, a differential current switch is used as the weak driver for accuracy. A strong driver injects large currents (I 1 , I 2 , and I 1 +I 2 ) on data transitions by driving the eight 5b PMOS or NMOS digital-to-analog converters (DACs) (P 1 , P 2 , N 1 , N 2 for each terminal) rail-to-rail for most efficient charge delivery. Since DAC transistors are changing the region of operation as the output voltage swings from 0.5 to 0.9V (see Fig. 3 .6.1), the DAC current and output impedance fluctuate. However, this nonlinearity is compensated by statically tuning each DAC segment (P 1 , P 2 , N 1 , N 2 ). -and N 2 + segment. At the same time, the drain voltage of P 2 -segment also changes from middle level to high level. Since each state transition has unique voltage-change profile and injected current value, assigning a DAC code for each transition compensates the nonlinear behavior. Except for the transitions between '01' and '10,' coefficients are statically assigned since a single designated DAC segment drives each transition. With small accuracy cost, the transitions between '01' and '10' are done by turning on all strong-driver DACs and a weak driver, since the current errors of P 1 /N 1 and P 2 /N 2 when all segments are turned-on compared to other individual transitions cancel each other. A decoding block is implemented with small number of simple gates to control the DACs.
A TIA pre-amplifier is used in RX to combine the benefits of voltage-mode (VM) and current-mode (CM) signaling. In VM, as the termination impedance decreases, the received voltage decreases and static current increases, but the bandwidth also increases [6] , while in CM, both the signal current and bandwidth increase. The TIA termination at RX provides small input impedance without large static current. The TIA has 860Ω input impedance for high bandwidth (~350MHz) and high current amplitude (~7.5µA) at 2GHz, and the received current is converted into voltage V TIA (~19mV) by a 2.7kΩ TIA gain in simulation. The RX with TIA roughly doubles the received voltage at half the static current compared to VM signaling with direct 860Ω termination. After the TIA, the received voltage is fed into a 1-tap loop-unrolled DDR DFE [8] . The DFE is designed based on threshold-controllable Strong-Arm slicers and latches to minimize power and area cost.
A test link over 1cm on-chip differential M8 wires (0.6µm wide, 0.4µm spaced) is implemented in 90nm 1.2V CMOS process. By sweeping threshold voltage and clock delay of RX slicers, the BER and signal probability distributions are measured in situ [8] at room temperature. Figure 3 .6.5 shows the in situ measured eye diagram at 4Gb/s with 98mV vertical and 50% UI horizontal eye. A BER test showed no errors within the test-setup-limited window of ~10 6 bits. The large eye size indicates good supply-noise immunity. Without equalization, the eye is completely closed. Only 3b out of the possible 5b DACs are used when equalized, implying potential tuning cost reduction. Although the DFE is implemented in the receiver, it is not used at 4Gb/s since the FFE provides sufficient equalization. Figure 3 .6.6 shows the energy breakdown and comparison with most recent related work [1] . The energy breakdown is extracted by multiplying the measured current with energy breakdown ratio from postlayout simulation. The total transceiver energy cost is 356fJ/b. The pre-driver energy of 85fJ/b could have been smaller (~30fJ/b) since the predriver was oversized to handle higher data-rates. Since the DFE is not used, the DFE overhead of 30fJ/b is not included. The test-chip results show that this link design can operate about 2× faster at similar pitch for ~30% higher energy cost compared to [1] . The measured eye is larger than the eye at 1.75Gb/s reported in [1] , increasing link robustness. 
3
Please click on paper title to view Visual Supplement.
• Please click on paper title to view Visual Supplement.
