Inter-chip signaling latency and bandwidth can be key factors limiting the performance of large VLSI systems. We present a high performance, transmission line signaling scheme for point-to-point communications between VLSI components. In particular, we detail circuitry which allows a pad driver to sense the voltage level on the attached pad during signaling and adjust the drive impedance to match the external transmission line impedance. This allows clean, re ection-free signaling despite the wide range of variations common in IC device processing and interconnect fabrication. Further, we show how similar techniques can be used to adjust the arrival time of signals to allow high signaling bandwidth despite variations in interconnect delays. This scheme employed for high performance signaling is a speci c embodiment of a more general technique. Conventional electronic systems must accommodate a range of system characteristics (e.g. delay, voltage, impedance). As a result, circuit designers traditionally build large operating margins into their circuits to guarantee proper operation across all possible ranges of these characteristics. These margins are generally added at the expense of performance. The alternative scheme exempli ed here is to sample these system characteristics in the device's nal operating environment and use this feedback to tune system operation around the observed characteristics. This tuning operation reduces the range of characteristics Acknowledgments:
Introduction
In this paper, we address the issue of highperformance, point-to-point transmission-line signaling. Our objective is to achieve low transmission latency and high signaling bandwidth with a design which is economical in real-estate and power consumption while remaining compatible with commodity IC technology. For large VLSI systems, inter-chip signaling can account for a signi cant fraction of the operational cycle time. The delay between a pair of ICs can be decomposed into three components: 1 output delay { delay through the output pad to drive the large, external capacitance associated with any component pad 2 signal propagation delay { the time required for a signal to propagate across the interconnect media from the source to the destination 3 input delay { delay through the input pad while the signal is being sensed and level-restored for internal IC consumption In this paper, we speci cally address the issue of minimizing signal propagation delay across transmission line interconnect. Note that interconnect is best modeled as a transmission line any time when the propagation delay across the interconnect is comparable to or greater than the rise time or fall time of the signal. That is: t pd t r (1) The transmission line propagation delay (t pd ) is de-termined by the materials in use and the physical interconnect length (l): v = 1 p = c p r r (2) t pd = l v (3) For most available interconnect technologies r = 1 and 2 r 5. With a relative dielectric constant ( r ) of four (v prop = c 2 = 15 cm/ns), which is common among PCB technologies, and fast edge rates (t r 0:5 ns), traces over a few centimeters begin to exhibit transmission line e ects. For long interconnects (tens of centimeters), propagation delay becomes the dominant fraction of interchip communication delay. The minimum transmission line propagation delay, however, is only achieved when the transmission line is properly terminated so that no re ections occur and the line settles to the desired voltage in one propagation time. Process variation in IC and PCB fabrication make the termination matching di cult to achieve. Termination integrated on-chip is desirable to avoid the area and cost associated with external termination devices but on-chip resistances vary widely with IC processing, operating voltage, and temperature. Further, the characteristics of the interconnect media itself can vary. To avoid these problems, we develop a technique which allows an output driver to examine the voltage on the line during signaling and servo the impedance to match the attached interconnect. This technique requires no external termination devices to achieve clean impedance matching and allows a single IC to match cleanly across a wide-range of interconnect impedances. This paper also addresses the issue of increasing signaling bandwidth on our transmission line interconnect. In this domain, two key properties limit signaling bandwidth:
signal deformation { limited rise and fall times along with dispersive e ects in circuits and interconnect spread out data bits. skew and delay variation { uncertainty in the propagation delay through components or interconnect must be accommodated by spacing the data bits to accommodate the entire range of possible transition timings. To achieve reliable, high bandwidth signaling over transmission line interconnect, our techniques sense the delays actual seen on a fabricated component in system. On-chip delay is then adjusted to make the total propagation delay between ICs conform to tighter timing constraints. Together, these techniques allow us to remove much of the uncertainty and variation associated with signal transmission and, consequently, signi cantly reduce the inter-bit separation necessary for reliable inter-chip signaling. As with the impedance matching, this technique also allows a system to tolerate a larger variance in interconnect characteristics while still achieving reliable, high bandwidth operation. Most cmos circuit designers are familiar with the practice of designing circuitry to compensate for the wide variations associated with silicon processing. These adjustable termination and timing techniques take the strategy one step further allowing the component to compensate for variations in its external environment. The key theme here is to measure system characteristics and then employ tuning circuitry to bring the characteristics into a tight and favorable operating range. Bringing these techniques together, we present a design for adaptable i/o pads engineered for highperformance, point-to-point transmission-line signaling. Section 2 presents an overview of the signaling strategy. We show a low-voltage swing, matched-impedance output pad for driving seriesterminated transmission lines along with a complimentary receiver in Section 3. We introduce a technique in Section 4 for capturing dynamic timing information in response to signaling events. Section 5 details how the series impedance is adjusted in-system so the source drive impedance matches the impedance of the attached transmission line. Section 6 reviews techniques for on-chip delay adjustment and Section 7 shows how the timingextraction and delay adjustment techniques facilitate the retiming of inter-chip communications. Section 8 describes how these techniques impact the testing of high performance interconnect. In Section 9, we present highlights of a prototype i/o pad which incorporates many of the techniques described in this paper. In Section 10 we highlight limitations of the techniques presented here before concluding in Section 11.
Signaling Strategy
To meet the needs of point-to-point signaling with high speed and acceptable power, we utilize a series-terminated, low-voltage swing signaling scheme which uses on-chip termination and employs feedback to match termination and transmission line impedances. For the purposes of the following discussion, we focus on a cmos integrated circuit technology.
Low-voltage swing signaling is motivated by the desire to drive the resistive transmission line load with acceptable power dissipation. We see in Equation 4 that power is quadratic in signaling voltage. In the designs which follow, we speci cally consider signaling between zero and one-volt. Limiting the voltage swings to one-volt saves a factor of 25 in power over traditional ve-volt signaling (i.e. with a 50 transmission line, P drive = 250mW with vevolt signal swings and P drive = 10mW with onevolt swings).
Recall that a properly series terminated transmission line will present a load of R = 2Z 0 to the driver since the transmission line impedance (Z 0 ) occurs in series with the termination resistance. To achieve one-volt signaling, we provide components with a one-volt power supply for the purpose of signaling. This frees the individual components from converting between the logic supply voltage and the signaling voltage level. Any power which must be consumed in the process of generating the one-volt supply is dissipated in the power supply, and not in the individual ICs. Series termination o ers several advantages over parallel termination for point-to-point signaling. First, we can integrate the termination impedance into the driver. In the parallel terminated case, we needed to drive the voltage on the transmission line close to the signaling supply rail. The e ective resistance across the driver between the supply rails and the driven transmission line must be small compared to the transmission line impedance, Z 0 , in order to drive the transmission line voltage close to the signaling supply (See Figure 1) . In a cmos implementation, this means that the size of the transistors implementing the nal driver must have a large W=L ratio to make the resistance small. As a consequence, the nal driver is large and, therefore, has considerable associated capacitance, C driver . The larger capacitance, of course, increases the delay required to bu er internal logic signals to drive the nal pad output stage. It also means that the charging power, P charge (Equation 5), will be large. In contrast, the series terminated driver can use a higher-impedance driver. The higher impedance of the series terminated driver allows it to have a smaller W=L ratio and hence smaller C driver , resulting in a lower output delay and requiring less power to drive the output.
Additionally, the series terminated con guration gives us the opportunity to use voltage feedback to adjust the on-chip, series termination to match the transmission line impedance. We expect both the transmission line impedance and the conductance of the drive transistors to vary due to process variations. By monitoring the stable line voltage during the round-trip transit time between the initial transition at the source end of the transmission line and the arrival of the re ection, a controller can identify whether the driver termination is high, low, or matched to the transmission line impedance. With a properly terminated series transmission line, we expect the voltage to settle half-way between ground and the signaling supply during the rst round-trip transit time. If the voltage settles much above the half way point, the drive impedance is too low. If the voltage settles much below the half-way point, the driver impedance is too high (See Figure 2) . By monitoring the voltage at the pad, the system can adjust the drive impedance until it matches the line impedance. This allows the integrated circuit to compensate for process variation in both the silicon processing and PCB manufacture.
Signaling Circuitry
In this section we present driver and receiver circuitry which facilitate low voltage signaling and impedance adjustment.
Driver
To control the output pad impedance, the output driver is connected to the high and low signaling supplies through adjustable impedance networks. As shown in Figure 3 , a set of exponentially sized drive transistors form the adjustable impedance network. The impedance control drivers can be enabled via digital control lines from a scan-loaded control register and serve as a D-to-A network for the pad drive resistance. Gabara and Knauer suggest a similar scheme which places only the set of exponentially sized transistors between the signaling supplies and the output pad 6]. An and gate preceding the D-to-A network serves to combine the logical drive value with the impedance selection. 
Receiver
The receiver must convert the low-voltage swing input signal to a full-swing logic signal for use inside the component. In the interest of high-speed switching, we want a receiver which has high gain for small signal deviations around the mid-point between the signaling supplies. 2] and 9] introduce suitable di erential receivers. Figure 4 shows one such receiver. This section introduces the sample register, the key circuitry for timing extraction. A sample register is a string of latches enabled at closely spaced time intervals. Each latch \samples" the binary value of the signal under test during the time it is enabled. By rippling the latch enables in rapid succession, the sample register captures a discrete representation of the time behavior of a signal. In this section, we develop sample register circuits starting from an \ideal" model and progressively re ning the circuitry into practical implementations. Alternately, one could consider delaying the target input signal as seen by each latch and using a single, register-wide enable rather than delaying the enables. We focus our discussion around a delayed enable since we generally have more control over the timing of the enable pulses we generate than we do over a random signal whose timing we wish to capture. The sample register is compatible with scanbased Test-Access Ports (TAPs), such as the JTAG/IEEE 1149 standard 3] TAP. The TAP can be used to initiate events which the sample register captures and to o oad the data captured in the sample register. An ideal sample register is composed of a sequence of latches each enabled at xed delays. When a timing event occurs, a short enable pulse is driven into the sample register. Each sample latch records the value seen by the receiver when it was last enabled. After the enable pulse propagates through the sample register, the sample register holds a discretetime sample of the target input. 
Ideal
An ideal sample register would consist of an innitely long string of latches, each enabled at uniform time intervals (See Figure 5 ). When the enable pulse res at the beginning of the delay chain, the sample register captures a discrete-time representation of the attached input signal. We seek to approximate the ideal behavior with a reasonably small, nite-length sample register and employ simple circuitry to generate the sequence of enables necessary to capture timing samples.
Inverter Timing Chain
For many applications, a pair of inverters will sufce to form the inter-sample-bit delay (See Figure 6 ). The delay through a single inverter is often the nest granularity of timing used in a design. With a little care during layout, the geometry and loading of each inverter in the delay chain can be made identical. Consequently, the only variation between delays will be due to process variation across the die. Since a single sample register will typically occupy only a small region of the die, the processing variation among inverters in a sample register will be minimal.
Sliding Window
If we can cause the event we are timing to occur under scan control, we can tradeo time for space. We do not have to capture the entire waveform in a single event. We can reuse a small sample register in time to capture a long waveform. The temporal placement of the sample register can be controlled via the TAP, and the composite waveform can be reconstructed o chip. Figure 7 shows the basic sliding window concept. Figure 8 shows one possible implementation for the sliding window. The sample pulse is recycled after rippling through several sample delays. A scanloaded con guration is compared against a trip count to allow the sample pulse to be recycled for a predetermined number of times. After the enable pulse settles following a trigger event, the sample register will contain the values corresponding to the last time the pulse was allowed to ripple through the delay chain. Note that we depict the ripple pulse recycling before the end of the sample register. It is unlikely that the delay on the recycle path can be accurately matched to the inter-sample time de ned by the inverters. By recycling the ripple from a point prior to the end of the sample register we can do two things: (1) provide overlap between sample windows and (2) make sure that there is always an inverter-pair delay between adjacent samples used to reconstruct the longer waveform. As we will see in the following section, with su cient overlap, calibration can help us factor out any delay anomalies associated with the recycle path. The choice of how many bits to include in the sample register and recycle path will depend on the relative speed of operation of various logic functions in the target technology. In particular, the operational frequency of the counter-comparator combination will set a lower limit on the delay between successive ripple pulses. For example, a technology with 100 ps minimum inverter delays and a maximum counter operational frequency of 500 MHz A simple sample register can be implemented using a pair of inverters to provide the xed, inter-sample-bit delays. Using a small, xed-size, sample register, we can capture a portion of the discrete-time waveform for a signal with each timing event. If we vary the placement of the capture window and repeatedly re the timing event, we can capture the entire waveform over a series of such samples. We can implement the sliding window in our sample register by recycling the sampleenable pulse a con gurable number of times. The values left in the sample register after a timing event will correspond to the data capture during the last cycle made by the enable pulse. With slightly lower accuracy, it is possible to use only a single latch in the sample register. As shown in Figure 9 , a mux can be used to select the ne timing delays while the counter-comparator combination selects the coarse-grain timing window. The delay between sample bits in this scheme will not generally be as accurate as the earlier circuits, but may be su cient for many applications.
Calibration and Sharing
With the circuits shown so far, we only know when the signal is occurring in units of inverter-pair delays. Since process and environmental variation can easily account for a factor of two variance in an inverter delay, inverter-pair delays alone are not su cient to extract ne-grained timing information. If a known timing source, such as the component clock, is available we can mux the sample input between the sample register and the known timing source to calibrate the inter-sample-bit delay time (See Figure 10) . A known-frequency clock will allow us to determine the timing of events on the sample register and reassemble the overlapped, sliding windows appropriately (See Appendix A). The mux can also be used to share a single sample register between several target signals. This muxing can be used simply to minimize the need for sample registers. It can also serve to acquire accurate, relative timing information for groups of related signals. For example, one might share a single sample register among the bits of an 8-bit data bus and its data strobe. This arrangement would provide accurate timing information on the relative occurrence of data bit transitions to each other, as well as, providing an indication of when data bit transitions occur in relation to the data strobe.
Tighter Timing
In applications where timing accuracy tighter than a pair of inverter delays is required, a neresolution, variable-delay bu er can be placed at the front-end of the delay path (See Figure 11) . Some variable delay bu ers developed for CMOS, Phase-Locked Loop (PLL) circuitry are suitable for this application. Figure 12 shows a voltagecontrolled, variable-delay bu er which operates by varying the capacitive load seen by each stage.
Horowitz 7] details a phase interpolator which smoothly varies the phase in 15 steps between two references under digital control. Such a phase interpolator can be applied to an inverter-pair delay to provide a resolution of roughly one-eighth of an inverter delay.
Enable Pulse
To acquire a \sample", we must be able to initiate both the event under test and the sample-enable pulse. When we are using a scan-based TAP for o oading the sample registers, it will be most convenient to initiate these events under scan control. In practice, we generally want the enable pulse to re synchronized to the event we wish to observe on the IC. Scan control is used to prime the enable pulse to trigger on the next synchronization event and to initiate the event for observation. For example, when the component is in a scan testing mode, the standard scan register load facilities can be used to cause signal transitions within the IC or at the IC boundary. Once red, circuitry inhibits the enable pulse from ring again until we have had a chance to o oad the sample register and con gure it to capture the next timing event.
Summary
Combining these techniques we can repeatedly re an event we wish to time and sample its behavior in narrow, xed-size windows. By integrating the information acquired across multiple samples at varying window o sets and calibrating to known frequency and phase sources, we can build up an accurate, discrete-time representation of a signal on the IC. We can easily achieve timing resolutions down to two inverter delays and, with care, can achieve even tighter resolutions. Data capture and acquisition can be completely controlled through a scan-based TAP.
Impedance Matching
Each controlled impedance pad is constructed from: 1 Driver (Figure 3 With slightly less accuracy, a ne-grained, variable-delay bu er can be employed reducing the number of sample latches required to one. The receiver plays the dual purpose of (1) bringing the signal onto the chip when the pad is acting as an input and (2) monitoring the source of the transmission line when calibrating driver impedance. The pad's sample register is enabled whenever the core logic toggles the value driven into the output pad. Following such transitions, the sample register records the value on the output pad as seen by the receiver. Since the receiver is biased to trip at the midpoint voltage, we can use the value recorded in the sample register to determine when the seriesterminated transmission line is matched. When the We can achieve higher resolution by adding a ne-grained, variable-delay bu er in the enable path preceding the inverter chain. This allows us to vary the timing of the sample taken by sub-inverter-pair quanta. drive impedance is too low, the driver will quickly drive the line past the half-way point and the receiver will capture the transition. If the impedance is too high, the driver will not drive the voltage past the half-way point and the receiver will not see a transition until subsequent re ections bring the voltage past the midpoint. By ring a series of test transition and recovering the sampled result, an o -chip controller can select an impedance setting where the drive impedance is well matched to the transmission line impedance. The impedance selection time is almost entirely dictated by the bandwidth to the o -chip controller. Impedance setting using a scan-based TAP can take half a millisecond for a single pin. For an entire chip with hundreds of pins, scan-based impedance setting can take on the order of 100 ms (See Appendix B). Figure 14 shows data collected from a test chip (See Section 9) while scanning through impedance settings. Figure 15 shows both ends of a seriesterminated transmission line after the driver has been automatically matched to the line impedance using the data collected in Figure 14 . The sample register is important to this application for two key reasons:
1. The delays through the driver and receiver will depend on IC processing. By capturing a window of the signal, we can be sure to capture the transition regardless of processing. 2. When we cannot control the length of the attached transmission line, it is di cult to know when the transmission line voltage corresponds to the initial drive or re ections. Coupled with process variation, we cannot sam- \Boundary Cell" contains the standard boundary scan registers for a bidirectional i/o pad. \Impedance Register" holds the digital value which controls the pad drive impedance. \Sample Register" is a sample register as described in Section 4. Sample Bit (Time)
Impedance Setting
The data above accompanies a low-to-high transition of the output value. The dark areas indicate that the receiver saw a high value, while the light areas indicate a low value. Impedance setting 0x3F corresponds to all impedance transistors enabled which is the lowest impedance setting while 0x00 corresponds to all impedance transistors disabled, the highest setting. The sample register allows us to see transitions occur. The transitions act as calibration marks, informing us when various events occur. In e ect, we have built a poor Time-Domain Reectometer (TDR) which we use to match the driver impedance to the line impedance. The discrete-time sample in the sample register is coarser than real TDRs, the rise time on the signals is much slower, and the length of the line monitored is limited by the window size captured by the sample register. For comparison, we connected one of the test pads to a small wire terminating in a shortcircuit and recorded the sample data for various impedance settings. Figure 16 compares the sample data to a real TDR waveform. The time range of the test pad is limited because we used a 16-bit sample register without the recycling technique (See Section 9). The sliding window (Section 4.3) allows us to extend the time range captured considerably for modest additional silicon real-estate. Section 9 summarizes the characteristics of the test pad used to acquire the data shown in Figures 14,  15 , and 16. Section 10 elaborates on the limitations of this technique as well as the expected usage patterns.
6 Delay Adjustment
Mechanism
We can use similar techniques to adjust the timing of key IC signals. Figure 17 shows a variable-delay bu er suitable for coarse-grain delay adjustment. Since this bu er uses inverter pairs as the basic, unit-delay element, it provides the same granularity of adjustment as most of the sample register designs presented in Section 4. For ner grained delay adjustments, we can borrow variable-delay elements from PLL circuits such as the VCDL bu er (Figure 12 ) or Horowitz's phase interpolator mentioned in Section 4.5.
Comparison with PLLs
Phase-Locked Loops are commonly employed to match timing of on-chip clock signals to external references. In such cases, where the signal is periodic with xed frequency, on-chip circuitry can close the feedback loop to adapt component timing to match system timing. However, PLL techniques cannot be applied to non-periodic control signals and data paths. Further, traditional PLLs cannot be used to guarantee the simultaneous arrival of the bits of a wide data bus. Our TAP-based timing extraction and timing control can provide in-system adjustment of on-chip timing for these non-periodic signals. The TAP can be employed to force events to occur and capture their timing relationships. Through the TAP, we can adjust the delay controls to servo the on-chip delays until the proper timing relationships are achieved. The feedback loop using TAPbased timing extraction and control is, of course, much slower than the feedback loops in conventional PLLs and does not operate continuously. Coupled with generally coarser-grained timing information, this makes TAP-based timing control unsuitable for the ne-grained timing adjustment provided by competent PLLs in the same technology. The coarser control of TAP-based timing does bring many of the advantages of feedback control to non-periodic signals and signal groups.
In-system Tuning
On-chip delay adjustment allows us to tune the timing of events to the target system. Once a component is deployed in its nal system many of the variables which had to be considered during design are xed and will remain e ectively constant during operational epochs. Such variables include: IC processing of all ICs in the target system, including this one External interconnect characteristics (e.g. path length, line impedance, propagation delay, capacitive loading) Target system clock frequency Other variables may vary during operational epochs, but do so relatively slowly (e.g. component temperature and operating voltage). If we can monitor changes in these parameters (e.g. with on-chip temperature sensors) we can often treat these parameters as constants, retuning whenever signi cant environmental changes make retuning necessary. Once system delays are xed, and can be measured using our timing extraction techniques, in-system component operation can be spe-cialized to these system characteristics. By specializing component timings around system characteristics, we can achieve higher performance than is possible when our design must allow for all possible variations in system parameters. In-system delay adjustment e ectively gives us most of the advantages of self-timed logic without incurring the complexity and testability problems associated with asynchronous logic.
Transmission Line Timing Adjustment
For high-bandwidth signaling over long transmission lines, we can pipeline multiple data bits on the transmission line. This wire pipelining requires:
1. We know how many clock cycles it takes to traverse each transmission line interconnect. 2. We guarantee that data transitions do not occur during the setup to hold time window of the receiving IC.
Computers such as the Cray-1 10] and CM-5 13] satisfy these criteria by carefully selecting the interconnect cable lengths and designing the basic system around the logical lengths of each interconnect. Using the techniques we have introduced here, it is possible to satisfy these two conditions by monitoring the transmission line re ections and adapting the output timing to the length of the connected transmission line. To handle long transmission lines, we add a tunable delay and sliding-window sample register with a muxed clock for calibration to the pad design described in Section 5. We can tune the output impedance to the transmission line as described previously. However, while scanning impedances the longer e ective sample window gives us an additional piece of information, the timing of the rst re ection arrival. When the transmission line impedance is set slightly above the transition point, the receiver will trip when the rst re ection arrives. 1 Scanning through impedance settings thus tells us both when the source is driven and when the rst re ection occurs. Assuming synchronous clock distribution and symmetry of transit times across the transmission line, this also allows us to determine when the signal arrives at the destination end of the transmission line. We can take this time and determine (1) how many clock cycles it requires to traverse this interconnect and (2) where during the clock cycle the signal is arriving at the far end of the transmission line. We can then tune the variable delay associated with the output signal so that the transition at the destination is guaranteed to occur outside of the setup to hold time window around the clock after taking into account any necessary uncertainties associated with the delay through the input receiver. By combining the sample register with seriesterminated, transmission-line signaling, we can tune the arrival of a transition at the far end of a variable length interconnect. This kind of tuning is not possible with conventional PLL circuits. Figure 18 shows a suitable pad scan architecture including the tunable output delay.
Implications on Interconnect Testing
It is worthwhile to note that techniques presented here allow us to test the dynamic properties of our interconnect media. Today, TAPs are commonly employed to test out the DC characteristics of ICs and interconnect integrity. Standard TAP techniques, however, cannot identify interconnect faults which only a ect high-speed signals. For example, conventional TAP interconnect testing cannot identify the impedance discontinuity arising from a poorly seated connector or a short to some piece of foreign material. With the pad architecture presented here, the TAP can capture the dynamic pro le of the voltage waveform resulting from signaling events. This allows the recovery of TDR-like data for the attached interconnect. Consequently, these techniques allow us to extend our TAP testing to identify the interconnect faults which a ect high-speed signals.
Implementation
We have implemented a prototype, matched impedance i/o pad which incorporates many of the techniques described here 5]. Table 19 summarizes the key characteristics of the prototype pad and Figure 20 shows the layout for the bidirectional i/o pad. This pad's scan architecture matches the one depicted in Figure 13 . Note that the sample register occupies 350 of the test pad length. In application, one could construct a single sample-register 
Limitations
In this section we review some of the costs and limitations of these on-chip sensing and adjustment techniques. We brie y address the impact of these limitations on the pragmatic application of these techniques.
Point-to-Point Signaling As noted, the speci c technique described here is primarily applicable to single driver, single receiver, seriesterminated signaling applications. The tuning behavior depends on the re ection pro le of the series terminated transmission line for proper operation.
Single Impedance Media The techniques described here assume the interconnection media, Figure 20: Prototype Matched Impedance I/O Pad while varying in impedance, is characterized by a single, homogeneous impedance between the source and the driver. This is typically the case if the interconnect is a printed-circuit board or cable between ICs. However, if the ICs are connected through multiple media this might not be the case. For instance, if two ICs are communicating over a long cable and each IC has a long wire run between the cable and the IC on its attached printed circuit board, the intervening interconnect could be characterized by three distinct impedance regions. The techniques presented here will allow one to identify the impedance discontinuities, but not to compensate for them. Of course, one could use the techniques presented here to build an impedance matching bu er component to place at each potential impedance discontinuity. The impedance matching bu er could then separately match to each interconnect segment. Such a scheme would, however, add i/o delay to the signaling path for each such impedance matching bu er encountered.
Area This technique does require dedicated, onchip silicon area. As noted in the previous section, a 16-bit sample register occupied just under 350 150 5]. The prototype sample register included the inverter chain, sample latch, shift register, and a 16-bit con guration register, but did not include any recirculation or calibration circuitry. For comparison, the standard bidirectional i/o pad boundary-scan registers in the same design occupied 140 150 . Layout for the standard boundary-scan i/o registers was partially determined by control signal routing, while the the sample register contains local connections and is dominated more by the size of shift and con guration registers. Using the recirculation techniques suggested in Section 4, one could build a smaller, 8-to 10-bit sample register and then build a 4-to 5-bit counter and comparator in comparable space. Calibration support then requires the addition of an input mux along with an attached con guration register. Recall from Section 4.4 that the input mux can be expanded in order to share a single scan register among several signals. For bussed signals, a single sample register would typically be shared among a series of 4 to 8 adjacent lines to amortize the area cost required.
Con guration and Tuning Latency Up front
tuning latency using the sample register scheme can be moderately large.
Sample registers can collect many bits of data per timed signal per experiment, but must still o oad such data via the low bandwidth, serial scan interface.
To keep the area requirements down, recirculating and shared sample registers reuse the sample register in time. Consequently, many timing experiments must be performed in sequence in order to reconstruct a single waveform. These e ects make TAP-based timing extraction and con guration a moderately high-latency operation. For example, a 160-pin component using the prototype pad from Section 9 requires 50-60 ms to tune all pads. Appendix B shows how to estimate con guration latencies based on scan and component architecture. In practice tuning would occur initially at system startup time and thereafter only when environmental characteristics change. As long as the environmental characteristics change slowly, the tuning latency does not have an adverse e ect on signaling operation.
Periodic Retiming Requirements As noted in Section 6.3, tuned parameters such as delay or impedance will depend on some slowly-changing environmental characteristics such as temperature and attached hardware con guration. These parameters will need to be retuned whenever environmental characteristics drift signi cantly from the point of tuning. The way this retuning ts into system operation will vary considerably among applications. In systems with adequate error detection, the detuning can be recognized by the error detection mechanism, and retuning may serve as a primary responses to excessive errors. In systems without this kind of error detection, more preemptive measures may be required. For example, a crude, on-chip temperature sensor can serve as an early warning indicator so that retuning can compensate for changes in temperature.
O -Chip Controller These techniques will require a moderately complex o -chip controller for waveform extraction and reintegration. A personal computer or low-end workstation is both su cient and economical for in-system testing and tuning. For impedance and delay tuning applications, the controller should be an inseparable part of the base system. In many systems, the task can be assumed by an existing processor in the system. Some systems may require an additional, embedded microcontroller to orchestrate tuning and con guration functions.
Summary
By exploiting the information available at the source end of a series-terminated transmission line, we can identify important characteristics of our interconnect. Employing a TAP accessible sample register, timing control, and impedance control, we can match a series-terminated transmission-line driver to its attached transmission-line, in system. Speci cally, we take discrete-time, binary samples of the voltage seen by the driving pad at various impedance settings. Using the sample feedback, we can determine both: 1 the driver impedance setting which best matches to the transmission line impedance 2 the arrival time of the signal at the far end of the transmission line In e ect, we get the capabilities of a crude, onchip TDR. Using this information and ne-grained timing adjustment at the source end of the transmission line, we can reliably pipeline the transmission of data bits over variable length interconnect media. This pipelining allows high bandwidth signaling even when interconnect distances are long.
In general, these techniques allow us to factor out process variation for the ICs and interconnect media and adapt to system speci c parameters such as interconnect impedance and length. Additionally, these techniques allow us to use a TAP to determine the integrity of our interconnect for highspeed signaling. 
A Calibration
Sample register calibration is required for two reasons.
1. The time delay between sample bits varies with process variation. 2. The timing on the recycle path in the sliding window also varies and is generally di erent from the time delay between sample bits.
As a result, when we recover a pair of adjacent samples, we do not immediately know the amount of overlap between samples. Figure 21 depicts (6) Reassembling windows is easy in this case since we immediately discover the inter-sample-bit time and can align corresponding edges between samples (See Figure 22) . To be in the fast clock case, we need: n sbits 2 + min(t clklow ; t clkhigh ) t sbit
For a 500 MHz clock with a 50% duty cycle in a technology with a minimuminverter delay of 100 ps this means:
n sbits 2 + 1 ns 100 ps = 12 (8) Slow Clock Case When the available calibration clock is slower it may not make sense to build a sample register long enough to cover half a clock period. In this mode we have to look for two pieces of information independently: T scycle { the time between successive placements of the sample register window t sbit { the time between successive bits within the sample register Each edge occurs at some o set within a sample register (n bp ) and at some cycle o set (n wp ). Given a pair of edges separated by time T e?e :
(n wp2 ? n wp1 ) T scycle + (n bp2 ? n bp1 ) t sbit = T e?e (9) From here there are two ways we can solve for t sbit and T scycle .
1. If we can see an edge in two successive windows, we know: T scycle + n bp2 t sbit = n bp1 t sbit (10) The only time when this will never occur is when there is no overlap. That is: n sbits < T scycle t sbit + 1
See Figure 23 for an example of this case. 2. We can also derive a relationship between t sbit and T scycle , if we can get two di erent sets of ((n wp2 ? n wp1 ) ; (n bp2 ? n bp1 )) pairs. An example of such di ering pairs is shown in Figure 24 .
Either of these cases provide a second equation in two unknowns allowing us to solve for the bit and cycle times and calibrate the sample windows. When the calibration clock is fast relative to the length of the sample register, we can capture an entire clock phase and directly determine the inter-sample-bit time. Shown here is a 7-bit sample register with 2 bits of overlap per window. The calibration clock has a period of 8 sample bit delays and a 50% duty cycle. If T e?e is not an exact multiple of T scycle , the edge positions within the sample windows will drift from window to window. This will guarantee that one of the above two cases will occur allowing calibration. If T e?e is an exact multiple of T scycle , (n bp2 ? n bp1 ) = 0, so we immediately know T scycle but do not know t sbit . There is also a potential problem if T e?e is an exact multiple of t sbit for certain values of the multiplier. If we have no control over the calibration clock edges and wish to avoid these exceptional cases, it will be necessary to make the timing on the recycle path adjustable. For example, an optional, extra inverter pair delay in the recycle path would allow us to change T scycle so that it was no longer a proper divisor of T e?e . Of course, if T scycle is close to being a multiple of T e?e it may take many calibration clock periods, and hence windows, to guarantee the adequate positional shift to guarantee su cient calibration data. The optional recycle delay can also be useful in this case to reduce the required time coverage for the sliding window sample register.
B Impedance Tuning Time
The most straightforward way to set the impedance involves a sequence of: 1 start with minimum impedance 2 load in current impedance value 3 force a transition 4 o oad resulting sample register 5 increment current impedance value and repeat at step 2 until all impedance values have been tested Once this sequence of data has been collected, the controller has a collection of information like that shown in Figure 14 Here again is a slow clock case where a pair of calibration clock edges never occur in a single sample window. In the two sequences shown, the pair of edges occur within a di ering number of windows. The combined information from these two series of samples give us enough information to solve for the sample bit delay and window cycle time using From this basic algorithm, we can compute the bandwidth requirements for impedance tuning and get a basic estimate for tuning time:
Step #2 requires loading in an impedance values. With n i impedance transistors on each supply network, this requires 2n i bits per controlled impedance pad.
Step #3 can occur in parallel for all pads being tuned and can be done in tens of clock cycles per iteration.
Step #4 o oads the impedance values and requires n sbit bits per controlled impedance pad.
If we scan through all 2 ni settings for an impedance network, and have to iterate the process four times to achieve reasonable convergence, the total number of bits transfered is: n bit = 4 2 ni n pads (n sbit + 2n i ) (12)
If we move data to and from the chip under serial TAP control, we get to move one bit per clock cycle. There will be some additional overhead for the TAP protocol, but these are small compared to the clock cycles required to move data on and o the chip. The tuning time can thus be approximated as: T tune = ? 4 2 ni n pads (n sbit + 2n i ) t tclk (13)
To make this concrete, we can consider the test pad from Section 9. This pad had n i = 6 and n sbit = 16. If we further consider a component with 160 impedance controlled pins and a scanbased TAP with a 20 MHz scan clock (tclk): T tune = ? 4 2 6 160 (16 + 2 6) 50 ns 57 ms This time can be reduced by clever arrangement of the scan operations. For instance, if we know we are going to be tuning all 160 pins in parallel, a parallel load of all 160 pins from the same 2n i impedance bits would make the cost of uploading test impedance values almost negligible. For the example above, such a change would reduce the time to roughly 33 ms. In the analysis above, we assumed that it was necessary to reload both the pull-up and pull-down impedance during step #2, while only one of these impedances generally varies during a scan iteration. Simply allowing the pullup and pull-down networks to be loaded independently would allow us to tune the impedance in 45 ms. Of course, if a single pad's impedance and sample registers can be accessed independently, the o oad time for a single pad during tuning would be:
T tuneone pad = ? 4 2 6 1 (16 + 2 6) 50 ns 360 s When tuning a single pad, the scan overhead will not be as negligible. A more accurate estimate for single pad tuning time is on the order of 500 s using a scan-based TAP.
