Interconnect tuning is an increasingly critical degree of freedom in the physical design of high-performance VLSI systems. By interconnect tuning, we refer to the selection of line thicknesses, widths and spacings in multi-layer interconnect to simultaneously optimize signal distribution, signal performance, signal integrity, and interconnect manufacturability and reliability. This is a key activity in most leading-edge design projects, but has received little attention in the literature. Our work provides the first technology-specific studies of interconnect tuning in the literature. We center on global wiring layers and interconnect tuning issues related to bus routing, repeater insertion, and choice of shielding/spacing rules for signal integrity and performance. We address four basic questions. (1) How should width and spacing be allocated to maximize performance for a given line pitch? (2) For a given line pitch, what criteria affect the optimal interval at which repeaters should be inserted into global interconnects? (3) Under what circumstances are shield wires the optimum technique for improving interconnect performance? (4) In global interconnect with repeaters, what other interconnect tuning is possible? Our study of question (4) demonstrates a new approach of offsetting repeater placements that can reduce worstcase cross-chip delays by over 30% in current technologies.
Introduction
With technology scaling, on-chip interconnect becomes an increasingly critical determinant of performance, manufacturability and reliability in high-end VLSI designs. Current and future designs are generally interconnect-limited, and the available routing resource must be carefully balanced among signal distribution, power/ground distribution, and clock distribution. Table 1 reproduces several technology projections from the 1997 SIA National Technology Roadmap for Semiconductors [1] . A notable deviation from the original 1994 Roadmap is that maximum onchip clock frequencies will reach the gigahertz range even in the 180nm process generation. The implications of technology scaling -particularly for system interconnect -are very complicated. Example considerations for a 7-layer metal (7LM) process might include:
Local interconnect layers (e.g., M1-M3) should generally remain at near-minimum dimensions and pitch in order to achieve routing density (for an example analysis of interconnect density in 0.25µm processes, see [10] ). For short lines (e.g., several hundred microns or less), thinner metal offers less lateral coupling capacitance and driver loading, and thus locally improves circuit performance. At the same time, maximum wire width is limited by the aspect ratio upper bound. The resulting thin and narrow wires are highly resistive and also subject to reliability concerns; they are hence unsuitable for global interconnects, power distribution, etc. We also note that layers M2-M3 (and maybe M4) will support a mix of local and "near-global" wiring, e.g., long wires within a single block. The distribution of lengths and performance goals for these signals can vary considerably between designs; since shorter wires are better routed on thinner metal, these design-specific considerations will affect the interconnect.
Power distribution layers (e.g., M6-M7, maybe M5), which typically also support the top-level clock distribution (mesh or balanced -tree), should be as thick as possible for reliability. IR drop and clock skew -as well as robustness under process variations -also suggest the use of thick wire on these layers. Thick wire additionally conserves area, but can suffer from increased lateral capacitive coupling.
Global interconnect layers (e.g., M4-M6) support inter-block signal runs with length on the order of 3000µm -15000µm. To satisfy delay and signal integrity constraints, at least three degrees of freedom are available: line width and spacing, repeater insertion, and shield wiring. Repeater insertion shields downstream capacitance and is the canonical means of converting "quadratic" RC delay into "near-linear" delay; this technique also improves edge rates and hence noise immunity. When lateral coupling capacitances are large, worst-case "Miller coupling" begins to dominate noise and delay calculations; this is alleviated by increasing the line spacing and/or adding shield wiring (i.e., wires connected to ground), with future techniques possibly including dedicated ground and power planes interleaved with signal layers [5] . 1 Another technique to reduce the lateral coupling capacitance is to interleave signal lines which do not switch at the same signal transistion period. The bus-dominated nature of global interconnects in building-block and high-performance designs only worsens the effects of coupling, since it results in longer parallel runs.
All layers are subject to mutual pitch-matching, via sizing, etc. considerations. Hence, available widths and spacings on one layer are not independent of the widths and spacings on a second layer.
The above are only a few of the applicable design considerations; the net effect is that balancing interconnect resources is now extremely difficult as designs move into and beyond the quarter-micron regime.
Interconnect Strategies
Interconnect tuning is the selection by a design team of line thicknesses, widths and spacings in multi-layer interconnect to simultaneously achieve: (i) distribution (available wiring density) for local signals, global signals, clock, power and ground; (ii) performance (signal propagation delay), particularly on global interconnects; (iii) noise immunity (signal integrity), again particularly on global interconnects; and (iv) manufacturability and reliability (e.g., required margins for AC self-heat or DC electromigration on interconnects, short-circuit power in attached devices, etc.). Today, interconnect tuning is a key activity in most leadingedge microprocessor projects. It is clearly an option whenever the design and fabrication are owned by a single entity; however, for highvolume projects even fabless design houses are exercising increasing 1 When two parallel neighboring lines L1 and L2 switch simultaneously in opposite directions, the driver of L1 sees the grounded line capacitance plus twice the coupling capacitance of L1 to L2. If L2 is quiet when L1 switches, then the driver of L1 sees the grounded line capacitance plus the coupling capacitance to L2. And if L2 switches simultaneously in the opposite direction, the driver of L1 sees only the grounded line capacitance. (In leadingedge processes, each neighbor coupling is of the same (and possibly greater) magnitude as the area coupling to ground.) The "coupling factor" or "switching factor" is often given in the range 0; 2 , and since most lines have two neighbors, the total coupling factor is in the range 0; 4 . influence on vendors' processes [10] . Nevertheless, this topic has received very little attention in the literature, with only a small handful of high-level treatments available. 2 Our work is the first in the literature to attempt a wide-ranging study of interconnect tuning. We center on global wiring layers (e.g., M4 and M5 in a 6LM process), and interconnect tuning issues related to bus routing, repeater insertion, and choice of shielding/spacing rules for signal integrity and performance. 3 We answer these questions using technology parameters from a representative 0.25um CMOS process; this matches the process technology context for many current-and next-generation microprocessors. Coupling capacitance studies are performed with the commercial QuickCap 3-D field solver, and interconnect delay and noise coupling studies are performed with the commercial HSPICE simulator. Of particular interest is our study of question (4): we demonstrate that a new methodology for offsetting repeater placements can reduce worst-case cross-chip delays by over 30% in current technologies, versus traditional repeater insertion methodology. 2 For example, [12] describes a characterization and analysis methodology and the need to break ideal scaling in deep submicron interconnect. [8] is another work that centers on analysis of a given multi-layer interconnect process, as opposed to the underlying interconnect tuning. [3] and [6] are examples of system-level treatments based on Rent's rule for interconnect length distribution. 3 Even though the results presented in this paper are for aluminum interconnects with SiO2 dielectric, similar techniques can be applied for copper interconnects and low-K dielectrics.
SIA

Allocation of Width and Spacing for Given Pitch
Our first study seeks to determine how width and spacing should be optimally allocated for a given line pitch. In practice, the actual line width used is considerably greater than the minimum line width achievable in lithography. Thus, there is freedom to tune the width and spacing once assumptions are in place for line thickness and target line length. We note that because very long inter-block lines will have repeaters inserted regularly (see Section 3 below), the maximum line length of interest is equal to the optimum interval between repeaters; this length ranges between 2500 µm and 5000 µm for global interconnect layers in leadingedge technologies.
We have performed detailed studies of "fast" M3 interconnect with 3.2µm pitch, assuming that M2 crossunders are dense (i.e., can be approximated as a ground plane) [9] and explicitly modeling M4 crossovers. Dielectric modeling is based on actual layer data for a representative 0.25µm CMOS process. QuickCap was used to extract coupling and area capacitances, summarized in Table 1 . As is typical in such analyses, we assume worst-case coupling, i.e., a total coupling factor of 4.0 (worst-case coupling factor of 2.0 to each of the left and right neighbors of the (victim) line under analysis). Table 3 shows HSPICE-computed line delays for M3 line lengths ranging from 4000µm to 6000µm. Again, dense M2 is assumed to be a ground plane, and M4 crossovers are modeled explicitly. The Table  shows that (width,spacing) = 1:2; 2:0µm gives the best performance for the given line pitch.
Bounding the Interval Between Repeaters
A very basic study (in some sense a pre-requisite to all other interconnect tuning) asks how often repeaters should be inserted into global interconnects. This is of course a chicken-egg problem, in that the optimum repeater interval depends on the interconnect tuning, and the interconnect tuning depends on the maximum run ever made without an intervening repeater. However, the following can be noted.
A body of study shows that repeaters should be inserted at uniform intervals. In other words, there should be a constant interconnect length (or interconnect delay) between each pair of adjacent repeaters; the first and last segments of the path are exceptions because in practice the driver and receiver sizes may not be the same as the repeater size. Actually, such theoretical results deviate from real-life practice. On any source-destination path the repeater sizes need not be the same. It may also be better to add repeaters in parallel in order to drive larger wire lengths. (This is not just for performance: repeaters locally affect device area and routing constraints. However, our studies have not yet addressed such layout issues. Using the same principle (and with certain types of methodology and chip planning constraints), it can be better to increase the size of the drivers inside the block as much as possible, which would increase the first segment length.
Assuming that the driver size and the receiver size are the same as the size of the repeaters inserted along the path, we calculate the total delay, optimal number of repeaters and optimal distance between the repeaters. The total delay for a path with K repeaters is
The delay of the first stage is the total delay from the output of driver to the input of the first repeater, i.e., T f irst stage = T gd + T int , where gate load delay is
, and R rep , C rep are repeater output resistance and input gate capacitance. The effective capacitance at the gate output can be approximated as C e f f int = αC int where α is a constant between 1=6 and 1 [11] . Let L p be the interconnect path length between driver and receiver. Then for optimal placement of repeaters the interconnect length between repeaters is Lp K+1 . Therefore, the total delay for the path is
where r, c are resistance and capacitance per unit length of the interconnect line. We compute the optimal number of repeaters that minimizes total delay by setting ∂Ttot ∂K = 0, and obtain
To minimize total delay, gate load delay and interconnect delay should be equal. If effective capacitance is not considered in the gate load delay computation, and with current technology trends, gate load delay will always be greater than interconnect delay. Under these conditions, to minimize total delay one can increase the time of flight (or wire length) between repeaters until slew time constraints become tight. In the current range of 0.35µm and 0.25µm process generations, global interconnects have repeaters inserted with periods ranging from 2500 µm to 10000 µm.
Repeater insertion is also driven by pure interconnect delay, since larger time of flight implies larger slew time on the transition seen at the receiver. Edges with large slew times cause much larger gate delays, are more susceptible to noise, are more susceptible to process-distribution influenced delay variations, and also increase the short-circuit power dissipation. Even in today's designs, slew times above 600-700 ps cannot be tolerated. Thus, even without the delay minimization objective, edge rate control will force insertion of repeaters.In fact, some of the functionality of "post-layout optimization" tools for gate sizing and repeater insertion is driven by edge rate checks as opposed to signal delay reduction.
In practice, repeaters will be implemented using inverters whenever possible, due to performance and area efficiency. Table 4 summarizes M3 interconnect slew times for line width 1.0µm and line spacing 1.2µm (corresponding to a "dense" M3 routing pitch), and input slew time of 400 ps. All capacitance extractions were performed with QuickCap, and correspond to M4 and M1 as the top and bottom ground planes, respectively. Switching factors range from 4 (both neighbors switching in the opposite direction from the victim) to 2 (both neighbors quiet, or one neighbor switching in the opposite direction and one neighbor switching in the same direction with respect to the victim). We see that the M3 distance between repeaters has an upper bound of 5000µm due to edge rate considerations alone. Separate studies show that this upper bound on distance between repeaters is essentially unaffected by changes to the driver/receiver sizing or the input slew time.
Benefits of Shield Wiring
Our third study addresses the question of whether shield wiring is an effective means of improving delay and signal integrity performance of long global interconnects. We consider various width-spacing rules for M3 interconnect, in order to evaluate the utility of spacing vs. shielding techniques. Our evaluations are with respect to delay only; for all of the configurations, the assumed slew time upper bounds of approximately 600ps imply that noise coupling will not be problematic. Figure  1 contrasts five pitch-matched width-spacing rules: Rule1: 1.2µm width, 1.0µm spacing Single-V SS : 1.2µm width, 1.0µm spacing, with every third line grounded (i.e., every signal line has one grounded neighbor to shield it) Rule2: 1.2µm width, 2.1µm spacing Rule3: 2.2µm width, 2.2µm spacing Double-V SS : 1.2µm width, 2.1µm spacing, with every other line grounded (i.e., every signal line has two grounded neighbors to shield it) Again, QuickCap was used to extract capacitive couplings of a given victim line to its neighbor lines and the neighboring top/bottom layers; these results are shown in Table 5 . Notice that the Rule1, Rule2 and Rule3 rules have worst-case coupling factors = 4. On the other hand, the Single-V SS rule has worst-case coupling factor = 3, and the Double-V SS rule has worst-case coupling factor = 2. Table 6 shows the delay performance for a 4000µm M3 line, under various bottom ground and top plane configurations. We observe:
The Rule3 rule provides 37% decrease in total delay, but since C e f f was not used in the gate load delay computation, actual delay reductions could be even greater.
The Single-V SS rule is less effective than the Rule2 rule; note that the two rules are equivalent in terms of effective routing density. Our studies have not yet addressed the routing interactions that can potentially affect this analysis. In particular, shield lines may be added to bring power and ground connectionsto repeater blocks. The Double-V SS rule gives improved total delays compared with the Rule3 rule, with the rules being equivalent in terms of effective routing density. However, the Rule3 rule yields smaller interconnect delays, so that driver size reductions have greater potential for delay improvement. Thus, the Rule3 rule seems preferable. When two buses have activity patterns such that each is quiet when the other is active, then their lines can be interleaved such that they effectively follow the Double-V SS rule. In such a case, interleaving is clearly superior to the Rule3 rule, since the effective routing density is doubled.
Gate load delays are larger than interconnect delays, suggesting that it is preferable to decrease line widths and increase line spacings. We also note that a dense M4 top layer decreases total delay, and a dense M2 bottom (ground plane) layer decreases total delay for smaller line widths only.
New Repeater Offset Methodology for Global Buses
Finally, we study another form of tuning that is possible for global interconnects. Our motivations are three-fold: (i) global interconnect is increasingly dominated by wide buses; (ii) present methodology designs global interconnects for worst-case Miller coupling; and (iii) present methodology routes long global buses using repeater blocks, i.e., blocks of co-located inverters spaced every, say, 4000µm.
We have proposed a simple method to improve global interconnect performance. The idea is to reduce the worst-case Miller coupling by offsetting the inverters on adjacent lines (see Figure 2 ). In the previous methodology (Figure 2(a) ), the worst-case switching of a neighbor line (i.e., simultaneously and in the opposite direction to the switching of the victim line) persists through the entire chain of inverters. However, with offset inverter locations (Figure 2(b) ), any worst-case simultaneous switching on a neighbor line persists only for half of each period between consecutive inverters, and furthermore becomes best-case simultaneous switching for the other half of the period!.
To confirm the advantages of this method, the following experimental methodology was used. Table 6 : Delay estimates for a 4000µm M3 line, under various interconnect tuning configurations. Driver and receiver buffer sizes: (wp=100µm,wn=50µm). Delay is computed from input of driver to input of receiver.
We study systems of three parallel interconnect lines, with lengths either 10000µm or 14000µm. These lines are stimulated by a waveform with risetime = falltime = 200ps. The middle line is considered the "victim" for analysis purposes.
We model two "technologies" representative of M3 and M4 in an 0.25µm CMOS process. In each technology, line resistance is 50Ω per 1000 µm. In Technology I, capacitive couplings to left neighbor, ground and right neighbor per 1000 µm are respectively 60fF, 80fF and 60fF. In Technology II, capacitive couplings to left neighbor, ground and right neighbor per 1000 µm are respectively 80fF, 160fF and 80fF.
We assume a period between inverters (repeaters) of 4000µm. So that HSPICE cannot introduce any error in its RC analysis, we manually distributed the line and coupling parasitics into 40µm segments, i.e., repeaters occurred every 100 segments, and line lengths were 250 or 350 segments. Each segment is modeled as a double-pi model. 4 We always place the inverters on the middle line with "phase = 0", i.e., at positions 4000, 8000, ... microns along the line. Inverters on the left and right neighbors are placed according to all combinations of phase = 0, 0.1, 0.2, ..., 0.9 (again with respect to the period of 4000µm). There are 100 different phase combinations. Figure 2 shows the three-line configurations with left/right neighbor phase combinations of (0,0) and (0.5,0.5).
We stimulate the three lines with the periodic waveform, with the first transition either rising (R) or falling (F). There are eight combinations of directions for the first transisions, i.e., RRR, RRF, ..., FFF.
Finally, we may offset the input waveforms of the left and right neighbors by -100ps, 0ps or +100ps with respect to the input waveform of the middle line. There are nine combinations of these input offsets. Table 7 shows HSPICE delays for systems of three lines of length 10000 µm, using Technology I, for all combinations of rising (R) and falling (F) initial transition on the input waveform. The Table shows delays for inverter phases (0,0) and (0.5,0.5) on the left and right neighbors of the middle line (phase 0). The effect of Miller coupling is clearly shown. Table 8 shows the worst-case delays (with respect to all eight possible combinations of rising and falling inputs) for the middle line, for each combination of phases for the inverter locations on the left and right neighbor lines. Input offsets are all 0, i.e., the waveforms start at the same time. All four combinations of Technology and line length are shown. In every case, the optimum phase combination is (0.5,0.5), while the traditional phase combination of (0.0,0.0) is actually the worst Table 7 : HSPICE delays (ns) for three lines of length 10000 µm, using Technology I, for all combinations of rising (R) and falling (F) initial transition on the input waveform. We show delays for inverter phases (0,0) and (0.5,0.5) on the left and right neighbors of the middle line (phase 0).
possible. The worst-case delay is reduced by anywhere from 25% to 30% when the repeaters are placed with optimum phase. Finally, Table  9 shows the same worst-case delays for the middle line, this time taken over all eight rise/fall combinations and all nine combinations of input waveform offsets. Again, even when the inputs do not switch perfectly simultaneously, the best phase combination is (0.5,0.5) and the worst phase combination is the traditional (0.0,0.0) methodology.
Conclusions
To our knowledge, this work has provided the first technology-specific studies of interconnect tuning in the literature. We have described experimental approaches to interconnect tuning issues related to bus routing, repeater insertion, and choice of shielding/spacing rules for signal integrity and performance. In particular, four questions have been addressed: allocation of width and spacing to maximize performance for a given pitch, finding the optimal interval for repeater insertion, assessing the potential benefits of shield wiring, and optimizing the insertion of repeaters in global buses. Our answers to these questions are at times surprising: in answering (3), we demonstrate that current shielding methodologies may be suboptimal when compared with alternate width/spacing rules, and in answering (4), we propose a new repeater offset technique that can reduce worst-case cross-chip delays by over 30% in current technologies. Ongoing efforts extend our interconnect tuning research to encompass layer thicknesses, more detailed analyses of noise coupling and tuning to meet noise margins, and the delay/noise behavior in emerging technology regimes (Cu interconnect and low-K dielectrics). Finally, we seek to develop more complete full-chip interconnect tuning approaches based on analyses of the interconnect structure, speed target, and power dissipation target for a given design. Table 9 : Worst-case delays with all combinations of input offsets.
