Abstract-Due to architectural complexity and process costs, circuit-level solutions are often the preferred means to resolving signal integrity issues that affect the performance and reliability of on-chip interconnect. In this paper, we consider multi-segment bit-lines used in wide on-chip interconnect, and explore in detail the effect of signal transition skew on the delay and time of flight in the presence of crosstalk. We present the relationship between segment delay, signal transition skew and the injected noise pulse and propose a novel staggered latch bus architecture to explicitly exploit transition skew for improved speed and performance. Our proposed SLB architecture achieves an average of 2.5X (2.3X) improvement in speed for fully-aligned (mis-aligned) buffering schemes with no increase in area, power or additional wires needed.
I. INTRODUCTION
Feature scaling has been key to sustaining the exponential growth in chip performance over the decades. However, as onchip dimensions cross the 100 nm threshold, various signal integrity challenges are threatening to limit this trend [1] . The adoption of high aspect ratio metal layers to mitigate the inverse relationship between process scaling and metal resistance, leads to the formation of large implicit coupling capacitances between physically separated metal traces. These large capacitances create non-negligible electrical interference and crosstalk noise that can distort signals on neighboring metal wires. The effect of this crosstalk noise is particularly critical in the design and performance of multi-bit on-chip interconnect structures such as buses, network links, memory bit-lines which provide communication between functional blocks, memory elements, I/O pins etc.
Crosstalk induced delay refers to the effective switching speed of a coupled metal line due to signal activity on neighboring lines. Line delay will vary depending on the specific direction and the relative temporal overlap of the neighboring transitions. Traditionally, the use of simultaneous latching in synchronous circuits forces a built-in temporal overlap in signal transitions. When transition directions on neighboring lines coincide, line delay is reduced below-nominal, when they diverge, line-delay increases above nominal, otherwise line-delay is nominal. In multi-bit interconnect design, worstcase delay margins are necessary to guarantee transmission reliability with a small performance tradeoff. Future technology nodes promise increasingly higher inter-metal coupling, this will require larger delay margins in order to guarantee reliability but also even larger tradeoffs. In this paper, we analyze the effect of offset switching on line delay in comparison to the traditional simultaneous switching approach. We show that offset switching is superior in a context of highly coupled, segmented bit-lines. We further present a novel staggered latching mechanism for seamless synchronous operation and demonstrate the improved error rates over traditional methods. Our solution requires zero additional wires, and no appreciable increase in power and area. The rest of this paper is organized as follows: Section 2 discusses some selected related work, Section 3 introduces the effect of switching offset in electrically coupled bit-lines. Section 4 introduces staggered latching as a novel clocking scheme. Section 5 presents experiments and results and Section 6 concludes the paper.
ITRS Data on Cu Interconnect

II. RELATED WORK
Signal crosstalk in on-chip interconnect due to adjacentwire capacitive coupling, has received much interest and attention in the literature. Efficient methods for extracting and characterizing wire resistance, ground and coupling capacitance for both local and global wires are well known [2] . Closed form expressions for modeling local interconnect delay in the presence of coupling, have been proposed and numerically efficient methods for electronic design automation (EDA) purposes have also been published [3] , [4] . The use of miller capacitance models for inter-metal coupling capacitance 978-1-4799-0524-9/13/$31.00 ©2013 IEEE as proposed for fast delay calculations in [5] , introduce nonnegligible inaccuracies with feature sizes below 50 nm. As a result, complex noise superposition models, which have been shown to offer more reliable delay estimates in the presence of crosstalk, have been developed [4] , [6] . In [7] , closed form expressions for the total noise waveform due to all active neighbors of a wire was proposed for local wires but is seamlessly extendable to long wire interconnect structures.
Crosstalk induced delay is skew-dependent. Skewdependent delay fluctuations are due to variations in the temporal overlap between a transitioning signal and the noise waveform from various neighboring aggressors, the larger the overlap, the larger the change in the delay [8] . In [9] , a similar bus delay reduction technique is proposed to deliberately introduce transition skew between adjacent wires on a bus. They used the miller capacitance method and assumed a oneway aggressor-victim model for the key delay analysis. This approach is an over-simplification of the multi-way aggressorvictim reality and as a result, it is insufficient for sub-50 nm processes.
Our proposed approach makes two key contributions: First, using the noise model proposed in [7] for the crosstalk noise waveform, we propose an efficient corresponding delay model as a function of inter-signal input skew for a typical segment in a multi-segment interconnect (MSI). Second, unlike in [9] , we propose staggered latching, a novel synchronous clocking strategy that efficiently leverages skewed switching with no additional bus wire overhead. This improves performance in the presence of large coupling capacitance in wide MSI.
III. SIGNAL SWITCHING OFFSET AND LINE DELAY
In this section, we develop analytical models for line delay as a function of the switching offset between closely spaced metal traces.
A. Signal Response Model
The signal delay of an n-bit MSI is limited by the slowest bit-line. The response time on the slowest line depends on the resistance and capacitance measured both to the ground plane and to the adjacent lines.
A generalized coupling structure for a multi-segment interconnect is shown in Figure 2 . The variable m represents the mis-alignment factor between adjacent segments on neighboring bit-lines such that choosing the value m = 0 or m = 0.5 allows for the modeling of either a Fully-aligned or Misaligned strategy respectively. The test segment of interest is in the middle. The goal is to obtain analytical expressions of the overlap threshold for both arrangement strategies.
B. Nominal Response
An RC model for the coupled segments in Figure 2 is shown in Figure 3 . The total response v b in Figure 3 is a superposition of the direct response due to the primary input v a , and the total noise injected by the secondary inputs v 1 , v 2 , v 3 and v 4 through the coupling capacitances. We first obtain the noise-free response and the analytical expression for the corresponding nominal delay.
1) Drive buffer/Repeater:
The segment driver is a large inverter with minimum length (λ) sized transistors. The pull-up PMOS transistor of size W p is selected to match approximately the performance of the pull-down NMOS transistor of size W n . If we define the ratio of transistor widths X p = W p /W n , and the capacitance per square for a minimum length MOSFET, (C ox ), we can obtain from ITRS data [1] , Figure 1 and HSPICE characterization runs the values shown in Table III -A. The gate capacitance (C gate ), diffusion capacitance (C dif f ), and drive resistance (R drv ) are then modeled by equations:
The characteristic driver delay (tD drv ) is given by the model in equation 1.
2) Metal trace: The direct response is obtained from an s-domain analysis of the circuit in Figure 3 . The secondary inputs are set to zero, and a unit step is applied to the primary input v a . The product of the total lumped resistance (R T ) and the lumped capacitance to ground (C T ) is the intrinsic rcconstant (τ ) of the line segment. The coupling capacitance is defined in terms of C T and a weighting factor (η), Cc = η·C T . where τ = R T C T and in the denominator
The total resistance R T , shown in equation 4 is the sum of the metal resistance R m and the driver switching resistance R drv . Likewise, the total capacitance to ground for a given segment, shown in equation 5 is the sum of the contributions from the metal and the driver.
Applying the Inverse Laplace Transform to V bN (s) we obtain the general, normalized, time-domain signal form
where the constants G 0 and A 0 are obtained via padé approximation and coefficient matching of the s-domain polynomials. See table III-C1. The nominal delay is obtained by solving for the V dd/2 crossing point of equation 6.
C. Noise Response
The RC model in Figure 3 is easily modified for noise signal extraction by grounding the primary input v a and driving the secondary inputs v 1 , v 2 and v 3 , v 4 with unit step signals. Note however, that due to the use of inverting repeaters, compared to v 1 and v 2 , the transition direction of v 3 (v 4 ) is opposite and shifted in time by T . The injected noise v b is thus comprised of two components, one in-phase, the other counterphase, see fig. 4 (b). The noise transient parameters depend on the segment RC characteristics, the coupling capacitance (C c ), and on the actual number of switching neighbors (SW ). For planar 2D layout with a maximum of two closest neighbors (i.e SW = 0,1,2), the general noise response in eqn. 8 is obtained from an s-domain analysis of the modified fig. 3 circuit.
If we define a generalized noise pulse response for the variable t and the model constants τ and G 1 (η) as v η (t) in eqn 9
Then the Inverse Laplace Transform of eqn. 8 yields a corresponding normalized, time-domain noise response v b in eqn. 10 . 
In general, η A = (1 − m) · η and η M = m · η. However, focusing on fully-aligned (m = 0) or mis-aligned (m = 0.5), the model constants A 1 and G 1 are obtained by substituting (η = η A ) or (η = η M ) in the expressions in Table III -C1.
1) Noise Duration:
The noise pulse v b in eqn 10 has a last crossing time z 0 , last absolute maximum (ñ) at timet. The pulse has a duration (d k ) measured in terms of non-zero, integer (k) multiples of G 1 τ , i.e. for a specified noise limit (N lim ), and for all integers k larger than
These parameters can be calculated from v b (t) for any chosen value of m.
The constant T is obtained by analyzing the circuit models in fig 2 and fig 3 .
Now, If we also express the time shift in terms T = j · G 1 τ , where j > 0, then the value of the last maximum value (ñ) of |v b (t)| can be calculated, see eqn 14. In general, signal delay on the middle segment in figure 2 is defined as the time difference between the last 0.5V dd crossing points measured from the signal v a to v b . Since the signal v b is a superposition of the direct response and the injected noise, the signal delay is a function of the degree of temporal overlap between them. If we define a variable alpha (α) as the offset between switching events at the input, of the signal v a and any adjacent segments, then the signal-to-noise overlap at the output v b , and consequently the signal delay tD(α) can be expressed in terms of α. For large enough absolute offset values, the overlap at the output between the transition event of the direct response and the duration of the injected noise is zero. This results in a signal delay that is indistinguishable from a noise free delay. The smallest absolute offset value for which this condition is true is defined as the offset Overlap Threshold (α OS ). We can calculate this value by solving for t using the normalized voltage eqn 15.
Using an intermediate variable sigma (σ), we can define a parametric relationship r(t(σ)) = 0.5 − n(σ)). The signal r(t) is the noise free response from eqn 6. The noise signal n(t) depending on the design, is either the fully aligned or misaligned noise pulse signal from eqn 10. Solving for t and α in terms of the variable σ, we obtain the parameterized delay and offset eqns 16
For a given design and a specified noise limit N lim , the corresponding d k can be obtained using equation 11. Substituting into eqn 16 the following values: σ = d k and n(d k ) ≤ N lim we obtain an expression for the overlap threshold for a chosen number (SW ) of switching neighbors.
In any MSI, regardless of alignment strategy, α OS represents the minimum, mutual signal-transition offset between any set of coupled segments that assures nominal signal delay on both segments. For comparative analysis, the 0 − 90% segment transition time (α SS ) is derived for simultaneously switched MSI using eqn 6 and the constants from Table III-C1. The constant tuple (G,A) for noise free nominal transition is chosen as (G 0 ,A 0 ). For noisy transitions, (G 2 ,A 2 ) and (G 3 ,A 3 ) are used for aligned and misaligned MSI respectively.
The worst case segment delay for simultaneous/offset switching considering all coupling noise is shown in equation 19
Using these analytical models, the potential speedup can be estimated for an M5, 0.25mm long, 6 line (4-signal, 2-grounded dummy), 5-segment MSI, using only metal and drive buffer RC parameters from current/predictive BEOL processes, see across sub-50nm processes.
IV. MULTIPLE PHASE STAGGERED LATCHING
In this section, we propose a b-bit wide Multiple Phase Staggered Latch (MPSL[b] ) interconnect architecture that exploits offset switching to achieve improved crosstalk performance. , each individually contains exactly b total latch stages. Spcifically, the OT latches are arranged on the j-th bitline such that j and (b − j) latches are placed at the send-side and receiveside respectively. This results in a staggered configuration and effectively achieves offset insertion at the send-side and resynchronization/offset removal at the receive-side. Note that the total number of latches traversed, end-to-end, for each bit position is exactly equal. The parallel MSI bit lines that form the physical connection between the send and receive side can be arranged either in an aligned or in a misaligned configuration.
All latches are two-state, sample/hold, clock level-sensitive latches. The latch control signals are periodic with identical period (T clk ). However, T clk is sub-divided into multiple phases and specific clocking signals are generated to operate the MPSL structure. For the IF latches, a two-phase control signal identical to the system clock signal is used to control data ingress and egress. For the OT latches, all stages use a b-phase control signal. In order to implement offset switching however, a stage dependent phase offset is added to the control signals between consecutive OT latch stages, forcing a b-by-1 bit transmission/reception exclusivity across the b-bit wide physical bit lines.
A. Clocking and Latch Control
At each bit position, the critical latch stage from a timing perspective is the last latch before the X-segment MSI bit line. Therefore, the relationship of clock period T clk to this latch stage, across all bit positions determines the performance of the MPSL interconnect. For a general b-bit design, with i consecutive bits-in-flight (biF ), if the MSI has a maximum bit line delay (tD M ) and a bit-to-bit minimum separation (α M ) at each position we can calculate key parameters. For an X-segment bit line with SW max as the maximum possible number of switching neighbors, we use equation 19, for (w 1 , w 2 , w 3 the worst case segment delay tD max , and with tD drv from equation 1, we obtain the maximum bit line delay.
For the minimum bit-to-bit separation α M , clock period T clk in b-phases, if we use b = 1 for simultaneous switching, we can write in general a scalar dot product of two vectors W and α SW shown in equation 21
Where
] is the array of offset threshold values, from eqns 17, 18, associated with noise injections from neighboring switching activity. The vector W contains the weight of each threshold value derived from the statistical distribution of transitions in a data stream. We also obtain that T clk must satisfy eqn 22 at the boundary between i and (i + 1) bits-in-flight.
The fig. 5 can be distributed to the specific latch stage via a delay equalized buffer tree network (not shown).
B. Staggered Latch Bus (SLB)
The MPSL implementation of an N-bit bus is the stacked-MPSL [b] , where N is subdivided into b-bit sections, with each assigned to an MPSL [b] . The simplest form is the stacked-MPSL [2] or Staggered Latch Bus (SLB). In this configuration, the LCC is simplified and the signalsφ m and φ b are identical, and likewise the signals φ m and φ b,1 . No additional logic area is required and an explicit LCC is therefore not necessary.
V. EXPERIMENTS AND RESULTS
In this section, we compare the data transmission error rates of two switching methods: simultaneous switching (SS) and offset switching (OS) over an increasing clock frequency. SS is the traditional strategy widely used in synchronous interconnect design while OS will be based on the MPLS architecture. We present an experimental validation of a 32 bit MPSL, in an SLB-16 configuration and analyze the design cost with the aid of various tools. Although the outputs (φ b and φ b,1 ) of a 2-bit long LCC are identical to the logic clocks (φ m andφ m ) respectively, an explicit LCC (only needed for b > 2) is included in the experiment for completeness. Our approach combines trace data and detailed HSPICE simulations.
For the HSPICE simulations, the MSI setup consists of two planar arrays of 32, 5-segment, closely spaced parallel bit lines, one array with fully-aligned segments the other with misaligned segments. Each bit line segment consists of a strip of M5 copper, 0.25mm long, driven by an optimally sized inverting buffer. Metal sizing, spacing, resistivity and intermetal dielectric constants, are taken from the ITRS forecast [1] . Device model files for 45 nm Predictive Technology Model (PTM) process [10] are used for the buffer. The electrical model for the wire resistance and ground capacitance were distributed-π RC sections, with the coupling capacitances between corresponding sections on adjacent segments similarly modeled.
Bit Error Rates (BER) per word versus data clock frequency (f clk = 1/T clk ) comparisons are performed for a 5 segment, 45 nm MSI and shown in figure 8(a) . Operated in either single or multiple biF mode, an SLB-16 based on the OS scheme shows a 2.5X improved speed over a similarly sized traditional SS scheme. When MSI-misaligned segments are used, figure 8(b) , we also obtain good results up to 2.1X speedup compared to misaligned SS. Note that multiple biF (2-biF) modes of operation are possible, this allows support for even higher operating frequencies. The eye diagram in figure 8(d) illustrates this, it shows an SLB-16 MSI-misaligned bus operated in 2-biF mode demonstrating a 60% approximate eye opening at an approximate data clock frequency of 5 GHz. At similar frequencies, the eye diagram in figure 8(c) shows the inability of the SS MSI-misaligned bus to match the performance of an OS MSI misaligned bus.
Scaling the SLB-16 design to 32, 22, and 16 nm nodes, similar BER vs frequency comparative analysis between SS and OS scheme were performed. A summary of the results Although similar as a comparative measure, the difference in nominal values between the simulated average speedup 2.5X(2.3X) and the predicted values 2.04X(1.70X) presented in section III-D, is attributable to the constraint imposed by the selection of N lim used in the analytical model. On the contrary, the maximum operating frequency reported here in the simulation results indicates the speed f clk where the BER per word first exceeds zero. Nevertheless, for quick design space exploration especially across process nodes, the analytical model provides a realistic, efficient speedup estimate for offset-switched MSI designers and EDA tool vendors.
In general, MPLS[b] based designs for b > 2 require an explicit LCC, careful control signal distribution planning, additional latch hardware and area. This is unnecessary for the MPSL [2] based SLB used in the experiment. Note that except for the latch rearrangements, the total latch count and control signals in the SLB are identical to the latch count and clock signals respectively in a traditional SS bus.
VI. SUMMARY AND CONCLUSIONS
In this paper, we explored offset-switched interconnect, its performance, power and area characteristics. We proposed a staggered latch bus as a simple implementation of a more general multi-phase staggered latch interconnect architecture. We performed a comparative analysis with the classical simultaneously switched interconnect. The results show that offset switching in the form of the simple SLB can achieve over 2X improvement in line delay for a given line length, segment size with no appreciable increase power, or need for extra wires.
