For nanotechnology nodes, the feature size is shrunk rapidly, the wire becomes narrow and thin, it leads to high RC parasitic, especially for resistance. The overall system performance are dominated by interconnect rather than device. As such, it is imperative to accurately measure and model interconnect parasitic in order to predict interconnect performance on silicon. Despite many test structures developed in the past to characterize device models and layout effects, only few of them are available for interconnects. Nevertheless, they are either not suitable for real chip implementation or too complicated to be embedded. A compact yet comprehensive test structure to capture all interconnectparasitic in a real chip is needed. To address this problem, this paper describes a set of test structures that can be used to study the timing performance (i.e. propagation delay and crosstalk) of various interconnect configurations. Moreover, an empirical model is developed to estimate the actual RC parasitic. Compared with the state-of-the-art interconnect test structures, the new structure is compact in size and can be easily embedded on die as a parasitic variation monitor. We have validated the proposed structure on a test chip in TSMC 28nm HPM process. Recently, the test structure is further modified to identify the serious interconnect process issues for critical path design using TSMC 7nm FF process.
Chapter 1. Introduction
With the advent of big-data era, machine learning has become increasingly powerful in solving problems from various domains such as face recognition in security screening and highfrequency trade in banking. However, most machine learning algorithms are complex in nature and limited to real-time operation with software implementation. Accordingly, the hardware accelerated or learning on-chip [1] [2] [3] are evolved to enable deep learning applications [4] [5] , making use of powerful heterogeneous hardware platforms involving graphics processing units (GPUs), field-programmable gate arrays (FPGAs) and/or network-on-chips (NoCs). For example, Intel last year released Xeon processors with built-in FPGAs dedicated for data center and learning applications [6] . [7] With the evolution of CMOS technology, the performance of CMOS circuits (including analog [8] [9] [10] [11] , mixed-signal [12] [13] [14] [15] , RF [16] [17] [18] [19] and digital [1] [2] [3] [4] [5] ) has been improved profoundly. This is mainly due to the scaling of semiconductor device. However, the routing in CMOS circuits can hardly scale [7] .A major bottleneck in these heterogeneous platforms lies in interconnects between various system components, as will be demonstrated in Section 2, the data latency is the decisive factor of the overall performance. On the other hand, with the relentless technology scaling, the features size is shrunk rapidly, the wire width and spacing are all significantly reduced, resulting in high coupling capacitance. To minimize the crosstalk impact, the wire thickness is also reduced with the major drawback of high sheet resistance due to a smaller crosssection area. The high sheet resistance leads to serious issue for Power Distribution Network (PDN) and clock tree design. From Figure 1 , it can be seen that the interconnect delay increases drastically with technology scaling, the delay is increased several times from CMOS 45nm to 16nm process, it becomes more serious for 10nm and below.
Figure 1 Delay of 1-mm global interconnect in different technology nodes based on data reported in International Technology Roadmap for Semiconductors (ITRS)
Accordingly, it is crucial to develop an accurate interconnect model for accurate system performance simulation and prediction. Currently, many test structures [20] [21] [22] [23] [24] have been developed to characterize the impacts of standard cell architecture, custom datapath, current mirror and I/O pad design. However, only a few test structures exist with a focus on the impact of interconnect. In addition, in order to be useful and practical, they must be calibrated with measured data, to establish silicon-to-model correlation [25] . The conventional cross-bridge Kelvin structure [26] [27] needs extra probe pads to directly measure the on-chip parasitic effect, which is not suitable for interconnect monitoring in real chip designs. Then, simple ring oscillator [28] [29] is developed to measure the frequency of various interconnect configurations.
This approach is widely adopted by the foundries and fabless design houses to validate the interconnect performance. However, it mainly focuses on the single interconnect and ignores the effect of cross coupling between adjacent wire impacts toward overall performance. Also, onchip interconnect monitoring based on time-to-digital converter (TDC) has been proposed [30] , but it suffers from the non-idealities of TDC. A compact yet comprehensive test structure to capture all interconnect parasitic is still a missing piece in the literature.
To address this issue, this paper describes a test structure based on a set of enhanced ring oscillator designs. It not only measures the propagation delay of various interconnect structures but also accounts for different crosstalk impacts (i.e. in-phase and out-of-phase crosstalk).
Compared with current interconnect test structure, this proposal can measure both interconnect delay and crosstalk impact at the same time to significantly reduce the test structure area. It is easy to embed in the real chip for interconnect validation during the production. A first-order empirical model is also put forward to estimate the RC parasitic for silicon-to-model correlation.
Since the test structure is relatively simple and small, it is easy to implement in real chips to monitor actual RC parasitic variations during manufacturing. We have validated the proposed structure on a TSMC 28nm test chip, and showed its efficacy on our in-house Neuro Processing Unit (NPU).
Chapter 2. Interconnect Structures and Models
In this section, we briefly review the interconnect structures and models as they directly relate to our proposed test structures.
Figure 2 Interconnect model
The interconnect model where the wire is routed through top and bottom layer and coupled with two left/right adjacent wires is shown in Figure 2 . The capacitances between the layers are named area capacitance (C a ) and fringe capacitance (C f ). They are further grouped into top (C top ) and bottom (C bottom ) capacitance as follows:
where C ta is top area capacitance, C ba is bottom area capacitance, C ft is top fringe capacitance, and C fb is bottom fringe capacitance.
The total capacitance is defined as the sum of the top, bottom and coupling capacitance (C c ) as follows:
When wires run parallel to each other, the signal (victim) propagation is affected by the adjacent wires (aggressors) through the coupling capacitance as shown in Figure 3 . It is called crosstalk impacts. Assuming there is no switching in top and bottom routing layers, the left and right aggressor introduces the noise on the victim through the coupling capacitance (C c ). The effective capacitance (C eff ) of the victim can be approximated using the following equations:
Ground capacitance = + Eq. 4
Quiet mode = + 2 Eq. 5
In-phase crosstalk = Eq. 6
Out-of-phase crosstalk = + 4 Eq. 7
The crosstalk can be modeled using equivalent lump model in the Laplacian domain [30] . In order to simplify the model, all resistance R, capacitance C and coupling C C are the same for all branches. V S1 and V S3 are the aggressor input voltage while V S2 is the victim voltage. The output voltages V A , V B and V C are defined as follows:
Eq.10 where
In order to examine different crosstalk operations, V B is set to be a step function (with Laplacian transform of V/s) and V A , V C are either set at zero for quiet mode operation or a step function (with Laplacian transform of ±V/s) with different polarity to model in-phase and out-of-phase crosstalk. It is then transformed to time domain and simplified through first order approximation.
Chapter 3. Proposed Test Structure

Ring Oscillator Configuration
In this paper, we propose an enhanced test structure derived from ring oscillators. The overall structure is shown in Figure 4 . It consists of an input control unit and three sets of ring oscillator.
The control unit controls the test signal (i.e., victim) and two left/right adjacent routing signals (i.e., aggressors)for different crosstalk operations (i.e., quiet mode, in-phase and out-of-phase crosstalk). Finally, the test signal is fed into the down counter to scale down the ring oscillator frequency for oscilloscope measurement.
Figure 4 Interconnect Test Structure
The control unit is further divided into NAND2 and 4-inputs MUX; NAND2 is used to control the test signal (victim) propagation. When the control signal SEL is low, the output of NAND2 is always set to high which results in no oscillation.Whenthe control signal SEL is high, the output of NAND2 is inverted by the input signal IN and the signal propagates through the inverter chain and creates oscillation.
The aggressors are controlled by two 4-inputs MUX with in-phase crosstalk (00), out-of-phase crosstalk (01), quiet mode (11) and don't use (10)operations. If the select signals SAx and SBx are set to "10", it is defined as "don't use" or invalid state.If they are set to "11", it models the signal propagating alone through the inverter chain with shielded protection while the aggressors are set to one or zero to avoid any crosstalk impacts. If they are set to either "00" or "01", the input/output of NAND gates are fed into MUX to toggle the adjacent routing signals. The signals travel in the same or opposite directions as test one for in-phase and out-of-phase crosstalk study.
The spacing between the victim and aggressors can be further adjusted to examine the various spacing crosstalk impacts toward signal propagation.
The proposed test structure is not limited to frequency measurement; an empirical model is also developed to estimate RC parasitic. It can correlate silicon measurement with simulation results as a practical way to identify the source of mismatch for improving interconnect mismatch.
Empirical Model
Compared with conventional ring oscillator approaches, the proposed test structure does not only predict the interconnect behavior through frequency measurement; the RC parasitic can be calculated through a set of ring oscillators and correlate with results with silicon measurements.
The empirical model is derived as follows:
Figure 5 Reference and 2 Fanout RO Stage
Typically, the delay is expressed in term of supply voltage, average current and load capacitance as = Eq. 11
where it is also rewritten in term of PMOS/NMOS saturation current
whereI dp is PMOS I dsat and I dn is NMOS I dsat Each stage switches twice during the complete cycle and the ring oscillator delay is calculated as follows:
where f d is the ring oscillator output frequency, n is the number of inverter stages andT s is the stage delay.
The ring oscillator delay T osc with output down counter scaling factor m is defined as = Eq. 14 = 2
Eq. 15
The stage capacitance C s can be estimated using Eq. 11 = Eq. 16
where I eff is effective current and is defined as the difference between the active current (I dda ) and leakage current (I ddq ) (i.e. I eff = I dda -I ddq ).
If I ddq is quite small, it is ignored during the calculation; then Eq. 16 is rewritten as = 2 Eq. 17
From Eq. 17, the stage capacitance is calculated using in-phase crosstalk delay and current measurement rather than the shielded one because it eliminates coupling capacitance impacts.
The time delay is calculated with the switching resistance R sw and stage capacitance C S by = Eq. 18
It can be simplified as
= 2
Eq. 19
Since the crosstalk introduces voltage noise on the victim, the overall delay and current measurement is changed; then, the quiet one measurement is chosen for switching resistance calculation.
The input gate capacitance C gate and output interconnect capacitance C int can be further estimated using two sets of ring oscillator with single fanout (FO1) and double fanout (FO2).
The stage capacitance C s can be divided into input C in and output one C out shown in Figure 5 2 = + Eq. 20
To simplify Eq. 20 and Eq. 21 to obtain C in and C out as = − 2 Eq. 22
Eq. 23
With aggressor zero or step inputs, the victim output is evaluated through inverse Laplace Transform and expressed in Eq. 24 (in-phase), Eq. 25 (out-of-phase) and Eq. 26 (quiet mode) crosstalk operation:
Eq. 24
Eq. 25
Eq. 24-26 are further simplified using the Taylor series expansion. Since RC constant is quite small, the second and higher coefficients are ignored for first-order approximation. Moreover, the coupling capacitance C c is much higher than the capacitance C (sum of input gate capacitance C gate and interconnect top/bottom capacitance: C top /C bottom ). The term R(C+3C c ) is rewritten as 3RC c . Moreover, the scaling factor ½ is taken into consideration to convert the lump model into a distributed one in order to match the simulation results. Finally, the capacitance C and C c are calculated as follows:
where T o is the out-of-phase time delay and T q is the quiet one.
The empirical model can estimate the first order interconnect RC parasitic through simple measurements from our test structure. It is useful to monitor in-die and die-to-die RC parasitic variation and provide feedback to identify the source of variation during real chip production.
Chapter 4. Experimental Results
Test Structure Characterization
Figure 6 Ring Oscillator Design
In order to validate the test structures, four ring oscillators (RO) are implemented and taped out using TSMC 28nm HPM process as shown in The actual stage delay is highly related to the wire spacing. For single width single spacing configurations, the delay is improved with in-phase and out-of-phase crosstalk. With the double spacing, it reduces the coupling capacitance; the overall delay is similar among the three test structures.The measured RO time period (T osc ) and current (I eff ) results are shown in Table- Through various RO structures, the RC parasitic can be calculated as shown in Table- 2. We compare the data measured from our teststructure with that from the state-of-art test structure [31] .Note that [31] can only measure quiet mode due to the lackof cross-talk aggressors. Stateof-art test structure [31] estimates the total stage capacitance and switchingresistanceand cannot estimate C int , C c and C gate . Using newtest structures (i.e. in-phase and out-of-phase crosstalk),C int , C c , C gate , C total and R sw can be calculated. 
System Impact
To further validate the efficacy of the proposed teststructure in real designs, weapplyit as an interconnectmonitor to our in-house NPU taped out with 28 nm process. The NPUs implements
AlexNet [32] forimage recognition. The chip area is 9mm x 9mm. It has 7hidden layers, 650,000 neurons and 60,000,000 weightparameters. The delay profile of interconnect can varysignificantly among chips located in different corners of thewafer. The die photo is shown in Figure 8 . As can beseen from the figure, the proposed test structure can beeasily embedded by effectively taking advantage of thespare area between the two chips and accordingly areaoverhead is minimal.
Figure 8 Simulated total training time (normalized to baseline slow case)
We measured the training time of NPU in various process corners with different interconnect delay, and the results are depicted in Figure 9 for two chips: the slowest one measured and one that can potentially run faster. Note that we have normalized the total training time with regards to the baseline worst case. Based on our measurement, the performance of NPU varies significantly with changes in interconnect and transistor delay, which can be as large as 30%
between the best and the worst conditions. We estimated parasitic RC of chips in different corners by our on-chip monitor. We also used quiet-mode data to estimate the parasitic that would be reported by state-of-art test structure [31] . With conservative design methodology, we have to make the NPU work at the slowest clock frequency regardless of the actual interconnect delay, so the training time is always the same as the worst case baseline. With state-of-art test structure [31] , the gate capacitance and interconnect delay can be tracked but with large error (~27%). Therefore, the clock rate estimated from state-of-art test structure [31] is still suboptimal, even though it can achieve better performance than baseline (~15% improvement).
With our test structure, we can track transistor and interconnect delay variation accurately, and clock frequencycan be obtained. A performance improvement of 25% is obtained compared with baseline case.
Chapter 5. Conclusions
Since interconnect RC parasitic plays an important role to predict the overall chip performance for nanotechnology, enhanced test structures are implemented to measure the routing propagation delay and crosstalk for various interconnect configurations. A first order empirical model is also developed to estimate RC parasitic. The test structures are quite simple and small. It can be easily inserted into real chip to monitor interconnect variation involume production. Recently, we have modified the proposed test structures to not only account for intra-layer (left/right routing) crosstalk but also for the interlayer (top/bottom routing) one. Additional device parameters (i.e.
source/drain capacitance) areextracted via minor test structure update. The modified test structures and empirical models are useful to estimate both device and interconnect parameters, it also help us to identify the serious interconnect process issues for clock tree design using TSMC 7nm FF process.
