Abstract-The rising level of complexity and speed of SoC makes it increasingly vital to test adequately the system for signal integrity. Voltage overshoot is one of the integrity factors that has not been sufficiently addressed for the purpose of testing and reliability. Overshoots are known to inject hot-carriers into the gate oxide and cause permanent degradation of MOSFET transistors' performance. This performance degradation creates a serious reliability concern. Unfortunately, accurate parasitic extraction and simulation to detect the interconnect problems is very time consuming and very sensitive to the circuit characteristics and thus is not practical for large SoC. This paper presents a built-in chip methodology to detect and measure the signal overshoots occurring on the interconnects of high-speed system-on-chips. This built-in test strategy does not require external probing or signal waveform monitoring. Instead, the overshoot detector cells monitor signals received by a core (e.g., from the system bus) and record the occurrence of overshoots over a period of operation. The overshoot information accumulated by these cells can be compressed and scanned out efficiently and inexpensively for final quality grading, reliability analysis, and diagnosis.
Detecting Signal-Overshoots for Reliability Analysis in High-Speed System-on-Chips
Mehrdad Nourani, Member IEEE and Amir R. Attarha
Abstract-The rising level of complexity and speed of SoC makes it increasingly vital to test adequately the system for signal integrity. Voltage overshoot is one of the integrity factors that has not been sufficiently addressed for the purpose of testing and reliability. Overshoots are known to inject hot-carriers into the gate oxide and cause permanent degradation of MOSFET transistors' performance. This performance degradation creates a serious reliability concern. Unfortunately, accurate parasitic extraction and simulation to detect the interconnect problems is very time consuming and very sensitive to the circuit characteristics and thus is not practical for large SoC.
This paper presents a built-in chip methodology to detect and measure the signal overshoots occurring on the interconnects of high-speed system-on-chips. This built-in test strategy does not require external probing or signal waveform monitoring. Instead, the overshoot detector cells monitor signals received by a core (e.g., from the system bus) and record the occurrence of overshoots over a period of operation. The overshoot information accumulated by these cells can be compressed and scanned out efficiently and inexpensively for final quality grading, reliability analysis, and diagnosis.
Index Terms-High-speed interconnect, hot carrier, reliability loss, system-on-chip, voltage overshoot. 
ACRONYMS

I. INTRODUCTION
A. Motivation
T HE MINIATURIZATION of VLSI circuits and the rapid increase in the working frequency (GHz range) of digital SoC, signal integrity becomes a major concern for design and test engineers. Although various parasitic factors for transistors can be well controlled during fabrication, the parasitic capacitances, inductances, and their cross-coupling effects on the interconnects are important in the functionality and performance of high-speed SoC.
The SIA technology roadmap [1] predicts a very aggressive progression of technologies to 0.1 m and beyond. Core-based SoC design strategies help companies appreciably shorten the time-to-market and reduce the design cost of their new products. A core is a highly complex logic block which is fully defined, in terms of its behavior, and is also predictable and reusable [2] . Testing the interconnects among these cores has become an important challenge. Specifically, at high frequency, the interconnects must not only be tested for opens and shorts but also for signal integrity.
In recent years, among the factors with adverse effect on signal integrity, such as delay, crosstalk, and overshoot, the least attention has been paid to detecting voltage overshoots. Voltage overshoots play a major role in final performance, reliability, and lifetime of a sub-micron GHz chip [3] and thus should be considered in the testing process. 
B. Voltage Overshoot-Cause and Effects
Distributed and coupling capacitances, and inductances of the interconnect can cause a signal with very short (e.g., a high-frequency signal) to exceed momentarily . This phenomenon, called overshoot, is shown in Fig. 1 .
Overshoots cause delay [4] , noise [5] , and hot-carrier damage [6] - [8] in MOS transistors. Hot-carrier degradation can have a permanent effect on MOS devices [6] , [9] . In sub-micron devices, the increase of the horizontal and vertical electric fields affecting the channel region of MOSFET transistors, causes electrons and holes to gain high kinetic energies. These hot-carriers might penetrate the gate oxide and cause permanent changes in the oxide charge distribution, and ultimately degrade the current-voltage characteristics of the transistors [3] . Such performance degradation over time creates serious reliability concern [3] , [10] .
References [8] , [11] - [13] present empirical and simulation evidences showing the impact of overshoots on transistor reliability and performance. For example, [11] shows that many circuits under hot-carrier attack reach an unacceptable speed degradation over time, e.g., after 10 to 1000 hours. The experimental results in [13] show that not only the number of overshoot occurrences, but also the voltage level of overshoots affect the reliability of VLSI circuits. Similarly, the experimental results in [8] show that an inverter delay increases up to 49% under severe overshoot, e.g., large number or large spike value, occurrence.
The overshoots and undershoots do not propagate in static CMOS logic [14] . However, damping noise using such gates, e.g., CMOS buffers, on long interconnects is limited because of the performance drawback. Moreover, even for the buses that can be observed directly, it is extremely difficult to see/test the overshoots by probing the pads using test equipment, because the CMOS logic, e.g., buffers used in pads or the parasitic RLC of the probes, suppresses the overshoots. [5] , [15] . Recently, researchers have addressed issues about crosstalk. Accurate analysis [16] , test generation for crosstalk noise [17] , [18] , and fault modeling [19] , are examples.
C. Related Work
Most research related to signal overshoots discuss the physical cause [11] , analytic model using parasitic parameters [3] , or accurate simulation and analysis [5] , [15] . The impact of overshoots on device hot carrier reliability and delay are also addressed in [8] , [11] , [12] . To reduce the effect of hot carriers in deep-submicron devices, a hybrid junction was proposed in [20] to optimize the structure in terms of performance, reliability, and manufacturability. The BERT simulator [21] analyzes hot-electron degradation in MOSFET, bipolar, and BiCMOS transistors, and predicts circuit failure-rates due to oxide breakdown and electromigration. BERT works in conjunction with a circuit simulator, such as SPICE, to simulate reliability for actual circuits. Reference [15] proposes a technique to solve the transmission-line equations for accurate simulation of the interconnect between cores. References [14] , [22] propose a test-generation procedure for generating test vectors to detect functional errors caused by overshoots, undershoots, and crosstalk. To relate the reliability and life-time factors, the model was presented in [10] for reliability analysis using a full range of stress conditions and scalability.
D. Contribution and Paper Organization
Voltage overshoot is an integrity factor that has not been sufficiently addressed for the purpose of testing and reliability. This paper focuses on overshoots which are known to have a destructive effect on high speed circuits. The main contribution is to propose an on-chip mechanism to detect the voltage overshoots for high-speed SoC. This built-in test strategy does not require external probing or signal waveform monitoring. Instead, the overshoot-detector-cells monitor signals received by a core, e.g., from the system bus, and record the occurrence of overshoots. The overshoot information accumulated by these cells can be compressed, and eventually scanned out for final test and reliability analysis.
Section II discusses the interconnect model and various aspects of voltage overshoot. Section III analyzes a circuitry for detecting voltage overshoot occurring on the interconnects. Section IV explains the test architecture alternatives to store and read-out the information collected by overshoot detector cells. Section V discusses the experimental results.
II. INTERCONNECT MODEL AND TRACING NOISE
Aggressive progress in deep submicron technology has created concerns regarding the signal-integrity degradation which can lead to low reliability. One of these integrity parameters Table I shows projections from the SIA roadmap [1] . As and decrease, and and increase, it is more likely to encounter signal overshoots and undershoots with larger magnitude relative to [15] , [22] . To consider the overshoots/undershoots (also called noise), accurate simulation tools are needed in high frequency systems. These simulation tools use distributed models that run much slower than lumped models because they deal with many parasitic , , values and their couplings [15] . The overshoots on the interconnect, in general, are input-dependent phenomena (similar to crosstalk). Therefore, the occurrence, peak value, additional delay, and other damaging effects (temporary or permanent) can not be predicted or removed easily.
A VLSI interconnect can be modeled in several ways with various levels of accuracy and computational overhead. Generally, when an interconnect wire becomes sufficiently long, or the circuits become sufficiently fast, the inductance of the wire begins to dominate the delay behavior. Thus, in high-speed interconnect modeling, the inductance should be considered. In spite of considering , , , the lumped model cannot demonstrate the real behavior of the interconnects; therefore the distributed model has been proposed.
The distributed RLC model of a wire is the most accurate approximation of its actual behavior [26] . Coupling capacitances are essential factors in accurate modeling of interconnects. Unintentional couplings from the neighboring interconnects appreciably affect the signal integrity, and cannot be ignored in high frequency [15] , [23] , [24] .
There are many good distributed models in the literature [25] , [26] . No specific interconnect model is advocated because the accuracy of the interconnect line model is beyond the scope of this paper. However, for the purpose of reporting the experimental results, the distributed RLC model with coupling capacitances [23] is used. Fig. 2 shows this model. Fig. 3 shows the SPICE simulation [28] results for a 10 mm interconnect line affecting a signal with ns ( GHz) using the lumped and distributed RLC models. SPICE does not reveal noise (overshoots, undershoots, etc.) for the lumped model because this simplistic model does not show the true behavior of a high-frequency signal traveling along the interconnect.
III. DETECTING THE VOLTAGE OVERSHOOT
A. Review of Sense Amplifiers
Sense amplifiers are widely used in memory architectures (both DRAM and SRAM) for performance speedup, power reduction, and signal restoration, e.g., refreshing in DRAM, [29] , [30] . Differential sense amplifiers present numerous advantages over single-ended counterparts, but they directly apply in SRAM memories only in which both and are available [29] . Fig. 4 shows 3 conventional differential sense amplifiers. In all three types, an NMOS transistor is used as a current source (when SE ) and PMOS transistors and are loads. The positive feedbacks (drain-gate connection between and ) allow amplification in these structures. In-depth details are in [29] , [30] .
Traditionally, sense amplifiers were not considered as voltage detectors-a goal that is pursued here. More importantly, to detect (sense) voltage overshoots that exceed , the sense amplifier must show specific characteristics. As explained in this paper, these characteristics were found only in the structure shown in Fig. 4(c) .
B. A CMOS Overshoot Detector (OD) Cell
The modified cross-coupled PMOS differential sense amplifier was chosen to detect voltage overshoots as pictured in Fig. 5. For simplicity, Fig. 5 shows only 1 bit of an interconnect (point-to-point or bus) between Core i and Core j. The OD cell sits physically near the receiving core and samples the actual signal plus noise received by Core j. SE is connected to test_mode to create a permanent current source in the test mode, and input is connected to to define the threshold level for sensing , i.e., the voltage received in . The inverter, formed by and , stabilizes the voltage levels in the output of OD cell. By adjusting the size of the PMOS transistors ( and ), the DC currents through transistors and are set to different values. Combining this with the feedbacks between PMOS transistors, creates threshold voltages to turn the transistors on or off. Fig. 6 shows signals on input and output (points and ) of the cell to validate the behavior of this overshoot detector cell.
Let be initially low. Transistors and both are in their linear region and behave as "on" switches. Transistors and are in cut-off, thus and . As increases , goes to saturation region and the current through begins increasing. On the other hand, is in its linear region and its current equals the current of . Therefore, when the current in increases, based on the large model of MOSFET transistors [31] , the resistance between drain and source decreases. The transistors stay in their regions, as discussed, until the passes the . After passing , transistors , , are in their linear region, and is in the cut-off. Thus, is forced to "1" by ; therefore . Although the current in the cut-off region is generally negligible, sub-threshold current is important in this circuit. When begins to decrease, of decreases, and lower current flowing through is anticipated. However, is in its cut-off region, and controls the current. When decreases, the channel length also decreases and, eventually, the resistance of the channel increases. This results in a slight voltage increase in the drains of and . Thus, the gate voltage of increases, causing to decrease. This in turn decreases the current of , and decreases the drain voltage (the resistance of channel of current) of and . When passes , the and go to cut-off and linear regions, respectively, making and . The threshold voltage can be regulated based on transistor sizes, which cause different channel resistances and . This analysis, and the waveforms in Fig. 6 , confirm that the OD cell shows a hysteresis (Schmitt-trigger) property which implicitly indicates a (temporary) storage behavior. To verify this, a DC analysis was run on the OD cell to get the hysteresis curve in Fig. 7 . In Fig. 7 , for example, the solid-line curve shows that the switching threshold voltages are and when . The hysteresis property is desirable because, using this property, the OD cell not only captures the overshoots but also filters out the signal bounces (noise) before settling down.
The unacceptable level of overshoot can be a matter of reliability debate. Researchers estimated that 10% or more overshoot values ( ) create hot-carriers and thus can lead to permanent damage [25] . A nice feature of the OD cell is that for any , the overshoot threshold ( of hysteresis) can be tuned by changing the layout size of the PMOS transistors (mainly s of and ). This is also shown in Fig. 7 in which 2 sets of transistor widths ( and for and ) and 2 values (3.3 and 2.5 volts) have been used. Analytic analysis [29] or a simulation-based approach can be used for such tuning.
IV. TEST MECHANISM
Recording the occurrence of overshoot is a crucial step performed by the OD cells (explained in Section III). Each time that overshoot occurs ( ), the OD cell generates a "0" signal that remains unchanged until drops below . The OD cells are not expensive-7 transistors per cell as Fig. 5 shows. The test architecture to read the information stored in the OD cells, is a DFT decision which depends on the overall SoC test methodology, testing objective, and cost consideration. Some alternatives are discusses in Sections IV-A -IV-C.
A. Using Compressors
In Fig. 8 , a compressor unit is used to compact test information (overshoot occurrence) so that bits of data are compacted into bits. The compression unit is a simple combinational circuit that outputs the total number of 0s (generated by OD cells as an indication of overshoots occurring on the bus) that appear on its inputs. Similar 4 : 2 compressor units have been extensively used in multiplier designs [32] . The circuit in Fig. 8(a) can measure the overshoot occurrences on the -bit bus over 1 data-transfer cycle. The result is stored in a -bit register. If this register becomes a part of the scan chain, its content can be scanned out in cycles. If the overshoots are required to be measured over a period of transfer cycles ( patterns), then a -bit adder can be used as shown in Fig. 8(b) . This adder and register can store numbers between 0 and . This is a very pessimistic, worst-case scenario in which all lines are assumed subject to overshoots in all cycles. In reality, a much smaller adder and scannable register can be used to keep the statistics. Such size can be determined based on empirical evidence, or on the threshold level that the chip will be rejected. Fig. 9 (a) demonstrates a flip-flop based test-architecture, which can record the occurrence of overshoot, and transfer it to the output. When at least 1 overshoot occurs, the output of the NAND gate (OD-flag) becomes 1, which is stored in a flip-flop. This architecture only reports the occurrence of overshoot on the bus, and can not detect a particular defective line. For diagnosis, it is proposed to use scannable OD-FF architecture, as shown in Fig. 9(b) . In the test-mode, first the overshoot flag signal (OD-flag) is transferred, through the MUX, to the test controller. If an overshoot occurs (OD-flag 1), then the content of OD-FF are scanned out through for further reliability and diagnosis analysis. The very pessimistic worst-case scenario, in terms of test time, is a case in which all lines are subject to overshoots in all cycles. This situation requires overall cycles for readout. In practice, a much shorter time (e.g., , where ) is sufficient because the presence of defects causing overshoots is quite limited. Fig. 10 shows two other test architectures to collect information about overshoot occurrence. In Fig. 10(a) , the output of the -input NAND gate is 1 if at least 1 overshoot occurs on the -bit input port of a core (attached to an -bit bus) for a specific test vector. The output of the NAND gate is connected to the clock line of a counter to record the number of times that the core input is exposed to overshoots. If the overshoot occurrence is measured over cycles ( patterns) then a -bit counter is needed in the worst case.
B. Using Flip-Flops
C. Using Counters
If the cost is justified, more accurate statistics about overshoot occurrence on each line of the bus can be obtained by assigning each line to a dedicated counter as shown in Fig. 10(b) . This architecture is times more costly compared to the architecture in Fig. 10(a) , but it keeps much more information about individual lines that can help in testing, diagnosis, and reliability analysis. Grouping lines into groups to use counters is a compromise architecture between the two extremes. Ultimately, Fig. 11 . The fraction of overshoot occurrence for signal set S and various T and l. the content of the counter(s) can be scanned out through dedicated scan chain or through the main scan chain.
V. SIMULATION RESULTS
The number of cores in the SoC and the number of input ports of cores do not influence the overshoot detection process, because the OD cells independently function near core input ports. They do, however, influence the cost of test overhead, e.g., OD cells, counters, and test time, e.g., scan-out time.
A. Simulation of the Interconnect
The experimental results here are reported for a 2-core DSP processing SoC communicating through a 16-bit high-speed bus. The results, all obtained using SPICE [28] , are reported for 3 important factors that affect overshoots: 1) rise time (implies average frequency), 2) the number of RLC segments (implies the wire length), 3) the input sets (implies signal coupling).
In these experiments, the 2 cores are allowed to interact (as in normal mode) through the bus for 500 cycles (500 patterns transferred from core 1 to core 2 or vice versa). The overshoot statistics are read after every cycles. Thus there are statistics for 5 signal sets, each of size 100 vectors. To be conservative in recording the worst case (maximum occurrence) of overshoots, the counting structures of Figs. 8(b) and 10(a) were used with -bit adder and -bit counter, respectively. The results are tabulated in Figs. 11 and 12 . Fig. 11 is a histogram showing the overshoot percentages: the number of times (out of 100) that the receiving core detects overshoot in at least 1 (out of 16) lines. In Fig. 11 , the overshoot percentages are reported for various rise-times (0.25 to 1.0 ns) and wire lengths (5, 10, 20 mm) for the same set of signal data ( ). As anticipated, for shorter rise time and longer wire lengths, the overshoot occurrence increases. Fig. 12 shows the overshoot fraction for a fixed rise time ( ns) and wire length (10 RLC segments corresponding to about 10 mm) for 5 signal sets ( -) traveling through the interconnect. As shown, due to the effect of coupling between adjacent lines, some signal sets ( and ) cause many more overshoots than the other 2 signal sets.
B. Test Overhead
The second experiment was to show the effect of overshoot on an overall system. Five main buses (data, address, control and two internal) of the famous 8051 microprocessor [33] were analyzed. In the implementation, the 7 cores communicate through these buses and are potentially subject to overshoot at high frequency. For experimentation purpose, it was presumed that the 8051 system runs at 1 GHz. This is not the actual working frequency of the 8051. A generic 8051 architecture model was used to obtain the patterns traveling on its interconnects. Then, the same patterns were applied to the interconnects as when they Table II summarizes the statistics, and shows the average occurrence of overshoots in a presumed 1 GHz 8051 system will be 32.44% and will cause severe damage to the chip over time. Table II Fig. 10(b) ] are more than the other 3 architectures, because they use wider components to keep the overall statistics of overshoot occurrence in all wires through the entire test session. Among these 3 architectures, the Multiple Counter architecture is the most expensive. In terms of test time for readout, the capture time is the same for all architectures, because all use the same number and type of OD-cells. However, the OD-FF MUX architecture is the slowest.
To quantify the formulas in Table III , the statistics are shown for the 16-bit address bus in 8051 reported by SYNOPSYS design analyzer toolset [34] when 100 patterns are applied. All costs are expressed in terms of 2-input NAND gates. Table IV summarizes the results.
C. Effect of Process Variation
Variation of factors in the fabrication process can cause considerable deviation from the nominal or anticipated behavior of the circuit. For example, due to the limited resolution of the photolithographic process, the transistor dimensions ( , ), on die can be different from the ideal dimensions anticipated. Other process parameters include , , impurity concentration or densities, oxide thicknesses, and diffusion depths [29] . Specifically, for high-speed circuits it is essential to analyze the sensitivity of the circuit with respect to various process variation factors. Sections V-C-1 and V-C-2 show some of the experimental results on the effect of process variations on the interconnect as well as the OD cells. A Gaussian distribution with is used to generate variations in such factors. Thus, the change of the nominal value remains in the range [35] . The is the result of simulation without any variation of parameters. The value for is selected to keep the random parameter values between lower and upper bounds. In practice, such bounds depend on the statistical data extracted for each fabrication plant [36] . The CODAC [27] and TISPICE [28] tools are used in a Monte-Carlo simulation environment to model and simulate the repercussion of process variations on the interconnects. In each iteration, the interconnect is simulated accurately using a randomly-chosen value for that specific process factor.
The Monte-Carlo simulation environment applied many patterns (often more than 1000 random patterns) and counted the total number of overshoots, undershoots, and settling times, that exceed the nominal (average) values substantially (15% or more). The results are in Table V ; the last column ( ) shows the fraction of situations in which the values are significantly different compared to the nominal values; and thus are not acceptable. As discussed in Section IV, in addition to the number of overshoot occurrences, the peak voltage of overshoots also affects chip reliability. Considering the adverse effects of process variations, the peak values of 24.7% of the overshoots are significantly greater than the peak nominal values. This information can be used to estimate the degradation of chip lifetime [10] . Such errors can be in terms of level of voltage (undershoots that cause signal to leave the noise margin area) and timing (settling time which is at least 10% larger than the nominal value). As shown in Table V, 6.7% of the undershoots and 4.7% of settling times exceed the tolerable range because of process variation effects. This simulation provides one more reason why detecting the overshoots on the interconnect is important. Due to their probabilistic and environment-dependent nature, process variations can not be considered or modeled in the design phase. Thus, after fabrication, a significant fraction of overshoots, violations of the noise margin, or settling times might appear on long interconnects. These situations can only be tested using an on-chip approach such as in this paper.
2) Effect on the OD-Cell: To show the sensitivity of the OD-cell with respect to process variations, the OD-cell was simulated with variations of the parameters. Using CODAC [27] , a Monte-Carlo simulation environment was used within TISPICE to determine the effects of variations of , , , and ) on the behavior of the OD-cell. Table VI shows the results. A Gaussian distribution with is applied to generate variations of the factors. The last column in Table VI ( ) shows the fraction of unacceptable outputs compared to the output of OD-cell with no parameter variation. As shown in Table VI , the OD-cell can adequately tolerate variations in and length of transistors. However, the adverse effects of deviations of the threshold voltage and transistor width are larger. This simulation shows that the sensitivity of the OD-cell to these important process parameters, even with pessimistic assumption of , remains reasonable. From a practical point of view, accurate values of these factors and inclusion of other process parameters need to be evaluated using the real statistics of the fab environment.
