Abstract. Statistical process variations are a critical issue for circuit design strategies to ensure high yield in sub-100nm technologies. In this work we investigate the variability of flip-flop race immunity in 130nm and 90nm low power CMOS technologies. An on-chip measurement technique with resolution of ~1ps is used to characterize hold time violations of flip-flops in short logic paths, which are generated by clock-edge uncertainties in synchronous designs. Statistical die-to-die variations of hold time violations are measured in various register-to-register configurations and show overall 3σ die-to-die standard deviations of 12-16%. Mathematical methods to separate the measured variability between systematic and random variability are discussed, and the results presented. They show that while systematic variability is the major issue in 130nm, it is significantly decreased in 90nm technology due to better process control. Another important point is that the race immunity decreases about 30% in 90nm, showing that smaller clock skews can lead to violations in 90nm. Normality tests to check if the variability follows a normal Gaussian distribution are also presented.
Introduction
Modern synchronous digital designs necessarily include a large amount of flip-flops (FF) in pipeline stages to improve data throughput. FF timing is determined by the CLK-Q propagation time, setup time and hold time. Complying with the specified setup and hold times is a pre-requisite for a stable sampling of the data signal around the clock edge. Due to the increasing relevance of process, voltage and temperature variations for robust circuit operation in modern CMOS technologies on the one hand and the frequent use of FFs in microprocessor, DSP cores and dedicated hardware on the other hand, a precise statistical characterization of FF is mandatory. This has motivated investigations of variability of the FF propagation time using Monte Carlo simulation [1] . Statistical variations of setup and FF propagation times in critical paths are essential for maximum chip performance. In contrast to this, a violation of the hold time in short FF-logic-FF paths lead to complete chip failure. In this case races in short pipeline stages are generated by a combination of clock skew and jitter between sending and receiving FFs, and process variations within the circuits. The internal race immunity is a figure of merit to characterize the robustness of a FF against race conditions and is defined as the difference between clock-to-Q delay and hold time. Hence, the race immunity strongly depends on the specific FF type [2] .
Especially scan chains for DFT schemes [3] are sensitive circuit structures since no logic is placed between the FFs. Several techniques for diagnosis of hold time failures in scan chains [3] [4] [5] [6] as well as in generic short logic paths [7] are proposed. These techniques are applied for buffer insertion, i.e. hold time fixing, to increase the delay of these paths during chip design [8] . However, depending on the design and FF properties, without detailed analysis of the critical clock skew and process variability, the extra delay introduced during hold-time fixing can be over or under estimated. In this work, we therefore present a statistical analysis of the race immunity in several test paths, due to process variability in 130nm and 90nm CMOS technologies. The experimental data is obtained using a precise on-wafer measurement technique with ~1ps resolution. This measurement technique has been presented in [9] for a 130nm CMOS technology and is here transferred to 90nm CMOS to facilitate a comparison between both technologies.
Test Circuit and Timing Issues
To evaluate the impact of statistical variations on hold time violations four different logic paths are considered. The two basic configurations are two simple pipeline stages with two master-slave edge-triggered FFs without logic between them, similar to one stage of a scan chain. Further pipelines including six small inverters between the FFs, represent short logic paths. The FFs used in this work are conventional rising edge-triggered master-slave FFs composed of CMOS transmission gates in the forward propagation path and C 2 MOS latches in the feedback loops [10] with typical library extensions such as input and output node isolations and local clock buffers.
For each configuration a version with the weakest FF of the standard cell library, i.e. smallest transistor sizes and hence largest sensitivity to process variations, and a version with 8x increased driving strength is used. Comparing the results of both it is possible to analyze the impact of different transistor dimensions on the variability. The inverters used in both versions are of the minimum size, since these configurations represent typical non-critical paths where large driving capability is not required.
To emulate clock uncertainties, the sending and receiving FFs are controlled by different clock signals. The clock signal CLK2 of the receiving FFs is generated from the launching clock CLK1 of the sending FF by a programmable delay line as shown in fig. 1 . If this artificial clock skew is large enough, i.e. CLK2 arrives after CLK1 and exceeds the internal race immunity tCLK-Q-tHOLD of the FF, a race is produced and detected if the output of both FFs are of same value at same time (Q1(t)=Q2(t)). The violation can be detected by initializing the FFs with opposite values, and applying a pulse in the data input, as shown in fig. 2 . As long as Q1(t)≠Q2(t) pipeline It is possible to see that the probability of a hold time violation receives contribution of the FF race immunity (that is inherent to the FF type and size used in the design), the maximum clock skew found in the circuit, and process variations. If the clock uncertainty is very well controlled and race immunity is large enough, process variability plays a minor role, but this is not the case of the majority of semicustom designs that have to meet a short time-to-market. Usually, the clock uncertainty and race immunity are of about the same order of magnitude.
Measurement Scheme
To specify the critical clock skew producing a hold time violation, the artificial skew is programmable over a wide range of 80 steps corresponding to a resolution of ~1ps. The programmable delay line is composed of two inverters, and 80 NMOS/PMOS gate capacitances as load elements connected to the inverters via pass transistors. Using capacitances as programmable electrical fan out elements is advantageous since a sub-gate delay resolution is achieved. The capacitances and transistors have been carefully designed to be able to achieve steps of the desired resolution.
Programming is done using an 80-stage shift register to control the inputs of the pass transistors. For coarse-grain clock skew shifting a multiplexer to enable or disable a further buffer chain is added. It is needed because the versions with 0 or 6 inverters have very different critical clock skews. Fig. 3 shows the implemented circuit. To measure the absolute time produced by a specific setting of the programmable delay line, it is additionally placed in the middle of a ring oscillator. The ring oscillator is connected to an 11-stage frequency divider to monitor the output frequency. Thus, it is possible to determine the programmed delay based on measuring and comparing the frequencies achieved with different numbers of capacitances. Fig. 4 shows the final layout of the different circuits in the 130nm CMOS technology.
For the measurement, first the settings for all combinations of the 80 capacitances are written into the shift register. Then the frequencies of the ring oscillator on each die are measured for all configurations to calibrate the programmable skews and to eliminate impact of systematic variations from the measurement accuracy. For measurement of the delay variations of the logic path, the delay line is initialized with minimum delay, and the delay is stepwise increased until a violation in the pipeline is detected. The corresponding delay estimated from the ring oscillator measurements is the critical clock skew for the given die and operating conditions. The procedure is repeated for each of the 4 test circuits considering the rising and falling input transitions. 
Separation of Systematic and Random Residual Variations
With the discussed measurement technique, it is possible to measure the overall variability on the wafer. However, for a deeper analysis, it is necessary to make mathematical transformations in the obtained data. Several methods to make the separation between the different components of the variability are present in the literature [11, 12] . In this work, we will focus in how to separate the data between systematic (over the wafer) variability and residual (within-die, local, or residuals due to imperfection in the measurement) variability.
A simple but widely used method is the moving average. In this method, the measured value in each die is substituted by the average of the value in the die itself with the values of the neighbor dies. If the number of dies is large, the average window can be expanded. We will analyze the results using a 3x3 window (the die with its direct adjacent neighbors) and a 5x5 window (with neighbors up to 2 dies of distance). The drawback of this method is some deterioration in the borders, since we do not have all neighbors available for the average.
Another common method is curve fitting. In this method, we take the measured data and apply a linear regression to find the curve that it approximates better. The curve can be a paraboloid, a plane, a Gaussian, and many others, depending on specific issues of the fabrication process. This is a more complex method, and requires a mathematic intensive computation.
Normality Tests
To evaluate the randomness and check if the measured variability data is a normal Gaussian, mathematical normality tests were performed in the results. There are several tests that are designed to check this normality [13] . Although it is possible to see that our data have strong systematic component and is clearly not normal Gaussian, after the separation between systematic and random residual variability in the previous section, the random residual component can possibly be normal Gaussian.
The first test used in the data was the Wilks-Shapiro (W-S) test. It returns a number called p-value, which may lay between 0 and 1. The larger this number, more likely is the distribution to be normal. A p-value larger than 0.05 is said to be a normal Gaussian curve at the 95% confidence level.
Another common test is the Anderson-Darling (A-D) normality test. The result of this test is a number larger than 0. But now, the smaller this number, more likely is the distribution to be normal. It is considered that a value smaller than 0.787 gives a normal Gaussian distribution with 95% confidence. The A-D normality test is a modification of the Kolmogorov-Smirnov (K-S) test and gives more weight to the tails than the K-S test.
An alternative way to check the normality is to calculate the kurtosis and the skewness of the data. Kurtosis is based on the size of a distribution's tails. A kurtosis of about 3 means a distribution very close to a normal distribution. Skewness is the measure of the asymmetry of the distribution. A normal distribution should be symmetrical and present a skewness value equal to 0.
For the measured data, these tests were made for all test circuits, using the total data, but also both systematic and random parts separated. The software used to make the tests was DataPlot from NIST/Sematech [14].
Experimental Results
The circuits are fabricated in 130nm and 90nm low power CMOS technologies using regular-VT core devices. For the 130nm CMOS technology, 182 chips are measured on one wafer, while only 36 dies are available in 90nm CMOS due to a larger reticle size. Nominal supply voltages are Vdd = 1.5V for 130nm CMOS, and Vdd = 1.32V for in 90nm CMOS, respectively. The temperature was 25°C in both cases.
First, the variability of the ring oscillator frequency over the wafers is analyzed, with different results (fig. 6 ). The 130nm wafer shows a typical global wafer variation with slower dies in the center of the wafer, while in 90nm the distribution seems to be more random, with smaller systematic variability, probably due to the larger reticle size and better controlled manufacturing process. The frequencies are normalized to omit confidential technology data. The faster circuits achieve resolutions less than 1ps, while none of the chips had a resolution of more than 1.2ps. It is important to note that the 90nm wafer was a test and not a production wafer, and the systematic variability was further reduced before the technology entered in production, even though the test wafer presented an improvement in systematic variability, if compared to 130nm. The σ deviation of the delay can be up to 5% of the nominal value. The critical skews are in the range of the clock skew that can be expected in circuits using the same technology, showing that these statistical effects have to be considered during hold-time fixing at the end of the layout generation. It is important to note that using larger FFs, the absolute variation of the critical skew decreases, but the relative value remains similar, since these circuits are faster. This indicates that larger FFs have an increased probability of violation, since the clock skew needed to provoke the failure is smaller.
The test circuits with extra inverters have an expected larger absolute variability, but relatively it is smaller, showing that the FFs are more sensitive to process variations than the inverters, or a large number of inverters average the variability.
Another important point is that the master-slave FFs used in the experiment typically have a small or even negative hold time, and consequently larger race immunity. Repeating the experiments for faster FFs that are used in high-speed designs and have larger hold times, the results would be even more critical.
The impact of the supply voltage in the race immunity variability was also analyzed. Figure 8 shows the results for both race immunity average and standard deviation for all 8 test path combinations at 130nm technology. In the average, it can be observed that the race immunity average more than doubles when the supply voltage changes from 1.5V to 0.9V, which is a typical operating range for SoCs with dynamic voltage scaling. If the clock skew more than doubles also for the same voltage drop, the probability of hold time violation increases. However, if the change of clock skew is less than double, the probability will decrease. On the other hand, analyzing the standard deviation, it is possible to see that it increases from almost 5% to almost 7%. Considering that it is relative to the average, the increase in the variability is relatively larger than the increase in the average. Figure 9 shows a graphical comparison of the results of race immunity found for both technologies. It is possible to see that the race immunity decreases about 30% from 130nm to 90nm. This is an expected value, since it is the speed-up from one technology to another. However, it is much more difficult to scale the clock skew in the same percentage in the scaling. It shows that the problem of hold time violations becomes more critical, and the clock skew and variability must be better controlled in newer technologies. The next step in the analysis was to apply the separation methods described in the previous section in the RO frequency variability. The three methods were compared: moving average with a 3x3 window, moving average with a 5x5 window, and curve fitting. Figure 10 shows the curves obtained for the 130nm wafer. In 130nm, the curve obtained was a paraboloid, what could be observed already in the original data. However, in the 90nm wafer, the original data was very random and difficult to see any systematic dependence, but the mathematical methods showed a slightly inclined plane, with ring oscillator frequency increasing slightly from one side of the wafer to the other.
Comparison of Average
Regarding the numerical results, the standard deviation calculated with the 3x3 moving average method was very close to the one found with the curve fitting method. However, the 5x5 moving average method presented results more than 20% different from the other, always decreasing systematic variability while increasing the random residuals, showing that a 5x5 window may be too large for the available data, masking part of the systematic variability, and especially leading to a deformation at the corners.
Based on these results, we decided to continue the analysis using only the 3x3 moving average method, due to its simplicity and very close results compared to curve fitting. The final step was to apply the method in the data obtained for the critical clock skew distribution in all circuit configurations. Table 2 shows the results of the total measured variability, and the systematic and residual variability calculated with the method. The results show that the systematic variability is dominant in 130nm technology, but the residual, probably influenced by the local and within-die variability, is expected to become much more important in 90nm and 65nm CMOS technologies, while the systematic variability probably will decrease due to a better process control. From table 2 it can be seen that using larger transistors (stronger FFs) and inverters between the flip-flops decreases the residual variability. This may result from a decrease in the local and within-die variability in larger transistors, and the averaging effect found if a larger number of gates (inverters) is used.
The next step was to apply the normality test to the set of data. All 8 test paths and also the ring oscillator frequency were tested, using the methods described previously. Analysing the results, it is possible to see some similarities between the results of different test circuits. First, Wilks-Shapiro and Anderson-Darling tests produced consistent conclusions in all cases. In all cases, random variability is more normal than the equivalent systematic variability, while the total data is located somewhere between them. The only exception is the ring oscillator composed of 17 stages, where the random variability is strongly reduced due to the use of large transistor widths. Moreover, since the relative random variation of the total propagation delay of a delay chain or ring oscillator decreases according to 1/√n, where n is the logic depth of the circuit, for the ring oscillator used in the test circuit the logic depth between two rising edges, which are used for frequency measurement at the input of the frequency divider, is n=34 and therefore random variations are suppressed due to averaging.
In all cases except for the ring oscillator frequency, the random variability is considered normal in the conclusion of the tests, while systematic variability is not normal (as expected, since it has a strong spatial dependence from the middle to the corners of the wafer). The total data is considered normal in the cases with weak FFs and no inverters, since these are the cases where random variability prevails, while it is not normal in other cases. In the case where we have large FFs and inverters, the role of random (local) variability is diminished and the role of systematic variability increases.
Analysing the kurtosis and skewness results, it is possible to see that in all cases, they are close to the values of 3 and 0, respectively, as expected for normal Gaussian curves.
Conclusions
This work presents an experimental analysis of the variability of hold time violations of edge-triggered master-slave FFs due to process variations in 130nm and 90nm low power CMOS technology. For accurate on-wafer characterization, a test circuit and a measurement technique with ~1ps resolution are presented. The proposed methodology provides detailed information about the circuit robustness of FFs under realistic operating conditions. This precise FF characterization then enables designers to perform hold-time fixing for short paths considering statistical variations of FFs as well as delay increasing inverters during buffer insertion. Moreover, during standard cell library development, the methodology is beneficial to optimize the FF portfolio, i.e. to balance race immunity and clock-to-Q propagation delay for various cell driving strengths and different FF topologies.
The proposed technique can be extended to characterize other timing constraints. Finally, statistical timing violations in edge-triggered master-slave flip flops are investigated experimentally, at different supply voltages. Mathematical methods to isolate systematic and random residual variations from the experimental data are discussed and compared. Results show that the absolute race immunity reduces by about 30% from 130nm to 90nm CMOS technology due to speed improvement, leading to a faster CLK-Q delay. This indicates that hold time violations are a harder problem in newer technologies if the clock skew is not expected to scale in the same way.
The results also show that the systematic variability is larger than random variability in a 130nm CMOS technology, but this trend is expected to not continue in newer technologies. However, there are design techniques available to reduce the impact of systematic variations, and the trend may be different between logic circuits and SRAMs. The normality tests performed in the results showed that, in general, random variability is a normal Gaussian distribution, while systematic variability is not, except in the cases with weak FFs and no inverters.
Future work includes the investigation of the impact of different temperature and supply voltage on variability.
