Increasing parameter variations, caused by variations in process, temperature, power supply, and wear-out, have emerged as one of the most important challenges in semiconductor manufacturing and test. As a consequence for gate delay testing, a single test vector pair is no longer sufficient to provide the required low test escape probabilities for a single delay fault. Recently proposed statistical test generation methods are therefore guided by a metric, which defines the probability of detecting a delay fault with a given test set. However, since runtime and accuracy are dominated by the large number of required metric evaluations, more efficient approximation methods are mandatory for any practical application.
I. INTRODUCTION
In recent years, parameter variations have emerged as a new challenge for manufacturing test methods [1] - [3] . The detrimental impact of these variations on the delay test quality may lead to many test escapes due to test invalidation [4] . A delay test is called invalid, if the test fails to detect a target fault due to delay variations, hazards or other reasons. More stringent path sensitization conditions can reduce the risk of test invalidation, but these conditions may not be satisfiable for a large number of paths [5] . Hence, a single test vector pair can no longer provide sufficiently low test escape probabilities for a delay fault [6] .
In order to keep test cost within an acceptable budget, statistical delay test generation methods must be aware of the diminishing returns in delay test quality by each additional test vector pair. The most promising approach is to guide the test generation procedure by a metric [7] - [10] , which defines the detection probability of a delay fault with a given test set. However, this metric must be computed O(k 2 n) times to evaluate k applicable test vector pairs for each gate delay fault in a circuit with n gates [8] . Hence, the accuracy and the computational complexity of any practical application strongly depend upon the efficiency of the metric evaluations.
For the research and development of new statistical test generation methods, the output deviation was proposed as a low-cost surrogate metric [11] . However, the output deviation tends to saturate, and equal values are obtained for long and intermediate sensitized paths.
In [7] , the detection probability is used to guide the selection of the longest paths through every gate of the circuit, which are subsequently targeted for test generation. However, in order to reduce the runtime of the algorithm, the authors approximate the path delays as independent random variables.
An alternative method is to create a superset of test vector pairs for each fault site. Then, a minimal subset of these test vector pairs with sufficiently high delay test quality is selected. A pattern-selection algorithm following this principle was proposed in [8] . However, the detection probability was computed using Monte Carlo simulations of the entire circuit, which is inefficient for practical applications. Instead, only sufficiently long paths, which are also sensitized by the given test set, can have a significant impact on the delay test result and should therefore be considered.
In a related work [12] , the authors proposed to utilize statistical static timing analysis techniques to estimate the process-induced variation in pattern delays. An event-driven timing simulation was used to identify the portion of the circuit, which is sensitized by a given test set. After removing the remaining parts of the circuit, a block-based statistical static timing analysis technique was applied to estimate the pattern delay distribution. However, the block based approach results in unnecessary error accumulation and requires all gate and interconnect delays to have a normal distribution.
The approach presented here combines several efficient block-and path-based statistical static timing analysis techniques, to minimize the approximation error and the computational complexity. In contrast to [12] , only sensitized paths with a significant probability of causing a timing failure are considered. The algorithm is not restricted to normally distributed gate and interconnect delays, which is particularly important for the consideration of the exponential fault size distribution of a gate delay fault. Furthermore, both structural as well as spatial correlations are taken into account.
The contribution of this work is twofold: (1) based on [12] , a new advanced statistical dynamic timing analysis algorithm is introduced, which can be used in test applications; (2) the efficiency of the algorithm is demonstrated in the context of delay variation aware pattern selection for small delay defects.
The remainder of the paper is organized as follows. Section II describes the proposed approximation algorithm. The experimental results for NXP benchmark circuits are presented in section III. Conclusions are drawn in section IV.
II. STATISTICAL TIMING ANALYSIS ALGORITHM
To detect a particular gate delay fault, a suitable set of test vector pairs Θ is applied to the circuit. The proposed algorithm approximates the probability, that an incorrect logic value will be captured into at least one scan flip-flop. More precisely, the algorithm computes the probability of a timing failure, which occurs if at least one primary output of the combinational network has not stabilized to its final logic value within the system clock cycle time T clk . It is assumed, that the whole circuit is subject to delay variation, which may cause path delay faults by itself or in combination with a gate delay fault.
A sensitization analysis similar to [12] is performed for every test vector pair in the test set. But instead of the entire sensitized portion of the circuit, only paths with a significant probability of causing a timing failure are extracted. The delay distribution of such a long path is approximately a normal distribution, regardless of the distribution of the gate and interconnect delays (central limit theorem). The circuit delay for the considered test set is now given by the statistical maximum of all path delays. The timing failure probability is finally obtained by evaluating the cumulative distribution function of the corresponding multivariate normal distribution.
The proposed algorithm approximates the probability of a timing failure in four major steps, as depicted in Fig 1. The first step identifies all paths, along which transitions travel from the primary inputs to the primary outputs under a given test set. Only those paths, whose probabilities of causing a timing failure exceed a user defined threshold, are extracted in the second step. The set of correlated random variables, defined by all critical transition path delays, is reduced in the third step with Clark's maximum estimation method [13] . The probability of a timing failure is finally computed in the last step, using an efficient numerical integration algorithm proposed by Genz [14] . The following subsections present a detailed description of all steps of the algorithm. A method to enhance its accuracy is introduced in subsection II-E.
A. Identification of Complete Transition Paths
A complete transition path is a sensitized path, along which a transition propagates from a primary input to a primary output. The identification of complete transition paths is based on a single pass event-driven timing simulation of each vector pair in the test set. This simulation considers only nominal delay values and may identify different complete transition paths for different delays, due to dynamic path sensitization and hazards. In rare cases, this dependency may have a large impact on the circuit delay, leading to complete transition paths, which are not representative for the vector pair.
Following the simulation of a vector pair, the complete transition paths can easily be identified by tracing the events backwards from the primary outputs to the primary inputs. To avoid any ambiguities during event tracing, each event at the output of a gate has a reference to its preceding event at the gate input. A further reference to the applied delay value, selected according to the conditions of the gate timing model, is stored to improve the efficiency of the following step. 
B. Extraction of Critical Transition Paths
A critical transition path is defined as a complete transition path, for which the probability of the path delay exceeding the system clock cycle time T clk is above a user defined threshold. If the probability is very large (e.g. greater 0.98), then a timing violation is very likely and no further analysis is required.
For a complete transition path with correlated and normally distributed gate and interconnect delays X 1 , . . . , X k , the path delay Y also has a normal distribution with mean µ(Y ) and variance σ 2 (Y ) given by
where σ(X i , X j ) denotes the covariance of X i and X j . Assuming independent delays X 1 , . . . , X k of arbitrary distributions, the path delay distribution is obtained from the convolution of all probability density functions of X 1 , . . . , X k . The following steps of the algorithm require the delay of all critical transition paths to have a normal distribution. However, according to the central limit theorem, the distribution of a sum of independent random variables with arbitrary distribution converges to a normal distribution. In practice, this convergence occurs rapidly for less than 10 variables, especially if the distribution of these variables is close to a normal distribution [15] . While the central limit theorem only holds for independent random variables, for most practical models of correlation, the distribution is also guaranteed to converge to a normal distribution.
C. Dimension Reduction with Statistical Maximum Operation
The goal of the remaining two steps is to compute the probability Ψ, that the delay of at least one critical transition path exceeds the system clock cycle time T clk . More formally, this probability is defined as
where the critical transition path delays equal correlated random variables X 1 , . . . , X n of normal distribution. The computation of the timing failure probability (2) requires knowledge about possible structural and spatial correlations between the critical transition path delays. For structural correlations, the covariance of two paths equals the sum of the variances of all gate and interconnect delays, which are shared by both paths. However, depending on the particular circuit model, the estimation of spatial correlations can be significantly more difficult and is beyond the scope of this paper. In this case, it is assumed that the covariance is obtained using one of the well-known methods.
Using Clark's approximation method [13] , any two random variables X i and X j with 1 ≤ i < j ≤ n can be replaced by a new random variable Y = max{X i , X j }, representing the statistical maximum of X i and X j . Following the definitions
the first two moments of Y are obtained by
Before removing X i and X j , it is necessary to compute all covariances σ(X k , Y ), where X k denotes any of the remaining random variables. The covariance of X k and Y is given by
This approximation disregards all higher-order moments of Y , which introduces an approximation error. The accumulation of this error can be minimized by carefully choosing the order, in which the pairwise statistical maximum operations are performed [16] . In this paper, Clark's maximum estimation method is iteratively applied to the pair of random variables with maximum α. To avoid unnecessary error accumulation, this process stops once the number of random variables has dropped below a user defined threshold.
D. Computation of Timing Failure Probability
This step finally computes the timing failure probability Ψ using an efficient numerical integration method. Replacing the maximum in (2) yields
where Y 1 , . . . , Y m denote the remaining random variables after the previous step. Instead of computing the entire distribution of the statistical maximum, only the cumulative distribution function must be evaluated. Using the definition of the cumulative distribution function this becomes
where φ denotes the multivariate normal density function, defined by the mean vector µ = [µ(Y 1 ), . . . , µ(Y m )] and the covariance matrix
For this problem, the numerical integration algorithm proposed by Genz in [14] has proven to be surprisingly effective. It is applicable to very large dimensions and quickly converges to the required accuracy of about 10 −4 . However, in some cases the dependencies between path delays may lead to a covariance matrix that is not positive-definite. To avoid severe numerical instabilities, various techniques can be applied to transform such a covariance matrix into a positive-definite matrix [17] . In this paper, the diagonal elements of the matrix are multiplied with a very small constant greater than unity.
E. Enhancement of Accuracy
Path sensitization and hazards can be very sensitive to delay variation. However, the sensitization analysis in the first step uses only nominal delay values, which is the main source of approximation error of this algorithm.
To explore the impact of delay variations on the composition of the critical transition path set, the algorithm can be executed N times using randomly chosen delays during sensitization analysis. If the approximation results Ψ 1 , . . . , Ψ N differ significantly, a classical Monte Carlo simulation of the whole circuit could be performed. However, a more efficient approach is to use the average approximation resultΨ
where the calculation of the timing failure probability is only required for the distinct sets of critical transition paths. To focus on the most likely critical transition path sets, the standard deviation for the random number generation is reduced to 0.3σ, which was chosen based on experimental results. However, the relative frequency and the delays of the critical transition paths are not necessarily independent as assumed by (12) , which may lead to a slight underestimation of the timing failure probability. Indeed, the later a transition arrives at a gate input, the more likely the off-path inputs have already stabilized to their non-controlling values and allow the transition to be propagated.
III. EXPERIMENTAL RESULTS
The experiments were performed on several NXP benchmark circuits. The circuits were first speed-optimized using a commercial synthesis tool and then mapped to the NanGate 45nm Open Cell Library [18] . To avoid an unnecessary complex experimental setup, chip layouts were not produced. As a result, interconnect delays and spatial correlations have been ignored. All experiments used the gate delay model defined by the Verilog HDL [19] standard. However, instead of real numbers, all delay values X were assumed to have a normal distribution with σ(X) = c v |µ(X)| of the nominal value µ(X). The nominal gate delay values were extracted from the Standard Delay Format (SDF) description of the synthesized netlists. A variation coefficient of c v = 0.25 was assumed, based on predictions for the 12nm process technology [20] . The system clock cycle time T clk was chosen, such that 5% of the defect-free manufactured chips would fail due to timing failures caused by delay variations.
For each circuit, a benchmark of 20000 randomly chosen single gate delay faults was created. For every delay fault, a set of test vector pairs, suitable for small delay defects, was generated. Using only the nominal gate delay values, the k longest paths through a fault site were found with a commercial static timing analysis tool. The number of paths was limited to 1000, of which at most 100 paths were allowed to end at the same primary output. A commercial ATPG tool was later used to sensitize the resulting set of paths. Only delay faults, for which at least 20 vector pairs had been generated, were considered. The fault size was set to the slack of the complete transition pathπ with the largest nominal delay, passing through the fault site.
For every delay fault, four different test sets Θ 1 , . . . , Θ 4 were defined as follows. The first test set Θ 1 contained only a single vector pair which sensitizesπ for nominal delay values. Then, Θ i+1 was created from Θ i , by adding several vector pairs in decreasing order of the delay of the longest complete transition path. Thereby gradually extending the test set to five, ten and finally to twenty vector pairs.
The approximation results are compared to the fault detection probability, which is computed using 10 5 classical full circuit Monte Carlo simulations with a highly optimized eventdriven timing simulator. During one iteration, a fault is said to be detected, if at least one primary output does not have the expected value at the sampling time T clk . The runtime of the Monte Carlo simulations is dominated by the large number of random values required for every iteration. This is due to the detailed gate delay model defined by the Verilog HDL [19] standard, which distinguishes between different pinto-pin, asymmetric rising/falling and conditional path delays. Hence, a two-input gate may require up to eight different delay values. The random number generation utilizes highperformance implementations of the Box-Muller transform and the Mersenne Twister pseudo-random number generator. All programs were implemented in C++ and executed on Intel Core i7-2600K processor workstations with 8 or 32GB RAM.
A. Proposed Approximation Algorithm
In this subsection, the effectiveness of the proposed approximation algorithm is demonstrated. The results for a single execution and for multiple executions are presented in Table I and II, respectively. Column (1) gives the name of the circuit, and column (2) shows the number of test vector pairs in the individual test sets Θ. All values in columns (3)-(8) represent average results over all delay faults. The average number of critical transition paths |Π|, obtained for the individual test sets, is shown in column (3) . A complete transition path with normally distributed delay Y was considered critical, if µ(Y ) + 3σ(Y ) ≥ T clk . The number of critical transition paths slowly saturated, because the additional vector pairs predominantly sensitized shorter paths and many paths may have already been sensitized by previous vector pairs.
The runtime and the accuracy of the proposed approximation algorithm were compared to 10 5 Monte Carlo simulations, which evaluated the fault detection probability as a golden reference. The absolute approximation error is presented in column (4) as the average absolute difference to the golden reference. As explained in section II-E, a single test vector pair might not sensitize a critical transition path for all possible delay realizations, due to different arrival times of the transitions at the off-path inputs. However, in this case, a different vector pair might still be able to sensitize the path. Hence, increasing the size of the test set leads to a reduction of the approximation error.
Clark's maximum estimation method was only applied in rare cases of more that 1000 critical transition paths. Hence, the impact on the average runtime and accuracy of the algorithm was very small. The multivariate normal integral (11) was computed using a FORTRAN routine named MVNDST, which was developed by Genz [14] . The largest estimated approximation error of the numerical integration algorithm was always less than 10 −2 .
Column (5) shows the relative approximation error¯ , defined as the average mean difference to the golden reference. A positive mean error¯ indicates overestimation, while a negativē shows a tendency for underestimation of the fault detection probability. A single execution of the proposed algorithm does not consider the impact of delay variations on the composition of the critical transition path set. Hence, the tendency shifts from overestimation for small test set sizes to underestimation for larger test set sizes.
As expected, the proposed extension of the algorithm improves the accuracy by computing the average result of 10 executions. For any delay fault, the number of executions was increased to 100 if the difference between the smallest and greatest timing failure probability exceeded 0.1. The fault detection probability was rarely overestimated, but contrary to a single execution, the mean error¯ for some circuits was increasing with the test set size |Θ|. This is caused by statistical dependencies between the relative frequency of a critical transition path and its timing failure probability, as described in subsection II-E. (6) shows the average runtime for a single execution of the proposed algorithm. The runtime is dominated by the maximum estimation and numerical integration (41.8%), followed by the critical transition path extraction (34.6%) and event driven-simulation (23.6%). Clark's maximum estimation had only a minor impact on the average runtime.
The average runtime of the Monte Carlo simulation is presented in column (7) and the corresponding speed-up is given in column (8) . The amount of time required for the event-driven simulation and path selection increases linearly with the test set size |Θ|. However, due to the increasing number of critical transition paths, the relative contribution of the numerical integration to the overall runtime of the approach increases significantly. On the other hand, the Monte Carlo algorithm becomes more efficient, because the high cost for generating the large number of high quality random delay values is shared among a greater number of vector pairs. Hence, the speed-up decreases with the size of the test set. The runtime for multiple executions depends on the number of executions and distinct critical transition path sets.
The experimental results have shown an average speedup of between four and five orders of magnitude, compared to classical Monte Carlo simulations. The small approximation error of the algorithm is predominantly caused by the impact of delay variation on path sensitization and hazards. Multiple executions can further enhance the accuracy of the results. 
B. Trade-Off Between Test Cost and Statistical Test Quality
In this subsection, the accuracy of the proposed approximation algorithm is demonstrated in the context of delay variation aware pattern selection for small delay defects. The fault detection probability is used to find a suitable compromise between the statistical test quality and the total number of vector pairs. Based on the results of the previous subsection, which were stored in a database, a suitable test set Θ ∈ {Θ 1 , . . . , Θ 4 } is selected for each delay fault.
At first, the fault detection probability P 4 of the largest test set Θ 4 with 20 vector pairs was compared to the smaller subsets Θ 1 , Θ 2 and Θ 3 with 1, 5 and 10 vector pairs, respectively. To minimize test cost, the smallest test set Θ i with a fault detection probability P i satisfying
was selected. A threshold value of b = 0.10 was chosen to specify a trade-off between test quality and test cost. This process was repeated for all delay faults and circuits. The described pattern selection approach was first performed using only the Monte Carlo simulation results. The test set size distribution over all delay faults in each circuit is presented in Fig. 2 . For more than 50% of all delay faults in this example, five vector pairs provide superior fault detection probability, close to the four times larger test set Θ 4 . The whole process was repeated using the proposed approximation algorithm, and both results were compared in Fig. 3 . The abscissa shows the difference in the test set size |Θ M C | − |Θ P A | over all delay faults and circuits, where Θ M C and Θ P A denote the test set chosen based on the results of the Monte Carlo simulation and the proposed approximation algorithm, respectively. The results match for more than 80% of all delay faults, while only for a small fraction a slightly smaller or larger test set was selected. By using the same randomly chosen delay values for all test sets, the average result of multiple executions provides even greater accuracy.
IV. CONCLUSION
Delay variations in recent technology nodes reduce the quality and reliability of all delay tests. Statistical test generation methods are guided by the fault detection probability to find a trade-off between statistical delay test quality and test cost. An efficient statistical dynamic timing analysis algorithm was proposed to reduce the high runtime complexity of these methods. The approach was compared with results of extensive Monte Carlo simulations and has shown a large speedup with only a small loss of accuracy.
V. ACKNOWLEDGEMENT

