At nanometer manufacturing technology nodes, process variations significantly affect circuit performance. To combat them, postsilicon clock tuning buffers can be deployed to balance timing budgets of critical paths for each individual chip after manufacturing. The challenge of this method is that path delays should be measured for each chip to configure the tuning buffers properly. Current methods for this delay measurement rely on path-wise frequency stepping. This strategy, however, requires too much time from expensive testers. In this paper, we propose an efficient delay test framework (EffiTest) to solve the post-silicon testing problem by aligning path delays using the already-existing tuning buffers in the circuit. In addition, we only test representative paths and the delays of other paths are estimated by statistical delay prediction. Experimental results demonstrate that the proposed method can reduce the number of frequency stepping iterations by more than 94% with only a slight yield loss.
Introduction
As technology nodes advance, increasing process variations together with aging effects require a nearly unaffordably large timing margin, thus causing expensive overdesign. To combat such challenges postsilicon tuning components and mechanisms have been considered to alleviate the effect of process variations.
A widely used post-silicon tuning technique is clock tuning using delay buffers. For example, the structure of the delay buffer (clock vernier device) in [1] is illustrated in Figure 1 . The delay of such a buffer can be adjusted by setting the configuration bits in the three registers. In high-performance designs, these tuning buffers are inserted during the design phase. After manufacturing, the delay values of these buffers are tuned to allot critical paths more timing budget by shifting clock edges toward stages with smaller combinational delays.
In recent years, several methods have already been proposed for statistical timing analysis and optimization of circuits with clock tuning buffers. In [2] a clock scheduling method is developed and clock tuning buffers are selectively inserted to balance the skews due to process variations. In [3] algorithms are proposed to insert buffers into the clock tree to guarantee a given yield, while either the number of buffers or the total area of buffers is minimized. In [4] the yield loss due to process variations and the total cost of clock tuning buffers are formulated together for gate sizing. In [5] , the placement of clock tuning buffers is investigated and a considerable benefit is observed when the clock tree is designed using the proposed tuning system. In addition, the work in [6] proposes an efficient postsilicon tuning method by searching a configuration tree combined with graph pruning, and an insertion algorithm to group buffers into clusters. The yield of such a circuit with clock tuning buffers This work was partly supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre "Invasive Computing" (SFB/TR 89). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. [1] .
can be evaluated efficiently using the method in [7] , and post-silicon testing methods for such circuits have been discussed in [8, 9] . In applying post-silicon tuning buffers, a major challenge is that delays of critical paths need to be measured specifically for each chip after manufacturing. Only with the knowledge of these delays can the tuning buffers be configured properly. However, so far these path delays are still measured using frequency stepping individually [2, 6, 8, 9] , which requires much time from an expensive tester.
In this paper, we propose a novel framework (EffiTest) to solve this delay measurement problem. Our contributions are as follows.
• Multiple paths are tested in parallel in our framework. Since we can adjust the existing clock tuning buffers during test, we can align the delays of combinational paths so that a frequency step can capture delay information of multiple paths.
• Instead of exhaustive frequency stepping, we apply statistical delay prediction. With this technique, we need to test only about 10% of the paths whose delays are required for buffer configuration using testers, and the delays of other paths are estimated from the tested delays.
• Experimental results confirm that the number of frequency stepping iterations can be reduced by more than 94%, with only about 2% yield loss. The rest of this paper is organized as follows. In Section 2 we give an overview of timing constraints for circuits with post-silicon clock tuning buffers and the post-silicon testing problem. We explain the proposed method in detail in Section 3. Experimental results are shown in Section 4. The conclusion is given in Section 5.
Background of Post-silicon Clock Tuning
In a circuit with post-silicon tuning buffers, the propagation delays of clock paths to flip-flops can be adjusted after manufacturing for each chip individually. The concept of this technique can be explained using the example in Figure 2 , where four flip-flops are connected into a loop by combinational paths. The numbers next to inverters represent delays of combinational paths. Although gate delays in advanced technology nodes are statistical [10] , they become fixed values after manufacturing.
Without tuning buffers, the minimum clock period of this circuit is 8. If clock edges can be moved by adjusting the delays of the tuning buffers, the minimum clock period can be reduced to 5.5. For example, the buffer value x2 shifts the launching clock edge at F2 2.5 units earlier. Therefore, with a clock period of 5.5, the combinational path between F2 and F3 now has 5.5+2.5=8 time units to finish signal propagation. This shifting of the clock edge reduces the timing budget of the path between F1 and F2 to 5.5-2.5=3 units after post-silicon tuning, which is still sufficient for this path. Note that the buffer delays are defined with respect to a reference clock signal, so that they can have negative values.
The timing imbalance between combinational paths as in Figure 2 potentially appears when process variations become large in advanced technology nodes. For an individual chip, this post-silicon clock tuning is similar to the concept of useful clock skews [11] . The difference is that the tuning values are specific to each individual chip after manufacturing, so that the effect of process variations can be dealt with specifically for each chip.
Timing constraints with clock tuning buffers can be explained using Figure 3 , where two flip-flops with such buffers are connected by a combinational circuit. Assume that the clock signal switches at reference time 0. The clock events at flip-flops i and j happen at time xi and xj, respectively. To meet the setup time and hold time constraints, the following constraints must be satisfied
where xi and xj are delay values of tuning buffers, dij (d ij ) is the maximum (minimum) delay of the combinational circuit between flip-flops i and j, sj (hj) is the setup (hold) time of flip-flop j, T is the clock period, Dij = dij + sj, and dij = hj − d ij . For simplicity, we will still refer to Dij and dij as path delays in the following discussion.
Owing to area cost, the configurable delay of a clock buffer usually has a limited range. For buffer i, this range is specified as
where ri and τi are constants determined by methods such as [3] . In the range (3), xi may only take discrete values according to the implementation of buffers. After manufacturing, path delays in chips, e.g. the delays stated next to the inverters in Figure 2 , are measured. Thereafter, the configuration values of buffers are determined by finding a feasible solution meeting the constraints (1)-(3).
The most challenging task of using this post-silicon tuning technique, however, is delay measurement of combinational paths in chips after manufacturing. The measured delays should be relatively accurate to configure buffers properly. But the cost due to this delay test must remain low; otherwise, the benefit of using tuning buffers to improve yield may be offset by the ensuing test cost.
In previous methods such as [2, 6, 8, 9] , path delays are measured straightforwardly using frequency stepping. In this technique, a path is tested with a given clock period. If the sink flip-flop of this path can latch data correctly, the setup time constraint at the sink flip-flop is met, so that an upper bound of the path delay is found. Thereafter, a smaller clock period is applied until data cannot be latched correctly to find a lower bound of the path delay. With enough frequency steps, the path delay can be approximated by narrowing lower and upper bounds.
Frequency stepping is very easy to use to test path delays, but the number of iterations (frequency steps) might be large if many paths are tested. Though there are some techniques that can be used to combine tests of several paths to reduce the iteration number, no method has considered the fact that the tuning buffers in the circuit can be used to align path delays, so that a clock period can sweep the delay ranges of several paths at the same time. For example, if the buffers in Figure 2 could be preset as shown, one clock period can be used to test these four paths together, because all these paths pass or fail the test at the same time.
To reduce test cost, the correlation information between path delays provided by statistical timing analysis techniques [10] can also be used. Consequently, only a set of representative paths need to be tested while the other path delays are estimated from the test results. In the example in Figure 2 , if the correlations between the path delays are high, it is possible that only one or two out of the four paths need frequency stepping. 
Statistical Prediction and Aligned Delay Test
In this section, we explain our method to reduce the total number of frequency stepping iterations in testing path delays using statistical prediction and delay alignment by tuning buffers. In the test scenario, we assume that the locations of buffers have been determined, using a method such as [3, 12] . We also assume that there is a separate pass/fail test after the buffers are configured, similar to [8] . The flow of the proposed method is summarized in Figure 4 .
Statistical Delay Prediction
Consider the test scenario shown in Figure 5 , where nodes represent flip-flops and edges represent maximum delays between flip-flops. These maximum delays are needed to configure tuning buffers after manufacturing. Although the number of tuning buffers in the circuit is small, the number of paths that need to be tested may still be large. Consequently, it is impractical to test all these paths with frequency stepping directly, as assumed in [2, 6, 8, 9] . In high-performance designs, the logic gates on a critical path usually are not spread out all over the chip. Therefore, critical paths converging at or leaving from flip-flops with buffers tend to form physical clusters on the chip, as shown in Figure 5 . This physical proximity results in high correlation of the path delays [10] . Since a high correlation means that two delays resemble each other in a manufactured chip, actually only a few paths in a highly correlated path set need to be measured in silicon. Thereafter, the delays of other paths can be estimated from these measured delays, using a conditional statistical prediction technique [13] , which has been used in [14] to predict the timing performance of a circuit from the measurements of on-chip test structures. This delay prediction technique can be applied to path clusters individually as in Figure 5 . In such a cluster, path delays are highly correlated, so that the accuracy of delay prediction can be well maintained.
Assume there are N statistical path delays Dt which are selected to be measured by frequency stepping, and the delay dk of another path should be estimated from these N test results dt. Assume that these delays follow Gaussian distributions. These variables can be
where µ is the mean value vector of D, Σ is the covariance matrix of D, dk ∼ N (µk, σk) and 
k is the covariance matrix between dk and Dt.
By frequency stepping, the delays Dt can be measured as dt. According to [13] , the mean value µk and the variance σ 2 k of the conditional distribution of dk after Dt are measured can be expressed as
Since the second product term in (5) is positive, the variance of the delay dk can be reduced. Consequently, the value of dk is limited into a narrow range and it may become unnecessary to measure the exact delay of dk for buffer configuration if the correlation between dk and Dt is high. On the other hand, a small correlation allows the delays to vary freely, leading to a relatively large variance even after statistical prediction is applied.
In the discussion above, all variables are assumed as Gaussian. This is an assumption widely used in statistical timing analysis [10] . The proposed method, however, only requires an estimation of the upper bounds of delays for buffer configuration (described in later sections), so that the exact distributions do not affect the result much. For a non-Gaussian distribution, independent component analysis (ICA) may also be considered as in [15] with an expansion on conditional distribution.
Since the quality of delay prediction relies on the magnitude of correlation, we partition path delays in different groups. We first extract the paths with high correlations. These paths have a good delay prediction accuracy, and only a small number of paths from this group need to be tested. We then lower the correlation threshold to extract further path groups until all paths are extracted. This grouping technique can handle the case that there are several clusters of critical paths that are far away in the circuit. The correlations between paths from different clusters may be small, but inside each cluster the correlation is still high.
For each path group, we decompose the delays using principal component analysis (PCA) [16, 17] to identify the principal components (PCs) shared by correlated variables. Since only the PCs carry correlation information and only those are useful in predicting path delays, we need only to select those paths that can capture the values of these PCs by frequency stepping. Assuming that the number of PCs in the ith path group Pi is |P Ci|, we select the same number (|P Ci|) of paths from Pi [14] . After decomposition, the delays of paths Pi are represented as linear combinations of PCs. We first select the path with the largest coefficient for the first PC. Thereafter, from the remaining paths, we select the one with the largest coefficient for the next PC. This process is repeated until |P Ci| paths are selected. The pseudocode of path grouping and path selection for frequency test is shown in Procedure 1. Procedure 1: Select Paths 1 P : paths whose delays are required. Pt: paths to be tested. 
Pt=Pt P t,i ; 
Path Test Multiplexing
Since the individual delays of selected paths (Pt in Procedure 1) should be measured, paths converging at or leaving from the same flip-flop cannot be tested in parallel. For example, the paths p14 and p34 in Figure 5 cannot be tested together, because a data latching failure at node 4 cannot be identified as a failure of either p14 or p34 definitely. Consequently, paths that are measured together should be arranged in series. For example, paths p14, p46, p67, p89, p9a, pab can be tested with the same clock period together. These paths are called a batch in the following discussion. In real circuits, there might be cases that some paths in a test batch cannot be activated by ATPG vectors at the same time due to logic masking. These paths can be set as mutually exclusive and arranged into different test batches.
Since the delays of paths in a test batch can be measured in parallel, naturally we should arrange paths to be tested into as few batches as possible to reduce the overall number of frequency stepping iterations. This arrangement can be determined easily using a depth-first search or a simple ILP model so that we skip the detailed discussion here.
After the test batches are formed, there might still be some unoccupied slots in a test batch because paths might not be distributed evenly at flip-flops with buffers. Since the path batches should be tested anyway, we add additional paths to these empty test slots to gather more delay information. According to (5) , the variance of a path after estimation does not rely on the results of delay test dt. Since a large variance means that the delay cannot be estimated with enough accuracy, we add such paths with large variances to the empty slots in the identified test batches so that their delays can also be measured to reduce the predicted delay ranges.
Test with Delay Alignment by Tuning Buffers
After path batches are identified, they should be tested using frequency stepping to determine their path delays. In this section, we discuss how the delays of paths in a single batch are measured. Note this is the only step in the proposed framework that is executed by expensive testers that are able to generate various clock signals with a high accuracy.
In frequency stepping, a clock period is applied to the chip under test and the paths in a test batch are sensitized by test vectors. If the setup time constraint (1) at a flip-flop is violated, the data at this flip-flop cannot be latched correctly. This error shows that Dij + xi − xj is larger than T so that T is its lower bound. On the other hand, if the clock period is large enough so that there is no timing violation, the constraint (1) is met and T is an upper bound of Dij + xi − xj. By applying different clock periods in a binary search style, the value of Dij can be approximated with a given accuracy. Consider the case shown in Figure 6a , where a delay has given upper and lower bounds. These bounds are initialized with µ ± 3σ, where µ and σ are the mean value and the standard deviation of the delay calculated by statistical timing analysis. When the delay is tested with a given clock period T in an iteration, either a new upper bound or a new lower bound of it is generated. Consequently, the corresponding delay range is partitioned into two parts by T and the real delay value falls into one of them. To partition the delay range efficiently, it is preferable that T is aligned to the center (middle point) of the range. Otherwise, T might not partition the delay range evenly, but instead slices it in small steps, leading to many test iterations to estimate the delay, as illustrated in Figure 6b .
When several path delays in one test batch are considered as in Figure 6c , it is not always possible to partition all the delay ranges evenly with one clock period. However, we can still find a clock period T that partitions several delay ranges at the same time, so that the ranges of these delays can be reduced in one test iteration.
To use a clock period T to partition multiple delay ranges, there must be some overlap between the delay ranges, such as d2 and d3 in Figure 6c . According to (1) , the actual constraint that is tested using T is Dij +xi−xj. Since the tuning buffers are already deployed in the circuit and their values xi and xj can be adjusted through the scan chain, we change the value of xi − xj to align the delay ranges, as illustrated in Figure 6d . Consequently, a clock period can partition more delay ranges so that the delays can be measured more efficiently compared with the case in Figure 6c . Because the configuration bits of buffers can be scanned into the chip under test together with the test vectors, this technique requires no change to the existing test platform.
In real circuits, the buffer values xi and xj can only be adjusted in a limited range as specified by (3) . In addition, these buffer values may affect more than one path delay. For example, in Figure 5 the buffer value of node 4 affects all the paths converging at or leaving from it. To test the path delays efficiently, we need to find a proper set of buffer values to align the ranges of path delays as much as possible.
Assume that the upper and lower bounds of Dij between nodes i and j are uij and lij, respectively. When the buffers at the source and sink nodes of the path are considered, the lower bounds and the upper bounds are shifted by xi − xj as defined in (1) . Therefore, the distance ηij between a given T and the center of the shifted range of the path delay Dij can be expressed as
If we minimize the sum of ηij from all delay ranges, the resulting T will approximate the centers of delay ranges as much as possible, while the buffer values xi and xj are also determined. Minimizing the sum of ηij directly, however, cannot handle the special case in Figure 6e where the two delay ranges still do not overlap even after the buffer values have been adjusted to the limit.
In this case, the sum of distances η1 +η2 is independent of where T is placed between the centers of the two ranges. To solve this problem, we sort the centers of delay ranges determined in the previous test iteration. Thereafter, we assign the weight k0 to the range whose center is in the middle of the sorted list, and reduce the weights of other ranges by kd successively. In the proposed method, we set k0 kd, so that the ranges at the middle of the sorted list have slightly higher priorities. With this weight assignment, the weights of the two ranges in Figure 6e are different so that the next test clock period T should align at the center of the range with the larger weight.
The optimization problem to determine the clock period T and the corresponding set of buffer values xi and xj to align delay ranges can thus be expressed as minimize i,j kijηij (7) subject to ∀ path pij in the test batch
where (8)- (13) are linear constraints transformed from (6) and M is a very large positive constant [18] ; z p ij and z n ij are two 0-1 variables corresponding to the two cases that T − ((uij + lij)/2 + xi − xj) are no less than zero and no greater than zero, respectively. (14) defines the ranges of buffer values as in (3).
After the clock frequency and the corresponding buffer values are determined by solving the ILP problem (7)- (14) , the paths in the current batch are tested. According to the test result, either the upper bounds or the lower bounds of their delays are updated. If the distance between the range bounds uij and lij of a path is smaller than a threshold , the path is removed from the current batch. The test iterations finish when all paths in the batch have been removed. The pseudocode of the test process is shown in Procedure 2. 
Procedure 2: Delay Test

Buffer Configuration with Delay Estimation
After a path in Pt is tested by frequency stepping, its delay has been in a range with a lower bound and an upper bound. For another delay dk that is not measured directly but to be estimated, (4) and (5) are used to calculate the mean value µ k and the standard deviation σ k . According to (4) and (5), σ k is determined exclusively by the covariance matrix, but µ k is affected by dt, which are the delays measured by frequency stepping. When calculating µ k , we use the upper bounds of dt so that the estimated delays are conservative. Since the variances of estimated delays are often not zero, indicating that purely random variations still affect path delays, we assign a lower bound and an upper bound µ k − 3σ k and µ k + 3σ k for an estimated delay, so that all path delays are constrained similarly for the following buffer configuration. A real delay may take any value in the range defined by the lower and upper bounds, but the exact location of this delay in the range is unknown due to test resolution and delay estimation. In this situation, a conservative method to configure the buffers is to assume the upper bounds of the ranges as path delays, so that the chip always works with the resulting buffer configuration. This method, however, may incorrectly report some chips as nonfunctional due to this delay overestimation. To solve this problem, we try to find a buffer configuration for a chip while assuming the delays are close to their corresponding upper bounds as much as possible. By minimizing the distance of the assumed delays from their corresponding upper bounds when determining the buffer configuration, the chance that the chip works after configuration becomes large, so that the final pass/fail test will accept most post-silicon configured chips as functional.
The optimization problem to find a buffer configuration while minimizing the distance ξ of the assumed delays from the corresponding upper bounds is described as follows.
where D ij is the assumed delay value of a path during buffer configuration; Td is the designated clock period for the design; (16) and (18) are derived from (1) and (3), respectively. By solving the optimization problem (15)-(18), a set of buffer configuration values xi and xj can be found.
Tuning Bounds due to Hold Time Constraints
In the discussion above, we do not consider hold time constraints. However, tuning buffers may affect hold time constraints significantly if they are configured improperly. For example, in Figure 3 , if xj is much larger than xi, the constraint (2) may be violated. As shown in (2), hold time constraints are affected by xi − xj instead of individual values of xi and xj. In our method, we do not test against hold time violations after configuring buffers. Instead, we set a lower bound λij for xi − xj by sampling the statistical distribution of dij in (2) so that a given yield can be maintained.
Consider the case that dij in (2) is sampled M times for all short paths and its value in the kth sample is dij,k. For the kth sample, we use a 0-1 variable yk to represent that the lower bound λij meet λij − dij,k ≥ M(yk − 1), for all short paths pij (19) where M is a very large constant. The yield of the circuit with respect to hold time can thus be constrained as
where Y is a given yield for hold time constraints, set to 0.99 in our method. To allow buffers to have the largest freedom in value configuration, we minimize the sum of all the lower bounds i,j λij.
After λij are determined, the buffer configuration values can be constrained to avoid hold time violation, as
This constraint is added into the optimization problems in Section 3.3 and Section 3.4 to incorporate hold time constraints to determine buffer values xi and xj.
Experimental Results
The proposed framework was implemented in C++ and tested using a 3.20 GHz CPU. We demonstrate the results with circuits from ISCAS89 and TAU13 benchmark sets. Information about these circuits is shown in Table 1 , where ns is the number of flip-flops and ng the number of logic gates. The number of inserted tuning buffers was less than 1% of the number of flip-flops. The numbers of buffers are shown in the column nb. As in [19] , we assumed that the maximum allowed buffer ranges were 1/8 of the original clock period and all tuning delays were set to be discrete with 20 steps. The logic gates in the circuits were mapped to a library from an industry partner. The standard deviations of transistor length, oxide thickness and threshold voltage were set to 15.7%, 5.3% and 4.4% of the nominal values. The correlation of variations in two side-by-side gates was set to 1 and the correlation due to global variations was set to 0.25. The ILP solver for the optimization problems was Gurobi [20] .
In Table 1 the column np shows the numbers of paths whose delays are required for buffer configuration. Although there are only a small number (nb) of buffers in the circuits, the numbers of paths to be tested (np) are still large, specially for the circuits mem ctrl and pci bridge32. The column np t shows the numbers of paths that are actually tested by the proposed method. Due to statistical prediction, only a small number of paths were selected so that the number of test iterations can be reduced directly. In our experiments, we tested 10 000 simulated chips. The column ta shows the average number of frequency stepping iterations for each chip using the proposed method, and the column tv shows the average number of iterations per path, where tv = ta/np t .
For comparison, we implemented the method applying frequency stepping to each path individually, as assumed in [2, 6, 8, 9] . The column t a in Table 1 shows the total numbers of test iterations. Since there are a lot of paths that should be tested (np), t a are extraordinarily large. These numbers confirm that the straightforward frequency stepping method is impractical for large circuits. Furthermore, the column t v shows the average numbers of frequency stepping iterations per path, where t v = t a /np. Comparing the columns tv and t v , we can find that the proposed method is much more efficient, due to the test multiplexing technique described in Section 3.2 and the aligned test technique described in Section 3.3. The columns ra(%) and rv(%) show the reduction ratios of the test iterations per chip and the test iterations per path, where ra = (t a − ta)/t a * 100 and rv = (t v − tv)/t v * 100. Combining statistical prediction and aligned delay test, the overall test effort can be reduced by more than 94% (94.71%∼99.29%). If we look at the ratios of test iterations per path (rv(%)), we can find that the test reductions are between 57.59% and 75.15%. This reduction comes only from test multiplexing and aligned delay test, while the statistical prediction technique does not affect this ratio much. Both comparisons, however, confirm that the proposed test framework reduces test cost significantly.
The runtimes of the proposed method are shown in the last three columns in Table 1 , where Tp is the runtime for path grouping and selection, test multiplexing and hold time bound computation. Because these steps are performed offline, the runtime is already acceptable. The column Tt(s) shows the average runtime when computing the clock period T and the buffer configuration values for all test batches of a chip. Since this computation can be performed in parallel while path batches are tested, the runtime is also acceptable compared with the execution time of scan test. The last column Ts(s) shows the runtime to determine the final buffer values using the method in Section 3.4. This step is not performed on expensive testers so that the efficiency is good enough.
In the proposed framework, the results of aligned delay test produce lower and upper bounds for delays. This inaccuracy cannot be avoided due to the nature of delay test and it affects the yields of the circuits after buffer configuration. In addition, the technique of statistical prediction also introduces configuration inaccuracy in the estimated delays. Consequently, it is expected that the yields of the circuits should drop from the ideal yields with delays measured exactly. We tested several cases with two clock periods T1 and T2 and the results are shown in Table 2 . For T1 and T2 the original yields without buffers were 50% and 84.13%, respectively. The column yi shows the yields with a perfect delay measurement; the column yt shows the yields with delays measured by the proposed method; and the column yr shows the yield drops due to the inaccuracy in the tested delays, where yr = yi − yt. In these results, we can see that the yield drops are around 1-2%, where the improved yields are still far better than the yields without buffers, 50% and 84.13%, respectively.
Since the results of the statistical prediction technique in Section 3.1 depend on the correlations between path delays, we manually increased the standard deviations of all delays by 10%. Since we did not change the covariance matrix between variables, this change led to a large increase in the purely random parts of the delays. Figure 7 shows the yield results of three cases: 1) no buffers in the circuits; 2) with buffers and the buffer configurations generated by the proposed method; 3) with buffers and perfect buffer configurations. The latter two cases clearly demonstrate that the yields were still improved impressively due to tuning buffers. When testing and configuring the buffer values with the proposed method, the yields dropped more from the ideal case than the cases in Table 1 due to the increased random variation. But the final results are still good considering the significant reduction in test cost.
To verify the effectiveness of test multiplexing and aligned delay ranges described in Section 3.2 and Section 3.3, we applied them directly to reduce test iterations without statistical prediction. Figure 8 shows the comparison of the numbers of test iterations per path in three cases: 1) path-wise frequency stepping; 2) test multiplexing without delay alignment using buffers; 3) multiplexing with delay alignment using buffers (the proposed method). The second case uses the method in Section 3.2 and Section 3.3, but all the buffers values were set to zero. Comparing the results of the first case and the second case, we can see that test multiplexing is a powerful technique to reduce test iterations. When the technique of delay alignment is applied, test iterations can be reduced further, as demonstrated by the third case. These results confirm that even Figure 8 : Test comparison without statistical prediction without taking advantage of the correlations between path delays, the proposed method can still reduce test cost significantly.
Conclusion
In this paper we propose an efficient framework to reduce test cost in configuring tuning buffers in high-performance designs. This framework combines the techniques statistical prediction and aligned delay test with path multiplexing, with which the number of test iterations can be reduced by more than 94%. The effectiveness of these techniques has been confirmed by experimental results using ISCAS89 and TAU13 benchmark circuits.
