Abstract-Pseudo
I. INTRODUCTION
In recent years, considerable research effort has been dedicated to at-speed test using the path delay fault model [1] [2] [3] [4] since it has the advantage of capturing the accumulative effects of small delay defects [5] . However, the maximum operating frequency (FMAX) during at-speed scan test may not correlate well to functional and system tests due to the mismatch of power supply noise (PSN) [6, 7] .
A. Over-testing and under-testing
One of the primary reasons for the FMAX mismatch between scan and functional test is that test patterns are generated utilizing only structural information and traditional two-cycle at-speed test uses only a short clock burst in the functional mode. Since the scanned-in pattern may contain illegal states [8] , a short burst of functional cycles is insufficient to exclude all illegal states and bring the circuit to a functional state [9] . The illegal states can lead to increased switching activity, resulting in excessive power supply voltage drop. Once the test-mode PSN exceeds the design margin, a good chip may fail the test, which is called over-testing [10] .
Several techniques have been proposed to handle overtesting in traditional two-cycle delay testing, such as low-power test generation [11] , X-filling [12] [13] [14] [15] , and power-aware static compaction [10] . The main disadvantage of these methods is that they do not change the fact that illegal states may still be included in the test pattern. Pseudo functional test (PFT) [8, 16] has been proposed to explicitly constraint test generation to use functional or near-functional states.
A test must also propagate transitions on paths under the realistic worst-case PSN in order to minimize under-testing, where a test is too quiet and leads to a scan test FMAX higher than functional FMAX. Approaches using a genetic algorithm (GA) [1] , automatic test pattern generation (ATPG) [3] and SAT [17] have been developed to maximize PSN for delay test.
In order to strike a balance between over-testing and undertesting, the approach in [2] proposes to test chips under the worst-case functional PSN condition. Delay test considering both functional constraints and power supply noise are reported in [2, 18] .
B. Pseudo functional KLPG test
Pseudo functional KLPG (PKLPG) test is proposed to generate delay tests that test the longest paths through each line in the circuit while having PSN similar to that seen during normal functional operation. Rather than scanning in a test pattern, applying it with a few functional test clocks and scanning out the results, PKLPG is applied by scanning in a test pattern, running multiple functional clocks, and then scanning out the result, as shown in Figure 1 . The initial functional clocks, termed the preamble, run at lower speed to ramp up offchip power supply currents and minimize the noise due to the chip and package inductance [7] . The preamble cycle time must be much less than the off-chip inductor time constant, but should be as large as possible to minimize the number of preamble cycles needed to stabilize off-chip inductor currents [7, 19] .
PKLPG can be viewed as short bursts of functional tests [9] . The primary advantage is that they apply tests in a more functional manner, so that power supply noise, signal coupling and power dissipation are more similar to functional operation. As for shift mode (both in and out), accumulative shift power dissipation can be minimized by either DFT, such as scan-chain partitioning [20] or test pattern manipulation, such as MTfilling [21] . Our in-house KLPG test generator, CodGen [4] was modified to support PKLPG pattern generation by introducing additional cycles prior to the launch-capture cycles. Currently, CodGen supports up to 32 preamble cycles followed by atspeed launch and capture.
Another key issue with PKLPG test is that functional PSN during the capture cycle can be much lower than the worst-case functional PSN and hence probably cause under-testing. Moreover, it is interesting to explore the range of functional PSN during launch and capture. This can be used to verify timing robustness during post-silicon validation and to set the PSN margin. Therefore, the objective of our work is to develop a PSN control scheme for PKLPG.
In this work, we first investigate the PSN scenario for PKLPG test using random pattern simulation, which shows that PKLPG is more vulnerable to under-testing than over-testing. A simulation-based X-Filling procedure called Bit-Flip is proposed to maximize PSN during launch and capture for a partially-specified PKLPG pattern. Experimental results on uncompacted longest path patterns of ITC benchmark circuit b19 demonstrate our scheme is able to improve effective WSA as much as 38.7% compared with the best random fill. Results on compacted patterns show that Bit-Flip can perform effectively even if the care bit density is as high as 20%. The trade-off between CPU time and noise maximization is also discussed.
The remainder of this paper is organized as follows: Section II reviews previous work on PSN control for traditional twocycle delay test. Section III investigates PSN in PFT test. Section IV presents our proposed X-Filling scheme, Section V describes experimental results, and Section VI concludes the paper and discusses future work. 
II. BACKGROUND
This section reviews previous research on controlling PSN for traditional two-cycle delay test and discusses the space to extend this work to PKLPG test.
A. Related prior work
Several techniques [1-3, 17, 18] have been studied to maximize PSN during delay test. In [1] , a GA-based iterative procedure is proposed. In each iteration, waveform simulations and fitness calculation are performed to guide selection, crossover and mutation to find patterns that induce larger PSN. In [17] , PSN maximization is modeled as a MIN-ONE problem and a SAT solver is used to maximize the transition count. Two methods [2, 3] use justification to maximize transitions on the neighboring cells. While [2] proposes several techniques to speed up the justification process, [3] utilizes a commercial ATPG engine to sensitize neighboring signal lines by virtual test point insertion. Max-Fill [18] computes functional reachable states that induce maximal WSA using both logic simulation and justification. Later partially specified patterns are filled with these computed states.
The central challenge of applying prior work to PKLPG is that PKLPG is a multiple-cycle sequential test. Computational cost increases dramatically with the number of preamble cycles, making it difficult to apply GA, SAT or justification-based methods.
In our work, we investigate a simulation-based method, BitFlip, to fill X bits such that the effective WSA during the capture cycle is maximized. A partially-specified scan pattern set is first randomly filled and then an iterative procedure is invoked to flip a randomly-selected group of the filled bits. Effective WSA is measured after each flip and the inversion is retained if it improves the effective WSA. Bit-Flip requires no modification to the ATPG process or circuit and has no knowledge of the fault model used during ATPG. Incremental simulation is utilized to reduce CPU time.
Bit-Flip is similar to the hill-climbing procedure reported in [13] , but with a few differences. We explore Bit-Flip to maximize effective WSA for PKLPG test rather than minimize
Hamming distance of scan cell states for traditional two-cycle delay test. The PSN in PKLPG focuses on IR-drop. We flip a group of bits per iteration rather than one at a time.
III. MULTIPLE-CYCLE PSN PROFILING
This section presents the PSN profiling results for PKLPG using random pattern simulation and analyzes the impact of random filling on WSA during the capture cycle. Primary inputs (PIs) are kept constant during simulation since low-cost test equipment has few high-speed pins. For each circuit, 30,000 random patterns are applied with a burst of 16 functional mode cycles. WSA [9, 13] is used as a metric to evaluate the PSN during each cycle.
A. Pseudo-Functional PSN
Figures 2 and 3 depict the detailed results of PSN profiling. Chip1 and Chip2 are industrial designs, while b19 and s38417 are from the ITC99 and ISCAS89 benchmarks respectively. Figure 2 shows the average WSA over the 16 cycles for the selected circuits. For convenience, WSA is normalized to cycle 2. As expected, for all the circuits the WSA falls rapidly during the first several cycles, and then stabilizes at 20-30% of the cycle 2 level at roughly cycle 6. The reason is that the high WSA in the first two cycles is probably introduced by illegal states. After preamble cycles, the illegal states die out and the circuit approaches a functional state. Therefore, applying at least six preamble cycles (PKLPG) will produce PSN closer to functional [9, 18] . Traditional two-cycle test is applied at cycle 2 and sees a higher WSA. PKLPG sees a much lower WSA, and thus is more vulnerable to under-testing rather than overtesting. Figure 3 illustrates the correlation between the WSA at cycle 2 and cycle 16 for b19. Similar results are observed for the other circuits. The PSN results of the 30,000 random patterns at cycle 2 are sorted in increasing order. The corresponding WSA at cycle 16 is also plotted. The minimum and maximum WSA observed at cycle 16 is denoted as the "Pseudo Functional PSN Range." We can see that the WSA at cycle 16 (last capture cycle) is independent of that at cycle 2 (first capture cycle). This indicates that probability-based methods [14] cannot guide filling in the presence of many preamble cycles. The large pseudo-functional PSN range at cycle 16 indicates the importance of X-filling. Given a partial specified pattern, the X bits should be assigned to sensitize the worst functional PSN condition. In the following section, we 
IV. NOISE CONTROL FOR PKLPG
In this section, a simulation-based PSN maximization scheme, Bit-Flip, is described. For a given partially-specified test pattern, it attempts to incrementally improve the effective WSA by flipping a group of randomly-selected X-bits.
A. Overview of Proposed Framework
The Bit-Flip algorithm is shown in Table 1 . It consists of a preprocessing step, an iterative step and a bit-relaxation step. Circuit net list, layout, critical path list and test pattern set are inputs to the algorithm.
In the preprocessing step, a pattern is fetched from the test set as well as the corresponding path(s). First procedure Critical-Identify(P, T, Circuit) identifies the cells near on-path gates by physical position matching and stores them in SCells. We term these cells Critical Cells. It has been demonstrated that critical cells have considerable impact on the PSN of onpath gates [3] . Meanwhile, scan cells located in the fan-in cone of the critical cells are also marked and stored in Sbits. These scan cells are termed critical bits. Next, all critical bits (SBits) are randomly filled and an event-driven logic simulator, CodSim, simulates the filled pattern. Then effective WSA (EWSA) is measured as initial Max-WSA. EWSA is defined as the sum of critical cell WSA.
In the second step, the number of rounds and initial group size (initial value of ) are chosen based on the CPU time budget.
At the beginning of a round, GetPotentialSCList(Circuit) collects potential scan cells that have constant logic values at the launch/capture cycles. The Union, L1, of SBits and the potential scan cells list serves as the final bit set for PSN control. Then Bit-Flip enters the iterative process. In each iteration, it randomly selects up to scan cells from the set L1 and flips them. Then CodSim simulates the result, and EWSA is measured. If EWSA is increased, the flipped bits are kept; otherwise, the flipped bits are restored. At the end of each round, the group size is reduced by a constant. The flipping process is terminated when the maximum number of rounds or enough failures in a row is reached. After the iterative procedure concludes, bit-wise relaxation is performed to maximize the number of X-bits, for the benefit of MT-filling or test compaction. (SCells, SBits) = Critical-Identify (P, T, Circuit); 4.
Critical-Random-Fill(Sbits, T); 5.
CodSim(T, Circuit); 6.
Max-WSA = Critical-WSA (SCells); 7.
while more Rounds 8. L1 = GetPotentialSCList(Circuit) ∪ Sbits; 9.
While (L1 != ∅) 10.
Randomly select up to bits, flip them; 11.
Event-Driven-Sim(T, Circuit); 12.
NewWSA 
B. Critical Cell Identification
In this work, we utilize the effective region to identify critical cells, as shown in Figure 4 . This approach is motivated by the model used in [3] . Vertically, gates in the same and neighboring rows are critical cell candidates since they share either power or ground with on-path gates. Horizontally, each row is divided into segments by power/ground strips. Based on power grid analysis, an effective region can be set around onpath gates in order to capture the localization character of PSN. All gates within the effective region are critical cell candidates.
Critical-Identify (P, T, Circuit) performs 3-value (logic 1, 0, X) simulation on the partially specified pattern. All candidates that have undetermined values (XX, 1/0X, X1/0) at launch/capture cycles are denoted as critical cells.
After critical cells are identified, Bit-Flip attempts to maximize the sum of the WSA of critical cells (EWSA). 
C. Task granularity
Bit-Flip flips a group of bits each time. To select an appropriate group size, we need to consider the potential EWSA improvement as well as simulation time.
Assuming there is no overlap among the fan-out cones of the flipped bits, simulation time increases linearly with the Pseudo Functional PSN range group size and the total iterations. This is usually true if the group size is much smaller than the number of scan bits. The flipped bits spread sparsely along the scan chains.
The total number of bits that are covered by Bit-Flip is:
where is the total iterations of each round, is the initial group size, and is the decrement of the group size. The time cost of Bit-Flip can formulated as: = • where is the simulation time cost of flipping one bit. In order to reach the maximal PSN, two conditions much be satisfied: 1) ≫ , where is the number of scan bits. This guarantees that each bit is flipped enough times. 2) is large enough to adequately explore the exponential search space.
In practice, the time budget T is fixed. Therefore, the total number of bits that can be flipped is fixed. Here we assume that condition 1) is satisfied. With a fixed , in order to make large enough, we need to make as small as possible. However, too small a group size causes the transitions (flipped bits) to die out over the preamble cycles, and so not improve EWSA.
Therefore, in Bit-Flip we first try a large group size to search across the exponential space and approach some local optima PSN. By decreasing the group size round by round, we gradually achieve the optimal result as well as limiting the execution time.
D. Critical Bit Fill and Bit Relaxtion
In order to narrow the search space, structural information can be used to identify critical bits that are highly correlated to the logic value of the critical cells. Bits in the fan-in cone of critical cells are most likely critical. After critical cells are identified, a multi-cycle back-trace procedure is called to collect critical bits, SBits. However, multi-cycle back-trace may cause too many bits to become critical, which will increase the fill rate and degrade pattern compaction performance.
To limit the number of SBits, we need to identify the bits which are insignificant to EWSA maximization and exclude them from SBits. That is, if we relax an insignificant bit to X, EWSA will not be reduced. Therefore, we apply a bitwise bitrelaxation procedure Bit-Relaxation(Circuit, T) to turn insignificant bits into X bits. The procedure relaxes each bit to X, simulates the circuit, and keeps the relaxation if EWSA is not decreased. Otherwise the bit is restored. An efficient relaxation method can be found in [22] , although their focus is fault coverage, not PSN.
If the fill rate of the test patterns is limited, such as to enable high test compression rates, a trade-off must be made between EWSA maximization and X-bit utilization. This is done by adding a significance ranking to X-bits during the relaxation process. We use the change in EWSA to rank the bits. This can then be used to select which bits are relaxed.
E. Compacted Pattern Consideration
Test compaction [10, 23] is used to reduce pattern count and minimize test time. Compacted patterns typically have higher care-bit density, which reduces the search space for PSN control. Bit-Flip can be applied to compacted patterns with slight modification.
First, the paths tested by a given pattern can be searched in a breath-first manner. If the pattern tests a critical path, we term this a critical pattern and Bit-Flip is applied to it. Critical paths can be obtained from STA tools or by setting a threshold on path length. In practice, it can be selected based on path length distribution and CPU time budget. If the compaction algorithm attempts to pack critical paths together, the number of critical patterns may be small.
Second, critical cells are identified for each critical path tested by the critical patterns, and its EWSA weighted based on the path length. Since a longer path is more sensitive to PSN induced delay, a larger weight is assigned to its critical cells. The weight is the ratio of path length to the longest path length (or clock cycle time). If there is an overlap of critical cells on different paths, the WSA is weighted by the longest path. BitFlip attempts to maximize the weighted EWSA of all critical cells.
V. EXPERIMENT RESULTS We implemented Bit-Flip in C++ running on a 3.16 GHz processor with 4 GB of memory. Robust paths and patterns are generated using the in-house PKLPG tool, CodGen, with K=1 (one longest rising and falling path per line) and 6 preamble cycles. Physical layouts were generated using commercial tools. In the following, Bit-Flip with N iterations will be termed BF-N. The 10 longest paths from b19 that do not share gates were selected for experiments. These paths/patterns are termed P0 to P9.
First, we investigate how group size affects Bit-Flip performance for a fixed CPU time budget. We ran Bit-Flip on path P0 while limiting CPU time to 10s. This is a generous amount of CPU time for one path. For each group size, we filled the pattern 1000 times and the average EWSA is compared with the best of 10,000 randomly-filled patterns (∆EWSA). As shown in Figure 5 , the average ∆EWSA peaks for an initial group size of 30, which is about 0.5% of the total bits. Similar results are observed on ISCAS89 circuits S38417, S38584, and S35932, which peak at group size 5. A larger group size can discover the logic correlation among bits. However, too large a group cannot maximize the average ∆EWSA within the time budget. Second, we investigate how the number of iterations affects performance. We ran Bit-Flip on P0 to P9 for 1000, 4000 and 10,000 iterations (BF-1000, BF-4000, and BF-10000) with an initial group size of 30. To validate Bit-Flip effectiveness, each pattern is filled 100 times for each configuration and the results are compared with the best random patterns as shown in Table  2 .
In Table 2 , the initial and final care-bit count, average ∆EWSA and CPU time are shown for each path. The average ∆EWSA of BF-1000, BF-4000, and BF-10000 are 10.31%, 15.71% and 17.09%. The best performance is observed on P4 Average ∆EWSA (%) using BF-10000, which has a ∆EWSA of 38.7%. Most paths have a 10%-25% improvement using BF-4000. The rate of EWSA improvement levels off with more iterations. For most paths, BF-4000 provides the best trade-off between PSN maximization and CPU time. The 95% confidence interval for average ∆EWSA is shown in Figure 6 . There is a relatively large range of pseudo functional EWSA for a given path. Quiet and noisy patterns can be binned and used to characterize the noise sensitivity of the paths. For example, Figure 7 illustrates the EWSA distribution for P0 of 1000 randomly filled patterns and 1000 patterns filled using BF-1000 and BF-10000. By applying patterns from left (quiet) to right (noisy) and computing FMAX for each bin, the sensitivity of delay to PSN can be understood. Bit-Flip provides the least improvement on paths P2, P6 and P8 compared to the best random pattern. To understand these three cases, the total number of critical cells (T.C.), transitioning critical cells count (T.O.) and transition rate (T.R.) are shown in Table 3 . It can be seen that the transition rate of these three paths is relatively higher than other paths. The noise on these three paths is relatively high and there is not much room for improvement.
The number of X bits used for PSN control is about 40% more than the original care bits. In aggregate less than 10% of the patterns are care bits. Since logic simulation time dominates the algorithm, the CPU time is nearly linear in the number of iterations. BF-4000 takes about 40s on b19 while simulating 10,000 random patterns takes more than 2 hours.
We compare the Bit-Flip approach to the ATPG-based PSN maximization approach in [3] . Based on their published data, we can only compare the average transition rates (T.R.) (total aggressor transitions divided by total aggressors). The average T.R. in [3] is 14.08% with virtual test point insertion, with a CPU time of 40s for the ATPG step. As shown in Tables 2 and  3 , BF-4000 averages 17.24% T.R. at 38s CPU time. So the two methods have similar performance.
Thirdly, we investigated how fill rate constraints limits the performance of Bit-Flip. We run BF-4000 with group size 30 and the fill rate is varied from 3% to 10%, compared to the original fill rate of 2.4%. For each case, we run BF-4000 100 times on P0. After BF-4000 completed, the remaining X bits in the filled pattern were randomly filled for a fair comparison. The average ∆EWSA from random fill (Average ∆EWSA-R) and from best random (Average ∆EWSA-BR) are shown in Figure 8 . BF-4000 always performs better than random fill and always performs better than best random once the fill rate is more than 5%. The improvement for P0 levels off when the fill rate is above 7%.
Finally, we evaluate the ∆EWSA achieved on compacted patterns of benchmark circuits in Table 4 . The patterns were dynamically compacted [23] . Paths longer than 70% of the longest path are considered critical, and any patterns containing them are subject to the Bit-Flip procedure. We chose 70% as the threshold since STA errors of 30% have been reported in the literature. The total pattern count (T.P.), critical pattern count (C.P.) and transition rate (T.R.) are also shown. On average 24% of patterns are critical and require PSN control. b19 is less compacted than the other three circuits and a large ∆EWSA is obtained. Although the other three circuits have about a 20% care bit density, Bit-Flip still performs significantly better than random fill. Future work will focus on improving algorithm efficiency by utilizing parallel pattern logic simulation. This should reduce the CPU time by more than an order of magnitude. The compaction algorithm will also be investigated to determine its interaction with PSN on PKLPG patterns. Prior work [10] has focused on vetoing compactions that would exceed a noise limit, but have not looked at compacting patterns to minimize critical patterns. Furthermore, we will investigate more accurate and efficient PSN estimation methods, such as [24] and apply the proposed scheme to multiple clock domains. 
