Path delay fault testing becomes increasingly important due to higher clock rates and higher process variability caused by shrinking geometries. Achieving high-coverage path delay fault testing requires the application of scan justified test vector pairs, coupled with careful ordering of the scan flip-flops and/or insertion of dummy flip-flops in the scan chain. Previous works on scan synthesis for path delay fault testing using scan shifting have focused exclusively on maximizing fault coverage and/or minimizing the number of dummy flip-flops, but have disregarded the scan wirelength overhead. In t h s paper we consider both dummy flip-flop and wirelengtb costs, and focus on post-layout formulations that capture the achievable tradeoffs between these costs and delay fault coverage in scan chain synthesis.
Introduction
Scan-based path delay fault testing requires the application of two test vectors: the first test vector, or initialization vector, initializes the logic to a known state while the second vector, or activation vector, activates the targeted fault. causing a transition to be prop agated along the path under test.' It is well-known that at-speed application of test vector pairs to the primary inputs has low path delay fault coverage [4]. Improved coverage can be achieved by using scan chaining, which has become the design-for-test (DFT) technique of choice for stuck-at fault testing [I].
A scan chain is formed of scan flip-j7ops, which are some or all of the flip-flops existing in a design. One end of the scan chain appears as a primary input (PI) and the other end appears as a primary output (PO). Standard scan-based delay fault testing involves justifying a test vector by giving clocks to the circuit placed in test mode, giving one (scan shifting) or two (functional justification) clwk(s) to the circuit in normal mode, then shifting out the resulting flip-flop values by giving clocks in test mode.
There are two techniques to produce the vector pairs for path delay fault testing -functional jusrification and scan shifring. With functional justification, the initialization vector not only sensitizes the proper paths but also produces the activation vector. On the other hand, with scan justification the activation vector is produced by a single shift of the initialization vector. Some pros and cons of the two techniques are as follows. Permission to make digital or hard copies of all or pan of this work for personal or classrwm use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and lhat copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on sewers or to redistribute to lists, requires prior specific permission andlor a fee. 8 It is known that test generation complexity using scan shifting is typically lower than that using functional justification.
To save test generation time, vectors may be generated using scan shifting first, and functional justification may be used for faults that cannot be tested by scan shifting. This approach was studied in [7] and a savings of 30% of test generation time was reponed.
-It has been argued that path delay faults which cannot he detected by functional justification are likely to be functionally false paths, but identification of functionally untestable paths is a hard problem [16] . Several faults that cannot be detected using functional justification (by commercial ATPG tools) may he detected by scan shifting. Vectors generated using functional justification have the advantage of being scan order independent. This allows scan order to be driven by layout such that the wirelength is minimized. However, there may be multiple equi-wirelength scan chain orderings, some of which may be conducive to scan justification based path delay testing. It is therefore possible to increase fault coverage with little or no impact on layout overhead of scan.
The requirement that the activation vector must be obtained from the initialization vector by one-bit shifting along the scan chain [23] constrains scan chain synthesis for delay fault testing using scan shifting. In general, not all activation vectors can be realized in this way once we fix the order of the flip-flops in the scan chain. Under the standard practice of using a single scan-enable signal, with scan chain edges always linking the non-negated data output pin of the source flip-flop to the data input pin of the destination flip-flop, we can capture the interdependence between testvector pairs and scan chain order as follows: Scan chain edge i -j can be made compatible with all conflicting tests either bv "enhancing" Rio-Boo i to store an additional bit or-by inserting a-separate I-bit flip-flop between i and j . We will refer to t h s operation as inserting a dummypip-pop in the edge i -j . In this paper we consider both dummy flip-flop and wirelength costs and focus on posr-layour formulations that capture the achievable tradeoffs between these costs and delay fault coverage in scan chain synthesis. Layout information is beneficial to scan chain synthesis in two important ways: ( I ) it enables higher ATPG selectivity in the choice of paths to be tested due to the availability of accurate path criticalities, and (2) it makes possible accurate estimation of scan routing cost and impact on circuit performance, thus enabling better informed coverage-cost tradeoff decisions.
Our contributions include:
An efficient heuristic for maximizing delay fault coverage by simultaneous layout-aware scan chain synthesis and insertion o f a hounded number of dummy flip-flops. A compact ILP formulation for the problem of optimally inserting a number of dummy flip-flops in a given scan chain.
This ILP is solved in practical runtime using the CPLEX commercial optimizer for designs with up to tens of thousands of scan flip-flops.
e A comprehensive empirical evaluation of the proposed algorithms on industry testcases, including a detailed analysis of the tradeoffs between delay fault coverage on one hand and number of dummies and scan chain wirelength on the other hand.
The rest of our paper is organized as follows. In Section 2, we
give an efficient heuristic for the problem of maximizing path delay coverage by scan chain synthesis and simultaneous insertion of a hounded number of dummy flip-flops. In Section 3 we prove the NP-hardness of, and give a con~pact ILP formulation for, the prohlem of computing achievable tradeoffs between delay fault coverage and the number of dummy flip-flops inserted in an already routed scan chain. Finally, we present experimental results in Section 4 and conclude in Section 5.
Forniulations for Post-Layout Coverage Driven Scan Chain Synthesis
In The number of dummy flip-flops needed to achieve complete coverage (i.e., the number of edges in n that conflict with at least one test of I ) is minimized
The above formulation is appropriate when complete fault coverage is a design requirement. However, for most designs full coverage is not required. Rather, designers decide on a design-hydesign basis the hest tradeoff between delay fault coverage and scan chain cost (wirelength, dummy flip-flops, impact on performance, Phase 3: Consuuct the auxiliary graph G" by adding to i i the edges is made untestahle by the inclusion of an edge into the chain fragments, it should no longer be counted as conflicting with the remaining edges.
Simultaneously, the multi-fragment greedy heuristic also attempts to keep the wirelength of the fragments as low as possible. It does so by growing the fragments as much as possible using short edges before starting to use longer edges. To consider both wirelength and coverage simultaneously, we rank the edges according to a weighted combination of length normalized by the average length, and number of incompatible faults (see Step 2 in Figure  1) . First, the algorithm considers edges with a weighted comhination value below a threshold value T (we use an initial threshold of 1.4 in our experiments). The parameter w determines the relative weight of normalized length vs. lost coverage, and can be modified to achieve different tradeoffs between the wirelength and the coverage of the final tour. In our experiments we use w = 2.0. The fragments are then extended iteratively using these edges (edges with Ieast number of incompatible faults first, and breaking ties based on wirelength). When no edges are left, we increase the threshold T by a multiplicative factor (we use J = 1.2). and attempt to grow the fragments in the same manner using the edges that now become eligible, i.e., have a weighted combination of length and number of incompatible faults below T. The experimental conclusions that we report below are not very sensitive to the choice of the T and J parameter values.
In the second phase, we combine the D+ 1 fragments into a single scan chain with the help of D dummy flip-flops. Since the objective of this phase is to increase the wirelength of the scan chain by the least amount possible, we perform this "fragment stitching" by using a wirelength driven ATSP solver (even high-quality solvers such as LKH 1131 can be used in practice since the number of fragments is small).
In the third phase, all edges compatible with surviving faults are added to the tour, and an ATSP solver is called to further decrease 
Optimal Dummy Flip-Flop Insertion in a Given Scan Chain
In this section we consider minimum dummy flip-flop insertion in a scan chain constructed in a previous design phase (possibly using the three-phase heuristic in Section 2). We assume that a set of spare sites available for dummy flip-flop insertion have been iden- 
Proof.
We will show that the NP-hard CLIQUE problem reduces in polynomial time to MCDI. Given a graph C = ( V , E ) and a positive integer k, the CLIQUE problem asks if G has a complete subgraph of size k. Without The ILP formulation is the following:
It is easy to see that ILP ( 1 ) is equivalent to MCDI constraint (2) ensures that no more than D dummies are inserted, while constraints (3) make sure that a test I is counted by the objective function as covered only if dummies have been inserted on all scan edges conflicting with it. ILP (1) has compact size (at most n + 1 1 1 -1 binary variables and at most IT/ + 1 constraints), and, as shown by the results in Section 4, can be solved to optimality in practical running time by the commercial solver CPLEX. If needed, significant speedups can be achieved in practice by instructing CPLEX to stop as soon as it finds feasible solutions known to be within a small cost ofthe optimum.
Experimental Study
In this section we describe our experimental setup and results. Tne testcases used in our experiment are described in Table 2 . Reported ILP runtimes are obtained using CPLEX 7.0 on a 300MHz Sun Ultra-IO with IGB RAM. The three-phase MaxDFC-Scan heuristic and SconOpl were run on a 2.4GHz Intel Xeon server with 2GB RAM. Since vectors using functional justification can be used to test faults irrespective of the scan order, we separate the paths that are testable by functionally justified vectors. We use a commercial ATPG tool, Synopsys TerraMAX to generate robust vectors using functional justification for the testcases. We obtain a scan order using each of three different flows, and compare the final coverages and scan chain wirelengths.
For each of the testcases we conducted the experiments in the following way:
1. The Verilog RTLdesign is synthesized using Synopsys Design
Compiler in an Artisan TSMC 0.13 p. m library.
2. The most critical paths and their sensitizing test patterns were found using Synopsys Prime7ime. We select the top 5000 critical paths or the paths that have a slack less than 30% of the clock period, whichever is less. Then the true paths (as detected by PrimeTime) are selected and used for testing.
3. Robust vectors using functional justification are generatcd for the synthesized netlist using Synopsys Tetmh4AX.5 Wc consider only robust tests in our experiments since this type of test is guaranteed to detect excessive delay on the given path irrespective of timing on other paths. Robust tests can also be useful for characterizing the timing of a particular path, or for better diagnostic resolution of a failing path delay test. Note however that requiring only robust path delay fault tests will result in lower overall coverage.
4.
Path sensitization vectors from Synopsys Primenme are used to construct robust vectors to be applied using scan justification. The paths tested using functional justification in the previous step are excluded. Only this set of test vectors is passed on to the scan chain ordering flows.
5.
The synthesized design is placed with Cadence PKS to generate a placed DEF netlist.
6.
We do the scan chain ordering using each of the following:
Flow I: Placement driven scan chain ordering flow.
Cell-to-cell distances from the placed netlist are used to drive the ScanOpt TSP solver 151. stitched into a single tour by inserting dummy flip-flops as described in Section 2.
7.
We calculate the coverage by finding the number of faults compatible with the generated scan order and repon it. The scan chain wirelength is estimated in all flows by summing up cell-to-cell Manhattan distances between FF locations.
Flow 11: Test driven scan chain ordering flow. Table 3 gives the path coverage and wirelength of the compared Rows for zero dummies inserted. The scan coverage rows show how many of the critical paths received as input by the 3 flows (i.e., of the critical paths that are not robustly testable using functional justification) can be robustly tested by using the scan order produced by each Row. The total coverage rows show how many of the critical paths for whch TetraMAX generates robust tests are testable using either scan or functioiial justification. On the reported testcases, all runtimes for ScanOpr range from 200 to 600 seconds, but are in some sense not comparable directly since altemative flows either read in and solve, or simply solve, the ATSP instance; the read-in time is a substantial portion of total runtime in the former case. MFG runtimes range from 0.5 to 440 seconds.
As expected, Flow I has shortest wirelength, hut poorest fault coverage. Flow I1 has 100% total fault coverage in all testcases, hut uses as much as 25x more wirelength than Flow 1. Flow Ill ' The options set delay ~diagn0stic-propa9ation and add pi constraints 0 test-se M used to get robust vectors using functional jurrification. Table 4 Fault coverage using scan shifting on testcase ~38417 as function of added dummies. Time reponed for flow I is the time taken by the CPLEX ILP implementation for dummy insertion.
DES3
achieves an excellent tradeoff between coverage and wirelength: it achieves 100% total fault coverage in 5 of the 6 testcases, with a wirelength comparable to that of Flow I (see also Figure 3 ). We put special emphasis on the zero dummies case since we are able to achieve reasonably high coverages, and also since ECO insertion of a large number of dummies implies significant overheads6 In some cases, dummy insertion results in drastic improvement in coverage as shown by the results of our heuristic for the testcase ~38417 in Table 4 .
The quality of flip-flop orderings produced by Flows I1 and 111 is reflected by the fact that 100% coverage for testcase s384I7 is achieved after inserting only a small number of dummies. In contrast. the purely wirelength-driven Flow I needs over 100 dummy flip-flops to achieve 100% total coverage, and the classic enhancedscan methodology would indiscriminately enhance all 1561 flipflops. Note that, as the number of dummies increases, the wirelength of scan chains produced by Flows I and II remains constant, while the wirelength cost incurred by Flow Ill is slightly decreasing due to the second-phase optimization. 1LP in Section 3 for testcase ~38417. The table shows that CPLEX runtime is dependent on the bound on the number of dummies. For very small or very large hounds the ILP is easy to solve and the branch and bound algorithm needs few iterations. For intermediate bound values the branch and bound tree grows larger and more iterations are needed to prove optimality. However, even in this case the runtime remains acceptable.
Conclusions
In this work we have proposed algorithms for computing the achievable tradeoffs in scan chain synthesis between number of dummy flip-flops, scan chain wirelength, and path delay fault coverage. With our layout and test-aware scan chain ordering methodology, we see up to 200% improvement in path delay coverage with just 20% increase in wirelength overhead compared to a layout-driven scan chain ordering approach. Also, up to 25x improvement in wirelength is achieved on the testcases compared to a test-dnven scan chain ordering approach.
Ongoing work seeks to extend these algorithms to redundant test-vector pairs, to exploit additional degrees of freedom such as selection of the Ripflop data output pins used to connect scan chain edges, and to improve routahility of resulting scan chains by using the available congestion information. AS discussed above, extensions to multiple scan chains are also possible. We are also integrating our methods with dummy flip-flop placement and detailed routing to confirm that estimated wirelength savings reported in Section 4 correspond to actual (post-detailed routing) wirelength savings.
