Abstract -Implementing a built-in self-test by a "test per clock" scheme offers advantages concerning fault coverage, detection of delay faults, and test application time. Such a scheme is implemented by test registers, for instance BILBOs and CBILBOs, which are inserted into the circuit structure at appropriate places. An algorithm is presented which is able to find the cost optimal placement of test registers for nearly all the ISCAS'89 sequential benchmark circuits, and a suboptimal solution with slightly higher costs is obtained for all the circuits within a few minutes of computing time. The algorithm can also be applied to the Minimum Feedback Vertex Set problem in partial scan design, and an optimal solution is found for all the benchmark circuits.
INTRODUCTION

Self-Testable Structures
Built-in self-test (BIST) is one of the most important techniques for testing large and complex systems. Test registers are added at the primary inputs and outputs of a circuit, and some additional test hardware is inserted into the circuit. In a "test per scan" scheme, test registers feed and evaluate a (partial) scan path (see Fig. 1 ).
It has been shown independently by several authors that breaking all cycles in the circuit structure bounds the length of the required test sequences to the sequential depth of the circuit [1, 2, 3, 4, 5] . To keep the hardware overhead low, the number of flip-flops that are integrated into the partial scan path in order to break all cycles should be as small as possible, and the NP-complete "Minimum Feedback Vertex Set" (MFVS) problem has to be solved [6] .
Chakradhar, Balakrishnan, and Agrawal presented an algorithm for computing the MFVS exactly using a branch and bound technique [7] . << Insert Fig. 1 >> Fig. 1 .
"Test per scan" scheme applied to a circuit with registers R1 ... R9, combinational logic blocks (CLB), and pipeline structures
In a "test per clock" scheme, some system registers are enhanced such that in special test modes they generate patterns or compact test responses. Examples of these multi-mode test registers are BILBO and GURT [8, 9] . A "test per clock" scheme has advantages with respect to test application time, delay testing and defect coverage, but it often requires a higher hardware overhead than the "test per scan" scheme.
In order to obtain self-testable circuits, test registers must be placed at appropriate positions [10, 11, 12, 13] . The circuit structure obtained from breaking all cycles, however, is not a priori suited to a "test per clock" scheme since during self-testing some test registers may have to generate patterns and compact test responses concurrently (e.g. test register T2 in random pattern resistant and all the states required for testing are reachable [14, 15] . But in general, the patterns are not exhaustive, (weighted) random, or deterministic. Then, BILBOs have to be employed, and at least two of them are required in each cycle. In Fig. 2 , register R1 must also be enhanced to a BILBO register.
Using BILBOs, it is possible to segment the circuit into a set of subcircuits that are completely bounded by test registers (see Fig. 3 ). For testing a portion of the circuit, at least one test register must collect test responses. Thus the smallest region that can be tested independently (test unit) consists of one test register that can be configured as a multiple input signature register, the block of logic connected to the inputs of this register, and a set of test registers to generate test patterns for the inputs of the block (cf. [11, 16, 17] ). In this way, every test unit u(T i ) is uniquely determined by the test register T i at its outputs. In Fig. 3 
Multi-Mode Test Registers
The general "test per clock" scheme requires at least two multi-mode test registers to be placed in each cycle of the circuit structure. In particular, problems arise when a register feeds itself through combinational logic (self-adjacent register [19] ), e.g. register R3 in Fig. 4 . Here register R3 must be enhanced to a BILBO register T3, and an additional test register T4 of the BILBO type that is transparent in normal mode must be inserted into the feedback path. In In this paper, we present an exact algorithm that determines an appropriate placement of test registers and selects their types such that the total hardware cost of all the built-in test registers is minimum. As a special case, it also solves the MFVS problem for partial scan and the "test per scan" scheme.
Optimal Test Register Placement
During "top-down" design or synthesis, test registers are usually inserted at register transfer level [12, 24, 25, 26, 27, 28] . This way, hierarchy can be exploited, and the underlying complex optimization problem can be solved efficiently. But at RT-level not all of the structural knowledge is yet available, and much better solutions are possible if gate level information is used. As the gate level description of a system has much higher complexity than the RT-level description, highly efficient algorithms for the insertion of BIST cells (1-bit elements of test registers) are required. In this paper, a hardware-optimal algorithm is presented which finds a BIST solution with minimum transistor overhead for nearly all the ISCAS'89 benchmark circuits [29] . 
TEST REGISTER INSERTION: GATE LEVEL VERSUS RT-LEVEL
Often, for a gate level circuit there is a variety of corresponding register graphs. The register graph is determined by the way the flip-flops are partitioned and assembled to registers. The register configuration of the system mode is not always optimal for testing. As an example, Fig. 7 shows a carry save adder (CSA) and its register graph. Such a circuit is often used for implementing sequential multiplication [30] . and C' of length n are required for making it self-testable. The resulting graph is shown in The presented algorithm uses a pre-processing step similar to [5] where iteratively all the nodes without predecessors or without successors are removed since they cannot be part of any cycle. Moreover, as the transistor cost of a transparent test cell is the same for all the gate inputs and outputs, it is sufficient to consider the outputs of combinational fanout-free regions Σ l(v') < 2, and if these cycles share at least one node with label 0, then all the nodes and edges of both cycles belong to the same TCC.
The labeled graph of Fig. 10 has two TCCs. The graph of Fig. 11 , however, cannot be divided since it is composed of a single TCC. Fig. 12 shows a graph that does not contain any 
Procedure A
In procedure A, the following rules are applied until no more changes are feasible. An optimal solution for the modified graph is still an optimal one for the original graph. After all possible restrictions and simplifications have been made, procedure B follows.
Rule (i):
If v is a combinational node, v ∉V s , with l(v) = 0
Procedure B
Procedure B selects a node v that has not yet been considered for labeling (hence with label 0) and tries all the admissible assignments l(v)∈L(v). The node v is selected by the following criteria:
• The number of admissible labels for v, |L(v)|, should be minimum.
• Setting l(v) = 2 should make it possible to divide the TCC into smaller TCCs, and the largest of these smaller TCCs should contain a minimal number of nodes.
The first of these two points is most important. Of course, selecting a node v with |L(v)| = 1 is best since no decisions are required that may have to be reversed later.
If the newly assigned label is 1 or 2, it is tried to divide the TCC into smaller TCCs, and procedure A is called for each of these. If the new label is l(v) = 0, the TCC cannot be divided. In this case we set L(v) := {0}, and when procedure A is called, the application of rule (vii) reduces the graph by ignoring node v.
Pruning the Search Tree
At each node of the search tree the cost of the current labeling is computed. If this sum (current_cost) equals or exceeds the value obtained by the best solution found so far (best_cost), sons of this node in the search tree need not be investigated and the search tree can be pruned.
Also if there is a node v with l(v) ≠ 0 and l(v) ∉ L(v) or a node v with no admissible label, L(v)
= Ø, the current (partial) labeling cannot be extended to a minimum cost labeling, and the search tree can be pruned at this point. Using these criteria for pruning, most of the search space may be skipped while it is still guaranteed that an optimal solution will be found.
After pruning, backtracking may be necessary. Starting from the current node, the search tree is traversed backward until a node is encountered where procedure B made a choice among several possible assignments of labels. A different assignment is made, and the search algorithm continues by again calling procedure A and procedure B alternatingly.
As the underlying problem is NP-hard, some circuits might be intractable by the exact algorithm. In this case good suboptimal solutions are obtained by a heuristic approach.
A parameter quality ≤ 1 is introduced. This factor is used during two steps of the algorithm. In The second step where this parameter is applied occurs during pruning the search tree.
Here the tree is pruned if current_cost > best_cost * quality. quality = 1 yields an optimal placement. Usually the costs of the suboptimal solutions are distinctly below the limit optimal_cost quality 2 as shown in the next section.
EXPERIMENTAL RESULTS
The presented algorithm has been applied to the ISCAS'89 benchmark circuits [29] . For the validation of the algorithm, we are interested in provably optimal solutions, computing times, and the impact of the factor quality on the costs of the found solutions. The solutions and computing times on a SUN Sparc 10 workstation are listed in Table 1 for parameter set I, and in Table 2 for parameter set II. The first column denotes the circuit, and #B, #Bt, #C, and #Ct are the number of BILBO cells, transparent BILBO cells, CBILBO cells, and transparent CBILBO cells, respectively. Test cells at the primary inputs and outputs are not counted. "Cost" is the sum of the overheads for these cells, and q is the value of the parameter quality where this solution is found. The last column gives the computing time.
<< Insert Table 1 >> Table 1 . Test cell placement for parameter set I
The correctness of the solutions can easily be verified. For each node v, it is checked if the set of nodes that can be reached on paths with label sum less than 2 includes the node v.
With the cost distribution of parameter set I, only CBILBO cells are used. Parameter set II significantly increases the costs for CBILBOs, but even then the number of inserted BILBO cells is very small (see Table 2 ).
<< Insert Table 2 >>   Table 2 . Test cell placement for parameter set II It is seen that the efficiency of the algorithm depends on the cost distribution. For parameter set I, an optimal solution is found for all the circuits but one, whereas with parameter set II five circuits are hard to handle.
The impact of q on the costs is investigated in Table 3 there is a distinct loss in quality for some circuits.
<< Insert Table 3 >>   Table 3 . Costs for solutions found with parameter set II and different values of the quality factor In another experiment the algorithm has been applied to the S-graphs of the circuits. If the costs of CBILBO cells are set less than the costs of BILBO cells, then the MFVS-problem for implementing a partial scan path is solved. The algorithm is not tuned to handle S-graphs and the MFVS-problem and cannot use many of the graph reductions applied in [5, 7] , e.g., as it is designed for a much more general problem. But for all the benchmark circuits it found the provably optimal solution of the MFVS-problem within a few seconds of computing time.
Some authors propose not to break self-loops for a partial scan design [1, 2, 5] . This problem can be solved by removing the self-loop edges from the S-graph. Also for these modified graphs the algorithm provides an optimal solution for all the benchmark circuits with the exception of s38584 where the factor quality must be reduced. The optimal solutions found agree with the numbers reported in [7] .
Generally, if testability analysis shows that some parts of the circuit are easily testable without modifications or that some cycles need not be broken as they do not cause poor controllability or poor observability, then we can remove some parts from the circuit graph before we solve the MCP problem. Thus, the requirements for test registers can be reduced further.
It is also possible to model the effect of increased delays due to inserted test cells. The cost of a test cell can be adjusted such that it reflects the time slack of the considered gate or flip-flop. Alternatively, the insertion of test cells may be prohibited at some nodes by restricting the set of admissible labels. In the latter case, however, it is no longer guaranteed that a solution to the MCP problem always exists.
Furthermore, an estimation of the overheads regarding BIST control logic and routing can be incorporated into the costs of test cells. CBILBOs generally require a smaller control effort than BILBOs (see next section).
SELF-TESTABLE TARGET STRUCTURES
The solutions for the ISCAS'89 circuits give new insight into the structure of self-testable circuits with a "test per clock" technique and minimal hardware overhead. In most of the circuits only CBILBO cells have to be placed, and in the remaining circuits only very few BILBO cells are required. We assumed a CBILBO cell up to 3.5 times more expensive than a BILBO cell, but the overall hardware overhead of a CBILBO based approach is still smaller than a BILBO solution.
Hardware is not only saved within the data path, but test control and routing the test The hardware-minimal circuit structures found by the algorithm have also advantages with respect to test application time. All test registers may work in parallel, and the maximal degree of concurrency is not restricted by conflicts regarding the modes of test registers, but only by the limits of power dissipation during BIST operation [33] .
The results obtained so far give also hints for an appropriate data path structure to be used by high-level synthesis systems and designers in order to implement self-testable systems.
Recent synthesis for testability approaches try to reduce the number of cycles in a circuit.
Avra and McCluskey try to reduce the number of self-adjacent registers using a so-called "incompatibility graph" [26, 34] . This graph indicates possible self-loops as well as constraints between variables due to the selected schedule, and graph coloring provides an admissible assignment of variables to registers. In [35, 36] , binding is formulated as a network flow problem for each control step, such that area and timing is optimized and the number of self-loops is minimized. Papachristou, Chiu and Harmanani consider "(extended) testable functional blocks" consisting of functional units and the necessary test registers as the basic blocks of a self-testable RT structure [12, 27] . Allocation and binding of resources correspond to covering a given data path with a minimal number of testable functional blocks.
But looking at the results of the hardware-optimal algorithm, it is found that not only the number of cycles has an impact on BIST costs. More important is the existence of a small number of registers such that each cycle contains at least one of these registers. The hardwareminimal BIST solution will enhance these registers to CBILBOs, and if binding maps all the self-loops of variables to just these registers, a self-loop does not cause any additional cost for BIST.
This problem is similar to designing a circuit that has an MFVS with low cardinality and is optimal for partial scan. In [37] , [38] , and [39] , retiming and resynthesis procedures have been presented that reduce the MFVS of a circuit described at gate level. In [40] , a high-level synthesis method is proposed that synthesizes data paths such that they have a MFVS consisting of a small number of flip-flops. In contrast to partial scan designs, BIST cells for a "test per clock" scheme may also be inserted at combinational nodes. This can be exploited for further optimizations.
CONCLUSION
An exact algorithm has been presented that selects flip-flops to be incorporated into multimode test registers or into a partial scan path in order to implement a "test per clock" or a "test per scan" scheme with minimum hardware overhead. The algorithm finds provably optimal solutions for nearly all the ISCAS'89 benchmark circuits, and the remaining circuits can be handled by a heuristic version very efficiently. It also considers BIST cells within the combinational logic, and takes into account that the hardware costs of BIST cells depend on their type. The MFVS problem of partial scan design is solved as a special case. He is a member of the steering committee of the GI/ITG Special Interest Group on "Testmethoden und Zuverlässigkeit von Schaltungen und Systemen" and has been a member of the program committee at numerous conferences. He has been a reviewer of research proposals submitted to NSF and NATO, and within the European projects EUROCHIP and EUROPRACTICE he has been a lecturer for courses on VLSI design and test.
Dr. Wunderlich is author and co-author of three books and over 70 papers in the field of test, synthesis, and fault tolerance of digital systems. CBILBO solution for the circuit of Fig. 3 
FIGURE CAPTIONS
