ABSTRACT
INTRODUCTION
The widespread use of core-based designs makes built-in self test (BIST) an increasingly attractive design option [19] . BIST is a design-for-testability technique that places the testing functions physically with the circuit under test (CUT). BIST has several advantages over the alternative, external testing: (i) the ability to test in-system and at-speed, (ii) reduced test application time, (iii) less dependence on expensive test equipment, and (iv) the ability to automatically test devices on-line or in the field. On-line testing is especially important for high-integrity applications such as automotive systems, in which we are interested.
When BIST is employed, a VLSI system is partitioned into a number of CUTs. Each component CUT is logically configured as shown in Figure 1 . In normal mode, the CUT receives its inputs X from other modules and performs the function for which it was designed. In test mode, a test pattern generator circuit (TG) applies a sequence of test patterns S to the CUT, and the test responses are evaluated by a response monitor (RM). This paper concentrates on the design of TG, although we also consider some relevant aspects of RM. In the most common type of BIST, test responses are compacted in RM to form signatures. The response signatures are compared with reference signatures generated or stored on-chip, and the error signal indicates any discrepancies detected. We assume this type of response processing in the following discussion.
Four primary parameters must be considered in developing a BIST methodology:
• Fault coverage: the fraction of faults of interest that can be exposed by the test patterns produced by TG and detected by RM. Most RMs produce the same signature for some Safety-critical applications require very high fault coverage, typically 100% of the modeled faults.
• Test set size: the number of test patterns produced by the TG. This parameter is linked to fault coverage: generally, large test sets imply high fault coverage. However, for on-line testing either at system start-up or periodically during normal operation, test set size must be kept small to minimize impact on system resources and reduce error latency, that is, the time elapsing before the effects of a fault are detected.
• Hardware overhead: the extra hardware needed for BIST. In most applications, high hardware overhead is not acceptable because of its impact on circuit size, packaging, power consumption, and cost.
• Performance penalty: the impact on performance of the normal circuit function, such as critical path delays, due to the inclusion of BIST hardware. This type of overhead is sometimes more important even than hardware overhead.
We have been investigating the design of TGs in the four-dimensional design space defined by the above parameters with the goals of 100% fault coverage, very small test sets, and low hardware overhead. The specific CUTs we are targeting are high-speed datapath circuits to which most existing BIST methods are not applicable. Our CUTs are N-input, scalable, combinational circuits with large values of N (64 or more). They also employ carry lookahead, a very common structure in high-performance datapaths. It is well known that such circuits have small deterministic test sets that can be computed fairly easily. For example, it is shown in [13] that the standard n-bit carry-lookahead adder (CLA) design, which has N = 2n + 1 inputs, has easily-derived and provably minimal test sets for all stuck-line faults; these test sets contain N + 1 test patterns. Some low-cost, scalable TG designs for datapath circuits based on C-testability (a constant number of test patterns independent of N) are known [12] [26], but they do not apply when CLA is used.
In this paper we describe a novel TG design methodology that addresses all the above issues, and illustrate it with several examples including an adder, an ALU and a multiplier-adder. The TG's structure is based on a twisted ring counter, and is tailored to generate a regular, deterministic test set of near-minimum size. Its hardware overhead is low enough to suggest that the TG can be incorporated into a standard cell or core design, as has been done for RAMs [20] , adders [21] and multipliers [12] . For a modest increase in hardware overhead and test set size, our method can also minimize the performance penalty. The proposed approach covers the major types of fast arithmetic circuits, and appears to be generalizable to other CUT types as well.
The paper is organized as follows. Section 2 reviews previous work on designing test generators. Section 3 describes the proposed approach to designing scalable test sets and test generators.
In Section 4 we apply our approach to carry-lookahead adders, and apply it to several other examples in Section 5. We present some conclusions in Section 6.
TEST GENERATOR DESIGN
A generic TG structure applicable to most BIST styles is shown in Figure 2 [7] . The sequence generator SG produces an m-bit-wide sequence of patterns that can be regarded as compressed or encoded test patterns, and the decoder DC expands or decodes these patterns into N-bit-wide tests,
where N is the number of inputs to the CUT. Generally, , and the SG can be some type of counter that produces all m-bit patterns.
The most common TG design is a counter-like circuit that generates pseudorandom sequences, typically a maximal-length linear feedback shift register (LFSR) [6] , a cellular automaton [5] , or occasionally, a nonlinear feedback shift register [9] . These designs basically consist of a sequence generator only, and have m = N. The resulting TGs are extremely compact, but they must often generate excessively long test sequence to achieve acceptable fault coverage. Some CUTs, including the datapath circuits of interest, contain hard-to-detect faults that are detected by only a few test patterns T hard . An N-bit LSFR can generate a sequence S that eventually includes 2 N -1 patterns (essentially all possibilities), however the probability that the tests in T hard will appear early in S is low. Two general approaches are known to make S reasonably short. Test points can be inserted in the CUT to improve controllability and observability; this, however, can result in a performance loss. Alternatively, some determinism can be introduced into S, for example, by inserting "seed" tests for the hard faults. Such methods aim to preserve the cost advantages of LFSRs while making S much shorter. However, these objectives are difficult to satisfy simultaneously. It can also be argued that pseudorandom approaches represent "overkill" for datapath CUTs, which, like RAMs [20] , seem much better suited to directed deterministic approaches.
Weighted random testing adds logic to a basic LFSR to bias the pseudorandom sequence it generates so that patterns from the desired test set T appear near the start of S [6] . In a related method proposed by Dufaza and Cambon [11] , an LFSR is designed so that T appears as a square block at the beginning of S. A test set must usually be partitioned into many square blocks, and the feedback function of the LFSR must be modified after the generation of each block, making this method complex and costly. The approach of Hellebrand et al. [14] [15] modifies the seeds used by the LFSR, as well as its feedback function. In other work, Touba and McCluskey [25] describe mapping circuits that transform pseudorandom patterns to make them cover hard faults.
Another large group of TG design methods, loosely called deterministic or nonrandom, attempt to embed a complete test T of size P in a generated sequence S. A straightforward way to do this is to store T in a ROM and address each stored test pattern using a counter. SG is then a -bit address counter and the ROM serves as DC. Unfortunately, ROMs tend to be too expensive for storing entire test sequences. Alternatively, a -state finite state machine (FSM) that directly generates T can be synthesized. However, the relatively large values of P and N, and the irregular structure of T, are usually more than current FSM synthesis programs can handle.
Several methods have been proposed that, like a ROM-based TG, use a simple counter for SG and design a low-cost combinational circuit for DC to convert the counter's output patterns into the members of T [3] [10]. Chen and Gupta [8] describe a test-width compression technique that leads to a DC that is primarily a wiring network. Chakrabarty et al. [7] explore the limits of testpattern encoding, and develop a method for embedding T into test sequences of reasonable length.
Some TG design methods strive for balance between the straightforward generation of T using a ROM or FSM, and the hardware efficiency of an LFSR or counter. Perhaps the most straightfor-P log P log ward way to do this was suggested by Agarwal and Cerny [1] . Their scheme directly combines the ROM and the pseudorandom methods. The ROM provides a small number of test patterns for hard-to-detect faults and the LFSR provides the rest of T.
None of the BIST methods discussed above explicitly addresses the scalability of the TG as the CUT is scaled. Scalable TGs based on C-testability have been described for iterative (bit-sliced) array circuits, such as ripple-carry adders [21] and array multipliers [12] . However, no technique has been proposed to design deterministic TGs that can be systematically rescaled as the size of a non-bit-sliced circuit, such as a CLA, is changed. This paper introduces a class of TGs where SG is a compact (n + 1)-bit twisted ring counter and DC is CUT-specific. The output of SG can be efficiently decoded to produce a carefully crafted test sequence S that contains a complete test set for the CUT. As we will see, both SG and DC have a simple, scalable structure of the bit-sliced type. S is constructed heuristically to match a DC design of the desired type, so we can view this process as a kind of "co-design" of tests and their test generation hardware.
BASIC METHOD
We first examine the scalability of the target datapath circuits and their test sets. A circuit or module M(n) with the structure shown in Figure 3 is loosely defined as scalable if its output function Z(n) is independent of the number n of its input data buses. Each such bus is w bits wide, and 
For example, if Z is addition, we can write
where the 2 n factor accounts for the shifted position of the new operand D n = (A n ,B n ). Similarly, a test sequence S(n) for a scalable circuit M(n) can be represented in recursive form. S(n) is considered to be scalable if
S(A(n+1), B(n+1)) = s[S(A(n), B(n)), A n , B n ]
As we will see, the test scaling functions s and S can take a few regular, shift-like forms for the CUTs of interest.
To introduce our method, we use the very simple example of a ripple-carry incrementer shown in Figure 4 . Here the carry-in line C 0 is set to 1 in normal operation, but is treated as a variable during testing. The increment function Z inc can be expressed as
When n = 1, Equation (1) reduces to the half-adder equation
and (2) is realized by a single half-adder. An (n + 1)-bit incrementer M inc (n) is obtained by appending a half-adder stage to M inc (n -1). Figure 4 shows how M inc (3) is scaled up to implement
A corresponding scaling of a test sequence S inc (n) for n = 3 to 4 is also shown in the figure.
S inc (n) consists of 2n + 2 test patterns of the form A n-1 A n-2 …A 0 C 0 , each corresponding to a row in the binary matrices of Figure 4 . These tests exhaustively test all half-adder slices of M inc (n) by
Half adder
Half adder
Half adder 
Half adder (3) applying the four patterns {00,01,10,11} to each half-adder and propagating any errors to the Z outputs. For example, the first test pattern A 3 A 2 A 1 A 0 C 0 = 00001 in S inc (4) applies 00 to the top three half-adders, and 01 to the bottom one. The next test 00011 applies 00 to the top two halfadders, 01 to the third half-adder from top, and 11 to the bottom one, and so on. If a fault is detected in, say, the bottom half-adder HA 0 by some pattern, an error bit appears either on Z 0 , or on HA 0 's carry-out line; in the latter case, the error will propagate to output Z 1 , provided the fault is confined to HA 0 . Thus S inc (n) detects 100% of all cell faults in the incrementer and, by extension, all single stuck-line (SSL) faults in M inc (n), independent of the internal implementation of the half-adder stages. The members of S inc (n) can easily be shown to constitute a minimal complete test with respect to cell or SSL faults. Note that, unlike a ripple-carry adder, a ripple-carry incrementer such as M inc (n) is not C-testable, and can easily be shown to require at least 2n + 2 tests for 100% fault coverage. This linear testing requirement is unusual in bit-sliced circuits, but is typical of CLA designs.
Each test in the sequences S inc (n) shown in Figure 4 has been carefully chosen to be a shifted version of the test above it. Moreover, the first n + 1 tests have been chosen to be bitwise complements of the second n + 1 tests. (We will see later that these special properties of S(n) can be satisfied in other, more general datapath circuits.) The sequence of the 2(n + 1) test patterns of S is exactly the state sequence of an (n + 1)-bit twisted ring (TR) counter 1 . This immediately suggests that a suitable test generator TG inc (n) for M inc (n) is an (n + 1)-bit TR counter, as shown in Figure   4 . Clearly TG inc (n) is also a scalable circuit. Thus we have a TG design conforming to the general model of Figure 2 , in which SG is a TR counter and DC is vacuous.
Although at first glance, a TG like TG inc (4) seems to embody a large amount of BIST overhead given the small size of M inc (4), we can argue that, in fact, TG inc (4) We can now outline our general approach to designing TGs for scalable datapath circuits. We use high-level information about the CUT to explore in a systematic, but still heuristic, fashion a large number of its possible test sets to find one that has a regular, shift-complement (SC) structure resembling that illustrated by S inc (n) in Figure 4 . The main steps involved are as follows:
1. Obtain a high-level, scalable model of the CUT M(n).
2. Analyze this model using high-level functional analysis to derive a complete SSL-fault test set T(n) for M(n) for some small value of n. Use don't cares in the test patterns wherever feasible.
Convert T(n)
to an SC-style test sequence S(n) as far as possible.
Synthesize a test generator TG(n)
for S(n) in the style of Figure 5 . Figure 5 ; they have and k states, respectively, where k is a fixed number independent of n. The total number of states for TG(n) is thus , which approximates the number of tests in the test set T(n).
Our use of functional, high-level circuit models to derive test sets (Step 1 and 2 above) is based on the work of Hansen and Hayes [13] , who show that test generation for datapath circuits can be done efficiently at the functional level while, at the same time, providing 100% coverage of lowlevel SSL faults for typical implementations. The model required for Step 1 is usually available for these types of circuits, since their scalable nature is exploited in their specification and carries through to high-level modeling during synthesis as illustrated by our incrementer example ( Figure   4 ).
Step 3 is perhaps the most difficult to formalize. It requires modifying and ordering the tests from
Step 2 to obtain a sequence of shifting test patterns that resemble the output of the TR counter, but retain the full fault coverage of the original tests.
In the remaining sections, we apply the preceding approach to derive similar, scalable test sets and test generators for the CLA and some other datapath circuits.
CARRY-LOOKAHEAD ADDER
The CLA is a key component of many high-speed datapath circuits, including arithmetic-logic units and multipliers. A high-level model of a generic n-bit CLA M CLA (n), with the 4-bit 74283
[24] serving as a model, was derived in [13] and is shown in Figure 6 . It is composed of (i) a module M PGX (n) that realizes the functions , , and , (ii) a carry-lookahead generator (CLG) module M CLG (n) that computes all carry signals, and (iii) an ...
... Figure 5 General structure of TG(n) and its state behavior.
XOR word gate that computes the sum outputs. The CLG module M CLG (n) contains the adder's hard-to-detect faults, and so is the focus of the test-generation process. Its testing requirements can be satisfied by generating tests for the SSL faults on the input lines of M CLG (n) that propagate the fault effects along the path to C n , which is the longest and "hardest" fault-detection path. The resulting test set T CLG (n) contains tests and detects all faults in the CLG logic. For example, when n = 2, T CLG (2) = {10101, 10110, 11000, 10100, 10001, 00111}, where the test patterns are in the form P 1 G 1 P 0 G 0 C 0 . Hansen and Hayes [13] have proven that such a test set detects all
SSL faults in typical implementations of M CLG (n). Their method induces high-level functional
faults from the SSL faults, and generates T CLG (n) for a small set of functional faults that cover all SSL faults. Because the carry functions are unate, it can be shown that T CLG (n) is a "universal" test set in the sense of [2] , hence it covers all SSL faults in any inverter-free AND/OR implementation of M CLG (n).
Once the tests for M CLG (n) are known, they are traced back to the primary inputs of the M CLA (n) through the module M PGX (n); the resulting test sets for n = 2, are shown in Table 1 (a).
The table gives a condensed representation of M CLG (2)'s test requirements within M CLA (2), and specifies implicitly all possible sets of 6 tests (the minimum number) that cover all SSL faults in Figure 6 High-level model of the n-bit CLA [13] . 01} 1  01  xx  x  1  10  10  1  {10,01}  00  1  10  xx  x  2  10  00  1  00  11  1  xx  01  x  3  00  11  1  {10,01} {10,01} 0  xx  10  x  4  01  01  0  {10,01}  11  0  5  01  11  0  11  00  0  6  11 Table 1 (b). The n XOR gates that feed the sum output Z are automatically covered by the tests for M CLG (n) and M PGX (n), and also provide non-blocking error propagation paths for these modules.
Once we know the possible test sets for M CLA (n), our next goal is to obtain a specific test sequence that follows the SC style. Such a test sequence of size 6 is extracted in A test generator TG CLA (n) for M CLA (n) can now be synthesized from S CLA (n) following the general structure in Figure 5 . As in the incrementer example, the sequence generator is an (n + 1)-bit TR counter. Note, however, that the number of input lines has almost doubled from N = n + 1
Test # Input pattern Response
to N = 2n + 1. The size of S CLA (n) is , which is the number of states of the TR counter, so no mode-control FSM is needed. Figure 7 . The combinations (HQ i+1 Q i ) = {010, 101} never appear at the inputs of the decoder cells, hence the outputs of DC i are considered don't care for these combinations. Furthermore, the patterns (HQ i+1 Q i ) = {011, 100} never appear at the inputs of the highorder decoder cell DC n-1 , however, we choose not to take advantage of this, since our goal is to keep the decoder logic DC simple and regular. The carry-in signal C 0 can be seen from Table 3 Our TGs, like the underlying TR counters, produce two sets of complementary test patterns.
Such tests naturally tend to detect many faults because they toggle all primary inputs and outputs, as well as many internal signals. An n-bit adder also has the interesting property that A plus B plus C in = C out S implies A plus B plus C in = C out S, where plus denotes addition modulo 2 n . Hence the adder's outputs are complemented whenever a test is complemented, implying that there are only 2n 2 + 
two distinct responses, 100...0 and 011...1, to all the tests in TG CLA (n), as can be seen from Table   2 . Consequently, a simple, low-cost and scalable RM can be easily designed for the CLA adder as depicted in Figure 7 . This example shows that some of the benefits of scalable, regular tests carry over to RM design.
TG CLA (n) Figure 7 Scalable hardware test generator and response monitor for an n-bit CLA. 
OTHER EXAMPLES
In this section, we extend the approach developed in the preceding sections to the design of a TR-counter-based TG for an arithmetic logic unit and two circuits involving multiplication.
Arithmetic Logic Unit. We first consider an n-bit ALU M ALU (n) that employs the standard design represented by the 4-bit 74181 [24] . This ALU is basically a CLA with additional circuits that implement all 16 possible logic functions of the form . A high-level model for the 74181 is shown in Figure 8 [13] , and consists of a CLG module M 2 , a function select module M 1 , and several word gates. Following the approach of the previous section, the tests needed for the CLG module M 2 are traced back to the ALU's primary inputs. During this process, the signal values applied to the function-select control bus S are chosen to satisfy the testing needs for M 1 as well.
An obvious choice is to make S select the add (S 3 S 2 S 1 S 0 = 1001) and subtract (S 3 S 2 S 1 S 0 = 0110) modes of the ALU. However, we found by trial and error that the assignments S 3 S 2 S 1 S 0 = 1010 and 0101 lead to a TG design with less overhead. The testing needs for the word gates in the high-level model of the ALU must be also considered. The final test sequence S ALU (n) has an SC structure that closely resembles that of the CLA. Table 4 shows S ALU (4); note how the tests exhibit the same shifting property as before for the patterns and . Moreover, tests 1:20 are the complements of tests 21:40. The test sequence S ALU (4) is not minimal, however, since 12 tests are sufficient to detect all SSL faults in the 74181 [13] . S ALU (4) can be easily extended to S ALU (n) with a near-minimal size of .
A test generator TG ALU (n) for M ALU (n) is shown in Figure 9 , which again follows the general test generator model of Figure 5 . Since the test sequence size is and the general test gener- 
2-bit binary counter 
State table of the mode-select FSM n n remain undetected. These undetected faults require two extra tests, leading to a complete test set of size 12. Once the possible test sets are determined, a sequence that has the desired SC structure is constructed. Table 5 shows a possible test sequence S MAU (4) of size 20 for M MAU (4) . This test sequence can be easily extended to M MAU (n) with a resultant test set of size .
A test generator TG MAU (n) for M MAU (n) in the target style is shown in Figure 11 . Figure 11 . The hardware overhead of TG MAU (n) is estimated to be only 0.8% for a -bit multiply-add unit.
Booth multiplier. Our technique can be applied with some minor modifications, to a fast Booth multiplier that is composed of a cascaded sequence of carry-save adders followed by a final stage consisting of a 2n-bit CLA [4] . Our design is faster than the usual Booth multiplier where the last stage is a ripple-carry adder; test generation has been studied before only for the slower, ripplecarry design [12] . We have been able to derive a complete scalable test sequence of size Carry-lookahead adder 
DISCUSSION
We have presented a new approach to the design of scalable hardware test generators for BIST, and illustrated it for several practical datapath circuits. The resulting test generators produce complete and extremely small test sets; they are of minimal or near-minimal size for all examples covered. Small test sets of this kind are essential for the on-line use of BIST, especially in applications requiring fast arithmetic techniques like carry-lookahead, for which previously proposed BIST schemes are not well suited. The TGs proposed here also have low hardware overhead, and are easily expandable to test much larger versions of the same target CUT. When applying BIST in a system, designers usually try to take advantage of existing flip-flops and logic already present in or around the CUT. For a typical datapath in, say, a digital signal processing circuit, all the data inputs to ALUs or multipliers come from a small register file. These registers can be designed to be reconfigured into TR counters like that in Figure 5 , thus eliminating the need for special flip-flops in SG. Similar schemes have been proposed in prior techniques such as BILBO [6] . Moreover, it may be possible to share the resulting SGs among several CUTs.
Multiplexing logic will then be needed to select the DCs for individual CUTs during test mode but In some cases, it may be feasible to share the entire TG. To illustrate this possibility, consider an n-bit ALU, an -bit MAU, and a register file connected to a common bus. A single, reconfigurable TG attached to the bus can test both arithmetic units. The results of this approach are summarized in Table 6 for various values of n, and suggest that replacing separate TGs for the ALU and MAU by a single combined TG reduces overhead by about a third.
Our TG designs shed some light on the following interesting, but difficult question: How much overhead is necessary for built-in test generation? As we noted in the incrementer case, the size of the TG inc (4) must be close to minimal for any TG that is required to produce a complete test sequence of near-minimal length. The same argument applies to TG CLA (4), since it has 5 flipflops in SG and a small amount of combinational logic in DC; any test generator G(4) producing the same number of tests (12) must contain at least 4 flip-flops in its SG. In general, the overhead of a TR-counter-based design TG(n) scales up linearly and slowly with n. The number of flipflops in some other test generator G(n) may increase logarithmically with n, but the combinational part of G(n) is likely to scale up at a faster rate than that of TG(n). This suggest that even if the overhead of TG(n) is considered high, it may not be possible to do better using other BIST techniques under similar overall assumptions. If the constraints on test sequence length are relaxed,
simpler TGs for datapath circuits may be possible, but such designs have yet to be demonstrated. 
