Abstract-This paper presents a new feedback shift registerbased method for embedding deterministic test patterns on-chip suitable for complementing conventional BIST techniques for infield testing. Our experimental results on 8 real designs show that the presented approach outperforms the bit-flipping approach by 24.7% on average. We also show that it is possible to exploit the uneven distribution of don't care bits in test patterns in order to reduce the area required for storing deterministic test patterns more than 3 times with less than 2% fault coverage drop.
I. INTRODUCTION
Large test data volume is widely recognized as a major contributor to the testing cost of integrated circuits [1] . The test data volume in 2017 is expected to be 10 times larger than the one in 2012 [2] . On the contrary, the size of the Automatic Test Equipment (ATE) memory is expected to grow only twice [2] .
A number of efficient on-chip test compression techniques have been proposed as a solution for reducing ATE memory requirements, including [1] , [3] , [4] , [5] , [6] . A test set for the circuit under test is compressed to a smaller set, which is stored in ATE memory. An on-chip decoder is used to generate the original test set from the compressed one during test application. Test compression has already established itself as a mainstream design-for-test methodology for manufacturing testing [6] . However, it cannot be used for in-field testing where ATE is not available [7] .
For in-field testing, Built-In Self Test (BIST) including use of JTAG is applied, in which either pseudo-random test patterns are generated within the system or pre-computed deterministic test patterns are stored in system memory [7] . In terms of test application time and fault coverage, deterministic test patterns are obviously more effective than pseudo-random ones. The fault coverage achieved with pseudo-random test patterns can be as low as 65% [8] . Several methods for increasing BIST test coverage have been proposed, including modification of the circuit under test [9] , insertion of control and observe points into the circuit [4] , modification of the LFSR to generate a sequence with a different distribution of 0s and 1s [10] , embedding of deterministic test patterns into LFSR's patterns by LFSR re-seeding [11] or bit-flipping [12] , or storing them in an on-chip memory [13] . The idea of complementing pseudo-random patterns with deterministic patterns is particularly attractive because the deterministic patterns can also solve the problem with transition or delay faults which are not handled efficiently by the pseudo-random patterns. However, the area required to store deterministic test patterns within the system can be prohibitively high. For example, the memory required to store them may exceed 30% of the memory used in a conventional ATPG approach [14] .
In this paper, we propose a new method for embedding deterministic test patterns on-chip suitable for complementing conventional techniques for in-field testing. We generate deterministic test patters using a structure known as binary machine. This name was introduced by S. Golomb in his seminal book [15] . Binary machines can be considered as a more general type of Non-Linear Feedback Shift Registers (NLFSRs) [16] in which every stage is updated by its own feedback function.
Binary machines are typically smaller and faster than NLFSRs generating the same sequence. For example, consider the 4-stage NLFSR with the feedback function
where "⊕" is the XOR (addition modulo 2), "·" is the AND, and x i is the variable representing the value of the stage i, i ∈ {0, 1, 2, 3}. If this NLFSR is initialized to the state (x 0 x 1 x 2 x 3 ) = (1000), it generates the output sequence
with the period 15. The same sequence can be generated by the 4-stage binary machine with the feedback functions
We can see that the binary machine uses 3 binary operations, while the NLFSR uses 5 binary operations. Furthermore, the depth of feedback functions of the binary machine is smaller than the depth of the feedback function of the NLFSR. Thus, the binary machine has a smaller propagation delay than the NLFSR. While binary machines can potentially be smaller and faster than NLFSRs, the search space for finding a best binary machine for a given sequence is much larger than the corresponding one for NLFSRs. Algorithms for constructing binary machines were presented in [17] , [18] . Both algorithms result in binary machines with the minimum number of stages for a given binary sequence. However, they do not minimize the circuit complexity of feedback functions. For Finite State Machines (FSM), it is known that an FSM with a non-minimal number of stages, e.g. encoded using one-hot encoding, often has a smaller total size than an FSM with a minimal number of stages [19] .
In this paper, we present an algorithm with constructs binary machines with a non-minimal number of stages. Our experimental results show that binary machines constructed by the presented algorithm are 63.28% smaller on average compared to the one constructed by the algorithm [18] . The presented algorithm is particularly efficient for incompletely specified sequences, which are important for testing.
The rest of the paper is organized as follows. Section II gives an introduction to binary machines. Section IV, describes the new algorithm for constructing binary machines. Section V presents the experimental results. Section VI concludes the paper and discusses open problems.
II. BINARY MACHINES
An n-stage binary machine consists of n binary storage elements, called stages [15] . Each stage i ∈ {0, 1, . . . , n − 1} has an associated state variable x i ∈ {0, 1} which represents the current value of the stage i and a feedback function f i : {0, 1} n → {0, 1} which determines how the value of x i is updated (see Figure 1) .
A state of a binary machine is a vector of values of its state variables. At every clock cycle, the next state of a binary machine is determined from its current state by updating the values of all stages simultaneously to the values of the corresponding feedback functions. An n-stage binary machine has 2 n states corresponding to the set {0, 1} n of all possible binary n-tuples.
The degree of parallelization of an n-stage binary machine, k, is the number of output bits generated at each clock cycle,
The dependence set of a Boolean function f : {0, 1} n → {0, 1} is defined by
The Algebraic Normal Form (ANF) [20] of a Boolean function f : {0, 1} n → {0, 1} (also called Reed-Muller canonical form [21] ) is an expression in the Galois Field or order 2, GF(2), of type
where c i ∈ {0, 1} are constants and (i 0 i 1 . . . i n−1 ) is the binary expansion of i.
III. RELATED WORK
The first algorithm for constructing a binary machine with the minimum number of stages for a given binary sequence was presented in [17] . This algorithm exploits the unique property of binary machines that any binary n-tuple can be the next state of a given current state. The algorithm assigns every 0 of a sequence a unique even integer and every 1 of a sequence a unique odd integer. Integers are assigned in an increasing order starting from 0. For example, if an 8-bit sequence 00101101 is given, the sequence of integers 0,2,1,4,3,5,6,7 can be used. This sequence of integers is interpreted as a sequence of states of a binary machine. The largest integer in the sequence of states determines the number of stages. In the example above, log 2 7 = 3, thus the resulting binary machine has 3 stages. The feedback functions f 0 , f 1 , f 2 implementing the resulting current-to-next state mapping are derived using the traditional logic synthesis techniques [22] .
Note that, in general, any permutation of integers can be used as a sequence of binary machine's states, as long as the selected integer modulo 2 is equal to the corresponding bit of the output sequence. Different state assignments result in different feedback functions. The size of these functions may vary substantially.
In [18] , the algorithm [17] was extended to binary machines generating k bits of the output sequence per clock cycle. The main idea is to encode a binary sequence into an mary sequence which can be generated in a simpler way. As an example, suppose that we use the 4-ary encoding (00) = 0, (01) = 1, (10) = 2, (11) = 3 to encode the binary sequence 00101101 from the example above into the quaternary sequence 0231. Then, we can construct a parallel binary machine generating 00101101 2-bits per clock cycle with a sequence of states 0, 2, 3, 1. Note that log 2 3 = 2, so the resulting parallel binary machine has one stage less than the binary machine constructed above. This is surprising taking into account that all existing techniques for the parallelization of LFSRs [23] , [24] and NLFSRs [25] , [26] have area penalty. In was shown in [18] that, for random sequences, parallel binary machines can be an order of magnitude smaller than parallel LFSRs or NLFSRs generating the same sequence.
IV. SYNTHESIS OF BINARY MACHINES
The problem of finding a best binary machine for a given sequence can be divided into three sub-problems: 1) Selecting an optimal degree of parallelization for a given binary sequence. 2) Choosing an optimal state assignment for a given degree of parallelization. 3) Finding a best circuit for feedback functions for a given state assignment.
A. Optimal degree of parallelization
The degree of parallelization determines how many output bits are generated per clock cycle. The size of binary machines may differ substantially for different parallelization degrees. The degree of parallelization is optimal if it minimizes the size of the resulting binary machine.
In order to construct a binary machine with the degree of parallelization p, we map a binary sequence into an 2 p -ary sequence by partitioning the binary sequence into vectors of length p. The resulting vectors are treated as binary expansions of elements of an 2 p -ary sequence. The same approach was used in [18] .
Let us denote by N i the number of occurrences of a digit i in the 2 p -ary sequence, 0 ≤ i < 2 p . Let N max be the largest of N i . In [18] , it was shown that the minimum number of stages in a binary machine generating a given binary sequence with the degree of parallelization p is equal to
From (2) we can see that if N max = 1, then k = p. Such a case is called full parallelization. On the base of our experimental results, we hypothesise that the optimal degree of parallelization belongs to the interval
where n is the sequence length. Note that for some applications, including testing, the degree of parallelization is specified by the user. For example, for testing it is equal to the number of scan chains.
B. Optimal state assignment
A state assignment determines a sequence of states which a binary machine follows. Different sequences of states give raise to different current-to-next state mappings and, thus, to different updating functions. The state assignment is optimal if it minimizes the size of the resulting binary machine.
Since a binary machine is a deterministic finite state automaton, any current state has a unique next state. For a given 2 p -ary encoding, the minimal number of bits which has to be added to p-tuples to make the current-to-next state mapping unique is log 2 N max . The minimal number of stages in the resulting binary machine is given by (2) .
The strategy for state assignment presented in this paper has two major differences from the one in [18] . First, we use a non-minimal number of stages, namely
Second, we assign states so that the feedback functions implementing the current-to-next state mapping depend on the minimum number of state variables. It is known that a Boolean function of k variables needs O(2 k /k) gates to be implemented (Shannon-Lupanov bound) [27] . Feedback functions of binary machines are random functions. For random functions, their actual size is very close to the upper bound. So, each extra variable nearly doubles the size of the function.
In our method, the feedback functions of an (m + p)-stage binary machine depend on m = log 2 n p variables only. In [18] , the feedback functions can potentially depend on all state variables.
The pseudocode of the presented state assignment algorithm is shown as Algorithm 1. The input of the algorithm in a binary sequence A = (a 0 , a 1 , . . . , a n ) and the desired degree of parallelization p. The output is a sequence S = (s 0 , s 1 , . . . , s r−1 ) of binary vectors s i = (s i,0 , s i,1 , . . . , s i,p+m−1 ) ∈ {0, 1} p+m , where r = n/p and m = log 2 r , corresponding to the states of an (p + m)-stage binary machine generating A with the degree of parallelization p.
The algorithm partitions A into p-tuples and appends at the beginning of each ith p-tuple m extra bits. These extra bits correspond to the binary expansion of the ith element of the permutation vector Π.
Next, we define a mapping s i → s i+1 , for all i ∈ {0, 1, . . . , r − 2}. Since Π is a permutation, each state in the resulting sequence of states has a unique next state, so the mapping is well-defined. The last state s r−1 and each of the 2 p+m − r remaining states of the resulting binary (p + m)-stage machine are mapped to don't cares values. This gives us the possibility to specify the functions f 0 , f 1 , . . . , f p+m implementing the current-to-next state mapping in a way which minimizes their size. Since r ≤ 2 m , we can treat them as functions depending on the first m variables only. This is very important, because, as we mentioned above, for random functions, the size nearly doubles with each extra variable.
Since, by construction, the first p bits of each state s i in S = (s 0 , s 1 , . . . , s r−1 ) correspond to the ith p-tuple of A, the resulting binary machine generates A with the degree of parallelization p.
As an example, let us construct a binary machine which generates the following 20-bit binary sequence with the degree of parallelization 2: 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0) .
Since n = 20 and p = 2, we get r = 10 and m = 4. Suppose we use the following permutation of (0, 1, . . . , 15): 8, 4, 2, 9, 12, 6, 11, 5, 10, 13, 14, 15, 7, 3, 0) Then, we get the following sequence of states: S = (000100, 100011, 010001, 001011, 100100, 110010, 011011, 101110, 010111, 101000)
The functions implementing the resulting current-to-next state Algorithm 1 Assign states to a binary machine which generates an binary sequence A = (a 0 , a 1 , . . . , a n ) with the degree of parallelization p. where "-" stands for a don't care value. Recall that the functions depend of the four variables x 5 , x 4 , x 3 , x 2 only. The remaining 6 input assignments are mapped to don't cares. We can implement the above functions as:
where " " stands for a complement.
It is important to use permutations Π which have a low-cost implementation. Examples of such permutations are sequences of states generated by counters, LFSRs, or NLFSRs with simple feedback functions [28] . In the example above, we used the sequence of states of an LFSR with the generator polynomial 1 + x + x 4 .
C. Best circuit for feedback functions
The problem of finding a best circuit for a given Boolean function is known to be notoriously hard. The exact solutions are known only for up to five variable functions [29] . However, there are many powerful heuristic algorithms for multi-level % of don't Number of gates in BMs circuit optimization which are capable of finding good circuits for larger functions [22] . We optimize feedback functions using UC Berkeley's tool ABC [30] . Our experimental results show that, even for random functions, ABC is capable of reducing the size of the original, non-optimized circuit by 30% on average.
V. EXPERIMENTAL RESULTS

A. Comparison to previous BM synthesis algorithms
In the first experiment, we compared the presented algorithm to the algorithm [18] . Using both algorithms, we constructed binary machines for random sequences of length 2 20 with a different number of don't care bits. The results are summarized in Table I and Figure 2 . As we can see, the presented algorithm is significantly more efficient than the algorithm [18] for sequences with many don't cares. For the case of 99% don't cares, it outperforms the algorithm [18] by 93.4%.
B. Comparison to previous approaches for embedding deterministic test patterns
In the second experiment, we compared the presented algorithms to the bit-flipping approach for embedding deterministic test patterns which, in our opinion, is one of the most efficient ones [12] . The results presented in this section were obtained using our implementation of the bit-flipping algorithm.
We summarized in Table II . We first applied 9000 pseudo-random patterns to all designs. Then, we computed the top-off patterns required to reach maximum achievable stuck-at faults coverage using a commercial ATPG tool. We used bit-flipping and the presented algorithms to represent these top-off patterns. As we can see from Table II , the presented approach outperforms the bit-flipping approach by 24.7% on average. The difference in the number of gates required in both approaches can be up to 51.5%. What is even more important, the area overhead of the presented approach goes down as the number of scan chains grows. On the contrary, the area overhead of the bit-flipping approach goes up (see Figure 3) .
However, in spite of the improvements, the percentage of the overall chip area required to store deterministic test patterns can be prohibitively high for some designs (see column 6 of Table III ). It is known that the size of representation for a data is related to the entropy of data [31] . Entropy puts a theoretical limit on the size of the minimal representation that can be achieved.
If a lower fault coverage is acceptable, then the area overhead can be reduced by exploiting the fact that don't care bits are normally unevenly distributed among test patterns. As an example, consider the diagram in Figure 4 . Each point on this diagram shows the number of don't care bits in a test pattern of dma benchmark (in total 411 patterns of length 1720 bits each). These test patterns were generated by a commercial ATPG tool with dynamic compaction turned on and random fill turned off. They cover 100% of detectable stuck-at faults. The total percentage of specified bits is 6.45%. We can see that only the first few test patterns are highly specified. If we chop off the first 5% of test patterns, the entropy of the remaining patterns reduces twice. Therefore, they can be represented with a twice smaller representation than the one required for the whole set of test patterns. By using the last 95% of test patterns, we can achive 95.7% test coverage for stuck-at faults. In Table III we show that, by using a subset of the top-off patterns only, we can reduce the area required for their representation more than 3 times in some cases, while sacrificing the fault coverage by less than 2%.
VI. CONCLUSION
We presented a new method for embedding deterministic test patterns on-chip based on binary machines. The presented algorithm for synthesis of binary machines is significantly more efficient than previous work, especially for test data with many don't cares. Our experimental results on 8 real designs show that the proposed approach outperforms the bit-flipping approach by 24.7% on average. We also show that it is possible to exploit uneven distribution of don't care bits in test patterns to reduce the area required for generating top-off patterns more than 3 times with less than 2% decrease in fault coverage.
We believe that the presented algorithm for synthesis of binary machines is quite close to an optimal. What can be improved in the proposed method is the strategy for selecting a subset of top-off patterns which maximizes the fault coverage and minimizes the area overhead. At present, we use a simple greedy algorithm which selects top-off patterns based on the number of don't care bits and the number of covered faults. A more sophisticated approach is likely to bring better results.
Binary machines can potentially be used for storing compressed test patterns for on-chip test compression techniques. This would eliminate the dependence of test compression on ATE memory. We are currently investigating the feasibility of such an approach on large industrial designs.
VII. ACKNOWLEDGEMENT
This work was supported in part by a research grant No 2011-03336 from the Swedish Governmental Agency for Innovation Systems VINNOVA.
