Abstract-Test is an essential task since the early days of digital circuits. Every produced chip undergoes at least a production test supported by on-chip test infrastructure to reduce test cost. Throughout the technology evolution fault tolerance gained importance and is now necessary in many applications to mitigate soft errors threatening consistent operation. While a variety of effective solutions exists to tackle both areas, test and fault tolerance are often implemented orthogonally, and hence do not exploit the potential synergies of a combined solution.
I. INTRODUCTION
The technology evolution of digital circuits is accompanied by two main challenges. To assure product quality offline test is a necessity. Under elevated soft error rates online fault tolerance constantly monitoring operation is of vital importance for reliability [1] . These two challenges require an efficient hardware test to cope with manufacturing defects as well as fault tolerance to confine transient errors caused by Single Event Upsets (SEUs) altering the sequential state.
Testing a circuit after production and throughout its lifetime to prove the presence of manufacturing defects or wearout effects is one of the most challenging areas in digital circuits. Testing sequential circuits without additional Design for Test (DfT) infrastructure is hard to achieve due to the limited access to the circuit state and the associated high complexity of sequential automatic test pattern generation (ATPG). The most widely adopted DfT infrastructure is scan design [2] . It provides observability and controllability of the circuit state by replacing sequential elements with scannable counterparts and grouping them into scan chains that are read and written sequentially. Nowadays multiple scan chains are used. The ability to use combinational test sets is paid by additional area overhead as well as increased test application times and test data volume. Although solutions like the use of multiple (shorter) scan chains or on-chip test data (de-)compression and compaction [3, 4, 5, 6] are able to reduce the test time and volume, they often substantially increase the area overhead in addition to the overhead introduced by the scan elements.
An alternative to scan-based DfT infrastructure is Random Access Scan (RAS) [7, 8] . It arranges the flip-flops of a circuit in an array providing unique access to read and write single bits. In [9, 10] a toggle flip-flop is used to invert a bit instead of writing it. While still incorporating a high area overhead, the results show that significant savings in test time and volume are possible if the next test pattern is setup by selectively updating the captured circuit response.
Fault tolerance can be achieved by time, space or information redundancy. Due to the non-regular structure of random logic, most schemes protect the sequential state by a combination of time and space redundancy. The RAZOR approach [11] as well as the GRAAL scheme [12] duplicate each bit to detect SEUs and correct them by restoring the value from the shadow element. The area overhead inherent to bitwise duplication and comparison is reduced by using latches. If present, the scan portion can be reused to implement the shadow elements, however this implies that it runs at speed.
The work presented here targets the convergence of test and system reliability solutions by the following contributions:
• A Unified Architecture (Fig. 1 
II. ONLINE FAULT TOLERANCE ARCHITECTURE
The online fault tolerance architecture from [13] is slightly extended to serve as the foundation for an efficient offline test. It protects the sequential state stored in registers against SEUs (Fig. 2 ). For each register R i a combination of information and structural redundancy is employed to derive a resident checksum and store it in an additional register C i . SEUs are detected by a signature S i computed as the difference between the stored resident checksum and the checksum recomputed from the register values C i . Detected SEUs are localized by decoding the signature. Finally, the clock is gated and the affected register bit is corrected in one additional clock cycle with the help of a sequential standard cell that is inherently able to invert its internal state. False corrections due to SEUs in C i are prevented by a parity of C i . For offline testing scan design is added to C i and the decoder is gated by the scan enable signal. 
Scan
In The following subsections discuss the two main underlying concepts used to implement the fault tolerance: A) The area efficient error detecting and correcting (EDAC) code computation of the register content, and B) the efficient correction of SEUs at bit level employing Bit-Flipping Flip-Flops.
A. Error Detecting and Correcting Code
Let R i be a register with n bits. Let R i = [r n , · · · , r 1 ] T represent the data word vector in matrix notation where r adr (n ≥ adr ≥ 1) references the bit at address adr. The modulo-2 address characteristic proposed in [14] is defined as the bitwise XOR of all addresses where r adr = 1. r 0 is not used, as address 0 does not contribute to C i .
The mapping between data and characteristic bits corresponds to the generator matrix of a Hamming code and can be expressed by a modulo-2 characteristic matrix M .
It consists of l rows and n columns, where n is the number of data bits and each column contains the binary address adr of the associated data bit. The maximum length over all used addresses defines the size of the calculated characteristic and depends on n logarithmically:
The characteristic C i is computed by multiplying M with R i :
To detect an error, the characteristic of the original register content R i is computed at time t j and stored in an additional register C i of size l. We call C i the resident characteristic.
The recomputed characteristic C i is then concurrently derived from the register content R i until new data is written. The difference between the resident characteristic C i and the recomputed characteristic C i is called the signature of R i :
If S i is the all-zero vector no deviation was detected, otherwise S i contains the address localizing the register bit affected by a single bit upset (SBU). The characteristic computation can be efficiently implemented using XOR2 standard cells [15] .
Example: Let R 1 be a 7-bit register with value 1011010 T .
Together with the modulo-2 checksum matrix M , the resident characteristic C 1 is computed and stored:
Now a SBU affects R 1 and flips bit 5, resulting in the faulty register value R 1 = 10 0 1010 T . The characteristic is recomputed as C 1 and the signature S 1 is calculated:
As S 1 is not zero the SBU is detected. Moreover S 1 contains the address 5, thereby correctly localizing the SBU. During offline test the characteristic is used for test response compaction.
B. Bit Flipping Flip-Flop
Whenever the signature S i is not the zero vector it is decoded to a n-bit wide correction vector by a 1-out-ofn decoder. The vector then triggers the correction of the erroneous register bit while preserving the state of all other bits.
In contrast to the Bit-Flipping-Latch from [13] , the BitFlipping Flip-Flop (Fig. 3 ) targets an edge-triggered design style. The master latch consists of two inverters (INV) and two transmission gates (TG). Both transmission gates are controlled by the control signal pair {L, L}, selecting whether a new value is latched or the internal state is preserved. The slave latch contains an additional inverting feedback loop (TG4, inverter INV3 and TG3) to flip the internal state. The new control signal pair {HI, HI} for TG2 and TG3 selects either the original or the inverting feedback loop. To avoid metastability of the inverting feedback loop, the inverter INV3 is precharged by TG4 if and only if the loop is not active. Inverting {HI, HI} while the slave latch stores a value feeds the inverted value of Q to the inverter chain. If the inversion is canceled the non-inverting loop stores the inverted value.
The Bit-Flipping Flip-Flop can be implemented efficiently as a standard cell similar to the bit-flipping latch in [13] .
III. OFFLINE TEST ACCESS ARCHITECTURE
The unified architecture is now used for test access to observe and control the sequential circuit state. Therefor the characteristic registers are equipped with scan design and the scan enable signal is used to gate the decoders (Fig. 2) . Instead of directly observing the register content R i in n cycles, the compacted characteristic in register C i is observed in l cycles. To control R i the bit-flipping capability inherent to the fault tolerance architecture is reused: Bit-flips are triggered at desired positions by shifting an appropriate characteristic into C i . The efficiency of the accomplished test access depends on the ratio between n and l as well as the amount of bit-flips.
A. Observing a Test Response
After setting up the stimulus for a test pattern p, the circuit response is captured into the internal registers R. In order to validate if the test pattern passed or failed the test response must be observed. The fault tolerance infrastructure already computes the resident characteristic of each R i and stores it in the additional register C i (Fig. 2) . Instead of observing the captured response in R i in n shift cycles, C i is made scannable and the compacted circuit response is observed in l shift cycles, where l n (Eq. 1). The value of C i depends on all bits of register R i and represents a compacted version of the register content. It has the same properties as a response generated by a dedicated response compactor but reuses existing infrastructure. The characteristic has the same aliasing probability than other SECDED Hamming Codes [14] . It follows, that the compaction quality of Bit-Flipping Scan is comparable to methods like X-Compact [5] or EDT [3] .
B. Controlling a Register by Bit-Flipping
A mechanism to flip single bits of a register R i is present in the architecture to correct SBUs in the fault tolerance mode. Now, this feature is used to setup the next state of register R i by a series of bit-flips. For each pattern and bit-flip only l bits need to be shifted in, thereby reducing the complexity of the shift operation logarithmically from O(n) to O(log 2 (n)) (Eq. 1). Fig. 2 shows the involved architecture parts.
Let p 1 and p 2 be two test patterns, let O(R i , p 1 ) denote the state of register R i after the capture cycle of p 1 with both characteristics C i , C i being equal and let I(R i , p 2 ) denote the state of R i needed to setup p 2 . Assume without loss of generality, that I(R i , p2) and O(R i , p1) differ in exactly one bit at address adr b (1 ≤ adr b ≤ n), and their Hamming distance is one. Then, the desired register value can be deduced by a single flip of the bit at adr b.
To trigger a bit-flip at this address the signature S i needs to encode adr b: S i = adr b. As the register state after p 1 and the associated recomputed characteristic C i are known, the resident characteristic C i is computed as:
Scanning in C i triggers a bit-flip at adr b, generating I(R i , p 2 ) from O(R i , p 1 ) with l shift cycles and one additional cycle for the bit-flipping. At the same time, the compacted register state C i is scanned out and observed. If the Hamming distance between the two register states is larger than one, a series of single bit-flips is used.
C. Efficient Test Access
In
For the presented Bit-Flipping Scan scheme, the test time is dominated by the number of bit-flips bf . For each flip, l shift cycles and one flip cycle are needed. After applying all flips the circuit state is captured: TAT BF S = bf · (l + 1) + 1.
Bit-Flipping Scan results in short test times. Formally, the maximum number of bit-flips at which both schemes have the same test time is defined by
Example: For a maximum register size respectively scan chain length of 127 it follows that Bit-Flipping Scan has a lower test time if 15 or less flips per register and pattern are required (bf ≤ 15.875).
Bit-Flipping Scan facilitates efficient test access by a logarithmic scan chain length reduction and altered scan chain semantics. Without loss of generality, classical test data (de-)compression and compaction schemes can be utilized to further improve test efficiency. The next sections show how the generation of optimized Bit-Flipping Scan test sequences is modeled and solved heuristically.
IV. MODELING THE TEST SEQUENCE GENERATION
While in principle any test set can be applied using the test access provided by the unified architecture it is very likely to result in a suboptimal test time and volume due to a high number of involved bit-flips. The goal of efficiently utilizing the unified architecture for offline test is achieved by a tailored sequence of test patterns. After defining the properties of an globally optimal test sequence the reduction of sequential ATPG under bit-flips to a Boolean satisfiability problem and its modeling in conjunctive normal form (CNF) is discussed.
For a circuit C with a set of faults F , an optimal BitFlipping Scan test sequence P opt ensures that
• all faults f in the fault universe F are detected by P opt • the number of bit-flips to setup a register R i for pattern p j from the previous register state O(R i , p j−1 ) is bound by HammingDist(O(R i , p j−1 ), I(R i , p j )) ≤ bf bound • the length of P opt is minimal.
A. Circuit Modeling
A combinational representation C C of C is built by removing all sequential elements and adding pseudo-primary in-and outputs (PPI/PPO). Each gate g i ∈ C C is then represented in CNF using the Tseitin encoding which generates a linear number of clauses at the cost of introducing a linear number of new variables [16] . Each gate g i with inputs i 1 , · · · , i n and output o implementing a Boolean func-
Expanding the equation in a product-of-sums form yields the set of clauses Φ g in CNF. The circuit C C is then described in CNF as
B. Modeling of Stuck-At Faults
Each stuck-at fault in F is represented as a new free literal f . The faulty circuit Φ c f is modeled by copying the output cone c f of the fault site s f and assigning new literals to the fault location and all other signals in the fault cone c f (s f for s f and ∀s n ∈ c f : s n ). At the edge of the cone the according literals from Φ C C are used. To generate a test pattern for f with polarity p f ∈ {0, 1} three conditions need to hold:
• Fault-free circuit: Fault site has the correct value: s f = p f • Faulty circuit: Fault site has the faulty value: s f = p f • f is observed at least at one output:
Then fault f is modeled as
C. Sequential Mapping and Modeling of Bit-Flips
The sequential behavior of C S is modeled by unrolling. Each timeframe t j is modeled by Φ C C ,tj consisting of a copy of Φ C C with appropriate literal renaming and Φ f,tj denotes fault f in timeframe t j .
Bit-Flips are modeled by introducing new free literals B(R i , t j ) for each register R i . Together with the pseudoprimary output literals O(R i , t j−1 ) of the previous timeframe the sequential state in timeframe t j is modeled:
The sequential behavior under bit-flips is modeled as Φ B = ∧ 
D. Optimal Test Sequence P opt
The SAT instance for the optimal test sequence P opt from the beginning of this section can now be modeled as follows.
The circuit is unrolled for x timeframes: Φ C C = Φ C C ,t1 ∧ · · · ∧ Φ C C ,tx . All faults are added to each timeframe. As it is sufficient to detect a fault once, a disjunction over all timeframes is added per fault: (f (t j ))) ). Between consecutive timeframes, bit-flip cardinality constraints are added to limit the maximum number of flips per register:
tj−1,tj ). The literals of timeframe 0 are set to the registers initialization values: Φ 0 .
Solving the model III-B) . The test sequence P opt with minimum length is found by bisection over the number of timeframes.
Finding the globally optimal test sequence is only feasible for small circuits, small fault universes and a limited number of timeframes due to the high complexity of ATPG and the associated runtimes [17] . Nonetheless an optimized BitFlipping Scan test sequence can be generated iteratively as depicted in the next section.
V. BIT-FLIPPING SCAN TEST SEQUENCE GENERATION
The heuristic iteratively generates patterns of the test sequence for combinationally detectable faults from the fault universe F , where each pattern p targets a limited number of not detected faults F nd . All patterns are guaranteed to require a minimal number of bit-flips while covering the maximum amount of faults from F nd . As a preprocessing step, F is classified by combinational ATPG and undetectable faults are removed. The remaining faults are sorted in descending order according to their testability, thereby putting preference on hard to detect faults in the iterative pattern generation.
The heuristic depicted in pseudocode in Algorithm 1 is invoked with the initialization pattern p 0 , the fault universe F and the number of concurrently targeted faults maxF .
First, the SAT-model Φ (l. 4) is built modeling one timeframe of the combinational circuit (Φ C C ), the limited number of faults contained in F nd , the bit-flips associated to the PPIs Φ B as well as cardinality constraint Φ B numBF restricting the number of flips per register to numBF .
The for-loop (l. 6-14) searches for a pattern p j covering a maximum number of faults numF from F nd . In each iteration, for a given number of faults, a cardinality constraint is added to detect at least numF faults: Φ numF = atleast(F nd , numF ). If the model is satisfiable under the current sequential state Φ pj−1 , pattern p j is extracted and the loop continues with an increased numF . Once the model is not satisfiable, two cases need to be distinguished:
maxF not det. faults 3: P ← ∅ ∪ p0; j ← 1; numBF ← 1 init, 1 BF per Reg.
4:
Φ ← ΦC C ∧ ΦF nd ∧ ΦB ∧ ΦB numBF model 5: while F = ∅ do 6: for numF ← 1, maxF do cover max. faults 7: update(Φ, ΦnumF ) add cardinality constraint return P 27: end function than 10% of the faults contained in F nd are detected, the model is rebuilt with the next maxF faults (l. 20).
• No pattern was found at all (l. 22): The iteration proves that no pattern detecting even a single fault exists under the constrained amount of bit-flips. numBF is increased and the for-loop is re-executed, thereby ensuring that the next pattern detects the maximum amount of modeled faults under the increased number of bit-flips. The surrounding while loop (l. 5-25) terminates when all faults from F are detected. As F contains only combinationally detectable faults each fault in F will be detected, in the last resort by a pattern requiring a high amount of bit-flips.
VI. EXPERIMENTAL EVALUATION
The presented scheme is evaluated for ISCAS89 and ITC99 benchmarks as well as for industrial circuits kindly provided by NXP (formerly Philips). For each circuit, the combinational core is synthesized for the 45 nm Nangate Open Cell Library (OCL) [18] For b) and c), all FFs are organized into scan chains with a maximum length of 127. For d), a register is implemented for each chain from the scan chain configuration used in b) and c). Note, that a chain length of 127 is rather short. For longer chains, scenarios b) and c) will scale linear in terms of area and test time. The area of the unified architecture and the test time of Bit-Flipping Scan sequences will grow slower due to the logarithmic correlation between n and l (Eq. 1).
A. Area Overhead
The gate area of the synthesized original circuit in µm 2 is used as a baseline in Table I , corresponding gate counts can be found in columns 2 & 3 of Table II . The implementation of scan design increases the area (col. 3) and the area overhead to the original circuit is between 4.3% and 24.7% (col. 4). Implementing fault tolerance by bitwise redundancy orthogonal to scan design further increases the area (cols. 5 & 6) . The area associated with the presented unified architecture (col. 7) is moderate for all circuits with an overhead between 21.7% and 81.8% (col. 8). The last column depicts the difference between the overheads of BFScan (col. 8) and FTScan (col. 6).
The results show, that compared to an orthogonal combination of the two classical methods, the unified architecture targeting both test and fault tolerance uses less area.
B. Test Application Time
Table II compares the test application times. A highly compacted test set is generated for each FTScan configuration using a commercial ATPG. The heuristic from Section V is used to generate BFScan test sequences, where maxF was set to 100, providing a good tradeoff between runtime and achieved test time.
The total number of clock cycles for Bit-Flipping Scan is significantly lower compared to FTScan (col. 12 & 7) although the BFScan test sequence contains more patterns (cols. 8 & 4) . Instead of n shift cycles only log 2 (n + 1) cycles are used per pattern and bit-flip, allowing to apply more patterns in 
