L F - C H E C K I N G C I R C U I T S
elf-Checking Design in Eastern Europe
LEADING VLSl CHIP and digital system manufacturers report an increasing number of undetected temporary faults that cause permanent damage in general-purpose microprocessor-based systems.' Clearly, modern VLSI circuits require higher reliability as traditional means of protection against temporaiy faults become insufficient. This need renewed interest in self-checking circuits for on-line error detection in commercial computers.' A self-checking circuit is a digital circuit that detects its internal faults simultaneously with normal functioning, for example, by encoding input, internal, andfor output states using error-detecting STANISLAW J. PIESTRAK Technical University of codes (ED&). Thanks to their error signals, self-checking modules offer some side-effect advantages: easy testability and localization of faulty modules. These modules also let us build fault-tolerant multiprocessor systems with dynamic reconfiguration for self-repair. Encoding information with EDCs and implementing circuits with self-checking capability thus become a viable and attractive option worthy of consideration.
Self-c hec king circuits
Self-checking circuits fall into five classes that are defined by a circuit's be- 16 havior in the presence of internal faults and input errors. Circuit His self-testing for set of faultst;: if for every fault fin F, it produces a noncode space output for at least one code space input. Circuit H is fault secure for set of faults F, if for every fault fin F, it never produces an incorrect code word at the output for a code word at the input. Circuit H is totally self-checking forset of faults F, if it is both self-testing and fault secure for F. Commonly, F includes all single stuck-at-z (s/z) faultsz, E {0,1}.
Circuit His code disjoint if it maps code words at the inputs to code words 0740-7475/96/$05.00 0 1996 IEEE at the outputs, and noncode words at the inputs to noncode words at the outputs. A self-testing checker for code Cand set of faults Fis a circuit Hthat is code disjoint for Cand self-testing for F. Typically, a selftesting checker is a two-output circuit with code outputs {(Ol), (10) ) and noncode outputs {(OO), (11) ). These signal input error detection or an internal fault in the checker itself.
All these circuits aim at a socalled totally self-checking goal: The first erroneous output resulting from an internal fault of the circuit is detectable A general scheme of the totally self-checking system ( Figure 1 ) consists of two blocks: a functional circuit that executes some useful function-implemented as totally self-checking a checker that monitors the correctness of the output generated by the functional circuit-implemented as self-testing Parity codes. The parity code is the least redundant EDC and requires only one check bit indicating the parity of the information part. We can immediately apply it to any bit-sliced circuit, since any fault in a single bit slice may cause only a detectable error. A later section discusses several parity prediction schemes using shared circuitry.
considered the testability of parity trees, which compose a selftesting checker for parity code. Gorozhin and Krainov" proved that any subset of IC1 = 2fl-2+ 1 parity code words fully tests any n-input parity checker. AksenovaI2 derived the necessary and sufficient conditions for self-testing a parity checker, depending on n input lines and Nactually-used input combinations for various pairs of n and N.
Sapozhnikov13 proved that just n + 2 tests fully check an n-input parity tree for all multiple faults (including primary input line faults).
Unordered codes. If both 1 and 0 errors can occur in received words, but in any particular word all errors are of one type, we call them unidirectional. Unidirectionals occur in various lsI/VLSI devices, including highly structured circuits such as ROMs and programmable logic arrays (PLAs). Unordered codes have no code words covering some other code word and thus can detect unidirectional erros of any multiplicity. For instance, four-tuples (001 1) and (1010) are unordered, whereas (0011) is ordered with respect to (01 1 l), or we say that (01 11) covers (001 1).
The optimal unordered codes are nonsystematic rn-out-of-n (rnln) codes (some of which are also referred to as constant weight, balanced, or equilibrium), in which a code word has always exactly m 1s on n bits. Berger codes (CJ are the optimal systematic unordered codes in which a code word is formed as the concatenation of data part X of length Zwith the check part of length K= r log, (1 + 1) 1. This is the bit-by-bit complemented number of 1s inX.
Virtually unknown is an important new unordered code with identifier that extends a Berger code with several extra code words. The extra code words are relatively easy to decode, as introduced in Varshavsky et aI.l0J4 See the adjacent box for the construction procedure.
Berger code C,, becomes a special case of the code with identifier-only when I + 1 largest (or smallest) consecutive check parts are used. The construction procedure permits a variety of codes that can either maximize the capacity of the code or minimize the com- rn-out-of-n codes. Several researchers15zo proposed self-testing checkers for l/n codes. Golani5 reported the first purely combinational selftesting checker for the 1/3 code, using a self-testing checker for the concatenated Cl , 3 x C , , "
code. Sapozhnikov and Sapozhnikov16 gave us formal proof that a combinational checker for the 1/3 code alone, which is self-testing for any s/z fault, cannot be designed as well as several sequential solutions of a self-testing checker for the 1/3 code.
Others proposed self-testing checkers for l/n codes with n > 3.1720 MaznevI7 offered a simple construction of a selftesting checker for the lln code.
Construction involves only a few modules-a number of smaller self-testing checkers for some llp, codes, p , < n. Figure 2 shows a self-testing checker for the 1/10 code that is entirely built using one module, the self-testing checker for the 1/4 code (see Figure 3a ).
Kotocova18 designed minimal-level, that is, three-level, self-testing checkers with a total number of gate inputs For all other n, PiestrakZ0 reports quite a different construction of the self-testing checkers for l/n codes. This method uses a self-testing/code-disjoint translator of a l/n code into some 21n' code with 1 He recommends a new self-testing checker for the 2ln codes. It minimizes the total number of gate inputs, which is proportional only to 2n (the best known lower bound). Piestrak also contributed the least complex self-testing checkers for the 2/n codes with n 2 5. These are built of a number of stages of 2/n,_, to 2/n, translators n,-, > n,, which are tested only by n code words being cyclic shifts of two subsequent 1s. Figure 3c shows the selftesting checker for the 216 code. We can easily obtain the 215 code checker by setting one arbitrary input line to 0 and simplifying the circuit from Figure 3c .
Researchersz125 also proposed a number of self-testing checkers for other m/n codes. Vizirevzl proposed easily testable self-testing checkers for any m/n code (m > 1). These circuits need only n tests to detect a single s/z fault, although they are also ratherslow (about n gate levels). Many self-testing checkers for mln codes are both faster and less complex than those from Vizirev. The checkers from Pie~trak~~ are currently the least complex self-testing checkers for the m/2m codes. These checkers use the multioutput threshold circuits implemented as sorting networks and are tested only by 3ml2 (2m) tests form even (odd).
PiestrakZ5 also reported high-speed, three-or four-level self-testing checkers Table 2 . Parameters (correctedj of various self-testing checkers for the most frequently used m/n codes.
I Code Gates Gateinputs Levels Tests
Method with minimum complexity for highspeed designs for the m/2m and ml(2m f 1) codes. Those for the 3/7,4/8, and 4/9 codes not only have three levels but also use the smallest number of gates of all known designs. Hisz2 earlier selftesting checkers for other m/n codes with m 2 3 and n 5 4n perform slightly better than those from Sapoznikov and Sapozniko~.~~ However, they present the least complex self-testing checkers for mln codes with m 2 3 and n > 4m.
See Table 2 .
Other unidirectional EDCs. P i e~t r a k~~,~~ gave the encoders and selftesting checkers for various separable all-unidirectional (Berger), t-unidirectional (capable of detecting up to t unidirectional errors), and burst-unidirectional EDCs. Since a self-testing checker for most separable codes can be built using an encoder followed by a bank of inverters and a comparator, I concentrate here on designing encoders only.
Encoders for separable codes come in two versions. One uses parallel coun- ters of 1s built of full-adders, half-adders, and XOR gates (see Figure 4) . It applies to self-testing checkers for certain codes only. The other version uses the multioutput threshold circuit, which is applicable for any unidirectional EDC. The latter design solves an open problem of a self-testing checker for some Bose-Lin
and all Jha-Vora t-unidirectional EDCsF6 Overall, designers can use the multioutput threshold circuit realized as a sorting network (as suggested in Pie~trak*~~~). This network would build circuitrysupporting the use of any errordetecting (correcting) code wherein the check part is derived from the weight of the remaining part of a code word.
PLAs and unordered codes. Selftesting checkers implemented using PLAs require special consideration as they must be made self-testing forslz, adjacent line, and crosspoint First a PLA must be inverter-free with all products unordered to be crosspoint irred~n d a n t .~~ Then an input code must be both unordered and closed. Also a PLA must be inverter-free ro allow design of a totally selfchecking/code-disjoint PLA.N Piestrak30 also formulated the general rules of designing PLA self-testing checkers for unordered codes. He showed that a multistage PLA self-testing checker can be designed even for highly irregular, unordered, heavily incomplete codes, provided that all subfields of an input code contain closed unordered subcodes. Derbunovich and Neshveevz8 suggested a promising general approach to designing one-stage PLA self-testing checkers for mln codes. They based their approach on the partitioning of the set of all mln code words into about (",)/n cyclic subsets of the mln code. This allows us to consider constructing two disjoint subsets of an mln code with complexity reduced by n.
We can realize two-stage PLAself-testing checkers using Piestrak's design methods for any rnln code with m 2 3 and 2m < n 1 4 m and 216,217, and 2/83 codes only.2?,20All use translation into the 2/4 code. This is achieved by assuming that they are realized in four levels, and all products of the first stage are complete (composed of exactly m input variables) and properly arranged (see Figure   5 ). Designers should follow Sapoznikov and SapoznikovB to realize PLAself-testing checkers for other mln codes.
Combinational circuits
As early as 1970, Sog~monyan~~ formulated the general approach to designing self-checking combinational circuits with parity prediction. Sogomonyan and Slabakov7 later proposed several refined solutions and summarized extensions.
Given a circuit's logic structure, we can construct a generalized circuit graph to analyze the dependency between a signal on any internal line and the output lines for any input combination. Suppose an slz fault on a par- Krainov. ' I ticular line causes a multiple-output error of even parity (which is undetected by a single parity bit) for some input combination. Then, we can modify the fan-out in such a way that any possible error becomes detectable, but at the cost of introducing extra logic. However, the limitations of this approach are exclusion of primary input faults and exponential computational complexity. At present, it is not clear whether this approach offers any advantages over a conceptually simple, bit-sliced realization with parity.
Another algebraic approach starts with the sum-of-products logic functions and applies to minimal-level (as well as multilevel) implementat i o n~.~~~~~~~ Others proposed totally selfcheckinglcode-disjoinr, bit-sliced, two-level AND-OR combinational circuits using complete rnln codes. These designs realize self-checking synchronous and asynchronous sequential machine^.^ 34 35 A more general approach allows design of totally selfcheckinglcodedisjoint combinational circuits using mln codes and shared logic.3z Pie~trqk~~ extended it to a self-testing and/or codedisjoint NOT-ANDOR combinational circuit using any input EDC. If a basic circuit is not self-testing for some faults, he also suggested a modification method of introducing one (AND) or two (AND-OR) extra gate levels. The usefulness of this approach was demonstrated on the new self-testing checkers for all arithmetic codes with check base A = 3. Finally, Piestrak% presented the formal conditions of designing totally self-checkingkode-disjoint PLAs using any input unordered code (directly applicable to a two-level AND OR gate circuit as well).
The circuits in the Reed-Muller (Zegalkin) polynomial form (built using AND and XOR gates only) have long been recognized to be easily testable. However, Gorozhin and Krainov'l is the only known work formulating the necessary and sufficient conditions for designing totally selfchecking versions of these circuits using parity. (See Figure 6 ).
Synchronous sequential machines
For these machines, the main problem is the need to check whether a sequence of outputs is incident to a given sequence of inputs. Since a fault may affect the state transition function as well as the output function, both generally need to be correct. One notable common restriction in all known selfchecking synchronous state machines (SSMs) is that faults on the clock lines do not occur. Most designs considered here also exclude the occurrence of faults on primary input lines. Several researcherss8 3M2 proposed and systematically presented self-checking SSM design methods that can be divided into three groups.
The first group of methods generalizes the built-in parity circuit approach for combinational circuits from Sogomonyan3' to a state machine Mspecified by the logic To find all changes of outputs for a given fault set F, we would simulate a faulty M for every fin F. On this basis, we can construct a table of faulty functions that contains all incorrect output combinations produced after the occurrence of an inter- The second group of methods commences with a state machine given by a flow table. It considers using either parity or mln codes for internal and output state assignment. The resulting cirzuit is totally self-che~king.~~.~~-~~ These designs rely on using self-testing checkers that continuously monitor whether the input, internal, and/or output state represents a code word. After a fault has xcurred in a state machine, the check3rs ultimately produce the error signal It the occurrence of the first error [ Figure 7a) .
Essentially, the main problem is to parantee that combinational input [excitation) and output circuits are toally self-checking and that the check3rs are self-testing even for incomplete encodings. The ultimate goal is the minimum hardware implementation of all extra circuits, so that the alternative scheme (functionally equivalent) in Figure 7b is also worfhy of consideration. Three reports34,37s38 proposed totally self-checking SSMs using mln encodings of internal and output states, using JK and D flip-flops as memory elements. G o r~z h i n~~ reports totally selfchecking SSMs that use D flip-flops, parity, and totally self-checking combinational circuits implemented using Reed-Muller polynomial form."
Finally, the most general approaches are algebraic ones considered in three reports4M2 and extended and systematically presented in Shcherbakov and Podk~paev.~ Here, the specification rather than the logic design level achieves self-checking. This group of methods relies on checking the algebraic relations between output signals of basic state machine Mand checking state machine Mk Here, M k evaluates the Figure 8 shows the general scheme of a self-checking state machine that uses checking state machine illk, which is homomorphic to basic state machine ill. Circuit H transforms the outputs produced by M so that they can be com- The design goal is to find a reduced state machine illk and a circuit H that ensure detection of internal faults of a specified class in the system from Figure   8 . We attempt to have IQkl (( IQI, so that most extra hardware expenses would go into H. Here, Q(Qk) is the set of internal states in M(Mk). We assume th'at simultaneous faults in Mand Mk do not occur, and in most designs the errorsignal occurs at the first output error. We can design a number of Mk versions undervarious constraints imposed on the comprehensive theory of designing selfchecking SSMs based on algebraic automata theory and pair algebra.
Linear sequential machines
We call a state machine linear if its structure is entirely composed of XOR gates and delay elements (D flip-flops). They include numerous circuits such as part of the encoding/decoding circuitry of many error-correcting codes, counters, address generators, signature analyzers, pseudorandom sequence generators, and many others. We conveniently describe a linear state machine by a set of general matrix equations:
Here, A, B, C, and D are constant matrices of appropriate sizes, and we carry out the summation and multiplication over the Galois field CF (2) .
The problem is to design a checking linear state machine Mk that satisfies the following conditions:
1. It is also a linear slate machine.
Only a subset of Q is accessible to
Mk (in most cases no access to Q ) .
E = M , E O z,;
where Mis a row vector of constant coefficients, and z k is the output variable produced by Mk.
Any single error in Z or Zk at some cycle leads to E f 0 at the same cycle. Its order k , which is defined as the number of memory (delay) elements used, characterizes the complexity of Mk.
Several have extensively studied the design of self-checking linear state machines. Although some of these design methods have been explicitly formulated in terms of functional diagnosis of linear dynamic systems only, they also readily apply to design of self-checking linear state machines (which are the special case).
Miron~vski~~ produced principal general results that include the necessary and sufficient conditions for the existence of Mk of prescribed order rules of determining k,,, of Mk for a given M. Mironovski ;c= 000100
I 00001 1 Figure 9b gives the checking linear state machine Mk for M , designed according to Miron~vski.~~ Samofalov et al. 49 proposed a signature analyzer protected by parity, which is totally self-checking for all slz faults.
In summary, evaluation of the efficiency of these methods is difficult as no comparison figures appear in the literature. Further extensive study is required before estimates of these methods can be given, since specific details concerning the actual implementation of sample selfchecking state machines are not available.
Asynchronous circuits
We can loosely define asynchronous circuits (ACs) as circuits with no global clock. Depending on the delay model, we can distinguish several classes of asynchronous circuits, but concentrate here on two of them.
In the fundamental Huffman mode circuit, the delay in all circuit elements and wires is unknown but bounded. The design requires special consideration of multiple input changes, race avoidance, and hazard removal. Also necessary is a guarantee that a circuit reaches a stable state before an environment applies the next input transition. The design becomes particularly difficult when we take into account the behavior of a faulty circuit.
In Muller's speed-independent circuit, gate delays are unbounded but finite, and wires have no delay. This means that any delay is concentrated at a gate's output line.
Although these two asynchronous design methodologies have been known for about 40 years, only recently have they received more attention.
SPRING 1996
We can attribute this resurgence to their well-known potential benefits: the possibility of avoiding clock problems, which are common in synchronous circuits (including the behavior of a circuit with faulty clock lines) 4 lower power consumption average rather than worstcase performance
An important attraction of speed-independent circuits is that they are inherently self-checking, since many faults halt the operation of asynchronous circuits that can be detected by a time-out circuit.
Researchers studied fundamentalmode, totally self-checking asynchronous sequential machines (ASMs) extensively for many 5051 Several design methods of totally selfchecking ASMs use rnln codes for encoding internal and output states that are monitored by self-testing checkers for rnln code^.^,^^ (Sapozhnikov and Sapozhnikov's monograph6 offers a more complete bibliography of contributions on fail-safe and totally selfchecking ASMs.) These are essentially extensions of the design approaches for SSMs, which add extra limitations on internal state assignment to prevent hazards and races.
One important novelty of these approaches is that, when necessary, we can introduce some extra internal states to a minimized flow table. This guarantees that a sufficiently large number of mln code words occur in normal operation. This makes the asynchronous state machine and monitoring self-testing checkers self-testing for all faults.
Danilov and Zhirabokso report a design that follows the scheme from Figure 8 . There, Mk is a factor asynchronous state machine constructed on the basis of xsynchronousstate machine Mby using partitionings with substitution property.
The other general approach to designing self-checking and fault-tolerant asynchronous state machines is based on the concept of a completely separating system. Since the mid-1960s Sagalovich4 51 considered various constructions of redundant, completely separating systems that are suited to design asynchronous state machines. These machines are race-free even in the presence of internal faults; one is constructed on the basis of ReedSolomon codes. For many yearsvarshavsky et a1.1n145z53 studied various properties of totally selfchecking, speed-independent circuits (also called aperiodic). They52 recognized that a permanent slz fault at the output of a gate is equivalent to unbounded delay occurring in this gate. This effectively stops a circuit's operation, and lets it be detected by a time-out circuit. They showed that a combinational circuit is totally self-checking for any slz fault on a gate'soutput if each allowed change of an input signal is indicated by an allowed change of an output signal. Since then however, the problem of designing speed-independent combinational circuits to be totally selfchecking for faults occurring on lines beyond the fan-out point remains open.
Note the importance of unordered codes in the design of speed-independent circuits. They are important even though in these applications they have been called selfsynchronized or delayinsensitive codes (names that more directly reflect their most important prop erty for speed-independent circuits).
Varshavsky et showed the suit3bility of any unordered code for designing an indicatable speed-independent combinational circuit that follows the two-phase (idlelworking) Dperation discipline using two spacers (00 ... 0) and (11 ... 1). (Recall that Varshavsky et al.14 first proposed a new unordered code called a code with identifier, unknown to designers of selfzhecking circuits.) They'" extended the .otally self-checking properties of
speed-independent, indicatable combinational circuits to Moore state machines as well. They also emphasized the other advantages of totally selfchecking speed-independent circuits over their synchronous counterparts. These are freedom from the problem of faulty clock lines as well as the critical races in the (01) -+ (10) (or (10) -+ (01) transitions at the checker's outputs. Varshavsky et al.i053 further considered totally self-checking, speed-independent autonomous circuits (those to which no external inputs are applied). They reported that the semimodularity property of a circuit is essential for total self-checking, and that fault detection manifests itself by entering some deadlock state.
MANY OF TWESE self-checking digital circuit designs represent advanced state-of-the-art designs, which, for various reasons, have not yet found (or have found, but only very recently) recognition in the Western world. Designers of self-checking circuits in many countries will readily benefit from this sutvey, since most of the works referenced here are available in English. Especially interesting are many contributions on designing self-checking sequential synchronous, linear, and asynchronous machines. These subjects have been virtually absent in most Western monographs on fault-tolerant hardware design. 
