The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for personal research or study, educational, or not-for-prot purposes provided that:
Introduction
Future electronics technologies based on nanoscale transistors or printed organic materials will become increasingly vulnerable to process variation and single event upset in harsh environments than present-day technologies. As more and more Commercial Off-the-Shelf (COTS) components are integrated into state of the art high value and mission critical systems, there is also growing concern over the economy of manufacture (i.e. yield) [1] and the in-service robustness of such systems. This is prompting new studies into fault tolerant designs [2] at the extremely fine-grained level that are based in part by the pioneering work of Von Neumann on the theory of massively redundant design [3] . This represents a distinct departure from design strategies that are prevalent today, such as component screening and high-level modular redundancy, and therefore is new design paradigm for future electronics.
Faults that occur in electronics are best categorised into a number of fault domains including: process variation, ageing, current thermal stress, soft errors, hard errors and extreme environmental events that are unforeseen [4] . Within each of these domains, there are many root-causes that illicit faults.
In this paper we present an experimental fault testing platform for verifying novel fault-tolerant design strategies in electronics with both analogue and digital diagnosis. The particular design strategy evaluated in this paper is relevant to low voltage CMOS circuits within ASICs and in power circuits that use discrete transistors such as IGBTs. Here we adopt an abstract model of switching elements based on field-effect transistors (FET) of some kind, and assume that the resource requirements depend upon two factors: the number of switching elements and the number of interconnects. Specific fault conditions are then injected into the hardware and the response recorded.
Besides the goal of achieving fault tolerance by fine-grained redundancy, a further opportunity exists to consider at the same time fault detection and reporting. Most fine-grained methods are based on the fault masking property i.e., where faults events are made non-critical so that they cannot cause an error in the output. Therefore the circuit is able to continue to function correctly in the presence of the fault. An example of this is the quadded logic strategy [5] where gate and intereonnect redundancy guarantee error-free operation for any single fault event. An example is illustrated in Figure 1a . Like most masking strategies, this approach is limited to single faults such as those caused by single event upset (SEU) though in some cases certain combinations of two simultaneous faults can also be tolerated. When SEU occur they cause a momentary charge event to occur within one or more switching elements (i.e., charge is generated in the semiconductor gate of the FET) that may cause a stuck-on or stuck-off fault. Although the fault mechanism is short-lived, it may cause a persistent effect in the circuit that remains until power-off. Hence SEU-induced faults may remain in the circuit for some time, and may potentially accumulate in several locations as is often observed in the SRAM of FPGA chips [6] . If the circuit is configurable (as is the case with FPGA chips) then scrubbing is often combined with masking strategies in order to maximise resilience to SEU. Other fault tolerant strategies apply the concept of triple modular redundancy (TMR) to fine-grained logic in electronics. This has become commonplace in FPGAs for mission-critical situations such as Space exploration where existing designs are augmented by TMR structures using HDL extensions. Several strategies have been developed for FPGA chips that involve various forms of fine-grained modular redundancy, including the capacity for online reconfiguration in response to permanent fault conditions [7, 8] . However they are highly confined to the available FPGA resources and architecture while reconfiguration requires a great deal of processing capability or else pre-storage of multiple alternative configurations. The TMR approach has also been applied in multi-processor core design [9] . Despite the presence of builtin checking via the TMR voting logic this is not generally used for reporting at higher system levels.
Error detection and correction (EDC) is also used for selfchecking by using information redundancy [10] . This approach is most effective for protecting data registers and memory blocks. Early computer systems used combined EDC and modular redundancy [11] but EDC is far more common in state of the art computers. Again however, EDC is usually performed internally with little or no record of fault events. Custom logic, such as the arithmetic logic unit (ALU), are more difficult to modify for fault tolerance since they must be designed with speed and minimal component layout.
At the power electronics scale, discrete IGBTs used in renewable energy power conversion systems are becoming a source of concern. Although their intrinsic reliability is high, modern energy conversion sub-systems may rely upon a large number of such devices, increasing the likelihood of failures occurring. This causes serious disruption for critical energy systems such as offshore wind farms where maintenance and repair is difficult and expensive. Mitigation strategies then become one of further improving component reliability or else introducing efficient fault tolerance with the smallest possible overhead.
As a result of the above context, our fault tolerant design aims to combine the following properties within the confines of custom logic gate design: Scalability: the redundancy structure is applicable to low and high level electronic design Strategic masking: only stuck-off events are masked Strategic fault detection: stuck-on events trigger a built-in fault detection mechanism Low overhead: the redundancy overhead is minimised.
Resource overhead represents a balance of cost/benefit where, in this case, the benefits gained are fault selectivity and detection. Goal of this work is to therefore to design and test a hardware prototype that demonstrates the above features of the fault tolerant design.
Proposed design
The basis of this design is the NAND logic gate, which is central to electronic ALUs and many other circuits. The reference NAND gate is illustrated in Figure 2a as a complimentary metal-oxide-semiconductor (CMOS) design. This gate uses 2x p-type FET and 2x n-type FET. The equivlanet fault tolerant gate design is illustrated in Figure 2b where additional FETs and interconnects have been added. The interconnect topology shown is one of several variations under investigation [12] . The design is intended to bring the capabilities of stuck-off fault masking and detection of stuckhigh faults. For simplicity it is assumed that each FET is fabricated separately i.e, that there is no overlapping of resources in the layout. For CMOS circuits this may not be the case because combined structures are commonly used. However the degree of resource overlap depends on the technology library used hence we adopt a simple model that can be later refined using a specific technology library. The predicted fault rate of the NAND gate design is shown in Figure 3 alongside other variations. The basic NAND gate is shown leftmost on the horizontal axis and each redundant design "NAND+n" contains n redundant FETs. The full quad-transistor (QT) design (NAND+12) achieves complete fault tolerance against and stuck-on and stuck-off fault [13] , but with a 4x resource overhead requirement in comparison to the non-redundant design (NAND). The design presented here is represented by the NAND+4 variation with a fault rate of 12.5%. The concept of augmenting resilience with fault event intelligence has been suggested by others. For example, a radiation sensor is incorporated into an FPGA-based subsystem in [14] to create environmental awareness that enhances the basic TMR/scrubbing procedures. By comparison our strategy is implemented using an analgoue trigger flag referred to as IDDQ. This is an attractive approach because detection is confined to the electronic circuit domain rathern than relying upon external sensory input. Furthermore, active responses triggered by IDDQ events are controlled by local circuitry and hence become extremely rapid and potentially autonomous.
The concept is illustrated in Figure 4 , where a stuck-on fault has been asserted at FET T7. In this case IDDQ current flows when the inputs are set to '01'. When IDDQ current flows the output logic state is difficult to predict as it depends upon the analogue conditions of drain-source impedance of each FET. Therefore, the output is not considered trustworthy and should be ignored. However the ocurrence of IDDQ can be used to identify the stuck-on fault condition.
Test strategy
To confirm the properties of the fault tolerant design we implemented the NAND gate using discrete FETs mounted on a test PCB. Fault-tolerant designs are typically evaluated using either software simulation or else using FPGA boards in order to predict their usefulness. A discussion of the different test approaches can be found in [15] . Examples of model-based approaches are seen in [1, 2, 5] wherein behavioural predictions of the fault response are formed. Alternatively, hardware fault injection within FPGA boards has been carried out taking the form of random bit-flips injected into the configuration bitstream. Examples of this are seen in [16] for evaluation of FPGA fault-tolerant design techniques. The faults injected are emulated, that is, artificially inserted into the active hardware by a separate hardware controller. Some modern FPGA chips benefit from a built-in fault injection interface [17] . Finally, accelerated radiation testing constitutes the ultimate form of fault testing whereby faults are induced by the actual physics of failure mechanism involving high energy particles interacting with semi-conductor materials.
For the purposes of this study we have chosen to adopt fault emulation by hardware injection due to the high repeatability of the approach and lower cost in comparison to accelerated testing. For the fault tolerant design under consideration, hardware fault injection offers the possibility to monitor both analogue and digital domain behaviour. This is essential when observing the stuck-high behaviour for the case when rail to rail current flows. This condition is conventionally ignored in fault analysis of logic circuits (see for example [18] ) because the logic output level becomes ill-defined. However our fault detection strategy relies only upon the presence of (analogue) IDDQ current flow rather than determination of the specific (digital) logic level.
Detailed information about IDDQ is not available using FPGA chips due to their architecture, hence a custom circuit implementation was created. This also created the possibility of asserting different fault conditions for each FET. The FETs used are low-power MOSFETs (types IRFD020 and IDRF9024) rated at 1 Watt, hence the circuit could also be used to demonstrate a small-scale power application. The test PCB ( Figure 5 ) is fitted with fault injection points that allow insertion of stuck-on / stuck-off conditions for any FET. In addition are also fault injection points for gate signals that are not used in this experiment. The PCB is wired to a National Instruments PXI test system. NAND  NAND+1  NAND+2  NAND+3  NAND+4  NAND+5  NAND+6  NAND+7  NAND+8  NAND+9  NAND+10  NAND+11  NAND+12 Fault rate (%) Figure 4 Example of IDDQ triggered by stuck-on fault occurring at T7. Fault injection is controlled by a fault insertion unit (FIU) type PXI-2510 capable of coordinating up to 64 fault channels across two fault buses. The basic topology is illustrated in Figure 6 . To inject a fault, relays are opened/closed such that relevant FET is connected in parallel with different fault loads. The test PCB includes connection points for up to four different fault loads for each fault bus. In this experiment, the fault loads comprise a short circuit track and a 500 kΩ resistor. Gate-level testing includes digital response test and analogue IDDQ measurement. A high-speed digital I/O module was programmed to test the NAND gate response during each fault event. IDDQ was recorded using a PXI-DMM module connected in-line with the power rail. Testing is coordinated using a LabVIEW test panel that allows users control of the test configuration and output data ( Figure 7 In this example fault location "dut0" is connected to fault Bus A, then fault load "a0" (stuck-on fault) and load "a1" (stuck-open) are connected. The fault location is then returned to normal operation. Many such scan list files are read and executed by the VI. Results are organised into matrices and saved in csv format for analysis.
Finally the experimental test system components are shown in Figure 8 . 
Results
After testing is complete the results are easily analysed using Excel macros. A key result is that all stuck-off faults were successfully masked by the NAND gate design. Stuck-on results are shown in Figure 9 , where the circuit response has been recorded for a single stuck-on fault asserted at each FET T1…T8. The green indicators show where an IDDQ event has been detected and hence where stuck-on fault can be detected. The data shows that an IDDQ event occurs once for every fault location and hence the stuck-at fault is always detectable. The fault rate is therefore 25% for stuck-high faults and 12.5% considering both stuck-high/low faults. IDDQ occurs for different input states depending upon the fault location, hence a digital response test is required to trigger IDDQ detection. Although this is a potentially time-intensive operation there is the possibility of built-in fault localisation. For example, if IDDQ occurs for the input pattern '10' then a stuck-at high fault must be located at either T5 or T6.
Another feature is that output errors only occur during IDDQ events. Hence the circuit output could still be considered trustworthy except when IDDQ events occur and therefore the gate's output could still be used in 75% of stuck-high fault conditions and discarded whenever a IDDQ event occurs.
Conclusions
An experimental test bench has been used to demonstrate a novel fault tolerant design of a NAND gate. Three key features have been confirmed via fault injection and analogue measurement: 1) masking of all stuck-off faults; 2) IDDQ event for all stuck-on faults; 3) fault localisation by combing IDDQ event and logic input pattern. Further tests have shown that the NAND gate also tolerates a limited number of double stuck-off faults. In some situations current flow between VDD and VSS would risk damage to the active FETs and should be avoided. In these cases the IDDQ event could be used to trigger an autoswitchover mechanism whereby a standby NAND gate takes the place of the faulty gate. An important benefit here would be that switch-over becomes selective and only occurs upon detection of a damaging stuck-high fault condition rather than stuck-off faults or dormant stuck-high faults.
The demonstrated test bench implementation is extensible to 64 fault locations, with further extension possible using a multiplexing switch unit. Complex fault patterns are easily programmed via scan list files. Hence the approach is capable of scaling to evaluation of fault discrimination/masking in logic units composed of multiple NAND gates such as basic arithmetic logic units. By generalising the fault injection patterns to include multiple simultaneous faults (along with IDDQ measurement) important fault rates data may then be measured and accumulated for reliability calculations according to the procedure described in [2] . This will lead to a better understanding of the resource/performance trade-offs incurred in the design of fault tolerant electronics.
