FPGA-Based, Self-Checking, Fault-Tolerant Computers by Rennels, David & Some, Raphael
NASA Tech Briefs, August 2004 11
Electronics/Computers
A proposed computer architecture
would exploit the capabilities of com-
mercially available field-programmable
gate arrays (FPGAs) to enable computers
to detect and recover from bit errors.
The main purpose of the proposed ar-
chitecture is to enable fault-tolerant com-
puting in the presence of single-event
upsets (SEUs). [An SEU is a spurious bit
flip (also called a soft error) caused by a
single impact of ionizing radiation.] The
architecture would also enable recovery
from some soft errors caused by electri-
cal transients and, to some extent, from
intermittent and permanent (hard) er-
rors caused by aging of electronic com-
ponents.
A typical FPGA of the current genera-
tion contains one or more complete
processor cores, memories, and high-
speed serial input/output (I/O) chan-
nels, making it possible to shrink a
board-level processor node to a single
integrated-circuit chip. Custom, highly
efficient microcontrollers, general-pur-
pose computers, custom I/O processors,
and signal processors can be rapidly and
efficiently implemented by use of
FPGAs. Unfortunately, FPGAs are sus-
ceptible to SEUs. Prior efforts to miti-
gate the effects of SEUs have yielded so-
lutions that degrade performance of the
system and require support from exter-
nal hardware and software.
In comparison with other fault-toler-
ant-computing architectures (e.g., triple
modular redundancy), the proposed ar-
chitecture could be implemented with
less circuitry and lower power demand.
Moreover, the fault-tolerant computing
functions would require only minimal
support from circuitry outside the cen-
tral processing units (CPUs) of comput-
ers, would not require any software sup-
port, and would be largely transparent
to software and to other computer hard-
ware.
There would be two types of modules:
a self-checking processor module and a
memory system (see figure). The self-
checking processor module would be
implemented on a single FPGA and
would be capable of detecting its own
internal errors. It would contain two
CPUs executing identical programs in
lock step, with comparison of their out-
puts to detect errors. It would also con-
tain various cache local memory cir-
cuits, communication circuits, and
configurable special-purpose processors
that would use self-checking checkers.
(The basic principle of the self-checking
checker method is to utilize logic cir-
cuitry that generates error signals when-
ever there is an error in either the
checker or the circuit being checked.)
The memory system would comprise a
main memory and a hardware-con-
trolled check-pointing system (CPS)
based on a buffer memory denoted the
recovery cache. The main memory
would contain random-access memory
(RAM) chips and FPGAs that would, in
addition to everything else, implement
double-error-detecting and single-error-
correcting memory functions to enable
recovery from single-bit errors.
The main purpose served by the
memory system as a whole would be to
enable the computer to return to a valid
state — a known good point reached in
the computations before the occur-
rence of a detected error. In operation,
the checkers in the self-checking
processor module would signal errors to
the memory system. Recovery would in-
volve halting the operation of the self-
checking processor module, correcting
its configuration bits if necessary, re-
loading its registers, and returning con-
trol to a previous, known good point in
the program. The CPUs could then re-
sume correct computations.
The known good point in the compu-
tations would be provided by the CPS in
a procedure denoted, variously, as check-
pointing and checkpoint recovery or
checkpoint rollback. The CPS would pe-
riodically command each CPU to store
the contents of its registers in the recov-
ery cache and clear its caches. This ac-
tion would establish a checkpoint. Then
the original value and the address of any
clean RAM block that was subsequently
overwritten by the CPU would be stored
in a special RAM within the recovery
cache. Subsequent writes to that block
would be carried out normally (that is,
without intervention by the recovery
cache). If an error in the CPU were de-
tected, the data in the special recovery-
cache RAM could be used to restore the
corresponding data in the main memory
to their prior correct values, the proces-
sor configuration would be reloaded, the
caches in the processor module would be
cleared, and the processor registers re-
stored to their prior values.
A new checkpoint could be ordered
when the recovery cache became filled to
capacity. Alternately, checkpoints could
be forced at strategic points in the soft-
ware. Another alternative would be to
force checkpoints periodically, at inter-
vals short enough to ensure that rollback
time did not exceed a value that could be
specified by design.
This work was done by Raphael Some and
David Rennels of Caltech for NASA’s Jet
Propulsion Laboratory. Further informa-
tion is contained in a TSP (see page 1).
NPO-30806
FPGA-Based, Self-Checking, Fault-Tolerant Computers
No software support and little hardware support would be needed for fault tolerance.
NASA’s Jet Propulsion Laboratory, Pasadena, California
Self-Checking
Processor Module
Main Memory CPS Recovery
Cache
Memory System
 Save/Restore
 Registers, Etc.
 Error Detected
The Memory System would store the states of computations at checkpoint intervals. Upon detection
of an error in the self-checking processor module, the memory system would provide the information
needed to roll back the computations to the immediately preceding checkpoint.
https://ntrs.nasa.gov/search.jsp?R=20110020435 2019-08-30T17:50:26+00:00Z
