Employment of Reduced Precision Redundancy for Fault Tolerant FPGA Applications by Sullivan, Margaret A. et al.
Calhoun: The NPS Institutional Archive
Faculty and Researcher Publications Faculty and Researcher Publications
2009
Employment of Reduced Precision
Redundancy for Fault Tolerant FPGA Applications
Sullivan, Margaret A.
http://hdl.handle.net/10945/46466




Capt. Margaret A Sullivan, USAF, Professor Herschel H Loomis, and Professor Alan A Ross 
Naval Postgraduate School, Monterey, CA 






This research explores the employment of 
Reduced Precision Redundancy (RPR) as a power-
saving  alternative to traditional Triple Modular 
Redundancy (TMR). This paper focuses on the details 
of RPR implementation and the effect of RPR fault 
tolerance on the performance of spacecraft systems.  
RPR-protected system performance is evaluated using 
a signal-to-noise ratio analogy developed with 
MATLAB and Simulink computational tools.  This 
research demonstrates that RPR is an effective fault 
tolerance approach for arithmetic operations 
Experimental results show that the benefit of RPR 
increases with the complexity of the operation to which 
it is applied.  System performance simulations 
demonstrate that RPR provides very good recovery 




Field programmable gate arrays (FPGA) offer a 
high degree of flexibility for multiple applications in 
space vehicle computers. [1]. The challenges faced by 
the developer of an FPGA-based system in space are 
not trivial.  In addition to the damage of long-term 
radiation exposure, which can be minimized using 
standard electronics shielding techniques [2], FPGAs 
are susceptible to errors in both data and architecture 
configuration caused by single event effects (SEE).   
Triple modular redundancy (TMR) is a fault-
tolerance approach that uses parallel computation and 
voting to detect and correct errors in a circuit.  The 
basic structure of TMR is to build three identical 
copies of an operation, and then apply a two-of-three 
majority voter to each bit of the output of the three 
circuits.  The bitwise majority voter will correct any 
single error in the three operation circuits. 
The concept of Reduced Precision Redundancy 
(RPR) allows the sacrifice of some level of precision in 
calculation, in the event that a fault occurs, in return 
for space and power savings on an FPGA.  The theory 
as developed by Snodgrass [3] at the Naval 
Postgraduate School (NPS) in 2006 suggests that 
instead of generating three identical copies of a circuit 
and voting on the outcome, some functions lend 
themselves to operating in a single thread at full 
precision, and in two additional threads at reduced 
precision.  The two reduced-precision operations will 
generate an upper and lower bound on the correct 
function output.  The precise calculation result is then 
compared to these upper and lower bounds, and voting 
logic determines whether the precise result may be 
used, or if an error has occurred in the precise solution 
and the average of the bounds must be used as a less-
precise result.  Shanbhag and his colleagues have used 
the same title and similar concepts to flag errors in 
DSP applications[4].  The significant difference is that 
Shanbhag assumes that the redundant calculation and 
all voting/comparison logic are fault-free. Snodgrass 
and the results in this paper do not require this 
assumption.   In this paper we examine the 
performance of RPR for fundamental arithmetic 
operations. 
 
2. Assumptions, Terminology and Rules 
 
In order to discuss RPR, it is necessary to define a 
metric that represents the “degree” of RPR – that is, 
the amount by which precision is reduced for the 
redundant lower-precision calculations.  We choose 
the ratio of the number of non-sign-bits r in the 
variables of the redundant calculations to the number 
of non-sign-bits n in the variables of the precise 
calculation.  For example, if the precise operands are 
represented in eight bits of precision and the upper and 
2009 17th IEEE Symposium on Field Programmable Custom Computing Machines
978-0-7695-3716-0 2009
U.S. Government Work Not Protected by U.S. Copyright
DOI 10.1109/FCCM.2009.53
283
lower bounds have five bits of precision, then the 
degree of RPR is r/n = 5/8. 
To apply RPR successfully to a given operation, it 
is necessary to determine the reduced-precision upper 
and lower limits, or bounds, of the result (the output) 
as functions of the operands (the input).  Using two’s 
complement notation for addition and subtraction 
enables identical treatment of the two operations, 
provided the transitions between the negative number 
closest to zero (1.111…2) and zero (0.000…2) are 
carried out successfully.  In this scheme, the upper 
bound of any operand or result x must lie to the right 
of x on a number line and the lower bound  must lie to 
the left of x.  
 
3. Comparing RPR & TMR Realizations 
 
If RPR is to be useful, it must require less space 
than TMR on an FPGA while still providing the same 
accuracy of computation in a no-error situation, and 
accuracy within a certain tolerance when errors do 
occur.  If this reduction in the area required can be 
achieved, using RPR instead of TMR to protect a 
circuit will require less power – or, conversely, more 
functionality may be obtained on a given FPGA.  To 
demonstrate, RPR and TMR adders were programmed 
using Xilinx Integrated Simulation Environment (ISE) 
Release 6.3.03i and targeted for the Xilinx Virtex™ 
XQVR600.  The RPR circuits contain both arithmetic 
operations and voters, and are compared to analogous 
TMR circuits using FPGA area as a metric for 
evaluation.  The FPGA area is reported during the 
mapping process in ISE as a slice count [5].  The 
results are presented in Table 1., which illustrates that 
for simple operations, the added complexity of the 
voter overwhelms the reduction in the size of the 
operation module.   
Table 1. Area Required by TMR and Representative 
RPR Addition Experiments 
Slice Count Type Precision (or Degree) Operation Voter Complete* 
TMR 64 99 65 163 
RPR 32/64 (0.5) 67 115 181 
RPR 16/64 (0.25) 51 95 146 
RPR 8/64 (0.125) 43 58 100 
TMR 32 51 33 83 
RPR 16/32 (0.5) 35 78 114 
RPR 8/32 (0.25) 27 42 68 
TMR 16 27 17 43 
RPR 8/16 (0.5) 19 34 52 
*Complete circuit area is computed independently of  “Operation 
+ Voter” 
 
Since multiplication is essentially a collection of 
many addition operations, it is reasonable that mapping 
a multiplication operation to an FPGA should require 
significantly greater area than an addition operation 
requires.  This is indeed the case; in fact, the difference 
is so great that the size of the multiplication operation 
modules in either TMR or RPR implementations 
dwarfs the size of their respective voter modules, as 
shown in Table 2. 
Table 2. Area Required by TMR and Representative 
RPR Multiplication Experiments 
Slice Count Type Precision 
(or Degree) Operation Voter Complete* 
TMR 64 6378 64 6436 
RPR 32/64 (0.5) 3208 226 3429 
RPR 16/64 (0.25) 2400 130 2526 
RPR 8/64 (0.125) 2194 101 2285 
TMR 32 1622 32 1112 
RPR 16/32 (0.5) 816 114 926 
RPR 8/32 (0.25) 610 85 684 
TMR 16 413 16 427 
RPR 8/16 (0.5) 205 77 274 
*Complete circuit area is computed independently of “Operation 
+ Voter” 
 
4.     Modeling Errors Due to Single Event 
Effects 
 
4.1.   Classes of Errors in FPGAs 
 
It is important to remember that there are two 
distinct classes of SEE generated faults that may cause 
errors in an FPGA-based system.  A data fault occurs 
when a charged particle changes the value or one or 
more bits in data memory.  This may cause a transient 
error, so named because it only lasts as long as that 
particular piece of data is in the system.   A 
configuration fault occurs when a charged particle 
changes the value of one or more bits in the memory 
that stores the configuration of the FPGA.  In the 
Xilinx® Virtex™ XQVR600 FPGAs used in this 
study, there are 6,127,744 configuration bits and only 
98,304 total block SelectRAM bits [5], so it is much 
more likely that a randomly incident charged particle 
will affect a configuration bit than a data bit.  A 
configuration memory fault also has the potential to 
inflict much more damage on a reconfigurable system 
than a transient fault could cause, because a 
configuration fault has the potential to generate an 
error in every value that it touches – it is persistent.  
Both classes of faults manifest themselves as errors in 
the output values of a reconfigurable computer.  
However, the propagation and extent of the errors 
284
caused by data and configuration faults are different.  
A transient error due to a data fault may propagate 
through multiple steps in a processor, but within a 
finite number of clock cycles the error will become 
obsolete and no longer affect the computation results.  
A persistent error due to a configuration fault will not 
be corrected until part or all of the FPGA is 
reconfigured. 
 
4.2.   Modeling Errors as Noise 
 
Consider a signal that is digitally sampled at 
regular intervals with precision 32n = and maximum 
value of 1 volt.  The smallest representable voltage 
level in the sampling regime, assuming fixed-point 
fractional representation, is 32 102 2.328 10− −= ×  volts.  If 
the signal processor has no fault tolerance and an error 
occurs, a given output could be incorrect in any or all 
digits: the maximum error magnitude is bounded by 
( ) ( ) ( )1 2 102 0.1 0.5Oε −≤ = = . 
If the same signal is sampled with reduced 
precision 8r = , the smallest representable voltage 
level is 8 32 3.91 10− −= ×  volts.  If a signal processor has 
RPR fault tolerance and an error occurs, the RPR result 
will be used.  The RPR result is guaranteed to be 
correct to the rth bit, i.e., maximum error is 
( ) ( )( 1) 9 102 1 2 0.001953rOε − + −≤ = × =  volts.  The r-bit RPR 
result sacrifices r – n bits of precision while it 
preserves r bits of precision in the fault case.   
The difference in magnitude of the error in a 
system with no fault tolerance and a system with RPR 
fault tolerance can be expressed using the signal-to-
noise ratio (SNR).  When there is no fault tolerance in 
a system, the “noise” that corrupts a result due to an 
error has the same potential magnitude and power as 
the signal itself.  The maximum possible error 
( )
10
0.5ε =  translates to 3 dBworstSNR =  for worst-case 
error scenarios.  When an error is detected in an RPR 
system and the RPR result is used, the loss of precision 
in the RPR result is analogous to noise at a lower 
power.  In the example system (8/32 RPR), the relative 
noise power represented by the RPR result is 
( )
10
10 log 1 0.001953 27.1 dB
RPR
SNR = = .  This procedure 
can be applied to any degree of RPR to determine the 
equivalent noise introduced by using the RPR result. 
 
4.3.   Experiment Details 
The experiments described below were conducted 
using MATLAB Release 2007a with Simulink version 
6.6.  Machine epsilon ε for the system configuration 
used was 162.2204 10−× , equivalent to n = 52 in fixed-
point representation. Errors were simulated using 
either a random number generator with output scaling 
(to represent an uncorrected error or an RPR result), or 
an additive white Gaussian noise (AWGN) channel 
simulator with SNR corresponding to the desired 
reduction in precision.   
The example system chosen for RPR performance 
evaluation was the NPS Bifocal Relay Mirror Satellite 
(BRMS) Simulator (BRMSS) attitude control system 
model, developed in 2005 by Kim [7]. 
One of the most sensitive points in a satellite 
ADCS is allocation of the control command to the 
actuators – in this case, Control Moment Gyros 
(CMGs).  CMGs are heavy, delicate machinery that 
apply very high torque to a spacecraft relative to the 
power they require to operate.  An error in 
commanding a set of CMGs can cause both temporary 
and permanent damage to the CMG’s and the 
spacecraft.  For this reason, the worst-case scenario 
was chosen by injecting error into the control 
command. 
The three components of the control command 
were passed through independent additive white 
Gaussian noise (AWGN) channels.  Since the 
experiment was meant to simulate a single event effect, 
only one noise channel was activated for any given 
trial.   
The scenario used for the experiment was a 
standard “reference maneuver,” where the spacecraft 
began at rest in attitude orientation ( ) ( ), , 0 , 0 , 0φ θ ψ = ° ° ° , 
and was commanded to move to orientation ( )1 ,1 ,1° ° °  
and stop.  The maneuver with no fault-induced errors 
is executed in less than twenty seconds, which includes 
the time required for the system to settle to within two 
percent of the target orientation.  Figure 1.a. shows the 
reference maneuver with no error. 
The error scenarios tested with the ADCS model 
included both transient (discrete delta function δ) and 
persistent (step function) error models.  Both types of 
errors occurred at five seconds into the simulation (t = 
5).  For each type of error, the reference maneuver was 
executed with SNR levels equivalent to no fault 
tolerance (SNR = 3 dB), and RPR fault tolerance for r 
= 8, 16, 24, and 32 (SNR = 27, 51, 75, and 99 dB, 
respectively).  Rather than present an exhaustive 
collection of data from the thirty scenarios tested, the 
most significant result is included here.  The first 
notable outcome of the tests is the conclusion that at 
the low operating power level assumed for this ADCS 
(25 mW), a transient data fault can generate an error 
only great enough to change the magnitude of the 
system transient response – it cannot change the 
285
steady-state properties of the system.  The insertion of 
a transient error causes an overshoot in the response of 
the spacecraft, but the position settles to its steady state 
at ( )1 ,1 ,1° ° °  in the same amount of time that it took to 
settle when there was no error to counteract. Because 
the system was stable enough to deal with the transient 
data error, no redundancy was required. 
 
Figure 1. ADCS Reference Maneuver. 
When a persistent error is introduced to a control 
system, the effect is much more significant.  Since the 
source of a persistent error is in the system 
configuration, a persistent error corrupts data in every 
execution of the feedback loop and disturbs every 
control command sent to the actuators.  A 
representative example of the effect of a persistent 
error in the control command is shown in Fig. 1.b.   
When RPR is applied to the system, the magnitude 
of the error is dramatically reduced.  While a 
configuration fault in the unprotected system causes 
unbounded error that makes the system completely 
unusable, a configuration fault in the system protected 
with RPR generates errors whose magnitude is strictly 
bounded, and even a small degree of RPR (r = 8) 
shows marked improvement in the system trajectory 
(Fig. 1.c). 
The scenario in Fig. 1.c was run with SNR 
equivalent to RPR of degree 8/52 (SNR = 27 dB) 
applied to the Y component of the control command.  
The response for RPR 8/52 is still not satisfactory for 
the fine pointing requirements of the BRMS, but 
would provide an adequately steady state to operate a 
spacecraft with less stringent attitude control mission 
requirements (e.g., an RF communications satellite).  
However, when the scenario was run with SNR 
equivalent to RPR of degree 16/52 (SNR = 51 dB), 
both control and angle trajectories were virtually 
indistinguishable from the error-free case.   
These results show that RPR has the potential to 
supply enough precision in a reduced-precision 
solution to allow continuous operation through both 





This research has shown that RPR is a viable fault 
tolerance approach for arithmetic operations.  In order 
for RPR to be effective, the upper and lower bounds of 
the result must be generated in a manner depending on 
the operation being executed, paying particular 
attention to the signs of the operands.  Also, the RPR 
voter must be constructed such that it conducts a 
numerical comparison of the MSB of the precise result 
with the bound results, as opposed to the bitwise 
comparison used in TMR.   
Experimental results show that for the simplest 
operations, RPR is not always the most efficient fault 
tolerance approach.  Essentially, the benefit of RPR 
increases with the complexity of the operation to 
which it is applied. 
System performance simulations demonstrate that 
RPR provides very good recovery from errors caused 
by SEE in spacecraft systems.  With a baseline 
precision n of 52 bits, even an approximate RPR result 
with only eight bits of precision drastically improved 
the transient and steady-state response of an attitude 
control system.  This and other implementation 
considerations contribute to the design trade space of 
FPGA capacity and power, fault tolerance 




[1] “Digital Hardware,” class notes for EC4530, Department of 
Electrical and Computer Engineering, Naval Postgraduate School, 
Summer 2008. 
[2] “Space Radiation Effects,” class notes for SS3035, Space 
Systems Academic Group, Naval Postgraduate School, Spring 2007. 
[3]  J. Snodgrass, “Low-Power Fault Tolerance For Spacecraft 
FPGA-Based Numerical Computing,” Ph.D. dissertation, Naval 
Postgraduate School, Monterey, CA, 2006. 
[4] B.Shim, S.R. Sridhara, and N. R. Shanbhag, “Reliable Low-
Power Digital Signal Processing via Reduced Precision 
Redundancy,” in IEEE Trans. VLSI Systems, May 2004. 
 [5] Xilinx, Inc., “Virtex™ 2.5 V Field Programmable Gate Arrays 
Architectural Description,” Xilinx Product Specification DS003-2 
(v2.8.1), December 9, 2002. 
[6] Xilinx, Inc., “Virtex™ 2.5 V Field Programmable Gate Arrays 
Electrical Characteristics,” Xilinx Product Specification DS003-3 
(v3.2), September 10, 2002. 
[7]  J. J. Kim and B.N. Agrawal, “Automatic Mass Balancing of Air-
Bearing Based Three-Axis Rotational Spacecraft Simulator,” Naval 
Postgraduate School, Monterey, CA, October 2008. 
286
