In this paper, we use reduced precision checking (RPC) 
INTRODUCTION
In this paper, we focus on detecting errors in floating point arithmetic performed by the floating point units (FPUs) in processor cores. We focus on all of the common floating point operations: addition/subtraction, multiplication, division, and square root.
We base our error detection on the previously introduced idea of reduced precision checking (RPC) [1] [9] . The key idea behind RPC is to check a full-precision (32-bit, in this paper) FPU with a reduced-precision checker FPU. The full-precision FPU operates on IEEE standard operands that have a sign bit, 8 exponent bits, and 23 fraction bits. The checker FPU operates on operands that also have a sign bit and 8 exponent bits, but the fractions have k bits, where k≤23 is a parameter that is chosen by the designer. In the absence of errors, the full-precision result of the FPU will match the reduced-precision result of the checker; only errors in the least significant bits can evade detection. RPC has an inherent trade-off between the cost of the checker and the likelihood of it not detecting an error.
Prior work has developed RPC for addition/subtraction [1] and multiplication [9] . The prior work introduced the idea of RPC and built and evaluated hardware implementations for addition/subtraction and multiplication. Prior work is limited, however, in several important ways. First, it does not include floating point division or square root, both of which are common hardware units. Second, we show that the size of the undetected errors for addition/subtraction and multiplication can be greatly reduced; prior work made overly conservative assumptions that led to unnecessarily large errors evading detection. Third, for some floating point input operations on some standard operands 1 (e.g., subtraction of two similarly sized numbers), the 1 A non-standard operand is a denorm, infinity, or NaN.
prior RPC schemes were unable to detect any errors at all and had to be disabled for these operation/operand combinations.
In this paper, we make the following contributions:
• We develop and evaluate RPC for a complete FPU in RTL-with addition/subtraction, multiplication, division, and square root. The results include error detection coverage, chip area, and energy.
• We mathematically compute the maximum undetectable error for every floating point operation, and we show that we can obtain tighter bounds for addition/subtraction and multiplication than prior work.
• Using a new RPC technique, that we call "reverse checking," we eliminate the unchecked operation/operand pairs of prior work. That is, our RPC unit checks every operation on all standard operands.
BACKGROUND AND NOTATION
In this section, we introduce background and notation that we use throughout the rest of this paper. We assume that all fullprecision floating point numbers adhere to the IEEE-754 standard. Furthermore, we consider only 32-bit floating point numbers in this paper, but our approach applies in an identical way to 64-bit floating point numbers.
Floating Point Format
As illustrated in Figure 1 biased notation, such that the exponent of X ( ) is the unsigned value of the exponent bits after 127 is subtracted (i.e., = ∃-127). The mantissa of X is an implicit 1 followed by the 23 fraction bits. 2 We define fraction ( ) as the integer value of the lower 23 bits of X (X[22:0]), and mantissa ( ) as the floating point value of the mantissa, including the implicit one, i.e. 
RPC Notation
Because RPC involves computations on portions of a floating point number's representation, it is helpful to introduce some notation for describing these quantities. Assume that the RPC checker uses a fraction with k bits, where ≤ 23. Thus, an RPC checker operates on numbers that have 9+k bits, where the 9 corresponds to 1 sign bit and 8 exponent bits.
We use A and B to refer to the FPU's operands and C to refer to its (rounded) result. We use to denote the "true" result that would be produced with unlimited precision. For example, for multiplication, × = . We use the notation × → to denote that C is the result produced by the hardware. We denote the rounding error associated with the mantissa of C ( ) as . Thus, × = = × 2 × + . Assume the full-precision FPU computes × → , where A, B, and C are full 32-bit numbers. For purposes of comparing its result, C, to the result of the checker, we consider the most significant 9+k bits of C and denote them as C H . and k = 7, = 1. {1 and = 0. {0 {1 . The subscript indicates number of repetition of the bit within braces. In this case, 0. {0 {1 means 7 zeros and 16 ones following the decimal point.
The checker's output, C', has a fraction and mantissa that we denote as and , respectively.
RPC AND ITS CHALLENGES
At a high level, RPC is similar to dual modular redundancy. Assume the FPU performs a full-precision floating point operation on operands A and B, say × , and obtains the fullprecision result C. To check this result, the checker performs the same floating point operation on the inputs Figure 2 .
Although RPC is quite simple at this high level of abstraction, it is challenging to efficiently compare C H to C'. With identical dual modular redundancy, comparison between the two results can be efficiently implemented with bitwise comparison. In contrast, we need to compute the difference between the two results, because C H and C' can differ even in the absence of errors. However, we do not want to perform floating point subtraction, because of the time and energy required to do so.
Instead of comparing the floating point values C H and C' by performing a floating point subtraction, RPC compares them using integer subtraction. First, RPC checks whether the first In Sections 5-7, we prove that, in the absence of errors:
The bounds on Diff for addition and subtraction are [-1,1] and the [-1,3] for multiplication, division, and square root.
If Diff is outside of its allowable range, then either an error has occurred in the full-precision FPU or in the checker. (The latter scenario is a false positive.) If Diff is non-zero but within its allowable range, then an error may or may not have occurred; if an error did occur, RPC would not detect it.
Later, we prove that the integer values of the bounds on Diff do not vary with k. However, the significance of these integer bounds does vary. The least significant bit of Diff is in the least significant bit position of the checker, which has the floating point value 2 . Intuitively, a larger value of k corresponds to a smaller significance of Diff (i.e., 2 -k is smaller) and thus a smaller magnitude of error that will not be detected by RPC. As a result, RPC provides a tune-able tradeoff between error coverage and cost. We devote much of this paper to proving how large the integer difference of C' and C H may be in the absence of errors; any difference larger than this allowable discrepancy will be detected as an error, and any difference smaller will be ignored.
Although integer subtraction is cheap and easy to implement, it introduces the following challenges in bounding the maximum allowable difference in the absence of errors.
Challenge #1: Unequal "Matches" of Exponents
It might appear at first that C H and C' can differ only in their fractions. However, they can also differ in their exponents in rare situations that must be considered. When we truncate the inputs to the checker (i.e., convert A and B into A H and B H in multiplication, for example), it leads to the absolute value of C' being less than or equal to the absolute value of C. Thus, the exponent of C' can be less than the exponent of . The mis-match in exponents further causes mis-alignment between the fractions of C' and C H for purposes of computing integer difference. This rare scenario leads to complexity in bounding Diff.
Challenge #2: The Need for "Reverse" Checking
The second challenge with RPC is the existence of certain scenarios in which direct checking is ineffective. We illustrate this challenge with a base-10 example: consider the case of 3.9999x10 23 -3.9996x10 23 , and assume that the reduced precision checker truncates the last two digits. The (error-free) full-precision result is C=3.0000x10 19 , and the RPC result C' is 3.99x10 23 -3.99x10 23 =0. The discrepancy between C H (3.00x10 19 ) and C' in scenarios like this can be almost arbitrarily large because the result of the reduced-precision subtraction is zero. More generally, the problem exists for subtraction when both operands have the same sign and for addition when both operands have different signs. Prior work detected these scenarios and disabled checking when they occurred [1] .
To overcome this challenge, we propose reverse checking for same-sign subtraction (SSSUB) and different-sign addition (DSADD). In the case of SSSUB, forward checking would compute − → ' and compare C' to . Instead, with reverse checking, RPC uses the truncated result of the fullprecision FPU, , as one of it inputs and then "reverses" the computation to check it.
The operation of reverse checking depends on the signs of A, B, and C. There are two cases. First, if = = then the checker computes + → ' and then compares ' to . This check avoids the problem of SSSUB because the checker is performing same-sign addition (SSADD) when computing + → '. In the second case, = ≠ , and the checker computes − → ' and compares ' to . Because = − , the checker is performing different-sign subtraction (DSSUB) in computing − → '. For division and square root, we also adopt reverse checking, but for a different reason. Both division and square root can be checked, using reverse checking, with the same reducedprecision multiplier we use to check multiplication. Reusing the multiplier checker is efficient, not least because multiplication is cheaper than division and square root. To check ÷ → , the checker performs × → ′ and compares to . To check → , the checker performs × → and compares to .
Challenge #3: the Rounding Effect
Another challenge in bounding Diff is that both the fullprecision FPU and the reduced-precision checker produce inexact, rounded results. We must consider this rounding in two scenarios.
First, rounding in the full-precision unit might produce a carry from low-order bits that could propagate to high-order bits, leading to a mis-match between the high-order bits of C H and C'. Second, rounding occurs twice in reverse checking, because one input operand used by the checker is rounded (and the result is rounded). For instance, in the case of ÷ → , even with full-precision × ≠ . With reverse checking, we compute × → ′ , during which we first round the input C H (by truncating C) and then round the result to get A'.
AXIOMS
In the following sections, we frequently use several axioms that derive from the format of IEEE floating point numbers.
The smallest possible value of is 1. {0 = 1. The largest possible value of is 1. {1 = 10. {0 − 0. {0 {1 = 2 − 2 .
Similarly, the smallest possible value of is also 1 and its largest possible value is 1. {1 = 10. {0 − 0. {0 {1 = 2 − 2 .
Axiom 2: 1 ≤ ≤ 2 − 2 For , the smallest possible value is 0. {0 and the largest possible value is 0. {0 {1 = 0. {0 {1 {0 − 0. {0 {1 = 2 − 2 .
For all floating point operations, the result gets rounded even in the full-precision unit. Take multiplication × → for example:
Rounding error is bounded by machine epsilon ( ). In the following derivations, we assume the default IEEE rounding mode round-to-nearest (to even on tie), in which case equals 2 . For the other three modes (toward 0, +∞, and −∞), equals 2 .
For the reduced precision checker, the rounding error is still 0.5 ulp of the checker. Because the least significant bit in the checker's mantissa is 2 , is bounded by
Note that Axiom 4 and Axiom 5 apply to all operation types, not just multiplication, because we considered only the relationship between the result produced by a FPU and the true 4 result with unlimited precision. Regardless of the type of operation, rounding is performed in the same manner.
An important part of bounding Diff is to connect the integer subtraction used to calculate Diff to the floating point values of its two operands. Consider the integer subtraction for computing Diff in forward checking. and left shifted by k (i.e., converted to an integer).
Axiom 6 applies to reverse checking similarly, except that, instead of − ' , A' or B' is subtracted from or , respectively.
RPC FOR MULTIPLICATION
The full-precision FPU computes × → :
= × . So the signs on both sides of the equation cancel and leave:
The RPC unit computes
After the hardware adds EA and EB and multiplies MA and MB with integer operations, it must normalize the result. To normalize, the hardware shifts × until only one non-zero bit is to the left of the binary point. The number of right shifts is then added to + to get . By Axiom 1, both MA and MB are less than 2. Therefore × is strictly less than 4 (100 in binary), and there can be at most 2 non-zero bits to the left of the binary point. Thus, × is right shifted by at most one bit.
In multiplication, the exponent of the product is equal to or is one larger than the sum of the exponents of the operands.
We apply this rule to both the FPU and the checker:
Because operands used by the checker are truncated from operands used by the full-precision FPU, the absolute value of C' is less than or equal to the absolute value of C. So EC' must be less than or equal to EC. ≤ Combining the conditions above, there are three cases we need to consider.
Common Case 1:
Common Case 2:
Corner Case:
The common cases are discussed in Sections 5.1 and 5.2. For both common cases, because = , Axiom 6 is valid, and thus we can subtract Eqn. (5) from Eqn. (3) to get:
Eqn. (8) is the corner case when E ≠ E so Diff no longer equals M -M × 2 . We present this corner case in Appendix B.
Common Case 1
Using Eqn. (6), we simplify Eqn. (10) to
Using Axiom 2 and Axiom 3, we can bound boxed term 1 in Eqn. (11), which we denote <1> hereafter:
Using Axiom 3, we bound <2>:
Using Axiom 4 and Axiom 5, we bound <3>:
We calculate the upper bound of Diff by summing the upper bounds of the three terms and multiplying by 2 .
Similarly for the lower bound on Diff, we have:
The above analysis for the upper bound is simplified for purposes of providing intuition. With more sophisticated analysis (Appendix C), we can prove that the upper bound of Diff is strictly less than 4. The reason is because the three bounded terms above are not independent from each other and cannot reach their maximums at the same time. As Diff is an integer, the allowable range of values for Diff in this case is [-1,3].
Common Case 2
Using Eqn. 
The lower bound is the same as in Section 5.1 (i.e., -1 Prior work [9] proved a looser bound of [-7,7] .
RPC FOR DIVISION AND SQUARE ROOT
RPC for division and square root uses reverse checking with multiplication. We use the same reduced-precision multiplier that checks multiplication.
Division
To check ÷ → , the checker performs × → ′. In this section, we show that reverse checking of division has the same bounds on Diff as multiplication. The full precision FPU computes ÷ → :
The checker computes × → ′:
Using similar analysis as for multiplication, because 1 ≤ , < 2 according to Axiom 1, the FPU's quotient ÷ is in the range of (0.5, 2), open interval. Therefore, the quotient needs to be left-shifted by at most 1 to be normalized such that one bit is to the left of the binary point.
In division, the exponent of the quotient is equal to or is one less than the difference between the exponents of the dividend and divisor.
Applying this rule to the FPU and retaining the same rule for the checker that we had for multiplication, we have:
The absolute value of A' is less than or equal to the absolute value of A due to truncating the mantissas of and , and thus ≤ . Combining the conditions above, there are again three cases we must consider:
The first two cases are presented in Sections 6. The corner case when E ≠ E can be analyzed similarly to the corner case of multiplication in Appendix B. and Axiom 5, respectively, <3> now has the following bounds:
Common Case 1
which slightly differs from the bounds on <3> in Section 5.1. This slight difference in this term does not affect the overall bounds on − as compared to Section 5.
Again, the upper bound can never reach 4 for the same reason as in multiplication. Therefore, the allowable range of Diff is still [-1,3].
Common Case 2
Using Eqn. The boundary condition of <3> becomes
<1>, <2>, and <3> have the same boundary conditions as <1>, <2>, and <3> in Section 5.2, respectively. Therefore, both the upper and lower bounds of Diff are identical to those in Section 5.2.
Summary: The corner case of division can be proved in the same manner as multiplication presented in Appendix B. Overall, Diff has the same bounded range for division as for multiplication, which is [-1,3].
Square Root
Checking square root is almost identical to checking division. To check → , the checker performs × → . The analysis for square root is identical to the analysis for division except the two operands of the checker are the same. Therefore, the allowable range of Diff for square root is also [-1,3].
RPC FOR ADDITION/SUBTRACTION
Because of the cancellation problem we discussed in Section 3.2 for addition and subtraction [1] , we must separately consider the following two scenarios:
• addition of same-sign operands (SSADD) or subtraction of different-sign operands (DSSUB) (Section 7.1)
• addition of different-sign operands (DSADD) or subtraction of same-sign operands (SSSUB) (Section 7.2)
Same-Sign Addition (SSADD) or DifferentSign Subtraction (DSSUB)
For both addition of same-sign operands = and subtraction of different-sign operands S ≠ S , RPC checks the result of the baseline FPU using forward checking, i.e., + → is checked with + → , and − → is checked with − → ′. These two situations can be easily converted to one another to get + − → . The resulting operation is an addition with same-sign operands (A and -B) . As a result, SSADD and DSSUB have identical upper and lower bounds for Diff. Without loss of generality, we explain only SSADD in this section but the bounds apply equally to DSSUB.
The full-precision FPU computes + → :
The RPC checker computes + → :
As with multiplication, division, and square root, a key aspect of our derivations is to understand the relationship between the exponents of the operands and results. To sum two numbers with the same sign, the mantissa of the operand with the smaller exponent must right shift until its exponent matches the operand with the larger exponent = max , . If exactly one bit is to the left of the binary point after summing the two mantissas, then the exponent of the result equals . Else, if more than one bit is to the left of the binary point, then the sum of the mantissas must right-shift accordingly, and the exponent of the result equals plus the number of right shifts. The sum of the two mantissas reaches its maximum when both operands have the same exponent (i.e., no shift in mantissa of either operand), and this maximum sum of the mantissas is strictly less than 4 (100 in binary) by Axiom 1. As a result, the sum of the two mantissas has at most two bits to the left of the binary point, and thus the number of right shifts is either zero or one.
Based on this analysis, the exponent of the sum, , equals or +1. When = , we further know that ≠ because, if = , then we can add and without shifting. The smallest value + can be is 2 (10 in binary) because , ≥ 1 according to Axiom 1. Thus, the sum must right shift to normalize , which violates = .
For addition, the exponent of the sum is equal to or one greater than the maximum of the operands' exponents (EL). Furthermore, when the exponent of the sum equals EL, the two operands must have unequal exponents.
Applying this rule to the FPU and checker, we have:
The absolute value of C' must be less than or equal to the absolute value of C. Therefore, We present the corner case for Eqn. (25) when E ≠ E in Appendix B.
Common Case 1
In Eqn. Using Axiom 4 and Axiom 5, we can bound <3>:
As a result, the upper bound of Diff is:
The lower bound of Diff is
Therefore, the allowable range of Diff is [-1,1].
Common Case 2
Without loss of generality, we can again explain only ≥ and ≥ . So the upper bound of Diff is:
The lower bound of Diff is still greater than -1.5. Therefore, the allowable range of Diff is [-1,1].
Different-Sign Addition (DSADD) or SameSign Subtraction (SSSUB)
Recall that our simple example in Section 3.2 was a SSSUB that motivated reverse checking. The situation occurs when = and = , but ≠ . In this situation, the result of − → computed by the full-precision FPU has a non-zero value, whereas − → computed by the checker (i.e., with forward checking) equals zero. To resolve the challenge, the checker applies reverse checking using as one of its operands. In this way, we use SSADD/DSSUB to check SSSUB/DSADD.
For SSSUB: − → , = :
In the following section, we analyze the first scenario above: SSSUB with = = . The other cases can be verified in a similar manner.
The full-precision FPU computes − → :
The checker performs
To understand the relationships between , , and in SSSUB (which also apply to DSADD), consider − = , where is the true C with unlimited precision. Then + = . From our previous analysis in Section 7.1, we have: Scenario (1&4) differs from Scenario (1&3) only when > , so = max , = in (1). Then because of (4), = − 1. So = − 1 = max , . Thus, Scenario (1&4) is impossible because it implies that abs(C)>abs(A), which is impossible after B is subtracted from C, given that = . Scenario (2&4) differs from (2&3) only when
, which is the same as scenario (1&3).
In subtraction, the exponent of the minuend (A) is equal to or one greater than , which is the maximum of the exponents of the subtrahend (B) and difference (C). Furthermore, when the exponent of the minuend equals , the subtrahend and the difference must have unequal exponents.
Applying this rule to the FPU and applying the rule for addition to the checker, we have: 
Common Case 1
In term <1>, because of Eqn. (29), either 2 or 2 must equal 1 because max , = . The other is less than or equal to 2 because ≠ . As a result, <1> is bounded by: 0 ≤ × 2 + × 2 ≤ 2 − 2 + 2 − 2 × 2 = 1.5 × 2 − 2 For the same reason, 2 in <3> is smaller than or equal to 1. So we can bound <3>: −2 − 2 ≤ × 2 + ′ ≤ 2 + 2 Term <2> is bounded to [−2 + 2 , 0] according to Axiom 3. As <1>, <2>, and <3> have the same boundary conditions as <1>, <2>, and <3> in Section 7.1.1, respectively, the bounds on Diff are identical to those in Section 7.1.1. Therefore, Diff is in the range [-1,1]. 
Common Case 2

MAXIMUM PERCENTAGE ERROR
With bounds on Diff, we can now approximate the Maximum Percentage Error (MPE) for errors undetected by RPC. This derivation is an approximation in that it applies to all common situations but not some of the extremely rare corner cases. This section shows that the undetected errors are small. We compute the percentage error as:
We have proven that Diff is in the range of [-1,1] for addition/subtraction and [-1,3] for multiplication, division, and square root. For the common cases, Diff corresponds to the difference between the fractions of the baseline FPU and the checker.
We represent an erroneous computation with an "E" above the right-arrow and an erroneous result as .
Forward Checking
In forward checking, the FPU and the checker compute → and →
RPC only misses an error when Diff is within the derived bounds. Also
Reverse Checking
In reverse checking of division (which also applies to square root), Diff represents the difference in the fractions of and A'. However, what we are really interested in is the difference between and C. We know that: ÷ → and × =
The percentage error is:
As − ≥ and ≥ 1, MPE is loosely bounded by
In reverse checking of SSSUB (also applies to DSADD), − → and
Summary
Because max | | for SSSUB is just 1, MPE is approximately max | | × 2 * 100% for all operations. This approximate analysis does not completely bound the maximum percentage error uncaught by RPC, but rather shows that, under most circumstances, undetected errors have very small percentage errors. For example, the MPE for a 16-bit checker (k=7) is 0.78% for addition and subtraction, and 2.34% for the other operations.
IMPLEMENTATION AND INTEGRATION INTO PROCESSOR CORE
We implemented RPC as an extension of a floating point unit developed by Kwon et al. [3] . Our implementation of RPC includes the following four components: one k+9-bit adder/subtractor, one k+9-bit multiplier, logic to determine how to perform the checking, and buffers to hold operands and results until checking can be performed.
Handling Floating Point Exceptions
When certain floating point exceptions occur, RPC is unable to check for errors. The reason for this limitation is that the assumptions we make about rounding error are valid only when the FPU does not encounter the following exceptions: overflow, underflow, invalid, and divide-by-zero. In these circumstances, the result of the FPU is formatted to positive/negative infinity, zero or denorm, NaN, and positive/negative infinity, accordingly. Therefore, our checker is suppressed when these rare cases are detected.
Performance Impact
RPC can impact performance if the processor core is waiting for a result to be checked before it can commit a floating point instruction. For some floating point operations, we can perform checking in parallel with the full-precision FPU, but reverse checking does not permit this parallelism. If we try to begin each check as soon as it has its operands, we could encounter a situation in which a reverse check and a forward check need to use the same checker at the same time. To avoid the complexity of managing these situations, we choose to have the checker always wait until the FPU has completed (even for forward checking).
As the checker is modified based on the baseline FPU, it should have at least the same throughput and thus the performance impact of RPC is limited to the extra reorder buffer (ROB) pressure incurred by floating point instructions that are ready to commit but have not yet been checked. The performance penalty due to this extra ROB pressure depends on the microarchitecture and the software running on it, but we do not expect it to be large.
ERROR DETECTION COVERAGE
The primary goal of our experimental evaluation is to determine the error detection coverage of RPC.
Error Injection Methodology
To evaluate the error detection coverage, we ran an extensive set of error injection experiments. In each experiment, we forced a single stuck-at error on a different wire in the flattened netlist that includes the full-precision FPU and all of the RPC hardware. Note that a single stuck-at error on a wire can often cause multiple errors downstream of the injected error, due to fan-out, and thus injecting errors at this low level is far more realistic than injecting errors in a microarchitectural simulator [4] . For every wire in the netlist, we ran 2000 experiments. Half of the experiments inject a stuck-at-0 error on the wire and test 1000 different inputs; the other half inject a stuck-at-1 on the wire and test the same 1000 random inputs.
We considered injecting transient errors, but a very large fraction of transient errors were masked (i.e., had no impact on execution). Masking is quite common, as in prior work in error injection [8] , because each error affects the results for only a subset of all possible inputs, and often these subsets are tiny. Transient errors, in particular, are masked with very high probability. To obtain statistically significant data in a tractable amount of time, we injected permanent stuck-at errors, which are less likely to be masked. Moreover, a floating point operation is a relatively short latency event, which makes transient errors similar to permanent errors.
Results
In Figure 3 , we show our results by classifying errors into categories, and we present a separate graph for each of the five arithmetic operations (add, sub, mul, div, sqrt). The figures focus on the errors that are unmasked, i.e., those errors that affect the result of the FPU and/or the result of the checker.
Among the unmasked errors, we classify the errors into four categories:
• unmasked and detected (UMD) -our desired outcome • unmasked undetected (UMUD) -a silent data corruption, which is the worst outcome
• unmasked unchecked (UMUC) -unusual corner case when RPC suppresses checking • false positive (FP) -error affected checker and is detected even though result of FPU is correct
The first observation from Figure 3 is that there is a clear trade-off between error detection capability and cost. As the checker is made wider, there are fewer undetected errors.
A corollary to this first result is that there are also more false positives, because the checker is larger and thus more liable to be the victim of an injected error. Notice that there are fewer false positives in division and square root than addition, subtraction, and multiplication; this is because the divider and square root unit are significantly larger than the checker multiplier. Unlike a dual modular redundancy scheme (i.e., simply replicate the unit to be checked), which has 50% false positives, RPC can reduce the fraction of false positives by having a smaller checker.
From Figure 3 , it appears RPC misses a large portion of unmasked errors at small checker width (such as 16, i.e., 7 bits mantissa). However, Figure 4 shows that RPC detects the vast majority of the large errors. Figure 4 presents the distribution of the size of the unmasked undetected (UMUD) errors. In these graphs, the top, middle, and bottom bars of the vertical box indicate the first quartile, median, and second quartile of percentage errors, respectively. The top whisker defines a 1.5 inter-quartile range away from the first quartile, and data above here are defined as outliers. The connected green diamonds mark the approximate MPE computed in Section 8. The percentage number above the diamonds is the percentage of UMUD errors that has percentage error larger than the approximate MPE. An error larger than the approximate MPE includes the corner case scenarios and faults in a handful of components that, when faulty, can cause extremely strange behavior. For example, an error in the wire that determines whether the output should be formatted in an exceptional way (e.g., as a NaN) can cause the output to differ dramatically from the correct result and cause an outlier in the percentage error calculation.
We observe that only a very small fraction of UMUDs have percentage errors that are not bounded by the MPE. Overall, the percentage errors of UMUDs are extremely small; the median is less than 0.1% even with a minimally-sized 1-bit mantissa (10-bit checker) for all operations. At checker width 16, only 0.003%~0.055% undetected errors have percentage error larger than the approximate MPE (0.78% for addition and subtraction, and 2.34% for the other operations). This means a vast majority of those undetected errors in Figure 3 at width 16 are very small.
AREA, POWER, AND ENERGY OVERHEADS
We now evaluate the area and power overheads of our RPC implementation.
Area
We used Synopsys CAD tools to layout the FPU, with and without RPC, in 45nm technology [7] . The results in Figure 5 (a) show that RPC's area overhead ranges from about 30% to about 90%, depending on the checker's width. This overhead is far less than that of simple dual modular redundancy (100%).
Energy
We compute the per operation energy overhead using the power and latency results from the CAD tools. As with the area analysis, we determine the dynamic and static power overheads of RPC with post-synthesis gate-level power estimation. After laying out the circuitry, we obtained the parasitic resistances and capacitances and back-annotated the circuits with them.
To determine the power, we feed the synthesized module with 1000 random inputs for each operation at all checker widths to acquire switching activities of each wire/cell, which is further used to compute both dynamic and static power. To minimize power consumed by buffers for reverse checking, we use clock gating to optimize our design.
Our experiments show that static power comprises roughly 20% of overall power and, unlike dynamic power, is relevantly stable across different operations and checker widths. Addition and subtraction suffer the most from energy overhead because they are relatively cheap operations. Hence, the checking logic incurs nontrivial overhead. However, before width 18, the overhead is still cheaper than dual modular redundancy. Multiplication incurs somewhat less energy overhead, reaching 100% at width 28. For the most expensive operations, division and square root, as their checker multiplier is significantly smaller in size, RPC is very efficient and has less than 70% overhead at full checker width. Note that the overheads in Figure 5(b) include the impact of the checker, logic, and buffer over each single operation. However, the checker adder is shared between addition and subtraction; the checker multiplier is shared among the other 3 operations; and logic and buffers are shared among all operations. So the overheads do not sum up for a mixture of 5 operations.
RELATED WORK
The most related work is the prior work on RPC, which we have already discussed [1] [9] . In addition, there have been other proposed schemes for detecting errors in floating point hardware.
Lipetz and Schwarz [5] propose residue checking, which is complete and cost-efficient, in principle, but we cannot resolve how their scheme handles the issue of rounding. We speculate that, based on an IBM patent [2] , the FPU passes rounding information to the residue checker, but such a design would be unable to detect errors in rounding logic.
Maniatakos et al. [6] propose a low-cost scheme in which the hardware checks only the exponents of floating point operations. Like us, they seek to detect all large errors, but checking only exponents misses many errors that we consider too large to be acceptable.
CONCLUSIONS
In this work, we have applied RPC to an entire floating point unit. We have comprehensively analyzed its ability to detect errors and the size of errors that can go undetected. Based on our results, we believe that RPC is an attractive method for detecting errors in FPUs. We are unaware of other approaches for detecting errors in FPUs that can be tuned to trade off error detection coverage versus cost. ≤ 1, but they cannot both equal 1 at the same time. Thus, we have: < + < 2 × 2 Therefore, must be in the form of 0. {0 { … , where each x represents an unknown bit and can be either zero or one. Each x is independent from each other and they may or may not hold the same value.
is all zero after its first k bits to the right of its binary point, i.e., it is in the form 1. { {0 … : 0. 000 … 000 … : The first (k-2) values of x in the fraction of must be 1 for the sum to reach 2. The sum of the last two x bits of (denoted as ab) and last two y bits of (denoted as cd) must be greater than or equal to 4 = 100 to produce the carry one.
: 0. 000 … 00 … + : 1. 111 … 11 0 … 0 + : 10. 000 … 00 … : 1. 000 … 000 … k bits
Now Diff can be evaluated in the following steps:
Consider all possible value of ab, cd, and ef, where ab + cd = 1ef. The two-bit values ab and cd are both less than or equal to 11 = 3 . Therefore, ef cannot be 11 , as that requires + = 111 = 7 . Thus, the possible values of ab, cd, ef, and Diff are as follows:
ab cd 1ef 10e Diff = 10e -cd 01 11 100 100 1 = 1 11 01 11 = 3 10 10 10 = 2 10 11 101 1 = 1 11 10 10 = 2 11 11 110 101 10 = 2 Therefore, Diff is still within the range [-1,3] of multiplication.
Appendix C: Upper Bound of Diff in Multiplication
In Section 5.1 we proved the upper bound of Diff is less than 4.5, indicating that this integer difference can be as large as 4. However, after billions of simulations, we never observed Diff equaling 4. In this section, we introduce a more sophisticated proof that bounds Diff to be less than 4 (i.e. − < 4 × 2 ) and thus this integer difference can be at most 3. This proof also implies that the upper bounds on Diff for division and square root are also less than 4.
To bound Diff, we must understand how differs from . Thus, we need to understand how rounding happens in the checker and how rounding in the FPU affects the most significant k bits in (i.e. ). In the following analysis, we denote the unrounded (unlimited precision) result of the FPU's mantissa as and the unrounded result of the checker's mantissa as . With unlimited precision, has more than 23 bits of fraction (i.e., bits after the binary point), and has more than k bits of fraction even though their input operands have only 23 and k bits of fractions, respectively. We split both and into High and Low terms: = + and = + , where refers to the implicit 1 and the first k bits of the fraction; refers to the bits less significant than . For example, = 1. { … , = 1. { , = 0. {0 { … . We derive an upper bound on − in three steps:
Step 1: Analyze the relationship between and .
Step 2: Analyze (a) how rounding in the FPU affects and (b) how rounding in the checker affects .
Step 3: Analyze the relationship between and .
Step 1 Given + = in Eqn. (2) Step 2.a For the FPU, = + . So = + = + + The only way in which ≠ is if, after rounding, the carried one at the 23 rd bit propagates to the most significant k bits. For this to happen, the (k+1) th to the 23 rd bits after the binary point in must all equal 1. To produce the carried one, the bits after the 23 rd bit of must be larger than 2 , so rounds up to get . Thus, ≠ when > 0. {0 {1 {1 = 2 − 2 . After rounding, all bits in must be zero because they all sum with the carried one. For purposes of analysis, we consider two ranges of values for : Step 2.b For the checker, = + = + Notice there is no term, because is produced by the checker and has only k bits after the binary point. For analysis, we consider two ranges of values for :
Range 2: 2 ≤ < 2
In Range 1, M rounds down to get M . equals after rounding. Step 3
We now have to consider all possible combinations of cases from Steps 1, 2.a, and 2.b. Because there are two cases in each step, there are 2 3 =8 theoretical combinations, but some combinations are impossible.
• Case I-Case III-Case V: Plug in the equations for the high bits . Then the boundary condition for * is 2 − 2 < * < 2 Since the upper bound is smaller than the lower bound, because k<24, there is no * that satisfies the requirements of Case I-Case IV-Case V, i.e. the case does not exist.
• Case I-Case III-Case VI: Plug Case III and Case VI into Case is in the range [0, 2 . Then 2 × 2 − 2 < * < 1.5 × 2 Because the upper bound is smaller than the lower bound, this case is impossible.
• Case II-Case IV-Case VI (impossible): As in Case II-Case IIICase V, + * is in the range ( 2 × 2 − 2 , 2 × 2 ). is in the range [2 , 2 . Thus 1.5 × 2 − 2 < * < 1.5 × 2 Because * < 2 , this case is impossible.
Conclusion Overall, − < 4 × 2 . Therefore = − × 2 in Section 5.1 is in [-1,3] .
