Monte-Carlo arithmetic is a form of self-validating arithmetic that accounts for the effect of rounding errors. We have implemented a floating point unit that can perform either IEEE 754 or Monte-Carlo floating point computation, allowing hardware accelerated validation of results during execution. Experiments show that our approach has a modest hardware overhead and allows the propagation of rounding error to be accurately estimated.
INTRODUCTION
Rounding error is inevitable for all finite precision computations. The most common solution is to perform each arithmetic operation at sufficiently high precision so the accumulation of error in the result is within an acceptable limit. Static analysis using affine arithmetic can also be used to estimate the propagation of rounding error in fixed point computations [3, 9] and this technique is widely used in bit-width optimization of digital circuits.
Unfortunately, analysis of rounding error propagation in floating point computations is not as straightforward. Unlike fixed point operations, the error is not bounded in a fixed interval and depends on the magnitude of the operands. A method for static analysis of floating point error based on affine arithmetic is proposed in references [4] and [5] . These methods depend on input range information and can Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. produce overly pessimistic error bounds unless the range is bound to a small interval. In a general computing problem, the range of possible input values can either be large, or simply unavailable before runtime. Moreover, associativity of mathematical operations does not hold in floating point computations and the analysis of rounding error is highly dependent on the sequence of operations. In practice, to produce correct results, numerical computing applications implemented in software often rely on the inherent stability of the algorithm rather than careful error analysis. When such an algorithm is implemented in reconfigurable hardware, any change to the floating point implementation and sequence of operations can compromise its stability.
A different, complementary approach for dealing with rounding errors is to track their propagation at runtime. Such self-validating numerical methods can produce not only the required result, but also an error bound. A traditional approach is interval arithmetic [8] which produces strict upper and lower bounds by operating on an interval instead of a point. Unfortunately, the bound is overly pessimistic in most cases, a major reason being that this technique does not take correlations into account. Affine arithmetic [14] overcomes this problem, however, the length of the error term grows as the computation proceeds, and estimating the propagation of rounding errors at runtime is very expensive. Another approach suitable for runtime implementation is the CES-TAC method [16] in which the computation is repeated using three different rounding modes and the part of the result that is the same for all rounding modes are assumed to be the significant digits.
Monte Carlo Arithmetic (MCA) [11] can track rounding errors at runtime by applying randomization to make rounding errors behave like random variables. Over a number of trials, a normal computation is turned into a Monte Carlo simulation and hence statistics on the effect of rounding errors can be obtained. Apart from floating point, MCA has also been applied to logarithmic number systems [17] . In this paper we focus on applying MCA to detect catastrophic cancellation which is the major cause of loss of significant digits in a computation.
While self-validating numerical methods produce valuable information on how rounding error affects the accuracy of the result, their implementation to date has mainly been in software which suffers from poor performance compared with hardware. This is especially problematic since applications that require high accuracy often require high performance.
In a field-programmable computing device, the flexibility to use custom floating point units to improve performance is present. A limited number of hardware implementations of interval arithmetic can be found in the literature [1, 12, 13] . A hardware implementation of the CESTAC method has also been published [2] . We are not aware of any hardware implementations of affine or Monte Carlo floating point arithmetic.
In this paper we describe a novel self-validating floating point unit (FPU) which uses MCA to track rounding error propagation. We also show that the area and performance overheads are modest compared to a standard FPU. We believe that self-validating numerical methods are important in reconfigurable computing for the follow reasons:
• The ability to produce an error estimate at runtime allows more aggressive optimizations to be used.
• For applications where the accuracy of the result is critical, a hardware generated error bound on the result is very useful.
• The ability to analyze rounding error at runtime enables the construction of computer systems that can dynamically tune themselves according to the input data. Such an ability would fully capitalize on the strength of reconfigurable hardware.
The remainder of the paper is organised as follows. In Section 2, we provide background on floating point numbers and MCA. A modified hardware algorithm for MCA addition and multiplication is described in Section 3. The implementation of the MCA arithmetic unit is described in Section 4. Results are shown in Section 5 and conclusions drawn in Section 6.
BACKGROUND

Floating Point Numbers
A binary floating point number x fp can be represented as a 3-tuple < n, f, e >, where n ∈ {0, 1} is the sign, f is an unsigned fraction referred to as the significand, and e is the integer exponent. Such a floating point number represents the real value:
The significand of a normalized floating point number has a range of 1 ≤ f < 2. It has an implicit most significant bit of 1, called the hidden bit and so the actual value stored in the binary representation of the significand is f − 1. e is represented as a signed binary integer in excess format. We define machine precision, p, to be the number of bits in the significand, excluding the hidden bit. For an IEEE 754 single precision floating point number, p = 23.
The IEEE-754 floating point standard also supports denormalized floating point numbers which represent those below the range representable by normalized numbers. For these numbers, the exponent is set to its smallest value and no hidden bit is assumed. In this case, assuming single precision, < n, f, e > represents the real value:
Our implementation is fully IEEE 754 compliant and hence supports denormalized numbers. Without loss of generality, normalized floating point numbers are assumed in the rest of this paper unless otherwise specified.
MCA
MCA, proposed by Parker [10] , is a way to detect catastrophic cancellation and overcome several arithmetic anomalies in floating point calculations. Parker employs a high precision floating point unit to perform low precision Monte Carlo floating point computation. In his experiments, a double precision floating point unit is used to perform single precision Monte Carlo floating point computations.
Exact values are real numbers that can be represented exactly within the floating point precision whereas inexact values are rounded due to finite precision or real values that are not completely known. In MCA, an inexact value x is modeled with a random variable that agrees with x to s digits:x
where e is the base 2 exponent of x, s is a positive integer, and ξ is a random variable in the interval (− A full MCA floating point operation is computed as:
where t is the virtual precision, emulating a precision less than the actual machine precision (t ≤ p), and ξ is a random variable uniformly distributed in the interval (− ). The function round() is any floating point rounding function. In this paper, we assume round to the nearest.
In a full MCA floating point operation, the function inexact() is applied: (1) once to each of the operands, and (2) also to the result of the floating point operation before rounding. The former is called precision bounding and can be used for detection of catastrophic cancellation. The latter is random rounding, which improves the statistical properties of floating point rounding and can be used to address anomalies in floating point arithmetic such as non-associativity and bias of round-off errors. This is because, if random rounding is applied, the expected value of the result converges to the correct value.
Precision bounding and random rounding can be used independently. In this work, we only consider the precision bounding operation since our objective is to estimate the propagation of rounding error. The techniques described in this paper could also be applied to random rounding, In MCA, a computation is performed n times with the same input, forming a Monte Carlo simulation. The output is an n-tuple X =< x 1, x2, . . . , xn >. The arithmetic mean of X is used as the result, while the distribution of X can be used to estimate the rounding error. Instability in rounding is reflected by an X with large variance.
MCA ARITHMETIC UNIT
In Parker's work, double precision floating point arithmetic was used to perform single precision MCA so that finite precision effects were negligible. This approach is obviously very inefficient for hardware implementations. In this section, we propose a modified algorithm for MCA floating point that uses lower precision arithmetic.
The key idea is to use a random perturbation of operands and operator result to model rounding error. For a round to nearest scheme, the rounding error is at most a unit in the last place (ulp), computing fop(x, y) + δ requires extending the precision of the arithmetic unit. Our algorithm avoids this by computing a result that matches the statistical distribution of fop(x, y) + δ without directly evaluating the expression. Since our objective is estimating the propagation of rounding error in IEEE floating point computation, we do not apply random rounding.
MCA Addition
We define a function unround as the hardware equivalent of the precision bounding operation:
where Ξ is a uniformly distributed random variable with a probability density function
For implementation purposes, we use the equivalent function:
where Φ is a random variable uniformly distributed in the interval [0, 1). The probability density function for Φ is:
Let a, b, c be 3 floating point numbers such that
MCA addition can be defined as
This function is realized in hardware by making some changes to the standard floating point addition algorithm. In the standard algorithm, operand significands are aligned by shifting the smaller operand to the right by an amount equal to the difference of the exponents. The aligned operands are then added together. If the normalization step is not considered, this step can be represented by the following equation:
Our goal is to compute this expression without extending the precision of the adder. In a floating point adder, the adder is p + 4 bits wide, where p + 1 is the width of the significand, and an additional 3 bits are used for rounding. We will compute f c under this bit-width constraint. During normalization, the rightmost e b − ea − 3 bits of f b are shifted out. In normal floating point addition, rounding of these lost bits is tracked via the sticky bit. Since there are an infinite sequence of random bits to the right of the least significant bit, the sticky bit is not necessary. We denote those bits shifted out f b0 and the remaining ones f b1 . Hence,
, and fc can be computed as:
Computing ε exactly would require enlarging the adder to a width of 2p−3 bits. Instead, we approximate f b0 using a uniform random number. This approximation only produces a small error (as confirmed experimentally later in the paper) since the bits shifted out consist of the lower order bits of the number, which are roughly randomly distributed. Under this assumption, we can compute ε using the expression:
where φx is a random number taken from the distribution Φ, representing all lower order bits of f b that are not visible. and the uncomputed part of the addition generates a carry 1 2 of the time, during rounding a carry input is added to the least significant place with probability 1 2 . After taking the carry operation into account, the unrecorded bits are uniformly distributed and its value is simply rounded to the nearest floating point number.
MCA Multiplication
Let a, b, c be 3 floating point numbers such that a = −1
Using the function unround() defined in Equation 1, MCA multiplication can be defined as mc mult(a, b) = c = unround(a) × unround(b).
This function is realized in hardware by making some changes to the standard floating point multiplication algorithm. The first step is to compute the product of the significand. Ignoring the normalization step, the significand of c can be computed by the following equation:
A random error is injected to both operands. Since a floating point multiplier does not use any guard bit in the multiplication stage, the multiplier width must be increased to accommodate the error injected. The number of extra bits added affects how precise the error is propagated after the multiplication. The absolute error of a multiplication is (
is a very small value, it is ignored. Taking normalization into account, the relative error can be represented by the equation: Since both f a and f b are both normalized, fa ∈ [1, 2) and f b ∈ [1, 2). The maximum relative error is 2(2 −p ), and this translates to an error of at most 2 ulp in the result. Since the final result is rounded to 1 ulp, the error cannot be represented in high precision in the result, so there is little advantage to computing the error propagated in high precision. The multiplier width is increased by 1 bit to accommodate the injected error. Zero error is injected with a probability of 1 2 , an error of −1 and +1 is injected with a probability of 1 4 . Rounding is unchanged except that the extra 2 bits from the multiplier output are included the calculation of sticky bit.
Handling correlated rounding error
In a datapath circuit, when a intermediate result becomes the input of more then one subsequent operation, the rounding error of the input is necessarily correlated. In such a case, our scheme can partially account for the correlation between rounding errors by arranging for all PRNGs corresponding to the same variable to be initialized with the same random seed. This causes all operations using the same intermediate result to perturbate the input using the same random number. Figures 3 and 5 show the modifications made to the adder and the multiplier for handling correlated error. Figure 1 shows an example of how correlation can be handled. In this example, the expression (x + a) − (y + a) is computed, where a is the result from some previous computation. URNG2 and URNG4 are initialized with the same random seed so that the same random sequence is used in the all unround operations associated with the variable a.
IMPLEMENTATION
A floating point adder/subtractor and multiplier incorporating the MCA algorithm described in the previous section is implemented. The floating point unit can operate in either MCA mode or IEEE 754 single precision mode. Our test design is based the floating point unit used in refer- ; r e t u r n ( s1ˆs2ˆs3 ) ; } ence [6] , which itself is derived from an open source floating point unit [15] . The floating point unit is a IEEE 754 compliant single precision one, supporting all IEEE 754 rounding modes and denormalized numbers. Four pipeline stages are employed, this being optimized for latency rather than maximum clock frequency. It would be possible to add additional pipeline stages to operate at a higher clock frequency. We will highlight the major changes made to support MCA operation.
The architecture for the adder is shown in Figure 2 . A 32-bit combined Tausworthe pseudo-random number generator [7] is used to generate the random numbers, and is described in Listing 1. A 32-bit random number is produced each clock cycle by combining the output of 3 Tausworthe generators s1, s2 and s3.
The floating point adder is composed from 4 major modules: the prenormalization unit, the fixed point adder/subtractor, the post-normalization unit and the rounding unit. The parts that are modified for MCA are highlighted in Figure 2. Here we list the modifications made to each of the modules.
• In the prenormalization unit, 3 bits from the URNG are appended to the significand of each operand. A binary value '100' is then subtracted from the significand of each operand.
• In the fixed point adder/subtractor, 1 is added to the sum with a probability 1 2 for addition. For subtraction, 1 is subtracted from the sum with a probability 1 2 . This is implemented by feeding a random bit to the carry in so no additional adder is required.
• In the normalization left shift, random digits are filled into the least significant bit (LSB) instead of zero.
• In Monte Carlo mode, the sticky bit is ignored, the result is rounded up if and only if the round bit is 1. Figure 3 shows the modifications made to the adder to account for rounding error correlation. Different PRNGs are used for each operand. The PRNG corresponding to each variable is initialized with a different random seed so the same pseudo-random sequence is used by each fan-out of a variable, as shown in Figure 1 .
A diagram for a floating point multiplier using the proposed algorithm is shown in Figure 4 . A 32-bit combined Tausworthe pseudo-random number generator is used and only 4 bits of the output are used. The modifications for MCA are highlighted in Figure 2 . For clarity, only relevant signals are shown. Here we list the modifications made.
• Both operands are extend by 1 bit by appending a zero. Then 1 is either subtracted from or added to the least significant bit with a probability of 1 4 for addition and 1 4 for subtraction.
• The multiplier is extended by one bit to 25 × 25.
• During rounding, the addition output bits from the multiplier is included in the calculation of the sticky bit. Figure 5 shows the modifications made to the multiplier to account for rounding error correlation. Different PRNGs are used for each operand. The PRNG corresponding to each variable is initialized with a random seed so that the same pseudo-random sequence is used by each fan-out of a variable.
The adder and multiplier are synthesized on a Xilinx Virtex 5 XC5VLX50T FPGA using ISE 11.1 using the default settings of optimising for timing performance with IOB packing. In the current implementation, a single Monte Carlo floating point adder occupied 460 slices and runs at 86 MHz while the unmodified adder occupied 304 slices and runs at 110 MHz. The Monte Carlo multiplier occupies 419 slices, and runs at 90 MHz, while the unmodified multiplier occupies 284 slices, and runs at 86 MHz. Note that the adder uses only 26 of the 32 output bits of the URNG while the multiplier uses only 4 of the 32 bits. If both the adder and multiplier are implemented, the URNG can be shared, eliminating a URNG.
RESULTS
Logarithmically Distributed Inputs
To test the effectiveness of the algorithm, pairs of double precision floating point numbers are drawn by independently generating the significand and exponent from uniform distributions in the required range. The resulting floating point numbers are logarithmically distributed. The numbers are then rounded to single precision and fed to the MCA floating point unit. A total of 64 Monte Carlo iterations are performed for each pair and the standard deviation of the output computed. The floating point unit is also run in non-MCA mode for 1 iteration to obtain an IEEE 754 single precision result. The results are compared by performing the same operation in double precision using the original pair of double precision floating point number. The rounding error is the difference between the MCA result and the double precision calculation. Figure 6 shows the results from the MCA floating point adder. A total of 5000 pairs of numbers distributed in the full range of single precision floating point number are tested and the rounding error is plotted against the standard deviation. Here we test how the standard deviation of the MCA simulation can be used to estimate the distribution of the rounding error. If the result of the Monte Carlo iterations are approximately normally distributed, we will expect that most of the values will lie within 3 standard deviation of the population. The dotted line indicates the point where the rounding error is equal to 3 standard deviations. It can be seen that, as expected, the data-points lie below the 3 standard deviation line. Figure 7 shows the same test with the MCA multiplier and similar results are observed. [ 
Inputs in the range
Comparison with Double Precision Simulation
The addition of pairs of single precision random numbers using MCA is compared to a double precision MCA simulation using a virtual precision of 23 bits. An example of the output distribution for −1.0 + 1.0009765625 is shown in Figure 10 . The values were chosen to have a large relative error. The output distribution closely matches the double precision simulation. Figure 11 shows the result of a similar test for the MCA multiplier calculating 1.5 × 1.5, with the x-axis plotted in ulp. The sparse distribution is due to the fact that the error propagation is small compared to 1 ulp.
Catastrophic Cancellation
Catastrophic cancellation occurs when two inexact floating point values, similar in magnitude, are subtracted. Consider the example a = 1.10000000000000000000000 
CONCLUSION
Based on Monte Carlo Arithmetic, an hardware algorithm for randomising rounding errors was devised and an IEEE 754 compliant single precision floating point unit incorporating the algorithm implemented. Experiments show that the floating point unit gives an accurate estimation of the propagation of rounding error. In our implementation using 4 pipeline stages, the multiplier has no speed penalty and occupies 51% more area than a standard one. The adder has a 22% increase in delay and 47% increase in area. We did not make any attempt to optimize the speed and area of the implementation. Speed could be improved by better balancing of pipeline stages and area can be reduced by sharing of the PRNGs between the adder and multiplier.
This work shows our approach can effectively estimate the propagation of rounding error with a minimal impact on performance. Future work will involve applying this approach at a system level and exploring aggressive floating point optimizations that reduce hardware resources while tracking rounding errors at runtime. 
