Abstract. We report on the formal verification of the floating point unit used in the VAMP processor. The FPU is fully IEEE compliant, and supports denormals and exceptions in hardware. The supported operations are addition, subtraction, multiplication, division, comparison, and conversions. The hardware is verified on the gate level against a formal description of the IEEE standard by means of the theorem prover PVS.
Introduction
Our institute at Saarland University is currently working on the formal verification of a complete microprocessor called VAMP. Part of this microprocessor is a fully IEEE compliant floating point unit (FPU). This paper describes the verification of the FPU in the theorem prover PVS [22] .
The FPU we have verified is developed in the textbook on computer architecture by Müller and Paul [20] . The designs go down to the level of single gates. Along with the complete designs come paper proofs for the correctness of the circuits. These paper proofs served as guidelines for the formal proofs. We have specified and verified these designs on the gate level in PVS. Only small changes to the designs were necessarysome due to errors in [20] , some to slightly simplify the proofs-with negligible impact on hardware cost and cycle time.
We have verified the designs with respect to a formalization of the IEEE standard 754 [12] (hereafter called "the standard"). We have partly used the formalization of the standard and the theory of rounding from [8, 20] , particularly the notion of factorings, round decomposition, and «-equivalence. Other parts of our IEEE formalization are influenced by Miner's formalization of the standard in PVS [17] , particularly the definition of the rounding function and the arithmetic on special operands.
The FPU we have verified supports both single and double precision. It can perform floating point addition, subtraction, multiplication, division, comparison, conversion between both floating point formats, and conversion between floating point numbers and integers. Denormal numbers are handled entirely in hardware. Exceptions and wrapped exponents are computed as mandated by the standard.
The verified VAMP processor will be implemented on a Xilinx FPGA.
1
Paper outline. In section 2, we sketch the formalization of the IEEE standard. The implementation and verification of the combinatorial FPU is described in section 3.
We describe the errors that we have encountered during the verification at the end of section 3. The pipelining of the combinatorial FPU is briefly discussed in section 4. We conclude in section 5.
Related work. The verification of floating point algorithms and hardware using formal methods has received considerable attention over the last years. As mentioned above, the formalization of the IEEE standard that we use is based on [8, 17, 20] . The notion of factorings, round decomposition, and «-equivalence is taken from [8, 20] . We have formally verified this theory in [13] . Since the definition of the rounding function is informal in [8, 20] , we use a formal definition of rounding, which is based on Miners formalization of the standard [17] .
Harrison has formalized the IEEE standard in the theorem prover HOL Light [10] . Both Miner and Harrison have no direct counterpart to the decomposition theorem and «-equivalence (cf. Sect. 2). They do not cover the actual implementation of operations or rounding.
Aagaard and Seger combine BDD based methods and theorem proving techniques to verify a floating point multiplier [1] . Chen and Bryant [4] use word-level SMV to verify a floating point adder. Exceptions and denormals are not handled in both verification projects.
Verkest et al. verify a binary non-restoring integer division algorithm [28] . Clarke et al. [7] and Ruess et al. [23] verify SRT division algorithms. Miner and Leathrum [18] verify a general class of subtractive division algorithms with respect to the IEEE standard. More mechanized proofs of SRT integer division are reported in [3, 14] .
Cornea-Hasegan describes the computation of division and square root by NewtonRaphson iteration in the Intel IA-64 architecture [5, 6] . The verification is done using Mathematica. O'Leary et al. report on the verification of the gate level design of Intel's FPU using a combination of model-checking and theorem proving [21] . Denormals and exceptions are not covered in the paper. Their definition of rounding is not directly related to the IEEE standard.
Moore et al. have verified the AMD K5 division algorithm [19] with the theorem prover ACL2. Russinoff has verified the K5 square root algorithm as well as the Athlon multiplication, division, square root, and addition algorithms [24, 25, 27] . In all his verification projects, Russinoff proves the correctness of a register transfer level implementation against his formalization of the IEEE standard using ACL2. Russinoff does not handle exceptions and denormals in his publications; however, he states that he handles denormals in unpublished work [26] . The definition of sticky in [19, 27] corresponds to our rounding of representatives.
IEEE Floating Point Arithmetic
To formally verify the correctness of a FPU, we need a formal notion of "correctness", i.e., a formalization of the IEEE standard which the FPU shall obey. In this section, we sketch the formalization of the IEEE standard used in our verification project. We detail the IEEE formalization in PVS in [13] . 2 The formalization is primarily based on [8, 17, 20] .
Factorings
We abstract IEEE numbers as they are defined in the standard to factorings. A factoring is a triple´× 
Rounding
We proceed with the definition of the rounding function. The IEEE standard defines four rounding modes: round to nearest, up, down, and to zero. We define a function Ö ÒØ´¡ Åµ for each rounding mode Å ¾ Ò Ö ÙÔ ÓÛÒ Þ ÖÓ , which rounds reals Ü to integers [17] :
By scaling by ¾ È ½ , reals are rounded to rationals with È ½ fractional bits:
Further scaling with ¾ , ´Üµ , yields the IEEE rounding function:
It is not obvious that this definition conforms with the IEEE standard. We prove the theorems stating this conformance in [13] .
The rounding of reals Ü can be decomposed into three steps: -computation, significand rounding, and a post-normalization. The benefit of having the decomposition theorem is that it simplifies the design and verification of the rounder (cf. Sect. 3.3).
The -computation step computes the IEEE factoring ´Üµ, where Ü is the number to be rounded. The significand round step then rounds the significand computed in the -computation to È ½ digits behind the binary point. This is formalized in the function × Ö :
where ´× µ is an arbitrary IEEE factoring, and Å ¾ Ò Ö ÙÔ ÓÛÒ Þ ÖÓ is a rounding mode.
In the case that the significand round returns 0 or 2, the factoring has to be postnormalized. If the significand round returns 2, the exponent is incremented, and the significand is forced to 1; if the significand round returns 0, the sign bit is forced to ¼ in order to yield an IEEE factoring. The post-normalization is defined as follows:
Theorem 1 (Decomposition Theorem). For any real Ü, and rounding mode Å ¾ Ò Ö ÙÔ ÓÛÒ Þ ÖÓ , it holds
ÔÓ×ØÒÓÖÑ ´Üµ Å ¡ Ö ´Ü Åµ ¡
«-Equivalence
We now define the concept of «-equivalence and «-representatives [20] . As we will see in theorem 3, this concept is a very concise way to speak about sticky-bit computations.
Let « be an integer. Two reals Ü and Ý are said to be «-equivalent (Ü « Ý), if Ü Ý 
i.e., if Ü is an integral multiple of ¾ « , the representative of Ü is Ü itself, and the midpoint of the interval between the surrounding multiples of ¾ « otherwise. The following lemma summarizes some important facts: 
Theorems 2 and 3 together allow a very efficient computation of representatives (respectively their IEEE-factorings) by or-ing the less significant bits in an OR-tree, and replacing them by the sticky-bit. This technique is well known [9] , but introducing the formalism with «-representatives allows for a very concise argumentation about these sticky-computations.
The valuable property of «-representatives is that rounding Ü and its representative Ü℄ È yields the same result: Ö ´Ü Åµ Ö ´ Ü℄ « Åµ
Exceptions
The IEEE standard defines five exceptions: invalid operation (INV), division by zero (DIVZ), overflow (OVF), underflow (UNF), and inexact result (INX). Our formalization of these exceptions is taken literally from [20] , as the implementation in the actual hardware is. As a consequence of theorem 4, the exceptions can be detected by considering only the representative of the exact result. In case of underflow or overflow with the respective trap handler enabled, the standard mandates scaling the result into the representable range, and passing the scaled result to the trap handler. This is called wrapped exponent. The handling of wrapped exponents is as in [20] .
Correctness of the FPU
The standard requests that every floating point operation shall return a result obtained as if one first computed the exact result with infinite precision, and then rounded this The floating point unpacker converts the operands from the IEEE format into a more convenient format. It translates the exponent from biased integer into two's complement format, and reveals the hidden significand bit. In case of multiplication and division, the unpacker normalizes the significand and adjusts the exponent accordingly, if the operand is denormal. Single and double precision operands are embedded into the same internal format. Furthermore, the floating point unpacker handles special cases such as operations on ¦½, AE AE, zeros etc. In the following sections, we describe the construction and verification of the functional units and the rounder. Exemplarily, we prove the correctness of the addition algorithm. The proof is a transcript of the actual PVS proof using standard mathematical notation instead of PVS notation for the sake of readability. The proof is similar to the proof given in [20] which, however, has larger gaps then the proof given here. The significance of the proof presented here is that it is formally verified.
We do not describe the proofs of the other components due to lack of space.
Adder
The floating point adder has IEEE-factorings´× µ and´× µ as inputs. The adder therefrom computes the sum (or difference in case of subtraction)´× × × × µ, which is fed into the rounder. Let Since the unpacker embeds single and double precision inputs into the same internal format, we do not distinguish between single and double precision. The final rounding stage will round the result to the appropriate precision. We therefore fix È ¿ in this section.
To simplify the description, we assume that the adder shall perform an addition. If it shall perform a subtraction, is replaced with by inverting the sign bit × . The exact sum is denoted by Ë · . We assume that 
Proof. By definition, we have We now have
Multiplying with ¾ and taking logarithms yields (3). (Fig. 2) is a straightforward implementation of the described algorithm using basic components [2] . If a subtraction is to be performed, × is negated, yielding × ¼ . Circuit EXPSUB computes the difference × and the flag Ø ´ µ. The result's exponent × is selected by a multiplexer. Circuit SWAP swaps and in case . The shift distance is limited in circuit LIMIT to ¿. Circuit ALIGN performs the actual alignment shift. It primaly consists of a 64-bit shifter and a sticky computation, which collects the bits shifted out during the alignment. Circuit SIGADD performs the addition, i.e., steps 4 and 5 from our informal description.
Ù Ø
The verification of the adder is straightforward: prove the correctness of the subcircuits, and combine them using the above lemma and theorem. Verifying the gate level. As an example for the detail level our proofs operate on, we present the LIMIT circuit (Fig. 3 ) that calculates the shift distance × ¾ for circuit ALIGN. First, an approximation of the absolute value of the shift distance × is computed.
If one of the high order bits × ½ ½¼ ℄ is set, then × ½ ¿. In this case, the low order bits of × ½ are forced to one by the OR-gates. Otherwise, the shift distance × ½ is unchanged. It holds
Both statements are easily verified in PVS.
In case , the described computation introduces an error of 1 in the shift distance. This is compensated by pre-shifting the significand by 1 in circuit SWAP in this case. This detour is done to reduce the cycle time of the adder. The approximation of the absolute value can be computed with the delay of a single inverter. If one computed the exact absolute value of ×, one would introduce the delay of an incrementer that would increase the length of the critical path of the adder.
Multiplier and Divider
The product of two floating point numbers can be computed by adding the exponents and multiplying the significands. The less significant bits of the significand's product are then compressed by means of a sticky bit computation. The so computed representative of the product is then passed to the rounder. Implementation and verification of this algorithm are straightforward.
In order to compute the quotient of two floating point numbers, one subtracts the exponents, and computes the quotient of the significands. The latter is the interesting part of the MULT/DIV unit.
Let and be the two significands. We may assume ½ ´ µ ¾, since the unpacker provides normalized significands. In our FPU, we use Newton-Raphson iteration to compute an approximation of ½ . We start with an initial approximation The analysis of the actual Newton-Raphson iteration and the following computation of the representative ℄ È of the significands' quotient is described very detailed in [20] . The translation of the proofs to PVS is therefore straightforward.
Rounder
Let Ü be the exact result of an operation, and let´× µ be the input factoring to the rounder. This factoring is not necessarily an IEEE-factoring. Let´× (Fig. 4) . The -shifter computes an IEEEfactoring´× Ò Ò Ò µ with × Ò × Ò Ò È . Two cases have to be distinguished: 1. In case of an addition/subtraction, the exponent satisfies Ñ Ò by construction (Sect. 3.1). However, the significand lies in the interval´¼ µ, and can-due to cancellation-be less than 1 even if Ñ Ò . In the latter case, the significand has to be shifted left.
2.
In case of multiplication and division, the input significand lies in the interval ½ µ, since the inputs to the multiplier/divider were normalized by the unpacker. The exponent , however, does not necessarily satisfy Ñ Ò . 4 In the case where Ñ Ò , the significand has to be shifted right by Ñ Ò digits. Since this shift could be very far, the shift distance is limited similarly to the adder alignment shift explained in section 3.1.
The -shifter outputs Ò with 128 binary digits. The circuit REP computes the representative Ö Ò ℄ È . This is done using an OR-tree, as suggested by theorem 3. We then have´× Ò Ò Ö µ ´ Ü℄ È µ by theorem 2. The next circuits SIGRD and POSTNORM exactly correspond to the functions × Ö and ÔÓ×ØÒÓÖÑ from section 2.2. The significand round on Ö is performed by investigation of the 3 least significant bits of Ö , and either chopping or incrementing the higher order bits [20] . The post-normalization increments the exponent and forces the significand to 1 if normalization is necessary.
Theorems 1 and 4 together imply the correctness of the rounder:
The circuit PACK outputs the IEEE bit string encoding of this factoring. In case of an untrapped overflow, however, the circuit PACK outputs either the format's maximal value, or infinity, depending on the sign and the rounding mode. The correctness of the unpacker, the functional units, and the rounder together imply the correctness of the whole FPU.
Errors encountered
We briefly describe some of the errors in [20] that we have encountered during the verification of the FPU in PVS:
1. Most important, the specification of the rounder interface (pg. 392) is wrong. There it is required that an overflow does not occur if a denormal significand is fed into the rounder, i.e., ½ µ ÇÎ ´Ü Åµ. This is necessary to detect overflows correctly (pg. 397). However, the requirement is not strong enough: it must hold that the input exponent is less than Ñ Ü in case of a denormal input significand, i.e., ½ µ Ñ Ü . Otherwise, the proof on page 397 fails. Nevertheless, the hardware was correct, only the proofs were wrong. 2. On page 400, a carry-in is fed into a compound adder, although compound adders do not feature a carry-in. A similar error was found in the exponent addition circuit in the multiplier (pg. 383). 3. In circuit SIGRND (pg. 406), chopping the significand in single precision mode leaves non-zero digits after the least significand bit. The claims in section 8.4.5 are therefore wrong. This can be fixed by tying the bits after the least significand bit to zero. 4. In the significand round, the circuit for the decision whether to chop or to increment the significand is wrong (pg. 407). The XOR has to be replaced by an XNOR gate. 5. In the adder, the computation of the sign bit is wrong (pg. 371).
The proofs in [20] partly have large gaps. These gaps had to be filled during the verification in PVS. Most proof gaps could be filled without revealing errors in [20] , but some proof gaps hid errors, e.g., the errors listed above. Having formally verified the proofs in PVS ultimately gives us the confidence that the design of the FPU is correct-under the assumption that PVS is sound.
FPU Control
So far we have verified combinatorial circuits. In order to implement the FPU in hardware with reasonable cycle time, one has to insert pipelining registers. Since multipliers are very expensive, one cannot fully pipeline the iterative Newton-Raphson algorithm. A loop has to be incorporated into the pipeline structure to reuse the multiplier in each iteration. This saves hardware costs, but considerably complicates control and the correctness proof.
In [20] , the FPU is integrated into an in-order variant of the DLX-processor. In our verification project, the FPU will be integrated into a Tomasulo based out-of-order DLX-variant. It is therefore necessary to design a new control automaton for the FPU in order to exploit the benefits of the out-of-order scheduler.
After pipelining, the FPU has a variable latency, and operations are finished out-oforder. The latency of the FPU is 1 cycle for comparison and for operations involving special operands. It is 5 cycles for addition, subtraction, and multiplication. The division has latency 17 and 21 cycles in single and double precision, respectively. Two divisions can be performed interleaved without increased latency.
We have verified the new FPU control using McMillan's SMV model-checker [16] . However, we are working on the verification of the FPU control using PVS, because we aim for a proof of the complete CPU in only one verification system. Using different, not tightly integrated proof tools potentially introduces unsoundness. The verification of temporal properties of circuits using PVS is feasible, but somewhat tedious. We omit the control implementation details here, since they are not specific to FPUs.
Summary and Future Work
We have formally verified a fully IEEE compliant floating point unit. The supported operations are addition, subtraction, multiplication, division, comparison, and conversions. The FPU handles denormals and exceptions as required by the IEEE standard. The hardware has been verified on the gate level with respect to a formal description of the IEEE standard using the theorem prover PVS.
The proofs in PVS used paper proofs from [20] as guidelines. However, some of the proofs in [20] were erroneous, and most proofs had gaps needed to be filled in PVS. Those gaps hid errors in the design in [20] . Having formally verified the proofs (and filled the proof gaps) in PVS gives us the certainty, that now the hardware is correct with respect to its specification.
To the best of our knowledge, this is the first time that a floating point unit with the said features has been formally verified, and the specifications and proofs scripts are made publicly available. 5 The amount of work needed to develop the PVS hardware description and proofs was roughly a man-year for each of the authors. Since theorem proving strongly profits from experience, we think we would succeed in at most half the time now on a comparable project.
We are currently proving the correctness of the FPU pipeline, and are integrating the FPU into the VAMP processor. The PVS hardware specifications will be converted automatically to Verilog HDL. The VAMP processor including the FPU will then be implemented on a Xilinx FPGA.
