The floating-point division bug in Intel's Pentium processor and the overflow flag erratum of the FIST instruction in Intel's Pentium Pro and Pentium II processor have demonstrated the importance and the difficulty of verifying floating-point arithmetic circuits. In this paper, we present a "black box" version of verification of FP adders. In our approach, FP adders are verified by an extended word-level SMV using reusable specifications without knowing the circuit implementation. Word-level SMV is improved by using Multiplicative Power HDDs (*PHDDs), and by incorporating conditional symbolic simulation as well as a short-circuiting technique. Based on a case analysis, the adder specification is divided into several hundred implementation-independent sub-specifications. We applied our system and these specifications to verify the IEEE double precision FP adder in the Aurora III Chip from the University of Michigan. Our system found several design errors in this FP adder. Each specification can be checked in less than 5 minutes. A variant of the corrected FP adder was created to illustrate the ability of our system to handle different FP adder designs. For each adder, the verification task finished in 2 CPU hours on a Sun UltraSPARC-II server.
Introduction
The floating-point (FP) division bug [11, 22] in Intel's Pentium processor and the overflow flag erratum of the FIST instruction (floating-point to integer conversion) [14] in Intel's Pentium Pro and Pentium II processors have demonstrated the importance and the difficulty of verifying floating-point arithmetic circuits and the high cost of an arithmetic bug. FP adders are the most common units in floating-point processors. Modern highspeed FP adders [20, 23] are very complicated, because they require many types of modules: a right shifter for alignment, a left shifter for normalization, a leading zero anticipator (LZA), an adder for mantissas, a rounding unit, etc. Exhaustive simulation or formal verification can be used to ensure the correctness of FP adders.
Most of the IEEE floating-point standard have been formalized by Carreño and Miner [5] in the HOL and PVS theorem provers. Theorem provers have been used to verify arithmetic circuits [18, 21] . However, theorem provers require users to make use of detailed circuit knowledge and the verification process for floating-point circuits is very tedious. Another drawback of theorem provers is that the proofs are implementation-dependent.
After the famous Pentium division bug [11] , Intel researchers applied word-level SMV [9] with Hybrid Decision Diagrams (HDDs) [8] to verify the functionality of the floating-point unit in one of Intel's processors [7] . Due to the limitations of HDDs, the FP adder was partitioned into several sub-circuits whose specifications were expressed in terms of integer functions. Each sub-circuit was verified individually based on assumptions about its inputs. The correctness of the overall circuit had to be ascertained manually from the verified specifications of the sub-circuits. This partitioning approach is error prone, since mistakes could be introduced in any of the following steps: partitioning the circuit and the specification, performing case analysis for each sub-circuit, and proving the overall correctness of the circuit, which potentially could have used a theorem prover. Moreover, their specifications are highly dependent on the circuit implementations.
The combination of model checking and theorem prover techniques was used to verify a IEEE double precision floating-point multiplier [1] . The circuit was partitioned into several sub-circuits which can be verified by model checking. The theorem prover handled the completeness of the proofs by inference rules to compose the verified specifications. This approach combines the strengths of both techniques. However, the proofs based on this approach are still implementation-dependent.
To the best of our knowledge, only two types of arithmetic circuits can be verified by treating them as black boxes (i.e., the specifications contain only the inputs and outputs). First, an integer adder can be verified by using Binary Decision Diagrams (BDDs) [2] . Second, Hamaguchi et al [15] presented the verification of integer multipliers without knowing their implementations using Multiplicative Binary Moment Diagrams (*BMDs) [4] . However, their approach does not work for incorrect designs, because the *BMDs explode in size and counterexamples can not be generated for debugging. None of the previous approaches can verify FP adders without knowing their circuit implementations.
In this paper, we present a black box version of verification of FP adders. In our approach, a FP adder is treated as a black box and is verified by an extended version of word-level SMV with reusable specifications. Word-level SMV is improved by using Multiplicative Power HDDs (*PHDDs) [6] to represent the FP functions, and by incorporating conditional symbolic simulation as well as a short-circuiting technique. The FP adder specification is divided into several hundred sub-specifications based on the sign bits and the exponent differences. These sub-specifications are implementation-independent, since they use only the input and output signals of FP adders.
The concept of conditional symbolic simulation is to perform the symbolic simulation of the circuit with some conditions to restrict the behavior of the circuit. This approach can be viewed as dynamically extracting circuit behavior under the given conditions without modifying the actual circuit. Can we verify the specifications of FP adders using conditional symbolic simulation, avoiding any use of circuit knowledge? We identify a conflict in variable orderings between the mantissa comparator and mantissa adder, which causes the BDD explosion in conditional symbolic simulation. A short-circuiting technique to overcome this ordering conflict problem is presented and integrated into word-level SMV package. In general, this short-circuiting technique can be used when different parts of the circuit are used under different operating conditions. We used our system and these specifications to verify the FP adder in the Aurora III Chip [16] at the University of Michigan. This FP adder is based on the design described in [20] , and supports IEEE double precision and all 4 IEEE rounding modes. In this verification work, we verified the FP adder only in the round-to-nearest mode, because we believe that this is the most challenging rounding mode for verification. Our system found several design errors. Each specification can be checked in less than 3 minutes or 5 minutes including counterexample generation. A variant of the corrected FP adder was created and verified to illustrate the ability of our system to handle different FP adder designs. For each FP adder, verification took 2 CPU hours. We believe that our system and specifications can be applied to directly verify other FP adder designs and to help find design errors.
The overflow flag erratum of the FIST instruction (FP to integer conversion) [14] in Intel's Pentium Pro and Pentium II processors has illustrated the importance of verification of the conversion circuits which convert the data from one format to another format (e.g., IEEE single precision to double precision). Since these circuits are much simpler than FP adders and only have one input operand, we believe that our system can be used to verify the correctness of these circuits.
*PHDD Preliminary
For expressing functions from Boolean variables to integer values, BMDs [4] use the moment decomposition of a function:
where , + and ? denote multiplication, addition and subtraction, respectively. Term f x (f x ) denotes the positive (negative) cofactor of f with respect to variable x, i.e., the function resulting when the constant 1 (0) is substituted for x. By rearranging the terms, we obtained the third line of Equation 1. Here, f x = f x ? f x is called the linear moment of f with respect to x. This terminology arises by viewing f as a linear function with respect to its variables, and thus f x is the partial derivative of f with respect to x. The negative cofactor f x will be termed the constant moment, i.e., it denotes the portion of function f that remains constant with respect to x.
This decomposition is also called positive Davio in K*BMDs [13] . Each vertex of a BMD describes a function in terms of its moment decomposition with respect to the variable labeling the vertex. The two outgoing arcs denote the constant and linear moments of the function with respect to the variable. Clarke, et al. [8] extended BMDs to a form they call Hybrid Decision Diagrams (HDDs), where a function may be decomposed with respect to each variable in one of six decomposition types. In our experience with HDDs, we found that three of their six decomposition types are useful in the verification of arithmetic circuits. These three decomposition types are Shannon, Positive Davio, and Negative Davio. Therefore, Equation 1 is generalized to the following three equations according the variable's decomposition type:
Here, f x = f x ? f x is the partial derivative of f with respect to x. The BMD representation is a subset of HDDs. In other words, the HDD graph is the same as the BMD graph, if all of the variables use positive Davio decomposition.
Chen and Bryant [6] 
where hw; fi denotes c w f. In general, the constant c can be any positive integer. Since the base value of the exponent in the IEEE floating-point (FP) format is 2, we will consider only c = 2 for the remainder of this paper. Observe that w can be negative, allowing representation of rational numbers. The power edge weights enable us to represent functions mapping Boolean variables to FP values without using rational numbers in our representation. [6] for more details of FP representation using *PHDDs.
Floating-Point Adders
Let us consider the representation of FP numbers by IEEE standard 754. Double-precision FP numbers are stored in 64 bits: 1 bit for the sign (S x ), 11 bits for the exponent (E x ), and 52 bits for the mantissa (N x ). The exponent is a signed number represented with a bias (B) of 1023. The mantissa (N x ) represents a number less than 1. Based on the value of the exponent, the IEEE FP format can be divided into four cases:
where NaN denotes Not-a-Number and 1 represents infinity. Let M x = 1.N x or 0.N x . Let m be the number of mantissa bits including the bit on the left of the binary point and n be number of exponent bits. For IEEE double precision, m=53 and n=11.
Due to this encoding, an operation on two FP numbers cannot be rewritten as an arithmetic function of two inputs. For example, the addition of two FP numbers X (S x , E x , M x ) and Y (S y , E y , M y ) can not be expressed as X + Y , because of special cases when one of them is NaN or 1. NaN NaN NaN NaN NaN NaN Figure 2 .a shows the block diagram of the SNAP FP adder designed at Stanford University [20] . This adder was designed for fast operation based on the following facts. First, the alignment (right shift) and normalization (left shift) needed for addition are mutually exclusive. When a massive right shift is performed during alignment, the massive left shift is not needed. On the other hand, the massive left shift is required only when the mantissa adder performs subtraction and the absolute value of exponent difference is less than 2 (i.e. no massive right shift). Second, the rounding can be performed by having the mantissa adder generate A + C, A + C + 1 and A + C + 2, where A and C are the inputs of the mantissa adder, and using the final multiplexor to chose the correct output. In the exponent path, the exponent subtractor computes the difference of the exponents. The MuxAbs unit computes the absolute value of the difference for alignment. The larger exponent is selected as the input to the exponent adjust adder. During normalization, the mantissa may need a right shift, no shift or a massive left shift. The exponent adjust adder is prepared to handle all of these cases.
In the mantissa path, the operands are swapped as needed depending on the result of the exponent subtractor. The inputs to the mantissa adder are: the mantissa with larger exponent (A) and one of the three versions of the mantissa with small exponent (C): unshifted, right shifted by 1, and right shifted by many bits. The path select unit chooses the correct version of C based on the value of exponent difference. The version right shifted by many bits is provided by the right shifter, which also computes the information needed for the sticky bit. The mantissa adder performs the addition or subtraction of its two inputs depending on the signs of both operands and the operation (add or subtract). If the adder performs subtraction, the mantissa with smaller exponent will first be complemented. The adder generates all possible outcomes (A + C, A + C + 1, and A + C + 2) needed to obtain the final, normalized and rounded result. The A+C + 2 is required, because of the possible right shift during normalization. For example, when the most significant bits of A and C are 1, A+C will have m+1 bits and must be right shifted by 1 bit. If the rounding logic decides to increase 1 in the least significant bit of the right shifted result, it means add 2 into A+C. When the operands have the same exponent and the operation of the mantissa adder is subtraction, the outputs of the adder could be negative. The ones complementer is used to adjust them to be positive. Then, one of these outputs is selected by the GRS unit to account for rounding. The GRS unit also computes the true guard (G), round(R), sticky (S) bits and the bit to be left shifted into the result during normalization. When the operands are close (the exponent difference is 0, 1, or -1) and the operation of the mantissa adder is subtraction, the result may need a massive left shift for normalization. The amount of left shift is predicted by the leading zero anticipator (LZA) unit in parallel with the mantissa adder. The predicted amount may differ by one from the correct amount, but this 1 bit shift is made up by a 1-bit fine adjust unit. Finally, one of the four possible results is selected to yield the final, rounded, and normalized result based on the outputs of the path select and GRS units.
And
Or Figure 3 : Detail circuit of the compare unit
As an alternative to the SNAP design, the ones complementer after the mantissa adder can be avoided, if we ensure that input C of the mantissa adder is smaller than or equal to input A, when the exponent difference is 0 and the operation of mantissa adder is subtraction. To ensure this property, a mantissa comparator and extra circuits, as shown in [23] , are needed to swap the mantissas correctly. This compare unit exists in many modern high-speed FP adder designs [23] and makes the verification harder described in Section 5.4. Figure 3 shows the detailed circuit of the compare unit which generates the signal to swap the mantissas. The signal E x < E y comes from the exponent subtractor. When E x < E y or E x = E y and M x < M y (i.e., h =1), A is M y (i.e. the mantissas are swapped). Otherwise, A is M x .
Specifications of FP Adders
In this section, we focus on the general specifications of the FP adder, especially when both operands have denormal or normal values. For the cases in which at least one of operands is a NaN or 1, the specifications can be easily written at the bit level. For example, when both operands are NaN, the expected output is NaN (i.e. the exponent is all 1s and the mantissa is not equal to zero). The specification can be expressed as the "AND" of the exponent output bits is 1 and the "OR" of the mantissa output bits is 1.
When both operands have normal or denormal values, the ideal specification is OUT = Round(X + Y ).
However, FP addition has exponential complexity with the word size of the exponent part for *PHDD. Thus, the specification must be divided into several sub-specifications for verification. According to the signs of both operands, the function X + Y can be rewritten as Equation 3 . 
Similarly, for FP subtraction, the function X ? Y can be also rewritten as true addition when both operands have different signs and true subtraction when both operands have the same sign. Figure 4 : Cases of true addition for the mantissa part.
True Addition
The *PHDDs for the true addition and subtraction still grow exponentially. Based on the sizes of the two exponents, the function X + Y for true addition can be rewritten as Equation 4 .
When E y E x , the exponent is E x and the mantissa is the sum of M x and M y right shifted by (E x ? E y ) bits (i.e. M y >> (E x ?E y ) in the equation). E x ?E y can range from 0 to 2 n ?2, but the number of mantissa bits in FP format is only m bits. Figure 4 illustrates the possible cases of true addition for E y E x based on the values of E x ? E y . In Figure 4 .a, for 0 E x ? E y < m, the intermediate (precise) result contains more than m bits. The right portion of the result contains L, G, R and S bits, where L is the least signification bit of the mantissa. The rounding mode will use these bits to perform the rounding and generate the final result(M out ) in m-bit format. When E x ? E y m as shown in Figure 4 .b, the right shifted M y only contributes to the intermediate result in the G, R and S bits. Depending the rounding mode, the output mantissa will be M x or M x + 1 2 ?m+1 . Therefore, we only need one specification in each rounding mode for the cases E x ?E y > m. A similar analysis can be applied to the case E y > E x . Thus, the specifications for true addition with rounding can be written as:
where [6] . With this decomposition, the graph sizes of E x and E y are exponential in *PHDDs, but 2 Ex and 2 Ey will have linear size. While building BDDs and *PHDDs for OUT from the circuit, the function on left side of ) will be used to simplify the BDDs automatically by conditional forward simulation.
The number of specifications for true addition is 2m + 1. For instance, the value of m for IEEE double precision is 53, thus the number of specifications for true addition is 107. Since the specifications are very similar to one another, they can be generated by a looping construct in word-level SMV.
True Subtraction
The specification for true subtraction can be divided into two cases: far (jE x ?E y j > 1) and close (E x ?E y =0,1 or -1). For the far case, the result of mantissa subtraction does not require a massive left shift (i.e., LZA is not active). Similar to the true addition, the specifications for true subtraction can be written as Equation 6 .
where C s1 i], C s2 , C s3 i] and C s4 are Cond sub&E x = E y + i, Cond sub &E x > E y + m, Cond sub&E y = E x + i, and Cond sub&E y > E x + m, respectively. Cond sub represents the condition for true subtraction.
For the close case, the difference of the two mantissas may generate some leading zeroes such that normalization is required to product a result in IEEE format. A special case is that the output is zero when E x is equal to E y and M x is equal to M y . The specification is as follows: (Cond sub &E x = E y &M x = M y ) ) OUT = 0.
Specification Coverage
Since the specifications of floating-point adders are split into several hundreds of sub-specifications, do these sub-specifications cover the entire input space? To answer this questions, someone would suggest to use theorem provers to handle the case splitting. In contrast, we propose a BDD approach to compute the coverage of our specifications.
Our approach is based on this observation that our specifications are in the form "cond ) out = expected result" and cond is only dependent on the inputs of the circuits. Thus, the union of the conds of our specifications, which can be done by BDD operations, must be TRUE when our specifications cover the entire input space. In other words, the union of the conds can be used to compute the percentage of input space covered by our specifications and to generate the cases which are not covered by our specifications.
Verification System: Extended Word-Level SMV with *PHDDs
Model checking is a technique to determine which states satisfy a given temporal logic formula for a given state-transition graph. In SMV [19] , BDDs are used to represent the transition relations and set of states. The model checking process is performed iteratively on these BDDs. SMV has been widely used to verify control circuits in industry, but for arithmetic circuits, particularly for ones containing multipliers, the BDDs grows too large to be tractable. Furthermore, expressing desired behavior with Boolean formulas are not appropriated.
To verify arithmetic circuits, word-level SMV [9] with HDDs extended SMV to handle word level expressions in the specification formulas. In word-level SMV, the transition relation as well as those formulas that do not involve words are represented using BDDs. HDDs are used only to compute word-level expressions such as addition and multiplication. When a relational operation is performed on two HDDs, a BDD is used to represent the set of assignments that satisfies the relation. The BDDs for temporal formulas are computed in the same way as in SMV. For example, the evaluation the formula AG(R = A + B), where R, A and B are word-level functions and AG is a temporal operator, is performed by first computing the HDDs for R, A, B and A + B, then generating BDDs for the relation R = A + B, and finally applying the AG operator to these BDDs. The reader can refer to [9] for the details of word-level SMV. Zhao's thesis [24] describes the layering backward substitution, a variant of Hamaguchi's backward substitution approach [15] , although the public released version of word-level SMV does not implement this feature. We have implemented this feature in our system. The main idea of layering backward substitution is to virtually cut the circuit horizontally by introducing auxiliary variables to avoid the explosion of BDDs while symbolic evaluating bit level circuits. Figure 5 shows a horizontal division of a combinational circuit with primary inputs x 0 ; : : :; x m and outputs y 0 ; : : :; y n . For 0 i n, y i = f i (x 0 ; : : :; x m ) where f i is a Boolean function, but it may not be feasible to be represented as a BDD. The circuit is divided into several layers by declaring some of the internal nodes as auxiliary variables. In this example, y i = f 1i (z 0 ; : : :; z k ); z i = f 2i (w 0 ; : : :; w l ); and w i = f 3i (x 0 ; : : :; x m ). Since each f ji is simpler than f i , the BDD sizes to represent them are generally much smaller. When we try to compute *PHDD representation of the word (y 0 ; : : :; y n ) in terms of the variables x 0 ; : : :; x m , we first compute the *PHDD representation of the word in terms of variables z 0 ; : : :; z k as F = P N i=0 2 i f 1i (z 0 ; : : :; z k ). Then we replace each z i , one at a time, by f 2i (w 0 ; : : :; w l ). After this, we have obtained the *PHDD representation for the word in terms of variables w 0 ; : : :; w l . Likewise, we can replace each w i by f 3i (x 0 ; : : :; x n ). In this way, the *PHDD representation of the word in terms of primary input can be computed without building BDDs for each output bit.
The main drawback of the backward substitution is that users still need to provide the information about the auxiliary variables (i.e., the virtual cuts). Another drawback is that the *PHDDs may grow exponentially during the substitution process, since the auxiliary variables may generalize the circuit behavior for some regions. For example, suppose that the internal nodes z k and z k?1 under the original circuit have the relation that both of them can not be 0 at the same time and that the circuit of region 1 can only handle this case. After introducing the auxiliary variables, variables z k and z k?1 can be 0 simultaneously. Hence, the word-level function F represents the function more general than the original circuit of region 1. This generalization may cause the *PHDD for F to blowup.
Conditional Symbolic Simulation
To overcome these drawbacks, we introduced a conditional symbolic simulation technique into word-level SMV. Symbolic simulation [3] performs the simulation with inputs having symbolic values (i.e., Boolean variables or Boolean functions). The simulation process builds BDDs for the circuits. If each input is a Boolean variable, this approach may cause the explosion of BDD sizes in the middle of the process, because it tries to simulate the entire circuit for all possible inputs at once. The concept of conditional symbolic simulation is to perform the simulation process under a restricted condition, expressed as a Boolean function over the inputs.
In [17] , Jain and Gopalakrishnan encoded the conditions together with the original inputs as new inputs to the symbolic simulator using a parametric form of Boolean expressions, but it is hard to incorporate this approach into word-level SMV. Our approach is to apply the conditions directly during the symbolic simulation process. Right after building the BDD for a circuit gate, the condition is used to simplify the BDDs using the restrict [12] algorithm. Then, the simplified BDD is used as the input function for the gates connected to this one. This process is repeated until the outputs are reached. This approach can be viewed as dynamically extracting the circuit behavior under the specified condition without modifying the actual circuit.
Equalities and Inequalities with Conditions
To verify arithmetic circuits, it is very useful to compute the set of assignments that satisfy F G, where F and G are word level functions represented by HDDs or *PHDDs, and can be any one of =; 6 =; ; ; <; >.
In general, the complexity of this problem is exponential. However, Clarke, et al. presented a branch-bound algorithm to efficiently solve this problem for a special class of HDDs, called linear expression functions using the positive Davio decomposition [8] . The basic idea of their algorithm is first to compute H = F ? G and then to compute the set of assignments satisfying H 0 using branch-and-bound approach. The complexity of subtracting two HDDs is O(|F | |G|) This algorithm only works well for the special class of HDDs (i.e., linear expression functions). However, the complexity of this algorithm for other classes of HDDs or *PHDDs can grows exponentially. In the verification of arithmetic circuits, HDDs and *PHDDs are not always in the class of linear expression functions. Thus, the H 0 operations can not be computed for most cases. For example, "OU T = Round(:::)" in some specifications of FP adders in Section 4.2 can not be finished after several CPU hours. To solve this problem, we introduce the relational operations with conditions, since the equality and inequality in our specifications must be hold only under the conditions. These operations take three arguments F, G and cond, where F and G are word level functions and cond is a Boolean function. First, it computes H = F ? G and then computes the set of assignments satisfying H 0 under the condition cond. For example, the algorithm for H = 0 under the condition cond is given in Figure 6 . This algorithm produces the BDDs satisfying H = 0 under the condition cond, and is similar to the algorithm in [8] , except that it takes an extra BDD argument for the condition and uses the condition to stop the equality checking of the algorithm as soon as possible. To be conservative, when the condition is false, the returned result is false. In line 1, the condition is used to stop this algorithm, when the condition is false. In line 16, the condition is also used to stop the addition of two *PHDDs and the further equality checking in lines 18 and 19, respectively. The efficiency of this algorithm will depend on the BDDs for the condition. If the condition is always true, then this algorithm has the same behavior as Clarke's algorithm. If the condition is always false, then this algorithm will immediately return false regardless of how complex the *PHDD is. This new algorithm has reduced the computation time dramatically for the specifications in Section 4.2.
Equalities and Inequalities
The efficiency of Clarke's algorithm for relational operations of two HDDs depends on the complexity of computing H = F ?G. The complexity of subtracting two HDDs is O(|F | |G|), and similar algorithms can be used for these relational operators with *PHDDs. However, the complexity of subtracting two *PHDDs using disjunctive sets of supporting variables may grow exponentially. For example, the complexity of subtraction of two FP encodings represented by *PHDDs grows exponentially with the word size of exponent part [6] . Thus, Clarke's algorithm is not suitable for these operators with two *PHDDs having disjunctive sets of supporting variables. These operators are commonly used in our specifications of FP adders in Section 4.2 and 4.1.
We have developed algorithms for these relational operators with two *PHDDs having disjunctive sets of supporting variables. Figure 7 shows the new algorithm for computing BDDs for the set of assignments that satisfy F > G. Similar algorithms are used for other relational operators. The main concept of this algorithm is to directly apply the branch-and-bound approach without performing a subtraction, whose complexity could be exponential. First, if both arguments are constant, the algorithm returns the comparison result of the arguments.
In line 2 and 3, weights w f and w g are adjusted by the minimum of them to increase the sharing of the operations, since (2 w f f) > (2 wg g) is the same as (2 w f ?min f) > (2 wg?min g), where min is the minimum of w f and w g . Line 4 checks whether the comparison is in the computed cache and returns the result if it is found. In line 5 to 7, the top variable is chosen and the 0-and 1-branches of f and g are computed. In lines 8 and 9, function bound value is used to compute the upper and lower bounds of these four sub-functions, The algorithm of bound value is similar to that described in [8] , except edge weights are handled. The complexity of bound value is linear in the graph size. When uses the Shannon decomposition, lines 11 and 12 try to bound and finish the search for the 0-branch. If it is not successful, line 13 recursively calls this algorithm for 0-branch. The 1-branch is handled in a similar way. When uses the positive Davio decomposition, the computation for 0-branch is the same as that in Shannon decomposition. since < w f 1 ; f 1 > is the linear moment of < w f ; f > and the 1-cofactor of < w f ; f > is equal to < w f 1 ; f 1 > + < w f 0 ; f 0 >, the lower(upper) bound of the 1-cofactor of < w f ; f > is bounded by the sum of lower (upper) bounds of < w f 1 ; f 1 > and < w f 0 ; f 0 >. This algorithm works very well for two *PHDDs with disjunctive set of supporting variables, while Clarke's algorithm has exponential complexity. For example, let F = Q n i=0 2 2 i x i and G = Q n i=0 2 2 i y i . The variable ordering is x n , y n , ..., x 0 , y 0 and all variables use the Shannon decomposition. The *PHDDs for F and G have the structure shown in Figure 8 . It can be proven that the complexity of this algorithm for this type of function is O(n) if the computed cache is a complete cache.
Short-Circuiting Technique
Can we verify the specifications of FP adders by conditional forward simulation? In our experience, all specifications for the FP adder design without a mantissa comparator, as in Figure 2 .a, can be verified by conditional forward simulation, but not so for the FP adder containing a mantissa comparator, as in Figure 2 .b. This is caused by a conflict of variable orderings for the mantissa adder and the mantissa comparator, which generates the signal M x < M y (i.e. signal d in Figure 3 ). The best variable ordering for the comparator is to interleave the two vectors from the most significant bit to the least significant bit (i.e., x m?1 , y m?1 , ..., x 0 , y 0 ). Table 2 .., y 0 , when the exponent difference is k. We observed that the best ordering for the specification represented by *PHDDs is the same ordering as the best ordering for the mantissa adder. Thus, the extended word-level SMV can not build the BDDs for both the mantissa comparator and mantissa adder by conditional forward simulation, when the exponent difference is large.
Let us examine the compare unit carefully. We find that the signal d is used only when E x = E y . In other words, it is not necessary to build the BDDs for it, when jE x ? E y j is greater than 0. Based on this fact, we introduce a short-circuiting technique to eliminate unnecessary computations as early as possible. The word-level SMV and *PHDD packages are modified to incorporate this technique. In the *PHDD package, the BDD operators, such as And and Or, are modified to abort the operation and return a special token when Table 2 : Performance measurements of a 52-bit comparator with different orderings.
the number of newly created BDD nodes within this BDD call is greater than a size threshold. In word-level SMV, for an And gate with two inputs, if the first input evaluates 0, 0 will be returned without building the BDDs for the second input. Otherwise, the second input will be evaluated. If the second input evaluates to 0 and the first input evaluates to a special token, 0 is returned. Similar technique is applied to Or gates with two inputs. Nand(Nor) gates can be decomposed into Not and And (Or) gates and use the same technique to terminate earlier. For other logic gates with two inputs, the result is a special token, if any of the inputs evaluates to a special token. If the special token is propagated to the output of the circuit, then the size threshold is doubled and the output is recomputed. This process is repeated until the output BDD is built. For example, when the exponent difference is 30, the size threshold is 10000, the ordering is the best ordering of mantissa adder, and the evaluation sequence of the compare unit shown in Figure 3 is d, e, f, g and h, the values of signals d, e, f, g and h will be special token, 0, 0, 1, and 1, respectively, by conditional forward simulation.
With these modification, the new system can verify all of the specifications for both types of FP adders by conditional forward simulation. We believe that this short-circuiting technique can be generalized and used in the verification which only exercises part of the circuits.
Verification of FP Adders
In this section, we used the FP adder in the Aurora III Chip [16] , designed by Dr. Huff as part of his PhD dissertation at the University of Michigan, as an example to illustrate the verification of FP adders. This adder is based on the same approach as the SNAP FP adder [20] at Stanford University. Dr. Huff found several errors with the approach described in [20] . This FP adder only handles operands with normal values. When the result is a denormal value, it is truncated to 0. This adder supports IEEE double precision format and the 4 IEEE rounding modes. In this verification work, we verify the adder only in round to nearest mode, because we believe that the round to nearest mode is the hardest one to verify. All experiments were carried out on a Sun 248 MHz UltraSPARC-II server with 1.5 GB memory. The FP adder is described in the Verilog language in a hierarchical manner. The circuit was synthesized into flattened, gate-level Verilog, which contains latches, multiplexors, and logic gates, by Dr. John Zhong at SGI. Then, a simple Perl script was used to translate the circuit from gate-level Verilog to SMV format.
Latch Removal
Huff's FP adder is a pipelined, two phase design with a latency of three clock cycles. We handled the latches during the translation from gate-level Verilog to SMV format. Figure 9 .a shows the latches in the pipelined, two phase design. In the design, phase 2 clock is the complement of the phase 1 clock. Since we only verify the functional correctness of the design and the FP adder does not have any feedback loops, the latches can be replaced by And gates, as shown in Figure 9 .b, without losing the functional behavior of the circuit. Since phase 2 clock is the complement of the phase 1 clock, we must replace the phase 2 clock by the phase 1 clock. Otherwise the circuit behavior will be incorrect. 
Design with Bugs
In this section, we describe our experience with the verification of a FP adder with design errors. During the verification process, our system found several design errors in Huff's FP adder. These errors were not caught by random simulation performed by Dr. Huff.
The first error we found is the case when A + C = 01.111...11, A + C + 1=10.000...00, and the rounding logic decides to add 1 to the least significant bit (i.e., the result should be A + C + 1), but the circuit design outputs A+C as the result. This error is caused by the incorrect logic in the path select unit, which categorized this case as a no shift case instead of a right shift by 1. While we were verifying the specification of true addition, our system generated a counterexample for this case in around 50 seconds. To ensure that this bug is not introduced by the translation, we have used Cadence's Verilog simulation to verify this bug in the original design by simulating the input pattern generated from our system. Another design error we found is in the sticky bit generation. The sticky bit generation is based on the table given in page 10 of Quach's paper describing the SNAP FP adder [20] . The table only handles cases when the absolute value of the exponent difference is less than 54. The sticky bit is set 1 when the absolute value of the exponent difference is greater than 53 (for normal numbers only). The bug is that the sticky bit is not always 1 when the absolute value of the exponent difference is equal to 54. Figure 10 shows the sticky bit generation when E x ? E y = 54. Since N x has 52 bits, the leading 1 will be the Round (R) bit and the sticky (S) bit is the OR of all of N y bits, which may be 0. Therefore an entry for the case jE x ? E y j = 54 is needed in the table of Quach's paper [20] .
From our experience, the design in the mantissa path doesn't cause the *PHDD explosion problem. However, when the error is in the exponent path, the *PHDD may grow exponentially while building the output. A useful tip to overcome the *PHDD explosion problem is to reduce the exponent value to a smaller range by changing the exponent range condition in Cond add or Cond sub in Equation 5, 6 or 7.
Corrected Designs
After identifying the bugs, we fixed the circuit in the SMV format. In addition, we created another FP adder by adding the compare unit in Figure 2 .b into Huff's FP adder. This new adder is equivalent to the FP adder in Figure 2 .b, since the ones complement unit will not be active at any time.
To verify the FP adders, we combined the specifications for both addition and subtraction instructions into the specification of true addition and subtraction. We use the same specifications to verify both FP adders. Table 3 shows the CPU time in seconds and the maximum memory required for the verification of both FP adders. The CPU time is the total time for verifying all specifications. For example, the specifications of true addition are partitioned into 18 groups and the specifications in the same group use the same variable ordering. The CPU time is the sum of these 18 verification runs. The FP adder II can not be verified by conditional forward simulation without the short-circuiting technique. The maximum memory is the maximum memory requirement of these 18 runs. For both FP adders, the verification can be done within two hours and requires less than 55 MB. Each individual specification can be verified in less than 200 seconds. Figure 2 .b. For true subtraction, far represent cases jE x ? E y j > 1, and close represent cases jE x ? E y j 1.
In our experience, the decomposition type of the subtrahend's variables for the true subtraction cases is very important to the verification time. For the true subtraction cases, the best decomposition type of the subtrahend's variables is negative Davio decomposition. If the subtrahend's variables use the positive Davio decomposition, the *PHDDs for OUT can not be built after a long CPU time (> 4 hours).
As for the coverage, the verified specifications cover 99.78% of the input space for the floating-point adders in IEEE round-to-nearest mode. The uncovered input space (0.22%) is caused by the unimplemented circuits for handling the cases of any operands with denormal, NaN or 1 values, and the cases where the result of the true subtraction is denormal value.
Our results should not be compared with the results in [7] , since the FP adders handle difference precision (i.e., their adder handles IEEE extended double precision) and the CPU performance ratio of two different machines is unknown (they used a HP 9000 workstation with 256MB memory). Moreover, their approach partitioned the circuit into sub-circuits which are verified individually based on the assumptions about their inputs, while our approach is implementation-independent.
Conversion Circuits
The overflow flag erratum of the FIST instruction (floating-point to integer conversion) [14] in Intel's Pentium Pro and Pentium II processors has illustrated the importance of verification of the conversion circuits [16] which convert the data from one format to another format (e.g., IEEE single precision to double precision). These circuits are another common unit in floating-point processors. For example, the MIPS processor supports conversions between any of the three number formats: integer, IEEE single precision, and IEEE double precision.
We believe that the verification of the conversion circuits is much easier than the verification of floating-point adders, since these circuits are much simple than the floating-point adders and only have one operand(i.e. less variables than FP adders). For example, the specification of the double-to-single (D2S) operation, which converts the data from double precision to single precision, can be written as " ( For another example, the specification of the single-to-double (S2D) operation, which converts the data from single precision to double precision, can be written as "output = input", since every number represented in single precision can be represented in double precision without rounding(i.e. the output represents the exact value of input).
Conclusions and Future Work
We presented extensions to word-level SMV to enable the verification of floating-point adders with implementationindependent specifications. Word-level SMV were improved by using the Multiplicative Power HDD (*PHDD) representation, by deriving efficient algorithms for equality and inequality operations between two *PHDDs, and by incorporating conditional symbolic simulation as well as a short-circuiting technique. Based on the case analysis of the signs and the relations of two exponents, the specifications of floating-point adders are divided into several hundreds of implementation-independent sub-specifications.
Conditional forward simulation has the advantage of implementation-independent specifications. We identified a conflict in the variable orderings between the mantissa comparator and mantissa adder which prevents the use of conditional forward simulation. We presented a short-circuiting technique to solve this problem. This short-circuiting technique can be generalized and used in the verification which only exercises part of the circuits.
We used our system and the implementation-independent specifications to verify a FP adder from University of Michigan. Our system found several bugs in Huff's FP adder and generated counterexamples within several minutes. After fixing the bugs, a variant of the Aurora III FP adder is created by introducing a mantissa comparator and extra circuits for demonstrating the capability of our system to handle different FP adder designs. For each of FP adders, the verification task finished in 2 CPU hours on a Sun UltraSPARC-II server for IEEE double precision. The verified specifications covered 99.78% of the entire input space. The uncovered input space (0.22%) We believe that our system and specifications can be applied to directly verify FP adders and to help finding design errors.
The overflow flag erratum of the FIST instruction (floating-point to integer conversion) [14] in Intel's Pentium Pro and Pentium II processors has illustrated the importance of verification of the conversion circuits which convert the data from one format to another format (e.g., IEEE single precision to double precision). Since these circuits are much simpler than floating-point adders and only have one input operand, we believe that our system can be used to verify the correctness of these circuits. We plan to verify the conversion circuits in the Aurora III chip.
