Abstract-An architecture for a combinational floating point multiplier and squarer is described for the purpose of producing a low power floating point square with small area requirements. The floating-point multiplier and squarer architecture are compared to demonstrate the power advantage of the squarer. The multiplier and squarer are combined into one circuit in order to take advantage of the squarer power improvements with a minimal increase in area. Shared circuitry among the units provides justification for inclusion of a dedicated squarer since a small amount of additional circuitry is required and the power savings for squaring computations is significant as compared to the use of a general-purpose multiplier to generate a squared value.
I. INTRODUCTION
The floating point squarer design presented in this paper is a full precision squarer. The squarer was constrained to be low power and require small area in order to justify the benefits of including both a multiplier and squarer. Furthermore, the design utilizes shared circuitry among the floating-point squarer and multiplier circuits to minimize impact due to area increases. If a unit-in-the-last-place (ulp) accurate approximation is the accepted rounding method for the system, a left-to-right approximation squarer such as the one presented in [8, 9, 10] would be the best choice since floating point circuits only use the n most significant bits if truncation is used. This approach however is not a feasible option because all the IEEE rounding methods are implemented in the floating-point multiplier in [2] . By requiring the squarer to also implement the same rounding methods as the multiplier, combining the two circuits allows for increased resource sharing as well as providing a more fair comparison. However, the combinational circuit presented here can easily be modified to use only truncation resulting in an approximate squarer (ie. most-significant bit first).
To keep area cost low, the right-to-left low-order-bit-first radix-4 squarer presented in [7] is used. The squarer takes advantage of Booth radix-4 folding, an idea first presented in [11, 12, 13] that uses symmetry and radix-4 Booth recoding to reduce the number of partial products.
The overall architecture presented here is capable of utilizing any fullprecision squaring implementation as a core design.
II. BACKGROUND
The IEEE 754 floating point standard [1] can be interpreted as follows:
Conceptually, floating point multiplication is performed through addition of the exponents and an accompanying multiplication of the significands with normalization governing the value of the product exponent. If both a and b are normalized, the equation,
, is all that is needed to calculate the floating-point product. The above equation does not account for the exponent bias. The true value of the exponent is the value represented by the exponent bit field minus the bias. In order to preserve the bias, it must be subtracted from the sum of the exponents as 2 2 2 .
Since the values for a and b are in the range [1, 2) , the product range is [1, 4) . The output from the circuit must be within the range [1, 2) , thus, the product undergoes normalization. This is accomplished by shifting the significand to the right by one bit position if the significand is greater than or equal to two. If the significand is divided by two, the sum of the exponents must accordingly be incremented by one.
The algorithm must account for inputs that are denormalized as well as the special values defined in the standard. Normalizing an input value is accomplished in a similar manner to normalizing the output. The significand shifted left to one place pass the most significant one and exponent is accordingly decremented by the number of times the significand is shifted left. In the IEEE standard, the special values that the input can represent are 'not a number' (NaN), positive infinity, and negative infinity. The equation presented above cannot be used to calculate the product if one or more of the inputs is a special value. Therefore, special values must be detected and the appropriate output selected.
Since only a certain range of values are represented, the output is not always an exact representation of the product. In this case, flags are set to indicate that output is not exact. Flags are required for the cases of divide by zero, invalid, underflow, overflow, and inexact. The remaining sections describe how a floating-point multiplier accomplishes these tasks for the purpose of comparison to our design.
III. FLOATING POINT MULTIPLIER
The floating-point multiplier implemented in [2] decomposes the multiplier architecture into the following components: pre-processer, special case detector, prenormalizer, multiplier, exponent adder, normalizer, shifter, rounder, flagger, and assembler. The components are connected as shown in Fig. 2 . Each of the components are discussed in detail in the following paragraphs.
The preprocessor in Fig. 2 implements the following Boolean equations to determine whether either of the inputs are a special case. The variables Asig, Bsig, Aexp, and Bexp represent the width of the significands and exponents of the inputs A and B respectively.
The outputs aisnan, bisnan, zero, and infinity from the preprocessor are inputs to the special case detector as well as A and B. The specialsigncase is defined as
If invalid is set, the specialsign is the most significant bit A if aisnan is true, (referred to as aishighernan in [2] ) and the significant bit of A is greater than or equal to the significant bit of B or bisnan is false and is the most significant bit of B if it is false.
The inputs A and B and the outputs from the pre-processor aisdenorm and bisdenorm are used to normalize A and B such that the significand value range is between [1, 2) . If the input is denormalized, the significand is left shifted by the width minus the bit position of the most significant one. If the significand is shifted, the exponent must be adjusted such that the number continues to represents the same value.
The multiplier circuit used here is a carry-save array multiplier. A carry-save array multiplier is a tree multiplier consisting of a one-sided reduction tree of carry-save adders with a ripple-carry adder for the final addition [14] . The normalized significands are input to the multiplier. The most significant bit of the product is used to determine if the product is equal to or greater than two. The exponent adder calculates the exponent of the product and the tiny bit by the equation, . If expsum is less than one, tiny is set to one.
If tiny and twoormore are true, the product is already normalized and normalized is set to prod. Since twoormore represents an overflow of the product and tiny represents an overflow of the sum of the exponents, the product is shifted left by one position if either is true or it is shifted left by two if neither one is true. This results in placing the most significant bit of the possible output in the most significant bit of the product. If tiny is true, the product is denormalized and the significand will need to be shifted right by the magnitude of expsum. If expsum is greater than zero, shiftexp is set to expsum, otherwise shiftexp is set to zero. If any accuracy is lost due to the shifting operations, shiftloss is set to one.
The multiplier produces an output with as large as the bit widths of the inputs. H standard only allows for the output to be stor width as the inputs. Since it sometimes is no the exact result, the output must be rounded. floating-point multiplier implementation in [ rounding methods presented in [3] . The mul different types of rounding modes: round to round to zero (01), round to positive infinit to negative infinity (11) . The rounding mode the two least significant bits of control. The shifted product is used as the basis for th referred to as unrounded. The purpose of th 2 is to determine whether or not to add one to simplest rounding method is round to zer never added to unrounded as the least si always truncated which is referred t Unrounded is incremented if unrounded is of the lowerBits are 1, and the rounding m nearest even. Using the round to positive method, unrounded is incremented when the any of the lowerBits are 1. The rounding negative infinity requires that unrounded is i the sign bit is 1 and any of the lowerBits ar following Boolean expression:
The exponent will need to be adjusted if the to be rounded up and adding one to the cause it to be greater than or equal to two. the exponent will need to be incremented a left shifted by one. Since the exponent mig must be rechecked to determine if the numb it must also be checked for overflow. In co IEEE standard [1] , once the value has been longer exact. If the value is inexact, it must is accomplished by computing | | | 0 . Just as are checked in to determine if they are d rounded product must also be checked.
The flagger implements the Boo
The Assembler in Fig. 2 is comp multiplexers. The inputs specialsigncase, and overflow are used to select the c specialsigncase is 1, the most significant specailsign, otherwise it is set to sign.
IV. DUAL RECODED RADIX-4 SQ
Several modifications are made to t multiplier discussed in the previous section. is in the pre-processor. The pre-processor represented by the Boolean expressions: a bit-width twice owever, the IEEE red in the same bit ot possible to store Rounding in the [2] is based on the ltiplier allows four nearest even (00), ty (10) , and round e is determined by e upper half of the he answer that is he Rounder in Fig.  o [2] ).
The adder circuit in the exp left shift by one. The multiplie as described in [7] . The squa square generators, a sign exten The three least significant bits 0, 1, or 2. Based on the radix least significant bits are negate least significant bits are defined
The sign extension circuit 1 1 1
, and output of the partial square ge added together using an adder rounder, flagger, and assembler
V. RESULTS AN
The floating-point squarer are both implemented in Ver standard cell library describe constrained to a 50ns clock cyc 32, and 64 bit operands. The pre-processor and pre-n can easily be split into two normalizers with no area pena not increase is that the dataflow pre-processor and pre-normal other. The only exception is th pre-processor. Therefore the sign are moved outside of the p For the special case detector, the inputs bzero and azero are produced by a 2×1 multiplexer (MUX) to produce zero. The MUX is controlled by the input mult that is set when the circuit functions as a multiplier and is zero when the circuit functions as squarer. Therefore, the output of MUX is zero when mult is 0. The same circuit with ainf and binf is used to calculate the infinity flag. Input B is ignored if mult is 0. The size of the special case detector increases by two 1-bit 2×1 MUXs, two 2-input OR gates, and one n-bit 2×1 MUX.
The other modifications to the multiplier circuit to allow it include a squaring unit are shown in Figure 6 . Four 2×1 MUXs, a full precision integer squarer, an exponent adder for the square, and an AND gate are required. While a MUX could have been used on one of the inputs of the multiplier exponent adder to eliminate the need for square exponent adder, it was not used because it would have offered only minimal area gains while sacrificing power gains from using a shift left by one instead of an adder to compute the exponent. The AND gate is used in lieu of a 2×1 MUX since the sign of square is always positive.
I. CONCLUSION
The floating-point squarer implementation results in a significant savings in total power. Since many circuits require a multiplier, this causes the usage of the squarer to become a tradeoff between power and area. The solution to the area issue is the inclusion of both a multiplier and squarer unit that utilizes common circuitry. The shared circuitry overcomes excessive area penalties due to the presence of both units. The components are only enabled as needed for the selected operation thus maintaining most of the power reductions that would occur from using separate circuits. This architectural arrangement justifies the inclusion of individual squaring and multiplication circuitry since the power savings for the squaring operation is achieved with a minimal increase in overall area.
