Introduction
Recent increases in FPGA capacity and capability have led to broader use of custom floating-point datapaths. When configurable resources were scarce, floating-point arithmetic could not be practically implemented on FPGAs, due to its large area and latency cost compared to fixed point arithmetic. The steady and rapid growth of FPGA resources has increased FPGA floating-point throughput to match or beat conventional floating-point processors, and FPGA floating-point throughput is growing at a faster rate. Indeed, it has been forecasted that FPGAs will enjoy an order of magnitude higher throughput on double precision floatingpoint arithmetic than conventional CPUs by the year 2009 [25] .
Recently, there has been much work done on floatingpoint for FPGAs, ranging from investigating hardware architectures [20] to implementation specific optimizations [23] , [19] . Several parameterizeable floating-point libraries have been developed specifically for FPGAs [1] , [5] . However, the cost of floating-point arithmetic on FPGAs is still high enough that novel alternatives such as Dual Fixed Point continue to be considered [8] .
The general computing world has settled on floatingpoint representations which conform to IEEE standards 754 and 854 ( [15] , [16] ). These standards play a crucial role in ensuring numerical robustness and code compatibility among machines of vastly different architectures. However, the choice of floating-point representation has such a dominant impact on FPGA implementation cost that the standards are often bent, giving the designer freedom to choose a custom floating-point representation in order to spend FPGA resources as efficiently as possible. For example, work has been done to automatically determine custom floating-point bitwidths for each node of a computation [10] , and others have demonstrated the suitability of very tiny floating-point representations with much less precision and range than IEEE single precision [6] .
Choosing non-standard floating-point representations by manipulating bitwidths is natural for the FPGA community, since bitwidth has such an obvious effect on circuit implementation costs. Besides the non-standard bitwidths, FPGA-based floating-point units often save hardware cost by omitting support for denormalized numbers or some of the rounding modes specified by the IEEE standard.
Although the impact of non-standard bitwidth floatingpoint representations on FPGA implementation is well known, the effect of non-standard radix floating-point representations has not been examined. The word "radix" in the context of computer arithmetic has acquired several meanings, which can be confusing. We use the word radix to refer to the numerical base of the floating-point represen-tation, meaning that the mantissa is interpreted to be composed of digits of some base greater than 2. This is not to be confused with high radix Booth encoding for multiplication or high radix division algorithms, as found in references to "high-radix" floating-point operators such as [26] .
In this paper, we show that higher radix floating-point representations, especially hexadecimal floating-point, are uniquely suited for FPGA-based computation, especially when denormalized numbers are supported. Choosing a higher radix floating-point representation can reduce adder area by 25% and multiplier area by 12%, while still providing equal worst-case and better average-case numerical accuracy than the standard binary representation. This paper justifies higher radix representations from a numerical perspective as well as presents implementation results (Xilinx Virtex-II) for arithmetic operators which operate on higher radix number representations.
Mathematical Terminology
Floating-point arithmetic approximates a real number x by choosing an element of a finite set of exactly representable real numbers S, called the significance space [21] . Elements of S u β have the form
where s = ±1 represents the sign, β is the base, or radix, u is the number of β-ary digits in the mantissa, d u−1 · · · d 0 are the digits of the mantissa, with d u−1 being the most significant digit, e is the exponent, and β δ−u is a term that accounts for the placement of the implied radix point. With this notation, the radix point is placed δ digits into the mantissa, from the most significant side. Equivalently, we can understand the β δ−u term as a scaling factor which leads to interpreting the mantissa to be in the range [β δ−1 , β δ ). We consider radices of the form β = 2 ν , which ensures that each digit d i is efficiently representable in binary. Expanding (1) into binary, with β = 2 ν , elements of S have the form
where each β-ary digit d i from (1) is expanded into its binary form b ν(i+1)−1 · · · b νi , and t = νu is the number of bits in the binary encoding of the mantissa
The term 2 νδ−t accounts for the placement of the implied binary point. We require t to be an integer, which ensures that the mantissa is representable with an integral number of bits, but make no such restriction on u, allowing fractional digits of radix β. Similarly, we require νδ to be an integer, but allow fractional δ. With this representation, the radix point is placed νδ bits into the mantissa, which may fall in the middle of a β-ary digit. In other words, we allow the radix point to function as a binary point, regardless of radix, positioning it between any bit of the mantissa, not just at the boundaries of radix β digits.
If the leading one is found within the most significant ν = log 2 β bits, the number is considered normalized. Otherwise, the number is considered denormalized, which is permitted only when representing exceptionally small numbers. Normalization is essential to floating-point accuracy because it keeps the mantissa bits significant and enables easy comparison of two floating-point numbers. However, it is also expensive to implement in FPGA hardware. Higher radix representations simplify normalization: with conventional binary representations, the leading non-zero bit must be exactly located and positioned, whereas with a radix 2 ν representation, the leading non-zero bit is located and positioned less precisely -only to within ν bits. This simplification results in hardware savings.
Background
Before the advent of floating-point standards, various radices greater than 2 were in use. For example, the Illiac II used β = 4, the Burroughs 5500 used β = 8, and the IBM 360 used β = 16 [2] . IBM mainframes still support hexadecimal floating-point (β = 16) for compatibility reasons [11] . The designers of these systems chose higher radix representations because of area and latency savings for higher radix floating-point arithmetic units, which come primarily through reductions in the size of the shifters and leading one detection circuitry due to relaxed normalization procedures.
During the late 1960s and early 1970s, there was tension between hardware designers and numerical analysts as to the choice of radix. Hardware designers wanted to use higher radix representations to reduce the hardware cost of floating-point functional units, and numerical analysts were set on radix 2 because of its numerical advantages. The numerical analysts won the battle, because the cost of a floating-point arithmetic unit decreased so quickly that hardware penalties incurred by the use of radix 2 ceased to be a concern. IEEE standard 754 mandates the use of radix 2, and although IEEE 854 is entitled "IEEE Standard for Radix-Independent Floating-Point Arithmetic", it forbids the use of radices other than β = 2 and β = 10 [16] . Decimal representations are required for financial calculations, in order to produce exactly the same results as those done by hand [4] , but their inefficient implementation causes them to be avoided whenever possible.
Despite the hardware advantages of higher radix floating-point, radix 2 has been chosen as the standard over other commensurable radices because radix 2 systems always have the best numerical accuracy when given a fixed number of bits to encode the entire floating-point number, including mantissa, exponent, and sign [3] . This comes about because there are no leading zeros in normalized radix 2 mantissas, which means that all mantissa bits are always significant. With higher radices of the form β = 2 ν , up to ν − 1 bits may be leading zeros. These leading zeros can be understood as exponent information which has been encoded into the mantissa, which has the effect of reducing the number of significant bits in the mantissa. Additionally, the first digit of a normalized radix 2 mantissa is always 1. Since this is known for every normalized radix 2 number, the leading digit can be implied, freeing one extra bit of precision in the actual representation. Because the first digit of a normalized floating-point number lies in the range [1, β − 1], none of the leading bits can be implied for representations with β > 2. This fact gives radix 2 representations an extra bit of precision over other representations.
Because memory and register file oriented computing systems must represent floating-point data in a convenient, fixed number of bits, numerical accuracy per bit of representation is the dominant measure of a floating-point representation's usefulness for the general computing world. The studies which led to the choice of radix 2 as the standard were all based on this underlying premise, and so they kept the bitwidth of the floating-point word constant as they determined which radix was most advantageous (e.g., [2] , [3] ). To our knowledge, this fundamental assumption has not been questioned in light of the unique capabilities and limitations of FPGAs.
In the ASIC community, hexadecimal floating-point has been recently advocated for use in lightweight, low power ASIC designs [9] , where the authors found that it reduced the size of the floating-point adder by 11%, but increased the size of the multiplier by 43% for very small (14-15 bit) floating-point word sizes. Our work shows a greater benefit for hexadecimal floating-point operators because we include support for denormalized numbers, we are implementing on an FPGA instead of an ASIC, and because we present results from larger floating-point formats (equivalent to IEEE single, double, and quadruple precision).
Higher Radix Representations for FPGAs
In contrast to conventional computing systems, custom floating-point datapaths implemented on FPGAs are not as limited by memory concerns. Data being processed on an FPGA is more likely to stay on chip until the application has finished processing it [25] . This, along with the use of distributed state in pipeline registers instead of a central register file, frees FPGA-based computation systems from rigid restrictions on floating-point word size imposed by memory interfaces. Instead, FPGA performance is constrained by circuit area, since FPGAs gain their high performance by exploiting spatial parallelism, unrolling a computation to fill the available compute fabric. Non-standard bitwidth floating-point formats are common on FPGAs because their use may enable the implementation of a particular computation or increase performance, with "acceptable" numerical accuracy.
Since FPGA performance is constrained by circuit area instead of memory interface, the fundamental assumption which led to the choice of radix 2 and exclusion of higher radix representations is not of primary importance. Instead of numerical accuracy per bit of representation, FPGAbased computing systems aim to maximize numerical accuracy and performance per unit of circuit area. From this perspective, higher radix representations are more efficient for FPGAs, even when their binary forms must be enlarged slightly in order to equalize numerical performance with their radix 2 counterparts. Also, the implied bit touted as a unique advantage of radix 2 representations is not a compelling advantage from this perspective, since it saves less than 1% of circuit area in FPGA implementations of floating-point operators.
The numerical disadvantages of higher radix representations can be resolved by adding a few bits to the mantissa, which is not practical in the general computing world because of the constraints imposed by memory interfaces. For a radix 2 ν representation, an additional ν − 1 bits of mantissa are sufficient to equalize worst case numerical accuracy, while providing increased average accuracy [9] . Because FPGAs are architected with bit-level granularity, the penalty for a few extra mantissa bits is minimal.
Storing slightly wider intermediate results in embedded block memories on FPGAs should not pose a problem, since most FPGA block memories can be configured in multiples of 9 bits wide, and thus have a few extra bits to store data. These extra bits were originally intended to store parity information, however, they are often used to store data. For example, the internal single precision floatingpoint datatype used in [12] is 34 bits wide, and another single precision floating-point datatype provided commercially by Nallatech is 36 bits wide [24] . Our higher radix floating-point representations still fit conveniently in FPGA embedded memories, despite being a few bits wider than the standard datatypes.
Some people may feel that a higher radix implementation is not acceptable for FPGA designs which aim to replace an IEEE compliant CPU. Although it is true that a higher radix design will not produce bit-for-bit the same output as a standard IEEE design, the IEEE specification does not require identical output from all IEEE compliant operators. For example, the Intel x87 floating-point unit performs all calculations in an internal 80-bit double extended format, converting down to single or double precision only on command [17] . The results from an x87 FPU will thus be more accurate and therefore not identical to the results from a 64-bit double precision unit which satisfies the bare minimum of the IEEE specification. Similarly, the widespread use of fused multiply-add units, such as those on IBM and Motorola's PowerPC and Intel's Itanium processors, also results in more accurate computation than the IEEE standard requires [13] . This occurs because only one rounding operation is required in a multiply-add operation, as opposed to the two which are necessary to do a multiply and then an add, using standard operators. Systems which use a fused multiply-add unit will therefore produce slightly different, more accurate results than those which do not.
Analogously, FPGA-based systems which use IEEE formats externally and compute internally with a higher radix are acceptable for applications requiring IEEE compliance, since they have higher numerical accuracy and equal dynamic range.
The Radix Point and Dynamic Range
Changing the radix of a floating-point representation affects both the mantissa and the exponent value of a floatingpoint number. Since the radix is exponentiated by the exponent value, higher radix representations need smaller values of exponent to represent the same number. Essentially, we divide the radix 2 exponent by ν to yield the radix 2 ν exponent. Thus, the exponent of a radix 2 ν representation can be restricted in range by a factor of ν compared to a radix 2 representation, while still keeping a dynamic range equal to that of the radix 2 representation. This allows us to represent the higher radix exponent with log 2 ν fewer bits and keep roughly the same dynamic range. However, there are some subtleties that should be explained.
According to the IEEE standards, exponents are represented in biased form, where an n bit exponent has bias BIAS = 2 n−1 − 1, and the actual encoded exponent value is e + BIAS . This particular bias allows floating-point comparison to be performed as a signed integer comparison when the floating-point number is structured as [sign, exponent, mantissa] [22] , which is useful, and so we choose the standard bias for our higher radix representations.
Along with the biased exponent, another feature of IEEE standard floating-point is that the mantissa is interpreted to be within the range [1, 2) . This means that the standard places the binary point 1 binary digit into the mantissa, or utilizing our earlier notation, defines δ = 1. We envision that many applications will require input and output data in a conventional, IEEE compliant form, so we choose the parameters of our higher radix representation to keep translation hardware to a minimum. The placement of the binary point affects both dynamic range and the translation hardware necessary to map from standard representations to higher radix representations, so we need to choose it carefully.
Conversions between radix 2 and radix β = 2 ν will involve division and multiplication by ν, as explained earlier, so we are interested in simplifying the conversions for radices such that ν = 2 k , which allows the division and multiplication to be accomplished by shifts alone. representation with 6 bits of exponent: the upper 6 bits of the radix 2 exponent become the radix 16 exponent. The information from the truncated exponent bits is encoded by introducing up to 3 leading zeros into the radix 16 mantissa. In order for the exponent mapping to be accomplished by a simple truncation, figure 1 shows that the mantissa of the higher radix representation should be interpreted to be within the range [ 2 β , 2). This requires placing the implied radix point within the first β-ary digit of the mantissa. For our earlier notation, this choice corresponds to δ = 1 ν . This choice of radix point placement is unorthodox: other higher-radix floating-point representations such as the hexadecimal formats used by IBM [11] , or the CMU lightweight floating-point project [9] , place the radix point to the left of the mantissa. The standard choice leads to a more complicated exponent mapping, as shown by figure 2.
When the implied binary point is selected as outlined, the dynamic range of the higher radix format is as close as possible to standard radix 2. For other radices of the form 
, it is not possible to equalize the dynamic range. These representations will cover either a considerably larger or smaller range than standard binary representations.
Encoding
Now that we have explained how the radix point should be placed, we can illustrate how changing the radix affects bit-level encoding. The first row of table 1 shows how the number 2.0 is encoded in a radix 2 representation with 4 bits of exponent and 4 bits of mantissa, explicitly showing the leading one of the mantissa that is usually implicit. The second row shows how the same number is encoded in radix 16 with 4 bits of mantissa and 2 bits of exponent, given the binary point is placed as we described earlier. Notice that in this case, no precision is lost, and both systems are able to exactly represent the number.
The third row of the table shows how the number 3.25 is encoded in the example radix 2 representation. Row 4 shows how encoding 3.25 in the hexadecimal representation causes precision to be lost. Since 3 leading zeros were introduced, the bottom 3 significant bits of the mantissa were lost, leading to a significant representation error -instead of 3.25 as desired, we end up with 2.0! Row 5 shows how adding an additional 3 bits to the mantissa is sufficient for the hexadecimal representation to capture all the precision of its binary counterpart. Since the worst possible scenario for hexadecimal floating-point introduces 3 leading zeros, if the mantissa is extended by 3 bits, every number representable in binary floating-point is exactly represented in hexadecimal format.
Numerical Accuracy
Higher radix floating-point representations are currently unpopular because of a perceived lack of numerical accuracy. When the number of bits in the floating-point word is kept constant, high radix representations do lack accuracy, as we just illustrated, but if the mantissa is allowed to grow slightly in a higher radix representation, worst case
Radix
Floating-Point Word Size 2 n bits 4 n + 1 bits 8 n + 2 bits 16 n + 2 bits 256 n + 6 bits β = 2 ν n + log 2 β − log 2 log 2 β Table 2 . Floating-point Word Size accuracy can be equalized. A normalized radix β = 2 ν representation introduces up to ν − 1 leading zero bits, which can be understood as an encoding of exponent information from the radix 2 representation. Thus, if the mantissa is extended by ν − 1 bits, worst case accuracy will be exactly equal to that of the corresponding radix 2 representation. Table 2 illustrates how the overall floating-point word size changes as a function of radix, while keeping worst case accuracy and dynamic range equal or better to radix 2, taking into account the loss of the implied leading bit, the reduction in exponent size, and the expansion of the mantissa which come with higher radix representations.
Interestingly, when worst case accuracy is equalized, the higher radix representation has better average case accuracy. To see this, we compare the significance space density of a higher radix representation with an extra ν − 1 bits of mantissa to that of the corresponding radix 2 representation. Relative significance space density is the ratio of the amount of distinct numbers which can be exactly represented in 2 different significance spaces. Matula found [21] that the relative significance space density for two floating-point representations S 
Illustrating the meaning of this equation, figure 3 shows the 16 members of S Figure 3 illustrates that every number in a binary floating-point format is exactly represented in its hexadecimal counterpart, which justifies our claim that higher radix representations can provide equal worst case accuracy to standard binary representations. Figure 4 shows how significance space density changes for radices 2, 4 and 16 as a function of overall floating-point word size. At equal word size, hexadecimal representations have 94% of the density of binary representations. When the hexadecimal representation has 2 more bits, and therefore equal worst-case accuracy, it represents 3.75 times as many numbers as its binary counterpart. Since rounding ensures that the closest element of S to the exact result of the computation is selected as the output of that computation, the denser significance space of worst-case accuracy normalized higher radix representations translates into better average-case accuracy.
Indeed, the authors of [9] found that using a hexadecimal floating-point format with only 1 extra bit, instead of the 2 that are required for worst-case accuracy normalization, gave more accurate results in their calculation than the standard radix 2 format, despite the fact that their hexadecimal format had worse worst-case accuracy. Accordingly, for some applications, fully extending the mantissa to equalize worst case accuracy may not be required.
Rounding
Moving to a higher radix also affects rounding. There are several different types of rounding defined in the IEEE specification -the default and most numerically accurate is unbiased rounding to nearest even, and since it is also the most complicated rounding procedure, we will focus on how this mode must be implemented to preserve its numerical properties with higher radix representations.
The most important property of a good rounding procedure is that the rounded result of an arithmetic operator is the same result as if the operation had been accomplished with infinite precision, then rounded to the given representation [18] . We want our rounding procedure, adapted for higher radices, to preserve this property.
During a floating-point add operation, before the add occurs, the radix points of the two operands must be aligned. This is accomplished by shifting the mantissa of the smaller operand to the right as dictated by the difference in their exponents. During this shifting, significant bits may be shifted away into oblivion. Later on, during normalization, the result of the add may be shifted to the left, which ideally should reintroduce the bits which were lost at alignment. In order to do this, in radix 2 addition there are three extra bits which are added to the least significant end of the smaller addend, which are usually called the Guard, Round and Sticky bits [7] .
For higher radix addition, the Guard bit must be turned into a Guard digit in order for the operation to retain all the significant bits that may be shifted out during alignment, and later shifted back in during normalization. The function of the Round and Sticky bits doesn't change, and so they remain unchanged in higher radix rounding procedures.
Thus, instead of the 3 extra round bits needed for radix 2, we now have ν + 2 round bits. We have included this rounding procedure in our adder.
For multiplication, there is no need for a guard digit. Unbiased rounding requires one round bit to determine whether the result should be rounded up or down, and the sticky bit to signal whether all other bits of the result are 0. The rounding procedures remain as they are in radix 2 operations.
Implementation and Results
Using the parameterization capability of JHDL [14] , we have implemented an adder and multiplier which are parameterizeable in both bitwidth and radix, as well as conversion circuitry between radix 2 and radix 16. The parameterized circuits are unpipelined, so we also implemented pipelined radix 2 and radix 16 single precision adders and multipliers to show that the efficiency gains seen in the unpipelined operators remain after pipelining.
All experiments were placed and routed on a Xilinx Virtex-II 6000, speed grade 6, with embedded multiplier stepping 1. No hand or relative placement was used. We present results for radix 16 and radix 4, since they are easily convertable to radix 2 and are therefore of greatest interest. All circuits implement the round to nearest even rounding procedure, as well as support for denormalized numbers. When reference is made to single precision, etc., the high radix circuits have equal worst case accuracy and equal dynamic range as their IEEE radix 2 counterparts, i.e. they use the formats described above, including the extension of the mantissa by ν − 1 bits, and the contraction of the exponent by log 2 ν bits. Thus, the hexadecimal representation compared against IEEE single precision has 6 bits of exponent and 27 bits of mantissa, while its radix 2 counterpart has 8 bits of exponent and 24 bits of mantissa.
Priority Encoder
The priority encoder is one of the critical circuits in the adder and multiplier. It finds the leading non-zero digit using fast carry logic. Using a higher radix significantly reduces the size and critical path of the priority encoder. Figure 5 illustrates a radix 16 priority encoder. The incoming word is divided into radix-16 digits, which are then priority encoded conventionally. The key is that the number of digits is reduced by a factor of 4, significantly reducing the complexity of the operation. The priority encoder for a single precision radix 2 adder is 25 bits long: 24 bits for the mantissa, plus 1 for the guard bit where the leading one may be located. The carry chain for such an encoder is then 23 bits long, since the top and bottom bits do not require carry propagation. In contrast, the corresponding priority encoder for the single precision radix 16 adder has a 6 bit long carry chain. This arises because there are 27 bits in the mantissa and 4 guard bits, making 31 bits or 8 radix-16 digits, since the incomplete digit must be counted as a full digit. The priority encoder is a relatively small circuit, so it doesn't contribute much to the size reduction. However, it is in the critical path of the normalizing circuitry, which makes the encoder critical path length reduction more prominent.
Normalizing and Aligning Shifters
The bulk of the hardware savings comes from reducing the size of the normalizing and aligning shifters. Since a radix 2 ν shifter only has to shift to within ν bits, the amount of shifting which must be performed is reduced significantly. Figure 6 illustrates this benefit for the normalizing shifter of the single precision radix 16 adder. The shifter must shift 0 to 7 radix-16 digits, requiring 3 stages of 2 input muxes. The corresponding radix 2 shifter must shift 0 to 24 bits, requiring 5 stages of 2 input muxes 1 . Since these shifters occupy a relatively large area, reducing their cost creates most of the area benefit of high radix multipliers and adders.
Unpipelined Adder
Our adder implements the canonical single path floatingpoint adder architecture as outlined [22] , [7] . Table 4 illustrates that the combinatorial critical path through high-radix adders is reduced slightly, around 5% for radix 4 and 7% for radix 16 .
The benefits we have seen using radix 16 are greater than those observed in [9] for several reasons. Firstly, shifters are relatively cheaper in VLSI technology than in FPGA fabric, since they can use more efficient transistor level structures specifically designed for shifting. This reduces the impact of minimizing the shifters, in contrast to FPGAs, on which shifters are expensive. Secondly, [9] examines the benefit of hexadecimal floating-point representations at very small word sizes. As can be seen in table 3, the benefit from higher radix representations increases with word size.
Unpipelined Multiplier
Our multiplier uses the single-path architecture outlined in [25] , and supports denormalized numbers. The multiplier makes use of embedded block multipliers for the mantissa multiplication. Table 5 shows that radix 4 multipliers are slightly smaller than their radix 2 counterparts, while radix 16 multipliers are around 12% smaller. Higher radix operators used exactly the same number of block multipliers as the binary multiplier.
Precision
Interval arithmetic reminds us that the result of a normalized radix 2 multiplication with both mantissas in the range [1, 2) will have a mantissa in the range [1, 4) , while the result of a normalized higher radix multiplication with both mantissas in the range [ 4) . Thus, the radix 2 multiplier has 2 ranges to select between to produce a normalized result: [1, 2) and [2, 4) , while the higher radix multiplier has 3 ranges to choose from: 2) , and [2, 4) . This results in an extra layer of muxing, which along with the increased adder tree necessary to form the mantissa product reduces the benefit of higher radix representations for multipliers.
Multipliers which support denormalized numbers must have both a normalizing and a denormalizing shifter, the size of which are reduced by high-radix representations. This results in the area benefit we have observed -if our multiplier did not support denormalized numbers, we would see a small area penalty rather than a savings, due to the added mux and slightly enlarged mantissa multiplier. However, FPGAs see a smaller penalty from the mantissa extension than ASIC implementations because of the discrete area scaling behavior of multipliers constructed from smaller block multipliers. Thus, block multipliers and support for denormalized numbers explain why we observe an area benefit, as opposed to the area penalty seen by [9] . The combinatorial critical path through our high radix multipliers was increased from 2-8% for the hexadecimal multiplier, and somewhat more for the radix 4 multiplier. This is primarily due to the enlarged mantissa multiplier. Table 7 shows that the area benefits observed earlier are not changed significantly by pipelining. The radix 16 single precision adder is 20% smaller than its radix 2 counterpart, while the radix 16 multiplier is 10% smaller than the radix 2 multiplier. This is an expected result, since the topology and architecture of the operators does not change with radix, and therefore, the costs of pipelining should not be significantly different for higher radix operators. Table 8 shows the clock periods of the pipelined operators. The radix 16 adder sees a 13% smaller clock period at the same pipeline depth. This is due to the priority encoder stage, which is significantly less complicated in the radix 16 adder. The multiplier sees an insignificantly reduced clock period. Combining the area and time savings, the radix 16 adder has a 30% smaller area-time product, while the radix 16 multipler has a 14% smaller area-time product.
Pipelined Operators

Converter Hardware
As explained earlier, the hardware necessary to convert a radix 2 representation to a radix β representation is simplified when β = 2 2 k . Of radices that satisfy this condition, radix 16 seems to be optimal, since it yields more hardware savings than radix 4, yet doesn't require the floating-point word size to be lengthened excessively to compensate for reduced accuracy, as do large radices such as 256.
Since a hexadecimal floating-point representation is 2 bits longer than its corresponding binary counterpart, some applications will require keeping the datapath externally radix 2 but internally radix 16, stationing converters at the gateways to the circuit. Although converter circuitry may be necessitated by higher radix representations, it is worth noting that FPGA-based floating-point datapaths gain performance by keeping data on chip as much as possible, especially since FPGAs are very pin-limited compared with the parallelism that can be accomodated internally. These two facts combined support the assertion that relatively few of these converters should be needed, and the overall system cost should be reduced by using a higher radix representation.
We chose the implied binary point placement to simplify conversion between standard radix 2 and radix 16. Because of this choice, conversion from radix 2 to radix 16 requires only a shifter which shifts the mantissa 0-3 places to the right, as determined by the bottom 2 bits of the exponent, which are then discarded to form the radix 16 exponent. A small bit of logic is required to handle exponent corner cases. No rounding is necessary, since no significant bits are lost in the conversion.
The conversion from radix 16 back to radix 2 requires a shifter to shift the mantissa 0-3 places to the left, eliminating the leading zeros. Since the radix 16 format can represent more numbers than the radix 2 format, a round operation is required to choose the closest representable radix 2 number, and some logic must be included for exponent corner cases. In order to avoid instantiating a rounder in this converter, we integrate the converter into the normalization and rounding steps of the arithmetic operators, making hybrid radix operators which accept hexadecimal numbers and output binary, IEEE results. Table 9 . Converter Circuitry Area
The cost of these converters is reasonable: in the worst case scenario with a datapath comprised of a radix 2 → radix 16 converter, a single radix 16 adder, and a radix 16 → radix 2 converter, the aggregate cost is between from 2-9% more than the cost of a single radix 2 adder. Since FPGAs gain their performance by performing multiple calculations and limiting I/O, few of these converters should be needed compared to the number of arithmetic operators in the datapath. Thus, using hexadecimal floating-point internally and binary floating-point externally should reduce overall system cost, despite the use of converter circuitry.
Future Work
We have not examined the impact of higher radix representations on divider or square-root circuitry.
We expect high radix representations to reduce power consumption similar to or slightly better than they reduce area, although this is as of yet unproven. Choosing a higher radix representation may thus be another chance to lower power consumption. Future work will explore these questions on pipelined versions of our higher radix operators.
Conclusion
The choice of floating-point representation has a major impact on FPGA based floating-point datapaths. Choosing a higher radix representation can yield implementations with better numerical accuracy, while still reducing area cost. Radix 16 is a particularly good choice, since it provides good area savings, and converters to and from radix 2 are simplified. Designs that are heavily constrained by memory interfaces can either sacrifice some accuracy to fit the representation within a convenient number of bits, or they can use converters at the gateways to the floating-point datapath.
High radix approaches may not be optimal for designs with much I/O and little computation, for designs using very small, non-standard representations, or for designs with many multipliers and no support for denormalized numbers. For such applications, radix 2 may be the best choice. However, for many designs, higher radix representations can be used to maximize efficiency for floating-point datapaths implemented on FPGAs. Some designers are beginning to push for greater precision than afforded by IEEE double precision, and need support for denormalized numbers [25] . The area savings afforded by higher radix representations, especially when support for denormalized numbers is required, may enable the implementation of such extremely high precision calculations on an FPGA. Since processors with hardware quadruple precision units are rare and expensive at present, such calculations must be run in software, making them an even bigger target for FPGA implementation. Calculations requiring less precision can also benefit from higher radix representations, especially if there are proportionally many add operations in the datapath.
Due to the established consensus that binary floatingpoint is optimal, the choice of floating-point radix has been neglected. The unique traits of FPGAs, such as the high ratio of calculation to I/O, high shifter cost, and embedded block multipliers make higher radix floating-point representations, especially hexadecimal floating-point, particularly attractive. Designers of FPGA-based custom floating-point datapaths should consider whether a high radix representation would be better suited to their needs.
