Abstract-Due to the widespread use and inherent complexity of floating-point addition, much effort has been devoted to its speedup via algorithmic and circuit techniques. We propose a new redundant-digit representation for floating-point numbers that leads to computation speedup in two ways: 1) Reducing the per-operation latency when multiple floating-point additions are performed before result conversion to nonredundant format and 2) Removing the addition associated with rounding. While the first of these advantages is offered by other redundant representations, the second one is unique to our approach, which replaces the power-and area-intensive rounding addition by low-latency insertion of a rounding two-valued digit, or twit, in a position normally assigned to a redundant twit within the redundant-digit format. Instead of conventional sign-magnitude representation, we use a sign-embedded encoding that leads to lower hardware redundancy, and thus, reduced power dissipation. While our intermediate redundant representations remain incompatible with the IEEE 754-2008 standard, many application-specific systems, such as those in DSP and graphics domains, can benefit from our designs. Description of our radix-16 redundant representation and its addition algorithm is followed by the architecture of a floating-point adder based on this representation. Detailed circuit designs are provided for many of the adder's critical subfunctions. Simulation and synthesis based on a 0:13 m CMOS standard process show a latency reduction of 15 percent or better, and both area and power savings of around 58 percent, compared with the best designs reported in the literature.
INTRODUCTION
F LOATING-POINT addition is believed to be the most frequent computer arithmetic operation. Intricacies of floatingpoint number representation make floating-point addition inherently more complex than integer addition. Thus, methods for speeding up floating-point addition are of utmost importance. A floating-point number is conventionally composed of a sign bit, an exponent, and a significand [1] . Since the ANSI/IEEE standard for binary floating-point arithmetic [2] (IEEE 754 for short) was introduced in 1985, virtually all implementations have adhered to its representation formats, even when they do not follow the full provisions of the standard, or its revised version, IEEE 754-2008 [3] . Implementations in this category include systems for digital signal/image processing [4] and computer graphics [5] . The short (long), also known as 32-bit or single (64-bit or double), normalized standard format incorporates a sign bit s, a biased excess-127 (excess-1023) representation e for the exponent, and a significand composed of 23 (52) bits to the right of the binary point that has a hidden 1 to its left. A short (long) floating-point number f ¼ ðs; e; Þ represents the real value ðÀ1Þ s 2 eÀbias 1:, where 1: stands for 1 þ Â 2 À23 ð2 À52 Þ and e is an unsigned 8-bit (11-bit) integer. Besides "ordinary" floating-point numbers just described, some special values (such as AE0; AE1, and NaN) have unique codes assigned to them.
The operation of floating-point addition/subtraction consists of several steps, as outlined in Algorithm 1. Description of each step is followed, in square brackets, by the order of its worst-case latency, assuming a fast implementation. Each of the steps above may consist of a number of simpler substeps. To minimize the addition latency, clever methods have been developed to allow concurrent execution of (sub)steps in Algorithm 1 (e.g., [6] , [7] , [8] , [9] ). Rounding, in particular, is problematic, because it could introduce a second power/area-intensive word-width addition (actually, an incrementation). Pipelining of steps in floating-point operations has also reduced the average latency per operation. Further per-addition speedup is possible with redundant representation of intermediate results, thus allowing carry-free addition, provided that a series of floating-point operations is performed before there is a need to store a result in memory or to send it to an output device. The rounding-packet forwarding scheme of Nielsen et al. [7] keeps values in binary signed-digit (BSD) form, resulting in double-size registers (in view of the 2 bits required in each radix-2 position). The latest relevant work, by Fahmy and Flynn [10] , uses sign-magnitude addition with redundant high-radix signed digits, that leads to lower redundancy. In the latter work, rounding is done concurrently with the exponent comparison of the next floatingpoint addition, or an unrounded value is used in a subsequent operation which also receives a "rounding value" to be included in the significand addition.
In this paper, we follow the approach of Fahmy and Flynn, but with our particular redundant encoding of digits, which significantly reduces the active hardware redundancy through sign-embedding and obviates one of the two challenges cited in [10] ; namely, recognition and transformation of insignificant digits in the process of leadingnonzero-digit detection. We use a redundant representation dubbed stored-unibit-transfer, or SUT [11] , where the encoding provides room in the transfer part of the leastsignificant position of the result for the three possible rounding values À1, 0, and 1. Other features of our approach include:
. redundant-digit internal format with embedded sign, . carry-free addition/subtraction, . simple detection of the leading nonzero digit, and . elimination of rounding increment operation and postrounding exponent adjustment. Here is a roadmap for the rest of this paper. Section 2 contains an overview of the state of the art in the design of floating-point adders, including one by Fahmy [12] . In Section 3, we review the SUT encoding of a class of redundant numbers and present the associated carry-free adder/subtractor that uses only standard full/half-adders. Sections 4-7 are devoted to key design considerations of our dual-path floating-point adder: redundant internal number format, path separation, extra (guard, round, and sticky) digits, and rounding decision. Analytical and simulation results, presented in Section 8, are used to compare our designs with those of [12] . Conclusions and directions for further work appear in Section 9. Drawbacks of the work in [12] in way of possible bad rounding positions, along with the difficulty of adherence to IEEE 754-2008 standard with either SUT or maximally redundant signed-digit number representation, are discussed in the Appendix.
FLOATING-POINT ADDITION
Design of hardware floating-point units has a long history, dating back to early digital computers that were used primarily for scientific computations [13] . Initial efforts in providing high performance in floating-point units were impeded by the exorbitant cost of hardware. This forced serialization of potentially parallel steps to allow hardware reduction and sharing, even in top-of-the-line supercomputers [14] . As hardware cost decreased, an array of innovative designs began to emerge. Once the more or less straightforward performance enhancement schemes were exhausted, replication of units and other hardware-intensive methods were employed to squeeze out incremental gains. The state of the art in floating-point adder design uses dual data paths, as depicted in Fig. 1 , to separate the relatively slow alignment and postnormalization shifts (Steps 2 and 6 of Algorithm 1) into different paths. This parallelization, based on the exponent difference, is credited to Farmwald [6] . Others have refined the path separation criteria. For example, Seidel and Even [15] use both the exponent difference and the actual operation for this purpose.
The innovations cited above notwithstanding, there is still room for fine-tuning and improvements in speed, latency-area tradeoffs, and energy dissipation given the following challenges in high-speed floating-point adders/ subtractors:
. The prevalent sign-magnitude encoding leads to a more complex significand addition process than 1's-or 2's-complement format. Some techniques meant to speed up the addition of sign-magnitude significands (e.g., [16] ) entail additional chip-area and power overheads. . Postnormalization via counting the leading (non)-zero digits is a log-latency operation at best. However, the count of leading 0/1 digits can be obtained concurrently with addition, perhaps by deriving an approximate count quickly and finetuning the result at the end [7] . . Rounding to nearest may require an incrementation and possible exponent adjustment. With some extra hardware (e.g., the parallel-prefix adder of [15] ), both the normal and an incremented result can be computed in parallel, thus allowing rapid selection of the rounded value. Many computations involve a sequence of arithmetic operations, without a need to store a result in memory. Converting an operand loaded from memory to an intermediate redundant encoding is possible in a small constant time, independent of operand width. Carry-free addition/subtraction of redundant operands reduces the latency of each operation, at the cost of wider register files to accommodate the redundant representation. However, the final result must be converted back to nonredundant form before it is stored in memory. This process is at best a logarithmic-time operation, but its latency is more than compensated for by the per-add savings, compounded over many redundant addition levels. Also, the conversion delay can be hidden by the memory store operation.
The redundant floating-point addition scheme of Nielsen et al. [7] [12] . However, maximal redundancy allows for cancellation of several nonzero digits, beginning with 1 (À1) and followed by a chain of À15 (15) digits to the right. This property complicates leading nonzero digit detection. Note that for narrower digit sets, cancellation can occur only for the leading 1 (À1). . Path separation: "The cancellation path is used only in the case of an effective subtraction with an exponent difference of zero or an effective subtraction with an exponent difference of one and a cancellation of some of the leading digits occurring in the result. In all other cases, the far path is used." [10] . . Rounding: Another challenge is the handling of the rounding increment/decrement operation. Assimilation of the increment/decrement is postponed and is performed concurrently with the exponent difference computation of the next addition. The problem here is that the rounding position (i.e., the exact binary position for inserting the rounding value),
should be determined based on what the rounding position would be after converting to nonredundant IEEE 754 format. It turns out that there may be four bad rounding positions to the right of the least significant digit of the unrounded redundant result. The first of these is handled by extending the significand adder to the right, and the rest are prevented in the process of leading nonzero digit detection via PN recoding [17] . More detail is supplied in the Appendix, where we show the difficulties of handling the bad rounding positions. Before proceeding to our redundant-digit floating-point design in Section 4, we compare some of the possible approaches to the design of floating-point addition schemes. The coarse comparison in Table 1 (to be supplemented with more detailed simulation results in Section 8), is based on latency and active hardware redundancy, with the latter also serving as an indicator of power requirements.
In column 1 of Table 1 , we list actual and potential implementations of Algorithm 1. Columns 2-5 show the number of steps whose latency is proportional to the logarithm of the given column parameter (per the bracketed latency formula appearing after each step of Algorithm 1). We assume that, where applicable, carry acceleration is employed to achieve logarithmic latency, with the needed extra hardware in terms of units of width () given in column 6. Full replication (e.g., SD adders to concurrently compute sum and sum AE 1 in [12, Fig. C.1] ) is cited in column 7. The last row of Table 1 is included for completeness; its entries will be justified once we have explained our work. Other parts are explained below, with the understanding that Steps 1 and 3 (7 and 8) . MRSD floating-point adder: In Fahmy's design [12] , Step 4 entails digit-length (i.e., h-bit) carry propagation. But either Step 2 or Steps 5-6 require dwidthðÞ=he-digit operations in the worst case. The delay for exponent difference computation (Step 1) is hidden by postponed rounding. The rather high hardware overhead is due to two shifters and five SD adders used (four in the normalization path and one in the alignment path), with each adder internally triplicated to also compute sum AE 1. Each of these 15 adders, the two shifters, and the leading zeros predictor (18 units in all) is assumed to contain carry acceleration circuitry.
REDUNDANT REPRESENTATIONS
To allow the use of conventional arithmetic components for bit compression and manipulation, redundant digits are often encoded as weighted bit sets. We have previously studied weighted-bit-set (WBS) encodings for digit sets and extended them to include the use of other two-valued digits [11] . Graphically, we use an extended dot notation: . for a bit or posibit in f0; 1g, for a negabit in fÀ1; 0g, for a unibit in fÀ1; 1g. Symbolically, we distinguish the latter two with letters or constants (0, 1) bearing the superscripts À and AE , respectively (Table 2 ). Generalized two-valued digits (twits) lead to weighted twit-set (WTS) encoding of digit sets. Logically, we use 0 for the smaller and 1 for the larger of the two twit values, a convention leading to inverted encoding of negabits (0 denotes À1 and 1 denotes 0), complementary to the common usage. Such encodings ( Table 2) have been shown to result in efficient, VLSIfriendly adder designs [11] , [18] .
As noted in Section 2, the rounding-packet forwarding scheme of Nielsen et al. [7] requires double-size registers and the MRSD adder uses three parallel 5-bit 2's-complement adders. The area and power dissipation penalties of this triplication of active hardware are significant. The maximally redundant radix-16 digit set [À15, 15] also complicates the leading nonzero digit detection. Both of the latter undesirable characteristics may be mitigated through improved encoding and/or adder design. For example, clever designs [19] , [20] provide MRSD adders with no hardware redundancy, but the problem of cancellation of leading digits persists.
As another example, the stored-posibit-transfer (SPT) encoding [18] of the radix-16 digit set [À8, 8] , where there can be at most one leading insignificant 1 (À1) followed to the right by À8 (8), obviates the need for any complex provision to convert leading insignificant nonzero digits. But the active hardware replication factor of the SPT adder is only modestly less than that in the design of Fahmy and Flynn. Nevertheless, the SPT adder has an edge in terms of latency.
The minimally asymmetric digit set [À9, 8] of this paper, which uses the SUT encoding [11] , also has at most one insignificant leading 1 or À1 digit. The corresponding adder uses two rows of full-adders, which is equivalent to hardware duplication, and exhibits the same latency as the SPT adder just cited. More importantly, however, it obviates the need for rounding increment due to its ability to store the rounding value as a unibit. The cardinality of SPT or SUT digit sets is slightly more than half that of the maximally redundant radix-16 digit set. But this is not a disadvantage, because all three schemes require the same number of redundant radix-16 digits after conversion from nonredundant format (e.g., 7 for IEEE 754-2008 short format).
As shown in Fig. 2 , each radix-16 SUT digit has a main 2's-complement part in [À8, 7] and a transfer part in {À1, 1}. A high-level design for the ith block of a radix-16 SUT adder/subtractor, mainly composed of full-adder blocks, is shown in Fig. 3 . Each input's type is symbolically denoted using our extended dot notation. Note that due to our twit encoding scheme, standard full-adders can receive and produce a variety of twit combinations, as justified in [11] .
The critical path in Fig. 3 (heavy line) has a latency of five cascaded full-adders. Note, however, that the bottom fulladder row in Fig. 3 may be augmented by carry-lookahead logic for greater speed. The collective arithmetic value of the two external inputs of the lower full-adder in position 0 of the least-significant digit (i.e., i ¼ 0) is normally zero. Therefore, these inputs can accommodate a rounding digit in [À1, 1] used in lieu of rounding decrement or increment. Conversion of a 2's-complement number to an SUT number is possible in constant time (Table 3) ; details are given in Section 4.
A REDUNDANT RADIX-16 REPRESENTATION
For brevity, and without loss of generality, let the nonredundant inputs be in IEEE short format ( in Fig. 4) . Note, however, that the experimental results in Section 8 are based on the long format, as are those of [12] . Our redundant binary and radix-16 representations correspond to formats and in Fig. 4 . Recall that we distinguish negabits and unibits with letters or constants (0, 1) bearing the superscripts À and AE (Table 2) . Also, primed and double-primed symbols in the same position denote equally weighted entities. The variables s, e i , and i (0 i 7) represent the sign, exponent twits, and radix-16-digit components of the significand.
The process of converting from IEEE 754-2008 to SUT format is described next. The explanations may be long, but the process itself is quite simple in hardware cost and latency. Conversion from other nonredundant formats is similar.
Exponent conversion. The biased input exponent e ¼ e 7 e 6 e 5 e 4 e 3 e 2 e 1 e 0 is converted to the unbiased exponent e 0 ¼ e i ¼ e i . Using kbk to denote the arithmetic value of a bit-string b, the fact that the same exponent bit-string interpreted differently becomes the unbiased internal exponent is justified by:
Due to inverted encoding of negabits, the lowest (highest) possible value for e 0 , that is, À127 (128), is represented by a string of eight 0s (1s). Therefore, ease of comparison (i.e., the rationale for conventional use of a biased exponent) is achieved here, with the unbiased exponent e 0 . Significand conversion. Each bit-pair x 4j x 4jÀ1 ð0 < j < 6Þ is independently transformed, leading also to the introduction of a transfer in position 4j (see Table 3 ). The following equations govern all transformations:
Radix conversion for the significand. Starting at the right end of the format in Fig. 4 , every four positions, up to but not including the leftmost part of the significand, may be viewed as a redundant SUT digit with a 4-bit 2's-complement main part and a stored unibit in its leastsignificant position. The proper handling of the leftmost 4-bit group in the significand will be discussed along with the conversion of the exponent base to 16.
Radix-16 exponent. The radix-16 exponent is actually e Table 4 for positive significands, where the variables are radix-16 SUT digits. The radix point is between 6 and 5 , with 6 ! 1; hence, we have a normalized radix-16 floating-point representation. To have a full radix-16 SUT digit before the radix point, we extend the significand width to 28 bits, with the following shift decisions in effect (dashed lines near the lower-right corner of Fig. 4 show the four possible alignments that might arise):
No shifting is needed, but the 23-bit significand is extended to the right by one binary position, which is filled with the effective arithmetic value 0. The appearance of x 22 to the left of the radix point is due to the SUT transformation. The hidden 1 is accommodated as a unibit and the effective arithmetic value of the three most-significant bits is 0. . ke Fig. 4 . Short (single) floating-point format , with the hidden 1 exposed; an equivalent representation , with redundant significand and unbiased exponent; and the internal radix-16 redundant format used in our design.
compensate for ignoring the value of ke (Table 4) in parallel. Likewise, when the required operation is subtraction, the subtractor is negated, with the result added to the subtrahend. Therefore, the actual operation is always addition, leading to the elimination of the following provisions and corresponding reductions in the hardware complexity and latency:
. detection of the actual operation, . possible swapping of the operands, . postcomplementation, and . widening of the addition circuitry to capture extra digits for rounding.
One drawback of SUT paradigm, although not problematic for subtraction (see Fig. 3 ), is that negation of an SUT digit involves digitwide carry propagation in general, given the asymmetry of the digit set [21] . Here, however, the value interdependence between the posibit and unibit in position 0 of each SUT digit (see Table 4 ), produced by the conversion outlined above, obviates the need for carry propagation in digit-by-digit negation. An SUT digit is negated by inversion of its unibit transfer and 2's-complementation of its main part.
The latter requires carry propagation in general, but for the digits 1 through 5 , as well as for 6 in the bottom two rows of Table 4 , position 0 of each digit holds a posibit p ¼ xjþ1 x j and a unibit u AE ¼ ðx jþ1 V x j Þ AE . Complementing all twits and adding a constant posibit 1 in position 0 of the latter digits leaves the three twits 1, p, and u AE in position 0. The first two of these may be replaced by the original p ¼ 1 È p and a carry p into the next position. Collectively, the value of this carry (i.e., 2ðx jþ1 È x j ÞÞ and the unibit u AE (i.e., À1 þ 2u AE Þ is 2ðx jþ1 È x j Þ À 1 þ 2x jþ1 V xj, which can be accommodated by a unibit ðx jþ1 x j Þ AE , as justified in Table 5 , where only the second column from the right holds arithmetic constants. The valuespecific SUT digits 0 as well as instances of 6 in the upper two rows of Table 4 are easily negated independently. Based on the observations above, the two-step process for accommodating the sign of a negative nonredundant floating-point number in its equivalent radix-16 SUT encoding can be reduced to a direct process (Table 6) . Moreover, examination of Tables 4 and 5 shows that the overall conversion process can be summarized as follows for hardware implementation:
. Extend the nonredundant floating-point number by one radix-2 position to the right and perform no operation, 1-bit right shift, 2-bit left shift, or 1-bit left shift, for ke The latency for conversion from nonredundant floatingpoint format to SUT format equals the delay of 2 bit-shifts, an XOR gate for sign embedding, and an OR gate for the final restructuring into SUT format.
PATH SELECTION
With the embedded sign representation of Section 4, addition and subtraction operations do not need to be distinguished in implementation. Therefore, we focus only on the radix-16 exponent difference Á 00 ¼ " 00 À 00 used for path separation, where " 00 ¼ "
2 represent the exponents of the two operands f 1 and f 2 according to the encoding of Fig. 4 . The binary positions enclosed in square brackets indicate equally weighted negabits in the least-significant positions of the two exponents. As is common in high-performance floating-point units, we develop a dual-path adder, with the alignment path allowing word-length alignment shifts and the normalization path going through a postaddition normalizing shifter. Both paths are active in every add/ subtract operation, but one of the results may be wrong. The correct result is obtained through multiplexing the result from the alignment and normalization paths. The selection is based on the exponent difference Á 00 , as detailed in the following paragraphs. Fig. 5 depicts our dual-path adder. The two paths have shared exponent difference and rounding logic. To avoid hardware redundancy owing to concurrent shifted and nonshifted additions, no prediction logic for exponent difference is used in the normalization path. We can now explain the entries of the last row of Table 1 . The rounding block latency (to be discussed in Section 7) amounts to only three logic levels, and as such cannot hide the latency of exponent difference computation. However, the exponent difference computation for the next floating-point addition may begin at the same time as the process of rounding decision, thereby taking the latter off the critical path. Leading zero detection and the following normalization shifts contribute to the column titled dwidthðÞ=he, while the former adder latency is proportional to log h. The two adders, two shifters, exponent subtractor, and detection logic (total of six) use carry acceleration circuitry, and there are three instances of hardware replication arising from adders, each with internal duplication due to two full-adder rows in Fig. 3 .
Alignment path. For Á 00 ! 2, the latency of alignment shift is significant, while that of normalization shift is minimal. This is because for the operand having the larger exponent (say f 1 ), we have 6 j j ! 1 and À9 5 8, while for the operand with the smaller exponent (f 2 ), the postalignment values are 6 ¼ 5 ¼ 0. When j 6 j ! 2 for f 1 , MSD of the result is nonzero and there is no normalization shift. When j 6 j ¼ 1 for f 1 , the transfer value to the mostsignificant digit of the result may make it zero. In this case, however, the next most-significant digit cannot be zero, given the 5 values for the two operands. Thus, we will either have a normalized result or need a single-digit right/ left shift to normalize it. The sign and magnitude of the exponent difference are derived by Algorithm 2 below. Note that inverted encoding of negabits is in effect.
Normalization path. For Á 00 ¼ 0, no exponent comparison, swapping, or postcomplementation is needed. This is due to sign-embedded representation of significands that allows the result to be negative. In case of Á 00 ¼ 1, there is only a 1-digit alignment right shift, but in both cases (i.e., Á 00 1), a lengthy normalization shift may be necessary. To compute the amount of normalization postshift, we need to locate the first nonzero digit. Because the SUT-encoded 
, where all full-adders, except the one in the leftmost position, receive three negabits. The fourth negabit in the rightmost position is kept intact, so that d 
Therefore, the negabits of Á 00 or their complements, which represent the magnitude of the exponent difference, can be used to control a parallel (e.g., barrel) shifter directly.
When kÁ 00 k > 6 ðkÁ 00 k < À6Þ, all the digits of the operand with the smaller exponent will be shifted out. Therefore, only d
Þ are needed to control the parallel shifter. Shifting is not required when d
6 ¼ 000Þ; in this case, no addition is necessary, but the shifted-out digits may contribute to the rounding decision.
GUARD, ROUND, AND STICKY DIGITS
In the alignment path, two or more radix-16 digits (i.e., SUT digits in [À9, 8] ) of the operand with the smaller exponent are shifted out. As discussed in Section 5, normalization postshift in this path is limited to one digit. Therefore, one needs only to save as a guard digit the most significant one of the digits that were shifted out. This guard digit may then be shifted back during the postnormalization process. The other shifted-out digits are only needed to the extent that they affect the rounding decision.
For conventional floating-point units, an extra round bit and a sticky bit (logical OR of all subsequent bits) suffice for correct rounding, where the sticky bit being 0 signals the need for applying the halfway rule of the round-to-nearesteven mode. For redundant-digit floating-point units, however, round and sticky information should carry information about the range of the shifted-out digits, at least indicating whether the fractional value represented by the shifted-out digits is negative, zero, or positive.
In our radix-16 SUT floating-point adder, we recognize three (four) possibilities for sticky (guard and round) digit values. We use the equally weighted posibit s 0 and negabit s
00À
to represent sticky information such that s 0 s 00À ¼ 00 À , 01 À , and 11 À (10 À not used) represent a negative, zero, and positive value passed into and through the sticky position, respectively. Rounding information, on the other hand, is encoded as r Fig. 6 are to be latched back to inputs with the same names appearing at the left, with the initial value s 0 s 00À ¼ 01 À (i.e., zero) and inverted encoding of negabits in effect. Note that in case of no normalization shift, the guard digit will serve as the round digit and the original round digit should be fed in to the logic of Fig. 6 for updating the sticky digit. As in nonredundant floating-point addition, in the alignment path of SUT floating-point addition, a onedigit normalization right shift may also occur, in which case the postaddition shifted-out digit serves as the round digit. The original guard and round digits should then modify the sticky digit through the logic of Fig. 6 .
In the normalization path, where alignment shift is limited to one right shift, maintaining the guard digit is still required. With one alignment shift and no postshift (a onedigit postshift to the right), the guard (shifted-out) digit serves the same purpose as the round digit and sticky digit is zero (is derived from the guard digit). But when there are one or more shifts to the left, no rounding action is necessary, given that the round and sticky digits are zero.
ROUNDING DECISION
In this section, we describe how the round-to-nearest-even mode might be implemented with our SUT floating-point addition scheme. The main idea is to compute a rounding value R v to be stored as the transfer part of the leastsignificant digit in the final sum. Note that the SUT addition scheme (described in Section 3) does not generate a value for the transfer part of the least-significant digit, thus leaving it available for our rounding scheme. Unfortunately, however, this simple rounding decision does not always work correctly. The challenges include the possible shifting of the transferless LSD and bad rounding positions. We discuss the former below, and deal with the latter in the Appendix, where we identify two bad rounding positions, to the right of the LSD, that may lead to a redundant value different from that obtained by nonredundant floatingpoint operations in accordance with IEEE 754-2008 standard. For the sake of a comprehensive comparison, we also show that a simple rounding decision scheme in case of [7] fails in four bad positions.
Unfortunately, there seems to be no simple solution for correct rounding in the case of bad rounding positions such that the requirements of IEEE 754-2008 are met. This is true for both our SUT representation and for Fahmy and Flynn's scheme. The reason is that the combined contribution of the rounding value and the other bits at position R p or higher is less than 1 ulp, and thus, cannot be stored with the LSD. However, no problem arises for rounding after conversion to nonredundant format. To solve the problem, Fahmy and Flynn introduce additional complexity to overcome the problem of bad positions, as noted in the Appendix. Alternate redundant number systems are under investigation by the authors to alleviate this problem. Meanwhile, the SUT floating-point scheme, primarily owing to the advantages arising from "sign embedding" (refer to Section 4), may find applications in special-purpose processors and embedded systems, where full compliance with IEEE 754-2008 is not a requirement.
Shifting of the Transferless LSD
Due to normalization shifts, the transferless LSD may be shifted right or left, and replaced by the next more significant digit or by the guard digit, respectively. Both replacements lead to a new LSD with a transfer digit, and thus, no room for storing the rounding value. We consider the cases of right, left, and no shift separately (Figs. 7a, 7b , and 7c, respectively), and summarize the results in Table 7 .
Normalization Right Shift
The LSD of the right-shifted result (i.e., the new LSD) has a nonempty unibit transfer u 00AE . Let R d (for rounding digit) denote the collective value of u 00AE and the last shifted-out digit (i.e., the old transferless LSD). The twits that constitute R d have actually been generated as the sum of two radix-16 SUT digits in LSD positions of the two operands and then shifted one radix-16 position to the right. The following interval equation summarizes the process:
We can now extract R v in fÀulp; 0; ulpg for different subranges of R d as follows:
In case of the two singular boundary cases (i.e., Àulp=2 and ulp=2), R v again in fÀulp; 0; ulpg is decided by the sticky digit, and parity of the new LSD.
Normalization Left Shift
In this case, a nonzero R v occurs only in the alignment path, where a shifted-out digit may be placed back in the LSD position. The transfer part of this digit and the main part of the next digit to the right, if any (i.e., R d as defined in Section 7.1.1), have been generated either as the sum of two radix-16 SUT digits from the previous addition, or through conversion of a nonredundant operand to SUT format. The former may be treated exactly as Section 7.1.1. For the latter, 
For the singular cases of AEulp=2, the rounding decision is as in Section 7.1.1. Note that depending on the sticky digit,
However, given the significand conversion equations of Section 4, the latter situation occurs only for l b ¼ 0, where l b is the least-significant posibit of LSD; hence, the possibility of storing R v in lieu of the two least-significant twits of LSD (#20 in Table 7 ).
No Normalization Shift
The LSD is transferless in this case. For the sake of uniformity in the rounding decision, however, we assume a unibit u 00AE in the least-significant end of LSD. This unibit may be produced by placing a second full-adder in the low end of the SUT digit adder handling the LSD position (similar to the lower right full-adder of Fig. 3) , and setting its input negabit to 1 and the right input posibit to 0, collectively representing the arithmetic value 0. Entries #1 to #4 in Table 7 occur particularly in this case when the collective arithmetic value of l b and u 00AE is 0. Other entries for the no-shift case are shared by the cases of normalization shift. Table 7 shows 21 combinations in deriving the rounding value; the other 43 possible combinations cannot occur and constitute don't-care conditions. Here, u 00AE ; r À 1 r 0 ; s 0 s 00À , and R v represent the unibit transfer of LSD, round, and sticky digits, as defined in Section 6, and the rounding value. The rightmost posibits of the LSD before and after storing R v are indicated as l b and l a , and r 00AE is the stored rounding unibit. Note that the rounding value computation is done after normalization shifts have taken place, and the first digit after LSD is viewed as the round digit. The rounding value À2 (2) is taken care of by l a ¼ 0 (1) and r 00AE ¼ 0 (1), since they can only occur when l b is adjusted to 1 (u 00AE ¼ 0) and 0 (u 00AE ¼ 1), respectively. Fig. 8 depicts the simple logic implementing the following equations for r 00AE and l a , based on Table 7 :
COMPARATIVE EVALUATIONS
The redundant digit floating-point addition scheme of Fahmy [12] and the one proposed in this paper are both based on radix-16 signed-digit number representation. We now show that the coarse comparison of Table 1 is supported by detailed analytical evaluation of both schemes. For a fair comparison, we follow the analytical model of [12] . In this model, the FO4 delays of a full-adder, a k-bit fast adder, an m-to-1 multiplexer, and an n-way shifter are as in Table 8 , where f (fan-in) is the maximum number of inputs for a gate in the design. Based on the component delays of Table 8 , the overall FO4 delays (the unit being an inverter delay with fanout of 4) of the redundant-digit floating-point adders of [12] and [7] are represented by the following equations (adapted from [12] ), where h ¼ 1 in [7] . Recall that , e, and h refer to significand, exponent, and digit width (h ¼ log 2 r), respectively: Using the same component delays for the critical path of our design (Fig. 5) yields the following equation, where the five variable terms that follow the fixed latency of 16 are due to exponent-difference, shifter, adder, and leading zero detector (two terms) units, respectively:
The preceding analysis yields 34 FO4 gate delays for the design of [12] , versus 28 for our design (using f ¼ 3, h ¼ 4).
The delay values for width () ranging from 8 to 120 are plotted in Fig. 9 .
For more realistic results, we produced VHDL code for both schemes and ran simulations and synthesis using the Synopsis Design Compiler. The target library is based on TSMC 0.13 m standard CMOS technology. For dynamic and leakage power, we have used the Synopsis Power Compiler. The same design environment (e.g., operating conditions and wire model) and design constraints (e.g., maximum path delay and area consumption) are assumed for both floatingpoint adders being compared. The results, as depicted in Table 9 , show that our proposed floating-point adder is both faster and significantly outperforms the design of [12] in terms of power and area. The following differences are responsible for the improved performance, power consumption, and layout area:
1. Less hardware redundancy in the main signed-digit adder. In reference [12] , three signed-digit adders are used to form sum, sum þ 1, and sum À 1 simultaneously. 2. Removal of the need for postcomplementing the significand in case of equal exponents due to the sign-embedded representation. To avoid added latency, the design of [12] includes extensive hardware redundancy, in the form of multiple adders to compute A À B, B À A, A À B shifted , and B À A shifted . It also uses two shifters in the normalization path and a 5-to-1 final selector (versus 2-to-1 in our design). 3. Elimination of rounding increment; [12, Fig. 4.2] includes four active rounding increment/decrement modules. 4. No need for PN recoding.
CONCLUSION
We have described a new redundant-digit representation for floating-point numbers that leads to computation speedup as well as reduced layout area and power dissipation. These benefits have been confirmed by approximate analyses and through more detailed simulation results.
. Simpler logic for rounding decision. By describing algorithms and circuit implementations for a floating-point adder based on a redundant-digit representation, we have shown that the new representation offers the unique advantage of replacing the slow full-width addition required for rounding by the insertion of a rounding value, derived by a simple three-level logic circuit, in a position normally assigned to a redundant "twit" within the redundant format. We have also demonstrated that our scheme leads to a simpler rounding decision and immediate incorporation of the rounding value so that, in case of conversion to nonredundant format, no additional time beyond that of redundant-to-binary conversion is required. The cost paid for these advantages is a one-time, two-multiplexer latency for converting IEEE 754-2008 floating-point numbers to our internal format.
Although our scheme does not provide a simple solution to the problem of lack of full IEEE 754-2008 compliance with redundant representations, it does reduce the number of bad rounding positions from four, in the existing scheme of [23] For this property to hold, it must be the case that converting the unrounded result to its nonredundant equivalent and then rounding it leads to the same result as rounding the redundant result first and then converting to its nonredundant equivalent.
To deal with this challenge, we follow Fahmy's approach [12] in predicting the position of the leading 1 in the converted (i.e., nonredundant) result and accordingly locating the "rounding position" within the LSD of the redundant result. The rounding position within the LSD corresponds to the leading 1 in the MSD, except that conversion to nonredundant format may shift the leading 1 to the right by one position. There would be no propagating 1 during the conversion process, given that the radix-16 digits of the converted result can accommodate the SUT digits in [0, 8] . However, negative SUT digits in [À9, À1] generate a propagating À1, leaving behind a radix-16 digit in [7, 15] . Therefore, if a propagating À1 reaches the MSD, it may turn the latter into 0, with the next digit to the right being in [6, 15] . In this case, the leading 1 after conversion will be 1-2 binary positions to the right of the MSD, implying that the rounding position, before conversion, may be 1-2 binary positions to the right of the LSD. With maximally redundant digits in [À15, 15] , as used in [23] , however, a propagating À1 may leave behind a digit in [0, 15] . This leads to the rounding position falling as far as four binary positions to the right of the LSD. The mostsignificant one of these positions is handled in [23] by extending the adder to the right and the other three are prevented via PN recoding logic.
The latency of the latter prediction process is at best logarithmic in the number of significand digits. However, given that storing of the rounding value does not lead to any exponent adjustment, the prediction process may be taken off the critical path by overlapping it with the exponent difference computation of the next floating-point addition. Having obtained the rounding position, we recognize two cases for storing the rounding value:
. Good rounding positions: When the computed rounding position R p coincides with one of the binary positions of the result's LSD, the posibits to the right of that position affect the rounding value R v to be added in position R p . . Bad rounding positions: In the two cases when R p falls to the right of the LSD, the most-significant bit that contributes to the rounding value weighs ulp=4 or ulp=8. The resulting rounding value that is not an integral multiple of ulp, regardless of the sign, cannot be stored with the LSD. After conversion to nonredundant format, however, a normalization shift of 1-2 bits moves the rounding position to the rightmost position of the LSD. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
