Abstract: Redundant representations play an important role in high-speed computer arithmetic. One key reason is that such representations support carry-free addition, that is, addition in a small, constant time, independent of operand widths. The implications of stored-transfer representation of digit sets and the associated addition schemes, as an extension of the stored-carry concept to redundant number systems, on the speed and cost of arithmetic algorithms, are explored. Two's-complement digits as the main part and any two-valued digit (twit) in place of a stored carry are allowed, leading to further broadening of the generalised signed-digit representations. The characteristics of the digit sets, possibly not having zero as a member, that allow for most efficient carry-free addition, are investigated. Circuit speed is gained from storing or saving, instead of combining through addition, the interdigit transfers generated during the carry-free addition process. Encoding efficiency is gained from using a twit-transfer set encoded by one logical bit, where more bits would otherwise be needed to represent a transfer value.
Introduction
A positional radix-r number system is deemed redundant if the cardinality of its digit set is greater than r; for example, decimal digit set f0, 1, 2, . . ., 9, 10, 11g, or binary digit set f21, 0, 1g [1] . In modern digital circuits, redundancy is commonly introduced in number representation with the aim of improving the speed or efficiency of arithmetic operations [2, 3] . One reason for speed improvement with redundancy is the possibility of carry-free addition; that is, addition in a small, constant time, independent of operand widths. This desirable property is routinely exploited in digital system designs where internal redundant representations (invisible to the casual user) are employed, although explicit use of redundant representations is also gaining in popularity [4 -6] . In carry-free addition, a carry produced by one-digit position is always absorbed by the next higher position [1] . Because the term 'carry' often conjures the notion of a propagating signal, the information produced by one stage and absorbed by the next stage is referred to as a 'transfer' value or digit.
Another reason for the usefulness of redundant representations is that redundancy allows some imprecision in the decision processes associated with arithmetic algorithms (such as quotient or root digit selection [7] ); this tolerance to imprecision removes enough complexity from the computation's critical path to yield significant performance improvement. As a specific example, quotient digits in high-radix division can be chosen by inspecting only a few bits of the divisor and of the redundant partial remainder, thus allowing a much shorter cycle time via hardwired or tabular implementation of the quotient digit selection process [8] .
Contributions to redundant number representation are of two main types. In abstract studies (e.g. [1, 9, 10] ), arithmetic algorithms are presented in terms of digit-level operations, specifying how each result digit is derived from operand digits and auxiliary quantities such as interdigit transfers. Implementation-oriented studies, on the other hand, are often based on specific encodings for digit sets encountered in the course of solving particular design problems; for example, implementation of a high-speed two's-complement full-tree multiplier [11] . Some contributions of this latter type have dealt with limited classes of digit-set encodings without directly associating them with a specific design problem. Falling into the latter category are hybrid-redundant number systems [12, 13] and representation paradigms of high-radix signed digit numbers [14] .
Certain weighted encodings of redundant number systems, using two-valued digits (twits, or generalised bits) have been recently shown to fill the gap between the aforementioned contributions [15] . Twits are exemplified by a unibit that represents a value in f21, 1g, with 21 encoded as logical 0 and 1 encoded as logical 1. Such encodings lead to efficient representations for redundant number systems and make it possible to realise arithmetic circuits based on widely available optimised full/ half adders, compressors, and carry acceleration cells. Furthermore, they allow faithful representation of digit sets, including those not having zero as a member, leading to enhanced encoding efficiency in some cases. Here, we focus on stored-transfer representation of redundant digit sets [16] , as an instance of weighted twit-set (WTS) encodings [15] , study the implications of twit transfers, and adapt the conventional carry-free addition algorithm to stored-transfer representation of general digit sets that may exclude 0 (e.g. digit values extending from 3 to 12).
Conventional carry-free addition of radix-r operands x and y, whose digits x i and y i belong to the redundant digit set D ¼ [a, b], is described as follows (see Fig. 1a 
We derive bounds on possible values of t i , and the necessary and sufficient number of these values for carry-free addition to be applicable, in Section 2. Note that the digit-size additions of steps 1 and 3, though quite fast compared with word-size additions required with nonredundant representations, are merely used for algorithm description and need not be explicitly performed in hardware. The addition in step 1 can be avoided, for example, by noting that w i and t iþ1 are directly computable in hardware as functions of x i and y i (see Fig. 1a ), that is
This, in effect, fuses steps 1 and 2 and allows the designer to choose the best possible merged implementation. It may be the case, with certain digit sets and/or encodings, that some form of addition is still part of the best hardware implementation scheme for v and t, but this is not required. We are thus motivated to investigate methods for eliminating, or else simplifying, the addition in step 3. In Section 3 we discuss stored-transfer representation of redundant number systems [16] and adapt Algorithm 1 to such numbers. In Section 4 we study the implications of limiting the transfer set to one twit. Section 5 offers an efficient realisation of single-twit stored-transfer adder based on the SUT adder of [15] and compare its performance with other redundant adders. Conversion from/to two's-complement representation is taken up in Section 6.
2
Transfer values in carry-free addition
In this section general characteristics of the interdigit transfer process are studied and bounds on transfer values within the framework of carry-free addition (Algorithm 1) are derived. The results obtained allow for the number of possible transfer values in steps 2 and 3 of Algorithm 1 to be kept to a minimum, leading to more efficient implementations. In this study, two practical restrictions imposed in [17] are relaxed; that is, we deal with digit sets that do not necessarily include zero as a member, and noncontiguous transfer sets are allowed. Proofs for formal results in this section are to be found in the Appendix. In Lemma 1, the focus is on the upper (lower) bound for c 0 (c d21 ) because these bounds determine the smallest span and consequently the minimum cardinality of transfer values. Note that, in general, the larger the number of choices for the transfer value, the greater the complexity of the circuit that must make the selection and the higher its latency. Thus, whether r is small (so that the upper bound above is r) or relatively large (so that the upper bound is 3 þ b(r 2 2)/ (r 2 1)c), the variation in d min is quite limited. 
Corollary 3 affirms the results first reported in [17] . The following two results are also generalisations of the corresponding results for generalised signed-digit representations. The preceding results contain both bad news and good news. The bad news is that the transfer value cannot be represented by a single bit, thus forcing us to use two bits for its binary encoding. The good news is that, in virtually all practical cases, we do not need to go beyond two bits in representing the transfer values. , a redundant radix-r digit set with at most 2r members and the transfer values for carry-free addition of Algorithm 1 can be encoded by h þ 1 and two bits (i.e. the minimum possible for both), respectively. A In the following sections, the practical consequences of the preceding results are explored, leading to the particularly efficient implementations in Section 5.
Stored-transfer representations
In a manner similar to stored-carry or carry-save representation of binary numbers [2] , a study is carried out of the implications of stored-transfer or transfer-save representations of redundant digits, where the pair (w i , t i ) in the carry-free addition of Algorithm 1 is viewed as an encoding of the sum digit s i . This interpretation obviates the need for the final addition s i ¼ w i þ t i in step 3 (w i is the main part and t i the transfer part of a digit's stored-transfer encoding). The stored-transfer representation of definition 3 leads to a two-step formulation of carry-free addition, as described by Algorithm 2 and depicted in Fig. 1b .
Perform the following digit operations for all positions i (0 i , k) concurrently:
Of course, steps 1 and 2 in this new two-step process can again be fused, in the manner previously outlined for Algorithm 1, leading to a merged or single-step implementation Note that the transfer set G ¼ fc 0 , c 1 , . . . , c d21 g, satisfying c 0 , c 1 , . . . , c d21 , is d-valued but does not necessarily contain a set of d consecutive integers. This more general view is taken in anticipation that it may provide added flexibility for optimisations. It can be seen later that even though such generalised transfer sets do not provide additional benefits directly, they can be used with minor modifications to the carry-free addition algorithm. On the other hand, the main part of a digit belongs to an interval D ¼ [a, b] of values. Whereas gaps in this set are also admissible, provided that the set contains one member from each of the r residue equivalence classes j mod r (0 j r 2 1), this generality has not been found to yield any speed or cost benefit.
An objection may be raised that Algorithm 2 simply shifts the complexity of the original step 3 in Algorithm 1 to the new step 1. The fact that this is not the case will become apparent when the methods employed in this paper are explained in more detail. Here, it is argued that the new scheme can, in principle, be faster than that of Algorithm 1. For instance, a four-operand addition, where two of the operands (transfer parts) are fairly small, can indeed be faster and less complex than two separate additions [18] . For another, the stored-transfer representation kz 0 , z 00 l may well contain the same total number of bits as the binary encoding of z [15] . In such a case, the function pairs (v, t) of Section 1 and (s 0 , s 00 ) of this section have comparable bit-level complexities.
Example 2: Stored-transfer representations of some redundant number systems appear in Table 1 . The hybrid signeddigit entries (lines 6 and 7) use the radix r ¼ 2 h . Note that even though not all entries in Table 1 are practically useful, they have been included in the list to demonstrate the generality of the results. It is the belief of the authors that such generality is desirable and must be pursued whenever it does not interfere with the clarity of presentation for more practical cases. One important reason for this viewpoint is the fact that general results ensure that no important special case has been overlooked by imposing arbitrary restrictions based on current practice or implementation technologies.
A
The transfer sets of entries 6, 10, and 12 in Table 1 are two-valued and thus representable by a single bit. However, by Theorem 4, the cardinality d of the transfer set must be at least three for carry-free addition to be possible. Moreover, the noncontiguous two-valued transfer sets f21, 1g and f2, 4g of entries 4, 9, and 12 do not satisfy the result of corollary 4, stating that for carry-free addition with r ¼ 2, the transfer set must contain a threevalued interval of integers; that is, it must consist of three consecutive integers. In Section 4, where an implementation of stored-transfer addition is presented, a simple design modification is used to deal with these two problems. Note that in designing a stored-transfer encoding, the transfer set G used should preferably be of the minimum size prescribed by Theorem 4. Extra values in G, though they offer small advantages in extending the range of the digit set, degrade the encoding efficiency and increase implementation costs (i.e. latency, area and power). Furthermore, the wider digit set may not be preserved under carry-free addition, thus nullifying any accrued benefits.
Because a four-valued G is always sufficient by Theorem 4, our stored-transfer representations need at most two bits of redundancy per digit compared with binary encoding of the nonredundant digit set [0, r 2 1]. Virtually all practical redundant representations use power-of-two radices and thus imply at least one bit of redundancy. Therefore the incremental cost of the proposed scheme, in its initial
other arithmetic operations that use addition as a building block. In multioperand addition, and thus in multiplication, as well as in subtractive and multiplicative division, the per-add savings are compounded over many addition levels.
Because the main part of digits in a stored-transfer representation can be in nonredundant two's-complement format, much of the digit-level addition circuits can be based on readily available, and well optimised, binary
4
Two-valued stored transfers
The representational efficiency of the proposed storedtransfer scheme can be improved by using a design trick involving coupled encoding of the two components x 0 and x 00 of a digit x. Consider a three-valued stored transfer
is encoded in two parts: a single bit denoting v 0 and an arbitrary encoding for u 0 . A stored-transfer digit k2u 0 þ 0, 0l can be recoded as k2u 0 þ 1, 21l, and k2u 0 þ 1, 0l as k2u 0 þ 0, 1l, thus making it unnecessary to store the transfer value 0. The resulting two-valued stored transfer renders the representational efficiency of the proposed scheme competitive with the most efficient redundant representations. The delay and circuit costs of this recoding are small, given that only a single bit v 0 in the encoding of x 0 is affected. The more general case of a three-valued transfer x 00 [ fl, l þ 1, l þ 2g is handled with equal ease: recode k2u
The modification of the preceding paragraph, which may be viewed as reintroducing step 3 of the carry-free addition process, but in much simpler form involving single-bit logical operations, can be applied after each carry-free addition operation to keep representations efficient in the arithmetic circuits and their associated registers or it can be applied only at the interface between the arithmetic unit and the storage system. Other ad hoc simplifications and efficient implementations for special cases may be derived. For example, the following algorithm (basically as a detailed description of Algorithm 2) is applicable for addition of two k-digit radix-2 h stored-transfer numbers x and y, where the main part of each digit is an h-bit two's-complement number and the transfer is a unibit in G ¼ f21, 1g.
Algorithm 3 (Radix-2
h stored-unibit-transfer addition to compute s ¼ x þ y): Perform the following digit operations for all positions i (0 i , k) concurrently: Consider G ¼ f21, 1g, with its two members encoded as f0, 1g. The rightmost bit of z i is always 0, the next bit is derived by an XNOR operation on the unibits, and the identical leftmost h 2 2 bits by a NOR operation (XNOR and NOR are complements of logical XOR and OR functions).
Step 2 can be realised by standard full-adders.
Step 3 requires an adder that is h bits wide (h 2 1 if an extra half-adder is used in step 2); this adder can be of any suitable design. In step 4, s 0 i and s 00 iþ1 are directly derived in constant time from p i and its two most significant bits, respectively. Step 5 involves one gate delay, as previously discussed. Only step 3 has a latency that depends on h. Moreover, steps 1, 2 and 3, 4 may be partially overlapped to further reduce the constant-time component of the addition latency. A high-level circuit design, based on standard full/half-adders and carry acceleration cells for stored-unibit-transfer addition/subtraction (i.e. with G ¼ f21, 1g) is offered in [15] .
Unfortunately, the two-valued transfer scheme just discussed is not applicable to arbitrary digit sets. Corollary 4 suggests the suitability of this scheme for any digit set with r ¼ 2. However, there are other digit sets with r . 2 and minimal d that lead to consecutive three-valued transfer sets (i.e. d ¼ 2). By Lemma 2, such desirable digit sets are limited to 2 r r 2 1. For example, it is easy to verify that possible transfer sets for radix-16 digit sets [29, 9] (with r ¼ 3) and [20, 45] (with r ¼ 10) are f21, 0, 1g and f1, 2, 3g, respectively. But the main part of the twovalued stored-transfer representation of these digit sets is necessarily redundant. This complicates the addition scheme and renders the simple adder design, as previously discussed and implemented in [15] , inapplicable. Therefore in the following section we focus exclusively on minimally redundant digit sets with r ¼ 2.
Encoding of stored transfers
The results derived in Section 2 suggest the applicability of Algorithm 2 (as well as Algorithm 1) for arbitrary digit sets [a, b], possibly excluding zero as a member and with a , b. As discussed earlier, to obtain an efficient addition scheme, the focus is on digit sets with r ¼ 2, using stored transfer encoding of the digit set with a nonredundant two's-complement main part and a twit transfer part. Thus
h21 2 1] as the two'scomplement main part and G ¼ fl, l þ 2g as the twit transfer part. With these assumptions, given a value a as the lower bound of the digit set, the upper bound b and the twit parameter l are derived as
Example 3 (SUT-like digit sets): Table 2 shows the characteristic parameters of some stored-transfer digit sets, using twit transfers in fl, l þ 2g, that satisfy the conditions of corollary 4. It is shown that for such digit sets, the SUT addition scheme (Algorithm 3) is applicable. A
In the addition scheme for radix-2 h stored-unibit-transfer operands (first entry of Table 2 ), an h-bit two's-complement sum of the transfers is derived (step 1 of Algorithm 3), and a standard three-operand addition is performed (fused steps 2 -4 of Algorithm 3). Both operations are performed in parallel for all radix-2 h digits. The required high-level circuit design is reproduced in Fig. 2 from [15] , where the transfer addition cell of step 1, generating the u inputs, appears in Fig. 3 . Note that step 5 of Algorithm 3 is taken care of by in-place reduction [15] in the lower full-adder in position ih of Fig. 2 . Table 3 presents some performance measures for the proposed SUT adder, along with those of two other highradix redundant representations. The addition scheme of Fahmy and Flynn [6] , proposed in the context of implementing a floating-point adder, is based on maximally redundant radix-16 (h ¼ 4) symmetric signed digits in [215, 15] , where a digit is represented by a 5-bit two's-complement number, excluding the value 216. The compound, maximally redundant signed-digit adder (first row of Table 3) computes three values simultaneously: the actual sum, and the incremented and decremented sums (sum + 1). This is done for reduced latency, at the cost of tripling the layout area and power consumption relative to a standard 5-bit two's-complement adder. Despite the use of the aforementioned speed-up technique, the extra control logic needed increases the overall addition latency by several logic levels [6] , the extent of which has not been specified.
The hybrid redundancy scheme of Phatak and Koren [12] (second row in Table 3 ), suffers from highly asymmetric digit sets (see row 7 of Table 1 ), which reduces its appeal for general-purpose applications and complicates the use of addition circuitry for the subtraction operation. Moreover, the inapplicability of carry acceleration techniques leads to a best-case O(h) addition latency for radix-2 h operands. The design is based on the use of nonstandard adder cells for redundant and nonredundant positions, realised by 42 and 32 transistors, respectively, which result in the implementation cost of 32h þ 10 transistors for each radix-2 h digit. The average latency per radix-2 position is roughly equal to that of one full adder (D FA ).
The proposed SUT adder, represented in the third row of Table 3 , requires two rows of full adders. This corresponds to an active hardware redundancy factor of two, compared to a redundancy factor of at least three for the addition scheme in Fahmy and Flynn [6] . The latency of the proposed adder is no greater than that of a (h þ 1)-bit two'scomplement adder, which can be reduced drastically via standard carry acceleration. A radix-2 slice in the proposed adder is realisable by roughly 28 transistors, given the use of 14-transistor full adders [19] .
An attractive solution for the addition of stored-twit-transfer operands, in cases where the stored transfers are twits other than unibit, is to design a special transfer adder for twit transfers and use the same three-operand adder of Fig. 1 . Table 4 shows the addition summary for two twits A 00 and B 00 in fl, l þ 2g and a special binary representation of the sum. Following the convention in Jaberipur et al. [15] , bold-italic underlined uppercase type is used for twits, while uppercase (lowercase) regular type corresponds to negabits (posibits).
Based on Lemma 1, for r ¼ 2 h , the minimum value for generated transfers in carry-free addition based on Algorithms 1 and 2 is t min ¼ ba/(2 h 2 1)c, leading to
Using this equation for l the possible transfer sum values 2l, 2l þ 2, and 2l þ 4 can be expressed as follows, where i ¼ 0, 2, and 4, respectively:
The first, middle, and last terms on the right-hand side of the equation are represented in Table 4 as t h , Q h21 u h22 . . . u 2 u 1 , and t 0 , respectively, where the middle term represents an h-bit two's-complement number in the range 22
This conformance is necessary for using the adder of Fig. 1, but The SUT adder of Fig. 2 generates a transfer value in f21, 1g in position h, where adding t h produces a transfer in ft min , t min þ 2g. Finally, t 0 of the next higher radix-2 h digit is also added to the latter transfer value to yield a value in fl, l þ 2g. The two additions just described do not actually take place, and t h and t 0 are not physically stored. They are merely used to show the relevant interpretation of the transfers produced. Note that for the SUT adder with l ¼ t min ¼ 21 and v ¼ 2 h21 2 2, we have t h ¼ t 0 ¼ 0 and the transfer sum values associated with the entries in Table 4 are 22, 0, 0, and 2.
Example 4 (Radix-16 stored-twit-transfer addition): Table 5  lists Table 7 for justification):
The delay of steps 1 and 3 of Algorithm 4 is constant and independent of h and k, but step 2 requires word-width carry propagation in general. However, there are special practical cases where this step may be eliminated (e.g. for l ¼ 0 and in the SUT case of [15] with l ¼ 21).
For the reverse conversion, the main parts are added, treating their corresponding transfers as doublebits (bits with double the normal weight), all in parallel. This yields a redundant number of the same value with two'scomplement radix-2 h digits. The rest of the process follows conventional redundant-to-binary conversion techniques [1] , except that addition of a constant l(2 kh 2 1)/(2 h 2 1) should be fused into the process. Therefore the reverse conversion, as is expected for any redundant representation, involves word-width carry propagation.
Conclusions
We have shown that the stored-transfer representation of redundant numbers offers speed and cost benefits in the carryfree addition process. We have proved the necessity of at least three transfer digit values, and sufficiency of four values, to allow carry-free addition in all cases of practical interest.
We have further shown that by a simple adjustment in the final stage of the carry-free addition algorithm, the number of stored transfers can be reduced to two values, thus requiring a single bit for storage. The proposed stored-transfer scheme is thus competitive with other practical redundant representations with regard to storage cost. In the course of establishing the theoretical basis for the proposed method, two practical restrictions imposed in the general signed-digit representations of Parhami [17] were relaxed, thereby providing the ability to deal with more general digit sets that do not necessarily include zero as a member and with noncontiguous transfer sets. This is an important aspect of the work reported in this paper because the generalisation comes with no inherent latency or cost penalty, but opens up valuable alternatives in exploring the implementation options in the design space associated with different redundant number representations.
We also demonstrated that converting a two's-complement number to stored-transfer form implies minimal constant cost and latency for many important practical cases, while the reverse conversion needs the obligatory carry propagation. This affinity with two's-complement numbers in representation and circuit implementation (i.e. use of standard full/half adders and compressors) is a key strength of the stored-transfer scheme.
Derivation of algorithms for stored-transfer multiplication and division is quite feasible. Very-high-radix SRT division algorithms with signed-digit partial remainders and signed-digit quotient [20] can be modified to accept stored-transfer operands. A series of arithmetic operations can thus be performed without carry propagation by representing the inputs, intermediate results, and outputs in stored-transfer format. Results on other arithmetic operations, particularly for floating-point operands, and a number of useful arithmetic support functions (such as shifting) will be reported in the near future. 
Proof of Theorem 3
According to Theorem 1, the condition (r 2 1) (r 2 2) . v 1 þ v 2 is necessary and sufficient for carry-free addition, where v 1 ¼ a mod(r 2 1) and v 2 ¼ (2b)mod (r 2 1). For r . 3, the condition always holds, given that (r 2 1)(r 2 2) . 2(r 2 2) ! v 1 þ v 2 . For r ¼ 3, the condition reduces to v 1 þ v 2 , 2(r 2 2), which always holds, except when v 1 ¼ v 2 ¼ r 2 2. So the proof will be complete if we show that for v 1 ¼ r 2 2, we cannot have v 2 ¼ r 2 2 for any digit set [a, b] with r ¼ 3. For a ¼ u 1 (r 2 A
