Abstract-Digit-recurrence binary dividers are sped up via two complementary methods: keeping the partial remainder in redundant form and selecting the quotient digits in a radix higher than 2. Use of a redundant partial remainder replaces the standard addition in each cycle by a carry-free addition, thus making the cycles shorter. Deriving the quotient in high radix reduces the number of cycles (by a factor of about h for radix 2 h ). To make the redundant partial remainder scheme work, quotient digits must be chosen from a redundant set, such as [-2, 2] in radix 4. The redundancy provides some tolerance to imprecision so that the quotient digits can be selected based on examining truncated versions of the partial remainder and divisor. No closed form formula for the required precision in the partial remainder and divisor, as a function of the quotient digit set and the range of the partial remainder, is known. In this paper, we establish tight upper bounds on the required precision for the partial remainder and divisor. The bounds are tight in the sense that each is only one bit over a well-known simple lower bound. We also discuss the implications of these bounds for the quotient digit selection process.
INTRODUCTION
DIVISION is the most complex, and hence slowest, of the four basic arithmetic operations. Thus, techniques for speeding up division have been extensively studied by researchers in computer arithmetic [9] . Simple dividers are usually based on digitrecurrence algorithms employing repeated additions, while highperformance implementations tend to use multiplier-based convergence schemes. In digit-recurrence division [3] , digits of the quotient are derived one at a time, from the most to the least significant. For a k-digit quotient, there are at least k cycles in the division process; one cycle for producing each radix-r quotient digit, plus additional cycles for initialization, possible final correction, and rounding.
There are but two ways to speed up such an algorithm: reducing the number of cycles and making each cycle shorter in duration. The first implies higher radix division, while the second leads to several techniques, including keeping the partial remainder in redundant form so as to eliminate the relatively long delay of carry-propagate addition in each cycle. Regardless of the design choices at the algorithmic level, circuit techniques can be used to improve the divider speed and often have a significant impact on the resulting performance [5] . We do not deal with circuit techniques in this paper.
To make the use of redundant partial remainder possible, quotient digits must be chosen from a redundant set, such as [-2, 2] in radix 4. The redundancy provides some tolerance to imprecision so that the quotient digits can be selected based on examining truncated versions of the divisor and the partial remainder [12] , [1] . A redundant partial remainder that is truncated to a few bits can be quickly transformed into an estimate for the partial remainder or its bits used directly as inputs to a PLA or ROM table that spews out quotient digit values. Among other issues, the designer of a fast digit-recurrence divider determines the precision with which the divisor and the partial remainder must be examined in order to deduce the value of the next quotient digit. This precision, stated in the number of bits before and after the radix point, dictates the complexity of the PLA or the size of the ROM table yielding the quotient digit value. The PLA/ROM size influences not only the cost (VLSI area), but also the circuit speed, given that the quotient digit selection block is usually on the critical path.
Many practical aspects of the quotient digit selection process and associated hardware implementations have been discussed in the literature [13] , [3] , [7] . Particularly relevant to this work are the original analysis by Atkins [1] showing the minimal precision required and more recent contributions by Burgess and Williams [2] and Oberman and Flynn [8] elaborating on the selection criteria and trade offs. These latter contributions will be reviewed in Section 6 dealing with redundant representation of the partial remainder.
No closed form formula for the required precision in the divisor and the partial remainder, as a function of the quotient digit set and the range of the partial remainder, is known. So, the design process is often described as trial and error. In this paper, we establish tight upper bounds on the minimum precision required for the divisor and the partial remainder. The bounds are tight in the sense that each is only one bit over a well-known simple lower bound, leading to the requirement for just four trial points during the design process: the one represented by the lower bounds, one (or two) more bit(s) of precision for d, one more bit of precision for p. If none of these four points leads to a viable design, then one more bit of precision for both d and p is guaranteed to work. We also elaborate on the implications of these results on the design of practical SRT dividers that use a redundant representation for the partial remainder. The notation used in our analyses is listed in Table 1 for ready reference.
QUOTIENT DIGIT SELECTION
We derive our main results in Sections 2-5 assuming a nonredundant partial remainder. We then consider the effects of a redundant partial remainder on these results in Section 6. Even though practical SRT dividers are almost always based on a redundant representation of the partial remainder, this separation and stepwise derivation lead to a simpler and clearer presentation, as well as better understanding of the impact of redundancy.
Radix-r digit-recurrence division for the (possibly adjusted) dividend s
where s ðjÞ is the jth partial remainder and q ¼ 0:q À1 q À2 . . . q Àk is the developed quotient. We assume s ð0Þ ¼ jzj hjdj, perhaps ensured through preshifting of the dividend, yielding the adjusted dividend. The parameter h will be defined shortly. Additionally, we assume that the radix is a power of 2 so that everything is done in binary, with a radix-r digit corresponding to log 2 r bits.
To allow the choice of q Àj based on truncated versions of d and p ðjÞ ¼ rs ðjÞ , we use a redundant quotient digit set ½Àa; a, with a ! r=2 (usually, we have a r À 1, although overredundant digit sets satisfying a ! r are also of some interest).
Assuming d > 0, to maintain the invariant jsj hd, throughout the division process, we must be able to restore a shifted partial remainder, having the worst-case magnitude rhd, to hd or less by subtracting ad from it. This requirement defines h by bounding it from above:
. rhd À ad hd or h a=ðr À 1Þ:
Choosing h ¼ a=ðr À 1Þ so as to impose the least possible restriction on the range of s, we can easily determine the two boundaries of the region where x is a valid quotient digit value:
This leads to the following range for p, across which x is a valid quotient digit value:
The ranges associated with the validity of x and x À 1 as quotient digit values overlap in: 
LOWER BOUNDS FOR THE REQUIRED PRECISION
The lower bounds presented in this section are well known. However, given that the methods and notation used in their derivations are essential for understanding the rest of the paper, we include them here for the sake of completeness. If we truncate d to bits after the radix point, the uncertainty in the value of d, based on its truncated version d trunc , will be less than
truncate p to " bits after the radix point to get p trunc , the uncertainty will be less than y ¼ 2 À" and we have p
These two truncations together define a grid or tiling on the p-d representing uncertainties in the values of d and p, intersects both boundaries of an overlap region. Put another way, between each pair of adjacent vertical (horizontal) grid lines, there must exist a horizontal (vertical) grid line that is totally contained in the overlap region. Defining Ád as the minimum horizontal separation between the boundaries of the overlap region, clearly we must have w Ád. Based on the equations for the two lines bounding the overlap region in Fig. 1 , we find
which assumes its smallest possible value for x ¼ a. Hence, we arrive at the following well-known lower bound (lb) for the precision required of d in quotient digit selection:
A similar argument, with the roles of the horizontal and vertical grid lines interchanged, yields:
Of course, in addition to the fractional bits of d and p, as dictated by the bounds above, integer bits of each operand must also be inspected. Because, even in unsigned division, the partial remainder can go negative, the sign bit of p is important as well. In the case of signed division, the sign bit of d is also involved in the decision.
Returning to the discussion of the number of fractional bits needed, we endeavor to prove that setting ¼ lbðÞ þ 1 and " ¼ lbð"Þ þ 1 is always sufficient for quotient digit selection, thus establishing the simultaneous upper bounds ubðÞ ¼ lbðÞ þ 1 and ubð"Þ ¼ lbð"Þ þ 1. Fig. 2 represents an example. Here, based on the precision lower bounds provided by Ád and Áp, the coarse grid of W Â Y tiles (solid, heavy gray lines) is built. The resulting precisions are clearly inadequate, as evident from the absence of a solid horizontal grid line spanning the overlap region between the vertical grid lines at d min and d min þ W . Our results will show that halving the grid line spacing in each dimension (adding the broken grid lines in Fig. 2 ) provides adequate precision in the worst case.
UPPER BOUNDS FOR THE MINIMUM PRECISION
With the background presented in the previous sections, we are now ready to state and prove our main result [10] , which helps in the vast majority of cases where precisions equal to the lower bounds previously derived are inadequate for a hardware divider. Theorem 1. To determine the quotient digit in radix-r division, with r a power of 2, it is sufficient to inspect bits of d and " bits of the nonredundant p after their radix points, where:
Proof. We need to show that if vertical grid lines are spaced w ¼ 2 À Ád=2 apart, there will be at least one horizontal grid line (spaced y ¼ 2
À"
Áp=2 apart) that is entirely contained in the overlap region between two successive vertical grid lines. The proof is based on the geometric representation in Fig. 3 and consists of showing that any rectangle that touches two consecutive vertical grid lines and the boundaries of the overlap region has a height of at least Áp=2, which is no less than 2 À" . We only need to prove this at the leftmost edge of the uppermost overlap region that defines the boundary between choosing a À 1 or a as the quotient digit value. Corresponding rectangles in other parts of the same overlap region or in other overlap regions are taller, as exemplified by the rectangle near the right end of the overlap region in Fig. 3 . Consider the shaded rectangle near the left edge of Fig. 3 whose width is w and which touches both boundaries of the overlap region. Let the height of this rectangle be u ¼ Áp À v. We want to demonstrate that u ! Áp=2 or v Áp=2. This is easily proven by combining tan ¼ v=w ¼ Áp=Ád with w Ád=2. The latter inequality results from w ¼ 2
Based on the joint upper bound results of Theorem 1, only the first three cases shown in Table 2 need to be investigated. If none of these three cases proves viable, then the use of lbðÞ þ 1 bits of precision for d and lbð"Þ þ 1 bits of precision for p (per Theorem 1) is called for and is guaranteed to work. Case 4 in Table 2 is included because it might be of some interest when a redundant partial remainder is used (see Section 6).
DETERMINING THE MINIMUM REQUIRED PRECISION
Verification for each of the first three cases in Table 2 is relatively easy to mechanize. It consists of checking that, for all of the overlap regions, of which there are a in the first quadrant, and all consecutive pairs of vertical grid lines, of which there are 2 ðd max À d min Þ, a horizontal grid line exists that is totally contained in the overlap region. Referring to Fig. 3 , this means that, for the points A and B of the rectangle associated with the overlap region for x À 1 and x and the pair of vertical grid lines at d min þ 2 À m and d min þ 2 À ðm þ 1Þ, the coordinates
must be such that there exists an integer multiple of 2 À" between them, i.e., we must have:
b2
" pðAÞc ! 2 " pðBÞ: So, the algorithm, presented in Fig. 4 , has two nested loops, corresponding to varying x and m, with the inner loop's termination condition being 2 " pðBÞ À 2 " pðAÞ ! 1. This latter condition is sufficient for the existence of the required horizontal grid line, but is not necessary; for example, there exists an integer between 2.75 and 3.25, even though their difference is less than 1. Note that the condition is verified only in the first quadrant of the p-d plot for, even though the choice of quotient digit is not symmetric about the two axes, the placement of the uncertainty regions, which affects the required precisions, is symmetric.
Note that, for the sake of clarity, the pseudocode in Fig. 4 has not been fully optimized. In an actual coding of the algorithm, the values of A, B, S, and T could be initialized for x = 1 outside the x while loop and then appropriately updated with advancing x. Updating of these values inside the while loop would then require only the addition of increment values in much the same way that updating of A and B in the inner m loop is now handled.
Note that the selection constants or, alternatively, the table entries, are obtained as byproducts of the checking function in Fig. 4 . In other words, when, for a particular m, we have bAc ! B, any integer in ½bAc; B can be used as the selection constant to choose between x À 1 and x for the divisor interval ½m; m þ 1Þ; i.e., when the truncated divisor, taken as an integer, represents m.
REDUNDANT PARTIAL REMAINDER
Most modern implementations of the SRT division use carry-save or BSD (binary signed-digit or borrow-save) representation of the partial remainder. Both of these representations require 2 bits in each radix-2 position. Given that redundant formats with lower representational redundancy are gaining in popularity, we base our presentation in this section on a more general format that covers binary carry-save and BSD representations as special cases [6] .
Let the partial remainder s be represented as s 0 þ s 00 , with the main component s 0 being a radix-2 unsigned or two's complement number and the auxiliary component s 00 taking the form of a radix-2 g number whose digits are in [0, 1] or [-1, 0]. The special case of g ¼ 1 covers binary carry-save and BSD representations, while g > 1 includes high-radix carry-save representations [4] and hybrid redundancy [11] with the recurring BSD position encoded as a pair of negative and positive bits (see Fig. 5 ).
Truncating a redundant partial remainder to " bit positions usually leads to a larger maximum error than that of a nonredundant partial remainder. It is clear, however, that, for all the examples shown in Fig. 5 and, more generally, for any redundant representation that contains at most 2 bits in any given position, inspecting one additional bit of the redundant partial remainder is adequate, provided that these bits are used directly as inputs to a lookup table. Such a process is practical only for small radices (perhaps only 2 and 4). For larger radices, and sometimes even for radix 2 or 4, a common compromise is to assimilate a portion of the redundant partial remainder into a truncated nonredundant estimate and then use the resulting estimate in the quotient digit selection process as before [2] , [8] . This is attractive for both carry-save and BSD partial remainders because the delay due to assimilation of an l-bit segment, which is done through standard binary addition, grows rather slowly with the width l.
The assimilation-based approach needs to be further discussed as it is unclear how our preceding results can be applied in this case. A nonredundant estimate derived from assimilating a section of a redundant partial remainder has two sources of error: 1) A maximum error of just under 2 À" due to the truncated estimate. 2) An additional error due to assimilating only l fractional bits of the redundant partial remainder (complete assimilation would, of course, defeat the purpose of having a redundant partial remainder in the first place). We need only specify how the additional error above affects the selection process.
To show what needs to be done, we redraw the lightly shaded rectangle of Fig. 3 , with opposite vertices A and B touching the boundaries of the overlap region at vertical grid lines, in Fig. 6 . The function "adequate" in Fig. 4 checks to see if a horizontal grid line spans the entire width of this rectangle. When we assimilate only l fractional bits of a redundant partial remainder, we commit an additional error that is equal to the sum of all the dropped bits. For example, for carry-save representation, the positive bits that are dropped can add to almost 2 Àlþ1 and, for BSD representation, the bits dropped can add to a value in the open interval ð2
Àl ; 2 l Þ. Based on the preceding discussion, we can use the function "adequate" of Fig. 4 to determine if assimilation of l bits of a redundant partial remainder would be adequate, provided A and B are initialized as follows, instead of as in Fig. 4 :
maximum magnitude of dropped positive bits
maximum magnitude of dropped negative bits:
The corrective terms on the right-hand sides of these two expressions are easily derived, given the particular redundant format (as in the examples in Fig. 5 ) and the assimilation width l. In fact, rather than test for different values of l to see which one leads to success, we can modify the function "adequate" of Fig. 4 to return the smallest acceptable value for l along with its true/false answer. If the returned value of l is unacceptable, then larger values of " and/or may be tried. Note that, because the magnitude of the additional error term due to the incomplete assimilation of a redundant partial remainder is never more than the original 2 À" error term, at most a couple of cases beyond those in Table 2 , namely those with " ¼ lbð"Þ þ 2, need to be considered. Note also that one byproduct of the lower redundancies associated with the high-radix carry-save and hybrid-redundant representations in Fig. 5 is that straight table lookup (i.e., without first assimilating a truncated redundant partial remainder into nonredundant form) might become practical where it is not with binary carry-save or BSD representation.
CONCLUSION
In the literature on computer arithmetic, it is often mentioned that determining the precisions required of truncated versions of d and p to correctly choose the radix-r quotient digits from the digit set [-a, a] involves several iterations over the design space. In this paper, we have proven that the number of cases to be tried is limited to a handful, as listed in Table 2 and further augmented in Section 6, given choices for r (power-of-2 radix) and a (defining a symmetric, redundant digit set). Theorem 1, along with some of its implications to the design of high-radix dividers, constitutes the main contributions of this work.
Considering that the trials mentioned above are easily mechanized, as demonstrated in Fig. 4 , and that the choice of the radix r is quite limited by cost/performance requirements, the design process is not as cumbersome as one might have thought. This is true even after variations in the format of the redundant partial remainder (Fig. 5) are taken into account. It would be interesting to see if corresponding results for high-radix square-rooting, which is intimately related to SRT division, can be derived.
A possible area for future research is that of quotient digit-set encoding. In our derivations, the chosen quotient digit q i was treated as a single indivisible entity. However, a high-radix digit set must be encoded as a collection of bits. Possible choices include 2'scomplement representation, gray-code encoding [8] , and ad hoc sign-plus-weighted-bits encodings that facilitate the multiple generation process. With any particular choice of encoding, each bit in the representation of the selected quotient digit can be viewed as a separate function of the truncated divisor and partial remainder. Using separate logic circuits or tables to obtain these bits can potentially lead to savings in the total circuit complexity or table size.
ACKNOWLEDGMENTS
A preliminary version of this paper appeared in the Procedings of the 35th Asilomar Conference on Signals, Systems, and Computers [9] . The anonymous referees supplied detailed comments and a few references. Their contributions to the clarity and completeness of the paper are gratefully acknowledged. [2] , [3] , and digital signal processing [4] . Desirable features of finite-field arithmetic units include small critical path delay (CPD), high computation speed (throughput), short computation delay (latency), and small area-time (AT) complexity. Following Parhi [5] , we define CPD to be the total delay of the longest path between any two storage elements (or latches) and latency to be the number of clock cycles from the time at which the first bit of the input enters the architecture and the time at which the last bit of the corresponding output exits the architecture. Also, the AT complexity of an architecture is simply the product of the respective orders of the gate count and the number of cycles between successive outputs.
Inversion and division are the most complicated finite-field arithmetic operations and have received the most attention, leading to a variety of algorithms and architectures. We consider only those algorithms and architectures that can be used for any value of m and will not discuss the architectures using the recursive construction approach [15] , [16] , [17] that are suitable only for certain choices of m. Several of the general architectures have Oðm 2 Þ AT complexity and OðmÞ latency. These include all the previously proposed architectures that are based on the extended Euclidean algorithm (EEA) [6] , [7] , [8] , [9] , [10] , [11] , which have a CPD of four gates delay or more because of the complicated control signals required, the double-basis inversion architecture proposed by Hasan [19] 
