Absfmcf -The parallel division algorithms discussed in this paper may be classified among modified higher radix nonrestoring on-line division methods, where redundant representations are extensively utilized to speed up the operation. The network realizations of these algorithms are cellular, or even systolic with exclusively local control; they have size (area) and time both O ( n ) , where n is the length of the dividend representation. The same structures can also be used as a signed, digit-serial multiplier. When suitably equipped with some control and a few registers, the dividerhultiplier brings remarkable performance to large modular arithmetic, RSA cryptography, and gcd computations. They are also of interest to the design of floating-point units, and signal processing applications.
I. INTRODUCTION HE design of efficient division algorithms is a central
T problem in digital arithmetic. Indeed, division (the formal inverse of multiplication over the rationals) is substantially different from multiplication when we operate over the integers, and algorithmically at least as complex.
Several approaches to digital division have been proposed in the last four decades [l] , [2] . Division methods can be broadly subdivided into two classes: those that iteratively update the dividend, and those that compute the quotient through multiplication of the dividend by the reciprocal of the divisor. Only the latter have the capability of achieving high speed, at the expense of great circuit complexity [3] - [5] . In particular, to this class belongs the only known method [4] that achieves minimal time O(logn), where n is the operand length.
The approach presented in this paper belongs to the first class. The operation is very simple and only elementary techniques are involved, resulting in a small and practical design. It is well-known (see, e.g., [ l , p. 2351 ) that, for any radix, the most-significant quotient digit can be estimated within an additive error of 2 on the basis of two digits of the dividend and one of the divisor. To cope with this approximation we adopt, as in SRT division [6] , a redundant representation of the quo- tient, so that each quotient digit is estimated on the basis of short prefixes of the divisor and current dividend (partial remainder). Due to the redundant representation, an inaccurate estimate of a quotient digit gets corrected by the next digit, ultimately resulting in the (redundantly represented) correct quotient. The use of redundant representations not only permits the correct restoration of the quotient, but also allows constant parallel time computation of the dividend updates. As a result, the division process for n-digit operands can be carried out in time O(n) by a circuit whose size and area are Approaches inspired by analogous ideas have been proposed earlier by Trivedi and Ercegovac [8] and by Takagy, Yasuura, and Yajima [9] . Such approaches are called "on-line," because the result is generated digit-by-digit. The algorithm of [9] is for the base 2 and uses area O(n2) and time O(n) for n-digit operands. The algorithms of [8] are designed for arbitrary radix and are analyzed in terms of their on-line delay.
In this paper, we present on-line division algorithms of two alternative types: Euclidean division (i.e., standard) and division by smallest remainder, and introduce a novel analysis based on the invariance of the membership interval of the most-significant digit of the current dividend. The O( n)-time O( n)-size performance of our dividers makes them analogous to well-known cellular multipliers [lo] , [ 111. The analogy extends to the realizability as systolic networks [ 121-where each module is amenable to exclusively local control and data operation.
In addition, our schemes lend themselves to the following applications.
The divider can be operated in a rather straightforward manner, as an integer multiplier, where the digits of the (redundantly represented) product are serially generated starting from the most-significant end.
The divider/multiplier can be used to implement modular multiplication, which in turn can compute exponentials modulo a given integer. This leads to an implementation of high-security (2512 bits) RSA cryptography [13] more efficient than previously known ones.
The approach can be adapted to greatest common divisor (gcd) computation, in linear O(n) time and area. For this application, the use of division by smallest remainder appears to be crucial. Our direct implementation of Euclid's algorithm should be contrasted with the gcd algorithm of Brent and Kung [14] . The latter, which is also implemented in linear time and area, is based on the binary gcd and avoids division. (Indeed, ' Asymptotically analogous performance is exhibited by a recently proposed divider circuit [7], which is based on an entirely different technique, and generates the quotient from the least significant end.
also O(n).'
0018-9340/90/0500-0605$01.00 O 1990 IEEE the use of division in the inner loop was considered by Brent and Kung an unattractive feature of Euclid's algorithm.) This paper is organized as follows. In Section 11, we briefly review the positional irredundant and redundant number representations used in the algorithms. In Section 111, we present the two division algorithms. Section IV describes two typical circuit implementations, digit parallel and cellular. Finally, in Section V, we illustrate uses of the divider in the applications mentioned above.
NUMBER REPRESENTATIONS
Numbers are represented in base b = 2', 1 2 1 by a sequence nonl . . 'nk of digits, with the usual positional convention:
By placing various constraints on the range of values allowed for digits nj, j 2 1 , to the right of the fractional point, different number systems arise whose nomenclature is given below. No such constraints are placed on no, which in this paper appears as an integer represented in base 10.
Ordinary nonredundant base b unsigned-digit representation: The conversions between any two of the above five representations are either trivial or implementable with circuits analogous to an ordinary binary adder. These conversions, which are part of the computer arithmetic folklore, are summarized here for the reader's convenience.
Since a number in one-value-redundant form is also in two-value-redundant form, and the latter is also in one-bitredundant form, these conversions are trivial. The inverse conversions are based on expressing a digit n, as n j = s, + c, x b and on selecting (carry) c, as follows:
It is immediately verified that carries propagate at most one b) two-value redundant + one-value redundant digit position.
-1
Here again, carries propagate at most one digit position. Combining a) and b), we see that carries propagate at most two digit positions in the conversion from one-bit-redundant to one-value-redundant representation. Clearly, the above conversions can be implemented in constant parallel time.
Conversions to the unsigned-irredundant (standard) representation are accomplished by a subtraction, where the minuend corresponds to the bits with positive weights, and the subtrahend to those with negative weights. Slightly more subtle is the conversion from unsigned-irredundant form to the signedirredundant one. Denoting by a/-la/-2 . . . a0 the binary representation of a b-ary digit and by c and c' the carry-in and the carry-out of the generic digit position when performing the conversion, it is a simple exercise to show that
L L
The structural identity of this expression with the one encountered in binary addition (ar-1 and A:Eia,, respectively, corresponding to the "generate" and "propagate" parameters) shows that the conversion is accomplished by a modified adder. Thus, these conversions are executed either in O(n)
Finally, the conversion from unsigned-irredundant to any redundant form is trivially done by inserting a 0 bit (the sign bit) to the left of each b-ary digit. This is a minor, but interesting, variation of the one-valueredundant representation.
0 One-bit-redundant base b representation:
Each digit is encoded with I + 1 bits, the most significant one having negative sign. For example, n = 1000/44 can 111. DIVISION ALGORITHMS A . N -and Z-Divisions and a current dividend N(r), related to the initial dividend
We determine the next quotient digit of qt by dividing the first digit of the dividend Net) by that of the divisor, according to the Z-division criterion:
Let integers N , D, Q, and R , respectively, denote the dividend, divisor, quotient, and remainder, related by
A further condition uniquely determines Q and R. The algorithms to be presented in this note are based on the following criteria.
( 5 )
We then compute the partial remainder
all digits being processed independently without carry propagation
so as to obtain the current remainder in one parallel time unit. In order to obtain the updated dividend N('+') in the representation ( l), we propagate carries through one position to the left in constant time. To do so, define sum digits s and carry digits c such that
To distinguish between the two, we use the notation
for the (usual Euclidean) N-division, and 
B . Algorithm for Z-Division
(9) The operands are suitably normalized, by shifting the frac-
(10) tional point, and we adopt the following representations:
Dividend N = no . . . n , is one-bit redundant:
in constant time.
(l), (2) , (3), and (7), we obtain
Invariance of the Representations of Operands: From
The membership intervals of no and do will be specified later [see (13) and (14)l. Below they are shown as numbers in base 10.
and In the description of the iterative algorithm, we use superscripts to denote the index t of the division step. Specifically, At the completion of the tth division step, we have computed a partial quotient for b 4; base 2 has we initialize N' O) = N , and assume throughout that do > 0.
Relations (12) and (13) Remark I : Relation (14) shows that for b 2 4, the membership interval of n:+" for j 2 1 is contained in [ -b, b -11, as specified by (1). This suggests that the membership intervals of either d , or qt may be enlarged without significantly affecting any other detail of the algorithm. Indeed, it is easy to verify that we may adopt for d , or qt any interval Z such that [-4, 41 
Example of Z-Division: An example of 2-division is given in Fig. 1 The resulting quotient Q = 27 is computed, in base 4, by the algorithm in the form Q = 211.
C . Algorithm for N-Division
The algorithm described in this section is based on different selections of the intervals for the parameters n:), d j , and qt , and of the division criterion. In the notation of the preceding section, we have
(20)
The quotient digit qt is computed by N-division as follows:
As a consequence, the sign of rt") = nt' -qt x do is identical to that of n z ) , and Irt") 1 5 do -1 .
We now distinguish three cases, depending upon the sign (>O, =o, <O) of qr:
[ 4 t l I n this case, we first compute and obtain the terms cy+') and sy") according to the Ndivision:
quo b , sy") = r:f+') mod b.
From (19), ( 1 8), and (20) we immediately have
and from (22):
From these, we obtain which shows that (19) is an invariant of the method.
Moreover, we repeat below the expression for n;+":
Bounding n ; ' " from above, we obtain
bounding it from below, we have n;
Iqr]OJIn this case, symmetric to the previous one, we compute the terms cy+') and s:' ", j > 0, according to the rule
Arguing as before, we find that
and n;") E [ -bdo + 1 , b2 -11.
14, = OlIn this case, we let n y + l ) = ,(t) nb"" = bnb" + nr') for j 2 1 , and
We note first that invariant (19) trivially holds. Moreover, qt = 0 implies that
Combining with the above definition of nb'"), we readily ob-
Merging the analyses of the three cases yields 
Remark 4:
The N-division appears to require that the divisor be represented in nonredundant form. This should be contrasted with Remark 1 , and will prove to be a handicap in the applications.
Example of N-Division: An example of N-division is given in Fig. 2 The resulting quotient Q = 27 is computed, in base 4, by the algorithm in the form Q = 13T.
IV . IMPLEMENTATION
In this section, we consider some network implementations, and limit our considerations to the algorithm for Z-division (Section 111-B). The approach is readily extensible to the Ndivision (Section 111-C).
For convenience, we repeat below the operations to be executed, for t = 0, 1 , 2, . . .. We also recall the representation adopted for the operand digits.
The dividend is one-bit redundant: j 2 1, n, E [-b, b - 
11.
The divisor is signed-digit irredundant:
The quotient is one-value redundant:
A . Initialization of the Operands
If we assume that the operands are stored in memory in conventional binary form, i .e., unsigned-digit irredundant form; it is necessary to perform a preliminary conversion of the dividend and of the divisor. Specifically, the dividend is trivially converted into one-bit redundant form, and the divisor into one-value redundant form by a modified binary addition (refer to Section 11). For both dividend and divisor, the initialization can be simply merged with the normalization phase, all in linear serial time. An alternative, based on Remark 1, is to simply convert d to a redundant representation, in constant time.
B. Network Operations
The above operations (25)-(31) immediately identify two types of modules; the front-end module ( j = 0), which computes the next quotient digit, and the standard multiplier modules ( j > 1).
I ) Broadcast Implementation:
The most straightforward implementation of the algorithm is one where qr is broadcast by the front-end to all the standard modules, which all execute (28) simultaneously. It has the advantage of simplicity, but it falls short of balancing the workload between front-end and standard modules. Indeed, operations must be performed in the order
(25), (26); (271, (28); ( W , (30); (31)
where ";" and '0'' respectively separate parallel and sequential steps. Since no has 3 and do 2 digits, the most time consuming operations are (25) and (27), both to be performed sequentially by the front-end. Furthermore, the front-end must be able to broadcast qt , which physically requires time asymptotically proportional to length of the global communication link, hence to the divisor length d [17] . This realization is thus not modular electrically; in view of its conceptual simplicity, it is however likely to be implemented in practice for moderate values of n (e.g., for n = 64).
2) Systolic Implementation: An alternative implementation has a systolic nature. In this solution, the quotient digits Fig. 3. Physical interconnection of the standard module chain. migrate from the front-end along the chain of neighborly connected standard modules. The connection is modular and the control exclusively local.
Standard Modules: We begin by considering the standard modules. Using (28), (29), and (30), we first rewrite (31) as
This expression reveals that, in order to compute n:"', the correct values of nyll and nyi2 must be available, i.e., the digit q l -l of the quotient must have proceeded beyond the position of weight b-JP2. This means that qt-l must be at least three cells ahead of q t . This spacing of the quotient digits is due to the carry save feature, independently of the base.
For simplicity, terms n;!' As indicated above, quotient digits are present in a module only one time unit out of three; this time unit is said to be active for that module. Thus, during its active time unit for qt , module M; performs the following action:
(32) (33)
To prove correctness, we assume that qr is available at module Mk, k > 2 at time 3t +k + 1, for all t > 0. Next, we assume inductively that the pair ( s f ' , c f ' ) is available at time 3p + k , for all 0 I p < t . Then, by (32) and ( 3 3 ) , sy' is available at time 3t + j and at time 3t + j + 1. This and the assured availability of qt imply that, one time unit later, i.e., at time 3t + j + 2 = 3(t + 1 ) + j -1, the values of c:.'?;) and sy?:' can be computed, thereby extending the inductive hypothesis.
It follows that each standard module contains four registers, respectively, for digits c, s, q, and d. A fragment of the standard modules chain is shown in Fig. 3 . The following is a timing diagram of the operations of modules M j , M j + l , and Mj+2, where we display values only at the times of their updates in the module registers:
Qt
Front end Modules: We now consider the structure of the front-end section of the network. Referring to (27), we note that c;+l) is available two time units after the generation of qj . In order to feed the standard modules at the stipulated rate, qj+l must be produced three time units after q j . Therefore, we only have one time unit to carry out the computations
Considering the large size of the operands ntil), r;+l), and do, this objective can only be achieved by adequately lengthening the duration of the time unit, thereby resulting in a global slowdown.
To circumvent this difficulty, we modify the computation embodied by (25)-(27) in such a way that the update of nt' and the calculation of q j can be conveniently overlapped, while preserving the essential features of the technique. Specifically, let q j now denote the values of the quotient digits obtained by the modified techni ue. Rather than performing the calculation of s(ll+l) and cy" as specified by (28)-(31), we directly add the sum c(ll'b + slf) + cr), available at time 3t + 2, to the (shifted) current remainder r t ) .
The resulting sum has a discrepancy -dlqj from the correct value; however, since q j is available only at time 3t + 3, it is necessary to wait for the next digit cycle to add the appropriate correction, which must then be multiplied by b. Thus, we have the modified algorithm The method is correct if n t t l ) satisfies ( 1 3 , or, equivalently, if The quotient Q = 27 is computed by the algorithm in the form 2 I i.
V . APPLICATIONS
The division techniques of Section 111 systolic Z-division can be adapted to several significant arithmetic computations. We shall consider integer multiplication, modular integer multiplication and exponentiation, and greatest-common-divisor (gcd) calculation.
A . Integer Multiplication
respectively, represented in base b by Let integers A and B be the multiplier and the multiplicand, A a n -l a n -2 . .
.ao and B b n -l b n -2 . .
.bo.
We want to compute the product As in Section 111, we assume that both operands A and B are represented either in the signed-digit-nonredundant form, or in the one-value-redundant form. We initially set np' = 0 for j 2 -2 , and dj = bn-j for 1 5 j 5 n , with d-2 = d-1 = Subsequently, we carry out 2n + 2 steps of the Z-division algorithm, replacing the determination of the quotient digits do =O. 
4t = {
Relations (7) and (8) are extended to j 2 -1 , while (10) is extended to j 2 -2. Moreover, digit n-3 is updated as follows:
we prove by induction on t that ny) E [ -3/4b -1 , 3/4b] for a l l t > O a n d j > -1 :
The base of induction is trivially established for t = 0. An analogous, more specialized, argument can be developed for the base b = 2. This shows that carries never propagate beyond the position of index -2. Since the tth step of the computation yields the partial product 3/4b] for j > 1 , while r!'") = nj"') E [-3/4b -1 , 3/4b] p(t) = bp(t-1) +an-t x B for t > 0, with PcO) = 0, clearly n!f3' is the tth digit of the product for 0 < t 5 2n, counting from the most significant end. The product is therefore available in one-bit-redundant form, starting from the most significant digit.
This multiplier has both area and time O(n), as for the well-known Muller-Atrubin multiplier [ 101, [ 1 13.
B . Modular Integer Multiplication and Exponentiation
the modular product Let A , B, and M be three integers. We want to compute
Initially, the operands are represented with n digits in base-b signed-irredundant form. We assume, without loss of generality, that M 2 (b/2)bn-'. Operands A and B are used as multiplier and multiplicand, respectively. Integers B and M, the multiplicand and the modulus, must be normalized so that the most significant digit of M is sufficiently large, while the corresponding digit of B is zero. Specifically, we have with rno > 0. The choice of the interval membership of ma, to be made shortly, defines a unique integer U > 0 such that b" gives the relative shift of M versus B. Correspondingly, A is also multiplied by b" , i.e. U zero digits are appended to the right of its representation. In conclusion, we are effectively carrying out the calculation Pb" = (Ab") x B(modMb").
The computation is performed in n + U iterations, designated by index t 2 1 , and using variables P('), U('), ut), and q f as follows:
We now establish the correctness of the method. For b 2 4, q t -l = ug-')-mo;
(40)
algorithm, with each step performed by Z-division. Let
with Pc0) = 0. We thus alternate between one multiplication (38) and one division [(40), (41)] step. However, the left-shift step is performed only once, after each division step. Obviously, the divisor register is now replaced by two registers, one for the multiplicand, the other for the modulus. Total area and time remain O( n).
To ensure correctness, the interval for mo must be chosen so that u t ) E [ -y m o , 2 -11 .
Let p t ) denote the coefficient of bo in the representation of P('). Using the notations of Section 111-B, operations (40) and (41) b2 + 6 b + 5 2 4) We perform one step of carry propagation; this can be achieved by computing one division step with partial quotient q = 0, but, without shifting the operand. As a result of (8), R is now in two-value-redundant form: 
If E is represented in binary as E = e n -l e n -2 . . .eo, with e,-l = 1 , the computation of C is carried out by the following routine: 1) R :=A;
2 ) for j = n -1 downto 1 do 3) begin R : = R x R m o d M ;
4)
i f ej-1 = 1 thenR := R x A modM;
end.
Since both Steps 3 and 4 take time O(n'), where n' is the number of digits of A and M , the RSA-residue C is computed in time O(nn').
C . Computation of the Greatest Common Divisor
Our last application will be to compute the greatest com- and proceed to the next iteration.
Each shift in Step 1, and each division in Step 2 reduces the total operand length n + d by one unit.
Step 3 is performed only once per stage in Euclid's algorithms, for a total which is well known [ l ] to be of at most logg D < 2d. It follows that the above algorithm for computing gcd is both time and space linear in the length of the operands.
