Abstract
Introduction
Most VLSI implementations of the division operation are based on digit recurrence; such class of algorithm, also known as SRT, guarantees a good tradeoff between implementation cost and performance. It is an iterative algorithm with linear convergence, i.e. a digit q j of the result is calculated at each iteration. The quotient after k iterations is Q k = P k j=1 q j r ,j , r being the radix. Several techniques proposed to improve the execution time of the algorithm, including, prediction, operands prescaling and overlapping of several selection stages, are reported in the book of Ercegovac and Lang [4] . Speculation of quotient digits, as a technique to speedup the division algorithm, was first reported in [2] and then extended to high-radix division and square root in [3] . The main differences between the method proposed in [2, 3] and the one we have developed are the following:
1. The number of bits for the truncation of the divisor and partial remainder to calculate the quotient digit is
This work has been funded by the Ministry of Education of Spain under CICYT, TIC 98-0410.
analytically derived;
2. The estimations of w and d to be used for speculation, error detection and correction are computed as a function of the desired speculation error;
3. The correction is always carried out in one extra cycle whichever be the error.
The rest of the paper is structured as follows: in section 2 we review the SRT algorithm and introduce the notation that we will use in this paper. Section 3 describes the proposed algorithm. Section 4 applies the theory developed in the previous section to the design of a radix-16 divider and compares the performances with previous implementations. Finally, in section 5, we draw up some conclusions.
SRT Division and Notation
In an SRT division scheme, the following recurrence is applied repeatedly:
where w j is the residual after the j-th iteration, q j+1 the new quotient digit, d is the divisor and x the dividend. We also assume that x and d are positive normalized fractions.
The quotient digit is a signed digit jq j+1 j a with redundancy factor = a r,1 , in particular we will consider digit sets with 1. Furthermore, in order to guarantee the convergence of the algorithm, the residual must be bounded, that is:
The quotient digit q j+1 is determined by a digit selection function F whose arguments are an estimationŵ j andd of both residual and divisor; that is:
As described in [4] , one of the methods for quotient digit selection is essentially a bound checking operation, i.e. a comparison ofŵ j with a set of selection constants m k that may depend on d and such that L k m k U k,1 , where
are respectively the lower and the upper bound of the selection interval for digit q j+1 = k. A selection function must satisfy two fundamental conditions [4] : containment and continuity, that determine the range of the selection intervals. However, a speculative selection function must only satisfy the latter, since a correction is performed every time the residual exceeds the correct bounds.
Quotient Digit Speculation
In case of speculation of the quotient digit the recurrence for division becomes [ 
where q c j+1 2 f , ; : : : ; ,1; 0; +1; : : : ; g is the correction digit. The wrong speculated remainder may be corrected using the following relation:
The selection bounds for digit q s j+1 = k are:
where " = jw the maximum speculation error for d = 1 (namely " = ). Digit q s j+1 is determined by a function F s that evaluates an estimation of both the full precision residual and divisor, that is:
If t is the number of bits of the fractionary part ofŵ j , then the estimation w 0 j of the residual at the j-th iteration may be obtained fromŵ j by discarding least significant bits of its fractionary part. To determine we impose that the overlap (for the worst case d = 1 2 ) between two adjacent selection intervals must be greater or equal the truncation error of the carry-sum redundant representation of w 0 j , namely:
A carry-save form residual is represented by two bit vectors; the former stores the sum bits whereas the latter the carry bit of an addition. Imposing the continuity condition between two consecutive selection intervals (L s 
Error Detection and Correction
A correct speculation is performed if and only if, at each step j of the recurrence, the speculated partial remainder w s j results to be bounded; that is:
Thus, in order to verify the correctness of a prediction we may check if the partial remainder generated by a speculation falls within the correct bounds expressed by (11). Furthermore, the correction function has to guarantee that the sufficient condition for convergence (2) be always respected. Since a wrongly speculated partial remainder exceeds the correct bounds expressed by (11) it is surely greater than unity, as a consequence integer bits of the remainder have to be evaluated in order to detect a speculation error. As Which also takes into account of the sign bit.
The speed of the comparison may be augmented by considering the estimations w 00 j and d 00 of both w s j and d.
Namely w 00 j = w s j ,2 ,e+1 and d 00 = d,2 ,f , which is equivalent to truncate the carry-sum representation of w 00 j to the e-th fractionary bit and d to the f-th fractionary bit;
thus we obtain:
, d 00 w 00 j + d 00 , 2 ,e+1
In case w 00 j is out of bounds and close to one of the bounds of (14) 
Performance Evaluation
Performance evaluation of a design is based on the calculation of the average number of cycles per quotient digit 
If D is the cycle delay of the design, the delay per bit may be defined as follows:
Implementation and Timing
We have performed the design of the radix-16 unit implementing our algorithm using a standard-cell library (ES2-ECPD10) and SIS for logic minimization. The characteristics of both our design and those presented in [3] are reported in Table 1 . Delays and area are expressed as a multiple of the delay and area of a two-input NAND gate with a fanout of three NAND gates. Figure 1 shows of the speculative radix-16 divider with = 12 15 . The use of a redundant digit set permits to perform carry-free additions using carry-save adders (CSA). Quotient conversion from redundant to non-redundant form may be performed on the fly, i.e. in parallel with the division operation, as also described in [4] . The dashed line indicates the critical path. In order to reduce the number of CSAs necessary to implement all the multiples of d, digits q s j+1 2 f , 11; +11g
are not speculated since they would need three CSAs, consequently they are obtained exploiting the correction function. Figure 1 shows the cycle time of the divider. In order to speedup the circuit, speculation at step j is overlapped with the error detection at step j , 1. Each digit q s j+1 results to be the sum of two contributions computed by two different combinational blocks; namely:
Where q h 2 f 8; 4; 0g and q l 2 f 4; 2; 1; 0g. can be performed faster, due to the reduced inputs, in order to shorten the critical path. On the other hand, computing q l is a slower operation and is overlapped to q h computation and to part of the delay introduced by the first CSA in the chain (see Figure 2) . What leads to improved performance with respect to the implementations without partial advance described in [3] is the different arrangement of the divisor multiples along the adder chain. The choice of the multiples is determined according to the frequency with which a digit is selected. Due to this consideration, the choice of speculating digits q s j+1 2 f , 12; +12g, unlike [3] where such digits are generated exploiting the correction logic, leads to hit-ratios very close to 90% while the architecture proposed in [3] has a hit-ratio of 70%. In addition, although using one more bit to speculate, we achieve cycles delays comparable to those of [3] designing the selection logic as reported in Figure 2 (since the table that computes q h requires a number of bits smaller than the assimilation) and using boolean relations [5] to synthesize the tables. Also the correction function contributes to speedup the execution time of the architecture since, unlike [3] , it always needs only a single cycle to correct an error.
Conclusions
Quotient digit speculation is an alternative to conventional division techniques. Assimilating a reduced number of bits of both residual and divisor may lead to a fast selection function and hence to a speedup of the execution time. Nevertheless the speculated digit may be incorrect. The correctness of the selection is checked by an error detection function that verifies whether the speculated residual falls within the correct bounds or not. In case of incorrect speculation the algorithm rolls back and the digit is corrected. This operation is performed by an error correction function. Because of the possible rollbacks, the execution time is variable, i.e. we can have different cycle times for each iteration. As a consequence we have transformed a fixed-latency unit into a variable latency one running with a faster clock cycle and with a functioning that reminds that of the telescopic units introduced in [1] . Moreover, due to its variable latency, our algorithm results particularly appealing for asynchronous implementations too. 
