Abstract: The design of a fast divider is an important issue in high-speed computing. The paper presents a fast radix-4 SRT division architecture. Instead of ®nding the correct quotient digit, an estimated quotient digit is ®rst speculated. The speculated quotient digit is used to simultaneously compute the two possible partial remainders for the next step while the quotient digit is being corrected. Thus, this two-step process does not in¯uence the overall speed. Since the decisionmaking circuits can be implemented with simple gate structures, the proposed divider offers fast speed operation. Based on the physical layout, the circuit takes 247ns for a double precision division (56 bits for fraction part), where the 2 mm CMOS technology in MAGIC is employed and simulated.
Introduction
The design of fast dividers is an important issue in highspeed computing because division accounts for a signi®-cant fraction of the total arithmetic operation [1] . Most implementations for the division are based on the SRT algorithm that uses a recurrence producing one quotient digit for each step [2±9] . The speed of such SRT-based dividers is mainly determined by the complexity of the quotient-digit selection. Fig. 1 illustrates an architecture of a radix-4 SRT divider which employs a quotient-digit selection table (QST). The use of QST signi®cantly reduces the complexity of quotient-digit selection. However, the table size increases drastically with high radices [2, 9] . The table size can be reduced signi®cantly by estimating the quotient digit instead of ®nding the exact one [9] . The estimated quotient digit is calibrated in parallel with updating the new partial remainder. Since the two-step process does not affect the division speed, the approach has fast speed performance due to the signi®cant reduction in table size. This paper presents the detailed design of a fast radix-4 SRT division and its VLSI implementation. Results show that, based on the physical layout, the circuit takes 291ns to compute a double precision division (56 bits in the fraction part), where the CMOS technology in MAGIC is employed for simulation.
Radix-4 division
Consider the following recursive equation for the partial remainder in a high-radix SRT division [2] , r i b iÀ1 7 q i D, where b 2 m is the radix, D is the divisor, r iÀ1 is the partial remainder at the (i 7 1)th step, and q i is the ith quotient digit. The high-radix SRT division can be implemented in such a way that its quotient digit q i is selected from a digit set {Àa, F F F , À1, 0, 1, F F F , a}, where d(b 7 1)/ 2e a (b 7 1). The ratio k a/(b 7 1) is a measure of the redundancy in the representation of the quotient digits. The smaller the value of k, the smaller is the redundancy in the number system for the quotient. [0.5, 1.0). The quotient digit q P {À2, À1, 0, 1, 2}. Let P 4r iÀ1 denote the previous partial remainder, where r iÀ1 is the partial remainder at the (i 7 1)th step and |r iÀ1 | (2/3)D. Thus, the upper limits and lower limits for P are À2a3 qD P 2a3 qD 1 
Partial remainder
and the regions for all qs are listed in Fig. 1c , where 0.5 D`1. There exists an overlapping region between two adjacent regions corresponding to two consecutive values of a quotient digit, as the shaded areas show in Fig. 1b . Let P jj + 1 denote the overlapping region for q j and q j 1. The overlapping region P jj + 1 is expressed as P jY j1 fPY DjÀ2a3 j 1D P 2a3 jD and 0X5 D`1g 2 Therefore, the overlapping regions in Fig. 1b are listed in Fig. 1d . The quotient digit in the overlapping region P j, j + 1 can be either q j or q j 1. The implication of having an overlap region is that we have a choice of values, of both the partial remainder and the divisor, that will eventually separate these two adjacent regions. The selected value of the partial remainder and divisor separating the adjacent regions of q will serve as comparison constants during the execution of the divide operation. If there is a value c satisfying j 1a3D max c j 2a3D min 3 then the selection of q will be independent of D and will depend only on P. The number of bits required to represent this constant determines the necessary precision when examining the partial remainder to select the quotient digit q. However, if this inequality is not satis®ed, the interval [D min , D max ) is partitioned into several smaller intervals. The stepping points determine the precision (i.e. the number of bits) at which we examined D, while the height of the steps determines the precision at which the partial remainder has to be examined. Therefore, the quotient-digit selection table stores the quotient assignment in the P 7 D plot and the table can be implemented by either a 1.5Kbit ROM or (n i ,n o ,n p ) (9,3,35) 7 PLA [9] , where n i , n o , n p are the number of inputs, outputs and product terms, respectively.
Proposed radix-4 division
Instead of selecting the correct quotient digit q, 7 a q a, the proposed approach ®rst estimates a quotient digit q # , 7 a q # a 7 1, such that q P {q # , q # 1}, i.e. the actual quotient digit is either q # or q # 1, and the possible partial remainder is either P 0 4r iÀ1 7 q # D or P 1 4r iÀ1 7 (q # 1)D. Thus, the actual quotient digit# q*, where q*, referred to as a correction value, is either a 0 or a 1. This division process includes two steps: (i) quotient digit estimation, and (ii) possible remainder updating and quotient digit estimation. This Section ®rst describes the proposed division algorithm, hardware implementation and error analysis. In addition, the speed performance of the proposed architectures is estimated. To actually estimate the performance, the physical layout has been simulated.
Algorithm development
The estimated quotient digit is selected as follows. For an estimated quotient digit q # , the upper and lower limits for the corresponding partial remainder P are
Thus, the regions for all q # s are listed in Fig. 2a , and the overlapping regions are shown in Fig. 2b . Fig. 2a shows that the new overlapping regions are much wider and¯atter than those in Fig. 1b , where P is represented in a two'scomplement form. Three comparison constants can be generated from the overlapping regions. They are c 1 (000.1), c 0 0 and c À1 (111.1). Thus, the quotient digit is estimated by
Once the estimated quotient digit q # is determined, the range of the partial remainder is
and the correction value q* is determined as follows:
This shows that P* is equivalent to P in eqn. 1 with q 0 and q 1. 
Once q* is determined, the actual quotient digit# q* and the partial remainder is P 0 if q* 0, or P 1 if q* 1. The detailed division process is summarised in Algorithm I, and Fig. 3 illustrates the stepwise procedure with an example, where the dividend X (0.01111111) 2 , the divisor D (0.1001) 2 and (b,a) (4,2). The division process starts with comparing X to the comparison constants in Fig.  2a to generate q # 1 so that r 0 X 7 D. Since X`D, this results in q* 0. Thus, q 0 q # q* 1. It is followed by the generation of q 1 , where a speculated quotient digit q # À1 is ®rst estimated. Together with q* 1, we obtain the actual quotient q 1 À0, which is the same as q 1 0. 
For q 2 , a speculated digit q # À2 is estimated. With q* 0, we obtain q 2 À2. Therefore, the ®nal quotient Q (0.1110) 2 and the remainder R (0.0001) Â 2 74 . It can be easily veri®ed that X Q*D R.
Hardware implementation
Based on Algorithm I, Fig. 4 illustrates the proposed architecture. It is assumed that the divisor D is an n-bit positive normalised binary number, where
, the dividend X is a k-bit binary number ranged between 7 (8/3)D and (8/3)D, where X is represented as (SP 1 P 0 P À1 P À2 F F F P Àk ) in two's complement form and S is the sign bit, k 2n, and the quotient Q (q 0 .q 1 q 2 F F F q n ), where q i P {À2, À1, 0, 1, 2} is a binary redundant digit, i 0, 1, 2, F F F n. At the ®rst cycle, the ®rst n 3 bits of X, i.e. (SP 1 P 0 P À1 P À2 F F F P Àn ) are processed, and in the following cycles, two new bits of X are shifted into the updated remainder.
According to Algorithm I, we ®rst estimate the quotient digit q # from Block QH. The multiplexer circuit takes the quotient digit q # and generates Àq # D and À(q # 1)D. Two adders are used to compute
Step 2.1 and the results are compared with the constants shown in Fig. 2 to generate q* and q # , where only the seven most signi®cant bits of P 0 and P 1 (to be explained shortly) are required. Two parallel addition schemes without carry-propagation delay may be considered: signeddigit adder [2] and carry-save adder (CSA). In practice, CSA is preferred when the adder is used to perform both addition and subtraction. Therefore, this implementation uses two (n 3)-bit CSAs and two 7-bit CLAs (carrylookahead adders) to avoid long carry propagation. After generating the correction value q* in Step 2.2, two multiplexers are used to respectively select the CSA and CLA outputs for P 0 or P
1
. The process will be repeated for n times. In this implementation a signal start is used to indicate the beginning of the division process. More speci®cally, when the division begins, i.e. start 0, Block QH selects the estimated quotient digit q # for the dividend X, and the signal start is set to a 1, and then the CSAs compute P 0 X 7 q # D and P 1 X 7 (q # 1)D. The remaining procedure is the same as that described above. Therefore, for an n-bit division, the process needs n 1 cycles.
Block QS:
Block QS generates the correction value q* from the computed remainder P 0 . Given an updated partial remainder 4r iÀ1 and the estimated quotient digit q # , by eqn. 7, q* is determined by the comparison constants in Fig. 2d . Let P 0 (SP 1 P 0 P À1 F F F P Àn ). Since the maximum positive value of P 0 is 5/3 and the bit P 1 0 is true for all positive P 0 , the bit P 1 is ignored here. The comparison constants shown in Fig. 2d can be tabulated as shown in Table 1 . For a P 0 , c 3 (000.10) P 0 , it can be represented by (SP 1 P 0 .P À1 F F F P Àn ) (001.xx F F F x) or (000.1 x F F F x), and will be truncated as (SP 0 P À1 .P À2 ) (001.x) or (000.1). Note that the bit P 1 is always 0 for a positive P 0 . Thus, the statement`c 3 P 0 ' in eqn. 7 is equivalent to`S 0&{P 0 1 or P À1 1}'. Similarly, the statement `c 2 P' is equivalent tò S 0&{P 0 1 or F F F x) . Thus,`D`(0.11)' is equivalent to`d À2 0'. Thus eqn. 7 can be rewritten as
and the logic function is
Fig . 5a shows the logic implementation of eqn. 8.
Block QH:
The correction value q* selects the updated remainder r i from either P 0 or P
1
, and the ®rst seven most signi®cant bits of the updated remainder. As shown in Fig. 4 , the updated remainder is shifted left by 2 bits to block QH for generating q # . Thus, after shifting 2 bits, we obtain (SP À1 P À2 P À3 ) which is used to compare the constants given in Table 2 . According to the encoding scheme in Table 1 , eqn. 5 implies that q s S and q a 0 if (P À1 P À2 P À3 ) (00.0) or (11.1), where q a is expressed by the following logic function:
Fig . 5b shows the logic implementation of eqn. 9.
Block QC:
Block QC sums up the estimated quotient digit q # and the correction value q* to produce the actual ith quotient digit q i , i.e. q i q # q*, where q i P {À2, À1, 0, 1, 2}. Let q (S1 q 0 ) be in signmagnitude form, where S q is the sign bit and (q 1 q 0 ) represents the binary representation of the absolute value of q. Fig . 5c shows the logic implementation of eqn. 10.
Adders±CSA and CLA:
Two CSAs (carry-slave adders) are used to compute the remainders P 0 and P
1
. The CSA with shifting is realised by FAs and latches, as shown in Fig. 5d . The sum and carry outputs of the ®rst seven FAs are fed to a 7-bit CLA.
Multiplexer circuitry:
The multiplexer circuitry (MUXC), as shown in Fig. 6 generates Àq ((Sd)'e 0 .e À1 e À2 F F F e Àn ) and 7 2D ((S d )'e À1 e À2 F F F e Àn 0), where e i d i ' and e Àn d Àn ' 1. Note that the ( 7 n)th bit of ( 7 2D), or, namely, e À(n + 1) , is a 0. The two's complementation is realised by using a simple one's complementation and assigning a 1 as the initial carry of the CSA.
The estimated quotient digit, q # P {1, 0 , À1, À2}, is represented by two bits, i.e. q # (q s , q a ), and its values, 1, 0, À1 and À2, are encoded as (0, 1), (0, 0), (1, 0) and (1, 1), respectively, as shown in Table 4 . Fig. 6a illustrates the block diagram that implements the MUXC. Each bit slice of the MUXC is realised by six 2-to-1 multiplexers (MUXs), as shown in Fig. 6b , where the ®rst level of MUXs is selected by the signal q a , while the second level is determined by q s . The output of the ith bit slice in MUXC is fed to the CSA, as shown in Fig. 7a . To generate the initial carry of CSA, the inputs to the Ànth bit of Table 4 : Function MUXC and bit assignment for (-n) bit of CSA
MUXC are modi®ed by assigning a 0 to d i + 1 d À(n + 1) and a 1 to e i + 1 e À(n + 1) . The inputs to the last bit of CSA are modi®ed as follows: the initial carry of CSA is fed to y Àn of the 7 nth FA, where y Àn 0 (1) if the CSA is used as an adder (subtracter). Therefore, for 7 q # D, y Àn 0 for 0, D and 2D, and y Àn 1 for 7 D, as shown in Table 4 . On the other hand, for À(q 3.2.6 VLSI implementation±speed and area: Fig.  8 shows the physical layout of an appropriate 56-bit oating-point divider (fraction part). The layout is generated by the MAGIC layout editor, where 2 mm SCMOS technology is employed. The layout includes a 57-bit CSA, two 7-bit CLAs with 1-level, a 57-bit MUXC with 4-to-1 MUXs, and BLOCK QS, QC and QH. There, different types of MUX circuits are used. Type-1, MUX 1 , includes the one receiving its inputs from CLAs and the one sending its outputs to QH; type-2, MUX 2 , includes the one receiving its inputs from CSAs and the one sending the outputs to CSAs; and type-3, MUX 3 , is the MUXC. The propagation delay time of each unit has been simulated by Pspice, where the circuit parameters are extracted from the layout. Table 5 lists the size and propagation delay of each unit. According to the layout in Fig. 8 with the routing areas, for 56-bit radix-4 SRT division, the total area is approximately 3.7 Â 5.3 mm 2 , and its delay time is 13.9 ns per cycle. The division time can be improved by the alternative architecture shown in Fig. 9 where an extra MUXC is used. As indicated by the bold lines, Block QS is no longer in the critical path. Thus, the critical path includes the MUX 3 , MUX 2 , CSA, CLA, MUX 1 and QH. Simulation results show that the circuit has a delay of 10.4 ns per cycle, or 291.2 ns for the 56-bit division. However, the speed improvement is achieved at the increased cost of area which is nearly 4 Â 5.3 mm 2 , as shown in Fig. 10 . Since block QH is on the critical path, improving its speed performance will also make the division process faster. Eqn. 9 shows that the output q a is a function of P À1 , P À2 and P À3 . Therefore, the output q a , as a function of three variables, can be realised by a 4-to-1 MUX which takes only 0.43 ns. As a result, the total delay in the critical path is 8.83 ns. Therefore, the proposed SRT algorithm can be achieved with a delay of 247.24 ns for 56-bit division, where 2 mm CMOS technology is employed. Fig. 10 Physical layout of improved version Table 6 shows the performance comparison given in [10] , where 1.2 mm CMOS technology is employed for all cases. Comparing the proposed radix-4 division, 247.24 ns with 2 mm CMOS technology, to the performance of various designs shown above, the proposed circuit performance is promising. The proposed circuit can be further improved by using 1.2 mm or better technology.
Conclusion
This paper presents a simple yet fast radix-4 divider design and its VLSI implementation. The proposed division method ®rst speculates a quotient digit. The speculated digit is used to compute the two possible partial remainders, for the next step, in parallel with the quotient-digit correction process. The algorithm can be implemented with only a delay of 247.24 ns for double precision division. Although this paper is presented only for radix-4 division, the same design concept is readily extended for high radices to reduce the quotient-digit selection table size and make the use of high-radices to become possible and practical [9] . Also, similarly to the discussion in [10] , the proposed architecture can also be implemented for squareroot calculation. It should be mentioned that the primary focus of the developed algorithm and hardware implementation was placed on optimising execution. It is worthwhile to further investigate and develop an alternative design and implementation which is optimised in terms of execution time, chip area and/or power dissipation.
References 6 Appendix: Error analysis
This Appendix is to justify that a 7-bit CLA is suf®cient for this implementation. Let C j and S j , j s, 1, 0, À1, F F F , Àn, denote the jth carry and sum bits of the CSA for the remainder P 0 at the ith cycle, respectively. Generating the complete (n 3) bit binary value for P 0 (SP 1 P 0Á P À1 P À2 F F F P Àn ) requires an (n 3)-bit CLA to sum the carry and sum bits as follows:
To avoid long carry propagation, this implementation rounds the value of P 0 to the bit P Àt , i.e
where C À(t + 1) * C À(t + 1) C in . The rounding scheme de®nes the value of C in , where C in 0 if C À(t + 2) S À(t + 1) 0, and C in 1, otherwise. The signal C in is ORing both S À(t + 1) and C À(t + 2) . Note that P 1 is not connected to block QH nor block QS.
Let P If the real remainder lies in the overlapping regions P 0,1 , P À1,0 or P À2,À1 , then we should be able to generate q* 1 and 0 for the truncated and rounded remainder, respectively. This results in both having the same quotient digit, i.e. q w 1. Let e max be the maximum error which causes both truncated and rounded values to be located in the different q # -regions, but they are still in the same overlapping region. For the overlapping region P 0,1 , we have e max P 0 ro 7 P 2 . This con®rms that the ®rst seven signi®cant digits, (SP 1 P 0Á P À1 P À2 P À3 P À4 ), are suf®cient in this implementation. 
