Abstract-In new generations of microprocessors, the superscalar architecture is widely adopted to increase the number of instructions executed in one cycle. The division instruction among all of the instructions needs more cycles than the rest, e.g., addition and multiplication. It then makes division instruction an important cycles-per-instruction figure for modern microprocessors. In this paper, a radix-16/8/4/2 divisor is proposed, which uses a variety of techniques, including operand scaling, table partitioning, and, particularly, table sharing, to increase performance without the cost of increasing complexity. A physical chip using the proposed method is implemented by 0.35-m single poly four metal (1P4M) CMOS technology. The testing measurement shows that the chip can execute signed 64-b/32-b integer division between 3-13 cycles with a 80-MHz operating clock.
I. INTRODUCTION
Integer division is a critical operation in CPU design since the number of clock cycles to complete an integer division is probably very long and unpredictable. The role of division is becoming more and more critical owing to the requirement of signed computer arithmetics, the modulus computation, the calculation of encryption keys, and so on. Division algorithms can be roughly classified into two categories, namely, digit-recurrence methods [1] and functional iteration techniques [1] , while the former is commonly used. Regarding the digit-recurrence method, traditionally there are two types of division schemes, i.e., restoring and nonrestoring schemes. However, they both require multiple operation steps to derive a quotient bit. Not only is the efficiency drastically poor, but also a long adder/subtracter is needed to execute the remainder bit adjustment. In this paper, we employ a modified high-radix, i.e., radix-16/8/4/2, digit-recurrence division method and on-the-fly conversion method to reduce the required cycles for the 64-b/32-b signed integer division, while keeping the hardware complexity in control.
II. HIGH-RADIX 64-b/32-b SIGNED INTEGER DIVISOR

A. Digit-Recurrence Theory
Assume x, d, q, and rem to be the dividend, divider, quotient, and remainder in the division operation. We also denote the radix of the division as being r. The division is then defined as x = q 1 d + rem.
In the digit-recurrence division algorithm [2] , 1-b bits of quotient digit can be obtained every iteration in a radix-2 b digit-recurrence division. In other words, b bits of quotient can be obtained every iteration. In [3] , the digit-recurrence algorithm is defined as
where w[j + 1] is the residual of the (j + 1)th iteration, r is the radix, and q j+1 is the quotient digit generated in the (j + 1)th iteration. In a radix-r, r = 2 b , division, the quotient digit set is defined as qj 2 Da = f0a; . . . ; 01; 0; 1; . . . ; ag. Since kDak > r, it uses more than r numbers to present the quotient digits, which make this quotient representation form to be a redundant form. Besides, the restriction of a is a dr=2e. In (1), the quotient digits are generated in every iteration. Hence, we can define the quotient-digit selection function as q j+1 = SEL(w[j]; d), where the SEL() function can be simplified as a table lookup function.
Although the digit-recurrence algorithm has been well written in [2] , there are many unsolved difficulties when it comes to hardwaredly realizing such a divisor, including the following.
1) A long adder is needed at the adjustment of the remainder.
2) Extra adjustment actions are required when the last cycle of the division contains nonmultiple digits of the radix. (For instance, the radix is 16, but there is only 1 b left in the dividend to be processed.) 3) The adjustment of the remainder is missing when the signed division is executed. 4) A data flow control unit is required, which provides correct timing control such that the results of the division can be correctly placed on the output ports. 5) The size of the quotient selection table will grow exponentially with the radix. Besides, it is likely that one radix needs one table. These two factors lead to a huge chip area consumption if the divisor is implemented on silicon. In short, the above problems will occur during the realization of a long signed divisor. If these problems are not resolved efficiently, the hardware divisor will be large and slow.
B. Mixed Radix-16/8/4/2 64-b/32-b Integer Divisor
In [4] , a mixed radix-8/4/2 integer divisor was proposed, of which performance is better than that of a normal radix-4/2 integer divisor [5] . However, it paid the price of increasing the complexity of hardware, and then nearly doubled the total area of the divisor owing to the sizes of tables. In this study, despite that the radix will be raised up to 16 to retire more bits of the quotient per cycle, the complexity of the hardware will be retained to a similar degree by using several methods, including operand prescaling, table partitioning, [6] , and table folding. 
1) Operand Scaling:
In high-radix divisors, the cycle time is generally determined by the quotient-digit selection operation, which is basically a table lookup operation. The complexity of the quotient selection function increases exponentially if the radix increases linearly. Consequently, it results in a long table lookup time. Operand scaling is a better alternative to avoid the long table lookup time.
The maximal overlap between quotient digits appears when the divider is the maximum. That is, the maximal amount of overlap occurs when a divider d is normalized and it approaches to one. This observation leads to the concept of divider prescaling. In the first step of the scaling method, the divisor is prescaled by a factor M so that the scaled divider z is 1 0 z = M 1 d 1 + , where and are chosen such that the scaling factor S i , i 2 f0; . . . ; 6g is identical in all divider intervals and the quotient-digit selection is independent of the divider. Besides, the value of M should be chosen to minimize (+), which produces the smallest achievable range of z. In order to preserve the value of the quotient, three alternative ways of performing the scaling were proposed [2] . Operand scaling process produces a scaled estimated residualŷ, which is generated as shown in Fig. 1. In Fig. 1 , the "estimated residual" is chosen from the first seven bits from w [j] in (1).
2) Table-Sharing Algorithm: The quotient-digit selection function in the radix-8 division has been proven to be the bottleneck in each iteration [4] . Hence, the radix-16/8/4/2 integer division will be infeasible unless a simplified quotient-digit selection function is developed. A modified version of the table-partitioning algorithm [6] , called "table sharing," is proposed to simplify the digit selection process. The tablepartition method in [6] arranged the entries in a nonmergable manner. We simply reorder the rows to place the negative entries of Lo. on the top half of the table and the positive ones on the bottom. Hence, the modified table possesses a feature that is mergable for different radixes. Moreover, the merge of a total of the radix-2, radix-4, radix-8, and the radix-16 quotient selection tables reduces the number of the required tables, which, in turn, reduces the area of the chip.
3) Quotient Digit Decomposition and Table Sharing:
In the proposed scheme, the maximally redundant quotient digit set is chosen for radix-16 division and decomposed into four components as follows: q j+1 = q h;h + q h;l + q l;h + q l;l ; q h;h 2 f08; 0; 8g; q h;l 2 f04; 0; 4g; q l;h 2 f02; 0; 2g; q l;l 2 f01; 0; 1g (2) where qj+1 2 f015; 014; . . . ; 0; . . . ; 14; 15g, and q h;h , q h;l , q l;h , and q l;l are tabulated as shown in Table I . Table I , the selection intervals for q j+1 in radix-16, radix-8, radix-4, and radix-2 divisions are included and tabulated as indicated by the four braces, respectively. Besides, Lo. and Hi. in Table I denote the lower and upper bounds of the shifted estimated residual, respectively. Notably, the bounds of the scaled shifted residualŷ are derived from jw[j]j (8288=8192) and the corresponding 2-b truncation error. Namely, b0r 1 (8288=8192) 0 2 02 c ŷ br 1 (8288=8192)c.
Nevertheless, since the quotient-digit table is shared by different radixes, the highest order digit will be incorrectly enabled if the value ofŷ is close to the bounds, as illustrated in the first and last rows and as indicated by the braces in Table I . Fortunately, this can be fixed easily later at the quotient-digit assimilation stage where q h;h , q h;l , q l;h , and q l;l re-compose the quotient digits. Table I , the entries of the top half are identical to the opposite ones in the bottom half. It allows us to simply implement only the positive half. Accordingly, the proposed scheme needs only six bits, besides the common sign bit, as the input to the quotient-digit selection table, including five integer bits and one fractional bit ofŷ in contrast to 11 bits required in the radix-8 division presented [4] . Thus, a total of seven bits,ŷ = y5 y4y3y2y1y0y01 are used to derive q 
4) Table Folding: By inspecting
Notably, all four radixes can share the expressions given in (3) without any changes. The scenario requiring many quotient selection tables for different radixes appearing in the prior designs no longer exists.
C. Quotient Digit Assimilation Unit
The quotient digit assimilation unit performs the assimilation of the selections q Table II .
The 38-bit carry-save adder (CSA) in Fig. 2 right after the divider d is scaled in every operation cycle. Multipliers will then not be required in our high-radix division implementations.
Note that the expressions of q l and q h for different radixes are identical, except for the cases that the higher order quotient digit is incorrectly set to one when the value ofŷ is close to the bounds, as illustrated in the boundary rows of each radix in Table I . The complete scheme for the mixed radix-16/8/4/2 64-b/32-b integer divisor is presented in Fig. 2 .
III. IMPLEMENTATION AND MEASUREMENT
The chip is implemented by synthesizable Verilog register transfer level (RTL) code and synthesized by Synopsys. Taiwan Semiconductor Manufacturing Company (TSMC) 0.35-m 1P4M CMOS technology is employed to carry out the design, while CADENCE standard delay format (SDF) simulation tools are used to execute both the pre-and post-layout simulations. The highest working clock of this radix-16/8/4/2 64-b/32-b divisor is 80 MHz. Table III is the comparison of mixed radix-4/2, mixed radix-8/4/2, Aoki's high-radix divider [8] , and our design. Fig. 3 shows a die photograph of the physical chip fabricated by TSMC. The area of the die is 2187 2 2204 m 2 . Fig. 4 is the measured results given by the HP 1660CP logic analyzer. Hence, this outcome addresses that our proposed design is silicon proven.
IV. CONCLUSION
In this paper, we have proposed a novel scheme to meliorate the performance of integer division. The methods that we propose include operand scaling, table folding, and table sharing to realize the mixed radix-16/8/4/2 quotient selection tables. A physical chip is implemented to prove our method on silicon. The results verify that our design saves operating cycles at a obscure increase of gate count.
