This paper presents a new division algorithm, which requires two multiplication operations and a single lookup in a small table. The division algorithm takes two steps. The table lookup and the first multiplication are processed concurrently in the first step, and the second multiplication is executed in the next step. This divider uses a single multiplier and a lookup table with 2"&(2m + 1) bits to produce 2m-bit results that. are guaranteed correct to one ulp. By using a multiplier and a 12.5KB lookup table, the basic algorithm generates a 24-bit result'in two cycles.
Introduction
Division is an important operation in many areas of compuring, such as signal processing, computer graphics, networking, numerical and scientific applications. In general, division algorithms may be divided into five categories: digit recurrence, functional iteration, high radix, table lookup, and variable latency. These algorithms differ in overall latency and area requirements. An overview of division algorithms can be found in [4] .
This paper introduces a new high radix division algorithm based on the well-known Taylor series expansion. A number of hi@ radix division algorithms were also proposed in the past based on the Taylor series. For example, Farmwald 121 proposed using multiple tables to look up the first few terms in the Taylor series. Later, Wong [ 5 ] proposed an elaborate iterative quotient approximation with multiple lookup tables. Wong demonstrated that only the first two terms in the Taylor series are necessary to achieve fast division because of the time to evaluate all the power terms.
The previous algorithms consider each individual term in the Taylor series separately; hence, many lookup tables are needed and the designs are complicated. Our proposed algorithm combines the first two terms of Taylor series together, and only requires a small lookup table to generate accurate results. This algorithm achieves fast division by multiplying the dividend in the first step, which is done in parallel with the table lookup. In the second step, another multiplication operation is executed to generate the quotient.
Basic Algorithm
Let X and Y be two 2m-bit fixed point numbers between one and two defined by Equations 1 and 2 where zi, yi E ( 0 , l ) . 
The range of Yh is between 1 and Yhmnz (= 2 -2-*), and Dividing X by Y , we get Equations 5 and 6. Since Yh > 2m . X, the maximum fractional error in Equation 6 is less than 2Y2, (or 1/2 ulp). the range of F j is between 0 and X , , , (= 2-, -2-(2m--') ).
Using Taylor series, Equation 5 can be expanded at %/Yh as in Equation 7. The approximation in Equation 6 is equivalent to combining the first two terms in the Taylor series.
(7) Figure 1 shows the block diagram of the algorithm. In the first step, the algorithm retrieves the value of 1/Y; from a lookup table and multiplies x with (Yh-K) at the same time.
In the second step, l / Y t and X . (Yh -K ) are multiplied together to generate the result.
Lookup Table Construction
To minimize the size of the lookup encoding schemes. In Booth 2 encoding, the multiplier is partitioned into overlapping strings of 3 bits, and each string is used to select a single partial product. Unlike conventional Booth 2 encoding, the encoding of (Yh -x) consists of four types of encoders. Figure 2 shows the locations of these four types of encoders: the group contains all the 3-bit strings that reside entirely within x; the boundary string contains some yi bits as well as some Yh bits;
the first Yh string is located next to the boundary string; the Yh group contains all the remaining strings within l'h. A lookup table with m=3 is shown in Table 1 .
represents the truncated value of 1 / Y i to 2m + 2 significant bits. The exponent part of the 1 / Y j may be stored in the same table, but can also be determined by some simple logic gates. In this example, the exponent is 1.00 when y1 = y2 = y3 = 0, the exponent is 0.10 when y1 = 0 and y2 V y3 = 1, the exponent is 0.01 when y1 = 1. 
Booth Encoding
Booth encoding algorithm [ 13 has widely been used to minimize the number of partial product terms in a multiplier. In our division algorithm, special Booth encoders are needed to achieve the X(Yh -x) multiplication without explicitly calculating the value of (Yh -x). Lyu and Matula [3] proposed a general redundant binary booth recoding scheme. In our case, the Yh and Y-bits are non-overlapping, and a cheaper and faster encoding scheme is feasible.
We use Booth 2 encoding to illustrate our encoding algorithm, but the same principle can apply to the other Booth
1466
The Yh bits represent positive numbers, whereas the x bits represent negative numbers. Hence, conventional Booth 2 encoding is used in the Yh group but the partial products in the Ei group are negated. As shown in the diagram, the boundary region between Yh and x requires two additional special encoders. Depending on whether m is even or odd, the encoding schemes for these two encoders are different. It is possible that only one such encoder is used in the boundary region, but it implies that this encoder needs to generate -3x multiplicand (for even m). In order to speed up the multiplication and simplify the encoding logic, two special encoders are used to avoid the "difficult" multiples. Table 2 summarizes the four different encoding schemes for both even and odd m. It is important to note that the first
Yh encoder actually needs to examine both the first Yh string and the boundary string when m is odd. If the boundary string is 101, the LSB of the first Yh string is set to 0 instead of 1. If the boundary string is not 101, the LSB of the first Yh string is set to be the MSB of the boundary string (as usual). This encoding scheme uses all but two normal Booth encoders and is particularly useful if the same multiplier hardware is used for both the first and the second multiplications.
Error Analysis
There are four sources of errors: Taylor series approximation error (E,), lookup table rounding error ( E T ) , the rounding error of the first multiplication EM^), and the rounding error of the second multiplication EM^).
The total error is equal to E , + ET + EM^ + EM^.
To minimize this error, the divider can be designed such that E, 5 0, ET 5 0, Eh11 5 0, and EM^ 2 0. This means that the table entries are truncated to 2m + 2 bits, the first multiplication is truncated to 2m + 2 bits, and the second multiplication is rounded up to 2m bits. 
Optimization Techniques
This section describes two optimization techniques for the division algorithm. The first technique uses a slightly different lookup table and allows the two multiplications to use the same rounding mode, whereas the second technique uses an error compensation term to further reduce the Taylor series approximation error.
Alternative Lookup Table
As described in Section 2.3, the rounding modes of the first and the second multiplications are different. This may be undesirable if the two multiplications need to share the same multiplier. A simple solution is to use round-to-nearest mode in the two multiplications as well as in constructing the lookup table. Since the error terms can either be positive or negative, the maximum total error becomes the sum of the maximum of each error term.
Let T(Yh) be the table entry at Yh with infinite precision.
The expression for the approximation error E, is shown in Equation 9 below.
(9)
In order to minimize ]&I, T(Yh) is set to be slightly larger than 1/Y;. For each Yh, the optimum table entry is determined by setting the maximum positive error (at = 0) to be the same as the maximum negative error (at 'yl = x,,,).
Equation 10 shows the expression for the optimum table entry
The approximation error is at its maximum when Yh = 1 and K = ki,,,,,. Using Equation 9, the maximum approximation error can easily be derived as in Equation 11 . In this case, IE,( is slightly less than 1 / 4 ulp.
Using round-to-nearest rounding mode, IEll.11 I < 1 / 8 ulp,
JEll.12) < 1/2 ulp, and IETJ < 1/8 ulp. As in Section 2.3, the total error of the alternative lookup table is also less than 1 ulp. represents the round-to-nearest value of T(Yh) to 2m + 2 significant bits. 
Error Compensation
The Taylor series approximation error can be further reduced by adding an error compensation term in the first multiplication. Equation 8 shows that the magnitude of the Taylor series approximation error (E,) increases when either x gets larger or Yh gets smaller. By looking at the first few bits of 'yl and Yh, it is possible to identify large x and small Yh, and then compensate for the approximation error. However, it is important to ensure that the approximation error is not overcompensated and becomes positive; otherwise this would increase the total error (Section 2.3). 
Discussion
This paper presents a simple and fast division algorithm based on Taylor series expansion. Using a multiplier and a lookup table with 2m ( 2 m + 1) bits, this algorithm produces a 2m-bit result in two steps. For example, a 12.5KB lookup table is required for single precision (24 bits) floating point division.
The same principle can be applied to some elementary functions, such as square root. Using the same definitions of Y , Yh, and x as before, we get the following approximation. This is very similar to the approximation used in the division algorithm. The differences are that the K term is shifted by one bit in the numerator and the lookup table contains l/Yhq entries instead of 1/Yh2 entries.
We can also combine more than the first two terms in the Taylor series expansion. For example, if we use the first four terms in the expansion, we get the following approximation.
As before, only a single 1/Yh2 lookup table is needed but this algorithm also needs to calculate x2. Our next step is to generalize the existing algorithm and to investigate the optimum Taylor series approximation for different input precisions.
