In this paper, we proposed that an area-and speedeffective fixed-point pipelined divider be used for reducing the bit-width of a division unit to fit a mobile rendering processor. To decide the bit-width of a division unit, error analysis has been carried out in various ways. As a result, when the original bit-width was 31-bit, the proposed method reduced the bit-width to 24-bit and reduced the area by 42% with a maximum error of 0.00001%.
Introduction
In recent years, the design of high performance dividers in hardware has become increasingly important as the needs are increased by applications such as 3D graphics, multimedia, digital signal processing, etc. [1] . In particular, 3D graphics processing has become an important application of processors and, as a result, high performance dividers are required for high performance 3D graphics processing [2] - [4] .
For the rasterization step of the 3D graphics pipeline, a pipelined divider for high performance is required. In a desktop PC, a floating-point format is used for the high performance rendering processor. In [5] - [7] , a fixed-point format has been used for the low-cost rendering processor of a mobile device. A fixed-point pipelined divider consists of a left shifter for normalization of the dividend and the divisor, a division unit for division execution, and a right shifter for converting the result to a fixed-point format.
Many existing pipelined division unit uses a multiplicative algorithm based on the Taylor-series expansion. Hung et al. stores the first two terms of the Taylor-series in a lookup table (LUT), and executes a division by referencing the LUT in the first step and using a multiplier in the second step [8] . In [9] , a cost-effective pipelined division algorithm has been proposed by modifying the Taylor-series and decreasing the LUT size. In [10] , the algorithm suggested in [9] has been expanded to handle double precision floating-point numbers. In this paper, we propose a fixed-point division algorithm that is effective for chip area and speed by reducing the bit-width of the division unit, while allowing a very small error range. The proposed algorithm utilizes the following features. If there are leading zeros in the dividend and the divisor that are inputs of multipliers inside the division unit, the least significant bits (LSBs) of the normalization result by left shift operations equal 0 as the number of leading zeros. In addition, when a right shift calculation is carried out, the LSBs of the division unit results are discarded according to the number of right-shift calculation in order to convert the result of the division unit to a fixed-point format. By reducing the bit-width of the division unit, the proposed method results in a small error when using these two features.
We performed analytic error analyses of division operation results according to bit-width to prove the proposed method. As a result, a fixed-point divider, using a pipelined divider with a 24-bit width, reduces the area by 42%, as compared to the 31-bit fixed-point pipelined divider proposed in [9] , while the maximum error is about 0.00001%. Also, the critical path has been decreased by 8%.
In the next section, we give a brief overview of the pipelined divider and its architectural features. In Sect. 3, we illustrate the new division method and its features. Error analysis, various simulation results, and performance evaluation are given in Sects. 4, 5, and 6. Conclusions are presented in Sect. 7.
Related Work
Hung [8] and Jeong [9] proposed pipelined division algorithms. These express division with Taylor-series expansions, and then calculate the upper two or four terms with an LUT and multipliers.
First, the Hung algorithm expands division, as in Eq. (1).
In Eq. (1), Y h = 2 0 y 0 + 2 −1 y 1 + 2 −2 y 2 + · · · + 2 −(p−1) y p−1 and Y l = Y − Y h . If the upper two terms are used in Eq. (1), then Eq. (2) can be derived as follows:
Copyright c 2013 The Institute of Electronics, Information and Communication Engineers Equation (2) requires an LUT to calculate 1/Y 2 h , the multiplication of X and (Y h − Y l ), and the multiplication of X(Y h − Y l ) and 1/Y 2 h . The block diagram of this process is shown in Fig. 1 . To conduct a division of a single precision format with the explained procedure, approximately 13 KB LUT is required.
Jeong proposed an algorithm that derives coarse quotients (Q and˜Q) that reduce the bit-width of Y h , and it reduces LUT size with the addition of these two terms [9] . This method conducts division with the procedure in Eq. (3) and it reduces the area by 27%, as compared to the algorithm of [8] . Figure 2 shows the block diagram of [9] .
Proposed Method
Whereas a conventional pipelined divider [9] (Fig. 2 ) uses the normalized value as input data, we use fixed-point value as input data. So, when we use a pipelined divider like [9] , we perform a normalization operation of fixed-point input value. The procedure is conducted through a left-shift after finding leading '1' of input value on left shifter of Fig. 3 . Figure 4 shows the structure of the 32-bit fixed point (sign: 1-bit, integer: i-bit, fraction: f -bit). Figure 4 is: (a) an LSBs of the left shift calculation results for normalization are filled with '0's, as many as leading '0's are ( Fig. 4 (b) and (c)). The LSBs are the lowest bits of multiplicand and multiplier, thus the LSBs of the multiplication result values are filled with '0's, as many as leading '0's are. Meanwhile, to convert the final result of the division unit to a fixed-point format, a right-shift calculation is repeated. When the right shift calculations are carried out, the final result of fixed-point formatted division calculation does not use LSBs of division unit as many as right-shifts calculations are repeated.
The proposed method projects these features and uses upper N-bit on the normalization result of the dividend and divisor during division conduction. In this case, since the internal bit-width of the division unit is reduced, the multiplier and the LUT size are reduced, and the area and the latency of the divider can be reduced as well. Figure 5 shows ε 1 and ε 2 , which causes error in the division calculation of variables X and Y when only the N-bit of X and Y are used.
Boundary Conditions for Error Analysis
Following Fig. 4 and Fig. 5 , the maximum values of X and Y occur when all digits have value '1', except the signbit, and can be expressed with 2 i − 2 − f . The minimum value of X is '0' and Y is 2 − f . This is because an exceptional case of division by zero happens when Y equals 0. This can be expressed in the following formula: 
The minimum values of ε 1 and ε 2 arel '0' and the maximum values are the case where the most significant bit (MSB) is '1' and the lower bits after the upper N-bit are all '1', as Fig. 6 shows. It can be expressed as Eq. (5).
Since X = X − ε 1 and Y = Y − ε 2 , following Eq. (4), the ranges of X and Y are as follows:
When error occurs while division is being carried out using X and Y instead of X and Y, ε 1 or ε 2 are not 0. In this case, the minimum value of X occurs when ε 1 is 2 − f , and the value is 2 N− f , as in Fig. 7 . The minimum value of Y is also 2 N− f . The maximum values of X and Y are identical to the maximum values in Eq. (6) . With derivation, Eq. (6) can be converted to Eq. (7) .
If exceptional cases (overflow or underflow) of division calculation results are handled by a divider, the range of division results, without exception, will be Eq. (8). The range happens because of the expressional limitation of fixedpoint representation, and the ranges of minimum and maximum values of division results are the same as the range of X in Eq. (4).
Error Analysis
In this section, the size of the error, which happens when division is carried out using X and Y instead of X and Y, and the portion taken by error in the results of division are analysed. Error, ε rr occurring when the upper N-bit among the 31-bit are used for division, is expressed as Eq. (9):
The error ratio, ε rate , which is the portion taken by error in the final division results, is expressed as Eq. (10):
The maximum error value, ε max , can be analysed in three cases: a) ε 1 = 0, ε 2 0, b) ε 2 = 0, ε 1 0, and c) ε 1 0, ε 2 0. According to Eq. (4), when both ε 1 and ε 2 are '0', X = X and Y = Y, so that there is no error.
5.1 Case a) ε 1 = 0, ε 2 0
When ε 1 equals 0, Eqs. (9) and (10) are as follows:
The maximum error occurs when X has the maximum value and Y has the minimum value, in accordance with Eq. (11).
Since ε 1 = 0, X = X, and X = 2 i − 2 i−N , following Eq. (6), the minimum value of Y is 2 N− f , in accordance with Eq. (7) . Then, as shown in Fig. 7 , ε 2 is 2 − f . If each of these values is inserted into Eq. (11), the following formula can be derived:
If the formula is expanded, it can be expressed as follows:
If Y and ε 2 are inserted into Eq. (12), the maximum error ratio is as follows:
When ε 2 equals 0, Eqs. (9) and (10) become as follows:
The maximum error ε max occurs when Y has the minimum value and ε 1 has the maximum value, in accordance with Eq. (14). Also the maximum error rate, ε max rate , occurs when X is minimum in accordance with Eq. (15). If the maximum value of ε 1 is 2 i−N − 2 − f following Eq. (5), then the MSB of X is 1. The minimum value of X is 2 i−1 . It is expressed in Fig. 8 .
Since ε 2 equals 0, the next equation can be derived when X is inserted into Eq. (8) to calculate the minimum of Y without overflow.
With the substitution of right 2 i and left Y , Eq. (16) can be derived. As a result of Eq. (16), the minimum of Y should be bigger than 2 −1 .
As Fig. 9 shows, the minimum value of Y , which is bigger than 2 −1 , is 2 −1 + 2 − f . If the value is inserted into Eq. (14), the maximum error is expressed, as in the following formula: Fig. 8 Minimum of X and maximum of ε 1 when ε 1 is the maximum. When X = 2 i−1 and ε 1 = 2 i−N − 2 − f are inserted to equation, the maximum error rate can be expressed as follows Eq. (15):
Since Y uses N-bit of leading '1' from Y and the rest of digits are ε 2 , Y and ε 2 are correlated. Since Y ε 2 , in accordance with Eq. (9), ε rr would have a maximum value when Y has a small value rather than a big value of ε 2 . Equation (18) can be derived as:
In accordance with Eq. (7), the maximum value of X is 2 i − 2 i−N , and the minimum value of ε 1 is 2 − f , following Fig. 7 . In accordance with Eq. (7), the minimum value of Y is 2 N− f , and ε 2 is 2 − f , following Fig. 7 . The result of inserting these values into Eq. (18) is Eq. (19). As Eq. (19) shows, ε max is always smaller than Eq. (13). So the Case c) always has a smaller error than Case a), and it does not affect the final maximum error of division result.
Results of Error Analysis
In this paper, we measured the maximum error and maximum error ratio in two cases. The first case is analysis according to a number of fraction bits, which are used in the fixed-point format of OpenGL ES 2.0, which is standard for mobile devices. The second case is the analysis about the case of i = 31, which occurs for the largest maximum error value according to Eqs. (13) and (17). First, because the fixed-point format of OpenGL ES 2.0 allocates 16-bit to fraction bits [11] , we assume that i is 15-bit and f is 16-bit. Table 1 shows the maximum error and maximum error ratios in accordance with each N. When N = 24 for division, the maximum error occurs in case b). The error is 0.0038756, and the portion taken by this error as a result of division is 0.0000118%. When N = 20, the maximum error is 0.0624676, and the portion taken by this error as a result of division is 0.00019%.
The second case is the analysis of the maximum error that occurs, which uses i to 31-bit according to Eqs. (13) and (17). Table 2 shows the maximum error and maximum error ratios in accordance with each N. Since errors in case a) and case c) are smaller than 1, there is no error in Table 1 Maximum error and error rate according to N in case of i = 15 and f = 16. Table 2 Maximum error and error rate according to N in case of i = 31 and f = 0. the division result related to the value of N. When N = 24 for case b), the maximum error is 127, and the portion taken by this error as a result of division is 0.0000118%. When N = 20, the maximum error is 2047, and the portion taken by this error as a result of division is 0.00019%.
The comparison, in terms of the delay and area cost of the proposed scheme in relation to previous approaches, is provided in Table 3 . The delay and area cost have been calculated based on the analytical method in [12] . The delays are expressed in terms of τthe delay of a complex gate such as one full adder. The unit employed for the area cost estimation is the size of one full adder, fa.
According to Jeong's algorithm, when N = 24 for division, following the results in Table 3 , it decreased the area by 42%, as compared to N = 31, and critical path delay decreases by 8%. Hung's algorithm decreased by 92%, as compared to the case of N = 31, and critical path delay decreases by 5%.
Conclusion
For the rasterization step of the 3D graphics pipeline, a divider, especially a pipelined divider for high performance, is required. Existing pipelined division algorithms use Taylor-series expansion and this uses a large LUT. In this paper, an effective fixed-point pipelined divider for area and speed is proposed for reducing the bit-width of the division unit to fit the low-cost rendering processor. To decide the bitwidth for a division unit, error analysis has been carried out in various ways.
When bit-width is restricted to 24-bit from the original 31-bit, a 42% decrease of the area is possible. This is because of the reduction of the input bit-width of the multiplier and the reduction of the LUT size in the division unit. Also, analysis results of error occurrence, followed by input bit restriction, shows a maximum error of 0.00002% on the divider implemented with 24-bit. The proposed structure can be applied not only to a pipelined divider, but also to other division algorithms.
