Perspective-correct texturing for correct 3D graphics images requires the per-pixel division, but the division can be avoided by midpoint algorithms. Previous approaches using midpoint algorithms for perspective-correct texturing evaluate the integer texture coordinates, which are not suitable for trilinear filtering used in the most applications. In this paper, we propose a rasterizer architecture which separates the evaluation of integer part and the computation of fraction parts. The fraction part of a texture coordinate is computed per cycle in spite of short pipeline depth, consequently much smaller cost than divider. The implemented parallel rasterizer runs at 100 Mhz clock and evaluates 400M pixels and 800M texture coordinates per second.
INTRODUCTION
Every pixel on the display in widespread mobile phones or PDA systems must be rendered in higher quality because the average eye-to-pixel distance is very short. In order to improve 3D graphics image with no visual pixel defects, every textured pixel on the display must be rendered with perspective-correct methods. [1] However, usual implementations for perspective correct rendering of planar polygons require a division per pixel. [1] Division requires long latency and large area, so division per pixel makes it difficult to achieve high throughput of pixel-fill rate with small hardware cost.
In order to avoid division per pixel, there have been several studies. Some techniques to approximate hyperbolic curve with linear and quadratic interpolation have been presented. [5] These polynomial algorithms use only adders instead of dividers, but they are approximations with possible errors. There have been other approaches that use midpoint algorithms [2, 3] , which are well known for the trace of hyperbolic functions. These approaches which induce mid-point algorithms for perspective correct texturing are well summarized and published by Barenbrug [3] . He demonstrates that the midpoint algorithm makes exact integer texture coordinates for point sampling, only using several additions instead of a division. However, point sampling does not provide good image quality, and then trilinear filtering instead of point sampling is usually used in the most applications. Usual filtering methods such as trilinear filtering require the fraction part of texture coordinates as filtering coefficients. Fixed point representations including fraction parts can be scaled such that they take integer values only, as mentioned in [3] , but the number of iteration are too increased to be used in hardware.
In this paper, we modify midpoint algorithms to be suitable for hardware and present its hardware architecture and implementation. The fraction part of texture coordinates are evaluated without iteration overhead. We also demonstrate mipmap level switching architecture using the internal variables for midpoint algorithms.
This paper is organized as follows. First, perspective-correct texturing and midpoint algorithms are reviewed. After that, fraction evaluations and mipmap switching are explained in detail. And then hardware architecture is described. Finally, discussion and summary are offered in conclusion.
BACKGROUND

Perspective-correct texture mapping
The perspective texture coordinates of a certain pixel (u, v) are calculated as follows. For triangles, the screen coordinates and the texture coordinates are connected by a homogeneous linear transformation. The relation among the screen coordinates (x, y), the homogeneous 2D screen coordinates (x', y', w') and the texture coordinates (u, v) is given by: 
As shown in (1), two division operations or one reciprocal and two multiplications are required in calculating exact texture coordinates of one pixel.
Midpoint Algorithm
The idea using midpoint algorithm in perspective texture rendering was proposed in [3] . The key idea of [3] is that repetition of additions can follow hyperbolic curves within given precision as shown in Figure 1 . Let's assume that u and x are represented by only integers. We do not need to know the exact value of f(x i ) for given x i . The exact function value f(x i ) must be taken in limited precision, so the nearest integer u i is acceptable for f(x i ). The difference between u i and f(x i ) is smaller than the precision error. Therefore, u i is the exact value allowed by given precision.
The acceptable value u i for x i must be an integer inside acceptable region shadowed in Fig. 1 . Considering both of x and y in (1), this condition is formulated as follows. As proposed in [3] , several variables are introduced:
Assuming d>0, to substituting (3) simplifies (2) to (4).
A new variable A is introduced for simplicity. When x increases by 1, d and E are changed as shown in (5).
When u increases by 1, A and E are modified as follows.
Therefore, when E and d change by increase or decrease of x or y, E can be restored to satisfying inequality (4) by adjusting u with iterative additions or subtractions in (6) .
However, In order to trace a hyperbolic curve as shown in Fig. 1 , an increment by 1 in x or y may require multiple increments or decrements of u depending of partial derivatives of the hyperbolic curve. The iteration of additions or subtractions does not have upper bound generally, but mipmap [[] limits the derivatives such as x u ∂ ∂ / to two. But still, when using texture filtering requires fraction parts of texture coordinates, much more iterations must be done as shown in Fig 2. the example in Fig 2 shows the case that 5 iterations of computing (6) and checking (4) are required to obtain correct u i+1 from u i . To calculate u i+2 , 3 iterations of (6) and (4) must be computed. If three bits are used to represent the fraction of u, 2Ý2 3 =16
iterations must be performed in the worst case in order to obtain texture coordinates of one pixel. This long latency by the iteration wastes clock cycles in the hardware implementation, like divisions. This problem can be solved by the proposed architecture that separates integer part interpolation and fraction part computation.
COMPUTING FRACTION OF TEXTURE COORDINATES
Assuming that m bit precision of the texture coordinate fraction is required as texture filtering coefficients, inequalities (2) are to be modified to (7). Instead of (7), let us consider an inequality (8) and Fig. 3 . Inequality (8) evaluates u i by truncation while inequality (7) rounds off f(x i ). Rounding off computes closer u i to original curves than truncation, but the difference is negligible for enough precision.
We introduces E' and A' instead of E and A defined in (3) and (6) so that E' is independent of m as follows. Note that the second and the third in the relation (5) are still valid.
Therefore, substituting E' in (8) yields: Let u| n imply the number with n fraction bits equal to or less than u. Then next relations are induced easily. 
The sequence k is obtained from k 1 to k m step by step as shown in Fig. 4 . These sequential evaluations can be pipelined and the required hardware resource for one pipeline stage is only one comparator for checking (13) and one adder for updating E'.
LEVEL OF DETAIL AND MIPMAP
Mipmap is pyramidal data structure of pre-processed texture data. [4] Trilinear filtering using mipmap provides better image quality than point sampling. Mipmap also limits the derivatives so that the iteration limit is bound in midpoint algorithms.
Because of its pyramidal structure, the texture coordinate u goes to the half when a mipmap level is increased by one. Therefore, we rewrite (8) and (9) 
If ǻȜ is less than 0, the mipmap level will be decreased. If ǻȜ is equal to or more than 1, the mipmap level will be increased. We can check this condition without logarithm as follows.
THE HARDWARE IMPLEMENTATION
Overall Architecture
In this section, overall architecture of the proposed rasterizer will be described. Fig. 5 shows the block diagram of a basic rasterization unit.
Context Register contains the variables of a current pixel such as position, color, depth and texture coordinates. Note that Context Register stores only integer part of texture coordinates.
Pixel Interpolator interpolates the contexts by the controls of FSM. Internal variables E' and d are also updated by (5) in Pixel Interpolator.
Texel Interpolator finds correct the integer part of texture coordinates by (6) . Texel Interpolator has two adder sets to avoid iterations.
Mipmap Switching Detector checks the condition (14) to determine whether a current mipmap level must be changed. When a mipmap level is changed, the outputs of Pixel Interpolator and Texel Interpolator are not updated to Context Register. Instead, the outputs of Mipmap Switch are updated. This is one cycle loss to change a mipmap level, but mipmap level shifting is not frequently occurred. Note that Mipmap Switching Detector determines mipmap level shifting by using A' of integer texture coordinates. It is mathematically correct to use A' of full texture coordinates, but this makes no problem since a precise mipmap level threshold is not critical. LOD evaluator generates the fraction of LOD for a trilinear filtering coefficient.
Fraction Evaluator produces the fraction part of texture coordinate by (13). A Fraction Evaluator block produces only one fraction bit by (13), but pipelining multiple Fraction Evaluator blocks enable to obtain multiple bits of fraction parts without loss of performance. 4 bit fraction is usually enough for trilinear filtering coefficients. Therefore, 4 pipeline stages produce a texture coordinate per cycle.
Fraction Evaluator is very simple, which is only composed of a comparator and an adder, but the registers to contain context of pixel are quite large. Therefore, it is better to cascading several Fraction Evaluator blocks in a pipeline stage to produce multiple fraction bits. This reduces the number of pipeline stages instead of increasing delay.
Implementation
The proposed architecture was verified by GATE [6] and implemented by Verilog HDL. The implemented rasterizer has 4 parallel basic rasterization units for 4 pixels per cycle throughput. Each basic rasterization unit evaluates 4 texture coordinates to support two level multi-textures. All the internal variables and texture coordinates are represented in 16 bit fixed point representation and the fraction parts of texture coordinates are 4 bits. The rasterizer is pipelined into two stages to evaluate 4 bit fraction. It is verified from static timing analysis in 0.13um process that the rasterizer affords 100 Mhz clock in the worst case. The total gate count is 220k.
CONCLUSION
To render perspective-correct textured images, divisions are required per pixels. However, using dividers is very expensive in terms of area and speed. Dividers are also deep pipelined to reduce latency, but the pipeline registers to store full pixel context are very large. Pipelined dividers with short latency have large look-up-tables. [7] To avoid expensive divisions, the proposed rasterizer uses midpoint algorithms. To use midpoint algorithms instead of divisions require also iterative computations, but the integer parts of texture coordinates are evaluated without iterations due to the characteristics of the 3D graphics mipmap structure.
The fraction parts of texture coordinates are evaluated in pipelined methods. The pipeline depth is determined on the precision of only fraction parts while the pipeline depth of divider is related on full precision. The precision of fraction part is usually much less than the precision of integer part. Therefore, the pipeline depth is much shortened and the area overhead by the pipeline registers becomes small. Besides, one or two pipeline stages as we used in hardware implementation is not overhead in the entire graphics systems. Alpha test or pre-depth tests are located between rasterizer and texture engine and executed in parallel with the fraction evaluation.
We implemented the proposed architecture into real hardware. The implemented parallel rasterizer produces 4 pixels with two texture coordinates for multi-texturing per cycle. The rasterizer operates at 100 Mhz clock and hence gives 400M pixels and 800M texture coordinates per second.
ACKNOWLEGEMENT
This work is supported in part by SAMSUNG Electronics and A Collaborative Project for Excellence in Basic System IC Technology.
