Abstract-As a 3D scene becomes increasingly complex and the screen resolution increases, the design of an effective memory architecture is one of the most important issues for 3D rendering processors. We propose a pixel rasterization architecture that performs the depth test twice, before and after texture mapping. The proposed architecture eliminates memory bandwidth waste due to fetching unnecessary obscured texture data by performing the depth test before texture mapping. It also reduces the miss penalties of the pixel cache by using a prefetch scheme-that is, a frame memory access, due to a cache miss at the first depth test, is done simultaneously with texture mapping. We have built a trace-driven simulator for the proposed architecture. To validate the proposed architecture, the results of various simulations are provided. The proposed pixel rasterization architecture achieves memory bandwidth effectiveness and reduces power consumption while producing high-performance gains.
INTRODUCTION
IN the latest high-performance rendering processors, not only basic functions, such as scan-conversion and z(depth)-test, but also image mappings, such as texture mapping, bump mapping, and environment mapping, are provided to enhance realism for highly complex scenes [2] , [3] , [4] . Texture mapping is one of the especially important criteria in estimating the performance of a rendering processor. In order to perform texture mapping at a high pixel fill rate, considerable hardware logic is required and heavy memory traffic is also generated [7] , [13] , [14] . Therefore, organizing the texture mapping stage effectively is one of the major design issues for high-performance rendering processors [5] , [6] , [13] .
In conventional high-performance rendering processors, texture mapping is performed before the z-test [2] , [3] , [7] , [9] , [11] . We call this type of processing flow pretexturing. Pretexturing properly supports the semantics of the standard APIs such as OpenGL [15] , [16] . The major disadvantage of pretexturing is unnecessary texture mapping for the fragments obscured by the previously drawn pixel. In order to remove such unnecessary operations, texture mapping should be performed after the z-test. We call this form of processing posttexturing. However, a wider fragment queue for the pipeline execution is required. Moreover, the wide separation between reading and writing a z-value makes maintaining the frame memory consistency more difficult because more than one fragment may have the same pixel address.
A hierarchical z-buffer [17] for Radeon is included in the Hyper2 technology [3] . It keeps a reduced resolution of the z-buffer on the hierarchical z-buffer and removes the fragments with z-test failures as early as possible; they are removed from the pipeline before texture mapping. After this step, texture mapping and the final z-test with the full screen frame memory are performed. With this scheme, a considerable amount (60-70 percent on the average) of fragments with z-test failures are detected and then discarded from the pipeline. However, the hierarchical z-buffer requires a very large data structure because it is constructed with the z-pyramid [17] . Maintaining the hierarchical z-buffer for every frame memory update may also bring an excessive computational burden.
In this paper, we propose a new pixel rasterization architecture that has two new stages, z-read and z-test, in its pixel rasterization pipeline. The key idea of the proposed architecture is to perform texture mapping between the first z-test and the second z-test. We call this processing midtexturing. In the proposed architecture, unnecessary texture mapping for obscured fragments can be eliminated because texture mapping is performed after the first z-test as in posttexturing. However, unlike posttexturing, a normal sized fragment queue is sufficient and the consistency problem due to the wide separation does not occur because the second z-read is performed after texture mapping. Additionally, when a pixel cache miss occurs at the first z-read stage, frame memory access for the cache miss penalty can be performed simultaneously, with the pipeline executions between the first and the second z-read stages inclusively.
We have built a trace-driven simulator for the proposed pixel rasterization architecture. To validate our proposed architecture, various simulation results with three benchmarks are given. The simulation results show that the average z-test failure rate is 40 percent for the Quake3 [20] benchmark. The consistency problem occurrence rates are also obtained for various stage separation lengths. These rates are so low that unnecessary texture mapping is hardly done. The simulation results for the cache miss penalty reduction of the proposed architecture show that the miss penalty can be reduced up to 90 percent. The performance analysis shows that midtexturing achieves an almost zero-latency memory system for Crystal Space [21] and Quake3.
In the next section, we give a brief overview of the conventional pixel rasterization pipeline flow and its architectural features. In Section 3, we illustrate a new rasterization architecture and its architectural features. Various simulation results and performance evaluation are given in Section 4. Conclusions are presented in Section 5.
BACKGROUND
The rendering process consists of two stages: geometry processing and rasterization. In the geometry-processing stage, triangle vertices are transformed from object space into screen space. In the rasterization stage, a triangle is converted into pixels, which are depth buffered into the frame memory. A general rasterization pipeline consists of three steps: per-triangle processing, per-span processing, and per-pixel processing. The per-triangle step, called triangle setup [10] , includes the computations of a series of increments used to walk along the edges of each triangle. In the per-span step, called the edge-walk, the span endpoints and the interpolation parameters for pixel operations are computed. During the edge-walk, the triangles are decomposed into horizontal spans. In the per-pixel step, the pixel rasterization generates a series of fragments along the span by interpolating the color and texture coordinates. The final visible colors of these fragments are then written into the frame memory with their own z-values. 
The Pretexturing Pixel Rasterization Architecture
Two types of detailed pretexturing pixel rasterization pipelines are shown in Fig. 1 . The pipeline in Fig. 1a includes the pixel cache and the one in Fig. 1b does not. The latest rendering systems have a structure similar to one of these two types. The first two stages read four or eight texels from the texture cache, perform either bilinear or trilinear filtering on them to produce a single texel, and blend the texel with the pixel color. The alpha-value of the current fragment is then compared with that of the filtered texel. The alphatest decides whether a fragment is accepted or not, based on the comparison between the alpha-value of the incoming fragment and a reference value. If the test rejects a fragment, which means the alpha-test fails, then the fragment is dropped from the pipeline. The next two stages read the z-value from the frame memory or from the pixel cache and compare it with that of the current fragment. If the z-test fails, that is, the current fragment is obscured by the previously drawn pixel, then the current fragment is removed from the pipeline. Otherwise, a new z-value is written into the frame memory or into the pixel cache. Finally, we read the color data, alpha-blend them with the result of texture blending, and then write the final color data back to either the frame memory or the pixel cache.
In the texture read stage, the level-of-detail calculation of the mip-map and the address generations for the four or eight texels should be performed. To complete these two operations and the filtering operation, considerable cycle latencies are required. Moreover, for a texture prefetch scheme [13] , additional cycle time is required for the fragment queue. Generally, because the access cycle time of the cache is one unit time, the access cycle time for either the texture cache or the pixel cache in Fig. 1 is assumed to be one unit time in the case of cache hit.
The pipelines in Fig. 1 may not be sufficient if we compare them with those of current rendering processors in terms of capabilities, such as bump mapping, environment mapping, and stencil effect. In some implementations of the pipelines, texture computation can be done in parallel with the z-read [10] , [11] . Since all these functions can be provided with minor modifications, we can still regard these pipelines as typical pixel rasterization pipelines by adding appropriate stages to the pipelines in Fig. 1 .
We define depth complexity as the average number of generated fragments per pixel for a frame-that is, depth complexity indicates how many objects are overlapped in the same pixel on the average. For some scenes, depth complexity may be as high as 10 or more. In most cases, depth complexity is two or three. In a scene with a depth complexity of three, about seven out of 18 fragments fail the z-test [7] .
The documentation on Neon [7] , [8] is very important for our research because no other documents on the internal architecture of a recent high-performance rendering processor have been made public. The pixel rasterization pipeline of [7] is similar to that of Fig. 1b . In [7] , it is mentioned that pretexturing is adopted for the following three reasons: First, posttexturing requires a wider fragment queue for texture mapping after the z-test and Neon could not afford it. Second, OpenGL semantics do not allow updating the z-buffer until texture mapping is finished because a textured fragment may be completely transparent. Such a wide separation between reading and writing a z-value may cause difficulty in maintaining the frame memory consistency. Third, maintaining the spatial locality of texture access is difficult. However, since current semiconductor technology can afford sufficient cache sizes, this problem may not significantly affect the performance of a texture-cache system.
The Posttexturing Pixel Rasterization Architecture
The posttexturing pixel rasterization pipeline with pixel cache is shown in Fig. 2 . The first stage reads the z-value from the pixel cache. When a pixel cache miss occurs for the z-read, the pipeline stops until the cache block of the cache miss is transmitted from the frame memory to the pixel cache. And then, the z-test is performed by comparing the z-value retrieved at the z-read stage with that of the current fragment. If the test fails, the current fragment is dropped from the pipeline. The next three stages, texture read/ filter, texture blend, and alpha-test stages, are identical to those in the pretexturing architecture. Finally, we read the color data, alphablend them, and then write them back to the pixel cache.
Since the z-data retrieved from the pixel cache at the z-read stage should be processed up to the z-write stage, along with the fragment information, a wider fragment queue is needed between the z-read and the z-write stages. Since recent rendering processors are able to generate four pixels in a cycle, if the pipeline length between the z-read and the z-write stages is at least 20, then a queue size of at least 3K bits should be added to the normal fragment queue. However, this hardware burden is not negligible.
The wide separation between reading and writing z-values in posttexturing causes the consistency problem. If two fragments have the same pixel address, then the z-write for the first fragment must be completed before the z-read of the second fragment; otherwise, the z-read of the second fragment should use an internal bypass. Since such a close time overlap occurs rarely, it is acceptable to stop reading the pixel data until the z-write for the first fragment is completed. However, an associative overlap detector for this separation is not suitable because of its hardware burden. Fig. 3 shows the proposed pixel rasterization pipeline with midtexturing. In this architecture, extra z-read and z-test stages are added to the pipeline. The first pair of z-read and z-test operations is performed before texture mapping and the second pair is performed after texture mapping. The proposed architecture has the following two advantages: First, the architectural advantages of both pretexturing and posttexturing can be obtained. Since texture mapping is performed after the first z-test, there is no memory bandwidth waste caused by fetching obscured texture data. Since the second pair of z-read and z-test operations is performed after texture mapping, a normal-sized fragment queue is sufficient. Even though the consistency problem may occur between the first z-read and the z-write stages, the consistency between the frame buffer and the pixel cache can still be maintained because the second z-read and z-test are performed after texture mapping. Second, the penalty of a pixel cache miss can be reduced. This will be shown in Section 4.3. We now describe the processing flow of the pipeline in Fig. 3 according to the following three cases. Note that, since the last three stages, color read, alpha blending, and color write, are identical to those in the previous architectures, we omit those portions in the processing flow. The first case is that the first z-read for the pixel cache reveals a hit and the result of the first z-test is a failure. Since this case happens when the current fragment is obscured by the previous drawn pixel, the fragment is removed from the pipeline. Thus, the extra stages including texture mapping do not need to be executed.
THE PROPOSED PIXEL RASTERIZATION ARCHITECTURE
The second case is that the first z-read results in a pixel cache hit and the first z-test is successful. In this case, the processing flow of the pipeline after the first z-test stage is equivalent to that of pretexturing. That is, texture mapping/filtering, texture blending, and the alpha-test are performed in turn and then the second z-read and z-test are done afterward. It may happen that the first z-test falsely succeeds due to some overlap occurrence. However, this consistency problem is resolved by the second z-read and z-test. As a result, two z-reads for the pixel cache should be performed in this case. Next, a new z-value is written to the pixel cache according to the result of the z-test.
The third case is that the first z-read results in a pixel cache miss. In spite of a pixel cache miss, as opposed to pretexturing and posttexturing, the pipeline does not stop its execution. The cache miss penalty that transfers a missed block from the frame memory into the pixel cache can be handled simultaneously with the pipeline executions between the first z-read and the second z-read stages. With this scheme, we can reduce the miss penalty. Then, the depth test is performed by the second z-read and z-test operations after texture mapping. The detailed execution flow is shown in Fig. 4 . Next, a new z-value is written to the pixel cache according to the result of the z-test.
In Fig. 4 , at the first z-read stage, the tag field of the pixel address of an input fragment is looked for in the tag table of the pixel cache. Then, the pixel address of an input fragment and that of a cache tag are compared. If a tag comparison reveals a miss, the cache tag is updated with the pixel address of the fragment and then this address is forwarded to the memory request FIFO queue. The request FIFO queue sends a request for the missing cache blocks to the frame memory. Then, the requested cache blocks are transmitted to the pixel data RAM of the pixel cache. When a fragment reaches the second z-read, the tag is checked again. If its result is a hit, the pipeline execution can be processed immediately. Otherwise, it must wait until the corresponding cache block is completely transmitted from the frame memory.
In the second case among the three types of the pipeline processing flows in Fig. 3 , it may happen that the older cache block is prematurely overwritten with a new cache block. When the result of tag checking of the first z-read is a hit and that of the second z-read is a miss for a fragment, the full execution cycle for a cache miss penalty should be executed. However, the occurrence rate of this case, denoted as M1 pix , is so small that it does not affect the performance of the pixel cache significantly. The values of M1 pix obtained with some benchmarks will be given in Table 2 of Section 4.3.
EXPERIMENTAL SIMULATIONS AND PERFORMANCE EVALUATION
In order to validate the proposed architecture, various simulation results are given in this section. A trace-driven simulator has been built for the proposed architecture. The traces are generated with three benchmarks, Crystal Space, Quake3, and Lightscape, by modifying the Mesa OpenGL compatible API (Application Programming Interface). For each benchmark, 100 frames are used to generate each trace. The trace-driven simulator calculates the depth complexity and the z-test failure rate for each pixel with each trace. The overlap condition occurrence rates according to various wide separations can also be calculated for each trace with this simulator. The pixel cache simulations to calculate the pixel cache hit rate and the cache miss penalty reduction rate are performed by modifying the well-known Dinero III cache simulator [19] . Fig. 5 shows the captured scenes for the benchmarks. Crystal Space and Quake3 are OpenGL-based popular video game engines. The game engines are architectural walkthroughs with visibility culling [18] . Quake3 in particular is one of the typical current video games and is frequently used as a benchmark in other related works for their simulations. Lightscape is a product of SPECviewperf2 [22] and is an industrial standard benchmark for measuring the performance of 3D rendering systems running under OpenGL. Lightscape is used as a benchmark in this paper because of its high scene complexity, compared with other SPECviewperf2 products. Fig. 6 shows the depth complexity and the z-test failure rate for each benchmark as the frames are running. The depth complexity and the average z-test failure rate for each benchmark are given in Table 1 . In most cases, the z-test failure rate increases as the corresponding depth complexity increases. The z-test failure rate also indicates the quantity of redundant texture mapping of pretexturing. Therefore, the memory bandwidth saving rate for texture mapping of midtexturing is equal to the z-test failure rate, minus the redundant executions rate.
Depth Complexity versus the Z-Test Failure Rate
There are three cases that cause redundant executions. The first case is caused by the obscured fragments among those with pixel cache misses at the first z-read stage. The rate of this case can be calculated by multiplying the pixel cache miss rate by the z-test failure rate, which is quite small and, hence, negligible. The second case is generated when the first z-test succeeds owing to some overlap occurrence, while the first z-test fails if the consistency is maintained between the first z-read and the z-write stages. Because the overlap occurrence rate (as described in Section 4.2) is so low, it can be neglected. The third case is caused by an additional z-read operation. However, memory bandwidth wasted by an additional zread is not significant because z-data are read from the pixel cache.
Wide Separation versus Overlap Condition
Occurrences Fig. 7 shows the occurrence rate of the overlap condition when the separations between the z-read and the z-write stages are 10, 50, and 100 wide. It assumes that the geometry processing unit, the frame memory system, and the texture mapping memory system are perfect-that is, the latency of each unit is assumed to be zero.
Thus, it can be assumed that one pixel is generated per cycle with one pixel pipeline. The simulation results show that the performance degradation due to the wide separation is quite low because the overlap conditions occur very rarely. Fig. 8 shows the cache miss rates of a conventional pixel cache architecture for various cache sizes, block sizes, and set associativities. The simulation results show that the miss rate varies according to the block size, but not to the cache size and the associativity. It also shows that, as the block size increases, the miss rate decreases. If the cache miss rate is 5 percent and the miss penalty takes 10 cycles, the performance is degraded by about 35 percent. As semiconductor technology advances, this performance degradation also increases due to the increase in miss-penalty cycles. Moreover, the performance degradation caused by the cache miss penalty is also unavoidable. Fig. 9 shows the cache miss penalty reductions by midtexturing with various wide separations for a direct mapped 16 Kbyte cache, with a block size of 64 bytes. We assume that the miss penalty can be handled in 10 cycles, considering the clock frequencies of a rendering processor and a high-performance memory, like a Double Data Rate memory. Even though the pixel cache miss rates are similar for each benchmark, as shown in Fig. 8 , the miss penalty reductions vary because of their cache miss distributions -for example, several groups of misses occur in Lightscape. We assume that the geometry processing unit and the texture mapping memory system are perfect. A perfect zero-latency frame memory system can be achieved if the pixel cache is hit perfectly or if the miss penalty is reduced to zero. Fig. 9 shows that the miss penalty reduction rates decrease when the length of the wide separation is over 50 due to premature overwriting the older cache block with a new one, as described in the previous section. However, the decrease in the miss penalty reduction may not be significant because a stage separation of 50 is sufficient for a practical processor implementation. The values of M1 pix , which is a rate of the first z-read hit and the second z-read miss for a fragment, as defined in Section 3, for various wide separations are given in Table 2 .
Pixel Cache Hit Ratio and Cache Miss Penalty Reduction

Performance Evaluation
To evaluate the performance of the pretexturing and the midtexturing schemes analytically, we calculate the average cycle per fragment (ACPF). For calculating ACPF, we assume that only the miss penalties of the pixel cache and the texture cache can degrade the overall performance because they are the most important factors in estimating the performance of a rendering processor. This assumption implies that the memory bandwidth between the caches and the external memory is infinity and does not affect the overall performance. Then, ACPF of the pretexturing scheme can be computed as follows:
where H pix and H tex are the hit rates of the pixel cache and the texture cache, respectively, and T pix and T tex are the cycle times for the miss penalties of the pixel cache and the texture cache, respectively. ACP F pre ¼ 1 when both the pixel cache and the texture cache get hit, ACP F pre ¼ T pix if the pixel cache miss occurs, and ACP F pre ¼ T tex if the texture cache miss occurs. ACPF of the midtexturing scheme can be easily obtained by analyzing the three cases of the processing flow described in Section 3.
where represents the occlusion rate by the first z-test failure, H1 pix ¼ 1 À M1 pix , and reduction is the pixel cache miss penalty reduction rate as given in Fig. 9 . The above equation shows that the occurrence rates of the three cases of the processing flow are H pix Â , H pix Â ð1 À Þ, and 1 À H pix , respectively. ACP F mid ¼ 1 for the first case. Because the processing flow after the first z-test stage of the second case is equal to that of the pretexturing scheme and the pixel cache hit rate in this case is H1 pix , therefore,
In the third case of the processing flow, because the second z-read is performed after texture mapping with cache miss penalty reduction, ACP F mid ¼ ð1 À H tex Þ Â T tex þ T pix Â ð1 À reductionÞ. Fig. 10 shows the ACPFs of the pretexturing and the midtexturing schemes when stage separation lengths are 25, 50, and 100. The pixel cache configuration is assumed to be direct-mapped with a cache size of 16K bytes and a block size of 64 bytes. The miss ratio of the pixel cache is given in Fig. 8 . We assume that T pix takes 10 cycles. The prefetch scheme for texture mapping can achieve at least 97 percent of the performance of a zero-latency memory system, even though the fragment texel miss rate is 12 percent [13] . Thus, we assume that H tex is 0.88 and ð1 À H tex Þ Â T tex ¼ 0:15. For Crystal Space and Quake3, the proposed architecture performs at almost zero-latency in accessing the memory and improves its performance over 37 percent, compared with the pretexturing scheme. In the case of Lightscape, a performance improvement between 8 percent and 17 percent can be achieved. Because the midtexturing scheme could reduce the memory bandwidth requirement comparing with the pretexturing scheme, the overall performance of the midtexturing scheme would be enhanced by more of an increase than that of the pretexturing scheme if memory bandwidth between the caches and the external memory is limited.
In the view point of hardware complexity, the proposed architecture requires the additional hardware for the first z-read stage, that is, one read port for the z-data of the pixel cache. Adding one read port to the cache may increase the cache size by about 20 percent. In the proposed architecture, however, only a 10 percent increment in the size is enough because we need additional one read port only for the z-data, not for the color data.
CONCLUSIONS
In this paper, we proposed a new pixel rasterization pipeline architecture for a rendering processor. The proposed architecture could eliminate memory bandwidth waste and require a normalsized fragment queue instead of a wider one. It could also reduce the pixel cache miss penalty up to 90 percent. We have built a trace-driven simulator for the proposed architecture. The simulation results show that the proposed rendering architecture is effective in terms of performance and efficiency. An accurate dynamic cycle simulator for the proposed architecture is currently under development. An effective realization unit including bump mapping, environment mapping, etc., is also being studied. The proposed rendering processor will be implemented onto a chip in the near future.
