ABSTRACT
INTRODUCTION
3-D graphics emerges rapidly in consumer electronics. Because of vivid visual effect, 3-D graphics plays important roles in multimedia, entertainment, virtual reality and user interface. Although lots of approaches are proposed in PC-based or entertainment platform, 3-D graphics rendering still seldom appears in embedded systems, such as PDA, mobile phone, car navigation system, etc.
One of the major reasons is computing power. Many embedded systems equip low-tier CPUs. Especially in portable devices, low-power low-cost requirement limits the employ of high-performance CPU. Hence 3-D graphics rendering by pure software suffers fiom low speed and poor image quality. Previous research tried to improve this by modified API [l] , and 10k polygods was reported without lighting, shading and texture mapping. The speed and image quality is hard to support fantasy 3-D graphics applications.
On the other hand, the approach of 3-D processor [2] [3] costs too much to be realized in embedded system. Because 3-D graphics rendering is computation-intensive, and high image quality requirement of 3-D graphics applications, commercial 3-D processors are designed to This work was supported by National Science Council, Taiwan, R.O.C. under Grant NSC-89-2215-E009-052 achieve high performance. The performance-driven architecture desires high computation power, large memory size and huge bandwidth. Those factors are bottlenecks to realize 3-D graphics rendering in embedded system. Hence, the 3-D graphics rendering approach for embedded system is desired, and it can be utilized in lots of consumer electronics devices, such as set-top box, car navigation system, PDA, and mobile phone. In our previous researches, we proposed index rendering [4] and deferred lighting [5] approaches. These approaches can reduce redundant operations on hidden pixels and lighting operations on invisible triangles. These approaches can be applied in embedded system. Moreover, we further extend deferred lighting approach to eliminate the transformations on invisible triangles in this paper. Because transformations are huge burden in geometry subsystem, the enhanced version of deferred lighting can save more operations. Because of these design issues, the architecture of traditional rendering pipeline is divided into two pipelines, and this new architecture can reduce lots of unnecessary operations without image quality loss.
The organization of this paper is as following: In Section 2, we first review 3-D graphics pipeline, and show the strategies to reduce operations. Then, we introduce our new architecture in embedded system in Section 3. Because of index rendering and enhanced deferred lighting, this architecture has the feature of two separated pipeline. In Section 4 we present simulation and analysis of this architectures. Finally, we conclude this paper in Section 5 .
2.3-D GRAPHICS RENDERING PIPELINE
3-D graphics rendering pipeline generally divided into two parts: geometric subsystem and raster subsystem. The geometry subsystem transforms vertices, and performs lighting and perspective transformation. Raster subsystem receives output of geometry subsystem, and renders transformed polygons for display. Those two subsystems are pipelined for high throughput in general. Fig.1 In viewing transformation, the CPU needs 12 multiplication and 9 addition instructions perform viewing transformation. The equation is as following:
On the other hand, in perspective transformation, the equation of perspective transformation is assumed as setup is the operation to prepare the necessary information for further rasterization. In the Setup operation, two kinds of data are generated for further rasterization. The first is the data related to shape (3) information, while the second is related to color information. 
Lighting
Lighting is an essential procedure to calculate illumination on assigned position by lighting model. Nowadays 3-D
Setup for shape information
Because the triangles are described by vertices in geometry subsystem, the setup of shape information is to help scan-converting triangles into a group of pixels. The locates, we can simply apply the (x,y) coordinates into E,,(x,y), and see the result value greater or less then zero. For three edges in a triangle, the pixels are inside triangles only when all edge functions are positive or negative, as shown in Fig. 3(b) and Fig. 3(c) . Because the all-positive or all-negative result depends on the direction of three edge vectors, this algorithm can also perform back-face This method works will in PC-based platform, but the description of left-side and right-side edge is a problem in embedded system. The edges are usually described by their edge slopes, and division operations are needed to generate the slopes. Because most embedded systems employ low-tier CPUs, division operations for edge setup are large burden. On the other hand, this method decomposes a triangle into two scanline-aligned ones, and hence other triangle information is duplicated for data transmission. It may cause bandwidth problems.
In embedded system, Pineda's algorithm [9] is more suitable, because its algorithm is all by integer and not needed to decompose triangle. This algorithm represents each edge of a triangle by a linear edge function. The edge function can divide a plan into two parts. As shown in Fig.3 (a), two vertices (xl,yl) (x,,y,) can define a linear equation E,,(x,y). To detect which part does a pixel (x,y)
Although the setup needs to setup three edge functions in Pineda's algorithm, we do not need to find out the real parameters in E(x,y). Because two end-point vertices are on the edge, the edge function must be zero on the vertices. For example, for an edge function E,,(x,y) defined by be zero. For other pixel (xk,yt), the edge function becomes:
vertices (xl,yl) and (x2,y2), E12(xl,~,) and El,(xZ,~,) must
Therefore, the setup and rasterizing triangle shape can be all integer operations. Although Pineda's algorithm is designed for parallel rendering, it is also suitable in 3-0 graphics rendering in embedded system. Fig. 4 , we take R value as example. The lighting operation gives the color intensity on three vertices, and hence a plain is defined in this space:
Setup for color information
The intensity of R in (xl,yl) is known as RI, therefore the R value in (xk,yk) is:
operations, while after Rasterize are pixel-level operations. Shading operation colors each pixel for display, and texture mapping is also applied here. Visibility comparison determines the visibility of each pixel, and Ztest algorithm is the most common one.
In order to eliminate redundant operations on invisible triangles and invisible pixels, we utilize index rendering and deferred lighting to realize raster subsystem. We will discuss more in following section.
THE PROPOSED ARCHITECTURE
In this paper, we propose a new architecture for 3-D graphics rendering in embedded system. As shown in Fig.5 , the chipset, Rasterizer Controller (RC) and Color Shader (CS), realizes the raster subsystem and setup, while the CPU handles the geometry subsystem. Two blocks of memory are utilized for data storage. One is for original object models, and can be realized by ROM or RAM, which depends on the applications. This database is named GTdb (Global Triangle Database). The another memory block is for temporal storage in 3-D graphics rendering, therefore it should be realized by RAM. The hardware architecture is based on our index rendering [4] and enhanced version of deferred lighting [5] approaches.
The value in Eq. 6 is calculated in setup stage. Because the Ku term is only related to the vector of vertices, and it is the same in different color component R, G, B. Hence, only one division is necessary for each triangle and other Color infOrmatiOn can be generated by mUlt@lkUtiOn and addition operations.
Raster Subsystem
In conventional rendering Pipeline, raster subsystem handles rasterization. Rasterization consists of three subtasks: scan conversion, visibility comparison and shading [121. scan COnversion decouPles PolYgon into a group of pixels. It iS handled in the raSteriZe block in Fig. 1 . Hence, the operations before Rasterize are triangle-level
Index Rendering
Index rendering is an approach that can avoid redundant operations on invisible pixels. It is also the essential architecture to realize deferred lighting. The major concepts of index rendering are separating triangle/pixel data to explore parallelism, and rearranging operations for optimal data flow.
Traditional rendering architecture is a long pipeline, and therefore triangles and pixels cany their whole data to pass all pipeline In fact, most of pipeline stages only relates to some parts of data. The other parts of data are only stored-and-forwarded, On the other hand, the nature of is fixed data flow, and hence limits the In our approach, index rendering, we utilize index to separate triangle/pixel data to explore parallelism. The index is a serial number of each polygon. We use this index to denote the information and pixels fiom this parent polygon. In the rendering pipeline, the information is stored in database, and each pixel only carries its index number to pass the long rendering pipeline. If one part of information is necessary in a pipeline operation, we can fetch the database on demand.
In order to eliminate redundant shading operations on invisible pixels, the approach of index rendering stores shading information in TdbS (Triangle Database for Shading), as shown in Fig.5 . After Z-test, the index numbers of visible pixels are stored into a screen-size buffer, named I-buffer, as shown in Fig. 6 . Then, we can calculate color values of each pixel from the Eq. 7 to generate the final image.
In the Fig.5 , CS handles the shading operations, while RC handles the other operations in raster subsystem and setup. The job of RC is to generate the index pattern in I-buffer and data in TdbS, and the CS utilizes the data in I-buffer and TdbS to generate the final result. Because the I-buffer and TdbS keep enough information to generate the final result, hence frame buffer can be optional if the CS can generate pixels in screen scan-out rate.
Enhanced Deferred Lighting
More than eliminating redundant operations on invisible pixels, our deferred lighting approach can avoid redundant operations on invisible triangles. This approach defers lighting calculation after Z-test. If all pixels of a triangle fail in Z-test, it implies that this triangle is invisible, hence we can eliminate lighting calculation on invisible polygon. This idea is straightforward but hardly to be realized in traditional rendering pipeline. With the approach of index rendering, this idea can be realized in 3-D graphics rendering pipeline. This approach was proposed in 151. In this paper, we further extend deferred lighting approach to eliminate the transformations on invisible triangles. In order to do this, the triangle information must be separated before operations. The triangle information related to geometry, such as the (x,y,z) coordinates, goes first to define the shape of this triangle, and then to be scan-converted into a group of pixels. After all pixel are Z-tested, we can h o w whether this triangle is hidden. If any pixel of this triangle passes the Z-test, the triangle information related to shading enters the rendering pipeline. After setup operation, the shading information is stored in TdbS. Finally, the result image is generated by I-buffer and TdbS.
Dual Pipeline Rendering Architecture
According to index rendering and enhanced deferred lighting approaches, the 3-D graphics rendering hardware becomes dual pipeline architecture. Fig. 7 shows the rendering pipeline. The Fig.7(a) shows traditional one, while Fig. 7 (b) our dual pipeline architecture. The major difference is that we divide the 3-D rendering pipeline into two parallel pipelines. The ASIC chip handles the operations with gray shaded area, and the CPU of embedded system handles the other area.
To render a triangle in the dual pipeline architecture, the upper pipeline goes first. The upper pipeline needs the input of triangle information related to geometry, which are the coordinates of vertices. After all pixels of this triangle are Z-tested, a signal is sent to the CPU to denote whether this triangle is visible or not. If this triangle is invisible, the other part of triangle information is discarded. If this triangle is visible, the other part of triangle information enters the second pipeline. Because the Phong lighting model is applied, hence the triangle information related to shading is the normal vectors on the vertices of this triangle.
In the dual pipeline architecture, we can find that the setup is divided into two parts. The setup in the upper pipeline handles the shape generation, and setup in the second pipeline helps the color generation. The shared terms of Eq.5 and Eq.6, the vectors can be reused to reduce operations.
SIMULATION AND ANALYSIS
The performances of index rendering and deferred lighting have been analyzed and simulated Hence, we will demonstrate the performance of the enhanced version of deferred lighting. Compared with our previous deferred lighting approach, enhanced deferred lighting can further eliminates the redundant transformations on invisible triangles. Because the CPU handles this part in embedded system, we focus on the reduction in the CPU's operations.
The Java [ 171 and Mesa [ 181 were utilized to develop our simulation environment and two 3-D object models, Dolphins and Castle, are applied, as shown in Fig.8 [19] . Their original triangle numbers are listed in Table 1 . Because the Pineda's algorithm is applied, the triangles are TYPE 11, which is a straightforward method to realize traditional architecture but wastes computation power. After simulation in resolution 320x200, we fmd the triangle numbers that can pass the back-face culling and be visible in final image. We denote the original triangle number as (a), the triangle number pass the back-face culling as (b), and visible triangle number as (c). The visible ratio equals the result of that visible triangle number (c) divides the triangle number that should be 
whick is a better way to realize traditional architecture. The rendering pipeline performs back-face culling after lighting before triangle setup as traditional architecture Table 2 , the operation numbers on vertex coordinates all equal to three times of original triangle numbers, (a) x 3, no matter which architecture is applied.
On the other hand, the operation numbers on normal vectors very depend on architecture. In order to perform lighting operations on each vertex, the normal vectors are necessary information. In traditional architecture without back-face culling (TYPE 11), the number of lighting equals three times of original triangle numbers, (a) x 3. In traditional architecture with back-face culling (TYPE I), the number equals (b) x 3. In our proposed architecture, the number becomes (c) x 3. Due to the data in Table 1, we can find the improvement on reducing lighting operation. Besides, due to enhanced deferred lighting, the operation number of viewing transform also reduced into (c) x 3 on normal vectors.
Then, the CPU costs are analyzed to generate the 3-D graphics. Fig.9 , we can analyze the CPU costs for transformations and lighting. On the three types of architecture: traditional architecture Type I, Type 11, and proposed architecture. Because of no specified CPU and platform, we reasonably assume the costs of each CPU instructions as a basis to measure performance. The cost of addition instruction is 1, multiplication instruction is 2, and reciprocal radical instruction is 16. Because the of triangle, the number of operations equals to three times of related triangle number. Triangle strip and fan are not discussed here for fair comparison. Table 2 
CONCLUSION
A new architecture is proposed in this paper for computation-effective 3-D graphics rendering in embedded multimedia system. It bases on our index rendering and enhanced version of deferred lighting approaches. Comparing with traditional architecture, its feature is dual pipeline rendering architecture. This architecture is computation-effective because it can render 3-D graphics image by fewer operations without image quality loss. We achieve this goal by eliminating the redundant operations on hidden pixels and invisible triangles.
By simulation and analysis in resolution 320x200, the result shows our dual pipeline architecture can reduce 
