Abstract-This paper presents a power-aware dynamically reconfigurable rendering engine design, which changes power as rendering throughput and image quality change. At algorithm level, a precisionaware shading scheme is proposed to improve the power efficiency of the conventional shading algorithm through the combination of precision detection and fraction masking techniques. At architecture level, a processing element (PE) based scalable architecture is combined with dynamic task scheduling and dispatching techniques to raise hardware utilization rate and reduce computation latency. Finally a prototyping design which delivers 453MPixels/s, 16.4MTriangles/s, 2.24MPixel/mJ is presented.
INTRODUCTION
Real-time 3-D computer graphics play an important role in current multimedia application. The need for better image quality and real-time response drives the system designers to increases the rendering performance. Thus it causes enormous power consumption. Many researches show that low power design is the near-term grand design challenge and power will be the only limitation in the future. To evaluate a design, power efficiency, throughput, and area are the design metrics. The power efficiency of a rendering system is defined as pixel rendering throughput over power [1] i.e. how many pixels can be output with the same energy consumed during computation (Pixels/mJ).
Traditional low-power VLSI design skills are constrained by iteration bound and further power reduction relies on other techniques. Static power reduction mainly relies on using multi-threshold voltage cells and most efforts are done by EDA tools. For dynamic power reduction, dynamic voltage scaling (DVS), dynamic frequency scaling (DFS), and gated clock are popular approaches. [2] indicates that the increasing importance of power awareness for VLSI systems. A power aware system is defined as a system which is able to adjust its power consumption in response to varying operating conditions. The changes may be brought by the time-varying inputs, desired output quality, or just environmental conditions. Regardless of whether they were engineered for being power aware, systems display variations in power consumption as operating conditions change.
According to the above definition, for a rendering system, high power awareness means the capability to adjust its power consumption regarding to rendering throughput and image quality. The relation between throughput and dynamic power consumption must be positively related and as linear as possible. To achieve this goal, processing element (PE) based scalable architecture and dynamic reconfiguration schemes are adopted for linear scaling. Because the limitation of parallel execution of the rendering algorithm restricts the scalability and the response of power consumption, a proper division of rendering steps helps to improve both scalability and hardware utilization. Inside PE, to keep power consumption being linearly related to the output pixel rate and image quality, we must take algorithm power efficiency into consideration. In this paper, we improve the power performance with two proposed methods:
Precision-aware linear interpolation.
PE-based scalable architecture with dynamically task scheduling, dispatching, and processing.
The architecture verification is through FPGA and a prototyping chip design gives the physical specification for performance measurement and power simulation.
II. ALGORITHM LEVEL DESIGN
Reference [3] shows that triangles are decomposed into two stages for Gouraud shading: Span generation and span interpolation. The function of span generation and span interpolation are demonstrated in Fig.1 . Span generation creates the pixels on the three edges based on the three vertices of a triangle. For each span, the pixels are linearly interpolated based on the pixels on the leading edge and the trailing edge. The two stages are similar because they are both a subset of generalized multi-dimensional linear interpolation.
The nature of a rendering system is time-varying volume data processing [4] . For an arbitrary sized triangle, it consists of three vertices, which contain {X, Y, Z, R, G, B} information, and has O(1) space complexity. At span generation stage, the space complexity increases to O(N). At span interpolation stage, the space complexity increase to O(N 2 ). It should be taken into consideration that the total energy consumed to compute the linear interpolation is not only by iterations of generating pixels. It also consumes energy to setup the algorithm. There are many candidates of algorithm to implement the generalized multi-dimension linear interpolation. The chosen algorithm should cost less energy for both algorithm setup and generation of each pixel.
Traditional division-free scan converting algorithms can be extended to be multi-dimensional and operate at integer number systems. However, these algorithms suffer two problems. Because division-free algorithms adopt residue arithmetic to compute, the slope limitation is not the only constraint to solve the pixel density problem. Any edge and span computation must follow the slope restriction. They cannot guarantee constant pixel generating rate while any of the absolute {dR, dG, dB, dZ} is larger than the absolute distance on spatial domain. Also, they all need more mapping from first octant to the others.
Division based interpolation provides constant rendering rates in all cases. However, they need to compute the slope of the edge as the incremental approach, an extension of Digital Differential Analyzer (DDA) algorithm [5] . It is important that the slope computation shouldn't be a heavy overhead for interpolation. In many systems [3, 6] , the slope computation takes a series of subtraction and shift to relieve the need of division hardware or takes ROM and multipliers as the SIMD division unit [7] . [6] restricts the divisor between 2 and 8 to reduce slope computation overhead, including the hardware cost and the number of iteration cycle. Here we adopt another approach, which can be generalized to compute any integer division and scales computation power elegantly with controllable error quantity. Here we modify the division cycle of DDA algorithm to be variable. For simplification, D denotes the dividend and S denotes the divisor. Equation (1) and (2) describe their values are in such ranges. They are represented by unsigned integer numbers. Q F denotes fixed-point quotient and Q R denotes real quotient. The relationship between Q F and Q R follows (3) and are represented by unsigned fixed-point numbers. 
These variables satisfy (4). The accumulation error E．S is in (5) and should be smaller than 2 -1 (6). The inequality is solved in (7) and the minimum E is obtained by (8) . For radix-2 pre-aligned iterative division, the number of iteration cycle is shown by (9) . It takes (k-m) cycles for the integer part quotient and (m+2) cycles for the fraction part.
The flow chart of linear interpolation approaches are shown in Fig.3(a) . The proposed precision-aware shading scheme is shown in Fig.3(b) . A small precision detection unit is used to setup the iteration cycles of division. The number of iteration cycles is reduced according to (9) and the redundant precision in fraction bits are masked to zero. 
III. ARCHITECTURE LEVEL DESIGN
Due to the time-varying volume data characteristic of rendering, we adopt the block-pipelining approach between key stages. Fig.4(b) shows the three main processing stages: span generation, span interpolation, visibility test and texturing. [7] and [8] show different architecture designs for the rendering system. In [7] , the architecture takes constant pipelining to control the clock gating at each pipeline stage precisely. In [8] , the processor-based reconfigurable architecture balances the computation loading in rendering system and reduces the total computation time for given tasks. To raise the hardware utilization of our design, the fixed bonding of processing units and memory partitions should be released and a scheduling unit handles the data exchange. The proposed system architecture is shown in Fig.4 (a) . Command dispatcher handles the bus protocol of command interface and dispatches different types of rendering commands to different processing stages. Triangle/span dispatchers, buffers, and span/pixel schedulers form the data path for power aware computing. The data path changes itself adaptively as workload change. Flow control units monitor the buffer status and determine the number of Span Generation PEs (SGPEs) and Span Interpolation PEs (SIPEs). Gated clock is applied to non-operating SGPEs and SIPEs to adjust power at triangle processing level (span generation) and span processing level (span interpolation). Visibility test and texturing unit eliminate redundant texture memory access as [7] did.
The functional verification of the proposed architecture is through FPGA design flow. The texturing of Lena is shown in Fig. 5 by our prototyping design. It also shows the rendering result of given 256 randomly generated triangles. 
IV. IMPLEMENTATION RESULTS

A. Layout and Specification
The layout of the power aware rendering engine design is shown in Fig. 6 . Table I summarizes the specification of the macro design. 
B. Performance and Power
The proposed precision aware scheme reduces the total switching activity to interpolate the given triangles. From the view of algorithm efficiency, the setup time is tightly related to the space complexity of the processing datum. Table II shows the rendering performance speed-up is from 1.004× to around 1.342×, which results from reduced iteration division. The shortened computation time and the eliminated redundant precision reduce the switching activity. Therefore total energy is saved and reduces the average power consumption. Table III shows the power simulation result of PrimePower. The power efficiency is improved from 1.743× to 4.265× depending on the given triangle size. For smaller triangles, less precision is needed and therefore more power is saved by the proposed precision aware interpolation. It is noted that even the triangle is as large as 255 on both width and height, there is still 1.743× improvement in power efficiency. The shorter spans in the triangle use less precision and result in the energy saving. The average power consumption is 146.0 mW when the macro is operating at 250MHz. It takes on average 607.8 pJ to rendering a pixel. The average power efficiency of the power aware rendering engine is 2.241MPixels/mJ. Table IV. shows the comparison among other works. The proposed design has higher power efficiency. It is noted that this work has higher throughput because of the shorter critical path. Frequency scaling and source voltage scaling techniques can be applied while the performance specification is not very high. 
V. CONCLUSIONS
In this paper, a power aware rendering engine design is presented. The power awareness is achieved through algorithm level and architecture level design. The precision aware shading approach has been advocated over conventional incremental linear interpolation. The elimination of redundant precision calculation reduce the setup overhead of division-based shading algorithm, and the fraction bit masking reduce the switching activity during interpolation of an edge or a span. The scalable, dynamically reconfigurable system architecture promptly controls the working status of computation units and scales power as workload changes. Finally a prototyping design proves the proposed shading approach and system architecture do save power and provides high performance. The power efficiency of rendering can be improved through proposed approaches.
