Current graphic cards include advanced graphic processing units to accelerate the rendering of 3D objects with millions of polygons. As object models grow in complexity, the rendering approach based on points as primitives is regarded superior in terms of scalability and efficiency. Next generation graphic cards could contain reconfigurable fabrics, similar to those implemented in current FPGAs, to offer two advantages: a) fast rendering units and b) new mechanisms for custom, run-time exchangeable accelerators. In this paper, we propose a hardware point-rendering architecture tailored specifically for reconfigurable systems. The presented implementation on a real FPGA-based platform demonstrates on the one hand the effectiveness of the approach and on the other hand it provides valuable insights into possible future improvements for this problem scenario.
Introduction
In recent years two particular factors in the graphics card sector dramatically changed. First, performance and visual quality has leaped into new areas. Second, graphic cards have been established as computational accelerators. However, as polygonal models have become increasingly complex, the size of the projected primitives decreased accordingly. This raised the question whether polygons are the right primitives for very detailed and complex models [10, 7, 17] .
The major challenge of point-based rendering algorithms is to achieve continuous interpolation between discrete point samples which are irregularly distributed on a smooth surface [12, 13, 20, 14] . Rendering large data sets at low magnification will often cause primitives to be smaller than the output device pixels. In order to minimize rendering time, it is desirable to control the level of detail through the use of multiresolution model objects [14, 5] .
Recent approaches such as [2, 6] address high speed point-rendering by exploiting GPU acceleration and onboard video memory caches. A state of the art ASIC chip and a multi FPGA architecture for point-rendering were recently presented in [18] and high image quality aspects have been considered in [3, 15] .
A partitioned DSP/FPGA implementation [8] uses the FPGA only for the Z-buffer test and final screen buffering. All other rendering operations are performed on the DSP and not in hardware. The design achieves a throughput of 5 million points per second.
In this paper, we present an efficient Point-Rendering hardware architecture on an FPGA platform and demonstrate that a high performance and at the same time resource efficient implementation on FPGAs is feasible. Furthermore, our implementation distinguishes itself from known approaches by a careful HW/SW partitioning strategy to balance performance and resource utilization trade-offs.
Point-Based Rendering
In point-based rendering, a 3D object is represented by a set of points [10, 7, 13, 15] . Each point p i consists of its 3D coordinate x i = (x, y, z) T , a color value c i = (r, g, b), and a normal vector n i that is orthogonal to the surface sampled at the point. The additional w-coordinate is necessary to obtain the 3D to 2D projection by means of matrix operations [4] .
The rasterization of 3D data points is performed in the rendering pipeline. The most important standards of this model are OpenGL and Direct3D. A detailed description of the pipeline is not in the scope of this paper, but for reasons of understandability, a short overview will be given here.
In Figure 1 , the stages of the rendering pipeline are shown. The modeling transformation consists of a translation, a scaling, and a rotation operation that map 3D points from object space into world space. Furthermore, the viewing transformation maps points from world space into the camera space that is defined by the position and orientation of the virtual camera. Since both mappings are linear transformations, they can be combined to a single 4 × 4 matrix, called ModelView matrix M MV .
The subsequent backface culling stage ensures that only points with normal vectors pointing towards the camera are processed further, i.e., points that sample surfaces visible to the camera.
After this, the point's actual color value is calculated in the lighting stage. For this purpose, the color value is weighted by a factor that depends on the angle between the point's position relative to the defined location of the light source and its normal vector. This technique is known as Lambert shading.
The purpose of the projection transformation is to map the viewing frustum defined by the camera parameters (focus, field-of-view, etc.) to a standard cube with side lengths [−1, 1]. This transformation is also described by a 4 × 4 matrix, the projection matrix M P .
Based on the unit cube, the clipping stage determins the points that fall outside the camera frustum and discards them.
After that, perspective division by the w-coordinate occurs. Now, a point's position on the image plane is given by its x-and y-coordinates. The final viewport transformation determines the pixel that represents the point according to the current viewport resolution. Finally, the z-coordinate is used to ensure that only points not occluded by others are displayed (Z-Test).
ESM Hardware Platform
A dynamically partially reconfigurable platform called Erlangen Slot Machine (ESM) 1 [1, 11] is used for prototyping the point-rendering pipeline. The ESM platform is centered around an FPGA serving as the main reconfigurable engine and an FPGA realizing a crossbar switch, see Technical data sheets are available at http://www.r-space.de
BabyBoard and MotherBoard and are realized using a Xilinx VirtexII-6000 and a Xilinx SpartanII-600 FPGA [19] , respectively. The slot-based architecture of the ESM consists of the main VirtexII FPGA on the BabyBoard, local SRAM memories providing the memory architecture for the rendering pipeline, configuration memory, and a reconfiguration manager. The MotherBoard contains a PowerPC running an embedded version of Linux and I/O peripherals connected to the crossbar. In Figure 2 the ESM BabyBoard and the MotherBoard are shown.
The main idea of the ESM architecture is to accelerate application development as well as research in the area of partially reconfigurable hardware. The advantage of the ESM platform is its unique slot-based architecture which allows to configure individual hardware modules independently of their peripheral needs at run-time arranged in 1-D vertical slots. A separate crossbar switch is in charge of routing the data dynamically from the periphery, e.g., a video input signal, to the current position of the responsible hardware module. We decided to implement the crossbar off-chip on the MotherBoard to have as many resources as possibly available on the FPGA for partially reconfigurable modules.
Thus, the ESM architecture is based on the flexible decoupling of the FPGA I/O-pins from a direct connection to an interface chip. This flexibility allows the independent placement of application modules in any available slot at run-time. As a result, run-time placement is not constrained by physical I/O-pin locations as the I/O-pin routing is done automatically in the crossbar.
Implementation of the Point-Rendering Pipeline
Our point-rendering implementation is split into the main hardware pipeline, but also includes a substantial and necessary software part. The rendering process is controlled through the software part, see Figure 3 .
HW/SW Partitioning
The point-rendering pipeline itself is performancecritical and should therefore be implemented in hardware as its throughput determines the main system performance. Consequently, the model memory, the Z-buffer, and screen buffer must be hardware controlled. Hence, these parts are implemented in hardware on the BabyBoard.
All matrix computations required by the point-rendering pipeline can be implemented either in software or hardware. As long as software execution time and the communication overhead is not prohibitive, the software solution is a) saving many hardware resources, in our case 6,273 slices and 48 block multipliers on the VirtexII-6000 FPGA, and b) has an inherent flexibility advantage. Adding a new transformation, e.g., the OpenGL gluLookAt transformation, becomes a simple software extension. Furthermore, a double precision floating point number format is used for all arithmetic operations and have therefore a precision advantage at the end of final ModelView transformation. After computing the matrix in software, the results are sent to the hardware point rendering pipeline via the crossbar, see Figure 3 .
Software Control Flow
In order to process a point model object, four main steps have to be executed in software:
1. Model point data must be downloaded onto the BabyBoard local memory prior to any rendering. The model memory stores the point data in coherent point group objects.
2. Update of the pipeline state, which is fully controlled through software. Here, only the pipeline state is transferred and, e.g., not the operands for the ModelView matrix. This means that the software generates the appropriate pipeline state after computing, e.g., the ModelView matrix M MV .
3. Enable execution inside the point-rendering pipeline. Now model point data is continuously read from the model memory and fed into the point rendering pipeline. The rendered picture is then written into the output screen buffer, which implements a double buffering technique. However, the screen buffer and the Zbuffer must be cleared before a new picture can be rendered. 4. Finally, the rendered picture is read form the screen buffer and transfered via the crossbar to the VGA output at a resolution of 640 × 480 pixels.
In the following, implementation issues of the hardware pipeline are discussed.
Number Representation
Each point coordinate in our model data is represented by a 24 bit word, which exactly matches our implemented fixed point Q7.16 number format (7 integer and 16 fractional bits). Additional compression of the coordinates is non trivial and was not implemented. However, all normal vectors are compressed. This allowed us to reduce the bit width from 72 to 15 bits, as proposed in [13] . The color information is stored in a coded color index. Therefore, one point of our model data is encoded in 12 bytes.
External SRAM Utilization
The ESM platform has 6 SRAM banks with 2 MB capacity each. Our object model memory occupies two SRAMs and has to deliver 12 bytes for every pipeline clock period. The double screen buffer uses another two SRAMs, see Figure 3 . Therefore, only the last two SRAMs can be used to implement the Z-buffer which limits our implementation to 16 bits instead of the recommended 24 bits [9] . Figure 4 shows the implementation of the rendering pipeline. The pipeline state vector holds the current state of the complete pipeline. Table 1 lists all controlled pipeline elements together with the required bit widths. Control words issued by the protocol state machine have a length of 1,306 bits (see the Data signal outgoing from the protocol FSM in Figure 4) . the HW/SW interface. The software part controls the setup phase and the rendering process by sending 104 bits long instruction words to the protocol state machine. The operand is encoded in 8 bits and the remaining bits are used for data transfers. In our implementation only four instructions are needed to update the ModelView matrix.
Pipeline State Vector
The instruction opcode is grouped into a) model memory operations, b) state update operations, and c) rendering control operations. Model memory operations allow the software to alter the model data. The points are stored in a linear array. This array is segmented into groups with a start and stop index, as rendering is only performed on complete groups rather than individual points. State update operations allow the software to update the various parameters of the pipeline as presented in Table 1 . Some parameters are set only once (e.g. the window-parameters), others are expected to change quite often (the ModelView matrix). Finally, rendering control operations enable the rendering of point groups through the activation of individual pipeline elements.
Point-Rendering Pipeline Elements
The implemented point-rendering pipeline has a throughput of one point per clock cycle. Every rendering transformation and visibility test is mapped to a corresponding hardware pipeline element, as shown in Figure  4 . Two signals are used to control the visibility of the currently processed point.
The latency of a pipeline element is not crucial, as long as its throughput is one. All control signals and point data are synchronously passed through each pipeline element.
ModelView Transformation
The ModelView matrix is used to transform model coordinates into camera coordinates. Allowing only affine transformations, we can simplify the last row of the ModelView matrix, which reduces the number of multiplications from 16 down to 12.
Similarly, we allow only linear transformation for the normal vector. This reduces the multiplications down to 9 instead of 16. 
Lighting A reflection coefficient is produced by the lighting computation and multiplied with the point color to output the visible screen color. However, we are using an 8 bits per pixel screen buffer which is only suitable for gray color coding. The normal vector is decoded by the memory controller to cartesian coordinates but is not further normalized. Since the normalized vector n 0 i = n i n i is required by the lighting computation to obtain correct results, the normalization must be performed at this stage.
The reflection coefficient ρ i depends on the angle between the point's surface normal n i and the direction to the light source l. We use diffuse reflection for our lighting computations, i.e., the light source is assumed to be far away. As a result, l is constant for each point. With l 0 being the normalized vector of l, the coefficient is calculated as
The projection transformation uses the intrinsic matrix values (near plane n, far plane f , coordinates of left and right vertical clipping planes l, r and of top and bottom horizontal clipping planes t, b). The projection transformation is shown in Equation 4 and can be implemented using 6 multiplications and 3 additions.
Z-Buffer For optimal Z-buffer implementation, a dual ported SRAM is needed which was not available. A pipelined variant of the Z-Buffer algorithm requires dual-port memories to be available. The implemented Z-Test is clocked at double the pipeline clock frequency so that the available singleport SRAMS can be used. We had to double the clock frequency of the SRAM controller compared to the pipeline frequency. However, special care has to be taken because the same point and control data are now sampled twice.
Results
Our final hardware implementation 2 of the pointrendering pipeline, as shown in Figure 3 and 4, consumes 13,462 (40.4%) slices, 80 (56%) block multipliers of the VirtexII-6000, and achives a clock frequency of 60 MHz. This means that we can render 60 million points per second. Our model memory can store model objects with up to 262,144 points in the final implementation. This factor is only limited by the size of our external SRAM memory bank.
Due to limited external frequency of the used SRAMs on the ESM platform the point rendering pipeline has to wait 16 clock cycles for a new point sample. Therefore, our current rendering throughput drops to 3.75 million points per second.
State of the art multi FPGA implementation of the pointrendering pipeline [18] achieves throughput of 140 million points per second while running at 70 MHz (due to two parallel rendering units). Their splatting performance ranges from 0.7 to 2 million splats per second at a screen resolution of 512x512 pixels.
Hardware Resource Utilization
The hardware resource utilization for the implemented point-rendering pipeline is shown in Table 2 .
In our implementation, three different variants for the multiplication were used. The first variant uses only the MULT18x18 blocks found in the VirtexII, which results in the use of 4 of these blocks per multiplication. The second variant uses a hybrid multiplier generated by the CoreGen utility [19] . It uses one MULT18x18 block and implements the remaining logic using slices. Due to pipelining, this implementation has a throughput of one. The third variant uses only slices to minimize the resource utilization. 
Conclusions and Future Work
We presented a HW/SW co-design architecture and an implementation of a point-rendering pipeline that has a high performance and resource efficient implementation on FPGAs. In this implementation, a careful HW/SW partitioning was used to find a good performance and resource utilization trade-off. This can be regarded as balancing pipeline throughput for hardware resource utilization. The rendering pipeline architecture can easily be extended to a parallel architecture with two or even four rendering pipelines. However, the memory bandwidth will become the main performance bottleneck. Still missing features are surface splatting and level of detail control [13, 15] . These are possible future extensions to our architecture.
Potential areas of future research are the use of partial run-time reconfiguration of hardware pipeline elements. The three most beneficial hardware units are the lighting stage, screen buffer stage, and the model object memory controller. The run-time reconfiguration of the lighting pipeline element will enable loading of custom hardware shaders right into the rendering pipeline. By changing the screen buffer stage during run-time, we can include custom filters like sobel or median filters before writing a picture to the screen buffer. Another very interesting concept are custom memory controllers which can create procedural model objects [16] based on precomputed parameters stored in the model memory.
