Estimating performance of an ray- tracing ASIC design by Brunvand, Erik L. & Woop, Sven
E s t i m a t i n g  P e r f o r m a n c e  o f  a  R a y - T r a c i n g  A S I C  D e s i g n
Sven Woop” Erik Brunvand^ Philipp Slusallek?
Saarland University University of Utah Saarland University
Figure 1: Test scenes used to evaluate the DRPU ASIC: Conference (282k triangles) , Mafia (15k triangles), Skeleton (16k triangles) 
Helix (78k triangles), and DynGael (85k triangles). For more test scenes see Figure 6.
Abstract
Rccursivc ray tracing is a powerful rendering technique used to 
compute realistic images by simulating the global light transport 
in a scene. Algorithmic improvements and FPGA-based hardware 
implementations of ray tracing have demonstrated realtime perfor­
mance but hardware that achieves performance levels comparable 
to commodity rasterization graphics chips is still not available.
This paper describes the architecture and ASIC implementations 
of the DRPU design (Dynamic Ray Processing Unit) that closes this 
performance gap. The DRPU supports fully programmable shading 
and most kinds of dynamic scenes and thus provides similar capa­
bilities as current GPUs. It achieves high efficiency due to SIMD 
processing of floating point vectors, massive multithreading, syn­
chronous execution of packets of threads, and careful management 
of caches for scene data. To support dynamic scenes B-KD trees 
arc used as spatial index structures that arc processed by a custom 
traversal and intersection unit and modified by an Update Processor 
on scene changes.
The DRPU architecture is specified as a high-level structural de­
scription in a functional language and mapped to both FPGA and 
ASIC implementations. Our FPGA prototype clocked at 66 MHz 
achieves higher ray tracing performance than CPU-based ray trac­
ers even on a modem multi-GI Iz CPU. We provide performance re­
sults for two 130nm ASIC versions and estimate what performance 
would be using a 90nm CMOS process. For a 90nm version with a 
196mm2 die we conservatively estimate clock rates o f400 MI Iz and 
ray tracing performance of 80 to 290 fps at 1024x768 resolution in 
our test scenes. This estimated performance is 70 times faster than 
what is achievable with standard multi-GHz desktop CPUs.
CR Categories: 1.3.1 [Hardware Architecture]: Graphics proces­
sors; 1.3.7 [3D Graphics and Realism]: Ray-Tracing
Keywords: Ray-Tracing, Hardware Architecture, ASIC Imple­
mentation, Performance Estimation




The current state-of-the-art in realtime computer graphics is the 
rasterization algorithm, mainly because low cost and highly effi­
cient hardware implementations arc available that achieve remark­
able levels of performance. The basic principle of this algorithm, 
used in all current commodity graphics chips, is to independently 
rasterize one triangle at a time onto the screen. This local triangle 
operation can be computed quickly using deep pipelines of custom 
floating point hardware. However, the incorrect assumption that tri­
angles arc independent is the great weakness of rasterization as it 
limits the possible shading operations to local per-triangle compu­
tations. This docs not allow for directly computing any global light 
effects such as shadows, reflections, transparency, or indirect illu­
mination as this would require direct access to potentially the entire 
scene database during rendering. Multi-pass techniques that arc of­
ten used to approximate these effects arc inaccurate and inefficient, 
especially with respect to the required external memory bandwidth.
The trend in realtime computer graphics is towards high realism, 
which becomes more and more difficult to achieve with rasteriza­
tion. Conceptually simple simulation-based rendering techniques 
like the ray tracing algorithm [2] can compute highly realistic im­
ages by simulating the physics of light based on the rendering equa­
tion. Rccursivc ray tracing [32] can easily compute shadows, reflec­
tions, refractions, and even combinations of them by recursively 
spawning secondary rays at the object intersection point. Ray trac­
ing even allows global illumination to be computed by stochasti­
cally gathering incoming light at a point of interest [27], The results 
of this computation arc high quality, photo-realistic images that arc 
often hard to distinguish from photographs.
Despite all these algorithmic advantages, ray tracing suffers from 
its high computational cost, causing renderings to take many sec­
onds to hours to finish. Much research has been performed over 
the last two decades to speed up this computation, using different 
platforms and algorithms.
1.1 Previous Work
On the software side significant research has been performed on 
mapping ray tracing efficiently to parallel machines, including 
MIMD and SIMD architectures [7, 12]. The key goal has been 
to exploit the parallelism of the hardware architecture in order 
to achieve high floating point and thus high ray tracing perfor­
mance [16, 15]. The OpenRT project implemented a high perfor­
mance ray tracer for commodity PCs that arc connected via a stan­
to Host Bib
Figure 2: DRPU Architecture: Several Rendering Units per chip are supported by the DRPU architecture. These units consist of an application 
programmable Shader Processor (SP) to generate and shade rays and a fixed-function part that contains a high performance Traversal Processor 
(TP) to traverse the B-KD tree (when requested to by the SP) and a Geometry Unit (GU) to intersect rays with triangles or to transform them 
to the local coordinate space defined by a B-KD transformation node. These units are all connected to the Memory Interface via small first 
level caches and a Thread Generator schedules new pixels for computation. On dynamic scene changes the B-KD trees are efficiently updated 
by the Update Processor.
dard Ethernet network [30, 25]. We use this system for performance 
comparison.
Existing programmable GPUs available in the graphics cards of 
today’s PCs can be used for many computationally intensive algo­
rithms as they offer excellent raw floating point performance by 
implementing up to 48 S1MD processors. However, the program­
ming model of these GPUs is very limited and docs not efficiently 
support ray tracing [ 19, 5]. In particular, it docs not provide flexible 
control flow and supports only very restricted memory access.
With realtime ray tracing it becomes also necessary to handle 
interactive changes in dynamic scenes. This is possible by using 
grids as spatial index structure as they allow for fast insertion of 
objects and even fast coherent rendering [29]. Separation of the 
scene into objects with piece-wise rigid motion and separate static 
spatial indices has been suggested [11] and has been implemented 
for realtime use on a cluster o f PCs [26]. Bounding Volume Hier­
archies [23, 28] have also successfully been used for rendering of 
dynamic scenes.
In recent years multiple custom hardware architectures for ray 
tracing have been proposed, both for volume [17, 8] and surface 
models. Partial hardware acceleration has been proposed [6] and 
a different implementation is commercially available [9], In addi­
tion a complete ray tracing hardware architecture has been simu­
lated [10]. The first complete, fully functional realtime ray tracing 
chip was presented in [20, 21], With the RPU hardware architec­
ture [35] a fully programmable design was implemented with lim­
ited support for dynamic scenes. A fixed function architecture us­
ing B-KD trees to render highly dynamic scenes was published in 
[34]. Although the results of these hardware implementations arc 
promising, none of them achieves performance levels and function­
ality comparable with current rasterization hardware.
2 DRPU Hardware Architecture
The DRPU approach of this paper is the first one that supports 
programmable material and lighting shaders on the one hand and 
highly dynamic scenes on the other hand. The architecture mainly 
consists o f two parts:
1. Ray Casting Units for managing spatial index structures dur­
ing rendering and for manipulating them on scene changes
2. a Shader Processor which consists of four highly multi­
threaded 4-way vector units (SPUs) for S1MD synchronous 
execution of bundles of threads to perform shading and ray 
generation tasks.
The Ray Casting Units arc identical to the architecture as de­
scribed in [34] while the Shader Processor (SP) is very similar to the 
SPUs described in [35]. A contribution of this paper is the combi­
nation o f both designs, which makes it comparable to rasterization 
hardware in terms of support for programmable shading and han­
dling of many kinds of dynamic scenes. This paper describes the 
basic details o f B-KD trees and the SPU, while more details can be 
found in the related papers. The DRPU architecture also contains 
the Skinning Processor as described in [34] which is not explained 
further here.
The main contribution of this paper is that we implemented and 
tested the DRPU on an FPGA and especially recast that same ar­
chitecture to an ASIC using a 130nm CMOS standard cell library 
from UMC [24]. We use the FPGA version to help calibrate the 
performance estimations of our ASIC implementation. We then ex­
trapolate performance to a 90nm version, which shows that with a 
comparable amount of hardware resources as current GPUs one can 
achieve a comparable level of rendering performance, while gain­
ing all the advantages of the ray tracing algorithm.
The DRPU hardware architecture is designed for ray tracing of 
dynamic scenes with programmable material and lighting shaders. 
It is highly scalable by supporting several Rendering Units on a 
single chip (see Figure 2). Each such Rendering Unit consists of
a Shader Processor (SP) to shade packets of four rays, a Traversal 
Processor (TP) to traverse packets of four rays through B-KD trees, 
and a Geometry Unit (GU) to intersect packets of rays with trian­
gles. To hide computation and memory latencies several of these 
packets of rays arc processed in parallel in the highly multithreaded 
hardware. The threads arc scheduled by a Thread Scheduler that 
performs load balancing and thread generation. F.ach time a packet 
of threads has finished its execution in one of the Rendering Units, 
the Thread Scheduler sends four new adjacent pixels to the Render­
ing Unit for processing. There, the packet of four threads is initial­
ized and executed synchronously. The threads stay together in the 
packet, but may be masked out on diverging control flow. The SP, 
TP, and GU, each support the same number of thread-packets such 
that a packet can always continue its compuation in a different unit.
On dynamic scene changes the Update Processor on the chip is 
used prior to rendering to update bounds of the B-KD trees to adapt 
to the scene changes. The spatial index structure and the hardware 
units that handle it arc explained in detail in the following section.
3 Ray Casting Units for Dynamic Scenes
For efficiently tracing rays through a scene, ray tracing requires 
spatial index structures that subdivide spacc into cells that can ef­
ficiently be enumerated along a ray. However, recomputing these 
spatial index structures is very expensive, which can limit ray trac­
ing to static scenes. To cope with this problem wc chose B-KD trees 
as index structure, which is a kind of Bounding Volume Hierar­
chy with one dimensional bounds. The structure of the B-KD tree 
is computed initially and maintained during rendering where only 
some node bounds need to be recomputed.
Such a B-KD tree is a binary tree, where each node recursively 
subdivides the geometry of the scene into two disjoint subsets rep­
resented by its two children. F.ach node stores the index of a coor­
dinate axis and bounds on the geometric extent of its two children 
along this axis in the form of two bounding intervals often also re­
ferred to as slabs (sec Figure 3). F.ach leaf node stores a reference 
to a single primitive of the scene. For instantiations of objects wc 
support transformation nodes, that store a pointer to the objects root 
node, and a transformation matrix to specify its position.
Figure 3: B-KD tree: A B-KD tree node divides a set of primitives 
into two disjoint subsets, represented by the two children. The node 
stores the extent of the geometry for each child as two bounding 
intervals along one splitting axis. The geometry is recursively subdi­
vided until there is only a single primitive per node.
A main advantage of the B-KD trees is low memory consump­
tion as they store the bounds of the children in only a single dimen­
sion. The implicit full bound per node can be obtained, similar to 
KD trees, by clipping against the bounding intervals from the top 
down. Wc build our B-KD trees using a Surface Area Heuristic 
(SAI I) similar to the approach in [28]. The concept of B-KD trees 
is described in more detail in [34],
3.1 Update Processor for B-KD Trees
For changed geometry the B-KD tree bounds can be updated by 
a simple bottom-up algorithm that merges the full axis aligned 
bounds of the nodes from bottom-up through the tree and updates 
for each B-KD tree node the extent of the two children along the 
node axis. This algorithm can be implemented by only perform­
ing trivial min/max operations that do not touch the structure of the 
tree.
This update procedure is performed by a dedicated Update Pro­
cessor that is fed by an instruction stream which is precomputed 
by the driver application for each dynamic object. This instruction 
stream includes instructions for loading vertices into one of the 64 
vertex registers, computing a triangle bound from 3 vertices, and 
merging two bounds together. By operating on vertices the pro­
cessor is optimized for triangle meshes with shared vertices. By 
keeping these shared vertices in the vertex registers they can opti­
mally be reused for computing the bound of several triangles, thus 
no caches arc required. All partial results, such as computed node 
bounds arc stored to one of 64 special hound registers to mini­
mize the external memory traffic to only the required updates of 
the nodes, vertex fetches, and additional instruction fetches.
For best results the structure of the B-KD tree should “match” 
the geometry and its dynamics. This means that geometry in a sub­
tree should stay as close together as possible during the course of 
changes. A mismatch can result in significant overlap of the bounds 
of child nodes. This leads to redundant traversal and missed op­
portunities for early ray termination, as both child nodes must be 
traversed if a ray enters an overlap region. As a consequence only 
dynamic scenes that show some coherent motion can be handled 
efficiently with B-KD trees. Many typical motions, like skinned 
meshes, obey this restriction as will be shown in the result sec­
tion by some animated characters. Random movement of triangles, 
however, is handled less efficiently bccausc the significant overlaps 
would require the traversal of many B-KD tree nodes.
3.2 Traversal Processor (TP)
Traversing B-KD trees typically requires between 50 to 100 traver­
sal steps. Using a fully programmable unit for these operations 
wastes precious cycles, since every step would correspond to sev­
eral instructions, fnstcad a Traversal Processor (TP) is used that 
consists of four custom fixed function Traversal Processing Units 
(TPU) that arc used in SfMD mode which greatly improves traver­
sal performance as it can perform one packet traversal operation 
each clock cycle.
The Traversal Processor traverses several packets of four rays 
in parallel through the B-KD tree, fn order to hide memory and 
computation latencies multiple packets of rays arc processed simul­
taneously using a wide multi-threading approach [20], The rays 
in the packet arc synchronized to operate on the same B-KD tree 
node, which reduces the memory bandwidth. This multi-threading 
and packet-based approach performs very well bccausc of the high 
coherence between adjacent rays. The memory bandwidth is re­
duced further by using dedicated first level caches to store B-KD 
tree nodes.
The implemented traversal algorithm for the B-KD tree is similar 
to that of standard KD trees [22], The recursive traversal function 
traverses the scene in a traversal interval /  =  [near, far] along the 
ray. Wc first test for early ray termination by determining if the 
current hit is before the near distance. Wc then intersect the ray 
with the four bounding planes defined by the node giving the two 
intersection intervals /^0.i j for the two leaf nodes (sec Figure 4). A 
child that lies partially in front of the second one is traversed first, 
as it is more likely that the closest hitpoint is located there. Before 
this closer child (for instance child 1) is traversed two comparisons 
determine if its intersection interval /j overlaps the current travcr-
sal interval I. We recursively traverse the child if this is the case. 
The traversal interval is then updated to the intersection of /  and 
h , which requires two min/max operations. If the other child over­
laps the traversal interval it is stored onto a stack together with the 
intersection of /  and Iq as its traversal interval.
Figure 4: Ray Traversal: The ray is intersected with the four planes 
defined by the bounds of each child giving two intersection intervals 
4o.u along the ray. A child is traversed iff its intersection interval 
overlaps the traversal interval I — [near, far] of the ray. The closer 
child is always traversed first to improve performance through early 
ray termination.
3.3 Geometry Unit (GU)
If the Traversal Processor reaches a leaf node, the Geometry Unit 
(GU) is responsible for sequentially intersecting the rays of a packet 
with the contained triangle geometry using the Moller-Trumbore 
algorithm [14], or to sequentially transform rays to the local coor­
dinate space defined by a transformation node of the B-KD tree. 
This transformation requires no additional arithmetic units as they 
can be shared with the ones used for ray/triangle intersection. The 
Geometry Unit is pipelined and can perform one ray triangle in­
tersection or one ray transformation every two clock cycles. Thus 
eight cycles arc required to transform or intersect the four rays of a 
packet.
4 Shadf.r Processor (SP)
At its core the DRPU architecture contains a general purpose 
Shader Processor (SP) similar to the SPUs used in the RPU archi­
tecture [35], It supports random memory read and write operations 
as well as arbitrary address computations using integer arithmetic. 
I lowever, the design has been optimized for algorithms with prop­
erties similar to those of ray tracing: generous thread-level paral­
lelism, high data coherence between threads of nearby pixels, and 
a large number of short vector floating point operations. Via a spe­
cial “trace” instruction it can recursively call the Ray Casting Units 
described in the previous section for efficient traversal o f additional 
rays through the index structure. On a “trace” instruction the ray 
and a pointer to the spatial index structure arc sent to the Traversal 
Processor (TP). The TP performs the traversal of the ray with that 
structure and writes results to special return registers. The SP may 
now continue operating on the thread-packet using the information 
provided by the TP.
Similar to current GPUs we use four component, single precision 
floating point or integer vectors as the basic data type in the core
Figure 5: An abstract view of the implemented DRPU. Each of the 
four Shader Processing Units (SPUs) operates on four-component 
vectors as its basic data type, and all four SPUs operate syn­
chronously in SIMD mode making up a 4-thread packet. A total of 
32 packets are supported and executed asynchronously in the multi­
threading hardware for a total of 128 supported hardware threads. 
Four Traversal Processing Units (TPUs) synchronously traverse pack­
ets of four rays through a B-KD tree using the Geometry Unit (GU) 
for ray/triangle intersection.
Shader Processing Unit (SPU) to exploit the available instruction 
level parallelism. This results in fewer memory requests of larger 
size, and significantly reduces the size of the shader programs com­
pared to a scalar code. Again similar to GPUs, dual-issue instruc­
tions arc supported to split the vector into two parts and perform 
different computations on them, or to pair an arithmetic instruction 
with a load or branch instruction.
We take advantage of the thread parallelism in ray tracing 
through a massively multi-threaded hardware design with 128 hard­
ware threads supported in the implemented version of the DRPU.
5bctwccn threads as required. Multi-threading allows an in­
crease in
The raw bandwidth requirement of the unmodified ray tracing al­
gorithm is huge [20], It can be reduced considerably by exploiting 
the high coherence between adjacent rays. To this end, four threads 
arc packed into a packet and executed synchronously in SIMD mode 
in parallel by four SPUs in the hardware (see Figure 5). There arc 
32 of these four-thread packets supported. Because all threads in 
a packet execute the same instruction, identical memory requests 
arc highly likely for coherent rays in the packet and can be com­
bined. Using SIMD mode, these SPUs can share much of their in­
frastructure (e.g. instruction scheduling and caches), which reduces 
the hardware complexity. The current numbers of four threads per 
packet and 32 packets were chosen after detailed simulation as a 
good balance between hardware complexity and available mem­
ory bandwidth in the current hardware. An increase of the number 
of threads would yield higher performance, but a slightly sublin- 
ear relation to the required additional space makes 32 packets a 
good compromise. On the other hand, increasing the number of 
synchronous rays per packet to more than four could cause prob­
lems during Place and Route of the ASIC design. A synchroniza­
tion circuit is required to synchronize between the single units and 
larger packets could cause this circuit to be far away from some of 
the units. Furthermore, very large packets would reduce the per­
formance for incoherent computations such as highly triangulated 
scenes, because few rays would active during the computation.
In order to allow for complex control flow even in a SIMD envi­
ronment the architecture supports conditional branching and full rc-
G U
T P  ;  S P
Update 
Processor:
cursion using masked cxccution and a hardware-maintained register 
stack accessible through the register file. Unused parts o f this stack 
can transparently be spilled out to main memory by the hardware 
to allow for deep recursions. Diverging branches of the threads in 
the packet arc automatically handled sequentially by processing one 
control path and putting instruction pointer and activity mask of the 
sccond one onto a control stack. In the cxccutcd control path, an ac­
tivity mask determines which threads take part at the computations. 
If the current control path rcachcs a return statement the next item 
of the control stack is cxccutcd.
Memory requests arc a key issue with multi-core designs. It turns 
out that the synchronous cxccution of rays leads to many identical 
memory requests that can be packcd and thus rcducc bandwidth. 
This memory packing mcchanism only performs the required num­
ber of memory requests for the packct o f rays, e.g. if  cach ray of the 
packct wants to read data from the same address only one packcd 
memory request is performed. Nevertheless incoherent packcts arc 
allowed and causc no additional overhead but do not see improve­
ments either as four single requests arc performed. All memory 
acccsscs go through small dcdicatcd cachcs (see Figure 2) in order 
to further rcducc external bandwidth and rc-usc data between dif­
ferent packcts o f threads. Cachc hit rates arc generally much higher 
than 90% in our test sccncs which results on low external bandwidth 
requirements (see Tabic 2).
The main diffcrcncc of the DRPU to the RPU design as de­
scribed in [35], is the spccial Geometry Unit, that the DRPU uses 
for ray/trianglc intersection based on shared triangle vcrticcs. This 
unit is required to handle dynamic sccncs as precomputing accel­
eration data to speed up a ray/trianglc intersection in software on 
the SP is difficult as this computation requires matrix inversions 
and a sccond iteration over the dynamic geometry. Also this would 
greatly increase the size of the sccnc database, as no vcrticcs could 
be shared, resulting in more memory traffic. The RPU design had 
only limited support for rigid-body motion, not for highly dynamic 
sccncs as supported by the DRPU by using B-KD trees.
5 DRPU IMPI F.MF.NTATION
5.1 FPGA Prototype
To accurately estimate the performance of our ASIC design, wc im­
plemented an FPGA version of the DRPU on a Xilinx Virtcx-4 LX 
160 FPGA [36] that is hosted on the Alpha Data ADM-XRC-4 PC1- 
board [1], The FPGA has acccss to four 16-bit wide DDR memory 
chips used in parallel to make a 64-bit wide memory intcrfacc that 
can deliver a peak bandwidth of 1.0 GB/s at 66 MHz. The FPGA 
is conncctcd via a 64 bit wide PCI bus to the host PC. The DMA 
capabilities of the PCI bridge arc used to upload sccnc data (B-KD 
tree nodes, shader codc, and all shader parameters) to DRAM and 
to download frame buffer contents to the application for storage or 
display via standard graphics APIs.
The hardware description o f the entire DRPU prototype is about 
8000 lines of ML [13] codc using the HWML library for hardware 
description [33], The specification is fully paramctcrizablc, thus 
cach of the design parameters, like packct size, number of threads, 
latencies, floating point accuracy, and cachcs, can be changcd by 
adjusting a single configuration file. Wc adjusted the configuration 
to achicvc the best possible performance with our FPGA by com­
pletely using the available logic resources.
Due to the limited size o f the FPGA not all features o f the DRPU 
architecture could be enabled for the prototype: integer operations 
arc not included, which limits memory reads to offsets of precom­
puted addresses. Write support is limited to a single vcctor per 
shader (similar to GPUs). A fixed register stack of 16 entries is 
provided without automatic spilling of unused parts to memory. In 
order to take advantage ofthc available 18 bit multipliers on the Xil­
inx FPGA, a 24-bit floating point format was used. With a packct
Figure 7: Plot of the DRPU ASIC shown with only four of the six 
levels of metal wiring so that the memories are visible. The Shader 
Processor (SP), Traversal Processor (TP), Geometry Unit (GU), and 
Update Processor are shaded, to show their die area and complexity. 
As we did not designed the external connection of the chip (PCI plus 
DRAM interfaces) pads are not included in the Figure.
size of 4 and 32 packcts (128 hardware threads total) the DRPU 
occupics about 99% of the logic cclls, 165 of the 288 block memo­
ries (57%), and 58 of the 96 18-bit multipliers (60%) of the FPGA 
chip. These numbers show that wc use the FPGA to its limits which 
sometimes causcs problems with routing and ovcrmapping. The de­
sign contains 113 floating point units, mostly in the SPUs, TPUs, 
and GU. The worst-casc timing according to the Xilinx mapping 
tools is 55 MHz, but the DRPU runs at 66 MHz as implemented. 
At this clock speed the theoretical peak performance is 7.5 GFlops.
5.2 ASIC Design
For the ASIC version of the DRPU wc mapped the HWML- 
gcncratcd description to a set o f standard cclls in a 130nm CMOS 
proccss from UMC [24], For on-chip RAM wc used memories gen­
erated by an SRAM memory compiler from Virtual Silicon. Phys­
ical assembly and post layout timing was done using the Cadcncc 
SOC Encounter tools.
The ASIC version docs not suffer from spacc limitations, thus wc 
increased the floating point data path to full 32-bit singlc-prccision 
width, including integer arithmetic and other features that had been 
disabled in the FPGA version. To easily estimate performance wc 
configured the DRPU ASIC version in a similar way to the FPGA 
version with packct size 4 and 32 packcts. This also results in 
113 floating point units on the DRPU ASIC. To make a conserva­
tive performance extrapolation, wc additionally increased the cachc 
sizes and implemented four-way set associative cachcs for the SP 
(16 KBytes), TP (16 KBytes), and GU (16 KBytes). The total core 
size for this DRPU is 7mm x 7mm (49 mm2) in the 130nm CMOS 
proccss. The post layout timing estimates for the current version 
arc 161 MHz worst ease (1.08V, 125°C) and 299 MHz typical case 
(1.25V, 25°C). These speed estimates arc approximately 70% of 
the maximum possible speed of the on-chip memories generated 
with our memory compiler, which shows room for further improve­
ments. A clock rate of 266 MHz should easily be achievable if the 
chip would be fabricated and would have a theoretical peak pcrfor-
Figure 6: Some of the the scenes used for benchmarking the prototype: Scene6 (0.5k triangles), Office (34k triangles), 
Mafia Spheres (20k triangles), Hand (17k triangles), and Gael (52k triangles). See Figure 1 for more benchmarking scenes.
mancc of 30.0 GFlops.
A plot o f the DRPU layout is shown in Figure 7 with the memo­
ries visible as the large blocks and some hardware units arc labeled 
and shaded. In total there arc approximately 9 million non-memory 
transistors in the DRPU (686k standard cells, 191k of them arc flip 
flops) and approximately 2.57 MBit o f on-chip RAM in the caches 
(0.6 MBit), register files (1.2 MBit), and other memory structures 
(0.77 MBit) that arc implemented in 280 generated memory blocks.
6 P erfo rm an ce  E valua tion
DRPU FPGA: The fully functional FPGA prototype, configured as 
dcscribcd in Scction 5.1, runs at 66 MI Iz with 1 GB/s peak memory 
bandwidth between the on-board SDRAM and the on-chip cachcs. 
It turns out that half the peak memory bandwidth is sufficient for 
most o f our test sccncs, thus for measurements wc sealed the avail­
able bandwidth down to only 0.5 GB/s using some test circuits. The 
performance of the DRPU FPGA is measured directly from the run­
ning hardware by counting the number o f cyclcs required to update 
spatial index structures and to computc the image (see Tabic 2).
DRPU ASIC: The timing of the DRPU ASIC, configured as dc­
scribcd in Scction 5.2 is estimated from post layout timing analy­
sis using C'adcncc SOC Encounter. Bccausc the architecture is the 
same, and the ASIC clock rate is four times higher than for the 
FPGA, wc can derive performance numbers for this ASIC version 
by scaling the FPGA framcratcs linearly. This is prccisc as long 
as the external memory bandwidth could also be sealed linearly to
2.1 GB/s. Bccausc of the larger cachcs of the ASIC this perfor­
mance estimate is quite conservative.
Todays high end rasterization graphics chips like the ATI R520 
use a 90nm proccss with a 288mnr die. This is much larger than our 
DRPU ASIC version whose die is 49m nr large and uses a 130nm 
proccss. For this reason wc estimate performance for two further 
ASIC versions with larger die size and with a 90nm proccss (see 
Tabic 3).
DRPU4 ASIC: First wc maintain the proccss and put four copics 
of the basic DRPU ASIC on a single chip. Wc did no ASIC layout 













Freq [MHz] 2,667.0 3,200.0 66.0 266.0 266.0 400.0
(JFlops 10.6 256.0 7.5 30.0 120.2 361.6
proccss [nm] 130 90 90 130 130 90
die size [mnr \ 145.0 221.0 - 49.0 196.0 186.6
bandwidth [GB/s] 8.5 25.0 0.5 2.1 8.5 25.6
Table 1: Comparison of the different hardware architectures: the 
OpenRT software implementation running on a Pentium 4, the Cell 
implementation, the DRPU FPGA implementation, the DRPU and 
DRPU4 ASIC implementations on a 130nm process, and the extrap­
olation of the DRPU8 ASIC to a 90nm process.
die (196 mm2) at 130nm if one ignores the area required for con­
necting the four DRPU copics to main memory. If run at 266 MI Iz 
the 452 floating point units o f the DRPU4 ASIC would provide a 
peak floating point performance of 120.0 Gflops. The performance 
could again be sealed up linearly if the chip would be conncctcd to 
a DDR memory interface with 8.5 GB/s peak bandwidth, which can 
be implemented quite feasibly with two 64-bit wide DDR2 memory 
interfaces clockcd at cffcctivc 532 MI Iz.
DRPU8 ASIC: Next wc extrapolate performance levels that 
could be achieved with the DRPU design by going from our 130nm 
proccss to a 90nm proccss. Bccausc wc don't have acccss to this 
proccss, wc cannot provide prccisc timing results from C’adcncc 
SOC Encounter, but extrapolations using constant field scaling arc 
reasonably accuratc [31]. If one scalcs the dimensions of a pro­
ccss by s using constant field scaling, then the frequency scalcs 
by a factor of 1 /s . If wc extrapolate from our 130nm design to a 
90nm proccss, s is 0.69 and wc get a maximal operating frequency 
of 299 M llz j0.69 =  433 M IIz for the DRPU. Thus wc consider a 
90nm version running at 400 MI Iz. Feature size decreases by s, thus 
the DRPU ASIC has a die size of 4.83mm x 4.83mm = 23.3mnr 
in the 90nm proccss, and wc can instantiate eight copics on a 
186.6mnr die. To provide enough memory bandwidth wc would 
need to conncct this DRPU8 ASIC to a 25.6 GB/s memory inter­
face. External memory interfaces at that speed arc difficult to imple­
ment, but realistic if looking to current high end GPUs with external 
bandwidths o f more than 40 GB/s. Again the memory interface and 
conncction to the Rendering Units would consumc additional die 
area. The DRPU8 ASIC would have an additional 3 times speedup 
over the DRPU4 ASIC, bccausc of higher frequency and twicc the 
number of computational units. The 904 floating point units would 
provide a peak floating point performance of 361 GFlops, which is 
very elose to the peak floating point performance of todays GPUs. 
Bccausc of the high rendering performance, a high speed PCI Ex­
press conncction would be required to download the rendered pixels 
for display.
Table 1 gives an overview of frequency, peak floating point per­
formance, and die characteristics for the FPGA version, the differ­
ent ASIC design versions, the Pentium 4 chip and the Cell processor 
used for speed comparison.
Wc have not done detailed power analysis or estimation of any 
of the ASIC versions, but wc would cxpcct power to be high, due 
to the large number of floating point units and the computational 
requirements of the rendering proccss. In this dimension wc cxpcct 
no particular improvements over existing GPUs which also exhibit 
high power consumption.
To show the possible performance, wc have choscn a number 
of benchmark sccncs (see Figure 1 and Figure 6) that covcr a 
large fraction of possible sccnc characteristics. The sccncs range 
from very simple ones like the Shirlcy6 and Officc sccncs, to com­
plex ones (Conference) and the Gael level from Unreal Tournament 
2004. The Mafia Spheres sccnc, shows a room containing four re­
flecting and one refracting sphere, to show secondary ray tracing









Shirlcy6 0.5k 1 1.5M - 14M 4.7 fps 98.6% 99.2% 85.7% 113MB/S
Confcrcncc 282k 52 1.5M - 39M 1.7 fps 81.3% 85.1% 89.6% 164 MB/s
Officc 34k 1 1.5M - 18M 3.6 fps 91.5% 93.7% 88.0% 103 MB/s
Mafia Room 15k 1 1.5M - 24M 2.8 fps 91.4% 96.3% 67.2% 186 MB/s
Mafia Spheres 20k 6 1.6M - 36M 1.8 fps 88.7% 96.1% 59.8% 210 MB/s
Hand 17k 2 1.3M 118k 13M 5.0 fps 91.8% 97.9% 75.3% 126 MB/s
Skeleton 16k 2 1.3M 113k 11M 5.9 fps 89.8% 97.5% 96.3% 73 MB/s
I Iclix 78k 2 1.5M 602k 18M 3.5 fps 80.0% 93.2% 87.2% 145 MB/s
Gael 52k 1 1.5M - 34M 1.9 fps 87.7% 91.4% 72.1% 188 MB/s
DynGael 85k 4 1.5M 121k 33M 2.0 fps 86.1% 91.6% 88.0% 154 MB/s
Table 2: Performance statistics of the DRPU FPGA prototype clocked at 66 MHz. For several scenes, the complexity in number of triangles, 
instantiated objects, and number of rays shot per image at 1024x768 resolution are shown. Further, the table shows the number of cycles 
required for updating the B-KD tree, for rendering the image, and the resulting framerate. Cache hitrates are shown for the TP, GU, and SP 
cache. The low resulting external memory bandwidth is presented, showing the scalability of the approach. Phong shading is used including 
textures and shadows. The cycles required to read back the framebuffer contents for display are not included (but would be below 1% for most 
scenes). See Figures 1 and 6 for images of the scenes.
Sccnc triangles objccts #rays DRPU FPGA DRPU ASIC DRPU4 ASIC DRPU8 ASIC
Shirley 6 0.5k 1 1.5M 4.7 fps 18.8 fps 75.2 fps 225.6 fps
Confcrcncc 282k 52 1.5M 1.7 fps 6.7 fps 27.0 fps 81.2 fps
Officc 34k 1 1.5M 3.6 fps 14.4 fps 57.6 fps 172.8 fps
Mafia Room 15k 1 1.5M 2.8 fps 11.2 fps 44.8 fps 134.4 fps
Mafia Spheres 20k 6 1.6M 1.8 fps 7.2 fps 28.8 fps 86.4 fps
Hand 17k 2 1.3M 5.0 fps 20.0 fps 80.0 fps 240.0 fps
Skeleton 16k 2 1.3M 5.9 fps 23.6 fps 94.4 fps 283.2 fps
I Iclix 78k 2 1.5M 3.5 fps 14.0 fps 56.0 fps 168.0 fps
Gael 52k 1 1.5M 1.9 fps 7.6 fps 30.4 fps 91.2 fps
DynGael 85k 4 1.5M 2.0 fps 8.0 fps 32.0 fps 96.0 fps
Table 3: Estimated performance of the DRPU versions for a number of benchmark scenes of varying complexity. We provide the number of 
cycles required for updating of the B-KD tree and rendering of the images at 1024x768 resolution with shadows, Phong shading, and textures. 
Frames per second are directly computed from the number of cycles required for the computation. The cycles required to read back the 
framebuffer contents for display are not included (but would be below 1% for most scenes). See Figures 1 and 6 for images of the scenes.









Shirley 6 3.2 180.0 5.0 20.0 80.0 240.0
Officc 2.6 n/a 4.1 16.4 65.6 196.8
Confcrcncc 2.0 60.0 3.4 13.6 54.4 163.2
Gael 2.0 n/a 3.8 15.2 60.8 182.4
Table 4: Performance comparison of the OpenRT software implemen­
tation running on a Pentium 4 with 2.66 GHz, the Cell ray tracer, 
the DRPU FPGA running at 66 MHz, post-layout estimates for the 
DRPU ASIC, and estimates for the DRPU4 ASIC running at 266 MHz 
and the DRPU8 ASIC running at 400 MHz. All performance numbers 
are for 1024x768 resolution with phong shading including bilinear tex- 
turing, vertex normal interpolation, and a single light source without 
shadows.
cffccts. Some Poser [18] animations (Hand, Skeleton, and Helix) 
show the support for dynamic scenes. The vertex positions and nor­
mals arc precomputed by Poser, and uploaded via DMA for each 
frame. The DynGael scene, shows the combination of the static 
Gael level, with two dynamic skeleton instances.
We use a subset of these scenes for speed comparisons, see Ta­
ble 4. A comparison of the performance of the FPGA prototype 
and the three ASIC versions against the OpenRT software ray tracer 
running on an Intel Pentium-4 at 2.66 GI Iz [4] and a Cell implemen­
tation of ray tracing [3] arc performed. The results show that the 
FPGA version outperforms the software implementation by 40% 
to 70% even though clockcd at a 40 times lower frequency. The
DRPU8 ASIC version would outperform the software ray traccrby 
a factor of up to 75. A comparison to a Cell implementation of ray 
tracing shows up to 2.5 times higher performance, despite the hard­
ware complexity being similar (see Tabic 1), and the DRPU8 ASIC 
performing much more complcx shading (including textures). This 
shows the cfficicncy of the DRPU architecture compared to general 
purpose designs.
For the full set of test sccncs, detailed statistics in Tabic 2 and 
performance extrapolations in Tabic 3 arc provided. The statistics 
includc the complexity of the sccnc, number of objcct instances, 
and number of rays shot for computation. The sccncs arc rendered 
with a realistically complcx shader with more than 90 assembly 
instructions to perform: bilinear texture lookup, diffuse term, spcc- 
ular term, light fall-off, vertex normal interpolation, vertex color 
interpolation, and pixel accurate shadows. Tabic 2 further shows 
the cxact number of cyclcs required to update the B-KD trees, ren­
dering, and the resulting frame rate of the FPGA. The cachc hit 
rates of the direct mapped FPGA cachcs arc an indicator for the 
cohcrcncc of the computations, but typically arc much higher than 
90%. The hit-ratcs drop down cspccially for higher resolution tex­
tures that arc acccsscd by the SP. As the ASIC versions implement 
four-way set associative cachcs with twicc the size as the FPGA, 
much higher hit-ratcs arc cxpcctcd, which would rcducc external 
bandwidth even more.
The performance extrapolations of Tabic 3 show performance 
for the DRPU FPGA, and all DRPU ASIC versions. If compar­
ing these performance values against Tabic 4, the numbers for the 
Shirlcy6 and Officc sccnc arc surprisingly only slightly lower dc-
spite containing shadows. This is bccausc traversal and shading 
can be performed in parallel and for these two simple sccncs the 
Ray Casting Units can trace the shadow ray at the same time as 
the SP performs shading. For the test sccncs, the estimated per­
formance of the DRPU8 ASIC is between 80 and 280 frames per 
sccond. The performance mainly depends on the cost of the rays, 
which increases with higher number of visible geometry elements, 
and the total number of rays shot. Thus the performance of the 
Mafia Spheres sccnc is lower than the Mafia Sccnc, bccausc more 
triangles arc visible and more rays need to be shot due to the refrac­
tion and reflection effect.
The Gael level renders with more than 90 frames per sccond at 
1024x768 resolution even with two animated Skeleton instances. 
This is sufficient for game play and would leave much room for 
improved image quality. For instance, it would be possible to im­
prove filtering of edges and shadows by using adaptive oversam­
pling techniques. These techniques take an additional pass over the 
generated image to find regions where more rays could effectively 
improve image quality.
The DRPU hardware architecture can render even highly dy­
namic sccncs efficiently, as shown by the results of the Hand, Skele­
ton, and Helix animations. For these dynamic test sccncs the num­
ber of cycles required to update the B-KD tree is about two orders 
of magnitude below the render cycles, and rendering these anima­
tions causes little overhead.
7 Conclusions
This paper presents ASIC implementations of the programmable 
DRPU architecture for efficient high performance ray tracing of 
dynamic sccncs. The DRPU contains a Shading Processor im­
plemented as a four-clement vector floating point processor core 
with support both for synchronous SIMD cxccution of packcts of 
threads and multithreading. It also contains custom hardware for 
ray/trianglc intersection and for traversing the B-KD tree which is 
required for efficient ray tracing of dynamic sccncs.
The FPGA prototype is fully working and makes for convincing 
demonstrations of the power of this technique. Wc hope to fabricate 
at least the singlc-DRPU ASIC to demonstrate the full potential of 
this architecture. Measurements on the implemented FPGA pro­
totype, and timings based on a 130nm ASIC design indicate that 
performance levels sufficient for game play arc achievable, espe­
cially if it is possible to use a high end 90nm ASIC technology. A 
DRPU would also offer much higher quality of image and realism 
due to the use of rccursivc ray tracing rather than rasterization.
References
[1] Alpha-Data, ADM-XRC-II. http://www.alphadata.uk.co, 2003.
[2] Arthur Appel. Some Techniques for Shading Machine Renderings of Solids. 
SJCC, pages 27-45,1968.
[3] Carsten Benthin, Ingo Wald, Michael Scherbaum, and Heiko Friedrich, Ray 
Tracing on the CELL Processor. In IEEE Symposium on Interactive Ray Trac­
ing, 2006,
[4] Andreas Dietrich, Ingo Wald, Carsten Benthin, and Philipp Slusallek. The 
OpenRT Application Programming Interface -  Towards A Common API for 
Interactive Ray Tracing, In Proceedings o f  the 2003 OpenSG Symposium, pages 
23-31, Darmstadt, Germany, 2003. EUROGRAPHICS Association.
[5] Tim. Foley and Jeremy Sugerman. Kd-tree acceleration structures for a gpu ray- 
tracer. In HWWS '05: Proceedings o f the ACM SIGGRAPH/EUROGRAPHICS 
conference on Graphics hardware, pages 15-22, New York, NY, USA, 2005, 
ACM Press,
[6] Stuart A, Green, Parallel processing for computer graphics, MIT Press, pages 
62-73, 1991.
[7] Stuart A, Green and Derek J. Paddon, A highly flexible multiprocessor solution 
for ray tracing. The Visual Computer, 6(2):62-73,1990,
[8] M, Porrmann H, Kalte and U, Riickert. Using a dynamically reconfigurable 
system to accelerate octree based 3D graphics. Technical report, System and 
Circuit Technology, University of Paderbom, 2000,
[9] D, Hall, The AR350: Today’s ray trace rendering processor. In Proceedings o f  
the EUROGRAPHICS/SIGGRAPH workshop on Graphics Hardware - Hot 3D 
Session, 2001,
[10] Hiroaki Kobavashi, Kenichi Suzuki, Kentaro Sano, and Nobuyuki Oba, In­
teractive Ray-Tracing on the 3DCGiRAM Architecture, In Proceedings o f  
ACM/IEEE MICRO-35, 2002.
[11] Jonas Lext and Tomas Akenine-Moller, Towards Rapid Reconstruction for An­
imated Ray Tracing, In Eurographics 2001 -  Short Presentations, pages 311— 
318,2001.
[12] Tony T.Y, Lin and Mel Slater, Stochastic Ray Tracing Using SIMD Processor 
Arrays. The Visual Computer, pages 187-199,1991,
[13] Robin Milner, Mads Tofte, and Robert Harper, The Definition of Standard ML,
1990.
[14] Tomas Moller and Ben Trumbore. Fast, minimum storage ray triangle intersec­
tion, Journal o f Graphics Tools, 2(1 ):21-28,1997,
[15] Jean-Christophe Nebel, A Mixed Dataflow Algorithm, for Ray Tracing on the 
CRAY T3E, In Third European CRAY-SGIMPP Workshop, September 1997,
[16] Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen, and Peter Pike 
Sloan, Interactive ray tracing. In Interactive 3D Graphics (I3D), pages 119— 
126, April 1999.
[17] Hanspeter Pfister, Jan Hardenbergh, Jim. Knittel, Hugh Lauer, and Larry Seiler, 
The VolumePro real-time ray-casting system. Computer Graphics, 33, 1999,
[18] Poser, Poser Web Page, http://www.e-frontier.com, 2006,
[19] Timothy J, Purcell, Ray Tracing on a Stream Processor. PhD thesis, Stanford 
University, 2004.
[20] Jorg Schmittler, Ingo Wald, and Philipp Slusallek, SaarCOR -  A Hard­
ware Architecture for Ray Tracing, In Proceedings o f  the ACM SIG- 
GRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 27-36, 
2002.
[21] Jorg Schmittler, Sven Woop, Daniel Wagner, Wolfgang J. Paul, and Philipp 
Slusallek. Realtime Ray Tracing of Dynamic Scenes on an FPGA Chip, In 
Proceedings o f  Graphics Hardware, 2004,
[22] K, R. Subramanian, A Search Structure based on K-d Trees for Efficient Ray 
Tracing. PhD thesis, The University of Texas at Austin, December 1990,
[23] Tomas Akenine-Moller Thomas Larsson, Strategies for bounding volume hier­
archy updates for ray tracing of deformable models. Technical report, February 
2003.
[24] United Microelectronics Coiporation, http://www.umc.com., 2005.
[25] Ingo Wald, Realtime Ray Tracing and Interactive Global Illumination. PhD 
thesis, Computer Graphics Group, Saarland University, 2004, Available at 
http://www.mpi-sb.mpg.de/~wald/PhD/.
[26] Ingo Wald, Carsten Benthin, and Philipp Slusallek, Distributed Interactive Ray 
Tracing of Dynamic Scenes, In Proceedings o f the IEEE Symposium on Parallel 
andLarge-Data Visualization and Graphics (PVG), 2003,
[27] Ingo Wald, Carsten Benthin, and Philipp Slusallek, Interactive Global Illumina­
tion in Complex and Highly Occluded Environments, In Per H Christensen and 
Daniel Cohen-Or, editors, Proceedings o f  the 2003 EUROGRAPHICS Sympo­
sium on Rendering, pages 74—81, Leuven, Belgium., June 2003,
[28] Ingo Wald, Solomon Boulos, and Peter Shirley. Ray Tracing Deformable Scenes 
using Dynamic Bounding Volume Hierarchies (revised version). Technical Re­
port, SCI Institute, University o f  Utah, No UUSCI-2006-023, 2006.
[29] Ingo Wald, Thiago Ize, Andrew Kensler, Aaron Knoll, and Steven G Parker. 
Ray Tracing Anim ated Scenes using Coherent Grid Traversal, ACM SIGGRAPH 
2006,2006.
[30] Ingo Wald, Philipp Slusallek, Carsten Benthin, and Markus Wagner. Interactive 
Rendering with Coherent Ray Tracing. Computer Graphics Forum, 20(3): 153­
164,2001. (Proceedings of EUROGRAPHICS).
[31] Neil Weste and David Harris. CMOS VLSI Design: A Circuits and Systems 
Perspective. Addison Wesley, 2005.
[32] Turner Whitted. An Improved Illumination Model for Shaded Display. CACM, 
23(6):343-349, June 1980.
[33] Sven Woop, Erik Brunvand, and Philipp Slusallek. HWML: RTL/Structural 
Hardware Description using ML, Technical report, Computer Graphics Lab, 
Saarland University, 2006.
[34] Sven Woop, Gerd Marmitt, and Philipp Slusallek. B-KD Trees for Hardware 
Accelerated Ray Tracing of Dynamic Scenes. In Proceedings o f  Graphics Hard­
ware, 2006.
[35] Sven Woop, Jorg Schmittler, and Philipp Slusallek, RPU: A Programmable Ray 
Processing Unit for Realtime Ray Tracing. In SIGGRAPH 2005 Conference 
Proceedings, pages 434 -  444,2005,
[36] Xilinx, Virtex-II. http://www.xil.inx.com, 2003,
