Hardware-accelerated ray-triangle intersection testing for high-performance collision detection by Kim, Sung-Soo et al.
Hardware-Accelerated Ray-Triangle Intersection Testing for
High-Performance Collision Detection
Sung-Soo Kim, Seung-Woo Nam, Do-Hyung Kim and In-Ho Lee
Digital Content Research Division
Electronics and Telecommunications Research Institute (ETRI), South Korea
{sungsoo, swnam, kdh99, leeinho}@etri.re.kr
ABSTRACT
We present a novel approach for hardware-accelerated collision detection. This paper describes the design of the hardware
architecture for primitive inference testing components implemented on a multi-FPGA Xilinx Virtex-II prototyping system.
This paper focuses on the acceleration of ray-triangle intersection operation which is the one of the most important operations
in various applications such as collision detection and ray tracing. Also, the proposed hardware architecture is general for
intersection operations of other object pairs such as sphere vs. sphere, oriented bounding box (OBB) vs. OBB, cylinder vs.
cylinder and so on.
The result is a hardware-accelerated ray-triangle intersection engine that is capable of out-performing a 2.8GHz Xeon processor,
running a well-known high performance software ray-triangle intersection algorithm, by up to a factor of seventy. In addition,
we demonstrate that the proposed approach could prove to be faster than current GPU-based algorithms as well as CPU based
algorithms for ray-triangle intersection.
Keywords: Collision Detection, Graphics Hardware, Intersection Testing, Ray Tracing.
1 INTRODUCTION
Collision detection is a fundamental task in many
diverse applications, including surgical simulation,
computer animation, computer games, robotics,
physically-based simulation, automatic path finding,
and virtual assembly simulation. The collision query
checks whether two objects intersect and returns all
pairs of overlapping features. We address the problem
of collision query among collision primitives for
interactive graphics applications. The set of colli-
sion primitives includes ray, axis-aligned bounding
box (AABB), oriented bounding box (OBB), plane,
cylinder, sphere and triangle.
The problem of fast and reliable collision detection
has been extensively studied [Bergen04, Ericson04].
Despite the vast literature, real-time collision queries
remain one of the major bottlenecks for interactive
physically-based simulation and ray tracing. One of the
challenges in the area is to develop the custom hard-
ware for collision detection and ray tracing [ALB05,
RBAZ05]. However, one major difficulty for imple-
menting hardware is the multitude of collision detection
and ray tracing algorithms. Dozens of algorithms and
data structures exist for hierarchical scene traversal and
intersection computation. Though the performance of
these algorithms seems to be similar to software imple-
mentations, their applicability to hardware implemen-
tation has not yet been thoroughly investigated.
Since collision detection is such a fundamental task,
it is highly desirable to have hardware acceleration
available just like 3D graphics accelerators. Using spe-
cialized hardware, the system’s CPU can be freed from
computing collisions.
1.1 Main Results
We present a novel FPGA-accelerated architecture for
fast collision detection among rigid bodies. Our pro-
posed custom hardware for collision detection supports
13 intersection types among rigid bodies. In order to
evaluate the proposed hardware architecture, we have
performed our FPGA-accelerated implementation for
accelerating intersection computations among collision
primitives.
We demonstrate the effectiveness of our hardware ar-
chitecture for collision queries in three scenarios: (a)
ray-triangle intersection computation with 260 thou-
sands of static triangles, (b) the same computation with
dynamic triangles and (c) dynamic sphere-sphere inter-
section tesing. The performance of our FPGA-based
hardware varies between 30 and 60 msec, depending
on the complexity of the scene and the types of colli-
sion primitives. For our comparative study we also an-
alyze three popular ray-triangle intersection algorithms
to estimate on the size of hardware resource. More de-
tails are given in Section 4.
Permission to make digital or hard copies of all or part of
this work for personal or classroom use is granted without
fee provided that copies are not made or distributed for profit
or commercial advantage and that copies bear this notice
and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
Copyright UNION Agency – Science Press, Plzen, Czech
Republic.
Journal of WSCG       ISSN 1213-6972 17 ISBN 978-80-86943-00-8
As compared to prior methods, our hardware-
accelerated system offers the following advantages:
• Direct applicability to collision objects with dynam-
ically changing topologies since geometric transfor-
mation can be done in our proposed hardware;
• Sufficient memory in our board to buffer the ray-
intersection input and output data and significant re-
duction in the number of data transmission;
• Up to an order of magnitude faster runtime perfor-
mance over prior techniques for ray-triangle inter-
section testing;
• Interactive collision query computation on mas-
sively large triangulated models.
The rest of the paper is organized as follows. We
briefly survey previous work on collision detection in
Section 2. Section 3 describes the proposed hardware
architecture for accelerating collision detection. We
present our hardware implementation of ray-triangle in-
tersection in Section 4. Finally, we analyze our im-
plementation and compare its performance with prior
methods in Section 5.
2 RELATED WORK
The problems of collision detection and distance
computation are well studied in the literature. We refer
the readers to recent surveys [Bergen04, Ericson04]. In
this section, we give a brief overview of related work
on collision detection, programmable GPU-based ap-
proaches, and custom hardware for collision detection.
Collision Detection: Collision detection is one of
the most studied problems in computer graphics.
Bounding volume hierarchies (BVHs) are commonly
used for collision detection and separation distance
computation. Most collision detection schemes involve
updates to bounding volumes, pairwise bounding
volume tests, and pairwise feature tests between
possibly-intersecting objects. Complex models or
scenes are often organized into BVHs such as sphere
trees [Hubbard95], OBB-trees [GLM96], AABB-trees
[Bergen04], and k-DOP-trees [KHMSZ98]. Projection
of bounding boxes extents on the coordinate axes is
the basis of the sweep-and-prune technique [Cohen95].
However, these methods incur overhead for each time
interval tested, spent updating bounding volumes and
collision pruning data structures, regardless of the
occurrence or frequency of collisions during the time
interval.
Programmable GPU: With the new programmable
GPU, tasks which are different from the traditional
polygon rendering can explore their parallel pro-
grammability. The GPU can now be used as a general
purpose SIMD processor, and, following this idea, a lot
of existing algorithms have been recently migrated to
the GPU to solve problems as global illumination, lin-
ear algebra, image processing, and multigrid solvers in
a fast way [GLM05]. Recently, GPU-based ray tracing
approaches have been introduced [Foley05, PWH02].
Ray tracing was also mapped to rasterization hardware
using programmable pipelines [PWH02]. However, ac-
cording to [RBAZ05] it seems that an implementation
on the GPU cannot gain a significant speed-up over
a pure CPU-based implementation. This is probably
because the GPU is a streaming architecture. Another
disadvantage which they share with GPUs is the limited
memory. Out-of-core solutions are in general not an
alternative due to the high bandwidth needed.
Custom Hardware: The need for custom graphics
hardware arise with the demand for interactive physi-
cally simulations and real-time rendering systems. The
AR350 processor is a commercial product developed
by Advanced Rendering Technologies for accelerating
ray tracing [CRR04]. Schmittler et al. proposed hard-
ware architecture (SaarCOR) for realtime ray tracing
and implemented the custom hardware using an FPGA
[SWWPS04, SWS05]. They also introduced the pro-
grammable ray processing unit (RPU) based on the
SaarCOR [Woop05]. The first publications of work
on dedicated hardware for collision detection was pre-
sented in [ZK03]. They focused on a space-efficient im-
plementation of the algorithms, while we aim at maxi-
mum performance for various types of collision queries
in this paper. In addition, they presented only a func-
tional simulation, while we present a full VHDL imple-
mentation on an FPGA chip.
3 HARDWARE ARCHITECTURE
In this section, we give an overview of hardware ar-
chitecture for accelerating the collision detection. Our
hardware architecture is based on a modular pipeline of
collision detection. The proposed architecture includes
three key parts such as input registers, the collision de-
tection engine, and the update engine in the Figure 1.
3.1 Input Registers and Transformer
Our proposed hardware has three inputs which are
counter register, primary data register file, and sec-
ondary data register file. The transformer provides the
geometric transformation functions for secondary ob-
jects to improve the performance. The counter register
contains the number of primary objects and the number
of secondary objects. The geometries of the primary
objects are stored in the primary data register file. The
secondary data register file also holds geometries of the
secondary objects for collision queries.
In our research, we suppose that the primary objects
P change for each time. On the other hand, the sec-
Journal of WSCG       ISSN 1213-6972 18 ISBN 978-80-86943-00-8
Memory controller
Buffers
Collision
position
F-Index Stencil-T
Update Engine
Acceleration structures
Primitive intersection testing
Output registers
Input registers
Collision
position
Collision
flag
Distance
(T) value
Counter Primary Secondary
Collision detection engine
Transformer
function selector
ready
Figure 1: The proposed hardware architecture.
ondary objects S does not change their geometries in
local coordinate system. Therefore, the S just can be
applied the geometric transformations such as transla-
tion and rotation. For instance, the triangulated models
are S and rays are P to perform the intersection compu-
tations in ray tracing applications. More specifically, S
denotes as S = {(T1, ..., Tn)| n ≥ 1}, where T is a tri-
angle defined by three vertices Vj ∈ R
3, j ∈ {0, 1, 2}.
The P is the set of rays which contain their origins O
and directions D.
When testing the intersection between the primary
objects and secondary objects, we perform the follow-
ing processing steps. First, we upload the secondary ob-
jects to on-board memory at once through direct mem-
ory access (DMA) controller. Second, we transfer the
primary objects to on-chip memory in the collision de-
tection engine (CDE). To do this step, we use the regis-
ter files which are packet data of the primary object to
reduce the feeding time for the CDE. Finally, we invoke
the ray-triangle intersection module in the CDE to com-
pute the intersection between primary objects and sec-
ondary objects. The details of our hardware-accelerated
ray triangle intersection algorithm for massive triangu-
lated models is shown in Algorithm 1.
One of the primary benefits of the transformer in our
architecture is to reduce the number of re-transmission
for the secondary objects from main memory to on-
board memory. If certain objects from the geome-
try buffer have to be reused, they can be transformed
at the transformer without re-transmission from main
memory. Therefore, we can reduce the bus bottleneck
since we reduce the number of re-transmission. The
bus width from secondary register file to CDE is 288
(= 9 × 32) bits. We can transfer 288 bits to the colli-
1: procedureHW-AcceleratedRayTrianlgeIntersection
2: input : P , S
3: output : R (CP, F-value, index, T-value)
4: collisionType CT = RAY_TRIANGLE;
5: intializeDevice();
6: secondaryUpload(S);
7: for ∀Ok, Dk ∈ P do
8: primaryRegFileUpload(Ok, Dk);
9: invokeCDE(CT);
10: R← downloadSRAM();
11: return R
Algorithm 1: Hardware-Accelerated Ray Triangle Inter-
section Testing.
sion detection engine in every clock. The ultimate goal
of our work is applying our results to physically-based
simulation. So, we use single precision for representing
a floating point to provide more accurate results.
3.2 Collision Detection Engine
The collision detection engine (CDE) is a modular
hardware component for performing the collision
computations between P and S. The CDE consists of
the acceleration structures and primitive intersection
testing components.
As already discussed earlier in Section 2, a wide va-
riety of acceleration schemes have been proposed for
collision detection over the last two decades. For exam-
ple, there are octrees, general BSP-trees, axis-aligned
BSP-trees (kd-trees), uniform, non-uniform and hierar-
chical grids, BVHs, and several hybrids of several of
these methods. In our hardware architecture, we can
adapt hierarchical acceleration structures for collision
culling as shown in Figure 1. However, we could not
implement the acceleration structure due to the FPGA
resource limit. But if we use the hierarchical accelera-
tion structure, we can search the index or the smallest
T-value much faster.
The primitive intersection testing component per-
forms several operations for intersection computations
among collision primitives. In order to provide various
operations for intersection computations, we classified
13 types of intersection queries according to the pri-
mary and secondary collision primitives: ray-triangle,
OBB-OBB, triangle-AABB, triangle-OBB, sphere-
sphere, triangle-sphere, ray-cylinder, triangle-cylinder,
cylinder-cylinder, OBB-cylinder, OBB-plane, ray-
sphere, and sphere-OBB intersection testing. We have
implemented hardware-based collision pipelines to
verify these intersection types. The proposed hardware
contains the 13 collision pipes, and more pipes can be
available if hardware resources are sufficient. The CDE
selects one collision pipe which is ready to working
among 13 collision pipes by the function selector
signal. Each pipe can be triggered in parallel by the
Journal of WSCG       ISSN 1213-6972 19 ISBN 978-80-86943-00-8
ready signal of each pipe. However, it is difficult to
execute each pipeline in parallel due to limitation of
the input bus width and routing problems. Thus, our
proposed hardware reads input packet from on-board
memory and stores in the register file which contains
two or more elements.
We use a pipelined technique in which multiple in-
structions are overlapped in execution. This technique
is used for real hardware implementation in order to im-
prove performance by increasing instruction through-
put, as opposed to decreasing the execution time of an
individual instruction.
There are four outputs which are collision flag (F-
value), collision position (CP), index, and separation
distance or penetration depth (T-value). In order to get
these outputs, the CDE performs the intersection testing
between P and S. If a collision occurs, CDE will store
output values for CP, index, T-value and F-value. The
CP denotes a collision position of the object pair and
index is the triangle (T ) index of the triangulated mesh.
The T-value denotes the penetration depth between two
objects and F-value is set true. Otherwise, CP and index
have invalid value, T-value is the separation distance
between two objects and F-value is set false.
3.3 Update Engine
We can simplify routing data lines and make memory
controller efficient by coupling buffers such as F-index
buffer and two stencil-T buffers as shown in Figure 1.
We compare old T-value from stencil-T buffer0 (or 1)
with new T-value from CDE and update smaller T-value
from stencil-T buffer1 (or 0) of the two values within
one clock. We do not transfer T-values from the stencil-
T buffer to CPU in order to find the smallest or the
largest T, which makes it possible to reduce transmis-
sion time. Stencil value in the stencil-T buffer is used
for masking some regions of the F-index buffer to save
searching time for the index of the collision object.
Collision Detection Engine
PCI Controller
DDR Memory
DDR Memory
Figure 2: The acceleration board with 64bits/66MHz
PCI interface. On the board, there are Xilinx V2P20
for PCI controller, Xilinx V2P70 for memory and the
collision detection logic. This board also includes two
1 GB DDR memories with 288 bit input bus, seven 2
MB SRAMs with 224 bit output bus.
We use single precision floating point of IEEE stan-
dard 754 for representing each element of the vertex or
vector and T-value in order to compare with the speed
of the CPU arithmetic. One of the main reasons that we
use single precision floating point is to provide more
accurate results in physically-based simulation systems.
So, we create many floating point arithmetic logics with
CoreGen library supported by Xilinx tool ISE.
We use two types of memories on the board. One
is uploading-purpose memory which is consists of two
DDR SDRAMs. The other is storing-purpose memory
which is consist of six SRAMs to store output results
(see Figure 2). Block RAMs on the FPGA is used for
buffering the P . Primary register file matches the block
RAM on the FPGA.
In our ray-triangle intersection computation, the pri-
mary object data P contains an origin point O and a
direction vector D of a ray. Total 256 rays can be trans-
ferred from main memory to block RAMs on the FPGA
at a time. Each secondary object data in S is a triangle
T which contains three vertices. When the number of
the rays is more than 256, the rays are divided by a
packet which contains 256 rays and packets are trans-
ferred one by one at each step. We define this step as
processing collision detection between a packet of pri-
mary object and all secondary objects. The bigger size
of the block RAMs is, the better performance of the
CDE is. While FPGAs usually have several small mem-
ories, the advantage of using such a memory is that the
several memory blocks can be accessed in parallel.
Each triangle of the secondary object is represented
using 288 (9×32)-bit data. Nearly 55 million trian-
gles can be transferred from main memory to two DDR
SDRAMs on the board through the DMA controller.
So, we designed the large bus width of the secondary
object data to eliminate input bottleneck of the CDE.
Therefore, we are able to read one triangle data from the
queue of the DDR SDRAM in each hardware clock.
4 ANALYSIS OF INTERSECTION AL-
GORITHMS
In this section we present the analysis results for ray-
triangle intersection algorithms in terms of hardware
resources. Fast ray-triangle intersection algorithm has
long been an active field of research in computer graph-
ics and has lead to a large variety of algorithms [Plü65,
Badouel90, MT97, SF98, SF01].
We use three major ray-triangle intersection
algorithms, the first one is Badouel’s algorithm
[Badouel90], the second one is Möller and Trumbore’s
algorithm [MT97], and the last one is the algorithm
using Plücker coordinates [Plü65]. These algorithms
are known to be stable and highly efficient. Because
we are mostly interested in performance related aspects
of intersection testing, we will skip correctness valida-
Journal of WSCG       ISSN 1213-6972 20 ISBN 978-80-86943-00-8
tions and refer to the original publications instead.
Badouel’s Algorithm: The algorithm introduced by D.
Badouel is similar to Snyder and Barr’s earlier approach
[Badouel90]. This algorithm is based on the study of
barycentric coordinates, following the line of Snyder
and Barr’s algorithm. It is split into two phases:
1. The ray is tested for intersection with the triangle’s
embedding plane, defined by the three vertices Vi,
i ∈ {0, 1, 2} of the triangle. Combining the para-
metric representation of the ray r and the implicit
plane equation leads to
t = −
d + N ·O
N ·D
(1)
whereO denotes ray origin,D denotes ray direction,
N denotes normal of the embedding plane, and d =
−Vo ·N and r(t) = O + Dt.
Based on the evaluation of the parameter t, the in-
tersection is rejected if either the ray and the trian-
gle are parallel (N · D = 0), the intersection point
lies behind the origin of the ray (t ≤ 0) or a closer
intersection has already been found (t > tnearest).
2. If the ray intersects with the embedding plane, the
coordinates of the intersection point P are deter-
mined. As shown Figure 3 point P can be expressed
as
−−→
V0P = α
−−→
V0V1 + β
−−→
V0V2 (2)
Finally, the intersection point P is inside the triangle
if α ≥ 0, β ≥ 0 and α + β ≤ 1.
Figure 3: Parametric representation of the ray-triangle
intersection point P .
For a more detailed derivation of the algorithm we refer
to Badouel’s original article [Badouel90].
Möller-Trumbore’s Algorithm: The algorithm pro-
posed by Möller and Trumbore does not test for inter-
section with the triangle’s embedding plane and there-
fore does not require the plane equation parameters
[MT97]. This is a big advantage mainly in terms of
memory consumption – especially on the GPU and cus-
tom hardware – and execution performance. The algo-
rithm goes as follows:
1. In a series of transformations the triangle is first
translated into the origin and then transformed to a
right-angled unit triangle in the y−z plane, with the
ray direction aligned with x. This can be expressed
by a linear equation


t
u
v

 = 1
P · E1


Q · E2
P · T
Q ·D

 (3)
where E1 = V1 − V0, E2 = V2 − V0, T = O − V0,
P = D × E2 and Q = T × E1.
2. This linear equation can now be solved to find the
barycentric coordinates of the intersection point
(u, v) and its distance t from the ray origin.
Again we refer to the original article for a more de-
tailed explanation. Optimized variations of the original
implementation can be found in Möller’s follow-up
article.
Algorithm using Plücker Coordinates: Plücker coor-
dinates are a way of specifying directed lines in three-
dimensional space [Plü65]. Plücker coordinates pir rep-
resent a ray R(t) = O + D ∗ t by an oriented line:
pir = {d : d× o} = {pr : qr} (4)
Then the inner product of Plücker space
pir ¯ pis = pr · qs + qr · ps (5)
defines the relative orientation of the two lines r and s.
A positive result means that r passes s clockwise, while
in the negative case r passes s counter-clockwise. If this
product is zero both lines intersect each other. Note that
this inner product is proportional to the signed volume
of a tetrahedron spanned by the origin and direction val-
ues r and s.
Furthermore, the Plücker test directly provides the
scaled barycentric coordinates of the intersection of a
ray with the three edges ei of a triangle, which is a ma-
jor advantage to plane intersection-based approaches.
Thus to compute barycentric coordinates each Plücker
value only needs to be divided by the sum of all three
values obtained from the three edges of a triangle:
wi = pir ¯ piei, ui = wi/
3∑
i=0
wi (6)
We compared these algorithms in terms of the la-
tency, the number of I/O and hardware resources as
shown in Table 1 and 2. We could not use Plücker test
which contains too many multipliers and inputs relative
to Möller’s algorithm and Badouel’s algorithm. Prepro-
cessing of Plücker reduces the number of inputs and the
latency of the hardware pipeline. However, it still needs
more storage than others.
Journal of WSCG       ISSN 1213-6972 21 ISBN 978-80-86943-00-8
Algorithms # of inputs # of outputs Latency
Badouel’s 9 6 16
Möller’s 9 6 10
Plücker’s 15 6 17
Table 1: Comparison of ray-triangle intersection algo-
rithms in terms of the number of inputs, the number of
outputs and latency for hardware implementation.
Möller’s algorithm is similar to Badouel’s one in
terms of the latency of the hardware pipeline, the num-
ber of I/O, and hardware resources as shown in Table 1
and 2. Möller’s algorithm has been more efficient than
Badouel’s algorithm in view of the processing speed
and usage of storage [MT97]. Therefore, we select the
Algorithms Badouel’s Möller’s Plücker’s
Multiplier 27 27 54
Divider 2 1 1
Adder 13 12 31
Subtractor 23 15 17
Comparator 6 8 3
AND 3 2 2
Table 2: Analysis of the hardware resource for ray-
triangle intersection algorithms.
Möller’s algorithm for VHDL implementation for real
circuit on the FPGA.
5 IMPLEMENTATION AND ANALYSIS
In this section we describe the implementation of our
collision detection hardware and highlight its applica-
tion to perform ray-triangle intersection testing for huge
triangulated meshes.
5.1 Implementation
We have evaluated our hardware on a PC running Win-
dows XP operating system with an Intel Xeon 2.8GHz
CPU, 2GBmemory and an NVIDIA GeoForce 7800GT
GPU. We used C++, OpenGL as graphics API and
Cg language for implementing the fragment programs
[Fern03]. We have implemented ray-triangle collision
detection engine with VHDL and simulated it with
ModelSim by Mentor Graphics. The ray-triangle in-
tersection algorithm which we used is Möller’s algo-
rithm. In order to evaluate our hardware architecture,
we created this algorithm as circuits on an FPGA. In
our experiments, the inputs for intersection testing are
dynamic rays for P and triangulated terrain which con-
tains 259,572 triangles for S in Figure 4. The origin
of the ray moves on the flight path shown as a red
curve and direction of the ray changes randomly in ev-
ery frame in Figure 8 (a).
Figure 4: Our test terrain model (259,572 triangles)
5.2 Comparison
We have classified three configurations of collision
detections according to the properties of collision
primitives. A static object is the object which the
topology is not changed in the scene. On the other
hand, a dynamic object is an object which the topology
is changed in the scene for each frame.
Static Objects vs. Static Objects: In this scenario, the
performance depends on the number of primary objects
due to limitation of the block RAMs on an FPGA.
Thus, we choose the objects which small number of
objects in our architecture. If the number of the objects
is larger than the size of the block RAM, then data
transmission from main memory to block RAM occurs
in two or more times.
Static Objects vs. Dynamic Objects: We choose
dynamic objects as the secondary object. Since the
transformation is performed in our hardware, we
do not need to retransfer data of dynamic objects
except that objects are disappeared or generated newly.
Position and orientation of the dynamic objects can be
transformed by transformer in Figure 3. We expect the
performance is comparable to above case.
Dynamic Objects vs. Dynamic Objects: Our hard-
ware architecture only supports transformation function
for secondary objects. In this scenario, transmission
time is defined by the number of the primary objects
which are transformed in the CPU. Thus, the perfor-
mance depends on the number of the primary objects
and the CPU processing speed.
We will evaluate performance of our proposed archi-
tecture in each case comparing with that of CPU and
GPU. The proposed hardware checks nearly 259,572
ray-triangle collision tests per frame, which takes 31
milliseconds including the ray data transmission time,
while it takes 2,100 milliseconds for CPU based soft-
ware implementation as shown in Figure 5. Our hard-
ware was about 70 times faster than CPU-based ray-
Journal of WSCG       ISSN 1213-6972 22 ISBN 978-80-86943-00-8
0500
1000
1500
2000
2500
20 40 60 80 100
Frame Number
T
im
e 
(m
se
c)
CPU-based approach
GPU-based approach
Our approach
Figure 5: The comparison result of the ray-triangle in-
tersection testing (static objects vs. static objects).
triangle implementation as shown in Figure 5. And the
proposed hardware is four times faster than the GPU-
based ray-triangle intersection approach. For dynami-
cally moving vertices of the triangles on the terrain, the
proposed hardware was 30 times faster than the CPU-
based approach as shown in Figure 6.
0
500
1000
1500
2000
2500
3000
20 40 60 80 100
Frame Number
T
im
e 
(m
se
c)
CPU-based approach
GPU-based approach
Our approach
Figure 6: The comparison result of the ray-triangle in-
tersection testing (static objects vs. dynamic objects).
0
50
100
150
200
250
300
350
400
100 200 500 1000
Number of objects
F
ra
m
e 
p
er
 s
ec
o
n
d
 (
F
P
S
)
CPU-based approach
Our approach
Figure 7: The comparison result according to the num-
ber of objects.
We also performed another experiment for dynamic
sphere-sphere intersection computation. In this sce-
nario, one thousand of spheres move dynamically in
every frame. The input data contains a center point
and a radius of the sphere which is represented four
32-bit floating points. In case of collision detection be-
tween dynamically moving spheres, our hardware was
1.4 times faster than CPU based implementation in Fig-
ure 7. Figure 8 (b) shows a snapshot of dynamic sphere-
sphere intersection result.
(a) ray-triangle intersection (b) sphere-sphere intersection
Figure 8: Snapshots of intersection results: a ray (blue
line) is shot on the triangulated terrain in arbitrary di-
rection for each frame.
5.3 Analysis and Limitations
Our hardware provides good performance of collision
detection for large triangulated meshes. The overall
benefit of our approach is due to two reasons:
• Data reusability: We exploit the transformer in the
proposed hardware to avoid the transmission bottle-
neck due to the transformation in the CPU. As a re-
sult, we have observed 30 - 70 times improvement
in ray-triangle intersection computation over prior
methods based on CPU and GPU.
• Runtime performance: We use the high-speed
processing power of the proposed hardware. We
also utilize instruction pipelining to improve the
throughput of the collision detection engine. More-
over, our current hardware implementation involves
no hierarchy computation or update.
Based on these two reasons, we obtain considerable
speedups over prior methods. Moreover, we are able to
perform various collision queries at almost interactive
frame rates.
Limitations: Our approach has a few limitations. Our
hardware architecture includes the component of accel-
eration structures, such as kd-tree, grids and BVHs in
Figure 3. However, we could not implement this com-
ponent due to the hardware resource limit. So, our cur-
rent implementation does not support hierarchical col-
lision detection. However, if traversal of acceleration
structures is performed in CPU, we can solve this prob-
lem easily.
Journal of WSCG       ISSN 1213-6972 23 ISBN 978-80-86943-00-8
6 CONCLUSION
We present the dedicated hardware architecture to per-
form collision queries. We evaluate the hardware ar-
chitecture for ray-triangle and sphere-sphere collision
detection under the three configurations.
We have used our hardware to perform different col-
lision queries (ray-triangle intersection, sphere-sphere
intersection) in complex and dynamically moving mod-
els. The result is a hardware-accelerated ray-triangle
intersection engine that is capable of out-performing
a 2.8GHz Xeon processor, running a well-known
high performance software ray-triangle intersection
algorithm, by up to a factor of seventy. In addition, we
demonstrate that the proposed approach could prove to
be faster than current GPU-based algorithms as well as
CPU based algorithms for ray-triangle intersection.
REFERENCES
[ALB05] N. Atay, J.W. Lockwood, and B. Bayazit,
A Collision Detection Chip on Reconfigurable
Hardware, In Proceedings of Pacific Conference
on Computer Graphics and Applications (Pacific
Graphics), Oct. 2005.
[Badouel90] D. Badouel, An Efficient Ray-Polygon In-
tersection, Graphics Gems I, pp. 390-394, 1990.
[Bergen04] G. V. D. Bergen, Collision Detection in
Interactive 3D Environments, Elsevier-Morgan
Kaufmann, 2004.
[Cohen95] J.D. Cohen, M.C. Lin, D. Manocha and
M.K. Ponamgi, I-COLLIDE: An Interactive and
Exact Collision Detection System for Large-
Scale Environments, In Symposium on Interac-
tive 3D Graphics, pp.189-196, 1995.
[CRR04] C. Cassagnabere, F. Rousselle, and C Re-
naud, Path Tracing Using AR350 Processor, In
Proceedings of the 2nd International Conference
on Computer Graphics and Interactive Tech-
niques in Australasia and South East Asia, pp.
23-29, 2004.
[Ericson04] C. Ericson, Real-Time Collision Detec-
tion, Morgan Kaufmann, 2004.
[Fern03] Randima Fernando, Mark J. Kilgard, The Cg
Tutorial, Addison-Wesley, 2003.
[Foley05] T. Foley and J. Sugerman, KD-Tree Ac-
celeration Structures for a GPU Ray Tracer, In
Proceedings of the ACM Siggraph/Eurogrphics
Conference on Graphics Hardware, pp. 15-22,
2005.
[GLM96] S.Gottschalk, M.C.Lin, D.Manocha, OBB
tree: A Hierarchical Structure for Rapid Inter-
ference Detection, In Proceedings of ACM SIG-
GRAPH, pp. 171-180, 1996.
[GLM05] N.K. Govindaraju, M.C. Lin and D.
Manocha, Quick-CULLIDE: Fast Inter- and
Intra-Object Collision Culling Using Graphics
Hardware, In Proceedings of IEEE Conference
on Virtual Reality, pp. 59-66, 2005.
[Hubbard95] P. M. Hubbard, Collision Detection for
Interactive Graphics Applications, IEEE Trans-
actions on Visualization and Computer Graph-
ics, pp. 218-230, 1995
[KHMSZ98] J.T. Klosowski, M. Held, J.S.B. Mitchell,
H. Sowizral and K. Zikan, Efficient Collision
Detection Using Bounding Volume Hierarchies
of k-DOPs, IEEE Transactions on Visualization
and Computer Graphics, 4(1), pp. 21-36, 1998.
[MT97] T. Müller and B. Trumbore, Fast, Minimum
Storage Ray-Triangle Intersection, Journal of
Graphics Tools, pp. 37-46, pp.22-28, 1997.
[Plü65] J. Plücker, On A New Geometry Of Space,
Phil. Trans. Royal Soc. London, 155:725-791,
1865.
[PWH02] T.J. Purcell, I. Buck, W.R. Mark, P. Han-
rahan, Ray Tracing on Programmable Graph-
ics Hardware, ACM Transactions on Graphics,
21(3), pp. 703-712, 2002.
[RBAZ05] A. Raabe, B. Bartyzel, J.K. Anlauf, and
G. Zachmann, Hardware Accelerated Collision
Detection-An Architecture and Simulation Re-
sult, In Proceedings of IEEE Design Automation
and Test in Europe Conference, vol. 3, pp. 130-
135, 2005.
[SF98] Segura, R.J., Feito, F.R. An algorithm for De-
termining Intersection Segment-Polygon in 3D,
Computer & Graphics, Vol. 22, No. 5, pp. 587-
592, 1998.
[SF01] Segura, R.J., Feito, F.R. Algorithms to Test
Ray-Triangle Intersection. Comparative Study.
In Proceedings of WSCG’2001.
[SWS05] Jörg Schmittler, Ingo Wald, and Philipp
Slusallek, SaarCOR-A Hardware Architecture
for Ray Tracing, In Proceedings of SIG-
GRAPH/Eurographics Workshop on Graphics
Hardware, pp. 27-36, 2002.
[SWWPS04] J. Schmittler, S. Woop, D. Wagner, W.J.
Paul and P. Slusallek, Real Time Ray Tracing
of Dynamic Scenes on an FPGA Chip, In Pro-
ceedings of Eurographics Workshop on Graphics
Hardware, pp. 95-106, 2004.
[Woop05] S. Woop, J. Schmittler, P. Slusallek, RPU: A
Programmable Ray Processing Unit for Realtime
Ray Tracing, ACM Tranactions Graphics, 24(3),
pp. 434-444, 2005.
[ZK03] G. Zachmann and G. Knittel, An architecture
for hierarchical collision detection, In Journal of
WSCG’2003, pp. 149-156, 2003.
Journal of WSCG       ISSN 1213-6972 24 ISBN 978-80-86943-00-8
