An architecture for hierarchical collision detection by Zachmann, Gabriel & Knittel, Günter
An Architecture for
Hierarchical Collision Detection
Gabriel Zachmann
Computer Graphics, Informatik II
University of Bonn
email: zach@cs.uni-bonn.de
Günter Knittel
WSI/GRIS
University of Tübingen
email: knittel@gris.uni-tuebingen.de
Abstract
We present novel algorithms for efficient hierarchical collision detection and propose a hardware ar-
chitecture for a single-chip accelerator. We use a hierarchy of bounding volumes defined by k-DOPs for
maximum performance. A new hierarchy traversal algorithm and an optimized triangle-triangle inter-
section test reduce bandwidth and computation costs. The resulting hardware architecture can process
two object hierarchies and identify intersecting triangles autonomously at high speed. Real-time collision
detection of complex objects at rates required by force-feedback and physically-based simulations can be
achieved.
Keywords: graphics hardware, computer animation, virtual reality, hierarchical algorithms, triangle
intersection.
1 Introduction
Collision detection is an elementary task in areas
like animation systems, virtual reality, games, phys-
ically-based simulation, automatic path finding, vir-
tual assembly simulation, and medical training and
planning systems.
In many of these systems, collision avoidance is
the ultimate goal. Since algorithms for computing
the exact time of collision are still too slow or too
restrictive, most approaches are “reactive” in that
they first try to place objects at a new position,
then check for collision, and then try other posi-
tions, based on physical laws or constraints [14,21].
This poses very high demands on collision detec-
tion performance, because they must do many col-
lision checks per simulation cycle. Another very
demanding application is rendering force-feedback,
where collisions of an (invisible) surface contact ob-
ject must be checked at about 1000Hz in order to
achieve stable force computations.
Since collision detection is such a fundamental
task, it would be highly desirable to have hardware
acceleration available just like 3D graphics acceler-
ators. Using specialized hardware, general-purpose
processors can be freed from computing collisions.
Permission to make digital or hard copies of all or part of
this work for personal or classroom use is granted without
fee provided that copies are not made or distributed for profit
or commercial advantage and that copies bear this notice and
the full citation on the first page. To copy otherwise, to re-
publish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee.
Journal of WSCG, Vol. 11, No. 1., ISSN 1213-6972
WSCG’2003, February 3–7, 2003, Plzen, Czech Republic.
Copyright UNION Agency – Science Press.
In this paper, we propose an architecture which
implements hierarchical collision detection for rigid
objects in hardware. We have concentrated on hier-
archical algorithms, because they have offered the
best performance for so-called “polygon soups”.
Such a collision detection hardware will comprise
the last stage of a collision detection pipeline [20].
This is where the bulk of the work is done in typi-
cal scenarios involving a modest number of objects
with large polygon counts. We assume the hierar-
chies have already been computed. This is not a
time-critical task, and can be done in software when
the application loads objects at startup time.
The next section describes related work, while
Section 3 describes novel algorithms that are suit-
able for hardware implementation. Section 4 de-
scribes the hardware design in detail. Finally, Sec-
tion 5 presents some benchmarks and considera-
tions about the performance of the envisioned archi-
tecture. An extended version of this paper is avail-
able at www.gabrielzachmann.org.
2 Related Work
Considerable work has been done on hierarchical
collision detection in software [4–6, 17, 19]. Some
of the bounding volumes (BVs) utilized are spheres,
axis-aligned bounding boxes (AABB), oriented bound-
ing boxes (OBB), and discretely oriented polytopes
(DOP). However, all traversal schemes proposed so
far are inefficient in that they possibly visit the same
nodes many times.
There is basically no literature about the design of
hardware architectures dedicated to collision detec-
tion. All research so far has tried to utilize existing
graphics hardware. Some utilize the stencil buffer
and multi-pass rendering [2, 13, 16], which works
only well for convex objects, some utilize the feed-
back feature of OpenGL during culling [10], which
is applicable only in special cases.
All of the approaches using graphics hardware
have the disadvantage that they either compete with
the rendering module for the graphics pipe, or an
additional graphics board must be spent for colli-
sion detection. The former slows down the overall
frame rate considerably, while the latter would be a
tremendous overkill, since most of the resources of
the hardware would not be made use of. Further-
more, these approaches work in image space, which
reduces precision significantly.
A number of algorithms for ray-triangle and trian-
gle-triangle intersection have been presented in the
literature [1,3,7,11,12,15,18]. Most of them com-
pute either the barycentric coordinates or a num-
ber of 4 × 4 determinants. We propose a very ef-
ficient algorithm for checking intersection of trian-
gles that does not need any division. Our new al-
gorithm not only uses less multiplications and addi-
tions than [11] and [1], but is also very well suited
for a hardware implementation due to a very uni-
form control and data flow.
3 The Algorithm
3.1 DOP Trees
The basic operation of any hierarchical collision de-
tection algorithm is the overlap check of two nodes
from different objects. In this section, we briefly re-
call the calculations necessary for collision detection
using DOP trees. The derivation of the following
formulas can be found in [19].
DOPs are bounding volumes that are a gener-
alization of axis-aligned bounding boxes. They
have been introduced into computer graphics by [8].
DOP trees are a hierarchical representation of ob-
jects [9, 19]. Each inner node stores a DOP and
pointers to its children which it encloses; leaves
store polygons (or other graphical primitives). A
DOP is described by k numbers (hence k-DOP ),
usually represented by a vector of k floats. Exten-
sive benchmarks have shown k = 24 to be optimal.
Given two objects OA and OB, and two DOPs
d, e ∈ Rk from OA and OB’s DOP trees, resp., the
overlap test proceeds in two steps: first, DOP d from
OA’s hierarchy is transformed into d
′ in the coordi-
nate frame of OB by
d ′ = C× d + c , (1)
where
C =


. . . c0,0 . . . c0,1 . . . c0,2 . . .
...
. . . ck−1,0 . . . ck−1,1 . . . ck−1,2 . . .


where in matrix C exactly three entries per row are
non-zero. Second, d ′ is compared componentwise
with DOP e according to
∃i ≤ k2 : d
′
i > ek
2 +i
∨ ei > d
′
k
2 +i
⇔
d and e do not overlap
(2)
where d ′i < d
′
k
2 +i
define a slab (analogously for all
DOPs).
Matrix C and vector c depend only on the position
of the two objects relative to each other. They are
computed during the set-up by the software API of
the collision detection hardware.
Since the k × k-matrix C in Equation 1 has ex-
actly 3 coefficients per row that are not 0, we can
compute d ′ more efficiently by
d ′i = Ci


dji,0
dji,1
dji,2

 + ci (3)
where correspondence j stores the place of those co-
efficients which are not zero. So, by introducing a
k × 3 correspondence matrix j, we can reduce the
size of the transformation matrix C to k × 3. Con-
sequently, the number of multiplications is 3k.
3.2 Hierarchy Traversal
The general, traditional scheme for hierarchical col-
lision detection is a simultaneous, recursive traver-
sal of two BV hierarchies. However, this procedure
incurs several penalties:
1. Nodes in both trees are usually visited several
times; this is a general problem of all hierar-
chical collision detection algorithms (see Fig-
ure 1).
2. If the nodes have to be transformed (or other
computations specific to individual nodes have
to be performed), then this will be done sev-
eral times for the same node.
The second penalty is a consequence of the first
one; it could be alleviated by storing the result of
the node transformation back into the node. Un-
fortunately, this has other disadvantages: first, the
BV hierarchy occupies more memory (in the case of
DOP trees, this would increase the memory usage
by a factor 2); second, more importantly, the algo-
rithm would no longer be thread-safe, so that mul-
tiple pairs of objects could no longer be checked in
parallel.
F5 G4 G5F4
F7 G6 G7F6
D7 E6 E7D6
5 6 7E4D4 D5 E5 4E F GD
2 3
1
CB
A A1
B2 B3 C2 C3
Figure 1: The simultaneous traversal of two BV hierarchies is, conceptually, equivalent to the traversal of a BV
pairs hierarchy. Here, the right DOP tree is “tumbled” with respect to the DOP orientations of the left tree’s
reference frame.
In contrast, our novel traversal scheme reduces
the number of nodes visited, transfer volume from
memory, and number of node transformations dra-
matically. Our traversal scheme only needs an addi-
tional small stack.
The idea is to avoid simultaneous traversal of two
BV hierarchies. Instead, we traverse only one hi-
erarchy and compare each node of that one with a
list of nodes from the other hierarchy. Let us call
nodes that need to be transformed tumbled nodes,
the other ones aligned nodes (see Figure 1). Assume
that we are visiting a tumbled node A, and that a
list L contains all aligned nodes with which A needs
to be checked for overlap. So we check all pairs
(A,Li); whenever such a pair overlaps, we append
the two children Lij, j ∈ [1, 2], to a new list L
′. Af-
ter L has been completely processed, L ′ contains all
aligned nodes that need to be checked with A1 and
A2, the two children of A. It is obvious that with
this traversal we visit each tumbled node only once,
and thus we transform the DOP stored with it ex-
actly once.
This scheme works for all kinds of hierarchical
collision detection, not just DOP trees. Depending
on how much work per node-node overlap test can
be factored out into one of the two nodes, the bene-
fit of our new method can be dramatic.
A hardware implementations allows us to im-
prove the algorithm further by performing DOP
overlap tests in parallel. By the nature of the bi-
nary tree, performing two overlap tests in parallel
yields the greatest cost/performance benefit. To this
end, we load a sibling pair of tumbled DOPs (A,B),
transform them sequentially, and compare the two
in parallel with each DOP from L. This results
in two new lists, one for child pair (A1,A2) and
one for (B1,B2). In the sequential version described
in the previous paragraph, we produced these two
lists at very different times during the traversal, and
we processed each of them twice; now, we produce
those two lists simultaneously, and then we process
each of them only once.1 The benefit of this is that
the time needed for overlap tests and the number of
times an axis-aligned DOP needs to be transferred
from memory is cut by a factor of two. Note that,
for clarity, we have omitted the “mixed” cases.
In a hardware implementation, we have to main-
tain the stack and the lists ourselves. This can be
done by a stack of lists (see Figure 2). On the same
stack, we keep pointers to pairs of tumbled nodes.
Going down from node pair (A,B) to (A1,A2), we
push the pointer to (A,B) onto the stack. Later,
when the recursion returns to this node pair, we
need to decide whether to go down into node pair
(B1,B2) or to make a step upwards. This informa-
tion can be kept in an additional bit on the stack:
when the pointer is pushed onto the stack, the cor-
responding bit is reset; when we return to this node,
we go down into the other branch and flip the bit
to 1. When we return the next time, the algorithm
knows to make another step upwards.
3.3 Polygon Intersection Test
In the case of collision, the traversal reaches pairs
of leaves containing triangles, which have to be
checked for intersection. Assume triangle A is given
by vertices V1,V2,V3 and triangle B is given by ver-
tices W1,W2,W3, both in their object’s reference
frame. Assume triangle A is part of object OA, and
B is part of OB.
The approach in our algorithm is to check (con-
ceptually) each edge of A against B, and vice versa.
First, A’s vertices are transformed into the reference
frame of OB. Assume further a 3×3 transformation
MB for triangle B such that MB · (W
i − W1) maps
onto the unit triangle (0, 0, 0), (1, 0, 0), (0, 1, 0).
Then, we transform A by (MB,W
1) (see Figure 3).
1 This scheme can be generalized straight-forward to process 2m
tumbled nodes simultaneously.
11
B
S
AP
Q
y
x
z
Figure 2: The improved traversal scheme can be implemented by a stack
of lists. (In a hardware implementation, the stack on the right is merged
into the left one.)
Figure 3: Using a special transfor-
mation, the intersection test can
be done very efficiently.
For sake of simplicity, we will call the new vertices
Vi again.
Consider each edge PQ := ViVi+1. If both Pz
and Qz ≥ 0 or ≤ 0, then we skip this edge. Now
we compute (conceptually) the intersection S of the
supporting line X = P+tr, r = Q−P, with the plane
z = 0, which is defined by t = −Pzrz as S = P − r
Pz
rz
(we know rz 6= 0). We know that 0 ≤ t ≤ 1. We
also know that Sz = 0, so we need to compute only
Sx = Px − rx
Pz
rz
and similarly Sy, which are, ba-
sically, the barycentric coordinates of the intersec-
tion point. Finally, we just check whether or not
Sx ≥ 0 ∧ Sy ≥ 0 ∧ Sx +Sy ≤ 1. If so, A and B do
intersect; otherwise, we check the other edges, and,
in case of no intersection, we check B against A.
In order to save the division and the vector sub-
traction (for r), we reformulate the condition as fol-
lows (assuming rz > 0):
PxQz ≥ QxPz ∧
PyQz ≥ QyPz ∧
PxQz −QxPz + PyQz −QyPz ≤ Qz − Pz
(4)
If rz < 0, then we must compare with ≤ 0,≤ 0, and
≥ 0, respectively.
The algorithm gains its special efficiency because
we can precompute the matricesMA andMB (they
can be obtained from a simple linear equation sys-
tem), and because we do not need to compute the
exact intersection point.
In our case of collision detection using DOP trees,
we can store these matrices in the leaves instead of
the DOPs. We do not need to check pairs of leaf
DOPs, because the immediate check of triangles is
faster. Storing the triangle matrixMB and 3 vertices
needs 3 × 4 + 3 × 3 = 21 floats, which fit well into
the nodes of a 24-DOP tree.
4 Hardware Design
The target design is a PCI-board with one ASIC,
a large on-board memory for the hierarchy, and a
Module 0
SDRAM
Module 2
SDRAM
Module 3
SDRAM
Module 1
SDRAM
32/64 Bits
PCI−Bus
256 Bits SRAM
(Stack)
36 Bits
64 Bits
CollisionChip
Figure 4: Schematic diagram.
small SRAM as dedicated stack memory. Crucial
for the performance is the bandwidth towards the
local memory, and so a four-bank SDRAM configu-
ration with a 256-bit bus was chosen (see Figure 4).
4.1 The CollisionChip
Figure 5 shows all functional units of the Colli-
sionChip. It consists of a number of large register
files grouped around an arithmetic unit for float-
ing-point dot-products (the DOTADD-unit), a Tri-
angle Intersection Test Unit (IT-Unit), register banks
connected to comparators, interfaces to the PCI-
bus and to the local memories, the Stack Engine
and control units as well as address generators. Al-
though the processing of bounding volumes and tri-
angles differ quite substantially, a common architec-
ture was found with only low redundancy.
The DOTADD-Unit. The DOTADD-unit is sim-
ilar to transform units as found in modern graphics
accelerators. Its basic function is to perform
d ′i = dk × Ci,0 + dm × Ci,1 + dn × Ci,2 + ci
on 32-bit floating-point numbers. The indices refer
to the location in the register files. Due to the ab-
sence of data dependencies in the control flow, it can
be pipelined for high clock frequency and through-
put.
SDRAM Interface Unit
Address
Generator 1
8−to−1
Multiplexer
Address
Generator 2
5
Hierarchy Traversal
Stack Engine
32323232
c
c
c
c
c
c
c
c
7
6
5
4
3
2
1
c
c
c
7,2
6,2
5,2
4,2
3,2
7,1
6,1
5,1
4,1
3,1C
C
C
C
C
C C
C
C
C
2,2C2,1C
C
0,2C
1,2
C
1,1
0,1
7,0
6,0
5,0
4,0
3,0
2,0C
1,0
0,0
C
CC
C C C
CC
C C C
CC
C
C
C
C
. . .
C32,0 C32,1 32,2 c32
C31,0 31,1 31,2 31
30,0 30,1 30,2 30
29,0 29,1 29,2 29
0
Address
Generator 3
65 65 65 65
64 64 64 64
63 63 63 63
62 62 62 62
n
n
n
n
n
n
n
n
n
n
n
n0
1
2
3
4
5
6
7
i
i
i
i
i
i
i
i
i
i
i
i
0
1
2
3
4
5
6
7
. . .
m
m
m
m
m
m
m
k
k
k
k
k
k
k
k
k
k
k
mk 0
1
2
3
4
5
6
77
6
5
4
3
2
1
0
m
m
m
m
5 5 56
5
d
d
d
d
d
d
d
d
d
d
d
d0
7
6
5
4
3
2
1
20
21
22
23
12
e
11109876543210
e eeee eeee eee
2322
e
21
e
2019181716151413
eeeeeeeee
2322
d’
2120
d’
19
d’
18
d’
17
d’
16
d’
1514
d’
13
d’ d’d’d’
12
d’
11
d’
10
d’
987
d’
65
d’
4
d’
3
d’
2
d’
1
d’ d’d’d’
0
d’
32
DOTADD−Unit
Node Node
Adresses
Control
Adresses
Control
7
Log.
Fnct.
32 32 each
7
PCI−Bus 32/64 bits
Module 0 Module 1 Module 2 Module 3
Memory Address and Control Bus
. . .
D32 .. D63
D64 .. D95
D96 .. D127
D128 .. D159
D160 .. D191
D224 .. D255
D192 .. D223
Master Controller
PCI−Bus Interface
D0..    D31
D0 .. D20
Child Pointers
D0.. D35
Commands
Status /
2−to−1
MUX
Interface
Data Bus D0...D255
Fast SRAM
O
VE
RL
AP
 / 
NO
 O
VE
RL
AP
> > > > > > > > > > > > < < < < < < < < < < < <
Ad
dr
es
s 
De
co
de
r W
rit
e 
Po
rt
Ad
dr
es
s 
De
co
de
r R
ea
d 
Po
rt 
2
Ad
dr
es
s 
De
co
de
r R
ea
d 
Po
rt 
1
Ad
dr
es
s 
De
co
de
r R
ea
d 
Po
rt 
0
Result Register Selector
d′
i
= dk ∗ Ci0 + dm ∗ Ci1+
dn ∗ Ci2 + ci
Pointers Pointers
Intersection
Test Unit
for Triangles
Results
Bank
Register
D
O
P
R
eg
is
te
r
Fi
le
M
at
ri
x
R
eg
is
te
r
Fi
le
C
o
rr
es
p
o
n
d
en
ce
R
eg
is
te
r
Fi
le
IT
C
o
n
tr
o
lB
u
s
Figure 5: All functional units of the CollisionChip.
Processing of Bounding Volumes. Prior to the
processing of two hierarchies, the matrix C and the
coefficients c must be loaded into the Matrix Reg-
ister File. Also, the correspondence indices i, k, m
and n must be stored in the Correspondence Regis-
ter File. This happens via the PCI-bus under soft-
ware control, and occupies 24 lines in both register
files. The software also transmits the pointers (local
memory addresses) of the two root nodes as starting
point to the Master Controller.
A DOP from the tumbled object is loaded from
memory and sequentially stored in the DOP Regis-
ter File under control of Address Generator 1. After
some constant delay (again predetermined by soft-
ware) a sufficiently large subset of DOP-elements d
are or will become available in time for continuous
evaluation of Equation 1. At this point in time, Ad-
dress Generator 2 is triggered and the operands are
fed into the DOTADD-unit. Note that for maxi-
mum performance, processing of the lines in ma-
trix C occurs out-of-order, depending on the earliest
availability of the required d-elements. Also note
that a specific d may be used for more than one
line. The transformed DOP-elements d ′ are then
stored into the Results Register Bank under control
of Address Generator 3, which basically delays an
index i for the duration of the pipeline delay of the
DOTADD-unit. The same processing is applied to
the sibling of the tumbled node.
The DOP-elements e of the aligned node are then
loaded from memory in three transfers and stored in
the registers labeled e0 . . . e23 under control of the
Master Controller. The bank of comparators deter-
mines overlap in parallel and signals this condition
to the Master Controller. The lists of child nodes to
be checked for overlap are constructed according to
this condition by the Stack Engine.
Hierarchy Traversal. The novel traversal algo-
rithm as described in Section 3.2 is implemented us-
ing a dedicated and fast external SRAM to store the
lists and a suitably designed Stack Engine. Its basic
task is to hand node pointers from the current list
to the Master Controller and to receive child node
pointers to construct new lists. Internally, it main-
tains a stack of list pointers and a register containing
the actual level.
Processing of Triangles. As described in Sect. 3.3,
testing triangle A from object OA against triangle B
from object OB requires transforming A into the co-
ordinate system of OB using a rotation and a trans-
lation. These are constant for two objects, and can
therefore be precomputed. The reverse test B against
A may also be necessary, which requires the inverse
transform. For maximum performance, all coeffi-
cients are kept on chip in the Matrix Register File in
lines 24 through 29. After this first transformation,
the triangle must then be transformed using the ma-
trix stored in the leaf of the other triangle, whose
coefficients are loaded into lines 30 through 32 of
the Matrix Register File. The Correspondence Reg-
ister File has 27 lines dedicated to these transforms,
and so they can be performed using the DOTADD-
unit by properly setting up the Address Generators.
The concrete sequence of operations is as follows.
The vertices of A are loaded into the DOP Reg-
isters d0 . . .d8, transformed and written back into
DOP Registers d9 . . . d17. Meanwhile, the coeffi-
cients from the other leaf are loaded from memory.
The second transformation of A leaves the vertices
in registers d0 . . .d8 again.
Further processing according to Section 3.3 and
Equation 4 is then done in a separate unit called IT-
Unit. This unit can have low-performance and thus
low-cost arithmetic units, since triangle tests are not
performed very frequently (see Table 1). This unit is
controlled by the Master Controller via a 7-bit bus
and supplied with operands from the DOP Register
File using indices from the Correspondence Register
File. Five pairs of operands need to be read per edge,
which requires additional 15 entries for a total of 66
lines in the Correspondence Register File. The accu-
mulation circuitry consisting of the ADD/SUB unit
and the register ACC compute the left side of the
third condition in Equation 4. The combinational
circuitry to the left determines whether the edge cuts
the z = 0 plane at all.
An intersection is reported to the software; other-
wise, the above procedure is repeated with operands
reversed.
5 Performance Estimations
Processing of a given list involves reading and trans-
forming two tumbled nodes, and reading and com-
paring the appropriate number of aligned node pairs.
We assume that throughput is limited by transfor-
mation performance and memory bandwidth; the
stack engine is assumed to be always fast enough.
We also don’t consider triangle-triangle-tests here
since they don’t occur very often.
Further assumptions are as follows: nodes are
defined by 24 single-precision floating-point num-
bers plus auxiliary data, placed in memory on 128-
byte boundaries. The memory is build from DDR-
SDRAM chips with a 2-2-2 access characteristic (2
cycles each for the precharge time, RAS-CAS-delay,
and CAS-latency). The CollisionChip is assumed to
run at the data burst frequency, e.g. 266MHz for
PC133 memory chips. A cycle of the CollisionChip
equals one half of a memory cycle. The SDRAM
Interface can buffer an entire node pair (256 bytes)
and thus allows a burst length of eight to be used.
average numbers worst-case numbers
aligned tumbled time time aligned tumbled time time speedup
num nodes nodes pgon in in nodes nodes in in avg /
Object pgons visited1 visited2 checks HW SW visited1 visited2 HW SW worst-case
Filter 19 326 12 474 240 1 660 153 14 936 542888 3553 5 638 717 164 98 / 127
Frontlight 30 075 389 73 68 15 524 7065 937 207 9 226 36 / 44
Lock 62 023 279 81 9 15 401 4854 877 178 6 804 27 / 38
Car body 60 755 259 66 55 12 383 3076 538 110 5 390 31 / 49
Buddha 125000 159 50 7 9 240 3345 301 77 4 074 26 / 53
Table 1: This table shows the performance of our hardware architecture for a number of objects that are placed
in close proximity. All times are in microseconds. The average numbers have been obtained by rotating one
of the objects relative to the other. The worst-case numbers are the respective maxima observed during that
rotation. The collision detection times have been calculated with Equation 5, assuming 3.76 nsec/cycle (α =
num. aligned nodes / num. tumbled nodes, τ = num. tumbled nodes / 2). Columns marked by (1) count
multiple visits of aligned nodes, while those marked by (2) count the number of unique tumbled nodes (which
are, unlike traditional traversal schemes, visited only once). The software performance has been measured using
the traditional recursive hierarchy traversal.
In the following, cycles refer to chip cycles. Then,
a random access to a node pair takes 16 cycles to
complete.
The first d-parameter of a tumbled node can be
written in the DOP Register File 10 cycles after the
(random) read was initiated. On average, contin-
uous evaluation of Equation 1 can start after ad-
ditional 12 cycles, when the first half of all d’s are
available. The DOTADD-unit is assumed to have 6
pipeline stages. The first result will be clocked into
the Results Register Bank after a total of 28 cycles,
the last one after 52 cycles.
Last access to the DOP Register File for the pro-
cessing of the first tumbled node sibling occurs in
cycle 46. The other sibling can then be transferred
sequentially from the SDRAM Interface Unit into
the DOP Register File and processed in the same
way. The transformed sibling will be ready in the
Results Register Bank after 88 cycles.
By that time, the first pair of aligned nodes in the
list has been fetched from memory, with one of the
nodes being present in ”e”-register bank. The other
node will be processed four cycles later. The load
of the second node pair has been initiated such that
processing can continue uninterrupted throughout
cycle 100.
For all further memory reads, since we assume
page faults for practically all memory reads, a delay
will occur between read cycles. On memory chips
with four internal banks, this delay will be two cy-
cles on average due to bank interleaving, giving a
total read time of 10 cycles for a node pair. Thus,
the performance can be estimated as
TL = 100 + (α− 2) ∗ 10,
where TL is the number of cycles needed to process
a list, and α is the number of aligned node pairs in
the list. If for a given collision test for two objects
there are τ lists to process, each with α node pairs on
average, the total performance can be characterized
as
TT = (100 + (α− 2) ∗ 10) ∗ τ (5)
The number of lists τ is given by the number of
visited tumbled node pairs.
This estimation is compared to a software imple-
mentation on a PC with a Pentium-III CPU running
at 1GHz. The results are summarized in Table 1.
6 Conclusions and Future Work
In this paper, we have presented novel algorithms
and a hardware architecture for performing hier-
archical collision detection. It is arguably the first
special-purpose hardware architecture dedicated to
this task. We lay special emphasis on the fact that
this architecture is suitable for “polygon soups” in
general, as opposed to previously reported methods
utilizing graphics hardware.
As can be seen in Table 1, the speedup ranges be-
tween about 25 and 125 for our benchmarks. It
is generally higher in worst-case scenarios, which is
an important result, because interactivity is limited
most severely by these cases. Thus a chip design is
very well justified.
A good part of the speedup can be attributed to
our novel hierarchy traversal scheme, which can be
applied to all kinds of bounding volume hierarchies.
Our near-term goal will be to implement a VHDL
model of the CollisionChip, identify potential bot-
tlenecks, and further optimize the architecture to-
wards even higher processing speeds. Our long-term
goal will be to integrate this project into an indus-
trial virtual prototyping application.
Figure 6: Some of the objects of our test suite (car body, front light, cooling filter, door lock; courtesy VW and
BMW).
References
[1] J. Arenberg, Re: Ray/Triangle Intersection with
Barycentric Coordinates, Ray Tracing News,
1 (1988). http://www1.acm.org/pubs/tog/
resources/RTNews/html/rtnews5b.html.
[2] G. Baciu, W. S.-K. Wong, and H. Sun, RECODE:
an image-based collision detection algorithm, The
Journal of Visualization and Computer Animation,
10 (October - December 1999), pp. 181–192. ISSN
1049-8907.
[3] D. Badouel, An Efficient Ray-Polygon Intersection,
in Graphics Gems, A. S. Glassner, ed., Academic
Press, San Diego, 1990, pp. 390–393. includes code.
[4] J. Eckstein and E. Schömer, Dynamic Collision
Detection in Virtual Reality Applications, in Proc.
The 7-th Int’l Conf. in Central Europe on Comp.
Graphics, Vis. and Interactive Digital Media ’99
(WSCG’99), Plzen, Czech Republic, Feb. 1999, Uni-
versity of West Bohemia, pp. 71–78.
[5] S. Gottschalk, M. Lin, and D. Manocha, OBB-
Tree: A Hierarchical Structure for Rapid Interfer-
ence Detection, in SIGGRAPH 96 Conference Pro-
ceedings, H. Rushmeier, ed., ACM SIGGRAPH, Ad-
dison Wesley, Aug. 1996, pp. 171–180. held in New
Orleans, Louisiana, 04-09 August 1996.
[6] P. M. Hubbard, Collision detection for interactive
graphics applications, IEEE Transactions on Visual-
ization and Computer Graphics, 1 (1995), pp. 218–
230. ISSN 1077-2626.
[7] R. Jones, Intersecting a Ray and a Triangle
with Plücker Coordinates, Ray Tracing News,
13 (2000). http://www1.acm.org/pubs/tog/
resources/RTNews/html/rtnv13n1.html.
[8] T. L. Kay and J. T. Kajiya, Ray Tracing Complex
Scenes, in Computer Graphics (SIGGRAPH ’86 Pro-
ceedings), D. C. Evans and R. J. Athay, eds., vol. 20,
Aug. 1986, pp. 269–278.
[9] J. T. Klosowski, M. Held, J. S. Mitchell,
H. Sowrizal, and K. Zikan, Efficient Collision
Detection Using Bounding Volume Hierarchies of
k-DOPs, IEEE Transactions on Visualization and
Computer Graphics, 4 (1998), pp. 21–36.
[10] J.-C. Lombardo, M.-P. Cani, and F.Neyret, Real-
time collision detection for virtual surgery, in
Proc. of Computer Animation, Geneva, Switzerland,
May26-28 1999.
[11] T. Möller, A fast triangle-triangle intersection test,
Journal of Graphics Tools, 2 (1997), pp. 25–30.
[12] T. Möller and B. Trumbore, Fast, Minimum Stor-
age Ray-Triangle Intersection, Journal of Graphics
Tools, 2 (1997). ISSN 1086-7651.
[13] K. Myszkowski, O. G. Okunev, and T. L. Kunii,
Fast collision detection between complex solids us-
ing rasterizing graphics hardware, The Visual Com-
puter, 11 (1995), pp. 497–512. ISSN 0178-2789.
[14] J. Sauer and E. Schömer, A Constraint-Based Ap-
proach to Rigid Body Dynamics for Virtual Real-
ity Applications, in Proc. VRST ’98, Taipei, Taiwan,
Nov. 1998, ACM, pp. 153–161.
[15] C. Schlick and G. Subrenat, Ray Intersection of
Tessellated Surfaces: Quadrangles versus Triangles,
in Graphics Gems V, A. Paeth, ed., Academic Press,
San Diego, 1995, pp. 232–241.
[16] M. Shinya and M.-C. Forgue, Interference detec-
tion through rasterization, The Journal of Visualiza-
tion and Computer Animation, 2 (1991), pp. 132–
134. ISSN 1049-8907.
[17] G. J. A. van den Bergen, Collision Detection in
Interactive 3D Computer Animation, PhD disserta-
tion, Eindhoven University of Technology, 1999.
[18] D. Voorhies and D. Kirk, Ray-Triangle Intersec-
tion Using Binary Recursive Subdivision, in Graph-
ics Gems II, J. Arvo, ed., Academic Press, San Diego,
1991, pp. 257–263.
[19] G. Zachmann, Rapid Collision Detection by Dy-
namically Aligned DOP-Trees, in Proc. of IEEE
Virtual Reality Annual International Symposium;
VRAIS ’98, Atlanta, Georgia, Mar. 1998, pp. 90–
97.
[20] , Optimizing the Collision Detection Pipeline,
in Proc. of the First International Game Technology
Conference (GTEC), Jan. 2001.
[21] G. Zachmann and A. Rettig, Natural and Ro-
bust Interaction in Virtual Assembly Simulation,
in Eighth ISPE International Conference on Con-
current Engineering: Research and Applications
(ISPE/CE2001), West Coast AnaheimHotel, Califor-
nia, USA, July 2001.
