Wait-free shared-memory irradiance caching by Debattista, Kurt et al.
66 September/October 2011 Published by the IEEE Computer Society 0272-1716/11/$26.00 © 2011 IEEE
Feature Article
Wait-Free Shared-Memory 
Irradiance Caching
Kurt Debattista and Piotr Dubla ■ University of Warwick
Luís Paulo Peixoto dos Santos ■ Universidade do Minho
Alan Chalmers ■ University of Warwick
Desktop multicore computing calls for mod-ifying traditional rendering algorithms to parallelize the available resources and 
maximize their use. For certain algorithms, such as 
classic ray tracing, this conversion can be relatively 
straightforward, but computing more complex 
lighting conditions requires careful consideration 
of thread synchronization to minimize overhead 
and allow computation.1
We’ve developed a parallel so-
lution to one method used for 
computing complex lighting con-
ditions—namely, the irradiance 
cache (IC), an acceleration data 
structure that caches sparsely 
sampled values for indirect dif-
fuse irradiance in the framework 
of a distributed ray-tracing al-
gorithm.2 However, because all 
rendering threads can write to 
and read from the IC, a multi-
threaded shared-memory system 
must have a data-access-control 
mechanism to ensure that the 
data structure isn’t corrupted.
Such control mechanisms in-
cur their own overhead, so they 
must be carefully designed to not compromise per-
formance. Traditionally, data-access-control mech-
anisms use lock-based mutual exclusion. However, 
nonblocking data structures that take the form of 
obstruction-free, lock-free, or wait-free data struc-
tures offer considerable performance advantages. 
Wait-free structures are the most powerful (see 
the “Nonblocking Synchronization” sidebar), but 
system developers and researchers have considered 
them difficult to construct, and they’re relatively 
rare in practice.3
Here, we present an efficient wait-free algorithm 
that lets all threads concurrently access an un-
bounded shared IC, without using locks or critical 
sections. This is an extension of our previous re-
search,4 which presented an initial version of the 
wait-free algorithm and tested it on an eight-core 
machine. The earlier algorithm could handle only 
fixed-size arrays in the IC. Furthermore, it dis-
carded some data when a conflict among threads 
occurred. We’ve fixed these limitations to guarantee 
the successful insertion of all new irradiance val-
ues. We’ve also assessed this algorithm’s efficiency 
on two highly concurrent multicore systems with 
up to 24 physical cores. The results demonstrate 
its superior performance and scalability over two 
other more traditional and straightforward mecha-
nisms for sharing the IC among a shared-memory 
system’s threads. (An audiovisual presentation that 
augments this article’s description of the algo-
rithm is available at http://doi.ieeecomputersociety.
org/10.1109/MCG.2010.80.)
Related Work in Parallel IC
Recent ray-tracing improvements have enabled in-
teractive computation of many global effects, such 
as specular phenomena and correct shadows.5 How-
ever, the dense sampling at each shading point 
required by indirect diffuse interreflections dra-
matically increases rendering times. Gregory Ward 
and his colleagues exploited a continuous smooth 
Parallelizing rendering 
algorithms to exploit 
multiprocessor and 
multicore machines isn’t 
straightforward. For example, 
the irradiance cache (IC) is an 
acceleration data structure 
that caches indirect diffuse 
irradiance values. In multicore 
systems, threads must 
share the IC to achieve high 
efficiency. A novel wait-free 
access mechanism significantly 
reduces synchronization 
overhead.
 IEEE Computer Graphics and Applications 67
function that generally characterizes the indirect 
diffuse component over space, unaffected by the 
high-frequency changes common with the spec-
ular component.2 They proposed the IC to allow 
sparse evaluation of indirect diffuse irradiance, 
storing the sparsely calculated values in the IC and 
reusing them to extrapolate or interpolate values 
at nearby locations. By caching indirect diffuse ir-
radiance samples in the framework of a distributed 
ray-tracing algorithm,2 the IC allows irradiance 
values to be interpolated for regions in a given 
sample’s neighborhood. This reduces rendering 
Using shared-memory multithreading to execute paral-lel computations requires careful design of the access 
to data structures that all threads can access concurrently. 
A data-access-control mechanism must be able to ensure 
that the data structure isn’t corrupted.
Traditionally, access control to shared-memory data 
structures is maintained via mutual exclusion. When critical 
sections are reasonably large, threads can use blocking 
mechanisms to preempt the running thread. However, 
when frequent access to a shared data structure is required, 
blocking costs can be prohibitive. In such cases, a thread 
would typically enter a busy-wait state, usually using a 
spin lock, when another thread is in the critical section and 
maintain that state until the other thread completes. Such 
control mechanisms incur overheads, such as serialization 
of accesses to the shared data structure. Blocking entails 
expensive context switches, and busy-waiting of frequently 
accessed resources leads to contention that can drastically 
reduce performance as the number of threads increases.1
Alternatives that avoid mutual exclusion exist in the form 
of nonblocking synchronization. By carefully ordering in-
structions, a system developer can remove all critical sections 
and thereby reduce contention by allowing nonblocking al-
gorithms to guarantee that none of the code is serialized.2
The weakest form of nonblocking data structures is 
obstruction-free methods, which guarantee that a thread 
can complete in finite time if it operates in isolation. When 
nonblocking data structures can guarantee that at least one 
among a set of concurrent threads will complete in finite 
time, they’re said to be lock free. All lock-free algorithms 
are obstruction free. However, lock-free and obstruction-
free methods rely on retrials and can’t guarantee an upper 
bound on the number of executed instructions.
When an algorithm can guarantee that all threads will 
complete in finite time, the algorithm is said to be wait 
free. Wait-free algorithms can guarantee an upper bound 
on the number of instructions, thus avoiding starvation, 
deadlock, livelock, and priority-inversion problems. Wait-
free algorithms are ideal for multiprogrammed multipro-
cessors—for example, when a thread holding a lock is pre-
empted and causes all other threads to busy-wait uselessly. 
Clearly, all wait-free data structures are also lock free.
The construction of nonblocking algorithms requires 
powerful atomic primitives, which execute without inter-
ruption as a single instruction on modern architectures. 
We can view these algorithms as a limiting case that re-
duces the size of critical sections to the size of individual 
machine instructions.
Figures A and B show pseudocode for two atomic in-
structions, fetch and add (XADD) and compare and swap 
(CAS), that we use for our wait-free irradiance cache. Mau-
rice Herlihy provided a hierarchy of such primitives’ ef-
fectiveness.3 The most effective are those that can be used 
to implement any wait-free data structure, which Herlihy 
described as CAS (or load-link store-conditional instruc-
tion pairs, which are an alternative to the CAS in some 
architectures).
References
 1. T.E. Anderson, E.D. Lazowska, and H.M. Levy, “The Perfor-
mance Implications of Thread Management Alternatives for 
Shared-Memory Multiprocessors,” IEEE Trans. Computers, vol. 
38, no. 12, 1989, pp. 1631–1644.
 2. M. Herlihy and N. Shavit, The Art of Multiprocessor Program-
ming, Morgan Kaufmann, 2008.
 3. M. Herlihy, “Wait-Free Synchronization,” ACM Trans. Program-
ming Languages and Systems, vol. 13, no. 1, 1991, pp. 124–149.
Nonblocking Synchronization
1 atomic XADD(address location)
2 {
3  int value = *location;
4  *location = value + 1;
5  return value
6 }
Figure A. Pseudocode for the fetch and add (XADD) atomic 
instruction. XADD atomically adds 1 to a value and returns the 
previous value. This function would be implemented as a single 
instruction on modern processor architectures.
1 atomic CAS(address location, value   
  cmpVal, value newVal)
2 {
3  if(*location == cmpVal) {
4   *location = newVal;
5   return true;
6  } else return false;
7 }
Figure B. Pseudocode for the compare and swap (CAS) atomic 
instruction. CAS compares the value at location with cmpVal and, 
if they are the same, updates the value pointed to by location to 
newVal. This function would be implemented as a single instruction 
on modern processor architectures.
68 September/October 2011
Feature Article
time by exploiting spatial coherence. (The “Irra-
diance Caching” sidebar describes the motivation 
and mechanism behind this data structure.)
To accelerate range searches for locating valid 
samples in the IC, the algorithm builds an octree 
incrementally every time a new sample is added. 
Writing to the cache requires both storing the new 
indirect diffuse irradiance value and updating the 
octree. In parallel systems, each rendering process, 
or thread, might evaluate new indirect diffuse 
irradiance values and add them to the IC. To 
increase efficiency, all processes must share the IC, 
thus avoiding replicated work in which one process 
evaluates an irradiance value that other processes 
have already calculated. Ideally, the IC becomes 
a shared data structure, requiring some sharing 
mechanism to ensure that all processes can access 
the available data, that the data isn’t corrupted, 
and that overheads don’t compromise efficiency.
In distributed-memory systems, such as work-
station clusters, each node has its own address 
space, resulting in multiple copies of the shared 
data structure that are regularly synchronized. The 
standard radiance distribution6 supports a paral-
lel renderer over a distributed system using Sun’s 
Network File System for concurrent IC access. This 
approach has led to contention and can result in 
poor performance when using inefficient file lock 
managers. Roland Koholka and his colleagues 
broadcast IC values among processors after every 
50 samples calculated at each slave.7 David Robert-
son and his colleagues presented a centralized par-
Physically based computation of the radiance reflected at a point p along a direction Q is dictated by the ren-
dering equation:1
L p f p L p N dr r i p→( )= ↔( ) ←( ) ( )∫Θ Θ Ψ Ψ ΨΩ ΩΨ, cos ,

,
where f pr ,Θ Ψ↔( )  is the bidirectional reflectance distri-
bution function at p for directions Q and Y, L pi ←( )Ψ  is 
the incident radiance at p along Y, 

Np  is the surface nor-
mal at p, and W, the integration domain, is the hemisphere 
centered at p and oriented around 

Np .
Ray tracing approximates L pr →( )Θ  by shooting rays 
along a number of directions Y, thus sampling L pi ←( )Ψ  
for these directions. We can usually compute certain light 
transport phenomena, such as specular scattering and 
direct illumination, with a limited number of rays. Other 
phenomena, because they lack directionality, require 
hemisphere sampling—that is, stochastically selecting and 
shooting many rays across W. This stochastic integration 
method, called Monte Carlo integration, usually accounts 
for much of the computation.
One such phenomenon is indirect diffuse reflection, the 
diffusely reflected radiance at p along a given direction re-
sulting from the indirect irradiance E(p). E(p) is the indirect 
incident radiant flux per unit area at p. Accurately com-
puting E(p) requires densely sampling W. In a ray-tracing 
context, this requires shooting hundreds or thousands of 
rays distributed across the hemisphere while, simultane-
ously, carefully avoiding directions corresponding to light 
sources so that we include only indirect lighting—that is, 
light that’s been reflected by at least one object.
Indirect diffuse reflections are crucial to convey a per-
ception of realism (see Figure C), but sampling the hemi-
sphere at all shading points results in very long rendering 
times, deemed unacceptable even for most offline render-
ings. Gregory Ward and his colleagues realized in 1988 
that indirect diffuse reflection is generally a continuous 
smooth function over space, not affected by the high-
frequency changes common with specular reflections.2 
They proposed accelerating the computation of indirect 
diffuse reflections by densely sampling the hemisphere 
at only a sparse set of shading points and interpolating 
the remaining ones. They store the sparsely calculated, 
indirect irradiance values in the irradiance cache (IC) data 
structure and later reuse them to extrapolate or interpo-
late irradiance values at nearby locations.
Irradiance Caching
(1) (2) (3)
Figure C. The contribution of indirect lighting: (1) direct only, (2) indirect only, and (3) full. Without the computation of the indirect lighting, 
the rendered images are largely inaccurate.
 IEEE Computer Graphics and Applications 69
allel radiance version that sends the calculated IC 
values to a master process whenever a threshold is 
met.8 Each slave then collects the values deposited 
at the master by the other slaves. Another proposal 
restricts diffuse irradiance evaluations to a subset 
of the available processors, synchronizing the IC 
among these at a higher frequency than with the 
remaining processors.9
We’re unaware of any publication describing a 
data-access-control mechanism for sharing the IC 
among rendering threads in a shared-memory par-
allel system, other than our previous wait-free al-
gorithm. The algorithm we propose here supports 
extendable memory for inserting an unbounded 
number of IC samples and for successfully insert-
ing all new irradiance values.
Data-Access-Control Algorithms
We now present the algorithms for the three 
data-access-control mechanisms we evaluated in 
experiments.
We begin with a traditional single-threaded IC 
that has no access control (see Figure 1). The Irra-
dianceCache data structure represents the IC; it 
consists of an octree of recursive nodes. The individ-
ual node is called ICNode. Each ICNode contains 
pointers to another eight nodes and an ICList 
storing the list of IC samples. Figure 2 shows the 
ICNode for the wait-free method. For the other 
methods, the ICList is just a single dynamic array. 
The Lock-Based Irradiance Cache
The lock-based access-control algorithm (LCK) 
We can interpolate E(p) from a set S(p) of previously 
evaluated irradiance values E(pi) at points pi, by using a 
weighted average:
E p
w p E p
w p
i i
i S p
i
i S p
( )=
( ) ( )
( )
∈ ( )
∈ ( )
∑
∑
,
where the weights 
w p
p p
R
N Ni
i
i
p pi( )=
−
+ − ⋅






−
1
1 
 
depend on the distance between p and pi, on the harmonic 
mean distance (Ri) to objects visible from pi, and on the rela-
tive orientation of the normals at p and pi. We can determine 
S(p) by requiring wi(p) to be larger than the reciprocal of the 
maximum acceptable error, a, which is a user-supplied pa-
rameter: S(p) = {i : wi(p) > 1/a}.
When the renderer requires indirect irradiance at any 
point p, it first determines S(p) by querying the IC. If S(p) 
is empty, E(p) is evaluated by the Monte Carlo integration; 
otherwise, it’s interpolated from the E(pi) belonging to S(p). 
Querying the IC to determine S(p) amounts to locating all 
samples pi stored in the cache that meet the search crite-
rion. This range search task is computationally demanding, 
but we can optimize it by resorting to 3D hierarchical data 
structures, such as octrees or k-d trees, and spatially order-
ing the IC.
By exploiting spatial coherence, the IC offers an order-of-
magnitude improvement in rendering time over Monte Carlo 
integration. We can further improve performance when 
rendering animations of static scenes because the indirect 
diffuse irradiance remains constant, which allows reuse of the 
IC samples across frames.
Researchers have recently extended the IC as a stand-
alone algorithm in many guises—for example, as an accel-
eration data structure for rendering glossy surfaces by stor-
ing radiance,3 as a participating-media phenomenon,4 for 
translucency,5 or in conjunction with photon mapping.6 
Other extensions have exploited coherence in the tem-
poral domain.7–9 PDI/DreamWorks has also used similar 
methods to accelerate rendering.10
References
 1. J.T. Kajiya “The Rendering Equation,” Proc. Siggraph, ACM 
Press, 1986, pp. 143–150.
 2. G. Ward, “A Ray Tracing Solution for Diffuse Interreflection,” 
Proc. Siggraph, ACM Press, 1988, pp. 85–92.
 3. J. Krivanek et al., “Radiance Caching for Efficient Global 
Illumination Computation,” IEEE Trans. Visualization and 
Computer Graphics, vol. 11, no. 5, 2005, pp. 550–561.
 4. D. Jarosz and J. Zwicker, “Radiance Caching for Participating 
Media,” ACM Trans. Computer Graphics, vol. 27, no. 1, 2008, 
article 56.
 5. S.-L. Keng, W.-Y. Lee, and J.-H. Chuang, “An Efficient Caching-
Based Rendering of Translucent Materials,” The Visual Computer, 
vol. 23, no. 1, 2006, pp. 59–69.
 6. H.W. Jensen, Realistic Image Synthesis Using Photon Mapping, 
A K Peters, 2001.
 7. M. Smyk et al., “Temporally Coherent Irradiance Caching 
for High Quality Animation Rendering,” Computer Graphics 
Forum, vol. 24, no. 3, 2005, pp. 401–412.
 8. P. Gautron, K. Bouatouch, and S. Pattanaik, “Temporal 
Radiance Caching,” IEEE Trans. Visualization and Computer 
Graphics, vol. 13, no. 5, 2007, pp. 891–901.
 9. K. Debattista et al., “Instant Caching for Interactive Global 
Illumination,” Computer Graphics Forum, vol. 28, no. 8, 2009, 
pp. 2216–2228.
 10. E. Tabellion and A. Lamorlette, “An Approximate Global 
Illumination System for Computer Generated Films,” ACM 
Trans. Graphics, vol. 23, no. 3, 2004, pp. 469–476.
 
70 September/October 2011
Feature Article
locks the IC whenever a read or write is made to 
it (see Figure 3, lines 4–6 and 12–14). However, 
the code responsible for hemisphere sampling, 
ComputeIrradianceRT(), isn’t a critical re-
gion, so for this computation the method allows 
concurrent irradiance evaluation. The LCK’s major 
disadvantage is that it serializes all accesses—both 
reads and writes—to the shared IC. As the number 
of threads increases, contention will increase, pre-
venting performance from scaling with the degree 
of parallelism.
The Local-Write Irradiance Cache
An alternative approach is to have a global IC 
readable by all threads and an additional local 
IC per thread (see Figure 4). Each thread writes 
only to its local IC but reads from both. At certain 
predefined execution points, such as the end of a 
frame, the local ICs sequentially merge into the 
global IC. This synchronization uses a frame end 
as a barrier, effectively constituting a blocking 
approach to synchronization.
This approach’s major drawback is that it disal-
lows any sharing in a single frame, thus result-
ing in work replication. The LW algorithm has 
a much higher IC sample count than the other 
two approaches we evaluated because each thread 
must locally evaluate all irradiance values required 
by its assigned image tiles. Additionally, memory 
consumption is dictated by the number of threads 
used and the octree’s complexity.
The Wait-Free Irradiance Cache
The wait-free (WF) algorithm doesn’t rely on any 
critical sections to both read and write to the 
shared IC. The algorithm changes three methods 
from the traditional IC.
The first method is the ICList::Add func-
tion (see Figure 5). The insert onto the node it-
self takes the form of an insertion onto an array 
or unbounded queue. For completeness, we dem-
onstrate how the method works for an unbounded 
queue. Insertion onto a fixed-size array is just a 
specialized case of this algorithm. The fixed-size 
array version is the same as the enqueue function 
of the Herlihy-Wing concurrent queue.10
The structure of ICList is an unbounded queue 
 1 IrradianceCache IC;
 2
 3 ComputeIndirectDiffuse() {
 4  //get irradiance from IC if there are valid records
 5  inIC = IC.getIrradiance ();
 6  if (!inIC) { // no valid records found
 7   // compute it by sampling the hemisphere
 8   ICsample = ComputeIrradianceRT ();
 9   // insert new IC sample into the octree
10   IC.insert (ICsample);
11  }
12 }
13
14 IrradianceCache::getIrradiance(Irr) {
15  Irr = {0,0,0};
16  <Traverse the octree>
17  <verify validity of sample>
18  <extrapolate irradiance; add to Irr>
19  if (found) return true;
20  else return false;
21 }
22
23 IrradianceCache::insert (ICsample) {
24  // recursively traverse the octree
25  // starting at root
26  IC.root.insert (ICsample);
27 }
28
29 ICNode::insert (ICSample) {
30  if (correct insertion node) {
31    IClist.Add (ICsample);
32  } else {
33   // go deeper in the octree
34   xyz = EvaluateOctant();
35   if (children[xyz] == NULL)
36    children[xyz] = new ICNode ();
37   children[xyz].insert (ICsample);
38  }
39 }
40
41 ICList::Add (ICsample) {
42  // insert new record in head of list
43  IClist.records[head++] = ICsample;
44 }
Figure 1. A traditional sequential irradiance cache (IC). This approach has 
no access control. It consists of an octree of recursive notes.
0 1 2 3 481 ...
k k+1 k+2 k+3
k+3head
k+4...
ICListICNode pointers to children 
ICNode
Figure 2. An ICNode structure for wait-free access control. For the other methods, ICList is just a single 
dynamic array that’s extended whenever required.
 IEEE Computer Graphics and Applications 71
of arrays, used to maintain coherence and ensure 
that queue extensions aren’t frequent (see Figure 
2). In Figure 5, the array type is qNode, which 
contains an array of qNodeSize elements and a 
pointer to another qNode. Initially, the queue of 
arrays contains only one qNode; when  the inser-
tion requires a new qNode, the executing thread 
creates it and attaches it to the previous one. An 
array is always initialized with a list of NULL point-
ers (or some other symbol that the computation 
doesn’t use) to denote that none of the threads has 
yet added an ICsample. When adding samples to 
an IC node, the algorithm uses the atomic fetch-
and-add operator (XADD; see Figure 5, line 3). This 
returns a unique index into the list of records, en-
suring that samples are never overwritten. Simul-
taneously, the thread increments the index to the 
next free position.
When the structure must be extended (see Figure 
5, line 13), the algorithm creates a new qNode 
and uses a compare-and-swap instruction (CAS) 
to insert it onto the previous qNode (see Figure 
5, line 19). If another thread hasn’t yet extended 
the queue—that is, if the pointer is still NULL, 
the CAS completes successfully and the executing 
thread inserts the associated ICsample onto 
the structure (see Figure 5, line 26). However, if 
another thread extended the queue, the CAS will 
fail and this thread will discard the created qNode 
(see Figure 5, line 20). The thread will then insert 
the associated sample onto the qNode that some 
other thread created (otherwise, the CAS would 
have succeeded).
Figure 6 demonstrates three concurrent threads—R, 
G, and B—executing this method for a qNodeSize 
of five. This will help illustrate how ICList:Add()
functions. In Figure 6a, the ICList is completely 
empty. In Figure 6b, R has just incremented the 
head but hasn’t yet inserted the sample. R inserts 
the sample in Figure 6c. In Figure 6d, both B and G 
have just incremented the head but not inserted the 
samples. In Figure 6e, B hasn’t inserted its sample 
and R has inserted another sample, but G has yet 
to insert the sample and is still on the same line of 
code as in Figure 6d.
Figure 6f demonstrates the scenario of the 
list needing to be extended and the possible 
conflicts that might occur. R has just filled in 
the first qNode, and G and B are about to insert 
another two samples. They have, in fact, already 
incremented head. Because both G and B have 
checked that the last qNode is full and has no 
successor (see Figure 5, line 16), both created a 
new qNode. However, the CAS at line 19 in Figure 
5 means that only one will succeed in attaching it 
to the previous qNode. In this case, at Figure 6g, 
we can see that G has succeeded and B is deleting 
the qNode it created. G has inserted the sample 
onto the new qNode. At Figure 6h, B inserts the 
sample onto the qNode that G had created.
The second method that’s changed from 
the traditional IC is the insert onto the octree 
structure (see Figure 7). When adding a new child 
node to the octree, the executing thread builds the 
new node using a temporary pointer. Once built, 
the node is attached to the octree using the CAS 
operator (see Figure 7, line 10). The reasoning 
underlying this method is similar to that for 
ICList::Add. Either this thread creates the 
subtree or some other thread does; computation 
proceeds notwithstanding.
The final modified method is Irradiance-
Cache::getIrradiance() (see Figure 8). 
The modifications simply reflect the structure 
 1 ComputeIndirectDiffuse()
 2 {
 3  //get irradiance from IC if there are valid  
  records
 4  IC.lock();
 5  inIC = IC.getIrradiance (Irr);
 6  IC.unlock();
 7
 8  if (!inIC) { // no valid records found
 9   // compute it by sampling the hemisphere
10   ICsample = ComputeIrradianceRT ();
11   // insert new IC sample into the octree
12   IC.lock();
13   IC.insert (ICsample);
14   IC.unlock();
15  }
16 }
Figure 3. A lock-based IC (LCK). This method uses busy-waiting for 
reading and extending the IC.
 1 IrradianceCache IClocal[number threads],   
  ICglobal;
 2 
 3 ComputeIndirectDiffuse()
 4 {
 5  //get irradiance from IC if there are valid  
  records
 6  inIC = ICglobal.getIrradiance (Irr);
 7 
 8  if (!inIC)
 9   inIC = IClocal[current thread].getIrradiance ();
10 
11  if (!inIC) { // no valid records found
12   // compute it by sampling the hemisphere
13   ICsample = ComputeIrradianceRT ();
14   // insert new sample into the local cache
15   IClocal[current thread].insert (ICsample);
16  }
17 }
Figure 4. A local-write IC (LW). This method maintains a separate IC per 
thread and synchronizes them after a single frame is rendered.
72 September/October 2011
Feature Article
 1 ICList::Add (ICsample) {
 2  // get index of new sample in node list
 3  int index = XADD (&head);
 4  int iteration = index / qNodeSize;
 5  int pos = index % qNodeSize;
 6  qNode * tail;
 7  int count = 0;
 8
 9  // identify node — can be optimized with a local tail
10  for (tail = qHead; tail —>next != NULL && count <iteration;
11   tail = tail —>next, count++);
12
13  if (iteration > count){
14   for (int n = 0; n < iteration — count; n++) {
15    // this is where we add the new Array
16    if (tail —>next == NULL) {
17     // all entries are initialized as NULL 
18     qNode * newN = new qNode;
19     if(!CAS(&tail —>next, NULL, newN))
20      delete newN;
21    }
22    tail = tail —>next;
23  }
24  // if this thread did not update
25  // some other thread must have updated
26  tail —>records[pos] = ICsample;
27  return index;
28 }
Figure 5. The wait-free IC Add method. This wait-free method ensures that IC samples can be added 
concurrently by different threads without blocking or busy-waiting.
(a) R = out, G = out, B = out
(d) R = out, G = 6, B = 7
(f) R = 27, G = 19, B = 18
(g) R = out, G = 27, B = 20 (h) R = out, G = out, B = 27
(e) R = 27, G = 6, B = out
(b) R = 4, G = out, B = out
(c) R = 27, G = out, B = out
0 1 2 3 4 0 1 2 3 4
0 1 2 3 4 0 1 2 3 4
0 1 2 3 4 0 1 2 3 4
5 6 7 8 9
5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
head 0
head 4 head 7
head 7head 7
head 1
head 1 head 3
Figure 6. How three threads (R, G, and B) would concurrently add samples to a node using our novel wait-free method. Numbers 
refer to the thread’s location in Figure 5. out means that the thread isn’t executing this function. ^ represents the NULL pointer. 
 IEEE Computer Graphics and Applications 73
changes to ICList. The method uses the fact of 
qNode elements being initialized to NULL. This 
function queries all elements that aren’t NULL 
and uses them to calculate the irradiance, if the 
valid neighborhood criterion is satisfied.
The wait-free approach ensures that all threads 
can access the single shared IC concurrently. Our 
experimental results show increased execution 
time both when interpolating and creating IC 
samples, without suffering the LW approach’s 
larger memory requirements.
Experimental Results
We obtained all the results presented here on two 
shared-memory systems.
One system was a dual-quad-core machine based 
on the Intel Xeon E5520 (Nehalem architecture), 
running at 2.26 GHz with 12 Gbytes of RAM. 
These processors include the Intel QuickPath Inter-
connect (which replaces the legacy front-side bus). 
They also support hyperthreading, enabling two 
threads per core and thus reporting a total of 16 
logical processors to the operating system. Hyper-
threading replicates certain processor resources but 
not the main execution units. Intel claims up to 30 
percent speed improvement, compared to otherwise 
identical, nonhyperthreaded processors.11
The second system was a quad hexacore ma-
chine based on the Intel Xeon E7450 (Dunning-
ton architecture), running at 2.40 GHz with 64 
Gbytes of RAM. With 24 physical cores, this sys-
tem let us evaluate the scalability of our wait-free 
access-control mechanism.
Both systems ran CentOS 5.2 with the Intel 
Compiler Suite Professional v. 11.0.
For all experiments, we used our own interactive 
ray tracer, which doesn’t employ packetization or 
explicit SIMD (single instruction, multiple data) 
operations. The only exception is the ray-bounding 
volume-intersection test used to traverse the 
acceleration data structure, which is a bounding-
volume-hierarchy implementation based on Ingo 
Wald and his colleagues’ research.12
Figure 9 shows the five scenes we used. We 
 1 ICNode::insert (ICSample) {
 2  if (correct insertion node)
 3   IClist.Add (ICsample);
 4  else { // go deeper in the octree
 5
 6   xyz = EvaluateOctant();
 7   if (children[xyz]==NULL) {
 8    temp = new ICNode();
 9    // Update new branch into the octree
10    if (!CAS (children[xyz], NULL, temp))
11     free temp;
12   }
13   // irelevant to whether this thread created the subtree
14   // someone must have created anyway
15   // recurse the insertion of ICsample onto the subtree
16   children[xyz].insert (ICsample);
17  }
18 }
Figure 7. The wait-free IC insert method. This method ensures that the octree grows dynamically without 
the use of busy-waiting or blocking.
 1 Irradiance Cache::getIrradiance(Irr) {
 2  Irr = {0,0,0};
 3  <Traverse the octree>
 4   for (qNode * tNode = qHead; tNode != NULL; tNode = tNode —>next) {
 5    for (i = 0; i < qNodeSize; i++)
 6     if (tNode —>records[i] != NULL) {
 7      <verify validity of sample>
 8      <extrapoloate irradiance; add to Irr>
 9     }
10  }
11  if (found) return true;
12  else return false;
13 }
Figure 8. The wait-free IC getIrradiance method. The method takes into account the dynamic nature of 
the ICList construction in the wait-free IC.
74 September/October 2011
Feature Article
picked them to provide a range of geometric com-
plexity, physical dimensions, and lighting condi-
tions. We rendered all scenes at 640 × 480 resolu-
tion. We use the following labels for the methods: 
traditional sequential (TRA), lock based (LCK), local 
write (LW), and wait free (WF). Tables detailing the 
results for still images and animation are at http://
doi.ieeecomputersociety.org/10.1109/MCG.2010.80.
Still Images
We varied the number of threads, and thus the 
number of used cores, up to 16 for the Nehalem 
architecture and 24 for the Dunnington architec-
ture. We obtained the results for a single thread 
using TRA, with no data-access-control, and com-
puted the speedup for the different techniques with 
respect to the sequential timings. Each image was 
calculated with an empty IC to show a worst-case 
scenario with maximal irradiance calculations.
Figures 10 and 11 present the speedup, normal-
ized number of evaluated IC samples, and effi-
ciency for both architectures. Each metric is the 
average over the five scenes. We normalized the 
metrics with respect to the results obtained for the 
same scene with one single thread and the tradi-
tional approach to the IC (no data-access-control). 
Absolute values particular to each scene are there-
fore irrelevant, as long as the behavior was similar 
across the different scenes for each access-control 
mechanism. We measured worst-case standard de-
viations of 14.2 and 4.2 percent for, respectively, 
absolute speedup and the normalized number of 
generated IC samples. The low worst-case values 
indicate that we can use these averages as reliable 
statistics to analyze our results.
For absolute speedup, Figures 10a and 11a in-
clude a dashed line depicting the linear speedup 
possible if the parallel solution incurred no algo-
rithmic or implementation penalties. For the Ne-
halem architecture with more than eight threads, 
linear speedup increased only 30 percent with 
each additional logical core, corresponding to 
Intel’s claim about hyperthreading’s maximum 
speed improvement.11
For all experiments with the Nehalem system 
and up to 14 threads in the Dunnington case, LW 
performed and scaled worse than the two other 
algorithms. This is because only one frame was 
rendered and the local caches merged only at the 
end of the frame, so no sharing actually occurred. 
Each thread had to evaluate all irradiance samples 
that projected into its assigned tiles of the image 
plane, leading to work replication. This is evident 
by the number of evaluated irradiance samples (see 
Figures 10a and 11a), which increased dramatically 
with the level of concurrency.
(a) (b)
(c) (d) (e)
Figure 9. The five scenes used in the experiments: (a) conference room (190k polygons), (b) Sponza (66k polygons), (c) Cornell 
(48k  polygons), (d) desk (12k polygons), and (e) office (20k polygons). The scene provide a range of geometric complexity, 
physical dimensions, and lighting conditions, all rendered at 640 × 480 resolution.
 IEEE Computer Graphics and Applications 75
The performance difference between LCK and 
WF became evident as the number of threads in-
creased: time waiting for locks grew, causing a 
major performance loss. The wait-free algorithm 
scaled much better. With fewer threads, LCK per-
formed similarly to WF because the computation 
spent most of the time evaluating new irradiance 
samples, which isn’t a critical region of the code. 
As the number of threads increased, the render-
ing performed more range searches. Because LCK 
serialized these searches, it incurred a perfor-
mance penalty. Figures 10a and 11a clearly show 
that LCK’s performance loss isn’t due to work rep-
lication. In fact, the total number of IC samples 
evaluated by LCK and WF decreased above a cer-
tain number of threads. Success in finding valid 
samples to interpolate from depended on the or-
der in which the threads requested and evaluated 
samples. Concurrent rendering of multiple image 
plane tiles quickly filled the IC with samples that 
were better distributed over object space, result-
ing in more successful range searches than with 
the sequential approach. Above a significant num-
ber of threads, LCK’s serialization penalty became 
larger than the overhead associated with work 
replication, and it performed even worse than LW 
(see Figure 11).
Parallel algorithms seldom exhibit linear speedup 
because of overheads, such as load imbalance, work 
replication, and communication and synchroniza-
tion costs. Wait-free access control can minimize 
the last two overheads; it exhibited almost linear 
speedup and, consequently, a nearly constant effi-
ciency of 0.9 on systems up to 24 cores (see Fig-
ures 10b and 11b). With around 14 cores on the 
Dunnington system, LCK speedup reached an in-
flection point and started decreasing, showing that 
lock-based approaches don’t scale with increas-
ing levels of concurrency. On the other hand, WF 
speedup grew linearly up to 24 cores, although at 
No. of threads
No. of threads
12
10
8
8
6
4
2
1.1
1.0
0.9
0.8
0.7
(a)
(b)
1 2 4 6 8 10 12 14 16
0.85
0.90
0.95
1.00
1.05
1.10
1.15
A
bs
ol
ut
e 
sp
ee
du
p
Ef
c
ie
nc
y
N
or
m
al
iz
ed
 n
o.
 o
f I
C
 s
am
p
le
s
1 2 4 6 8 10 12 14 16
WF
LW
LCK
WF LCK
Wait free (WF)
Solution that incurs no algorithm or implementation penalties
Local write (LW) IC samples:Speedup: Lock based (LCK) LW
Figure 10. Performance for still images on the Nehalem architecture: (a) speedup and the normalized number 
of IC samples and (b) efficiency. (All values are averaged over the five scenes used in the experiments.) The 
wait-free IC presents an efficiency above 94 percent up to 16 logical (hyperthreaded) cores.
76 September/October 2011
Feature Article
a rate slightly lower than the increase in the num-
ber of cores. The WF curve’s shape suggests that 
any eventual inflection point is still far from being 
reached, which demonstrates its superior scalabil-
ity potential. The shared-memory parallel ray tracer 
also incurs overheads such as workload distribution 
and results gathering, which were partially respon-
sible for the small deviation from linear speedup 
with WF.
Animations
Figure 12 shows, for the Sponza and conference 
room scenes running on 24 cores on the Dun-
nington system, the frame rate and the number 
of IC samples evaluated per frame when running 
an animation of 36 frames while the camera per-
formed a 360-degree rotation around the scene 
(10 degrees from frame to frame). Each frame 
reused previously created cache samples while si-
multaneously calculating new ones. This provides 
a performance overview of mixed evaluation and 
interpolation, unlike the case of the still images. 
For each scene, the first frame is the equivalent of 
one of the still images, where the cache is totally 
empty and all the samples must be generated.
Clearly, LCK performs worse than LW and WF. 
Because the IC won’t be empty except for the first 
frame, the computation can reuse many irradi-
ance samples. However, LCK serializes all range 
searches to locate these samples, thus severely af-
fecting performance. LCK achieves the best ren-
dering time for the first frame, suggesting that 
temporal reuse of previously calculated irradiance 
samples is worse than recalculating these values, 
which completely contradicts the rationale behind 
the IC.6 We can thus conclude that synchroniza-
tion overheads make such lock-based access mech-
anisms prohibitive when rendering animations of 
No. of threads
1 2 4 6 8 10 161412 18 2220 24
No. of threads
1 2 4 6 8 10 161412 18 2220 24
0.80
0.95
0.10
1.25
1.40
1.55
1.70
N
or
m
al
iz
ed
 n
o.
 o
f I
C
 s
am
p
le
s
24
20
16
12
8
4
0
(a)
(b)
A
bs
ol
ut
e 
sp
ee
du
p
WF
LW
LCK
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Ef
c
ie
nc
y
WF LCK
Wait free (WF)
Solution that incurs no algorithm or implementation penalties
Local write (LW) IC samples:Speedup: Lock based (LCK) LW
Figure 11. Performance for still images on the Dunnington architecture: (a) speedup and the normalized 
number of IC samples and (b) efficiency. (All values are averaged over the five scenes used in the experiments.) 
The wait-free IC achieves close to ideal speedup and, at 24 cores, is twice as fast as its closest rival.
 IEEE Computer Graphics and Applications 77
static scenes on highly concurrent shared-memory 
systems.
WF outperforms LW because the former shares 
irradiance samples immediately without any extra 
synchronization overhead associated with reading, 
whereas the latter doesn’t share samples in a 
frame and so incurs extensive, costly evaluations 
of more indirect diffuse irradiance values. The 
bars in Figure 12 clearly show that, for WF and 
LW, performance variations from frame to frame 
correlate highly with the number of IC samples 
evaluated per frame. More important, these graphs 
show that WF’s better results come from evaluating 
fewer irradiance samples, which is a consequence 
of efficient data sharing among threads.
In summary, LCK is penalized mostly by read-
ing serialization, and LW is penalized by work 
replication. WF efficiently shares IC values while 
minimizing writing overheads and eliminating 
synchronization overheads associated with con-
current reads.
Multicore and multiprocessor systems now represent the standard form of desktop 
computing. Soon, such systems will likely have a 
degree of parallelism larger than what’s available 
on current machines. So, the relevance of efficient, 
scalable, and reliable shared data structures for 
maximizing performance is continually increas-
ing. Wait-free data structures offer an alternative 
to traditional locking and blocking approaches and 
enable traditional graphics algorithms to exploit 
modern hardware. The shared-memory IC we’ve 
demonstrated shows such techniques’ potential. 
Our algorithm has let us achieve close to interac-
tive rates for ray tracing with global illumination. 
We hope our solution will motivate similar paral-
lel methods in other computer graphics areas. 
Acknowledgments
Partial funding for this work came from the Portuguese 
Foundation for Science and Technology’s Project IGIDE 
N
o.
 o
f I
C
 s
am
p
le
s
Fr
am
es
 p
er
 s
ec
on
d
(a)
(b)
1 3 5 7 9 11 13 15 17 19
Frame number
21 23 25 27 29 31 33 35
3,000
2,250
1,500
750
08
6
4
2
0
N
o.
 o
f I
C
 s
am
p
le
s
Fr
am
es
 p
er
 s
ec
on
d
6,000
4,000
5,000
2,000
3,000
1,000
0
8
10
12
6
4
2
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
WFIC samples:Speedup: Wait free (WF) Local write (LW) Lock based (LCK) LW
Frame number
WFIC samples:Speedup: Wait free (WF) Local write (LW) Lock based (LCK) LW
Figure 12. Performance for animation on the Dunnington system with 24 cores, for the (a) Sponza and  
(b) conference room scenes. The wait-free IC outperforms the other methods for both scenes.
78 September/October 2011
Feature Article
(Interactive Global Illumination within Dynamic 
Environments), grant PT-DC/EIA/65965/2006, and 
from the United Kingdom’s Engineering and Physical 
Sciences Research Council, grant EP/D069874/2. 
Greg Ward provided the office and conference scenes 
from the Radiance package, and Stanford’s Graphics 
Group provided the bunny model from the Stanford 
3D Repository. Alberto Proença granted us access to 
the two multicore systems.
References
 1. M. Herlihy, “Technical Perspective: Highly Concur-
rent Data Structures,” Comm. ACM, vol. 52, no. 5, 
2009, p. 99.
 2. G. Ward, “A Ray Tracing Solution for Diffuse 
Interreflection,” Proc. Siggraph, ACM Press, 1988, 
pp. 85–92.
 3. M. Herlihy, V. Luchangco, and M. Moir, “Obstruction-
Free Synchronization: Double-Ended Queues as 
an Example,” Proc. 23rd Int’l Conf. Distributed 
Computing Systems (ICDCS 03), IEEE CS Press, 2003, 
pp. 522–529.
 4. P. Dubla et al., “Wait-Free Shared-Memory Irradiance 
Cache,” Proc. Eurographics Symp. Parallel Graphics 
and Visualization (ESPGV 09), Eurographics Assoc., 
2009, pp. 57–64.
 5. I. Wald et al., “State of the Art in Ray Tracing 
Animated Scenes,” Eurographics 2007 State of the Art 
Reports, Eurographics Assoc., 2007, pp. 89–116.
 6. G. Ward, “The Radiance Lighting Simulation and 
Rendering System,” Proc. Siggraph, ACM Press, 1994, 
pp. 459–472.
 7. R. Koholka, H. Mayer, and A. Goller, “MPI-Parallelized 
Radiance on SGI CoW and SMP,” Proc. 4th Int’l ACPC 
Conf. (ParNum 99), Springer, 1999, pp. 549–558.
 8. D. Robertson et al., “Parallelization of Radiance for 
Real Time Interactive Lighting Visualization Walk-
throughs,” Proc. 1999 ACM/IEEE Conf. Super-
computing, ACM Press, 1999, article 61.
 9. K. Debattista, L.P. Santos, and A. Chalmers, 
“Accelerating the Irradiance Cache through Parallel 
Component-Based Rendering,” Proc. Eurographics 
Symp. Parallel Graphics and Visualization, Eurographics 
Assoc., 2006, pp. 27–34.
 10. M.P. Herlihy and J.M. Wing, “Linearizability: A 
Correctness Condition for Concurrent Objects,” 
ACM Trans. Programming Languages and Systems, vol. 
12, no. 3, 1990, pp. 463–492.
 11. D. Marr et al., “Hyper-threading Technology 
Architecture and Microarchitecture,” Intel Technology 
J., vol. 6, no. 1, 2002, pp. 4–15.
 12. I. Wald, S. Boulos, and P. Shirley, “Ray Tracing 
Deformable Scenes Using Dynamic Bounding 
Volume Hierarchies,” ACM Trans. Graphics, vol. 26, 
no. 1, 2007, article 6.
Kurt Debattista is an assistant professor at the University 
of Warwick’s International Digital Laboratory, WMG. His 
research focuses on high-fidelity rendering, parallel comput-
ing, high-dynamic-range imaging, and serious games. De-
battista has a PhD in computer science from the University 
of Bristol. Contact him at k.debattista@warwick.ac.uk.
Piotr Dubla is a graphics programmer at Rockstar North, 
a game studio, while completing his PhD at the University 
of Warwick’s International Digital Laboratory, WMG. His 
research interests include interactive ray tracing, global il-
lumination, offline physically based rendering, and realistic 
simulations of light transport as well as parallel processing 
and data structures for computer graphics and procedural 
content generation. Dubla has a BSc Hons (with distinc-
tion) in computer science from the University of Cape Town. 
Contact him at p.b.dubla@wariwck.ac.uk.
Luís Paulo Peixoto dos Santos is an auxiliary professor 
in Universidade do Minho’s Department of Computer Sci-
ence. His research interests include high-fidelity interactive 
rendering and parallel processing. Santos has a PhD in par-
allel processing from Universidade do Minho. Contact him 
at psantos@di.uminho.pt.
Alan Chalmers is a professor at the University of Warwick’s 
International Digital Laboratory, WMG. His research inter-
ests include high-fidelity graphics, multisensory perception, 
virtual archaeology, and parallel rendering. Chalmers has 
a PhD in computer science from the University of Bristol. 
He is the honorary president of Afrigraph and a former vice 
president of ACM Siggraph. Contact him at a.g.chalmers@
warwick.ac.uk.
Selected CS articles and columns are also available 
for free at http://ComputingNow.computer.org.
NEW from
Y2K: BEWARE! 
IT’S NOT OVER YET
by Robert L. Glass
In these new and updated essays told around 
the proverbial IT camp re, Robert Glass looks 
back at the Y2K crisis to  nd lessons applicable 
to future crises coming all too soon.
PDF edition • $12 list / $9 members • 15 pp.
Order Online:
COMPUTER.ORG/STORE
