Memory-savvy distributed interactive ray tracing by Parker, Steven G. & DeMarle, David E.
Eurographics Symposium on Parallel Graphics and Visualization (2004) 
Dirk Bartz, Bruno Raffin and Han-Wei Shen (Editors)
M e m o r y - S a v v y  D i s t r i b u t e d  I n t e r a c t i v e  R a y  T r a c i n g
David E. DcMarlc, Christiaan P. Gribblc, and Steven G. Parker 1
Scientific Computing and Imaging Institute 
University of Utah
Abstract
Interactive ray tracing in a cluster environment requires paying close attention to the constraints o f a loosely cou­
pled distributed system. To render large scenes interactively, memory limits and network latency must be addressed 
efficiently. In this paper, we improve previous systems by moving to a page-based distributed shared memory layer, 
resulting in faster and easier access to a shared memory space. The technique is designed to take advantage o f 
the large virtual memory space provided by 64-bit machines. We also examine task reuse through decentralized 
load balancing and primitive reorganization to complement the shared memory system. These techniques improve 
memory coherence and are valuable when physical memory is limited.
Categories and Subject Descriptors (according to ACM CCS): 1.3.2 [Computer Graphics]: Graphics Systems- 
Distributed/network graphics 1.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism— Ray tracing
1. Introduction
Datasets arc growing in size at an alarming rate. Typical 
datasets, including regular volumetric data, tetrahedral vol­
umes, surface models, and textures, arc often hundreds of 
megabytes or tens of gigabytes in size and arc much larger 
than the capacity of the physical memory in most uniproces­
sor machines. Parallel computers allow us to solve this prob­
lem because their processing power is great enough to render 
the data and because their memory resources arc plentiful 
enough to hold the data. Amdahl's law, which implies that 
any amount of sequential processing limits parallel compu­
tational scaling, encourages us to take advantage of the sec­
ond feature of parallel computing and make the best possible 
use of the parallel memory resources.
It is well known that ray tracing is trivially parallel. With 
the huge number of data accesses incurred by ray tracing, 
however, memory access delays quickly become the indivis­
ible portion of the rendering time and limit achievable scala­
bility when visualizing large, shared datasets. To reduce this 
bottleneck, we apply a page-based distributed shared mem­
ory (PDSM) system to interactive ray tracing.
In our distributed interactive ray tracing system, each
t  {d e ir .a r  l e  | c g r  i b b l e  | s p a r k e r  }@ s c i  . uL a h . ed u
node in the cluster manages different pieces of the scene 
data and reserves some local space to cache remote pieces. 
Our Tenderer accesses data through the regular address space 
by employing operating systems services, particularly vir­
tual memory. Accessing distributed memory through the vir­
tual memory system is beneficial because the shared mem­
ory software layer intervenes only when a missing page is 
referenced. Figure 1 presents a high-level view of the pro­
cess by which render threads access scene data.
The page-based approach has the potential to allow clus­
ters of inexpensive machines to render large datasets quickly, 
even when each node has a physical memory that can store 
only a fraction of the total data. This point is important be­
cause, while next generation 64-bit machines arc able to ad­
dress tremendous amounts of memory, the cost of that mem­
ory is a limiting factor in the quantity available on each node. 
Efforts to employ the total sum of the memory in an efficient 
and cost-effective manner arc valuable.
Our new system has advantages over previous systems be­
cause memory access is fully decentralized, docs not rely on 
disk access, and has a natural programming interface. Shared 
memory hits incur no overhead, so when the working set is 
reasonably bounded and changes gradually between frames, 
the system operates at nearly the same speed as if the data 
had been replicated on each node.
(c) The Eurographics Association 2004.
D. E. DeMarle, C. P. Gribble, & S. G. Parker /  EG Memory-Savvy
Node 31s Memory
Figure 1: Basic Page-Based DSM Architecture. The vir­
tual memory hardware detects misses, and the PDSM layer 
causes remote references to he paged from across the net­
work.
Two additional changcs to ourrcndcrcr attempt to capital­
ize on this advantage. Both seek to increase hit rates, help­
ing to alleviate the large miss penalties and take advantage 
of low hit times. The first change is to move from a central­
ized demand driven load balancing scheme, in which Ten­
derers obtain work from the supervisor, to a decentralized 
work stealing scheme. Work stealing helps reduce the varia­
tion in tile assignments between frames, enabling previously 
cached data to be reused.
The second change seeks to improve hit rates by reorga­
nizing the scene data to improve coherence. The key obser­
vation is that spatially local primitives should be stored to­
gether in memory as well. Similar to data bricking for volu­
metric datasets [PSL*98, CE971, this technique can improve 
rendering rates by increasing the probability that a page of 
memory will be reused once it has been loaded.
2. Related Work
DeMarle et al. [DPH*031 use an object-based distributed 
shared memory (ODSM) to render large volumetric datasets. 
Distributed access to scene data in a switched network pro­
vides data at the combined bandwidth of all of the ma­
chines in the cluster. Badouel et al. [BBP941 achieved a 
similar effect with a page-based distributed shared mem­
ory. Quarks [CKK951 and Adsmith [LK.L971 arc representa­
tive examples of full featured page- and object-based DSMs. 
In this paper we compare the performance of both types of 
DSM layers for distributed interactive ray tracing.
Wald et al. [WSB011 have explored coherent ray tracing 
techniques in the distributed environment. They address the 
challenge of rendering large, complex models interactively 
by combining centralized data access and client-side caching
of geometry voxels. They take pains to exploit spatial co­
herence within BSP tree nodes and temporal coherence be­
tween subsequent frames. In their system, both tile assign­
ments and data retrieval go through central servers. In this 
work we parallelize these functions to eliminate the central 
bottlenecks.
Many authors have considered load balancing for parallel 
ray tracing. For example, Heirich et al. [HA981 discuss the 
necessity of dynamic load balancing for ray tracing in an in­
teractive setting and describe a scheme based on a diffusion 
model. Reinhard et al. [RCJ991 present an advanced hybrid 
load balancing scheme in which both objects and rays can 
be transfered between processing elements. The data parallel 
tasks allow their system to render large and complex scenes 
efficiently while the demand driven tasks, consisting of co­
herent ray packets, balance the load more evenly. A complete 
discussion of load balancing techniques for parallel render­
ing can be found in [CDR021. We focus on decentralized 
load balancing with work stealing as a simple and effective 
method to balance the load while improving memory and 
network performance characteristics.
As the gap between processor speeds and memory 
and network speeds continues to widen, data locality 
becomes increasingly critical to rendering performance. 
Pharr et al. [PKGH971 use caching techniques to manage 
model complexity in an off-line rendering process. They 
complement lazy data loading with data reorganization in 
a geometry cache. The reorganization tics spatial locality in 
the three-dimensional space of the scene to locality in mem­
ory so that expensive disk access times can be amortized. 
Cox et al. [CE971 apply similar methods for scientific visu­
alization of large datasets. The data reorganization technique 
we describe targets the same goal, and in this paper, we ex­
amine its effect in a distributed shared memory environment.
3. Page-Based Distributed Shared Memory
In [DPH*031, we presented a solution to the memory 
problem in which a C++ dataserver object obtains data 
from remote nodes when requested by one or more render 
threads. With a well-constructed acceleration structure, vol­
ume bricking, caching, and strict attention to efficient ac­
cesses to the dataserver, more than 95% of data accesses re­
sulted in hits. Such a high hit rate allowed the program to 
produce interactive visualizations of multi-gigabyte datasets, 
despite the fact that miss penalties were on the order of 
1000 /is.
The high hit rates in our renderer make the hit times a 
tempting target for optimization. In an ODSM system, ev­
ery memory access must go through expensive access tests, 
which arc performed in software, to find a block and ensure 
that it is available for the duration of its use. In our new sys­
tem, the virtual memory hardware of the machine handles 
all memory accesses and a signal handler processes only the
(c) The Eurographics Association 2004.
D. E. DeMarle, C. P. Gribble, & S. G. Parker/ EG Memory-Savvy
exceptional ease of a cache miss. The DSM system requires 
no kernel modifications because it is implemented entirely 
in user-spaee using standard Unix system calls. Because the 
scenes are read-only during rendering, the DSM system does 
not implement page invalidation, is not prone to false shar­
ing inefficiencies and does not require complex consistency 
mechanisms.
The distributed memory space now occupies a reserved 
range of virtual memory addresses. As before, each node as­
sumes ownership of different stripes of the shared memory 
space and populates them with data at program initialization. 
The remainder of the shared address range is initially empty 
and unmapped.
During execution, render threads generate a segmentation 
fault signal every time they access an unmapped address. A 
registered handler catches the signal and retrieves the fault­
ing address. The handler finds the node that owns the page 
of memory in which the faulting address resides, issues a 
request to the owner, and suspends the render thread.
When the owner responds with the page, the communi­
cation thread first checks the number of cached pages. If 
there are no open slots, a page is selected for eviction to 
reclaim space. Currently a random page replacement policy 
is used because of the difficulty of implementing a more so­
phisticated algorithm in user-spaee. Next, the communica­
tion thread receives the sent data directly into a newly cre­
ated page residing at the requested address and wakes the 
render thread, which continues forward.
One strength of a PDSM is that data accesses do not re­
quire special handling in the application. The application 
need not distinguish between items that lie in shared space 
and those that do not. Programs that use large amounts of 
memory can be developed in a similar fashion to those that 
employ shared memory hardware. This ease of use makes 
it feasible to render large scenes composed of any object 
type. Additionally, placing the acceleration structure in the 
shared memory space can be beneficial. Although accesses 
to uncached portions of the structure impose slight penalties, 
these do not occur frequently in practice. In fact, frequently 
accessed root level data typically remain loaded while un­
used branches tend to be pruned away.
The primary disadvantage of a page-based system is that 
it is limited to scenes that fit within the address space of the 
machines on which it runs. For 32-bit machines, the maxi­
mum theoretical limit is 4 GB. In practice, however, operat­
ing system limits and the need to leave some addresses for 
other program data make the limit slightly less than 2 GB. 
Fortunately, the increasing availability of 64-bit machines 
makes this limitation much less severe.
The core routines of the PDSM software layer are given 
as pseudocode in Figure 2. In Section 6 .1 we analyze the ef­
fectiveness of rendering large scenes using the PDSM space.
/ /  A t  p ro g ram  s t a r t ,  t h i s  f u n c t io n  i s  
/ /  r e g i s t e r e d  to  h a n d le  S IG SEG V  
v o id  m e m in te rc e p t ( s i g i n f o _ t *  s in f o )  { 
v o id *  f a u l t in g _ a d d r  = s in f o - > s i_ a d d r ; 
i n t  page_num  = g e t _ p a g e _ n u m ( fa u lt in g _ a d d r ) ;  
i n t  ow ner = g e t_o w n e r (p a g e _n u m ); 
se n d _m sg (o w n e r , REQ UEST_PAGE, p a g e _n u m ); 
p a g e _ w a it _ s e m . dow n( ) ;
1
/ /  A t  p ro g ram  s t a r t ,  t h i s  f u n c t io n  i s  
/ /  r e g i s t e r e d  to  h a n d le  PDSM m essag es 
v o id  h a n d le m e s s a g e ( i n t  s e n d e r , i n t  m sg id , 
i n t  page_num ) { 
i f  (m sg id  == REQ U EST_PA G E) { 
s e n d _ m s g (s e n d e r , SEN T_PAG E,
&page [page__num] ) ;
} e l s e  {
i n t  d e s t s lo t  = n u m _p a g e s_ lo a d e d ; 
i f  (n u m _p a g e s_ lo a d e d  == c a c h e s iz e )  { 
i n t  v i c t im  = s e le c t _ v i c t im _ p a g e ( ) ;  
unm ap( & p a g e [v ic t im ] , s i z e o f ( p a g e ) ) ;  
d e s t s l o t  = v i c t i m ;
} e l s e
num _page s__loaded+  + ;
mmap(& ( p a g e [ d e s t s lo t ] ) ,  s i z e o f ( p a g e ) ) ;  
r e c v _ m s g (s e n d e r , & p a g e [d e s t s lo t ] ) ;  
p a g e _ w a it _ s e m .u p ( ) ;
1
Figure 2: Page-Based DSM Core Functionality. These two 
functions implement the distributed shared memory, mem­
i n t e r c e p t  is invoked when a render thread accesses 
memory that is unmapped, and h a n d le m e s s a g e  reacts to 
the resulting network messages.
4. Decentralized Load Balancing
Even with faster access to shared scene data, each miss in­
flicts hundreds of microseconds of delay. The miss time is 
largely a function of the network characteristics and is not 
easily reduced. In an interactive rendering system, then, it 
is critical to reduce the number of misses. Toward this end. 
we have experimented with a decentralized load balancing 
scheme that employs work stealing. Our algorithm is a sim­
plification of that in [RSAU911.
As in our previous system, rendering tasks are based on an 
image-spaee division because primary rays can be traced in­
dependently. Before, the supervisor node maintained a work 
queue, and workers implicitly requested new tiles from the 
supervisor when they returned completed assignments. Al­
though the central work queue quickly achieves a well- 
balanced workload, it results in poor memory coherence be­
cause tile assignments are essentially random and change ev­
ery frame.
(c) The Eurographics Association 2004.
D. E. DeMarle, C. P. Gribble, £  5. G. Parker /EG Memory-Savvy
With a work stealing load balancer, each render thread 
starts frame t with the assignments it completed in frame 
t — 1. This pseudo-static assignment scheme increases hit 
rates because the data used to render frame t — \ will likely 
be needed when rendering frame t. The goal of this ap­
proach is similar to the scheduling heuristic described by 
Wald et al. [WSB01],
Our new system uses a combination of receiver- and 
sender-initiated task migration to prevent the load from be­
coming unbalanced when the scene or viewpoint changes. 
Once a worker finishes its assignments for a given frame, it 
picks a peer at random and requests more work. If that peer 
has work available, it responds. To improve the rate of con­
vergence toward a balanced load, heavily loaded workers can 
also release work without being queried. In our current im­
plementation, for example, the node that required the most 
time to complete its assignments will send a task to a ran­
domly selected peer at the beginning of the next frame.
Figure 3 contains diagnostic images showing typical im­
age tile distributions for the original demand driven and the 
new work stealing algorithms. Note the distribution of tiles 
in the work stealing image is more regular. In Section 6.2 we 
analyze the effectiveness of this optimization.
Figure 3: Comparing Task Assignment Strategies. Tiles ren­
dered by each node have unique gray-levels. On the left, 
tasks constantly change with demand driven assignment. On 
the right, assignments are more stable with work stealing, 
allowing workers to reuse locally cached data more often.
5. Address Sorting
The problem of accessing data is acute in network memory 
and out-of-core systems, where the access time to missing 
memory is high. To decrease the number of misses, we care­
fully organize the layout of scene data in memory. If primi­
tives located together in three-dimensional space can be re­
arranged so they are also located together in address space, 
pages are more likely to be reused. The effect of our tech­
nique is similar to that achieved by Pharr et al. [PKGH97],
We reorganize the memory layout of our input data using 
a preprocessing program. This program reads a mesh file and 
creates a multi-level grid acceleration structure that groups
nearby objects together. To sort the geometry for improved 
coherence, we traverse the acceleration structure and write 
the primitives, in order, to a new scene database. Although 
the input data may contain neighboring triangles p and q 
that are separated by tens of megabytes of address space, the 
output data will contain new triangles p' and q within a few 
bytes of each other. The preprocessing program takes only a 
few minutes for the models we tested.
Figure 4 shows graphically what it means to group trian­
gles in the shared address space according to spatial locality. 
In the figure, all triangles within pages owned by a partic­
ular node have identical hues. Figure 4b demonstrates that, 
without reorganization, neighboring triangles may be placed 
far apart in memory or owned by different nodes. Figure 4c 
shows the address alignment of the sorted mesh. With this 
layout, neighboring rays are more likely to find the data they 
need within an already referenced page and throughout the 
lower levels of the memory hierarchy In Section 6.3 we an­
alyze the effectiveness of this optimization.
6 . Results
In this section, we benchmark interactive rendering sessions 
under varying conditions to analyze the performance bene­
fits of the three techniques we have applied. All scenes have 
a single light source and include hard shadows. For each test, 
the images were rendered at a resolution of 512x512 pix­
els and were divided into 16x16 pixel tiles, except where 
noted. Our test machine is a 32 node cluster consisting of 
dual 1.7 GHz Xeon PCs with 1 GB of RAM. The nodes 
are connected via switched gigabit ethernet. We run a sin­
gle rendering thread on each node, except where noted. The 
reported node counts do not include the use of a single dis­
play machine.
6.1. Page-Based Distributed Shared Memory Analysis
The first test compares the hit and miss times for the ODSM 
and PDSM layers that our ray tracer employs when render­
ing large datasets. Table 1 gives the measured hit and miss 
penalties for object- and page-based DSMs recorded in a 
random access test. In the test, one million 16 KB blocks 
are chosen at random from a 128 MB shared memory space. 
The access times have been recorded using the gettimeofday 
system call. From this table, it is clear that the hit time of the 
PDSM system is substantially lower than that of the ODSM 
system. If the renderer is able to maintain high hit rates, the 
PDSM layer results in higher frame rates.
Next we compare the performance of the ODSM and 
PDSM layers in the ray tracer. We render isosurfaces of a 
512 MB scalar volume created from a computed tomogra­
phy scan of a child’s toy. In the test, we replay a recorded 
session in which the viewpoint and isosurface change, caus­
ing the working set to vary. Extra work is required to obtain
© The Eurographics Association 2004.
D. E. DeMarle, C. P. Gribble, £  5. G. Parker /EG Memory-Savvy
Hit Time Miss Time
Object-based DSM 10.2 629
Page-based DSM 4.97 632
Table 1: DSM Access Penalties. Average access penalties, 
in /js, over 1 million random addresses to a 128 MB address 
space on five nodes.
the rendered data when using the DSM systems because we 
restrict the DSM layers to store only 81 MB on each node.
Figure 5 shows the recorded frame rates from the test and 
a sampling of rendered frames. The test is started with a cold 
cache. In the first half of the test, the entire volume is in 
view, while in the second, only a small portion of the dataset 
is visible. Both DSM layers struggle to keep the caches full 
during the first part of the test. However, the lower hit time 
of the PDSM allows it to outperform the ODSM throughout. 
In later frames most memory accesses hit in the cache, so the 
PDSM adds little overhead to data replication. Overall, the 
average frame rates for this test are 3.74 fps with replication, 
3.06 with the PDSM and 1.22 with the ODSM.
6.2. Decentralized Load Balancing Analysis
We now analyze the extent to which decentralized load bal­
ancing improves performance. For this test, we rendered two 
of Stanford University’s widely available PLY format mod­
els. In particular, we report results using the models shown 
in Figure 6 . Details of these models and the run-time data 
structures are given in Table 2.
In these tests, we place the geometry data and a large, 
highly efficient acceleration structure in PDSM space. By 
varying the local cache size, we can analyze how both load 
balancing algorithms impact the performance of the dis­
tributed memory system.
Figure 7 shows the results. As memory becomes re­
stricted, the work stealing scheme maintains interactivity 
better because it is able to reuse cached data more often 
and yields fewer misses. We note, however, when memory 
is plentiful, either approach works well.
A decentralized scheme also eliminates a synchronization 
bottleneck at the supervisor that is amplified by the network 
transmission delay. Unless frameless rendering is used, a 
frame cannot be completed until all image tiles have been 
assigned and returned. Asynchronous task assignment can 
hide the problem, but as processors are added, message start­
up costs will determine the minimum rendering time. In this 
case, the rendering time is at least the product of the mes­
sage latency and twice the number of task assignments in a 
frame.
On a switch-based interconnect, a decentralized task as-
Figure 6 : The Stanford Bunny and Dragon PLY Models. 















* DD misses/worker 




50 100 150 
local memory [MB]
200 250
Figure 7: Effect o f Task Reuse with Limited Local Memory. 
Each node is exhibits fewer misses when reusing previous 
tasks. As a result, work stealing improves frame rates when 
the local memory is limited.
signment scheme takes advantage of the fact that nodes B 
and C can communicate at the same time as nodes D and E. 
Work stealing eliminates all task assignment messages from 
the supervisor and allows workers to assign tasks indepen­
dently. When the system is network bound, this approach 
can potentially increase the frame rate by a factor of two.
To demonstrate, we render a small sphereliake scene con­
sisting of only 827 primitives. To emphasize the effect of 
work stealing on the supervisor’s communication time, the 
test uses 8x8 pixel tiles and two rendering threads per node.
@ The Eurographics Association 2004.
D. E. DeMarle, C. P. Gribble, & S. G. Parker/  EG Memory-Savvy
Model Vertices Triangles Prim. Size Grid Total Sorted Prim. Size Grid Total
[MB] [MB] [MB] Triangles [MB] [MB] [MB]
Bunny 35947 69451 2.138 9.653 11.79 324635 7.978 8.415 16.39
Dragon 437645 871304 26.62 178.0 204.6 3724385 91.92 163.8 255.7
Total 602045 1202281 28.76 187.7 216.4 7545220 99.90 172.2 272.1
Table 2: Characteristics o f the Stan ford PLY models, the preprocessed geometry data, and the acceleration structure.
Table 3 reports the measured time the supervisor spends 
communicating, as well as the resulting frame rate. Note that 
as the number of workers grows, the supervisor's communi­
cation time remains constant with work stealing.
triangle. Data may be duplicated many times because ver­
tices are often shared by several triangles. The data bloat 
resulting from this process is substantially higher, and, in 
general, the achievable frame rates are lower still.























Table 3: Supervisor Communication lime. In the decentral­
ized approach, the supen’isor’s communication time remains 
constant as the number o f worker nodes increases. Commu­
nication times are given in s / f ,  frame rates in f / s .
6.3. Address Sorting Analysis
Our last test examines how sorting spatially local primitives 
in address space affects performance. In this test, we use the 
same PLY models as before, but we now render the data af­
terpreprocessing with the sorting program described in Sec­
tion 5.
Figure 8 shows that the effect of address sorting is similar 
to that of decentralized load balancing. Specifically, when 
the local memory of each node is small compared to the 
total data size, sorting geometry decreases the number of 
misses enough to increase the frame rates. However, when 
the memory size is large enough to contain the working set, 
sorting does not yeild improved performance. The accelera­
tion structure used for sorting is based on a uniform grid, and 
as we traverse the structure, triangles that cross cell bound­
aries are duplicated. It is possible that using a different ac­
celeration structure would reduce this increase in data size.
We have also experimented with dereferencing vertex 
pointers when sorting the geometry. The sorting process is 
the same as described earlier, except that vertex pointers are 
dereferenced and the vertex data is stored with each sorted
local memory [MB]
Figure 8 : Effect o f Address Sorting with Limited Local Mem­
ory. Each node exhibits fewer misses when memory is limited 
because, with address sorting, spatially local primitives ex­
hibit better memory coherence.
7. Conclusion and Future Work
We have found that by utilizing virtual memory hardware 
and associated operating system services to manage a shared 
memory space, large datasets can be rendered more quickly 
and easily than with a software-only solution. There are sig­
nificantly lower memory access penalties, and the program­
ming task of using the shared memory is also reduced. These 
benefits make it possible to render large amounts of almost 
any type of data.
Complementary to the PDSM is a distributed load bal­
ancing mechanism that improves cache hit rates and helps 
overcome the network transmission latency barrier. Hit rates 
can be improved further by sorting the scene data in address 
space so that spatially local data exhibits improved memory 
coherence. Both techniques are most useful when memory 
available on each node is limited. Higher quality decentral­
ized load balancing heuristics and improved sorting algo­
rithms that reduce data replication are left as future work.
(c) The Eurographics Association 2004.
D. E. DeMarle, C. P. Gribble, & S. G. Parker/ EG Memory-Savvy
All three optimizations should be valuable in the context 
of 64-bit clusters, where the virtual address space will likely 
be substantially larger than the physical memory of any one 
node. Techniques like ours will enable interactive visualiza­
tion of very large datasets with these clusters. We plan to test 
our implementation on a 64-bit cluster in the near future.
A disadvantage of the user-spaee PDSM is that it is diffi­
cult to create a page-based memory that is usable by multi­
ple rendering threads. A race condition exists whenever the 
communication thread tills a received page of data. Ren­
dering threads must be prevented from accessing the in­
valid page during this time. To overcome this limitation, 
we have begun experiments with asynchronous signaling to 
temporarily suspend all rendering threads during page han­
dling. When threads are stalled, however, efficiency drops. 
More critically, all rendering code must be asynchronous 
signal-safe for this approach to work. Preliminary testing is 
currently underway.
Similar drops in efficiency result because render threads 
are suspended while waiting for previously unmapped pages 
to arrive. Rescheduling rays that cause segmentation faults 
and allowing the render thread to trace other rays may elim­
inate this problem. PDSM performance may also be im­
proved with a better page replacement policy. Finally, aread- 
write PDSM implementation will be required for rendering 
most dynamic scenes.
Two other potential targets for optimization are the cen­
tralized result gathering phase of the computation and the 
size of data transfers. Decentralized task assignment greatly 
reduces the number of messages the supervisor must handle, 
but we have not yet examined the delay incurred because 
each rendered tile must be returned through the same bot­
tleneck. In addition, results by Wald et al. [WSB01] have 
shown that on-the-fly compression for data transmitted over 
the network can reduce access penalties. We would like to 
investigate similar techniques in this page-based system.
A cknowledgments
This work has been sponsored in part by the National Sci­
ence Foundation under grants 9977218 and 9978099, by 
DOE VIEWS and by NIH grants. The authors thank An­
thony Davis from HyTee, Inc. and Bill Ward and Patrick Mc­
Cormick at Los Alamos National Labs for the furby dataset.
References
[BBP94] B a d o le l  D., B o l a to l c h  K., P r io l  T.: 
Distributing data and control for ray tracing in 
parallel. IEEE Computer Graphics and Appli­
cations 14, 4 (1994), 69-77. 2
[CDR02] C h a lm e rs  A., D avis T., R f jn h a rd  E.: 
Practical Parallel Rendering. AK Peters Pub­
lishing, Nantiek Massachusetts, 2002. 2
[CE97] Cox M., E l l s w o r th  D.: Application- 
controlled demand paging for out-of-core visu­
alization. In Proceedings o f IEEE Visualization
(1997), pp. 235-244. 2
[CKK95] CARTER J. B.. K h a n d e k a r  D„ Kamb L.: 
Distributed shared memory: Where we are and 
where we should be headed. In Fifth Workshop 
on Hot Topics in Operating Systems (HotOS-V) 
(1995), pp. 119-122. 2
[DPH*03] D e M a rle  D. E., P a r k e r  S., H a r tn e r  M., 
G r ib b le  C., H an se n  C.: Distributed interac­
tive ray tracing for large volume visualization. 
In IEEE Symposium on Parallel and Large-Data 
Visualization and Graphics (Oct. 2003), pp. 87­
94. 2
[HA98] HFJRICH A., A rvo  J.: A eompetative analy­
sis of load balancing strategies for parallel ray 
tracing. The Journal o f Supercomputing 12, 1-2
(1998), 57-68. 2
[LKL97] L iang  W.-Y.. K ino C.-T.. Lai F.: Adsmith: 
An object-based distributed shared memory sys­
tem for networks of workstations. IEICE Trans­
actions on Information and Systems E80-D, 9 
(1997), 899-908. 2
[PKGH97] P h a r r  M., K o lb  C., G e rsh b f jn  R., H an- 
RAHAN P.: Rendering complex scenes with 
memory-coherent ray tracing. Computer 
Graphics 31, Annual Conference Series (1997), 
101-108. 2,4
[PSL*98] Parker S., Shirley P., L ivnat Y., Hansen
C., S lo a n  P.-P: Interactive ray tracing for iso­
surface rendering. In Proceedings o f IEEE Visu­
alization (Oct. 1998), pp. 233-238. 2
[RCJ99] R f jn h a rd E .,  C h a lm e rs  A., Ja n sen  F. W.: 
Hybrid scheduling for parallel rendering using 
coherent ray tasks. In IEEE Symposium on Par­
allel Visualization and Graphics (1999), ACM 
Press, pp. 21-28. 2
[RSAU91] R udo lph  L., S l i v k in - A l l a lo l f  M., Up- 
FAL E.: A simple load balancing scheme for 
task allocation in parallel machines. In Proceed­
ings o f the Third Annual ACM Symposium on 
Parallel Algorithms and Architectures (1991), 
ACM Press, pp. 237-245. 3
[WSB01] W a ld  I., S l l s a l l e k  P., B e n th in  C.: In­
teractive distributed ray tracing of highly com­
plex models. In 12th Eurographics Workshop 
on Rendering (June 2001), pp. 277-288. 2, 4, 7
(c) The Eurographics Association 2004.
D. E. DeMarle, C. P. Gribble, & S. G. Parker /  EG Memory-Savvy
(a) (b) (c)
Figure 4: Improving Coherence via Data Reorganization. In (a), a standard rendering o f the happy buddha PLY model. In (b), 
the input mesh with each node’s triangles shown in a different hue. In(c), the reorganized mesh in which neighboring triangles 
are more likely to reside in the same page o f memoiy.
frame number
frame 2 frame 222 frame 460 frame 510 frame 710
Figure 5: Comparing Memoiy Organization. Frame rates are above and images from the test are below. The page-based DSM 
outperforms the object-base DSM in all cases. Moreover, its performance is competitive with full data replication, even though 
the local memoiy size is reduced to 16% o f the total.
© The Eurographics Association 2004.
