Surface Processing on Graphics Hardware by Carr, Nathan Aaron
c© 2004 by Nathan Aaron Carr. All rights reserved.
SURFACE PROCESSING ON GRAPHICS HARDWARE
BY
NATHAN AARON CARR
B.S., College of William & Mary, 1996
M.S., Washington State University, 2000
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the





This thesis would never have been possible without the support of many people. Thanks to my
advisor John Hart who has stuck with me through six long years of study and research. He has
provided great technical inspiration, but more importantly friendship; reminding me to work hard,
but also to have plenty of fun along the way. Let the mountain biking trails, ski slopes, and graphics
research always stay open for another good run together.
Thanks to my office-mates and research partners: Jesse Hall and Patrick Lacz. Both have
provided me endless amounts of help, guidance, and friendship. Thanks to Nvidia for their generous
support, providing me the freedom to pursue my own ideas, and from this I have grown and gone
far.
This thesis is a testament to my wonderful parents and siblings. Thanks to my father who taught
me to love science and all its explorations. If it where not for his example and encouragement, I
would not have come half as far. May I someday catch up to his technical greatness. Special thanks
to my mother for her belief that I could achieve all this and more. Lastly, thanks to my wife for her
wonderful patience and support throughout my years of study. Life for two has made this journey
all the more wonderful.
iv
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 General Computation on the GPU . . . . . . . . . . . . . . . . . . . . . 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Example: The Ray Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Ray Tracing with the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Ray-Triangle Intersection on the GPU . . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 Ray Engine Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.6 The Ray Engine Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.8 Analysis and Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.9 Ray Engine Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Example: GPU Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Modern GPU Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.3 Cache Aware Matrix Multiply . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Multi-Channel GPU Matrix Multiplies . . . . . . . . . . . . . . . . . . . . . . 34
2.3.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 GPU Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter 3 Texture Atlas Background . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Chart Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Goals and Metrics for Cutting . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Single Chart Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.3 Multi-Chart Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.4 Normal Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.5 Agglomerative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
v
3.2.6 Greedy Region Growing Methods . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Chart Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1 Quadratic Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.2 Non-linear Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Chart Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Metrics and Goals for Texture Atlases . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.2 Sample Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.3 Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.4 Discontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 4 The Multi-Resolution Meshed Atlas (MMA) . . . . . . . . . . . . . . 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 MIP Mapping a Texture Atlas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Chart Generation using Recursive Bisection . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Mapping Triangles into Texture Space . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter 5 Dynamic Texture Atlas Management . . . . . . . . . . . . . . . . . . . 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Bloom Filter Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.2 Node Imbalance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.3 Cluster Perimeter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.4 Bloom Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.5 Metric Minimization via Hierarchy Perturbation . . . . . . . . . . . . . . . . 72
5.2.6 Tree Balancing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.7 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.8 Bloom Filter Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Adaptive k-means Style Clustering on a Surface . . . . . . . . . . . . . . . . . . . . . 81
5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.3 Multiple Heaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.4 Border Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.5 Cluster Boundary Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.6 Chart Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.7 Chart Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.8 k-means MMA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 6 GPU Surface Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Subsurface Scattering on the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.3 Subsurface Scattering on the GPU . . . . . . . . . . . . . . . . . . . . . . . . 98
vi
6.2.4 Real-Time Subsurface Approximation Algorithm . . . . . . . . . . . . . . . . 99
6.2.5 A Three Pass GPU Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.7 Subsurface Scattering Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 3D Painting on the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3.3 The GPU Painting Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Chapter 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
vii
List of Tables
2.1 Rays per second across a variety of scenes and applications. . . . . . . . . . . . . . . 21
2.2 Percentage of rays sent to the GPU across a variety of scenes and applications. . . . 24
2.3 Speedup by using the GPU to render the teapot room. . . . . . . . . . . . . . . . . . 25
2.4 Tuning the ray engine by varying the range of triangles T sent to the GPU, measured
on the teapot room. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Tuning the number of rays R sent to the GPU for intersection. . . . . . . . . . . . . 26
2.6 Bytes transferred internally by each GPU matrix multiplication method. . . . . . . . 39
2.7 Floating point operations required by each GPU matrix multiplication method. . . . 40
5.1 Bit loss rate at leaf nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Cluster perimeter edge lengths ei as a function of cluster size and shared edge per-
centage c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Perimeter Loss Rates, c = 1/4, 1 bit per edge. . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Perimeter Loss Rates, c = 1/4, 2 bits per edge. . . . . . . . . . . . . . . . . . . . . . 79
6.1 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Subsurface scattering performance on a GeForce FX 5900. . . . . . . . . . . . . . . . 106
6.3 GPU painting performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
viii
List of Figures
1.1 Surface Processing Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1 Environment map versus ray-traced reflections . . . . . . . . . . . . . . . . . . . . . 7
2.2 Ray intersection crossbar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Programmable pixel shading is a crossbar. . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Leaky teapot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Ray engine organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Ray Engine result images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 GPU ray-triangle intersection test performance. . . . . . . . . . . . . . . . . . . . . . 23
2.8 The Larson-McAllister multipass algorithm for multiplying two matrices. . . . . . . 28
3.1 Packed Atlas Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Strip packing of 7,232 undistorted scalene triangles into a texture atlas. . . . . . . . 47
3.3 L2 stretch metric image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Rasterization rules for texture atlases . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Aligning texture coordinates for rasterization. . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Bilinear filtering support and texture coordinate layouts. . . . . . . . . . . . . . . . . 58
4.1 Supporting mip-mapping with discontinuity in texture space. . . . . . . . . . . . . . 61
4.2 Moon with MMA atlas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 MMA mapped moon with mip-map levels . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Balanced disjoint partition of vertices minimizing interconnecting edges. . . . . . . . 64
4.5 Weighted edge collapses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Recursive partitioning of 2-D texture space (top) and head mesh data set (bottom). 65
4.7 Base case square atlas partitions for mesh components of two, three or four triangles. 66
4.8 Cutting a 3-ring into a 3-fan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1 Perturbations for rebalancing the MMA tree. . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Algorithm to rebalance the tree after the metric of a single node has changed. . . . . 74
5.3 Outward region growing algorithm implemented using a single heap. . . . . . . . . . 84
5.4 Optimization for the outward region growing process. Nested heap structure used
to reduce search times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Chart clustering before (a) and after (b) triangle corner reduction. . . . . . . . . . . 87
5.6 Tree formed on the bunny model using recursive bisection with inertial partitioning
at the top levels of the tree, and agglomerative approach for constructing lower
subtrees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.7 k-means Clustering Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.8 Dynamic clustering of Buddha and bunny model . . . . . . . . . . . . . . . . . . . . 92
ix
5.9 k-means surface clustering performance . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1 Subsurface scattered Buddha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2 High-level GPU subsurface scattering process. . . . . . . . . . . . . . . . . . . . . . . 100
6.3 Three pass process for rendering GPU subsurface scattering effects. . . . . . . . . . . 101
6.4 Subsurface scattering result images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.5 Head rendered with subsurface scattering at 40 fps Nvidia GeForceFX 5900. . . . . 106
6.6 Static versus dynamic parametrization for 3D paint. . . . . . . . . . . . . . . . . . . 107
6.7 GPU painting process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.8 Painted bunny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114




(a) (b) (c) (d)
Figure 1.1: Surface Processing on the GPU: (a) Procedural Shading (b) Ray-Tracing (c) Surface
Painting (d) Real-Time Subsurface Scattering
1.1 Introduction
Computer graphics has predominantly focused on the appearance of objects. Many objects derive
their visual appearance from the features on their surface and not from their internal structures.
This allows many computer graphics applications to ignore the volumetric aspects of the world and
work directly with surface representations such as meshes, spline patches, and implicit surfaces. As
a result, a significant portion of the computation in computer graphics takes place directly on the
surfaces of objects.
The importance of surface computation in computer graphics cannot be understated. Light
transport simulation, radiosity, both global and local illumination, subsurface scattering, and pro-
cedural shading all require a predominant amount of computation to be done on surfaces. Figure
1.1 shows three applications in computer graphics that require efficient storage and processing of
surface information.
1
The search for efficient surface representations has a long history and is an on-going area of
research. Given a surface, the goal of finding an efficient representation is not restricted solely to
its geometric aspects. Surface signals such as color, reflectance functions, normals, displacements,
and coefficients to approximating functions must also be efficiently expressed in the representation.
One of the more popular methods for representing surfaces and their signals is with a polygonal
mesh. Any number of surface signals may be sampled and encoded as additional attributes stored
at the vertices of the mesh. The disadvantage of this method is that the sample distribution and
density is fixed over all signal domains. This is not ideal since the signal frequencies vary between
domains and require different sampling rates for accurate reconstruction.
To alleviate this problem, surface information is commonly stored as 2D images called textures.
The regularity of 2D textures provides nice domain for storing surface information. This is partly
due to the fact that a large body of literature and tools exists for the processing, filtering, and
compression of 2D images. Furthermore, efficient memory layouts exist for 2D textures allowing
for fast access and filtering during rendering. For these reasons, the use of textures as a means of
storing surface information is nearly ubiquitous in the graphics community, having support in both
consumer level graphics hardware, and in non-realtime production rendering systems.
In order to use textures to store surface information a mapping from surface positions to image
locations must be defined. Such correspondences between texture image locations and surface
positions are defined in a texture map or texture atlas. The goal of automatically generating ideal
texture atlases for mesh surfaces is an active area of research. This is primarily due to the fact
that the flattening of surfaces in most cases leads to imperfect problem, providing several tradeoffs
between a number of ends.
What is relatively new to the graphics community is the notion of the textures as an ideal
domain for performing computation on surfaces. This is in large part due to the advancement
of programmable graphics hardware. Graphics cards are designed to perform computation on 2D
images allowing them to naturally perform surface computations directly in the texture domain.
The true difficulty in texture mapping has not been in the storage and representation of 2D images,
but rather in finding good mappings from mesh surfaces into the 2D image domain.
The primary goal of this research is to develop a general paradigm for performing computation
2
on surfaces in graphics hardware. Investigation is made into the limitations of existing surface rep-
resentation methods and new methods are proposed. The primary goals in developing a new surface
representation scheme are two fold: first, is to explore the concept of dynamic parametrization; to
develop a method that allows for re-parametrization to occur at real-time rates in the presence
of a dynamic environment. This provides a domain that can rapidly adapt and store information
on dynamically changing surfaces with dynamically changing surface signals. Secondly, to develop
a method for representing surface information that is highly amenable to the modern graphics
hardware enabling the rapid computation and display of surface information.
This opens up many new possibilities for a wide range of existing computer graphics applications.
A side part to this research is to investigate new applications for which GPU surface processing
may be applied.
1.2 Overview
In chapter 2, the GPU is investigated as a viable platform for general computation. Sections 2.2
and 2.3, cover our investigations into GPU processing: a global illumination engine, and dense
matrix multiplication.
Chapter 3 details the state of the art for texture atlas generation and surveys the current
literature. A new method for automatically generating texture atlases on surfaces is presented in
Chapter 4. This texture atlas method is further improved upon in Chapter 5 to be updatable at
near real-time rates
Chapter 6 covers two applications made possible the texture atlas and fast GPU surface com-
putation. The first a method for performing real-time subsurface scattering using the texture atlas
and is presented in section 6.2. This second is is a GPU based 3D painting system that can re-
distribute paint detail dynamically and is discussed in section 6.3). Finally conclusions, future
directions for research, and final remarks are made in Chapter 7.
1.3 Contributions
Below lists the contributions made in this thesis to this body of research:
3
• GPU as a computation unit:
Ray tracing on the GPU [14].
Matrix Multiplication on the GPU [41].
• Surface Representation:
The multi-resolution mip-mappable texture atlas (MMA) [16].
Fast dynamic re-parametrization of the MMA atlas [13].
• Surface Processing Applications:
Real-time subsurface scattering on the GPU[15].
GPU based 3D painting[13].
4
Chapter 2
General Computation on the GPU
2.1 Introduction
The GPU (graphics processing unit) has become a significant piece of hardware in modern day
consumer level computers. Recent history has shown an incredible growth in the processing power,
memory, and programmability of modern GPU hardware. Parallelism inherent to the graphics
pipeline has enabled its performance growth to far outpace that of the CPU. It has been projected
that this trend will continue for the foreseeable future as fuelled by market desire for better and
faster computer imagery. What may seam less obvious, is that the GPU is rapidly becoming a
massively parallel vector processing engine for the masses.
In spite of the great potential for graphics hardware, its use has predominantly been restricted
to the efficient display of images. The growing significance of the GPU in terms of performance
and functionality suggests that it can and should be used for a much broader array of tasks. To
date, however, graphics hardware remains to be an under-utilized resource in the modern personal
computing environment.
The lack of adoption of graphics hardware into every day computing lies in the following two
areas: programmability and functionality. Programmability refers to the ease at which developers
may access and utilize the GPU to solve common day problems. Functionality directly addresses
the features in modern day GPU implementations that prohibit such algorithms from efficiently
running on graphics hardware. It is important to note that GPU and CPU architectures are
fundamentally different and as a result they are tuned to work best on different types of problems.
Important research is to explore what types of algorithms work well in graphics hardware as well
5
as which algorithms can be modified to work well on the GPU.
The development of advanced compiler technology such as Nvidia’s Cg [29] and Microsoft’s
HLSL[20] have made great strides reducing the programmatic complexity of accessing the power of
the GPU. Even in the presence of the most advanced compiler technology, the problem of finding
a good GPU mapping for more general classes of algorithms remains a significant challenge. Often
times it involves complete redesign of the algorithm itself.
The past two years has seen a marked change in the functionality of graphics hardware. For
example, the GPU has moved from being a highly configurable device to a fully programmable unit.
Precision on graphics hardware has increased to a 32-bit floating point representation, allowing for
numerical computation to be performed. Recent developments in GPU functionality however, do
not fully address the needs required to achieve widespread adoption. Graphics hardware must
change to meet this need while not comprising the pipelining an parallelism that it relies on for its
success. More importantly research must be done to explore what types of algorithms work well
in graphics hardware – and paradigms must be developed that ease the mapping of more general
problems into the accelerated GPU environment.
To asses the viability of GPU processing we explored two applications: ray-tracing, and dense
matrix multiplication. These explorations are covered in the following sections 2.2, and 2.3.
6
2.2 Example: The Ray Engine 1 (Beyond Local Illumination)
2.2.1 Introduction
Hardware-accelerated rasterization has made great strides in simulating global illumination ef-
fects, such as shadows[116, 91, 28], reflection[10], multiple-bounce reflection[23], refraction[47],
caustics[108] and even radiosity[58]. Nonetheless some global illumination effects have eluded ras-
terization solutions, and may continue to do so indefinitely. The environment map provides polygon
rasterization with limited global illumination capabilities by approximating the irradiance of all
points on an object surface with the irradiance at a single point[10]. This single-point irradiance
approximation can result in some visually obvious errors, such as the boat in wavy water shown in
Figure 2.1.
Figure 2.1: What is wrong with this environment-mapped picture? (1) The boat does not meet its
reflection, (2) the boat is reflected in the water behind it, and (3) some aliasing can be seen in the
reflection.
Ray tracing of course simulates all of these effects and more. It can provide true reflection
and refraction, complete with local and multiple bounces. Complex camera models with compound
lenses are easier to simulate using ray tracing[60]. Numerous global illumination methods are based
on ray tracing including path tracing[53], Monte-Carlo ray tracing[113] and photon mapping[50].
Ray tracing is classically one of the most time consuming operations on the CPU, and the
graphics community has been eager to accelerate it using whatever methods possible. Hardware-
based accelerations have included CPU-specific tuning, distribution across parallel processors and
1Contains work previously published in Graphics Hardware 2002[14].
7
even construction of special purpose hardware.
Graphics cards have recently included support for programmable shading in an effort to increase
the realism of their rasterization-based renderers[66]. This added flexibility is transforming the
already fast graphics processing unit (GPU) into a supercomputing coprocessor, and its power is
being applied to a wider variety of applications than its developers originally intended.
One such application is ray tracing. Section 2.2 shows how to configure the graphics processing
unit (GPU) to compute ray-triangle intersections, and Section 2.2.4 details an implementation. This
GPU ray-triangle intersection reconfigures the graphics accelerator into a ray engine, described in
Section 2.2.5, that hides the details of its back-end GPU ray-triangle intersection, allowing the ray
engine to be more easily integrated into existing rendering software systems.
The ray engine can make existing rasterization-based renderers look better. A rasterization
renderer augmented with the ray engine could trace the rays necessary to achieve effects currently
impossible with rasterization-only rendering, including local reflections (Figure 1.1), true refractions
and sub-surface scattering[52].
The ray engine is also designed to be efficiently integrated into existing ray-tracing applica-
tions. The ray engine performs best when intersecting caches of coherent rays[83] from host-based
rendering tasks. This is a form of load balancing that allows the GPU to do what it does best
(perform the same computation on arrays of data), and lets the CPU do what the GPU does worst
(reorganize the data into efficient structures whose processing requires flow control). A simple ray
tracing system we built using the ray engine is already running at speeds comparable to the fastest
known ray tracer, which was carefully tuned to a specific CPU[112]. The ray engine could like-
wise accelerate Monte Carlo ray tracing, photon mapping, form factor computation and visibility
preprocessing.
2.2.2 Related Work
Although classic ray tracing systems support a wide variety of geometric primitives, some recent ray
tracers designed to achieve interactive rates (including ours) have limited themselves to triangles.
This has not been a severe limitation as geometric models can be tessellated, and the simplicity of
the ray-triangle intersection has led to highly efficient implementations [6, 75].
8
Hardware z-buffered rasterization can quickly determine the visibility of triangles. One early
hardware optimization for ray tracing was the first-hit speedup, which replaced eye-ray intersections
with a z-buffered rasterization of the scene using object ID as the color [114]. Eye rays are a special
case of a coherent bundle of rays. Such rays can likewise be efficiently intersected through z-buffered
rasterization for hardware accelerated global illumination [107], of which ray tracing is a subset.
One obvious hardware acceleration of ray tracing is to optimize its implementation for a specific
CPU. The current fastest CPU implementation we are aware of is a coherent ray tracer tuned for
the Intel Pentium III processor [112]. This ray tracer capitalized on a variety of spatial, ray
and memory coherencies to best utilize CPU optimizations such as caching, branch prediction,
instruction reordering, speculative execution and SSE instructions. Their implementation ran at
an average of 30 million intersections per second on an 800 Mhz Pentium III. They were able to
trace between 200K and 1.5M rays per second, which was over ten times faster than POV-Ray and
Rayshade.
There have been a large number of implementations of ray tracers on MIMD computers [92].
These implementations focus on issues of load balancing and memory utilization. One recent
implementation on 60 processors of an SGI Origin 2000 was able to render at 5122 resolution
scenes of from 20 to 2K patches at rates ranging from two to 20 Hz [78].
Special purpose hardware has also been designed for ray tracing. The AR350 is a production
graphics accelerator designed for the off-line (non-real-time) rendering of scenes with sophisticated
Renderman shaders [40]. A ray tracing system designed around multiprocessors with smart memory
is also in progress [87].
Our ray engine is similar in spirit to another GPU-based ray tracing implementation that
simulates a state machine [88]. This state-based approach breaks ray tracing down into several
states, including grid traversal, ray-triangle intersection and shading. This approach performs the
entire ray tracing algorithm on the GPU, avoiding the slow readback process required for GPU-
CPU communication that our approach must deal with. The state-based method however is not
particularly efficient on present GPU’s due to the lack of control flow in the fragment program,
resulting in a large portion of pixels (from 90% to 99%) remaining idle if they are in a different state
than the one currently being executed. Our approach has been designed to organize ray tracing to
9
achieve full utilization of the GPU.
2.2.3 Ray Tracing with the GPU
Ray Casting
The core of any ray tracer is the intersection of rays with geometry. Rays are represented paramet-
rically as r(t) = o+ td where o is the ray origin, d is the ray direction and t ≥ 0 is a real parameter
corresponding to points along the ray. The classic implementation of recursive ray tracing casts
each ray individually and intersects it against the scene geometry. This process generates a list of
parameters ti corresponding to points of intersection with the scene’s geometric primitives. The
least positive element of this list is returned as the first intersection, the one closest to the ray
origin.
Figure 2.2: Ray intersection is a crossbar.
Figure 2.2(a) illustrates ray casting as a crossbar. This illustration represents the rays with
horizontal lines and the (unorganized) geometric primitives (e.g. triangles) with vertical lines. The
crossing points of the horizontal and vertical lines represent intersection tests between rays and
triangles.
This crossbar represents an all-pairs check of every ray against every triangle. Since their in-
ception, ray tracers have avoided checking every triangle against every primitive through the use of
spatial coherent data structures on both the rays and the geometry. These data structures reorga-
nize the crossbar into a sparse overlapping block structure, as shown in Figure 2.2(b). Nevertheless
10
Online Submission ID: 0092
Figure 3: Programmable pixel shading is a crossbar.
vanced shading [Lindholm et al. 2001]. These programmable el-
ements can be separated into two components, the vertex shader
and the pixel shader, as shown in Figure 3(b).
The vertex shader replaces the graphics pipeline with a user-
programmable stream processor. This stream processor cannot
change the number of vertices passing through it, but it can change
the vertex attributes, including position, color and texture coordi-
nates.
The pixel shader generalizes the per-pixel access and application
of texture. The pixel shader can perform arithmetic operations on
the texture coordinates before they index into the texture, and can
then perform additional arithmetic operations with the fetched tex-
ture result. In a single pass, the pixel shader computes each pixel in
isolation, and cannot access data stored at other pixels in the frame-
buffer.
The speed of modern graphics accelerators is indicated by vertex
rate, which measures the vertical bandwidth of Figure 3, and its
pixel rate, which measures the horizontal bandwidth. The pixel
rate is an order of magnitude faster than the vertex rate on modern
graphics cards.
3.3 Mapping Ray Casting to Programmable Shading
Hardware
We map the ray casting crossbar in Figure 2 to the rasterization
crossbar in Figure 3 by distributing the rays across the pixels and
broadcasting a stream of triangles to each pixel by sending their
coordinates down the geometry pipeline as the vertex attribute data
(e.g. color, texture coordinates) of screen filling quadrilaterals.
The rays are stored in two screen-resolution textures. The color
of each pixel of the ray-origins texture stores the coordinates of
the origin of the ray. The color of each pixel of the ray-directions
texture stores the coordinates of the ray direction vector.
An identical copy of the triangle data is stored at each vertex
of a screen-filling quadrilateral. Rasterization of this quadrilateral
interpolates these attributes at each pixel of its screen projection.
Since the attributes are identical at all four vertices, interpolation
simply distributes a copy of the triangle data to each pixel.
A pixel shader performs the ray-triangle intersection computa-
tion by merging the ray data stored per-pixel in the texture maps
with the triangle data distributed per-pixel by the interpolation of
the attribute data stored at the vertices of the quadrilateral. The
specifics of this implementation will be described further in Sec-
tion 4
3.4 Discussion
The decision to store rays in texture and triangles as vertex at-
tributes was based partially on precision. The geometry pipeline
supports full-precision 32-bit floating point values whereas the tex-
ture pipeline is restricted to 8-bit clamped fixed-point values. Rays
can be specified with five real values whereas triangles require nine.
We found it easier and more accurate to store the five ray values in-
stead of the nine triangle values in the lower texture precision. We
were also able to use special high-precision texture modes designed
for bump mapping to store the ray origins as 10- and 11-bit values,
and ray directions as 16-bit values.
Vertex shaders perform computations at a higher precision and
range (currently 32-bit floating point) than do pixel shaders (cur-
rently 16-bit fixed point ranging from -8 to 8). We nonetheless de-
cided to perform ray-triangle intersection as a pixel shader instead
of a vertex shader. Vertex shaders do not have direct access to the
rasterization crossbar, so our test implementation of ray-triangle in-
tersection as a vertex shader had to store ray data as constants in the
vertex shader’s local memory. Furthermore the triangle rate is an
order of magnitude slower than the pixel rate on modern graphics
accelerators. The vertex shader was able to compute 4.1M ray-
triangle intersections per second, which is much less than what the
pixel shader (or the CPU) is capable of performing.
Since the GPU can be viewed as a SIMD processor [Peercy et al.
2000], the decision to distribite ray data as pixels and broadcast the
geometry was also influenced by other SIMD ray tracing imple-
mentations. SIMD ray tracers have the choice of rays or geometry
distribution. Ray distribution stores the ray data locally per proces-
sor and broadcasts the geometry simultaneously to all processors
whereas geometry distribution stores the triangle data locally and
broadcasts the ray data.
The AR350 ray tracing hardware utilized a fine-grain ray dis-
tribution to isolated processors [Hall 2001]. This distribution im-
proved load balancing, but inhibited the possible advantages of ray
coherence. The geometry was broadcast from the host similar to
standard graphics cards.
The coherent ray tracer [Wald et al. 2001b] also distributed rays
at its lowest level. It intersected each triangle with four coherent
rays using the SIMD instructions available on the CPU. An axis-
aligned BSP-tree provided spatial coherence of the triangle data,
but required special implemention to efficiently support parsing by
the four ray-parallel processes.
One counterexample worth noting is a distributed-memory paral-
lel ray tracer that used a geometry distribution to handle the special
problems of ray tracing large scene databases [Wald et al. 2001a].
One final benefit of ray distribution, as the next section will show,
is that it allows us to use the z-buffer to efficiently maintain the
parametric distance to the first triangle intersected by each ray.
4 Ray-Triangle Intersection on the GPU
The pixel shader implementation of ray-triangle intersection treats
the GPU as a SIMD parallel processor [Peercy et al. 2000]. In this
model, the framebuffer is treated as an accumulator data array of 5-
vectors (r,g,b,α,z), and texture maps are used as data arrays for in-
put and variables. Operations on this data array are performed using
image-processing combinations of the textures and the framebuffer.
Pixel shaders are sequences of these image-processing combina-
tion operators. While compilers exist for multipass programming
[Peercy et al. 2000; Proudfoot et al. 2001], the current limitations
of pixel shaders required complete knowledge and control of the
available instructions and registers to implement ray intersection.
4.1 Input
Ray Data. As mentioned in Section 3.3, the GPU component of
the ray engine intersects multiple rays with a single triangle. Every
pixel in the data array corresponds to an individual ray. Our imple-
mentation stores ray data in two textures: a ray-origins texture and
a ray-directions texture. Batches of rays cast from the eyepoint or a
point light source will have a constant color ray-origins texture and
3
Figure 2.3: Program l pixel shading is a crossbar.
the individual blocks are themselves full crossbars that perform an all pairs comparison on their
subset of the rays and geometry.
The result of ray casting is the identification of the geometry (if any) intersected first by each ray.
This result is a series of points in the crossbar, no greater t an one r horizontal line (ray). These
first intersections are shown as black disks in Figure 2.2(c). The other ray-triangle intersections
are indicated with open circles and are ignored in simple ray casting.
Programmable Shading Hardware
Graphics accelerators have been designed to implement a pipeline that converts polygons vertices
from model coordinates to viewport coordinates. Once in viewport coordinates, rasterization fills
the polygon with pixels, interpolating the depth, c lor and texture coordina es in p rspective-
correct fashion. During rasterization, interpolated texture coordinates index into texture memory
to map an image texture onto the polygon.
This rasterization process can also be viewed as a crossbar, as shown in Figure 2.3(a). The verti-
cal lines represent individual polygons passing through the graphics pipeline whereas the horizontal
lines represent the screen pixels.
Consider the case where each pol gon, a quadrilateral, exactly covers all of the screen pixels.
Then rasterization of these polygons performs an all-pairs combination of every pixel with every
polygon.
11
While even early graphics accelerators were programmable through firmware[18], modern graph-
ics accelerators contain user-programmable elements designed specifically for advanced shading[66].
These programmable elements can be separated into two components, the vertex shader and the
pixel shader, as shown in Figure 2.3(b). The vertex shader is a user-programmable stream pro-
cessor that can alter the attributes (but not the number) of vertices sent to the rasterizer. The
pixel shader can perform arithmetic operations on multiple texture coordinates and fetched texture
samples, but does so in isolation and cannot access data stored at any other pixel. Pixel shaders
run about an order of magnitude faster than vertex shaders.
Mapping Ray Casting to Programmable Shading Hardware
We map the ray casting crossbar in Figure 2.2 to the rasterization crossbar in Figure 2.3 by
distributing the rays across the pixels and broadcasting a stream of triangles to each pixel by
sending their coordinates down the geometry pipeline as the vertex attribute data (e.g. color,
texture coordinates) of screen filling quadrilaterals.
The rays are stored in two screen-resolution textures. The color of each pixel of the ray-origins
texture stores the coordinates of the origin of the ray. The color of each pixel of the ray-directions
texture stores the coordinates of the ray direction vector.
An identical copy of the triangle data is stored at each vertex of a screen-filling quadrilateral.
Rasterization of this quadrilateral interpolates these attributes at each pixel of its screen projection.
Since the attributes are identical at all four vertices, interpolation simply distributes a copy of the
triangle data to each pixel.
A pixel shader performs the ray-triangle intersection computation by merging the ray data
stored per-pixel in the texture maps with the triangle data distributed per-pixel by the interpolation
of the attribute data stored at the vertices of the quadrilateral. The specifics of this implementation
will be described further in Section 2.2.4.
Discussion
The decision to store rays in texture and triangles as vertex attributes was based initially on
precision. Since rays can be specified with five real values whereas triangles require nine we found
12
it easier and more accurate to store the ray values at the lower texture precisions.
We also chose to implement ray-triangle intersection as a pixel shader instead of a vertex shader.
Vertex shaders do not have direct access to the rasterization crossbar, and hence needed to store
ray data as constants in the vertex shader’s local memory. The vertex shader is also slower, and
was able to compute 4.1M ray-triangle intersections per second, which is much less than what the
CPU is currently capable of performing.
Viewing the GPU as a SIMD processor[80] allowed us to compare other SIMD ray tracing
implementations. SIMD ray tracers typically distribute rays to the processors and broadcast the
geometry, or distribute geometry and broadcast the rays. The AR350 ray tracing hardware utilized
a fine-grain ray distribution to isolated processors[40], which improved load balancing, but inhibited
the possible advantages of ray coherence. The coherent ray tracer[112] also distributed rays at its
lowest level, intersecting each triangle with four coherent rays using SSE whereas an axis-aligned
BSP-tree coherently organized the triangles (but required special implementation to efficiently
intersect four-ray bundles). Geometry distribution on the other hand seems better suited for
handling the special problems due to ray tracing large scene databases[111].
2.2.4 Ray-Triangle Intersection on the GPU
The pixel shader implementation of ray-triangle intersection treats the GPU as a SIMD parallel
processor[80]. In this model, the framebuffer is treated as an accumulator data array of 5-vectors
(r, g, b, α, z), and texture maps are used as data arrays for input and variables. Pixel shaders
perform sequences of operations that combine the textures and the framebuffer. While compilers
exist for multipass programming[80, 86], the current limitations of pixel shaders required complete
knowledge and control of the available instructions and registers to implement ray intersection.
Input
Ray Data. As mentioned in Section 2.2.3, the GPU component of the ray engine intersects multiple
rays with a single triangle. Every pixel in the data array corresponds to an individual ray. Our
implementation stores ray data in two textures: a ray-origins texture and a ray-directions texture.
Batches of rays cast from the eyepoint or a point light source will have a constant color ray-origins
13
texture and their texture map could be stored as a single pixel or a pixel shader constant.
Triangle Data. The triangle data is encapsulated in the attributes of the four vertices of a
screen filling quad. Let a,b, c denote the three vertices of the triangle, and n denote the triangles
front facing normal. The triangle id was stored as the quad’s color, and the vectors a,b,n,ab(=
b− a),ac,bc were mapped to multi-texture coordinate vectors.
The redundant vector information includes ray-independent pre-computation that reduces the
size and workload of the pixel shader. Our implementation passes only the three vertices of the
triangle from the host, and computes the additional redundant values in the vertex shader.
The texture coordinates for texture zero (s0, t0) are special and are not constant across the
quadrilateral. They are instead set to (0, 0), (1, 0), (1, 1), (0, 1) at the four vertices, and rasterization
interpolates these values linearly across the quad’s pixels. These texture coordinates are required
by the pixel shader to access each pixel’s corresponding ray in the screen-sized ray-origins and
ray-directions textures.
Output
The output of the ray-triangle intersection needs to be queried by the CPU, which can be an
expensive operation due to the asymmetric AGP bus on personal computers (which sends data to
the graphics card much faster than it can receive it). The following output format is designed to
return as little data as necessary, limiting itself to the index of the triangle that intersects the ray
closest to its origin, using the z-buffer to manage the ray parameter t of the intersection.
Color. The color channel contains the color of the first triangle the ray intersects (if any). For
typical ray tracing applications, this color will be a unique triangle id. These triangle id’s can index
into an appearance model for the subsequent shading of the ray-intersection results.
Alpha. Our pixel shader intersection routine conditionally sets the fragments alpha value to
indicate ray intersection. The alpha channel can then be used as a mask by other applications if
the rays are coherent (e.g. like eye rays through the pixels in the frame buffer).
The t-Buffer. The t-value of each intersection is computed and replaces the pixel’s z-value. The
built-in z-test is used so the new t-value will overwrite the existing t-value stored in the z-buffer
if the new value is smaller. This allows the z-buffer to maintain the least positive solution t for
14
each ray. Since the returned t value is always non-negative, the t-value maintained by the z-buffer
always corresponds to the first triangle the ray intersects.
Intersection
We examined a number of efficient ray-triangle intersection tests[26, 6, 75], and managed to reor-
ganize one[75] to fit in a pixel shader.
Our revised ray-triangle intersection is evaluated as
ao = o− a, (2.1)
bo = o− b, (2.2)
t = −n · ao
n · d , (2.3)
aod = ao× d, (2.4)
bod = bo× d, (2.5)
u = ac · aod, (2.6)
v = −ab · aod, (2.7)
w = bc · bod. (2.8)
The intersection passes only if all three (unnormalized) barycentric coordinates u, v and w are
non-negative. If the ray does not intersect the triangles, the alpha channel for that pixel is set to
zero and the pixel is killed. The parameter t is also tested against the current value in the z-buffer,
and if it fails the pixel is also killed. Surviving pixels are written to the framebuffer as the ray
intersection currently closest to the ray origin.
This implementation reduces cross products, which require multiple pixel shader operations to
compute. The quotient (2.3) was implemented using the texdepth instruction, which implements
the “depth replace” texture shader.
Results
We tested the PS1.4 implementation of the ray-triangle intersection using the ATI R200 chipset
found on the Radeon 8500 graphics card. The limited numerical precision of its pixel shader (16-bit
fixed point, with a range of ±8) led to some image artifacts shown in Figure 2.4, this implementation
did suffice to measure the speed of an actual hardware pixel shader on the task of ray intersection.
We clocked our GPU implementation of ray intersection at 114M intersections per second.
The fastest CPU-based ray tracer was able to compute between 20M and 40M intersections per
15
Figure 2.4: Leaky teapot, due to the low precision implementation on PS1.4 pixel shaders used to
test the performance of ray-triangle intersection. Our simulations using the precision available on
upcoming hardware are indistinguishable from software renderings.
second on an 800Mhz Pentium III[112]. Even doubling the CPU numbers to estimate performance
on today’s computers, our GPU ray-triangle intersection performance already exceeds that of the
CPU, and we expect the gap to widen as GPU performance growth continues to outpace CPU
performance growth.
2.2.5 Ray Engine Organization
This section outlines the encapsulation of the GPU ray-intersection into a ray engine. It begins with
a discussion of why the CPU is a better choice for the management of rays during the rendering
process. Since the CPU is managing the rays, the ray engine is packaged to provide easy access to
the GPU ray-intersection acceleration through a front-end interface. This interface accepts rays in
coherent bundles, which can be efficiently traced by the GPU ray-intersection implementation.
The Role of the CPU
We structured the ray engine to perform ray intersection on the GPU and let the host organize the
casting of rays and manage the resulting radiance samples. Since the bulk of the computational
resources used by a ray tracer are spent on ray intersection, the management of rays and their
results is a relatively small overhead for the CPU, certainly smaller than performing the entire ray
tracing on the CPU.
The pixel shader on the GPU is a streaming SIMD processor good at running the same algorithm
16
on all elements of a data array. The CPU is a fast scalar processor that is better at organizing
and querying more sophisticated data structures, and is capable of more sophisticated algorithmic
tools such as recursion. Others have implemented the entire ray tracer on the GPU[88], but such
implementations can be cumbersome and inefficient.
For example, recursive ray tracing uses a stack. While some have proposed the addition of state
stacks in programmable shader hardware[71], such hardware is not currently available. Recursive
ray tracing can be implemented completely on the GPU[88], but apparently at the expense of
generating two frame buffers full of reflection and refraction rays at each intersection, which are
then managed by the host.
The need for a stack can be avoided by path tracing[53]. Paths originating from the eye point
passing through a pixel can accumulate its intermediate results at the same location in texture
maps. Path tracing requires importance sampling to be efficient, even with fast ray intersection.
Sophisticated importance sampling methods[110] use global queries into the scene database, as
well as queries into previous radiance results in the scene. Such queries are still performed more
efficiently on the CPU than on the GPU.
Some ray tracers also organize rays and geometry into coherent caches that are cast in an
arbitrary order to more efficiently render large scenes[83]. The management of ray caches and the
radiances resulting
rom their batched tracing requires a lot of data shuﬄing. An implementation on the GPU
would require all of the pixels in the image returned by the batch ray intersection algorithm to be
shuﬄed to contribute to the radiance of the previously cast rays. While dependent texturing can
be used to perform this shuﬄing[88], the GPU is ill-designed to organize and set up this mapping.
We used the NV FENCE extension to overlap the computation of the CPU and GPU. This
allows the CPU to test whether the GPU is busy working on a ray-triangle bundle so the CPU can
continue to work simultaneously on ray caching.
2.2.6 The Ray Engine Interface
Organizing high-performance rendering services to be transparent makes them easier to integrate
into existing rendering systems[59]. We structured the ray engine as both a front end driver that
17
runs on the host and interfaces with the application, and a back end component that runs on the
GPU to perform ray intersections.
The front end of the ray engine accepts a cache of rays from a host application. This front
end converts the ray cache into the texture map data for the pixel shader to use for intersection.
The front end then sends the geometry (from a shared database with the application) down the
geometry pipeline to the pixel shader. The pixel shader is treated as a back end of this system
that intersects the rays with the triangles passed to it. The front end grabs the results of ray
intersection (triangle id, t-value and, if supported, the barycentric coordinates) and returns them
to the application in a more appropriate format.
v
Ray Intersection Pixel Shader
GPU
CPU
Application (Ray Tracing, Path Tracing,




















Figure 2.5: The organization of the ray engine.
The main drawback of implementing ray casting applications on the host is the slow readback
bandwidth of the AGP bus when transferring data from the GPU back to the CPU. This bottleneck
is addressed by the ray engine system with compact data that is returned infrequently (once after
all triangles have been sent to the GPU).
The Ray Cache
Accelerating the implementation of ray intersection is not enough to make ray tracing efficient.
The number of ray intersections needs to be reduced as well. The ray engine uses an octree to
maintain geometry coherence and a 5-D ray tree[5] to maintain ray coherence.
The ray engine works more efficiently when groups of similar rays intersect a collection of spa-
18
tially coherent triangles. In order to maintain full buckets of coherent rays, we utilize a ray cache[83].
Ray caching allows some rays to be cast in arbitrary order such that intersection computations can
be performed as batch processes.
As rays are generated, they are added to the cache, which collects them into buckets of rays with
coherent origins and directions. For maximum performance on the ray engine, each bucket should
contain some optimal hardware-dependent number of rays. Our bucket size was 256 rays, organized
as two 64×4 ray-origin and ray-direction textures. Textures on graphics cards are commonly stored
in small blocks instead of in scanline order to better capitalize on spatial coherence by placing more
relevant texture samples into the texture cache of the GPU. The size of these texture blocks is
GPU-dependent and can be found through experimentation.
If adding a ray makes a bucket larger than the optimal bucket size then the node is split into
four subnodes along the axis of greatest variance centered at the using the mean values of the ray
origins and directions. We also add rays to the cache in random order which helps keep the tree
balanced.
When the ray tracer needs a result or the entire ray cache becomes full, a bucket is sent to the
ray engine to be intersected with geometry. We send the fullest buckets first to maximize utilization
of the ray engine resources. Each node of the tree contains the total number of rays contained in
the buckets below it. Our search traverses down the largest valued nodes until a bucket is reached.
While this simple greedy search is not guaranteed to find the largest bucket, it is fast and works
well in practice since the buckets share the same maximum size. This greedy search also tends to
balance the tree.
Once the search has chosen a bucket, rays are stolen from that node’s siblings to fill the bucket to
avoid wasting intersection computations. Due to the greedy search and the node merging described
next, this ensures that buckets sent to the ray engine are always as full as possible, even though in
the ray tree they are typically only 50-80% full.
Once a bucket has been removed from the tree and traced, it can often leave neighboring buckets
containing only a few rays. Our algorithm walks back up the tree from the removed bucket leaf
node, collecting subtrees into a single bucket leaf node if the number of rays in the subtree has
fallen below a threshold. Our tests showed that this process typically merged only a single level of
19
the tree.
The CPU performs a ray bucket intersection test[5] against the octree cells to determine which
should be sent to the GPU. We also used the vertex shader to cull back-facing triangles as well as
triangles outside the ray bucket from intersection consideration. The vertex shader cannot change
the number of vertices passing through it, but it can transform the screen-filling quad containing
the triangle data to an off screen location which causes it to be clipped.
2.2.7 Results
We implemented the ray engine on a simulator for an upcoming GPU based on the expected
precision and capabilities needed to support the Direct3D 9 specification. These capabilities allow
us to produce full precision images that lack the artifacts shown earlier in Figure 2.4.
We used the ray engine to classically ray trace a teapot room and an office scene, shown in
Figure 2.6(a) and (c). We applied the ray engine to a Monte-Carlo ray tracer that implemented
distributed ray tracing and photon mapping, which resulted in Figure 2.6(b). The ray engine was
also used to ray trace two views of one floor from the Soda Hall data set, shown in Figures 2.6(d)
and (e).
(a) (b) (c) (c) (c)
Figure 2.6: Images tested by the ray engine: teapot Cornell box ray traced classically (a) and
Monte Carlo (b), office (c), and Soda Hall side (d) and top (e) views.
The performance is shown in Figure 2.1. Since our implementation is on a non-realtime sim-
ulator, we have estimated our performance using the execution rates measured on the GeForce 4.
We measured the performance in rays per second, which measures the number of rays intersected
with the entire scene per second. This figure includes the expensive traversal of the ray-tree and
triangle octree as well as the ray-triangle intersections.
This performance meets the low end performance of the coherent ray tracer, which was able
to trace from 200K to 1.5M rays per second[112]. It too used coherent data structures to increase
20
Scene Polygons Rays/sec.
Teapot Room Classical 2,650 206,905
Teapot Room Monte-Carlo 2,650 149,233
Office 34,000 114,499
Soda Hall Top View 11,052 129,485
Soda Hall Side View 11,052 131,302
Table 2.1: Rays per second across a variety of scenes and applications.
performance, in this case an axis aligned BSP tree organized specifically to be efficiently traversed
by the CPU. Our ray traversal implementation is likely not as carefully optimized as theirs.
2.2.8 Analysis and Tuning
Suppose we are given a set of R rays and a set of T triangles for performing ray-triangle intersection
tests. We denote the time to run the tests on the GPU and CPU respectively as GPU(R, T ) and
CPU(R, T ). To achieve improved performance, we are only interested in values of R and T for which
GPU(R, T ) ≤ CPU(R, T ), suggesting the right problem granularity for which the GPU performs
best.
Since the GPU performs all pairs intersection test between the rays and triangles passed to it,
its performance is independent of scene structure
GPU(R, T ) = O(RT ). (2.9)
The running time for CPU(R, T ) is dependent on both scene and camera (sampling) structure since
partitioning structures in both triangle and ray space may be used to reduce computation
CPU(R, T ) ≤ O(RT ). (2.10)
As Section 2.2.4 shows, the constant of proportionality in the O(RT ) in (2.9) is smaller (by at least
a factor of two) than the one in (2.10). Tuning the ray engine will require balancing the raw speed
of GPU(R, T ) with the efficiency of CPU(R, T ).
21
The Readback Bottleneck
We can model GPU(R, T ) by analyzing the steps in the GPU ray-triangle intersection in terms of
GPU operations, and empirically measuring the speed of these operations. A simple version of this
model sufficient for our analysis is
GPU(R, T ) = TR fill−1 +Rγ readback−1, (2.11)
where γ is the number of bytes read back from the graphics card per ray. This model shows that
the GPU ray-triangle intersection time is linearly dependent on the number of rays and affinely
dependent on the number of triangles. This model does not include the triangle rate, which would
add a negligible term proportional to T to the model. Once we determine values for fill and
readback we can then determine the smallest number of triangles Tmin needed to make GPU ray-
triangle intersection practical.
The fill rate is measured in pixels per second (which includes the cost of the fragment shader
execution) whereas the readback rate is measured in bytes per second. The fill rate is measured
pixels per second instead of bytes per second because it is non-linear in the number of bytes
transferred (modern graphics cards can for example multitexture two textures simultaneously).
Since our ray engine uses two ray textures (an origins texture and a directions texture) we simply
divide the number of rays (pixels) by the fill rate (pixels per second) to get the fragment shader
execution time.
We determine values for the fill and readback rates empirically. For example, the GeForce3
achieves a fill rate of 390 MP/sec. (dual-textured pixels) and an AGP 4x readback rate of 250
MB/sec (which is only one quarter of the 1 GB/sec that should be available on the AGP bus).
Returning a single 64-bit triangle ID uses a γ of four, whereas returning an additional three single-
precision floating-point barycentric coordinates sets γ to 16. Hence we can return triangle ID’s
at a rate of 62.5M/sec., but when we include barycentrics the rate drops to 15.6M/sec. We can
further increase performance by reducing the number of bytes used for the index of each triangle,
especially since the ray engine sends smaller buckets of coherent triangles to the GPU.
For small values of T the performance is limited by the readback rate. As T increases, the
22



























Radeon 8500 LE perf.




GeForce4 perf.GeForce4 w/full AGP 4x bandwidth
Figure 2.7: Theoretical performance in millions of ray-triangle intersection tests per second on the
GPU with γ = 4.
constant cost of readback is amortized over a larger number of intersections tests. (When we
measured peak ray-triangle intersection rates on the Radeon 8500, we sent thousands of triangles
to the GPU.) In each case, the curve asymptotically approaches the fill rate, which is listed as the
maximum performance possible. Realistically, only smaller values of T should be considered since
the GPU intersection routine is an inefficient all-pairs O(RT ) solution and our goal is to only send
coherent rays and triangles to it.
Figure 2.7 shows that even for small value of T , the performance is quite competitive with
that of a CPU based implementation in spite of the read back rate limitation. For example, the
ray-triangle intersections per second for ten triangles clock at 240M on the GeForce3 and 286M
on the GeForce4 Ti4600 (if they had the necessary fragment processing capabilities). The recent
availability of AGP 8x, and the upcoming AGP 3.0 standard will further reduce the impact of the
readback bottleneck and further validate this form of general GPU processing.
Avoiding Forced Coherence
The previous section constructed a model for the efficiency of the GPU ray-triangle intersection.
We must now determine when it is more efficient to use the CPU instead of the GPU.
It is important to exploit triangle and ray coherence only where it exists, and not to force it
where it does not. We hence identify the locations in the triangle octree where the ray-triangle
23
coherence is high enough to support efficient GPU intersection. This preprocess occurs after the
triangle octree construction, and involves an additional traversal of the octree, identifying cells that
represent at least Tmin triangles. Since these cells are ideal for GPU processing we refer to these
cells as GPU cells.
Rendering employs a standard recursive octree traversal routine. When a ray ray traverses
through a cell not tagged as a GPU cell, the standard CPU based ray-triangle intersection is
performed. If a ray encounters a GPU cell during its traversal, the ray’s traversal is terminated
and it is placed in the ray cache for that cell for future processing. When the ray cache corresponding
to a given GPU cell reaches Rmin rays, its rays and triangles are sent to the GPU for processing
using the ray-intersection kernel.
A point may be reached where the ray engine has receive all known rays from the application to
be processed. At this poing there may exist GPU cells whose ray cache is non-empty, but containing
less than Rmin rays. A policy may be chose to select a GPU cell and force its ray cache to be send to
the CPU instead of the GPU. This allows the ray engine to continually advance towards completion
for rendering the scene.
Results
We have performed numerous tests to tune the parameters of the geometry engine to find the
highest performance.
Scene % GPU Rays
Teapot Room Classical 89%
Teapot Room Monte-Carlo 71%
Office 65%
Soda Hall Top View 70%
Soda Hall Side View 89%
Table 2.2: Percentage of rays sent to the GPU across a variety of scenes and applications.
Table 2.2 demonstrates the utilization of the GPU. As mentioned earlier, only reasonably sized
collections of coherent rays and triangles are sent to the GPU. The remaining rays and triangles are
traced by the CPU. The best performers resulted from classical ray tracing of the teapot room and
the ray casting of the Soda Hall side view. The numerous bounces from Monte Carlo ray tracing
likely reduce the coherence on all but the eye rays. Coherence was reduced in the office scene due to
24
the numerous small triangles that filled the triangle cache before the ray cache could be optimally
filled. The Soda Hall top view contains a lot of disjoint small “silhouette” wall polygons that likely
failed to fill the triangle cache for a given optimally filled ray cache.
System Rays/sec. Speedup
CPU only 135,812
plus GPU 165,098 22%
Asynch. Readback 183,273 34%
Infinitely Fast GPU 234,102 73%
Table 2.3: Speedup by using the GPU to render the teapot room.
Table 2.3 illustrates the efficiency of the ray engine. The readback delay was only responsible
for 12% of the potential speedup of 34%. One feature that would allow us to recover that 12% is
to be able to issue an asynchronous readback (as is suggested in OpenGL 2.0), such that the CPU
and GPU can continue to work during the readback process. The NV FENCE mechanism could
then report when the readback is complete. This feature could possibly be added through the use
of threads, but this idea has been left for future research.
The last row of Table 2.3 shows the estimated speed if we had an infinitely fast GPU, which
shows that most of our time is spent on the CPU reorganizing the geometry and rays into coherent
structures. This effect has been observed in similar ray tracers[112], where BSP tree traversal is
“typically 2-3 times as costly as ray-triangle intersection.”
T GPU Rays Rays/sec. Speedup
CPU 135,812
4–16 78% 147,630 8%
5–12 81% 157,839 16%
5–15 89% 165,098 22%
Table 2.4: Tuning the ray engine by varying the range of triangles T sent to the GPU, measured
on the teapot room.
Table 2.4 shows the effect of tuning the number of triangles that get sent to the GPU. In each
of these cases, the number of rays intersected by each GPU pass was set to 64.
Table 2.5 shows that the number of rays in each bucket can also be varied to achieve peak
efficiency. Tuning the ray engine to assign more rays to the GPU frees the CPU to perform more
caching. For example, for the teapot room classical ray tracing, we were able to achieve a 52%








Table 2.5: Tuning the number of rays R sent to the GPU for intersection.
2.2.9 Ray Engine Conclusions
We have added ray tracing to the growing list of applications accelerated by the programmable
shaders found in modern graphics cards. Our ray engine performed at speeds comparable to the
fastest CPU ray tracers. We expect the GPU will become the high-performance ray-tracing platform
of choice due to the rapid growth rate of GPU performance.
By partitioning computation between the CPU and GPU, we combined the best features of both,
at the expense of the slow readback of data and the overhead of ray caching. The AGP graphics
bus supports high-bandwidth transmission from the CPU to the GPU, but less bandwidth for
recovery of results. We expect future bus designs and driver implementations will soon ameliorate
this roadblock.
The overhead of ray caching limited the performance speedup of GPU to less than double
that of the CPU only, and this overhead has also burdened other methods[112]. Even though our
method for processing the data structures is considered quite efficient[93], we are anxious to explore
alternative structures that can more efficiently organize rays and geometry for batch processing by
the GPU.
26
2.3 Example: GPU Matrix Multiplication
2.3.1 Introduction
The multiplication of matrices is one of the most central operations applied in scientific computing.
Recent history has shown continued research for better tuned algorithms that improve the efficiency
of matrix multiplication. The ATLAS system for automatic tuning of matrix multiplication for
target CPU’s has shown much success [115]. New advances in PC chip design (such as streaming
SIMD extensions) has led to research into how to best leverage modern mico-architectures for this
task [1].
Recently, consumer based graphics processors (GPU’s) have become increasingly more powerful
and are starting to support programmable features. The parallelism in graphics hardware pipelines
makes the GPU a strong candidate for performing many computational tasks including matrix
multiplication [62]. We detail a new approach for multiplying matrices on GPU hardware. This
approach takes advantage of multiple levels of parallelism found in modern GPU hardware and
reduces the bandwidth requirements necessary to make this technique effective.
The architecture of modern GPUs relevant to this paper is described in Section 2.3.2. In
summary, GPUs receive 3D graphics primitives (typically triangles or quadrilaterals) specified as
a set of vertices from the application. These vertices are transformed into screen coordinates,
and a fragment is generated for each pixel covered by the primitive (generating these fragments is
called rasterization). Fragments contain information such as colors and texture coordinates which
are interpolated across the primitive from values associated with the vertices. These auxiliary
attributes are used to shade each fragment, which results in a final color which is written to a pixel
in the framebuffer. One of the most common ways of shading fragments is to use auxiliary attributes
known as texture coordinates to index into a previously supplied image (texture). Multiple sets of
texture coordinates can be used to retrieve colors from multiple textures; the results are combined
to form the final color for the fragment. This is multitexturing. Finally, a shaded fragment can
either replace the current value of the pixel in the frame buffer or it can be added to the current
value. This version of the graphics pipeline is described in [118].
The performance of GPUs comes from the fact that large amounts of parallelism are available
27
in this pipeline. In particular, each fragment is independent of all other fragments, so they can be
processed in parallel. Processing of fragments can also be overlapped to hide pipeline stalls and
memory latencies, resulting in very efficient use of the hardware.
Multitexturing can be used to multiply matrices [62]. An m × n matrix can be represented
by a greyscale texture, with the each pixel containing an element of the matrix 1. These matrices
can be displayed on the screen by drawing an m × n-pixel rectangle with the texture coordinates
(0, 0), (0, n−1), (m−1, n−1), (m−1, 0) assigned to the vertices (clockwise starting with the upper-
left vertex). One benefit of this is that we can access the transpose of a matrix by drawing an
n×m-pixel rectangle with texture coordinates (0, 0), (m−1, 0), (m−1, n−1), (0, n−1). The exact
same texture is used for drawing the matrix transposed and un-transposed. We have only changed
the mapping of the texture image onto the rectangle.
Matrix multiplication performs C = AB where A is an m× l-element matrix and B is an l×n-
element matrix. By storing matrices A and B as textures, we can compute C in l multitexturing
passes as shown in Figure 2.8.
Clear the screen.
Set the drawing mode to overlay.
Load texture texA with matrix A.
Load texture texB with matrix B.
Set the multitexturing mode to modulate.
Set framebuffer write mode to accumulate.
for i = 0 . . . l − 1
draw an m× n-pixel rectangle with
texA coords (0, i), (0, i), (m− 1, i), (m− 1, i), and
texB coords (i, 0), (i, n− 1), (i, n− 1), (i, 0).
end
Screen contains result of A×B.
Figure 2.8: The Larson-McAllister multipass algorithm for multiplying two matrices.
The texture coordinates (0, i), (0, i), (m − 1, i), (m − 1, i) replicate the ith column across the
entire rectangle, whereas the texture coordinates (i, 0), (i, n − 1), (i, n − 1), (i, 0) replicate the ith
1Textures are typically indexed using (s, t), with s indexing the horizontal axis and t indexing the vertical axis.
This is the opposite of standard matrix notation, where the first index represents the row and the second index
representing the column. In the rest of this paper, we’ll use the matrix style rather than the texture style. Also,
texture coordinates have traditionally been in the range [0 . . . 1], with various filters used to generate a color for
indices that fall between pixels. We use an the OpenGL texture rectangle extension to allow integer indexing, and
disable filtering.
28
row. These textured rectangles are then combined.
2.3.2 Modern GPU Organization
The graphics pipeline implemented on graphics acceleration hardware was classically organized
in a series of transformations, clipping and rasterization steps. Modern graphics hardware has
generalized this pipeline into programmable elements. The modern graphics pipeline consists of
vertex processing, rasterization and fragment processing. The vertex processor performs operations
on the individual vertices of triangles sent to the graphics accelerator. Once transformed, these
triangles are rasterized into a collection of pixels. Each pixel output by the rasterizer is called a
fragment. The rasterization process linearly interpolates attributes, such as texture coordinates,
stored at the vertices and stores the interpolated values at each fragment. A fragment processor
uses the interpolated texture coordinates to lookup texture values from texture memory, and can
perform special-purpose arithmetic operations on both the texture addresses and the fetched texture
values.
The vertex processor is structured similarly to vector processors (pipeline on a stream of ver-
tices), whereas the fragment processor is structured similarly to a SIMD array processor (one
processor per pixel). Our experiments have shown that the vertex processor does not provide much
advantage over existing CPU capabilities, whereas the fragment processor already outperforms the
CPU on some operations like ray-triangle intersections [14].
Because these processors were originally developed for texturing, the programs the GPU exe-
cutes are called shaders. A fragment shader is a program executed by the fragment processor that
describes the color of each fragment before it is assigned to its corresponding pixel. The inputs to a
fragment shader are a set of program constants, interpolated attribute data from triangle vertices,
and texture maps (blocks of texture memory addressed by the texture coordinates). Before ren-
dering, the fragment shader is compiled and loaded into the graphics hardware. Primitives (in our
case quadrilaterals) are then sent down the graphics pipeline invoking the enabled fragment shader
as they are rasterized. The output of a fragment shader is an output color plotted to the screen.
A fragment shader is allotted a fixed set of temporary registers R0 . . . Rn. Each register holds a
single 4-vector corresponding to the four color channels red, green, blue, alpha. The color channels
29
of each register may be accessed individually as follows: Ri.c where c ∈ {r, g, b, a}, i ∈ {1..n}. Stan-
dard arithmetic operations are defined over the set of registers, such as addition and multiplication.
For example: R2.b = R1.a + R0.g, assigns the blue channel of register R2 to be the sum of the
alpha channels of registers R1 with the green channel of R0.
Modern fragment shaders allow for up to four-instruction to be executed simultaneously, much
like that of the SIMD instructions found in modern PC architectures. For example:
R2 = R1.abgr ∗R0.ggab (2.12)
can be issued as a single GPU instruction which refers to four simultaneous multiplications numer-
ically equivalent to:
R2.r = R1.a ∗R0.g
R2.g = R1.b ∗R0.g
R2.b = R1.g ∗R0.a
R2.a = R1.r ∗R0.b (2.13)
The SIMD nature of the operation defined in (2.12) allows for four additions to occur in parallel,
taking one fourth the computation time of (2.13).
In equation (2.12), R1’s color channels are referenced in arbitrary order. This is referred to
as swizzling. The green channel of R0 is referenced multiple times. This is known as smearing.
Arbitrary swizzling and smearing (and also negation) of input operands can be done with no
performance penalty. This is in contrast to the Intel’s SSE instructions, where moving data between
channels requires additional instructions.
The output color of a fragment program is placed in a designated register (usually R0) upon
termination of the program. This value is written to the fragment’s screen location in the frame
buffer.
Fragment shaders have access to three kinds of data: constants, interpolated vertex attribute
data (e.g. texture coordinates), and texture data (indexed by texture coordinates). Interpolated
vertex attributes data are accessed as the registers T0 . . . Tm wherem+1 is the number of attributes
30
stored with each vertex.
Data is fetched from texture memory by the lookup() operation. For example, R0 = lookup(T0.r, T0.g,M)
uses the first two coordinates of T0 to access the textureM . We can also perform arithmetic on the
texture coordinate before the fetch, or we can use the result of one texture fetch as the coordinates
for a second texture fetch (dependent texturing).
Although fragment shaders provide a very powerful SIMD model for programming, they are
currently limited in a number of ways. Modern implementations restrict the number of available
registers, the total instruction length, and the number of lookup() operations that may occur in a
given fragment shader. Control flow is also restricted in fragment shading. For example, branching
is not supported and conditional execution is limited to predicating instructions on previously set
condition codes.
Our model for a fragment shaders is based on the one described by the DirectX 9.0[69]. This
model provides capabilities currently found in vertex shaders at the fragment shader level. This
model has also been used to describe the implementation of a ray tracer as a fragment shader [88].
This paper assumes similar fragment processor capabilities, specifically fragment shaders of up to
256 instructions, an unrestricted number of texture access operations, a set of at least six registers,
and standard single-precision floating point data formats.
2.3.3 Cache Aware Matrix Multiply
Suppose we are multiplying two large matrices X and Y , wlog who’s dimensions are a perfect power
of two, with n = 2i rows and n = 2i columns. A general algorithm for computing Z = XY can be
expressed as follows:
31
for i = 1 . . . n
for j = 1 . . . n
Zij = 0
for m = 1 . . . n





The outer two loops are implemented on the GPU by rendering a single screen filling quadri-
lateral. This implies that everything within the outer two loops must be handled by the fragment
shader. Below we have inserted pixel shader pseudo-code in the appropriate places. Matrices X and
Y are now assumed to be stored in single channel texture maps and accessed through the fragment
shader lookup() operation.
for i = 1 . . . n





for m = 1 . . . n
R1.r = lookup(i,m,X)
R2.r = lookup(m, j, Y )





The above pseudo-code in (2.15) requires that either loops are available in fragment shaders
or that the fragment shader instruction count is long enough to allow the innermost loop to be
unrolled. We assume neither are realistic assumptions. To address this issue, we turn to a standard
blocking strategy.
32
Blocking has been shown to improve cache performance, but for this application blocking also
serves the purpose of allowing us to work within the constraints of our fragment programming model.
The pseudo-code for our blocking strategy is shown in (2.16). A new matrix F is introduced that
is initialized to be all zeroes and used as a temporary store by the routine. The value b is a scalar
representing the block size.
for k = 1..n step b
for i = 1 . . . n





for m = k . . . k + b− 1
R1.r = lookup(i,m,X)
R2.r = lookup(m, j, Y )
R3.r = R3.r +R1.r ∗R2.r
end
R4.r = lookup(i, j, F )
R0.r = R3.r +R4.r
end
end
Copy frame buffer into texture F
end
(2.16)
This new algorithm is a multipass method requiring multiple renderings to the frame buffer.
The outer three loops are handled by rendering screen filling quadrilateral to the frame buffer
n/b times. Between each of the n/b passes, the frame buffer is copied into texture map F to be
accumulated with result of the next pass. This copy operation is required since modern GPU
hardware does not support direct lookup operations on the framebuffer. Some graphics hardware
does however, support blending modes allowing fragment values to accumulate directly with the
contents of the frame buffer eliminating the need for the temporary texture F , and consequently
more efficient rendering.
33
The fragment program (which covers the portion inside the j loop), can now be unrolled by
choosing an appropriate value of b. For our tests we have chosen b = 32. We were able to reduce
our total fragment program instruction count to be four instructions per iteration of the m loop,
for a total of 6 + 4b = 134 total instructions.
2.3.4 Multi-Channel GPU Matrix Multiplies
Texture map sizes on modern day GPU’s are often restricted. Let n being the maximum size of
any dimension. NVidia’s GeForce4 Ti4600 has a maximum allowable renderable size of n = 2048,
limiting multipass programs to two-dimensional textures containing at most 2048× 2048 elements.
Texture maps may consist of between one and four channels (luminance, luminance-alpha, RGB or
RGBA). This implies that the GeForce4 can handle textures in size up to 2048× 2048× 4.
Methods have already been presented for handling matrices whose dimensions are at most
n = 2048 using single channel texture maps. It is naturally desirable to be able to handle matrices
of larger sizes. The GeForce4 for example should be able to multiply matrices 4096× 4096 in size
by utilizing all four of the color channels. This section describes a matrix multiplication algorithm
that takes advantage of this four-component storage capability.
These four-channel textures store matrices of 4096×4096 32-bit floating point values, and occupy
64MB. Our GPU matrix multiplication implementation requires four times this space for storing
the two operands, a temporary store, and the result. Current consumer level GPU’s such as the
GeForce4 currently ship with 128MB of on-board memory, suggesting a maximum capable matrix
side of 2828 × 2828. Exceeding this memory threshold under a non-unified memory architecture
of present-day PC GPU’s results in paging to main system memory, and increased traffic over the
graphics card bus.
Basic Formulation
Suppose we are multiplying two large matrices X and Y , wlog whose dimensions are a perfect
power of two, with 2i rows and 2i columns.
XY = Z (2.17)
34
The matrix multiply in (2.17) can be expressed as the following series of matrix multiplies of
smaller matrices
X︷ ︸︸ ︷ A B
C D

Y︷ ︸︸ ︷ E F
G H
 =
Z︷ ︸︸ ︷ AE +BG AF +BH
CE +DG CF +DH
 (2.18)
Elements A,B,C,D,E, F,G, and H are sub-matrices decomposing X and Y . Let the dimen-
sions of A . . .H be 2i−1 rows by 2i−1 columns.
Blocked Matrix Texture Maps
We can store matrices X and Y as texture maps in a 2i−1 by 2i−1 sized texture maps on the GPU,








We now introduce subscript notation on matrices M , such that Mi for i ∈ r, g, b, a refers to the
sub-matrices composing M . For example XrYg = AF .
We have presented our technique for multiplying matrices contained in a single color channel
XrYr = Zr in Section 2.3.3. We can extend this same approach to a SIMD notation by using





The above notation assumes an architecture where four sub-matrices may be operated on in
parallel by a single instruction. This notation is useful since graphics hardware is designed in a
SIMD manner to work simultaneously on four color channels at a time. Using this notation, we
can now concisely express matrix multiplication X and Y as follows:
35










To apply the formulation derived in (2.21) into an algorithm usable by graphics hardware, we
must first re-examine fragment shader programming. As discussed in Section 1, a single lookup()
operation can retrieve a 4-vector corresponding the four color channels. If we store X as a texture
map in blocked form (2.19) then a single lookup R0 = lookup(i, j,X) can be used to retrieve four
values coming from the sub-matrices of R0 = Aij , Bij , Cij , Dij . The four matrix multiplies from
equation (2.21) may now be parallelized within a single fragment shader. The matrix swizzling
and smearing suggested by (2.21) is handled at the per-element level utilizing the capabilities of
graphics hardware, as shown in (2.22).
36
for k = 1 . . . n/2step b
for i = 1 . . . n/2





for m = k . . . k + b− 1
R1 = lookup(i,m,X)
R2 = lookup(m, j, Y )
R3 = R1.rrbb ∗R2.rgrg +R3
R3 = R1.ggaa ∗R2.baba+R3
end
R4 = lookup(i, j, F )
R0 = R3 +R4
end
end
copy frame buffer into texture F
end
(2.22)
The above algorithm represents an efficient use of the SIMD computation power of the GPU
by working on all four color channels in parallel. This implementation only increases the fragment
shader instruction count by one per iteration of m, thus resulting in a total fragment program
length of 6 + 5b = 166 with b = 32.
2.3.5 Analysis
We have analyzed the new blocked and multichannel GPU matrix multiplication algorithms with
respect to memory bandwidth, instruction count and predicted performance.
Bandwidth Considerations
To analyze the potential bandwidth limitations for our approach, we first distinguish between the
two bandwidth limited areas of modern GPUs. The external bandwidth is the rate at which data
37
may be transferred between the GPU and the main system memory. On modern PC’s this is limited
by the speed of the AGP graphics bus which can transfer data at the rate of 1GB/sec. The internal
bandwidth is the rate at which the GPU may read and write from its own internal memory. The
GeForce4 Ti 4600 is currently capable of transferring 10.4 GB/sec.
For our application the external bandwidth of the GPU affects our application in two areas.
First, the matrices must be copied into the GPU’s memory as texture maps, and the result of the
computation must be read back from the card into main memory. These transfers use the AGP
bus, which currently has a theoretical bandwidth of about 1 GB/s (for AGP 4x). However, in
practice sending data to the GPU is much faster than reading data back from the GPU (since the
hardware and drivers are optimized for this case), and the best speed we’ve measured for reading
data back into host memory is 175 MB/s. Even assuming an average of 200 MB/s transfer for both
reads and writes, transferring two 1024 × 1024 single-precision matrices to the GPU and reading
the 1024×1024 result back requires 60 ms. At 4 GFLOPS the actual computation takes about 510
ms. For this problem, transfer time is about 11% of the total time. Thus, external bandwidth is a
significant but not overwhelming fraction of the total time.
The second part of the algorithm affected by external bandwidth is the time to send the ge-
ometry, the screen filling quads on which the matrices are texture mapped. In fact this is a very
small amount of data (about 48 bytes per pass), and graphics hardware is very good at transferring
geometry information in parallel with other tasks, including running fragment shaders. Thus, this
cost is negligible.
One of the primary bottlenecks in performing matrix multiplies on the GPU is the internal
bandwidth [62]. This is also true for CPU implementations. For our analysis we consider multiply-
ing two n × n matrices. For both the multi-channel and single-channel block-matrix approaches,
the processing of each fragment requires two texture lookup() operations per iteration of the inner
m loop, plus an additional lookup() to combine it with the results from the previous pass. A single
write to output occurs per fragment per pass as its result is written to the frame buffer. Thus,
there are 2b + 2 memory operations per fragment per pass. In our single channel method, every
memory operation transfers 4 bytes of data. Our multi-channel method transfers 16 bytes of data
(4 channels, 4 bytes per channel) per memory operation.
38
method passes frags/pass bytes/frag total
L-M n n2 16 16n3
single nb n






2 (2b+ 2)16 4n
3(b+1)
b
Table 2.6: Bytes transferred internally by each GPU matrix multiplication method.
Table 2.6 summarizes these results and shows the total internal bandwidth in bytes transferred
by each method, which is just the product of the number of passes, the fragments per pass and the
bytes per fragment. The L-M (Larson-McAllister) figures assume an implementation based on a
single floating-point channel. (Even though multiple channels were mentioned, [62] did not describe
a multi-channel implementation.) The four-byte floats are accessed four times (two matrices, a
temporary store and a result) for a total of 16 bytes transferred per fragment.
Our single-channel block matrix algorithm performs identically to L-M when b is set to one
(no blocking). As the blocking size b grows, the bandwidth drops by nearly a factor of two when
compared to L-M. The multi-channel method further reduces the memory bandwidth by exactly
one half over the single channel method, reducing the bandwidth to nearly 25% of L-M.
Instructions
The GPU uses the fact that the same instructions are being executed for a large number of frag-
ments to overlap the processing of different fragments. As no communication between executions of
the fragment shaders is needed, a large amount of parallelism is available. This parallelism is used
to hide the latency of memory operations and other causes of stalls. As a result, when enough frag-
ments are available (as in our case), the running time of a fragment shader is approximately linear
in the number of instructions executed (assuming computation is the limiting factor). Therefore,
it makes sense to analyze the number of instructions used by our algorithm.
Each execution of the fragment shader needs four instructions for setup and adding in the result
of the previous pass. The multi-channel algorithm also needs three instructions per iteration of the
inner loop (two multiply-add instructions and one addition to update the texture indices). The
single-channel algorithm removes one of the multiply-add instructions. Therefore, the instruction
counts are 3b + 4 for the multi-channel case and 2b + 4 for the single-channel case. Note that in
the multi-channel case, each many of the instructions involve a 4-wide data issue, operating on the
39
four color channels in parallel. Currently there is no performance penalty for this since modern
GPU’s designed to natively work on multiple channels. Table 2.7 summarizes the analysis and the
total GPU floating point instructions required by each method.
method passes frags/pass inst/frag total inst
L-M n n2 2 2n3
single nb n






2 3b+ 4 n
3(3b+4)
8b
Table 2.7: Floating point operations required by each GPU matrix multiplication method.
The additional fragment program overhead makes the single channel block-structured matrix
multiplication longer than the L-M algorithm. As b increases, the instruction count asymptotically
approaches that of the L-M algorithm. The multi-channel method executes 3/16ths as many
instructions of either of the single channel methods as the block size grows.
Performance
ATLAS has demonstrated 4.0 GFLOPS/s for matrix multiplication on a 1.5GHz Pentium4 using
Intel’s SSE2 SIMD instructions [24]. Is the GPU comparable to this? For a matrix size of 1024×1024
and a block size of 32 (n = 1024, b = 32), our multi-channel algorithm transfers 4.125 GB of data.
In order to match the ATLAS P4 SSE numbers, we need to perform this multiplication in 0.5
seconds. This means we will need 8.25 GB/s of bandwidth. Current hardware has a theoretical
bandwidth of 10.4 GB/s to the main memory; as with CPUs, it also has a cache between the
GPU and memory which supports much higher bandwidth. Thus, existing hardware should be
able to support our bandwidth needs. Future hardware is likely to improve both cache size and
performance and memory bandwidth.
Performance of CPU implementations of matrix multiplication are typically limited by memory
bandwidth, not CPU speed. Larson and McAllister [62] reported similar results with their GPU
implementation. However, due to much lower clock speeds on CPUs relative to CPUs, the move
from fixed-point byte operations to floating point may increase the processing requirements enough
to make the GPU speed the bottleneck.
40
2.4 GPU Conclusions
In this chapter we demonstrated two applications which utilize the GPU as a general purpose pro-
cessor. We use these two preliminary studies as a motivation for GPU processing. Both applications
clearly demonstrate the viability of the GPU as a computing platform and at the same time begin
to address some of the issues surrounding working in a multi-processor GPU/CPU environment.
Furthermore, both applications suggest frameworks for mapping particular types of algorithms to
the GPU.
Since our research there has been a lot of continuing research into the area of GPU computing.
Both in the area of photo realistic rendering and numerical algorithms. Purcell et al. [89] developed
a method for performing photon mapping in programmable graphics hardware. Bolz et al. [12]
developed conjugate gradient ( for sparse linear systems ) and multi-grid solvers that run entirely
on the GPU. Goodnight et al. [36] have also developed multi-grid GPU algorithms for boundary
value problems. Similar in spirit to our matrix multiplication research, Kreuger and Westermann
[61] developed GPU paradigms for performing simple linear algebra operators on the GPU. This
research did not include dense matrix matrix multiplication. Lastly, Harris et al. [46] used the
GPU to solve fluid flow problem to simulate clouds.
We now focus our attention on using the GPU to perform the computation on surfaces. Before
this can be done, however, a general framework for representing surfaces on the GPU must be






Finding suitable parameterizations for meshed surfaces is a fundamental problem in computer
graphics that has seen a lot of research[68, 64, 32, 33, 65, 101, 99, 39, 105, 16]. The choice of
parametrization determines how texture samples are distributed across the surface of a mesh. Many
applications such as texture mapping [85, 70, 68], mesh compression [77], re-meshing [4, 2, 3, 106],
image processing [38], 3D-painting [43, 49, 104], and GPU surface computation [15] rely on such
parameterizations.
Alternate approaches have been proposed such as storing surface information in volumetric
data structures as octrees[9, 21]. Texture maps have the advantage over the volumetric methods
in that they provide a convenient memory layout and have widespread hardware implementation.
Furthermore, post-filtering techniques for texture maps exist, such as MIP-mapping, allowing for
fast hardware support of alias removal during texture minification[117].
In this research we consider a subclass surface parametrization methods referred to as texture
atlases. A texture atlas is an injective mapping between an object’s surface and a texture. Such
mappings ensure that every surface position is uniquely represented in the texture domain; a
requirement of many applications that utilize texture atlases.
A texture atlas should be chosen in a way that best distributes texture samples to represent
some underlying surface signal such as colors, normals, and displacements. In may cases, the
underlying signal is not known a-priori. As a result, many texture atlas schemes focus on generating
parameterizations that preserve geometric properties of the mesh, such as surface area, angles, and
42
edge lengths.
To date, numerous methods for generating automatic meshed texture atlases (and texture pa-
rameterizations) have been proposed. The next section discusses recent advancements and related
works in texture atlas generation.
3.2 Chart Formation
Many surfaces, such as closed surfaces, cannot be flattened without forming overlap in the parametriza-
tion resulting in the loss of injectivity in the mapping. Such an injective mapping forms an embed-
ding for the surface. The necessary condition for a mesh to be flattenable without overlap, is that
is must be homeomorphic to a disc. Mesh surfaces that do not satisfy this property (such as closed
surfaces) require cutting so that they can be laid flat.
The cutting process decomposes a meshed surface into one to many flattenable ( but possibly
overlapping ) simply connected components. Each component is referred to as a coordinate-patch or
chart. Charts are flattened and packed into the 2D image domain resulting in a texture atlas that
is comprised of multiple piecewise continuous parameterizations. The collection of parameterized
charts forms an atlas for the given mesh.
Regions along which the mesh is cut form texture seams. In general, seams are undesirable
since they lead to discontinuities in the parametrization. Discontinuous regions may suffer sampling
artifacts during rendering and are more difficult to filter.
3.2.1 Goals and Metrics for Cutting
To avoid distortion and stretch during the flattening process each resulting chart should be de-
velopable, having zero Gaussian curvature everywhere. One way to satisfy this constraint a tri-
angle mesh may be broken down into a chart for every triangle or set of connected triangle
pairs [94, 17, 16], resulting in a large number of seams. In practice most methods seek a bal-
ance by minimizing seams while maintaining some bound on the allowable distortion during flat-
tening [27, 101, 105, 101]. Many proposed methods, however, seek a course approximation by
searching for charts that are either near planar or have near zero Gaussian curvature.
Minimizing both Gaussian curvature and seams alone may not lead to satisfactory results. For
43
example, some atlas schemes fix the shape of the boundary during the chart flattening process [38].
Such schemes desire that each chart parameterizes well into the fixed shape region. Other atlas
generations methods desire that charts have compact boundaries so that they may be more tightly
packed into texture space. Lastly, seams may be placed in regions of low visibility to hide sampling
and reconstruction artifacts[65, 101].
Ideally, charts should be chosen that lead to the best texture atlas parametrization for a given
method and much of this depends on nature of the method used. To date, most existing methods
handle the chart generation process separately from chart parametrization. Simple heuristics are
used in finding charts that lead to low error during flattening.
3.2.2 Single Chart Methods
One approach to chart generation is to simply ”unfold” the mesh into the plane forming a single
chart by carefully selecting a set of cuts through the mesh’s surface[27, 101, 105]. Any closed
polygonal manifold surface may be opened to form a topological disc given an appropriate set of
cuts. The set of edges whose removal transforms a mesh into a topological disc forms a cut graph.
Such an unfolding is commonly referred to as a topological schema.
Gu et al. proposed an O(nlgn) two stage process for flattening a mesh[38]. In stage one, an
arbitrary seed face is selected from the mesh. A region is grown outward from the seed triangle.
The topological disc property for the grown region is maintained as an invariant during the growth
process. At each step a face that is adjacent to the border is selected and added to the grown region.
Face selection is chosen in priority of minimum geodesic distance from the seed face. This heuristic
helps to minimize the resulting seam. During face addition, only a single edge that connects the
face to the region is removed from being a candidate for the final cut graph.
Stage 2 operates on the remaining edges that are candidates for the final cut graph. Some of
the remaining edges form connected loops along generators that must be cut to lay the mesh flat.
Other edge segments for dangle off of the generating cycles. These limbs are detected and removed
from the final cut graph. The loop cycles provide the set of edges required to flatten the mesh.
The final cut graph may be straightened using a constrained shortest path algorithm to provide a
smoother cut for flattening the mesh.
44
Erickson and Har-Peled proposed an alternate method for finding the set of cuts of required to
flatten a polyhedral mesh that minimizes some metric over the edges (e.g. length, curvature,etc..)
[27]. Their approach is similar to that of [38], but provides more exhaustive searching to further
minimize the resulting cuts.
Scheffer and Hart provided alternative guidance for choosing cuts. They note that seams should
be placed along high Gaussian curvature regions, and in regions of the model that are least visibly
conspicuous[101].
3.2.3 Multi-Chart Methods
Alternative schemes have been proposed that decompose the mesh into multiple charts. Multi-
Chart methods have the disadvantage over single chart approaches in that they generally lead to
more seams. Decomposing the mesh into multiple components has its advantages in that smaller
pieces are often times easier to work with. They can be chosen to parameterize well into fixed
boundary regions, and can be packed to reduce wasted texture space.
Triangle and Quadrilateral Packing Methods
One of the simplest approaches for chart creation is to consider each triangle (or connected pairs
of triangles )of a mesh as a separate chart. This can lead to texture atlases having no distortion
since triangles and connected triangle pairs are trivially developable.
Triangular Block Atlas. Maruya [70] was apparently the first to devise a packed triangle texture
atlas, placing the triangles in an arbitrary order in the texture map. He devised a block packing
mechanism consisting of rows of squares, with each square containing two triangle blocks as shown
in Figure 3.1(a). Each triangle block could then contain one large triangle, two medium triangles,
four small triangles, and so on for any combination of triangles of “size” 1/2i so long as their
sizes summed to no more than one, as shown in Figure 3.1(b). Triangles were quantized into a
given size depending on their “scale,” which was defined as the number of texture samples the
triangle’s texture atlas image would cover divided by the number of color samples the surface
triangle contained. A minimum scale α0 would be set a priori (based on the number of triangles
and the texture resolution), and surface triangles would be mapped to atlas triangles of a size such
45
(a) (b) (c) (d) (e)
Figure 3.1: Some previous atlases, with wasted texture space hatched. Triangle block packing (a)
of 32 triangular blocks. Blowup of a sample triangle block (b) containing triangles of size 1/2, 1/4,
1/8, 1/16, 1/32, and two of size 1/64. Quantized area mesh atlas (c) of 68 triangles quantized to
three different power-of-two widths. Sheared mesh atlas (d) of 34 triangles quantized into three
power-of-two altitudes. Progressive mesh atlas (e) of 28 clusters.
that the resulting scale would be no less than α0. Hence this technique would result in a texture
atlas with α0 texture samples per surface color sample.
Quantized Area Mesh Atlas. Soucy et al. [94] described a packing of triangles assumed to be
from a simplified mesh, such that the color data from the vertices of the original unsimplified mesh
will become the texture of the simplified mesh. Let Di be the average distance between vertices
in the region of the original high resolution mesh that simplifies to triangle i. Let Emax be the
length of the longest edge of triangle i. Then their technique would require triangle i to map to a
triangle of edge-length Emax/Di in the atlas, with a goal of achieving at least one texture sample
per vertex color sample in the high resolution mesh. The atlas triangles were quantized into right
triangles whose adjacent and opposite edges were equal and integer powers of two, as shown in
Figure 3.1(c). Arbitrary pairs of equal-sized atlas triangles were joined together at the hypotenuse
to form a rectangular cell. The triangles were offset by one pixel at the hypotenuse to avoid a
seam problem later described in Section 3.5.4, resulting in rectangular cells one pixel longer in one
dimension than the other. The resulting rectangular cells were strip packed into the texture map.
Undistorted Mesh Atlas. Battke et al. [8] constructed a texture atlas by performing a strip
packing of triangles into a square region. The triangles were rotated and translated such that
their longest edge lies along the horizontal axis, then sorted by altitude. The triangles are then
alternately flipped and compacted to form the efficient though sub-optimal greedy packing shown
in Figure 3.2. This packing was sensitive to texture sampling such that the packing did not cause
neighboring triangles to share the same border pixel in the texture atlas. Their algorithm could
46
Figure 3.2: Strip packing of 7,232 undistorted scalene triangles into a texture atlas.
optionally surround triangles with a boundary to support bilinear filtering and to remove seam
artifacts.
Sheared Mesh Atlas. Cignoni et al. [17] developed an anisotropic mesh atlas to better accom-
modate skinny triangles. This mesh atlas worked similarly to the undistorted texture atlas that
rigidly packed triangles into the texture atlas, except that it quantized the altitude of the triangles
to the next higher power of two, then sheared each triangle to fit exactly with the previous triangle
edge in the current strip, as shown in Figure 3.1(d). The method also packed large triangles first,
and packed smaller triangle strips in the remaining gaps.
3.2.4 Normal Clustering
Maillot et. al. [68] suggested grouping faces by their normals. An arbitrary n-sided polyhedron
is selected for grouping the faces. Each triangle is then grouped with the polyhedral face whose
normal direction is closest to that of the triangle’s normal. Triangles that are contained within the
same group and who are adjacent on the surface are clustered together to form a chart. A similar




Garland et al. [35] proposed an agglomerative method for building a complete face hierarchy, which
forms a binary tree. Each node in the tree represents a simply connected regions of triangles that
can be used to form a chart. The algorithm starts by considering every triangle in the mesh as
a separate chart and iteratively applying greedy merges of faces to construct the complete tree.
This is accomplished by using the dual graph of the mesh ( every face is a vertex, and faces are
connected by edges only if they are adjacent) and performing edge contractions.
An edge contraction collapses two dual vertices into a single dual vertex, thereby clustering
groups of faces together into a single simply connected chart. Every edge is assigned an error based
on the resulting planarity and compactness of the chart that would result in contracting the edge.
At each step of the algorithm, and edge is removed from the top of the queue, a contractions of the
edge is performed, and the queue is updated. This is repeated until a complete face cluster tree is
formed.
Since every node in the tree represents a simply connected regions of triangles, any cut through
the resulting face cluster tree may be used as a partitioning of the mesh’s faces into a collections
charts. The location of the cut may be chosen based on the planarity and compactness error at the
level of the tree, and the desired number of charts.
Sander et al. [99] used this agglomerative approach to devise an atlas of packed clusters resulting
from a modified progressive-mesh simplification process. Neighboring triangles were repeatedly
merged into sufficiently co-planar clusters with minimal boundary length, and these boundaries
were “straightened” afterward into rough polygons. These clusters were then rescaled relative to
each other to more evenly distribute the samples, and strip packed (using bounding rectangles)
into the texture map as shown in Figure 3.1(e).
3.2.6 Greedy Region Growing Methods
Another class of chart formations algorithms involves greedy region growing methods [65, 96].
These algorithms start by choosing a set of seed faces (referred to as cluster chart centers) on the
mesh. A region is grown out around each seed face by adding surrounding faces (that have not
been assigned a region) that meet with some error criterion. Region growth is terminated when
48
adding additional faces to the region results in the loss of some desired properties for the region
(e.g. topology, compactness, flatness, or low Gaussian curvature). Additional seeds may be added
and region growing performed until every face has been assigned a region. The resulting regions
form the set of charts for parametrization.
Le´vy et al. [65] start by detecting features of the mesh. Features here are defined as sharp edges
along which curvature is high. Dijkstra’s shortest path algorithm is used to compute the shortest
path from every face in the mesh to a feature edge. This distance is stored with every face and
referred to as the distance to features Morse function. Faces at the local maxima of this function
provide the starting seed locations for the greedy region growing algorithm. Region growing is
performed simultaneously out from each seed adding the faces with higher distance to features
first. This ensures that chart growth does not cross along feature boundaries.
Sander et al. [96] propose an adaptation of Lloyd’s [67] vector quantization algorithm to de-
termine seed locations. At the start initial seed faces are chosen on the mesh. The seed faces are
equivalent to the cluster centers described in Lloyd’s algorithm. Greedy region growing is performed
simultaneously around each seed face using Dijkstra’s shortest path algorithm, so that every face
is assigned to its closet seed faces (as defined by the shortest path in the dual graph). The distance
function between adjacent faces F and F ′ connected vertices in the graph is given by:
cost(F, F ′) = (λ− (Nc −NF ′))||PF − PF ′ ||2 (3.1)
where PF and PF ′ are the location of the faces centers, Nc is the average normal of the region, and
NF ′ is the normal for face F ′. Sander chose the parameter λ = 1.0 giving no distance penalty to
faces co-planar with the global region normal.
After every face has been assigned to a region, a new seed for each region is computed. This
is done by performing Dijkstra shortest path algorithm inward on each region to find the faces
that are furthest from each region’s border. These faces become the starting seeds for successive
iterations of the algorithm. The process is repeated until convergence is reached ( or a cycle is
detected ).
Both the approaches detailed by Le´vy et al. and Sander et al. produce charts that avoid crossing
areas of high curvature. Both of these approaches, however, do not explicitly distinguish between
49
Gaussian curvature and mean curvature, which is less than ideal. The later of the approaches has
added flexibility in that it can be be made to favor the growth of compact charts by setting λ = 2.0.
Sorkine et al. [105] took a different approach to the charting process. They performing greedy
region growing and parametrization as a combined step. A starting seed face is first selected and
parameterized. The a cluster is greedily grown outward from this face. As each face is added to this
cluster, it is added to the cluster parametrization. Cluster growth is terminated when the cluster
exceeds some threshold of distortion, or exceeds some compactness criteria. Additional seed faces
are chosen until the entire mesh has been broken int parameterized clusters. This approach has
the nice property of providing a bound on the distortion in the resulting parametrization.
For methods that handle chart generation separately from chart parametrization, the problem
arises as to how each chart should be flattened into the plane.
3.3 Chart Parametrization
A number of methods have been proposed for flattening meshed surfaces that are already topological
discs requiring the surface to have a single boundary loop. The loop boundary is either set to lie on
some fixed boundary in the parameter domain, or it is allowed to freely move during the flattening
process to take on a more natural shape.
3.3.1 Quadratic Metrics
A number of approaches for mesh flattening have been proposed that involve solving a linear system.
Loop boundary vertices are fixed lie on the boundary of some convex shape in the parameter domain.
A linear system is formulated to determine the location of the remaining mesh vertices that are not
part of the boundary. The are commonly referred to as internal vertices since they get mapped
inside the convex region.
To construct the linear system, every internal vertices location is defined to be some weighted
combination of its neighboring vertex locations in the parameter domain. The choice of weights
can be arbitrary, however, it has been proven that to guarantee and embedding, a vertex must be
an convex combination of its neighbors[109, 32]. A number of methods for assigning weights have
been proposed, only some of which guarantee an embedding.
50
One of the simplest vertex weighting schemes proposed by Tutte [109], is to weight every vertices
neighbors equally. Such a mapping guarantee’s an embedding, however, angles or shapes of triangles
are not preserved by the method. Eck et al. [25], suggest a non-uniform weighting scheme based
on edge lengths that will generate a harmonic map. Pinkall and Polthier [84] also showed how
these weights minimize Dirichlet energy and generate conformal maps (LSCM). The edge weights,
however, in some case become negative resulting on triangle fold-over. Floater [32, 33] improves up
this method by providing choices for weights based on mean value coordinates that better preserves
triangle shape while guaranteeing an embedding.
Desbrun et al. [22] further generalized such mappings by noting that the choices of weighting
schemes ( and therefore parameterizations ) is limited under the assumption of translation and scale
invariance. Choices for weights that preserve angles (conformal) or area (authalic) are provided,
and it is argued that any admissible set of choices for weights is a linear combination of the two.
Unlike the method proposed by Floater, the provided weighting scheme does not guarantee a valid
embedding for the mesh. Le´vy et al. provide an alternative method for computing LCSM for
meshes[65]. Ray and Le´vy propose a hierarchical multi-grid scheme that allows for fast evaluation
of the LSCM useful for large models[90].
Guskov [39] developed a choice of weights based on Floater’s method that takes into account
anisotropy, where triangles are allowed to distort more (or less) in given directions. Their method
does not guarantee a convex mapping ( and therefore fold-over may occur ). They suggest the use
of clamping the weight values as a means to control such behavior.
Proposed methods based on linear solves for surface parametrization are fast, robust, and
easy to implement, however, they have significant drawbacks. First, while providing translational,
rotational, and scale invariance, they are limited in how the stretch and distortion is distributed
across the mesh. This may lead to significant under-sampling and over-sampling in regions of the
model when using the parametrization for applications such as texture mapping. Secondly, many
linear methods require that loop boundary vertices are placed along the boundary of a fixed convex
region in the parameter domain. The linear solve does not allow for the vertices to move and take
on a natural shape, and the convexity of the boundary may add more distortion than would occur
if the boundary could be freely chosen. For these reasons research has been done into non-linear
51
metrics that can better preserve properties of the mesh.
3.3.2 Non-linear Metrics.
Le´vy and Mallet [64] proposed a non-linear metric based on preserving the spacing of isoparametric
curves defined by the mapping. Sheffer and de Sturler [100] proposed another non-linear metric
that attempts to minimize the deformation of angles during the flattening process. Hormann and
Greiner [48] proposed a different metric based on the ratio of singular values of the 2×3 matrices that
transforms triangles from 3D into the parameter domain. For both methods, the boundary vertices
in the parameter domain are allowed move, taking on a natural shape that best minimizes the
prescribed metric. Additional constraints are applied during the minimization process to prevent
the moving boundary from overlapping during flattening.
Both of the metrics provided by Scheffer and de Sturler, and Hormann and Greiner are invariant
under rotations and translations and scaling. The latter of which is undesirable since area is an
important property to preserve during parametrization. Large triangles may get mapped to small
triangles resulting in poor sampling when used in texture mapping. Such algorithms, however, do
perform well when flattening meshes with near zero Gaussian curvature.
Geometric Stretch Metrics.
To preserve both area and stretching, Sander et al. [99] took a similar approach to that of Horman
and Greiner. Rather than looking at the ratio of singular values Sander et al. expressed the error
as the root mean square of the two singular values, referred to as their L2 metric.
The L2 “stretch” metric is defined for a mesh M as follows: given the triangle Ti and its 3D
coordinates (x1,x2,x3) and its surface texture coordinates (u1,u2,u3), let φ be the implied affine
map such that φ(ui) = xi, and let J be its 3 × 2 Jacobian. (The inverse of φ is the texture map
u(x)). The L2(φ) norm was set to the RMS of the two singular values of J, and was extended





where A(Ti) is the surface area of the triangle Ti defined by its 3D coordinates (x1,x2,x3). This
52
φFigure 3.3: Single chart parameterized using the L2 metric proposed by Sander et al. [99].
was not the first use of the singular values of φ. Their squares were present in the trace of the first
fundamental form Iφ = JTJ used by [68] in their minimization of texture distortion. Figure 3.3
shows a model of a parasaur that has been mapped using the L2 stretch metric.
Alternatively Sander et al. suggested using the max singular value (as opposed to the RMS of
the two singular values) referred to as the L∞ metric. The L∞ metric can be used to measure the
maximum stretch of a mapped triangle. Both the L2 metric, and the L∞ metric, do not penalize
shrinkage. For this reason, Sorkine et al. [105] suggested taking the max of either the largest
singular value or one over the smallest singular value, thereby penalizing stretch and shrinkage.
Zhang et al. [119] also noted that the L2 metric does not distinguish between isotropic versus
anisotropic stretch. They proposed using the Green-Lagrange tensor to analyze the symmetric
matrix JTJ , as similarly proposed by Maillot et al. [68].
Signal Specialized Stretch Metric.
Up until now, we have only discussed metrics for flattening meshes that rely strictly on the geometric
properties of the mesh. In practice we have have some signal over our mesh (colors, normals, etc..)
that we wish to preserve. Sander et al. proposed a signal specialized stretch metric based on the
L2 metric, that measures the stretch in signal space[97].
When forming a parametrization that preserves a some surface signal s˜ ∈ Rn, we must examine
the affine mapping from texture positions to signal space S˜. This mapping is given by φ˜ : ui = s˜i.
The Jacobian J˜ of this transformation is a n × 2 matrix. Just like in the geometric stretch case,
we can analyze the tensor matrix J˜T J˜ to determine the maximum and minimum signal stretch
53





where δ is the sampling density in texture space. This error term gives the squared ( two norm )
distance between the signal at a point and its reconstruction.
3.4 Chart Packing
After a mesh has been cut into one to many coordinate charts, and each chart has been flattened,
the problem arises as to how to best pack them without overlap into the (u, v) ∈ [0, 1] texture
domain. The goal of the packing procedure is to minimize the amount of texture space that does
not get covered by any chart. This problem, known at the pants packing problem, has been shown
to be NP-hard[72, 73, 74]. A number of heuristics have been proposed for packing three sided
charts[70, 94, 17, 98, 16], four sided charts [76, 49], and n-sides polygons [99, 65, 96].
3.5 Metrics and Goals for Texture Atlases
In this section we discuss metrics for analyzing texture atlases and briefly discuss tradeoffs between
various trade-offs between proposed texture atlas methods.
3.5.1 Coverage
The coverage C of an atlas measures how effectively the parameterization uses the available pixels
in the texture map. The coverage ranges between zero and one and indicates the percentage of the





where A() returns the area of a triangle. We assume the texture map is a unit square.
54
3.5.2 Sample Distribution
Whereas the coverage measures how well the parameterization utilizes texture samples, the relative
scale S indicates how evenly samples are distributed across the surface. We measure the relative













The additional summation factor computes the surface area of the object in model space, and
normalizes the relative scale so it can be used as a measure to compare the quality of atlases across
different models. A relative scale less than one indicates that the atlas is contracting a significant
number of large triangles too severely, whereas a relative scale greater than one indicates that small
triangles are taking up too large a portion of the texture map.
The relative scale is similar to the L2 “stretch” metric 3.2. Both the relative scale S and stretch
L2 represent the RMS of the area sampling, but the stretch metric can detect anisotropy in the
scaling, such as when a sliver surface triangle maps to an equilateral texture triangle of equal area.
The relative scale is also affected by the coverage. An inefficient packing can scale chart images
smaller than necessary in order to make them fit into available texture space.
3.5.3 Distortion
The measure of distortion for a texture atlas may be done with the non-linear metrics mentioned
section 3.3.2. By area weighting the metrics evaluated over each triangle as presented in equa-
tion 3.2, a global distortion metric for the entire mesh may be derived.
3.5.4 Discontinuity
Texture atlases are discontinuous along the boundaries of their charts. Texture mapping can reveal
these discontinuities as a rendering artifact known as a seam. Seams are pixels in the texture map
along the edges of charts. They appear along the mesh edges as specks of the wrong color, either
the texture map’s background color or a color from a different part of the texture.









Figure 3.4: Texture pixel samples are located at the center of grid cells (a). The rasterization of
triangles A and B (shown in blue and yellow) does not cover their entire domain (shown by their
boundaries). Because of this difference, nearest-neighbor sampling misclassifies the red pixels (c)
resulting in seam artifacts.
images in the texture atlas, e.g. [99]. These partitions improved the atlas continuity, resulting in
fewer charts, though with complex boundaries. While this method reduced seams to the complex
boundaries of fewer charts, it did not eliminate them.
Our real-time procedural solid texturing system rasterizes triangles into the texture map using
its surface texture coordinates. Seams can appear because the rasterization rules differ from the
sampling rules of the texture magnification filter. The rules of polygon scan conversion are designed
with the goal of plotting each pixel in a local polygonal mesh neighborhood only once1. The rules
for texture magnification are designed to appropriately sample a texture when the sample location
is not the center of a pixel, usually either the nearest-neighbor sample or bilinear interpolation of
the nearest four samples.
Figure 3.4 demonstrates the disparity between the rasterization and sampling of two triangles
A and B. Assume for the moment that integer coordinates in the texture map correspond to the
centers of texture samples (the pixels of the texture map) in the configuration shown in (a). Since
these integer pixel coordinates occur at the center of the grid cells, the grid cell indicates the set
of points whose nearest neighbor is the sample located at the cell’s center. Two triangles with
integer coordinates are rasterized into the texture map, as shown in (b) using the standard rules
[34]. Un-rasterized pixels remain white. The chart images of triangles A and B are also shown in
(b). The texture sampling process needs to reconstruct an approximate sample for any point in
1 Missing pixels can result in holes or even cracks in the mesh, whereas plotting the same pixel twice (once for














Figure 3.5: Standard rasterization rules disagree with texture sampling rules (a). Moving the
texture coordinate vertices one-half pixel diagonally aligns the rasterization with the sampling (b),
except along the hypotenuse. Rasterizing a shifted A and an enlarged B (c) results in a rasterization
that can be by nearest-neighbor sampled correctly by the shifted A and original B (d) at the cost
of an extra column of pixels.
these chart image regions. The nearest sample for some points in both regions A and B are white
background pixels. The nearest neighbor for some points near the shared hypotenuse in region A
are pixels rasterized from plotting B. These regions are indicated in red in (c).
A common solution is to over-scan the polygons in the texture map, but surrounding all three
edges of each triangle with a one-pixel safety zone wastes valuable texture samples.
Figure 3.5 illustrates a better solution, adapted from [94]. Triangles A and B have been raster-
ized into the texture map in (a), as shown before. The surface texture coordinates of the triangles
in (b) have been offset by one half pixel but the rasterization of this triangle in (b) generates the
same pixels as in (a). The rasterization rules generate the same pixels for the triangles in both
(a) and (b), but the texture coordinates interpolated across the no background pixels (e.g. white
pixels) would be accessed by a nearest-neighbor texture magnification filter in (b). However, a
nearest neighbor filter of samples in triangle B near its hypotenuse can still return A’s color. The
nearest-neighbor misclassification along the hypotenuse is fixed by shifting the texture coordinates
of triangle A right by one pixel, and enlarging B by one pixel horizontally and vertically, as shown
in (c). The nearest neighbor sampling of the shifted A and the original B now find samples from
the appropriate rasterization. This over-scanning solution reduces the coverage slightly, but only
costs one column of pixels for each triangle pair in a horizontal strip.













Figure 3.6: Reducing the rasterization of the triangles by half a pixel yields the same rasterization
(a). Similarly reducing the texture regions (b) leads to partial support for bilinear reconstruction.
If A and B are neighboring triangles on the object surface and sharing their hypotenuse edge (c)
then support for bilinear reconstruction is complete (d).
mesh atlas without over-scanning, as the edge they share has appropriate pixels on either side of
it.
Figure 3.6 illustrates how contracting the triangles by one half pixel yields a result that supports
bilinear interpolation of samples. The rasterization remains the same, shown in (a). For points
away from the hypotenuse in the reduced triangle regions A and B shown in (b), the four nearest
samples centers of cells) are from the appropriate rasterization. If triangles A and B are neighbors
on the object surface and share the edge that forms their hypotenuse, then bilinear interpolation
would also be correct near the hypotenuse (and the extra column of pixels would not be necessary).
The multi-resolution atlas introduced in Section 5 can support this shared hypotenuse property,
such that the resulting atlas can support bilinear magnification filtering of texture samples.
3.6 Conclusion
In this chapter we have surveyed the field of texture atlas generation. The presented set of metrics:
sample distribution, distortion, discontinuity, and coverage, must be examined as a whole. Many of
the metrics compete against one another, and as a result lead to challenging trade-offs for texture
atlas construction. In the next chapter we detail a new approach for texture atlases construction
that attempts to address limitations present in many existing methods.
58
Chapter 4
The Multi-Resolution Meshed Atlas
(MMA) 1
4.1 Introduction
In this section present a new texture atlas method that provides the fundamental construct used
for surface processing used in the rest of this thesis. Our method provides an atlas that utilizes all
available texture samples, supports post-filter mip-mapping, and obeys rasterization rules so that
it may be directly processed in graphics hardware without forming seams.
Many texture atlas methods, although providing very low distortion solutions, ignore ( or do
not handle well ) the issue of achieving good texel coverage. For example, chart packing methods
as proposed by in [99, 96, 65, 68] waste as much as a third or more of the available texture samples.
Such wasted space is inefficient, and leads to problems during post-filtering. Our solution is to
utilize all available texture samples, but with the trade-off of having slightly higher distortion.
Another aspect of texture atlas generation that is commonly ignored by many texture atlas
algorithms is the ability to post-filter the texture map resulting from the texture atlas scheme. For
example, texture minification produces aliasing artifacts when projected texture resolution exceeds
screen resolution. The MIP-map [117] is the primary mechanism for inhibiting texture minification
aliases in modern computer graphics hardware. In order to inhibit texture minification aliases our
meshed texture atlases needs to support MIP-mapping.
Lastly we are interested in forming a texture atlas that can be rasterized on the GPU without
forming cracks between charts. This ability is of fundamental importance for the GPU surface
1Contains work previously published in ACM Transactions on Graphics 2002[16]
59
processing applications presented chapter 6.
To date we are aware of only two atlas algorithms that support similar features for complex
topological meshes. The progressive mesh atlas [99] supports MIP-mapping by interpolating colors
in the crevices between packed chart images; but because the chart images were strip packed in
size order, areas of separate sections of the surface could be blended together at lower resolution
levels of minification. In contrast, our novel clustering method yields rectangular chart images that
pack efficiently and according to a property that preserves spatial coherence across all levels of the
MIP-map. The irregular shaped charts, although providing low distortion, do not obey graphics
card rasterization rules producing seams when directly rendered. Furthermore, texture space is
wasted between packed charts.
The goals behind the development of geometry images[38] are most closely related to our work.
Geometry images provide an atlas that is completely continuous within the texture domain leading
fewer discontinuities that our method. Being designed with surface storage and processing in mind,
it is both mip-mappable and rasterization friendly so is well suited for surface processing. The
geometry image approach differs from our method in the it relies on a single chart. Although the
use of a single chart has reduced seams, it may be difficult to find cuts that allow the mesh to
be flattened into a rectangular region without incurring excessive distortion and stretch. We view
geometry images as an alternative to our approach when higher degree of continuity is desired.
Our method for texture atlas generation is based on the idea of packing triangles or quadrilat-
erals into texture space. This approach has been taken by other researchers as well [70, 94, 8, 17],
however, none of them provide packing in a way that allows for mip-mapping.
4.2 MIP Mapping a Texture Atlas
A MIP-map is a multi-resolution pyramid of textures, filtering the texture from full resolution in
half-resolution steps down to a single pixel. Each pixel at level l of a MIP-map represents 4l pixels
of the full resolution texture map (at level zero).
Assume we have a uniform mesh atlas where the adjacent edge a of each of the triangles is a
power of two. Then at levels up to la = lg a, some pixels from both sides of a triangle pair will

































Figure 4.1: The triangles on the left are mapped into texture map in the same order they appear in
the surface mesh. The triangles on the right are mapped into the texture map in a different order,
though the triangles in each quadrant are connected neighborhoods. Averaging quadrants for the
next level of the MIP-map does not depend on the order of the elements.
in the surface mesh.
At level la + 1, four neighboring triangle-pairs in the texture map are averaged together. The
uniform mesh atlas cannot be MIP-mapped at level la+1 or above as there is no spatial relationship
between triangles in the atlas. We can however impose a spatial relationship on the uniform mesh
atlas that permits MIP-mapping above level la.
At level la, triangle pairs are each represented by a single pixel. At level la + 1, the result of
averaging neighboring triangles pairs is a single pixel. Hence, the mesh needs to have neighborhoods
of triangle pairs grouped together, but the grouping need not be in any particular order.
This subset property states that each quadrant at levels above la of the MIP-map describes a
connected neighborhood of mesh elements, but these elements need not adhere to any particular
order. Figure 4.1 illustrates this subset property.
We implement this subset property on the atlas by partitioning the surface mesh hierarchi-
cally into an area-balanced quadtree. Each level of the quadtree partitions the mesh into disjoint
contiguous sections.
We implement this subset property on the atlas by partitioning the surface mesh hierarchi-
cally into an area-balanced quadtree. Each level of the quadtree partitions the mesh into disjoint
contiguous sections, as shown later in Figure 15. We implement this quadtree using a face cluster-
ing method described in the next section. As Figure 4.2 shows, the square subquandrants of the
MIP-map correspond to partitions on the model surface.
61
(a) (b)
Figure 4.2: Moon textured with a multiresolution atlas: at full texture resolution (a) and at
decreasing MIP map levels (b).
We have two other goals in the construction of our quadtree partition of the mesh.
1. The quadtree should be balanced, yielding a hierarchy of sub-meshes of equal surface area.
2. The seam length separating partitions at each level of the quadtree should be minimized.
3. Near the leaves of the quad-tree each patch should be developable leading to low distortion
and stretch when parameterized into a rectangle.
We used the Metis partitioning algorithm to achieve these goals, yielding balanced partitions such
as the one shown in figure 4.3.
4.3 Chart Generation using Recursive Bisection
We now detail one approach that may be used for forming the quad-tree MMA hierarchy. Later in
Chapter 5 we detail additional methods that are practical in dynamic environments.
We cluster the faces of our mesh using the Metis multi-constraint mesh partitioning algorithm
[56, 57, 55]. Mesh clustering has already found a wide variety of applications in computer graphics,
e.g. [54, 35]. The Metis algorithm begins by finding an acceptable partitioning of weighted vertices
62
Figure 4.3: A tie-dyed moon at full texture resolution (left) and the MIP-map used to generate it
(right). (The missing triangle is explained at the end of Section 4.4.)
of a graph simplified by repeated edge collapses. It then tries to preserve this solution as the graph
is unsimplified via vertex splits back to its original resolution. Metis can generate face clusters
when applied to the dual of the mesh.
Each vertex in a simplification of the dual mesh is weighted by the surface area of the face
cluster it represents in the original mesh. The merging of vertices in a simplified dual represents
the clustering of faces in the original mesh. Each edge in the dual mesh is weighted by the number
of edges it crosses in the original graph. For example in figure 4.4, Vi and Vj represent dual vertices
that are formed as a result of a series of edge contractions. They are connected by a single edge
Eij , whose weight is 6 (equal to the number of edges crossing between Vi and Vj in the original
un-contracted graph).
Every dual vertex v is assigned a weight, denoted |v|, corresponding to the surface area of the
triangle face it represents in the 3D model. Given a dual M = (V,E) of a given subset of the
original mesh, we need to find a disjoint partition V = Vi∪Vj of its vertices whose summed weights








 ≤ 1 (4.1)
and the cardinality of the set Eij = (Vi × V j) ∩ E of edges connecting the partition Vi to Vj is at
63
Vi VjEij















Figure 4.5: Weighted edge collapses.
least close to minimal. Such a partition is illustrated in Figure 4.4.
Figure 4.5 demonstrates a pair of edge collapses. The weight of a vertex created by an edge
collapse is the sum of the collapsed edge’s vertex weights. The weight of the collapsed edge is
removed (“hidden” inside the vertex), but if an edge collapse merges two edges, then their weights
are summed. Hence the weight of each vertex is the area of the cluster of faces it represents, and
the weight of each edge is the number of edges it represents.
We perform a heavy-edge matching to determine the order of edge collapse. By collapsing the
heaviest edges first, we effectively hide as many edges in the original base as possible. This leaves





we are trying to minimize in the partitioning phase.
Once the dual graph has been simplified, typically down to about ten percent of its original
vertex count, a greedy region-growing algorithm finds a good initial partitioning.





Figure 4.6: Recursive partitioning of 2-D texture space (top) and head mesh data set (bottom).
shuﬄes the partitions to maintain optimality [30]. It prioritizes each vertex in both partitions based
on its “gain,” the change in |Eij | if the vertex were moved from one partition to the other. Then
it repeatedly selects the vertex with the most negative gain in either partition, moves it to the
other partition, and recomputes the gain for all affected vertices. If the new partitioning is nearly
balanced, then the partition data is saved.
While a direct four-way partitioning algorithm could potentially achieve better results than
repeated two-way partitioning [57], the simpler two-way partitioning provided reasonable results.
Figure 4.6 shows the face clusters on the original mesh and their corresponding allocations of texture
space (along with the number of faces in the corresponding cluster). These surface area of these
partitions is balanced within 1.5%. This process is recursively applied to each partition containing
more than four triangles.
4.4 Mapping Triangles into Texture Space
Once the MMA quad-tree hierarch has been formed, the triangles are mapped into the texture
domain. Every level of the MMA tree represents both a region of texture domain and a simple
connected collection of triangles. At any level of this hierarchy, the collection of triangles may be
mapped into the texture domain, provided that they are homeomorphic to a disc. This can be done
using any number of methods as discussed in section 3.3. Note that the partitioning algorithm
65
2-strip 3-fan 4-strip 4-fan 4-ring 4-star
Figure 4.7: Base case square atlas partitions for mesh components of two, three or four triangles.
3-ring 3-fan
Figure 4.8: Cutting a 3-ring into a 3-fan.
provides no constraints on topology or developability of a sub-cluster within the MMA tree. Checks
may be performed to evaluate each nodes topology, compactness, and developability. If an MMA
node satisfies the desired constraints it is flattened, otherwise the nodes descendants are examined.
Our original implementation solves this problem by recursing down the MMA tree mapping all
nodes with four our fewer triangles. When a partition contains four or fewer triangles, it is mapped
directly into 2-D texture space based on its configuration.
Figure 4.7 shows layouts of six of the eight possible configurations of four or fewer triangles.
These layouts are either square or rectangular in texture space, depending on the parity of the
current level within the partition tree. Note that all shared edges between triangles internal to
the rectangular region are also shared edges between the triangles in object space. This eliminates
the necessity to follow the rasterization rules within each square/rectangular region of texture
space, which resulted in blocks one pixel wider than tall. Since blocks are now square, they easily
cover the entire texture map. Furthermore in all but the 2-strip, the location of non-corner texture
coordinates may be chosen to reduce sampling bias in the map. Our present implementation chooses
these free coordinates as bisections.
Figure 4.8 shows the 3-ring, one of the two remaining partitions that cannot be directly deformed
into a square. The one edge of the 3-ring case must be cut to turn the 3-ring into a 3-fan for it to
be laid out in a rectangular region.
66
The final case is when a partition contains a single triangle. We modified the partitioning
algorithm to borrow a triangle from a neighboring partition to ensure all partitions have at least
two connected triangles. This slightly reduces the optimality of the partitioning but avoids wasted
samples, which was the ultimate goal of the partitioning. Our present implementation will not
borrow a triangle when the imbalance is greater than 7:1, which is why a black triangle appears
in the atlas in Figure 4.3. We developed special filters for our MIP-map construction that would
ignore samples from these triangle holes during the down-sampling process.
4.5 Discussion
Observation and analysis of the multiresolution atlas algorithm has yielded a number of benefits
and drawbacks.
• Total Coverage. The multi-resolution approach represents a complete use of texture in-
formation, which results in more texture detail being placed on the model during rendering
time. This added detail helps to offset the additional distortion that occurs as a result of
forcing square/rectangular charts.
• Magnification Filtering. Texture magnification filtering, particularly for high frequency
textures is imperative for reducing temporal and spatial aliasing. The spatial coherence of the
partitioning allows for linear texture magnification filters to be applied to the mesh within
the rectangular partition boundary. The half-pixel boundary introduced by contracting our
partition corner vertices one-half pixel (as described in Section 3.5.4) creates exactly enough
room to support linear magnification filtering at the lowest level of the mip-map pyramid.
Our experiments verified that bilinear magnification filtering did not yield any seam artifacts.
Bilinear filtering at the other levels of the mip-map does result in artifacts. This is due to
the face that the gutter spacing is reduced by half at each higher level of the mip-map. For
example, the lowest level of the mip-map pyramid has 12 texel gutter. The next level up has
a 14 texel gutter, and so bilinear filtering breaks down. This suggests using different texture
coordinates for each level of the mip-map. The MMA method does correctly support mip-map
filtering and linear interpolation between mip-map levels, which significantly reduces aliasing.
67
• Sliver Sensitivity. The quality of the MMA map is largely dependant on the tessellation of
the model. Models with sliver triangles prevent the MMA algorithm from finding clusters that
parameterize well into square/rectangular regions. Retessellation of such models is required
to achieve satisfactory results.
• Distortion. The MMA construction method detailed in this section completely ignores the
issue of distortion and measured atlas quality completely in terms of sampling area ratios.
Improvements with this construction can be made integrating some distortion metric ( such
as the L2 stretch metric [99]) during parametrization of the base level charts.
• Continuity and Seams. Although the MMA approach carefully assigns samples to each
chart to minimize visual discontinuities, it may be advantages to reduces the number of seams
in the model. This can be done by reducing the number of charts by allowing more triangles
per chart than that suggest by Figure 4.7. This may be more important when working with
large polygon count models.
68
Chapter 5
Dynamic Texture Atlas Management
5.1 Introduction
In this chapter we present two new algorithms for maintaining disjoint face clusterings on dynamics
meshes that both run in fractions of a second for meshes containing over 100k faces. We then show
how these fast clustering methods may be applied to construct the MMA texture atlas (4) so that
it may be updated at interactive to real-time rates. This process leads us to a solution for dynamic
parametrization, where the texture atlas may be rapidly updated to provide accurate sampling on
meshes undergoing free form deformations with changing surface signal information.
The decomposition of polygonal meshes into collections of smaller components is an important
problem in computer graphics having numerous applications such as texture mapping, collision
detection, radiosity, and compression [96, 65, 37, 42, 54]. Although many algorithms have been
proposed for the problem, to our knowledge, no known method has dealt with the issue of main-
taining such clusterings for non-static meshes undergoing free-form deformations.
The first algorithm is based on a randomized data structure and provides a complete balanced
binary face cluster hierarchy. The algorithm attempts to achieve cluster compactness while main-
taining some balance metric at each level of the hierarchy. Such balance metrics might include
surface area, triangle count, sample distribution importance, or spatial spread. The second algo-
rithm computes optimized face clusterings for a fixed set of clusters and then constructs a complete
hierarchy from the resulting clusters.
69
5.2 Bloom Filter Clustering
5.2.1 Introduction
This section describes a rapid method for re-balancing the MMA hierarchy behind a multi-resolution
mesh atlas. Though the re-balancing is approximate, Section 5.2.7 later shows it is well behaved
in typical cases.
5.2.2 Node Imbalance.
We define the imbalance of a node n in a hierarchical face clustering as
βn =

0 if n is a leaf,
δl/δr − 1 if δl > δr,
δr/δl − 1 otherwise.
(5.1)
where as before δl and δr are the texture sample distribution metric of the left and right children
of n respectively. The imbalance indicates how much heavier one child is as a percentage of the
weight of the other.
When the imbalance of a node reaches some acceptability threshold, the hierarchy needs to
be rebalanced. We rebalance the hierarchy with a greedy search over local perturbations of the
tree described later in Section 5.2.5. However, tree perturbation can result in bad clustering,
where sibling clusters sharing few edges are merged into long skinny parent clusters. Section 5.2.3
describes a fast method for estimating the number of shared edges between two clusters, and this
estimate can help determine which tree perturbations to avoid.
5.2.3 Cluster Perimeter Estimation
Given two nodes in the tree ni and nj , we desire to know the total length of the shared border
between their corresponding clusters. This length can be found as the length of the intersection of
the set of perimeter edges of cluster ni with the set of perimeter edges of cluster nj . But maintaining
complete perimeter edge sets at every node in the tree is prohibitively expensive. To reduce
memory and especially computation requirements, we instead use a randomized data structures
70
that approximates these perimeters and measures shared boundary length more efficiently.
5.2.4 Bloom Filters
Bloom filters are space and time efficient data structures for answering fast set queries with con-
trolled error [11]. A Bloom filter B(X) = (v, {hi}ki=1) operates on a subset X of some domain
space X, and is comprised of a bit-vector v of m bits, and a set {hi}ki=1 of k hash functions
hi : X→ {1, 2, . . . ,m}. To insert an element x ∈ X into B(X) we set the bits in v indexed by hi(x)
for all i ∈ 1 . . . k. An element x is predicted by B(X) to be in X if the bits in v indexed by the k
hash functions on x are all set. This approach prevents false negatives; if the bit in v indexed by
any hi(x) is not set, then x 6∈ X. But there may be false positives; there may exist y ∈ X \X such
that every bit in v indexed by an hi(y) is set. The error rate (probability of a false positive) can
be controlled by increasing the number hash functions k and the bit-vector length m.
We Bloom filter the perimeter edge sets of every cluster in the hierarchy. First a unique id is
assigned to every edge in the mesh. Then we construct a perimeter bit vector v of a cluster by
hashing the ids of the edges in its perimeter. The Manhattan metric (L1 norm) on this bit vector,
denoted ||vn||, is the edge length of this perimeter predicted by the Bloom filter. The shared
boundary between two clusters corresponding to two nodes i and j is predicted as the number of
set bits the two vectors vi and vj have in common, or more concisely ||vi&vj ||.
We store a perimeter bit vector vn at every node n in the binary tree. The perimeter bit vector
of a leaf node is constructed by hashing the ids of the edges in its cluster’s perimeter. The perimeter
bit vector of a parent is computed by the element wise exclusive-or of the bit vectors of its two
children, which resets the bits corresponding to the shared edges of its children’s perimeters.
The hashing functions, on which Bloom filters depend, can and do collide, introducing false
positive errors. Section 5.2.7 shows this error is strictly a function of the height of the tree and the
number of bits hashed at the leaves. Hence local tree perturbations do not accumulate additional
error in the approximation.
71
Perimeter Metric.
We define a perimeter metric γn for every tree node n as
γn =
 0 if n is a leaf,||vl||+ ||vr||+ α| ||vl|| − ||vr|| | otherwise, (5.2)
Where vl and vr are the perimeter edge set bit vectors for the left and right child of node n
respectively. The node’s perimeter metric is a function of the children’s perimeters (instead of its
own perimeter) because the tree perturbation methods described next alter the child perimeters,
but leave the node’s perimeter unchanged. The value α is a freely chosen parameter that gives some
preference to balanced perimeter lengths. If α = 0, then the metric may be small even though one
sibling’s perimeter is nicely rounded while the other’s is long and thin. We have found setting
α = 1/4 suffices.
5.2.5 Metric Minimization via Hierarchy Perturbation
Finding a good hierarchical face clustering becomes the task of finding a binary tree with every
node n suitably balanced (according to the imbalance βn) and corresponding to a well-shaped
cluster (according to the perimeter metric γn). We search the space of all such hierarchies with a
greedy minimization that explores perturbations of the children and grandchildren of each node.
We explore two kinds of perturbations: rotations and grandchild swapping.







A B C D A BC D A B CD
(d) (e) (f )
Figure 5.1: The tree (a) can be rotated into both configurations (b) and (c) because the order of
children within a node is irrelevant. The grandchildren (d) can likewise be swapped into configu-
rations (e) and (f).
Figure 5.1 demonstrates half of the tree rotations available to a node. The relevant children and
grandchildren of a node are shown in (a). A standard rotation leads to configuration (b). However,
since the order of the children within a node is irrelevant, switching grandchild A with B before
the rotation leads to (c). We also investigate the two opposite rotations where the right child is
72





= 6 possible pairings of four grandchildren into
two child nodes, organized into three different configurations (d),(e) and (f).
To maintain triangle cluster coherence, every tree node must correspond to a simply-connected
cluster of triangles. Both the rotation and grandchild swap may violate this property.
We prune the available perturbations to better maintain connected partitions. Of the four
rotations, we select only the one whose new grandchild pairing has the largest estimated shared
edge length. Of the grandchild swaps, we reject the ones that do not meet or exceed the estimated
shared edge lengths between the existing grandchild pairings.
The perimeter metric and imbalance are then evaluated on the remaining perturbations. To
further minimize perimeter length, we fix an imbalance threshold τ . If a node’s imbalance exceeds
τ , we choose the perturbation with the smallest perimeter metric that brings the node closer to
balance. If the node’s imbalance is already within τ , we choose the perturbation with the smallest
perimeter metric whose imbalance remains below τ.
If a rotation or grandchild swap is found and implemented, newly grouped descendants might,
though rarely, violate the imbalance threshold τ. After a rotation we rebalance the parent of the
newly paired grandchildren. After a grandchild swap, we rebalance both children of the base node.
5.2.6 Tree Balancing Algorithms
Balanced Tree Construction
Construction of the initial face cluster hierarchy is performed bottom-up. The lossy nature of the
Bloom filtering of perimeter length prevents triangle insertion into the tree in a top down manner.
A single triangle is selected from the mesh to form the first node of the tree. Triangles are added
to the hierarchy from a breadth first traversal expanding from this initial triangle. Each triangle is
inserted at the bottom of the tree by forming a new two-triangle cluster with an adjacent triangle
that has already been inserted in the tree. The insertion of each new triangle into the tree requires
rebalancing to minimize the prescribed metrics. For each new triangle, we rebalance all of the
nodes in its ancestry, from the new two-triangle cluster all the way to the tree’s root. Hence, for
reasonable meshes, a balanced hierarchy is constructed in approximately O(n log n) time, where n
is the number of triangles.
73
Algorithm Rebalance(Node n)
Evaluate metrics β(n) and γ(n)
Initialize the best choice n∗ = n
If (β(n) > τ) Then For n′ ∈ Perturbations(n)
If (γ(n′) < γ(n∗) and β(n′) < β(n))
n∗ = n′
Else For n′ ∈ Perturbations(n)
If (γ(n′) < γ(n∗) and β(n′) < τ)
n∗ = n′
Replace n with n∗
If n∗ was a rotation right Then Rebalance(RChild(n))
Else If n∗ was a rotation left Then Rebalance(LChild(n))
Else If n∗ was a granchild swap
Rebalance(LChild(n)) Rebalance(RChild(n))
If (p = Parent(n)) Rebalance(p)
Figure 5.2: Algorithm to rebalance the tree after the metric of a single node has changed.
Balanced Tree Maintenance
When the signal or shape of a set of triangles changes, the meshed atlas needs to be rebalanced.
This rebalancing occurs by calling Rebalance(n) on each leaf node n whose cluster contains at least
one of the changed triangles. Note that this results in a walk up the ancestry of each of these leaf
nodes, even though they may share common ancestors. Simultaneously updating the metrics of all
of the leaf nodes sends the tree too far out of balance to be easily recovered by the greedy local
perturbation search. Assuming a constant number of changed leaf nodes, maintenance is thus, on
average, an O(log n) time process.
5.2.7 Analysis
This section derives the error rate of the Bloom filter as a function of its level in the hierarchy,
which leads to a heuristic for choosing an appropriate bit-vector size m. An additional outcome of
this analysis is the discovery that hashing two bits per edge instead of one provided better cluster
shapes.
We first consider the error imposed by hashing. Suppose a node corresponds to a cluster with
a perimeter of e edges. The node contains a bit vector v of m elements that underestimates the
74
perimeter edge length, because collisions may occur in the hashing of perimeter edges. If we assume
a uniform hashing distribution, the probability that a particular bit of v will be zero is ((m−1)/m)e.









It follows that the expected number of collisions that occur when constructing v from the e perime-
ter edges is the difference between the actual perimeter edge length e and the expected perimeter
length underestimate
E[coll] = e− E[||v||]. (5.4)





This allows us to compute Table 5.1, which shows the chances that an edge is lost due to a
hashing collision at the single-triangle leaf nodes, by computing Ploss for hash functions that set
one bit per edge e = 3 and various bit-vector lengths m. For later reference, we also compute the
leaf-node Ploss for hash functions that set two bits per edge by setting e = 6.
bit vector size m Ploss, e = 3 Ploss, e = 6
2 KBytes 0.0488% 0.122%
4 KBytes 0.0244% 0.061%
8 KBytes 0.0122% 0.031%
Table 5.1: Bit loss rate at leaf nodes.
We now develop a model to extend Ploss to reveal the percentage of bits lost at every level
higher in the tree. Suppose a parent node P has two children C1 and C2 each having n bits set.
Let
E1 = Event that a particular bit in v(C1) is 1,
E2 = Event that another particular bit in v(C2) is 1,
E3 = Event that yet another particular bit in v(P ) is 1.
75
Note that the probabilities of the first two events Pr[E1] = Pr[E2] = n/m. Since the bit vector in
P is the exclusive-or of the bit vectors of its two children, we can express the probability that a
given bit is set in v(P ) as
Pr[E3] = Pr[E1 ∪ E2]− Pr[E1E2]
= Pr[E1] + Pr[E2]− 2Pr[E1E2]
= 2n/m− 2Pr[E1E2]
We do not know a priori the number of edges shared between the clusters of C1 and C2.
In other words E1 and E2 are not mutually independent, therefore Pr[E1E2] 6= Pr[E1]Pr[E2].
Thus, we predict the percentage of a cluster’s edges that are shared with a single neighboring
cluster. If clusters form equilateral triangles, then neighboring clusters each share 1/3 of their edges.
Quadrilateral clusters share 1/4 of their edges, and hexagonal clusters share 1/6. We measured the
mean fraction of shared edges resulting from our clustering method on several meshes, and found
it surprisingly to uniformly approximate 1/5.
We now examine the growth rates of cluster perimeters under various neighboring predictions.
Let c denote the assumed percentage of edges a cluster shares with a single neighboring cluster, e.g.
1/3, 1/4 or 1/6. Let ei be the perimeter edge length of a cluster at level i, where i = 0 corresponds
to a leaf node. Thus e0 = 3 and
ei = 2ei−1 − 2cei−1,
= 2(1− c)ei−1,
= 3(2(1− c))i. (5.6)
Table 5.2 illustrates the perimeter growth rate under various edges-shared percentages c.
As expected, the growth of the perimeter edge set grows much slower than the number of
triangles in the cluster, which suggests the efficiency of Bloom filters when working with large
meshes. It is interesting to note that decreasing c from triangular (c = 1/3) to hexagonal (c = 1/6)
does not imply rounder clusters. In fact the perimeter lengths increase which implies skinnier
76
triangles edges c = 1/3 c = 1/4 c = 1/6
1K 1,623 53 173 496
8K 12,580 126 584 2,296
256K 395,433 532 4,437 29,539
1M 1,577,852 946 9,976 82,053
Table 5.2: Cluster perimeter edge lengths ei as a function of cluster size and shared edge percentage
c.
clusters that reach more neighboring clusters.
An expression can now be formulated for Pr[E1E2] using our cluster connectedness assumption
c. We define the two events
E2b = a particular shared border bit in C2 is 1,
E2p = a particular non-shared perimeter bit in C2 is 1.
The probability of Pr[E1E2] may now be expressed
Pr[E1E2] = Pr[E1]Pr[E2|E1],










Pr[E3] = 2n(1− c)(1− n/m).
Using this model, the expected number of bits set in node P will be





We now have an expression that predicts the number of bits set in a node given the number of bits
set in its two children. The expected number of collisions that have occurred higher up in the tree
77
can be expressed as follows:
E[coll(P )] = e(P )− E[||w(P )||] (5.7)
where e(P ) is the actual number of edges in the perimeter of the cluster corresponding to P. As
before, Ploss(P ) = E[coll(P )]/e(P ).
Error increases in our model due to collisions. As our bit vector is required to hold more and
more edges, it becomes saturated. The effect of saturation is that more collisions occur and thus
information is lost during the exclusive-or operation. Table 5.3 uses the above formulated model
to show the percent loss occurring at various subtree levels.
tree size Ploss
leaves (tris) m=2K m=4K m=8K m=16K
1K 14.67% 7.80% 4.03% 2.05%
8K 38.41% 23.01% 12.73% 6.71%
256K 85.30% 73.04% 55.71% 37.17%
1M 93.24% 86.83% 75.53% 58.89%
Table 5.3: Perimeter Loss Rates, c = 1/4, 1 bit per edge.
A loss rate of 50% at a given level of the tree implies that half of the perimeter bits have been
lost or more concretely that our estimate of the edge length of the perimeter of our root cluster is
half of its actual edge length. Although this may seem sloppy, we only need enough bits to make
good judgements on which local tree perturbations to perform. At higher levels in our tree, the loss
rate increases, but the number of bits set increases, allowing us to still make good choices about
what tree operations should be performed.
In practice we found the loss at higher levels in the tree to be satisfactory. We did find that
random collisions occurring very low in the tree proved fatal, leading the algorithm to make poor
tree operation choices. To prevent this we chose to hash two bits per edge. Although our overall
perimeter loss rate increased, we found that our clusters became better shaped (rounder) due to




leaves (tris) m=2K m=4K m=8K m=16K
1K 26.15% 14.68% 7.81% 4.03%
8K 57.06% 38.41% 23.02% 12.73%
256K 92.41% 85.30% 73.04% 55.71%
1M 96.59% 93.24% 86.83% 75.53%
Table 5.4: Perimeter Loss Rates, c = 1/4, 2 bits per edge.
5.2.8 Bloom Filter Limitations
Although Bloom filters have provided a reasonable solution to the clustering problem, the method
requires significant amount of storage and computation. As analyzed in section 5.2.1 the error grows
considerable with the height of the tree as the bit-fields become saturated and more collisions occur.
As a result the metric tends to become less reliable. This becomes problematic when choosing tavg
to be larger since the clusters higher in the tree are less round due to the break down in metric and
as a result do not parameterize very well.
This break down in metric results in a more insidious problem of forming clusters containing
a significant number of border triangles that have two edges along the border. Vertices impacting
these edges must be mapped to the corners of the rectangular domain or else the triangle containing
the vertex will become degenerate during parametrization. These vertices are commonly referred
to as corner vertices [38]. More than four of such corner vertices make it impossible to form a
mapping that does not result in degenerate parameterizations (e.g. zero texture area triangles).
Furthermore, the Bloom filter heuristic (perimeter minimization) provides only a weak metric
since ideally we desire clusters that are developable and are parameterizable into a square region
with low distortion on the underlying surface signal over the cluster. The Bloom filter perimeter
metric is purely geometry based, and thus can lead to poor sampling and distortion of the surface
signal.
Another observation about the proposed clustering strategy, is that nodes in the MMA tree
above where we’ve chosen to parameterize(form charts), do not require the strict connectivity
constraint on their children. Only balance and proximity is important. As a result, the Bloom
filter metrics places unnecessary restrictions on the nodes higher in the tree, which may led to
sub-optimal clusters being parameterized.
79
Tree Balancing.
We would also like to note the limitations in the re-balancing strategies used in section 5.2.1. Simple
tree rotations and grandchildren swapping are very locally greedy and limit the algorithms ability
to find a good minimum. Consider the boundary resulting from the 2-way mesh partitioning at
the root of the MMA tree. It is very difficult for triangles(or clusters of triangles) stored deep in
the tree to ever move across this boundary since the only way for this to occur is for a rotation or
grandchild swap to occur at the root. The obvious solution to this is to find additional strategies
besides rotations and grandchildren swapping that allow clusters of triangles to regroup with other
clusters in a local region of the mesh. This becomes difficult since the only connectivity information
is encoded in the Bloom filters. Additional adjacency information is require for a less locally greedy
approach.
5.2.9 Conclusion
Although Bloom filters has drawbacks for use in constructing MMA texture atlases, we believe
that it still has merit for other applications that may require face clusterings. Since the core of our
Bloom filter algorithm is a fast approximation to a complex graph partitioning problem, similar to
Metis [56], it may have application in a number of areas beyond computer graphics. Our algorithm
easily extends to higher dimensions where it could form a volume balanced hierarchical clustering
of a tetrahedral mesh, that minimizes surface area between tetrahedral clusters. Such higher
dimensional clustering schemes may prove beneficial in the area of data analysis or visualization.
80
5.3 Adaptive k-means Style Clustering on a Surface
5.3.1 Introduction
In this section we describe an alternative approach to dynamic face clustering. This approach
finds starts by finding suitable clusters for parametrization. These clusters are then placed into a
complete balanced MMA hierarchy and parameterized. We closely based our clustering strategy
on existing methods, however, we adapt these approaches to allow the clustering to be updated at
interactive rates.
As discussed in section 3, Levy et al. , Schlafmen et al. , Sander et al. ,and Sorkine et al. [105]
proposed greedy regions growing methods for chart construction[65, 96]. The method proposed
in [105] performs cluster determination and parametrization on the fly. This method, however,
assumes parametrization is performed into regions with natural boundaries so is not suitable for
MMA construction. We chose to adapt the approach proposed by Sander et al. for two reasons.
First, it has the added benefit of allowing a mesh to be decomposed into any number of clusters.
Secondly, it is based on an iterative k-means type approach allowing the solution to be updated
rather than re-computed during mesh changes.
5.3.2 Background
The algorithm presented by Sander et al. in [96], can be broken down into two separate phases
that are iteratively applied until convergence is reached: an outward region growing phase, and an
inward growing phase. The outward region growing phase, grows clusters outward from a set of seed
faces also referred to as cluster centers. Faces in the mesh are assigned to the cluster represented
by their closest cluster center where distance is defined to be the shortest path in the dual graph.
The inward growing phase performs a shortest path search inward on each cluster to determine
the face that is most central to the cluster. These faces become the cluster centers for the next
iteration of the algorithm. The two phases are applied until convergence is reached and the cluster
centers do not change.
Both the outward region growing phase and inward growing phase use Dijkstra’s search algo-
rithm, touching every face in the mesh. Dijkstra’s algorithm has the well known running time of
81
O((V +E)lgV ), or O(V lgV +E) when implemented with a Fibonacci heap[19]. The latter approach,
however, has a high cost hidden in constant factors. The overall running time for the algorithm is
given by: O(I(V + E)lgV ), where I is the number of iterations required for convergence.
Initialization.
For every dual vertex u, a scalar distance DIST(u), and its closest cluster CLUSTER(u) is main-
tained. The distance is defined to be the minimum distance (along the dual graph) from u its
closest cluster center vertex whose cluster is given by CLUSTER(u). Sander et al. 5.9 proposed
the following edge distance metric between adjacent faces F and F ′ in the dual graph:
DIST(F, F ′) = (λ− (Nc −NF ′))||PF − PF ′ ||2 (5.8)
where PF and PF ′ are the location of the faces centers, Nc is the average normal of the region, and
NF ′ is the normal for face F ′. Sander chose the parameter λ = 1.0 giving no distance penalty to
faces co-planar with the global region normal. Schlafman et al. proposed a similar metric:
DIST(F, F ′) = (1− λ)(1− cos2(α)) + (λ)||PF − PF ′ ||2 (5.9)
Here α is the dihedral angle between the faces, and λ ∈ [0, 1] is freely chosen select between the first
term ( angular distance) and the second term which approximates geodesic distance. We prefer the
metric given by 5.8 since the metric in 5.9 relies on edge lengths to be on a similar scale to that of
the angular distance term.
At the start of the process k seed faces (dual vertices) are selected to act as centers for the mesh
clustering process. These may be selected randomly, or any number of heuristic approaches. The
k-means strategy is only guaranteed to converge to a local minima, and the so the starting choice
of seed faces impacts both the number of iterations required for convergence, but also the quality
of the resulting clusters.
For achieving our starting seed configuration we use Euclidean distance and use recursive co-
ordinate bisection[102]. Faces are treated as a 3D point-set using their centroid locations. We
compute the center of mass for the point set, and its 3 × 3 inertial tensor matrix. The center of
82
mass and minimum eigenvector of the tensor matrix defines a plane that minimizes the sum of
squared distance between all points in the set and the plane. The cutting plane partitions the
points into two spatially coherent groups. We recursively apply the process to divide the points
into smaller sets until the number of sets equals the number of starting seeds. For each set we find
the most central face ( in the Euclidean sense) and use it as a starting seed face.
Phase 1: Outward Region Growing
The cluster center vertices are assigned a distance of zero (since they are are zero distance from
a cluster center), and assigned to a unique cluster. All other dual vertices are assigned an infinite
distance since their shortest paths to a cluster center are currently unknown. Cluster center vertices
are then placed into a priority queue. At each step of the region growing process, a dual vertex u
is removed from the top of the priority queue, and assigned to its closest cluster, CLUSTER(u).
Relaxation is performed on the neighbor set v ∈ Adj[u] of u, where Adj[u] is the set of neigh-
boring vertices attached by an edge to u in the dual graph. This is done by computing a distance d
for each adjacent face that has not been assigned to a cluster. This distance is computed as follows:
d = DIST(u)+DIST(u, v). If d is less than the currently known distance for v: DIST(v), then u
is relaxed by setting DIST(v) = d and CLUSTER(v) =CLUSTER(u), and updating the location
of v in the priority queue accordingly. Pseudo code for phase 1 of Sander’s algorithm is given in
algorithm 5.3.
Phase 2: Inward Region Growing.
Following phase 1, every face has been assigned to a cluster. The quality of each cluster is dependant
on both the number of seed faces, and the choice of seed face locations. Phase 2, computes a new
seed face for each cluster. The new seed face for each cluster is chosen to be the face whose distance
(along the dual graph) to the cluster border is maximal. These face are found by placing the faces
along the borders of each cluster onto a heap, and assigning them a zero distance. Regions are
grown inward using the same Dijkstra search algorithm detailed in phase 1. The last face reached
within each cluster during this inward growing process becomes the new seed face for the cluster.
Using the updated seed faces, phase 1 may be applied to compute the new clusters. The two
83
Algorithm Phase1(faceHeap H)
while H 6= Ø
u ← EXTRACT-MIN(H)
C ← CLUSTER(u)
C ← C ∪ u
for each v ∈ Adj[u]
d ← DIST(u,v)
if DIST(u) + d < DIST(v)
DIST(v) ← DIST(u) + d
CLUSTER(v) ← CLUSTER(u)
if v ∈ H
DECREASE-KEY( H,v,DIST(v))
Figure 5.3: Outward region growing algorithm implemented using a single heap.
phases are iterated on until convergence or a cycle is detected.
5.3.3 Multiple Heaps
To improve the running time of the algorithm we start by decomposing the problem into a collection
of heaps; one per cluster. The algorithm proceeds in the similar manner as that presented in 5.3.
Rather adding faces to a global heap during the relaxation process, faces are added to the heap
whose cluster center is currently closest. The order in which faces are processed is handled by
selecting the cluster heap, whose top element has the minimum distance. This search may be
accelerated by placing all clusters into a global priority queue ( heap ) represented by HG in
algorithm 5.4.
By decomposing the problem into a collection of heaps, the inward region growing process
(phase 2) may be handled separately for each cluster. This is done by iterating over the clusters,
and performing the Dijkstra shortest path search inward from the cluster’s border to find the face
that is most distant from the edge.
Assuming every cluster grows to contain the same number of triangles, the running time of
phase 1 is reduced to O((V lg(k)+ (V +E)lg(V/k)), and phase 2 is reduced to O((V +E)lg(V/k)).
84
Algorithm Phase1 Optimized(globalHeap HG)
while HG 6= Ø
C ← EXTRACT-MIN(HG)
u ← EXTRACT-MIN(HC)
C ← C ∪ u
for each v ∈ Adj[u]
d ← DIST(u,v)
if DIST(u) + d < DIST(v)
DIST(v) ← DIST(u) + d
CLUSTER(v) ← CLUSTER(u)
if v ∈ Hi, Hi 6= HC
DELETE( Hi,v)




if HC 6= Ø
u ← TOP(HC)
INSERT( HG,C,DIST(u))
Figure 5.4: Optimization for the outward region growing process. Nested heap structure used to
reduce search times.
5.3.4 Border Sets
The optimization presented above reduces the cost of executing a single iteration of the k-means
process. The cost of each successive iteration, however, remains constant. Ideally we wish the
iteration cost to decrease as it converges towards the solution.
The key observation is that interactions on successive iterations occur on the borders between
cluster regions. We store with each cluster C the set BC of triangles on its border. A triangle t is
in BC if it belongs to the cluster C, but shares at least one edge with another cluster.
If a cluster’s border set has not changed following an iteration of phase 1, then clearly the seed
face for that cluster does not change either. In this case, phase 2 (the inward growing process) need
not be performed for that cluster.
The known border is also used to improve the performance of phase 1. The idea is that the
current clusters provides a good starting guess for an iteration of phase 1. Rather than initializing
each cluster’s heap HC with the cluster’s single seed face, we insert the cluster’s border set BC into
85
its heap. Successive iterations of phase 1 are therefore performed starting with each cluster already
grown out to its border. The cost of phase 1 is reduced to the movement of a few triangles between
clusters on the frontier between cluster regions.
In the event that a cluster’s border set changes following phase 1, a new seed face must be
computed for that cluster. We apply a phase 2 operation to that cluster to find the new cluster
center. For phase 1, however, we must know the distance between every face and its closest cluster
center. This requires the distances within a cluster to be recomputed by growing the patch outward
from its new cluster center to its border. In some cases, however, this situation may be avoided. If
the result of performing a phase 2 operation on the cluster finds the cluster seed face unchanged,
then the distances to the cluster center may not need to be computed. In the case of the DIST ()
metric provided by Sander et al. , distances are impacted by the global chart normal so this
optimization cannot be used.
The last improvement is to note that that if a cluster C and all of its adjacent clusters Cadj
remain unchanged after an iteration of phase 1 and phase 2, then C cannot grow outward during
the next iteration of phase 1. Therefore, we do not place cluster C on the global head HG, or
initialize C’s cluster heap HC with its border set.
5.3.5 Cluster Boundary Smoothing
Using the above detailed approach, cluster may be rapidly found on meshes that may deform over
time. Ideally we wish that every cluster parameterize well into the square domain. To improve
cluster parametrization we post-process the the clusters removing corner triangles.
Corner triangles (triangles that are connected to a cluster by a single edge) greatly impact
sample distribution when parameterizing onto a square domain [38]. To avoid these corner triangles
from being degenerate in the parameter domain they are mapped to one of the four corners of the
rectangular parameter space. A cluster containing more than four corner triangles cannot be
mapped without forming a degeneracy.
To avoid this problem, we start by eliminating as many corner triangles as possible. Any corner
triangle that shares two edges with the same cluster may be eliminated by reassigning the triangle
to that cluster. Corner triangles that are adjacent to three different clusters are greedily assigned
86
to the cluster with the fewest number of corner triangles. If both clusters have the same number of
corner triangles, the corner triangle is assigned to the cluster that maximizes the distance between
any two corner triangles within a cluster. This keeps the corner triangles assignments spread out
within a cluster lead to a better parameterization. Figure 5.5, shows an example of the quality
clustering that may be achieved using Sander’s algorithm followed by corner triangle reduction.
Charted Bunny (a) (b)
Figure 5.5: Chart clustering before (a) and after (b) triangle corner reduction.
5.3.6 Chart Parameterization
As mentioned in chapter 3 charts may be statically parameterized into the square u, v ∈ [0, 1]2
domain using any number of methods [99, 33, 22]. Methods that solve a linear system to determine
u, v parameter locations, provide a fast means for flattening a mesh. Some linear methods guarantee
the absence of fold-over [33]. In practice we have found that linear parameterizations are not
sufficient to adequately control distortion and sample distribution. As a result we turned to the
non-linear geometric L2 stretch metric proposed by Sander et al.[99].
Biased Metric.
The L2 metric used is purely geometry based. Let triangle Ti have 3D coordinates xi = (x1, x2, x3)
and 2D coordinates ui = (u1, u2, u3). Let A(Ti) denote the area of the triangle defined by coordinate
points xt. We wish to bias this metric to account for the varying signal importance across the surface
of our mesh. This can be done by applying a unique linear transformation Si : xi → xˆi to each







where A() returns the area of the triangle specified by the given coordinates. In our implementation,
we use the scalar per-triangle sample distribution metric δi to form the uniform scaling matrix Si.
Application of non-uniform scaling could be applied to better represent triangles with strongly
anisotropic surface signal frequencies.
Solving the System.
The L2 stretch metric leads to a non-linear minimization problem. Solving such systems may be
expensive and costly. We start the optimization process by providing an initial starting guess using
the linear solve weighting scheme proposed by Tutte [32]. We chose this weighting scheme for two
reasons: first, it guarantees no fold-overs occur and secondly, it leads to a sparse symmetric system
that can be rapidly solved using the pre-conditioned conjugate gradient method.
Rather than performing random line searches on local neighborhoods by optimizing a single
vertex at a time as done in [99], we attempted a global approach to the optimization problem.
To simplify the computation of the analytic gradient, we minimize the square of the L2 metric.
Let Uˆn denote a vector containing the locations of the freely moving u, v coordinates of our mesh.
This includes all of the u, v vertex location on the interior of the parameterization, in addition to
the freely moving u or v coordinates from vertices on the boundary. The gradient ∇(L2(M) of
the error metric L2(M) with respect to the freely moving u, v coordinates Uˆn has a closed form
analytic solution. We use the conjugate gradient approach for non-linear optimization as detailed
by Fletcher and Reeves [31] to search for a minimum.
Chart Re-parametrization.
When the signal changes, each impacted chart may be re-parameterized to move more samples to
locations where they will better reproduce the new signal. We use the existing parametrization as
a starting point for the optimization process, which converges rapidly especially when the change
in surface signal is small.
88
5.3.7 Chart Balancing
Each parameterized chart is placed into a balanced MMA hierarchy similar to that presented in
section 4. In this case the leaves of the MMA tree are charts rather than triangles. Rather than
working with a quaternary tree structure as suggested in 4, we work with a binary tree for simplicity.
Odd levels of the binary tree correspond to mip-map subquadrants. In the case a leaf terminates
at an even level of the tree hierarchy, the chart is mapped into a rectangular region corresponding
to half of a mip-map subquadrant.
Two primary goals are present in the construction of the MMA tree. First the sample metric
should be evenly balanced at every node of the tree. Secondly, for correct mip-mapping results,
clusters contained within a subtree should be spatially proximate. A subtree, however, need not be
comprised of clusters that are simply connected.
The location of a chart within the MMA tree determines a unique rectangular region of the
texture atlas for the chart. Each chart’s texture coordinates are scaled and translated (from the
[0, 1]2 domain) during run-time to their appropriate location in the texture atlas.
We form the MMA hierarchy by combining two heuristic approaches: a top down divisive
approach, and a bottom up agglomerative approach.
We let the sample distribution metric δ of a cluster be the area weighted mean of the δ values
of its triangles. The average reflects the effectiveness of the parametrization redistributing samples
within a chart.
We start forming our tree in a divisive manner using recursive coordinate bisection [102]. The
center of mass for each cluster is computed, yielding a single 3D coordinate value per-cluster. The
points are used to compute a 3 tensor matrix whose smallest eigenvector determines an axis where
the point have greatest variance. The axis is used to define the orientation of a partitioning plane.
The plane is swept along this axis until the sum of the chart sample metrics δn on either side of
the plane are nearly equal. Using this plane, the points (clusters) are divided into two spatially
coherent sets with nearly equal metric values. This process may be recursively applied to each
sub-group to construct the full MMA tree.
To find the axis we form the symmetric 3× 3 inertial tensor matrix I from the cluster centers.
The splitting axis is given by the smallest eigenvector of the inertial tensor matrix. Given a
89









Ixx = n(var(yi) + var(zi)),
Iyy = n(var(xi) + var(zi)),
Izz = n(var(xi) + var(yi)),
Ixy = Iyx = −n cov(xi, yi),
Iyz = Izy = −n cov(yi, zi),
Ixz = Izx = −n cov(xi, zi),
where var(), cov() indicate the sample variance and covariance of the n cluster centers. The inertial
tensor is construced in linear time, but since its size is always 3 × 3, its eigenvalues are found in
constant time.
The recursive coordinate bisection approach produces good results in forming balanced parti-
tions at the top of the MMA tree. Near the leaves of the final tree, the algorithm fails to find
a good partitioning due to the small set size. To prevent poor balancing near the leaves of the
tree, we detect when the recursive bisection algorithm divides a set into subsets that fall below a
threshold size of 16 clusters. A bottom-up approach is then used for forming that subtree.
The agglomerative algorithm joins together clusters/sub-trees that are both of equal sample
metric importance and spatially proximate to form larger subtrees. All of the clusters in a sub-tree
are placed in a priority queue based on minimum δ. At each iteration, a cluster c0 is removed
from the top of the queue. A search is performed on the elements in the queue to find the best
cluster c1 with which to pair c0 with. We choose the cluster whose center is closest c0, and whose
sample metric δc1 is within 30% of δc0 . If no matching cluster can be found, we pair c0 with the
next cluster on the top of the queue yielding the best balance. We remove c1 from the queue, and
group c0 and c1 into a subtree/super-cluster represented by a parent node p. The node p is inserted
into the queue for the next iteration. The algorithm continues until the group of clusters has been
combined into one subtree.
Figure 5.6 shows the top three levels of the MMA tree formed by this process. Levels 1 and 2
clearly show the location of the splitting planes used to bisect regions. At level 3 the agglomeration
90
level 1 level 2 level 3
Figure 5.6: Tree formed on the bunny model using recursive bisection with inertial partitioning at
the top levels of the tree, and agglomerative approach for constructing lower subtrees.
strategy was executed on some of subtrees as evidenced by the loss of some proximity between
neighboring clusters.
5.3.8 k-means MMA Results
These enhancements play a significant role in the performance of the clustering process. As the
algorithm converges, the number of changing clusters decreases along with the processing time.
Figure 5.7 shows the clustering times for the horse model. Note that the cost per iterations for
the original clustering approach proposed by Sander et al. [96] remains constant. By maintaining
cluster boundaries our method achieves a significant performance which grows as the clusters move
closer towards convergence.
Clustering Performance Resulting Clusters
Figure 5.7: Clustering performance on the horse model containing 100k faces, using k = 256 and
randomly chosen seed faces.
To test our dynamic clustering software we integrated our algorithm into an simple modeler.
In Figure 5.8 we demonstrate the Buddha and bunny model being deformed and reclustered at
91
interactive rates, using the clustering metric from equation 5.8. Each cluster is guided to be both
compact and planar, as would be useful for parametrization.
Figure 5.8: Buddha and bunny model (100K/69K faces) being interactively deformed and auto-
matically re-clustered at >10Hz. Clusters are optimized for planarity and compactness.
Figure 5.9 compares the quality of parametrization achievable by our dynamic method with that
of other popular schemes. These existing methods do not adapt to surface signal information so we
examined sample distribution based on geometric properties of the mesh. We use the relative scale
metric as define in Carr [16], which relates the ratio of normalized triangle surface area in 3D to its
area in texture space. Methods (e.g. (a)-(c)) lead to large over-sampled and under-sampled regions
of texture space. The non-linear L2 metric with a single chart (d) leads to much more equitable
sample distribution than the linear methods. Further improvement is made by breaking the mesh
into multiple charts (e), however, the mesh is under-sampled everywhere since many texels are
92
left unused. The MMA map (f) further reduced the relative scale error by utilizing all available
samples. The maximum relative scale error for any triangle in the mesh is slightly greater than that
of (e) since charts were forced into rectangular domains, rather than taking natural boundaries(f).
(a) Authalic (b) Conformal (c) Mean Value
(d) L2 Stretch (e) Zang et al. Stretch (f ) L2 Multi-Chart
(g) MMA Bloom Filters (h) MMA k-Means L2 Stretch
undersampledoversampled
Figure 5.9: Plots showing relative scale for a number of parametrization methods. Single-chart
linear solves (a)–(c) [22, 33], non-linear geometric stretch metric with single chart and multiple
charts (d)-(f) [99], and Dynamic MMA mapping with Bloom filters(g) and k-means (h).
5.4 Conclusions and Discussion
We have detail two separate methods for maintaining dynamic face clusterings on surfaces. We
used these approaches in implementing the MMA texture atlas to provide two methods for fully
dynamic parametrization.
In practice we found that the Bloom filter clustering method for the purpose of parametrization
93
inferior to that of the latter approach. This is largely do to the corner triangle problem, which
results in a large number of clusters. By controlling the number of clusters using k-means style
clustering followed by smoothing of cluster boundaries, we were able to increase the continuity in





The MMA texture atlas has many useful a properties and can be applied to a wide range of
applications. One obvious application is that of solid procedural texturing[79, 81]. The MMA
texture atlas provides a place to store and evaluate and lighting and viewing independent procedural
shader components[16]. Although graphics hardware is making great strides towards real-time
evaluation of shaders on a per-fragment basis, shader evaluation remains to be a computationally
expensive part of the graphics pipeline. The texture atlas provides a nice domain for caching these
results for later use between frames.
In this chapter we explore two less obvious applications made possible by the features of the
texture atlas method presented in Chapter 4. Section 6.2 shows how the hierarchical properties of
texture atlas may be used to implement subsurface scattering effects. Section 6.3 details a dynamic
painting system that fully utilizes the dynamic nature of our texture atlas method from Chapter 5
to redistribute surface samples. Both applications take advantage of the GPU friendly nature of the
parametrization method to allow for fast surface computation to take place in graphics hardware
at interactive to real-time rates.
95
6.2 Subsurface Scattering on the GPU 1
(cream) (cream) (apple)
Figure 6.1: Buddha with and without subsurface scattering: 100k faces, 20 fps, Nvidia QuadroFX
2000.
6.2.1 Introduction
The standard rendering equation using a BRDF approximation has been used for many years to
model light transport and the reflectance of surfaces. However, the BRDF formulation assumes
that light entering a material at a point also leaves the material at the same point. For many real-
world surface this approximation is sufficient, but for numerous translucent materials (skin, milk,
marble, etc..), much of their appearance comes from internal scattering of light. Figure 6.1 shows
examples Buddha statue with and without subsurface scattering effects. To account for subsurface
scattering, the BRDF is replaced with a more general model known as the BSSRDF (bidirectional
surface scattering reflectance distribution function).
We base our subsurface scattering scheme on the method derived by Jensen et al.[52]. For
subsurface scattering as proposed by Jensen et al. [51] to be integrated into into an existing
renderer involves two major steps. First, the incident illumination must be evaluated on all sides of
the model ( and not just visible portions) and stored in some data structure. Secondly light paths
interior to the model must be evaluated, transporting light through the interior of the model. We
found that our MMA algorithm as presented in chapter 4 an ideal data structure that can assist in
1Contains work previously published in Graphics Hardware 2003[15]
96
both steps of the process and lead to a real-time solution for rendering subsurface scattering effects
for static models under dynamic lighting and viewing conditions.
6.2.2 Related Work
Hanrahan and Krueger[44] showed subsurface scattering to be an important phenomena when
rendering translucent surfaces, and used a path tracing simulation to render skin and leaves. Pharr
et al.[82] extended these techniques into a general Monte-Carlo ray tracing framework. Jensen and
Buhler[52] used a dipole and diffusion to approximate multiple scattering. These methods made
clear the importance of subsurface scattering to the graphics community, and led some to consider
additional approximations and accelerations.
Jensen et al.[51] accelerates subsurface scattering using a hierarchical approach consisting of
several passes. Their first pass finds surface irradiances from external light whereas the second pass
transfers these irradiances to the other surface patches. Their hierarchical approach uses an octree
to volumetrically represent the scattered irradiances in the interior of the translucent substance.
Hao et al.[45] approximated subsurface scattering using only local illumination, by bleeding
illumination from neighboring vertices to approximate local back scattering. They precompute
a scattering term for source lighting expressed in a piecewise linear basis. They reconstructed
scattering per-vertex from directional light by linearly interpolating these terms computed from
the nearest surrounding samples. Their technique was implemented on the CPU but achieved
real-time rendering speeds.
Sloan et al.[103] uses a similar precomputation strategy, though using a spherical harmonic basis
for source lighting. They precomputed per-vertex a transfer matrix of spherical harmonic coeffi-
cients from environment-mapped incoming radiance to per-vertex exit radiance that includes the
effects of intra-object shadowing and interreflection in addition to subsurface scattering. They were
able to compress these large transfer matrices in order to implement the evaluation of precomputed
radiance transfer entirely on the GPU to achieve real-time frame rates.
Lensch et al.[63] approximated back scattering by filtering the incident illumination stored in a
texture atlas. The shape of these kernels is surface dependent and precomputed before lighting is
applied. They also approximated forward scattering by precomputing a vertex-to-vertex throughput
97
factor, which resembles a form factor. Forward scattering is rendered by performing a step similar
to radiosity gathering, by collecting for a given vertex the irradiance from the other vertices scaled
by the throughput factor.
While Lensch et al.[63] benefited from some hardware acceleration, for example using a vertex
shader to accumulate external irradiances, the application of the vertex-to-vertex throughput factors
to estimate forward scattering, and the atlas kernel filtering to estimate backward scattering is
performed on the CPU (actually a pair of CPUs). They expressed a desire to explore full GPU
implementations of these techniques, and our GPU implementation of subsurface scattering is an
extension of their technique. But as we have discovered and documented in this paper, their
technique requires novel extensions to be efficiently implemented given the limited resources of a
modern GPU.
6.2.3 Subsurface Scattering on the GPU




Ω S(xi, ~ωi;xo, ~wo)Li(xi, ~ωi)(~ni · ~ωi)d~ωidA(xi).
(6.1)
Integration is performed over the surface A at points xi and all incident light directions. The term
S is the BSSRDF, which relates the amount of outgoing radiance at point xo in direction ~ωo, given
that there is incoming radiance at some other surface point xi in direction ~ωo.
Φ Flux,Power (Watts)













Jensen et. al. noted that single scattering for many common materials only accounts for a
small percentage of the outgoing radiance of a material[51]. Also, light tends to become heavily
scattered when it enters materials, removing the relationship between incident and exitant light
directions. This simplifies the BSSRDF to a four dimensional function Rd known as the diffuse













Li(xi, ~ωi)Ft(η, ~ωi)|~ni · ~ωi|d~ωi (6.4)
where Rd(xi, xo) is the diffuse subsurface scattering reflectance. It encodes geometric information
and also the volumetric material properties anywhere in the object dealing with light transport
from xi to xo. For Rd we use the dipole approximation model detailed by Jensen et. al. [51], and
also used in Lensch [63]. Also found in Jensen et. al. are the scattering and absorption coefficients:
σ′s, σa for common materials that are used in this model. For brevity we have omitted the details
here.
Lensch et. al. noted this strong similarity between the radiosity formula and the formula in
(6.4). This equation may be solved by discretizing the scene into patches.
6.2.4 Real-Time Subsurface Approximation Algorithm
We start by discretizing our object into a collection of N patches where Pi and Pj are patches on





which resembles a single transport step of radiosity transport. The multiple diffuse scattering









Rd(xj , xi)dxjdxi. (6.6)
For a static model, we can precompute the Fij factors between all pairs of patches. Using (6.5),
the radiosity due to diffuse multiple scattering now reduces to a simple inner product for each patch
resulting in O(N2) operations to compute the incident scattered irradiance for all patches.
A simple way to reduce the number of interactions is to turn to a clustering strategy like that
used to solve the N-body problem and hierarchical radiosity[42]. This is particularly applicable
to subsurface scattering since the amount of scattered radiance drops rapidly as it travels further
through the media. This implies that patches that are far away from the patch whose scattered ra-
diosity we are computing may be clustered together and be replaced by a single term to approximate
the interaction.
6.2.5 A Three Pass GPU Method
We now formalize a solution the diffuse subsurface scattering equation (6.5) as a three pass GPU
scheme (as shown in Fig. 6.2) preceded by a pre-computation phase. By assuming that our geometry











Figure 6.2: Three passes for rendering subsurface scattering on the GPU.
Our first pass to the GPU computes the amount of light incident on every patch of our model,
and scales this by the Fresnel term, storing the radiosity that each patch emits internal to the
object. This map forms our radiosity map.
Our second pass acts as a gathering phase evaluating equation (6.5). For every patch/texel the
100
transmitted radiosity is gathered and scaled by the precomputed throughput factors and stored
into a scattered irradiance map.
In our third and final pass we render our geometry to the screen using the standard OpenGL
lighting model. The contribution from subsurface scattered light is added in by applying the
scattered irradiance texture map to the surface of the object scaled by the Fresnel term.
Pass 1 Pass 2 Pass 3
Figure 6.3: Pass one plots triangles using their texture coordinates (left), interpolating vertex colors
set to their direct illumination scaled by a Fresnel transmission term. Pass two transfers these
radiances (center) via precomputed scattering links implemented as dependent texture lookups
into a MIP-map of the previous pass result (left). Pass three scales the resulting radiances by a
Fresnel transmission factor and texture maps the result onto the original model.
Pre-computation Phase
We start by forming a hierarchical disjoint face clustering and texture atlas parameterization of our
mesh with a method previously developed for real-time procedural solid texturing[16]. Every texel
in our texture atlas corresponds to a patch on the surface of our model. This parameterization
method is ideal for a GPU solution to the subsurface scattering problem for a number of reasons.
First, it provides a natural face cluster hierarchy necessary for a hierarchical approach. Secondly, it
works directly with the GPU rasterization rules allowing the GPU to perform surface computation
in a seam-free manner. Thirdly, it is MIP-mappable, allowing the GPU to compute fast average
and sums over the surface hierarchy. Lastly, by using a parameterization scheme as a domain to
store and compute our surface samples, the number of surface samples is independent of both the
tessellation of our geometry, and the resolution of the screen we are rendering to. This marks
a difference between earlier interactive subsurface scattering approaches where vertex to vertex
101
interactions were used to discretize the scattering equation.
Full evaluation of equation (6.5) of patch to patch throughput factors would require P 2 inter-
actions, where P is the number of patches (and also texels in our texture atlas). Each interaction
forms a link. Since all of our patches are in texture space, we need only store a u, v location and
a throughput factor for each link. By adding an LOD term into the link structure we can access
any level in the MIP-map surface hierarchy. Based on our computation and space restrictions we
assign some maximum number of links L that we store for each patch Pb in the base level of our
texture map.
For a non-adaptive approach we can choose some level l in the hierarchy. We then build all
links from patches Pb to Pl where Pb are the patches at the lowest level in the hierarchy, and Pl are
patches at level l in the hierarchy.
An adaptive top-down approach for the construction of links may be done similar to that of
hierarchical radiosity. For every patch in the lowest level of our hierarch Pb, we start by placing the
root patch Pr of our hierarchy onto a priority queue (with highest throughput factor at the top).
We then recursively remove the patch at the top of the queue, compute the throughput factors
from Pb to its four children, and insert the four children into the priority queue. This process is
repeated until the queue size grows to the threshold number of links desired.
Once L adaptive links for each patch/texel in our atlas have been created, we store the result




Pb in size. We use an fp16 (16-bit floating point) texture
format supported on the GeForceFX to pack the link information: u, v, Fij ,LOD into the four color
channels to be used during pass two of our rendering stage. To reduce storage in favor of more
links, we opted to reduce our form factor to a single scalar. Form factors per color channel may be
supported at a cost of space and bandwidth.
In the case of a non-adaptive approach, the link locations and LOD are the same for every patch
Pb, we therefore store this information in constant registers on the graphics card during the second
pass. The throughput factors, however, vary per-link. We store the form factor information into
L/4 texture maps where each texel holds four throughput factors (corresponding to 4 links) in the
rgbα channels.
102
Pass 1: Radiosity Map
Given our face cluster hierarchy and MIP-mappable parameterization, we must first compute the
Ej ’s for the patches in our scene. To do this, we start by computing a single incident irradiance for
every texel in our texture atlas, thereby evaluating lighting incident on all sides of the model. We
scale the result of the incident illumination by the Fresnel term, storing the result in the texture
atlas. Each texel now holds the amount of irradiance that is transferred through the interface to
the inside of the model. This step is similar to the illumination map used in Lensch et al.[63].
To accomplish this efficiently on the GPU, we use the standard OpenGL lighting model in a
vertex program on the GPU. Using the OpenGL render-to-texture facility, we send our geometry
down the graphics pipeline. The vertex program computes the lighting on the vertices scaled by the
Fresnel term placing it in the color channel and swaps the texture coordinates for the vertices on
output. The radiosity stored in the color channel is then linearly interpolated as the triangle gets
rendered into the texture atlas. Our method does not prevent the use of more advanced lighting
models and techniques for evaluating the incident irradiance on the surface the object.
As an alternative to computing our transmitted radiosity in a vertex program, we could perform
the computation entirely in a fragment program per-texel. In addition to higher accuracy, this
approach may have improved performance for high triangle count models. A geometry image (e.g.
every texel stores surface position) and a normal map may be precomputed for our object and
stored as textures on the GPU. Rendering the radiosity map would only entail sending a single
quadrilateral down the graphics pipeline texture mapped with the geometry image and normal
map. The lighting model and Fresnel computation can take place directly in the fragment shader.
We use the automatic MIP-mapping feature available in recent OpenGL version to compute
the average radiosity at all levels of our surface hierarchy. The radiosity map is then used in the
next pass of this process.
Pass 2: Scattered Irradiance Map
This pass involves evaluating equation (6.5) for every texel in our texture atlas. We render directly
to the texture atlas, by sending a single quadrilateral to the GPU. In a fragment program, for
each texel, we traverse the L links stored during the pre-computation phase. Texture lookups
103
are performed to get the link information. Using the u,v and LOD link information a dependent
texture lookup is performed on the mip-mapped Radiosity Map. The result of this is scaled by the
links form factor. All links are traversed for a given texel and the accumulated radiosities form the
incident scattered irradiance for the texel.
This pass can be the most costly of the three passes depending on the number of links chosen.
For our adaptive approach, 2*L texture lookups must be performed in the fragment shader. For
our non-adaptive scheme we were able to pack four links into a single texture lookup resulting in
1.25 ∗ L lookups.
Pass 3: Final Rendering
Pass three begins with the scattered irradiance map resulting from pass two. This pass multiplies
the scattered irradiance texture with a texture of inverted Fresnel terms to get a texture of exter-
nal radiosities on the outside of the translucent object’s surface. This texture product is further
modulated by direct illumination resulting from a standard OpenGL lighting pass, evaluated either
per-vertex or per-fragment. This results in the final rendering of the translucent object.
6.2.6 Implementation Details
The accuracy of this method is governed by many factors, and can be controlled based on the
desired output and performance. First, the radiosity map (during pass 1) must be of high enough
resolution to capture the transmitted lighting on the surface. We found that increasing the size of
this map ( e.g. from 5122 to 10242 ) did not significantly slow down the overall performance, but
neither did it improve the quality much.
More important for improving accuracy is increasing the number of links per texel of the scat-
tered irradiance map. Increasing the number of links becomes not only computationally expensive,
but also expensive in terms of storage. To reduce storage and thus bandwidth costs, we start by
vector quantizing the form factor data using using a variant of GLA (Generalized Lloyd’s Algo-
rithm) [67]. We chose a 256 element code-book, to reduce our form factor storage to a single 8-bit
index.
We found in practice that the visual results using this approximation were indistinguishable
104
from the uncompressed version. This approximation works well since we have assumed that our
material is homogeneous, and thus the form factors compress very well. The predominant error
in our method remains in the number of links L that can be evaluated and still achieve real-time
performance. On modern hardware we were able to use 64 links per texel of our scattered irradiance
map, and still achieve near real-time rates.
6.2.7 Subsurface Scattering Results
Figure 6.4 shows the result of our subsurface scattering algorithm running on a GeForce FX card.
To achieve real-time performance we are running with either a 5122 or 10242 texture atlas with 16
links per texel. We initially tried to use our adaptive approach for assigning links but found that the
adaptivity when using such a course number of links led to visible discontinuities during rendering
along the mip-map boundaries. As we increased the number of links, the seams diminished due to
the improved approximation.
(a) (b)
Figure 6.4: Without subsurface scattering (a), and with subsurface scattering (b) using σ′s = 2.21,
σa = 0.0012, 16 links, 40fps Nvidia GeForceFX 5800.
Precomputation of the links was performed entirely on the CPU and took approximately nine
seconds for the 10242 resolution.
We tested the algorithm using an NVidia GeForce FX 5900 on three models. The ”head” model
105
Model res. fps Pass 1 Pass 2 Pass 3
Head 5122 61.10 11% 82% 7%
Dolphin 5122 30.35 43% 43% 14%
Bunny 5122 30.33 37% 34% 28%
Head 10242 15.40 13% 85% 2%
Dolphin 10242 15.09 8% 85% 7%
Bunny 10242 12.05 18% 68% 14%
Table 6.2: Subsurface scattering performance on a GeForce FX 5900.
with 7,232 faces, a ”dolphin” model with 21,952 faces and the ”bunny” model with 69,451 faces.
The results of the GPU subsurface scattering algorithm is shown in Table 6.2. At the 10242 texture
resolution, Pass 2 dominates, and since this pass is a texture-to-texture pass, the use of a texture
atlas has effectively decoupled the total shading cost from the tessellation complexity.
Figure 6.5: Head rendered with subsurface scattering at 40 fps Nvidia GeForceFX 5900.
106
6.3 3D Painting on the GPU
Static Conformal Atlas Dynamic MMA
Figure 6.6: Left is a static atlas generated by conformal mapping [22] and right is a dynamic
multiresolution meshed atlas that is rebalanced and reparameterized during the painting process
to better distribute texture resolution to capture painted detail.
6.3.1 Introduction
One use for dynamic texture atlases is in 3D paint systems. In 3D painting, the surface signal
(represented by paint strokes) is unknown at the start of the process, and evolves over time as the
artist paints on the model. Static parametrization schemes do not adapt to the changing surface
signal information leading to poor memory utilization and loss of fidelity in important regions as
shown in figure 6.6.
To date, all known commercial paint systems work with static parametrization schemes. Since
the signal is not known a-priori the artist is typically provided with a parametrization based solely
on the geometric aspects of the model. Tools for manually editing the parametrization are given
to the artist, allowing for adjustments in areas where detail may be needed. Such manual editing
of parametrization may be unintuitive to the artist, adding an additional step to the painting
process[43, 49]. Furthermore, it requires the artist to plan ahead where detail on the model is to
be placed. Dynamic parametrization can alleviate this problem by reallocating available texture
samples on the fly so that painted detail is represented more faithfully.
In this section we detail a GPU based painting system based that uses the dynamic parametriza-
tion method presented in chapter 5. This system detects regions of high frequency paint detail and
107
updates the model’s parametrization and texture at interactive rates while the artist paints. Each
time the parametrization is updated, we avoid resampling by rendering all painted strokes to con-
struct the new texture. This process is accelerated by using the speed of modern graphics hardware
to analyze where more texture samples are needed, and to perform the rendering of paint strokes
during texture reconstruction.
6.3.2 Related Work
Several recent texture atlas generation techniques have focused specifically on distributing texture
samples to reduce texture aliasing. Sander et al. [99] developed an atlas for texturing progressive
meshes that inhibited texture aliasing across different mesh resolutions. Carr and Hart [16] used
an area-balanced hierarchical face clustering to create an atlas to support real-time procedural
solid texturing. Both techniques strived to distribute samples evenly across the surface, and both
supported MIP mapping. Our atlases are generated using a variation of [16] because it uses more
of the available texture samples.
Sander et al. [97] developed a signal-stretch metric that combines both surface area and surface
signal bandwidth. Their metric guided texture atlas construction to better sample high frequency
texture areas, but runs too slowly to support interactive rates. Balmelli et al. [7] developed a
method that reduces texture space by optimizing texture coordinate locations using wavelet analysis
on the texture map. Section 6.3.3 we show how the GPU can quickly perform an edge-detecting
high-pass filter of a texture by computing its gradient magnitude.
Our work focuses on the redistribution of texture samples to recover dynamic surface signals.
Sloan et al. [104] also noticed the weakness of a static assignment of texture coordinates, and
formalized the notion of an importance map to guide the reassignment of texture coordinates based
on the surface signal. Alliez et al. [4] interactively reparameterized a surface by halftoning the
importance map to discretize it into a new set of surface samples evenly distributed according to
the importance map. Their method differs from ours; it is designed for semi-regular meshing and
displacement mapping. Moreover, their halftoning scheme constructs the opposite of an atlas; it
distributes mesh vertices in the texture map whereas an atlas distributes texture samples on the
mesh.
108
Our goals for a paint system are most closely aligned with those of Igarashi and Cosgrove [49].
They too developed a paint system that supports high detail by storing the strokes from each pose
in independent texture maps. When a surface painting was completed, they packed these individual
texture maps as charts in a global atlas. Since these chart domains could overlap, the resulting
atlas was not 1:1, which wastes the texture samples whose contents are overwritten by others, and
the probability of overlap increases with the number of individual charts.
Our technique maintains a single texture atlas using fixed texture resources during the entire
painting process. Our method redistributes texture samples with each new stroke, and does not
suffer from each new view change, which improves on [49]. Because our texture atlas is one-to-
one and covers the available texture space without overlap, our method more efficiently uses the
available texture samples to represent the painted surface.
6.3.3 The GPU Painting Process
We now describe an overview of our painting system as shown in figure 6.7. To avoid later confusion
we remind the reader that the texture atlas is the mapping between 2D and 3D coordinates for our
model. The atlas texture is its corresponding 2D grid of texel values.
The painting process starts when the artist applies a stroke of paint to the model. This paint
stroke is rendered into a separate empty buffer referred to as the stroke buffer. A depth map of the
object for the given pose is also rendered into a depth buffer. Rendering is then performed directly
into an area-balanced atlas texture by swapping vertex texture coordinates with vertex positions.
The texture coordinates(formerly vertex positions), are transformed into viewing coordinate by a
simple vertex program. The transformed coordinates are used to access the stroke buffer and depth
buffer, which are texture mapped onto the transformed geometry. The stroke information is only
applied if the atlas location was visible in the current pose as determined the contents of the depth
buffer. This prevents paint applied in a given pose from leaking through to non-visible regions of
the model.
A gradient filter is applied to the area-balanced atlas texture to determine areas of high signal
stretch. This forms the importance map for our atlas. The importance map is read from the



























Render Model with 
Signal-Balanced
MMA Texture
Figure 6.7: The GPU-accelerated surface painting process. The strokes are rendered into the
MMA by swapping vertex positions and texture coordinates, and using the stroke buffer as texture
(Sec. 6.3.3). An edge detector pixel shader converts the MMA into an importance map (Sec. 6.3.3).
The CPU reads back the MMA and a downsampled importance map, rebalances and reparame-
terizes the MMA to assign more texture samples to strokes (Sec. 6.3.3), and loads the new texture
map to render the model with painted detail.
and depth buffer are rendered into this new optimized texture atlas, and texture mapped onto the
geometry for final display. Each step of the painting process is covered in detail in the following
sections.
Stroke Rendering
Each collection of paint strokes applied in the same object pose are rendered directly into our atlas
texture using GPU hardware. We accomplish this by first rendering the strokes for a given pose
into an empty texture called the stroke buffer. To support transparent strokes, we enable blending
as strokes are rendered into the stroke buffer. The stroke buffer is first cleared to RGBα =
[0, 0, 0, 1]. We enable the source and destination blending function in the RGB channel to be:
RGBsrc(αsrc) + RGBdst(1 − αsrc). The blending function in the alpha channel we set to be:
αsrc(0) + αdst(1− αsrc). The choice of blending functions is described later in section 6.3.3.
110
Depth Buffer Rendering
A depth buffer of the object in its given pose is also rendered into a depth texture as shown in
figure 6.7. Each texel of the depth buffer stores the z depth value of all visible fragments. This
depth texture is later used to prevent paint from being applied to non-visible portions of the model.
Texture Atlas Rendering
We then resample the strokes from screen coordinates into the atlas texture. This resampling occurs
by rendering the surface geometry with a simple vertex shader that first transforms the model-space
position into viewing coordinates, then swaps the vertex’s position (in viewing coordinates) with
its texture coordinates. The viewing coordinates (now stored as texture coordinates) are used to
access the stroke buffer and depth buffer.
Rasterization of the triangles under these new coordinates resamples the stroke buffer into the
atlas texture. The OpenGL shadow buffer extension compares each fragment’s depth with a texture
lookup into the depth buffer to prevent surfaces that were not visible in the pose from receiving
paint. Note that this depth buffer test is only needed if the artist desires the paint to be applied
to visible regions. By disabling this test our system easily supports laser paint mode as described
in [49].
We blend the resampled stroke data into the atlas texture by setting the blending mode oper-
ations to be: RGBsrc(1) +RGBdst(αsrc). The destination alpha value is left unchanged.
Blending Support
The blending functions into the stroke buffer and atlas texture are specifically chosen so that a
series of strokes (all strokes for a given pose) may be accumulated together, and then combined at
the end with color in the atlas texture. This changes the typical order of blending operations.
Suppose we are given an ordered set of colors C0..Cn ∈ RGBα to be blended provided from
a series of overlapping strokes, and the standard blending over operator ⊗ performs the following
blending on two input colors: Cb ⊗ Ca = Ca(Caα) + Cb(1− Caα). For standard alpha blending, the
final output color Cf is correctly given by Cf = Cn ⊗ ..(C2 ⊗ (C1 ⊗ C0))..). Suppose that C0 is
the current color in our texture atlas, and a collection of strokes whose colors are given by C1..Cn
111
respectively are accumulated over the top of one another into the stroke buffer. The resulting stroke
buffer color Cs is given by Cs = (Cn ⊗ (..C3 ⊗ (C2 ⊗ C1))..). To combine the stroke buffer color
with the current texture color to produce a correctly blended output, we need some operator ,
where Cs C0 = Cf . Letting αs = (1−C1α)(1−C2α)..(1−Cnα). Then blending operator  is then
given by Cs  C0 = Cs + C0(αs). It is easy to verify that this yields the correctly blended result.
Accumulating Strokes From All Poses
The process of rendering both the stroke buffer and depth buffer, and then combining the results
into the atlas texture, is repeated for all poses where strokes occurred in order of their occurrence.
The resulting atlas texture holds all of the accumulated stroke information, and can be texture
mapped onto the model.
This process gives us a fast means to render all stored strokes of paint into a given atlas texture.
We can then change the texture atlas (i.e. the texture coordinate mapping) at any time and use
the process to reconstruct the new corresponding atlas texture.
For this process we keep two MMA texture atlas mappings around. The first texture atlas we
refer to as the area-balanced MMA atlas. This atlas is the un-optimized area-balanced MMA map,
that distributes samples purely based on the geometric properties of the mesh. Our signal-balanced
MMA atlas is the texture atlas that has been chosen to better distribute samples across our model.
We update the signal-balanced MMA atlas following each paint stroke provided by the artist.
The purpose of maintaining two mappings, is that it is easier to analyze texture frequency with
respect to the area-balanced MMA atlas. This is because the area-balanced atlas fairly distributes
samples based on surface area, and has less distortion. We compute our frequency samples discretely
so the atlas ensures that the signal across each triangle is fairly analyzed irrespective of its signal
importance.
Texture Frequency Analysis
We next analyze the area-balanced atlas texture to find regions in need of additional samples.
These undersampled regions occur in the atlas wherever the texture color gradient becomes too
large. We use a four-tap gradient magnitude filter to find undersampled regions, implemented as a
112
GPU fragment shader to support interactive processing speeds. The output texture of this filter is
half the horizontal and vertical resolution of the input texture.
To use the importance samples, we need to know the correspondences between texel locations
in the importance map, and triangles in our mesh. This involves knowing the rasterized locations
of each triangle in the atlas texture. To do this we use an item buffer rendering technique; a unique
id for every triangle is encoded in the color channel, and rendering is performed to the importance
map by sending down the atlas texture coordinates as vertex positions. The unique triangle id is
output into the RGB components of the importance map, and the result of the filter output in the
α channel.
The rendered geometry is texture mapped with the texture corresponding to the reference
texture atlas. For each fragment, the shader fetches four samples (taps) from the corresponding
2×2 support region of the input texture. Each of these four taps falls directly between two adjacent
input texture samples, and so its bilinearly interpolated value will be the mean of these two samples.
The fragment shader arithmetic uses these four taps to construct a central difference estimate of
the horizontal and vertical derivatives.
Using these derivatives, the following error (filter result) is computed by the shader on the signal











where φ˜ is the surface signal function that maps texture coordinate positions to colors φ˜ : R2 → R3,
and u1,u2 are the coordinate axis of our texture domain. Note that this error term is off by a
scaling factor related to the sample grid spacing, from that proposed by Sander. We are able to
disregard this factor since the error for every triangle is sampled at roughly the same frequency
since we are using the area-balanced MMA atlas. The resulting error term is written to the alpha
channel on output.
When applied to our texture, this yields an importance map, encoding the areas of our texture
with high frequencies. The support of the filter is only 2× 2 and aligned with the MIP-map such
that it does not interfere with (nor is it sensitive to) the structure of the parameterization.
The atlas rebalancing algorithm, described earlier in Chapter 5, is implemented on the CPU,
113
but requires the importance map generated by the GPU.
Tree Rebalancing and Reparametrization
The intensities in our importance map encode areas that are under-sampled. As described in
Section 4, each node in the face cluster tree corresponds to a rectangular region of the texture
domain, in this case the importance map. The distribution metrics stored in the cluster hierarchy
nodes are initialized to a uniform area distribution. Since the importance map is non-negative, we
use non-zero importance values to increase the distribution metrics of corresponding nodes in the
cluster hierarchy.
Sampling is performed on the importance map to assign every triangle a sample metric δ. We
choose δ for a triangle to be the maximum pixel intensity over its covered area in the importance
map.
For each cluster, we also assign δ importance to be the maximum pixel intensity over its covered
area multiplied by its surface area. Charts that contain triangles with updated metric values are re-
parameterized. Finally, the MMA tree is reconstructed to ideally balance the placement of samples
as described in section 4. The newly defined atlas is signal-balanced. Figure 6.8 shows a painted
bunny model and its optimized texture atlas.
Figure 6.8: Painted bunny model and its 10242 signal optimized atlas texture.
114
Resampling from a Base Texture
At some point, the number of poses may become quite large, slowing the stroke rendering phase
below acceptable interactive update rates. To avoid this problem we allow the strokes to be flattened
into a base texture so the individual stroke buffers and pose information can be discarded. The
resolution of these strokes, once flattened, is thereby fixed and can no longer be improved by the
dynamic resolution management algorithm. They do, however, impact the future frequency analysis
and texture atlas layouts.
We can use the GPU to resample a base texture to coincide with the new parametrization.
This resampling is performed by first loading the base texture, and rendering the surface geometry
vertices using the new atlas coordinates as position, and its previous base-texture parameterization
as texture coordinates. This rendering redistributes the base texture image into a multiresolution
mesh atlas image that is consistent with the new parameterization.
If the base texture is already a multiresolution meshed atlas, then we can take advantage of its
MIP-map. We can use MIP-mapping to inhibit minification aliases that may occur when resampling
a previous atlas image into a new atlas image, particularly in regions that will now receive fewer
texture samples. Resampling is performed from the original base texture, to avoid the accumulation
of error that can result from successive resampling.
6.3.4 Results
Figure 6.9 shows our dynamic parametrization scheme in comparison with the signal specialized
stretch metric detailed in [97]. We found that the difference in signal reconstruction between our
signal-balanced MMA and the signal-stretch approach taken by Sander et al. to be very close in
quality. Figure 6.9 shows that at the same resolution our method shows some slight artifacts from
the method of [97], but maintains the salient features of the signal. At twice the resolution, our
method is visually indistinguishable.
We tested our painting system on a wide variety of models ranging in size from 6k to 100k faces.
All tests were performed on a 2.08GHz Athlon, with a QuadroFX 2000 graphics card.
One of the freely chosen parameters of our system is the number of charts per mesh. With




(b) MMA 128x128 (c) MMA 256x256
Figure 6.9: Signal Stretch (a) versus Dynamic Parametrization (b) and (c). Some artifacts are
present when compared at same resolution: (a) and (b). At 2562 the results are visibly indistin-
guishable: (a) and (c).
model faces param (sec.) re-param (sec.)
bunny 69K 9.2 0.072 - 1.10
parasaur 7.6K 1.04 0.03- 0.186
head 7.2K 1.03 0.034 - 0.183
Table 6.3: Algorithm performance times for various models. Param time includes clustering, chart
parameterization, and MMA tree construction. Re-param time includes analyzing the importance
map, chart parameterization update, and MMA tree re-construction.
model into many charts results in poorly parameterized charts. This is due to fact that charts with
few triangles tend not to parameterize into square/rectangular regions without high distortion and
scaling error. For this reason we found our dynamic parametrization method worked better on
larger polygon count models. In practice we achieved good results by breaking our models into
clusters containing an average of 64 faces.
Table 6.3 shows some of the performance results from our dynamic parameterization algorithm.
The re-parameterization times greatly varied in our painting system based on the number patches
affected by a given stroke. The upper ranges in the table reflect re-parametrization times with a
surface signal that affects all charts. For typical strokes that cover only a fraction of the model we
were able to re-parameterize our models at a rate of 5-60Hz.
We found the overall performance of our GPU paint system a bit disappointing. Readback of the
116
importance map (at 5122) from the GPU accounted for as much a 90% of the total processing time
required to re-parameterize and re-render all strokes into a new atlas. We hope to see improvement
in this with the release of faster graphics card buses that offer symmetric transfer speeds such as
PCI Express. Regardless of current bus transfer speeds, the performance of our GPU painting
system should scale since only a single image must be readback during texture atlas optimization.
117
6.4 Conclusion
In this chapter we have demonstrated two very diverse applications that can take advantage of the
texture atlas construction developed in chapters 4 and 5. These two applications exemplify the
fast surface computation and sampling made possible by our texture atlas method. We believe
numerous other application can benefit from this new atlas. Our hope is that this research will lead
to the growth in future applications that take advantage of the graphics hardware to bring more





In this thesis we have developed a new texture atlas domain that can be used for surface processing
in graphics hardware. This is a fundamental problem in computer graphics that will only grow as
the predicted performance between graphics hardware and more general purpose CPU’s widen. We
have motivated the importance of graphics hardware by developing two applications that use the
GPU as a general computational unit. Both the applications of ray-tracing and numerical routines
such as large matrix multiplication demonstrate the future potential of graphics hardware. To
extend this capability to surfaces, texture atlas methods have been examined, and their limitations
with respect to graphics hardware have been noted.
Our new MMA method, provides a unique set of features not present in other methods. The
MMA parametrization scheme, works on orientable 2-manifold surfaces of any genus, provides
complete texel usage, is mip-mappable, and can be made adaptive for use in dynamic environments.
We have shown the usefulness of our parametrization scheme by developing two new applications:
real-time subsurface scattering, and a fully dynamic painting system.
7.2 Future Work
As future work, there are a number of directions that can be examined.
The problem of finding an optimal partitioning of a model into clusters that parametrize well into
square/retangular regions, remains to be an open problem. We have provide a number of heuristic
119
approaches to this end, however, we believe that better construction methods are possible. Future
work in parameterized shape-based clustering would lead to better lower distortion clusters, making
the argument for using rectangular parameter domains even stronger.
We would also be interested in examining clustering based on signal information. To date,
clustering methods only examine the geometric aspects of the mesh, seeking compactness, and
planarity or developability. In the presence of surface signal information such geometric aspects
are not important. Clusters should be chosen so that they best preserve surface signal information
when parameterized.
Another important area of research is the examination of non-linear optimization strategies
that allows for better assignment of boundary vertex. Our current method pins the four vertices
to the four corners of our rectangular parameter domain. Such constraints limit the degree to
which the parametrization of the patch may adapt to represent underlying surface signal. This
problem inevitably leads to constrained non-linear optimization along the parameter domain border,
however, additional care must to taken with the discontinuities occurring at the corners.
The signal balanced atlas can be viewed as a form of texture compression. One potential is
to examine the potential of the MMA atlas as a means to compress and store surface data. Since
the MMA breaks down the surface into texture blocks, applying compression to each texture block
may yield better results in some cases than the signal specialized parametrization methods, which
is more constrained by continuity.
One of the more obvious areas of explorations is in developing new applications that can benefit
from the MMA texture atlas structure. We are very interested in examining its usefulness towards
the evaluation of global illumination and light transport algorithms in graphics hardware. The
MMA construction might prove useful in evaluating hierarchical radiosity solutions on the GPU.
Our hope is the dynamic atlas may be used to redistribute samples to areas of importance where
more computation should take place. For example, the dynamic atlas could possibly be applied
to perform subsurface scattering on dynamically deforming surfaces. Our hope is the the dynamic




[1] Douglas Aberdeen and Jonathan Baxter. General matrix-matrix multiplication using SIMD
features of the PIII (research note). In European Conference on Parallel Processing, pages
980–983, 2000.
[2] Pierre Alliez, David Cohen-Steiner, Olivier Devillers, Bruno Le´vy, and Mathieu Desbrun.
Anisotropic polygonal remeshing. ACM Transactions on Graphics. Special issue for SIG-
GRAPH conference, pages 485–493, 2003.
[3] Pierre Alliez, E´ric Colin de Verdie`re, Olivier Devillers, and Martin Isenburg. Isotropic surface
remeshing. In Proceedings of Shape Modeling International, pages 49–58, 2003.
[4] Pierre Alliez, Mark Meyer, and Mathieu Desbrun. Interactive geometry remeshing. In Proc.
SIGGRAPH 2002, pages 347–354, 2002.
[5] James Arvo and David B. Kirk. Fast ray tracing by ray classification. Proc. SIGGRAPH 87,
pages 55–64, July 1987.
[6] Didier Badouel. An efficient ray-polygon intersection. In Graphics Gems, pages 390–393, 735.
Academic Press, Boston, 1990.
[7] L. Balmelli, G. Taubin, and F. Bernardini. Space-optimized texture maps. In Eurographics,
pages 411–420, 2002.
[8] H. Battke, D. Stalling, and H-C Hege. Fast line integral convolution for arbitrary surfaces in
3d. Technical Report SC-96-59, Konrad-Zuse-Zentrum fu¨r Informationstechnik Berlin (ZIB),
Dec. 1996.
121
[9] David Benson and Joel Davis. Octree textures. ACM Trans. on Graphics, 21(3):785–790,
July 2002.
[10] James F. Blinn and Martin E. Newell. Texture and reflection in computer generated images.
Comm. ACM, 19(10):542–547, Oct. 1976.
[11] B. Bloom. Space/time tradeoffs in hash coding with allowable errors. Communications of the
ACM, 13(2):159–188, July 1970.
[12] Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schro¨der. Sparse matrix solvers on the
GPU: Conjugate gradients and multigrid. ACM Trans. on Graphics, 22(3):to appear, July
2003. (Proc. SIGGRAPH 2003).
[13] Nathan A. Carr, , and John C. Hart. Painting detail. To be submitted for publication in
SIGGRAPH 2004, Jan. 2003.
[14] Nathan A. Carr, Jesse D. Hall, and John C. Hart. The ray engine. In Proc. Graphics Hardware
2002, pages 37–46, Sep. 2002.
[15] Nathan A. Carr, Jesse D. Hall, and John C. Hart. GPU algorithms for radiosity and subsurface
scattering. Proc. Graphics Hardware 2003, pages 51–59, July 2003.
[16] Nathan A. Carr and John C. Hart. Meshed atlases for real-time procedural solid texturing.
ACM Transactions on Graphics, 21(2):106–131, 2002.
[17] P. Cignoni, C. Montani, C. Rocchini, and R. Scopigno. A general method for preserving
attribute values on simplified meshes. In Proc. Visualization ’98, pages 59–66. IEEE, Oct.
1998.
[18] James Clark. The geometry engine: A VLSI geometry system for graphics. Proc. SIGGRAPH
82, pages 127–133, July 1982.
[19] Thomas Cormen, Charles Leiserson, and Ronald Rivest. Introduction to Algorithms. McGraw-
Hill, 1990.
[20] Microsoft Corp. D3DX 9.0 SDK. http://www.microsoft.com/windows/directx/default.aspx,
2003.
122
[21] David Debry, Johnathan Gibbs, Devorah DeLeon Petty, and Nate Robbins. Painting and
rendering textures on unparameterized models. ACM Transactions on Graphics SIGGRAPH,
21(3):763–768, July 2002.
[22] M. Desbrun, M. Meyer, and P. Alliez. Intrinsic parameterizations of surface meshes. Computer
Graphics Forum, 21:209–218, 2002.
[23] Paul J. Diefenbach and Norman I. Badler. Multi-pass pipeline rendering: Realism for dy-
namic environments. In Proc. Symposium on Interactive 3D Graphics, pages 59–70. ACM
SIGGRAPH, Apr. 1997.
[24] Jack Dongarra. An update of a couple of tools: ATLAS and PAPI. DOE Salishan Meeting
(Available from http://www.netlib.org/utk/people/JackDongarra/
SLIDES/salishan.ps), Apr., 2001.
[25] Matthias Eck, Tony DeRose, Tom Duchamp, Hugues Hoppe, Michael Lounsbery, and Werner
Stuetzle. Multiresolution analysis of arbitrary meshes. In Proceedings of the 22nd annual
conference on Computer graphics and interactive techniques, pages 173–182. ACM Press,
1995.
[26] Jeff Erickson. Pluecker coordinates. Ray Tracing News, 10(3):11, 1997. www.acm.org-
/tog/resources/RTNews/html/rtnv10n3.html#art11.
[27] Jeff Erickson and Sariel Har-Peled. Optimally cutting a surface into a disk. ACM Symposium
on Computational Geometry, 2002.
[28] Randima Fernando, Sebastian Fernandez, Kavita Bala, and Donald P. Greenberg. Adaptive
shadow maps. Proc. SIGGRAPH 2001, pages 387–390, Aug. 2001.
[29] Randima Fernando and Mark J. Kilgard. The Cg Tutorial. Addison-Wesley, Reading, MA,
USA, March 2003.
[30] Charles M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improving network
partitions. In Proc. 19th IEEE Design Automation Conference, pages 175–181, 1982.
[31] R. Fletcher. Practical Methods of Optimization. John Wiley & Sons, New York, 1987.
123
[32] Michael S. Floater. Parametrization and smooth approximation of surface triangulations.
Computer Aided Geometric Design, 14(4):231–250, 1997.
[33] Michael S. Floater. Mean value coordinates. Comput. Aided Geom. Des., 20(1):19–27, 2003.
[34] J.D. Foley, A. van Dam, S.K. Feiner, and J.F. Hughes. Computer Graphics: Principles and
Practice. Addison-Wesley, 2nd edition, 1990.
[35] Michael Garland, Andrew Willmott, and Paul S. Heckbert. Hierarchical face clustering on
polygonal surfaces. In Proc. Interactive 3D Graphics, pages 49–58. ACM, 2001.
[36] Nolan Goodnight, Cliff Woolley, Gregory Lewin, David Luebke, and Greg Humphreys. A
multigrid solver for boundary value problems using programmable graphics hardware. In
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware,
pages 102–111. Eurographics Association, 2003.
[37] S. Gottschalk, M. C. Lin, and D. Manocha. OBBTree: A hierarchical structure for rapid
interference detection. Computer Graphics, 30(Annual Conference Series):171–180, 1996.
[38] Xianfeng Gu, Steven J. Gortler, and Hugues Hoppe. Geometry images. In Proc. SIGGRAPH
2002, pages 355–361. ACM, 2002.
[39] Igor Guskov. An anisotropic mesh parameterization scheme. In Proc. of 11th International
Meshing Roundtable, 2002.
[40] Daniel Hall. The AR350: Today’s ray trace rendering processor. In Peter. N. Glagowski,
editor, Hot 3D Presentations, pages 13–19. Graphics Hardware 2001, Aug. 2001.
[41] Jesse D. Hall, Nathan A. Carr, and John C. Hart. Cache and bandwidth aware matrix multi-
plication on the GPU. Technical Report UIUCDCS-R-2003-2328, University of Illinois, Apr.
2003. At ftp.cs.uiuc.edu in /pub/dept/tech reports/2003/ as UIUCDCS-R-2003-2328.ps.gz.
[42] P. Hanrahan, D. Salzman, and L. Aupperle. A rapid hierarchical radiosity algorithm. Com-
puter Graphics, 25(4):197–206, July 1991. (Proc. SIGGRAPH 91).
[43] Pat Hanrahan and Paul E. Haeberli. Direct WYSIWYG painting and texturing on 3D shapes.
In Proc. SIGGRAPH 1990, volume 24:4, pages 215–223. ACM, Aug. 1990.
124
[44] Pat Hanrahan and Wolfgang Krueger. Reflection from layered surfaces due to subsurface
scattering. In Proc. SIGGRAPH 93, pages 165–174, 1993.
[45] Xuejun Hao, Thomas Baby, and Amitabh Varshney. Interactive subsurface scattering for
translucent meshes. In Proc. Interactive 3D Graphics, pages 75–82, 2003.
[46] Mark J. Harris, William V. Baxter, Thorsten Scheuermann, and Anselmo Lastra. Sim-
ulation of cloud dynamics on graphics hardware. In Proceedings of the ACM SIG-
GRAPH/EUROGRAPHICS conference on Graphics hardware, pages 92–101. Eurographics
Association, 2003.
[47] Wolfgang Heidrich, Hendrik Lensch, and Hans-Peter Seidel. Light field-based reflections and
refractions. Eurographics Rendering Workshop, 1999.
[48] K. Hormann and G. Greiner. MIPS: An efficient global parametrization method. In P.-J.
Laurent, P. Sablonnie`re, and L. L. Schumaker, editors, Curve and Surface Design: Saint-
Malo 1999, Innovations in Applied Mathematics, pages 153–162. Vanderbilt University Press,
Nashville, TN, 2000.
[49] T. Igarashi and D. Cosgrove. Adaptive unwrapping for interactive texture painting. Sympo-
sium on Interactive 3D Graphics 2001, pages 209–216, 2001.
[50] Henrik Wann Jensen. Global Illumination Using Photon Maps. In Rendering Techniques ’96
(Proceedings of the Seventh Eurographics Workshop on Rendering), pages 21–30, New York,
NY, 1996. Springer-Verlag/Wien.
[51] Henrik Wann Jensen and Juan Buhler. A rapid hierarchical rendering technique for translu-
cent materials. ACM Trans. on Graphics, 21(3):576–581, 2002. (Proc. SIGGRAPH 2002).
[52] Henrik Wann Jensen, Stephen R. Marschner, Marc Levoy, and Pat Hanrahan. A practical
model for subsurface light transport. Proc. SIGGRAPH 2001, pages 511–518, Aug. 2001.
[53] James T. Kajiya. The rendering equation. Proc. SIGGRAPH 86, pages 143–150, Aug. 1986.
125
[54] Zachi Karni and Craig Gotsman. Spectral compression of mesh geometry. In Proceedings of
the 27th annual conference on Computer graphics and interactive techniques, pages 279–286.
ACM Press/Addison-Wesley Publishing Co., 2000.
[55] G. Karypis. Multi-level algorithms for muli-constraint hypergraph partitioning. Technical
Report 99-034, University of Minnesota, Nov 1999.
[56] G. Karypis and V. Kumar. Multilevel algorithms for multi-constraint graph partitioning. In
Proc. Supercomputing 98, Nov. 1998.
[57] G. Karypis and V. Kumar. Multilevel k-way hypergraph partitioning. In Proc. Design
Automation Conference, pages 343–348. IEEE, 1999.
[58] Alexander Keller. Instant radiosity. Proc. SIGGRAPH 97, pages 49–56, Aug. 1997.
[59] Peter Kipfer and P. Slusallek. Transparent distributed processing for rendering. Proc. Parallel
Visualization and Graphics Symposium, pages 39–46, 1999.
[60] Craig Kolb, Patrick M. Hanrahan, and Don Mitchell. A realistic camera model for computer
graphics. Proc. SIGGRAPH 95, pages 317–324, Aug. 1995.
[61] Jens Kru¨ger and Ru¨diger Westermann. Linear algebra operators for gpu implementation of
numerical algorithms. ACM Transactions on Graphics (TOG), 22(3):908–916, 2003.
[62] Scott E. Larson and David McAllister. Fast matrix multiplies using graphics hardware. Super
Computing, Nov. 2001.
[63] Hendrik P. A. Lensch, Michael Goesele, Philippe Bekaert, Jan Kautz, Marcus A. Magnor,
Jochen Lang, and Hans-Peter Seidel. Interactive rendering of translucent objects. In Proc.
Pacific Graphics 2002, pages 214–224, Oct. 2002.
[64] Bruno Le´vy and Jean-Laurent Mallet. Non-distorted texture mapping for sheared triangu-
lated meshes. In Proc. SIGGRAPH 98, pages 343–352, July 1998.
[65] Bruno Le´vy, Sylvain Petitjean, Nicolas Ray, and Je´rome Maillot. Least squares conformal
maps for automatic texture atlas generation. In Proc. SIGGRAPH 2002, pages 362–371.
ACM, 2002.
126
[66] Erik Lindholm, Mark J. Kilgard, and Henry Moreton. A user-programmable vertex engine.
Proc. SIGGRAPH 2001, pages 149–158, July 2001.
[67] S. P. Lloyd. Least squares quantization in pcm. IEEE Trans. Inform. Theory, 28:129–137,
1982.
[68] Je´roˆme Maillot, Hussein Yahia, and Anne Verroust. Interactive texture mapping. In Proc.
SIGGRAPH 93, pages 27–34. ACM, Aug. 1993.
[69] Brian Marshall. DirectX graphics future. http://www.microsoft.com/windows/directx/default.aspx,
Jul. 2001.
[70] Makoto Maruya. Generating a texture map from object-surface texture data. Computer
Graphics Forum, 14(3):397–406, Aug. 1995.
[71] Michael D. McCool and Wolfgang Heidrich. Texture shaders. In Proc. Graphics Hardware
99, pages 117–126. SIGGRAPH/Eurographics Workshop, August 1999.
[72] Victor J. Milenkovic. Rotational polygon overlap minimization. Computational Geometry:
Theory and Applications, 10:305–318, 1997.
[73] Victor J. Milenkovic. Rotational polygon containment and minimum enclosure. In Proceedings
of the fourteenth annual symposium on Computational geometry, pages 1–8. ACM Press, 1998.
[74] Victor J. Milenkovic. Rotational polygon containment and minimum enclosure using only
robust 2d constructions. Comput. Geom. Theory Appl., 13(1):3–19, 1999.
[75] Tomas Mo¨ller and Ben Trumbore. Fast, minimum storage ray-triangle intersectuion. Journal
of Graphics Tools, 2(1):21–28, 1997.
[76] Hiroshi Murata, Kunihiro Fujiyoshi, Shigetoshi Nakatake, and Yoji Kajitani. Rectangle-
packing-based module placement. In Proceedings of the 1995 IEEE/ACM international con-
ference on Computer-aided design, pages 472–479. IEEE Computer Society, 1995.
[77] Daniel Cohen-Or Olga Sorkine. Eurographics 2001 / jonathan c. roberts short presentations
warped textures for uv mapping encoding.
127
[78] Steven Parker, WilliamMartin, Peter-Pike J. Sloan, Peter S. Shirley, Brian Smits, and Charles
Hansen. Interactive ray tracing. In 1999 ACM Symposium on Interactive 3D Graphics, pages
119–126. ACM SIGGRAPH, Apr. 1999.
[79] Darwyn R. Peachey. Solid texturing of complex surfaces. Computer Graphics, 19(3):279–286,
July 1985.
[80] Mark S. Peercy, Marc Olano, John Airey, and P. Jeffrey Ungar. Interactive multi-pass pro-
grammable shading. Proc. SIGGRAPH 2000, pages 425–432, 2000.
[81] Ken Perlin. An image synthesizer. Computer Graphics, 19(3):287–296, July 1985.
[82] Matt Pharr and Pat Hanrahan. Monte Carlo evaluation of non-linear scattering equations
for subsurface reflection. In Proc. SIGGRAPH 2000, pages 75–84, 2000.
[83] Matt Pharr, Craig Kolb, Reid Gershbein, and Patrick M. Hanrahan. Rendering complex
scenes with memory-coherent ray tracing. Proc. SIGGRAPH 97, pages 101–108, Aug. 1997.
[84] Ulrich Pinkall and Konrad Polthier. Computing discrete minimal surfaces and their conju-
gates. Experimental Mathematics, 2(1):15–36, 1993.
[85] Emil Praun, Adam Finkelstein, and Hugues Hoppe. Lapped textures. In Proc. SIGGRAPH
2000, pages 465–470, July 2000.
[86] Kekoa Proudfoot, William R. Mark, Svetoslav Tzvetkov, and Pat Hanrahan. A real-time
procedural shading system for programmable graphics hardware. Proc. SIGGRAPH 2001,
pages 159–170, 2001.
[87] Timothy J. Purcell. SHARP ray tracing architecture. SIGGRAPH 2001 Real-Time Ray
Tracing Course Notes, Aug. 2001.
[88] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan. Ray tracing on pro-
grammable graphics hardware. In John F Hughes, editor, Proceedings of SIGGRAPH 2002,
Computer Graphics Proceedings, Annual Conference Series. ACM, ACM Press / ACM SIG-
GRAPH, 2002.
128
[89] Timothy J. Purcell, Craig Donner, Mike Cammarano, Henrik Wann Jensen, and Pat Han-
rahan. Photon mapping on programmable graphics hardware. In Graphics Hardware, July
2003.
[90] Nicholas Ray and Bruno Lvy. Hierarchical least squares conformal maps. In Pacific Graphics
2003 conf. proc., 2003.
[91] William T. Reeves, David H. Salesin, and Robert L. Cook. Rendering antialiased shadows
with depth maps. Proc. of SIGGRAPH 87, pages 283–291, Jul. 1987.
[92] E. Reinhard, A.G. Chalmers, and F.W. Jansen. Overview of parallel photorealistic graphics.
Eurographics ’98 STAR, pages 1–25, Sep. 1998.
[93] J. Revelles, C. Urena, and M. Lastra. An efficient parametric algorithm for octree traversal.
Proc. Winter School on Computer Graphics, 2000.
[94] Marc Rioux, Marc Soucy, and Guy Godin. A texture-mapping approach for the compression
of colored 3d triangulations. The Visual Computer, 12(10):503–514, 1996.
[95] M. Samek. Texture mapping and distortion in digital graphics. The Visual Computer,
2(5):313–320, 1986.
[96] P. V. Sander, Z. J. Wood, S. J. Gortler, J. Snyder, and H. Hoppe. Multi-chart geome-
try images. In Proceedings of the Eurographics/ACM SIGGRAPH symposium on Geometry
processing, pages 146–155. Eurographics Association, 2003.
[97] Pedro V. Sander, Steven J. Gortler, John Snyder, and Hugues Hoppe. Signal-specialized
parametrization. Proc. Eurographics Rendering Workshop, 2002.
[98] Pedro V. Sander, Xianfeng Gu, Steven J. Gortler, Hugues Hoppe, and John Snyder. Silhouette
clipping. In Proceedings of the 27th annual conference on Computer graphics and interactive
techniques, pages 327–334. ACM Press/Addison-Wesley Publishing Co., 2000.
[99] Pedro V. Sander, John Snyder, Steven J. Gortler, and Hugues Hoppe. Texture mapping
progressive meshes. In Proc. ACM SIGGRAPH 2001, pages 409–416, 2001.
129
[100] Alla Sheffer and Eric de Sturler. Surface parameterization for meshing by triangulation
flattening. Proc. Meshing Roundtable, pages 161–172, 2000.
[101] Alla Sheffer and John C. Hart. Seamster: Inconspicuous low-distortion texture seam layout.
Proc. Visualization 2002, pages 291–298, 2002.
[102] Horst D. Simon. Partitioning of unstructured problems for parallel processing. Computing
Systems in Engineering, 2:135–148, 1991.
[103] Peter-Pike Sloan, Jesse Hall, John Hart, and John Snyder. Clustered principal components
for precomputed radiance transfer. ACM Transactions on Graphics (TOG), 22(3):382–391,
2003.
[104] Peter-Pike J. Sloan, David M. Weinstein, and J. Dean Brederson. Importance driven texture
coordinate optimization. Computer Graphics Forum, 17(3), 1998.
[105] Olga Sorkine, Daniel Cohen-Or, Rony Goldenthal, and Dani Lischinski. Bounded-distortion
piecewise mesh parameterization. IEEE Visualization, pages 355–362, 2003.
[106] Vitaly Surazhsky, Pierre Alliez, and Craig Gotsman. Isotropic remeshing of surfaces: a local
parameterization approach. In Proceedings of 12th International Meshing Roundtable, 2003.
[107] La´szlo´ Szirmay-Kalos and Werner Purgathofer. Global ray-bundle tracing with hardware
acceleration. Proc. Eurographics Rendering Workshop, pages 247–258, June 1998.
[108] Chris Trendall and A. James Stewart. General calculations using graphics hardware with ap-
plications to interactive caustics. In Rendering Techniques 2000: 11th Eurographics Workshop
on Rendering, pages 287–298. Eurographics, Jun. 2000.
[109] W.T. Tutte. Convex representations of graphs. In Proc. London Math Society, volume 10,
1960.
[110] Eric Veach and Leonidas J. Guibas. Metropolis light transport. Proc. SIGGRAPH 97, pages
65–76, Aug. 1997.
130
[111] Ingo Wald, Philipp Slusallek, and Carsten Benthin. Interactive distributed ray tracing of
highly complex models. In Rendering Techniques 2001, pages 277–288. Eurographics Ren-
dering Workshop, 2001.
[112] Ingo Wald, Philipp Slusallek, Carsten Benthin, and Markus Wagner. Interactive rendering
with coherent ray tracing. Computer Graphics Forum, 20(3):153–164, 2001.
[113] Gregory J. Ward. The radiance lighting simulation and rendering system. Proc. SIGGRAPH
94, pages 459–472, Jul. 1994.
[114] H. Weghorst, G. Hooper, and D.P. Greenberg. Improved computational methods for ray
tracing. ACM Trans. on Graphics, 3(1):52–69, Jan. 1984.
[115] R. Clinton Whaley, Antoine Petitet, and Jack Dongarra. Automated empirical optimizations
of software and the ATLAS project. Parallel Computing, 27(1-2):3–35, 2001.
[116] Lance Williams. Casting curved shadows on curved surfaces. Proc. SIGGRAPH 78, pages
270–274, Aug. 1978.
[117] Lance Williams. Pyramidal parametrics. Computer Graphics (Proc. SIGGRAPH 83),
17(3):1–11, July 1983.
[118] Mason Woo, Jack Neider, Tom Davis, and Dave Shreiner. OpenGL Programming Guide.
Addison-Wesley, Reading, MA, USA, 1997.
[119] Eugene Zhang, Konstantin Mischaikow, and Greg Turk. Feature-based surface parameteri-




Nathan Aaron Carr was born in Madison, Wisconsin on February 19, 1974. He graduated Magna
Cum Laude from the College of William & Mary in 1996 with a degree in computer science, with
a minor degree in fine arts. For two years following graduation, Nathan worked in the information
technology industry developing computer systems for Bell Atlantic. In 1998, Nathan returned to
school to pursue graduate study in computer graphics under the guidance of John C. Hart. He
completed his master’s degree in 2000, and transferred to the University of Illinois at Urbana-
Champaign for doctoral study also under the guidance on John C. Hart. While at the University
of Illinois he was twice awarded a fellowship from Nvidia for his studies in computer graphics.
Following completion of his Ph.D., Nathan will stay at the University of Illinois doing post-doctoral
research under the guidance of John Hart.
132
