On-the-fly Vertex Reuse for Massively-Parallel Software Geometry
  Processing by Kenzel, Michael et al.
On-the-fly Vertex Reuse for Massively-Parallel Software
Geometry Processing
Michael Kenzel
michael.kenzel@icg.tugraz.at
Graz University of Technology
Graz, Austria
Bernhard Kerbl
bernhard.kerbl@icg.tugraz.at
Graz University of Technology
Graz, Austria
Wolfgang Tatzgern
wolfgang.tatzgern@student.tugraz.at
Graz University of Technology
Graz, Austria
Elena Ivanchenko
elena.ivanchenko@icg.tugraz.at
Graz University of Technology
Graz, Austria
Dieter Schmalstieg
dieter.schmalstieg@icg.tugraz.at
Graz University of Technology
Graz, Austria
Markus Steinberger
markus.steinberger@icg.tugraz.at
Graz University of Technology
Graz, Austria
(a) Shading rate for XYZRGB Dragon (b) Rendering a scene from The Witcher 3 (c) Computation of inner and outer mesh envelopes
Figure 1: Reducing the number of shader invocation during rendering is essential to guarantee high performance. Tradition-
ally, redundant vertex shading can be bypassed using a post-transform cache, but its poor scalability makes the vertex cache
a poor choice in massively parallel environments. (a) The batch-based approaches we explore in this work show good reuse
characteristics onmodern GPUs (green vertices are shaded only once, dark red six times). (b, c)We evaluate static and dynamic
batching in a variety of applications, e.g., rasterization of captured game scenes and computation of mesh simplification en-
velopes. The Witcher 3: Wild Hunt screenshot courtesy of CD PROJEKT S.A.; used with permission.
ABSTRACT
Compute-mode rendering is becoming more and more attractive
for non-standard rendering applications, due to the high flexibil-
ity of compute-mode execution. These newly designed pipelines
often include streaming vertex and geometry processing stages. In
typical triangle meshes, the same transformed vertex is on average
required six times during rendering. To avoid redundant compu-
tation, a post-transform cache is traditionally suggested to enable
reuse of vertex processing results. However, traditional caching
neither scales well as the hardware becomes more parallel, nor can
be efficiently implemented in a software design. We investigate
alternative strategies to reusing vertex shading results on-the-fly
for massively parallel software geometry processing. Forming static
and dynamic batching on the data input stream, we analyze the
effectiveness of identifying potential local reuse based on sorting,
hashing, and efficient intra-thread-group communication. Alto-
gether, we present four vertex reuse strategies, tailored to modern
parallel architectures. Our simulations showcase that our batch-
based strategies significantly outperform parallel caches in terms
Conference’17, July 2017, Washington, DC, USA
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
This is the author’s version of the work. It is posted here for your personal use. Not
for redistribution. The definitive Version of Record was published in Proceedings of
ACM Conference (Conference’17), https://doi.org/10.1145/nnnnnnn.nnnnnnn.
of reuse. On actual GPU hardware, our evaluation shows that our
strategies not only lead to good reuse of processing results, but also
boost performance by 2 − 3× compared to naïvely ignoring reuse
in a variety of practical applications.
CCS CONCEPTS
• Computing methodologies→ Rasterization; Massively par-
allel algorithms;
KEYWORDS
Vertex Processing, GPU
ACM Reference Format:
Michael Kenzel, Bernhard Kerbl, Wolfgang Tatzgern, Elena Ivanchenko,
Dieter Schmalstieg, and Markus Steinberger. 2018. On-the-fly Vertex Reuse
for Massively-Parallel Software Geometry Processing. In Proceedings of
ACM Conference (Conference’17). ACM, New York, NY, USA, 11 pages. https:
//doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Although hardware-supported, real-time rendering of 3D scenes
is highly efficient, the standard rendering pipeline implemented in
hardware lacks flexibility in certain aspects. With modern graph-
ics processing units (GPU) inexorably rising in compute power,
implementing (parts of) custom pipelines in compute-mode, i.e.,
ar
X
iv
:1
80
5.
08
89
3v
1 
 [c
s.G
R]
  2
2 M
ay
 20
18
Conference’17, July 2017, Washington, DC, USA Kenzel, M. et al.
in software, becomes an interesting alternative. Although certain
features—like rasterization—will likely always be multiple orders of
magnitude faster in hardware, others may efficiently be realized also
in software. Primitive transformations, i.e., vertex shading, is one
of those applications. While implementing vertex shading stages
in software for execution on GPU compute units becomes more
and more common, vertex reuse, i.e., reusing the result of the vertex
shader when it is referenced more than once, is usually ignored.
This is in part due to vertex reuse being realized in hardware in the
conventional pipeline—a feature that is not exposed for custom use
in software.
However, vertex reuse should not be neglected, as it offers several
benefits for high-performance rendering. In addition to a signif-
icant reduction of required memory for storing input geometry
data, effective vertex reuse can greatly reduce the number of shader
invocations. A vertex in a mesh is, on average, referenced up to
six times. The traditional solution to enable vertex reuse is the
employment of a post-transform cache. The post-transform cache
stores shaded vertex information, which can then be retrieved in-
stead of computing the same information multiple times [Sheaffer
et al. 2004; Wang et al. 2011]. The significance of this assumption
is underlined by the wide body of research aiming at improving
the ordering of vertices in meshes to yield better cache behavior.
Unfortunately, there is little publicly available information on the
implementation specifics used in current GPUs. The lack of men-
tion of the post-transform cache in more recent articles [Kubisch
2015; Purcell 2010] also raises the question, to which degree the
widely accepted preconceptions about vertex reuse still hold.
The adequacy of a central vertex cache in contemporary graphics
pipelines and its use in a software pipeline should be questioned,
and justifiably so: with the increasing degree of parallelism usually
present in modern GPUs, the costs of a post-transform cache can
be expected to rise drastically. Alternative design choices tailored
towards massively parallel devices may circumvent this bottleneck
while achieving similar or even better reuse characteristics. In this
light, we see large potential benefits by revisiting the problem of
efficient vertex reuse with an additional focus on software rendering
pipelines. In search ofmethods capable of scalingwith themassively
parallel architecture of current and future GPUs, we make the
following contributions:
(1) We investigate batch-based vertex uniquization as an alter-
native to post-transform caching for achieving reuse.
(2) Next to a naïve processing scheme, we discuss four batch-
based approaches to identify unique vertices on massively
parallel devices.
(3) We evaluate all approaches with respect to their theoreti-
cal and practical vertex reuse effectiveness in a variety of
computer graphics applications.
2 RELATEDWORK
It has been realized early on that there is significant potential for
optimization by minimizing redundancy in an input stream de-
scribing mesh geometry. The pioneering work by Deering [1995],
Evans et al. [1996], and Chow [1997] considered the problem from
a data compression point of view. However, due to this angle of
approach, these methods required input geometry to always first
be encoded according to some compression scheme which would
then be decompressed during processing.
Hoppe et al. [Hoppe 1999] were the first to explore the use of
a k-FIFO post-transform vertex cache to reduce redundant vertex
processing on-the-fly during rendering of triangle meshes. They
furhtermore presented a set of algorithms that automatically opti-
mize the rendering sequence for a given mesh to maximize utiliza-
tion of their proposed cache architecture. The downside of their
optimization approach is that it requires exact knowledge of the
properties of the underlying hardware which are subject to change.
However, their work inspired a long line of followup research im-
proving upon their results. [Chhugani and Kumar 2007; Lin and Yu
2006; Sander et al. 2007] Arguably one of the most impactful works
is the architecture-agnostic approach by Forsyth [Forsyth 2006].
Amore current area of research where we encounter the problem
of massively parallel vertex processing is software rendering on the
modern GPU. Noteworthy examples of GPU software rendering
pipelines include Freepipe [Liu et al. 2010], CUDARaster [Laine
and Karras 2011], and Piko [Patney et al. 2015]. They all use the
compute mode of the GPU (typically on top of the CUDA [NVIDIA
2016] ecosystem) to implement rasterization and fragment shading,
but lack mechanisms for vertex reuse. Freepipe simply executes
the vertex shader every time an index is fetched. CUDARaster and
Piko run the vertex shader in a preprocessing step on the entire
vertex buffer and store the results in global memory. This strategy is
wasteful if a substantial portion of the vertices is never referenced
or used during rendering, which is a rather common case in practice,
e.g., with methods such as level-of-detail or occlusion culling. The
need for buffering the entire intermediate output of the geometry
stage also leads to excessive memory requirements.
In order to avoid ambiguity in the following sections, we will
employ the nomenclature for parallel execution and hardware con-
cepts according to CUDA [NVIDIA 2016]. Hence, wave fronts of
single-instruction-multiple-data (SIMD) width will be referred to
as warp. Warp divergence indicates the case where threads follow
redundant execution paths, since warps advance in lockstep. Logi-
cal groups of warps that run on the same multiprocessor share a
portion of fast local shared memory and can easily synchronize will
be addressed as blocks.
3 VERTEX REUSE STRATEGIES
A major goal of this work is a characterization of vertex reuse in
a software-based, massively parallel context, and heightening the
understanding of its influence on graphics workload. To this aim,
we formulate the following assumptions: We only consider indexed
triangles as primitives, for which the index buffer can be used to
identify recurring vertices (Figure 2). The routine (or shader) for
processing a vertex is invoked based on an index buffer, where
groups of threads are assigned to consecutive primitives in the
index buffer. In the ideal case, the vertex shader should be executed
only once for each vertex that is referenced by the index buffer. To
ensure high performance, shading must happen in parallel, without
any need for expensive synchronization or communication across
GPU multiprocessors. In order to support maximum performance
in streaming pipelines, preprocessing of the vertex or index buffer
should be kept at a minimum, or avoided altogether.
On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing Conference’17, July 2017, Washington, DC, USA
7 56
4
3
2
8 9
1     2     3     4     5     6 7    
●●● ●●● ●●● ●●● ●●● ●●● ●●●
…,2,3,9,3,4,9,4,5,6,4,6,9,9,6,7,9,7,8,9,8,2,…
6
1
2
3
4
5
7
indices:
triangles:
Figure 2: Section of a mesh and its representation as an in-
dexed triangle list. On average, each vertex is referenced six
times in a typical mesh.
3.1 The post-transform cache revisited
Given the above considerations for geometry processing in a stream-
ing pipeline, a global persistent post-transform cache appears to be
the intuitive choice to reduce the number of vertex shader invoca-
tions. However, such a cache is difficult to implement efficiently if it
is required to work across the multiprocessors on the GPU. On the
other hand, even in a single multiprocessor, the high level of data
concurrency may defeat the purpose of caching. The reuse of vertex
information can occur almost instantly, if neighboring triangles are
referenced in quick succession in the index buffer (e.g., triangle strip
layout). Consequently, an advancing wave front of threads may
process the same vertex multiple times in parallel, and new cache
entries become available too late to be of use. In a software-only im-
plementation, caching additionally suffers from high latency when
using conventional memory rather than dedicated cache hardware.
Furthermore, it is prone to cause detrimental thread divergence,
since cache hits and misses lead to different control paths in the
execution of warps [Clarberg et al. 2013].
3.2 Batch-based vertex reuse
To avoid the issues raised by the use of a central cache, we propose
the concept of batch-based vertex processing, which naturally lends
itself to execution on massively parallel architectures. A batch
is defined by us as a bounded region in the index buffer, that is
assigned to a single warp or block for processing. The thread block is
responsible for executing the shader once for each referenced vertex
within its batch and assembles the output triangles. Each block must
analyze its batch, assign vertices uniquely to threads for shader
invocation and finally distribute shading results for assembling the
output triangles. This implies that duplicate indices in the batch
need to be identified before executing the vertex shader.
An obvious challenge in the parallel generation of this many-
to-one mapping is that the input-to-output ratio is not known in
advance. Ideally, we would like to choose a batch size such that the
number of unique vertices equals the block size, and one thread
can run exactly one instance of the vertex shader. If that is not the
case, under-utilization will arise, as threads that receive no unique
vertex to process will simply idle. With larger batch sizes, it may be
possible to identify a larger number of duplicate indices, at the cost
of requiring multiple rounds of shader invocations to finish a whole
batch. These considerations lead to the proposal of two strategies
outlined in Figure 3: static batching and dynamic batching.
239456 46978
𝑁𝑝 = 3 × 3 𝑁𝑝 = 3 × 3
…, 2,3,9, 3,4,9, 4,5,6, 4,6,9, 9,6,7, 9,7,8, 9,8,2,
(a) static batching
𝑁𝑢 ≤ 5 𝑁𝑢 ≤ 5
2394 45697
…, 2,3,9, 3,4,9, 4,5,6, 4,6,9, 9,6,7, 9,7,8, 9,8,2,
(b) dynamic batching
Figure 3: The input can either be divided into (a) batches of a
constant size Np (static batching) or (b) batches of a variable
size chosen such that the number of unique vertices stays
below a threshold Nu (dynamic batching).
3.3 Static batching
For static batching, each thread block simply fetches a fixed num-
ber of indices from the input buffer to process. As a guideline for
efficient processing, we use a common multiple of the block size
and the primitive size as batch size, e.g., for triangles and thread
block size 32, we could use any multiple of 3 · 32 = 96. Since the
batch size is fixed, static batching requires no preprocessing of the
index buffer and can be applied directly to the input of a streaming
pipeline.
Statically batched naïve. As a baseline, we implement a naïve
strategy that does not attempt any vertex reuse. Instead, every
thread is directly assigned to a primitive, and invokes the vertex
shader for all its indices. As thread blocks always fetch the same
number of indices, the static batch size is implicitly given. Notice
that, while this strategy leads to duplicate vertex shader execu-
tion, it avoids all communication overhead. Thus, for very simple
vertex shaders, this naïve approach may in fact show very good
performance.
Statically batched warp voting. For this strategy, we aim to fill up
warps with triangles so that every thread receives a unique vertex
to work on, as outlined in Figure 4. For this purpose, we use fast,
warp-level communication mechanisms, as detailed in Algorithm 1.
Every thread first loads an index from the buffer and subsequently
publishes it via register shuffle instructions to all other threads
in the warp. Each thread then informs its peers via warp voting
whether a duplicate index has been found. We track the number of
unique indices observed so far and assign each new index to the
available thread with the lowest ID. We also maintain an inverse
lookup-table in shared memory for fast reassembly after shading.
We keep fetching indices until either all threads were assigned
a unique vertex, or the batch boundary is hit. Next, all identified
Conference’17, July 2017, Washington, DC, USA Kenzel, M. et al.
shared 
memory
vertex map
vertex map
2 3 9 3 4
…,2,3,9,3,4,9,4,5,6,4,6,9,9,6,7,9,7,8,9,8,2,
2 3 9 4
registers
9 4 5 6 .
4 5 6 . .
2 3 9 4 5 6
● ● ● ● ●
● ● ●
4 5 6
●
●●
Figure 4: Statically-batched warp voting uses all threads in
a warp (5 in this example) to load indices. We exploit warp
voting and shuffle instructions to unify the indices and store
the result in local registers. This process is repeated until all
indices have been consumed, or all threads have acquired a
unique index for processing in the vertex shader. As primi-
tive size must be considered, early shading results might be
discarded (e.g. for index 5 above).
unique vertices are shaded and output assembly is carried out.
This process is repeated iteratively, until all indices in the batch
have been processed. Note that starting a new iteration can lead
to duplicate shader invocation inside a batch, since shading results
are not carried over from the previous iteration. To distribute the
shaded vertices within the warp, we again use shuffle instructions.
3.4 Dynamic batching
We assess the potential for optimizing vertex processing by allowing
for a fast, low-impact preprocessing step to retrieve analytical data
from the submitted index buffer. Specifically, we investigate the
performance of several dynamic batching strategies, which rely on
a load-time analysis of the input to derive optimal batch sizes. This
routine splits the buffer into batches of variable length, with the
goal of maximizing thread occupancy at runtime for the loading
and processing of vertices.
To achieve this, we define N to be a multiple of the block size
and scan the triangles in the index buffer front to back, counting
unique indices until we reach N , or a maximum allowed number of
primitives has been added to the batch. As soon as either of these
conditions is met, we start a new batch and continue scanning the
index buffer until all indices have been assigned to their respective
batches. The batch starting positions, stored in an auxiliary buffer,
Algorithm 1: Statically batched warp voting.
1 shared map[ ]
2 cStar t ← BatchBeдin
3 while cStar t < BatchEnd do
4 f il l ← 0, done ← 0,my_id ← −1, of f set ← cStar t
5 while of f set < BatchEnd and f il l <WarpSize do
6 incominд ← −1, outдoinд ← −1
7 if of f set + laneId < BatchEnd then
8 incominд ← indexBuffer[of f set + laneId]
9 for i ∈WarpSize do
10 current ← shfl (incominд, i )
11 match ← ballot (current =my_id )
12 if match = 0 then
13 if f il l = laneId then
14 my_id = current
15 match ← BitShift (1, f il l )
16 f il l ← f il l + 1
17 if i = laneId then
18 outдoinд ←match
19 map[done + laneId]← ffs (outдoinд)-1
20 f ir stmask ← ballot (outдoinд = 0 or incominд = −1)
21 addit ional ←min (WarpSize , ffs (f ir stmask )
22 done ← done + addit ional
23 of f set ← of f set +WarpSize
24 tr ianдles ← ⌊done/3⌋
25 if laneId < f il l then
26 v ← shade (vertexBuffer[my_id])
27 v0← shfl (v , map[3 · laneId])
28 v1← shfl (v , map[3 · laneId + 1])
29 v2← shfl (v , map[3 · laneId + 2])
30 if laneId < tr ianдles then
31 output (v0,v1,v2)
32 cStar t ← 3 · tr ianдles
allow us to feed a close-to-ideal amount of data to each thread
block.
Note that we do not require the buffer to forward information
about the unique vertices, and leave their identification to be con-
ducted by threads at runtime. Hence, no information other than the
splitting of the index buffer into optimally processable portions is
output at this point. Therefore, we can abstract our preprocessing
procedure to an elaborate work scheduling routine, that could very
well be realized by dedicated hardware. We propose three distinct
strategies for the dyanimc approach: dynamically batched sorting,
dynamically batched hashing, dynamically batched parallel hashing.
Dynamically batched sorting. One straight-forward way to de-
termines the assignment between threads and unique vertices is
to use parallel sorting. We load and sort a full batch of indices in
shared memory, and run a prefix sum over the sorted sequence to
determine all unique vertex indices. The original position in the
batch is carried along during sorting, to be used as an inverted
lookup-table for assembly as outlined in Algorithm 2 and Figure 5.
On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing Conference’17, July 2017, Washington, DC, USA
4,5,6,4,6,9,9,6,7
4,4,5,6,6,6,7,9,9
sort
…,2,3,9,3,4,9,4,5,6,4,6,9,9,6,7,9,7,8,9,8,2,
●
●
● ●
●
● ● ● ● ●
●
●
●
●
shared 
memory
transformed 
vertices
vertex map
indices
Figure 5: Dynamically-batched sorting brings indices into a
monotonic order using radix sort. This allows to efficiently
determine unique indices. Computing the prefix sum yields
offsets for the individual indices to store shading results at.
An inverse mapping is used to assemble the output primi-
tives.
Algorithm 2: Dynamically batched sorting.
1 shared ids[ ], linIds[ ], map[ ], marks[ ], uniqueIds[ ], v[ ]
2 for i ∈ size(Batch) do in parallel
3 ids[i]← indexBuffer[BatchBeдin + i]
4 linIds[i]← i
5 RadixSort (ids, linIds)
6 for i ∈ size(Batch) do in parallel
7 marks[i]← 1 if ids[i] , ids[i + 1] else 0
8 numV er tices ←PrefixSum (marks)
9 for i ∈ size(Batch) do in parallel
10 map[linIds[i]]← marks[i]
11 uniqueIds[marks[i]]← ids[i]
12 for j ∈ numV er tices do in parallel
13 v[j]← shade (vertexBuffer[uniqueIds[j]])
14 for i ∈ size(Batch)/3 do in parallel
15 output (v[map[3i], v[map[3i + 1], v[map[3i + 2]])
Dynamically batched hashing. In this strategy, we employ a hash
map in shared memory to remove duplicate vertex indices. As hash
function, we use multiplicative hashing with linear probing. We
choose the size of the hash map to match the thread block size.
Ideally, the hash map is fully filled after loading a batch due to the
size restrictions applied in our preprocessing step. Consequently,
filling the hash map allows us to uniquely assign vertices to threads.
Upon entering an index into the hash map, the loading thread
records the value of the hash function, to be identify the required
vertex after shading. Since the hash map is filled in parallel, we
use atomic operations for insertion, as outlined in Algorithm 3 and
Figure 6.
…,2,3,9,3,4,9,4,5,6,4,6,9,9,6,7,9,7,8,9,8,2,
5 7 4 6 9
4,5,6,4,6,9,9,6,7
●
●
● ●
●
● ● ● ● ●
shared 
memory
transformed 
vertices
vertex map
●
●
●
●
Figure 6: Dynamically-batched hashing uses a hash map
that can hold one entry for each thread. Filling up the hash
map removes duplicates in the data and each thread can ex-
ecute the vertex shader on one unique vertex.
Algorithm 3: Dynamically batched hashing.
1 shared hashtable[ ], map[ ], v[ ]
2 for i ∈ BatchSize do in parallel
3 hashtable[i]← -1
4 for i ∈ size(Batch) do in parallel
5 id ← indexBuffer[BatchBeдin + i]
6 p ← hash (id )
7 while not inserted do
8 prev ← atomicCAS (hashtable[i], −1, id )
9 if prev − 1 or prev = id then
10 loc ← p
11 else
12 p ← probing (p)
13 map[i] = loc ;
14 for j ∈ BatchSize do in parallel
15 if hashtable[j] , −1 then
16 v[j]← shade (vertexBuffer[hashtable[j]])
17 for i ∈ size(Batch)/3 do in parallel
18 output (v[map[3i], v[map[3i + 1], v[map[3i + 2]])
Dynamically batched parallel hashing. One issue with the simple
hashing approach above is, that a fully occupied hash map will
likely lead to excessive linear probing. In some cases, this may lead
to pathological warp divergence, as a single thread repeatedly tries
to find the last free entry, and the remaining peers in the warp have
to join in the effort. As a remedy, we propose to perform hashing
as a two-tiered approach. First, every thread executes up to a fixed
number of linear probing attempts. Second, all threads within a
warp collaboratively collaborate to find available spots until all
indices have been inserted. This fast-path/slow-path strategy effec-
tively repurposes otherwise idle threads in order to speed up the
Conference’17, July 2017, Washington, DC, USA Kenzel, M. et al.
search for free spots. Coordination within a warp can be realized
through efficient register shuffle and warp voting.
4 EVALUATION
For performance evaluation, we use a set of commonly processed
models, as well as content captured from five recent video games
and an NVIDIA technical demo: Age of Mythology (abbreviated
am), Assassin’s Creed: Black Flag (as), Deus Ex: Human Revolution
(dx), Stone Giant animation (sg), Total War: Shogun 2 (sh), Rise
of the Tomb Raider (tr), and The Witcher 3 (tw). A representative
rendering from our 19 different scenes is shown in Figure 1b.
Obviously, evaluating the effectiveness of different vertex reuse
methods is meaningful only if the processed data actually allows for
detecting reuse. As this is usually not considered in the generation,
formatting or conversion of mesh data, the order in which vertices
are referenced in input models can be at best arbitrary, or, at worst,
biased. Popular mesh processing algorithms have been presented
previously, with the aim of reordering indices in a given mesh to
increase vertex locality. As shown by Figures 7 and 8, applying such
algorithms to popular models can remove unusual discontinuities,
and significantly improve reuse potential in the OpenGL streaming
pipeline. Incidently, this is also true for our own techniques, which
appear to exhibit similar behavior to OpenGL. In order to generate
a fair ordering and enable vertex reuse even in unstructured mod-
els, we preprocess all meshes with the optimization algorithm by
Forsyth (2006).
For comparison, we simulate a parallel caching scheme with
different cache sizes and compare performance to our static and
dynamic batching approaches. Furthermore, we investigate perfor-
mance of our techniques when using them in a software streaming
rendering pipeline and compare to a non-streaming, multi-kernel
setup.
4.1 Caching vs batching and OpenGL
Wefirst determine the ideal reuse rate as the ratio of duplicate vertex
indices over the total length of the index buffer. Theoretically, a very
large, global post-transform cache with instant reusability could
yield the reported ideal figures. Unfortunately, such a global vertex
cache does not seem practical for massively parallel devices, like
modern GPUs. Storing and retrieving data from device-wide caches
introduces an order of magnitude higher latency than caches on
multi-processors, effectively reducing performance significantly.
Furthermore, as cache entries can only be generated after vertex
shader execution, all threads that concurrently receive the same
non-cached index to execute, will not be captured by the cache.
On a massively parallel device, like the GPU, tens of thousands of
threadsmay execute in parallel, precluding an immense vertex reuse
potential of being utilized. The problem gets evenworse considering
the long latency of a device-wide cache, e.g., even if one thread
completes a vertex shader invocation and another threads queries
the cache for that vertex just after, it may still result in a cache miss,
as the latency for storing the shading result has not yet elapsed.
In order to compare to a more realistic cache-based approach,
we simulate a per-multiprocessor cache setup. Based on the design
of our main evaluation GPU, the NVIDIA GTX 1080Ti, we simulate
28 multi-processors, where each multi-processor runs 1024 vertex
shader invocations in parallel and stores the results in its dedicated
least-recently-used (LRU) cache of size 16, 32, or 64KB. As men-
tioned before, a cache hit cannot occur for duplicate vertices being
concurrently processed, but only if the result was already produced
in an earlier shading cycle. For our statically batched warp voting,
we use a batch size of 96, which fits the warp size of 32. For all
dynamic batching approaches, we use a maximum batch size of
1023 indices and 256 unique vertices, and assign 256 threads to each
batch. Test scene statistics and achieved reuse for each technique
are listed in Table 1.
Table 1: Scene statistics: vertices, triangles, and average ver-
tex reuse (equivalent to cache hit rates) in an ideal case, for a
per-multiprocessor cache, warp voting, and dynamic batch-
ing.
Parallel Cache Ours
vert tris ideal 16KB 32KB 64KB warp dyn.
bunny 34k 70k .832 .056 .060 .067 .714 .799
sphere 40k 82k .833 .063 .065 .069 .712 .795
tree 492k 239k .314 .001 .001 .001 .314 .313
buddha 544k 1.1M .833 .050 .048 .049 .725 .804
dragon 3.6M 7.2M .833 .059 .059 .059 .713 .795
am02 3k 6k .801 .042 .057 .075 .709 .779
am03 2k 4k .839 .090 .108 .108 .735 .807
as01 108k 183k .803 .034 .035 .035 .719 .788
as04 598k 538k .629 .017 .018 .018 .575 .620
dx29 25k 42k .796 .034 .035 .037 .719 .782
dx33 37k 60k .795 .028 .029 .029 .718 .784
sg14 135k 254k .822 .043 .044 .044 .726 .805
sg16 38k 69k .813 .047 .048 .048 .720 .796
sh11 812k 1.1M .754 .025 .026 .026 .693 .744
sh21 521k 701k .751 .029 .029 .029 .681 .737
tr04 191k 283k .775 .033 .034 .034 .708 .763
tr09 78k 118k .780 .031 .031 .031 .705 .769
tw03 268k 487k .816 .051 .052 .053 .709 .787
tw30 695k 565k .589 .020 .020 .020 .532 .579
The recorded reuse rates of these cache-based techniques is
always below 15 of the ideal reuse rate. In contrast, our dynamic
batching usually falls within 5% of the ideal reuse, and statically
batched warp voting within 10–20%. In general, both static and
dynamic batching seem to be highly effective for all kinds of scenes.
The sum of these observations strengthens our assumption that
the herein presented approaches may be better suited for modern
architectures than a post transformation cache. Furthermore, our
tests with implementing a post-transform cache, quickly lead us
to believe that there is no efficient way of implement such a cache
in software. Thus, we solely focus on the presented batch-based
approaches throughout the evaluation of use-case scenarios.
4.2 Rasterization
The major motivation and use case for vertex reuse is the geometry
processing stage of a real-time 3D rendering pipeline. To test our
batch-based reuse techniques, we have implemented a configurable
On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing Conference’17, July 2017, Washington, DC, USA
O
rig
in
al
To
m
F
(a) OpenGL (b) Static batching (c) Dynamic batching
Figure 7: Vertex reuse visualization for the Stanford Bunny model: green indicates a single shader invocation, red indicates
six shader calls. Compared to the original input, preprocessing the model with TomF reduces arbitrary discontinuities and
enables better overall potential for vertex reuse for both OpenGL and software techniques. While dynamic batching shows
higher reuse, static warp voting seems to be closer to actual OpenGL behavior on a GTX 1080Ti.
none Hoppe tomF random
0
2
4
6
ve
rte
x 
sh
ad
er
 in
vo
ca
tio
ns
(a) GL
0
2
4
6
ve
rte
x 
sh
ad
er
 in
vo
ca
tio
ns
(b) warp
0
2
4
6
ve
rte
x 
sh
ad
er
 in
vo
ca
tio
ns
(c) dyn.
Figure 8: Preprocessing the Stanford Bunny for vertex local-
ity significantly improves its reuse potential. In OpenGL, ev-
ery vertex is shaded four times in the original mesh; a ran-
dom organization leads to nearly six invocations; prepro-
cessing with TomF [Forsyth 2006] or Hoppe [Hoppe 1999]
reduces the number of shader calls. Our approaches (warp,
dyn.) mirror this behavior.
geometry stage in CUDA, that can be included into a streaming
pipeline design. The geometry processing stage is simply given an
input stream of indices and a vertex buffer. Based on the respective
batching approach, indices are fetched from the index buffer and
vertex reuse is evaluated. As a final step, one thread per triangle
is used to write the output primitive with its vertices into a queue.
This output queue could—when integrated into a full streaming
pipeline—be consumed by the next stage in the rendering pipeline.
For a traditional real-time rendering pipeline, this queue would
form the natural point for a sort middle approach.
The runtime results for the vertex processing for selected tested
techniques on an NVIDIA GTX 1080 Ti are shown in Table 2. To
simulate different vertex shader loads, we used a simple matrix
multiplication (simple), a load of 256, 512 and 1024 cycles; for com-
parison, a cached access to L1 takes about 20 cycles, to L2 about
200. We also include a non-streaming, multi-staged processing im-
plementation for reference. With this approach, all vertices in the
vertex buffer are processed only once by separate kernel. The out-
put vertex data can then be directly loaded from global memory in a
separate kernel for assembling the output primitives. This approach
is employed, e.g., by Laine and Karras (2011) for rasterization of
3D scenes. Since vertices need to be shaded exactly once, this tech-
nique can achieve ideal reuse, but only at the cost of sacrificing the
advantages of a streaming architecture. For instance, unreferenced
vertices are shaded in this approach regardless of their relevance to
the scene. This can exact a significant performance penatly if, e.g.,
one large vertex buffer is used in combination with multiple index
buffers to draw relevant portions on demand. Although this is not
the case in our test scenes, our vertex reuse techniques can even
outperform multi-staged geometry processing on several accounts.
The full data set we produced for evaluating reuse in our software
renderer can be found in the accompanying supplemental material
for this paper.
As can be seen, for a very simple vertex shader, the naïve, no-
reuse approach is the fastest, as it has no communication overhead.
However, warp voting and sorting are on average only about 1.5×
slower, and both hashing approaches are about 2.0× behind. The
Conference’17, July 2017, Washington, DC, USA Kenzel, M. et al.
Table 2: Processing times achieved with different vertex
reuse techniques for rendering of geometry on a GTX
1080Ti. We also include a non-streaming, multi-kernel tech-
nique (multi) for comparison.
sph tree dra bud as01 dx33 sg14 sh11 tr04 tw03
sim
pl
e
naïve 0.07 0.34 4.59 0.81 0.14 0.05 0.19 0.83 0.21 0.35
warp 0.13 0.40 10.38 1.69 0.24 0.11 0.34 1.39 0.36 0.61
hash 0.16 0.42 12.89 2.05 0.27 0.12 0.46 1.47 0.40 0.74
phash 0.13 0.41 11.00 1.82 0.23 0.10 0.39 1.32 0.35 0.65
sort 0.20 0.51 6.36 1.21 0.31 0.13 0.41 1.18 0.34 0.61
multi 0.10 0.38 6.24 1.02 0.19 0.08 0.26 1.09 0.30 0.47
25
6
cy
cl
es
naïve 0.22 0.66 14.71 2.57 0.47 0.17 0.64 2.57 0.72 1.18
warp 0.13 0.33 5.38 0.93 0.19 0.11 0.26 0.93 0.27 0.43
hash 0.16 0.47 9.27 1.60 0.30 0.14 0.49 1.39 0.40 0.69
phash 0.15 0.42 8.42 1.59 0.26 0.12 0.41 1.28 0.33 0.66
sort 0.22 0.66 8.19 1.69 0.37 0.15 0.51 1.56 0.54 0.87
multi 0.11 0.49 6.62 1.14 0.23 0.09 0.31 1.24 0.36 0.54
51
2
cy
cl
es
naïve 0.39 1.10 25.22 4.55 0.84 0.29 1.13 4.45 1.27 2.10
warp 0.17 0.53 6.38 1.13 0.27 0.14 0.35 1.27 0.38 0.56
hash 0.19 0.62 11.20 1.78 0.32 0.16 0.51 1.40 0.44 0.79
phash 0.18 0.55 8.48 1.64 0.29 0.15 0.44 1.32 0.37 0.74
sort 0.26 0.83 9.57 1.94 0.41 0.16 0.53 1.89 0.56 0.98
multi 0.13 0.68 7.82 1.36 0.27 0.11 0.38 1.57 0.44 0.65
10
24
cy
cl
es
naïve 0.71 2.02 46.88 8.09 1.56 0.53 2.09 7.43 2.25 3.75
warp 0.25 0.98 10.90 1.78 0.43 0.21 0.56 2.14 0.65 0.95
hash 0.24 1.00 10.91 1.98 0.42 0.22 0.63 1.98 0.58 0.90
phash 0.23 0.91 10.29 2.01 0.37 0.20 0.57 1.86 0.54 0.89
sort 0.31 1.25 12.45 2.34 0.49 0.21 0.64 2.53 0.73 1.17
multi 0.16 1.08 10.24 1.80 0.36 0.14 0.51 2.16 0.62 0.87
results indicates to us that, among the techniques capable of vertex
reuse, warp voting has the lowest overhead.
As the vertex shader load increases, the naïve approach quickly
falls behind, indicating that the proposed approaches can efficiently
detect vertex reuse. For a 256 cycle load, warp-voting achieves the
best performance, followed by parallel-hashing and hashing. For
this load, the lower overhead of warp-voting still outweighs its
lower vertex reuse rate. However, for larger loads the two hashing
approaches catch up, and performance is overall tied between the
our three techniques, while sorting eventually trails behind. We
attribute the high performance of both hashing approaches and
their marginal difference to the efficient implementation of shared
memory atomics on recent GPUs.
To assess the performance of the proposed approach across mul-
tiple GPU generations, we also tested an NVIDIA GTX 780 Ti, 980
Ti and report the relative execution time compared to naïve, aver-
aged over the entire test body in Figure 9. As can be seen, there is a
significant difference across GPU generations, which is mostly due
to more efficient shared memory operations. While for a simple
shader, all vertex reuse approaches reduce performance in com-
parison to naïve, the more complex shaders again benefit greatly
from reuse. Although statically batched warp voting again slightly
loses ground in comparison to the other approaches on the GTX
Table 3: Runtimes for parallel Loop subdivision, executing
on three of our input models, with different reuse methods,
given in ms. Our techniques achieve up to %22 speed-up for
happy buddha over naive streaming.
naïve warp hash p.hash sort
bunny 0.70 0.58 0.58 0.59 0.57
sphere 0.79 0.68 0.68 0.69 0.64
happy buddha 11.10 8.50 8.15 8.17 8.61
1080 Ti in the complex case, it outperforms the other approaches on
average over all GPUs. Additionally, statically batched warp voting
does not require analysis of the index buffer and thus can be used
in a full streaming approach.
5 SOFTWARE APPLICATIONS
In addition to their respective impact to rendering performance, we
also evaluate our techniques for its potential in context of general,
software-based processing tasks that allow for reuse. Specifically,
we consider them for mesh subdivision and morphological trans-
formation for inner and outer envelopes on 2-manifold models.
Furthermore, we run a random walk simulation based on prob-
abilistic input parameters. All applications were implemented in
CUDA and executed on an NVIDIA GTX 1080Ti.
5.1 Mesh Subdivision
Subdivision of low-detail meshes can be achieved by adding primi-
tives to mesh, based on the adjacency information of its vertices.
The particular steps to take, as well as orientations of the newly
introduced triangles are determined by analyzing their topological
neighborhood. An easily parallelizable subdivision algorithm has
been presented in [Loop 1987]. Loop subdivision produces a piece-
wise linear approximation of smooth surfaces based on B-spline and
multivariate spline theory. For each edge and vertex, vertices are
added in each subdivision iteration. The position of new vertices is
computed from a convex combination of the adjacent primitives.
Considering vertex reuse, allows the merging of these memory ac-
cesses. Figure 10 shows results for subdividing a simplified version
of the original buddha statue in this way. We ran one iteration of
the Loop subdivision with different vertex reuse strategies on the
bunny, sphere and happy buddha models.
We evaluated a wide variety of different parameters for batch
and thread block size, and chose those producing the best results for
our final consideration. For naïve and warp, a batch size of 96 was
used, with a block size of precisely one and two warps, respectively.
For both hash and p.hash, we chose batches containing up to
192 indices and 64 threads per block. Note that, for the dynamic
methods, the thread block size also equals the maximum allowed
number of unique indices allowed in a batch. For sort, best results
were achieved at a batch size of 768, with a block size of 256 threads.
The Loop subdivision algorithm is arguably quite simple, and hence
the cost of re-shading vertices comparably inexpensive. However,
the reduction in runtime with our vertex reuse techniques can still
be as high as 26%. Without exception, all vertex reuse techniques
outperform the naïve approach for the tested scenes (see Table 3).
On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing Conference’17, July 2017, Washington, DC, USA
naïve warp hash phash sort
0
2
4
6
8
(a) GTX 780 Ti simple
0.0
0.5
1.0
1.5
2.0
2.5
(b) GTX 980 Ti simple
0.0
0.5
1.0
1.5
2.0
2.5
(c) GTX 1080 Ti simple
0.00
0.25
0.50
0.75
1.00
(d) GTX 780 Ti 512 cycles
0.00
0.25
0.50
0.75
1.00
(e) GTX 980 Ti 512 cycles
0.00
0.25
0.50
0.75
1.00
(f) GTX 1080 Ti 512 cycles
Figure 9: Reported runtimes of the vertex processing stage in our software renderer, averaged over the entire test body (19
original scenes) and plotted relative to naïve. Different GPU architectures behave quite diversely: note the poor performance
of hash, compared to our optimization p.hash on older architectures (GTX 780 Ti). Overall, statically batched warp voting
seems to be the most reliable approach.
(a) Buddha statue (b) Wireframe close-up w/o and w/ subdivi-sion
Figure 10: Running Loop subdivision on a simplified buddha
with vertex streaming. The smoother, subdivided output is
shown in the lower right.
5.2 Simplification Envelopes
The inner/outer envelopes of a mesh are defined to occupy a strict
spatial sub-/superset of the input. Resulting meshes can be used,
e.g., as input to a variety of simplification algorithms, manipula-
tion of subdivision or for conservative intersection/collision testing
with tolerance [Cohen et al. 1996; Zhou et al. 2007]. An envelope is
Table 4: Runtimes for parallel envelope creation, given in
ms. Due to the particularly high shader cost, models with
good vertex reuse (sphere) can achieve 3× the performance
obtained with naïve streaming alternatives.
naïve warp hash p.hash sort
bunny 71.73 29.41 22.21 21.74 21.85
sphere 15.08 5.55 4.02 4.03 4.03
buddha 5021.80 2112.45 1595.31 1516.15 1509.76
obtained by moving vertices along their vertex normal towards the
inside or the outside of the model. Provided that the original mesh
does not contain self-intersections, an inner or outer envelope must
also retain this property. Hence, before moving each vertex, we
need to determine a safe distance ϵ to ensure that no intersections
will occur as a result of its transformation. We have implemented
the analytical approach presented by Cohen et al. and ported it for
parallel execution in CUDA (1996). Potential intersections are iden-
tified and resolved efficiently by providing an octree representation
of the scene as auxiliary input. Inner and outer envelopes for the
bunny model are shown in Figure 1c.
We configured our routine to generate outer envelopes with
a target ϵ equal to 2% of the mesh’s bounding box diagonal. We
again report results with the best configuration found for each
technique to yield consistently good results. As with subdivision,
batch/block sizes for naïve and warp were chosen as 96/32 and
96/64, respectively. For all dynamic methods, we found that a block
size of 128 works best. For hash, we picked a batch size of 576
Conference’17, July 2017, Washington, DC, USA Kenzel, M. et al.
indices, and used 768 indices for both p.hash and sort. The en-
velope creation routine is comparably complex, and the incurred
cost for each "shaded" vertex in the creation of envelopes is high:
computing an intersection-free offset for a vertex to move by re-
quires traversing a spatial data structure, which has to be stored
in global memory. Similarly to our experiments for rendering with
high shader loads, a speed-up of more than 3× can be achieved over
naïve streaming. Table 4 lists reported runtimes in milliseconds for
processing 2-manifold models.
5.3 Parallel RandomWalk
A random walk [Pearson 1905] describes a stochastic or random
process, where a path is chosen on top of a graph structure or given
domain, based on successive randomized steps. Random walks are
used, e.g., to simulate the paths of molecules traveling through
liquids, the random search path of animals, or messages traversing
through a social network.
To evaluate whether on-the-fly reuse computations can increase
the performance of such random processes, we implemented a par-
allel walk on a discrete domain that follows a Levy flight [Kleinberg
2000]. We use a grid size of 256 × 256 and place 300 000 agents
on this grid. To simulate their activity, we overlay multiple Gauss-
ian functions on this domain. The likelihood for agents to move
a certain distance is then computed based on the activity input
to the Fokker–Planck equation. To evaluate the movements, we
run through all potential moves with a maximum distance of 16
and keep only those 8 with the highest likelihood. Then, every
agent draws a random number to choose one of the stored options,
whereas each is chosen with a probability proportional to their
relative likelihood.
Reuse can be implemented in this scenario as follows. We encode
the current agent location as a combined integer, using half of the
bits for each dimension, yielding a single 32-bit word. This number
serves as a virtual “index” for the reuse computations, combining
agents that are currently placed on the same grid location. Given
that themovement probability only depends on the current position,
all agents with combined “indices” will see identical movement
likelihoods, which can be computed only once. The final step, which
involves drawing a random number and choosing the most likely
move, has to be carried out separately.
Initializing all 300 000 agents randomly and running 10 simu-
lation steps on the 256 × 256 showed that our reuse strategies
can significantly increase the performance of the parallel random
walk. naïve, warp, hash, p.hash and sort, respectively, took 0.30ms,
0.10ms, 0.09ms, 0.13ms and 0.10ms for one time step in their best
configurations. The batch sizes that achieved the best performance
were rather large (1536 for dynamic batching and 576 for static
batching). At first glance, the great performance of reuse is not
surprising, as the likelihood computations are rather time complex,
and a high benefit can be expected for expensive vertex shaders.
However, note that the agents are also likely to significantly diverge
throughout the random walk. A further analysis revealed that a
small amount of reuse already entails a significant performance
gain, as the large batch sizes can still reduce the computations.
6 CONCLUSION
While the traditional solution to vertex reuse is represented by the
post-transform vertex cache, caching seems to be less applicable for
modern, massively parallel devices. Our simulations have shown
that, even under ideal conditions, cache miss rates in a distributed
environment commonly exceed 90%. We have presented four inher-
ently parallel, batch-based approaches, providing a suitable alterna-
tive to conventional caching. Our methods are straightforward to
realize in software or hardware, and operate directly on the input
buffers for indexed triangle meshes, with little to no preprocess-
ing required. Especially for complex shading routines, we showed
that batching can achieve high reuse and increase performance
by up to 3× over non-reuse approaches. We have evaluated both
static and dynamic batching methods on a variety of applications
and test cases. Due to its use of fast, warp-level communication,
our static warp-voting technique is well-suited for basic shading
tasks, while a dynamic, hashing-based batching approach usually
performs best at high shader complexity. Considering that vertex
shaders often exhibit low-to-medium complexity and absence of
a required preprocessing step, warp-voting appears to be the rec-
ommended choice for streaming pipelines written in a compute
language.
Our results are obtained from simulations and applications in
CUDA, but similar approaches are straight-forward to be imple-
mented in hardware, where communication primitives can be set
up even more efficiently. As vertex reuse needs to interface with
vertex shading, batch sizes and efficiency considerations for warp-
based execution are certainly transferable into a hardware design.
Furthermore, data reuse considerations are not only applicable to
rendering, but also to general mesh processing and parallel graph
traversal problems, where node dependencies require a similar treat-
ment. We hope that our implementation will help other researchers
kick off the conception of novel software rendering pipelines. Our
source code is publicly available at [link removed for review].
ACKNOWLEDGMENTS
This research was supported by the Max Planck Center for Visual
Computing and Communication, by the German Research Foun-
dation (DFG) grant STE 2565/1-1 and the Austrian Science Fund
(FWF) grant I 3007.
The Witcher game © CD PROJEKT S.A.
REFERENCES
Jatin Chhugani and Subodh Kumar. 2007. Geometry Engine Optimization: Cache
Friendly Compressed Representation of Geometry. In Proceedings of the 2007 Sym-
posium on Interactive 3D Graphics and Games (I3D ’07). ACM, New York, NY, USA,
9–16. https://doi.org/10.1145/1230100.1230102
Mike M. Chow. 1997. Optimized Geometry Compression for Real-time Rendering.
In Proceedings of the 8th Conference on Visualization ’97 (VIS ’97). IEEE Computer
Society Press, Los Alamitos, CA, USA, 347–ff. http://dl.acm.org/citation.cfm?id=
266989.267103
Petrik Clarberg, Robert Toth, and Jacob Munkberg. 2013. A Sort-based Deferred
Shading Architecture for Decoupled Sampling. ACM Trans. Graph. 32, 4, Article
141 (July 2013), 10 pages. https://doi.org/10.1145/2461912.2462022
Jonathan Cohen, Amitabh Varshney, Dinesh Manocha, Greg Turk, Hans Weber, Pankaj
Agarwal, Frederick Brooks, and William Wright. 1996. Simplification Envelopes.
In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive
Techniques (SIGGRAPH ’96). ACM, New York, NY, USA, 119–128. https://doi.org/
10.1145/237170.237220
Michael Deering. 1995. Geometry Compression. In Proceedings of the 22Nd Annual
Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’95). ACM,
On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing Conference’17, July 2017, Washington, DC, USA
New York, NY, USA, 13–20. https://doi.org/10.1145/218380.218391
Francine Evans, Steven Skiena, and Amitabh Varshney. 1996. Optimizing Triangle
Strips for Fast Rendering. In Proceedings of the 7th Conference on Visualization ’96
(VIS ’96). IEEE Computer Society Press, Los Alamitos, CA, USA, 319–326. http:
//dl.acm.org/citation.cfm?id=244979.245626
Tom Forsyth. 2006. Linear-speed vertex cache optimisation.
Hugues Hoppe. 1999. Optimization of Mesh Locality for Transparent Vertex Caching.
In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive
Techniques (SIGGRAPH ’99). ACM Press/Addison-Wesley Publishing Co., New York,
NY, USA, 269–276. https://doi.org/10.1145/311535.311565
Jon M Kleinberg. 2000. Navigation in a small world. Nature 406, 6798 (2000), 845.
Christoph Kubisch. 2015. Life of a triangle – NVIDIA’s logical pipeline. Tech-
nical Report. NVIDIA Corporation. https://developer.nvidia.com/content/
life-triangle-nvidias-logical-pipeline
Samuli Laine and Tero Karras. 2011. High-performance Software Rasterization on
GPUs. In Proc. High Performance Graphics (HPG ’11). 79–88.
G. Lin and T. P. Y. Yu. 2006. An improved vertex caching scheme for 3D mesh rendering.
IEEE Transactions on Visualization and Computer Graphics 12, 4 (July 2006), 640–648.
https://doi.org/10.1109/TVCG.2006.59
Fang Liu, Meng-Cheng Huang, Xue-Hui Liu, and En-Hua Wu. 2010. FreePipe: A
Programmable Parallel Rendering Architecture for Efficient Multi-fragment Effects.
In Proc. Symposium on Interactive 3D Graphics and Games (I3D ’10). 75–82.
Charles Loop. 1987. Smooth Subdivision Surfaces Based on Triangles. Ph.D. Dissertation.
NVIDIA. 2016. CUDA C Programming Guide. NVIDIA Corporation.
Anjul Patney, Stanley Tzeng, Kerry A. Seitz, Jr., and John D. Owens. 2015. Piko: A
Framework for Authoring Programmable Graphics Pipelines. ACM Trans. Graph.
34, 4, Article 147 (July 2015), 13 pages. https://doi.org/10.1145/2766973
Karl Pearson. 1905. The problem of the random walk. Nature 72, 1867 (1905), 342.
Tim Purcell. 2010. Fast Tessellated Rendering on the Fermi GF100. In High Performance
Graphics Conf., Hot 3D presentation.
Pedro V. Sander, Diego Nehab, and Joshua Barczak. 2007. Fast Triangle Reordering for
Vertex Locality and Reduced Overdraw. ACM Trans. Graph. 26, 3, Article 89 (July
2007). https://doi.org/10.1145/1276377.1276489
J. W. Sheaffer, D. Luebke, and K. Skadron. 2004. A Flexible Simulation Framework
for Graphics Architectures. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS
Conference on Graphics Hardware (HWWS ’04). ACM, New York, NY, USA, 85–94.
https://doi.org/10.1145/1058129.1058142
Po-Han Wang, Chia-Lin Yang, Yen-Ming Chen, and Yu-Jung Cheng. 2011. Power
Gating Strategies on GPUs. ACM Trans. Archit. Code Optim. 8, 3, Article 13 (Oct.
2011), 25 pages. https://doi.org/10.1145/2019608.2019612
Kun Zhou, Xin Huang, Weiwei Xu, Baining Guo, and Heung-Yeung Shum. 2007. Direct
Manipulation of Subdivision Surfaces on GPUs. ACM Trans. Graph. 26, 3, Article
91 (July 2007). https://doi.org/10.1145/1276377.1276491
