Load-balanced rendering on a general-purpose tiled architecture by Chen, Jiawen (Jiawen Kevin)
Load-balanced Rendering on a General-Purpose Tiled
Architecture
by
Jiawen Chen
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
MASSACH
at the OF T
MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUL
May 2005 2cc~7
@ Massachusetts Institute of Technology 2005. All rights reserved.
A uthor .......... ................................................
Departvient of Electrical Engineering and Computer Science
May 15, 2005
Certified by...........
Fredo Durand
Assistant Professor
Thesis Supervisor
Accepted by................... .........
Arthur C. Smith
Chairman, Department Committee on Graduate Students
BARKER
USETS INSnnff!E
ECHNOLOGY
RARIESLIB
2
Load-balanced Rendering on a General-Purpose Tiled Architecture
by
Jiawen Chen
Submitted to the Department of Electrical Engineering and Computer Science
on May 15, 2005, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
Commodity graphics hardware has become increasingly programmable over the last few
years, but has been limited to a fixed resource allocation. These architectures handle some
workloads well, others poorly; load-balancing to maximize graphics hardware performance
has become a critical issue. I have designed a system that solves the load-balancing problem
in real-time graphics by using compile-time resource allocation on general-purpose hard-
ware. I implemented a flexible graphics pipeline on Raw, a tile-based multicore processor.
The complete graphics pipeline is expressed using StreamIt, a high-level language based
on the stream programming model. The StreamIt compiler automatically maps the stream
computation onto the Raw architecture. The system is evaluated by comparing the perfor-
mance of the flexible pipeline with a fixed allocation representative of commodity hardware
on common rendering tasks. The benchmarks place workloads on different parts of the
pipeline to determine the effectiveness of the load-balance. The flexible pipeline achieves
up to twice the throughput of a static allocation.
Thesis Supervisor: Fredo Durand
Title: Assistant Professor
3
4
Acknowledgments
I would like to thank all my collaborators in this project, including Mike Gordon, Bill
Thies, Matthias Zwicker, Kari Pulli, and of course, my advisor Professor Frddo Durand
for all their hard work and support. Special thanks go to Mike Doggett for his unique
insider perspective on graphics hardware. I can't forget Eric Chan, for his incredibly precise
comments and crisp writing style. Finally, I would like to thank the entire Graphics Group
at MIT for their support, and wonderful games of foosball.
5
6
Contents
1 Introduction
2 Background
2.1 Fixed-Function Graphics Hardware.
2.2 Programmable Graphics Hardware .
2.3 Streaming Architectures . . . . . . .
2.4 The Raw Processor . . . . . . . . . .
2.5 The StreamIt Programming Language
2.6 Compiling StreamIt to Raw . . . . . .
17
. . . . . . . . . . . . . . . 17
. . . . . . . . . . . . . . . 2 0
. . . . . . . . . . . . . . . 2 1
. . . . . . . . . . . . . . . 2 2
. . . . . . . . . . . . . . . 2 5
. . . . . . . . . . . . . . . 2 6
3 Flexible Pipeline Design
3.1 Pipeline Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Variable Data Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Load-Balancing Using the Flexible Pipeline . . . . . . . . . . . . . . . . .
4 Performance Evaluation
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . .
4.2 Case Study 1: Phong Shading . . . . . . . . . . . . . . .
4.3 Case Study 2: Multi-Pass Rendering-Shadow Volumes
4.4 Case Study 3: Image Processing-Poisson Depth-of-Field
4.5 Case Study 4: Particle System . . . . . . . . . . . . . . .
4.6 D iscussion. . . . . . . . . . . . . . . . . . . . . . . . . .
5 Conclusion
7
13
29
29
31
36
37
37
38
39
41
42
43
59
A Code 61
B Figures 75
8
List of Figures
3-1 Reference Pipeline Stream Graph. . . . . . . . . . . . . . . . . . . . . . . 32
4-1 Case Study 1 Output image. Resolution: 600 x 600. . . . . . . . . . . . . 46
4-2 Compiler generated allocation for case study #1, last case. 1 vertex proces-
sor, 12 pixel pipelines, 2 fragment processors for each pixel pipeline. Un-
allocated tiles have been removed to clarify the routing. . . . . . . . . . . 47
4-3 Case 1 utilization graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4-4 Case Study 2 Output image. The relatively small shadow caster still creates
a very large shadow volume. Resolution: 600 x 600. . . . . . . . . . . . . 49
4-5 Compiler generated layout for case study #2. 1 Vertex Processor and 20
pixel pipelines. Unallocated tiles have been omitted to clarify routing. . . . 50
4-6 Case 2 utilization graph. Depth buffer pass on the left, shadow volume pass
on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4-7 Compiler generated layout for case study #3. 1 tile wasted for the start
signal and one was unused. 62 tiles perform the filtering. Unallocated tiles
have been omitted to clarify routing. . . . . . . . . . . . . . . . . . . . . . 52
4-8 Case Study 3 Output image. Resolution: 600 x 600. . . . . . . . . . . . . 53
4-9 Case 3 utilization graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4-10 Compiler generated layout for case study #4, last case. 2 stage pipelined
Triangle Setup. Unallocated tiles have been omitted to clarify routing. . . . 55
4-11 Case Study 4 Output image. Resolution: 600 x 600. . . . . . . . . . . . . 56
4-12 Case 4 utilization graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A-1 Common Triangle Setup Code, Page 1 of 3. . . . . . . . . . . . . . . . . . 62
9
A-2 Common Triangle Setup Code, Page 2 of 3. . . . . . . . . . . . . . . . . . 63
A-3 Common Triangle Setup Code, Page 3 of 3. . . . . . . . . . . . . . . . . . 64
A-4 Common Rasterizer code, using the homogeneous rasterization algorithm. 65
A-5 Common Frame Buffer Operations Code. . . . . . . . . . . . . . . . . . . 66
A-6 Case Study #1 Vertex Shader Code. . . . . . . . . . . . . . . . . . . . . . 67
A-7 Case Study #1 Pixel Shader Code. . . . . . . . . . . . . . . . . . . . . . . 68
A-8 Case Study #2: Shadow Volumes Z-Fail Pass Frame Buffer Operations
Code. ..... ....... ................................ 69
A-9 Case Study #3 Poisson Disc Filter Code, Page 1 of 2. . . . . . . . . . . . . 70
A-10 Case Study #3 Poisson Disc Filter Code, Page 2 of2. . . . . . . . . . . . . 71
A-11 Case Study #4 Vertex Shader Code, Page 1 of 2. . . . . . . . . . . . . . . 72
A-12 Case Study #4 Vertex Shader Code, Page 2 of 2. . . . . . . . . . . . . . . 73
B-I Fixed function pipeline block diagram. . . . . . . . . . . . . . . . . . . . 76
10
List of Tables
4.1 Case 1 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Case 2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Case 4 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . 43
11
12
Chapter 1
Introduction
Rendering realistic scenes in real-time has always been a challenge in computer graphics.
Displaying large-scale environments with sophisticated lighting simulation, textured sur-
faces, and other natural phenomena requires an enormous amount of computational power
as well as memory bandwidth. Graphics hardware has long relied on using specialized units
and the inherent parallelism of the computation to achieve real-time performance. Tradi-
tionally, graphics hardware has been designed as a pipeline, where the scene description
flows through a series of specialized fixed-function stages to produce the final image.
Recently, there has been a trend in adding programmability to the graphics pipeline.
Modem graphics processors (GPUs) feature fully programmable vertex and fragment proces-
sors and reconfigurable texturing and blending stages. The latest features, such as floating-
point per-pixel operations and dynamic flow control offer exciting possibilities for sophis-
ticated shading as well as performing generalpurpose computation using GPUs.
Despite significant gains in performance and programmable features, current GPU ar-
chitectures have a key limitation: a fixed resource allocation. For example, the NVIDIA
NV40 processor has 6 vertex pipelines, 16 fragment pipelines, and a fixed set of other re-
sources surrounding these programmable stages. The allocation is fixed at design time and
remains the same for all software applications that run on this chip.
Although GPU resource allocations are optimized for common workloads, it is diffi-
cult for a fixed allocation to work well on all possible scenarios. This is due to a key
difference between graphics processing and traditional digital signal processing (DSP) ap-
13
plications. DSP hardware is usually optimized for specific types of algorithms such as
multiply-accumulate schemes for convolution and matrix-vector multiplication. Such al-
gorithms operate with a static data rate, which allows the hardware to be extremely deeply
pipelined and achieve excellent throughput. In contrast, graphics applications introduce a
variable data rate into the pipeline. In graphics, a triangle can occupy a variable number of
pixels on the screen. Hence, the hardware cannot be optimized for all inputs. For instance,
consider a scene that contains detailed character models standing on a landscape. The de-
tailed characters models are comprised of thousands of triangles, each of which cover only
a few pixels on the screen, while the landscape contains only a few large triangles that
cover most of the screen. In this case, it is impossible for a fixed resource allocation to
balance the workload-the bottleneck is on either the vertex or pixel processing stage. Pro-
grammability only aggravates the load balancing problem. Many applications now perform
additional rendering passes using programmable pixel shaders to do special effects, during
which the vertex engines are completely idle. Conversely, vertex shaders are now being
used to perform complex deformations on the geometry and the pixel engines are idle. And
when an application spends much of its time rasterizing shadow volumes, almost the entire
chip is idle. These are common scenarios that suffer from load imbalance due to a fixed
resource allocation.
I propose a new approach to solving the load-balancing problem statically by using
compile-time resource allocation. In this method, the programmer designs a set ofprofiles
that specify the resource allocation at compile time. At runtime, the hardware switches
between profiles on explicit "context switch" instructions, which can occur within a frame.
The proposed method is somewhat extreme: the method requires hardware that has not only
programmable units, but also a programmable network between units to control the data
flow. While it is expected that GPUs will retain a level of specialization in the foreseeable
future (in particular for rasterization), a fully programmable pipeline gives is a new point
in the design space that permits efficient load-balancing.
The prototype implementation of the approach is made feasible by two unique technolo-
gies: a multicore processor with programmable communication networks and a program-
ming language that allows the programmer to easily specify the topology of the graph-
14
ics pipeline using high-level language constructs. The pipeline is executed on the Raw
processor [20], which is a highly scalable architecture with programmable communica-
tion networks. The programmable networks allow the programmer to realize pipelines
with different topologies; hence, the user is free to allocate computation units to rendering
tasks based on the demands of the application. The prototype is written completely using
Streamlt [11], which is a high-level language based on a stream abstraction. The stream
programming model of StreamIt facilitates the expression of parallelism with high-level
language constructs. The StreamIt compiler generates the code to control Raw's networks
and relieves the programmer of the burden of manually managing data routing between
processor tiles. On the compiler side, the main challenge has been to extend StreamIt to
handle the variable data rates present in 3D rendering due to the variable number of pixel
outputs per triangle and shader lengths.
I must emphasize that the implementation if a proof of concept and cannot compete
with commodity GPUs in terms of pure performance. Beyond the implementation of a
load-balanced 3D graphics pipeline in this particular environment, the thesis is that load-
balancing and increased programmability can be achieved through the following approach:
A multicore chip with exposed communication enables general-purpose computation and
resource reallocation by rerouting data flow.
A stream-based programming model that facilitates the expression of arbitrary compu-
tation.
A compiler approach to static load-balancing facilitates the appropriate allocation of com-
puting units for each application phase. This shifts load balancing away from the
programmer who only needs to provide hints such as the expected number of pixels
per triangle.
This thesis focuses on load balancing and the programming model at the cost of the fol-
lowing points, which I plan to address in future work. Emphasis is not placed on the mem-
ory system, although it plays a major role on performance, especially in texture caching.
Additionally, the proposed method only explores static or compile-time load-balancing.
15
Dynamic load-balancing would be an exciting direction in future research. Finally, my
method pushes full programmability quite far and does not use specialized units for ras-
terization, the stage of the 3D pipeline that seems the least-likely component to become
programmable in the near future for performance reasons. Specializing triangle rasteriz-
ers to support other rendering primitives (e.g., point sprites) would also be interesting for
future research.
Despite these limitations, and although the performance obtained by the simulation
cannot compete with state-of-the-art graphics cards, I believe the proposed method is an
important first step in addressing the load-imbalance problem in current graphics architec-
tures. Solving this problem is important because doing so maximizes the use of available
GPU resources, which in turn implies a more efficient and cost-effective rendering archi-
tecture. I hope that this work will inspire future designs to be more reconfigurable and yield
better resource utilization through load-balancing.
A review of background, related work, the Raw Processor, and the StreamIt program-
ming language are presented in Chapter 2. I describe the implementation in Chapter 3,
including details of the algorithms used in each pipeline stage. Chapter 4 evaluates the
implementation and proposes enhancements to the architecture. Finally, Chapter 5 will
conclude the thesis and propose possible directions for future research.
16
Chapter 2
Background
This chapter provides some background on the evolution of graphics hardware from fixed
function rasterizers to fully programmable stream processors. It also highlights other works
relevant to this thesis, including parallel interfaces, load distribution, scalability, general
purpose computation on GPUs (GPGPU), and the stream abstraction. Section 2.4 provides
a brief overview of the Raw processor and the unique features that enable efficient map-
ping of streaming computation to the hardware. In Section 2.5, I give an overview of the
StreamIt language, its programming model, and the mapping of StreamIt programs to Raw.
2.1 Fixed-Function Graphics Hardware
3D graphics hardware in the 1980s and 1990s were fixed-function pipelines. The typi-
cal graphics pipeline composed of the following stages, Figure B in Appendix B shows a
pipeline block diagram [2]:
" Command / Input Parser: Receives geometry and state change information from
the application.
" Vertex Transformation and Lighting: Transforms vertex attributes position, nor-
mal, texture coordinates) from 3D world space to 3D eye space and projected into
2D screen space. One of a fixed set of lighting models is applied to the vertices.
17
" Primitive Assembly / Triangle Setup: Vertices are assembled into primitives such
as triangles, triangle-strips, and quadrilaterals. Primitive facing is determined and
optionally discarded depending on user-selected culling modes (typically front or
back).
" Rasterization: The primitive is rasterized onto the pixel grid and fragments are gen-
erated for texturing. Vertex colors and texture coordinates are interpolated across the
primitive to generate fragment colors and texture coordinates.
" Texturing: One or two textures may be retrieved from memory using the fragment's
texture coordinates and applied using a fixed set of texture operations (such as decal
or color modulation).
" Frame Buffer Operations: The incoming fragment is optionally tested against a
depth buffer, stencil buffer, and blended with the previous fragment in the frame
buffer. The user can typically choose from one of several modes for each buffer
operation.
" Display: The display device draws the contents of the frame buffer.
Components of the graphics pipeline were at most configurable. For example, the user
may have the ability to select between one or two texture filtering modes, but is not able
to customize the shading of a fragment. In these early architectures, exploiting the paral-
lelism in the computation and using specialized hardware was key to achieving real-time
performance.
Molnar et al. [12] have characterized parallel graphics architectures by their sorting
classification. Due to the order dependent nature of graphics applications, parallel archi-
tectures require at least one synchronization point to enforce correct ordering semantics.
Sorting can take place almost anywhere in the pipeline. A "sort-first" architecture distrib-
utes vertices to parallel pipelines during geometry processing before screen-space positions
are known and takes advantage of object-level parallelism. A "sort-middle" architecture
synchronizes between geometry processing and rasterization, redistributing screen-space
18
primitives to take advantage of image-space parallelism. A "sort-last" architecture redis-
tributes fragments after rasterization to parallel frame buffers taking advantage of memory
bandwidth.
The Silicon Graphics RealityEngine [1] and InfiniteReality [13] were highly scalable
sort-middle architectures. These fixed-function pipelines were built using special-purpose
hardware and distributed onto multiple boards, each of which contained dozens of chips.
Both architectures used specialized hardware for geometry processing, rasterization, dis-
play generation (frame buffer). Despite off-chip and off-board communications latencies,
these architectures achieved real-time framerates for large scenes by combining extremely
fast specialized hardware with a highly parallel processing pipeline.
In 1996, Nishimura and Kunii [14] designed VC-1, a highly scalable sort-last architec-
ture based on general purpose processors. While VC-1 relies on general purpose proces-
sors, it did not feature user programmability. VC-1 was able to achieve excellent perfor-
mance by exploiting virtualframe buffers, which greatly simplified network communica-
tions and parallelized the computation.
In 1997, Eyles et al. designed PixelFlow, a highly-scalable architecture [5]. It was com-
posed of multiple networked workstations each with a number of boards dedicated to parts
of the rendering pipeline. While described as an "object-parallel" rendering system, Pix-
elFlow had two synchronization points not unlike Pomegranate, a later "sort-everywhere"
architecture [4]. It featured an 800 MB/s geometry sorting network and a 6.4 GB/s im-
age composition network as well as API changes to permit programmer control over the
parallel computation.
These designs all took advantage of both data and task parallelism. Since during com-
putation, the data ordering is essentially independent (except for the one required synchro-
nization point), multiple pipelines can be used for each stage. For example, in a sort middle
architecture, after rasterization, the fragments are essentially independent. As long as two
fragments that occupy the same position are drawn in the same order they arrived, they
can be distributed onto an arbitrary number of pipelines. Computation tasks are also paral-
lel: once a vertex has been transformed, it can go on to the next stage. The next vertex is
independent of the current vertex. Hence, the pipeline can be made extremely deep.
19
The scalability of such parallel architectures have been studied from both an inter-
face perspective [6] and from a network and load distribution perspective [4]. Both agree
that in order to fully exploit parallel architectures, explicit programmer-controlled synchro-
nization primitives are required. These synchronization primitives effectively change the
programming model and forces the programmer to consider parallelism when rendering.
Pomegranate [4] takes advantage of these synchronization primitives to design a "sort-
everywhere" architecture, to fully parallelize every stage of the graphics pipeline.
2.2 Programmable Graphics Hardware
Increased transistor budgets provided by modem manufacturing processes has made it
more viable to add a certain level of programmability to various functional units in the
GPU. The first generation of truly programmable GPUs were the "DirectX 8 series", in-
cluding NVIDIA's GeForce3 and ATI's Radeon 8500 [7]. Although these chips featured
user-programmable vertex and pixel shaders, their programmability was limited. Program-
ming was done in a low-level assembly-like language, with limited instruction counts, no
branching, and a small instruction set. Despite these limitations, programmers were able
to take advantage of these new capabilities. Multi-pass rendering, in particular, was and
still is used extensively to create special special effects using programmable hardware. A
typical multi-pass rendering scenario is rendering real-time shadows using the shadow map
algorithm. In the first pass, the scene is rendered from the point of view of a light, writing
only to the depth buffer. The depth buffer is read back as a texture, and the scene is ren-
dered again from the point of view of the camera. The vertex shader is used to transform the
object position into the coordinate system of the light, and the fragment shader compares
the fragment's light-space depth value against the depth value stored in the texture. The
result of this comparison is whether the fragment is lit or not, and can be used to control
the resulting color.
Current-generation GPUs, such as NVIDIA's NV40 and ATI's R420 (DirectX 9 Se-
ries) are much more powerful than their DirectX 8 counterparts. They permit much longer
shader lengths, and dynamic branching inside the programmable shaders. There are also
20
a number of high-level languages that target these platforms [8, 10, 18]. It is expected that
future GPUs will feature "unified shaders", where both the programmable vertex and pixel
shaders use the same physical processor. This design could increase overall throughput
since in a typical application, the load is not balanced between the vertex and pixel stages.
The ease of programmability, immense arithmetic capability and large memory bandwidth
of GPUs have made them more and more attractive for general purpose, high-performance
computing. The architecture of rendering pipelines are beginning to match that of tradi-
tional DSPs, albeit with much more arithmetic and memory capability. In particular, GPU
architectures seem well suited for streaming computations. Buck et al. [3] have designed
Brook, a streaming language that permits the implementation of fairly general streaming
algorithms on the graphics processor, mapping these algorithms to vertex and pixel shaders.
2.3 Streaming Architectures
Owens et al. characterized polygon rendering as a stream operation and demonstrated an
implementation of a fixed-function pipeline on Imagine, a stream processor [16]. In their
implementation, the various stages of the graphics pipeline were expressed as stream ker-
nels (a.k.a. filters), which were swapped in and out of the Imagine's processing units.
Producer/consumer locality was exploited by storing intermediate values in a large set of
stream registers and using relatively small kernels which minimizes the cost of context
switches. Owens et al. have also compared Reyes, an alternative rendering pipeline organi-
zation, against the traditional OpenGL on a stream processor [17]. Reyes is very different
pipeline from OpenGL, using tesselation and subdivision instead of triangle rasterization.
It has several properties well-suited for stream computation: bounded-size primitives, only
one programmable shading stage, and coherent texture memory access. The comparison
shows that Reyes essentially has the same deficiencies as OpenGL: the variable data rate in
the computation is now during surface subdivision instead of triangle rasterization.
My architecture builds on Brook and Imagine in that it also uses the stream abstraction,
but as the programming model. Not only is it used to express vertex and fragment shader
computation, but also to express the full graphics pipeline itself. The approach differs from
21
Imagine in that Imagine uses a time-multiplexing scheme: the data remains in memory
and kernels are context switched to operate on data. My approach uses space-multiplexing,
where kernels are mapped onto a set of processing units, and data flows through the kernels.
My stream programming method is also related to the shader algebra, [9] where shaders
can be combined and code analysis leads to efficient compilation and dead-code elimination
for complex vertex and pixel shader combinations. In this case, rather than eliminate dead-
code from a limited number of programmable stages, depending on the application, unused
pipeline stages can be dropped completely from the graphics pipeline.
Given the trend of increased programmability and unifying programmable units in
GPUs, I consider rendering on an architecture that contains a single type of general-purpose
processing element. The goal is to study the implications and challenges that this scenario
imposes on the "driver" of such a processor. The driver will be of critical importance be-
cause it allocates resources depending on the rendering task and ensures that the processor
is used efficiently.
2.4 The Raw Processor
The Raw processor [20,22] is a versatile architecture that achieves scalability by addressing
the wire delay problem. Raw aims to perform as well as Application Specific Integrated
Circuits (ASICs) on special-purpose kernels while still achieving reasonable performance
on general-purpose programs. Raw's design philosophy is to address the wire-delay prob-
lem by exposing its rich on-chip resources, including logic, wires, and pins, through a new
instruction set architecture (ISA) to the software. In contrast with other architectures, this
allows Raw to more effectively exploit all forms of parallelism, including instruction, data,
and thread level parallelism as well as pipeline parallelism.
Tile-Based Architecture. Raw is a parallel processor with a 2-D array of identical, pro-
grammable tiles. Each tile contains a compute processor as well as a switch processor
that manages four networks to neighboring tiles. The compute processor is composed of
an eight-stage in-order single-issue MIPS-style processor, a four-stage pipelined floating
22
point unit, a 32kB data cache, and a 32kB instruction cache. The current prototype is
implemented in an IBM 180nm ASIC process running at 450MHz; on one chip, it con-
tains 16 uniform tiles arranged in a square grid. The theoretical peak performance of this
prototype is 6.8 GFLOPS. Raw's scalable design allows an arbitrary number of 4x4 chips
to be combined into a "Raw fabric" with minimal changes. The performance evaluation
for my architecture is done using btl, a cycle-accurate Raw simulator, modeling a 64-tile
configuration. Though the prototype chip contains only 16 tiles, a 64-tile configuration is
planned.
On-Chip Communication Networks. The switch processors control four 32-bit full-
duplex on-chip networks. The networks are register-mapped, blocking, and flow-controlled,
and they are integrated directly into the bypass paths of the processor pipeline. As a key
innovative feature of Raw, these networks are exposed to the software through the Raw
ISA.
There are four networks total: two static and two dynamic. The static networks are
used for communication patterns that are known at compile time. To route a word from one
tile to another over a static network, it is the responsibility of the compiler to insert a route
instruction on every intermediate switch processor. The static networks are ideal for reg-
ular stream-based traffic and can also be used to exploit instruction level parallelism [21].
The dynamic networks support patterns of communication that vary at runtime. Items are
transmitted in packets; a header encodes the destination tile and packet length. Routing is
done dynamically by the hardware, rather than statically by the compiler. There are two
dynamic networks: a memory network for trusted clients (data caches, I/O, etc.) and a
general network for use by applications.
Memory System. On the boundaries of the chip, the network channels are multiplexed
onto the pins to form flexible I/O ports. Words routed off the side of the chip emerge on
the pins, and words put on the pins by external devices appear on the networks. Raw's
memory system is built by connecting these ports to external DRAMs. For the 16 tile
configuration, Raw supports a maximum number of 14 ports, which can be connected to
23
up to 14 full-duplex DRAM memory banks, leading to a memory bandwidth of 47GB per
second. Tiles on the boundary can access memory directly over the static network. Tiles
on the interior of the chip access memory over the memory dynamic network and incur a
latency proportional to the distance to the boundary. Data-independent memory accesses
can be pipelined, hence, overall throughput is unaffected.
Raw as Graphics Hardware. Raw is an interesting architecture for graphics hardware
developers, because its design goals share a number of similarities with current GPUs.
Raw is tailored to effectively execute a wide variety of computations, from special purpose
computations that are often implemented using ASICs to conventional sequential programs.
GPUs exploit data parallelism (by replicating rendering pipelines, using vector units), in-
struction level parallelism (in super-scalar fragment processors), and pipeline parallelism
(by executing all stages of the pipeline simultaneously). Raw, too, is capable of exploiting
these three forms of parallelism. In addition, Raw is scalable: it consists of uniform tiles
with no centralized resources, no global buses, and no structures that get larger as the tile
count increases. In contrast to GPUs, Raw's computational units and communication chan-
nels are fully programmable, which opens up almost unlimited flexibility in laying out a
graphics pipeline and optimizing its efficiency on Raw.
On the other hand, the computational power of the current 16-tile, prototype Raw
processor is more than an order of magnitude smaller than the power of current GPUs.
GPUs can perform far more parallel operations in a single cycle than Raw: Raw's compu-
tation units do not perform vector computation. In addition, Raw is a research prototype
implemented with a 0.18plm process; an industrial design with a modem 90nm process
would achieve higher clock frequencies.
Hence, my prototype implementation is not intended to compete with current GPUs
in terms of absolute performance, but demonstrates the benefits of a flexible and scalable
architecture for efficient resource utilization.
24
2.5 The StreamIt Programming Language
StreamIt [11,23] is a high level stream language that aims to be portable across communication-
exposed architectures such as Raw. The language exposes the parallelism and communica-
tion of streaming programs without depending on the topology or granularity of the under-
lying architecture. The StreamIt programming model is based on a structured stream ab-
straction: all stream graphs are built out of a hierarchical composition of filters, pipelines,
split-joins, and feedbackloops.
As I will describe in more detail in Chapter 3, the structured stream graph abstraction
provided by StreamIt lends itself to expressing data parallelism and pipeline parallelism that
appear in graphics pipelines. In particular, I will show how to use StreamIt for high-level
specification of rendering pipelines with different topologies. As previously published,
StreamIt permits only static data rates. In order to implement a graphics system that allows
different triangle sizes, variable data rates is a necessity. Section 3.2 will describe how
variable data rates are implemented in the StreamIt language and compiler.
Language Constructs. The basic unit of computation in StreamIt is thefilter (also called
a kernel and used interchangeably). A filter is a single-input, single-output block with a
user-defined procedure for translating input items to output items. Filters send and receive
data to and from other filters through FIFO queues with compiler type-checked data types.
StreamIt distinguishes between filters with static and variable data rates. A static data rate
filter reads a fixed number of input items and writes a fixed number of output items in each
cycle of execution, whereas a variable data rate filter has a varying number of input or
output items in each cycle.
In addition to the filter, StreamIt provides three language constructs to compose stream
graphs: pipeline, split-join, and feedback-loop. Each of these constructs, including a filter,
is called a stream. In a pipeline, streams are connected in a linear chain so that the outputs
of one stream are the inputs to the next stream. In a split-join configuration, the output from
a stream is split onto multiple (not necessarily identical) streams that have the same input
data type. The data can be either duplicated or placed in a weighted round-robin scheduling
policy. The split data must be joined somewhere downstream unless it is the sink to the
25
stream. The split-join allows the programmer to specify data-parallelism between streams.
The feedback loop enables a stream to receive input from downstream, for applications
such as MPEG.
2.6 Compiling StreamIt to Raw
A compiler for mapping static data rate Streamlt to Raw has been described in previous
work [11]. Extending the compiler to support variable data rates for graphics is described in
Section 3.2. Compilation involves four stages: dividing the stream graph into load-balanced
partitions, laying out the partitions on the chip, scheduling communication between the
partitions, and generating code. The operation of these stages is summarized below.
Partitioning StreamIt hides the granularity of the target machine from the programmer.
The programmer specifies as abstract filters, which are independent of the underlying hard-
ware. It is the responsibility of the compiler to partition the high level stream graph into
efficient units of execution for the particular architecture. Given a processor with N ex-
ecution units, the partitioning stage transforms a stream graph into a set of no more than
N filters that run on the execution units, while satisfying the logical dataflow of the graph
as well as constraints imposed by the hardware. To achieve this, the StreamIt partitioner
employs a set offusion,fission, and reordering transformations to the stream graph. Work-
load estimations for filters are calculated by simulating their execution and appropriate
transformations are chosen using a greedy strategy. However, simulation can only pro-
vide meaningful workload relationships for stream graphs with static data rates. Hence, in
stream graphs that contain variable data-rate filters partitioning is performed separately for
each static subgraph. Refer to Section 3.2 for the discussion on variable data rates.
Layout The layout stage assigns filters in the partitioned stream graph to computational
units in the target architecture while minimizing the communication and synchronization
present in the final layout. For Raw, this involves establishing a one-to-one mapping from
filters in the partitioned stream graph to processor tiles, using Raw's networks for com-
26
munication between filters. Computing the optimal layout is NP-Hard, to make this prob-
lem tractable, a cost function is developed that measures the overhead of transmitting data
between tiles for each layout [11]. Memory traffic, static network communication, and
dynamic network communication are all components of the cost function. For example,
off-chip memory accesses latency is shortest when the filter is allocated near the edge of
the chip. The cost function measures the memory traffic and communication overhead for
a given layout, as well as the synchronization imposed when independent communication
channels are mapped to intersecting routes on the chip. A layout is found by minimizing
the cost function with a simulated annealing algorithm. The automatic layout algorithm is
a useful tool for approximating the optimal layout. The programmer is free to manually
specify the layout for further optimization.
Communication Scheduling The communication scheduling stage maps the communi-
cation channels specified by the stream graph to the communication network of the target.
While the stream graph represents communication channels as infinite FIFO abstractions, a
target architecture will only provide limited buffering resources. Communication schedul-
ing must avoid deadlock and starvation while trying to utilize the parallelism explicit in
the stream graph. On Raw, StreamIt channels between static data rate filters are mapped
to the static network. The static network communication schedule is computed by simu-
lating the firing of filters in the stream graph and recording the communication pattern for
each switch processor. Outputs of variable data rate filters are transmitted via the general
dynamic network and scheduled differently. (Section 3.2. More details on scheduling can
be found in the original publication about mapping StreamIt to Raw by Gordon et al. [11].
Code Generation Code generation for Raw involves the generation of computation code
for the compute processors and communication code for the switch processors. Computa-
tion code is generated from the output of the partitioning and layout stages. The code is a
mixture of C and assembly and is compiled using Raw's GCC 3.3 port. Assembly commu-
nication code for the switch processors is generated directly from the schedules obtained
in the communication scheduling stage.
27
28
Chapter 3
Flexible Pipeline Design
This chapter describes the design and implementation of a flexible graphics pipeline on
Raw using the StreamIt programming language. The primary goal is to build a real-time
rendering architecture that can be reconfigured to adapt to the workload. The architecture
should also be fully programmable: not just in the vertex and pixel processing stages, but
throughout the pipeline and should not be restricted to one topology. Reconfigurable, in
this case, means that the programmer is aware of the range of inputs in the application and
has built profiles at compile-time. At runtime, an explicit "switch" instruction is given to
reconfigure the pipeline. My design leverages the Raw processor's programmable tiles and
routing network as well as the StreamIt language to realize this architecture. Section 3.1
discusses how to express the components of a graphics pipeline using higher order StreamIt
constructs and how the resulting stream graph is mapped to the Raw architecture. In Sec-
tion 3.2, I describe extending Streamlt with variable data rates for graphics computation.
Finally, Section 3.3 describes how the flexible pipeline can be used for load balancing.
3.1 Pipeline Design
The StreamIt philosophy is to implement filters as interchangeable components. Following
that philosophy, each stage in my flexible pipeline is implemented as a StreamIt filter and
allocated to Raw tiles by the Streamlt compiler. The programmer is free to vary the pipeline
topology by rearranging the filters and recompiling, with different arrangements reusing
29
the same filters. The flexible pipeline has several advantages over a fixed pipeline on the
GPU. First, any filter (i.e. stage) in the pipeline can be changed. For example, in the first
pass of shadow volume rendering, texture mapping is not used, and dead-code elimination
can be performed. The entire pipeline is changed so that texture coordinates are neither
interpolated nor part of the dataflow. Second, the topology does not even need to conform to
any traditional pipeline configuration. In the image processing case study, (see Section 4.4)
the current GPU method would render the scene to a texture, and use a complex pixel
shader to perform image filtering. Raw can simply be reconfigured to act as an extremely
parallel image processor.
In the case studies (Chapter 4) the performance of a flexible pipeline is compared
against a fixed-allocation reference pipeline. The reference pipeline models the same de-
sign tradeoff as made in GPUs in fixing the ratio of fragment to vertex units. It is imple-
mented using StreamIt and emulates most of the functionality of a programmable GPU.
It is manually laid out on Raw (Figure 3.1). The pipeline stages include Input, Program-
mable Vertex Processing, Triangle Setup, Rasterization, Programmable Pixel Shading, and
Reconfigurable Frame Buffer Operations that write to the frame buffer.
The reference pipeline is a sort-middle architecture [12], Figure 3.1 displays its stream
graph. Input stage is connected to off-chip memory through an I/O port. Six tiles are as-
signed to programmable vertex processing, and they are synchronized through one synchro-
nization tile. The synchronizer consumes output of the vertex shaders using a round-robin
strategy and pushes the data to the triangle setup tile. The rasterizer uses the homoge-
neous rasterization algorithm to avoid clipping overhead [15]. Triangle setup computes the
vertex matrix and its inverse, the screenspace bounding box, triangle facing, and the pa-
rameter vectors needed for interpolation. It distributes data to the 15 pixel pipelines. The
pixel pipelines are screen locked and interleaved. Each pipeline is assigned to every 15th
column. The pixel pipelines each consist of three tiles, a rasterizer that outputs the visi-
ble fragments of the triangle, a programmable pixel processor, and a raster operations tile,
which communicates with off-chip memory through an I/O port to perform Z-buffering,
blending and stencil buffer operations. Due to routing constraints, not all Raw tiles are
used: in particular, no two variable data rate paths can cross. The code for triangle setup,
30
rasterization, and frame buffer operations is listed in Appendix A.
In contrast to the reference pipeline, my flexible pipeline builds on the same filters as
the reference one, but it exploits the StreamIt compiler for automatic layout onto Raw.
The pipeline is parameterized by the number of split-join branches at the vertex and pixel
stages, and the programmer provides the desired number of branches for a given scenario.
The programmer can also increase the pipeline depth manually by dividing the work into
multiple stages. For example, a complex pixel shading operation may be divided into two
stages, each of which fits onto one tile. In some cases, the programmer also omits some of
the filters when they are not needed. Together, flexible resource allocation and dead-code
elimination greatly improve performance.
While the automatic layout of filters to tiles by the compiler provides great flexibility,
the layout is often not optimal. The compiler uses simulated annealing to solve the NP-hard
constrained optimization problem. Hence, when using automatic layouts, in order to satisfy
routing constraints, a number of tiles are left unallocated. Automatic layout can act as a
good first approximation so the programmer can iterate on pipeline configurations without
having to manually configure the tiles. It also has a reasonable runtime: only 5 minutes on
a Pentium Xeon 2.2 GHz with 1 GB of RAM. In the case studies, only automatic layout is
used for flexible pipeline configurations.
3.2 Variable Data Rates
Variable data rates are essential for graphics rendering. Because the number of pixels cor-
responding to a given triangle depends on the positions of the vertices for that triangle, the
input/output ratio of a rasterizer filter cannot be fixed at compile time. This contrasts with
traditional applications of stream-based programming such as digital signal processing that
exhibit a fixed ratio of output to input and can be implemented using synchronous models.
In particular, the original version of StreamIt relies on the fixed data rate assumption.
The StreamIt language and compiler are augmented to support variable data rates be-
tween filters. The language extension is simple, allowing the programmer to tag a data
rate as variable. On the compiler side, each phase of the mapping of StreamIt to Raw is
31
Input
6 Vertex
Vertex Shader Shaders Vertex Shader
(Programmable) (Programmable)
Sync
Triangle Setup
Rasterizer Rasterizer
15 Pixel
Pipelines
Pixel Shader Pixel Shader
(Programmable) (Programmable)
Frame Buffer Frame Buffer
Figure 3-1: Reference Pipeline Stream Graph.
32
4-
A
4-
a
I
-4-
a
w
-g
El
Vertex Sync
Processor
Triangle Rasterizer Pixel
Setup
Frame Static Variable
Processor Buffer Ops Data Rate Data Rate
Reference Pipeline Layout on an 8 x 8 Raw configuration. Color-coded squares represent
Raw tiles (unused tiles have been omitted for clarity). Data arrives from an I/O port off the
edge of the chip.
33
Input
-i 
I
... ................. :: --- -
-I
-A 6
modified.
Partitioning with Variable Data Rates The partitioner supports variable rates by di-
viding the stream graph into static-rate subgraphs. Each subgraph represents a stream in
which child filters have static data rates for internal communication. A variable data rate
can appear only between subgraphs. Because each subgraph resembles a static-rate appli-
cation, the existing partitioning algorithm can be used to adjust its granularity. However,
as variable data rates prevent the compiler from judging the relative load of subgraphs, the
compiler relies on a programmer hint as to how many tiles should be allocated to each sub-
graph. Figure 3.2 shows the reference pipeline's stream graph partitioned into static-rate
subgraphs.
Recall that the flexible graphics pipeline is parameterized by split-joins that provide
data-level parallelism for the vertex and pixel sides of the pipeline. The programmer spec-
ifies the number of branches for the split joins, thereby choosing the respective expected
loads.
Layout with Variable Data Rates Variable data rates impose two new layout constraints.
First, a switch processor must not interleave routing operations for distinct static subgraphs.
Because the relative execution rates of subgraphs are unknown at compile time, it is impos-
sible to generate a static schedule that interleaves operations from two subgraphs (without
risking deadlock). Second, there is a constraint on the links between subgraphs: variable-
rate communication channels that are running in parallel (e.g., in different paths of a split-
join) must not cross on the chip. Even when such channels are mapped to the dynamic
network, deadlock can result if parallel channels share a junction (as a high-traffic channel
can block another). In my implementation, these constraints are incorporated into the cost
function in the form of large penalties for illegal layouts.
Communication Scheduling with Variable Data Rates Communication scheduling re-
quires a simple extension: channels with variable data rates are mapped to Raw's general
dynamic network (rather than the static network, which requires a fixed communication
pattern). Within each subgraph, the static network is still used. This implementation avoids
34
Static Rate
Subgraphs
Input
6 Vertex
Vertex Shader Shaders Vertex Shader
(Programmable) * * * * (Programmable)
Sync
Triangle Setup
15 Pixel
Rasterizer Pipelines Rasterizer
Pixel Shader Pixel Shader
(Programmable) (Programmable)
Frame Buffer Frame Buffer
Variable Data Rate
Boundary
Reference Pipeline stream graph partitioned into static-rate subgraphs.
35
the cost of constructing the dynamic network header for every packet [11]; instead, the
header is constructed at compile time. Even though the rate of communication is variable,
the endpoints of each communication channel are static.
3.3 Load-Balancing Using the Flexible Pipeline
The flexible pipeline exploits both data-level parallelism and pipeline parallelism to gain
additional throughput. Data-level parallelism is expressed using the StreamIt split-joins
construct. Split-joins specify the number of vertex units and the number of parallel pixel
pipelines after the rasterizer. Pipeline parallelism must be controlled manually by the pro-
grammer, who can find a more fine-grained graph partition than the compiler.
The static load-balancing scenario is as follows. Given an input scene, it is profiled on
the reference pipeline to determine where the bottlenecks are (usually the vertex or pixel
stage). If the bottleneck is in the vertex or pixel stage, the programmer iteratively adjusts the
width of the split-joins to change the ratio between vertex and pixel stages and re-profiles
the scene until the load is balanced across the chip. If the load-imbalance is elsewhere (for
example, in the triangle setup stage due to the sort-middle architecture), a different graph
topology may be used.
36
Chapter 4
Performance Evaluation
This chapter describes a series of case studies where I have profiled the performance of
the reference and flexible pipelines on four common rendering tasks. These experiments
serve to test if a flexible pipeline can outperform a pipeline with fixed resource allocation.
Given a workload, there should be a load-imbalance between the units in the fixed pipeline
that causes suboptimal throughput. Once the bottleneck is identified, the flexible pipeline
should be able to reallocate its resources to balance the load and achieve better throughput.
My experiments include the cases of a complex pixel shader (Section 4.2), a multi-pass
algorithm (Section 4.3), an image processing algorithm (Section 4.4), and a particle system
(Section 4.5). This chapter presents the details of each experiment and demonstrates how
the load imbalance in each case can be improved by reallocating resources appropriately,
with over 100% increase in throughput in some cases. I conclude the chapter with a dis-
cussion on how the Raw hardware could be improved to attain even greater performance
for rendering tasks.
4.1 Experimental Setup
Each experiment involves a real-world rendering task that induces a strongly imbalanced
load on different pipeline stages and that do not necessarily use all functional units of
today's GPUs at the same time. These scenarios demonstrate that the flexible pipeline
avoids bottlenecks and idle units, a common problem in today's architectures.
37
Instead of comparing the graphics performance of Raw against a real GPU, I compare
them against the reference pipeline on Raw instead. It is unrealistic to compare Raw against
a modem GPU because today's GPUs have several orders of magnitude the computational
power of Raw. GPUs not only have far more arithmetic power, but also memory bandwidth
and number of registers. For a more realistic comparison, I simulate a GPU's fixed pipeline
topology on Raw, and compare the performance and load-balance of the flexible pipeline
against this reference.
Performance numbers are listed in terms of triangles per second and percent utilization.
The screen resolution is fixed at 600x600 and 32-bit color. Pipeline stage utilization is
computed as the number of instructions completed by all tiles assigned to that stage divided
by the number of cycles elapsed. Note that this metric for processor utilization is unlikely
to reach 100% in any scenario, even in highly parallel computations such as image filtering
(4.4). While each tile is fully pipelined, it is unlikely to achieve 1 instruction per clock
cycle. Floating point operations incur a 4 cycle latency, memory access costs 3 cycles even
on a cache hit, and there are likely to be data hazards in the computation. Furthermore,
Raw tiles do not feature predication and all conditionals must be expressed as software
branches.
4.2 Case Study 1: Phong Shading
Consider the case of rendering a coarsely tessellated polyhedron composed of large trian-
gles with per pixel Phong shading. In the vertex shader, the vertex's world space position
and normal are bound as texture coordinates. The rasterizer interpolates the texture coor-
dinates across each triangle and the pixel shader computes the lighting direction and the
diffuse and specular contributions. Most of the load is expected to be on the fragment
processor. Vertex and pixel shader code are in the appendix (Figures A and A).
Reference Pipeline As expected, the reference pipeline suffers from an extreme load
imbalance. The fragment processor has a 68% utilization, the rasterizer at 17%, while the
other units are virtually idle (< 1%) (Figure 4.2). Overall chip utilization is only 25.5%.
38
Vertex Units Pixel Units tris / sec Throughput Increase
Reference Pipeline 6 15 4299.27 N/A
Automatic Layout 6 15 4353.08 1.25%
Automatic Layout 1 15 4342.67 1.01%
Automatic Layout 1 24 (12 pipelines, 6652.18 54.73%
2 units each)
Table 4.1: Case 1 Performance Comparison
Throughput is 4300 triangles per second.
Flexible Pipeline I tried several different allocations for this scenario. The first test com-
pares the same pipeline configuation as the reference pipeline, but using automatic layout
instead; it yielded a marginal improvement. In both cases, the pixel processor was the bot-
tleneck: the pixel processor's utilization was 68%, the rasterizer at 17%, and the other units
virtually idle. Since the fragment stage was the bottleneck, the next test left the number of
pixel pipes the same but used only one vertex unit, the utilization and throughput was vir-
tually unchanged, supporting the hypothesis. The final configuration was to allocate most
of the tiles to pixel processing, with a pipelined 2-stage pixel shader. This configuration
yielded the greatest gain in performance.
In this allocation, the first pixel processor is at 74% utilization, and the second at 60%.
The rasterization stage's utilization increases to 31%. The load-balance has improved sig-
nificantly, even though it is not balanced across the entire chip. Overall chip utilization is
only 39.5%. This allocation achieves a throughput of 6652 triangles per second, a 55%
increase over the fixed allocation.
4.3 Case Study 2: Multi-Pass Rendering-Shadow Vol-
umes
To demonstrate the utility of a flexible pipeline, I benchmarked shadow volume rendering,
a popular technique for generating real-time hard shadows, on Raw. In this algorithm,
the load shifts significantly over the three passes. In the first pass, the depth buffer is
39
initialized with the depth values of the scene geometry. In the given scene, scene, the
triangles are relatively large and the computation rasterization bound. In the second pass,
the shadow volume itself is rendered. This incurs a even greater load on the rasterizer
which has to rasterize large shadow volume polygons, and the frame buffer operations,
which must perform a depth test and update the stencil buffer. The final pass is fragment
processing bound, where the fragment shader is used to light and texture the final image. I
analyze the first two passes in this case study.
Reference Pipeline On the first pass, as expected, the rasterization stage is the bottleneck
at 69% utilization (see Figure 4.3). It takes approximately 55 floating point operations for
the software rasterizer to output each fragment. The pixels are output in screen aligned
order and memory access is very regular for the large triangles. The frame buffer updates
only achieve a 7% utilization. The other units in the pipeline are virtually idle, with the
exception of the pixel shader use at 6% simply forwarding rasterized fragments to the
frame buffer operations unit. Throughput is 988 triangles / second and utilization is 25%.
On the second pass, the results are virtually identical, with a slight increase in utilization
at the frame buffer operations stage where the Z-Fail algorithm updates the stencil buffer
based on triangle orientation. Throughput is 796 triangles / second and utilization is 23%.
Flexible Pipeline Noticing that the computation is rasterization limited for both the first
and second pass, the allocation is changed so that only one tile is assigned to vertex process-
ing. The final pass is fragment bound, so the allocation from Case Study #1 is used. Notice
that in this multi-pass algorithm, a different allocation can be used for each pass. In fact,
multiple allocations can be used within a pass. Since neither the first nor the second pass
requires pixel shading, the pixel shading stage is removed completely, and the tiles are re-
allocated to increase the number of pixel pipelines to 20 (see Figure 4.3). Since the input
vertices do not contain any attributes other than position, interpolation and parameter vector
calculations for those attributes can be safely removed from the rasterization and triangle
setup stages. The flexible pipeline achieves more than 100% increase in throughput over
the reference pipeline in both passes. In the first pass, throughput is 2232 triangles / sec
40
Vertex Units Pixel Units tris / sec Throughput Increase
Ref., Depth Pass 6 15 987.79 N/A
Ref., Shadow Pass 6 15 796.25 N/A
Auto., Depth Pass 1 20 2223.16 125.06%
Auto., Shadow Pass 1 20 1798.53 125.88%
Table 4.2: Case 2 Performance Comparison
with 27% overall utilization. In the second pass, throughput is 1800 triangles / sec with
27% overall utilization. It is interesting to note that although overall chip utilization has
increased, the utilization in the rasterization stage has actually decreased from 71% down
to 49% due to the elimination of dead code.
4.4 Case Study 3: Image Processing-Poisson Depth-of-
Field
Image processing requires a quite different pipeline architecture than 3D rendering. Since
Raw is a general purpose architecture, the computation does not need to be mapped onto
a traditional graphics pipeline. Consider the Poisson-disc fake depth-of-field algorithm
by ATI [19]. In a GPU implementation, the final pass of the algorithm would require
submitting a large screen-aligned quadrilateral and performing the filtering in the pixel
shader. The operation is extremely fragment bound since the scene contains only 2 triangles
and the pixel shader must perform many texture accesses per output pixel.
In the flexible pipeline, each tile is allocated as an image filtering unit. The tile con-
figuration is expressed as a 62-way StreamIt split-join. Currently, the StreamIt compiler
requires that split-joins have a source; hence, two tiles are "wasted" for data routing rout-
ing; otherwise, the full 64 tiles could be used. The color and depth buffers are split into
62 blocks. At 600x600 resolution, the blocks fit in the data cache of a tile. The configu-
ration gets only 38% utilization of the chip and throughput of 130 frames per second (see
Figure 4.4). Due to the memory-intensive nature of the operation, 100% utilization is not
reached-a cache hit still incurs a 3 cycle latency so the result is near the expected 33%
utilization.
41
4.5 Case Study 4: Particle System
For the fourth experiment, consider automatic tessellation and procedural deformation of
geometric primitives. In this test, the vertex shaders are modified to receive a complete
triangle as input and output 4 complete triangles. Each vertex is given a small random
perturbation. The input triangles comprised of a particle system, since these primitives
occupied little screen area and required no shading, this scene is expected to be vertex-
bound in performance on the reference pipeline.
Reference Pipeline It turns out, however, that the bottleneck lies in the triangle setup
stage. Triangle setup has a 49% utilization, the rasterizer is at 22%, and the other units
are stalled (< 4%) (see Figure 4.5). In retrospect, this is unsurprising; the sort-middle
architecture required output vertices to be synchronized and contained only one triangle
setup stage. Since the triangles are small, setup takes a proportionally large amount of
computation relative to rasterization. Overall chip utilization was only 7.8%.
Flexible Pipeline The flexible pipeline has the immediate advantage of removing unnec-
essary work such as texture coordinate interpolation in the rasterizer and parameter vector
computation in triangle-setup. The automatically laid out version of the reference pipeline
performed slightly better, with utilization up to 9.1% and throughput up by 21%. Assum-
ing that the computation was vertex limited, I reallocated some of the pixel pipelines to
vertex units. With 10 pixel pipelines and 9 vertex shaders, performance was increased by
71% over the reference pipeline, with utilization up to 20.6%. To test the hypothesis that
the computation was vertex limited, I increased the number of pixel pipelines up to 12,
and there was virtually no change. Finally, noticing that triangle setup was a bottleneck, I
pipelined the stage by dividing the work onto two tiles and forwarding the necessary data.
The pipelined version obtained a performance increase of over 157% over the reference
pipeline and utilization up to 24.6%. Pipelining the triangle setup yielded the most dra-
matic increase in throughput and demonstrated the communication limited nature of the
architecture. Even though I originally misjudged where the bottleneck would be, this ex-
periment still illustrates the benefit of a flexible architecture: it can achieve a substantial
42
Vertex Pixel Units tris / sec Throughput Increase
Ref. Pipeline 6 15 62300.98 N/A
Automatic 6 15 75465.37 21%
Automatic 8 12 106812.25 71%
Automatic 10 10 107604.02 73%
Automatic 8 12 159857.91 157%
(2-stage tri. setup)
Table 4.3: Case 4 Performance Comparison
performance gain by transferring a tile originally assigned to an idle stage to a busy one.
4.6 Discussion
In the above experiments, I compared processor utilization and throughput achieved by a
fixed versus flexible resource allocation under several rendering scenarios and configura-
tions of the flexible pipeline. They showed that a flexible resource allocation increases
utilization and throughput up to 157% and I believe that these results are indicative for the
speed-ups that could be obtained by designing more flexible GPU architectures. In these
scenarios, the flexible pipeline was able to adapt to the workload and partially alleviate
the bottleneck in the computation. The image processing experiment also showed that a
flexible resource allocation leads to a more natural representation of the computation. In-
stead of relying on the fixed topology and forcing a inherently 2D computation into 3D by
drawing a rectangle and using a pixel shader, the flexible pipeline can easily parallelize the
operation.
One fact that has not been discussed yet is the switching cost, or the time it takes to re-
configure the chip for a different resource allocation. The simplest scheme for reallocating
resources would be to first flush the pipeline, then all the tiles would branch to another sec-
tion of code to begin their new task. The overall cost would be one pipeline flush, a branch,
and possibly an instruction cache miss. The latter two would most likely be masked by
the latency of the flush. For applications that use the single-pass and multi-pass rendering
algorithms considered in the experiments above, switching cost would not be an issue. In
the single-pass algorithms considered, the application would be in the process of switching
43
to a new shader, and would most likely need to flush the pipeline. Similarly, for in the
case of multi-pass algorithms, a pipeline flush is required to guarantee correctness. How-
ever, it is possible that the programmer may want to switch configurations within a pass
without changing a shader. For example, consider a scene that is rendered in two parts.
First, the background composed of large triangles is rendered. Then detailed characters are
rendered over the background. The first part is fragment-bound while the second part is
vertex-bound. In this case, the programmer would want to switch from devoting tiles to
fragment processing to devoting resources to vertex processing. The switch would require
a flush between drawing the background and drawing the character whereas on a GPU,
such a flush is not necessary. This additional overhead has not been well studied and may
become future work.
Although the flexible pipeline uses resources much more efficiently over a wider range
of applications, the absolute performance obtained by Raw is orders of magnitude lower
than current GPUs. The triangle throughput, even in the best case, is several orders of mag-
nitude less than that of an NVIDIA NV40. If case 1 is considered without the expensive
pixel shader, the load becomes severely imbalanced in the rasterizer. A serious weakness of
the Raw architecture is the use of general-purpose computation for rasterization. In a GPU
pipeline, the rasterizer performs a tremendous amount of computation every clock, how-
ever, all of it is very simple floating point arithmetic. A DirectX 9 class GPU's rasterizer
can output 16 fragments every clock and push the data to the fragment units. In contrast,
a Raw tile must fetch and execute all the instructions to rasterize each fragment. My cur-
rent implementation of homogeneous rasterization requires 55 floating point instructions
and one branch per fragment in steady state. Clearly, the specialized rasterizer is a huge
advantage for the GPU and integrating one into a Raw architecture would greatly improve
performance. It would also be an interesting research question how such an augmentation
would be done.
Another advantage of the GPU is the presence of floating-point vector ALUs and vec-
torized busses. Almost all ALUs on a modem GPU are equipped with 128-bit vector ALUs
that can perform a multiplication and addition in one cycle. They can also forward data
across wide busses and store all values in vector registers. In contrast, Raw's ALUs, busses,
44
and register only operate on scalar values. Simply equipping Raw with vector ALUs would
yield at least a 4x speedup. If the entire chip was vectorized, even greater speedups can be
expected due to lower latency in transmitting data between tiles.
45
Figure 4-1: Case Study 1 Output image. Resolution: 600 x 600.
46
I-
r
Vertex Triangle
Processor Setup
-I
-0
IlL
Rasterizer Pixel Pixel Frame Static Variable
Processor Processor Buffer Ops Data Rate Data Rate
A B
Figure 4-2: Compiler generated allocation for case study #1, last case. 1 vertex processor,
12 pixel pipelines, 2 fragment processors for each pixel pipeline. Unallocated tiles have
been removed to clarify the routing.
47
OFI
Input
.... 
... ............. 
.
MMA
- I I
w -010
AkL.
gi
I
IF*
19; 1 1
17
Fixed allocation pipeline, overall utilization: 25%
Flexible pipeline, overall utilization: 27%
Flexible pipeline, overall utilization: 27%
15 fbuffer ops
6 vertex
15 rasterizer
1 setup
15 pixel
I input
15 fbuffer ops
6 vertex
15 rasterizer
I setup
15 pixel
1 input
15 fbuffer ops
I vertex
15 rasterizer1 setup
15 pixel
1 input
12 fbuffer ops1 vertex
12 rasterizer
I setup
12 pixelB
12 ixelB
1 in put
0 5M lOM
usage
100%
I0%
50%
cycles
15M
Figure 4-3: Case 1 utilization graphs.
48
Flexible pipeline, overall utilization: 40%
w TT f! I M - R p4I
Figure 4-4: Case Study 2 Output image. The relatively small shadow caster still creates a
very large shadow volume. Resolution: 600 x 600.
49
. ... . ..........
-
i. * -
Vertex Triangle
Processor Setup
Rasterizer Frame
Buffer Ops
Static Variable
Data Rate Data Rate
Figure 4-5: Compiler generated layout for case study #2. 1 Vertex Processor and 20 pixel
pipelines. Unallocated tiles have been omitted to clarify routing.
50
-0-
1!
4-j
I,
0-
44-
-4 I
4-
a
U
r
6-
.2 I
Input
I
AkL
Fixed-allocation pipeline: average utilizations: 25%, 23%
15 fbuffer ops
6 vertex
15 rasterizer
1 setup
15 pixel
1 input
20 fbuffer ops
1 input
20 rasterizer
1setuSverteg
0
Flexible pipeline: average utilizations: 27%, 28%
usage
100%
-50%
10%
cyclesI&.
2M
I I
4M OM 2M
I
4M
Figure 4-6: Case 2 utilization graph. Depth buffer pass on the left, shadow volume pass on
the right.
51
cycles
a-
4-
I- ~4- --- g-
-g 4-
Input
(Start
Signal)
4- -0
~Lg
Poisson
Disc
Filter
Figure 4-7: Compiler generated layout for case study #3. 1 tile wasted for the start signal
and one was unused. 62 tiles perform the filtering. Unallocated tiles have been omitted to
clarify routing.
52
4-
IF a 4- -0 4-
III-
4-
4-
4-1-0
Figure 4-8: Case Study 3 Output image. Resolution: 600 x 600.
53
.
.. ................. 
Flexible pipeline: average usage 38%
usage
100%
50%
cycles
i.10%I I I
0 25K 50K 75 K
Figure 4-9: Case 3 utilization graph.
54
1 input
62 Poisson
Disc Filters
100K
U_ 1 11 r i-
4-j-
4
I' I
Vertex Triangle
Processor Setup
Triangle
Setup 2
Rasterizer Frame
Buffer Ops
Static Variable
Data Rate Data Rate
Figure 4-10: Compiler generated layout for case study #4, last case. 2 stage pipelined
Triangle Setup. Unallocated tiles have been omitted to clarify routing.
55
4-
'I
L
4-
4,
v
Input
# I 
-
Iv
I
i
Figure 4-11: Case Study 4 Output image. Resolution: 600 x 600.
56
120 -i- ---i P I - I " - - - - - - - 1 -- -
15 fb
15!
1 trian
Fixed-allocation pipeline, overall utilization: 7.8%
uffer ops
6 vertex
asterizer
gle setup
15 pixel
I mput
15 fbuffer ops
6 vertex
15 rasterizer
1 triangll iR
10 fbuffer ops
9 vertex
10 rasterizerI trangll setuR
12 fbuffer ops
9 vertex
12 rasterizer1 trang s etu
Flexible pipeline, overall utilization: 9.1%
Flexible pipeline, overall utilization: 20.6%
usageI 100%
Flexible pipeline, overall utilization: 20.8%
Flexible pipeline, overall utilization: 24.6%
12 fbuffer ops
9 vertex12 rasterizer1 triangle setup A __1 nut------ ---1 triangle setup B
0.2M 0.4M
Figure 4-12: Case 4 utilization graphs.
57
50%
0%I
O.6M
CYCII*
.8M0
58
Chapter 5
Conclusion
I have presented a graphics hardware architecture based on a multicore processor, where
load balancing is achieved at compile-time using automatic resource allocation. Both the
3D rendering pipeline and shaders are expressed in the same stream-based language, allow-
ing for full programmability and load balancing. Although the prototype cannot compete
with state-of-the-art GPUs, I believe it is an important first step in addressing the load-
balancing challenge in graphics architecture.
As discussed in Chapter 4, there are several limitations to the current design. One
promising option for improving performance would be to replace certain Raw tiles with
specialized rasterizers since this is the stage of the graphics pipeline that benefits most from
specialization. Another direction for future research is studying the memory hierarchy for
optimal graphics performance, in particular the pre-fetching of textures. Texture mapping
is an important part of modem rendering and was not considered in this thesis. Dynamic
load balancing is the most exciting avenue of future work. A first intermediate step might
exploit the statistics from the previous frame to refine resource allocation or switch be-
tween different pre-compiled versions of the pipeline. In the future, I hope that graphics
hardware will be able to introspect itself and switch resource allocation within a frame or
rendering pass depending on the relative load of computation units and on the occupancy
of its buffers. Achieving the proper granularity for such changes and the appropriate state
maintenance are the biggest challenges.
59
60
Appendix A
Code
61
Vertex->TriangleSetupinfo filter TriangleSetup( int screenWidth, int screenHeight
Vertex vi;
Vertex v1;
Vertex v2;
TriangleSetupInfo tsi;
Vector3f[31 ndcSpace;
Vector3f[3] screenSpace;
Matrix3f vertexMatrix;
Matrix3f vertexMatrixInverse;
void normalizeW)
ndcSpacetil.x = vO.position.x / vo.position.w;
ndcSpace[0].y = vi.position.y / v0.position.w;
ndcSpace[01.z = v0.position.z / vi.position.w;
ndcSpace[l].x = vi.position.x / vi.position.w;
ndcSpace[1].y = vi.position.y / vi.position.w;
ndcSpace[l].z = vi.position.z / vi.position.w;
ndcSpace(2].x = v2.position.x / v2.position.w;
ndcSpace(2].y = v2.position.y / v2.position.w;
ndcSpace(21.z = v2.position.z / v2.position.w;
void viewport()
screenSpace[01.x = screenWidth ( ndcSpace[0].x + 1.0 [ / 2.0;
screenSpace[0}.y = screenHeight ( ndcSpace10[.y + 1.0 ) / 2.0;
// shift 2 range from [-1..1] to [0..1)
screenSpace[0].z = ( ndcSpace[O].z + 1.0 ) / 2.0;
screenSpace[II.x
screenSpace[l].y
// shift z range
screenSpace[1].z
= screenWidth * ndcSpace[I].x + 1.0 ) / 2.0;
= screenHeight * ndcSpace[1].y + 1.0 ) / 2.0;
from [-1..1] to [0..1]
= ( ndcSpace [1.z + 1.0 ) / 2.0;
screenSpace[2].x = screenWidth * ndcSpace[2).x + 1.0 ) / 2.0;
screenSpace[2].y = screenHeight * ( ndcSpace[2].y + 1.0 ) / 2.0;
// shift 2 range from [-1..1] to [0..1]
screenSpace[21.z = ( ndcSpace[2].z + 1.0 ) / 2.0;
void computeScreenBoundingBox(
int vix, viy, vIx, vly, v2x, v2y;
int temp;
vix int )( screenSpace[0[.x );
voy = int )H screenSpace[0].y );
vix = int ) screenSpace[i[.x );
v1y = int [( screenSpace[i[.y );
v2x = int ) screenSpace[2].x
v2y = int )( screenSpace[2].y C;
// x max
if( vix > vix
temp = vx;
else
temp = v1x;
if( v2x > temp )
temp = v2x;
tsi.maxX = temp + 1;
if( viy > viy
temp = viy;
else
temp = viy;
if( v2y > temp
temp = v2y;
tsi.maxY = temp + 1;
// x min
if( vix < vix
temp = vix;
else
temp = v1x;
if( v2x < temp
temp = v2x;
tsi.minX = temp;
// y min
if( viy < voy
temp = voy;
else
temp = viy;
if( v2y c temp
temp = v2y;
tsi.minY = temp;
if( tsi.minX < 0
tsi.minX = 0;
if( tsi.maxX [ screenWidth - )
if( tsi.minY < 0
tsi.minY = 0;
if( tsi.maxY > ( screenHeight - 1
Figure A-1: Common Triangle Setup Code, Page 1 of 3.
62
{
}
}
}
.s ;aX=srenit
}
}
t0i.maxY screenliight - 1;
void computevertexMatrix()
vertexmatrix.m[0 - vo.p.itiOn.x;
vertex1atrix.m[3] vO.position.y;
vertexatrix.m[61 - v0.po8ition.w;
verteKtrix.m[1] - v1.position.x;
vertexatrix.m[4] - vl.position.y;
vertexMatrix-m[7]- v1.position.w;
vertexMatrix.m[2] - v2.position.x;
vertexatrix.m[51 v2.poaition.y;
vertexMtrix.m[8] -V2.position.w;
void computeVerte-MatrixInverse()
float d;
d - ( vertexMatrix.m{O] - vertexMatrix.M[4] 8 vertexMtrix.m[8]- vertexma
trix.m[O ]- vertexMatrix.m[5] - vertexMatrix.m[7] - vertexatrix.m[3] * vertexmatrix.m
[1] - vertxMatrix.88 + vertexMatrix.m[3-]] vertexMatrix.m[2] * vertexMatrix.[71 +
vertexmatrix.m61 - vertematrix.m[1] vertexmatrix.m(5] vertexMatrix.M[6]1 vertex
Matrix.m[21 - vertexMatrix.M[41 );
vertexatrixInverse.m[ - vertexMatrix.m[41 Vertexmatrix.m[8 - verte
xMatrix.M[5] - vertexMatrix.M[7 ) / d;
vertexMtrixInverse.m[3] verteMatrix.m[3] vertexMatrix.m[B} - vert
exM1trix.m[5} - vertexMatrix.M[61] ) /d;
vertexatrixnverse.M[6] --- vertexMatrix.m[3] verteKatrix.m[7 + vert
exMatrix.m[41 - vertexMatrix.m(6] ) /d;
verte8atrixInverse.m[1] - - ( vertexatrix.m[o]* vertexMatrix.m[) -vert
exMatrix.m[2 - vertexMatrix.m{7 1 / d;
vertexMatrixInverse.M[4]- vertexMatrix.m[]8 vertexMatrix.m(8 - vert
xMatrix.m[2] - vertexMtrix.m[6] ) / d;
vertxMatrixInverse.m01 - - vertexmatrix.m[01 vertexMatrix.M[7]- vert
exMatrix.m[1] - vertMatrix.m[6] ) / d;
vertexMatrixInverse.m[21 - -(-vertexMatrix.m[1] vertexmatrix.m[5] + vert
exMatrix.m[2 - vertexMatrix.m[4} ) / d;
vertexMatrixInverse.m[5] - -( vertexMatrix.m[O0] vertexMatrix.m[] vert
exmatrix.m[2] - vertexMatrix.m[3 ) / d;
vert MatrixInverse.m[8] - ( vertexMatrix.m[0] vertexMatrix.m[41 verte
xMatrix.m[1] * verteX4atrix.M[3] ) / d;
void computeEdgeEquatins()
Sedge01 - vertexMatrixInverse * [0 0 1]^-
tai.edgeo.x 0vertexMatrixInverse.m[6];
ti.edge01.y -vertexMatrixInverse.m[7];
tsi.edge0l.z -0vertexMatrixnverse.m[81;
// edge12 - vertexMatrixnverse 8 (1 0 01^T
tsi.edge12.x -vertexMatrixnverse.m[0];
tsi.edgel2.y -0vertexMatrixnverse.m[];
tsi.edgel2.z -8vertexatrixnverse.m[2;
// edge20 - vertexMatrixInvere * J0 1 0]- T
tsi.edge20.x vertexMatrixnverse.M[3;
tai.edge20.y -8vertexMatrixnverse.m[41;
tai.edge20.z -vertexMatrixInverse.m[5;
void computeWInterp()
// w coefficients - vertexMatrixInverse I 1 11^T
tsi.Interp.- vertexMatrixInverse.m[00 + vertexmatrixInverse.M[3] + v1rt
exMatrixInverse.m[6];
tsi.wInterp.y vertexMatrixInverse.m[lj + vertexMatrixInverse.m[4] + vert
exMatrixInverse-m[7];
t0i.wInterp.- vertexMatrixnverse.m[21] + vertexMatrixInverse.m[] + vert
ex~atrixInverne..[81;
void -oputeEInterp()
t 8i.zInterp.x - 0vertexMatrixInverse.m[ } * screenspace[0].z + vertexMtrix
Inverse-m[3 * screenSpace[1].z + vertexmatrixlnverae.M[6) * screenSpaCe[21.Z;
tai-zInterp.y - verte-MatrixInverse.m[1] - screenSpace[O).z + vertexMatrix
Inverse.M[4 * screenSpace[}11 + vertMatrixInver80 .816] * screenspace[21.8;
tai-zInterp.z - vertexMatrixInverse.m[2] * screenSpace{0].z + vertexMatrix
Inverse.M[58 * screenSpace1].z + vertematrixInverse.m[8] - screenSpace[21.;
void determineFaCing()
float 008x - screenSpace[0.x -9screenSpace[.x;
float eOly - screenSpace[1].y 0 8screenSpace[0].y;
float el2x - 8creenSpace[2].x - screenSpace[1].x;
float e12y - screenSpace[2].y - screenSpace[1l.y;
float z - eOlx - el2y - eOly * el2x;
if( 0 > 0 )
tsi.isFrontFacing - 1;
else
tai.isFrontFacingj 
- 0;
float computeInterpolantX( float uO, float ul, float U2)
return vertexmatrixnverse.m[0] 8 u + vertex8atrixInverse.M31 ul + ver
texMatrixInverse.m[6] - u2;
float computeInterpolantY( float uD, float ul, float u2
return verte-MatrixInverse.m[l] * uO + vertexnatrixInverse.m[41 * ul + ver
tex0atrixInverse.817] 8 u2;
float computeInterpolantZ( float uO, float ul. float u2 )
return vertexMatrixInverse.M[21 - u + vertexatrixInverse.m[] * ul + ver
texMatrixInverse.m(al * U2;
void computeNor-lInterp()
// nx
t0i.nxinterp.- computeInterpolantX( vo.norma1.x, vl.norl.x, v2.norm1l.
x )
tsi.nxInterp.y- computeInterpolantY( vO.normal.x, vl.norml.x, v2.norml.
t0i.nxInterp.z - 0computeInterpolantZ( v0.normal.x, vl.norml.x, V2.norml.
// 0y
t0i.nyInterp.x -0computenterpolantX( v0.normal.y, v0.normal.y, v2.nor8l.
y );
Figure A-2: Common Triangle Setup Code, Page 2 of 3.
63
tsi.nyInterp.y - computeInterpolantY( vO.normal.y, vi.normal.y, v2.normal.
tsi.nyInterp.z = computeInterpolantZ( vo.normal.y, vi.normal.y, v2.normal.
// nz
tsi.nzInterp.x - compute InterpolantX( vo.normal., vi.normal.z, v2.normal.
tsi.nzInterp.y = computeInterpolantY( vO.normal.z, vi.normal., v2.normal.
tsi.nzInterp.c - computeInterpolantZ( vi.normal.2, vi.normal.z, v2.normal.
void computeColorInterpo)
// red
tsi.rInterp.x - computeInterpolantX( vO.color.r, v.color.r, v2.color.r
tsi.rInterp.y - computeInterpolantY( vO.color.r, vi.color.r, v2.color.r )
tei.rInterp.z = computeInterpolantZ( vO.color.r, vi.color.r, v2.color.r )
// green
tei.gInterp.x - computelnterpolantX( vO.color.g, vi.color.g, v2.color.g )
tsi.gInterp.y = computeInterpolantY( vO.color.g, vi.color.g, v2.color.g )
tei.gInterp.z = computeInterpolantZ( vO.color.g, vi.color.g, v2.color.g )
// blue
tsi.bInterp.x - computeInterpolantX( vO.color.b, vi.color.b, v2.color.b
tei.bInterp.y . computeInterpolantY( vO.color.b, vi.color.b, v2.color.b
tsi.bInterp.z - computelnterpolantZ( vO.color.b, vl.color.b, v2.color.b )
void computeTextureTnterp()
vI - pop();
v2 = pop();
computeVertexMatrix();
computeVertexMatrixInverse();
normalizeW(); // clip space - ndc space
viewport(; // ndc space - screen space
computeScreenoundingBox();
computeEdgeEquations();
// special interpolants
computeWInterp();
computeZinterp()
// other interpolants
computeNormallnterp();
computeColorlnterp);
computeTextureInterp();
// determine backfacing
determineFacing();
// push out
push( tai );
// to.e
tsi.tosinterp.x
texcoordO.x );
tsi.tOsInterp.y
texCoordO.x );
tsi.tslnterp.z
texCoordO.x );
texCoordi.y
texCoordi.y
texCoordi.y
texCoord.z
texCoordO.z
texCoordO.z
texCoordO.w
texCoordO.w
texCoordO.w
// to.t
tei.titInterp.x
tsi.tOtInterp.y
tai.totInterp.2
// to.p
tei.tOpInterp.x
tsi.tipInterp.y
tei.tipInterp.z
If to.q
tsi.tiqInterp.x
tei.tiqInterp.y
tei.tOqInterp.2
- computeInterpolantX(
= computeInterpolantY(
vi.texcoordi.x,
vo.texCoordi.x,
vi.texCoord.x, v2.
vi.texCoord.x, v2.
- computeInterpolantZ( vo.texcoordO.x, vi.texcoordO.x, v2.
= computeInterpolantX(
- computeInterpolantY(
- computeInterpolantZ(
= computeInterpolantX(
- computeInterpolantY(
= computeInterpolantZ(
= computeInterpolantX(
= computeInterpolantY(
- computeInterpolantZ(
vi.texCoordi.y,
vO.texCoordO.y,
vO.texCoordO.y,
vO.texcoordO.z,
vi.texCoord.z,
vi.texCoord.z,
vO.texCoordO.w,
vi.texCoordi.w,
vO.texcoordO.w,
vl.texCoord.y,
vi.texCoordO.y,
vl.texCoordO.y,
vi.texCoordO.z,
vi.texCoordOz,
v1.texCoord.z,
vi.texCoordO.w, v2.
vi.texCoordO.w, v2.
vi.texCoordO.w, v2.
work pop 3 push 1
v0 = pop();
Figure A-3: Common Triangle Setup Code, Page 3 of 3.
64
y )
y
v2.
v2.
v2.
v2.
v2.
v2.
z
z
z
}
}
// offset - rasterizer number (0, 1, 2, . numUnits - 1)
TriangleSetupInfo->Fragment filter Rasterizer( int offset, int numunits, int creenWidth, int screenHeight )
int numColumns;
TriangleSetupInfo tai;
init
numColumns - screenwidth / numUnits;
float interpolate ( float interpx, float interpY, float interpZ,
float ndcX. float ndcY, float w
return( (interpX - ndcx + interpY - ndcY + interpZ * w);
work pop I push
tai - pop();
/given an x coordinate:
/x / numUnits group number
/x t numUnits -offset within group
// group number numunits start of group
int groupNumber ti.minX / numUnits;
int startOfGroup groupNumber - numUnits;
int xStart -startOfGroup + offset;
if( xStart <tsi.minX)
Interp.y, ti.nxInterp.2, ndcX, ndcY, w
Interp.y. tai.nyInterp.2, ndcx, ndcY, w
Interp.y, ti.nInterp.., ndcX, ndcY, w
erp.y, ti.rInterp.2, ndcX, ndcY, w
erp.y, t2i.gInterp.z, ndX, ndcY, w)
erp.y, ti.bInterpz, ndcX, ndcY, w
-x, tsi.tslnterp.y, tsi.t2sInterp.z,
.x, ti.tOt2nterp.y, tsi-totinterp., 2
.x, tsi.tOpInterp.y, tsi.tpInterp.2.
.x, ti.tq2nterp.y, tsi.tqInterp.z,
nd2X,
ndcX,
ndcX,
ndcX,
f.ny -interpolate( ti.nyInterp.x, ti.ny
f.nz 2interpolate( ti.nzinterp.x, ti.nz
f.r interpolate( ti.rInterp.x, tai.rInt
f.g -interpolate( ti.gInterp.x, ti.gInt
f.b interpolate( ti.bnterp.x, ti.bInt
f.texCoord.x - interpolate( tai.t2Interp
ndcY, w );f.texCoordO.y . interpolate( tai-t~tInterp
ndcY, w );
f.texCoord.2 - interpolate( ti.topinterp
ndcY, w );
f.texCoord.w . interpolate( tsi.tqnterp
ndcY, w );
// push fragment
push( f );
for( int y - tai.minY; y - tsi.maxY; ++y
Sfor( int x - offset; x tsi.maxX; x - x + numUnits
for i int x - xStart; x < tsi.maxx; x - x + numUnits)
// compute NDC coordinates for current pixel position
float ndcX -(float )(x )*2. 0 /(float )screenWidth -
float ndcY -(float )(y )*2. 0 /(float )screenHeight
dcY + tsi.wInterp.2
ndcx,
ndcx,
ndcX,
y, tai.zInterp.2, nd
// interpolate w
float w - 1.0 / ( tsi.Interp -. ndcX + ti.wInterp.y - n
float inside1. interpolate( tsi.edgeOl.x, tai.edgeOl.y,
ndcY, 2 );float inside12 = interpolate( ti.edgel2.x, tsi.edge12.y,
ndcY, w );
float inside20 interpolate( tai.edge20.x, ti.edge20.y,
ndcY, w );
if( insideal >- 0 && inside12 >- 0 &,& inside2O >- 0
// interpolate z
float z - interpolate( ti 2Interp.x, tsi.zInterp.
cX, ndcY, w
if( > 0 )
Fragment f;
f.x x;
f.y -Y;f~z- z;f.iaFrontFacing - tsi.isFrontFaCing;
f-n - interpolate( tai-nInterp-x, tsi.nx
Figure A-4: Common Rasterizer code, using the homogeneous rasterization algorithm.
65
ti.edge01.z,
tsi.edgel2.z,
tsi.edge20.z,
I.0;
-1. 0;
Raster->void filter FrameBufferOps( int offset, int numUnits, int screenWidth, int scr
eenHeight
{
int [ ( screenWidth / numUnits ) * screenHeight ] rgb;
float [ ( screenWidth / numUnits ) * screenHeight ] zBuffer;
int width;
init
width = screenWidth / numUnits;
for( int i = 0; i < width * screenHeight; ++i
zBuffer[i] = 1.1;
}
work pop 1
{
Raster r = pop(;
r.x = r.x / numUnits;
int index = r.x * screenHeight + r.y;
if( r.z < zBuffer( index )
rgb[ index ( ( int )( r.r * 255.0 ) << 16 ) ( int )( r.g *
255.0 ) << 8 ) | ( (int )( r.b * 255.0 ) );
zBuffer[ index ] = r.z;
Figure A-5: Common Frame Buffer Operations Code.
66
Vertex->Vertex filter VertexShader( int id
Matrix4f modslVieo;
Matrix4f projection;
float worldX;
float worldY;
float worldZ;
float worldW;
float worldNX;
float worldNY;
float worldNZ;
float eyeX;
float eyeY;
float eyeZ;
float eyeW;
float clipX;
float clipY;
float clipz;
float clipw;
float inR;
float inG;
float inB;
init
modelView.m[0] - 2;
modelView.m[l - 0;
-odelview.m[2] - 0;
delview.m[31 - 0;
modelview.m[41 - 0;
modelView.m[5] - 1;
modelVisw..[61 - 0;
oodelview.m[71 - 0;
modelview.m[s] - 0;
modelVi.m[91 - 0;
modelview.m[Ol - 1;
modelview.m[11] - 0;
modelView.m[121 - 0;
modelView.m[13] - 0;
modeliew.m[14] - -5;
modelView.s15f - 1;
/nominal projection matrix
// fov - 50 degrees, 1:1 aspect ratio, near 1, far - 10
projection.m[ ] - 2.144507;
projection.mil] - 0;
projection.m[21 - 0;
projection.m[3] - 0;
projection.m[4] - 0;
projection.m[5) - 2.144507;
projectionm[0 - 0;
projectionm[71 -0;
projection.m[8 - 0;
projection.m[9] - 0;
projection.m[10] 
-- 1.022222;
projection.m[101 - -1;
projection-m[121 0;
projection.m 3] 0;
projection.ml4] -2.022222;
projection.m[15] - 0;
void computeEyeSpace()
eyeX - modelView.m[0] - oldX + modelView.m[4 * w-oldY + modelVi...[B]
orldZ + modelView.m[121 - worldW;
eyeY - modelView.m[l] * worldx + modelView.m[51 worldY + sodelview.m91
orldZ + modelView.m[131 - worldW;
eyeZ - wodelView.m[2] - worldX + modelView.m[6 worldY + modelView.m[101
orldZ + modelView.m[14] * worldw;
eyeW - modelview.m[3 - worldX + modelview.m71 worldY + modelView.m[111
orldZ + modelView.M[15] worldW;
void computeClipSpace)
{
clipX - projection.m[ * eyeX + projection.-[4* eyeY + projectionm[8f]
yez + projectionm[12 - eyeW;
clipY - projection.m[11 eyeX + projection.m[5 eyeY + projection.m[9
yeZ + projection.m[13] eyeW;
clipZ - projection.m(21 eyeX + projection.m[61 eyeY + projection.m[10
yeZ + projection.m(141 * eyeW;
clipW - projection.m[31 eyeX + projection.m[7] eyeY + projection.m[i
yeZ + projection.m[151 - eyeW;
work pop 1 push I
Vertex vin - pop();
worldX -vIn.position.x;
worldY - vIn.position.y;
worldZ- vIn.position.z;
worldW - vin.position.w;
// *ftransformo)"
computeEyeSpace();
computeClipSpaceo;
Vertex vOut;
vOut.positio. - clipX;
vOut.position.y - clipY;
vout.position.z - clipZ;
vOut position.w - clipW:
vOut.normal,.x vIn.normal.x;
vout.normal.y -vn.normal.y;
vOut.normal.z -vIn.normal.2;
v0ut.color.r - vtn.color.r;
vOut.color.g 
-vln.color.g;
vOut.color.b vln.color.b;
// copy position as texture coordinate
vOut.texCoordO.x - vfn.position.x;
vOut.texCoordo.y -vIn.position.y;
v0ut.texCoordO. s v.n.position.z;
push( vout );
Figure A-6: Case Study #1 Vertex Shader Code.
67
Fragment->Raster filter PixelShader( int id
Votor3f lightPosition;
Vector3f eyePosition;
float shininess;
init
lightPosition.x - -0.75;
lightPosition.y - 0,
lightPoition.o - 1.0;
eyePo ition.x - 0;
ye.Poition.y - 0;
.yePoition.z - 5;
shininess 20.0;
float MAX( float x, float y
if( x > y
return X;
else
return Y;
work pop I push I
Fragment f - pop();
// compute light vector
Vector3f lightVector;
lightvector.x - lightPo ition.x - f.texCoordO.x;
lightvector.y - lightPo ition.y - f.t xCoordo.y;
lightVector.z - lightPosition.z - f.texCoordO.z;
// normalize light vector
float lvNorm - sqrt( lightVector.x * lightVector.x + lightVector.y light
Vootor.y + lightvector.z lightVector.z)
lightvoctor.x / lvNorm;
lightVoctor.y I- lMoro;o
lightVector. /- lVNorM;
// compute view vector
V ctor3f viewVector;
viewV ctor.x . ey ooition.o - f.tex oordO.x;
vi wVoctor.y - 0yoition.y - f.texCoordo.y;
viewVector.z - eyePoition.z - f.texCoordo.z;
bDotN - maX( 0.0, lDotN );
/specular contribution
/ compute reflection vector
Vector3f reflectVector;
refloctVoCtor.x 2.0 lDotN * f.nx - lightVector.x;
reflotVc tor.y - 2.0 DotN f.ny lightVectory;
reflootVector.z - 2.0 *DotN f.nz - lightVector,z;
// normalize reflection vector
float rvNorm - sqrt( reflectvector.x reflectVector x + reflectVector.y *
reflectVector.y + reflectVctor.z - reflectVector.)
reflectVectorx /- rvNorm;
reflctvctor.y /- rvNorm;
refletVector.z /- rvNor;
float rDotV . vieoVector- reflectVector.x vi vector.y *reflectVecto
y +vieoVector. flctVector..;
float iSpecular - pow( rDotV, shininessispeculr -max( 0.0, iSpecular );
Raster r;
r.x f.x;
0.y -f.y;
r - f.z;
r.r f.r 1DotN + iSpecular;
r.g - f.g 1DotN + iSpecular;
r.b- f.b *DotN + iSpecular;
ifft r.r > 1.0 )
if( 9 1.
-.9 - 1.0;
if( r.0 > 1.0r.g - 1.0;
push( r )
/ normalize view vector
flo.t vVNor o (viewVector.x - viewVector.x + viewVector.y - vieWect
or.y + viewVct-or.o vhieoV.ctr.t ),;
viewV ctor.x /- Nor.;
viewVctor.y/- vvNorm;
vievector.o /- vvNorm;
// normalize normal
float normalNorm - sqrt( f.nx - f.nx + f.ny * f.ny + f.nz f.nz)
f.nx /-normsINorM;
f.ny /- nor-lNorm;
f.nz /*normaINorm;
// diffuse contribution
float DotN - f. lightVector.x + f.ny - lightVector.y + f.nz lightve
Figure A-7: Case Study #1 Pixel Shader Code.
68
Raster->void filter FrameBufferOps( int offset, int numUnits, int screenWidth, int scr
eenHeight
float[ ( screenWidth / numUnits ) * screenHeight ] zBuffer = init arraylD_float(
"PassO.zBuffer.xy." + offset + ".arr", ( screenWidth / numUnits ) * screenHeight );
int[ ( screenWidth / numUnits ) * screenHeight ] stencilBuffer;
int width;
init
width = screenWidth / numUnits;
for( int i = 0; i < width * screenHeight; ++i
zBuffer[i] = 1.1;
work pop 1
Raster r = pop();
r.x = r.x / numUnits;
int index = r.x * screenHeight + r.y;
// zFail algorithm
if( r.z >= zBuffer[ index )
if( r.isFrontFacing == 1
else
stencilBuffer[ index I = stencilBuffer[ index ] - 1;
stencilBuffer[ index I = stencilBuffer[ index I + 1;
}
Figure A-8: Case Study #2: Shadow Volumes Z-Fail Pass Frame Buffer Operations Code.
69
}
}}
int-void filter PoissonDOF( int width, int height, int lowwidth, int lowHeight
int[width * height) inputPackedArray - init_array_lD_int( "inputhigh.arr", width
*height ) ;
int[lowWidth * lowHeight] inputPackedArrayLow - init_array_lDint( "inpatlow.arr"
lowWidth * lowHeight );
int width * height) outputPackedArray;
float maxCoCRadius;float radiuoScale;
float[B] poilsonX - ( -0.45, -0.9, -0.85, -0.2, 0.4, 0.55, 0.33, 0.8
float[8] poissonY - ( 0.04, 0.4, -0.3. -0.6, 0.34, -0.2, -0.58, 0.3
float tmpR;
float tmpG;
float tmpB;
float tmpRLow;
float tmpGLow;
float ttpBLow;
int iTmpR;
int iTmpG;
int iTmpB;
init
maxCoCRadiu - 5.0;
radiuScale - 0.25;
void getoigh( float x, float y
int xO - ( int )y;int xI xO + 1;int yO ( int )y;int yl yO + 1;
int val o inputPackedArray[ yo width + xO 1;
int va0l inputPackedArray[ yl width + xo J;
int valiD - inputPackedArray[ yo width + xl 1;
int vall- inputPackedArray[ yl width + x 1;
float fracX - 0 - x0;
float fracY - y - y0;
float blueO - ( vallj & Dxff ) / 255.0;
float greenOO - ( ( v&loD >>00 ) & Oxff ) / 255.0;
floatordoc - ( valOD to 16 ) & Dxff ) / 255.0;
float blue0 - ( val0l & xff ) / 255.0;
float green0l - ( ( VA101 to 0 ) & Dxff ) / 255.0;
float redo -(101 >> 16 )&Oxff )/255.0;
float blue10 - valiD & Oxff ) / 255.0;
float greenO - ( ( vallo >> ) & oxff ) / 255.0;
floatordiD - ( ( sollO >> 16 ) & oxff ) / 255.0;
float bluell - ( vall: & Dxff ) / 255.0;
float greenil - ( 1 valll >> 0 ) & oxff ) / 255.0;
float redli - ( ( 00111 t> 16 ) & Dxff ) / 255.0;
float redTop redOo + fracX C redlO - red0o ;
float redBot - red01 + fracX red1l - red01 )
float greenTop - green00 + frcX* ( 9r88010 - 5r88000 );
float greenBot - greenol + fracX ( 5r88011 - 5 1 greenl
float blueTop- blue00 fracX bluelo0 - blue00 );
float blueBot - blue0l + fracX - ( bluell - blue0l I;
tmpR - redTop + fracY ( redBot - redTop (;
tmpG - greenTop + fracY greenBot - greenTop );
tmpB - blueTop + fracY 8(blueBot - blueTop );
work pop I
pop(C);
for( int y - 5; y < height - 5; ++y)
print ( It 
5
for( int x - 5; x , width - 5; ++x)
{
> 24 ) & Ox7F ( / 127.0;
// fetch center tap
float blurriness ( I nputPackedArray[ y - width +x I
float radiusHigh . blurrines - maxCoCRadiu
// float radiULow - blurriness radiusScale;
float redAccum - 0;
float greenAccum - 0;
float blueAccum - 0;
float to : / 5.0;
floot yy y / 5.0;
float coordLowX;
float coordbowY;
float coordHighX;
float CoordHighY;
float redTapLow;
float greenTaphow;
float blUeTapLow;
float redTapHigh;
float greenTapHigh;
float blueTapHigh;
float redTapBlurred;
float greenTapBlurred;
float blueTapBlurred;
int tmpx;
int topY;int val;
for( int k - 0; k < B; +k
coordHighX - 0+ ( poissonX[k] radiusHigh
coordHighY - y + I poioo InYk * rdiuHigh );
coordLowX coordoighX / 5.0;
coordLowY -coordighY/ 5.0;
tmpX - int ( x + 0.5)1
tmpY - (lint) y 00.511
val -inputPackedArrayLoW[ tpY lowWidth + tmpX
tpRLow - ( ( val to 16 ) & Oxff ) / 255.0;
tMpGLoW- ((val 8 )& Oxff) 255.0;
tMpBLoo - v to1 & off / 255.0;
val - getHigh( coordHighX. coordHighY );
Figure A-9: Case Study #3 Poisson Disc Filter Code, Page 1 of 2.
70
tmpR= ( ( val >> 16) & Oxff ) / 255.0;
tmpG = ( ( val >> 8 ) & Oxff ) / 255.0;
tmpB = ( val & Oxff ) / 255.0;
redTapBlurred = tmpR + blurriness * ( tmpRLow - tm
pR );
greenTapBlurred = tmpG + blurriness * C tmpGLow -
tmpG );
blueTapBlurred = tmpB + blurriness * C tmpBLow - t
mpB );
redAccum = redAccum + redTapBlurred;
greenAccum = greenAccum + greenTapBlurred;
blueAccum = blueAccum + blueTapBlurred;
tmpR = redAccum * 0.125;
tmpG = greenAccum * 0.125;
tmpB = blueAccum * 0.125;
iTmpR = ( int )H 255.0 * tmpR );
iTmpG = ( int )( 255.0 * tmpG );
iTmpB = ( int )H 255.0 * tmpB );
outputPackedArray[ y * width + x ] = ( iTmpB I ( iTmpG <<
8 ) I ( iTmpB << 16 ) );
Figure A-10: Case Study #3 Poisson Disc Filter Code, Page 2 of 2.
71
VO rtex->Vertex filter VertexShader( int id )
Vertex[61 vertices;
Matrix4f modelView;
MatriX4f projection;
float worldX;
float worldY;float worldZ;float worldW;
float eyeX;
float eyeY;
float eyeZ;
float eyeW;
float clipX;
float clipY;float clipz;
float clipw;
init
/modelview matrix, identity for now
modelView.m[O] 1;
modelView.m[I] 0;
modelView..(2] 0;
modelView..[3] 0;:
modelView.m[4] 0;
modelView.m[51 1;
modelView.m[6] - 0;
modelView.m[7] - 0;
modelView.m[B] - 0;
.OdlViewo[9] - 0;
modelview.m[ 3 - 1;
modelview.m[Il) - 0;
modelView.m[12] - 0;
modelView.m[13] - 0;
modolview.m[14] - -5;
modelview.m[15] - 1;
/nominal projection matrix
fov - 50 degrees, 1:1 aspect ratio, near 1, far 10
projection.m[O] - 2.144507;
projection.m[l] - 0;
projection.m[2} - 0;
projectionm[3 - 0;
projection.m[41 - 0;
projection.m[51 - 2.144507;
projection.m[6} - 0;
projection.M[7 - 0;
projection.m[s - 0;
projoction.m[9 - 0;
projection.m[] - -1.022222;
projection.m[ll - -1;
projection.m[121 - 0;
projection.m[13 - 0;
projection.M[14 -2.022222;
projection.m[15 - 0;
void tesselate()
vertices(31 .position.x - vertices[01 .position-x + vertices[11 position.x
) / 2.0;
vertices[3 .position.y - I vertices[0 .position.y + vertices[1. position.y
) / 2.0;
vertices(31 .poeition. - 1 vertices0( .position.z + vertices[1] .position.
/ 2.0;
v 1rtice.[3.p3itin.3 . 1;
vertices31color.r - 1vertices[01.color.r + verticesl]color.r I / 2.0;
vertices[3.1color.g 1 ( 3vertices[01color.g + verticesll.color.g I / 2.0;
vertices(31.color.b - I vertices[ 0.color.b + vertices[1].color.b / 2.0;
vertices[41 position. - I vertice[11 .position. + vertices[21 position-.
/ 2.0;
vertices[41 .position.y - (vertices(1) position.y + Vertices[21 position.y
/ 2.30;
vertices[41 .position- - (vertices[ 1. position.z + vertices (21 .1position.z
S/ 2.30;
vertices[41 .position.w - 1;
vertices[41.color.r 33(3vertice1.3color.r + vertices21.color.r ) / 2.0;
vertices(41.color.g -3(3vertices[1].color.g + vertices21.color.g ) / 2.0;
vertices[41.color.b - I vertices[11.color.b + vertices(21 .color.b) / 2.0;
vertices[51 position. - I vertices[ ] position.x + vertices[21 position.x
I / 2.0;
vertices[51 position.y - vertices0 1 .position.y + vertices1[2]position.y
) / 2.0;
vertices[51 position.2 - vertices [01.position.1z + vertices [2.position.z
I / 2.0;
vertices[5].position.w - 1;
vertices[51.color.r vertices[0) . or. r + vertices21.color.r 2.0;
vertices[51.color.g 3 3vertices0).color. g + vertices[21.color.g / 2.0;
vertices[5.color.b - ( v 1rtices[] .color.b + vertice[21.Color.b1) / 2.0;
void transform()
for( int i - 0; i < 6; ++i )
worldX - verticesli).position.x;
worldY - verticesli].position.y;
worldZ - vertices[i].position.z;
worldW - verticesti}.position.w;
computeEyeSpace();
computeClipSpace();
vertices~il.position.x 
-clipX;
verticestil.position-y 
-clipY;
verticen~il.position.z 
-clipZ;
verticesti}.position.w 
-clipW;
void computeHyegpace(
i eyeX - modelView.m[0] - worldX + modelView.m[4] worldY + modelView.m[8]
*worldZ + modelView.m[121 - worldW;
eyeY . modelView.m[1] - worldX + modelView.m[5] worldY + modelView.m[9]
*worldz + modelView.m[13] - worldw;
eyeZ - modelView.m[2] - worldX + modelView.m[6] worldY + modelView.m[101
*worldZ + -odelView.M[14] - worldw;
eyeW . modelView.m[3] * worldX + modelView.m[7] worldY + modelview.m[ill]
*worldZ + modelView.m[15] - worldW;
void computeClipSpace()
I 23323.33313331 331333.3
Figure A-11: Case Study #4 Vertex Shader Code, Page 1 of 2.
72
clipX = projection.m[0] * eyeX + projection.m[4]
* eyeZ + projection.m[12] * eyeW;
clipY = projection.m[l] * eyeX + projection.m[5]
* eyeZ + projection.m[13] * eyeW;
clipZ = projection.m[2] * eyeX + projection.m[6]
* eyeZ + projection.m[14] * eyeW;
clipW = projection.m[3] * eyeX + projection.m[7]
* eyeZ + projection.m[15] * eyeW;
* eyeY + projection.m[8]
* eyeY + projection.m[9]
* eyeY + projection.m[10]
* eyeY + projection.m[ll
void randomize()
for( int i = 0; i < 6; ++i
// make a "random" vector
float rx = rand( -1, 1 );
float ry = rand( -1, 1 );
float rz = rand( -1, 1 );
float norm = sqrt( rx * rx + ry * ry + rz * rz );
rx = 0.05 * rx / norm;
ry = 0.05 * ry / norm;
rz = 0.05 * rz / norm;
I
vertices[i].x +=
vertices[i].y +=
vertices[ii.z +=
rx;
ry;
rz;
// uniform tesselation
// one triangle in, 4 triangles out
work pop 3 push 12
vertices[0]
vertices[1]
vertices(2]
tesselate()
transform()
randomize()
push(
push(
push(
push(
push
push(
push(
push(
push(
vertices
vertices
vertices
vertices
vertices
vertices
pop()
pop()
pop()
[0]
[3]
[4]
[3]
[1]
[4]
vertices[0]
vertices[4]
vertices[5]
push( vertices[5] );
push( vertices[4] );
push( vertices[2] );
Figure A-12: Case Study #4 Vertex Shader Code, Page 2 of 2.
73
}
}
}
74
Appendix B
Figures
75
Input
Il
Vertex
Transformation
and Lighting
Primitive
Assembly and
Triangle Setup
Rasterization
IL
Texture
Frame Buffer
Operations
To Display
Figure B-1: Fixed function pipeline block diagram.
76
Bibliography
[1] Kurt Akeley. Reality Engine Graphics. In Proceedings of the 20th Annual Confer-
ence on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 109-
116, 1993.
[2] OpenGL Architecture Review Board, Jackie Neider, Tom Davis, and Mason Woo.
OpenGL Programming Guide. Addison-Wesley Publishing Company, Reading,
Massachusetts, fourth edition, 14 November 2003.
[3] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike
Houston, and Pat Hanrahan. Brook for GPUs: Stream Computing on Graphics Hard-
ware. ACM Transactions on Graphics, 23(3):777-786, 2004.
[4] Matthew Eldridge, Homan Igehy, and Pat Hanrahan. Pomegranate: A Fully Scalable
Graphics Architecture. In Proceedings of the 27th Annual Conference on Computer
Graphics and Interactive Techniques (SIGGRAPH), pages 443-454, 2000.
[5] John Eyles, Steven Molnar, John Poulton, Trey Greer, Anselmo Lastra, Nick Eng-
land, and Lee Westover. PixelFlow: The Realization. In Proceedings of the ACM
SIGGRAPH/Eurographics Conference on Graphics Hardware, pages 57-68, 1997.
[6] Homan Igehy, Gordon Stoll, and Pat Hanrahan. The Design of a Parallel Graphics
Interface. In Proceedings of the 25th Annual Conference on Computer Graphics and
Interactive Techniques, pages 141-150, 1998.
77
[7] Erik Lindholm, Mark J. Kilgard, and Henry Moreton. A User-Programmable Vertex
Engine. In Proceedings of the 28th Annual Conference on Computer Graphics and
Interactive 7echniques (SIGGRAPH), pages 149-15 8, 2001.
[8] William R. Mark, R. Steven Glanville, Kurt Akeley, and Mark J. Kilgard. Cg: A Sys-
tem for Programming Graphics Hardware in a C-like Language. ACM Transactions
on Graphics, 22(3):896-907, 2003.
[9] Michael McCool, Stefanus Du Toit, Tiberiu Popa, Bryan Chan, and Kevin Moule.
Shader Algebra. ACM Transactions on Graphics, 23(3):787-795, August 2004.
[10] Michael D. McCool, Zheng Qin, and Tiberiu S. Popa. Shader Metaprogramming. In
Proceedings of the ACM SIGGR APH/Eurographics Conference on Graphics Hard-
ware, pages 57-68, 2002.
[11] Michael Gordon and William Thies and Michal Karczmarek and Jasper Lin and Ali
S. Meli and Christopher Leger and Andrew A. Lamb and Jeremy Wong and Henry
Hoffman and David Z. Maze and Saman Amarasinghe. A Stream Compiler for
Communication-Exposed Architectures. In International Conference on Architec-
tural Support for Programming Languages and Operating Systems, 2002.
[12] Steven Molnar, Michael Cox, David Ellsworth, and Henry Fuchs. A Sorting Classifi-
cation of Parallel Rendering. IEEE Computer Graphics and Applications, 14(4):23-
32, 1994.
[13] John S. Montrym, Daniel R. Baum, David L. Dignam, and Christopher J. Migdal.
InfiniteReality: A Real-Time Graphics System. In Proceedings of the 24th Annual
Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages
293-302, 1997.
[14] Satoshi Nishimura and Tosiyasu L. Kunii. VC-1: A Scalable Graphics Computer
with Virtual Local Frame Buffers. In Proceedings of the 23rd Annual Conference on
Computer Graphics and Interactive Techniques (SIGGRAPH), pages 365-372, 1996.
78
[15] Marc Olano and Trey Greer. Triangle Scan Conversion Using 2D Homogeneous
Coordinates. In Proceedings of the ACM SIGGRAPH/Eurographics Conference on
Graphics Hardware, pages 89-95, 1997.
[16] John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, and
Ben Mowery. Polygon Rendering on a Stream Architecture. In Proceedings of the
ACM SIGGRAPH/Eurographics Conference on Graphics Hardware, pages 23-32,
2000.
[17] John D. Owens, Brucek Khailany, Brian Towles, and William J. Dally. Compar-
ing Reyes and OpenGL on a Stream Architecture. In Proceedings of the ACM SIG-
GRAPH/Eurographics Conference on Graphics Hardware, pages 47-56, 2002.
[18] Kekoa Proudfoot, William R. Mark, Svetoslav Tzvetkov, and Pat Hanrahan. A Real-
time Procedural Shading System for Programmable Graphics Hardware. In Proceed-
ings ofthe 28th Annual Conference on Computer Graphics and Interactive Techniques
(SIGGRAPH), pages 159-170, 2001.
[19] Thorsten Scheuermann. Advanced Depth of Field. Game Developers Conference
2004, 2004.
[20] Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben
Greenwald, Henry Hoffmann, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma,
Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman
Amarasinghe, and Anant Agarwal. The Raw Microprocessor: A Computational Fab-
ric for Software Circuits and General Purpose Programs. IEEE Micro, pages 25-35,
Mar/Apr 2002.
[21] Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, and Anant Agarwal.
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architec-
tures. In HPCA '03: Proceedings of The Ninth International Symposium on High-
Performance Computer Architecture (HPCA'03), page 341, 2003.
79
[22] Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben
Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf,
Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant
Agarwal. Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architec-
ture for ILP and Streams. In ISCA '04: Proceedings of the 31st annual international
symposium on Computer architecture, 2004.
[23] William Thies, Michal Karczmarek, and Saman Amarasinghe. StreamIt: A Language
for Streaming Applications. In International Conference on Compiler Construction,
April 2002.
80
