A SIMD-efficiant 14 instruction shader program for high-throughput microtriangle rasterization by Roca Monfort, Jordi et al.
Vis Comput (2010) 26: 707–719
DOI 10.1007/s00371-010-0492-4
O R I G I NA L A RT I C L E
A SIMD-efficient 14 instruction shader program
for high-throughput microtriangle rasterization
Jordi Roca · Victor Moya · Carlos Gonzalez ·
Vicente Escandell · Albert Murciego ·
Agustin Fernandez · Roger Espasa
Published online: 14 April 2010
© Springer-Verlag 2010
Abstract This paper shows that breaking the barrier of 1
triangle/clock rasterization rate for microtriangles in mod-
ern GPU architectures in an efficient way is possible. The
fixed throughput of the special purpose culling and triangle
setup stages of the classic pipeline limits the GPU scalability
to rasterize many triangles in parallel when these cover very
few pixels. In contrast, the shader core counts and increasing
GFLOPs in modern GPUs clearly suggests parallelizing this
computation entirely across multiple shader threads, mak-
ing use of the powerful wide-ALU instructions. In this pa-
per, we present a very efficient SIMD-like rasterization code
targeted at very small triangles that scales very well with the
number of shader cores and has higher performance than tra-
ditional edge equation based algorithms. We have extended
the ATTILA GPU shader ISA (del Barrioet al. in IEEE Inter-
national Symposium on Performance Analysis of Systems
J. Roca () · V. Moya · C. Gonzalez · V. Escandell ·
A. Murciego · A. Fernandez
Computer Architecture Department (UPC), Barcelona, Spain
e-mail: jroca@ac.upc.edu
V. Moya
e-mail: vmoya@ac.upc.edu
C. Gonzalez
e-mail: cgonzale@ac.upc.edu
V. Escandell
e-mail: vicente@ac.upc.edu
A. Murciego
e-mail: albertm@ac.upc.edu
A. Fernandez
e-mail: agustin@ac.upc.edu
R. Espasa
Intel Barcelona, Barcelona, Spain
e-mail: roger.espasa@intel.com
and Software, pp. 231–241, 2006) with two fixed point in-
structions to meet the rasterization precision requirement.
This paper also introduces a novel subpixel Bounding Box
size optimization that adjusts the bounds much more finely,
which is critical for small triangles, and doubles the 2 × 2-
pixel stamp test efficiency. The proposed shader rasteriza-
tion program can run on top of the original pixel shader pro-
gram in such a way that selected fragments are rasterized,
attribute interpolated and pixel shaded in the same pass. Our
results show that our technique yields better performance
than a classic rasterizer at 8 or more shader cores, with
speedups as high as 4× for 16 shader cores.
Keywords Microtriangle rasterization · GPU rendering ·
Shader performance
1 Introduction
The traditional rasterization pipeline in current GPU archi-
tectures exploits the parallelism of fragment-inside-triangle
tests within a single large polygon [16, 17, 19, 20] to main-
tain a high fragment fill rate. Implicitly, current GPUs rely
on medium to large triangle sizes to deliver high interactive
frame rates: while GPUs have vast amounts of processing
power to deal with fragments, the triangle processing por-
tion of their pipeline (culling, clipping, setup and traversal)
is still limited at 1 or 2 triangles per clock [1]. This limi-
tation is one of the reasons why the geometry complexity
of 3D scenes in interactive rendering has been traditionally
low (300 K triangles/frame in the best-looking games like
Crysis) compared with cinematic-quality rendered scenes
(500 M polygons/frame) [13].
Lately, the desire for higher quality interactive rendering
has led to an increase of the geometric detail, causing a de-
crease in average triangle size. Additionally, large numbers
708 J. Roca et al.
of very small triangles are commonly generated by parti-
cle systems in games to render fire, fog, etc. And looking
forward at future game generations, the upcoming DX11
specification includes tessellation support in the rendering
pipeline API [7]. With tessellation, the coarse-detail geome-
try is broken on the fly into smaller triangles of a few or even
a single pixel, which results in smooth rendered surfaces.
This increase in small and very-small size primitives,
commonly referred to as microtriangles, is a problem for
edge-equation based setup and traversal rasterization meth-
ods used in current GPUs. First, the fixed rate of 1 trian-
gle per clock of the culling, clipping, setup and traversal
stages generates a low percentage of triangle valid frag-
ments, which starves the rest of the fragment processing
pipeline. To increase the fragment throughput, the trivial
replication of triangle setup units leads to a waste of com-
putation because of their improper scaling: the expensive
costs of computing the triangle edge equations and setting
up the traversal are not justified for the few fragment-inside-
triangle tests in the small area of a microtriangle. As demon-
strated in [10], solutions to efficiently rasterize micropoly-
gons require parallelism across many polygons rather than
across fragments within a single polygon to keep a high frag-
ment fill rate.
This paper introduces a novel rasterization approach that
can be fully integrated in the current GPU rendering pipeline
and optionally selected by API users to process streams of
microtriangles, while still preserving the classic rasterizer
for normal triangles. The presented alternative pipeline con-
sists of two different components. The first one is an op-
timized rasterization shader program that can run in many
parallel GPU shader threads for high microtriangle rasteri-
zation throughput. It uses only 14 SIMD instructions and,
for the most part, fully utilizes the parallel ALU blocks of
the GPU shader cores. The code is straightforward with no
loops and requires a few feasible modifications in the shader
instruction set: two new fixed point instructions FXMUL
and FXMAD to meet the precision requirements for crack-
free triangle rasterization and a CMP&KIL instruction for
improved intersection testing.
The second key feature of our approach is that we com-
pute the microtriangle axis-aligned bounding box with sub-
pixel precision in a high-throughput hardware unit called
Triangle Bound. This hardware uses the computed subpixel
bounds to execute a very important optimization pass aimed
to reduce the bounding box size, and therefore the number of
fragments to test. Our experiments show that this optimiza-
tion is critical in microtriangles to reduce the percentage of
useless rasterization shading work. It has also proven very
useful to cull entire microtriangles, specifically those laying
in between pixel sampling points, and this method can be
implemented in current GPUs with little cost.
Our engineered rasterization program shares with [10]
a fundamental design feature: the edge equation based ras-
terization has been replaced by individual point-in-triangle
tests. In Sect. 2, we justify the substantially lower cost of
the latter when few points have to be tested, and since point-
in-triangle tests are independent, it allows individual shader
rasterization threads to be executed in parallel for each frag-
ment with no inter-thread communication. The required face
culling test is also performed as part of the rasterization pro-
gram, but at no cost in our approach using the powerful
SIMD instructions.
However, our proposal is different from [10] in that
shader threads are not explicitly microtriangles but tightly
selected fragment candidates to hit in the microtriangle edge
limits. This is a major difference in our approach that al-
lows rasterizing both small-but-larger-than-one-pixel trian-
gles and sub-pixel size triangles not perfectly aligned to
pixel boundaries, that can cover up to 4 neighboring pix-
els. Both types of small triangles are much more expected
as output of GPU tessellators. An ideal triangle-to-pixel per-
fect tessellation does not exist yet, and it is only closely ap-
proximated in the REYES rendering architecture [6], where
high-definition primitives are diced to fit approximately in
one screen pixel.
The implemented microtriangle pipeline directly sends
groups of fragments from the adjusted bounding boxes of
several different microtriangles to the shader units, which
together execute the rasterization program immediately fol-
lowed by the original fragment program. The fewer frag-
ments per triangle, the more triangles can be rasterized and
shaded per clock. Shaded fragments are finally reordered
to match the original assembled triangle input as required
for rasterized lists of triangles. However, our software-based
rasterization is different from the Larrabee rasterization ap-
proach [2] in that we use shader programs to process streams
of small triangles only (when the low utilization of the
shader units favors its use), and still use the classic GPU
fixed-function hardware rasterizer for the large ones.
The remainder of this paper is organized as follows. Sec-
tion 2 describes the classic rasterization pipeline and its
main limitations when rendering microtriangles. In Sect. 3,
we present our shader rasterization program and the ad-
justed subpixel Bounding Box solution for microtriangles.
In Sect. 4, we describe the implementation details and cur-
rent limitations of our microtriangle pipeline. Obtained per-
formance is shown in Sect. 5, and a few lines about the inter-
action of our technique with Z optimizations can be found
in Sect. 6. Finally, Sect. 7 summarizes the conclusions of
this work.
A SIMD-efficient 14 instruction shader program for high-throughput microtriangle rasterization 709
2 Classic rasterization
The task of the rasterization stage is to transform screen-
space projected polygons into covered pixel positions within
the screen area. The per-pixel generated rendering objects,
called fragments, carry the color and visibility informa-
tion required by further stages of the rendering pipeline to
compute its contribution to the final rendered image. Per-
fragment depth, color or texture coordinates values are de-
rived by interpolating the corresponding attributes associ-
ated to the vertices of the polygon.
For the specific triangle rasterization, graphics systems
as [8] use the scan-line rasterization algorithm, in which
horizontal spans of fragments are traversed within the trian-
gle boundaries while per-fragment attributes are calculated
by adding interpolation factors. In case of the commodity
graphics hardware, the widely adopted alternative has been
the edge equations approach [16, 17, 19, 20], in which a per-
triangle preprocessing step called “setup” calculates explicit
equations from triangle’s edges that allow, just using a few
additions, the parallel evaluation of groups of pixels, thus
more suitable for dedicated graphics hardware. According
to our cost analysis based on [16] and summarized in Fig. 1,
when the triangle area covers at least 4 pixels, the expensive
edge equation calculation is amortized over a cheaper per-
pixel evaluation (blue and yellow lines). However, individ-
ual pixel-in-triangle tests based on cross-products suppose
fewer total operations when triangles are very small (pink
line). Although the number of operations involved in the tri-
angle setup stage is less than other graphics pipeline com-
putations, the trivial replication of setup + traversal engines
to increase triangle throughput in GPU architectures, is con-
sidered inefficient due to the waste of operations to compute
edge equations for the small number of pixels to test [10].
Traditional graphics pipelines heavily rely on the abun-
dance of large triangles in typical 3D workloads and size up
to a fixed processing rate of 1 or 2 triangles per cycle in the
rasterization units, so that parallel generation of fragments
within a single triangle results in a reasonable number of
inputs to feed the rest of the fragment pipeline. As shown
in Fig. 8(a), this fixed triangle rate results, however, in a
low utilization of Z/Stencil, Color and Shader units when
streams of very small triangles are the input to the pipeline,
while the triangle setup operations of the rasterizer unit be-
come the bottleneck. The simple replication of rasterization
engines can be particularly wasteful in terms of rasterizer
area efficiency because, when processing sequences of large
triangles, the units will become idle most of the time (a sin-
gle rasterizer can feed the fragment pipeline).
Modern GPU rasterizers test coarse regions of screen pix-
els within the triangle bounding box, called tiles, expecting
those common large triangles to succeed in most of the pixel
evaluations on edge equations. It is key for the rasterization
Fig. 1 Estimated cost of edge equations and cross-products ap-
proaches over triangle size (one sample per pixel, no msaa). For edge
equations, “traversal” evaluates one pixel each time and “parallel” par-
allelizes the pixel evaluation. In both cases, a base cost of 65 operations
is always paid to setup the triangle and evaluate at the starting point.
“traversal” uses fewer operations (only 4 adds per pixel) but they are
serialized, while “parallel” evaluates several pixels in a row using 8
muls and 8 adds for each pixel. “Cross-products” skips setup compu-
tations at the expense of 31 operations per pixel, which suppose fewer
total operations for 1 and 2 size triangles
performance to avoid as many failing pixel tests as possible,
to minimize the work on fragments that do not contribute to
the final rendered image. This sample test efficiency (STE)
as referred to in [10] becomes a special challenging prob-
lem in small triangles: even the relatively small 2 × 2-pixel
data unit used in the fragment processing pipeline becomes
highly under-utilized when triangles are about one-pixel in
size. Methods to select a tighter set of candidates for frag-
ment evaluation, like our bounding box optimization pass
presented in Sect. 3.2, are therefore totally necessary in mi-
crotriangles.
3 In-shader microtriangle rasterization
3.1 The rasterization program
Our shader rasterization approach is based on the execution
of a more efficient algorithm for triangles covering a sin-
gle or very few pixels, the cross-products method of Fig. 1.
The program requires a minimum number of 4 alive vector
registers throughout the code and the proper allocation of
temporary values avoids explicit MOVs or branches while
still keeping a high SIMD utilization of 75% to 100% for
the most part (operated components are in bold, according
to the write mask), and a minimum number of instructions.
Each program execution determines if the tested frag-
ment center (Fig. 2(b) shows stored offset coordinates in
i0.x and i1.x) lies in the positive halfplane of all the three
microtriangle edges by computing cross-products of the
edge directed vectors ei with their respective pi vectors
(as shown in Fig. 2(c)). The screen projected vertex posi-
tions are read from the remaining components of i0 and i1
710 J. Roca et al.
Fig. 2 The rasterization program (a) is executed on fragments with
the inputs described in (b) to test if their center lies in the positive half-
plane of all the three microtriangle edges (c). Indeterminates at edges
intersecting the pixel center are solved by the edge-direction rule and
successful fragments interpolate their depth from Z values at vertices
(Fig. 2(b)). If the fragment center is outside, it is killed and
further computation aborted. A successful (inside) fragment
interpolates its Z and input attributes, and finally the orig-
inal fragment shader instructions are executed to compute
the fragment color and optional depth (as shown in Fig. 3).
Concretely, the rasterization program performs the follow-
ing required steps:
– Computing the edge directed vectors ei
Properly subtracting i0 and i1 input vector compo-
nents by (a) and (b), the microtriangle edges are stored
in t1 and t3 grouped by A and B coefficients of the edge
explicit equations, respectively. Such allocation simplifies
the later tie break test.
– Computing of the testing point vectors pi
(c) and (d) compute vectors going from each vertex to
the testing point (the pixel center). In addition, the inverse
of e1, used to compute the microtriangle area and face
orientation, is calculated in the 4th SIMD slot.
– Computing the cross-products
The proper allocation of values and the use of swiz-
zles allow (e) and (f) to compute the three cross-products
of e and p vectors using fixed point precision (operations
are described in Sect. 4.2). Each cross-product sign deter-
mines if the tested point lies on the left (+), on the right
(−) or exactly on the edge (0). Each cross-product’s mag-
nitude represents twice the area of the subtriangle defined
by the operated vectors. The proper normalization to the
total area of the triangle (2A has been computed in t0.x
as the 4th cross-product of e2 and −e1) provides the areal
coordinates for attribute interpolation.
– Testing inside-triangle with the edge tie break rule
The next instruction (g) tests the sign of the cross-
products to kill fragments in any of the negative half-
spaces of the edges. The sign of the computed micro-
triangle area 2A in t0.x means in addition its face ori-
entation (clockwise (−) counterclockwise (+)), hence
clockwise face culling is also performed in (g) (remain-
ing culling modes are explained in Sect. 3.3). The fol-
lowing CMP&KIL instruction (h) works as the ARB 1.0
[4] CMP instruction fused with a KIL: if any of the first
operand components is negative, then it copies into the
destination register the corresponding component of the
second operand register, otherwise of the third operand.
If any of the masked components in the destination reg-
ister is negative, then the thread is killed. Therefore, any
positive cross-product in t0 will directly succeed in test-
ing the constant 1, but a zero cross-product (pixel center
intersects with edge) will test the corresponding coeffi-
cient A in t1 first, and if any A is zero again, the sec-
ond CMP&KIL (i) will test B coefficients in t3, as the tie
break rule described in [16].
– Interpolating the fragment’s Z
The perspective interpolation of the Z value for those suc-
cessful fragments is performed by (j), (k) and (l) using
the following expression in [15]:
Z = 2A
(E0 ∗ 1v0.z + E1 ∗ 1v1.z + E2 ∗ 1v2.z )
,
with Ei ’s being the magnitudes of the cross-products of
the corresponding ei and pi vectors and 1/vi .z’s being the
precomputed inverses of Z values at each corresponding
vertex.
The rasterization algorithm shown here has a fundamen-
tal advantage compared to the algorithm used in traditional
rasterizers: fragment tests of the same or different micro-
triangles are independent pieces of work (no inter-shader
thread communication is required) so their computation can
be parallelized across multiple shader threads. As we will
A SIMD-efficient 14 instruction shader program for high-throughput microtriangle rasterization 711
Fig. 3 Complete shader
program with detailed
interpolation code options
(according to the perspective
requirements)
see in Sect. 4, due a current limitation in our implemen-
tation related with texture sampling LOD computation, we
are, however, constrained to send 2 × 2 fragment stamps as
the minimum shader input work per microtriangle.
Our microtriangle pipeline can rasterize up to four 2 × 2
pixel stamps (4 × 1 or 1 × 4 stamps in either axis direction
or a 2 × 2 squared stamp area) to cover an approximate total
area of up to 8 pixels. We have considered these larger trian-
gles because current GPU tessellation approaches, although
view dependent, mostly tessellate in object space rather than
screen space. So it will commonly generate microtriangles
not perfectly aligned and larger than a single pixel area.
We added the CMP&KIL instruction to the ATTILA
GPU shader instruction set [5] for the tie break testing and
it has proven also very useful to test other face orientations
(see Sect. 3.3). The hardware implementation is rather sim-
ple: the instruction must AND the component write mask
with the corresponding sign bit of the CMP result and set
the thread kill/alive status.
3.2 Microtriangle bounding box optimization
It is a fundamental requisite in our implementation to test
as few fragments as possible, ideally only those contribut-
ing to the rendered image, since our microtriangle pipeline
is mainly shader bound (as shown in Fig. 8(b)), so the total
instructions executed directly relates to the achievable per-
formance. This requirement justifies the use of an earlier mi-
crotriangle bounding stage that carefully selects the tight set
of fragments to be sent to the shader rasterization.
In [10], the same rasterization program computes the mi-
cropolygon bounding box and then chooses a set of test sam-
ples within the computed limits. The low sample test effi-
ciency (STE) of this approach is identified by the authors
as one of the biggest problems with micropolygons. In our
microtriangle pipeline, although we previously select those
fragment candidates to be rasterized, a careless selection re-
sults in a low test efficiency and waste of shader computa-
tion with no contribution to the final image. As shown in
Fig. 4(a), the usual pixel-precision bounding box generates
a suboptimal number of fragments to test: a lot of failing
2 × 2 pixel stamps are selected if the microtriangle edges
slightly cover pixels whose sample points are clearly out-
side the microtriangle. Similarly, in cases such as that shown
in Fig. 4(b), the microtriangle lies in-between the four pixel
centers but does not hit any of them.
The solution proposed in this work consists in computing
the microtriangle subpixel-accurate bounding box and using
the decimal fractions to compare against the limits of the
sampling region (bounding box that contains all the sample
points) to determine if any chance for the triangle to hit in a
sample point actually exists. As shown in Fig. 4(c), the re-
quired operations become as simple as comparing each mi-
crotriangle subpixel bounding box edge with the opposite
edge of the sampling region. If no overlapping is detected
the integer bounding limit can be stepped one pixel inwards,
thus reducing the bounding box size. An interesting result
is that some very small triangles can even be totally culled
with this optimization (as is the case of Fig. 4(b)). It is easy
to prove that this simple comparison of decimal fractions
works in all cases, even in the viewport borders, provided
that a inclusive bounding box is used. Figure 5 shows the
significant improvement from 20% to 45% of the 2 × 2-
stamp test efficiency using this optimization. Although the
presented optimization mostly benefits small triangles, this
optimization can be used as well in current GPU architec-
712 J. Roca et al.
Fig. 4 (a) and (b) describe typical scenarios where the conventional
inclusive pixel bounding box (thick red line) selects a suboptimal set of
2 × 2 fragment stamp candidates (yellow) compared to the stamps with
fragments that actually hit in the triangle (blue). Comparing the sub-
pixel accurate bounds (dashed green line) with the limits of the sam-
pling area, the simple BB reduction operations described in (c) (4 com-
parisons + 4 increments) help to increase the percentage of shaded
microtriangle stamps that generate valid fragments
tures using a traditional rasterizer to adjust the traversal area
of normal size triangles and cull entire microtriangles.
We decided to implement the computation of the sub-
pixel-accurate bounding box along with the bounding box
optimization pass and the generation of the resulting mi-
crotriangle stamps, in a fixed-function stage called Trian-
gle Bound (Fig. 6). It computes the bounding box as usual:
(1) perspective division by w of triangle vertices, (2) trans-
formation to viewport coordinates and clamping to the scis-
sor rectangle, and (3) calculation of the horizontal and ver-
tical fixed-point bounds (with subpixel precision). The re-
duction operations of Fig. 4(c) can be parallelized using an
integer comparator for each bound, resulting in a very sim-
ple unit easily replicable for high-throughput at little cost.
Our experiments (see Fig. 5) show that 8 bit precision is
enough for the bounds comparison. When the microtrian-
gle pipeline is enabled, the Microtriangle Stamp Generator
subunit generates the set of microtriangle fragments within
the adjusted bounds and forwards them to the shader units,
therefore skipping the setup and traversal stages.
3.3 Face culling
A long-time supported feature of the graphics pipeline is the
face culling test, that allows a triangle to be discarded based
on its face orientation (the order of its vertices). It is pri-
marily used to discard hidden back faces of solid rendered
objects [21] and implement “hard” shadows in Stencil Vol-
ume algorithms [3]. Rasterized fragments in our approach
must be similarly tested and kill the thread accordingly to
avoid fragments of a culled microtriangle face to be drawn.
As described in Sect. 3.1, the rasterization program in
Fig. 2 performs the clockwise face culling for free through
instructions (a) to (g). To cull counterclockwise faces in-
stead, we must simply invert the sign of the face orientation
and areal coordinates before the test, which have inverted
sign in alive clockwise faces. In addition, A and B coeffi-
cients must be negated for a consistent tie break test: the co-
incident edges of two adjacent triangles must always point
in opposite directions. This is the summary of instructions:
( g ) CMP_KIL t 0 . xyzw , − t 0 . xyzw , − t 0 . xyzw , − t 0 . xyzw ;
( h ) CMP_KIL t 2 . yzw , − t 0 . yyzw , c255 . xxxx , − t 1 . yyzw ;
( i ) CMP_KIL t 2 . yzw , − t 2 . yyzw , c255 . xxxx , − t 3 . yyzw ;
Finally, to allow both triangle orientations (disable
culling), the following code negates A and B (using (g.1)
and (g.2)) and areal coordinates (g) according to the face
orientation (note that here a fragment is never killed regard-
less of its orientation):
( g . 1 ) CMP t 1 . yzw , t 0 . xxxx ,− t 1 . yyzw , t 1 . yyzw ;
( g . 2 ) CMP t 3 . yzw , t 0 . xxxx ,− t 3 . yyzw , t 3 . yyzw ;
( g ) CMP_KIL t 0 . xyzw , t 0 . xxxx ,− t 0 . xyzw , t 0 . xyzw ;
( h ) CMP_KIL t 2 . yzw , − t 0 . yyzw , c255 . xxxx , t 1 . yyzw ;
( i ) CMP_KIL t 2 . yzw , − t 2 . yyzw , c255 . xxxx , t 3 . yyzw ;
3.4 MSAA
The anti-aliasing techniques are a fundamental feature in
current graphics pipelines to avoid stair-like rasterization
of triangle edges (called “jaggies”), specially unrealistic in
interactive 3D animation. The multi-sampling anti-aliasing
(MSAA) rasterization approach in GPUs involves testing at
different points inside the pixel. Z values are computed at
these sampling points to determine a finer visibility that bet-
ter approximates to non-aliased color for the pixel. The ras-
terization stage is therefore responsible for the construction
A SIMD-efficient 14 instruction shader program for high-throughput microtriangle rasterization 713
Fig. 5 2 × 2-stamp test
efficiency increase with respect
to the number of bits used for
the subpixel bounding box. The
0 bits series corresponds to the
optimization disabled. Data is
shown for a 1/2 pixel size
regular microtriangle mesh
rolled on 360 degrees in the
view direction
Fig. 6 Detailed pipe of the
Triangle Bound unit. The
enabled MSAA mode looks up a
small memory storing the
corresponding sampling region
bounds, which are compared
with the subpixel fractional
bounds of the projected
microtriangle. The outputs from
comparators control the adders
which adjust the bounding box
used for the microtriangle stamp
generation
of the sample-inside-triangle coverage mask and interpola-
tion of the Z values at the sampling points.
To test and interpolate these additional sampling points in
our rasterization program we compute the areal coordinates,
as usual, at the first sample point (the base sample), and then
we take advantage of the linearity of edge evaluations to in-
crementally evaluate every other sample in the same pixel
using constant offsets (defined by the MSAA pattern). For
example, to incrementally evaluate E0 at s, given the evalu-
ation at the base we perform the following operation:
E0(s) = E0(base) + A0 ∗ delta.x + B0 ∗ delta.y,
where delta is the constant subpixel vector from the base to
s and A0 B0 the coefficients of the directed edge e0. A and
B coefficients of edges were consciously stored respectively
in t1 and t3 after instructions (a) and (b) in Fig. 2. Deltas
to the rest of samples can be precomputed in program con-
stants. The following two FXMAD instructions (described
in Sect. 4.2) incrementally evaluate (using the required fixed
point precision) at the new sample point, simultaneously for
the three edges, given the areal coordinates E0, E1 and E2
at the base sample in t0 and the delta vector in the constant
254:
FXMAD fx4 . yzw , c254 . xxxx , t 1 . yyzw , t 0 . yyzw ;
FXMAD t 4 . yzw , c254 . yyyy , t 3 . yyzw , fx4 . yyzw ;
We repeat this process iteratively for the remaining sam-
ple points, and every time the areal coordinates for the new
sample are stored in t4 and then tested (with the tie break
rule) using (g), (h) and (i) and then used for Z interpola-
tion with (j), (k) and (l). Since the last MUL (l) is actually
714 J. Roca et al.
Table 1 Shader rasterization program length for the different MSAA
modes
MSAA Shader instruction summary x:y Total
(from instruction x to y in Fig. 2(b)) instrs
off a:n 14
2x a:k + 1x[2xFXMAD + g:k] + l + n + c:f + m 25
4x a:k + 3x[2xFXMAD + g:k] + l + n + c:f + m 39
6x a:k + 5x[2xFXMAD + g:k] + 2x[l + n] + c:f + m 55
8x a:k + 7x[2xFXMAD + g:k] + 2x[l + n] + c:f + m 69
a scalar operation, it can be factorized out so we perform
just one vector multiplication every 4 samples. In total we
execute 7 instructions per remaining sample point plus the
base (11 instructions) and the final instructions that multiply
and output per-sample Z values (all the required instructions
per MSAA mode are summarized in Table 1). To build the
coverage mask, instead of KIL and CMP&KIL instructions
we use their respective counterparts (KLS and CMP&KLS)
with an additional operand to identify the sample to kill
(mark as noncovered).
Finally, the areal coordinates for attribute interpolation
are computed using 4 additional instructions ((c) to (f)). In
this work, we directly choose the pixel center for attributes,
however, this simple approach is known to suffer from tex-
ture sampling artifacts because computed texture coordi-
nates can lie outside the texture region. The centroid inter-
polation [14] solves this problem by using the middle point
between covered samples of the pixel. Due to the program-
mability of our solution, more sophisticated shader rasteri-
zation programs for MSAA could use the centroid interpo-
lation.
4 Implementation
4.1 The microtriangle pipeline
The presented pipeline is not intended to replace the clas-
sic rasterizer consisting of the setup and traversal stages but
to provide the API user with a selectable specialized path
for high-throughput microtriangle rasterization. For the im-
plementation, we added to the baseline ATTILA GPU ar-
chitecture [5] the Triangle Bound unit between the Trian-
gle Assembly and Triangle Setup stages, as shown Fig. 7.
This new unit is bypassed when the microtriangle pipeline
is disabled, and in such case the Triangle Setup performs the
perspective division and bounding box computation. As ex-
plained in Sect. 3.2, when enabled, the Triangle Bound unit
assumes the previous tasks, executes the bounding box op-
timization pass and its Microtriangle Stamp Generator sub-
unit generates the interior fragment stamps with their corre-
sponding shader input vectors, which are sent to the Shader
Fig. 7 Microtriangle pipeline overview
Work Distributor unit (the main scheduler of shader work in
the ATTILA architecture).
Shaded microtriangle stamps wait for completion in a re-
order queue in the Shader Work Distributor unit before depth
testing, therefore keeping the original triangle order (a ma-
jor requirement of the graphics pipeline). Our initial imple-
mentation is therefore based on late Z test (shading before
depth occlusion) because the depth value of a microtriangle
fragment is computed in the shader rasterization program.
Another limitation in our current implementation is that the
rasterization is enforced on unfragmentable groups of 2 × 2
pixel microtriangle stamps. This is due to the required com-
putation of the level of detail (LOD) with the texture coordi-
nate derivatives to make mipmapping and anisotropic filter-
ing work for texture lookup instructions. Future investiga-
tion will suggest the use of some of the alternative approx-
imations for LOD computation as those summarized in [9].
Hence, for the current implementation the peak performance
when processing 1 pixel or 1 stamp size triangles is actually
the same. Table 2(a) summarizes this peak microtriangle rate
as a linear function of the number of shader cores and the
number of instructions in the rasterization program, which
fairly matches the maximum achievable speedup given the
fixed processing rate of 1 triangle per clock of the baseline
rasterizer in modern GPUs (Table 2(b)).
As shown in Fig. 3, the rasterization program is inserted
before the original fragment shader program to determine if
the fragment is actually covered by the microtriangle. Ad-
ditionally, the corresponding interpolation code (according
to the perspective requirements) is inserted in-between to
A SIMD-efficient 14 instruction shader program for high-throughput microtriangle rasterization 715
Table 2 (a) shows the estimated microtriangle peak rate based on the
number of shader cores and triangle size. In addition, the required Tri-
angle Bound pipes. In (b), ATI reference architectures with featured
shader cores and rasterizer triangle rate [1] compared with the esti-
mated shader rasterizer performance in (a)
(a)
Shader Shader rate Shader μtriangle μtriangles/clock Min
cores (pixinstrs output rate stamps 1 1 4 TB
per clock) (14 instrs) per clock pixel stamp stamps pipes
1 16 1.14 0.29 0.29 0.29 0.07 1
2 32 2.29 0.57 0.57 0.57 0.14 1
3 48 3.43 0.86 0.86 0.86 0.21 1
4 64 4.57 1.14 1.14 1.14 0.29 2
8 128 9.14 2.29 2.29 2.29 0.57 3
10 160 11.43 2.86 2.86 2.86 0.71 3
16 256 18.29 4.57 4.57 4.57 1.14 5
20 320 22.86 5.71 5.71 5.71 1.43 6
(b)
ATI reference Shader Rasterizer rate
architecture cores (triangles/clock)
Classic Shader (1 pixel)
R520 1 2 0.29
R580 3 2 0.86
R600 4 1 1.14
RV770 10 1 2.86
RV870 20 1 5.71
linearly interpolate each of the input registers required by
the fragment shader. To this end, vertex attribute values are
passed to the rasterization program by extending three times
the original input vector. Simple compiler transformation
steps have been implemented to replace input references in
the original shader program with temporary registers stor-
ing the interpolated results. In this way, rasterized successful
fragments inside the microtriangle are pixel shaded within
the same shader thread (no inter-stage transition is required).
4.2 The fixed-point instructions
In the rasterization of triangle meshes, and especially for
very small triangles, it is important to stick to a precise set
of rules and operations in order to prevent cracks or overlap-
ping pixels. Fixed function rasterizers use fixed point arith-
metic to achieve the required operation precision. In our ap-
proach, the triangle vertex positions are converted to a fixed
point representation snapping the vertices to a subpixel grid,
and then the cross products that determine which pixels are
inside the triangle use the required fixed-point subpixel pre-
cision to avoid rounding inconsistencies.
Using the original 32-bit floating point arithmetic instruc-
tions available in the ATTILA shader ISA would not meet
these precision requirements. Hence, we extended the in-
struction set with two new fixed point instructions. As shown
in Fig. 2(a), the FXMUL (e) and FXMAD (f) compute the
cross-products using a fixed point register for the interme-
diate result. The FXMUL converts the two 4-element in-
put vectors from a 32-bit floating point value to a fixed
point value and then computes a conservative element-by-
element multiplication using fixed point arithmetic. The re-
sult is stored in a fixed point register f xi:
F i x e d P o i n t [ 4 ] FXMUL( f l o a t 3 2 a [ 4 ] , f l o a t 3 2 b [ 4 ] ) {
F i x e d P o i n t r e s [ 4 ] ;
f o r ( i n t e = 0 ; e < 4 ; e ++)
r e s [ e ] = f l o a t 3 2 T o F i x e d P o i n t ( a [ e ] ) ∗
f l o a t 3 2 T o F i x e d P o i n t ( b [ e ] ) ;
r e t u r n r e s ;
}
The FXMAD converts the two multiplication 4-element
vector operands from 32-bit floating point to fixed point,
computes a conservative vector multiplication and adds the
fixed point vector provided by the third operand. The result
is converted back to 32-bit floating point:
f l o a t 3 2 [ 4 ] FXMAD( f l o a t 3 2 a [ 4 ] ,
f l o a t 3 2 b [ 4 ] ,
F i x e d P o i n t c [ 4 ] ) {
f l o a t 3 2 r e s [ 4 ] ;
f o r ( i n t e = 0 ; e < 4 ; e ++)
r e s [ e ] = F i x e d P o i n t T o F l o a t 3 2 (
f l o a t 3 2 T o F i x e d P o i n t ( a [ e ] ) ∗
f l o a t 3 2 T o F i x e d P o i n t ( b [ e ] ) + c [ e ] ) ;
r e t u r n r e s ;
}
As it is shown, the new fixed point instructions convert
32-bit floating point inputs to fixed point prior to the op-
erations. The logic required to implement a 32-bit floating
point fused multiply add instruction can be reused with a
few control modifications to perform floating point to/from
fixed point conversions and fixed point arithmetic [12].
5 Results
Due to the lack of a tessellation unit in our current imple-
mentation to generate real 3D game microtriangle work-
loads we decided to build synthetic OpenGL traces to test
our implementation. The traces, described in Table 3, draw
all of them a strip mesh with a total of 320 K regular-size
triangles that cover the full screen area.
We used four different screen resolutions (200 × 200,
400 × 400, 800 × 800 and 1600 × 1600) to project the same
320 K regular triangle mesh, so each respectively draws a
stream of microtriangles of an eighth a pixel, half a pixel,
2 pixels and 8 pixels in size. For each resolution, the mesh
was tested at four different rotation angles (0, 15, 30, and 45
grades), hence up to 16 versions of each trace in Table 3 have
been run in the experiments. Since the trace workloads are
716 J. Roca et al.
mostly triangle processing bound, using the same large num-
ber of primitives across traces avoids unfair speedup com-
parisons when using different triangle counts: the fraction
of time rasterizing microtriangles is similar and long enough
to minimize the effect of the initial phase (only transform-
ing vertices) and the final phase of the trace (shading the
remaining fragments and dumping the frame buffer).
The “flat” trace that draws triangles of 1/2 a pixel is used
in Fig. 8 to show the unit utilization in the microtriangle
pipeline. Compared with the classic GPU pipeline where the
fixed-throughput rasterization unit becomes the bottleneck,
in the microtriangle pipeline the rasterization job is trans-
ferred to the shader units, that bound the execution.
Table 3 Description of the microtriangle synthetic traces
Traces VP interp required Original Microtriangle
instrs attribs interp fragment program program
alu to bilinears total alu to
texture per instrs bilinear
ratio texture ratio
flat 5 color flat 1:0 – 14 14:0
lights 60 color smooth 1:0 – 22 22:0
texbil 5 texcoord smooth 1:1 1 24 24:1
texaniso 5 texcoord smooth 1:1 16 24 3:2
texcoord,
texfog 11 fogcoord smooth 6:1 1 37 37:1
msaa4x 5 color flat 1:0 – 39 39:0
With these conditions, the rasterization throughput is
directly related to the total shader performance and the
speedup compared to the classic rasterizer greatly scales
with the number of shader cores. The average end-to-end
pipeline speedup across the 4 triangle orientations for the
different traces and respective sizes are shown in Fig. 9. For
the experiments we used a GPU unified shading architec-
ture [5] in which each shader core can execute the same
instruction on groups of 64 fragments over 4 cycles, or in
practice, execute 16 instructions per fragment in a single
cycle. As a reference, in Table 2(a), the number of shader
cores is matched with the total number of instructions to ex-
ecute per fragment and per cycle (pixinstrs/clock). We con-
figured the total memory bandwidth to the typical 112 GB/s
of today’s high-end graphics cards, and thanks to the large
pool of shader units in the unified architecture which exe-
cute a rather inexpensive vertex transformation, vertex at-
tribute read and shading are not limiting factors in the mod-
eled GPU architecture. Therefore, it provides enough input
microtriangle rate, similar to having a GPU hardware tessel-
lator unit.
As Fig. 9 shows, with 8 or more shader cores and for 1/8
and 1/2 trace sizes, the microtriangle pipeline obtains higher
performance than the classic rasterization pipeline, getting
between 1.5 and 4.3× speedups for the smallest triangle
size and 16 shader cores. The higher performance of the 1/8
traces compared with 1/2 is because a high percentage of the
microtriangles covering just an eighth of a pixel are culled
by the BB optimization pass presented in Sect. 3.2. The cull
Fig. 8 Compared unit
utilization of the (a) classic and
(b) the microtriangle pipeline
(with the classic rasterizer
switched off)
Fig. 9 Speed-up of our microtriangle pipeline compared to the traditional rasterizer (1 triangle/clock). Each point shows the average across 4
different mesh orientations (0, 15, 30, and 45 grades), for each trace and triangle size with respect to the shader cores
A SIMD-efficient 14 instruction shader program for high-throughput microtriangle rasterization 717
rate for 1/8 triangles due to the optimization was 55% on
average through all the traces, which supposes a 1.25× av-
erage speedup with respect to the optimization disabled.
To show the importance of having inexpensive per-vertex
computation in the microtriangle pipeline to scale properly,
the performance of the “lights” trace, that uses 4 point lights
(as referred in Table 3) for a total of 60 vertex instruc-
tions, significantly drops compared with the “flat” trace, us-
ing only 5 vertex instructions. In this case, a higher number
of shader cores are necessary to compensate for the fewer
shader threads available for rasterization.
Among the traces using texture mapping, the “texbil”
trace samples bilinearly a 512 × 512 texture and outputs the
sampled color. A mipmapped version of the same base tex-
ture is sampled in “texaniso” using 16x anisotropic filtering
(up to 32 bilinears are computed in the anisotropy direction).
The high anisotropy ratio is obtained by drawing very thin
triangles along the view direction. Finally, the “texfog” adds
a per-pixel fog contribution to the sampled texture (1 bilin-
ear) by doing additional computation and interpolating an
extra fog coordinate attribute (the instruction overhead in the
shader program is 50% compared to “texbil”).
It is worthwhile to highlight how the “texaniso” perfor-
mance just slightly drops even having a much longer indi-
vidual fragment shading execution time due to the expensive
texture access. The reason is that the multithreaded architec-
ture of shader cores allows executing for free the rasteriza-
tion instructions in the ALUs while waiting for the filtered
samples from the texture units (current GPUs allow to exe-
cute in the order of 3 or 4 ALU instructions per each bilinear
access and real ratios for our traces are shown in the last col-
umn of Table 3). Finally, the msaa4x trace shows a moderate
scaling due to the additional instructions to test the three ad-
ditional sample positions.
6 Interaction with Z optimizations
EarlyZ [18] is a widely-adopted technique in current GPUs
that computes fragment visibility prior to the fragment shad-
ing stage to discard hidden surfaces. Because our microtri-
angle pipeline computes the Z values and executes the frag-
ment program within the same shader pass, our current im-
plementation does not work with EarlyZ. However, we argue
that the over-shading produced due to using LateZ is not a
major impediment for the success of our solution in its cur-
rent form: as seen in Fig. 8(a), the major performance lim-
iter when rendering microtriangles is the rasterization rate
rather than the shading rate. Indeed, microtriangles cause
an under-utilization of the shading units. Hence, any step
toward increasing the rasterization rate and balancing the
global pipeline utilization will improve the performance, de-
spite the extra shading cost.
For large fragment shaders, the cost of disabling Ear-
lyZ could, however, become a performance problem. Hence,
to enable EarlyZ in our pipeline, we are currently explor-
ing two avenues: (1) split of the rasterization and fragment
processing stages in two separate shading passes and per-
form the EarlyZ test after rasterization and prior to fragment
shading, and (2) execute a “ztest” special instruction in the
same shader which would similarly perform the test once
the Z value has been computed. In this latter approach, the
new instruction would dispatch a request out of the shader
to the Z test units, which will execute and return the Z
operation result. Similar to texture lookups, the latency of
this operation can be completely or partially hidden in a
multithreaded-designed shader unit. Since shader stages ex-
ecute fragments out of the original triangle order, both so-
lutions require a middle reordering of fragment Z requests
before the test.
Hierarchical Z [11] is a technique that early-tests the vis-
ibility of groups of rasterized pixels against a coarse version
of the Z-buffer, saving a lot of BW required by the more
accurate Z-test. It can be enabled with our microtriangle
pipeline, since the closest vertex Z value of a microtriangle
is a good input reference depth to test all the microtriangle
fragments, given its small size.
7 Conclusion
This work shows that is possible to use the GPU shader
units to efficiently rasterize microtriangles by extending the
shader model with a minor set of changes. Two new FX-
MUL and FXMAD arithmetic instructions working with
fixed-point precision, a CMP&KIL fused instruction (the
latter barely requires extra hardware), and an extended
shader input vector are used to rasterize in parallel individ-
ual fragments within the microtriangle bounding box, and of
different triangles, across multiple shader threads. The pre-
sented rasterization program is more efficient for streams of
small microtriangles (≤2 pixels) in total number of instruc-
tions per fragment than the simple replication of setup +
traversal engines of current GPU pipelines and scales with
the number of shader cores.
The implementation also requires a new fixed function
Triangle Bound stage to compute the subpixel-accurate
bounding box. This new stage, that can be easily parallelized
for high-throughput triangle processing, implements a novel
bounding box optimization pass which reduces, in a very
simple way, the number of failing (nonfinally rendered) mi-
crotriangle fragments which are sent to the shader units.
This optimization can discard entire microtriangles in some
cases and it is equally applicable in the bounding box com-
putation stages of current graphics pipelines for enhanced
microtriangle processing.
718 J. Roca et al.
The results have shown the performance scalability of
both aliased and multi-sampling synthetic traces under dif-
ferent shader load conditions. The presented implementation
gets higher performance than the traditional rasterizer for
microtriangle meshes of 1/8 and 1/2 pixel sizes with 8 or
more shader cores and gets up to 4x speedups with 16 cores.
The novel bounding box optimization pass has proven key
to achieve this performance by increasing from 20% to 45%
the 2 × 2 pixel stamp test efficiency.
Our approach does not intend to replace the traditional
setup + traversal but instead provide an optimized pipeline
for streams of microtriangles, to be selected at the API user’s
convenience. We are currently working on a new design to
run actual 3D games workloads combining both large trian-
gles and highly tessellated surfaces in the same frame, which
could switch between both the traditional and the alterna-
tive pipeline presented in this work. Such an implementa-
tion would maximize the use of the GPU shader cores to
rasterize microtriangle streams in situations where the tradi-
tional rasterizer becomes the rendering bottleneck. Finally,
although our rasterization approach is currently software-
based, fixed-function hardware implementations based on
either the presented or improved algorithms will be studied
in future research.
Acknowledgements The authors thank to the anonymous reviewers
for their helpful comments. This work is supported by the Ministry of
Education and Science and of Spain and the European Union (FEDER
funds) under Contract TIN2007-60625 and by Intel. For additional in-
formation about the authors, visit the Attila Research Group web site
https://attila.ac.upc.edu/.
References
1. Beyond3D Graphic Hardware and Technical Forums. http://www.
beyond3d.com/resources (2010)
2. Abrash, M.: Rasterization on Larrabee. http://software.intel.com/
en-us/articles/rasterization-on-larrabee/ (2009)
3. Akenine-Möller, T., Haines, E., Hoffman, N.: Real-Time Render-
ing, 3rd edn. Peters, Natick (2008)
4. ARB: ARB fragment program specification v 1.0. http://oss.
sgi.com/projects/ogl-sample/registry/ARB/fragment_program.tx
(2002)
5. del Barrio, V., Gonzalez, C., Roca, J., Fernandez, A.: ATTILA: a
cycle-level execution-driven simulator for modern GPU architec-
tures. In: IEEE International Symposium on Performance Analy-
sis of Systems and Software, pp. 231–241 (2006)
6. Cook, H.L., Carpenter, L., Catmull, E.: The Reyes Image Render-
ing Architecture, pp. 28–35 (1988)
7. Dudash, B.: Tesselation of displaced subdivision surfaces in
DX11. GPU-BBQ 2008. http://www.nvidia.in/object/gpubbq-
2008-subdiv.html (2008)
8. Eldridge, M., Igehy, H., Hanrahan, P.: Pomegranate: a fully scal-
able graphics architecture. In: SIGGRAPH ’00: Proceedings of
the 27th Annual Conference on Computer Graphics and Interac-
tive Techniques, pp. 443–454. ACM/Addison-Wesley, New York
(2000)
9. Ewins, J.P., Waller, M.D., White, M., Lister, P.F.: MIP-Map level
selection for texture mapping. IEEE Trans. Vis. Comput. Graph.
4(4), 317–329 (1998)
10. Fatahalian, K., Luong, E., Boulos, S., Akeley, K., Mark, W.R.,
Hanrahan, P.: Data-parallel rasterization of micropolygons with
defocus and motion blur. In: HPG ’09: Proceedings of the Confer-
ence on High Performance Graphics, pp. 59–68. ACM, New York
(2009)
11. Greene, N., Kass, M., Miller, G.: Hierarchical Z-buffer visibility.
In: SIGGRAPH ’93: Proceedings of the 20th Annual Conference
on Computer Graphics and Interactive Techniques, pp. 231–238.
ACM, New York (1993)
12. Hennessy, J., Patterson, D.: Computer Architecture—A Quantita-
tive Approach. Morgan Kaufmann, San Mateo (2003)
13. Kun, Z., Qiming, H.: RenderAnts: interactive REYES rendering
on GPUs. ACM Trans. Graph. (2009)
14. Licea-Kane, B.: GLSL: Center or centroid? (or when shaders at-
tack!). http://www.opengl.org/pipeline/article/vol003_6/ (2007)
15. Low, K.L.: Perspective-Correct Interpolation (2002)
16. McCool, M.D., Wales, C., Moule, K.: Incremental and hierarchi-
cal Hilbert order edge equation polygon rasterization. In: HWWS
’01: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS
Workshop on Graphics Hardware, pp. 65–72. ACM, New York
(2001)
17. McCormack, J., McNamara, R.: Tiled polygon traversal us-
ing half-plane edge functions. In: HWWS ’00: Proceedings of
the ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics
Hardware, pp. 15–21. ACM, New York (2000)
18. Mitchell, J., Sander, P.: Applications of explicit Early-Z culling.
In: Real-Time Shading Course, SIGGRAPH 2004 (2004)
19. Olano, M., Greer, T.: Triangle scan conversion using 2D homoge-
neous coordinates. In: HWWS ’97: Proceedings of the ACM SIG-
GRAPH/EUROGRAPHICS Workshop on Graphics Hardware,
pp. 89–95. ACM, New York (1997)
20. Pineda, J.: A parallel algorithm for polygon rasterization. In: SIG-
GRAPH ’88: Proceedings of the 15th Annual Conference on Com-
puter Graphics and Interactive Techniques, pp. 17–20. ACM, New
York (1988)
21. Zhang, H., Hoff, K.E. III: Fast backface culling using normal
masks. In: SI3D ’97: Proceedings of the 1997 Symposium on In-
teractive 3D Graphics, pp. 103–106. ACM, New York (1997)
Jordi Roca is a Ph.D. candidate at
the Computer Architecture Depart-
ment of the Polytechnic University
of Catalunya (UPC). His main re-
search interests are in new GPU ar-
chitecture designs for specialized
rendering pipelines. He has previ-
ously worked on the ATTILA GPU
driver and fixed function emulation,
and the workload characterization of
3D games in terms of GPU perfor-
mance.
A SIMD-efficient 14 instruction shader program for high-throughput microtriangle rasterization 719
Victor Moya has been a Ph.D. stu-
dent in the Computer Architecture
Department of the Polytechnic Uni-
versity of Catalonia since 2001. He
has been working on the ATTILA
cycle-accurate GPU simulator for
years and his main interest is on
GPU microarchitecture and real-
time 3D graphic rendering from a
hardware perspective.
Carlos Gonzalez received his B.Sc.
degree in 2004 in Computer Sci-
ence from the Barcelona School of
Informatics, Spain. Now he is a
Ph.D. candidate at the Department
of Computer Architecture at UPC
(Barcelona, Spain). His current re-
search interests are focused on GPU
architectures, chiefly the GPU mem-
ory hierarchy, memory access pat-
terns and GPU memory scheduling
techniques.
Vicente Escandell received a B.Sc.
in Computer Engineering from the
Polytechnic University of Catalunya
(UPC) in 2010. Now he is an M.Sc.
candidate at the Computer Archi-
tecture Department (UPC). His cur-
rent research interests are GPU RTL
hardware implementation and power
aware architectures.
Albert Murciego is a B.Sc. candi-
date in Computer Science and En-
gineering at the Polytechnic Univer-
sity of Catalunya (UPC). His main
research interests include computer
graphics, 3D visualization and multi
GPUs.
Agustin Fernandez is assistant pro-
fessor in the Computer Architec-
ture Department of the Polytech-
nic University of Catalonia (UPC),
in Barcelona, Spain. He received
a computer science degree in 1988
and his Ph.D. in Computer Science
in 1992, both from the UPC. His re-
search interests are GPUs, microar-
chitecture and high performance
compilers.
Roger Espasa is a Principal Engi-
neer at Intel Corporation in Barce-
lona, Spain. He is also a part time
professor at the Computer Archi-
tecture Department of the Polytech-
nic University of Catalunya (UPC).
Roger was the architect for the
Alpha Tarantula vector processor
while working for Compaq. He
joined Intel in 2003 and has worked
on the definition of vector instruc-
tions for the Intel ISA and now is
the architect for the Larrabee tex-
ture sampler, and Larrabee instruc-
tion set architecture. Roger received his Ph.D. from the Polytechnic
University of Catalunya (UPC).
