Optimizing The Performance of Streaming Numerical Kernels On The IBM Blue Gene/P PowerPC 450 by Malas, Tareq Majed Yasin
Optimizing The Performance of Streaming Numerical Kernels
On The IBM Blue Gene/P PowerPC 450
Thesis by
Tareq Majed Yasin Malas
In Partial Fulfillment of the Requirements
For the Degree of
Master of Science
King Abdullah University of Science and Technology, Thuwal,
Kingdom of Saudi Arabia
July, 2011
2The thesis of Tareq Malas is approved by the examination committee.
Committee Chair(Thesis Supervisor): Prof. David Keyes
Committee Member: Prof. Mikhail Moshkov
Committee Member: Dr. Aron Ahmadia
King Abdullah University of Science and Technology
2011
3Copyright ©2011
Tareq Majed Yasin Malas
All Rights Reserved
4ABSTRACT
Optimizing The Performance of Streaming Numerical Kernels
On The IBM Blue Gene/P PowerPC 450
Tareq Majed Yasin Malas
Several emerging petascale architectures use energy-efficient processors with vec-
torized computational units and in-order thread processing. On these architectures
the sustained performance of streaming numerical kernels, ubiquitous in the solution
of partial differential equations, represents a formidable challenge despite the regular-
ity of memory access. Sophisticated optimization techniques beyond the capabilities
of modern compilers are required to fully utilize the Central Processing Unit (CPU).
The aim of the work presented here is to improve the performance of streaming
numerical kernels on high performance architectures by developing efficient algorithms
to utilize the vectorized floating point units. The importance of the development time
demands the creation of tools to enable simple yet direct development in assembly to
utilize the power-efficient cores featuring in-order execution and multiple-issue units.
We implement several stencil kernels for a variety of cached memory scenarios us-
ing our Python instruction simulation and generation tool. Our technique simplifies
the development of efficient assembly code for the IBM Blue Gene/P supercomputer’s
PowerPC 450. This enables us to perform high-level design, construction, verifica-
5tion, and simulation on a subset of the CPU’s instruction set. Our framework has
the capability to implement streaming numerical kernels on current and future high
performance architectures. Finally, we present several automatically generated im-
plementations, including a 27-point stencil achieving a 1.7x speedup over the best
previously published results.
6ACKNOWLEDGMENTS
I owe my deepest gratitude to my mentor, Aron Ahmadia, for his encouragement,
guidance, support, and contribution in our research work since the beginning. He
enabled me to develop better understanding of Scientific Computing research. With
his amazing leadership skills, he formed our research group and ensured flawless and
synergistic teamwork.
Many thanks to John Gunnels, the manager of High Performance Computing at
IBM T.J. Watson Research Center. He has made available his support in a number of
ways. He provided guidance and support to improve the performance of our research
code through our weekly meetings. His long experience in High Performance Com-
puting was essential in achieving our research results and understanding the resulting
performance. Moreover, I am grateful for his help in reviewing the draft of this thesis.
I am in debt to Jed Brown from ETH for his contribution, utilizing his professional
coding experience, in our research group. His proof-of-concept PowerPC 450 simu-
lator provided great assistance to our research. Also, his bright ideas in the kernel’s
algorithm were essential in achieving our successful results.
I am heartily thankful to Professor David Keyes. He provided all the support
required to make our research group possible. Even with his commitments as the
founding dean of the MCSE Division at KAUST, he generously granted me some
of his time through our various meetings. I am honored to be one of his graduate
students.
7I am most indebted to my parents for all their time and effort raising me. Their
sustained encouragement gave me the power to pursue knowledge through higher
education. Also, I am grateful for their patience during my busy periods when I
could not visit them.
I am indebted to my brothers and sister for helping me in my school courses,
an essential factor that enabled me to carry out my undergraduate and graduate
education. Moreover, I am thankful for their patience, support, and encouragement
during my education in the undergraduate and Masters degree.
Finally, I am grateful to Andy Ray Terrel for his helpful commentary on an early
draft of this thesis. I am also endebted to Andrew Winfer for his support in conduct-
ing our numerical experiments on the Shaheen Blue Gene/P system at the KAUST
Supercomputing Laboratory, and to Vernon Austel for his assistance in running ex-
periments on the Blue Gene/P system at IBM’s Watson Research Center.
8TABLE OF CONTENTS
List of Abbreviations 10
List of Illustrations 11
List of Tables 12
I Introduction 14
I.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
I.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
II Related Work 18
III Implementation Considerations 21
III.1 Stencil Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
III.2 PowerPC 450 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
III.3 Instruction Scheduling Optimization . . . . . . . . . . . . . . . . . . 26
IV Implementation 32
IV.1 STREAM benchmark on Shaheen . . . . . . . . . . . . . . . . . . . . 32
IV.2 C and Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
IV.3 Synthetic Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
IV.3.1 Kernels Design . . . . . . . . . . . . . . . . . . . . . . . . . . 35
IV.3.2 Unroll-and-Jam . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9IV.3.3 PowerPC 450 Simulator . . . . . . . . . . . . . . . . . . . . . 39
V Performance 42
V.1 3-Point Stencil Computations . . . . . . . . . . . . . . . . . . . . . . 43
V.2 7-Point Stencil Computations . . . . . . . . . . . . . . . . . . . . . . 45
V.3 27-Point Stencil Computations . . . . . . . . . . . . . . . . . . . . . . 47
V.4 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
VI Concluding Remarks 52
VI.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
VI.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Appendices 55
A Code generation with SimPPC 56
A.1 Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.2 Code template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.3 Code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.4 Generated code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
References 69
10
LIST OF ABBREVIATIONS
API Application Programming Interface
ArBB Array Building Blocks
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
DAG Directed Acyclic Graph
FIFO First In First Out
FMA Floating point Multiply-Add
FPR Floating Point Registers
FPU Floating Point Unit
GPR General Purpose Registers
GPU Graphics Processing Unit
ILP Integer Linear Programming
LRU Least Recently Used
LSU Load Store Unit
11
MG Multi Grid
MPI Message Passing Interface
NAS NASA Advanced Supercomputing
NUMA Non-Uniform Memory Access
OpenMP Open Multi-Processing
SIMD Single Instruction Multiple Data
SIMT Single Instruction Multiple Thread
SMPs Symmetric Processors
WENO Weighted Essentially Non-Oscillatory
12
LIST OF ILLUSTRATIONS
II.1 Unroll-and-jam example for a 3-dimensional array copy operation . . 20
(a) Regular loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
(b) Unrolled-and-jammed loop . . . . . . . . . . . . . . . . . . . . . 20
III.1 7-point stencil operator . . . . . . . . . . . . . . . . . . . . . . . . . . 22
III.2 27-point stencil operator . . . . . . . . . . . . . . . . . . . . . . . . . 23
III.3 Example of DAG representation of the instructions . . . . . . . . . . 27
III.4 A 0/1 array of size N ×M representing the scheduling variables . . . 28
IV.1 SIMD stencil computations on 1D streams . . . . . . . . . . . . . . . 36
IV.2 The components of the code generation framework in this work . . . 41
V.1 3-point stencil performance with load-copy kernel . . . . . . . . . . . 44
V.2 7-point stencil performance comparison between mutate-mutate and
load-copy kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
V.3 7-point stencil performance comparison . . . . . . . . . . . . . . . . . 47
V.4 27-point stencil performance with mutate-mutate kernel . . . . . . . . 48
V.5 27-point stencil performance comparison . . . . . . . . . . . . . . . . 49
13
LIST OF TABLES
IV.1 STREAM performance (MB/s) on PowerPC 450 using 1, 2, and 4
threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
IV.2 STREAM performance (MB/s) on PowerPC 450 running 2 and 4 MPI
processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
IV.3 Resource usage per stencil of mm and lc . . . . . . . . . . . . . . . . 37
IV.4 Computational requirements . . . . . . . . . . . . . . . . . . . . . . . 40
V.1 Predictions vs. observations for in-L1 and streaming performance . . 51
14
CHAPTER I
Introduction
I.1 Motivation
As computational science soars past the petascale to exascale, a large number of
applications continue to achieve disappointingly small fractions of the sustained per-
formance capability of new machines. Although the traditional focus in scientific
computing is on massively parallel or even multicore scalability, in many cases this
shortcoming in performance stems from issues at the single processor/thread level.
The ‘many-core’ revolution brings simpler, slower, more power-efficient cores with
vectorized floating point units in large numbers to a single micrprocessor. At each
stage of such a design, tradeoffs are made that complicate performance optimization
for the compiler and programmer in exchange for improved power and die efficiency
from the hardware. For example, out-of-order execution logic allows a microprocessor
to reorder instruction execution on the fly to avoid pipeline and data hazards. When
out-of-order logic is removed to save silicon and power, the responsibility for avoiding
these hazards is returned to the compiler and programmer. In the same vein, a wide
vector processing unit can significantly augment the floating point performance ca-
pabilities of a processing core at the expense of the single-element mappings between
15
input and output in performant code. Such vector units also incur greater bandwidth
demands for a given level of performance, as measured by percentage of theoreti-
cal peak. Graphics processing units provide a set of compute semantics similar to
a traditional Single Instruction Multiple Data (SIMD) vector processing unit with
Single Instruction Multiple Thread (SIMT), but still execute in-order and require
vectorized instruction interlacing to achieve optimal performance. The trend to im-
prove efficiency of performance with wider vector units and in-order execution units
is observable in the architectures of the IBM Blue Gene/P PowerPC 450 [1], the Cell
Broadband Engine Architecture [2], Intel’s Larrabee architecture [3], and NVIDIA’s
Tesla Graphics Processing Unit (GPU) [4].
The characteristics of Blue Gene/P that motivate the work in this thesis will per-
sist in the processor cores of exascale systems. The fundamental challenge of the
exascale relative to petascale is electrical power [5]. An optimistic benchmark (goal)
for a petascale system is the continuous consumption of somewhat over a MegaWatt
of electrical power, which represents the average continuous power consumption of
roughly one thousand people in an OECD country (1.4 kW per person). A na¨ıve scal-
ing from peta to exa of the hardware resources known today implies that operation of
a single exascale system would displace one million people from the power grid. Power
reductions relative to delivered flop/s of factors of one to two orders of magnitude
are expected en route to the exascale, which means more threads instead of faster-
executing threads, power growing roughly as the cube of the clock frequency. It also
means much less memory and memory bandwidth per thread because the movement
of data over copper interconnects consumes much more power than the operations
on that data in registers. Mathematical formulations of problems and algorithms to
implement them will be rewritten to increase arithmetic intensity. Implementations
in hardware will have to be made without some of today’s popular power-intensive
performance optimizations.
16
In general terms, we may expect less memory per thread, less memory bandwidth
per thread, and more threads per fixed data set size, creating an emphasis on strong
scaling within a shared-memory unit. We expect larger grain sizes of SIMDization
and high penalization of reads and writes from main memory.
I.2 Background
We define streaming numerical kernels as small, cycle-intensive regions of a program
where, for a given n bytes of data accessed in a sequential fashion, O(n) computational
operations are required. Streaming numerical kernels are generally considered to be
memory-bandwidth bound on most common architectures due to their low arith-
metic intensity. The actual performance picture is substantially more complicated
on high performance computing microprocessors, with constraints on computational
performance coming from software limitations in the expressiveness of standard C
and Fortran when targeting SIMD processors and a host of hardware bottlenecks and
constraints, from the number of available floating point registers, to the available in-
structions for streaming memory into SIMD registers, to the latency and throughput
of buses between the multiple levels of the memory hierarchy.
In this thesis, we focus upon stencil operators, a subset of streaming kernels that
define computations performed over a local neighborhood of points in a spatial multi-
dimensional grid. They are commonly found in partial differential equation solver
codes as finite-difference discretizations of continuous differential operators. Perhaps
the most well-known of these is the 7-point stencil, which usually arises as a finite
difference discretization of the Laplace kernel on structured grids. Although modern
numerical methods and discretization schemes have diminished this operator’s rele-
vance for many problems as a full-fledged numerical solver, it is still a cycle-intensive
subcomponent in several important scientific applications such as Krylov iterations of
17
Poisson terms in pressure corrections, gravitation, electrostatics, etc. [6] and adaptive
mesh refinement methods [7]. Other important stencils include the 27-point sten-
cil and the 3-point stencil. The 27-point stencil arises when cross-terms are needed
such as in the NASA Advanced Supercomputing (NAS) parallel Multi Grid (MG)
benchmark, which solves a Poisson problem using a V-cycle multigrid method with
the stencil operator [8]. The 3-point stencil is a one-dimensional stencil which serves
as a building block for the higher-order stencils and is useful on its own in tensor
product formulations such as the solution of Riemann kernels in hyperbolic systems
of equations.
We concentrate on the Blue Gene/P architecture [1] for a number of reasons.
The forward-looking design of the Blue Gene series, with its power-saving and ultra-
scaling properties exemplifies some of the characteristics that we feel will be common
in exascale systems. Blue Gene/P has SIMD registers, a slow clock rate (850 MHz), an
in-order, narrow superscalar execution path, and is constructed to be highly reliable
and power-efficient, holding several slots in the top 10 of the GREEN500 [9] list as
recently as 2009. While the system is several years old, it is still used for cutting-edge
science, as is evidenced by the continued presence of Blue Gene systems as winners
and finalists in the Gordon Bell competition [10, 11, 12] as well as receiving awards
as prestigious as The National Medal of Technology and Innovation in 2009.
18
CHAPTER II
Related Work
As stencil operators have been identified as an important component of many scientific
computing applications, a good deal of effort has been spent attempting to optimize
their performance. Recently, [13] optimized the performance of a 7-point stencil and
a Lattice Boltzmann method on CPUs and GPUs over three-dimensional grids. They
performed spatial and temporal blocking, named 3.5D blocking, to decrease the mem-
ory bandwidth requirements of their memory bound problems. An earlier yet more
comprehensive work by Datta [14] constructed an auto-tuning framework to optimize
the 7- and the 27-point stencils and the Gauss-Seidel Red-Black Helmholtz kernel.
Datta’s work was performed on diverse multicore architectures modern at the time.
He achieved good performance employing a search over a variety of algorithmic and
architecture-targeting techniques including common subexpression elimination and
Non-Uniform Memory Access (NUMA)-aware allocations. It is interesting to note
that aside from register blocking and common subexpression elimination, Datta’s
techniques were ineffective in improving the performance of stencil operators on the
Blue Gene/P platform. This was attributed in part to the in-order-execution archi-
tecture of the PowerPC 450 processor.
A common domain-specific technique in practice is time skewing [15] [16]. Unfor-
19
tunately, time skewing is not generalizable and has limited application as it does not
allow other computations between time steps to occur.
There have been several other recent studies of performance improvements in sten-
cil operators in the literature [17, 18, 19], including core-level optimization techniques.
The majority of previous work focuses on spatial and temporal blocking of memory
access: Frigo provides a cache-oblivious algorithm for general n-dimensional stencil
computation in [20], whereas in [21] Wellein performs stencil multi-core optimizations
using cache blocking in time (temporal blocking). Kamil conducted a study on the
impact of modern memory subsystems on three-dimensional stencils [22], and pro-
posed two cache optimization strategies for stencil computation in [23], as well as
an auto-tuning framework for code generation in [24]. Kamil’s framework accepts
the stencil’s kernel in Fortran or then converts it to a tuned version in Fortran, C, or
Compute Unified Device Architecture (CUDA). In [25], a software synthesis approach
was proposed to generate optimized stencil code. Machine Learning is utilized in [26]
to tune the parameters of the optimization techniques for the 7- and the 27-point
stencil. In [27] tiling in space is utilized to improve the performance by improving
the spatial locality of the stencil computations.
Several emerging frameworks are facilitating the development of efficient high
performance code without having to go down to the assembly level, at least not
directly. These implementations are largely motivated by the difficulties involved in
utilizing vectorized Floating Point Unit (FPU) and other advanced features in the
processor. CorePy [28], a Python implementation similar to our approach, provides a
code synthesis package with Application Programming Interface (API) to develop high
performance applications by utilizing the low-level features of the processor that are
usually hidden by the programming languages. Intel has introduced a new dynamic
compilation framework, Array Building Blocks (ArBB) [29], a high level approach to
automatically using the SIMD units on Intel processors without targeting a specific
20
architecture.
Several techniques are developed in the literature to address the alignment prob-
lems in utilizing the SIMD capabilities of the modern processors. In [30] the alignment
is handled by using an algorithm to reorganize the data in the registers to satisfy the
alignment constraints of the processor. A compilation technique for data layout trans-
formation is proposed in [31] to have the data aligned properly to be used with the
SIMD without problems.
In this work we focus on using code synthesis to manually make use of unroll-and-
jam, an optimization technique that creates tiling on multiply nested loops through
a two-step procedure [32, 33]. Unroll-and-jam combines two well-known techniques,
loop unrolling on the outer loop to create two inner loops, then loop fusion, or “jam-
ming,” to combine the two inner loops into a single loop. Figure II.1 shows example
of unroll-and-jam for a copy operation between two 3-dimensional arrays, where each
of the two outer most loops is unrolled once. This technique can work well for
3-dimensional local operators because it promotes register reuse and can increase ef-
fective arithmetic intensity, though it requires a large number of registers available to
work effectively.
for i = 1 to N
for j = 1 to N
for k = 1 to N
A(i,j,k) = R(i,j,k)
endfor
endfor
endfor
(a) Regular loop
for i = 1 to N step 2
for j = 1 to N step 2
for k = 1 to N
A(i ,j ,k) = R(i ,j ,k)
A(i+1,j ,k) = R(i+1,j ,k)
A(i ,j+1,k) = R(i ,j+1,k)
A(i+1,j+1,k) = R(i+1,j+1,k)
endfor
endfor
endfor
(b) Unrolled-and-jammed loop
Figure II.1: Unroll-and-jam example for a 3-dimensional array copy operation
21
CHAPTER III
Implementation Considerations
III.1 Stencil Operators
A 3-D stencil is a linear operator on RMNP , the space of scalar fields on a Cartesian
grid of dimension M ×N × P . Apart from some remarks in Chapter VI, we assume
throughout this thesis that the operator does not vary with the location in the grid,
as is typical for problems with regular mesh spacing and space-invariant physical
properties such as constant diffusion. The input A and output R are conventionally
stored in a one-dimensional array using lexicographic ordering. We choose a C-style
ordering convention so that an entry ai,j,k of A has flattened index (iN + j)P + k,
with zero-based indexing. We assume that the arrays A and R are aligned to 16 byte
memory boundaries. Some of our experimental kernels assume an “odd” alignment,
where the start of either the A or R array is set to an offset of 8 bytes from a 16 byte
alignment. We could generalize the alignment requirements to 8 bytes at almost no
computational cost by adding some logic to the prologue of the kernels.
Formally, the 3-point stencil operator defines a linear mapping from a weighted
sum of three consecutive elements of A to one element in R:
ri,j,k = w0 ∗ ai,j,k−1 + w1 ∗ ai,j,k + w2 ∗ ai,j,k+1
22
We further assume certain symmetries in the stencils that are typical of self-
adjoint problems, there that w0 = w2. The effect of this assumption is that we
require fewer registers for storing the w coefficients of the operator, allowing us to
unroll the problem further; however, this assumption is not applicable to convective
problems.
The 7-point stencil (Figure III.1) defines the result at ri,j,k as a linear combination
of the input ai,j,k and its six three-dimensional neighbors with Manhattan distance
one. The 27-point stencil uses a linear combination of the set of 26 neighbors with
Chebyshev distance one. The boundary values ri,j,k with i ∈ {0,M − 1}, j ∈ {0, N −
1}, or k ∈ {0, P −1} are not written as is standard for Dirichlet boundary conditions.
Other boundary conditions would apply a different one-sided stencil at these locations
which is cheap even without highly optimized code.
i
k
j
Figure III.1: 7-point stencil operator
The 27-point stencil (Figure III.2) can be seen as the summation over nine inde-
pendent 3-point stencil operators into a single result. We assume symmetry along
but not between the three dimensions, leading to 8 unique weight coefficients. The
symmetric 7-point stencil operator has 4 unique weight coefficients.
23
i
k
j
Figure III.2: 27-point stencil operator
III.2 PowerPC 450
Designed for delivering power-efficient floating point computations, the nodes in a
Blue Gene/P system are four-way Symmetric Processors (SMPs) comprised of Pow-
erPC 450 processing cores [34]. The processing cores possess a modest superscalar
architecture capable of issuing a SIMD floating point instruction in parallel with var-
ious integer and load/store instructions. Each core has an independent register file
containing 32 4-byte general purpose registers and 32 16-byte SIMD floating point
registers which are operated on by a pair of fused floating point units. A multiplexing
unit on each end of this chained floating point pipeline enables a rich combination
of parallel, copy, and cross semantics in the SIMD floating point operation set. This
flexibility in the multiplexing unit can enhance computational efficiency by replacing
the need for independent copy and swap operations on the floating point registers. To
provide backward compatibility as well as some additional functionality, these 16-byte
SIMD floating point registers are divided into independently addressable, 8-byte pri-
mary and secondary registers wherein non-SIMD floating point instructions operate
transparently on the primary half of each SIMD register.
An individual PowerPC 450 core has its own 64KB L1 cache divided evenly into
24
32 KB for instructions and 32KB for data. The L1 cache uses a round-robin (First In
First Out (FIFO)) replacement policy in 16 sets, each with 64-way set-associativity.
The L1 cache line is 32 bytes in size.
Every core also has its own private prefetch unit, designated as the L2 cache,
“between” the L1 and the L3. In the default configuration, each PowerPC 450 core
can support up to 5 “deep fetching” streams or up to 7 shallower streams. These
values stem from the fact that the L2 prefetch cache has 15 128-byte entries. If the
system is fetching two lines ahead (such settings can be configured at boot time),
each stream occupies three positions, one current and two “future,” while a shallower
prefetch lowers the occupancy to two per stream. The final level of cache is an 8 MB
L3 cache, shared among the four cores which features a Least Recently Used (LRU)
replacement policy, with 8-way set associativity and 128-byte lines.
On this architecture the desired scenario in highly performant numerical codes
is the dispatch of a SIMD floating point instruction every cycle (in particular, a
Floating point Multiply-Add (FMA)), with any load or store involving one of the
floating point registers issued in parallel, as inputs are streamed in and results are
streamed out. Floating point instructions can be retired one per cycle, yielding a peak
computational throughput of one (SIMD) FMA per cycle, leading to a theoretical
peak of 3.4 GFlops/s per core. Blue Gene/P floating point load instructions, whether
they be SIMD or non-SIMD, can be retired every other cycle, leading to an effective
bandwidth to the L1 of 8 bytes a cycle for aligned 16-byte SIMD loads (non-aligned
loads result in a significant performance penalty) and 4 bytes a cycle otherwise. As
a consequence of the instruction costs, no kernel can achieve peak floating point
performance if it requires a ratio of load to SIMD floating point instructions greater
than 0.5. It is important to ensure packed “quad-word” SIMD loads occur on 16 byte
aligned memory boundaries on the PowerPC 450 to avoid performance penalties that
ensue from the resulting hardware interrupt.
25
An important consideration for achieving high throughput performance on modern
floating point units is pipeline latency, the number of cycles that must transpire
between accesses to an operand being written or loaded into in order to avoid pipeline
hazards (and their consequent stalls). Floating point computations on the PowerPC
450 have a latency of approximately 5 cycles, whereas double-precision loads from
the L1 require at least 4 cycles and those from the L2 effectively require 15 cycles [1].
Latency measurements when fulfilling a load request from the L3 or DDR memory
banks are less precise: we assume an additional 50 cycle average latency penalty for
all loads outside the L1 that hit in the L3.
In the event of an L1 cache miss, up to three concurrent requests for memory
beyond the L1 can execute (an L1 cache miss while 3 requests are “in-flight” will cause
a stall until one of the requests to the L1 has been fulfilled). Without assistance from
the L2 cache, this leads to a return of 96 bytes (three 32-byte lines) every 56 cycles (50
cycles of memory latency + 6 cycles of instruction latency), for an effective bandwidth
of approximately 1.7 bytes/cycle. This architectural characteristic is important in our
thesis, as L3-confined kernels with a limited number of streams can effectively utilize
the L2 prefetch cache and realize as much as 4.6 bytes/cycle bandwidth per core,
while those not so constrained will pay the indicated bandwidth penalty.
The PPC450 is an in-order unit with regards to floating point instruction exe-
cution. An important consequence is that a poorly implemented instruction stream
featuring many non-interleaved load/store or floating point operations will suffer from
frequent structural hazard stalls with utilization of only one of the units. Conversely,
this in-order nature makes the result of efforts to schedule and bundle instructions
easier to understand and extend.
26
III.3 Instruction Scheduling Optimization
We wish to minimize the required number of cycles to execute a given code block com-
posed of PowerPC 450 assembly instructions. This requires scheduling (reordering)
the instructions of the code block to avoid the structural and data hazards, described
in the previous section. In this section we present the theoretical formulation for
instruction scheduling technologies and constraint to the maximum allowed register
allocation for a given code block.
We formulate the Integer Linear Programming (ILP) optimization problem. Our
work is based on the formulation of [35] that considers optimizations combining
both register allocation and instruction scheduling of architectures with multi-issue
pipelines. To account for multi-cycle instructions, we include parts of the formulation
at [36]. We consider two minor changes to these approaches. First, we consider the
two separate sets of registers of the PowerPC 450, the General Purpose Registers
(GPR), and the Floating Point Registers (FPR). Second, we account for the instruc-
tions that use the Load Store Unit (LSU) occupying the pipeline for two cycles.
We consider a code block composed of N instructions initially ordered as I =
{I1, I2, I3, ..., IN}. A common approach to represent the data dependencies of these
instructions is to use a Directed Acyclic Graph (DAG). Given a graph G(V,E), the
nodes of the graph (V ) represent the instructions whereas the directed edges (E)
represent the dependencies between them. Figure III.3 shows an example of a DAG
representing a sequence of instructions. Edges represent data dependencies between
instructions. Each read-after-write data dependency of an instruction Ij on an in-
struction Ii is represented by a weighted edge eij. This weighted edge corresponds to
the number of cycles needed by an instruction Ii to produce the results required by
an instructionIj . The weights associated with Write-after-read and write-after-write
data dependencies are set to one.
The PowerPC 450 processor has one LSU and one FPU. This allows the processor
27
I1 : ld fa, rx, ry
I2 : ld fb, rx, ry
I3 : sub fd, fa, fb
I4 : ld fb, rt, ry
I5 : mul fe, fb, fd
I6 : st fe, rz, ry
I7 : mul fc, fd, fc
I8 : st fc, rv, ry
(a) Instruction sequence
1
2
3
4
5
7
8
1
4
4
5
4
4
5
5
5
1
1
6
(b) DAG representation
Figure III.3: Example of DAG representation of the instructions
to execute a maximum of one floating point operation per cycle and one load/store
operation every two cycles (each operation using the LSU consumes 2 cycles). The
instructions use either the LSU (ILSU ∈ LSU) or the other computation resources of
the processor including the FPU (IFPU ∈ FPU). A critical path, C, in the DAG is a
path in the graph on which the sum of its edges’ weights attains the maximum. We
can compute the lower bound (Lbound) to run the instructions I as follows:
Lbound(I) = max{C, 2 ∗ ILSU , IFPU} (III.1)
We can compute an upper bound Ubound(I) by simulating the number of cycles
to execute the scheduled instructions. If Ubound(I) = Lbound(I) we can generate an
optimal schedule.
We define the 0/1 optimization variables xji , having the value 1 to represent that
the instruction Ii is scheduled at a given cycle c
j and 0 otherwise. These variables are
represented in the array shown in Figure III.4, where M represents the total number
28
of the cycles.
c1 c2 c3 . cM
I1 x
1
1 x
2
1 x
3
1 . x
M
1
I2 x
1
2 x
2
2 x
3
2 . x
M
2
I3 x
1
3 x
2
3 x
3
3 . x
M
3
. . . . . .
IN x
1
N x
2
N x
3
N . x
M
N
Figure III.4: A 0/1 array of size N ×M representing the scheduling variables
The first constraint in our optimization is to force each instruction to be scheduled
only once in the code block. Formally:
M∑
j=1
xji = 1, ∀i ∈ {1, 2, ..., N} (III.2)
The PowerPC 450 processor can execute a maximum of one floating point oper-
ation every cycle and one load/store operation every two cycles. This imposes the
following constraints:
∑
i:Ii∈IFPU
xji ≤ 1, ∀j ∈ {1, 2, ...,M} (III.3)
∑
i:Ii∈ILSU
(xji + x
j+1
i ) ≤ 1, ∀j ∈ {1, 2, ...,M − 1} (III.4)
Finally, to maintain the correctness of the code’s results we enforce dependency
constraints:
M∑
k=1
(k ∗ xki )−
M∑
k=1
(k ∗ xkj ) + eij + 1 ≤ 0, ∀i, j : eij ∈ E (III.5)
Instruction latency to write on a GPR is 1 cycle (eij = 1). The latency for writing
on an FPR is 5 cycles for Ii ∈ IFPU and at least 4 cycles for Ii ∈ ILSU . Load
instructions have higher latency when the loaded data is present in the L3 cache or
the RAM. This can be considered in future formulations to maximize the number of
29
cycles between loading the data at an instruction Ii and using it by maximizing eij
in the objective.
We wish also to constrain the maximum number of allocated registers in a given
code block. All the registers are assumed to be allocated and released within the
code block. An instruction Ii scheduled at a cycle c
j allocates a register r by writing
on it. The register r is released (deallocated) at a cycle ck by the last instruction
reading from it. The life span of the register r is defined to be from cycle cj to cycle
ck inclusive.
We define two sets of 0/1 register optimization variables gji and f
j
i to represent the
usage of the GPR and the FPR, respectively. Each of these variables belongs to an
array of the same size as the array in Figure III.4. The value of gji is set to 1 during
the life span of a register in the GPR modified by the instruction Ii scheduled at the
cycle cj and last read by an instruction scheduled at the cycle ck, that is gzi = 1 when
j ≤ z ≤ k and zero otherwise. The same applies to the variables f ji to represent the
FPR register allocation.
To compute the values of gji and f
j
i , we define the temporary variables gˆ
p
i and fˆ
p
i .
Let Ki be the number of instructions reading the from the register allocated by the
instruction Ii. These temporary variables are computed as follows:
fˆpi = K ×
p∑
z=1
xzi −
∑
∀j:eij∈E
(
p∑
z=1
xzj), ∀i : Ii writes on FPR (III.6)
gˆpi = K ×
p∑
z=1
xzi −
∑
∀j:eij∈E
(
p∑
z=1
xzj), ∀i : Ii writes on GPR (III.7)
Our optimization variables gji and f
j
i will equal to 1 only when gˆ
j
i > 0 and fˆ
j
i > 0,
30
respectively. This can be formulated as follow:
f ji − fˆ ji ≤ 0 (III.8)
gji − gˆji ≤ 0 (III.9)
K × f ji − fˆ ji ≥ 0 (III.10)
K × gji − gˆji ≥ 0 (III.11)
Now we can constrain the maximum number of used registers FPRmax and GPRmax
by the following:
N∑
i=1
(gji )−GPRmax ≤ 0, ∀j ∈ {1, 2, ...,M} (III.12)
N∑
i=1
(f ji )− FPRmax ≤ 0, ∀j ∈ {1, 2, ...,M} (III.13)
Our optimization objective is to minimize the required cycles to execute the code
(Crun). The complete formulation of our optimization is:
31
Minimize: Crun
Subject to :
M∑
j=1
xji = 1, ∀i ∈ {1, 2, ..., N} (III.2)
∑
i:Ii∈IFPU
xji ≤ 1, ∀j ∈ {1, 2, ...,M} (III.3)
∑
i:Ii∈ILSU
(xji + x
j+1
i ) ≤ 1, ∀j ∈ {1, 2, ...,M − 1} (III.4)
M∑
k=1
(k ∗ xki )−
M∑
k=1
(k ∗ xkj ) + eij + 1 ≤ 0, ∀i, j : eij ∈ E (III.5)
N∑
i=1
(gji )−GPRmax ≤ 0, ∀j ∈ {1, 2, ...,M} (III.12)
N∑
i=1
(f ji )− FPRmax ≤ 0, ∀j ∈ {1, 2, ...,M} (III.13)
M∑
j=1
j ∗ xji − Crun ≤ 0, ∀i : Ii has no successors (sink node) (III.14)
This optimization problem is known to be NP-complete [37], so there is no known
algorithm to solve it in polynomial time. In our current implementation, we use a
greedy algorithm which yields an approximate solution in a practical time, as we
describe in IV.3.3.
32
CHAPTER IV
Implementation
IV.1 STREAM benchmark on Shaheen
STREAM benchmark [38] is a program to measure the sustainable memory bandwidth
for a set of basic vector computations on the target processor. Given arrays A, B, and
C and a constant q, these vector computations are: Copy (A = B), Scale (A = q×B),
Sum (A = B + C), and Triad (A = B + q × C).
The theoretical peak of the processor does not provide practical upper bounds
of the memory bandwidth. We use the STREAM benchmark to get practical up-
per bounds on the memory bandwidth of our target platform, PowerPC 450. We
performed our experiments on Shaheen using the XL compiler with different config-
urations and number of cores to explore the set of configurations achieving the best
performance.
We generated the results shown in Table IV.1 on a single node for different numbers
of Open Multi-Processing (OpenMP) threads using optimization flag -O5 without
using the Message Passing Interface (MPI).
Furthermore we examined the performance of the STREAM benchmark when
independent MPI processes run on the same 4-cores chip as shown in Table IV.2.
33
Threads # 1 2 4
total total per thread total per thread
Copy 3871 7538 3769 8718 2180
Scale 3621 7074 3537 9812 2453
Add 3701 7353 3677 9405 2351
Triad 3707 7405 3703 9819 2455
Table IV.1: STREAM performance (MB/s) on PowerPC 450 using 1, 2, and 4 threads.
Mode (processes) Dual (2) Virtual (4)
Total Per process Total Per process
Copy 7656 3828 9992 2498
Scale 7200 3600 9904 2476
Add 7372 3686 10284 2571
Triad 7412 3706 10620 2655
Table IV.2: STREAM performance (MB/s) on PowerPC 450 running 2 and 4 MPI
processes
We have two observations from the STREAM benchmark results on a four Pow-
erPC 450 cores chip. First, running four separate MPI processes achieved better
performance than running one process with four OpenMP threads. Second, when we
use less than four cores, the achieved bandwidth per process in MPI or per thread in
OpenMP is more than the theoretical peak of a single core when using all the 4 cores.
These results motivate the use of MPI processes in our experiments. In addition,
we run our experiments using four MPI processes on the four cores chip to avoid
misleadingly optimistic results.
IV.2 C and Fortran
We implement three steaming kernels in C and Fortran and utilize published results
as a benchmark to gain a better understanding of the performance of the general
class of streaming numerical kernels on the IBM PowerPC 450. Our first challenge is
in expressing a SIMDized mapping of the stencil operator in C or Fortran. Neither
34
language natively supports the concept of a unit of SIMD-packed doubles, though
Fortran’s complex type comes very close. Complicating matters further, the odd
cardinality of the stencil operators necessitates the use of clever tactics to properly
SIMDize. Regardless of unrolling strategy, one of every two loads is unaligned or re-
quires clever register manipulation. It is our experience that standard C and Fortran
implementations of our stencil operators will compile into exclusively scalar instruc-
tions that cannot attain greater than half of peak floating point performance. For
example, with a naive count of 53 flops per 27-point stencil, this is 31.5 Mstencil/s on
a single core of the PowerPC 450. Register pressure and pipeline stalls further reduce
the true performance to 21.5 Mstencil/s. We note that Datta [39] was able to improve
this to 25 Mstencil/s using manual unrolling, common subexpression elimination, and
other optimizations within C.
The XL compilers support intrinsics which permit the programmer to specify
which instructions are used, but the correspondence between intrinsics and instruc-
tions is not exact enough for our purposes. The use of intrinsics can aid programmer
productivity, as this method relies upon the compiler for such tasks as register al-
location and instruction scheduling. However, our methods required precise control
of such aspects of the produced assembly code, making this path unsuitable for our
needs.
IV.3 Synthetic Assembly
The challenges in writing fast kernels in C and Fortran motivate us to program at the
assembly level, a (perhaps surprisingly) productive task when using our code synthesis
framework: an experienced user was able to design, implement, and test efficient
kernels for a stencil operator in one day using the framework. Initally, we design
two small, na¨ıvely scheduled, 3-point kernels: mutate-mutate and load-copy. The
35
relative balance between load/store and floating point cycles consumed distinguishes
the kernels from each other.
IV.3.1 Kernels Design
We find that efficiently utilizing SIMD units in stencil computations is a challenging
task. To fill the SIMD registers, we pack two consecutive data elements from the
fastest moving dimension, k, allowing us to compute two stencils simultaneously, as
in [40]. Computing in this manner is semantically equivalent to an unrolling by 2
in the k direction. Since i and j are static for any given stream, we denote the two
values occupying the SIMD register for a given array by their k indices, e.g., SIMD
register a34 contains Ai,j,3 in its primary half and Ai,j,4 in its secondary half.
The stencil operators map a subset of adjacent A elements with odd cardinality
to each R element, as is illustrated in the left half of Figure IV.1, which depicts the
SIMD register contents and computations mid-stream of a 3-point kernel. This pre-
vents a direct mapping from SIMD input registers to SIMD output registers. Note
that aligned SIMD loads from A easily allow for the initialization of registers contain-
ing a23 and a45. Similarly, the results in r34 can be safely stored using a SIMD store.
The register containing a34, unaligned elements common to the aligned registers con-
taining adjacent data, requires a shuﬄe within the SIMD registers through the use
of additional load or floating point move instructions.
We introduce two kernels, which we designate as mutate-mutate (mm) and load-
copy (lc), as two different approaches to form the packed data into the unaligned
SIMD registers while streaming through A in memory. The mutate-mutate kernel,
as the name indicates, mutates the value of the SIMD register by replacing the older
half with the next element of the stream. In our example in Figure IV.1 we start with
a23 loaded in the SIMD register, then after the first computation we load a4 into the
primary part of the SIMD register so that the full contents are a43.
36
!"#$%&'()*+(,+-$
•  ./+$01+$-)23+3$42-5+$)6$2&789:;$6()%$<(+4&)5/$&0+(2=),$
•  >)%<50+$?(/0$'),0(&@5=),$0)$01+$'5((+,0$(+/5-0$(&78:A;$
Aij 0 1 2 3 4 5 6 
Rij 0 1 2 3 4 5 6 
*$
B$ BCDB9$ B!$
E$
F$ 9$ :$
G$
H$ :$ A$
2$I$FJ&K7K*LCM*L!N$$O-)23$2C!$
(!9$G$2C!$E$!"###
2C!P<$I$FJ&K7K*L9N$$O%50+$29!$
(!9$LG$29!$QE$!$#
29!P/$I$FJ&K7K*L:N$$O%50+$29:$$
(!9$LG$29:$E$!%#
(!9$R$HJ&K7K*L!M*L9N$$$$
"""""""""""""""""""""""""""""""""""""""$
&'(#)#*%'#+#!"#
29:P<$I$FJ&K7K*LAN$$O%50+$2A:$
(:A$LG$2A:$QE$!$#
2A:P/$I$FJ&K7K*LSN$$O%50+/$2AS$
(:A$LG$2AS$E$!%#
(:A$R$HJ&K7K*L:M*LAN$
(ST$G$2AS$E$!"#
(a) Compute contributions from a23
!" !#$!%" !&"
'"
(" )" *"
+,"
-" *" )"
&./"01234563768"
•  9:6";<6"84=>6>"?=8@6"4A"=1BC%*D"A340"E36?14@:"1;63=F47"
•  G40E@;6"H3:;"247;31I@F47";4";<6"2@3367;"36:@8;"31BC*)D"
Aij 0 1 2 3 4 5 6 
Rij 0 1 2 3 4 5 6 
5"
="J"(K1LBL5+#M5+&N""O84=>"=#&"
3&%","=#&"'" ###
=#&PE"J"(K1LBL5+%N""O0@;6"=%&"
3&%"+,"=%&"Q'" #
=%&P:"J"(K1LBL5+*N""O0@;6"=%*""
3&%"+,"=%*"'" #
3&%" "-K1LBL5+&M5+%N""""
......................................."
3*)","=%*"'"!#"
=%*PE"J"(K1LBL5+)N""O0@;6"=)*"
&'(#)*#+('#,-#!$#
=)*P:"J"(K1LBL5+SN""O0@;6:"=)S"
3*)"+,"=)S"'"!%#
3*)"R"-K1LBL5+*M5+)N"
3ST","=)S"'"!"#
(b) Compute contributions from a43
!" !#$!%" !&"
'"
(" )" *"
+,"
-" ." )"
&/0"12345674879"
•  :;7"<=7"95>?7?"@>9A7"5B">2CD%.E"B451"F47@25A;"2<74>G58"
•  H51FA<7"I4;<"358<42JAG58"<5"<=7"3A4478<"47;A9<"42CD.)E"
Aij 0 1 2 3 4 5 6 
Rij 0 1 2 3 4 5 6 
6"
>"K"(L2MCM6+#N6+&O""P95>?">#&"
4&%",">#&"'" "###
>#&QF"K"(L2MCM6+%O""P1A<7">%&"
4&%"+,">%&"R'" #
>%&Q;"K"(L2MCM6+.O""P1A<7">%.""
4&%" ">%." " #
4&%"S"-L2MCM6+&N6+%O""""
///////////////////////////////////////"
&'(#)#*%'#+#!"#
>%.QF"K"(L2MCM6+)O""P1A<7">)."
4.)"+,">)."R'"!$#
>).Q;"K"(L2MCM6+*O""P1A<7;">)*"
4.)"+,">)*"'"!%#
4.)"S"-L2MCM6+.N6+)O"
4*T",">)*"'"!"#
(c) Compute contributions from a45
Figure IV.1: SIMD stencil computations on 1D streams
The load-copy kernel instead combines the unaligned values in a SIMD register
from two consecutive aligned quad-words loaded in two SIMD registers by copying, via
an inter-register move, the primary element from the second register to the primary
element of the first. In our example a23 and a45 are loaded in two SIMD registers.
Then, after the needed computations involving a23 have been dispatched, a floating
point move instruction replaces a2 with a4 to form a43.
The two kernels use an identical set of floating point instructions, visually depicted
in the right half of Figure IV.1, to accumulate the computations into the result reg-
isters. The two needed weight coefficients are packed into one SIMD register. This
packing is enabled by the copy feature provided by the floating point unit. The first
37
floating point operation is a cross copy-primary multiply instruction, multiplying two
copies of the first weight coefficient by the two values in a23, then placing the results
in the SIMD register containing r34 (Figure IV.1a). Then, the value of a23 in the
SIMD register is mutated to become a43, replacing one data element in the SIMD
register. The second floating point operation is a cross complex multiply-add instruc-
tion, performing a cross operation to deal with the reversed values (Figure IV.1b).
Finally, the value a45, which has either been preloaded by load-copy or is created by a
second mutation in mutate-mutate, is used to perform the last computation (Figure
IV.1c).
We list resource requirements of the load-copy and the mutate-mutate kernels in
Table IV.3. The two kernels are at complementary ends of a spectrum. The mutate-
mutate kernel increases pressure exclusively on the load pipeline while the load-copy
kernel incurs extra cycles on the floating point unit. The two strategies can be used
in concert, using mutate-mutate when the floating point unit is the bottleneck and
the load-copy when it is not.
Table IV.3: Resource usage per stencil of mm and lc
Kernel Operations Cycles Registers
ld-st FPU ld-st FPU Input Output
mm 2-1 3 6 3 1 1
lc 1-1 4 4 4 2 1
IV.3.2 Unroll-and-Jam
The 3-point kernels are relatively easy to specify in assembly, but the floating point
and load/store instruction latencies will cause pipeline stalls if they are not unrolled.
Further unrolling in the k-direction is a possible solution that we do not explore in
this thesis. Although this solution would reduce the number of concurrent memory
streams, it would also reduce data reuse for the other stencils and require even more
intricate programming.
38
Unrolling and jamming once in transverse directions provides independent arith-
metic operations to hide instruction latency, but interleaving the instructions by hand
produces large kernels that are difficult to understand and modify. To simplify the
design process, we constructed a synthetic code generator and simulator with reorder-
ing capability to interleave the jammed FPU and load/store instructions to minimize
pipeline stalls. The synthetic code generator also gives us the flexibility to implement
many general stencil operators, including the 7-point and 27-point stencils using the
3-point stencil as a building block. This implies that when we improve performance
for the 3-point stencil we almost transparently improve performance for the entire
class of stencil operators.
Unroll-and-jam serves a second purpose for the 7-point and 27-point stencil op-
erators due to the overlapped data usage among adjacent stencils. When applied,
it eliminates the redundant loads of the common data elements among the jammed
stencils, reducing pressure on the memory subsystem by increasing the effective arith-
metic intensity of the kernel.
This can be quantified by comparing the number of input streams, which we refer
to as the “frame size,” with the number of output streams. For example a 2 × 2
jam for the 27-point stencil produces 2 · 2 = 4 output streams using a frame size of
4 · 4 = 16 while the same jam for the 3-point stencil has a frame size of just 2 · 2 = 4.
We use mutate-mutate and load-copy kernels to construct 3-, 7-, and 27-point sten-
cil kernels over several different unrolling configurations. Table IV.4 lists the register
allocation requirements for these configurations and provides per-cycle computational
requirements. In both the mutate-mutate and load-copy kernels the 27-point stencil
is theoretically FPU bound because of the high reuse of loaded input data elements
across streams.
The mutate-mutate kernel allows more unrolling for the 27-point stencil than the
load-copy kernel because it uses fewer registers per stencil. The number of allocated
39
registers for input data streams at the mutate-mutate kernel is equal to the number
of the input data streams, while the load-copy kernel requires twice the number of
registers for its input data streams.
Using unroll-and-jam enables the reuse of loaded input data across streams, in-
creasing the FPU to load/store instruction ratio. The mutate-mutate kernel outper-
forms the load-copy kernel when using unroll-and-jam, allowing complete utilization
of the FPU unit when the stencil operator has high reuse across the data streams.
IV.3.3 PowerPC 450 Simulator
The initial implementation of our simulator is created by Jed Brown. It optimizes our
several hundred synthesized assembly instructions for in-order execution. First, we
produce a list of non-redundant instructions using a Python generator. The simula-
tor reflects an understanding of the instruction set, including semantics such as read
and write dependencies, instruction latency, and which execution unit it occupies. It
functions as if it were a PowerPC 450 with an infinite-lookahead, greedy, out-of-order
execution unit. On each cycle, it attempts to start an instruction on both the load/-
store and floating point execution units while observing instruction dependencies. If
this is not possible, it provides diagnostics about the size of the stall and what de-
pendencies prevented another instruction from being scheduled. The simulator both
modifies internal registers that can be inspected for verification purposes and pro-
duces a log of the reordered instruction schedule. The log is then rendered as inline
assembly code which can be compiled using the XL or GNU C compilers.
Our code generation framework is shown in Figure IV.2. Python code is prepared
to produce the instruction sequence of our code block. First, registers are allocated
by assigning them variable names. Next, a list of instructions’ objects is created,
utilizing the variables to address the registers. Enabling the instruction scheduler
allows the simulator to execute the instructions out of order; otherwise they will be
40
Table IV.4: Computational requirements
C
on
fi
gu
ration
s
R
egisters
In
stru
ction
s
B
an
d
w
id
th
K
ern
el
F
ram
e
S
ten
cils/
In
p
u
t
R
esu
lt
W
eigh
t
C
ou
n
t
C
y
cles
U
tilization
%
B
y
tes/
Iteration
ld
-st
F
P
U
ld
-st
F
P
U
ld
-st
F
P
U
sten
cil
27-m
m
-1x
1
9
2
9
1
4
18-1
27
36-2
27
100
71.1
80
27-m
m
-1x
2
12
4
12
2
4
24-2
54
48-4
54
96.3
100
56
27-m
m
-1x
3
15
6
15
3
4
30-3
81
60-6
81
81.5
100
48
27-m
m
-2x
2
16
8
16
4
4
32-4
108
64-8
108
66.7
100
40
27-m
m
-2x
3
20
12
20
6
4
40-6
162
80-12
162
56.8
100
34.7
7-m
m
-2x
3
16
12
16
6
2
22-6
42
44-12
42
100
75
29.3
7-lc-2x
3
16
12
22
6
2
16-6
48
32-12
48
91.7
100
29.3
3-lc-1x
1
1
2
2
1
1
1-1
4
2-2
4
100
100
16
3-lc-2x
1
2
4
4
2
1
2-2
8
4-4
8
100
100
16
3-lc-2x
2
4
8
8
4
1
4-4
16
8-8
16
100
100
16
3-lc-2x
3
6
12
12
6
1
6-6
24
12-12
24
100
100
16
3-lc-2x
4
8
16
16
8
1
8-8
32
16-16
32
100
100
16
41
executed in order when we turn the scheduler off. The instruction simulator uses
virtual GPR, FPR, and memory to simulate the pipeline execution and to simulate
the expected results of the given instruction sequence. A log is produced by the
instruction simulator to provide the simulation details, showing the cycles at which
the instructions are scheduled and the encountered data and structural hazards, if
there are any. Also, the log can contain the contents of the GPR, the FPR, and the
memory, at any cycle, to debug the code for results correctness. The C code generator
takes the simulated instructions and appends their equivalent inline assembly code in
a provided C code template. To provide more clarity in the generated assembly code,
we associate each generated line with a comment showing the mapping between the
used registers’ numbers and their corresponding variables’ names in the Python code.
Figure IV.2: The components of the code generation framework in this work
42
CHAPTER V
Performance
Code synthesis allows us to easily generate and verify performance of 3-, 7-, and 27-
point stencil operators against our predictive models over a range of unrolling-and-
jamming and inner kernel options. We use individual MPI processes mapped to the
four PowerPC 450 cores to provide 4-way parallelization and ascertain performance
characteristics out to the shared L3 and DDR memory banks.
The generated assembly code, comprising the innermost loop, incurs a constant
computation overhead from the prologue, epilogue, and registers saving/restoring.
The relative significance of this overhead is reduced when larger computations are
performed at the innermost loop. This motivated us to decompose the problem’s
domain among the four cores along the outermost dimension for the 3- and 7-point
stencils, resulting in better performance. However, the 27-point stencil has another
important factor, dominating the innermost loop overhead cost. Its computations
inhibit relatively high input data points sharing among neighbor stencils. Splitting the
inner most dimension allows the shared input data points to be reused by consecutive
middle dimension iterations, where the processor will likely have them in the L1 cache,
resulting in faster computations. On the other hand, if the computation is performed
using a large innermost dimension, the shared input data points will no longer be in
43
the L1 cache at the beginning of the next iteration of the middle loop.
All three studies are conducted over a range of cubic problems from size 143 to
3623. The problem sizes are chosen such that all variations of the kernels can be
evaluated naively without extra code for cleanup.
Each sample is computed in a loop to reduce noise from startup costs and timer
resolution, and the samples’ average is used. To have improved measurements, we
repeat this experiment over another loop, selecting the fastest experiment.
All performance results are given per-core, though results were computed on an
entire node with a shared L3 cache. Full node performance can be reasonably obtained
by multiplying by 4.
During the course of our experiments on the stencils, we noticed performance
problems for many of the stencil variants when loading data from the L3 cache. The
large number of concurrent hardware streams in the unroll-and-jam approach over-
whelms the L2 streaming unit, degrading performance. This effect can be amplified in
the default optimistic prefetch mode for the L2, causing wasted memory traffic from
the L3. We make use of a boot option that disables optimistic prefetch from the L2
and compare against the default mode where applicable in our results. We distinguish
the two modes by using solid lines to indicate performance results obtained in the
default mode and dashed lines to indicate results where the optimistic prefetch in L2
has been disabled.
V.1 3-Point Stencil Computations
We begin our experiments with the 3-point stencil, our computational building block
for the other experiments in our work (Figure V.1). The 3-point stencil has the lowest
arithmetic intensity of the three stencils studied, and unlike its 7-point and 27-point
cousins, does not see an increase in effective arithmetic intensity when unroll-and-
44
jam is employed. It is clear from Section IV.3 that the load-copy kernel is more
efficient in bandwidth-bound situations, so we use it as the basis for our unroll-and-
jam experiments. We see the strongest performance in the three problems that fit
partially in the L1 cache (the peak of 224 Mstencil/s is observed at 263), with a drastic
drop off as the problem inputs transition to the L3. The most robust kernel is the 2×1
jam, which reads and writes to two streams simultaneously, and can therefore engage
the L2 prefetch unit most effectively. The larger unrolls (2×2, 2×3, and 2×4), enjoy
greater performance in and near the L1, but then suffer drastic performance penalties
as they exit the L1 and yet another performance dip near 2503. Disabling L2 prefetch
does not seem to have any large effect on the 2×1 kernel, though it unreliably helps
or hinders the other kernels.
0 50 100 150 200 250 300 350 400
Input length
60
80
100
120
140
160
180
200
220
240
M
st
en
ci
l/s
2x4
2x3
2x2
2x1
1x1
Figure V.1: 3-point stencil performance with load-copy kernel
45
The 3-point kernel seems to be an ideal target on the PowerPC 450 for standard
unrolling in the fastest moving dimension, k, a technique we did not attempt due to
its limited application to the larger problems we studied. Unroll-and-jam at sufficient
sizes to properly cover pipeline hazards overwhelms the L2 streaming unit due to the
large number of simultaneous streams to memory. Unrolling in k would cover these
pipeline hazards without increasing the number of streams.
V.2 7-Point Stencil Computations
Our next experiment focuses on performance of the 7-point stencil operator (Figure
V.2). We directly compare the mutate-mutate and load-copy kernels using the same
unroll configurations, although we note that the mutate-mutate kernel can support
a slightly larger unroll on this problem with a compressed usage of general purpose
registers that were only implemented for the 27-point stencil. Again, we notice strong
performance within the L1, then a dropoff as the loads start coming from the L3
instead of the L1. The performance drop near 2563 is caused when the 2×3 kernel’s
frame size of (2 + 2)(3 + 2)− 4 = 16 multiplied by the length of the domain exceeds
the size of L1. For smaller sizes, neighbors in the j direction can reside in L1 between
consecutive passes so that only part of the input frame needs to be supplied by streams
from memory. With up to 16 input streams and 2 · 3 = 6 output streams, there is no
hope of effectively using the L2 prefetch unit. We notice that performance improves
with the L2 optimistic prefetch disabled slightly within the L3, and drastically when
going to the DDR banks. The load-copy kernel shows better performance than the
mutate-mutate kernel, it is clear here that load/store cycles are more precious than
floating point cycles. It is likely that this kernel could attain better results with the
use of cache tiling strategies, though we note that without any attempts at cache tiling
this result is commensurate with previously reported results for the PowerPC 450 that
46
focused on cache tiling for performance tuning.
0 50 100 150 200 250 300 350 400
Input length
40
60
80
100
120
140
160
M
st
en
ci
l/s
mm_2x3 - L2 optimistic prefetch on
mm_2x3 - L2 optimistic prefetch disabled
lc_2x3 - L2 optimistic prefetch on
lc_2x3 - L2 optimistic prefetch disabled
Figure V.2: 7-point stencil performance comparison between mutate-mutate and load-
copy kernels
In Figure V.3 we compare our results with Datta’s findings in [14]. Both our results
and Datta’s results achieved perfect scaling from one to two and four cores. It is clear
that Datta’s performance is compute bound resulting in the same performance for
both small problems fitting in the L1 cache and large prblems streaming from the L3
cache. Our work improved the comutation performance, resulting in 2.2x speedup for
small problems fitting in the L1 cache. We got similar performance for large problems
because the performance becomes memory bound when the data is streamed from
the L3 cache.
47
(a) In-L1 performance (b) Streaming performance
Figure V.3: 7-point stencil performance comparison
V.3 27-Point Stencil Computations
The 27-point stencil should be amenable to using a large number of jammed unrolls
due to the high level of reuse between neighbor stencils. Indeed, we see nearly perfect
scaling in Figure V.4 as we increase the number of jams from 1 to 6 using the mutate-
mutate kernel. Although there is a gradual dropoff from the peak of 54 Mstencil/s
(85% of arithmetic peak) as the problem sizes increase to the point that there is little
reuse from the L1 cache, the kernel consistently sustains an average of 45 Mstencil/s
(70% of arithmetic peak), even when the problem sizes greatly exceed the L3 cache.
Despite the L2-overwhelming frame size ((2 + 2)(3 + 2) = 20 input streams and
2 · 3 = 6 output streams), the jammed stencil achieves good performance with no
blocking largely due to the high level of reuse of input data afforded by the unrolls
in i and j.
We compare with Datta’s results of the 27-point stencil in Figure V.5. Similar
to the 7-point stencil results, Datta’s performance is compute bound for both small
problems fitting in the L1 cache and large problems streaming from the L3 cache. This
48
0 50 100 150 200 250 300 350 400
Input length
0
10
20
30
40
50
60
70
M
st
en
ci
l/s
Comp. peak
BW peak
2x3
2x2
1x3
1x2
1x1
Figure V.4: 27-point stencil performance with mutate-mutate kernel
resulted in the same performance. By improving the computation performance, we
achieved improvements for both small problem size, achieving 2.2x speedup, and large
problem size, achieving 1.7x speedup. The performance of the large problem size is
improved because of the high arithmetic density in the 27-point stencil computations.
V.4 Model Validation
As we utilized a simulator which incorporates a model of the architecture’s perfor-
mance characteristics to produce our kernels, we sought to validate our performance
model by comparing the implicit predictions of our generative system to the empirical
results seen in Table V.1.
49
(a) In-L1 performance (b) Streaming performance
Figure V.5: 27-point stencil performance comparison
Since performance within a core is often considerably easier to predict than when
one must go beyond the core for memory access, we divide our comparisons into those
on-core (L1) and those that go off the core, to L3 or main memory (streaming).
Our modeling of the 27-point stencils can be seen to be highly accurate. Inside
the L1 cache the disparity between predicted and actual performance is consistently
less than 1.5%. Shifting our attention to the streaming predictions, our accuracy can
be seen to be considerably degraded. This is not surprising, given that our simulator
was largely targeted to model the L1 domain. However, the relative error is less than
15% in all cases; this appears to be sufficient for producing highly efficient code. This
shortcoming appears to stem directly from the level of detail with which we model
the shared L3 cache and main memory subsystem and we are working to correct this
in our simulator.
The match between predicted and witnessed performance for the 7-point stencil
shows the same pattern. When modeling performance inside the L1 our relative error
is less than 5%, but when extending our prediction to the components of the system
50
shared by all four cores, our error is as great as 17.5%. That our greatest error in this
instance is an under-prediction that is probably attributable to a fortuitous alignment
of the continuous vectors in the k-direction, as staggering these carefully often result
in bandwidth benefits on the order of 10-15%.
Finally, while our simulator proved useful in generating code with tolerable per-
formance characteristics for the 3-point stencil, it is apparent that our model does
not accurately reflect some of the characteristics of the hardware. From some fur-
ther experimentation, we are reasonably certain that the chief reason for our lack of
accuracy in predicting the performance both within the L1 cache and beyond stems
from the bandwidth that must be shared between the multiple write streams and our
failure to account for this in our model. As the L1 cache is write-through in Blue
Gene/P, this same characteristic likely accounts for both sets of results. It is most
apparent in the 3-point stencil predictions as the ratio of write streams to either read
streams or floating point operations is highest in this case.
51
Table V.1: Predictions vs. observations for in-L1 and streaming performance
K
ern
el
In
stru
ction
lim
its
B
an
d
w
id
th
lim
its
In
-L
1
S
tream
in
g
N
aive
S
im
u
lated
L
1
stream
in
g
P
red
icted
O
b
served
P
red
icted
O
b
served
27-m
m
-1x
1
44.74
12.49
80.88
19.38
12.49
12.41
12.49
12.38
27-m
m
-1x
2
62.96
24.61
113.19
28.58
24.61
24.8
24.61
23.39
27-m
m
-1x
3
62.96
36.37
130.58
33.95
36.37
36.76
33.95
29.83
27-m
m
-2x
2
62.96
47.8
154.28
41.8
47.8
47.38
41.8
39.49
27-m
m
-2x
3
62.96
59.19
175.52
49.43
59.19
58.59
49.43
45.06
7-m
m
-2x
3
182.14
150.29
203.54
60.46
150.29
149.69
60.46
55.32
7-lc-2x
3
212.5
187.04
203.54
60.46
187.04
178.22
60.46
71.04
3-lc-1x
1
425
109.07
338.72
281.56
109.07
95.22
109.07
67.5
3-lc-2x
1
425
198.68
338.72
281.56
198.68
171.01
198.68
120.03
3-lc-2x
2
425
300
338.72
136.76
300
212.28
136.76
113.12
3-lc-2x
3
425
349.08
338.72
136.76
338.72
241.35
136.76
110.02
3-lc-2x
4
425
355.34
338.72
136.76
338.72
244.88
136.76
100.81
52
CHAPTER VI
Concluding Remarks
VI.1 Conclusion
The main contribution of this thesis is the reuse of registers in constant coefficient
linear operators. The loads of the input vector elements and stores of the output
vector elements are minimized and the fraction of multiply-adds among all cycles
is maximized. This is achieved by using our novel streams computation algorithms
mutate-mutate and load-mutate. This was possible, without data layout reordering,
because of the SIMD-like instructions of the PowerPC 450 core.
The recommended research agenda for libraries in the exascale domain includes the
the fusion of library routine implementations as well as the creation of frameworks that
enable the optimal instantiation of a given routine when supplied with architectural
information [41]. We feel that the work presented here is a contribution to that end.
Further, the nature of our simulator allows us to optimize our code as measured
by other metrics such as bandwidth or energy consumption, given a simple model
of the cost. We also demonstrate performance comparable to advanced cache tiling
approaches for the 7-point stencil, despite the fact that we make no effort to optimize
for cache reuse.
53
VI.2 Future Work
While the three individual problems considered (3-point stencils in one dimension and
7-point and 27-point stencils in three dimensions) all with constant coefficients and
symmetry within each spatial dimension (but not across them) are important, there
are also two generalizations of importance: higher-order stencils and chained iterative
passes over the vectors. To set upper bounds for the generalization of the performance
enhancement techniques of this thesis to these other important kernels, we character-
ize their volumes of reads and writes of floating point and integer arguments relative
to the number of floating point multiply-adds performed.
Higher-order stencils expand the number of adjacent input vector elements that
enter into a single output vector element, in successive steps of semi-width one in
each of the spatial dimensions i, j, and k, with an additional weight coefficient cor-
responding to each additional increment of semi-width in each dimension. This is
a modest generalization. Higher-order discretization increases register pressure be-
cause of the larger number of inputs that combine in each output. Opportunities for
reuse of input elements expand with the stencil width up to the ability to keep them
resident. In a P -point regular stencil (regardless of number of spatial dimensions)
each input element is operated upon with a pre-stored weight P times: once in the
stencil centered upon it, and once in each neighboring stencil with which its own
stencil overlaps. Floating point arithmetic intensity increases in proportion to P .
That is, if there are N elements in the input or output array, there are PN floating
point multiply-adds per N floating reads and N floating writes. Explicit methods
for nonlinear systems, especially with high-order discretization techniques such as
Weighted Essentially Non-Oscillatory (WENO) or discontinuous Galerkin [42], have
similar properties, including a larger number of input streams, but with much higher
arithmetic intensity.
S-stage chaining (as in the simultaneous accumulation of Ax, A2x, A3x, . . .Asx)
54
allows the output vector to be fed back as input before being written. Per output
vector of N floating point writes, there are N/S reads and PN floating point multiply-
adds. Therefore, up to the ability to keep the additional operands cached, both higher-
order operators and chained operations improve the potential for the transformations
described here.
Two other generalizations tend to make the work of this thesis less relevant. Ir-
regular stencils require integer reads, in addition to floating point reads to determine
which elements of the input vector go with each row of the matrix. This further
dilutes advantages that lead to the great breakthroughs in stencils per second de-
scribed here. Stencil operations with constant coefficients and sparse matrix-vector
multiplies with general coefficients are similar when counting floating operations, but
very different when it comes to data volume.
Spatially varying coefficients require the loading of additional weights, each of
which is used only once, each of which is of the same floating precision of the input
and output vectors, P of them in the production of each output vector element, with
each input vector element being combined with P different weights. While each input
and output element can still be reused up to P times in the execution of one pass
through the overall vector-to-vector map, the dominant array in the workspace is
the coefficient matrix of weights so the benefits of reusing the vectors are minimal.
This situation is typical for nonlinear problems when using Newton-Krylov and linear
multigrid methods. However, when “free flops” are available, the weights can also
be recomputed on the fly as a nonlinear function of a given state and/or scalar
coefficients. In this case, the number of input streams is similar to the linear constant
coefficient case (perhaps larger by a factor of 2 or 3), but the number of floating point
results is several times higher and involves the problem-specific “physics.” Putting
the physics inside the kernel like this suggests that there will be an emphasis on the
ability to quickly develop high-performance kernels for new physics.
APPENDICES
56
APPENDIX A
Code generation with SimPPC
We describe the code generation process of our framework by a walkthrough example
to generate the innermost loop of a 27-point stencil kernel with a 1 × 2 unroll-and-
jam. We drive our experiments by a Python module presented in section A.1. This
module appends the generated code in a C code template described in section A.2.
The generated code is prepared by executing the Python functions shown in section
A.3. Finally, section A.4 contains the generated C code.
A.1 Driver
This code starts by parsing the arguments passed by the user and preparing a di-
rectory for the generated source code file (lines 3-10). Then, the generated code
is prepared by the get template dict function (line 11). Next, the code reads the
text of the C code template, by the get template function at line 13. Finally, the
generated code is appended in the template and the output C file is written by the
write template function in line 14.
1 #!/ usr / b in /env python3
2 def main ( ) :
3 args = pa r s e a r g s ( )
4 b u i l d d i r = ’ . / bu i ld ’
5 en su r e d i r ( b u i l d d i r )
6 gen params = { ’ i n t e r l e a v e ’ : a rgs . i n t e r l e av e , ’ verbose ’ : a rgs .
verbose }
7 opt params = [ ’ u n r o l l i ’ , ’ u n r o l l j ’ ]
8 for k in opt params :
9 i f g e t a t t r ( args , k ) i s not None :
10 gen params [ k ] = g e t a t t r ( args , k )
11 t emp la t e d i c t = ge t t emp l a t e d i c t ( args . kerne l , gen params )
12 i f not args . dry run :
13 ke rne l t emp la t e = get template ( args . k e rne l )
57
14 wr i t e t emp la t e ( args . kerne l , ke rne l t emplate , t empla te d i c t ,
b u i l d d i r )
15
16 def get t emplate ( ke rne l ) :
17 import os
18 with open ( os . path . j o i n ( ’ k e rn e l s ’ , k e rne l+’ . i c . t ’ ) ) as
t emp l a t e f i l e :
19 ke rne l t emp la t e = t emp l a t e f i l e . read ( )
20 return ke rne l t emp la t e
21
22 def wr i t e t emp la t e ( kerne l , ke rne l t emplate , t empla te d i c t , b u i l d d i r
) :
23 import os
24 with open ( os . path . j o i n ( bu i l d d i r , ’ g en ke rne l . i c ’ ) , ’w ’ ) as
k e r n e l f i l e :
25 k e r n e l f i l e . wr i t e ( ke rne l t emp la t e . format (** t emp la t e d i c t ) )
26
27 def g e t t emp l a t e d i c t ( kerne l , gen params ) :
28 import os , import l ib , ke rne l s , i n sp e c t
29 kerne l module = impor t l i b . import module ( ’ k e r n e l s . ’+kerne l , ’
k e r n e l s ’ )
30 gene ra to r s = [m for m in i n sp e c t . getmembers ( kerne l module .
Generator , i n sp e c t . i s f u n c t i o n )
31 i f m[ 0 ] [ 0 : 4 ] == ’ gen ’ ]
32 g = kerne l module . Generator (** gen params )
33 t emp la t e d i c t = d i c t ( )
34 for name , method in gene ra to r s :
35 in s t , c l o ck = method ( g )
36 print (name+’ : ’+s t r ( c l o ck ) )
37 t emp la t e d i c t [ name ] = i n s t
38 return t emp la t e d i c t
39
40 def pa r s e a r g s ( ) :
41 import argparse
42 import sys
43 from glob import glob
44 from os . path import basename
45 par s e r = argparse . ArgumentParser ( d e s c r i p t i o n=’ Build . py :
Generates k e rn e l s ’ )
46 par s e r . add argument ( ’−k ’ , ’−−ke rne l ’ , d e f au l t=’mock ’ ,
47 help=’ ke rne l to generate ’ )
48 # arguments f o r the k e rne l ’ s Python genera tor
49 par s e r . add argument ( ’− i ’ , ’−− i n t e r l e a v e ’ , d e f au l t=”1” ,
50 help=’ enable / d i s ab l e i n s t r u c t i o n s
i n t e r l e a v i n g ( d e f au l t 1 ” enabled ”) ’ )
51 par s e r . add argument ( ’−v ’ , ’−−verbose ’ , d e f au l t=”1” , he lp=’Show
the log o f the s imu la tor ( d e f au l t 1 ” enabled ”) ’ )
52 par s e r . add argument ( ’−n ’ , ’−−dry−run ’ , he lp=’Do not wr i t e the
generated ke rne l ’ , a c t i on=’ s t o r e t r u e ’ )
53 par s e r . add argument ( ’−−u n r o l l j ’ , type=int , he lp=’ s p e c i f y the
un r o l l i n g s i z e in J ’ )
54 par s e r . add argument ( ’−−u n r o l l i ’ , type=int , he lp=’ s p e c i f y the
un r o l l i n g s i z e in I ’ )
55 class Li s tAct ion ( argparse . Action ) :
58
56 def c a l l ( s e l f , parser , namespace , va lues , o p t i o n s t r i n g=
None ) :
57 print ( ’ [ Kerne ls ]\n ’ + ’ \n ’ . j o i n ( basename (k ) . r ep l a c e ( ’ . i c
. t ’ , ’ ’ ) for k in glob ( ’ k e rn e l s /* i c . t ’ ) ) )
58 sys . e x i t (0 )
59 par s e r . add argument ( ’− l ’ , ’−− l i s t ’ , a c t i on=ListAct ion , nargs=0,
60 help=’ l i s t a v a i l a b l e k e rn e l s ’ )
61 return par s e r . p a r s e a r g s ( )
62
63 def en su r e d i r (d) :
64 import os , er rno
65 try :
66 os . makedirs (d)
67 except OSError as exc :
68 i f exc . e r rno == errno .EEXIST :
69 pass
70 else : raise
71
72 i f name == ” main ” :
73 main ( )
Listing A.1: The main function in the framework
A.2 Code template
Our C code template example contains the static part of the code and four replace-
ment fields ( {gen header defines!s}, {gen initialize registers!s}, {gen prologue!s},
and {gen inner iter!s}). These fields are replaced by their corresponding generated
code. The general format of the replacement field is {string!s}. We use the terminol-
ogy gen [name] for the string. Each gen [name] has a corresponding code generation
function in Python, which has the same name. Our framework executes these python
functions, starting with gen , and replaces the replacement fields by the returned text
in the C template.
1 /*
2 FP Reg i s t e r s a l l o c a t i o n :
3 0−15: Input data frame ( j i s f a s t moving dimension )
4 16−19: F i r s t s e t o f jam r e s u l t s
5 20−23: Second s e t o f jam r e s u l t s
6 24−31: we i gh t s c o e f f i c i e n t s
7
8 Weights mapping in the s t e n c i l opera tor
9 frame k=0 k=1 k=2
10 j j j
11 7 4 7 6 2 6 7 4 7
12 i 5 3 5 1 0 1 5 3 5
13 7 4 7 6 2 6 7 4 7
14
15 Weight index : 0 1 2 3 4 5 6 7
16 Weight r e g i s t e r : 24 25 26 27 28 29 30 31
17 */
59
18 { g en heade r d e f i n e s ! s }
19
20 #define k e r n e l r e f e r e n c e k e r n e l r e f e r e n c e 2 7p t
21 #define f o r t r a n k e r n e l f o r t r a n k e r n e l 2 7 p t
22
23 i n l i n e void gen ke rne l ( const double * r e s t r i c t a , double * r e s t r i c t r ,
const double * r e s t r i c t w, int i s t r i d e , int j s t r i d e ) {{
24 register const double * r e s t r i c t weights asm ( ”26” ) ;
25 register int k asm ( ”31” ) ;
26 register int s i x t e en asm ( ”27” ) ;
27 register double * a pt r asm ( ”3” ) ;
28 register int next frame asm ( ”4” ) ;
29 register int next j j am asm ( ”5” ) ;
30 register int next i j am asm ( ”6” ) ;
31
32 next frame = 1 − ( i s t r i d e ) * (GEN KERNEL NI JAMS+2−1) − j s t r i d e
* (GEN KERNEL NJ JAMS+2−1) ;
33 next j j am = 8* j s t r i d e ;
34 next i j am = 8*( i s t r i d e − j s t r i d e * (GEN KERNEL NJ JAMS+2−1)) ;
35
36 a pt r = (double *) a − i s t r i d e − 1* j s t r i d e − next frame ;
37 next frame=8*( next frame ) ; // f i r s t update in the pro logue uses
next f rame f o r index update
38
39 s i x t e en = 16 ;
40
41 // I n i t i a l i z e r e g i s t e r s
42 { g e n i n i t i a l i z e r e g i s t e r s ! s }
43
44 // Prologue
45 { gen pro logue ! s }
46
47 a pt r += 1 ;
48
49 // Inner i t e r a t i o n
50 for ( k=1; k < j s t r i d e −2; k+=2) {{
51 { g e n i n n e r i t e r ! s }
52 }}
53 }}
Listing A.2: 27-point stencil kernel’s template
A.3 Code generation
The first step in the code generation is to give variable names to the registers (allocate
registers) as shown in Listing A.3. We allocate registers for both GPR and FPR.
Several variables address the same register in the case of the registers containing
the weight coefficients. This is done to have direct mapping between each weight
coefficient’s value and its stencil point instead of doing convoluted mapping between
the location and the value of each weight coefficient in the stencil operator.
60
The gen header defines function, shown in Listing A.4 defines important values
that are specific to the generated stencil code to have correct execution and perfor-
mance measurements.
We define and initialize our registers at gen initialize registers function, shown
in Listing A.5. It generates the variables definitions and initializations that are re-
quired by the generated code. In this function, the number of allocated variables
changes according to the required unroll-and-jam size, where larger jams requires
more registers.
The loop prologue is generated by the gen prologue function, as shown in Listing
A.6. The generated inline assembly code is responsible for loading the weight coef-
ficients to the registers and perform the prologue’s computations of the loop. The
Python’s code appends the instructions’ objects to the array istream. By the end
of the function’s execution, the list of instructions are passed to the PowerPC 450
simulator (at line 23) to generate the inline assembly of these instructions and to com-
pute the number of simulated cycles to execute the code. By default, the simulator
performs out-of-order scheduling. The user has the option to disable the out-of-order
scheduling using runtime parameters.
Finally, the main loop body is generated by the gen inner iter function, as shown
in Listing A.7 starting from line number 19. For each point in the stencil’s operator,
we perform the same FMA operation for all jams. Hence, we aggregate these common
operations at the function fma block starting from line number 1.
1 class Common:
2 def i n i t ( com se l f , s e l f ) :
3 # crea t e func t i on to dea l wi th b l o c k s o f r e g i s t e r s
4 com se l f . c = ge t c o r e ( u s e t r a c e=s e l f . u s e t r a c e )
5 com se l f . cv = CViewer ( )
6
7 # Al l o ca t e fp r e g i s t e r s
8 FPR pool = com se l f . c . a c q u i r e f p r e g i s t e r s ( range (27 ,−1 ,−1) )
9 # Al l o ca t e input data fp r e g i s t e r s f o r the input data frame
10 com se l f . streams = [ 0 ] * s e l f .FRAME SIZE
11 for i in range ( s e l f .FRAME SIZE) : c om se l f . streams [ i ] = FPR pool .
pop ( )
12 # Al l o ca t e r e s u l t s fp r e g i s t e r s
13 com se l f . r e s u l t s = [ 0 ] * s e l f . JAM SIZE
14 for i in range ( s e l f . JAM SIZE) : c om se l f . r e s u l t s [ i ] = FPR pool .
pop ( )
15 # Al l o ca t e 4 fp r e g i s t e r s f o r the we i gh t s
16 com se l f . w compact = com se l f . c . a c q u i r e f p r e g i s t e r s ( range (28 ,32)
)
17 com se l f . w unique = [ 0 ] * 8
18 com se l f . w unique [ 0 ] = com se l f . w compact [ 0 ]
19 com se l f . w unique [ 1 ] = com se l f . w compact [ 1 ]
20 com se l f . w unique [ 2 ] = com se l f . w compact [ 2 ]
21 com se l f . w unique [ 3 ] = com se l f . w compact [ 0 ]
22 com se l f . w unique [ 4 ] = com se l f . w compact [ 2 ]
23 com se l f . w unique [ 5 ] = com se l f . w compact [ 1 ]
24 com se l f . w unique [ 6 ] = com se l f . w compact [ 3 ]
25 com se l f . w unique [ 7 ] = com se l f . w compact [ 3 ]
61
26
27 com se l f .w = [ 0 ]*27
28 com se l f .w[4+9*1] = com se l f . w unique [ 0 ]
29 com se l f .w[3+9*1] = com se l f .w[5+9*1] = com se l f . w unique [ 1 ]
30 com se l f .w[1+9*1] = com se l f .w[7+9*1] = com se l f . w unique [ 2 ]
31 com se l f .w[4+9*0] = com se l f .w[4+9*2] = com se l f . w unique [ 3 ]
32 com se l f .w[1+9*0] = com se l f .w[7+9*0] = com se l f .w[1+9*2] =
com se l f .w[7+9*2] = com se l f . w unique [ 4 ]
33 com se l f .w[3+9*0] = com se l f .w[5+9*0] = com se l f .w[3+9*2] =
com se l f .w[5+9*2] = com se l f . w unique [ 5 ]
34 com se l f .w[0+9*1] = com se l f .w[2+9*1] = com se l f .w[6+9*1] =
com se l f .w[8+9*1] = com se l f . w unique [ 6 ]
35 com se l f .w[0+9*0] = com se l f .w[2+9*0] = com se l f .w[6+9*0] =
com se l f .w[8+9*0] = \
36 com se l f .w[0+9*2] = com se l f .w[2+9*2] = com se l f .w[6+9*2] =
com se l f .w[8+9*2] = com se l f . w unique [ 7 ]
37
38 # Al l o ca t e i n t r e g i s t e r s
39 # Reserve i n t e g e r r e g i s t e r s f o r streams po in t e r and index ing
40 com se l f . a p t r = In tReg i s t e r (3 , ’ a p t r ’ )
41 com se l f . next frame = In tReg i s t e r (4 , ’ next frame ’ )
42 com se l f . next j j am = In tReg i s t e r (5 , ’ next j j am ’ )
43 com se l f . nex t i j am = In tReg i s t e r (6 , ’ next i j am ’ )
44 com se l f . a index ing = [ com se l f . next j j am ]* s e l f .FRAME SIZE
45 for i in range ( s e l f .FRAME SIZE) :
46 i f i%s e l f .FRAME J == 0 : com se l f . a index ing [ i ] = com se l f .
n ex t i j am
47 com se l f . a index ing [ 0 ] = com se l f . next frame
48 # Resu l t s r e g i s t e r s po in t e r s
49 r e s u l t s s t r = [ ’ r%d%d ’ % ( i // s e l f . JAM J , i%s e l f . JAM J) for i in
range ( s e l f . JAM SIZE) ]
50 com se l f . r e s u l t s p = [ In tReg i s t e r ( s e l f . g p r po i n t e r s [ i ] ,
r e s u l t s s t r [ i ] ) for i in range ( s e l f . JAM SIZE) ]
51 # cons tan t s
52 com se l f . s i x t e en = In tReg i s t e r (27 , ’ s i x t e en ’ )
53 # Weight po in t e r
54 com se l f . w p = In tReg i s t e r (26 , ’ we ights ’ )
Listing A.3: Python’s register allocation class
1 def g en heade r d e f i n e s ( s e l f ) :
2 s t e n c i l s i z e = 27
3 s t r = ’ ’
4 s t r+=’#de f i n e GEN KERNEL A OFFSET (0) \n ’
5 s t r+=’#de f i n e GEN KERNEL R OFFSET (1) \n ’
6 s t r+=’#de f i n e GENKERNELNAME (”%s ”) \n ’ % s e l f . kernel name
7 s t r+=’#de f i n e GEN KERNEL OPERATOR SIZE (%.1 f ) \n ’ % s t e n c i l s i z e
8 s t r+=’#de f i n e GEN KERNEL LS CYCLES (%.1 f ) \n ’ % ( ( s e l f .FRAME SIZE*2
+ s e l f . JAM SIZE) *2) # *2 c y c l e s
9 s t r+=’#de f i n e GEN KERNEL FPU CYCLES (%.1 f ) \n ’ % ( s t e n c i l s i z e * s e l f
. JAM SIZE)
10 s t r+=’#de f i n e GEN KERNEL NI JAMS (%d) \n ’ % s e l f . JAM I
11 s t r+=’#de f i n e GEN KERNEL NJ JAMS (%d) \n ’ % s e l f . JAM J
12 s t r+=’#de f i n e N FLOPS (N STENCILS*%.1 f ) \n ’ % ( s t e n c i l s i z e *2−1)
62
13 return s t r , 0
Listing A.4: Python’s definitions code generation
1 def g e n i n i t i a l i z e r e g i s t e r s ( s e l f ) :
2 i n i t r e g = ’ ’
3 for i in range ( s e l f . JAM SIZE) : # de f i n e the po in t e r s o f the
r e s u l t s s t r i d e s
4 i n i t r e g += ’ r e g i s t e r double * r%d%d asm (”%d”) ;\n ’ % \
5 ( i // s e l f . JAM J , i%s e l f . JAM J , s e l f . g p r po i n t e r s [ i ] )
6 i n i t r e g += ’ \n ’
7 # Assign the r e s u l t s s t r i d e s po in t e r s to the i n t e g e r r e g i s t e r s
8 s h i f t = l i s t ( range ( s e l f . JAM J) )
9 for i in range ( s e l f . JAM SIZE) :
10 i n i t r e g += \
11 ’ r%d%d = ( double *) ( r + %d* i s t r i d e + %d* j s t r i d e − 1) ;\n ’ %
\
12 ( i // s e l f . JAM J , i%s e l f . JAM J , s h i f t [ i // s e l f . JAM J ] , s h i f t [ i%s e l f
. JAM J ] )
13 # Assign the we i gh t s po in t e r to i t s i n t e g e r r e g i s t e r
14 i n i t r e g += ’ \n weights = w − 2 ;\n ’
15 return i n i t r e g , 0
Listing A.5: Python’s registers initialization code generation
1 def gen pro logue ( s e l f ) :
2 com = s e l f .Common( s e l f )
3 i n s t s t r = [ ’ ’ ]
4 i s t ream = [ ]
5 c l o ck count = [ 0 ]
6 ## PROLOGUE
7 # load we i gh t s r e g i s t e r s
8 # Weight index : 3 0 5 1 4 2 7 6
9 # Weight r e g i s t e r : 28 29 30 31
10 # Python Index : 0 1 2 3
11 # Compact index ing :
12 # 7 6 4 2
13 # 5 1 3 0
14 #
15 w map = [ 0 , 1 , 2 , 0 , 2 , 1 , 3 , 3 ]
16 for i in range (8 ) :
17 i f i in [ 0 , 1 , 2 , 6 ] :
18 i s t ream += [ i s a . l f s dux (com . w compact [ w map [ i ] ] , com . w p , com .
s i x t e en ) ]
19 e l i f i in [ 3 , 5 , 4 , 7 ] :
20 i s t ream += [ i s a . l fdux (com . w compact [ w map [ i ] ] , com . w p , com .
s i x t e en ) ]
21 # load frame 1/3
22 i s t ream += [ i s a . l fxdux (com . streams [ i ] , com . a ptr , com . a index ing [ i ] )
for i in range ( s e l f .FRAME SIZE) ]
23 s e l f . generate code append ( istream , com , i n s t s t r , c l o ck count )
24 i s t ream = [ ]
25 return i n s t s t r [ 0 ] , c l o ck count [ 0 ]
Listing A.6: Python’s prologue code generation
63
1 def fma block ( s e l f ,w , streams , r e su l t , stream index , k index ) :
2 i s t ream = [ ]
3 # Determine the curren t we igh t c o e f f i c i e n t
4 weight = w[ stream index%s e l f .FRAME J +3*( s tream index // s e l f .
FRAME J) + 9* k index ]
5 # genera te the i n d i c e s o f the a c t i v e par t o f the frame
6 act ive window = [ ( stream index+ i%s e l f . JAM J + s e l f .FRAME J*( i //
s e l f . JAM J) ) for i in range ( s e l f . JAM SIZE) ]
7 # Do the FMA’ s
8 for i in range ( s e l f . JAM SIZE) :
9 ind = act ive window [ i ]
10 i f k index == 0 :
11 i f st ream index == 0 :
12 i s t ream += [ i s a . fxpmul ( r e s u l t [ i ] , weight , streams [ ind ] ) ] #
over r i d e the r e g i s t e r at the f i r s t wr i t e
13 else :
14 i s t ream += [ i s a . fxcpmadd ( r e s u l t [ i ] , weight , streams [ ind ] , r e s u l t [ i
] ) ]
15 e l i f k index == 1 : i s t ream += [ i s a . fxcxma ( r e s u l t [ i ] , weight ,
streams [ ind ] , r e s u l t [ i ] ) ]
16 e l i f k index == 2 : i s t ream += [ i s a . fxcpmadd ( r e s u l t [ i ] , weight ,
streams [ ind ] , r e s u l t [ i ] ) ]
17 return i s t ream
18
19 def g e n i n n e r i t e r ( s e l f ) :
20 com = s e l f .Common( s e l f )
21 i s t ream = [ ]
22 # do the FMA’ s f o r frame 1/3
23 for i in s e l f . b l o ck ind : i s t ream += s e l f . fma block (com .w, com .
streams , com . r e s u l t s , i , s e l f .K0)
24 # mute f o r frame 2/3
25 i s t ream += [ i s a . l f s dux (com . streams [ i ] , com . a ptr , com . a index ing [ i ] )
for i in range ( s e l f .FRAME SIZE) ]
26 # do the FMA’ s f o r frame 2/3
27 for i in s e l f . b l o ck ind : i s t ream += s e l f . fma block (com .w, com .
streams , com . r e s u l t s , i , s e l f .K1)
28 # mute f o r frame 3/3
29 i s t ream += [ i s a . l fdux (com . streams [ i ] , com . a ptr , com . a index ing [ i ] )
for i in range ( s e l f .FRAME SIZE) ]
30 # do the FMA’ s f o r frame 3/3
31 for i in s e l f . b l o ck ind : i s t ream += s e l f . fma block (com .w, com .
streams , com . r e s u l t s , i , s e l f .K2)
32 # wr i t e back r e s u l t s e t 1/2
33 i s t ream += [ i s a . s t fxdux (com . r e s u l t s [ i ] , com . r e s u l t s p [ i ] , com .
s i x t e en ) for i in range ( s e l f . JAM SIZE) ]
34 return s e l f . g ene ra te code ( istream , com)
Listing A.7: Python’s loop body code generation functions
64
A.4 Generated code
This section presents the generated code of our 27-point stencil example. The out-
of-order scheduling is disabled in Listing A.8 to provide better clarity of the different
stages of the code. Then, we show how the inner loop body appears when our out-of-
order scheduling is used in Listing A.9. This code does not show the mapping between
the registers’ numbers and their corresponding variables’ names in the Python code
to have a more copact view of the code.
1 /*
2 FP Reg i s t e r s a l l o c a t i o n :
3 0−15: Input data frame ( j i s f a s t moving dimension )
4 16−19: F i r s t s e t o f jam r e s u l t s
5 20−23: Second s e t o f jam r e s u l t s
6 24−31: we i gh t s c o e f f i c i e n t s
7
8 Weights mapping in the s t e n c i l opera tor
9 frame k=0 k=1 k=2
10 j j j
11 7 4 7 6 2 6 7 4 7
12 i 5 3 5 1 0 1 5 3 5
13 7 4 7 6 2 6 7 4 7
14
15 Weight index : 0 1 2 3 4 5 6 7
16 Weight r e g i s t e r : 24 25 26 27 28 29 30 31
17 */
18 #define GEN KERNEL A OFFSET (0)
19 #define GEN KERNEL R OFFSET (1)
20 #define GENKERNELNAME (”bagm−mute−27−po int 1x2 jam Kernel − no
i n t e r l e a v i n g ” )
21 #define GEN KERNEL OPERATOR SIZE (27 . 0 )
22 #define GEN KERNEL LS CYCLES (52 . 0 )
23 #define GEN KERNEL FPU CYCLES (54 . 0 )
24 #define GEN KERNEL NI JAMS (1)
25 #define GEN KERNEL NJ JAMS (2)
26 #define N FLOPS (N STENCILS*53 .0 )
27
28
29 #define k e r n e l r e f e r e n c e k e r n e l r e f e r e n c e 2 7p t
30 #define f o r t r a n k e r n e l f o r t r a n k e r n e l 2 7 p t
31
32 i n l i n e void gen ke rne l ( const double * r e s t r i c t a , double * r e s t r i c t r ,
const double * r e s t r i c t w, int i s t r i d e , int j s t r i d e ) {
33 register const double * r e s t r i c t weights asm ( ”26” ) ;
34 register int k asm ( ”31” ) ;
35 register int s i x t e en asm ( ”27” ) ;
36 register double * a pt r asm ( ”3” ) ;
37 register int next frame asm ( ”4” ) ;
38 register int next j j am asm ( ”5” ) ;
39 register int next i j am asm ( ”6” ) ;
40
41 next frame = 1 − ( i s t r i d e ) * (GEN KERNEL NI JAMS+2−1) − j s t r i d e *
(GEN KERNEL NJ JAMS+2−1) ;
65
42 next j j am = 8* j s t r i d e ;
43 next i j am = 8*( i s t r i d e − j s t r i d e * (GEN KERNEL NJ JAMS+2−1)) ;
44
45 a pt r = (double *) a − i s t r i d e − 1* j s t r i d e − next frame ;
46 next frame=8*( next frame ) ; // f i r s t update in the pro logue uses
next f rame fo r index update
47
48 s i x t e en = 16 ;
49
50 // I n i t i a l i z e r e g i s t e r s
51 register double * r00 asm ( ”7” ) ;
52 register double * r01 asm ( ”8” ) ;
53
54 r00 = (double *) ( r + 0* i s t r i d e + 0* j s t r i d e − 1) ;
55 r01 = (double *) ( r + 0* i s t r i d e + 1* j s t r i d e − 1) ;
56
57 weights = w − 2 ;
58
59 // Prologue
60 asm volat i le ( ” l f s dux 28 , %0, %1” : ”+b” ( weights ) : ”b” ( s i x t e en ) ) ;
61 asm volat i le ( ” l f s dux 29 , %0, %1” : ”+b” ( weights ) : ”b” ( s i x t e en ) ) ;
62 asm volat i le ( ” l f s dux 30 , %0, %1” : ”+b” ( weights ) : ”b” ( s i x t e en ) ) ;
63 asm volat i le ( ” l fdux 28 , %0, %1” : ”+b” ( weights ) : ”b” ( s i x t e en ) ) ;
64 asm volat i le ( ” l fdux 30 , %0, %1” : ”+b” ( weights ) : ”b” ( s i x t e en ) ) ;
65 asm volat i le ( ” l fdux 29 , %0, %1” : ”+b” ( weights ) : ”b” ( s i x t e en ) ) ;
66 asm volat i le ( ” l f s dux 31 , %0, %1” : ”+b” ( weights ) : ”b” ( s i x t e en ) ) ;
67 asm volat i le ( ” l fdux 31 , %0, %1” : ”+b” ( weights ) : ”b” ( s i x t e en ) ) ;
68 asm volat i le ( ” l fxdux 0 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next frame ) ) ;
69 asm volat i le ( ” l fxdux 1 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
70 asm volat i le ( ” l fxdux 2 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
71 asm volat i le ( ” l fxdux 3 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
72 asm volat i le ( ” l fxdux 4 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next i j am ) ) ;
73 asm volat i le ( ” l fxdux 5 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
74 asm volat i le ( ” l fxdux 6 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
75 asm volat i le ( ” l fxdux 7 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
76 asm volat i le ( ” l fxdux 8 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next i j am ) ) ;
77 asm volat i le ( ” l fxdux 9 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
78 asm volat i le ( ” l fxdux 10 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
79 asm volat i le ( ” l fxdux 11 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
80
81
82 a pt r += 1 ;
83
84 // Inner i t e r a t i o n
85 for ( k=1; k < j s t r i d e −2; k+=2) {
86 asm volat i le ( ” fxpmul 12 , 31 , 0” ) ;
87 asm volat i le ( ” fxpmul 13 , 31 , 1” ) ;
88 asm volat i le ( ”fxcpmadd 12 , 30 , 1 , 12” ) ;
89 asm volat i le ( ”fxcpmadd 13 , 30 , 2 , 13” ) ;
90 asm volat i le ( ”fxcpmadd 12 , 31 , 2 , 12” ) ;
91 asm volat i le ( ”fxcpmadd 13 , 31 , 3 , 13” ) ;
92 asm volat i le ( ”fxcpmadd 12 , 29 , 4 , 12” ) ;
93 asm volat i le ( ”fxcpmadd 13 , 29 , 5 , 13” ) ;
94 asm volat i le ( ”fxcpmadd 12 , 28 , 5 , 12” ) ;
66
95 asm volat i le ( ”fxcpmadd 13 , 28 , 6 , 13” ) ;
96 asm volat i le ( ”fxcpmadd 12 , 29 , 6 , 12” ) ;
97 asm volat i le ( ”fxcpmadd 13 , 29 , 7 , 13” ) ;
98 asm volat i le ( ”fxcpmadd 12 , 31 , 8 , 12” ) ;
99 asm volat i le ( ”fxcpmadd 13 , 31 , 9 , 13” ) ;
100 asm volat i le ( ”fxcpmadd 12 , 30 , 9 , 12” ) ;
101 asm volat i le ( ”fxcpmadd 13 , 30 , 10 , 13” ) ;
102 asm volat i le ( ”fxcpmadd 12 , 31 , 10 , 12” ) ;
103 asm volat i le ( ”fxcpmadd 13 , 31 , 11 , 13” ) ;
104 asm volat i le ( ” l f s dux 0 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next frame ) ) ;
105 asm volat i le ( ” l f s dux 1 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
106 asm volat i le ( ” l f s dux 2 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
107 asm volat i le ( ” l f s dux 3 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
108 asm volat i le ( ” l f s dux 4 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next i j am ) ) ;
109 asm volat i le ( ” l f s dux 5 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
110 asm volat i le ( ” l f s dux 6 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
111 asm volat i le ( ” l f s dux 7 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
112 asm volat i le ( ” l f s dux 8 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next i j am ) ) ;
113 asm volat i le ( ” l f s dux 9 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
114 asm volat i le ( ” l f s dux 10 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
115 asm volat i le ( ” l f s dux 11 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
116 asm volat i le ( ” fxcxma 12 , 31 , 0 , 12” ) ;
117 asm volat i le ( ” fxcxma 13 , 31 , 1 , 13” ) ;
118 asm volat i le ( ” fxcxma 12 , 30 , 1 , 12” ) ;
119 asm volat i le ( ” fxcxma 13 , 30 , 2 , 13” ) ;
120 asm volat i le ( ” fxcxma 12 , 31 , 2 , 12” ) ;
121 asm volat i le ( ” fxcxma 13 , 31 , 3 , 13” ) ;
122 asm volat i le ( ” fxcxma 12 , 29 , 4 , 12” ) ;
123 asm volat i le ( ” fxcxma 13 , 29 , 5 , 13” ) ;
124 asm volat i le ( ” fxcxma 12 , 28 , 5 , 12” ) ;
125 asm volat i le ( ” fxcxma 13 , 28 , 6 , 13” ) ;
126 asm volat i le ( ” fxcxma 12 , 29 , 6 , 12” ) ;
127 asm volat i le ( ” fxcxma 13 , 29 , 7 , 13” ) ;
128 asm volat i le ( ” fxcxma 12 , 31 , 8 , 12” ) ;
129 asm volat i le ( ” fxcxma 13 , 31 , 9 , 13” ) ;
130 asm volat i le ( ” fxcxma 12 , 30 , 9 , 12” ) ;
131 asm volat i le ( ” fxcxma 13 , 30 , 10 , 13” ) ;
132 asm volat i le ( ” fxcxma 12 , 31 , 10 , 12” ) ;
133 asm volat i le ( ” fxcxma 13 , 31 , 11 , 13” ) ;
134 asm volat i le ( ” l fdux 0 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next frame ) ) ;
135 asm volat i le ( ” l fdux 1 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
136 asm volat i le ( ” l fdux 2 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
137 asm volat i le ( ” l fdux 3 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
138 asm volat i le ( ” l fdux 4 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next i j am ) ) ;
139 asm volat i le ( ” l fdux 5 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
140 asm volat i le ( ” l fdux 6 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
141 asm volat i le ( ” l fdux 7 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
142 asm volat i le ( ” l fdux 8 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next i j am ) ) ;
143 asm volat i le ( ” l fdux 9 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
144 asm volat i le ( ” l fdux 10 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
145 asm volat i le ( ” l fdux 11 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
146 asm volat i le ( ”fxcpmadd 12 , 31 , 0 , 12” ) ;
147 asm volat i le ( ”fxcpmadd 13 , 31 , 1 , 13” ) ;
148 asm volat i le ( ”fxcpmadd 12 , 30 , 1 , 12” ) ;
67
149 asm volat i le ( ”fxcpmadd 13 , 30 , 2 , 13” ) ;
150 asm volat i le ( ”fxcpmadd 12 , 31 , 2 , 12” ) ;
151 asm volat i le ( ”fxcpmadd 13 , 31 , 3 , 13” ) ;
152 asm volat i le ( ”fxcpmadd 12 , 29 , 4 , 12” ) ;
153 asm volat i le ( ”fxcpmadd 13 , 29 , 5 , 13” ) ;
154 asm volat i le ( ”fxcpmadd 12 , 28 , 5 , 12” ) ;
155 asm volat i le ( ”fxcpmadd 13 , 28 , 6 , 13” ) ;
156 asm volat i le ( ”fxcpmadd 12 , 29 , 6 , 12” ) ;
157 asm volat i le ( ”fxcpmadd 13 , 29 , 7 , 13” ) ;
158 asm volat i le ( ”fxcpmadd 12 , 31 , 8 , 12” ) ;
159 asm volat i le ( ”fxcpmadd 13 , 31 , 9 , 13” ) ;
160 asm volat i le ( ”fxcpmadd 12 , 30 , 9 , 12” ) ;
161 asm volat i le ( ”fxcpmadd 13 , 30 , 10 , 13” ) ;
162 asm volat i le ( ”fxcpmadd 12 , 31 , 10 , 12” ) ;
163 asm volat i le ( ”fxcpmadd 13 , 31 , 11 , 13” ) ;
164 asm volat i le ( ” st fxdux 12 , %0, %1” : ”+b” ( r00 ) : ”b” ( s i x t e en ) ) ;
165 asm volat i le ( ” st fxdux 13 , %0, %1” : ”+b” ( r01 ) : ”b” ( s i x t e en ) ) ;
166 }
167 }
Listing A.8: Generated 27-point stencil for a 1x2 jam kernel
1 asm volat i le ( ” fxpmul 12 , 31 , 0” ) ;
2 asm volat i le ( ” l f s dux 0 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next frame ) ) ;
3 asm volat i le ( ” fxpmul 13 , 31 , 1” ) ;
4 asm volat i le ( ”fxcpmadd 12 , 30 , 1 , 12” ) ;
5 asm volat i le ( ” l f s dux 1 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
6 asm volat i le ( ”fxcpmadd 13 , 30 , 2 , 13” ) ;
7 asm volat i le ( ”fxcpmadd 12 , 31 , 2 , 12” ) ;
8 asm volat i le ( ” l f s dux 2 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
9 asm volat i le ( ”fxcpmadd 13 , 31 , 3 , 13” ) ;
10 asm volat i le ( ” l f s dux 3 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
11 asm volat i le ( ”fxcpmadd 12 , 29 , 4 , 12” ) ;
12 asm volat i le ( ” l f s dux 4 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next i j am ) ) ;
13 asm volat i le ( ”fxcpmadd 13 , 29 , 5 , 13” ) ;
14 asm volat i le ( ”fxcpmadd 12 , 28 , 5 , 12” ) ;
15 asm volat i le ( ” l f s dux 5 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
16 asm volat i le ( ”fxcpmadd 13 , 28 , 6 , 13” ) ;
17 asm volat i le ( ”fxcpmadd 12 , 29 , 6 , 12” ) ;
18 asm volat i le ( ” l f s dux 6 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
19 asm volat i le ( ”fxcpmadd 13 , 29 , 7 , 13” ) ;
20 asm volat i le ( ” l f s dux 7 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
21 asm volat i le ( ”fxcpmadd 12 , 31 , 8 , 12” ) ;
22 asm volat i le ( ” l f s dux 8 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next i j am ) ) ;
23 asm volat i le ( ”fxcpmadd 13 , 31 , 9 , 13” ) ;
24 asm volat i le ( ”fxcpmadd 12 , 30 , 9 , 12” ) ;
25 asm volat i le ( ” l f s dux 9 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
26 asm volat i le ( ”fxcpmadd 13 , 30 , 10 , 13” ) ;
27 asm volat i le ( ”fxcpmadd 12 , 31 , 10 , 12” ) ;
28 asm volat i le ( ” l f s dux 10 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
29 asm volat i le ( ”fxcpmadd 13 , 31 , 11 , 13” ) ;
30 asm volat i le ( ” l f s dux 11 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
31 asm volat i le ( ” fxcxma 12 , 31 , 0 , 12” ) ;
32 asm volat i le ( ” l fdux 0 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next frame ) ) ;
68
33 asm volat i le ( ” fxcxma 13 , 31 , 1 , 13” ) ;
34 asm volat i le ( ” fxcxma 12 , 30 , 1 , 12” ) ;
35 asm volat i le ( ” l fdux 1 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
36 asm volat i le ( ” fxcxma 13 , 30 , 2 , 13” ) ;
37 asm volat i le ( ” fxcxma 12 , 31 , 2 , 12” ) ;
38 asm volat i le ( ” l fdux 2 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
39 asm volat i le ( ” fxcxma 13 , 31 , 3 , 13” ) ;
40 asm volat i le ( ” l fdux 3 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
41 asm volat i le ( ” fxcxma 12 , 29 , 4 , 12” ) ;
42 asm volat i le ( ” l fdux 4 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next i j am ) ) ;
43 asm volat i le ( ” fxcxma 13 , 29 , 5 , 13” ) ;
44 asm volat i le ( ” fxcxma 12 , 28 , 5 , 12” ) ;
45 asm volat i le ( ” l fdux 5 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
46 asm volat i le ( ” fxcxma 13 , 28 , 6 , 13” ) ;
47 asm volat i le ( ” fxcxma 12 , 29 , 6 , 12” ) ;
48 asm volat i le ( ” l fdux 6 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
49 asm volat i le ( ” fxcxma 13 , 29 , 7 , 13” ) ;
50 asm volat i le ( ” l fdux 7 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
51 asm volat i le ( ” fxcxma 12 , 31 , 8 , 12” ) ;
52 asm volat i le ( ” l fdux 8 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next i j am ) ) ;
53 asm volat i le ( ” fxcxma 13 , 31 , 9 , 13” ) ;
54 asm volat i le ( ” fxcxma 12 , 30 , 9 , 12” ) ;
55 asm volat i le ( ” l fdux 9 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
56 asm volat i le ( ” fxcxma 13 , 30 , 10 , 13” ) ;
57 asm volat i le ( ” fxcxma 12 , 31 , 10 , 12” ) ;
58 asm volat i le ( ” l fdux 10 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
59 asm volat i le ( ” fxcxma 13 , 31 , 11 , 13” ) ;
60 asm volat i le ( ” l fdux 11 , %0, %1” : ”+b” ( a pt r ) : ”b” ( next j j am ) ) ;
61 asm volat i le ( ”fxcpmadd 12 , 31 , 0 , 12” ) ;
62 asm volat i le ( ”fxcpmadd 13 , 31 , 1 , 13” ) ;
63 asm volat i le ( ”fxcpmadd 12 , 30 , 1 , 12” ) ;
64 asm volat i le ( ”fxcpmadd 13 , 30 , 2 , 13” ) ;
65 asm volat i le ( ”fxcpmadd 12 , 31 , 2 , 12” ) ;
66 asm volat i le ( ”fxcpmadd 13 , 31 , 3 , 13” ) ;
67 asm volat i le ( ”fxcpmadd 12 , 29 , 4 , 12” ) ;
68 asm volat i le ( ”fxcpmadd 13 , 29 , 5 , 13” ) ;
69 asm volat i le ( ”fxcpmadd 12 , 28 , 5 , 12” ) ;
70 asm volat i le ( ”fxcpmadd 13 , 28 , 6 , 13” ) ;
71 asm volat i le ( ”fxcpmadd 12 , 29 , 6 , 12” ) ;
72 asm volat i le ( ”fxcpmadd 13 , 29 , 7 , 13” ) ;
73 asm volat i le ( ”fxcpmadd 12 , 31 , 8 , 12” ) ;
74 asm volat i le ( ”fxcpmadd 13 , 31 , 9 , 13” ) ;
75 asm volat i le ( ”fxcpmadd 12 , 30 , 9 , 12” ) ;
76 asm volat i le ( ”fxcpmadd 13 , 30 , 10 , 13” ) ;
77 asm volat i le ( ”fxcpmadd 12 , 31 , 10 , 12” ) ;
78 asm volat i le ( ”fxcpmadd 13 , 31 , 11 , 13” ) ;
79 asm volat i le ( ” st fxdux 12 , %0, %1” : ”+b” ( r00 ) : ”b” ( s i x t e en ) ) ;
80 asm volat i le ( ” st fxdux 13 , %0, %1” : ”+b” ( r01 ) : ”b” ( s i x t e en ) ) ;
Listing A.9: Generated 27-point stencil loop body with instruction interleaving for a
1x2 jam kernel
69
REFERENCES
[1] Sosa, C. and International Business Machines Corporation Organization.
International Technical Support, IBM system Blue Gene solution: Blue Gene/P
application development. IBM International Technical Support Organization,
2008. [Online]. Available: http://www.citeulike.org/user/inis afalon/article/
7678013
[2] Dac C. Pham, Tony Aipperspach, David Boerstler, Mark Bolliger, Rajat
Chaudhry, Dennis Cox, Paul Harvey, Paul M. Harvey, H. Peter Hofstee,
Charles Johns, Jim Kahle, Atsushi Kameyama, John Keaty, Yoshio
Masubuchi, Mydung Pham, Ju¨rgen Pille, Stephen Posluszn, Kazuaki
Yazawa, “Overview of the architecture, circuit design, and physical
implementation of a first-generation cell processor,” Solid-State Circuits,
IEEE Journal of, vol. 41, no. 1, pp. 179–196, 2006. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=1564359
[3] P. H. Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash,
Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert
Cavin, Roger Espasa, Ed Grochowski, Toni Juan, “Larrabee: a many-core x86
architecture for visual computing,” in ACM SIGGRAPH 2008 papers. ACM,
2008, pp. 1–15. [Online]. Available: http://portal.acm.org/citation.cfm?id=
1360612.1360617
[4] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A
unified graphics and computing architecture,” Micro, IEEE, vol. 28, no. 2,
pp. 39–55, 2008. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs all.jsp?
arnumber=4523358
[5] D. Keyes, “Exaflop/s: The why and the how,” Comptes Rendus
Me´canique, vol. 339, no. 2-3, pp. 70–77, 2011. [Online]. Available:
http://linkinghub.elsevier.com/retrieve/pii/S1631072110002032
[6] W. Briggs and S. McCormick, A multigrid tutorial. Society for Industrial
Mathematics, 2000. [Online]. Available: http://books.google.com/books?hl=
en&lr=&id=SRAZwqAkrQMC&oi=fnd&pg=PA1&dq=briggs+multigrid&ots=
WLIA6jiqfd&sig=CI2SeFCTq0ILmPWepiYwr5ADOCg
[7] M. J. Berger and J. Oliger, “Adaptive Mesh Refinement for Hyperbolic Partial
Differential Equations,” Journal of Computational Physics, vol. 53, no. 3, pp.
70
484–512, 1984. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/
0021999184900731
[8] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi,
P. Frederickson, T. Lasinski, R. Schreiber, H.D. Simon, V. Venkatakrishnan,
and S. Weeratunga, “The NAS parallel benchmarks,” International Journal of
High Performance Computing Applications, vol. 5, no. 3, p. 63, 1991. [Online].
Available: http://hpc.sagepub.com/content/5/3/63.short
[9] W. Feng and K. Cameron, “The Green500 List: Encouraging Sustainable Super-
computing,” COMPUTER, pp. 50–55, 2007.
[10] R. Ananthanarayanan, S. K. Esser, H. D. Simon, and D. S. Modha, “The
cat is out of the bag: cortical simulations with 109 neurons, 1013 synapses,”
in Proceedings of the Conference on High Performance Computing Networking,
Storage and Analysis, ser. SC ’09. New York, NY, USA: ACM, 2009, pp.
63:1—-63:12. [Online]. Available: http://doi.acm.org/10.1145/1654059.1654124
[11] D. F. Richards, J. N. Glosli, B. Chan, M. R. Dorr, E. W. Draeger,
J.-L. Fattebert, W. D. Krauss, T. Spelce, F. H. Streitz, M. P. Surh, and
J. A. Gunnels, “Beyond homogeneous decomposition: scaling long-range
forces on Massively Parallel Systems,” in Proceedings of the Conference on
High Performance Computing Networking, Storage and Analysis, ser. SC ’09.
New York, NY, USA: ACM, 2009, pp. 60:1—-60:12. [Online]. Available:
http://doi.acm.org/10.1145/1654059.1654121
[12] A. Ghoting and K. Makarychev, “Indexing genomic sequences on the
IBM Blue Gene,” in Proceedings of the Conference on High Performance
Computing Networking, Storage and Analysis, ser. SC ’09. New York,
NY, USA: ACM, 2009, pp. 61:1—-61:11. [Online]. Available: http:
//doi.acm.org/10.1145/1654059.1654122
[13] A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey, “3.5-D Blocking
Optimization for Stencil Computations on Modern CPUs and GPUs,” sc, pp.
1–13, 2010. [Online]. Available: http://www.computer.org/portal/web/csdl/
doi/10.1109/SC.2010.2
[14] K. Datta, “Auto-tuning Stencil Codes for Cache-Based Multicore Platforms,”
Ph.D. dissertation, EECS Department, University of California, Berkeley, Dec.
2009. [Online]. Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/
EECS-2009-177.html
[15] Z. Li and Y. Song, “Automatic tiling of iterative stencil loops,” ACM
Transactions on Programming Languages and Systems (TOPLAS), vol. 26,
no. 6, pp. 975–1028, 2004. [Online]. Available: http://portal.acm.org/citation.
cfm?id=1034777
71
[16] S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev,
and P. Sadayappan, “Effective automatic parallelization of stencil computa-
tions,” ACM Sigplan Notices, vol. 42, no. 6, p. 235, 2007. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1250734.1250761
[17] H. Dursun, K.-I. Nomura, L. Peng, R. Seymour, W. Wang, R. K. Kalia,
A. Nakano, and P. Vashishta, “A Multilevel Parallelization Framework for
High-Order Stencil Computations,” in Proceedings of the 15th International
Euro-Par Conference on Parallel Processing, ser. Euro-Par ’09. Berlin,
Heidelberg: Springer-Verlag, 2009, pp. 642–653. [Online]. Available: http:
//dx.doi.org/10.1007/978-3-642-03869-3 61
[18] L. Peng, R. Seymour, K.-I. Nomura, R. K. Kalia, A. Nakano, P. Vashishta,
A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong, “High-order
stencil computations on multicore clusters,” IPDPS, pp. 1–11, 2009. [Online].
Available: http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=5161011
[19] S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick, “Lattice
Boltzmann simulation optimization on leading multicore platforms,” 2008 IEEE
International Symposium on Parallel and Distributed Processing, vol. 69, no. 9,
pp. 1–14, 2008. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/
wrapper.htm?arnumber=4536295
[20] M. Frigo and V. Strumpen, “Cache Oblivious Stencil Computations,” ICS 05
Proceedings of the 19th annual international conference on Supercomputing,
vol. 1, no. 212, pp. 361–366, 2005. [Online]. Available: http://portal.acm.org/
citation.cfm?doid=1088149.1088197
[21] G. Wellein, G. Hager, T. Zeiser, M. Wittmann, and H. Fehske, “Efficient
Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront
Parallelization,” 2009 33rd Annual IEEE International Computer Software
and Applications Conference, pp. 579–586, 2009. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5254211
[22] S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick, “Impact of
modern memory subsystems on cache optimizations for stencil computations,”
Memory System Performance, pp. 36–43, 2005. [Online]. Available: http:
//portal.acm.org/citation.cfm?id=1111589
[23] S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick, “Implicit
and explicit optimizations for stencil computations,” Proceedings of the 2006
workshop on Memory system performance and correctness - MSPC ’06, p. 51,
2006. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1178597.
1178605
[24] S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams, “An auto-tuning
framework for parallel multicore stencil computations,” in Parallel & Distributed
72
Processing (IPDPS), 2010 IEEE International Symposium on. IEEE, 2010, pp.
1–12. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=
5470421
[25] A. Solar-Lezama, G. Arnold, L. Tancau, R. Bodik, V. Saraswat, and S. Seshia,
“Sketching stencils,” in Proceedings of the 2007 ACM SIGPLAN conference on
Programming language design and implementation. ACM, 2007, pp. 167–178.
[Online]. Available: http://portal.acm.org/citation.cfm?id=1250754
[26] A. Ganapathi, K. Datta, A. Fox, and D. Patterson, “A case for machine learn-
ing to optimize multicore performance,” in Proceedings of the First USENIX
conference on Hot topics in parallelism. USENIX Association, 2009, pp. 1–1.
[27] G. Rivera and C. W. Tseng, “Tiling optimizations for 3D scientific
computations,” in Proceedings of SC00. IEEE Computer Society, 2000.
[Online]. Available: http://portal.acm.org/citation.cfm?id=370403
[28] C. Mueller and B. Martin, “CorePy: High-Productivity Cell/BE Programming,”
Applications for the Cell/BE, 2007. [Online]. Available: http://sti.cc.gatech.
edu/Slides/Mueller-070619.pdf
[29] C. Newburn, B. So, Z. Liu, M. McCool, A. Ghuloum, S. Du Toit,
Z. Wang, Z. Du, Y. Chen, G. Wu, P. Guo, Z. Liu, and D. Zhang,
“Intel’s Array Building Blocks: A Retargetable, Dynamic Compiler and
Embedded Language,” Proceedings of Code Generation and Optimization,
2011. [Online]. Available: http://software.intel.com/en-us/blogs/wordpress/
wp-content/uploads/2011/03/ArBB-CGO2011-distr.pdf
[30] A. Eichenberger, P. Wu, and K. O’Brien, “Vectorization for SIMD architectures
with alignment constraints,” ACM SIGPLAN Notices, vol. 39, no. 6, pp. 82–93,
2004. [Online]. Available: http://portal.acm.org/citation.cfm?id=996853
[31] T. Henretty, K. Stock, L. Pouchet, F. Franchetti, J. Ramanujam, and
P. Sadayappan, “Data Layout Transformation for Stencil Computations on
Short-Vector SIMD Architectures,” in Compiler Construction. Springer,
2011, pp. 225–245. [Online]. Available: http://www.springerlink.com/index/
75714T3217111443.pdf
[32] D. Callahan, J. Cocke, and K. Kennedy, “Estimating interlock and
improving balance for pipelined architectures 1,” Journal of Parallel and
Distributed Computing, vol. 5, no. 4, pp. 334–358, 1988. [Online]. Available:
http://linkinghub.elsevier.com/retrieve/pii/0743731588900020
[33] S. Carr and K. Kennedy, “Improving the ratio of memory operations
to floating-point operations in loops,” ACM Transactions on Programming
Languages and Systems (TOPLAS), vol. 16, no. 6, pp. 1768–1810, 1994.
[Online]. Available: http://portal.acm.org/citation.cfm?id=197366
73
[34] IBM Blue Gene Team, “Overview of the IBM Blue Gene/P project,” IBM
J. Res. Dev., vol. 52, no. 1/2, pp. 199—-220, 2008. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1375990.1376008
[35] C. Chang, C. Chen, and C. King, “Using integer linear programming
for instruction scheduling and register allocation in multi-issue processors
1,” Computers & Mathematics with Applications, vol. 34, no. 9, pp.
1–14, 1997. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/
S0898122197001843
[36] K. Wilken, “Optimal instruction scheduling using integer programming,” ACM
SIGPLAN Notices, 2000. [Online]. Available: http://portal.acm.org/citation.
cfm?id=349318
[37] A. Malik, “Constraint programming techniques for optimal instruction schedul-
ing,” Ph.D. dissertation, 2008. [Online]. Available: http://citeseerx.ist.psu.edu/
viewdoc/download?doi=10.1.1.146.3980&amp;rep=rep1&amp;type=pdf
[38] J. D. McCalpin, “Memory bandwidth and machine balance in current high per-
formance computers,” IEEE Computer Society Technical Committee on Com-
puter Architecture (TCCA) Newsletter, pp. 19–25, Dec. 1995.
[39] K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, and K. Yelick,
“Auto-tuning the 27-point Stencil for Multicore,” in Proc. iWAPT2009: The
Fourth International Workshop on Automatic Performance Tuning, 2009.
[Online]. Available: http://crd.lbl.gov/∼{}oliker/papers/iwapt09.pdf
[40] M. Araya-Polo, F. Rubio, R. De, M. Hanzich, and J. Mar´ıa, “3D seismic imaging
through reverse-time migration on homogeneous and heterogeneous multi-core
processors,” Scientific Programming, vol. 17, pp. 185–198, 2009.
[41] J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J.-C. Andre,
D. Barkai, J.-Y. Berthou, T. Boku, B. Braunschweig, F. Cappello, B. Chapman,
C. Xuebin, A. Choudhary, S. Dosanjh, T. Dunning, S. Fiore, A. Geist,
W. Gropp, R. Harrison, M. Hereld, M. Heroux, A. Hoisie, K. Hotta,
J. Zhong, Y. Ishikawa, F. Johnson, S. Kale, R. Kenway, D. Keyes,
B. Kramer, J. Labarta, A. Lichnewsky, T. Lippert, B. Lucas, B. Maccabe,
S. Matsuoka, P. Messina, P. Michielse, B. Mohr, M. S. Mueller, W. E.
Nagel, H. Nakashima, M. E. Papka, D. Reed, M. Sato, E. Seidel, J. Shalf,
D. Skinner, M. Snir, T. Sterling, R. Stevens, F. Streitz, B. Sugar, S. Sumimoto,
W. Tang, J. Taylor, R. Thakur, A. Trefethen, M. Valero, A. van der Steen,
J. Vetter, P. Williams, R. Wisniewski, and K. Yelick, “The international
exascale software project roadmap,” International Journal of High Performance
Computing Applications, vol. 25, no. 1, pp. 3–60, 2011. [Online]. Available:
http://hpc.sagepub.com/content/25/1/3.abstract
74
[42] C. Shu, “High-order finite difference and finite volume WENO schemes and dis-
continuous Galerkin methods for CFD,” International Journal of Computational
Fluid Dynamics, vol. 17, no. 2, pp. 107–118, 2003.
