Tiling Stencil Computations To Maximize Parallelism by Bandishti, Vinayaka Prakasha
Abstract
Stencil computations are iterative kernels often used to simulate the change in a discretized
spatial domain over time (e.g., computational ﬂuid dynamics) or to solve for unknowns in a
discretized space by converging to a steady state (i.e., partial differential equations). They are
commonly found in many scientiﬁc and engineering applications. Most stencil computations
allow tile-wise concurrent start, i.e., there exists a face of the iteration space and a set of tiling
hyperplanes such that all tiles along that face can be started concurrently. This provides load
balance and maximizes parallelism.
Loop tiling is a key transformation used to exploit both data locality and parallelism from
stencils simultaneously. Numerous works exist that target improving locality, controlling fre-
quency of synchronization, and volume of communication wherever applicable. But, con-
current start-up of tiles that evidently translates into perfect load balance and often reduction
in frequency of synchronization is completely ignored. Existing automatic tiling frameworks
often choose hyperplanes that lead to pipelined start-up and load imbalance. We address this is-
sue with a new tiling technique that ensures concurrent start-up as well as perfect load balance
whenever possible. We ﬁrst provide necessary and sufﬁcient conditions on tiling hyperplanes
to enable concurrent start for programs with afﬁne data accesses. We then discuss an iterative
approach to ﬁnd such hyperplanes.
It is not possible to directly apply automatic tiling techniques to periodic stencils because
of the wrap-around dependences in them. To overcome this, we use iteration space folding
techniques as a pre-processing stage after which our technique can be applied without any
further change.
iii
iv
We have implemented our techniques on top of Pluto - a source-level automatic parallelizer.
Experimental evaluation on a 12-core Intel Westmere shows that our code is able to outperform
a tuned domain-speciﬁc stencil code generator by 4% to 2×, and previous compiler techniques
by a factor of 1.5× to 15×. For the swim benchmark from SPECFP2000, we achieve an
improvement of 5.12× on a 12-core Intel Westmere and 2.5× on a 16-core AMD Magny-
Cours machines, over the auto-parallelizer of Intel C Compiler.
