Robust structured multigrid at extreme scales by Reisner, Andrew
© 2019 Andrew Reisner




Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2019
Urbana, Illinois
Doctoral Committee:
Professor Luke Olson, Chair and Director of Research




The solution of elliptic partial differential equations is a common performance bottleneck
in scientific simulations. By exploiting structure in a problem, robust structured multigrid
methods gain important performance benefits because they preserve structure throughout
the multigrid hierarchy. In parallel these methods benefit from nearest neighbor stencil-
based communication patterns; however, the increased communication demands of coarse-
grid problems and block smoothers needed for a robust solver challenge parallel efficiency.
In this dissertation, methods for reducing parallel communication through changes in the
parallel implementation are explored. To reduce communication costs for coarse-grid prob-
lems, recursive agglomeration of tasks on a logically structured grid of processors is consid-
ered. This communication is optimized using a predictive performance model to guide how
tasks are agglomerated. This approach provides an efficient strategy for parallel coarsening
in a structured setting that can adapt to changes in the target architecture or multigrid al-
gorithm through its incorporation in the performance model. Parallel results show favorable
weak scaling using this strategy out to 500k cores and consistency of the performance model
in quantifying the cost of various redistribution decisions.
To reduce communication costs in block smoothers, an automated strategy for aggregat-
ing communication across blocks is considered. With minor changes to a block solver due
to the introduction of a service abstraction layer, user-level threads are used to execute
blocks concurrently so communication can be aggregated. This results in a reduction in the
amount of messages sent during a block smoothing operation. This strategy is demonstrated
in plane smoothing to extend the strong scaling limit by reducing communication latency
costs. Parallel results demonstrate scalable multilevel relaxation with log p communication
complexity and plane relaxation with automated communication aggregation that doubles
the strong scaling performance of a V-cycle.
Lastly, the application of robust structured solvers to emerging heterogeneous architectures
is considered. Benchmarks are used to develop a performance expectation for structured
matrix-based operations on each target processing unit. OpenMP with unified memory
is then used to offload solve phase operations in the open-source, structured variational
multigrid solver Cedar [1]. The performance expectation is then used to provide context
for performance gains by targeting GPUs on Sierra—a current Power9 system at Lawrence
Livermore National Laboratory. Results show speedup of a Cedar V-cycle targeting a V100
GPU over a Power9 CPU consistent with an approximate speedup estimated by comparing
achievable memory bandwidth on each processing unit.
ii
ACKNOWLEDGMENTS
Many people have provided crucial guidance and support towards the completion of this
work. I would first like to thank my advisor Luke Olson for his invaluable insight in con-
structing concise research questions and conducting quality research. I would next like to
thank David Moulton who through numerous lab visits and continued collaboration, pro-
vided much useful guidance and advice significantly improving the quality of publications.
I would also like to thank my committee members, Bill Gropp and Andreas Kloeckner.
Additionally, I would like to thank the Computer Science and Math professors at Wartburg
College for providing an excellent and supportive education. Specifically, I would like to
thank John Zelle for his valuable advice and Mariah Birgen for encouraging me toward an
education in computer science.
Finally, I would like to thank my friends and family for their support over the last six
years with special thanks to Imani Palmer for her support during the writing process.
This work was carried out under the auspices of the National Nuclear Security Adminis-
tration of the U.S. Department of Energy, at the University of Illinois at Urbana-Champaign
under Award Number DE-NA0002374, and at Los Alamos National Laboratory under Con-
tract Number DE-AC52-06NA25396, and was partially supported by the Advanced Simula-
tion and Computing / Advanced Technology Development and Mitigation Program.
This research is part of the Blue Waters sustained-petascale computing project, which is
supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and
the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-
Champaign and its National Center for Supercomputing Applications.
This research used resources of the Argonne Leadership Computing Facility, which is a
DOE Office of Science User Facility supported under Contract DE--AC05--00OR22725.
Special thanks to Kyle Mackay from the University of Illinois at Urbana-Champaign for
the mesh and image files in Figure 4.15.
iii
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Application: XPACC Center . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 2 COMPONENTS OF ROBUST MULTIGRID . . . . . . . . . . . . . . 7
2.1 Robust Variational Multigrid on Structured Grids . . . . . . . . . . . . . . . 8
CHAPTER 3 OPTIMIZED COARSE-GRID REDISTRIBUTION . . . . . . . . . . 18
3.1 Parallel Robust Variational Multigrid on Structured Grids . . . . . . . . . . 20
3.2 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Optimized Parallel Redistribution Algorithm . . . . . . . . . . . . . . . . . . 28
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
CHAPTER 4 STRUCTURED RELAXATION . . . . . . . . . . . . . . . . . . . . . 45
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Parallel Structured Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Automated Communication Aggregation . . . . . . . . . . . . . . . . . . . . 57
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
CHAPTER 5 EMERGING ARCHITECTURES . . . . . . . . . . . . . . . . . . . . 72
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Performance Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
CHAPTER 6 CEDAR FRAMEWORK . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 91
CHAPTER 7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
iv
CHAPTER 1: INTRODUCTION
The performance of many numerical simulations is limited by the efficiency of solvers
for elliptic partial differential equations. Although increased parallelism on emerging high
performance computing architectures facilitates solving larger problems or meeting higher
accuracy demands, these globally coupled problems present significant challenges to their
efficient solution on modern large-scale parallel systems. By taking advantage of structure in
a problem, these challenges can be addressed in a setting where coarse-grid communication
patterns are known a priori and local computation uses direct memory access. By avoiding
indirection used in unstructured sparse matrix storage formats and by maintaining efficient
stencil-based operators on coarse grids, robust structured solvers gain notable performance
benefits (e.g., 10× speedup over Algebraic Multigrid (AMG) for heterogeneous diffusion
problems [2]).
1.1 APPLICATION: XPACC CENTER
Electric field calculations play a key role in plasma-coupled combustion simulations. The
electric field is significant to the formation of plasmas. One canonical problem studied in
the Center for Exascale Simulation of Plasma-Coupled Combustion (XPACC), is shown in
Figure 1.1. Figure 1.1a shows a fuel entering a cross-flowing oxidizer. Near the top of
the jet, a plasma actuator is positioned consisting of electrodes separated by a dielectric.
This setup is meshed using multiple overlapping curvilinear grids as shown in Figure 1.1b.
Figure 1.1c shows an experimental image of this setup. The electric field is described through
the potential, E = ∇φ(x), and satisfies
−∇ · ε(x)∇φ = f in Ω, (1.1)
[n · ε(x)∇φ] = g on ΓI , (Interface Conditions)
n · ∇φ = 0 on ΓN , (Boundary Conditions)
φ = φground on ΓE− , (Ground Electrode)
φ = φapplied on ΓE+ , (Applied Electrode)
where ε(x) is the permittivity, ΓI is a material interface, ΓN is a Neumann boundary, and
ΓE− ,ΓE+ are Dirichlet blocks in the domain Ω. A simplified model problem used to focus on
the plasma actuator is shown in Figure 1.2. The actuator, as shown in Figure 1.2a, consists




Figure 1.1: Jet in cross-flow. (a) shows fuel entering a cross-flowing oxidizer. (b) shows
overlapping grids used to mesh the domain. (c) shows an experimental image of the setup.









Figure 1.2: Dielectric barrier discharge (DBD) plasma actuator. Images from the XPACC
center at the University of Illinois.
electric field plotted on a slice of the domain. The electric field calculation is a global elliptic
solve and represents a significant computational challenge for this application. In addition,
this problem also presents a number of challenges for multilevel solvers:
Discontinuous coefficient The permittivity, ε(x), of the electric field is discontinuous in
the domain. In this case, care must be taken when defining interpolation operators
used in coarse-grid correction. Standard bilinear interpolation cannot be used in this
2
case as it relies on the continuity of the gradient of the solution and not the normal
flux.
Stretched meshes Computation with curvilinear meshes can introduce anisotropy in the
operator (see Section 1.1.1). This anisotropy will need to be addressed by either the
coarsening algorithm or with line/plane smoothers to maintain favorable convergence.
Fine-scale physical features The electrodes are represented as fixed potentials in the do-
main. The electrodes along with the discontinuous material interface represent features
given by the fine-scale problem and the influence of these features on the coarse-grid
solution must be well represented in the coarse-grid operator. Figure 1.4 shows an
example of convergence deterioration of geometric multigrid when a fine-scale feature
is not aligned to the coarse grid. As the mesh is coarsened, the Dirichlet block shrinks
in relative size. By level four, the entire block disappears on the coarse mesh. This is
reflected in significant convergence deterioration as more levels are used. When four
levels are used, the coarsest mesh is no longer able to represent the Dirichlet block
resulting in divergence.
To address these challenges, robust variational multigrid is considered as described in Chap-
ter 2.
1.1.1 Curvilinear Discretization
To construct a suitable discretization for Equation 1.1 we consider a mapping from a
physical domain in (x, y) to a logical domain in (r, s) as described in Figure 1.5. Thus a
coordinate transformation x(ξ) : Ξn → Xn defines a mapping from logically structured
computational domain Ξn to the physical domain. Using this coordinate transformation,
(1.1) is solved on the computational domain.
To consider 1.1 on the computational domain, we use a contravariant metric tensor of the
physical domain in the coordinates ξ1, . . . , ξn, which is given by
gαβ = 〈∇ξα,∇ξβ〉. (1.2)
Then, ∇ · ε(x)∇φ = f on the logically structured computational domain Ξn using the non-















Level 1 Level 2
Level 3 Level 4
Figure 1.3: Coarsening of a 128 × 128 mesh with an electrode (hole) that is not aligned to
the coarse grid. As the fine grid is coarsened, the coarse-grid representation of the electrode
shrinks. By level 4, the electrode disappears from the coarse mesh.
where 1/g is the Jacobian of the contravariant matrix above [3]. Here, Einstein notation is
used and repeated indices imply summation.































































































Figure 1.4: Convergence of geometric multigrid using a varying number of levels with an
isotropic Poisson model problem using meshes from Figure 1.3. When more levels are used,
the coarse-grid representation of the electrode shrinks resulting in convergence deterioration.
Divergence occurs when the electrode is no longer represented on the coarsest grid.
ΞX
x(ξ)
Figure 1.5: Mapping of a structured mesh
A distributed memory parallel implementation of this discretization is released as open source
in the Stella package [4].
5
1.2 OVERVIEW OF CONTRIBUTIONS
The main contributions in this dissertation support the goal of addressing scaling and
efficiency challenges for robust structured multilevel solvers on large scale parallel systems.
Cedar Framework
A robust, variational multigrid library implementing Black Box Multigrid (BoxMG) on
large scale parallel systems. This library includes a C++ control layer that leverages
efficient legacy Fortran kernels to promote code reuse and flexibility.
Optimized Coarse-grid Redistribution
Using predictive performance modeling to guide coarse-grid redistribution decisions.
This is used to optimize the parallel execution of coarse-grid computations in robust
variational multigrid methods on structured grids.
Automated Communication Aggregation for Block Smoothers
The increased communication demands of block smoothers that are robust for anisotropic
problems challenge parallel performance at scale. The uniform communication pattern
of each block can be exploited to reduce communication costs; however, this aggre-
gation is difficult when robust and scalable block solvers are desired. An automated
strategy for aggregating this communication is included that reduces communication
latency costs for robust structured solvers.
Robust Structured Multigrid on Emerging Architectures
Emerging high performance computing architectures greatly increase on-node paral-
lelism and heterogeneity. Specifically, the recent machines at Lawrence Livermore
National Lab (LLNL) and Oak Ridge National Lab (ORNL) motivate targeting nodes
with multiple GPUs. Benchmarks are used to evaluate the use of OpenMP and uni-
fied memory for offloading structured matrix-based operations on these systems and
to establish a performance expectation on each processing unit.
6
CHAPTER 2: COMPONENTS OF ROBUST MULTIGRID
The solution of a logically structured elliptic PDE with Black Box Multigrid (BoxMG)
begins with an accurate discretization yielding the fine-grid system of linear equations
Akuk = fk, (2.1)
where Ak is a five of nine point nearest neighbor stencil operator on fine-grid index k.
Logically structured coarse-grid operators and intergrid transfer operators are generated
using a structured coarsening procedure and the fine-grid operator. This process, shown
Algorithm 2.1: Multigrid Setup Phase
Input: Ak fine-grid operator
setup inter intergrid transfer setup routine
setup coarse operator operator coarsening routine
Output: A0, . . . , Ak−1
R1, . . . , Rk restriction operators
P1, . . . , Pk prolongation operators
1 for l = k, k − 1, . . . , 2 do
2 Rl, Pl = setup inter(Al)
3 Al−1 = setup coarse operator(Rl, Al, Pl)
in Algorithm 2.1, is called the multigrid setup phase and creates operators needed in the
solve phase for approximating fine-grid smooth error on coarser grids. The details of these
operations are given in Section 2.1.
In the multigrid solve phase, iterations of a cycling procedure are run. In the case of a
standard multigrid V-cycle, a smoothing procedure is first used to dampen high-frequency
error in Equation 2.1. The problem is then restricted to a coarser grid:
Ak−1uk−1 = Rk(fk − Akuk). (2.2)
The solution of Equation 2.2 on grid (k − 1) is a coarse-grid representation of the error on
grid k since the right-hand side is residual restricted to grid (k− 1). This process continues
until the coarsest grid is reached. At this point, Equation 2.2 is solved directly for u1. The
coarse-grid representation of error is then used as a correction for the fine-grid approximation
uk = uk + Pkuk−1. (2.3)
7
Smoothing is then run on uk to reduce any high-frequency error introduced from the correc-
tion. This process is shown in Algorithm 2.2.
Algorithm 2.2: Multilevel V-cycle
Input: uk fine-grid initial guess
fk fine-grid right-hand side
A0, . . . , Ak
P1, . . . , Pk
Output: uk fine-grid iterative solution
1 for l = k, . . . , 1 do
2 relax(Al, ul, fl, ν1) {relax ν1 times}
3 rl = fl − Alul {compute residual}
4 fl−1 = P
T
l rl {restrict residual}
5 u0 = solve(A0, f0) {coarse-grid direct solve}
6 for l = 1, . . . , k do
7 ul = ul + Plul−1 {interpolate and correct}
8 relax(Al, ul, fl, ν2) {relax ν2 times}
2.1 ROBUST VARIATIONAL MULTIGRID ON STRUCTURED GRIDS
The three pieces needed for building a variational multigrid method for symmetric, positive
definite matrix problems are a coarsening strategy, interpolation operator, and smoothing
procedure. Given an interpolation operator P , a fine-grid operator Ak, and choosing re-




This operator is called the Galerkin coarse-grid operator and follows a variation principle that
minimizes coarse-grid correction error in the range of interpolation [5]. Variational coarsening
is an important component for black box solvers as it does not require rediscretization on each
multigrid level—instead relying on the variational principle to generate coarse problems using
Equation 2.4. This is crucial when fine-scale features are present in the fine-grid problem
that change or disappear on coarser grids. In this approach, the interpolation must be
sufficiently accurate to generate a Galerkin operator satisfying the approximation property.
The approximation property is important for multigrid convergence and can be thought of
as bounding the difference in the action of the inverse of the fine-grid operator and the
interpolated action of the inverse of the Galerkin coarse-grid operator on a restricted vector
8
by some power of the grid spacing [6, 7, 8, 9]. In the case of structured grids, the complexity
of the coarse-grid operators is bounded and the stencil pattern can be fixed a priori.
2.1.1 Coarsening Strategies for Structured Grids
In contrast to Algebraic Multigrid (AMG), coarsening strategies for structured grids are
prescribed a priori to preserve structure on coarse grids. Standard coarsening is the simplest
Figure 2.1: Standard coarsening by a factor of 2.
coarsening strategy for structured grids–coarsening the grid by a constant factor in each
dimension as shown in Figure 2.1. In this case, any anisotropy in the problem must be
handled by the smoother to maintain effective coarse-grid correction (see Section 2.1.3).
When using point smoothing with anisotropic problems, error less smoothed in the direction
of weak connections. If the anisotropy is aligned with the grid, semicoarsening can be
used to coarsen in the direction of strong connection, where smoothing is effective. While
semicoarsening allows the use of inexpensive point smoothing for anisotropic problems, it
must be combined with other smoothing techniques for general anisotropy. In contrast to
standard coarsening, semicoarsening also reduces the rate at which the coarse problem is
diminished at each multigrid level impacting the performance of a solver.
Multiple semicoarsening [10, 11] is another structured coarsening strategy used to ad-
dress general anisotropy. In multiple semicoarsening, the fine-grid problem is semicoarsened
simultaneously in each direction. This generates multiple coarse-grid problems on a given
multigrid level. In three dimensions, the large number of coarse problems needed for multiple
semicoarsening can be avoided by using 2D multiple semicoarsening with a block smoothing
operation for the third dimension.
9
2.1.2 Operator-induced Interpolation
When using Galerkin coarsening, interpolation must be sufficiently accurate to ensure
the coarse-grid operator satisfies the approximation property [12]. For example, in two-
dimensional diffusion problems with discontinuous coefficients, bilinear interpolation is not
accurate across discontinuities because the gradient of the solution is not continuous. Thus,
a key element in robust multigrid methods is operator-induced interpolation [13, 12], which
uses the matrix problem to construct intergrid transfer operators. In the case of structured
grid problems, operator-induced interpolation is naturally motivated by noting the normal
component of the flux is continuous [14], and its impact on the properties of the variational
coarse-grid operator is understood through its connection to homogenization [15, 16]. This
approach has natural extensions to non-symmetric problems as well [17, 18].
To introduce an operator-induced interpolation for logically structured grids, a nine point




Figure 2.2: Nine point stencil centered at point (i, j) on a logically structured Cartesian
mesh.
point (i, j) in a logically structured Cartesian grid the nearest neighbor stencil weights as
shown in Figure 2.2. This notation is written more concisely using the stencil notation
A[i,j] =






where the row corresponding to (i, j) in the system Ax = b is written as A[i,j]x[i,j] = bi,j or
Ci,jxi,j +Wi,jxi−1,j + Ei,jxi+1,j
+ SWi,jxi−1,j−1 + Si,jxi,j−1 + SEi,jxi+1,j−1
+NWi,jxi−1,j+1 +Ni,jxi,j+1 +NEi,jxi+1,j+1 = bi,j.
(2.6)
To develop an operator-induced interpolation, interpolation is chosen to capture error not
effectively reduced by smoothing—often termed algebraically smooth error. The standard
assumption that algebraically smooth error produces small residuals [5] is then used, written
as
A[i,j]e[i,j] ≈ 0. (2.7)
Interpolation weights are found by first writing Equation 2.7 as
ei,j ≈ −(Wi,jei−1,j + Ei,jei+1,j
+ SWi,jei−1,j−1 + Si,jei,j−1 + SEi,jei+1,j−1
+NWi,jei−1,j+1 +Ni,jei,j+1 +NEi,jei+1,j+1)/Ci,j.
(2.8)
Approximations are then made to remove connections to points that are not present on the
coarse-grid. This process results in an approximate interpolation stencil for a given fine-grid
point (i, j) denoted P[i,j]. For simplicity, stencil notation using fine-grid indexing is used for
P[i,j] with  denoting positions in the stencil not collocated with coarse-grid points. This
stencil maps coarse-grid vectors to the fine-grid point (i, j). For example,
P[i,j]e[i,j] =




e[i,j] = ω1eI,J + ω2eI+1,J , (2.9)
where e[i,j] includes values in the coarse-grid vector e that correspond to weights in P[i,j],
and (I, J) is a coarse-grid index corresponding to the fine-grid point (i− 1, j).
When using coarsening by a factor of 2, classical black box multigrid interpolation [14, 15,
19, 20] considers three cases for interpolating to a fine-grid point (i, j) as shown in Figure 2.3.
(a) (i, j) is also a coarse-grid point. In this case injection is used resulting in the interpo-
11
(a) (b) (c)
Figure 2.3: BoxMG interpolation whe coarsening by a factor of 2 in 2D. denotes coarse-grid
points, and denotes fine-grid points considered for interpolation.
lation stencil
P[i,j] =





(b) (i, j) is embedded in a coarse line with (i±1, j) or (i, j±1) being coarse-grid points. In
this case, the fine-grid stencil is collapsed by assuming error is constant in the direction
orthogonal to the coarse-grid line. When (i±1, j) are coarse-grid points, it is assumed





The collapsed stencil now involves only the point to be interpolated and the coarse-grid
points. Using Equation 2.7 with the collapsed stencil, we have
−(N + C + S)ei,j ≈ (NW +W + SW )ei−1,j + (NE + E + SE)ei+1,j. (2.12)
This leads to the interpolation stencil
P[i,j] =






Similarly, when (i, j ± 1) are coarse-grid points, the stencil for A[i,j]e[i,j] is collapsed in
the first dimension giving 0 NW +N +NE 00 W + C + E 0

















(c) (i, j) is in logical cell center. In this case, the fine-grid points to be removed from
the operator stencil are all embedded in coarse-grid lines. Because of this, we can use
the interpolation stencils from Equations 2.13 and 2.15 to remove connections to these
points in A[i,j]:  NW N · P[i,j+1] NEW · P[i−1,j] C E · P[i+1,j]




Using Equation 2.7, the interpolation stencil is given implicitly as
P[i,j]e[i,j] =





− [Wi,j(P[i−1,j]e[i−1,j]) + Ei,j(P[i+1,j]e[i+1,j])
+ Si,j(P[i,j−1]e[i,j−1]) +Ni,j(P[i,j+1]e[i,j+1])]/Ci,j
(2.17)
By collapsing the operator stencil across grid lines to form one dimensional interpolation
stencils, operator-induced interpolation in BoxMG preserves continuity in n · ε∇u for grid
aligned discontinuities in the diffusion coefficient ε while not relying on continuity in n ·
∇u [14]. This is essential for problems with large discontinuities in the diffusion coefficient.
Using BoxMG operator-induced interpolation with Galerkin coarsening ensures bounded
complexity in the coarse-grid operator with operators not exceeding nine-point stencils.
13
2.1.3 Smoothing
Careful consideration of smoothers is necessary for structured solvers where coarsening
is prescribed to preserve structure. For anisotropic problems, smoothers must be designed
to produce error suitable for coarse-grid correction with a given coarsening pattern. For






∇u = f, (2.18)
























= f after 10 iterations of Gauss-Seidel point smoothing
using a random initial guess with ε = 0.001.
ing considering Equation 2.18 with homogeneous Dirichlet boundary conditions with a ran-
dom initial guess. This shows point smoothing is only effective in producing smooth error
in the direction of strong connection, namely in the x direction. Coarse-grid correction with
14
standard coarsening for this error is ineffective as error in the y direction is not smoothed.
Block smoothers, however, can be used to address this anisotropy. Line smoothing operates
on matrix entries corresponding to the same logical grid line collectively. The structure of























where p is the number of logical grid lines. Each block smoothing sweep consists of solving
p systems of form 
Dixi = bi − Uixi+1 if i = 1
Dixi = bi − Lixi−1 − Uixi+1 if 1 < i < p
Dixi = bi − Lixi−1 if i = p.
(2.20)
In line smoothing, rows in the matrix are grouped by logical grid line to reach this structure.
For a 2D nearest neighbor stencil-based matrix, each Di block is tridiagonal and each Li, Ui
is tridiagonal if diagonal connections are present and diagonal otherwise. Each sweep of
line smoothing solves p tridiagonal from Equation 2.20. This type of block smoothing is
effective for grid-aligned anisotropy by collectively processing strongly connected unknowns.
Figure 2.5 shows the error after ten iterations of line smoothing in the x direction considering
Equation 2.18 with homogeneous Dirichlet boundary conditions. In contrast to the error
shown in Figure 2.4, this error is smooth in both directions and appropriate for coarse-grid
correction with standard coarsening. By simply alternating directions in each smoothing
step, line relaxation in two dimensions is robust for anisotropy that is not aligned to the
grid [13, 12].
Similar to line smoothing in two dimensions, plane smoothing can be used as a robust
smoother for anisotropic problems in three dimensions. In plane smoothing, rows in the
matrix are grouped by logical plane to reach the structure shown in Equation 2.19. Each
block Di corresponds to a 2D structured operator for each plane. Plane smoothing sweeps,
described in Equation 4.6, consist of solving p 2D structured systems. In practice, these
systems do not need to be solved exactly. In [21] it has been shown that a single V(1,1)
cycle is sufficient for recovering expected convergence factors.


























= f after 10 iterations of line smoothing in the x direction
using a random initial guess with ε = 0.001.
ing the implementation in BoxMG [22, 14, 15] with standard coarsening by a factor of 2.
BoxMG is used because its operator-induced interpolation is designed to handle discontinu-
ous coefficients, and additional heuristics ensure optimal performance for Dirichlet, Neumann
and Robin boundary conditions on logically structured grids of any dimension (i.e., it is not
restricted to grids 2k + 1). BoxMG supports vertex- and cell-based discretizations of the
diffusion equation on logically structured grids that lead to nearest neighbor stencils (i.e.,
five- and nine-point operators in two dimension). In addition, only the fine-grid problem is
specified; coarse-grid operators are constructed through the variational (Galerkin) product.
The package is also released as open source in the Cedar Project [1].
16
Figure 2.6: Domain partition. Circles represent degrees of freedom, with denoting points
on process k, representing halo points needed for communication from other processors,
and points on other processors. Fictitious points on the domain boundary are denoted by
.
2.1.4 Domain Decomposition Parallel Implementation
The distributed memory parallel implementation of BoxMG divides the problem domain
among available processors. The processors are arranged in a structured grid and points
in the fine-grid problem are divided among processors in each dimension — see Figure 2.6.
Since the computation is structured, inter-process communication occurs through nearest
neighbor halo updates with a halo width of 1. An important feature of BoxMG is bounded
complexity in the coarse-grid operator. By exploiting the structure of the problem, BoxMG
produces coarse-grid problems with a fixed structure. This results in known communication
patterns and guaranteed data locality.
17
CHAPTER 3: OPTIMIZED COARSE-GRID REDISTRIBUTION
Portions of this chapter appear in the publication: “Scaling Structured Multigrid to 500K+
Cores Through Coarse-Grid Redistribution” [23].
The efficient solution of large, sparse linear systems resulting from the discretization of
elliptic partial differential equations (PDEs) is crucial to the performance of many nu-
merical simulations. Although there has been significant progress in developing general
Algebraic Multigrid (AMG) solvers [12], modern high-performance computing (HPC) ar-
chitectures continue to pose significant challenges to parallel scalability and performance
(e.g., [24, 25, 26]). These challenges include reducing data movement, increasing arithmetic
intensity, and identifying opportunities to improve resilience, and are more readily addressed
in settings where problem structure can be identified and exploited. For example, robust
structured variational multigrid methods, such as Black Box Multigrid (BoxMG) [14, 17],
take advantage of direct memory addressing and fixed stencil patterns throughout the multi-
grid hierarchy to realize a 10× speed-up over AMG for heterogeneous diffusion problems [2].
In addition, the communication patterns in a parallel BoxMG solve are fixed and predictable
throughout the multigrid hierarchy. Here we explore using this information to improve the
parallel scalability and performance of elliptic solves for problems with structure.
The meshing strategy used in the discretization of a PDE has a significant impact on the
amount of structure that can be exploited by the solver. For example, single-block logically
structured grids can be mapped to conform to smooth non-planar geometries [27], and can
use embedded boundary discretization techniques to add additional flexibility to the rep-
resentation of object boundaries. Robust structured variational methods, such as BoxMG,
are directly applicable to these cases. In contrast, fully unstructured grids can capture very
complex geometries, including non-smooth features over a range of scales, but demand gen-
eral algebraic multilevel solvers, such as AMG or smoothed aggregation AMG. The needs
of many applications lie between these extremes, and a variety of adaptive or multi-mesh
strategies have been developed to preserve structure and enable its use in the discretization
and solver. These approaches generally lead to specialized hybrid solvers, favoring struc-
tured techniques at higher levels of refinement and unstructured techniques below a suitably
chosen coarse level [28, 29]. In [29], memory efficient matrix-free geometric methods are used
on fine levels with hybrid hierarchical grids to solve a problem reaching 1.1× 1013 degrees of
freedom. While these hybrid solvers present a variety of challenges, a single-block logically
structured solver is a critical component of their design and performance.
18
A common approach to parallel multigrid solvers for PDEs is to partition the spatial
domain across available processor cores. However, on coarser levels, the local problem size
decreases and the communication cost begins to impact parallel performance. A natural
approach to alleviate this problem is to gather the coarsest problem to a single processor (or
redundantly to all the processors), and to solve it with a serial multigrid cycle. This approach
was first motivated by a performance model of early distributed memory clusters [30] where
core counts were quite small, and it was used in the initial version of parallel BoxMG.
Unfortunately, as the weak scaling study in Figure 3.1 shows, this approach scales linearly
with the number of cores, and hence, is not practical for modern systems. A modification of
this approach that gathers the coarse problem to a multi-core node [31], and then leverages
OpenMP threading on that node, improves the performance but does not address the scaling.
This challenge of reducing communication costs at coarser levels is even more acute in AMG
methods, and this led to exploration of agglomeration to redundant coarse problems at
higher levels [24], as well as redistribution of smaller portions of the coarse problem to
enhance scalability [25], leading to approximately 2× speedup in the solve phase.
An alternative to this two-level gather-to-one approach is a more conservative redistribu-
tion strategy that can be applied recursively as the problem is coarsened. In single-block
structured solvers, the decision to redistribute data is driven by balancing communication
costs in relaxation with diminishing local work. This approach was first considered in a struc-
tured setting [32], and used a heuristic to guide recursive application of nearest neighbor
agglomeration. Later, in the Los Alamos AMG (LAMG) solver, a heuristic was developed to
guide the reduction of the number of active cores at each level by a power of two [33]. In [34],
agglomeration is performed on a per-process basis in AMG. When the number of unknowns
on a process falls below a specified threshold, the process donates their local problem to their
neighbor with the fewest unknowns. Hierarchical agglomeration is performed for hierarchical
distributed grids in [35] beginning with the coarse problem on one process and performing
incremental refinement of the processor grid. In [36], uniform agglomeration is used to min-
imize communication costs during geometric coarsening. The necessity of processor agglom-
eration at scale was noted in [37]–motivating the addition of an agglomeration framework in
PETSc. While manual specification of agglomeration is required, performance model guided
agglomeration is proposed as future work. While these works use recursive agglomeration to
address parallel scalability of coarse-grid problems, they use fixed agglomeration strategies
and do not explore optimizing agglomeration through predictive performance models.
In this chapter, an optimized redistribution algorithm is proposed for robust structured
multigrid methods that balances the computation and communication costs at each level.
The structured setting enables the enumeration of possible coarse-grid configurations, and
19
a performance model is developed to support optimization of the coarsening path that is
selected through these configurations. The utility of this approach is demonstrated through
scaling studies extending beyond 100K cores on two modern supercomputers.
The remainder of this chapter is organized as follows. Section 3.1 examines the domain
decomposition implementation in BoxMG, and proposes a redistribution algorithm that can
be applied recursively. The performance model for this parallel algorithm with redistribution
is developed in Section 3.2, and the optimization algorithm is presented in Section 3.3. Scal-
ing studies are presented in Section 3.4 demonstrating the efficacy of the proposed method,
and Section 3.5 gives conclusions.












Figure 3.1: Weak scaling on Blue Waters for a local problem size of 568× 71 (rectangular)
and 200 × 200 (square) using V(2,1)-cycles with a gather-to-one serial BoxMG coarse-grid
solver.
3.1 PARALLEL ROBUST VARIATIONAL MULTIGRID ON STRUCTURED GRIDS
Examining the distributed memory parallel implementation of BoxMG (see Section 2.1.4)
in the context of parallel scalability, many of the operations are stencil based computa-
tions with halo updates. Since the operations are relatively local, they are not expected
20
to significantly limit parallel scalability. For this chapter, the particular focus is in parallel
decisions at coarse grids. To balance useful local work with communication costs, agglom-
eration methods are used to redistribute coarse problems. A straightforward approach is
agglomeration to one task. This task is then mapped to either a single processor or to all
processors redundantly. This method of agglomeration was used in the initial distributed
memory implementation of BoxMG; while effective for low processor counts it suffers at scale
since the global coarse problem grows linearly with the number of processors. As seen in
Figure 3.1, this is especially true with rectangular local problems for which the coarsest local
problem size is larger, and hence, the linear growth is faster and overall parallel scaling is
lower. In addition to agglomerating to one task, the coarse problem can be redistributed by
agglomerating to n tasks, where n is less than the number of processors. The problem then
becomes choosing a desirable value of n which is dependent on the problem distribution on
the processor grid. This agglomeration can be applied recursively to gradually reduce the
size of the processor grid as the size of the global coarse-grid problem is reduced.
3.1.1 Redistribution
To extend parallel coarsening in a structured setting, the algorithm introduced in this
chapter aims to redistribute the coarse-grid problem to an incrementally smaller subset
of processors. Parallel coarsening continues until a parameterized minimum local problem
size is reached. At this point, a set of possible redistributions is enumerated. Here, a
redistribution is given by (p0, p1, . . . , pD−1) where pk is the number of processors in space
direction k and D the number of dimensions. These redistributions are evaluated based on
a cost derived through a performance model. An optimal redistribution sequence is then
selected. It is important to note that since a redistributed problem on a given level also
limits parallel coarsening, the selection algorithm is designed to be applied recursively to
obtain the highest efficiency possible for the multigrid cycle. Section 3.3.1 discusses the
redistribution of a grid of processors to coarser processor grids.
To redistribute the coarse-grid problem on a smaller number of processors, the processor
grid is agglomerated into processor blocks. This agglomeration is performed by dimension to
maintain a distributed tensor-product grid structure. Each processor block is then mapped
to one task in the redistributed solver.
To map the coarse tasks to processors, two approaches are considered. The first approach
uses one processor from each processor block. This is visualized in Figure 3.2 where the
following steps are taken:
21
1. Agglomerate: processors grouped into blocks to define coarse tasks
2. Gather: processors within each block perform gather on coarse problem
3. Cycle: cycling continues with redistributed problem
4. Scatter: iterative solution scattered after redistributed cycle completes
0 1 2 3
4 5 6 7
8 9 a b
c d e f
0 1 2 3
4 5 6 7
8 9 a b














0 1 2 3
4 5 6 7
8 9 a b
c d e f
Cycle
Scatter
Figure 3.2: A redistribution of a 4 × 4 processor grid to 2 × 2 processor blocks. The boxes
represent the processing elements and the circles represent their respective coarse-grid local
problems.
The second approach employs redundancy by mapping each processor in the processor block
to the coarse task. This is visualized in Figure 3.3 with the following steps
1. Agglomerate: processors grouped into blocks to define coarse tasks
2. Allgather: processors within each block perform allgather on coarse problem
3. Cycle: cycling continues redundantly with redistributed problem
While the second approach avoids an additional communication phase at the end of each
cycle and adds an opportunity for resilience through redundant cycling, the first approach
22
0 1 2 3
4 5 6 7
8 9 a b
c d e f
0 1 2 3
4 5 6 7
8 9 a b







































0 1 2 3
4 5 6 7
8 9 a b
c d e f
×4
Redundant Cycle
Figure 3.3: A redundant redistribution of a 4×4 processor grid to 2×2 processor blocks. The
boxes represent the processing elements and the circles represent their respective coarse-grid
local problems.
avoids the increased network usage involved in simultaneous coarse cycles. Algorithm 3.1
supplements Algorithm 2.2 with steps needed for redistribution. The algorithm is annotated
with parallel communication required for each step.
3.2 PERFORMANCE MODEL
In this section, a performance model is introduced for the BoxMG V-cycle (see Algo-
rithm 3.1). The model helps identify the parallel performance limitations of the V-cycle,
particularly at coarse-levels in the multigrid hierarchy, and also provides a cost metric that
is used to guide the coarse-level redistribution algorithm introduced in Section 3.3.
A key kernel in the V-cycle is that of matrix-vector multiplication. Since the stencil-based
computations for this operation are relatively uniform, the cost of parallel communication
in matrix-vector multiplication is accurately modeled with a postal model [38], leading to a
total cost of
T = nf · γ︸ ︷︷ ︸
computation




Algorithm 3.1: Multilevel V-cycle with Redistribution
Input: uL−1 fine-grid initial guess
fL−1 fine-grid right-hand side
A0, . . . , AL−1
P1, . . . , PL−1
p0, . . . , pL−1 number of processors on each level
Output: uL−1 fine-grid iterative solution
1 for l = L− 1, . . . , 1 do
2 relax(Al, ul, fl, ν1) {halo exchange}
3 rl = fl − Alul {halo exchange}
4 fl−1 = P
T
l rl
5 if pl > pl−1
6 gather rhs(fl−1, pl, pl−1) {local gather}
7 u0 = solve(A0, f0)
8 for l = 1, . . . , L− 1 do
9 if pl > pl−1
10 scatter sol(ul−1, pl, pl−1) {local scatter}
11 ul = ul + Plul−1 {halo exchange}
12 relax(Al, ul, fl, ν2) {halo exchange}
with nf denoting the number of floating point operations, γ a measure of the computation
rate or inverse effective FLOP rate, α the interprocessor latency, 1/β the network bandwidth,
and m the number of bytes in an MPI message. The value γ is determined by measuring
the computation time of a local stencil-based matrix-vector product, which is a memory
bandwidth bound operation. The values α and β are determined through standard machine
benchmarks such as mpptest1. As an example, the parameters derived for a 9-point 2D
stencil on Blue Waters, a Cray XE6 machine at the National Center for Supercomputing
Applications2, are listed in Table 3.1. A more accurate model for communication may
be used, particularly for multiple communicating cores with large message sizes [39] or to
account for network contention [40], which may play a prominent role in communication.
α β γ
0.65 µs 5.65 ns/B 0.44 ns/flop




In an L-level multigrid V-cycle, Algorithm 2.2 is modeled through
TV-cycle = Tcgsolve +
L−1∑
l=1









In the following expressions, each component of (3.2) represents the time taken in the
actual implementation in BoxMG and does not necessarily form a lower bound on each
portion of the computation. In the model parameters below, D denotes the number of
dimensions in the problem, nld the number of local grid points in dimension d on level l,
and ns the number of points in the stencil. Grid quantities involved in communication
and computation are assumed to be 8 byte double-precision floating-point numbers. The
dimension of the parallel decomposition is also assumed to match the dimension of the
problem. The communication required for a halo exchange of width 1 in D dimensions on
level l is modeled as
T lexchange(D) = 2 ·D · α + 2 ·
D−1∑
d=0
nld · 8 · β. (3.3)
Smoothing, using Gauss-Seidel with nc colors, results in
T lsmooth = 2 · ns ·
D−1∏
d=0
nld · (ν1 + ν2) · γ + nc · (ν1 + ν2) · T lexchange(D), (3.4)
where the factor of 2 accounts for both multiply and addition operations. Likewise, the
residual computation is
T lresidual = 2 · ns ·
D−1∏
d=0
nld · γ + T lexchange(D). (3.5)
Since it is unnecessary to communicate halo regions in restriction, the computation is entirely
local, leading to
T lrestrict = 2 · ns ·
D−1∏
d=0
nld · γ. (3.6)
Following [20], the interpolation and correction computes
ul ← ul + I ll−1ul−1 + rl/C, (3.7)
25
where rl is the previously computed residual and C is the center, diagonal stencil coefficient
of the operator. Interpolation for edges (see Figure 3.4) yields
ul ← ul + ωwul−1w + ωeul−1e + rl/C︸ ︷︷ ︸
5FLOPs
(3.8)
for the x-direction (and similar for the y-direction). Likewise, the interior stencil (see Fig-
ure 3.4) yields
ul ← ul + ωswul−1sw + ωseul−1se + ωnwul−1nw + ωneul−1ne + rl/C︸ ︷︷ ︸
9FLOPs
(3.9)
Here interpolation is given as weights stored at the coarse points. For example, the interpo-
lation stencil stored at a given coarse point isωnw ωn ωneωw ωe
ωsw ωs ωse
 .





nld + 20 ·
1∏
d=0





· γ + T lexchange(2). (3.10)





nld + 60 ·
2∏
d=0





To agglomerate the coarse problem, the right hand side is gathered within processor blocks
before the redistributed cycle begins and approximate solution scattered after the redis-
tributed cycle completes. This time is then given by:
T lagglomerate =
T lgather rhs + T lscatter sol if pl > pl−10 else, (3.12)
where T lgather rhs and T
l




















Figure 3.4: Interpolation computation in 2D. (a) The top two grids show computation
performed on the first coarse line in the first dimension. (b) Similarly, the middle two grids
show computation performed on the first coarse line in the second dimension. Each coarse
point performs injection to the corresponding fine point. Interpolation to the preceding fine
point embedded in the coarse-line is then computed using the surrounding coarse points. (c)
The bottom three grids show computation performed for interior coarse points. For each
coarse point, the corresponding fine point is injected. The preceding coarse points in each
dimension are then interpolated. Finally, the logical cell center preceding the coarse point
is interpolated using the surrounding coarse points.
and solution, as follows. The number of processors within a processor block and the local
















For this operation, a MPI allgather or gather/scatter is used depending on whether the
redistribution is redundant. Following [41], the cost of these collective operations is given
by





T lscatter sol =
0 if redundantT lgather rhs else. (3.15)
Finally, the time for the solve on the very coarsest level after agglomerating to one pro-










This assumes the Cholesky factorization has been computed and stored in the setup phase.
This performance model provides a basic predictive model to begin exploring the guided
redistribution algorithm proposed in this chapter. In contrast to [42, 29], the performance
model in the present work is used to provide an estimate of the communication and compu-
tation costs involved in the multigrid solve phase that can be used to predict the impact of
different coarse-grid redistribution options on the overall solve time.
3.3 OPTIMIZED PARALLEL REDISTRIBUTION ALGORITHM
3.3.1 Coarse Processor Grid Enumeration
To enumerate potential redistributions, the fine-grid tasks described by the processor grid
are agglomerated into coarser tasks called processor blocks. In agglomerating by dimension,
the processor blocks form a coarser tensor product grid. The potential redistributions are
enumerated by beginning with a 1 × 1 processor block and refining greedily by dimension
with respect to the agglomerated local problem size.
Figure 3.5 illustrates this process. A dimension is considered for refinement if the refine-
ment is feasible. A refinement is feasible if the number of processor blocks in that dimension
after refinement is less than or equal to the initial fine-grid task size in that dimension. If
refinements in multiple dimensions are feasible, the dimension with the largest agglomerated
local problem size is chosen. As refinement is chosen in each step by dimension, the number
of enumerated processor blocks is bounded by dlog2 npe using a refinement factor of 2.
Table 3.2 illustrates this enumeration strategy using a 16×8 initial fine-grid task fine size.
This refinement procedure is used to limit the number of potential redistributions thereby
making the global search feasible within the setup phase. Figure 3.6 shows an example
of parallel structured grid coarsening with a fine grid problem size of 9088× 568 degrees of















(b) All possible redistributions for 8 proces-
sors.
Figure 3.5: (a) shows processor grid refinement by a factor of 2. The dimension chosen for
refinement in each step depends on the agglomerated local problem size. (b) shows a tree
of all possible redistributions with an initial processor grid containing 8 processors. Nodes
in the tree represent the number of processors in the redistribution. Redistributions are
enumerated on each level of the tree using refinement by dimension recursively.
Step Redistribution Agglomerated Local Problem
1 1× 1 1136× 71
2 2× 1 568× 71
3 4× 1 284× 71
4 8× 1 142× 71
5 16× 1 71× 71
6 16× 2 71× 36
7 16× 4 71× 18
Table 3.2: Example redistribution enumeration using a fine grid problem of 9088 × 568
degrees of freedom with a 16 × 8 processor grid. The global coarse-grid considered for
agglomeration contains 1136 × 71 degrees of freedom. After step 4 refinement in the first
dimension is infeasible.
of grid points in a dimension is reached. The minimum was nine grid points in a dimension
on a given processor for this example. Three potential redistributions are then shown, each
of which results in a different number of multigrid levels until the minimum number of grid




































16× 4 processors 16× 2 processors 8× 1 processors
Figure 3.6: Example of three redistributions from a 16 × 8 grid of processors using a fine
grid problem of 9088× 568.
3.3.2 Redistribution Search
The recursive enumeration of coarse processor grids generates a search space of possible
redistributions. To find an optimal redistribution, a state in the search space is given by the
size of the processor grid and the size of the coarse-grid associated with a distributed solver.
The search space can be viewed as a directed graph — an example is given in Figure 3.7.
The initial state is given by the top-level distributed solver and the goal state is the state
with a 1 × 1 processor grid (in the case of a 2-D problem). State transitions have varying
costs; the goal is an inexpensive path from the initial state to the goal state.
To search the state space, a path cost function is defined as
f(s) = g(s) + h(s) (3.17)
where s is a vertex or state in the graph in Figure 3.7, g is the cost to reach s from the initial
state, and h is an estimate of the cost to reach the goal state from state s. That is, g(s)
represents the time predicted by the performance model for a solve phase executed down to
coarse-grid state s. For the heuristic function h(s), a weighted combination of the coarse-grid
problem and the processor grid size is used to predict the cost. If the performance model
is an accurate predictive model, an inexpensive path from the initial state to the goal state




















Figure 3.7: Example redistribution search space: The fine grid global problem is 9088× 568
degrees of freedom with a 16×8 processor grid, yielding a fine grid local problem of 568×71.
The optimal path through this redistribution space is highlighted.
defined, the space of redistributions is searched for an optimal path. This is performed in
the MG setup phase to dictate agglomeration when a coarsening limit is reached. Using a
brute force approach by searching every path for the best redistribution strategy incurs an
O(np) cost, since the redistribution search space is constructed recursively. To show this, we
let d = dlog2(np)e and consider building a tree of possible redistributions (see Figure 3.5b).
We let a node in the tree represent the total number of processors for a given processor
grid. Building this tree recursively using the above processor grid enumeration results in the




R(k) + 1, (3.18)
as the immediate children of a node include the powers of two up to but not including np.
Using induction, we wish to show R(d) = 2d−1. For the base case, we have R(0) = 0 = 20−1.
31




R(k) + 1 (3.19)












= 2d + 2d − 1
= 2d+1 − 1,
this leads to R(d) = 2d − 1 = 2log2(np) − 1 = np − 1.
To address the O(np) cost, the A* algorithm [43] is used to determine the optimal path
with a heuristic of smoothing on the coarsest level multiplied by log2p. This never overesti-
mates the cost of reaching the goal state—weighting redistribution levels by their coarse-level
computations. While the worst case complexity for A* is is O(np) for this search, the cost
with an optimal heuristic with evaluation cost O(1) is the length of the solution path. The
length of this path in the redistribution search is O(log2(np)). Figure 3.8 shows the A*
heuristic is effective in avoiding the brute force cost, but does not reach the optimal cost.
3.4 EXPERIMENTAL RESULTS
To explore the performance of the proposed redistribution algorithm a standard 5-point
finite volume discretization of the diffusion equation is used. Since square problems may
lead to a natural or predictable redistribution path, rectangular local problems are used in
order to fully stress the redistribution algorithm in a weak scaling study. In particular, the
ratio of processors in the processor grid is fixed at 2 : 1, and the ratio of unknowns in the
fine-grid local problem is fixed at 8 : 1. Other processor grid ratios lead to similar findings.
Note that for an isotropic diffusion problem this grid stretching results in anisotropy in the
discrete problem, causing the convergence of the solver to deteriorate when using point-wise
smoothing. Consequently, a diffusion problem with compensating anisotropy is defined in
order to focus on the parallel scalability of the coarse-grid redistribution algorithm (rather
than the well-established convergence rate of the multigrid algorithm itself). Specifically,
32
















Figure 3.8: Performance of redistribution search algorithms weak scaling with local problem
size 568× 71. np denotes the number of cores.
the following diffusion problem is used in these numerical experiments,
−∇ · (D∇u) = f(x, y) in Ω = (0, 1)× (0, 1) , (3.20)
u = 0 on ∂Ω , (3.21)
where the diffusion tensor is D = diag[1
r
, r] and r ≈ 16. This compensating anisotropy
results in optimal convergence of multigrid V-cycles with point-wise smoothing.
3.4.1 Scaling Studies
Both the weak and strong scalability of multigrid V-cycles that use the proposed redistri-
bution algorithm are important for applications on exascale systems. Here, the discretization
of the diffusion problem given above is used for both weak and strong scaling studies and
the algorithmic components of the BoxMG multigrid library (e.g., interpolation, restriction)
are used in the redistributed multigrid V-cycles.
Since, the cost of coarsening (with redistribution) and the cost of the coarse-grid problem
are dominated by parallel communication, network speed and machine topology play an
important role in timings. To explore this dependency, two different petascale systems are
33
considered in the scaling tests:
Mira3 An IBM Blue Gene/Q system at Argonne National Laboratory. Mira uses an IBM
network which comprises a 5D torus with 49,152 compute nodes using PowerPC A2
processors. Each compute node and Mira has a shared cache size of 32MB.
Blue Waters4 A Cray XE system at the National Center for Supercomputing Applications
(NCSA) at the University of Illinois at Urbana-Champaign. Blue Waters employs a
3D torus using a Cray Gemini interconnect and has 22,640 XE compute nodes each
with two AMD Interlagos processors. Each XE compute node has a total cache size of
32MB, with a 16MB L3 cache for each socket.
In each case, the machine parameters used in the performance model of Section 3.2 are
determined using the beff benchmark [44].
Weak Scalability
In this section, a weak scaling study is conducted to highlight the scalability of the
redistribution algorithm at large core counts. For the numerical experiments below, 10
V(2,1)-cycles are executed using two different local problems sizes: 568 × 71 = 40, 328 and
288 × 36 = 10, 368. We observed an average geometric convergence factor of 0.1 for these
runs with the largest error norm being approximately 10−10. The memory footprint for these
two local problems was approximately 7MB per core for the 40k local problem size, and 1.8
MB per core for the 10k local problem size. For both machines, the per core cache size when
using 16 cores on each node is 2MB.
Figures 3.9 and 3.10 show the run times on Mira of various computational kernels in the
multigrid solve phase. The “solve” line shows the time of the entire solve phase and includes
the other timings. The “redistribution” line shows communication needed to redistribute
coarse problems. This communication is low in comparison to the cost of relaxation. Overall,
the algorithm exhibits high parallel scalability in the solve time for both local problem sizes
on Mira.
With different network capacities, redundant redistribution (see Section 3.1.1) may not
yield the lowest communication costs. Indeed, the results in Figure 3.11 for the Blue Waters
system show that while redundant redistribution of the data at course levels is inexpensive,
the network contention introduced by redundant cycling contributes to an increase in com-


































Figure 3.9: Weak scaling on Mira with local problem size: 568× 71. The parallel efficiency
(%) relative to 32 cores for the overall solve is shown for each data point.































Figure 3.10: Weak scaling on Mira with local problem size: 288× 36. The parallel efficiency
(%) relative to 32 cores for the overall solve is shown for each data point.
basis could lead to reduced costs. In contrast, non-redundant redistribution (in Figure 3.11)
exhibits high scalability — thus, it is used for the following runs on Blue Waters.
Figures 3.12 and 3.13 show run times on Blue Waters for the multigrid solve phase,
35























Figure 3.11: Weak scaling on Blue Waters with local problem size 288 × 36 shows the sig-
nificant improvement of non-redundant redistribution compared to redundant redistribution
beyond 8192 cores.
highlighting that the proposed algorithm achieves good parallel scalability here as well.






























Figure 3.12: Weak scaling on Blue Waters with local problem size: 568 × 71. The parallel
efficiency (%) relative to 32 cores for the overall solve is shown for each data point.
However, comparing runs on the two machines, it is apparent that the weak scalability is
superior on Mira. This is in part attributed to the network capabilities and scheduling dif-
36































Figure 3.13: Weak scaling on Blue Waters with local problem size: 288 × 36. The parallel
efficiency (%) relative to 32 cores for the overall solve is shown for each data point.
ferences of the two machines, noting that communication is included in these measurements
(see Figure 3.14). Jobs on Mira receive a dedicated, full torus partition of the machine,
which often reduces contention resulting from neighboring jobs on the machine, since parti-
tions receive a full torus network. Moreover, the lower dimensional 3D torus on Blue Waters
also contributes to an increase in contention within the running job.
Nevertheless, on Blue Waters communication cost for redistribution remains relatively
low in comparison to smoothing, which remains the dominant kernel in the overall solve.
Figure 3.14 decomposes the cost of smoothing into communication and computation. This
decomposition indicates that the communication cost of the halo exchange is the primary
contributor to the reduced scalability.
In contrast to the network advantages of Mira, lower floating point performance is observed
for the key computational kernels on Mira than on Blue Waters. This difference, combined
with an extra core dedicated to operating system functions, yields more predictable per-
formance and superior scaling behavior on Mira. This is also observed for a variety of
applications [45]: loss in performance on the Cray XE6 in comparison to BG/Q systems as
the core count increases. However, with the data locality and fixed communication patterns
of the algorithm, the network performance on Blue Waters is good enough to let its superior
floating performance yield the best time to solution, solving approximately 2.5 times faster
at 131K cores.
37
















Figure 3.14: Weak scaling of smoothing routine on Blue Waters with local problem size
568× 71.
Strong Scalability
Turning to strong scalability, an isotropic model diffusion problem is setup on a 3200
× 3200 grid, and then executed on a sequence of square processor grids. The number of
cores in each coordinate direction is successively doubled from 8 to 128, to create a range
of core counts from 64 to 16,384, with a corresponding local problem size ranging from
160,000 to 625 degrees of freedom (dof). The strong scaling performance for Blue Waters
is shown in Figure 3.15. Here, the strong scaling limit is reached at 1024 cores, which is
equivalently 10K dof per core. The parallel efficiency relative to 64 cores drops from 125%
to 53% from 256 cores to 1024 cores, and then to 13% at 4096 cores. This scaling limit
is consistent with other performance assessments of multilevel solvers in application codes
(e.g., [46, 28]). Superlinear speedup is also observed with a parallel efficiency of 125% as the
fine grid problem is able to fit in cache at 256 cores.
The computational cost of the solve is again dominated by smoothing and scaling is limited
by communication, as supported in Figure 3.16 by the performance model of Section 3.2. The
steady reduction in computation cost due to a decreasing local problem size is not reflected
in the near constant communication cost. The communication cost is expected to be reduced
at a slower rate than computation since the size of the halo region is proportional to the
surface area of the local problem. The performance model is reasonably accurate, although
the slowly increasing underestimation of the communication cost suggests that further in-
vestigation of more advanced topology-aware performance models [39] may be worthwhile.
In addition, increased network utilization at higher core counts may also contribute to the
38





















Figure 3.15: Strong scaling on Blue Waters with problem size: 3200 × 3200. The parallel
efficiency (%) relative to 64 cores for the overall solve is shown for each data point.
overestimation. Often a hybrid Gauss-Seidel Jacobi sweep is used to limit communication
to once per relaxation sweep [24]. This improves the cycle’s strong scaling behavior, but
may slow convergence. Here, the focus is on redistribution and the impact of communication
is highlighted with a strict implementation of four-color Gauss-Seidel, which results in four
halo exchanges per relaxation sweep. The granularity of the component timers is such that
the residual and interpolation operations each include a single halo exchange. This leads
to strong scaling curves for these operations that are very similar to smoothing, but much
faster in absolute terms.
In contrast, the restriction has no communication, and would exhibit perfect strong scaling
in the absence of redistribution. With redistribution fewer cores are used at coarser levels,
and strong scaling begins to degrade at 2.5K dof per core, although speedup is still realized
even at 625 dof per core. This observation highlights the complexity of trade-offs in multilevel
algorithms on modern systems, and the important role of performance models in design and
runtime optimization.
3.4.2 Extension to 3D
This approach was extended naturally to 3D. Following the 2D results, the ratio of proces-
sors in the processor grid was fixed at 2 : 2 : 1. We considered two local problem sizes, one
with approximately 40k unknowns per core and one with approximately 5M unknowns per
39
















Figure 3.16: Strong scaling of smoothing routine on Blue Waters with problem size: 3200×
3200.
core. The memory footprint was approximately 14MB per core for the 40k local problem size
and 1.7GB per core for the 5M local problem size. Figure 3.17 shows weak scaling on Blue






















Figure 3.17: Weak scaling on Blue Waters with local problem size: 52×28×28. The parallel
efficiency (%) relative to 256 cores for the overall solve is shown for each data point.
Waters. These results are similar to the 2D results as they have similar computation and
communication requirements. The increased communication involved in a 3D halo exchange
may account for the decreased parallel efficiency relative to the 2D results. In 2D, we also
customized the rank ordering on Blue Waters to match the topology of the machine. This
40
resulted in improved performance, especially at higher core counts.




















Figure 3.18: Weak scaling on Blue Waters with local problem size: 260 × 140 × 140. The
parallel efficiency (%) relative to 256 cores for the overall solve is shown for each data point.
Figure 3.18 shows weak scaling with a much larger local problem size than the previous
weak scaling results. The previous local problem sizes were chosen to be near the strong
scaling limit where communication costs dominate and weak scalability is challenged. The
local problem size chosen in Figure 3.18, however, better saturates the memory on each node.
Since computation dominates in this case, the weak scaling behavior is superior. The global
problem size for this case reaches 6.5× 1011 degrees of freedom with an overall solve time of
20 seconds. As a reference, in [29], matrix-free geometric methods are used on fine levels of
refinement to solve a problem with 1.1× 1013 degrees of freedom in under 13 minutes.
While the focus of the present work is on the scalability limitations introduced by coarse
levels, future work could consider features of emerging architectures. In addition, a perfor-
mance analysis using a target architecture [42, 29] could be used to assess the efficiency of
the solver.
3.4.3 Performance of Optimized Redistribution
To understand the impact of the redistribution path on solve time, the 568×71 weak scaling
problem used in Figure 3.12 with 2048 processors is executed over a variety of redistribution
choices. The selected redistribution paths are listed in Table 3.3. Path 1 indicates the optimal
redistribution path used in Figure 3.12 and is selected by the algorithm from Section 3.3. In
41
stark contrast to Path 1 is Path 0, which is the original all-to-one redistribution path that
is known to scale poorly. The remaining paths highlight the flexibility in the agglomeration
sequence.
Path Processor Grid Redistribution
0 64× 32 → 1× 1
1 64× 32 → 64× 16 → 64× 8 → 64× 4 → 32× 2 → 16× 1 → 1× 1
2 64× 32 → 64× 4 → 8× 1 → 4× 1
3 64× 32 → 16× 2 → 1× 1
4 64× 32 → 64× 16 → 2× 1 → 1× 1
5 64× 32 → 4× 1 → 1× 1
6 64× 32 → 64× 16 → 4× 1 → 1× 1
7 64× 32 → 64× 16 → 64× 8 → 2× 1 → 1× 1
8 64× 32 → 2× 1 → 1× 1
Table 3.3: Example redistribution paths using a 64 × 32 processor grid and 568 × 71 local
problem size.
The bar graph in Figure 3.19 compares the run times and predicted times of the multigrid
solve phase for the various redistribution paths shown in Table 3.3. The predicted times are
computed using the performance model from Section 3.2, and are in good agreement with
the actual run times. Comparing the run times of Path 0 and Path 1, this graph shows a
40 times speedup of the new redistribution strategy over the original all-to-one strategy. In
addition, Path 1 has the lowest run time of several plausible alternatives to the all-to-one
strategy. Thus, Figure 3.19 demonstrates the utility of using a global search guided by a
performance model as a predictive tool for choosing optimal redistribution paths. Moreover,
using this strategy to select the redistribution path delivered effective weak parallel scaling
out to 500K+ cores on Mira, and out to 132K+ cores on Blue Waters, while the original
code was essentially unusable beyond 4K cores.
3.5 CONCLUSIONS
Emerging architectures place an increasing importance on data locality and minimizing
data movement. In these environments, structured approaches benefit from predictable
memory access patterns and avoiding indirect addressing. This motivates the development
of methods that exploit local structure to avoid incurring the performance consequences of
full algebraic generality. A robust structured single-block multigrid implementation that
scales well on modern architectures is an important step toward this goal.
42











Figure 3.19: Model and run times on Blue Waters of various redistribution paths using a
64× 32 processor grid and 568× 71 local problem size.
In this chapter, a new optimized recursive agglomeration algorithm for redistributing
coarse-grid work in structured multilevel solvers is introduced. This algorithm combines a
predictive performance model with a structure exploiting recursive enumeration of coarse
processor grids to enable a global search for the optimal agglomeration strategy. This ap-
proach significantly improves the weak parallel scalability of robust, structured multigrid
solvers such as BoxMG. In this study using BoxMG operators, this new algorithm delivers
very good weak scaling up to 524, 288 cores on an IBM BG/Q (Mira), and reasonable weak
scaling up to 131, 072 cores on Cray XE (Blue Waters). The speedup over the previous all-
to-one agglomeration approach is significant even at modest core counts, 40 times speedup
on just 2048 cores and 144 times speedup on 8192 cores when comparing the data from
Figure 3.1 and Figure 3.12. At larger core counts, the all-to-one agglomeration approach be-
came infeasible as the increased memory requirements for the coarse-grid problem exceeded
available memory.
In addition, the strong scaling of multigrid solves using this redistribution algorithm is
demonstrated on Blue Waters. Overall, the scaling limit is observed to be approximately
10K dofs per core, which is similar to results obtained in other studies and is dominated by
the communication cost of the smoother. Future work on additive variants of multigrid and
improvements to the performance model to more accurately capture deep memory hierarchies
are needed to reduce this limit.
43
To focus on coarse-grid redistribution, only pointwise Gauss-Seidel relaxation is considered
in this chapter. In the future, smoothers that may enhance performance, such as hybrid
or polynomial smoothers, or enhance robustness, such as line and plane smoothers [47]
(see Chapter 4), may be considered and their impact on optimal coarse-grid redistribution
explored.
44
CHAPTER 4: STRUCTURED RELAXATION
Relaxation methods, such as weighted Jacobi or Gauss-Seidel, play an important role in
the success of a multilevel methods (see Section 4.1 for a detailed description). Yet, in
many scenarios such as stretched meshes or anisotropic diffusion operators, more powerful
relaxation techniques may be necessary. For example, relaxation along entire lines or planes
of the mesh and in alternating fashions often improve convergence significantly, but at a
substantial cost if parallel communication is not addressed properly, as we show in this
chapter.
As an example, consider a cylindrical domain in Figure 4.1a with a curvilinear discretiza-
tion of a Poisson problem. Here, the mesh mapping introduces an anisotropy into the result-
ing operator. Using standard pointwise weighted Jacobi relaxation in a multigrid scheme
(BoxMG) results in stagnation of the iterates as shown in Figure 4.1b. In contrast, using
relaxation methods that annihilate the residual on subsets of points (rather than individual
points) is more effective. Indeed, for this example, alternating line smoothing (in the radial
and in the azimuthal directions) significantly improves performance, as shown in Figure 4.1b.
In a parallel setting, operating on entire lines or planes of a mesh (as is the case in line and
plane relaxation) presents a challenge for the parallel efficiency of structured solvers. Unlike
point relaxation, these methods involve global operations over a subset of dimensions in
the problem. Parallel scalability necessitates matching the dimensions of the grid partition
to the problem. This implies the global connections will span a non-constant number of
processors and increase the communication requirements of the solver. In this chapter, we
focus on reducing the communication costs in line and plane relaxation methods to improve
the parallel performance of structured multilevel solvers.
Specifically, the recursive version of the parallel tridiagonal solver proposed in [48] is imple-
mented to improve the parallel scalability of line smoothing in Black Box Multigrid. Parallel
results are introduced that confirm the expected communication complexity of this solver.
In three dimensions, an automated strategy for aggregating parallel communication in plane
smoothing is proposed. This strategy relies on efficient context switching to synchronize
communication phases of 2D solvers used in plane smoothing. This approach permits com-
munication within existing 2D solvers to be aggregated when used in a plane smoothing
operation. This is especially valuable for parallel 2D solvers with nontrivial communica-
tion patterns; for example, a 2D solver that recursively redistributes coarse-grid problems.
Parallel results using model problems in addition to a selected application are included to
demonstrate the utility of this approach.
45
The chapter is organized as follows. Section 4.1 introduces smoothers used in structured
solvers and motivates the need for block smoothers when structure is preserved on coarse
levels. In Section 4.2, parallel algorithms for block smoothing in two and three dimensions
are discussed. Section 4.3 proposes an automated strategy for reducing communication
latency costs in plane smoothing. Performance studies included in Section 4.4 demonstrate






(a) Cross section of cylindrical mesh


















(b) Convergence of point and alternating line smoothers on a curvilinear mesh.
Figure 4.1: Solver convergence with point and line smoothing on mapped mesh. Line smooth-
ing is needed for robust solution with mapped mesh from (a).
47
4.1 BACKGROUND
4.1.1 Smoothers for Structured Solvers
Given an n× n matrix problem
Ax = b (4.1)
the ith entry if x is denoted xi. Given a vector x, a pointwise weighted Jacobi scheme is
defined for each i = 1, . . . , n as
x
(new)







for some weight ω. In matrix form this becomes
x(new) = x+ ωD−1(b− Ax), (4.3)
where D is the diagonal of A. The method effectively annihilates the residual at each suc-
cessive element of the vector x to form a new iterate. Point-wise iterative methods, such
as Gauss-Seidel or weighted Jacobi, operate on single entries of a vector and are effec-
tive smoothers for isotropic problems — see Figure 4.2. With anisotropic problems, either
coarsening or relaxation must be modified to account for the anisotropy (see Figure 4.1b).
Coarsening in a single direction, referred to as semicoarsening, can be effective for problems
where the anisotropy is aligned to the grid [49, 50]. For general anisotropy, semicoarsen-
ing must be combined with block smoothers for robustness [21]. Semicoarsening also has
a drawback of increasing the total work in each cycle, since coarsening is less rapid — e.g.
coarsening by a factor of 2 is common in 2D, whereas full coarsening reduces the coarse
problem size by a factor of 4.
When coarsening is fixed, block smoothers are used to address anisotropy [21, 12, 13]. For
example, line smoothing operates on matrix entries corresponding to the same logical grid
line collectively. That is, x-line relaxation refers to logical grid lines of constant x. Consider
the case of a 2D 9-point discretization on a regular n × n mesh, which has the following
48
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99
Figure 4.2: Updated data in the case of point relaxation (green), y-line relaxation (pink),
and x-line relaxation (blue).
sparsity pattern (in the case of n = 4):
   
     
     
   
     
        
        
     
     
        
        
     
   
     
     
   

(4.4)
Here, the matrix problem can be written in block form, say by groups of points in y-lines,
49























In each case Di, Li, and Ui are tridiagonal matrices.
In line relaxation, each step is a block solve, which has the form
Dixi = bi − Uixi+1 if i = 1
Dixi = bi − Lixi−1 − Uixi+1 if 1 < i < n
Dixi = bi − Lixi−1 if i = n.
(4.6)
This type of block smoothing is effective for grid aligned anisotropy by collectively pro-
cessing strongly connected unknowns. By simply alternating directions in each smoothing
step, line relaxation in two dimensions is robust for anisotropy that is not aligned to the
grid.[21, 13, 12]
Similar to line smoothing in two dimensions, plane smoothing can be used in three dimen-
sions as a robust smoother for anisotropic problems. In plane smoothing, rows in the matrix
are grouped by logical plane to reach the structure shown in Equation 4.5. Consider a 3D
27-point discretization on an n× n× n mesh. In the case of n = 3, the sparsity patterns is
shown in (4.7).
In this case, each block of (4.5) represents a 2D problem (with corresponding tridiagonal
blocks as in the 2D example above). Each Di, for example, is a structured 2D operator for
each plane —- plane relaxation then requires successive 2D solves. While a full 2D solve
is relatively expensive, in practice it is found that these systems do not need to be solved




   
 
   
     

     
   
 
   
     
  
     
        

        
     
  
     
   
 
   
     

     
   
 
   

(4.7)
4.1.2 Parallel Structured Decomposition Implementation
In a distributed parallel setting, the processor layout adds an additional layer of mappings
to the block structures outlined above. In this chapter we consider Cartesian processor
topologies, where logically structured problems (e.g. Figure 4.1a) are decomposed on a
processor grid by dividing the points among the processors by dimension.
Stencil-based communication is accomplished using halo regions with a halo width of one.
Parallel grid coarsening is used considering standard coarsening by a factor of two in each
dimension. As processors run out of local work, recursive coarse-grid redistribution is used
to repartition the work on a subset of involved processors as described in [23].
4.2 PARALLEL STRUCTURED SMOOTHING
Efficient parallel smoothers are important to the performance of structured solvers as they
often dominate the cost of a solve. In point smoothing, computation and communication
are straightforward. The computation consists of stencil-based updates for each grid point
and the communication is performed through single layer halo exchanges. In the case of
line and plane smoothing, however, global operations are executed over a subset of involved
51
dimensions. The increased communication requirements of these operations motivates the
need for scalable and efficient methods for processing a sequence of distributed block solves
from Equation 4.6. In line relaxation, these blocks correspond to tridiagonal systems, thus
requiring a scalable distributed memory tridiagonal solver. In plane relaxation, efficient
processing of a series of distributed two dimensional cycles is needed.
In this chapter, Gauss-Seidel block smoothing is considered for line and plane smoothing.
To process blocks in parallel, red-black ordering of blocks is used. However the techniques
can be extended to any number of structured smoothers.
4.2.1 Parallel Line Smoothing
An efficient tridiagonal solver is key to the performance of line smoothing in parallel. In
this section, we consider a series of tridiagonal systems in a distributed memory environment,
which forms the basis of the steps outlined in (4.6). Many parallel algorithms have been
developed for the direct solution of tridiagonal linear systems. Many of the early algorithms,
such as cyclic reduction [51] and recursive doubling [52], target fine-grained parallelism.
Later, a wide class of algorithms based on parallel partitioning of the matrix rows were
developed to target coarse-grained parallelism [53, 54, 55, 56, 57]. In [58], a framework for
the unified analysis and development of such algorithms is also proposed. While many of
these algorithms focus on a single level of partitioning, recursive partitioning should be used
to avoid the linear scaling of computation and communication with respect to the number
of processes.
In this chapter, the recursive parallel partitioning algorithm proposed in [48] for scalable
distributed line relaxation is considered. In this approach, rows of the matrix are partitioned
across p processors. For example in the case of p = 3 and a single n× n tridiagonal matrix,






For example, Figure 4.3 shows a tridiagonal matrix partitioned across three processors.
Rows adjacent to partition boundaries are called interface rows, while rows not connected to
a row across a partition boundary are called interior rows. If the unknowns in each partition























































































Bi represents the matrix of interior rows, Ei and Fi represent coupling to and from the
interface rows, and Ci represents couplings between interface rows on a given partition. xi, fi
are the interior unknowns and right-hand side. yi, gi are the interface unknowns and right-






































Bi Fi Ei Ci Ei,i−1 Ei,i+1
Figure 4.4: Block structure of partitioned tridiagonal matrix.
reordered with interface unknowns ordered last. Three phases are then considered.
The first phase. To form a decoupled reduced system involving only the interface rows,
the interior rows are used to eliminate each Fi block. To perform this elimination, two cases
54
are considered. For the lower interface equation which has a nonzero in the last column of







is used to eliminate this nonzero. Here, ek is the kth column of the identity matrix with k
equal to the number of rows in Ci. For the upper interface equation which has a nonzero in







is used to convert this matrix to upper triangular and remove the nonzero. During each
elimination step, the corresponding row of Ci is modified and fill is introduced from the
nonzero columns of Ei. Letting Ĉi denote the updated Ci matrix and ĝi the updated right-
hand side, the reduced system produced by eliminating the Fi block for each interface row























The second phase. The reduced system in Equation 4.13 is a tridiagonal system with
2p − 2 rows. To solve this system, recursion can be used by partitioning Equation 4.13 to
s processors with s < p. This is done by combining processors from the original partition.
Recursion is terminated when s = 1. In this case, the reduced system is gathered to one
processor and solved using the sequential Thomas algorithm. When this phase is complete,
the solution yi is scattered from processors in the merged partition to their corresponding
processors in the original partition.
The third phase. The nonzero elements in Ei are first eliminated by substituting the
solution values yi from the second phase. Once complete, the interior rows form independent
tridiagonal systems which can be solved on each processor using the sequential Thomas
algorithm.
By combining a constant number of processors in the second phase, this approach provides
a direct solver with a communication complexity that is logarithmic with respect to the
number of processors. As described in [48], the Gaussian elimination steps in the first
phase can be completed using storage for only six values to build the Ĉi blocks–avoiding
55
the increased cost of storing a factorization. This is especially important in reducing the
significant memory requirements in line and plane relaxation where a tridiagonal solve is














Figure 4.5: Weak scaling of line relaxation on Blue Waters with 100 × 100 local problem
using single level and recursive partitioning.
Figure 4.5 shows the importance of using recursive partitioning to avoid a serial fraction
that grows linearly with the number of processes.
4.2.2 Parallel Plane Smoothing
Plane smoothing in parallel involves a series of distributed 2D multilevel cycles to solve
Equation 4.6. Using red-black ordering of planes, half of the planes can be processed in
parallel without communication across planes. This involves a series of independent parallel
tasks with 2D structured communication. Planes on a given processor have uniform com-
munication and computation patterns. They perform the same 2D operations on the same
logically structured region. Communication phases for planes on a given processor then in-
volve the same remote processors. Instead of processing the planes sequentially, the uniform
communication pattern of planes on a process can be exploited to aggregate communication.
As noted in [12, 59], the parallel efficiency of plane relaxation can be improved by collecting
data to be sent from each 2D parallel solver. This data can then be sent as a single message
to reduce the overall latency cost in plane relaxation. To aggregate communication across
planes in this way, concurrent execution of 2D parallel plane cycles is needed. In the context
56
of a scalable robust structured solver, coordination of the concurrent execution of these 2D

























































































Figure 4.6: High-level view of operations in a scalable 3D robust structured Vcycle using
block smoothers.
using block smoothing with a scalable 3D structured solver. Within each 3D Vcycle, plane
smoothing is used at a variety of resolutions. Each plane smoothing step performs a series
of 2D Vcycles each of which includes line smoothing at a variety of resolutions. The line
smoothing operations use multilevel tridiagonal solvers which recursively repartition data to
improve parallel scalability. In addition to the block smoothing operations, each 2D and 3D
distributed Vcycle performs coarse-grid redistribution needed for parallel scalability. Since
these operations lead to nontrivial execution, we focus on automating the coordination of
2D cycling needed to aggregate communication in plane smoothing.
4.3 AUTOMATED COMMUNICATION AGGREGATION
To automate communication aggregation, we introduced a software layer realized as a set
of services to be provided to a distributed multilevel solver instance. The services used for
57
Algorithm 4.1: Parallel Plane Relaxation
Input: u initial guess
f right-hand side
A
nplanes number of planes
Output: u iterative solution
1 for c = 0, . . . , 2 do
2 for i = c, c+ 2, . . . , nplanes do
3 A2 = slice(A, i)
4 u2 = slice(u, i)
5 f2 = slice(f, i)
6 cycle2d(A2, u2, f2)
7 halo update(u)
aggregating communication are:
1. Memory Pool: Services requests for grid function memory allocations
2. Halo Exchange: Exchanges halo data on each multigrid level
3. Message Passing: Provides a message passing interface
Implementations of these services can be selected dynamically to facilitate aggregation when
a 2D solver instance is used within plane relaxation. The memory pool service is used to
ensure allocations of fine and coarse-grid vectors that will later be involved in communica-
tion are contiguous in memory across planes of the same color. This removes the need for
the data collection stage when aggregating plane communication—providing better memory
efficiency. The halo exchange and message passing services provide an interface through
which communication in the 2D solver is performed. In this way, aggregated communication
routines may be selected when used within a plane relaxation sweep. Even when commu-
nication aggregation is not used, the message passing service is necessary to avoid creating
redundant MPI communicators in plane relaxation. For instance, coarse-grid redistribution
involves creating new communicators for redistributed cycling. Multilevel line relaxation
also creates communicators for collective communication at each level and halo exchange
libraries typically duplicate the communicator passed to them. When run at scale, the plane
relaxation routine quickly reaches the MPI communicator limit in modern implementations
if each plane creates these communicators independently.
In plane relaxation, the two main communication routines are a halo exchange and the
gather/scatter operations used in multilevel line relaxation. Since vectors involved in com-
58
Figure 4.7: Distributed plane solves. The colors denote processors, points degrees of freedom,
and lines connections in the mesh. The original 3D problem distributed on eight processors
is shown on the left. The right visualizes plane solves for this distribution where off plane
coefficients are moved to the right-hand side.
munication are contiguous in memory across planes of the same color, the aggregated halo
exchange service can be implemented by simply using a 3D halo exchange with communi-
cation in the direction orthogonal to the plane disabled. The halo regions communicated
are then 2D regions consisting of the 1D halos of each plane on a processor. To implement
aggregated gather and scatter operations without needing a data collection stage, a MPI
user defined data type can be used to correctly interleave plane data as shown in Figure 4.8.
While the service layer provides a mechanism for presenting different data layouts and
communication routines to the plane solvers, a method for coordinating the concurrent ex-
ecution of the planes is also needed. In contrast to the entirely parallel operation of plane
solves of the same color, the addition of communication aggregation necessitates synchroniza-
tion for each communication operation. Since each 2D solve involves the nontrivial control
flow and communication patterns associated with recursive coarse-grid redistribution and
multilevel line relaxation, avoiding manual coordination is desired. Context switching is
well suited for this type of coordination–allowing for planes to be suspended and resumed
when communication occurs. Using context switching, plane coordination is implemented
using Algorithm 4.2 for each communication routine. This assumes a one-to-one mapping




Standard gather to one
processor (blue)
Interleaved gather to one
processor (blue)
Figure 4.8: Aggregated gather/scatter operation of three planes on two processors. Data
associated with the first, second, and third plane are denoted by , , and respectively. The
gather and scatter operations interleave plane data to keep consecutive planes contiguous in
memory.








index of plane (0, . . . , num planes - 1)
1 switch context(mod(plane index+1, num planes))
2 if plane index = 0
3 communicate(...)
constructs could be added if extra cores are available (e.g., fewer MPI ranks than compute
cores). To manage the concurrent execution of plane solves and provide efficient context
switching, the lightweight user-level threading library Argobots1 [60] is used. Figure 4.9
shows the execution of planes on an MPI rank. Planes are grouped by color into teams
of manager and worker planes. The manager plane is then responsible for running the
aggregated communication routines and splitting MPI communicators.
Using lightweight user-level threads with the service layer provides an automated method
1http://www.argobots.org
60
Computation Communication ComputationManager 0
Computation Communication ComputationWorker 2
Computation Communication ComputationWorker 4
Computation Communication ComputationManager 1
Computation Communication ComputationWorker 3
Computation Communication ComputationWorker 5
(a) Red sweep
Computation Communication ComputationManager 0
Computation CommunicationWorker 2
Computation CommunicationWorker 4
Computation Communication ComputationManager 1
Computation Communication ComputationWorker 3




Figure 4.9: Execution schedule for red-black plane relaxation.
for aggregating communication for a sequence of distributed tasks with uniform communi-
cation patterns. In the context of plane smoothing in a 3D structured solver, this allows
the 3D smoothing routine to directly use 2D solver routines for each plane and have their
communication aggregated.
4.4 EXPERIMENTAL RESULTS
To assess the parallel scalability of this approach, a variety of scaling studies are included.
For these scaling studies the following petascale system was considered:
Blue Waters2 A Cray XE system at the National Center for Supercomputing Applications
(NCSA) at the University of Illinois at Urbana-Champaign. Blue Waters employs a
2https://bluewaters.ncsa.illinois.edu/blue-waters
61
Figure 4.10: Series of distributed plane operations.
3D torus using a Cray Gemini interconnect and has 22,640 XE compute nodes each
with two AMD Interlagos processors.
To improve communication performance for structured grids, a rank ordering that matches
the application topology to the machine topology was used [61].
4.4.1 Series of Uniform Distributed Tasks
To focus on the performance of communication aggregation in reducing communication
costs for a series of uniform distributed tasks, a scaling study is included using 2D smoothing
routines. These routines are run on a sequence of 2D problems with the same parallel data
distribution. This is analogous to a single level within a sweep of plane relaxation. Runtimes
are then compared using automated communication aggregation vs. independent execution
of the 2D routines.
To look at strong scaling, a 2D model diffusion problem was setup on a series of 100
1024 × 1024 grids. The number of cores in each coordinate direction was doubled from 8
to 128 cores, to create a range from 64 to 16,384. The corresponding local problem sizes
also varied from 64 to 16,384 degrees of freedom per core. Figure 4.11 shows strong scaling
behavior of point and line relaxation on a series of 2D grids. In both cases, communica-
tion aggregation significantly improves strong scaling with greater speedup as the the local
problem size decreases. For example, sequential point smoothing reaches a parallel efficiency
under ten percent at approximately 4k cores in contrast to aggregated point smoothing which
reaches an efficiency of 37 percent. Aggregation in this case extends the strong scaling limit
of ten percent efficiency from approximately 1k cores to 4k cores. Similarly, sequential line
smoothing reaches an efficiency under ten percent at approximately 4k cores whereas aggre-
gated line smoothing reaches an efficiency of 30 percent. Even at 16k cores, aggregated line
smoothing retains an efficiency above ten percent and the strong scaling limit is extended


































Figure 4.11: Strong scaling relaxation routines on Blue Waters with problem size: 1024 ×
1024. The aggregate label indicates the relaxation routine was run with automated commu-
nication aggregation using ULTs, the sequential label indicates that the 2D routines were
run independently, and the manual aggregate label indicates the smoothing routine was
manually rewritten to aggregate communication.
more dominant, in contrast to local computation and communication bandwidth, as the local
problem size decreases. Since communication aggregation targets this latency cost, it is most
useful in reducing communication on coarser grids. To help quantify the overhead of using
user-level threads to automate communication aggregation, the point smoothing routine was
rewritten to manually aggregate communication. This is shown in Figure 4.11 as the “man-
ual aggregate point” line. By manually aggregating communication, the overhead incurred
by context switches among the 100 2D problems is avoided. As seen in Figure 4.11, this
overhead is relatively small due to the lightweight user-level threading provided by Argobots.
Turning to weak scaling, the 2D model diffusion problem was again setup on a series of
100 grids. Local problem sizes were chosen near both ends of the strong scaling range, with
one chosen away from the strong scaling limit and one near this limit where weak scalability
is best challenged. Figure 4.12 shows aggregation is critical for improving the performance
of point smoothing for the smaller local problem size where communication latency costs
dominate. For the larger problem size, aggregation was also shown to improve weak scal-
ing efficiency above 1k cores. For this problem size, sequential point smoothing reached an
efficiency of 68 percent at approximately 4k cores in contrast to aggregated point smooth-
ing which retained an efficiency of 99 percent. The overhead of using user-level threading
to coordinate aggregation was most significant in the smaller problem size; however, this
cost is relatively low and is not shown to significantly impact weak scalability. Similar to





























Figure 4.12: Weak scaling point relaxation routines on Blue Waters with local problem
sizes: 10 × 10 and 100 × 100. Aggregate indicates the relaxation routine was run with
















Figure 4.13: Weak scaling line relaxation routines on Blue Waters with local problem sizes:
10× 10 and 100× 100. Aggregate indicates the relaxation routine was run with automated
communication aggregation using ULTs, and sequential that the 2D routines were run inde-
pendently.
smoothing for the smaller local problem size. Unlike the constant communication complex-
ity of point smoothing, line smoothing communication scales logarithmically with respect to




To investigate the performance of communication aggregation with plane relaxation, a
strong scaling study and resolution study are included. In both studies, red-black plane
relaxation is used with each plane executing a single V(1,1) cycle. Both the 2D plane cycles
and outer 3D cycle use recursive coarse-grid redistribution to balance communication and
computation costs at coarse levels. With the exception of the communication in coarse-grid
redistribution, all communication in the multigrid solve phase is aggregated.
To assess the impact of automated communication aggregation on the strong scalability
of plane relaxation, ten 3D V(1,1) cycles were run using a 27 point operator with a problem
size of 128 × 128 × 128. The number of processors in each dimension ranged from 2 to
16, and the local problem size ranged from 8 to 64 degrees of freedom in each dimension.
Figure 4.14 shows strong scaling on Blue Waters where the planes at each level were processed
sequentially, and where user-level threads were used to aggregate plane communication on
each level. This figure shows communication aggregation improves the strong scalability of
the solve—approximately doubling the parallel efficiency.
To examine the performance of this approach with an application, we consider calculating
the electric potential on simulation meshes for the ACT-II facility at Illinois [62]. Figure 4.15
shows a schematic of this facility and the corresponding simulation meshes. To mesh this
facility, overlapping logically structured grids are used. The potential of the electric field is
given by
−∇ · ε∇φ = f, (4.14)
where ε is the permittivity of the electric field. Computation on each mesh is performed
in the logical space Ξ = (ξ1, ξ2, ξ3) given by the unit cube. In curvilinear coordinates,









where J is the Jacobian of the transformation, gij the contravariant metric tensor [3]. The
additional metric terms in Equation 4.15 can introduce anisotropy in the operator depend-
ing on the physical geometry of the mesh. To conduct a single mesh resolution study, the
main application mesh shown on the bottom of Figure 4.15 was chosen. This mesh rep-
resents the most significant computational challenge in the setup and includes stretching
appropriate for plane smoothing. Three resolutions of this mesh were generated with di-
mensions listed in Table 4.1. The processor grid sizes were chosen to keep the local problem
















point sequential solve point aggregate solve














line sequential solve line aggregate solve
(b) Planes with line smoothing
Figure 4.14: Strong scaling 3D solve with plane relaxation. Aggregate indicates the relax-
ation routine was run with automated communication aggregation using user-level threads,
and sequential that the 2D routines were run independently.
consider Equation 4.15 with a constant right hand side and homogeneous Dirichlet bound-
ary conditions. Figure 4.16 shows convergence of Cedar with Mesh1 from Table 4.1 using
a variety of smoothers. Using point smoothing with this problem resulted in the poorest
convergence. This is due to anisotropy introduced from mapping this mesh to logical space
66
mesh size logical grid processor grid
Mesh1 1,002,177 191× 159× 33 4× 2× 2
Mesh2 10,000,650 550× 319× 57 8× 5× 4
Mesh3 100,035,208 1003× 959× 104 16× 12× 8
Table 4.1: Three resolutions of simulation mesh.
in Equation 4.14. Alternating plane smoothing, which alternates among xy, xz, yz planes,
is a robust smoother for anisotropic problems and results in the best convergence for this
problem. yz plane smoothing had similar convergence to alternating plane smoothing in
contrast to xy and xz plane smoothing. This shows coupling for this problem is in the
yz direction. This is supported by Figure 4.17 which shows less stretching in the y and
z directions. This leads to stronger coupling in the operator when mapped to the compu-
tational domain. Figure 4.17 also shows stretching in the y and z directions are related
leading to a nearly isotropic yz plane. For this reason line smoothing is not needed for the
yz planes. This is shown in the similar convergence behavior of Cedar with yz planes using
point smoothing and line smoothing from Figure 4.16. Figure 4.18 shows multigrid setup
and solve times of Cedar using point and plane relaxation using the meshes and processor
decompositions shown in Table 4.1. Setup and solve times for the parallel algebraic multigrid
solver BoomerAMG from Hypre are also included to show the performance benefit of using
a robust structured solver for this problem.
Solver speedup from Mesh1 to Mesh2 when using point smoothing results from differences
in the metric terms from Equation 4.15 at different resolutions. To provide a measure of the
anisotropy introduced to the operator from mapping each mesh from Table 4.1 to a logically
structured grid with unit spacing, the following metric is considered. For each interior vertex
on the mesh, the vector
~c =
1√




is computed. This represents the normalized diagonal of each cell on the mesh. The scalar
projection of ~c onto ~1 ∈ R3 is then used as a single number to quantify mesh anisotropy.
Computed for each cell of the mesh, this is used as a metric for mesh stretching as shown
in Figure 4.19. With a higher percentage of cells close to square, Mesh2 introduced a
significantly smaller amount of anisotropy compared to Mesh1 and Mesh3. This impacts
solver convergence with point smoothing. The iteration counts to reach the fixed relative
67
tolerance using point smoothing in Figure 4.18 were 48 with Mesh1, 27 with Mesh2, and
49 with Mesh3. This resulted in a smaller solve time using point smoothing with Mesh2
compared to Mesh1 even though Mesh2 had approximately ten times the number of grid
points.
4.5 CONCLUSIONS
To achieve performance at scale, robust structured solvers must include efficient coarse-
grid correction and smoothing routines. While optimized coarse-grid redistribution [23]
provides an efficient strategy for scaling the application topology with diminishing local work,
efficient smoothing routines to handle anisotropic problems are also needed for robustness.
In this chapter, the scalable parallel tridiagonal solver proposed in [48] is used to provide
efficient line relaxation without the need to store a factorization for each line. Parallel
results from figure 4.5 confirm the logarithmic communication complexity. To improve the
efficiency of robust structured smoothing in 3D, an automated strategy for aggregating
communication for a series of distributed tasks with uniform communication patterns is
proposed. Using lightweight user-level threads to manage the concurrent execution of plane
solves and a service layer to specialize communication routines and memory allocation,
manual coordination and data collection among planes is avoided. Parallel results show
this approach is effective in reducing communication costs for coarse-grid computations–
significantly improving strong scalability.
Using user-level threads to orchestrate plane execution in a multilevel solve also exposes
additional communication optimizations. The ability to inexpensively switch contexts among
planes facilities communication and computation overlap across planes. For example, com-
munication from one plane can be overlapped with computation from another using non-
blocking communication routines. Communication aggregation can then be used to maintain
useful overlap at a variety of resolutions by aggregating communication for a subset of planes
on a process. This is applicable when plane smoothing is used within a multilevel Vcycle











Figure 4.15: ACT-II facility at Illinois [62]. (a) shows a schematic of the upstream section of
the facility with labeled electrodes. (b) shows the set of overset meshes used for simulations.
(c) shows the mesh used for the resolution study. This mesh represents the largest mesh in
the setup and includes the region where the effects of the electric field are most significant.
(d) focuses on the region of the mesh used for the resolution study where cells are most
skewed. Mesh and image files produced by Kyle Mackay from the University of Illinois.
69















Figure 4.16: Convergence of Cedar using a variety of smoothers with mesh from Figure 4.15.
Alternating and YZ-Plane smoothing have best convergence for this mesh.






















Figure 4.17: Components of normalized cell diagonals on the mesh. Larger values indicate
strong stretching in that coordinate direction. Stretching for each mesh is strongest in the x


















Figure 4.18: Resolution study with an approximately constant number of grid points on each
processor. Multigrid setup and solve times are shown with BoomerAMG from Hypre and
Cedar with both point and plane relaxation. Robust structured solve with plane smoothing














Figure 4.19: Scalar projection of normalized mesh cell diagonals onto ~1 ∈ R3. Values of 1.0
indicate square mesh cells. Point smoothing is most effective with Mesh2 which contains a
higher percentage of cells near square.
71
CHAPTER 5: EMERGING ARCHITECTURES
Emerging high performance computing architectures continue to challenge solver perfor-
mance through heterogeneity and deep memory hierarchies. The performance of robust
structured multilevel solvers is driven by structured matrix-based operations. The arith-
metic intensity of these operations is approximately 1. With low arithmetic intensity, these
operations are bound by the memory performance of the target architecture. This motivates
the careful consideration of the memory subsystem for improving solver performance [63, 64].
The next generation of machines deployed at Lawrence Livermore National Laboratory and
Oak Ridge National Laboratory feature multiple GPUs per node using the IBM Power9
architecture with NVIDIA Volta GPUs. To achieve performance on these machines, solvers
must have the facility to run on GPUs as they represent the majority of memory and floating
point performance on each node.
Robust structured multilevel solvers exploit structure in a problem to maintain efficient
stencil-based operators throughout the multigrid hierarchy. This avoids CSR indirection and
increased communication costs at coarse levels. These performance advantages, in addition
to fixed communication patterns, make them a natural candidate for GPU accelerators as the
improved data locality leads to a simpler use of the disparate resources—requiring less com-
munication across devices. By targeting the higher memory bandwidth available on GPUs,
performance improvements to the stencil based computations in Cedar are anticipated.
In this chapter, an approach for offloading legacy code with structured regions to the GPU
on current Power9/Volta systems is presented. A performance expectation is developed
to evaluate technologies used to simplify the porting of legacy code to this architecture
and provide context for performance improvements. This approach is then used to offload
structured solve phase operations written in Fortran to the GPU with few changes required.
The remainder of this chapter is organized as follows. Section 5.1 introduces the heteroge-
neous system targeted in this chapter in addition to technologies and programming models
commonly used on the system. Section 5.2 builds a preliminary performance expectation
for memory bandwidth limited operations on each processing unit on the IBM Power sys-
tem and provides guidance on thread affinity and unified memory settings that result in the
best performance for these operations. In Section 5.3, results of offloading the solve phase
of Cedar to the GPU are included demonstrating performance improvement over the CPU
similar to the expected improvement from Section 5.2.
72
5.1 BACKGROUND
To study performance of structured solvers on emerging architectures, we focus on IBM
Power9 Systems with NVIDIA Volta GPUs—the platform used in the upcoming Summit and
Sierra machines at LLNL and ORNL. Table 5.1 summarizes published data transfer rates
Location Bandwidth
CPU to memory 140 GB/s
GPU to memory 900 GB/s
CPU to CPU 64 GB/s
GPU to GPU (4 GPUs) 150 GB/s
GPU to GPU (6 GPUs) 100 GB/s
CPU to GPU (4 GPUs) 150 GB/s
CPU to GPU (6 GPUs) 100 GB/s
Table 5.1: Aggregate transfer rates on IBM Power9 nodes.
between different devices on this heterogeneous system [65]. Each Power9 CPU is connected
to system memory with eight channels delivering 17.5 GB/s. The Volta GPUs are connected
using NVLink links, an NVIDIA high-speed interconnect for GPUs, which provide 50 GB/s.
Since each GPU and CPU supports six NVLink links, transfer rates vary between four and









Figure 5.1: GPU connectivity on Power9+Volta systems with 2 GPUs per socket (a) and
3 GPUs per socket (b). Each Arrow represents an NVLink2 connection with 50 GB/s
bidirectional bandwidth.
two GPUs using three NVLink links. GPUs connected to the same CPU are then connected
to each other using three NVLink links. On nodes with six GPUs, each CPU is connected
to three GPUs using two NVLink links. GPUs connected to the same CPU are connected
73
to each other using two NVLink links. In both node configurations, GPUs connected to
different CPUs are not connected through NVLink. This motivates grouping GPUs on a
node by socket when devising application communication patterns in order to minimize
GPU communication across sockets.
With the variety of memory regions of this architecture, a number of options exist for
simplifying data access across a node. Unified memory is a technology that provides a single
virtual address space available to any processor on a system. This significantly simplifies
the process of porting applications to heterogeneous systems since data does not need to
be allocated separately on each device with data transfers managed explicitly. In CUDA,
unified memory can be created using a custom allocator routine. This memory is managed
by CUDA and uses hardware supported page faults to transfer data between processors.
Access counters are also used to migrate pages based on access frequency. The CUDA API
includes routines to provide access hints to memory regions that optimize unified memory
performance. In addition to CUDA managed memory, the Power AC922 system includes an
address translation service (ATS) that allows GPUs to access CPU page tables directly [66,
67]. This provides unified memory access without a custom allocator.
A number of programming models exist to simplify the porting and development of appli-
cations for heterogeneous systems. Kokkos, a C++ framework developed at Sandia National
Laboratories [68], provides a set of C++ abstractions for data structures and parallel ex-
ecution of computationally expensive operations. These abstractions are used to generate
optimized code for a variety of platforms. Similar to Kokkos, RAJA is a C++ performance
portability layer developed at Lawrence Livermore National Laboratory [69, 70]. Similarly,
by expressing performance critical code in RAJA, applications can remain flexible to evolving
hardware and lower level programming models.
OpenMP is a directive based shared memory programming model [71]. OpenMP consists
of a runtime library and compiler directives for C/C++ and Fortran. Recently, support for
accelerators was added to the OpenMP 4 standard. This allows offloading compute intensive
code to the GPU.
Anticipating the deployment of the Sierra and Summit systems, a number of projects
have made progress in targeting this architecture. The Hypre project [72] at LLNL began
by adding GPU support in their structured solver codes. This effort focused on porting their
boxloop construct using various heterogeneous APIs. This construct abstracts nested loops
over a structured region (box) and is used for most computational operations. The current
version (2.15.1) supports CUDA, RAJA, and Kokkos. Using unified memory and MPI, this
approach required relatively few changes since many of the computational kernels use the
boxloop construct. In their algebraic multigrid solver, BoomerAMG, unified memory and
74
MPI are used with OpenMP 4.5 and CUDA. As of v2.17.0, only the solve phase includes
GPU support although there is active development in porting the setup phase.
Most computationally intensive operations in Cedar currently use legacy Fortran kernels.
To use RAJA or Kokkos, these operations would need to be rewritten in C++. To best use
current legacy code, OpenMP offloading with unified memory is considered in this chapter.
5.2 PERFORMANCE EXPECTATION
Many computational kernels in structured solvers involve structured matrix-vector prod-
ucts. With a low arithmetic intensity, the performance of these operations are bound by
the memory bandwidth of the target architecture. To develop a performance expectation
for Power9 systems, the STREAM benchmark [73, 74] is considered to measure bandwidth
performance on both the CPU and GPU. Benchmarking is also used to evaluate the perfor-
mance implications of using unified memory and OpenMP offloading to execute structured
matrix-based operations on the GPU. Many of the computationally intensive operations in
Cedar rely on legacy Fortran code. For this reason, OpenMP offloading is considered for its
ability to generate device code from both Fortran and C.
The Lassen system1 at Lawrence Livermore National Laboratory is used for all perfor-
mance results. Lassen is an unclassified Sierra system with 774 IBM Power AC922 nodes
using IBM Power9 processors and NVIDIA Volta GPUs. Each node has two Power9 pro-
cessors with 22 cores per socket (although 2 are reserved for the operating system) and four
Volta GPUs. Each GPU has 16 GB high bandwidth memory with 900 GB/s peak bandwidth
and each node has 256 GB of DDR4 memory.
5.2.1 Vector Triad Performance
To explore the performance implications of different programming models and unified
memory mechanisms on the IBM Power system, the vector triad benchmark [73, 74] is
considered. This benchmark, shown in Algorithm 5.1, is useful for determining sustained
Algorithm 5.1: Vector Triad
1 for i = 1, . . . , N do
2 A[i] = B[i] + C[i] ∗D[i]
memory bandwidth performance. Along with providing a tool for evaluating different mem-
1https://hpc.llnl.gov/hardware/platforms/sierra
75
ory management strategies and programming models on this system, it provides a useful
expectation of performance for memory bandwidth bound kernels used in structured solvers.
To examine the performance of this benchmark on the Power9 processor, the vector triad
was implemented using OpenMP. Focusing on memory performance for a single Power9
processor, the benchmark was run using thread counts up to the number of cores on a CPU
using a problem size of 1.2 GB. Figure 5.2 shows performance of the OpenMP vector triad




















Figure 5.2: Vector triad performance on Power9 CPU with varying thread affinity. The
cores line assigns OpenMP threads to cores on a socket sequentially. The sockets line uses
OMP PLACES=’sockets(1)’ with OMP PROC BIND=close. The block line uses a block map-
ping to assign OpenMP threads to cores on a socket. This mapping assigns thread i to core
i × 20/N where N is the number of requested threads. The hybrid line assigns threads to
a single core in each core pair sharing cache on the chip for thread counts less than 11. For
thread counts larger than the number of core pairs, the hybrid line uses the same binding
procedure as the cores line.
using a variety of thread affinities. Cores on the Power9 CPU are arranged into pairs sharing
an L2 and a 10 MB segment of L3 cache. In this setting, affinities that spread threads across
core pairs that share lower-level caches achieve the best memory bandwidth performance.
In Figure 5.2, the cores line assigns threads to cores sequentially. This does not avoid using
76
both cores in a core pair resulting in decreased memory bandwidth performance for thread
counts less than the number of core pairs. In Figure 5.2, the hybrid affinity achieved the
highest performance. In this hybrid method, threads are assigned to the first core in each
core pair sharing lower-level cache until all core pairs were used. As a result, the maximum
memory bandwidth performance is achieved when all eight memory channels are saturated
using eight cores from separate core pairs. For core counts greater than the number of core
pairs, the hybrid affinity used OMP PLACES=cores with OMP PROC BIND=close. Figure 5.3


























































Figure 5.3: Vector triad performance on Power9 CPU with gcc (a), IBM XL (b), and clang
(c) compilers using the hybrid thread affinity. Data points plot the mean over ten runs.
The error bars are generally tight and mark minimum and maximum values over ten runs.
Performance is limited when all eight memory channels are saturated using stack allocated
memory.
shows performance of the OpenMP vector triad compiled with gcc, IBM XL, and clang
compilers using the hybrid affinity from Figure 5.2. The compilation flags used for gcc
where -O3 -fopenmp -ftree-vectorize, the compilation flags used for the XL compiler
where -O3 -qsmp=omp -qsimd -qarch=pwr9 -qtune=pwr9:st, and the compilation flags
used for clang were -O3 -fopenmp -ftree-vectorize. This figure provides an estimate
of how each compiler performs with standard optimization levels; however, the plots in
Figure 5.3 can not be directly compared as each compiler employs different optimizations.
In Figure 5.3, memory bandwidth performance is sensitive to both the compiler and memory
location of the arrays. When using stack allocated arrays, the OpenMP triad compiled with
each compiler reaches a performance limit at eight cores which corresponds to the number of
memory channels. With heap allocated arrays, additional cores are needed to achieve a lower
performance limit than when using stack allocated arrays. Only the vector triad compiled
with gcc using stack allocated arrays hits the achievable memory bandwidth performance of
the Power9 CPU.
To examine the performance of the vector triad on the Volta GPU, the benchmark is
77
implemented using OpenMP offloading using CUDA managed memory and system allocated
memory with the IBM Power address translation service (ATS). This is then compared to
a CUDA implementation of the kernel without unified memory where data is managed
explicitly with separate host and device pointers. In the case of unified memory, all data is
prefetched for this experiment. The CUDA implementation uses a fixed block size of 1024
threads and mapped one triad index to each thread. The number of thread blocks in this
case grows with the problem size. For this experiment we consider problem sizes ranging




















Figure 5.4: Performance of vector triad implemented in CUDA and OpenMP offloading on
Volta GPU with three problems sizes with varying device memory management. Managed
indicated CUDA managed memory was used for the triad arrays. Explicit indicates separate
device and host pointers were used with explicit memory copies between the host and device.
ATS indicates unified memory using address translation services was used for the triad arrays.
The use of ATS incurs performance penalties for these problem sizes.
from 10 MB to 10 GB. Figure 5.5 shows performance with a variety of problem sizes in this
range while Figure 5.4 focuses on three representative sizes. For this benchmark, OpenMP
offloading demonstrates similar performance to the CUDA implementation when not using
unified memory with address translation services (ATS). CUDA managed unified memory
also exhibits similar performance to explicitly managed data. This is in contrast to ATS
which resulted in a performance loss in both the CUDA and OpenMP implementations.
This performance loss is more significant for the CUDA implementation that used more
thread blocks than the OpenMP implementation. Figure 5.5 shows performance of the
vector triad with problem sizes ranging from 10 MB to 10 GB. These problem sizes cover a
78






















Figure 5.5: Performance of OpenMP and CUDA vector triad on Volta GPU with problem size
varying from 10 MB to 10 GB. Managed indicates CUDA managed memory was used for triad
arrays. Explicit indicates separate device and host pointers were used with explicit memory
copies between the host and device. ATS indicates unified memory with address translation
services was used for the triad arrays. The explicit and managed data points overlap as they
resulted in similar performance. Figure 5.4 shows three representative problem sizes from
this figure.
significant portion of the 16 GB high bandwidth memory on the Volta GPU. Using explicit
and CUDA managed memory, the OpenMP and CUDA triad implementations display similar
performance and quickly reach a performance limit at 92% of peak memory bandwidth
performance. Using unified memory with address translation services resulted in a significant
performance loss for problem sizes larger than 1.12 GB.
The results in this section provide a rough expectation of performance that can be gained
by targeting the higher memory bandwidth performance of the GPU over the CPU on a
Power9 system for memory bandwidth bound operations. The results also demonstrate the
low overhead of OpenMP offloading with CUDA managed memory using prefetched data
which achieves a high percentage of peak performance.
79
5.2.2 Structured Residual Performance
To evaluate the performance of OpenMP offloading and unified memory for a represen-
tative structured solver operation, the residual calculation b − Ax where A is a symmetric
stencil-based operator is considered. Figure 5.6 shows how the stencil operator is stored in
i, j
i − 1, j − 1 i, j − 1 i + 1, j − 1
i − 1, j i + 1, j





Figure 5.6: Storage for a symmetric, nine point stencil centered at coordinate i, j on a logi-
cally structured Cartesian mesh. The solid lines indicate stencil weights stored at coordinate
i, j.
memory for a two dimensional, nine point stencil. Each point (i, j) on a logically structured
grid stores the five stencil weights shown in Figure 5.6. Fictitious points are used on the
boundary of the logically structured grid in order for every stencil weight to be represented
using this simple storage pattern. Since the stencil is symmetric, the weights from Figure 5.6
can be used to uniquely represent every stencil connection on a logically structured grid.
Algorithm 5.2 shows the 2D residual calculation using a stencil-based operator with the
storage pattern from Figure 5.6. This calculation is similar to most solve phase opera-
tions and involves a structured matrix-vector product. With 20 memory operations and 18
floating-point operations, the residual calculation is memory bound with a low arithmetic
intensity. In contrast to the vector triad from Section 5.2.1, the residual calculation has
strided array access, a nested loop, and data reuse in A and x. The added complexities in
this operation may better challenge compilers, but the data reuse should lead to improved
memory performance due to caching.
Figure 5.7 shows the run time of the residual operation from Algorithm 5.2 implemented
80
Algorithm 5.2: Residual calculation with a nine point, symmetric, stencil-based
operator. Fictitious points at the first and last grid lines are used for ease of
programming. Arrays use column-major, zero-based indexing.
Input: Nx number of grid points in first dimension
Ny number of grid points in second dimension
A(Nx + 2, Ny + 2, 5) stencil-based operator
x(Nx + 2, Ny + 2) approximate solution
b(Nx + 2, Ny + 2) right-hand side
Output: r(Nx + 2, Ny + 2) residual
Let C = 0,W = 1, S = 2, SW = 3, NW = 4
for j = 1, . . . , Ny do
for i = 1, . . . , Nx do
r(i, j) = b(i, j)
− A(i, j,W ) · x(i− 1, j)
− A(i+ 1, j,W ) · x(i+ 1, j)
− A(i, j, S) · x(i, j − 1)
− A(i, j + 1, S) · x(i, j + 1)
− A(i, j, SW ) · x(i− 1, j − 1)
− A(i+ 1, j, NW ) · x(i+ 1, j − 1)
− A(i, j + 1, NW ) · x(i− 1, j + 1)
− A(i+ 1, j + 1, SW ) · x(i+ 1, j + 1)
− A(i, j, C) · x(i, j)
for the CPU using OpenMP and the GPU using both OpenMP offloading and CUDA. For the
first data point, the entire 10 MB problem fits in cache leading to a significantly smaller run
time. OpenMP offloading was effective in generating efficient device code from the Fortran
source containing nested loops and strided array access and achieved similar run time to
the CUDA implementation when using CUDA managed memory. Similar to Figure 5.5,
using unified memory with address translation services resulted in a performance loss for
problem sizes between 1 GB and 5 GB although this did not significantly impact the run
time. Overall, Figure 5.7 shows OpenMP with CUDA managed memory can be used to
generate efficient device code for structured operations in Cedar.
81















Figure 5.7: Cedar 2D residual with problem size varying from 10 MB to 10 GB. The CPU
line shows run time on the CPU with 20 cores using OpenMP. The OpenMP lines show run
time on the GPU using OpenMP offloading with CUDA managed memory (managed) and
unified memory with address translation services (ATS). The CUDA line shows run time on
the GPU using a CUDA implementation of the residual.
5.3 EXPERIMENTAL RESULTS
Guided by the performance expectation from Section 5.2, the solve phase of Cedar is
implemented using OpenMP offloading and CUDA managed memory. Ten V(1,1)-cycles
were then run on Lassen with a problem of size 9250 × 9250. This problem size is chosen
so all arrays used in the solve phase of Cedar on all levels of the multigrid vcycle fill a
large portion of high bandwidth memory on a single Volta GPU. For this problem, around
11 GB of 16 GB of high bandwidth memory are used. The solve phase is then run using
three different execution modes: serial, OpenMP, and offloading. The serial mode runs the
solve phase without OpenMP on the Power9 CPU. The OpenMP mode runs using OpenMP
with the hybrid affinity from Figure 5.2 with all 20 cores. Since the arrays are dynamically
allocated, this affinity and core count results in the best CPU performance. The offloading
mode runs the solve phase with OpenMP offloading and CUDA managed memory. In this
82
mode, all computation in the solve phase uses OpenMP offloading with the exception of
the coarse-grid solver. The coarse-level operation is typically inexpensive by design and the
Cholesky solver from cuSOLVER [75] is currently used.

















Figure 5.8: Run time of solve phase operations on the CPU using OpenMP and in serial in
addition to on the GPU with OpenMP offloading. Solve indicates the overall run time of
the ten cycles. Speedup of offloading over OpenMP is listed above the offloading bars.
Figure 5.8 shows the run time of operations in the solve phase of Cedar and the overall solve
on Lassen. The times shown are the total time spent in each operation over all ten iterations
all levels in the Vcycle. The four colored Gauss-Seidel relaxation dominates the overall solve
time on both the CPU and GPU. Consistent speedups from OpenMP to offloaded execution
over all operations leads to an overall solve phase speedup of 6.2. This speedup is reasonable
when compared to the performance expectation from Section 5.2. In this section, the ratio of




To better understand the speedups of each kernel, run times are decomposed by multigrid
level. Figure 5.9 shows the run time of point relaxation on each multigrid level. Starting with
level 0, each level reduces the problem size by a factor of four using standard coarsening by
a factor of two in each dimension. Table 5.2 shows the size of the logical grids and memory
usage of all arrays used in computation on each level. The four coloring used in this operation
requires four passes over approximately a quarter of the logically structured grid. This results
83
















Figure 5.9: Four colored Gauss-Seidel point relaxation by level.
mesh size logical grid memory usage
Level 0 85,562,500 9250× 9250 6.8 GB
Level 1 21,390,625 4625× 4625 3.4 GB
Level 2 5,349,969 2313× 2313 0.96 GB
Level 3 1,338,649 1157× 1157 0.21 GB
Level 4 335,241 579× 579 54 MB
Level 5 84,100 290× 290 14 MB
Level 6 21,025 145× 145 3.5 MB
Level 7 5,329 73× 73 0.90 MB
Level 8 1,369 37× 37 0.24 MB
Level 9 361 19× 19 71 KB
Level 10 100 10× 10 23 KB
Table 5.2: Grid size and memory usage on each multigrid level.
in a reduction by a factor of four in the amount of available parallelism when compared to
a structured matrix-vector operation, e.g., the residual calculation from Section 5.2.2. The
reduction in parallelism in addition to the diminished locality in colored passes, contributes
84
to lower OpenMP offloading performance compared to the residual calculation. For point
relaxation, offloading execution reaches a performance limit around level 5. At this level,
the entire problem fits in cache on the Volta GPU and the overhead of offloading is not
amortized. OpenMP on the CPU reaches a similar performance limit around level 7.















Figure 5.10: Run time of residual calculation on each level.
Figure 5.10 shows the run time of the residual calculation over ten Vcycle iterations on
each multigrid level. Similar to Figure 5.9, the OpenMP offloading performance reached a
performance limit around level 5. The higher degree of parallelism and locality of this calcu-
lation on each level when compared to four colored relaxation, resulted in better OpenMP
offloading performance—remaining faster than OpenMP until level 6.
Figure 5.11 shows the run time of interpolation over ten iterations on each multigrid level.
This operation interpolates a coarse-grid vector to the next finest grid and adds it as a
correction to the fine-grid solution vector computing:
ul = ul + I ll−1u
l−1 + rl/C, (5.1)
where ul is the approximate solution on fine-grid level l, ul−1 is the approximate solution
85















Figure 5.11: Run time of interpolation by level.
on coarse-grid level l − 1, I ll−1 is the interpolation operator, rl is the residual on level l,
and C is the center, diagonal stencil coefficient of the operator. The last term is included
to improve solver convergence for little cost [20]. This computation is split into three loop
bodies in Cedar. The first loop body is embedded in a nested loop over fine-grid points in
each dimension to precompute rl/C. The second loop body is contained in a loop over all
coarse x points to compute Equation 5.1 for points in the first grid line in the y direction
(see Figure 3.4). The third loop body is contained in a loop over all coarse y points. It first
computes Equation 5.1 for points in the first grid line in the x direction and than executes a
loop over all coarse x points to compute Equation 5.1 for all points outside of the first grid
line in each dimension. Since the inner loop is not nested directly inside the outer loop, the
computation of Equation 5.1 for points in the first grid line in the x direction was moved
into a fourth loop for the OpenMP and offload execution modes. This allows OpenMP to
distribute iterations of both loops using the loop collapse directive.
Figure 5.12 shows the run time of the restriction operation over ten Vcycle iterations
on each level. This operation showed similar performance to the residual calculation, but
reached performance limits and negative speedup points in OpenMP and offload execution
86
















Figure 5.12: Run time of restriction by level.
modes a level earlier. This is reasonable considering the residual operation can be viewed as
a matrix-vector operation on the fine grid, while the restriction operation can be viewed as
a matrix-vector operation on the coarse grid.
5.4 CONCLUSIONS
Emerging high-performance computing systems significantly increase on node parallelism
using a heterogeneous architecture with multiple GPUs per node. The recent Summit and
Sierra machines motivate the need to target execution on GPUs for performance. Both
machines have a greater amount of GPUs than CPUs and each GPU has higher floating-
point and memory performance than each CPU. While robust structured solvers have low
arithmetic intensity due to non-constant, spatially dependent stencil weights, their direct
memory access and structured loop patterns make them a natural candidate for exploiting
the significantly higher memory bandwidth performance of GPUs on these machines. In
this chapter, benchmarks on the CPU and GPU are used to develop a rough expectation of
87
performance for memory bandwidth bound operations on the Power9 heterogeneous archi-
tecture. OpenMP with unified memory was shown to be effective in offloading solve phase
operations written in legacy Fortran in Cedar. The performance expectation then provides
important context for speedups of the solve phase on the GPU over the CPU.
The work in this chapter demonstrates an effective method for executing structured
matrix-based operations on the GPU. Using OpenMP offloading, the solve phase of Cedar is
offloaded to the GPU resulting in a speedup of 6.2 over the CPU with a global problem size
of approximately 85× 106 unknowns using 12 multigrid levels. Requiring minor changes to
the Fortran code, this approach is attractive for offloading legacy code containing structured
parts to the GPU. In the future, this work can be extended to multiple processing elements
on a heterogeneous node in addition to multiple nodes using MPI and the halo exchange
library Tausch. When targeting multiple devices on a node using unified memory, codes
must carefully consider the memory layout of key data structures to avoid expensive mem-
ory page faults from sharing pages across devices. Halo based data structures using MPI
avoids this issue—treating devices as separate distributed memory regions. Optimizations
can then be implemented to take advantage of unified memory for devices on a node. This
approach requires the fewest changes to the code and facilitates incremental optimization.
Careful consideration of communication costs guided by performance modeling is likely
to remain important to the design of efficient structured solvers when using multiple het-
erogeneous nodes. To better understand on node performance, a more detailed analysis of
solver performance on the memory system is needed [42, 29, 76]. Modeling communication
in a heterogeneous environment that considers on node data movement in addition to net-
work interactions will also be crucial to building a predictive model. These models can then
be used to investigate optimized coarse-grid redistribution and guide overdecomposition of
structured communication on heterogeneous systems.
88
CHAPTER 6: CEDAR FRAMEWORK
Cedar is a framework for structure exploiting multigrid on large scale parallel systems.
Cedar currently implements Black Box Multigrid using standard coarsening for symmetric
stencil-based operators in two and three dimensions. This implementation uses legacy For-
tran kernels developed at Los Alamos National Laboratory. This framework includes a set
of abstractions and data structures for implementing a variational multigrid solver on 2D
and 3D structured grids targeting serial and distributed memory parallel systems. The code
is written in C++ using MPI for distributed memory communication. CUDA and OpenMP
are also used for targeting emerging heterogeneous architectures.
6.1 DESIGN
The design of Cedar was motivated by the goal of using a modern language to pro-
vide flexibility in exploring modifications to parallel algorithms for improving performance
and scalability while effectively using legacy code. Cedar evolved around the legacy For-
tran BoxMG code using its efficient, stateless numerics kernels while introducing a common
generic control layer for serial and parallel solvers in two and three dimensions. Two exam-
ples of such kernels are calculating a residual and running a single iteration of Gauss-Seidel
smoothing. Kernels that include parallel communication, such as Gauss-Seidel smoothing,
are passed a communication service object to perform the communication in C++. Using
a higher-level language to control common multigrid setup and cycling, define parallel data
structures, and handle parallel communication greatly improved code reuse while enabling
communication optimizations such as coarse-grid redistribution of data and aggregation of
communication in plane smoothing. This pattern provides a useful strategy for evolving
modern codes around a legacy code base.
Figure 6.1 shows the high-level design of Cedar. The framework is divided into three main
parts:
Control Layer This layer includes generic logic for setting up a multigrid hierarchy and
running the multigrid solve phase. This logic is used by serial and parallel solvers
in two and three dimensions. This layer also provides data structures for logically
structured grid functions and stencil-based operators. High-level distributed-memory
operations such as the optimized redistribution of coarse-grid data structures are also









































Figure 6.1: Cedar high-level architecture.
Service Layer This layer abstracts a set of services provided to multilevel structured
solvers. Implementations of these services can be registered for each solver in a data
structure called the service manager. This allows for the dynamic selection of imple-
mentations of these services in a solver based on user input or execution modes. Plane
relaxation uses this layer to register implementations of services that aggregate commu-
nication in its component 2D solvers. The message passing and halo exchange services
provide distributed-memory communication routines. The memory allocation service
provides implementations with application specific knowledge that can be exploited
for performance optimizations.
Kernel Layer Similar to the service layer, the kernel layer abstracts a set of computational
90
kernels used in multigrid solvers. Kernel implementations provide setup and solve
phase methods that are called generically by the control layer using the kernel manager
data structure. Kernel declarations are templated with a solver_types structure to
maintain type safety between implementations (e.g., 2D, 3D, and parallel). Kernels
that need parallel communication are passed a handle to the appropriate service from
the service layer.
A fundamental data structure in Cedar is the multidimensional array. Templated by the
number of dimensions and data type, this data structure is used to provide Cartesian indexing
on logically structured grids. Serial and parallel grid function data structures in two and
three dimensions inherit from the multidimensional array and include storage for fictitious
points on domain boundaries in addition to halo regions used in parallel communication.
The stencil operator data structure is used as a structured matrix data structure in Cedar.
This data structure is templated by the dimension of the logically structured grid and the
stencil type. The stencil operator then inherits from a multidimensional array of dimension
one greater than the dimension of the stencil operator. The extra dimension is used for the
stencil direction with the stencil type restricted to scoped enumerations. Listing 6.1 shows an
example of solving a 2D Poisson problem using Cedar. After MPI and Cedar are initialized,
a logically structured grid object is created from grid extents specified in the input file. A
five point finite difference Poisson operator is then built and the right-hand side set. Once
the approximate solution is computed, the infinity norm of the error is computed using an
analytical solution. The timer save routine then aggregates timings of operations on each
level across all processes and writes a summary file.
For interoperability outside of C++, Cedar includes a C interface. This interface was
modeled after the MPICH [77] C API implementation and encodes information such as an
object’s type and location in integer handles that can be passed to both C and Fortran. By
avoiding pointers, this also adds a degree of safety to a library interface in that the user
cannot easily write into the library’s memory. The C interface was also used to build a
higher level interface to PETSc [78]. This interface currently uses PCSHELL and MATSHELL
to setup Cedar as a preconditioner in PETSc.
6.2 CONCLUSIONS AND FUTURE WORK
Cedar introduces a flexible framework for robust structured variational multigrid methods
on large scale parallel systems. Using generic parallel data structures and multigrid setup
and cycling logic, Cedar effectively uses legacy code to implement Black Box Multigrid for
91
int main(int argc, char *argv[])
{
using namespace cedar;
using namespace cedar::cdr2; // 2D data structures
int provided;




// create structured grid from configuration file
auto grid = util:create_topo(conf);
mpi::stencil_op<five_pt> so = mpi::gallery::poisson(grid);
mpi::grid_func b(grid);
set_rhs(b);
// create parallel solver for five point stencil-based operator
mpi::solver<five_pt> bmg(so);
MPI_Barrier(MPI_COMM_WORLD); // synchronize before timing solve
auto sol = bmg.solve(b);
mpi::grid_func exact_sol(grid);
set_solution(exact_sol);
auto diff = exact_sol - sol;
log::status << "Solution norm: "






Listing 6.1: Solve 2D Poisson with Cedar
symmetric operators in two and three dimensions.
In the future, kernels from the nonsymmetric Black Box Multigrid implementation can
be used to support nonsymmetric problems in Cedar. Using the generic data structures in
Cedar, this could be accomplished by first defining a scoped enumeration for the nonsym-
92
metric stencil. The legacy kernels can then be registered using the kernel manager. By
implementing the nonsymmetric solver in Cedar, the solver would benefit from common
management and important communication optimizations such as optimized coarse-grid re-
distribution crucial for parallel scalability.
93
CHAPTER 7: CONCLUSIONS
This dissertation includes a number of algorithmic contributions for improving the par-
allel performance of robust structured solvers at scale. These focus on taking advantage of
structure in the problem by exploiting the predictable, structured communication patterns.
In Chapter 3, knowledge of the structure of coarse-grid operators is used to optimize the
redistribution of coarse problems. Contributions in this chapter are:
• Introduction of a predictive performance model for Black Box Multigrid.
• Implementation of recursive redistribution using aggregation of coarse tasks based on
locality in a structured grid.
• Performance model used to guide global search over recursive redistribution provides
robust decisions for structured solvers.
• Demonstrated parallel scalability of this approach in two and three dimensions with
over 500k cores.
With a scalable parallel algorithm for structured coarse problems, smoothing becomes
the dominant cost in robust variational structured solvers. When coarsening is designed
to preserve structure, block smoothers can be used to handle anisotropic problems. While
structured solvers with block smoothers are robust for general anisotropy, these smoothers
significantly increase the communication requirements of the solver. Chapter 4 focuses on
approaches for reducing the communication costs in block smoothers. Contributions in this
chapter are:
• Parallel implementation and performance results for a recursive version of the memory
efficient tridiagonal solver proposed in [48]. This provides robust block smoothing in
two dimensions.
• Novel algorithm for automating communication aggregation across plane solves using
user-level threading. This reduces the communication latency cost in plane smoothing
which provides robust smoothing in 3D. The automated approach is crucial at scale
where both the 3D solver and 2D plane solvers recursively redistribute coarse-grid
problems.
• Performance of this approach demonstrated on both model problems and an applica-
tion from the XPACC center.
94
Recent HPC architectures motivate the need to target GPUs as they contribute to a
majority of the computing and memory performance of these systems. In Chapter 5, the
performance of a robust variational structured multigrid solver is improved by targeting the
increased memory performance of these accelerators. Contributions in this chapter are:
• Addition of Cedar solve phase operations that target GPUs on emerging Power9 sys-
tems using OpenMP offloading and unified memory to effectively use legacy code.
• Results showing performance gains of targeting GPUs for structured multigrid op-
erations on this architecture given context by a performance expectation for each
processing unit.
In Chapter 6, a parallel framework for structure exploiting multigrid is discussed. Contri-
butions in this chapter are:
• Scalable implementation of Black Box Multigrid in two and three dimensions.
• Flexible high-level control layer included to improve code reuse and flexibility while
effectively using legacy code.
• Parallel communication and memory allocation routines can be registered and switched
by the user by registering them as services. Application specific information is provided
to services to enable optimization.
• Implementations of computational kernels in multigrid can be registered and selected
by the user.
• Structured kernels or methods implemented in this framework benefit from common,
scalable management and data structured (e.g., optimized coarse-grid redistribution).
95
REFERENCES
[1] D. Moulton, L. N. Olson, and A. Reisner, “Cedar framework,” 2017. [Online].
Available: https://github.com/cedar-framework/cedar
[2] S. P. MacLachlan, J. M. Tang, and C. Vuik, “Fast and robust solvers for pressure-
correction in bubbly flow problems,” J. Comput. Phys., vol. 227, no. 23, pp. 9742–9761,
2008.
[3] V. D. Liseikin, Grid Generation Methods, 2nd ed. Springer Publishing Company,
Incorporated, 2009.
[4] L. N. Olson and A. Reisner, “Stella,” 2018, version 0.1. [Online]. Available:
https://github.com/cedar-framework/cedar
[5] A. Brandt, S. F. McCormick, and J. W. Ruge, “Algebraic multigrid (AMG) for sparse
matrix equations,” Sparsity and Its Applications, 1984.
[6] S. P. MacLachlan and L. N. Olson, “Theoretical bounds for algebraic multigrid per-
formance: review and analysis,” Numerical Linear Algebra with Applications, vol. 21,
no. 2, pp. 194–220, 2014.
[7] W. Hackbusch, “Convergence of multigrid iterations applied to difference equations,”
Mathematics of Computation, vol. 34, no. 150, pp. 425–440, 1980.
[8] W. Hackbusch, “On the convergence of multi-grid iterations,” Beiträge Numer. Math,
vol. 9, pp. 213–239, 1981.
[9] W. Hackbusch, “Survey of convergence proofs for multi-grid iterations,” Special topics
of applied mathematics, North-Holland, Amsterdam, pp. 151–164, 1980.
[10] N. H. Naik and J. Van Rosendale, “The improved robustness of multigrid elliptic solvers
based on multiple semicoarsened grids,” SIAM Journal on Numerical Analysis, vol. 30,
no. 1, pp. 215–229, 1993.
[11] P. N. Brown, R. D. Falgout, and J. E. Jones, “Semicoarsening multigrid on distributed
memory machines,” SIAM Journal on Scientific Computing, vol. 21, no. 5, pp. 1823–
1834, 2000.
[12] U. Trottenberg and A. Schuller, Multigrid. Orlando, FL, USA: Academic Press, Inc.,
2001.
[13] W. L. Briggs, V. E. Henson, and S. F. McCormick, A Multigrid Tutorial: Second
Edition. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics,
2000.
[14] J. E. Dendy, “Black box multigrid,” J. Comput. Phys., vol. 48, pp. 366–386, 1982.
96
[15] J. D. Moulton, J. E. Dendy, and J. M. Hyman, “The black box multigrid numerical
homogenization algorithm,” J. Comput. Phys., vol. 141, pp. 1–29, 1998.
[16] S. P. MacLachlan and J. D. Moulton, “Multilevel upscaling through variational coars-
ening,” Water Resour. Res., vol. 42, 2006.
[17] J. E. Dendy, “Black box multigrid for nonsymmetric problems,” Appl. Math. Comput.,
vol. 13, pp. 261–283, 1983.
[18] P. D. Zeeuw, “Matrix-dependent prolongations and restrictions in a blackbox multigrid
solver,” Journal of Computational and Applied Mathematics, vol. 33, no. 1, pp. 1–27,
1990.
[19] S. P. MacLachlan, J. D. Moulton, and T. P. Chartier, “Robust and adaptive multigrid
methods: comparing structured and algebraic approaches,” J. Numer. Lin. Alg. App.,
vol. 19, pp. 389 – 413, 2012.
[20] J. E. Dendy and J. D. Moulton, “Black box multigrid with coarsening by a factor of
three,” J. Numer. Lin. Alg. App., vol. 17, no. 2-3, pp. 577–598, 2010.
[21] C.-A. Thole and U. Trottenberg, “Basic smoothing procedures for the multigrid treat-
ment of elliptic 3d operators,” Appl. Math. Comput., vol. 19, no. 1-4, pp. 333–345, July
1986.
[22] R. E. Alcouffe, A. Brandt, J. E. Dendy, and J. W. Painter, “The multi-grid method
for the diffusion equation with strongly discontinuous coefficients,” SIAM J. Sci. Stat.
Comput., vol. 2, no. 4, pp. 430–454, 1981.
[23] A. Reisner, L. Olson, and J. Moulton, “Scaling structured multigrid to 500k+ cores
through coarse-grid redistribution,” SIAM Journal on Scientific Computing, vol. 40,
no. 4, pp. C581–C604, 2018.
[24] A. H. Baker, R. D. Falgout, H. Gahvari, T. Gamblin, W. Gropp, T. V. Kolev, K. E.
Jordan, M. Schulz, and U. M. Yang, “Preparing algebraic multigrid for exascale,”
Lawrence Livermore National Laboratory, Lawrence Livermore, CA, Tech. Rep. LLNL-
TR-533076, 2012.
[25] H. Gahvari, W. Gropp, K. E. Jordan, M. Schulz, and U. M. Yang, “Systematic reduction
of data movement in algebraic multigrid solvers,” in Proceedings of the 2013 IEEE 27th
International Symposium on Parallel and Distributed Processing Workshops and PhD
Forum, ser. IPDPSW ’13. Washington, DC, USA: IEEE Computer Society, 2013, pp.
1675–1682.
[26] A. Bienz, R. D. Falgout, W. Gropp, L. N. Olson, and J. B. Schroder, “Reducing parallel
communication in algebraic multigrid through sparsification,” SIAM J. Sci. Comput.,
vol. 38, no. 5, pp. S332–S357, 2016.
97
[27] G. L. Delzanno, L. Chacón, J. M. Finn, Y. Chung, and G. Lapenta, “An opti-
mal robust equidistribution method for two-dimensional grid adaptation based on
monge–kantorovich optimization,” Journal of Computational Physics, vol. 227, no. 23,
pp. 9841–9864, 2008.
[28] H. Sundar, G. Biros, C. Burstedde, J. Rudi, O. Ghattas, and G. Stadler, “Parallel
geometric-algebraic multigrid on unstructured forests of octrees,” in Proceedings of the
International Conference on High Performance Computing, Networking, Storage and
Analysis. IEEE Computer Society Press, 2012, p. 43.
[29] B. Gmeiner, M. Huber, L. John, U. Rüde, and B. Wohlmuth, “A quantitative perfor-
mance study for stokes solvers at the extreme scale,” Journal of Computational Science,
vol. 17, pp. 509 – 521, 2016.
[30] W. Gropp, “Parallel computing and domain decomposition,” in Domain Decomposition
Methods for Partial Differential Equations. Norfolk, VA, USA: Society for Industrial
& Applied Mathematics, 1992, pp. 349 – 361.
[31] K. Nakajima, “Optimization of serial and parallel communications for parallel geomet-
ric multigrid method,” in 2014 20th IEEE International Conference on Parallel and
Distributed Systems (ICPADS), Dec 2014, pp. 25–32.
[32] D. E. Womble and B. C. Young, “A model and implementation of multigrid for massively
parallel computers,” International Journal of High Speed Computing, vol. 2, no. 3, pp.
239–255, 1990.
[33] W. Joubert and J. Cullum, “Scalable algebraic multigrid on 3500 processors,” Electron.
T. Numer. Ana., vol. 23, pp. 105–128, 2006.
[34] M. Emans, “Coarse-grid treatment in parallel amg for coupled systems in cfd applica-
tions,” Journal of Computational Science, vol. 2, no. 4, pp. 365 – 376, 2011.
[35] S. Reiter, A. Vogel, I. Heppner, M. Rupp, and G. Wittum, “A massively parallel geo-
metric multigrid solver on hierarchically distributed grids,” Comput. Vis. Sci., vol. 16,
no. 4, pp. 151–164, 2013.
[36] J. Rudi, A. C. I. Malossi, T. Isaac, G. Stadler, M. Gurnis, P. W. J. Staar, Y. Ineichen,
C. Bekas, A. Curioni, and O. Ghattas, “An extreme-scale implicit solver for complex
pdes: Highly heterogeneous flow in earth’s mantle,” in Proceedings of the International
Conference for High Performance Computing, Networking, Storage and Analysis, 2015.
[37] D. A. May, P. Sanan, K. Rupp, M. G. Knepley, and B. F. Smith, “Extreme-scale multi-
grid components within petsc,” in Proceedings of the Platform for Advanced Scientific
Computing Conference, 2016, pp. 5:1–5:12.
[38] A. Bar-Noy and S. Kipnis, “Designing broadcasting algorithms in the postal model for
message-passing systems,” in Proceedings of the Fourth Annual ACM Symposium on
Parallel Algorithms and Architectures, ser. SPAA ’92. New York, NY, USA: ACM,
1992, pp. 13–22.
98
[39] W. Gropp, L. N. Olson, and P. Samfass, “Modeling MPI communication performance on
SMP nodes: Is it time to retire the ping pong test,” in Proceedings of the 23rd European
MPI Users’ Group Meeting, ser. EuroMPI 2016. New York, NY, USA: ACM, 2016,
pp. 41–50.
[40] A. Bienz, W. D. Gropp, and L. Olson, “Topology-aware performance modeling of par-
allel spmvs,” http://meetings.siam.org/sess/dsp talk.cfm?p=75934.
[41] R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication
operations in MPICH,” International Journal of High Performance Computing Appli-
cations, vol. 19, no. 1, pp. 49–66, 2005.
[42] B. Gmeiner, U. Rüde, H. Stengel, C. Waluga, and B. Wohlmuth, “Towards textbook
efficiency for parallel multigrid.” Numerical Mathematics: Theory, Methods & Applica-
tions, vol. 8, no. 1, pp. 22 – 46, 2015.
[43] P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the heuristic determination
of minimum cost paths,” IEEE Transactions on Systems Science and Cybernetics, vol. 4,
no. 2, pp. 100–107, 1968.
[44] R. Rabenseifner, “Effective bandwidth (beff) benchmark.” [Online]. Available:
www.hlrs.de/mpi/b eff
[45] D. J. Kerbyson, K. J. Barker, A. Vishnu, and A. Hoisie, “Comparing the performance
of Blue Gene/Q with leading Cray XE6 and InfiniBand systems,” Proceedings of the
International Conference on Parallel and Distributed Systems - ICPADS, pp. 556–563,
2012.
[46] G. E. Hammond, P. C. Lichtner, C. Lu, and M. R.T., “Pflotran: Reactive flow and
transport code for use on laptops to leadership-class supercomputers,” in Groundwater
Reactive Transport Models, F. Zhang, G. Yeh, and J. C. Parker, Eds. Sharjah, UAE:
Bentham Science Publishers, 2012, pp. 141–159.
[47] T. M. Austin, M. Berndt, and J. D. Moulton, “A memory efficient parallel tridiagonal
solver,” Mathematical Modeling and Analysis Group, Los Alamos National Laboratory,
Los Alamos, NM, Tech. Rep. LA-UR 03-4149, 2004.
[48] T. M. Austin, M. Berndt, and J. D. Moulton, “A memory efficient parallel tridiagonal
solver,” Mathematical Modeling and Analysis Group, Los Alamos National Laboratory,
Los Alamos, NM, Tech. Rep. LA-UR 03-4149, 2004.
[49] S. Schaffer, “A semicoarsening multigrid method for elliptic partial differential equations
with highly discontinuous and anisotropic coefficients,” SIAM Journal on Scientific
Computing, vol. 20, no. 1, pp. 228–242, 1998.
[50] J. Dendy Jr, “Semicoarsening multgrid for systems,” Electronic Transactions on Nu-
merical Analysis, vol. 6, pp. 97–105, 1997.
99
[51] R. W. Hockney, “A fast direct solution of poisson’s equation using fourier analysis,” J.
ACM, vol. 12, no. 1, pp. 95–113, Jan. 1965.
[52] H. S. Stone, “An efficient parallel algorithm for the solution of a tridiagonal linear
system of equations,” J. ACM, vol. 20, no. 1, pp. 27–38, Jan. 1973.
[53] H. H. Wang, “A parallel method for tridiagonal equations,” ACM Trans. Math. Softw.,
vol. 7, no. 2, pp. 170–183, June 1981.
[54] A. H. Sameh and D. J. Kuck, “On stable parallel linear system solvers,” J. ACM, vol. 25,
no. 1, pp. 81–91, Jan. 1978.
[55] L. Brugnano, “A parallel solver for tridiagonal linear systems for distributed memory
parallel computers,” Parallel Computing, vol. 17, no. 9, pp. 1017 – 1023, 1991.
[56] I. N. Hajj and S. Skelboe, “A multilevel parallel solver for block tridiagonal and banded
linear systems,” Parallel Computing, vol. 15, no. 1, pp. 21 – 45, 1990.
[57] A. Krechel, H.-J. Plum, and K. Stüben, “Parallelization and vectorization aspects of
the solution of tridiagonal linear systems,” Parallel Computing, vol. 14, no. 1, pp. 31 –
49, 1990.
[58] A. P., B. L., and P. T., “Parallel factorizations for tridiagonal matrices,” SIAM Journal
on Numerical Analysis, no. 3, p. 813, 1993.
[59] U. Gärtel, A. Krechel, A. Niestegge, and H.-J. Plum, Parallel Multigrid Solution of 2D
and 3D Anisotropic Elliptic Equations: Standard and Nonstandard Smoothing. Basel:
Birkhäuser Basel, 1991, pp. 191–209.
[60] S. Seo, A. Amer, P. Balaji, C. Bordage, G. Bosilca, A. Brooks, P. Carns, A. Castelló,
D. Genet, T. Herault, S. Iwasaki, P. Jindal, L. V. Kalé, S. Krishnamoorthy, J. Lifflan-
der, H. Lu, E. Meneses, M. Snir, Y. Sun, K. Taura, and P. Beckman, “Argobots: A
lightweight low-level threading and tasking framework,” IEEE Transactions on Parallel
and Distributed Systems, vol. 29, no. 3, pp. 512–526, March 2018.
[61] W. D. Gropp, “Using node information to implement mpi cartesian topologies,” in
Proceedings of the 25th European MPI Users’ Group Meeting, ser. EuroMPI’18. New
York, NY, USA: ACM, 2018, pp. 18:1–18:9.
[62] D. Baccarella, Q. Liu, T. Lee, S. D. Hammack, and H. Do, The Supersonic Combustion
Facility ACT-2, 2017.
[63] M. Kowarschik, U. Rüde, and C. Weiß, “Data layout optimizations for variable coef-
ficient multigrid,” in International Conference on Computational Science. Springer,
2002, pp. 642–651.
[64] C. C. Douglas, J. Hu, M. Kowarschik, U. Rüde, and C. Weiß, “Cache optimization
for structured and unstructured grid multigrid,” Electronic Transactions on Numerical
Analysis, vol. 10, pp. 21–40, 2000.
100
[65] A. B. Caldeira, “Ibm power system ac922 introduction and technical overview,” IBM
Redbooks, 2018.
[66] T. NVIDIA, “V100 gpu architecture,” 2017. [Online]. Available: https://images.nvidia.
com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
[67] IBM, “POWER9 processor user’s manual,” 2018. [Online]. Available: https:
//openpowerfoundation.org/?resource lib=power9-processor-users-manual
[68] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling manycore perfor-
mance portability through polymorphic memory access patterns,” Journal of Parallel
and Distributed Computing, vol. 74, no. 12, pp. 3202 – 3216, 2014, domain-Specific
Languages and High-Level Frameworks for High-Performance Computing.
[69] LLNL, “RAJA performance portability layer.” [Online]. Available: https://github.
com/LLNL/RAJA
[70] R. Hornung, H. Jones, J. Keasler, R. Neely, O. Pearce, S. Hammond, C. Trott, P. Lin,
C. Vaughan, J. Cook et al., “Asc tri-lab co-design level 2 milestone report 2015,”
Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), Tech. Rep.,
2015.
[71] L. Dagum and R. Menon, “Openmp: An industry-standard api for shared-memory
programming,” Computing in Science & Engineering, no. 1, pp. 46–55, 1998.
[72] R. D. Falgout and U. M. Yang, “hypre: A library of high performance preconditioners,”
in International Conference on Computational Science. Springer, 2002, pp. 632–641.
[73] J. D. McCalpin, “Memory bandwidth and machine balance in current high performance
computers,” IEEE Computer Society Technical Committee on Computer Architecture
(TCCA) Newsletter, pp. 19–25, Dec. 1995.
[74] J. D. McCalpin, “Stream: Sustainable memory bandwidth in high performance
computers,” University of Virginia, Charlottesville, Virginia, Tech. Rep., 1991-2007,
a continually updated technical report. http://www.cs.virginia.edu/stream/. [Online].
Available: http://www.cs.virginia.edu/stream/
[75] NVIDIA, “CUDA sparse matrix library.” [Online]. Available: https://docs.nvidia.com/
cuda/cusolver
[76] G. Hager, J. Treibig, J. Habich, and G. Wellein, “Exploring performance and power
properties of modern multi-core chips via simple machine models,” Concurr. Comput.
: Pract. Exper., vol. 28, no. 2, pp. 189–210, Feb. 2016.
[77] W. Gropp and E. Lusk, “User’s guide for mpich, a portable implementation of mpi,”
1996.
[78] S. Balay, S. Abhyankar, M. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin,
A. Dener, V. Eijkhout, W. Gropp et al., “Petsc users manual,” 2019.
101
