Abstract. We investigate a technique -segmental refinement (SR) -proposed by Brandt in the 1970s as a low memory multigrid method. The technique is attractive for modern computer architectures because it provides high data locality, minimizes network communication, is amenable to loop fusion, and is naturally highly parallel and asynchronous. The network communication minimization property was recognized by Brandt and Diskin in 1994; we continue this work by developing a segmental refinement method for a finite volume discretization of the 3D Laplacian on massively parallel computers. An understanding of the asymptotic complexities, required to maintain textbook multigrid efficiency, are explored experimentally with a simple SR method. A two-level memory model is developed to compare the asymptotic communication complexity of a proposed SR method with traditional parallel multigrid. Performance and scalability are evaluated with a Cray XC30 with up to 64K cores. We achieve modest improvement in scalability from traditional parallel multigrid with a simple SR implementation.
Introduction
Full multigrid (FMG) is a provably asymptotically exact, non-iterative, algebraic equation solver with work complexity of about five residual calculation for the constant coefficient Laplacian -textbook multigrid efficiency. Multigrid has been developed for many applications in the past 40 years and continues to be an active area of research and development. Textbook multigrid efficiency, or nearly so, has been observed experimentally in many applications [BD81, TDB01, TOS01, ASB10] . FMG is a highly parallel algorithm with a computational depth of log 2 n, where n is the number of equations; the theoretical lower bound for solving that Laplacian is log n. While textbook multigrid efficiency is only provable for a few classes of problems, FMG is applied widely in practice, which makes it an important solution method to adapt to the rapidly changing computer architectures that we are encountering today and anticipate in the future.
Memory movement, in both intra-node and inter-node communication, is the primary driver of costs, in power and time, for partial differential equations (PDE) simulations on current, and anticipated future, computer architectures. Memory movement pressures are not new but have been accumulating for decades and the powering and moving of memory will become more central to the cost of PDE solves in the future. The desire to continue the exponential growth of extreme-scale PDE simulations combined with an economic need to not increase power budgets exponentially presents a challenge to computer engineers and they will not be able to deliver machines that maintain the historic balance of compute and communication performance. Engineers have often innovated past the worst fears of the community; the degree to which they do so in the next ten years is an open question.
There is the potential to significantly increase productivity on future machines by exploring algorithms now that are designed with memory movement, and not flops, as the primary resource to be conserved. This paper develops low memory movement techniques to maintain, and allow for the continued expansion of, multigrid methods on future architectures. Brandt proposed segmental refinement (SR) as a low memory complexity technique for FMG in the 1970s [Bra77] §7.5; [BL11] §8.7; [Din79] ; [BD94] ) that does not store the entire solution in memory at any one time -evanescent data. This algorithm requires that the multigrid algorithm be processed "vertically" as opposed to the traditional "horizontal" approach where entire grids are processed sequentially, from fine to coarse and back again in the classic V-cycle. Though SR was originally proposed as a serial, low memory, method that "sweeps" across the grid, it was later recognized to have attractive properties for parallel computing by Brandt and Diskin [BD94] . This algorithm is inherently asynchronous and highly parallel and amenable to loop fusion [WKS + 12], with a modest amount of extra storage and work in SR buffer cells. The goal of this paper is to continue the development of this technique on modern parallel computers, quantify the extra costs experimentally, understand the asymptotic behavior of the method to better design future instantiations, and develop a memory model useful in analyzing multilevel solver methods from the perspective of memory movement.
We proceed by presenting some multigrid background in §2; an SR data model and technique for cell centered discretizations is described in §3; the accuracy of this SR data model with respect to the parameters of the method is experimentally investigated in §4; a new SR data model, appropriate for asymptotic analysis and computation, is defined and compared to traditional parallel multigrid with a two level memory model in §5; the performance of SR on a Cray XC30 with up to 64K processes is investigated in §6; and we conclude in §7.
Multigrid Background
Multigrid is an effective method for solving systems of algebraic equations that arise from discretized PDEs. Modern multigrid's antecedents go back to Southwell in the 1930s [Sou40] , Fedorenko in the early 1960s [Fed61] , and others [TOS01] . Brandt developed multigrid's modern form in the 1970s, with orders of magnitude lower work complexity, equivalent to a few residual calculations (work units) -textbook multigrid efficiency -applied to complex domains, variable coefficients and nonlinear problems [Bra73] . A substantial body of literature, both theoretical and experimental, exists that proves and demonstrates the efficacy of multigrid [TOS01, BL11] . Full Approximation Scheme (or Full Approximation Storage, FAS) multigrid as also been demonstrated to be an effective nonlinear solver with costs very similar to that of a linearized multigrid solve ([TOS01] §5.3.3, [ASB10] , and many others).
2.1. Multigrid V-cycle. Multigrid methods are motivated by the observation that a low resolution discretization of an operator can capture modes or components of the error that are expensive to compute directly on a highly resolved discretization. More generally, any poorly locally-determined solution component has the potential to be resolved with a coarser representation. This process can be applied recursively with a series of coarse "grids", thereby requiring that each grid resolve only the components of the error that it can resolve efficiently. This process is known as a V-cycle (see Figure 1 ). These coarse grids have fewer grid points, typically a factor of two or more in each dimension. The total amount of work in the multigrid process is a geometric sum that converges to a small factor of the work on the finest mesh. The multigrid approach has proven to be effective in separating the near-field from the far-field contributions in the solution of elliptic operators -the coarse grid correction captures the far-field contribution and the near-field is resolved with a local process called a smoother. These concepts can be applied to problems with particles/atoms or pixels as well as the traditional grid or cell variables considered here.
The coarse grid space can be represented algebraically as the columns of the prolongation operator I h H , where h is the fine grid mesh spacing, and H is the coarse grid mesh spacing. The prolongation operator is used to map corrections to the solution from the coarse grid to the fine grid. Residuals are mapped from the fine grid to the coarse grid with the restriction operator I 2.2. Nonlinear multigrid. The multigrid V-cycle can be adapted to a nonlinear method by observing that for solving Lu = f the coarse grid residual equation can be written as
where u is the exact solution,û H approximates I H h u h , the full solution represented on the coarse grid, hence the name Full Approximation Scheme (FAS), and e is the error. With this, and an approximate solution on the fine gridũ h , the coarse grid equation can be written as 
-restriction of residual to coarse grid t C ← u C -temporary store of coarse grid solution 
Full
Multigrid. An effective V-cycle reduces the error by a constant fraction and is thus an iterative method. The V-cycle can be used to build a non-iterative, asymptotically exact, solver that reduces the algebraic error to the order of the incremental error, or truncation error. Moreover, the error is reduced at each level at the same rate as the discretization (eg, quadratically), with O(N ) work complexity. FMG starts on the coarsest grid, where an inexpensive accurate solve is available, prolongates the solution to the next finest level, applies a V-cycle, and continues until a desired resolution is reached. FMG is well know to be the most efficient multigrid cycle, when a textbook efficient V-cycle is available. A higher order interpolator between the level solves, Π h H , is useful for optimal efficiency of the FMG process. FMG has been proven, and demonstrated, to reduce the error to the order of the truncation error and is thus an asymptotically exact direct method [BD81] , §3.2.2 in [TOS01] .
One can analyze FMG with induction where the induction hypothesis is that the ratio r of the algebraic error to the truncation error is below a factor such as , the truncation error is of the form O(h p ), where p in the order of the discretization, and that the solver on each level (eg, one V-cycle) reduces the error by some factor Γ, which can be proven or measured experimentally, to derive an equation that directly relates r to Γ. This provides any desired ratio of algebraic error to truncation error if a sufficiently powerful V-cycle is used (Γ < 0.25 is required for an asymptotically exact solver with p = 2 and a refinement ratio of two). Adams et. al. applied these ideas to compressible resistive magnetohydrodynamics where two V-cycle were required at each level to achieve sufficient error reduction [ASB10] . Figure 2 shows the full multigrid algorithm, 
Segmental refinement
Consider a distributed memory FAS-FMG solver where the coarsest grid is solved on one process and is generally very small (eg, one cell). This grid is refined until the grid is large enough to utilize distributed memory parallelism (eg, 4
. Further refinements use the same size patch and populate more processes, forming "reduced process" levels, until all processes are populated. Then continue to refine to a new, larger, patch size where in-process parallelism is efficient (eg, 32 3 cells). Finally, continue refinement by splitting domains and populating lower level compute elements to form an octree in 3D, albeit with cross communication.
Segmental refinement starts with this traditional distributed memory FAS-FMG multigrid method and at some level, the transition level, buffer cells are added to each SR patch or subdomain. These SR buffer cells are populated with genuine data from the nearest neighbors in a standard exchange process, though with more buffer cells than required by a traditional solver. Subsequent grids are refined from each SR patch of this transition level -independently -without any exchange of data with other trees, only communicating vertically with their parent and (eight) children. The process is applied recursively forming an octree, or a forest of octrees if more that one SR patch is used on the transition level. Two SR data models are considered herein: 1) the implementation used for experimental investigation where an SR patch on the transition level is defined as the process subdomain in a traditional domain decomposition multigrid solver §3.1, and 2) an asymptotically appropriate data model with only one massive parallel SR patch on the transition level §5.3. SR removes the "horizontal" communication of traditional parallel multigrid between SR patches; the buffer cells "protect" the genuine cells from patch boundary errors that occur because boundary data is not refreshed with communication as in the traditional solver.
The segmental refinement approach allows for the application of several techniques: Segmental refinement: Domain decomposition and (deep) buffer regions to isolate subdomains. Loop fusion: Sweep over subdomain patch computing, for instance, FMG prolongation, smoothing, residual, and restriction on successive planes and hence only bringing data into high levels of cache once per SR kernel (See Figure 3) , [WKS + 12]. Evanescent data: Use a full update of fine grids; sweep a window across the local domain, fuse operators, including collecting functionals of interest, and not storing the entire fine grid patch at any one time. Asynchronous process: Any leaf of the tree can be processed at any time, naturally accommodating asynchronous programming models. Block structured adaptive mesh refinement: BSAMR is naturally accommodated by "pruning" the octree. level. The FAS-FMG process continues in the SR levels, the transition level uses a standard V-cycle as the coarse grid solver; genuine data is injected into the traditional FAS-FMG data structure, the V-cycle is called, the updated solution is copied into the SR data structure and a (large) data exchange populates the entire patch, compute region, with genuine data.
3.1.
A segmental refinement data model: "a forest of pine trees". A standard parallel FAS-FMG solver is used as the SR coarse grid solver. At the transition level, assigned index 0, the standard FMG refinement process is replaced by SR. The transition level is the finest non-SR level, but requires an SR representation for prolongation to the first SR level, level 1. These SR representations have SR buffer cells as defined below. Each subdomain of the grid on the transition level is refined locally on each MPI process. This results in no MPI communication on the SR levels.
Consider cell centered discretizations and an SR data model as follows. Define the genuine region on level i as Ω Figures 5 and 6 shows the SR modifications to the FAS-FMG algorithm, with K SR levels, and the transition level is assigned index 0 (the SR coarse grid).
CopyIn() and CopyOut() methods move data between the SR and non-SR data structures in the transition level; SRExchange(u) refreshes the SR buffer cells with a standard exchange of data with neighbor patches (on the transition level).
, Ω C return u Figure 6 . FAS-FMG-SR algorithm,ū,r,w,t on the coarse grid C Prolongation is computed to both Ω C ∪ Ω SRG cells, that is to all data on the fine grid.
Segmental Refinement Parameters
The primary purpose of this work is to investigate the accuracy of the algorithm with respect to parameters of the method. Several parameters define the SR data model and effect its cost and accuracy. First, size the SR buffer on each level with the equation, buffer schedule,
2 where i = 0 on the transition grid and i = K on the finest grid, and A and B are two independent parameters. N B i is an even integer to simplify restriction. A third parameter is the size C of the genuine data on the transition grid. In Figure 4 , A = 2 and B = 0 (ie, both levels have two layers of buffer cells), and C = 4 if the coarse grid is the transition grid.
4.1. Experimental Determination of Segmental Refinement Parameters. We use a multigrid refinement ratio of two, piecewise constant restriction, and linear prolongation for both the FMG and V-cycle prolongation, a 2 nd order Chebyshev polynomial for both presmoother and post-smoothing, one pre V-cycle smoothing step (an F(1,2,2) cycle). We test
, of the Laplacian Lu = f , on the rectangular domain with lower corner at the origin and upper corner at (2, 1, 1), and homogenous Dirichlet boundary conditions. We use a 27-point finite volume stencil that is 2 nd order accurate. Consider the model problem with 16 processes (4 x 2 x 2 process grid), with cube subdomains; N is the number of cells on each subdomain in each direction.
To investigate the relationship of A, B, C, and K on accuracy, we desire that the solver maintain 2 nd order convergence rates in the error as the FMG algorithm progresses. Define a bound on an acceptable level of error as an error of less that about 10% more than the traditional solver (the underlined entries in Tables 1), which has perfect 2 nd order convergence. We sample this high dimensional parameter design space along a constant log 2 C − K line (columns in Tables 1), because decreased C reduced the number of usable SR levels K, and show the ratio of the infinity norm of the SR error to the error in the traditional solver (e r = e SR /e traditional ) with A = 2, 4, 6, 8 (tables) and B = 0, 1, 2, 3 (rows) in the Tables 1.   2 log 2 C / K B 4/6 3/5 2/4 0 17 7.2 2.7 1 2.9 2.1 1.2 2 1. This shows that A and B are both useful and A ≈ 2B for larger A and A ≈ B for small A. A second observation is that each table has an approximate diagonal isosurface with a positive slope of one. These observations indicate that, in this parameter regime at least, K ≤ log 2 C + 2 and K ≤ B + A 2 is required. These expressions can also be expressed as a required number of buffer cells on the transition grid N B 0 (K) (Figure 7) .
We observe an increase in error with decrease in C. The source of this error is from an increase in the ratio of buffer or boundary cells to the total number of cells with decreasing C. Each boundary cell is a source of error, with a decaying Green's function, for every cell; these errors add to the error at each cell thereby increasing error. To further investigate this phenomenon fix A = 8, B = 0 (N B 0 = 4), and K = 5; the relative error as a function of C is shown in Table 2 with N genuine cells per process, in each dimension, on the fine grid. , is an important parameter because these cells are the source of all horizontal communication for SR. Consider a maximum buffer schedule, where N B 0 is a parameter and the SR buffers completely support the transition grid, Ω C = Ω F . This removes one source of error in SR: no update of the solution and τ correction on the compute region not supported by its fine grid. Table 3 shows the error ratio as a function of C with fixed K = 4 and N B 0 = 2 and the maximum buffer schedule. log 2 C 6 5 4 3 2 1 N 1024 512 256 128 64 32 e r 1.02 1.05 1.11 1.25 1.4 1.9 Table 3 . Effect of C on error ratio with N B 0 = 2, K = 4, and maximum buffer schedule This data shows slightly less degradation of the solution with C that Table 4 . Maximum buffer schedule, linear increasing log 2 C, N B 0 , and K only a modest increase in error as the number of SR levels increases if the first experiment with N B 0 = 1 is ignored. This indicates a regime that is close to an asymptoticly accurate buffer and C schedule. Table 5 . We can investigate a lower bound on the required B by searching for an acceptable value experimentally. Approximate a fractional B ≈ could provide an asymptotically exact solver with the maximal buffer schedule, independent of C. Future work entails verifying that this linear N B 0 schedule is sufficient asymptotically and attaining these bounds with the finite buffer schedule. Figure 7 shows a plot of N B 0 and the K at which the error is less than about 10% from the data in Table 1 (boxed values) and Table 5 (left). This data, albeit limited in range, shows a strong correlation of N B 0 to the maximum allowable SR levels.
Segmental Refinement Communication Complexity
Segmental refinement inherits the computational depth of traditional multigrid; a more refined complexity model is required to distinguish SR. To this end, this section defines a new SR data model and a two level memory model. . Message aggregation is not currently prevalent but will likely become more relevant in the future and is ignored it in this study. SR can be viewed as applying message aggregation ideas at a more algorithmic level, or message aggregation can be viewed as a narrower, lower level, technique than SR that does not alter the semantics of multigrid.
A traditional parallel multigrid V-cycle uses 26 (c H ) messages, per process, in each residual, smoother, and operator application when using a 27-point stencil in 3D in a standard nearest neighbor exchange process, per level. With a V (2, 2) cycle this results in 156 messages per level, including coarse grid τ correction term, plus restriction and prolongation messages, in eight message phases, or bulk synchronous steps: six horizontal and two vertical communication phases. Vertical message phases can have about the same number of messages although the SR method discussed thus far has no messages whatsoever in the SR grids. 5.3. A New Segmental Refinement Data Model. The experimental data in §4.1 suggests that N B 0 and C should rise with SR levels K. In practice, fixing K = 5 and C = 32 would result in a reduction of the size of the traditional solver (the SR coarse grid solver) by a factor or 32K, which would be substantial at least in a non-BSAMR context and be very practical. SR with K = 2 might also be very useful with accelerator hardware where interprocess communication is very expensive and coarser grids run on the host. Our observations suggests an SR data model for an asymptotic application and analysis that amortizes the cost of the buffer cells and exploits the benefits of increasing C.
To this end consider one massively parallel SR patch on the transition level, which is refined to create eight (2 D ) "child" SR patches of the same logical size, with this process continuing to create an octree. This results in 2K +1 multigrid levels, K SR levels and K +1 traditional levels,
The coarsest grid is refined until one partition with √ Q words is filled with a grid. This is the SR transition level i = 0 and is the root of the SR octree. Each of the K subsequent refinement levels refines the grid by two and forms 2 D child SR subdomains with the same logical size, that is, they each fill one √ Q memory partition, until all partitions are filled by a grid. There are √ Q memory partitions on the fine grid by the definition of our machine and the same number of SR patches, thus 
Segmental refinement thus reduces the bisection bandwidth requirements from O(N 2 ) to O(N log 2 N ). 
Scaling Studies
The scalability of our SR method is tested with a FORTRAN90 + MPI code that is fully parallel and has the option to be incrementally compute the coarse grids redundantly, where all processors are active on all levels doing redundant work, as well as the option for the traditional tree bases approach without redundant computation. The redundant coarse grid solve approach results in a "butterfly" communication pattern instead of a classic tree. This approach has the advantage of requiring no communication in the prolongation phase, hence reducing the number of bulk synchronous communication steps, at the expense of sending more data with more messages overall.
The performance of our current implementation is measured on a Cray XC30, Edison at NERSC, where up to 64K cores are used (only using 8 of the 12 cores on each socket and thus utilizing up to 96K cores), and the same test problem as §4 is used. Consider weak scaling with 128 3 and 32 3 cells per core, with the SR solver, with four and three SR levels respectively (C = 8 and C = 4 respectively), and compare this with a traditional FAS-FMG solver, which is also used as the SR coarse grid solver. The 128 3 subdomain is indicative of a more realistic size of 32 3 cells per core on a socket with 64 cores as is expected in the near future. The solve times for a V-cycle solve with a relative tolerance of the residual of 10 There is some indication that the communication savings of the SR solver is compensating for the extra computational costs and there is little difference with redundant and non-redundant coarse grid solves. Loop fusion has not been implemented in SR; we anticipate that loop fusion would make SR significantly faster. FMG is much faster than the iterative V-cycle solver and Figure 10 shows the expected degradation in error at scale because a V-cycles solver converged to a constant relative error, or residual, reduction is not an asymptotically exact solver. V-cycles can be an asymptotically exact solver if the convergence tolerance is adjusted appropriately. We have presented a segmental refinement method for cell centered discretizations of the Laplacian , and demonstrated that SR can maintain the semantics of textbook efficient multigrid with processing that is more attractive on modern memory centric architectures than traditional parallel multigrid in that it removes "far" communication from finer grids and naturally facilitates message aggregation and loop fusion. We experimentally verify that our SR method is an asymptotically exact solver on 64K cores of a Cray XC30, with a fixed number of SR levels. We have only observed modest gains in communication from SR at scale but have established the feasibility of the approach. We have performed extensive testing to understand the mathematical asymptotic behavior of SR with a simple SR method and proposed and analyzed a new SR data model. Future work entails, for one, exploiting the data locality provided by SR to fuse loops and increase arithmetic intensity and hence flop rates (without any additional work). Future work also includes a more flexible implementation that allows for more SR levels, to investigate the basic algorithm more thoroughly. We also intend to use vertex centered methods, which have different multigrid characteristics than cell centered discretizations, so as to understand the fundamental nature of the SR technique more thoroughly. Future work involves developing an asymptotical exact solver with a finite buffer schedule and data model, perhaps with the massive parallel SR patches model proposed in §5.3.
We see the first algorithmic improvement to the current work as increasing the order of prolongation because SR depends on boundary conditions, from coarse grids, that are not smoothed nor refreshed with genuine data as in a traditional method. SR may be particularly sensitive to the order of prolongation because it is used to set the frozen ghost cells in the SR buffer region, and is not smoothed with the α smoothing steps nor post smoothing.
We have investigated the model problem; further work involves extending the application of SR to more application domains, such as variable coefficient, mapped grids, and nonlinear problems. Complex domains can be naturally accommodated with block structured adaptive mesh refinement (BSAMR) [SCGK11] ; the octree need only be pruned, which is facilitated by not possessing horizontal data dependancies.
