ISSN

0302-9743
ISBN
9783319321486
Authors
Druinsky, A Ghysels, P Li, XS et al.
Introduction
We study the performance of a novel algebraic multigrid algorithm that was recently introduced by Brezina and Vassilevski [4] for solving difficult elliptic PDEs with a variable coefficient that can be resolved only using a fine-grained discretization. We use the two-level variant of the method, implemented in the serial C++ code SAAMGe [11, 12] , and our focus is on the impact of the coarsegrid solver on the performance of the algorithm. The coarse grid represents the parallel bottleneck in the code, and therefore solving the corresponding systems efficiently is crucial for the solver's performance. Outside of the coarse grid, most of the work in SAAMGe is formulated in terms of sparse matrix-vector multiplications (SpMVs) that involve large matrices with regular sparsity structure. Optimizing such SpMVs is a well-studied problem and optimized implementations are available for many architectures.
We consider two coarse-grid solvers. One is the preconditioned conjugate gradient method (PCG), which we precondition using a single step of the Jacobi iteration. Although careful optimization of PCG can make it a powerful coarsegrid solver, using it when convergence is slow can be expensive due to its low arithmetic intensity. As an alternative, we use STRUMPACK, a new HSS-embedded low-rank sparse-factorization algorithm [10] . It has a higher arithmetic intensity, and it is robust, meaning that it can be used as a direct solver. 3 The use of low-rank structure in the factorization reduces the amount of work, memory space, and memory bandwidth that we expend, making the solver potentially more efficient than earlier sparse-factorization algorithms.
There exist thorough studies of parallel-performance optimization and analysis of algebraic multigrid (AMG) [2, 3, 8, 9] . Our paper differs from these studies by focusing on the architecture level and performing detailed performance-bound modeling, taking into account the arithmetic intensity of the different algorithmic components. We are the first to apply this methodology to AMG, and this allows us to bridge the gap in the understanding of the limits of AMG performance on individual multicore nodes with large core counts.
Our main contributions are the following. We make a comprehensive study of the impact of the coarse-grid solver on the performance of a two-level AMG solver. We perform extensive optimization on two multicore architectures, one of which is the challenging Xeon Phi, characterized by having a large number of relatively slow cores and a high memory latency. We incorporate a novel randomized HSS-embedded sparse-factorization algorithm as the coarse-grid solver. Finally, we develop and validate a bounds-and-bottlenecks Roofline model which helps us to identify performance optimization opportunities on our target architectures and to characterize the limitations of our code.
Background
Test Problems and Machines
We use the oil-reservoir simulation benchmark spe (model 2) from the SPE Comparative Solution Project [6] . Here, fluid flow in porous media is described by the Darcy equation (in primal form):
where p(x) is the pressure field and κ(x) is the permeability of the medium. The model is described on a regular Cartesian grid of 60 × 220 × 85 cells. The coefficient κ(x) admits a wide range of variation between distinct horizontal layers of the medium, which makes this a challenging problem to solve. Finite-element discretization of the problem produces a fine-grid matrix of order 1.2 million with a regular sparsity structure and an average of 26.4 nonzero elements in each row. From this matrix, the SAAMGe algorithm produces a coarse-grid matrix whose dimensions and sparsity pattern depend on the algorithm's parameters, as we explain below. We carried out our study on two machines at the National Energy Research Scientific Computing Center in Oakland, CA. One machine is Edison, a Cray XC30 machine of 5,600 Ivy Bridge EP nodes. The other is Babbage, a commodity Intel cluster in which each node contains two Xeon Phi Knights Corner coprocessors. In the following, we refer to these machines as IVB-EP and KNC, respectively. In this paper, we focus on one CPU socket of an IVB-EP node and one KNC coprocessor. Each IVB-EP socket consists of 12 cores with a theoretical 230.4 peak gflop/s rate. A KNC coprocessor has 60 cores with a theoretical 1,010.9 peak gflop/s rate. Although KNC has more computational power and memory bandwidth, it has slower cores, a wider SIMD architecture and more primitive hardware stream prefetchers. As a result, SpMV-like operations are much more difficult to optimize on KNC. The cache hierarchies of both machines are coherent and have the same net capacity-30 MB. However, whereas IVB-EP provides a 30 MB unified L3 cache, KNC maintains 60 caches of 512 KB each. This results in superfluous data movement and coherency transactions that can impede the effective bandwidth on KNC. Furthermore, ineffective software prefetching can highlight the fact that KNC's memory latency can be an order of magnitude higher than that of IVB-EP [13] . As a result, although KNC has 5× the nominal bandwidth of IVB-EP, it can often be underutilized or squandered given complex memory access patterns endemic in sparse methods.
Algebraic Multigrid
The SA-ρAMGe method that we study [4] works by forming a coarse-grid matrix A c and solving the problem iteratively by repeating the following steps:
1. Pre-smoothing:
Here, M −1 is a polynomial smoother and P is the so-called prolongation operator. The operator P is formed by representing A as the sum of local stiffness matrices (which requires knowledge of the finite-element discretization) and computing the eigenvectors that correspond to the smallest eigenvalues of each such matrix. We form the tentative prolongatorP , which is a rectangular blockdiagonal matrix whose diagonal blocks correspond to the local stiffness matrices and consist of the eigenvectors that we computed. The ultimate prolongator has the form P = SP , where S is a matrix-polynomial smoother. The coarse-grid matrix is obtained by forming the sparse-matrix product A c = P T AP .
HSS Sparse Solver
PCG is relatively easy to implement and parallelize, but its convergence can be slow for numerically difficult problems. Sparse-factorization methods can serve as powerful alternatives. Here, we consider an HSS-embedded sparse-factorization method that has asymptotically lower complexity than traditional factorizations. The algorithm has a shared-memory parallel implementation in the package STRUMPACK [10] , which uses nested-dissection ordering and multifrontal factorization. The sparse solver consists of the following steps: preprocessing (e.g., sparsity-preserving ordering by a graph-partitioning algorithm such as METIS, and symbolic factorization), numerical factorization and solution. Fig. 1 . A regular two-dimensional grid, partitioned using nested dissection (left) and the corresponding separator tree (right). The nodes of the separator tree represent frontal matrices. Only the nodes of the top levels are compressed using HSS.
HSS dense
The novelty is to represent the dense frontal matrix corresponding to a node of the separator tree as a hierarchically-semiseparable (HSS) matrix [10, 14, 18] . The HSS structure exploits the data-sparsity of the dense matrix using lowrank compression. Furthermore, the hierarchical partitioning of the matrix blocks and the use of nested bases lead to factorization and solve algorithms that are asymptotically faster than the classical ones and use less memory. Figure 1 illustrates the nested-dissection procedure on a small regular mesh. The blue points denote the root separator, splitting the domain in two unconnected parts. The nested-dissection procedure is then repeated on both parts recursively. Each of the separators corresponds to one frontal matrix, placed in a data structure called the separator or elimination tree, illustrated in Figure 1 (right) . The largest frontal matrices, corresponding to the top s levels of the elimination tree, are approximated as HSS matrices. The other smaller frontal matrices are stored as full-rank dense matrices. The factorization is computed by a bottom-up traversal of the elimination tree, computing a partial factorization of each front and passing the Schur complement from child to parent. The structure of an HSS matrix is also illustrated in Figure 1 (right) . In an HSS matrix, diagonal blocks are recursively partitioned until at the finest level diagonal blocks are stored as full-rank dense matrices. Off-diagonal blocks are represented as low-rank products
The STRUMPACK solver can be used as a preconditioner, where the quality of the preconditioner is controlled by the accuracy of the low-rank approximations in the HSS structure and by the number of HSS levels s . The HSS approximation accuracy can be controlled by a user specified tolerance . A single solve with the STRUMPACK preconditioner is more expensive than a single PCG iteration, but it is more effective for numerically challenging PDEs.
Code Optimizations
Obtaining top performance on a traditional multicore architecture such as the IVB-EP is a well-studied problem, and therefore we focus in this section on the performance optimizations that we conducted on KNC. The challenges on this platform are due to a large core count, wide SIMD FPU and distributed coherent L2 caches.
PCG Thread-Friendly Optimizations
In contrast with IVB-EP, a large proportion of the time on KNC is spent on solving coarse-grid systems, and this proportion is increasing as we use more threads. Ultimately, using 180 threads, coarse-grid PCG accounts for more than 50% of the total time on KNC. Furthermore, our study in Sect. 4 showed that AMG performs best when the coarse-grid system is small, and so the optimizations that we require are different from those that are typically done in large-scale implementations.
The initial version of PCG in our code was implemented as a serial code that launched parallel OpenMP kernels such as dot product, vector linear combination and SpMV. Arranging the computation in this way incurs an overhead of entering and exiting an OpenMP parallel region for each computational kernel. For this reason, we introduced an alternative implementation, omp-for-all, in which the whole iteration is nested inside a single OpenMP parallel region. This yielded a speedup of 1.55 (using 180 threads).
We accomplished a further 1.69 speedup by introducing the omp-for-spmv variant, in which all kernels are serial, except for SpMV, which is parallel. Our explanation for the speedup in this case is that the coarse grid is represented by a small matrix and therefore the overhead of parallelizing the dot-product and vector-linear-combination kernels outweighs the benefits of such parallelization.
Finally, we also considered the omp-parallel-spmv variant, which we obtained from the original code by replacing all parallel kernels with serial ones, except for SpMV. Similarly to omp-for-spmv, in this version all kernels except SpMV are serial and there is only one parallel region. However, in contrast with omp-for-spmv, the parallel region here is inside the main loop and therefore incurs an overhead in each iteration. This version is slightly slower than ompfor-spmv, reaching 88% of omp-for-spmv's performance using 180 threads. Nevertheless, omp-parallel-spmv is competitive with omp-for-spmv and it allows us to use an external library that implements the SpMV kernel, such as the one proposed in [13] .
HSS Thread-Friendly Optimizations
We now consider the use of STRUMPACK for solving coarse-grid systems. Table 1 and Fig. 2 show the running time of the algorithm. The solve time on both machines is more than an order of magnitude faster than factorization, which is for the better, because factorization is required only once, whereas solve is required in each AMG iteration. On both machines, the computation scales well, with the exception of ordering and symbolic factorization on KNC, where performance stagnates early, at about 10 threads. These steps involve purely combinatorial algorithms which are hard to parallelize on a machine architecture optimized for a high flop rate. Parallel scaling on IVB-EP is better than on KNC, with a speedup of 12 threads over one thread of 6.5× and 3.3× for factorization and solve, respectively, compared to speedups of 12.8× and 7.6× using 60 threads on KNC.
The solver is implemented using OpenMP task parallelism, so that the numeric-factorization phase is represented by a single parallel region, and therefore the only barriers in the code correspond to dependencies between the tasks. We took the following additional steps to improve performance on KNC. We replaced the default memory allocator by the more scalable TBB [1] allocator. Tasks are generated by recursive functions. We tuned the total number of tasks, i.e., the task granularity, specifically for each machine. We replaced the SCOTCH graph partitioner, which we were using for ordering the matrix, with METIS, and thereby gained a 10× time savings in the ordering step on KNC. 4, 5 We disabled a preprocessing step that performs permutation and scaling using the MC64 code [7] . This step is not required because our matrix is symmetric positivedefinite. Finally, we disabled code for counting the number of executed flops, which we found was causing a 3× slowdown in the solve phase on KNC.
Performance Comparison
To complement our optimization effort, we conducted a comprehensive performance assessment of the solvers from the architectural viewpoint (two machines) as well as the algorithmic one (varying the algorithm's parameters). In particular, we considered five parameters of SAAMGe, and explored the effect of changing these parameters within a five-dimensional space. For each parameter we consider a range of values, as described in Table 2 .
6 There are altogether 216 configurations explored on each machine. We used 60 cores on KNC and 12 cores on IVB-EP.
The following summarizes our findings. First, choosing the proper algorithm parameters is crucial to achieve good performance. The difference in runtime 4 Based on this experience, we changed the STRUMPACK default to METIS. 5 The ordering phase consists of running METIS, applying the computed permutation to the matrix and sorting the column indices within each row of the permuted matrix. METIS runs serially, but the rest of the work is done in parallel. 6 We also used the following parameters to control the accuracy of the computed solution. For the HSS algorithm, we used four levels of compression with compression tolerance 10 −4 and zero GMRES iterations. For PCG, we used relative tolerance 10 −4 . These were chosen so as to maximize performance without sacrificing accuracy. Table 2 . The parameters of the algorithm and the corresponding values that we explored. The number of elements per agglomerate determines the size of the local stiffness matrices; νP and ν M −1 are respectively the polynomial degrees of the interpolator smoother S and the relaxation smoother M −1 ; and θ is the spectral tolerance, which determines how many eigenvectors of each local stiffness matrix represent that matrix in the tentative prolongator.
Parameter
can be as large as an order of magnitude among the 216 configurations. Taking PCG on IVB-EP as an example, the fastest configuration took 9.6 seconds, while the slowest took 168.1 seconds -more than a 17× difference. Second, on the same machine, the HSS coarse-grid solver always won over PCG. For the best configurations, HSS is 1.54× faster than PCG on IVB-EP and 1.34× on KNC. Finally, with the best configurations, IVB-EP is 1.7× and 1.49× faster than KNC using HSS and PCG, respectively.
Roofline Performance Model
We developed a bounds-and-bottlenecks Roofline model to drive the performance optimization of our OpenMP code [17] . 7 The goal is to gain insight about the machine's performance bottlenecks and terminating performance optimization. Here, we focus on the AMG solution cycle; modeling AMG setup is future work.
The model consists of formulas, one for each component of the algorithm, expressing the number of bytes that we move between the levels of the memory hierarchy and the number of flops that we carry out. To obtain runtime estimates from this model, we divide the total memory traffic by the machine bandwidth, and also divide the total number of flops by the machine flops rate. This yields two lower bounds on the runtime: one that corresponds to memory bandwidth being the bottleneck, and the other to the floating-point units.
Concurrent with traditional Roofline analysis, the inputs to our model are: 1) The machine peak flop rate and its sustainable memory bandwidth, measured using a modified STREAM benchmark [15] ; 2) The dimensions of A and A c , denoted by n and n c , respectively; 3) The number of nonzeros in A, A c and P , denoted by nza, nzc and nzp, respectively; 4) The number of AMG cycles; and 5) Parameters that are specific to the coarse solver: the average number of PCG iterations per AMG cycle when we use PCG, and the memory size of the HSS factors when we use HSS.
The Model for the Combination of AMG with PCG
The model that we obtain for the version of the solver in which we use PCG to solve coarse-grid systems is shown in Table 3 . We used the following combination of parameters: elements-per-agglomerate is set to 400, ν M −1 = 3 and θ = 0.001. The corresponding runtime bounds on IVB-EP are shown in Fig. 3 When 1 or 2 cores are used, our flops-based bound is within 1% and 7% of actual runtime respectively. As the number of cores is increased, memory bandwidth becomes the bottleneck. For 4, 8 and 12 cores, our memory-bandwidthbased bound is within 23% of the actual runtime. We attribute the difference to the extremely different memory access patterns in AMG compared to the STREAM benchmark.
The Model for the Combination of AMG with HSS
Following the same practice as in Sect. 5.1, we conduct a performance-bound analysis when HSS is used as the coarse solver. Comparing to Table 3 , the costs are the same for smoothing, restriction, interpolation, and termination. The difference is in the coarse solve, where the code needs to stream through the factored matrix. For our test cases, the factors are larger than the largest cache of the machines. Therefore, we assume that the factors are read from DRAM. Table 4 shows the performance upper bound based on the DRAM sustained bandwidth. On IVB-EP, the best configuration is: elements-per-agglomerate = 256, ν M −1 = 1 and θ = 0.01. On KNC, the best configuration is: elements-peragglomerate = 256, ν M −1 = 3 and θ = 0.01. The bandwidth-based performance bound is quite accurate on IVB-EP, yielding an estimate within a gap of 31% of the actual time. Among all the stages, the best match between model and reality is the smoothing step-about an 18% gap. The worst gap corresponds to the coarse solve-about 55%.
On the other hand, the estimated time on KNC is far less than the actual time, implying that the memory bandwidth is severely underutilized. Attributing this significant performance difference to either architecture or model is an area of continued investigation. Table 4 . Runtime bounds and actual runtime (seconds) of the AMG iteration. HSS is used as coarse-grid solver. The column "R/I" represents the combined restriction and interpolation steps. "Stopping" refers to the evaluation of the stopping criterion. 
Conclusion
We proposed a series of optimizations to improve the performance of the coarsegrid solver. The optimizations aim to expose fine-grained parallelism, exploit high memory bandwidth, and reduce OpenMP overheads. These led to a 2.6× reduction of the AMG solve time on a 60-core Xeon Phi KNC machine. We expect these optimizations to be effective on other manycore architectures as well. We also compared the performance of PCG with STRUMPACK when the two algorithms are used as coarse-grid solvers. We found that PCG is at a disadvantage because of its slow convergence. HSS usually leads to a faster AMG cycle, up to 2× faster than PCG. We expect the relative performance of PCG and HSS to depend on the problem. If the problem is such that the AMG parameters can be tuned so as to produce a well-conditioned coarse-grid matrix, then PCG could outperform HSS. Additionally, we explored the parameter space of our AMG algorithm and found high variation in performance. This makes the algorithm a good candidate for an autotuning approach. Our roofline model yields a bound that is within 23% (for PCG) and 31% (for HSS) of the actual performance on the Ivy Bridge EP. The gap is much more significant on the Xeon Phi KNC, which indicates that the bottleneck on that machine is not the memory bandwidth or the FPU but rather the high memory latency. More effective prefetching could hide this latency, but achieving this is challenging because of the relatively primitive hardware prefetchers on that machine, and because of the irregular memory access pattern in our coarse-grid solvers. We are exploring this optimization. Finally, the only aspect of performance that we considered in our study was time to solution. This is the objective that users care most about. Nevertheless, in future work it would also be valuable to consider other parameters, such as the financial cost of the hardware and its energy efficiency.
