Lecture 01: Scalable Solvers: Universals and Innovations by Keyes, David
University of Arkansas, Fayetteville 
ScholarWorks@UARK 
Mathematical Sciences Spring Lecture Series Mathematical Sciences 
4-5-2021 
Lecture 01: Scalable Solvers: Universals and Innovations 
David Keyes 
King Abdullah University of Science and Technology, david.keyes@kaust.edu.sa 
Follow this and additional works at: https://scholarworks.uark.edu/mascsls 
 Part of the Analysis Commons, Computer and Systems Architecture Commons, Data Storage Systems 
Commons, Dynamical Systems Commons, Numerical Analysis and Computation Commons, Numerical 
Analysis and Scientific Computing Commons, and the Ordinary Differential Equations and Applied 
Dynamics Commons 
Citation 
Keyes, D. (2021). Lecture 01: Scalable Solvers: Universals and Innovations. Mathematical Sciences Spring 
Lecture Series. Retrieved from https://scholarworks.uark.edu/mascsls/2 
This Video is brought to you for free and open access by the Mathematical Sciences at ScholarWorks@UARK. It 
has been accepted for inclusion in Mathematical Sciences Spring Lecture Series by an authorized administrator of 
ScholarWorks@UARK. For more information, please contact ccmiddle@uark.edu. 
David Keyes
Extreme Computing Research Center





University of Arkansas Department of Mathematical Sciences 
46th Spring Lecture Series
My goals for the series
n Recruit researchers to a “renaissance” in scalable solvers 
§ set the stage of opportunity
§ introduce global leaders in core techniques as guest lecturers
§ introduce several “universals” that govern scalable computing into the 
indefinite future
§ show some current algorithmic developments that relate to these 
universals
n Provide motivation to grow the computational 
mathematics community at our host institution
n Feature the research of young colleagues in KAUST’s 
Extreme Computing Research Center
n Encourage gender diversity in the math sciences
n Celebrate the elegance and power of the math sciences 
Structure of the series
n Traditional SLS structure 
§ five principal lectures 
§ ten guest lectures
n Plus a public outreach lecture
§ Harnessing the power of mathematics for HPC
n Plus a panel on Women in STEM
n A venerable tradition going back to 1977
§ An honor for us to be associated with the influential 
mathematical scientists – pure, applied, statistical, and 
computational – that have graced this series so far
§ Thanks to Professor Tulin Kaman for her vision, initiative, 
persistence, and logistics
SLS weeklong schedule
A falcon flies to where the prey will be …
… rather than where it is
flying to where the 
target will be
flying towards the target
C. H. Brighton, 
et al., PNAS 
(2017)
Let’s fly with the falcons… 
to where computer architectures will be
Some “universals” of exascale computing
• Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
• Code to specialized “back-ends” while presenting high-level APIs to general users
• Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
• Process “on the fly” rather than storing all at once (esp. large dense matrices)
• Co-design algorithms with hardware, incl. computing in the network or in memory
Strategies in progress
• Exploit extra memory to reduce communication volume
• Perform extra flops to require fewer global operations
• Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
• Adapt floating point precision to output accuracy requirements
• Take more resilience into algorithm space, out of hardware/systems space
Strategies in practice
• Reside “high” on the memory hierarchy, close to the processing elements
• Rely on SIMD/SIMT-amenable batches of tasks at fine scale
• Reduce synchrony in frequency and/or span
• Reduce communication in number and/or volume of messages
• Exploit heterogeneity in processing, memory, and networking elements
Architectural imperatives
Timely appearance in CACM




§ “co-design” of architectures and applications
§ coordination of enabling software development
“Poster child” example: 
§ Quantum Chromodynamics (QCD), the 
application that led to IBM’s Blue Gene/L
https://www.exascale.org
Timely global topic (see lecture 5)
https://www.exascale.org/bdec
Exascale software agenda
n Emphasize heterogeneity and hierarchy
§ Heterogeneity is the new normal
§ Hierarchy is the key to efficient representation and access of big data
• Watch hardware opportunities
§ Processors: CPU, vector, GPU, TPU, FPGA, neuromorphic, quantum, …
§ Memories: cache, HBM, DRAM, NVRAM, …
§ Channels: copper, optical fiber, direct optical
n Think on two levels
§ High-level: how to find thresholds that amortize overheads for changing 
devices (heterogeneity) or scales (hierarchy)
§ Low-level: how to express (vector extensions, CUDA, libraries for remote ops)
n Gain hands-on experience and integration
§ Ideally in a multidisciplinary team, so one’s specialized efforts are part of 
something bigger that motivates and brings visibility and sponsorship
Exascale algorithmic opportunity
To “go big” and achieve the potential of emerging architectures 
for scientific applications, we need implementations of fast
• linear and least squares solvers
• singular value and eigensolvers 
• nonlinear solvers and optimizers
• integrators and sensitivity solvers
• stencil and tensor operators
that
n offer tunable accuracy-time-space tradeoffs
n exploit data sparsity
n exploit hierarchy of precisions
n may require more flops but complete earlier, thanks to more concurrency 
or less communication or synchronization
n are energy efficient
Two computational universes exist side-by-side
c/o Instageeked.com
* Global indices *
do i  {
do j  {




* Local indices *
for matrix blocks  (k,l)
do i  {
do j  {
for (i,j)  in Sk,l do op
}
}
Algorithms were once flat (Cholesky, 1910)
classical global triangular loop, O(n3)
A=LLT   or A= RTR  or A=LDLT
across columns
top to diag of right factor





Architectures were flat, as well (vN, 1945)
classical separation of ALU & memory
One hierarchy is not so bad…
As humans managing implementation complexity, we 
would prefer:
! hierarchical algorithms on flat architectures
or even (suboptimally) 
! flat algorithms on hierarchical architectures
… but two independent hierarchies may not match
n need to marshal irregular structures into uniform 
batches and/or 
n to feed dynamic runtime queues
n to best exploit hierarchical memory and heterogeneous 
accelerators
Hierarchies may not perfectly match, but…
We go to exascale with the architectures we have, 
not with the architectures we want!
! First exascale Gordon Bell Prize (2018) awarded on the 
heterogeneous Summit system at ORNL (currently the #2 
ranked system by HPL), with GPUs and Power9 cores
! A 4,000-node subset of Summit sustained 1.88 ExaOp/s of 
mixed precision on a genome-wide association studies 
(GWAS) application
! Majority of these operations are half-precision (16-bit 
floating point) NVIDIA tensor-core matrix-matrix 
multiplies, 64 FP FMADD operations per clock
Algorithmic philosophy
Algorithms must span a widening gulf …
A full employment program 











Must address the  tension between
n highly uniform vector, matrix, and general SIMT operations 
– prefer regularity and predictability
n hierarchical algorithms with tree-like data structures and 
scale recurrence – possess irregularity and adaptability
Hierarchical algorithms and extreme scale
our target
è Billions of
of investment worldwide in open source and 
commercial scientific software hangs in the balance 








! Solvers / integrators
! Dynamic load balancers
! Discretization adaptors
! Data (de-)compressors 
! Random no. generators
! Uncertainty quantifiers
! Graph & combinatorial 
operators












u Dynamic resource 
managers







u Fault monitors & 
recoverers
High-end computers come 
with little of this. Most is 
contributed by the user  
community.
in Cray LibSciin NVIDIA cuBLAS Aramco ExaWave
Our modest contributions at
https://github.com/ecrc
What will exascale algorithms look like?
n Attempt to start with algorithms as close as possible to 
optimal asymptotic order, O(N logpN)
n Some such optimal (typically hierarchical!) algorithms
! Fast Fourier Transform (1960’s) 
! Multigrid (1970’s)
! Fast Multipole (1980’s)
! Sparse Grids (1990’s)
! H matrices (2000’s)
! Randomized algorithms (2010’s)
! <What will you call your contribution?> (2020’s)
“With great computational power comes great 






Some “universals” of exascale computing
• Reside “high” on the memory hierarchy, close to the processing elements
• Rely on SIMD/SIMT-amenable batches of tasks at fine scale
• Reduce synchrony in frequency and/or span
• Reduce communication in number and/or volume of messages
• Exploit heterogeneity in processing, memory, and networking elements
Architectural imperatives
Classical memory hierarchy
c/o K. Webb (2018)
Memory placement increasingly a user decision
c/o J. Ang et al (2014)
HPL Top 10 memory BW trends, 2010-2020
Fugaku
The last three #1 systems
TaihuLight (Nov 2017) B/F = 0.004 
Summit (June 2018) B/F = 0.0005
Fugaku (June 2020) B/F = 0.303
Keren Bergman’s lab at Columbia 
has been tracking architectural 
trends in memory and networking 
interconnects for two decades. 
This slide is updated for Fugaku.
NB: log scale
Single-node speeds/feeds ratios, 1990-2020
John McCalpin, now at TACC, has 
been tracking architectural trends 
through the STREAM benchmark 
since 1990, when he noticed that 
code loops that gave 90% peak on 
Cray gave less than 10% on RISC
n BW based on node level GF/s divided by node 
level sustainable BW (memory or network)
n Latency based on GF/s for one core and 







On-node memory latency, 1990-2020
What happens if all cores stall on a local memory latency?
n 50% / yr increase reflects increase in # cores per socket package
n This worse-than-single-core scenario prevails, for example, if an OpenMP 
coordinating thread is in a serial section while the other cores are idle, and gets 







Why exa-… is hard
Moore’s Law (1965) has not fully ended 
but Dennard’s MOSFET scaling (1972) has
Eventually, processing is 
limited by transmission, 
as known for > 4 decades
Robert Dennard, IBM 
(inventor of DRAM, 1966)
Dennard et al., IEEE J. Solid-State Circuits (1974)
Typical power costs per operation
c/o J. Shalf (LBNL)
Remember that a pico (10-12) of something done exa (1018) 
times per second is a mega (106)-somethings per second
u 100 pJ at 1 Eflop/s is 100 MW (for the flop/s only!)
u 1 MW-year costs about $1M ($0.12/KW-hr × 8760 hr/yr)
• We “use” 1.4 KW continuously, so 100MW is 71,000 people
Operation approximate energy cost
DP FMADD flop 100 pJ
DP DRAM read-to-register 5,000 pJ
DP word transmit-to-neighbor 7,500 pJ
DP word transmit-across-system 10,000 pJ
Some “universals” of exascale computing
• Reside “high” on the memory hierarchy, close to the processing elements
• Rely on SIMD/SIMT-amenable batches of tasks at fine scale
• Reduce synchrony in frequency and/or span
• Reduce communication in number and/or volume of messages
• Exploit heterogeneity in processing, memory, and networking elements
Architectural imperatives
Rely on SIMD/SIMT tasks
n Many specialized operations are now hard-wired, e.g., 
§ traditional vector triadic operations
§ matrix-matrix operations used in DL
n Such instructions cannot be ignored 
§ 4x4 matrix-matrix multiply-add does 64 
FMADD instructions in one clock cycle
§ varieties of scales and precisions abound












































































~ 2 orders of 
magnitude spread!
191 of the Top500 systems
All but 3 of 
the 40 most 
efficient are 
accelerated
Specialization includes precision choice
c/o Nick Higham (2021)
Each halving of precision generally doubles execution rate 
§ sometimes more than 2x from higher memory residency for given no. of elements
Some “universals” of exascale computing
• Reside “high” on the memory hierarchy, close to the processing elements
• Rely on SIMD/SIMT-amenable batches of tasks at fine scale
• Reduce synchrony in frequency and/or span
• Reduce communication in number and/or volume of messages
• Exploit heterogeneity in processing, memory, and networking elements
Architectural imperatives






What happens if all cores stall on a network latency?
n Duration of the stalls will be something like log2(P) network latencies
n This is common during MPI collective operations that synchronize all participating 
cores, e.g., inner products, norms, barriers




Leslie Valiant, Harvard 
2010 Turing Award Winner Communications of the ACM, 1990
How are most simulations implemented at 
the petascale today?
n Iterative methods based on data decomposition and 
message-passing
! data structures (e.g., grid points, particles, agents) are distributed
! each individual processor works on a subdomain of the original 
(“owner computes”)
! exchanges information at its boundaries with other processors 
that own portions with which it interacts causally, to evolve in 
time or to establish equilibrium
! computation and neighbor communication are both fully 
parallelized and their ratio remains constant in weak scaling
n The programming model is BSP/SPMD/CSP
! Bulk Synchronous Programming 
! Single Program, Multiple Data
! Communicating Sequential Processes
BSP parallelism w/ domain decomposition
Partitioning of the grid 
induces block structure on 























By the Gordon Bell Prize, performance on real applications (e.g., 
mechanics, materials, petroleum reservoirs, etc.) has improved more than 
a million times in two decades.  Simulation cost per performance has 







Extrapolating exponentials eventually fails
Proceeded steadily for decades from giga- (1988) to 
tera- (1998) to peta- (2008) with 
§ same BSP programming model
§ same assumptions about who (hardware, systems software, 
applications software etc.) is responsible for what 
(resilience, performance, processor mapping, etc.)
§ same classes of algorithms (cf. 25 yrs. of Gordon Bell 
Prizes)
Main challenge going forward for BSP
Almost all “good” algorithms in linear algebra, 
differential equations, integral equations, signal 
analysis, etc., require frequent synchronizing global 
communication
§ inner products, norms, and fresh global residuals are 
“addictive” idioms
§ tends to hurt efficiency beyond 100,000 threads
§ can be fragile for smaller concurrency, as well, due to 
algorithmic load imbalance, hardware performance variation, 
etc.
Concurrency is heading into the billions of cores
§ Already 10.6 million on TaihuLight (currently #4 overall)
Some “universals” of exascale computing
• Reside “high” on the memory hierarchy, close to the processing elements
• Rely on SIMD/SIMT-amenable batches of tasks at fine scale
• Reduce synchrony in frequency and/or span
• Reduce communication in number and/or volume of messages
• Exploit heterogeneity in processing, memory, and networking elements
Architectural imperatives
Motivation to communicate less
How 3 solvers exploit more bandwidth
Geometric MG (LBNL)
many-to-many msgs, 5% of runtime
Algebraic MG (LLNL)
many small msgs, 40% of runtime
Spectral (ANL)
large msgs, 68.5% runtime
• Improvements resulting from additional rails in a fat-tree network depend 
on the application’s communication pattern
• For some apps, reduction in communication, not more bandwidth is the 
only alternative for runtime improvements
• Applications sending large numbers of small packets with fewer 
synchronization points (left) can see major improvements
• Applications transferring small numbers of larger packets with frequent 
synchronization (right) see diminished improvement
c/o Jens Domke (RIKEN, 2021)
Some “universals” of exascale computing
• Reside “high” on the memory hierarchy, close to the processing elements
• Rely on SIMD/SIMT-amenable batches of tasks at fine scale
• Reduce synchrony in frequency and/or span
• Reduce communication in number and/or volume of messages
• Exploit heterogeneity in processing, memory, and networking elements
Architectural imperatives
Heterogeneity is taking over (top of) Top500
c/o Erich Strohmeier (LBNL, 2020)
Nearly one-third of the Top500 systems exploit accelerators 
§ disproportionally concentrated at the top of the list
Heterogenous HPL performance 
and power efficiency






recent years, all 




after J. Ang et al (Sandia, 2014)
Heterogeneity in today’s smart phone
Typical smart phone has 40+ special processors
c/o John Shalf (LBNL, 2021)
Some “universals” of exascale computing
• Exploit extra memory to reduce communication volume
• Perform extra flops to require fewer global operations
• Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
• Adapt floating point precision to output accuracy requirements
• Take more resilience into algorithm space, out of hardware/systems space
Strategies in practice
Exploit extra memory to reduce comm
Exploit extra memory to reduce comm
Some “universals” of exascale computing
• Exploit extra memory to reduce communication volume
• Perform extra flops to require fewer global operations
• Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
• Adapt floating point precision to output accuracy requirements
• Take more resilience into algorithm space, out of hardware/systems space
Strategies in practice
Perform extra flops to synchronize less




Some “universals” of exascale computing
• Exploit extra memory to reduce communication volume
• Perform extra flops to require fewer global operations
• Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
• Adapt floating point precision to output accuracy requirements
• Take more resilience into algorithm space, out of hardware/systems space
Strategies in practice
Use high-order discretizations for fewer DOFs
Rediscretize from 32 spectral 
elements of order 8 on a side 
to 8 spectral elements of order 
16 on a side
Same error in key functional: 
- approx 4e-6
Savings in execution time:
- factor of 8
Use high-order discretizations for fewer DOFs
Four different dense linear algebra libraries compared on 15 different 
element orders for execution rate and memory transfer rate
Performance of all libraries improves up to 16th-order elements
LIBXSMM continues to improve up to 32nd-order elements
Some “universals” of exascale computing
• Exploit extra memory to reduce communication volume
• Perform extra flops to require fewer global operations
• Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
• Adapt floating point precision to output accuracy requirements
• Take more resilience into algorithm space, out of hardware/systems space
Strategies in practice
Adapt precision to accuracy requirements




Implicit question: Do we want to wait for NVIDIA Hopper (2023)? 
Or do we want Hopper performance on NVIDIA Ampere today?   
Adapt precision to accuracy requirements
fp64, fp32, fp16 defined by IEEE standard
Bfloat16: Google, Intel, ARM, NVIDIA
c/o Nick Higham (Manchester, 2021)
Some “universals” of exascale computing
• Exploit extra memory to reduce communication volume
• Perform extra flops to require fewer global operations
• Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
• Adapt floating point precision to output accuracy requirements
• Take more resilience into algorithm space, out of hardware/systems space
Strategies in practice
Resilience in algorithms, not hardware
Key ideas:
§ Reliable computing is expensive
§ Divide memory: reliable/unreliable
§ Divide routines: reliable/unreliable
§ Do most of the work in unreliable 
mode with reliable detection and 
correction
§ Ex.: FT-GMRES with unreliable 
matvec or preconditioner
Resilience in algorithms, not hardware
Ill-conditioned Stokes problem




Some “universals” of exascale computing
• Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
• Code to specialized “back-ends” while presenting high-level APIs to general users
• Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
• Process “on the fly” rather than storing all at once (esp. large dense matrices)




















































































Task graph for the first 3 stages of a 
Generalized Symmetric EVP with 4 blocks
Int Conf Par Comput (ParCo) 2011
Employ dynamic scheduling
n Remove artifactual synchronizations in the form of subroutine boundaries
n Remove artifactual orderings in the form of pre-scheduled loops






Some “universals” of exascale computing
• Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
• Code to specialized “back-ends” while presenting high-level APIs to general users
• Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
• Process “on the fly” rather than storing all at once (esp. large dense matrices)
• Co-design algorithms with hardware, incl. computing in the network or in memory
Strategies in progress
Employ APIs to specialized back-ends
Employ APIs to specialized back-ends
applications
architectures
(ARM, AMD, IBM, Intel, NVIDIA, …)
algorithmic 
infrastructure
n Tiling and recursive subdivision 
create large numbers of small 
problems that can be marshaled 
for batched operations on GPUs 
and MICs
" amortize call overheads
" polyalgorithmic approach based 
on block size
n Non-temporal stores, coalesced 
memory accesses, double-
buffering, etc. reduce sensitivity to 
memory
n Code is complex 
n Code is architecture-specific at the 
bottom
n Need to hide the support from the 
apps through an API
Some “universals” of exascale computing
• Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
• Code to specialized “back-ends” while presenting high-level APIs to general users
• Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
• Process “on the fly” rather than storing all at once (esp. large dense matrices)










Complexities of rank-structured factorization
For a square dense matrix of O(N) :
n “Straight” LU or LDLT
§ Operations O(N3)
§ Storage O(N2)
n Tile low-rank (Amestoy, Buttari, L’Excellent & Mary, SISC, 2016)*
§ Operations O(k0.5 N2)
§ Storage O(k0.5 N1.5)
§ for uniform blocks with size chosen optimally for max rank k of any 
compressed block, bounded number of uncompressed blocks per row 
n Hierarchically low-rank (Grasedyck & Hackbusch, Computing, 2003) 
§ Operations O(k2 N log2N)
§ Storage O(k N)
§ for strong admissibility, where k is max rank of any compressed block
* First reported O(k0.5 N2.5), then later O(k0.5 N2) for variant that reorders updates and recompression
Some “universals” of exascale computing
• Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
• Code to specialized “back-ends” while presenting high-level APIs to general users
• Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
• Process “on the fly” rather than storing all at once (esp. large dense matrices)
• Co-design algorithms with hardware, incl. computing in the network or in memory
Strategies in progress
Process “on the fly”
H matrix-H matrix multiplication
Fast matvecs ⇒ fast approx inversions with Newton-Schulz
Some “universals” of exascale computing
• Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
• Code to specialized “back-ends” while presenting high-level APIs to general users
• Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
• Process “on the fly” rather than storing all at once (esp. large dense matrices)
• Co-design algorithms with hardware, incl. computing in the network or in memory
Strategies in progress
Co-design algorithms with hardware
Co-design algorithms with hardware
Some “universals” of exascale computing
• Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
• Code to specialized “back-ends” while presenting high-level APIs to general users
• Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
• Process “on the fly” rather than storing all at once (esp. large dense matrices)
• Co-design algorithms with hardware, incl. computing in the network or in memory
Strategies in progress
• Exploit extra memory to reduce communication volume
• Perform extra flops to require fewer global operations
• Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
• Adapt floating point precision to output accuracy requirements
• Take more resilience into algorithm space, out of hardware/systems space
Strategies in practice
• Reside “high” on the memory hierarchy, close to the processing elements
• Rely on SIMD/SIMT-amenable batches of tasks at fine scale
• Reduce synchrony in frequency and/or span
• Reduce communication in number and/or volume of messages




are brought closer within reach
with insights from math
print c/o Toshi Yoshida
Thank you!
اركش
david.keyes@kaust.edu.sa
