University of Arkansas, Fayetteville

ScholarWorks@UARK
Mathematical Sciences Spring Lecture Series

Mathematical Sciences

4-5-2021

Lecture 01: Scalable Solvers: Universals and Innovations
David Keyes
King Abdullah University of Science and Technology, david.keyes@kaust.edu.sa

Follow this and additional works at: https://scholarworks.uark.edu/mascsls
Part of the Analysis Commons, Computer and Systems Architecture Commons, Data Storage Systems
Commons, Dynamical Systems Commons, Numerical Analysis and Computation Commons, Numerical
Analysis and Scientific Computing Commons, and the Ordinary Differential Equations and Applied
Dynamics Commons

Citation
Keyes, D. (2021). Lecture 01: Scalable Solvers: Universals and Innovations. Mathematical Sciences Spring
Lecture Series. Retrieved from https://scholarworks.uark.edu/mascsls/2

This Video is brought to you for free and open access by the Mathematical Sciences at ScholarWorks@UARK. It
has been accepted for inclusion in Mathematical Sciences Spring Lecture Series by an authorized administrator of
ScholarWorks@UARK. For more information, please contact scholar@uark.edu.

University of Arkansas Department of Mathematical Sciences

46th Spring Lecture Series
David Keyes
Extreme Computing Research Center
King Abdullah University of Science and Technology
5-9 April 2021

Lecture 1

Scalable Solvers:
Universals and Innovations

My goals for the series
n

Recruit researchers to a “renaissance” in scalable solvers
§

set the stage of opportunity

§

introduce global leaders in core techniques as guest lecturers

§

§

n

n

n
n

introduce several “universals” that govern scalable computing into the
indefinite future
show some current algorithmic developments that relate to these
universals

Provide motivation to grow the computational
mathematics community at our host institution
Feature the research of young colleagues in KAUST’s
Extreme Computing Research Center
Encourage gender diversity in the math sciences
Celebrate the elegance and power of the math sciences

Structure of the series
n

n

Traditional SLS structure
§

five principal lectures

§

ten guest lectures

Plus a public outreach lecture
§

Harnessing the power of mathematics for HPC

n

Plus a panel on Women in STEM

n

A venerable tradition going back to 1977
§

§

An honor for us to be associated with the influential
mathematical scientists – pure, applied, statistical, and
computational – that have graced this series so far
Thanks to Professor Tulin Kaman for her vision, initiative,
persistence, and logistics

SLS weeklong schedule

A falcon flies to where the prey will be …
flying towards the target

flying to where the
target will be

… rather than where it is

C. H. Brighton,
et al., PNAS
(2017)

Let’s fly with the falcons…
to where computer architectures will be

Some “universals” of exascale computing
Architectural imperatives
•
•
•
•
•

Reside “high” on the memory hierarchy, close to the processing elements
Rely on SIMD/SIMT-amenable batches of tasks at fine scale
Reduce synchrony in frequency and/or span
Reduce communication in number and/or volume of messages
Exploit heterogeneity in processing, memory, and networking elements

Strategies in practice
•
•
•
•
•

Exploit extra memory to reduce communication volume
Perform extra flops to require fewer global operations
Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
Adapt floating point precision to output accuracy requirements
Take more resilience into algorithm space, out of hardware/systems space

Strategies in progress
•
•
•
•
•

Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
Code to specialized “back-ends” while presenting high-level APIs to general users
Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
Process “on the fly” rather than storing all at once (esp. large dense matrices)
Co-design algorithms with hardware, incl. computing in the network or in memory

Timely appearance in CACM

Timely US interagency topic

https://nitrd.gov

Timely global topic

Key concepts:
§

“co-design” of architectures and applications

§

coordination of enabling software development

“Poster child” example:
§

Quantum Chromodynamics (QCD), the
application that led to IBM’s Blue Gene/L

https://www.exascale.org

Timely global topic (see lecture 5)

https://www.exascale.org/bdec

Exascale software agenda
n

Emphasize heterogeneity and hierarchy
§

Heterogeneity is the new normal

§

Hierarchy is the key to efficient representation and access of big data

§

Processors: CPU, vector, GPU, TPU, FPGA, neuromorphic, quantum, …

§

Memories: cache, HBM, DRAM, NVRAM, …

§

Channels: copper, optical fiber, direct optical

• Watch hardware opportunities
n

n

Think on two levels
§

High-level: how to find thresholds that amortize overheads for changing
devices (heterogeneity) or scales (hierarchy)

§

Low-level: how to express (vector extensions, CUDA, libraries for remote ops)

Gain hands-on experience and integration
§

Ideally in a multidisciplinary team, so one’s specialized efforts are part of
something bigger that motivates and brings visibility and sponsorship

Exascale algorithmic opportunity
To “go big” and achieve the potential of emerging architectures
for scientific applications, we need implementations of fast
• linear and least squares solvers
• singular value and eigensolvers
• nonlinear solvers and optimizers
• integrators and sensitivity solvers
• stencil and tensor operators
that
n

offer tunable accuracy-time-space tradeoffs

n

exploit data sparsity

n

exploit hierarchy of precisions

n

n

may require more flops but complete earlier, thanks to more concurrency
or less communication or synchronization
are energy efficient

Two computational universes exist side-by-side

Flat
* Global indices *
do i {
do j {

Hierarchical
* Local indices *
c/o Instageeked.com

for matrix blocks (k,l)
do i {
do j {

for (i,j) in S do op

for (i,j) in Sk,l do op

}
}

}
}

Algorithms were once flat (Cholesky, 1910)
geodesic
least
squares
problem

A=LLT or A= RTR or A=LDLT
across columns
top to diag of right factor
inner prod length “i”

classical global triangular loop, O(n3)

Architectures were flat, as well (vN, 1945)

classical separation of ALU & memory

One hierarchy is not so bad…
As humans managing implementation complexity, we
would prefer:
!

hierarchical algorithms on flat architectures

or even (suboptimally)
!

flat algorithms on hierarchical architectures

… but two independent hierarchies may not match
n

n
n

need to marshal irregular structures into uniform
batches and/or
to feed dynamic runtime queues
to best exploit hierarchical memory and heterogeneous
accelerators

Hierarchies may not perfectly match, but…
We go to exascale with the architectures we have,
not with the architectures we want!
!

!

!

First exascale Gordon Bell Prize (2018) awarded on the
heterogeneous Summit system at ORNL (currently the #2
ranked system by HPL), with GPUs and Power9 cores
A 4,000-node subset of Summit sustained 1.88 ExaOp/s of
mixed precision on a genome-wide association studies
(GWAS) application
Majority of these operations are half-precision (16-bit
floating point) NVIDIA tensor-core matrix-matrix
multiplies, 64 FP FMADD operations per clock

Algorithmic philosophy
Algorithms must span a widening gulf …
adaptive
algorithms
austere
architectures

ambitious
applications

A full employment program
for algorithm developers J

Hierarchical algorithms and extreme scale
Must address the tension between
n

n

highly uniform vector, matrix, and general SIMT operations
– prefer regularity and predictability
hierarchical algorithms with tree-like data structures and
scale recurrence – possess irregularity and adaptability

our target
Hierarchical
algorithms

GPU,
manycore

è Billions

of

of investment worldwide in open source and
commercial scientific software hangs in the balance
until our algorithmic infrastructure evolves to span
the architecture-applications gap

Required software
Model-related
!
!
!
!
!
!
!
!
!
!
!

!

Development-related Production-related

Geometric modelers
u
Build configurers
Meshers
u
Source-to-source
Discretizers
translators
Partitioners
u
Compilers
Solvers / integrators
Dynamic load balancers u Simulators
Discretization adaptors u Message passers
Data (de-)compressors u Debuggers
Random no. generators u Profilers
Uncertainty quantifiers
High-end computers come
Graph & combinatorial
operators
with little of this. Most is
contributed by the user
Subgridscale physics
community.
machine learners

u

u

Dynamic resource
managers
Dynamic performance
optimizers

u

Authenticators

u

I/O optimizers

u

Visualizers

u

Workflow controllers

u

Data miners

u

Fault monitors &
recoverers

Our modest contributions at
https://github.com/ecrc
in NVIDIA cuBLAS

in Cray LibSci

Aramco ExaWave

What will exascale algorithms look like?
n

n

Attempt to start with algorithms as close as possible to
optimal asymptotic order, O(N logp N)
Some such optimal (typically hierarchical!) algorithms
!

Fast Fourier Transform (1960’s)

!

Multigrid (1970’s)

!

Fast Multipole (1980’s)

!

Sparse Grids (1990’s)

!

H matrices (2000’s)

!

Randomized algorithms (2010’s)

!

<What will you call your contribution?> (2020’s)

“With great computational power comes great
algorithmic responsibility.” – Longfei Gao

Energy-aware
generation

Flat, bulk
synchronous
generation

Some “universals” of exascale computing
Architectural imperatives
•
•
•
•
•

Reside “high” on the memory hierarchy, close to the processing elements
Rely on SIMD/SIMT-amenable batches of tasks at fine scale
Reduce synchrony in frequency and/or span
Reduce communication in number and/or volume of messages
Exploit heterogeneity in processing, memory, and networking elements

Classical memory hierarchy

c/o K. Webb (2018)

Memory placement increasingly a user decision

c/o J. Ang et al (2014)

HPL Top 10 memory BW trends, 2010-2020
NB: log scale

Fugaku

The last three #1 systems
Keren Bergman’s lab at Columbia
has been tracking architectural
trends in memory and networking
interconnects for two decades.
This slide is updated for Fugaku.

TaihuLight (Nov 2017) B/F = 0.004
Summit (June 2018) B/F = 0.0005
Fugaku (June 2020) B/F = 0.303

Single-node speeds/feeds ratios, 1990-2020
NB: log scale

18% / yr

20% / yr
25% / yr
15% /yr

John McCalpin, now at TACC, has
been tracking architectural trends
through the STREAM benchmark
since 1990, when he noticed that
code loops that gave 90% peak on
Cray gave less than 10% on RISC

n

BW based on node level GF/s divided by node
level sustainable BW (memory or network)

n

Latency based on GF/s for one core and
latency for “load” to local memory or “get”
from another node

On-node memory latency, 1990-2020
NB: log scale
all cores

1 core
20% / yr
(same
curve)

What happens if all cores stall on a local memory latency?
n

50% / yr increase reflects increase in # cores per socket package

n

This worse-than-single-core scenario prevails, for example, if an OpenMP
coordinating thread is in a serial section while the other cores are idle, and gets
worse with flooding of cores per socket

Why exa-… is hard
Moore’s Law (1965) has not fully ended
but Dennard’s MOSFET scaling (1972) has

Robert Dennard, IBM
(inventor of DRAM, 1966)

Eventually, processing is
limited by transmission,
as known for > 4 decades
Dennard et al., IEEE J. Solid-State Circuits (1974)

Typical power costs per operation
Operation
DP FMADD flop

approximate energy cost
100 pJ

DP DRAM read-to-register

5,000 pJ

DP word transmit-to-neighbor

7,500 pJ

DP word transmit-across-system

10,000 pJ

Remember that a pico (10-12) of something done exa (1018)
times per second is a mega (106)-somethings per second
u

100 pJ at 1 Eflop/s is 100 MW (for the flop/s only!)

u

1 MW-year costs about $1M ($0.12/KW-hr × 8760 hr/yr)
• We “use” 1.4 KW continuously, so 100MW is 71,000 people

c/o J. Shalf (LBNL)

Some “universals” of exascale computing
Architectural imperatives
•
•
•
•
•

Reside “high” on the memory hierarchy, close to the processing elements
Rely on SIMD/SIMT-amenable batches of tasks at fine scale
Reduce synchrony in frequency and/or span
Reduce communication in number and/or volume of messages
Exploit heterogeneity in processing, memory, and networking elements

Rely on SIMD/SIMT tasks
n

n

Many specialized operations are now hard-wired, e.g.,
§

traditional vector triadic operations

§

matrix-matrix operations used in DL

Such instructions cannot be ignored
§

4x4 matrix-matrix multiply-add does 64
FMADD instructions in one clock cycle

§

varieties of scales and precisions abound

§

more than an order of magnitude efficiency at stake

Power efficiencies
(191 entries of Nov 2020 Top500 report efficiency rating)
P
o
w
e
r

30.00

25.00

20.00

E
f

Most efficient
> 26 GFs/W
All but 3 of
the 40 most
efficient are
accelerated

~ 2 orders of
magnitude spread!

15.00

f
i

10.00

c
5.00

0.00

Median: 3.29 GF/s/W

1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
109
113
117
121
125
129
133
137
141
145
149
153
157
161
165
169
173
177
181
185
189

i
e
n
c
y

Least efficient
< 0.2 GFs/W

191 of the Top500 systems

Specialization includes precision choice

Each halving of precision generally doubles execution rate
§

sometimes more than 2x from higher memory residency for given no. of elements

c/o Nick Higham (2021)

Some “universals” of exascale computing
Architectural imperatives
•
•
•
•
•

Reside “high” on the memory hierarchy, close to the processing elements
Rely on SIMD/SIMT-amenable batches of tasks at fine scale
Reduce synchrony in frequency and/or span
Reduce communication in number and/or volume of messages
Exploit heterogeneity in processing, memory, and networking elements

Off-node data latency, 1990-2020
all cores

NB: log scale

1 core
18% / yr
(same
curve)

What happens if all cores stall on a network latency?
n

Duration of the stalls will be something like log2(P) network latencies

n

This is common during MPI collective operations that synchronize all participating
cores, e.g., inner products, norms, barriers

n

“Bandwidth is limited by money, but latency is limited by physics”

Bulk Synchronous
Parallelism

Leslie Valiant, Harvard
2010 Turing Award Winner

Communications of the ACM, 1990

How are most simulations implemented at
the petascale today?
n

Iterative methods based on data decomposition and
message-passing
!
!

!

!

n

data structures (e.g., grid points, particles, agents) are distributed
each individual processor works on a subdomain of the original
(“owner computes”)
exchanges information at its boundaries with other processors
that own portions with which it interacts causally, to evolve in
time or to establish equilibrium
computation and neighbor communication are both fully
parallelized and their ratio remains constant in weak scaling

The programming model is BSP/SPMD/CSP
!
!
!

Bulk Synchronous Programming
Single Program, Multiple Data
Communicating Sequential Processes

BSP parallelism w/ domain decomposition
W3
W2
W1
rows assigned
to proc “2”

Partitioning of the grid
induces block structure on
the system matrix
(Jacobian)

A21

A22

A23

BSP has an impressive legacy
By the Gordon Bell Prize, performance on real applications (e.g.,
mechanics, materials, petroleum reservoirs, etc.) has improved more than
a million times in two decades. Simulation cost per performance has
improved by nearly a million times.
Gordon Bell
Prize: Peak
Performance

Year

Gigaflop/s
delivered to
applications

Year

Cost per
delivered
Gigaflop/s

Gordon Bell
Prize: Price
Performance

1988

1

1989

$2,500,000

1998

1,020

1999

$6,900

2008

1,350,000

2009

$8

Extrapolating exponentials eventually fails
Proceeded steadily for decades from giga- (1988) to
tera- (1998) to peta- (2008) with
§

same BSP programming model

§

same assumptions about who (hardware, systems software,
applications software etc.) is responsible for what
(resilience, performance, processor mapping, etc.)

§

same classes of algorithms (cf. 25 yrs. of Gordon Bell
Prizes)

Main challenge going forward for BSP
Almost all “good” algorithms in linear algebra,
differential equations, integral equations, signal
analysis, etc., require frequent synchronizing global
communication
§

inner products, norms, and fresh global residuals are
“addictive” idioms

§

tends to hurt efficiency beyond 100,000 threads

§

can be fragile for smaller concurrency, as well, due to
algorithmic load imbalance, hardware performance variation,
etc.

Concurrency is heading into the billions of cores
§

Already 10.6 million on TaihuLight (currently #4 overall)

Some “universals” of exascale computing
Architectural imperatives
•
•
•
•
•

Reside “high” on the memory hierarchy, close to the processing elements
Rely on SIMD/SIMT-amenable batches of tasks at fine scale
Reduce synchrony in frequency and/or span
Reduce communication in number and/or volume of messages
Exploit heterogeneity in processing, memory, and networking elements

Motivation to communicate less

How 3 solvers exploit more bandwidth
Geometric MG (LBNL)

many-to-many msgs, 5% of runtime

•
•
•
•

Algebraic MG (LLNL)

many small msgs, 40% of runtime

Spectral (ANL)

large msgs, 68.5% runtime

Improvements resulting from additional rails in a fat-tree network depend
on the application’s communication pattern
For some apps, reduction in communication, not more bandwidth is the
only alternative for runtime improvements
Applications sending large numbers of small packets with fewer
synchronization points (left) can see major improvements
Applications transferring small numbers of larger packets with frequent
synchronization (right) see diminished improvement

c/o Jens Domke (RIKEN, 2021)

Some “universals” of exascale computing
Architectural imperatives
•
•
•
•
•

Reside “high” on the memory hierarchy, close to the processing elements
Rely on SIMD/SIMT-amenable batches of tasks at fine scale
Reduce synchrony in frequency and/or span
Reduce communication in number and/or volume of messages
Exploit heterogeneity in processing, memory, and networking elements

Heterogeneity is taking over (top of) Top500

Nearly one-third of the Top500 systems exploit accelerators
§

disproportionally concentrated at the top of the list

c/o Erich Strohmeier (LBNL, 2020)

2019

2019

Heterogenous HPL performance
and power efficiency

c/o H. Sim, S. Vazhkudai & A Khan (ORNL, 2020)

For these
recent years, all
of the Top 5
systems were
heterogeneous

Exploit heterogeneity

after J. Ang et al (Sandia, 2014)

Heterogeneity in today’s smart phone

Typical smart phone has 40+ special processors
c/o John Shalf (LBNL, 2021)

Some “universals” of exascale computing

Strategies in practice
•
•
•
•
•

Exploit extra memory to reduce communication volume
Perform extra flops to require fewer global operations
Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
Adapt floating point precision to output accuracy requirements
Take more resilience into algorithm space, out of hardware/systems space

Exploit extra memory to reduce comm

Exploit extra memory to reduce comm

Some “universals” of exascale computing

Strategies in practice
•
•
•
•
•

Exploit extra memory to reduce communication volume
Perform extra flops to require fewer global operations
Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
Adapt floating point precision to output accuracy requirements
Take more resilience into algorithm space, out of hardware/systems space

Perform extra flops to synchronize less

Perform extra flops to synchronize less

NB: log scale
Speedup
of 2.25x

Some “universals” of exascale computing

Strategies in practice
•
•
•
•
•

Exploit extra memory to reduce communication volume
Perform extra flops to require fewer global operations
Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
Adapt floating point precision to output accuracy requirements
Take more resilience into algorithm space, out of hardware/systems space

Use high-order discretizations for fewer DOFs

Rediscretize from 32 spectral
elements of order 8 on a side
to 8 spectral elements of order
16 on a side
Same error in key functional:
- approx 4e-6
Savings in execution time:
- factor of 8

Use high-order discretizations for fewer DOFs
Four different dense linear algebra libraries compared on 15 different
element orders for execution rate and memory transfer rate

Performance of all libraries improves up to 16th-order elements
LIBXSMM continues to improve up to 32nd-order elements

Some “universals” of exascale computing

Strategies in practice
•
•
•
•
•

Exploit extra memory to reduce communication volume
Perform extra flops to require fewer global operations
Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
Adapt floating point precision to output accuracy requirements
Take more resilience into algorithm space, out of hardware/systems space

Adapt precision to accuracy requirements

Adapt precision to accuracy requirements
(2020)
(2017)
(2014)

Implicit question: Do we want to wait for NVIDIA Hopper (2023)?
Or do we want Hopper performance on NVIDIA Ampere today?

Adapt precision to accuracy requirements

fp64, fp32, fp16 defined by IEEE standard
Bfloat16: Google, Intel, ARM, NVIDIA

c/o Nick Higham (Manchester, 2021)

Some “universals” of exascale computing

Strategies in practice
•
•
•
•
•

Exploit extra memory to reduce communication volume
Perform extra flops to require fewer global operations
Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
Adapt floating point precision to output accuracy requirements
Take more resilience into algorithm space, out of hardware/systems space

Resilience in algorithms, not hardware

Key ideas:
§
§
§
§

§

Reliable computing is expensive
Divide memory: reliable/unreliable
Divide routines: reliable/unreliable
Do most of the work in unreliable
mode with reliable detection and
correction
Ex.: FT-GMRES with unreliable
matvec or preconditioner

Resilience in algorithms, not hardware
Ill-conditioned Stokes problem
matvec unreliable
deterministically spaced faults

matvec & preconditioner unreliable
random faults

Some “universals” of exascale computing

Strategies in progress
•
•
•
•
•

Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
Code to specialized “back-ends” while presenting high-level APIs to general users
Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
Process “on the fly” rather than storing all at once (esp. large dense matrices)
Co-design algorithms with hardware, incl. computing in the network or in memory

Employ dynamic scheduling

Task graph for the first 3 stages of a
Generalized Symmetric EVP with 4 blocks

Int Conf Par Comput (ParCo) 2011

Employ dynamic scheduling

LAPACK
Speedup
of 21x
MKL

PLASMA

n
n
n

Remove artifactual synchronizations in the form of subroutine boundaries
Remove artifactual orderings in the form of pre-scheduled loops
Expose more concurrency

Some “universals” of exascale computing

Strategies in progress
•
•
•
•
•

Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
Code to specialized “back-ends” while presenting high-level APIs to general users
Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
Process “on the fly” rather than storing all at once (esp. large dense matrices)
Co-design algorithms with hardware, incl. computing in the network or in memory

Employ APIs to specialized back-ends

Employ APIs to specialized back-ends
n

applications

Tiling and recursive subdivision
create large numbers of small
problems that can be marshaled
for batched operations on GPUs
and MICs
"
"

algorithmic
infrastructure
architectures

(ARM, AMD, IBM, Intel, NVIDIA, …)

n

n
n

n

amortize call overheads
polyalgorithmic approach based
on block size

Non-temporal stores, coalesced
memory accesses, doublebuffering, etc. reduce sensitivity to
memory
Code is complex
Code is architecture-specific at the
bottom
Need to hide the support from the
apps through an API

Some “universals” of exascale computing

Strategies in progress
•
•
•
•
•

Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
Code to specialized “back-ends” while presenting high-level APIs to general users
Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
Process “on the fly” rather than storing all at once (esp. large dense matrices)
Co-design algorithms with hardware, incl. computing in the network or in memory

Exploit data sparsity
TLR

HLR
weakly
admissible

HLR
strongly
admissible

Complexities of rank-structured factorization
For a square dense matrix of O(N) :
n

n

n

“Straight” LU or LDLT
§

Operations O(N3)

§

Storage O(N2)

Tile low-rank (Amestoy, Buttari, L’Excellent & Mary, SISC, 2016)*
§

Operations O(k0.5 N2)

§

Storage O(k0.5 N1.5)

§

for uniform blocks with size chosen optimally for max rank k of any
compressed block, bounded number of uncompressed blocks per row

Hierarchically low-rank (Grasedyck & Hackbusch, Computing, 2003)
§

Operations O(k2 N log2N)

§

Storage O(k N)

§

for strong admissibility, where k is max rank of any compressed block

* First reported O(k0.5 N2.5), then later O(k0.5 N2) for variant that reorders updates and recompression

Some “universals” of exascale computing

Strategies in progress
•
•
•
•
•

Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
Code to specialized “back-ends” while presenting high-level APIs to general users
Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
Process “on the fly” rather than storing all at once (esp. large dense matrices)
Co-design algorithms with hardware, incl. computing in the network or in memory

Process “on the fly”

H matrix-H matrix multiplication

Fast matvecs ⇒ fast approx inversions with Newton-Schulz

Some “universals” of exascale computing

Strategies in progress
•
•
•
•
•

Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
Code to specialized “back-ends” while presenting high-level APIs to general users
Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
Process “on the fly” rather than storing all at once (esp. large dense matrices)
Co-design algorithms with hardware, incl. computing in the network or in memory

Co-design algorithms with hardware

Co-design algorithms with hardware

Some “universals” of exascale computing
Architectural imperatives
•
•
•
•
•

Reside “high” on the memory hierarchy, close to the processing elements
Rely on SIMD/SIMT-amenable batches of tasks at fine scale
Reduce synchrony in frequency and/or span
Reduce communication in number and/or volume of messages
Exploit heterogeneity in processing, memory, and networking elements

Strategies in practice
•
•
•
•
•

Exploit extra memory to reduce communication volume
Perform extra flops to require fewer global operations
Use high-order discretizations to manipulate fewer DOFs (w/more ops per DOF)
Adapt floating point precision to output accuracy requirements
Take more resilience into algorithm space, out of hardware/systems space

Strategies in progress
•
•
•
•
•

Employ dynamic scheduling capabilities, e.g., dynamic runtime systems based DAGs
Code to specialized “back-ends” while presenting high-level APIs to general users
Exploit data sparsity to meet “curse of dimensionality” with “blessing of low rank”
Process “on the fly” rather than storing all at once (esp. large dense matrices)
Co-design algorithms with hardware, incl. computing in the network or in memory

Closing haiku

Exascale summits
are brought closer within reach
with insights from math

print c/o Toshi Yoshida

Thank you!

ﺷﻛرا
david.keyes@kaust.edu.sa

