Lecture 11: The Road to Exascale and Legacy Software for Dense Linear Algebra by Dongarra, Jack
University of Arkansas, Fayetteville 
ScholarWorks@UARK 
Mathematical Sciences Spring Lecture Series Mathematical Sciences 
4-6-2021 
Lecture 11: The Road to Exascale and Legacy Software for Dense 
Linear Algebra 
Jack Dongarra 
University of Tennessee 
Follow this and additional works at: https://scholarworks.uark.edu/mascsls 
 Part of the Algebra Commons, Computer and Systems Architecture Commons, Databases and 
Information Systems Commons, Data Storage Systems Commons, Digital Communications and 
Networking Commons, Numerical Analysis and Computation Commons, Numerical Analysis and 
Scientific Computing Commons, Ordinary Differential Equations and Applied Dynamics Commons, and 
the Programming Languages and Compilers Commons 
Citation 
Dongarra, J. (2021). Lecture 11: The Road to Exascale and Legacy Software for Dense Linear Algebra. 
Mathematical Sciences Spring Lecture Series. Retrieved from https://scholarworks.uark.edu/mascsls/10 
This Video is brought to you for free and open access by the Mathematical Sciences at ScholarWorks@UARK. It 
has been accepted for inclusion in Mathematical Sciences Spring Lecture Series by an authorized administrator of 
ScholarWorks@UARK. For more information, please contact ccmiddle@uark.edu. 
4/1/21 1
The Road to Exascale and Legacy 
Software for Dense Linear Algebra
Jack Dongarra
University of Tennessee
Oak Ridge National Laboratory
University of Manchester
Copy of slides at http://bit.ly/dongarra-arkansas-042021
Dense Linear Algebra
• Common Operations
• A major source of large dense linear systems is problems involving the solution of 
boundary integral equations.
• The price one pays for replacing three dimensions with two is that what started as a 
sparse problem in O(n3) variables is replaced by a dense problem in O(n2).
• Dense systems of linear equations are found in numerous other applications, 
including:
• Airplane wing design;
• Radar cross-section studies;
• Flow around ships and other off-shore constructions;
• Diffusion of solid bodies in a liquid;
• Noise reduction; and
• Diffusion of light through small particles. 2
Ax = b;    min
x
|| Ax − b ||;    Ax = λx
4/1/21
Existing Math Software - Dense LA
http://www.netlib.org/utk/people/JackDongarra/la-sw.html
¨ LINPACK, EISPACK, LAPACK, ScaLAPACK
ØPLASMA, MAGMA 34/1/21
DLA Solvers
• We are interested in developing Dense Linear 
Algebra Solvers




Over the Past 50 Years Evolving SW and Alg
Tracking Hardware Developments
Software/Algorithms follow hardware evolution in time
EISPACK (1970’s) 
(Translation of Algol to F66)
Rely on













- PBLAS Mess Passing
PLASMA / MAGMA (2010’s)
(Many-core friendly & GPUs)
Rely on 
- DAG/scheduler
- block data layout
SLATE (2020’s)
(DM and Heterogeneous arch)
Rely on  C++
- Tasking DAG scheduling
- Tiling, but tiles can come from anywhere
- Heterogeneous HW, Batched dispatch
6
What do we mean by performance?
¨ What is the unit: floating point operations per second (flop/s)?
Ø flop/s is a rate of execution, some number of floating point operations per second. 
Ø Whenever this term is used it will refer to 64 bit floating point operations and the 
operations will be either addition or multiplication. 
Ø Tflop/s refers to trillions (1012) of floating point operations per second
Ø Pflop/s refers to 1015 floating point operations per second.
Ø Eflop/s is 1018 floating point operations per second.
¨ What is the theoretical peak performance?
Ø The theoretical peak is based not on an actual performance from a benchmark run, 
but on a paper computation to determine the theoretical peak rate of execution of 
floating point operations for the machine. 
Ø The theoretical peak performance is determined by counting the number of floating-
point additions and multiplications (in 64-bit precision) that can be completed during 
a period of time, usually the cycle time of the machine. 
Ø For example, an Intel Skylake core at 2.1 GHz can complete 32 floating point 
operations per cycle or a theoretical peak performance per core of:
32 fl.pt. ops / cycle * 2.1 G-cycles / second = 67.2 Gflop/s 
Ø With 24 cores per socket: 24*67.2 Gflop/s or 1.61 Tflop/s for the socket.
Peak Performance - Per Core
Floating point operations per cycle per core
Ø Most of the recent computers have FMA (Fused multiple add): 
(i.e. x ←x + y*z in one cycle)
Ø Intel Xeon earlier models and AMD Opteron have SSE2
Ø 2 flops/cycle/core DP & 4 flops/cycle/core SP
Ø Intel Xeon Nehalem (2009) & Westmere (2010) have SSE4
Ø 4 flops/cycle/core DP & 8 flops/cycle/core SP
Ø Intel Xeon Sandy Bridge(2011) & Ivy Bridge (2012) have AVX (vector instructions) 
Ø 8 flops/cycle/core DP & 16 flops/cycle/core SP
Ø Intel Xeon Haswell (2013) & Broadwell (2014) AVX2
Ø 16 flops/cycle/core DP & 32 flops/cycle/core SP
Ø Xeon Phi (per core) is at 16 flops/cycle DP & 32 flops/cycle SP
Ø Intel Xeon Skylake (server)  & KNL AVX 512 
Ø 32 flops/cycle/core DP & 64 flops/cycle/core SP
Ø Skylake w/24 cores & Xeon Phi (Knight’s Landing) w/68 cores
Ø Intel Xeon Cascade Lake, Kaby Lake, Coffee Lake, Ice Lake…
Ø 32 flops/cycle/core DP & 64 flops/cycle/core SP
Ø Next Gen Sapphire Rapids, with AMX (matrix instructions) 




Commodity Processors … 
Over provisioned for floating point operations
Today it’s all about data movement
Each Core: 32 Flops per core / cycle
With 2.6 GHz 
(32 flops/cycle*2.6 Gcycles/sec = 83.2 Gflop/s)
Each Core Peak DP 83.2 Gflop/s
Each Socket (8 cores) Peak 665.6 Gflop/s
Memory Access Latencies in Clock Cycles
167 cycles to move a word from memory to a register
In 167 cycles single core: 5344 DP Flops, socket: >40K Flops
Need Cache Friendly Algorithms
Matrix Multiply and Data Reuse
Memory transfer







The model IS simplified (see next slide) but it provides an upper bound on 
performance as well. I.e., we will never go faster than what the model predicts.     
(And, of course, we can go slower … )
( Omitting latency here. )
56 GFLOP/sec/core x 2 cores
Intel iCore7 4850HQ
Haswell
Cycle time = 2.3 GHz
Turbo Boost = 3.5 GHz
3.5 GHz*16 flops/cycle = 







for ( j = 0; j < n; j++)
y[i] += a * x[i];
(without increment)
alpha = 0e+00;
for ( j = 0; j < n; j++)










Note: It is reasonable to expect the one loop codes shown here to perform as well as 
their Level 1 BLAS counterpart (on multicore with an OpenMP pragma for example).
The true gain these days with using the BLAS is (1) Level 3 BLAS, and (2) portability.
• Take two double precision vectors x and y of size n=375,000.
• Data size: 
– ( 375,000 double ) * ( 8 Bytes / double ) = 3 MBytes per vector
( Two vectors fit in cache (6 MBytes). OK.) 
• Time to move the vectors from memory to cache:
– ( 6 MBytes ) / ( 25.6 GBytes/sec ) = 0.23 ms
• Time to perform computation of DOT:




total_time ≥ max ( time_comm , time_comp )
= max ( 0.23ms , 0.01ms ) = 0.23ms
Performance = (2 x 375,000 flops)/.23ms = 3.2 Gflop/s
Performance for DOT ≤ 3.2 Gflop/s
Peak is 56 Gflop/s
We say that the operation is communication 
bounded. No reuse of data.
Level 1, 2 and 3 BLAS
Level 2 BLAS  Matrix-Vector operations
Level 1 BLAS  Matrix-Vector operations














AXPY: 2n READ, n WRITE
DOT:   2n READ
RATIO Fl Pt Ops to Memory Ops: 1:1
2n2 FLOPs
n2 memory references
RATIO Fl Pt Ops to Memory Ops: 2:1
2n3 FLOPs
3n2 memory references
3n2 READ, n2 WRITE
RATIO Fl Pt Ops to Memory Ops: n:2
• Double precision matrix A and vectors x and y of size n=860.
• Data size: 
– ( 8602 + 2*860 double ) * ( 8 Bytes / double ) ~ 6 MBytes
Matrix and two vectors fit in cache (6 MBytes).
• Time to move the data from memory to cache:
– ( 6 MBytes ) / ( 25.6 GBytes/sec ) = 0.23 ms
• Time to perform computation of GEMV:




Matrix - Vector Operations 
total_time ≥ max ( time_comm , time_comp )
= max ( 0.23ms , 0.026ms ) = 0.23ms
Performance = (2 x 8602 flops)/.23ms = 6.4 Gflop/s
Performance for GEMV ≤ 6.4 Gflop/s
Peak is 56 Gflop/s
We say that the operation is communication 
bounded. Very little reuse of data.
!"#$%#&'()"*$%#*+,-*.*/01*2$3%456
• Take two double precision matrices A and B of size n=500.
• Data size: 
– ( 5002 double ) * ( 8 Bytes / double ) = 2 MBytes per matrix
( Three matrices fit in cache (6 MBytes). OK.) 
• Time to move the matrices in cache:
– ( 6 MBytes ) / ( 25.6 GBytes/sec ) = 0.23 ms
• Time to perform computation in GEMM:




total_time ≥ max ( time_comm , time_comp )
= max( 0.23ms , 4.46ms ) = 4.46ms
For this example, communication time is less than 6% of the computation time. 
Performance = (2 x 500 3 flops)/4.5ms = 55.5 Gflop/s
There is a lots of data reuse in a GEMM; 2/3n per data element. Has good 
temporal locality.
If we assume total_time ≈ time_comm +time_comp, we get 
Performance for GEMM ≈ 55.5 Gflop/sec
Performance for DOT ≤ 3.2 Gflop/s
Performance for GEMV ≤ 6.4 Gflop/s

























Level 1, 2 and 3 BLAS
1 core Intel Haswell i7-4850HQ, 2.3 GHz (Turbo Boost at 3.5 GHz); 
Peak = 56 Gflop/s
1 core Intel Haswell i7-4850HQ, 2.3 GHz, Memory: DDR3L-1600MHz
6 MB shared L3 cache, and each core has a private 256 KB L2 and 64 KB L1. 
The theoretical peak per core double precision is 56  Gflop/s per core.




18 cores Intel Xeon Gold 6140, 2.3 GHz (Skylake)
The theoretical peak double precision is 1325 Gflop/s
Compiled with icc and using Intel MKL 2018  
Level 1, 2 and 3 BLAS
18 cores Intel Xeon Gold 6140 (Skylake), 2.3 GHz, Peak DP = 1325 Gflop/s 
2k 4k 6k 8k 10k 12k 14k 16k 18k 20k












dgemm BLAS Level 3
dgemv  BLAS Level 2





C = C + A*B 
y = y + A*x 




• Reuse based on matrices that fit into cache.
• What if you have matrices bigger than cache?
• Break matrices into blocks or tiles that will fit.
4/1/21 20
Issues 
• Reuse based on matrices that fit into cache.
• What if you have matrices bigger than cache?
• Break matrices into blocks or tiles that will fit.
4/1/21 21
LU Factorization in LINPACK (1970’s)
• Factor one column at a time
– i_amax and _scal
• Update each column of trailing matrix, one column at a time
– _axpy
• Level 1 BLAS
• Bulk synchronous
– Single main thread
– Parallel work in BLAS
– “Fork-and-join” model##
• Factor panel of nb columns
– getf2, unblocked BLAS-2 code
• Level 3 BLAS update block-row of U
– trsm
• Level 3 BLAS update trailing matrix
– gemm
– Aimed at machines with cache hierarchy
• Bulk synchronous
#$
The Standard LU Factorization LAPACK
1980’s HPC of the Day: Cache Based SMP
Parallelism in LAPACK




































- block data layout




Scalable Linear Algebra PACKage
• Distributed memory
• Message Passing
– Clusters of SMPs
– Supercomputers
• Dense linear algebra
• Modules
– PBLAS: Parallel BLAS
– BLACS: Basic Linear Algebra Communication Subprograms
26
Parallelism in ScaLAPACK
• Similar to LAPACK
• Bulk-synchronous processing
– separate message passing & compute
• Most flops in gemm update
– 2/3 n3 term
– Can use sequential BLAS,
p x q = # cores
= # MPI processes,
num_threads = 1
– Or multi-threaded BLAS,
p x q = # nodes
= # MPI processes,
num_threads = # cores/node
27
Today’s HPC Environment for Numerical Libraries
• Highly parallel
– Distributed memory
– MPI + Open-MP programming model
• Heterogeneous
– Commodity processors + GPU accelerators
• Simple loop level parallelism too limiting in terms of performance
• Communication between parts very                                      
expensive compared to floating point ops 
• Comparison of operation counts may not reflect time to solution
• Floating point hardware at 64, 32, and 16 bit levels
ORNL Summit, 200 Pflop/s, 4608 nodes 
(node= 2-Power9 chips + 6-Nvidia GPUs)
2.3 x 106 Cores
Yesterday’s HPC
ScaLAPACK
• First released Feb 1995, 25 years old
• Lacks dynamic scheduling, look-ahead panels,
communication avoiding algorithms, ...
• Can’t be adequately retrofitted for accelerators
• Written in archaic language (Fortran 77)
SGI Origin 2000 (ASCI Blue Mountain, 1998)





SLATE: Software for Linear Algebra Targeting Exascale
• Distributed, GPU-accelerated, dense linear algebra library
– Target large HPC machines
– BLAS: matrix multiply (C = AB), etc.
– Linear systems (Ax = b)
• LU, Cholesky, symmetric indefinite
– Least squares (Ax ≈ b)
• QR, LQ
– Eigenvalue (Ax = λx)
– SVD (A = UΣVH)
• Modern replacement for ScaLAPACK





LU (partial pivoting) ✓ ✓
LU, band (pp) ✓ ✓
LU (non-pivoting) ✘ ✓ (new)
Cholesky ✓ ✓
Cholesky, band ✓ ✓ (new)
Symmetric Indefinite (Aasen) ✘ ✓ (CPU only)
Mixed precision ✘ ✓
Inverses (LU, Cholesky) ✓ ✓
ScaLAPACK SLATE
Level 1 PBLAS ✓ ✘ (use Level 3)
Level 2 PBLAS ✓ ✘ (use Level 3)
Level 3 PBLAS ✓ ✓
Matrix norms ✓ ✓
Test matrix generation ✓ ✓ (new)
ScaLAPACK SLATE
QR ✓ ✓
LQ ✓ ✓ (new)
Least squares solver ✓ ✓
ScaLAPACK SLATE
SVD ✓ ✓ values (new)
Symmetric eigenvalues ✓ ✓ values (new)
Generalized symmetric eig. ✓ ✓ values (new)
Polar decomposition (QDWH) ✘ ✓ (new)
Non-symmetric eigenvalues ✘ pieces ✘ (2021–2022)
Basic linear algebra (C = AB, ...)
Linear systems (Ax = b)
Least squares (Ax ≅ b)
SVD, eigenvalues (A = UΣVH, Ax = λx)
All SLATE routines listed are GPU-accelerated,
except symmetric indefinite
(new) since Sep 2019 review
Tile Algorithms: Matrix Decomposition 
32
LAPACK Algorithm (right looking) SLATE: Tile Algorithm
Track dependencies — Directed acyclic graph (DAG)
33









• High utilization of each core
• Scaling to large number of cores
• Synchronization reducing algorithms
• Methodology
• Dynamic DAG scheduling using OpenMP 
• Explicit parallelism
• Implicit communication
• Fine granularity / block data layout
• Arbitrary DAG with dynamic scheduling





Multiply A-1 = L-T L-1
Cholesky A = LLT
Invert L-1 time →









• Initial version with NVIDIA CUDA
• Port to AMD and Intel in progress
• Use BLAS++ as abstraction layer
– cuBLAS backend (done)
– hip/rocBLAS backend (done)
– oneAPI backend (in progress)
• Few CUDA kernels are memory bound
(batched add tiles, scale tiles, norms of tiles)
– Port to HIP using hipify (prototype done)
– Port to DPC++ in progress
– Alternatively, port to OpenMP offload








Many fields are beginning to adopt machine learning to augment modeling and simulation 
methods
Deep Learning Needs Small Matrix Operations
Matrix Multiply is the time consuming part.
Convolution Layers and Fully Connected Layers require matrix multiply
There are many GEMM’s of small matrices, perfectly parallel, can get by with 16-bit floating point
$" 6&%7
Convolution Step















• Define standard API for batched BLAS and LAPACK in 
collaboration with Intel/Nvidia/other users
• Fixed size: most of BLAS and LAPACK released
• Variable size: most of BLAS released
• Variable size: LAPACK in the branch
• Native GPU algorithms (Cholesky, LU, QR) in the branch
• Tiled algorithm using batched routines on tile or LAPACK 
data layout in the branch
• Framework for Deep Neural Network kernels
• CPU, KNL and GPU routines
• FP16 routines in progress
Standard for Batched Computations 
Batched Computations 
• Non-batched computation
• loop over the matrices one by one and compute using multithread (note that, since
matrices are of small sizes there is not enough work for all the cores). So we expect low
performance as well as threads contention might also affect the performance
for (i=0; i<batchcount; i++)
dgemm(…)
There is not enough work
to fulfill all the cores.
Low percentage of the 
resources is us d
Batched Computations 
• Batched computation
• Distribute all the matrices over the available resources by assigning a matrix to each
group of core/TB to operate on it independently
• For very small matrices, assign a matrix/core (CPU) or per TB for GPU
• For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU)
• For large size switch to multithreads classical 1 matrix per round.
Batched_dgemm(…)
Based on the kernel
design that decide the


















0 500 1000 1500 2000 2500 3000 3500 4000
50~1000 matrices of size
Nvidia V100 GPU
Batch cuBLAS DGEMM 
Standard cuBLAS DGEMM

















0 500 1000 1500 2000 2500 3000 3500 4000
50~1000 matrices of size
Nvidia V100 GPU
Batch cuBLAS DGEMM 
Standar  cuBLAS DGE
IEEE 754 Half Precision (16-bit) Floating Pt Standard
A lot of interest driven by “machine learning”
FP16
FP16
Google TPU different then IEEE
bfloat16
1 bit for the sign, 
8 bits for the exponent (same as SP) 
7 bits for the mantissa
Mixed Precision
• Today many precisions to deal with (IEEE Standard)
¨ Note the number range with 









Nvidia Volta peak rates
• 64 bit floating point (FMA): 7.5 Tflop/s
• 32 bit floating point (FMA): 15 Tflop/s
• 16 bit floating point (FMA): 30 Tflop/s
• 16 bit floating point with Tensor core: 120 Tflop/s
07
45




Leveraging Half Precision in HPC on V100
matrix size

























Matrix matrix multiplication GEMM
• dgemm achieve about 6.4 Tflop/s
Study of the Matrix Matrix multiplication kernel on Nvidia V100 
Leveraging Half Precision in HPC on V100
matrix size


























Matrix matrix multiplication GEMM
• dgemm achieve about 6.4 Tflop/s
• sgemm achieve about 14   Tflop/s
Study of the Matrix Matrix multiplication kernel on Nvidia V100 
~2X
Leveraging Half Precision in HPC on V100
Matrix matrix multiplication GEMM
• dgemm achieve about 6.4 Tflop/s
• sgemm achieve about 14   Tflop/s
• hgemm achieve about 27   Tflop/s
matrix size



























Study of the Matrix Matrix multiplication kernel on Nvidia V100 
~4X
Leveraging Half Precision in HPC on V100
Matrix matrix multiplication GEMM
• dgemm achieve about 6.4 Tflop/s
• sgemm achieve about 14   Tflop/s
• hgemm achieve about 27   Tflop/s
• Tensor cores gemm reach about 85 Tflop/s
matrix size




























Study of the Matrix Matrix multiplication kernel on Nvidia V100 
~12X
Leveraging Half Precision in HPC on V100
Matrix matrix multiplication GEMM
• dgemm achieve about 6.4 Tflop/s
• sgemm achieve about 14   Tflop/s
• hgemm achieve about 27   Tflop/s
• Tensor cores gemm reach about 85 Tflop/s
matrix size




























Study of the Matrix Matrix multiplication kernel on Nvidia V100 
Leveraging Half Precision in HPC on V100
• In LU factorization need matrix 
multiple but operations is a 
rank-k update computing the 
Schur complement
m=n
































Study of the rank k update used by the LU factorization algorithm on Nvidia V100
Leveraging Half Precision in HPC on V100
• LU factorization is used to solve a 
linear system Ax=b
A  x = b 
LUx = b



















24 FP16 hgetrf LU factorization Tensor Cores
FP16 hgetrf LU factorization
FP32 sgetrf LU factorization
FP64 dgetrf LU factorization
Study of the LU factorization algorithm on Nvidia V100
3~4X
Iterative refinement for dense systems, Ax = b, can work this way.
L U = lu(A) lower precision O(n3)
x = U\(L\b) lower precision O(n2)
r = b – Ax FP64 precision O(n2)
WHILE || r || not small enough
1. find a correction “z” to adjust x that satisfy Az=r
solving Az=r could be done by either:
Ø z = U\(L\r) Classical Iterative Refinement lower precision O(n2)
Ø GMRes preconditioned by the LU to solve Az=r Iterative Refinement using GMRes lower precision O(n2)
2. x = x + z FP64 precision O(n1)
3. r = b – Ax FP64 precision O(n2)
END
Idea: use low precision to compute the expensive flops (LU O(n3)) and then iteratively refine 
the solution in order to achieve the FP64 arithmetic
Ø Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.
Ø It can be shown that using this approach we can compute the solution to 64-bit floating point precision.
Ø Need the original matrix to compute residual (r) and matrix cannot be too badly conditioned
Leveraging Half Precision in HPC on V100
Requires extra storage, total is 1.5 times normal;
O(n3) work is done in lower precision
O(n2) work is done in high precision
Problems if the matrix is ill-conditioned
Higham and Carson showed can solve the inner problem with iterative method and not infect the solution.
E. Carson & N. Higham, “Accelerating the Solution of 
Linear Systems by Iterative Refinement in Three 
Precisions SIAM J. Sci. Comput., 40(2), A817–A847.
Improving Solution
• z is the correction or (xi+1 – xi)
• Computed in lower precision and then added to the approximate 
solution in higher precision xi + z




Leveraging Half Precision in HPC on V100
Performance Behavior
• solving Ax = b using FP64 LU
Flops = 2n3/(3 time) 
meaning twice higher is twice faster
Matrix size


















Performance of solving Ax=b















Problem generated with an arithmetic distribution of the singular values                                          and positive eigenvalues.
Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA
Matrix size



























































(a) matrix with diagonal dominant.
Matrix size






























































(b) matrix with positive l and where si is random number between 1cond ,1 such that
their logarithms are uniformly distributed.
Matrix size





























































(c) matrix with positive l and with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size

























































(d) matrix with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size





























































(e) matrix with positive eigenvalues and arithmetic distribution of its singular values


















































40 40 40 40 40 40 40
(f) matrix with arithmetic distribution of its singular values si = 1  ( i 1n 1 ) (1 
1
cond ).
Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve
FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size
and different type of matrices. Note that cond = 102 for these experiments.
Leveraging Half Precision in HPC on V100
Performance Behavior
• solving Ax = b using FP64 LU
• solving Ax = b using FP32 LU and iterative 
refinement to achieve FP64 accuracy
Flops = 2n3/(3 time) 
meaning twice higher is twice faster
Matrix size


















Performance of solving Ax=b















Problem generated with an arithmetic distribution of the singular values                                          and positive eigenvalues.
Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA
Matrix size



























































(a) matrix with diagonal dominant.
Matrix size






























































(b) matrix with positive l and where si is random number between 1cond ,1 such that
their logarithms are uniformly distributed.
Matrix size





























































(c) matrix with positive l and with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size

























































(d) matrix with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size





























































(e) matrix with positive eigenvalues and arithmetic distribution of its singular values


















































40 40 40 40 40 40 40
(f) matrix with arithmetic distribution of its singular values si = 1  ( i 1n 1 ) (1 
1
cond ).
Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve
FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size
and different type of matrices. Note that cond = 102 for these experiments.
Leveraging Half Precision in HPC on V100
Performance Behavior
• solving Ax = b using FP64 LU
• solving Ax = b using FP32 LU and iterative 
refinement to achieve FP64 accuracy
• solving Ax = b using FP16 LU and iterative 
refinement to achieve FP64 accuracy
Flops = 2n3/(3 time) 
meaning twice higher is twice faster
Matrix size


















Performance of solving Ax=b















Problem generated with an arithmetic distribution of the singular values                                          and positive eigenvalues.
Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA
Matrix size



























































(a) matrix with diagonal dominant.
Matrix size






























































(b) matrix with positive l and where si is random number between 1cond ,1 such that
their logarithms are uniformly distributed.
Matrix size





























































(c) matrix with positive l and with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size

























































(d) matrix with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size





























































(e) matrix with positive eigenvalues and arithmetic distribution of its singular values


















































40 40 40 40 40 40 40
(f) matrix with arithmetic distribution of its singular values si = 1  ( i 1n 1 ) (1 
1
cond ).
Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve
FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size
and different type of matrices. Note that cond = 102 for these experiments.
Matrix size


















Performance of solving Ax=b















Leveraging Half Precision in HPC on V100
Performance Behavior
• solving Ax = b using FP64 LU
• solving Ax = b using FP32 LU and iterative 
refinement to achieve FP64 accuracy
• solving Ax = b using FP16 LU and iterative 
refinement to achieve FP64 accuracy
• solving Ax = b using FP16 Tensor Cores LU and 
iterative refinement to achieve FP64 accuracy
4X
Flops = 2n3/(3 time) 
meaning twice higher is twice faster
Problem generated with an arithmetic distribution of the singular values                                          and positive eigenvalues.
Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA
Matrix size



























































(a) matrix with diagonal dominant.
Matrix size






























































(b) matrix with positive l and where si is random number between 1cond ,1 such that
their logarithms are uniformly distributed.
Matrix size





























































(c) matrix with positive l and with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size

























































(d) matrix with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size





























































(e) matrix with positive eigenvalues and arithmetic distribution of its singular values


















































40 40 40 40 40 40 40
(f) matrix with arithmetic distribution of its singular values si = 1  ( i 1n 1 ) (1 
1
cond ).
Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve
FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size
and different type of matrices. Note that cond = 102 for these experiments.
Matrix size



















































Performance of solving Ax=b















Leveraging Half Precision in HPC on V100
Performance Behavior
• solving Ax = b using FP64 LU
• solving Ax = b using FP32 LU and iterative 
refinement to achieve FP64 accuracy
• solving Ax = b using FP16 LU and iterative 
refinement to achieve FP64 accuracy
• solving Ax = b using FP16 Tensor Cores LU and 
iterative refinement to achieve FP64 accuracy
Flops = 2n3/(3 time) 
meaning twice higher is twice faster
Problem generated with an arithmetic distribution of the singular values                                          and positive eigenvalues.
Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA
Matrix size



























































(a) matrix with diagonal dominant.
Matrix size






























































(b) matrix with positive l and where si is random number between 1cond ,1 such that
their logarithms are uniformly distributed.
Matrix size





























































(c) matrix with positive l and with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size

























































(d) matrix with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size





























































(e) matrix with positive eigenvalues and arithmetic distribution of its singular values


















































40 40 40 40 40 40 40
(f) matrix with arithmetic distribution of its singular values si = 1  ( i 1n 1 ) (1 
1
cond ).
Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve
FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size
and different type of matrices. Note that cond = 102 for these experiments.
Critical Issues at Exascale for Algorithm and Software Design
• Synchronization-reducing algorithms
▪ Break Fork-Join model
• Communication-reducing algorithms
▪ Use methods which have lower bound on communication
• Mixed precision methods (half (16bit), single(32 bit), & double precision (64))
▪ 2x – 10x speed of ops and 2x - 4x speed for data movement
• Autotuning – Performance Debugging
▪ Today’s machines are very complicated, build “smarts” into software to adapt to the 
hardware
• Fault resilient algorithms
▪ Implement algorithms that can recover from failures/bit flips
• Reproducibility of results
▪ Today we can’t guarantee this. We understand the issues, but some of our “colleagues” 










• Data movement critical for performance.
• Algorithm / Software advances follows hardware
▪ And there is “plenty of room at the top”
▪ “There's life in the old dog yet”
62








u PaRSEC (Parallel Runtime Scheduling & Execution Control)
• http://icl.cs.utk.edu/parsec/
Also see: http://www.netlib.org/utk/people/JackDongarra/papers.htm u Collaborating partners
University of Tennessee, Knoxville
University of California, Berkeley
University of Colorado, Denver
Looking for Grad Students and 






















Ø bulk synchronous processing
89
OpenMP tasking
• Added with OpenMP 3.0 (2009)
• Allows parallelization of irregular problems






Leveraging Half Precision in HPC on V100
# iterations










Convergence history for Iterative Refinement using GMRes
FP32->64 IRGM
FP16->64 IRGM
FP16->64 IRGM (Tensor Cores)
Factorization and Inversion of a Million Matrices on GPUs Abdelfattah, Haidar, Tomov and Dongarra
The reason is that such software is optimized to handle large sizes of data with embarrassingly
parallel computational kernels such as matrix multiplication (GEMM). In dense linear algebra,
the idea of blocking enables the use of compute-intensive trailing matrix updates that are well-
suited for GPUs [19]. However, when the sizes are very small, the LAPACK-style blocking cannot
be applied, since the blocking sizes will be very small, leading to memory-bound trailing matrix
updates. We need a di↵erent design mindset in order to develop high performance GPU kernels
to handle such type of workloads.
This paper presents highly optimized GPU kernels for batched LU factorization and matrix
inversion. The kernels can handle millions of matrices of sizes up to 32. We show a step-by-
step methodology, where incremental improvements in the kernel design lead to incremental
performance gains. We justify all of our design choices by showing the performance before
and after every incremental improvement. One of the main advantages of our design is the
elimination of the expensive intermediate row interchanges by delaying them to the end of the
kernel, which leads to a much faster kernel that produces the exact same result of a LAPACK-style
LU-factorization and inversion. The performance results show significant speedups against the
vendor-supplied CUBLAS kernels using a Pascal P100 GPU.
2 ddd
Matrix of size 10240 generated with positive   and clustered singular values,
 i=(1, · · · , 1, 1cond ) and where its condition number is equal to 10
2.
3 Related Work
GPUs are well suited for embarrassingly parallel tasks on large chunks of data, such as matrix
multiplication(GEMM) [13][18]. The nearly optimal performance of GEMM on GPUs allowed the
development of high performance LAPACK algorithms [19][11]. Following a growing interest in
performing matrix computations on large batches of small matrices, an obvious start was to
develop a batched GEMM routine for GPUs. This has already been addressed in literature [7][4],
leading to developments of higher-level batched LAPACK algorithms. For example, Kurzak et
al. [12] introduced batched Cholesky factorization for single precision up to sizes 100 ⇥ 100,
while Dong et al. provided a more generic design [9]. Some contributions also proved the
ability of GPUs to deal with variable size batched workloads [4][3].
Wang et al. [20] introduced FPGA-based parallel LU factorization of large sparse matrices,
where the algorithm is reduced to factorizing many small matrices concurrently. Villa et al. [16]
developed a GPU-based batched LU factorization, which has been used in subsurface transport
simulation, where many chemical and microbiological reactions in a flow path are simulated in
parallel [17]. Batched matrix inversion is also introduced in the context of generating block-
Jacobi preconditioners [6]. The work presented by Haidar et al. [10] provides a batched LU
2
# iterations










Convergence history for Classic Iterative Refinement
FP32->64 IR
FP16->64 IR
FP16->64 IR   (Tensor Cores)
Leveraging Half Precision in HPC on V100
Matrix size


















Performance of solving Ax=b
using FP64 or IR with GMRes to achieve FP64 accuracy
FP64 dgesv
Factorization and Inversion of a Million Matrices on GPUs Abdelfattah, Haidar, Tomov and Dongarra
The reason is that such software is optimized to handle large sizes of data with embarrassingly
parallel computational kernels such as matrix multiplication (GEMM). In dense linear algebra,
the idea of blocking enables the use of compute-intensive trailing matrix updates that are well-
suited for GPUs [19]. However, when the sizes are very small, the LAPACK-style blocking cannot
be applied, since the blocking sizes will be very small, leading to memory-bound trailing matrix
updates. We need a di↵erent design mindset in order to develop high performance GPU kernels
to handle such type of workloads.
This paper presents highly optimized GPU kernels for batched LU fact rization and matrix
inversion. The kernels can handle millions of matrices of sizes up to 32. We show a step-by-
step methodology, where incremental improvements in the kernel design lead to incremental
performance gains. We justify all of our design choices by showing the performance before
and after every incremental improvement. One of the main advantages of our design is the
elimination of the expensive intermediate row interchanges by delaying them to the end of the
kernel, which leads to a much faster kernel that produces the exact same result of a LAPACK-style
LU-factorization and inversion. The performance results show significant speedups against the
vendor-supplied CUBLAS kernels using a Pascal P100 GPU.
2 ddd
Matrices generated with positive   and clustered distribution of its singular values
 i=(1, · · · , 1, 1cond ) and where its condition number is equal to 10
2.
3 Related Work
GPUs are well suited for embarrassingly parallel tasks on large chunks of data, such as matrix
multiplication(GEMM) [13][18]. The nearly optimal performance of GEMM on GPUs allowed the
development of high performance LAPACK algorithms [19][11]. Following a growing interest in
performing matrix computations on large batches of small matrices, an obvious start was to
develop a batched GEMM routine for GPUs. This has already been addressed in literature [7][4],
leading to developments of higher-level batched LAPACK algorithms. For example, Kurzak et
al. [12] introduced batched Cholesky factorization for single precision up to sizes 100 ⇥ 100,
while Dong et al. provided a more generic design [9]. Some contributions also proved the
ability of GPUs to deal with variable size batched workloads [4][3].
Wang et al. [20] introduced FPGA-based parallel LU factorization of large sparse matrices,
where the algorithm is reduced to factorizing many small matrices concurrently. Villa et al. [16]
developed a GPU-based batched LU factorization, which has been used in subsurface transport
simulation, where many chemical and microbiological reactions in a flow path are simulated in
parallel [17]. Batched matrix inversion is also introduced in the context of generating block-
Jacobi preconditioners [6]. The work presented by Haidar et al. [10] provides a batched LU
2
• solving Ax = b using FP64 LU
Flops = 2n3/(3 time) 
meaning twice higher is twice faster
Leveraging Half Precision in HPC on V100
Matrix size


















Performance of solving Ax=b
using FP64 or IR with GMRes to achieve FP64 accuracy
FP64 dgesv
FP32->64 dsgesv
Factorization and Inversion of a Million Matrices on GPUs Abdelfattah, Haidar, Tomov and Dongarra
The reason is that such software is optimized to handle large sizes of data with embarrassingly
parallel computational kernels such as matrix multiplication (GEMM). In dense linear algebra,
the idea of blocking enables the use of compute-intensive trailing matrix updates that are well-
suited for GPUs [19]. However, when the sizes are very small, the LAPACK-style blocking cannot
be applied, since the blocking sizes will be very small, leading to memory-bound trailing matrix
updates. We need a di↵erent design mindset in order to develop high performance GPU kernels
to handle such type of workloads.
This paper presents highly optimized GPU kernels for batched LU fact rization and matrix
inversion. The kernels can handle millions of matrices of sizes up to 32. We show a step-by-
step methodology, where incremental improvements in the kernel design lead to incremental
performance gains. We justify all of our design choices by showing the performance before
and after every incremental improvement. One of the main advantages of our design is the
elimination of the expensive intermediate row interchanges by delaying them to the end of the
kernel, which leads to a much faster kernel that produces the exact same result of a LAPACK-style
LU-factorization and inversion. The performance results show significant speedups against the
vendor-supplied CUBLAS kernels using a Pascal P100 GPU.
2 ddd
Matrices generated with positive   and clustered distribution of its singular values
 i=(1, · · · , 1, 1cond ) and where its condition number is equal to 10
2.
3 Related Work
GPUs are well suited for embarrassingly parallel tasks on large chunks of data, such as matrix
multiplication(GEMM) [13][18]. The nearly optimal performance of GEMM on GPUs allowed the
development of high performance LAPACK algorithms [19][11]. Following a growing interest in
performing matrix computations on large batches of small matrices, an obvious start was to
develop a batched GEMM routine for GPUs. This has already been addressed in literature [7][4],
leading to developments of higher-level batched LAPACK algorithms. For example, Kurzak et
al. [12] introduced batched Cholesky factorization for single precision up to sizes 100 ⇥ 100,
while Dong et al. provided a more generic design [9]. Some contributions also proved the
ability of GPUs to deal with variable size batched workloads [4][3].
Wang et al. [20] introduced FPGA-based parallel LU factorization of large sparse matrices,
where the algorithm is reduced to factorizing many small matrices concurrently. Villa et al. [16]
developed a GPU-based batched LU factorization, which has been used in subsurface transport
simulation, where many chemical and microbiological reactions in a flow path are simulated in
parallel [17]. Batched matrix inversion is also introduced in the context of generating block-
Jacobi preconditioners [6]. The work presented by Haidar et al. [10] provides a batched LU
2
• solving Ax = b using FP64 LU
• s lving Ax = b using FP32 LU and 
iterative refinement to achieve FP64 
accuracy
Flops = 2n3/(3 time) 
meaning twice higher is twice faster
Leveraging Half Precision in HPC on V100
Matrix size


















Performance of solving Ax=b




Factorization and Inversion of a Million Matrices on GPUs Abdelfattah, Haidar, Tomov and Dongarra
The reason is that such software is optimized to handle large sizes of data with embarrassingly
parallel computational kernels such as matrix multiplication (GEMM). In dense linear algebra,
the idea of blocking enables the use of compute-intensive trailing matrix updates that are well-
suited for GPUs [19]. However, when the sizes are very small, the LAPACK-style blocking cannot
be applied, since the blocking sizes will be very small, leading to memory-bound trailing matrix
updates. We need a di↵erent design mindset in order to develop high performance GPU kernels
to handle such type of workloads.
This paper presents highly optimized GPU kernels for batched LU fact rization and matrix
inversion. The kernels can handle millions of matrices of sizes up to 32. We show a step-by-
step methodology, where incremental improvements in the kernel design lead to incremental
performance gains. We justify all of our design choices by showing the performance before
and after every incremental improvement. One of the main advantages of our design is the
elimination of the expensive intermediate row interchanges by delaying them to the end of the
kernel, which leads to a much faster kernel that produces the exact same result of a LAPACK-style
LU-factorization and inversion. The performance results show significant speedups against the
vendor-supplied CUBLAS kernels using a Pascal P100 GPU.
2 ddd
Matrices generated with positive   and clustered distribution of its singular values
 i=(1, · · · , 1, 1cond ) and where its condition number is equal to 10
2.
3 Related Work
GPUs are well suited for embarrassingly parallel tasks on large chunks of data, such as matrix
multiplication(GEMM) [13][18]. The nearly optimal performance of GEMM on GPUs allowed the
development of high performance LAPACK algorithms [19][11]. Following a growing interest in
performing matrix computations on large batches of small matrices, an obvious start was to
develop a batched GEMM routine for GPUs. This has already been addressed in literature [7][4],
leading to developments of higher-level batched LAPACK algorithms. For example, Kurzak et
al. [12] introduced batched Cholesky factorization for single precision up to sizes 100 ⇥ 100,
while Dong et al. provided a more generic design [9]. Some contributions also proved the
ability of GPUs to deal with variable size batched workloads [4][3].
Wang et al. [20] introduced FPGA-based parallel LU factorization of large sparse matrices,
where the algorithm is reduced to factorizing many small matrices concurrently. Villa et al. [16]
developed a GPU-based batched LU factorization, which has been used in subsurface transport
simulation, where many chemical and microbiological reactions in a flow path are simulated in
parallel [17]. Batched matrix inversion is also introduced in the context of generating block-
Jacobi preconditioners [6]. The work presented by Haidar et al. [10] provides a batched LU
2
• solving Ax = b using FP64 LU
• s lving Ax = b using FP32 LU and 
iterative refinement to achieve FP64 
accuracy
• solving Ax = b using FP16 LU and 
iterative refinement to achieve FP64 
accuracy
Flops = 2n3/(3 time) 
meaning twice higher is twice faster
slow converg nce affect 
the performance
Leveraging Half Precision in HPC on V100
Matrix size


















Performance of solving Ax=b





Factorization and Inversion of a Million Matrices on GPUs Abdelfattah, Haidar, Tomov and Dongarra
The reason is that such software is optimized to handle large sizes of data with embarrassingly
parallel computational kernels such as matrix multiplication (GEMM). In dense linear algebra,
the idea of blocking enables the use of compute-intensive trailing matrix updates that are well-
suited for GPUs [19]. However, when the sizes are very small, the LAPACK-style blocking cannot
be applied, since the blocking sizes will be very small, leading to memory-bound trailing matrix
updates. We need a di↵erent design mindset in order to develop high performance GPU kernels
to handle such type of workloads.
This paper presents highly optimized GPU kernels for batched LU fact rization and matrix
inversion. The kernels can handle millions of matrices of sizes up to 32. We show a step-by-
step methodology, where incremental improvements in the kernel design lead to incremental
performance gains. We justify all of our design choices by showing the performance before
and after every incremental improvement. One of the main advantages of our design is the
elimination of t e xpensive intermediate row interchanges by delaying them to the end of the
kernel, which leads to a much faster kernel that produces the exact same result of a LAPACK-style
LU-factorization and inversion. The performance results show significant speedups against the
vendor-supplied CUBLAS kernels using a Pascal P100 GPU.
2 ddd
Matrices generated with positive   and clustered distribution of its singular values
 i=(1, · · · , 1, 1cond ) and where its condition number is equal to 10
2.
3 Related Work
GPUs are well suited for embarrassingly parallel tasks on large chunks of data, such as matrix
multiplication(GEMM) [13][18]. The nearly optimal performance of GEMM on GPUs allowed the
development of high performance LAPACK algorithms [19][11]. Following a growing interest in
performing matrix computations on large batches of small matrices, an obvious start was to
develop a batched GEMM routine for GPUs. This has already been addressed in literature [7][4],
leading to developments of higher-level batched LAPACK algorithms. For example, Kurzak et
al. [12] introduced batched Cholesky factorization for single precision up to sizes 100 ⇥ 100,
while Dong et al. provided a more generic design [9]. Some contributions also proved the
ability of GPUs to deal with variable size batched workloads [4][3].
Wang et al. [20] introduced FPGA-based parallel LU factorization of large sparse matrices,
where the algorithm is reduced to factorizing many small matrices concurrently. Villa et al. [16]
developed a GPU-based batched LU factorization, which has been used in subsurface transport
simulation, where many chemical and microbiological reactions in a flow path are simulated in
parallel [17]. Batched matrix inversion is also introduced in the context of generating block-
Jacobi preconditioners [6]. The work presented by Haidar et al. [10] provides a batched LU
2
• solving Ax = b using FP64 LU
• s lving Ax = b using FP32 LU and 
iterative refinement to achieve FP64 
accuracy
• solving Ax = b using FP16 LU and 
iterative refinement to achieve FP64 
accuracy
• solving Ax = b using FP16 Tensor Cores 
LU and iterative refinement to achieve 
FP64 accuracy
Flops = 2n3/(3 time) 
meaning twice higher is twice faster
4X
Time (sec)















































Solving Ax=b on Nvidia V100
FP64 solver dgesv
CPU: 10 cores E5-2650 v3
GPU: Nvidia V100
Leveraging Half Precision in HPC
Power awareness
CPU Intel Xeon E5-2650 v3 (Haswell)2x10 cores @ 2.30 GHz V100
NVIDIA Volta GPU
80 MP x 64 @ 1.38 GHz 
Power is for GPU + CPU + DRAM
Problem generated with an arithmetic distribution of the singular values                                          and positive eigenvalues.
Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA
Matrix size



























































(a) matrix with diagonal dominant.
Matrix size






























































(b) matrix with positive l and where si is random number between 1cond ,1 such that
their logarithms are uniformly distributed.
Matrix size





























































(c) matrix with positive l and with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size

























































(d) matrix with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size





























































(e) matrix with positive eigenvalues and arithmetic distribution of its singular values


















































40 40 40 40 40 40 40
(f) matrix with arithmetic distribution of its singular values si = 1  ( i 1n 1 ) (1 
1
cond ).
Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve
FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size
and different type of matrices. Note that cond = 102 for these experiments.
• Power consumption of the FP64 algorithm to 
solve Ax=b for a matrix of size 34K, it achieve 
5.5 Tflop/s and requires about 2021 joules 
providing about 14 Gflops/Watts.
Leveraging Half Precision in HPC
Power awareness
Problem generated with an arithmetic distribution of the singular values                                          and positive eigenvalues.
Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA
Matrix size



























































(a) matrix with diagonal dominant.
Matrix size






























































(b) matrix with positive l and where si is random number between 1cond ,1 such that
their logarithms are uniformly distributed.
Matrix size





























































(c) matrix with positive l and with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size

























































(d) matrix with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size





























































(e) matrix with positive eigenvalues and arithmetic distribution of its singular values


















































40 40 40 40 40 40 40
(f) matrix with arithmetic distribution of its singular values si = 1  ( i 1n 1 ) (1 
1
cond ).
Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve
FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size
and different type of matrices. Note that cond = 102 for these experiments.
• Power consumption of the FP64 algorithm to 
solve Ax=b for a matrix of size 34K, it achieve 
5.5 Tflop/s and requires about 2021 joules 
providing about 14 Gflops/Watts.
• Power consumption of the mixed precision 
FP32à64 algorithm to solve Ax=b for a 
matrix of size 34K, it achieve 10.7 Tflop/s and 
requires about 1041 joules providing about  
30 Gflops/Watts.
Mixed precision techniques can provide 
a large gain in energy efficiency
Time (sec)


















































Solving Ax=b on Nvidia V100
FP64 solver dgesv
FP32 --> 64 solver dsgesv
CPU: 10 cores E5-2650 v3
GPU: Nvidia V100
Iterative refinement 
Leveraging Half Precision in HPC
Power awareness
Problem generated with an arithmetic distribution of the singular values                                          and positive eigenvalues.
Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA
Matrix size



























































(a) matrix with diagonal dominant.
Matrix size






























































(b) matrix with positive l and where si is random number between 1cond ,1 such that
their logarithms are uniformly distributed.
Matrix size





























































(c) matrix with positive l and with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size

























































(d) matrix with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size





























































(e) matrix with positive eigenvalues and arithmetic distribution of its singular values


















































40 40 40 40 40 40 40
(f) matrix with arithmetic distribution of its singular values si = 1  ( i 1n 1 ) (1 
1
cond ).
Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve
FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size
and different type of matrices. Note that cond = 102 for these experiments.
• Power consumption of the FP64 algorithm to 
solve Ax=b for a matrix of size 34K, it achieve 
5.5 Tflop/s and requires about 2021 joules 
providing about 14 Gflops/Watts.
• Power consumption of the mixed precision 
FP32à64 algorithm to solve Ax=b for a 
matrix of size 34K, it achieve 10.7 Tflop/s and 
requires about 1041 joules providing about  
30 Gflops/Watts.
• Power consumption of the mixed precision 
FP16à64 algorithm to solve Ax=b for a 
matrix of size 34K, it achieve 16.8 Tflop/s and 
requires about 609 joules providing about     
48 Gflops/Watts. 
Mixed precision techniques can provide 
a large gain in energy efficiency
Time (sec)





















































Solving Ax=b on Nvidia V100
FP64 solver dgesv
FP32 --> 64 solver dsgesv
FP16 --> 64 solver dhgesv
CPU: 10 cores E5-2650 v3
GPU: Nvidia V100
Leveraging Half Precision in HPC
Power awareness
Problem generated with an arithmetic distribution of the singular values                                          and positive eigenvalues.
Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA
Matrix size



























































(a) matrix with diagonal dominant.
Matrix size






























































(b) matrix with positive l and where si is random number between 1cond ,1 such that
their logarithms are uniformly distributed.
Matrix size





























































(c) matrix with positive l and with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size

























































(d) matrix with clustered singular values, si=(1, · · · , 1, 1cond )
Matrix size





























































(e) matrix with positive eigenvalues and arithmetic distribution of its singular values


















































40 40 40 40 40 40 40
(f) matrix with arithmetic distribution of its singular values si = 1  ( i 1n 1 ) (1 
1
cond ).
Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve
FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size
and different type of matrices. Note that cond = 102 for these experiments.
• Power consumption of the FP64 algorithm to 
solve Ax=b for a matrix of size 34K, it achieve 
5.5 Tflop/s and requires about 2021 joules 
providing about 14 Gflops/Watts.
• Power consumption of the mixed precision 
FP32à64 algorithm to solve Ax=b for a 
matrix of size 34K, it achieve 10.7 Tflop/s and 
requires about 1041 joules providing about  
30 Gflops/Watts.
• Power consumption of the mixed precision 
FP16à64 algorithm to solve Ax=b for a 
matrix of size 34K, it achieve 16.8 Tflop/s and 
requires about 609 joules providing about     
48 Gflops/Watts. 
• Power consumption of the mixed precision 
FP16à64 TC algorithm using Tensor Cores 
to solve Ax=b for a matrix of size 34K, it 
achieve 24 Tflop/s and requires about 470 
joules providing about 74 Gflops/Watts. 
Mixed precision techniques can provide 
a large gain in energy efficiency
Time (sec)
























































Solving Ax=b on Nvidia V100
FP64 solver dgesv
FP32 --> 64 solver dsgesv
FP16 --> 64 solver dhgesv
FP16 --> 64 solver dhgesv (TC)




Critical Issues at Peta & Exascale for 
Algorithm and Software Design
• Synchronization-reducing algorithms
§ Break Fork-Join model
• Communication-reducing algorithms
§ Use methods which have lower bound on communication
• Mixed precision methods
§ 2x speed of ops and 2x speed for data movement
§ Now we have 16 bit floating point as well
• Autotuning
§ Today’s machines are too complicated, build “smarts” into software to adapt to 
the hardware
• Fault resilient algorithms
§ Implement algorithms that can recover from failures/bit flips
• Reproducibility of results
§ Today we can’t guarantee this. We understand the issues, but some of our 
“colleagues” have a hard time with this.
77












University of Tennessee, Knoxville
University of California, Berkeley




• SLATE — distributed dense linear algebra
• CEED — tensor algebra, batched operations
• PEEKS — Krylov methods
• heFFTe — distributed FFT
• PAPI — performance measurement and 
modeling
• ParSEC — distributed tasking for exascale
79
www.icl.utk.edu/jobs
