Accelerating Reduction and Scan Using Tensor Core Units by Dakkak, Abdul et al.
Accelerating Reduction and Scan Using Tensor Core Units
Abdul Dakkak1, Cheng Li1, Isaac Gelado2, Jinjun Xiong3, and Wen-mei Hwu4
dakkak@illinois.edu,cli99@illinois.edu,igelado@nvidia.com,jinjun@us.ibm.com,w-hwu@illinois.edu
1Department of Computer Science , University of Illinois, Urbana-Champaign
2NVIDIA Corporation, Santa Clara, CA, USA
3IBM Thomas J. Watson Research Center , Yorktown Heights, NY
4Department of Electrical and Computer Engineering , University of Illinois, Urbana-Champaign
Abstract
Driven by deep learning, there has been a surge of special-
ized processors for matrix multiplication, referred to as Tensor
Core Units (TCUs). These TCUs are capable of performing
matrix multiplications on small matrices (usually 4× 4 or
16×16) to accelerate the convolutional and recurrent neural
networks in deep learning workloads. In this paper we lever-
age NVIDIA’s TCU to express both reduction and scan with
matrix multiplication and show the benefits — in terms of pro-
gram simplicity, efficiency, and performance. Our algorithm
exercises the NVIDIA TCUs which would otherwise be idle,
achieves 89%− 98% of peak memory copy bandwidth, and
is orders of magnitude faster (up to 100× for reduction and
3× for scan) than state-of-the-art methods for small segment
sizes — common in machine learning and scientific applica-
tions. Our algorithm achieves this while decreasing the power
consumption by up to 22% for reduction and 16% for scan.
1. Introduction
Deep learning’s reliance on matrix-multiplication for com-
pute has driven both research and industry to develop matrix-
multiplication accelerator hardware — collectively called Ten-
sor Core Units (TCUs) in this paper. TCUs come under the
guise of different marketing terms, be it NVIDIA’s Tensor
Cores [18], Google’s Tensor Processing Unit [10], Intel KNL’s
AVX extensions [76], Apple A11’s Neural Engine [2], or
ARM’s Machine Learning Processor [3]. TCUs are designed
to accelerate Multilayer Perceptrons (MLP), Convolutional
Neural Networks (CNN), Recurrent Neural Networks (RNN),
or Deep Neural Network (DNN) in general TCUs vary in im-
plementation [18, 36, 40, 43, 48, 54, 71, 74, 75, 76, 79, 87], and
are prevalent [1, 4, 8, 9, 10, 11, 24, 70] in edge devices, mobile,
and the cloud.
Although TCUs are prevalent and promise increase in per-
formance and/or energy efficiency, they suffer from over spe-
cialization — with only matrix-multiplication operations being
supported. This limits their applicability to general algorithms
and makes them confined to narrowly specialized libraries and
application domains. Take NVIDIA’s Tensor Cores, for exam-
ple. For matrix multiplication, the NVIDIA Volta architecture
achieves a 8× throughput increase — with each Streaming
Multiprocessor (SM) capable of performing 1024 half preci-
sion operations per cycle using the TCUs compared to 128
half precision operations per cycle without the TCUs. This
Load/Store SPU
Registers
Control
Cache
INT FP64 FP32 TCUTCU
Figure 1: Each processing block (subcore) in the NVIDIA Tesla
V100 PCI-E architecture contains 2 TCUs. In total, 640 TCUs
are available — achieving a theoretical peek of 113 TFLOPS.
is enabled by the fact that NVIDIA’s Volta GPUs dedicate a
large portion of the SM processing unit (or subcore) chip area
to TCUs, shown in Figure 1. Currently algorithms other than
general matrix-matrix multiplication (GEMM) do not utilize
the TCUs— resulting in an idle TCU and low chip utilization
for these algorithms.
The objective of the paper is to expand the class of algo-
rithms that can execute on TCUs— enabling the TCU to be
used for non-GEMM kernels. We choose reduction and scan,
since a large body of work [30, 31, 32, 33, 60, 68, 82, 84] has
shown that they are key primitives of data parallel implemen-
tations of radix sort, quicksort, string comparison, lexical
analysis, stream compaction, polynomial evaluation, solving
recurrence equations, and histograms. We formulate a simple
mapping between reduction or scan and TCUs. Then we de-
velop a library of warp-, block-, and grid-level primitives for
reduction and scan that utilize the NVIDIA TCUs, and present
how they can be used in concert to achieve near-ideal per-
formance. Although this paper targets GPUs, the motivation,
methods, and observations are applicable to a wide number of
TCU implementations.
While the formulation is the main objective, we show that
our reduction and scan algorithms is either order of magnitude
faster or rivals the fastest GPU implementation, with much
lower programming complexity. We also show that by leverag-
ing the TCU we free up the general purpose ALUs on the SMs
for tasks that cannot be expressed in terms of TCU operations.
The key contributions of the paper are as follows:
1. We show how to use the TCU to compute both reduction
ar
X
iv
:1
81
1.
09
73
6v
1 
 [c
s.P
F]
  2
4 N
ov
 20
18
and scan — we believe we are the first to formulate these
algorithms in terms of TCUs operations.
2. We measure our algorithms and show that we are orders
of magnitude better than state-of-art algorithms for small
segment sizes — this is common in mathematics (e.g. eval-
uating polynomials), scientific applications (e.g. finite dif-
ference), and machine learning (e.g. batch norm) — and
are comparable to the fastest algorithm for large segments.
3. We find that we are more energy efficient and decrease the
utilization of general purpose ALUs — making them free
for tasks that cannot be represented on TCUs.
4. We describe the current usage and the programmability
of the NVIDIA TCUs and evaluate GEMM on the TCUs
using cuBLAS [15], CUTLASS [17] and the CUDA TCU
API. The results show that TCUs can be profitably used in
a much wider range of libraries.
5. We provide insights into how to relax the current CUDA
TCU programming interface constraints.
This paper is divided as follows: we first describe the cur-
rent NVIDIA TCUs and show the performance of GEMM and
GEMV computation in Section 2. In Section 3, we give a back-
ground of reduction and scan and show the TCU algorithms
for reduction (Section 4) and scan (5). We then compare our
implementation against state-of-the-art in Section 6. Section 7
describes the related work, before we conclude in Section 8.
2. NVIDIA Tensor Cores
A marquee feature of NVIDIA’s Volta (Tesla V100) archi-
tecture is its TCUs— a programmable matrix-multiply-and-
accumulate hardware units, called Tensor Cores1 by NVIDIA.
Figure 1 illustrates the processing block within each SM, with
the V100 containing 80 SMs and each having 4 processing
blocks. In turn, each processing block contains 2 Tensor Cores
— for a total of 640 Tensor Cores on the V100 and achieving a
12× throughput improvement over previous generation Tesla
P100 [22]. This section first describes the current usage of the
NVIDIA TCUs, then details the current TCU API and presents
some evaluation results that motivate our work.
1 #include <mma.h>
2 using namespace nvcuda::wmma;
3 __global__ void dot_wmma_16x16(half *a, half *b, half *c) {
4 fragment<matrix_a, 16, 16, 16, half, col_major> a_frag;
5 fragment<matrix_b, 16, 16, 16, half, row_major> b_frag;
6 fragment<accumulator, 16, 16, 16, half> c_frag;
7 load_matrix_sync(a_frag, a, /* leading dim */ 16);
8 load_matrix_sync(b_frag, b, /* leading dim */ 16);
9 fill_fragment(c_frag, 0.0f);
10 mma_sync(c_frag, a_frag, b_frag, c_frag);
11 store_matrix_sync(c, c_frag, 16, row_major);
12 }
Listing 1: A simple CUDA kernel performing 〈16,16,16〉
matrix multiplication (C = A.B+C) in half precision using
the CUDA WMMA API.
Each Tensor Core provides a 4× 4× 4 tensor processing
array capable of performing the operation D = A.B+C within
1We will use TCU and Tensor Core interchangeably in this paper.
a cycle, where A, B, C and D are 4×4 matrices. The inputs A
and B must be in half precision format while the accumulators,
C and D, can be either single or half precision. Each Tensor
Core can perform 64 (4× 4× 4) FMA operations per cycle.
Using Tensor Cores, each SM can perform 1024 (64×2×8)
floating point operations per cycle since each each FMA con-
sists of two floating point operations and each SM contains
8 Tensor Cores. This is an 8× SM throughput increase com-
pared to Pascal for floating point operations [22].
2.1. Current Usage
Tensor Cores have been only used to accelerate GEMM opera-
tions, most prominently through NVIDIA’s CUDA libraries.
The NVIDIA libraries, such as cuBLAS [15] and cuDNN [16],
require users to opt-in to use the Tensor Cores and both li-
braries accelerate GEMM computation (HGEMM for cuBLAS
and convolution and recurrent neural networks for cuDNN).
NVIDIA also provides the CUTLASS (CUDA Templates for
Linear Algebra Subroutines) [17] library, which is a CUDA
templated library that provides the building block primitives
to write high performance GEMM-like kernels. Deep learning
frameworks such as NVCaffe [52], Caffe2 [5], MXNet [14],
PyTorch [23], TensorFlow [25], and TensorRT [19] leverage
the NVIDIA libraries for DNN training [13] and inference.
2.2. Programming Interface
NVIDIA also provides a CUDA C++ Warp Matrix Multiply
and Accumulate (WMMA) [7] API to program the Tensor
Cores directly. The current WMMA API provides warp-level
matrix operations for matrix load (load_matrix_sync), ma-
trix store (store_matrix_sync), and matrix multiply and
accumulate (mma_sync). These APIs operate on a special data
type fragment, which holds a matrix tile in thread-local reg-
isters. A helper function to fill a matrix fragment with a scalar
constant (fill_fragment) is provided as well. NVIDIA has
not provided any API for explicitly calling the TCU at sub
warp level — neither in the IR nor in the PTX [20, 21].
A load_matrix_sync distributes values of the matrix
across the warp lanes. Threads within a warp utilize multiple
Tensor Cores concurrently to perform the mma_sync operation
— collaborating to compute the DM×N =AM×K .BK×N+CM×N ,
with M, N, K denoting the matrix dimensions. The API im-
poses limitations on the dimensions — requiring the shape
〈M,N,K〉 to be either 〈16,16,16〉, 〈32,8,16〉, or 〈8,32,16〉.
Listing 1 shows a simple CUDA kernel that computes a
〈16,16,16〉 matrix multiplication within a warp using the
WMMA API. Lines 4–6 declare the matrix fragments. The
API supports 3 kinds of matrices — matrix_a (A), matrix_b
(B), and accumulator (C or D) — with each having their
own internal data layout 2 as well as loading, storing, and
computing semantics. Users specify both the data type and
2The mapping between individual matrix elements to their residing
thread(s) is purposely opaque [7] and undocumented. We will discuss how
we alleviate some of the constraints in 6.1.
2
29 210 211 212 213 214
M = N = K (log scale)
0
25
50
75
100
H
al
f
P
re
ci
si
on
T
F
L
O
P
S
WMMA HGEMM
WMMA HGEMM (na¨ıve)
cuBLAS HGEMM w/o TCU
cuBLAS HGEMM w TCU
CUTLASS HGEMM
(a) GEMM with half precision input and half precision output.
29 210 211 212 213 214
M = N = K (log scale)
0
20
40
60
80
M
ix
ed
P
re
ci
si
on
T
F
L
O
P
S WMMA MGEMM
WMMA MGEMM (na¨ıve)
cuBLAS MGEMM w/o TCU
cuBLAS MGEMM w TCU
CUTLASS MGEMM
(b) Mixed precision GEMM with half precision input and single precision output.
Figure 2: General matrix-matrix multiplication (GEMM) performance using Tensor Cores for both half- (2a) and mixed- (2b) pre-
cision on a V100 PCI-E GPU with a clock frequency of 1380 MHz and a 113 TFLOPS peek performance. The inputs are square
matrices with variable 〈M,N,K〉 dimensions. The optimized and naïve WMMA GEMM algorithms are described in the text.
29 210 211 212 213 214
M = N (log scale)
0.0
0.2
0.4
0.6
0.8
T
F
L
O
P
S
WMMA MGEMV using WMMA MGEMM (na¨ıve)
WMMA HGEMV using WMMA HGEMM (na¨ıve)
WMMA MGEMV using cuBLAS MGEMM
WMMA HGEMV using cuBLAS HGEMM
cublasSgemv
Figure 3: General matrix-vector multiplication (GEMV) per-
formance using Tensor Cores on a V100 PCI-E GPU. GEMV
can be implemented in terms of a GEMM (with dimensions
〈M,N,16〉) or calling the GEMV method in CUBLAS (which
currently does not support half precision).
the 〈M,N,K〉 shape of the fragments. For both the A and B
matrix kinds, users specify whether the matrix is in column- or
row-major order. Users specify the stride between rows using
the leading dimension and load the data from either shared or
global memory — Lines 7–8. Line 9 initializes the matrix_c
elements to zero by broadcasting the scalar value 0 into each
index of the fragment. Once the data is loaded, users perform
the matrix multiplication operation — Line 10 — and store
the results — Line 11.
The kernel in Listing 1 can be generalized to implement
GEMM for arbitrary matrix dimensions in a manner similar
to tiled matrix multiplication. For example, a naive imple-
mentation (referred to as WMMA HGEMM (naïve)) assigns a
strip of 16 rows from matrix A and a strip of 16 columns
from matrix B columns to each warp to compute a 16×16 tile
of the output C. Each warp iterates through the A rows and
B columns by loading 16× 16 tiles of A and B from global
memory into the fragments using load_matrix_sync, then
performing mma_sync, and repeating. After the entire rows
of A and columns of B have been consumed, the warp uses
store_matrix_sync to store the accumulated C values into
global memory. An optimized implementation (referred to
as WMMA HGEMM) utilizes persistent threads where each thread
block collaboratively loads multiple 16×16 tiles of matrix A
and B from global memory to shared memory so that some of
the tiles can be re-used across warps. The tiles are then loaded
into fragments and the mma_sync operation is performed.
2.3. GEMM Evaluation
We evaluate the GEMM performance using Tensor Cores on a
Tesla V100 PCI-E GPU with CUDA 9.2.88 through cuBLAS,
CUTLASS (version 0.1.1), and hand written kernels using the
WMMA API. The results are shown in Figure 2.
For half precision GEMM (HGEMM), shown in Figure 2a,
cuBLAS HGEMM with Tensor Cores achieves a maximum
performance of 96.3 TFLOPS — approximately 85% the peak
performance — and over 3.4× that of cuBLAS without the
use of TCUs. For mixed precision GEMM (MGEMM), shown
in Figure 2b, we achieved maximum performance of 85.8
TFLOPS on NVIDIA TCUs using cuBLAS, approximately
76% the peak performance, and is about 6.2× the performance
of cuBLAS without Tensor Cores (the degradation of perfor-
mance compared to HGEMM is due to output bytes count
being twice as large). CUTLAS MGEMM is more performant
that it’s HGEMM, we suspect this is due to optimizations for
mixed precision that are absent for half-precision.
2.4. GEMV Evaluation
The order of magnitude speedup of GEMM with TCU raises
the question: can we formulate other algorithms in terms of
matrix multiplication and also benefit from the TCU? The most
obvious algorithm is matrix-vector multiplication (GEMV).
We implement HGEMV (half precision GEMV) and
MGEMV (mixed-precision GEMV) with HGEMM or
MGEMM call of dimension 〈M,N,(K = 16)〉— with K re-
quired to be at least 16 by the cuBLAS API. This method
wastes at least 15N memory loads and performs 15MN ex-
tra flops. We evaluate our implementations against cuBLAS
SGEMV, since half precision GEMV is not present in
cuBLAS.
Figure 3 shows that even when accounting for both re-
source and computation waste, HGEMV, implemented us-
ing cuBLAS HGEMM with Tensor Cores, outperforms the
cuBLAS SGEMV by at least 2×. Naïve (non-optimized 3)
HGEMV and MGEMV are super imposed atop each other
since the overhead of using mixed-precision is dwarfed by in-
efficient memory access. Both naïve versions still outperform
3A user implicitly performs tiling when utilizing the WMMA API.
3
Warp Level
Block Level
Grid Level
…
…
1 1
2
3
2 2
33
Figure 4: The reduction algorithm is 1 composed of warp-
level reduction that reduces each segment and is used to 2
implement block-level reduction that further reduces each seg-
ment of partially reduced values. The partially reduced values
are reduced across the grid 3 to perform full reduction.
cuBLAS SGEMV for large input sizes.
The WMMA HGEMV implementation saturates at about
900 GFLOPS in Figure 3. This is because each FMA requires
one matrix element and one vector element, each is 2 bytes in
size, to perform the FMA operations. That is we need to fetch
one byte from the global memory into the shared memory for
every floating-point operation performed. Note that there is no
reuse of matrix elements and heavy reuse of vector elements.
Assume that the vectors are mostly loaded from the L2 cache
due to heavy reuse, the 900 GB/s global memory bandwidth of
V100 is the hardware limitation that prevents the any GEMV
implementation to go beyond 900 GFLOPS in half precision.
3. Conventional Reduction and Scan on GPU
The GEMV evaluation shows that the matrix-multiplication
performance of NVIDIA TCUs is high enough to tolerate
resource and computation waste in algorithms. Driven by
this observation, we examine how to formulate two widely
used collectives — reduction and scan — to utilize TCUs.
Both reduction and scan are memory-bandwidth bound so we
expect that even with some resource and compute waste, their
HGEMM-based implementations can beat or match the most
highly tuned traditional implementations.
1 __device__ half warp_reduce(half val) {
2 for (int offset = WARP_SIZE/2; offset > 0; offset /= 2)
3 val += __shfl_down_sync(0xFFFFFFFFU, val, mask);
4 return val;
5 }
6 __device__ half warp_scan(half val) {
7 for (int offset = 1; offset < WARP_SIZE; offset *= 2) {
8 auto n = __shfl_up_sync(0xFFFFFFFFU, val, mask);
9 if (laneid >= offset) val += n;
10 }
11 return val;
12 }
Listing 2: NVIDIA’s recommended warp-level reduction and
scan implementations utilizing shuffle instructions.
Given a vector A= [a1,a2, . . . ,an], the reduction, also called
fold or total, of A is defined by its sum ∑ni=1 ai. Segmented
reduction is defined as reductions on subsets of the input vector.
In a regular segmented reduction, all segments are the same
size4. The scan operation, also called prefix sum, is defined by
the vector [a1,a1+a2, . . . ,∑ni=1 ai]. Segmented scan is defined
similarly to segmented reduction.
State of the art libraries such as CUB [62], MAGMA [26],
OpenMP runtime [38], implement both reduction and scan
in terms of warp-, block-, and device-level collectives, as
illustrated in Figure 4. The building blocks — warp-level
reduction and scan, are commonly implemented using shuffle
instructions [12]. The shuffle implementations, shown in List-
ing 2, allow threads within a warp to share values via registers
without using shared memory or block-level synchronization.
4. TCU Reduction Algorithm
Intuitively, reduction can be represented as a special case of a
matrix multiplication, since
Reduction([a1,a2, . . . ,an]) =
1 1 · · · 1
0 0 · · · 0
...
...
. . .
...
0 0 · · · 0
 .

a1 a2 . . . an
0 0 · · · 0
...
...
. . .
...
0 0 · · · 0

T
=

∑ni=1 ai 0 · · · 0
0 0 · · · 0
...
...
. . .
...
0 0 · · · 0

The challenge is to map generic input sizes onto the fixed
tensor shapes supported by the TCU. For simplicity, this paper
will use the 16× 16 matrix configuration as an example to
implement the segmented reduction. Other configurations can
be used in a similar manner to perform segmented reduction
for multiples of 8 or 32. It is natural to extend our algorithms
to support mixed precision computation.
We use ReductionK to represent a K regular segmented
reduction — partial reductions of the input uniformly par-
titioned into K element subsets. We will also use P as the
matrix which has the first row set to 1 and the rest to 0 —
i.e. (P)r,c =
{
1 if r = 0
0 if r 6= 0 , and the notation X to denote a
matrix where all the elements are the constant value X .
To make our formulation non-WMMA API specific,
we present our algorithms in an API neutral way. In
the following sections, we use LoadTile in place of the
load_matrix_sync which accepts a matrix layout (de-
fault is row-major) and stride (default is 16). We abstract
store_matrix_sync to make it addressable as if it was a
linear array. We will also use the notation A.B+C to denote
the mma_sync operation.
4.1. Warp-level Reduction
We look at warp-level reduction first, since it is the building
block for both block- and grid-level reductions. Even though
using shuffle instructions for warp-level reduction is efficient,
it can be a bottleneck due to its limited throughput. On the
NVIDIA Volta architecture only 32 warp shuffle operations
can be performed within a clock cycle per SM.
4Irregular segmented reduction is implemented in terms of regular seg-
mented reduction, and is elided from this paper.
4
0= AV 0 P
1
1
2
3
Figure 5: The Reduction16 algorithm 1 each warp loads 256
elements into the matrix A in column major order, 2 performs
the TCU operation where the P matrix has ones for the first
row, and then 3 the result, which is in the first row of V , is
stored into the output vector.
This section shows the formulation of warp-level reduction
onto the TCU when the segment size is 16, 256, and multiples
of 16 or 256. Support for arbitrary segment sizes can be
realized either by padding the input with zeros or by masking
through the P matrix. We find that padding introduces minimal
overhead and is required in some cases to maintain the memory
alignment imposed by the TCU API.
Segment Size 16: The Reduction16 algorithm, shown in Al-
gorithm 1 and Figure 5, performs warp-level reduction on 256
elements which represent 16 segments of size 16. The data
is loaded from memory into a column-major order fragment
(matrix A). Each row is then reduced using V = P.A. The
result — first row of V — is stored in the output.
Algorithm 1 The Reduction16 algorithm.
1: Initialize P matrix.
2: idx← global offset
3: A← LoadTile(in[idx . . . idx+256],“colma jor”)
4: V ← P.A+0
5: if laneIdx < 16 then out[ idx16 + laneIdx]←V [laneIdx]
Depending on the TCU API’s flexibility, one can formulate
the Reduction16 algorithm as either P.A where A is in column-
major order or A.P where A is in row-major order. The A.P
formulation is more friendly to loading of A elements since
the input vector elements naturally form the row-major layout
for A. However, there is no performance difference when
using the WMMA API since: (1) the API not expose sub-
fragment operations and (2) the CUDA compiler performs the
transposition through registers. We formulate the algorithm as
P.A to make it amicable for a more relaxed API, since the 16
element output is in linear row-major order and can be written
to consecutive locations in the global memory.
Segment Size 256: For handling the segments of size 256,
one follows a pattern similar to Reduction16. First, all 256
elements are loaded onto the TCU. The rows are reduced using
the same procedure as Reduction16. Finally, we reduce the
first row to get a scalar. The algorithm is shown in Algorithm 2
and is a single iteration of the algorithm illustrated in Figure 6.
Algorithm 2 The Reduction256 algorithm.
1: Initialize P matrix
2: idx← global offset
3: A← LoadTile(in[idx . . . idx+256],“colma jor”)
4: V ← P.A+0
5: V ←V.PT +0
6: if laneIdx = 0 then out[ idx256 ]←V [0]
Segment Size Multiples of 256: With the above Reduction16
and Reduction256 warp-level primitives, we can express seg-
ments that are multiples of either 16 (denoted by 16N) or 256
(denoted by 256N). We will first look at the 256N algorithm,
since it will be used for representing the 16N algorithm.
A naïve way is to implement the 256N segmented reduction
as N-repeated applications of the Reduction256, shown in Fig-
ure 6. While this is correct, it is work inefficient — wasting
one matrix multiplication for each iteration. Instead of per-
forming two reductions in each iteration, we can implement
a work efficient 256N segmented reduction by first reducing
each row of the 16×16 matrix (Reduction16) in each iteration
and use the row of reduced values as an accumulator. In the
final iteration, the final row is reduced into a scalar. Figure 7
illustrates the algorithm.
Segment Size Multiples of 16: Similar to Reduction256, seg-
mented reduction where the segment size is multiples of 16
(16N) can be performed in two ways. The first is a strided
segmented reduction, shown in Figure 8 (for the case where
N = 2). During each iteration i, a warp loads 16 segments
=
0=
AV
R
0
QL
0
1
0 P
s
1
0
PT
1
0
1V
5
2
1
3
4
6
Figure 6: The work-inefficient Reduction256N algorithm 1 ini-
tializes the Q matrix with all zeros and 2 loads the 256 input
elements into a matrix A in column major order. 3 A dot
product V = P.A+ 0 where the P matrix has the first row as
ones and the rest of the values are zeros is performed to re-
duce each row into a scalar. 4 the dot product R=V.PT +Q
reduces the first row into a scalar. 5 If the segmented re-
duction size is equal to the matrix size (i.e. N = 1) or for the
last iteration, then the first element of the R matrix is stored in
the output array, otherwise 6 the first element of R matrix is
used as the first element of the Q matrix and the procedure is
iterated starting from step 2 .
5
(each of length 16) into the matrix A with a stride of 16N, i.e.,
the beginning of each 16-element segment is 16N elements
away from the beginning of the next segment in the original
input vector. The 16 columns of A are then reduced and accu-
mulated into the first row of V . This repeats for N iterations.
This method is simple, works for arbitrary multiple of 16, and
leverages GPU cache for small N. For large N this method
suffers from bad locality, consistently misses the cache, and
results in poor performance.
For N > 256, one can implement Reduction16N in terms
of Reduction256N — shown in Algorithm 3. This makes bet-
ter use of cache locality and reduces uncoalesced memory
accesses. The left over, Reduction(N%16)×16, can be imple-
mented using the strided segmented 16N reduction method.
Algorithm 3 The Coalesced Reduction16N algorithm.
1: Initialize P matrix
2: V ← 0
3: gidx← global offset
4: numSegs← b 16N256 c . Number of 256 segments
5: for i← 0; i < numSegs; i← i+1 do
6: idx← gidx+256i
7: A← LoadTile(in[idx . . . idx+256],“colma jor”)
8: V ← P.A+V
9: end for
10: . . . . Reduce rest of segments using StridedReduction16
11: if laneIdx < 16 then out[ gidx16N + laneIdx]←V [laneIdx]
0= A1V1 P
1
ANPVN = VN-1
1
=R VN PT
1 0
1
2
2
3
4
0
0
0
Figure 7: The work-efficient Reduction256N algorithm 1
loads 256 input elements into matrix Ai in each iteration. It
then 2 performs a matrix multiplication Vi = P.Ai +Vi−1 for
i between 1 and N with V0 = 0. The final vector is reduced
3 by performing the R =VN .PT operation with the 4 result
stored in the output.
= A1V1 0P
1
0
A20P
1
0=V2 V1
1 4 1 4 1 4
2
3
5
6
Figure 8: A strided Reduction16N algorithm for N = 2 1
loads 256 elements where the stride between each row is 16N.
2 We then perform the matrix multiplication V1 = P.A1 and
3 use the V1 matrix as an accumulator for the next itera-
tion where 4 we again load the next 256 elements with the
leading dimension set to 16N. The 5 matrix multiplication
V2 = P.A2 +V1 is performed and 6 the first row is stored in
the output vector.
Algorithm 4 The Block-level Reduction256N algorithm.
1: wpb← warps per block
2: prtls← alloc shared mem[wpb]
3: partial← Reduction256 Nwpb (in)
4: if laneIdx = 0 then prtls[warpIdx]← partial
5: sync threads
6: if warpIdx = 0 then
7: out[blockIdx]← Reductionwpb(prtls)
8: end if
4.2. Block-level Reduction
When the segment size is large, collaborative reduction within
a block becomes profitable. We follow standard practice [62]
to implement block-level reduction, but differ in that we still
use the TCU to perform reduction on the partially reduced
values. Algorithm 4 shows how we use warp-level reduction
to implement the block-level Reduction256N kernel.
4.3. Grid-level Reduction
When the segment size is very large a grid-level reduction
might be needed. A naïve grid-level reduction for a list of
length N involves two kernel launches. The first kernel launch
performs a segmented reduction with the output stored in a
list of partials. A second kernel then reduces the partials
into a scalar. Although it is simple to implement, it achieves
performance that’s on par with the fastest algorithm.
5. TCU Scan Algorithm
Unlike reduction, it might be less intuitive to represent scan as
a matrix multiplication. For a vector V of 256 elements, we
can store it in row-major order within a 16×16 matrix A —
6
with ai, j =V [16( j−1)+ i].
A =

a1,1 a1,2 . . . a1,16
a2,1 a2,2 . . . a2,16
...
...
. . .
...
a16,1 a16,2 . . . a16,16

We notice that a row-wise scan can then be obtained by
multiplying the matrix A with an upper diagonal matrix —
with the values of the upper diagonals being 1 and the rest 0.
RowScan

a1,1 a1,2 . . . a1,16
a2,1 a2,2 . . . a2,16
...
...
. . .
...
a16,1 a16,2 . . . a16,16
= A.U =
A.

1 1 . . . 1
0 1 . . . 1
...
...
. . .
...
0 0 . . . 1
=

a1,1 a1,1+a1,2 . . . ∑16i=1 a1,i
a2,1 a2,1+a2,2 . . . ∑16i=1 a2,i
...
...
. . .
...
a16,1 a16,1+a16,2 . . . ∑16i=1 a16,i

Similarly, to get the scan of each column one can use a
lower diagonal matrix. We use a strict lower diagonal, i.e. the
diagonal is 0, to get an exclusive scan of each column.
ExclusiveColumnScan
 a1,1 a1,2 . . . a1,16... ... . . . ...
a16,1 a16,2 . . . a16,16
= L.A =

0 0 . . . 0
1 0 . . . 0
...
...
. . .
...
1 1 . . . 0
 .A =

0 0 . . . 0
a1,1 a1,2 . . . a1,16
a1,1+a2,1 a1,2+a2,2 . . . a1,16+a2,16
...
...
. . .
...
∑15j=1 a j,1 ∑
15
j=1 a j,2 . . . ∑
15
j=1 a j,16

We use the L.A matrix to create a G matrix where each
element G j,i is the reduction of the jth row of L.A. That
is, all elements in the jth row of G are of the same value
— the sum of the all elements preceding the jth row of A,
i.e. G j,i = ∑ j−1k=1∑
16
i=1 Ak,i. The G matrix can be generated by
multiplying L.A with a matrix with all element values set to 1.
We then add G to the A.U matrix to generate the scan of V —
which is read in linear row-major order.
Scan(V ) = L.A.

1 1 . . . 1
1 1 . . . 1
...
...
. . .
...
1 1 . . . 1
+A.U = G+A.U =

a1,1 a1,1+a1,2 . . . ∑16i=1 a1,i
a2,1+∑16i=1 a1,i a2,1+a2,2+∑
16
i=1 a1,i . . . ∑
2
j=1∑
16
i=1 a j,i
...
...
. . .
...
a16,1+∑15j=1∑
16
i=1 a j,i a16,1+a16,2+∑
15
j=1∑
16
i=1 a j,i . . . ∑
16
j=1∑
16
i=1 a j,i

Throughout this section we will use U to represent the
upper diagonal matrix where the upper diagonal values are
one, and use L to represent the strict lower diagonal matrix
where the values bellow the lower diagonal are one — i.e.
(U)r,c =
{
1 if r >= c
0 if r < c
and (L)r,c =
{
1 if r < c
0 if r >= c
.
A SAU = 10
0
U
3
21
ALA = 0
0
1 L4
1R = L5 AU
7
6
LA
Figure 9: The Scan256N algorithm 1 loads 256 elements from
the input vector into a matrix A and 2 initializes the S matrix
to 0. The 3 AU = A.U +S and 4 LA = L.A+0 matrix mul-
tiplications are performed to compute the prefix sum of each
row and column. 5 A row wise reduction is performed on the
LA and added to the AU matrix. 6 The result R is stored in
the output vector. 7 If the segment size is a multiple of 256,
then the last element of R (position [16,16]) is broadcasted
into the S matrix and the procedure is repeated.
5.1. Warp-level Scan
With the above formulation, we follow a similar structure
to Section 4: first introducing warp-level primitives before
presenting the block- and grid-level primitives. We write
ScanK to represent a K regular segmented scan. Since the
process of building warp-level, block-level, and grid-level
scans from ScanK is very similar to that of reduction, we will
only highlight the key differences.
Segment Size 16: Is the RowScan equation and is illustrated
in steps 1 , 2 , and 3 in Figure 9.
Segment Size 256: Is implemented using 3 matrix multiplica-
tions shown in Figure 9 and presented symbolically above.
Segment Size Multiples of 16: Is similar to strided 16N re-
duction, with the difference being that we broadcast the last
column rather than the reduced value, shown in Algorithm 5.
Segment Size Multiples of 256: Only a small modification
to Scan256 is needed to implement Scan256N and is illustrated
in Figure 9. Algorithm 6 shows that we keep track of the
sum (last element of the R matrix) and broadcast it to the S
matrix after each iteration. The S matrix is then used when
performing subsequent iterations.
5.2. Block-level Scan
Algorithm 7 shows how to perform the scan at the block level.
It first computes the segmented scan using the warp primitives
and stores the reduced values into a partials list. A scan is
performed on the partial list and the values are added to the
intermediate results to get the output.
7
Algorithm 5 The Scan16N algorithm.
1: Initialize U matrix.
2: gidx← global offset
3: S← 0
4: lid← laneIdx
5: for i← 0; i < N; i← i+1 do
6: idx← gidx+16i
7: A← LoadTile(in[idx . . . idx+256N],stride = 16N)
8: R← A.U +S
9: S← Broadcast(LastColumn(R))
10: if lid < 16 then
11: oi← idx+ lid ∗16N
12: out[oi . . .oi+16]← R[16lid . . .16lid+16]
13: end if
14: end for
Algorithm 6 The Scan256N algorithm.
1: Initialize U and L matrices.
2: gidx← global offset
3: S← 0
4: for i← 0; i < N; i← i+1 do
5: idx← gidx+256i
6: A← LoadTile(in[idx . . . idx+256])
7: AU ← A.U +S
8: LA← L.A+0
9: R← LA.1+AU
10: out[idx . . . idx+256]← R
11: S← Broadcast(R[255])
12: end for
Algorithm 7 also exercises the TCU to perform the scan on
the partially reduced values across tiles. On Line 16 we use the
offset of the last row (240) and 256 as the leading dimension
when loading the tile. This loads the last row of R across tiles
into E. Line 17 then performs an exclusive scan on the last
column of the E and stores the results into the list of partials5.
5.3. Grid-level scan
Similar to reduction, the segmented scan is used as a building
block for the grid-level scan. The grid-level scan uses a text
book implementation, scan-then-propagate strategy, and in-
volves 3 kernel calls. The first kernel uses segmented scan and
produces partially reduced values for each block. The second
kernel performs a scan on the partially reduced values. The
third kernel then uniformly adds the partially scanned values
to their corresponding segments.
5 The implementation of LastColumnScan16 is performed by just loading
the last column values into the first row and performing an TCU version of
the exclusive scan algorithm. Formulating the intermediate operation this way
is needed to adhere to the CUDA WMMA API’s byte alignment constraint
for loading fragments.
Algorithm 7 The Block-level Scan256N algorithm.
1: Initialize U and L matrices.
2: gidx← global offset
3: wpb← warps per block . Assumed to be less than 16
4: sout← alloc shared mem[256×16]
5: prtls← alloc shared mem[16] . Partial sums
6: S← 0
7: for i← 0; i < N; i← i+warpsPerBlock do
8: idx← gidx+256(i+warpIdx)
9: A← LoadTile(in[idx . . . idx+256])
10: AU ← A.U +S
11: LA← L.A+0
12: R← LA.1+AU
13: sout[256warpIdx . . .256warpIdx+256]← R
14: sync threads
15: if warpIdx = 0 then
16: E← LoadTile(sout[240 . . .4096],stride = 256)
17: prtls← LastColumnScan16(E) . Exclusive scan
18: end if
19: sync threads
20: for j← 1; j ≤ 256; j← j+warpSize do
21: it← j+ laneIdx
22: val← sout[256warpIdx+ it]+ prtls[warpIdx]
23: out[idx+ it]← val
24: end for
25: S← Broadcast(prtls[15])
26: end for
6. Evaluation
We evaluate our implementation on an Intel Xeon E5-2698
with CentOS 4.3, CUDA Driver 396.26, and CUDA Runtime
9.2.88 installed. We use the Tesla V100-PCIE GPU with
16GB of GPU Memory. The Tesla V100 sports HBM2 with
a theoretical peak bandwidth of 900GB/s or 450 billion half
precision elements per second. All the results below show
the throughput of the algorithms in terms of billions of half
precision elements per second.
We implemented 6 our algorithms as a C++ header li-
brary. The API is designed to be similar to CUB’s high level
API — providing functions such as TCU::SegmentedReduce,
TCU::Reduce, TCU::SegmentedScan, and TCU::Scan. We
employ auto-tuning to select the ideal algorithm, number of
warps (or independent TCU operations) per block, coarsening
factor (the number segments to perform within a warp), and
block dimensions for the user-provided segment size.
6.1. Relaxing the WMMA API Constraints
Constraints arise when using the current WMMA API for
non-GEMM computation. These limitations would not exist
if one is to perform just GEMM computation. The constraints
observed were:
1. Loads or stores must be performed at fragment granularity.
6Accessible at https://www.removed/for_blind_review.
8
2. Loading and storing fragments can only be performed using
global or shared memory; constant memory cannot be used.
3. The memory layout for the matrix kinds are not the same,
making casting between matrix kinds a kluge.
We address these limitations in different ways within our
implementation. For (1) and (2) we use knowledge about
the internal layout of the fragment [53], we implemented
API enhancements tailored to our usage. Listing 3 shows
our API to support extended usage of the matrix_b frag-
ment for both storing a constant matrix into a fragment
(matrix_b_set_upper_triangular) and loading partial
fragments (matrix_b_get_first_column).
using frag_b = fragment<matrix_b, 16, 16, 16, half, row_major>;
__device__ int matrix_b_get_row_idx() {
const int laneIdx = threadIdx.x % warpSize;
return laneIdx&0x10 >> 2 + laneIdx&0x0B;
}
__device__ void matrix_b_set_upper_triangular(frag_b &f) {
#pragma unroll
for (int ii = 0; ii < f.num_elements; ii++)
f.x[ii] = matrix_b_get_row_idx() < ii ? 0.0f : 1.0f;
}
__device__ void matrix_b_get_first_column(half* out, frag_b f) {
const int laneid = threadIdx.x % warpSize;
if (laneid & 0x04) return ; // avoid redundant writes
out[matrix_b_get_row_idx()] = f.x[0];
}
Listing 3: The WMMA API can only perform load/store from
shared or global memory and lacks the ability to fill an TCU
fragment from constant memory or operate on sub-fragments.
This code shows how we enhance the NVIDIA WMMA API, us-
ing knowledge of the fragment layout, to create an upper tri-
angular matrix and get the first column of a fragment for the
matrix_b fragment kind.
Although we can use the layout information to shuffle reg-
isters to address (3), we express the cast in terms of load/s-
tore available through the WMMA API. For example, to cast
a matrix in the matrix_a format to matrix_b format, we
first store the matrix into shared memory and then perform
a load from memory to matrix_b. These extensions can be
leveraged by compiler runtimes to perform casting between
fragment kinds.
We observe that API extensions using fragment layout infor-
mation requires less block synchronization, uses less shared-
memory — resulting in an increased performance by up to 5%
for our algorithms. For brevity we omit these results.
6.2. Optimizing CUB for Half Precision
CUB is a C++ template library that contains multiple al-
gorithms for the collectives. The CUB library contains
fastest [63, 66] implementation for the reduction and scan
collectives and is used by libraries such as Thrust [29] as
well as most deep learning frameworks [5, 14, 23, 25, 68]. We
compare against the latest release of CUB [62] (version 1.8)
and evaluate against different parameters of the collectives.
As an optimization, warp-level shuffle reduction and scan are
Figure 10: We evaluate the segmented reduction for the algo-
rithms presented on different segment sizes (between 16 and
230) for a fixed 230 element list. Through a combination of the
algorithms presented, for the range between 16 and 224 we
are able to achieve throughput within 90% and 98% of ideal
throughput (the theoretical peak is 450 billion half precision
elements per second). The bar on top of the figure shows the
best performing algorithm for each range of segment sizes.
implemented in PTX for integer, float, and double data types,
since NVCC is currently unable to use the shuffle instruction’s
predicate to guard against invalid peers [12,69]. CUB does not
contain these shuffle-based optimizations for half precision.
To make the evaluations fair and more technically meaningful,
we implement these optimization for the half precision data
type in CUB. The modified CUB is used for the evaluation to
provide a more aggressive base of comparison.
6.3. Warp- and Block-level Reduction and Scan
Theoretically our warp-level TCU reduction algorithms re-
quire less than one fourth of the cycles of the warp-level shuf-
fle reduction. For example, consider performing a warp-level
Reduction256: the warp-level reduction show in Listing 2 re-
quires 8 iterations of 32 element reduction to reduce each
segment. The total cycles is therefore 256, since each shuffle
instruction and addition take 4 cycles. On the other hand, our
algorithm performs the reduction using two matrix multipli-
cations or 64 cycles — since each TCU WMMA matrix mul-
tiplication requires 32 cycles. However, reduction is known
to be memory bound, with the ideal performance bounded by
memory copy speed.
We evaluate the TCU segmented reduction algorithms
against cub::DeviceSegmentedReduce::Sum by fixing the
number of input elements and varying the segment size, shown
in Figure 10. When the segment size is less than 256, the 16N
algorithm is used. Its performance then degrades for large
segment sizes due to its strided access pattern resulting in
uncoalesced global memory access. When the segment size is
larger than 256, the 256N algorithm is used, but again suffers
from performance degradation after segment size 215 due to
low occupancy. When the segment size is large — greater
than 215, the block-level 256N reduction is used. Our TCU
implementation achieves more than 90% of the peak through-
put consistently for variable segment size and is always better
than the CUB implementation.
When segment size is large and the number of segments is
small, the performance of both CUB and our implementation
9
16 32 64 128 256 512 1024 2048 4096 8192
0
100
200
300
400
B
ill
io
ns
of
E
le
m
en
ts
/Sec CUB Warp
CUB Block
CUB Device
Our 16N
Our 256N
1
16 32 64 128 256 512 1024 2048 4096 8192
0
50
100
150
200
250
300
B
ill
io
ns
of
E
le
m
en
ts
/Sec CUB Warp
CUB Block
Our 16N
Our 256N
2
Figure 11: Segmented 1 reduction and 2 scan are evaluated in terms of billions of half-precision elements per second (y-axis)
for segment sizes between 24 and 213 (x-axis). The best configurations for our implementation as well as CUB are selected.
drops. Since each segment gets mapped onto a block, a small
number of segments causes some SMs to be idle. For example
when segment size is 225, both CUB and our implementation
achieve an occupancy of around 0.25 and SM efficiency of
around 40%. A better strategy for these cases would be to
assign multiple thread blocks to collaboratively reduce each
segment when the size of the segment is very large. This
can be achieved using CUDA 9’s cooperative groups [6].
Such optimization is outside the focus of this paper, which
is to propose and evaluate TCU-based reduction and scan
primitives.
Our TCU implementation largely outperforms CUB’s
device-wide segmented reduction for different segment
size. Through profiling, we identified the major pre-
dictors of performance to be, in the order of impor-
tance, the number of half-precision floating-point instruc-
tions (inst_fp_16 in NVProf metric), warp instructions
(inst_inter_thread_communication), and integer in-
structions (inst_integer). Consistently, our implementa-
tion’s half-precision instructions is approximately equal to the
number of total elements (231) while CUB requires a much
larger number. Moreover CUB requires large number of in-
teger and warp shuffle instructions while our implementation
uses no warp shuffle instructions and a smaller number of in-
teger instructions. This contributes to the 100× performance
improvement observed for segment size 16.
24 27 210 213 216 219 222 225 228
Segment Size (log scale)
0
100
200
B
ill
io
ns
 o
f E
le
m
en
ts
 / 
Se
c
Thrust
Our 16N
Our 256N
Our 256N Block
Figure 12: We evaluate the segmented scan for the algorithms
presented on different segment sizes for a fixed 231 element
list. Through a combination of the algorithms presented, for
the range between 16 and 219 we are able to achieve through-
put within 89% and 97% of ideal throughput (the theoretical
peak is 225 billion half precision elements per second).
We examined the power consumption by measuring the av-
erage power within the execution phase of the kernel using
NVVP. Our implementation uses 7.4−22.3% less power com-
pared to CUB across different segment sizes. Again, this is
because of the efficient use of the FP16 and INT ALUs as
well as better SM and DRAM utilization. We note that our
algorithm leaves the ALUs (other than the TCU) idle, allowing
less contention on these units.
CUB provides a cub::WarpReduce, applicable for seg-
ment sizes 16 and 32, to compute a parallel reduction of ele-
ments within a warp. CUB also provides cub::BlockReduce
to perform reduction within a block. These two primitives
require users to partition the data and construct the kernel.
Since CUB’s device-wide segmented reduction does not per-
form well for segment size smaller then 213, we evaluate
our TCU implementations against cub::WarpReduce and
cub::BlockReduce implementations, shown in Figure 11.
The cub::WarpReduce implementation is tunable on block
size. The cub::BlockReduce implementation is tunable on
block size, thread coarsening factor, and reduction algorithms.
We compare our implementation to the best CUB implemen-
tation. We find that our TCU implementations is still faster
for segment size smaller than 1024, and is comparable to
cub::BlockReduce for the other cases.
We evaluate TCU segmented scan algorithms against
Thrust’s implementation (inclusive_scan_by_key), since
CUB has no user visible API for segmented scan. The Thrust
implementation utilizes CUB’s internal warp- and block-level
scan to implement the scan-by-key operation. We use different
segment sizes with a fixed number of input elements — the re-
sults are shown in Figure 12. Thrust, consistent with previous
observation [42], has constant performance irrespective of the
segment size. Our scan TCU implementations achieve more
than 89% of the peak throughput and is 3× faster than thrust
for small segment sizes. We observe lower power consump-
tion compared to Thrust — observing it to be either equivalent
in power usage or up to 17% less. Our segmented scan is not
ideal for large segment sizes since, as explained in Section 4,
only a small number of blocks get launched and thus the GPU
is under utilized. This can be remedied using the same strategy
10
218 220 222 224 226 228 230
Number of Elements (log scale)
0
100
200
300
400
B
ill
io
ns
 o
f E
le
m
en
ts
 / 
Se
c
CUB
Thrust
Ours
Figure 13: A full reduction implementation based on the de-
scription in Section 4 achieves performance on par to CUB.
described for reduction.
CUB provides cub::WarpScan to compute a parallel scan
of items partitioned within a warp, and cub::BlockScan
within a block. Similar to reduction, these two primitives re-
quire more programming effort from the users to partition the
data and construct the kernel. We evaluate our TCU segmented
scan against the best cub::WarpScan and cub::BlockScan
implementations, shown in Figure 11. The CUB scan im-
plementations have the same tunable parameters as the CUB
reduction implementations. We can see that our TCU imple-
mentations are still largely faster for small segment size, and
are at least comparable to cub::BlockScan in other cases.
6.4. Grid-level Reduction and Scan
Unlike the warp- and block-level operations, this paper does
not attempt to optimize grid-level operations — opting to use
a naïve implementation for the grid-level collectives. The
naïve implementation involves multiple kernel launches. We
include the evaluation results to show that even our naïve grid-
level implementation achieves performance that is better or
comparable to that of CUB and Thrust.
215 216 217 218 219 220 221
Number of Elements (log scale)
0
50
100
B
ill
io
n
s
of
E
le
m
en
ts
/
S
ec
CUB
Thrust
Ours
Figure 14: A full scan implementation based on the descrip-
tion in Section 5 achieves performance comparable to CUB.
We compare against both CUB and Thrust for full reduc-
tion, Figure 13, and scan, Figure 14. For both cases, our
implementation uses the 256N block-level algorithm. Even
with our naïve grid-level implementation, we are able to mach
the performance of CUB and are considerably faster than the
Thrust implementation. For reduction and scan, the TCU im-
plementation is slightly faster than CUB with large input sizes
being bounded by memory bandwidth and is within 98% (for
reduction) of peek memory copy bandwidth.
For scan, our current unoptimized implementation internally
uses CUB to reduce the partial values (kernel 2 described in
Section 5.3). CUB, however, fails for inputs larger than 221
which causes our implementation to fail. Future optimized
implementations would not use CUB.
7. Related Work
The mapping of some algorithms onto matrix multiplication
has already been well studied [49, 56, 72, 83]. Similarly, both
reduction and scan are well studied from a performance and
application [30, 31, 32, 46, 55, 64, 78] aspect on a wide range
of architectures and have been heavily evaluated across GPU
architectures [39,41, 61, 77, 85]. To our knowledge however,
there has been no attempt at mapping either reduction or scan
in terms of matrix multiplication.
Considerable research has been done on the development
of performance portable reduction and scan kernels [28, 34,
57, 80]. These compilers express the algorithms as system
of alternative building blocks that are then composed and
auto-tuned at compile time for both the target architecture and
the input characteristics. These tools are synergistic with our
technique, since we are able to add our algorithm as another
building block to implement reduction or scan.
Previous work [35, 62, 65, 73, 85] has also shown that opti-
mizations can be made to either avoid or hide the overhead of
multi-kernel launches. These optimizations would enable our
grid-level operations to be competitive for large sizes when
compared to state-of-the-art methods. Other research looked
at specific cases of scan, in [58] the authors look at performing
scan on tuples while minimizing global reads and facilitating
latency hiding. Recently there has been some work in applying
scan and reduction to optimize database queries [37, 47, 81].
Work describing NVIDIA’s Tensor Cores is scarce.
In [53], the authors use microbenchmarks to discern micro-
architectural details of the V100 architecture. In [45, 59] use
half precision and Tensor Cores to implement iterative solvers.
They use half precision along with low quality solvers to
compute the initial conditions and then switch to both higher
precision solvers for subsequent iterations. The authors also
examine the error incurred when using Tensor Cores and half-
precision for HPC workloads. In [27,44,67] the authors devise
schemes to accelerate neural network training with limited loss
of precision using half precision and Tensor Cores.
8. Conclusion
This paper leverages specialized hardware that was developed
to optimize matrix multiplication for deep learning to imple-
ment both reduction and scan. This is in the same vein to early
GPU research which implemented these algorithms in terms
of shaders. We point to directions for future API and archi-
tectural changes to relax some of the TCU constraints such
as loading fragments from constant, extracting single row or
single column, etc. — resulting in a simplified implementation
that achieves more energy efficiency and performance.
In this paper we showed how to map both reduction and
scan onto TCU. We believe we are the first to formulate these
11
algorithms to exercise the TCU. Our implementation achieves
up to 100× speedup on reduction, up to 3× on scan, and
showed that the performance rivals the state of the art imple-
mentation in the worst cases. We observed a 22% less power
consumption for reduction and 16% for scan. We are able
to make use the otherwise idle TCUs— enabling better GPU
utilization for kernels that exercise the general purpose ALUs.
Future work would leverage the techniques described to
map more algorithms and functions onto TCUs. We are specif-
ically interested in transcendental and special functions (such
as Tanh and Log), since the NVIDIA special function units
have been observed to be the bottleneck in HPC applications.
We also want to express neural network layers in terms of
TCUs— using a cuDNN like API. Some layer implementa-
tions and layer fusion opportunities are natural extensions
of our work: such as the computation of variance in batch-
norm [50, 51, 86] or the evaluation of special functions in
activation layers.
References
[1] Amazon EC2 Instances with Up to 8 NVIDIA Tesla V100 GPUs (P3).
https://goo.gl/GyrFWN. Accessed: 2018-7-24.
[2] Apple A11 Bionic. https://www.apple.com/iphone-x/#a11. Ac-
cessed: 2018-7-24.
[3] Arm Machine Learning Processor. https://developer.
arm.com/products/processors/machine-learning/
arm-ml-processor. Accessed: 2018-7-24.
[4] Azure IoT Edge. https://azure.microsoft.com/en-us/
services/iot-edge. Accessed: 2018-7-24.
[5] Caffe2. https://caffe2.ai. Accessed: 2018-7-24.
[6] Cooperative Groups: Flexible CUDA Thread Programming. https://
devblogs.nvidia.com/cooperative-groups/. Accessed: 2018-
7-24.
[7] CUDA C Programming Guide. https://docs.nvidia.com/cuda/
cuda-c-programming-guide/index.html#wmma. Accessed:
2018-7-24.
[8] FPGA Cloud Compute. https://cloud.baidu.com/product/
fpga.html. Accessed: 2018-7-24.
[9] FPGA Cloud server. https://cn.aliyun.com/product/ecs/
fpga. Accessed: 2018-7-24.
[10] Google Cloud TPU. https://cloud.google.com/tpu. Accessed:
2018-7-24.
[11] Google Edge TPU. https://cloud.google.com/edge-tpu. Ac-
cessed: 2018-7-24.
[12] Kepler Shuffle: Tips and Tricks. http://on-demand.
gputechconf.com/gtc/2013/presentations/
S3174-Kepler-Shuffle-Tips-Tricks.pdf. Accessed:
2018-7-24.
[13] Mixed Precision Training. https://docs.nvidia.com/
deeplearning/sdk/mixed-precision-training. Accessed:
2018-7-24.
[14] MXNet. https://mxnet.apache.org. Accessed: 2018-7-24.
[15] NVIDIA cuBLAS. https://developer.nvidia.com/cublas. Ac-
cessed: 2018-7-24.
[16] NVIDIA cuDNN. https://developer.nvidia.com/cudnn. Ac-
cessed: 2018-7-24.
[17] NVIDIA CUTLASS. https://devblogs.nvidia.com/
cutlass-linear-algebra-cuda. Accessed: 2018-7-24.
[18] NVIDIA Tensor Cores. https://www.nvidia.com/en-us/
data-center/tensorcore. Accessed: 2018-7-24.
[19] NVIDIA TensorRT. https://developer.nvidia.com/tensorrt.
Accessed: 2018-7-24.
[20] NVVM IR Specification 1.5. https://docs.
nvidia.com/cuda/nvvm-ir-spec/index.html#
nvvm-intrin-warp-level-matrix. Accessed: 2018-7-24.
[21] Parallel Thread Execution ISA Version 6.2. https:
//docs.nvidia.com/cuda/parallel-thread-execution/
index.html#warp-level-matrix-instructions. Accessed:
2018-7-24.
[22] Programming Tensor Cores in CUDA 9. https://devblogs.
nvidia.com/programming-tensor-cores-cuda-9. Accessed:
2018-7-24.
[23] PyTorch. https://pytorch.org. Accessed: 2018-7-24.
[24] Snapdragon Artificial Intelligence. https://www.qualcomm.com/
snapdragon/artificial-intelligence. Accessed: 2018-7-24.
[25] TensorFlow. https://www.tensorflow.org. Accessed: 2018-7-
24.
[26] Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub
Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire
Tomov. Numerical linear algebra on emerging architectures: The
plasma and magma projects. In Journal of Physics: Conference Series,
volume 180, page 012037. IOP Publishing, 2009.
[27] Andrew Anderson, Servesh Muralidharan, and David Gregg. Effi-
cient multibyte floating point data formats using vectorization. IEEE
Transactions on Computers, 66(12):2081–2096, 2017.
[28] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao,
Alan Edelman, and Saman Amarasinghe. PetaBricks: a language and
compiler for algorithmic choice, volume 44. ACM, 2009.
[29] Nathan Bell and Jared Hoberock. Thrust: A productivity-oriented
library for cuda. In GPU computing gems Jade edition, pages 359–371.
Elsevier, 2011.
[30] Guy E Blelloch. Scans as primitive parallel operations. IEEE Transac-
tions on computers, 38(11):1526–1538, 1989.
[31] Guy E Blelloch. Prefix sums and their applications. Technical re-
port, Technical Report CMU-CS-90-190, School of Computer Science,
Carnegie Mellon University, 1990.
[32] Guy E Blelloch, Michael A Heroux, and Marco Zagha. Segmented
operations for sparse matrix computation on vector multiprocessors.
Technical report, Carnegie-Mellon Univ Pittsburgh PA School of Com-
puter Science, 1993.
[33] Timothy M Chan. More algorithms for all-pairs shortest paths in
weighted graphs. SIAM Journal on Computing, 39(5):2075–2089,
2010.
[34] Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-
Luna, and Wen-mei Hwu. Efficient kernel synthesis for performance
portable programming. In The 49th Annual IEEE/ACM International
Symposium on Microarchitecture, page 12. IEEE Press, 2016.
[35] Gaurav Chaurasia, Jonathan Ragan-Kelley, Sylvain Paris, George Dret-
takis, and Fredo Durand. Compiling high performance recursive filters.
In Proceedings of the 7th Conference on High-Performance Graphics,
pages 85–94. ACM, 2015.
[36] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss:
An energy-efficient reconfigurable accelerator for deep convolutional
neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138,
2017.
[37] Mateus SH Cruz, Yusuke Kozawa, Toshiyuki Amagasa, and Hiroyuki
Kitagawa. Accelerating set similarity joins using gpus. In Transactions
on Large-Scale Data-and Knowledge-Centered Systems XXVIII, pages
1–22. Springer, 2016.
[38] Leonardo Dagum and Ramesh Menon. Openmp: an industry standard
api for shared-memory programming. IEEE computational science
and engineering, 5(1):46–55, 1998.
[39] Yuri Dotsenko, Naga K Govindaraju, Peter-Pike Sloan, Charles Boyd,
and John Manferdelli. Fast scan algorithms on graphics processors. In
Proceedings of the 22nd annual international conference on Supercom-
puting, pages 205–213. ACM, 2008.
[40] Zidong Du, Shaoli Liu, Robert Fasthuber, Tianshi Chen, Paolo Ienne,
Ling Li, Tao Luo, Qi Guo, Xiaobing Feng, Yunji Chen, et al. An
accelerator for high efficient vision processing. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 36(2):227–
240, 2017.
[41] Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran.
Low-level functional gpu programming for parallel algorithms. In
Proceedings of the 5th International Workshop on Functional High-
Performance Computing, pages 31–37. ACM, 2016.
[42] Marco Eilers. Multireduce and multiscan on modern gpus. Department
of Computer Science, University of Copenhagen. Master’s thesis, 2014.
[43] Yao Fu, Ephrem Wu, Ashish Sirasao, Sedny Attia, Kamran Khan, and
Ralph Wittig. Deep learning with int8 optimization on xilinx devices.
White Paper, 2016.
[44] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish
Narayanan. Deep learning with limited numerical precision. In Inter-
national Conference on Machine Learning, pages 1737–1746, 2015.
[45] Azzam Haidar, Panruo Wu, Stanimire Tomov, and Jack Dongarra.
Investigating half precision arithmetic to accelerate dense linear system
solvers. In Proceedings of the 8th Workshop on Latest Advances in
Scalable Algorithms for Large-Scale Systems, page 10. ACM, 2017.
12
[46] Mark Harris, Shubhabrata Sengupta, and John D Owens. Parallel prefix
sum (scan) with cuda. GPU gems, 3(39):851–876, 2007.
[47] Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. Fast seg-
mented sort on gpus. In Proceedings of the International Conference
on Supercomputing, page 12. ACM, 2017.
[48] Miao Hu, John Paul Strachan, Zhiyong Li, Emmanuelle M Grafals,
Noraica Davila, Catherine Graves, Sity Lam, Ning Ge, Jianhua Joshua
Yang, and R Stanley Williams. Dot-product engine for neuromorphic
computing: programming 1t1m crossbar to accelerate matrix-vector
multiplication. In Proceedings of the 53rd annual design automation
conference, page 19. ACM, 2016.
[49] Xiaohan Huang and Victor Y Pan. Fast rectangular matrix multiplica-
tion and applications. Journal of complexity, 14(2):257–299, 1998.
[50] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. arXiv
preprint arXiv:1502.03167, 2015.
[51] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo,
Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu. Highly Scal-
able Deep Learning Training System with Mixed-Precision: Training
ImageNet in Four Minutes. ArXiv e-prints, July 2018.
[52] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev,
Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell.
Caffe: Convolutional architecture for fast feature embedding. arXiv
preprint arXiv:1408.5093, 2014.
[53] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza.
Dissecting the nvidia volta gpu architecture via microbenchmarking.
arXiv preprint arXiv:1804.06826, 2018.
[54] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav
Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden,
Al Borchers, et al. In-datacenter performance analysis of a tensor
processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE
44th Annual International Symposium on, pages 1–12. IEEE, 2017.
[55] Hee-Seok Kim, Shengzhao Wu, Li-wen Chang, and W Hwu Wen-mei.
A scalable tridiagonal solver for gpus. In Parallel Processing (ICPP),
2011 International Conference on, pages 444–453. IEEE, 2011.
[56] Tamara G Kolda and Brett W Bader. Tensor decompositions and
applications. SIAM review, 51(3):455–500, 2009.
[57] Rasmus Wriedt Larsen and Troels Henriksen. Strategies for regular
segmented reductions on gpu. In Proceedings of the 6th ACM SIGPLAN
International Workshop on Functional High-Performance Computing,
pages 42–52. ACM, 2017.
[58] Sepideh Maleki, Annie Yang, and Martin Burtscher. Higher-order and
tuple-based massively-parallel prefix sums, volume 51. ACM, 2016.
[59] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng,
and Jeffrey S Vetter. Nvidia tensor core programmability, performance
& precision. arXiv preprint arXiv:1803.04014, 2018.
[60] Michael D McCool, Arch D Robison, and James Reinders. Structured
parallel programming: patterns for efficient computation. Elsevier,
2012.
[61] Trevor L McDonell, Manuel MT Chakravarty, Gabriele Keller, and
Ben Lippmeier. Optimising purely functional gpu programs. ACM
SIGPLAN Notices, 48(9):49–60, 2013.
[62] D Merrill. CUB v1.8.0: CUDA Unbound, a library of warp-wide,
blockwide, and device-wide GPU parallel primitives. NVIDIA Re-
search, 2018.
[63] Duane Merrill and Michael Garland. Single-pass parallel prefix scan
with decoupled look-back. Technical report, NVIDIA Technical Report
NVR-2016-002, 2016.
[64] Duane Merrill and Andrew Grimshaw. Parallel scan for stream ar-
chitectures. University of Virginia, Department of Computer Science,
Charlottesville, VA, USA, Technical Report CS2009-14, 2009.
[65] Duane G Merrill and Andrew S Grimshaw. Revisiting sorting for
gpgpu stream architectures. In Proceedings of the 19th international
conference on Parallel architectures and compilation techniques, pages
545–546. ACM, 2010.
[66] Bruce Merry. A performance comparison of sort and scan libraries for
gpus. Parallel Processing Letters, 25(04):1550007, 2015.
[67] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos,
Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii
Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training.
In International Conference on Learning Representations, 2018.
[68] Rory Mitchell and Eibe Frank. Accelerating the xgboost algorithm
using gpu computing. PeerJ Computer Science, 3:e127, 2017.
[69] Wilt Nicholas. The cuda handbook: A comprehensive guide to gpu
programming, 2013.
[70] Paavo Pärssinen. Modern mobile graphics processors. Science: Inter-
net, Data and Things (CS-E4000), Spring 2018, page 211.
[71] Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan,
Christos Kozyrakis, and Mark A Horowitz. Convolution engine: bal-
ancing efficiency & flexibility in specialized computing. In ACM
SIGARCH Computer Architecture News, volume 41, pages 24–35.
ACM, 2013.
[72] Stephan Rabanser, Oleksandr Shchur, and Stephan Günnemann. In-
troduction to tensor decompositions and their applications in machine
learning. arXiv preprint arXiv:1711.10781, 2017.
[73] Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly
Barnes, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo
Durand. Halide: decoupling algorithms from schedules for high-
performance image processing. Communications of the ACM,
61(1):106–115, 2017.
[74] Brandon Reagen, Robert Adolf, Paul Whatmough, Gu-Yeon Wei, and
David Brooks. Deep learning for computer architects. Synthesis
Lectures on Computer Architecture, 12(4):1–123, 2017.
[75] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama,
Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-
Yeon Wei, and David Brooks. Minerva: Enabling low-power, highly-
accurate deep neural network accelerators. In ACM SIGARCH Com-
puter Architecture News, volume 44, pages 267–278. IEEE Press, 2016.
[76] Andres Rodriguez, Eden Segal, Etay Meiri, Evarist Fomenko,
Young Jim Kim, Haihao Shen, and Barukh Ziv. Lower Numerical
Precision Deep Learning Inference and Training. Technical report,
Intel, January 2018.
[77] Shubhabrata Sengupta, Mark Harris, and Michael Garland. Efficient
parallel scan algorithms for gpus. NVIDIA, Santa Clara, CA, Tech. Rep.
NVR-2008-003, 1(1):1–17, 2008.
[78] Shubhabrata Sengupta, Aaron E Lefohn, and John D Owens. A work-
efficient step-efficient prefix sum algorithm. In Workshop on edge
computing using new commodity architectures, pages 26–27, 2006.
[79] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasub-
ramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and
Vivek Srikumar. Isaac: A convolutional neural network accelerator
with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer
Architecture News, 44(3):14–26, 2016.
[80] Michel Steuwer, Toomas Remmelg, and Christophe Dubach. Lift:
a functional data-parallel ir for high-performance gpu code genera-
tion. In Code Generation and Optimization (CGO), 2017 IEEE/ACM
International Symposium on, pages 74–85. IEEE, 2017.
[81] Hiroki Tokura, Toru Fujita, Koji Nakano, Yasuaki Ito, and Jacir L
Bordim. Almost optimal column-wise prefix-sum computation on the
gpu. The Journal of Supercomputing, 74(4):1510–1521, 2018.
[82] Jean-Baptiste Tristan, Joseph Tassarotti, and Guy Steele. Efficient
training of lda on a gpu by mean-for-mode estimation. In International
Conference on Machine Learning, pages 59–68, 2015.
[83] Charles Van Loan. A survey of matrix computations. Handbooks in
operations research and management science, 3:247–321, 1992.
[84] Xiaolong Xie, Yun Liang, Xiuhong Li, and Wei Tan. Culda_cgs: Solv-
ing large-scale lda problems on gpus. arXiv preprint arXiv:1803.04631,
2018.
[85] Shengen Yan, Guoping Long, and Yunquan Zhang. Streamscan: fast
scan algorithms for gpus without global barrier synchronization. In
ACM SIGPLAN Notices, volume 48, pages 229–238. ACM, 2013.
[86] Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to
32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017.
[87] Yuhao Zhu, Matthew Mattina, and Paul Whatmough. Mobile machine
learning hardware at arm: A systems-on-chip (soc) perspective. arXiv
preprint arXiv:1801.06274, 2018.
13
