Data Movement Is All You Need: A Case Study on Optimizing Transformers by Ivanov, Andrei et al.
Data Movement Is All You Need: A Case Study on
Optimizing Transformers
Andrei Ivanov∗, Nikoli Dryden∗, Tal Ben-Nun, Shigang Li, Torsten Hoefler
ETH Zürich
firstname.lastname@inf.ethz.ch
∗ Equal contribution
Abstract—Transformers have become widely used for language
modeling and sequence learning tasks, and are one of the most
important machine learning workloads today. Training one is a
very compute-intensive task, often taking days or weeks, and
significant attention has been given to optimizing transformers.
Despite this, existing implementations do not efficiently utilize
GPUs. We find that data movement is the key bottleneck when
training. Due to Amdahl’s Law and massive improvements in
compute performance, training has now become memory-bound.
Further, existing frameworks use suboptimal data layouts. Using
these insights, we present a recipe for globally optimizing data
movement in transformers. We reduce data movement by up
to 22.91% and overall achieve a 1.30× performance improve-
ment over state-of-the-art frameworks when training BERT.
Our approach is applicable more broadly to optimizing deep
neural networks, and offers insight into how to tackle emerging
performance bottlenecks.
Index Terms—Data movement, high-performance computing,
deep learning, transformers
I. INTRODUCTION
Transformers [1] are a class of deep neural network ar-
chitecture for sequence transduction [2], similar to recurrent
neural networks [3] and LSTMs [4]. They have recently
had a major impact on natural language processing (NLP),
including language modeling [5]–[7], question-answering [8],
translation [1], and many other applications. The significant
improvement in accuracy brought by transformers to NLP
tasks is comparable to the improvement brought to computer
vision by AlexNet [9] and subsequent convolutional neural
networks. Transformers have also begun to be applied to
domains beyond NLP where RNNs would previously have
been used, including speech recognition [10], reinforcement
learning [11], molecular property prediction [12], and sym-
bolic mathematics [13].
Training transformers is very compute-intensive, often tak-
ing days on hundreds of GPUs or TPUs [14]–[20]. Further,
generalization only improves with model size [18], [20]–[22],
number of training samples [16], [20], and total iterations [16],
[23]. These all significantly increase compute requirements.
Indeed, transformers are becoming the dominant task for
machine learning compute where training a model can cost
tens of thousands to millions of dollars and even cause
environmental concerns [24]. These trends will only accelerate
with pushes toward models with tens of billions to trillions
of parameters [22], [25], their corresponding compute re-
quirements [26], and increasing corporate investment towards
challenges such as artificial general intelligence [27]. Thus,
improving transformer performance has been in the focus of
numerous research and industrial groups.
Significant attention has been given to optimizing transform-
ers: local and fixed-window attention [28]–[32], more general
structured sparsity [33], learned sparsity [34]–[36], and other
algorithmic techniques [19], [37] improve the performance of
transformers. Major hardware efforts, such as Tensor Cores
and TPUs [38] have accelerated tensor operations like matrix-
matrix multiplication (MMM), a core transformer operation.
Despite this, existing implementations do not efficiently
utilize GPUs. Even optimized implementations such as Mega-
tron [18] report achieving only 30% of peak GPU flop/s.
We find that the key bottleneck when training transform-
ers is data movement. Improvements in compute performance
have reached the point that, due to Amdahl’s Law and the
acceleration of tensor contractions, training is now memory-
bound. Over a third (37%) of the runtime in a BERT training
iteration is spent in memory-bound operators: While tensor
contractions account for over 99% of the flop performed, they
are only 61% of the runtime. By optimizing these, we show
that the overhead of data movement can be reduced by up
to 22.91%. Further, while MMM is highly tuned by BLAS
libraries and hardware, we also find that existing frameworks
use suboptimal data layouts. Using better layouts enables
us to speed up MMM by up to 52%. Combining these
insights requires moving beyond peephole-style optimizations
and globally optimizing data movement, as selecting a single
layout is insufficient. Overall, we achieve at least 1.30×
performance improvements in training over general-purpose
deep learning frameworks, and 1.08× over DeepSpeed [39],
the state of the art manually-tuned implementation of BERT.
For robustly training BERT [16], this translates to a savings
of over $85,000 on AWS using PyTorch. For the GPT-3
transformer model [40] with a training cost of $12M [41],
our optimizations could save $3.6M and more than 120 MWh
energy. To do this, we develop a recipe for systematically
optimizing data movement in DNN training.
Our approach constructs a dataflow graph for the training
process, which we use to identify operator dependency patterns
and data volume. With this representation, we identify oppor-
tunities for data movement reduction to guide optimization.
We aim to maximize data reuse using various forms of fusion.
Then we select performant data layouts, which is particularly
ar
X
iv
:2
00
7.
00
07
2v
2 
 [c
s.L
G]
  2
 Ju
l 2
02
0
@dace.program
def mha_forward(
wq: dace.float16[P, H, I], bq: dace.float16[P, H],
q: dace.float16[I, B, J],
wk: dace.float16[P, H, I], bk: dace.float16[P, H],
k: dace.float16[I, B, K],
wv: dace.float16[W, H, I], bv: dace.float16[W, H],
v: dace.float16[I, B, K],
wo: dace.float16[W, H, I], bo: dace.float16[I],
scaler: dace.float16
):
qq = np.einsum("phi,ibj->phbj", wq, q) + bq[:,:,None,None]
kk = np.einsum("phi,ibk->phbk", wk, k) + bk[:,:,None,None]
vv = np.einsum("whi,ibk->whbk", wv, v) + bv[:,:,None,None]
beta = scaler * np.einsum("phbk,phbj->hbjk", kk, qq)
alpha = dropout(softmax(beta))
gamma = np.einsum("whbk,hbjk->whbj", vv, alpha)
out = np.einsum("whi,whbj->ibj", wo, gamma)+bo[:,None,None]
return out
(a) Input Code
WV V
whi,ibk->whbk
whbk,hbjk->whbj
whi,whbj->ibj
OUT
WQ Q
phi,ibj->phbj
WK K
phi,ibk->phbk
phbk,phbj->hbjk
softmax [k]
8G 910
4G 102
8G 910
8G 9108G 910
4G 102
160M 2.5 flop
flop / IO
MHA
dropout
bias [ph]bias [ph]
4M 0.54M 0.5
32M 0.5
bias [wh]
4M 0.5
bias [i]
4M 0.5
scaler
WO
IO > flop
IO ≈ flop
IO < flop
Tensor contraction
Normalization
Elementwise
VV
KK QQ
GAMMA
ALPHA
BETA
(b) Resulting dataflow
Fig. 1: Input code and stateful dataflow multigraph (SDFG) for Multi-Head Attention. Throughout the paper, if not stated
otherwise, the values are given for the following set of parameters: P = W = 64, H = 16, I = P ⋅ H = 1024, B = 8,
J = K = 512. P/W: key/value projection size; H: # heads; I: embedding size; B: batch size; J/K: input/output sequence length.
impactful for normalization and tensor contraction operators,
where it provides opportunities for vectorization and different
tiling strategies. The performance data gathered is then used
to find operator configurations that produce a fast end-to-end
optimized implementation of training.
We evaluate these implementations first for multi-head
attention, a core primitive within transformers and one that
has significant applications beyond transformers [42]–[44]. We
then consider the encoder layer from BERT [14], a widely-
used transformer architecture. In each case, we compare
against existing highly optimized implementations to provide
strong baselines. Using this recipe, we are able to demonstrate
significant performance improvements in both microbench-
marks and end-to-end training, outperforming PyTorch [45],
TensorFlow+XLA [46], cuDNN [47], and DeepSpeed [39].
While we focus our work on particular transformer models,
our approach is generally applicable to other DNN models and
architectures. We summarize our contributions as follows:
• We find transformer training to be memory-bound and
significantly underperforming on GPUs.
• We develop a generic recipe for optimizing training using
dataflow analyses.
• Using this recipe, we systematically explore the performance
of operators and the benefits of different optimizations.
• We demonstrate significant performance improvements, re-
ducing data movement overheads by up to 22.91% over
existing implementations, and achieving at least 1.08×
performance improvements over specialized libraries, and
1.30× over general-purpose frameworks.
• We make our code available at https://github.com/spcl/
substation.
II. BACKGROUND
Here we provide a brief overview of our terminology,
transformers, and data-centric programming. We assume the
reader is generally familiar with training deep neural networks
(see Goodfellow et al. [48] for an overview).
A. Training Deep Neural Networks
We use the standard mini-batch data-parallel approach to
training, wherein a mini-batch of samples is partitioned among
many GPUs. During backpropagation, we distinguish between
two stages: computing the gradients with respect to a layer’s
input (dX), and computing the gradients with respect to the
layer’s parameters (dW ). Note that the second stage is relevant
only for layers with learnable parameters.
B. Transformers
The transformer architecture [1], originally developed for
machine translation, is a neural network architecture for se-
quence transduction, or transforming an input sequence into
an output sequence. Transformers build upon a long sequence
of work within natural language processing, most relevantly
beginning with word embeddings [49]–[51], neural machine
translation [52], [53], and sequence-to-sequence learning [54].
A key component is attention [28], [29], which enables a
model to learn to focus on particular parts of a sequence.
The transformer makes two key contributions. First, it
generalizes attention to multi-head attention, which we discuss
below. Second, the transformer neglects recurrent or con-
volutional mechanisms for processing sequences, and relies
entirely on attention. Critically, this enables significantly more
parallelism during training, as the model can process every
element of a sequence simultaneously, instead of having a
serial dependence on the prior element.
1) Multi-head Attention: Multi-head attention (MHA) gen-
eralizes attention, and uses h attention “heads” in parallel
to attend to different learned projections of a sequence. We
provide Python code and an illustration of MHA forward
propagation in Fig. 1.
Linear+Bias
Dropout
LayerNorm
MHA
ReLU
X
Y
ReLU dX
MHA back
Linear+Bias dXLinear+Bias dW
LayerNorm dXLayerNorm dW
Dropout dX
dY
dX
Dropout
Linear+Bias
Dropout
LayerNorm
Dropout dX
Linear+Bias dXLinear+Bias dW
LayerNorm dXLayerNorm dW
Dropout dX
ln1 dW
linear1 dW
linear2 dW
ln2 dW
1/3
3.5
585
1/2
1/3
1024
1/3
2.33 2 3
1/3
1/3
1/3
1170
819
1365
1365
1/3
2 3
239 26243G
4M
29M
34G
16M
16M
34G
4M
29M 17M 38M
4M
34G 34G
16M
16M
34G34G
17M 38M
4M
86G
flop
flop / IO
IO > flop
IO ≈ flop
IO < flop
1/34M
4M 1/3
4M
4M 1/3
1/3
Tensor contraction
Normalization
Elementwise
Fig. 2: Forward and backpropagation for a BERT-large encoder
layer, and the ratio of flop to memory accesses (words) when
training on a batch B = 8 and sequence length J = K = 512.
Each attention head is an instance of scaled dot-product
attention, and takes three input tensors: queries (q), keys (k),
and values (v). Conceptually, attention finds values corre-
sponding to the keys closest to the input queries. Heads are
also augmented with learned linear layers that project their
inputs into a lower-dimensional space. The three inputs are
first multiplied by weight tensors wq, wk, wv, respectively,
as a learned input projection (we use Einstein-notation sums,
or einsums, for tensor contractions). The query and key
tensors are subsequently multiplied together and scaled (stored
in beta), followed by applying the softmax operation in order
to weight and select the most important results. This is then
multiplied with vv to produce the per-head output (gamma).
The outputs of all the heads are finally concatenated and
linearly projected back to the input dimension size (i).
The respective dataflow graph (Fig. 1b) immediately ex-
poses coarse- (whole graph) and fine-grained (within rectan-
gular nodes) parallelism, as well as data reuse. As every edge
represents exact data movement, their characteristics (access
sets and movement volume) can be inspected for guided
bottlenecks and potential solution analysis.
There are three broad classes of MHA, distinguished by
their inputs. General attention uses distinct tensors as the
queries, keys, and values. Encoder/decoder attention uses the
same tensor for both keys and values (typically produced by an
encoder layer). Self-attention uses the same tensor for all three
inputs. MHA may also have a masking step, which is used
during training to prevent a model from “seeing the future”
and using information from a later part of a sequence.
2) Transformer Architectures: BERT [14] is a widely-used
transformer for NLP tasks. Fig. 2 illustrates the forward and
backward pass for a single BERT encoder layer. The layer
essentially consists of MHA (as self-attention) followed by
a feed-forward neural network. The feed-forward network
consists of two linear layers with bias and ReLU activations
after the first layer. Dropout [55], layer normalization [56],
and residual connections [57] are also used.
Transformers also incorporate several other layers that we
will not discuss in detail: embedding layers for input sequences
and various output layers, depending on the task. Other
transformer architectures, such as the original Transformer and
GPT-2/3 [21], [40] have very similar architectures.
C. Data-Centric Programming
As DNN processing is among the most popular compute-
intensive applications today, considerable efforts have been
made to optimize its core operators [58]. This has driven
the field to the point where optimization today is almost
exclusively performed beyond the individual operator, either
on the whole network [59], [60] or repeated modules.
Performance optimization on modern architectures consists
of mutations to the original code, sometimes algorithmic
[61]–[63] but mostly relating to hardware-specific mapping
of computations and caching schemes. This includes tiling
computations for specific memory hierarchies, using special-
ized units (e.g., Tensor Cores) for bulk-processing of tiles,
modifying data layout to enable parallel reductions, hiding
memory latency via multiple buffering, pipelining, and using
vectorized operations. It is thus apparent that all current
optimization techniques revolve around careful tuning of data
movement and maximizing data reuse.
The Data-Centric (DaCe) parallel programming frame-
work [64] enables performance optimization on heterogeneous
architectures by defining a development workflow that en-
forces a separation between computation and data movement.
The core concept powering program transformation is the
Stateful Dataflow multiGraph (SDFG), which is a graph inter-
mediate representation that defines containers and computation
as nodes, and data movement as edges. DaCe takes input code
written in Python or DSLs, and outputs corresponding SD-
FGs. Subsequently, programmers can mutate the dataflow via
user-extensible graph-rewriting transformations to change the
schedule of operations, the layout of data containers, mapping
of data and computation to certain processing elements, or
any adaptation to the data movement that does not change the
underlying computation.
As opposed to traditional optimizing compilers and deep
learning frameworks (e.g., XLA, TorchScript), DaCe promotes
a white-box approach for performance optimization. The
framework provides an API to programmatically instrument
and explore, e.g., different layouts and kernel fusion strategies,
all without modifying the original code. DaCe was shown to
map applications to different hardware architectures, including
CPUs, GPUs, and FPGAs [64], enabling both whole-program
and micro-optimizations of nontrivial applications to state-of-
the-art performance [65].
The combination of the separation of the algorithm from
the transformed representation and white-box approach for
optimization enables us to inspect and optimize the data
movement characteristics of Transformer networks. As we
shall show in the next sections, this drives a methodical
approach to optimizing a complex composition of linear
algebraic operations beyond the current state of the art.
III. ELEMENTS OF OPTIMIZED TRANSFORMERS
We now apply our recipe to optimize data movement in
training, using a BERT encoder layer as an example. We
focus on a single encoder layer, since these are simply re-
peated throughout the network, and other components (e.g.,
embedding layers) are not a significant component of the
runtime. In this section, we discuss dataflow and our operator
classification. Sections IV and V discuss our optimizations and
Section VI presents end-to-end results for transformers.
At a high level, our recipe consists of the following steps:
1) Construct a dataflow graph for the training process and an-
alyze the computation to identify common operator classes.
2) Identify opportunities for data movement reduction within
each operator class using data reuse as a guide.
3) Systematically evaluate the performance of operators with
respect to data layout to find near-optimal layouts.
4) Find the best configurations to optimize end-to-end perfor-
mance of the training process.
A. Dataflow Analysis
We use SDFGs and the DaCe environment to construct and
analyze dataflow graphs. Fig. 2 provides a simplified represen-
tation of dataflow in a transformer encoder layer. Each node
represents an operator, which is a particular computation along
with its associated input and output, which may vary in size.
An operator may be implemented as multiple compute kernels,
but is logically one operation for our analysis. To produce an
SDFG, all that is required is a simple implementation using
NumPy. As the goal of this stage is to understand the dataflow,
we do not need to optimize this implementation: It is simply
a specification of the computations and data movement.
Using DaCe, we can easily estimate data access volume
and the number of floating point operations (flop) required for
each computation. Fig. 2 is annotated with the number of flop
and the ratio of flop to data volume, and we provide a precise
comparison with PyTorch in Tab. III. The key observation is
that the ratio of data movement to operations performed varies
significantly among operators. In many cases, the runtime of
an operator is dominated by data movement, rather than useful
computation, and this should be the target for optimization.
TABLE I: Proportions for operator classes in PyTorch.
Operator class % flop % Runtime
△ Tensor contraction 99.80 61.0
⬜ Stat. normalization 0.17 25.5
○ Element-wise 0.03 13.5
B. Operators in Transformers
With this analysis, we can now identify high-level patterns
that allow us to classify operators. We base our classification
both on the data movement to operation ratio and the structure
of the computations. This classification is useful as it allows
us to consider optimizations at a more general level, as
opposed to working on each operator (or kernel) individually.
For transformers, we find three classes: tensor contractions,
statistical normalizations, and element-wise operations. The
border of each operator in Fig. 2 indicates its class and Tab. I
gives the proportion of flop and runtime for a BERT encoder
layer for each class.
1) Tensor Contractions △: These are matrix-matrix mul-
tiplications (MMMs), batched MMMs, and in principle could
include arbitrary tensor contractions. We consider only MMMs
and batched MMMs for simplicity, as these are efficiently
supported by cuBLAS. In transformers, these are linear layers
and components of MHA. These operations are the most
compute-intensive part of training a transformer. For good
performance, data layout and algorithm selection (e.g., tiling
strategy) are critical.
2) Statistical Normalizations ⬜: These are operators such
as softmax and layer normalization. These are less compute-
intensive than tensor contractions, and involve one or more
reduction operation, the result of which is then applied via
a map. This compute pattern means that data layout and
vectorization is important for operator performance.
3) Element-wise Operators ○: These are the remaining
operators, and include biases, dropout, activations, and residual
connections. These are the least compute-intensive operations.
C. Memory Usage Efficiency (MUE)
The MUE metric [66] provides a measure of the memory
efficiency of an operation, both with respect to its implementa-
tion and achieved memory performance. This provides another
method for understanding performance beyond flop/s that is
particularly relevant for applications that are bottlenecked by
data movement. MUE evaluates the efficiency of a particular
implementation by comparing the amount of memory moved
(D) to the theoretical I/O lower bound [67] (Q) and the ratio
of achieved (B) to peak (Bˆ) memory bandwidth:
MUE = Q/D ⋅ B/Bˆ ⋅ 100.
If an implementation both performs the minimum number
of operations and fully utilizes the memory bandwidth, it
achieves MUE = 100. This can be thought of as similar to
the percent of peak memory bandwidth.
DlinB2
Dlin2
bji->i add [bji]
Dresid2 Dlin1_linW1
Dsb1ln1
DS1
bji,bji->i
fusion pattern 1
fusion pattern 4
mul [bju]
Ddrop2
drop2mask
Dact
backward activation [bju]
lin1
fusion pattern 2
Dlin1
resid1
sub [bju] - [bj]
bju->bj
ln1mean
ln1diff
fusion pattern 3
Fig. 3: Examples of operator fusion.
D. Evaluation Details
All our results were produced on the Lassen supercom-
puter [68], which consists of 795 nodes, each with two IBM
Power9 CPUs and four Nvidia V100 GPUs with NVLINK2
and 16 GB of memory. We use CUDA 10.1.243 and GCC
7.3.1 to build our code. For comparison, we use cuDNN
7.6.5, PyTorch 1.5.0 (PT) (built-in transformer implementa-
tion), TensorFlow 2.1 from IBM PowerAI with XLA enabled
(transformer adapted from [69]) (TF+XLA), and a recent
development version of DeepSpeed (DS). Unless noted, our
results use mixed-precision training [70], with FP16 data
and accumulations performed at FP32. In PyTorch, we use
Apex [71] for mixed-precision; in TensorFlow and DeepSpeed,
we use the built-in automatic mixed-precision. Results are the
mean of 100 measurements. When we compute the percent of
peak performance, we use the 125 Gflop/s Tensor Core peak
on our V100s for tensor contraction operations, and the 31.4
Gflop/s half-precision peak for other operations.
Our running example is training BERT-large. We use a mini-
batch size of B = 8, sequence length L = 512, embedding size
N = 1024, H = 16 heads, and a projection size of P = 64.
IV. FUSION
A significant portion of the runtime in existing transformer
implementations is in statistical normalization and element-
wise operators (Tab. I). Thus, fusion is a major opportunity
for promoting data reuse. We find fusion opportunities with
a combination of standard performance engineering heuristics
and by using DaCe to identify conformant iteration spaces.
We develop a set of fusion rules applicable to any appli-
cation with operators similar to the three here. The process
consists of two steps: detecting which operations can be fused
and then deciding whether it is beneficial to fuse them.
To discover fusion opportunities, we analyze their iteration
spaces. Every operator has independent dimensions. Statistical
normalization operators also contain reduction dimensions.
Tensor contractions additionally have reduction dimensions,
and special independent dimensions for the two input tensors.
The type of iteration space determines which tools are
used to implement them. Independent dimensions can be
implemented using GPU block or thread parallelism, or with
"for" loops within a single thread. Reduction dimensions use
these techniques but also specify how the reduction is to be
performed: accumulating to the same memory location ("for"
loops), or as grid-, block-, or warp-reductions.
Two operators can be fused if their iteration space imple-
mentations are compatible: They are either the same or the
only difference is that one operator performs a reduction.
The order and size of dimensions and the implementation
for each must match. If the first operator produces output the
second uses as input, partial fusion can be done: The outermost
independent dimensions can be shared, but the innermost
iteration spaces are put in sequential order inside.
When a fusion opportunity is detected, we take it in two
cases: First, if we can perform fewer kernel launches by
merging iteration spaces. Second, if we achieve less data
movement by avoiding loads and stores between operators.
Theoretically, the first case could increase memory pressure,
but we observe it provides benefits in practice.
We attempt to fuse maximally. There are four structural
patterns (Fig. 3) in the dataflow graph for the encoder layer
when fusion rules explained above are applied to a pair of
non-tensor contraction operators. Using the SDFG, we fuse
two adjacent operators whenever we detect these patterns and
continue until we cannot fuse further. This means we fuse until
either a reduction dimension or iteration space changes. As a
further constraint, we only fuse simple element-wise scaling
operations into tensor contraction operators (see Sec. IV-C).
A. Implementation
We implement each fused operator as a single custom
CUDA kernel and specialize it for a specific data layout using
templates to maximize opportunities for compiler optimiza-
tion. To correctly handle data dependencies, if a reduction is
the first operator in a fusion kernel, it is implemented with two
loops: the first computes the reduction and the second uses it.
Otherwise, each kernel is implemented as a single loop.
Reduction operations in statistical normalizations use a warp
allreduce among all threads in a warp, implemented with
CUDA intrinsics. If the number of elements to be reduced
exceeds the warp size, we perform a sequential reduction
over smaller chunks first. Layout-permuting, we use vectorized
loads, stores, and arithmetic within a single thread, and fall
back to word-wise implementations otherwise. For dropout
operators, which must generate a random mask, we use
cuRAND for random number generation.
After fusion, we have the following fused element-wise and
normalization operations. Fig. 3 illustrates several cases.
• AIB: Attention input bias.
• BAOB: Backward attention output bias.
• BAIB: Backward attention input bias.
• SM: Softmax with scaling and dropout.
TABLE II: Algebraic fusion for MHA Q/K/V (µs).
Unfused QK fused QKV fused
Forward 345 294 275
Backward 342 312 291
• BRD: Bias, ReLU, and dropout.
• BDRLN: Bias, dropout, residual connection, and layernorm.
• BSB: Backward layernorm scale and bias.
• BLNRD: Backward layernorm dX and dropout, saving the
intermediate result for the residual connection.
• BDRB: Backward dropout, ReLU, and bias.
• EBSB: Backward residual and layernorm scale and bias.
• BS: Backward dropout and softmax with scaling.
• BEI: Backward encoder input residual connection.
B. Results
Tab. III presents our comprehensive results, including op-
erator fusion. In this, we show a detailed breakdown of the
required and observed flop, data movement, runtime, and MUE
for each operator within the encoder layer, for both PyTorch
and our implementation, with our fused operators marked. We
can easily observe that while the vast majority of flop are
in tensor contractions, much of the runtime is in statistical
normalization and element-wise operators (see also Tab. I).
These operators are indeed memory-bound.
In forward propagation, every fused operator outperforms
PyTorch’s. In backpropagation, this trend generally holds, but
EBSB and BAOB are slower due to our configuration selection
algorithm choosing a layout that is suboptimal for some
operators to optimize the overall performance (see Sec VI).
By studying the MUE and flop/s, we can reason about the
bottlenecks behind each operator. For the fused operators, we
see that high MUE rates are often achieved. In fact, the MUE
from Tab. III and the theoretical flop/IO ratio from Fig. 2
are highly correlated across operators. We say that a kernel
is memory bound if its MUE is larger than the achieved
peak flop/s, and compute bound otherwise. This insight aids
in analyzing the bottlenecks of general DNNs and automated
tuning of operators, prior to measuring their performance. We
note that for our operators, which involve multiple tensors
of different shapes, 100% MUE is potentially unattainable as
achieving peak memory bandwidth requires a specific, highly
regular access pattern into DRAM.
As for the tensor contraction results, we see that the attained
MUE is consistently under 50%. This is acceptable, since
the underlying matrix multiplications are generally compute-
bound. However, we note that some cases, such as QKT ,
are both low in flop/s and MUE. A more in-depth look into
the dimensions of the contraction reveals that the dimensions
are small, which then indicates that the tensor cores are
underutilized. This may result from insufficient scheduled
threads, or low memory throughput to compute ratio. We
thus try to increase hardware utilization by fusing additional
operators into the contractions next.
C. Fusing into Tensor Contractions
Because cuBLAS does not support fusing arbitrary opera-
tors into (batched) MMMs, we evaluated CUTLASS [72] ver-
sion 2.1 as an alternative, which does support fusing element-
wise operators. We conducted a simple benchmark comparing
cuBLAS with a separate bias kernel to CUTLASS for the
first linear layer of BERT. We found that the batched MMM
in CUTLASS is approximately 40 µs slower than cuBLAS.
The reduction in runtime by fusing the bias is less than this.
Hence, we only consider cuBLAS for tensor contractions.
cuBLAS does support simple scaling operations, which we
use to implement the scaling for the scaled softmax operator.
D. Algebraic Fusion
There is an additional opportunity for fusion among certain
tensor contraction operators. Using domain knowledge and the
dataflow graph, we can identify some operations that can be
combined into a single algebraically equivalent operation. For
example, there are several different ways to perform the input
projections (batched MMMs) in self-attention, since the input
queries, keys, and values are the same tensor, X:
1) X can be multiplied by each of the projection matrices:
W
Q
X , WKX , and WVX .
2) The WQ and WK projection matrices can be stacked and
two multiplications performed: [WQ WK]X and WVX .
Similarly, the WK and WV matrices can be stacked.
3) All three can be stacked: [Q̃ K̃ Ṽ ] = [WQWK WV ]X .
In backpropagation, the dW and dX computations for each of
the projection matrices can be similarly fused: X[dQ̃dK̃ dṼ ]
and [WQWK WV ][dQ̃ dK̃ dṼ ], respectively.
There are different tradeoffs to these approaches, which
must be determined empirically. Performing separate MMMs
may enable task parallelism. On the other hand, this stacking
enables data reuse, since X is used only once.
Tab. II presents results for each of these cases. Fully fusing
this batched MMM performs the best. Unfortunately, cuBLAS
launches kernels that utilize the entire GPU, so task parallelism
is not profitable. This specific example can also be adapted to
fuse keys and values in encoder/decoder attention.
V. DATA LAYOUT
We now consider data layout selection, which enables
efficient access patterns, vectorization, and tiling strategies for
tensor contraction and statistical normalization operators. To
study this systematically, for each operator, including the fused
operators produced in the prior step, we benchmark every
feasible data layout to measure its performance, as well as
varying other parameters depending on the specific operator.
The best parameterization of an operator is highly dependent
on the GPU model and tensor sizes, and may not be obvious
a priori; hence it is important to empirically consider this.
Because there are a myriad of potential configurations for
each operator, we summarize the distribution of runtimes over
all configurations using violin plots. The width of the violin
represents the relative number of configurations with the same
runtime. This allows us to see not only the best runtime, but to
TABLE III: Flop analysis for BERT encoder layer. △ – tensor contraction, ⬜ – statistical normalization, ○ – element-wise.
MHA operators are filled black. We bold whichever is greater, % peak (compute-bound) or MUE (memory-bound). The speedup
is computed for kernels in isolation, overall speedup may be different due to measurement overheads.
Input Output PyTorch Ours
Operator Gflop (1e6) (1e6) Gflop Time (µs) % peak Time (µs) % peak MUE Speedup Kernel
Fo
rw
ar
d
▲ Q, K, V 24 7.3 12.5 24.012 333 56.2 306 61.2 12 1.08 —
● Input bias 0.012 12.5 12.5 0.023 90 0.4 66 0.5 78 1.35 }AIB
▲ QKT 4 8.3 33.5 4.031 189 16.5 143 21.8 50 1.32 —
⬛ Scaled softmax 0.188 33.5 100.6 0.89 453 1.3 433 1.3 32 1.04 }SM
▲ Gamma 4 37.7 4.1 8.008 142 21.9 160 19.4 6 0.88 —
▲ Out 8 5.2 4.1 8.09 136 45.9 120 52 10 1.13 —
● Output bias 0.004 4.1 4.1 0.008 34 0.4
102 0.1 42 1.68
⎫⎪⎪⎪⎪⎬⎪⎪⎪⎪⎭DRLN
○ Dropout 0.004 4.1 8.3 0.013 37 0.3
○ Residual 0.004 8.3 4.1 0.008 36 0.3
⬜ LayerNorm 0.027 4.1 4.1 0.048 63 1.3
△ Linear 32 8.3 16.7 32.016 451 55.4 402 62.1 12 1.12 —
○ Bias 0.016 16.7 16.7 0.031 116 0.4
183 0.3 76 1.90
⎫⎪⎪⎪⎬⎪⎪⎪⎭BRD○ ReLU — 16.7 16.7 — 112 0○ Dropout 0.016 16.7 33.5 0.048 120 0.4
△ Linear 32 20.9 4.1 32.09 449 55.6 369 67.6 6 1.21 —
○ Bias 0.004 4.1 4.1 0.008 35 0.3
101 0.1 43 1.70
⎫⎪⎪⎪⎪⎬⎪⎪⎪⎪⎭BDRLN
○ Dropout 0.004 4.1 8.3 0.013 37 0.3
○ Residual 0.004 8.3 4.1 0.008 37 0.3
⬜ LayerNorm 0.027 8.3 4.1 0.048 63 1.3
B
ac
kw
ar
d
⬜ LayerNorm dW 0.016 8.3 <0.01 0.02 184 0.3 150 0.3 6 1.22 }BSB
⬜ LayerNorm dX 0.035 8.3 4.1 0.06 78 1.4 71 1.5 37 1.58 }BLNRD○ Dropout dX 0.004 8.3 4.1 0.008 34 0.4
△ Linear+Bias dX 32 8.3 16.7 32.016 427 58.4 414 60.3 5 1.03 —
△ Linear dW 32 20.9 4.1 32.027 424 58.9 378 66 13 1.11 —
⬜ Bias dW 0.004 4.1 <0.1 0.005 24 0.5
362 <0.1 38 1.05
⎫⎪⎪⎪⎪⎬⎪⎪⎪⎪⎭BDRB
○ Dropout dX 0.016 33.5 16.7 0.031 129 0.4
○ ReLU dX — 33.5 16.7 — 166 0
⬜ Bias dW 0.016 16.7 <0.1 0.02 61 0.8
△ Linear+Bias dX 32 20.9 4.1 32.027 417 59.9 398 62.7 6 1.04 —
△ Linear dW 32 20.9 4.1 32.09 437 57.2 372 67.2 6 1.17 —
○ Residual 0.004 8.3 4.1 0.008 36 0.3 250 <0.1 17 0.89 }EBSB⬜ LayerNorm dW 0.016 8.3 <0.1 0.02 186 0.3
⬜ LayerNorm dX 0.035 8.3 4.1 0.06 80 1.4 69 1.6 37 1.64 }BLNRD○ Dropout dX 0.004 8.3 4.1 0.008 34 0.4
⬛ Output bias dW 0.004 4.1 <0.1 0.005 23 0.5 38 0.3 22 0.60 }BAOB
▲ Out dX 8 4.3 4.1 8.044 131 47.6 119 52.2 10 1.09 —
▲ Out dW 8 8.3 1.0 8.09 136 45.9 113 54.8 5 1.19 —
▲ Gamma dX1 4 8.3 33.5 8.008 136 22.8 147 21.2 7 0.93 —
▲ Gamma dX2 4 67.1 33.5 4.031 188 16.6 123 25.2 8 1.52 —
⬛ Scaled softmax dX 0.156 12.5 4.1 0.199 790 0.6 426 1.1 49 1.85 }BS
▲ QKT dX1 4 37.7 4.1 4.004 135 23.1 155 20 7 0.86 —
▲ QKT dX2 4 37.7 4.1 8.008 139 22.3 115 26.9 9 1.20 —
▲ Q, K, V dX 24 15.7 4.1 24.027 344 54.4 274 68.2 6 1.25 —
▲ Q, K, V dW 24 20.4 1.0 24.132 329 57 293 64 6 1.12 —
⬛ Input bias dW 0.012 12.5 <0.1 0.015 52 0.7 39 0.9 66 1.32 }BAIB
○ Residual 0.004 8.3 4.1 0.008 35 0.3 31 0.4 83 1.14 }BEI
△ Tensor contractions 312 — — 324.75 4951 43.1 4411 48.5 — 1.12
⬜ Stat. normalizations 0.535 — — 1.389 2063 0.9 1591 0.6 — 1.29
○ Element-wise 0.098 — — 0.223 1096 0.3 735 0.1 — 1.49
Total 312.633 — — 326.362 8110 31.1 6739 35 — 1.20
see how sensitive operators are to layouts, an important factor
when globally optimizing the layout in Step 4 of our recipe.
A. Tensor Contractions
Using the Einsum notation for tensor contractions, we
consider all equivalent permutations of the summation string.
The einsum is then mapped to a single cuBLAS MMM or
batched MMM call. While this notation allows us to express
arbitrary tensor contractions, as cuBLAS does not support all
configurations, we limit ourselves to these two types.
In addition, we consider every possible cuBLAS algorithm
for each layout, as we have observed that the heuristic selec-
tion provided by its default algorithm is not always optimal.
We use the cublasGemmEx API to manually select algo-
rithms. We support both regular and Tensor Core operations,
and perform all accumulations at single-precision, in line with
best practices for mixed-precision training [70].
Fig. 4 presents performance distributions over all data lay-
outs for every tensor contraction in the encoder layer training,
including algebraic fusion variants. Since input matrices for
best: 2.32 ms
best: 0.38 ms
worst: 3.30 ms
worst: 0.65 ms
best: 0.31 ms
best: 0.14 ms
worst: 0.82 ms
worst: 0.18 ms
best: 1.19 ms
best: 0.21 ms
worst: 1.66 ms
worst: 0.33 ms
best: 0.59 ms
best: 0.12 ms
worst: 0.87 ms
worst: 0.23 ms
best: 0.32 ms
best: 0.12 ms
worst: 0.79 ms
worst: 0.40 ms
best: 1.75 ms
best: 0.29 ms
worst: 2.48 ms
worst: 0.56 ms
best: 1.14 ms
best: 0.20 ms
worst: 1.77 ms
worst: 0.34 ms
best: 2.32 ms
best: 0.37 ms
worst: 3.29 ms
worst: 2.46 ms
best: 1.19 ms
best: 0.20 ms
worst: 1.67 ms
worst: 0.33 ms
best: 1.70 ms
best: 0.29 ms
worst: 2.63 ms
worst: 0.50 ms
best: 0.60 ms
best: 0.11 ms
worst: 0.85 ms
worst: 0.18 ms
best: 1.74 ms
best: 0.28 ms
worst: 2.48 ms
worst: 0.48 ms
dXQK
M: 4096, N: 1024, K: 2048, B: 1
dXQKV
M: 4096, N: 1024, K: 3072, B: 1
KV
M: 4096, N: 2048, K: 1024, B: 1
QKV
M: 4096, N: 3072, K: 1024, B: 1
dX1gamma, QKT
M: 512, N: 512, K: 64, B: 128
dX1QKT, dX2gamma, dX2QKT, gamma
M: 512, N: 64, K: 512, B: 128
dXlin2, lin1
M: 4096, N: 4096, K: 1024, B: 1
dXout, dXQ, out, Q
M: 4096, N: 1024, K: 1024, B: 1
dWlin1, dWlin2, dXlin1, lin2
M: 4096, N: 1024, K: 4096, B: 1
dWout, dWQ
M: 1024, N: 1024, K: 4096, B: 1
dWQK
M: 2048, N: 1024, K: 4096, B: 1
dWQKV
M: 3072, N: 1024, K: 4096, B: 1
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
% of peak performance16-bit FPUs Tensor Cores
Fig. 4: Tensor contraction performance. Tensor Core peak: 125 Tflop/s; FP16 peak: 31.4 Tflop/s.
cuBLAS MMs can be easily swapped, results for both orders
are merged into the same tile of the figure. All tiles are labeled
with M > N . As expected, in many cases using tensor cores
offers significant performance advantages, but interestingly, in
several cases (where one of the matrix dimensions is 64) the
performance is quite close to the regular floating point units,
due to a failure to saturate the tensor cores. Among the tensor
core results, we can typically see there are several modes in the
performance distributions; these correspond to particularly im-
portant axes for data layout. Indeed, for many configurations,
one of these is near or contains the best-performing configura-
tion, indicating that many slightly different data layouts could
be used with little impact on performance depending on the
needs of our global optimization pass. However, this does not
mean that any data layout is acceptable; in every case, the
majority of the mass does not perform well, illustrating the
importance of careful tuning.
We also investigated how well the built-in cuBLAS algo-
rithm heuristics compare to the best-performing configuration.
On half precision, we found that the algorithm chosen by
cuBLAS’s heuristic was up to 14.24% worse than the best
algorithm (in QKT dX1). We also investigated the perfor-
mance at single-precision and found similar results with up to
7.18% worse performance. This demonstrates the importance
of carefully tuning for the particular hardware and workload.
B. Fused Operators
For our fused operators, we consider all combinations of in-
put and output layout permutations. This enables us to include
transposes of the output data as part of the operator, should
the next operator perform better in a different layout than
the input. The data layout is especially critical for statistical
normalization operators, where the appropriate data layout can
enable vectorization opportunities, especially vectorized loads
and stores for more efficient memory access. Where relevant,
we therefore also consider vectorization dimensions, the map-
ping of dimensions to GPU warps, etc. Our implementation
takes advantage of these layouts when feasible.
Fig. 5 presents the runtime distribution for all configurations
of our fused operators (note some are used twice). The
distributions here are qualitatively similar to those in Fig. 4,
except these have much longer tails: A bad configuration is,
relatively, much worse, potentially by orders of magnitude.
All kernels support changing the layouts of tensors they
use. This change is done via template parameterization, so
the compiler can generate efficient code. All kernels support
selecting different vectorization dimensions. The BRD and BEI
kernels can select the dimension used for CUDA thread dis-
tribution; BSB, EBSB, and BDRB can select the warp reduction
dimension, as they reduce over two dimensions.
The most noticeable performance improvement is made by
layouts that enable vectorized memory access, showing that
the main performance limitation is the amount of accessed
data volume. The second notable category are layouts with the
same reduce and vector dimensions. Joining these dimensions
decreases the number of registers required to store partially
reduced values from the vector size (eight at FP16) to one.
We can expect to get good performance restricting ourselves
to configurations in the two groups described above. Usually,
the best layout discovered follows them. For example, the SM
kernel has the same warp and reduction dimensions, and these
dimensions are the last and sequential ones for involved arrays.
However, this intuition is not always correct. Examining the
results in Fig. 5, we find that there are kernel configurations
that both satisfy these intuitive rules yet are very slow. For
example, the best configuration of AIB takes 65 µs, and the
worst "intuitively good" configuration takes 771 µs.
best: 0.065
worst: 5.317
best: 0.101
worst: 7.643
best: 0.396
worst: 45.379
best: 0.033
worst: 0.863
best: 0.034
worst: 0.876
best: 0.037
worst: 1.271
best: 0.014
worst: 0.146
best: 0.071
worst: 3.521
best: 0.078
worst: 3.334
best: 0.167
worst: 14.453
best: 0.176
worst: 6.583
best: 0.402
worst: 81.286
BS BSB EBSB SM
BDRLN BEI BLNRD BRD
AIB BAIB BAOB BDRB
0
5
10
15
0
2
4
6
0
20
40
60
80
0.04
0.08
0.12
0
1
2
3
0
1
2
3
0.00
0.25
0.50
0.75
0.00
0.25
0.50
0.75
0.0
0.5
1.0
0
2
4
0
2
4
6
8
0
10
20
30
40
Ti
m
e 
[m
s]
Fig. 5: Performance of fused kernels for element-wise and
statistical normalization operators.
Configurations discovered through exhaustive search can en-
hance our intuitive understanding of requirements that should
be satisfied to achieve better performance. The BRD kernel
provides an example of that. It uses four 3D tensors for which
we can choose one among six possible layouts. According to
our intuitive rules, we want to put the vectorized dimension
last to make it sequential for all of them. However, the best
configuration has only two out of four arrays vectorized while
the others are not vectorized. With this information, we can
refine our intuition. Probably the factor that limits vector-
ization over all arrays is excessive GPU register usage. The
important point here is that even this new intuitive knowledge
doesn’t help to find the exact number of tensors that should
be vectorized, but the exhaustive search does.
VI. END-TO-END OPTIMIZATION
The final step is to assemble fused components and select
data layouts for each operator to yield a complete implemen-
tation. This is the culmination of the prior steps performing
dataflow analysis, fusion, and layout selection. From these, we
have performance data for all data layouts as well as algebraic
0
0
0
source
t1
t2
ijb
t4
ibj
t3
bji
⋅⋅
⋅
t5
t6
t7
phjb
t13
t12
t11
hpjb
t9
t8
t10
phbj
⋅⋅
⋅
0
phjb
0
hpjb
0
phbj
⋅⋅
⋅
target
QKV-fused AIB
in in/out out
Fig. 6: Example configuration selection graph for SSSP.
fusion strategies. One cannot simply pick a single data layout
a priori, as the benefit of running two operators in different
layouts may outweigh the overhead of transposing data.
Our assembled implementation is structured using the SD-
FGs produced in Step 1. We integrate it into PyTorch [45] via
its C++/CUDA operator API.
A. Configuration Selection
We develop an automatic configuration selection algorithm
to globally optimize our implementation using the perfor-
mance data. We construct a directed graph based on the
dataflow graph of the operators. Beginning from the input data
and proceeding in the order given by a pre-order depth-first
search, we add a node to the graph for each input and output
data layout of the operator. An edge is added from the input
to the output layout, weighted with the minimum runtime of
any configuration with that layout. Determining this simply
requires a linear search over the matching performance data.
This allows us to select both the best data layout and any other
operator knobs (e.g., vectorization dimension). To minimize
the size of the graph, we only add a configuration from an
operator when it has at least one input and output edge. We
then run a single-source shortest-path (SSSP) algorithm from
the input to the output in the graph; the resulting path gives
our final configuration. Because this graph is a DAG and
the number of feasible input/output configurations for each
operator is small, SSSP takes linear time asymptotically and
seconds for BERT. We illustrate the graph used in Fig. 6.
To simplify the implementation, we make two assumptions
on the global configuration space. First, we omit residual
connections. Second, we run SSSP only for the forward
propagation dataflow graph, and then infer the layouts of
the backpropagation operators from those selected (including
weight and gradient layouts). Both of these assumptions could
be relaxed in a future version of this algorithm. Although
this means we are not guaranteed to find a globally optimal
data layout, the runtime of our configuration is nevertheless
within 4% of the sum of the best possible configuration of
each operator (which ignores data layout constraints).
B. Multi-head Attention
We first analyze the performance of multi-head self-
attention. While it is a key primitive in BERT, MHA is also
used outside of transformers, so understanding its performance
TABLE IV: Multi-head attention performance for BERT.
TF+XLA PT cuDNN Ours
Forward (ms) 1.60 1.90 131 1.25
Backward (ms) 2.25 2.77 652 1.86
TABLE V: Full BERT encoder layer performance.
PT TF+XLA DS Ours
Forward (ms) 3.45 3.2 2.8 2.63
Backward (ms) 5.69 5.2 4.8 4.38
in isolation can inform other models too. Tab. IV compares
our globally-optimized implementation with PyTorch, Ten-
sorFlow+XLA, and cuDNN. cuDNN’s (experimental) MHA
implementation (cudnnMultiHeadAttnForward and re-
lated) supports six different data layouts; we report the fastest.
cuDNN’s performance is orders of magnitude worse than
the others; as it is a black box, our ability to understand
it is limited. However, profiling shows that its implementa-
tion launches very large numbers of softmax kernels, which
dominate the runtime, indicating additional fusion would be
profitable. TensorFlow+XLA finds several fusion opportunities
for softmax. However, its implementation does not perform
algebraic fusion for the queries, keys, and values, and it uses
subpar data layouts for tensor contractions.
Our performance results in Tab. III illustrate the source of
our performance advantage over PyTorch: Our data layout and
algorithm selection results in faster tensor contractions overall.
This is despite the Gamma stage actually being slower than
PyTorch’s: Sometimes locally suboptimal layouts need to be
selected to improve performance globally.
C. End-to-End Performance
We present overall performance results for the encoder layer
in Tab. V. For forward and backpropagation combined, we
are 1.30× faster than PyTorch and 1.20× faster than Ten-
sorFlow+XLA, including unoptimized framework overheads.
At a high level, this is because we perform a superset of
the optimizations used by both frameworks, and globally
combine them to get all the advantages while minimizing
drawbacks. As a general guideline, we use flop and MUE
rates as proxies for which operators require the most attention
and their corresponding bottlenecks. This ensures a guided
optimization rather than tuning all operators aggressively.
We also include performance results from DeepSpeed,
which we are 1.08× faster than. This is despite DeepSpeed
being a manually-optimized library tuned specifically for
BERT. Note also that DeepSpeed modifies some operations,
e.g., to be reversible or to exploit output sparsity, and so is
not always strictly comparable to the other implementations.
This also provides it opportunities for optimization that we do
not consider.
The total data movement reduction we attain is ∼22.91%
over the standard implementation. We obtain this information
from Tab. III, where for each fused kernel we omit the interim
outputs and inputs that are not part of the overall I/O that the
fused kernels perform. TensorFlow+XLA’s automatic kernel
fusion reduces data movement similarly to ours. However,
the data layouts used for tensor contractions are not optimal,
and its BERT encoder implementation does not use algebraic
fusion in MHA. PyTorch’s data layouts enable faster tensor
contractions and it implements the algebraic fusion, but it has
higher overheads for other operators.
Our fusion pass finds all the opportunities that TF+XLA
does, plus several additional ones; for example, we implement
layernorm as a single fused kernel. Our data layout selection
picks better layouts than PyTorch in nearly every individual
instance; when it does not, this is because the layout change
enables greater performance gains downstream. In Tab. III, we
also see that PyTorch performs more flop than predicted. Some
of this is due to padding in cuBLAS operations, and generic
methods performing excess operations. However, we also dis-
covered that some cuBLAS GEMM algorithms, including ones
called by PyTorch, incorrectly perform twice as many FLOPs
as necessary; our recipe avoids these cases automatically.
We also briefly considered another configuration for training
BERT, where we change the batch size to B = 96 and the
sequence length to L = 128, and retuned our layout selection.
In this case, forward and backpropagation for a single encoder
layer takes 18.43 ms in PyTorch, 16.19 ms in DeepSpeed, and
16.22 ms in our implementation. We significantly outperform
PyTorch, and match the performance of DeepSpeed in this
case (even with its additional optimizations). We believe that
with further improvements to our layout selection algorithm,
the performance of our implementation will improve further.
Beyond BERT, other transformers have very similar layers,
such as decoder layers in GPT-2/3. With very few changes, our
recipe and implementations are directly applicable to these.
Our implementation can also be extended to support a full
training pipeline by stacking our optimized layers.
VII. RELATED WORK
There has been significant work optimizing both transform-
ers in particular and deep learning in general. For a recent
comprehensive overview, see Ben-Nun & Hoefler [58]. To
help guide training regimes for transformers, recent work has
provided empirical recommendations on model size, batch
size, and so on [23], [73], [74]. Many of the subsequent
techniques we review are complementary to our work.
Most directly relevant are other approaches specifically
to accelerate transformer training. Distributed-memory tech-
niques, such as ZeRO [75], Megatron [18], and Mesh-
TensorFlow [76] scale training to many GPUs to accelerate
it. Mesh-TensorFlow also presents a classification of opera-
tors similar to ours. Large batches have also been used to
accelerate training via LAMB [77] or NVLAMB [78]. None
of these directly address the performance of a single GPU
as done in this paper. DeepSpeed [39], which we compare
with in Section VI-C, is closest to our work, but performs all
optimizations and layout selections manually.
Transformer architectures to enable improved training have
also been the subject of significant recent work. ALBERT [19]
used a combination of weight sharing and factorized em-
bedding layers to reduce memory requirements; however
compute times are unaffected. Transformer-XL [79] caches
prior hidden states to learn long sequences. RevNets [80], a
variant of ResNets which allow activations to be reconstructed
during backpropagation, have been applied to transformers.
Notably, Reformer [37] combines reversible residual layers
with locality-sensitive hashing to improve the efficiency of
multi-head attention. Sparsity optimizations for attention [28]–
[36] reduce memory and compute requirements. We view these
as orthongal to our work: the same principles of data-flow
analysis can be applied to optimize for sparsity and reuse.
There has also been much work on optimizing deep learn-
ing in general. Many frameworks provide implementations
of transformers or their components, such as PyTorch [45],
TensorFlow [46], cuDNN [47], and others built atop these [69],
[81]. Optimizing frameworks can also be applied to transform-
ers [59], [60], [82]–[100]. None of these frameworks provide
all the optimizations or the systematic study of data movement
and its impact on performance. We have specifically compared
against some of the most popular production frameworks:
PyTorch, TensorFlow with XLA, and cuDNN. Beyond these,
TASO [87] targets similar optimizations to ours by using
graph substitutions, but considers only inference and does not
exhaustively explore the search space.
Other optimizations, including model parallelism [76], [85],
[101]–[105], pipeline parallelism [106]–[109], microbatch-
ing [110], and recomputation for memory reduction [111],
[112] are all also applicable. Communication can also be a
major bottleneck for training transformers, due to the large
model size [18], [76]. Frameworks for inference, including
TensorRT [113], Caffe2 [114], and the ONNX Runtime [115],
all help to enable a suite of optimizations primarily applicable
during inference. Pruning [116], [117] and distillation [118]
has also been used to accelerate inference.
Neural network architecture-specific optimizations have a
long history outside of transformers, and have primarily
targeted CNNs [9], [119]–[126]. Notably, Li et al. [127]
optimized data layouts for convolution.
In general, data movement reduction is a core component of
high-level optimization [128]. Optimizing compilers, most no-
tably components that specialize in polyhedral programs [129],
[130], apply loop transformations (e.g., tiling, skewing) that
belong to the class of data movement optimization. Other
white-box approaches for separation of program definition
from data optimization passes include Halide [131], JAX [82],
[83], Legion [132], Lift [133], and MLIR [134]. The data-
centric approach proposed here enables user-extensible coarse-
and fine-grained data movement optimization via the flat (yet
parametric) dataflow graph representation [64]. This allows us
to perform and tune complex data layout and fusion trans-
formations that span multiple granularity levels, surpassing
the optimization capabilities of the aforementioned tools and
achieving state-of-the-art performance.
VIII. DISCUSSION
The recipe we propose in this paper can be directly adopted
in other DNN architectures. Additional transformer networks,
such as Megatron-LM [18] and GPT-3 [40], only differ by
dimensions and minor aspects in the encoder and decoder
blocks (e.g., dropout position, biases). Once a data-centric
graph is constructed from them, the recipe remains unchanged.
A. Beyond Transformers
The classification of operators into three groups covers a
wide variety of operators that span beyond transformers.
Large tensors and their contraction are ubiquitous in modern
DNNs. For fully connected networks (MLPs) and recurrent
neural networks (RNNs), there is little change, as the core
operator types are essentially the same. Convolutions, pooling,
and other local spatial operators can be treated similarly to
tensor contractions, owing to their arithmetic intensity proper-
ties and abundance of optimized implementations. Therefore,
the same considerations we take here are just as critical
in CNNs. However, as opposed to contractions (see Section
IV-C), further fusion is typically considered for convolutions.
Statistical normalization also takes different forms in DNNs.
This includes a variety of reductions, as well as Instance,
Group, and Batch Normalization, where the latter constitutes
the second largest computation in ResNets after convolutions.
When varying data layouts, these operators share properties
(normalizing a dimension) and are optimized in exactly the
same way. Lastly, element-wise operators always exist in
DNNs and benefit from the same fusion and bulk data move-
ment optimizations as we perform here. For graph neural
networks [135], capsule neural networks [136], and other
emerging architectures, the operators change more signifi-
cantly, but the basic procedure applies.
Due to the prohibitively large search space of configurations
in transformers, writing manually-optimized kernels becomes
infeasible. Instead, each of our data-centric implementations
chooses an optimization scheme (e.g., tiling, vectorization,
warp-aggregated reductions) automatically, according to the
input data layout and the operator type. Combined with
automated configuration selection (Section VI-A), we rely only
on the dataflow graph structure to choose the best feasible data
layout configuration.
As many networks are bound by data movement rather
than compute performance [127]. This leads us to believe
that our recipe, regardless of the objective (e.g., classification,
regression, RL) and the constituent operators, is generically
applicable for optimizing any neural network architecture.
B. Hardware Implications
The implications of data movement reduction extend beyond
software. Given that the highest performance for different
operators is achieved with different data layouts, there would
be significant benefits if future machine learning accelerators
included built-in support for fast data layout changes. We
confirm this in our MUE results (Tab. III), showing that even
the most compute-intensive tensor contractions are bounded by
the hardware’s capability of transferring data to Tensor Cores.
Hardware trends indicate a similar situation. New archi-
tectures are moving towards reducing data format conversion
(e.g., TF32 [137]), increased on-chip memory and low-latency
interconnects [138], and coarse-grained spatial hardware [139].
For the latter two, the recipe and analysis provided here is cru-
cial to maintain high utilization in pipelined DNN execution.
IX. CONCLUSIONS
Despite the importance of transformer neural networks,
training them is memory-bound and underutilizes GPUs. Using
our recipe for data movement analysis, we identified bottle-
necks and optimizations, yielding improved implementations
that outperform the already highly tuned state-of-the-art. As
training transformers is already a major compute workload
that will only grow larger, our improvements offer significant
real-world impacts for both research and industry.
Our approach is applicable more broadly to deep learning;
many neural networks easily fit within our operator classifica-
tion. This is especially important for guiding the optimization
of emerging or novel model architectures, which do not benefit
from existing acceleration libraries. Our results also highlight
the importance of considering data movement at every level of
the training stack—from the application down to hardware.
ACKNOWLEDGMENTS
This project received funding from the European Re-
search Council (ERC) under the European Union’s Horizon
2020 programme (grant agreements DAPP, No. 678880, and
EPiGRAM-HS, No. 801039). N.D. is supported by the ETH
Postdoctoral Fellowship. T.B.N. is supported by the Swiss
National Science Foundation (Ambizione Project #185778).
Experiments were performed at the Livermore Computing
facility.
REFERENCES
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Advances in Neural Information Processing Systems (NeurIPS), 2017.
[2] A. Graves, “Sequence transduction with recurrent neural networks,”
International Conference on Machine Learning (ICML) Workshop on
Representation Learning, 2012.
[3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning rep-
resentations by back-propagating errors,” Nature, vol. 323, no. 6088,
1986.
[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, 1997.
[5] A. Radford, K. Narasimhan, T. Salimans, and
I. Sutskever, “Improving language understanding
by generative pre-training,” 2018. [Online]. Avail-
able: https://s3-us-west-2.amazonaws.com/openai-assets/research-
covers/language-unsupervised/language_understanding_paper.pdf
[6] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman,
“GLUE: A multi-task benchmark and analysis platform for natural lan-
guage understanding,” in Proceedings of the 2018 EMNLP Workshop
BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,
2018.
[7] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill,
O. Levy, and S. Bowman, “SuperGLUE: A stickier benchmark for
general-purpose language understanding systems,” in Advances in
Neural Information Processing Systems (NeurIPS), 2019.
[8] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know:
Unanswerable questions for SQuAD,” in Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics
(ACL), 2018.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems (NeurIPS), 2012.
[10] C.-F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le, M. Jain,
K. Schubert, C. Fuegen, and M. L. Seltzer, “Transformer-transducer:
End-to-end speech recognition with self-attention,” arXiv preprint
arXiv:1910.12977, 2019.
[11] E. Parisotto, H. F. Song, J. W. Rae, R. Pascanu, C. Gulcehre, S. M.
Jayakumar, M. Jaderberg, R. L. Kaufman, A. Clark, S. Noury et al.,
“Stabilizing transformers for reinforcement learning,” arXiv preprint
arXiv:1910.06764, 2019.
[12] Ł. Maziarka, T. Danel, S. Mucha, K. Rataj, J. Tabor, and S. Jastrze˛bski,
“Molecule attention transformer,” arXiv preprint arXiv:2002.08264,
2020.
[13] G. Lample and F. Charton, “Deep learning for symbolic mathematics,”
arXiv preprint arXiv:1912.01412, 2019.
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
training of deep bidirectional transformers for language understanding,”
in Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics (NAACL), 2019.
[15] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and
Q. V. Le, “XLNet: Generalized autoregressive pretraining for language
understanding,” in Advances in Neural Information Processing Systems
(NeurIPS), 2019.
[16] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized
BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[17] N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher,
“CTRL: A conditional transformer language model for controllable
generation,” arXiv preprint arXiv:1909.05858, 2019.
[18] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan-
zaro, “Megatron-LM: Training multi-billion parameter language mod-
els using GPU model parallelism,” arXiv preprint arXiv:1909.08053,
2019.
[19] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
“ALBERT: A lite BERT for self-supervised learning of language rep-
resentations,” in Proceedings of the Seventh International Conference
on Learning Representations (ICLR), 2019.
[20] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of trans-
fer learning with a unified text-to-text transformer,” arXiv preprint
arXiv:1910.10683, 2019.
[21] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
“Language models are unsupervised multitask learners,” 2019.
[Online]. Available: https://openai.com/blog/better-language-models/
[22] Microsoft, “Turing-NLG: A 17-billion-parameter lan-
guage model by Microsoft,” 2020. [Online]. Avail-
able: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-
billion-parameter-language-model-by-microsoft/
[23] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess,
R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws
for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
[24] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy consid-
erations for deep learning in NLP,” in Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics (ACL), 2019.
[25] Microsoft, “ZeRO & DeepSpeed: New system optimizations
enable training models with over 100 billion parameters,” 2020.
[Online]. Available: https://www.microsoft.com/en-us/research/blog/
zero-deepspeed-new-system-optimizations-enable-training-models-
with-over-100-billion-parameters/
[26] OpenAI, “AI and compute,” 2018. [Online]. Available: https:
//openai.com/blog/ai-and-compute/
[27] ——, “Microsoft invests in and partners with OpenAI to support
us building beneficial AGI,” 2019. [Online]. Available: https:
//openai.com/blog/microsoft/
[28] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” in Proceedings of the Third
International Conference on Learning Representations (ICLR), 2014.
[29] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to
attention-based neural machine translation,” in Proceedings of the 2015
Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2015.
[30] T. Shen, T. Zhou, G. Long, J. Jiang, and C. Zhang, “Bi-directional
block self-attention for fast and memory-efficient sequence modeling,”
in Proceedings of the Sixth International Conference on Learning
Representations (ICLR), 2018.
[31] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, and
D. Tran, “Image transformer,” in Proceedings of the 35th International
Conference on Machine Learning (ICML), 2018.
[32] Y. Tay, S. Wang, L. A. Tuan, J. Fu, M. C. Phan, X. Yuan, J. Rao,
S. C. Hui, and A. Zhang, “Simple and effective curriculum pointer-
generator networks for reading comprehension over long narratives,”
in Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics (ACL), 2019.
[33] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long
sequences with sparse transformers,” arXiv preprint arXiv:1904.10509,
2019.
[34] G. M. Correia, V. Niculae, and A. F. Martins, “Adaptively sparse
transformers,” in Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
2019.
[35] S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin, “Adaptive
attention span in transformers,” in Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics (ACL), 2019.
[36] Y. Tay, D. Bahri, L. Yang, D. Metzler, and D.-C. Juan, “Sparse sinkhorn
attention,” arXiv preprint arXiv:2002.11296, 2020.
[37] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient
transformer,” in Proceedings of the Eighth International Conference
on Learning Representations (ICLR), 2020.
[38] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter
performance analysis of a tensor processing unit,” in Proceedings of
the 44th Annual International Symposium on Computer Architecture
(ISCA), 2017.
[39] Microsoft, “DeepSpeed,” 2020. [Online]. Available: deepspeed.ai
[40] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language
models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
[41] K. Wiggers, “OpenAIâA˘Z´s massive GPT-3 model is impres-
sive, but size isnâA˘Z´t everything,” 2020. [Online]. Avail-
able: https://venturebeat.com/2020/06/01/ai-machine-learning-openai-
gpt-3-size-isnt-everything/
[42] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention
augmented convolutional networks,” in Proceedings of the IEEE Inter-
national Conference on Computer Vision (CVPR), 2019.
[43] N. Parmar, P. Ramachandran, A. Vaswani, I. Bello, A. Levskaya, and
J. Shlens, “Stand-alone self-attention in vision models,” in Advances
in Neural Information Processing Systems (NeurIPS), 2019.
[44] J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the relationship
between self-attention and convolutional layers,” in Proceedings of the
Eighth International Conference on Learning Representations (ICLR),
2020.
[45] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “PyTorch: An
imperative style, high-performance deep learning library,” in Advances
in Neural Information Processing Systems (NeurIPS), 2019.
[46] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning
on heterogeneous systems,” 2015. [Online]. Available: https://
www.tensorflow.org/
[47] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catan-
zaro, and E. Shelhamer, “cuDNN: Efficient primitives for deep learn-
ing,” arXiv preprint arXiv:1410.0759, 2014.
[48] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT
Press, 2016, http://www.deeplearningbook.org.
[49] T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in
continuous space word representations,” in Proceedings of the 2013
Conference of the North American Chapter of the Association for
Computational Linguistics (NAACL), 2013.
[50] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,” arXiv preprint arXiv:1301.3781,
2013.
[51] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their com-
positionality,” in Advances in neural information processing systems
(NeurIPS), 2013.
[52] N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation
models,” in Proceedings of the 2013 Conference on Empirical Methods
in Natural Language Processing (EMNLP), 2013.
[53] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
RNN encoder-decoder for statistical machine translation,” in Pro-
ceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), 2014.
[54] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Advances in neural information processing
systems (NeurIPS), 2014.
[55] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: a simple way to prevent neural networks from
overfitting,” Journal of machine learning research (JMLR), vol. 15,
no. 1, 2014.
[56] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
preprint arXiv:1607.06450, 2016.
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE International Conference on
Computer Vision (CVPR), 2016.
[58] T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed
deep learning: An in-depth concurrency analysis,” ACM Comput. Surv.,
vol. 52, no. 4, 2019.
[59] Google, “XLA: Optimizing compiler for machine learning,” 2020.
[Online]. Available: https://www.tensorflow.org/xla
[60] PyTorch Team, “TorchScript,” 2020. [Online]. Available: https:
//pytorch.org/docs/stable/jit.html
[61] K. Chellapilla, S. Puri, and P. Simard, “High performance convolutional
neural networks for document processing,” in Tenth International
Workshop on Frontiers in Handwriting Recognition, 2006.
[62] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutional
networks through FFTs,” arXiv preprint arXiv:1312.5851, 2013.
[63] A. Lavin and S. Gray, “Fast algorithms for convolutional neural
networks,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2016.
[64] T. Ben-Nun, J. de Fine Licht, A. N. Ziogas, T. Schneider, and
T. Hoefler, “Stateful dataflow multigraphs: A data-centric model for
performance portability on heterogeneous architectures,” in Proceed-
ings of the International Conference for High Performance Computing,
Networking, Storage and Analysis (SC), 2019.
[65] A. N. Ziogas, T. Ben-Nun, G. I. Fernández, T. Schneider, M. Luisier,
and T. Hoefler, “A data-centric approach to extreme-scale ab initio
dissipative quantum transport simulations,” in Proceedings of the In-
ternational Conference for High Performance Computing, Networking,
Storage and Analysis (SC), 2019.
[66] O. Fuhrer, T. Chadha, T. Hoefler, G. Kwasniewski, X. Lapillonne,
D. Leutwyler, D. Lüthi, C. Osuna, C. Schär, T. C. Schulthess et al.,
“Near-global climate simulation at 1 km resolution: establishing a
performance baseline on 4888 GPUs with COSMO 5.0,” Geoscientific
Model Development, vol. 11, no. 4, 2018.
[67] H. Jia-Wei and H.-T. Kung, “I/O complexity: The red-blue pebble
game,” in Proceedings of the thirteenth annual ACM symposium on
Theory of computing, 1981.
[68] Livermore Computing Center, “Lassen,” 2020. [Online]. Available:
https://hpc.llnl.gov/hardware/platforms/lassen
[69] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,
P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew, “Hug-
gingFace’s transformers: State-of-the-art natural language processing,”
arXiv preprint arXiv:1910.03771, 2019.
[70] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,
B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., “Mixed
precision training,” in Proceedings of the Sixth International Confer-
ence on Learning Representations (ICLR), 2018.
[71] Nvidia, “Apex (A PyTorch Extension),” 2020. [Online]. Available:
https://nvidia.github.io/apex/
[72] ——, “CUTLASS: CUDA templates for linear algebra subroutines,”
2020. [Online]. Available: https://github.com/NVIDIA/cutlass
[73] Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and
J. E. Gonzalez, “Train large, then compress: Rethinking model size
for efficient training and inference of transformers,” arXiv preprint
arXiv:2002.11794, 2020.
[74] J. S. Rosenfeld, A. Rosenfeld, Y. Belinkov, and N. Shavit, “A construc-
tive prediction of the generalization error across scales,” in Proceedings
of the Eighth International Conference on Learning Representations
(ICLR), 2020.
[75] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory op-
timization towards training a trillion parameter models,” arXiv preprint
arXiv:1910.02054, 2019.
[76] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanan-
takool, P. Hawkins, H. Lee, M. Hong, C. Young et al., “Mesh-
TensorFlow: Deep learning for supercomputers,” in Advances in Neural
Information Processing Systems (NeurIPS), 2018.
[77] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song,
J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for
deep learning: Training BERT in 76 minutes,” in Proceedings of the
Eighth International Conference on Learning Representations (ICLR),
2020.
[78] NVIDIA AI, “A guide to optimizer implementation for BERT
at scale,” 2019. [Online]. Available: https://medium.com/nvidia-ai/a-
guide-to-optimizer-implementation-for-bert-at-scale-8338cc7f45fd
[79] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov,
“Transformer-XL: Attentive language models beyond a fixed-length
context,” in Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics (ACL), 2019.
[80] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, “The reversible
residual network: Backpropagation without storing activations,” in
Advances in neural information processing systems (NeurIPS), 2017.
[81] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier,
and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,”
in Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
[82] R. Frostig, M. J. Johnson, and C. Leary, “Compiling machine learning
programs via high-level tracing,” Systems for Machine Learning, 2018.
[83] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary,
D. Maclaurin, and S. Wanderman-Milne, “JAX: composable
transformations of Python+NumPy programs,” 2018. [Online].
Available: http://github.com/google/jax
[84] N. Rotem, J. Fix, S. Abdulrasool, G. Catron, S. Deng, R. Dzhabarov,
N. Gibson, J. Hegeman, M. Lele, R. Levenstein et al., “Glow: Graph
lowering compiler techniques for neural networks,” arXiv preprint
arXiv:1805.00907, 2018.
[85] Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism
for deep neural networks,” in Proceedings of the 2nd Conference on
Systems and Machine Learning (SysML), 2019.
[86] Z. Jia, J. Thomas, T. Warszawski, M. Gao, M. Zaharia, and A. Aiken,
“Optimizing DNN computation with relaxed graph substitutions,” in
Proceedings of the 2nd Conference on Systems and Machine Learning
(SysML), 2019.
[87] Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken,
“TASO: optimizing deep learning computation with automatic gen-
eration of graph substitutions,” in Proceedings of the 27th ACM
Symposium on Operating Systems Principles (SOSP), 2019.
[88] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan,
L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “TVM:
An end-to-end optimization stack for deep learning,” in 13th USENIX
Symposium on Operating Systems Design and Implementation (OSDI),
2018.
[89] Paul G. Allen School of Computer Science & Engineering, University
of Washington, Amazon Web Service AI team, and DMLC open-source
community, “NNVM compiler: Open compiler for AI frameworks,”
2017. [Online]. Available: https://tvm.apache.org/2017/10/06/nnvm-
compiler-announcement
[90] M. Sivathanu, T. Chugh, S. S. Singapuram, and L. Zhou, “Astra:
Exploiting predictability to optimize deep learning,” in Proceedings of
the Twenty-Fourth International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS), 2019.
[91] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou,
N. Kumar, M. Norouzi, S. Bengio, and J. Dean, “Device placement
optimization with reinforcement learning,” in Proceedings of the 34th
International Conference on Machine Learning, 2017.
[92] S. Cyphers, A. K. Bansal, A. Bhiwandiwalla, J. Bobba, M. Brookhart,
A. Chakraborty, W. Constable, C. Convey, L. Cook, O. Kanawi et al.,
“Intel nGraph: An intermediate representation, compiler, and executor
for deep learning,” arXiv preprint arXiv:1801.08058, 2018.
[93] R. Baghdadi, A. N. Debbagh, K. Abdous, B. F. Zohra, A. Renda, J. E.
Frankle, M. Carbin, and S. Amarasinghe, “TIRAMISU: A polyhedral
compiler for dense and sparse deep learning,” in Workshop on Systems
for ML at NeurIPS, 2019.
[94] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S.
Moses, S. Verdoolaege, A. Adams, and A. Cohen, “Tensor com-
prehensions: Framework-agnostic high-performance machine learning
abstractions,” arXiv preprint arXiv:1802.04730, 2018.
[95] R. Lethin, “Polyhedral optimization of tensorflow computation graphs,”
in Sixth Workshop on Extreme-scale Programming Tools (ESPT), 2017.
[96] R. Wei, L. Schwartz, and V. Adve, “DLVM: A modern compiler
infrastructure for deep learning systems,” in Proceedings of the Sixth
International Conference on Learning Representations - Workshop
Track (ICLR), 2018.
[97] L. Truong, R. Barik, E. Totoni, H. Liu, C. Markley, A. Fox, and
T. Shpeisman, “Latte: A language, compiler, and runtime for elegant
and efficient deep neural networks,” in Proceedings of the 37th ACM
SIGPLAN Conference on Programming Language Design and Imple-
mentation (PLDI), 2016.
[98] A. Venkat, T. Rusira, R. Barik, M. Hall, and L. Truong, “SWIRL:
High-performance many-core CPU code generation for deep neural
networks,” The International Journal of High Performance Computing
Applications, vol. 33, no. 6, 2019.
[99] X. Dong, L. Liu, P. Zhao, G. Li, J. Li, X. Wang, and X. Feng, “Acorns:
A framework for accelerating deep neural networks with input sparsity,”
in 2019 28th International Conference on Parallel Architectures and
Compilation Techniques (PACT), 2019.
[100] V. Elango, N. Rubin, M. Ravishankar, H. Sandanagobalane, and
V. Grover, “Diesel: DSL for linear algebra and neural net computations
on GPUs,” in Proceedings of the 2nd ACM SIGPLAN International
Workshop on Machine Learning and Programming Languages, 2018.
[101] B. Van Essen, H. Kim, R. Pearce, K. Boakye, and B. Chen, “LBANN:
Livermore big artificial neural network HPC toolkit,” in Proceedings of
the Workshop on Machine Learning in High-Performance Computing
Environments (MLHPC), 2015.
[102] A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc, “Integrated
model, batch, and domain parallelism in training neural networks,” in
Proceedings of the 30th on Symposium on Parallelism in Algorithms
and Architectures (SPAA), 2018.
[103] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ran-
zato, A. Senior, P. Tucker, K. Yang et al., “Large scale distributed
deep networks,” in Advances in neural information processing systems
(NeurIPS), 2012.
[104] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project
Adam: Building an efficient and scalable deep learning training sys-
tem,” in 11th USENIX Symposium on Operating Systems Design and
Implementation (OSDI), 2014.
[105] P. Buchlovsky, D. Budden, D. Grewe, C. Jones, J. Aslanides,
F. Besse, A. Brock, A. Clark, S. G. Colmenarejo, A. Pope et al.,
“TF-Replicator: Distributed machine learning for researchers,” arXiv
preprint arXiv:1902.00465, 2019.
[106] X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide, “Pipelined back-
propagation for context-dependent deep neural networks,” in Thirteenth
Annual Conference of the International Speech Communication Asso-
ciation (INTERSPEECH), 2012.
[107] Y. Li, M. Yu, S. Li, S. Avestimehr, N. S. Kim, and A. Schwing, “Pipe-
SGD: A decentralized pipelined SGD framework for distributed deep
net training,” in Advances in Neural Information Processing Systems
(NeurIPS), 2018.
[108] D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur,
G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: generalized
pipeline parallelism for dnn training,” in Proceedings of the 27th ACM
Symposium on Operating Systems Principles (SOSP), 2019.
[109] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee,
J. Ngiam, Q. V. Le, Y. Wu et al., “GPipe: Efficient training of giant
neural networks using pipeline parallelism,” in Advances in Neural
Information Processing Systems (NeurIPS), 2019.
[110] Y. Oyama, T. Ben-Nun, T. Hoefler, and S. Matsuoka, “Accelerating
deep learning frameworks with micro-batches,” in 2018 IEEE Interna-
tional Conference on Cluster Computing (CLUSTER), 2018.
[111] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with
sublinear memory cost,” arXiv preprint arXiv:1604.06174, 2016.
[112] P. Jain, A. Jain, A. Nrusimha, A. Gholami, P. Abbeel, K. Keutzer,
I. Stoica, and J. E. Gonzalez, “Checkmate: Breaking the memory wall
with optimal tensor rematerialization,” in Proceedings of the Third
Conference on Machine Learning and Systems (MLSys), 2020.
[113] NVIDIA, “NVIDIA TensorRT,” 2020. [Online]. Available: https:
//developer.nvidia.com/tensorrt
[114] Facebook, “Caffe2,” 2020. [Online]. Available: https://caffe2.ai/
[115] Microsoft, “ONNX Runtime,” 2020. [Online]. Available: https:
//microsoft.github.io/onnxruntime/
[116] N. Shazeer, “Fast transformer decoding: One write-head is all you
need,” arXiv preprint arXiv:1911.02150, 2019.
[117] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing
multi-head self-attention: Specialized heads do the heavy lifting, the
rest can be pruned,” in Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics (ACL), 2019.
[118] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled
version of BERT: smaller, faster, cheaper and lighter,” in Fifth Work-
shop on Energy Efficient Machine Learning and Cognitive Computing
at NeurIPS 2019, 2019.
[119] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew,
“Deep learning with COTS HPC systems,” in International Conference
on Machine Learning (ICML), 2013.
[120] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Ky-
rola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD:
training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
[121] T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large minibatch SGD:
Training ResNet-50 on ImageNet in 15 minutes,” in NeurIPS 2017
Workshop: Deep Learning at Supercomputer Scale, 2017.
[122] Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “ImageNet
training in minutes,” in Proceedings of the 47th International Confer-
ence on Parallel Processing (ICPP), 2018.
[123] H. Mikami, H. Suganuma, P. U-chupala, Y. Tanaka, and Y. Kageyama,
“Massively distributed SGD: ImageNet/ResNet-50 training in a flash,”
arXiv preprint arXiv:1811.05233, 2018.
[124] C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, “Image classifi-
cation at supercomputer scale,” in NeurIPS Systems for ML Workshop,
2018.
[125] N. Dryden, N. Maruyama, T. Benson, T. Moon, M. Snir, and B. Van Es-
sen, “Improving strong-scaling of CNN training by exploiting finer-
grained parallelism,” in 2019 IEEE International Parallel and Dis-
tributed Processing Symposium (IPDPS), 2019.
[126] N. Dryden, N. Maruyama, T. Moon, T. Benson, M. Snir, and B. Van Es-
sen, “Channel and filter parallelism for large-scale CNN training,” in
Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis (SC), 2019.
[127] C. Li, Y. Yang, M. Feng, S. Chakradhar, and H. Zhou, “Optimizing
memory efficiency for deep convolutional neural networks on GPUs,”
in Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis (SC), 2016.
[128] D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco,
B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger,
F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung,
H. Ltaief, N. Maruyama, C. J. Newburn, , and M. Pericas, “Trends in
Data Locality Abstractions for HPC Systems,” IEEE Transactions on
Parallel and Distributed Systems (TPDS), vol. 28, no. 10, 2017.
[129] T. Grosser, A. Größlinger, and C. Lengauer, “Polly - performing
polyhedral optimizations on a low-level intermediate representation,”
Parallel Processing Letters, vol. 22, no. 4, 2012.
[130] U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam,
A. Rountev, and P. Sadayappan, “Automatic transformations for
communication-minimized parallelization and locality optimization in
the polyhedral model,” in International Conference on Compiler Con-
struction (ETAPS CC), 2008.
[131] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and
S. Amarasinghe, “Halide: A language and compiler for optimizing
parallelism, locality, and recomputation in image processing pipelines,”
ACM SIGPLAN Notices, vol. 48, no. 6, 2013.
[132] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, “Legion: Expressing
locality and independence with logical regions,” in Proceedings of the
International Conference on High Performance Computing, Network-
ing, Storage and Analysis (SC), 2012.
[133] M. Steuwer, T. Remmelg, and C. Dubach, “Lift: A functional data-
parallel ir for high-performance gpu code generation,” in Proceedings
of the 2017 International Symposium on Code Generation and Opti-
mization (CGO), 2017.
[134] C. Lattner, J. Pienaar, M. Amini, U. Bondhugula, R. Riddle, A. Cohen,
T. Shpeisman, A. Davis, N. Vasilache, and O. Zinenko, “MLIR: A
compiler infrastructure for the end of moore’s law,” arXiv preprint
arXiv:2002.11054, 2020.
[135] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst,
“Geometric deep learning: going beyond Euclidean data,” IEEE Signal
Processing Magazine, vol. 34, no. 4, 2017.
[136] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between
capsules,” in Advances in neural information processing systems
(NeurIPS), 2017.
[137] Nvidia, “TensorFloat-32 in the A100 GPU Accelerates AI Training,
HPC up to 20x,” 2020. [Online]. Available: https://blogs.nvidia.com/
blog/2020/05/14/tensorfloat-32-precision-format
[138] Z. Jia, B. Tillman, M. Maggioni, and D. P. Scarpazza, “Dissecting the
Graphcore IPU architecture via microbenchmarking,” 2019.
[139] Cerebras, “Cerebras CS-1 Product Overview,” 2020. [Online].
Available: https://www.cerebras.net/product
