High-Performance Deep Learning via a Single Building Block by Georganas, Evangelos et al.
High-Performance Deep Learning via a Single Building Block
Evangelos Georganas, Kunal Banerjee, Dhiraj Kalamkar, Sasikanth Avancha, Anand Venkat,
Michael Anderson, Greg Henry, Hans Pabst, Alexander Heinecke
Intel Corporation
ABSTRACT
Deep learning (DL) is one of the most prominent branches of ma-
chine learning. Due to the immense computational cost of DL work-
loads, industry and academia have developed DL libraries with
highly-specialized kernels for each workload/architecture, leading
to numerous, complex code-bases that strive for performance, yet
they are hard to maintain and do not generalize. In this work, we
introduce the batch-reduce GEMM kernel and show how the most
popular DL algorithms can be formulated with this kernel as the
basic building-block. Consequently, the DL library-development
degenerates to mere (potentially automatic) tuning of loops around
this sole optimized kernel. By exploiting our new kernel we imple-
ment Recurrent Neural Networks, Convolution Neural Networks
and Multilayer Perceptron training and inference primitives in
just 3K lines of high-level code. Our primitives outperform vendor-
optimized libraries onmulti-node CPU clusters, and we also provide
proof-of-concept CNN kernels targeting GPUs. Finally, we demon-
strate that the batch-reduce GEMM kernel within a tensor compiler
yields high-performance CNN primitives, further amplifying the
viability of our approach.
1 INTRODUCTION
In the past decade, machine learning has experienced an academic
and industrial renaissance where deep learning (DL) has been the
main driving force. More specifically, deep neural networks have
advanced the fields of computer vision, speech recognition, machine
translation and search ranking, and naturally emerge in numerous
applications and scientific domains [1–6].
Three types of neural networks (NN) comprise the most promi-
nent DL workloads by representing 95% of the data-centers’s de-
mands [7]: i) Recurrent Neural Networks (RNN) [8] with the so-
called Long Short-Term Memory (LSTM) [9] networks being the
most popular variation, ii) Convolution Neural Networks (CNN) [1],
and iii) Multi-Layer Perceptrons (MLP) [10, 11]. Additionally, the
contemporary Transformer [12] and BERT [13] workloads com-
putationally involve fully-connected layers which also lie in the
heart of MLP. All these neural networks can be further associ-
ated with two use-cases: training of the underlying NN models
(i.e. learning via back-propagation [14]), and inference (i.e. yielding
predictions) based on trained models. Due to the increase of the
involved datasets’ size and complexity in deep neural networks
(DNN), the training and inference tasks require vast amount of
computation. Therefore, academia and industry have invested into
the development of DL libraries targeting all the aforementioned
workloads on various architectures.
The development of such DL libraries typically embraces one of
the following strategies: (i) the specific workload kernel leverages
coarse-grained, linear algebra library calls, e.g. LSTM cell via large
0
500
1000
1500
2000
2500
3000
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
G
FL
O
PS
Layer ID
ResNet-50 forward convolutions on Skylake-SP 8180 (Minibatch=28)
This work MKL-DNN batched gemm + im2col gemm peak
Figure 1: Performance of ResNet-50 forward convolutions
GEneral Matrix Multiply (GEMM) calls in mkl-dnn [15], convolu-
tions via image-to-column tensor transformations and subsequent
large GEMM calls [16, 17], or (ii) for each workload and use-case
(training/inference) the kernel employs a specialized implementa-
tion that targets the specific algorithm/workload and architecture
at hand, e.g. convolution kernels in mkl-dnn and cuDNN [18]. The
former approach of deploying coarse-grained, linear algebra library
calls provides ease in the DL library development process since
no special kernel development is involved. However it may result
in suboptimal data reuse (e.g. redundant data movements to for-
mat underlying tensor/matrices in the required layout that enables
GEMM calls), and also it is not flexible enough to allow efficient,
fine-grained fusion of other operators. The latter approach of im-
plementing specialized kernels for each DL workload/use-case and
platform/architecture strives for performance but naturally results
in numerous, complex code-bases that are hard to maintain and do
not generalize. For example, the code-base only for convolutions on
CPUs within mkl-dnn consists of ∼36,000 lines of code. Figure 1
shows the performance of various convolution kernel implementa-
tions on a Xeon Skylake-SP 8180 processor. The yellow and green
lines represent implementations adopting strategy (i). More specifi-
cally, the green line shows the performance of convolutions that
leverage small GEMM library calls, whereas the yellow line illus-
trates the performance of an implementation which uses image
to column transformations and batched GEMM [19] library calls.
Both approaches perform far from the machine’s peak with average
efficiencies of 61% and 49% respectively. On the other hand, the
orange line exhibits the performance of the vendor-optimized mkl-
dnn library that follows strategy (ii) with ad hoc, specialized direct
convolution kernels and achieves average efficiency of 81%, being
1.33× and 1.64× faster than the aforementioned generic implemen-
tations. However, this performance comes at the cost of complex,
specialized kernels that do not generalize to different workloads
(e.g. RNN/LSTM/MLP) or different architectures (e.g. GPUs).
In this work, we introduce a new kernel called batch-reduce
GEMM and show how the most popular DL workloads and algo-
rithms (RNN/LSTM, CNN and MLP) can be formulated with this
new kernel as basic building block. The batch-reduce GEMM kernel
essentially multiplies a sequence of input sub-tensor blocks (which
ar
X
iv
:1
90
6.
06
44
0v
2 
 [c
s.L
G]
  1
8 J
un
 20
19
form a batch) and the partial multiplication results are reduced into
a single accumulator/output sub-tensor block. Our new kernel is
flexible enough to accommodate coarse-grained and fine-grained
operations that arise in DL workloads, whereas its semantics lend
themselves to various optimizations (e.g. load/store optimizations
of the result sub-tensor, prefetching of the sub-tensors to be multi-
plied). Also, since the kernel supports operations at fine granularity,
fusion of subsequent operators on the output sub-blocks is inher-
ently efficient. The blue line in Figure 1 shows the performance
of the convolution primitive that leverages our new batch-reduce
GEMM kernel achieving average efficiency of 83%, and outperforms
even the ad hoc, vendor-optimized kernel.
Having a single kernel as basic building-block is transformative:
by implementing and optimizing this single kernel for a given
architecture, the development of DL primitives degenerates to mere
loop tuning around this kernel. Essentially our approach with a
single kernel addresses the issue of combinatorial explosion of low-
level optimization work that is required for each pair <architecture,
DL primitive>. Instead, for each architecture we need to optimize
at low-level only one kernel for all DL primitives.
Furthermore, having a single, highly efficient building-block
enables efficient usage of tensor compiler frameworks. Such frame-
works embrace tensors as first class citizens, and provide specific op-
timization techniques targeting tensor algebra programs. Since DL
primitives are inherently tensor algebra programs, there is a large
amount of ongoing research that leverages specialized tensor com-
pilers for DL workload development (e.g. TVM [20], GLOW [21],
PlaidML [22], MLIR [23]). However, compilers struggle to optimize
small GEMM-flavored loop nests that arise in tensor programs [24].
Contemporary architectures become increasingly complex, and all
the micro-architectural idiosyncrasies have to be considered in or-
der to achieve close-to-peak performance. Our kernel is optimized
for the nuances of the architecture at hand, and serves tensor com-
pilers a robust building block that can be used during the polyhedral
optimization phase of general loop nests [22, 25].
To illustrate the viability and generality of our methodology with
a single kernel, we develop DL primitives which target training and
inference of RNN/LSTM, CNN and MLP workloads in ∼3,000 lines
of high-level C code. Our primitives outperform vendor-optimized
libraries on CPUs. We also provide proof-of-concept design with
a tensor compiler framework by showcasing efficient CNN imple-
mentation in TVM that leverages our batch-reduce GEMM kernel.
Additionally, our methodology provides a pathway for performance
portability; we present exemplary, high-performance CNN kernels
on integrated GPUs. Last but not least, we integrate our primitives
in distributed DL frameworks (Tensorflow [26] and GxM [27]), and
show performance results on two training workloads: Google’s Neu-
ral Machine Translation (GNMT) [5] and ResNet-50 training [28].
These results push the envelope of DL training performance on
CPU clusters. The main contributions of this paper are:
• The introduction of the batch-reduce GEMM kernel along
with its efficient implementation.
• The design and implementation of multi-threaded, high per-
formance DL primitives covering RNN/LSTM, CNN andMLP
inference and training algorithms with batch-reduce GEMM
A0
A1
A2
A3
B0
B1
B2
B3
Cj
Tensor A Tensor B Tensor C 
!" = $ ∗ !" + α ∑ '( ∗ )(*+,(-.
(b)
A i
su
bc
ol
um
n
C
ja
cc
um
ul
at
or
s 
in
 
24
 v
ec
to
r r
eg
is
te
rs
(a)
bcast Bi subrow in 6 vec registers
30 
0
29 28 27 26 25 
24 23 22 2120 19 
18 1716 15 1413 
12 11 10 9 8 7 
65 4 3 2 1 
Figure 2: (a) The batch-reduce GEMM kernel (b) Outer prod-
uct small GEMMmicrokernel
kernel being the basic building block. We need to optimize
at low-level only this kernel for all DL primitives.
• A detailed performance comparison of our DL primitives
with state-of-the-art vendor-optimized libraries.
• Distributedmemory results of LSTM and CNN trainingwork-
loads that leverage our optimized DL kernels and outperform
the best in class results on CPU clusters.
• CNN proof-of-concept results on integrated GPUs and CNN
kernels within TVM that leverage the batch-reduce GEMM
kernel.
2 THE BATCH-REDUCE GEMM KERNEL
In this section, we describe the design and implementation of the
new batch-reduce GEMM kernel which comprises the cornerstone
of our deep learning primitives. Figure 2 (a) illustrates the function-
ality of the new kernel which materializes the operation:
Cj = β ·Cj + α
N−1∑
i=0
Ai · Bi
This kernel multiplies the specified blocks Ai ∈ IRm×k and Bi ∈
IRk×n and reduces the partial results to a block Cj ∈ IRm×n of a
tensor C . Tensors A and B can alias and also the blocks Ai and Bi
can reside in any position in the input tensors A and B. The batch-
reduce GEMM kernel takes the following arguments: (i) two arrays
of pointers to the corresponding blocks Ai and Bi to be multiplied,
(ii) a pointer to the output blockCj , (iii) the number N of the blocks
to be multiplied and (iv) the scaling parameters α and β .
Our kernel differs from the recently introduced batchedGEMM [19]
and its variation strided-batch-gemm [29] that materialize:
Ci = β ·Ci + α · Ai · Bi
These batched routines are missing the reduction functionality and
cannot optimize for the output matrix re-use. Also, the strided-
batch-gemm kernel accesses the Ai and Bi subblocks based on
fixed strides and therefore is more restrictive.
The new batch-reduce GEMM kernel specification naturally
lends itself to a handful of optimizations. First, this kernel min-
imizes the output data movement compared to GEMM or batched
GEMM approaches since the specification dictates the use of a sin-
gle output. Second, the input subblocks that are multiplied can
reside in arbitrary locations within tensors, therefore the kernel
obviates the need for tensor transformations/copy overheads that
are otherwise required in order to obtain long accumulation chains
2
Algorithm 1 The batch-reduce GEMM kernel
Inputs:Ai ∈ IRm×k ,Bi ∈ IRk×ni = 0, ...,N -1,Cj ∈ IRm×n α , β ∈ IR
Output: Cj = β ·Cj + α ∑N−1i=0 Ai · Bi
1: for in = 0 . . .n − 1 with step nb do
2: for im = 0 . . .m − 1 with step mb do
3: acc_regs← loadmb × nb Cj subblockim,in
4: for i = 0 . . .N − 1 with step 1 do
5: for ik = 0 . . .k − 1 with step 1 do
6: ▷ Outer product GEMM microkernel
7: acc_regs += Ai subcolumnim,ik × Bi subrowik ,in
8: Cj subblockim,in ← acc_regs
(e.g. image to column transformations are required to implement
convolutions via large GEMM calls). Such long accumulation chains
are essential in order to achieve high performance. Additionally,
being able to provide arbitrary sub-tensor blocks as inputs provides
ease of integration with blocked/tiled tensor layouts. Last but not
least, since the input Ai and Bi subblocks are part of the interface,
the implementation can trivially prefetch them in order to hide the
latency of data movement.
In order to obtain a high performance implementation of the
batch-reduce GEMM kernel we build upon and extend the open
source LIBXSMM [24] library which leverages JIT techniques and
generates small GEMMS achieving close to peak performance. Al-
gorithm 1 shows the pseudocode of the batch-reduce GEMM kernel.
Lines 1-2 block the computation of the resultCj inmb×nb subblocks.
Once such a subblock is loaded into the accumulation registers (line
3), we loop over all pairsAi , Bi (line 4) and we accumulate into the
loaded registers the products of the correspondingmb ×k subblocks
of Ai with the relevant k × nb subblocks of Bi (lines 5-7). In order
to calculate a partial product of anmb × k subblock of Ai with a
k × nb subblock of Bi , we follow an outer product formulation. In
particular, we multiply anmb × 1 column ofAi with a 1×nb row of
Bi (line 7) and we repeat the analogous outer product computation
for all k columns/rows of the Ai /Bi subblocks (line 5). Figure 2(b)
depicts the outer product microkernel that multiplies anmb × 1
column of Ai with a 1 × nb row of Bi (in this examplemb = 64,
nb = 6). For illustration purposes, we consider that the underlying
architecture has 32 vector registers where each one can hold 16 ten-
sor elements. In this example, accumulation registers 7-30 hold the
partialCj result. First, we broadcast the row of Bi into registers 1-6.
Then, we load in register 0 the first 16 elements of the Ai column
and via 6 fused-multiply-add instructions (FMAs) with registers
1-6 we update the accumulators 7-12. We repeat the analogous
process for the remaining 48 elements of the Ai column and we
update all the accumulation registers. We note here that this is just
one of the methods that LIBXSMM adopts for the outer product
microkernel; LIBXSMM leverages various strategies depending on
the architecture at hand (i.e. vector length) and themb , nb values.
Once the mb × nb subblock of Cj is fully computed for all pairs
of Ai and Bi matrices, the accumulators are stored in the proper
location of Cj (line 8). Finally, we further enhance the microkernel
with software prefetches of Ai and Bi elements aiming to mitigate
cache miss latency overheads.
Optimizing Deep Learning LSTM Topologies on
Intel Xeon Architecture
ISC 2019: Research Poster June 16 – 20, 2019
Long Short-Term Memory (LSTM)
I LSTM is a type of recurrent neural network (RNN) which is
well-suited for processing temporal data
I Unlike traditional RNN, LSTM can handle exploding and
vanishing gradient problems encountered during neural
network training
I LSTM has found applications in language translation, text
generation, handwriting recognition and image captioning
among many others
I Operations in the forward pass of an LSTM cell
it =  (Wi ⇤ xt + Ri ⇤ ht 1 + bi)
ct = tanh(Wc ⇤ xt + Rc ⇤ ht 1 + bc)
ft =  (Wf ⇤ xt + Rf ⇤ ht 1 + bf )
ot =  (Wo ⇤ xt + Ro ⇤ ht 1 + bo)
st = ft   st 1 + it   ct
ht = ot   tanh(st)
Ri
Rc
Rf
Ro
Wo Wf Wc Wi
bi
bc
bf
bo+
+
+
+  
tanh
 
 
mul
+
mul
st 1
tanh
mul ht
st
ht 1
xt
Figure 1: A diagram of an LSTM cell
Typical implementation of LSTM
+ Perform two large GEMMs (W ⇤ x and R ⇤ h) or one larger
GEMM (concatenated WR with concatenated xh)
X Easy to implement – leverage vendor-optimized GEMM
⇥ Weight reuse relies on how the GEMMs are parallelized
and hence may be sub-optimal for GEMMs stemming from
small minibatch size
⇥ Element-wise operations are exposed as bandwidth-bound
kernel (vs in-cache reuse of the GEMM outputs)
Our implementation of LSTM
+ Adopt a “dataflow” based approach for optimizations
I Use blocked layout to better exploit locality and avoid
conflict misses
I Given N = minibatch size, C = input channels and K =
output channels and T = total time steps
I Internally, transform the inputs in blocked format:
I Input: [T ][N][C]! [T ][N/BN][C/BC][BN][BC]
I Hidden: [T ][N][K ]! [T ][N/BN][K/BK ][BN][BK ]
I Weights: [C][4K ]! [4K/BK ][C/BC][BC][BK ]
I Recurrent Weights: [K ][4K ]! [4K/BK ][K/BK ][BK ][BK ]
I BN, BC and BK are blocking factors for N, C and K respectively
I Perform computation with fused time steps
I Amortize cost of blocking
I Optimized weight gradient computation
I Also, allow blocked inputs / weights to be passed directly
from framework
I Useful when performing one time step at a time
I Use JIT batch-reduce GEMM kernels
I Implement optimized blocked GEMM
I Implement fused kernel for elementwise operations (it , ft , ot , ct , st , ht)
I Using Intel AVX512 intrinsics to vectorize
I Use the Intel Short Vector Math Library (SVML) for fast tanh and sigmoid
I Once a block of GEMM is computed, apply element-wise operations
on it while hot in cache
I Our LSTM operators are thread-library agnostic (can use
any of pthreads, OpenMP, C++ threads, Cilk, TBB, etc.)
+ Same optimization principles applied to backward and
weight update passes
+ Our code is available through LIBXSMM at [9]
Integration into TensorFlow
I XsmmFusedLSTM: Implemented a wrapper in TensorFlow
similar to LSTMBlockFusedCell
I Single TensorFlow Op performing all time steps
I Best for performance but may require significant source code change
I XsmmLSTMCell: A wrapper in TensorFlow compatible with
BasicLSTMCell to perform single time step
I Allows use of RNNCell wrappers like MultiRNNCell, DeviceWrapper,
DropoutWrapper and ResidualWrapper
I Allows easy replacement inside application code where fused cell is
not used, e.g. GNMT
I Weights: Uses same layouts as in TensorFlow LSTMCell
I Optimizes block transpose when using XsmmLSTMCell
I Transpose happens outside time step loop when using dynamic rnn
Experimental Setup
All the experiments and measurements are conducted over
following hardware / software configuration
I Machine: Single socket Xeon Platinum 8180 with 28 Cores
(3+ TFLOPS peak), NVIDIA K40m (4+ TFLOPS peak, [6])
I MKL-DNN: from github (commit 3439371) compiled with
icc 19.0.0.117
I LIBXSMM: compiled with icc version 18.0.0 (we observed
slowdown with latest icc/SVML version)
I Stock Tensorflow w/o MKL: v1.12.0 installed using “pip
install tensorflow”
I Tensorflow with MKL: v1.12.0 compiled using gcc 8.3.0
with “-config=mkl”
I GNMT: NMT + GNMT attention (8 layers) with
Minibatch: 168, inter op threads: 1, intra op threads: 28
GNMT end-to-end training with TensorFlow
I First, with few lines of source code change, we replaced
BasicLSTMCell code by XsmmLSTMCell (XsmmLSTM)
I Then, we replaced unidirectional encoder layers with
XsmmFusedLSTM layers (+Fused Encoders)
I Switching to the Fused Cell for decoders is subject to future
due to Tensorflow’s decoder implementation.
I For 8-layer German-to-English model, Perplexity and
Gradient Norm of our implementation follows closely with
reference run and we achieved similar BLEU score to
reference version for 2-layer Vietnam-to-English translation
I Overall, we achieved 1.9⇥ training speed up compared to
original TensorFlow code for 8-layer German-to-English
translation model exceeding Nividia K40m performance
I Major benefits come from improved efficiency for forward
pass GEMMs (1.5x speed up) and 12⇥ reduction in cost
for elementwise operations (from 30% to 2.5%). Out of
24% of backward/update elementwise operations, single
BiasAddGrad takes about 16% of time which reduces to
less than 1% after optimization
0.97
1.22
1.63
1.85
1.70
1x 1.25x 1.68x 1.91x 1.75x
0.0
0.5
1.0
1.5
2.0
Reference Reference XsmmLSTM +Fused Encdr Reference
TF w/o MKL TF with MKL (Compiled from Source) TF with Cuda
Single Socket Intel Xeon Platinum 8180 Nvidia K40m
Ki
lo
 w
or
ds
 p
er
 se
co
nd
WPS Speed Up
Figure 2: GNMT 8-layer Performance (with Turbo Enabled)
In a similar implementation ”OpenSeq2Seq” NVIDIA V100 (5x faster than one Xeon 8180) achieves 4.8x more throughput
2
20
200
2000
20000
1 42 83 12
4
16
5
20
6
24
7
28
8
32
9
37
0
41
1
45
2
49
3
53
4
57
5
61
6
65
7
69
8
73
9
78
0
82
1
1 2 3 4
Pe
rp
le
xi
ty
#Training Iterations (in Hundreds)
#Epochs
Reference (Stock TF) Reference
XsmmLSTM  +Fused Encoders
Figure 3: GNMT Convergence: Perplexity
3.5% 3.7% 3.5%5.7%
5.8% 5.8%
4.7% 4.8% 3.4%
75.3%
49.6%
42.6%
10.7%
10.9%
10.7%
1x 1.34x 1.51x
0%
20%
40%
60%
80%
100%
120%
Reference XsmmLSTM +Fused Encoders
Rest
LSTM Cell
DropOut
Output Proj
Attention
Speedup
Figure 4: GNMT 8-layer: Overall Time Breakup
18.7% 14.3% 11.2%
21.5% 25.1%
22.9%
5.4% 1.1%
1.0%
24.0%
1.0
1.0
5.8%
8.1%
6.5%
75.3%
49.6%
42.6%
0%
10%
20%
30%
40%
50%
60%
70%
80%
Reference XsmmLSTM +Fused Encoders
Others
Bwd EltWise
Fwd EltWise
Bwd GEMM
Fwd GEMM
Figure 5: GNMT: Time Spent inside LSTM Cell
LSTM cell efficiency
I Intel R  Math Kernel Library for Deep Neural Networks
(MKL-DNN) is an open source performance library from
Intel intended for acceleration of deep learning frameworks
on Intel architecture
I To demonstrate that our LSTM cell offers best-in-class
performance, we not only compare to Tensorflow
end-to-end but also to the MKL-DNN LSTM cell which is not
available in Tensorflow at the time of this writing
16
18 1
99
7 23
23
20
19 21
46
11
84 1
55
7 18
27 20
47
20
36
3046
0
500
1000
1500
2000
2500
3000
3500
256 512 1024 2048 4096
GF
LO
PS
Hidden State Size
Minibatch: 168  #Time Steps: 50
LIBXSMM cell MKL-DNN Peak
Figure 6: Forward pass results, Turbo disabled for stability
I LIBXSMM cell is up to 1.4⇥ faster than MKL-DNN LSTM
forward pass
I For large hidden state sizes, the two approaches exhibit
similar performance
I GEMM has cubic complexity while element-wise
operations quadratic! for large sizes the element-wise
operations/bandwidth overhead are less emphasized
10
70 13
21
20
07
19
53
20
28
84
6 12
34 15
38 17
50 20
94
3046
0
500
1000
1500
2000
2500
3000
3500
256 512 1024 2048 4096
GF
LO
PS
Hidden State Size
Minibatch: 168  #Time Steps: 50
LIBXSMM cell MKL-DNN Peak
Figure 7: Backward/weight update pass results, Turbo disabled
I LIBXSMM cell is up to 1.3⇥ faster than MKL-DNN LSTM
backward/weight update pass
Summary
I Implementation of LSTM cell using a “dataflow” approach
instead of large GEMMs
I Maximize locality, weight reuse
I Fuse element-wise operations
I For small/medium sized problems, our implementation of
LSTM forward pass is up to 1.4⇥ faster than the MKL-DNN
cell, while for backward/weight update it is up to 1.3⇥ faster
I For large weight matrices the two approaches have similar
performance
I Cubic GEMM scaling VS quadratic elementwise scaling
I This conclusion may change with GEMM accelerated
hardware
I Dataflow approach is well suited for CPUs
I Coarse-grained parallelization and better locality control
I Modified TensorFlow which invokes our LSTM cell
implementation is shown to perform end-to-end training
attaining identical BLEU score and in as many iterations as
original TensorFlow CPU implementation
I A speed up of 1.9⇥ is achieved using our LIBXSMM LSTM
cell over original TensorFlow implmentation for 8-layer
German-to-English translation model training
Current Research
I Our LSTM cell also supports bfloat16 – a new datatype
introduced by Intel – however, further tuning is needed to
expose its full potential
I Other than LSTM, we have also implemented vanilla RNN
and Gated Recurrent Unit (GRU) (available online on
github); we intend to experiment with these variants of RNN
and report their performance benefits on neural network
training/inference
I Evaluating how and if the proposed JIT batch-reduce
GEMM kernel can be used on GPU or deep learning
focused architectures
References:
[1] Sepp Hochreiter, Jurgen Schmidhuber. Long Short-Term Memory, Neural Computation 9(8): 1735–1780, 1997.
[2] Alexander Heinecke, Greg Henry, Maxwell Hutchinson, Hans Pabst. LIBXSMM: Accelerating small matrix multiplications by runtime code generation, SC 2016: 981–991.
[3] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, CoRR abs/1609.08144, 2016.
[4] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen et al. TensorFlow: A System for Large-Scale Machine Learning, OSDI 2016: 265–283.
[5] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, CoRR abs/1412.3555, 2014.
[6] GNMT – TensorFlow Neural Machine Translation Tutorial, https://github.com/tensorflow/nmt
[7] MKL-DNN – Intel R  Math Kernel Library for Deep Neural Networks, https://github.com/intel/mkl-dnn
[8] TensorFlow – An Open Source Machine Learning Framework for Everyone, https://github.com/tensorflow/tensorflow
[9] LIBXSMM – Library targeting Intel Architecture for specialized dense and sparse matrix operations, and deep learning primitives, https://github.com/hfp/libxsmm
Figure 3: Long-Short Term Memory data flow.
3 DEEP LEARNING KERNELS
Here we describe the design and implementation of our DL primi-
tives that exploit the batch-reduce GEMM kernel. In particular, we
outline how to implement the required algorithms for LSTM [9],
CNN [2] and MLP [11] workloads. We choose performance-optimal
data lay uts which might differ from classic layout specifications in
today’s vendor libraries. However, this is fully acceptable as modern
DL frameworks anyways change tensor layouts during their graph
optimization phase for operator fusion (e.g. Tensorflow’s Grap-
pler). Therefore, freedom of data layout choice is a fundamental
cornerstone to enable high performance through tensor compilers.
We highlight that the subsequent algorithmic descriptions are
agnostic of the compute precision.The only prerequisite in order
to get an implementation with the desired compute precision is to
generate the corresponding batch-reduce GEMM kernel. The re-
sults we present in section 4 are in single precision (FP32), however
we already have implementations supporting the int8 and bfloat16
datatypes (via the new Intel VNNI and bfloat16 instructions re-
spectively) which have been shown to sufficiently cover a range
of DL training and inference workloads [30–32] and are supported
on up-coming CPU architectures. Also, the same algorithms are
applicable for GPUs; in Section 4 we showcase exemplary results
of CNNs on integrated GPUs.
3.1 Long-Short Term Memory (LSTM)
LSTM is a type of RNN which is well-suited for processing tempo-
ral data. Unlike traditional RNN, LSTM can handle exploding and
vanishing gradient problems encountered during neural network
training. LSTM has found applications in language translation, text
generation, handwriting recognition and image captioning. In this
subsection, we focus on the forward propagation to train an LSTM
cell (the forward propagation pass is utilized also for the inference
use-case). The backward by data andweight update kernels required
for the entire training via the back-propagation algorithm [14] are
implemented in an analogous way.
3.1.1 LSTM equations and prior art
Given the batch size N , the sequence length T , the state size
C and hidden state size K , the inputs of the forward propagation
pass in the training process of the LSTM cell are: i) the weight
tensorsWi ,Wc ,Wf ,Wo ∈ IRK×C , ii) the recurrentweightsRi ,Rc ,Rf ,
3
Ro ∈ IRK×K , iii) the input sequence tensor x ∈ IRT×C×N , and iv) the
bias tensors bi , bc , bf , bo ∈ IRK . These tensors are combined based
on the Equations 1-6 and yield the output sequence h ∈ IRT×K×N
and tensor s ∈ IRT×K×N :
it = σ (Wi · xt + Ri · ht−1 + bi ) (1)
ct = tanh(Wc · xt + Rc · ht−1 + bc ) (2)
ft = σ (Wf · xt + Rf · ht−1 + bf ) (3)
ot = σ (Wo · xt + Ro · ht−1 + bo ) (4)
st = ft ◦ st−1 + it ◦ ct (5)
ht = ot ◦ tanh(st ) (6)
In these equations, observe the recurrent relationship between sub-
tensors it , ct , ft ,ot and st of the current time-step t and subtensors
ht−1, st−1 of the previous time-step t − 1. Also, σ () represents the
standard logistic sigmoid function, tanh() is the hyperbolic tangent
function and “◦" stands for element-wise multiplication of tensors.
Figure 3 visualizes the computations and the dependencies involved
in the forward propagation pass of the LSTM network.
Typical implementations of the LSTM cell (e.g. basic LSTM cell
in Tensorflow) stack theWi ,Wc ,Wf ,Wo matrices intoW ∈ IR4K×C
and the Ri , Rc , Rf , Ro into R ∈ IR4K×K and then employ two large
GEMMSW · xt and R · ht−1 to calculate the relevant partial prod-
ucts in Equations 1-4. Moreover, these two large GEMMs can be
further replaced with a single large GEMM call by stackingW ,
R and xt , ht−1 and performing:
[
W R
] · [xTt hTt−1]T . Then, such
an implementation applies the element-wise operations (sigmoid/-
tanh) onto the GEMM results and concludes with the element-wise
operations dictated by Equations 5-6. While such an approach is
easy to implement by exploiting large vendor-optimized GEMM
library calls, the data reuse of the underlying tensors relies on how
GEMMs are parallelized and may be suboptimal for GEMM sizes
stemming from small batch size N . Also, the element-wise opera-
tions are exposed as a bandwidth-bound kernel after the GEMM
which is typically a compute-bound kernel; the outputs of the large
GEMM are not hot in cache (due to limited cache capacity) and as
such the involved tensors have to be re-read from memory for the
element-wise operations.
3.1.2 Optimized LSTM cell via the batch-reduce GEMM kernel
In order to ameliorate the inefficiencies of the large GEMM
approach, we follow a data flow methodology for our optimized
LSTM cell, an approach which has been also explored in previ-
ous work [33]. More specifically, we implement a parallel blocked
matrix GEMM in order to achieve load balance, maximize weight
matrix reuse and fuse the element-wise operations after partial
GEMM blocks are computed and while they are still hot in cache.
Algorithm 2 shows our data-flow implementation. In particular,
the output and the intermediate GEMM results/tensors are divided
into logical bn × bk blocks which constitute the work items. Then
these work items are assigned onto the available threads (line 2) and
subsequently each thread proceeds with its assigned computations.
Lines 6-17 indicate how such a bn × bk block of it is calculated
by a specific thread. First (line 8), the corresponding it block is
initialized with the according bias tensor values from bi . Then, lines
9-12 employ the batch-reduce GEMM kernel described in Section 2
Algorithm 2 Forward propagation pass of LSTM cell
Inputs: Weight tensorsW∗[Kb ][Cb ][bc ][bk ],R∗[Kb ][Kb ][bk ][bk ]
Input sequence x[T ][N ][C], Bias b∗[K], blocking factors bk ,bc ,bn
Outputs: Output sequence h[T ][N ][K] and s[T ][N ][K]
1: Nb ← N /bn
2: Based on thread_id calculate Kb_start , Kb_end , Nb_start and
Nb_end to assign output work items
3: for t = 0 . . .T − 1 do
4: for ibk = Kb_start . . .Kb_end do
5: for ibn = Nb_start . . .Nb_end do
6: ▷ Compute a block of it = σ (Wi · xt + Ri · ht−1 + bi )
7: ik ← ibk · bk , in ← ibn · bn
8: i[t][in ][ik ] ← bi [ik ]
9: for ibc = 0 . . .Cb − 1 do
10: Aptrs [ibc ] = &Wi [ibk ][ibc ][0][0]
11: Bptrs [ibc ] = &x[t][in ][ibc · bc ]
12: batchreduce_gemm(Aptrs ,Bptrs ,&i[t][in ][ik ],Cb )
13: for ibc = 0 . . .Kb − 1 do
14: Aptrs [ibc ] = &Ri [ibk ][ibc ][0][0]
15: Bptrs [ibc ] = &h[t − 1][in ][ibc · bk ]
16: batchreduce_gemm(Aptrs ,Bptrs ,&i[t][in ][ik ],Kb )
17: i[t][in ][ik ] ← σ (i[t][in ][ik ])
18: ▷ Ditto for blocks of ct , ft ,ot via Equations 2-4
19: s[t][in ][ik ] ← f [t][in ][ik ] ◦ s[t − 1][in ][ik ]+
20: i[t][in ][ik ] ◦ c[t][in ][ik ]
21: h[t][in ][ik ] ← o[t][in ][ik ] ◦ tanh (s[t][in ][ik ])
and calculate the contributionWi ·xt to the current block of it . More
specifically, lines 9-11 prepare the arguments of the batch-reduce
GEMMcall by calculating the pointers of the requiredWi and xt sub-
blocks and storing them in auxiliary arrays Aptrs and Bptrs . Then,
line 12 calls the batch-reduce GEMM kernel which accumulates the
partial products from theWi and xt sub-blocks onto the current
it block. We emphasize here that our batch-reduce GEMM allows
small blocking values bn and bk to be used since: (a) the small
GEMM microkernel runs close to peak even for small dimensions
and (b) it avoids the redundant load/stores of the accumulators that
arise from the batch-reduce operation and would cripple the overall
performance; instead it keeps the accumulation chain in-registers
for as long as possible (see Algorithm 1). In an analogous way, lines
13-16 calculate the contribution Ri · ht−1 to the current block of it
as shown in Equation 1. Subsequently, line 17 applies the element-
wise operation (sigmoid in this case) onto the just-computed block
of it . Since the block of it is hot in cache, the application of the
element-wise operation does not incur any data movement from
memory. The same technique is used to calculate the corresponding
sub-blocks of ct , ft and ot (omitted in Algorithm 2 for simplicity).
It is noteworthy that the ct , ft and ot computations reuse the same
entries of xt and ht−1 from cache since these tensor entries were
also used for the computation of it . Finally, lines 19-21 conclude the
computation of the corresponding blocks of the output tensors ht
and st based on the element-wise operations dictated by Equations
5-6. After all the work items assigned to the available threads for a
given time-step are fully computed, all the threads synchronize and
proceed to the next time-step (loop at line 3). Such a synchronization
4
is necessitated because all the output entries ht of the current time-
step are required in the next time-step iteration.
We also note here that the way the work items are processed
by the threads affects the data reuse of the weight tensorsW∗ and
R∗. In particular, since work items are processed by iterating the
“mini batch" dimension first (loop at line 5), the corresponding slices
of the weight tensorsW∗ and R∗ are reused Nb_end − Nb_start −
1 times from cache (potentially from mid-level cache). Another
optimization that is not shown in Algorithm 2 for simplicity is
further cache blocking of the batch-reduce loops at lines 9 and 13.
In particular, if the weight tensors at hand have large state sizes C
and K , we block these dimensions in order to fit the corresponding
weight tensors slices in cache. In such a case, the algorithm would
have yet another loop just after the time-step loop (at line 3) which
blocks the batch-reduce loops at lines 9 and 13.
Last but not least, Algorithm 2 carefully chooses the layouts
of the corresponding tensors. The weight tensorsW∗ and R∗ are
conceptually 2 dimensional tensors, whereas our implementation
employs a blocked layout (with Cb = C/bc and Kb = K/bk ) :
W∗[C][K] →W∗[Kb ][Cb ][bc ][bk ], R∗[K][K] → R∗[Kb ][Kb ][bk ][bk ]
Such a blocked layout exposes better locality (i.e. the corresponding
accesses of weight sub-blocks are non-strided with such a layout)
and more importantly avoids cumbersome conflict cache misses.
Typically the C and K values are large powers of 2 resulting in
strided accesses (in the case of the non-blocked format) which
are known to cause conflict misses in contemporary associative
cache designs. However, our blocked format bypasses this issue
by laying out the weight tensors in a format allowing non-strided
accesses in the GEMM microkernel. In regard to the activation
tensors, we keep the original non-blocked three dimensional format
x[T ][N ][C], h[T ][N ][K] and s[T ][N ][K] since strided accesses are
barely an issue for the “B" matrix in the GEMM microkernel (we
also confirmed this by experimenting with a blocked format for the
activation tensors). Note that even though our LSTM cell internally
uses a blocked layout for the weight tensors, this does not need
to be exposed at the application level; instead, we can transform
the weight tensors into the desired blocked layout in the beginning
of the algorithm and such a transformation overhead is amortized
among the multiple time-steps in the LSTM cell.
Finally, we would like to briefly discuss the importance of a
single, architecture-specific optimized kernel. All the functionalities
in the LSTM cell (forward propagation/backward by data/weight
update pass) utilize as building block just our batch-reduce GEMM
kernel. The development/parallelization/optimization of the LSTM
cell then merely degenerates to tuning/calibrating the surrounding
loops around this microkernel – a process which can be automated
to some extent or even implemented in different programming
frameworks/tensor compilers like TVM [20] or PlaidML [22].
3.2 Convolution Neural Networks (CNN)
Convolutional Neural networks (CNN) consist of layers with multi-
ple neurons connected by weights, and have found applications in
image recognition, semantic segmentation, autonomous driving and
medical imaging. Similar to the LSTM cell, our CNN primitives im-
plement all the kernels required for training via back-propagation.
In this section, we describe only the forward propagation kernels
Output O[N][K][P][Q]
• P, Q: spatial dims
• K : feature maps
…
K
S
=✻
Input I[N][C][H][W]
• H, W: spatial dims
• C : feature maps
Weight W[K][C][R][S]
• R, S: spatial dims
• C,K : feature maps
C
R
… …
C
H
…
W
…
N
K
P
…
Q
…
N
… …
Figure 4: Convolution Neural Network (CNN) tensors
Algorithm 3 CNN forward propagation loops
1: Cb = C/bc , Kb = K/bk , Qb = Q/bq
2: for n = 0 . . .N − 1 do
3: for kb = 0 . . .Kb − 1 do
4: for cb = 0 . . .Cb − 1 do
5: for oj = 0 . . . P − 1 do
6: for oib = 0 . . .Qb − 1 do
7: oi = oib · bq , ii = str · oi, ij = str · oj
8: for r = 0 . . .R − 1 do
9: for s = 0 . . . S − 1 do
10: ▷ Small GEMM loops
11: for k ′ = 0 . . .bk − 1 do
12: for oi ′ = 0 . . .bq − 1 do
13: for c ′ = 0 . . .bc − 1 do
14: oi ′′ = oi + oi ′
15: ij ′ = ij + r , ii ′ = ii + str · oi ′ + s
16: O[n][kb ][oj][oi ′′][k ′] +=
17: W [kb ][cb ][r ][s][c ′][k ′] · I [n][cb ][ij ′][ii ′][c ′]
which are used as-is for inference. The implementation of back-
ward by data and gradient update kernels follows the same design
principles as the forward propagation.
3.2.1 Direct convolution loops and prior art
The values assigned to a neuron are usually called activations.
Both activations and weights are represented with multidimen-
sional tensors as illustrated in Figure 4. The input activation ten-
sors are convoluted with the weight tensors to yield the output
activation tensors. The activation tensors conceptually consist of 4
dimensions: the minibatch size N , the number of feature maps C
and the spatial dimensions H andW . We denote the input tensor
dimensions with N , C , H andW while the corresponding output
tensor dimensions are N , K (output feature maps), P and Q (output
spatial dimensions). The weight tensor is conceptually character-
ized also by 4 dimensions: the feature map dimensions C , K and
the spatial dimensions R and S .
Algorithm 3 shows a basic implementation of the forward propa-
gation loops where the feature map loops (lines 3 and 4) are blocked
by factorsbk andbc respectively and theQ loop (output tensor pixel
dimension) is blocked by a factor bq . The input tensor pixels can
be also accessed in a strided fashion via a stride str . Additionally,
the tensors employ a blocked layout format which has been shown
5
Algorithm 4 CNN forward pass via batch-reduce GEMM
1: Cb = C/bc , Kb = K/bk , Qb = Q/bq
2: for n = 0 . . .N − 1 do
3: for kb = 0 . . .Kb − 1 do
4: for cb = 0 . . .Cb − 1 with step Bc do
5: for oj = 0 . . . P − 1 do
6: for oib = 0 . . .Qb − 1 do
7: oi = oib · bq , ii = str · oi, ij = str · oj, i = 0
8: ▷ Prepare batch-reduce GEMM arguments
9: for r = 0 . . .R − 1 do
10: for s = 0 . . . S − 1 do
11: for c = 0 . . . Bc − 1 do
12: Aptrs [i] = &W [kb ][cb + c][r ][s][0][0]
13: Bptrs [i++] = &I [n][cb +c][ij+r ][ii+s][0]
14: Out = &O[n][kb ][oj][oi][0]
15: batchreduce_gemm(Aptrs ,Bptrs ,Out ,R · S · Bc )
to exhibit better locality properties for direct convolutions [27]:
Input tensor : I [N ][Cb ][H ][W ][bc ]
Weiдht tensor : W [Kb ][Cb ][R][S][bc ][bk ]
Output tensor : O[N ][Kb ][P][Q][bk ]
By adopting such a blocked layout and given the loop ordering of
Algorithm 3, the three innermost loops (lines 11-17) form a small
GEMMof abk×bc weight sub-tensor with abc×bq input sub-tensor
yielding abk ×bq output subtensor (note that the leading dimension
of the input sub-tensor is str ·bc ). The authors of previous work [27]
identified this property; however, they implemented a specialized
convolution kernel because:
• they optimize load/store of the output O in case of R, S > 1
and in case the input feature map loop (line 4) is reordered
as the innermost loop in order to maximize output reuse.
• they apply additional pixel blocking when Q = bq and this
value is smaller than the FMA latency of the architecture at
hand.
In the following subsection, we describe how we address these
issues with our new batch-reduce GEMM kernel.
3.2.2 Optimized convolutions via the batch-reduce GEMM kernel
The introduction of the batch-reduce GEMM kernel obviates
the need for a specialized convolution kernel. More specifically,
the batch-reduce GEMM kernel inherently optimizes load/store of
the output O in case of R, S > 1 and in case the input feature map
loop is reordered as the innermost loop. By properly selecting the
sub-tensors of weights/inputs to be multiplied and reduced onto an
O sub-tensor, the accumulation takes place entirely in registers as
described in Section 2. In order to tackle the second issue regarding
the case with Q = bq and bq being smaller than the FMA latency,
we make the following observation: the small GEMM microkernel
utilizes bq × (bk/VLEN ) accumulator registers where VLEN is the
vector length of the architecture at hand. Therefore, if bq is small
then we accordingly increase bk such that bq ×(bk/VLEN ) is larger
than the FMA latency.
Algorithm 4 shows how to implement the convolution loops us-
ing our new batch-reduce GEMMkernel. Note that the input feature
Input
Layer
Hidden Layers Output 
Layer
C Neurons K Neurons
0
C-1
j
0
K-1
Wij
i
Figure 5: A Mulitlayer Perceptron (MLP) topology
map loop (line 4) is blocked by a factor Bc and these Bc iterations
are brought into the batch-reduce call in order to further increase
the output register reuse. The loops at lines 9-11 prepare the argu-
ments of the batch-reduce GEMM call by calculating the pointers
to the weight and input sub-tensors that need to be multiplied and
reduced onto a sub-tensor in O . In this way, we optimize the out-
put sub-tensor O register reuse: without the batch-reduce kernel
we would have to load/store the output registers (R × S × Bc ) − 1
additional times. Another optimization involves the case of convo-
lutions with R = S = 1 and unit stride (i.e. str = 1). In such a case,
the input spatial dimensions (loops 5 and 6) are accessed sequen-
tially and as such one can consider that the spatial dimensions are
collapsing into a single dimension allowing even more aggressive
blocking parameter values bq .
In regard to the parallelization of Algorithm 4, we observe that
the mini-batch dimension (line 2), the output feature map blocks
(line 3) and the output pixels blocks (lines 5 and 6) define N ×Kb ×
P × Qb independent tasks. Typically we opt to divide work first
based on the mini-batch dimension since the weight tensors could
be reused by multiple threads from shared caches. If we don’t have
sufficient work just based on the mini-batch size, then we consider
all N × Kb × P × Qb tasks and they are assigned to the available
threads in a block fashion. In case our convolution at hand involves
large weights, it may be better to assign tasks by starting from the
feature map dimension Kb . In this way, each thread will touch only
a part of the large weight tensor which could be further blocked
for a specific cache-level. We implemented all these parallelization
strategies and use the most suitable one based on the convolution
layer specifications and the available number of threads.
Our backward by data/weight update kernels with batch-reduce
GEMM leverage previous work [27]. The authors in [27] show that
only slight modifications to the forward kernel are required in order
to implement the back-propagation kernels, as they can be mapped
through linear index transformations into the forward convolution
loop nest (“dual convolutions"). The data reuse optimizations/par-
allelization tasks then simply translate to tuning the surrounding
loops as shown in Algorithm 4. In Section 4.3, we show results of a
proof-of-concept design where we develop CNN primitives within
a tensor compiler framework via our batch-reduce GEMM kernel.
We also show how the same design principles are applicable for
integrated GPUs, yielding high performance convolution kernels.
3.3 Multilayer Perceptron (MLP)
Multilayer perceptrons (MLP) comprise a class of feed-forward
artificial neural networks that are widely used for classification
tasks, brain modeling, time series prediction, character recognition
6
Algorithm 5 Forward pass of Fully Connected Layer
Inputs: WeightW [Kb ][Cb ][bc ][bk ], Input X [Nb ][Cb ][bn ][bc ]
Outputs: Output Y [Nb ][Kb ][bn ][bk ]
1: Based on thread_id calculate Kb_start , Kb_end , Nb_start and
Nb_end to assign output work items
2: for ibn = Nb_start . . .Nb_end do
3: for ibk = Kb_start . . .Kb_end do
4: ▷ Prepare batch-reduce GEMM arguments
5: for ibc = 0 . . .Cb − 1 do
6: Aptrs [ibc ] = &W [ibk ][ibc ][0][0]
7: Bptrs [ibc ] = &X [ibn ][ibc ][0][0]
8: Out = &Y [ibn ][ibk ][0][0]
9: batchreduce_gemm(Aptrs ,Bptrs ,Out ,Cb )
10: Y [ibn ][ibk ][0][0] ← д(Y [ibn ][ibk ][0][0])
and data compression. An MLP consists of (at least three) fully
connected layers of neurons as illustrated in Figure 5: the topology
starts with an input layer, followed by a number of hidden layers
which conclude to the output layer. Each neuron in the topology
uses a non-linear activation function. For the rest of this section
we consider the optimization of the fully connected layers since
they constitute the cornerstone of MLP. The fully-connected layers
also lie in the heart of the modern Transformer [12] and BERT [13]
workloads. We dive into the details of the forward propagation
algorithm of the MLP training process (also used for inference); we
also implemented all the required kernels of the back-propagation
training in an analogous fashion.
3.3.1 Fully Connected layers and prior art
The dashed box in Figure 5 illustrates two fully connected layers
consisting ofC andK neurons respectively. A neuron i from the first
layer is connected to a neuron j in the second layer with a weight
Wi j . Mathematically, an input layer x ∈ IRC is mapped to an output
layer y ∈ IRK via the relation y =W · x , whereW ∈ IRK×C is the
weight tensor of the connections between the neurons. During the
training process, N multiple inputs (N is the so-called mini-batch
size) are grouped together yielding the equation Y =W · X with
W ∈ IRK×C , X ∈ IRC×N and Y ∈ IRK×N . After the output tensor Y
is computed, a non-linear activation function д() is applied on it.
Observe that by increasing the mini-batch N , we fundamentally
increase the weight tensor reuse. Typical implementations of Fully
Connected layers leverage a large GEMM call and they apply the
activation functions onto the GEMM outputs. Even though such
an approach is straightforward to implement, its performance can
be underwhelming for three reasons: i) typical high-performance
GEMM library calls internally perform packing of sub-matrices
to ameliorate TLB misses and cache conflict misses [34], ii) the
multi-threaded implementation of GEMM with shapes arising from
small mini-batch values N may not fully exploit the available data
reuse, and iii) in case of large matrices that do not fit in cache, the
activation function application is exposed as a bandwidth-bound
kernel which decays the overall performance. In the next subsection,
we describe how our implementation of Fully Connected layers via
the batch-reduce GEMM kernel addresses all these issues.
3.3.2 Fully Connected layers via the batch-reduce GEMM kernel
Algorithm 5 shows the implementation of the forward propa-
gation in the training process of fully connected layers. First, we
highlight the blocked tensor layout; all the 2 dimensional tensors
are transformed into 4 dimensional ones by blocking the mini-batch
dimension N with a factor bn and the tensor dimensions C and K
with blocking factors bc and bk respectively. Such a blocked layout
addresses issue (i) mentioned in the previous subsection by expos-
ing better locality and avoiding large, strided sub-tensor accesses
which are known to cause TLB misses and cache conflict misses in
case the leading dimensions are large powers of 2.
Our algorithm first assigns the output sub-tensor blocks to the
available threads (line 1) and every thread then for each assigned
output Y block calculates the addresses of theW and X sub-tensor
blocks that need to be multiplied and reduced onto the current
output Y block (lines 5-7). Note that our JIT-ed kernel allows small
values of blocking values bn to be used, and as such we can extract
parallelism from the mini-batch dimension even for small values
of N . By following the loop ordering of Algorithm 5, a weight
sub-tensor is reused by each thread Nb_end − Nb_start − 1 times,
potentially from some level of cache. Also, multiple threads are able
to read weights from shared caches when the assigned Y blocks cor-
respond to the same subspace of the K dimension. Finally, in case a
weight sub-tensor does not fit in the targeted/desired level of cache,
we can further block loops at lines 3 and 5. These cache blocking
techniques in combination with the flexible blocking factors bn , bc
and bk which yield high performance micro-kernels, address the
data reuse issue (ii) mentioned in the previous subsection.
Finally, once the arguments of the batch-reduce GEMM have
been calculated, we perform the batch-reduce GEMM call (line
9) and while the output sub-tensor block Y is still hot in cache
we apply on it the relevant activation function (line 10). In this
way, we ensure that the application of the activation function takes
place when the data are still hot in cache and it does not incur
any additional data movement from memory, addressing issue (iii)
from the previous subsection. Once again, the development of the
Fully Connected primitive follows the same recipe as the LSTM
and CNN primitives. Therefore, the loops surrounding the batch-
reduce GEMM kernel can be automatically optimized with a tensor
compiler/infrastructure.
4 PERFORMANCE RESULTS
In subsection 4.1, we evaluate the performance of our DL kernels.
Then, in subsection 4.2, we present distributed memory results
on two state of the art workloads, namely Google’s Neural Ma-
chine Translation (GNMT) which corresponds to LSTM training
and ResNet-50 which is representative of CNN training. Finally, in
subsection 4.3, we show a couple of proof-of-concept results that
highlight the generalizability of our approach. More specifically, we
show CNN kernel results on integrated GPUs and conclude with
CNN kernel results that are generated by TVM, both leveraging
batch-reduce GEMM kernel as their basic building block.
4.1 Performance evaluation of our DL kernels
Since we use a JIT-ing methodology for the batch-reduce GEMM
kernel, we can virtually run on every platform supporting SSE, AVX,
7
0
20
40
60
80
100
0
500
1000
1500
2000
2500
3000
256 512 1024 2048 4096
%
 o
f p
ea
k
G
FL
O
PS
forward pass (N=168, T=50, 28 cores)
This work MKL-DNN efficiency
0
20
40
60
80
100
0
500
1000
1500
2000
2500
3000
256 512 1024 2048 4096
%
 o
f p
ea
k
G
FL
O
PS
backward/update pass (N=168, T=50, 28 cores)
This work MKL-DNN efficiency
Hidden State Size Hidden State Size
Figure 6: Performance of LSTM cell: (Left) Forward propagation and (Right) backward by data and weight update pass.
batch-reduce GEMM Elementwise Tensor
LSTM pass % of total GFLOPS operations reformatting
fwd 93.3% 2550 5.3% 1.4%
bwd & upd 91.2% 2350 5.3% 3.5%
Table 1: Breakdown of LSTM cell performance (C=K=1024).
AVX2 and AVX-512 instructions. All the experiments presented in
this subsection are conducted on a Skylake-SP (SKX) 8180 processor
with 28 cores, 96 GBDDR4 2666 main memory at 2.3 GHz (AVX512)
Turbo at 205W TDP. The stream triad of a single socket is 105GB/s.
For the experiments we used all 28 cores with turbo disabled
(i.e. AVX-512 base frequency at 1.7 GHz) in order to get stable mea-
surements. With such a setup, the peak of the machine is ∼3,050
GFLOPS (single precision). All the experiments were performed 400
times and we report the average timing; due to careful configuration
of our platform (i.e. tick-less Linux kernel, core pinning, turbo dis-
abled) the run-to-run variation is within 3%. For our work we used
the Intel compilers (version 18.0.0). For performance comparisons,
we used the latest version of MKL-DNN (version 0.9).
4.1.1 Performance evaluation of LSTM cell
Figure 6 (Left) illustrates the performance of the forward propa-
gation algorithm in the LSTM cell that is described in subsection 3.1.
In this experiment, we fix N = 168 (mini-batch), T = 50 (sequence
length), and we vary the hidden state size K which is equal to the
input state size (i.e. C = K). The blue bars represent the perfor-
mance in GFLOPS (see Left y-axis) of our kernels. We observe that
even in the case of small C and K , our LSTM cell runs at ∼60%
of peak (see Right y-axis), whereas for larger weight tensors the
activation-tensor reuse is larger and consequently the kernels run
at ∼70% of peak. In Table 1 (row labeled “fwd”) we provide more
details regarding how the time is spent within the LSTM cell for
the case with C = K = 1024. During the forward pass, 93.3% of the
time is spent in the batch-reduce GEMM kernel which runs at 2550
GFLOPS or equivalently at 84% of peak. Then, 5.3% of the execution
time is spent for the elementwise operations described in subsec-
tion 3.1.2. The rest 1.4% is spent in reformatting the weight tensors
to take advantage of the blocked format (see subsection 3.1.2).
Figure 6 (Right) exhibits the performance of the remaining two
passes in the LSTM training process, namely backward propagation
and weight update pass (henceforth called “bwd” and “upd” respec-
tively). The performance follows the same trend as the forward
propagation, i.e. with larger weight tensors, the overall efficiency is
closer to peak due to more re-use of the activation tensors. Notably,
the overall efficiency is diminished compared to the forward propa-
gation; by inspecting the time breakdown at Table 1 (row labeled
ID C K H W R S str ID C K H W R S str
1 3 64 224 224 7 7 2 11 512 1024 28 28 1 1 2
2 64 256 56 56 1 1 1 12 512 256 28 28 1 1 2
3 64 64 56 56 1 1 1 13 256 256 14 14 3 3 1
4 64 64 56 56 3 3 1 14 256 1024 14 14 1 1 1
5 256 64 56 56 1 1 1 15 1024 256 14 14 1 1 1
6 256 512 56 56 1 1 2 16 1024 2048 14 14 1 1 2
7 256 128 56 56 1 1 2 17 1024 512 14 14 1 1 2
8 128 128 28 28 3 3 1 18 512 512 7 7 3 3 1
9 128 512 28 28 1 1 1 19 512 2048 7 7 1 1 1
10 512 128 28 28 1 1 1 20 2048 512 7 7 1 1 1
Table 2: ResNet-50 layers specifications
“bwd & upd”) we observe that larger fraction of the overall time
is spent in tensor reformatting. This is expected because bwd and
upd passes require algorithmically additional weight and activa-
tion tensor transposes [14]. Also, the batch-reduce GEMM runs at
2350 GFLOPS or equivalently at 77% of peak, which is a bit lower
than the efficiency of the one achieved in the forward pass. This
is a result of different tensor shapes in the “upd” pass, where the
reduction dimension of GEMM becomes the mini-batch dimension
and typically it is smaller than C/K which constitute the GEMM
reduction dimensions in forward pass.
In Figure 6, we also provide performance comparison of our
LSTM cell with the vendor-optimized LSTM cell within MKL-DNN
(orange bars). For small to medium problem sizes, our LSTM cell is
faster than the MKL-DNN cell in the range of 1.2-1.3× for forward
propagation and 1.1-1.7× for the bwd/upd pass. This is a result
of the adopted “data-flow” approach described in subsection 3.1
that leverages our batch-reduce GEMM kernel: the elementwise
operations are naturally fused within the GEMM operations which
run at high efficiency. For larger problem sizes, the overall cost
of the GEMM operation dominates the entire computation and as
such the elementwise operations are negligible. This is expected
since the GEMM computation cost scales cubically compared to the
quadratic scaling of the elmentwise operations. Therefore, for large
problem sizes a coarse grained approach like the one described in
subsection 3.1.1 yields good performance. It is worth mentioning
that in the following subsection 4.2 where we present distributed
memory GNMT training results, the involved LSTM corresponds to
the case withC=K=1024 in Figure 6, where our code is 1.26× faster
than MKL-DNN for all training passes (for N=168).
4.1.2 Performance evaluation of CNN kernels
We conducted experiments with the ResNet-50 topology which
yields state of the art results in image recognition tasks [28]. The
convolution layers within the ResNet-50 topology cover a wide
variety of parameters/configurations (e.g. filter dimensionality and
sizes, input sizes, strided convolutions) and can be seen at Table 2.
In this Table we assign to each convolution layer an ID that is used
8
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
G
FL
O
PS
Layer ID
ResNet-50 forward convolutions (N=28)
This work MKL-DNN peak
0
500
1000
1500
2000
2500
3000
3500
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Layer ID
ResNet-50 backward by data convolutions (N=28)
This work MKL-DNN peak
Figure 7: Performance of ResNet-50 convolutions: (Left) Forward propagation and (Right) backward by data pass.
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
G
FL
O
PS
Layer ID
ResNet-50 weight update convolutions (N=28)
This work MKL-DNN peak
Figure 8: Performance of ResNet-50 weight update pass.
as identifier in the performance plots. Also, for the remaining of
this paper we will use the term weighted efficiency when presenting
ResNet-50 results; each layer i of Table 2 requires Fi flops, takes ti
seconds to be computed, and is represented ni times in the entire
topology (which has 53 layers in total). The weighted efficiency of
the entire topology is given by: (∑52i=0 ni · Fi )/(∑52i=0 ni · ti ).
Figures 7 and 8 show the performance of the ResNet-50 convolu-
tions with mini-batch size N = 28. The blue bars in Figure 7 (Left)
represent the achieved performance (in GFLOPS) of the forward
(FWD) propagation algorithm described in subsection 3.2.2 that
leverages the batch-reduce GEMM kernel. The weighted efficiency
of the FWD convolutions within the ResNet-50 topology is 83% of
peak. More specifically, the convolutions with large spatial filters
(e.g. R=S=3 in convolutions with IDs 4, 8, 13, 18) run at ∼90% of
peak since they inherently have more input and output tensor reuse
than the convolutions with R=S=1 which run at ∼80% of peak. No-
tably, layer with ID 2 runs at 65% of peak since it has large output
spatial and feature map dimensions and as such its performance
is bound by the write bandwidth of our system. By comparing
the performance of our kernels to MKL-DNN (orange bars) we
observe similar trends. The MKL-DNN library exhibits weighted
efficiency of 81% of peak for FWD convolutions and as such it is
2.5% slower than our work. This result highlights the effectiveness
of our approach with a single kernel as basic building block: our
convolutions consist of just ∼1500 lines of code (for all training
passes) whereas the convolution portion of the MKL-DNN library
is ∼36000 lines of code since it leverages ad hoc, specialized kernels
(e.g. ad hoc optimization of the direct convolution loops, different
approaches/code generation for various R and S values).
Figures 7 (Right) and 8 exhibit the performance of the convolu-
tion kernels in the remaining training passes, namely backward by
data (BWD) and weight update (UPD). Our kernels have weighted
efficiency 80% and 73.6% for the BWD pass and the UPD pass re-
spectively. Similarly to the convolutions in the FWD pass, the layers
with large spatial weight dimensions show better performance than
the ones with 1 × 1 spatial dimensions since the former have better
input and output tensor reuse properties. Also, we note that the
efficiency of the UPD kernels is ∼10% lower than the efficiency of
FWD/BWD kernels. This is a consequence of the weight tensor
reduction which is required in the weight update algorithm in or-
der to maximize the input and output tensor data movement [27].
For comparison, the MKL-DNN BWD and UPD kernels illustrate
weighted efficiencies of 78.9% and 68.9% respectively, and are 1%
and 7% slower than our kernels. In subsection 4.2 we integrate our
kernels in the GxM distributed framework and improve the best in
class performance of ResNet-50 training on CPUs.
4.1.3 Performance evaluation of Fully Connected Layers
Figure 9 shows the performance of the Fully Connected layers
which are the cornerstone of the MLP workload. In these experi-
ments we fix the mini-batch size N=1344 and we vary the dimen-
sions of the weight tensors. For each configuration, we show results
for the forward propagation (FWD), backward by data pass (BWD)
and weight update pass (UPD). We observe that our approach (blue
bars) with the fine-grained batch-reduce GEMM kernel shows for
the smaller configuration (C=K=256) efficiencies in the range 57%-
73%, for the medium weight sizes (C=K=512) 55%-94% and for the
larger configuration (C=K=1024) the efficiencies are 67%-82%. In all
configurations, we observe that the BWD kernels’s performance
deteriorates compared to the equivalent FWD kernels. This is the
case because the BWD kernels require a weight transpose [14]. The
overhead of this weight transpose is more emphasized in the cases
with small C/K values while for the cases with large C/K values
the cost of the GEMM kernel dominates the overall runtime and as
such the transpose cost in negligible (see case with C=K=1024). In
regard to the UPD kernels, we also observe that for smaller weight
tensors the performance is lower than the corresponding FWD
kernels. This is due to the limited parallelism that is available in
such cases. More precisely, the FWD pass employs parallelism in
the N /K dimensions (see Algorithm 5), the BWD pass in the N /C di-
mensions while the UPD pass in theC/K dimensions. Consequently
it is more challenging to extract sufficient parallelism within the
configurations with small C/K values.
Moreover, Figure 9 shows the performance of the Fully Con-
nected layers within the MKL-DNN library (orange bars). These
kernels use the coarse-grained approach (i.e. a single large GEMM
call) as described in subsection 3.3.1. Considering the average ef-
ficiencies of all MKL-DNN kernels (FWD, BWD and UPD), the
smallest configuration achieves 55% of peak, the medium configu-
ration runs at 56% of peak and the largest test case attains 70% of
9
0
20
40
60
80
100
0
500
1000
1500
2000
2500
3000
3500
FWD BWD UPD
%
 o
f p
ea
k
G
FL
O
PS
Fully Connected (N=1344, C=256, K=256)
This work MKL-DNN efficiency
0
20
40
60
80
100
0
500
1000
1500
2000
2500
3000
3500
FWD BWD UPD
%
 o
f p
ea
k
G
FL
O
PS
Fully Connected (N=1344, C=512, K=512)
This work MKL-DNN efficiency
0
20
40
60
80
100
0
500
1000
1500
2000
2500
3000
3500
FWD BWD UPD
%
 o
f p
ea
k
G
FL
O
PS
Fully Connected (N=1344, C=1024, K=1024)
This work MKL-DNN efficiency
Figure 9: Performance of Fully Connected Layers. Bars correspond to Left y-axis / efficiency corresponds to Right y-axis.
80
160
320
640
1280
2560
5120
1 2 4 8 16 32
im
ag
es
/s
ec
on
d
Number of nodes
SKX + this work strong scaling
V100 (FP16)
V100 (FP32)
SKX + this work single node
SKX + MKL-DNN + TF single node 870
4432
371
149
103
(a) (b)
N = 1344 N = 2688 N = 5376
0
1
10
100
1 2 4 8 16 2 4 8 16 4 8 16
Nodes Nodes Nodes
KW
PS
Perfect  scaling This work+TF reference LSTM cell+TF
Figure 10: Distributed memory training results: (a) 4-Layer GNMT model (LSTM kernels), (b) ResNet-50 model (CNN kernels)
peak. In contrast, our approachwith the batch-reduce GEMMkernel
achieves 64%, 76%, and 76% of peak respectively and is 1.16×, 1.36×,
and 1.09× faster than the corresponding coarse-grained approach.
4.2 Distributed memory training results
Our experimental platform is a 32-node cluster (Intel Omnipath
interconnect), each node having two Skylake-SP (SKX) 8180 proces-
sors. For these runs, we enable the turbo mode on the processors
(i.e. clock frequency up to 2.3 GHz).
4.2.1 Distributed memory GNMT training results
We conducted our experiments with the 4-layer GNMT [5]model.
The framework of our choice is Tensorflow (TF) [26], where we
replaced the Tensorflow LSTM cell implementation with our op-
timized LSTM cell that leverages the batch-reduce GEMM kernel.
Then, we utilized Uber’s Horovod library [35] to enable efficient
multi-node runs. In order to accelerate the communication perfor-
mance of Horovod, we replaced its default MPI communication
backend with Intel’s MLSL library [36] which optimizes communi-
cation primitives arising in deep learning. Moreover, we extend the
partitioning logic of the inputs by grouping sequences with similar
length together in order to achieve load balance; such a technique
yields up to 1.5× speedup compared to classic input partitioning.
For all the experiments in this section, we use 1 MPI rank per CPU
socket. As a baseline for comparisons, we used the default LSTM cell
for CPUs within TF which we configured to use the MKL library to
materialize efficiently the large GEMM calls. In order to assess the
benefits of our new kernels, we incorporated our LSTM cell and we
further modified the TF code to support fused time-step operations
as they are described in Algorithm 2. We verified correctness of the
code changes by achieving state of the art BLEU score of 22.7 after
3 epochs with the German to English WMT16 dataset [37].
Figure 10 (a) illustrates the strong scaling of the distributed mem-
ory training with three different batch sizes; the usage of such large
batch sizes (up to ∼ 5K) is enabled by the LEGW [38] approach.
The y-axis represents the achieved training performance in Kilo
Words per Second (KWPS) while the x-axis shows the number of
nodes. Both axes are in logarithmic scale. For the smaller batch size
(N=1344), our approach (solid blue line) scales from 1 to 4 nodes
with 84% strong scaling efficiency, and when we keep scaling from
4 to 16 nodes the parallel efficiency further drops down to 38%. The
main reason for this efficiency drop involves the small mini-batch
per socket as we strong scale (we use pure data parallelism to scale
out). As a result, we get reduced efficiency within the LSTM cell
computation. For example, with N=1344 and at the concurrency of
1 node (2 MPI ranks), the mini-batch per socket is 672 whereas at 16
nodes (32 MPI ranks), the mini-batch per socket is 42. Nevertheless,
we are able to increase the performance all the way up to 16 nodes
even for such a small batch size and we achieve 35.8 KWPS. For
comparison, the reference LSTM cell + TF approach (orange line)
achieves 15.36 KWPS, thus the approach with our kernels is 2.33×
faster. As we increase the global batch size, the strong scaling effi-
ciency is better because the local computation does not suffer from
very small mini-batch. For example, with batch size N=2688 our
strong scaling efficiency at 16 nodes is 58% achieving 52.5 KWPS,
and is 2.77× faster than the reference LSTM cell that achieves 18.9
KWPS. With the largest batch size (N=5376) the strong scaling effi-
ciency at 16 nodes is 75.2% achieving 65.9 KWPS. For the same setup,
the reference LSTM cell achieves 32.32 KWPS, thus our approach is
2.04× faster. For the last two experimental setups, we start from 2
and 4 nodes respectively since those batch sizes are too large for the
available memory of a single node. To the best of our knowledge,
these achieved GNMT training results are the best in class on CPU
platforms. For completeness, we mention here the performance
of contemporary Nvidia V100 GPU systems on the 4-layer GNMT
model in Tensorflow (FP32 precision): The achieved performance
is 12.7 and 83.3 KWPS on 1 and 8 GPUs respectively [39]. The FP32
peak performance of V100 GPU is 15.7 TF/s which is ∼2× larger
than a single CPU node, and also the available bandwidth is 900
GB/s which is 3.6× larger than the bandwidth of our node. The
10
0
200
400
600
800
1000
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
G
FL
O
PS
Layer ID
ResNet-50 forward convolutions on Gen9 (N=32) 
This work clDNN
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
G
FL
O
PS
Layer ID
ResNet-50 inference (N=1) on SKX 8180
TVM+batch-reduce GEMM Amazon-AutoTVM MKL-DNN This work peak
Figure 11: CNN forward propagation: (Left) integrated GPU Gen9, (Right) implementation with batch-reduce GEMM in TVM
scaling from 1 to 8 GPUs shows similar scaling efficiency to our
distributed training results.
4.2.2 Distributed memory ResNet-50 training results
For ResNet-50 training, we integrated our new CNN kernels into
the GxM framework which has been shown to scale efficiently on
clusters of CPUs via the MLSL library [27, 36]. We verified correct-
ness of our experiments by converging to the same state of the art
accuracies, e.g. 75.7% Top-1 accuracy. For Nvidia V100 performance,
we use numbers from Nvidia [40] and previous work [41].
Figure 10 (b) summarizes the obtained performance. For the
single node CPU measurements, we use 28 cores per socket and the
mini-batch is 56 (dual socket nodes). Our approach (red flat line)
achieves 149 images/sec and is 1.45× faster than the configuration
with MKL-DNN and Tensorflow (orange flat line) which attains
103 images/sec. If we increase the mini-batch to 224, MKL-DNN
improves its efficiency and is able to obtain 129 images/sec. For
completeness, we mention the achieved performance of one V100
GPU which is 371 images/sec in single precision (green flat line)
and 870 images/sec in mixed FP16/FP32 precision (black flat line) –
the latter approach uses the available tensor cores [42].
In order to scale out with GxM, we dedicate 2 cores per node
for communication via MLSL primitives and we use 54 cores for
computations. In Figure 10 (b), we illustrate with solid blue line
the scaling of GxM with our new CNN kernels (both axes are in
logarithmic scale). We scale up to 32 nodes with 95.3% parallel
efficiency, achieving at the concurrency of 32 nodes (1,792 cores)
4432 images/sec. To the best of our knowledge, these are the best
reported distributed memory training results for CPUs (in terms
of efficiency). Previous work, which also uses GxM [27], obtained
on the same platform 1696 images/sec at the concurrency of 16
nodes, while our CNN kernels within GxM achieve 2239 images/sec,
improving the end-to-end training performance by 1.32×.
4.3 CNN kernels on integrated GPU and TVM
In order to showcase the generalizability of our approach to diverse
platforms, we developed the batch-reduce GEMMkernel in OpenCL.
We implemented forward propagation CNN kernels (Algorithm 4)
targeting Intel’s integrated GPU Gen9 (Core i7 6770HQ) [43] which
has peak performance of 1152 GFLOPS. Figure 11 (Left) illustrates
the performance of our kernels (blue bars) and the vendor-optimized
library Intel clDNN [44] (orange bars). For the clDNN experiments,
we tried all the available tensor layouts and picked the one that
yields the highest performance. For this experiment the mini-batch
size is N=32. We conclude that our kernels with batch-reduce
GEMM achieve similar performance to clDNN. When consider-
ing the weighted efficiency, our kernels run at 728.3 Gflops and the
clDNN kernels at 753.5 Gflops, thus our approach is within 3% of
the vendor-optimized, ad hoc implementation.
Once again, in this work, the specific algorithm/kernel devel-
opment can be seen as loop tuning around batch-reduce GEMM,
cf. section 3. Even though our DL kernels perform this tuning in a
well-informed, manual fashion and are implemented in high-level
C code, an alternative is to use a high-level tensor framework. Here
we present results of such a proof-of-concept design, where we
implement the forward convolutions within TVM [20]. More specif-
ically, we provide to TVM the forward propagation loop recipes in
high-level Python code, and at the innermost loop nest level we
invoke our batch-reduce GEMM kernel.
To assess the efficiency of our approach, we consider the infer-
ence use-case of CNNs (i.e. only the forward propagation pass). One
idiosyncrasy of inference compared to training is the very small
mini-batch N=1 that is necessitated to meet the latency require-
ments of the application. Figure 11 (Right) shows the performance
of our TVM implementation (green bars) on ResNet-50 forward
kernels (mini-batch N=1) on the SKX 8180 platform. In this plot
we also show the performance of the following implementations: i)
CNN kernels within AutoTVM developed by Amazon [45] (yellow
bars), which are auto-tuned for inference, ii) MKL-DNN (orange
bars), and iii) the CNN kernel performance of our high-level C
code kernels (blue bars). First, we observe that our DL kernels are
the most efficient, achieving overall weighted efficiency of 2492
GFLOPS. Our Python implementation within TVM that exploits the
batch-reduce GEMMkernel runs at 2361 GFLOPS, and consequently
is within 5.3% of our C implementation. Our TVM implementation
is 2% faster than the Amazon-AutoTVM auto-tuned code and 1.24×
faster than the vendor optimized MKL-DNN library. These results
show that high-performance DL kernels within high-level tensor
frameworks are feasible if the proper building block is used.
5 RELATEDWORK
The status quo in the development of high-performance DL work-
loads entails vendor-optimized DL primitives within some high-
level deep learning framework (e.g. Tensorflow [26], Pytorch [46],
Caffe [47]). MKL-DNN [15] is the Intel optimized DL library that
provides specialized primitives (e.g. convolutions, RNN/LSTM cell,
fully connected layers) for Intel CPUs. Each one of these primitives
is individually optimized at a low-level on a per-platform basis
in order to maximize performance, leading to numerous, highly
11
specialized code-bases that do not generalize to different architec-
tures. In an analogous way, cuDNN [18] is the vendor-optimized
DL library targeting Nvidia GPUs. clDNN [44] is an open source
performance library for DL applications intended for acceleration
of DL Inference on Intel GPUs. All these library approaches suffer
from the combinatorial explosion of the low-level optimizations
that have to be done for each pair <architecture, DL primitive>. On
the other hand, our proposed batch-reduce GEMM kernel covers
all major DL primitives and it is the sole building-block that has to
be optimized at a low-level.
An alternative methodology to implement DL primitives is to
leverage vendor-optimized linear algebra library calls. For example,
convolutions can be lowered to a matrix multiplication [48], but this
lowering requires tensor transformations, and the obtained perfor-
mance deteriorates [18, 27]. Aiming to accelerate the small matrix
operations that pertain in DL, academia and industry have recently
developed batched linear algebra routines [19, 29, 49]. The batched
GEMM approaches in [29, 49] specifically target only Nvidia GPUs
where the reduction across output subtensors is relatively cheap.
Even though such approaches improve the performance of the DL
primitives, they still perform worse than ad hoc implementations
(cf. batched GEMM approach and mkl-dnn in Figure 1). Our work
extends the batched GEMM routine, enables more optimizations
(see Section 2), optimizes for locality and is therefore well suited
for latency architectures such as CPUs. Also, the derived DL primi-
tives match/exceed the performance of vendor-optimized ad hoc
implementations as shown in Section 4.
Tensor compilers comprise a promising research area for end-
to-end DL workload optimization and performance portability (e.g.
TVM [20], GLOW [21], PlaidML [22], MLIR [23], Tensor Compre-
hensions [50]). Such frameworks treat tensors as first-class objects,
and provide optimizations targeting tensor algebra programs (e.g.
polyhedral optimizations for data movements). However, compilers
struggle to optimize the GEMM-flavored loop nests for the nuances
of the increasingly complex architectures. Our work can be seen
as complementary to this effort, where the GEMM-flavored loops
are abstracted into our batch-reduce GEMM call (which is indepen-
dently optimized at a low-level). Then, the specific DL primitive
optimization is reduced to mere loop tuning around a single kernel,
and this task can be handed off to a tensor compiler. In Section 4 we
showcased a prototype for CNNs within TVM that uses our kernel.
6 CONCLUSIONS
In this work, we showed how the most popular DL algorithms
(RNN/LSTM, CNN and MLP) can be formulated with batch-reduce
GEMM as basic building block. We demonstrated that our method-
ology outperforms vendor-optimized, low-level DL primitives by
factors up to 1.4×. Moreover, we integrated our DL kernels into
distributed frameworks, and optimized end-to-end workflows for
GNMT and ResNet-50 training. In multi-node experiments we ex-
ceeded the performance of vendor-optimized implementations by
up to 2.3×. Additionally, we highlighted the architectural-agnostic
aspect of our methodology by matching the CNN kernel perfor-
mance of a vendor-provided library on integrated GPUs. Finally,
we prototyped CNN kernels in a tensor compiler framework by
harnessing our batch-reduce GEMM kernel, and matched the per-
formance of auto-tuned inference TVM primitives. As future work,
we plan to extend our DL primitives for a wider set of architec-
tures/workloads, and also we intend to experiment with Tensor
compilers’ automatic polyhedral optimization (e.g. [22, 25]).
REFERENCES
[1] Alex Krizhevsky, I. Sutskever, and G.E. Hinton. Image classification with deep
convolutional neural networks. Advances in neural information processing systems,
pages 1097–1105, 2012.
[2] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going
deeper with convolutions. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 1–9, 2015.
[3] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[4] Dong Yu, Michael L Seltzer, Jinyu Li, Jui-Ting Huang, and Frank Seide. Feature
learning in deep neural networks-studies on speech recognition tasks. arXiv
preprint arXiv:1301.3605, 2013.
[5] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,
Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
Google’s neural machine translation system: Bridging the gap between human
and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chan-
dra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir,
et al. Wide & deep learning for recommender systems. In Proceedings of the 1st
Workshop on Deep Learning for Recommender Systems, pages 7–10. ACM, 2016.
[7] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-
datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE
44th Annual International Symposium on Computer Architecture (ISCA), pages
1–12. IEEE, 2017.
[8] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition
with deep recurrent neural networks. In 2013 IEEE international conference on
acoustics, speech and signal processing, pages 6645–6649. IEEE, 2013.
[9] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural
computation, 9(8):1735–1780, 1997.
[10] Marvin Minsky and Seymour A Papert. Perceptrons: An introduction to computa-
tional geometry. MIT press, 2017.
[11] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward
networks are universal approximators. Neural networks, 2(5):359–366, 1989.
[12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.
In Advances in neural information processing systems, pages 5998–6008, 2017.
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805, 2018.
[14] Yann LeCun, D Touresky, G Hinton, and T Sejnowski. A theoretical framework
for back-propagation. In Proceedings of the 1988 connectionist models summer
school, volume 1, pages 21–28. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988.
[15] Intel MKL-DNN. https://github.com/intel/mkl-dnn, Accessed on 4/3/2019.
[16] Aravind Vasudevan, Andrew Anderson, and David Gregg. Parallel multi channel
convolution using general matrix multiplication. arXiv preprint arXiv:1704.04428,
2017.
[17] Andrew Anderson, Aravind Vasudevan, Cormac Keane, and David Gregg. Low-
memory gemm-based convolution algorithms for deep neural networks. arXiv
preprint arXiv:1709.03395, 2017.
[18] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John
Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep
learning. arXiv preprint arXiv:1410.0759, 2014.
[19] Jack Dongarra, Sven Hammarling, Nicholas J Higham, Samuel D Relton, Pedro
Valero-Lara, and Mawussi Zounon. The design and performance of batched blas
on modern high-performance computing systems. Procedia Computer Science,
108:495–504, 2017.
[20] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen
Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An
automated end-to-end optimizing compiler for deep learning. In 13th {USENIX}
Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages
578–594, 2018.
[21] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov,
James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen,
Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. Glow: Graph lowering
compiler techniques for neural networks. CoRR, abs/1805.00907, 2018.
[22] Tim Zerrell and Jeremy Bruestle. Stripe: Tensor compilation via the nested
polyhedral model. arXiv preprint arXiv:1903.06498, 2019.
12
[23] Multi-Level Intermediate Representation. https://github.com/tensorflow/mlir,
Accessed on 4/10/2019.
[24] Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst.
LIBXSMM: Accelerating small matrix multiplications by runtime code generation.
In Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis, SC ’16, pages 84:1–84:11, Piscataway, NJ, USA,
2016. IEEE Press.
[25] Roman Gareev, Tobias Grosser, and Michael Kruse. High-performance gener-
alized tensor operations: A compiler-oriented approach. ACM Transactions on
Architecture and Code Optimization (TACO), 15(3):34, 2018.
[26] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San-
jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven-
berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
[27] Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg
Henry, Hans Pabst, and Alexander Heinecke. Anatomy of high-performance deep
learning convolutions on SIMD architectures. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage, and Analysis,
page 66. IEEE Press, 2018.
[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 770–778, 2016.
[29] Yang Shi, Uma Naresh Niranjan, Animashree Anandkumar, and Cris Cecka.
Tensor contractions with extended blas kernels on cpu and gpu. In 2016 IEEE 23rd
International Conference on High Performance Computing (HiPC), pages 193–202.
IEEE, 2016.
[30] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of
neural networks on cpus. 2011.
[31] “Using bfloat16 with TensorFlow models".
https://cloud.google.com/tpu/docs/bfloat16, Accessed on 4/3/2019.
[32] Christopher De Sa,Megan Leszczynski, Jian Zhang, AlanaMarzoev, Christopher R
Aberger, Kunle Olukotun, and Christopher Ré. High-accuracy low-precision
training. arXiv preprint arXiv:1803.03383, 2018.
[33] Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. Deepcpu:
Serving rnn-based deep learning models 10x faster. In 2018 {USENIX} Annual
Technical Conference ({USENIX}{ATC} 18), pages 951–965, 2018.
[34] Kazushige Goto and Robert A Geijn. Anatomy of high-performance matrix
multiplication. ACM Transactions on Mathematical Software (TOMS), 34(3):12,
2008.
[35] Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep
learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
[36] Srinivas Sridharan, Karthikeyan Vaidyanathan, Dhiraj Kalamkar, Dipankar Das,
Mikhail E. Smorkalov, Mikhail Shiryaev, Dheevatsa Mudigere, Naveen Mellem-
pudi, Sasikanth Avancha, Bharat Kaul, and Pradeep Dubey. On scale-out deep
learning training for cloud and hpc. arXiv preprint arXiv:1801.08030, 2018.
[37] WMT16 dataset. https://google.github.io/seq2seq/data/, Accessed on 4/3/2019.
[38] Yang You, Jonathan Hseu, Chris Ying, James Demmel, Kurt Keutzer, and Cho-Jui
Hsieh. Large-batch training for lstm and beyond. arXiv preprint arXiv:1901.08256,
2019.
[39] GNMTv2 For TensorFlow. https://github.com/nvidia/deeplearningexamples/tree/
master/tensorflow/translation/gnmt#training-performance-results, Accessed on
4/3/2019.
[40] NVIDIA Tesla Deep Learning Product Performance.
https://developer.nvidia.com/deep-learning-performance-training-inference,
Accessed on 4/3/2019.
[41] Rengan Xu, Frank Han, and Quy Ta. Deep learning at scale on nvidia v100 accel-
erators. In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation
of High Performance Computer Systems (PMBS), pages 23–32. IEEE, 2018.
[42] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S
Vetter. Nvidia tensor core programmability, performance & precision. In 2018 IEEE
International Parallel and Distributed Processing Symposium Workshops (IPDPSW),
pages 522–531. IEEE, 2018.
[43] Stephen Junkins. The compute architecture of intel® processor graphics gen9.
paper, Aug, 14:22, 2015.
[44] “Compute library for deep neural networks (clDNN)". https://01.org/cldnn, Ac-
cessed on 4/3/2019.
[45] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. Optimizing
cnn model inference on cpus. arXiv preprint arXiv:1809.02697, 2018.
[46] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch:
Tensors and dynamic neural networks in python with strong gpu acceleration.
PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceler-
ation, 6, 2017.
[47] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross
Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional archi-
tecture for fast feature embedding. In Proceedings of the 22nd ACM international
conference on Multimedia, pages 675–678. ACM, 2014.
[48] Kumar Chellapilla, Sidd Puri, and Patrice Simard. High performance convolu-
tional neural networks for document processing. In Tenth International Workshop
on Frontiers in Handwriting Recognition. Suvisoft, 2006.
[49] Lucien Ng, Kwai Wong, Azzam Haidar, Stanimire Tomov, and Jack Dongarra.
Magmadnn high-performance data analytics for manycore gpus and cpus. In
magma-DNN, 2017 Summer Research Experiences for Undergraduate (REU). 2017.
[50] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal,
Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert
Cohen. Tensor comprehensions: Framework-agnostic high-performance machine
learning abstractions. arXiv preprint arXiv:1802.04730, 2018.
Optimization Notice: Software and workloads used in performance
tests may have been optimized for performance only on Intel micro-
processors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, soft-
ware, operations and functions. Any change to any of those factors
may cause the results to vary. You should consult other informa-
tion and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product
when combined with other products. For more information go to
http://www.intel.com/performance.
Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation
in the U.S. and/or other countries.
13
