Optimizing Memory-Access Patterns for Deep Learning Accelerators by Zheng, Hongbin et al.
ar
X
iv
:2
00
2.
12
79
8v
1 
 [c
s.P
F]
  2
7 F
eb
 20
20
Optimizing Memory-Access Paerns for Deep
Learning Accelerators
Hongbin Zheng, Sejong Oh, Huiqing Wang, Preston Briggs, Jiading Gai,
Animesh Jain, Yizhi Liu, Rich Heaton, Randy Huang, Yida Wang
Amazon Web Services
ABSTRACT
Deep learning (DL) workloads are moving towards accelera-
tors for faster processing and lower cost. ModernDL acceler-
ators are good at handling the large-scalemultiply-accumulate
operations that dominate DL workloads; however, it is chal-
lenging to make full use of the compute power of an accel-
erator since the data must be properly staged in a software-
managed scratchpad memory. Failing to do so can result in
significant performance loss. This paper proposes a system-
atic approach which leverages the polyhedral model to ana-
lyze all operators of a DL model together to minimize the
number of memory accesses. Experiments show that our
approach can substantially reduce the impact of memory
accesses required by common neural-network models on a
homegrown AWS machine-learning inference chip named
Inferentia, which is available through Amazon EC2 Inf1 in-
stances.
KEYWORDS
Compiler, Deep Learning Accelerator
1 INTRODUCTION
As deep learning (DL) models grow in sophistication and
computational load, the traditional approach of executing
DL workloads, i.e., neural networks, on CPUs and GPUs is
becoming more time consuming and expensive. There is a
trend to move DL workloads to custom accelerators [2, 6].
By designing domain-specific architectures, these processors
are able to accelerate DL workloads and reduce energy re-
quirements by orders of magnitude.
A typical DL model can be represented as a graph, where
nodes are operators and directed edges denote the depen-
dences between nodes. Modern accelerators mostly focus
on compute-bound operators such as convolution (CONV)
and general matrix multiplication (GEMM) via specially de-
signed compute units like systolic arrays. These units are
able to process multiply-accumulate operations in a highly
efficientmanner. On the other hand, the accelerators depend
on complex software-managed scratchpads. End-to-end per-
formance will be limited if memory references of a neural
network are not well organized. Current solutions, e.g., the
XLA compiler forGoogle’s TPU [11], handlememory-access
optimization within an operator, but ignore opportunities
to reduce the number of memory accesses across multiple
operators. There is some global optimization work for DL
models [5, 7], but no one seems to have attacked global op-
timization of memory-access patterns for DL accelerators.
We propose a systematic way to optimize the memory-
access patterns of DL models for efficient execution on DL
accelerators. Specifically, our approach takes a DL model
as input, does a number of global optimizations to remove
unnecessary memory copies and intelligently schedule nec-
essary memory accesses on the accelerators to maximize
the memory-bandwidth usage. Experiments show that we
are able to significantly reduce the impact of memory refer-
ences running on a homegrown AWS machine-learning in-
ference chip named Inferentia. The chip is available to public
through Amazon EC2 Inf1 instances 1.
2 OPTIMIZATIONMETHOD
Ourwork is part of the compiler toolchain for Inferentia. The
toolchain reads in the computation graph of a DL model, de-
fines the operators via TVM [1] to build an intermediate rep-
resentation (IR) that represents the whole neural network,
applies analyses and optimizations to the IR, and eventually
produces a low-level IR for machine-code generation.
This paper focuses on a small portion of the compiler: op-
timizing the memory-access patterns. A DL workload ma-
nipulates high dimensional tensors with loop nests. With-
out loss of generality, we define the tensor accesses with
element-wise load and store instructions inside a loop nest
based on the polyhedral model [10]:
vl = tm[ ®f (®i)] (load)
tm[ ®f (®i))] = vs (store)
In these definitions, ®i = i0, i1, ..., in−1 represents a loop nest
with n loops, where i j is the loop at level j , tm represents the
m-dimensional tensor which is being read/written by the
load/store instructions, and ®f (®i) = C®i + ®b. Since the matrix
C and the vector b are compile-time constants, C®i + ®b is an
affine expression. Finally, vl in (load) represents the result
of the load instruction and vs in (store) represents the data
being written to tm[ ®f (®i))] in the store instruction.
1https://aws.amazon.com/ec2/instance-types/inf1/
C4ML ’20, February 23, 2020, San Diego, CA Zheng, et al.
Our approach tries to eliminate unnecessary data move-
ments in the workload (Section 2.1), and for the remainder,
maximizing the utilization of the on-chip memory by main-
taining data locality in the scratchpad (Section 2.2). Our ap-
proachwas designed forDL accelerators equippedwith pow-
erful compute units and limited on-chip memory.
2.1 Data-Movement Elimination
Data-movement elimination tries to eliminate the pair of in-
structions (v = tl [ ®fl (®i)], ts [ ®fs (®i))] = v), where the result of
the load instruction, v , directly feeds the input of the store
instruction. Such patterns are found in DL workloads by an-
alyzing the loop nests of pairs of memory-bound operators
like repeat, tile, split, transpose, strided_slice, etc. Existing so-
lutions cannot thoroughly eliminate them without optimiz-
ing globally.
To eliminate such pairs, we first generate the reverse of
®fs as ®f
′
s : ®idxts 7→ ®i . Using
®f ′s , we build a function:
®дl s = ®fl ◦ ®f
′
s =
®fl ( ®f
′
s ( ®idxts )) :
®idxts 7→
®idxtl (1)
to map the index space of tensor ts to the index space of
tensor tl . Using ®дl s , we rewrite the accesses that read ts so
they directly read tl which in turn allows us to eliminate the
stores that defined ts . Specifically, for each load instruction
that reads ts , v
′
= ts [ ®f
′
l
(®i ′)], we build a function:
®д′ = ®дl s ◦ ®f
′
l
= ®дl s ( ®f
′
l
(®i ′)) = ®fl ( ®f
′
s ( ®f
′
l
(®i ′))) : ®i ′ 7→ ®idxtl (2)
to map the loop indices ®i ′ to the index space of tl and rewrite
the load instruction v ′ = tl [ ®д′(®i
′)]. Once we apply such
transformations to all load instructions that read tensor ts ,
ts can be eliminated along with all instructions defining it.
We repeat this process until we cannot eliminate anymore
load/store pairs. The affine function reverse and composition
are implemented using the Integer Set Library [9].
2.2 Global Memory-Bank Mapping
Not all data movement in a DL workload can be removed.
For compulsory references, we try to fully exploit the avail-
able memory bandwidth. In order to maximize the internal
memory bandwidth, accelerators typically organize on-chip
memories into multiple banks with disjoint address spaces,
each of which connects to one portion of the compute units
(e.g., a specific row of the systolic array). Data movement
between different banks is very slow through themainmem-
ory; therefore, tensor data needs to be carefully spread across
the banks for computation. For example, in a Conv2D oper-
ator, data from different channels of the feature map and
weights must be mapped to different memory banks so that
the internal compute units can read and process the data in
parallel. At the same time, the result of the Conv2D needs
to be spread across several banks, guided by the different
output channels.
In prior work [3], bank mapping focused on a single loop
nest with a goal of maximizing the memory-access paral-
lelism for that nest. We call this local bank mapping.
Our goal is to minimize inter-bank data movement be-
tweenmultiple operators (represented bymultiple loop nests
in our compiler). To achieve this goal, we first derive bank
mappings for the operators with bank-mapping restrictions,
e.g., conv2D, matmul, pooling, etc., then propagate thesemap-
pings across the network based on the data dependencies
between operators. We perform a fixed-point iteration to
propagate the mappings to cover all operators in the neural
network and make sure that the output of an operator maps
to the memory banks required by the next operator. If a ten-
sor t has conflicting mapping requirements during the prop-
agation, i.e., the data layout changes between consecutive
operators in the network, we will introduce a tensor t ′ and
a memcopy between t and t ′ to represent data movement
between memory banks. Typically, for a high-dimensional
tensor, we map its outer dimensions to different banks and
use its inner dimensions to address different elements in the
same bank to support sequential data access.
3 EVALUATION
We conducted our experiments on a homegrown AWS chip
called Inferentia, specifically, AmazonEC2 Inf1.xlarge instance.
For the sake of space, we present results of a single model
for each algorithm.
We tested the effectiveness of data-movement elimina-
tion on Parallel WaveNet [8]. Our optimization was able to
eliminate 123 out of 124 load-store pairs. As a result, we elim-
inated 145 MB (out of 146 MB) of tensors that were used for
intermediate storage. We saved 10% of the on-chip memory
copies and 11% of the off-chip memory copies (measured in
bytes).
We tested the effectiveness of global memory-bank map-
ping by running our compiler on ResNet-50 [4], comparing
two different mapping algorithms:
Local mapping which generatesmappingswithin each
operator, without propagation, but keeps the output
of an operator in on-chip memory if it will be directly
used as the input of the next operator.
Global mapping as described in Section 2.2.
Taking results from local mapping as a baseline, we saw
globalmapping eliminate 76% of the on-chip data copies and
37% of the copies off chip (measured in bytes).
Optimizing Memory-Access Paerns for Deep Learning Accelerators C4ML ’20, February 23, 2020, San Diego, CA
4 CONCLUSION
To conclude, this paper proposes a systematic approach to
globally optimize the memory-access patterns of DL work-
loads on accelerators. Experimental results show that we are
able to significantly reduce memory references for state-of-
the-art networks on Inferentia, a homegrownAWSmachine-
learning inference chip.
REFERENCES
[1] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie
Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis
Ceze, et al. 2018. TVM: An Automated End-to-End Optimizing Com-
piler for Deep Learning. In USENIX Symposium on Operating Systems
Design and Implementation, Vol. 13. 578–594.
[2] Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier
Temam. 2016. DianNao Family: Energy-Efficient Hardware Accelera-
tors for Machine Learning. Commun. ACM 59, 11 (2016), 105–112.
[3] Wei Ding, Diana Guttman, and Mahmut Kandemir. 2014. Com-
piler Support for Optimizing Memory Bank-Level Parallelism. In
IEEE/ACM International Symposium onMicroarchitecture, Vol. 47. 571–
582.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
Residual Learning for Image Recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 770–778.
[5] Zhihao Jia, James Thomas, Todd Warszawski, Mingyu Gao, Matei Za-
haria, and Alex Aiken. 2019. Optimizing DNN Computation with Re-
laxed Graph Substitutions. In Proceedings of the Conference on Systems
and Machine Learning, Vol. 19.
[6] Norman P. Jouppi, Cliff Young, Nishant Patil, and David Patterson.
2018. A Domain-Specific Architecture for Deep Neural Networks.
Commun. ACM 61, 9 (2018), 50–59.
[7] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang.
2019. Optimizing CNN Model Inference on CPUs. In USENIX Annual
Technical Conference, Vol. 19. 1025–1040.
[8] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan,
Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Ed-
ward Lockhart, Luis C Cobo, Florian Stimberg, et al. 2017. Paral-
lel Wavenet: Fast High-Fidelity Speech Synthesis. arXiv preprint
arXiv:1711.10433 (2017).
[9] Sven Verdoolaege. 2010. isl: An Integer Set Library for the Polyhedral
Model. In International Congress on Mathematical Software. Springer,
299–302.
[10] Sven Verdoolaege. 2016. Presburger Formulas and Polyhedral Compi-
lation. (2016).
[11] XLA Team. 2017. XLA: TensorFlow, Compiled.
https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html
