Deep learning (DL) workloads are moving towards accelerators for faster processing and lower cost. Modern DL accelerators are good at handling the large-scale multiply-accumulate operations that dominate DL workloads; however, it is challenging to make full use of the compute power of an accelerator since the data must be properly staged in a softwaremanaged scratchpad memory. Failing to do so can result in significant performance loss. This paper proposes a systematic approach which leverages the polyhedral model to analyze all operators of a DL model together to minimize the number of memory accesses. Experiments show that our approach can substantially reduce the impact of memory accesses required by common neural-network models on a homegrown AWS machine-learning inference chip named Inferentia, which is available through Amazon EC2 Inf1 instances.
INTRODUCTION
As deep learning (DL) models grow in sophistication and computational load, the traditional approach of executing DL workloads, i.e., neural networks, on CPUs and GPUs is becoming more time consuming and expensive. There is a trend to move DL workloads to custom accelerators [2, 6] . By designing domain-specific architectures, these processors are able to accelerate DL workloads and reduce energy requirements by orders of magnitude.
A typical DL model can be represented as a graph, where nodes are operators and directed edges denote the dependences between nodes. Modern accelerators mostly focus on compute-bound operators such as convolution (CONV) and general matrix multiplication (GEMM) via specially designed compute units like systolic arrays. These units are able to process multiply-accumulate operations in a highly efficient manner. On the other hand, the accelerators depend on complex software-managed scratchpads. End-to-end performance will be limited if memory references of a neural network are not well organized. Current solutions, e.g., the XLA compiler for Google's TPU [11] , handle memory-access optimization within an operator, but ignore opportunities to reduce the number of memory accesses across multiple operators. There is some global optimization work for DL models [5, 7] , but no one seems to have attacked global optimization of memory-access patterns for DL accelerators.
We propose a systematic way to optimize the memoryaccess patterns of DL models for efficient execution on DL accelerators. Specifically, our approach takes a DL model as input, does a number of global optimizations to remove unnecessary memory copies and intelligently schedule necessary memory accesses on the accelerators to maximize the memory-bandwidth usage. Experiments show that we are able to significantly reduce the impact of memory references running on a homegrown AWS machine-learning inference chip named Inferentia. The chip is available to public through Amazon EC2 Inf1 instances 1 .
OPTIMIZATION METHOD
Our work is part of the compiler toolchain for Inferentia. The toolchain reads in the computation graph of a DL model, defines the operators via TVM [1] to build an intermediate representation (IR) that represents the whole neural network, applies analyses and optimizations to the IR, and eventually produces a low-level IR for machine-code generation.
This paper focuses on a small portion of the compiler: optimizing the memory-access patterns. A DL workload manipulates high dimensional tensors with loop nests. Without loss of generality, we define the tensor accesses with element-wise load and store instructions inside a loop nest based on the polyhedral model [10] :
In these definitions, ì i = i 0 , i 1 , ..., i n−1 represents a loop nest with n loops, where i j is the loop at level j, t m represents the m-dimensional tensor which is being read/written by the load/store instructions, and ì
Since the matrix C and the vector b are compile-time constants, C ì i + ì b is an affine expression. Finally, l in (load) represents the result of the load instruction and s in (store) represents the data being written to t m [ ì f ( ì i))] in the store instruction.
Our approach tries to eliminate unnecessary data movements in the workload (Section 2.1), and for the remainder, maximizing the utilization of the on-chip memory by maintaining data locality in the scratchpad (Section 2.2). Our approach was designed for DL accelerators equipped with powerful compute units and limited on-chip memory.
Data-Movement Elimination
Data-movement elimination tries to eliminate the pair of in-
where the result of the load instruction, , directly feeds the input of the store instruction. Such patterns are found in DL workloads by analyzing the loop nests of pairs of memory-bound operators like repeat, tile, split, transpose, strided_slice, etc. Existing solutions cannot thoroughly eliminate them without optimizing globally.
To eliminate such pairs, we first generate the reverse of
to map the index space of tensor t s to the index space of tensor t l . Using ì l s , we rewrite the accesses that read t s so they directly read t l which in turn allows us to eliminate the stores that defined t s . Specifically, for each load instruction that reads t s ,
, we build a function:
to map the loop indices ì i ′ to the index space of t l and rewrite the load instruction ′ = t l [ ì ′ ( ì i ′ )]. Once we apply such transformations to all load instructions that read tensor t s , t s can be eliminated along with all instructions defining it.
We repeat this process until we cannot eliminate any more load/store pairs. The affine function reverse and composition are implemented using the Integer Set Library [9] .
Global Memory-Bank Mapping
Not all data movement in a DL workload can be removed. For compulsory references, we try to fully exploit the available memory bandwidth. In order to maximize the internal memory bandwidth, accelerators typically organize on-chip memories into multiple banks with disjoint address spaces, each of which connects to one portion of the compute units (e.g., a specific row of the systolic array). Data movement between different banks is very slow through the main memory; therefore, tensor data needs to be carefully spread across the banks for computation. For example, in a Conv2D operator, data from different channels of the feature map and weights must be mapped to different memory banks so that the internal compute units can read and process the data in parallel. At the same time, the result of the Conv2D needs to be spread across several banks, guided by the different output channels.
In prior work [3] , bank mapping focused on a single loop nest with a goal of maximizing the memory-access parallelism for that nest. We call this local bank mapping.
Our goal is to minimize inter-bank data movement between multiple operators (represented by multiple loop nests in our compiler). To achieve this goal, we first derive bank mappings for the operators with bank-mapping restrictions, e.g., conv2D, matmul, pooling, etc., then propagate these mappings across the network based on the data dependencies between operators. We perform a fixed-point iteration to propagate the mappings to cover all operators in the neural network and make sure that the output of an operator maps to the memory banks required by the next operator. If a tensor t has conflicting mapping requirements during the propagation, i.e., the data layout changes between consecutive operators in the network, we will introduce a tensor t ′ and a memcopy between t and t ′ to represent data movement between memory banks. Typically, for a high-dimensional tensor, we map its outer dimensions to different banks and use its inner dimensions to address different elements in the same bank to support sequential data access.
EVALUATION
We conducted our experiments on a homegrown AWS chip called Inferentia, specifically, Amazon EC2 Inf1.xlarge instance. For the sake of space, we present results of a single model for each algorithm.
We tested the effectiveness of data-movement elimination on Parallel WaveNet [8] . Our optimization was able to eliminate 123 out of 124 load-store pairs. As a result, we eliminated 145 MB (out of 146 MB) of tensors that were used for intermediate storage. We saved 10% of the on-chip memory copies and 11% of the off-chip memory copies (measured in bytes).
We tested the effectiveness of global memory-bank mapping by running our compiler on ResNet-50 [4] , comparing two different mapping algorithms:
Local mapping which generates mappings within each operator, without propagation, but keeps the output of an operator in on-chip memory if it will be directly used as the input of the next operator. Global mapping as described in Section 2.2.
Taking results from local mapping as a baseline, we saw global mapping eliminate 76% of the on-chip data copies and 37% of the copies off chip (measured in bytes).
CONCLUSION
To conclude, this paper proposes a systematic approach to globally optimize the memory-access patterns of DL workloads on accelerators. Experimental results show that we are able to significantly reduce memory references for state-ofthe-art networks on Inferentia, a homegrown AWS machinelearning inference chip.
