Abstract-We present MAESTRO, a framework to describe and analyze CNN dataflows, and predict performance and energy-efficiency when running neural network layers across various hardware configurations. This includes two components: (i) a concise language to describe arbitrary dataflows and (ii) and analysis framework that accepts the dataflow description, hardware resource description, and DNN layer description as inputs and generates buffer requirements, buffer access counts, network-on-chip (NoC) bandwidth requirements, and roofline performance information. We demonstrate both components across several dataflows as case studies.
D
EEP learning techniques, especially convolutional neural networks (CNN), have pervaded vision applications across image classification, face recognition, video processing, and so on due to the high degree of accuracy they provide [1] , [2] . Both industry and academia are exploring specialized hardware accelerator ASICs as a solution to provide low-latency and high-throughput for CNN workloads [3] , [4] , [5] , [6] .
The convolution operation is a deeply nested multiply-accumulate loop. For throughput and energy efficiency, each accelerator chooses different strategies to manipulate the loop order/tiling of the convolution operations and the spatial/temporal mapping of data on compute units, which we collectively refer to as dataflow. The throughput and energy efficiency of a dataflow changes dramatically depending on both the DNN topology (i.e., layer shapes and sizes), and accelerator hardware resources (buffer size, and network-on-chip (NoC) bandwidth). This demonstrates the importance of dataflow as a first-order consideration for deep learning accelerator ASICs, both at design-time when hardware resources (buffers and interconnects) are being allocated on-chip, and compile-time when different layers need to be optimally mapped for high utilization and energy-efficiency.
We present MAESTRO (Modeling Accelerator Efficiency via Spatio-Temporal Resource Occupancy), an open-source tool for modeling and evaluating the performance and energy-efficiency of different dataflows. Fig. 1 presents an overview. Our key novelty is a methodology to formally describe dataflows using concisely defined pragmas over nested loops of convolutions. The "stationary"-based taxonomy introduced by prior work such as Eyeriss [3] is captured by one of our pragmas. We provide a concise DSL that allows users to input the neural network structure (shape/size), hardware resource description (buffer size and interconnect topology/bandwidth), and desired dataflow using our pragmas. MAESTRO computes the maximum performance (roofline throughput) and hardware resources (buffer sizes and NoC bandwidth) required to achieve this performance. It also produces buffer access and link traversal counts across the memory hierarchy, which can be plugged into an energy model.
MODELING DATAFLOW IN CNN ACCELERATORS

Convolutional Neural Networks (CNN)
CNNs consist of convolutional, pooling, and fully-connected layers. Among these layers, convolutional layers are significant in the amount of computations and the size of required weight/input data [7] . As presented in Fig. 2 , the computation in convolutional layers is often implemented as a sliding window operation with MACC (Multiplication-Accumulation). The operations can be described as six nested for-loop as shown in Code 1.
DNN Layer Description
C O N V I N P O O L C O N V Neural Network Structure … F C F C O U T ∑ ∑ …∑ W * I ∑ ∑ …∑
Spatial CNN Accelerator
Our discussion models the target accelerator as a spatial collection of processing elements (PEs), where each PE houses a MAC and a local arXiv:1805.02566v1 [cs.DC] 4 May 2018 for ( int k i = 0 ; k i<K ; k i ++) //Loop_K for ( int c i = 0 ; c i<C ; c i ++) //Loop_C for ( int y i = 0 ; y i<Y; y i ++) //Loop_Y for ( int x i = 0 ; x i<X; x i ++) //Loop_X for ( int r i = 0 ; r i <R ; r i ++) Code 1: 6D convolution code over one image "L1 buffer". The PEs are interconnected internally to each other, and to a shared "L2 buffer" by a network-on-chip (NoC), as shown in Fig. 1 . This is an abstract model -a real implementation might group multiple PE's together to create a larger PE, the L1 could be a single latch or a FIFO or a scratchpad, the NoC could be hierarchical buses [3] , systolic, tree, crossbar, and so on.
CNN Dataflows
Since the multiplications in Code 1 are all independent, accelerator architectures can re-order and sub-tile the computations for efficiency and parallelism. This is necessary to limit off-chip accesses, because the size of the input feature map (upto 6.5MB in VGG16) and weight values (upto 4.5MB in VGG16) is too large to be loaded at once on to L1 buffers. These sizes also mean that strategies for sequencing the computation and splitting (tiling) data onto spatially deployed PEs is a large design space exploration problem by itself.
Definition of Dataflow
Because of the complexity of the loop nest as shown in Code 1, choosing the best dataflow for a given network layer on a given accelerator is not intuitive to describe or design.
To address this challenge, we take a variable-centric approach, which regards the mapping of weights/inputs as equivalent to assigning iteration variables to each PE. For example, if we assign (ki, ci, yi, xi, ri, si) = (3, 4, 5, 6, 0-2, 0-2) to a PE, we observe that the PE should receive nine weights (W [6] ). This approach provides a better abstraction than identifying the exact weights/inputs to assign to each PE (data-centric approach) because it converts the 6D array mapping problem into a six-variable assignment problem.
Elements of Dataflow
Loop-ordering. Changing the order of loops in Code 1 affects the data reuse patterns in each PE. For example, if we iterate in the order of K → C → Y → X → R → S and assign each X loop iteration across PEs, weight values within the same channel can remain in each PE until the channel iteration variable ci increases. However, if we iterate in an order of K → R → S → C → X → Y and assign each C loop iteration across PEs, input feature map values within the same channel can remain in each PE. We probably cannot keep weight values because the reuse time is over two inner loops (X,Y ), much larger than a typical PE's L1 buffer size. This means that the same weight might be read multiple times over the course of the computation from the shared L2 buffer. Thus, based on the CNN sizes, the buffer size in each PE, and the dataflow, the number of L1 (local registers) and L2 (shared buffer) read/writes of a dataflow change, as we will show later in Fig. 7 .
Spatial and Temporal Mapping. Spatial mapping enables individual PEs to process different sets of iteration variables at the same time. When the number of compute units or local buffer is not sufficient to cover a given spatial mapping, implicit temporal mapping is necessarythe original dataflow graph is conceptually "folded" onto a physical unit over time. For example, if the feature map width X = 4 and we spatially map X over two PEs, PE0 and PE1 can process xi = 0 and xi = 1, respectively, at cycle 0. And then, PE0 and PE1 process xi = 2 and xi = 3, respectively, at cycle 1.
Temporal mapping indicates that a PE can process different set of iteration variables over time, which effectively folds the compute W0 W1  W2 W3   I0 I1 I2 I3  I5 I6 I7 I8  I10 I11 I12 I13   I4  I9  I14   O0 O1 O2 O3  O5 O6 O7 O8   O1  P1_1  P1_0  P1_3 unit requirement for full parallel execution to the time domain. For example, a PE could first process ki = 0 between cycles 0 − 9, and ki = 1 between cycles 10 − 19. Temporarily mapped variables can result in stationary data (inputs, weights, or partial outputs), using the taxonomy introduced in Eyeriss [3] , which requires local buffer to store this stationary data over time.
Tiling. The granularity of spatial and temporal mapping can be onto separate PEs, or coarse-grained groups of PEs, which we call a tile. For example, Flexflow [8] organizes PE into multiple rows and assigns operation for one output pixel to one of the PE rows. In this case, variables K, X, and Y are temporally mapped on a PE row but variables C, R, and S are spatially mapped to PEs in the PE row. The tile can be organized into higher dimensions than two.
Describing Dataflows
Based on the elements discussed in Section 2.3.2, we describe a method to formally describe dataflows using a combination of five pragmas that includes all the elements of dataflow, loop ordering, spatial/temporal mapping and tiling, as presented in Table 1 . All the pragmas are followed by loop variable to indicate their target variables. MAESTRO processes pragmas in two phases; tile construction and variable mapping. In the first phase, tile construction, MAESTRO processes TILE pragmas from inner to outer loop to recursively construct tile structure, as shown in the example of tile in Table 1 . By this process, MAESTRO assigns a tile structure to each loop so that MAESTRO can map corresponding variables on the right tiles.
Data Reuse
Maximizing data reuse is the prime target of many accelerators as it improves both the throughput and energy efficiency. Data reuse reduces the number of energy-consuming L2 reads and writes (which translates to fewer DRAM reads and writes), in turn translating to reduced bandwidth requirements from the L2 and the NoC implementations within the accelerator. We define three classes of data reuse.
Temporal data reuse (Stationary data): Temporal data reuse opportunity, which is the same as the stationary taxonomy introduced by Eyeriss [3] , is based on non-shared variables among data classes. For example, if the innermost spatially-mapped loop variable is K, although the target weight pixel changes every loop-K iteration, the target input pixel does not because inputs do not have a K dimension. The stationary data class is determined by the loop order and innermost spatially mapped loop. For example, weights in a row-stationary dataflow [3] are reused in temporal dimension as illustrated in Fig. 3 , in which weight is stationary in K and C dimension. Since loops R and S are merged with Y dimension, each PE has unique R/S values; thus weight is fully-stationary in each PE for each Loop Y iteration. To exploit temporal reuse, an accelerator needs a local L1 buffer in each PE.
Spatial data reuse (Multicasted data): Spatial data reuse opportunity is based on temporal mapping and sliding window halo. For example, in Fig. 3 (b) , because of the halo from SPATIAL MAP (2, 1), I 1 is shared between PE1 and PE2 at the same time. Rather than reading the data twice from the L2 buffer, an accelerator can read only once and multicast I 1 to PE1 and PE2. To exploit spatial reuse, an accelerator needs a NoC that supports multicasting (bus, tree, etc.).
Spatio-temporal data reuse (Local-forwarded data): Also called producer/consumer parallelism, this is based on sliding window 
Description
Recursively constructs new tile structures using the tile structure of the next inner loop. Tile structure in the innermost loop is PEs, and Tile pragma allows to construct arbitrary dimensional tiles from 1D PE array.
Map Sz (mapping size) variables on every tile in the tile structure at the loop on which this pragma is used. When all the inner loops finishes, the index of mapped variables are increased by stride Ofs (offset).
Similar to Temporal_Map, map Sz variables on every tile in the tile structure at the loop. However, within adjacent tiles, mapped variables are increased by stride Ofs, which maps different variables on each tile. Involves implicit temporal mapping (folding) when the number of tiles cannot cover entire loop. Table 2 halo over implicit temporal mapping. For example, in the example for SPATIAL MAP in Table 1 , a = 2 is reused over time stamp 0 and 1, in different tiles. Corresponding input/weight values can be directly forwarded to T0 from T1, instead of reading them from prefetch buffer(L2). To exploit spatio-temporal reuse, an accelerator needs local forwarding links between PEs. MAESTRO assumes all three data reuse opportunities are provided by the accelerator to evaluate the potential of the dataflow. We describe the details next.
MAESTRO FRAMEWORK
MAESTRO Domain-specific Language
We present the syntax of the MAESTRO DSL in Fig. 4 (a) , which consists of the DNN layer description, hardware description, and dataflow description. For the layer description, users can specify the shape and size of each dimension in a convolutional layer. For the hardware description, users can specify L1 (i.e., private/local) and L2 (i.e., shared/global) buffer (i.e., FIFO/scratchpad) size and the NoC bandwidth for L2 ingress/egress traffic. The dataflow description is specified using the pragmas presented in Section 2.3.3. Fig. 4 (b) shows an example. Fig. 5 provides an overview of how MAESTRO analyzes the given dataflow. As described in Section 2.3.3, MAESTRO first processes the given dataflow in two steps; tile construction and variable mapping. After the dataflow is parsed, the analysis mainly focuses on the innermost spatial loop. This is because the innermost spatial loop has the finest granularity of spatial processing, and the cost of a dataflow 
MAESTRO Analysis Engine
Weight Table 2 . The access counts generated by MAESTRO are multiplied by appropriate energy values from Cacti [10] at 28nm for 2KB L1 scratchpad in each PE and 1MB shared L2 buffer. The values are normalized to the MAC energy consumption of NLR.
is closely related the innermost spatial loop, as we show in Fig. 5 . Data reuse is abstracted in UTsz function, which identifies unique values of a loop variable. Based on the relationship between dataflow and costs defined in Fig. 5 , MAESTRO lets users know if the given hardware resource is enough to run the dataflow without additional temporal folding other than specified in dataflow description. Buffer access counts, in particular, can be integrated into energy analysis tools to estimate energy consumption of a dataflow. In the following section, we present analysis results of five dataflows using MAESTRO. Fig. 6 presents bandwidth and L1 memory requirements of five dataflows discussed in Table 2 . Throughput is measured for a hypothetical 64-PE architecture running in steady state (non-edge regions). Fig. 7 plots the energy consumption across the MAC, L1 and L2 for the same dataflows. We perform this analysis for two CONV layers of VGG16. We emphasize that this is an evaluation of these dataflows' applicability to this hypothetical architecture, and not meant as a comparison the original systems, which vary widely in number of PEs, buffer sizes, network topology, an so on 1 .
EVALUATION
We gather useful insights across the dataflows and across layers. Between dataflows, we observe, as expected, that NLR has the least L1 memory requirement (as it does not perform temporal reuse at the PE), and therefore has significant L2 energy consumption. For CONV1, NVDLA has the highest energy consumption among all dataflows. However, for CONV11, this trend reverses -NVDLA's energy consumption remains similar, and is 2× lower than NLR, WS and Shi. This is because CONV1 has just 3 input channels, while CONV11 has 512; NVDLA is tuned for operating on layers with large input channels (as Temporal Map (64,64) on variable C of NVDLA dataflow in Table 2 shows), making it inefficient for early layers since it still needs to pay the energy cost of vector reads, but is much more efficient than other dataflows in later layers. For the same reason, NVDLA requires extremely high NoC bandwidth in CONV11 (compared to CONV1), since more partial sums get mapped on each PE of NVDLA with CONV11, leading to more L1 to L2 communication for partial sums and outputs. The RS dataflow is observed to be the most energy-efficient due to very few L2 reads, especially for CONV11, demonstrating the best input and weight reuse. Compared to NVDLA, it has much worse roofline throughput in CONV1, but slightly better in CONV11. The Shi dataflow has the highest L1 buffer requirement among all dataflows, as it spatially replicates variable X across 3 PEs.
CONCLUSION
Dataflow analysis is a high-dimensional problem that requires defining clear relationships among entities with different dimensions, such as the problem (6D), time (1D), PE mapping (1D or more based on tiling), buffer address space (1D), NoC bandwidth (1D), and so on. In this work, we proposed a systematic approach to deal with such a high-dimensional problem by dividing the problem into each dimension. Our dataflow pragma is designed to describe the dataflow behavior on each dimension, and our framework, MAESTRO, receives dataflow description written in the pragma along hardware/DNN layer description to analyze the efficiency of the given dataflow.
Using dataflow from recent accelerators, we demonstrated that the efficiency of each dataflow varies dramatically based on the CNN layer sizes and that each dataflow is a design space of tradeoff of NoC bandwidth requirements, memory requirements, energy, and throughput. We believe that our taxonomy and analysis framework for dataflow exploration will help CNN accelerator architects to explore how much these benefits apply to their specific approach.
