Efficiently tiling and mapping high-dimensional convolutions onto limited execution and buffering resources is a challenge faced by all deep learning accelerators today. We term each unique approach as dataflow. The dataflow determines overall throughput (utilization of the compute units) and energy-efficiency (reads, writes, and reuse of model parameters and partial sums across the accelerator's memory hierarchy). In this work, we provide a first-of-its kind framework called MAESTRO to formally describe and analyze CNN dataflows. MAESTRO uses a set of concise pragmas to describe three kinds of data reuse -spatial, temporal, and spatio-temporal. It predicts roofline performance and energyefficiency of each dataflow when running neural network layers, and reports the hardware resources (size of buffers across the memory hierarchy, and network-on-chip (NoC) bandwidth) required to support this dataflow. Using MAESTRO, we demonstrate trade-offs between various dataflows, and demonstrate the potential benefits of a hardware substrate with a specialized NoC that can support adaptive dataflows.
Introduction
Convolutional neural networks (CNN) are one of the most popular deep learning approaches for image classification, face recognition, web search, and video processing [11, 25, 7] today, having exceeded human accuracy for many of these traditionally challenging problems. They are being increasingly deployed for real-time processing on edge devices with applications across gaming, self-driving cars, photo tagging, surveillance, and so on. Unfortunately, modern CNNs require billions of computations, placing extreme demands on performance and energy-efficiency on the hardware they run on.
Specialized accelerators (ASICs and FPGAs) have emerged for addressing this challenge [4, 3, 2, 20, 21, 10] . Many of them are spatial in nature -i.e., they are built using an array of processing elements (PEs) and use direct communication instead of via memory for energy-efficiency. These accelerators leverage the observation that CNN processing inherently relies on the same filters convolving with multiple input images, and vice-versa, providing opportunities for data reuse. As a simple example, VGG16's CONV3_2 layer requires 1.85B MAC operations, but has less than 1M unique weights, inputs and outputs. Thus, recent accelerators have tried to keep weights or inputs or partial outputs stationary within on-chip processing elements (PEs) while streaming the other data types through the array. This is known as dataflow.
Coming up with the right dataflow is however an open research question today. Convolutions are deeply nested multiply-accumulate loops. There are literally millions of ways of partitioning these loops to map the billions of computations over hundreds of compute units -each of which results in a unique dataflow and corresponding data reuse. Moreover, throughput and energy efficiency of a dataflow can dramatically change depending on both the DNN topology (i.e., layer shapes and sizes), and accelerator hardware resources (buffer size, and network-on-chip (NoC) bandwidth and connectivity). The research community today lacks any formal framework to model and reason about the performance and energy-efficiency of different dataflows.
We present MAESTRO (Modeling Accelerator Efficiency via Spatio-Temporal Resource Occupancy), an analytical tool for modeling and evaluating the performance and energyefficiency of CNN dataflows. Fig. 1 presents an overview. Our key novelty is a methodology to formally describe dataflows using five concisely defined pragmas over nested loops of convolutions. We show that the "stationary"-based taxonomy introduced by prior work such as Eyeriss [4] is not sufficient for the full description of the dataflow, and is captured by one of our pragmas. We provide a concise DSL that allows users to input the neural network structure (shape/size), hardware resource description (buffer size and interconnect topology/bandwidth), and desired dataflow using our pragmas. MAESTRO computes the maximum performance (roofline throughput) and hardware resources (buffer sizes and NoC bandwidth) required to achieve this performance. It also produces buffer access and link traversal counts across the memory hierarchy, which can be plugged into an energy model. Using two case studies, we demonstrate how MAESTRO can be used at design-time, for providing quick first-order metrics at design-time when hardware resources (buffers and interconnects) are being allocated on-chip, and compile-time when different layers need to be optimally mapped for high utilization and energy-efficiency.
The rest of the paper is organized as follows. Section 2 characterizes dataflows in CNN accelerators. Section 3 presents MAESTRO. Using MAESTRO, Section 4 presents two case studies -comparing different dataflows, and demonstrating the benefits of adaptive dataflow support in hardware. Section 6 discusses related work and Section 7 concludes.
Dataflows in CNN Accelerators
CNNs consist of convolutional, pooling, and fully-connected layers. Among these layers, convolutional layers are significant in the amount of computations and the size of required weight/input data [24] . As presented in Fig. 2 , the computation in convolutional layers is often implemented as a sliding window operation with MAC (Multiplication-Accumulation). Code 1 describes the operations as six nested for-loops.
Our discussion models the target accelerator as a spatial collection of processing elements (PEs), where each PE houses a MAC and a local "L1 buffer", as shown in accelerator architecture in Fig. 1 . The PEs are interconnected internally to each other, and to a shared "L2 buffer" by a network-on-chip (NoC), as shown in Fig. 1 . This is an abstract model -a real implementation might group multiple PE's together to create a larger PE, the L1 could be a single latch or a FIFO or a scratchpad, the NoC could be hierarchical buses [4] , systolic, tree, crossbar, and so on.
Since the multiplications in Code 1 are all independent, accelerator architectures can re-order and sub-tile the computations for efficiency and parallelism. This is necessary to limit off-chip accesses, because the size of the input feature map (upto 6.5MB in VGG16) and weight values (upto 4.5MB in VGG16) is too large to be loaded at once on to L1 buffers. These sizes also mean that strategies for sequencing the computation and splitting (tiling) data onto spatially deployed PEs is a large design space exploration problem by itself.
Definition of Dataflow
Because of the complexity of the loop nest as shown in Code 1, choosing the best dataflow for a given network layer on a given accelerator is not intuitive to describe or design.
To address this challenge, we take a variable-centric approach, which regards the mapping of weights/inputs as equivalent to assigning iteration variables to each PE. For example, if we assign (ki, ci, yi, xi, ri, si) = (3, 4, 5, 6, 0-2, 0-2) to a PE, we observe that the PE should receive nine weights (W [3] [4][0-2][0-2]) and one input (I [4] [5] [6] ). This approach provides a better abstraction than identifying the exact weights/inputs to assign to each PE (data-centric approach) because it converts the 6D array mapping problem into a six-variable assignment problem.
Elements of Dataflow
Loop-ordering Changing the order of loops in Code 1 affects the data reuse patterns in each PE. For example, if we iterate in the order of K → C → Y → X → R → S and assign each X loop iteration across PEs, weight values within the same channel can remain in each PE until the channel iteration variable ci increases. However, if we iterate in an order of K → R → S → C → X → Y and assign each C loop iteration across PEs, input feature map values within the same channel can remain in each PE. We probably cannot keep weight values because the reuse time is over two inner loops (X,Y ), much larger than a typical PE's L1 buffer size. This means that the same weight might be read multiple times over the course of the computation from the shared L2 buffer. Thus, based on the CNN sizes, the buffer size in each PE, and the dataflow, the number of L1 (local registers) and L2 (shared buffer) read/writes of a dataflow change, as we will show later in Fig. 17 .
Spatial and Temporal Mapping Spatial mapping enables individual PEs to process different sets of iteration variables at the same time. When the number of compute units or local buffer is not sufficient to cover a given spatial mapping, implicit temporal mapping is necessary-the original dataflow graph is conceptually "folded" onto a physical unit over time. For example, if the feature map width X = 4 and we spatially map X over two PEs, PE0 and PE1 can process xi = 0 and omp_set_num_threads (5) xi = 1, respectively, at cycle 0. And then, PE0 and PE1 process xi = 2 and xi = 3, respectively, at cycle 1. Temporal mapping indicates that a PE can process different set of iteration variables over time, which effectively folds the compute unit requirement for full parallel execution to the time domain. For example, a PE could first process ki = 0 between cycles 0 − 9, and ki = 1 between cycles 10 − 19. Temporarily mapped variables can result in stationary data (inputs, weights, or partial outputs), using the taxonomy introduced in Eyeriss [4] , which requires local buffer to store this stationary data over time.
Tiling The granularity of spatial and temporal mapping can be onto separate PEs, or coarse-grained groups of PEs, which we call a tile. For example, Flexflow [16] organizes PE into multiple rows and assigns operation for one output pixel to one of the PE rows. In this case, variables K, X, and Y are temporally mapped on a PE row but variables C, R, and S are spatially mapped to PEs in the PE row. The tile can be organized into higher dimensions than two.
Describing Dataflows
Based on the elements discussed in Section 2.2, we propose a method to formally describe dataflows using a combination of five pragmas that includes all the elements of dataflow, loop ordering, spatial/temporal mapping and tiling, as presented in Fig. 5 . The pragmas follows the philosophy of pragma-based parallel programing libraries, such as OpenMP [5] , which provides a concise way to describe loop parallelization as an 1D convolution mapping example in Fig. 3 shows. We use the example in Fig. 3 as a representative example in this paper. Fig. 7 (a) describes the resulting variable mapping from the MAESTRO pragmas in the example in Fig. 3 showing the capability to describe both of the temporal and spatial mapping. In addition to temporal and spatial map pragmas, MAESTRO provides loop unroll and PE tiling pragmas, which enables the description of various dataflows. We highlight the syntax and semantics of the pragmas in Section 3.
Data Reuse
Maximizing data reuse is the prime target of many accelerators as it improves both the throughput and energy efficiency. Data reuse reduces the number of energy-consuming L2 reads and W0 W1  W2 W3   I0  I1 I2  I3  I5  I6 I7  I8  I10 I11 I12 I13   I4  I9 writes (which translates to fewer DRAM reads and writes), in turn translating to reduced bandwidth requirements from the L2 and the NoC implementations within the accelerator. We define three classes of data reuse. Temporal data reuse (Stationary data): Temporal data reuse opportunity highlighted by the stationary taxonomy in Eyeriss [4] is based on non-shared variables among data classes. For example, if the innermost spatially-mapped loop variable is K, although the target weight pixel changes every K loop iteration, the target input pixel does not because inputs do not have a K dimension. The stationary data class is determined by the loop order and innermost spatially mapped loop. For example, weights in a row-stationary dataflow [4] are reused in temporal dimension as illustrated in Fig. 4 , in which weight is stationary in K and C dimension. In this example, Each PE has one R and one S value until an entire row is processed; thus weight is fully-stationary for each Loop Y iteration. To exploit temporal reuse, an accelerator needs a local L1 buffer in each PE with sufficient size to stage data until their next reuse.
Spatial data reuse (Multicasted data): Spatial data reuse opportunity is based on temporal mapping and sliding window halo. For example, in Fig. 4 (b) , because of the halo from SPATIAL_MAP (2, 1), I 1 is shared between PE1 and PE2 at the same time. Rather than reading the data twice from the L2 buffer, an accelerator can read only once and multicast I 1 to PE1 and PE2. To exploit spatial reuse across PEs, an accelerator needs a NoC that supports multicasting (bus, tree, etc.). At a finer granularity, spatial reuse can be exploited even from the L1 (e.g., fanning out an input to multiple weight datapaths via internal wires).
Spatio-temporal data reuse (Local-forwarded data): Also called producer/consumer parallelism, this is based on sliding window halo over implicit temporal mapping. For example, in Fig. 4 , the second partial sum for the third output, denoted as P3_1, is forwarded from PE3 to PE1 to be used in the different time. To exploit spatio-temporal reuse, an Table 1 accelerator needs local forwarding links among PEs, which could be dedicated links to neighbors. General topologies such as meshes can also support spatio-temporal reuse using the standard arbitrated links.
Given the description of an architecture and buffer hierarchy, MAESTRO assumes that the accelerator exploits all three data reuse opportunities to evaluate the potential of the dataflow. Fig. 5 presents the syntax of MAESTRO DSL and an example hardware, layer, and dataflow description using it. For the layer description, users can specify the shape and size of each dimension of a convolutional layer. For the hardware description, users can specify L1 (i.e., private/local) and L2 (i.e., shared/global) buffer (i.e., FIFO/scratchpad) sizes, and the NoC average hops, bisection bandwidth, and ingress/egress bandwidth from/to L2. For the dataflow description, users can specify various mapping and tilings using the pragmas presented in Section 2.3. Fig. 5 (b) shows an example. 3.1.2. Dataflow pragmas Dataflow description syntax includes dataflow pragmas we introduced in Section 2.3 allows users to describe a variety of dataflows. MAESTRO DSL inlcudes four pragmas, temporal and spatial map, unroll, and tile, whose syntax is as described in Fig. 5 . We highlight the semantics of each dataflow pragma.
MAESTRO: Dataflow Analysis
Temporal and spatial map. Temporal and spatial mappings are fundamental pragmas to describe dataflow. Temoral map is a variable mapping pragma that applies the same variable sets to the tiles (PEs in base case; when no tile pragma is used). On the other hand, spatial map enables to map different variable sets to each tile simultaneously allowing users to exploit parallelism. Temporal and spatial map has two arguments, size and offset. The size argument represents the number of mapped 
MAESTRO scans dataflow pragmas from the innermost loop and constructs tiles by updating index mapping structures at each loop. After the tile construction, it scans pragmas from the outermost loop to assign three attributes described in (a). (b) shows how MAESTRO determines variable mapping in each iteration of each loop.
variables and the offset argument defines how the mapped variables are updated when the PE array finishes a spatial or temporal iteration. Fig. 6 describes precise semantics how temporal and spatial map determines mapped variables at each iteration. The temporal offset of spatial map shows how implicit temporal folding happens with spatial maps. When the number of tiles is not sufficient to cover entire variables in the spatially mapped loop, then the tile array moves onto the next set of spatial mappings. This process is repeated until the spatial map covers entire variables in the corresponding loop. Fig. 7 (a) shows how the argument affects the map size and the update of mapped variable over space (PEs) and time.
Unroll. Unroll pragma performs loop unrolling. Unroll embeds all the operations within an unrolled loop to its next upper loop, which results in the mapping of entire unrolled loop variable values. Therefore, it does not involve any temporal or spatial foldings and yields the same effect as TEMPORAL_MAP (loop_size, loop_size) of the unrolled loop variable. Fig. 7 (b) presents an example of UNROLL. As discussed, the example maps all the s values in [0, 10) to each PE.
Tile. Tile is the fundamental mapping target of variables. In the base case, when no tile pragma was placed, each tile is a PE. Tile pragma groups the PEs or tiles up in the size of tile pragma. For example, in Fig. 7 (c), TILE (5) X groups five tiles (in this base case, PEs) to construct a new tile. Mapping of X and its upper loops target the tiles, and mapping of variables in inner loops target the PEs (subtiles in case of multiple tile pragmas). Multiple tile pragmas constructs hierarchical tiles by performing grouping of tiles present at the loop level. Tile pragmas are processed from the innermost loop to outermost Accelerator Dataflow Strategy Dataflow Example for this work No Local Reuse (NLR)
Output Stationary (OS) loop to support correct hierarchical tile structure.
As we observed in this subsection, dataflow pragmas include all the elements of dataflows discussed in Section 2.2. The dataflow pragmas allows to change loop order by reordering pragmas, enables spatial and temporal mapping by dedicated pragmas , and supports constructive tiling with TILE pragma. Now we discuss how MAESTRO analyze the dataflow written in dataflow pragmas and generates statistics about cost and benefit of the input dataflow.
MAESTRO Analysis Engine
The key concept for precise buffer access counts and runtime modeling is data reuse discussed in Section 2.4. If the accelerator does not utilize any spatial or temporal reuse, the analysis is trivial. However, understanding data reuses and applying that to cost estimation is a complex problem. MAE-STRO analyzes temporal and spatial reuse by identifying the total number of mapped variables and the number of uniquely mapped variables for each spatial iteration. For simplicity, we restrict the number of spatially mapped loops as one. Multiple spatially mapped loops can be analyzed by recursively applying the method we discuss.
Spatially and temporally mapped volume analysis
To compute the number of data points per tile at spatial mapped loop mapped by variable mappings, which we refer as mapped volume, MAESTRO simply multiply the number of mapped corresponding variables. This volume does not consider any reuse but represent the number of ultimately accessed data points in each spatial iteration. MAESTRO identifies the number of spatially, temporally, and spatio-temporally unique data points by analyzing the map size of each variable and offset of spatial/temporal map pragma, the bound of loops in outer loops, and the correlation of each variable with a data class presented in Fig. 8 . Fig. 9 presents the mapped volume analysis model of MAESTRO. MAESTRO first processes each pragma and extract the number of total, spatially unique, and temporally unique mapped variables as shown in Fig. 9 (a). Based on the mapped variable analysis, MAESTRO identifies the number of total data points (mapped volume, MV), the number of spatially unique data points (spatially unique volume, MSUV), the number of temporally unique data points (temporally unique volume, MTUV), and the number of spatially and temporally data points (spatially and temporally unique volume, MSTUV), as presented in Fig. 9 (b) . The difference between mapped volume and unique volume represents the reused volume. Each unique volume represents net amount of data needs to be processed in each spatial or temporal iteration.
//MSUV: Mapped spatially unique volume MSUV[Weights
] = GetSpUSz(K) x GetSpUSz(C) x GetSpUSz(R) x GetSpUSz(S) MSUV[Inputs] = GetSpUSz(C) x GetSpUSz(Y) x GetSpUSz(X) MSUV[Outputs] = GetSpUSz(K) x GetSpUSz(C) x GetSpUSz(Y') x GetSpUSz(X') //MTUV: Mapped temporally unique volume MTUV[Weights] = TU(K) x TU(C) x TU(R) x TU(S) MTUV[Inputs] = TU(C) x TU(Y) x TU(X) MTUV[Outputs] = TU(K) x TU(C) x TU(Y') x TU(X')
//MSTUV: Mapped spatially and temporally unique volume MSTUV[Weights
] = GetSTpUSz(K) x GetSTpUSz(C) x GetSTpUSz(R) x GetSTpUSz(S) MSTUV[Inputs] = GetSTpUSz(C) x GetSTpUSz(Y) x GetSTpUSz(X) MSTUV[Outputs] = GetSTpUSz(K) x GetSTpUSz(C) x GetSTpUSz(Y') x GetSTpUSz(X') * GetSpUSz(V) = (V.TemporalMap (3,1) Y -> TemporalMap (3,1) X -> TemporalMap (1,1) K -> SpatialMap (1,1) C -> Unroll R -> Unroll S <Input stationary 1> TemporalMap (3,1) Y -> TemporalMap (1,1) K -> TemporalMap (3,1) X -> SpatialMap (1,1) C -> Unroll R -> Unroll S
Spatial and temporal data reuse pattern analysis
As recent CNN accelerator works presented [4, 9] , buffer access consumes the majority of energy. Therefore, modeling buffer size to support all of the potential data reuse implied by a dataflow and buffer access counts is crucial to estimate the energy consumption of a dataflow with a given hardware configuration on an input layer. Those buffer costs depend on multiple factors including mapping size, spatial and temporal data reuse, multicasting capability, the number of active tiles at the edge of layers dimensions, the frequency of data update from upper temporal loops and so on. They are not only many but also intertwined in a complicated manner. In particular, buffer analysis requires deep understanding of spatial, temporal, and combination of those reuse because reuse in L1 buffer reduces the number of buffer accesses to the L2 buffer. Also, we need to skip inactive PEs at the edge of spatial iterations deactivated due to variable index out of range. MAESTRO considers all of the aspects, and we described them in the detailed model description in Fig. 12 and Fig. 13 . In addition to the details, we illustrates core aspects in intuitive examples in Fig. 10 .
Spatial reuse. Fig. 10 (a) shows an example of spatial reuse pattern with a simplified dataflow. For simplicity, we omit all the loop variables except X. To provide the mapped variable to the first PE, because no other data points are read in the PE array, PE0 needs to read all the mapped data from L2. However, from the second PE (PE1), we can observe that only unique data is read. This implies that we need to treat the first PE as an exception in our cost model. The data read pattern relies on the spatial offset. However, in Fig. 13 , we present the case of offset is one for simplicity, and because spatial offset is likely 1 because most of recent CNNs use the stride of 1 in sliding windows. Fig. 10 (b) presents another exceptional case of spatial iteration, which is edge case of spatially mapped dimension (X). In such a case, some of PEs are inactive because mapped variables to those PEs are out of range. The cost model need to precisely compute the number of inactive PEs to prevent overestimate buffer accesses. Because the exceptional cases in spatial iterations can be the common case based on the dimension of layers and number of PEs, which amplify the error in a large scale if not correctly handled, precisely modeling such cases are critical to the correctness of the model.
Temporal reuse. Fig. 10 (c) presents data reuse pattern when we consider temporal reuses. In the first temporal iteration, PEs cannot exploit temporal reuse because no temporally mapped data is read yet. However, as the second case in Fig. 10 (c) shows, when the PE array moves on to the next temporal iteration, PEs started see temporally staged data that can be reused, which results in only three unique data to read.
The reuse pattern changes immediately after the first PE in the second temporal iteration, as shown in the third case of Fig. 10 (c). In that case, because both of spatial and temporal reuse can be exploited, a PE reads only one data in the example. The unqiue data volume in each case depends on map sizes and offsets, unrolled loops, and the number of PEs. Spatial volume analysis (SV_(F/S)TP_(S/L)SP values) in Fig. 13 precisely model all the aspects discussed in examples in Fig. 10 .
In addition to aspects discussed in Fig. 10 , we also need to correctly infer the lifetime of stationary data.
Lifetime of temporally reused (stationary) data. As presented in Fig. 8 , because each data class has different set of correlated variables, the data volume mapped on each PE does not necessarily change at every spatial or temporal iteration. This implies the stationary data class introduced in Eyeriss [4] . However, even within the same data class stationary, the duration of keeping data stationary changes depending on details of a dataflow (map size, loop order, and so on), and this is not well studied in previous literatures. Fig. 11 shows an example how the lifetime of stationary data differs based on dataflow.
In the example, the only difference of the two dataflows is the order of pragmas for variable X and K. However, this difference results in LoopSz(K) times difference in the accessed weight volumes before the loop iteration breaks stationary boundary, which implies that stationary data in input stationary 2 needs to be updated LoopSz(K) times as frequent as that of input stationary 1. This rate also implies the refetch factor of inputs that eventually increments the number of L2 buffer access by LoopSz(K) times. However, the increased cost based on short life time of stationary data, or the update frequency of stationary data, need to be considered with changes in costs of other data classes. Therefore, we still need to analyze the cost of whole data classes to classify a dataflow as efficient or inefficient. MAESTRO models the lifetime of stationary data as the update frequency, as illustrated in Fig. 12 (c) , and it is used in buffer cost analysis and runtime estimation illustrated in Fig. 13 and Fig. 14. Based on the core concepts we introduced so far, we describe the buffer cost and runtime (latency) model of MAESTRO.
Buffer Analysis
Buffer size requirement MAESTRO defines the size of required L1 buffer as the sum of mapped volumes for each data class. This size is required to exploit data reuse opportunities implied by each dataflow descriptions. If the L1 buffer size is not sufficient, MAESTRO outputs an error that the hardware cannot support the input dataflow correctly. When user enables double buffering, the L1 buffer requirement doubles. At the cost of buffer size, double buffering is beneficial in runtime when the total delay is dominated by communication delay.
The runtime analysis module also models double buffering and provides meaningful information around it. MAESTRO defines the required size of L2 buffer as the sum of uniquely mapped volumes over one spatial iteration to support the dataflow without threshed by DRAM accesses. Un- like L1 buffers, MAESTRO always applies double buffering to the L2 buffer because of significant DRAM access delay.
Buffer access counts Because buffer accesses on smaller buffer than the minimum size to support all the reuse rely on dynamic aspects such as buffer entry replacement policy, communication schedule and so on, MAESTRO analyze the minimum buffer access counts with the given input dataflow on input layer, which implies the energy save potential of dataflows. Modeling L2 buffer write and L1 buffer read is relatively simple. For L2 buffer write, because we estimate the minimum number of buffer accesses, we apply the algorithmic minimum number of buffer write; the number of data points of each data class -weight, input, and output. For L1 buffer read, because local reads are not affected by any data reuse, we can easily multiply the mapped volume, number of temporal/spatial iterations, number of spatial tiles (PEs in base case without tile pragmas). However, we still need to consider the number of active tiles in the edge; when the number of tiles does not divide spatially mapped dimension evenly, we have inactive PEs in the last spatial iteration. L2 read and L1 write counts are closely related because they are sender and receiver of data. The model of L2 read counts is one of the most complicated part of MAESTRO model. In addition to exceptional cases we illustrated in Fig. 10 , we also need to the multicast capability of network on chip that enables spaital data reuse and the update frequency of temporally mapped variables. All of those aspects including those introduced in Section 3.2.2 are encapsulated in the model we present in Fig. 13 .
The difference between counts for L2 read and L1 write is if counting multicast as one (L2's perspective) or number of destinations (L1's perspective). Theefore L1 write counts is L2 read multiplied by corresponding multicast factor if network on chip supports multicasting. When the NoC does not support multicasting, L1 write number matches L2 read counts because the model of L2 read counts already considered individual data delivery of data can be multicasted (when supported).
Performance (runtime) analysis
Runtime analysis requires full consideration of not only the aspects we highlighted in Section 3.2.1 but also hardware execution specific-details such as pipelining, double buffering, communication delay, and so on. The runtime analysis model of MAESTRO presented in Fig. 14 models all of those aspects. The runtime model in default applies double buffering, which prefetches data volume for the next spatial iteration and enables latency hiding of communication latency. This technique prevents long critical path (data fetch delay + compute delay + data commit delay) by overlapping data fetch delay, which results in the total delay of (max(data fetch delay, compute delay + data commit delay)). Also, MAESTRO let the number of ALUs within a PE as a parameter, which helps modeling accelerators with fine-grained PEs such as SCNN [20] .
Supported dataflows by analytic model
Although MAESTRO can model a large collection of dataflows, it currently does not support arbitrary tiling that splits a loop into smaller ones and places them in arbitrary positions. MAE-STRO DSL still can describe such a dataflow as multiple map pragmas placed across different tile borders with an additional syntax that specifies the split loop sizes. Also, MAE-STRO currently supports only convolutions, but MAESTRO receives the data class names and correlated variable list and the implementation is designed to support arbitrary correlation of data class and variables. Therefore, with some changes in APIs, MAESTRO can support other problems such as fully connected layer or LSTMs. VC1 and VC11) , and PWCnet [23] conv1 (PWC1), 4 (PWC4), and 6 (PWC6).
Network-on-Chip
Crossbar switch from OpenSMART [13] (CB), Microswitch NoC [14] (MS), a two-level hierarchical bus with eight sub-clusters following SCNN and Eyeriss style [20, 4] (HB), mesh NoC [13] (MH), and FLICR (FR, this work) 
Evaluation
We validated MAESTRO's runtime results against an opensource hardware RTL source code-based cycle accurate environment called MAERI [15, 1] . On average, the results of MAESTRO presented 94.1% match with the cycle accurate simulation results, as shown in Fig. 15 . Fig. 16 presents bandwidth and L1 memory requirements of five dataflows discussed in Table 1 . Throughput is measured for a hypothetical 64-PE architecture running in steady state (non-edge regions). Fig. 17 plots the energy consumption across the MAC, L1 and L2 for the same dataflows. We perform this analysis for two CONV layers of VGG16. We emphasize that this is an evaluation of these dataflows' applicability to this hypothetical architecture, and not meant as a comparison the original systems, which vary widely in number of PEs, buffer sizes, network topology, an so on 1 .
Case study 1: Dataflow Comparison
We gather useful insights across the dataflows and across layers. Between dataflows, we observe, as expected, that NLR has the least L1 memory requirement (as it does not perform temporal reuse at the PE), and therefore has significant L2 energy consumption as presented in Fig. 17 . For CONV1, NVDLA dataflow consumes 98% of average amount of energy. However, for CONV11, this trend changes -NVDLA consumes 63% of average amount of energy, which is 2× lower than NLR, WS and Shi, in average. This is because the ratio of input feature map and weight is dramatically different in CONV1(input-dominated) and CONV11(weight-dominated), and NVDLA dataflow is tuned to work efficiently in weightdominated layers. In detail, CONV1 has just 3 input channels, while CONV11 has 512; NVDLA is tuned for operating on layers with large input channels (as TEMPORAL_MAP (64,64) on variable C of NVDLA dataflow in Table 1 shows), making it inefficient for early layers since it still needs to pay the energy cost of vector reads, but is much more efficient than other dataflows in later layers. For the same reason, NVDLA requires notably high NoC bandwidth in CONV11 (compared to CONV1), since more partial sums get mapped on each PE of NVDLA with CONV11, leading to more L1 to L2 communication for partial sums and outputs. The RS dataflow is observed to be the most energy-efficient due to very few L2 reads demonstrating the best input and weight reuse. Compared to NVDLA, it has much worse roofline throughput in CONV1, but slightly better in CONV11. The Shi dataflow has the highest L1 buffer requirement among all dataflows, as it spatially replicates variable X across 3 PEs and two variables (R and S) are unrolled in each X iteration. Fig. 18 plots the MAC and buffer access energy with five dataflows on two convolutional layers with the number of PEs 16, 32, 64, 128, and 256. Please note that the number of PEs varies within each dataflow buckets. When we increase the number of PEs, the energy consumption scalability depends on both of target layer and dataflow, as Fig. 18 presents. Row stationary dataflow scales well in an early layer of VGG16 (CONV1), however, its energy consumption in a late layer of VGG16 (CONV11) increases super-linearly. This sharp increase is because the characteristic of late layer (small input and large number of channels) does not work well with spatially mapped input-columns. Because of the under-utilization of PEs and halo in Y dimension (TEMPORAL_MAP (3,1) Y in Table 1 , row-stationary dataflow needs to read the same input data over small tiles, which results in a large number of L2 reads. Because of the good scalability, DLA dataflow performs better with large number of PEs on CONV11. However, DLA dataflow performs worst in CONV1 because of the lack of input/output channels in early layers. Therefore, we need to perform a careful study considering layer dimensions, dataflow characteristics, and also the scalability before we select a dataflow.
Case study 2: Benefits from Adaptive Dataflows
The analysis from MAESTRO demonstrates quantitatively that no one dataflow is the best across all layers, nor is the same dataflow best from both a throughput and energy-efficiency point of view. Note that automatically finding the best dataflow for a specific CNN layer is beyond the scope of MAESTRO. Its purpose is to serve as a cost model for both manual exploration and future automatic tools. For this work, the point is to conclusively demonstrate the opportunity afforded by being able to use different dataflows for different layers. This requires a NoC interconnect flexible enough to dynamically reconfigure traffic patterns, without being so costly as to overwhelm the area/power benefits [14, 15] . To analyze the hardware cost from NoCs to support adaptive dataflow, we design a NoC named FLICR (Fig. 19 ) that has full flexibility (multicasting capability and full connectivity) and sufficient bandwidth to support various dataflows based on NoCs proposed in recent works [14, 15] . More details are provided in Table 3 . We use the evaluation environment described in Table 2 for hardware cost analysis to support adaptive dataflow using various NoCs. Fig. 20 presents the benefits of adaptive dataflow in performance and the impact of flexible NoCs on adaptivity. For fixed dataflow, we select DLA and row-stationary (Eyeriss) dataflow with Eyeriss NoC (twolevel hierarchical bus). For adaptive dataflow, we select a throughput-optimized dataflow for early and late layers of VGG-16, which are weight-stationary and DLA, respectively. We run the selected dataflow on a rigid NoC (Eyeriss) and a flexible NoC (FLICR). To compare the fixed and adaptive dataflow, adaptive dataflow reduced 60.5% of runtime in average compared to two fixed dataflows. We can observe that adaptive dataflow on FLICR provides the lowest runtime over other combinations. Although Eyeriss NoC also provides relatively low runtime with adaptive dataflow compared to fixed dataflows, it requires 70% more runtime compared to FLICR. Also, we can observe that the runtime of rowstationary dataflow does not scale in the late layer. This is because row-stationary dataflow is not optimal in late layers and the bandwidth requirement of row-stationary dataflow increases sharply so NoC congestion introduces enormous delay when we increase the number of PEs. Fig. 21 formed much worse than other options. In contrast, FLICR has shown the second best runtime for most of the test cases. In some cases such as PWC 6 in Fig. 21 (b) , the Microswitch NoC outperformed FLICR by 2% because FLICR is internally pipelined while Microswitch NoC pursues single-cycle communication. However, because the pipelined design of FLICR is more suitable to increase clock frequency, we expect such a minor difference can be covered in practice. Fig. 22 (a) and (b) presents post-PnR area and power of each NoCs with different number of PEs. Although crossbar is a good option to provide high throughput, the cost of the throughput is significant. Compared to hierarchical bus, which is popular in recent DNN accelerators [4, 20] , a crossbar requires 1.56 times more area and consumes 3.59 times more power. Mesh required the largest area and the second highest power while it performs worst in all the test cases we compare. Hierarchical bus requires moderate area and power with average runtime. Microswitch NoC is a light-weight structure compared to three designs, crossbar, mesh, and hierarchical bus, with low runtime. However, it has less flexibility for different dataflows than FLICR. Furthermore, FLICR reduces the area by 7% with almost the same power consumption. Fig. 22 (a) and (b) presents a scalability study of NoCs. As presented, FLICR also scales better than other NoC options for both of area and power. With 32 PEs, FLICR consumes 20% more power and 8% more area compared to microswitch NoC. However, with 256 PEs, FLICR requires 56% and 48% less area and 59% and 28% less energy compared to h-bus (Eyeriss NoC) and microswitch NoC. Therefore, FLICR provides not only provides high bandwidth to achieve high accelerator throughput but also requires low hardware cost. Fig. 22 (c) presents the link utilization of distribution network connected to 64 PEs and four shared buffer channels with Eyeriss-style dataflow (row-stationary). Based on the heatmap and area/power evaluation, we can conclude that FLICR provisions its hardware resources very efficiently for CNN dataflows.
Performance benefits

Runtime impact of NoCs
Hardware cost of NoCs
Related Works
Dataflow Analysis. Eyeriss [4] categorized dataflows into four classes (weight-, output-, row-stationary and no-localreuse) based on temporal data reuse pattern, or stationary data class. Flexflow [16] suggested three dataflow categories (featuremap, neuron, and synapse parallelism based on the spatial data reuse pattern. MAESTRO provides more fine-grained dataflow description than those two related works, which enables users to describe any dataflow that exploits spatial/temporal data reuse or both at the same time. In compiler side, several works studied the data reuse in the cache of CMPs, the cost and benefits of loop permutations. [12, 17] Although they provided thorough analysis of loop nests in CMP domain, they cannot be directly applied to accelerators because CMPs and accelerators have different architectures (e.g., CMP uses cache memory but accelerators use scratch pad memory).
Flexible Dataflow CNN Accelerators. Table 3 characterizes prior approaches for flexible dataflow support in CNN accelerators with ours.
Conclusion
Dataflows play a first-order role in determining the performance, energy consumption, and required hardware resources for accelerators. We present MAESTRO, an analytic model for formally describing and analyzing CNN dataflows using five pragmas that capture various re-use opportunities within CNN accelerators. MAESTRO can be used by architects and hardware designers at design-time to determine the most area/energy/performance efficient design for the target dataflows; it can also be used by compiler writers to determine the most efficient mapping for a CNN layer on the fabric. Extending MAESTRO to model more complex dataflows (such as sparse DNNs, LSTMs) is part of our future work.
