For convolutional neural networks, a simple algorithm to reduce off-chip memory accesses is proposed by maximally utilizing on-chip memory in a neural process unit. Especially, the algorithm provides an effective way to process a module which consists of multiple branches and a merge layer. For Inception-V3 on Samsung's NPU in Exynos, our evaluation shows that the proposed algorithm makes off-chip memory accesses reduced by 1/50, and accordingly achieves 97.59% reduction in the amount of feature-map data to be transferred from/to off-chip memory.
Introduction
Recent achievements in image processing tasks such as image recognition, object detection, and scene segmentation have been coupled with the application of deep convolutional networks (Szegedy et al., 2015; Ren et al., 2015; Long et al., 2015) . As the need for more complex networks increases, we get faced with several implementation issues, i.e. real time processing, limited power budget, and memory bandwidth. For the issues to get resolved, various approaches have been investigated in both cloud and mobile applications; low-precision (Courbariaux et al., 2014; Hubara et al., 2016; Gupta et al., 2015; Gysel et al., 2016; Judd et al., 2015; Lin et al., 2016; Kim et al., 2018) , network compression (Han et al., 2015) , (Han et al., 2016b) , and small network design (Iandola et al., 2016) , (Howard et al., 2017) .
Another remarkable trend is to execute deep convolutional networks on mobile platforms, and it is getting important by concerns about response time, dependency on an internet connection, privacy, and security. Many companies and research groups have been recently developing notable hardware accelerators called as a neural processing unit (NPU) 1 Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country. Correspondence to: Anonymous Author <anon.email@domain.com>.
Preliminary work. Under review by the International Conference on Machine Learning (ICML). Do not distribute. (Song et al.; Zhang et al., 2016 ; ARM-ML-processor). They tried to develop energy-efficient NPUs based on novel algorithms such as exploiting network sparsity for high utilization of multiply/accumulate (MAC) units or quantizing networks to reduce the power of MAC units. To operate CNNs on a NPU, it is needed to access memory hugely to read and write weight and feature-map data. In (Han et al., 2016a) , it was shown that the total energy is dominated by the required memory access if there is no data reuse, and that the energy cost of on-chip memory (SRAM in (Han et al., 2016a) ) is 128 times better than one of off-chip memory (DRAM in (Han et al., 2016a) ). Since NPUs for a mobile platform have the limited amount of on-chip memory, however, it is not easy to maintain NPU efficiency for most applications. Therefore, we reasoned that reducing off-chip memory accesses by utilizing on-chip memory maximally can be one of the most powerful solutions to increase the efficiency of NPUs.
To achieve the high utilization of on-chip memory, we needed to know what series of operations are required in NPUs for processing the convolution layer and to choose the most efficient one among possible series of operations. We focused on a series of operations which fetches weights in minimal increments and executes convolution with the weights for the feature-maps of all input channels. Such a series of operations makes the realization of convolution simple by leaving out the consideration about how to locate weights in on-chip memory. Figure 1 shows the required memory size of weights and feature-maps according to the index of layer, for Inception-V3 and ResNet-50. The size of weights increases as the layer index increases, whereas the size of feature-maps decreases as the layer index increases. From the patterns of sizes, we thought that it would be beneficial to locate feature-maps fully in on-chip memory and to manage them efficiently if the size of feature-maps is small enough to be located in on-chip memory. In recent architectures of convolutional neural networks, moreover, a concept of module or block has been introduced to computer vision applications for higher representational power of neural networks. The important thing of such observation is that modules or blocks are repeatedly used in a network just after feature-maps are scaled down for reduction of computational cost, as shown in the size of feature-maps and the characteristics of modules, therefore, we were able to find a simple way to minimize off-chip memory accesses by efficiently managing on-chip memory for modules.
Inception-V3
In this paper, we propose a simple method to support energyefficient and real-time processing of NPUs through the reduction of off-chip memory accesses. Firstly, the algorithm detects certain types of modules or blocks by graph interpretation of the whole network. Then, several regions of on-chip memory are assigned to reuse feature-maps during a module or a block processing. In order to utilize on-chip memory maximally, moreover, we also propose a branchreordering algorithm and two branch-processing algorithms. By combining the proposed algorithms, we can effectively cut down off-chip memory accesses for convolutional neural networks such as Inception-V3 and ResNet. The rest of the paper is organized as follows. Section 2 describes the module we define. Then, in Section 3, we introduce the proposed algorithms to reduce off-chip memory accesses. Section 4 shows evaluation results for a representative network, Inception-V3 (Szegedy et al., 2015; . Finally, Section 5 makes a conclusion.
Definition of Module
After Network-in-Network was proposed in (Lin et al., 2013) in order to increase the representational power of neural networks, a concept of module or block is getting popular in convolution neural networks for computer vision applications like Inception network (Szegedy et al., 2015; , ResNet (He et al., 2016) , MobileNet V2 (Sandler et al., 2018) , SqueezeNet (Iandola et al., 2016) , ShuffleNet (Ma 
Figure 2. Illustration of various modules satisfying four conditions in Section 2. Here, b and k denote the indexes of a branch and a layer for the module, And di(k) is the maximum depth of module at the k th layer.
et al., 2018), and MnasNet (Tan et al., 2018) . In this section, we define module and specify which modules can be utilized in the algorithms we will propose among the modules defined for various networks.
In general, a module used in convolutional neural networks can be one of the directed acyclic graphs (DAGs), which is a finite directed graph without directed cycles. That is, it consists of finite multiple layers and edges, with each edge directed from one layer to another, such that there is no way to start at any layer A and follow a consistently-directed sequence of edges that eventually loops back to A again. For a convolutional neural network including many modules, moreover, it can be viewed as a large DAG with multiple DAGs.
In the paper, we consider only the limited structure of DAGs satisfying the following conditions: 1) the module has multiple branches within itself, 2) the module has to include at least a merge layer, and 3) the type of merge layers should be either concatenation or element-wise summation. In addition, 4) we do not cover a large-sized module configured by a long skip-connection or having lots of layers inside, which has been frequently used in neural networks to extract multiscaled features. This is because there seems no efficient way to utilize on-chip memory for large-sized modules. When we focused on only the modules satisfying the four conditions mentioned above, we could reach a sub-optimal but simple solution to reduce off-chip memory accesses even if the proposed algorithms did not cover all kinds of neural networks. Figure 2 shows the examples of modules satisfying the required conditions and introduces some useful indices to explain the algorithms proposed in this paper, where b denotes the index of branch in a module, k is the index of layer in a branch, and d i (k) denotes the maximum depth of modules for the k th layer. In other words, it means k th layer is included in the d i (k) th module. With these parameters, we explain two types of modules, each having a different kind of merge layer as follows:
Concatenation based module: this type of module includes the k-th layer on the b-th branch, and a concatenation layer at the end of module, as shown in the left plot of Figure 2. Sometimes the module also includes the sub-module which is a module in a module like as the Inception-C type. The representative neural networks having the concatenation based module are several versions of Inception Networks (Szegedy et al., 2015; and SqueezeNet (Iandola et al., 2016) . In Figure 2 , Conv and P ool mean a convolutional layer and a pooling layer, respectively.
Element-wise adder based module: this type of module has two branches, multiple layers, and an element-wise adder at the end. The representative neural networks with the element-wise adder based module are ResNet (He et al., 2016) and MobileNet V2 (Sandler et al., 2018) .
Algorithm for Effective Module Processing
The flowchart demonstrated in Figure 3 shows an overall process of compiler optimization including the algorithm proposed for module processing, where the compiler serves to interpret graphs as well as to make a policy for effective execution of a neural network on NPUs. Neural network source code represents a certain type of files including graph information of a neural network such as prototxt and TFLite formats. Through parse unit, network parameters are extracted such as kernel size, stride, pad, layer type, feature-map size, module parameters and so on. With the extracted parameters, we can operate the proposed algorithm in optimization unit which includes four phases: module detection, micro-instruction generation, branch reordering and branch processing (I/II). Here, module detection is to detect the modules satisfying required conditions in a neural network, and micro-instruction generation is to find the best sequence of micro-instructions per module defined in NPU. branch reordering decides on a new order of branches to be processed in a module based on a proposed criteria. And branch processing I/II decide an effective policy. Then, memory allocation unit executes a resource allocation of on-chip memory based on the policy by the optimization unit. Finally, neural network object code is generated as an output of object code generation unit.
Since our main proposal is optimization unit in Figure 3 , we explain it in detail. Algorithm 1 shows a whole procedure of optimization unit which has six steps such as module detection and there are two types of outputs: the best sequence of micro-instructions, and the resource allocation information in on-chip memory per module. Each step of Algorithm 1 is described in detail in Algorithm 2 through Algorithm 6.
First of all, the proposed algorithm has to detect all possible valid modules within a network. Different kinds of modules are designed for well-known neural network models, but we focus only on those modules which are suitable for efficient use of on-chip memory. Algorithm 2 is about how to detect modules that meet the required conditions as mentioned in the previous section. First, we skip modules configured by a long skip-connection since it is extremely difficult to manage them within on-chip memory efficiently. Then, the algorithm detects modules in a modified graph where all the long skip-connections are erased. Figure 4 shows the results of the module detection algorithm being applied to Inception-V3. Here we can see that 11 modules are detected by the Algorithm 2. After detecting modules, the optimization unit generates a sequence of micro-instructions for whole layers of each module. Since this step is very hardware-dependent, it is hard to explain details in this paper. In the optimization unit, anyway, the best sequence of micro-instructions should be found in order to make the layers of each module processed efficiently in a specific hardware.
As the 3 rd step, we execute a re-ordering process of Input: Network graph (Ω) with a layer set (Ψ) and an edge set (Θ). for ν in Θ do if ν is not a long skip-connection: then ν ∈ Θ e . else ν e = Disconnect(ν) ν e ∈ Θ e , where ν e is originated from off-chip memory. end if end for Input: Effective network graph (Ω e ) with a layer set (Ψ) and an effective edge set (Θ e ). branches for all modules through Algorithm 3. Firstly, we calculate a required memory size of each branch within a module. Here the required memory size of a layer is calculated by accumulating sizes of IFM, OFM, internal working memories, and weights in a function of CalcSizeReqM em, but except MIFM and MOFM. And the required memory size of a branch is determined by the largest of the memory sizes required for layers on the branch. Finally, we change the processing order of the branches in descending order of the memory size required by each branch. Figure 5 shows an example including a module with four branches. When it is assumed size req(1) < size req(3) < size req(2) < size req(4) , BrReordering changes the order of branches as follows: Br(4), Br(2), Br(3), and Br(1) in the right plot of Figure 5 . Through BrReordering, we can allocate more available resource in on-chip memory for the branch which requires a larger size of memory.
At the next step, we need to calculate the size of MIFM and to allocate an offset in order to share a memory region without any collision during module processing. As shown Figure  6 , we look at the red-colored box as an example of MIFM calculation algorithm, where it is the second branch with three layers in the module. If you look at the 1 st layer, only size mif m(1,1) is considered because it is included in the only 1 st depth module. Because size mif m(1,1) is exactly same with the size of the previous branch, of course, the memory region is directly handed over from the previous branch. For the 2 nd and the 3 rd layers within the 2 nd depth module, we need to consider the second shared memory region of size mif m(2,2) and of f set mif m(2,2) . After operating the 3 rd layer, we can release the second shared memory region of size mif m(2,2) and of f set mif m(2,2) because there is no need in the next branches. As the 5 th step, we calculate the occupied memory sizes within MOFM at the b th branch in Algorithm 5. It can be simply calculated by accumulating the output size of the last layer on all previous branches. It means size occu(b) is the size of occupied region of the MOFM for the current branch as shown in Figure 7 . It is important to exactly calculate the size of the occupied region because the region (greed colored region in Figure 7 ) in on-chip memory cannot be utilized for the current branch operations.
The branch processing is conducted at the last step as shown in Algorithm 6. We propose two types of branch processing: (I) default algorithm considering both MIFM and MOFM, and (II) optional algorithm considering only MIFM. The optimization unit adaptively chooses one between (I) and (II) according to the required memory size of a module. The branch processing (I) can support an operation where MOFM of the former module is directly forwarded to MIFM of the latter module (= sharing memory between consecutive modules), but we have to use less memory region (size avail in Figure 8 ) within branch operation because it needs the occupied memory region for both MIFM and MOFM in the module. In other hands, the branch processing (II) is an optional algorithm applicable when a required memory size size req(b) for the b th branch is larger than that of available on-chip memory size avail (b,k) . It is because we can use additional memory region of size occu(b) for a branch processing and because we can also get more available memory by considering size OF M.P artial , not size OF M.F ull at the last layer on the branch. That is, we give up the shared memory region of MOFM in order to increase size avail , instead we get a benefit only from sharing MIFM on the branch. More detail operations are explained in Algorithm 6, where f wd(k) denotes a flag of an OFM forwarding of the k th layer. Figure 8 shows the example of the branch processing (I), where we consider the second branch operation including three layers within a module. At the 1 st layer, size mif m(1,1) means the shared memory for MIFM in the module, and size occu(2) is mapped to the occupied region by OFM of last layer in the 1 st branch. That is, we f wd (0) =false; % estimating the forwarding status of OFM for the k th layer on the branch for k = 1 to len(layers of branch) do % calculating the available memory size Figure 8 for the 1 st layer. If the total size including OFM, working memory (WM) and weights (W) is equal to or less than size avail , OFM can be directly forwarded to IFM of the 2 nd layer, as shown in Figure 8 . At the 3 rd layer by the same sequences, IFM is a heritage region from OFM of the second layer and OFM is stored as the green colored region within MOFM region. By using the branch processing (I), therefore, we can completely erase off-chip memory accesses during the processing within a module. The algorithm also provides the removal of the off-chip accesses between consecutive modules by forwarding the MOFM of the former module to the MIFM of the latter module in neural networks, as shown in the second plot of Figure 8 . Figure 9 . Comparison of total data sizes accessed the off-chip memory for Naive and Proposed algorithms, where total data size was summed for both weights and feature-maps.
Evaluation
To evaluate the proposed algorithm, a representative CNN model, Inception-V3 (Szegedy et al., 2016) with an input image of 299 × 299, has been selected. The target neural processor and its features are as follows: 1) Samsungs NPU (Song et al.) in Exynos has 1024 multiply/accumulate (MAC) units on 16 MAAs (multiply/accumulate arrays), and on-chip memory of 1,024 Kbyte which contains IFMs, OFMs, weights and temporary WMs. The NPU also has 3 parallelism: First, IFMs are divided and fetched into four chunks along channel. Second, OFMs in the form of 4x4 patch in a MAA are computed in parallel. Lastly, a weight kernel is copied to 16 kernels for parallel operation on 16 MAAs.
2) It is not essential that all weight kernels of a layer have to be in on-chip memory. The NPU can make 16 OFM channels in parallel if partial weights for 16 MAAs are in on-chip memory. The partial weights can be read and written in a double buffering manner to effectively hide memory access time. 3) We assume that all weights and feature-maps are 8-bit quantized in the evaluation, even though the NPU supports other precisions. Figure 9 shows the total amount of data with off-chip memory access for two algorithms: Naive and Proposed algorithms. Here, it is assumed that Naive algorithm has to access off-chip memory to read and write feature-maps for every layers. The results show that the amount of featuremap with off-chip memory access is almost gone and the amount of weight with off-chip memory access remains the same. Therefore, the proposed algorithm reduces the total amount of data with off-chip memory access by almost half. By applying the proposed algorithm, only reduction-a and Inception-b1 among 11 modules in Inception-V3 have a little amount of feature-map with off-chip memory access. The reason can be explained as follows: In reduction-a module, we operated BrP rocessing(b, II) which needs to access off-chip memory three times at the end of branches because size req was not satisfied with size avail for the module, as shown in Figure 10 . That is why the off-chip accesses in reduction-a happen. And then, there exists one time of off-chip memory access at the start of the Inceptionb1 module according to the result of the former module, reduction-a. Figure 10 shows the important information related to size req(b,k) and size avail(b,k) in our proposed algorithm. Firstly, branches in each module are re-ordered by size req (b,k) . And size avail(b,k) also decreases as the in- dex of branch in the module increases because size occu(b) increases. Moreover, we can see the relationship between size req(b,k) and size avail(b,k) used for the criteria to select BrP rocessing(b, I) or BrP rocessing(b, II).
Consequently, Table 1 summarizes the overall amount of off-chip accesses for every modules in Inception-V3, and an reduction ratio is calculated to 47.14 % (=21673.5 / 45977.5). And The table also represents the amount and the number of off-chip accesses for only a feature-map in Inception-V3. We can know that the proposed algorithm can achieve a great reduction up to 97.59 % in terms of the amount, and can effectively reduce to 1/50 (= 4/200) in terms of the number of accesses for all modules.
Conclusion
In the paper, we proposed a simple method for energyefficient and real-time processing of NPUs through the reduction of off-chip memory accesses. To achieve it, we focused on the modules used for convolutional neural networks that have multiple branches and a merge layer. In the algorithm, the key ideas consisted of module detection ignoring long skip-connections, branch re-ordering for utilizing available memory maximally, assignment of MIFM and MOFM to share between modules, and branch processing. For Inception-V3 on Samsung's NPU in Exynos, we showed the proposed algorithm achieved 97.59 % reduction in the amount of data to require off-chip access, and reduced the number of off-chip accesses by 1/50. Finally, we think the proposed algorithm can be a powerful solution to increase the efficiency of NPUs when processing various convolutional neural networks.
