This paper presents, NeuroTrainer, an intelligent memory module with in-memory accelerators that forms the building block of a scalable architecture for energy efficient training for deep neural networks. The proposed architecture is based on integration of a homogeneous computing substrate composed of multiple processing engines in the logic layer of a 3D memory module. NeuroTrainer utilizes a programmable data flow based execution model to optimize memory mapping and data re-use during different phases of training operation. A programming model and supporting architecture utilizes the flexible data flow to efficiently accelerate training of various types of DNNs. The cycle level simulation and synthesized design in 15nm FinFET shows power efficiency of 500 GFLOPS/W, and almost similar throughput for a wide range of DNNs including convolutional, recurrent, multi-layer-perceptron, and mixed (CNN+RNN) networks.
INTRODUCTION
The hardware acceleration of inference of Deep Neural Network (DNNs) including convolutional (CNN), recurrent (RNN), and multi-layer-perceptron (MLP) have received considerable attention in recent past [1, 2, 3, 4, 5, 6, 7, 8, 9] . In contrast, training has largely been accelerated by software implementations executing on clusters of graphics processing units (GPUs). As DNNs become larger and more complex, the time and energy costs of training become limiters to the application of DNNs to more complex problems. For example, a DGX-1 with 8 GPUs consumes more than 3KW, and is limited to training only 1,000 Imagenet images per second for VGG 16 [10] . Hence, availability of a specialized modular architecture for energy efficient scaling of training performance will be critical to the feasibility of future, large scale DNNs.
The training of a DNN is composed of three primary steps: forward propagation (FP), which is identical to inference, back propagation (BP), and parameter update (UP) (see Section 2 for details). The acceleration of training faces major additional challenges over acceleration of inference as discussed below. 1) Most DNN accelerators for inference are optimized for convolutions with small kernels and matrix-matrix multiplication (for fully connected layers). However, accelerating BP and UP includes the following additional operations -i) convolution with very large kernels, ii) matrix transpose, iii) vector to vector outer product, iv) loss function computation, v) a pooling layer and its derivative, and vi) the derivative of non-linear activation functions..
2) Training operates over very large data sets and employs mini-batch processing across training thereby requiring larger on-chip storage to increase effective memory bandwidth. In contrast, inference operates over single sample. Further, while inference requires only reading weights, training requires reading/writing weights and their gradients, increasing memory traffic.
3) The computation of gradients in back propagation and weight update require higher bit precision to account for small gradient values (vanishing gradient issue [11, 12] ). Therefore, low bit precision (8/16 bit) arithmetic, often used during inference for energy efficiency, is not suitable for training. This paper presents, NeuroTrainer, an intelligent memory module with in-memory accelerators that forms the building block of a scalable architecture for energy efficient training. The key distinguishing feature of NeuroTrainer is its programmable data flow execution model. We observed that distinct computational kernels in training different networks share common arithmetic operations (e.g., multiply-and-accumulate) but differ in their memory usage and data flow pattern. Hence, NeuroTrainer utilizes a homogeneous architecture but with an execution model where memory mapping, re-use, and data flow for different kernels are programmed to match the data usage/flow pattern, and hence, optimize performance of each kernel. It is in contrast to recent report of multi-chip module based training accelerator where each chip is independently optimized for a specific operation, creating a heterogeneous architecture [13] . A major advantage of the programmable data-flow based execution on homogeneous substrate, compared to customized hardware based solutions is to provide stable performance over many applications. The NeuroTrainer is evaluated using cycle level simulation, and synthesized in 15nm FinFET technology. The NeuroTrainer demonstrates 25x higher efficiency over GPU (Fig. 1 ). Moreover, NeuroTrainer shows higher average power efficiency over prior training accelerators( [4, 6, 13] ), and more importantly, demonstrate ability to train different types of benchmark networks (CNN, RNN, and CNN + RNN) with similar powerefficiency. We further demonstrate scalable system with multiple interconnected NeuroTrainers to scale training performance for very large DNNs. Hence, NeuroTrainer can be used as the building block to design a large scale DNN training platform. This paper makes the following contributions. 1) We present a programmable data flow based execution model that enables the use of a homogeneous computing architecture to efficiently train diverse DNNs. This flexible data flow programmability enables efficiently accelerate training of various types of DNNs including CNNs, RNNs, MLPs and hybrid networks (CNN+RNN).
2) We present the NeuroTrainer as a 3D memory module with an integrated in-memory accelerator. The architecture of the memory module is patterned after the Hybrid Memory Cube (HMC) which is composed of stacked DRAM partitioned across multiple independently controlled vaults. Each vault has an independent vault controller on the logic die; therefore multiple partitions in a DRAM die can be accessed simultaneously.
3) We present an in-memory accelerator composed of an array of interconnected processing engines (PEs), implemented on the logic die with precision-configurable arithmetic and support for dataflow based execution. All but one memory vault are connected to a dedicated PE; and the remaining vault is connected to all the PEs using a shared bus. Each vault controller is augmented with a Programmable Memory Address Generator (PMAGs).
4) We develop programming model for the NeuroTrainer. Compilation now involves optimized mappings of data (input, parameters, and gradients) into memory vaults, and programming the PMAGs to orchestrate an efficient flow of data between the DRAM and logic layer in a manner that optimizes bandwidth, exploits re-use, maximizes concurrency, and minimizes data movement and buffering in the PEs.
The rest of the paper is organized as follows. Section 2 introduces DNN training; Section 3 illustrates the proposed architecture; Section 4 explains the programming model; Section 5 presents simulation results, followed by related work and conclusion.
PRELIMINARIES
In this section, we will explain the approach for training DNN with gradient descent, which is composed of three steps: feedforward, backpropagation, and weight update in recent DNNs [14] . Fig. 2 shows a simple DNN and its feedforward, backpropagation, and weight updating for different types of the layer in the network.
Feedforward (FF)
Feedforward is propagation of neuron activation from i th layer to i + 1 th layer through weights between two layers. The output of a neuron (state) is weighted summation of activation from all connected neurons in previous layer (and current layer as well for RNN [15] ). It is the only phase required during the inference. Fig. 2 shows feedforward through convolution layer and fully connected layer.
Backpropagation (BP)
For a given input, at the end of feedforward operation, the output of the last layer is compared with the ground truth i.e. the desired output for this input and computing loss (L). The loss can be defined by simple mean squared error (MSE) or combination of softmax layer and cross-entropy layer [16] . Backpropagation is the phase to find the impact of each state on the loss (gradient) ∂L/∂X(dX) by propagating from the last layer. Since there is no definition of loss (L) in hidden layer, ∂L/∂X(dX) is computed from the dX of i + 1 th layer (∂L/∂Y (dY )) instead of computing ∂L/∂X(dX) directly. We can see most of arithmetic operation in Backpropagation is similar to that of Feed- forward in convolution and fully connected layer except transposing kernel (Fig. 2 ).
Weight Updates (UP)
Based on dX, ∂L/∂W (dW ) needs to be generated to reduce L in next iteration (epoch). New W for next iteration is determined as W new = W old −η ·dW , where η is learning rate. Recently, additional term is added during the update such as momentum [17] . For convolution, it needs convolution between X and dY . As dimension of dY is smaller than X by radius of kernel (W ), it is convolution with very large kernel size. For fully connected layer, it requires vector vector outer product. Thus we can see there is additional operation and data flow is needed for efficient operation in weight updating.
Data Preparation (Prep)
For each operation, input data need to be pre-loaded into the memory to improve data flow between memory and processing engines. If the layout of output generated in layer i does not match with required data layout for the layer i + 1, it needs to be re-arranged between multiple memory banks. In addition, for convolution, to make the size of the output same as the size of input, the input needs dummy zeros on its boundary.
Minibatch Training
Mini batch training involves updating weights after training small set (K) of training data. If total number of training data is N , it will iterate N/K times for one epoch. It is faster than training with large batch sizes and shows smoother convergence than training individual images. Moreover, it can reuse weights K times improving computing efficiency [14] . However, it requires more on-chip memory to store K temporal data.
The multiple mini-batches are trained in parallel us- 
PROPOSED ARCHITECTURE
In this paper, NeuroTrainer is designed considering a Hybrid Memory Cube (HMC) where a 3D memory stack is partitioned into multiple parallel vaults. For example, HMC 1.0 is composed of 4 DRAM dies partitioned into 16 vaults, each vault has an independent memory controller (vault controller, VC), and connected to 4 external (off-chip) links via AXI interface. The computation fabric of NeuroTrainer is composed of multiple processing engines (PEs) where each PE contains clusters of computation units and local buffers for inputs (states) and parameters. Each PE has a one high bandwidth connection to its local vault and an interface to a broadcast bus connected to a shared vault. The vault controllers are augmented with a programmable memory address generator (PMAG), a state-machine that realizes mapping of the different types of data (input, parameter, and gradients) to different vaults and control data flow between memory and PEs.
Hybrid Data Flow
NeuroTrainer is designed to provide two different data movements between vaults and PEs based on the operation types. Consider convolution and matrix-matrix multiplication. During the convolution ( Fig. 3 (a) ), kernel (W ) is shared by PEs while input is partitioned for each PE. For the matrix-matrix multiplication, weight matrix (W ) is partitioned while input (X) is shared by PEs. We note that one of the inputs can be shared by Based on the size of common input, operations in Fig. 2 can be classified as small common data (ex: convolution with small kernels) and large common data (ex: matrix-matrix multiplication). For example, the weights of small kernels in the convolution layer would be small common kernels whereas the large weight matrix in a fully connected layer corresponds to large common data. The approach used in NeuroTrainer is to buffer copies of small common data across PEs and stream partitions of large data (e.g. inputs to a layer) from the local vault. Alternatively, with respect to large common weight matrices, they can be broadcast from a shared vault to all PEs, while partial weight matrices are stored across PEs. These two classes of data flows are illustrated in Fig. 3 .
The data rearranging among vaults is required to dynamically change data flow from one type of layer to another. However, in a DNN, a set of convolution layers is followed by a set of fully connected layers; therefore rearrange is required only once in both feedforward and backpropagation.
Programmable Memory Address Generator (PMAG)
The programmable memory address generator (PMAG) controls the data flow by providing memory address to the vault controller for read and write, and pushing the data through the NeuroTrainer. The PMAG is composed of 7-level nested counters (r1...r7), combinational logic to generate address, and decoders to assign counter values as input of combinational logic (Fig. 4) . The PMAG also computes the non-linear func- ... 
Final dW is average of dW 0 and dW 1 .
tion (and its derivative) by using look up tables [LUTs, for f (x) and f (x)] for (a) activation function (ReLu, tanh, etc.) or (b) exponential/logarithm for softmax and cross-entropy layer. Convolution Feedforward / Backpropagation. Fig. 5 shows input X is partitioned into 4 pX with boundary overlap for convolution (assume 4 PEs). As kernel size is small, kernels are duplicated into all PE's buffers. For each kernel (outer most loop is N O ), N M AC inputs are processed in parallel (SIMD). For backpropagation, transpose of W (W T ) is required and it can be handled in PE without reshaping data in the buffer of PE. It will be explained in Section 3.3.4.
Convolution Weight-update. After generating dY , dW is needed to update weights. Fig. 6 shows convolution weight update when N I is 2. For each sample (X i ), dW i = X i * dY i is computed, and final dW is computed by averaging all dW i s. Although weight update is also convolution between X and dY , the kernel size (W O by H O ) is similar to the input size (W I by H I ). Due to large kernel (dY ), partitioning input (X) with bound-
pA (H by W) × X (W by N MAC ) = pAX (H by N MAC ) (per PE) N MAC
Figure 7: Matrix-matrix multiplication using 4 PEs. Each PE computes pA × X = pAX.
ary overlap (Fig. 5 ) is inefficient and duplicating dY into all PEs is impractical. Therefore, we convert convolution with large kernel to matrix matrix multiplication by lowering convolution similar to how cuDNN performs convolution [20] (Fig. 6 (b) ). Although drawback of lowering is increasing memory requirement from X i to X M i , in-memory computation in NeuroTrainer can resolve the memory challenge. Matrix-matrix multiplication. The main operation of fully connected layer or recurrent layer is matrixmatrix multiplication (A × X = AX) [21, 22] . Fig. 7 shows that A is divided into 4 pA i (i: PE index) rowwise and how a single pA i is partitioned to small blocks (each size is L×P ), which is fitted into half size of buffer in PE (double buffering). As explained in Section 3.1, two data paths operate in matrix matrix multiplication (large common data) and the PMAG with common data vault and the PMAG with independent data vault are programmed separately. Fig. 7 shows that pA i is partitioned into 3 by 2 blocks. After processing first 3 blocks of pA i and X, a block of pAX i is generated (size = N M AC by H). The pAX i needs to be delivered to common data vault.
Vector-Vector Outer product. For weight update in FC layer, for each sample in batch, input (X) and gradient (dY ) need to be multiplied to generate dW . Contrast to matrix-matrix multiplication, N i samples cannot be unlooped in SIMD level. In other words, this operation should be repeated N i times and dW needs to be averaged. Fig. 8 shows vector-vector outer product using 4 PEs. Vector A is divided into 4 vaults (pA i , size = H) and B will be stored in common data vault and will broadcast. The operation inside PE is similar to that of matrix matrix multiplication, however, the output (pA pB T ) does not need to be merged to common vault since it's gradient of weight in FC layer; therefore it's written back to dedicated vault.
Data Preparation Fig. 5 (a) shows that convolution with 4 PEs generates 4 pY s in parallel. If it is the last convolution layer before fully connected layer, the outputs of convolution layer should be merged into common data vault before to be broadcast in matrix matrix multiplication (Fig. 3) . The order of PE to send data is pre-determined in the BUS. Based on this order, PMAG connected to common data vault also knows the portion in the merged data (P W , P H). In similar way, data from common data vault is also partitioned to all other vaults. Add/remove zero boundary.Before convolution, input needs to be zero padded on the boundary to return same sized output based on the kernel radius r.
Processing Elements (PE)
A processing element is composed of a k MACs array, k comparators, and three local buffers: two input buffers, one output buffer (partial sum) (Fig 9) . Similar to PMAG, PE also needs to be programmed before the main computation.
Local Buffers
To avoid stall of PE (idle mode) due to lack of operands (inputs), we use two inputs and output buffers in PE. For each buffer, while half of memory is consumed by MACs (computing operation), rest of buffer can be filled simultaneously (double buffering). All local buffers have address generator based on nested counters. In Fig. 9 , CNT2 is two level nested 16bit counters and CNT1 is single level 16bit counter. Data stream between DRAM and PE ends with END-MARK (0xFFFF for 16bit case, 0xFFFFFFFF for 32bit case) Computing in the PE starts only both input buffers are ready (half filled).
Multiply and Accumulate Units
As primitive arithmetic operator, a row of k multiplier and accumulator (MAC) units is placed in a PE. Although, reduced precision (16 bit fixed-point) is acceptable for inference (forward propagation), even 32bit fixed point in backpropagation and parameter update may result in inaccurate training in deeper network, in particular, the recurrent networks [25] as illustrated in Fig. 10 . We should note that there is no accuracy degradation between SR and SR LO. The stochastic rounding (SR) can be applied to overcome quantization error in fixed point [26, 25] .
Therefore, we design MAC to operate 1) two pairs of 16bit operands or 2) a pair of 32bit operands (Fixed 32/16). To add stochastic rounding, 64 random number generators are added [25] (Fixed 32/16 + SR). To reduce power/area overhead, we propose to add a single random number generator is used and it generates a single bit in every clock (Fixed 32/16 + SR LO, . 11) ). Synthesis in 15nm FinFET [23] shows that proposed design provides higher energy-efficiency (Table 1) while providing similar training accuracy as floating point design. The MAC operates in the 16bit mode without SR during inference (forward propagation).
Comparator Unit
Since MAX operation is required only for the maxpooling inference, 16 bit fixed point comparators are placed in PE. Based on pooling radius (r), r 2 data are streamed into controller, and the comparator unit returns the maximum value and its ID for backpropagation.
PE Operation
After two input buffers are filled (BUF Input 1 and BUF Input 2), BUF Input 1 pushes one 32bit input (one 32 bit operand or two 16 bit operands) while BUF Input 2 pushes k (N M AC ) 32bit inputs. For MAC operation (all cases except max pooling), k MAC arrays compute y = ax + y, where x and y are vectors, which length is k (32bit) or 2k (16bit). For MAX operation (max pooling), k comparator returns max value as y = max(x, y).
Convolution. In convolution, k inputs are processed by k MACs in parallel (SIMD level). Therefore, kernels are stored in BUF Input 1 and k inputs are stored in BUF Input 2. If k inputs cannot be stored in BUF Input 2 due to capacity issue, k subsets of k inputs are stored and newly required input (k×H k ) is updated during the operation similar to [3] . For convolution backpropagation, W T is easily obtained by sweeping counter values in CNT2 attached to BUF Input 1.
Matrix-Matrix multiplication. Similar to convolution, k inputs are processed in parallel (SIMD level). Therefore, partial weight matrix is loaded in BUF Input 1 and k partial inputs are stored in BUF Input 2. After consuming one partial weight matrix [a,b] (P by L in Fig. 7) , next partial weight matrix [a,b+1] is processed. Similar to convolution, W T is obtained by sweeping counter values in CNT2 attached to BUF Input 1.
Vector-Vector outer product. In fully connected update, k inputs cannot be processed in parallel. In 2 Table 2  3 Table 3  4 Table 4   R1 R2 R3 R4 R5 R6 R7  C1-FF  96 55 55 32  3 11 computing AB T , A is loaded in BUF Input 1 and B is loaded in BUF Input 2. In other words, k elements of B is delivered into k MAC units in a single clock (Fig. 8) .
BUS Interface
Bus interface has two operation modes controlled by common data vault: broadcasting to all PEs and merging data into common data vault from all PEs. The BUS and PE communicates using three-way handshaking (REQ-ACK-SEND) for both operations. Broadcasting mode is set when all PE can take data (input buffer is ready) and during the broadcasting, any REQ from PE is ignored (broadcasting is prior to merging mode). During the merging mode, all PE send REQ and get ACK from the bus based on predetermined priorities among PEs. Although all 15 PEs request BUS for writing-back simultaneously, the impact of latency of entire writing back (for 15 PEs) on the throughput can be minimized as PE's computing latency dominates entire computing latency. The bus architecture is designed and synthesized to guarantee a bandwidth same as that of a single vault (10GBps). We use 4 stage pipelined BUS interface [27] ; it takes 4 clock cycle between a vault to any PE.
PROGRAMMING
Following the discussions in Section 3.2, Table 2 and  Table 3 summarize the PMAG programming which includes setting range of 7 nested loops (r1 ∼ r7) and connecting counter values to combination logic for different operations. For matrix-matrix multiplication (FC-FF/BP) and vector outer product (FC-UP), C.Vault is the programming value for PMAG attached to common data vault and I.Vault is the programming value for PMAG attached to independent data vault. Similar to PMAG, PE needs to be programmed to set: 1) operation type: MAC or MAX, 2) bit precision mode for MAC: 16 bit or 32 bit with/without SR , and 3) loop range for address generator for local buffers. Based on the discussions in Section 3.3.4, Table 4 summarizes the inputs for PE programming for different operations. In essence, the preceding three tables defines the instruction set architecture of the NeuroTrainer.
Given a DNN, the host first generates the preced- 
a, b, c, d, p, q, s, t, u, and v are labels used in Fig. 4 . R1 ∼ R7: maximum value of r1 ∼ r7 loop. (minimum value are all zero) Table 3 : Programming PMAG for data rearranging and data preparation 7 level nested counters (Fig. 7) h: length of partial vector (Fig. 8) ing three tables. Fig. 12 illustrates the programming process of the NeuroTrainer. To enable autonomous operation of the NeuroTrainer, we embed an on-chip instruction buffer (iBuffer) to store the preceding three tables (Figure 1(a) ). Given a DNN, the host generates the preceding three tables and loads them in the iBuffer. During execution the layer-wise operation is controlled by the iBuffer (using a layer counter). To estimate the size of the iBuffer, consider that for a network with N layers, we need to program NeuroTrainer ∼ 4N times (Feedforward, backpropagation, weight update, data preparation if needed). Each time the amount of data for programming is 22Byte (18Byte for PMAG and 4Byte for PE). Therefore, a 16KB memory is sufficient as iBuffer and it can cover 186 layers. The latency of programming the iBuffer through HMC external interface is negligible compared to loading the input data.
SIMULATION RESULTS

Performance Analysis
The performance of the NeuroTrainer is simulated using cycle-level simulator. All simulation results is based on minibatch size 32, which is recommended minimum size of minibatch [28] . All MACs, comparators, buffer in PE, BUS interface, PMAG are synthesized operate at 2.5GHz to maximize the single vault's bandwidth. Fig. 13 shows throughput (TOPS/s) and latency (sec-8 In feedforward phase, all convolution or fully connected layers shows similar throughput above 4.0 TOPS/s (4.2TOPS/s ∼ 4.7TOPS/s) which is close to the theoretical maximum for 16bit operation of our MAC (2.5GHz × 15 PEs × 32 MACs × 2 pairs inputs × 2 (multiplication and addition) = 4.8TOPS/s).
For backpropagation and weight update, 32bit with stochastic rounding is used. Theoretical maximum throughput can be computed as 2.4 TOPS/s in the same manner. In backpropagation, FC3 (1.61 TOPS/s) and C1 layer (1.19 TOPS/s) show lower throughput than others. For FC3 backpropagation, the size of dY is not large enough to hide latency of writing back from all PEs to a single vault. In other words, the latency to generate output by iterating dY times is shorter than writing back the output to common data vault; writing back becomes bottleneck in the system. For C1 layer, input dimension is 55 × 55 × 96 and kernel dimension is 11×11×3. It can be processed as convolution since kernel size is small enough to fit in the local memory; but efficiently due to large kernel size compared to input.
In weight update, C1 ∼ C5 shows about 1.98 TOPS/s by translating convolution as matrix multiplication in large kernel case. However, FC layer (vector vector outer multiplication) shows about 1.02 TOPS/s, which is the worst case due to high network traffic between PE and independent vault since there is no data re-usage. To see the performance of more complex and deeper network, we evalaute a DNN for generating image description [29] , image feature extraction part is implemented as Alexnet and RNN (GRU) is attached after 5 th convolution layer (Fig. 14) . A single GRU is composed of six fully connected layers for hidden neurons and one fully connected layer for output neurons. The number of input neurons in GRU is 43,264 and the number of hidden neurons in GRU is 10,000. We assume T for DNN is 100. Fig. 15 shows latency of each layer in DNN explained earlier. For the recurrent layers (in the dashed box), the latency is computed considering time windows (latency to across all time unfolded T layers); that's why it shows high latency than other layers. : layers in GRU. latency is reported for T ( = 100) iterations Figure 15 : Latency analysis of each layer in DNN [29] . Fig . 16 shows the throughput for various benchmarks including Resnet 152 [30] , VGG 16, VGG 19 [31] , Inception V3 [32] , GRU [22] , DNN for image description [29] , and MLP0 [9] are also tested. Y-axis represent the throughput (TOPS/s) and the number on the X-axis represent the number of inputs can be trained in a second for each benchmark. For all benchmarks, inference shows 4.0 ∼ 4.7 TOPS/s and training shows 1.9 TOPS/s. Further, NeuroTrainer shows stable throughput (standard deviations less than 6% of average) for training with all benchmarks of varying complexity.
Synthesis and Power Analysis
The computation fabric of the NeuroTrainer, including the PEs, bus interface, and PMAG with the vault controller is synthesized using 15nm FinFet [23] . As vault controller is a proprietary design, a 32bit SDRAM controller [27] is adopted as a reference vault controller. Table 5 summarizes average power across 8 different benchmarks and area overhead of each module in the system. Total power consumption of logic layer is 2.64W and area overhead (including vault controller) is 1.17mm
2 . Even scaled up to compare with previous result syn- thesized in 28nm CMOS [33] , total area is less than 5% of footprint of fabricated HMC (68mm 2 ). Average DRAM die power is computed during the simulation using 3.7pJ/bit from [33] and actual DRAM access pattern. The power densities of the logic die (0.039W/mm 2 ) and DRAM dies (0.030W/mm 2 ) in NeuroTrainer is well within the acceptable power densities (1.5W/mm 2 , [34] , of 3D stacked systems.
From DRAM power consumption (2.03W), average memory bandwidth can be computed as 68.5GByte/sec (2.03W/3.7pJ/bit), which is lower than total aggregated memory bandwidth of HMC (16 vaults × 10GByte/sec). With batch size of 32, weights are reused 32 times. DRAM utilization can be increased by, 1) more MACs per PE, but requires larger partial sum SRAM and less efficient for the small batch (minimum batch size for most of DLs is 32) and 2) more PEs, but it requires a larger network among PEs and vaults.
On average, NeuroTrainer consumes 4.64W and delivers 1.89 TFLOPS throughput and 406 GFLOPS/W of efficiency during training (32bit) while maintaining high training accuracy.
For HMC 2.0 [35] , performance is estimated (Table 6 ). With 32 vaults, 31 PEs can be placed; therefore throughput and logic power increases about twice. However, power of DRAM dies is same since total memory access is constant. Therefore, it shows 39% gain in efficiency. 
Scalability to Multiple NeuroTrainers
The multiple NeuroTrainer can be used in parallel for scalable training performance as illustrated in Fig. 17 (a) . As all NeuroTrainer take same latency (T 1 ) for training a minibatch, we propose to perform synchronous training [18] . After training a single input batch, N NeuroTrainer delivers dW to a central unit (latency = T 2 ). The central unit needs to take all dW from N NeuroTrainer s, and generates new W (W ) following: W = η×average(dW )+W , where η is learning rate. The above computation can be performed by another NeuroTrainer. However, to cover more generic approaches for weight update (e.g. AdaGrad [36] or Adam [37] ), a software implementation, for example, using Tegra K1 (326 GFLOPS, 10W, 28nm) [38] can also be considered. Fig. 17 (b) shows estimated training performance (number of images per second) of VGG 16 [31] by different number of NeuroTrainers and two different types of central core. Training performance using high-end GPU (NVIDIA DGX-1) is also reported. This estimated performance is computed based on peak FLOPS of each For the same power budget, 64 NeuroTrainer can operate in parallel and train 1,900 images delivering 13x speedup. The additional power consumption due to offchip data movement estimated using HMC access energy of 10pJ/bit [33] . Ultimately, the performance scaling in NeuroTrainer is limited by the off-chip latency showing need for better system architecture and faster off-chip network. Table 6 compare NeuroTrainer with previously reported DNN training accelerators. NeuroCube [4] and NeuroStream [6] presents inference engines using inmemory accelerators, which can also perform training. The results demonstrate higher efficiency over a GPUbaseline showing the promise of hardware acceleration. However, performance gain is nominal as no hardware was optimized for training.
RELATED WORK
Scaledeep [13] proposes specialized hardware for different computation kernels. A multi-chip module is synthesized using five different tiles (heterogeneous architecture) and allocating layers to different tiles based on their property (such as Byte/Ops). The design demonstrates better power efficiency over GPUs.
The main difference between NeuroTrainer and ScaleDeep is the orthogonal approaches to optimize efficiencies of different kernel. Rather than changing a data flow in the hardware for different operations as performed in NeuroTrainer, ScaleDeep decides the tile distribution during design. Consequently, if the layer distribution in DNN architecture does not match the tile distributions, for example, if one kind of layer (convolution or fully connected) dominates the entire network, the tile utilization and efficiency is low. This effect is evident from [13] (see Fig. 20 ) which shows even for various CNN benchmarks, the standard deviation of efficiency is about 28% of average, which is expected to increase further if recurrent networks are considered.
In contrast, the NeuroTrainer uses a homogeneous architecture and dynamically changes the data flow and data mapping to optimize the performance of individual layers. The dynamic optimization, instead of design time decisions, allow NeuroTrainer to maintain similar throughput for much wider classes of benchmarks even including RNNs. The homogeneous architecture also makes the design easier to scale for parallel training. The secondary difference comes from the use of 3D in-memory acceleration in NeuroTrainer to reduce data movement power, and fixed point arithmetic with stochastic rounding for higher efficiency (compared to floating point in ScaleDeep).
CONCLUSION
We have presented NeuroTrainer, an intelligent memory module with in-memory accelerators for energyefficient training of different classes of DNNs. The NeuroTrainer utilizes a programmable data flow based execution model to optimize memory mapping and data reuse during different phases of training operation. The simulation results demonstrate potential for appreciable performance and power-efficiency gain over baseline GPU or alternative accelerators. The NeuroTrainer can form the building block of a scalable architecture for energy efficient training for deep neural networks. Ultimately, the performance scaling in a scalable training platform with NeuroTrainer is limited by the off-chip latency showing need for future research on better system architecture and faster off-chip network.
