Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) a loose coupling of RISC-V cores and NTX co-processors reducing offloading overhead by 7Â over previously published results;
INTRODUCTION
M ODERN Deep Neural Networks (DNNs) have to be trained on clusters of GPUs and millions of sample images to be competitive [1] . Complex networks can take weeks to converge during which the involved compute machinery consumes megajoules of energy to perform the exa-scale amount of operations required. Inference, i.e., evaluating a network for a given input, provides many knobs for tuning and optimization. Substantial research has been performed in this direction and many good hardware accelerators have been proposed to improve inference speed and energy efficiency [2] . The training of DNNs is much harder to do and many of these optimizations do no longer apply. Stochastic Gradient Descent (SGD) is the standard algorithm used to train such deep networks [3] . Consider Fig. 1 which shows the data dependencies when training a simple neural network. While inference is concerned only with finding y, training aims at finding the gradients (Du) which introduces a data dependency that requires us to temporarily store the output x i ; y of every layer. This also prevents optimizations such as fusing activation or sub-sampling functions with the preceding layer.
While it has been shown that inference is robust to lowering arithmetic precision [2] , the impact of fixed-point or reduced-precision floating-point (FP) arithmetic on training is not yet fully understood (see Section 5) . Until additional light is shed on the topic, a training accelerator must support 32 bit FP arithmetic to be able to compete with the ubiquitous GPU. Existing accelerators require a significant amount of custom silicon and often additional memory and computational resources to function. In this paper we show that a processing system embedded in the Logic Base (LoB) of a Hybrid Memory Cube (HMC) is a competitive and scalable option for training DNNs in the data center. The proposed architecture is based on the earlier NeuroStream (NS) [7] inference engine which introduced the concept of streaming coprocessors based on nested hardware loops and address generators to HMC. We show that such corprocessors can be extended to training workloads and be made more efficient by increasing their level of autonomy. High overall data center-level energy efficiency can be achieved by distributing training over multiple such HMCs. The key contributions of this paper are: 1) A compute architecture featuring a few RISC-V cores loosely coupled with several NTX co-processors (1:8 ratio) capable of managing computation and L1 memory access. One RISC-V core can manage 8 NTX with a reduced number of instructions, hence the von Neumann bottleneck is relaxed without compromising flexibility (Section 2).
2) An optimized data path in the NTX for highprecision convolutions and gradient propagation coupled with an effective TCDM/DMA data transfer hardware that eliminates area and power overhead of large caches, by leveraging the predictability of DNN memory patterns (Section 3). 3) Significant computational capabilities at no additional silicon area cost in the LoB of a HMC, which we show to outperform GPUs and other accelerators in terms of silicon and energy efficiency (Section 4). 4) A competitive scaling to meshes of HMCs that can replace existing GPU-based solutions in a data center setting, the improved efficiency of which translates to an increase in computational power and significant savings in power, cooling, and equipment cost (Section 4). The remainder of this paper is organized as follows: Section 2 describes the proposed hardware architecture and Section 3 shows the execution model of DNN layers. Section 4 presents experimental results and comparisons to other accelerators. The remaining sections describe related and future work, and provide a conclusion.
ARCHITECTURE
The LoB of a HMC offers a unique opportunity to introduce a Processor-in-Memory (PiM) as depicted in Fig. 2 . The memory dies are subdivided into vertically connected vaults, with individual memory controllers on the LoB. Traffic between the serial links and vaults is transported by the means of an efficient all-to-all network [4] , [5] . Our architecture consists of multiple processing clusters attached to a crossbar, which thus gain full access to the entire memory space of the HMC. The architecture was chosen to offer the same bandwidth as in [7] such that the interconnect offers the full bandwidth required by the aggregate cluster ports. The memory cube is attached to a host CPU or other cubes via the four serial links. The on-chip network is responsible for arbitration of traffic between the serial links, the DRAM, and the PiM. This arbitration can be prioritized such that external memory accesses from the serial links are given priority over internal ones originating in the processing system. It also allows requests from the PiM to be routed to the serial links for inter-HMC communication.
Processing Cluster
We combine a general purpose RISC-V processor core [6] with multiple NTX FP streaming co-processors. Both operate on a 128 kB TCDM which offers a shared memory space with single-cycle access. The memory is divided into 32 banks that are connected to the processors via a low-latency logarithmic interconnect. These form a cluster which also contains a DMA engine that is capable of transferring twodimensional planes of data between the TCDM and the HMC's memory space. This solution has proven to be more area and energy efficient than implicit caches, and the DMA can anticipate and time block data transfers precisely, thereby hiding latency. The RISC-V processors perform address calculation and control data movement via the Fig. 2 . Top-level block diagram of one HMC enhanced with m processing clusters. The LoB contains the vault controllers, main interconnect, and the four serial links that lead off-cube. The proposed processing clusters attach directly to the main interconnect and gain full access to the HMC's memory space and the serial links. Each cluster consists of a DMA unit, a Tightly-Coupled Data Memory (TCDM), and one or more RISC-V processor cores augmented with NTX streaming co-processors. We designed the NTX to operate at 1.5 GHz, while the remaining additions to the system operate at 750 MHz. Fig. 1 . Data dependency graph of the forward pass (above) and backward pass (below). f; g; h are DNN layers, L is the loss function, x 1 ; x 2 ; x 3 and u 1 ; u 2 ; u 3 are the layer activations and parameters. The backward pass introduces a data dependency between the first node f and the last node df. Thus intermediate activations need to be stored.
DMA. Actual computation is performed on the data in the TCDM by the NTX co-processors which we describe in the next section.
Address translation is performed either in software or via a lean Memory Management Unit (MMU) with Translation Look-aside Buffer (TLB) as described in [4] . This allows the PiM to directly operate on virtual addresses issued by the host. If there are multiple HMCs attached to the host, care must be taken since the PiMs can only access the memory in the HMC that they reside in. An additional explicitly managed memory outside the clusters, labeled "L2" in Fig. 2 , holds the RISC-V binary executed by the processors and additional shared variables. The binary is loaded from DRAM.
Network Training Accelerator (NTX)
The computations involved in training DNNs are highly regular. To leverage this feature we developed NTX, a FP streaming co-processor that operates directly on the TCDM. Conceptually the NTX co-processor is similar to the one presented in [7] , but it is a complete redesign optimized for performance and training. The streaming nature of the co-processor alleviates the need for a register file and corresponding load/store instructions. The architecture of NTX is depicted in Fig. 3 . It consists of four main blocks: (i) the FPU containing the main data path, (ii) the register interface for command offloading, (iii) the controller that decodes the commands and issues micro-instructions to the FPU, and (iv) the address generators and hardware loops.
FMAC and FPU
The FPU in NTX can perform fast FMAC operations with single-cycle throughput. It is based on a Partial Carry-Save (PCS) accumulator which aggregates the 48 bit multiplication result at full fixed-point precision (% 300 bit). After accumulation the partial sums are reduced in multiple pipelined segments. In order to reach an operating frequency above 1.5 GHz in 28 nm (SS 125 C 1.0 V), two segments are sufficient. The employed format has been aligned with IEEE 754 32 bit floats. The wide accumulator and deferred rounding allows NTX to achieve higher precision than conventional floating-point units (FPUs) for reduction operations such as convolutions. Analysis has shown that in a full 3 Â 3 convolution layer of GoogLeNet [1] the Root Mean Squared Error (RMSE) of NTX is 1:7Â lower than that of a 32 bit FPU, with respect to a common baseline (64 bit float). See Table 1 .
The FMAC unit allows NTX to compute inner/outer product and vector addition/multiplication. An additional comparator, index counter, and ALU register enable various additional commands such as finding minima/maxima, ReLU, thresholding and masking, and copy/memset.
Hardware Loops and Address Generation
At the core of address generation in NTX are the five Hardware Loops (HWLs). Each loop is managed by a 16 bit counter that has a programmable maximum count register N i . Additionally, the counter can be explicitly enabled or disabled, and it has a signal indicating whether the counter has reached its maximum value and is about to reset to zero. To support nesting, each counter is enabled by the previous counter's "done" signal. The first counter (L0) is only disabled upon a pipeline stall. The "done" signal of the last counter (L4) indicates that all loop iterations have been performed. The enable signals of all counters are concatenated into a 5 bit output.
Three Address Generation Units (AGUs) allow NTX to keep track of three pointers into memory. Each unit consists of a 32 bit register holding the address and an adder. The address is incremented by one of five programmable step sizes p i , each of which corresponds to one of the hardware loops. The enabled counter with the highest index dictates the chosen stride. This allows addresses of the form
to be calculated, but using only one addition per cycle. The conversion from strides s i to step sizes p i is trivial
where N i is the iteration limit of an HWL. This conversion can be performed by the controlling CPU core when programming a command, for example as part of a driver library. Fig. 5a shows the pseudo code structure of nested loops that NTX can natively perform. The number of loops (outer level), position of the accumulator initialization (init level), and position of the accumulator write back (store level) are fully programmable. The AGUs provide addresses for the memory reads and writes depicted. The operation performed by the FPU always occurs in the innermost loop body and can be configured to be one of the commands listed in Fig. 5b .
Offloading Support
Offloading to NTX has been enhanced with respect to the earlier NeuroStream [7] inference engine to significantly improve efficiency in training workloads. The three improvements in offloading are: (i) a command staging area, (ii) an increased number of hardware loops, and (iii) a third address generator. The following paragraphs briefly explains the impact of each.
(i) In NS, configuration of addresses, strides, loops, and the initialization of the accumulator are performed via a command register. This register is mapped into the controlling CPU's memory space and the written commands are pushed into a FIFO. This allows the CPU to enqueue configuration updates while a computation is still ongoing. The computation itself is also a command which is popped off the FIFO only upon completion. This also implies that the FIFO needs to be deep enough to hold all commands necessary to configure the next computation, lest the CPU has to stall. NS used depth 8 for these FIFOs, causing the CPU to stall frequently. NTX improves on this by exposing the configuration registers as a memory-mapped "staging area". As such the CPU can directly address and modify the registers. A computation is launched by writing to the command register, which is special in the sense that it causes the entire configuration to be copied to an internal "shadow" register. This allows the CPU to immediately go ahead and configure the next operation without disturbing the current one. Furthermore, parts of the configuration that do not change between commands need not be written again since the staging area is persistent. It is worthwhile noting that the size of the staging area and its shadow copy in NTX is roughly the same as the command FIFO and the corresponding registers in NS, but the former offer significantly higher ease of use. All NTX controlled by a core are also accessible via a broadcast address, which further reduces offloading time for configuring common parameters.
(ii) We observe that a convolution as it appears in DNNs has six nested loops: three that iterate over each output pixel, and three to perform the per-pixel reduction of the input dimensions (3D input and output, 4D weights). NS offers three hardware loops, which allows the 3D per-pixel reduction to be expressed in one command. To compute the first convolution of GoogLeNet [1] which has a 7 Â 7 Â 3 kernel and yields a 112 Â 112 Â 64 output, 802,816 offloads need to be issued by the CPU each of which ideally takes only 147 cycles. This leaves only few cycles to coordinate DMA transfers and configure the next command, thus limiting the number of NeuroStreams that can be controlled by one CPU. NTX improves on this by increasing the number of hardware loops to five. This allows multiple output pixels to be calculated with one offload. For the aforementioned convolution, this translates to only 64 offloads that need to be issued, each of which ideally takes 1,843,968 cycles. In practice the size of the offloaded computation is now bounded by the tile size that fits into the TCDM, thus the CPU only needs to issue one offload per NTX per tile. This reduces the control overhead of NTX to almost zero. The CPU is now free to do more elaborate data transfers, for example issuing multiple small DMA transfers to copy slices of a tensor and performing zero-padding, thus not requiring that the data be laid out in memory in a zig-zag tiling fashion as described in [7] . This is an important improvement, since such a tiling cannot be maintained during training without significant data reshuffling between layers, which would severely reduce the energy efficiency and inflate bandwidth. Table 2 shows this effect for select convolutions in GoogLeNet [1] .
(iii) To allow NTX to calculate multiple pixels in the output image with one offload, we added a third address generator to maintain a pointer for autonomously writing back multiple results to memory. This in contrast to NS [7] which requires an explicit command from the CPU to store the accumulated value.
NTX EXECUTION
The combination of RISC-V processors and dedicated FP streaming co-processors makes our architecture very flexible. It is a many-core platform with explicitly managed scratchpad memories, where data copies are performed by a DMA engine and bulk computations by NTX co-processors in parallel to a running program. The proposed architecture allows for entire DNN training batches to be performed NS  NTX  NS  NTX   7x7x3  112x112x64  802816  64  147  1843968  3x3x64  56x56x192  602112  192  576  1806336  1x1x256  28x28x64  50176  64  256  200704  1x1x512  14x14x192  37632  192  512  100352 NS requires one offload per output pixel, whereas the increased number of hardware loops and the third address generation unit of NTX allow it to compute many output pixels per offloaded command.
completely in memory, without intervention from a host outside the HMC, as follows. Starting from a reference implementation of a training step in C or C++, nested loops of the form described in Section 2 are amenable to acceleration on NTX. This includes the bulk operation of all DNN layers. These loops are replaced by an offload sequence consisting of writes to the eight staging areas. Furthermore, the input and output data of each loop nest must be tiled and data movement appropriately scheduled, as described below. Fig. 4 shows the execution of one tile of a 3 Â 3 convolution on NTX. The RISC-V core first configures and launches the main computation on NTX, then controls the DMA to write back the output of the previous tile and read the input of the next one. Additional tasks such as zero padding and address computation are performed in the background. The short period of NTX idleness in between tiles is due to the core using the NTX to initialize the next tile, which is a quick command that terminates faster than the core can configure the next big computation.
Memory and Tiling
As described in Section 2 and [4], [7] , the core and NTX operate directly on a scratchpad memory inside the cluster. A DMA unit in conjunction with a lean MMU is used to copy data from DRAM into cluster memory, where the accelerators operate on them. This mechanism is similar to how caching works on CPUs/GPUs, but is explicitly managed by the programmer.
The scratchpad memory in the cluster is limited in size. To evaluate an entire convolution layer for example, the input and output data are tiled to fit into memory. The tiles need to overlap in the convolution case. The DMA unit can run in parallel to the computation and is used to write back previous results and read next inputs while a computation is ongoing (double buffering), as can be seen in Fig. 4 . In [7] the authors made use of 4D tiling, which requires that the data is already laid out in such a tiled fashion in DRAM, including replication of the overlapping areas. This allows the DMA to copy a tile in a single consecutive transfer, requiring little control by the core. For training this scheme is infeasible, since forward and backward passes require different tile sizes, and the data would need to be retiled after several subsampling layers to maintain a sufficient tile size. This retiling translates into no-op movement of data, which wastes bandwidth and energy.
The improved offloading scheme described in Section 2.5 and increased independence of NTX compared to [7] frees up significant RISC-V core resources. This allows us to now store tensors in memory as dense chunks of FP values, without any replication or tiling pre-applied. To transfer tiles, we task the core with issuing multiple DMA transfers, each of which copies one consecutive stripe of data. Zero padding can also be performed by the core in this way. Hence we can drop the requirement of the data being laid out in memory in a pre-tiled fashion.
Strided Stencil Operations
A stride greater than one in stencil operations such as convolution and pooling causes an irregularity during training. For example, a strided convolution can be thought of as a regular convolution where a subset of the output pixels are discarded. The backward pass correspondingly can be represented roughly as a sparse convolution where the discarded pixels are 0. For efficiency reasons we would like to skip multiplications with 0, effectively leveraging the sparsity of the problem. However NTX cannot change the number of summands within the course of one operation, so we must perform convolutions that have a constant number of operations required per pixel. This does not hold for strided convolutions in the backward pass, where the input derivative contains contributions from a varying number of output pixels. See figure Fig. 6 . We observe that we can subdivide the pixels of the input derivative into different categories: Each pixel subset can be computed as a regular convolution with a subset of the filter weights, and the overall result can be found by interleaving the subset results. This scheme allows us to decompose a sparse convolution (as found in the derivative of strided convolutions) into multiple dense convolutions each contributing a subset of the result pixels.
Special Functions (exp, log, div, sqrt)
There is no dedicated hardware to evaluate special functions such as division, exp, log, square roots, or arbitrary powers. These are needed for the softmax layer or various forms of normalization. As the number of such operations is typically very low (in the order of a few thousands per training step), it is feasible to implement them using iterative algorithms on the NTX, calculating multiple results in parallel. We found that for tens to hundreds of inputs, pipeline latency can be hidden and the evaluation takes on the order of 30 to 100 cycles per element.
Communication Across HMCs
The serial links in the HMC are accessible to the processor cores and DMA units in each cluster. This allows a mesh of HMCs to be programmed in a similar way as a two-tiered network of compute nodes. Within the HMC, clusters may exchange data via the DRAM and L2. Across HMCs, the processor cores may cooperate to perform complex systolic operations via the serial links. Section 4.9 provides an example of this.
EXPERIMENTAL RESULTS AND ANALYSIS
In this section we evaluate the silicon and energy efficiency of our proposed architecture and compare it against NeuroStream, the most closely related other accelerator [7] . Furthermore we investigate the effects of voltage and frequency scaling and the impact of multiple logic dies per memory cube. We conclude by comparing different NTX configurations against existing accelerators and evaluate the data center scale impact of our architecture.
Methodology

DRAM Power
We model the power consumption of the vault controllers, DRAM dies, and HMC interconnect as the following relationship:
where B is the requested bandwidth. We call this the "DRAM" power. This model is based on the observation in [7] that 7.9 W are consumed in a 1 GB cube under no traffic into DRAM (50 nm, see Section 4.1.6). Under an average traffic of 51.2 GB/s caused by their investigated workloads, this increases to the reported 9:0W, a bandwidth-dependent power increase of 21.5 mWs/GB. These estimates are conservative and do not consider further power-saving measures, such as Voltage and Frequency Scaling (VFS) or power gating of HMC components.
Cluster Power
We have synthesized our design for a 28 nm Fully Depleted Silicon On Insulator (FD-SOI) technology using Synopsys Design Compiler, which we also use to estimate power based on simulation traces. The Register Transfer Level (RTL) model was back-annotated with timing information obtained from the synthesized design at 125 C/1.0V (slowslow corner). We execute the worst-case kernel in an RTL simulation of the cluster, which gives us a cycle-accurate picture of the computation. This is depicted in Fig. 4 , and furthermore gives us an estimate of the cluster's power consumption, 165 pJ per clock cycle in this case. The worst-case kernel is a convolution which makes full use of the FPUs in the NTXs and has a high utilization of the DMA. Memorybound kernels consume less power, since the FPU utilization is lower, reducing its power contribution. This simulation also gives us realistic utilization efficiencies of h c ¼ 84% and h d ¼ 87% for NTX and TCDM, respectively.
Network Layer Energy
To evaluate applications, we model the execution of individual network layers. The computation and data movement performed by a cluster is very predictable. For each network layer we therefore compute the number of FP operations necessary, as well as the amount of data that needs to be transferred. The latter we further split into data that must be moved before computation can start (head), data that can be moved in parallel to the computation, and data that must be moved once the computation completes (tail). This closely models the double buffering possible by overlapping operation of the DMA and NTX within the cluster. For each kernel we determine the execution time of the computation (T c ) and DMA transfers (T dpar ; T dseq ) as
where T dpar represents DMA transfers that can run in parallel with computation and T dseq those that need to happen before and after. In more detail, N c and D dma are the total number of compute operations performed and bytes transferred by the kernel; r c are the peak compute operations per cycle of the cluster; and r d is the peak bandwidth of the DMA per cycle. For the architecture with 8 NTXs presented in Section 2, r c ¼ 8 op and r d ¼ 4B. h c and h d account for inefficiencies such as interconnect contentions and are determined empirically from simulations. We then formulate the execution time, requested bandwidth, and power consumption of the kernel as
See Fig. 7 for a visual explanation. Note that we issue DMA transfers in chunks of multiple kB and the engine is capable of having multiple simultaneous transfers in flight. This allows us to hide the latency into DRAM which we estimate to be on the order of 40 core cycles. It is crucial that we fully saturate the precious bandwidth into DRAM when performing strided memory accesses, e.g., when transferring a tile of a tensor. There the length of the tile's innermost dimension is critical, as it determines the length of one burst accesses. Since we have full control over the tiling, we can ensure that a tile has at least 8 elements along its shortest dimension. This yields consecutive accesses of at least 32 B, which is the minimum block size in an HMC [5] .
Cube Power
Based on the above, we model the requested bandwidth, power consumption, and energy efficiency of a kernel parallelized on a HMC with K clusters as
The parallelization is achieved by distributing the tiles of computation described in Section 3 across the clusters of a cube.
Network Training Energy
We then model different layers of DNNs as the amount of computation and data transfers necessary. This also gives us a per-layer estimate of the number of parameters and intermediate activations. We proceed to model each of the investigated networks as a sequence of these layers, giving us a realistic estimate for the execution time of the inference and training steps of one image on the proposed architecture, together with the associated bandwidth requirement. We then further use the execution time and the cluster and bandwidth-dependent DRAM/LoB power determined above to estimate the overall energy required to process one image.
Technology Scaling
We use internal comparisons and publicly available information to estimate the effect of scaling down the technology node of the LoB from the 28 nm FD-SOI process investigated by us to a more modern 14 nm FinFET node [8] , [9] . For this change we observed across several designs an increase of 1:4Â in speed, a decrease of 0:4Â in area, and 0:7Â in dynamic power dissipation.
To our knowledge there is no publicly available information on the DRAM characteristics of HMCs. "SMCSim" [4] assumes them to be similar to the MT41J512M8 device by Micron, which is based on a 50 nm process. Given the manufacturer and [10] , the device seems to be a reasonable reference for early HMCs. We estimate the DRAM technology scaling factor for power consumption to be 0.87, by comparing the supply currents and voltages of this device to the newer 30 nm MT40A512M8.
GPU Efficiency Estimation
We estimate GPU efficiency based on the training time per image measured by [11] , [12] . For each network we compute the amount of flop necessary per image based on our model of the network. This yields an estimate of the actual throughput in flop/s achieved. Assuming the GPU can reach its TDP under such highly optimized workloads (e.g., cuDNN), we determine the energy efficiency as the ratio between that throughput and the TDP. We do not assume optimizations such as Winograd to be performed on the GPU, and as such overestimate the number of FP operations performed, making the estimated energy efficiency optimistic. Furthermore, this excludes the power consumed by the CPU to constantly push training data into GPU memory.
Precision, Sparsity, Compression
Training a DNN with reduced FP precision or even fixedpoint arithmetic is much harder than doing the same for inference. The intuition here is that the SGD algorithm performs smaller and smaller changes to the parameters as training progresses. If these changes fall beneath the numeric precision, the algorithm effectively stops converging. There is no a priori obvious range of magnitudes within which parameters fall, thus the arithmetic must support a significant dynamic range without additional prior analysis. NTX employs 32 bit FP arithmetic which is commonly used in deep learning frameworks and CPUs/GPUs, rendering such analysis unnecessary. Note that there is evidence that training is possible in fixed-point arithmetic with little accuracy loss in some cases [13] . However, results tend to be limited to specific networks and other work suggests that reducing precision may not be feasible at all without incurring significant accuracy loss [14] . T dseq corresponds to memory transfers that need to happen before and after the main computation, e.g., first data fetch and last data store. T dpar corresponds to transfers that can happen in parallel to the computation T c . Shown are a compute-bound case where T c dominates, and a memory-bound case where T dpar dominates.
Recent work on network compression and pruning techniques has shown promising results in terms of reducing computational overhead [15] . The general purpose nature of the RISC-V processors in our architecture allows some of these schemes to be implemented. For example entire convolutions may be skipped or certain forms of decompression and re-compression may be performed on the processor cores. The NTX has not been optimized for sparse tensor operations however, and we leave their detailed analysis for future work.
Voltage and Frequency Scaling (VFS)
In this section we assess the efficiency of NTX at different operating points. We vary the supply voltage between 0.6 V and 1.2 V; and the operating frequency between 0.1 GHz and 2.5 GHz for the 28 nm process and 0.14 GHz and 3.5 GHz for the 14 nm process. The voltage is assumed to scale linearly with frequency [16] and is thus varied in proportion to the frequency. Fig. 8 plots the energy efficiency of HMCs with different NTX configurations against the operating frequency. Two counteracting effects lead to a tradeoff between efficiency and frequency: On one hand DRAM consumes significant static power, making it beneficial to operate at a higher frequency to decrease the time to solution. On the other hand the NTX power consumption increases quadratically with voltage and thus frequency. For larger configurations, the internal bandwidth limit of the HMC is reached at a certain frequency, visible as a dent in the efficiency. The points of highest efficiency are listed in Table 5 . Fig. 9 shows a breakdown of the power consumption at these operating points. Fig. 10 provides a more detailed power breakdown of the NTX 64 configuration in 14 nm.
All configurations remain within a power budget of 25 W, which according to [17] is feasible for a HMC with active cooling and keeps DRAM temperature within nominal refresh limits. If the static power of the DRAM decreases, e.g., by switching to a different memory technology, these optimal operating points will change. Table 5 shows the area occupied by different NTX configurations. The unoccupied area on the LoB is not precisely known, and estimates range from 10 mm 2 [7] to 50 mm 2 [18] . In the following we assume that the LoB has an area of 50 mm 2 , of which 25 mm 2 are unused and thus available to custom logic. This allows configurations of up to 64 clusters per HMC. For larger configurations, we propose the use of multiple stacked logic dies such as the 3D Logic in Memory (LiM) proposed in [19] . While the use of additional layers increases the complexity of the die stack, they allow for a significant increase in parallelism and efficiency. Furthermore, the use of LiM layers for custom accelerator logic has the additional benefit of decoupling the LoB manufacturing process from the accelerator, thus allowing modular assembly of "Application Specific Memory Cubes (ASMCs)". We expect this concept to be relevant for High Bandwidth Memory (HBM) as well. Table 3 summarizes our estimates for the memory occupied by the parameters and the intermediate activations of the networks investigated in the paper. We derive these from the network structure outlined in the corresponding papers. For training with a batch size of 1 the footprint amounts to 239 MB, 73.2 MB, and 461 MB for AlexNet, GoogLeNet, and ResNet-152, respectively. For batch sizes greater than 1, where the gradient of each image is computed separately and added to a weighted average, another set of of parameters is needed to hold the accumulated gradient. This amounts to 471 MB, 99.8 MB, and 767 MB, respectively. Note that the memory footprint then remains constant for all batch sizes. 1 This leaves 0.5 GB to 7 GB for training data depending on network and HMC size, around 3,550 to 48,600 sample images (227 Â 277 Â 3), equating to 31 s to 247 s of independent training operation on NTX 64. Fig. 9 . Power dissipation of different configurations, evaluated at their most-efficient operating point in Fig. 8 . Note that even the massively parallel configurations with more than 64 clusters are below a TDP of 25W. Fig. 10 . Power breakdown of the processing clusters in an NTX 64 in 14 nm performing the 3 Â 3 convolution described in Fig. 4 . 47 percent of the HMC's power are consumed by the clusters. More precisely, 76 percent of the cluster power are dedicated to computation (NTX FPUs, TCDM, and TCDM interconnect) while 21 percent are consumed by control logic (DMA, ICACHE, NTX Controller, RISC-V Core). 3 percent are consumed by cluster peripherals. The DRAM, memory controllers, and interconnect in the HMC consume another 11.6W, 53 percent of the total cube power. Notably 14 percent of the total HMC power are spent in the FPUs.
Multiple Logic Layers
Memory
1. Images in a batch may still be processed in sequence before a weight update is performed, keeping memory need constant.
To fully utilize the bandwidth into DRAM, it is paramount that the accesses emitted by the DMA occur in sufficiently long bursts and have high locality with respect to DRAM pages to reduce overhead. In the case of 4D tiling [7] , this is given by the fact that the pre-tiled data lies in DRAM as a dense consecutive sequence. In the case of onthe-fly tiling the DMA has to issue more and smaller bursts since the required data does not lie in DRAM consecutively. The tile dimensions however offer multiple degrees of freedom to adjust the access patterns generated by the clusters. For example, HMCs [5] use an internal bus width of 32 B, and a maximum DRAM page size in the range of 32 B to 256 B. In the aforementioned 3 Â 3 convolution most data transfers occur as bursts of 72 B, 88 B, or 96 B, and 92 percent of all data is transferred in bursts above 32 B. Fig. 11 shows a histogram of the burst lengths issued by the DMA into the DRAM. The few small bursts are due to convolution weight transfers, which can be cached to improve burst length further. We thus conclude that our architecture is capable of fully utilizing DRAM bandwidth by emitting sufficiently large accesses.
Comparison with NeuroStream
NeuroStream [7] was aimed primarily at efficient inference and requires data to be very carefully laid out in memory (4D tiling). This constraint on data layout makes training very inefficient, since intermediate activations after each layer need to be re-tiled when storing them back to DRAM. This puts a significant workload on the RISC-V processor cores and causes additional traffic into memory. The processors are under high load to keep the NS saturated with FP operations, such that spending compute cycles on re-tiling also means stalling the NS co-processors. Our architecture does not depend on such a tiling.
In Table 4 we compare NTX to NS, both implemented in 28 nm. The much improved offloading scheme allows us to increase the ratio of co-processors to control cores from 2:1 to 8:1. The fast FMAC allows us to operate the NTX at twice the frequency of the rest of the cluster, leading to an increase of peak performance from 256 Gop/s to 384 Gop/s for the 16 cluster version. The increased number of hardware loops and operations supported by NTX, together with the improved performance, allow us to increase the energy efficiency of a training step from 15 Gop/s W to 21 Gop/s W.
The 16 cluster configuration requests a peak bandwidth of 57.6 GB/s, which does not saturate the internal bandwidth of up to 320 GB available inside the HMC. We can improve the energy efficiency to 38.3 Gop/s W by increasing the number of clusters to 64.
Comparison with other Accelerators
To compare against other accelerators, we use one training step of AlexNet [20] , GoogLeNet [1] , Inception v3 [21] , three variants of ResNet [22] , and a Long Short-Term Memory (LSTM) with 512 inputs and hidden states as workload. Table 5 and Fig. 12 provides an overview of the compared architectures.
To our knowledge there are two other custom accelerators besides NeuroStream [7] that claim support for training at precisions similar to ours: DaDianNao [13] and ScaleDeep [23] . Both provide much less memory relative to their computational power than GPUs, NeuroStream, and NTX. To compare on a system level, we estimate the efficiency of these accelerator including additional DRAM to hold training data, as described in Section 5.2. In this case, DaDian-Nao has an efficiency of 65.8Gop/s W with fixed-point arithmetic, which is identical to the computationally equivalent NTX 128. ScaleDeep has an efficiency of 100.8 Gflop/ sW which is 1:3Â higher than NTX 512, the largest configuration considered by us. GPUs are currently the accelerator of choice to train DNNs. Our architecture can achieve significantly higher energy efficiency than a GPU at a comparable technology node (see Fig. 12 ). Considering the largest NTX configurations that do not require additional LiMs, we achieve an efficiency increase of 2:5Â from 11.8 Gop/s W to 29.9 Gop/s W in 28 nm, and an increase of 2:7Â from 20.4 Gop/s W to 54.9 Gop/s W in 14 nm. Compared to the GPU power analysis and model published in [26] , NTX spends a larger fraction of power in the FPUs, namely 14 versus 4.8 percent. Assuming an FMA requires the same energy per item in similar technology nodes, this increase corresponds to the observed efficiency increase and gives an intuition of why NTX outperforms GPUs. This is in part due to the absence of caches in NTX and the GPU's significant idle power. See Fig. 10 .
Deployed Silicon
One unique key benefit of our architecture is that it leverages existing unused silicon area. This incurs almost no additional costs, since we assume the HMCs to be already present in the system as main memory of the CPU, and manufacturing costs of the spare silicon area is the same regardless of whether it is being used. This allows us to deploy up to 32 processing clusters in 28 nm and 64 processing clusters in 14 nm with no additional silicon needed. Fig. 13 compares the Gop/s of compute performance per deployed amount of silicon for NTX and GPUs. Our solution requires 4:4Â less area to achieve the same compute performance as a GPU. Even more when one considers that the chosen 32 and 64 cluster configurations can fit into the aforementioned unused silicon, their cost is virtually Table 5 (geometric mean), with GPUs, NS [7] , and the largest NTX configurations that do not require additional LiMs. NTX 32 in 28 nm achieves a 2:5Â increase, and NTX 64 in 14 nm a 2:7Â increase in efficiency over GPUs in similar technology nodes. Fig. 13 . Comparison of the Gop/s of compute performance per deployed area of silicon, for GPUs, NS [7] , and the largest NTX configurations that do not require additional LiMs. NTX 32 in 28 nm achieves a 2:7Â increase, and NTX 64 in 14 nm a 4:4Â increase in area efficiency over GPUs in similar technology nodes. [11] ; GoogLeNet: batch size 128 with Torch/cuDNN 5.1 [12] z All nets: batch size 16 Torch/cuDNN 5.1 [24] x 512 inputs and hidden states, batch size 32 for NTX and 64 for GPU [25] Â GDDR5 and GDDRX5, process node estimated based on GPU release year HBM2 Ã Estimated system efficiency including DRAM, see Section 5. zero. This sets our solution apart from ScaleDeep, Da-DianNao, and other GPUs, which require significant silicon overhead.
Scaling to Multiple HMCs
We investigate the scaling behavior of NTX 64 and organize the HMCs in a square mesh of different side lengths N as depicted in Fig. 14a . Each link operates at 60 GB/second [5] . We leverage data parallel training to distribute computation across the HMCs in the mesh, which is also commonly done on GPUs [27] . Each HMC computes its local weight update first. The global update is then performed in four waves as a horizontal followed by a vertical systolic average which can be performed in a streaming fashion.
We assume the weight update to be 300 MB, which takes T tx ¼ 4:88 ms to transmit. Each cube takes 104 ms to compute the average, which is negligible compared to T tx . Furthermore the internal bandwidth of the HMC is much larger than the 120 GB required by the two serial links active in parallel during streaming operation. We assume a latency of T lat ¼ 20 ms inside the cube, which is a very conservative estimate. The time taken for one of the four waves described above is then
Since T lat is small relative to T tx , the number of cubes in the mesh has only little influence on this time. For a very large mesh of N ¼ 16 (256 HMCs) T pass ¼ 5:20 ms. Since four such passes are necessary, the total time to perform the weight update across the mesh is
A time diagram of such an update is depicted in Fig. 14b . The time required by the mesh to calculate the local weight update is
where L B is the total batch size across all cubes. This yields a total execution time for one batch across the mesh of T total ¼ T update þ T step . A single HMC would perform the same computation in T single ¼ 8:69 ms Á L B . Fig. 14c Regarding energy efficiency we consider two operating modes of the cubes: During the global mesh update the serial links and clusters are active. We assume the four serial links to consume P link ¼ 8 W [7] . The energies to compute a wave pass and to power-cycle the serial links [5] are
The energies spent per HMC for the global and local weight updates are
and the overall energy for one batch across the mesh requires
A single HMC would require E single ¼ T single Á 21 W for the same task. At a batch size of 8,192 as above, 64 HMCs achieve an energy efficiency of 94.3 percent, 144 HMCs achieve 88.1 percent.
Savings at Data Center Scale
Computing at a data center scale incurs a significant energy and cost overhead over the raw hardware's power consumption. This is among other factors due to the required air conditioning and cooling. A standard measure for this overhead is the Power Usage Effectiveness (PUE) [28] , the ratio of the power consumed by a data center to the power consumed solely by its compute units
Data centers are reported to have h pue ¼ 1:12 [29] . The figure depends heavily on the local climate and usually only the winter months' numbers are published. We assume an average h pue ¼ 1:2. We consider a NVIDIA DGX-1 server with two Intel Xeon CPUs and eight Tesla P100 cards. One such unit consumes 3.2 kW of power, 2.4 kW of which are due to the GPUs. We assume DDR4 DRAM to consume 6 W per 16 GB of storage under full load [30] . We investigate two different approaches of replacing the GPUs of the system with NTX-augmented HMCs. Consider that the 512 GB of system memory requires 256 chips distributed across the DIMM modules if built from 16 Gbit DRAM chips. An 8 GB HMC is roughly equivalent to 4 such chips, so the same system built from HMCs would comprise 64 memory cubes.
Same Peak Compute
The P100 cards achieve a combined peak compute of 84.8 Tflop/s. Fig. 15 shows the number of HMCs required to match this performance, and the achievable energy savings, with different NTX configurations per cube. The 43 HMCs with NTX 128 required to achieve the same compute power consume only 860 W, saving 2.4 kW of GPU power and an additional 128 W of DRAM power, for an overall reduction of 2:1Â. With a PUE of 1.2 this translates to 1868 kW of saved power, which at an energy price of 0.1104 $/kWh [31] is 1808 $ per year and server. 
Same Thermal Design Power
RELATED WORK
Acceleration of DNNs, in particular the forward pass, is a well researched field with a rich literature. Goodfellow, et al. [3] provide a good coverage of the mathematical background of Deep Learning. An overview of techniques for efficient DNN inference and the involved challenges can be found in [2] .
Accelerators for Inference
Architectures to accelerate the inference process of Convolutional Neural Network (CNN) have been studied extensively in literature. FPGA-based accelerators report energy efficiencies on the order of 10Gop/s W and usually rely on fixed-point arithmetic and less than 32 bit precision [33] . ASIC-based accelerators porivde efficiencies on the order of 1000 Gop/sW at reduced precisions, for example Google's TPU [34] which uses 8 bit arithmetic, or [35] with 1 bit. Near-memory inference architectures embedded in the logic die of an HMC have also been investigated, for example the 2D accelerator array presented in [18] which uses 16 bit fixed-point arithmetic and achieves up to 450 Gop/s W, or the clustered many-core architecture [7] which is based on 32 bit FP co-processors and achieves up to 22.5 Gflop/sW. Certain architectures such as [36] employ a distributed memory model, where the entire network's parameter are stored on chip. This becomes increasingly difficult with modern networks that require hundreds of MB [1] , [20] , [21] , [22] , and the network to be trained is tightly bound to the number of chips that can be interconnected. We furthermore observe that due to the vast difference in energy spent for computation and data transfer, it is only meaningful to compare architectures that use the same arithmetic precision and bit width.
Accelerators for Training
We observe that much fewer architectures have been proposed to cover the training aspect of DNNs. Many of the aforementioned architectures are not suitable for this since they lack the ability or memory capacity to store intermediate activations, e.g., due to optimizations in the data path, or the precision and dynamic range for the training to converge.
The NeuroCube [37] is based on 16 bit fixed-point MAC units which are capable of performing the necessary computations, but it is unclear if training of modern deep networks converges at this precision and dynamic range. NTX surpasses NeuroCube's efficiency of 7.63 Gop/s W because we focus on maximizing the energy spent in the FPU, e.g., by doubling its clock frequency. We furthermore use VFS to increase efficiency.
DaDianNao [13] uses 64 chips to perform training at 32 bit fixed-point precision with 2.3 GB of distributed memory. A single chip with 2.1 Top/s and 36 MB is roughly equivalent to NTX 128 with 2 Tflop/s and 16 MB but without the HMC. In this setting NTX achieves a 1:9Â better core efficiency of 250.9 Gflop/sW despite its FP arithmetic. Considering an estimated 15.8 GB of DRAM that is needed to match the memory-to-compute density of the DGX-1 puts DaDianNao's system energy efficiency at 65.8 Gflop/ sW, assuming that the overall cost of the memory (DRAM modules, memory controller, interconnects) is comparable to that of an 8 GB HMC (1 W/GB).
ScaleDeep [23] supports training in 32 bit FP precision at 14 nm. We estimate its total die area to be 2800 mm 2 based on its TDP and the power density of a GPU, which yields 243 Gflop/mm 2 s, around 2:3Â more than NTX 64. An entire node achieves 680 Tflop/s but has only 1.17 GB of distributed memory. As such it excludes any form of system DRAM and puts its core efficiency of 420.9 Gflop/sW on par with the 417.0 Gflop/sW achieved by NTX 512. The estimated 5.13 TB of DRAM that is needed to match the memory-to-compute density of the DGX-1 puts ScaleDeep's system power consumption at 6.75kW under the same assumptions as DaDianNao, with a system energy efficiency of 100.8 Gflop/sW. It is unclear how much additional energy would be consumed by the processing power required to feed these accelerators with data.
GPUs
GPUs can be seen as the main workhorse of Deep Learning and are commonly used for both inference and training due to their flexibility. Recent implementations on the GTX 780 and GTX Titan (both featuring a Kepler microarchitecture) reach 1,650 Gflop/s at 250W and 999 Gflop/s at 240W, which corresponds to 6.6 and 4.2 Gflop/sW, respectively [7] , [38] . Embedded GPUs like the Tegra K1 have lower absolute throughput, but reach a similar energy efficiency of around 7 Gflop/sW [38] . The Pascal generation of GPUs offer several features beneficial to DNNs, such as HBM and 16 bit FP support. Compared to previous generations, the P100 achieves a 2Â higher peak throughput of 10.6 Tflop/s and a significantly higher energy efficiency around 20 Gflop/sW [11] , [24] . The recently introduced Volta generation offers tensor cores, a new compute element able to perform 4 Â 4 matrix Fused Multiply and Adds (FMAs) in 16 bit FP, with 16 or 32 bit outputs in one cycle. These cores promise a 5Â increase in Deep Learning performance compared to previous GPU generations [39] . Furthermore, GPUs have been shown to be amenable to near-memory processing as well [41] .
FUTURE WORK
Our architecture is inherently scalable since the HMC standard allows for memory cubes to be interconnected via the serial links [5] . Mesh arrangements of HMCs offer many opportunities and different parallelization techniques [40] for training DNN should be explored. Moving to HBM promises further energy efficiency gains and brings new challenges and design constraints in using the bottom memory controller die. Improvements to the DMA engine in the compute clusters would allow for even more efficient offloading and further ease the load on the RISC-V processor core. Applicability of transprecision and compression techniques offer other interesting angles to be investigated for further gains.
CONCLUSION
We have presented the streaming FP co-processor NTX with a decisive focus on training DNNs. Its data path is built around a fast fused accumulator with full 32 bit precision, which gives it a key advantage over architectures that are based on fixedpoint arithmetic or lower FP precision. The co-processor is capable of generating three independent address streams from five nested hardware loops, allowing it to traverse structures with up to five dimensions in memory independently. A rich set of arithmetic and logic commands allows it to perform the reductions and matrix/vector operations commonly found in the forward pass, but also the threshold, mask, and scatter operations encountered during the backward pass. We combine eight such co-processors with memory, a control processor, and a DMA unit into a cluster. An efficient offloading scheme frees up resources on the control processor to exert fine-grained control over data movement. The data does therefore not need to be put into memory in a specific, pre-tiled pattern, but can be operated on directly in its canonical and dense form. Integrated into the LoB of an HMC, multiple clusters can exploit the high bandwidth and low accesses latency into DRAM in this near-memory setting. Configurations which fit into the unused area on the LoB incur virtually zero additional manufacturing costs. NTX scales well to large meshes of HMCs and can provide the same compute capability at less power, or more compute capability at the same power.
