Technical Disclosure Commons
Defensive Publications Series
August 2021

Processing-in-Memory for Weight Updates in a Neural Network
Accelerator
N/A

Follow this and additional works at: https://www.tdcommons.org/dpubs_series

Recommended Citation
N/A, "Processing-in-Memory for Weight Updates in a Neural Network Accelerator", Technical Disclosure
Commons, (August 13, 2021)
https://www.tdcommons.org/dpubs_series/4537

This work is licensed under a Creative Commons Attribution 4.0 License.
This Article is brought to you for free and open access by Technical Disclosure Commons. It has been accepted for
inclusion in Defensive Publications Series by an authorized administrator of Technical Disclosure Commons.

: Processing-in-Memory for Weight Updates in a Neural Network Accel

Processing-in-Memory for Weight Updates in a Neural Network Accelerator
ABSTRACT
Leveraging the vectorizability of deep-learning weight-updates, this disclosure describes
processing-in-memory (PIM) techniques for weight-updates in a large class of deep learning
networks. Rather than importing the state of the deep-learning optimizers to the computational
die, the techniques send gradients to a die of a high-bandwidth memory (HBM) stack and
perform the modest number of optimizer updates in compute units located in the die. Since reads
and writes are done inside the HBM stack, the techniques can substantially reduce the CPUHBM bandwidth requirements. Weight-related memory traffic, dominant for multilayer
perceptrons and transformers, is also reduced.
KEYWORDS
● Processing-in-memory (PIM)
● Compute-in-memory
● High bandwidth memory (HBM)
● Weight update sharding
● Inference embeddings
● Neural network
● Machine learning
● Deep learning (DL)

Published by Technical Disclosure Commons, 2021

2

Defensive Publications Series, Art. 4537 [2021]

BACKGROUND
Weight update in deep-learning neural networks
Deep learning optimizers are techniques that improve the number of epochs required to
train a deep neural network by maintaining additional state variables. Weight-update equations in
deep-learning neural networks have some common features: all weight updates, be it stochastic
gradient descent (SGD), Momentum, Nesterov, RMS-propagation, Adam, NAdam, etc., compute
new weights (θt+1) at iteration t+1 based on the old weights (θt), the internal state variables, the
hyper-parameters, and the gradient ∇l( θt ). The weight-update equations, which are equivalent
to optimizers, comprise a relatively small number of vectorizable elements and operations:
● The gradient, accepted as an input.
● One to three state variables, typically the old weights and the previous values of one or
two momentum terms, read from memory.
● The gradient, the state variables, and the hyperparameters, used to perform a small set of
scalar operations to produce new weight and state variables. The operations are limited to
addition/subtraction, multiplication, division, and square root.
● The state variables, stored back in memory.
A given weight interacts with its gradient, its associated state variables, and a handful of
hyperparameters, each of which is a scalar for a given time-step t.
High-bandwidth memory (HBM)
High-bandwidth memory (HBM) is the main memory technology for graphics processing
units (GPUs) and machine learning or neural network accelerators. HBM is built out of the same

https://www.tdcommons.org/dpubs_series/4537

3

: Processing-in-Memory for Weight Updates in a Neural Network Accel

kind of memory cells as standard DRAM (Dynamic Random-Access Memory), but is organized
differently to trade-off reduced total memory capacity (GiB) for bandwidth (GiB/s).
As stated in [2], “HBM is a memory chip with low power consumption and ultra-wide
communication lanes. It uses vertically stacked memory chips interconnected by through-silicon
vias (TSVs).” HBM is considered a 2.5D technology because although it uses the third
dimension to integrate compute and memory dies, it doesn’t go all the way to true 3D fabrication
techniques. An HBM stack typically comprises 4-to-8 DRAM dies on top of a logic die. The
logic die gathers the signals from the stack and transfers values through a silicon interposer to the
main GPU or compute die. The logic or base die is relatively simple and is built in a CMOS
compute process to enable the efficient construction of drivers and of a small amount of logic to
interface between the stack of DRAMs and the main compute die.
Processing-in-Memory (PIM)
Processing-in-memory is the integration of a processor, e.g., compute functions, with
memory on a single chip. PIM is an approach to overcoming the von Neumann bottleneck, a
limitation on throughput caused by the latency inherent in standard (von Neumann) computer
architecture. In the standard architecture, programs and data are held in memory; the processor
and memory are separate; and data moves between the two. The processor can spend time
waiting for data to be fetched from memory; this is known as the von Neumann bottleneck. In
PIM, logic devices and memory cells are tightly coupled. Processor logic is directly connected to
the memory stack. Such integration of processing and memory decreases latency and power
consumption, and increases processing speed and memory transfer rate.
It is possible to add compute functions to HBM, thus achieving PIM, without unduly
stressing current cooling solutions. For example, power-intensive, matrix computations can be

Published by Technical Disclosure Commons, 2021

4

Defensive Publications Series, Art. 4537 [2021]

moved into the logic die of the HBM to leverage the very large memory bandwidth between the
logic/base and the DRAM die. Weight update in neural networks is an application well-suited for
PIM: weight-update computations are memory intensive; the memory accesses are sequential
and local to the memory block; and the computation rate fits within the power budget of a PIM
module.
DESCRIPTION
Leveraging the vectorizability of deep-learning weight-updates, this disclosure describes
processing-in-memory techniques for weight-updates in a large class of deep learning networks.
Rather than importing the state of the optimizers to the computational die, the techniques send
gradients to a die of an HBM stack (e.g., the base die or other suitable die) and perform the
modest number of optimizer updates in compute units located in the die, close to the HBM
memory. For the optimizer equations, the techniques can substantially reduce the CPU-HBM
bandwidth since the reads and writes are done inside the HBM stack. Weight-related memory
traffic, dominant for multilayer perceptrons and transformers, can also reduce substantially.
Models with more activations, such as convolutional neural networks, may see smaller, but still
significant gains.
Per the techniques, the optimizer equations are written as a small number of per-weight
scalar operations that can be vectorized into elementwise operations. The operations to
implement the optimizer all have low arithmetic intensity: for each byte read from memory, there
are only a handful of ALU operations that use that byte. The optimizers store additional floatingpoint values per weight, increasing the storage required per parameter/weight in the model.
However, the formulas of the optimizers depend only on the gradient.

https://www.tdcommons.org/dpubs_series/4537

5

: Processing-in-Memory for Weight Updates in a Neural Network Accel

Fig. 1: Processing-in-memory (PIM) for weight updates in a neural network accelerator
Fig. 1 illustrates PIM optimization calculations for neural network training, per the
techniques of this disclosure. Although Fig. 1 uses NAdam to illustrate PIM weight-update, the
techniques are applicable across different types of weight-update equations, e.g., stochastic
gradient descent (SGD), Momentum, Nesterov, RMS-propagation, Adam, NAdam, etc. The
computationally intensive steps of forward propagation, backpropagation, and weight update are
indicated in red.
Forward and backward propagation each load the weights from memory once. Weight
update outputs the gradients, indicated in orange. State variables m, v, and θ, residing in HBM,
are indicated in light blue. The region between the dashed and dotted thick red lines is where the
optimizer calculations take place. The eight load/store (ld/st) operations to the state variables are

Published by Technical Disclosure Commons, 2021

6

Defensive Publications Series, Art. 4537 [2021]

indicated in yellow. Hyperparameters are indicated in light green, and compute operations are
indicated in light red.
The thick dotted line shows the current separation between operations performed on the
main computational die (e.g., the matrix multiplications and the optimizer calculations) and the
state variables held in HBM. The thick dashed line shows the re-partitioning of tasks described
herein, where optimizer calculations are done in the HBM logic/base die (or other suitable die).
Instead of performing eight load/store operations, the described techniques perform just three
transfers: a weight load for forward propagation, a weight load for backward propagation, and
sending the gradients to the HBM for the optimizer work.
Activations also consume memory bandwidth, so by Amdahl’s Law, some computational
benefits are likely to be somewhat lower, but the overall gain in computational efficiency is still
significant. The PIM calculation model requires elementwise operations of modest complexity.
For example, a floating-point ALU that supports addition, multiplication, and inverse square root
suffices to express all weight-update formulas.
Let reading all the weights of the model take a memory bandwidth B. Currently, a
memory bandwidth of 8B is expended towards forward propagation (1B), backward propagation
(1B), and weight update (6B from 3 reads and 3 writes). The described processing in the HBM
logic die reduces the memory consumed to 3B, e.g., a reduction in total memory bandwidth due
to parameter traffic by a factor of 8/3 = 2.7. For weight-intensive models, such a reduction can
significantly improve the operational intensity of the computation. In such models, forward and
backward propagation have many more operations per byte fetched (again, 6X—the FLOPS are
the same for each; what differs is the memory bandwidth used). The reduction in memory traffic

https://www.tdcommons.org/dpubs_series/4537

7

: Processing-in-Memory for Weight Updates in a Neural Network Accel

also improves energy efficiency, saving the power needed to move five weight-images worth of
data from HBM to the compute processor and back.
Instruction set architecture
Given the large and increasing number of optimizers, a programmable accelerator in the
die can be optimal. The operations being vectors-in-memory (VIM), an off-the-shelf vector
architecture, e.g., RISC-V, is suitable. Vector processors need a scalar core to perform both
overhead operations as well as compute-intensive vector instructions. The neural network
optimizers are varied but simple; hence a relatively small instruction memory is sufficient. The
host CPU can load the instruction memory, the programmed kernels in the VIM accelerator can
be invoked as necessary.
Computation and energy calculations
The NAdam formulas have eight multiplications, nine additions/subtractions, and three
operations that map to the inverse square root unit, for a total of 18 operations/weight. For a
neural network with 100 million weights, that amounts to 1.8GFLOPS — total operations, not
operations per second — over a training step. The 1.8GFLOPS calculation is not affected by the
batch size, as the batch is fully reduced as part of gradient calculation. Assuming the bandwidth
of an HBM stack is about 400 GB/s, NAdam must read and write 4 vectors, each having 4 bytes
per element, suggesting that the performance requirement for the PIM die at full speed is
18×400 𝐺𝐵/𝑠
8×4𝐵

= 225 GFLOPs/second.

Assuming a die power consumption of 8 watts and the aforementioned target of 225
GFLOPS/second, the peak efficiency of the PIM is 225÷8 ≅ 28 GFLOPS/second/watt. Typical
machine learning processors can have energy consumptions per GFLOP-second that are two-to-

Published by Technical Disclosure Commons, 2021

8

Defensive Publications Series, Art. 4537 [2021]

ten times this number, such that the PIM techniques described herein are plausible and impose no
unduly large compute bandwidth on the HBM die.
In this manner, the PIM optimizers described herein can enable substantial reduction in
the memory bandwidth required for training neural networks. The operations are not
arithmetically intensive matrix multiplications; rather, they are cheap element-wise operations
performed on large state variables. For example, for a network with 100 million weights training
under Adam, there are 400MB worth of weights and another 800MB of additional state
variables. Such a large footprint for variables doesn’t typically fit into any on-chip SRAM
available today, which illustrates the efficacy of staging them from the HBM, performing
optimizer calculations on a vision processing unit (VPU), and moving the state variables back to
HBM for storage.
Weight update is a central operation in neural network training. Modern training
optimizers (Momentum, Adam, etc.) multiply the weight state by a factor of 2-3, requiring a
handful of low-arithmetic-intensity operations per training step. By moving those updates to a
lightweight processor that is nearer to or in the main memory (typically HBM) of a training
accelerator, the requirement for memory bandwidth, a critical performance bottleneck in neural
network training, is reduced substantially. For Adam, the techniques need three reads and three
writes for each weight in the model. A calculation facility in the HBM logic die, as described
herein, enables the sending of gradients to the HBM for a 6X reduction in bandwidth. The
bandwidth reduction helps the weight update part of the training time step; forward and
backward propagation each read the weights once.

https://www.tdcommons.org/dpubs_series/4537

9

: Processing-in-Memory for Weight Updates in a Neural Network Accel

The PIM neural network optimizers described herein have a wide range of applications,
e.g., image/video classification and segmentation, suggestion engines, natural language
understanding and translation, etc.
CONCLUSION
Leveraging the vectorizability of deep-learning weight-updates, this disclosure describes
processing-in-memory (PIM) techniques for weight-updates in a large class of deep learning
networks. Rather than importing the state of the deep-learning optimizers to the computational
die, the techniques send gradients to a die of a high-bandwidth memory (HBM) stack and
perform the modest number of optimizer updates in compute units located in the die. Since reads
and writes are done inside the HBM stack, the techniques can substantially reduce the CPUHBM bandwidth requirements. Weight-related memory traffic, dominant for multilayer
perceptrons and transformers, is also reduced.
REFERENCES
[1] Choi, Dami, Christopher J. Shallue, Zachary Nado, Jaehoon Lee, Chris J. Maddison, and
George E. Dahl. "On empirical comparisons of optimizers for deep learning." arXiv preprint
arXiv:1910.05446 (2019).
[2] https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average accessed Jun.
17, 2021.
[3] https://pcper.com/2018/12/jedec-updates-hbm-standard-with-24gb-capacity-and-faster-speed/
accessed Jun. 17, 2021.
[4] https://ruder.io/optimizing-gradient-descent/index.html#nadam accessed Jun. 21, 2021.
[5] The Berkeley IRAM Project http://iram.cs.berkeley.edu/

Published by Technical Disclosure Commons, 2021

10

Defensive Publications Series, Art. 4537 [2021]

[6] Wang, Ke, Kevin Angstadt, Chunkun Bo, Nathan Brunelle, Elaheh Sadredini, Tommy Tracy,
Jack Wadden, Mircea Stan, and Kevin Skadron. "An overview of micron's automata processor."
In Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software
Codesign and System Synthesis, pp. 1-3. 2016.
[7] Kwon, Youngeun, Yunjae Lee, and Minsoo Rhu. "TensorDIMM: A practical near-memory
processing architecture for embeddings and tensor operations in deep learning." In Proceedings
of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 740-753.
2019.

https://www.tdcommons.org/dpubs_series/4537

11

