An Inference and Learning Engine for Spiking Neural Networks in
  Computational RAM (CRAM) by Cılasun, Hüsrev et al.
An Inference and Learning Engine for Spiking Neural Networks in
Computational RAM (CRAM)
HÜSREVCILASUN, SALONIKRESCH, ZAMSHED IQBALCHOWDHURY, ERINOLSON,MASOUD
ZABIHI, ZHENGYANGZHAO, THOMASPETERSON, KESHABPARHI, JIAN-PINGWANG, SACHIN
S. SAPATNEKAR, and ULYA KARPUZCU, University of Minnesota
Spiking Neural Networks (SNN) represent a biologically inspired computation model capable of emulating neural computation in human
brain and brain-like structures. The main promise is very low energy consumption. Unfortunately, classic Von Neumann architecture
based SNN accelerators often fail to address demanding computation and data transfer requirements efficiently at scale. In this work, we
propose a promising alternative, an in-memory SNN accelerator based on Spintronic Computational RAM (CRAM) to overcome scalability
limitations, which can reduce the energy consumption by up to 164.1× when compared to a representative ASIC solution.
1 INTRODUCTION
Spiking Neural Networks (SNN) are neural networks mimicking the impulse based neural transmission in the brain. Recently,
as a computational model, SNNs have been emerged as biologically more realistic, yet tractable alternatives to artificial
neural networks. Although there are topological similarities to a fully-connected neural network, the main difference lies
in the spike event, an electrical discharge triggered by a series of chemical reactions. Spike trains coming from presynaptic
neurons are processed in the postsynaptic neuron, which enables computational tasks such as classification to be performed
in SNN.
As a biologically inspired computational model capable of emulating neural computation in human brain and brain-like
structures [12, 23], inherent SNN architecture poses as an ideal computing medium for event-driven processing [24]. As a
result, SNN can perform a variety of computational tasks with significantly lower power consumption, such as prosthetic
brain-machine interface control [8] and speech recognition [28]. Recent efforts [14] focus on flexible and efficient hardware
emulation of different biologically accurate SNN models, while others exploit the computational power resulting from
the high number of relatively simpler neurons in SNNs. Low energy hardware SNN accelerators such as ODIN [11] try
to minimize the energy per spiking operation. SNN hardware solutions include IBM’s TrueNorth [17], Intel’s Loihi [6],
University of Manchester’s SpiNNaker [12], and Human Brain Project’s BrainScaleS [20]. Present highest number of spiking
neurons in a hardware SNN implementation is 1 Billion real-time neurons in SpiNNaker [13], only a fraction of the average
neuron count of 86 Billion in human brain [15].
Large-scale SNNs needed to perform useful computational tasks inevitably come with increased data access and
parallelism demand which often exceeds capabilities of traditional hardware. When neuron count increases significantly,
more space is required to store synaptic parameters. Data access and retrieval becomes a burden at the same time, mainly
because any spiking information emerging in neurons should be accessible to all other neurons as an input. Various
solutions based on traditional CMOS as well as emerging spintronic and resistive technologies are proposed to overcome
this bottleneck [18]. For example, recent Magnetic Tunneling Junction (MTJ) based spintronic SNN accelerators [2, 21, 27]
multiply weights by spike trains using MTJs in a crossbar setting. These and similar designs typically only address a single
Authors’ address: Hüsrev Cılasun, cilas001@umn.edu; Salonik Resch, resc0059@umn.edu; Zamshed Iqbal Chowdhury, chowh005@umn.edu; Erin Olson,
olso6834@umn.edu; Masoud Zabihi, zabih003@umn.edu; Zhengyang Zhao, zhaox526@umn.edu; Thomas Peterson, pete9290@umn.edu; Keshab Parhi,
parhi@umn.edu; Jian-Ping Wang, jpwang@umn.edu; Sachin S. Sapatnekar, sachin@umn.edu; Ulya Karpuzcu, ukarpuzc@umn.edu, University of Minnesota,
Twin Cities, Minnesota.
Manuscript submitted to ACM 1
ar
X
iv
:2
00
6.
03
00
7v
1 
 [c
s.E
T]
  4
 Ju
n 2
02
0
2 Cılasun, et al.
stage of SNN computation – i.e., weight multiplication by spike train. Flexibility is also a concern, as the SNN model as well
as parameter resolution is hardwired. Spintronic (MTJ based) devices have also been proposed as analog SNN building
blocks [31], however, analog implementations by construction suffer from process variability constraints [25].
Putting it all together, large scale SNN computations are massively parallel and induce high-intensity memory accesses
for data retrieval. As a result, the energy consumption skyrockets at scale. This is why true in-memory computing
substrates such as the recently proposed spintronic Computational RAM (CRAM) [26], which enable massively parallel and
reconfigurable logic operations in-situ without compromising energy efficiency, represent especially promising platforms
for SNN hardware, which form the focus of this paper. We will start our discussion with background information covering
basics of CRAM and the SNN model in Section 2. Section 3 details the proposed SNN architecture based on CRAM. After
quantitative evalution in Section 4, we conclude the paper in Section 5.
2 BACKGROUND
After covering CRAM basics, we will continue with the feedforward Leaky Integrate-and-Fire SNN model in Section 2.2.
Although only the feedforward model is useful for many computational tasks, applications such as brain simulation needs
the parameters to be learned online. To this end, we also include pairwise Spike-Time Dependent Plasticity (STDP) learning
algorithm for updating parameters during computation, as need be.
2.1 CRAM Basics
Essentially, CRAM [26] augments conventional spintronic memory arrays with compute capability, thereby enabling
seamless memory access. As long as there is no computation, CRAM reduces to an ordinary memory. When computation is
enabled, CRAM performs logic operations directly inside the memory array. Spin Torque Transfer (STT) and Spin-Hall
Effect (SHE) or Spin-Orbit Torque (SOT) based CRAM variants exists [4, 19, 29, 30]. CRAM can perform one logic operation
(Boolean gate) in a column at a time, but all (or a desired subset of) columns can perform the same operation in parallel.
Each cell in a column can serve as an input or output to a Boolean gate. Logic operations are reconfigurable in time and
space. At the same time, multiple CRAM arrays can perform the same computation (across all of their columns) in parallel.
CRAM’s main storage element is a Magnetic Tunneling Junction (MTJ), which comprises a fixed-polarity magnetic layer,
a variable-polarity magnetic layer, and an insulating layer in between. When polarities of the magnetic layers (mis)match,
the MTJ is in (Anti)Parallel-(A)P state which corresponds to logic (1) 0, exhibiting a (high) low resistance.
Cell structures for STT- and SHE-CRAM are provided in Figure 1(a) and 1(b), respectively. STT-CRAM features even
and odd bitlines (BLE/O), which are used to sense (change) the state of the MTJ in read (write) mode. For logic operations,
BLE/O is used determine which MTJ serves as an input or output. Logic lines (LL) connect input and output cells to perform
Boolean operations. STT-CRAM also has wordlines (WL), which are used to select the rows for both memory and logic
operations.
Specifically, in order to perform a logic gate in Figure 1(a), first the output MTJ is preset to a known logic value depending
on the type of the gate. If inputs reside in even rows, the output should be in an odd row, and vice versa. By imposing a
specific voltage difference between BLE and BLO, a current is induced through (a parallel connection of) inputs and the
output in series. The magnitude and direction of this current depends on the voltage difference between BLE and BLO,
which also is a function of the type of the gate to be performed. For a given voltage difference, the current through the
output evolves as a function of the parallel equivalent resistance of the inputs, and may or may not be enough to switch the
output state. This is the main principle behind how CRAM implements truth tables. If the combined current is high enough
Manuscript submitted to ACM
An Inference and Learning Engine for Spiking Neural Networks in Computational RAM (CRAM) 3
LLBLOBLE
WL0
WL1
WL2
LLBLOBLE
WLR0
WLW0
SHE
WLR1
WLW1
SHE
WLR2
WLW2
SHE
Rin1
Rin0
Rou t
Rin1
Rin0
Rou t
Rin1
Rin0
Rou t
(a) (b) (c)
(d)
Rin0 Rin1 Rout
P(0) P(0) AP(1)
AP(1) P(0) AP(1)
P(0) AP(1) AP(1)
AP(1) AP(1) P(0)
(e)
A
B
Cin
Cout
MAJ3
Step 1
A
B
Cin
Cout
D
E
B
Cout
D
E
INV1-2
Step 2
MAJ5
Step 3
S
A
Cin
Fig. 1. (a) 1T1M STT-CRAM; (b) SHE-CRAM; (c) Equivalent resistive network for a NAND gate; (d) NAND truth table; (e) 3-step full adder
implementation. MAJ3 and MAJ5 are 3-input and 5-input majority gates. INV1-2 is an inverter which writes the output to two different
cells simultaneously.
to change the state of the output cell(s), the cell state is changed, effectively corresponding to the implementation of a logic
gate. The equivalent resistive network for the universal NAND gate is given in Figure 1(c) along with the truth table in
Figure 1(d). CRAM supports a rich set of (universal) gates beyond NAND, each characterized by specific bitline voltage
and output preset values. More complex logic functions such as full-adders can be synthesized by a sequence of basic gate
operations as shown in Figure 1(e) [3].
SHE-CRAM, as shown in Figure 1(b), on the other hand, features a four terminal cell element which enables faster memory
and logic operations while consuming significantly less energy. This is because SHE-CRAM separates read and write
paths, making independent optimization of each possible (which otherwise impose conflicting constraints). Accordingly,
SHE-CRAM has separate read (WLR) and write (WLW) wordlines, and two access transistors per cell. SHE-CRAM hence
has a higher area footprint compared to the STT-CRAM, which is compensated by lower energy and faster operation.
Memory and logic operations follow the same principles otherwise.
To ease illustration, in the following, we will use transposed CRAM arrays, and abstract out the interleaved placement
where inputs (outputs) reside in even (odd) rows or vice versa.
2.2 Leaky Integrate-and-Fire SNN Model
Although there exist more complex and biologically accurate models for SNNs, Leaky Integrate-and-Fire model, as described
in [6], is widely used due to its relative simplicity. The general model assumes that each neuron is connected to presynaptic
neurons which broadcast their own spike trains. Each connection from neuron j to neuron i is characterized by a weight
ωi j and a delay value di j corresponding to the actual transmission delay in brain for more accurate modeling. Each spike
train (or spike function) can be expressed as σ (t) = ∑k δ (t − tk ) where δ is the unit impulse function; t , the discrete time
variable; and tk , the time difference corresponding to kth spike. Table 1 provides the definition for all model parameters.
The weighted sum of filtered presynaptic spike trains with bias is called synaptic response current, as depicted in Equation
(1), where ui (t) represents the synaptic response current of neuron i:
ui (t) =
∑
j,i
ωi j (αu ∗ σj )(t) + bi (1)
αu (t) = 1τu e
− tτu H (t) where H (t) is the unit step function; τu , a time constant; and bi , the bias.
Manuscript submitted to ACM
4 Cılasun, et al.
Table 1. Parameter Definitions
Parameter Definition
t discretized time variable
σj (t) spike train at the output of neuron j
τu neural time constant
τv synaptic time constant
θi spiking threshold constant of neuron i
ωi j weight variable between neurons i and j
di j delay variable between neurons i and j
bi bias variable of neuron i
ui (t) synaptic response current of neuron i
vi (t) membrane potential of neuron i
A± time constant
tpre/post presynaptic/postsynaptic spike time
αu (t) 1τu e
− tτu H (t)
F (t) e− tτu H (t)
Lf #entries in the lookup table of αu (t)
S weight bit size
tmax maximum time difference for STDP
r (t) pseudorandom noise
The voltage difference between the inside and the outside of the synapse is called membrane potential. Update function
for membrane potential is given as – where vi (t) captures the membrane potential of neuron i:
vi (t) = − 1
τv
vi (t) + ui (t) − θiσi (t) (2)
τv here is a time constant; θi , the threshold value for neuron i . If the neuron spikes, vi is initialized to zero.
Filtered output
2
𝜶𝒖(𝒕) Table Multiplied Output
1
Input Spikes (𝒋 bits)
𝝈𝟏(𝒕) 𝝈𝟏(𝒕 − 𝟏) … 𝝈𝟏(𝒕 − 𝑳𝒇 + 𝟏) 𝜶𝒖(𝟎) 𝜶𝒖(𝟏) … 𝜶𝒖(𝑳𝒇 − 𝟏)
𝝈𝟐(𝒕) 𝝈𝟐(𝒕 − 𝟏) … 𝝈𝟐(𝒕 − 𝑳𝒇 + 𝟏) 𝜶𝒖(𝟎) 𝜶𝒖(𝟏) … 𝜶𝒖(𝑳𝒇 − 𝟏)
…
𝝈𝒋(𝒕) 𝝈𝒋(𝒕 − 𝟏) … 𝝈𝒋(𝒕 − 𝑳𝒇 + 𝟏) 𝜶𝒖(𝟎) 𝜶𝒖(𝟏) … 𝜶𝒖(𝑳𝒇 − 𝟏)
𝝎𝟏 𝝎𝟏∑𝝈𝟏 𝒕 − 𝒍 𝜶𝒖 𝒍
𝝎𝟐 𝝎𝟐∑𝝈𝟐 𝒕 − 𝒍 𝜶𝒖 𝒍
…
𝝎𝒋 𝝎𝒋∑𝝈𝒋 𝒕 − 𝒍 𝜶𝒖 𝒍
Weights
6Old 
Membrane 
Potential
∑𝝎𝒋∑𝝈𝒋 𝒕 − 𝒍 𝜶𝒖 𝒍
3
Added Outputs
𝒃𝒊 𝒓(𝒕) 𝒖𝒊(𝒕)
4
Bias
Synaptic 
Response 
Current
𝒗𝒊(𝒕) 𝝉𝒗
−𝟏 𝒗𝒊(𝒕)𝝉𝒗
−𝟏
Synaptic 
Constant
Scaled 
Potential
𝝈𝒊(𝒕) 𝜽𝒊 𝜽𝒊𝝈𝒊(𝒕) 𝒓(𝒕)
Old Spike 
Train Threshold
Scaled 
Train
𝒖𝒊(𝒕)+𝒗𝒊 𝒕 𝝉𝒗
−𝟏 + 𝜽𝒊𝝈𝒊 𝒕 + 𝒓(𝒕) 𝝈𝒊(𝒕)
Membrane 
Potential
7
Spike Train
𝝈𝟏(𝒕)𝜶𝒖(𝟎) 𝝈𝟐(𝒕 − 𝟏)𝜶𝒖(𝟏) … 𝝈𝟏(𝒕 − 𝑳𝒇 + 𝟏)𝜶𝒖(𝑳𝒇 − 𝟏)
𝝈𝟏(𝒕)𝜶𝒖(𝟎) 𝝈𝟐(𝒕 − 𝟏)𝜶𝒖(𝟏) … 𝝈𝟐(𝒕 − 𝑳𝒇 + 𝟏)𝜶𝒖(𝑳𝒇 − 𝟏)
…
𝝈𝟏(𝒕)𝜶𝒖(𝟎) 𝝈𝟐(𝒕 − 𝟏)𝜶𝒖(𝟏) … 𝝈𝒋(𝒕 − 𝑳𝒇 + 𝟏)𝜶𝒖(𝑳𝒇 − 𝟏)
∑𝝈𝟏 𝒕 − 𝒍 𝜶𝒖 𝒍
∑𝝈𝟐 𝒕 − 𝒍 𝜶𝒖 𝒍
…
∑𝝈𝒋 𝒕 − 𝒍 𝜶𝒖 𝒍
AND gate results
Noise
5
Noise
Fig. 2. Data layout (transposed).
Figure 3 depicts an overview of the model. Putting it all together, each neuron calculates its output spike train from
presynaptic spike trains. Variations of this basic algorithm include the Random Sampling algorithm from Loihi [6], which
our design is based on, and where synaptic response current and membrane potential are optionally incremented by a
pseudorandom number – which we will encapsulate in the following discussion in a noise term r (t).
Manuscript submitted to ACM
An Inference and Learning Engine for Spiking Neural Networks in Computational RAM (CRAM) 5
2.3 Pairwise STDP based Parameter Learning
Learning entails adjusting synaptic weights (ωi j ) and delays (di j ) to dynamically optimize the neural network for solving a
specific problem. It models the actual synaptic changes which brain adapts to accommodate new constraints.
STDP model features a simple weight learning rule, as described in [5, 6]. It can be summarized as follows
∆ωi, j =

A−F (t − tposti ), on presynaptic spike
A+F (t − tprej ), on postsynaptic spike
(3)
where F (t) = e− tτ H (t) and τ , A−, and A+ are constants, as defined in Table 1. ∆ωi, j is added to the weight in each update.
Figure 8 provides a block diagram description of the basic STDP algorithm. We define ∆tpre(post ) as t − tpre(post )j(i) . The
algorithm continuously checks whether the current neuron or a presynaptic neuron spikes. If so, it resets the ∆tpre(post )
counter by multiplying its former value by zero. Otherwise, it checks for overflow and increments the ∆tpre(post ), which is
then fed to F (·) function and multiplied by A+(−).
𝒏𝟏 ∗ ×
+
𝒏𝟐 ∗ ×
𝒏𝒋 ∗ ×
…
𝝈𝟏(𝒕)
𝝈𝟐(𝒕)
𝝈𝒋(𝒕)
𝜶𝒖(𝒕)
𝜶𝒖(𝒕)
𝜶𝒖(𝒕)
𝝎𝟏
𝝎𝟐
𝝎𝒋 +
𝒃𝒊
+𝒖𝒊(𝒕)
×
𝟏
𝝉𝒗
> 𝜽𝒊
×𝜽𝒊
𝒗𝒊(𝒕)
𝝈𝒊(𝒕)
𝒏𝒊
1
𝑷𝑹𝑵𝑮
+
𝑷𝑹𝑵𝑮
+
2
3 4
5
6 7
Fig. 3. Leaky Integrate-and-Fire model [6].
3 SNN IN CRAM
3.1 Leaky Integrate-and-Fire-Model in Memory
CRAM supports universal Boolean gates, hence can handle any type of computation including the Leaky Integrate-and-Fire
Model described in Section 2.2. The mapping is straightforward, as shown in Figure 2, where each neuron gets processed
inside a single CRAM array to exploit array-level parallelism.
Table 2. CRAM array utilization by constants, parameters, and lookup table for different S and Lf s.
S Lf Array Size Utilization
1 − 4 32 256 × 256 14.06% − 56.25%
5 − 8 32 512 × 512 35.15% − 56.25%
9 − 16 32 1024 × 1024 31.64% − 56.25%
1 − 2 64 256 × 256 26.55% − 53.13%
3 − 4 64 512 × 512 39.84% − 53.13%
5 − 8 64 1024 × 1024 33.2% − 53.13%
9 − 16 64 2048 × 2048 29.88% − 49.8%
Manuscript submitted to ACM
6 Cılasun, et al.
3.1.1 Initialization. Our design incorporates several lookup tables (LUT) and dedicated storage for parameters and constants,
which are initialized before feedforward computations start. The initialization is a one-time procedure, as the corresponding
values do not change during computation. We also initialize an S-bit local delay value for each synapse, which serves
modeling synaptic delays between neurons.
Precomputed αu values reside in a lookup table. αu lookup table consists of Lf different S-bit values (Table 1). Similarly,
constant 1τv is initialized once, as well as weightsωi j , bi , and θi values. For the αu lookup table, Table 2 shows the utilization
for different CRAM array sizes for various S and Lf . We choose the array size so that the utilization does not exceed
approximately half of the array size. Thereby enough space is left to perform the arithmetic/logic operations in remaining
CRAM space.
3.1.2 Data Flow. Once a spike train of presynaptic neurons is received, it is written in the memory as shown in the first
column in Step 1 . Along with all the previously received (and similarly placed) spikes, these spikes and the αu (s) values are
multiplied where s ∈ [0,Lf − 1] and Lf is the predetermined filter length. This filtering operation effectively corresponds
to the convolution (∗) from Equation (1). Since all of the values involved are binary, this operation reduces to S × Lf AND
operations where S is the bit length of each entry in the αu (s) lookup table, which keeps precomputed values of the αu (s)
function. Once ANDs are computed, the resulting Lf entries are added together and rounded (Section 3.1.3) to S bits to obtain
the column denoted by 2 in Figure 2. In order to prepare the spike train for the next spike computation cycle, the spike
train is next shifted by reading the columns obtained in Step 1 and writing them to another row. Then, the resulting S bit
values are multiplied by S bit weights, which corresponds to S2 full adder operations and results in the column marked by
Step 3 . The 2S multiplication outcome is next rounded to S bits, i.e., a (S − 1)-bit rounding factor is added to it and the
result is truncated to S bits. Addition is again performed as a cascade of full adder operations. Truncation has practically no
overhead as it translates into simply ignoring the unused bits. Equation (1) gives rise to the operations we covered so far,
spanning Step 1 to Step 3 .
In the following step, all rows in Step 3 are added by reading half of the rows and writing them back to the adjacent rows.
After each addition, a rounding operation is performed. This operation takes log2 j stages where j is the predetermined
maximum number of presynaptic neurons. Once the addition is done, the result is reduced to a single row as shown in Step
4 . After the bias and the noise is added to the outcome from Step 4 , synaptic response current shown in Step 5 is obtained.
We keep older membrane potential in the same row, which is next scaled – by multiplication with τv−1. Previous spike
train is multiplied by the threshold value θ , which reduces to an AND operation of S bits. Equation (2) spans Step 4 –Step 6 .
The current membrane potential is obtained after current synaptic response current, scaled old membrane potential, the
noise, and the scaled old spike train is added as shown in Step 6 . Current membrane potential then overwrites the old
membrane potential. Finally, current spike value is calculated by thresholding the current membrane potential in Step 6 ,
which corresponds to the comparison of S-bits. Current spike value is written as shown in Step 7 and copied to the old
spike train column. In order to reset the membrane potential, we invert the spike bit and AND with the membrane potential.
The output spike is then read and broadcasted as we will explain in Section 3.2.
Before starting computation, the S-bit local delay value in all columns is incremented and compared to di, j (for the corre-
sponding synaptic connection). If the comparison yields true, the local delay value is reset to zero. The comparison output
is then read by the array controller and used as column enable. This results in a practical synaptic delay implementation, as
well as energy savings, as the disabled columns do not participate in computations described above to perform Equations
(1) and (2).
Manuscript submitted to ACM
An Inference and Learning Engine for Spiking Neural Networks in Computational RAM (CRAM) 7
3.1.3 Low Level Operations. For addition operations, we use the full adder from Figure 1. Multiplication also uses the same
full adder design N 2 times for N -bit numbers. In the convolution with αu (t), the multiplications reduce to AND operations
since the spikes are binary. Comparison by the threshold value is implemented as a cascade of 6N − 3 NAND gates for N -bit
numbers. We increase the bit size conservatively until Step 3 and then we perform rounding operation by adding rounding
factors in each arithmetic operation, which ensures that no overflow happens. Rounding to S bits entails adding a (S − 1)
bit long rounding factor and then keeping only the S most significant bits.
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
PRESET(t,a0,a1,a2)
NAND(b9,a0,a2)
NAND(a1,a2,t)
COPY(b8,b9)
COPY(b7,b8)
COPY(b6,b7)
COPY(b5,b6)
COPY(b4,b5)
COPY(b3,b4)
COPY(b2,b3)
𝑡𝑛
𝑡𝑛+4
𝑡𝑛+5
𝑡𝑛+6
𝑡𝑛+7
𝑡𝑛+8
𝑡𝑛+9
𝑡𝑛+10
𝑡𝑛+11
𝑡𝑛+12
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
𝑡𝑛+1
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
𝑡𝑛+2
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
𝑡𝑛+3
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
NAND(b5,b9,a0)
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
b1 b2 b3 b4 b5 b6 b7 b8 b9 t a0 a1 a2
NAND(b5,a0,a1)
𝑡𝑛+13
COPY(b1,b2)
COPY(t,b1)
OperationTime Memory Layout
Fig. 4. Linear Feedback Shift Register (LFSR) based pseudorandom number generation in CRAM using feedback polynomial x 9 + x 5 + 1.
Input (output) cells are highlighted with green (red).
Loihi [16] features a Pseudorandom Noise Generator (PRNG) to support several variants of the basic algorithm such
as neural sampling. As a proof of concept, we demonstrate how to implement a PRNG in CRAM using a basic linear
feedback shift register in Figure 4. First, an XOR gate is applied as a cascade of four NAND gates and then COPY operations
are performed as many as the length of the feedback polynomial. For an example 9-bit polynomial, this operation takes 13
cycles. We denote this generated noise with r (t) and update each time it is used.
Our lookup table based implementation of αu (t) function and the inevitably limited bit length for parameters such as
weights, by construction, affect synaptic accuracy. Section 4 provides a quantitative characterization of the accuracy impact
of precision.
3.2 Routing and Connectivity
As one CRAM array is allocated to process each neuron, the basic connectivity between neurons directly translates into
the connections between arrays to transfer spikes. Ideally, each neuron in SNN would be connected to all of the other
neurons. However, such a fully connected scheme requires
(N
2
)
connections between neurons, which is not feasible in
hardware when the number of neurons is very high (i.e., in the order of Billions), due to the limited area. In order to
eliminate the connectivity burden, a possible solution is using a 2-dimensional network as implemented in [6, 12, 17]. In
[12], the 2-dimensional topology is further extended to a 3-dimensional Torus alongside additional diagonal connections
and emergency routes. Such implementations involve each spike event to be transmitted as a packet in the mesh. This and
similar solutions with more involved networks are subject to congestion in the packet traffic, eventually leading to packet
drops.
Instead, we propose to use the Generalized De Bruijn Graph (GDBG) topology [7] to connect CRAM arrays. GDBG has
been used in High Performance Computing and it is shown to be the near-optimal for load-balanced networks [10]. The
Manuscript submitted to ACM
8 Cılasun, et al.
1,2,3,4 5,6,7,8
9,10,
11,12
13,14,
15,16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1 2
3 4
5 6
7 8
9
1
13 14
15 16
10
11 12
Fig. 5. Generalized De Bruijn Graph for an example 16-array network, self connections are not shown.
first advantage of GDBG is that each vertex has four edges at most, which is comparable to a 2-dimensional mesh. However,
the connected vertices are not necessarily in the neighborhood of each other, which is defined in a lexicographical ordering
sense. The second advantage is that the shortest path between any two vertices is loд2N for a graph with N vertices.
Since the connection pattern in GDBG is fixed, we can think about it as the same repeated pattern allowing each neuron
to be connected to all other neurons in loд2N steps. Figure 5 depicts an example connection scheme for 16 neurons (labeled
with numbers 1-16). in upper-left. When the connection scheme is expanded to log2 16 = 4 time steps, we obtain the
FFT-like diagram in upper-right. Indeed, this connectivity scheme is similar to Singleton’s FFT [22], which has a repetitive
fixed pattern for each FFT stage, i.e., a set of operations with no internal data dependency. The upper-left portion of Figure 5
shows the resulting hierarchical architecture, where neurons are grouped in subsets of 4, as indicated with colored borders.
The bottom portion of Figure 5 visualizes such an implementation where each edge corresponds to sets of wires between
neurons. We have the connections between 4-neuron subsets on the bottom-left; and each neuron and its connections
within the subset, on bottom-right. Applying such a quadtree decomposition, it is possible to map any N -neuron network
in log4 N steps.
In our design, each CRAM array, which corresponds to a neuron, is connected to one another in GDBG topology. Each
connection consists of j wires, which is equal to the maximum number of presynaptic neurons. Once the computation is
done, the routing operations are initiated. The routing consists of loд2N stages. If the current stage c ∈ [1, loд2N ] is smaller
or equal to the number of presynaptic neurons j, then each array (i.e., neuron) concatenates the input spike trains and
transmits them to the connected arrays for the next stage. This operation involves only reading incoming spike trains and
writing them in the predetermined locations. Therefore no extra logic is involved if c ≤ log2 j. However, if c > log2 j, each
array combines the input spike trains using stored address values. Each array therefore reads log2 j-bit address values for
each spike and copies the spike train to the corresponding addresses. Hence, upon initialization, each array has to store
(log2 j)(log2 N − log2 j) bits of address values. After routing is complete, the next spike computation starts, as discussed in
Section 3.1.
Fig. 6 demonstrates 4 routing stages of an example 16-neuron architecture, i.e., j = 8 and N = 16. In Stage 1 (c = 1), the
two spikes (highlighted in green and blue) coming from two neurons are combined and transmitted. In Stage 2 (c = 2), the
Manuscript submitted to ACM
An Inference and Learning Engine for Spiking Neural Networks in Computational RAM (CRAM) 9
𝝈𝒌 𝒕 𝝈𝒌 𝒕
𝝈𝒎+𝟒 𝒕 𝝈𝒎+𝟒 𝒕
𝝈𝒌 𝒕 𝝈𝒌 𝒕
𝝈𝒌+𝟏 𝒕 𝝈𝒌+𝟏 𝒕
𝝈𝒎+𝟒 𝒕 𝝈𝒎+𝟒 𝒕
𝝈𝒎+𝟓 𝒕 𝝈𝒎+𝟓 𝒕
Combined 
output
Input 
Spikes
𝝈𝒌 𝒕 𝝈𝒌 𝒕
𝝈𝒌+𝟏 𝒕 𝝈𝒌+𝟏 𝒕
𝝈𝒌+𝟐 𝒕 𝝈𝒌+𝟐 𝒕
𝝈𝒌+𝟑 𝒕 𝝈𝒌+𝟑 𝒕
𝝈𝒎+𝟒 𝒕 𝝈𝒎+𝟒 𝒕
𝝈𝒎+𝟓 𝒕 𝝈𝒎+𝟓 𝒕
𝝈𝒎+𝟔 𝒕 𝝈𝒎+𝟔 𝒕
𝝈𝒎+𝟕 𝒕 𝝈𝒎+𝟕 𝒕
𝝈𝒌 𝒕 𝝈𝒎 𝒕 0 1
…
0 𝝈𝒌 𝒕 0x03 0x02
…
0x02 𝝈𝒎+𝟒 𝒕
𝝈𝒌+𝟏 𝒕 𝝈𝒎+𝟏 𝒕 1 0 0 𝝈𝒎+𝟏 𝒕 0x01 0x07 0x07 𝝈𝒎+𝟏 𝒕
𝝈𝒌+𝟐 𝒕 𝝈𝒎+𝟐 𝒕 0 1 0 𝝈𝒌+𝟐 𝒕 0x06 0x05 0x05 𝝈𝒎+𝟓 𝒕
𝝈𝒌+𝟑 𝒕 𝝈𝒎+𝟑 𝒕 0 1 1 𝝈𝒌+𝟑 𝒕 0x05 0x04 0x04 𝝈𝒌 𝒕
𝝈𝒌+𝟒 𝒕 𝝈𝒎+𝟒 𝒕 1 0 1 𝝈𝒎+𝟒 𝒕 0x00 0x06 0x06 𝝈𝒌+𝟔 𝒕
𝝈𝒌+𝟓 𝒕 𝝈𝒎+𝟓 𝒕 1 1 1 𝝈𝒎+𝟓 𝒕 0x02 0x04 0x04 𝝈𝒌+𝟑 𝒕
𝝈𝒌+𝟔 𝒕 𝝈𝒎+𝟔 𝒕 0 1 0 𝝈𝒌+𝟔 𝒕 0x04 0x03 0x03 𝝈𝒌+𝟐 𝒕
𝝈𝒌+𝟕 𝒕 𝝈𝒎+𝟕 𝒕 1 0 1 𝝈𝒎+𝟕 𝒕 0x07 0x01 0x01 𝝈𝒎+𝟕 𝒕
Bit indicators for 
each stage
Reordering masks 
for each stage
Ordered 
output
Input 
Spikes
Fig. 6. Routing stages for an example 16-neuron (-array) architecture with maximum 8 presynaptic spikes.
number of incoming spikes doubles and combined spikes are forwarded to Stage 3. When c = 3, each incoming spike train
has a size of four, and they are forwarded to the next stage. In Stage 4, however, c = 4 is greater than log2 j = 3 so half of
the incoming spike trains should be discarded and the remaining spikes should be forwarded. For this purpose, we use a bit
indicator for each stage, as shown in Figure 6. The bit indicators select one of the input spikes (highlighted in green and
blue). The selected input spike train is read and then written to the ordered output in the corresponding address specified
in the current stage’s reordering (bit)mask. For this example, there is only one set of reordering masks and bit indicators,
however, in general there can be log2 N − log2 j different sets of (log2 j)-bit masks and bit indicator sets.
Our GDBG based design is efficient since we need 2N connections for N neurons instead of
(N
2
)
; we avoid traffic handling
which is required to implement NoC based architectures; and we can synchronize the whole system in deterministic time as
each routing cycle entails a fixed amount of routing steps, which is log2 N for N neurons.
Manuscript submitted to ACM
10 Cılasun, et al.
𝒏𝒊𝒏𝒋 𝝈𝒋(𝒕) 𝝈𝒊(𝒕)1
×
= 𝟎
𝚫𝒕𝐩𝐫𝐞
+
ℱ(∙)
×
ℱ(∙)+
×
𝝎𝒊
𝑨+ 𝑨−
= 𝟎
×
𝝈𝟏
′ (𝒕) 𝚫𝒕𝒑𝒓𝒆 + (𝚫𝒕𝒑𝒓𝒆 == 𝒕𝒎𝒂𝒙) 𝝈𝟏
′ (𝒕)(𝚫𝒕𝒑𝒓𝒆 + (𝚫𝒕𝒑𝒓𝒆 == 𝒕𝒎𝒂𝒙)) 𝚫𝒕𝒑𝒓𝒆 ℱ(𝟎) ℱ(𝟏) … ℱ(𝑾) ℱ(𝚫𝒕𝒑𝒓𝒆)
𝝈𝟐
′ (𝒕) 𝚫𝒕𝒑𝒓𝒆 + (𝚫𝒕𝒑𝒓𝒆 == 𝒕𝒎𝒂𝒙) 𝝈𝟐
′ (𝒕)(𝚫𝒕𝒑𝒓𝒆 + (𝚫𝒕𝒑𝒓𝒆 == 𝒕𝒎𝒂𝒙)) 𝚫𝒕𝒑𝒓𝒆 ℱ(𝟎) ℱ(𝟏) … ℱ(𝑾) ℱ(𝚫𝒕𝒑𝒓𝒆)
…
𝝈𝒋
′(𝒕) 𝚫𝒕𝒑𝒓𝒆 + (𝚫𝒕𝒑𝒓𝒆 == 𝒕𝒎𝒂𝒙) 𝝈𝒋
′(𝒕)(𝚫𝒕𝒑𝒓𝒆 + (𝚫𝒕𝒑𝒓𝒆 == 𝒕𝒎𝒂𝒙)) 𝚫𝒕𝒑𝒓𝒆 ℱ(𝟎) ℱ(𝟏) … ℱ(𝑾) ℱ(𝚫𝒕𝒑𝒓𝒆)
𝑨−ℱ(𝚫𝒕𝒑𝒐𝒔𝒕) 𝑨+ℱ(𝚫𝒕𝒑𝒓𝒆)+𝑨−ℱ(𝚫𝒕𝒑𝒐𝒔𝒕) 𝒘𝟏 + 𝑨+ℱ(𝚫𝒕𝒑𝒓𝒆)+𝑨−ℱ(𝚫𝒕𝒑𝒐𝒔𝒕)
𝑨−ℱ(𝚫𝒕𝒑𝒐𝒔𝒕) 𝑨+ℱ(𝚫𝒕𝒑𝒓𝒆)+𝑨−ℱ(𝚫𝒕𝒑𝒐𝒔𝒕) 𝒘𝟐 + 𝑨+ℱ(𝚫𝒕𝒑𝒓𝒆)+𝑨−ℱ(𝚫𝒕𝒑𝒐𝒔𝒕)
…
𝑨−ℱ(𝚫𝒕𝒑𝒐𝒔𝒕) 𝑨+ℱ(𝚫𝒕𝒑𝒓𝒆)+𝑨−ℱ(𝚫𝒕𝒑𝒐𝒔𝒕) 𝒘𝒋 + 𝑨+ℱ(𝚫𝒕𝒑𝒓𝒆)+𝑨−ℱ(𝚫𝒕𝒑𝒐𝒔𝒕)
! = 𝒕𝒎𝒂𝒙
+
𝚫𝒕𝐩𝒐𝒔𝒕
! = 𝒕𝒎𝒂𝒙
+
… 𝑨−ℱ(𝚫𝒕𝒑𝒐𝒔𝒕)
× × 𝑨+ℱ(𝚫𝒕𝒑𝒓𝒆)
𝑨+ℱ(𝚫𝒕𝒑𝒓𝒆)
𝑨+ℱ(𝚫𝒕𝒑𝒓𝒆)
2
2
3
3
4
4 5
5
6
7
8
9
1011
12
10 11 12
Fig. 7. Data layout for STDP engine in CRAM (transposed).
𝒏𝒊
𝒏𝟏 𝝈𝟏(𝒕)
𝝈𝒊(𝒕)
×
= 𝟎
𝚫𝐭𝐩𝐫𝐞
+
ℱ(∙)
×
ℱ(∙)+
×
𝝎𝒊,𝒋
𝑨+ 𝑨−
= 𝟎
×
! = 𝒕𝒎𝒂𝒙
+
𝚫𝐭𝐩𝐨𝐬𝐭
! = 𝒕𝒎𝒂𝒙
+
× ×
𝒏𝟐 𝝈𝟐(𝒕)
𝒏𝒋 𝝈𝒋(𝒕)
…
…
…
2
3
4
5
7
8
9
1011
12
1
6
Fig. 8. Basic STDP diagram.
3.3 STDP Learning Engine
3.3.1 Initialization. Initialization steps for the STDP engine are similar to the feedforward case (Section 3.1.1). Note that
F (·) is only a scaled version of αu (t) (Table 1). Therefore we re-use the lookup table for αu and scale any fetched value
from it by multiplying the fetched value with 1τu . We also initialize all parameters and constants from Equation 3 such A±.
3.3.2 Data flow. In STDP engine, every time spike distribution is completed, presynaptic (postsynaptic) part of the
operations is executed, as shown in the left (right) half in Figure 8, where the input is the spikes and the output is the
weight. This effectively performs Equation (3). Note that the data dependency on presynaptic and postsynaptic spikes
does not induce additional memory accesses since they are already stored in the same array in our mapping from Section
3.1. In the transposed layout, 1 corresponds to the presynaptic spikes. 2 is the incremented time difference since the
arrival of the latest presynaptic spike and 3 resets the ∆tpre if the input neuron spikes. Then 5 is obtained by a lookup
table implementation of F (·) function which is fed by 4 . These steps are similar for 7 , 8 , 9 , and 10 except that these
calculations are only performed in one row and the output is copied to all rows as shown in the layout in Figure 7. Finally,
∆ω is added to ω as shown in 11 and 12 .
3.3.3 Low level operations. Addition and multiplication operations in the STDP are similar to the feedforward case as
discused in Section 3.1.3. Comparators are implemented using a cascade of AND gates.
Manuscript submitted to ACM
An Inference and Learning Engine for Spiking Neural Networks in Computational RAM (CRAM) 11
4 EVALUATION
In order to evaluate our design, we perform energy and performance analysis based on the low level elementary operations
such as Boolean gate implementations in the context of the proposed CRAM-based SNN architecture. Configuration
parameters for the simulations are given in Table 3. For experiments, we have four different configurations where STT-M
and SHE-M correspond to the current conservative estimates while STT-F and SHE-F reflect near-term future expectations
for STT and SHE based MTJs. For array characteristics such as read/write timing, we use NVSim [9]. Each array has an
array controller which is responsible for driving logic lines, bitlines, and wordlines. We take the overhead resulting from
driving such lines into consideration. For synaptic events, we stick to a conservative set-up where each presynaptic neuron
spikes in each time step and each neuron is connected to the maximum possible number of neurons. Since we base our
neuron model on Loihi’s implementation, we compare our results with the energy and time figures reported in [6] and [16],
using L = 1024 and F = 1024 for 1-bit synaptic weights (in their design, L is the neuron count per core; F , the postsynaptic
fanout).
Table 3. Configuration Parameters
Parameter STT-M, SHE-M STT-F, SHE-F
P state resistance 3.15 kΩ 7.34 kΩ
AP state resistance 7.34 kΩ 76.39 kΩ
Switching Time 3 ns 1 ns
Switching Current 40 µA 3 µA
Bulk preset current limit 30 mA
2 4 6 8 10 12
Bit length
(a)
10-2
100
R
M
SE
CRAM
Loihi
0.2 0.4 0.6 0.8 1.0
Tim e (s)
(b )
2
4
6
8
10
N
eu
ro
n 
nu
m
be
r
Fig. 9. (a) RMSE vs. log2 Lf bit length for an example 10-neuron CRAM-SNN compared to 9-bit Loihi. (b) The spike-time diagram.
For accuracy analysis, we use Loihi library of Nengo framework [1] with Loihi configuration. Modifying Nengo Loihi’s
implementation, we analyze the spiking accuracy for our lookup table based limited precision design. This analysis is only
used for Leaky Integrate-and-Fire feedforward implementation. As shown in Figure 9, RMSE error is comparable to Loihi
if at least 9 bits are allocated for each lookup table entry. Although the accuracy loss is significant in higher bit lengths,
Figure 9 (a) shows that a similar bit length to Loihi implementation provides similar accuracy to Loihi. Figure 9 (b) features
10-neuron spike-time diagram used for the analysis.
Table 4 summarizes the results for 1 Billion neurons, where the maximum number of presynaptic neurons is 1024, bit
length is 1 and the filter (lookup table) size is 64. Each CRAM array has 1024× 512 cells.
It can be seen that although all CRAM configurations can provide enough performance to implement SNN operations
within a biologically plausible time budget, i.e., a spiking rate of several kHz, SHE-F configuration provides the lowest
energy consumption and fastest operation.
Manuscript submitted to ACM
12 Cılasun, et al.
0 5 10
Number of Neurons
(a)
1010
10-2
10-1
100
Pr
oc
es
si
ng
 ti
m
e 
(m
s)
STT/SHE-M
STT/SHE-F
Loihi
2 4 6 8 10 12 14
Number of Neurons
(b)
1010
10-5
100
En
er
gy
 (k
J)
STT-M
STT-F
SHE-M
SHE-F
Loihi (0.75 V)
Loihi-basic (0.75 V)
Loihi-basic (0.55 V)
0 5 10
Number of Neurons
(c)
1010
103
104
105
M
ax
im
um
 S
pi
ki
ng
 R
at
e 
(H
z)
STT/SHE-M
STT/SHE-F
Loihi
2 4 6 8 10 12
Number of Neurons
(d)
1010
10-5
100
En
er
gy
-D
el
ay
 P
ro
du
ct
 (J
 s
)
STT-M
STT-F
SHE-M
SHE-F
Loihi (0.75 V)
Loihi-basic (0.75 V)
Loihi-basic (0.55 V)
2 4 6 8 10 12
Weight bit size
(e)
10-3
10-2
Pr
oc
es
si
ng
 ti
m
e 
(m
s)
STT/SHE-M
STT/SHE-F
2 4 6 8 10 12
Weight bit size
(f)
10-4
10-2
En
er
gy
 (k
J)
STT-M
STT-F
SHE-M
SHE-F
102
Lookup Table Size
(g)
10-2
10-1
Pr
oc
es
si
ng
 ti
m
e 
(m
s)
STT/SHE-M
STT/SHE-F
102
Lookup Table Size
(h)
10-2
100
En
er
gy
 (k
J)
STT-M
STT-F
SHE-M
SHE-F
Fig. 10. Sensitivity to number of neurons, weight bit length and αu table size.
Table 4. Evaluation Results
Metric STT-M STT-F SHE-M SHE-F
Execution Time (µs) 6.342 0.825 6.342 0.825
Maximum Spiking Rate (KHz) 157.7 1212.3 157.7 1212.3
Energy (J) 54.51 0.989 20.61 0.234
EDP (µJs) 345.72 0.816 130.74 0.193
SHE-M STT-M
(a)
0
2
4
6
En
er
gy
 (J
)
10-11
Input W/R
Multiplications
Additions
Thresholding
Preset
Broadcast
Routing
PRNG
SHE-F STT-F
(b)
0
0.2
0.4
0.6
0.8
1
1.2
En
er
gy
(J
)
10-12
Input W/R
Multiplications
Additions
Thresholding
Preset
Broadcast
Routing
PRNG
Fig. 11. Overall energy breakdown of operations in a single instance of spiking events.
Figure 11 shows the energy breakdown for a single spike operation with 1 bit weights. We observe that preset and
additions dominate the overall energy followed by multiplication (i.e., AND) operations, pointing to further hardware
optimization opportunities.
Figure 10 shows execution time, maximum spiking rate, energy and energy-delay-product (EDP) for the case where
maximum number of presynaptic neurons is 1024 and the bit length is 8. Figure 10(a) and 10(c) are complementary figures
for -M and -F variants compared to Loihi model for similar parameters. Although Loihi is faster for larger number of
neurons because of its faster computation, CRAM variants surpass Loihi when neuron count approaches to hundred billions
thanks to our routing architecture. Figure 10(b) and 10(d) capture energy and EDP’s sensitivity to neuron counts in several
Loihi models with different voltages. For very large neuron counts, the best performing CRAM variants have lower energy
Manuscript submitted to ACM
An Inference and Learning Engine for Spiking Neural Networks in Computational RAM (CRAM) 13
consumption than Loihi, and higher energy efficiency. We also sweep bit length of weights and table size for αu , as shown
in Figure 10(e)–(h). Figures 10(e) and 10(f) show processing time and energy consumption for different weight bit lengths;
and Figures 10(g) and 10(h) capture the processing time and energy for various lookup table sizes in the feedforward design.
In all cases, STT/SHE-F or SHE-F configuration provide the lowest processing time and energy when compared to other
CRAM variants, emphasizing the effect of the cell technology.
Overall, when compared to the best performing Loihi baseline (0.55V) with a fanout of 1024, and 1-bit synaptic weights,
SHE-F configuration consumes 26.13× less energy. At the same time, SHE-F is 3.99× more energy efficient (in terms of
EDP, energy-delay-product) for the feedforward implementation as captured in Figure 10.
Table 5. Energy, latency, and EDP for single spike operation.
Synaptic Parameter SHE-F Loihi
Spike Operation Energy 143.81 fJ 23.6 pJ
Spike Operation Time 499.86 ns 3.5 ns
Spike Operation EDP 7.19e-20 Js 8.26 e-20 Js
STDP Update Energy 0.33 pJ 120 pJ
STDP Update Time 1321.7 ns 6.1 ns
STDP Update EDP 4.38e-19 Js 7.32 e-19 Js
Table 5 summarizes low level performance parameters of our best case CRAM-SNN design for 1-bit weights where
Lf = 32 for feedforward computations. Also tabulated are Loihi’s corresponding low-level performance parameters for
comparison. STDP operations are performed at 10-bit precision. Lf = 32 is chosen to capture enough accuracy to model
αu (t) while lowering logic complexity to maintain a low energy consumption. 10-bit precision suffices to provide enough
bandwidth for updating up to 9 bit weights, which is meaningful for a fair comparison. Although being slower, our design
consumes significantly less energy, which leads to a lower EDP for a single spike operation with a single presynaptic
connection for 1 bit weights.
SHE-M STT-M
(a)
0
0.5
1
1.5
En
er
gy
 (J
)
10-10
Input W/R
Multiplications
Additions
Row copy
Preset
Comparator
Lookup
SHE-F STT-F
(b)
0
0.5
1
1.5
2
2.5
3
En
er
gy
 (J
)
10-12
Input W/R
Multiplications
Additions
Row copy
Preset
Comparator
Lookup
Fig. 12. Single-spike energy breakdown of STDP weight update operation.
Figures 13(a) and 13(b) show the STDP processing time and energy for larger bit lengths. Processing times in 13(a)
irregularly increase around 6-bits as CRAM preset operations become overwhelming, however, this does not effect the
energy consumption pattern in 13(b) where the SHE-F configuration outperforms the other CRAM variants. Energy
breakdown of STDP operations are given in Figures 12(a) and 12(b). Similar to the feedforward case, energy consumption is
mostly dominated by preset, addition and multiplication operations in STDP weight update. Low level performance results
further show that our gains are not only due to the proposed connectivity scheme, but also because of the low energy,
massively parallel logic operations CRAM enables.
Manuscript submitted to ACM
14 Cılasun, et al.
2 4 6 8 10 12
Weight bit size
 (a)
10-3
10-2
Pr
oc
es
si
ng
 ti
m
e 
(m
s)
STT/SHE-M
STT/SHE-F
2 4 6 8 10 12
Weight bit size
 (b)
10-14
10-12
En
er
gy
 (k
J)
STT-M
STT-F
SHE-M
SHE-F
Fig. 13. (a) Processing time vs. weight bit length. (b) STDP update energy vs. weight bit length.
5 CONCLUSION
We introduce a CRAM-based SNN architecture, where each neuron is processed inside a single CRAM array while the
arrays are connected to each other in a GDBG topology. We thereby achieve a limited full connectivity by using only
2N connections in a N neuron network, instead of
(N
2
)
, while exploiting the massive intra- and inter-array parallelism
and energy efficiency of CRAM for logic operations. Our best configuration results in 164.1× less energy consumption
and similar EDP for feedforward operations; and 361.77× less energy consumption and 1.66× lower EDP for learning (via
STDP) when compared to the alternative Loihi architecture.
REFERENCES
[1] Trevor Bekolay, James Bergstra, Eric Hunsberger, Travis DeWolf, Terrence C Stewart, Daniel Rasmussen, Xuan Choo, Aaron Voelker, and Chris Eliasmith.
2014. Nengo: a Python tool for building large-scale functional brain models. Frontiers in neuroinformatics 7 (2014), 48.
[2] Mei-Chin Chen, Abhronil Sengupta, and Kaushik Roy. 2018. Magnetic skyrmion as a spintronic deep learning spiking neuron processor. IEEE Transactions
on Magnetics 54, 8 (2018), 1–7.
[3] Z. Chowdhury, J. D. Harms, S. K. Khatamifard, M. Zabihi, Y. Lv, A. P. Lyle, S. S. Sapatnekar, U. R. Karpuzcu, and J. Wang. 2018. Efficient In-Memory
Processing Using Spintronics. IEEE Computer Architecture Letters 17, 1 (Jan 2018), 42–46. DOI:http://dx.doi.org/10.1109/LCA.2017.2751042
[4] Z. I. Chowdhury, S. K. Khatamifard, Z. Zhao, M. Zabihi, S. Resch, M. Razaviyayn, J. Wang, S. Sapatnekar, and U. R. Karpuzcu. 2019. Spintronic In-Memory
Pattern Matching Using Computational RAM (CRAM). IEEE Journal on Exploratory Solid-State Computational Devices and Circuits (2019), 1–1. DOI:
http://dx.doi.org/10.1109/JXCDC.2019.2951157
[5] Khanh N Dang and Abderazek Ben Abdallah. 2019. An Efficient Software-Hardware Design Framework for Spiking Neural Network Systems. 2019
International Conference on Internet of Things, Embedded Systems and Communications (IINTEC) (2019).
[6] Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam,
Shweta Jain, and others. 2018. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38, 1 (2018), 82–99.
[7] N. G. DE BRUIJN. 1946. A combinatorial problem. Proc. Koninklijke Nederlandse Academie van Wetenschappen 49 (1946), 758–764. https://ci.nii.ac.jp/naid/
10019660672/en/
[8] Julie Dethier, Paul Nuyujukian, Chris Eliasmith, Terrence C Stewart, Shauki A Elasaad, Krishna V Shenoy, and Kwabena A Boahen. 2011. A brain-machine
interface operating with a real-time spiking neural network control algorithm. In Advances in neural information processing systems. 2213–2221.
[9] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi. 2012. NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 7 (July 2012), 994–1007. DOI:http://dx.doi.org/10.1109/TCAD.2012.2185930
[10] P. Faizian, M. A. Mollah, X. Yuan, Z. Alzaid, S. Pakin, and M. Lang. 2018. Random Regular Graph and Generalized De Bruijn Graph with k -Shortest Path
Routing. IEEE Transactions on Parallel and Distributed Systems 29, 1 (Jan 2018), 144–155. DOI:http://dx.doi.org/10.1109/TPDS.2017.2741492
[11] Charlotte Frenkel, Martin Lefebvre, Jean-Didier Legat, and David Bol. 2018. A 0.086-mm2 12.7-pJ/SOP 64k-synapse 256-neuron online-learning digital
spiking neuromorphic processor in 28-nm CMOS. IEEE transactions on biomedical circuits and systems 13, 1 (2018), 145–158.
[12] Steve B Furber, Francesco Galluppi, Steve Temple, and Luis A Plana. 2014. The spinnaker project. Proc. IEEE 102, 5 (2014), 652–665.
[13] Xin Jin, Mikel Lujan, Luis A Plana, Sergio Davies, Steve Temple, and Steve B Furber. 2010. Modeling spiking neural networks on SpiNNaker. Computing
in science & engineering 12, 5 (2010), 91–97.
[14] Dayeol Lee, Gwangmu Lee, Dongup Kwon, Sunghwa Lee, Youngsok Kim, and Jangwoo Kim. 2018. Flexon: a flexible digital neuron for efficient spiking
neural network simulations. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 275–288.
Manuscript submitted to ACM
An Inference and Learning Engine for Spiking Neural Networks in Computational RAM (CRAM) 15
[15] Roberto Lent, Frederico AC Azevedo, Carlos H Andrade-Moraes, and Ana VO Pinto. 2012. How many neurons do you have? Some dogmas of quantitative
neuroscience under revision. European Journal of Neuroscience 35, 1 (2012), 1–9.
[16] Andrew Lines, Prasad Joshi, Ruokun Liu, Steve McCoy, Jonathan Tse, Yi-Hsin Weng, and Mike Davies. 2018. Loihi asynchronous neuromorphic research
chip. In 2018 24th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 32–33.
[17] Paul A Merolla, John V Arthur, Rodrigo Alvarez-Icaza, Andrew S Cassidy, Jun Sawada, Filipp Akopyan, Bryan L Jackson, Nabil Imam, Chen Guo, Yutaka
Nakamura, and others. 2014. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science 345, 6197 (2014),
668–673.
[18] D. E. Nikonov and I. A. Young. 2019. Benchmarking Delay and Energy of Neural Inference Circuits. IEEE Journal on Exploratory Solid-State Computational
Devices and Circuits 5, 2 (2019), 75–84.
[19] Salonik Resch, S. Karen Khatamifard, Zamshed Iqbal Chowdhury, Masoud Zabihi, Zhengyang Zhao, Jian-Ping Wang, Sachin S. Sapatnekar, and Ulya R.
Karpuzcu. 2019. PIMBALL: Binary Neural Networks in Spintronic Memory. ACM Trans. Archit. Code Optim. 16, 4, Article 41 (Oct. 2019), 26 pages. DOI:
http://dx.doi.org/10.1145/3357250
[20] S. Schmitt, J. KlÃďhn, G. Bellec, A. GrÃĳbl, M. GÃĳttler, A. Hartel, S. Hartmann, D. Husmann, K. Husmann, S. Jeltsch, V. Karasenko, M. Kleider, C. Koke,
A. Kononov, C. Mauch, E. MÃĳller, P. MÃĳller, J. Partzsch, M. A. Petrovici, S. Schiefer, S. Scholze, V. Thanasoulis, B. Vogginger, R. Legenstein, W. Maass,
C. Mayr, R. SchÃĳffny, J. Schemmel, and K. Meier. 2017. Neuromorphic hardware in the loop: Training a deep spiking network on the BrainScaleS
wafer-scale system. In 2017 International Joint Conference on Neural Networks (IJCNN). 2227–2234. DOI:http://dx.doi.org/10.1109/IJCNN.2017.7966125
[21] Abhronil Sengupta, Aparajita Banerjee, and Kaushik Roy. 2016. Hybrid spintronic-CMOS spiking neural network with on-chip learning: Devices, circuits,
and systems. Physical Review Applied 6, 6 (2016), 064003.
[22] R Singleton. 1967. A method for computing the fast Fourier transform with auxiliary memory and limited high-speed storage. IEEE Tran. on Audio and
Electroacoustics 15, 2 (1967).
[23] Terrence Stewart, Feng-Xuan Choo, and Chris Eliasmith. 2012. Spaun: A perception-cognition-action model using spiking neurons. In Proceedings of the
Annual Meeting of the Cognitive Science Society, Vol. 34.
[24] Evangelos Stromatias, Miguel Soto, Teresa Serrano-Gotarredona, and Bernabé Linares-Barranco. 2017. An event-driven classifier for spiking neural
networks fed with synthetic or dynamic vision sensor data. Frontiers in neuroscience 11 (2017), 350.
[25] E. I. Vatajelu and L. Anghel. 2017. Fully-connected single-layer STT-MTJ-based spiking neural network under process variability. In 2017 IEEE/ACM
International Symposium on Nanoscale Architectures (NANOARCH). 21–26. DOI:http://dx.doi.org/10.1109/NANOARCH.2017.8053727
[26] Jian-Ping Wang and Jonathan D Harms. 2015. General structure for computational random access memory (CRAM). (Dec. 29 2015). US Patent 9,224,447.
[27] Kezhou Yang, Akul Malhotra, Sen Lu, and Abhronil Sengupta. 2020. All-Spin Bayesian Neural Networks. IEEE Transactions on Electron Devices 67, 3
(2020), 1340–1347.
[28] Shihui Yin, Deepak Kadetotad, Bonan Yan, Chang Song, Yiran Chen, Chaitali Chakrabarti, and Jae-sun Seo. 2017. Low-power neuromorphic speech
recognition engine with coarse-grain sparsity. In 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 111–114.
[29] M. Zabihi, Z. I. Chowdhury, Z. Zhao, U. R. Karpuzcu, J. Wang, and S. S. Sapatnekar. 2019. In-Memory Processing on the Spintronic CRAM: From
Hardware Design to Application Mapping. IEEE Trans. Comput. 68, 8 (Aug 2019), 1159–1173. DOI:http://dx.doi.org/10.1109/TC.2018.2858251
[30] M. Zabihi, Z. Zhao, D. Mahendra, Z. I. Chowdhury, S. Resch, T. Peterson, U. R. Karpuzcu, J. Wang, and S. S. Sapatnekar. 2019. Using Spin-Hall MTJs
to Build an Energy-Efficient In-memory Computation Platform. In 20th International Symposium on Quality Electronic Design (ISQED). 52–57. DOI:
http://dx.doi.org/10.1109/ISQED.2019.8697377
[31] Deming Zhang, Lang Zeng, Youguang Zhang, Weisheng Zhao, and Jacques Olivier Klein. 2016. Stochastic spintronic device based synapses and spiking
neurons for neuromorphic computation. In 2016 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH). IEEE, 173–178.
Manuscript submitted to ACM
