A Supervised Learning Algorithm for Multilayer Spiking Neural Networks
  Based on Temporal Coding Toward Energy-Efficient VLSI Processor Design by Sakemi, Yusuke et al.
A Supervised Learning Algorithm for Multilayer Spiking Neural
Networks Based on Temporal Coding Toward Energy-Efficient
VLSI Processor Design
Yusuke Sakemi, Kai Morino, Takashi Morie, and Kazuyuki Aihara,
January 16, 2020
Abstract
Spiking neural networks (SNNs) are brain-inspired mathematical models with the ability to process
information in the form of spikes. SNNs are expected to provide not only new machine-learning algo-
rithms, but also energy-efficient computational models when implemented in VLSI circuits. In this paper,
we propose a novel supervised learning algorithm for SNNs based on temporal coding. A spiking neuron
in this algorithm is designed to facilitate analog VLSI implementations with analog resistive memory, by
which ultra-high energy efficiency can be achieved. We also propose several techniques to improve the
performance on a recognition task, and show that the classification accuracy of the proposed algorithm is
as high as that of the state-of-the-art temporal coding SNN algorithms on the MNIST dataset. Finally,
we discuss the robustness of the proposed SNNs against variations that arise from the device manufac-
turing process and are unavoidable in analog VLSI implementation. We also propose a technique to
suppress the effects of variations in the manufacturing process on the recognition performance.
1 Introduction
Machine learning is a paradigm according to which algorithms perform tasks by learning from data instead
of exploiting domain knowledge. Recently, machine-learning algorithms that employ artificial neural net-
works (ANNs) have been attracting much interest from researchers and engineers in diverse fields owing to
their astonishing performance on various tasks such as image recognition, natural language processing, and
time series forecasting [1, 2]. These machine-learning algorithms are known as deep learning because of the
characteristic multilayer structures of ANNs. In deep learning, stochastic gradient descent (SGD) and back
propagation (BP) are important techniques that enable multilayer ANNs to perform effective learning [3,4].
In addition, recent advances in algorithms such as convolutional neural networks (CNNs) [5] and skip con-
nections [6] have led to the success of deep learning. Moreover, the evolution of very large-scale integration
(VLSI) technologies, such as Moore’s law [7] and Dennard scaling [8], have enabled deep learning to be
implemented in real-world applications.
However, the computational burden of deep learning often hinders its application in the real world.
Therefore, researchers and engineers often use high-performance processors such as graphical processing
units, tensor processing units [9], and application-specific integrated circuits (ASICs) [10] to compensate for
this shortcoming. Despite these developments, it remains difficult to enable the learning to converge in an
acceptable time, or to perform the tasks in real time. This is even more difficult when performing tasks with
Y. Sakemi and K. Morino, and K. Aihara are with Institute of Industrial Science, The University of Tokyo, Tokyo 153-8505,
Japan e-mail: sakemi@sat.t.u-tokyo.ac.jp.
Y. Sakemi is with NEC Corporation, Kanagawa 211-8666, Japan.
K. Morino is with Interdisciplinary Graduate School of Engineering Sciences, Kyushu University, 816-8580, Japan
T. Morie is with Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology, Kitakyushu,
808-0196, Japan.
K. Aihara is with International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo Institutes for
Advanced Study, The University of Tokyo, Tokyo 113-0033, Japan.
1
ar
X
iv
:2
00
1.
05
34
8v
1 
 [c
s.N
E]
  8
 Ja
n 2
02
0
processors placed near the end-users (edge computing) owing to the limited battery capacity. Therefore, for
edge computing, high computational energy efficiency is required.
In this regard, finding a solution for the energy-efficiency problem by looking toward the human brain
would only seem natural, because the human brain is capable of processing complex information in real
time by consuming only 20-30 W of energy. An ANN is mostly a mathematical model inspired by the
physiological observation of spike frequencies in biological neurons, whereas a spiking neural network (SNN)
is a mathematical model that is directly concerned with the spike generation [11]. Therefore, SNNs would
be expected to perform tasks in a way similar to the brain with low energy consumption. Until recently,
SNNs did not perform as well as ANNs on tasks such as image recognition. However, it has been shown
that the performance of SNNs could reach that of ANNs by exploiting the techniques used in deep learning
(multilayer structures, SGD, BP, etc.) [12,13]. Thus, it seems reasonable to find a way to really improve the
energy efficiency of SNNs. In fact, not only have SNNs demonstrated their high algorithmic performance but
also their advantages in terms of energy efficiency especially when implemented in ASICs by exploiting the
asynchronous property and binary (spike) communications [14]. Many research groups have manufactured
various types of ASICs for SNNs, which include, for example, mixed-signal ASICs [15–18], and fully digital
ASICs [19,20]. Other ASICs were also reported [21,22]. These ASICs are able to execute SNNs with complex
and flexible networks in a highly energy efficient manner. However, it is still unclear whether ASICs for SNNs
are more energy efficient per operation or per task than ASICs for ANNs. This comparison is blurred by a
common bottleneck preventing computational energy efficiency, which is caused by data movement between
processors and memory [23]. Because of this common bottleneck, the advantages of employing SNNs in a
hardware implementation such as the asynchronous property and spike communications are hardly visible in
terms of the computational energy efficiency of ASICs. One promising approach to overcome this bottleneck
is in-memory computing.
In-memory computing is a concept that involves performing the computing where the memorized data
are located [24]. Especially, in-memory computing schemes that use analog resistive memory have being
attracting hardware researchers’ attention [25–27]. Analog resistive memory stores information in the form
of resistance. The application of voltage across the two terminals of an analog resistive memory cell results in
an electric current, which is proportional to the voltage, and which flows through the memory cell by obeying
Ohm’s law. Currents flowing through memory cells can easily be summed up on the basis of Kirchhoff’s
current law. Note that this computing scheme does not require information to be stored elsewhere (e.g.,
static random access memory or dynamic random access memory). Based on the above principle, computing
such as matrix-vector multiplications (MVMs) can be performed in a highly energy efficient way.
This motivated us to propose a novel supervised learning algorithm for multilayer SNNs to achieve
high algorithmic performance and high energy efficiency. It is also possible to implement our algorithm in
VLSI circuits with analog resistive memory. Our algorithm is based on the following ideas: (i) SNNs are
composed of spiking neuron models which facilitate VLSI implementations; (ii) SNNs are trained with a
temporal coding scheme, thereby enabling SNNs to perform tasks with fewer spikes than SNNs based on
rate coding. We also propose novel learning techniques to improve the learning performance and evaluate
the performance on the MNIST dataset. Furthermore, for real VLSI implementations, robustness against
manufacturing variations in devices is crucial. Therefore, we investigated the effects of these manufacturing
variations on the learning performance of the proposed algorithm. In addition, we propose a technique to
suppress the effect of manufacturing variations.
This paper consists of the following sections: section 2 describes related work; section 3 proposes an
SNN and its learning algorithm; section 4 introduces the experimental setup for evaluating the algorithm
on the MNIST dataset, and for investigating the effects of manufacturing variations; section 5 presents the
experimental results; section 6 concludes this paper.
2 Related Work
2.1 Supervised learning algorithms for multilayer SNNs
A number of studies have shown that multilayer SNNs can be trained with the BP technique [12, 13, 28].
The application of the BP to SNNs is not straightforward as in the case of ANNs because SNNs have non-
differentiable processes such as spike generations. Furthermore, the dynamics and the learning performance
2
Figure 1: Different representations of information in SNNs. The arrows represent the directions of informa-
tion propagation. The widths of spikes do not carry information. (a) Information is contained in the spike
frequency (rate coding). (b) Information is contained in the spike timing (temporal coding).
depend on the coding schemes of SNNs. Fig. 1 shows SNNs based on two different coding schemes: rate
coding and temporal coding. In rate coding, information is contained in the spike frequency of neurons,
whereas in temporal coding, information is contained in the spike timing of neurons [29].
In rate coding, BP algorithms can be obtained by approximations such as linear neuron approximation
[30], and by considering spike generation as a probabilistic process [31]. In [28,32,33], the BP algorithms were
obtained by using the derivative of the membrane potential with respect to time. In addition, it has been
shown that trained ANNs can be converted into rate-coding-based SNNs [34,35]. However, these algorithms
need to average a number of spikes to propagate information forward or backward. Because the energy
consumption of ASICs increases as the number of spikes increases, performing tasks with fewer spikes is
preferable.
Temporal coding enables SNNs to perform tasks with fewer spikes compared to rate coding. SpikeProp is
the first algorithm that enabled multilayer SNNs to learn by using temporal coding [36]. SpikeProp obtains
the BP algorithm by directly differentiating spike timing with respect to weights. Mostafa [37] improved the
learning performance of SNNs in temporal coding on recognition tasks by using non-leaky spiking neurons.
Comsa et al. [38] used an alpha synaptic function and employed an evolutionary algorithm to search for the
hyperparameter to improve the learning performance.
In this paper, we propose a supervised learning algorithm for SNNs based on temporal coding, which is
similar to those mentioned above [37,38]. The novelties of our work are as follows: we use a different neuron
model to simplify the hardware implementation and modify the learning algorithm to improve the learning
performance.
2.2 Computing with analog resistive memory
Analog resistive memory such as resistive RAM and phase change memory is considered important build-
ing blocks for next-generation artificial-intelligence processors [25–27]. Many research groups have already
designed circuits with analog resistive memory to efficiently compute MVMs [39–43]. Among various VLSI
implementations with analog resistive memory, circuits that process information by using the timing of
electronic pulses are expected to be advantageous in terms of energy efficiency because of their simple im-
plementation [43–51]. For example, circuits using the timing difference between two pulses [48,49] and those
using the pulse-width difference between two pulses [50,51] are able to compute MVMs efficiently. Nandaku-
mar et al. [52] developed circuits with phase change memory that is capable of computing single-layer SNNs
based on rate coding.
In this paper, we show that SNNs based on temporal coding can be straightforwardly mapped to the
above VLSI implementation schemes.
3 Methods
This section introduces our proposed SNNs and their learning algorithms. The SNNs described here are
feedforward networks of spiking neurons as shown in Fig. 1(b), where spikes propagate from one layer to
3
Figure 2: (a) Schematic diagram of a multilayer SNN. The kth neuron in the lth layer receives spikes from
neurons in the (l − 1)th layer at t(l−1)1 , t(l−1)2 , t(l−1)3 , and fires at t(l)k . Then, a spike is sent to neurons in the
subsequent layer. (b) Schematic diagram of time evolution of the membrane potential v
(l)
k of the kth neuron
in the lth layer.
the next layer without skipping layers. The SNNs are to be trained such that the network can convert a
given input spike pattern into a specific output spike pattern. We also present several examples of VLSI
implementations of the SNNs where analog resistive memory is used as network weights.
3.1 Models
The time evolution of the membrane potential of a spiking neuron that we employed is given by
d
dt
v
(l)
i (t) =
N(l−1)∑
j=1
w
(l)
ij θ(t− t(l−1)j ) (1)
θ(x) =
{
0, (x < 0)
1, (x ≥ 0) (2)
where we define v
(l)
i (t) as the membrane potential of the ith spiking neuron in the lth layer, and define t
(l)
i
as the spike timing generated by the same neuron to the present input pattern. We define w
(l)
ij as the weight
of the connection from the jth neuron in the (l − 1)th layer to the ith neuron in the lth layer. N (l) is the
number of neurons constituting the lth layer. The zeroth layer and the Mth layer are referred to as the
input and output layers, respectively. The other layers are hidden layers. This network is an M -layer SNN.
The analytical solution of (1) is given by
v
(l)
i (t) =
N(l−1)∑
j=1
w
(l)
ij (t− t(l−1)j )θ(t− t(l−1)j ). (3)
The neuron generates (fires) a spike when its membrane potential reaches the firing threshold Vth. Then, the
membrane potential is fixed to the reset voltage (v
(l)
i = 0), which prevents the neuron from firing twice for a
4
given input spike pattern. Fig. 2(b) shows a schematic diagram to indicate the dependence of the evolution
of the membrane potential on the timing of received spikes and the connection weights.
The timing of spikes can be obtained as follows. The spike timing t
(l)
i satisfies the equation
Vth =
∑
j∈Γ(l)i
w
(l)
ij (t
(l)
i − t(l−1)j ) (4)
where Γ
(l)
i is a set of indices of neurons in the (l − 1)th layer that sends spikes before the ith neuron in the
lth layer fires. With (4), we obtain the spike timing as follows:
t
(l)
i =
Vth +
∑
j∈Γ(l)i
w
(l)
ij t
(l−1)
j∑
j∈Γ(l)i
w
(l)
ij
. (5)
3.2 Supervised learning of SNNs
For the supervised learning of SNNs, we define the following functions
C(t(M),κ;w) := L(t(M),κ;w) +
γ
2
R(t(M)) (6)
L(t(M),κ;w) := −
N(M)∑
i=1
κi ln
(
Si(t
(M))
)
(7)
Si(t
(M)) :=
exp
(
−t(M)i
)
∑N(M)
j=1 exp
(
−t(M)j
) (8)
R(t(M)) :=
N(M)∑
i=1
(
t
(M)
i − t(Ref)
)2
(9)
where C(·, ·; ·), L(·, ·; ·), Si(·), and R(·) are the cost function, loss function, softmax function, and temporal
penalty term, respectively. Teacher labels are represented as a vector κ such that κi is 1 when the ith label
is given, and 0, otherwise. The weights of the entire network are represented as a vector w. The temporal
penalty term R(t(M)) is defined by the total difference between the spike timing of the output neurons t(M)
and the timing of a reference spike t(Ref). Note that t(M) is the vector of spike timing of the neurons in the
output layer. As explained later, the temporal penalty is intended to circumvent learning difficulties. The
coefficient γ(> 0) controls the effect of the temporal penalty term.
The network is trained by updating the weights by using SGD on a dataset as follows:
∆w
(l)
jk = −η
∂C(t(M),κ;w)
∂w
(l)
jk
(10)
where ∆w
(l)
jk is the update of w
(l)
jk and η(> 0) is the learning rate. The derivative of the cost function with
respect to weights is given by
∂C
∂w
(l)
ij
=
∂t
(l)
i
∂w
(l)
ij
δ
(l)
i (11)
where we define the propagation error δ
(h)
i as
δ
(h)
i =

∑N(h+1)
k=1
∂t
(h+1)
k
∂t
(h)
i
δ
(h+1)
k , for h = 1, 2, . . . ,M − 1
∂C
∂t
(M)
i
, for h = M.
(12)
5
By using the following derivatives
∂t
(l)
i
∂w
(l)
ij
= − t
(l)
i − t(l−1)j∑
k∈Γ(l)i
w
(l)
ik
(13)
∂t
(l)
i
∂t
(l−1)
j
=
w
(l)
ij∑
k∈Γ(l)i
w
(l)
ik
(14)
∂C
∂t
(M)
i
= κi − Si + γ
(
t
(M)
i − t(Ref)
)
(15)
the updating difference (10) can be obtained.
The derivative of the cost function is rigorous unless a neuron ceases to fire because of the update.
However, we found that the network experience two difficulties during training. These difficulties are related
to destructively large weight updates and non-unique spike timing, respectively. The destructively large
weight updates are caused by infinitesimal values in the denominator of (13) or (14). Our solution to prevent
these unacceptable large weight updates is to redefine (13) and (14) as
∂t
(l)
i
∂w
(l)
ij
= − t
(l)
i − t(l−1)j
+
∑
k∈Γ(l)i
w
(l)
ik
(16)
∂t
(l)
i
∂t
(l−1)
j
=
w
(l)
ij
+
∑
k∈Γ(l)i
w
(l)
ik
(17)
where  is a positive constant.
Next, the loss function shown in (7) is invariant against an arbitrary value α such that
L(t(M),κ;w) = L(t(M) + αI,κ;w) (18)
where I ∈ RN(M) is a vector of which the elements are all unity. Owing to its invariant nature, the loss
function does not uniquely specify the timing of output spikes, which forces the timing of the output spike to
become large, resulting in the disappearance of spikes. The temporal penalty term, which is introduced to
solve this problem, enables the cost function to uniquely specify the timing of the output spike to approximate
t(Ref) of the reference spike.
Here, we summarize the difference between our algorithm and the algorithms reported before [37, 38].
These previous studies were also concerned with the supervised learning of SNNs based on temporal coding.
They used almost the same loss function like an exponential form of spike timing in [37]. Their approach
to avoid the destructively large weight updates was to normalize the gradient when it became excessively
large [37], or to clip the gradient to a fixed value [38]. We use a simple method (16) and (17), which
can prevent the magnitude of the updated weights from becoming too large. They also reported the spike
disappearance problem as explained above. Mostafa [37] avoided this problem by introducing a penalty term
that increased the sum of values of weights and the L2 regularization term, but the two terms need to be
balanced appropriately. Comsa et al. [38] also avoided this problem by adding a small penalty that increased
the sum of weights when a neuron did not fire. On the other hand, we avoid this problem by introducing the
temporal penalty term, which is a more direct way to stabilize the timing of an output spike. Our approach
enhances the performance compared to that of Mostafa [37], and is slightly superior to that of Comsa et
al. [38].
3.3 Deep learning with SNNs
Because the important properties of deep learning are sequential nonlinear transformations over layers and
end-to-end learning, the proposed algorithm can be considered to be a form of deep learning. The proposed
SNN processes information by converting the pattern of an input spike t(0) in a layer-by-layer manner as
follows:
t(0) → t(1) → · · · → t(M) (19)
6
Figure 3: Examples of implementations of analog circuits of spiking neurons. Dashed squares are integrators,
which convert summed currents into voltages: circuit implementation (a) with and (b) without operational
amplifiers.
where t(l) := {t(l)1 , t(l)2 , . . . , t(l)N(l)}. Note that t
(l)
i is considered to be a sufficiently large constant if the ith
neuron in the lth layer does not fire at the end of the simulations. It is worth noting that the layer-to-layer
nonlinear transformation in SNNs is different from that in ANNs. According to (5), the spike timing t
(l)
i is
a linear function of the presynaptic spike timing t
(l−1)
j if t
(l)
i > t
(l−1)
j , whereas it does not depend on t
(l−1)
j if
t
(l)
i ≤ t(l−1)j . This function seems to be similar to a rectified linear unit (ReLU) [53], which is commonly used
as an activation function for ANNs. However, unlike ReLUs, the turning point, t
(l)
i = t
(l−1)
j , is determined
by other spikes and their weights. In other words, the information is processed by using the timing order of
spikes, which is the most unique characteristic of SNNs based on temporal coding.
3.4 Consideration of VLSI implementation
For analog VLSI implementations, the dynamics of SNNs must be physically realized in circuits that are
carefully designed using either the nonlinear or linear properties of transistors. Here, we show that the time
evolution of the membrane potential of neurons as expressed by (1) can be mapped to circuits with analog
resistive memory in a simple and straightforward manner.
Fig. 3(a) shows an example of such implementations including the analog resistive memory, operational
amplifiers, capacitors, and comparators. The membrane potential v
(l)
i is represented as the voltage at the
capacitor C, which is determined by the charge at the node Q. A spike is generated by a comparator when
7
the membrane voltage exceeds the firing threshold voltage Vth. The existence of the spike is represented
with the output voltage vˆ
(l)
i and
¯ˆv
(l)
i as follows:
(
vˆ
(l)
i (t),
¯ˆv
(l)
i (t)
)
=

(
V −pulse, V
+
pulse
)
, (t < t
(l)
i )(
V +pulse, V
−
pulse
)
, (t ≥ t(l)i )
(20)
where V ±pulse are constants and V
−
pulse < 0 < Vth < V
+
pulse is assumed.
In Fig. 3, σ
(l)+
ij and σ
(l)−
ij represent the conductance (the inverse of the resistance) from the jth neuron
in the (l − 1)th layer to the ith neuron in the lth layer. This combination can represent positive weights
and negative weights as explained later. Currents flow through the conductance σ
(l)±
ij to the ith neuron in
the lth layer, which causes the membrane potential v
(l)
i to change. It should be noted that, in Fig. 3(a),
the voltage at node Q is virtually grounded because of the operational amplifiers. The time evolution of the
membrane potential v
(l)
i is obtained by Kirchhoff’s current law and Ohm’s law as follows:
C
d
dt
v
(l)
i (t) =
N(l−1)∑
j=1
(
σ
(l)+
ij V
+
pulse + σ
(l)−
ij V
−
pulse
)
θ(t− t(l−1)j ). (21)
By setting C = 1 and σ
(l)+
ij V
+
pulse + σ
(l)−
ij V
−
pulse = w
(l)
ij , (21) agrees with (1).
We introduce another simpler circuit configuration, as shown in Fig. 3(b), to allow us to investigate the
effect of an approximation of the neuron model (1) on the learning performance. This circuit implementation
is similar to that in [49], which was proposed for calculating vector-matrix operations. Because this circuit
(Fig. 3(b)) uses no operational amplifiers, which often consume the largest amount of energy in analog
circuits, this implementation is expected to be more energy efficient. Similar to (21), the time evolution of
the membrane potential v
(l)
i is performed by using Kirchhoff’s current law and the Ohm’s law as follows:
C
d
dt
v
(l)
i (t) =
N(l−1)∑
j=1
σ
(l)+
ij (V
+
pulse − v(l)i )θ(t− t(l−1)j )
+ σ
(l)−
ij (V
−
pulse − v(l)i )θ(t− t(l−1)j )
=
N(l−1)∑
j=1
w
(l)
ij δw(l)ij ≥0
(
1− v
(l)
i
V +pulse
)
θ(t− t(l−1)j )
+ w
(l)
ij δw(l)ij <0
(
1− v
(l)
i
V −pulse
)
θ(t− t(l−1)j )
=
N(l−1)∑
j=1
w
(l)
ij
(
1− v(l)i V ijpulse
)
θ(t− t(l−1)j ) (22)
where we define
δs :=
{
0, if a condition s is false
1, if a condition s is true
(23)
V ijpulse :=
δ
w
(l)
ij ≥0
V +pulse
+
δ
w
(l)
ij <0
V −pulse
. (24)
Note that the following relationships
σ
(l)+
ij =
w
(l)
ij δw(l)ij ≥0
V +pulse
σ
(l)−
ij =
w
(l)
ij δw(l)ij <0
V −pulse
(25)
8
were used. By setting C = 1 and letting V ijpulse → 0, (22) converges to (1), where V ijpulse → 0 is realized
by |V ±pulse| → ∞. However, smaller values of |V ±pulse| are more preferable in terms of energy efficiency. The
effects of the finite values of |V ±pulse| on the learning performance were not previously considered [49]. Thus,
we investigated the effects by numerical simulations. Except when explicitly pointed out, simulations were
conducted with the neuron model expressed by (1). Details of the simulation method and the learning
methods for the neuron model expressed by (22) are provided in the appendix. We note that if current
sources (field-effect transistors) are used instead of resistors, as employed by others [51], the time evolution
equation becomes similar to (21) corresponding to the case of Fig. 3(a).
Finally, we explain the advantages of SNNs over time-domain circuits of ANNs [46, 49–51] in terms of
the energy efficiency of circuits. First, the circuits of Marinella et al. [46] need Analog-to-Digital Converters
(ADCs) to convert the signal for each layer of ANNs, which are known as energy-hungry processes. Actually,
it is reported that ADCs were the energy bottleneck in the systems [46]. Circuit implementations requiring
no ADCs were proposed [49–51]. These circuits compute the positive and negative weights independently,
and subsequently take the difference of the two. As a result of the above principles, these circuits require
two circuits to compute the positive and negative weights, and the dynamic range decreases significantly
because the difference is taken (corresponding to digit loss in digital computing). The advantage of using
SNNs over these circuits is that our circuits shown in Fig. 3 require no full ADCs because the input and
output of the circuits consist of binary spikes, and that the circuits can compute the positive and negative
weights at the same time.
4 Experimental setup
In this section, we describe the experimental setup for evaluating the learning performance of the proposed
algorithm with the MNIST dataset [54]. We also present the experimental setup for evaluating the robustness
of the algorithms against manufacturing variations in the devices, which are unavoidable in analog VLSI
implementations. In all the experiments, we set Vth = 1. We again note that neuron model (1) was used
unless otherwise noted.
4.1 Model evaluation with MNIST dataset
The MNIST dataset [54] consists of 70,000 images of handwritten digits, each of which is represented as a
784 (= 28×28)-dimensional vector with a label of a corresponding integer (from 0 to 9). In our experiments,
60,000 images were used for training, and the remaining 10,000 images were used for testing, as in [54].
To process data in SNNs, the vector data are converted into spike patterns as follows: we normalize all
the elements (pixels) of the vector x as 0 ≤ xi ≤ 1. Then, the corresponding input spikes are generated at
t
(0)
i = τ
(in) (1− xi) , if xi > 0 (26)
where τ (in) is the maximum time window of the input spikes, and was set at 5 ms in our experiments. A
spike is not generated when xi = 0. At the training phase, to improve the generalization performance, noise
was introduced according to a Gaussian distribution N(0, σt) with zero mean and standard deviation σt.
4.2 Robustness against device manufacturing variations
In analog VLSI implementations, variations in the manufactured devices are unavoidable, and must be taken
into account during circuit design. In this experiment, we investigated the effect of manufacturing variations
on the learning performance, which has yet not been reported for SNNs based on temporal coding, to the
best of our knowledge. To suppress these effects, we employed a learning technique that is described later.
In the VLSI implementations of Fig. 3(a) and Fig. 3(b), various parameters, such as weights, capacitance,
spike delays, and firing threshold, need to be taken into account. Owing to the computational complexity, we
investigated the effects of variations on the learning performance of only the firing threshold and spike delays
in this study. We chose these two parameters because they affect the network dynamics in a different way:
the firing threshold changes the generation timing of spikes, whereas the spike delays shift the arrival timing
of spikes. The other parameters can be investigated in the same way we explained below. The remainder
9
Figure 4: (a) Schematic diagram of the spike delays. Each connection has its own delay such as τ
(l)
k1 . (b)
The probabilistic distribution of spike delays given by (29). Assuming that the mean τ
(l)
offset of the Gaussian
distribution is much larger than the standard deviation στ , the probability of drawing negative values from
(29) is negligible.
of this section describes the mathematical representations of the variations and then the evaluations of the
effects of the variations.
First, we mathematically model the variations of the firing threshold Vth. We define the firing threshold
of the ith neuron in the lth layer as V
(l,i)
th . For simplicity, we assume that the probability density distribution
of the variation is a clipped Gaussian distribution as follows:
V
(l,i)
th = max
(
Vˆ
(l,i)
th , 0
)
(27)
Vˆ
(l,i)
th ∼ N(Vth, σVth) (28)
where N(Vth, σVth) represents a Gaussian distribution with mean Vth and standard deviation σVth and the
operation ‘∼’ means drawing a value from the distribution.
Second, we mathematically model the variation of spike delays. As shown in Fig. 4(a), spike delays from
the jth neuron in the (l − 1)th layer to the ith neuron in the lth layer are defined as τ (l)ij . We assume a
Gaussian probability density distribution of the spike delays
τ
(l)
ij ∼ N(τ (l)offset, στ ). (29)
We also assume
τ
(l)
offset  στ ∀ l (30)
such that the condition τ
(l)
ij > 0 is always satisfied as shown in Fig. 4 (b). The offset τ
(l)
offset can be removed
from the simulation as follows. Consider the modification of τ
(l)
ij in the SNNs. If we add a constant α
(l) to
τ
(l)
ij for all i and j, all the spike timings of the neurons in the lth-layer are delayed by α
(l), e.g., for the first
layer the delays are α(1) and for the second layer the delays are α(1) + α(2). Therefore, the spike timing in
the output layer is delayed as
∑M
l=1 α
(l). We denote the delayed spike timing by tˆ(l) for the lth layer. Note
that the loss function shown in (7) is invariant against the delays as
L(t(M),κ;w) = L(tˆ(M),κ;w) (31)
because the effects of the difference
∑M
l=1 α
(l) between t(M) and tˆ(M) are cancelled out in (8). Moreover, in
(9), the temporal penalty term R(t(M)) is invariant against the delays if we add the delay
∑M
l=1 α
(l) to the
reference spike timing as
tˆ(Ref) = t(Ref) +
M∑
l=1
α(l). (32)
10
Figure 5: Raster plots of the spike timing of the neurons in the input, hidden, and output layers, from top to
bottom, respectively, for a 2-layer SNN (784-800-10). The hyperparameters are η = 1500, t(Ref) = 21 ms, γ =
100, and  = 4.
Therefore, the cost function shown in (6) becomes invariant against the delays {α(l)}Ml=1. This invariance
indicates that a constant shift of the delay τ
(l)
offset does not affect the learning results. Thus, our numerical
simulations were conducted with τ
(l)
offset = 0. Note that a situation in which the spike timing t
(l)
i causes the
spike to arrive earlier than the arrival time t
(l−1)
j + τ
(l)
ij may occur in the numerical simulation. However,
information processing in feedforward SNNs is valid because the differences among the arrival times within
each layer are maintained.
Using the variation models for the firing threshold Vth and for spike delays τ
(l)
ij , we investigated the
effects of the variations on the learning performance in the following two cases: (i) the exact values of the
parameters (Vth and τ
(l)
ij ) are not known, but those distributions are known; (ii) the exact values of the
variation parameters are known.
In case (i), the dynamics in the SNNs is black boxed, thus direct training is not possible. To overcome this
problem, we employed a learning technique in the training phase. We extract the values of these parameters
from (27) or (29) with estimated standard deviations σ
(train)
Vth(τ)
. The same techniques were reported for training
binarized neural networks [55]. In case (ii), training the SNNs can be carried out in a straightforward manner.
In the test phase, we evaluated the performance of the SNNs with various standard variations σ
(test)
Vth(τ)
.
Note that the experimental setup described in section 4.1 corresponds to the case of σ
(train)
Vth(τ)
= σ
(test)
Vth(τ)
= 0 in
case (i).
5 Results
5.1 Performance on the MNIST dataset
Fig. 5 shows the spike timing pattern (a raster plot) of a 2-layer SNN (784-800-10) after training for the input,
the hidden, and output layers, from top to bottom, respectively. The horizontal axis represents the time in
milliseconds and the vertical axis represents the index of neurons for each layer. These results show that the
broad distribution of the spike timing in the hidden layer is such that a large portion of neurons in the hidden
layer fire after the output neurons fire. In the output layer, the ninth output neuron fires the earliest, which
corresponds to the correct label (= 8). Fig. 6 shows the time evolution of the membrane potential v
(l)
i (t) of
the neurons in each layer. The membrane potentials evolve as (1) such that their behavior can be expressed
11
Figure 6: Time evolution of the membrane potentials of the neurons in the hidden layer (top) and the output
layer (bottom) for a 2-layer SNN (784-800-10). The horizontal dashed lines represent the value of Vth (= 1).
The hyperparameters are η = 1500, t(Ref) = 21 ms, γ = 100, and  = 4.
Table 1: Performance comparisons on the MNIST dataset
Network Coding Accuracy
784-800-10 [56] ANN 98.4 % [56]
784-800-800-10 [57] ANN 98.8 % [57]
784-800-10 [32] SNN (rate coding) 98.64 % [32]
784-300-300-10 [32] SNN (rate coding) 98.77 % [32]
784-800-10 [37] SNN (temporal coding) 97.55 % [37]
784-400-400-10 [37] SNN (temporal coding) 97.14 % [37]
784-340-10 [38] SNN (temporal coding) 97.96% [38]
784-500-10 (This work) SNN (temporal coding) 97.83±0.07 %
784-800-10 (This work) SNN (temporal coding) 98.04±0.05 %
784-300-300-10 (This work) SNN (temporal coding) 97.73±0.08 %
784-400-400-10 (This work) SNN (temporal coding) 97.90±0.12 %
by a piecewise linear function. For the hidden layer, because the neurons receive no spikes from the input
layer after 5 ms, their membrane potentials evolve linearly after the time 5 ms. For the output neurons,
their membrane potential varies nonlinearly because they receive many spikes from the hidden layer. The
membrane potential of the output neurons was observed to be suppressed by spikes from the hidden layer
at approximately 10 ms except that of the output neuron corresponding to the correct label. One neuron
fired at approximately 12 ms and the others at approximately tRef. These results indicate that our proposed
temporal penalty term is effective.
Similar but slightly different network dynamics is observed for 3-layer SNNs (784-400-400-10). Fig. 7
shows the raster plots of the spikes in the SNN. The distribution of spikes in the first hidden layer was
broad, whereas that in the second hidden layer was relatively narrow. A large portion of neurons in the
second hidden layer are observed not to have fired. This reduction in the number of firing neurons may
imply that the neurons needed for information processing are selected for each data. Fig. 8 shows the time
evolution of the membrane potential of the neurons in each layer. The results show that the membrane
potentials of neurons in the second and output layers evolve nonlinearly, whereas those in the first layer
evolve linearly after 5 ms.
Table 1 shows the results of the classification accuracy on the MNIST dataset. The standard errors ‘±’
12
Figure 7: Raster plots for a 3-layer SNN (784-400-400-10). The hyperparameters are η = 200, t(Ref) =
60 ms, γ = 100, and  = 1.
Figure 8: Time evolution of the membrane potentials of the neurons in the first layer (top), in the second layer
(middle), and in the output layers (bottom), for a 3-layer SNN (784-400-400-10). The horizontal dashed lines
represent the values of Vth(= 1). The hyperparameters are η = 1500, t
(Ref) = 60 ms, γ = 100, and  = 1.
13
Figure 9: Time evolution of the membrane potentials of the neurons for 2-layer SNN (169-300-10) for Vpulse.
(a) Vpulse = 128, (b) Vpulse = 2. In both (a) and (b), the upper and lower panels represent the first and the
output layer, respectively. In the upper panel of (b), thee membrane potentials converge to some intermediate
constants. The hyperparameters are η = 1500, t(Ref) = 21 ms, γ = 8, and  = 10.
for our results were obtained from 10 trials with different initial weights. The SNNs based on rate-coding [32]
performed as good as ANNs [56, 57], whereas the SNNs based on temporal coding [37] performed slightly
worse than ANNs. However, the improvements in our learning algorithm increased the performance of our
algorithm compared with previous research [37], and the performance was slightly higher than [38]. Our
best result is highlighted in bold in Table 1.
Next, we show the results of the learning performance when we replace the original neuron model (1)
with the alternative neuron model (22). We again note that the model (22) is considered to be more energy
efficient. To reduce the computational costs, we focused on a small-scale network (169-300-10) and used a
shrunk version of the MNIST dataset. A shrunk MNIST image xˆ was obtained by convoluting a 4×4 kernel
of which the elements are all 1 onto an original 28 × 28 image x∗ with a stride of 2. The (i, j) pixel of the
shrunk image xˆij is obtained by
xˆij =
1
16
3∑
k=0
3∑
m=0
x∗2i+k−1 2j+m−1, (i, j = 1, 2, . . . , 13). (33)
Then, we obtained spike patterns by using (26) with the flatted shrunk two-dimensional data xˆij . In
14
Figure 10: Classification accuracy on the shrunk MNIST dataset with various values of Vpulse (=
2, 4, 8, . . . , 128) for 2-layer SNN (169-300-10). Each plot is the average of the results of five trials. The
standard errors are smaller than the symbol size. The hyperparameters are η = 1500, t(Ref) = 21 ms, γ =
8, and  = 10.
the experiment, we experienced difficulties in achieving learning convergence because of destructively large
weight updates which had their origins in the temporal penalty term (9). We circumvented this difficulty by
replacing the temporal penalty term (9) with
R(t(M)) :=
N(M)∑
i=1
(
t
(M)
i − t(Ref)
)3/2
(34)
to soften the effect of its large weight updates when the timing of the output spike is far from t(Ref).
Fig. 9 shows the time evolution of the membrane potentials of the neurons for different values of Vpulse
(= |V +pulse| = |V −pulse|). When Vpulse = 128 (Fig. 9(a)), the membrane potentials of the neurons in the hidden
layer evolve linearly after 5 ms and those in the output layer evolve nonlinearly, which is similar to the case
of Fig. 6. On the other hand, when Vpulse = 2 (Fig. 9(b)), the membrane potentials of the neurons in the
hidden evolve nonlinearly, which is caused by the term v
(l)
i V
ij
pulse on the right-hand side of (22). Moreover,
the membrane potentials of the neurons are stabilized at some intermediate value, because (22) has an
equilibrium point with
v
(l)
i =
∑
j∈Γ(l)i
w
(l)
ij∑
j∈Γ(l)i
w
(l)
ij V
ij
pulse
. (35)
Fig. 10 shows the results of the classification accuracy for various values of Vpulse (= 2, 4, 8, . . . , 128). As
the value of Vpulse decreases, the classification accuracy decreases monotonically. As shown in Fig. 9, there
exists an equilibrium point in the membrane potential lower than Vth when the value of Vpulse is small, which
may complicate the training.
5.2 Effects of manufacturing variations on learning performance
We present the results of the numerical simulation of the experiments described in 4.2. Fig. 11 shows the
results of the classification accuracy for various values of the standard deviations of the firing threshold in
the test phase σ
(test)
Vth
. When no variation is assumed at the training phase (σ
(train)
Vth
= 0), the classification ac-
curacy decreases conspicuously in the test phase as the standard deviation σ
(test)
Vth
increases. When variations
15
Figure 11: Classification accuracy of 2-layer SNN (784-500-10) with various variations of the firing threshold
σ
(train)
Vth
at the training phase. The horizontal axis represents the value of σ
(test)
Vth
at the test phase. Each
plotted point and error bar are the average and standard deviation of the results of ten trials, respectively.
The hyperparameters are η = 1500, t(Ref) = 21 ms, γ = 100, and  = 1.
are assumed in the training phase (σ
(train)
Vth
= 0.05, 0.1, 0.15) corresponding to case (i) in 4.2, the degradation
of the classification accuracy can be suppressed to an extent. Especially, the suppression is effective when
the variation in the test phase is almost the same as that assumed in the training phase (σ
(train)
Vth
; σ(test)Vth ).
When the values of all Vth are known corresponding to case (ii) in 4.2, the degradation of the classification
accuracy can be well suppressed.
Next, the results of classification accuracy for various standard deviations of spike delays at the test
phase σ
(test)
τ are shown in Fig. 12. As in the case of variations of firing threshold Vth in Fig. 11, when
no variation is assumed at the training phase (σ
(train)
τ = 0), the classification accuracy decreases as the
standard deviation σ
(test)
τ at the test phase increases. While, when variations are assumed at the training
phase (σ
(train)
τ = 0, 1 ms, 2 ms) corresponding to case (ii) in 4.2, the degradation of classification accuracy
can be suppressed to some extent. Especially, the suppression efficiently works when the variation at the
test phase is almost the same as that assumed at the training phase (σ
(train)
τ ; σ(test)τ ). When values of
all τ
(l)
ij are known corresponding to case (ii) in 4.2, the degradation of classification accuracy can be well
suppressed.
The series of numerical simulations shown above indicates that our proposed algorithm is robust against
the device manufacturing variations. This fact paved a way to manufacturing VLSIs implementing the
proposed algorithm.
6 Concluding remarks
In this study, we pursued the goal of achieving both high algorithmic performance and high energy efficiency
by proposing a novel supervised learning algorithm for multilayer SNNs which can be implemented in VLSI
circuits with analog resistive memory. We also proposed novel learning techniques such as adding a temporal
penalty term to improve the learning performance.
Although the performance of the proposed algorithm on the MNIST dataset was slightly worse than SNNs
based on rate coding and ANNs, the performance was confirmed to be as high as the state-of-the-art SNNs
based on temporal coding. We showed that VLSI circuits can be designed with analog resistive memory
in a simple and straightforward manner. We used numerical simulation to investigate the extent to which
the learning performance depended on the circuit design. We also showed that the effects of manufacturing
16
Figure 12: Classification accuracy of 2-layer SNN (784-500-10) with various variations of spike delays σ
(train)
τ
at the training phase. The horizontal axis represents the value of standard deviation of σ
(test)
τ at the test
phase. Each plot and error bar are the average and standard deviation of the results of ten trials, respectively.
The hyper parameters are η = 1500, t(Ref) = 21 ms, γ = 100, and  = 10.
variations, which are unavoidable for real VLSI implementations, can be suppressed by estimating the
magnitude of variations or by observing the exact values of these parameters. Therefore, we achieved our
goal to a certain extent. The next step will be the fabrication of VLSI chips based on the circuits and
algorithms proposed in this paper.
In this study, we focused on the development of energy-efficient algorithms for feedforward networks as
a first step. Recently, it was shown that SNNs with recurrent connections can process time-series data with
high performance [58–60]. For various applications of edge computing, the development of energy-efficient
and hardware-friendly learning algorithms for SNN with recurrent connections is strongly anticipated. Thus,
the development of these algorithms should be the next goal. Finally, we would like to remark that recent
advances in biologically plausible and spike-based BP [31, 61–65] are foreseen to be beneficial to construct
energy-efficient systems including the training process [66].
Appendix: Back propagation for case (22)
Here, we redefine the indices of the neurons to satisfy their indices to be sorted in the order of their spike
time as
tˆ
(l−1)
1 ≤ tˆ(l−1)2 ≤ · · · ≤ tˆ(l−1)N(l−1) . (36)
Then, we obtain the corresponding sorted weights
wˆ
(l)
i1 , wˆ
(l)
i2 , . . . , wˆ
(l)
iN(l)
. (37)
The time evolutions of the membrane potentials are piecewise such that they can be represented as the
corresponding equations. For the ith neuron in the lth layer, from the moment when it receives the kth
earliest spike to when it receives the (k + 1)th earliest spike, the time evolution of its membrane potential
v
(l),k
i is described as
d
dt
v
(l),k
i (t) =
k∑
j=1
wˆ
(l)
ij
(
1− v(l),ki V ijpulse
)
, (38)
for tˆ
(l−1)
k ≤ t < tˆ(l−1)k+1 , (k = 1, 2, . . . , N (l−1)). (39)
17
Figure 13: Schematic diagram of the time evolution of the membrane potential.
The terminal values of v
(l),k
i are denoted by v¯
(l),k
i as follows:
v¯
(l),k
i :=
v
(l),k
i
(
tˆ
(l−1)
k+1
)
, for k < G
(l)
i
v
(l),G
i
(
t
(l)
i
)
, for k = G
(l)
i
(40)
where G
(l)
i represents the number of received spikes before the ith neuron fires, which can take different
values for different neurons. Using the above definitions, we can obtain an analytical form of the membrane
potential as follows:
v¯
(l),k
i =

A
(l)
i1
B
(l)
i1
− A
(l)
i1
B
(l)
i1
exp
(
−B(l)i1 ∆tˆ(l)1
)
, for k = 1
A
(l)
ik
B
(l)
ik
−
(
A
(l)
ik
B
(l)
ik
− v¯(l),k−1i
)
exp
(
−B(l)ik ∆tˆ(l)k
)
,
for k = 2, 3, . . . , G
(l)
i
(41)
where
A
(l)
ik :=
k∑
j=1
wˆ
(l)
ij B
(l)
ik :=
k∑
j=1
wˆ
(l)
ij V
ij
pulse (42)
∆tˆ
(l−1)
k :=
{
tˆ
(l−1)
k+1 − tˆ(l−1)k , for k = 1, 2, . . . , G(l)i − 1
t
(l)
i − tˆ(l−1)k , for k = G(l)i .
(43)
Furthermore, the spike timing of the ith neuron in the lth layer is obtained as follows:
t
(l)
i =

tˆ
(l−1)
G
(l)
i
− 1
B
(l)
iG
(l)
i
ln
(
1−
B
(l)
iG
(l)
i
A
(l)
iG
(l)
i
Vth
)
, for G
(l)
i = 1
tˆ
(l−1)
G
(l)
i
− 1
B
(l)
iG
(l)
i
ln
A
(l)
iG
(l)
i
B
(l)
iG
(l)
i
−Vth
A
(l)
iG
(l)
i
B
(l)
iG
(l)
i
−v¯(l),G
(l)
i
−1
i

, for G
(l)
i > 1.
(44)
Fig. 13 shows a schematic diagram of the time evolution of the membrane potential v
(l),k
i (t) .
18
Similar to (10)–(15), the back propagation rule is then obtained by
∂C
∂w
(l)
jk
=
∂v¯
(l),G
(l)
i
j
∂w
(l)
jk
δ
(l)
j (45)
δ(h)s =

∂t(h)s
∂v¯
(h),G
(l)
i
s
∑N(h+1)
i
∂v¯
(h+1),G
(l)
i
i
∂t
(h)
s
δ
(h+1)
i ,
for h = 1, 2, . . . ,M − 1
∂t(M)s
∂v¯
(M),G
(l)
i
s
∂C
∂t
(M)
s
, for h = M
(46)
∂C
∂t
(M)
i
= κi − Si + γ
(
t
(M)
i − t(Ref)
)
. (47)
Replacing {t(l)i } and {w(l)ij } with the sorted expressions {tˆ(l)i } and {wˆ(l)ij } given by (36) and (37), the derivatives
on the right-hand sides of the above equations are obtained as follows:
for (45),
∂v¯
(l),G
(l)
i
i
∂wˆ
(l)
ij
=

∑G(l)i
P=j
∂v¯
(l),P
i
∂wˆ
(l)
ij
δ
v
(l)
i
P , for j = 1, 2, . . . , G
(l)
i
0, for j = G
(l)
i + 1, G
(l)
i + 2, . . . , N
(l−1).
(48)
In addition, for (46),
δ
v
(l)
i
j =

∏G(l)i −1
q=j
∂v¯
(l),q+1
i
∂v¯
(l),q
i
=
∂v¯
(l),j+1
i
∂v¯
(l),j
i
δ
v
(l)
i
j+1
for j = 1, 2, . . . , G
(l)
i − 1
1, for j = G
(l)
i
(49)
∂v¯
(l),G
(l)
i
i
∂tˆ
(l−1)
j
=

∂v¯
(l),j
i
∂tˆ
(l−1)
j
δ
v
(l)
i
j , for j = 1, 2, . . . , G
(l)
i
0, for j = G
(l)
i + 1, G
(l)
i + 2, . . . , N
(l−1)
(50)
∂t
(l)
i
∂v¯
(l)
i
= −
∂v¯(l),G(l)ii
∂t
−1
=

−1
A
(l)
iG
(l)
i
−B(l)
iG
(l)
i
Vth
, when G
(l)
i ≥ 1
−1
A
(l)
iG
(l)
i
−B(l)
iG
(l)
i
Vth
, when G
(l)
i = 0.
(51)
Note that the above equations take the value of 0 when the ith neuron does not fire and (51) is derived from
SpikeProp [36]. For the calculation of (48), we obtain the derivatives as follows:
∂v¯
(l),k
i
∂wˆ
(l)
ij
=

D
(l),k
ij
(
1− exp
(
−B(l)ik ∆tˆ(l−1)k
))
+
(
A
(l)
ik
B
(l)
ik
− v¯(l),k−1i
)
∆tˆ
(l−1)
k V
ij
pulse
× exp
(
−B(l)ik ∆tˆ(l−1)k
)
,
for k = 1, 2, . . . , G
(l)
i
D
(l),k
ij
(
1− exp
(
−B(l)ik ∆tˆ(l−1)k
))
+
A
(l)
ik
B
(l)
ik
∆tˆ
(l−1)
k V
ij
pulse
× exp
(
−B(l)ik ∆tˆ(l−1)k
)
,
for k = 0
(52)
19
where D
(l),k
ij is defined as
D
(l),k
ij :=
∂
∂wˆ
(l)
ij
(
A
(l)
k
B
(l)
k
)
=
B
(l)
k −A(l)k V ijpulse
B
(l)2
k
,
for j = 1, 2, . . . , k. (53)
Further, for (49),
∂v¯
(l),k+1
i
∂v¯
(l),k
i
= exp
(
−∆tˆ(l−1)k+1 B(l)k+1
)
,
for k = 1, 2, . . . , G
(l)
i − 1. (54)
Furthermore, (50) and (51) can be calculated with
∂v¯
(l),k
i
∂tˆ
(l−1)
k
=

∂v¯
(l),k−1
i
∂tˆ
(l−1)
k
exp
(
−B(l)ik ∆tˆ(l−1)k
)
−
(
A
(l)
ik −B(l)ik v¯(l),k−1i
)
× exp
(
−B(l)ik ∆tˆ(l−1)k
)
,
for k = 1, 2, . . . , G
−A(l)i0 exp
(
−B(l)i0 ∆tˆ(l−1)0
)
,
for k = 0
(55)
∂v¯
(l),k−1
i
∂tˆ
(l−1)
k
=

(
A
(l)
ik−1 −B(l)ik−1v¯(l),k−2i
)
× exp
(
−B(l)ik−1∆tˆ(l−1)k−1
)
,
for k = 2, . . . , G
A
(l)
i0 exp
(
−B(l)i0 ∆tˆ(l−1)0
)
,
for k = 1.
(56)
As described above, the BP for case (22) is obtained.
Acknowledgment
This work is partially supported by the “Brain-Morphic AI to Resolve Social Issues” project and by NEC
Corporation.
References
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional
neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. 2012.
[2] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, Vol. 521, pp. 436–444,
2015.
[3] S. Amari. A theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, Vol.
EC-16, No. 3, pp. 299–307, June 1967.
[4] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by back-
propagating errors. Nature, Vol. 323, pp. 533–536, 1986.
[5] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hub-
bard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural
computation, Vol. 1, No. 4, pp. 541–551, 1989.
20
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016.
[7] Gordon E Moore. Cramming more components onto integrated circuits. Electron. Mag, Vol. 38, No. 8,
pp. 114–117, April 1965.
[8] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. Design of ion-implanted
MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits, Vol. 9, No. 5,
pp. 256–268, Oct 1974.
[9] N. P. Jouppi, et al. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE
44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12, June 2017.
[10] V. Sze, Y. Chen, T. Yang, and J. S. Emer. Efficient processing of deep neural networks: A tutorial and
survey. Proceedings of the IEEE, Vol. 105, No. 12, pp. 2295–2329, Dec 2017.
[11] W. Maass. To spike or not to spike: That is the question. Proceedings of the IEEE, Vol. 103, No. 12,
pp. 2219–2224, Dec 2015.
[12] Amirhossein Tavanaei, Masoud Ghodrati, Saeed Reza Kheradpisheh, Timothe Masquelier, and Anthony
Maida. Deep learning in spiking neural networks. Neural Networks, Vol. 111, pp. 47–63, 2019.
[13] Michael Pfeiffer and Thomas Pfeil. Deep learning with spiking neurons: Opportunities and challenges.
Frontiers in Neuroscience, Vol. 12, No. 774, pp. 1–18, 2018.
[14] Shih-Chii Liu, Tobi Delbruck, Giacomo Indiveri, Adrian Whatley, and Rodney Douglas. Event-based
neuromorphic systems. John Wiley & Sons, 2015.
[15] Ning Qiao, Hesham Mostafa, Federico Corradi, Marc Osswald, Fabio Stefanini, Dora Sumislawska, and
Giacomo Indiveri. A reconfigurable on-line learning spiking neuromorphic processor comprising 256
neurons and 128k synapses. Frontiers in Neuroscience, Vol. 9, No. 141, pp. 1–17, 2015.
[16] S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri. A scalable multicore architecture with heterogeneous
memory structures for dynamic neuromorphic asynchronous processors (dynaps). IEEE Transactions
on Biomedical Circuits and Systems, Vol. 12, No. 1, pp. 106–122, Feb 2018.
[17] J. Schemmel, D. Briiderle, A. Griibl, M. Hock, K. Meier, and S. Millner. A wafer-scale neuromorphic
hardware system for large-scale neural modeling. In Proceedings of 2010 IEEE International Symposium
on Circuits and Systems, pp. 1947–1950, May 2010.
[18] S. Friedmann, J. Schemmel, A. Grbl, A. Hartel, M. Hock, and K. Meier. Demonstrating hybrid learning
in a flexible neuromorphic hardware system. IEEE Transactions on Biomedical Circuits and Systems,
Vol. 11, No. 1, pp. 128–142, Feb 2017.
[19] Paul A. Merolla, et al. A million spiking-neuron integrated circuit with a scalable communication
network and interface. Science, Vol. 345, No. 6197, pp. 668–673, 2014.
[20] M. Davies, et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro,
Vol. 38, No. 1, pp. 82–99, January 2018.
[21] Catherine D. Schuman, et al. A survey of neuromorphic computing and neural networks in hardware.
arXiv:1705.06963, May 2017.
[22] Maxence Bouvier, et al. Spiking neural networks hardware implementations and challenges: A survey.
J. Emerg. Technol. Comput. Syst., Vol. 15, No. 2, pp. 22:1–22:35, April 2019.
[23] M. Horowitz. Computing’s energy problem (and what we can do about it). In 2014 IEEE International
Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14, Feb 2014.
21
[24] N. Verma, H. Jia, H. Valavi, Y. Tang, M. Ozatay, L. Chen, B. Zhang, and P. Deaville. In-memory
computing: Advances and prospects. IEEE Solid-State Circuits Magazine, Vol. 11, No. 3, pp. 43–55,
Summer 2019.
[25] Hsinyu Tsai, Stefano Ambrogio, Pritish Narayanan, Robert M Shelby, and Geoffrey W Burr. Recent
progress in analog memory-based accelerators for deep learning. J. Phys. D: Appl. Phys., Vol. 51, No. 28,
p. 283001, Jun 2018.
[26] Daniele Ielmini and H-S Philip Wong. In-memory computing with resistive switching devices. Nature
Electronics, Vol. 1, pp. 333–343, June 2018.
[27] Qiangfei Xia and J Joshua Yang. Memristive crossbar arrays for brain-inspired computing. Nature
materials, Vol. 18, No. 4, pp. 309–323, March 2019.
[28] Emre Ozgur Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking
neural networks. arXiv:1901.09948, Jan 2019.
[29] W. Maass. “Computing with Spiking Neurons,” in Pulsed Neural Networks, chapter 2, pp. 55–85. MIT
Press, 1999.
[30] Filip Ponulak and Andrzej Kasiski. Supervised learning in spiking neural networks with ReSuMe:
Sequence learning, classification, and spike shifting. Neural Computation, Vol. 22, No. 2, pp. 467–510,
February 2010.
[31] Emre O. Neftci, Charles Augustine, Somnath Paul, and Georgios Detorakis. Event-driven random back-
propagation: Enabling neuromorphic deep learning machines. Frontiers in Neuroscience, Vol. 11, No.
324, pp. 1–16, June 2017.
[32] Jun Haeng Lee, Tobi Delbruck, and Michael Pfeiffer. Training deep spiking neural networks using
backpropagation. Frontiers in Neuroscience, Vol. 10, No. 508, pp. 1–12, November 2016.
[33] Friedemann Zenke and Surya Ganguli. Superspike: Supervised learning in multilayer spiking neural
networks. Neural computation, Vol. 30, No. 6, pp. 1514–1541, June 2018.
[34] P. U. Diehl, D. Neil, J. Binas, M. Cook, S. Liu, and M. Pfeiffer. Fast-classifying, high-accuracy spiking
deep networks through weight and threshold balancing. In 2015 International Joint Conference on
Neural Networks (IJCNN), pp. 1–8, July 2015.
[35] Steve K Esser, Rathinakumar Appuswamy, Paul Merolla, John V. Arthur, and Dharmendra S Modha.
Backpropagation for energy-efficient neuromorphic computing. In Advances in Neural Information Pro-
cessing Systems, pp. 1117–1125. 2015.
[36] Sander M. Bohte, Joost N. Kok, and Han La Poutr. Error-backpropagation in temporally encoded
networks of spiking neurons. Neurocomputing, Vol. 48, No. 1, pp. 17–37, October 2002.
[37] H. Mostafa. Supervised learning based on temporal coding in spiking neural networks. IEEE Transac-
tions on Neural Networks and Learning Systems, Vol. 29, No. 7, pp. 3227–3235, July 2018.
[38] Iulia M Comsa, Krzysztof Potempa, Luca Versari, Thomas Fischbacher, Andrea Gesmundo, and Jyrki
Alakuijala. Temporal coding in spiking neural networks with alpha synaptic function. arXiv:1907.13223,
August 2019.
[39] G. W. Burr, et al. Experimental demonstration and tolerancing of a large-scale neural network (165 000
synapses) using phase-change memory as the synaptic weight element. IEEE Transactions on Electron
Devices, Vol. 62, No. 11, pp. 3498–3507, November 2015.
[40] Shinhyun Choi, Jong Hoon Shin, Jihang Lee, Patrick Sheridan, and Wei D Lu. Experimental demon-
stration of feature extraction and dimensionality reduction using memristor networks. Nano letters,
Vol. 17, No. 5, pp. 3113–3118, April 2017.
22
[41] F Merrikh Bayat, M. Prezioso, B. Chakrabarti, H. Nili, I. Kataeva, and D. Strukov. Implementation
of multilayer perceptron network with highly uniform passive memristive crossbar circuits. Nature
communications, Vol. 9, No. 2331, June 2018.
[42] Can Li, et al. Analogue signal and image processing with large memristor crossbars. Nature Electronics,
Vol. 1, pp. 52–59, June 2018.
[43] Fuxi Cai, et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–
accumulate operations. Nature Electronics, Vol. 2, pp. 290–299, July 2019.
[44] Vishnu Ravinuthula, Vaibhav Garg, John G. Harris, and Jos A. B. Fortes. Time-mode circuits for
analog computation. Int. J. Circ. Theor. Appl., Vol. 37, No. 5, pp. 631–659, 2009.
[45] Ali Shafiee, et al. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in
crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture, pp. 14–26,
Piscataway, NJ, USA, 2016. IEEE Press.
[46] M. J. Marinella, et al. Multiscale co-design analysis of energy, latency, area, and accuracy of a ReRAM
analog neural training accelerator. IEEE Journal on Emerging and Selected Topics in Circuits and
Systems, Vol. 8, No. 1, pp. 86–101, March 2018.
[47] Stefano Ambrogio, et al. Equivalent-accuracy accelerated neural-network training using analogue mem-
ory. Nature, Vol. 558, pp. 60–67, June 2018.
[48] Quan Wang, Hakaru Tamukoh, and Takashi Morie. Time-domain weighted-sum calculation for ulti-
mately low power VLSI neural networks. In Neural Information Processing, Vol. 9947, pp. 240–247,
September 2016.
[49] Quan Wang, Hakaru Tamukoh, and T Morie. A time-domain analog weighted-sum calculation model for
extremely low power VLSI implementation of multi-layer neural networks. arXiv:1810.06819, October
2018.
[50] M. Bavandpour, M. R. Mahmoodi, and D. B. Strukov. Energy-efficient time-domain vector-by-matrix
multiplier for neurocomputing and beyond. IEEE Transactions on Circuits and Systems II: Express
Briefs, Vol. 66, No. 9, pp. 1512–1516, September 2019.
[51] Masatoshi Yamaguchi, Goki Iwamoto, Hakaru Tamukoh, and Takashi Morie. An energy-efficient
time-domain analog VLSI neural network processor based on a pulse-width modulation approach.
arXiv:1902.07707, February 2019.
[52] S. R. Nandakumar, Irem Boybat, Manuel Le Gallo, Evangelos Eleftheriou, Abu Sebastian, and Bipin
Rajendran. Supervised learning in spiking neural networks with phase-change memory synapses.
arXiv:1905.11929, May 2019.
[53] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted Boltzmann machines. In
Proceedings of the 27th International Conference on International Conference on Machine Learning, pp.
807–814, USA, June 2010. Omnipress.
[54] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recog-
nition. Proceedings of the IEEE, Vol. 86, No. 11, pp. 2278–2324, November 1998.
[55] D. Miyashita, S. Kousai, T. Suzuki, and J. Deguchi. A neuromorphic chip optimized for deep learning
and CMOS technology with time-domain analog and digital mixed-signal processing. IEEE Journal of
Solid-State Circuits, Vol. 52, No. 10, pp. 2679–2689, October 2017.
[56] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning
Research, Vol. 15, No. 1, pp. 1929–1958, January 2014.
23
[57] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks
using dropconnect. In Proceedings of the 30th International Conference on Machine Learning, Vol. 28,
pp. 1058–1066, Jun 2013.
[58] Wilten Nicola and Claudia Clopath. Supervised learning in spiking neural networks with force training.
Nature communications, Vol. 8, No. 1, p. 2208, 2017.
[59] Priyadarshini Panda and Narayan Srinivasa. Learning to recognize actions from limited training ex-
amples using a recurrent spiking neural model. Frontiers in Neuroscience, Vol. 12, No. 126, pp. 1–14,
2018.
[60] Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert Legenstein, and Wolfgang Maass. Long
short-term memory and learning-to-learn in networks of spiking neurons. In Advances in Neural Infor-
mation Processing Systems, pp. 787–797. 2018.
[61] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic
feedback weights support error backpropagation for deep learning. Nature communications, Vol. 7,
No. 1, p. 13276, November 2016.
[62] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In Advances in
Neural Information Processing Systems, pp. 1037–1045. 2016.
[63] Jordan Guerguiev, Timothy P Lillicrap, and Blake A Richards. Towards deep learning with segregated
dendrites. eLife, Vol. 6, pp. 1–37, 2017.
[64] Guillaume Bellec, Franz Scherr, Elias Hajek, Darjan Salaj, Robert Legenstein, and Wolfgang Maass.
Biologically inspired alternatives to backpropagation through time for learning in recurrent neural nets.
arXiv:1901.09049, January 2019.
[65] James C.R. Whittington and Rafal Bogacz. Theories of error back-propagation in the brain. Trends in
Cognitive Sciences, Vol. 23, No. 3, pp. 235–250, 2019.
[66] J. Park, J. Lee, and D. Jeon. A 65nm 236.5nj/classification neuromorphic processor with 7.5% energy
overhead on-chip learning using direct spike-only feedback. In 2019 IEEE International Solid-State
Circuits Conference, pp. 140–142, Feb 2019.
24
