Training spiking multi-layer networks with surrogate gradients on an
  analog neuromorphic substrate by Cramer, Benjamin et al.
Training spiking multi-layer networks with surrogate
gradients on an analog neuromorphic substrate
B. Cramer?1, S. Billaudelle?1, S. Kanya1, A. Leibfried1,
A. Grübl1, V. Karasenko1, C. Pehle1, K. Schreiber1, Y. Stradmann1, J. Weis1,
J. Schemmel1, and F. Zenke2
1Kirchhoff-Institute for Physics, Heidelberg University, Germany
2Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
{benjamin.cramer,sebastian.billaudelle,schemmel}@kip.uni-heidelberg.de,
friedemann.zenke@fmi.ch
?These authors contributed equally.
Abstract
Spiking neural networks are nature’s solution for parallel information processing
with high temporal precision at a low metabolic energy cost. To that end, biological
neurons integrate inputs as an analog sum and communicate their outputs digitally
as spikes, i.e., sparse binary events in time. These architectural principles can
be mirrored effectively in analog neuromorphic hardware. Nevertheless, training
spiking neural networks with sparse activity on hardware devices remains a ma-
jor challenge. Primarily this is due to the lack of suitable training methods that
efficiently take into account device-specific imperfections and operate at the level
of individual spikes instead of firing rates. To tackle this issue, we developed a
hardware-in-the-loop strategy to train multi-layer spiking networks using surrogate
gradients on the BrainScales-2 analog neuromorphic platform. Specifically, we
used the hardware to compute the forward pass of the network, while the backward
pass was computed in software. We evaluated our approach on downscaled 16x16
versions of the MNIST and the fashion MNIST datasets in which spike latencies
encoded pixel intensities. The analog neuromorphic substrate closely matched
the performance of equivalently sized networks implemented in software. It is
capable of processing 70 k patterns per second with a power consumption of less
than 300 mW. Added activity regularization resulted in sparse network activity
with about 20 spikes per input, at little to no reduction in classification performance.
Thus, overall, our work demonstrates low-energy spiking network processing on
an analog neuromorphic substrate and sets several new benchmarks for hardware
systems in terms of classification accuracy, processing speed, and efficiency. Impor-
tantly, our work emphasizes the value of hardware-in-the-loop training and paves
the way toward energy-efficient information processing on non-von-Neumann
architectures.
1 Introduction
The brain processes information reliably in a parallel and energy-efficient way in vast, intricately
connected spiking neural networks (SNNs). Despite tremendous progress in building computationally
powerful deep artifical neural networks (ANNs), the biological wetware remains unchallenged
in terms of energy-efficiency and fault tolerance. A key architectural property facilitating these
capabilities is sparseness [1, 2], which is expressed at the population level and in time. The underlying
Preprint. Under review.
ar
X
iv
:2
00
6.
07
23
9v
1 
 [c
s.N
E]
  1
2 J
un
 20
20
communication is based on binary messages. Neurons receive and integrate these so-called spikes on
their analog membrane potentials and emit a spike themselves, as soon as a certain firing threshold is
reached. By combining the efficiency of analog computation with binary communication, they foster
sparseness and robustness. Analog neuromorphic devices emulate SNNs in an attempt to mirror
these two aspects of neurobiology [3–8]. Training SNNs to perform complex information processing,
however, has remained a major obstacle [9–11]. This difficulty is mainly due to the binary nature of
spikes, which precludes the use of standard gradient-based optimization [12, 13]. Moreover, when a
multi-layer SNN is first trained in software and then uploaded to an analog neuromorphic substrate,
it normally suffers from a significant loss of performance. Among other factors, this is caused by
device mismatch, the combined effect of small manufacturing imperfections in the analog circuitry,
which has to be taken into account when training in software.
In this article, we solve this problem with hardware in-the-loop (ITL) training whereby we emulate the
forward pass of an SNN in dedicated mixed-signal neuromorphic hardware. The weight updates for
supervised learning are then calculated in software using surrogate gradients. Our contributions are
the following: First, we explain our efficient ITL training approach to build temporally coding SNNs
on analog neuromorphic substrates. Second, we demonstrate the effectiveness of our approach on
the BrainScaleS-2 neuromorphic hardware system and show that ITL training mitigates the problem
of device mismatch almost entirely. Finally, we show that the resulting SNNs perform with high
accuracy on standard classification problems while processing tens of thousands of inputs per second
on a power budget of less than 300 mW.
2 Previous work
Building deep SNNs that approximate both the sophisticated computational capabilities of biological
neural networks while mirroring their sparse activity and energy efficiency is a great outstanding chal-
lenge. Emulating such networks on non-von-Neumann architectures requires addressing additional
hurdles. Tackling these problems calls for both suitable training algorithms and adopting them to
neuromorphic hardware.
Training algorithms for deep SNNs. The problem of building deep SNNs has been approached
from different angles. A plethora of studies focused on unsupervised learning rules inspired by
biology [9]. However, their success in comparison to supervised learning has remained limited.
Supervised learning in deep SNNs requires solving both a spatial and temporal credit assignment
problem. To that end, many training strategies attempt to translate the success of gradient-based
learning to the realm of SNNs [10, 11]. We distinguish between direct and indirect methods. The most
successful indirect approach is network translation in which one first trains a non-spiking ANN and
then translates it into an SNN [14–16]. However, the resulting networks typically rely on rate-coding
and do not take advantage of spike timing. Direct training refers to methods that operate on the
parameters of the SNN themselves. We further subdivide them into smoothing, firing-time gradient,
and surrogate gradient approaches. First, smoothing approaches attempt to render the forward model
differentiable by either introducing graded spikes [17] or stochasticity [18, 12, 19–22]. However,
adding stochasticity typically requires averaging to reduce the variance of the gradient estimates by
relying on firing rates or population coding [23–26, 22, 27]. Second, firing-time gradients can be
adopted in cases in which the neuronal membrane potential [28, 29] or firing times can be expressed
analytically in closed form [30–34]. This method leaves the forward model unchanged and permits to
exploit temporal coding, but needs additional mechanisms to deal with quiescent neurons or multiple
spikes. In multi-layer networks, it has only been demonstrated in conjunction with time-to-first-spike
coding schemes. Finally, surrogate gradient approaches do not add such constraints while also
leaving the forward model unchanged. Instead, they directly apply approximations to the gradients, a
notion inspired by work on binary ANNs [12, 13]. Surrogate gradient approaches are surprisingly
successful in training SNNs that take advantage of spike timing [35–40, 11] without explicitly having
to specify a coding strategy. In this article, we report the first application of surrogate gradients to an
analog neuromorphic substrate.
Neuromorphic substrates for SNNs. A body of work has attempted to emulate the key architec-
tural components of biological SNNs in electronics. Such hardware platforms can be coarsely divided
into digital [41–44] and analog [45, 7] architectures, although the latter typically involve mixed-signal
interfaces and peripheral logic. Digital systems attempt to efficiently simulate neuronal dynamics by
2
CADC
PPU
signed
synapse
CADC
PPU
signed
synapse
∆w
forward
backward
input spikes analog traces,
spikes
in
pu
t
hi
dd
en
ou
tp
ut
im
ag
e
A B
C D E
Figure 1: Surrogate gradient learning on BrainScaleS-2. (A) Close-up of the BrainScaleS-2
ASIC. (B) Photograph of the full system and an oscilloscope showing a membrane trace during
training of the SNN. (C) Implementation of a multi-layer network on the analog neuromorpic core.
Input spike trains are injected via synapse drivers (triangles) and relayed to the hidden layer neurons
(green circles) via the synapse array. Spikes in the hidden layer are routed on-chip to the output
units (gray circles). Each connection is represented by a pair of excitatory and inhibitory hardware
synapses, which holds a signed weight value. The analog membrane potentials are read out via the
CADC and further processed by the PPU. (D) Illustration of our ITL training scheme. The forward
pass is emulated on the BrainScaleS-2 ASIC. Observables from the neuromorphic substrate as well
as the input spike trains are processed on a conventional computer to perform the backward pass. The
calculated weight updates are then written to the neuromorphic system. (E) Network structure of the
multi-layer SNN in which input spike latencies were derived from image pixel intensities (bottom).
optimizing numerical integration steps and the data flow between individual components. Analog
designs, on the other hand, exploit the particular properties of individual devices such as transistors
to emulate the underlying model dynamics physically. Thus, neuronal state variables are explicitly
represented in the form of voltages and currents evolving in continuous time. In this article, we
employed the BrainScaleS-2 system which is an analog neuromorphic hardware platform.
3 Neuromorphic spiking neural network model
We implemented multi-layer SNNs on the analog BrainScaleS-2 single chip system (Fig. 1A), which
features 512 analog neuron circuits emulating the leaky integrate-and-fire (LIF) dynamics
C
dV
dt
= −gl (V − El) + Isyn , (1)
where the membrane potential V corresponds to the neuron’s internal state. V is explicitly represented
on chip as an analog voltage measured across a capacitor and evolves continuously in time (Fig. 1B).
The leak conductance gl causes an exponential decay to the leak potential El (in the following
set to zero without loss of generality) with the time constant τmem ≡ C/gl ≈ 6 µs. Due to the
substrate’s small intrinsic capacitances and comparatively large gl, the dynamics of the spiking
neurons implemented on the BrainScaleS-2 ASIC evolve 103 times faster than biological neurons. For
fine-grained control, all parameters such as reference potentials and time constants can be individually
configured and tuned on a per neuron basis (Appendix, Tab. 2). Finally, in accordance with the LIF
model, the membrane potential V is reset and an outgoing spike is emitted, when V crosses the
firing threshold ϑ. Additional features of the circuit, such as adaptation currents or an exponential
non-linearity were disabled throughout all our experiments for the sake of simplicity.
3
Each neuron circuit receives stimuli via an attached column of 256 input synapses, each of which has
a weight with 6 bit resolution. The resulting postsynaptic currents Isyn, which are integrated on the
membrane capacitor, also follow an exponential time course similar to the membrane itself. In the
synaptic connectivity matrix, the sign of a synapse is determined as a row-wise property. To allow
for a continuous transition between positive and negative weights during training, we represent each
connection through two merged synapse circuits of opposing signs (Fig. 1C). This choice resulted in
a reduction to 128 possible presynaptic partners per neuron, which could be mitigated by combining
two neuron circuits to form a single, larger logical neuron with a restored fan-in of 256 presynaptic
partners [46]. The propagation of spikes is achieved by an on-chip event routing system, which
allows to connect external stimuli as well as neuromorphic neurons in feed-forward and recurrent
topologies.
In addition, the BrainScaleS-2 ASIC houses freely programmable on-chip plasticity processing units
(PPUs) [47] to implement learning rules. To that end, the PPUs have access to various observables
that can be incorporated into weight update algorithms [48–50]. Access to analog state variables is
provided by column-parallel analog-to-digital converters (CADCs), which, most notably, allowed us
to digitize the analog membrane traces.
The full neuromorphic setup consists of the neuromorphic system itself and a field-programmable
gate array (FPGA) serving as an interface to a cluster of host computers (Fig. 1B). A custom software
stack provides device drivers and multiple layers of hardware abstraction for seamless interaction
with the neuromorphic substrate [51].
4 Surrogate gradient learning with hardware in the loop
We sought to efficiently solve spike-timing-based classification problems by training SNNs imple-
mented on the hardware with surrogate gradients. In software, surrogate gradients can be computed
with backpropagation through time (BPTT) by modifying auto-differentiation libraries to use a
surrogate for the derivative of the spike [37, 4]. However, current analog neuromorphic substrates
do not offer direct support for BPTT [11]. This lack motivated our ITL approach in which the
hardware is effectively used to implement the forward pass of training, whereas the backward pass
is evaluated using software on the host computer (Fig. 1D). There are two possible complications
that arise within this approach. First, computing surrogate gradients requires a model that closely
matches the behavior of the hardware. But, due to device mismatch, small deviations between the
two are unavoidable. Second, computing surrogate gradients for spiking neurons requires knowledge
of the temporal evolution of all membrane potentials V . Recording these analog traces from the
neuromorphic chip and transferring them to the host computers is therefore an essential step of ITL
training with surrogate gradients.
Reading out analog membrane traces. Since analog neuromorphic systems represent membrane
potentials as physical voltages, they are not directly accessible for further numerical computation.
Most platforms do not allow to digitize the required data in an efficient manner. On BrainScaleS-2,
the CADCs can be used to simultaneously digitize the membrane potentials of all neurons (Fig. 1C)
with a conversion rate of approximately 2 MSample s−1. Thus they allow to capture the membrane
dynamics with a high level of temporal detail despite the accelerated emulation speed. However, the
resulting data rates of up to 8 Gbit s−1 for the full chip exceed the bandwidth between the FPGA and
the host. To ensure precise timing and to fully exploit the available conversion rate, we employed
the on-chip PPU to read out and package the CADC samples, instead of doing so directly from the
host computer. This embedded processor was programmed to — concurrently to the presentation
of an input pattern — trigger a predefined number of CADC conversions in direct succession and
transfer the results to its main memory. This procedure reduces the communication overhead during
recording. After each presented pattern, the host machine asynchronously reads the samples for
further processing. In this setup, the sample duration is limited by the available memory of the PPU.
We hence presented inputs to the system individually or in small batches. It is important to note, that
this limitation does not apply during inference, for which only the network’s final decision of the
readout layer has to be read out from the chip. Therefore, the SNN can take full advantage of the
system’s acceleration during inference.
4
Computing hardware weight updates in software. We sought to calculate weight updates for
the hardware system based on BPTT and PyTorch’s auto-differentiation framework [52]. For this
reason, we require a differentiable computational graph representing the state and evolution of the
neuromorphic platform. To construct such a graph, a model of the system’s behavior is required
in which we can dynamically inject state variables such as spikes and membrane traces recorded
from the hardware substrate. For this model we relied on the regular LIF equation (Eq. 1), which
BrainScaleS-2 closely emulates. Since the CADCs sample the voltage V at discrete time intervals,
we formulated our model in discrete time with an equivalent time step ∆t. In the following, we
indicate modelled states with a tilde (e.g. V˜ ) to differentiate them from recorded quantities such as V .
The model of the membrane potential can be computed recursively by taking into account its temporal
decay and the calculated synaptic currents I˜syn[t] based on the presynaptic spikes S˜j [t] of neuron j:
V˜ [t+ 1] = V˜ [t] · e−∆t/τmem +I˜syn[t] , I˜syn[t+ 1] = I˜syn[t] · e−∆t/τsyn +
∑
jWjS˜j [t] . (2)
However, V˜ and S˜ may deviate from their actual values V and S on the hardware due to noise
or device mismatch. Therefore it is desirable to use the actual values recorded from hardware
whenever possible and only rely on the model to calculate derivatives. To achieve this in PyTorch’s
auto-differentiation framework, we introduced an auxiliary identity function f(x, x˜) ≡ x and defined
its surrogate derivatives ∂f/∂x = 0 and ∂f/∂x˜ = 1. Eq. 2 can now be modified to
V˜ [t+ 1] = f
(
V [t+ 1], V˜ [t] · e−∆t/τm +I˜syn[t]
)
. (3)
A similar approach was taken for spike times by defining S˜j [t](Sj [t], V˜j [t]) ≡ Sj [t] with associated
surrogate derivatives
∂S˜j [t]
∂Sj [t]
= 0 ,
∂S˜j [t]
∂V˜j [t]
=
(
β · |V˜j [t]− ϑ|
)−2
, (4)
where β describes the steepness of the surrogate gradient [37]. Recalling Eq. 3, the actual value of
the potential is used to evaluate this expression, but the calculation of derivatives relies on the model.
Input spike-latency code. To feed standard vision benchmark data to the chip, we converted them
to input spikes on the host computer. To that end, we interpreted the normalized pixel grayscale
values xi as input currents to LIF neurons. If the current is strong enough to reach the firing threshold
of neuron i the latency to the first spike is given by ti = τin log xi/(xi − ϑin) with the membrane
time constant τin and the firing threshold ϑin = 0.2 of the input neuron i. Specifically, we only used
the first firing time of each neuron as the input to the network resulting in input spike trains that were
sparse both in time and space (Fig. 2A).
Max-over-time loss. We configured the readout units, which were emulated on hardware, as non-
spiking leaky integrators by disabling their spiking mechanism (Fig. 2A). This allowed us to define a
max-over-time loss based on their membrane potentials V Li [t]
L = NLL
(
softmax
(
max
t
V Li [t]
)
, y?
)
, (5)
with the negative log-likelihood NLL and the true labels y? [53]. This loss was augmented by
regularization terms taking into account the absolute weights as well as the hidden layer spiking
activity (Appendix, A).
5 Results: Training sparse spiking networks on analog hardware
We first evaluated our ITL training framework on a downscaled version of MNIST. To accommodate
the data to the available fan-in of 256 inputs, we reduced the original 28× 28 images to 16×16 pixels
by discarding the two outermost rows and by rescaling the remaining pixels. The resulting images
were then converted into a spike-latency code of the 256 input units as explained above. In addition
to the 256 input units, our network had a single hidden layer consisting of 118 plus an additional
10 output neurons (Fig. 1E). Starting with a quiescent hidden layer, i.e. exhibiting zero spikes, we
trained the network ITL using the Adam optimizer (Appendix, B,C) [54]. After 50 epochs, the
5
910 µs
2
input spikes hidden spikes output traces
5
images
0.2
0.4
0.6
lo
ss
0 20 40
epochs
0.90
0.95
1.00
ac
cu
ra
cy
0 10 20 30
time T (µs)
10
−1
10
0
te
st
er
ro
r
A B
C
Figure 2: Classification of the 16× 16 MNIST dataset. (A) Three snapshots of the SNN activity,
consisting of the input images (left), spike raster of both the input spike trains and hidden layer
activity (middle), and readout neuron traces (right). The latter show a clear separation, and hence a
correct classification of the presented images. The model’s predicted class labels are indicated in the
colored circles. (B) Loss and accuracy over the course of 50 training epochs. (C) The classification
latency is determined by iteratively re-evaluating the max-over-time for output traces (see panel A)
restricted to a limited interval [0, T ].
network on the neuromorphic substrate developed useful hidden layer representations, as evidenced
by a reduction of the loss, and the fact that the maximally responsive output units encoded the correct
class membership (Fig. 2A,B). Note that the inhibition of neurons corresponding to incorrect labels
does not directly follow from the loss function (Eq. 5), but rather emerges as a consequence.
On held out test data, the trained network achieved an overall accuracy of (96.2± 0.2) % (Table 1),
which closely approximated the performance of a software simulated SNN with the same number of
hidden neurons ((96.6± 0.1) %). To check whether data augmentation could further improve upon
these results we generated augmented training data by randomly rotating images by up to ±10◦. We
found that this indeed increased the performance of the hardware network to (96.7± 0.1) %.
The above results were obtained with a downscaled 16× 16 version of MNIST, instead of the original
dataset. As a baseline, we therefore wanted to assess the performance of a conventional non-spiking
ANN with rectified linear units (ReLUs) on the same data. Hence, we trained a larger 256-512-512-10
network in Keras [55] using the Adam optimizer which resulted in (98.3± 0.1) % test error. Similarly,
we trained and evaluated an ANN with the same number of neurons as our SNN (256-118-10), without
((97.8± 0.2) %) and with ((98.1± 0.1) %) data augmentation. Thus the hardware SNN performance
is only slightly lower than the performance of a deterministic ANN which uses graded activation
functions and connection weights with floating-point precision. However, we expect that this gap can
be further reduced by future hardware implementations with increased network size and additional
hidden layers.
Next, we wanted to check that ITL training also works on a more difficult task. Therefore, we
trained the same 256-118-10 SNN on a 16 × 16 version of Fashion MNIST [56]. On this dataset,
the SNN on BrainScaleS-2 reached 83.8 % accuracy, whereas the ANN reference network achieved
(87.4± 0.2) %. In summary, SNNs on analog hardware can closely approximate the performance of
ANNs computed in software.
Low latency and low power information processing with spikes. A key advantage of SNNs
relying on spike-latency coding (in contrast to rate-based codes) are their short decision times. To
test after which time our SNNs reached their final decisions, we analyzed their inference latency by
artificially restricting output traces to the time interval [0, T ]. This analysis showed that the peak
accuracy was reached already 10 µs after the first input spike (Fig. 2C). Based on this number, we
estimated the theoretical upper bound for the maximum inference rate of this SNN as 100 k images
6
Table 1: Comparison of results achieved with networks trained on BrainScaleS-2 and in software.
Dataset Substrate Remarks Accuracy (%)
train test
16× 16
MNISTi
SNN, BrainScaleS-2 98.9± 0.1 96.2± 0.2
SNN, BrainScaleS-2 data augmentation 97.5± 0.0 96.7± 0.1
SNN, software 99.8± 0.0 96.7± 0.1
16× 16
Fashion MNISTii
BrainScaleS-2 85.4 83.8
SNN, software 91.0 84.9
i A non-spiking ANN (256-512-512-10) reached (98.3± 0.1) %. An ANN of the same size as the neuromorphic SNN reached (97.8± 0.2) %
without and (98.1± 0.1) % with data augmentation. ii An ANN (256-512-512-10) reached (88.5± 0.3) %. An ANN of the same size as the
neuromorphic SNN reached (87.4± 0.2) %.
per second. However, achieving this rate in practice requires a forced reset of all neuronal state
variables which can be achieved on the BrainScaleS-2 systems by clamping all membrane and current
variables to their respective resting values. On BrainScaleS-2 this step takes about 4 µs, thus limiting
the maximum inference rate to approximately 70 k images per second.
To assess the system’s energy efficiency, we measured the ASIC’s power consumption. For emulating
the trained SNN, BrainScaleS-2 used approximately 285 mW which includes the power consumption
of the analog core, the plasticity processor, and all surrounding periphery. The overall figure was
dominated by the idle power spent on clock generation and distribution as well as the communication
links to and from the ASIC. Yet, assuming the maximum inference rate of 70 k images per second,
this number translates into an estimated energy consumption of only 4 µJ per image.
Hardware ITL training compensates for analog device mismatch. Although BrainScaleS-2 has
a favorably high signal-to-noise ratio and can be calibrated to produce reliable neuronal parameters,
some remaining inter-neuron variability caused by device mismatch is unavoidable. By training
ITL, these deviations are taken into account implicitly. To illustrate the benefit of ITL, we compared
ITL-trained networks with networks that were pre-trained purely in software and then uploaded to
the device. For instance, a simulated network trained in software performed at 96.7 % accuracy on
MNIST (Fig. 3). When the same SNN was uploaded on the chip, its accuracy plummeted to 83.8 %.
An accuracy of 96.2 % was recovered through subsequent ITL training, which was comparable to the
accuracy of a network trained on the hardware from scratch 96.2 % (Fig. 3B). Thus, ITL training is
indispensable to achieve high accuracy, irrespective of pre-training in software.
0.0
0.2
0.4
lo
ss
pre-training (software) training (BrainScaleS-2)
w/ pre-training
w/o pre-training
0 10 20 30 40 50
epoch
0.001
0.010
0.100
1.000
er
ro
r
0 10 20 30 40 50
epoch
train
test
0
20
40
60
80
100
te
st
ac
cu
ra
cy
(%
)
96.7
pr
e-
tr
ai
ni
ng
(s
of
tw
ar
e)
83.8
lo
ad
ed
(B
SS
-2
)
96.2
lo
ad
ed
+
IT
L
(B
SS
-2
)
96.2
on
ly
IT
L
(B
SS
-2
)
A B
Figure 3: Hardware ITL training compensates for analog device mismatch. (A) Loss and clas-
sification error during pre-training (left) and ITL training on the hardware (right). The classification
accuracy drops when a pre-trained network is loaded onto the neuromorphic substrate. ITL training
is required to reinstate maximum performance, which makes the benefits of pre-training almost
negligible. (B) Comparison of the classification accuracy of the different training approaches.
7
10
1
10
2
average hidden layer spikes per image
0.92
0.94
0.96
0.98
1.00
ac
cu
ra
cy
train test
re
gu
la
ri
za
tio
n
Figure 4: Sparse spiking activity is sufficient
for accurate classification. Both, train and test
accuracy depend on the sparsity imposed by the
burst regularization. Choosing the strength of
this term (color coded) allows to trade accuracy
against sparsity, albeit a state of high perfor-
mance is reached for a broad range of hidden
layer activities, measured by the average number
of spikes in the hidden layer per image. This
plateau coincides for training and testing images.
Computational efficiency with sparse spiking activity. One of the most remarkable phenomena
in neurobiology is sparse spiking activity [1]. To asses whether we could attain similar sparse activity
levels on the BrainScaleS-2 system without compromising classification performance, we added an
adjustable sparseness penalty to the total training loss Ltot:
Ltot = L+ λ 1
NH
NH∑
i=1
(∑
t
SHi [t]
)2
,
where L denotes the classification loss (Eq. 5), the strength parameter λ, the hidden layer size NH,
and the corresponding hidden layer spike trains SHi . We trained SNNs on the BrainScaleS-2 system
for a range of different values λ and measured both their accuracy and average hidden layer spike
counts per input. We found that a large fraction of the resulting network configurations achieved
an accuracy above 96 % with only 20 or more hidden layer spikes on average (Fig. 4). Below 20
hidden layer spikes, the accuracy degraded monotonically. These data show that the performance of
SNNs depends only weakly on the number of hidden layer spikes down to a critical sparseness level
below which performance degrades. Thus, efficient computation in SNNs on analog neuromorphic
substrates is possible with only a few spikes per input.
6 Discussion
We have developed an efficient ITL training method for multi-layer SNNs on analog neuromorphic
hardware. Specifically, we have used the BrainScales-2 neuromorphic system to emulate the forward
pass of an SNN while sampling its analog voltage traces. These traces permitted us to compute
surrogate gradients in software for supervised learning. Using this framework, we trained SNNs to
rapidly classify sparse spike-timing-dependent input data derived from standard vision benchmark
datasets. The resulting SNNs exhibited sparse spiking activity, were minimally affected by device
mismatch, and achieved state-of-the-art performance compared to other neuromorphic platforms.
Crucially, the hardware acceleration factor allows classifying up to 70 k images second on a power
budget of less than 300 mW. To the best of our knowledge, this is the first instance of a hardware
SNN trained on a real-world dataset that efficiently uses spike timing for information processing.
The classification of vision datasets on neuromorphic hardware has been tackled on both digital
and analog systems in the past. However, most, if not all, digital neuromorphic architectures have
employed rate-based coding schemes [57–59, 44] which typically entail higher activity levels and
longer latencies. In the analog setting, Schmitt et al. [60] have used a comparable ITL approach
to train rate-coding SNNs. Similar to the work of Göltz et al. [33], which so far only has been
applied to artificial datasets, the present article expands upon this work to the realm of processing
with sparse, precisely timed spikes. Nevertheless, a detailed quantitative comparison between these
various approaches remains difficult because both suitable metrics and standardized benchmarks are
lacking [61, 53]. This especially applies to figures concerning the net energy footprint, which often
have to be estimated or rely on simulations [62, 63].
In summary, our work gives a glimpse of the extensive opportunities that analog SNN hardware offers
for energy-efficient ultra-low latency information processing while simultaneously highlighting the
importance of on-device training methods to reap the full benefits of this emergent technology.
8
Acknowledgments and Disclosure of Funding
We express our gratitude towards O. Breitwieser, C. Mauch, E. Müller, and P. Spilger for their work
on the software environment, B. Kindler, F. Kleveta, and S. Schmitt for their helpful support, A.
Baumbach for his valuable feedback during the early commissioning phase of the system, and J.
Göltz and L. Kriener for helpful discussions. We thank the whole Electronic Vision(s) group for the
inspirational work environment.
This work has received funding from the European Union Sixth Framework Programme ([FP6/2002-
2006]) under grant agreement no 15879 (FACETS), the European Union Seventh Framework Pro-
gramme ([FP7/2007-2013]) under grant agreement no 604102 (HBP), 269921 (BrainScaleS) and
243914 (Brain-i-Nets) and the Horizon 2020 Framework Programme ([H2020/2014-2020]) under
grant agreement no 720270 and 785907 (HBP). This work was supported by the Novartis Research
Foundation.
References
[1] Sterling P. and Laughlin S.. Principles of neural design. MIT Press, 2015.
[2] LeCun Y.. Deep learning hardware: Past, present, and future. In 2019 IEEE International
Solid-State Circuits Conference-(ISSCC), pages 12–19. IEEE, 2019.
[3] Boahen K.. A neuromorph’s prospectus. Computing in Science & Engineering, 19(2):14–28,
2017.
[4] Neftci E. O.. Data and power efficient intelligence with neuromorphic learning machines.
iScience, 5:52–68, 2018.
[5] Roy K., Jaiswal A., and Panda P.. Towards spike-based machine intelligence with neuromorphic
computing. Nature, 575(7784):607–617, 2019.
[6] Mahowald M. and Douglas R.. A silicon neuron. Nature, 354(6354):515–518, 1991.
[7] Schemmel J., Brüderle D., Grübl A., Hock M., Meier K., and Millner S.. A wafer-scale
neuromorphic hardware system for large-scale neural modeling. Proceedings of the 2010 IEEE
International Symposium on Circuits and Systems (ISCAS"10), pages 1947–1950, 2010.
[8] Chicca E., Stefanini F., Bartolozzi C., and Indiveri G.. Neuromorphic electronic circuits for
building autonomous cognitive systems. Proceedings of the IEEE, 102(9):1367–1388, 2014.
[9] Tavanaei A., Ghodrati M., Kheradpisheh S. R., Masquelier T., and Maida A.. Deep learning in
spiking neural networks. Neural Networks, 111:47–63, 2019.
[10] Pfeiffer M. and Pfeil T.. Deep learning with spiking neurons: opportunities and challenges.
Frontiers in neuroscience, 12:774, 2018.
[11] Neftci E. O., Mostafa H., and Zenke F.. Surrogate gradient learning in spiking neural networks:
Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal
Processing Magazine, 36(6):51–63, 2019.
[12] Bengio Y., Léonard N., and Courville A.. Estimating or propagating gradients through stochastic
neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[13] Courbariaux M., Hubara I., Soudry D., El-Yaniv R., and Bengio Y.. Binarized neural networks:
Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv
preprint arXiv:1602.02830, 2016.
[14] Rückauer B., Känzig N., Liu S.-C., Delbruck T., and Sandamirskaya Y.. Closing the accuracy
gap in an event-based visual recognition task. arXiv preprint arXiv:1906.08859, 2019.
[15] Zambrano D., Nusselder R., Scholte H. S., and Bohte S.. Efficient computation in adaptive
artificial spiking neural networks. arXiv preprint arXiv:1710.04838, 2017.
9
[16] Stöckl C. and Maass W.. Classifying images with few spikes per neuron. arXiv preprint
arXiv:2002.00860, 2020.
[17] Huh D. and Sejnowski T. J.. Gradient descent for spiking neural networks. In Advances in
Neural Information Processing Systems, pages 1433–1443, 2018.
[18] Ackley D. H., Hinton G. E., and Sejnowski T. J.. A learning algorithm for boltzmann machines.
Cognitive science, 9(1):147–169, 1985.
[19] Brea J., Senn W., and Pfister J.-P.. Matching recall and storage in sequence learning with spiking
neural networks. Journal of neuroscience, 33(23):9565–9575, 2013.
[20] Gardner B. and Grüning A.. Supervised learning in spiking neural networks for precise temporal
encoding. PloS one, 11(8):e0161335, 2016.
[21] Jimenez Rezende D. and Gerstner W.. Stochastic variational learning in recurrent spiking
networks. Frontiers in Computational Neuroscience, 8:38, 2014. doi: 10.3389/fncom.2014.
00038.
[22] Guerguiev J., Lillicrap T. P., and Richards B. A.. Towards deep learning with segregated
dendrites. Elife, 6:e22901, 2017.
[23] Hunsberger E. and Eliasmith C.. Spiking deep networks with lif neurons. arXiv preprint
arXiv:1510.08829, 2015.
[24] O’Connor P. and Welling M.. Deep spiking networks. arXiv preprint arXiv:1602.08323, 2016.
[25] Lee J. H., Delbruck T., and Pfeiffer M.. Training deep spiking neural networks using backprop-
agation. Frontiers in neuroscience, 10:508, 2016.
[26] Neftci E. O., Augustine C., Paul S., and Detorakis G.. Event-driven random back-propagation:
Enabling neuromorphic deep learning machines. Frontiers in neuroscience, 11:324, 2017.
[27] Payeur A., Guerguiev J., Zenke F., Richards B., and Naud R.. Burst-dependent synaptic plasticity
can coordinate learning in hierarchical circuits. bioRxiv, 2020.
[28] Gütig R. and Sompolinsky H.. The tempotron: a neuron that learns spike timing-based decisions.
Nat Neurosci, 9(3):420–428, March 2006. doi: 10.1038/nn1643.
[29] Memmesheimer R.-M., Rubin R., Ölveczky B., and Sompolinsky H.. Learning Precisely Timed
Spikes. Neuron, 82(4):925–938, May 2014. doi: 10.1016/j.neuron.2014.03.026.
[30] Bohte S. M., Kok J. N., and La Poutre H.. Error-backpropagation in temporally encoded
networks of spiking neurons. Neurocomputing, 48(1):17–37, 2002.
[31] Mostafa H.. Supervised Learning Based on Temporal Coding in Spiking Neural Networks. Trans
Neural Netw Learn Syst, 29(7):3227–3235, July 2018. doi: 10.1109/TNNLS.2017.2726060.
[32] Mozafari M., Ganjtabesh M., Nowzari-Dalini A., and Masquelier T.. SpykeTorch: Efficient
Simulation of Convolutional Spiking Neural Networks with at most one Spike per Neuron.
arXiv:1903.02440 [cs, q-bio], March 2019.
[33] Göltz J., Baumbach A., Billaudelle S., Breitwieser O., Dold D., Kriener L., Kungl A. F., Senn
W., Schemmel J., Meier K., et al. Fast and deep neuromorphic learning with time-to-first-spike
coding. arXiv preprint arXiv:1912.11443, 2019.
[34] Comsa I. M., Fischbacher T., Potempa K., Gesmundo A., Versari L., and Alakuijala J.. Temporal
Coding in Spiking Neural Networks with Alpha Synaptic Function. In ICASSP 2020 - 2020
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
8529–8533, May 2020. doi: 10.1109/ICASSP40776.2020.9053856.
[35] Bohte S. M.. Error-backpropagation in networks of fractionally predictive spiking neurons. In
International Conference on Artificial Neural Networks, pages 60–68. Springer, 2011.
10
[36] Esser S. K., Merolla P. A., Arthur J. V., Cassidy A. S., Appuswamy R., Andreopoulos A., Berg
D. J., McKinstry J. L., Melano T., Barch D. R., and et al. Convolutional networks for fast,
energy-efficient neuromorphic computing. Proceedings of the National Academy of Sciences,
113(41):11441–11446, Sep 2016. doi: 10.1073/pnas.1604850113.
[37] Zenke F. and Ganguli S.. Superspike: Supervised learning in multilayer spiking neural networks.
Neural computation, 30(6):1514–1541, 2018.
[38] Shrestha S. B. and Orchard G.. Slayer: Spike layer error reassignment in time. In Advances in
Neural Information Processing Systems, pages 1412–1421, 2018.
[39] Bellec G., Salaj D., Subramoney A., Legenstein R., and Maass W.. Long short-term memory and
learning-to-learn in networks of spiking neurons. In Advances in Neural Information Processing
Systems, pages 787–797, 2018.
[40] Wozniak S., Pantazi A., and Eleftheriou E.. Deep Networks Incorporating Spiking Neural
Dynamics. arXiv:1812.07040 [cs], December 2018.
[41] Akopyan F., Sawada J., Cassidy A., Alvarez-Icaza R., Arthur J., Merolla P., Imam N., Nakamura
Y., Datta P., Nam G.-J., et al. Truenorth: Design and tool flow of a 65 mw 1 million neuron
programmable neurosynaptic chip. IEEE transactions on computer-aided design of integrated
circuits and systems, 34(10):1537–1557, 2015.
[42] Furber S. B., Lester D. R., Plana L. A., Garside J. D., Painkras E., Temple S., and Brown A. D..
Overview of the spinnaker system architecture. IEEE Transactions on Computers, 62(12):
2454–2467, 2012.
[43] Davies M., Srinivasa N., Lin T.-H., Chinya G., Cao Y., Choday S. H., Dimou G., Joshi P., Imam
N., Jain S., et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE
Micro, 38(1):82–99, 2018.
[44] Frenkel C., Lefebvre M., Legat J.-D., and Bol D.. A 0.086-mm2 12.7-pj/sop 64k-synapse
256-neuron online-learning digital spiking neuromorphic processor in 28-nm cmos. IEEE
transactions on biomedical circuits and systems, 13(1):145–158, 2018.
[45] Moradi S., Qiao N., Stefanini F., and Indiveri G.. A scalable multicore architecture with
heterogeneous memory structures for dynamic neuromorphic asynchronous processors (dynaps).
IEEE transactions on biomedical circuits and systems, 12(1):106–122, 2018.
[46] Aamir S. A., Müller P., Kiene G., Kriener L., Stradmann Y., Grübl A., Schemmel J., and
Meier K.. A mixed-signal structured adex neuron for accelerated neuromorphic cores. IEEE
Transactions on Biomedical Circuits and Systems, 12(5):1027–1037, Oct 2018. doi: 10.1109/
tbcas.2018.2848203.
[47] Friedmann S., Schemmel J., Grübl A., Hartel A., Hock M., and Meier K.. Demonstrating
hybrid learning in a flexible neuromorphic hardware system. IEEE Transactions on Biomedical
Circuits and Systems, 11(1):128–142, 2017. doi: 10.1109/TBCAS.2016.2579164.
[48] Cramer B., Stöckel D., Kreft M., Schemmel J., Meier K., and Priesemann V.. Control of
criticality and computation in spiking neuromorphic networks with plasticity. arXiv preprint
arXiv:1909.08418, 2019.
[49] Billaudelle S., Stradmann Y., Schreiber K., Cramer B., Baumbach A., Dold D., Göltz J., Kungl
A. F., Wunderlich T. C., Hartel A., et al. Versatile emulation of spiking neural networks on an
accelerated neuromorphic substrate. arXiv preprint arXiv:1912.12980, 2019.
[50] Billaudelle S., Cramer B., Petrovici M. A., Schreiber K., Kappel D., Schemmel J., and Meier K..
Structural plasticity on an accelerated analog neuromorphic hardware system. arXiv preprint
arXiv:1912.12047, 2019.
[51] Müller E., Mauch C., Spilger P., Breitwieser O. J., Klähn J., Stöckel D., Wunderlich T., and
Schemmel J.. Extending brainscales os for brainscales-2. arXiv preprint arXiv:2003.13750,
2020.
11
[52] Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein
N., Antiga L., Desmaison A., Kopf A., Yang E., DeVito Z., Raison M., Tejani A., Chilamkurthy
S., Steiner B., Fang L., Bai J., and Chintala S.. Pytorch: An imperative style, high-performance
deep learning library. In Wallach H., Larochelle H., Beygelzimer A., d’Alché Buc F., Fox
E., and Garnett R., editors, Advances in Neural Information Processing Systems 32, pages
8024–8035. Curran Associates, Inc., 2019.
[53] Cramer B., Stradmann Y., Schemmel J., and Zenke F.. The heidelberg spiking datasets for the
systematic evaluation of spiking neural networks. arXiv preprint arXiv:1910.07407, 2019.
[54] Kingma D. P. and Ba J.. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[55] Chollet F. et al. Keras. https://github.com/fchollet/keras, 2015.
[56] Xiao H., Rasul K., and Vollgraf R.. Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
[57] Chen G. K., Kumar R., Sumbul H. E., Knag P. C., and Krishnamurthy R. K.. A 4096-neuron
1m-synapse 3.8-pj/sop spiking neural network with on-chip stdp learning and sparse weights in
10-nm finfet cmos. IEEE Journal of Solid-State Circuits, 54(4):992–1002, 2018.
[58] Stromatias E., Neil D., Pfeiffer M., Galluppi F., Furber S. B., and Liu S.-C.. Robustness of
spiking deep belief networks to noise and reduced bit precision of neuro-inspired hardware
platforms. Frontiers in Neuroscience, 9:222, 2015. doi: 10.3389/fnins.2015.00222.
[59] Lin C.-K., Wild A., Chinya G. N., Cao Y., Davies M., Lavery D. M., and Wang H.. Programming
spiking neural networks on intel’s loihi. Computer, 51(3):52–61, 2018.
[60] Schmitt S., Klähn J., Bellec G., Grübl A., Guettler M., Hartel A., Hartmann S., Husmann D.,
Husmann K., Jeltsch S., et al. Neuromorphic hardware in the loop: Training a deep spiking
network on the brainscales wafer-scale system. In 2017 International Joint Conference on
Neural Networks (IJCNN), pages 2227–2234. IEEE, 2017.
[61] Davies M.. Benchmarks for progress in neuromorphic computing. Nature Machine Intelligence,
1(9):386–388, 2019.
[62] Yin S., Venkataramanaiah S. K., Chen G. K., Krishnamurthy R., Cao Y., Chakrabarti C., and Seo
J.-s.. Algorithm and hardware design of discrete-time spiking neural networks based on back
propagation with binary activations. In 2017 IEEE Biomedical Circuits and Systems Conference
(BioCAS), pages 1–5. IEEE, 2017.
[63] Yin B., Corradi F., and Bohté S. M.. Effective and efficient computation with multiple-timescale
spiking recurrent neural networks. arXiv preprint arXiv:2005.11633, 2020.
[64] He K., Zhang X., Ren S., and Sun J.. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. In Proceedings of the IEEE international conference
on computer vision, pages 1026–1034, 2015.
12
Appendix
A Loss term and regularization
The loss function Ltot as used for optimization in the manuscript, is composed of the classification
loss L and three regularization terms:
Ltot = L+ Lreg,burst + LHreg,W + LOreg,W , (6)
(7)
The classification loss term depends on the maxima of the membrane potentials V Li of the label units:
L = NLL
(
softmax
(
max
t
V Li [t]
)
, y?
)
, (8)
(9)
with the negative log-likelihood NLL, the softmax function, and the ground truth labels y?. The first
regularization term Lreg,burst aims for biologically plausible firing behavior by punishing especially
spike bursts:
Lreg,burst = λ 1
NH
NH∑
i=1
(∑
t
SHi [t]
)2
, (10)
(11)
where λ determines the strength of the regularization. NH represents the number of hidden units and
SHi is the spike train of the ith hidden neuron. Finally, the other two terms attempt to prevent the
saturation of weights WH,Lij in the hidden and label layer, respectively:
LHreg,W = ρH
 1
NINH
∑
i,j
∣∣WHij∣∣
2 , LLreg,W = ρL
 1
NHNL
∑
i,j
∣∣W Lij∣∣
2 , (12)
with the regularization strengths ρH,L and the numbers of units in the input NI, hidden NH, and label
layers NL. Indices i and j iterate over the input and output dimensions of the weight matrices.
We made use of a decaying learning rate with a decay of ηe+1/ηe = 1− γη per epoch.
B Initialization
We used Kaming’s initialization scheme [64] for both the hidden and label layer weight matrices.
Specifically, weights were drawn from a normal distribution with zero mean and a standard deviation
of σˆw/
√
NH,L.
C Weight scaling for the neuromorphic substrate
When writing the weights to the neuromorphic system, their values had to be scaled, rounded, and
cropped to 7 bit signed integers. The resulting scaling factor strongly depends on bias currents and
other technical parameters of the neuromorphic system. Due to the absence of a threshold for the
non-spiking label layer, its membrane traces can be scaled arbitrarily. For this reason, we were able
to adopt a dynamic weight scaling by aligning the largest absolute weight value as represented in
software to the maximum weight possible on the substrate.
13
Table 2: Parameters and hyperparameters of the neuromorphic system and the training framework.
Parameter Symbol Value
neuromorphic
potentials
leak potential El 550 mV
firing threshold θ 850 mV
neuron
time constants
membrane time constant τmem 6 µs
synaptic time constant τsyn 6 µs
input spike conversion input unit time constant τin 8 µsinput unit threshold θin 0.2
hyperparameters
surrogate gradient steepness β 50
learning rate η 1.5× 10−3
learning rate decay (per epoch) γη 0.075
burst regularization strength λ 0.0 to 10.0
hidden weight regularization strength ρH 0.2
output weight regularization strength ρO 0.0
miscellaneous time step/sample period ∆t 2.5 µs
initialization weight initialization spread σˆw 0.24
14
