hxtorch: PyTorch for BrainScaleS-2 -- Perceptrons on Analog Neuromorphic
  Hardware by Spilger, Philipp et al.
hxtorch: PyTorch for ANNs on BrainScaleS-2
Philipp Spilger?, Eric Müller?, Arne Emmel, Aron Leibfried, Christian Mauch,
Christian Pehle, Johannes Weis, Oliver Breitwieser, Sebastian Billaudelle,
Sebastian Schmitt, Timo C. Wunderlich, Yannik Stradmann, and
Johannes Schemmel
Kirchhoff-Institute for Physics
Ruprecht-Karls-Universität Heidelberg, Germany
{pspilger,mueller}@kip.uni-heidelberg.de
Abstract. We present software facilitating the usage of the BrainScaleS-2
analog neuromorphic hardware system as an inference accelerator for ar-
tificial neural networks. The accelerator hardware is transparently inte-
grated into the PyTorch machine learning framework using its extension
interface. In particular, we provide accelerator support for vector-matrix
multiplications and convolutions; corresponding software-based autograd
functionality is provided for hardware-in-the-loop training. Automatic
partitioning of neural networks onto one or multiple accelerator chips is
supported. We analyze implementation runtime overhead during training
as well as inference, provide measurements for existing setups and eval-
uate the results in terms of the accelerator hardware design limitations.
As an application of the introduced framework, we present a model that
classifies activities of daily living with smartphone sensor data.
Keywords: Machine Learning · Analog Accelerator · Neuromorphic ·
Convolutional Neural Networks · PyTorch · Human Activity Recognition
1 Introduction
Modern machine learning (ML) frameworks such as PyTorch or Tensorflow pro-
vide interfaces and tools to easily define, build and evaluate machine learning
models [1, 2]. They aim to reduce development effort and increase interoperabil-
ity by providing a large collection of operators, algorithms, optimizers and anal-
ysis tools. Using a high-level specification of the complete model, these packages
construct a corresponding computational graph that is fundamentally agnostic
to the substrate on which it is executed on and optimized for. State-of-the-art
machine learning frameworks support CPU and GPU-based backends that ac-
celerate inference and training. Compared to typical CPUs, GPUs are capable of
processing data in a highly parallel fashion and are well-suited to the abundance
of vector-matrix multiplications in artificial neural networks.
While GPUs represent a traditional approach to accelerating the execution of
artificial neural networks, neuromorphic chips aim to mimic biological neural net-
works in terms of their fundamental structure and functionality. BrainScaleS-2
? contributed equally.
ar
X
iv
:2
00
6.
13
13
8v
1 
 [c
s.N
E]
  2
3 J
un
 20
20
2 P. Spilger, E. Müller et al.
is a mixed-signal accelerated neuromorphic system targeted for research in the
fields of computational neuroscience and beyond-von-Neumann computing that
follows this paradigm. It supports spiking and non-spiking operation. The latter
provides analog accelerator functionality for vector-matrix multiplications. In
this work, we focus on the PyTorch integration of this accelerator platform. We
use the embedded SIMD microprocessors for additional digital operations.
Fig. 1: BSS-2 setup (left);
a white dust cap covers the
chip (right).
Fig. 1 shows the BrainScaleS-2 (BSS-2) hard-
ware setup providing 512 neurons per chip, see [3].
Synapses are arranged in a matrix above the neu-
rons, containing 256 rows for inputs, and connecting
to the neurons in columns. Vector entries control
how long a synapse row is activated. The charge
reaching the neurons is then the product of the
vector entry and the matrix weight, as it is the
product of time and current. Neurons accumulate
charge emitted by a column of synapses on the
membrane capacitance, which completes the vector-
matrix multiplication. All neuron membrane voltages are digitized in parallel.
The resulting operation is xTW = y, where the inputs xi ∈ [0, 31] are transposed
multiplied by the weight matrix Wij ∈ [−63, 63] for signed and Wij ∈ [0, 63] for
unsigned weights, resulting in outputs yj ∈ [−128, 127]. The available resolution
of vector entries therefore is 5 bits, synapse weights are 6 bits (plus a sign for
signed weights), the result is 8 bits. In the current lab setup, host connectivity
is provided by a 1GBit-Ethernet (GbE) link to an FPGA providing buffer mem-
ory and managing real-time access to the BSS-2 chip. Details on the hardware
architecture can be found in [4].
2 Methods and Tools
Simple and user-friendly interfaces are key components affecting the success of
custom hardware accelerators. Especially the PyTorch and Tensorflow machine
learning frameworks are widely known and established in both, research and
industry, cf. [5]. Compared to Tensorflow, we experienced a simpler integration
of the PyTorch build flow into our software environment: the build flow is simpler
and relies on standard cmake, and out-of-tree dependencies are tracked using
standard mechanisms. In conjunction with sufficient documentation, this is the
reason we chose PyTorch [1] as a frontend for our accelerator hardware. However,
a similar integration into Tensorflow is possible.
In terms of artificial neural networks, the BSS-2 system provides support for a
limited set of operations: matrix multiplications and convolutions can be mapped
onto analog multiplication and accumulation units. The embedded SIMD proces-
sor can be used to perform additional operations. However, the embedded pro-
cessor isn’t optimized for number crunching but rather serves as a programmable
controller for the analog compute units.
hxtorch: PyTorch for ANNs on BrainScaleS-2 3
The software we present in the next section builds upon the “BrainScaleS
Operating System”, a software stack providing multiple hardware access layers.
We make use of a register-level abstraction layer, other components provide a
reliable GbE-based communication channel to the hardware system or integrate
multiple hardware systems into the SLURM resource manager. The authors
provide an overview over the software components in [6, 7].
3 Implementation
In this section we give an overview of our software implementation providing a
BSS-2 hardware backend for PyTorch. We start with the PyTorch integration
and considerations regarding data and computational flow. Afterwards, we de-
scribe our graph-based approach for describing and handling control and data
flow on-chip as well as off-chip. We partition operations into chunks fitting onto
the available hardware and schedule hardware using just-in-time-based (JIT)
execution. Finally, we focus on hardware-specific aspects.
3.1 PyTorch integration
PyTorch provides an extensions interface which allows us to implement new
custom operations without modifying the core source code or build system. In-
tegration of the computation executed on the accelerator hardware into Py-
Torch can be split into two parts. Firstly, low-level operator primitives written
in C++ and wrapped to Python implementing the torch.autograd.Function
interface provide means to execute the operation directly. Since BSS-2 is geared
towards accelerating two-dimensional matrix multiplication, the low-level op-
erator primitive implemented with direct hardware access is adhering to the
interface of the matmul and conv{1,2,3}d operation. Secondly, layers using the
torch.nn.Module interface provide the ability to integrate the computation into
an abstract model, a representation of the computation to be done without ea-
gerly executing it.
Forward pass The forward pass of the matmul operation is comprised of a linear
sequence of transformations to and from the hardware. Fig. 2 shows its internal
implementation. Tensors from PyTorch are preprocessed to match types and
shape of the hardware. A partitioner then places the operation onto available
hardware resources, cf. sec. 3.3. The resulting custom data flow graph is then
traversed by JIT execution, cf. sec. 3.2. Measured activations are postprocessed
to resemble the expected shape and type and returned.
Backward pass — hardware in-the-loop PyTorch features intrinsic support for
automatic differentiation of a sequence of operations via the torch.autograd.
Function interface, which is used for the low-level operations on BSS-2. Sup-
ply of a backward() function allows integration of custom operations into this
framework. The output gradient as well as saved state of the forward pass of
4 P. Spilger, E. Müller et al.
PyTorch reshape convert partition jit execute
BSS-2
convert reshape PyTorch
matmul
preprocess postprocess
xfloat32
wfloat32
xu5, ws6∗
data flow graph
ys8 yfloat32
Fig. 2: Implementation of the matmul operation executed on BSS-2. Inputs and
weights of type float32 are reshaped such that the input batch space is one
dimensional; types are converted to 5-bit unsigned inputs and 6-bit “signed”
weights ([−63, 63]). A partitioner performs splitting and placement of the weight
matrix onto the hardware resources. The resulting data flow graph is then used
for JIT execution, which constructs an instruction stream sent to the hardware,
see [4], and decodes the result data received from the hardware using the software
architecture described in [7]. The resulting digitized 8-bit neuron membrane
potential values are converted to float32, reshaped to match the originally
provided type/shape from PyTorch and returned as the result of the operation.
the operation is used to compute a gradient for the inputs. The analog accel-
erator’s usability for computing the gradient of an operation is limited due to
fixed-pattern and temporal noise. However using the results obtained from the
hardware execution of the forward pass together with a software model to cal-
culate the gradients allows mitigation of hardware distortions and resolution
limitations during training. This method is adapted from [8], where a combina-
tion of a software model and the forward pass data from BSS-1 hardware runs
are used to train a spiking network, thereby training with hardware in the loop.
This approach is also suited for spike-based modeling, e.g., [9]. The software
model employed here is much simpler in that it only resembles the backward
pass of the conventional matmul operation. The assumption is that the direction
of the gradient is correct and potential scaling between the software model and
the hardware execution will be mitigated by adjusting the learning rate.
Mapping convolutions In order to support convolutions, a convNd operation is
implemented using the matmul operation. The convolution is transformed to a
matrix multiplication by unrolling the kernel into the vertical dimension, placing
all kernel channels horizontally aside each other and traversing the input such
that each operation is equivalent to the original kernel applied at a certain po-
sition. Fig. 3 shows the transformation exemplary for a conv2d operation. The
transformations are implemented using the C++-API of PyTorch, which allows
most operations to be in-place modifications of tensor shapes only.
3.2 Graph Representation and Just-in-time Execution
A major task of neural network accelerator operation is the tracking of data
flow to and from the host as well as on the hardware substrate itself. In the
case of BSS-2 and in addition to general input and output properties, heteroge-
neous entities, e.g., synapses and neuron circuits, need to be configured. Limited
hxtorch: PyTorch for ANNs on BrainScaleS-2 5
k00
k10
k01
k11
xij k00
k10
k01
k11x11
x10
x01
x00
k11
k10
k01
k00
k11
k10
k01
k00
k11
k10
k01
k00
y00 y00 y00
x12
x11
x02
x01
k11
k10
k01
k00
k11
k10
k01
k00
k11
k10
k01
k00
y01 y01 y01(a) (b)
Fig. 3: Transformation of a 2-d convolution ((a), (b) left) of inputs xij with kernel
kij , three channels denoted by different shades and a stride of 1 (moved from (a)
to (b)) to a multiplication. The resulting matrix ((a), (b) right) is constant for
all kernel positions, which is efficient in terms of reconfiguration of the weights
while leading to overlapping inputs for different kernel positions.
execution instance
+
load
load
external
store
execution instance
synapse
matrix
external
load
neurons
digitize store
execution instance
synapse
matrix
external
load
neurons
digitize store
Fig. 4: Graph representation of two matrix multiplications (left) followed by an
addition (right). Each multiplication as well as the addition are separately exe-
cuted. The data flow between execution instances (gray boxes) comprises stores
and loads of measured neuron activations. Vertices within the individual execu-
tion instances represent the on-chip data flow and hardware configuration.
hardware resources require a temporal reuse of hardware substrate to compute
a larger operation over the course of multiple inter-dependent executions.
These demands are met by a hardware-centric data flow graph. It describes
the on-chip data flow as well as input and output. While the former is used for
hardware configuration, the latter links individual execution instances in a depen-
dency graph. Vertices represent statically configurable computation or hardware
circuits. A heterogeneous set of edge types allows to express analog and digital
signal/data flow in a unified interface. Support of batched input data mitigates
comparably long static configuration times. A static-single-assignment builder
pattern facilitates correct configuration. Fig. 4 shows an exemplary graph.
Individual execution instances in the dependency graph are executed just in
time. Every instance can be split into a sequence of tasks: preprocessing, build of
the instruction stream, execution on the accelerator hardware and postprocessing
of the received results. Assuming that concurrency on the host computer is not
a limiting factor, pre/postprocessing of non-sequentially-dependent execution
6 P. Spilger, E. Müller et al.
1
3
2
4
t
1
2
3
4
execution
preprocessing
postprocessing
Fig. 5: Just-in-time execution (right) of a dependency graph (left) consisting of
four execution instances. Four runs are scheduled onto the accelerator hardware.
Pre- and postprocessing of independent execution instances can overlap in time.
x w
y
+ + +
·∑
Fig. 6: Partitioning an operation too large to fit on a single synapse array. Top
left: the input x is multiplied with the weight matrix w. Inputs and weights are
split at the black boundaries representing the shape of a hardware synapse array.
Middle: as split operations are independent, they are allocated and executed
individually. Right: split results in the row dimension are summed digitally,
results in the column dimension are concatenated leading to the result y (bottom
left) of the equivalent original operation.
instances can be parallelized leading to a saturation of the accelerator hardware
usage. Fig. 5 exemplifies an dependency graph and a possible execution flow.
3.3 Partitioning
The physical dimensions of each of the two the synapse arrays on the hardware
is fixed to N = 256 rows (128 for signed weights) and M = 256 columns limiting
the shape of a single multiply-accumulate (MAC) operation. Temporal reuse
of the synapse array allows larger operations. The operation is split into parts
which fit on a single synapse array and are placed individually with a round-
robin allocation scheme on the available synapse arrays. Fig. 6 visualizes the
partitioning scheme. In order to support more columns, the results of the split
operations are to be concatenated. To the contrary, to support more rows the
results of the split operations are to be summed up digitally:
yj =
∑N
i
xiwij =
(∑N1
i
xiwij
)
+ ...+
(∑NR
i
xiwij
)
, N =
∑R
r
Nr (1)
where the input sizeN is split intoR rangesNr of analog computation
∑Nr
i xiwij ,
which are then summed digitally. This approach is expected to be comparable
hxtorch: PyTorch for ANNs on BrainScaleS-2 7
to computation on a larger physical synapse array if no boundary effects —like
analog saturation or digital overflow— occur.
For weight matrices large in both dimensions compared to the synapse array
this partitioning scheme leads to a optimal chip area usage, because the number
of partial synapse array allocations scales with the edges like O(N +M) while
the number of full synapse array allocations scales with the area like O(N ·M).
3.4 Parallel execution of convolutional layers
x w
y
Fig. 7: Expanded
Conv1d layer maxi-
mizing usage of the
synapse array.
In convolutional layers, the size of the transformed weight
matrix (cf. fig. 3) often is a lot smaller than the synapse
array of our accelerator. Especially in the one dimensional
case, it is possible to perform multiple such operations in
one step, a possible layout on the chip is sketched in fig. 7.
The downside of this approach is the increase of inde-
pendent parameters, since the weights of each expansion
have to be learned individually due to fixed pattern noise
and other deviations, cf. [4]. To overcome this problem,
we add modified versions of the data set to the training
data, shifted by the stride of the convolution operation.
However, this is only applicable for a single layer or an
equal number of parallel executions, since it can only be tuned to the hyper-
parameters of one layer. Nevertheless, the execution time while evaluating the
model is reduced up to a factor of the number of parallel executions and the
energy consumption is nearly decreased by the same factor.
3.5 Handling hardware setup, initialization and parameters
The accelerator hardware setup and initialization routine is time-consuming
compared to a single computation. Hence, initialization of the hardware is only
performed once. We also allow users to modify the hardware initialization pro-
cess. Exclusive access to hardware resources is handled via free functions utiliz-
ing a singleton pattern. Inside the execution of an operation, the JIT executor,
cf. sec. 3.2, then uses available hardware acquired previously via the singleton.
Using BSS-2 involves choice of parameters affecting the available dynamic
range. Two such parameters, described in [4], are the interval duration between
successive input events, and sending the same input multiple times. Both pa-
rameters were introduced to increase the precision of the analog computation
on first-version chips. As such parameters possibly have to be tuned for each
layer we supply them side-by-side to other operation parameters already present
in the equivalent CPU operation. The following listing shows the implemented
scheme at the example of a conv1d operation:
conv1d(x, kernel, stride=1, num_sends=6, wait_between_events=25)
While these parameters are in principle differentiable, a model is yet to be de-
veloped. Therefore they are treated as non-differentiable hyperparameters.
8 P. Spilger, E. Müller et al.
Since the hardware specific parameters are not present for CPU/GPU op-
erations, care has to be taken when converting such operations to accelerator
hardware execution. We provide replacements for the PyTorch layers Linear
and Conv{1,2}d, which take these additional parameters as keyword arguments.
Since these layers use the same state as their counterparts, pre-trained weights
can easily be loaded into a model that uses them.
4 Results
We first look at performance figures for speed of execution and utilization of
the accelerator hardware in order to evaluate imposed software overhead. Af-
terwards, we demonstrate usability of the presented software framework by an
application on the human activity recognition dataset [10].
4.1 Performance Evaluation
Hardware Limitations and Measurement Setup The currently used first hardware
version contains a bug which requiring rewriting the synapse array for each sent
input. As a consequence, the input data volume increases by a factor of ≈ 100. To
increase the analog precision of the calculations, the same input is sent multiple
times as well as it is spaced in time. The second hardware version (v2) is currently
in the commissioning process and we expect it to be free of these limitations.
Performance estimations for v2 are given by disabling these workarounds on the
first hardware version. This only affects the quality of the computation.
The available hardware setup consists of the BSS-2 chip connected to an
FPGA which provides one GbE link to the host computer. This connection poses
a severe communication limitation as the chip features full-duplex 8Gbit/s inter-
connects. To evaluate software performance against the chip hardware design
limitation, we use a software simulation providing a fast mock-up communica-
tion partner. This is a valid approach for software performance evaluation since
the postprocessing of response data is content-agnostic.
Results We evaluate the performance in terms of multiply-accumulate (MAC)
operations per time for square-shaped weight matrices with fixed batch size and
varying batch size for a fixed weight matrix in fig. 8. The left panel shows that the
current implementation is able to saturate v2 in combination with a 1-GbE host
connection in the limit of large matrix multiplications. Furthermore, the current
implementation reaches up to the hardware design limitation within a factor of
two. Given the static configuration necessary for a matrix multiplication, the
right panel shows the achieved performance for v2 and the design limitation
simulation is within 50% of the saturation speed for batch sizes larger than
≈ 200.
hxtorch: PyTorch for ANNs on BrainScaleS-2 9
102 103 104
order of square weight matrix
105
106
107
108
109
1010
M
A
C
op
er
at
io
n
ra
te
[o
p/
s]
8GBit/s
1GBit/s
v1
v2
sim.
full chip
102 103 104
order of square weight matrix
0.0
0.2
0.4
0.6
0.8
1.0
re
la
tiv
e
tim
e
co
ns
um
pt
io
n
full chip
hardware
graph exec
preprocessing
graph build
postprocessing
101 102 103
batch size
107
108
109
M
A
C
op
er
at
io
n
ra
te
[o
p/
s]
static config size
v1
v2
sim.
Fig. 8: Performance measurement for square matrices with sizes ranging from 12
to (214)2 elements (batch size: 2000). The dotted line marks the matrix size of
a single BSS-2 chip (left and center). Left : Rate of MAC operations for real and
simulated hardware: the yellow line illustrates the limiting speed for setups us-
ing 1Gbit/s links (host link); the gray line shows this for 8Gbit/s links (chip I/O);
disregarding a constant configuration overhead, the rate of operations increases
linearly for matrices smaller than a chip; for matrices larger than the full chip
area, individual runs can overlap in the pre- and postprocessing step thereby
increasing the speed further; the red dots and blue crosses show measurements
using the real hardware setup. Center : Distribution of execution times for the
8Gbit/s case (cf. fig. 2 for details on the categories); the chip utilization increases
for matrices larger than the full chip area. Right : Measurement for varying batch
sizes (matrix size: 2562); rates are plotted for different chip versions and simu-
lated design-goal hardware; the dotted line marks the batch size where the static
configuration overhead matches the variable data volume; this coincides with the
point where ≈ 50% of the maximum performance is reached.
4.2 Application example: Human Activity Recognition
Our application example uses the smartphone acceleration data from [10]. The
dataset contains recordings of 30 subjects carrying a waist-mounted smartphone
while walking straight ahead, up or down, sitting, standing or lying on their
backs. The data is already split in training and test data, containing 9 channels
of sensor data with a length of 2.56 s sampled at 50Hz, each.
The model We use a 1-d convolution layer with kernel size 32, stride 6 for feature
detection, followed by two dense layers. The model topology is defined by tab. 1.
The hyper-parameters of all layers are optimized for the dimensions of the analog
substrate to achieve a balance between accuracy, energy efficiency and execution
speed. We do not use additive biases in any layer.
Training First, we trained the model for 50 epochs in software without our
accelerator. In this step, the weights and inputs were already quantized and
scaled to the dynamic range of the hardware, and Gaussian noise was added
10 P. Spilger, E. Müller et al.
Table 1: Model parameters for “Human Activity Recognition” example.
Layer Activation Input Shape Output Shape # of Params
Conv1d ReLU [-1, 9, 128] [-1, 16, 16] 4’608
Linear-1 ReLU [-1, 256] [-1, 125] 32’000
Linear-2 Softmax [-1, 125] [-1, 6] 750
W WU WD Si St L
obtained label
Walking
W. Upstairs
W. Downstairs
Sitting
Standing
Lying
tr
ue
la
be
l
95.4
96.8
99.8
89.0
78.2
98.7
(a) in software
W WU WD Si St L
obtained label
96.6
2.5
85.2
49.3
0.4
4.8
(b) on-chip
W WU WD Si St L
obtained label
95.8
85.6
97.4
40.9
81.6
93.7
(c) after retraining
100
101
102
#
of
sa
m
pl
es
Fig. 9: Confusion matrices and recall accuracies for the separate test set, (a) after
50 epochs training in software, (b) executed on BSS-2 without retraining, and
(c) after one additional epoch training with hardware in the loop.
to the output of each layer. The resulting confusion matrix is shown in fig. 9a.
Running the very same model on our analog substrate shows results that are
quite off (cf. fig. 9b), which can be explained with fixed pattern noise and non-
linearities. One additional epoch with hardware-in-the-loop training is sufficient
to come significantly closer to our software results as is depicted in fig. 9c.
Results The achieved overall accuracy is 92.7% in software and 82.3% on-chip.
As already found in [10, 11] there is a noticeable misclassification between sitting
and standing. This can be explained by the similar orientation of the smartphone
and the non-dynamic nature of these activities.
5 Discussion and Outlook
This work presents software developments integrating BSS-2 into the PyTorch
machine learning framework. Using the PyTorch extension interface, hxtorch
provides support for convolutional and dense layers. The hardware backend
builds upon the BrainScaleS operating system software, cf. [6, 7], and utilizes a
configuration and runtime flow that is essentially identical to the spiking opera-
tion of the system. We describe the underlying implementation comprising a data
flow representation, operator mapping, hardware partitioning, and system setup
and execution. We evaluate end-to-end runtime performance and compare it with
hardware design limitations. In particular, we show that the software is able to
reach the end-to-end runtime performance of the hardware design limitations
within a factor of two for sufficiently large matrix multiplications. We obtain
hxtorch: PyTorch for ANNs on BrainScaleS-2 11
14.7Gop/s for a single simulated accelerator. In conjunction with the single GbE
host link, this surpasses the speed of v2 by a factor of ≈ 5. Starting with a batch
size of ≈ 200 the performed computation becomes significant compared to the
static configuration. Finally, we demonstrate an end-to-end application example
on “Human Activity Recognition” with software results comparable to [11] but
obtain a drop in the classification quality for the hardware backend. We expect
the result to improve for v2.
Although we focused here an interface to the non-spiking functionality of the
accelerator, it is possible to extend PyTorch to model the spiking operation of
the chip as well. In the simplest case input spikes are represented as sparse binary
tensors, with one of the axes being the time dimension. For efficient hardware op-
eration this time dimension is identified with the time dimension of the intrinsic
temporal dynamics of the analog circuitry. This implies that recurrent spiking
neural networks with up to N = 512 neurons per chip can be implemented. A
simulation can be run concurrently to estimate the gradients. By performing
the forward pass through chip and backward pass through the simulation like in
[12, 8, 9], hardware parameters can be adjusted based on simulation parameter
updates. Ideally, gradient estimation would be independent of analog measure-
ments on the chip, which introduce additional memory bandwidth requirements.
Finally, deeper integration into ML frameworks facilitates compute graph anal-
ysis. Consequently, network topologies can be described natively; the compute
graph still allows for the specification of arbitrary computation —e.g., plastic-
ity rules in spiking neural networks— that can be offloaded to the embedded
processors.
Contributions and Acknowledgments
PS is the main developer of the software extensions of this work. EM is the lead
developer and architect of the BSS-2 software stack. AE contributed to the Py-
Torch extension module and the model application. AE, AL, CM, OB, TCW and
YS contributed to the software implementation. CP contributed to the initial
implementation of the PyTorch extension module. JW is a main contributor to
hardware commissioning and contributed to the PyTorch extension module. SB
contributed to the hardware design, commissioning and the software implemen-
tation. SS contributed to core software components and the software design. JS
is the lead designer and architect of the BSS-2 neuromorphic system. All authors
discussed and contributed to the manuscript.
The authors wish to thank all present and former members of the Electronic
Vision(s) research group contributing to the BrainScaleS-2 hardware platform.
The authors express their special gratitude towards: J. Klähn, D. Stöckel and S.
Friedmann for earlier software contributions. We especially express our grateful-
ness to the late Karlheinz Meier who initiated and led the project for most if its
time. This work has received funding from the EU ([H2020/2014-2020]) under
grant agreements 720270, 785907 (HBP) as well as from the BMBF (16ES1127).
12 REFERENCES
References
1. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z.,
Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chin-
tala, S.: PyTorch: An Imperative Style, High-Performance Deep Learning Library.
In: Advances in Neural Information Processing Systems 32. Ed. by H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, pp. 8024–
8035. Curran Associates, Inc.(2019)
2. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.,
Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving,
G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané,
D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner,
B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas,
F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X.:
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,
(2015). 2015.
3. Schemmel, J., Billaudelle, S., Dauer, P., and Weis, J.: Accelerated Analog Neuro-
morphic Computing. arXiv preprint (2020). arXiv: 2003.11996 [cs.NE]
4. Weis, J., Spilger, P., Billaudelle, S., Stradmann, Y., Emmel, A., Müller, E., Bre-
itwieser, O., Grübl, A., Ilmberger, J., Karasenko, V., Kleider, M., Mauch, C.,
Schreiber, K., and Schemmel, J.: Inference with Artificial Neural Networks on
the Analog BrainScaleS-2 Hardware. In: Proceedings of the Workshop on IoT,
Edge, and Mobile for Embedded Machine Learning (ITEM)/ECML-PKDD 2020
(submitted) (2020)
5. He, H.: The State of Machine Learning Frameworks in 2019. The Gradient (2019)
6. Müller, E., Schmitt, S., Mauch, C., Schmidt, H., Montes, J., Ilmberger, J., Klähn,
J., Passenberg, F., Koke, C., Kleider, M., Jeltsch, S., Güttler, M., Husmann,
D., Billaudelle, S., Müller, P., Grübl, A., Kaiser, J., Weidner, J., Vogginger, B.,
Partzsch, J., Mayr, C., and Schemmel, J.: The Operating System of the Neuro-
morphic BrainScaleS-1 System. arXiv preprint (2020). arXiv: 2003.13749 [cs.NE]
7. Müller, E., Mauch, C., Spilger, P., Breitwieser, O.J., Klähn, J., Stöckel, D., Wun-
derlich, T., and Schemmel, J.: Extending BrainScaleS OS for BrainScaleS-2. arXiv
preprint (2020). arXiv: 2003.13750 [cs.NE]
8. Schmitt, S., Klähn, J., Bellec, G., Grübl, A., Güttler, M., Hartel, A., Hartmann,
S., Husmann, D., Husmann, K., Jeltsch, S., Karasenko, V., Kleider, M., Koke,
C., Kononov, A., Mauch, C., Müller, E., Müller, P., Partzsch, J., Petrovici, M.A.,
Vogginger, B., Schiefer, S., Scholze, S., Thanasoulis, V., Schemmel, J., Legenstein,
R., Maass, W., Mayr, C., and Meier, K.: Classification With Deep Neural Networks
on an Accelerated Analog Neuromorphic System. Proceedings of the 2017 IEEE
International Joint Conference on Neural Networks (2017). doi: 10.1109/IJCNN.
2017.7966125
9. Cramer, B., Billaudelle, S., Kanya, S., Leibfried, A., Grübl, A., Karasenko, V.,
Pehle, C., Schreiber, K., Stradmann, Y., Weis, J., Schemmel, J., and Zenke, F.:
Training spiking multi-layer networks with surrogate gradients on an analog neu-
romorphic substrate. arXiv preprint (2020). arXiv: 2006.07239 [cs.NE]
10. Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-Ortiz, J.L.: A public domain
dataset for human activity recognition using smartphones. In: Esann (2013)
REFERENCES 13
11. Ronao, C.A., and Cho, S.-B.: Human activity recognition with smartphone sensors
using deep learning neural networks. Expert systems with applications 59, 235–244
(2016)
12. Le Gallo, M., Sebastian, A., Mathis, R., Manica, M., Giefers, H., Tuma, T., Bekas,
C., Curioni, A., and Eleftheriou, E.: Mixed-precision in-memory computing. Nature
Electronics 1(4), 246–253 (2018). doi: 10.1038/s41928-018-0054-8
