With more and more event-based neuromorphic hardware systems being developed at universities and in industry, there is a growing need for assessing their performance with domain specific measures. In this work, we use the methodology of converting pre-trained nonspiking to spiking neural networks to evaluate the performance loss and measure the energy-per-inference for three neuromorphic hardware systems (BrainScaleS, Spikey, SpiNNaker) and common simulation frameworks for CPU (NEST) and CPU/GPU (GeNN). For analog hardware we further apply a re-training technique known as hardware-in-the-loop training to cope with device mismatch. This analysis is performed for five different networks, including three networks that have been found by an automated optimization with a neural architecture search framework. We demonstrate that the conversion loss is usually below one percent for digital implementations, and moderately higher for analog systems with the benefit of much lower energy-per-inference costs.
Introduction
Diverse event-based neuromorphic hardware systems promise the accelerated execution of so called spiking neural networks (SNN), also referred to as the third generation of neural networks [14] . The most prominent representatives of this class of hardware accelerators include the platforms Braindrop [16] , BrainScaleS [22] , DYNAPs [15] , Loihi [5] , SpiNNaker [8] and Truenorth [1] . With the diversity of hardware accelerators comes a problem for potential end-users: which platform is suited best for a given spiking neural network algorithm, possibly respecting inherent resource requirements for embedding in mobile robots or smart devices. Usually, this question is answered by evaluating a set of benchmarks on all qualified systems, which measure the state-of-the-art and quantify progress in future hardware generations (see e.g. [4] )). Here, we face two major challenges with neuromorphic hardware. First, there is no universal interface to all hardware/software simulators despite some projects like PyNN [6] . Second, there are quite a few promising network models and learning strategies, but still "the" algorithm for spiking neural networks is missing. One recent system overarching network is the cortical microcircuit model [2, 13] . A follow-up publication [21] shows, how this benchmark has driven platform specific optimization that, in the end, improves the execution of various networks on the SpiNNaker platform confirming the value of benchmarks. However, it is also an example of a platform specific implementation to reach maximal performance on a given system.
One commonly agreed application for spiking neural networks is the conversion of conventionally trained artificial neural networks (ANN) to rate-based SNNs [7] . Although this is not using SNNs in their most efficient way, it is a pragmatic approach that is suitable to be ported to different accelerators, independent of their nature. In this work, we use this approach for evaluating five distinct networks, either defined by hardware restrictions, by already published work, or by employing neural architecture search (NAS) with Lamarck ML [11] to optimize the network topology. We evaluate these networks on BrainScaleS, Spikey [20] , and SpiNNaker as well as the CPU simulator NEST [9] and the CPU/GPU code-generation framework GeNN [25] . Furthermore, we use a retraining approach with neuromorphic hardware-in-the-loop (HIL) proposed in [23] to unlock the full potential of the analog neuromorphic hardware systems. Section 2 outlines the target systems, the software environment, and the used methods. Section 3 presents the results, including neuron parameter optimization, and accuracy along with energy measurements for all target platforms.
Methods
In the following we introduce all target systems and the software environment as well as the methodology followed.
Target Systems and Software
All target systems in this work support the simulation or emulation of leaky integrate-and-fire neurons with conductance-based synapses, although especially analog systems are limited to specific neuron models. NEST is a scaleable software simulator suited to simulate small as well as extensive networks on compute clusters. It is used in version 2.18 [12] executed with four threads on an Intel Core i7-4710MQ mobile processor. GeNN [25] is a code generation framework for the simulation of SNNs. In its current release version (4.1.0), it supports generating code for a single-threaded CPU simulation or for graphics processing units (GPU) supporting NVIDIA CUDA. Networks are evaluated on a NVIDIA GeForce 1080 TI GPU; runtimes are measured for networks without recording any spikes due to the overhead of getting spikes back from GPU, which effectively stops the simulation at every time step and copies the data between GPU and CPU. For this publication we make use of single precision accuracy and all simulators use a time step of 1 ms. However, NEST is using an adaptive timestep to integrate the neuron model. The fully digital many-core architecture SpiNNaker [8] comes in two different sizes, which are both used in this work. The smaller SpiNN3 system is composed of four chips; the larger SpiNN5 board consists of 48 chips. A single chip comprises 18 ARM968 general purpose CPU cores, with each simulating up to 255 IF_cond_exp neurons. The system runs in real-time, simulating 1 ms of model time in 1 ms wall clock time. SpiNNaker is used with the latest released software version 5.1.0 using PyNN 0.9.4. Finally, we make use of two mixed-signal (analog neural circuits, digital interconnect) systems: First, the Spikey system [20] supports the emulation of 384 neurons with 256 synapses each. The emulated neuron model is subject to restricted parameter ranges (e.g. four bit weights, limited time constants) with some parameters prescribed by the hardware (e.g. the membrane capacitance). The system runs at a speedup of 10, 000, therefore taking only 0.1 µs to emulate 1 ms of model time. Second, Spikey's successor BrainScaleS [22] shares many of Spikey's properties. Most notably is the now fully parameterizable neuron model, as well as the usage of wafer-scale integration, combining 384 accessible HICANN chips on a single wafer for a full system. Each chip implements 512 neuron circuits with 220 synapses each, where up to 64 circuits can be combined to form a single virtual neuron, allowing more robust emulations and a higher synapse fan-in.
While all of these platforms formally support the PyNN API [6] , the supported API versions differ between simulators impeding the portability of code. Cypress 1 [24] is a C++ framework abstracting away these differences. For NEST, Spikey and SpiNNaker the framework makes use of their PyNN interfaces, however, for BrainScaleS and GeNN a lower-level C++ interface is used. Furthermore, the proposed networks studied below are part of the Spiking Neural Architecture Benchmark Suite 2 (SNABSuite) [17, 18] , which also covers benchmarks like low-level synthetic characterizations and application-inspired (sub-)tasks with an associated framework for automated evaluation.
Energy measurements have been taken with a Ruideng UM25C power meter (SpiNNaker, Spikey), with a PeakTech 9035 for CPU simulations, or with the NVIDIA smi tool. There is no possibility for remote energy measurements on the BrainScaleS system. Thus, the values have been estimated from the number of pre-synaptic events using published data in [23] .
Converting DNNs to SNNs
This work is based on the idea of [3, 7] , where a pre-trained artificial neural network is converted into a SNN. In this case, we train several multi-layer perceptrons that differ in size to classify MNIST handwritten digits. The training uses standard batch-wise gradient-descent in combination with error backpropagation. Conversion exploits that the activation curve of a LIF neuron resembles the ReLU activation curve, such that float (analog) values of the ANN become spike rates in the SNN. All weights of the ANN are normalized to the maximal weight of the full network, and then scaled to a maximal value either given by restrictions of the hardware platform (e.g. 4 bit weights on Spikey/BrainScaleS) or determined by parameter optimization (see below for details). Similarly, other parameters of the SNN are found by extensive parameter tuning or are fixed due to hardware constraints. Neuron biases are not easily and efficiently mapped to SNNs, which is why we set all bias terms to zero in the training process of the ANN. In contrast to [7] , we found that using a softmax layer as the last layer in the ANN for training does not necessarily decrease the performance of the SNN. However, using soft-max will lead to an increased number of spikes for all rejected classes (cf. Figure 1 ).
As the Spikey platform is very limited in size and connectivity, the smallest and simplest network (referred to as Spikey network ) consists of a single hidden layer with 100 neurons and no inhibitory connections. Spikey requires separation of excitation and inhibition at the neuron level and consists of two separate chips with limited connectivity between them. Thus, we only used positive weights and achieved the best performance using a hinge loss, which increases the weights for the winner neurons and decreases weights for the second place neuron only. Due to the acceleration factor of Spikey and BrainScaleS, communication bandwidth limits the usable spike rates. Too high rates (input as well as inter-neuron rates) will inevitably lead to spike loss that would reduce the performance of the network. This naturally restricts the parameter space to be evaluated. Still, there is a significant performance loss when applying the conversion process for analog systems. Perfect conversion requires that every synapse with the same weight and every neuron behaves in the same way, referring to identical activation curves. On analog systems, however, we have to deal with temporal noise perturbing the membrane voltage, trial-to-trial variation and analog mismatch between circuits [19] . As shown in [24] , such a hardware network will perform at roughly 60-70% accuracy compared to a simulator, even after platform specific parameter tuning. [23] proposed to train the pre-trained neural network again while replacing the outputs of the ANN with spike rates recorded from hardware employing back-propagation to train a device specific network. All details can be found in [23] (Figure 7 ).
Neural Architecture Search (NAS)
Lamarck ML 3 [11] is a modular and extensible Python library for application driven exploration of network architectures. This library allows to define a class of network architectures to be examined and operations to modify and combine those architectures. These definitions are then used by a search algorithm to explore and evaluate network architectures in order to maximize an objective function. For this work, the limitations of the neuromorphic hardware systems compared to state-of-the-art processing units are the leading motivation for the applied restrictions. The applied layer types are limited to fully connected layers which may be arranged in a nonsequential manner resulting in an acyclic directed graph structure. To preserve the structural information of a single neural network in the exploration process, a meta graph is created to contain the current network and the meta graph of the networks which were involved in creating it. This process is unbounded and accumulates structural information over several generations in the meta graph. To forget unprofitable information, the meta graph is designed to dismiss structural information that has not been used in the last five exploration steps. One exploration step consists of combining the meta graph of two network architectures and sampling a new path in this meta graph in order to create an improved architecture. A new architecture is created by sampling a path based on the quality of its best architecture and amending it with elements that have not been examined before.
The exploration procedure is performed by a genetic algorithm configured with a generation size of 36 network architectures of which 20 are selected based on an exponential ranking to create new architectures for the next generation. This next generation is created with an elitism replacement scheme that preserves the best two network architectures of the previous generation. In total 75 generations have been created in the NAS to find an architecture that achieves at least 97% evaluation accuracy. Above this threshold, an architecture is defined to be better if it requires less than 100 neurons for increasing the accuracy by 1%.
Results
The first two parts of this section present the parameter tuning process used for the converted SNNs. Details of four different networks are shown, the smallest one was defined by the restrictions of the Spikey platform, while the remaining networks were picked from the neural architecture search. The final part gathers the results for all networks including one model taken from literature.
The Spikey Network and Parameter Optimization
This is the simplest network used in this work. As described above, it is motivated by the hardware restriction of the Spikey neuromorphic hardware system and uses a 89 × 100 × 10 layout which requires images to be scaled down using 3 × 3 average pooling (cf. Fig. 2 ). These restrictions limit the test-accuracy of the pre-trained network to only 90.13%. This serves as the baseline for the following optimizations of the most relevant SNN parameters. high relative accuracy is rather narrow. Therefore, careful parameter tuning has to be done.
Taking a look at the most relevant conversion parameters, Figure 4 shows the accuracy in relation to the sample presentation time and the maximal spike input frequency. First, simulating more than 200 ms will result in minor improvements only. Analog platforms converge a bit slower (which is partially caused by different neuron parameters used in the simulation), and the benefits of using presentation times larger than 200 ms are minor again. However, prolonged presentation times can cancel out some of the temporal noise on membrane voltages and synapses. Second, all platforms gain significantly from frequencies larger than 40 Hz. However, due to communication constraints in the accelerated analog platforms, the accuracy decreases for values above 60 Hz. Here, two bandwidth restrictions may play a major role: input spikes are inserted into the digital network using FPGAs. Any spike loss is usually reported by the respective software layer. However, on the wafer, there might be additional loss in the internal network, which is not reported. Output rates of hidden and ouput layers are a second source of potential spike loss which is only partially reported for the Spikey system (by monitoring spike buffers), but happens silently on the BrainScaleS system. The Spikey system reports full buffers for larger frequencies, which is why we assume that this is the major cause for spike loss on both systems.
To reach a high efficiency on larger systems, like SpiNN5 or GPUs, it is crucial to fully utilize them. Therefore, we used several parallel instances of the same network each classifying a separate portion of the data. In our setup this is controlled by choosing the batch size: a smaller batch size leads to more independent batches processed in parallel. On SpiNNaker the hardware size and the required number of processor cores per network instance determine the parallelism. On GeNN the working memory required to compile the GPU code is the determining factor. The latter is a limitation caused by using separate populations per layer, which could be merged to possibly lead to an increased parallelism of the networks, but not necessarily to increased efficiency. SoftMax Test Acc as seen by GA Test Acc as seen by GA Test Acc Fig. 5 . Results of the optimization process. Highlighted are three candidates networks at the pareto front with their respective network layout.
NAS Optimized Networks
The optimization process was driven by two major goals: to reach an accuracy larger than 97% and at the same time to reduce the network size in terms of the number of neurons. Results in Figure 5 reveal, that this not necessarily leads to networks with a single hidden layer. Furthermore, the sequential neural networks outperformed all evaluated non-sequential architectures. We have chosen three candidates on the pareto-front for evaluation on neuromorphic hardware:
the network with the highest evaluation accuracy (NAStop, 97.71%) the optimal network with the best trade-off (NAS129, 97.53%) a small network with still sufficient accuracy (NAS63, 96,76%) Table 3 .3 collects the results for all target platforms. Most striking is the energy efficiency of the analog platforms, which is two orders of magnitude higher compared to other simulators. Furthermore, HIL training recovers most of the conversion losses found for these platforms (despite the four bit weight accuracy). Larger networks have not been evaluated either due to size restrictions, or because combined spike rates of input pixels are too high to get any reasonable results. The SpiNNaker system, in both variants, performs on the same efficiency level as a CPU/GPU implementations although its technology is much older (130 nm vs. 22 nm CPU vs. 16 nm GPU). Furthermore, there is less than one percent loss in accuracy due to the conversion in almost all cases. However, for the large networks the system was performing at its limits, and we had to reduce the maximal number of neurons per core. Of course, this can be mitigated by further reducing the number of neurons per core or slowing down the system with respective negative impacts on the energy per inference. Interesting differences have been found for NEST: the accuracy is a bit lower, the energy per inference one order higher than for the GeNN CPU simulation. The latter is mainly due to the more accurate integrator employed by the NEST simulator (especially the adaptive time step in the integrator), which is also responsible for the significant energy gap between the two CPU simulators NEST and GeNN. Furthermore, the multi-threaded execution of NEST does not reduce the computation time compared to GeNN. With the increase of network complexity there is next to no increase in GPU execution time, indicating that despite parallelization of the networks, the GPU is still not utilized fully for the smaller networks (there are 2100-48,200 simultaneously simulated neurons for the GPU depending on the network). Still, for the larger networks, the GPU implementation is the fastest simulation available.
Benchmark results
The last network in Table 3 .3 is taken from [7] , as the network weights are published within the respective repository. The layout is 784 × 1200 × 1200 × 10, and thus it is significantly larger. The results show that the SpiNN3 system still operates at its limits (as reported by the software stack) despite the used slow-down. The other platforms show nearly the same accuracy with next to no loss in the conversion process. Concerning the energy per inference, the larger SpiNNaker platform is slightly better than the CPU implementation.
Conclusion and Outlook
We have demonstrated the capability of all target platforms to simulate converted deep neural networks. The loss in the conversion process is negligible in many cases, and for analog platforms we successfully employed retraining to reach high accuracy. Furthermore, we calculated the used energy-per-inference for all networks and platforms, quantifying the efficiency vs. accuracy trade-off of analog platforms. The digital SpiNNaker platform is highly efficient if fully utilized. If primarily simulation time needs to be optimized, GeNN and its GPU backend allow fast and efficient simulation of SNNs. The approach used in this work is not the most efficient way of using spiking neural networks. However, the rate-coding applied here can be replaced with a more efficient time-to-first-spike encoding, using only a few spikes with much faster response times, which has recently been demonstrated on analog hardware [10] . Therefore, the results from this work must be seen as a conservative measure for the relative efficiency of SNNs on neuromorphic hardware. Furthermore, we did not make use of convolutional networks, because these currently cannot be mapped well to neuromorphic hardware.
Funding/Acknowledgment

