Abstract-Despite being originally inspired by the central nervous system, artificial neural networks have diverged from their biological archetypes as they have been remodeled to fit particular tasks. In this paper, we review several possibilites to reverse map these architectures to biologically more realistic spiking networks with the aim of emulating them on fast, lowpower neuromorphic hardware. Since many of these devices employ analog components, which cannot be perfectly controlled, finding ways to compensate for the resulting effects represents a key challenge. Here, we discuss three different strategies to address this problem: the addition of auxiliary network components for stabilizing activity, the utilization of inherently robust architectures and a training method for hardwareemulated networks that functions without perfect knowledge of the system's dynamics and parameters. For all three scenarios, we corroborate our theoretical considerations with experimental results on accelerated analog neuromorphic platforms.
I. INTRODUCTION
Artificial neural networks (ANNs) rank among the most successful classes of machine learning models, but aresuperficial similarities to sensory processing pathways in cortex notwithstanding -difficult to map to biologically realistic spiking neural networks. Nevertheless, we argue that such a reverse mapping is worthwhile for two reasons. First, it could help us understand information processing in the brain -assuming that it follows similar computational principles. Second, it enables machine learning applications on fast, low-power neuromorphic architectures that are specifically developed to mimic biological neuro-synaptic dynamics. In this manuscript, we discuss several ways to answer what we consider to be a key challenge for neuromorphic architectures with analog components: Is it possible to design spiking architectures and training methods that are amenable to neuromorphic implementation and remain functionally performant despite substrate-inherent imperfections?
More specifically, we review three different approaches [1] - [3] . The first two are based on recent insights about how networks of spiking neurons can be constructed to sample from predefined joint probability distributions [4] , [5] . When these distributions are learned from data, these networks automatically build an internal, generative model, which is then straightforward to use for pattern recognition and memory recall [6] . Practical problems arise when the hardware dynamics and parameter ranges are incompatible to the target specifications of the network, as these inevitably distort the sampled distribution. The first approach involves the addition of auxiliary network components in order to make it robust to hardware-induced distortions (Sec. II). The second one restricts the network topology in a way that endows it with immunity to some of these effects (Sec. III). We demonstrate the effectiveness of both these approaches on the Spikey neuromorphic system [7] .
The third strategy maps traditional feedforward architectures, trained offline with a backpropagation algorithm, to a network of spiking neurons on the neuromorphic device (Sec. IV). Here, the key to good performance is an additional learning phase where parameters are trained on hardware in the loop, while using the abstract network description as an approximation for the parameter updates. We show how this approach can restore network functionality despite having incomplete knowledge about the gradient along which the parameters need to descend. These experiments are performed on the BrainScaleS neuromorphic system [8] .
While our networks are small compared to those used in contemporary machine learning applications, they showcase the potential of using accelerated analog neuromorphic systems for pattern representation and recognition. In particular, the used neuromorphic systems operate 10 4 times faster than their biological archetypes, thereby significantly speeding up both training and practical application.
II. FAST SAMPLING WITH SPIKES
Following [4] , [5] , neural network activity can be interpreted as sampling from an underlying probability distribution over binary random variables (RVs). The mapping from spikes to states z = (z 1 , . . . , z k ) is defined by
where t s k are spike times of the kth neuron and τ ref its absolute refractory period (Fig. 1 A) . When using leaky integrateand-fire (LIF) neurons, Poisson background noise is used to achieve a high-conductance state, in which the stochastic response of a single neuron is well approximated by a logistic activation function
where σ(·) is the logistic function andū k represents the noise-free membrane potential of the kth neuron. The parametersū 0 k (bias parameter determining the inflection point) and α (slope) are controlled by the intensity of the background noise. With appropriate settings of synaptic weights w ij and bias parametersū 0 k , these networks can be trained to sample from Boltzmann distributions
where the weight matrix W and the bias vector b can be chosen freely. This enables the emulation of Boltzmann machines (BMs) with networks of LIF neurons ( Fig. 1 B) . A core assumption of the neural sampling framework is that the membrane potential u k of a neuron reflects the state z \k of all presynaptic neurons at any moment in time:
In particular, this requires that all neurons instantaneously transmit their states (spikes) to all their postsynaptic partners. In any physical system, this assumption is necessarily violated to some degree, since signal transmission can never be instantaneous. In the particular case of accelerated neuromorphic hardware, synaptic transmission delays become even more problematic, as they can be in the same order of magnitude as the state-encoding refractory times themselves. Furthermore, the required equivalence between post-synaptic potential (PSP) durations and refractory states (1, 4) Here, we alleviate the issue of substrate-induced timing mismatches by using a recurrent network structure that represents each RV with a small subnetwork, called a sampling unit. The subnetworks are built such that refractory times can be well controlled and, in addition, intra-unit refractory states and inter-unit state communication across the network are inseparably coupled (Fig. 1 C) .
Sampling units consist of a single principle neuron (PN) and a small synfire chain of excitatory (EPs) and inhibitory populations (IPs). The EPs of each stage project to both populations in the following stage, thereby relaying an activity pulse in the forward direction. The IPs project backwards, ensuring that neurons from previous stages only spike once. Additionally, all IPs and the last EP also project onto the PN with large weights. Therefore, after the PN elicits a spike, the IPs sequentially pull its membrane potential close to the inhibitory reversal potential, prohibiting it from firing as long as the synfire chain is active (Fig. 1 D) . When the pulse has reached the final synfire stage, its EP pulls the PN's membrane potential back to its equilibrium value. The total duration of this pseudo-refractory period can then be controlled by the synfire chain length and parameters.
In addition to controlling refractoriness, the synfire chains also mediate the interaction between PNs. The connections from a synfire chain to other PNs simply mirror its connections to its own PN. This guarantees a match between effective interaction durations and pseudo-refractory periods. The correct synapse parameter settings (weights, time constants) are determined in an iterative training procedure [1] .
The results of a hardware emulation can be seen in Fig. 1 E, F. A network of four sampling units was trained on Spikey to sample from a target Boltzmann distribution. After training, the network needs about 10 4 ms of biological time to achieve a good match between the sampled and the target distribution. Considering the hardware acceleration factor of 10 4 , this happens in 1 ms of wall-clock time.
III. ROBUST HIERARCHICAL NETWORKS As discussed in the previous section, sampling LIF networks are ostensibly sensitive to different types of hardwareinduced timing mismatch. In this subsection, we discuss how a sampling network model can be made robust by imposing a hierarchy onto the network structure [2] . This is the equivalent of moving from general BMs to restricted BMs (RBMs). In addition to making their operation more robust, as we discuss below, this hierarchization has the distinct advantage of significantly speeding up training. To emulate an RBM, we construct a hierarchical LIF network model with 3 layers: a visible layer representing the data, a hidden layer that learns particular motifs in the data and a label layer for classification (Fig. 2 A) . The network was trained with a contrastive learning rule
on a modified subset of the MNIST dataset ( · data and · model represent expectation values when clamping training data and when the network samples freely, respectively). Due to hardware limitations, we used a small network and dataset (6 digits, 12×12 pixels, each with 20 training and 20 test samples) for this proof-of-principle experiment. The specific influence of various hardware-induced distortion mechanisms were first studied in complementary software simulations. These simulations show that the classification accuracy of the network is essentially unaffected by the types of timing mismatch discussed above, even when their amplitudes are much larger than those measured on our neuromorphic substrate (Fig. 2 B, C) . In order to facilitate a meaningful comparison with hardware experiments, two further distortion mechanisms were studied. An upper limit to the membrane conductance can prevent neurons from entering a high-conductance state, thereby distorting their activation functions away from their ideal logistic shape (2) and consequently modifying the sampled distribution. However, within the range achievable on Spikey, the effect on the classification accuracy remains small (Fig. 2 D) . The largest effect (about 5.6 % regression in classification accuracy compared to ideal software simulations) stems from the discretization of synaptic weights, which have a resolution of 4 bits on Spikey (Fig. 2 E) .
The robustness of this hierarchical architecture to timing mismatches is a consequence of both the training procedure and the information flow within the network. Training has the effect of creating a steep energy landscape E(z) (3), for which deep energy minima, corresponding to particular learned digits, represent strong attractors, in which the system is placed during classification by clamping of the visible layer. Throughout the duration of such an attractor, visible neurons represent pixels of constant intensity encoded in their spiking probability, thereby entering a quasi-rate-based information representation regime. Therefore, the information they provide to the hidden layer is unaffected by temporal shifts or zero-mean noise. As they outnumber the hidden neurons 24:1, they effectively control the state of the hidden layer. The hidden layer neurons themselves are unaffected by timing mismatches because they are not interconnected. Secondorder (hidden→label→hidden) lateral interactions are indeed distorted, but as they are mediated by only few label neurons, their relative strength is too weak to play a critical role.
These findings are corroborated by experiments on Spikey (Fig. 2 F) . Due to the system's limitations, we used a hybrid approach, with the visible and label layers implemented in software and the hidden layer running on Spikey. In the ideal, undistorted case, the LIF network had a classification performance of 86.6 ± 1.7 % (93.4 ± 0.9 %) on the test (training) set. This was reduced to 78.1 ± 1.5 % (90.7 ± 1.7 %) when all distortive effects were simultaneously present in software simulations. In comparison, the hybrid emulation showed a performance of 80.7 ± 2.3 % (89.8 ± 1.8 %), which closely matched the software results within the trial-to-trial variability. We stress that this was a result of direct-tohardware mapping, with no additional training to compensate for hardware-induced distortions (as compared to Sec. IV).
IV. IN-THE-LOOP TRAINING
In Sec. II, we used a training procedure based on (5, 6) to optimize the hardware-emulated sampling network. Such simple contrastive learning rules can yield very good classification performance in networks of spiking neurons [6] . Another class of highly successful learning algorithms is based on error backpropagation. This, however, requires precise knowledge of the gradient of a cost function with respect to the network parameters, which is difficult to achieve on analog hardware. We propose a training method for hardwareemulated networks that circumvents this problem by using the cost function gradient with respect to the parameters of an ANN as an approximation of the true gradient with respect to the hardware parameters [3] . A similar method has previously been used for network training on a digital neuromorphic device [9] .
Our training schedule consisted of two phases. In the first phase, an ANN was trained in software on a modified subset of the MNIST dataset (5 digits, 10×10 pixels, with a total of 30690 training and 5083 test samples) using a simple cost function with regularization
and backpropagation with momentum [10]
Here,ỹ s andŷ s denote the target and network state of the label layer, respectively, and the sum runs over all samples within a minibatch S. The learned parameters were then translated to a feed-forward spiking neural network (Fig. 3 A) . Here, the BrainScaleS wafer-scale system [8] was used for network emulation. Due to hardware imperfections, the ANN classification accuracy of 97 % dropped to 72 +12 −10 % after mapping the network to the hardware substrate.
In the second training phase, the hardware-emulated network was trained in the loop (Fig. 3 B) for several iterations. Parameter updates were calculated using the same gradient descent rule as in the ANN, but the activation of all layers was measured on the hardware. The rationale behind this approach is that the activation function of an ANN unit is sufficiently similar to that of an LIF neuron to allow using the computed gradient as an approximation of the true hardware gradient. As seen in Fig. 3 C, this assumption is validated by the post-training performance of the hardware-emulated network: after 40 training iterations, the classification accuracy increased back to 95 +1 −2 %.
V. DISCUSSION
We have reviewed three strategies for emulating performant spiking network models in analog hardware. The proposed methods tackled the problems induced by substrateinherent imperfections from different (and complementary) angles. The three strategies were implemented and evaluated with two different analog hardware systems.
An essential advantage of the employed neuromorphic platforms is provided by their accelerated dynamics. Despite possible losses in performance compared to precisely tunable software solutions, accelerated analog neuromorphic systems have the potential to vastly outperform classical simulations of neural networks in terms of both speed and energy consumption [3] -an invaluable advantage for online learning of complex, real world data sets. The network in Sec. II, for example, is already faster than equivalent software simulations (NEST 2.2.2 default build, single-threaded, Intel Core i7-2620M) by several orders of magnitude.
The studied networks serve as a proof of principle and are scalable to larger network sizes. Future research will have to address whether the results obtained for these small networks still hold as training tasks increase in complexity. Furthermore, the generative properties of the described hierarchical LIF networks remain to be studied. Another major step forward will be taken once training can take place entirely on the hardware, thereby rendering sequential reconfigurations between individual experiments unnecessary. Future generations of the used systems will feature onboard plasticity processor units, with early-stage experiments already showing promising results [11] .
