. Example SbS network (see (Rotermund and Pawelzik, 2019a ) for more details on this network) for analyzing handwritten digits (MNIST benchmark, http://yann.lecun.com/exdb/mnist/). The input image (28x28 gray value pixels) is represented by a 576 node input population and 802 parallel SbS IPs with different number of neurons (with 10, 32, 64, or 1024 neurons per SbS inference population IP, colored columns in layers H1-H4). Thus this network consists of 1378 independent computational elements. The network is organized in one input layer, two convolution layers, two pooling layers, one fully connected layer, and one output layer. All the layers, except the input layer, use the same update dynamic (i.e. updating the internal state of a SbS inference population with an incoming spike). In contrast to usual deep convolutional networks, there is no algorithmic difference for pooling layers, they only have a special arrangement of their weight matrices. Information between elements (SbS inference population and input populations) are exchanged only by spikes. Depending on the architecture of the network, the information can flow either only forward or bi-directionally (as indicated by the blue arrows). The latter can be used to implement local learning rules. In this specific network there are no interactions in between SbS inference populations within the same layer, but in a more complex network they could be helpful. .
Besides updating the latent variables h(i), it is necessary to optimize suitable weights p(s|i) from 148 training data for allowing the network to perform the desired function (e.g. pattern recognition). Two 149 different approaches were found useful for learning weights. One is based on changing weights based on 150 only single spikes observed during processing the actual pattern. This procedure is called online learning.
151
The other approach utilizes information gathered over many spikes and several patterns, that is batch 152 learning.
153
For online learning we focus on a multiplicative learning rule for the weights p(s|i): 
where γ is a learning parameter which can change during learning.
155
For batch learning, a variety of implementations exit, we focus on batch learning rules that base on
where µ identifies the training pattern, and where the sum may run over the whole or only parts of the complete set of training patterns.p µ (s) denotes the input probability distribution of the incoming spikes 158 into the SbS inference population and is approximated by analyzing (counting) the spikes processed by 159 the SbS inference population. For allowing a more flexible implementation of these type of rules, ∆p(s|i) 160 is handed over by the FPGA to a CPU, which allows to implement batch learning rules based on
A simple example for an update rule falling into this category is
.
U could also be realized by Adam (Kingma and Ba, 2014) or L4 (Rolinek and Martius, 2018) using 163 mini-batches. Since ∆p(s|i) is based on a multitude of patterns, offloading ∆p(s|i) to a separate CPU 164 and performing the weight update on the CPU occurs at a lower rate than all other operations. Thus the 165 reduction in performance by using a CPU for handling the weight updates is outweighed by the gain of 166 flexibility.
167
Furthermore, we need the means to generate spikes from probability distributions or the latent variables 168 using random numbers. Also specialized circuits are required for observing the spikes entering a
169
SbS inference population and calculating the corresponding probability distributionp µ (s) from these 170 observations.
171

Non-negative numbers
172
Investigating the three main equations 1, 2 and 3 reveals that no subtractions are required and that ).
179
In more detail, we designed the 36 bit floating point numbers as follows: Since only non-negative 180 numbers are used, the usual sign bit was not necessary anymore. 189 Figure S1 shows the coding for the two types of floating point numbers used in the design. Analyzing the three equations 1, 2 and 3 reveals that all have common terms. Especially they share the
ζ can be ǫ, γ orp(s). All three variants of ζ have in common that they change slower than i, which means operation Y /X is known to be computational demanding, the following approach was chosen: take longer than two clock cycles. Figure S5a shows the steps necessary for adding these numbers and 211 every step requires one clock cycle in our design. Thus we found that it is beneficial to implement the 212 summation operation via an addition pipeline, where the output of the addition pipeline is feedback to its 213 input (this procedure will be explained in detail later). 
220
For reducing the amount of overall delay, it is beneficial to start step 3.) some time before step 2.) is 221 fully complete. The timing has to selected such that
result leaves the first multiplication pipeline.
223
For the implementation of online learning, it is self-evident that the results from the circuits for step 224 0.) and 1.), calculated during the update of the latent variables, should be reused. Adding an additional 225 sequential division unit and an additional multiplication pipeline allows to calculate step 2.) and 3.) fully 226 in parallel to the ongoing update of the latent variables.
227
Since the calculation of Ω(i; s; ζ) of the batch learning update for the actually processed pattern isn't 228 done at the same time as the update of the latent variable, it is beneficial to use the same circuits for these 229 two tasks. and the result A · B leaves the module. In the case of fixed point numbers, the output has twice the number
247
of bits compared to the input.
248
Module + (figure 2b) represents a simple adding pipeline. Two inputs A and B enter the module and
249
A + B leaves the module. In the case of fixed point numbers, the output has one bit more than the input.
250
Module # ( figure 2c) before the results exiting the module is cut down to the original number of bits for A.
256
Module MEM (figure 2d) stores the latent variables h(i) and the corresponding weights p(s|i). The and output of the module, the first bit of the mantissa is removed or re-added (see figure S1 ). This bit is 270 necessary for calculations but is directly defined by the exponent and hence doesn't need to be stored in 271 memory.
272
Module NORM-MULTI (figure 2e) is a modification of Module * . In addition to the multiplication received (defined by an external constant), stage 3 is entered. In stage 3 A and B is set to 0 by default.
283
The output of the adder pipeline is collected until two valid outputs are available. These two values are 284 send as an A-B-pair through the adder pipeline. This is done several times and thus combines more and 285 more remaining pairs until one value is left, which is the desired output of the cumulative sum. Finally,
286
an external value (in figure 2e X is a placeholder for ǫ orp(s)) is divided by the actual cumulative sum in 
292
Besides the basic computation building blocks for calculating the three main equations, additional 293 modules are required for realizing the network. In particular a module that allows to generate spikes from 294 a probability distribution and a random number, a module that analyzes spikes and calculates a ratep(s) 295 out of them, and a module that offsets the weights in a normalized fashion during learning. The figures 296 S2, S3 and S4 show simplified schematics for these modules. Table S3 shows the amount of components 297 necessary to build them as well as the number of clock cycles they require.
298
Module Spike Generator (figure S2a) converts a probability distribution and one random number into 299 one spike. A spike is an index describing a position in the probability distribution that elicited the spike.
300
The values of the probability distribution (e.g. h(i) or normalized input pattern) are presented to the 301 module sequentially.
302
The module sequentially calculates the cumulative sums over the observed part of the probability 303 distribution and compares it to the random number. Is the actual value of the partial cumulative sum equals 304 or is larger than the random number, the index of the probability distribution value that just contributed 305 to the sum is the desired index. If every value of the probability distribution was shown but the last value 306 and no index was found yet, then the wanted index is the last position in the probability distribution.
307
In addition, for this design the values of the probability distribution are converted into a fixed point 308 representation, since adding two fixed point numbers can be done in one clock cycle and the available 309 random numbers are in a fixed point representation anyway.
310
Module Spike Generator with offset (figure S4a): Sometimes, especially during annealing while 311 learning, it is helpful to add an offset α to a probability distribution and to fade out this offset over time.
312
One way to generate spikes with an offset is to add the offset to the probability distribution, re-normalize 313 it and then draw spikes from it. In the context of these circuit designs, a more efficient way is to rely on a 314 double stochastic process using two random numbers. One random number is used to draw one spike from 315 the original probability distribution without offset and one spike from a uniform probability distribution.
316
The second random number is compared to α. The outcome of this comparison decides which one of the 317 two spikes is used. calculates the initial table after reset given a seed value.
327
Module Rate Calculator (figure S2b): For batch learning the weights, the input ratep(s) is necessary.
328
For calculating the ratep(s), the following is required: The total number of spikes c All observed and the 329 number of spikes seen in each channel s counted as c(s). After an externally defined number of spikes 330 has been counted, the rate is calculated and recalled by an external module. This is realized by calculating 
Circuits for updating h(i), online and batch learning p(s|i)
343
With the presented building blocks it is now possible to implement the three equations 1, 2 and 3 as Table   357 S4 shows the number of components required to map this circuit onto the Xilinx Virtex 6 FPGA as well 358 as how many clock cycles of latency the modules * , + and # adds to the processing time. 
363
In parallel to the h(i) update, an online update step for the weights p(s|i) can be performed. however, would cost much more components.
387
Another approach to learn weights is to perform batch learning. In this learning mode, for a given Rate Calculator and then used instead of ǫ. As result we getp Typically, a network consists of more than one Spike-By-Spike inference population which need to 428 exchange spikes. Or the task requires input patterns that need to be converted into spikes, which can 429 be accomplished by an input population (a simple combination of block RAM for the input probability producing and spike received network elements is required. Figure 6 shows the communication fabric that 433 was implemented into the presented design. 
495
On the incoming data bus there are two sources for data: external data and internal data generated and puts these messages on the data bus which programs the network elements with these random numbers.
c.) It contains a mailbox system that is controlled by the number of already processed spikes. This allows 503 to store every possible -otherwise externally provided -data control sequence in this mailbox together 504 with a number of processed spikes that will release the message at that moment on the data bus. E.g. this 505 message system can be used to change parameters like ǫ during processing the pre-defined number of 506 spikes without waiting of any external just-in-time changes of parameters.
507
One arbiter joins the two data streams from the incoming data. Another arbiter combines all data sources 508 that are destined for an external data receiver. In firmware when the circuit is alone and the IO pins can be selected by the software.
520
However, we weren't able to achieve these high clock rate when the network was embedded into the 521 third party firmware which is required for communicating with the Virtex 6 FPGA on the 4DSP FM680
522
PCIe card. We used the 'adder' example from the training materials and simply replaced the 'adder' core 523 with our network. We allowed for a separate clock domain for the network by insulating the data-flow by 524 dual clock FIFOs on the input and output side. Nevertheless, Xilinx ISE wasn't able to provide us with a 525 working firmware if we set the clock speed for the network to 200MHz. Thus we were forced to use the 526 overall 125MHz clock provided by the 4DSPs example design also for our network.
527
With the 125MHz clock speed were were able to run the network on 4DSP FM680 cards under Linux 528 (Centos 7.5 64bit with driver version 04.05.2018). Results were identically to the Xilinx ISim simulations.
529
However, the overall design of the 4DSP FM680 doesn't fit for our application very well. This type 530 of card is designed for high data bandwidth applications where 64bit words in blocks of 1024bits are 531 continuously exchanged. On one side we were forced to incooperate our 32bits into block of 1024bits. On 532 the other side we had problems receiving data from the card because our network doesn't produce data 533 continuously. Thus the Linux driver and our software had to recover from trying to read from the card 534 when it had no data for us. All this combined, slows down the communication with our network on the 535 FPGA significantly.
536
In the supplemental materials we examine how the presented design fairs in comparison to Intel CPUs. In neurons with stochastic inputs in a network, which is a hard problem. For SbS networks, the stochasticity 568 rather is feature because it corresponds to importance sampling of the input as well as the latent variables.
569
This acts as a filter for capturing the more dominant information in the network and suppress noise. SbS inference populations can cause problems during learning, especially with larger numbers of neurons.
584
If a spike s t is received by a population of neurons, for every neuron in this population an update of it's 585 latent variables is calculated through
We presented computational circuitry for this equation that only requires to read the involved h(i) 587 and p(s t |i) value pairs twice during the update process. For floating point numbers a number of 588 179 + 2 · N H clock cycles and 118 + 2 · N H clock cycles for the fixed point numbers were measured.
589
The constant numbers of clock cycles in these two equations don't scale with the number of neurons 590 because they represent the time which the information needs to travel through the pipeline structure of 591 the computational circuitry as well as setting up the circuits for incoming data. These amounts of clock 592 cycles can be reduced if optional features are removed, like e.g. adding an normalized offset to the weights.
593
Auxiliary circuits were designed such that, given a random number, they draw a new spike from h(i) while are too long for even 125MHz on this FPGA from 2009. We didn't pursued the increase in the number 663 of SbS inference populations any further, because we found that the 4DSP firmware, driver and software 664 framework for this card is optimized for continuously streaming of large bandwidth of data. However, in 665 our application the data is send in a stop-and-go fashion which doesn't ensure that data is always available.
666
Thus it was necessary to probe the cards for available data via blindly trying to read data from the card 667 and to rely on slow timeouts of the Linux driver if there was no data available at that exact moment.
668
Furthermore we had to fill up our data streams with zeros (up to 31x bit in zeros for filling up compared 669 to the payload) to conform to the required data format. In the end we decided to focus on transferring the 670 design onto a custom ASIC instead of optimizing the FPGA firmware.
671
The computationally attractive aspect of a SbS neuron based network is the property that populations In future we will design a custom ASIC for networks with larger numbers of SbS inference populations.
685
The desire for using ASICs compared to continuing using FPGAs stems from several reasons: The goal is with FPGAs is the amount of available memory (block RAM) and the corresponding routing problems.
689
Every SbS inference population needs its own dual port memory, at least for its latent variables. Intel Xeon E5-2640v3 with a combined memory bandwidth of 531 gigabyte per second. Even though the 812 memory on the GPU is a bit faster, the problems encountered with the multi-core CPUs can be assumed 813 to occur with GPUs too. Even more so if the complex requirements of GPUs for memory alignment can 814 not be fulfilled for all 72 simulations processes at the same time.
815
The proposed design for the SbS IPs is optimized with respect to the number of memory accesses. The 816 pipelined structure of the computations requires only a minimum of memory read and write transactions. 
829
The question remains for which number of neurons per SbS IPs, the propose design shines. 
