This paper presents the concepts behind the BrainScales (BSS) accelerated analog neuromorphic computing architecture. It describes the second-generation BrainScales-2 (BSS-2) version and its most recent in-silico realization, the HICANN-X Application Specific Integrated Circuit (ASIC), as it has been developed as part of the neuromorphic computing activities within the European Human Brain Project (HBP). While the first generation is implemented in an 180 nm process, the second generation uses 65 nm technology. This allows the integration of a digital plasticity processing unit, a highly-parallel micro processor specially built for the computational needs of learning in an accelerated analog neuromorphic systems.
: Basic elements of the BrainScaleS architecture: wafer, BSS-1 ASIC, BSS-2 neuron and exemplary membrane voltage trace.
Introduction
The basic concept of the BrainScaleS systems is the emulation of biologicallyinspired neural networks with physical models [1] . It differs from comparable neuromorphic approaches based on continuous-time analog circuits [2, 3, 4] in many aspects, like the high acceleration factor [5, 6] , usage of wafer-scale integration [7] , calibratability towards biologically-sound neuron parameters [8, 9] , a softwareinterface based on the simulator-agnostic description language PyNN [10, 11] , support for non-linear dendrites and structured neurons [12] as well as on-chip support for complex plasticity rules based on a combination of analog measurements internal analog-to-digital conversion and build-in microprocessors.
The first generation, BrainScaleS 1, has been completed [13] and is used mostly for research of connectivity aspects of large accelerated analog neural networks and the further development of wafer-scale integration technology. The main short-coming of the BrainScaleS 1 system is the rather inflexible implementation of long-term plasticity based solely on Spike-timing dependent plasticity (STDP), which has been taken over from its predecessor [6] . Already at the very beginning of the BrainScaleS project this was considered a conceptual weakness and an upgrade path was devised to implement the more flexible hybrid plasticity [14] scheme in future revisions. Due to the process technology used within Brain-ScaleS 1, 180 nm, it was not feasible to integrate the necessary standard cell logic without sacrificing too much area to digital circuits in relation to the analog neurons and synapses. Therefore, the decision was made to develop a second Brain- ScaleS generation, BrainScaleS 2, which is based from the beginning on a smaller process technology, namely 65 nm. Fig. 1 shows the main elements of the BSS architecture. At the very left, a BSS-1 wafer containing approx. 500 interconnected ASICs is shown. To its right, a BSS chip illustrates the characteristic layout of BSS neuromorphic chips: a central neuron area surrounded by two large synapse blocks. The sketched overlay shows the rectangular orientation of input (pre-synaptic) and output (post-synaptic) signals: the input is routed horizontally through the synapse array, while the output of the synapses connects them vertically to the neurons in the center. Next to it the graphical representation of an emulated structured neuron is shown above a measured voltage-trace from the membrane capacitor of a neuron.
One major improvement is the inclusion of a digital plasticity processor in the BSS-2 ASIC [14] . This specialized highly-parallel Single Instruction Multiple Data (SIMD) microprocessor adds an additional layer of modeling capabilities, covering all aspects of structural and parameter changes during network operation. By including the necessary logic directly within the analog network core, a communication bottleneck to the host system is avoided. This allows to scale-up all novel plasticity features for wafer-scale integration within the BrainScaleS 2 system. In the finale multi-wafer version of the BrainScaleS 2 system, which is planned to be capable of extending experiments across several hundreds of wafers, the distributed local compute capability will be even more essential. It will not only perform all levels of plasticity calculations, but also the initialization and calibration of the numerous analog mixed-signal circuits within the ASIC. The role of the analog neural network block changes by the transition from BSS-1 to BSS-2. The analog part becomes an attachment to the CPU cores, similar to a complex accelerator. Fig. 2 illustrates this architecture.
The remainder of this publication is organized as follows: Section 2 gives an overview of the BSS-2 architecture. Section 3 presents the current prototype, the single-chip variant of BSS-2, called HICANN-X (HICANN-X). Section 4 shows some examples of the complex calibrated Monte-Carlo simulations used to verify that the analog neurons circuits are always capable of correctly emulating their biological counterparts, i.e. their calibratability under all process and device variations. The paper closes with a conclusion in Section 5.
Overview of the BSS neuromorphic architecture
As shown in Fig. 2 , the BSS architecture is based on the close interaction of digital and analog circuit blocks. Because of their primary intended function, the digital processor cores are called Plasticity Processing Units (PPUs). As the main neuromorphic component, the analog core contains synapse and neuron circuits [15, 16] , analog parameter memories, PPU interfaces as well as all event related interface components.
The PPU is an embedded microprocessor core with a highly parallel SIMD unit optimized for the calculation of plasticity rules in conjunction with the analog core [17] . In the current incarnation of the BSS architecture, BSS-2, two PPUs share an analog core. This allows the most efficient arrangement of the neuron circuits in the center of the analog core. Fig. 3 depicts the individual function blocks located within the ANNCORE:
synapse arrays
The total number of synapses are split up in four equally sized blocks to keep the vertical and horizontal lines traversing the sub-arrays as short as possible, thereby reducing their parasitic capacitances (see [17, 16] ). Each synapse array resembles a block of static memory, with 16 memory cells located in each synapse, organized in two words of eight bits each. A synapse array also contains the sense amplifiers, precharge and write control circuits as well as word-line decoders and buffers. Thereby it can be connected directly to the digital, standard cell based parts of the chip. Two PPUs connect to the static memory interfaces of the two adjacent synapse arrays, using a fully parallel connection to the 8×256 data lines. 
neuron compartment circuits
Four rows of neuron compartment circuits are located at the edges of the synapse blocks. Each pair of dendritic input lines of a neuron compartment is connected to a column of 256 synapses. The neuron compartment implements the Adaptive-Exponential Integrate-and-Fire (AdEx) neuron model. They can be connected to form larger neurons, emulating either point or structured neurons. See [12] for more details about the multi-compartment capabilities.
analog parameter memories Adjacent to each row of neuron compartments is a row of analog parameter storages. These capacitive memories [18] synapse drivers with short term plasticity The pre-synaptic events are fed into the array via the synapse drivers. Besides timing control and buffering they contain short-term plasticity circuits emulating a simplified Tsodys-Markram model [19, 6] . The synapse drivers can handle single-or multi-valued input signals, depending on the current operation mode of the synapse row, which may be either rate or spike based.
random event generators
The random generators produce random background events fed directly into the synapse array via the synapse drivers, strongly reducing the external bandwidth usage when stochastic models [20, 21] are used. ron control is split in eight blocks controlling 64 neuron compartments each. Four blocks are located in the left and four in the right half of ANNCORE. Each block contains the so-called neuron builder logic, which allows to interconnect analog membrane and digital spike output signals from neuron compartments being either vertically or horizontally adjacent to each other. To serialize the up-to 64 spike outputs each digital neuron control block contains priority encoder circuits that arbitrate the access to the output bus. It also contains a 8 × 64 neuron source address memory [22] .
The pre-synaptic input for the synapse drivers of one chip half comes from a set of local event input buses driven by the central event router. The event router within the ANNCORE mixes global, local and random event sources. In Fig. 4 the synapses are arranged in a two-dimensional array between the PPU and the neuron compartment circuits. Pre-synaptic input enters the synapse array at the left edge. For each row, a set of signal buffers transmit the pre-synaptic pulses to all synapses in the row. The post-synaptic side of the synapses, i.e. the equivalent of the dendritic membrane of the target neuron, is formed by wires running vertically through each column of synapses. At each intersection between pre-and post-synaptic wires, a synapse is located. To avoid that all neuron compartments share the same set of pre-synaptic inputs, each pre-synaptic input line transmitsin a time-multiplexed fashion -the pre-synaptic signals of up to 64 different pre-synaptic neurons. Each synapse stores a pre-synaptic address that determines the pre-synaptic neuron it responds to. The main functional blocks are the address comparator, the Digital to Analog Converter (DAC) and the correlation sensor. Each of these circuits has its associated memory block. The address comparator receives a 6 bit address and a pre-synaptic enable signal from the periphery of the synapse array as well as a locally stored 6 bit neuron number. If the address matches the programmed neuron number, the comparator circuit generates a pre-synaptic enable signal local to the synapse (pre), which is subsequently used in the DAC and correlation sensor circuits. Each time the DAC circuit receives a pre signal, it generates a current pulse. The height of this pulse is proportional to the stored weight, while the pulse width is typically 4 ns. This matches the maximum pre-synaptic input rate of the whole synapse row which is limited to 125 MHz. The remaining 4 ns are necessary to change the pre-synaptic address. The current pulse can be shortened below the 4 ns maximum pulse length to emulate short-term synaptic plasticity [6, 23] .
Each neuron compartment has two inputs, labeled A and B in Fig. 5 . Usually, the neuron compartment uses A as excitatory and B as inhibitory input. Each row of synapses is statically switched to either input A or B, meaning that all presynaptic neurons connected to this row act either as excitatory or inhibitory inputs to their target neurons. Due to the address width of 6 bit the maximum number of different pre-synaptic neurons is 64 [24] . The output currents of all synapses discharge the synaptic input capacitance C syn , which is realized predominantly by the shielding capacitance of the long synaptic input wires. An adjustable Metal-Oxid Semiconductor (MOS) resistor,R syn , restores the charge. Due to the short time-constant of the synaptic input pulse compared to the time constant of the synaptic input line τ input = C syn R syn , which is three orders of magnitude longer, the voltage trace V input (t) is a single exponential.
The ion-channel circuits in BrainsScaleS should implement the full AdEx neuron model, as it is the case in the BSS-1 system. [25, 26, 16] . In BSS-2 some terms are still under development at the time of this writing. The minimum configuration available in all prototype versions of BSS-2 is a set of two currentbased inputs, one for inhibitory synaptic input, connected to input A in Fig. 5 and one for excitatory (input B), in combination with a leak circuit and spike and reset generation [15] . Therefore the membrane voltage is given by the standard Integrate-and-Fire (I&F) neuron model [27] . Typically, the membrane time constant set by the leakage term is another order of magnitude above the timeconstant of the synaptic input. These temporal relationships are visualized in the small timing diagram inserts in Fig.5 .
The remaining functional block of the synapse shown in Fig. 5 is the correlation sensor. Its task is the measurement of the time difference between pre-and postsynaptic spikes. To determine the time of the pre-synaptic spike it is connected to the pre signal. The post-synaptic spike-time is determined by a dedicated signaling line running from each neuron compartment vertically through the synapse array to connect to all synapses projecting to inputs A or B of the compartment [17] .
The HICANN-X chip
Although the target of the BSS architecture is wafer-scale integration, which offers a cost-effective possibility to build brain-size spiking neural network models, smaller solutions based upon single ASICs are needed to develop and debug the final design. They also shorten the time to first experiments, of which a significant proportion does need only hundreds up to a few thousand neurons and therefore does not necessarily rely on wafer-scale integration. Depending of the complexity of the neuron model they utilize, a few tens of interconnected BSS ASICs might be sufficient. To support these goals, an intermediate version of the secondgeneration BSS technology has been developed: suited for single-or multi-chip operation, but simultaneously prepared for later wafer-scale integration. This section will introduce said single-chip version of BSS-2, called HICANN-X, in more detail. Fig. 6 shows a block diagram of HICANN-X. In total, the HICANN-X chip uses 16 differential Low Voltage Differential Signalling (LVDS) lines for the host communication. A single chip has the same bandwidth as the full BSS-1 reticle build from eight individual chips. Using this link arrangement, the HICANN-X chip can be directly connected to one communication module of the Brain-ScaleS system, providing an easy upgrade path [28] . The layout and photograph of the chip are shown in Fig. 7 .
Event-routing within HICANN-X
HICANN-X uses the same two-level communication infrastructure as the first BSS generation [29] : a real-time address-event layer without handshake, called Event Link Layer 1 (Layer1), and a second layer using time-stamped event pack- Fig. 8 shows the implementation of the central Layer1 digital event routing network. There are two main sources and sinks for event data: the analog network core, which has eight input event and eight output event buses, as well as the Event Link Layer 2 (Layer2)→Layer1 converter, which provides four links in each direction. With the exception of the analog core input buses, each link can handle one event per clock cycle of 4 ns. The ANNCORE input buses are limited to one event every two cycles.
All eight High Input Count Analog Neural Network (HICANN) compatible links are used for Layer2 based event transport. An event is encoded as a combination of neuron address and time stamp. The conversion between time-stamped Layer2 data and real-time Layer1 data is preformed inside the Layer2→Layer1 converter loacted in the digital core logic. It uses a globally synchronized system time counter for this purpose. The routing of all Layer1 events is done within the router matrix. Inside this module are several columns of buffered n-to-1 event merger stages allowing to combine the data of a set of inputs into one Layer1 output channel. All eight physical links of the chip can be simultaneously used for neuron event data (Layer2), slow control and PPU global memory accesses. The number of active links might be statically programmed to be any number between one and the maximum of eight. This is useful if several chips should be connected to a single host with a limited number of available links. All events transferred via Layer2 are protected against undetected bit-errors by Cyclic Redundancy Check (CRC) fields.
Analog Inference: Rate-based Extension of HICANN-X
One of the first neuromorphic systems build in Heidelberg was Heidelberg Ana-loG Evolvable Neural network (HAGEN), a fast analog Perceptron-based network chip optimized for hardware-in-the-loop training [30] . Caused by parallel activities withing the Heidelberg Electronic Vision(s) research group [31] it was mainly trained by evolutionary algorithms [32] , explaining the acronym. Nevertheless, it was perfectly usable for other hardware-in-the-loop based algorithms, similar to the deep-learning results that have been more recently achieved by other neural network chips used in a Perceptron-like fashion [33] . Although the HICANN architecture has been successfully used to implement deep multi-layer networks using rate-based spiking models [34] and back-propagation based training, is looses some of its power-efficiency by emulating a Perceptron model. Encoding the activation in the time between spikes can enhance the efficiency significantly [35] . In all spiking solutions the network operates in continuous time and therefore the size of the network is limited to the number of neurons and synapses available on the chip. The HAGEN extension, which is part of the HICANN-X chip, allows a seamless mixture of spiking and non-spiking operation within a single chip. Since this rate-based operation is based on discrete-time analog vector-matrix mulitplication, a time-multiplexing scheme can be employed, similar to digital accelerators for deep convolutional networks [36] . In this case the size of the network is limited only by the size of any external memory. Fig. 9 visualizes the differences between standard spiking mode and HAGEN mode, which eliminates all temporal dynamics from the neuron. By disabling the leakage term of the neuron the membrane just sums up the synaptic input. The excitatory input is added with a positive and the inhibitory input with a negative sign. All input is applied during the time interval ∆t input , after which the membrane voltage is digitized by the Correlation-readout ADC (CADC) and the neuron is set to the reset voltage V reset by a reset signal from the PPU. ∆t input can be as short as 100 ns. It depends on the bandwidth of the synaptic input and the number of synaptic rows used, i.e. the total time required to transfer all input events to the synapses. Since the minimum time is at least a few synaptic time constants and nothing is gained by setting the integration time shorter than the conversion time of the CADC, a typical value for ∆t input is about 500 ns. Thereby, the network can evaluate 2 · 10 6 × 256 × 512 = 2.62 · 10 11 multiply-accumulate operations per second. By shortening the conversion time of the CADC further speed improvements are possible. Since the reset voltage of the neuron membrane can be aligned with the lower bound of the CADC conversion range the neuron acts like a ReLU unit in this setting [37] .
A standard synapse within BrainScaleS reacts to a pre-synaptic event in a digital fashion: the arrival of a pre-synaptic event generates a fixed current pulse. By enabling short-term facilitation or depression [23] the synaptic strength depends on the pre-synaptic firing history. This is achieved by modulating the pulse length generated by the synapse. Instead of using the firing history, in HAGEN mode the pulse length is transmitted together with the pre-synaptic spike and converted into variable length pulses by the existing Short Time Plasticity (STP) pulse-length modulation circuits. The digital pulse length information is transmitted by reusing the 5 lower address bits of the Layer1 event data, since in the HAGEN mode the network structure is much more regular and not all pre-synaptic address bits are needed. Fig.10 shows some early results using the activity-based Perceptron mode from HICANN-X for analog vector-matrix multiplication. In the left part of the figure, 127 neurons are measured simultaneously. Their synaptic weights increase linearly from -63 to 63, i.e. all synapses connected to a single neuron are set to the same weight while the weights incrase from neuron to neuron. All synapes receive the same input: 0, 3 or 7 for the black, red and blue traces respectively. The outputs of all 127 neurons are digitized simultaneously by the CADC and the digital values are plotted over the weight values of the neurons. Although the neuron circuits are calibrated, some fixed-pattern noise remains visible. The temporal variations are caused by a well-understood circuit flaw, that will be removed in future iterations.
The chip has been subsequently used to perform inference on the MNIST dataset [38] . A three-layer network has been trained in Tensorflow [39] to reach a classification rate of 97.43%. The weights and input activations of this network have been quantisized to 6 bit weight and 5 bit input resolution, to fit the trained network to the dynamic range of the analog circuits. The inference on the test data set has been repeated using the HICANN-X chip. The resulting classification accuracy was 92.48%. The corresponding confusion matrix is shown in the right panel of Fig. 10 . The deterioration is most likely caused by the remaining fixed-pattern noise. In the future we will include the hardware in the forward-path of the training loop, similar to the approach followed in [34] , which will most likely improve the accuracy significantly.
Analog Verification of Complex Neuron Circuits
The BrainScaleS systems feature complex mixed-signal circuits to emulate the rich properties of their biological counterparts. Our neuron circuits, implementing the AdEx equations [25] , possess a multitude of individual subcomponents, such as a leak or adaptation term. Each of these units is parameterized through a number of digital controls as well as analog voltage and current biases. Designed to support a variety of different tasks, ranging from biologically realistic firing patterns to analog matrix multiplication, these circuits have to be operated at widely different operating points. The correct behavior has to be ensured prior to fabrication. Individual components can often be unit-tested in isolation, making use of convential simulation strategies. The assesability of a complete design is, however, limited due to error propagation and inter-dependencies of parameters. Fig. 11 : Structure of a teststand-based simulation highlighting the interaction with the Cadence Design Suite. Image taken from [40] .
for pre-tapeout verification. To ensure the required degree of precision over larger arrays of analog circuits, mismatch effects introduced through imperfections in the production process, have to be covered through Monte Carlo (MC) simulations. Different incarnations of a circuit can be obtained by individually fixing the MC seed. These virtual instances can then be characterized, very similar to the fabricated siblings. Similarly, the worst case behavior can be characterized for the process corners.
In the following paragraphs, we present our simulation strategy and a custom library to aid software-driven simulations within the rich ecosystem of the Python programming language. We will guide through our benchmarking flow for our current generation of AdEx neurons. Similar approaches have successfully been taken for the verification of plasticity circuits and vector-matrix muliplication circuits.
Interfacing analog simulations from Python
Our custom Python module teststand provides a tight integration between analog circuit simulations and the ecosystem of the programming language [40] . It mainly consists of a software layer to interface with the Cadence Spectre simulator and other tools from the Cadence Design Suite.
Teststand extracts the testbench's netlist directly from the target cell view as available in the design library. The data is accessed by querying the database via an OCEAN script executed as a child process. Teststand then reads the netlist and modifies it according to the user's specification. In addition to the schematic description, Spectre netlists also contain simulator instructions. The simulate()-call executes Spectre as a child process. Basic parallelization features are natively provided via the multiprocessing library. Scheduling can be trivially extended to support custom compute environments. The simulation log is parsed and potential error messages are presented to the user as Python exceptions.
Results are read and provided to the user as structured NumPy arrays. This allows to resort to the vast amount of data processing libraries available in the Python ecosystem to process and evaluate recorded data. Most notably, this includes NumPy [41] , SciPy [42] , and Matplotlib [43] . As a side effect, the latter allows to directly generate rich publication-ready figures from analog circuit simulations.
Monte Carlo calibration of AdEx neuron circuits
As shown in Fig. 12A , we used the teststand library, inter alia, for the verification of our AdEx design. The model equations feature a high-dimensional parameter space, allowing for a wide range of behaviors. Our circuit, on the other hand, is parameterized through 24 individual analog bias sources and a set of digital controls. Starting from first-order models of the utilized subcomponents, we characterized the circuit's dynamics through a set of measurements on the full neuron circuit. With the results stored in a database, we established a transformation between between the circuits's and the models's parameter spaces. The influence of mismatch effects manifests itself in deviations in these calibration curves for individual neuron instances. We applied the above framework for a large number of neuron incarnations, obtained by fixing the respective MC seeds.
The circuit was benchmarked against multiple firing patterns, such as transient spiking, regular bursting, and initial bursting [44] . For each of these targets, a set of biases, corresponding to the respective parameter set from literature, was determined through a reverse lookup based on the above transformations. Examplary results for a single neuron simulation are shown in Fig. 12 B. The presented approach enforces the development of calibration algorithms before tape-out. Especially for circuits with large parameter spaces, there might occur multi-dimensional dependencies which can be hard to resolve. The strategy might also reveal an insufficient parametrization not necessarily apparent from individual unit tests. In order to uncover potential regressions due to modifications to a circuit, simulations based on teststand can easily be automated and allow continuous integration testing for full-custom designs.
Conclusion
The development and implementation of the presented second generation Brain-ScaleS architecture will hopefully continue during the next years. The outcome we hope for is a multi-wafer system, constructed from hundreds of 30 cm silicon wafers, each one directly embedded in a printed circuit board (PCB) and all of them interconnected to form a novel large-scale analog neuromorphic platform. A system capable of answering questions about learning and development in large scale, biologically realistic neural networks.
Utilitzing standard Complementary Metal-Oxid-Semiconductor (CMOS) technology to build large-scale analog accelerated neuromorphic hardware systems places our approach in the middle between the two major research directions for AI circuits: digital accelerators and novel persistent memory devices. It presents a complementary option to theses technologies. Compared to systems based on novel device technology it has advantages, like the high operational speed, low energy requirements for learning, the possibility to use any standard CMOS process without regards to back-of-the-line compatibility and the capability to replicate relevant biological structures more easily. In comparison to digital implementations, like Loihi or SpiNNaker [45, 46] , the fully analog implementation of complex neural structures combined with true in-memory computing allows for time-continuous emulation of neural dynamics and much higher emulation speed at similar energy efficiences. Most importantly, analog CMOS implementations might be the essential step to uncover the learning rules needed to cope with substrate variations. In our systems the local learning rules do not only train the system to perform a certain task, but simultaneously adjust the operating point of the circuits and compensate fixed pattern noise [23] . This will be an essential property for future novel computing systems based on advanced device technologies as well, since they all are expected to have substantially increased device-todevice variations. We hope that our BSS platform will help to gain insight in the necessary algorithms in the upcoming future.
In the short term the BSS system allows the combination of energy-and costefficient analog inference with local learning rules for a multitude of practical applications, scaling from small systems for edge computing up to high-performance neuromorphic cloud computing.
