Abstract-We present results from a new approach to learning and plasticity in neuromorphic hardware systems: to enable flexibility in implementable learning mechanisms while keeping high efficiency associated with neuromorphic implementations, we combine a general-purpose processor with full-custom analog elements. This processor is operating in parallel with a fully parallel neuromorphic system consisting of an array of synapses connected to analog, continuous time neuron circuits. Novel analog correlation sensor circuits process spike events for each synapse in parallel and in real-time. The processor uses this pre-processing to compute new weights possibly using additional information following its program. Therefore, to a certain extent, learning rules can be defined in software giving a large degree of flexibility. Synapses realize correlation detection geared towards Spike-Timing Dependent Plasticity (STDP) as central computational primitive in the analog domain. Operating at a speed-up factor of 1000 compared to biological time-scale, we measure time-constants from tens to hundreds of micro-seconds. We analyze variability across multiple chips and demonstrate learning using a multiplicative STDP rule. We conclude that the presented approach will enable flexible and efficient learning as a platform for neuroscientific research and technological applications.
I. INTRODUCTION

I
N THE modern landscape of information technology machine learning is gaining more and more in importance. Major companies use artificial intelligence for their products [1] . This development is driven by advancements in methods such as deep learning [2] , [3] that were originally inspired by concepts from neuroscience. Together with the availability of substantial computational performance, these methods enable complex machine learning applications, such as image [4] or speech recognition [5] . Specialized hardware can lower the cost of these methods in terms of energy, time, and therefore money [6] , enabling either a scaling to larger problem sizes or the use in new devices outside of data centers.
On the other hand, using simulations of neural networks as a major tool for research in neuroscience depends on efficient simulators for large-scale networks. This opens the opportunity to build specialized hardware systems that serve as efficient platforms for research as well as technology. Multiple systems with this goal have been proposed, e.g., [7] - [10] .
While the problem can be approached in different ways, the concept of analog neuromorphic hardware [11] , [12] promises especially area and energy efficient solutions as demonstrated by, e.g., [13] - [15] . These systems use the concept of a physical model to emulate neural networks: the temporal development of the membrane voltages of the neurons is emulated by custom analog circuits, representing the neuron and synapses of the emulated network. However, neurons and synapses built this way are limited to at best a family of models that are compatible with their physical realization. On the other end of the spectrum, software allows the simulation of arbitrary models by solving numerical equations.
Especially, there exists a large set of different models for learning and plasticity, so that a flexible hardware implementation is desirable. This is true for technical applications where one network is often trained with different methods for pretraining and fine-tuning [3] , as well as biology where different plasticity rules are found depending on cell type and brain region [16] , [17] . But besides flexibility, efficiency is a key concern in both domains. Large-scale simulations have been demonstrated in the past [18] - [20] , but, especially with plasticity, simulation time quickly becomes a limiting factor even on medium-sized networks [21] . Similarly, in the technical domain, significant effort is put into accelerating learning including the use of Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs) [6] , [22] .
For this study, we follow a novel hybrid approach to learning as a trade-off between efficiency and flexibility: we use fullcustom analog circuits for real-time and parallel processing of spikes in the emulated synapses. These circuits serve as sensors for an embedded general-purpose processor that implements the learning rule in software. This way, we offer a solution that allows biologically realistic plasticity while emulating networks a thousand times faster than in biology. Using physical models for core components, this speed-up is not affected by network size or activity. In this study we present results from a scaled-down prototype that demonstrates for the first time plasticity in such a hybrid system using analog components together with an embedded Plasticity Processing Unit (PPU).
The study starts with a description of analog circuits and the architecture of the PPU in Section II. After that, we introduce the theoretical background and methods in Section III. Then, results are presented for simulations in Section IV and for experiments in Section V. Finally, Section VI discusses results, followed by conclusion and outlook in Sections VII and VIII.
II. DESCRIPTION OF CIRCUITS
The circuits presented in this paper are part of a prototype ASIC for the next generation of a large Neuromorphic Hardware system [7] . All results have been measured using the setup shown in Fig. 1 . The individual components of the chip and their functional relations are depicted in Fig. 2 . The central elements are an array of 2048 synapses and 64 neuroncompartment circuits, which implement the analog, continuoustime emulation of their biological counterparts. Similar to the predecessor system described in [7] the presented chip operates faster than wall-clock time. To simplify the calibration of the analog elements to the model equations, the acceleration factor is fixed at 10 3 . Therefore, one second in the model time scale is emulated in one millisecond by the presented system.
The focus of this paper is the plasticity sub-system, which observes the activity of the emulated neural network and modifies its parameters in reaction to these observations depending on the configured plasticity rule. The neuron circuits are not covered in this publication.
The plasticity sub-system is a mixed-signal, highly-parallel control loop simultaneously monitoring the temporal correlation between all pre-and post-synaptic firing times. The plasticity rule itself is implemented as software running on an embedded micro-processor, the PPU. It evaluates the signals from the analog correlation sensors located within the synapses and computes weight updates. Besides the synapse, it can Fig. 2 . Block diagram of the presented system. The prototype ASIC is shown to the left. A photograph of the system can be seen in Fig. 1 . "CD" stands for "correlation data" and "PSC" for "post-synaptic current." observe firing rates of neurons and modify parameters of the emulated neurons as well as the topology of the network. Connection to the outside world allows the integration of third factors, for example a reward signal [23] .
The parallel analog implementation of the correlation sensors in every synapse allows the plasticity sub-system to handle the high rate of simultaneous events 1 The circuit maintains a local eligibility trace that depends on the relative timing of pre-and post-synaptic firing.
A 128 channel single-slope Analog to Digital Converter (ADC) 2 digitizes the stored trace information for the PPU.
A. Synapse 1) Basic Operation Principles:
In Fig. 2 the synapses are arranged in a two-dimensional array between the PPU and the neuron compartment circuits. Pre-synaptic input enters the synapse array at the left edge. For each row, a set of signal buffers transmit the pre-synaptic pulses to all synapses in the row. The post-synaptic side of the synapses, i.e., the equivalent of the dendritic membrane of the target neuron, is formed by wires running vertically through each column of synapses.
At each intersection between pre-and post-synaptic wires, a synapse is located. To avoid that all neuron compartments share the same set of pre-synaptic inputs, each pre-synaptic input line transmits-in a time-multiplexed fashion-the presynaptic signals of up to 64 different pre-synaptic neurons. Each synapse stores a pre-synaptic address that determines the pre-synaptic neuron it responds to. Fig. 3 shows a block diagram of the synapse circuit. The main functional blocks are the address comparator, the DAC and the correlation sensor. Each of these circuits has its associated memory block.
The address comparator receives a 6 bit address and a presynaptic enable signal from the periphery of the synapse array as well as a locally stored 6 bit neuron number. If the address matches the programmed neuron number, the comparator circuit generates a pre-synaptic enable signal local to the synapse (pre), which is subsequently used in the DAC and correlation sensor circuits.
Each time the DAC circuit receives a pre signal, it generates a current pulse. The height of this pulse is proportional to the stored weight, while the pulse width is typically 4 ns. This matches the maximum pre-synaptic input rate of the whole synapse row which is limited to 125 MHz. The remaining 4 ns are necessary to change the pre-synaptic address. The current pulse can be shortened below the 4 ns maximum pulse length to emulate short-term synaptic plasticity [24] .
Each neuron compartment has two inputs, labeled A and B in Fig. 3 . Usually, the neuron compartment uses A as excitatory and B as inhibitory input. Each row of synapses is statically switched to either input A or B, meaning that all pre-synaptic neurons connected to this row act either as excitatory or inhibitory inputs to their target neurons. Due to the address width of 4 bit the maximum number of different pre-synaptic neurons is 64.
The remaining block shown in Fig. 3 is the correlation sensor, which has a 4 bit static memory associated with it. Its task is the measurement of the time difference between pre-and post-synaptic spikes. To determine the time of the pre-synaptic spike it is connected to the pre signal. The post-synaptic spiketime is determined by a dedicated signaling line running from each neuron compartment vertically through the synapse array to connect to all synapses projecting to inputs A or B of the compartment. This signal, which is called post subsequently, has a similar pulse length as the pre signal.
The correlation sensor measures the causal (pre before post) and anti-causal (post before pre) time differences and stores them as exponentially weighted sums within the synapse circuit. In comparison to earlier implementations [14] by the authors the circuit has been improved in two main aspects: first, only one instance of the time measurement circuit is now re-used for causal as well as anti-causal time difference measurements, resulting in strongly reduced mismatch between the causal and anti-causal branches of the activation function. Second, the time-constant of the exponential is now truly adjustable over more than two orders of magnitude to fit most biological models of spike-time dependent plasticity [25] .
Due to the implementation in a much smaller process feature size, 65 nm instead of 180 nm, four static memory bits could be allocated for additional calibration of transistor variations 2) Correlation Sensor Circuit: The structure of the correlation sensor is shown in Fig. 4 . The input stage receives pre and post signals and uses them to generate the internal timing. A time to voltage conversion circuit generates a voltage representing the elapsed time between the most recent pre and post events. This voltage is scaled by the storage gain parameter and the result is used as argument to an exponential function. This exponentially weighted time difference is added to one of two storage circuits. The selection of the storage circuit depends whether the last input event seen has been a pre or post signal. Pre before post is stored in the causal storage, post before pre in the anti-causal one.
To counteract the effects of fixed-pattern noise created by transistor variations, the time to voltage as well as the storage gain stages have two digital calibration inputs each. The four calibration bits are stored locally in each synapse. The time constant of the time to voltage conversion can be set for one row of synapses by a control voltage. The same applies to the storage gain stage, where the storage gain control signal adjusts one row of synapses. In the prototype chip the gain and time constant input signals of each row are shorted and connected to two external input pins.
The values stored in the causal and the anti-causal storage cells can be read out simultaneously for all synapses in a row. A parallel single-slope ADC at the top of the synapse array converts the analog values read out from the storage cells into digital words for the PPU (see Fig. 2 ).
Fig . 5 depicts the correlation sensor circuit. To enhance the readability of the circuit diagram, the individual blocks of Fig. 4 are not marked. See the caption for assignments of the components to the different functional blocks.
As stated above, the correlation sensor monitors the temporal correlation between pre and post synaptic firing events. This is accomplished by charging the capacitors C causal and C anti−causal with a constant current. The selection of the capacitor depends on the temporal order of the pre and post signals. As can be seen in Fig. 6 , the arrival of a pre pulse starts the charging of C causal after discharging it quickly to its initial value, while C anti−causal starts charging after a post pulse. Two or more pre or post pulses in succession would only restart the discharge/charge process without changing the capacitor. Therefore, the correlation sensor only supports plasticity rules based on nearest neighbor schemes [26] .
To determine the temporal order, the input stage of the correlation sensor utilizes a D-latch formed by I1 and I2. Each time a post follows a pre or vice-versa, the D-latch gets toggled by M1 or M2, respectively. To orchestrate the precise discharging and switching of capacitors within the limited area of the synapse, the circuit makes use of the delays of the individual components. In Fig. 8 a subset of the relevant signals is shown. The inverters I1 and I2, which form the D-latch, have a very low drive strength. This leads to a significant delay between the internal node being discharged by an external pre or post pulse, and the respective inverted internal node (storeAntiCausal in case of a pre pulse or storeCausal after a post signal).
This time difference is used by G2 to produce a short pulse at the gate of M12 to precharge C transfer (see below). The current charging the capacitors C causal and C anti−causal , and therefore controlling the time constant of the correlation sensor, is generated by an adjustable current sink M4. The gate voltage of M4 is shared by all synapses of a row. To reduce the fixed pattern noise within a synapse row, the length of M4 can be digitally controlled in four steps by approximately 20%. This allows to reduce the fixed pattern noise by selecting for each synapse the value which minimizes synapse to synapse variation within the row. Fig. 7 shows the results of a Monte-Carlo simulation demonstrating the effectiveness of this approach.
In the full-size neural network chip each row will have an individual bias generation for M4, which allows different time constants in different rows, as well as the calibration of the row mean of the time-constant. The presented prototype chip directly connects all time-constant inputs to an external input pin which is driven by the test controller (see Fig. 2 ).
The state of the D-latch determines whether C causal or C anti−causal is charged by M4 through the inverter chains formed by I3 to I6 and M7 as well as M10.
The subsequent discussion is based on the temporal relations depicted in Fig. 8 . As can be seen in Fig. 6 , the charging process of C causal or C anti−causal starts after it has been discharged to the ramp reset voltage by the pre or post event. In the case shown in Fig. 8 , C anti−causal is discharged. The initial discharge is initated in two steps: first, after arrival of a post pulse, C anti−causal is connected to M3 by enabling M5. The enabling of M3 is delayed to make sure the other capacitor, C causal is disconnected from M3 by M8. This is essential since at this moment C causal holds the last causal time-difference measurement result which should not be altered by the discharge of C anti−causal . At the beginning of the post pulse the voltage on C causal is as follows:
After a pre pulse a similar equation holds for the voltage on
The intial discharge process finishes within the time-interval set by the length of the post pulse. After post becomes inactive, M3 is deactivated and the charging of C anti−causal by the current flowing through M7 and M4 starts. Simultaneously, the transfer of the causal result from C causal to the storage capacitor C storage causal is initiated. In Fig. 5 only one of the two identical storage circuits is drawn. Depending on the state of the storeCausal and storeAntiCausal signals, M14 or M15 connect one of the storage circuits to M13. The timing of these signals assures that M14 and M15 are never activated simultaneously.
The charge transfer starts by enabling M9, thereby connecting C causal to C transfer . To avoid any crosstalk from the previous transfer process, M12 is always activated prior to M9 and charges C transfer to V dd . After M9 is enabled, charge charing between C causal and C transfer starts. The charging process will be completed before post becomes inactive, but C causal and C transfer will stay connected until the end of the storage cycle.
After the post pulse, before the charging of C transfer and C causal starts, the voltage on C transfer can be calculated as follows:
Since C transfer has been charged by M12 at the very beginning of the post pulse, V begin post C transfer is zero and (3) simplifies to
The capacitance of C transfer is adjustable in four steps to allow the reduction of synapse-to-synapse variations. Fig. 9 shows a simulation of the charging process of C causal and C transfer by M11. After the post pulse, as long as the storeCausal signal is active, C causal and C transfer are connected by M9 and their voltages are equal. The charging current is set by the gate voltage of M11. In the presented prototype chip, this voltage is directly connected to an analog input pin and set by the test controller (see Fig. 2 ).
Before the time difference is stored for a causal or anti-causal measurement, its exponential value has to be calculated. This is accomplished by M13. While M13 is connected to one of the storage capacitors C storage by M14 or M15, it discharges the respective storage capacitor. The amount of charge it can remove from C storage depends on its gate voltage, which follows the time course shown in Fig. 9 .
The purpose of the charge sharing between C causal and C transfer is the reduction of the voltage representing the measured time difference below the threshold voltage of M13. This ensures the operation of M13 in weak inversion. Therefore, we can use the sub-threshold model to calculate the current through M13 at any time t
Since V C transfer (t) changes after C transfer has been discharged to its initial voltage during the post pulse, V end post C transfer , (5) has to be integrated over the time interval from the post pulse, t p , to t e , the point in time when V C transfer (t) has been charged completely, i.e., V C transfer (t) is close to zero
To solve this integral a simple linear model is used for the charging of C transfer from V end post C transfer to zero
The time difference t e − t p can be calculated from the current through M11 and the involved capacitances as follows:
Solving (8) gives
(12) Using the result of (12) the change in the voltage stored on C storage can be calculated
For typical values of the transfer gain, which controls t e − t p by setting I DS of M11, the deviation between (12) and the ideal exponential activation function is below 1%. Also, due to the exponential decay of I C storage (t), only the very first part of the charging of C transfer contributes to ΔV C storage significantly. If the discharge of C transfer is interrupted by an arriving pre pulse, the resulting error is minimal. No control signal is needed to end the charging of C transfer , avoiding any distortions caused by clock-feedthrough. The current I C storage (t) is reduced to the minimum sub-threshold current without negative gate overdrive as V GS M13 approaches 0 V. Since M13 is a thick oxide transistor with a long gate, this current is below 1 nA. Measured total leakage on C storage was 1.7 mV/ms at 50
• and only 0.14 mV/ms at room temperature (approx. 25
• ). The usable dynamic range of V C storage is 1.3 V. M13 together with M14 or M15, respectively, also protect the thin oxide transistors used in the time difference measurement circuits from the higher supply voltage of the storage circuits. To reach sufficient storage times the utilization of thick oxide transistors is necessary to avoid gate tunneling currents. The gate voltage of M14 and M15 comes from the thin oxide supply voltage, thereby limiting their source voltages to save values.
As a second function M14 and M15 act as cascodes to limit the voltage swing at the drain of M13, thereby reducing the variation of ΔV C storage as a function of the stored voltage on C storage .
The storage circuits themselves use MIM-capacitors as storage cells, sitting on top of the each synapse, whereas C causal , C anti−causal , and C transfer are implemented as MOS-capacitors. C transfer uses several individual transistors to accomplish the digital calibration feature.
Each storage circuit uses a source follower (M18) for the readout of the stored correlation results. A pass-transistor (M19) connects the source follower to the correlation readout line if the readback enable signal of the row is active. There are two readout lines per synaptic column, thereby causal and anticausal data of every synapse in one row can be simultaneously connected to the inputs of the correlation ADC at the top of the synapse array.
Each storage capacitor of the synapse array can be cleared individually by activating a causal or anti-causal column correlation reset signal together with a row-wise correlation reset enable. During network operation the PPU generates a pattern on the correlation reset inputs, depending on the results of the plasticity calculations, before it applies the column reset enable. The reset voltage can be adjusted, as can the bias current of the readback source followers, to adjust the readback voltage range to the input range of the correlation ADCs. Fig. 10 . The PPU is part of the plasticity sub-system and computes weight updates. It consists of a general-purpose part implementing the Power ISA and a special-function unit to accelerate computations using Single Instruction Multiple Data (SIMD) operations. The processor has access to 16 kiB of onchip memory and uses a 4 kiB direct-mapped instruction cache. The specialfunction unit consists of a shared control unit for multiple datapath slices operating on 128 bit vectors. See Fig. 11 for details of the vector unit. Fig. 10 shows an overview of the PPU. It is a general-purpose micro-processor extended with a functional unit specialized for parallel processing of synapses. The general-purpose part implements the Power ISA 2.06 [27] in order to be compatible with existing compilers. We have chosen a 32 bit embedded implementation. Instructions are issued in order and can retire out of order. The core does not have a floating-point unit, but includes fixed-point hardware multiplier and divider. In the presented chip it has access to 16 kiB of main memory with a direct-mapped instruction cache of 4 kiB. The SystemVerilog source code of the implementation is available as open source from [28] .
B. Plasticity Processing Unit
The special-purpose functional unit implements an instruction set extension for accelerated processing of synapses. Following the SIMD principle a single control unit operates multiple-two for the presented chip-datapath slices. Each slice operates on 128 bit wide vectors of either eight or sixteen elements. Of these vectors 32 can be stored in a dedicated register file in each slice. Fig. 11 shows a block diagram of the unit.
The vector unit is organized as a weakly coupled coprocessor with five functional units that have their own reservation stations. Upon encountering vector instructions, the general-purpose part sends them to a queue, which completes execution on the general-purpose side. The vector unit takes instructions in order from this queue, decodes them and distributes them to the appropriate functional units.
The five functional units provide operations for arithmetics, comparison, permutation, load/store from main memory, and load/store from synapses. Table II lists what types of operations are implemented. All operations are available in two modes treating their operands either as vectors of sixteen 8 bit or eight 16 bit elements. This allows trade-offs between throughput and accuracy and is also necessary to support the capability of combining synapses to achieve weights of higher resolution. Fig. 11 . Detailed view of the special-function unit for SIMD operations within the PPU. The general-purpose part sends instructions with optionally a 32 bit operand from the general-purpose register file via a queue. The decoding unit dispatches instructions to the respective reservation station upon availability. When operands are available and the execution unit is ready, the reservation station issues the operation to the control unit, which controls the multiple parallel datapaths. The vector register file has a single port for reading and writing. Access is arbitrated between reservation stations using a pseudo-random fair scheme. The serial load/store unit accesses main memory through a shared datapath.
TABLE II IMPLEMENTED OPERATIONS
A minimum of 8 bit is required, since the ADC uses that particular resolution. In addition to the two modes of different size, vector elements can be treated either to be in signed integer or signed fractional representation. For the latter case saturating arithmetic is used, while integers always use modular arithmetic. The arithmetic functional unit is centered around a fused multiply-add data path, which also executes instructions for simple addition, subtraction, and multiplication.
The comparison unit writes results to a vector condition register holding flags for equality, less than, and greater than for each byte. These flags can be used by a select operation provided by the permutation unit to selectively combine two registers into one depending on a previous compare operation. Also, arithmetic and load/store operations support conditional execution using the vector condition register. Further operations provided by the permutation unit are bit-shifting, loading vectors from general-purpose registers, and conversion between fractional 16 bit and storage representation.
The two load/store units serve different purposes: one is meant for initialization of vector registers by sequentially loading words of 32 bit length from main memory. The other uses a fully parallel bus for accesses on synapses and the ADC. In the presented chip this bus has a width of 256 bit.
C. Input/Output With Analog Part
A specialized Input/Output (IO) unit translates the load and store operation on the parallel bus into transactions to the appropriate blocks based on the used address. Potential targets are synapse memory, ADC, and correlation readout. Typically, the PPU will iterate over all rows of synapses sequentially reading weights and correlation data and writing back updated weights. Therefore, the access unit allows multiple transactions to be in progress simultaneously. For example, performing a Static Random Access Memory (SRAM) read operation, while an analog-to-digital conversion of correlation data is ongoing.
The presented chip can process 32 synapses in parallel, when using byte-mode operations. Therefore, it takes two steps to compute updates for a full row of 64 synapses. Since IO operations work on full rows, the access unit supports buffering: results are kept in the output registers of analog blocks after a read transaction completes. If the next read refers to the same row, the buffered results are returned immediately.
The access unit also executes requests from outside of the chip performed through a 32 bit wide bus. Arbitration with PPU accesses uses a pseudo-random fair scheme: a flip-flop indicates which requester is favored upon conflict. For every conflict the state of the flip-flop is inverted.
D. Considerations for Plasticity Processing
The architecture includes several design decisions geared towards the main use-case of computing weight updates. Synaptic plasticity models from biology are typically local to the synapse, i.e., synapses can be computed independently. This is true for classical Spike-timing dependent plasticity (STDP) models [17] , [26] and many phenomenological models [29] - [33] . Therefore, parallel processing of synapses is viable and we realize this using the SIMD approach.
The vector unit is weakly coupled to the general-purpose part of the processor: the two parts do not synchronize instruction execution or share instruction tracking logic. Only when the instruction queue is full, does the general-purpose part stall. This allows to overlap execution in both parts to a large extent. The general-purpose part is primarily concerned with controlflow and sends the plasticity kernel to the vector unit as a stream of instructions.
For the execution of the plasticity kernel it is important, that IO accesses and computation are pipelined to achieve good performance. While new weights for the current row of synapses are computed, the ADC should simultaneously convert analog values for the next row. To achieve this in an efficient and automatic way, we use reservation stations for out of order execution of vector operations. Each functional unit shown in Fig. 11 has a reservation station (shown in green). Within one reservation station instructions are issued in order.
Implementing several reservation stations is more costly than following a simpler scheme for in-order issue as it is done in the general-purpose part. Because control logic is shared for all vector slices, this additional cost does not impinge on scalability to larger synapse arrays. On the other hand, area of the vector slices themselves has to be minimized. This reflects for example in the use of a single-port register file instead of a more typical three-port variant. Thereby, register access is a bottleneck for execution-an instruction will typically read two operands and write one result requiring three cycles on the register file-that has to be minimized. Therefore, we opted for a multiply-accumulate unit with internal accumulator, so that multiplication and summation can be done in one instruction and instructions can be chained without dependency on the register file.
Apart from that, we selected a minimal set of instructions focusing on fixed-point arithmetics and IO operations to save area in the vector slices. The only concession are pack and unpack operations as part of the permute unit to efficiently convert between weight representations for storage and computation (see Section III-B).
To save power while plasticity is not needed at all or waiting for the next update cycle, the clock of the PPU is gated. The clock is disabled when the PPU enters the sleep state by executing the Power ISA wait instruction. Any interrupt request, for example from a timer or an external request, reenables the clock and wakes the processor up. and A − , respectively. These values together with the synaptic weight w are the input for the PPU that computes the new weight w .
III. THEORY AND METHODS
For this study we use a multiplicative STDP rule as reference model (see for example [26] )
Here, λ is a scaling parameter, w max is the maximum weight, δ is the time difference between pre-and postsynaptic firing (δ > 0 if the presynaptic event occurs before the postsynaptic one), τ ± are time constants, and α controls the asymmetry between the pre-before-post (δ > 0) and post-before-pre (δ < 0) branches.
The exponential term in (14) is realized by the synapse circuit itself (see Section II-A) and accumulated on the local traces a ± . The a ± correspond to the voltage on C storage in the synapse circuit. We use a different symbol here to refer to the value visible to the PPU, i.e., including offset from the source follower of the readout circuit. The a ± are also inverted compared to the physical voltage, so that a ± = 0 V corresponds to the reset value on C storage . In an idealized model of the actual circuit, these traces are given by summing over previously observed spike-pairs
with the analog accumulation rates η ± . The summed up pairs are selected according to a reduced symmetric nearest neighbor pairing rule as defined in [26] . This is the same pairing scheme as was already used in [14] . To approximate the rule described by (14) , the PPU uses the converted digital values A ± to compute
After the update, the accumulation traces a ± are reset to zero.
A. STDP Interaction Box
To quantify the measured STDP curves we extract two measures from the observed a ± (Δt) dependency: the amplitudê a ± and the full width at half maximum (FWHM)τ ± . For illustration they are plotted together as a box with heightâ ± and widthτ ± in Fig. 15 . The amplitude is given aŝ
FWHM is given as the range where a ± is below (1/2)(max a ± − min a ± ) + min a ± .
B. Bit-Representation of Weights
Each synapse provides 6 bit of SRAM memory for weight storage. Two synapses can be combined to increase the effective weight to 12 bit. The PPU uses either 8 bit or 16 bit operations giving some freedom in how weights are represented for computation. For this study, we use a fractional number format with saturating arithmetic, i.e., over-and underflows are prevented by saturating to maximum and minimum values [34] . Weights are aligned to use the range from 0 to 1, i.e., one zero bit is added to the right for 8 bit computations as follows:
Here, the w i are the individual bits of the weight with w 5 being the most significant bit (MSB). For 12 bit weights the representation is as follows: Since for this study synaptic transmission of events to the neuron is not used, weights are permanently kept in a vector register. So no IO operations are performed.
IV. SIMULATIONS
To quantify the inaccuracies added by weight resolution and numerical precision of computations performed by the PPU, we simulate the protocol outlined in Fig. 12 and Section III with an idealized synapse circuit and ADC. This means, that accumulation by the synapse follows equations (15) and (16) exactly. The PPU computes weight updates according to (17) and (18) using 8 bit mode for 6 bit weight resolution and 16 bit mode for 6 bit and 12 bit weight resolutions. Fig. 13(a) , below 1.9 × 10 −2 for Fig. 13(b) , and below 1.5 × 10 −2 for Fig. 13(c) . Notably, 6 bit weights systematically are smaller than predicted. According to these results, the use of the 16 bit mode for 6 bit synapses reduced the error especially for large updates, i.e., small |Δt|.
A. Numerical Accuracy
B. Updating Performance
The simulation used for the previous section also provides performance results in terms of achievable update rates. Depending on the learning task a minimum update rate may be required for correct functionality [35] . The classical model of STDP assumes immediate updates to the weight and so any delay can lead to mismatch to software simulations. Table III shows performance results for four different scenarios with and without ADC conversions and for different weight resolutions. The number of cycles represents the total time to update the full array of synapses. Row time is the resulting duration for a single row assuming a clock frequency of 500 MHz. The biological update rate shows the frequency of updates as seen by a single synapse translated into the biological time domain. The latter number assumes, that the update program iterates over all rows updating synapses in turn and is therefore a worst-case estimate.
The update frequencies are in all cases high compared to spike frequencies in the range of approximately 1-15 Hz expected from biology [36] - [38] . A previous study has identified 1 Hz as a lower threshold for a particular correlation detection task [35] . However, their updating mechanism did not use an ADC but only employed a threshold comparison leading to larger errors on the accumulation traces a ± for longer delays. It is, therefore, conceivable that for the same task the PPU-based approach is less sensitive to update frequency.
The ADC requires 560 ns for the conversion of one row of synapses. Rows 1 and 2 in Table III show that all other operations can execute in less time. Therefore, conversion by the ADC limits the update rate. Updates for 12 bit weights are generally faster, because two rows of synapse circuits are combined into one logical one. This leads to half the number of ADC conversions and computational operations. The additionally required pack and unpack operations to convert between stored and logical representation (see Section III-B) do not impact performance.
V. EXPERIMENTS Fig. 1 shows the produced chip and the test setup used for experiments. The chip contains 64 neurons with 32 synapses each for a total of 2048 synapses. A single-ended SerDes link provides communication with a Xilinx Spartan-6 FPGA for control and event data. Link and internal logic operate with the same clock signal provided via a chip pin. The system is designed for frequencies up to 500 MHz and operated at 97.5 MHz in this study.
The FPGA is equipped with 512 MiB of DDR3-SDRAM and communicates with a PC via USB 2.0. Due to the realtime nature of neuromorphic hardware and the small timescales involved, communication with the chip is buffered in the on-board SDRAM attached to the FPGA and played-back under precise timing control. The FPGA uses a byte-code with instructions of variable length to provide efficient coding with 64 bit effective time stamp resolution. The byte-code is executed at a clock frequency of 97.5 MHz leading to a bestcase temporal precision of 10.26 ns. Responses and events are recorded with annotated timing information using the same byte-code representation.
A. Weight Linearity
We first analyze the DAC within the synapse. Fig. 14 shows the average output current for a total of 96 synapses on one chip over the full range of 64 possible weight values. The current was measured by sending a high-frequency input spike train to the synapse and measuring the resulting current using a readout pin and an external current measurement device. 3 The fit yields an offset of 22.79 nA and a value of 11.52 nA for one least significant bit (LSB). With these values 3 Keithley SourceMeter 2635 the maximal integral nonlinearity (INL) is 4.83 LSB, while the mean INL is 1.06 LSB. The systematic shift at the transition from code 31 to 32 is caused by well-proximity effects. Two fingers of the MSB transistor of the DAC are too closed to an adjacent well. This was only discovered after tape-out. Fig. 15 shows the measured dependency a ± (Δt) using the experimental protocol illustrated in Fig. 12 . The curves were measured using N = 32 spike pairs and analog parameters V ramp = 250 mV and V store = 350 mV. For all following experiments shown in this study ambient temperature was kept at 25
B. Variability
• C. The data shown in Fig. 15 is corrected for different offsets of the readout on different chips. All curves are shifted vertically, so that without stimulation the average a ± lies at 1.00 V. This way the curves can be shown and compared in one plot. For learning applications the offset is determined on program startup by the PPU. The results show biologically realistic time-constants of approximately 20 μs to be achievable. Here, we use a speedup factor of 10 3 to convert from biological time-constants of approximately 20 ms given in [39] - [41] . The average timeconstants in Fig. 15 are τ ± = 30 μs with a standard deviation of 10 μs for Fig. 15(a) and (b) and 8 μs for Fig. 15(c) and (d) . The achievable ranges are discussed later (see Fig. 16 ).
Trial-to-trial variability for individual synapses is generally small. The mean trial-to-trial standard deviation for all four plots is equal within errors at 8 ± 5 mV. Therefore, the variation between synapses that can be seen in the plots is due to device mismatch within the synapse circuit itself and mismatch within the readout channels of the ADC. Plots C and D of Fig. 15 show only data for a single channel each. Concerning amplitude, standard deviations for the multi-and single-channel cases are comparable: â ± = 400 ± 140 mV for A and C, â ± = 600 ± 180 mV for B and D. For the time-constants singlechannel data exhibit slighlty less variability [see Fig. 15 (c) and (d)]. However, differences are small and overall variability can be assumed to be dominated by mismatch between the synapse circuits themselves.
C. Achievable Ranges
To configure the shape of the STDP curve the circuit provides two primary configuration parameters: V store and V ramp (see Section II-A). V store controls the storage gain and V ramp the time constant (see Fig. 5 ). We measured 192 synapses on three different chips sweeping both parameters to find the achievable amplitudesâ ± and widthsτ ± . Fig. 16 shows the results while using N = 32 spike pairs (see Section III-A for the definiton of the plotted quantities "width" and "amplitude").
The usable range is the parameter range, for which V store controls amplitude and V ramp controls the width. The respective other property, i.e., width for V store and amplitude for V ramp , remains flat. Therefore, the shape of the STDP curve can be tuned with the given parameters. Fig. 16 . Table IV gives mean and standard deviation at start and stop of this range forâ andτ . The amplitude covers nearly the 1 V of full dynamic range of the ADC input. Time-constants show a large configurable range from tens to hundreds of micro-seconds. Even lower values down to 2 μs are configurable, but the error will stay at 4 μs so that we have excluded these values from the usable range. The amplitude can maximally be as large as the available input range of the ADC, which is evident in the measured data.
D. Full-System Experiments
With the individual channels characterized, the next step is to look at the full signal processing chain. We use the experimental protocol described in Section III and illustrated in Fig. 12 . The PPU performs weight updates according to (18) . To eliminate trial-to-trial noise on the analog readout and to remove systematic offset between the two channels of one synapse, we modify (17) tō
Here, A off is determined at program startup after reset of the accumulation storage as difference A + − A − . (21) implements thresholding using the user selected parameter θ. The PPU performs updates at regular intervals of 10 μs during stimulation. The source code for the actually used update program is available from [42] . Fig. 17 shows results when using 8 bit resolution for arithmetics. For analysis, two functions f + and f − are individually fitted to pre-before-post (Δt > 0) and post-before-pre
Here, b ± and c ± are the fit parameters, while initial weight w and maximum weight w max are the same as those used by the update program. In all experiments we set w max , λ, and α to 1.0. The threshold θ was set to 10. Since discretization of the weight removes the long tail of the exponential, the fit is restricted to points where the weight was actually changed (w = w). Fig. 17(a) -(c) demonstrate different combinations of N and w. Especially for small updates, the discretization of the weight to 6 bit is apparent. Results exhibit the expected dependency on w for a multiplicative rule. Fig. 17(d) and (e) plot the fitted parameters for amplitude (b ± ) and time-constant (c ± ) over the Fig. 17 . Results from experiments using the full signal chain including PPU, ADC, and correlation sensor in the synapse. Synapses are stimulated according to the protocol outlined in Fig. 12 . The PPU computes a multiplicative update rule according to (21) and (18) number of spike pairs N . As expected, amplitude increases linearly with the number of pairs and for the chosen initial weight w = 0.25 positive changes are larger than negative ones. The process that measures the timing of spike pairs in the synapse operates on individual pairs and is therefore independent of N . Also, the circuit for time measurement is shared for pre-before-post and post-before-pre pairs within one synapse. Therefore, time-constants should be identical on both sides and independent of N . Experimental data is compatible with these expectations as Fig. 17(e) shows. For small N the fit is not reliable due to discretization [see Fig. 17(b) ].
The plots in Fig. 17(f) -(i) give a hint of the achievable flexibility. They were produced with the same stimulation protocol only by changes in software running on the PPU. Fig. 17(f) and (g) show symmetrical Hebbian and anti-Hebbian rules. Fig. 17(h) is only sensitive to pre-before-post pairings. Fig. 17 (i) realizes bi-stable learning.
E. Power Consumption
During execution of the experiment described in the previous section, digital logic consumes below 32 mW of power as measured on the power supply pins of the chip. With the clock disabled for the PPU, power consumption drops below 10 mW, so that 22 mW can be attributed to the PPU. In reset, power consumption drops by 2 mW for the PPU. Therefore, power consumption is largely due to clock distribution.
VI. DISCUSSION OF RESULTS
The two overarching goals in the development of neuromorphic hardware are to provide a platform for neuroscientific experiments and to find new ways of computation for technical applications. For both these goals we believe reliability, scalability, and flexibility to be enabling factors besides efficiency in terms of power, area, and speed. Therefore, the presented results focus on these aspects.
A. Reliability
To assess reliability we characterized the synapse behavior across three different chips (see Figs. 15 and 16) . Results show substantial variation due to device mismatch within the analog circuits. Please note, that for these measurements the configuration bits of the synapse circuit were not even used (see Section II-A). So there is room for improvements through calibration. On the other hand, trial-to-trial variability of individual components is small. This is for example illustrated in Fig. 16 that shows multiple trials from a single synapse on the background of the overall distribution of synapses. A small trial-to-trial variability was also measured for individual channels in Section V-B. This allows on the one hand to use off-line calibration, but on the other hand it is also conceivable, that an emulated network calibrates itself through the use of plasticity. Indeed, the robustness of reward-based learning to device mismatch on the correlation detection within the processor-based approach presented here has been shown in previous work [23] . Self-tuning has also been shown to be feasible through the use of short-term plasticity [43] .
In general, a plasticity mechanism can compensate inhomogeneities if there is a feedback loop for the parameter subject to variation. This is typically the case for outputs, e.g., synaptic weights. An STDP rule will modify the weight according to the timing behavior it observes, which is an effect of the weight including variation. Variation on the input however-in this case the signal a ± from the correlation sensor-is invisible from the rule. It can however be compensated by introducing additional information about the behavior of the system, for example through a reward signal. This addition of complementary information is why reward-based learning rules are well suited for analog neuromorphic hardware systems. The alternative is to use redundancy of analog components so that the average behavior is reliable.
B. Scalability
Scalability can of course only be shown by actually scaling the system, which we plan to do in the future. Nonetheless, the plasticity system is designed to scale well: the only part for which area scales linearly with the number of synapses is the correlation sensor that resides within the synapse circuit. Therefore, we have chosen to use an area-optimized circuit realized as analog full-custom design. The ADC scales with the number of columns in the synapse array, which have typically a square root dependency on the number of synapses. Most parts of the PPU are required only once per array and only the number of vector slices scales with the number of columns. To keep these slices as lightweight as possible, all control logic is shared and a single-port vector register file is used. A scaled system will feature arrays of 256 × 256 synapses with a dedicated PPU using 8 vector slices.
C. Flexibility
The whole approach presented here has a strong emphasis on flexibility, compared to our previous implementation [14] and considering the constraints of an analog, accelerated neuromorphic system. By this we mean, that a large number of plasticity rules should be implementable in the hardware system. Introducing the PPU sacrifices area and power in order to have as much freedom as possible while not sacrificing speed. To achieve this latter point in the 65 nm technology, we consider a combination of analog and software-based processing, as shown in this study, to be necessary. At a speed-up factor of 10 3 and array sizes of 65 k synapses it is not feasible to process individual spike events in software. This of course limits flexibility as the functionality of the correlation sensor is fixed in hardware. Therefore, this functionality should at least operate over a wide range of parameters, demonstrated by the results shown in Fig. 16 and Table IV In the biological time domain, the design covers ranges from tens to hundreds of milliseconds, fitting typical ranges found in biology [39] - [41] . Also the amplitude is tunable over a large range, so that the sensitivity of the correlation sensor can be matched to the network activity.
In general, every plasticity model is implementable in this system that depends only on observables visible to the PPU and affects only parameters accessible by the PPU. Observables are the weight w, the correlation signals a ± , a firing rate sensor not discussed in this study, and signals from outside the chip such as reward. All parameters of the chip that can be modified at all, can also be modified by the PPU. This includes the synaptic weight w, neuron parameters, and the topology of the network.
The latter is limited to the addresses stored in the synapses for this prototype chip.
In future realizations it is feasible to increase the number of observables of the PPU. It is planned to include a fast ADC in a forthcoming chip which will give the PPU access to membrane voltages. It is also feasible to add synapse correlation measurement circuits with novel properties, if there are plasticity models demanding them.
Here we only show the simple STDP rule given in (14) as proof of concept. Fig. 17(f)-(i) show simple examples of modifications of the plasticity model purely realized in software running on the PPU. Beyond that, the reward-based learning rule R-STDP and a learning rule for spike-based expectation maximization has been ported to the system, but not yet tested in hardware [23] , [44] .
VII. CONCLUSION
In this study we have presented a new approach to plasticity in neuromorphic hardware: the combination of dedicated analog circuits in every synapse with a shared digital processor. It represents a trade-off between flexibility of implementable plasticity models and efficiency of the implementation in terms of area, speed, and energy. The presented results demonstrate the viability of this approach for plasticity.
The more classical approach taken for neuromorphic hardware, for example by [10] or [45] , is to implement a single plasticity mechanism that can be used to solve a range of network learning tasks. Analog continuous-time implementations of neuromorphic circuits can be combined with floatinggate technology to achieve persistence of the learned synaptic weights. By modifing the control signals the precise learning rules can also be tuned [46] . Our approach not only aims for flexibility in the learning task, but also in the mechanism itself. Together with the speed-up factor this enables experimental analysis of long-term effects of such mechanisms. In the classical approach it is essential to have a detailed understanding of the mechanism prior to production of hardware. In our approach the hardware system can help to gain this understanding. This is an important aspect when designing a system intended as a neuroscientific platform.
In [47] this approach is taken even further: neuronal dynamics as well as detection of correlations and weight update are performed by general-purpose processors in software. Specialized hardware is only used for event communication. This maximizes flexibility but further sacrifices efficiency, so that operation is only possible without speed-up. Another mixedmode approach is reported in [48] . Here, the authors also perform the full plasticity operation in software, achieving maximum flexibility, while the synapses and neurons are fullcustom analog implemetations.
Our approach to use dedicated hardware for the most expensive part-the processing of spikes-enables faster operation. Using an on-die PPU local to the synapse circuits also facilitates scaling of the system, since no communication to off-chip components is necessary. Since learning and development in biology are processes spanning many time-scales, platforms for accelerated simulation or emulation are important. In the domain of general-purpose computers using software simulations even for medium-sized networks accelerated operation with plasticity is currently not possible [21] .
VIII. OUTLOOK
The chip presented here is still an early prototype that for example lacks on-chip networking capabilities. However, using the experimental setup described here a wide range of plasticity mechanisms can already be implemented and analyzed in hardware. Obvious candidates are the models already prepared for implementation [23] , [44] . Future prototypes will add the ability to include neuronal and structural plasticity opening the door for a large set of learning mechanisms. It will then also be possible to execute learning tasks involving networks of neurons with the system.
In the long run, the focus will be on scaling the system in size. As an intermediate step we plan to build chip-scale variants with two 256 × 256 synapse arrays and two PPUs. Eventually, the goal is to go to wafer-scale [7] . It will then replace the first generation of the neuromorphic platform (NM-PM-1) of the Human Brain Project [49] .
We also hope that the release of the PPU design-the Nux processor [28] -as open source will turn out to be a valuable contribution to open source hardware.
