Abstract-Recent trends in the field of artificial neural networks (ANNs) and convolutional neural networks (CNNs) investigate weight binarization for full on-chip weight storage to minimize circuit resources and to avoid the high energy cost of offchip memory accesses. In parallel, spiking neural network (SNN) architectures are explored to further reduce power when processing sparse event-based data streams, while on-chip spike-based online learning targets applications constrained in power and resources during the training phase. However, leveraging highdensity on-chip online learning in binary-weight SNNs is still an open challenge. In this work, we demonstrate MorphIC, a quadcore binary-weight digital neuromorphic processor embedding a stochastic version of the spike-driven synaptic plasticity (S-SDSP) learning rule and a hierarchical routing fabric for large-scale chip interconnection. The MorphIC SNN processor embeds a total of 2k leaky integrate-and-fire (LIF) neurons and more than two million plastic synapses for an active silicon area of 2.86mm 2 in 65nm CMOS, achieving a high density of 738k synapses/mm 2 .
I. INTRODUCTION
The massive deployment of neural network accelerators as inference devices is currently hindered by the memory footprint and power consumption required for high-accuracy classification [1] . Two trends are currently being explored in order to solve this issue. The first trend consists in optimizing current artificial neural network (ANN) and convolutional neural network (CNN) architectures. Weight quantization down to binarization is a promising approach as it allows to simplify the operations and minimize the memory footprint, thus avoiding the high energy cost of off-chip memory accesses if all the weights can be stored into on-chip memory [2] . The accuracy drop induced by quantization can be mitigated to acceptable levels for many applications with the use of quantizationaware training techniques that propagate binary weights during the forward pass and keep full-resolution weights for backpropagation updates [3] . The associated off-chip learning setup for quantization-aware training is shown in Fig. 1(a) : this strategy allows binary-weight on-chip neural networks to perform inference with a favorable energy-accuracy tradeoff, as recently demonstrated by binary CNN chips (e.g., [4] - [6] ).
The second trend consists in changing the neural network architecture and data representation, which is currently being explored with bio-inspired spiking neural networks (SNNs) as a power-efficient processing alternative for sparse event-based data streams [7] . Embedded online learning is a key feature in SNNs as it enables on-the-fly adaptation to the environment [8] . Moreover, by avoiding the use of an off-chip optimizer, on-chip online learning allows SNNs to target applications that are power-and resource-constrained during both the 
32-bit weight memory
1-bit quant. Fig. 1 : Learning strategies for binary-weight neural networks. (a) Quantizationaware off-chip learning setup: binary weights are used during the forward pass while full-resolution weights are kept for backpropagation updates [3] . Training is carried out in an off-chip high-performance optimizer, while inference is carried out in the power-and resource-constrained device. (b) Onchip online learning setup, where data-driven weight updates are carried out in parallel with inference in the power-and resource-constrained device.
Input Label
training and the inference phases, as shown in Fig. 1(b) . Spikebased online learning is an active research area, both in the development of new rules for high-accuracy learning in multilayer networks (e.g., [9] - [12] ) and in the demonstration of silicon implementations in applications such as unsupervised learning for image denoising and reconstruction [13] , [14] . However, these approaches currently rely on multi-bit weights. These two trends mostly evolve in parallel as only three chips have been proposed previously to leverage the density and power advantage of binary weights with SNNs. First, the IBM TrueNorth is the largest-scale neuromorphic chip with 1M neurons and 256M 1-bit synapses, however it does not embed online learning [15] . Second, the recently-proposed Intel Loihi has a configurable synaptic resolution that can be reduced to 1 bit and embeds a programmable co-processor for on-chip learning, though not demonstrated using a binary synaptic resolution [16] . Finally, Seo et al. propose a stochastic version of the spike-timing-dependent plasticity (S-STDP) rule for online learning in binary synapses [17] . However, S-STDP requires the design of a custom SRAM memory with both row and column accesses, which severely degrades the density advantage of their approach.
It has been suggested in [18] that the spike-dependent synaptic plasticity (SDSP) learning rule proposed by Brader et al. in [19] allows for a more efficient resource usage than STDP: all the information necessary for learning is available in the post-synaptic neuron at pre-synaptic spike time. SDSP requires neither an expensive local synaptic storage of spike timings nor a custom SRAM with both row and column accesses. Therefore, in this work, we propose an efficient 978-1-7281-0397-6/19/$31.00 ©2019 IEEE stochastic implementation of SDSP compatible with a standard high-density foundry SRAM in order to leverage embedded online learning in binary-weight SNNs. We demonstrate this approach with MorphIC, a quad-core digital neuromorphic processor with a hierarchical routing fabric for large-scale chip interconnection. MorphIC was prototyped in 65nm CMOS and embeds 2k neurons and more than 2M synapses in an active silicon area of 2.86mm
2 , therefore achieving a high density of 738k 1-bit online-learning synapses per mm 2 . It results in an order-of-magnitude density improvement compared to the binary-weight online-learning SNN processor from Seo et al.
The remainder of this paper is structured as follows. The architecture and implementation of the MorphIC SNN processor are provided in Section II. The specifications, measurements and benchmarking results are then presented in Section III.
II. ARCHITECTURE AND IMPLEMENTATION
A block diagram of the MorphIC quad-core spiking neuromorphic processor is shown in Fig. 2 , illustrating its hierarchical routing fabric for large-scale chip interconnection: level-0 (L0) routers handle intra-core connectivity, level-1 (L1) routers handle inter-core connectivity and level-2 (L2) routers handle inter-chip connectivity (Section II-A). A block diagram of the MorphIC core is shown in Fig. 3 : each core embeds 512 leaky integrate-and-fire (LIF) neurons, 256k L0 synapses configured as a crossbar array, 256k L1 synapses and 16k L2 synapses. Time multiplexing is used to increase the neuron and synapse densities by using shared update circuits and offloading neuron and synapse states to local SRAM memory, based on the strategy we previously proposed for the ODIN SNN in [20] . Each synapse embeds online learning with a stochastic implementation of the spike-dependent synaptic plasticity (S-SDSP) learning rule (Section II-B).
A. Hierarchical event routing
Clustering groups of neurons with dense local and sparse long-range connectivity allows minimizing memory requirements while keeping flexibility and scalability [21] . This organization is found in the brain and is known as small-world networks [22] . Hierarchy is therefore a key concept in SNN event routing infrastructures for large-scale networks [15] , [16] , [21] , [23] , [24] . MorphIC uses a heterogeneous hierarchical routing fabric with different router types at each level. to-all connectivity of the 512 neurons inside each core. 256k L0 synapses are thus available per core. 2) L1 -A star router ensures low-latency local broadcasting of events inside MorphIC chips. L1 connectivity information is stored in the source neuron: a 3-bit word allows independently enabling source-based routing to each of the three other local cores. 256k L1 synapses are thus available per core. 3) L2 -A mesh router with dimension-ordered routing (i.e. x direction before y direction) manages interchip connectivity. Each neuron has 32 L2 synapses, for a total of 16k per core. L2 connectivity information is stored in the source neuron and is destination-based: each neuron can target a specific L2 synapse address in any combination of cores in another MorphIC chip, at a distance of up to 4 chips in any direction. Address-event representation (AER) transactions [25] are used for L2 event routing. This combination of routers is deadlock-free and allows reaching a fan-in of 512 (L0) + 512 (L1) + 32 (L2) and a fan-out of 512 (L0) + 3×512 (L1) + 4 (L2). A key feature of MorphIC is that synapses at all routing levels embed online learning (Section II-B): distance information is contained in event packets and modulates the probability of synaptic weight update as a function of distance, in accordance with smallworld network models. To the best of our knowledge, no other previously-proposed SNN features online hierarchical learning.
B. Stochastic spike-dependent synaptic plasticity (S-SDSP)
As the spike-timing-dependent plasticity (STDP) learning rule relies on the relative timing between pre-and post-synaptic spikes, it requires a local synaptic buffering of spike timings, which leads to critical overheads as buffering circuitry has to be replicated inside each synapse [18] . In order to avoid this problem, the stochastic binary approach proposed by Seo et al. in [17] involves the design of a custom SRAM with both row and column accesses to carry out STDP updates each time pre-and post-synaptic spikes occur. However, beyond increasing the design time, custom SRAMs do not benefit from DRC pushed rules for foundry bitcells and induce a strong area penalty compared to single-port high-density foundry SRAMs [18] . Therefore, STDP cannot be implemented efficiently in silicon. The spike-dependent synaptic plasticity (SDSP) learning rule [19] avoids this drawback: the synaptic weight w is updated each time a pre-synaptic event occurs, according to Eq. (1). The update depends solely on the state of the postsynaptic neuron at the time of the pre-synaptic spike, i.e. the membrane potential V mem compared to threshold θ m and the Calcium concentration Ca compared to thresholds θ 1 , θ 2 and θ 3 . The Calcium concentration represents an image of the recent firing activity of the neuron, it disables SDSP updates for high and low post-synaptic neuron activities and helps prevent overfitting [19] . A single-port high-density foundry SRAM can therefore be used for high-density time-multiplexed implementations. However, as SDSP relies on discrete positive and negative steps, it cannot be applied directly to binary weights.
Senn and Fusi proposed a stochastic learning rule for binary synapses in [26] . However, the update conditions rely on the total synaptic input of the post-synaptic neuron at the time of the pre-synaptic spike, which is difficult to extract in timemultiplexed implementations. Therefore, we propose a stochastic spike-dependent synaptic plasticity (S-SDSP) learning rule suitable for binary weights, as formulated in Eq. (2). It results from the fusion of the stochastic mechanism proposed in [26] with the SDSP update conditions. ζ + and ζ -are binary random variables with probability q + and q -of being at 1, respectively. The synaptic weight w b therefore goes from 0 to 1 (resp. 1 to 0) with probability q + (resp. q -), depending on the update conditions.
The proposed S-SDSP update logic is shown in Fig. 4 . The binary random variables ζ ± can be generated with q ± probabilities using linear feedback shift register (LFSR)-based pseudo-random number generation. In order to generate the q ± probabilities with a resolution similar to those used in [26] and to leave sufficient margin for distance-based modulation of q ± (Section II-A), a 9-bit resolution is required. As S-SDSP updates must be computed in a single clock cycle, it is possible to parallelize 9 successive iterations of a given LFSR by using the unfolding algorithm from [27] with an unfolding factor of 9, as suggested in [28] to avoid instantiating parallel LFSRs and save switching power. The unfolding process and the resulting unfolded LFSR are illustrated in Fig. 5 . Unfolding leads the combinational logic resources (here, a single XOR gate) to be multiplied by the unfolding factor, while the LFSR period is divided by the unfolding factor. In order to avoid inducing correlation between synapses, the unfolded LFSR period must be one order of magnitude higher than the number of synapses per neuron. A 17-bit LFSR depth has thus been selected. The overhead incurred by the resulting S-SDSP update logic is negligible as it is shared with time multiplexing for all the L0, L1 and L2 synapses in a MorphIC core.
III. MEASUREMENTS AND BENCHMARKING RESULTS
MorphIC was prototyped in an 8-metal 65-nm lowpower (LP) CMOS process. A chip microphotograph is presented together with specifications and die measurement results at 25
• C in Figs. 6(a) and 6(b), respectively. The power consumption P of digital SNNs can be modeled by Eq. (3) [20] :
where P leak is the leakage power, P idle is the idle power (i.e. active clock, without network activity), E SOP is the energy per synaptic operation (SOP), f clk is the clock frequency and r SOP is the SOP processing rate. E SOP is an incremental definition of the energy per SOP as it does not include contributions from leakage and idle power. At 0.8V, when including the leakage and idle power contributions at maximum f clk and r SOP , i.e. 55MHz and 110MSOP/s (27.5MSOP/s/core) [20] , MorphIC consumes a total energy of 51pJ per SOP. Offline learning performances with quantization-aware training can be demonstrated with the MNIST dataset of handwritten digits [29] . Using the four cores of MorphIC and all the available neuron resources with the network topology shown in Fig. 7 , an accuracy of 97.8% can be reached. 
Output classification
Interleaved sub-sampling to four 14x14-pixel images and conversion to rate-based Poisson-distributed spike trains Sub-image MLP Sub-image MLP Fig. 7 : MNIST classification setup. Input images are split with interleaved sub-sampling into four independent 14×14 images. The sub-image pixels are converted to rate-based Poisson-distributed spike trains and sent to four multi-layer perceptrons (MLPs) resulting from quantization-aware training in Keras, following [2] , [3] . Layer-wise inhibitory neurons are used to compensate for rescaling of synaptic weights trained with −1 and +1 values in Keras to values of 0 and 1 in MorphIC. Average-pooling the core activities into a global output sum layer leads to a classification accuracy of 97.8%. S-SDSP online learning is demonstrated in Fig. 8 , we reproduced the benchmark that was proposed in [30] for an analog SDSP implementation. Eight patterns are classified by a spiking CNN. Each MorphIC core implements a fixed-weight convolutional layer with a line detection kernel followed by an average pooling layer. The pooling layers are connected with plastic weights to an 8-neuron fully-connected (FC) output layer in core 0. The resulting weights allow classifying with 100-% accuracy a test set consisting of 100 different Poisson realizations of each input pattern. Future work should focus on extending S-SDSP for multi-layer training.
Finally, a comparison of MorphIC with the three previously-proposed binary SNNs is provided in Table I . While Loihi has a programmable learning engine but does not demonstrate online learning with a binary-weight configuration, MorphIC and the chip from Seo et al. are the only ones to demonstrate embeded online learning with binary weights. The high-density claim of S-SDSP online learning on binary weights is demonstrated with an order-of-magnitude density advantage compared to the S-STDP rule from Seo et al., which is further emphasized when considering process normalization to 65nm. Regarding power, MorphIC has a competitive energy per SOP despite using a less advanced CMOS process.
IV. CONCLUSION
In this paper, we presented the MorphIC quad-core spiking neuromorphic processor to leverage the power and density ad- synapses by the chip area, excluding pads. As the obtained raw density performances are strongly dependent on the selected technology node, values normalized to a 65nm technology node are provided. Normalization is carried out by using the node factor (e.g., a (65/45) 2 -fold reduction for normalizing 45nm to 65nm), except for Loihi where we used the 13.5 factor from [31] for 14nm FinFET normalization to a bulk 65nm node.
vantages of binary weights with online-learning SNNs. Using the proposed stochastic spike-dependent synaptic plasticity (S-SDSP) learning rule, we demonstrated this claim with a density of 738k synapses per mm 2 in 65nm CMOS. It is an orderof-magnitude improvement compared to the only previouslyproposed binary SNN with demonstrated online learning from Seo et al. MorphIC also integrates a hierarchical routing fabric for large-scale chip interconnection, where distance information allows modulating the synaptic update probabilities, in accordance with small-world brain network models.
