Advances in non-volatile resistive switching random access memory (RRAM) have made it a promising memory technology with potential applications in low-power and embedded in-memory computing devices owing to a number of advantages such as low-energy consumption, low area cost and good scaling. There have been proposals to employ RRAM in architecting chips for neuromorphic computing and artificial neural networks where matrix-vector multiplication can be computed in the analog domain in a single timestep. However, it is challenging to employ RRAM devices in neuromorphic chips owing to the non-ideal behavior of RRAM. In this article, we propose a cycle-accurate and scalable system-level simulator that can be used to study the effects of using RRAM devices in neuromorphic computing chips. The simulator models a spatial neuromorphic chip architecture containing many neural cores with RRAM crossbars connected via a Network-on-Chip (NoC). We focus on system-level simulation and demonstrate the effectiveness of our simulator in understanding how non-linear RRAM effects such as stuck-at-faults (SAFs), write variability, and random telegraph noise (RTN) can impact an application's behavior. By using our simulator, we show that RTN and write variability can have adverse effects on an application. Nevertheless, we show that these effects can be mitigated through proper design choices and the implementation of a write-verify scheme. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Concurrent with the developments in neuromorphic computing, advances in non-volatile resistive switching random access memory (RRAM) have made it a suitable memory technology for realizing neuromorphic computing architectures [11] . For instance, RRAM-based neuromorphic computing hardware has been proposed in [19, 23, 25] . Apart from advantages such as low operating power, high speed and density, memristive and RRAM-based crossbars have been proposed as energy-efficient dot-product engines. These can be used to perform matrix-vector multiplication operations efficiently in the analog domain through current sums [4, 6, 15] . Such approaches are suitable for low-power embedded devices targeting neuromorphic or neural network applications.
A System-Level Simulator for RRAM-Based Neuromorphic Computing Chips 64:3 • By using our simulator, we discovered that RTN, which is present in RRAM read-outs, is an important factor that could contribute to degradation in application accuracy. To our knowledge, the impact of RTN on applications running on a many-core RRAM neuromorphic chip has not been investigated in previous works.
• We also use the simulator to study the accuracy-rewrite trade-offs when using a write-verify scheme as a mitigation strategy for write variability.
• Our simulator is highly configurable and scalable. Various settings -such as core size, noise model, and network topology -can be configured. Large applications using more than 3000 cores can be simulated and are limited only by CPU and memory resources.
The rest of the article is organized as follows. Section 2 motivates the need for a system-level simulator and provides background information on SNNs, RRAM and the neuromorphic computing architecture. Section 3 details the design of our simulator, with a focus on its configurability and scalability. It also discusses which non-ideal effects of the RRAM are simulated and introduces the NoC model. Section 4 demonstrates and discusses the experimental results on the simulator. Section 5 reviews previous related work and Section 6 presents our conclusions.
BACKGROUND AND MOTIVATION 2.1 Spiking Neural Networks
SNNs are generally considered to be the third-generation neural network model. They differ from traditional second-generation artificial neural networks (ANNs) in terms of closeness to biological realism [39] . One of most distinguishing features of SNNs is that the inputs and outputs of the neurons in SNNs are a series of spikes with unit amplitude. Figure 1 shows the structure of an SNN. In Figure 1 (a), an example SNN that is fully connected is shown. Its high-level structure does not differ from a fully connected ANN. However, the computation performed is different, as depicted in Figure 1 (b), which shows the computational model of an SNN neuron. In an SNN, when a neuron receives spikes from its dendritic inputs, the membrane voltage of the neuron changes according to their combined synaptic weights. When the neuron membrane voltage exceeds a given threshold, it will fire a spike to the given axon output. Various neuron models have been proposed to describe the changing of membrane voltage and the firing of spikes in SNN [32] . In our simulator, we adopt the leaky-integrate-and-fire (LIF) neuron model because it achieves a good balance between accuracy and complexity and is easy to implement in hardware. Nevertheless, the simulator is flexible enough to be extended with other neuron models as required by users. The LIF model is also commonly used in previous works on SNNs [4, 22] . Equation (1) describes how the membrane voltage in an LIF neuron is updated in our simulator. A single neuron begins a timestep with membrane voltage v t . The change in membrane voltage due to input synapse activity is Δv t , and a constant leak v leak is subtracted. The voltage after the leak and integrate phase is denoted by v t . For a given LIF neuron, the change in membrane voltage is defined in Equation (2), where x i,t is the ith input channel to the neuron at timestep t, and w i is the weight of the input channel to the neuron. Note that x i,t is a binary variable, ithat is, x i,t = 1, when there is a spike received on the ith input channel at timestep t and x i,t = 0 otherwise.
(1)
Equation (3) is used to determine when an LIF neuron will fire. Essentially, a spike is fired during timestep t if v t ≥ v th , where v th is the given threshold voltage of the neuron. A spike event will set the indicator variable s t to 1 and the post-firing voltage v t +1 to v t − v th . Otherwise, no spike is emitted (s t = 0) and v t +1 = v t .
Non-Volatile Resistive RAM
The RRAM device is a metal-insulator-metal (MIM) configuration that exhibits hysteretic resistive switching behavior. In this section, we describe the operating principle of the metal-oxide RRAM, which is a structure where a metal-oxide layer is sandwiched between two electrodes [11] . This structure exhibits a resistive switching phenomenon between a low-resistance state (LRS) and a high-resistance state (HRS), which is caused by oxygen migration processes. Different resistance values can be written to the device to represent different weight values. As shown in Figure 2 , conductive filaments appear during the Set process due to oxygen drift to the metal electrodeoxide interfacial reservoir, giving rise to an LRS. When a Reset voltage is applied, oxygen ions drift back and partially fill up the vacancies, causing a gap in the conductive filament and resulting in an HRS. There are several advantages that non-volatile RRAM technology offers. These are high speed and density, energy efficiency, and compatibility with CMOS technology. In this article, we consider the use of RRAM memory technology for storing synaptic weights as well as for performing weight integration in the analog domain using the sum of currents through the RRAM devices. Figure 3 shows the classic architecture of a neurosynaptic core employing an RRAM crossbar array, which is also used in previous work [4, 6, 15, 19] . The RRAM cells are aligned in a crossbar array, where the word lines (horizontal lines) represent axons and the bit lines (vertical lines) are the dendrites. Each RRAM cell represents a synapse and its conductance value represents the weight of the synapse. When a spike arrives at an axon, the voltage of the corresponding word line is set to high. The current flowing through the bit lines is determined by the conductance of the RRAM cells. The current of each bit line, which represents an aggregation of the input currents to the destination neuron, is measured and converted to a digital signal by an analog-digital (A/D) converter. The digital value is then forwarded to the CMOS neuron, which realizes the LIF neuron model discussed in Section 2.1.
RRAM-Based Neuromorphic Computing Architecture
Since the conductance values of RRAM cells are unidirectional, it is impossible to store positive and negative weights in one bit line. Therefore, a common solution is to assign two distinct bit lines, called a differential pair, to each neuron [16] . In a differential pair, one line is for the storage of positive weights and the other is for negative weights. As shown in Figure 3 , W + i, j represents the positive weight and W − i, j represents the negative weight. In this work, we also adopt the neurosynaptic core architecture shown in Figure 3 . Many such cores are connected with an NoC. Figure 4 demonstrates the need for a system-level simulator for a neuromorphic chip based on the RRAM crossbar array described earlier. It shows the influence of core size, noise model, and network topology on the performance of an application. In this example, a 3-layer fully connected SNN with 512, 256, and 10 neurons in each respective layer is trained on the MNIST benchmark. The trained SNN is executed on our simulator using three different chip configurations, as shown in Figure 4 . In Figure 4 (a), noiseless cores of size 1024×1024 were used. The large size allowed the SNN to fit and be mapped into three separate cores, with each hidden layer occupying one core. However, the drawback of the large core size is that hardware resources in the neurosynaptic cores are poorly utilized. Core resource utilization, defined as the fraction of RRAM crossbar cells being used, is 19% in this configuration.
Motivation
The configuration in Figure 4 (b) uses a smaller core size of 256×256. Under this smaller fanin constraint, the network cannot be directly mapped into the cores. Instead, the original neural network is split in order to fit into the smaller neurosynaptic cores, with additional cores needed to merge the results from the split cores (network structure is shown in Figure 8 ). As a result, the accuracy suffers a drop of 3.8%. However, core utilization is improved to ∼40%. As compared to Figure 4 (a), the ring topology used in Figure 4 (b) improved the latency of the network slightly due to better balancing of the network traffic.
Finally, Figure 4 (c) shows the effect of adding RTN to the SNN shown in Figure 4 (b). In addition, a 2D-mesh network topology was used. Accounting for noise in the RRAM devices resulted in a further accuracy drop of 2%. However, the mesh network used in this configuration reduced the latency by half and achieved the lowest latency of the three configurations.
These examples demonstrate that a system-level simulator is a useful tool for evaluating different design choices and device characteristics for an application.
SIMULATOR DESIGN
To assist in both the hardware and software designs of RRAM-based neuromorphic computing chips, it is useful to have a cycle-accurate system-level simulator that can simulate the behavior of the RRAM-based neurosynaptic cores, including the non-ideal effects of the RRAM memory crossbars and the NoC interconnect. Without a system-level simulator, it would be challenging to assess the impact of hardware architectural designs on end-user applications. Figure 5 shows a high-level architectural view of our simulator framework. The cycle-accurate simulator models a neuromorphic chip comprising many neurosynaptic cores connected with an interconnect as shown in Figure 5 (a). The neuromorphic Chip class is the highest-level container that is used to hold N number of Cores. These cores are connected by a Network interface that can be configured with various topologies. An input spike Source can be interfaced to the Chip to drive the computations, and a Sink can be connected to the output neurons to collect simulation statistics. Figure 5 (b) shows the detailed structure of each core. Within each Core is a Crossbar memory interface to store the synaptic weights of a population of neurons. The Neuron interface is responsible for integrating the incoming signals with the respective synaptic weights to produce an outgoing spike if the integrated signal exceeds the membrane potential of the neuron. In the simulator, we also include an optional module that models the non-ideal effects of the RRAM crossbar.
Overview
The Network interface implements the NoC that connects all the cores. Figure 5 (c) shows the internal structure of each router inside the NoC. Within the router, there are 5 input/output channels (C0 to C4). The first channel, C0, connects to the local core and the others connect to neighboring routers. Incoming packets are received and sent to the decoder. The decoder analyzes the source and destination of each packet based on the header data; the arbiter controls the switch accordingly. Outgoing packets traverse the switch and are stored in the FIFO buffers of the output channels, awaiting to be sent to next hop.
The simulator is time driven; each class or interface shares the same global clock. In addition, power consumption can also be simulated. For each component of the simulator, a trace-based power model is implemented to estimate the power of the system. The RRAM crossbar power model and the A/D converter power model are based on [37] and [38] , respectively. The NoC power model is based on Orion 2 [35] .
Non-Ideal Effects of RRAM
In this section, we consider many of the non-ideal effects of RRAM devices in the Crossbar array. A study of the non-ideal effects is important because the membrane voltage of each neuron is determined by the analog current sum of the RRAM devices connected to that neuron (i.e., in a given column). Any effects that affect the resistance value of the RRAM may have an impact on the overall resultant SNN output.
The non-ideal effects of RRAM can be classified into static defects and variability as well as dynamic variability. Static defects and variability are mainly caused by process variations when the RRAM devices are fabricated, and do not change during runtime. An example of this is the SAF defects. Unlike static defects and variability, dynamic variability changes from cycle to cycle during runtime and is caused by the driving circuits or is the result of electronic noise. In our simulator, we consider the following defects and variability.
Stuck-at-Faults.
This is a static defect that can occur in an RRAM device. Due to limitations in the fabrication process and poor control of the forming process, the resistance value of an RRAM device could be stuck to an HRS or LRS, regardless of the set or reset operations. SAF defects not only lead to degradation in the yield of the fabricated chips but also cause severe failures if they are not detected at the test stage, which we will show later. Two SAF effects are possible: either stuck-at-open, where the RRAM device is stuck at the HRS, resulting in abnormally low current flow; or stuck-at-short, where it is stuck at an LRS, resulting in abnormally high current flow. According to studies on RRAM devices and in our RRAM model, SAFs are assumed to be uniformly distributed among the devices in the crossbar array.
Random Telegraph
Noise. Due to the random trapping and release of charge carriers along the conductive filaments owing to the presence of lattice defects, the current flow through an RRAM device randomly fluctuates between two stable states. This manifests as a source of noise known as random telegraph noise (RTN). This dynamic variation is stochastic in nature and can be modeled as a bistable fluctuating behavior between two resistance levels. In our simulator, the Monte Carlo approach described in [21] is used to model the random switching between the two resistance levels. The amplitude of fluctuation depends on the RRAM material type as well as the resistance range and is approximated with curves fitted to data from [21] , as shown in Figure 6 .
Compared with static defects and variability, dynamic variability is much more computationally intensive to model because they are cycle-to-cycle variations that occur in every computation. Among the dynamic variations, RTN modeling takes up the most computational time because, for each crossbar computation, every read operation needs to be simulated.
Write Variability.
Set and reset variability of the RRAM is another important dynamic variation. During the set and reset process, the migration of ionized defects changes the shape and size of the conductive filament in an RRAM device. Because of the discrete nature of the ionized defects, the resistance values of set and reset states or, equivalently, LRS and HRS, show stochastic variability. For an RRAM-based crossbar array, the set and reset variability is generally exhibited as write variability, which affects the resistance values written to the crossbar array [20] . In order to mitigate write variability, a write-verify scheme may be used to reduce the variability at the expense of increased write latency. In our simulator, write variability is modeled using lognormal statistics in the quasi-classical limit for both on and off states owing to the exponential dependencies of the current distribution on a large number of random fluctuations in the gap domains of the non-crystalline oxide structure [36] . Data from [20] is used in our modeling of resistance variations. It is important to note that write variability increases with higher target resistance values.
RRAM Models
In this section, we will describe the details of the non-linear RRAM models currently built into the simulator. Note that the framework is flexible and allows users to extend or replace the models with their own.
Stuck-at-open and stuck-at-short faults refer to faults where a RRAM device in the crossbar is stuck in HRS or LRS, respectively. A probability distribution decides whether each RRAM device in the crossbar is randomly marked as being stuck-at-open, stuck-at-short, or is operating normally. For cases where empirical data is not available, stuck-at-faults are usually assumed to be independent and uniformly distributed. Otherwise, a user-defined probability distribution can be used. The procedure for simulating stuck-at-faults is shown by the following equations. Equations (5) and (6) describe the probability of an RRAM device stuck at HRS or LRS conditioned on the stuck-at-fault occurring in the device (Equation (4)). The probability of a device operating normally is given as
RTN in a device's reading process is simulated using a Monte Carlo approach. The RTN fluctuates between two stable states: R (lower resistance level) and R + ΔR (higher resistance level). We used a ΔR that is determined from curves fitted to the empirical data shown in Figure 6 . The listing in Algorithm 1 shows the Monte Carlo procedure used to simulate RTN behavior. Initially, each device is set to one of the resistance levels randomly with equal probability. Subsequently, during each timestep when a device is being read, a random number is drawn from the uniform distribution to determine if the device remains in the current resistance level or switches to the other resistance level. The switching behavior is described in Equations (7) and (8),
where r t is the resistance in the current timestep, r t −1 is the resistance in the previous timestep, t sample is the duration of current flow through the RRAM device in between consecutive reads, and t off and t on is the average time that the device stays in higher or lower resistance levels, respectively. To simulate write variability, the simulator implements stochastic write behavior based on empirical data [20] that provides the standard deviation (σ R ) of the targeted resistance value (μ R ). Applying a log-normal distribution on write variability [36] , the model currently implemented is described by Equations (9) 
whereR is the actual resistance value written to the device and Z is a random number drawn from the standard normal distribution. In short, when an RRAM device is written with the target value, a random number is generated from a normal distribution with mean m and standard deviation s. The actual resistance value written is the exponential of the generated random number.
Configurability and Scalability Design
To allow for flexibility in evaluating the influence of different factors on application performance on a neuromorphic chip, our simulator is built with configurability and scalability in mind. We implemented the simulator in a modularized manner; it is easy to plug in or unplug the optional features from the simulator, for example, the non-ideal RRAM effects module.
Interfaces. Key interfaces defined for the simulator are:
Chip: The Chip interface is a high-level abstraction of a hardware chip. The neuromorphic Chip instance comprises a collection of symmetric cores that are connected by an interconnect. It also provides communication interfaces for sending data into the chip and receiving data out from the chip.
Core: This interface represents a functional computational block for a set of neurons. The number of input axons into the Core is defined as the fan-in degree; the number of neuronal output axons is defined as the fan-out degree. Each Core is driven by the global clock that is fed to the Neuron interface. Crossbar: This interface is responsible for storing the synaptic weights of each neuron in a memory crossbar array. To support a RRAM-based crossbar array, an RRAM Crossbar instance is constructed using this interface and RRAM effects, such as write variability are modeled.
Neuron: The Neuron interface encapsulates the neuron model and the type of computation that is performed at each clock cycle. For an LIF Neuron instance, the computations in Equations (1) and (3) are performed. Output spikes that are generated are forwarded to the Network interface to be sent to their destination axons.
Network: The Network interface is responsible for providing the on-chip interconnect that supports communication between different Cores. Different topologies can be configured, including parameters such as link bandwidth and buffer size. Section 3.4.2 provides more detail.
Source: The spike Source interface provides the spike data input into the neuromorphic Chip. For instance, it can be used to convert an input image into spike trains to be delivered to the input neurons.
Sink: The Sink interface is used to monitor and interpret spikes from output neurons. For example, a classifier is a sink that translates output spikes into their respective class labels. Figure 7 shows the different configurations that are supported by these interfaces. In addition, many of the interfaces of the simulator are parameterized, as shown in Table 1 . It is convenient to change simulator settings by changing the parameters.
Network-On-Chip.
Apart from the RRAM neurosynaptic cores, our simulator explicitly integrates an NoC model through the Network interface. As increasing core counts are needed to support larger SNNs, the NoC becomes critical in determining the performance of the system. Factors that may affect the delivery of spikes between neurons include (a) bandwidth of the physical links, (b) network topology and routing algorithm, (c) buffer resources, and (d) router clock frequency. The building block of our NoC model is the router. Each neurosynaptic core is connected to a router. The router internals include multiple buffers to hold spikes, route compute logic, interfaces to receive spikes from the connected core or other routers, and a switch for spikes to traverse to destination buffers. We model a classic pipelined router model as shown in Figure 5 (c). The router has 5 read/write channels, one in each North, South, East, West direction, and a local channel connected to the core. Each channel has associated FIFO buffers. When a core sends out a spike from one of its neurons through a router, the destination information is obtained from the routing table and a packet is constructed. The arbiter and switch components are responsible for selecting packets and forwarding to the appropriate channels to be sent out through the connected links. At the destination router, the packet is then stripped to remove routing information and the spike is sent to the corresponding axon in the core. In our model, the buffer sizes can be configured by users. Other configurable parameters -including frequency, topology, and routing algorithm -are shown in Table 1 . Frequency: In our simulator, the core and NoC can run at different relative clock frequencies. The clock frequency is configured as a router-to-core clock ratio. One of the important metrics that determines the performance of the system is the minimum clock ratio required to deliver all the spikes in the system. The lower the clock ratio, the higher the performance of the NoC.
Topology: Our NoC model supports the following topologies: 2D mesh, 2D torus, ring, and double ring. Any number of cores can be simulated for each topology.
Buffer Size: The buffers within the router are all configurable. The size of the buffers may have an impact on the performance of the NoC.
Routing Algorithm: We implement a routing table-based approach, as in [42] . Currently, we support only dimension-based routing for mesh and torus topology. For ring and double ring, we use shortest path-based routing.
Flow Control: We implement the credit-based flow control in our simulator.
CASE STUDIES
Our simulator can be used to quantitatively measure how various parameters affect how well an SNN application executes on a given hardware configuration. In this section, we present case studies of using the simulator to explore how various settings -such as RRAM material type, noise model, and NoC topology -affect application behavior. The experiments demonstrate the utility of the simulator in exploring issues that have not been well studied in previous works, in particular:
• the extent to which RRAM material and non-ideal effects of RRAM can affect the classification accuracy of an SNN, • the traffic pattern and impact of the NoC in an RRAM-based neuromorphic chip, and • the scalability and performance of executing large-scale real-world, neural-network-based applications on the simulator.
Experimental Setup
We first describe the experimental setup, the SNN applications that are used, how the weights of the SNN are mapped to the neurosynaptic cores, and the NoC configuration, followed by a presentation of the simulation results. The experiments were performed in single-threaded mode on an 18-core 2.3GHz Intel Xeon machine with 128GB of memory.
SNN-Based Applications.
Two SNN applications are used in the following experiments. The first application is a fully connected SNN (MNIST-MLP) for digit recognition trained and tested using the MNIST dataset. This is used to evaluate the influence of non-ideal RRAM effects and the impact of the NoC. The second application (CIFAR10-CNN) is a large-scale SNN converted from a deep convolutional neural network (CNN) for image recognition, which is trained and tested using the CIFAR-10 dataset. This is used to demonstrate the scalability of our simulator for largescale SNN applications. Note that these SNN applications were not tuned to achieve state-of-theart accuracies. Their purpose is to illustrate the usefulness of the simulator in evaluating design choices as well as its configurability and scalability.
MNIST-MLP:
This SNN application is converted from a three-layer fully connected neural network that is pretrained on the MNIST dataset. The three-layer network contains 512, 256, and 10 neurons, respectively. In the experiments, because the size of each neurosynaptic core is set to 256 × 256, this network is modified to meet the fan-in constraints. This is achieved in the following manner, similar to [14] . Assuming that the fan-in degree of a core is F in , each neuron with an in-degree d in > F in in the original graph is replaced by a neuron that receives input from 
CIFAR10-CNN:
This SNN application is implemented based on the CNN network architecture in [22] . It contains a total of 15 layers, including both convolutional layers and pooling layers. The original CNN network is trained on the CIFAR-10 dataset and converted to SNN using the model-based normalization [18] . As the network architecture already takes into account the fanin constraints, it can be directly mapped onto the neuromorphic chip that we simulate. The input to the SNN is also converted into spike trains via a Poisson process (see Section 4.1.2). For the classification output, a majority voting scheme with votes accumulated across timesteps is used. This SNN application uses 3680 neurosynaptic cores.
Rate-Based Conversion.
Each input image is converted into a rate-based representation using Poisson statistics. For every pixel of the image, a spike train is generated using the gray level of that pixel as the mean value of a Poisson distribution. In other words, the spike train for each pixel is generated as follows:
where pixel i is the ith pixel with gray value between 0 and 255, and r is a random number drawn from a uniform distribution.
Weight Mapping.
The weights of the SNN in each application are mapped into resistances -or, equivalently, conductances -in the following manner. Let the gain of the A/D converter (ADC) be α and let the voltage applied to the row when there is a spike be β (the units of α and β are A −1 and V ). Assuming that the ADC is operating in its linear regime so that no clipping occurs and ignoring quantization effects, then the result of the weighted accumulation can be simplified as
where д + i and д − i are a pair conductances, and x i is an indicator function denoting the occurrence of a spike. In order to compute Δy = i ∈F in w i x i , we have the mapping, w i = αβ (д + i − д − i ). When w i , α, and β are known, there are many choices of (д + i , д − i ) pairs that can be chosen to represent w i , subject to the range of resistances that can be written to the RRAM. Being conductances, д + 
NoC Configuration.
We configured the NoC as a 5 × 5 2D-mesh to simulate a 25-core neuromorphic chip for the MNIST-MLP SNN application. Static XY routing with priority-based arbitration was used for routing. The FIFO buffer depth is set to 2, and the NoC clock frequency is set to 1MHz, which is 1000× the global clock frequency of 1kHz. Figure 9 shows the influence of RTN on the accuracy of MNIST-MLP. The intensity of RTN is affected by the resistance range and the RRAM material (see Section 3.2.2). In the experiment, we maintain the ratio of R high to R low to 10 and evaluate MNIST-MLP across different resistance ranges. We also evaluate different RRAM materials in the simulation. Average accuracies are shown by the main bars and the standard deviation is denoted by the error bars. In general, we can observe that the RTN degrades application performance as the resistance range increases. When the resistance range is above 10 5 Ω, the accuracy of NiO-and HfO 2 -based RRAM is unacceptably poor. As shown in the figure, among different RRAM materials, the Cu-based RRAM (specifically Cu-doped Ge 0.3 Se 0.7 [21] ) shows the highest robustness under RTN. When the resistance range is above 10 6 Ω, the Cu-based RRAM still achieves over 85% accuracy whereas the accuracy of other materials is below 30%. There are two implications to this result. First, for hardware designers of RRAM-based neuromorphic chips, this experiment shows that in order to minimize the influence of RTN, either the operating resistance range should be decreased or other materials such as Cu-based RRAM should be used. Second, for algorithm developers, it is important to design SNN algorithms that are robust under RTN in order to use such hardware.
Evaluation of Non-Ideal RRAM Effects

RTN.
Write Variability.
In this experiment, we explore the effects of write variability. Figure 10(a) shows the accuracy of MNIST-MLP under write variability for different resistance ranges. Like RTN, adverse effects of write variability become more prominent when the resistance range increases. This result indicates that write verification is critical for an RRAM-based neuromorphic chip. For higher resistance ranges, the accuracy is consistently below 10%. Even at the lowest resistance range, the highest accuracy achieved is only 61.4%.
Spurred by this result, we implemented and tested a write-verification scheme to explore the cost of mitigating write variability. In the write-verification scheme, we read back the RRAM cell value immediately after writing into that cell. If the difference between the read value and the designated write value exceeds a predefined tolerance range, we rewrite the value to the cell until the difference is within the tolerance range. Figure 10(b) shows the change in accuracy under different resistance ranges and with the write verification scheme applied. The dots in each line represent the accuracy and average number of writes to an RRAM cell when write verification is enabled with a decreasing tolerance range. As the tolerance range decreases, the number of writes also increases to mitigate the effects of write variability. For the resistance range under 10 4 Ω, the accuracy is significantly improved to about 90% and costs an average number of 0.47 rewrites. In the resistance range from 10 4 Ω to 10 5 Ω, to achieve over 90% accuracy, the average number of rewrites required is 5.31. This number increases to over 24 for the 10 4 Ω to 10 5 Ω resistance range. The above experiments showed the influence of write variability aalong with the tradeoffs between accuracy and the cost of the write-verify-write scheme. These results suggest that keeping to a lower resistance range can help minimize the effects of write variability as well as help reduce write latency even when write verification is enabled.
Although write-verify-write is a simple scheme, it has been widely adopted in state-of-theart non-volatile memory implementations [41] . While other schemes have been proposed to reduce the number of rewrites to reduce power consumption and improve performance, they are circuit-level optimizations [29] . Since our simulator focuses on system-level simulation instead of circuit-level simulation, the effects of such circuit-level optimization techniques simply show up as different numbers of rewrites per write operation. Once the high-level characteristics of the circuits are empirically obtained, the simulator can be extended by users to simulate their proposed write-verify schemes more accurately. Fig. 11 . Impact of stuck-at-faults on MNIST-MLP accuracy. Stuck-at-short defects have a greater impact than stuck-at-open defects.
Stuck-at-Faults (SAFs).
There are two forms of SAF, stuck-at-open when an RRAM cell has abnormally high resistance, and stuck-at-short, when the cell is stuck at a low-resistance state.
In the experiments, SAF occurrence is uniformly distributed among all the RRAM cells. Figure 11 shows the accuracy rate of MNIST-MLP under different SAF rates. In the figures, average accuracies are denoted by solid lines; the shaded region shows the variations across different simulation instances. Results show that SAF can severely affect accuracies even at low rates. Compared to stuck-at-open, stuck-at-short has a more adverse effect. Accuracy of the SNN drops over 20% when only 0.02% of the RRAM cells are stuck-at-short. In comparison, accuracy remains higher than 70% when 1% of the RRAM cells are stuck-at-open. The reason for this observation is because stuck-at-short drastically changes the spiking rate of a neuron much more than stuck-at-open. A short cell in the crossbar produces high current in the bit line and causes the neuron to fire more frequently. On the other hand, an open cell tends to reduce the firing rate of the neuron. If the weight of the synapse is low or the synapse does not receive spikes frequently, then the influence of an open cell would be small. These results indicate that it is critical to minimize SAF defects during fabrication, especially for stuck-at-short. Alternatively, axons containing defective cells could be bypassed during mapping of a network to the hardware. Figure 12(a) shows the impact of different topologies on MNIST-MLP. For all of the topologies, we use Sequential Mapping, that is, 21 nodes of MNIST-MLP are mapped onto the first 21 hardware cores in increasing order. The y-axis shows performance normalized to the 2D torus topology. The performance metric is defined as the minimum number of router clock cycles required to deliver all of the spikes across all cores in the chip. We use an exhaustive search scheme to measure the performance metric for a given topology. Figure 12(a) shows that the ring topology is the worst performing owing to its lowest bisection bandwidth. It is interesting to note that 2D torus outperforms 2D mesh by 7.6% for a 25% increase in number of links. Figure 12 (b) shows the impact of changing the mapping of MNIST-MLP nodes onto a 2D mesh network on the performance metric in increasing order. For 25 hardware cores and 21 MNIST-MLP nodes, the number of possible mappings is 25 P 21 . We simulate a subset of randomly chosen 10,000 different mappings and measure the performance metric for each given mapping. As shown in Figure 12(b) , the mapping has a significant impact on the performance of the system. The metric ranges from 174 router cycles to 1487 router cycles. As evidenced by the simulation results, this clearly shows that mapping an SNN onto the chip has a significant impact on the resulting performance. Previous works, such as [1, 33, 34] , have proposed different mapping algorithms based on variations of the Kernighan-Lin algorithm. As our focus is on developing a system-level simulator for RRAM chips, optimization of mapping algorithms is not addressed in this work.
Evaluation of NoC Impact
Power Simulation
To further illustrate the utility of our simulator, we used it to estimate the power consumption of each component in the neuromorphic chip when executing the MNIST-MLP SNN. For the subsequent evaluations, the resistance range of the RRAM crossbars is assumed to be 10kΩ to 100kΩ. The signal frequency of the A/D converter is 100MHz, and the technology node of the CMOS components is assumed to be 45nm. Note that we used this technology node to render our results comparable to many reported in the literature. It is fairly easy to extend the simulator for other technology nodes. The results for power consumption are shown in Figure 13 . Figure 13(a) shows that the total power consumption of the chip is around 401μW, and that the NoC dominates the power consumption (∼90%) over the cores. Figure 13 neuromorphic chip, the CMOS components will be the key bottleneck in terms of power consumption, especially the NoC. Overall, the power efficiency is about 2.67 × 10 3 GOps/W, which, although lower, is comparable (∼0.65× lower) to other state-of-the-art RRAM crossbar-based computing architectures, such as [24] . In comparison to [15] , which does not include an NoC, power efficiency will be more than 10 times lower. The high power consumption of the NoC reported by our simulator shows that optimizing the design of the NoC will be crucial for any low-power neuromorphic chip if the advantages of SNNs and RRAM are not to be marginalized.
Simulation of Large-Scale SNNs
In this section, we study the scalability of our simulator using the CIFAR10-CNN SNN application. The 15-layer SNN is mapped onto 3680 cores connected by a 2D mesh NoC. Results are shown in Figure 14 . Non-ideal RRAM effects are not considered in this experiment. Figure 14(a) shows the change in accuracy with an increase in the timesteps of the simulation. As mentioned in Section 4.1, the SNN is converted from a deep CNN [22] , whose original accuracy of 75% is shown by the dotted line. The classification output of CIFAR10-CNN is a majority voting scheme and the votes accumulate across timesteps. If all classes have zero votes, no prediction is made, denoted by "??." The figure shows that the accuracy gradually grows as the timestep advances and after 100 steps achieves ∼60% accuracy. The confusion matrix in Figure 14(b) shows that the network tends to mis-classify categories 3 and 5 for the first 100 timesteps. Notwithstanding the fact that the SNN was not explicitly tuned to achieve the original CNN accuracy, this example adequately demonstrates that the simulator can be used to collect runtime statistics and is useful for determining the performance of large-scale SNN applications. Figure 15 shows the network traffic generated by CIFAR10-CNN, where the solid line denotes the average spike count and the shaded region shows the variations under different runs. In the beginning, the number of spikes in the NoC is low because the input spike trains have just entered the chip. As time advances, more spikes are generated as neurons in the later layers begin to spike. The network traffic, after 30 timesteps, stabilizes at a level that is, on average, 7.67% of the traffic if all neurons of the CIFAR10-CNN were to spike. This result shows that the simulator can be used to shed light on the NoC bandwidth requirement for a large-scale SNN and can help in the hardware design of NoCs for the neuromorphic chip.
Finally, for the purpose of scalability analysis, we simulate various chip sizes from 25 cores up to 4096 cores, as shown in Table 2 . We assume a 2D mesh NoC with the configuration shown in the table. In this experiment, we simulate all cores generating spikes for all neurons every core clock cycle (which is invoked every 1000 router cycles). For example, in a 5 × 5 mesh, each of the 25 cores generates 256 spikes per core cycle. These spikes are transmitted to randomly generated destination cores in 1000 router cycles before the next core cycle. All experiments were run for 35 core cycles (equivalent to 35,000 router cycles). The table shows that the simulation time increases linearly with the number of cores, demonstrating the scalability of our simulator.
Discussion
In our experiments, we first found that RTN degrades accuracy at high operating resistance ranges. Through simulating with empirical data, our results indicate that (1) RRAM-based neuromorphic chips should employ materials such as Cu-based RRAM or use lower operating resistance ranges, and (2) it is important for algorithm developers to develop SNN training and weight mapping methods that are robust under RTN in order to use RRAM-based neuromorphic hardware. Second, we showed that the effect of write variability is pronounced and adversely affects application accuracy, even at low-resistance ranges. However, a write-verify scheme can be effective in restoring overall SNN accuracies at the expense of higher rewrite latencies, a trade-off that can be explored with our simulator. Third, we found that NoC topology and the mapping of the SNN to cores both have a significant impact on the router cycles needed to deliver all spikes, a major performance metric. This highlights the need for a good mapping algorithm and the importance of choosing a good network topology for neuromorphic chip designers.
RELATED WORK
Neuromorphic Hardware. Research into neuromorphic hardware initially focused on designing circuits that model a single neuron or a small set of neurons accurately. Studies such as [2, 45] proposed analog circuits to emulate the biological structures in neurons. These neuromorphic chips contain a limited number of neurons and are capable of running only trivial neural networks. Large-scale neuromorphic chips such as HiCANN [12] and NeuroGrid [7] with greater power and area efficiency were developed later. The SpiNNaker project developed ARM-based multi-core systems that can be used to simulate the human brain using SNN. IBM has released TrueNorth, a large-scale neuromorphic computing chip that integrated 4096 digital 256 × 256 neurosynaptic cores [17] capable of running state-of-the-art CNN applications [22] .
Advances in RRAM offered potentially higher density and lower energy consumption for neuromorphic chips. Jo et al. showed that memristor-based crossbar memory architecture combined with CMOS neurons can support important synaptic functions [23] . Indiveri et al. demonstrated a design that integrated memristor-based neurosynaptic architectures into large-scale neural networks [10] . Park et al. proposed a neuromorphic computing circuit design based on TiN RRAM and applied it in speech processing [25] . Prezioso et al. proposed the first transistor free design of a metal-oxide-based memristor crossbar which supported fully operative neural networks [16] . Hu et al. proposed the dot-product engine (DPE) based on an RRAM crossbar as an efficient way to perform matrix-vector multiplication [15] . They also developed techniques to ensure compatibility of the DPE with software-level accuracy [31] . Chakrabarti et al. later proposed a 3D DPE based on hybrid CMOS and memristor circuits that led to increased density and bandwidth [3] .
Inspired by TrueNorth, Yao et al. developed a fully functional large-scale neuromorphic chip based on RRAM [19] . Shafiee et al. proposed an accelerator based on RRAM crossbars [6] . A recent architecture proposed for SNNs, INXS, also uses memristor crossbar arrays, which is shown to be more computational and energy efficient than TrueNorth [24] . However, the impact of various non-ideal RRAM characteristics was not considered in these works.
Simulators for Neuromorphic Computing. Many simulators for neuromorphic computing were initially designed for neuroscience research [5, 28, 30] . Other simulators, including [8, 27] , focused on using GPUs to accelerate the performance of SNN simulators. They were primarily used for computational modeling of neural networks and did not consider any hardware.
Other simulators for neuromorphic hardware have been developed to assist in hardware development. For example, COMPASS is a hardware simulator built by IBM for the digital TrueNorth chip [43] . HRLSim was released with more generic assumptions on the hardware architecture and can support more neuron models [13] . Both COMPASS and HRLSim are used for fully digital CMOS design, whereas our simulator can cater to memristor-or RRAM-based designs. A memristorbased SNN simulator was proposed by Querlioz et al. [44] . However, they do not consider the RRAM effects that we have studied. Moreover, their simulator supports simulation of only a single neurosynaptic core. Tang et al. proposed a circuit-level simulator of RRAM-based SNN [26] . In their study, detailed circuit-level features of memristor devices are considered, but they support the simulation of only synthetic neural networks and it does not have an integrated NoC model. Xia et al. proposed a simulator called MNSIM that simulates memristor-based neuromorphic chips. [14] . Unlike our simulator, MNSIM does not simulate RTN and does not have an NoC model.
CONCLUSION
In this article, we introduced a cycle-accurate, system-level simulator for RRAM-based neuromorphic computing chips comprised of many neurosynaptic cores interconnected through an NoC. The neurosynaptic core architecture contains an RRAM crossbar array for storing synaptic weights and is able to perform weight accumulation through a current sum in the analog domain. Our full-scale neuromorphic chip simulator integrates both an RRAM model and an NoC model. In addition, many non-ideal characteristics of the RRAM, such as SAFs, write variability, and RTN, are also integrated into the simulator to allow for the accurate simulation of applications at scale.
