Deep Neural Networks (DNNs) are widely used to perform machine learning tasks in speech, image, video, and natural language processing. The high computation and storage demands of DNNs have led to a need for the energy-efficient implementations. Resistive crossbar systems have emerged as promising candidates due to their ability to compactly and efficiently realize the primitive DNN operation, viz., vectormatrix multiplication. However, in practice, the functionality of resistive crossbars may deviate considerably from the ideal abstraction due to the device and circuit level non-idealities such as driver resistance, sensing resistance, sneak paths, interconnect parasitics, non-linearities in the peripheral circuits, imperfect write operations, and process variations. Although DNNs are somewhat tolerant to errors in their computations, it is still essential to evaluate the impact of the errors introduced due to crossbar non-idealities on DNN accuracy. Unfortunately, device and circuit-level models are not feasible to use in the context of large-scale DNNs with 1.9-13.8 million neurons and 2.6-15.5 billion synaptic connections.
I. Introduction
Deep neural networks (DNNs) have gained tremendous popularity in the past decade, and are currently used in several realworld products and services for speech recognition (Apple Siri, Google Assistant, Amazon Alexa), image analysis (Google+ image search, Facebook DeepFace), natural language processing (Google Translate, Facebook DeepText), search engines, This work was supported in part by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.
recommendation systems, and more [1] , [2] . However, the large and rapidly growing computation requirements of DNNs pose severe challenges to performance and energy efficiency in the systems on which they are deployed.
Resistive crossbar systems have garnered significant interest for realizing DNNs due to their ability to perform the underlying computational kernel, viz. vector-matrix multiplications, efficiently. They may be designed using a range of emerging devices, including Resistive RAM (ReRAM), Phase Change Memory (PCM), and Spintronics [3] - [6] . These devices have several desirable characteristics such as high density, nonvolatility, low leakage, and low voltage operation. Consequently, they offer the prospect of highly compact and energyefficient DNN implementations. Several research efforts have explored the design of crossbar based neuromorphic systems at the device, circuit, architecture, and algorithmic levels [4] , [7] - [29] .
The abstraction of vector-matrix multiplication (voltages applied at the inputs of a resistive crossbar are multiplied with the weights stored as conductances in synaptic elements, resulting in the output currents that represent the product of the input vector and the weight matrix) is only an approximation of the functionality of a resistive crossbar. In practice, various device and circuit level non-idealities such as driver resistance, sensing resistance, sneak paths, interconnect parasitics, process and programming variations, and non-linearities in the peripheral circuits such as ADCs and DACs, lead to deviations from this idealized abstraction. These deviations or "errors" can degrade the overall application-level accuracy of DNNs. Although DNNs are resilient to some inaccuracy in their computations [30] - [32] , this resilience is not unlimited. Therefore, it is necessary to evaluate the impact of nonidealities present in imperfect computational fabrics such as resistive crossbars.
Most previous efforts that target resistive crossbar based realization of DNNs do not explicitly consider non-idealities, or model non-idealities in a very limited manner (e.g., as limited-precision). Moreover, they focus their analysis on very small networks and datasets (e.g., CIFAR-10 [33] and MNIST [34] ). Thus, they leave open the question of how nonidealities impact the accuracy of large-scale neural networks (viz., ResNet [35] , GoogleNet [36] , VGG [37] , OverFeat [38] , Network-in-Network (NiN) [39] , AlexNet [40] , etc.) realized on resistive crossbars. Answering this question requires a fast and scalable, yet accurate simulation framework for resistive crossbars that can be integrated into state-of-the-art DNN software frameworks. Unfortunately, such a framework is currently unavailable. Device and circuit simulation (SPICE) models of resistive crossbars are accurate but extremely slow and infeasible for large-scale training and evaluation. On the other hand, architectural models of resistive crossbars [23] , [24] are targeted at design space exploration, and use simplified error models that are reasonable for their context, but inadequate for our purpose as they do not consider the dependence of errors on several key factors: (i) applied inputs, (ii) the crossbar state, i.e., the values of all conductances, and (iii) the crossbar columns that perform the computation. In this work, we address the need for a fast and accurate simulation framework to enable the functional evaluation of large-scale DNNs realized on resistive crossbars.
We first study the impact of non-idealities present in resistive crossbars to characterize errors incurred in the realized vector-matrix multiplications. We observe that the errors in these computations show significant dynamism and datadependence, which is not captured by prior models [23] , [24] . Next, we analyze the circuit model of a resistive crossbar with non-idealities and formulate a Fast Crossbar Model (FCM) that captures its operation using simple linear algebra operations, to achieve orders-of-magnitude faster simulation compare to SPICE. We realize FCM using the well-known BLAS (Basic Linear Algebra Subprograms) library and develop Rx-Caffe, an enhanced version of the popular Caffe [41] deep learning software framework, to evaluate DNNs realized on resistive crossbar systems. We use Rx-Caffe to evaluate six popular large-scale DNNs for classifying the ImageNet [42] dataset. We also evaluate three small networks for classifying CIFAR-10 and MNIST datasets. We observe that non-idealities can result in significant accuracy loss for large-scale DNNs realized on resistive crossbar systems, motivating the need for further research in cross-layer mitigation and compensation techniques.
In summary, the key contributions of this work are:
• We evaluate the impact of crossbar non-idealities to characterize errors incurred in the realized vector-matrix multiplications. We observe errors show significant dynamism and data-dependence that should be considered for accurately capturing the impact of non-idealities on DNNs. • We propose FCM, i.e., a fast and accurate functional crossbar model to capture the effects of non-idealities. • We present Rx-Caffe, a software framework to enable training and evaluation of DNNs realized on resistive crossbar systems. • We evaluate the impact of non-idealities on the application-level accuracy of 6 state-of-the-art large-scale DNNs, viz. ResNet-50, VGG-16, GoogleNet, AlexNet, OverFeat, and NiN. Our evaluation reveals that the degradation in accuracy due to non-idealities can be significant (9.6%-32%) for these large-scale DNNs.
The rest of the paper is organized as follows. Section II overviews the prior research efforts related to crossbar based systems. Section III provides the necessary background on resistive crossbars. Section IV discusses crossbar non-idealities and their impact on the desired vector-matrix multiplication functions. Section V describes the proposed FCM model. Section VI outlines the software framework Rx-Caffe. Section VII details the experimental methodology. Experimental results are presented in section VIII. Section IX concludes the paper.
II. Related Work
Resistive crossbar based systems have attracted significant research interest in recent years due to their ability to efficiently realize the primitive machine learning operations, i.e., vector-matrix multiplications [43] - [46] . Prior efforts in this area span across the computing stack from devices to algorithms and can be broadly classified into specialized hardware accelerators [7] - [12] , non-ideality mitigation schemes [13] - [22] , and design tools for crossbar based systems [23] - [25] .
Specialized hardware accelerators. Several resistive crossbars based specialized hardware systems have been proposed to accelerate the inference [7] - [9] , [12] and the training [10] , [11] operations of neural networks. These efforts focus on the evaluation of the proposed architecture for performance, energy, and area, and either do not explicitly consider nonidealities or model only the limited-precision aspect of nonidealities.
Non-ideality mitigation schemes. Many efforts have addressed different forms of non-idealities present in resistive crossbars. These efforts can be broadly classified based on the type of solutions, i.e., hardware or software. The proposed software approaches include, a methodology for addressing device level non-idealities [17] , a training method to overcome effects of process-variations [15] , a modified backpropagation algorithm for neural networks [13] , a weight to conductance conversion algorithm [14] , a rank clipping method to reduce the effects of non-idealities by lowering crossbar dimensions [18] , and a defect rescuing scheme to alleviate the effect of bit failures [19] . In addition, proposed hardware solutions address the impact of low-voltage induced drift [16] , eliminate programming errors [20] , and resolve the effects of IR drop [22] . The focus of all the above efforts has been the evaluation and mitigation of errors due to nonidealities. However, these works restrict to small networks and small datasets.
Our work complements the above efforts on hardware accelerators and non-ideality mitigation schemes. We focus on the functional evaluation of large-scale DNNs including winners from previous ImageNet challenges [42] . To that end, we propose a fast and accurate functional crossbar model that captures the effects of device and circuit-level non-idealities, and a software framework to train and evaluate these networks. Design tools. Recent efforts have proposed tools to aid the design of crossbar systems [23] - [25] . MNSIM [23] presents an architectural simulation platform to evaluate crossbar architectures, P. Gu et al. [24] proposes a technological exploration tool to optimize the trade-offs for resistive crossbars, and AutoNCS [25] optimizes the utilization and efficiency of these systems. We differ from these efforts [23] , [24] as they estimate the application-level accuracy using simple error estimation models that are reasonable for their context, but inadequate for evaluating the impact of non-idealities on DNN accuracy, since they do not consider errors' dependence on several factors. Further, we focus on the evaluation of large-scale DNNs realized on crossbar systems using the proposed software framework. In this section, we provide a brief background on resistive crossbar arrays and the computation that they perform. For the sake of illustration, we use a resistive crossbar that is designed with spintronic devices [5] as synaptic and neuronal elements. However, our methodology is generic and applicable to other devices (any device that can be used to realize a programmable resistance at each crossbar element) as well.
III. Preliminaries
Spintronic synapse/neuron device. Spintronic devices [5] have emerged as promising candidates for designing synaptic and neuronal devices. They have several key advantages over 2-terminal memristive devices [3] , viz., low-voltage write operations, high endurance, immunity to drift (i.e., change in synaptic conductance during the read operations [3] ), and no read-write conflicts due to separate read and write paths. Figure 1 (a) shows a spintronic device [5] that can be used as a building block for resistive crossbar arrays. It is a 3-terminal device consisting of a Magnetic Tunneling Junction (MTJ) and an underlying Heavy Metal (HM) layer. An MTJ is composed of a fixed Ferro-Magnetic (FM) layer, a tunneling oxide, and a free FM layer. A write operation is performed by applying a voltage at the two ends of the HM layer, i.e., between terminals T 2 and T 3 . The resulting current flowing through the HM exerts a spin orbit torque on the free layer that causes domain wall motion. The position of the domain wall in the free layer determines the conductance of the MTJ that lies between G MIN (when the domain wall is to the far right) and G MAX (when the domain wall is to the far left). Moreover, the number of unique locations at which the domain wall can reside determines the precision of the device. A read operation in this device is performed by applying a voltage across terminals T 1 and T 3 , and the resultant current is proportional to the MTJ's conductance. In summary, the device in Figure 1 (a) operates as a programmable resistor wherein the resistance is programmed through the HM layer and sensed via the MTJ.
Crossbar Array. Figure 1 (b) shows a crossbar array design for realizing vector-matrix multiplications. It supports two main operations: (i) Programming, i.e., a write operation performed sequentially on a set of synaptic devices, and (ii) Evaluation, i.e., the vector-matrix multiply operation. The synaptic element at the intersection of each row and column is programmed by enabling the corresponding write circuits along the Write Wordline (WWL) and the bitline (BL), to apply the necessary current and set it to the desired conductance. A vector-matrix multiplication is performed by first resetting all the neurons to the minimum conductance state, G MIN . Subsequently, Digitalto-Analog converters (DACs) convert digital inputs into voltages on the Read Wordlines (RWL). The resulting current flowing through each BL determines the end position of the domain wall in the corresponding neuron, and thereby its conductance. The neurons' conductances (G neuron ) represent the outputs of the vector-matrix multiply operation and are subsequently converted to digital numbers using Analog-to-Digital converters (ADCs).
Equation 1 specifies the ideal vector-matrix multiply operation for an MxN dimensional crossbar. Vin ideal is a 1xM vector consisting of the input voltages, G is an MxN matrix comprising of the synaptic conductances, and Iout ideal is a 1xN vector containing the output currents.
IV. Crossbar Non-Idealities
In this section, we analyze non-idealities in resistive crossbars and examine their impact on vector-matrix multiplications.
A. Crossbar Non-idealities
To illustrate the device and circuit level non-idealities in resistive crossbars, we present the resistive equivalence of the crossbar array and the peripheral (DAC and ADC) circuits in Figure 2 (a). The key sources of non-idealities are -1 wire resistances of the crossbar interconnects, 2 sensing resistances of the circuits that sense the output currents, 3 driver resistances of the circuits that drive the crossbar rows, 4 sneak paths, 5 variance in synaptic conductance due to process variations and imperfect programming, and 6 nonideal DACs. While we consider all these non-idealities in subsequent sections, we select the non-idealities due to DACs and sneak paths for a more detailed treatment below, in order to illustrate the error modeling complexity.
Non-ideal DAC. Figure 2 (a) shows the resistive equivalence of a DAC circuit. As shown, DAC can be represented as a resistive divider circuit, where one of the resistances (R DAC ) varies with the applied digital input, and the other resistance (R PD ) is fixed. An applied digital input determines the value of R DAC and subsequently decides the DAC's output voltage (DAC out ). Note that DAC out also depends on the effective Figure 3 (a) shows the error incurred due to DAC non-idealities which is a function of both applied inputs (R DAC ) and synaptic conductances (R Load ). Further, we illustrate errors' dependence on the applied inputs using Figure 3 (a). To that end, we plot the outputs of the non-ideal DAC (Non-ideal DAC OUT ) for two load resistances 3.2kΩ and 32kΩ, respectively. As evident, the voltage plots of the two R Load differ with the applied digital inputs.
Fig. 3: Example of non-idealities in resistive crossbar
Sneak paths during vector-matrix multiplication. Ideally, currents in resistive crossbars would be expected to flow from left to right along the rows and from top to bottom through the columns. However, due to the non-idealities described above (specifically, wire resistances), internal node voltages within the crossbar may vary, resulting in additional current paths, which we refer to as sneak paths. Figure 3 (b) illustrates sneaks paths during vector-matrix multiplications for a 3x2 crossbar array. We consider a crossbar state with all synaptic devices programmed to 20KΩ, and the applied input voltages at the rows are 0.2V, 0.01V and 0.2V, respectively. For this crossbar state, we observe that the direction of current between nodes a 22 and b 22 is flipped, i.e., the current flows from b 22 towards the input (Vin2), instead of the expected direction. Sneak paths are a function of both the crossbar state and the applied inputs, and therefore further contribute to the overall dynamism in errors due to non-idealities.
B. Errors due to Non-Idealities
Next, we analyze the impact of non-idealities on the computational accuracy of the vector-matrix multiplication realized on resistive crossbars. Toward this end, we compare the outputs of the vector-matrix multiplications obtained from HSPICE simulations of non-ideal crossbar arrays with the ideal computations (Equation 1) and analyze the sensitivity of the error to various parameters.
Sensitivity to crossbar size. We first examine how the errors incurred due to the individual non-idealities (WIRE, SENSE), combinations of non-idealities (DAC+DRIVER, WIRE+SENSE) and the cumulative effect of all non-idealities (ALL) vary with the crossbar dimension. Figures 2(b) and 2(c) show the errors incurred during the vector-matrix multiplication realized on crossbars, with all synaptic conductances programmed to G MIN and G MAX , respectively. In both graphs, the Y-axis represents the error in the last (N th ) column of an NxN crossbar, and the X-axis represents the crossbar dimension (N). In both cases, we observe that the overall errors due to all non-idealities (ALL), as well as due to individual non-idealities, increase with the crossbar dimension. This is expected because: (i) the overall wire resistances increase with crossbar array size, (ii) the sensing resistance contribution to the overall bitline resistance increases, and (iii) the DAC non-ideality increases due to a decrease in the effective load resistance 1 . Further, we also observe that for smaller crossbars, the non-ideality due to DAC is predominant, 1 Higher crossbar dimensions have more columns leading to increase in parallel paths, consequently lowering the effective load resistance whereas, for larger crossbars, the wire and sensing resistance effect becomes equally significant.
Sensitivity to crossbar state. Next, we analyze the errors' dependence on the crossbar state, i.e., the conductances of synaptic devices. To this end, we fixed the inputs to a 64x64 crossbar array and varied the conductances of the synaptic devices to obtain different crossbar states. Figure 2(d) shows the maximum (MAX), minimum (MIN), and average (AVG) errors across columns of the crossbar over 1000 random crossbar states. We observe that the errors show significant dynamism across these states. Further, we also plot the error for a sample crossbar state (Sample-Run) to demonstrate the irregular pattern that the errors show across crossbar columns. Moreover, this irregular pattern also deviates notably from the patterns observed for MAX, MIN, and AVG errors.
Sensitivity to crossbar inputs. To analyze the errors' dependence on the applied inputs, we fixed the conductances of all synaptic devices and varied the inputs. Figure 2 (e) shows the variations in errors across inputs. We observe that the variance across inputs (MAX and MIN) for a particular column is noticeable, but small in comparison to the variance across crossbar states. However, the dynamism in errors across columns is significant.
Sensitivity to crossbar columns. Figures 2(d-e ) depicts how errors vary across crossbar columns. While there is a slight trend of increase in error as we go from the first to the last column, it is not always the last column that incurs the maximum error. Rather any column can incur the maximum error depending on the crossbar states and the applied inputs.
Sensitivity to process variation and imperfect programming. Finally, we also evaluate the impact of variations by performing Monte-Carlo simulation on a sample set of 10,000 crossbar states obtained by considering variations in synaptic conductances (σ/µ = 10%) [47] . Figure 2 (f) shows the maximum, minimum, and average error observed on a 64x64 crossbar array across these samples. The variations in synaptic conductances can occur due to two prominent reasons: (i) Process variations and (ii) Imperfect programming, i.e., errors during write operations.
In summary, the non-idealities in resistive crossbars can have a significant impact on the computations that they perform, and that the errors due to non-idealities are highly dependent on various factors, including the conductances and applied inputs. In order to accurately capture the impact of non-idealities on application-level accuracy, a crossbar model should consider these factors.
V. Crossbar Modeling
In this section, we present the proposed Fast Crossbar Model (FCM) that captures the data-dependence of errors arising due to various crossbar non-idealities. We also provide a mathematical proof for the exact crossbar array model, a key component of FCM, using an example 4x4 crossbar array.
Fig. 4: FCM: Overview

A. FCM Overview
The key idea behind FCM is to transform an ideal conductance matrix (G ideal ) that represents the programmed synaptic conductances and an ideal input vector (Vin ideal ), into a nonideal conductance matrix (G non−ideal ) and a non-ideal input vector (Vin non−ideal ) that reflect the impact of all non-idealities. We leverage circuit laws (Kirchhoff's loop laws and Ohm's law) and linear algebraic operations (direct sum, row switching, vector concatenation, row reduction, etc.) to abstract nonidealities and achieve this transformation. Figure 4 provides an overview of FCM that comprises of three models: (i) Crossbar array model, (ii) DAC model, and (iii) ADC model. The crossbar array model, i.e., the core array model, generates a G non−ideal matrix using the G ideal matrix and the crossbar parameters (R sense , r col , r row ). Conceptually, G non−ideal matrix represents conductances of a non-idealities abstracted crossbar array. The transformation from G ideal to G non−ideal is exact, and we provide the mathematical proof in Section V-B. The embedded DAC model generates the non-ideal input voltages (Vin non−ideal ) by incorporating the DAC non-idealities. It is composed of a resistive divider circuit, where one of the resistance (R DAC ) is dependent on the applied digital input (Din), and the other resistance is fixed (R PD ). The resistive divider is connected to a variable effective load conductance (G Load ) whose value is dependent on the crossbar state (programmed weights). We use the equation shown in Figure 4 to compute Vin non−ideal . In the shown equation, R DAC is obtained using Din, and G Load is computed using G non−ideal . Since R DAC and G Load are dependent on the applied inputs and crossbar state, respectively, the Vin non−ideal captures the data-dependence of the errors arising due to non-ideal DAC. Subsequently, using matrices G non−ideal and Vin non−ideal , FCM computes the vectormatrix multiplication realized in crossbars to obtain the nonideal output currents (Iout non−ideal ). The ADC model shown in Figure 4 is then used to convert the Iout non−ideal to the digital output vector (Dout). Note that, FCM realizes linear algebraic operations using well-optimized BLAS (Basic Linear Algebra Subprograms) routines to enable faster simulations.
B. Crossbar Array Model
In this section, we present the mathematical formulation of G non−ideal using Figure 5 that shows the resistive equivalence of an MxN crossbar array. Vin i represents the input voltage at the i th row of the crossbar, Va i, j denotes the voltage at the node a i, j , and V i, j is the voltage difference between the node a i, j and the node b i, j . G i, j is the conductance of the synaptic device at the i th row and the j th column. R sense , r row , and r col depict the sensing and distributed wire resistances, respectively, and Iout j indicates the output current of the j th column. In figure 5 , vertical and horizontal slices of the crossbar array are referred to as Column Linear Systems (LSCols) and Row Linear Systems (LSRows), respectively. To demonstrate the formulation of G non−ideal , we employ 6 major steps involving Equations 3 to 15. Figure 6 illustrates these steps and equations using matrices from an example 4x4 crossbar array. Note that in Figure 6 , the equations are generic and applicable to crossbars of any size. We next describe these steps in turn below.
Fig. 5: Resistive Equivalence of MxN crossbar array
Step 1: Formulate column linear systems. We first formulate column linear systems (LS Col 1 to LS Col N ) using each vertical slice of the crossbar, shown in Figure 5 . Let us consider the j th vertical slice which forms the LS Col j system (Equations 3 to 5). Using Kirchhoff's Current Law (KCL) at all nodes b i, j present in the j th column, we obtain Equations 3 and 4 as shown in Figure 6 . We then combine Equations 3 and 4 to form the linear system in Equation 5. In the case of an MxN crossbar, we have N such linear systems (LS Col 1 to LS Col N ).
Step 2: Merge column linear systems. Next, the column linear systems (LS Col 1 to LS Col N ) are merged to form a larger Column Linear System (merged-LSCol) as shown in Equation 6 and 7. We achieve this by using the direct sum (⊕) matrix operation on matrices (A j + J*K j ) and K j to obtain block matrices COLmat and Gmat, respectively. In Equation 6 , CVcol and CVAcol are vectors formed by concatenating Vcol j and VAcol j vectors, respectively. Note that, we obtained vectors Vcol j and VAcol j in Step 1 (Equations 3 and 5). Further, Iout non-ideal in Equation 7 is a vector representing the output currents.
Step 3: Formulate row linear systems. Similar to Step 1, we formulate row linear systems (LS row 1 to LS row M ) considering horizontal slices of the crossbar. We use KCL at nodes a i, j present in the i th horizontal slice to obtain Equation 8 which forms the LS row j system. In case of an MxN crossbar, we have M sets of the row linear systems (LS row 1 to LS row M ).
Step 4: Merge row linear systems. We now merge the row linear systems obtained in Step 3 to get a larger Row Linear System (merged-LSrow) as shown in Equation 9 . ROWmat is a block matrix obtained by performing the direct sum (⊕) matrix operation on the matrix (B i ). Moreover, CVrowIN, CVrow, and CVArow are vectors formed by concatenating vectors (obtained in Step 3) VrowIN i , VArow i , and Vrow i , respectively.
Step 5: Eliminate internal variables. Next, the vectors CVAcol and CVcol comprising of internal variables Va i, j and V i, j , respectively, are eliminated. In order to eliminate these variables, we use the merged-LScol and merged-LSrow systems obtained in Step 2 and 4, respectively. However, the merged-LScol and merged-LSrow equations cannot be used directly due to the mismatch in their Right-Hand Sides (RHS) (CVAcol CVArow). We resolve this mismatch by performing elementary row operations on RLS (Equation 9) to obtain Equation 10 . Note that, the CVrowINA vector and the ROWmatA matrix are obtained by performing row switching, i.e., an elementary row operation, on the CVrowIN vector and the ROWmat matrix, respectively. We then eliminate the CVAcol vector using Equations 6 and 10 to obtain Equation 11 . Subsequently, we eliminate the CVcol vector using Equations 7 and 11 to yield Equation 12. Note that, Equation 13 details the NETmat matrix introduced in Equation 12.
Step 6: Reduce matrix dimension. Finally, we reduce the size of matrices NETmat and CVrowINA by leveraging a key property of the CVrowINA vector, i.e., it contains repeated elements. Recall that, the CVrowIN vector is formed by concatenating the VrowIN i vectors (Step 4), and the CVrowINA vector is obtained by performing row switching operations on the CVrowIN vector. Since the VrowIN i vector (Step 3) has repeated elements, consequently, the vectors CVrowIN and CVrowINA also have repeated elements. Exploiting this property, the columns of the NETmat matrix that are to be multiplied by same elements in CVrowINA can be summed using elementary column operations to yield a compressed NETmatC matrix (shown in Equation 14 ). Moreover, removing redundancies in vector CVrowINA leads to the Vin non-ideal T vector. Further, Equation 14 can be rewritten as Equation 15 to obtain the G non-ideal matrix. Note that, G non-ideal is a function of G ideal , R sense , r col , and r row , and therefore can be constructed using the intermediate matrices COLmat, ROWmat, and Gmat as shown in Figure 4 . In this section, we present a software framework Rx-Caffe that enables training and evaluation of large-scale DNNs on resistive crossbar systems. Rx-Caffe is a functional simulator obtained by modifying a popular deep learning framework, i.e., Caffe [41] , to enable mimicking of the vector-matrix multiplications realized on resistive crossbars. Caffe models the convolutional and fully-connected layers of DNNs as vectormatrix and matrix-matrix multiplications. Rx-Caffe takes the description of a crossbar based architecture as input, maps the DNN onto this system, and subsequently simulates the vector-matrix multiplications using the FCM model described in the previous section. Rx-Caffe's primary objective is to evaluate the application-level accuracy of DNNs, however, it is also capable of generating execution traces to enable energy estimations. Figure 7 depicts the Rx-Caffe flow that is composed of 5 major steps. Step 1 maps the neural network to the specified target architecture. In Step 2 , the weights from the trained Caffe model are read and virtually programmed into the crossbar arrays in the specified architecture to generate the G ideal matrices. Subsequently, in Step 3 , each G ideal matrix is transformed into the corresponding G non−ideal matrix, which is incorporated into Caffe's original weight data structure (Step 4 ). Note that Rx-Caffe transparently utilizes Caffe's underlying data structures and optimized BLAS libraries, which is key to its performance and scalability. We note that the Steps 1 -4 are performed only once for a given DNN and architecture template. In Step 5 , the DNN is evaluated for the given inputs using the G non−ideal matrices. During network evaluation, the DAC/ADC models are invoked as pre-and post-processing steps on the inputs/outputs of each convolutional and fully-connected layer.
Next, we describe training/re-training with Rx-Caffe to obtain DNN models for crossbar based systems. The major Fig. 7 : Rx-Caffe Overview challenges that arise during training/re-training of DNNs for these architectures are: (i) the data-structures (inputs, outputs, weights) should abide by the range and resolution constraints at all times, and (ii) errors and gradients computed during backpropagation should be appropriately scaled to ensure network convergence 2 . Rx-Caffe meets these constraints by using a crossbar abstracted forward pass and a floating-point based backward pass. It appropriately converts and scales the data-structures between abstractions to ensure that the network trains with minimal impact on the overall training speed, which is extremely critical in the context of large-scale DNNs.
VII. Experimental Methodology
In this section, we provide the experimental setup for evaluating FCM and the software framework Rx-Caffe. We also detail the benchmark DNNs used in our experiments. Device/Circuit simulation. We used an in-house device model of the synaptic/neuron spintronic device [5] that is based on the solution of Landau-Lifshitz-Gilbert (LLG) magnetization dynamics and Non-Equilibrium-Green's Function (NEGF) electron transport. Circuit-level simulations were performed in HSPICE using 45nm bulk CMOS technology and our device model. Our simulations use the ADC and DAC circuit designs proposed in [48] , [49] . The circuit parasitics used in our simulations were extracted from the device and crossbar array layouts. The layouts as shown in Figure 8 , were performed using the design rules specified in [50] . The table on the right in Figure 8 details the device, technology [51] , and variation parameters [47] used in our experiments. We also characterized crossbar arrays for energy which is used as a technology parameter in Rx-Caffe to estimate energy consumption.
Fig. 8: Device and Technology Parameters
Application-Level simulation. We evaluated the applicationlevel accuracy and energy of several popular DNNs realized on crossbar based architecture using Rx-Caffe. Table I details the benchmark DNNs using the targeted dataset, the number of convolution and fully-connected layers, and the number of synapses and neurons. We also present the relative model size to show the size difference between our benchmark DNNs. In order to evaluate the application-level accuracy and energy of DNNs using Rx-Caffe, we used the architecture presented in [7] . We realize this architecture using crossbar arrays of varying sizes 16x16 (Cross16), 32x32 (Cross32), or 64x64 (Cross64). Note that, the operations other than vectormatrix multiplications (max-pooling, normalization, etc.) are performed on the host processor. 
VIII. Results
In this section, we present the results of experiments that evaluate the modeling accuracy and speedups achieved by FCM over the HSPICE model. We also evaluate the impact of non-idealities on various DNNs realized on resistive crossbar systems using Rx-Caffe.
A. FCM merits
Modeling Accuracy. Figure 9 shows the errors for a 64x64 non-ideal crossbar with respect to an ideal vector-matrix multiplication using 3 different models, viz., HSPICE, FCM, and MNSIM. HSPICE is the baseline model, whereas MN-SIM represents crossbar model proposed in [23] . The Xaxis represents the crossbar column, and the Y-axis depicts the error incurred during the vector-matrix multiplication. We observe that simple error models such as MNSIM that assume a constant error regardless of the applied inputs, crossbar configuration, and crossbar column deviates considerably from the HSPICE model. In contrast, the proposed FCM model considers these dynamic factors and therefore able to closely match the errors observed in HSPICE model. The maximum deviation between the estimated errors using MNSIM and actual error observed in HSPICE is about 3.51%. In the case of FCM, the maximum deviation was found to be 0.28% which is significantly smaller.
Fig. 9: Computation Errors observed in crossbar for various crossbar models
Speedup. In order to evaluate the speedup of FCM over HSPICE, we measured the execution time of FCM and HSPICE for various crossbar sizes. Table II details the speedup achieved using FCM with respect to HSPICE. We observe a speedup of about 5 orders in magnitude for various crossbar sizes. In addition, as expected, the speedup increases for larger crossbar arrays. Modeling Overhead. Recall that FCM performs a set of onetime processing steps to convert the ideal conductance matrix (G ideal ) into an equivalent non-ideal matrix (G non−ideal ) for each crossbar. Table II presents the one-time modeling overhead of FCM for various crossbar sizes. While considerable for larger crossbars, these one-time overheads are amortized over a large number of inputs evaluated by the DNN model.
B. Application-Level Evaluation
We next evaluate the accuracy degradation due to nonidealities at the application-level for the benchmark DNNs using Rx-Caffe. We realized these networks on resistive crossbar based systems designed with crossbar arrays of size 16x16 (Cross16), 32x32 (Cross32), and 64x64 (Cross64). Figure 10 (a) shows the accuracy degradation for these designs with respect to our baseline, i.e., an ideal crossbar. We first compare the accuracy degradation of the Cross64 design across different benchmark networks. We observe that for small networks the accuracy degradation due to non-idealities is quite small. For example, LeNet and ConvNet networks suffer accuracy degradation of 0.05% and 2.2%, respectively. In contrast, the accuracy loss due to non-idealities is considerably higher for large-scale DNNs. For instance, VGG-16, OverFeat, and Resnet-50 networks incur accuracy losses of 25.6%, 27.8%, and 32%, respectively. We observe similar accuracy degradation trend across applications for Cross16 and Cross32 designs as well.
Next, we compare the accuracy degradation across designs (Cross16, Cross32, and Cross64). As evident from Figure 10(a) , the Cross16 design has lesser accuracy degradation compared to Cross32 and Cross64 designs. We expect this trend as the impact of non-idealities is lower for smaller crossbar arrays (Section IV-B). However, the Cross16 design consumes higher energy compared to Cross32 and Cross64 designs. Since the major components of the energy consumed in crossbar based vector-matrix multiplications are peripherals (ADC and DAC), therefore, larger crossbar arrays that amortize the energy cost of ADCs and DACs along more number of columns and rows have superior energy efficiency. Figure 10 (b) depicts the normalized energy consumption per image for the Cross16, Cross32, and Cross64 designs. Note that, in LeNet and ConvNet networks the energy of the Cross64 design is higher than the Cross32 design. This is due to the fact that the Cross64 design in these networks is underutilized. Therefore, it suffers from energy overheads due to redundant computations performed in unmapped columns. To further illustrate the energy consumption of DNNs on crossbar based systems, we present the energy breakdown of three networks, viz., VGG-16, GoogleNet, and AlexNet realized on the Cross64 design. Figure 11 shows the energy breakdown of these networks into -read energy for inputs (CMOS-Mem-Read), write energy for outputs (CMOS-Mem-Write), and computation energy for vector-matrix multiplications (Cross-Computation). We observe the major en-ergy component is the vector-matrix multiplications (Cross-Computation), which is mostly dominated by the ADCs and DACs.
Fig. 11: Energy Breakdown for Cross64 implementation
We also evaluated the slowdown of Rx-Caffe with respect to Caffe, which amounts to 2.5X and 2.75X for inference and training, respectively, across our benchmark applications. We believe this is a reasonable overhead given the highly optimized nature of Caffe, and the fact that much like Caffe, Rx-Caffe can also leverage multi-cores, GPUs, and clusters for increased processing throughput.
In summary, there exists a fundamental trade-off between the application-level accuracy and the system energy which needs to be examined, in order to determine the architectures for future resistive crossbar systems. Rx-Caffe intends to drive these decisions by providing a software platform that can precisely evaluate crossbar architectures executing large-scale DNNs. 
C. Accuracy's sensitivity to non-idealities
To further illustrate the impact of non-idealities on the application-level accuracy, we present a sensitivity analysis in Figure 12 . We plot the accuracies of 3 networks, viz., AlexNet, VGG-16, and GoogleNet for implementations that differ in their degree of non-idealities. The implementations that we use are: (i) floating-point implementation realized on an x86 CPU architecture (FP32), (ii) 6-bit ideal crossbar based design (Cross-ideal), and (iii) 6-bit non-ideal crossbar based designs with and without variations (Cross64-nonideal). FP32 is a CMOS based digital implementation that does not use crossbars. As shown in Figure 12 , the accuracy drops from left to right as more non-idealities are incorporated. We observe two significant accuracy drops, one between FP32 and Cross-ideal implementations, and other between Crossideal and Cross64-nonideal implementations. The degradation between FP32 and Cross-ideal is due to the limited precision of the synaptic devices, ADCs, and DACs. This drop in accuracy can be reduced by re-training using Rx-Caffe as discussed in Section VIII-D. In contrast, the drop in accuracy, from Crossideal to Cross-nonideal implementations, is due to the device and circuit-level non-idealities. Next, we show the effectiveness of Rx-Caffe in retraining large-scale DNNs for crossbar systems. To that end, we re-trained three networks, viz., AlexNet, VGG-16, and GoogleNet, as shown in Figure 13 . Our experiments show that with only 150 iterations of re-training Rx-Caffe can achieve ∼9%, ∼8%, ∼26% improvement in accuracy for AlexNet, VGG-16, and GoogleNet, respectively. 
E. Impact of non-idealities: Insight
To provide further insights into the impact of non-idealities at the application-level, Figure 14 compares the output features obtained from ideal crossbars and non-ideal crossbars, respectively, for two convolution layers (Conv1 and Conv3) of the ConvNet network running on the CIFAR-10 dataset. Some of the significant distortions in features are identified in the figure using circles. We observe that the impact of nonidealities increases noticeably as we go deeper into the network (Conv3 layer outputs show increased artefacts compared to Conv1 layer outputs in Figure 14 ). This is consistent with the observation from Figure 10 (a) that deeper DNNs show greater degradation in accuracy due to crossbar non-idealities.
In summary, our results underscore the utility of Rx-Caffe in evaluating and training large-scale DNNs on resistive crossbar architectures. They also motivate the need for further research into techniques to mitigate and compensate for the effects of crossbar non-idealities in the context of large-scale DNNs.
IX. Conclusion
Resistive crossbar systems are a promising solution to the energy efficient realization of DNNs. In this work, we evaluated the impact of various device and circuit non-idealities that are present in crossbars on the overall accuracy of largescale DNNs. We proposed FCM, i.e., a fast, scalable, and accurate functional crossbar model to evaluate vector-matrix multiplications realized on crossbars. We also present a software simulation framework to enable evaluation and training of large-scale DNNs on crossbar systems. Our evaluations show that the errors due to non-idealities can degrade the overall accuracy of large scale DNNs considerably, therefore necessitating a need for error correction and compensation schemes. Prof. Raghunathan has been a member of the technical program and organizing committees of several leading conferences and workshops, chaired premier IEEE/ACM conferences (CASES, ISLPED, VTS, and VLSI Design), and served on the editorial boards of various IEEE and ACM journals in his areas of interest. He received the IEEE Meritorious Service Award and Outstanding Service Award. He is a Fellow of the IEEE and Golden Core Member of the IEEE Computer Society. Prof. Raghunathan received the B. Tech. degree from the Indian Institute of Technology, Madras, and the M.A. and Ph.D. degrees from Princeton University.
