Abstract-Recent advances in deep learning has accelerated the growth of machine learning and artificial intelligence in a variety of cognitive tasks. Deep learning involves a dense connection of artificial neurons and synapses to form deep neural networks (DNNs). However, DNNs are computationally and memory intensive, and consume high energy on standard von-Neumann based systems. Thus, there is widespread interest in emerging technologies, especially resistive crossbars for accelerating DNNs. Resistive crossbars offer a highly-parallel and efficient matrixvector-multiplication (MVM) operation. MVM being the most dominant operation in DNNs makes crossbars ideally suited. However, various sources of device and circuit non-idealities lead to errors in the MVM output, thereby reducing DNN accuracy. Towards that end, we propose crossbar re-mapping strategies to mitigate line-resistance induced accuracy degradation in DNNs, without having to re-train the learned weights, unlike most prior works. Line-resistances degrade the voltage levels along the crossbar columns, thereby inducing more errors at the columns away from the drivers. We rank the DNN weights and kernels based on a sensitivity analysis, and re-arrange the columns such that the most sensitive kernels are mapped closer to the drivers, thereby minimizing the impact of errors on the overall accuracy. We propose two algorithms − static remapping strategy (SRS) and dynamic remapping strategy (DRS), to optimize the crossbar re-arrangement of a pre-trained DNN. We demonstrate the benefits of our approach on a standard VGG16 network trained using CIFAR10 dataset. Our results show that SRS and DRS limit the accuracy degradation to 2.9% and 2.1%, respectively, compared to a 5.6% drop from an as it is mapping of weights and kernels to crossbars. We believe this work brings an additional aspect for optimization, which can be used in tandem with existing mitigation techniques, such as in-situ compensation, technology aware training and re-training approaches, to enhance system performance.
I. INTRODUCTION
Although artificial intelligence (AI) has been around for decades, recent advancements in deep learning (DL) has enabled machine learning and AI find value in many applications [1] , [2] . DL is based on deep neural networks (DNNs). DNNs are biologically inspired class of algorithms, which have shown state-of-the-art results for various cognitive tasks, even surpassing human intelligence in certain tasks [3] . However, DNNs consist of a dense connection of artificial neurons and synapses, making them memory-and compute-intensive. Current computing systems are based on the well-known vonNeumann architecture, which consists of a physically separate memory and compute units. Running DNN algorithms on such machines are limited by the von-Neumann bottleneck [4] , since the compute patterns of DNNs are inherently different. The bottleneck arises due to multiple data transfers from the off-chip memory, incurring large overheads in energy and latency. With energy efficiency being a primary concern, especially for battery operated edge devices, exploring new computing paradigms is of great importance.
In-memory computing (IMC) is one approach to overcome the von-Neumann bottleneck. IMC embeds computing within memory arrays, enabling a few computations locally where the data is stored. There have been many previous proposals for IMC for CMOS based memories, especially using SRAMs [5] - [12] . However, since most DNNs are memory-intensive, having large SRAM caches that can store all weights incurs large area overheads, thereby requiring off-chip memory accesses. Embedded non-volatile memories (eNVMs), such as resistive random-access memories (ReRAM), spin-transfertorque magnetic RAM (STT-MRAM), and phase-change materials (PCRAM), are emerging memory solutions that offer high-density storage. Moreover, the crossbar structure of such eNVMs can be leveraged to perform massively parallel matrixvector multiplication (MVM) operations [13] - [16] . Resistive crossbars use analog-domain for directly computing the MVM operation within the memory array itself. This makes these architectures well suited for DNNs since most of the computations in DNNs can be converted to MVM operations. Moreover, the high-density storage of eNVMs can accommodate large weights and kernels of DNNs on-chip. Multilevel resistive crossbars, which can store data into multiple conductive states, have been shown to effectively perform MVM operations for DNNs [17] - [23] .
The analog nature of doing the computations in resistive crossbars induces errors and approximations in the MVM output. The sources of these errors include device and circuit non-idealities, such as device variations, line resistances, and non-idealities in the analog-digital and digital-analog converters [24] . These errors pose an even bigger challenge for DNNs, since the errors accumulate across deeper layers. Thus, once a trained network is mapped to the crossbars, it may not give the desired accuracy due to these errors. Many mitigation techniques have been proposed in literature to overcome these challenges, such as training on the hardware itself, or re-training the weights after being mapped onto crossbars [24] - [27] . The neural network captures the error patterns and 'learns' them, thereby improving the accuracy.
However, these techniques require multiple writes into eNVM devices. This is a power-hungry process since eNVM writes are energy-expensive [28] . Moreover, low endurance of eNVM devices, especially ReRAMs, limits the number of writes into the device [28] . Another mitigation strategy is to lower the crossbar dimension [29] . However, this limits the benefits of parallelism and energy-efficiency offered by crossbars.
In this work we tackle the non-idealities through rearranging crossbar columns, without having to re-train the learned weights. We observe a pattern in the line-resistance induced errors, which can be exploited to re-arrange the crossbar columns based on the sensitivity analysis of the weights and kernels. Line-resistances degrade the voltage levels along the crossbar columns, thereby inducing more errors at columns away from the drivers. We propose to remap these columns based on a sensitivity analysis of the outputs. In other words, the DNN weights and kernels which are more sensitive to alter the final output are given a higher rank, and are mapped to columns closer to the drive source, thereby generating lower errors. We propose two algorithms, which take a pre-trained DNN and optimize the crossbar rearrangement such that an improvement in the overall accuracy degradation is obtained. Note that in our work we analyze the spatial dependency among columns of the crossbar which are induced due to line-resistances. Other non-idealities like the source and sink resistances coming from peripheral circuitry affects each column equally, and do not introduce this spatial dependency. Thus, our work complements the previous efforts of mitigating crossbar non-idealities by bringing in another aspect for optimization, which can be used in tandem with existing techniques to enhance system performance.
In summary, the key highlights of this work are: 1) We study the impact of line-resistance induced errors and spatial dependency in MVM computations in resistive crossbars, and develop a statistical model to characterize these errors. 2) We propose two crossbar re-arrangement strategiesstatic remapping strategy (SRS) and dynamic remapping strategy (DRS). In both strategies, the crossbar arrangement of a pre-trained DNN is optimized through a sensitivity analysis of its weights and kernels. 3) We evaluate the effects of line-resistance induced errors on a standard VGG16 network trained on CIFAR10 dataset, and demonstrate the improvements in accuracy degradation of the proposed mapping strategies.
II. PRELIMINARIES
In this section, we provide a brief background on resistive crossbar arrays, including their structure and operation for performing matrix-vector multiplication (MVM), and the sources of error due to parasitic line-resistances. We also briefly illustrate how large-scale DNNs are typically mapped to crossbar arrays.
A. Crossbar structure and operation Fig. 1(a) shows a schematic of a crossbar array. It consists of a mesh cells connected through bit-lines (BLs) running horizontally and source-lines (SLs) running vertically. Each cell is a non-volatile memory device, for example, memristor, phase-change material or a magnetic tunneling junction. Each cell also contains a selector device or a transistor, which helps read/write into individual cells and also helps block sneakcurrent paths [30] . In this work, we chose a one-transistor oneresistor (1T1R) cell. To perform a matrix-vector operation, the input vector is translated to analog voltages using a digital-toanalog converter (DAC), and applied to the BLs. The matrix data is stored in the form of conductance state of the resistive elements. Each resistive element of the crossbar stores a matrix entry. The resulting current output from each SL represents the matrix-vector multiplication output obtained from Kirchoff's laws:
where v i is the analog voltage applied to i-th BL, G ij is the conductance of the resistive element at the crosspoint of i-th BL and j-th SL, and I j is the current output obtained at j-th SL. Thus, the crossbar structure inherently performs an MVM operation by exploiting the Kirchoff's current laws. Since most neural network computations heavily involve MVM operations, crossbars have been shown to be effective for such workloads. In that case, the input activations at each layer of the neural network are mapped to analog voltages, while the resistive devices store the learned weights of the deep neural network.
B. Impact of line-resistances on crossbar operation Thus, the BL and SL metal lines running horizontally and vertically, respectively, have a finite resistance contribution over the length and width of the cell layout. This is depicted schematically in Fig. 1(b) , where these resistances (r L ) are lumped at every node of the crossbar array. First, let us consider the horizontal lines. When an input voltage is applied at the BLs, there would be voltage drops induced along the horizontal lines due to the lumped line resistances. In other words, the input voltage seen by the cells going from left to right degrades. In the example shown, v 1 and v 2 applied at Row 1 and 2 respectively, degrade to v 1 and v 2 at the second column, due to the voltage drop across r L . Moreover, the amount of voltage drop at every node (i h r L ) would depend on the current being drawn by that column, making it highly data-dependent on the state and the permutation of the all storage devices and input voltages in the crossbar. Next, let us consider the vertical lines. Note that as we go from the bottom to the top, the source connection of the transistor sees a higher resistance, thereby increasing the effects of source degeneration. This causes the transistor conductance to reduce, leading to errors. These errors are also data-dependent as the voltage drops along the line resistances (i v r L ) depend on the current being drawn by that column. Intuitively, we get an idea that the minimum errors would be at the bottom left corner of the array, while the highest errors would be at the top right corner of the array. The data-dependency and spatialdependency of these errors make it really difficult to estimate them quantitatively, due to large number of permutations and combinations of input voltages and the memristor states. However, by using a few key properties of DNNs, we can approximately quantify these errors, as we will show later.
C. Mapping large-scale DNNs to crossbars
The DNNs consist of convolutional layers (conv-) and fullyconnected (fc-) layers. A conv-layer consists of multiple 3-dimensional kernels. Each kernel is flattened to a column vector and stacked, to create a big matrix. Thus, each column of the big matrix stores one kernel. The big matrix can further be divided into multiple smaller matrices corresponding to the crossbar sizes. Fig. 2 illustrates this process of mapping kernels to crossbars. Thus, output of each column corresponds to each output feature map. Since deeper layers of DNNs may have large number of weights, typically greater than the crossbar size, the weights are mapped to multiple crossbars where each crossbar generates a partial output. The outputs from multiple crossbars are summed to obtain the final result. Note that fc-layers can be configured as conv-layers with kernel size = input feature size, and number of kernels = number of output neurons. Thus, the proposed mapping is general to conv-and fc-layers.
In general, the weights of a DNN can have both positive and negative values. However, since the memristor conductances are positive, we use a differential architecture proposed in many previous works [31] to map both positive and negative weights to memristor conductances. In a differential form, each weight w can be represented as w = w + −w − , where both w + and w − are positive numbers, and can be separately mapped to crossbar conductances G + and G − , respectively. The output current from the positive and the negative crossbars can be subtracted to obtain the final result. Thus, Equation 1 can be written as:
where G 
III. CROSSBAR REMAPPING STRATEGIES
In this section, we propose two crossbar re-mapping algorithms, which minimize the impact of line-resistance induced errors of crossbar arrays on the system-level classification accuracy of the neural network. The idea is to map 'sensitive' weights and kernels as close to the voltage drivers as possible, to have minimum output quality impact of line-resistance δ total = δ total + δ i 5: end for 6: Rank total = EvaluateRank(δ total ) 7: N N SRS = MapCrossbar(N N , Rank total ) 8: return N N SRS induced voltage drops. Thus, if all the 'sensitive' weights and kernels contribute the least line-resistance induced errors in computations, the impact on final system-level classification accuracy would also be minimal.
A. SRS: Static Re-mapping Strategy
In order to characterize the degree of sensitivity of weights and kernels to the final output quality, we use backpropagation [32] technique (adopted an approach from [33] ) to calculate the (local) error gradients which is the derivatives of loss function with respect to the outputs of each neurons.Through the backpropagation technique, one can estimate the contributions of individual neuron's output to final output error. As asserted in [33] , the sensitive neurons contribute more to the final output error (quality) than the less sensitive ones. Thus, error gradients provide the measure of each neuron's sensitivity to impact the neural network output quality. Based on this observation, the error errors at each neuron are averaged for all instance of the training samples through backpropagation. Thus, the higher values of the accumulated error gradient are considered to be more sensitive (or important) neurons, while lower values of error gradient signifies resilient (or less important) neurons. Once we obtain local error gradients for each neuron, we rank the neurons of each layer, giving higher rank to sensitive neurons, and lower to resilient neurons. Now that we have ranked the neurons of each layer, let us discuss how to map the weights and kernels to crossbars by utilizing the evaluated ranks. Recall from Section II-C that each column of the crossbar is mapped to weights corresponding to a particular output neuron in a fc-layer. While for conv-layers, each column is mapped to a particular kernel, which corresponds to an output feature map. For fc-layers, we directly assign crossbar columns to each output neuron based on its rank. The weights corresponding to that neuron occupy the assigned column of the crossbar. Note that the weights might span multiple crossbars, but we ensure that all weights corresponding to a particular output neuron are mapped to the same column number in all crossbars. Thus, the highest ranked neuron's weights are mapped to the first column, while the least ranked neuron's weights are mapped to the last column. For conv-layers, we take an average of the Acc val = Validation(N N DRS ) 8: if (Acc val >Acc best ) then
9:
Rank best , Acc best = Rank i , Acc val 10:
end if 11: end for 12: N N DRS = MapCrossbar(N N DRS , Rank best ) 13: return N N DRS δ of all neurons corresponding to an output feature map. Next, the rankings are ascertained for each output feature map using this averaged error gradients. Since each kernel corresponds to an output feature map, the kernels are assigned crossbar columns in accordance with the ranks, similar to the fc-layer case.
We call this a static remapping strategy (SRS) since the whole analysis of calculating the ranks can be done offline, before the final mapping of conductances on crossbars. This is a one-step mapping procedure, which requires only onetime write operations to the crossbar arrays after all training examples have been evaluated.
B. DRS: Dynamic Re-mapping Strategy
We propose another re-mapping strategy by introducing the stochasticity to the SRS method. As previously mentioned, SRS is a one-step mapping strategy, where the averaged local gradients of the entire training examples were used to rank the sensitivity of neurons, and finally map the weights to the crossbar accordingly. Compared to SRS method, the dynamic remapping strategy (DRS) can be varied in terms of the number of training examples used to calculate the error gradients before mapping the weights to crossbar. In DRS, we evaluate the rank of neurons based on the mini-batches of training samples, instead of the entire training data at once. Once the ranks are evaluated for a mini-batch of training images, the weights are mapped to crossbars according to the ranks of neurons in each layer. Then, the next mini-batch of training images are used to evaluate the ranks again, and the process is repeated. Thus, the crossbars are dynamically remapped in this strategy. Please note that mapping strategy is analogous to the stochastic gradient descent (SGD) technique used for training neural networks with mini-batches of the training data. However, the system-level performance (e.g, classification accuracy) of the neural network does not converge while [15] 60k 600k Ag/Si [35] 100k 1M Fig. 3 . Distribution of positive and negative weights (w + and w − ) for a pre-trained neural network to be mapped to crossbars.
wandering the possible crossbar configurations when performing DRS method. To address this problem, we evaluate the validation accuracy using validation examples whenever remapping the crossbar system with updated ranks. Hence, we can store the optimal neuronal ranks of the system among large search spaces. After executing DRS method at the last set of mini-batch samples, we finally re-map the crossbar with weight sets that showed the best validation performances. As a result, the cross validation process assures to find the best neuronal rank configurations of crossbar-based neural network while iteratively searching the optimal rank of the system.
IV. EVALUATION METHODOLOGY
In this section, we describe the simulation methodology that is developed to evaluate the system-level effectiveness of the proposed crossbar mapping algorithms. Firstly, we provide a detailed analysis of circuit-level crossbar modeling. Secondly, we describe the system-level simulation framework to evaluate the proposed mapping techniques on a benchmark imagerecognition task using DNNs.
A. Crossbar Modeling
We use TSMC's 65nm PDK for the access transisor and an equivalent resistor to model the non-volatile memory element. The resistance values were chosen based on various memristor and phase-change material devices in literature. Different memristive technologies have different resistance ranges, from low-resistance state (R ON ) to high-resistance state (R OF F ), as shown in Table I . A crossbar of size 128×128 was simulated in H-SPICE with each cell connected in a 1T-1R fashion.
We add lumped resistors along the horizontal and vertical lines at each crossbar node, to model the line-resistances. As discussed earlier, the line resistances arise due to the physical length and width of each crossbar cell, through which the BLs and SLs need to route. The layout for the 1T-1R configuration was taken from [36] . Since the nonvolatile memory element is fabricated at the back-end-of-line (BEOL), the BLs and SLs running horizontally and vertically are usually routed on metal-2 and metal-3 layers. We used a 2Ω lumped resistor at each node, which was calculated from typical BEOL resistances and the cell area.
We use a statistical modeling approach to estimate the errors induced in the output crossbar currents due to parasitic line resistances. We saw in Section II-B that the crossbar errors increased as we go from left to right, since the line resistance induces voltage drops along the horizontal lines. We also saw that the current output at every column has a spatialdependence on resistances of all devices in the crossbar and their respective permutations, making the analysis non-trivial and extremely difficult. In order to simplify the analysis, we employ a few key properties of DNNs. Profiling a pre-trained DNN gives us some information regarding the distribution of weights in every layer. Fig. 3 plots the weight distributions for both, positive and negative crossbars, for a particular layer in the neural network. More details on the network architecture and training will be discussed in the next sub-section. We can observe, that the weights are highly skewed towards 0, which would be mapped to R OF F , both for positive and negative crossbars. In other words, most of the devices in the crossbar array would be in R OF F state. We verified this assumption by taking random snippets of size 128×128 from the learned kernels of the neural network, and comparing the output currents of the column-of-interest from H-SPICE, with and without replacing all other devices to R OF F . We observed a maximum error of only ∼0.1%, thereby justifying our assumption.
We randomly choose thousands of vectors V and R of size 128 each, from a uniform distribution [0V, 0.5V ] and [R ON , R OF F ], respectively. For each of these cases, R was mapped to conductances of the devices in a column of the crossbar, while all other devices were kept at R OF F . The voltages V were applied to the BLs. The resulting current from the mapped column was recorded (Î j ) from H-SPICE. This was repeated for all 128 columns, by mapping R to that column and all other devices to R OF F , generatingÎ 1 , I 2 ,...,Î 128 . Fig. 4(a) shows a scatter plot, illustrating the correlation Table I . Other fitting parameters follow a similar trend.
between the ideal current I and the observed currentsÎ i from non-ideal crossbars, for various columns. Taking one random case for the current, Fig. 4(b) shows how the output current deviates from the ideal current as we go from the left-most column to the right-most column. A few key observations can be made from the figures. 1) At lower currents, the estimated currents closely match the ideal currents, while at higher currents, the errors are higher. This makes sense because lower currents would induce lower voltage "ir" drops along the lineresistances. 2) As the column number to which R is mapped increases, the slope of the scatter plot increases. In other words, as we go from left-most column to right-most column, the errors in the output current increase, which is expected due to cumulative effect of line resistances. This behavior was abstracted into a crossbar model using a linear fitting:
where the index i denotes the column number, m,c and σ are fitting parameters, I is the ideal current output (without errors),Î is the non-ideal current output, and N is a normally distributed random variable with zero mean and standard deviation σ. Fig. 5(a) plots the fitting parameters m,c and σ as a function of crossbar column number. The value of m drops as the crossbar column number increases, denoting the fact that the non-ideal currentÎ i deviates more from the ideal current I as the column number increases. A similar trend is observed for the parameters c and σ.
Various memristor and phase-change technologies have been proposed in literature, spanning various process techniques, materials and physics of operation. Some of these technologies have been highlighted in Table I , along with their R ON and R OF F values. Thus, we analyze the effects of line resistances on different R ON and R OF F values. Note that the parasitic line resistances are a function of the cell size, which is typically governed by the size of the access transistors, and the metal pitch. Assuming, the cell size remains the same for all these technologies, we expect more pronounced effects of line resistances for lower values of R ON and R OF F . We repeated the above analysis for different R ON and R OF F values listed in Table I , and obtained the fitting parameters. Fig. 5(b) plots the fitting parameter m as a function of crossbar column number, for various cases. It can be observed that the drop in m is higher for lower resistances. This is expected, because lower the device resistances, higher the current which flows through the wires, causing larger "ir" drops. A similar trend is observed for other fitting parameters.
B. System-level simulation framework
To the analyze the effects of line-resistances at a systemlevel, we integrate the developed crossbar model into PyTorch deep learning framework [37] . We train VGG16 network [38] using backpropagation algorithm [32] on a CIFAR-10 dataset [39] . Note that we split our dataset into 3 sections (i.e. training, validation, testing) among the entire data samples. Then, we apply the proposed crossbar re-mapping algorithms (i.e. SRS and DRS) to minimize the impact of line-resistance induced errors of crossbar arrays on the system-level performance (classification accuracy). For SRS, the local gradients were averaged out on the entire training examples. Then, the weights are accordingly mapped to the crossbar depending on the evaluated ranks of neurons. For DRS, a batch-size of 8 was chosen, as it showed best results. In this case, the local gradients were averaged out after each a mini-batch iteration, to evaluate the neuronal ranks. Then, the crossbar is accordingly mapped, and the next batch is shown. This process is repeated until all training examples have been used. At each step , the validation accuracy and the ranks are recorded. We finally remap the crossbar with optimal column ranks that showed the best validation performances.
V. RESULTS AND DISCUSSION
The baseline accuracy of the trained network was observed to be 89.29%. This is the accuracy without considering any hardware errors of the crossbars. Next, the testing data was run on the network with the developed crossbar model. In this case, the accuracy dropped to 83.70%, a drop of 5.6% from the baseline due to the parasitic line-resistance induced errors.
To evaluate the SRS mapping strategy, the crossbar columns were assigned to neurons based on the δ's, as described in Section III-A. The testing accuracy after the rearrangement was observed to be 86.37%. We can clearly see an improvement in the accuracy. This is due to the fact that all the sensitive neurons, which have a higher impact on the neural network output, are mapped to crossbar columns producing the least errors. Thus, we see an overall improvement in the system accuracy.
In the DRS mapping strategy, the crossbars are re-mapped after every mini-batch of training set, and evaluated on a validation set. The ranking scheme which gives the highest validation accuracy was saved. The best test accuracy we obtained in this case was 87.18%. This scheme performs better than the SRS, since it involves multiple remapping steps, enabling it to explore a larger design space. Moreover, the mini-batch approach adds stochasticity, helping the system to reach different minima points. The system-level accuracy for the proposed approaches are summarized in Fig. 6 .
Let us now discuss the advantages and disadvantages of the two proposed mapping strategies. Clearly, DRS method appears be the superior method than the SRS method for measuring the rank of sensitivity of neurons. However, it comes at a cost. To implement DRS, one requires multiple write-steps into the crossbar arrays. This might be unsuitable depending on the eNVM technology being used and whether the system is being deployed on a battery operated edgedevice. Writing into most eNVMs are energy-expensive, and the limited endurance of eNVM devices limits the number of updates. On the other hand, SRS is an off-line approach and requires only one-time write into the crossbars.
VI. CONCLUSION
Resistive crossbars have been shown to effectively accelerate DNNs, owing to their analog-domain highly-parallel MVM operation. However, various device and circuit non-idealities in crossbars induce errors in the output, which accumulates across the deeper layers. In this work, we analyzed the lineresistance induced errors in crossbars and developed a statistical model to characterize them. We proposed two algorithms to optimize the crossbar mapping, such that the effects of these line-resistances is minimized. In the first approach (SRS), we rank the weights and kernels of a pre-trained DNN using a sensitivity analysis over the entire training data-set, and assign crossbar columns according to the ranks. In the second approach (DRS), we use an iterative process of ranking and remapping the crossbar columns, by using mini-batches of training dataset every iteration. We integrate the statistical crossbar model into a system-level framework to analyze the accuracy degradation on a VGG16 network trained on CIFAR10 dataset. We demonstrated that the accuracy degradation was limited to only 2.9% and 2.1% for SRS and DRS, respectively, compared to a 5.6% degradation an as it is mapping of weights and kernels to crossbars. We believe that our work brings in another aspect for optimization, which can be used in tandem with existing mitigation techniques to further enhance system performance.
