Introduction -As transistors start to approach fundamental limits and Moore's law slows down, new devices and architectures are needed to enable continued performance gains. New approaches based on RRAM (resistive random access memory) or memristor crossbars can enable the processing of large amounts of data [1, 2] . One of the most promising applications for RRAM crossbars is brain inspired or neuromorphic computing [3, 4] .
Introduction -As transistors start to approach fundamental limits and Moore's law slows down, new devices and architectures are needed to enable continued performance gains. New approaches based on RRAM (resistive random access memory) or memristor crossbars can enable the processing of large amounts of data [1, 2] . One of the most promising applications for RRAM crossbars is brain inspired or neuromorphic computing [3, 4] .
We will show that performing certain analog computations on a crossbar provides fundamental energy scaling advantages over a digital SRAM (or any digital memory) based implementation when doing finite precision computations. There are two key computational kernels that are more efficient on a crossbar. First, the crossbar can perform a "parallel read" or a vector matrix multiply as illustrated in Fig 1. Second, the crossbar can perform a "parallel write" or a rank 1 update where every weight is programmed based on the outer product of the row and column inputs as shown in Fig 2. These two kernels form the basis of most neuromorphic algorithms.
Noise Limited Parallel Read -An RRAM crossbar can be used to perform a parallel analog vector matrix multiply as illustrated in Fig 1. Each column of the crossbar performs a vector dot product:  i ij i w x for column j. The inputs, x i , are represented by either an analog voltage or the length of a voltage pulse. The weights, w ij , are represented by the RRAM conductances. The multiplication is performed by using I = G×V and the sum is performed by simply summing currents (or integrating the total current if the input, x i , is encoded in the pulse length).
The absolute minimum energy to read the crossbar will be determined by the thermal noise in each RRAM. For many computations we only need to know the result with some finite precision. Taking advantage of this allows the minimum energy to compute the vector dot product to be independent of the size of the vector, O(1). (On the other hand, if the full precision of every input is needed, the analog energy will scale worse than digital)
To understand the scaling advantage, consider the minimum energy required to measure the current through N resistors with some signal to noise ratio (SNR). The SNR corresponds to some fixed precision. The strength of the signal (total current) scales with N, while the thermal noise scales as
, where f  is the operation speed, G o is the conductance of each resistor, k b is the Boltzmann constant and T is the temperature. Consequently, if we double N, we can also double the speed, f  , and keep the same SNR: (
). By doubling N we double the current and therefore power through the resistors, but doubling the speed halves the measurement time and gives the same total energy. Consequently, the energy is independent of N for a given SNR.
The energy to read the resistors at voltage V is: Energy = Power per resistor × N Resistors
. The operation speed, f  , is determined by the thermal noise. We need to integrate the current for long enough to get the SNR we want. Using the thermal noise,
, to solve for the speed in terms of the SNR,
, and plugging it back in for the energy gives: Energy=
Thus the total noise limited energy is the same regardless of the crossbar size! As we increase the number of resistors and therefore signal strength, we can measure each device faster and with less precision to get the same precision on the output.
Capacitance Limited Read -The previous analysis is only valid when the read energy is limited by noise and not capacitance. In particular, this is when the noise limited energy is greater than the energy to charge the RRAM and wire capacitance:
If we assume we have a 1000x1000 crossbar, want a SNR of 100, and a RRAM dominated capacitance of 18 aF (20nm x 20nm area, 5 nm thick capacitor with ε r of 25) we would need to perform the read at 100 mV or less to be noise limited. If we use a larger crossbar or a higher voltage due to access devices, the energy will instead be capacitance limited.
For a capacitance limited read energy, the crossbar will still be O(N) times more energy efficient than an SRAM memory. The most energy efficient way to organize a digital memory for performing vector matrix multiply is to have the matrix stored in an on chip SRAM array. To perform a vector matrix multiply, we have to read out one row/wordline at a time. This means that the columns/bitlines and associated circuitry will need to be charged N times for N rows. In an analog crossbar, everything can be done in parallel and so the columns/bitlines and associated circuitry are only charged once. Thus the crossbar is O(N) times more energy efficient.
For each row in a NxN SRAM array, there are N memory cells. To read each cell in a row, we need to charge each bitline/column and run the read electronics/sense amp for each bitline. The energy to charge each bitline is proportional to the capacitance and therefore the length of the bitline: In an analog RRAM crossbar, all of the rows are charged in parallel and so the total energy to charge N rows plus N columns (with wire capacitances proportional to N) scales as O(N 2 ) and is therefore is O(N) times more energy efficient than an SRAM memory.
Parallel Write Energy -The energy scaling to write a SRAM cell will be identical to the energy to read the cell. N rows must each be written one at a time and each row has N cells. When writing each cell the energy to charge the bitline will be proportional to N. Consequently, the energy to write the cell will scale as O (N 3 ).
On an analog crossbar, we can perform a "parallel write" or a rank 1 update to change every weight based by the outer product of the row and column inputs. An example of a parallel write is illustrated in Fig 2. The goal is to adjust the weight, W ij , by the product of the inputs on the row, x i , and column, y j , of the weight:
An analog value for the row inputs, x i , can be encoded by the length of the pulse. The longer the pulse the more the weight will change. The analog column inputs, y j , can be encoded in the height of the pulse in order to achieve a multiplicative effect. The larger the voltage the more the weight will change for a given pulse duration. The exact write voltages will need to be adjusted to account for any nonlinearities in the device.
If the write is energy is limited by the capacitance for the lines, the energy formula will be the same as in the read case and will scale as O( N 2 ) and is therefore is O(N) times more energy efficient than an SRAM memory. However, each RRAM will also typically require a fixed amount of current to program. If the energy is limited by the program current, the total energy will be given by the number of RRAMs times the energy to program one: If the write current or time is too large, it is possible that there will be a large constant factor that would make the energy scaling irrelevant. Fortunately, energies to fully write an RRAM cell as low as 6 fJ have been demonstrated [5] . Furthermore, since we are operating the RRAM as an analog memory with many levels, we do not want to fully write the cell. Rather, we only want to change the state by 1% or less, resulting in a corresponding reduction in the write energy per RRAM. In this case the RRAM energy will be on the same order of magnitude as the energy to charge the wires. (1% of 6 fJ is 60 aJ. The wire capacitance per RRAM in a scaled technology node is likely to be on the order of 10's of attofarads [6] . At 1V, that corresponds to 10's of attojoules as well) 
