Resistive memory crossbars can dramatically reduce the energy required to perform computations in neural algorithms by three orders of magnitude when compared to an optimized digital ASIC [1] . For data intensive applications, the computational energy is dominated by moving data between the processor, SRAM, and DRAM. Analog crossbars overcome this by allowing data to be processed directly at each memory element. Analog crossbars accelerate three key operations that are the bulk of the computation in a neural network as illustrated in Fig 1: vector matrix multiplies (VMM), matrix vector multiplies (MVM), and outer product rank 1 updates (OPU) [2] . For an NxN crossbar the energy for each operation scales as the number of memory elements O(N 2 ) [2] . This is because the crossbar performs its entire computation in one step, charging all the capacitances only once. Thus the CV 2 energy of the array scales as array size. This fundamentally better than trying to read or write a digital memory. Each row of any NxN digital memory must be accessed one at a time, resulting in N columns of length O(N) being charged N times, requiring O(N 3 ) energy to read a digital memory. Thus an analog crossbar has a fundamental O(N) energy scaling advantage over a digital system. Furthermore, if the read operation is done at low voltage and is therefore noise limited, the read energy can even be independent of the crossbar size, O(1) [2] .
Resistive memory crossbars can dramatically reduce the energy required to perform computations in neural algorithms by three orders of magnitude when compared to an optimized digital ASIC [1] . For data intensive applications, the computational energy is dominated by moving data between the processor, SRAM, and DRAM. Analog crossbars overcome this by allowing data to be processed directly at each memory element. Analog crossbars accelerate three key operations that are the bulk of the computation in a neural network as illustrated in Fig 1: vector matrix multiplies (VMM), matrix vector multiplies (MVM), and outer product rank 1 updates (OPU) [2] . For an NxN crossbar the energy for each operation scales as the number of memory elements O(N 2 ) [2] . This is because the crossbar performs its entire computation in one step, charging all the capacitances only once. Thus the CV 2 energy of the array scales as array size. This fundamentally better than trying to read or write a digital memory. Each row of any NxN digital memory must be accessed one at a time, resulting in N columns of length O(N) being charged N times, requiring O(N 3 ) energy to read a digital memory. Thus an analog crossbar has a fundamental O(N) energy scaling advantage over a digital system. Furthermore, if the read operation is done at low voltage and is therefore noise limited, the read energy can even be independent of the crossbar size, O(1) [2] .
Many different algorithms can be built on these kernels (VMM, MVM, OPU) including backpropagation [3] , sparse coding [2] , liquid state machines, restricted Boltzmann machines and more. The key design considerations for a neural algorithm accelerator is that it should both reduce the computational energy and delay by orders of magnitude, and it should be flexible enough that it can run many different neural algorithms. The difference in the implementation of many algorithms is how the inputs and outputs of a crossbar are processed. An NxN crossbar accelerates O(N 2 ) operations, while it has O(N) inputs or outputs. This means that the energy to process an input or output can cost O(N) times more than the energy to read or write a single resistive memory element without significantly increasing the system energy. This key insight allows us to optimize the tradeoff between energy efficiency and system flexibility. A crossbar based neural core should be used to perform the parallel vector matrix multiply and outer product update, while a more general purpose digital core can be used to process the inputs and outputs of the crossbar as illustrated in Fig 2. In order to interface between analog and digital logic, analog to digital (ADC) and digital to analog (DAC) converters will be needed at the inputs and outputs of the Crossbar as shown in Fig 2(b) . As the energy and delay of these converters increases exponentially with the number of bits, it is important to understand exactly what impact the bit precision has on the energy, delay and accuracy of an algorithm. For instance, in [1] an analog accelerator core is designed with all the required ADCs and DACs for 8 bit, 4 bit and 2 bit precision. It is then compared against an optimized digital accelerator with a multiplier placed directly next to a digital cache. The results of the energy, latency and area analysis is plotted in Fig 3 . At 8 bits, the analog accelerator has a 430X gain in energy and a 35X gain in latency. Reducing the ADC and DAC bit precision from 8 bits to 4 bits can give an additional 10X gain in energy and latency. In order to enable these energy gains, a ) ReRAM cell needs to be developed. Fig 4(a) shows that reducing from 8 bits of precision to 4 bit of precision reduces the accuracy from 98% to 94% when training a 784×300×10 neural network on MNIST [4] . Probabilistically rounding quantized numbers to the nearest value [5] significantly increases the accuracy for the 4 bit systems to 96.6%.
Any analog device that behaves like a programmable resistor can be used to build the crossbar. For instance, resistive memories will change resistance when a large write voltage is applied, allowing the resistance to be programmed. At lower voltages, the state does not change and the device can be read out. The resistance acts like a weight that modulates the voltage applied to it.
Unfortunately, analog devices are noisy and suffer from many non-idealities including read noise, write noise and write nonlinearity. Analog arrays suffer from parasitic voltage drops. Furthermore, analog systems tend to have limited bit precision on the inputs and outputs to a crossbar, with the fewer bits used, the faster and more energy efficient an analog system is. All of these issues will impact the final classification accuracy of a neural network. To compensate for these issues and take advantage of large gains in energy and latency enabled by analog systems, neural algorithms will need to be designed specifically to overcome the hardware limitations. This will require new co-design tools where the impact of device level properties can be evaluated on algorithmic performance and the requirements for new analog devices will be driven by algorithmic considerations.
Consequently, we have developed a new open source simulation tool called CrossSim [6] that allows for the impact of device level properties on algorithmic performance to be quantified. For algorithm designers this tool is designed to abstract away the crossbar to three key noisy mathematical operations: VMM, MVM and outer product update (OPU). The impact of different device properties or crossbar designs can then be studied by simply changing a few input parameters. Similarly, device designers can specify their measured device properties as input parameters and see quickly see how a new device would impact the accuracy of a neural network algorithm. CrossSim focuses on modelling the crossbar while allowing for arbitrary neuron models. This allows for many different algorithms to be built on top of the key computation kernels. The exact result of a matrix vector multiplication can be passed into a user specified neuron model, or analog to digital converter models can be used to quantize the output. The process of measuring a TaOx device, gathering statistics on it and simulating training is illustrated in Fig 5[7] . This methodology can be used to evaluate new programmable memory devices [8, 9] .
Using CrossSim, new algorithmic techniques to compensate for device nonidealities such as periodic carry can be developed. Periodic carry uses multiple devices to represent a weight. Each device is used to represent increasing significance in a place value based number (i.e. base 10) that represents the weight while maintaining the benefit of a parallel update [10] . Using period carry allows an analog TaOx ReRAM to reach within 1% of numerical accuracy as shown in Fig 6. Overall we see that using analog crossbars can provide a fundamental O(N) energy scaling advantage over digital memories. Analyzing a particular 8-bit design shows that an analog accelerator has a 430X gain in energy and a 35X gain in latency over an optimized digital ASIC. In order to model the impact of non-ideal analog devices we have developed an open source simulation tool CrossSim that can be used to quickly compare new devices or test new algorithms. This allowed the development of a new technique called periodic carry that allows noisy TaOx ReRAM devices that would only train to 80% accuracy to now train to 97% accuracy, just 1% away from the ideal accuracy of 98%.
Acknowledgment:
This work was funded by Sandia National Laboratories Hardware Acceleration of Adaptive Neural Algorithms (HAANA) Grand Challenge Laboratory Directed Research and Development (LDRD) Project. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525. 
