The conventional von Neumann architecture has been revealed as a major performance and energy bottleneck for rising data-intensive applications. The decade-old idea of leveraging in-memory processing to eliminate substantial data movements has returned and led extensive research activities. The effectiveness of in-memory processing heavily relies on the memory scalability, which cannot be satisfied by traditional memory technologies. Emerging non-volatile memories (eNVMs) that pose appealing qualities such as excellent scaling and low energy consumption, on the other hand, have been heavily investigated and explored for realizing in-memory processing architecture. In this paper, we summarize the recent research progress in eNVM-based in-memory processing from various aspects, including the adopted memory technologies, locations of the in-memory processing in system, supported arithmetics, as well as applied applications.
INTRODUCTION
In current era with data explosion, deep neural network (DNN) models are used to process various applications that explore a large * This author is supported by NAS Associate Fellowship Award.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. amount of information in different data formats. Executing such data-intensive applications on conventional von Neumann systems causes massive data movements between processors and memory elements and induces significant performance and energy overheads. After decades since it was proposed first time [1, 2] , the concept of in-memory processing returns and evokes many innovative solutions. Different from the conventional computing paradigm where data and computing are decoupled, in-memory processing architecture pulls data close to processing elements to reduce the amount of data movement and minimize the computation cost.
Benefiting from the recent advances in processing integration and memory technologies, many in-memory processing architectures have been developed. These attempts can be cataloged into three groups: (1) processing close to memory, aka near-data processing (NDP); (2) processing in traditional memory; and (3) processing in emerging non-volatile memory (eNVM).
In enabling NDP, three-dimension (3D) integration is a key technology. For example, 3D DRAM is constituted by vertically stacking a set of DRAM dies on top of a CMOS logic die over through-silicon vias (TSVs). There are a number of 3D DRAM-based NDP platforms that exploit the logic die to perform simple but common operations in data-intensive applications [16, [29] [30] [31] [32] [33] [34] . TSVs substantially shorten the distance between the logic and memory dies, increasing the data bandwidth and improving overall performance.
Processing in memory directly performs computations in memory arrays so as to reduce data movement to a large extent. Prior works exploit traditional memory technologies such as DRAM and SRAM to complete the frequent operations appearing in dataintensive applications [35] [36] [37] [38] [39] . However, as the scaling of traditional memory technologies is approaching the physical limit, it is [24] 2016 Co-processor System MVM Boltzmann machine ISAAC [25] 2016 Co-processor System MVM CNN PipeLayer [26] 2017 Co-processor System MVM CNN AtomLayer [27] 2018 Co-processor System MVM CNN GraphR [28] 2018 Co-processor System MVM Graph
Note: MVM -Matrix-Vector Multiplication; DNN -Deep Neural Network.
difficult to provide sufficient computing and storage capacity for data-intensive applications. Moreover, big cell size and high leakage power of traditional memory lead to large design area and energy consumption [40] [41] [42] .
In recent years, eNVMs that demonstrate excellent scaling and near-zero leakage power are emerged as promising candidates for future trend. Table 1 compares traditional DRAM and SRAM with a few popular eNVM technologies, including spin-transfer torque RAM (STT-RAM), phase-change memory (PCM) and resistive RAM (ReRAM). Among these eNVM technologies, STT-RAM shows the fastest access speed and the lowest energy consumption while the cell area is relatively larger [41, 43] . Both PCMs [44, 45] and ReRAMs [46] [47] [48] can store multiple logic bits in a single memory cell, demonstrating superior density with technology scaling. In addition, they inherently support parallel data processing, which is uniquely beneficial to aforementioned data-intensive applications like DNNs. For the reason, extensive research efforts have been devoted to building in-memory processing using eNVM.
In this paper, we survey the recent progress in developing inmemory processing by leveraging the three mainstream eNVM technologies (STT-RAM, PCM and ReRAM). We present and discuss the difference and similarity of these works in terms of the supported functions, the location in architecture, the targeted applications, etc. Table 2 presents a summary of the latest eNVM-based in-memory processing designs reviewed in this paper. Each of them is classified according to the following five main categories.
DESIGN OVERVIEW
• Type -The types of memory technologies adopted in these works: STT-RAM, PCM or ReRAM. Since the features of memories are different with each other, the selection of memory type determines the types of computation to some extent.
• Location -Where is the memory located in the computing architecture: cache, main memory or scratchpad? eNVMs can be used as storage and/or computing unit. Some works treat eNVM only as co-processor while some designate its location in memory hierarchy too.
• Design Level -The techniques in these works are carried out at different levels, such as device, circuit or system. Some works proposed the novel writing method to perform calculations in memory cells [13] [14] [15] [16] . Some techniques are achieved through the modifications of the readout or write circuits associated with memory arrays. The system-level techniques would provide the interface and connection between the memory array and operation system so that the processing can be manipulated by applications.
• Functions -The function types in these works can be divided into the following groups: logic, arithmetic, associative, vector and matrix-vector multiplication (MVM). A type of operation can be realized by different eNVMs, but the implementation details could vary significantly. into generic. Some works carry out only the core operation in a number of applications such as encryption, database, and CNN. A few works complete simple and basic operations that are not designated to any targeted applications. Figure 1 depicts a high-level view of the classifications based on the location of the in-memory processing in system (eNVM core, cache, scratchpad memory, or main memory) and the type of supported functions (bitwise logic, arithmetic, or associative).
ENVM-BASED IN-MEMORY PROCESSING 3.1 Spin-Torque Transfer RAM (STT-RAM)
STT-RAM consists of a magnetic tunnel junction (MTJ) device, which presents two resistance states depending on the relative magnetization orientation of the fixed and free ferromagnetic layers. Compared to other resistive memory devices, STT-RAM has faster write speed, lower write energy, and higher write endurance (refer Table 1 ). Due to the limited resistance difference between these distinct resistance states of MTJ, it is hard to implement multi-bit storage in STT-RAM cells. So most of STT-RAM-based in-memory processing designs focus on the bit-wise operations.
3.1.1 Associative and combinational logic. Early works exploit the high density of STT-RAM to complete the associative computing and combinational logic [7, 8] . These works achieve reduced cost relative to traditional memory technologies. For instance, Guo et al. [7] employ STT-RAM to construct look-up table (LUT) and further realize the computing by cascading multiple LUTs. As such, the floating-point units are replaced by STT-RAM-based LUTs. The work successfully demonstrates the improved power and performance brought by STT-RAM technology, compared to multi-core CPU platform.
Bitwise logic operations.
Recent works [9] [10] [11] 49] explore the use of STT-RAM in accomplishing bitwise logic operations. Based on the basic logic function realization by STT-RAM, advanced operations are implemented. Kang et al. [9] , STT-CiM [10] and HieIM [11] are taken as examples and introduced here.
Kang et al. [9] propose a STT-RAM chip which can process bitwise logic and store information. The operands reside in different rows of the same array. By simultaneously activating multiple rows, the bitwise operations are enabled and results are obtained through the modified readout periphery circuits, i.e. sense amplifier (SA). The functionality of one logic operation can be controlled by modifying the bit in one control row. The chip can benefit some particular applications which involve intensive bitwise logic operations such as bitmap. This work focuses on circuit design and functional evaluation.
STT-CiM [10] extends the supported functions from bitwise logic to basic arithmetic and vector operations. At circuit level, the row decoders and SAs are enhanced to enable logic functions. Additional logic gates are integrated into the sense circuits to realize arithmetic operations. Two row decoders are used to active multiple rows where operands are located. The connected multiplex circuits are controlled by select signals to determine the desired operation type. When processing vector function, the vector outputs from STT-RAM arrays will be fed into reduction units and switched to the scalar value. At array level, authors analyze the impact of process variation on the computing results and deploy error correction scheme to enhance the reliability. At architectural and system levels, this work extends the instruction set to convey the operation command from applications to memory array. Through the extensions across multiple levels together, STT-CiM can be placed next to processor as on-chip scratchpad memory and applied for various data-intensive applications such as text processing, data compression, and digits recognition.
HieIM [11] implements bulk bitwise operations in STT-RAM array too. Different from the above two designs, HieIM is more flexible and allows the computing to operate between any cells within the same array. Moreover, a data encryption engine based on HieIM is demonstrated, which consumes 51.5% lower energy than the CMOS-based ASIC counterpart.
Neural networks.
Thanks to the evolution of DNN models, convolution in binary convolutional neural network (BCNN) can be replaced with bitwise operations such as XNOR and bit-count [50] . Pan et al. [12] build an accelerator based on multilevel STT-RAM (i.e. two-bit cell) for BCNN. STT-RAM arrays are programmable and can be switched between memory mode and bitwise operation mode. Thereby, multi-functional STT-RAM arrays are exploited to process convolutional layers. In this design, the two bits of one cell associates with inputs and weights, respectively so that the logic and add operations are carried out within one cell. This integrates the compuational STT-RAM array with an auxiliary processing unit (APU) which processes other computational layers in CNNs and implements the BCNN accelerator. Figure 2 illustrates the computing array and execution flows. Compared to other eNVM-based counterparts, the STT-RAM based accelerator achieves significant performance and energy improvement.
Phase Change Memory (PCM)
PCM can store more than one bit of data per cell, by diving the overall resistance range into a few levels. What's more, the cell conductance exhibits a linear increase along with the number of programming (more exactly, SET) pulses [51] . These key attributes of PCM are exploited to implement more complicate computations, such as the training of neural networks.
Logic and basic arithmetic operations.
Phase change material manifests many physical attributes under various pulse amplitudes or duration, which have been exploited to realize computation. For example, Cassinerio et al. [13] leverage the resistance transition of phase change material and propose an initialize-compute-confirm scheme to implement Boolean logic operations within a single PCM cell. Wright et al. [14, 15] and Hosseini et al. [16] exploit the accumulative behavior of PCM material during programming and build an accumulator for arithmetic computations such as addition, subtraction, and parallel factoring. In these works, the partial and final results can be stored where computations are carried out. One single operation takes multiple cycles to complete as input operands sequentially enter. PCM cells are used to substitue for logic gates without revealing specific applications of interest.
Sebastian et al. [20] exploit the physical dynamics of PCM material and propose computational PCM to perform the temporal correlation detection between stochastic binary processes. One process is encoded into a SET pulse whose amplitude or duration is proportional to the instantaneous sum of all processes and enters the assigned PCM device. By comparing the conductance of each device, the correlated processes can be identified.
Matrix-vector multiplications & machine learning.
Arranged in the crossbar structure, PCM devices can process analog matrixvector multiplications, which have been intensively investigated [4, 18, 19, 21, 22, 52] . An element in an matrix can be corresponded to the conductance of a PCM device. With an PCM crossbar representing a matrix, the vector is encoded into the amplitudes or duration of voltage pulses applied along rows. Then, the currents along columns will be proportional to the results. The positive and negative elements of the matrix could be stored in a pair of PCM devices. When applying input signals to columns, the currents along rows denote the results of the vector multiplying with the transposed matrix. A 3-layer perceptron using PCMs trained with backpropagation on the MNIST database of handwritten digits can achieve the comparable accuracy with the software model [18, 19] . Moreover, leveraging PCM-based in-memory processing for other complicated tasks are demonstrated, such as compressed sensing recovery [22] and transfer learning [52] .
3.2.3 System-level bitwise operations. Pinatubo [17] proposes a mechanism to perform bulk bitwise operations in PCM main memory. Read circuit and write driver is modified for Pinatubo processing logic functions. The operands are all stored in different rows in memory arrays. According to the locations that operands reside, Pinatubo has three computation modes: intra-subarray, intersubarray and inter-bank (Figure 3) . The rows associated with operands will be activated simultaneously when computing. Sense amplifiers are enhanced with more reference circuits to obtain the logic outputs which will be sent to I/O bus or another memory row. To bridge operating system and logic operations inside PCM, Pinatubo develops the programming model and run-time supports to ensure that operands are allocated to different memory rows. The design achieves 1.12× overall speedup, 1.11× overall energy saving over the conventional CPU.
Resistive RAM (ReRAM)
The attractive features of high resistances and multi-level cell storage make ReRAM stand out from other emerging memory technologies to construct dense and low-power computing systems [53] . A variety of ReRAM based computing systems have been proposed to demonstrate superior performances in different applications central to memory access reduction. [23] proposes mechanisms to perform bitwise operations with the aid of binary ReRAM. These schemes enable the integration of fundamental logic gates as well as complex arithmetic units, e.g. multi-bit full adders, within a ReRAM array. [25] is an neural network accelerator based on ReRAM dot-product engine. In ISAAC, ReRAM crossbar arrays both store weights in DNN and perform MVM with analog current and analog/digital converters (i.e. DAC & ADC). Figure 4 shows the top-down view of ISAAC chip. A group of tiles are connected through on-chip network. Every tile is composed of eDARM buffers , several in-situ multiply-accumulators (IMA), output registers and the shift-adders. Pooling and activation units in tiles dedicate to the pooling and activation operations in neural network. Each IMA consists of a number of ReRAM crossbars, ADCs, the input/output registers, and shift-adders. ISAAC exploits this integration of storage and computation for saving data (especially weights in filters) movement. The deeply pipelined flow of ISAAC focuses on the optimization in neural network inference.
Logic arithmetic operations. MAGIC

Matrix-vector multiplications. ISAAC
A number of ReRAM-based CNN accelerators are proposed for boosting the system performance in training [26, 27, 54] . For example, PipeLayer [26] balances the parallelism and throughput in training and inference based on both parallelism granularity and weight duplication. By eliminating the potential stalls as in Tile   Tile   IO Interface   Tile   Tile   Tile   Tile   Tile   Tile   Tile   Tile   Tile   Tile   Tile   Tile   Tile   Tile ISAAC, PipeLayer yields an averagely 42.45× speedup and saves computation energy by 7.17× on average, compared to the massively parallel computing GPU platform. AtomLayer [27] attempts to provide a universal solution to enhance the efficiency during both training and inference. In this scheme, one network layer is executed at a time, i.e. atomic layer, to solve the issues brought by the highly pipelined operations, such as pipeline bubbles, long single-layer latency, and high cost of data buffers. AtomLayer revises the mapping scheme of weights to ReRAM arrays and data reuse and further reduces the on-chip data buffer access aside from the reduction of memory accesses. AtomLayer achieves 1.1× higher power efficiency than ISAAC in inference and 1.6× higher than PipeLayer in training, and its footprint shrinks 15× averagely with the reduction of on-chip buffers.
ReRAM array's parallel computing nature is capable of building accelerators for special computing models other than neural networks. GraphR [28] is a ReRAM-based graph processing accelerator to solve the poor locality and high-bandwidth requirement in graph processing. The ReRAM crossbar based graph engines offers low-cost hardware implementation to realize power-efficient graph processing acceleration.
Beyond convolutional computing engine, a number of works utilize ReRAM crossbars to support different computations and applications [24, [55] [56] [57] . For instance, Bojnordi et al. [24] implement the restricted Boltzmann machine with ReRAM arrays. The model of Boltzmann machine has been used to train deep neural networks with vast training samples. With the help of current summation circuit and reduction unit, large networks are reshaped to fit into ReRAM arrays, where in-situ computing operations are executed. Compared with conventional multi-core systems, ReRAM-based Boltzmann machine achieves 57× higher performance and 25× lower energy consumption without degrading the quality of solutions to optimization problems.
CONCLUSION
In this work, we gave an overview of recent works on eNVM-based in-memory processing that minimizes the cost of memory access and is expected to be meet the requirements of data-intensive applications. Emerging non-volatile memories (eNVMs) have advantages of low-power, high-density, superior scaling and inherent computing capability. Hence, numerous research works have been carried out to develop eNVM-based in-memory processing architectures. We summarize and discuss the types of eNVMs that have been adopted in in-memory processing designs, as well as a variety of implemented functions and supported applications. Because each type of eNVMs has distinct strengths and weaknesses, the selection of eNVM technology shall consider the specific requirements of applications. Following the progress of material science and device processing techniques, we anticipate continuous improvement in reliability, read/write speed, and energy efficiency of eNVM technologies. We believe the collaborative researches across various levels including device, circuit, system and applications, are essential to move eNVM-based in-memory processing towards commercial production.
