An Overview of In-memory Processing with Emerging Non-volatile Memory
  for Data-intensive Applications by Li, Bing et al.
An Overview of In-memory Processing with Emerging
Non-volatile Memory for Data-intensive Applications
Bing Li∗
bing.li.ece@duke.edu
ECE Dept., Duke University, US
Army Research Office, Research
Triangle Park, US
Bonan Yan
ECE Dept., Duke University
Durham, North Carolina, US
bonan.yan@duke.edu
Hai “Helen” Li
ECE Dept., Duke University
Durham, North Carolina, US
hai.li@duke.edu
ABSTRACT
The conventional von Neumann architecture has been revealed as a
major performance and energy bottleneck for rising data-intensive
applications. The decade-old idea of leveraging in-memory pro-
cessing to eliminate substantial data movements has returned and
led extensive research activities. The effectiveness of in-memory
processing heavily relies on the memory scalability, which can-
not be satisfied by traditional memory technologies. Emerging
non-volatile memories (eNVMs) that pose appealing qualities such
as excellent scaling and low energy consumption, on the other
hand, have been heavily investigated and explored for realizing
in-memory processing architecture. In this paper, we summarize
the recent research progress in eNVM-based in-memory processing
from various aspects, including the adopted memory technologies,
locations of the in-memory processing in system, supported arith-
metics, as well as applied applications.
CCS CONCEPTS
• Hardware → Memory and dense storage; Spintronics and
magnetic technologies; •Computer systems organization→
Other architectures.
KEYWORDS
Data-intensive applications, emerging non-volatile memory, in-
memory processing
ACM Reference Format:
Bing Li, Bonan Yan, and Hai “Helen” Li. 2019. An Overview of In-memory
Processing with Emerging Non-volatile Memory for Data-intensive Ap-
plications. In Great Lakes Symposium on VLSI 2019 (GLSVLSI ’19), May
9–11, 2019, Tysons Corner, VA, USA. ACM, New York, NY, USA, 6 pages.
https://doi.org/10.1145/3299874.3319452
1 INTRODUCTION
In current era with data explosion, deep neural network (DNN)
models are used to process various applications that explore a large
∗This author is supported by NAS Associate Fellowship Award.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
GLSVLSI ’19, May 9–11, 2019, Tysons Corner, VA, USA
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6252-8/19/05. . . $15.00
https://doi.org/10.1145/3299874.3319452
amount of information in different data formats. Executing such
data-intensive applications on conventional von Neumann systems
causes massive data movements between processors and memory
elements and induces significant performance and energy over-
heads. After decades since it was proposed first time [1, 2], the
concept of in-memory processing returns and evokes many innova-
tive solutions. Different from the conventional computing paradigm
where data and computing are decoupled, in-memory processing
architecture pulls data close to processing elements to reduce the
amount of data movement and minimize the computation cost.
Benefiting from the recent advances in processing integration
and memory technologies, many in-memory processing architec-
tures have been developed. These attempts can be cataloged into
three groups: (1) processing close to memory, aka near-data process-
ing (NDP); (2) processing in traditional memory; and (3) processing
in emerging non-volatile memory (eNVM).
In enabling NDP, three-dimension (3D) integration is a key tech-
nology. For example, 3D DRAM is constituted by vertically stacking
a set of DRAM dies on top of a CMOS logic die over through-silicon
vias (TSVs). There are a number of 3D DRAM-based NDP platforms
that exploit the logic die to perform simple but common opera-
tions in data-intensive applications [16, 29–34]. TSVs substantially
shorten the distance between the logic and memory dies, increasing
the data bandwidth and improving overall performance.
Processing in memory directly performs computations in mem-
ory arrays so as to reduce data movement to a large extent. Prior
works exploit traditional memory technologies such as DRAM
and SRAM to complete the frequent operations appearing in data-
intensive applications [35–39]. However, as the scaling of tradi-
tional memory technologies is approaching the physical limit, it is
Table 1: Emerging Non-volatile Memory Comparison
SRAM DRAM STT-RAM PCM ReRAM
Cell Size >100 6-10 6-50 4-30 ≤2(F2)
Multibit 1 1 1 >2 2-7
Endurance >1016 >1016 >1015 108-1015 108-1012
Read Time ∼1 ∼10 <10 <10 <10(ns)
Write Time ∼1 ∼10 <10 50 <10(ns)
Write Energy ∼ 10−15 ∼ 10−14 ∼ 10−13 ∼ 10−11 ∼ 10−13(J/bit)
Source: [3–6]. Note: F represents the feature size.
ar
X
iv
:1
90
6.
06
60
3v
1 
 [c
s.A
R]
  1
5 J
un
 20
19
Table 2: An Overview of In-memory Processing Designs
Works Types Locations Design Levels Functions Applications
Guo et al. [7] 2010
STT-RAM
Cache Circuit; System Logic; Arithmetic Generic
AC-DIMM [8] 2013 Main Memory Circuit; System Associative Generic
Kang et al. [9] 2017 - Circuit Logic Bitmap
STT-CiM [10] 2018 Scratchpad Circuit; System Logic; Addition; Vector Generic
HielM [11] 2018 - Circuit; System Logic Encryption, Database
Pan et al. [12] 2018 Co-processor Circuit; System Logic Binary CNN
Cassinerio et al. [13] 2013
PCM
- Device Logic -
Wright et al. [14, 15] 2011, 2013 - Device Arithmetic -
Hosseini et al. [16] 2015 - Device Arithmetic -
Pinatubo [17] 2015 Main Memory Circuit; System Logic Generic
Burr et al. [18, 19] 2015 - Circuit MVM DNN
Sebastian et al. [20] 2017 - Circuit MVM Unsupervised Learning
Le et al. [21, 22] 2017, 2018 - Circuit MVM Transfer Learning
MAGIC [23] 2014
ReRAM
Co-processor Circuit Logic; Arithmetic Adder
Bojnordi et al.[24] 2016 Co-processor System MVM Boltzmann machine
ISAAC [25] 2016 Co-processor System MVM CNN
PipeLayer [26] 2017 Co-processor System MVM CNN
AtomLayer [27] 2018 Co-processor System MVM CNN
GraphR [28] 2018 Co-processor System MVM Graph
Note: MVM – Matrix-Vector Multiplication; DNN – Deep Neural Network.
difficult to provide sufficient computing and storage capacity for
data-intensive applications. Moreover, big cell size and high leakage
power of traditional memory lead to large design area and energy
consumption [40–42].
In recent years, eNVMs that demonstrate excellent scaling and
near-zero leakage power are emerged as promising candidates for
future trend. Table 1 compares traditional DRAM and SRAM with
a few popular eNVM technologies, including spin-transfer torque
RAM (STT-RAM), phase-change memory (PCM) and resistive RAM
(ReRAM). Among these eNVM technologies, STT-RAM shows the
fastest access speed and the lowest energy consumption while
the cell area is relatively larger [41, 43]. Both PCMs [44, 45] and
ReRAMs [46–48] can store multiple logic bits in a single memory
cell, demonstrating superior density with technology scaling. In
addition, they inherently support parallel data processing, which is
uniquely beneficial to aforementioned data-intensive applications
like DNNs. For the reason, extensive research efforts have been
devoted to building in-memory processing using eNVM.
In this paper, we survey the recent progress in developing in-
memory processing by leveraging the three mainstream eNVM
technologies (STT-RAM, PCM and ReRAM). We present and dis-
cuss the difference and similarity of these works in terms of the
supported functions, the location in architecture, the targeted ap-
plications, etc.
2 DESIGN OVERVIEW
Table 2 presents a summary of the latest eNVM-based in-memory
processing designs reviewed in this paper. Each of them is classified
according to the following five main categories.
• Type - The types of memory technologies adopted in these
works: STT-RAM, PCM or ReRAM. Since the features of
memories are different with each other, the selection of mem-
ory type determines the types of computation to some extent.
• Location -Where is the memory located in the computing ar-
chitecture: cache, main memory or scratchpad? eNVMs
can be used as storage and/or computing unit. Some works
treat eNVM only as co-processor while some designate its
location in memory hierarchy too.
• Design Level - The techniques in these works are carried
out at different levels, such as device, circuit or system.
Some works proposed the novel writing method to perform
calculations in memory cells [13–16]. Some techniques are
achieved through the modifications of the readout or write
circuits associated with memory arrays. The system-level
techniques would provide the interface and connection be-
tween the memory array and operation system so that the
processing can be manipulated by applications.
• Functions - The function types in these works can be divided
into the following groups: logic, arithmetic, associative,
vector andmatrix-vector multiplication (MVM). A type
of operation can be realized by different eNVMs, but the
implementation details could vary significantly.
• Applications - Most works provide advanced functions to
support most data-intensive applications, which are grouped
Bitwise Logic (e.g.)
A or B; A xnor B;
Arithmetic (e.g.)
A +/- B; A ⨀ B;
Associative
(key, value)
Bitwise Logic (e.g.)
A or B; A xnor B;
Bitwise Logic (e.g.)
A or B; A xnor B;
Arithmetic (e.g.)
A +/- B; A ⨀ B;
Main Memory
Cache Scratchpad 
eNVM
Core
Core Core•••
•••
Arithmetic (e.g.)
A ⨀ B;
Figure 1: Processing locations and function types.
into generic. Some works carry out only the core operation
in a number of applications such as encryption, database,
andCNN. A fewworks complete simple and basic operations
that are not designated to any targeted applications.
Figure 1 depicts a high-level view of the classifications based
on the location of the in-memory processing in system (eNVM
core, cache, scratchpad memory, or main memory) and the type of
supported functions (bitwise logic, arithmetic, or associative).
3 ENVM-BASED IN-MEMORY PROCESSING
3.1 Spin-Torque Transfer RAM (STT-RAM)
STT-RAM consists of a magnetic tunnel junction (MTJ) device,
which presents two resistance states depending on the relative
magnetization orientation of the fixed and free ferromagnetic layers.
Compared to other resistive memory devices, STT-RAM has faster
write speed, lower write energy, and higher write endurance (refer
Table 1). Due to the limited resistance difference between these
distinct resistance states of MTJ, it is hard to implement multi-bit
storage in STT-RAM cells. So most of STT-RAM-based in-memory
processing designs focus on the bit-wise operations.
3.1.1 Associative and combinational logic. Early works exploit the
high density of STT-RAM to complete the associative computing
and combinational logic [7, 8]. These works achieve reduced cost
relative to traditional memory technologies. For instance, Guo et
al. [7] employ STT-RAM to construct look-up table (LUT) and fur-
ther realize the computing by cascading multiple LUTs. As such,
the floating-point units are replaced by STT-RAM-based LUTs. The
work successfully demonstrates the improved power and perfor-
mance brought by STT-RAM technology, compared to multi-core
CPU platform.
3.1.2 Bitwise logic operations. Recent works [9–11, 49] explore
the use of STT-RAM in accomplishing bitwise logic operations.
Based on the basic logic function realization by STT-RAM, advanced
operations are implemented. Kang et al. [9], STT-CiM [10] and
HieIM [11] are taken as examples and introduced here.
Kang et al. [9] propose a STT-RAM chip which can process
bitwise logic and store information. The operands reside in differ-
ent rows of the same array. By simultaneously activating multiple
rows, the bitwise operations are enabled and results are obtained
through the modified readout periphery circuits, i.e. sense amplifier
(SA). The functionality of one logic operation can be controlled by
modifying the bit in one control row. The chip can benefit some
particular applications which involve intensive bitwise logic op-
erations such as bitmap. This work focuses on circuit design and
functional evaluation.
STT-CiM [10] extends the supported functions from bitwise
logic to basic arithmetic and vector operations. At circuit level, the
row decoders and SAs are enhanced to enable logic functions. Ad-
ditional logic gates are integrated into the sense circuits to realize
arithmetic operations. Two row decoders are used to active multiple
rows where operands are located. The connected multiplex circuits
are controlled by select signals to determine the desired operation
type. When processing vector function, the vector outputs from
STT-RAM arrays will be fed into reduction units and switched
to the scalar value. At array level, authors analyze the impact of
process variation on the computing results and deploy error correc-
tion scheme to enhance the reliability. At architectural and system
levels, this work extends the instruction set to convey the opera-
tion command from applications to memory array. Through the
extensions across multiple levels together, STT-CiM can be placed
next to processor as on-chip scratchpad memory and applied for
various data-intensive applications such as text processing, data
compression, and digits recognition.
HieIM [11] implements bulk bitwise operations in STT-RAM
array too. Different from the above two designs, HieIM is more
flexible and allows the computing to operate between any cells
within the same array. Moreover, a data encryption engine based
on HieIM is demonstrated, which consumes 51.5% lower energy
than the CMOS-based ASIC counterpart.
3.1.3 Neural networks. Thanks to the evolution of DNN models,
convolution in binary convolutional neural network (BCNN) can be
replaced with bitwise operations such as XNOR and bit-count [50].
Pan et al. [12] build an accelerator based on multilevel STT-RAM
(i.e. two-bit cell) for BCNN. STT-RAM arrays are programmable
and can be switched between memory mode and bitwise operation
mode. Thereby, multi-functional STT-RAM arrays are exploited to
process convolutional layers. In this design, the two bits of one cell
associates with inputs and weights, respectively so that the logic
and add operations are carried out within one cell. This integrates
the compuational STT-RAM array with an auxiliary processing
unit (APU) which processes other computational layers in CNNs
and implements the BCNN accelerator. Figure 2 illustrates the com-
puting array and execution flows. Compared to other eNVM-based
counterparts, the STT-RAM based accelerator achieves significant
performance and energy improvement.
3.2 Phase Change Memory (PCM)
PCM can store more than one bit of data per cell, by diving the
overall resistance range into a few levels. What’s more, the cell
conductance exhibits a linear increase along with the number of
programming (more exactly, SET) pulses [51]. These key attributes
of PCM are exploited to implement more complicate computations,
such as the training of neural networks.
3.2.1 Logic and basic arithmetic operations. Phase change material
manifests many physical attributes under various pulse amplitudes
or duration, which have been exploited to realize computation. For
example,Cassinerio et al. [13] leverage the resistance transition of
phase change material and propose an initialize-compute-confirm
scheme to implement Boolean logic operations within a single PCM
cell. Wright et al. [14, 15] and Hosseini et al. [16] exploit the
accumulative behavior of PCM material during programming and
build an accumulator for arithmetic computations such as addition,
subtraction, and parallel factoring. In these works, the partial and
final results can be stored where computations are carried out. One
single operation takes multiple cycles to complete as input operands
sequentially enter. PCM cells are used to substitue for logic gates
without revealing specific applications of interest.
Sebastian et al. [20] exploit the physical dynamics of PCM
material and propose computational PCM to perform the temporal
correlation detection between stochastic binary processes. One
Batch 
Normalization
Binary Operation
Scaling Factor
Multiplier & 
Pooling
W1 I1 W2 I2 W3 I3
ANDANDANDANDANDAND
•••
•••
Bit1
Input Activations
Weight Data
Bit2
Modified Sensing 
Circuit
Mode Controller 
(Logic & Add)
•••
•••
APU MLC Array
Input Weight
AND Bit-count
R
o
w
 
D
e
c
o
d
e
r
Controller
Binary 
AND
Bit-count
Controller
Bitline
Wordline
Soureline
Small 
MTJ
Large 
MTJ
Bit1
Bit2
Multi-level STT-RAM cell
Figure 2: The execution flow, computing array, and multilevel STT-RAM cell for convolutional layers of Binary CNN in [12].
Controller
Output 
Buffer
Inter-bank 
operations
Bank
Bank
•••
MatMat
MatMat
Row Buffer
G
lo
b
a
l 
w
o
rd
lin
e
d
e
c
o
d
e
r
Mat•••
•••
Ctrl.
Inter-sub operations
Subarrays
L
o
c
a
l 
w
o
rd
lin
e
d
e
c
o
d
e
r
M
u
lt
i-
ro
w
 
A
c
ti
v
a
ti
o
n
Ctrl.
Local 
Wordline
SelectLine BitLine
Column 
SelectLine
Row
Sense Amplifier
Write Driver
Intra-sub operations
In-place update
MUX MUX
•••
••• •••
••••••
Bottom
Electronode
Amorphous
Crystalline
Top 
Electronode
PCM cellMatBankChip
Figure 3: Pinatubo [17] and PCM cell architecture.
process is encoded into a SET pulse whose amplitude or duration is
proportional to the instantaneous sum of all processes and enters
the assigned PCM device. By comparing the conductance of each
device, the correlated processes can be identified.
3.2.2 Matrix-vector multiplications & machine learning. Arranged
in the crossbar structure, PCM devices can process analog matrix-
vector multiplications, which have been intensively investigated [4,
18, 19, 21, 22, 52]. An element in an matrix can be corresponded
to the conductance of a PCM device. With an PCM crossbar rep-
resenting a matrix, the vector is encoded into the amplitudes or
duration of voltage pulses applied along rows. Then, the currents
along columns will be proportional to the results. The positive
and negative elements of the matrix could be stored in a pair of
PCM devices. When applying input signals to columns, the currents
along rows denote the results of the vector multiplying with the
transposed matrix. A 3-layer perceptron using PCMs trained with
backpropagation on the MNIST database of handwritten digits can
achieve the comparable accuracy with the software model [18, 19].
Moreover, leveraging PCM-based in-memory processing for other
complicated tasks are demonstrated, such as compressed sensing
recovery [22] and transfer learning [52].
3.2.3 System-level bitwise operations. Pinatubo [17] proposes a
mechanism to perform bulk bitwise operations in PCM main mem-
ory. Read circuit and write driver is modified for Pinatubo pro-
cessing logic functions. The operands are all stored in different
rows in memory arrays. According to the locations that operands
reside, Pinatubo has three computationmodes: intra-subarray, inter-
subarray and inter-bank (Figure 3). The rows associatedwith operands
will be activated simultaneously when computing. Sense amplifiers
are enhanced with more reference circuits to obtain the logic out-
puts which will be sent to I/O bus or another memory row. To
bridge operating system and logic operations inside PCM, Pinatubo
develops the programming model and run-time supports to ensure
that operands are allocated to different memory rows. The design
achieves 1.12× overall speedup, 1.11× overall energy saving over
the conventional CPU.
3.3 Resistive RAM (ReRAM)
The attractive features of high resistances and multi-level cell stor-
age make ReRAM stand out from other emerging memory technolo-
gies to construct dense and low-power computing systems [53]. A
variety of ReRAM based computing systems have been proposed to
demonstrate superior performances in different applications central
to memory access reduction.
3.3.1 Logic arithmetic operations. MAGIC [23] proposes mecha-
nisms to perform bitwise operations with the aid of binary ReRAM.
These schemes enable the integration of fundamental logic gates as
well as complex arithmetic units, e.g. multi-bit full adders, within a
ReRAM array.
3.3.2 Matrix-vector multiplications. ISAAC [25] is an neural net-
work accelerator based on ReRAM dot-product engine. In ISAAC,
ReRAM crossbar arrays both store weights in DNN and perform
MVM with analog current and analog/digital converters (i.e. DAC
& ADC). Figure 4 shows the top-down view of ISAAC chip. A group
of tiles are connected through on-chip network. Every tile is com-
posed of eDARM buffers , several in-situ multiply-accumulators
(IMA), output registers and the shift-adders. Pooling and activation
units in tiles dedicate to the pooling and activation operations in
neural network. Each IMA consists of a number of ReRAM cross-
bars, ADCs, the input/output registers, and shift-adders. ISAAC
exploits this integration of storage and computation for saving data
(especially weights in filters) movement. The deeply pipelined flow
of ISAAC focuses on the optimization in neural network inference.
A number of ReRAM-based CNN accelerators are proposed for
boosting the system performance in training [26, 27, 54]. For ex-
ample, PipeLayer [26] balances the parallelism and throughput
in training and inference based on both parallelism granularity
and weight duplication. By eliminating the potential stalls as in
IMA IMA
Shift & Adder
Pooling/Activation Unit
Output 
Register
eDRAM
Buffer
IMA IMA
IMA IMAIMA IMA
Tile
Tile
IO Interface
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Crossbar Crossbar
Crossbar Crossbar
Shift & Adder
Input 
Register
Output 
Register
ADC ADC ADC ADC
…
…
…
… …
TaOx ReRAMReRAM crossbar
DAC
ADC
Shift & Hold
DAC
DAC
DAC
Bitline
Wordline
IMAChip
Figure 4: ISAAC [25], ReRAM crossbar, and ReRAM cell architecture.
ISAAC, PipeLayer yields an averagely 42.45× speedup and saves
computation energy by 7.17× on average, compared to the mas-
sively parallel computing GPU platform.AtomLayer [27] attempts
to provide a universal solution to enhance the efficiency during
both training and inference. In this scheme, one network layer is
executed at a time, i.e. atomic layer, to solve the issues brought
by the highly pipelined operations, such as pipeline bubbles, long
single-layer latency, and high cost of data buffers. AtomLayer re-
vises the mapping scheme of weights to ReRAM arrays and data
reuse and further reduces the on-chip data buffer access aside from
the reduction of memory accesses. AtomLayer achieves 1.1× higher
power efficiency than ISAAC in inference and 1.6× higher than
PipeLayer in training, and its footprint shrinks 15× averagely with
the reduction of on-chip buffers.
ReRAM array’s parallel computing nature is capable of build-
ing accelerators for special computing models other than neural
networks. GraphR [28] is a ReRAM-based graph processing accel-
erator to solve the poor locality and high-bandwidth requirement
in graph processing. The ReRAM crossbar based graph engines
offers low-cost hardware implementation to realize power-efficient
graph processing acceleration.
Beyond convolutional computing engine, a number of works
utilize ReRAM crossbars to support different computations and ap-
plications [24, 55–57]. For instance, Bojnordi et al. [24] implement
the restricted Boltzmann machine with ReRAM arrays. The model
of Boltzmann machine has been used to train deep neural networks
with vast training samples. With the help of current summation
circuit and reduction unit, large networks are reshaped to fit into
ReRAM arrays, where in-situ computing operations are executed.
Compared with conventional multi-core systems, ReRAM-based
Boltzmann machine achieves 57× higher performance and 25×
lower energy consumption without degrading the quality of solu-
tions to optimization problems.
4 CONCLUSION
In this work, we gave an overview of recent works on eNVM-based
in-memory processing that minimizes the cost of memory access
and is expected to be meet the requirements of data-intensive appli-
cations. Emerging non-volatile memories (eNVMs) have advantages
of low-power, high-density, superior scaling and inherent comput-
ing capability. Hence, numerous research works have been carried
out to develop eNVM-based in-memory processing architectures.
We summarize and discuss the types of eNVMs that have been
adopted in in-memory processing designs, as well as a variety of
implemented functions and supported applications. Because each
type of eNVMs has distinct strengths and weaknesses, the selection
of eNVM technology shall consider the specific requirements of
applications. Following the progress of material science and device
processing techniques, we anticipate continuous improvement in
reliability, read/write speed, and energy efficiency of eNVM tech-
nologies. We believe the collaborative researches across various lev-
els including device, circuit, system and applications, are essential
to move eNVM-based in-memory processing towards commercial
production.
ACKNOWLEDGMENTS
Bing Li acknowledges the National Academy of Sciences (NAS),
USA for awarding the NRC research fellowship. Any opinions,
findings and conclusions or recommendations expressed in this
material are those of the authors and do not necessarily reflect the
views of NAS or their contractors.
REFERENCES
[1] David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly
Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. A case for
intelligent RAM. IEEE Micro, 17(2):34–44, 1997.
[2] Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff La-
Coss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, et al. The
architecture of the DIVAprocessing-in-memory chip. In Proceedings of the 16th
international conference on Supercomputing (ICS), pages 14–25, 2002.
[3] Micron Technology Inc. Ddr4 sdram system-power calculator, June 2016.
[4] Geoffrey W Burr, Matthew J Brightsky, Abu Sebastian, Huai-Yu Cheng, Jau-Yi
Wu, Sangbum Kim, Norma E Sosa, Nikolaos Papandreou, Hsiang-Lan Lung,
Haralampos Pozidis, et al. Recent progress in phase-change memory technology.
IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 6(2):146–162,
2016.
[5] Sparsh Mittal, Jeffrey S Vetter, and Dong Li. A survey of architectural approaches
for managing embedded dram and non-volatile on-chip caches. IEEE Transactions
on Parallel and Distributed Systems, page 14, 2015.
[6] Shimeng Yu and Pai-Yu Chen. Emerging memory technologies: recent trends
and prospects. IEEE Solid-State Circuits Magazine, 8(2):43–56, 2016.
[7] Xiaochen Guo, Engin Ipek, and Tolga Soyata. Resistive computation: avoiding
the power wall with low-leakage, stt-mram based computing. ACM SIGARCH
Computer Architecture News, 38(3):371–382, 2010.
[8] Qing Guo, Xiaochen Guo, Ravi Patel, Engin Ipek, and Eby G Friedman. Ac-dimm:
associative computing with stt-mram. ACM SIGARCH Computer Architecture
News, 41(3):189–200, 2013.
[9] Wang Kang, Haotian Wang, Zhaohao Wang, Youguang Zhang, and Weisheng
Zhao. In-memory processing paradigm for bitwise logic operations in stt–mram.
IEEE Transactions on Magnetics, 53(11):1–4, 2017.
[10] Shubham Jain, Ashish Ranjan, Kaushik Roy, and Anand Raghunathan. Computing
in memory with spin-transfer torque magnetic ram. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, 26(3):470–483, 2018.
[11] Farhana Parveen, Zhezhi He, Shaahin Angizi, and Deliang Fan. Hielm: Highly
flexible in-memory computing using stt mram. In Design Automation Conference
(ASP-DAC), 2018 23rd Asia and South Pacific, pages 361–366. IEEE, 2018.
[12] Yu Pan, Peng Ouyang, Yinglin Zhao, Wang Kang, Shouyi Yin, Youguang Zhang,
Weisheng Zhao, and ShaojunWei. A multilevel cell stt-mram-based computing in-
memory accelerator for binary convolutional neural network. IEEE Transactions
on Magnetics, 54(99):1–5, 2018.
[13] M Cassinerio, N Ciocchini, and D Ielmini. Logic computation in phase change
materials by threshold and memory switching. Advanced Materials, 25(41):5975–
5980, 2013.
[14] C David Wright, Peiman Hosseini, and Jorge A Vazquez Diosdado. Beyond von-
neumann computing with nanoscale phase-change memory devices. Advanced
Functional Materials, 23(18):2248–2254, 2013.
[15] C David Wright, Yanwei Liu, Krisztian I Kohary, Mustafa M Aziz, and Robert J
Hicken. Arithmetic and biologically-inspired computing using phase-change
materials. Advanced Materials, 23(30):3408–3413, 2011.
[16] Peiman Hosseini, Abu Sebastian, Nikolaos Papandreou, C David Wright, and Har-
ish Bhaskaran. Accumulation-based computing using phase-change memories
with fet access devices. IEEE Electron Device Letters, 36(9):975–977, 2015.
[17] Shuangchen Li, Cong Xu, Qiaosha Zou, Jishen Zhao, Yu Lu, and Yuan Xie.
Pinatubo: A processing-in-memory architecture for bulk bitwise operations
in emerging non-volatile memories. In Proceedings of the 53rd Annual Design
Automation Conference, page 173. ACM, 2016.
[18] GeoffreyWBurr, Robert M Shelby, Severin Sidler, Carmelo Di Nolfo, Junwoo Jang,
Irem Boybat, Rohit S Shenoy, Pritish Narayanan, Kumar Virwani, Emanuele U
Giacometti, et al. Experimental demonstration and tolerancing of a large-scale
neural network (165 000 synapses) using phase-change memory as the synaptic
weight element. IEEE Transactions on Electron Devices, 62(11):3498–3507, 2015.
[19] GW Burr, P Narayanan, RM Shelby, Severin Sidler, Irem Boybat, Carmelo di Nolfo,
and Yusuf Leblebici. Large-scale neural networks implemented with non-volatile
memory as the synaptic weight element: Comparative performance analysis
(accuracy, speed, and power). In Electron Devices Meeting (IEDM), 2015 IEEE
International, pages 4–4. IEEE, 2015.
[20] Abu Sebastian, Tomas Tuma, Nikolaos Papandreou, Manuel Le Gallo, Lukas Kull,
Thomas Parnell, and Evangelos Eleftheriou. Temporal correlation detection using
computational phase-change memory. Nature Communications, 8(1):1115, 2017.
[21] M Le Gallo, A Sebastian, G Cherubini, H Giefers, and E Eleftheriou. Compressed
sensing recovery using computational memory. In Electron Devices Meeting
(IEDM), 2017 IEEE International, pages 28–3. IEEE, 2017.
[22] Manuel Le Gallo, Abu Sebastian, Roland Mathis, Matteo Manica, Heiner Giefers,
Tomas Tuma, Costas Bekas, Alessandro Curioni, and Evangelos Eleftheriou.
Mixed-precision in-memory computing. Nature Electronics, 1(4):246, 2018.
[23] Shahar Kvatinsky, Dmitry Belousov, Slavik Liman, Guy Satat, Nimrod Wald,
Eby G Friedman, Avinoam Kolodny, and Uri C Weiser. Magic-memristor-aided
logic. IEEE Transactions on Circuits and Systems II: Express Briefs, 61(11):895–899,
2014.
[24] Mahdi Nazm Bojnordi and Engin Ipek. Memristive boltzmann machine: A hard-
ware accelerator for combinatorial optimization and deep learning. In High
Performance Computer Architecture (HPCA), 2016 IEEE International Symposium
on, pages 1–13. IEEE, 2016.
[25] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian,
John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. Isaac:
A convolutional neural network accelerator with in-situ analog arithmetic in
crossbars. ACM SIGARCH Computer Architecture News, 44(3):14–26, 2016.
[26] Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. Pipelayer: A pipelined reram-
based accelerator for deep learning. In High Performance Computer Architecture
(HPCA), 2017 IEEE International Symposium on, pages 541–552. IEEE, 2017.
[27] Ximing Qiao, Xiong Cao, Huanrui Yang, Linghao Song, and Hai Li. Atomlayer:
a universal reram-based cnn accelerator with atomic layer computation. In
Proceedings of the 55th Annual Design Automation Conference, page 103. ACM,
2018.
[28] Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. Graphr:
Accelerating graph processing using reram. In High Performance Computer
Architecture (HPCA), 2018 IEEE International Symposium on, pages 531–543. IEEE,
2018.
[29] Seth H Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalak-
shmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. Ndc: Analyzing
the impact of 3d-stackedmemory+ logic devices onmapreduce workloads. In 2014
IEEE International Symposium on Performance Analysis of Systems and Software
(ISPASS), pages 190–200. IEEE, 2014.
[30] Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim.
Nda: Near-dram acceleration architecture leveraging commodity dram devices
and standard memory modules. In High Performance Computer Architecture
(HPCA), 2015 IEEE 21st International Symposium on, pages 283–295. IEEE, 2015.
[31] Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L Greathouse,
Lifan Xu, and Michael Ignatowski. Top-pim: throughput-oriented programmable
processing in memory. In Proceedings of the 23rd international symposium on
High-performance parallel and distributed computing, pages 85–98. ACM, 2014.
[32] Mingyu Gao and Christos Kozyrakis. Hrl: Efficient and flexible reconfigurable
logic for near-data processing. InHigh Performance Computer Architecture (HPCA),
2016 IEEE International Symposium on, pages 126–137. Ieee, 2016.
[33] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris:
Scalable and efficient neural network acceleration with 3d memory. ACM SIGOPS
Operating Systems Review, 51(2):751–764, 2017.
[34] Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal
Mukhopadhyay. Neurocube: A programmable digital neuromorphic architecture
with high-density 3d memory. In Computer Architecture (ISCA), 2016 ACM/IEEE
43rd Annual International Symposium on, pages 380–392. IEEE, 2016.
[35] Shuangchen Li, Dimin Niu, Krishna T Malladi, Hongzhong Zheng, Bob Brennan,
and Yuan Xie. Drisa: A dram-based reconfigurable in-situ accelerator. In Proceed-
ings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture,
pages 288–301. ACM, 2017.
[36] Quan Deng, Lei Jiang, Youtao Zhang, Minxuan Zhang, and Jun Yang. Dracc: a
dram based accelerator for accurate cnn inference. In Proceedings of the 55th
Annual Design Automation Conference, page 168. ACM, 2018.
[37] Jintao Zhang, Zhuo Wang, and Naveen Verma. In-memory computation of a
machine-learning classifier in a standard 6t sram array. J. Solid-State Circuits,
52(4):915–924, 2017.
[38] Amogh Agrawal, Akhilesh Jaiswal, Chankyu Lee, and Kaushik Roy. X-sram: En-
abling in-memory boolean computations in cmos static random access memories.
IEEE Transactions on Circuits and Systems I: Regular Papers, 65(99):1–14, 2018.
[39] Mingu Kang, Sujan K Gonugondla, Ameya Patil, and Naresh R Shanbhag. A
multi-functional in-memory inference processor using a standard 6t sram array.
IEEE Journal of Solid-State Circuits, 53(2):642–655, 2018.
[40] Minesh Patel, Jeremie S Kim, and Onur Mutlu. The reach profiler (reaper):
Enabling the mitigation of dram retention failures via profiling at aggressive
conditions. ACM SIGARCH Computer Architecture News, 45(2):255–268, 2017.
[41] Sang Phill Park, Sumeet Gupta, Niladri Mojumder, Anand Raghunathan, and
Kaushik Roy. Future cache design using stt mrams for improved energy efficiency:
devices, circuits and architecture. In Proceedings of the 49th Annual Design
Automation Conference, pages 492–497. ACM, 2012.
[42] Sparsh Mittal. A survey of architectural techniques for improving cache power
efficiency. Sustainable Computing: Informatics and Systems, 4(1):33–43, 2014.
[43] Nour Sayed, Rajendra Bishnoi, Fabian Oboril, and Mehdi B Tahoori. A cross-
layer adaptive approach for performance and power optimization in stt-mram. In
2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages
791–796. IEEE, 2018.
[44] T Nirschl, JB Philipp, TDHapp, GeoffreyWBurr, B Rajendran, M-H Lee, A Schrott,
M Yang, M Breitwisch, C-F Chen, et al. Write strategies for 2 and 4-bit multi-
level phase-change memory. In Electron Devices Meeting, 2007. IEDM 2007. IEEE
International, pages 461–464. IEEE, 2007.
[45] Bing Li, Yu Hu, Ying Wang, Jing Ye, and Xiaowei Li. Power-utility-driven write
management for mlc pcm. ACM Journal on Emerging Technologies in Computing
Systems (JETC), 13(3):50, 2017.
[46] H-S PhilipWong, Heng-Yuan Lee, Shimeng Yu, Yu-Sheng Chen, YiWu, Pang-Shiu
Chen, Byoungil Lee, Frederick T Chen, and Ming-Jinn Tsai. Metal–oxide rram.
Proceedings of the IEEE, 100(6):1951–1970, 2012.
[47] Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubramonian, Tao
Zhang, Shimeng Yu, and Yuan Xie. Overcoming the challenges of crossbar
resistive memory architectures. In 2015 IEEE 21st International Symposium on
High Performance Computer Architecture (HPCA), pages 476–488. IEEE, 2015.
[48] Bing Li, Bonan Yan, Chenchen Liu, and Hai Helen Li. Build reliable and efficient
neuromorphic design with memristor technology. In Proceedings of the 24th Asia
and South Pacific Design Automation Conference, pages 224–229. ACM, 2019.
[49] Akhilesh Jaiswal, Amogh Agrawal, and Kaushik Roy. In-situ, in-memory stateful
vector logic operations based on voltage controlled magnetic anisotropy. Scientific
reports, 8(1):5738, 2018.
[50] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-
net: Imagenet classification using binary convolutional neural networks. In
European Conference on Computer Vision, pages 525–542. Springer, 2016.
[51] Abu Sebastian, Manuel Le Gallo, Geoffrey W. Burr, Sangbum Kim, Matthew
BrightSky, and Evangelos Eleftheriou. Tutorial: Brain-inspired computing using
phase-change memory devices. Journal of Applied Physics, 124(11):111101, 2018.
[52] Stefano Ambrogio, Pritish Narayanan, Hsinyu Tsai, Robert M Shelby, Irem Boybat,
Carmelo Nolfo, Severin Sidler, Massimo Giordano, Martina Bodini, Nathan CP
Farinha, et al. Equivalent-accuracy accelerated neural-network training using
analogue memory. Nature, 558(7708):60, 2018.
[53] Bonan Yan, Chenchen Liu, Xiaoxiao Liu, Yiran Chen, and Hai Li. Understanding
the trade-offs of device, circuit and application in reram-based neuromorphic
computing systems. In Electron Devices Meeting (IEDM), 2017 IEEE International,
pages 11–4. IEEE, 2017.
[54] Ming Cheng, Lixue Xia, Zhenhua Zhu, Yi Cai, Yuan Xie, Yu Wang, and Huazhong
Yang. Time: A training-in-memory architecture for memristor-based deep neural
networks. In Proceedings of the 54th Annual Design Automation Conference 2017,
page 26. ACM, 2017.
[55] Bing Li, Linghao Song, Fan Chen, Xuehai Qian, Yiran Chen, and Hai Helen Li.
Reram-based accelerator for deep learning. In 2018 Design, Automation & Test in
Europe Conference & Exhibition (DATE), pages 815–820. IEEE, 2018.
[56] Shihui Yin, Xiaoyu Sun, Shimeng Yu, Jae-sun Seo, and Chaitali Chakrabarti. A
parallel rram synaptic array architecture for energy-efficient recurrent neural
networks. In 2018 IEEE International Workshop on Signal Processing Systems (SiPS),
pages 13–18. IEEE, 2018.
[57] Zichen Fan, Ziru Li, Bing Li, Yiran Chen, and Hai Helen Li. Red: A reram-based
deconvolution accelerator. In 2019 Design, Automation & Test in Europe Conference
& Exhibition (DATE), 2019.
