Crossbar-Constrained Technology Mapping for ReRAM based In-Memory
  Computing by Bhattacharjee, Debjyoti et al.
1Crossbar-Constrained Technology Mapping for
ReRAM based In-Memory Computing
Debjyoti Bhattacharjee, Yaswanth Tavva, Arvind Easwaran, Anupam Chattopadhyay
School of Computer Science and Engineering,
Nanyang Technological University, Singapore - 639798
Email: {debjyoti001,yaswanth001,arvinde,anupam}@ntu.edu.sg
Abstract—In recent times, Resistive RAMs (ReRAMs) have
gained significant prominence due to their unique feature of sup-
porting both non-volatile storage and logic capabilities. ReRAM
is also reported to provide extremely low power consumption
compared to the standard CMOS storage devices. As a result,
researchers have explored the mapping and design of diverse
applications, ranging from arithmetic to neuromorphic comput-
ing structures to ReRAM-based platforms. ReVAMP, a general-
purpose ReRAM computing platform, has been proposed recently
to leverage the parallelism exhibited in a crossbar structure.
However, the technology mapping on ReVAMP remains an open
challenge. Though the technology mapping with device/area-
constraints have been proposed, crossbar constraints are not
considered so far. In this work, we address this problem. Two
technology mapping flows are proposed, considering different
runtime-efficiency trade-offs. Both the mapping flows take cross-
bar constraints into account and generate feasible mapping for
a variety of crossbar dimensions. Our proposed algorithms are
highly scalable and reveal important design hints for ReRAM-
based implementations.
I. INTRODUCTION
Traditional computing platforms require transfer of data
along energy-hungry buses between the compute cores and
the memory hierarchy. This has resulted in performance degra-
dation (memory wall) leading to challenges while processing
big-data [1], [2]. Data transfer between cores and memory is
often costlier than computing itself [3]. Such challenges can be
mitigated by logic-in-memory (LiM) enabled devices, which
can perform simple Boolean operations within the memory
or very close to the memory itself. Efficient algorithms for
LiM can lead to considerable improvements in performance of
applications that require large memory bandwidth to process
inputs [4], [5], [6], [7]. Therefore, we need to build mapping
tools to leverage the benefits of LiM architectures.
One of the most promising emerging non-volatile memory
with computation capabilities is Resistive RAM (ReRAM).
ReRAMs offer fast read/write speeds [8], high endurance [9],
long retention times [10] along with the scope of 3D fab-
rication [11]. Large passive crossbar arrays can be enabled
by preventing parasitic currents by means of devices such as
a select device in series to a switch (1S1R) or a Comple-
mentary Resistive Switch (CRS) [12]. Unlike CRS devices,
This work is an extension of the following publication. Bhattacharjee D,
Devadoss R, Chattopadhyay A. ReVAMP: ReRAM based VLIW architecture
for in-memory computing. In2017 Design, Automation & Test in Europe
Conference & Exhibition (DATE) 2017 Mar 27 (pp. 782-787). IEEE.
1S1R devices offer non-destructive read outs making them
suitable for logic in memory operations. ReRAMs are fast
gaining popularity for use as computation devices. Recently,
multiple propositions for realizing arithmetic blocks using
ReRAMs have been proposed [13], [14]. In addition, efficient
implementations of encryption, data compression and linear
algebra algorithms have also been mapped to ReRAMs [15],
[16], [17]. ReRAMs have been also used for neuromor-
phic computation [18]. Analog non-volatile ReRAM based
synapses have been used for gray-scale face classification
for energy savings [19]. Even, emulation of metaplasticity
has been demonstrated using analog ReRAMs [20]. Analog
memristor crossbar arrays have also been used for sparse
encoding of input data, which can be extended for image
processing applications [21]. Further, to enable uniform analog
switching, fast speed, along with excellent retention properties,
a thermal enhanced layer has been proposed to confine heat
in switching layer [22]. Multi-state memristors have also been
used for ternary arithmetic as well as native multi-valued
logic implementation [23], [24]. Ot has been experimentally
demonstrated that 3D-fabrication is feasible for resistive RAM
arrays [25].
From the perspective of computing arbitrary Boolean func-
tions, a preliminary method for computing using memristors
realizing material implication, was presented by Lehtonen et
al. [26]. Further, it was shown that any arbitrary Boolean ex-
pression can be computed using two working memristors that
realize material implication [27]. Logic synthesis flows have
been proposed using Imply Sequence Diagram and Or-Invertor
Graph for memristors realizing material implication [28],
[29]. Optimal technology mapping for ReRAM devices have
been investigated for ReRAM devices, that realize three-
input Boolean majority with a single input inverted [30]. In
addition, area-constrained technology mapping for individual
ReRAM devices using Integer Linear Programming, along
with scalable heuristics have been proposed [31]. A general
purpose bit-serial Programmable Logic in Memory (PLiM)
architecture was proposed [32] that uses ReRAM crossbar for
data storage as well as computation. A compiler for the same
was developed by Soeken et al. [33]. However, these works
either consider independent devices or use serial operations
on ReRAM crossbar arrays. A transpose resistive memory
with additional controller circuitry, was proposed by Nishil
et al. [34], for which a technology mapping was proposed
recently [35]. Inherently, ReRAM arrays support operations
ar
X
iv
:1
80
9.
08
19
5v
1 
 [c
s.E
T]
  2
1 S
ep
 20
18
2on multiple devices that are on the same wordline, allowing
parallel operations. The ReVAMP architecture allows harness-
ing this parallelism by means of VLIW instructions [36].
A ReRAM crossbar array consists of multiple ReRAM
devices that share wordlines and bitlines. In this paper, we
address the problem of technology mapping for computation
using ReRAM crossbar array, by using ReVAMP as the
target logic-in-memory architecture. The main challenge is
to efficiently harness the bit-level parallelism offered by the
crossbar arrays. The key contributions of the paper are as
follows.
• Any arbitrary AIG/MIG with k-levels can be mapped
with 2(k + 1) devices, arranged as a crossbar with at
least two bitlines.
• Any Boolean expression, expressed as a Exclusive-Sum-
Of-Product (ESOP), can be computed on a crossbar with
three wordlines and at least two bitlines.
• We present two technology mapping approaches for
ReVAMP in-memory computing platform.
• The area-constrained technology mapping approach uses
And-Inverter Graph for logic representation and then uses
a hierarchical method for generating ReVAMP instruc-
tions, aware of the crossbar dimensions. The method sup-
ports mapping to a wide variety of crossbar dimensions.
• The delay-constrained mapping approach relies on har-
nessing bit-level parallelism of the ReRAM crossbar array
by maximizing parallel operations across multiple devices
that share the same wordline. This method achieves
significant lower delay compared to existing ReRAM-
based serial logic-in-memory architecture.
The rest of the paper is organized as follows. In section II,
we present an introduction to ReVAMP, along with a brief
introduction to Boolean logic networks. Section III formally
presents the technology mapping problem followed by outline
of the solution approaches. Section IV describes the solu-
tion for the area-constrained technology mapping. Section V
presents a technology mapping solution for fast mapping by
exploiting inherent crossbar parallelism. Benchmarking results
are presented in section VI. Section VII concludes the paper.
II. PRELIMINARIES
In this section, we present the details of logic operations
using ReRAM crossbar arrays, followed by ReVAMP — a
ReRAM based general purpose computing architecture. We
also summarily present the details of Boolean logic networks
which will be used for technology mapping.
A. Logic in memory operations using ReRAM crossbar arrays
The ReRAM device model proposed in [37], was fitted to a
Pt/(11nm)TaOx/Ta cell. The used selector device is the
Pt/TaOx/T iO2/TaOx/Pt crested barrier device proposed
in [38], [39]. Both devices were implemented in VerilogA
and simulated using Cadence Spectre. The used ReRAM
model considers a filamentary region in which the switching
takes place by a redistribution of ionic defects, i.e., oxygen
vacancies. The filament is modeled by three lumped circuit
elements: a Schottky-type diode representing the current flow
Fig. 1: Logic operation using 1S1R device. (a) FSM. Wordline
input 1 and bitline input 0 changes the device state to logic
1 whereas wordline input 0 and bitline input 1 changes the
device state to logic 0. Other input combinations do not change
the internal device state Z. (b) Truth Table of the intrinsic
function Zn.
through the Pt/TaOx interface, a disc resistance describing
the region close to the Schottky-type interface and a resistance,
which comprises the plug resistance describing the remaining
part of the filament and the resistance of the electrodes.
The state variable of the resistive switching model is the
oxygen vacancy concentration N close to the active electrode
interface, which modulates the disc resistance and the electron
transport through the Schottky-type diode.
For logic operations, each ReRAM device can be interpreted
as a finite-state-machine (FSM), as shown in Fig. 1. Each
device has two input terminals—the wordline wl and the
bitline bl. The internal resistive state Z of the ReRAM acts
as a third input and the stored bit. If the state Z is in High
Resistive State (HRS), it is interpreted as logic 0, while Low
Resistive State (LRS) is interpreted as logic 1. As shown in
following equation, the next state of the device Zn is expressed
as a 3-input majority function, with the bitline input inverted.
Zn =M3(Z,wl, bl) (1)
This forms the fundamental logic operation that can be realized
using ReRAM devices. The inversion operation is equivalent
to using the intrinsic function Zn with one input (wordline or
state) as 0, the second input (state or wordline) at 1 and the
bitline input as the variable to be inverted.
v =M3(0, 1, v) =M3(1, 0, v) (2)
Since majority and inversion operations form a functionally
complete set, any Boolean function can be realized using only
Zn operations.
A ReRAM crossbar memory consists of multiple 1S1R
ReRAM devices, arranged in the form of a crossbar array [40].
Multiple devices share wordlines and bitlines. Fig. 2 shows
a ReRAM crossbar array with 6 devices arranged in 2 × 3
configuration i.e. 2 wordlines and 3 bitlines. The internal state
of device Dij at wordline i and bitline j is referred as Sij .
The devices D00, D01 and D02 share wordline 0 whereas
the devices D10, D11 and D12 share wordline 1. Similarly,
the devices D00 and D10 share bitline 0 and so on. Like
conventional RAM arrays, ReRAM memories are accessed
as words. It should be noted that all the devices in a word
share a common wordline. For example, word 0 has devices
D00, D01 and D02. The ReRAM array is programmed using
a V/2 scheme, with V= 4.8V . Logic 1 and 0 are realized by
3(a) wl1 wl0
bl0
bl1bl2
wl1 S12 S11 S10
wl0 S02 S01 S00
bl2 bl1 bl0
(b) 
Fig. 2: A 2 × 3 ReRAM crossbar array (a) Six 1S1R devices
arranged as crossbar. (b) The crossbar represented as a
schematic. Sij represents internal state of device at wordline
i and bitline j. wli represents the ith wordline input while blj
represents the jth bitline input.
t1 t2 t3 t4 t5
-2
0
2
V
w
l (V
)
t1 t2 t3 t4 t5
-2
0
2
V
bl
 
(V
)
t1 t2 t3 t4 t5
Time (ns)
-20
0
20
I (
A)
Fig. 3: Cadence simulation of a 1S1R device. Vwl and Vbl
indicate the applied wordline and bitline voltages. I indicates
the read out current, observed at the bitline.
voltage pulses of 2.4V and −2.4V respectively. Unselected
lines are kept grounded. In a readout phase, the presence of
a current greater than 4µA implies logic 1 while its absence
implies logic 0.
Fig. 3 shows the Cadence simulation for a single device.
In cycle t1, 0 and 1 are applied to the wordline and bitline,
respectively to set the logic state to 0 (HRS). In cycle t2, the
device state is read out. The read out current is less than 5µA,
confirming the device is in logic state 0. In the next cycle t3,
1 and 0 are applied to the wordline and bitline respectively,
to set the logic state to 1 (LRS). In t4, the devices is read out
and the read out current is greater than 4µA, indicating the
logic state to be state 1.
B. ReVAMP architecture
We briefly present the ReRAM based VLIW Architecture for
in-Memory comPuting (ReVAMP), depicted in Fig. 4. The
architecture uses two ReRAM crossbar memories — the In-
struction Memory (IM) and the Data Storage and Computation
Memory (DCM). The IM is a regular instruction memory
accessed using the program counter (PC). The DCM hosts
data and in-memory computation. All the devices in one single
word of the DCM is can be operated in parallel, with each
operation being the intrinsic Zn function. Since multiple Zn
operations operate in parallel, the proposed architecture is
TABLE I: ReVAMP parameters.
Parameter Description
SD Number of words in the DCM
wD Number of bits in a word in DCM
wD Number of primary input lines
SI Number of words in the IM
wI Number of bits in a word in IM
VLIW in nature. Splitting the instruction and data memory
allows reduction in overall execution time, by pipelining
instruction fetch and computation. The ReVAMP architecture
is parameterized as shown in Table I, and can be configured
as necessary.
The ReVAMP architecture has a three-stage pipeline with
instruction fetch (IF), instruction decode (ID) and exe-
cute (EX) stages. In the IF stage, the instruction at the address
held by the program counter (PC) is fetched from the IM
and loaded into the instruction register (IR) before the PC is
updated. In the ID stage, the instruction is read from IR to
determine the control inputs for the source select multiplexer,
the crossbar interconnect and the write circuit.
The data memory register (DMR) stores the data read out
from the DCM. The primary input register (PIR) buffers
the primary input data. Both DMR and PIR are wD bits
wide. Depending on the control input Mc, the source se-
lect multiplexer selects either the DMR or the PIR as the
data source. Thereafter, the crossbar-interconnect is used to
generate the wordline and wD number of bitline inputs by
appropriate permutation of the input data, as per the control
signals stored in Cc. The crossbar-interconnect is basically a
set of multiplixers, one per output, which selects one of the
input wD bits. The write circuits reads the value of the target
wordline from the register Wc and the output of the crossbar-
interconnect to determine and apply the inputs to the row and
column decoder of the DCM.
ReVAMP Instruction Set: The ReVAMP architecture sup-
ports two instructions—Read and Apply, in the formats shown
in Fig. 5. The Read instruction reads the word at the address
wl from the DCM and stores it in the DMR. Now available
in the DMR, this word can be used as input by the following
instructions.
The Apply instruction is used for computation in the DCM.
The address w specifies the word in the DCM that will be
computed upon. A bit flag s chooses whether the inputs will
be from primary input (PIR) or DMR. A two-bit flag ws
specifies the worline input — 00 selects logic 0, 01 selects
logic 1, 11 selects input specified by the wb flag and 01 is
not a valid input. The wb bit-vector are used to specify the
bit within the chosen data source for use as wordline input.
Pairs (v val) pairs are used to specify bitline inputs. The bit
flag v indicates if the input is NOP or a valid input. Similar
to wb, the bit-vector val specifies the bit within the chosen
data source for use as bitline input.
In each instruction, one bit is required to specify the opcode,
and log2(SD) bits are required to select the word. One bit is
required for s flag and two bits are required for the wordline
source select flag ws. Each (v val) pair requires one bit for the
v flag and log2(wD) bits for specifying the bit in the selected
input source. The field wb also requires log2(wD) bits. Thus,
4Instruction  
Memory 
(IM)
Update PC
PC
Instruction
Decode  
and 
Control Signal 
Generation
Primary
Input
IR
Wc
Cc
Mc
PIR
DMR
wDx(1+wD)
switch
network
W
rit
e 
Ci
rc
ui
t
0
1
Sense Amplifiers
Read
Address
Data 
Out 
Column Decoder
Ro
w
 D
ec
od
er
Data and 
Computation
Memory
So
ur
ce
Se
le
ct
W
or
dl
in
e 
Se
le
ct
wD
wD
wD
1+wD
wI
Instruction Fetch Instruction Decode Execute
Fig. 4: ReVAMP architecture
Read w
Apply w s ws wb (v val1) (v val2) . . . (v valwD )
Fig. 5: ReVAMP instruction format.
the lengths of these instructions are:
ILRead := 1 + log2(SD) (3)
ILApply := 3 + log2(SD) + (1 + wD)(1 + log2(wD)) (4)
The word length wI of the IM should be greater than or equal
to max(ILRead, ILApply).
We demonstate the working of the ReVAMP architecture.
Let us consider a 3×2 crossbar as the DCM for realizing two-
bit XOR function for operands p1p0 and q1q0. To compute the
XOR, we use the following equation :-
pi ⊕ qi = pi.qi + pi.qi = pi.qi + pi + qi (5)
Fig. 6a shows the sequence of operations performed to realize
a 2-bit XOR function and the steps are described below.
• Step 1: Inputs p0 and p1 are loaded to wordline 0 in
inverted form via the PIR, since M3(0, 1, pi) = pi.
• Step 2: Wordline 0 is read out using Read instruction.
The read out values p0 and p1 are stored in the DMR.
• Step 3-4: The read out value is loaded to wordline 1
and 2 using two Apply instructions via the bitlines as
M3(0, 1, pi) = pi.
• Step 5: Input q0 and q1 are ANDed with the values in
wordline 2 in inverted form by using 0 as wordline input
since M3(pi, 0, qi) = pi.qi.
• Step 6: Input q0 and q1 are ORed with the values in
wordline 1 in inverted form by using 1 as wordline input
since M3(pi, 1, qi) = pi + qi.
• Step 7: The ORed values available in wordline 1 are read
out, using Read instructions.
• Step 8: The values in the DMR are ORed with the
contents of wordline 2 to complete the XOR operations,
as pi.qi + pi + qi.
The set of instructions corresponding to the steps is shown
in Fig. 6b. This concludes the description of the ReVAMP
architecture. In the following subsection, we describe briefly
structural representation of Boolean functions.
C. Logic representation
For representation of Boolean functions, we use two structural
representations namely And Inverter Graph (AIG) [41] and
(a)
Step 1 Step 3 Step 4 Step 5
0 0 ‘1’ 0 0 p1 p0 ‘0’ p1 p0
0 0 0 0 ‘1’ 0 0 p1 p0
‘1’ 0 0 p1 p0 p1 p0 p1 p0
p1 p0 p1 p0 p1 p0 q1 q0
Step 6 Step 8
p1.q1 p0.q0 ‘1’ p1.q1 p0.q0 p1 ⊕ q1 p0 ⊕ q0
‘1’ p1 p0 p1 + q1 p0 + q0 p1 + q1 p0 + q0
p1 p0 p1 p0 p1 p0
q1 q0 p1 + q1 p0 + q0
(b)
Instruction
I1 Apply 0 0 01 1 0 1 1
I2 Read 0
I3 Apply 2 1 01 1 0 1 1
I4 Apply 1 1 01 1 0 1 1
I5 Apply 2 0 00 1 0 1 1
I6 Apply 1 0 01 1 0 1 1
I7 Read 1
I8 Apply 2 1 01 1 0 1 1
Fig. 6: XOR computation of two-bit vectors p1p0 and q1q0.
(a) The steps of computation are shown graphically using a
crossbar schema. The read out steps are not shown explicitly.
The word highlighted in green represents the read out word.
(b) The corresponding instructions for the ReVAMP architec-
ture. The inputs p1p0 and q1q0 are made available on the PIR
during Step 1 and Step 5-6 respectively.
Majority Inverter Graph (MIG) [42]. An AIG (MIG) is a
directed acyclic graph where each node is 2-input (3-input)
representing Boolean AND (Boolean Majority). A directed
edge i→ j exists if the output of the (parent) node i is an input
to the (child) node j. Each edge is marked as either regular or
inverted. A Primary Input (PI) node is either a logic constant
0/1 or a Boolean variable. If a node is not a PI, then it is
an internal node. A Primary Output (PO) node represents the
output of the function. An AIG (MIG) can have one or more
PO nodes. We define the level of a node n as follows.
Definition 1. The level of a node n, written as level(n), is
defined as the length of the longest path from any PI node to
5the node n. The level of the PI nodes is zero.
Example 1. Fig. 7a and Fig. 7b shows an AIG and
a MIG respectively. In both the graphs, the primary in-
puts (a0, a1, a, b, . . .) are shown in square boxes and the
internal nodes (n1, n2, S1, S2, . . .) are shown in circles. The
output nodes (n5,S4) are shown in double lined circles. The
inverted edges (n2 → n4, S2 → S3, . . .) are marked using
dots.
(a)
(b)
Fig. 7: Logic representation. (a) An AIG (b) A MIG.
III. PROBLEM DEFINITION AND SOLUTION
In this section, we present the technology mapping problem for
the ReVAMP architecture along with overview of the proposed
solutions.
A. Problem definition
Area constrained techhnology mapping : Given a Boolean
function represented as a Boolean logic network G and cross-
bar dimension SD×wD, determine a sequence of instructions
I1, I2, ..., IT , It ∈ {Read,Apply} and 1 ≤ t ≤ T and PIR
inputs for the ReVAMP architecture that computes the output
nodes of the network G.
Delay focused technology mapping : Given a Boolean func-
tion represented as a Boolean logic network G and crossbar
width wD, determine a sequence of instructions I1, I2, ..., IT ,
It ∈ {Read,Apply} and 1 ≤ t ≤ T , PIR inputs and number
of words SD for the ReVAMP architecture that computes the
output nodes of the network G.
The quality of the solution is measured in terms of the delay
and the total number of devices required for the mapping. The
delay of a solution is equal to the number of instructions (T ).
The total number of devices is equal to SD × wD.
Fig. 8: Technology mapping flow for the ReVAMP architecture
B. Solution approach
In this paper, we propose two different approaches to the
problem. Fig. 8 shows the overall flowchart for the technology
mapping problem. In the first approach, we consider the area
constrained version of the problem, where we represent the
Boolean function as an AIG. We begin by partitioning the
AIG into k-input Look-up Tables (LUTs). A k-input LUT
is basically a function with atmost k-inputs and a single
output. Once the graph has been partitioned, the LUTs for
computation are scheduled in topological ordering, i.e. the
LUTs close to the primary input are scheduled first and so
on, till the output LUTs are computed. In order to compute a
LUT, we express the functionality of the LUT using Exclusive
Sum-Of-Products (ESOP) [43]. Any arbitrary ESOP can be
computed on the DCM with at least 3-wordlines and 2-
bitlines (explained in detail in Theorem IV.2) — the variables
which have to be used in inverted form are negated first (to
be applied via bitlines), followed by computing the product
terms and finally XORing them. To reduce the delay, the AND
computation for realizing the ESOP needs to minimize number
of reads performed and maximize number of AND operations
that can be done in parallel. Thereafter, we perform the XOR
of computed AND terms by means of a XOR reduction tree
of logarithmic depth in the number of AND terms.
In the second approach, we focus on minimizing the delay
of the mapping, without any constraints on the number of
words. We use MIGs for logic representation in this approach.
We propose an algorithm with four phases — assignment of
nodes as host or input for computation, grouping nodes to
blocks, packing blocks to words followed by generation and
scheduling of instructions. We explain both the technology
mapping solutions in detail in the following sections.
6IV. AREA-CONSTRAINED TECHNOLOGY MAPPING
In this section, we establish a lower bound on the number of
devices required to map any arbitrary AIG (MIG). Thereafter,
we present a scalable technique for area constrained technol-
ogy mapping.
Theorem IV.1. Any AIG or MIG with k-levels can be mapped
using 2(k + 1) devices, arranged as a crossbar with atleast
two bitlines.
Proof: Since any AIG can be expressed as MIG, we prove
the theorem for MIG by means of an inductive proof. Before
explaining the proof, we describe a transformation to the input
MIG and prove the theorem on the transformed MIG. We
transform the MIG such that
• Each internal node has a single child. Nodes with multiple
fanout can be replicated bottom-up i.e., from the output
to the primary inputs.
• Each node has two non-inverted inputs and a single
inverted input. This can be realized by propagating the
inverts across nodes or by creating an inverted copy of the
node as required, using the following axiom for Boolean
majority.
M3(a, b, c) =M3(a, b, c) (6)
Now, we present the inductive proof for the transformed MIG.
The device at wordline 0 bitline 0 is used for inverting any
input v as needed by applying the input via the bitline with
1 as wordline input and 0 as internal state, i.e. M3(0, 1, v).
The inverted value v can be read out in the next cycle
and used in the following cycles using Apply instructions.
Any device is reset i.e. internal state Z is set to logic 0,
by applying 0 and 1 as wordline and bitline input respectively.
Base Case: A MIG with 1-level basically implies inputs
act as outputs and hence does not require any devices for
computation. Therefore, we consider the MIG in Fig. 9a with
2-levels as the base case. One of the non-inverted input W and
the inverted input B can be loaded to wordline 1. The second
non-inverted input H is loaded to wordline 0. The wordline 1
can be readout and in the next cycle, W and B are applied as
wordline and bitline inputs of the device holding H to compute
S.
Inductive Case: Let us assume that for an MIG with k-levels,
the theorem holds true. Now, consider an MIG with k + 1-
levels, as shown in Fig. 9b. The subtrees MIGwk, MIGbk
and MIGhk have k-levels. Therefore, these MIGs can be
computed using 2(k + 1) devices. Let the subtree MIGwk
be computed on wordlines 1 to (k + 1) and the result Wk be
stored at wordline 1 bitline 1. All the devices, except the device
holding Wk is reset. Similarly, subtree MIGbk be computed
on wordlines 1 to k+1 and the result Bk is stored at wordline 1
bitline 0, followed by reset of all the devices, except wordline
1. The last subtree MIGwk is computed using wordlines 0, 2
to k + 1 with the result Hk stored at wordline 1 and bitline
1. Therefore, to compute the final output Tk+1, wordline 1 is
read out and then Wk and Bk are applied to the wordline and
bitline of device holding Hk to compute Tk+1. This completes
the proof. 
(a)
(b)
Fig. 9: (a) Mapping an MIG with 2-levels (b) Mapping an
MIG with (k + 1)-levels.
Fig. 10: A portion of the LUT graph. Each node in this
partitioned graph represents a LUT.
A. Logic network partitioning and scheduling
We represent the Boolean function as an AIG. We partition
the graph into k-input LUTs using ABC [44]. From here on,
we refer to the partitioned graph as LUT graph and each node
in the partitioned graph represents a LUT.
Example 2. For k = 4, the AIG in Fig. 7a can be partitioned
into two LUTs, as shown by dotted lines.
Bound on number of devices required : To determine the bound
on number of devices required for the storage of intermediate
results, we define transient node.
Definition 2. In a LUT graph, a node n is termed as transient
node in level l if node level(n) < l and there exists an edge,
n→ n′ such that level(n′) > l.
Example 3. In Fig. 10, LUT L2 in level l − 1 has an edge
to LUT N1 in level l + 1, therefore it is a transient node for
level l.
Let the number of nodes, including transient nodes in a
level l be Nl. We can schedule the nodes of the LUT Graph
in topological ordering, i.e all nodes at level l−1 are scheduled
before any node in level l is scheduled. A node in level l is
dependent only on the nodes (including transient nodes) that
are present in level l − 1. Therefore, once all the nodes in
level l have been scheduled, the nodes in level l − 1 can be
7Fig. 11: Memory layout of the ReRAM crossbar array with r
wordlines and c bitlines. The three wordlines e0, e1 and e2
are used for computation. The rest of the wordlines s0, . . . , st
are used for storage of the intermediate results. 3 + t = SD.
reset. Doing this iteratively, the number of devices required
for scheduling a LUT graph is
MinDev = max
0≤l≤Lmax−1
(Nl +Nl+1) (7)
The memory layout of the crossbar, with
SD (= t+ 3) wordlines and wD bitlines is shown in
Fig. 11. The top t wordlines are used for storing the output
of each LUT. The bottom three wordlines e0, e1 and e2 are
reserved for computation of each LUT. For the scheduling to
be feasible, MinDev should be less than or equal to (t×wD).
If the scheduling condition is not feasible, a different value
of k is used to partition the graph and feasibility is checked.
Once the scheduling condition is satisfied for a given crossbar
size, nodes are scheduled in topological order. The device
where the output of an LUT (node in the LUT graph) would
be stored, is determined according to the best fit method.
The wordline in the crossbar with minimum number of free
devices is chosen if the number of nodes to schedule is less
than or equal to the number of free devices in that wordline.
If no such wordline exists, a wordline with maximum number
of free devices is chosen iteratively, till all the nodes have
been allocated a device. A device storing a node n is marked
dirty if all the successors of n have already been allocated.
If none of the devices are free, then the wordline with
maximum dirty bits is reset and allocation starts. This process
is repeated till all the nodes have been scheduled, along
with target device allocation. The overall technique has been
shown in Algorithm 1.
Example 4. We explain the device allocation and scheduling
technique, presented in Algorithm 1 using a representative
LUT graph, shown in Fig. 12. The nodes are scheduled
in topological ordering. Nodes in level 1, n1 and n2, are
allocated to wordline 5, as shown in Fig. 13. Node n3, in
level 2, is assigned another device in wordline 5, using the
Best-fit allocation strategy. Since the only successor of n1 has
been allocated, device allocated to n1 is now marked dirty. In
level 3, there are 3 nodes (n4, n5 and n6). Since there is only
a single device free in wordline 5, it is not possible to allocate
these nodes together. Therefore, these nodes are allocated to
Algorithm 1: LUT Graph Scheduling
Data: Lut Graph G,SD ,wD
1 for l = 1; l ≤ lmax; l++ do
2 Allocate unscheduled nodes in level l to free devices in a wordline
considering Best-fit;
3 if s.scheduled = True : ∀s ∈ succ(n) then
// Device allocated to node n is marked dirty
4 dev(n).dirty = True;
5 while level(n) = l and n.scheduled = False : ∃n ∈ G(V ) do
// No free device is available
6 if @free(D) then
7 w = wordline with maximum number of dirty devices;
8 Reset the dirty devices in wordline w;
9 Allocated unscheduled nodes of level l to the free devices in
wordline w;
Fig. 12: An LUT graph with five primary inputs (a0, . . . , a5)
with LUT nodes (n1, . . . , n7). The output of LUT n7 is the
output of the LUT graph.
wordline 4. All the successors of node n2 have been allocated,
hence the corresponding device is marked as dirty. Finally, the
node n7 in level 4 is allocated to the free device in wordline
5. This completes the allocation and scheduling of the LUT
nodes.
B. ESOP computation
Each function realized by the LUT can be expressed as
an Exclusive Sum-Of-Product (ESOP). For many Boolean
functions, minimal ESOPs have lesser number of cubes com-
pared to Sum-Of-Products [45]. In addition, there are multiple
ESOP minimizers available which can be used to reduce
the ESOP size [46], [47], [48], [49]. Before presenting the
ESOP computation algorithm on ReVAMP, we present a brief
description of the related terms.
Definition 3. A literal is a Boolean variable either in inverted
or non-inverted form.
Definition 4. A cube is a product term composed of literals
using Boolean AND.
Example 5. The ESOP abc ⊕ abc has two cubes, abc and
abc. The cube abc has literal a in inverted form and b,c in
non-inverted form.
8Inital State Level 1 Level 2
0 0 0 0 0 0 n2 n1 0 n3 n2 n1
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
Level 3 Level 4
0 n3 n2 n1 n7 n3 n2 n1
0 n6 n5 n4 0 n6 n5 n4
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Fig. 13: Scheduling using Algorithm 1 for the LUT graph in
Fig. 12. The wordlines 0-3 (colored in teal) are reserved for
XOR computation and the remaining wordlines 4-5 (colored
in orange) are used for storage of the LUT outputs.
(a) (b) (c)
c1 c2 c1 c2 0 c1 c2
0 0 1 0 0 0 c2
0 0 0 c2 0 c2
c2 c1
(d) (e) (f)
1 c1 c2.c1 c1 c2 ⊕ c1 0 c2 ⊕ c1
0 c2 + c1 0 c2 + c1 0 0
0 c2 0 c2 0 0
c2 + c1
Fig. 14: ESOP computation on a 3 × 2 crossbar. (a) Cubes
are computed on devices at wordline e2. (b-e) Some of the
intermediate steps of computing XOR of the two cubes are
shown. (f) All devices, except the device holding the XOR of the
cubes is reset to 0. If the ESOP has more cubes, the next cube
c3 would be computed at wordline e2 and bitline b0 and XOR
would be computed for c3 and c2⊕ c1, followed by reset. This
process is repeated till the entire ESOP has been computed.
Theorem IV.2. Any Boolean function, expressed as an ESOP,
can be computed using three wordlines and atleast two bit-
lines.
Proof : We present a constructive proof for the theorem. Let
us consider three wordlines, e0, e1 and e2 with bitlines b0 and
b1. We consider two cases.
Case 1: The ESOP has a single cube, say l1.l2...ln. If a literal
li is inverted, it is applied via bitline b0 with ‘0’ as input to
wordline e2. Else, the literal is applied via bitline b0 and ‘1’
as input to wordline e0 to store in non-inverted form. Then,
wordline e0 is read out and li is applied via the bitline with ‘0’
as wordline input to wordline e2. The wordline e0 is reset. The
process is repeated till all the literals have been ANDed and
the computed cube is available at wordline e2 and bitline b0.
Case 2: The ESOP has more than one cube, say c1, c2, ..., cm.
The cube c1 can be computed, as stated in Case 1. Similarly,
c2 can be computed at wordline e2 and bitline b1 by applying
the bitline inputs via bitline b1. The cubes c1 and c2 can be
XORed as shown in Fig. 14 with the result stored at wordline
e2 and bitline b1. Rest of the devices are reset to 0 by using 0
as wordline input and 1 bitline input. Now, the third cube c3
can be computed, using steps identical to Case 1 and the XOR
can be performed with the result c1⊕ c2. This process can be
repeated till the entire ESOP has been computed. 
The theorem IV.2 guarantees that any ESOP can be com-
puted in a crossbar with three wordlines and two bitlines.
1 0 0 0 a a 0 ab ab abc abc
a a b b c c
Fig. 15: Computation of cubes of the ESOP abc⊕ abc.
Fig. 16: A XOR-reduction tree 4 terms x1, ..., x4.
x12 = x1 ⊕ x2 , x34 = x3 + x4 and x14 = x12 ⊕ x34
If the number of bitlines is greater, it is possible to reduce
the delay by parallising operations. Boolean AND of two
literals a and b can be expressed as M3(a, 0, b). 0 can be
used a common wordline input during computation of cubes
in parallel feasible. Fig 15 shows the computation of the cubes
of an ESOP. Due to the crossbar constraints, all the bitline-
applied literals must be either available via the PIR or DMR
simultaneously. This implies that all the applied literals either
have to be primary inputs or must reside on the same wordline
for parallel computation of the cubes.
At the end of completion of computation of the cubes, the
cubes have to be XORed. Each XOR can be performed using
that steps similar to the example shown in Fig. 6a. Multiple
XORs can be performed by means of a XOR reduction tree
with logarithmic depth, in the number of terms to be XORed.
In Fig. 16, there are four terms xi to be XORed. The XOR
of x1 and x2 can proceed in parallel with the XOR of x3 and
x4. Thereafter, the results x12 and x34 are XORed. It might
happen that the numyclesber of cubes in an ESOP is greater
the number of available bitlines in the crossbar. In that case,
the computation of the cubes, followed by XOR reduction has
to be iterated.
The technique for ESOP computation is presented in Algo-
rithm 2. Once the ESOP has been evaluated, the result is
written back to the position in the working area, as determined
by the scheduling algorithm.
Discussion: The proposed approach provides a novel solution
to the area-constrained technology mapping problem. The
target Boolean function is represented as an AIG, followed
by partitioning into k-input LUTs and finally scheduling and
computing these LUTs on the crossbar. The approach allows
a feasible mapping for a variety of crossbar sizes, with some
portion of the crossbar reserved for computation of ESOPs.
Instead of using AIGs for representing the functions, it is
also feasible to represent the function using Majority Inverter
Graph (MIG). The native function realized by ReRAM devices
is Boolean Majority three (M3) with an input inverted. There-
fore, MIGs have been used heavily in synthesis [50], [51] and
technology mapping [30] flows for ReRAM crossbar array. In
the next section, we discuss another approach to the technol-
ogy mapping problem using MIGs for logic representation and
constrained by only the word length of the DCM with focus
on reducing the delay of mapping.
9Algorithm 2: ESOP computation
Data: E, CrossbarState, Loc
1 result = null;
2 do
3 activeCubes = set();
4 currentLiterals = list();
5 v = maxl∈C occ(l);
6 currentLiteral.append(v);
7 activeCubes.add(cube(v));
8 while occ(currentLiteral) ≤ wD do
9 allowedLiterals = findAllowed(E,currentLiterals);
10 Choose v′ ∈ allowedLiterals with max occurence in
cube(v′)− activeCubes and
occ(currentLiteral + v′) ≤ wD ;
currentLiteral.append(v′);
11 activeCubes.add(cube(v′));
12 And(currentLiterals);
13 do
14 activeCubes′ = set();
15 currentLiterals′ = list();
16 v = maxl∈C occ(l,activeCubes);
17 while occ(currentLiteral) ≤ wD do
18 allowedLiterals = findAllowed(E,currentLiterals);
19 Choose v′ ∈ allowedLiterals with max occurence in
cube(v′)− activeCubes and
occ(currentLiteral + v′) ≤ wD ;
currentLiteral.append(v′);
20 activeCubes′.add(cube(v′));
21 And(currentLiterals);
22 while All the activeCubes have not been computed;
23 result = XorReduction(activeCubes, result);
24 while Some cube has not been computed;
V. DELAY-CONSTRAINED TECHNOLOGY MAPPING
In this section, we present a method to generate instructions
for the ReVAMP architecture that is focused at reducing the
delay of mapping without constraints on the number of words
required for mapping. In this method, we still consider the
constraint on word length wD during mapping.
A. Assign Host and Inputs to Nodes
A ReRAM device has an internal state Z, and two inputlines—
the wordline and bitline. A computation on it updates its
internal state Z, in effect making the device the host for the
computation. For each internal node in an MIG, one of its
parents hosts the computation and the remaining parents act
as wordline and bitline inputs. The computation of multiple
independent nodes can be grouped into an Apply instruction
if they have a common wordline input. Based on this, we
present a few rules to assign the host and the inputs of the
nodes of an MIG.
• If a node has multiple children in the same level, then
it can be used as common wordline input for computing
those nodes. For instance, in Fig. 7b, input b can be used
as common wordline input to compute S1 and S2.
• If an incoming edge to a node is marked inverted, then
the corresponding parent can be used as the bitline input.
In Fig. 7b, c and S2 are used as bitline inputs to compute
S1 and S3 respectively.
• If there are no inverted incoming edges to a node, then
a negated parent is used as input to that node. For node
S2 in Fig. 7b, input c is used as bitline input.
• The remaining parent is used as host for the node. The
nodes a and S1 act as host to compute S1 and S3
respectively in Fig. 7b.
These rules ensure that the nodes with common inputs can
share wordline inputs which is used for scheduling computa-
tion. We mark these assignments on the edges of the MIG, as
shown in Fig. 7b.
B. Group Nodes to Blocks
To compute an internal node in a MIG, we need to read out
the wordline and bitlines inputs of the node and then apply
these inputs to the host. Given that only a single word can be
read out in a clock cycle, the wordline and bitline inputs of
the node must reside on the same wordline to allow efficient
computation of the node. This creates a constraint that for each
node in an MIG — the wordline and the bitline inputs should
be placed in the same word. We call this grouping a block.
Further, as read-outs are non-destructive, blocks can be
merged if they have common inputlines. This reduces the
number of devices required, with the merged block having
only one copy of the common inputline. Note that blocks can
be merged only if the number of inputs in the resultant block
does not exceed the word length.
Also, a pair of blocks in the same level that have hosts
which share a wordline input should be merged. This host-
based merge along with merge of the corresponding blocks
with the inputlines of these hosts permits computation of the
nodes in the same level with shared wordline in a single cycle,
thereby reducing delay.
Algorithm 3: Block Formation Algorithm
Data: G, pi, po
Result: blockList
1 blockCount = 0;
2 for nodeout ∈ po do
3 addBlock([(nodeout.host)],blockCount);
4 addBlock([(nodeout.wl, nodeout.bl)],blockCount);
5 addInversionBlock(el);
6 mergeBlock();
7 for l = levelmax; l > 0; l = l− 1 do
8 for block ∈ blockList do
9 for el ∈ block do
10 if el /∈ pi and el.level == l then
11 replace(el, el.host);
12 addBlock([el.wl, el.bl], blockCount);
13 addInversionBlock(el);
14 mergeBlock();
The algorithm of the block formation is shown in Algo-
rithm 3. The lines 2-5 creates the blocks considering the
placement constraint on the input lines of the output nodes.
The addInversionBlock method adds the positive nodes as
blocks to the blockList, if the added blocks have inverted
values. Only a single positive node is added to blockList,
corresponding to multiple copies of a negated node. The
mergeBlock method merges blocks based on the input line and
host based merge constraints. The replace method replaces a
node in a block with its host node.
Example 6. For a word length (wD) of 3, Table II shows
the working of the block formation algorithm on the MIG of
Fig. 7b. Starting at the output node, blockList has a single
block. At level 3, node S4 is replaced with its host and
10
TABLE II: BlockList update with BlockMerge algorithm for
MIG of Fig. 7b
Level BlockList
Output [[1,S4)]]
3 [[(1,S3,h)],[(2,d,i),(2,e,i)]]
2 [[(1,S1,h)],[(2,d,i),(2,e,i)],[(3,c,i),(3,S2,i)]]
1 [[1:(a,h)],[2:(d,i),(e,i)],[3:(c,i),(a,h)],[4:(b,i),(c,i)],[5:(b,i),(c,i)]]
1 [[1:(a,h)],[2:(d,i),(e,i)],[3:(c,i),(a,h)],[4:(b,i),(c,i),(c,i)]]
1 [[1:(a,h),(c,i),(a,h)],[2:(b,i),(e,i)],[4:(b,i),(c,i),(c,i)]
inputlines. Since these two blocks do not have any common
inputlines or hosts, they cannot be merged. At level 2, node
S3 gets replaced and the inputlines are added to a new block.
At level 1, nodes S1 and S2 are replaced by their hosts a, and
the inputlines are inserted in two new blocks. Blocks 4 and
5 have a common inputline b and are hence merged. Blocks
2 and 4 have common inputs, but cannot be merged as the
length (four) of the resultant block will exceed the given word
length. Thereafter, since the two a host nodes have the same
wordline, blocks 1 and 3 get merged, but both copies of the
host are retained, using the host-merge constraint.
C. Pack Blocks in Words
At the end of scheduling computation, we have blocks of
elements, which have to be placed in the same wordline. The
number of elements in each block is less than or equal to
wD, the number of bits in a word. Now, these blocks have
to packed in the DCM using the minimum number of words.
The problem can be formulated as a bin packing problem as
defined below.
Algorithm 4: First-fit Algorithm
Data: blockList,wordlength
Result: blockToWord
1 wordToBlock = HashMap();
2 wc = 0;
3 for block ∈ blockList do
4 assigned = False;
5 for w ∈ wordToBlock do
6 if wordToBlock[w].occupied + block.length < wordlength
then
7 wordToBlock[w].occupied =wordToBlock[w].occupied +
block.length;
8 wordToBlock[w].append(block);
9 assigned = True;
10 if assigned == False then
11 wc = wc+1;
12 wordToBlock[wc].append(block)
wordToBlock[wc].occupied = block.length;
Consider each word in the DMR as a bin, with capacity
wD. Each block bi has a value vi, vi > 0. Each block must be
assigned to a bin such the total value of the objects assigned
to the bin is less than or equal to wD. The objective is to
minimize the number of bins required to assign all the block,
without violating the capacity constraint.
This first-fit algorithm provides a 2-factor approximation,
i.e., the number of words required by the algorithm is at most
twice the number of words required by the optimal solution.
Example 7. For the example, the blocks determined by the
Block Formation algorithm are placed in a separate wordline,
as shown in Fig. 17 (a).
D. Generation and Scheduling instructions
The primary inputs have to be loaded into the DCM before
computation of the internal nodes of the MIG can begin.
In each clock cycle, wD primary inputs can be read. The
primary inputs are loaded via the bitline and hence the inverted
values are stored in a single clock cycle. To store non-inverted
primary inputs, the primary inputs are written to a wordline,
thereby storing it in inverted form. Then, the inverted value is
read out and applied via the bitline to store the non-inverted
value to the required wordline. A single extra wordline is used
for storage of the intermediate inverted primary input, and this
wordline is reset, after each use.
All the nodes in level i are scheduled for computation before
any node at level i + 1 is scheduled. The nodes in the same
level can be scheduled in any order as they do not have any
data dependencies. The nodes in a level with hosts of the
which are in the same block, and the corresponding inputlines
are also placed together in the same block, are scheduled
for computation together. Once all the nodes in a level have
been computed, we determine whether any inverted copies
of the nodes are required for computation of nodes present
at a higher level. If inverted copies are needed, the node is
read out and stored in inverted form in the required block by
writing through the bitline. Each computation is expressed as
an Apply instruction and read operations are expressed as Read
instructions.
TABLE III: Instruction sequence to compute MIG in Fig. 7b.
Instruction
I1 Read 0
I2 Apply 2 11 0 1 1 0 0 1 2
I3 Read 2
I4 Apply 2 11 1 1 2 0 0 0 0
I5 Read 1
I6 Apply 2 11 0 1 1 0 0 0 0
Example 8. Table III shows the sequence of instructions used
to compute the example MIG, and Fig. 17 shows the changes
in DCM state on application of the Apply instructions. Note
that the additional instructions needed to initialize the DCM
are not shown. The inputs to compute nodes S1 and S2 are
in word 0 and are read out. The hosts of nodes S1 and S2
are in word 2, and therefore I2 computes these nodes in word
2. The inputs to compute S3 are in word 2, and are read out
by I3. I4 computes S3 in host S1. Finally to compute S4, I5
reads out word 1 and I5 applies the required inputs to S3.
Discussion: Even though the two approaches have been
discussed with AIG and MIG as the input data structures,
the data structures can be used interchangeably. To use MIG
in the area-constrained mapping approach, the MIG can be
directly partitioned into LUTs and the rest of the mapping flow
can be used. Similarly, the AIG can be converted to an MIG
by introducing constant ‘0’ as the third input to each node,
and the rest of the delay-constrained mapping flow can be
used. Due to the inherent sequential nature of computation on
and the crossbar constraints, employing traditional synthesis
optimization techniques, such as depth reduction, do not
directly translate into lower delay after technology mapping.
11
(a) (b) (c) (d) (e)
a c a
===⇒
I1,I2
b a c a
===⇒
I3,I4
c S1 c S2
===⇒
I5,I6
d S3 c S2
==⇒
S4 c S2
d e d e d e d e d e
b c c b c c b c c b c c b c c
c c S2 e
Fig. 17: DCM state transition during computation. (a) DCM state after loading the primary inputs. (b-d) The intermediate
DCM states during computation. (e) The final DCM state. The green coloured row represents the read out wordline.
However, it is possible to make the synthesis optimization
technique technology-aware to aid the technology mapping
flow, as demonstrated recently by Bhattacharjee et al. [52].
VI. EXPERIMENTAL RESULTS
We have implemented the proposed compilation flow for
the ReVAMP architecture using Python. The algorithm was
evaluated using the EPFL benchmarks1. For area-constrained
mapping, we used ABC for generating the initial AIG and
also for ESOP expansion [44]. Each run is limited to 2 hours,
exceeding which the program is terminated. The major amount
of time in mapping is spent in ESOP expansion.
For all the EPFL benchmarks, Table IV presents the results
of the area-constrained mapping for varying number of LUT
inputs k for fixed crossbar dimension of 64×64. With increase
in k, the number of LUTs (#NLUT ) in the LUT graph
reduces, along with reduction in the number of levels (#L).
For the given crossbar dimensions, 61 words are available
for storing the intermediate results and 3 are reserved for
computing the ESOPs. Some of the benchmarks could not
be mapped (marked by ××) due to violation of the feasibility
criteria (MinDev > 3904), presented in Equation (7).
To analyze the impact of increasing number of LUT in-
puts (k) on delay and MinDev in detail, we consider four
large benchmarks from the EPFL benchmark suite for crossbar
dimension 64×64. The results are shown in Fig. 18. The effect
of k on Mindev is dependent on the benchmark itself. For
example, with increase in value of k, Mindev for the bench-
mark mul32 decreases but for mem ctrl, Mindev increases
for larger values of k, as evident from the Fig. 18. The delay
of the mapping (in terms of number of cycles #C) closely
follows the trend of Mindev i.e. with increase in Mindev ,
the overall number of cycles required for mapping increases.
This is because with increase in Mindev , less number of
crossbar devices can be reset at any given time, which leads to
reduction in the parallelization of operations during the ESOP
computation. For the benchmark mul32, notice the sharp rise
in delay of mapping on changing k from 13 to 16. The number
of cubes in the ESOP expression for increased consistently,
resulting in the increased time of computation of the ESOP
expression, that increases the overall delay of mapping. Also,
for large values of k (k ≥ 28), the time required for each ESOP
expansion increases considerably (> 2 hours), which leads to
long execution time for mapping an entire benchmark.
We analyze the impact of crossbar dimension on the delay of
mapping. Keeping the overall number of devices fixed to 4096,
we vary the number of bitlines from 4 to 1024. The results
are shown in Fig. 19. With increase in the number of bitlines,
1http://lsi.epfl.ch/benchmark
the crossbar permits greater number of parallel operations that
can be carried out in a word. This parallelism is harnessed by
the ESOP computation technique, which leads to reduction in
delay of mapping for the entire benchmark.
The results of delay constrained mapping are presented in
Table V for word length (wD) of 16-bits. For most of the
benchmarks, the compilation time to generate the instructions,
was a few seconds while for the larger benchmarks, the
compilation process finished under 20 minutes. The number of
Read and Apply instructions are shown in column IA and IR
respectively while the total number of instructions is ITotal.
The number of blocks created by mapping is #B. The total
delay (#C) of the mapping solution is the number of cycles to
complete computation of the benchmark by the ReVAMP ar-
chitecture. Gaillardon et al. [32] proposed the PLiM computer,
which has a single instruction — RM3 A,B,Z. Assuming
16-bit words, each instruction results in the following micro
operations on the memory array: Read @A (32 bits), Read
@B (32 bits), Read @Z (32 bits), Read A (1 bit), Read
B (1 bit), Write @Z (1 bit). This corresponds to 9 R/W cycles
on the considered machine. Therefore, minimum number of
cycles DP∗ required by PLiM [32] to compute any MIG is
9#N, where #N is the number of nodes in MIG. This delay
does not include the additional delay required for computing
the negated valued of the nodes. For each benchmark, #C
is significantly lower than the DP∗ achieved by the PLiM
computer. This is a fair comparison since PLiM computer
also used a word-length of 16-bits. The ReVAMP architecture
outperforms PLiM computer for the same 16-bit word by
a factor of 4.38× on average and 9.5× at the maximum.
For the ReVAMP architecture, on average, almost 30% of
the computation time is spent in computing negated value
of the nodes. Thus, the ReVAMP architecture would further
outperform the PLiM computer, when the actual number of
cycles required by PLiM computer for computation with
negations will be considered. Synthesis techniques can be used
to reduce the number of nodes in the MIG for reducing the
delay of executing a benchmkark on the PLiM, as suggested
in [33], but similar techniques can also be used for optimizing
the input data structures of the proposed technology mapping
flow [52].
In Fig. 20a, the speed up achieved by the ReVAMP ar-
chitecture against the PLiM computer is presented for various
word lengths. Even for a small word length of 4, the ReVAMP
architecture gains in performance over the PLiM architecture
by a factor of 2.9× on average. This shows that harnessing the
inherent parallelism of ReRAM crossbar arrays for computa-
tion provides considerable performance gains. This justifies the
VLIW nature of the ReVAMP architecture and demonstrates
the effectiveness of the delay constrained mapping.
12
(a)
0 2 4 6 8 10 12 14 16
LUT size
4.92
4.94
4.96
4.98
lo
g 1
0(#
C)
2000
3000
4000
5000
M
in
de
v
ac97
(b)
0 2 4 6 8 10 12 14 16
LUT size
4.8
4.85
4.9
4.95
5
lo
g 1
0(#
C)
1000
1500
2000
M
in
de
v
mem_ctrl
(c)
0 2 4 6 8 10 12 14 16
LUT size
4.8
5
5.2
5.4
lo
g 1
0(#
C)
500
1000
1500
2000
M
in
de
v
mul32
(d)
0 2 4 6 8 10 12 14 16
LUT size
4.6
4.8
5
5.2
lo
g 1
0(#
C)
200
300
400
M
in
de
v
revx
Fig. 18: Impact of LUT size (k) on delay (#C) for crossbar dimensions 64 × 64. The blue + symbol indicates the delay of
mapping #C in the log scale. The violet squares denote the minimum number of devices Mindev for a feasible mapping.
TABLE IV: Performance of area-constrained mapping on crossbar size ( SD × wD )=(64× 64). ×× denote the benchmarks
that cannot be mapped with the current crossbar due to MinDev > 3904. – – indicate the benchmarks that did not complete
mapping within the time limit (2 hours).
Benchmark
k = 4 k = 8 k = 16
#NLUT #L MinDev #C #NLUT #L MinDev #C #NLUT #L MinDev #C
ac97 ctrl.v 3900 4 2743 88893 2630 3 2102 83464 2409 2 2409 87908
aes core.v 8978 8 7006 ×× 902 3 890 33804 816 2 816 40091
comp.v 8918 50 2624 188156 5695 29 1987 179905 4169 19 2143 301379
des area.v 2020 10 1085 44303 699 5 505 29885 884 3 859 57017
des perf.v 34260 6 26309 ×× 11500 3 11476 ×× 9713 2 9713 ××
diffeq1.v 6652 72 1736 164355 3949 38 1247 169057 2710 22 1079 414193
div16.v 1033 85 151 23557 710 39 131 27276 507 20 155 51407
DSP.v 14679 22 4513 ×× 9025 11 3823 323789 7195 7 3733 416307
ethernet.v 20178 10 11855 ×× 14385 6 10462 ×× 12113 3 9869 ××
hamming.v 696 24 243 16676 498 13 221 19841 413 8 220 – –
i2c.v 347 7 255 7264 219 3 157 5929 156 2 156 5921
log2.v 10331 107 2996 257920 4415 48 1208 240998 2859 25 723 – –
MAC32.v 2828 29 1347 75746 1624 13 979 72908 967 7 775 110020
max.v 1023 54 735 25815 687 24 658 30267 615 12 615 45412
mem ctrl.v 3523 15 1252 71790 2372 7 1156 64689 2038 4 1371 84581
MUL32.v 2458 18 1103 66538 1580 9 942 72954 879 6 617 189105
mult64.v 7438 87 3105 229381 3937 40 1418 208960 2860 20 1109 468718
pci bridge32.v 6384 10 2435 143697 4585 6 2294 142557 4189 3 2927 – –
pci spoci ctrl.v 373 7 296 7294 242 4 197 7144 119 3 96 4859
revx.v 2659 61 252 62304 1644 34 267 71856 559 15 234 153372
sasc.v 207 3 161 4876 147 2 147 4411 132 1 132 5038
simple spi.v 279 6 182 6002 187 3 137 5641 150 2 150 5762
spi.v 1211 11 656 27253 906 5 692 30900 491 3 359 36109
sqrt32.v 264 112 83 8103 206 37 79 10899 158 15 70 33350
square.v 5721 83 3128 154151 3466 36 1894 131707 2557 17 1672 193112
ss pcm.v 127 3 109 3040 100 2 100 3281 98 1 98 3243
systemcaes.v 3255 11 1830 79168 1565 6 693 59813 1033 4 692 49199
systemcdes.v 1109 9 576 26040 516 3 497 15857 290 2 290 19303
tv80.v 2822 19 1150 64901 1523 10 802 54089 769 5 562 49165
usb funct.v 4678 12 2330 110240 2765 6 1383 91873 2177 3 1455 126515
usb phy.v 181 4 151 3382 120 2 120 3061 111 1 111 3046
The number of words (SD) used determines the area of
the mapping solution. To determine the effectiveness of the
packing algorithm to pack blocks into words, we utilize the
word utilization (WUtil) metric. WUtil is the percentage of
total number of bits in SD words, that are used by the mapping
solution. For the example MIG, out of 9 bits (3 words, each
with 3 bits), 8 bits are used and therefore WUtil is 88.8%.
The proposed packing algorithm achieves more than 97%
utilization for all the benchmarks, when wD = 16, including
100% utilization for the revx benchmark. Fig. 20b shows that
with increase in word length from 4 to 8, leads to considerable
improvement in WUtil. However, the WUtil is comparable for
the word lengths, 8, 16 and 32, and approaches 100%. This
shows the effectiveness of delay-constrained mapping to pack
the blocks into words.
VII. CONCLUSION
In this work, we presented two approaches to the technol-
ogy mapping problem for logic-in-memory computation using
ReVAMP. The area-constrained method allows high flexibility
of addressing the need of mapping to a variety of crossbar
dimensions while harnessing the available parallelism of the
ReRAM crossbar array. The delay-constrained method reduces
the overall delay of mapping by using a multi-step approach
that takes into account crossbar constraints while placing the
operands. The proposed approach outperforms the state-of-the-
art serial logic in memory approach using ReRAMs.
The synthesis approaches used in the technology mapping
flow, such as partitioning algorithm used for LUT mapping,
are not aware of the crossbar constraints. The LUT partitioning
algorithm should ideally try to partition the graph, so that
13
TABLE V: Performance of the ReVAMP architecture on EPFL benchmarks for wD=16.
ID Benchmark NMaj IA IR Itotal #B SD WUtil #C DP∗
1 ac97 ctrl 15253 8803 7520 16323 2330 933 99.88 16325 137277
2 comp 18967 32297 31293 63590 3114 1182 99.96 63592 170703
3 des area 4629 5971 5639 11610 1073 305 99.92 11612 41661
4 div16 4440 8375 7825 16200 702 342 99.96 16202 39960
5 hamming 2280 3603 3450 7053 623 180 99.58 7055 20520
6 i2c 1263 1450 1247 2697 297 83 99.32 2699 11367
7 MAC32 9489 15980 15363 31343 3045 784 99.85 31345 85401
8 max 4854 4431 4184 8615 990 314 99.2 8617 43686
9 mem ctrl 9569 9497 8405 17902 2032 625 99.65 17904 86121
10 MUL32 9226 14047 13389 27436 2406 718 99.28 27438 83034
11 pci bridge32 25653 23826 21914 45740 6535 1546 99.82 45742 230877
12 pci spoci ctrl 1096 1523 1328 2851 196 85 99.04 2853 9864
13 revx 7563 14575 14004 28579 1703 611 100 28581 68067
14 sasc 889 602 514 1116 170 57 99.01 1118 8001
15 simple spi 1135 930 794 1724 209 75 99.17 1726 10215
16 spi 3890 4615 4301 8916 885 267 99.98 8918 35010
17 sqrt32 2206 4216 3948 8164 442 170 99.85 8166 19854
18 square 18080 30988 29880 60868 5640 1454 99.78 60870 162720
19 ss pcm 604 313 257 570 97 31 98.59 572 5436
20 systemcaes 11299 11100 10229 21329 2602 721 99.84 21331 101691
21 systemcdes 3028 3312 3090 6402 627 195 99.49 6404 27252
22 tv80 8177 11219 10368 21587 1672 578 99.73 21589 73593
23 usb funct 16704 16054 14269 30323 2943 1057 99.77 30325 150336
24 usb phy 599 538 460 998 140 37 97.47 1000 5391
1024,4 512,8 256,16 128,32 64,64 32,128 16,256 8,512 4,1024
Crossbar size (SD,wD)
4.5
5
5.5
6
6.5
lo
g 1
0(#
C)
LUT size (k=16)
ac_97 mul32 mem_ctrl revx
Fig. 19: Impact of crossbar dimensions on delay (#C) of
area-constrained mapping. 4096 devices are used for all the
mappings and the number of bitlines (wD) is increased from
4 to 1024.
each of the ESOP expression corresponding to each LUT
has roughly the same number of cubes, instead of solely
minimizing the number of LUTs covering the graph. Also, the
initial representation of the Boolean functions into AIG/MIG
are not explicitly optimized w.r.t to the quality of the resulting
mapping. We believe optimizing the synthesis algorithms w.r.t
to the crossbar constraints, would allow further reduction
in delay of mapping, when combined with the proposed
technology mapping approaches.
REFERENCES
[1] O. O. Babarinsa and S. Idreos, “Jafar: near-data processing for
databases,” in Proceedings of the 2015 ACM SIGMOD International
Conference on Management of Data, pp. 2069–2070, ACM, 2015.
[2] G. Koo, K. K. Matam, H. Narra, J. Li, H.-W. Tseng, S. Swanson,
M. Annavaram, et al., “Summarizer: trading communication with com-
puting near storage,” in Proceedings of the 50th Annual IEEE/ACM
International Symposium on Microarchitecture, pp. 219–231, ACM,
2017.
[3] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions: a low-
overhead, locality-aware processing-in-memory architecture,” in Com-
puter Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International
Symposium on, pp. 336–348, IEEE, 2015.
(a)
1.00
3.00
5.00
7.00
9.00
11.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Sp
e
e
d
u
p
 w
.r
.t
 P
Li
M
Benchmark ID s4 s8 s16 s32
(b)
84
88
92
96
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
W
U
ti
l
%
Benchmark ID 4 8 16 32
Fig. 20: For varying word length wD = {4, 8, 16, 32}
(a) Speedup achieved by the delay constrained mapping on
the ReVAMP architecture against PLiM. (b) Word utilization
achieved by the delay constrained mapping.
[4] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-
in-memory accelerator for parallel graph processing,” ACM SIGARCH
Computer Architecture News, vol. 43, no. 3, pp. 105–117, 2016.
[5] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vi-
jaykumar, O. Mutlu, and S. W. Keckler, “Transparent offloading and
mapping (tom): Enabling programmer-transparent near-data processing
in gpu systems,” ACM SIGARCH Computer Architecture News, vol. 44,
no. 3, pp. 204–216, 2016.
[6] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim,
M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-
memory accelerator for bulk bitwise operations using commodity dram
technology,” in Proceedings of the 50th Annual IEEE/ACM International
Symposium on Microarchitecture, pp. 273–287, ACM, 2017.
[7] J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan,
O. Ergin, C. Alkan, and O. Mutlu, “Grim-filter: fast seed filtering in
read mapping using emerging memory technologies,” arXiv preprint
arXiv:1708.04329, 2017.
[8] A. C. Torrezan, J. P. Strachan, G. Medeiros-Ribeiro, and R. S. Williams,
14
“Sub-nanosecond switching of a tantalum oxide memristor,” Nanotech-
nology, vol. 22, no. 48, p. 485203, 2011.
[9] M.-J. Lee, C. B. Lee, D. Lee, S. R. Lee, M. Chang, J. H. Hur, Y.-
B. Kim, C.-J. Kim, D. H. Seo, S. Seo, et al., “A fast, high-endurance
and scalable non-volatile memory device made from asymmetric ta2o5-
x/tao2- x bilayer structures,” Nature materials, vol. 10, no. 8, pp. 625–
630, 2011.
[10] Z. Wei, T. Takagi, Y. Kanzawa, Y. Katoh, T. Ninomiya, K. Kawai,
S. Muraoka, S. Mitani, K. Katayama, S. Fujii, et al., “Retention model
for high-density reram,” in Memory Workshop (IMW), 2012 4th IEEE
International, pp. 1–4, IEEE, 2012.
[11] W. Chien, F. Lee, Y. Lin, M. Lee, S. Chen, C. Hsieh, E. Lai, H. Hui,
Y. Huang, C. Yu, et al., “Multi-layer sidewall wox resistive memory
suitable for 3d reram,” in VLSI Technology (VLSIT), 2012 Symposium
on, pp. 153–154, IEEE, 2012.
[12] E. Linn, R. Rosezin, C. Ku¨geler, and R. Waser, “Complementary
resistive switches for passive nanocrossbar memories,” Nature materials,
vol. 9, no. 5, pp. 403–406, 2010.
[13] A. Siemon, S. Menzel, R. Waser, and E. Linn, “A complementary
resistive switch-based crossbar array adder,” Emerging and Selected
Topics in Circuits and Systems, IEEE Journal on, vol. 5, no. 1, pp. 64–
74, 2015.
[14] D. Bhattacharjee, F. Merchant, and A. Chattopadhyay, “Enabling in-
memory computation of binary blas using reram crossbar arrays,” in
2016 IFIP/IEEE International Conference on Very Large Scale Integra-
tion (VLSI-SoC), pp. 1–6, Sept 2016.
[15] D. Bhattacharjee, V. Pudi, and A. Chattopadhyay, “Sha-3 implementation
using reram based in-memory computing architecture,” in Quality Elec-
tronic Design (ISQED), 2017 18th International Symposium on, pp. 325–
330, IEEE, 2017.
[16] D. Bhattacharjee and A. Chattopadhyay, “In-memory data compression
using rerams,” in Emerging Technology and Architecture for Big-data
Analytics, pp. 275–291, Springer, 2017.
[17] L. Xia, P. Gu, B. Li, T. Tang, X. Yin, W. Huangfu, S. Yu, Y. Cao,
Y. Wang, and H. Yang, “Technological exploration of rram crossbar
array for matrix-vector multiplication,” Journal of Computer Science
and Technology, vol. 31, no. 1, pp. 3–19, 2016.
[18] K.-H. Kim, S. Gaba, D. Wheeler, J. M. Cruz-Albrecht, T. Hussain,
N. Srinivasa and W. Lu, “A functional hybrid memristor crossbar-
array/cmos system for data storage and neuromorphic applications,”
Nano Letters, vol. 12, no. 1, pp. 389–395, 2011.
[19] P. Yao, H. Wu, B. Gao, S. B. Eryilmaz, X. Huang, W. Zhang, Q. Zhang,
N. Deng, L. Shi, H.-S. P. Wong, et al., “Face classification using
electronic synapses,” Nature communications, vol. 8, p. 15199, 2017.
[20] X. Zhu, C. Du, Y. Jeong, and W. D. Lu, “Emulation of synaptic
metaplasticity in memristors,” Nanoscale, vol. 9, no. 1, pp. 45–51, 2017.
[21] P. M. Sheridan, F. Cai, C. Du, W. Ma, Z. Zhang, and W. D. Lu, “Sparse
coding with memristor networks,” Nature nanotechnology, vol. 12, no. 8,
p. 784, 2017.
[22] W. Wu, H. Wu, B. Gao, N. Deng, S. Yu, and H. Qian, “Improving Analog
Switching in HfOx-Based Resistive Memory With a Thermal Enhanced
Layer,” IEEE Electron Device Letters, vol. 38, no. 8, pp. 1019–1022,
2017.
[23] W. Kim, A. Chattopadhyay, A. Siemon, E. Linn, R. Waser, and V. Rana,
“Multistate memristive tantalum oxide devices for ternary arithmetic,”
Scientific reports, vol. 6, 2016.
[24] D. Bhattacharjee, W. Kim, A. Chattopadhyay, R. Waser, and V. Rana,
“Multi-valued and fuzzy logic realization using taox memristive de-
vices,” Scientific reports, vol. 8, no. 1, p. 8, 2018.
[25] Y. Bai, H. Wu, R. Wu, Y. Zhang, N. Deng, Z. Yu, and H. Qian, “Study
of multi-level characteristics for 3d vertical resistive switching memory,”
Scientific reports, vol. 4, p. 5780, 2014.
[26] E. Lehtonen and M. Laiho, “Stateful implication logic with memristors,”
in NanoArch, pp. 33–36, IEEE Computer Society, 2009.
[27] E. Lehtonen, J. Poikonen, and M. Laiho, “Two memristors suffice
to compute all boolean functions,” Electronics letters, vol. 46, no. 3,
pp. 239–240, 2010.
[28] A. Raghuvanshi and M. Perkowski, “Logic Synthesis and a Generalized
Notation for Memristor-Realized Material Implication Gates,” ICCAD,
pp. 470–477, 2014.
[29] A. Chattopadhyay and Z. Rakosi, “Combinational logic synthesis for
material implication,” in 2011 IEEE/IFIP 19th International Conference
on VLSI and System-on-Chip, pp. 200–203, Oct 2011.
[30] D. Bhattacharjee and A. Chattopadhyay, “Delay-optimal technology
mapping for in-memory computing using reram devices,” in Proceedings
of the 35th International Conference on Computer-Aided Design, p. 119,
ACM, 2016.
[31] D. Bhattacharjee, A. Easwaran, and A. Chattopadhyay, “Area-
constrained technology mapping for in-memory computing using reram
devices,” in 22nd Asia and South Pacific Design Automation Conference,
ASP-DAC 2017, 2017.
[32] P. E. Gaillardon, L. Amaru´, A. Siemon, E. Linn, R. Waser, A. Chattopad-
hyay, and G. D. Micheli, “The programmable logic-in-memory (plim)
computer,” in 2016 Design, Automation Test in Europe Conference
Exhibition (DATE), pp. 427–432, March 2016.
[33] M. Soeken, S. Shirinzadeh, P.-E. Gaillardon, L. G. Amaru´, R. Drechsler,
and G. De Micheli, “An mig-based compiler for programmable logic-in-
memory architectures,” in Design Automation Conference (DAC), 2016
53nd ACM/EDAC/IEEE, pp. 1–6, Ieee, 2016.
[34] N. Talati, S. Gupta, P. Mane, and S. Kvatinsky, “Logic design within
memristive memories using memristor-aided logic (magic),” IEEE
Transactions on Nanotechnology, vol. 15, no. 4, pp. 635–650, 2016.
[35] R. B. Hur, N. Wald, N. Talati, and S. Kvatinsky, “Simple magic:
Synthesis and in-memory mapping of logic execution for memristor-
aided logic,”
[36] D. Bhattacharjee, R. Devadoss, and A. Chattopadhyay, “Revamp: Reram
based vliw architecture for in-memory computing,” in 2017 Design,
Automation & Test in Europe Conference & Exhibition (DATE), pp. 782–
787, IEEE, 2017.
[37] A. Siemon, S. Menzel, A. Marchewka, Y. Nishi, R. Waser, and E. Linn,
“Simulation of TaOx-based complementary resistive switches by a
physics-based memristive model,” in Circuits and Systems (ISCAS), 2014
IEEE International Symposium on, pp. 1420–1423, IEEE, 2014.
[38] S. Kim, W. Lee, and H. Hwang, “Selector devices for cross-point reram,”
in 2012 13th International Workshop on Cellular Nanoscale Networks
and their Applications.
[39] W. Lee, J. Park, S. Kim, J. Woo, J. Shin, G. Choi, S. Park, D. Lee,
E. Cha, B. H. Lee, et al., “High current density and nonlinearity combi-
nation of selection device based on TaOx/T iO2/TaOx structure for
one selector–one resistor arrays,” ACS nano, vol. 6, no. 9, pp. 8166–
8172, 2012.
[40] E. Linn, R. Rosezin, S. Tappertzhofen, U. Bo¨ttger and R. Waser, “Be-
yond von neumann-logic operations in passive crossbar arrays alongside
memory operations,” Nanotechnology, vol. 23, no. 30, 2012.
[41] A. Mishchenko, J. S. Zhang, S. Sinha, J. R. Burch, R. Brayton, and
M. Chrzanowska-Jeske, “Using simulation and satisfiability to compute
flexibilities in boolean networks,” IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, vol. 25, no. 5, pp. 743–
755, 2006.
[42] L. Amaru, P.-E. Gaillardon, and G. De Micheli, “Majority-inverter
graph: A new paradigm for logic optimization,” IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, vol. 35,
no. 5, pp. 806–819, 2016.
[43] A. Mishchenko and M. Perkowski, “Fast heuristic minimization of
exclusive-sums-of-products,” 2001.
[44] “Berkeley Logic Synthesis and Verification Group, ABC: A System for
Sequential Synthesis and Verification, Release YMMDD..” http://www.
eecs.berkeley.edu/∼alanmi/abc/. Accessed: 2017-10-31.
[45] T. Sasao, M. Fujita, et al., Representations of discrete functions, vol. 242.
Springer, 1996.
[46] T. Kozlowski, E. L. Dagless, and J. M. Saul, “An enhanced algorithm
for the minimization of exclusive-or sum-of-products for incompletely
specified functions,” in Computer Design: VLSI in Computers and
Processors, 1995. ICCD’95. Proceedings., 1995 IEEE International
Conference on, pp. 244–249, IEEE, 1995.
[47] R. Drechsler and B. Becker, “Sympathy: fast exact minimization of fixed
polarity reed-muller expressions for symmetric functions,” in European
Design and Test Conference, 1995. ED&TC 1995, Proceedings., pp. 91–
97, IEEE, 1995.
[48] A. Zakrevskij, “Minimum polynomial implementation of systems of
incompletely specified boolean functions,” in Proc. Reed-Muller, vol. 95,
pp. 250–256, 1995.
[49] R. Drechsler, “Pseudo-kronecker expressions for symmetric functions,”
IEEE Transactions on Computers, vol. 48, no. 9, pp. 987–990, 1999.
[50] S. Shirinzadeh, M. Soeken, P. E. Gaillardon, and R. Drechsler, “Fast
logic synthesis for rram-based in-memory computing using majority-
inverter graphs,” in Proceedings of DATE, 2016.
[51] D. Bhattacharjee, L. Amaru´, and A. Chattopadhyay, “Technology-aware
logic synthesis for reram based in-memory computing,” in 2017 Design,
Automation Test in Europe Conference Exhibition (DATE), March 2017.
[52] D. Bhattacharjee, L. Amar´u, and A. Chattopadhyay, “Technology-aware
logic synthesis for reram based in-memory computing,” in Design,
Automation & Test in Europe Conference & Exhibition (DATE), 2018,
pp. 1435–1440, IEEE, 2018.
