X-SRAM: Enabling In-Memory Boolean Computations in CMOS Static Random
  Access Memories by Agrawal, Amogh et al.
X-SRAM: Enabling In-Memory Boolean
Computations in CMOS Static Random Access
Memories
Amogh Agrawal*, Akhilesh Jaiswal*, Chankyu Lee and Kaushik Roy, Fellow, IEEE
School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN-47907, USA
(* Equal Contributors)
Email: {agrawa64, jaiswal, lee2216, kaushik}@purdue.edu
Abstract—Silicon-based Static Random Access Memories
(SRAM) and digital Boolean logic have been the workhorse of the
state-of-the-art computing platforms. Despite tremendous strides
in scaling the ubiquitous metal-oxide-semiconductor transistor,
the underlying von-Neumann computing architecture has re-
mained unchanged. The limited throughput and energy-efficiency
of the state-of-the-art computing systems, to a large extent, results
from the well-known von-Neumann bottleneck. The energy and
throughput inefficiency of the von-Neumann machines have been
accentuated in recent times due to the present emphasis on data-
intensive applications like artificial intelligence, machine learning,
cryptography etc. A possible approach towards mitigating the
overhead associated with the von-Neumann bottleneck is to
enable in-memory Boolean computations. In this manuscript, we
present an augmented version of the conventional SRAM bit-
cells, called the X-SRAM, with the ability to perform in-memory,
vector Boolean computations, in addition to the usual memory
storage operations. We propose at least six different schemes for
enabling in-memory vector computations including NAND, NOR,
IMP (implication), XOR logic gates with respect to different bit-
cell topologies − the 8T cell and the 8+T Differential cell. In
addition, we also present a novel ‘read-compute-store’ scheme,
wherein the computed Boolean function can be directly stored in
the memory without the need of latching the data and carrying
out a subsequent write operation. The feasibility of the proposed
schemes have been verified using predictive transistor models
and detailed Monte-Carlo variation analysis. As an illustration,
we also present the efficacy of the proposed in-memory com-
putations by implementing AES (advanced encryption standard)
algorithm on a non-standard von-Neumann machine wherein the
conventional SRAM is replaced by X-SRAM. Our simulations
indicated that up-to 75% of memory accesses can be saved using
the proposed techniques.
Index Terms—In-memory computing, SRAM, sense amplifier,
von Neumann bottleneck.
I. INTRODUCTION
S INCE the invention of transistor switches [1], there hasbeen an ever-increasing demand for speed and energy-
efficiency in computing systems. Almost all the state-of-the-
art computing platforms are based on the well-known von-
Neumann architecture which is characterized by decoupled
memory storage and computing cores. Running data-intensive
applications on such von-Neumann machines, like artificial
intelligence, search engines, neural networks, biological sys-
tems, financial analysis etc., are limited by the von Neumann
bottleneck [2]. This bottleneck results due to frequent and
 
Processor
Von-Neumann 
Bottleneck
‘In-memory’ Computing
Memory Array
Fig. 1. Illustration of the von-Neumann bottleneck. Frequent to-and-fro
data transfers between the processor and memory units incur large energy
consumption and limits the throughput. Computing within the memory array
enhances the memory functionality thereby reducing the number of unnec-
essary transfers of data for certain class of operations like vector bit-wise
Boolean logic etc.
large amounts of data transfer between the physically separate
memory units and compute cores. Moreover, frequent to-and-
fro data transfers incur large energy overheads in addition to
limiting the overall throughput.
In order to overcome the von-Neumann bottleneck, there
have been many efforts to develop new computing paradigms.
One of the most promising approach is the in-memory com-
puting, which aims to embed logic within the memory array
in order to reduce memory-processor data transfers. Concep-
tually, the in-memory compute paradigm is illustrated in Fig.
1. It shows two physically separated blocks − the processor
and the memory unit and the associated computing bottleneck.
In-memory techniques tend to bypass the von-Neumann bottle-
neck by accomplishing computations right inside the memory
array, as shown in the figure. In other words, in-memory-
compute blocks store data exactly like a standard memory,
however, they enable additional operations without expensive
area or energy overheads. By enabling logic computations in-
memory, significant improvements, both in energy efficiency
and throughput are expected [3]–[6].
Due to the potential impact of in-memory computing on
future computing platforms, various proposals spanning right
from conventional complementary metal-oxide semiconductor
(CMOS) to beyond-CMOS technologies can be found in the
ar
X
iv
:1
71
2.
05
09
6v
2 
 [c
s.E
T]
  1
8 J
un
 20
18
 8T SRAM 8+T SRAM
• NAND 
• NOR
• XOR
• IMP
• XOR
• Possible 2 bit read operation
• NAND 
• NOR
• XOR
Skewed Inverter Sensing
Voltage Divider Scheme
Asymmetric Sense Amplifier 
Differential Sensing
Proposal for ‘Read-Compute-Store’ Operation
Proposed ‘In-Memory’ Techniques
Fig. 2. A summary of In-Memory computing schemes proposed in this work.
With respect to the 8T cell, we present bit-wise NAND, NOR and XOR
operations using skewed inverter sensing. Further, we present the voltage-
divider based operation of 8T-cells for IMP and XOR gates. With respect to
the 8+T-cells, we present bit-wise NAND, NOR and XOR operations using
asymmetric differential SAs. Moreover, a ‘read-compute-store’ operation has
been presented for both types of bit-cells.
literature. For example, Ref. [7] proposed integrating an ALU
(arithmetic-logic-unit) close to the memory unit to exploit
the wide memory bandwidth, while Ref. [3] reconfigures
a standard 6 transistor (6T) static random-access memory
(SRAM) cells as content addressable memories (CAMs) and
enable bit-wise logical operations. 6T-SRAM cells have also
been used to implement machine learning classifiers [8], and
dot-products in analog domain for pattern recognition [5]. The
underlying idea is to enable multiple rows of memory bit-
cells and directly read out a voltage at the pre-charged bit-
lines corresponding to the desired operation. However, the 6T-
SRAM bit-cells have a coupled read-write path that imposes
conflicting constraints on the design of the 6T cell, thereby
raising issues of read-disturb failures. Moreover, activating
multiple word-lines may cause short-circuit paths, thereby
flipping the cell states nondeterministically. The read-disturb
failure is further accentuated by the fact that once the BL
has discharged, activating subsequent word-lines perform a
pseudo-write operation on the 6T cell, given the shared read-
write path. A 6T-SRAM based on the deeply depleted channel
(DDC) technology [9] was recently proposed for searching
and in-memory computing applications, which had decoupled
read-write paths. However, all of these proposals perform the
computation in the peripheral circuits and read out the data.
A subsequent memory-write operation is required to store the
data back in the memory array. Thus, in our work, we use
standard CMOS 8T- and 8+T Differential SRAM cells due
to their decoupled read-write mechanisms, for performing in-
memory computations. Moreover, we go a step further and
propose the novel ‘read-compute-store’ scheme, where the
computed result can be stored in-situ, within the memory
array, without the need for latching the result and performing
a subsequent memory-write instruction. In addition, recently
memristor like multi-bit dot product computations using 8T
cells has been proposed in [10]. The present works differs
from the work in [10] since the computations presented in [10]
are analog-like computations in SRAM arrays and requires
more complex peripheral circuitry, whereas the focus of the
present work is purely digital vector computations in the
SRAM arrays.
In addition, almost all beyond CMOS non-volatile technolo-
gies have been extensively explored for possible applications
to in-memory computing [11]. These include works based on
resistive RAMs [12], spin-based magnetic RAMs [13]–[15],
and phase change materials [16]. Such emerging non-volatile
technologies promise denser integration, energy-efficient op-
erations and non-volatility as compared to the CMOS based
memories, and are suitable for in-memory computations [17].
However, these emerging technologies are still under extensive
research and development phase and their large scale commer-
cialization for on-chip memories is far-fetched.
In this work, we explore in-memory vector operations in
standard CMOS 8T- and 8+T Differential SRAM cells with
minimal modifications in the peripheral circuitry. We call
the augmented version of the SRAM bit-cells with extra in-
memory compute features as the X-SRAM. We propose at
least six different techniques to enable Boolean computations.
The 8T and 8+T cells lend themselves easily for enabling in-
memory computations because of the following three factors.
1) The read ports of the 8T and 8+T cells are isolated and
can be easily configured to enable in-memory operations. 2)
Also, in sharp contrast to the 6T cells, 8T and 8+T cells do
not suffer from read disturb and hence multiple read word-
lines within the memory array can be simultaneously activated.
3) In addition, in this manuscript, we exploit the two port
structure of the 8T and 8+T cells to propose a novel read-
compute-store operation, wherein, the computed Boolean data
can be stored into the memory array without actually latching
the data followed by a subsequent memory write-operation.
Later in Appendix, we describe the in-memory computations
in standard 6T-SRAMs using the staggered activation of word-
lines, as was presented for analog computing in Ref. [5].
Some of the key highlights of the present work in compar-
ison to previous works are enumerated below.
1) We firstly leverage the fact that two simultaneously
activated read-word-lines for the standard 8T cells are
inherently ‘wire NORed’ through the read bit-line. By
using a skewed inverter at the sensing output, we
demonstrate that NOR operation can be easily achieved.
Further, we also show that NAND logic can similarly be
accomplished using another skewed inverter. Note, un-
like 6T cells, simultaneous activations of two read word-
lines do not impose any read-disturb concerns, thereby
opening up a wider design space for optimization.
2) Further, by applying appropriate voltages, we show that
two activated read ports of the 8T cell can be configured
as a voltage divider. Based on such voltage divider
scheme we present in-memory vector IMP as well as
XOR logic gates. The voltage divider scheme not only
allows in-memory computations, but also augments the
read mechanism by allowing a possible two bit-read
operation under specific conditions.
 WWL WWL
RBL
RWL
WBL WBLB
Q QB
a)
RBL
RWL1
Cell 1
RWL2
Cell 2
NAND
NOR
b) d)
RBL
Case 00
Case 11
RWL1/
RWL2
Case 01/10
RBL
Case 00
Case 11
RWL1/
RWL2
Case 01/10
c)
INV1 INV2
INV3 INV4
A B NAND NOR XOR
0 0 1 1 0
0 1 1 0 1
1 0 1 0 1
1 1 0 0 0
M1
M2
Fig. 3. a) Schematic of a standard 8T-SRAM bit-cell. In addition to the standard 6T cell, two additional transistors form the read path using a separate read
bit-line (RBL). b) Single ended sensing of NAND/NOR using gated skewed inverters. Figure also shows the truth table for NAND/NOR/XOR operations. c)
Timing diagram for reading NOR output of Cell 1 and Cell 2. d) Timing diagram for reading NAND output of Cell 1 and Cell 2.
3) Subsequently, we also present in-memory NAND and
NOR computations (along with XOR) in the recently
proposed 8+T cells [18], using asymmetric sense am-
plifiers (SA). The 8+T cells are more robust since
they allow differential read sensing as opposed to the
standard 8T cells that are characterized by single ended
sensing. The usual memory read/write functionality of
the SRAM cell is not disturbed due to the use of
asymmetric sense amplifiers. We also show that the same
hardware, including the SA, can be shared for an in-
memory operation and also for the normal memory read
operation. Moreover, the extra hardware enhances the
memory read operation, by acting as a check for read
failures.
4) We propose a novel ‘read-compute-store’ scheme for
the 8T and 8+T bit-cells, wherein the computed data
can directly be written into the desired memory loca-
tion, without having to latch the output and perform a
subsequent memory write operation. This exploits the
decoupled read-write paths of the 8T and 8+T bit-cells.
5) We perform Monte-Carlo simulations including voltage
and temperature variations to verify the robustness of
the proposed in-memory operations for the 8T and the
8+T bit-cells. Energy, delay and area numbers have been
presented for each of the proposed scheme.
6) We demonstrate the effectiveness of using in-memory
bitwise computations in a typical von-Neumann ma-
chine, wherein the conventional SRAM is replaced by
the proposed X-SRAM for Advanced Encryption Stan-
dard (AES) algorithm. Our system level simulations
indicates 75% reduction in memory accesses thereby
saving energy expensive data transfers.
II. IN-MEMORY COMPUTATIONS IN 8-TRANSISTOR
SRAM BIT-CELLS
As discussed in the introduction, 8T cells have favorable
bit-cell structure to enable in-memory computing. Specifically
we would exploit the isolated read mechanism and the two
port cell topology to embed NAND, NOR, IMP and XOR
logic within the memory array. Further, by leveraging the
separate read and write ports of the 8T cell, we also propose
a ‘read-compute-store’ scheme, wherein, by minimal changes
in the peripheral circuits, the computed Boolean results can
be stored in the desired row of the memory array in the same
cycle without the need of latching the results and performing
a subsequent write operation.
For each proposal, we first describe the circuit operation
using representative illustrations of the transient waveforms
followed by actual SPICE based transient simulations under
Monte-Carlo analysis. Further, we also present a distribution
graph for the key voltages that represent worst case scenarios
including temperature as well as voltage variations. Note, in
general global process variations can be taken care by proper
calibrations, therefore, we concentrate on intra-die threshold
voltage variation along with variations in temperature and
supply voltage. Towards the end of the manuscript, we tabulate
the pros-and-cons of the proposed techniques in a comparative
manner.
A. 8-Transistor SRAM: NOR operation
The 8T SRAM cell is shown in Fig. 3(a). It consists of the
usual 6T cell augmented by additional read port constituted
by transistors M1-M2. The write operation is similar to the
6T cell, whereas for the read operation, RWL is activated
(WWL is low). The RBL is initially pre-charged and if Q = ‘1’
the RBL discharges otherwise it stays at its initial precharged
condition. This decoupled read port for the 8T cell allows
to have large voltage swing (almost rail-to-rail) on the RBL
during the read operation without any concerns of read disturb
failure.
The output of a NOR operation is ‘1’ only if both the
inputs are ‘0’. Consider we activate two RWLs corresponding
to the rows storing vector operand ‘A’ and vector operand ‘B’,
respectively, as shown in Fig. 3(b). Due to the decoupled read
ports, both the RWLs can be activated simultaneously without
any read disturb concerns as opposed to the 6T cell. The
precharged RBL line retains its precharged state if and only if
both the bits Q corresponding to operands ‘A’ and ‘B’ are ‘0’.
 N
A
N
D
N
O
R
CASE ‘00’
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0               1n              2n              3n             4n
RBL
NAND
V
o
lt
ag
e 
(V
)
Time (s)
CASE ‘00’
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0               1n              2n              3n             4n
RBL
NOR
V
o
lt
ag
e 
(V
)
Time (s)
CASE ‘11’
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0               1n              2n              3n             4n
RBL
NOR
V
o
lt
ag
e 
(V
)
Time (s)
CASE ’01/10’
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0               1n              2n              3n             4n
RBL
NOR
V
o
lt
ag
e 
(V
)
Time (s)
CASE ‘11’
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0               1n              2n              3n             4n
RBL
NAND
V
o
lt
ag
e 
(V
)
Time (s)
CASE ’01/10’
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0               1n              2n              3n             4n
RBL
NAND
V
o
lt
ag
e 
(V
)
Time (s)
Fig. 4. Monte-Carlo simulations in SPICE for NAND and NOR outputs for all possible input cases − ‘00,01,10,11’, in presence of 30mV sigma variations
in threshold voltage.
 
VRBL(V) VRBL(V)
T=298K
T=353K
(c) VDD  VDD
(b) Nominal VDD
(a) VDD  VDD
Case 11 Case 01/10 Case 11 Case 01/10
VRBL(V) VRBL(V)
TT Corner SS Corner
Fig. 5. Monte-Carlo simulations across process corners (TT corner and SS corner shown) under voltage and temperature variations for NAND outputs for the
borderline cases − ‘01/10’ and ‘11’. The distribution of RBL voltage is plotted under 30mV sigma threshold voltage variations for two different temperatures
and ±10% variation in nominal VDD .
In other words, as shown in Fig. 3(c) RBL remains high only
if Q = ‘0’ for both ‘A’ and ‘B’. Thus, merely by activating the
two RWLs, data stored in the two bit-cells are ‘wire NORed’.
A gated inverter (INV1) is connected to the RBL such that the
inverter output goes low if the RBL remains high. Thereby,
the output of the cascaded inverter (INV2) mimics the NOR
operation. Note, the NOR operation is same as the usual read
operation except that we have turned ON two RWLs instead
of one. Thus, NOR can be easily achieved in the 8T bit-cell
without any significant overhead. The timing diagram for the
NOR operation is shown in Fig. 3(c). It is also interesting to
observe that although we have discussed the NOR operation
for two inputs, the proposed scheme can in fact be extended
to n-input NOR operations. For an n-input NOR operation
n-read world-lines can be simultaneously activated and RBL
would remain high only if all the corresponding operands are
‘0’ which would represent the n-input NOR truth table.
B. 8-Transistor SRAM: NAND operation
Let us consider that we activate two RWLs corresponding
to vector operands ‘A’ and ‘B’, respectively. The precharged
RBL will eventually go to 0V if Q for any one of the input
operand is ‘1’. However, the fall time of the signal at RBL
from the precharged value to 0V would depend strongly on
the fact, whether any one Q is high or if both the Q bits are
high simultaneously. In other words, only if both the Qs are
‘1’, the discharge of the precharged RBL line would be fast
enough. In Fig. 3(d), we have shown schematically the state
of the RBL for all input cases. In order to exploit the different
discharge rates of the RBL, the RWL signal had to be timed
such that the RBL does not discharge completely in cases
‘01/10’. This allows a difference in voltage levels on RBL in
the two cases (‘01/10’ and ‘11’). The trip point of the inverter
INV3 is chosen such that it goes high only for the case ‘11’,
thus output of inverter INV4 mimics the NAND operation.
Fig. 4 shows the SPICE transient simulation for the NAND
and NOR proposals, under 30mV sigma threshold voltage
variation in transistors. We used 45-nm Predictive Technology
Models (PTM) [19] for simulating the circuits. A BL and
BLB capacitance of 10fF was assumed for all the simulations.
As discussed earlier, the NAND computation has a narrower
design margin due to its timing critical operation as opposed
to the NOR logic. Specifically, for the NAND operation
a discharge path with two parallel transistors needs to be
distinguished from the discharge path with one transistor. To
analyze the robustness and the design margin, we performed
a rigorous variation analysis across process corners including
voltage and temperature variations for the NAND operation, as
shown in Fig. 5. Monte-Carlo simulations with 30mV sigma
threshold voltage variation were performed along with a±10%
variation in nominal VDD (∆VDD). The simulations were
repeated for two different temperatures. The figure shows the
resultant distribution of voltage on RBL for the borderline
cases ‘11’ and ‘01/10’, at the instant when RWL is pulled
LOW.
In order to study the effect of variations due to different
process corners, we also performed simulations assuming
global variations in the threshold voltage, the simulations were
performed for all possible corners including SS (slow NMOS,
Slow PMOS), SF (Slow NMOS, Fast PMOS), FS (Fast
NMOS, Slow PMOS) and FF (Fast NMOS, Fast PMOS). The
threshold voltages for respective corners were globally shifted
in appropriate directions for each of the process corners for
both the PMOS and the NMOS transistors. For example, for
the SS and FF corners, the threshold voltages were increased or
decreased by ∼90mV to imitate the affect of process corners.
These global shifts in threshold voltages were then super-
imposed by random VT variations to evaluate the cumulative
effect. In Fig. 5, we have shown the Monte-Carlo results for
two different process corners − the nominal case (TT) and for
the SS corner, for two different temperatures including ± 10%
variation in supply voltage. Note, similar results were obtained
for other process corners as well, however to avoid clutter, we
have shown two representative results for the process corners.
It can be observed, we obtain a 50mV worst cases sense
margin in Fig. 5(a) for a −10% nominal VDD. Also, the
timing for the NAND operation can be controlled by a digitally
programmable delay based control signal for tuning the pulse
activation of the RWL [20]. Such a programmable delay path
would require a one-time calibration depending on the process
corner, for proper functionality.
In addition to the NAND and NOR operations, by NORing
the outputs of the AND (INV3) and the NOR (INV2) gates
together, XOR operation can be easily achieved. In summary,
we have shown that the very bit-cell topology of the 8T cell
can be exploited to accomplish in-memory NOR, NAND and
XOR computations. In the next sub-section, we would discuss
another proposal for embedding IMP as well as XOR gate
within the 8T SRAM array by utilizing the proposed voltage
divider scheme.
C. 8 Transistor SRAM: Voltage Divider Scheme for IMP and
XOR gates
In this sub-section, we present a method of implementing
IMP and XOR operation using 8T cell by exploiting the volt-
age divider principle. Let us consider, the circuit shown in Fig.
6(a). Let us assume the first operand is stored in the upper bit-
cell corresponding to the line RWL1, while the second operand
is stored in the lower bit-cell corresponding to RWL2. In the
conventional 8T cell, the source of transistors M1 and M4 are
connected to ground. In the presented circuit, the source of
the transistors M1 and M4 are connected to respective source
lines (SL1 and SL2 shared along respective rows). During
the normal operations, the SLs can be grounded, thereby
accomplishing usual 8T SRAM read and write operations.
During the in-memory computation mode, the SL1 is pulled
to VDD, while the SL2 is grounded. RWL1 and RWL2 are
initially grounded and RDBL is pre-charged to a voltage Vpre
(chosen to be 400mV). After the pre-charge phase, transistors
M2 and M4 are switched ON, thereby M1 − M2 − M3 −
M4 form a voltage divider and RDBL forms the middle node
of the voltage divider structure (see Fig. 6(b)). Note, in the
voltage divider configuration, M1 and M2 are strongly source
degenerated. In order to make sure M1 and M2 are sufficiently
ON, we boosted the VDD of ‘Cell 1’ and RWL1 such that the
gate of M1 and M2 have enough overdrive when the ‘Cell 1’
is storing a digital ‘1’ (Q = ‘1’ and QB = ‘0’).
In the voltage divider configuration M1 − M2 − M3 − M4,
RDBL retains its precharged voltage Vpre if both the bit-cells
are storing digital ‘0’ (i.e. M1 and M4 are OFF ). Similarly,
if both the cells are storing a digital ‘1’ (i.e. M1 and M4 are
ON), the voltage at RDBL stays close to its precharged value
(400mV) due to the voltage divider effect. Thus, when the
cells store (0,0) or (1,1) (where the first (second) number in
the bracket indicates the data stored in Cell 1 (2)), the voltage
at RDBL stays close to the precharged voltage. On the other
hand, if the data stored is (1,0), then M1 is ON while M4 is
OFF. As such, RDBL will charge to VDD through transistors
M1 and M2. In contrast, if the data stored is (0,1), M4 is
ON while M1 is OFF. Therefore, RDBL will discharge to 0V
through transistors M3 and M4. In summary, the voltage on
RDBL stays close to Vpre when both the cells store same data.
RDBL charges to VDD for data (1,0) and discharges to 0V
for data (0,1).
The state of the data stored in the two cells can be sensed
through two skewed inverters. INV2 is skewed such that it
goes high only when RDBL is much lower than Vpre and is
close to 0V, while INV1 is skewed so that it goes low only
when RDBL is higher than Vpre and is close to VDD. In
other words, high output at INV2 indicates data (0,1) while
high output at INV3 indicates data (1,0). Interestingly, INV1
implements ‘A IMP B’. By ORing the output of INV2 and
INV3 we can obtain the XOR of inputs A and B.
 RDBL
RWL1
Cell 1
RWL2
Cell 2
VDD
SL1
SL2 INV 1
INV 2
10
01
M1
M2
M3
M4
VDD
Q1
RWL1
RWL2
Q2
RDBL
M1
M2
M3
M4
a) b)
Case ‘00’
0             1n             2n           3n           4n
Time (s)
1.5
1.0
0.5
0.0
V
o
lt
ag
e 
(V
)
INV 2
Case ‘01’
0             1n             2n           3n           4n
Time (s)
INV 3
1.5
1.0
0.5
0.0
V
o
lt
ag
e 
(V
)
INV 2
Case ‘10’
0             1n             2n           3n           4n
Time (s)
INV 3
1.5
1.0
0.5
0.0
V
o
lt
ag
e 
(V
)
INV 2
Case ‘11’
0             1n             2n           3n           4n
Time (s)
INV 2/3
1.5
1.0
0.5
0.0
V
o
lt
ag
e 
(V
)
INV 2
c)
INV 3
INV 2/3
INV 2
INV 2
Fig. 6. a) Circuit schematic of the 8T-SRAM for implementing the voltage-divider scheme. b) Equivalent circuit traced by transistors M1−M4 while data
is read from Cell 1 and Cell 2. c) Monte-Carlo simulations in SPICE for all possible input cases, showing the output of the two asymmetric inverters.
 
Verror(V)
(a) VDD  VDD
TT
Corner
(b) Nominal VDD (c) VDD  VDD
Verror(V) Verror(V)
T=298K
T=353K
SS
Corner
Fig. 7. Monte-Carlo simulations with variations in supply voltage and temperature across process corners for the voltage-divider scheme for the case (1,1).
Verror is defined as the difference between the RBL voltage (when both the operands are ‘1’) and the initial pre-charge voltage Vpre. The distribution of
Verror is plotted under 30mV sigma threshold voltage variations for two different temperatures and ±10% variation in nominal VDD . The variations in
Vpre are also accounted for.
Fig. 7 shows the distribution of Verror (defined below)
under VT variations in addition to variations in temperature
and supply-voltage. Monte-Carlo simulations across process
corners, similar to the ones performed for NAND in the
previous sub-section were performed in this case. Note, when
either of the two operands Q1 or Q1 is low, the circuit in
Fig. 6(b) reduces to an RC charging or discharging circuit,
respectively. As such, if any of Q1 or Q2 is low, the RDBL
would either charge up to VDD or discharge to ground even
under variations. The critical case arises when both Q1 and Q2
are low or high, simultaneously. For robust operation, ideally
we want the RDBL voltage to stay at Vpre for both the cases
(Q1 = Q2 = low or Q1 = Q2 = high). Therefore, we analyze
the difference in voltages on the RDBL in the two cases ‘Q1
= Q2 = low’ versus ‘Q1 = Q2 = high’. We define Verror
as the difference between the RDBL voltage for case (1,1)
and case (0,0). In other words, Verror denotes the variation of
RDBL voltage when the voltage divider is active with respect
to Vpre. The variations in Vpre are also considered in the
Monte-Carlo simulations. We observe that Verror is close to
zero, making this configuration robust to variations, as shown
in Fig. 7. Intuitively, the robustness of the proposed scheme
 Write 
Driver
Compute
Write Data
RWL1
RWL2
RBL
WWL3
WBLBWBL
Sel
a)
RCS RCS
A
B
A.B
RWL1
RWL2
WWL3
b) c)
RCS          to Cell 3 (Case ‘11’)
1.5
1.0
0.5
0.0
0                     1n                   2n                    3n                  4n
V
o
lt
ag
e 
(V
)
Time (s)
RBL
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
Q3
QB3
NAND
A.B
Cell 3
Fig. 8. a) Proposed ‘read-compute-store’ (RCS) scheme. RWL1 and RWL2 are enabled, corresponding to the data to be computed. The computation output
is selectively passed to the write-driver of that column, while simultaneously enabling the WWL3, where data is to be stored. b) Block diagram showing the
RCS blocks in the memory array. The NAND of row 1 and row 2 is to be stored in row 3. c) Monte-Carlo simulations in SPICE, showing the final state of
Cell 3 stores the desired output.
stems from the fact that changes in voltage and temperature
affects all the four transistors of Fig. 6(b) in similar manner
thereby reducing any variations in the voltage at node RDBL.
Moreover, since we use the static voltage developed at RDBL,
unlike the time-sensitive discharge in the earlier scheme, the
voltage-divider scheme is robust to process corners as well.
The voltage at RDBL depends on the relative strengths of the
four transistors. Since process corners induce global VT shifts,
all NMOS transistors are equally affected, making the voltage-
divider ratio largely unaffected. This is evident form the two
representative process corner simulations shown in Fig. 7.
Some key features of the voltage divider logic scheme are,
1) IMP is a universal gate and hence any arbitrary Boolean
function can be implemented using the proposed scheme 2)
if any one of the inverter outputs (INV2 or INV3) are high,
it indicates the data stored is (0,1) or (1,0), thereby allowing
a two bit-read operation in addition to the desired in-memory
computation. However, if none of the inverters are high then
a subsequent read operation would be required to ascertain if
the stored data is (0,0) or (1,1). As such, in 50% cases when
the data stored is (0,1) or (1,0), we can accomplish a two bit
read operation, along with the in-memory compute operation.
D. Proposed ‘read-compute-store’ (RCS) scheme
We have seen that basic Boolean operations like NAND,
NOR, IMP and XOR can be computed using 8T cells. We
would now show that the decoupled read and write ports of
the 8T bit-cell can be used for enabling ‘read-compute-store’
(RCS) scheme. The RCS scheme implies that while the data
is being read from the two activated RWLs (corresponding to
the two input operands), simultaneously the WWL of a third
row can be activated such that the computed data gets stored
in the third row at the same time while the actual Boolean
computation is in progress. As such, the computed data is not
required to be latched first, then written subsequently, in a
multi-cycle fashion. Note, writing into 8T bit-cells is much
easier due to the fact that the write port of the 8T cell is
specifically optimized for the write operation.
Let us understand how the RCS scheme can be implemented
with reference to Fig. 8. Assume that the input operands
correspond to the rows 1 and 2, while the resulting Boolean
computation has to be stored in row 3. Note, this Boolean
computation can be either of NAND/NOR/IMP/XOR. Let us
take the example for the NAND operation. As shown in Fig.
8(a), two read lines RWL1 and RWL2 would be activated,
the compute block, which basically is the abstracted view of
the skewed inverters of Fig. 3(b), would perform the logic
computation. Now, since the read and write port for 8T cell
are decoupled we can simultaneously activate a third WL,
in this case the write word-line (WWL3). The computed
output can be selected through a multiplexer and fed to the
write drivers for directly storing the Boolean result in the
bit-cells corresponding to WWL3. Thus, the fact that 8T
cells have decoupled read-write ports can be leveraged to
accomplish the proposed ‘read-compute-store’ scheme. Fig.
8(b) shows schematically the array level block diagram where
the three word-lines RWL1, RWL2 and WWL3 are activated
simultaneously. In Fig. 8(c) we show the Monte-Carlo results
for storing the computed NAND output into Cell3. Note that a
‘copy’ operation can also be performed using the RCS scheme,
by activating the RWL of the source row and WWL of the
destination row. In this case, the input to the RCS block will
simply be the SA output, which corresponds to the data stored
in the bit-cells of the source row.
III. 8+ TRANSISTOR DIFFERENTIAL READ SRAM
Recently, an 8+T Differential SRAM design was proposed
in [18] to overcome the single ended sensing of the conven-
tional 8T-SRAM cell. 8+T Differential SRAM has decoupled
read-write paths with an added advantage of a differential read
mechanism through the read bit-lines RBL/RBLB (see Fig.
9(a)), as opposed to the single-ended read mechanism of 8T-
SRAM. The ninth transistor, whose gate is connected to RWL
in Fig. 9(a) is shared by all the bit cells in the same row. The
differential read operation is very similar to the read operation
of a standard 6T-SRAM. The usual memory read operation is
performed by pre-charging the bit-lines (RBL and RBLB) to
VDD, and subsequently enabling the word-line corresponding
to the row to be read out. Depending on whether the bit-cell
stores ‘1’ or ‘0’, RBL or RBLB discharges. The difference
 WWL WWL
RBL RBLB
RWL
WBL WBLB
VX
RWL2
RBL/RBLB
RBL/RBLB
Case 00
Case 11
RWL1
RBL/RBLB Case 01/10
b)a)
BL
SAE
SAE
BLB
MBL MBLB
SAOUTBSAOUT
c)
Fig. 9. a) Circuit schematic of an 8+T Differential SRAM bit-cell [18]. b) Timing diagram used for in-memory computations on the 8+T Differential SRAM.
c) Circuit schematic of the proposed asymmetric differential sense amplifier.
 
CASE ‘00’
1.5
1.0
0.5
0.0
0                                       1n                                     2n
V
o
lt
ag
e 
(V
)
Time (s)
1.5
1.0
0.5
0.0
CASE ‘11’
1.5
1.0
0.5
0.0
0                                       1n                                     2n
V
o
lt
ag
e 
(V
)
Time (s)
1.5
1.0
0.5
0.0
CASE ’01/10’
1.5
1.0
0.5
0.0
0                                       1n                                      2n
V
o
lt
ag
e 
(V
)
Time (s)
1.5
1.0
0.5
0.0
OUTB
OUT
OUTB
SANAND
OUTB
OUT
SANOR
OUTB
OUTB
OUT
SANAND
OUTB
OUT
SANOR
OUTB
OUTB
OUT
SANAND
OUTB
OUT
SANOR
Fig. 10. Monte-Carlo simulations in SPICE for SA outputs for all possible input cases − ‘00,01,10,11’, in presence of 30mV sigma variations in threshold
voltage.
in voltages on RBL and RBLB is sensed using a differential
sense amplifier.
Let us consider words ‘A’ and ‘B’ stored in two rows of
the memory array. Note that we can simultaneously enable
the two corresponding RWLs without worrying about read-
disturbs, since the bit-cell has decoupled read-write paths. The
RBL/RBLB are pre-charged to VDD. For the case ‘AB’=‘00’
(‘11’), RBL (RBLB) discharges to 0V, but RBLB (RBL)
remains in the precharged state. However, for cases ‘10’ and
‘01’, both RBL and RBLB discharge simultaneously. The four
cases are summarized in Fig. 9(b).
Now, in order to sense bit-wise NAND and NOR operation
of ‘A’ and ‘B’, we propose an asymmetric SA (see Fig. 9(c)),
by skewing one of the transistors. Skewing the transistors
can be done in multiple ways, for example, transistor sizing,
threshold voltage, body bias etc. In Fig. 9(c), if the transistor
MBL is deliberately sized bigger compared to MBLB , its
current carrying capability increases. For cases ‘01’ and ‘10’,
both RBL and RBLB discharge simultaneously. However,
since the current carrying capability of MBL is more than
MBLB , SAout node discharges faster, and the cross-coupled
inverter pair of the SA stabilizes with SAout=‘0’. For the case
‘11’, RBL starts to discharge, while RBLB is at VDD. The
SA amplifies the voltage difference between RBL and RBLB,
resulting in SAout=‘1’. Whereas for the case ‘00’, RBLB
starts to discharge, while RBL is at VDD, giving SAout=‘0’.
Thus it can be observed that SAout generates an AND gate
(thus, SAoutb outputs NAND gate). Similarly, by sizing the
MBLB bigger than MBL, OR/NOR gates can be obtained
at the SA outputs. Finally, two SAs in parallel (one with
MBL up-sized, SANAND, and one with MBLB up-sized,
SANOR) enable bit-wise AND/NAND and OR/NOR logic
gates. Moreover, an XOR gate can be obtained by combining
the AND/NAND and OR/NOR outputs using an additional
NOR gate. Thus, in a single memory read cycle, we obtain a
class of Boolean bitwise operations, read directly from the
asymmetrically sized SA outputs. SPICE transient simula-
tions with 30mV sigma variations in the threshold voltage
for all input data cases are summarized in Fig. 10. Monte-
carlo analysis across process corners with VT variations and
variations in supply voltage and temperature for the 8+T
Differential SRAM are shown in Fig. 11. Vdiff is defined as
the absolute difference between the RBL and RBLB voltages
at the instant when the sense amplifier is enabled. For the case
 
Vdiff(V) Vdiff(V)
Case 01/10 Case 11/00
T=298K
T=353K
(c) VDD  VDD
(b) Nominal VDD
(a) VDD  VDD
Case 01/10 Case 11/00
TT Corner SS Corner
Vdiff(V) Vdiff(V)
Fig. 11. Monte-Carlo simulations across process corners under VT and temperature and supply-voltage variations for the 8+T SRAM configuration for the
cases ‘01/10’ and ‘11/00’. Vdiff is defined as the absolute difference between the RBL and RBLB voltages at the instant when the sense amplifier is enabled.
The distribution of Vdiff is plotted under 30mV sigma threshold voltage variations for two different temperatures and ±10% variation in nominal VDD .
‘01/10’, Vdiff should be close to 0V to allow the asymmetry
in the SA to determine the output. Whereas, for the case
‘11/00’, Vdiff should be large enough, so that the output is
driven by the differential voltage difference between RBL and
RBLB, and not due to the asymmetry in SA. The difference
in RBL and RBLB voltages for various cases is shown in
Fig. 11. Due to global VT variations across process corners,
the discharge on both RBL/RBLB is affected in the same
manner. Moreover, since we use a differential voltage sense-
amplifier, these global variations cancel, thereby making this
scheme robust to process corners. Note, even in worst case
the voltage difference between the ‘01/10’ and ‘11/00’ case is
sufficient for proper differential SA operation. Interestingly,
this difference also increases with increase in VDD, hence
voltage boosting can be easily employed to increase the design
margin.
It is worthwhile to note that the two SAs can be used for
regular memory read operations as well. The two cases of a
typical memory read operation are similar to the cases ‘11’
and ‘00’ in Fig. 9(b). Both SAs will generate the same output
corresponding to the bit stored in the cell. Moreover, the output
of the XOR gate inherently acts as an in-memory check for
possible read failures. The RCS scheme described in Section
II can also be applied to 8+T Differential SRAMs due to
decoupled read-write paths. Along with the two RWLs from
where the input operands are read, a WWL can also be enabled
which would eventually store the Boolean output within the
memory array in the same cycle.
Using 8+T cells is advantageous over the conventional 8T
cells for in-memory bit-wise logic operations because of better
robustness due to the differential read operation, in contrast to
the single ended read in 8T-SRAM cells.
IV. DISCUSSIONS
In sections II and III, we have seen various ways of
implementing basic Boolean operations using the 8T and the
8+T bit-cells. Table I presents the average energy per-bit
and latency for each of the proposed in-memory compute
techniques. The 8T cell allows separate read write ports,
thereby alleviating any possible read-disturb failure concerns.
In addition, it also supports the proposed RCS scheme. How-
ever, 8T cell suffers from robustness concerns due to its single
ended sensing.
Using the 8+T cell, on the other hand, allows differential
sensing like the conventional 6T cell, while also allowing
separate read and write ports. It thus combines the benefits
of both the standard 6T and the 8T cells. The thin cell layout
of standard 8T bit-cells and the 8+T bit-cells are shown in Fig.
12. Standard 8T cell requires five diffusion tracks, while the
8+T cell requires six. However, left- and right-most diffusion
tracks are shared with adjacent bit-cells, thereby achieving
similar area per-bit as compared to the standard 8T cell [18].
Note, since the differential read scheme for the 8+T cell
is functionally similar to the conventional 6T cell, NOR
and NAND gates (along with the XOR gate) can also be
implemented in the 6T based memory array. However, due
to the shared read-write paths of the 6T cell, the word-lines
cannot be simultaneously activated and require a sequential
activation. In addition, 6T cells are read disturb prone and
hence would exhibit much lesser robustness than the proposed
8T and 8+T cells. Nevertheless, in the Appendix we have
included a description of how the 6T cells can be used to
accomplish NOR, NAND and XOR operations. We also show
that an in-memory ‘copy’ operation can also be easily achieved
in the 6T cell due to its shared read/write paths.
Finally, it is worth noting that although we have proposed
multiple in-memory techniques in this manuscript, the choice
of the bit-cell and the associated Boolean function would heav-
ily depend on the target application. Thus, Table I summarizes
the pros and cons of each proposal. The aim of the present
manuscript is to demonstrate various possible techniques that
can be utilized in conventional CMOS based memories for
 Bit-Cell Operations Latency 
(ns)
Avg. Energy/Bit 
(fJ)
Pros Cons
8T-SRAM
• NAND
• NOR
• XOR
• RCS
3 17.25
• NOR operation is very robust and can be 
seamlessly extended to more than two 
operands.
• Uses simple skewed inverter based sensing.
• Requires timing 
control for NAND 
operation.
• Low sense margin 
for NAND operation.
8T-SRAM
(Voltage 
Divider)
• IMP
• XOR
• RCS
1 11.22
• Better robustness towards global variations 
including voltage and temperature since 
global variation affects both branches of the 
voltage divider in similar fashion.
• Static design since the critical functionality is 
based on a stable voltage dictated by the 
voltage-divider effect.
• Possible 2 bit read operation.
• Requires voltage 
boosting for proper 
functionality.
• Vpre for logic 
functionality is 
different form VDD.
8+T-SRAM
(Differential 
Cell)
• NAND
• NOR
• XOR
• RCS
1 29.67
• Differential operation and hence improved 
robustness with respect to global variations 
including temperature and voltage.
• The two sense amplifiers can also be used as
a sanity check for read operation.
• Requires two 
skewed sense 
amplifiers.
TABLE I
SUMMARY OF PROPOSALS DESCRIBED IN THE MANUSCRIPT. THE TABLE SHOWS AVERAGE ENERGY CONSUMPTION PER-BIT AND LATENCY FOR THE
IN-MEMORY OPERATIONS ON VARIOUS BIT-CELLS. PROS AND CONS OF EACH PROPOSAL ARE ALSO LISTED.
 
N-well
VDD
VDD GND
GND
BL
BLB
WWL
WWL
GND
RBL
RWL
a)
N-well
VDD
VDD GND
GND
BL
BLB
WWL
WWL
RBL
RBLB
VXVX
Left 
Cell
Right 
Cell
b)
Fig. 12. a) Thin cell layout for the standard 8T-SRAM bit-cell shown in Fig. 3(a). b) Thin cell layout for the 8+T Differential SRAM bit-cell [18] illustrated
in Fig. 9(a). Left- and right-most diffusion tracks are shared with adjacent bit-cells. The ninth transistor in Fig. 9(a) is common for the row and is connected
at the periphery to the node ‘VX’.
accomplishing in-memory Boolean computations. Since the
present proposal augments the functionality of the memory
arrays without changing the basic circuitry, it has wide ap-
plications in diverse computing systems, few of them are −
1. A standard von-Neumann general-purpose processor with
SRAM replaced by X-SRAM. 2. A modified GPU, wherein
the SRAM based register files are replaces by X-SRAM arrays.
3. A machine learning or artificial intelligence processor, for
example, a binary neural network accelerator. As an example,
in the next section we would present an encryption accelerator
using the proposed X-SRAM.
V. X-SRAM BASED NON-STANDARD VON-NEUMANN
COMPUTING FOR AES ENCRYPTION
In this section, we evaluate the system-level implications
of using X-SRAMs instead of conventional SRAMs as the
memory blocks in a typical von-Neumann based architecture
taking advanced encryption algorithm (AES) as a case study.
X-SRAMs enable extra functionalities within the memory
block, as described in previous sections, through massively
parallel vector Boolean operations. By utilizing such in-
memory computations, we expect reduction in energy expen-
sive data movements over the bus between the processor and
the memory blocks.
 Avalon Memory Mapped Bus
NIOS-II processor
(with In-memory 
custom instructions)
X-SRAM In-
memory 
compute block
Instruction 
Memory
Load REG1 [op1]
Load REG2 [op2]
XOR REG1 REG2 REG3
Store REG3 [dest]
RCS-XOR [op1] [op2] [dest]
Load REG1 [source]
Store REG1 [dest]
Conventional instructions Custom In-memory instructions
RCS-Copy [source] [dest]
b)a)
Fig. 13. (a) System-level implementation of a typical von-Neumann architecture with X-SRAM as the memory block. The processor, data-memory and the
instruction-memory blocks are connected via a shared system bus. (b) Illustration of custom in-memory instructions added to the instruction set of the Nios-II
processor. Substituting in-memory instructions reduces unnecessary read-writes into the memory.
 
Read Write In-Mem SRAM X-SRAM
0
0.2
0.4
0.6
0.8
1
CBCen CBCde CTRen CTRde ECBen ECBde
N
o
rm
al
iz
ed
 M
em
o
ry
 A
cc
es
se
s 
(1
2
8
b
 K
ey
)
AES Mode
0
0.2
0.4
0.6
0.8
1
CBCen CBCde CTRen CTRde ECBen ECBde
N
o
rm
al
iz
ed
 M
em
o
ry
 A
cc
es
se
s 
(2
5
6
b
 K
ey
)
AES Mode
Fig. 14. Normalized number of memory accesses for various AES encryption and decryption modes and two different key-sizes, with and without using
X-SRAM custom in-memory instructions. The total memory transactions are split into memory read instructions, memory write instructions and custom
in-memory instructions.
A. Simulation Methodology
A typical von-Neumann system implementation is shown
in Fig 13(a). It consists of a processor, data-memory and an
instruction-memory, connected by a system bus. For our sim-
ulations, we use Intel’s programmable Nios-II processor [21],
and extend the associated instruction set (ISA) to incorporate
new custom-instructions enabled by our proposed X-SRAM
(see Fig. 13(b)). The system bus follows the Avalon memory-
mapped protocol, with enhanced architecture to enable passing
three addresses at a time. Note that this is not a huge
overhead since in-memory instructions do not pass the data
operands, and thus the data-channel along with the address-
channel can be used to pass three memory addresses over the
bus. This methodology is similar to the work presented in
[13]. A complete RTL model of the proposed X-SRAM was
developed using the circuit parameters summarized in Table
1, incorporating the in-memory computation capabilities. We
perform cycle-accurate RTL simulations to run the benchmark
AES application [22] on the architecture described above.
AES encryption algorithm heavily relies on substitution-and-
permutation operations that utilize several bit-wise Boolean
operations such as XORs, which makes X-SRAM custom
instructions suitable for this application. We identified pieces
of code constituting 92% of the entire runtime which can be
mapped using the custom instructions RCS-XOR and RCS-
Copy (shown in Fig. 13(b)), along with usual memory read-
 
M1 M2 M3
Data1
Data4
Data7 Data9Data8
Data3Data2
Data5 Data6
Processor Mem
b)a)
Fig. 15. (a) Realistic scenario for a typical system with multiple masters over
a shared bus. An arbiter keeps track of the memory traffic and controls which
master has access to the bus at a given point in time. (b) Data-parallelism in
memory arrays. X-SRAM performs bit-wise operations throughout the row,
where each row may store multiple data words. Thus multiple computations
occur in parallel.
write instructions. The software was modified by replacing
repetitive Boolean operations with our custom instruction
macros.
B. Results and Discussion
We evaluate three modes of AES encryption and decryption
namely CBC, CTR and ECB [23] for two different key sizes
− 128bits and 256bits. We plot the total number of memory
accesses (memory read instructions, memory write instructions
and custom in-memory instructions) required for each mode in
Fig. 14. The results are normalized to the corresponding mem-
ory accesses required in a conventional SRAM memory block
(no in-memory custom instructions). The plots show that the
memory accesses can be reduced by up-to 74.7% and 74.6%
in ECB mode for 128b and 256b key respectively, by using
X-SRAM in-memory instructions. The implications of these
are threefold. 1) Since memory transactions are expensive,
we directly save ∼75% memory access energy consumption
by reducing the number of accesses to the memory. The
total energy consumption in the peripheral circuitry is also
thereby reduced. 2) In a realistic scenario, shown in Fig. 15(a),
multiple masters access the shared system bus, thereby causing
large arbitration delays. Reducing the total number of memory
accesses allows the system bus to cater to other masters,
thereby reducing arbitration wait times over the shared bus and
hence improving overall system performance. 3) The decrease
in the data transfer volume between the processor and memory
alleviates the problems associated with limited bus bandwidth
while providing enough memory bandwidth for parallelism.
Fig. 15(b) shows how data can be mapped to the X-SRAM
to exploit data-parallelism. Since X-SRAM has capability to
compute two physical rows at a time, Data1-4, Data2-5 and
Data3-6 can be computed in parallel with a single in-memory
instruction, thereby improving throughput.
VI. CONCLUSION
Von-Neumann machines have fueled the computing era for
the past few decades. However, the recent emphasis on data
intensive applications like artificial intelligence, image recog-
nition, cryptography etc. requires novel computing paradigm
in order to fulfill the energy and throughput requirements.
‘In-memory’ computing has been proposed as a promising
approach that could sustain the throughput and energy re-
quirements for future computing platforms. In this paper,
we have proposed multiple techniques to enable in memory
computing in standard CMOS bit-cells − the 8T cell and
the 8+T cell. We have shown that Boolean functions like
NAND, NOR, IMP and XOR can be obtained by minimal
changes in the peripherals circuits and the associated read-
operation. Further, we have also proposed a ‘read-compute-
store’ scheme by leveraging the decoupled read and write ports
of the 8T and 8+T cells, wherein the computed logic data
can be directly stored in the desired row of the memory array.
Our results are supported by rigorous Monte-Carlo simulations
performed using predictive transistor models. Moreover, taking
an example of AES encryption algorithm, we demonstrate
that up-to 75% memory transactions can be avoided, thereby
allowing energy and performance improvements.
REFERENCES
[1] J. Bardeen and W. H. Brattain, “The transistor, a semi-conductor triode,”
Physical Review, vol. 74, no. 2, p. 230, 1948.
[2] J. Backus, “Can programming be liberated from the von neumann style?:
A functional style and its algebra of programs,” Commun. ACM, vol. 21,
no. 8, pp. 613–641, Aug. 1978.
[3] “A 28 nm configurable memory (TCAM/BCAM/SRAM) using push-
rule 6t bit cell enabling logic-in-memory,” IEEE Journal of Solid-State
Circuits, vol. 51, no. 4, pp. 1009–1021, apr 2016.
[4] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and
R. Das, “Compute caches,” in 2017 IEEE International Symposium on
High Performance Computer Architecture (HPCA). IEEE, feb 2017.
[5] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz,
“An energy-efficient VLSI architecture for pattern recognition via deep
embedding of computation in SRAM,” in 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, may 2014.
[6] M. Kang, E. P. Kim, M. sun Keel, and N. R. Shanbhag, “Energy-efficient
and high throughput sparse distributed memory architecture,” in 2015
IEEE International Symposium on Circuits and Systems (ISCAS). IEEE,
may 2015.
[7] W. M. Snelgrove, M. Stumm, D. Elliott, R. McKenzie, and C. Cojo-
caru, “Computational ram: Implementing processors in memory,” IEEE
Design & Test of Computers, vol. 16, pp. 32–41, 1999.
[8] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of a
machine-learning classifier in a standard 6t SRAM array,” IEEE Journal
of Solid-State Circuits, vol. 52, no. 4, pp. 915–924, apr 2017.
[9] Q. Dong, S. Jeloka, M. Saligane, Y. Kim, M. Kawaminami, A. Harada,
S. Miyoshi, D. Blaauw, and D. Sylvester, “A 0.3v VDDmin 4+2t SRAM
for searching and in-memory computing using 55nm DDC technology,”
in 2017 Symposium on VLSI Circuits. IEEE, jun 2017.
[10] A. Jaiswal, I. Chakraborty, A. Agrawal, and K. Roy, “8t sram cell as a
multi-bit dot product engine for beyond von-neumann computing,” arXiv
preprint arXiv:1802.08601, 2018.
[11] H.-S. P. Wong and S. Salahuddin, “Memory leads the way to better
computing,” Nature Nanotechnology, vol. 10, no. 3, pp. 191–194, mar
2015.
[12] S. Shirinzadeh, M. Soeken, P.-E. Gaillardon, and R. Drechsler, “Fast
logic synthesis for RRAM-based in-memory computing using majority-
inverter graphs,” in Proceedings of the 2016 Design, Automation &
Test in Europe Conference & Exhibition (DATE). Research Publishing
Services, 2016.
[13] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory
with spin-transfer torque magnetic ram,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 26, no. 3, pp. 470–483,
March 2018.
[14] W. Kang, H. Wang, Z. Wang, Y. Zhang, and W. Zhao, “In-memory
processing paradigm for bitwise logic operations in stt-mram,” IEEE
Transactions on Magnetics, 2017.
[15] D. Lee, X. Fong, and K. Roy, “R-MRAM: A ROM-embedded STT
MRAM cache,” IEEE Electron Device Letters, vol. 34, no. 10, pp. 1256–
1258, oct 2013.
[16] A. Sebastian, T. Tuma, N. Papandreou, M. L. Gallo, L. Kull, T. Parnell,
and E. Eleftheriou, “Temporal correlation detection using computational
phase-change memory,” Nature Communications, vol. 8, no. 1, oct 2017.
[17] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo,” in
Proceedings of the 53rd Annual Design Automation Conference on -
DAC16. ACM Press, 2016.
[18] J. P. Kulkarni, A. Goel, P. Ndai, and K. Roy, “A read-disturb-free,
differential sensing 1r/1w port, 8t bitcell array,” IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, vol. 19, no. 9, pp. 1727–
1730, sep 2011.
[19] Predictive Technology Models.[Online] http://ptm.asu.edu/, 2016.
[20] M. Maymandi-Nejad and M. Sachdev, “A monotonic digitally controlled
delay element,” IEEE Journal of Solid-State Circuits, vol. 40, no. 11,
pp. 2212–2219, 2005.
[21] “Nios II processor overview,” in Embedded SoPC Design with Nios II
Processor and VHDL Examples. John Wiley & Sons, Inc., sep 2011,
pp. 179–188.
[22] Advanced Encryption Standard.[Online] https://github.com/kokke/tiny-
AES-c/, 2016.
[23] M. Dworkin, “Recommendation for block cipher modes of operation.
methods and techniques,” National Inst of Standards and Technology
Gaithersburg MD Computer Security Div, Tech. Rep., 2001.
Amogh Agrawal received his B.Tech degree in
Electrical Engineering from Indian Institute of Tech-
nology (Ropar), India in 2016. He was a research
intern at University of Ulm, Germany in 2015, under
the DAAD (German Academic Exchange Service)
fellowship. He joined the Nanoelectronics Research
Lab in 2016 and is currently pursuing Ph.D. degree
at Purdue University under the guidance of Prof.
Kaushik Roy. His primary research interests include
enabling in-memory computations for neuromorphic
systems using CMOS and beyond-CMOS memories.
He is also looking into modeling and simulation of emerging spintronic
devices for applications in neuromorphic computing. He was the recipient
of Directors Gold Medal for his all-round performance, and Institute Silver
Medal for his academic achievements at IIT Ropar. He is also the recipient
of the Andrews Fellowship from Purdue University since 2016.
Akhilesh Jaiswal received the B.Tech degree from
Shri Guru Gobind Singhji Institute of Engineering
and Technology, Nanded, India, in 2011 and the
M.S. degree in electrical engineering from the Uni-
versity of Minnesota, Minneapolis, MN, USA, in
2014. He joined Nano-electronics Research Lab at
Purdue University in the Fall of 2014, where he is
currently pursuing doctoral degree. He was an intern
at Globalfoundries Lab, Malta, USA during sum-
mer 2017. His research interests include in-memory
CMOS and beyond-CMOS computing, exploration
and modeling of spin devices for on-chip memory/logic/nueromorphic appli-
cations.
Chankyu Lee received B.S. in Electrical and Elec-
tronics Engineering from Sungkyunkwan University,
Korea, in 2015. Currently, he is pursuing PhD degree
in Electrical and Computer Engineering at Purdue
University, West Lafayette, IN, USA. His primary
research lies in the area of brain-inspired (neuro-
morphic) computing and event-driven deep learning,
low power and high performance VLSI design for
machine learning hardware.
Kaushik Roy received the BTech degree in elec-
tronics and electrical communications engineering
from the Indian Institute of Technology, Kharagpur,
India, and the PhD degree from the Department
of Electrical and Computer Engineering, University
of Illinois at Urbana-Champaign in 1990. He was
with the Semiconductor Process and Design Cen-
ter of Texas Instruments, Dallas, where he worked
on FPGA architecture development and low-power
circuit design. He joined the electrical and com-
puter engineering faculty at Purdue University, West
Lafayette, IN, in 1993, where he is currently Edward G. Tiedemann Jr. Distin-
guished Professor. His research interests include neuromorphic and cognitive
computing, spintronics, device-circuit co-design for nano-scale Silicon and
non-Silicon technologies, low-power electronics for portable computing and
wireless communications, and new computing models enabled by emerging
technologies. He has published more than 600 papers in refereed journals and
conferences, holds 15 patents, graduated 70+ PhD students, and is coauthor
of two books on Low Power CMOS VLSI Design (Wiley & McGraw Hill).
He received the US National Science Foundation Career Development Award
in 1995, IBM faculty partnership award, ATT/Lucent Foundation award, 2005
SRC Technical Excellence Award, SRC Inventors Award, Purdue College
of Engineering Research Excellence Award, Humboldt Research Award in
2010, 2010 IEEE Circuits and Systems Society Technical Achievement
Award, Distinguished Alumnus Award from Indian Institute of Technology,
Kharagpur, Fulbright Nehru Distinguished Chair, and Best Paper Awards at
1997 International Test Conference, IEEE 2000 International Symposium on
Quality of IC Design, 2003 IEEE Latin American Test Workshop, 2003
IEEE Nano, 2004 IEEE International Conference on Computer Design, 2006
IEEE/ACM International Symposium on Low Power Electronics & Design,
and 2005 IEEE Circuits and System Society Outstanding Young Author Award
(Chris Kim), 2006 IEEE Transactions on VLSI Systems Best Paper Award,
2012 ACM/IEEE International Symposium on Low Power Electronics and
Design Best Paper Award, 2013 IEEE Transactions on VLSI Best Paper
Award. He was a Purdue University Faculty scholar (1998-2003). He was
a Research Visionary board member of Motorola Labs (2002) and held the
M.K. Gandhi Distinguished Visiting faculty at Indian Institute of Technology
(Bombay). He has been in the editorial board of IEEE Design and Test, IEEE
Transactions on Circuits and Systems, IEEE Transactions on VLSI Systems,
and IEEE Transactions on Electron Devices. He was the guest editor for
Special Issue on Low-Power VLSI in the IEEE Design and Test (1994) and
IEEE Transactions on VLSI Systems (June 2000), IEE ProceedingsComputers
and Digital Techniques (July 2002), and IEEE Journal on Emerging and
Selected Topics in Circuits and Systems (2011). He is a fellow of the IEEE.
 AXL
A0 A1
AXR AXL AXR AXL
WL1
BL
AXL
B0 B1
AXR AXL AXR AXL
BLB BL BLB BL
WL2
SANOR SANOR
SANAND SANAND
XOR XOR
Fig. 16. Schematic of a 6T-SRAM array along with two asymmetric SAs in
parallel for reading bitwise NAND/NOR/XOR operation.
APPENDIX
A. 6-Transistor SRAM: bit-wise NOR/NAND/XOR Operation
The most popular and widely used SRAM design is the
standard 6T bit-cell, shown in Fig. 16. However, 6T bit-cells
are inherently design constrained due to the shared read and
write paths. Nevertheless, by proper design choices, 6T cells
can still be used to perform in-memory computations although
at reduced robustness due to the conflict between read and
write operations in a standard 6T cell. The usual memory
read operation in a 6T cell is performed by pre-charging
the bit-lines (BL and BLB) to VDD, and enabling the word-
line corresponding to the row to be read out. Depending on
whether the bit-cell stores ‘1’ or ‘0’, BL or BLB discharges,
as illustrated in Fig. 17(a). The difference in voltages on BL
and BLB is sensed using a differential sense amplifier.
Consider a typical memory array shown in Fig. 16, with
two words ‘A’ and ‘B’ stored in rows 1 and 2, respectively.
Simultaneously enabling WL1 and WL2 introduces read-
disturbs due to possible short-circuit paths. Hence, we employ
a sequentially pulsed WL technique as a workaround, similar
to the proposal in [5]. The address decoder sequentially turns
WL1 and WL2 ON, corresponding to the rows storing ‘A’ and
‘B’, respectively, as illustrated in Fig. 17(b).
The WL pulse duration is chosen such that with application
of one WL pulse, BL/BLB drops to about ∼VDD/2. If bits ‘A’
and ‘B’ both store ‘0’ (‘1’), BL (BLB) will finally discharge
to 0V after the two consecutive pulses, whereas BLB (BL)
remains at VDD. On the other hand, for cases where ‘AB’ =
‘10’ and ‘01’, the final voltages at BL and BLB would be the
same (∼VDD/2), approximately. Thus, for the cases ‘01’ and
‘10’ both BL and BLB would have a voltage ∼VDD/2, while
for ‘00’ BL would be lower than BLB by ∼VDD and for the
case of ‘10’ BLB would be lower than BL by ∼VDD.
 
WL1
WL2
BL/BLB Case 00
Case 11
Case 01
Case 10BL/BLB
BL/BLB
BL/BLB
WL
BL/BLB
BL/BLB
Case 0
Case 1
A B NAND NOR XOR
0 0 1 1 0
0 1 1 0 1
1 0 1 0 1
1 1 0 0 0
b)a)
c)
Fig. 17. a) Timing diagram for a typical memory read operation. The BL/BLB
is pre-charged to Vdd, and the final voltage is shown for the two cases when
the bit-cell stores ‘0’ or ‘1’. b) Timing diagram for the proposed sequentially
pulsed WL activation, and the resulting BL/BLB voltages for the four cases
when bit-cells store ‘00,01,10,11’. c) Truth table for NAND/NOR/XOR
operation.
The four cases are summarized in Fig. 17(b). Using the
two asymmetric SAs (in a similar fashion as proposed in
Section III), connected in parallel, we can obtain NAND/AND,
NOR/OR and XOR bit-wise operations on ‘A’ and ‘B’. The
average energy consumption per-bit and the latency of in-
memory operations in 6T-SRAM cells are 29.3fJ and 3ns,
respectively. Note, although the sensing operation for in-
memory computing with 6T cells seem similar to the 8+T
cell, there are certain key differences. Firstly, two word-lines
cannot be activated simultaneously in 6T cells, therefore the
WL pulses have to be properly timed and the pulse duration
needs to be appropriately selected for achieving the desired
functionality. Secondly, unlike the 6T cells the voltage swing
on the read bit-lines for the 8+T cells can have much larger
swing without any concerns of possible read disturb failures,
thereby relaxing the constraints on the sense amplifier.
A Monte-Carlo simulation with a 30mV sigma variations
in the threshold voltage were performed to demonstrate the
functionality and robustness of the proposal. Fig. 18 shows
the outputs of the asymmetric SAs, SANAND and SANOR,
for the four possible input cases - ‘00,01,10,11’, in presence
of variations.
B. 6T SRAM: Copy Operation
In this section, we describe a method of implementing
‘copy’ functionality within the 6T bit-cell. To copy data from
one memory location to another, a typical instruction sequence
performed by the processor would be to do a memory read
from the source location, followed by a memory write to the
destination. Thus, two memory transactions are performed. We
exploit the coupled read-write paths of the 6T cell to perform
a data copy operation from one row to another, since the same
set of bit-lines BL/BLB are used to read from and write into
the cell.
 CASE ‘00’
1.5
1.0
0.5
0.0
0                  1n                2n                3n              4n
V
ol
ta
ge
 (V
)
Time (s)
1.5
1.0
0.5
0.0
OUT
OUTB
SANAND
SANOR
OUT
OUTB
CASE ‘11’
1.5
1.0
0.5
0.0
0                  1n               2n      3n    4n
V
ol
ta
ge
 (V
)
Time (s)
1.5
1.0
0.5
0.0
OUT
OUTB
SANAND
SANOR
OUT
OUTB
CASE ’01/10’
1.5
1.0
0.5
0.0
0                  1n                2n               3n                 4n
V
ol
ta
ge
 (V
)
Time (s)
1.5
1.0
0.5
0.0
OUT
OUTB
SANAND
SANOR
OUT
OUTB
Fig. 18. Monte-Carlo simulations in SPICE of the SA outputs for all possible input cases − ‘00,01,10,11’, in presence of 30mV sigma variations in threshold
voltage.
 
WL1/WL2
BL/BLB
Q2/QB2
Q1/QB1
Q1
WL1
BL
Cell 2
BLB
WL2
QB1
Q2 QB2
Cell 1
AXL AXR
AXL AXR
(a) (b)
Copy Cell1 to Cell2
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
V
o
lt
ag
e 
(V
)
0                       1n                     2n                      3n                     4n
Time (s)
Cell1
Cell2
Q1
QB1
Q2
QB2
(c)
Fig. 19. a) Schematic of 6T-SRAM bit-cells − Cell 1 and Cell 2. b) Timing diagram for performing a copy operation to copy data from Cell 1 to Cell 2. c)
Monte-Carlo simulations in SPICE showing the final state of the Cell 2.
Let us consider bit-cells 1 and 2, connected to WL1 and
WL2, respectively (see Fig. 19(a)). To copy data from cell 1
to cell 2, we perform two steps illustrated in Fig. 19(b). Let us
assume cell 1 stores a ‘1’, while cell 2 stores ‘0’. The bit-lines
BL/BLB are pre-charged to VDD, as usual. In step 1, WL1 is
enabled, thereby turning the access transistors of cell 1 ON.
Since node Q1 is connected to VDD and QB1 is connected
to 0V (cell 1 stores ‘1’), BL remains at VDD, while BLB
discharges to 0V. The pulse width is long enough for BLB
to discharge fully to 0V. In step 2, WL2 is enabled, turning
access transistors of cell 2 ON, with BL at VDD and BLB
at 0V. Since cell 2 stores a ‘0’, charge flows from BL to Q2,
and from QB2 to BLB, thereby, flipping the state of cell 2,
such that cell 2 now stores a ‘1’. If cell 2 initially stored a
‘1’, nothing happens in step 2, and the state remains the same.
Thus, we have implemented a data copy from cell 1 to cell 2,
in a single memory transaction.
Note that step 1 is a usual memory read operation, while
step 2 is similar to a memory write operation. However, step 2
is a weak write mechanism since the charge stored on BL/BLB
is used to switch the cell state. This may cause write failures.
Thus, a boosted voltage on WL2 is required to ensure correct
data is written into cell 2. A Monte-Carlo simulation with
sigma threshold voltage variations of 30mV in 45-nm PTM
models was performed to test the proposal, as shown in Fig.
19(c).
In order to implement a copy in 8T- and 8+T Differential
SRAMs, the proposed scheme would not work due to decou-
pled read-write paths. However, the RCS scheme proposed in
Section II can be used. The RWL of the source row and the
WWL of the destination row are enabled, and the output of
the sense amplifier is fed to the RCS block. This copies the
data from the source row to the destination row in a single
cycle operation.
