Computing-in-Memory for Performance and Energy Efficient Homomorphic
  Encryption by Reis, Dayane et al.
1Computing-in-Memory for Performance and Energy
Efficient Homomorphic Encryption
Dayane Reis, Student Member, IEEE, Jonathan Takeshita, Taeho Jung, Member, IEEE, Michael Niemier, Senior
Member, IEEE and Xiaobo Sharon Hu, Fellow, IEEE
Abstract—Homomorphic encryption (HE) allows direct com-
putations on encrypted data. Despite numerous research efforts,
the practicality of HE schemes remains to be demonstrated. In
this regard, the enormous size of ciphertexts involved in HE com-
putations degrades computational efficiency. Near-memory Pro-
cessing (NMP) and Computing-in-memory (CiM) — paradigms
where computation is done within the memory boundaries —
represent architectural solutions for reducing latency and energy
associated with data transfers in data-intensive applications such
as HE. This paper introduces CiM-HE, a Computing-in-memory
(CiM) architecture that can support operations for the B/FV
scheme, a somewhat homomorphic encryption scheme for general
computation. CiM-HE hardware consists of customized peripher-
als such as sense amplifiers, adders, bit-shifters, and sequencing
circuits. The peripherals are based on CMOS technology, and
could support computations with memory cells of different
technologies. Circuit-level simulations are used to evaluate our
CiM-HE framework assuming a 6T-SRAM memory. We compare
our CiM-HE implementation against (i) two optimized CPU
HE implementations, and (ii) an FPGA-based HE accelerator
implementation.When compared to a CPU solution, CiM-HE
obtains speedups between 4.6x and 9.1x, and energy savings
between 266.4x and 532.8x for homomorphic multiplications
(the most expensive HE operation). Also, a set of four end-to-
end tasks, i.e., mean, variance, linear regression, and inference
are up to 1.1x, 7.7x, 7.1x, and 7.5x faster (and 301.1x, 404.6x,
532.3x, and 532.8x more energy efficient). Compared to CPU-
based HE in a previous work, CiM-HE obtain 14.3x speed-up and
>2600x energy savings. Finally,our design offers 2.2x speed-up
with 88.1x energy savings compared to a state-of-the-art FPGA-
based accelerator.
Index Terms—Homomorphic encryption, Computing-in-
memory, encrypted data processing, cloud computing.
I. INTRODUCTION
There are growing concerns regarding the security and
privacy of clients’ data stored in the cloud. Due to the
nature of the cloud environments where computations need
to be performed on outsourced data, Fully Homomorphic
Encryption (FHE) [1], [2] may be a suitable solution for
data security in the scenarios of data/computation outsourcing,
because it enables any number of additions and multiplications
on ciphertexts directly. However, additions and multiplications
on ciphertexts accumulate noise, and a time-consuming boot-
strapping mechanism is required to reduce the noises so that
the final decryption contains the correct result.
This work was supported in part by ASCENT, one of six centers in
JUMP, a Semiconductor Research Corporation (SRC) program sponsored by
DARPA. Date of publication xx xx, xxxx; date of current version xx xx, xxxx.
(Corresponding author: Xiaobo Sharon Hu.)
D. Reis, J. Takeshita, T. Jung, M. Niemier, and X. S. Hu are with the
Department of Computer Science and Engineering, University of Notre Dame,
Notre Dame, IN, 46556, USA. e-mail: shu@nd.edu.
Somewhat Homomorphic Encryption (SHE) [3], [4] enables
a limited number of additions and multiplications on encrypted
data without losing the ability to decrypt the results of en-
crypted computation. Notably, multiplications cause the noise
of ciphertexts to grow at a much faster pace than additions. For
this reason, the depth of a SHE implementation is defined in
terms of the number of multiplications it can support without
invoking bootstrapping, which is denoted as multiplicative
depth. The multiplicative depth in the arithmetic circuits is
limited to a constant number in SHE, however, as highlighted
in [5], the number of operations on encrypted data is finite in
many real-life applications. Therefore, the application of SHE
instead of FHE may be sufficient and more practical because
computationally intensive bootstrapping can be avoided.
While SHE enables secure and private computing in the
cloud, the large size of ciphertexts (hundreds of KB) may
result in low speeds due to long computation times as well
as the need to constantly transfer data between memory and
the CPU. New computing paradigms such as Near-Memory
Processing (NMP) and Computing-in-Memory (CiM) are po-
tential solutions for reducing the overhead of data transfers
between memory and the CPU [6]–[10]. NMP reduces the
energy and latency associated with memory accesses by plac-
ing processing units close to the memory. Alternatively, CiM
lowers the number of overall memory accesses by integrating
certain logic and arithmetic operations directly in either the
memory cells or memory peripherals. Compounded with the
benefit of fewer data transfers, CiM may offer further speedups
in computation due to the inherently high internal bandwidth
of the memory [11]. In other words, CiM-based solutions
enable a high level of parallelism in their operations by pro-
cessing many words simultaneously without data movement
[11], which could reduce the time overheads in HE.
Recent research efforts have investigated whether compute-
and data-intensive applications can be accelerated with NMP
and CiM architectures based on different technologies, e.g.,
in [6], [11]–[19]. However, most of the previous work on
CiM/NMP targets general purpose computation or specific
machine learning tasks. When compared to traditional appli-
cations, HE (i) has much longer data words — in the order of
hundreds or even thousands of bits, and (ii) features previously
unsupported operations such as modulo reduction and scaling
(division with rounding). That said, the CiM infrastructure in
previous works have not been designed to perform multipli-
cation of long words and polynomials efficiently (as required
by some HE schemes). Furthermore, the integer X modulo
q = 2k operation and the PolyScale primitive present in HE
ar
X
iv
:2
00
5.
03
00
2v
1 
 [c
s.C
R]
  5
 M
ay
 20
20
2require conditional execution of steps by the CiM components,
as well as arbitrary division and multiplication by powers of
two, which cannot be supported in previous works.
In this work, we introduce a CiM architecture to support
the processing of encrypted data by employing the well-
known Brakerski/Fan-Vercauteren (B/FV) SHE scheme that
is designed with polynomial rings [3]. While we focus on the
B/FV cryptosystem in this work, our architecture is directly
applicable to other encryption schemes that incorporate similar
operations (e.g. the BGV [20], GSW [21], TFHE [22], and
CKKS [23] schemes. Our proposed CiM-HE targets four prim-
itive polynomial operations used in the B/FV scheme: poly-
nomial addition (PolyAdd), polynomial subtraction (PolySub),
rounded polynomial scaling (PolyScale), and polynomial mul-
tiplication (PolyMult). Furthermore, we propose restrictions
on the parameter settings to facilitate CiM-friendly solutions.
Namely, CiM-HE employs moduli of 2k for some integer k.
Also, the PolyScale primitive assumes a divisor of the form
2k
′
for some integer k′. Our parameter settings do not impact
the security of HE, but rather make it CiM-friendly. To enable
full support to all of the essential operations of HE, we make
the following circuit/architecture contributions:
Division by arbitrary powers of two in memory: The
execution of PolyScale requires division by arbitrary powers
of two in memory, which is achieved with large and varied
number of bit shifts enabled by the bi-directional logarithmic
bit shifter in our CiM architecture. Importantly, previous
work [19] performs bit shifting in memory with configurable
interconnects. However only a small number of unidirectional
shifts were demonstrated in their architecture, which are not
sufficient to meet the requirements of the HE operations.
Conditional execution by the CiM architecture: The ex-
ecution of PolyScale and the Integer X modulo q = 2k
operation require conditional execution of atomic operations
in the CiM architecture, which has not been done in previous
work. For instance, the result of PolyScale might need to be
rounded up depending on the rest of the division. Similarly,
modular reduction needs to ensure that [x]q ∈ [− q2 , q2 ). These
conditions can only be checked at runtime by our CiM-HE. To
support such conditional execution, we introduce a horizontal
OR operation performed at the sense amplifier level. The result
of this operation generates a flag that is interpreted by our CiM
sequencing circuit. Namely, the sequencing circuit examines
the flag value and take the appropriate action while executing
the CiM operation.
Polynomial multiplication in CiM: The execution of Poly-
Mult needs (i) integer multiplication and (ii) an efficient
procedure to multiply polynomials of very large order in
memory. We rely on carry select adders (CSLA) [24], in-
place copy buffers [12] and logarithmic bit shifters to perform
integer multiplication of long words. To realize polynomial
multiplication with CiM-HE, we implement a CiM-friendly
Karatsuba multiplication algorithm enabled by in-place move
buffers (IPMB). The IPMBs allow us to move data from
column i to column i+F in an array, where F is a pre-defined
column offset. By performing this operation in c cycles, it is
possible to move data with cF different offsets.
We perform simulations using Verilog and SPICE to eval-
uate speed and energy consumption of CiM-HE. We employ
the BSIM-CMG FinFET model from [25] for a 14nm tech-
nology node, and compare the runtime of three HE primitives
(HomAdd, HomSub, and HomMult, based on the four primitive
polynomial operations) in our CiM-HE design — without
algorithmic optimizations — against a CPU-based implemen-
tation based on the widely-used Microsoft Simple Encrypted
Arithmetic Library (SEAL) [26] with various algorithmic opti-
mizations enabled. Our results indicate speedups of >10000x
(4.6x) for homomorphic addition (multiplication), and energy
savings for addition (multiplication) of >290x (>530x) for a
single execution of each homomorphic operation.
We also demonstrate the benefits of CiM-HE by evaluat-
ing widely used operations in data analytics, i.e., arithmetic
mean, variance and linear regression, as well as inference, on
encrypted data using HE primitives. The speedups (and energy
savings) of CiM-HE varies by task. In our experiments with a
memory size of 4MB (8 MB), we obtain maximum (minimum)
speedups of >26000x (1.1x) and maximum (minimum) energy
savings of 532.8x (299.7x). Better speedups are obtained
for tasks whose execution time is dominated by additions,
e.g., in the arithmetic mean. Furthermore, compared to an
alternative CPU-based HE based from [27], we obtain a
14.3x speedup and >2600x energy savings for homomorphic
multiplications. Finally, compared to a state-of-the-art FPGA-
based SHE accelerator [5], our design is 88.1x more energy
efficient with speedup of 2.2x for a security level of 80 bits
and a multiplicative depth of 4.
II. BACKGROUND AND PRELIMINARIES
A. B/FV scheme based on RLWE and its polynomial primitives
SHE schemes based on polynomial rings and the NP-Hard
Ring Learning-With-Error (RLWE) problem [3], [28] have
become popular recently due to their security against quantum
algorithms. Their implementations are widely available and
actively maintained as well [22], [26], [29]. We focus on the
well-known B/FV scheme [3] in this paper, but our design
is easily extensible to the BGV [20], GSW [21], TFHE [22],
and CKKS [23] schemes. The B/FV scheme takes in plaintext
messages from Rt := Zt[X]/ΦM (X), i.e., the set of residue
classes of the polynomial ΦM (X) whose coefficients are
from Zt, and operates on ciphertext polynomials in Rq :=
Zq[X]/ΦM (X). These sets form rings of polynomials under
polynomial addition and multiplication. We provide a brief
overview of B/FV SHE below. Details with respect to key
generation, encryption, and decryption are omitted as the focus
of this work is on the acceleration of homomorphic operations
on encrypted data 1.
A plaintext message (e.g., numbers) can be encoded into a
plaintext polynomial in Rt with standard encoding techniques
(e.g., integer encoding, fraction encoding [30]), which is then
encrypted into a pair of polynomials in R2q , which is the
ciphertext c = (c[0], c[1]) of the plaintext polynomial.
1In the applications of homomorphic encryption, key generation, encryption
and decryption are one-time process, and the bottleneck lies in the homomor-
phic operations. The details can be found in the original publication [3].
3Given two ciphertexts c1 and c2 which are ciphertexts of
two messages m1,m2 respectively, homomorphic addition is
defined as:
HomAdd(c1, c2) = ([c1[0] + c2[0]]q, [c1[1] + c2[1]]q)
where q is the moduli of ciphertext polynomials and [·]q for
an integer denotes the reduction modulo q defined as [x]q =
x−bx/qe. When the input of [·]q is a polynomial, the reduction
modulo q is applied to all coefficients. The resulting ciphertext
is the encrypted addition m1 +m2.
In Microsoft’s SEAL [26] that implements the B/FV
scheme, homomorphic subtraction between two ciphertexts c1
(minuend) and c2 (subtrahend) is implemented as:
HomSub(c1, c2) = ([c1[0]− c2[0]]q, [c1[1]− c2[1]]q)
Similarly, the resulting ciphertext is the encrypted subtraction
m1 −m2.
The homomorphic multiplication between two ciphertexts
c1 and c2 is defined as a series of polynomial operations. First,
the following polynomials are computed:
cx = [bt(c1[0] · c2[0])/qe]q
cy = [bt(c1[0] · c2[1] + c1[1] · c2[0])/qe]q
cz = [bt(c1[1] · c2[1])/qe]q
Here, t is the moduli of plaintext polynomials. Then, the
relinearization key rlk = (rlk0, rlk1) generated from the key
generation algorithm (which is a pair of polynomials) is used
to generate the final ciphertext, also a pair of polynomials:
HomMul(c1, c2) = (cx + 〈rlk0, cz〉, cy + 〈rlk1, cz〉)
Here, 〈polyx, polyy〉 represents the inner product between two
polynomials (i.e., the sum of coefficient-wise multiplications).
The resulting ciphertext is the encrypted multiplication m1 ·
m2.
The computations above are comprised of two types of op-
erations: (i) ring operations, which are polynomial additions,
subtractions and multiplications. These consist of modular
additions and multiplications closed in Zq . For ring operations,
we support three primitive operations on polynomials: poly-
nomial addition (PolyAdd), polynomial subtraction (PolySub),
and polynomial multiplication (PolyMult). Additionally, (ii)
rounded polynomial scaling scales a polynomial by t/q and
returns the closest polynomial in the ring, i.e., coefficients
are rounded to the nearest integer and the tie is broken by
rounding it up. For this step, our CiM-HE implements the
rounded polynomial scaling (PolyScale) as the final primitive.
B. Related Work
1) CiM and NMP designs for HE: CiM enables significant
speedups and energy savings for data-intensive problems in
multiple application domains, e.g., [6]–[11], [13]–[18]. CiM
designs targeting SRAM-based caches [6] or main memory
[7] have been fabricated. With respect to security-centric
applications, CiM-based engines have been used for realizing
encryption schemes such as AES [31], [32]. However, there is
limited work regarding ”in-memory” HE due to the complexity
of operations involved in these encryption schemes (e.g.,
multiplication between ciphertexts that are hundreds of KB in
size). In this regard, SCAM [33] proposes a modification to
the homomorphic XOR and AND operations as defined in the
FHEW scheme [34] in order to perform search operations on
encrypted data using a fully-additive search function. CiM and
NMP implementations of this search function were presented
in [24] and [10], respectively. CiM and NMP approaches
enable speedups and energy savings compared to [33]. To the
best of our knowledge, there are no other works on the imple-
mentation of both homomorphic additions and homomorphic
multiplications in either CiM or NMP.
2) Hardware Accelerators: Several hardware accelerators
for HE have been proposed [5], [35]–[40]. Most designs
are based on Field Programmable Gate Arrays (FPGAs),
which provide an option for implementation of varied HE
circuits. Some designs, e.g., [37]–[39], focus on the design
of multiplier units for very large numbers, which are useful
for performing homomorphic multiplications. For instance,
references [39], [41] focus on the implementation of poly-
nomial multiplication and the Fast Fourier Transform/Number
Theoretical Transform (FFT/ NTT), which are building blocks
commonly used for fast multiplication. However, these works
do not propose/evaluate a complete solution that enables the
realization of a complete set of HE primitive operations, i.e.,
homomorphic additions and homomorphic multiplications, in
a scenario of massive amount of data transfers.
Accelerators for general HE computation are proposed in
[5], [35], [36], [39], [40]. References [35], [36] and [41]
propose accelerating the YASHE [4] and CKKS schemes, re-
spectively. The YASHE scheme has prohibitively large cipher-
text sizes, which aggravates the impact of memory transfers
on performance due to memory bandwidth limitations. This
fact is not accounted for in [36]. Furthermore, the YASHE
scheme is proven insecure as it is subject to a subfield lattice
attack [42]. As for CKKS, this scheme is also based on
polynomials operations (like the B/FV), but it can operate
on floating-point numbers, which is amenable to machine
learning applications. In [41], improvements of two orders of
magnitude in homomorphic multiplication time are reported
when compared to a CPU solution that implements CKKS.
However, the work does not provide an evaluation for a larger
workload, i.e., operations with a set of tens or hundreds of
ciphertexts as likely to occur in a real-life setting are not
included in the evaluation.
In this work we evaluate the functionality of CiM-HE for
the B/FV scheme. Likewise, [5], [40] can perform the essen-
tial operations of the B/FV SHE scheme, i.e., homomorphic
multiplications and additions. To ensure the fairest comparison
possible between different works, we do not compare CiM-HE
to others that are based on different schemes, e.g., YASHE,
CKKS, etc. Regarding prior work that focuses on the B/FV
scheme, reference [40] reports a homomorphic multiplication
time of 463.9 ms for 80-bit security, which is ∼92.8x more
than the time reported in [5] for the same operation with same
security level. As [5] represents a state-of-the-art FPGA-based
HE accelerator, we choose to compare CiM-HE with that work
in our evaluation (see Sec. IV).
4III. CIM-HE FRAMEWORK
In this section, we describe our proposed CiM-HE frame-
work, which consists of CiM hardware as well as the mapping
of primitives (PolyAdd, PolySub, PolyScale, and PolyMult) to a
sequence of CiM-friendly steps. We also discuss HE parameter
settings used by the CiM-HE framework.
A. Parameter settings for CiM-friendly HE
Per Sec. II, polynomial primitives are always computed
over polynomial rings Rq with moduli q for coefficients. To
expedite the integer reduction modulo q in memory, we set
q to be a power of two. We also set the divisor (for the
PolyScale) to be a power of two. Thus, the polynomial ring of
interest is R2k := Z2k [X]/Φ(X) for some positive integer k.
When divisions are needed during the polynomial scaling, we
divide the coefficients by powers of two, which can be done by
performing bit shifting in memory. This choice of parameters
does not affect the security of HE schemes [3], [43]–[45],
because the RLWE problem is hard regardless of whether
the moduli of the polynomial coefficients are powers of two
instead of prime numbers or products of coprime numbers.
We have followed the widely-accepted analysis of Lindner
and Peikert [46] to choose the parameters n (ciphertext polyno-
mial degree) and log2(q) (ciphertext modulus size) to achieve
an acceptable level of security in CiM-HE, i.e., ≥ 128 bits. In
Sec. IV, we compare the performance and energy consumption
of CiM-HE to a CPU-based implementation with four [n,
log2(q)] pairs to achieve different multiplicative depths and
levels of security. Furthermore, in Sec. IV, we also compare
CiM-HE with previous work, assuming the same parameter
setting as [5], [27], which enables us to achieve 80-bit security
and a multiplicative depth of 4.
B. CiM hardware
Fig. 1 illustrates our CiM hardware architecture. A CiM-HE
bank (Fig. 1(a)) consists of multiple CiM-HE arrays of M×N
size where ciphertexts are stored. When designing a bank, we
should choose the number and dimension of the arrays based
on (i) the number of ciphertexts we wish to store in a single
CiM-HE bank, (ii) the coefficient’s modulus (q = 2k), and (iii)
the degree of polynomials (n). When positioning polynomials
in the CiM-HE banks before a HE operation, we have to write
the polynomials in a column-aligned fashion (so coefficients of
the same degree are mapped to the same columns in the CiM
arrays) to facilitate the execution of polynomial primitives by
our CiM-HE hardware.
Figs. 1(a) and 1(b) illustrate the mapping of polynomial
coefficients to the memory arrays in a CiM-HE bank. Recall
that the ciphertext has the form c = (c[0], c[1]). In Fig. 1(a),
we define arrays numbered 1 to n1 to store the first polynomial
(c[0]), while arrays numbered n1 + 1 to n2 store the second
polynomial (c[1]). The coefficient mappings depicted in Fig.
1(b) are for arrays 1 and n1 + 1. M and N are the number
of rows and columns in an array, respectively. M − M ′ is
the number of rows reserved as a scratch space to store
intermediate results of CiM-HE operations. The coefficients
Customized sense amplifiers
+ in-memory adders (1:N)
Mem
cell
Mem
cell
Mem
cell
Mem
cell
…
… … … …
R
o
w
d
e
c
o
d
e
r
A
R
o
w
d
e
c
o
d
e
r
B
𝐵𝐿1
𝑊𝐿1
𝑊𝐿2
𝑊LM
Operation selectors
Bit shifters
Shift mask S
(levels 1 to 5)
IPCB
(Copy data to
same column)
Data in
Bitline 
drivers
Enable Copy
𝑅1𝑅2𝑅3
Sequencing 
circuit
(controller)
…
IPMB
(Copy data to a
different column)
• ADD
• Horiz. OR
• OR (READ)
• NOR(NOT)
Enable Move
Flags, 
Enables
Mem
cell
Mem
cell
Mem
cell
Mem
cell
Scratch
space
(a)
…
…
Mem
cell
Mem
cell
Mem
cell
Mem
cell
𝐵𝐿2𝐵𝐿3𝐵𝐿N 𝐵𝐿1𝐵𝐿2𝐵𝐿3𝐵𝐿𝑁
𝑅N
OUT1OUT2OUT3OUTN
𝑂𝑈𝑇1𝑂𝑈𝑇2𝑂𝑈𝑇3𝑂𝑈𝑇𝑁
…
CiM-HE bank
(c)
…MxN 
array
MxN 
array
MxN 
array
…MxN 
array
MxN 
array
MxN 
array
𝑐[0]1 𝑡𝑜 𝑁
Array #:
1 2 𝑛1 𝑛1 + 1 𝑛1 + 2 𝑛2
𝑐[1]1 𝑡𝑜 𝑁
𝑐[0]1 [1] 𝑐[0]1 [2] 𝑐[0]1 [3] 𝑐[0]1 [
𝑁
𝑤×64
]
…
…
Indices of coefficients: 
Indices of ciphertexts 
1 2 3
𝑁
𝑤 × 64
𝑐[0]2 [1] 𝑐[0]2 [2] 𝑐[0]2 [3] 𝑐[0]2 [
𝑁
𝑤×64
]
𝑐[0]3 [1] 𝑐[0]3 [2] 𝑐[0]3 [3] 𝑐[0]3 [
𝑁
𝑤×64
]
1
2
3
𝑀’
…
𝑐[1]1 [1] 𝑐[1]1 [2] 𝑐[1]1 [3] 𝑐[1]1 [
𝑁
𝑤×64
]
…
…
Indices of coefficients: 
1 2 3
𝑁
𝑤 × 64
𝑐[1]2 [1] 𝑐[1]2 [2] 𝑐[1]2 [3] 𝑐[1]2 [
𝑁
𝑤×64
]
𝑐[1]3 [1] 𝑐[1]3 [2] 𝑐[1]3 [3] 𝑐[1]3 [
𝑁
𝑤×64
]
(b)
Ciphertexts mapping
Scratch space Scratch space
𝑀
…
𝑐[0]𝑀 [1] 𝑐[0]𝑀 [2] 𝑐[0]𝑀 [3] 𝑐[0]𝑀 [
𝑁
𝑤×64
] 𝑐[1]𝑀 [1] 𝑐[1]𝑀 [2] 𝑐[1]𝑀 [3] 𝑐[1]𝑀 [
𝑁
𝑤×64
]
Fig. 1: (a) A CiM-HE bank, where (b) mapping of coefficient to CiM-
HE bank and (c) individual arrays are highlighted. Each CiM array
consists of memory cells and peripherals that support the execution
of polynomial primitives.
of the two polynomials c[0] and c[1] are labeled as c[0]y[z]
and c[1]y[z], where y is the index for the coefficients, and
z is the index for ciphertexts that are stored in memory. As
such, each array holds up to C = N/(w× 64) coefficients of
M ′ ciphertexts. In this equation, w represents the number of
standard-sized 64-bit words that are necessary to store each
k-bit coefficient. Namely, w × 64 ≥ k. The coefficients of
M different ciphertexts of the same degree are placed in
5…
Column i
Shift 1 L/R
…
OUTiRi
S1
… …
…
…
…
…
…
…
S2S3 S4S5S6 S7S8S9 S10S11S12 S13S14S15
Shift 4 L/R Shift 16 L/R Shift 32 L/R Shift 64 L/R
Fig. 2: Log shifter implemented in CiM-HE.
the same column to facilitate in-memory logic and arithmetic
operations such as AND, OR, addition, and subtraction. The
in-memory computation of polynomial primitives in each array
is supported by a set of customized memory peripherals,
including a sequencing circuit (controller), customized sense-
amplifiers, and in-memory adders and bit shifters, which are
described below.
1) Controller: The controller activates appropriate clock
signals so the customized sense amplifier and in-memory
adder circuits can perform operations at the correct times.
In addition to operation sequencing, the controller receives
flag inputs from the customized sense amplifiers that allow
for the conditional execution of steps in a reduction of integer
X modulo q = 2k operation and the PolyScale primitive (see
Sec. III-C1 and Sec. III-C4).
2) Customized sense amplifiers/in-memory adders: The
customized sense amplifiers and in-memory adders employed
in this work perform bitwise logic and wordwise addition
between two ciphertexts stored in memory. The in-memory
adder is also used to implement subtraction and coefficient
multiplication in PolySub and PolyMult primitives. We employ
circuits similar to those in [12], [24], i.e. an in-memory carry
select adder and a customized sense amplifier that enables
bitwise logic between two memory words. We also introduce
a new operation, “horizontal OR”, in which the bitwise AND
result is ORed horizontally, i.e., within all the bits of a word.
The “horizontal OR” operation is used to generate modulo
reduction and rounding flags for the execution of the PolyScale
primitive (see Sec. III-C1 and Sec. III-C4). At the end of each
computation, we keep the output (computation result) stored
in a temporary latch, so it can be copied/moved to an address
in memory in the subsequent cycle.
3) Bit shifters: The bit shifters implemented in CiM-HE
can perform right bit shifts that enable divisions by powers
of two for the PolyScale primitive. We choose to implement
a logarithmic (log) shifter in the memory (Fig. 2). Compared
to other bit shifters such as a barrel shifter, the log shifter
employed in CiM-HE offers flexibility in terms of the number
of shifts possible to achieve in a single round of computation
with less area overhead. For instance, a log shifter with L
levels can perform 2L combinations of right/left shift in a
single round of bit shifting, and multiple rounds can be used
to achieve any desired number of right/left bit shifts. It is
possible to arbitrarily choose the number of bit shifts each
level performs to favor either larger or smaller shifts in one
cycle. For instance, Fig. 2 depicts the log shifter in our CiM-
HE. We have a 5-level log shifter; each level enables 1, 4, 16,
32, or 64 bit shifts, which can be combined in a single cycle to
achieve greater numbers of shifts, e.g., “64+32+16+4+1=117”,
or “64+4+1=69”. It is possible to activate only certain levels
of the log shifter and keep others deactivated, i.e., performing
a shift by 0. The shift mask S1−S15, a 15-bit number, is used
to activate left/right or no shifts in every level L of the log
shifter. Note that, when setting a mask, we must make sure
that one (and only one) transistor is selected per level.
4) In-place copy buffers/In-place move buffers: In-place
copy buffers (IPCBs) based on [12] are used to copy CiM
outputs (i.e., the results of a bitwise logic, addition, or right/left
shift) to an address defined in the memory scratch space.
IPCBs are always placed in alignment with each sense am-
plifier column in the memory array. As such, they do not
allow data movement between column i and i + F (F is
a pre-defined offset). As CiM-HE needs both (i) a copy of
outputs to same column in the array, and (ii) data movement
between distinct columns for the execution of polynomial
operations, we introduce in-place move buffers (IPMBs) in our
architecture. While IPMBs have a similar schematic as IPCBs,
we use a different routing scheme in IPMBs that allows us to
move data from column i to column i+F in an array, where
F is a pre-defined column offset. Note that by performing
this operation in c cycles, it is possible to move data with cF
different offsets.
The described architecture is used to implement steps that
allow for the execution of polynomial primitives in support of
HE operations in the B/FV scheme. The mapping between our
CiM-HE hardware and such polynomial primitives is detailed
in Sec. III-C.
C. Mapping operations in SHE to CiM hardware
We now describe how CiM-HE components can be used
to execute polynomial primitives that are used to support
the computational requirements of the B/FV scheme. Each
primitive is broken down into a sequence of CiM-friendly steps
that require one or more cycles to be executed. Note that we
reduce integers X modulo q (as described below) after the
execution of every polynomial primitive, to ensure that our
results are in the ring Rq .
1) Reduction of integer X modulo q=2k: Because all coef-
ficients of polynomials are closed in Zq , the first operation we
need to support is the integer reduction modulo q. Computing
the reduction of integer X modulo q=2k (i.e. [x]q ∈ [− q2 , q2 ))
after the execution of polynomial primitives. Integer reduction
modulo q = 2k in CiM-HE is facilitated by the fact that the
kth bit — and all other bits to the left — represent the integer
part when performing division by q. As such, these bits can
be ignored if we are only interested in the modulo. While
simple, the aforementioned method only returns moduli in the
interval [0, q − 1]. To find the modulo in the correct interval,
i.e. [x]q ∈ [− q2 , q2 ), additional CiM steps are required. We
compare half of q (the bit in the (k − 1)th position) to the
value of our current modulo via a three-step process.
• Step 1: We store a mask with all ‘0’s and a single ‘1’
in the (k − 1)th position in a scratch space of the CiM
memory. We perform a bitwise in-memory AND between
6this mask and the residues from [0, q − 1], which results
in a k-bit number (an intermediate result).
• Step 2: At the customized sense amplifier, we perform
a horizontal OR between the k bits of the intermediate
result. The horizontal OR operation generates a flag value,
either ‘0’ or ‘1’, which is used by the CiM controller to
indicate the next operation.
• Step 3: If the flag is ‘0’, the current modulo is a
positive number in the interval [0, q2 ), which satisfies our
desired form for modulo reduction. If the flag is ‘1’, a
PolySub operation between the current modulo and q has
to performed, so a modulo in the interval [− q2 , 0) can
be found (q is stored in the scratch space of the CiM
memory).
2) Homomorphic Addition via PolyAdd: Homomorphic ad-
dition (Sec. II-A) requires polynomial additions, i.e., PolyAdd.
As long as coefficients of the same order of two (or more)
polynomials are placed in the memory in a column-aligned
fashion, this can be directly realized by the in-memory adder
described in Sec. III-B.
3) Homomorphic Subtraction via PolySub: The homomor-
phic subtraction (Sec. II-A) requires polynomial subtractions,
i.e., PolySub. This can also be realized by the in-memory adder
described in Sec. III-B, but requires an extra step before the
addition itself, which consists of finding the 2’s complement
of the subtrahend. The PolySub primitive can be performed as
a two-step procedure.
• Step 1: Performing a NOT operation on the subtrahend
with our customized sense amplifier, and copying the
result to the scratchpad space in the memory array.
• Step 2: Adding the minuend with inverted subtrahend,
while setting the carry-in bit of the in-memory adder to
‘1’. (This is equivalent to performing the subtraction via
2’s complement arithmetic.)
4) PolyScale for Homomorphic Multiplication: The homo-
morphic multiplication (Sec. II-A) involves rounded scaling.
Therefore, before introducing HomMult, we describe how our
CiM-HE architecture supports scaling. Our CiM-HE can per-
form rounded scaling by powers of two. Given a polynomial c
and a divisor in the form D = 2k
′
, we perform rounded scaling
by performing a division by 2k
′
for every coefficient, which
returns the integer quotient for every coefficient. Divisions by
powers of two in memory can be implemented with our CiM
log-shifter circuit (described in Sec. III-B).
The log shifter in CiM-HE has 5 levels that allows us to
have either 0, or a pre-defined number of right shifts in each
level. The pre-defined shifts in each level are 64, 32, 16, 4,
and 1. Therefore, for a division by D = 2127, we break down
the division into three right-shift rounds:
• Round 1: 64 + 32 + 16 + 4 + 1 = 117;
• Round 2: 0 + 0 + 0 + 4 + 1 = 5;
• Round 3: 0 + 0 + 0 + 4 + 1 = 5.
After each right-shift division round, the result is stored in
the memory though IPCBs (Fig. 1(c)). To determine which
CiM log-shifter levels must be activated in the next round, we
perform a subtraction between k′ and the value of activated
levels (pre-determined values that can be stored in the scratch
space of the CiM array). If the result of this subtraction is
negative (which can be verified by looking at its signal bit),
we know that the number of bit-shifts is beyond what is needed
by the division. Hence, we make the highest active level equal
to 0 for the next right-shift round and continue the process
iteratively until we reach a number of accumulative right-shifts
equal to k′.
After performing the division with right-shifts, we need to
round the value of z to the nearest integer (round up/down).
This can be done by a two-step procedure as follows:
• Step 1: We compare the bits to the right of k′ in the div-
idend, (i.e., the remainder) to half of the divisor. We use
a bit mask with all ‘0’s and a single ‘1’ in the (k′ − 1)th
position. We perform an in-memory AND followed by
horizontal OR operations to generate a rounding flag
(similar to the reduction of integer X modulo q).
• Step 2: If the result of the AND-OR flag is ’1’, rounding
up is performed by adding ’1’ to the quotient z. Other-
wise, z should be rounded down (the bit shifts perform
this type of rounding by default).
5) Homomorphic Multiplication via all polynomial primi-
tives: Homomorphic multiplication (Sec. II-A) involves mul-
tiple polynomial primitives. For example, to calculate cy ,
one needs to perform two multiplications of two polynomials
(PolyMult), add the two resulting polynomials (PolyAdd), and
then perform rounded scaling for the resulting polynomial
(PolyScale) to round the polynomial after scaling it by t/q.
Implementing PolyMult in the memory using the naive
“schoolbook” multiplication method would require O(n2)
multiplications between each pair of coefficients, which could
trigger impractically high data movement between CiM arrays.
Thus, we employ the Karatsuba multiplication algorithm [47],
and execute it with our CiM design. The Karatsuba algorithm
recursively breaks a multiplication of two polynomials into
multiplications of polynomials with half the number of terms.
Compared to a “schoolbook” method, the data movement
between arrays — and thus the complexity of implementation
— is significantly lower with Karatsuba multiplication. The
steps for the Karatsuba multiplication with CiM are described
in Algorithm 1.
The CiM implementation of the Karatsuba algorithm is
executed recursively. The base case consists of a multiplication
between two coefficients (in our case, k-bit numbers), i.e.,
a product that can be computed by Shift-Add operations
(Algorithm 2). To perform these operations in memory, we
initially (i) reserve a memory address to store the accumulation
of partial products (out), (ii) copy the value of multiplicand
to the scratch space in memory (so we can shift it in each
iteration), and (iii) store the value of the multiplier in a
temporary register in the controller (b′). We access the bits
of b′ in the k subsequent cycles in order to determine whether
we shift a′, i.e. the copy of the multiplicand, 1 bit to the left,
or if we add the contents of out to a′.
The mapping of Karatsuba multiplication operations for
execution in memory are described with a toy example in
Fig. 3. In our example, we want to multiply A = 1110 and
B = 610 (A = 10112 and B = 01102). In the first step of
the algorithm, we first move the high part of both operands
7Algorithm 1 Karatsuba algorithm [47], [48]
Input: Polynomials A,B of degree n and k-bit coefficients
Output: A×B
KARATSUBA(A,B)
1: if n == 1 then
2: return SHIFT-ADD(A,B)
3: HighA, LowA := SPLIT(A,bn/2c); . IPMBs
4: HighB, LowB := SPLIT(B,bn/2c); . IPMBs
5: R1 :=KARATSUBA(HighA+ LowA,HighB + LowB);
6: R2 :=KARATSUBA(HighA,HighB);
7: R3 :=KARATSUBA(LowA,LowB);
8: R1 := R1−R2−R3;
9: return R2× 22nk +R1× 2nk +R3 . shifts/additions
Algorithm 2 Shift-Add algorithm
Input: Multiplicand a and multiplier b, which are k-bit
coefficients
Output: out = a× b
SHIFT-ADD(a, b)
1: out := 0; . reserve in memory
2: a′ := a; . copy to memory
3: b′ := b; . copy to controller
4: for i = 1 : k do
5: if b′(i) == 0 then
6: a′ :=SHIFT(a′); . 1-bit left shift
7: if b′(i) == 1 then
8: out :=ADD(out, a′);
return out
(HighA = 102 and HighB = 012) so that they are aligned
to (LowA = 112 and LowB = 102), i.e., the high and low
parts of the coefficients would map to the same columns,
respectively. This step is illustrated in Fig. 3(a). We use IPMBs
(described in Sec. III-B4 to move data from column i to
column i+ F in an array.
The second step consists of computing the additions
LowA+HighA = 112+102 and LowB+HighB = 102+012
(Fig. 3(b)) with an in-memory adder (described in Sec. III-B).
The results of the additions, i.e., LowA +HighA = 1012 and
LowB + HighB = 0112, are copied to scratchpad addresses
in memory with IPCBs (represented in Fig. 1(c)).
Next, we compute the product R1 between LowA +
HighA = 1012 and LowB +HighB = 0112 using the Shift-
Add operation (Fig. 3(c)). The product R1 = 11112 is copied
to the scratchpad space in memory with IPCBs. Similarly, we
compute the other two products R2 = 102 and R3 = 01102
— between low (and high) parts of A and B — using the
same procedure as for computing R1 (Fig. 3(d)).
The fifth and sixth steps of the in-memory Karatsuba
multiplication (in Figs. 3(e) and 3(f)) are for computing
the subtraction R1 − R2 − R3 = 01112. The subtraction
is performed in two parts: first, we compute the two-term
subtraction R1−R3 = 10012 using the in-memory subtraction
as described in Sec. III-C3. We use IPMBs to move the result
of this subtraction so that it is aligned with product R2 = 102
(Fig. 3(e)). Next, we perform another subtraction of two-terms,
i.e. between R1 − R3 = 10012 and R2 = 102. The result of
the subtraction R1 − R2 − R3, which we rename as R1, is
copied to memory with IPCBs (Fig. 3(f)).
After we compute the subtraction, we shift: (i) R1 = 01112
by nk bits, and (ii) R2 = 102 by 2nk bits using the log
shifter (Fig. 3(g)). In this example, k = 2 and n = 1,
i.e., the smaller product to be computed by the algorithm
has 2-bit terms (base case). The results of the shifts are
copied to scratchpad addresses in memory with IPCBs. Next,
we perform an addition between R1 × 2nk = 111002 and
R2 × 22nk = 1000002, and move the result (1111002) so
that it is aligned to R3 = 01102. The steps described are
illustrated in Fig. 3(h). Last, we perform the final addition
(R2 × 22nk + R2 × 2nk + R3 = 10000102) (Fig. 3(i)). This
result is equivalent to the product A×B.
The dominant part of the execution time for this procedure
is the multiplications between coefficients, i.e., the computa-
tion of products R1, R2 and R3. The other operations are
additions, subtractions and copies (movement of data either
within the same or different columns in memory). Due to the
parallelism in CiM, we can compute the products R2 and R3
in parallel. In this scenario, the complexity of the Karatsuba
algorithm can be reduced from O(nlog2 3) to O(n log2 n) [48].
Finally, note that the support of polynomial primitives that
form the basis the execution of HomAdd, HomSub, and Hom-
Mult operations enables CiM-HE to perform all computations
required by HE.
IV. EVALUATION
We evaluate our CiM-HE architecture and compare the
runtime and energy consumption of the three operations
(HomAdd, HomSub and HomMult in Sec. III) with a CPU-
based HE implementation based on SEAL [26]. Furthermore,
we evaluate three end-to-end tasks (arithmetic mean, variance,
and linear regression) performed with both CiM-HE and the
CPU. Finally, we compare the runtime and energy of HomMult
(the most expensive HE operation) with a state-of-the-art
FPGA-based accelerator [5].
A. Experimental setup
The operations in the B/FV scheme are computed over a
polynomial ring Rq . We take q to be a 218-bit number, (i.e.
dlog2 qe = 218) and the degree of ciphertext polynomials to
be n = 213 = 8192. We choose these parameters to match
the 128-bit security, one of the default settings of Microsoft’s
SEAL [26], which exceeds the minimum security requirement
of NIST [49] which is 112 bits. The security afforded by these
parameters can be shown to be 128 bits with the estimator
[50]. Consequently, each ciphertext contains two polynomials
of degree 8192 with 218-bit coefficients, and the ciphertext
size is 436 KB. Furthermore, the multiplicative depth of the
supported functions is L = 5, which can be shown by the
following inequality [3]:
L <
log(
b qB c
4 ) + log(t)− log(δR + 1.25)
log(δR) + log(δR + 1.25) + log(t)
In our setting, the bound of the error distribution B is 1, the
expansion factor δR is n, and the plaintext modulus t is 210.
8(a)
…
𝐻𝑖𝑔ℎ𝐴: 102
𝐼𝑃𝑀𝐵
…
(b)
(c) (d)
(e) (f)
(g) (h)
𝐻𝑖𝑔ℎ𝐵 : 012
𝐻𝑖𝑔ℎ𝐴:102
𝐻𝑖𝑔ℎ𝐵:012
𝐼𝑃𝐶𝐵
… …
𝐼𝑃𝐶𝐵 𝐼𝑃𝐶𝐵
…
𝐼𝑃𝑀𝐵
…
…
𝐼𝑃𝐶𝐵
…
1012
0112
11112
10000102 
𝐻𝑖𝑔ℎ𝐴: 102
𝐻𝑖𝑔ℎ𝐵 : 012
𝐿𝑜𝑤𝐴: 112
𝐿𝑜𝑤𝐵:102
𝐻𝑖𝑔ℎ𝐴:102
𝐻𝑖𝑔ℎ𝐵:012
𝐻𝑖𝑔ℎ𝐴: 102
𝐻𝑖𝑔ℎ𝐵 : 012
𝐿𝑜𝑤𝐴 +𝐻𝑖𝑔ℎ𝐴: 1012
𝐿𝑜𝑤𝐵 +𝐻𝑖𝑔ℎ𝐵:0112
𝐻𝑖𝑔ℎ𝐴: 102
𝐻𝑖𝑔ℎ𝐵 : 012
𝐿𝑜𝑤𝐴: 112
𝐿𝑜𝑤𝐵:102
𝑅2: 102
𝐼𝑃𝐶𝐵 𝑅1: 11112
𝑅3: 01102
𝐿𝑜𝑤𝐴: 112
𝐿𝑜𝑤𝐵:102
𝐿𝑜𝑤𝐴: 112
𝐿𝑜𝑤𝐵:102
𝐻𝑖𝑔ℎ𝐴: 102
𝐻𝑖𝑔ℎ𝐵 : 012
𝐿𝑜𝑤𝐴: 112
𝐿𝑜𝑤𝐵:102
𝑅1: 11112
𝑅3: 01102
𝑅3 − 𝑅1: 10012
𝑅2: 102
𝐿𝑜𝑤𝐴: 112
𝐿𝑜𝑤𝐵:102
𝑅1: 11112
𝑅3: 01102
𝐻𝑖𝑔ℎ𝐴: 102
𝐻𝑖𝑔ℎ𝐵 : 012
1112 𝑅1 − 𝑅3: 10012
𝑅2: 102
𝐼𝑃𝐶𝐵
𝐿𝑜𝑤𝐴: 112
𝐿𝑜𝑤𝐵:102
𝑅1: 11112
𝑅3: 01102
𝐻𝑖𝑔ℎ𝐴: 102
𝐻𝑖𝑔ℎ𝐵 : 012
𝑅1 = 𝑅3 −𝑅2−𝑅1: 1112
𝑅2: 012
𝑹𝟏:≪ 𝒏𝒌 and 𝑹𝟐:≪ 𝟐𝒏𝒌
111002
1000002
𝐼𝑃𝑀𝐵
𝐿𝑜𝑤𝐴: 112
𝐿𝑜𝑤𝐵:102
𝑅3: 01102
𝐻𝑖𝑔ℎ𝐴: 102
𝐻𝑖𝑔ℎ𝐵 : 012
22𝑛𝑘𝑅2: 1000002
2𝑛𝑘𝑅1: 111002 2
2𝑛𝑘𝑅2 + 2
𝑛𝑘𝑅1: 1111002
(i)
…
𝐿𝑜𝑤𝐴: 112
𝐿𝑜𝑤𝐵:102
𝑅3: 01102
𝐻𝑖𝑔ℎ𝐴: 102
𝐻𝑖𝑔ℎ𝐵 : 012
22𝑛𝑘𝑅2: 10002
2𝑛𝑘𝑅1: 111002 2
2𝑛𝑘𝑅2 + 2
𝑛𝑘𝑅1: 1111002
Final result of multiplication:
10112 × 01102 =
6610 1110 × 610 =
In decimal:
Fig. 3: Mapping the Karatsuba multiplication for execution in mem-
ory. A toy example for multiplication of A = 1110 and B = 610
(A = 10112 and B = 01102) is presented. Blue arrows (and
shades) indicate the storage of intermediate computation results by
IPCBs and IPMBs. Different colors, i.e., purple/red, green, orange
and pink, are used to represent additions, coefficient-wise multiplica-
tions, subtractions and shifts, respectively. The flow of the algorithm
execution is presented in eight parts. (a) Alignment of high and
low parts of coefficients A and B. (b) Computation of additions
LowA +HighA and LowB +HighB . (c) Computation of product
R1. (d) Computation of products R2 and R3, which are performed
in parallel. (e) Subtracting R3 from R1. (f) Subtracting R2 from
R1 − R3. (g) Left-Shifting R1 = R1 − R2 − R3 by nk and R2
by 2nk. (h) Adding up the shifts results. (i) Computation of final
addition R2× 22nk +R1× 2nk +R3.
1) CPU-based HE: To evaluate HE primitives on a CPU,
we run HomAdd, HomSub, and HomMult using the Microsoft
SEAL library [26]. CPU-based HE performs best when the
modulus q is a product of distinct coprime numbers, as
this allows residue number system (RNS) decomposition of
coefficients as in [51] or [52]. Because 2k cannot be factored
as a product of coprimes, using 2k as a modulus does not
TABLE I: Specifications of CPU-based HE
CPU Model: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz
L1 cache: 32 KB (L1i), 32KB (L1d)
L2 cache: 256 KB
L3 cache: 3 MB
RAM: 8 GB DDR3
allow for RNS decomposition of ciphertext coefficients. Thus
using 2k as a modulus would slow down computation on non-
CiM systems. SEAL thus automatically chooses q to be a
218-bit number that is a product of 5 coprime numbers of
43 or 44 bits. SEAL uses several algorithmic optimizations
for HE, including coefficient-wise RNS decomposition as in
[51], and the Number-Theoretic Transform (NTT) for fast
polynomial multiplication (concisely described in [53]). In
our CiM-HE system, we have not implemented any of these
optimizations. Thus, our basis for comparison for CiM-HE is
the best case for the CPU. Despite the differences in the choice
of the parameter q, (CiM-HE uses moduli in the form 2k, and
SEAL uses a product of coprimes), the security level of both
implementations is the same at 128 bits.
The configuration of our test machine is listed in Table I.
To account for runtime, we use the C++ std::chrono library.
We use the Powerstat tool [54] to measure the power of
the CPU while running the HE primitives. To offset the
power consumption of system tasks not related to HE, we
measure the idle power for the same amount of time as the
execution of HE primitives and subtract it from the total power
consumption. Finally, we calculate energy consumption as the
product between HE net power and operation runtime.
2) CiM-HE: The time and energy of HE primitives im-
plemented in CiM-HE is evaluated based on simulations of
the architecture described in Sec. III. The modulus q is set
as a power of two, i.e. q = 2218, which is CiM-friendly.
Memory arrays and peripherals are simulated in HSPICE
[55] using the 14nm BSIM-CMG FinFET model [25] (the
same technology node as the CPU). The CiM controller and
decoders are synthesized in Verilog with the Open Cell Library
[56] at the same FinFET technology node. In circuit-level
simulations, we measure the time and energy of executing
each CiM step required by the polynomial operations (as
described in Sec. III-C). Then, we determine the complete
sequence of steps needed for each polynomial primitive. Based
on this sequence, we compute the time and energy of all the
polynomial primitives, which will ultimately be used to obtain
the time and energy of each HE operation running on CiM-HE.
Our CiM architecture is implemented at the L3 cache level,
which consists of 1 bank with 8192 arrays of size 8×1024 6T-
SRAM cells. Each array holds up to 32 coefficients of 218 bits,
which can fit in 4 standard-sized words of 64 bits. 0.25 KB
per array are reserved as scratch space to store intermediate
results of CiM-HE operations. The total size of a CiM-HE
bank is 4MB. To minimize the number of cycles needed for
each CiM-HE primitive, we assume that coefficients of the
same degree (of different ciphertexts) are stored in a column-
aligned fashion (see Fig. 1(b) in Sec. III-B for more details).
With the column-aligned placement scheme described in Sec.
III-B, as well as the 8192 arrays in our CiM architecture,
9TABLE II: Time and Energy of Single Execution of HE Operations
Primitive Time (s) Energy (J)CPU CiM Imp. CPU CiM Imp.
HomAdd 8.2E-5 7.9E-9 10426.5x 1.1E-3 3.6E-6 296.9x
HomSub 8.7E-5 8.9E-9 9790.6x 1.4E-3 3.9E-6 364.5x
HomMult 3.0E-2 6.6E-3 4.6x 3.7E-1 7.0E-4 532.8x
we are able to perform 2 simultaneous PolyAdd operations,
i.e., 16384 coefficient-wise additions in parallel for our HE
parameter settings.
B. Single Execution of HE Operations
We evaluate the single execution of HE operations
(HomAdd, HomSub and HomMult) on CiM-HE using the
polynomial primitives described in Sec. III. In Table II, we
summarize the energy and runtime of these operations for
CiM-HE and CPU-based implementations. We observe high
speedups (up to 10426.5x) for operations that require only the
execution of polynomial primitives between coefficients of the
same degree, i.e. HomAdd and HomSub, as these operations
can all be executed in parallel by CiM-HE components de-
scribed in Sec. III-B.
A smaller speedup is observed for HomMult, since this
operation requires many steps to execute the required number
of polynomial multiplications (PolyMult) just as with the
CPU. Note that the runtime of PolyMult is proportional to
both log(q) and n, and this polynomial primitive requires (i)
multiplication between coefficients of the same degree (im-
plemented with Add-Shift operations), and (ii) multiplications,
additions and subtractions between coefficients of different (n)
degrees. While CiM-HE is able to perform many coefficient-
wise operations in parallel because of the inherently high
internal bandwidth of the memory, the PolyMult algorithm
implemented on CiM relies on Karatsuba multiplication and
does not employ more advanced optimizations such as NTT
or RNS.
CiM-HE enables energy savings of 296.9x for HomAdd and
532.8x for HomMult. Besides in-memory additions, multipli-
cations involve numerous shifts and copies, which consumes
much less energy than additions when performed in memory.
As such, shifts and copies lower the average power of Hom-
Mult in comparison to HomAdd. The high energy savings for
multiplication with CiM-HE reflects its lower average power.
CiM-HE can support various parameter settings for the
B/FV encryption scheme, which in turn results in different
security levels and multiplicative depths. Table III show the
time and energy improvements for a single execution of
HomAdd, HomSub, and HomMult when compared to a CPU-
based implementation with same parameters. The improve-
ments are in the same order of magnitude as reported in Table
II for all four settings investigated (A through D).
C. Area evaluation
We estimate the area of one CiM-HE array of 8×1024 size
based on the modular layout of a 8×16 tile, as depicted in Fig.
4. We use Cadence Virtuoso with FreePDK15 design kit [57]
to construct the layout. Furthermore, we estimate the area of
TABLE III: Time and Energy Improvements of HE Operations for
different parameter settings
Settings Time Improvements Energy ImprovementsHom
Add
Hom
Sub
Hom
Mult
Hom
Add
Hom
Sub
Hom
Mult
A∗:
n=8192
log q=218
10426.5x 9790.6x 4.6x 296.9x 364.5x 532.8x
B†:
n=16384
log q=438
11598.1x 10552.5x 1.5x 450.5x 435.4x 425.0x
C‡:
n=8192
log q=152
8002.8x 7308.6x 5.8x 273.3x 263.0x 783.6x
D×:
n=16384
log q=237
10551.5x 9959.5x 2.6x 355.3x 349.1x 700.1x
* Setting A results in 128-bit security and multiplicative depth of 5;
† Setting B results in 128-bit security and multiplicative depth of 11;
‡ Setting C results in 192-bit security and multiplicative depth of 4;
× Setting A results in 256-bit security and multiplicative depth of 6.
8x1024
CiM-HE array
…
8×16
array tile
IPCB/IPMB
5.9%
Log shifter
15.8%
Op selector
14.0%
CSLA
22.1%
Sense amp/ 
boolean
13.1%
6T-SRAM
29.3%
R
o
w
 d
ec
o
d
er
s
CiM-HE sequencing circuit:
301.6 𝜇𝑚2
routing
3,950.0 𝜇𝑚2
(area: 33,804.2 𝜇𝑚2)Width: 1318.4 𝜇𝑚
L
en
g
th
: 
2
2
.2
 𝜇
𝑚
…
Area distribution:
Width: 20.8 𝜇𝑚
Fig. 4: Area evaluation of a CiM-HE array.
the sequencing circuit from the synthesized Verilog netlist with
Cadence Encounter. The area of row decoders is not included
in our evaluation, as they are standard elements in a memory
and not exclusive to CiM-HE. Based on the modular design,
the area of 1 CiM-HE array corresponds to 64× the area of one
8×16 tile. When the sequencing circuits (and their respective
routing overhead) are included, the resulting area is 33,804.2
µm2. Per Fig. 4, the CiM components of CiM-HE occupy
70.7% of the array tile area. Considering a 4MB CiM-HE
bank with 8192 8×1024 arrays, the CiM components occupy
an area of 194 mm2.
We compare the area overhead of CiM-HE to a naive
approach of placing multiplier units near the SRAM array.
As reported in [58], a single 32×32 Booth-Wallace multiplier
constructed in the 14nm FinFET technology node occupies
an area of 2621.4 µm2. For the multiplication of coeffi-
cients with up to 256 bits using 32×32 multipliers, we
need either (i) to serialize the operation, or (ii) construct
a larger multiplier of size 256×256. Assuming the second
approach, we roughly estimate the area of such multiplier as
8×2,621.4µm2 =21,131.2µm2. To process the same amount
of integer multiplications as CiM-HE in one shot, i.e., two
multiplication of n coefficients, assuming the parameter setting
as in Sec. IV-B, we would need 2×8192 multipliers that
occupy a total area of 346.2mm2 (a 78% increase in area
overhead when compared to CiM-HE).
10
D. Multiple Executions of HE Operations
Here, we consider multiple executions of HE operations for
different numbers of CiM-HE banks and different sizes of a
CPU’s L3 cache to evaluate the impact of data transfers on
the runtime and energy efficiency of HE. Our evaluation is
divided into two scenarios as described below:
• Scenario 1: We consider a single CiM-HE bank of size
4MB, and further assume a CPU’s L3 cache of 4MB.2
We operate on N ciphertexts. Only 6 (8) ciphertexts are
present in CiM-HE (CPU’s L3 cache) initially. The rest
N−6 (N−8) ciphertexts must be fetched from DRAM.3
• Scenario 2: We consider two CiM-HE banks of size 4MB
(a total size of 8MB), and further assume a CPU’s L3
cache of 8MB.4 We operate on N ciphertexts. Only 12
(16) ciphertexts are present in CiM-HE (CPU’s L3 cache)
initially. The rest N−12 (N−16) ciphertexts must be
fetched from DRAM.
These two different scenarios are evaluated with an in-
creasing number of HE operations on ciphertexts, i.e. 20—
29. CiM-HE speedup (and energy savings) for Scenarios 1
and 2 are depicted in Fig. 5(a) (and Fig. 5(b)). The CiM-
HE-based L3 cache is filled with 6 ciphertexts on which
we initially operate (i.e., the cache is at its full capacity,
considering that scratch space is needed to store interme-
diate results from the execution of HE algorithms). Like-
wise, the CPU’s L3 cache of 4MB (iso-capacity) stores 8
ciphertexts as no scratch space is needed. We define the
speedup as (TCPU +MTCPU )/(TCiM +MTCiM ), where
TCPU (TCiM ) refers to execution time on CPU (CiM) and
MTCPU (MTCiM ) refers to the time required to transfer data
from DRAM to the L3 cache/CiM.
With a single CiM-HE bank of 4MB and equivalent size
of CPU’s L3 cache (Scenario 1), the maximum speedups (and
energy savings) depicted in Fig. 5(a) (Fig. 5(b)) are 12308.2x
(301.2x), 11449.9X (368.4X) and 4.6x (532.8x) for Add, Sub,
and Mult, respectively. When the number of operations is
larger than 6 for CiM-HE (and 8 for CPU), more ciphertexts
need to be fetched from DRAM to replace the ciphertexts
already present in the cache. For instance, when we perform
8 operations, CiM needs to fetch 2 ciphertexts from main
memory, while the CPU’s L3 cache already contains all the
ciphertexts needed for computation. The difference in the
number of ciphertexts present in the two L3 caches causes the
CiM-HE speedup to drop below 1 for all operations (except
Mult) when the number of operations is equal to 8. For a
larger number of operations, the speedup of HomAdd and
HomSub stays close to 1, because MTCPU  TCPU , and
MTCiM  TCiM for these primitives. HomMult can sustain
its speedups at the same level because its computation time is
more significant than the time spent on data transfers, i.e., for
2We estimate 4MB SRAM access time and energy with NVSim [59].
Cache hit, miss, and write latencies (energies) are 1.189ns (0.949nJ), 0.286ns
(0.949nJ), and 0.621ns (0.903nJ) per access, respectively.
3We consider a DRAM access time of 100ns and 80pJ/bit energy consump-
tion [60]. Data blocks are 64B.
4We estimate 8MB SRAM access time and energy with NVSim [59].
Cache hit, miss, and write latencies (energies) are 2.075ns (1.230nJ), 0.339ns
(1.230nJ), and 1.173ns (1.156nJ) per access, respectively.
(a)
(b)
20 21 22 23 24 25 26 27 28 29
10-1
100
101
102
103
104
S
p
e
e
d
-u
p
# of Operations
 ADD 
 SUB 
 MULT
Solid: Scenario 1
Dash: Scenario 2
20 21 22 23 24 25 26 27 28 29
10-1
100
101
102
103
104
E
n
e
rg
y
 s
a
v
in
g
s
# of Operations
 ADD  SUB  MULT
Solid: Scenario 1
Dash: Scenario 2
Fig. 5: CiM-HE (a) speed-ups and (b) energy savings with respect to
CPU. Solid (dash) lines represent scenarios 1 (and 2). For Scenario
2, CiM-HE banks perform operations in parallel increasing the
throughput of CiM operations.
HomMult, TCPU MTCPU and TCPU MTCPU . Unlike
speedup, the energy savings of CiM-HE for all operations
(depicted in Fig. 5(b)) do not suffer a significant drop when
there are data transfers from DRAM to cache. This is because
the energy spent on data transfers is much smaller than the
energy spent on HE computations.
With an 8MB L3 cache (Scenario 2), the CPU stores 16
ciphertexts (twice as many as in Scenario 1), while CiM-
HE stores 12 ciphertexts (6 in each CiM-HE bank of 4MB).
Therefore, the CPU has an advantage of 4 more ciphertexts
that are cached before the need for fetching from DRAM. By
using 2 CiM-HE banks, two HE operations can be performed
in parallel (a similar scenario as in [5]), which leads to
further improvements with CiM-HE when compared to the
CPU (dash lines in Fig. 5(a)). Namely, we achieved maxi-
mum speedups (and energy savings) of 26386.7x (301.1X),
24461.1x (368.3X) and 9.1x (532.8x) for Add, Sub and Mult,
respectively.
As in Scenario 1, in Scenario 2 we observe a drop in
speedup (but not in energy savings) when we start fetching
data from DRAM, as the runtime is dominated by data
11
TABLE IV: Evaluation of Mean, Variance, and LinReg tasks over (a)
6 and (b) 60 ciphertexts for Scenario 1. Number of dimensions for
LinReg is 4.
Task Time (s) Energy (J)CPU CiM Imp. CPU CiM Imp.
Mean 4.7E-4 3.9E-8 11985.4x 5.4E-3 1.8E-5 300.1x
Variance 1.5E-1 3.3E-2 4.6x 2.1E+0 5.3E-3 404.9x
LinReg 4.5E+0 9.9E-1 4.6x 5.6E+1 1.1E-1 532.3x
(a) Scenario 1, file size of 6 ciphertexts
Task Time (s) Energy (J)CPU CiM Imp. CPU CiM Imp.
Mean 4.7E-2 4.3E-2 1.1x 6.4E-2 2.1E-4 301.1x
Variance 2.1E+0 4.8E-1 4.4x 2.6E+1 6.4E-2 404.6x
LinReg 6.7E+2 1.7E+2 4.1x 8.1E+3 1.5E+1 532.4x
(b) Scenario 1, file size of 60 ciphertexts
TABLE V: Evaluation of Mean, Variance, and LinReg tasks over (a)
6 and (b) 60 ciphertexts for Scenario 2. Number of dimensions for
LinReg is 4.
Task Time (s) Energy (J)CPU CiM Imp. CPU CiM Imp.
Mean 5.2E-4 2.0E-8 26386.7x 5.4E-3 1.8E-5 299.7x
Variance 1.5E-1 1.6E-2 9.2x 2.1E+0 5.3E-3 404.9x
LinReg 4.5E+0 4.9E-1 9.1x 5.6E+1 1.1E-1 532.3x
(a) Scenario 2, file size of 6 ciphertexts
Task Time (s) Energy (J)CPU CiM Imp. CPU CiM Imp.
Mean 4.1E-2 3.9E-2 1.1x 6.4E-2 2.1E-4 301.1x
Variance 2.1E+0 2.8E-1 7.7x 2.6E+1 6.4E-2 404.6x
LinReg 6.7E+2 9.5E+1 7.1x 8.1E+3 1.5E+1 532.4x
(b) Scenario 2, file size of 60 ciphertexts
transfers. Using more than one CiM-HE bank (Scenario 2)
does not improve the runtime or energy of each HE operation,
and may not be advantageous if the system always performs
one HE operation at a time (albeit unlikely in real applica-
tions). However, as an extra CiM-HE bank nearly doubles
the throughput of the system because of the higher level of
parallelism, the speedup for multiple HE operations is also
improved. For instance, the maximum throughput for Mult (the
most expensive HE operation) rises from 151 multiplications
per second with 1 CiM-HE bank to 302 multiplications per
second with 2 CiM-HE banks working in parallel. The energy
consumption for 2 CiM-HE banks (Scenario 2) remains at
same level as with a single CiM-HE bank (Scenario 1) when
we perform multiple HE operations.
Note that the runtimes of HomMult are much higher than
other primitives, and the overall runtimes are dominated by
PolyMult. In complex computation tasks with various opera-
tions, we expect to have a meaningful speedup even if we have
to bring a large number of ciphertexts into the cache, and will
show this in the next subsection.
E. End-to-End Tasks
One application of CiM-HE is in secure computation on
private data. One use case is that several parties jointly
compute a function on their secret inputs (e.g., hospitals jointly
computing statistics on patient data). Another use case is that
a client employs a machine learning model in a trusted cloud
computing service for inference without disclosing the private
data.
…
…
…
𝑋
𝑎1
𝑎2
𝑤1 𝑤2
Input 
layer
Output 
layer
Hidden 
layer784
neurons 64
neurons
10
neurons
(a) (b)
𝑥
𝑦 𝑦 =
1
1 + 𝑒−𝑥
𝑥
𝑦
𝑥
𝑦
Scaling and 
approximation*
Coefficients 
rounding**
Original 
activation 
function
*Three polynomial approximations (with degrees 2, 3, and 5) are found:
poly1: −7.7e−19x2 + 0.8x+ 900
poly2: −3.4e−7x3 + 1.7e−18x2 + 1.4x+ 900
poly3: 2.6e−13x5 − 2.5e−24x4 − 1.3e−6x3 + 6.1e−18x2 + 2.1x+ 899
**After the polynomial coefficients are rounded to the nearest integer, the
approximations become:
poly1,poly2: 1x+ 900
poly3: 2x+ 899
Fig. 6: (a) Structure of the 3-layer MLP neural network. (b) Original
sigmoid activation function and adjustments to enable inference on
encrypted data with B/FV scheme.
TABLE VI: Inference evaluation
Cache size Time (s) Energy (J)CPU CiM Imp. CPU CiM Imp.
4MB 1.6E+3 3.8E+2 4.2x 1.9E+4 3.6E+1 532.8x
8MB 1.6E+3 2.1E+1 7.5x 1.9E+4 3.6E+1 532.8x
In all of the aforementioned examples, the parties can
encrypt their data, allow the cloud computing service to
perform the computation homomorphically, then decrypt the
results locally. To evaluate the use cases of CiM-HE in real-
life situations, we homomorphically compute the functions
described below.
1) Arithmetic mean µ = 1n
∑n
i=1 xi: CiM-HE computes
and returns the encrypted sum, which is decrypted and divided
by the client.
2) Variance σ2 = 1n
∑n
i=1(xi−µ)2: CiM-HE computes the
value of σ2 equivalently as σ2 = 1n3
∑n
i=1(n·xi−
∑n
j=1 xj)
2.
The clients performs the final decryption and division by n3.
3) Arithmetic operations of linear regression: Given a set
of arguments X ∈ FN×D and target values t ∈ F1×D over a
field F, the optimal weights for linear regression are calculated
by w = (XTX)−1XTt. We examine only the arithmetic
operations required in calculating w, and do not consider the
matrix transpositions and inversions.
4) Inference with a MLP neural network: Given a 3-
layer MLP (Fig. Fig 6(a)), CiM-HE performs inference on the
encrypted MNIST data set of handwritten numbers [61]. The
MLP neural network was previously trained with unencrypted
data. The sigmoid activation function, i.e., Y = 1/(1 + e−x),
was employed for training. This baseline, 3-layer MLP neural
12
network achieves 93.1% accuracy at inference.5 The fact that
the B/FV SHE scheme operates on polynomials with integer
coefficients require the weights and the activation function to
be adjusted for inference on encrypted data. The adjustments
are listed below. Note that after we perform these adjustments,
the 3-layer MLP neural network can perform inference on
encrypted MNIST data set without the need for bootstrapping.
• Adjustment 1: A scaling factor6 Fw was applied to
the weights post training to ensure that their values are
integers;
• Adjustment 2: A scaling factor6 Fa was applied to the
activation function, which becomes: Y = Fa × (1/(1 +
e−x)). The scaled sigmoid function is approximated by
polynomial functions (Fig 6(b)). The coefficients of the
polynomials must be rounded to the nearest integer. The
linear function y = 1x + 900 is found to be a possible
solution to the approximation, and it is used in our
evaluation. The function yields 84% testing accuracy
versus 77% when using another option available, i.e.,
y = 2x+ 899.
5) Result Discussion: The results of our evaluation for the
Mean, Variance, and LinReg tasks are presented in Tables IV
and V. Two different situations are considered regarding file
sizes, which represent the number of ciphertexts (inputs) to
be processed in each one of these tasks. Namely, we evaluate
Mean, Variance, and LinReg tasks with CiM-HE and CPU
considering:
• A file size of 6 ciphertexts (Table IV(a) for Scenario 1
and Table V(a) for Scenario 2)
• A file size of 60 ciphertexts (Table IV(b) for Scenario 1
and Table V(b) for Scenario 2)
The Inference task has a fixed input size, i.e., an image
with 28 × 28 pixels, therefore we present the results of our
evaluation for this task in a separate table (Table VI). When
encrypted, the input corresponds to 784 ciphertexts that encode
1 handwritten digit. Note that the 4MB and 8MB caches are
not large enough to hold all the ciphertexts at inference time,
so data transfers occur for the two different CiM size scenarios
during inference.
Computing the Mean, Variance, and LinReg with a file size
of 6 ciphertexts requires no data fetches in either Scenario 1
or Scenario 2, as all ciphertexts are present in the L3 caches
per Sec. IV-D. More than five orders of magnitude speedup
(>11,000x with Scenario 1 and >26,000x with Scenario 2) are
observed for the Mean task as only HomAdd operations are
executed. The execution time of HomMult dominates others
in variance and linear regression tasks, i.e., the speedup with
Scenario 1 for Variance and LinReg is 4.6x, which is analogous
to the speedup of HomMult alone.
On the other hand, executing Mean, Variance, and LinReg
for a file size of 60 ciphertexts causes the improvement
5Higher baseline accuracies might be possible by fine adjusting the MLP
network and its parameters, e.g., the learning rate, epochs, number of neurons
in the hidden layer etc. However, in this study we primarily focus on
establishing an initial baseline that is meant to guide future study at scale
for SHE applied to machine learning problems.
6Our evaluation assumes Fw = 100 and Fa = 1800, which yielded the
best training accuracy among several different combinations tested.
for the Mean task to drop to 1.1X in either scenario as
the time to fetch ciphertexts dominates computational time.
As expected, for other tasks where multiplication operations
dominate runtime, we observed a minimum speedup of 4.4x
and 4.1x for Variance and LinReg with Scenario 1.
The runtime of the Inference task is also dominated by
multiplications, which explains the 4.2x (7.5x) runtime im-
provements of CiM-HE when compared to CPU for cache
sizes of 4MB (8MB). While doubling the memory size does
not change the runtime of the CPU-based solution, it allows
for more parallel computations to happen in two distinct CiM
banks (similar to the two co-processors solution in [5]).
Energy improvements reach two orders of magnitude for all
the tasks evaluated, i.e., we obtain a minimum of >290x en-
ergy savings for Mean and maximum of >500x energy savings
for LinReg and Inference. CiM-HE consumes more energy in
Mean than in other tasks. Calculating a sum homomorphically
with HomAdd requires only additions between coefficients of
the same degree, which can be performed with the fast in-
memory carry select adders in our CiM-HE architecture. Other
tasks, e.g., Variance, LinReg, and Inference are dominated by
multiplications (HomMult). As shown in Sec. IV-B, HomAdd
has higher power consumption than HomMult, hence the lower
energy improvement for tasks that involve more additions.
F. Comparison with Related Works
Here, we compare the performance and energy efficiency
of HomMult running on CiM-HE with previous works that
propose HE implementations [5], [27]. As highlighted in [5],
homomorphic multiplications are the primary bottleneck of HE
due to their extremely long runtime. As such, a comparison
for this primitive alone is sufficient to assess the benefits
of our proposed CiM-HE when compared to existing work.
Reference [27] is a CPU implementation based on the NFLlib
[62] that employs the B/FV scheme. Reference [5] proposes
an accelerator that employs up to 2 FPGA-based co-processors
for HE, and implements NTT and RNS for homomorphic
multiplications that are also based on the B/FV scheme.
An important challenge of making a fair comparison be-
tween CiM-HE and other works is that each HE implementa-
tion uses a different set of parameters, which can significantly
impact security, multiplicative depth and runtime of HE opera-
tions. For this reason, the parameters log q and n, as well as the
use of NTT and RNS optimizations in [5], [27] and this work
are listed in Table VII. The security level resulted from these
parameters is only 80 bits, which has not been considered to
be acceptable by NIST since 2015 [49]. Nevertheless, we use
a similar parameter setting in our design so as to make a fair
comparison between CiM-HE and the existing work [5], [27].
Table VIII presents the homomorphic multiplication time
and energy for [5], [27] and CiM-HE. Roy, et al. employ
their faster configuration with 2 FPGA-based co-processors.
Bos, et al. run their HE implementation on an Intel Core
i5-3427 CPU at 1.8 GHz. Per [63], latest generation Intel
i5 reaches up to 40W on heavy load operations, which we
assume when comparing energy consumption of [27] with our
proposed CiM-HE. The same assumption was made by [5].
13
TABLE VII: Parameters* used in HE Implementations
HE implementation log q n Optimizations
Bos, et al. [27] 186 4096 Yes
Roy, et al. [5] 180 4096 Yes
CiM-HE (this work) 180 4096 No
*Note: These parameter settings enable a security level
of 80-bit and a multiplicative depth of 4
TABLE VIII: Runtime and energy of HomMult running on different
HE platforms
Figure of
Merit
(HomMult)
Bos, et al.
[27] Imp.
Roy, et al.
[5] Imp. CiM-HE
Runtime 33 ms 14.3x 5 ms 2.2x 2.3 ms
Energy 1.3 J 2632.4x 43.5 mJ 88.1x 494.0 µJ
Execution of a homomorphic multiplication takes 5.0 ms
and 33.0 ms in [5] and [27], respectively. CiM-HE performs
the same operation in 2.3 ms, which represents a speedup
of 14.3x with energy savings of >2600x when compared to
[27]. The use of 2 FPGA-based co-processors (1 CiM-HE
bank that processes two HE operations in parallel) allows for
a throughput of 400 (861) multiplications per second in [5]
(CiM-HE). Furthermore, CiM-HE enables 88.1x more energy
savings, with same the security level and multiplicative depth.
The energy efficiency of CiM-HE is due to two main factors:
(i) CiM-HE does not use algorithmic optimizations like NTT
or RNS, (ii) computation is performed in memory. The impact
of these factors is explained below. For (i), the PolyMult prim-
itive that relies on additions and shifts (Karatsuba) requires
simpler hardware when compared to the hardware required for
NTT in [5]. The latter requires the design of large multiplier
units and special modules for performing polynomial lift and
scaling (which are not used/needed in our design). The modulo
reduction circuitry is also more complicated than the bit
shifters in CiM-HE, which employs moduli q that are powers
of 2. For (ii), CiM avoids large amounts of data transfers to
processing units while performing HE operations. CiM-HE
takes advantage of data placement (as described in Sec. III)
to perform Boolean logic at the bitline level with the use
of customized sense amplifiers. Boolean operations can be
leveraged to implement more complex functions, e.g. arith-
metic additions, with a lower area overhead when compared
to implementing arithmetic logic units (ALUs) from scratch
[12], [24].
Note that if one were to decide not to leverage optimizations
such as NTT and RNS in a design that performs conven-
tional data-processing, i.e., not in-memory, we expect item (ii)
above to significantly influence the associated speedups/energy
savings when compared to CiM-HE implementations. This is
because it is not easy to design processing units that can match
the processing power of CiM, which performs logic at the
sense amplifier level and takes advantage from inherently high
internal bandwidth of the memory [11].
V. CONCLUSION AND FUTURE WORK
We propose a CiM architecture that realizes essential op-
erations for the B/FV scheme, a well-known SHE scheme.
Our CiM-HE architecture consists of customized CMOS pe-
ripherals such as sense amplifiers, adders, bit-shifters, and
sequencing circuits. The peripherals and memory cells are
based on a 14nm FinFET technology. Circuit-level evalua-
tion of the CiM-HE design indicates maximum (minimum)
speedups of 9.1x (4.6x) and maximum (minimum) energy
savings of 266.4x (532.8x) for homomorphic multiplications
(the most expensive HE operation). Furthermore, we evaluate
arithmetic mean, variance, linear regression, and inference
tasks using CiM-HE. The speedups (and energy savings)
are associated with the dominant HE operation required by
each task. Furthermore, our results support the idea that
using multiple CiM banks can improve CiM-HE speedups by
allowing them to operate on different ciphertexts in parallel,
taking advantage of internal memory bandwidth. However, a
more efficient multi-bank approach for CiM-HE may require
larger memory sizes. Therefore, we plan to study the use of
CiM-HE in the main memory as an alternative to the cache by
employing denser memory cells based on CMOS (DRAM) or
emerging technologies. We also plan to integrate algorithmic
optimization techniques into our design to further increase the
speedup of CiM-HE.
REFERENCES
[1] C. Gentry, “Fully homomorphic encryption using ideal lattices,” in Pro-
ceedings of the 41st Annual ACM Symposium on Theory of Computing,
ser. STOC ’09. New York, NY, USA: ACM, 2009, pp. 169–178.
[Online]. Available: http://doi.acm.org/10.1145/1536414.1536440
[2] L. Ducas and D. Micciancio, “FHEW: bootstrapping homomorphic
encryption in less than a second,” in Eurocrypt. Springer, 2015, pp.
617–640.
[3] J. Fan and F. Vercauteren, “Somewhat practical fully homomorphic
encryption.” IACR Cryptology ePrint Archive, 2012.
[4] J. W. Bos, K. Lauter, J. Loftus, and M. Naehrig, “Improved security for a
ring-based fully homomorphic encryption scheme,” in IMA International
Conference on Cryptography and Coding. Springer, 2013, pp. 45–64.
[5] S. Sinha Roy, F. Turan, K. Jarvinen, F. Vercauteren, and I. Verbauwhede,
“Fpga-based high-performance parallel architecture for homomorphic
computing on encrypted data,” in HPCA, Feb 2019, pp. 387–398.
[6] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A 28 nm config-
urable memory (tcam/bcam/sram) using push-rule 6t bit cell enabling
logic-in-memory,” IEEE Journal of Solid-State Circuits, vol. 51, pp.
1009–1021, April 2016.
[7] J. T. Pawlowski, “Hybrid memory cube (hmc),” in IEEE HCS. IEEE,
2011, pp. 1–24.
[8] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-
in-memory accelerator for parallel graph processing,” in ISCA, June
2015, pp. 105–117.
[9] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions:
A low-overhead, locality-aware processing-in-memory architecture,” in
ISCA, June 2015, pp. 336–348.
[10] A. O. Glova, I. Akgun, S. Li, X. Hu, and Y. Xie, “Near-data acceleration
of privacy-preserving biomarker search with 3d-stacked memory,” in
DATE, March 2019, pp. 800–805.
[11] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in Memory
With Spin-Transfer Torque Magnetic RAM,” TVLSI, vol. PP, pp. 1–14,
2017.
[12] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and
R. Das, “Compute caches,” in HPCA, Feb 2017, pp. 481–492.
[13] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “Floatpim: In-memory
acceleration of deep neural network training with high precision,” in
ISCA. New York, NY, USA: ACM, 2019, pp. 802–815.
[14] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie,
“PRIME: A Novel Processing-in-Memory Architecture for Neural Net-
work Computation in ReRAM-Based Main Memory,” in ISCA, June
2016, pp. 27–39.
[15] N. Talati, S. Gupta, P. Mane, and S. Kvatinsky, “Logic design within
memristive memories using memristor-aided logic (magic),” TNANO,
vol. 15, pp. 635–650, 2016.
14
[16] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo: A
processing-in-memory architecture for bulk bitwise operations in emerg-
ing non-volatile memories,” in DAC, 2016, pp. 1–6.
[17] D. Reis, M. Niemier, and X. S. Hu, “Computing in Memory with
FeFETs,” in ISLPED. New York, NY, USA: ACM, 2018, pp. 24:1–24:6.
[Online]. Available: http://doi.acm.org/10.1145/3218603.3218640
[18] A. F. Laguna, X. Yin, D. Reis, M. Niemier, and X. S. Hu, “Ferro-
electric FET Based In-Memory Computing for Few-Shot Learning,” in
GLSVLSI. New York, NY, USA: ACM, 2019, pp. 373–378. [Online].
Available: http://doi.acm.org/10.1145/3299874.3319450
[19] S. Gupta, M. Imani, H. Kaur, and T. S. Rosing, “Nnpim: A processing
in-memory architecture for neural network acceleration,” IEEE Trans-
actions on Computers, vol. 68, pp. 1325–1337, Sep. 2019.
[20] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(leveled) fully ho-
momorphic encryption without bootstrapping,” ACM Transactions on
Computation Theory (TOCT), vol. 6, pp. 1–36, 2014.
[21] C. Gentry, A. Sahai, and B. Waters, “Homomorphic encryption
from learning with errors: Conceptually-simpler, asymptotically-faster,
attribute-based,” in Annual Cryptology Conference. Springer, 2013, pp.
75–92.
[22] I. Chillotti, N. Gama, M. Georgieva, and M. Izabache`ne, “TFHE:
Fast fully homomorphic encryption library,” August 2016,
https://tfhe.github.io/tfhe/.
[23] J. H. Cheon, A. Kim, M. Kim, and Y. Song, “Homomorphic encryption
for arithmetic of approximate numbers,” in Eurocrypt. Springer, 2017,
pp. 409–437.
[24] D. Reis, M. Niemier, and X. S. Hu, “A Computing-in-Memory Engine
for Searching on Homomorphically Encrypted Data,” IEEE Journal on
Exploratory Solid-State Computational Devices and Circuits, pp. 1–1,
2019.
[25] J. P. Duarte, S. Khandelwal, A. Medury, C. Hu, P. Kushwaha, H. Agar-
wal, A. Dasgupta, and Y. S. Chauhan, “Bsim-cmg: Standard finfet
compact model for advanced circuit design,” in ESSCIRC, Sep. 2015,
pp. 196–201.
[26] H. Chen, K. Laine, and R. Player, “Simple encrypted arithmetic library-
SEAL v2. 1,” in Eurocrypt. Springer, 2017, pp. 3–18.
[27] J. W. Bos, W. Castryck, I. Iliashenko, and F. Vercauteren, “Privacy-
friendly forecasting for the smart grid using homomorphic encryption
and the group method of data handling,” in International Conference on
Cryptology in Africa. Springer, 2017, pp. 184–201.
[28] Z. Brakerski, A. Langlois, C. Peikert, O. Regev, and D. Stehle´, “Classical
hardness of learning with errors,” in Proceedings of the forty-fifth annual
ACM symposium on Theory of computing. ACM, 2013, pp. 575–584.
[29] S. Halevi and V. Shoup, “Design and implementation of a homomorphic-
encryption library,” IBM Research (Manuscript), vol. 6, pp. 12–15, 2013.
[30] K. Laine, “Simple encrypted arithmetic library 2.3.
1,” Microsoft Research https://www. microsoft. com/en-
us/research/uploads/prod/2017/11/sealmanual-2-3-1. pdf, 2017.
[31] Y. Wang, H. Yu, D. Sylvester, and P. Kong, “Energy efficient in-memory
aes encryption based on nonvolatile domain-wall nanowire,” in DATE,
March 2014, pp. 1–4.
[32] M. Xie, S. Li, A. O. Glova, J. Hu, Y. Wang, and Y. Xie, “Aim: Fast
and energy-efficient aes in-memory implementation for emerging non-
volatile main memory,” in DATE, March 2018, pp. 625–628.
[33] S. Bian, M. Hiromoto, and T. Sato, “Scam: Secured content addressable
memory based on homomorphic encryption,” in DATE, March 2017, pp.
984–989.
[34] L. Ducas and D. Micciancio, “Fhew: bootstrapping homomorphic en-
cryption in less than a second,” in Eurocrypt. Springer, 2015, pp.
617–640.
[35] T. Po¨ppelmann, M. Naehrig, A. Putnam, and A. Macias, “Accelerat-
ing homomorphic evaluation on reconfigurable hardware,” in Interna-
tional Workshop on Cryptographic Hardware and Embedded Systems.
Springer, 2015, pp. 143–163.
[36] S. Sinha Roy, K. Ja¨rvinen, F. Vercauteren, V. Dimitrov, and I. Ver-
bauwhede, “Modular hardware architecture for somewhat homomorphic
function evaluation,” in CHES 2015, T. Gu¨neysu and H. Handschuh,
Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 164–184.
[37] W. Wang, X. Huang, N. Emmart, and C. Weems, “VLSI Design of a
Large-Number Multiplier for Fully Homomorphic Encryption,” TVLSI,
vol. 22, pp. 1879–1887, 2014.
[38] X. Cao, C. Moore, M. ONeill, N. Hanley, and E. O’Sullivan, “High-
speed fully homomorphic encryption over the integers,” 03 2014, pp.
169–180.
[39] A. Cilardo and D. Argenziano, “Securing the cloud with reconfigurable
computing: An fpga accelerator for homomorphic encryption,” in DATE,
March 2016, pp. 1622–1627.
[40] D. B. Cousins, K. Rohloff, and D. Sumorok, “Designing an fpga-
accelerated homomorphic encryption co-processor,” IEEE Transactions
on Emerging Topics in Computing, vol. 5, pp. 193–206, April 2017.
[41] M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “Heax: An architecture
for computing on encrypted data,” in Proceedings of the Twenty-Fifth
International Conference on Architectural Support for Programming
Languages and Operating Systems, 2020, pp. 1295–1309.
[42] M. Albrecht, S. Bai, and L. Ducas, “A subfield lattice attack on
overstretched ntru assumptions,” in Annual International Cryptology
Conference. Springer, 2016, pp. 153–178.
[43] A. Langlois and D. Stehle´, “Hardness of decision (r) lwe for any
modulus,” Citeseer, Tech. Rep., 2012.
[44] C. Peikert, “Public-key cryptosystems from the worst-case shortest
vector problem,” in STOC. ACM, 2009, pp. 333–342.
[45] O. Regev, “On lattices, learning with errors, random linear codes, and
cryptography,” Journal of the ACM (JACM), vol. 56, p. 34, 2009.
[46] R. Lindner and C. Peikert, “Better key sizes (and attacks) for lwe-based
encryption,” in Cryptographers Track at the RSA Conference. Springer,
2011, pp. 319–339.
[47] A. A. Karatsuba and Y. P. Ofman, “Multiplication of many-digital
numbers by automatic computers,” in Doklady Akademii Nauk, vol. 145,
no. 2. Russian Academy of Sciences, 1962, pp. 293–294.
[48] G. Cesari and R. Maeder, “Performance analysis of the parallel karatsuba
multiplication algorithm for distributed memory architectures,” Journal
of Symbolic Computation, vol. 21, pp. 467–473, 1996.
[49] E. Barker and A. Roginsky, “Transitioning the use of cryptographic algo-
rithms and key lengths,” National Institute of Standards and Technology,
Tech. Rep., 2018.
[50] M. R. Albrecht, R. Player, and S. Scott, “On the concrete hardness of
learning with errors,” Journal of Mathematical Cryptology, vol. 9, pp.
169–203, 2015.
[51] J.-C. Bajard, J. Eynard, M. A. Hasan, and V. Zucca, “A full rns variant of
fv like somewhat homomorphic encryption schemes,” in International
Conference on Selected Areas in Cryptography. Springer, 2016, pp.
423–442.
[52] S. Halevi, Y. Polyakov, and V. Shoup, “An improved rns variant of the
bfv homomorphic encryption scheme,” in Cryptographers Track at the
RSA Conference. Springer, 2019, pp. 83–105.
[53] E. O¨ztu¨rk, Y. Doro¨z, B. Sunar, and E. Savas, “Accelerating somewhat
homomorphic evaluation using fpgas.” IACR Cryptology ePrint Archive,
vol. 2015, p. 294, 2015.
[54] “Powerstat, a power consumption calculator for Ubuntu,” Online. http:
//launchpad.net/ubuntu/xenial/+package/powerstat, date: 2019-09-08.
[55] Synopsys Inc., “HSPICE,” Version O-2018.09-1, 2018.
[56] M. Martins, J. M. Matos, R. P. Ribas, A. Reis, G. Schlinker, L. Rech,
and J. Michelsen, “Open Cell Library in 15Nm FreePDK Technology,”
in ISPD. New York, NY, USA: ACM, 2015, pp. 171–178. [Online].
Available: http://doi.acm.org/10.1145/2717764.2717783
[57] K. Bhanushali and W. R. Davis, “FreePDK15: An open-source predictive
process design kit for 15nm FinFET technology,” in Proceedings of the
2015 Symposium on International Symposium on Physical Design, 2015,
pp. 165–170.
[58] K. Vaidyanathan, “Exploiting challenges of sub-20 nm cmos for afford-
able technology scaling,” arXiv preprint arXiv:1509.00885, 2015.
[59] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “Nvsim: A circuit-level
performance, energy, and area model for emerging nonvolatile memory,”
TCAD, vol. 31, pp. 994–1007, July 2012.
[60] S. Borkar, “The exascale challenge,” in VLSI-DAT. IEEE, 2010, pp.
2–3.
[61] L. Deng, “The mnist database of handwritten digit images for machine
learning research [best of the web],” IEEE Signal Processing Magazine,
vol. 29, pp. 141–142, 2012.
[62] “CryptoExperts,“FV-NFLlib”,” https://github.com/CryptoExperts/ FV-
NFLlib, online. Year: 2016.
[63] “Intel Kaby Lake Core i7-7700K, i7-7700, i5-7600K, i5-7600 Review.”
http://www.tomshardware.com/reviews/intel-kaby-lake-core-i7-7700k-
i7-7700-i5-7600k-i5-7600,4870.html, online. Accessed on 11/2019.
