HEAAN Demystified: Accelerating Fully Homomorphic Encryption Through
  Architecture-centric Analysis and Optimization by Jung, Wonkyung et al.
HEAAN Demystified: Accelerating Fully Homomorphic Encryption
Through Architecture-centric Analysis and Optimization
Wonkyung Jung∗, Eojin Lee∗, Sangpyo Kim∗, Keewoo Lee∗, Namhoon Kim∗, Chohong Min†,
Jung Hee Cheon∗, and Jung Ho Ahn∗
∗Seoul National University, † Ewha Woman’s University
gajh@snu.ac.kr
Abstract—Homomorphic Encryption (HE) draws a significant
attention as a privacy-preserving way for cloud computing
because it allows computation on encrypted messages called
ciphertexts. Among numerous HE schemes proposed, HE for
Arithmetic of Approximate Numbers (HEAAN) is rapidly gaining
popularity across a wide range of applications (e.g., machine
learning) because it supports messages that can tolerate approx-
imate computation with no limit on the number of arithmetic
operations applicable to the corresponding ciphertexts.
A critical shortcoming of HE is the high computation complex-
ity of ciphertext arithmetic; especially, HE multiplication (HE
Mul) is more than 10,000 times slower than the corresponding
multiplication between unencrypted messages. This leads to a
large body of HE acceleration studies, including ones exploiting
FPGAs; however, those did not conduct a rigorous analysis of
computational complexity and data access patterns of HE Mul.
Moreover, the proposals mostly focused on designs with small
parameter sizes, making it difficult to accurately estimate the
performance of the HE accelerators in conducting a series of
complex arithmetic operations.
In this paper, we first describe how HE Mul of HEAAN is
performed in a manner friendly to computer architects. Then we
conduct a disciplined analysis on its computational and memory-
access characteristics, through which we (1) extract parallelism in
the key functions composing HE Mul and (2) demonstrate how to
effectively map the parallelism to the popular parallel processing
platforms, multicore CPUs and GPUs, by applying a series of
optimization techniques such as transposing matrices and pinning
data to threads. This leads to the performance improvement of
HE Mul on a CPU and a GPU by 42.9× and 134.1×, respectively,
over the single-thread reference HEAAN running on a CPU. The
conducted analysis and optimization would set a new foundation
for future HE acceleration research.
I. INTRODUCTION
As cloud computing becomes an increasingly dominant way
of providing computing resources, numerous computations
are performed on datacenter servers rather than on personal
devices [5], [35]. It enables the client without expensive hard-
ware to receive services that require complex computations.
However, security and privacy issues are also emerging with
the growth of cloud computing [23], [51]. When a client sends
private data to a server, security issues in data transfers can
be resolved by sending the data after encryption. However,
the data encoded by a conventional encryption method must be
decrypted to perform the computation in the server. Therefore,
a user have no choice but to use the cloud service with a risk
of security or privacy attack (e.g., abusing) that occurs during
the computation of unencrypted data.
Homomorphic Encryption (HE) [48], an encryption scheme
that enables computation between encrypted data, draws sig-
nificant attention as a solution to this privacy problem. By
adopting HE, service providers no longer need to decrypt the
clients’ private data for computation. The concept of HE was
first suggested in 1978 [48]. However, the early proposals of
HE were either unsafe [48] or support only one type of HE
operation, namely HE addition (HE Add) or HE multiplication
(HE Mul) (e.g., ElGamal [29] and Paillier [46]). In this aspect,
it was difficult to put HE into serious applications for a
while. However, fully HE (FHE) [31] proposed in 2009 made
a major breakthrough by supporting both HE Add and HE
Mul. Moreover, FHE supports bootstrapping, a method of
initializing noise in encrypted data, enabling an unbounded
number of HE Add and Mul without decryption.
Among numerous FHE schemes to date [12], [17], [18],
[28], [30], [40], HE for Arithmetic of Approximate Numbers
(HEAAN [17]), also known as CKKS (Cheon-Kim-Kim-
Song) is rapidly gaining popularity [36] as it supports the
approximate computation of real numbers. HEAAN enables
HE Add and Mul of approximate data, where the result is
almost the same as that of the original operation with a tiny
error. Using HEAAN, computations can be performed without
data decryption in the datacenter. However, the execution time
for computation on encrypted data (ciphertext) increases by
from 100s to 10,000s of times compared to that on native,
unencrypted messages. Therefore, it is highly desired to reduce
the computation time of HE operations to use HE practically.
There have been a large body of research on accelerating
HE operations using FPGAs or GPUs. However, FPGA-based
acceleration studies targeted the HE schemes (e.g., BGV [12],
LTV [40], and BFV [30]) that only support computations of
integer numbers [21], [25], [45], [49], [50], or operate with
only limited parameter sizes [47]. They all target performing
a small number of HE Mul without bootstrapping, inhibiting
their applicability to a wide range of applications requiring
hundreds to thousands of multiplication be performed (e.g.,
deep learning). GPU implementation studies [4], [7], [8], [22]
do not take advantage of the algorithm’s internal parallelism
sufficiently, operate on only small or limited parameters, or do
not consider the cost of modulo operations.
In this paper, we demystify HEAAN, a representative FHE
scheme, by describing, analyzing, and optimizing it in a
manner friendly to computer architects. We first explain the
ar
X
iv
:2
00
3.
04
51
0v
1 
 [c
s.D
C]
  1
0 M
ar 
20
20
pertinent details of the encryption, decryption, and compu-
tation aspects of HEAAN, and identify that the following
four functions take more than 95% of HE Mul, the most
computationally expensive operation of HE: CRT (Chinese
Remainder Theorem), NTT (Number Theoretic Transform),
iNTT (inverse NTT), and iCRT (inverse CRT). we conduct an
in-depth and disciplined analysis of the aforementioned pri-
mary functions to understand their computational complexity
and access patterns on input, output, and precomputed data,
which are critical for operation (e.g., modular multiplication)
strength reduction, across a range of key HE parameters.
The parallelism exposed through the analysis is exploited to
accelerate HE Mul on CPU and GPU, the most popular com-
puting platforms, which are already equipped with hundreds to
thousands of ALUs. In CPU, we utilize multiple cores (inter-
core parallelism) and AVX-512 instructions supported by the
latest Intel architectures (intra-core parallelism). In GPU, we
utilize massive thread-level parallelism expressible through the
CUDA programming model. We further improve performance
by proposing a series of architecture-centric optimizations,
such as matrix transposition to better exploit memory access
locality, loop reordering to expose more parallelism, and
taking a synergy between precomputation and delayed modulo
operations; and we estimate how much more performance
gains are attainable through natively support currently em-
ulated instructions. We achieve 42.9× and 134.1× speedup
of HE Mul on CPU and GPU, respectively, compared to the
single-thread reference HEAAN [1], setting a new baseline for
the future HE acceleration research.
II. BACKGROUND: COMPUTATIONAL CHALLENGES OF
HOMOMORPHIC ENCRYPTION (HE)
HE can be categorized into two groups, somewhat HE
(SHE) and fully HE (FHE), by whether there is a limitation
on the number of arithmetic operations applicable to the
ciphertext. In a HE scheme, noise is accumulated during each
operation; this makes the ciphertext of a SHE scheme inde-
cipherable after performing a certain number of operations.
On the contrary, FHE schemes support a bootstrapping algo-
rithm [16], which refreshes the accumulated noise. Therefore,
although there is an upper bound in the number of arithmetic
operations that can be consecutively applied to a ciphertext,
by periodically bootstrapping it, we can continue manipulating
the ciphertext with no need for decrypting it. This property
makes FHE well-tailored to meet the demands of a wide range
of general applications (e.g., training/inference of deep neural
networks [6], [11], [19], [27]), which require a massive number
of operations applied to encrypted data.
Representative FHE schemes include Brakerski-Gentry-
Vaikuntanathan (BGV) [12], Lopez-Alt, Tromer, and Vaikun-
tanathan (LTV) [40], Brakerski/Fan-Vercauteren (BFV) [30],
fast FHE over the torus (TFHE) [18], and Cheon-Kim-Kim-
Song (CKKS) [17]. Among these, only CKKS supports ap-
proximate computation on real numbers, and hence it is a
top candidate for many real-world applications requiring a
bunch of operations on data that can tolerate tiny errors due to
TABLE I: Execution time for addition and multiplication of
messages vs. ciphertexts in HEAAN.
Operation type Message Ciphertext Slowdown
Addition 2.1 ns 348.2 ns 168.2×
Multiplication 4.3 ns 155883.8 ns 36112.7×
approximate computation. CKKS is rapidly gaining popularity
in a wide range of applications exploiting HE, such as machine
learning [39]. For example, the winner and the most runner-
ups of a recent HE challenge about secure genome analysis
competition (iDASH 2018 [36] and 2019 [37]) used CKKS
or its hybrid versions. Therefore, we investigate HEAAN (HE
for Arithmetic of Approximate Numbers) scheme [1], which
is developed by the authors of CKKS.
HEAAN is able to perform arbitrary computation types by
combining HE multiplication (simply HE Mul) and HE addi-
tion (simply HE Add), which are multiplication and addition
on ciphertexts. However, the execution time of HE operations
increases significantly compared to the corresponding ones
on the original unencrypted messages. Table I compares the
execution time for addition/multiplication of original messages
and ciphertexts using a single core from the system specified
in Section VI. We measure the average execution time of
addition/multiplication of a complex number in a message
consisting of 32,768 complex numbers. HE Mul (Add) is
36,112× (168×) slower than native message operation. When
the message consists of fewer numbers, the slowdown is even
higher. Considering that most approximate number operations
consist of multiplication and addition, the long execution
time of HE operations is an obstacle to the practical use
of HE. Therefore, it is essential to accelerate HE operations,
especially HE Mul as it is 448× slower than HE Add.
III. A BRIEF INTRODUCTION TO HEAAN
Prior to accelerating the most compute-intensive HEAAN
operation, HE Mul, we introduce the pertinent details of
HEAAN, focusing on how to convert an input message to
a ciphertext through encoding/encryption steps and how to
perform arithmetic operations on ciphertext in HEAAN.
A. HEAAN encryption
HEAAN converts an input message to a ciphertext through
encoding and encryption steps. An input message consists of
n complex numbers, each composed of a double-type real
and imaginary number. The encoding step first converts an
input message to a plaintext (t), a polynomial of at most
degree (N -1) with N integer coefficients. t is placed in
a cyclotomic polynomial ring (R = Z[X]/(XN + 1) =
c0X
0+c1X
1+ ...+c(N−1)X(N−1)) space with the magnitude
of each coefficient bounded by ciphertext modulus, integer q
(t ∈ R/qR). Therefore, each coefficient (ck) is the residue
number by q, where q is a BigInt (big integer) much larger than
264. The encoding step converts the floating-point numbers of
an message to integer numbers after multiplying with a scaling
2
TABLE II: Multiplicative depth (L) and required N over logQ
to guarantee 80-bit security level (an attacker needs 280× of
modular mul & add to break with the current best algorithm.)
logQ Multiplicative depth (L) required N
300 10 214
600 20 215
1,200 40 216
2,400 80 217
factor (∆) and then rounding down the remaining fraction
numbers. Each coefficient is a logq-bit BigInt, and n ≤ N2 .
Then, the encryption step converts a plaintext to a ciphertext
consisting of a pair of polynomials c.ax and c.bx using a public
key pair (pk0 and pk1) as follows:
c.ax = u× pk1 + e1
c.bx = u× pk0 + e0 + t
The public key pair is generated from a secret key (sk).
In the equations above, u is a polynomial of at most degree
(N -1), where each coefficient is either -1, 0, or 1 following a
distribution described in [17]. e0 and e1 are also polynomials
of at most degree (N -1) polynomials with random error values
to ensure security, which follow a Gaussian distribution with
a small standard deviation value (e.g., σ = 3.2 in [17]).
To extract the original message from a ciphertext, we first
convert the ciphertext to the plaintext exploiting the following
relationship between c.ax and c.bx:
c.bx = c.ax× sk + t + e’
Then, the plaintext t can be returned to the original message
through decoding; in this case, the inverse of scaling factor
(1/∆) is multiplied to get the approximate values.
HEAAN limits the maximum size of the ciphertext modulus
q to a constant value Q. HEAAN chooses pL for Q, where
L is multiplicative depth, the number of consecutive HE Mul
operations applicable to a ciphertext before it loses encrypted
data, and p is rescaling factor. The size of the message
contained in a ciphertext increases exponentially as the ci-
phertext is multiplied repeatedly. To prevent the explosion of
message size, HEAAN performs rescaling after each HE Mul
by dividing the coefficients of the output ciphertext by p. Then
the size of q, the ciphertext modulus, is adjusted to q′ where
logq′ = logq - logp. Therefore, logq of a ciphertext that is just
encrypted starts at logQ, decreases by logp every time an HE
Mul is applied, and becomes 0 after experiencing L HE Mul
operations, losing data. When p is fixed, more HE Mul can
be applied to a ciphertext with a larger Q value.
To apply more HE Mul operations to a ciphertext, FHE re-
initialize the ciphertext through bootstrapping. However, boot-
strapping is a costly operation (reported to take in the order of
minutes [14]) because it consists of dozens of HE Mul1 and
shift operations. To reduce the overhead of bootstrapping for
practical use of HE, we should use large Q values to increase
L. Also, N should increase as Q increases to guarantee a
certain level of security in HE (see Table II). As larger Q and
1This again reinforces the importance of accelerating HE Mul.
bits
qLimbs
N
 c
oe
ffi
ci
en
ts
Number of primes (np)
CRT
N
 c
oe
ffi
ci
en
ts
Number of primes (np) bits
bits
bits
Fig. 1: Input and output data structure of CRT.
N values require more computation and data storage costs per
HE Mul, it is not efficient to use too large Q; we discuss a
further detail of this trade off in Section VIII. In this paper, we
use (p, L, Q, N ) of (230, 40, 21200, 216), respectively, which
are the default values the official HEAAN repository [1] uses.
B. HEAAN computation
Arithmetic operations in HEAAN, HE Add and Mul, are
through computation between the polynomials of operand
ciphertexts. Here we assume that the two operand ciphertexts
in HE operation have the same ciphertext modulus value q.
HE Add computes an output ciphertext (c3) from two input
ciphertexts (c1 and c2) through the following operations:
c3.ax = mod(c1.ax + c2.ax, q)
c3.bx = mod(c1.bx + c2.bx, q)
HE Add is relatively simple because it performs the element-
wise addition of BigInt coefficients and then the modulo of
q for each output coefficient; mod(x, y) means x modulo
y. Here the modulo operation is implemented simply by
subtracting q when each output coefficient is larger than q
because the result of each addition is smaller than 2q.
HE Mul is much more costly than HE Add because the
former requires multiplying BigInt coefficients by N2 times.
In this paper, we assume that β is the size of an integer
data type that a computer natively supports with high perfor-
mance (e.g., β=264 for 64-bit CPUs), which is often called
a word size. A logq-bit BigInt is represented as qLimbs
(=dlogq/logβe) words (see Figure 1). Then, one BigInt mul-
tiplication (simply mul) consists of (qLimbs)2 logβ-bit word
mul and 2(qLimbs)2−1 word addition (simply add) operations.
For example, because qLimbs is 19 when using the representa-
tive parameters (N=216, logq=1,200, logβ=64) summarized in
Table III, 361 64-bit mul and 721 64-bit add operations are re-
quired besides carry propagation per BigInt mul. As described
above, polynomial mul requires N2 (=4.3 billion) BigInt mul,
which requires at least 4.6 Tera 64-bit operations. To reduce
the complexity of this polynomial mul, HEAAN and other
HE schemes use Chinese Remainder Theorem (CRT [24]) and
Number Theoretic Transform (NTT [20]).
CRT for reducing the complexity of BigInt mul: CRT states
that for np integers {mi|0≤ i<np} which are coprime with
each other, the residue set {xi = mod(X,mi)|0≤ i<np} of
any integer 0 ≤X <∏np−1i=0 mi is unique. HEAAN exploits
CRT by defining a set of np integers {p0, p1, ..., pnp−1},
where each modulus pi is a prime number smaller than β
and
∏np−1
i=0 pi = P ≥ q2. Then, a logq-bit BigInt number B,
3
TABLE III: Representative parameters in Homomorphic Encryption (HE)
Symbol Description Representative values
∆ Scaling factor is multiplied to the floating-point number of message to convert to integer number. log ∆ = 30
p Rescaling factor linearly reduces the size of messages that grow exponentially during computation. log p = 30
L Multiplicative depth is the maximum number of possible mul of a ciphertext without bootstrapping. 40
Q The maximum ciphertext modulus is equal to the initial ciphertext modulus after encryption. logQ =1,200
q Ciphertext modulus starts from Q and divided by a rescaling factor p at each mul. log q = 1,200, 1,170,. . . , 0
N The number of coefficients of ciphertext polynomial. The degree of ciphertext polynomial is at most N -1. 216 = 65,536
n The number of messages. n messages are encrypted in one ciphertext. 32, 64,. . . , 32,768
β Word size. It is machine-dependent (264 for CPU, 232 for GPU). 264 or 232
qLimbs The number of limbs of q. To represent log q-bit integer, qLimbs limbs are required. d1,200/64e = 19
np The number of prime numbers. We use np prime numbers to represent big integer in RNS domain. d2,400/58e = 42
P Product of prime numbers that are used to represent big integer. P =
∏
j pj
PLimbs The number of limbs of P . To represent log(P/pj)-bit integer, PLimbs limbs are used. max
j
(log(P/pj)/ log β)
which is the coefficient of the ciphertext polynomial, can be
represented in the residue number system (RNS) by the set of
remainders {b0, b1, ..., bnp−1} where bi =mod(B, pi).
A key property of RNS is that for adding, subtracting, and
multiplying numbers represented in RNS, it is sufficient to
perform the same modular operation on each pair of residues
(called a congruence relation). That is, for a pair of logq-
bit BigInt numbers (A, B) and their corresponding RNS
representations ({ai|0≤ i< np}, {bi|0≤ i< np}), the product
of A and B is C represented by {ci|0 ≤ i < np} such that
ci =mod(ai · bi, pi). This relationship holds because we set
P not smaller than q2 and the product of two logq-bit BigInt
numbers are smaller than q2.
Therefore, a logq-bit BigInt number is converted into np
logβ-bit data (see Figure 1) in an RNS domain, where we call
the conversion a CRT function or simply CRT, and a BigInt
mul is changed into np logβ-bit modular mul; and hence, the
time complexity per BigInt mul is changed from O(qLimbs2)
to O(np). In general, qLimbs2  np (see Table III), so
the number of operations required for BigInt mul can be
greatly reduced by CRT. However, multiplying two BigInt
polynomials still has a complexity of O(np · N2) because
polynomial mul requires N2 coefficient mul operations.
NTT for reducing the complexity of polynomial mul: NTT
is a discrete Fourier transform over a finite field (integer). It
is well known that fast polynomial mul can be implemented
by using Fast Fourier Transform (FFT) [10]. Therefore, we
can translate polynomial mul with O(N2) complexity into
element-wise mul with O(N) complexity by using fast NTT,
a variant of FFT limited to integer values. Although fast NTT
(or simply NTT) requires transformation cost with O(N logN )
complexity, it is beneficial to use NTT when N is large enough.
The overall flow of HE Mul in HEAAN, including CRT
and NTT, is depicted in Figure 2. It consists of 5 polynomial
mul, where each polynomial mul performs (1) CRT, (2)
NTT, and (3) element-wise modular mul, followed by (4)
inverse fast NTT (iNTT) and (5) RNS-to-BigInt conversion
(iCRT) to return to the polynomial with BigInt coefficients.
We divide the whole process into region 1 and region 2;
the former multiplies and adds input ciphertexts whereas
c1.ax
c1.bx
c2.ax
c2.bx
×
×
+
+
×
×
×
≫ + − −
≫ +
c3.ax
c3.bx
evk.ax
evk.bx
Region1 Region2
Fig. 2: The overall flow of HE Mul in HEAAN. A white
(black) filled symbol represents an operation conducted in an
RNS (BigInt) domain. Region 1 and region 2 use different
moduli because the former multiplies two logq-bit numbers
whereas the latter multiplies a logq-bit number with an eval-
uation key polynomial composed of logQ2-bit numbers.
the latter transforms c1.ax × c2.ax, which is a region 1’s
byproduct that is decrypted with sk2, into a form that is
decrypted with sk (called key-switching). The evaluation key
(evk) used in region 2 consists of two polynomials (evk.ax
and evk.bx) where each coefficient is logQ2-bit (= 2·logQ-bit)
long, encrypting the square of sk multiplied by a constant Q.
The key-switching procedure with evk is mandatory to decrypt
correctly, as it removes the need for skn in decryption stage,
which, without the key-switching, would be required after n
sequential multiplications.
In region 1, np is configured to deal with logq2-bit BigInt,
the intermediate result of polynomial mul between two input
ciphertexts whose coefficient size is logq-bit. By contrast, in
region 2, np is set larger to represent (logq+ logQ2)-bit BigInt
because the coefficient size of evk is logQ2-bit long. The
shift operations in region 2 reduce the amount of error that
was accumulated during mul. The following additions and
subtractions between the results of polynomial mul produce
the result of HE Mul (c3.ax and c3.bx); we can get an
approximate value of mul between two original messages by
decrypting the result using sk.
Figure 3 shows the execution time breakdown of HE Mul
4
0% 20% 40% 60% 80% 100%
Execution time
(ms)
CRT NTT iNTT iCRT Extras
Fig. 3: HE Mul execution time breakdown (total 5,108 ms).
Algorithm 1 CRT
Input: INCRT(N, qLimbs), TBCRT(np, qLimbs)
Output: OUTCRT(N,np)
1: for (i = 0; i < N ; i← i+ 1) do
2: for (j = 0; j < np; j ← j + 1) do
3: accum← 0
4: for (k = 0; k < qLimbs; k ← k + 1) do
5: accum += INCRT[i][k]×TBCRT[j][k]
6: OUTCRT[i][j] = mod(accum, pj)
using a single-threaded reference HEAAN in the system and
configuration described in Section VI. CRT, NTT, iNTT, and
iCRT account for 95.8% of the total execution time. The
remaining operations, such as element-wise modular mul,
account for only 4.2% of execution time. The total execution
time is 5,108 ms, which is about 36,000× slower than the
original message mul, as discussed in Section II. Therefore,
accelerating HE Mul is essential for practical use of HE, and
it is necessary to accelerate CRT, NTT, iNTT, and iCRT.
IV. AN IN-DEPTH ANALYSIS OF MAJOR FUNCTIONS IN
HEAAN MULTIPLICATION
To accelerate the primary functions (CRT, NTT, iNTT, and
iCRT) in HE Mul, we first conduct an in-depth analysis
of how each function works. In the following descriptions,
IN/OUTfunction(X,Y ) represents an X by Y matrix used
as input/output of a function, while TBfunction(X,Y ) repre-
sents a precomputed table of X by Y matrix.
CRT (Algo. 1) takes INCRT(N, qLimbs) representing N
log q-bit BigInt numbers and produces OUTCRT(N,np), the
result of modulo operation on each BigInt with np different
primes {pj |0 ≤ j < np}. The operation consists of two stages:
(1) computing matrix-matrix mul of INCRT with TBCRTᵀ
and (2) applying modulo operations to each output element.
We first explain how to perform a modulo operation on a
BigInt with a modulus smaller than β. A BigInt A is expressed
by qLimbs log β-bit words, i.e.
∑qLimbs−1
k=0 ak · βk where
{ak|0 ≤ k < qLimbs}. Then the modulo operation on the
BigInt is as follows:
mod(A, pj) = mod(
∑qLimbs−1
k=0 ak · βk, pj)
= mod(
∑qLimbs−1
k=0 ak · mod(βk, pj), pj)
Here because β and pj are independent of the input, HEAAN
precomputes TBCRT, mod(βk, pj) for all k and j. There-
fore,
∑qLimbs−1
k=0 ak · mod(βk, pj) is performed by multiply-
ing INCRT and TBCRT.
We can exploit Shoup’s modular mul (Shoup’s Mod-
Mul [50]) for the modulo operation in line 6 of Algo. 1.
Shoup’s ModMul (Algo. 2) computes mod(X · Y, pj) with 3
Algorithm 2 Shoup’s modular multiplication (ModMul)
Input: X,Y, pj , YShoup
Output: r = mod(X × Y , pj)
1: Quhi = (X × YShoup) logβ
2: r = X × Y −Quhi × pj
3: if r > pj then
4: r = r − pj
Algorithm 3 NTT
Input: INNTT(np,N)← OUTCRTᵀ,TBW(np,N)
Output: OUTNTT(np,N)
1: for (i = 0; i < np; i← i+ 1) do
2: t = N
3: IN = INNTT[i]
4: TBW = TBW[i]
5: for (m = 1; m < N ; m← m× 2) do
6: t = t / 2
7: for (j = 0; j < m; j ← j + 1) do
8: for (k = j × 2t; k < j × 2t+ t; k ← k + 1) do
9: butt(IN[k], IN[k + t], pi,TBW[m+ j])
10: OUTNTT[i] = IN
Algorithm 4 butt
Input: A,B, pi,W
Output: A,B
1: U = mod(B ×W, pi)
2: B = A − U
3: A = A + U
muls and a single correction step if a value YShoup (= bY ·βpj c)
is known in advance. It replaces a costly division operation
with relatively cheaper mul, comparison, and subtraction oper-
ations. We apply the algorithm for the modular mul on accum
spanning up to 3 limbs (accum0 +accum1 ·β+accum2 ·β2),
using precomputed YShoup values on Y = {1, β, β2}. The
operations of CRT can be performed in parallel for each
coefficient (total N ) and each prime number (total np).
NTT implements Cooley-Tukey algorithm [20], which re-
cursively divides an N -point FFT to k N/k-point FFTs and
combines their results (called radix-k FFT). An exemplar
radix-2 NTT in Algo. 3 takes a matrix INNTT(np,N) as
an input and performs a butterfly algorithm butt. It uses
a precomputed table (TBW) of powers of the 2N -th root
of unity for all np prime numbers. For each prime, butt
(Algo. 4) is called logN · N2 times. As butt requires modular
mul, it also uses Shoup’s ModMul as was done in CRT.
iNTT is slightly different from NTT. It has a different loop
order, calls inverse butterfly (ibutt) instead of butt, deals
with a different precomputed table (consisting of the inverse
of powers of the primitive root of unity (TBinvW)), and
finally divides each element by N . However, except for the
last element-wise division by N , iNTT is symmetric to NTT in
terms of the number and the kind of operations. Both NTT and
iNTT are completely parallelizable for each prime number.
iCRT converts the matrix OUTiNTT(np,N), where each
element is a remainder smaller than β, back to N log q-
bit BigInts (see Algo. 5). It starts with (1) the Hadamard
5
TABLE IV: The number of arithmetic operations and computational complexity of major functions of HE Mul.
CRT NTT iNTT iCRT
Multiplication N × qLimbs× np - - N × np× PLimbs
Modular mul N × np np×N/2× logN np× (N/2× logN + N) 2×N × np
ADC (add with carry) N × qLimbs× np - - N × np× PLimbs
Add, Sub - np×N × logN np×N × logN -
Computation complexity O(N × qLimbs× np) O(N × logN × np) O(N × logN × np) O(N × np× PLimbs)
Algorithm 5 iCRT
Input: INiCRT(np,N)← OUTiNTT(np,N)
TBinvP(np),TBPdivp(np, PLimbs)
Output: OUTiCRT(N,np)
1: for (i = 0; i < N ; i← i+ 1) do
2: for (j = 0; j < np; j ← j + 1) do
3: temp[j][i] = mod(INiCRT[j][i]×TBinvP[j], pj)
4: for (i = 0; i < N ; i← i+ 1) do
5: accum = 0
6: for (j = 0; j < np; j ← j + 1) do
7: for (k = 0; k < PLimbs; k ← k + 1) do
8: accum+ = temp[j][i]×TBPdivp[j][k]× βk
9: OUTiCRT[i] = mod(mod(accum,P),q)
TABLE V: Input and precomputed data size of major functions
of HE Mul. The unit of data size is β. Precomputed data for
Shoup’s ModMul is excluded.
CRT NTT & iNTT iCRT
Input data N × qLimbs N × np N × np
Precomputed
np× qLimbs N × np np,data np× PLimbs
product between an input matrix and a precomputed table
TBinvP whose elements are modular inverses of P/pj for
all pj , followed by an element-wise modular mul with each
pj . Shoup’s ModMul can also be applied here for efficient
modular mul. (2) Then, each output element of (1), a scalar
value, is multiplied by a BigInt P/pj according to its j and
accumulated to a temporary BigInt accum. Here each P/pj
is precomputed and stored in table TBPdivp(np, PLimbs),
where PLimbs = max
j
(log(P/pj)/ log β). (3) Finally, a
reduction of accum modulo P and q (q · Q in Region 2)
is performed.
We summarize the number of arithmetic operations needed
for each major function in Table IV and the size of input and
precomputed data for each function in Table V.
V. ARCHITECTURE-AWARE OPTIMIZATIONS TO MAXIMIZE
HE MUL PERFORMANCE ON CPUS AND GPUS
Previous HE studies [47], [50] pursued proposing new hard-
ware architectures (e.g., through FPGA implementation) for
performance improvement. By contrast, we first improve the
performance of HE by utilizing the most popular computation
platforms, CPUs and GPUs, which are already equipped with
hundreds to thousands of ALUs.
Algorithm 6 iCRT algorithm in the matrix-matrix mul form.
The first three lines are the same as Algo. 5.
...
4: for (i = 0; i < N ; i← i+ 1) do
5: accum = 0
6: for (k = 0; k < PLimbs; k ← k + 1) do
7: accumsmall = 0
8: for (j = 0; j < np; j ← j + 1) do
9: accumsmall+ = temp[j][i]×TBPdivp[j][k]
10: accum = accum+ accumsmall × βk
11: OUTiCRT[i] = mod(mod(accum,P),q)
All the major functions (CRT, iCRT, NTT, and iNTT)
of HE Mul of HEAAN have massive parallelism that can
be exploited by CPUs and GPUs. All the residual numbers
(N × np) can be computed in parallel on CRT. NTT and
iNTT perform np independent transformations and leverage
the algorithmic optimization of FFT, where N/k groups can
be computed in parallel at each individual stage during FFT.
Henceforth, we identify key challenges and solutions we
devise in accelerating HE Mul on CPUs and GPUs.
A. Loop reordering to expose massive parallelism in iCRT
iCRT recombines the residual numbers into the integers of
size logq for each coefficient of the resulting ciphertext and
hence it might be regarded that the degree of parallelism is
smaller than CRT (N vs. N·np). However, we discover that the
limited N -degree parallelism can be expanded to N·PLimbs-
degree parallelism by reordering two loops in iCRT (line 6
and 7 in Algo. 5); the modified algorithm is shown in Algo. 6.
After reordering, the sequence of original mul between a scalar
and a BigInt now becomes a matrix-matrix mul between a
temp matrix and a TBPdivp matrix (line 9 in Algo. 6). Then,
iCRT should be modified such that the partial sum in the
inner-most loop is accumulated into accumsmall (double or
triple words), rather than accum (BigInt), which is aggregated
to accum at the end of the loop.
The range of modulus and np determine whether
accumsmall should be a double word or a triple word; with our
representative parameters where β = 264, using a double word
is sufficient. With our loop reordering, the resulting matrix-
matrix mul exposes a massive parallelism of degree N·PLimbs
in iCRT, providing abundant parallelization opportunities to
contemporary hardware platforms.
B. Accelerating HE Mul on CPUs
The strategies a modern CPU takes to exploit parallelism
from an application are twofold: populating (1) multiple cores
6
and (2) ALUs supporting short-SIMD instructions within each
core. For example, the Skylake-based Intel Xeon CPU we use
has 24 cores per socket and each core support AVX-512 in-
structions, each executing eight 64-bit integer operations [26],
which could lead to over 100× performance improvement
compared to the baseline implementation not exploiting these
parallelism. To achieve a near-maximum performance CPU
provides, we exploit intra-core parallelism via utilizing AVX-
512 instructions and inter-core parallelism via multi-threading.
We carefully distribute operations to multiple threads and
AVX-512 SIMD lanes per thread to minimize performance
degradation by an inferior cache performance caused by poor
data access patterns; in this, whether an input matrix follows
a column- or a row-major order can vastly affect performance.
During CRT, a CPU thread takes responsibility of a portion
of the N coefficients (line 1 in Algo. 1), whereas each lane
of an AVX-512 port performs operations on different prime
numbers (line 2 in Algo. 1). During NTT and iNTT, a thread
does its job on a portion of the prime numbers, (line 1 in
Algo. 3), whereas each lane of AVX-512 computes a part of
the coefficients (line 8 in Algo. 3). In case of iCRT, we take
different approaches for the two iteration phases. During the
first phase (line 1 to 3 in Algo. 5), each thread and a AVX-
512 lane performs computation on a part of the N coefficients.
During the second iteration phase (line 4 to 11 in Algo. 6) each
thread also computes on a part of the coefficients (line 4), but
each lane of an AVX-512 port is on different k, the positional
index on the limbs of P/pj (line 6).
Although the reference HEAAN library [1] supports multi-
threading and exploits the same parallelism type on NTT,
iNTT, and iCRT with our work, (in terms of distributing
works to threads; the reference HEAAN does not implements
SIMD instructions) CRT takes a different strategy.
Emulating arithmetic operations: However, AVX-512 does
not support parallel 64-bit mul and 64-bit ADC (addition with
carry) yet. Therefore, 64-bit mul is emulated with four parallel
32-bit mul, five 64-bit add, and five 64-bit shift instructions.
Also, one 64-bit compare and one additional 64-bit add are
required to handle carry per addition. This emulation narrows
the performance gap between the reference HEAAN and the
AVX-512 implementation. To further improve performance
under this constraint, we modify Shoup’s ModMul as follows.
The original Shoup’s ModMul algorithm requires three
operations: one 64-bit mulhi operation to compute Quhi,
and two 64-bit mullo operations to compute Quhi · pj and
X · Y , where Qu is an estimation of a quotient, and 64-bit
mulhi (mullo) returns the upper (lower) 64-bit of a mul result
which is 128-bit long. A single 64-bit mulhi operation can be
emulated with four 32-bit mul (hi · hi, hi · lo, lo · hi, lo · lo),
four 64-bit add, and five 64-bit shift operations. In this case,
the estimated remainder can be either r or r + pj , which lies
in a range of [0, 2pj).
However, one of the 32-bit mul (lo · lo) for emulating 64-bit
mulhi is used only for computing a carry from low 64-bit of
mul. We can remove this lo · lo mul if the carry is ignored,
and produce an approximated 64-bit mulhi (Quhi′ ) with only
three mul instead of four. By applying this optimization, the
estimated remainder lies in a range of [0, 4pj). As the upper
bound of a remainder grows to 4pj , one more correction step
(conditional subtraction) is needed, but the number of more
expensive operations are reduced: one 32-bit mul, two 64-bit
add, and one 64-bit shift instruction.
Matrix transposition in iCRT: SIMD instructions might lead
to a poor cache utilization if they access a matrix storing
elements in a row-major order by column direction (or vice
versa), as this demands multiple cache lines at once. The
resulting performance degradation is more prominent when
the access stride is too large for a hardware prefetcher to be
effective. iCRT experiences this issue because the matrix-
matrix mul (line 9 in Algo. 6) accesses the temp matrix
storing elements in a row-major order by column direction.
We implicitly transfer the temp matrix using the scatter
instructions in AVX-512 to address this issue.
Reducing the size of β to 232: We can remove the overhead of
emulating 64-bit operations by using β of 232 instead of 264
as AVX-512 naturally supports 32-bit mul and ADC. Using
the smaller beta changes the number of instructions for mul
and ADC to one, and reduces the number of instructions for
modular mul to below the half. However, it also has short-
comings that qLimbs should roughly be doubled to express
numbers whose size is up to q. np also needs to be doubled
because the upper bound of each prime number is β. Larger
qLimbs and np imply bigger precomputed tables and more
iterations for CRT, NTT, iNTT, and iCRT. It also increases
the number of operations other than mul, modular mul, and
ADC. A preliminary implementation shows no improvement
in performance by using the small β for CPUs.
C. Accelerating HE Mul on GPUs
Modern GPUs such as Volta [41] and Turing [42]
have as many integer (INT32) processing units as single-
precision floating-point (FP32) processing units, resulting in
the multiply-accumulate throughput for INT32 numbers the
same as that for FP32 numbers. Such massive throughput
of integer operations makes GPU an attractive candidate for
accelerating HE operations. We identify the following key
points in fully exploiting the performance potential of GPUs
when accelerating HE operations.
Parallelization strategies: CUDA programming model [43]
for GPU has the following hierarchical structure of threads:
multiple GPU threads are grouped to form a thread block
and multiple thread blocks comprise a grid. A thread block
is allocated to one of Streaming Multiprocessors (SMs). The
threads in a thread block share the resources (e.g., shared
memory) of the SM. Each thread block in a grid is allocated
to each SM in a round-robin fashion, and the number of thread
blocks (grid dimension) and the number of threads in a block
(block dimension) are configured at each GPU kernel launch.
The basic parallelization strategy is to assign each inde-
pendently computable output element in a function to a GPU
thread. We launch N ·np threads for CRT so that each thread
computes one output element, which is a residue. NTT and
7
iNTT launch N/2·np threads each, where one thread performs
one butterfly operation (Algo. 4) for each butterfly step using
a simple radix-2 iterative NTT algorithm [34] that is also used
for CPU implementation.
In case of iCRT, a naı¨ve parallelism strategy uses N -degree
parallelism where one thread handles one output BigInt type
coefficient. Prior studies [8], [22] took the same strategy.
However, by changing the loop order as described above,
we transform its core operation into a matrix-matrix mul
operation, thereby taking advantage of N · PLimbs-degree
parallelism to maximize TLP. We can also exploit various
optimization strategies developed for matrix-matrix mul in
GPU (e.g., tiling principles in NVIDIA’s cutlass library [38]).
64-bit emulation vs. 32-bit word: As opposed to most CPUs
that natively support 64-bit words, modern GPUs natively
support 32-bit words and emulate 64-bit integer operations.
To avoid the overhead of 64-bit emulation (whose throughput
is more than an order of magnitude lower than that of the 32-
bit counterpart), prior studies accelerating HE on GPU [8],
[22] use 32-bit words (β = 232) and the prime numbers
smaller than β. We also use 32-bit words and operations.
Another advantage of using 32-bit words on GPU is that the
operations with carry-in and carry-out can be used without
emulation; modern NVIDIA GPUs support carry operations
(e.g., addc, subc, and madc in assembly-like virtual ISA,
PTX [44] where the operations are called extended-precision
integer arithmetic instructions). The throughput of these in-
structions is the same as the instructions without carry in recent
GPU architectures [41], [42], enabling efficient computation
on large integers without emulation.
Different strategies for BigInt modulo in CRT: A naı¨ve
BigInt modulo is done by repetitive logβ-bit shift, add, and
modulo operation; for example, cuHE [22], takes this ap-
proach. By contrast, HEAAN accumulates the result of modulo
on each ak ·βk using a precomputed table in CRT. In this case,
the BigInt variable accum (line 3 in Algo. 1) can span two
or three words depending on np and the size of each prime
number. In the CPU implementation, accum is guaranteed
to span two words when using the representative parameters
specified in Table VI, as an overflow does not happen for
np ≤ 264−58 = 64 with prime numbers smaller than 258. To
guarantee accum to be two-word long, we use 257 as a lower
bound of a prime for AVX-512 implementation instead of 259
which is the default value of the reference HEAAN. However
in GPU with a 32-bit β, with primes smaller than 230,2 only
up to 4 (= 232−30) accumulation is allowed to guarantee that
the overflow does not happen, which is nearly impossible as
np is 90 or higher when using the representative parameters.
To prevent the overflow, one might (1) use three-word
accum with an additional ADC operation added in the inner-
most loop to avoid expensive modulo operations, or (2)
do modulo operations intermittently in the inner-most loop
2We used 227 as a lower bound of a prime when β = 232. If the lower
bound is excessively reduced (e.g., using 221) to avoid any overflow, the
overhead from a growing np becomes too high, negating the advantage of
removing overflows.
(e.g., for every 4 accumulation in our case), to ensure that
accum spans only two words. We compare these strategies in
Section VII.
Per-thread storage for accumulation in iCRT: The baseline
implementation of iCRT with N-degree parallelism allocates
a BigInt accum (line 8 in Algo. 5) as a long array, in a per-
GPU-thread manner. If accum is not carefully allocated to
fast storage, frequent cache thrashing might occur, leading
to a significant performance degradation. The latest NVIDIA
GPUs [41], [42] have a variety of storage types including
register, L1 cache, L2 cache, device memory, and read-only
constant memory. In the algorithm of original iCRT (Algo. 5),
temp[j][i] is stored in register memory, so that it can be loaded
quickly in a single cycle. On the other hand, because accum
is declared as a thread-local array and also it is dynamically
indexed in the algorithm (e.g., used as accum[idx] where idx
is a variable), it is not stored in register, which is the fastest
storage on GPU. Instead, CUDA compiler stores it in global
memory and caches into L1 and L2 (in CUDA programming
model this is called local memory).
However, heavily using local memory can lead to cache
thrashing when the grid dimension and block dimension
increase, leading to a number of threads competing for cache,
degrading overall performance. In order to mitigate the cache
miss penalty, we suggest two different optimizations when
using N parallelism in iCRT: (1) using fewer threads by
simply reducing block dimension and grid dimension, using
the grid-stride loop method [33], or (2) pinning each accum
array in L1 cache through allocating the array in shared
memory; this is possible as the shared memory shares capacity
with the L1 unified cache. (2) is similar to the implementation
of CRT kernel in cuHE [22], which uses the shared memory
for storing thread-local arrays. We compare the methods on
cuHE’s iCRT kernel [22] which implements Algo. 5 with
N parallelism, along with loop reordering with N · PLimbs
parallelism which is explained in Section V-A.
High-radix NTT (iNTT): Radix-2 NTT is memory-bound;
GPU reads and writes a large input INNTT(np,N) (dozens of
megabytes with typical np and N values specified in Table III,
exceeding the size of L1 and L2 caches of GPU) by log2N
times. At each butt function in Algo. 3, a GPU thread reads
two values of IN from the device memory and writes two
output values back to the device memory.
Using high radix NTT (radix-k with k > 2) can mitigate
the memory bandwidth bottleneck because each GPU thread
reads and writes k values within IN in the butt operation
performing k-point NTT. It changes the number of transferring
IN from log2N above to logkN , reducing the number of
main memory accesses needed for NTT. However, increasing
the radix is not always beneficial because, as each thread takes
more than two inputs, register pressure on each GPU thread
increases. Using registers more than the register file size of
a streaming multiprocessor causes register spilling to local
memory, leading to performance degradation with additional
data loads from the main memory. Given the constraint, we
use an appropriate size of radix (radix-32) to alleviate the high
8
TABLE VI: HEAAN parameter settings for CPU and GPU.
Parameters CPU (AVX-512) GPU
N 216
Q & q 21200
β & qLimbs 264 & 19 232 & 38
Prime number size 257 < pi < 260 227 < pi < 230
np (Region 1) 42 (43) 90
np (Region 2) 63 (64) 134
pressure of main memory bandwidth and conduct a sensitivity
study in Section VII.
VI. EXPERIMENTAL SETUP
We compared the performance of the reference HEAAN [1],
our AVX-512 implementation with multi-threading, and GPU
implementation. We used Intel Xeon CPU (Skylake-based
Xeon Platinum 8160 operating at 2.1 GHz) and NVIDIA GPU
(Turing Titan RTX operating at 1.35 GHz). The CPU system
consists of 24 cores per socket and each core has two AVX-
512 FMA units, achieving a peak 64-bit integer performance of
1.61 TOPS per socket. Each socket has six memory channels,
each equipped with DDR4-2666 DRAM modules. We did not
use HyperThreading; the number of cores utilized was the
same as the number of CPU threads populated. The GPU
system consists of 72 streaming multiprocessors (SMs), each
with 64 CUDA cores, performing up to 4,608 32-bit integer
operations per cycle. The size of each L1 unified cache and
L2 shared cache in the GPU is 128 KB (per SM) and 6 MB,
respectively. Even if the CPU system has two CPU sockets,
we only used one socket to compare it with the GPU system
with a single discrete GPU.
We tabulate the key parameters for HE Mul of HEAAN on
CPU and GPU in Table VI. We measured the execution time
of HE Mul, excluding time for memory operations, such as
malloc, free, and data transfers from host to the device for
GPU. We conducted each experiment 32 times and reported
the average.
VII. EVALUATION
We evaluated the effectiveness of the proposed optimiza-
tions in accelerating HEAAN mul by comparing against the
performance of the reference HEAAN (Ref). For CPU, our
optimizations exploit both intra-core and inter-core paral-
lelism. We compared the basic implementation utilizing AVX-
512 instructions (AVX), the one with the modified Shoup’s
ModMul on top of AVX (AVX-M), and the one transposing the
temp matrix on top of AVX-M (AVX-MT). In the basic GPU
implementation (GPU), we adopted radix-2 iterative NTT for
NTT and iNTT. We modified the CRT kernel of cuHE [22],
which only exploits N -degree parallelism, to exploit N ·np-
degree parallelism. Also, we used the iCRT kernel of cuHE.
We compared GPU with the followings: the implementation
optimizing CRT by using ADC instead of intermittently con-
ducting modulo operations (GPU-C), the one adjusting the
number of launching threads on top of GPU-C (GPU-CT),
the one using shared memory to pin the arrays of each thread
to L1 unified cache on top of GPU-C (GPU-CP), the one
applying loop reordering (Algo. 6) to translate a majority of
iCRT computation into matrix-matrix mul to use N ·PLimbs-
degree parallelism on top of GPU-C (GPU-CL), and the one
implemented with high-radix NTT and iNTT to reduce main
memory accesses and utilize GPU’s computing power more
efficiently on top of GPU-CL (GPU-CLH).
We made the following key observations. First, exploiting
the massive parallelism supported by modern CPUs and GPUs
gives even more than 100× performance improvement in HE
Mul. Table VII shows the execution time and the relative
speedup of the CPU and GPU implementations after applying
a series of architecture-aware optimizations. AVX-MT and
GPU-CLH, the implementations giving the best performance
for CPU and GPU, achieve 42.9× and 134.1× speedup,
respectively, compared to the single-thread Ref. GPU-CLH
performs 4.3× and 2.3× better than AVX-MT on CRT and
iCRT thanks to more ALUs populated on GPU. Also, by
reducing the main memory accesses through increasing the
radix, GPU-CLH achieves 2.35× and 2.17× performance
improvement in NTT and iNTT, respectively, compared to
AVX-MT’s implementation.
Second, our CPU implementations are highly scalable
across both intra-core and inter-core dimensions. AVX is
effective regardless of the number of CPU threads populated,
providing 2.0× and 2.7× performance gain over Ref when a
single and 24 threads are utilized, respectively (see Figure 4(a)
and (b)). Among the primary functions, NTT is the best in
scalability, leading to 7.7× speedup for AVX over Ref when
24 threads are populated. Overall, AVX experiences 19.4×
speedup when the number of populated threads increases
from 1 to 24, exhibiting a better scalability than Ref (14.8×)
because of the following reasons. AVX and Ref exploit par-
allelism in different ways for CRT as described in Section V.
In Ref, each thread operates on different prime numbers
(np-degree parallelism) where np is not large (e.g., 42 or
63), and hence Ref is more susceptible to a load imbalance
across threads. By contrast, each thread operates on different
coefficients (N -degree parallelism) in AVX, exhibiting better
scalability. For iCRT, data accesses for the matrix occur in
column direction during matrix-matrix mul, causing its perfor-
mance memory-bound because hardware prefetching becomes
ineffective. However, with 24 threads being utilized, hardware
prefetching hits more frequently because a thread might access
the data in the adjacent columns that are prefetched by other
threads, leading to better performance.
The additional optimizations applied to the AVX-512 im-
plementation are effective as well. Figure 4(c) shows the
impact of these optimizations on each major function when
24 threads are used. In AVX-M, both NTT and iNTT are 10%
faster than AVX because these functions compute modular mul
frequently. iCRT experiences a 18% speedup in AVX-MT
compared to AVX-M because the matrix transposition alle-
viates the memory-bound issue.
Third, the performance of GPU reaches close to full
potential through our microarchitecture-aware optimizations.
9
TABLE VII: Comparing the execution time of HE Mul among a single- and 24-thread reference HEAAN (Ref-1 and Ref-24),
a 24-thread optimized AVX-512 implementation (AVX-MT-24), and an optimized GPU implementation (GPU-CLH).
Execution time (ms) and [Relative speedup]
Function Ref-1 (baseline) Ref-24 AVX-MT-24 GPU-CLH
CRT 639.9 40.4 [15.8×] 17.5 [36.6×] 4.1 [156.1×]
NTT 1541.0 73.3 [21.0×] 8.7 [177.1×] 3.7 [416.5×]
iNTT 584.6 28.3 [20.7×] 10.2 [57.3×] 4.7 [124.4×]
iCRT 2126.7 130.1 [16.3×] 45.4 [46.8×] 19.4 [109.6×]
Extra 215.7 73.1 [3.0×] 37.3 [5.8×] 6.2 [34.8×]
Total 5108.0 345.3 [14.8×] 119.1 [42.9×] 38.1 [134.1×]
0
1000
2000
3000
4000
5000
6000
Ref AVX AVX-M AVX-MT
CRT NTT iNTT iCRT etc
Ex
ec
ut
io
n 
tim
e 
(m
s)
(a) Execution time on single-thread
0
50
100
150
200
250
300
350
400
Ref AVX AVX-M AVX-MT
Ex
ec
ut
io
n 
tim
e 
(m
s)
Extras
iCRT
iNTT
NTT
CRT
(b) Execution time on 24-threads
0.95
1
1.05
1.1
1.15
0.85
0.9
0.95
1
1.05
AV
X
AV
X-
M
AV
X-
M
T
AV
X
AV
X-
M
AV
X-
M
T
AV
X
AV
X-
M
AV
X-
M
T
AV
X
AV
X-
M
AV
X-
M
T
CRT NTT iNTT iCRT
Sp
ee
du
p
R
el
at
iv
e 
Ex
ec
ut
io
n 
tim
e Relative Execution time Speedup
(c) Relative execution time and speedup per function
Fig. 4: Comparing HE Mul execution time among Ref, AVX-512 implementation (AVX), and optimized AVX (AVX-M and
AVX-MT) when (a) one and (b) 24 threads are utilized, and (c) per-function speedup when using 24 threads.
0
50
100
150
200
250
300
350
400
Ref GPU GPU-C GPU-CL GPU-CLH
Ex
ec
ut
io
n 
tim
e 
(m
s) ExtrasiCRT
iNTT
NTT
CRT
(a) HE Mul execution time breakdown
0
1
2
3
4
5
6
0
3
6
9
12
15
18
GPU GPU-C
Sp
ee
du
p
Ex
ec
ut
io
n 
tim
e 
(m
s)
(b) CRT optimization
0
1
2
3
4
5
6
7
8
9
10
0
20
40
60
80
100
120
140
160
180
200
GPU GPU-CT GPU-CP GPU-CL
Sp
ee
du
p
Ex
ec
ut
io
n 
tim
e 
(m
s)
Execution time Speedup
(c) iCRT optimization
0.5
1
1.5
2
2.5
2
4
6
8
10
GPU
-CL
GPU
-CLH
GPU
-CL
GPU
-CLH
NTT INTT
Sp
ee
du
p
Ex
ec
ut
io
n 
tim
e 
(m
s)
(d) NTT/iNTT optimization
Fig. 5: Comparing HE Mul execution time (a) among Ref-24, the baseline GPU (GPU), and optimized GPU (GPU-C, GPU-CL
and GPU-CLH), and relative speedup of (b) GPU and GPU-C for CRT, (c) GPU and GPU-C[T/P/L] for iCRT, and (d)
GPU-CL and GPU-CLH for NTT/iNTT.
Figure 5(a) shows the execution time of HE Mul on vari-
ous GPU implementations compared to that of the reference
HEAAN running on CPU with 24 threads (Ref-24). Even the
baseline GPU implementation (GPU) outperforms Ref-24 by
1.52×. Most of the speedup comes from accelerating NTT and
iNTT. iCRT in GPU, whose implementation we adopt from
cuHE [22], performs poorly; it is even 1.43× slower than that
in Ref-24 and takes 81.2% of total HE Mul execution time.
To reduce the execution time of iCRT, we devised the
following optimizations and compared their performance in
Figure 5(c). By adjusting the number of launching threads,
we reduced the degree of performance impact due to cache
thrashing, achieving a speedup of 4.22× (GPU-CT) compared
to that of GPU. Pinning thread-local arrays to L1 cache
(GPU-CP) performs better than GPU-CT, reaching 6.79×
of speedup. Finally, GPU-CL was the best among all the
iCRT optimizations (9.58× of speedup), by effectively ex-
TABLE VIII: Comparing the execution time of CRT when
applying different strategies to RNS conversion. GPU-Modx
means one modulo operation is applied at every x iteration.
Time for CRT (ms) Speedup
GPU 14.95 1.00×
GPU-Mod1 16.80 0.89×
GPU-Mod2 10.47 1.43×
GPU-Mod4 7.56 1.98×
GPU-C 4.11 3.64×
ploiting N ·np-degree parallelism through the loop reordering
explained in Section V.
For CRT, performing fewer modulo operations led to a better
performance. Table VIII compared the execution time of CRT
when using different strategies to convert a BigInt number
to an array of residue numbers in an RNS domain. GPU
10
TABLE IX: Comparing the execution time of NTT and iNTT
when altering their radix values.
NTT iNTT
Time (ms) Speedup Time (ms) Speedup
Radix-2 8.73 1.00 9.76 1.00
Radix-4 4.74 1.84 5.47 1.78
Radix-16 3.65 2.39 4.88 2.00
Radix-32 3.72 2.35 4.67 2.09
TABLE X: The number of required AVX-512 instructions
(add, sub, mul, shift, and cmp) for each function with (np,
qLimbs) of (43, 19). We compared the cases where each of
64-bit mul, modular mul, and ADC is supported by emulation
and by a single native instruction.
CRT NTT iNTT iCRT
by emulation 155M 48M 47M 319M
by single native instr. 27M 17M 17M 51M
does not precompute mod(βk, pj) and performs a modulo
operation on every limb (one limb for β) of the BigInt.
GPU-Modx means applying modulo operation every x iter-
ation in the inner-most loop of line 5 in Algo. 1. Even if
using the precomputed table, conducting a modulo operation
per iteration (GPU-Mod1) performs even worse than GPU by
1.1×. This is because both implementations compute the same
number of modulo operations whereas GPU-Mod1 imposes
more pressure on local memory utilization. Fewer modulo
operations led to better performance; GPU-Mod4 performs
1.98× better than GPU. By letting the partial sum (accum)
span three words instead of two words utilizing ADC in every
iteration, GPU-C performs best with the speedup of 3.64×.
Figure 5(d) shows the performance improvement of NTT
and iNTT by increasing their radix values. Because NTT and
iNTT algorithms are mostly symmetrical, they experience a
similar degree of performance improvement with 2.35× (NTT)
and 2.09× (iNTT). iNTT improves slightly less because
iNTT includes additional computations such as element-wise
division by N , which does not gain speedup from using
higher radix. Table IX shows the execution time of NTT and
iNTT when varying the radix values. As the radix grows
from 2 to 16, the performance of NTT and iNTT increases,
taking advantage of reducing the memory accesses and hence
alleviating the memory bandwidth bottleneck. However, the
speedup saturates when the radix rises from 16 to 32 because
of the register spilling to local memory; when radix increases
beyond 32, the performance deteriorates due to the higher
register pressure.
VIII. DISCUSSION
The impact of supporting the emulated operations natively
on AVX-512: As described in Section V, the cost of emulating
64-bit mul, modular mul, and ADC operations is significant
in AVX-512 implementations. If some future CPUs support
these instructions natively, the execution time of HE Mul could
0.0001
0.001
0.01
0.1
1
10
0%
20%
40%
60%
80%
100%
150 300 600 1200 2400 R
el
at
iv
e 
# 
of
 o
pe
ra
tio
ns
O
pe
ra
tio
n 
ra
tio
 (%
)
logQ
CRT NTT iNTT iCRT Relative # of operations
Fig. 6: Distribution on the number of operations across func-
tions and the relative number of operations for HE Mul on
various logQ values. The reference logQ is 1,200.
be reduced substantially. Table X summarizes the number of
AVX-512 instructions required to perform each major function
by comparing the cases where each of 64-bit mul, modular
mul, and ADC is supported by either emulation and by a single
native instruction. CRT and iCRT require just 17.3% and
15.8% of AVX-512 instructions if a future CPU supports these
instructions natively, not through emulation. NTT and iNTT
require one third of instructions by the instruction extension.
Because HE Mul is mostly computation-bound, the significant
reduction in the number of instructions would lead to a similar
degree of performance improvement.
Impact of Q on the characteristics of HE Mul: Q determines
multiplicative depth L; a larger depth requires a bigger Q.
However, N must increase proportionally to logQ to ensure
a certain level of security (see Table II). Also, qLimbs,
np, and PLimbs increase in proportion to logQ. Based on
these relationships, the computational complexity of Table IV
can be expressed in terms of logQ. The complexity of CRT
and iCRT is O((logQ)3) whereas that of NTT and iNTT
is O(log(logQ) · (logQ)2). Figure 6 shows the estimated
number of operations for HE Mul according to logQ. When
Q is small (e.g., logQ=150), CRT, NTT, iNTT, and iCRT
require a similar number of operations, but as Q increases,
CRT and iCRT become more dominant. Overall, the total
number of operations for HE Mul is proportional to (logQ)3.
When an application requires a large number (e.g., billions)
of HE Mul, using large Q amortizes the cost of the expensive
bootstrapping operation. However, using too large Q is costly
because the maximum number of messages (n) that can be
multiplied together by a HE Mul is N/2, where the complexity
of a HE Mul is super-linear, O((logQ)3) = O(N3). The Q
value we mainly target is 1,200, which is large enough to
amortize the cost of bootstrapping; other HE accelerators [47],
[50] focused on much smaller Q values. For example, [50]
used the Q value of 180 without considering bootstrapping.
Impact of q on the characteristics of HE Mul: As de-
scribed in Section III, rescaling, which decreases logq by
logp, is performed after each HE Mul to prevent the amount
of message information in the ciphertext from increasing
exponentially. As HE Mul is repeated, q decreases, so does
qLimbs and np. In region 1 of HE Mul, np decreases linearly
with logq because two logq2-bit BitInt numbers are multiplied
11
015
30
45
60
75
90
0
0.2
0.4
0.6
0.8
1
1.2
12
00
11
40
10
80
10
20 96
0
90
0
84
0
78
0
72
0
66
0
60
0
54
0
48
0
42
0
36
0
30
0
24
0
18
0
12
0 60
N
um
be
r o
f p
rim
es
R
el
at
iv
e 
# 
of
 o
pe
ra
tio
ns
log q
CRT NTT iNTT iCRT np (Region 1) np (Region 2)
Fig. 7: The size of np and the relative number of operations
per function according to logq.
TABLE XI: Comparing CKKS schemes
CKKS type Full-RNS Non-RNS[2], [3], [13] [17], [32]
Parameter (p and Q) selection Rigid Flexible
Multiplicative depth Lower Deeper
HE operation speed Faster Slower
Main function of HE Mul NTT, iNTT, CRT, iCRT,mod up/down NTT, iNTT
per ciphertext coefficient. By contrast, in region 2, np should
be set to represent (logq + logQ2)-bit BigInt (to multiply
over the evaluation key polynomial). PLimbs has the same
trend as np. Figure 7 shows the computation amount of HE
Mul according to the logq in the AVX-512 configuration of
Section VI calculated based on Table IV. As np is proportional
to (logq + logQ2) in region 2, the number of operations for
HE Mul when logq becomes 30 (the smallest number where
no more HE Mul is applicable) is still 24% of that when logq
is 1,200. Also, iCRT is dominant regardless the size of q.
The applicability of optimization techniques to other li-
braries supporting CKKS: In addition to HEAAN, there are
several other libraries [2], [3], [13], [32] that support CKKS.
The libraries fall into one of the two groups: libraries sup-
porting full-RNS and non-RNS CKKS, as shown in Table XI.
Non-RNS types manage each coefficient of polynomials of
ciphertext in a big integer domain. By contrast, full-RNS types
let the coefficient stay in an RNS domain during the whole
procedure of homomorphic operations, resulting in faster oper-
ations. However, using full-RNS is less flexible because there
exist rigid limitations while choosing the rescaling factor p and
the modulus Q. Each prime composing an RNS representation
of Q in full-RNS CKKS should be set to be close to the
rescaling factor p [15]. In non-RNS CKKS, by contrast, one
can freely choose a rescaling factor p independent of Q
without considering the approximation error existing in full-
RNS CKKS. Moreover, due to these parameter limitations,
the multiplicative depth of full-RNS variants is lower than an
non-RNS scheme for a given security bit and error bound.
For the libraries supporting CKKS other than HEAAN, our
optimizations described in Section V are partially applicable.
Other non-RNS CKKS libraries can apply our optimization
techniques because they use the same primary functions as
HEAAN. Also, although full-RNS variants do not require CRT
and iCRT, the optimization techniques for NTT and iNTT
can be applied. Also, we can partially apply our techniques
in CRT to the functions that change the number of primes
in the RNS domain (mod up/down in Table XI). mod up
increases the number of primes of an RNS representation of
a given big integer, whereas mod down decreases it with
an additional division operation [9], [15]. Both functions can
take advantage of our optimizations as their core functions are
similar to applying CRT right after iCRT.
IX. RELATED WORK
FPGA-based HE accelerators: There have been numerous
studies [21], [25], [45], [47], [49], [50] to accelerate HE oper-
ations using FPGA. [21], [25], [45] accelerate LTV-based FHE
schemes whereas [49], [50] acclerate FV-based FHE schemes.
However, LTV and FV schemes have a limitation in practical
use because they cannot perform approximate computations.
HEAX [47] uses FPGA to accelerate Microsoft SEAL, which
supports a full-RNS variant of the CKKS scheme; however,
HEAX considers only small parameter sizes (Q ≤ 438 and
N ≤ 214), and the full-RNS variant it targets is not as versatile
as the original HEAAN we accelerate in this paper due to the
limitations in choosing rescaling factors and prime numbers.
GPU libraries for HE: [4], [7], [8], [22] propose to
accelerate the HE operations using GPUs. However, they either
do not take advantage of the algorithm’s internal parallelism
sufficiently, operate on only small or limited parameters, or
do not consider the cost of modulo operations in GPUs.
Moreover, all the aforementioned studies did not conduct a
rigorous analysis of computational complexity and data access
patterns of HE Mul, making it hard to assess the effectiveness
of the proposed accelerators compared to the CPU or GPU
implementations applying architecture-aware optimizations.
X. CONCLUSION
We have demystified the key operations of HEAAN, a
representative and popular FHE scheme, from a computer
architect’s perspective. After identifying that multiplying the
encrypted data (ciphertext) is the most computationally de-
manding, we accelerated the major functions of HE Mul
(CRT, NTT, iNTT, and iCRT) on CPU and GPU. To ac-
celerate the major functions on CPU, we populate multiple
cores by using multi-threading (inter-core parallelism) and
AVX-512 instructions (intra-core parallelism). We accelerate
HE Mul on GPU by effectively exploiting massive thread-
level parallelism. Moreover, based on the in-depth analysis
of the major functions for HE Mul, we introduced a series
of architecture-aware optimization techniques such as loop
reordering and matrix transposition for iCRT and taking a
synergy between precomputation and delayed modulo opera-
tions for CRT. Our accelerated HEAAN on CPU and GPU
outperforms the reference single-thread HEAAN on CPU by
42.9× and 134.1×, respectively, in HE Mul, setting a new
baseline for HE acceleration studies targeting practical usage.
12
REFERENCES
[1] “HEAAN with Faster Multiplication,” https://github.com/snucrypto/
HEAAN/releases/tag/2.1, Sep. 2018.
[2] “Lattigo 1.3.1,” Online: http://github.com/ldsec/lattigo, Feb. 2020, ePFL-
LDS.
[3] “PALISADE Lattice Cryptography Library (release 1.7.4),” https://
palisade-crypto.org/, Jan. 2020.
[4] P. Alves and D. Aranha, “Efficient GPGPU Implementation of the
Leveled Fully Homomorphic Encryption Scheme YASHE,” Tech. Rep.,
Jun. 2016.
[5] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski,
G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “A View of
Cloud Computing,” Communications of the ACM, vol. 53, no. 4, 2010.
[6] A. A. Badawi, J. Chao, J. Lin, C. F. Mun, S. J. Jie, B. H. M. Tan,
X. Nan, K. M. M. Aung, and V. R. Chandrasekhar, “The AlexNet
moment for homomorphic encryption: HCNN, the first homomorphic
CNN on encrypted data with GPUs,” arXiv:1811.00778, 2018.
[7] A. A. Badawi, Y. Polyakov, K. M. M. Aung, B. Veeravalli, and
K. Rohloff, “Implementation and Performance Evaluation of RNS Vari-
ants of the BFV Homomorphic Encryption Scheme,” IEEE Transactions
on Emerging Topics in Computing, 2019.
[8] A. A. Badawi, B. Veeravalli, C. F. Mun, and K. M. M. Aung,
“High-Performance FV Somewhat Homomorphic Encryption on GPUs:
An Implementation Using CUDA,” The International Association for
Cryptologic Research Transactions on Cryptographic Hardware and
Embedded Systems, vol. 2018, no. 2, 2018.
[9] J.-C. Bajard, J. Eynard, M. A. Hasan, and V. Zucca, “A Full RNS
Variant of FV like Somewhat Homomorphic Encryption Schemes,” in
International Conference on Selected Areas in Cryptography, 2016.
[10] D. J. Bernstein, “Fast Multiplication and Its Applications,” International
Algorithmic Number Theory Symposium, vol. 44, 2008.
[11] F. Boemer, A. Costache, R. Cammarota, and C. Wierzynski, “nGraph-
HE2: A High-Throughput Framework for Neural Network Inference on
Encrypted Data,” in Proceedings of the 7th ACM Workshop on Encrypted
Computing & Applied Homomorphic Cryptography, 2019.
[12] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(Leveled) Fully Ho-
momorphic Encryption without Bootstrapping,” ACM Transactions on
Computation Theory (TOCT), vol. 6, no. 3, 2014.
[13] H. Chen, K. Laine, and R. Player, “Simple Encrypted Arithmetic
Library-SEAL v2.1,” in International Conference on Financial Cryp-
tography and Data Security, 2017.
[14] J. Cheon, K. Han, and M. Hhan, “Faster Homomorphic Discrete Fourier
Transforms and Improved FHE Bootstrapping,” The International Asso-
ciation for Cryptologic Research Cryptology ePrint Archive, vol. 2018,
2018.
[15] J. Cheon, K. Han, A. Kim, M. Kim, and Y. Song, “A Full RNS Variant
of Approximate Homomorphic Encryption,” in International Conference
on Selected Areas in Cryptography, 2018.
[16] J. Cheon, K. Han, A. Kim, M. Kim, and Y. Song, “Bootstrapping for
Approximate Homomorphic Encryption,” in Annual International Con-
ference on the Theory and Applications of Cryptographic Techniques,
2018.
[17] J. Cheon, A. Kim, M. Kim, and Y. Song, “Homomorphic Encryption
for Arithmetic of Approximate Numbers,” in International Conference
on the Theory and Application of Cryptology and Information Security,
2017.
[18] I. Chillotti, N. Gama, M. Georgieva, and M. Izabache`ne, “TFHE: Fast
Fully Homomorphic Encryption Over the Torus,” Journal of Cryptology,
2018.
[19] E. Chou, J. Beal, D. Levy, S. Yeung, A. Haque, and L. Fei-Fei, “Faster
CryptoNets: Leveraging Sparsity for Real-World Encrypted Inference,”
arXiv:1811.09953, 2018.
[20] J. W. Cooley and J. W. Tukey, “An Algorithm for the Machine
Calculation of Complex Fourier Series,” Mathematics of computation,
vol. 19, no. 90, 1965.
[21] D. B. Cousins, K. Rohloff, and D. Sumorok, “Designing an FPGA-
Accelerated Homomorphic Encryption Co-Processor,” IEEE Transac-
tions on Emerging Topics in Computing, vol. 5, no. 2, 2016.
[22] W. Dai and B. Sunar, “cuHE: A Homomorphic Encryption Accelerator
Library,” in International Conference on Cryptography and Information
Security, 2015.
[23] T. Dillon, C. Wu, and E. Chang, “Cloud Computing: Issues and
Challenges,” in IEEE International Conference on Advanced Information
Networking and Applications, 2010.
[24] P. Dingyi, S. Arto, and D. Cunsheng, Chinese Remainder Theorem:
Applications in Computing, Coding, Cryptography. World Scientific,
1996.
[25] Y. Doro¨z, E. O¨ztu¨rk, E. Savas¸, and B. Sunar, “Accelerating LTV Based
Homomorphic Encryption in Reconfigurable Hardware,” in International
Workshop on Cryptographic Hardware and Embedded Systems, 2015.
[26] J. Doweck, W. Kao, A. K. Lu, J. Mandelblat, A. Rahatekar, L. Rap-
poport, E. Rotem, A. Yasin, and A. Yoaz, “Inside 6th-Generation
Intel Core: New Microarchitecture Code-Named Skylake,” IEEE Micro,
vol. 37, no. 2, 2017.
[27] N. Dowlin, R. Gilad-Bachrach, K. Laine, K. Lauter, M. Naehrig, and
J. Wernsing, “CryptoNets: Applying Neural Networks to Encrypted
Data with High Throughput and Accuracy,” in Proceedings of the
33rd International Conference on International Conference on Machine
Learning, 2016.
[28] L. Ducas and D. Micciancio, “FHEW: Bootstrapping Homomorphic
Encryption in Less than a Second,” in Annual International Conference
on the Theory and Applications of Cryptographic Techniques, 2015.
[29] T. ElGamal, “A Public Key Cryptosystem and a Signature Scheme Based
on Discrete Logarithms,” IEEE transactions on information theory,
vol. 31, no. 4, 1985.
[30] J. Fan and F. Vercauteren, “Somewhat Practical Fully Homomorphic
Encryption,” The International Association for Cryptologic Research
Cryptology ePrint Archive, vol. 2012, 2012.
[31] C. Gentry, “A fully homomorphic encryption scheme,” Ph.D. disserta-
tion, 2009.
[32] S. Halevi and V. Shoup, “HElib: An Implementation of homomor-
phic encryption,” https://github.com/homenc/HElib/releases/tag/v1.0.0,
Jan. 2020.
[33] M. Harris, “CUDA Pro Tip: Write Flexible Kernels with Grid-
Stride Loops,” https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-
kernels-grid-stride-loops, Apr. 2013.
[34] D. Harvey, “Faster Arithmetic for Number-Theoretic Transforms,” Jour-
nal of Symbolic Computation, vol. 60, 2014.
[35] B. Hayes, “Cloud Computing,” Communications of the ACM, vol. 51,
no. 7, 2008.
[36] iDASH Privacy Protection Challenge 2018, “Secure Genome Analysis
Competition,” http://www.humangenomeprivacy.org/2018/, 2018.
[37] iDASH Privacy Protection Challenge 2019, “Secure Genome Analysis
Competition,” http://www.humangenomeprivacy.org/2019/, 2019.
[38] A. Kerr, D. Merrill, J. Demouth, and J. Tran, “CUTLASS: Fast Lin-
ear Algebra in CUDA C++,” https://devblogs.nvidia.com/cutlass-linear-
algebra-cuda, Dec. 2017.
[39] A. Kim, Y. Song, M. Kim, K. Lee, and J. Cheon, “Logistic Regression
Model Training Based on the Approximate Homomorphic Encryption,”
BioMed Central medical genomics, vol. 11, no. 4, 2018.
[40] A. Lo´pez-Alt, E. Tromer, and V. Vaikuntanathan, “On-the-Fly Multiparty
Computation on the Cloud via Multikey Fully Homomorphic Encryp-
tion,” in Proceedings of the forty-fourth annual ACM symposium on
Theory of computing, 2012.
[41] NVIDIA Corporation, “NVIDIA Tesla V100 GPU Architecture,” Tech.
Rep., Aug. 2017.
[42] NVIDIA Corporation, “NVIDIA Turing GPU Architecture,” Tech. Rep.,
Sep. 2018.
[43] NVIDIA Corporation, “CUDA C++ Programming Guide,” https://docs.
nvidia.com/cuda/cuda-c-programming-guide, Aug. 2019.
[44] NVIDIA Corporation, “Parallel Thread Execution ISA,” https://docs.
nvidia.com/cuda/parallel-thread-execution, Aug. 2019.
[45] E. O¨ztu¨rk, Y. Doro¨z, E. Savas¸, and B. Sunar, “A Custom Accelerator
for Homomorphic Encryption Applications,” IEEE Transactions on
Computers, vol. 66, no. 1, 2016.
[46] P. Paillier, “Public-Key Cryptosystems Based on Composite Degree
Residuosity Classes,” in International Conference on the Theory and
Applications of Cryptographic Techniques, 1999.
[47] M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “HEAX: High-
Performance Architecture for Computation on Homomorphically En-
crypted Data in the Cloud,” arXiv:1909.09731, 2019.
[48] R. L. Rivest, L. Adleman, and M. L. Dertouzos, “On Data Banks and
Privacy Homomorphisms,” Foundations of Secure Computation, vol. 4,
no. 11, 1978.
13
[49] S. S. Roy, K. Ja¨rvinen, J. Vliegen, F. Vercauteren, and I. Verbauwhede,
“HEPCloud: An FPGA-Based Multicore Processor for FV Somewhat
Homomorphic Function Evaluation,” IEEE Transactions on Computers,
vol. 67, no. 11, 2018.
[50] S. S. Roy, F. Turan, K. Ja¨rvinen, F. Vercauteren, and I. Verbauwhede,
“FPGA-Based High-Performance Parallel Architecture for Homomor-
phic Computing on Encrypted Data,” in IEEE International Symposium
on High Performance Computer Architecture (HPCA), 2019.
[51] S. Subashini and V. Kavitha, “A Survey on Security Issues in Service
Delivery Models of Cloud Computing,” Journal of Network and Com-
puter Applications, vol. 34, no. 1, 2011.
14
