Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption by Agrawal, Rashmi et al.
ar
X
iv
:2
00
7.
01
64
8v
1 
 [c
s.C
R]
  3
 Ju
l 2
02
0
Fast Arithmetic Hardware Library For
RLWE-Based Homomorphic Encryption
Rashmi Agrawal1, Lake Bu2, Alan Ehret1 and Michel A. Kinsy1
1 Adaptive and Secure Computing Systems (ASCS) Laboratory, Boston University
rashmi23,ehretaj,mkinsy@bu.edu
2 The Charles Stark Draper Laboratory Inc., USA, lbu@draper.com
Abstract. With billions of devices connected over the internet, the rise of sensor-
based electronic devices has led to the use of cloud computing as a commodity tech-
nology service. These sensor-based devices are often small and limited by power,
storage, or compute capabilities; hence, they achieve these capabilities via cloud ser-
vices. However, this heightens data privacy issues as sensitive data is stored and
computed over the cloud, which, at most times is a shared resource. Homomor-
phic encryption can be used along with cloud services to perform computations on
encrypted data, guaranteeing data privacy. While work on improving homomor-
phic encryption has ensured its practicality, it is still several magnitudes too slow
to make it cost effective and feasible. In this work, we propose an open-source,
first-of-its-kind, arithmetic hardware library with a focus on accelerating the arith-
metic operations involved in Ring Learning with Error (RLWE)-based somewhat
homomorphic encryption (SHE). We design and implement a hardware accelerator
consisting of submodules like Residue Number System (RNS), Chinese Remainder
Theorem (CRT), NTT-based polynomial multiplication, modulo inverse, modulo re-
duction, and all the other polynomial and scalar operations involved in SHE. For
all of these operations, wherever possible, we include a hardware-cost efficient serial
and a fast parallel implementation in the library. A modular and parameterized
design approach helps in easy customization and also provides flexibility to extend
these operations for use in most homomorphic encryption applications that fit well
into emerging FPGA-equipped cloud architectures. Using the submodules from the
library, we prototype a hardware accelerator on FPGA. The evaluation of this hard-
ware accelerator shows a speed up of approximately 4200× and 2950× to evaluate a
homomorphic multiplication and addition respectively when compared to an existing
software implementation.
Keywords: Homomorphic Encryption, RNS, CRT, Modulo Reduction, Barrett Re-
duction, NTT, Relinearisation.
1 Introduction
As the internet becomes easily accessible, almost all electronic devices collect enormous
amounts of private and sensitive data from routine activities. These electronic devices
may be as small as wearable electronics [PJ03], like a smart watch collecting personal
health information or a cell phone collecting location information, or may be as large as
an IoT-based smart home [BFZY11] [KJBB16] collecting routine information like room
temperature, door status (open or closed), smart meter reading, and other such details.
These electronic devices have limited power, storage, and compute capabilities and often
need external support to process the collected information. Cloud computing [MG+11]
[Hay08] provides a convenient means not only to store the collected information but also
2 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
to apply various compute functions to this stored data. The processed information can
be used easily for various purposes, like machine learning predictions.
As cloud computing services become readily available and affordable, many industrial
sectors have begun to use cloud services instead of setting up their own infrastructure.
Sectors like automotive production, education, finance, banking, health care, manufactur-
ing, and many more leverage cloud services for some or all of their storage and computing
needs. This, in turn, leads to the storage of a lot of sensitive data in the cloud. Hence,
while cloud computing provides the convenience of sharing resources and compute capabili-
ties for individuals and business owners equally, it brings its own challenges in maintaining
data privacy [PH10] [KIA11]. A cloud owner has access to all of the private data pertain-
ing to the clients and can also observe the computations being carried out on this private
data. Moreover, cloud services are shared among many clients; therefore, even if a client
may assume that the cloud service provider is honest and will ensure that there is no data
breach in their environment, the chances of data leakage remain high, due to the shared
storage space or compute node on the cloud.
Homomorphic encryption [G+09] is a ground-breaking technique to enable secure pri-
vate cloud storage and computation services. Homomorphic encryption allows evaluating
functions on encrypted data to generate an encrypted result. This result, when decrypted,
matches the result of the same operations performed on the unencrypted data. Thus, a
data owner can encrypt the data and then send it to cloud for processing. The cloud
running the homomorphic encryption based services will perform computations on the
encrypted data and send the results back to the data owner. The data owner, having
access to the private key, performs the decryption and obtains the result. The cloud does
not have access to the private key or the plain data, and, hence, the security concerns
related to private data processing on the cloud can be mitigated. An illustrative scenario
is shown in Figure 1.
Homomorphic
encryption-based data
processing platform
Third-party cloud
service provider
Cloud server processes the data in
encrypted form and returns it to the client
Client encrypts
data with their
own key and
sends it to cloud
Client decrypts and reads the
computation results
Data owner
Figure 1: Third-party cloud service provider with Homomorphic Encryption.
The idea of homomorphic encryption was first proposed in 1978 by Rivest et al.
[RAD+78]. In 2009, Gentry’s seminal work [G+09] provided a framework to make fully
homomorphic encryption feasible, and almost a decade’s work has now made it practi-
cal [NLV11]. While homomorphic encryption has become realistic, it still remains several
magnitudes too slow, making it expensive and resource intensive. There are no existing
homomorphic encryption schemes with performance levels that would allow large-scale
practical usage. Substantial efforts have been put forward to develop full-fledged soft-
ware libraries for homomorphic encryption. Such libraries include SEAL [CLP17], Pal-
isade [Tec19], cuHE [DS15], HElib [HS14], NFLLib [AMBG+16], Lattigo [LT219], and
HEAAN [Kim18]. All of these libraries are based on the RLWE-based encryption scheme,
and they generally implement Brakerski-Gentry-Vaikuntanathan (BGV) [BGV14], Fan-
Vercauteren (FV) [FV12], and Cheon-Kim-Kim-Song (CKKS) [CKKS17] homomorphic
encryption schemes with very similar parameters.
Rashmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 3
Although the software implementations are impressive, they are still incapable of gain-
ing the required performance, as they are limited by the underlying hardware. For exam-
ple, Gentry et al. [GHS12], in their homomorphic evaluation of an AES circuit, reported
approximately 48 hours of execution time on an Intel Xeon CPU running at 2.0GHz. Even
their parallel SIMD style implementation took around 40 minutes per block to evaluate
54 AES blocks. A modern Intel Xeon CPU takes about 20ns to perform a regular AES
encryption block, hence it is evident that homomorphic evaluation of an AES block is
about 1.2 × 1011 times slower than a regular evaluation. Similarly, logistic regression, a
popular machine learning tool, is often used to make predictions using client’s private data
in the cloud. A software-based homomorphic logistic regression prediction takes about 1.6
hour while a regular logistic prediction takes about 95ns.
If homomorphic encryption’s full potential and power can be unleashed by realizing
the required performance levels, it will make cloud computing more reliable via enhanced
trust of service providers and their mechanisms for protecting users’ data. Hence, there
is a need to accelerate the homomorphic encryption operation directly on the hardware
to achieve maximum throughput with a low latency. With this in mind, we propose
an arithmetic hardware library that includes the major arithmetic operations involved
in homomorphic encryption. A hardware accelerator designed using the modules from
this library can reduce the computational time for HE operations. To lower the power
usage and improve performance, new cloud architectures integrate FPGAs to offload and
accelerate compute tasks such as deep learning, encryption, and video conversion. The
FPGA-based design and optimization approach introduced in this work fits into this class
of FPGA-equipped cloud architectures.
The key contributions of the work are as follows:
• A fast and hardware-cost efficient hardware arithmetic library to individually acceler-
ate all operations within homomorphic encryption. A speedup of 4200× and 2950×
is observed to evaluate homomorphic multiplication and addition respectively.
• An open-source, FPGA-board agnostic, parameterized design implementation of the
modules to provide flexibility to adjust parameters so as to meet the desired security
levels, hardware cost and multiplication depth.
• A modular and hierarchical implementation of a hardware accelerator using the
modules of the proposed arithmetic library to demonstrate the speedup achievable
in hardware.
The rest of the paper is organized as follows. In Section 2, we briefly present the
underlying scheme and discuss the required arithmetic operations. Section 3 introduces
these arithmetic operations and their efficient implementation. In Section 4, we evaluate
the associated hardware cost and latency and then conclude the paper in Section 5 along
with future work.
2 Homomorphic Encryption
To present the hardware library, we first start by introducing the underlying RLWE-based
homomorphic encryption scheme. For this purpose, we chose the Fan-Vercauteren (FV)
[FV12] scheme, as it has more controlled noise growth while performing homomorphic op-
erations when compared to approaches like the BGV (Brakerski-Gentry-Vaikuntanathan)
scheme [BGV14]. Moreover, Costache et al. [CLP] presented results showing FV scheme
outperforming BGV for large plaintext moduli.
4 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
2.1 FV Scheme
The FV scheme operates in the ring R = ZQ[x]/ 〈f(x)〉, with f(x) = φd(x) the dth
cyclotomic polynomial. The plaintext m is chosen in the ring Rt for some small t and a
ciphertext consists of only one element in the ring RQ for a large integer Q. The security
of the scheme is governed by the degree of this polynomial f(x) and the size of Q.
The secret key, sk, is sampled from the ring R2 or a Gaussian distribution, χk. The
public key, a, is sampled from the ring RQ and the error vector, e, is sampled from a
second Gaussian distribution, χerr. The other public key, b, is computed as follows:
b = [−(a · s+ e)]RQ (1)
Encryption of the plaintext message yields a pair of ciphertexts as follows:
ct = ([(b · r0 + r2 + t ·m)]RQ , [(a · r0 + r1)]RQ) (2)
The homomorphic addition operation adds two such pairs of ciphertexts:
(c0, c1) = ([ct1[0] + ct2[0]]RQ , [ct1[1] + ct2[1]]RQ) (3)
After the addition operation, decryption is done as:
m1 +m2 =
[⌊
[c0 + c1 · sk]
t
⌋]
RQ
(4)
The homomorphic multiplication operation multiplies two such pairs of ciphertexts
using the following equations:
c0 = [ct1[0] · ct2[0]]RQ
c1 = [ct1[0] · ct2[1] + ct1[1] · ct2[0]]RQ
c2 = [ct1[1] · ct2[1]]RQ
(5)
Decryption, after the multiplication operation, using the secret key is computed as:
m1 ·m2 =
[⌊
[c0 + c1 · sk + c2 · s2k]
t
⌋]
RQ
(6)
Since after the multiplication operation a degree 2 ciphertext is obtained, to continue
further multiplication operations this degree 2 ciphertext needs to be reduced to a degree
1 ciphertext. In the FV scheme, this is achieved by performing a relinearisation operation.
The scheme facilitates two different approaches for performing relinearisation. Relineari-
sation version 1 operation consists of generating the relinearisation key, decomposing c2 to
limit the noise explosion, and then conversion to a degree 1 ciphertext by using generated
relinearisation keys and the decomposed c2. The relinearisation keys are generated as
follows:
rlk = [([−(ai · s+ ei) + T i · s2]RQ , ai) : i ∈ [0..ℓ]] (7)
Here, T is independent of t and ℓ = ⌊logT (Q)⌋. Decomposition of c2 involves rewriting
c2 in base T and can be computed using the following equation:
c2 =
ℓ∑
i=0
T i · c
(i)
2 (8)
Next the relinearisation operation can be performed as follows:
c
′
0 = [c0 +
ℓ∑
i=0
rlk[i][0] · c(i)2 ]RQ
c
′
1 = [c1 +
ℓ∑
i=0
rlk[i][1] · c(i)2 ]RQ
(9)
Rashmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 5
Since, we have obtained a degree 1 ciphertext after the relinearisation operation, the
decryption can be performed without using the s2k term, as in equation 6. Therefore, the
decryption operation simplifies to the equation:
m1 ·m2 = c
′
0 + c
′
1 · sk (10)
Note that the choice of T will determine the size of relinearisation keys and the noise
growth during the relinearisation operation. The larger the value of T , the smaller the
relinearisation keys will be, but the noise introduced by relinearisation will be higher. And
the smaller the value of T , the larger the relinearisation keys will be, with smaller noise
introduction. So the value of T must be picked in a balanced way.
Relinearisation version 2 is a modified form of modulus switching and hence requires
choosing a second modulus p such that p ≥ Q3 for small enough error samples. Now, the
relinearisation keys can be generated as follows:
rlk = ([−(a · s+ e) + p · s2]Rp·Q , a) (11)
Here, a ∈ Rp·Q and e ←− χ′err. We can perform the relinearisation using the following
computation:
c
′
0 = c0 +
[⌊
c2 · rlk[0]
p
⌉]
RQ
c
′
1 = c1 +
[⌊
c2 · rlk[1]
p
⌉]
RQ
(12)
Once the c2 component is removed, we can perform decryption using equation 10.
2.2 Required Operations
Our proposed arithmetic library includes highly optimized hardware-based implementa-
tions of Residue Number System (RNS), Chinese Remainder Theorem (CRT), modulo
inverse, fast polynomial multiplication using Number Theoretic Transform(NTT), polyno-
mial addition, modulo reduction, Gaussian noise sampler and relinearisation operations.
While implementing these operations for our arithmetic library, the design choices are
highly motivated by the parameter selection. This is because an RLWE-based encryption
scheme requires adding a small noise vector to obfuscate the plaintext message as shown
previously in equations 1 and 2. While performing homomorphic addition and multiplica-
tion, the noise present in the ciphertexts gets doubled and squared respectively. Due to
this noise growth, the ring RQ, along with the degree of the polynomial needs to be large,
so as to compute a circuit of certain depth and still enable successful decryption of the
result. Hence, the parameter Q with the degree of the polynomial f(x) needs to be large.
The operations on this large parameter set not only increase the hardware cost but also
slow down the homomorphic encryption.
Residue Number
System
Chinese Remainder
Theorem
Polynomial
Coefficients
Prime Moduli
q1, q2, q3, q4
Polynomial
Coefficients
C0
C01
C04
C'01
C'04
C'0
HE Add/ 
HE Mul + Relinearisation
HE Add/ 
HE Mul + Relinearisation
n bits
{
n/4 bits
{
. 
. 
.
n/4 bits
{
n bits
{
Figure 2: Illustrative sequence of operations.
6 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
Client FPGA Logic 
Key Generation
Degree-2 Ciphertext Decryption
Nearest
Binary
Integer of
u/[q/2]
c1'
s
Poly
MULMod
Redu
m u
Poly
MUL
c2'
Poly
MUL
Poly
Add
Encryption
b
r1
r0
Poly
MUL
Mod
Redu
Poly
Add
c0
Scalar
MUL
[q/2]
m
Poly
MUL
Mod
ReduPoly
Add
c1
r2
a
Cipher 
Out
Noise
Sampler
Message to Encrypt
Relinearisation Verision 2
Key Generation
TRNG
e
Poly
MULPoly
Add
a
s
Poly
Add Poly
MUL
Noise
Sampler
<<
P
Mod
Redu
rlk1
rlk0
rlk0
rlk1
Cipher In
Degree-1 Ciphertext Decryption
c0'
c1'
s
m u
Decrypted 
Message
Poly
MUL
Mod
Redu
Poly
Add
Nearest
Binary
Integer of
u/[q/2]
Noise
Sampler
TRNG
e
Poly
MUL
Mod
ReduPoly
Add
a
s
Relinearisation Verision 1
Key Generation
TRNG
e
Poly
MUL
Mod
ReduPoly
Add
a
s
Poly
Add Powers
Of 2
Poly
MULT
rlk1
rlk0
b
a
b
c0'
c0'
c1'
c2'
Cloud Provider FPGA Shell
Poly
Add
Input Bus
Network Interface
Output Bus
Network Interface
Public Key In
Relin Key
Out
Public Key
Out
RNS
Homomorphic
Addition
Poly
Add
c0[0] c1[0] c0[1] c1[1]
c0' c1'
Homomorphic Multiplication
Poly
MUL
Poly
MUL
Poly
MUL
Poly
MUL
Poly
Add
c0' c1' c2'
Relinearisation Version 1
Poly
Add
Poly
Add
c1''c0''
c2'c0' c1'
rlk0rlk1
Inner Product
Inner Product
Scalar
MUL
Scalar
MUL
Decomp
Relinearisation Version 2
Poly
MUL
Div&
Round
Poly
Add
Poly
MUL
Div&
Round Poly
Add
c2'c0' c1'rlk0rlk1
c1''
c0'' c1''
c0''
System Bus
Poly
Add
c0[0] c1[0] c0[1] c1[1]
Input Bus
O
u
tp
u
t 
B
u
s
CRT
Figure 3: Core building blocks of RLWE-based Somewhat Homomorphic Encryption.
We use the concepts of modular arithmetic to speed up HE computations. The under-
lying FV scheme does not restrict Q to a prime number; instead, Q can be the product
of small primes. When working modulo a product of numbers, say Q = q1 × q2 × · · · × qk,
Residue Number System (RNS) helps reduce the coefficients in each of the Rqi and Chi-
nese Remainder Theorem (CRT) lets us work in each modulus qi separately. Since the
computational cost is proportional to the size of operands, this is faster than working in
the full modulus Q. Moreover, breaking down the coefficients into smaller integers using
Rashmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 7
RNS also limits the noise expansion. Figure 2 illustrates the sequence of operations that
will be required in HE using the FV scheme, for Q = q1 × q2 × q3 × q4.
In our implementation of the arithmetic library, we consider Q to be 1200 bits and
degree of the polynomial, n as 1024 with a 128-bit security level. With these values of
Q and n, we can evaluate a binary circuit of depth 56 using somewhat homomorphic
encryption. The RNS module will take a 1200-bit wide integer coefficient as an input,
perform xi = x mod qi, and thus break x into 40 small integers of 30 bits each. This
enables us to set up 40 pipelines to perform 40 operations in parallel, providing the
required performance boost. Once the homomorphic add or multiplication operation is
done on these small integers, CRT can combine them to map back to the original 1200-bit
width. Note that the bit width selection for the small integers is a design decision one can
make based on available resources. Our implementations take qi as a parameter, which
facilitates different bit width selections.
Figure 3 shows all the core building blocks required to perform somewhat homomor-
phic encryption (SHE) addition and multiplication operations using FV scheme. The
client-side building blocks include key generation, relinearisation (versions 1 and 2) key
generation, encryption, decryption (for both degree-1 and degree-2 ciphertexts), residue
number system (RNS) along with modular reduction, and Chinese remainder theorem
(CRT) along with modulo inversion. The cloud provider has blocks to perform homomor-
phic addition, homomorphic multiplication, and relinearisation (versions 1 and 2). While
RNS and CRT are standalone modules, the rest of the main blocks share the following
submodules: polynomial multiplication, polynomial addition, modular reduction, scalar
multiplication, scalar division to nearest binary integer, noise sampler, and true random
number generator. Certain operations like decompositions, powers-of-2 computations, di-
vide and round operations are specifically required for the purpose of relinearisation, which
we will discuss in detail in Section 3.10.
3 Arithmetic Hardware Library For HE
We will start with discussing the standalone modules first (i.e., RNS and CRT) along with
the supplemental operations to these modules. Then, we will describe other arithmetic
operations shared between all the core building blocks, along with their design implemen-
tations. All the operations are customized for hardware-based implementation, and we
include both hardware-cost efficient serial and fast parallel implementations in the library.
It is worth noting that, in all the algorithms, Q will denote the large integer and qi or q
will denote a prime factor of Q.
3.1 Residue Number System
A residual number system, RNS [Gar59] [SJJT86] [ST67] is a mathematical way of rep-
resenting an integer by its value modulo a set of k integers q1, q2, q3, . . . , qk, called the
moduli, which generally should be pairwise coprime. An integer, x, can be represented in
the residue number system by a set of its remainders x1, x2, x3, . . . , xk under Euclidean
division by the respective moduli. That is, xi = x (mod q)i and 0 ≤ xi < qi for every i.
The serial implementation of RNS is shown in Figure 4. Each 1200-bit coefficient
is modulo reduced by a qi and stored in the respective BRAM. For k moduli, it takes
k cycles to perform all the computations. When modulo reductions are performed in
parallel, all the computations can be performed in a single clock cycle instead. The parallel
implementation is shown in Figure 4. Since the mod operation is the key operation in
RNS, we next optimize the modulo reduction operation. This will allow us to reduce the
hardware cost substantially.
8 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
%
C0
qi
C0i
i
Coefficient 
Store
(BRAM)
Coefficient 
Store
(BRAM)
q1-4
%
C0
q1
C01
Coefficient 
Store
(BRAM)
Coefficient 
Store
(BRAM)
%
%
%
q4
q2
C02
C03
C04
q3
Serial RNS Parallel RNS
Figure 4: Serial and parallel implementation of RNS.
3.2 Modular reduction
Modular reduction is not only at the core of many asymmetric cryptosystems, it is the
most performed operation in encryption schemes based on R-LWE. This is because, in
RLWE, all the operations are required to be performed over large finite rings. The function
of the modular reduction operation is to compute the remainder of an integer division.
Mathematically, it is written as r = a (mod q).
While the modular reduction operation sounds relatively simple, the division of two
large integers is very costly. Moreover, the moduli in RLWE-based schemes are prime
numbers and not power-of-2 numbers, which makes the operation non-trivial. Therefore,
the hardware implementation of the modulo operation is quite expensive. For example,
the use of the inbuilt modulo operator, % in Verilog for 30-bit operands utilizes about
800 LUTs, and when there are many such modular operations involved, the hardware
cost quickly adds up. Hence, optimization of modulo operation can lead to significant
hardware cost reductions.
One well-knownmodular reduction optimization algorithm is Barrett reduction [Bar87].
It is preferred over Montgomery reduction [Mon85], as it operates on the given integer
number directly, while Montgomery reduction requires numbers to be converted into and
out of Montgomery form, which is expensive in itself. We will discuss the Barrett reduc-
tion next, and then later, we propose some modifications to the existing Barrett reduction
algorithm to reduce the hardware cost further.
3.2.1 Barrett reduction
The Barrett reduction algorithm was introduced by P. D. Barrett [Bar87] to optimize the
modular reduction operation by replacing divisions with multiplications, so as to avoid the
slowness of long divisions. The key idea behind the Barrett reduction is to precompute a
factor using division for a given prime modulus, q, and thereafter, the computations only
involve multiplications, subtractions and shift operations. These operations are faster than
the division operation. The Barrett reduction algorithm steps are shown in Algorithm 1
and works as described below.
Since the modulus q is known in advance and the factor r depends only on this modulus,
it can be precomputed and stored. Then, the reduction function requires only computing
the remainder value, t. While computing the t value, a division operation by 4k, being
power-of-2, can be performed using the right-shift operation. Hence, the entire compu-
Rashmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 9h i gr l, i el . insy 9
Algorithm 1: Barrett Reduction Algorithm
1 Let a ∈ Zq [x]/〈f(x)〉, and q is a prime number.
2
3 Precomputation:
/* Choose k ∈ N such that 2k > q */
4 assign k = ⌈log2q⌉;
/* Compute the factor */
5 assign r = ⌊ 4
k
q
⌋;
6
7 Reduction Function:
8 compute t = a − ⌊ ar
4k
⌋q;
9 if t < q then
10 Return t as 〈a mod q〉.
11 end
12 else
13 Return t − q as 〈a mod q〉.
14 end
entire computation reduces to just two multiplications, one right-shift, and one subtraction
operation. Furthermore, the computation is performed in one step and, thus, is performed
in constant time. The hardware implementation circuit is as shown in Figure ??
For our hardware-based Barrett reduction implementation, we specifically included
some additional optimizations. One such optimization is a careful bit width analysis. Say
that the modulus requires exactly bits, then the product ar fits in bits. We also
observed that the computed values do not need more than + 1 bits. The advantage of
this observation is that we can safely ignore the upper bits of the product. This in
turn reduces the size of the registers to + 1 bits while performing computations.
2k
>>
t - q
a a >> >>
Barrett Reduction Modified Barrett Reduction
Figure 5: dware implementation of Barrett reduction and Modified Barrett reduction.
3.2.2 Modified Barrett reduction
enplaugh et al. [ ] introduced an iterative folding method as a modification to the
Barrett reduction method. This method not only reduces the number of required multi-
plications via an increased number of precomputations, but also reduces the bit width of
the operations performed. We modify their proposed approach and propose Algorithm ??
which computes modulo reduction in a single fold.
When compared to Barrett reduction, the proposed algorithm precomputes with
half the bit width and with one third the bit width. This significantly reduces the
multiplication bit-width while performing actual computations after the coefficient integers
are known. Moreover, we were able to get rid of an additional check, t < q, which is
tation reduces o just two multiplic s, one right-shift, and one sub raction ope ation.
Furthermore, th computation is performed in one step and, thus, is perform d in constant
time. The hardware implementation circui is as shown in Figure 5.
For our hardware-based Barrett reduction implementation, we specifically included
some additi na optimizations. One such optimization is a careful bit width analysis. Say
that the modulus requires exactly k bits, hen the product ⌊ar
4k
⌋q fits in 2k bits. We also
observed that the computed values t do not n ed more than m + 1 bits. The advantage
of this observation is that we can safely ignore the upp r m− 1 bits of the product. This
in turn reduces the size of the registers to m+ 1 bits while performing computations.
2k
a
X
r
>> - X <
t
-
t - q
q q
a a
a
>>
r
>>
X
-
qk
X
t
a
q
R
Barrett Reduction Modified Barrett Reduction
Figure 5: Hardware implementation of Barrett reduction and Modified Barrett reduction.
3.2.2 Modified Barrett reduction
Hasenplaugh et al. [HGG07] introduced an iterative folding method as a modification
to the B rrett reduction method. This meth d not only reduces the number f required
multiplications via an inc ased number of pr c mputations, but also reduces the bit width
of the operations performed. We modify their proposed approac and propose Algorithm
2, which computes mo ulo reduction in a single fold.
When compared to Barrett reduction, the proposed algorithm precomputes k with
half the bit width and r with one third the bit width. This significantly reduces the mul-
tiplication bit-width while performing actual computations after the coefficient integers
10 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption10 Fast Arithmetic Hard are i o orphic Encryption
Algorithm 2: Modified Barrett Reduction Algorithm
1 Let a ∈ Zq [x]/〈f(x)〉, and q is a prime number.
2
3 Precomputation:
/* Choose k ∈ N such that 23k > q */
4 assign k = ⌊ log2q
2
⌋;
/* Compute the factor */
5 assign r = ⌈ 2
3k
q
⌉;
6
7 Reduction Function:
8 compute t = ⌊⌊ a
22k
⌋ r
2k
⌋;
9 Return R = a − tq as 〈a mod q〉.
required in case of Barrett reduction. Hence, the result for modulo reduction is available
with minimal operations using minimal bit width. The modified Barrett reduction’s
implementation is shown in Figure ??
3.3 Polynomial Multiplication using NTT
Polynomial multiplication is the most performed operation in homomorphic encryption
and has the highest implementation complexity. Therefore, the latency of the polynomial
multiplication module will govern the efficiency of the entire implementation. Hence, it is
critical to design an efficient polynomial multiplication module. A conventional approach
to implement a polynomial multiplier is to use convolution method. However, this approach
is expensive to implement in hardware, as it requires performing multiplications
for a degree polynomial. This complexity can be reduced to log multiplications
instead by using NTT combined with negative wrapped convolution to perform polynomial
multiplication. We leverage the NTT-based multiplication algorithm proposed by Chen
et al. [ ] in our implementation. The steps involved in this algorithm are described in
Algorithm ??
Algorithm 3: based Polynomial Multiplier
Setup:Let · · · , a and · · · , b } ∈ be two polynomials of
length (with coefficients), where ) = + 1 is an irreducible polynomial with a power
of 2, and mod is a large prime number).
Let be the -th root of unity, the inverse of mod , and the inverse of
Precompute: , w , φ , φ for all ∈ { · · · , n
/* negative wrap convolution of a and b */
i=0 to n-1 do
end
/* number-theoretic transforming a and b */
NT T (¯
10 NT T
/* component-wise multiplying A and B */
11
12 iNT T
13 i=0 to n-1 do
14
15 end
16 rn
are known. Moreover, we were able to get rid of an additional check, t < q, which is
required in case of Barrett reduction. Hence, the result for modulo reduction is avail-
able with minimal operations using minimal bit width. The modified Barrett reduction’s
implementation is shown in Figure 5.
3.3 Polynomial Multiplicati n using NTT
Polynomial multiplication is the most performed operation in homomorphic encryption
and has the highest implementation complexity. Therefore, the latency of the polynomial
multiplication module will govern the efficiency of the entire implementation. Hence,
it is critical to design an efficient polynomial multiplication module. A conventional
approach to implement a polynomial multiplier is to use convolution method. However,
this approach is expensive to implement in hardware, as it requires performing O(n2)
multiplications for a degree n polynomial. This complexity can be reduced to O(n log2 n)
multiplications instead by using NTT combined with negative wrapped convolution to
perform polynomial multiplication. We leverage the NTT-based multiplication algorithm
proposed by Chen et al. [CMV+15] in our implementation. The steps involved in this
algorithm are described in Algorithm 3.
10 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
Algorithm 2: Modified Barrett Reduction Algorithm
Let , and is a prime number.
Precomputation:
/* Choose such that > q */
gn
/* Compute the factor */
gn
ction Function:
compute ⌊⌊
rn tq as a mod q
required in arrett reduction. Hence, the r sult for modulo reduction s avail ble
with minimal operations u ing minimal bit width. The modified Barrett reduction’s
implementation is sho n in i ??
3.3 Polyno ial lti li i
Polynomial multiplication is t e f r ti in homomorphic encryption
and has the highest imple ent ti l i . f re, t e latency of the polynomial
multiplication il govern the fficiency of the entire implementation. Hence, it is
critical to design an efficient polynomial multiplication module. A conventional approach
to implement a polynomial multiplier is to use convolution method. H wever, this approach
is ex ensive to impl ment in hardwar , as it requires performing multiplicatio s
for a degree p lynomial. This complexity can be reduced to log multiplications
ins ead by using NTT combined with negative wrapped convolution to perform p lynomial
multiplication. We leverage the NTT-based mul iplication lgorithm roposed by Chen
et al. [ ] in our implementation. The steps involved in this algorithm are described in
Algorithm ??
Algorithm 3: NTT-based Polynomial Multiplier
1 Setup:Let a = {a0, · · · , an−1} and b = {b0, · · · , bn−1} ∈ Zq [x]/〈f(x)〉 be two polynomials of
length n (with n c efficients), where f(x) = xn + 1 is an irreducible polynomial w th n a power
of 2, and q ≡ 1 mod 2n is a large prime number).
2 Let ω be the n-th root of unity, ω−1 the inverse of ω, φ2 = ω mod q, and φ−1 the inverse of φ.
3
4 Precompute: {wi, w−i, φi, φ−i} for all i ∈ {0, 1, · · · , n − 1}
/* negative wrap convolution of a and b */
5 for i=0 to n-1 do
6 a¯i ← aiφ
i
7 b¯i ← biφ
i
8 end
/* number-theoretic transforming a and b */
9 A¯ ← NT T nω (a¯)
0 B¯ ← NT T nω (b¯)
/* component-wise multiplying A and B */
1 C¯ = A¯ · B¯
2 c¯ ← iNT T nω (C¯)
3 for i=0 to n-1 do
4 ci ← c¯iφ
−i
15 end
16 Return c
Rashmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 11
3.3.1 Number Theoretic Transform
A generalization of the Fast Fourier Transform (FFT) over a finite ring Rq = R/〈q〉 =
Zq[x]/〈f(x)〉 is represented by Number Theoretic Transform (NTT). The equation of the
NTT is as follows:
Xi =
n−1∑
k=0
xk · ω
ik (13)
where ω is the nth root of unity in the corresponding polynomial field and for a ring
Rq, where q is a prime number, the nth root of unity ω must satisfy two conditions:
1. ωn = 1 mod q,
2. The period of ωi for i ∈ {0, 1, 2, · · ·n− 1} is exactly n.
One of the efficient ways to compute ω is by using the following approach:
1. First compute the primitive root of q, which must satisfy:
• αq−1 = 1 mod q
• The period of αi for i ∈ {0, 1, 2, · · ·q − 1} is exactly q − 1.
2. And since ωn ≡ αq−1 mod q, we can compute:
ω = α(q−1)/n mod q
3. As a final step, verify that this ω meets both the conditions mentioned above.
Applying inverse NTT (iNTT) is straight forward and can be performed using the
existing NTT module by replacing ω with ω−1, where ω−1 = ωn−1 mod q. iNTT com-
putation also requires computing the inverse of n, which can be computed as n−1 · n = 1
mod q.
Although there exist hardware implementations of NTT, they are quite expensive
because of the way they compute the indices of the points and the corresponding wi.
Investing a large number of multiplications and divisions for these computations may
not be an issue with software implementation, however they lead to a higher resource
consumption in the hardware counterpart. Therefore, in our implementation of the NTT
algorithm, Algorithm 3.3.1, we perform the indices computation using only shift and xor
operations. The benefit of doing so is that the shift and xor are not only inexpensive to
implement but they conveniently replace the large multiplication and division circuits. By
leveraging this highly optimized NTT implementation, we implement the fast polynomial
multiplication algorithm (Algorithm 4) very efficiently. Figure 6 shows a high level circuit
for polynomial multiplication and the operations within the NTT block.
3.4 Polynomial Addition
Polynomial addition is the second most frequently used operation after polynomial mul-
tiplication. The schematic of the hardware implementation for polynomial addition is
shown in Figure 7. The implementation performs a component-wise addition operation
on the coefficients of the polynomial. Note that the results are wrapped either within
small modulus q or large modulus Q depending on which main module is utilizing this
submodule to perform polynomial addition.
3.5 Scalar Multiplication
Since the message space is binary, a conditional assignment operator can be used to
implement the scalar multiplication operation. As shown in Figure 7, m, the plaintext
message, is an n-bit vector. Thus, computing tm essentially requires choosing t or 0
according to each bit of m. Since we avoid performing actual multiplication operations,
hardware cost is greatly reduced.
12 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
12 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
Algorithm 4: Number-Theoretic Transform
1 Let a = {a0, · · · , an−1} ∈ Zq [x]/〈f(x)〉 where n is a power of 2, and q ≡ 1 mod 2n is a prime
number. Let ω be the n-th root of unity for q. Let ω−1 be the inverse of ω. Precompute ωi and
ω−i for i ∈ {0, 1, · · · , n/2 − 1}, and store them in the array element ω[i] and iω[i] respectively.
2 Let a be swapped to A so that A[j] = a[jrev ], where jrev is a binary vector bit-reversed from j.
Let i, Stage both be initialized to 0.
3
4 Index Computation:
/* calculate the corresponding point’s index icorr to the ith point */
5 assign icorr = i XOR (1 << Stage);
/* calculate the twiddle factor k for both icorr and i */
6 assign k0 = (Stage == 0)?0 : (i << (log n − Stage));
7 assign k = k0[log q − 1 : 1];
8
9 Shared Variable Computation:
10 assign v = A[icorr] ∗ ω[k] mod q;
11
12 NTT Function:
13 if Stage < log n then
14 if i < n then
15 i = i + 1;
16 if i == n − 1 then
17 Stage = Stage + 1;
18 end
19 if i[Stage] == 0 then
20 A[i] = A[i] + v − (A[i] + v ≥ q?q : 0);
21 A[icorr] = A[i] − v + (A[i] ≥ v?0 : q);
22 end
23 end
24 end
25 else
26 Return A as the transformed polynomial of a.
27 end
3.6 Scalar Division to the Nearest Binary Integer
The scheme parameter t = ⌊Q2 ⌋ value is already published, and hence, it is known. From
the equations 4 and 6 in the scheme, if we denote u = (c0 + s · c1), then to decrypt
the message m correctly, all we need to do is to compute m = ⌈ut ⌋. The nearest binary
integer equivalent of ⌈ut ⌋ can be computed by measuring the distance between u and t as
(Absolute(u − t) < t2 ) ? 1 : 0. If this distance is larger than half of t, it indicates that
u and t are far from each other, and thus, the nearest integer of the quotient ut must be
0. But if this distance is less than half of t, then the nearest integer of the quotient ut
must be 1. Thus, we implement the scalar division hardware circuit as shown in Figure 7,
without using any hardware division circuit.
3.7 Chinese Remainder Theorem
The Chinese remainder theorem, CRT [KI07], states that if we know the residue of an
integer, a modulo two primes q1 and q2, it is possible to reconstruct < a >q1q2 as follows.
Let < a >q1 = a1 and < a >q2 = a2, then the value of a (mod Q), where Q = q1 · q2, can
be found by
a =< q1t1a2 + q2t2a1 > Q (14)
where t1 is the multiplicative inverse of q1 (mod q)2 and t2 is the multiplicative inverse
of q2 (mod q)1. This is feasible as the inverses t1 and t2 always exist, since q1 and q2
are coprime. Mathematically, < a >q1q2 can also be represented by a set of congruent
Rashmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 13
NTTX
b
c
X X
a
φ
NTT
INTTX
n-1
φ-1
Polynomial Multiplier
i
X >=
-
>=
-
<<
1
log
n
q
a
0
w
Number Theoretic Transform (NTT)
Coefficient 
Store
(BRAM)
Figure 6: Polynomial Multiplication and Operation within NTT
>=
q
0
-
a b
c
Polynomial Addition
tm
==
1
0
>=
c
t
-
-
>=
t/2
0
1
a
c
Scalar Multiplication Scalar Division
Figure 7: Polynomial Addition, Scalar Multiplication, and Scalar Division submodules.
equations as follows:
a ≡ a1 (mod q1)
a ≡ a2 (mod q2)
(15)
Using the CRT, we combine all the small integers back into one large integer, so
as to generate the required final result. A naive approach to implement CRT would
compute pairwise multiplicative inverse or modulo inverse for two given moduli and then
use equation 14 to merge values of a1 and a2 to get a. This process can be carried out
recursively until the final coefficient value is obtained.
14 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
equations as follows:
(mod
(mod
sing the , e co bine all the s all integers back into one large integer, so
as to generate the required final result. naive approach to implement CRT would
co pute pairwise ultiplicative inverse or odulo inverse for two given moduli and then
use equation ?? to erge values of and to get . This process can be carried out
recursively until the final coefficient value is obtained.
Algorithm 5: Chinese Remainder Theorem LUT-based Algorithm
1 Let a1, · · · , ak ∈ Zqi [x]/〈f(x)〉 (1 ≤ i ≤ n), q1, · · · , qk are pairwise relatively prime and large
modulus Q = q1 × · · · × qk
2
3 Precomputation:
/* Compute Qi where 1 ≤ i ≤ k */
4 assign Qi =
Q
qi
;
/* Compute the multiplicative inverse using corresponding moduli */
5 assign Q−1
i
= moduloInverse(Qi);
/* Compute the multiplicative constant */
6 assign ci = Qi × Q
−1
i
;
7
8 CRT Function:
9 Return x = (a1 × c1 + a2 × c2 + · · · + ak × ck) mod Q
Coefficient 
re
(BRAM)
cons
cons
conscons
Coefficient 
re
(BRAM)
Figure 8: dware implementation of CRT.
The problem with this approach is that computing the multiplicative inverse at runtime
increases latency. Therefore, a better approach is to precompute the modulo inverse values
since all the moduli are known in advance. And then the actual computation reduces to
just a single step, which can be performed in one clock cycle. The precomputation and
computation steps involved are shown in Algorithm ??. We call this approach LUT-based,
since the precomputated values are stored in LUTs; its hardware implementation circuit is
shown in Figure ??. The hardware cost for LUT-based CRT can be further optimized by
breaking down the single step multiplication and addition operation into various steps.
This will enable the reuse of multipliers and adders during these steps. We leave this as
future work for now.
The problem with this approach is that computing the multiplicative inverse at runtime
increases latency. Therefore, a better approach is to precompute the modulo inverse values
since all the moduli are known in advance. And then the actual computation reduces to
14 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
Coefficient 
Store
(BRAM)
cons1
Q
X
cons2
X
a1
X
X
cons4cons3
+ %
Coefficient 
Store
(BRAM)
a2
a3
a4
Figure 8: Hardware implementation of CRT.
just a single step, which can be performed in one clock cycle. The precomputation and
computation steps involved are shown in Algorithm 5. We call this approach LUT-based,
since the precomputated values are stored in LUTs; its hardware implementation circuit
is shown in Figure 8. The hardware cost for LUT-based CRT can be further optimized
by breaking down the single step multiplication and addition operation into various steps.
This will enable the reuse of multipliers and adders during these steps. We leave this as
future work for now.
3.8 Modulo Inverse
A modulo inverse or multiplicative inverse is the main computation involved in CRT.
Moreover, while working with homomorphic encryption, due to large parameters, the
amount of storage available can be a concern. Thus, instead of using LUT-based CRT,
we may need to use a regular CRT implementation with the modulo inverse computed on
the fly.
The multiplicative inverse of a (mod q) exists if and only if a and q are relatively prime
(i.e., if gcd(a, q) = 1). Given two integers a and q, the modulo inverse is defined by an
integer p such that
a · p ≡ 1 (mod q) (16)
Here, the value of p should be in 0, 1, 2, . . . q − 1, i.e., in the range of integer modulo q.
In our case, we need to compute the multiplicative inverse between pairwise moduli, qi.
There are two primary algorithms used in computing a modulo inverse. We discuss both
algorithms in detail next.
3.8.1 Fermat’s Little Theorem
Fermat’s little theorem [Vin16] is typically used to simplify the process of modular expo-
nentiation. But since we know q is prime, we can also use Fermats’s little theorem to find
the modulo inverse. According to this theorem, we can rewrite equation 16 as follows:
aq−1 ≡ 1 (mod q) (17)
If we multiply both sides of this equation with a−1, we get
a−1 ≡ aq−2 (mod q) (18)
Equation 18 is the only computation carried out by this theorem to get the value of
a−1 and this is what is shown in Algorithm 6. The algorithm has a time complexity of
Rashmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 15
O(log2q). For our hardware-based implementation, we precompute the power factors from
1 to q − 2 to save the computation cost. This not only speeds up computation but also
significantly reduces the hardware cost. The hardware implementation is shown in Figure
9.
hmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 15
3.8 Modulo Inverse
A modulo inverse or multiplicative inverse is the main computation involved in CRT.
Moreover, while working with homomorphic encryption, due to large parameters, the
amount of storage available can be a concern. Thus, instead of using LUT-based CRT, we
may need to use a regular CRT implementation with the modulo inverse computed on the
fly.
The multiplicative inverse of mod exists if and only if and are relatively prime
(i.e., if gcd a, q) = 1). Given two integers and , the modulo inverse is defined by an
integer such that
1 (mod
Here, the value of should be in , . . . q , i.e., in the range of integer modulo
In our case, we need to compute the multiplicative inverse between pairwise moduli,
There are two primary algorithms used in computing a modulo inverse. We discuss both
algorithms in detail next.
3.8.1 Fermat’s Little Theorem
Fermat’s little theorem [ ] is typically used to simplify the process of modular exponentia-
tion. But since we know is prime, we can also use Fermats’s little theorem to find the
modulo inverse. According to this theorem, we can rewrite equation ?? as follows:
1 (mod
If we multiply both sides of this equation with , we get
(mod
Equation ?? is the only computation carried out by this theorem to get the value of
and this is what is shown in Algorithm ??. The algorithm has a time complexity of
log . For our hardware-based i ple entation, e precompute the power factors from
to to save the co putation cost. his not only speeds up computation but also
significantly reduces the hard re c st. e r re i ple entation is shown in Figure
??
Algorithm 6: Fermat’s Little Theorem
1 Let a ∈ Zq [x]/〈f(x)〉, and q is a prime number.
2
3 Precomputation:
/* Compute power factors */
4 assign yi = q − 2,
q−2
2
, · · · , 1 for 1 ≤ i ≤ ⌊ q
2
⌋ − 1;
/* Use a temporary intermediate storage */
5 assign p = 1;
6
7 Inverse Function:
8 for i ← 1 to ⌊ q
2
⌋ − 1 do
9 if yi == 1 then
10 assign p = 1
11 end
12 compute p = (p × p) mod q.
13 if yi%2 Ó= 0 then
14 compute p = (a × p) mod q.
15 end
16 end
17 Return p
3.8.2 Extended Euclidean Algorithm
The extended Euclidean algorithm [Vin16] is an extension to the classic Euclidean al-
gorithm that is used for finding the greatest common divisor (GCD). According to this
algorithm, if a and q are relatively prime, there exist integers x and y such that ax+qy = 1,
and such integers may be found using the Euclidean algorithm. Considering this equation
modulo q, it follows that ax = 1; i.e., x = a−1 (mod q). The algorithm used for our
implementation is as shown in Algorithm 7.
The algorithm works as follows. Given two integers 0 < a < q, using the classic
Euclidean algorithm equations, one can compute gcd(a, q) = rj , where rj is the remainder.
In the classic Euclidean algorithm, we start by dividing q by a (integer division with
remainder), then repeatedly divide the previous divisor by the previous remainder until
there is no remainder. The last remainder we divided by is the greatest common divisor.
To avoid division operations, the classic Euclidean algorithm equations can be rewritten
as follows:
r1 = a− q · x1,
r2 = a− r1 · x2,
r3 = r1 − r2 · x3,
...
rj = rj−2 − rj−1 · xj
Then, in the last of these equations, rj = rj−2 − rj−1 · xj , replace rj−1 with its
expression in terms of rj−3 and rj−2 from the equation immediately above it. Continue
this process successively, replacing rj−2, rj−3, . . . , until we obtain the final equation rj =
ax+ qy, with x and y integers. In our special case that gcd(a, b) = 1, the integer equation
reads as 1 = ax + qy and therefore we deduce 1 ≡ ax (mod q) so that the residue of
16 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
x is the multiplicative inverse of a (mod q). The time complexity of this algorithm is
O(log(min(a, q))).
16 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
3.8.2 Extended Euclidean Algorithm
The extended Euclidean algorithm [ ] is an extension to the classic Euclidean algorithm
that is used for finding the greatest common divisor (GCD). According to this algorithm,
if and are relatively prime, there exist integers and such that ax qy = 1, and such
integers may be found using the Euclidean algorithm. Considering this equation modulo
it follows that ax = 1; i.e., mod . The algorithm used for our implementation
is as shown in Algorithm ??
The algorithm works as follows. Given two integers < a < q, using the classic
Euclidean algorithm equations, one can compute gcd a, q) = , where is the remainder.
In the classic Euclidean algorithm, we start by dividing by (integer division with
remainder), then repeatedly divide the previous divisor by the previous remainder until
there is no remainder. The last remainder we divided by is the greatest common divisor.
To avoid division operations, the classic Euclidean algorithm equations can be rewritten
as follows:
Then, in the last of these equations, , replace with its
expression in terms of and from the equation immediately above it. Continue
this process successively, replacing , r , . . . , until we obtain the final equation
ax qy, with and integers. In our special case that gcd a, b) = 1, the integer
equation reads as 1 = ax qy and therefore we deduce ax mod so that the residue
of is the multiplicative inverse of mod . The time complexity of this algorithm is
log a, q
Algorithm 7: Extended Euclidean Algorithm
1 Let a ∈ Zq [x]/〈f(x)〉, and q is a prime number.
2
3 Initialization:
4 assign tempa = q;
5 assign tempb = a%q;
6 assign m = tempa
tempb
;
7 assign x = 0, prevx = 1;
8 assign y = 1, prevy = 0;
9
10 Inverse Function:
11 while tempb Ó= 0 do
12 compute x = prevx − m × x
13 assign prevx = x
14 compute y = prevy − m × y
15 assign prevy = y
16 assign tempa = tempb
17 assign tempb = tempa%tempb
18 assign m = tempa
tempb
;
19 end
20 Return prevx mod q
The hardware implementation is shown in Figure ??. The algorithm is simplified by
removing unnecessary variables and computations to make it more suitable for hardware
implementation. The implementation is done in an iterative fashion so that the input
parameters gradually decrease while keeping the GCD of the parameters unchanged.
Power
factor, y 
Store
(BRAM)
p
q
X
%
%
Coefficient 
Store
(BRAM)%
2
X
a
1
i
yi
=
1
1
q
Coefficient 
Store
(BRAM)
%
a
/
m
tempa
tempb
x
prevx
yprevy
X
-
X
-
%
Extended Euclidean TheoremFermat's Little Theorem
Figure 9: Hardware implementation of Fermat’s little and extended Euclidean theorem.
The hardware implementation is shown in Figure 9. The algorithm is simplified by
removing unnecessary variables and computations to make it more suitable for hardware
implementation. The implementation is done in an iterative fashion so that the input
parameters gradually decrease while keeping the GCD of the parameters unchanged.
3.9 Gaussian Noise Sampler
The security of the RLWE-based encryption scheme is governed by small error samples
generated from a Gaussian distribution. Hence, a Gaussian noise sampler lies at the core
of maintaining the required security level. However, it is critical to select a sampling algo-
rithm with a high sampling efficiency and throughput so that the key generation and the
encryption operations, at the client side, still remain efficient. We leverage the implemen-
tation of a Ziggurat-based Gaussian noise sampler done by the authors in [ABK]. Due
to space constraints we do not present the implementation details here but the interested
readers can refer to the actual paper.
Rashmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 17
3.10 Relinearisation
We discuss relinearisation version 1 implementation details first. In version 1, the key
generation will reuse most of the existing submodules except for the powers-of-2 compu-
tation. This operation is indicated by the PowersOf2 submodule in Figure 10. The values
of T i are not precomputed and stored to reduce the memory overhead. Instead, we take
the vector s2 and perform a left shift operation on all of the elements of this vector. The
first set of left shift operations should be by 0 bits to indicate 20 multiplication, then by
1 bit for 21 multiplication, and so on until 2ℓ multiplications are performed.
Div&Round
s2
i
PowersOf2 Inner Product
<<
Tis2s2 
Store
(BRAM)
rlkc2
==
1
0
c'2
log2(p)
>>
c'2 
Store
(BRAM)
c'2 c'2
Figure 10: PowersOf2, Inner Product, and Div&Round submodules.
Note that we generate the relinearisation keys in Rq1 , . . . , Rqk rather than RQ. This
facilitates the routing of the relinearisation keys correctly to the corresponding qi operation
pipeline without the need to perform modular reductions. Although we perform k times
more operations, these operations are significantly faster. Additionally, the output of this
submodule is arranged in such a fashion that the elements having the same index, from
key rlk[0], are treated as a single output. A similar output format holds true for rlk[1] as
well. This helps in faster indexing of the relinearisation keys while computing the inner
product with c2. The schematic of PowersOf2 submodule implementation is as shown in
Figure 10.
In the relinearisation version 1 module in Figure 10, the decomposition submodule’s
task represents the ciphertext c2 at bit level, i.e., converting the coefficients from Rq to RT
and here T = 2. We know that in hardware bit-level operations can be performed readily,
and hence, this operation becomes trivial. Therefore, we do not specifically provide an
implementation of this submodule, and it is shown for completeness in Figure 3. Next,
we describe the implementation of the inner product submodule. To avoid performing
actual multiplication operations, we leverage our scalar multiplication module (Section
3.5) within this submodule, since c2 is binary. Hence, a conditional operator does the
work of multiplying elements of c2 and relinearisation keys. We just need to use adders
to compute the summation to finish the inner product computations. Implementation of
this submodule is shown in Figure 10.
We will explore the relinearisation version 2 submodules now. For key generation, we
pick the largest qi from the moduli set, compute q3i and then the immediate next power
of two is set as the value of p. This p is the scaling factor. Since we choose a power of 2
as p, we can simply perform shift left operations to emulate the multiplication of p with
s2 while generating the relinearisation keys. Note that rlk[1] or a is sampled from Rp·qi .
Additionally, to maintain the required security, the error samples need to be generated
from a different noise sampler. Hence, a second instance of the noise sampler is used
here with the required parameter settings. The rest of the submodules are as previously
discussed.
While performing the relinearisation operation in version 2, the ciphertext needs to
be scaled down. This task is accomplished by using the Div&Round submodule shown
in Figure 10. As the scale factor p is a power of 2, division operations can be avoided,
18 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
and shift right operations can be performed instead. Most other existing implementations
(both software and hardware), precompute 1p , round it down, and perform multiplication
operations instead of division. There are two disadvantages to this approach. First round-
ing leads to loss of precision, generating approximate results and magnifying the errors in
decryption as the levels of operations increase. Second, even though the expensive division
operations are avoided, multiplications are still costly, requiring large multipliers which
are not only expensive but also lead to a lower operating frequency. Figure 10 shows the
Div&Round submodule circuit.
4 Performance Evaluation
We evaluate the performance of all the design implementations through synthesis on a
Xilinx Zynq-7000 family xc7z020clg400-1 FPGA. The tool used for synthesis is ISE design
suite 14.7, with all designs implemented in Verilog 2001. To generate synthesis results,
the input coefficient bit width is considered 1200 bits and there are 40 coprime moduli, q,
having 30 bits each. The degree of the polynomial is taken as 1024.
4.1 Hardware Cost and Latency
We start by discussing the hardware cost and latency of the individual operations listed
in Table 1 and 2. The hardware cost depends on the size of each coefficient (either 1200
bits or 30 bits) and the number of portions into which a 1200-bit coefficient is divided.
The latency computation factors in the number of portions, k, or the polynomial degree,
n as required by the implementation of a module. That is why in Table 2, the latencies
are represented as a factor of k or n.
Table 1: Hardware cost of the individual operations.
Operation LUT Slices Registers DSP BRAM
Mod Operator (%) 798 0 0 0
Barrett Reduction 71 0 0 0
Modified Barrett Reduction 23 0 3 0
RNS (serial) 7592 90 0 1.5
RNS (serial modified) 145 56 3 1.5
RNS (parallel) 88353 1242 0 2
RNS (parallel modified) 133 86 3 2
CRT 3883 2408 20 6
CRT (LUT-based) 1274 301 4 6
Modulo Inverse (Fermat’s little) 1889 120 14 1
Modulo Inverse (Extended Euclidean) 3993 154 3 1
NTT 6188 1291 0 3
NTT-based Polynomial Multiplication 8261 162 30 6
Polynomial Addition 1185 56 0 1
Scalar Multiplication 118 10 0 3
Scalar Division to nearest integer 672 14 0 3
PowersOf2 113 20 0 1
Inner Product 8961 796 0 3
Div&Round 0 30 0 1
The built in Verilog mod operator is very expensive and takes about 800 LUTs to
Rashmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 19
perform a 30-bit modulo reduction. For the same bit width, the classic Barrett reduc-
tion reduces the hardware cost by almost 11 times, and our proposed modified Barrett
reduction reduces the cost by about 35 times. Since modular reduction is performed very
frequently and used by all the modules, using the modified Barrett reduction method
substantially reduces the hardware resources required for implementing the entire homo-
morphic encryption scheme. Note that the latency of the modular reduction module is
not shown, as the implementation comprises combinational logic only. We observe that
the RNS parallel implementation utilizes about 12 times more LUT slices, when compared
to the serial implementation, while the latency of the serial implementation is about 40
times higher than the parallel one. The RNS serial and parallel modified implementations
listed in the table use the modified Barrett reduction to perform the modulo reduction
operation, instead of the inbuilt Verilog mod operator. The hardware resource utilization
is dramatically reduced by this modification, however, the latency remains the same.
Table 2: Latency and frequency of the individual operations.
Operation Latency (clock cycles) Frequency (MHz)
RNS (serial) 120n 260.5
RNS (serial modified) 120n 265.3
RNS (parallel) 3n 314.1
RNS (parallel modified) 3n 316.4
CRT 1404n 132.4
CRT (LUT-based) 156n 134.5
Modulo Inverse (Fermat’s little) 3240n 113.7
Modulo Inverse (Extended Euclidean) 360n 117.4
NTT 10240k 218.1
NTT-based Polynomial Multiplication 20480k 121.9
Polynomial Addition 3072k 144.4
Scalar Multiplication 2048k 243.8
Scalar Division to nearest integer 2048k 129.5
PowersOf2 163840 349.1
Inner Product 153600 124.3
Div&Round k 204.3
The regular CRT implementation is about 3 times more expensive than a LUT-based
CRT. This difference is because regular CRT spends a lot of hardware resources for com-
puting the multiplicative inverse, while the LUT-based CRT, with all its precomputations,
not only requires fewer hardware resources but also performs computations in 9 times fewer
clock cycles. While computing multiplicative inverses, Fermat’s little theorem facilitates
a low hardware cost implementation, with about half the hardware cost as compared to
the widely used extended Euclidean method. However, the extended Euclidean method
performs computations about 9 times faster. An NTT-based polynomial multiplication
cuts down the latency from n2 to n logn. The polynomial addition, scalar multiplication,
and scalar division submodules avoid the usage of modular reduction, multiplication and
division operations respectively. Hence, these submodules are implemented using minimal
hardware resources and have a low latency. For the rest of the other submodules involved
in relinearisation, because of all the optimizations in implementation, we observe a low
hardware cost and latency.
20 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
4.2 Hardware library vs Software library Speedup
Table 3 provides the time, in clock cycles, for computing various homomorphic encryption
operations. We represent time in clock cycles due to frequency difference between FPGA
and general-purpose CPU.
Table 3: Time required for homomorphic encryption operations.
Operation Time (in clock cycles)
Homomorphic addition 3072
Homomorphic multiplication 71338
Relinearisation KeyGen (version 1) 86698
Relinearisation (version 1) 18432
Relinearisation KeyGen (version 2) 72362
Relinearisation (version 2) 112298
RNS + CRT 23259
Encryption 75434
Decryption (Degree-1) 73386
Decryption (Degree-2) 141653
As seen in the table, a single HE addition is about 23× faster than a HE multiplica-
tion. Moreover, if one has to choose between relinearisation version 1 and 2, then version
1 would be the unanimous choice, as it is about 6× faster than version 2, even when key
generation takes almost the same time. It is worth noting that, although the table lists
the relinearisation key generation time for both versions, these keys can be precomputed,
and hence, the time required for key generation need not be included in the overall time
required for HE operations. Based on the parameters that are used in the implementation,
we can evaluate a circuit of depth 56, with relinearisation performed after every multipli-
cation operation. Therefore, using the data from Table 3, we can compute the number of
clock cycles required for this entire set of operations. We observe that when the circuit
evaluation is done with relinearisation version 1, the total cycles required are 5, 031, 984,
while evaluation done with relinearisation version 2 takes 10, 194, 651 cycles.
Table 4: Hardware speedup for homomorphic encryption operations.
Operation Palisade library
(Time in clock cycles)
Our Hardware library
(Time in clock cycles)
Speedup
Encryption 119700000 75434 1500×
Homomorphic mult. 299729520 71338 4200×
Homomorphic add. 9070884 3072 2950×
Decryption 22400640 73386 300×
Next, we present the speed up obtained by the hardware accelerator designed using
the modules in our hardware library in comparison to its software counterpart. For this
comparison, we recorded the number of clock cycles required for encryption, HE multiplica-
tion, HE addition and decryption in the Palisade software library using same underlying
scheme with RNS implementation using same parameters. Table 4 lists the observed
speedup. This evaluation assumes we utilize the maximum possible resources available
on the FPGA that we used for our evaluation. There is a scope to further enhance the
speedup using more hardware resources.
Finally, we evaluate the time required to make a prediction using logistic regression,
Rashmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 21
Table 5: Hardware speedup for Logistic Regression prediction.
Operation Speedup (in clock cycles)
Logistic Regression prediction 2650×
a common tool used in machine learning for binary classification problems. Making pre-
dictions using logistic regression model with x0, x1, ..., xn features require computing the
logistic regression equation, Y = eX/1 + eX , where X =
∑n
i=0 bixi. We first compute
X and then use the value in the Remez algorithm (an iterative minimax approximation
algorithm [Fra65]) equation, Y (X) = −0.004X3+0.197X+0.5 to compute the probability.
The use of Remez algorithm equation helps avoid log computation which is required in the
logistic regression equation. However, these equations require working over floating point
numbers and the FV scheme does not support the encryption of floating point numbers.
Hence, we scale them to integers and perform fixed point operations instead. We observe
that our hardware accelerator requires 470, 064 clock cycles to perform one logistic regres-
sion prediction using homomorphic encryption, providing a speedup of around 2650× over
a software-based prediction as mentioned in Table 5.
5 Conclusion
We presented a fast hardware arithmetic hardware library with a focus on accelerating the
key arithmetic operations involved in RLWE-based somewhat homomorphic encryption.
For all of these operations, we include a hardware cost efficient serial implementation
and a fast parallel implementation in the library. We also presented a modular and hi-
erarchical implementation of a hardware accelerator using the modules of the proposed
arithmetic library to demonstrate the speedup achievable in hardware. The parameterized
design implementation approach of the modules and the hardware accelerator provides
the flexibility to extend use of the modules for other schemes, such as BGV, and the
accelerator for many applications, especially in the FPGA-centric cloud computing envi-
ronment. Evaluation of the implementation shows that a speed up of about 4200× and
2950× for evaluating homomorphic multiplication and addition respectively is achievable
in hardware when compared to software implementation.
As future work, we would like to optimize and implement the arithmetic operations
involved in bootstrapping as well. The bootstrap operation is one of the key functions
in achieving fully homomorphic encryption, but it remains very expensive to perform.
Optimizing the bootstrap operation will render it more practical to use. We are also
actively working on integrating other RLWE-based homomorphic encryption schemes,
like BGV, into our library so as to leverage inherent advantages that these schemes offer.
Once we have the required operations and schemes implemented, we will open-source the
arithmetic library and FPGA design examples.
References
[ABK] Rashmi Agrawal, Lake Bu, and Michel A Kinsy. A post-quantum secure
discrete gaussian noise sampler. 2020 IEEE International Symposium on
Hardware Oriented Security and Trust (HOST).
[AMBG+16] Carlos Aguilar-Melchor, Joris Barrier, Serge Guelton, Adrien Guinet, Marc-
Olivier Killijian, and Tancrede Lepoint. Nfllib: Ntt-based fast lattice library.
In Cryptographers’ Track at the RSA Conference, pages 341–356. Springer,
2016.
22 Fast Arithmetic Hardware Library For RLWE-Based Homomorphic Encryption
[Bar87] Paul Barrett. Implementing the rivest shamir and adleman public key en-
cryption algorithm on a standard digital signal processor. In Andrew M.
Odlyzko, editor, Advances in Cryptology — CRYPTO’ 86, pages 311–323,
Berlin, Heidelberg, 1987. Springer Berlin Heidelberg.
[BFZY11] Kang Bing, Liu Fu, Yun Zhuo, and Liang Yanlei. Design of an internet
of things-based smart home system. In 2011 2nd International Conference
on Intelligent Control and Information Processing, volume 2, pages 921–924.
IEEE, 2011.
[BGV14] Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. (leveled) fully
homomorphic encryption without bootstrapping. ACM Transactions on
Computation Theory (TOCT), 6(3):13, 2014.
[CKKS17] Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. Homomor-
phic encryption for arithmetic of approximate numbers. In International
Conference on the Theory and Application of Cryptology and Information
Security, pages 409–437. Springer, 2017.
[CLP] Anamaria Costache, Kim Laine, and Rachel Player. Evaluating the effec-
tiveness of heuristic worst-case noise analysis in fhe.
[CLP17] Hao Chen, Kim Laine, and Rachel Player. Simple encrypted arithmetic
library-seal v2. 1. In International Conference on Financial Cryptography
and Data Security, pages 3–18. Springer, 2017.
[CMV+15] Donald Donglong Chen, Nele Mentens, Frederik Vercauteren, Sujoy Sinha
Roy, Ray CC Cheung, Derek Pao, and Ingrid Verbauwhede. High-speed
polynomial multiplication architecture for ring-lwe and she cryptosystems.
IEEE Transactions on Circuits and Systems I: Regular Papers, 62(1):157–
166, 2015.
[DS15] Wei Dai and Berk Sunar. cuhe: A homomorphic encryption accelerator
library. In International Conference on Cryptography and Information Se-
curity in the Balkans, pages 169–186. Springer, 2015.
[Fra65] W. Fraser. A survey of methods of computing minimax and near-minimax
polynomial approximations for functions of a single independent variable. J.
ACM, 12(3):295–314, July 1965.
[FV12] Junfeng Fan and Frederik Vercauteren. Somewhat practical fully homomor-
phic encryption. IACR Cryptology ePrint Archive, 2012:144, 2012.
[G+09] Craig Gentry et al. Fully homomorphic encryption using ideal lattices. In
Stoc, volume 9, pages 169–178, 2009.
[Gar59] Harvey L Garner. The residue number system. In Papers presented at the the
March 3-5, 1959, western joint computer conference, pages 146–153. ACM,
1959.
[GHS12] Craig Gentry, Shai Halevi, and Nigel P. Smart. Homomorphic evaluation of
the aes circuit. In Reihaneh Safavi-Naini and Ran Canetti, editors, Advances
in Cryptology – CRYPTO 2012, pages 850–867, Berlin, Heidelberg, 2012.
Springer Berlin Heidelberg.
[Hay08] Brian Hayes. Cloud computing. Communications of the ACM, 51(7):9–11,
Rashmi Agrawal, Lake Bu, Alan Ehret and Michel A. Kinsy 23
[HGG07] William Hasenplaugh, Gunnar Gaubatz, and Vinodh Gopal. Fast modu-
lar reduction. In Proceedings of the 18th IEEE Symposium on Computer
Arithmetic, ARITH ’07, pages 225–229, Washington, DC, USA, 2007. IEEE
Computer Society.
[HS14] Shai Halevi and Victor Shoup. Algorithms in helib. In Juan A. Garay and
Rosario Gennaro, editors, Advances in Cryptology – CRYPTO 2014, pages
554–571, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg.
[KI07] Victor J Katz and Annette Imhausen. The Mathematics of Egypt,
Mesopotamia, China, India, and Islam: A Sourcebook. Princeton Univer-
sity Press, 2007.
[KIA11] SO Kuyoro, F Ibikunle, and O Awodele. Cloud computing security issues and
challenges. International Journal of Computer Networks (IJCN), 3(5):247–
255, 2011.
[Kim18] Andrey Kim. HEAAN, 2018.
[KJBB16] Ravi Kishore Kodali, Vishal Jain, Suvadeep Bose, and Lakshmi Boppana. Iot
based smart security and home automation system. In 2016 international
conference on computing, communication and automation (ICCCA), pages
1286–1289. IEEE, 2016.
[LT219] Lattigo 1.3.0. Online: http://github.com/ldsec/lattigo, December
2019. EPFL-LDS.
[MG+11] Peter Mell, Tim Grance, et al. The nist definition of cloud computing. 2011.
[Mon85] Peter L Montgomery. Modular multiplication without trial division. Math-
ematics of computation, 44(170):519–521, 1985.
[NLV11] Michael Naehrig, Kristin Lauter, and Vinod Vaikuntanathan. Can homo-
morphic encryption be practical? In Proceedings of the 3rd ACM workshop
on Cloud computing security workshop, pages 113–124. ACM, 2011.
[PH10] Krešimir Popović and Željko Hocenski. Cloud computing security issues and
challenges. In The 33rd International Convention MIPRO, pages 344–349.
IEEE, 2010.
[PJ03] Sungmee Park and Sundaresan Jayaraman. Enhancing the quality of life
through wearable technology. IEEE Engineering in medicine and biology
magazine, 22(3):41–48, 2003.
[RAD+78] Ronald L Rivest, Len Adleman, Michael L Dertouzos, et al. On data banks
and privacy homomorphisms. Foundations of secure computation, 4(11):169–
180, 1978.
[SJJT86] Michael A Soderstrand, W Kenneth Jenkins, Graham A Jullien, and Fred J
Taylor. Residue number system arithmetic: modern applications in digital
signal processing. IEEE press, 1986.
[ST67] Nicholas S Szabo and Richard I Tanaka. Residue arithmetic and its applica-
tions to computer technology. McGraw-Hill, 1967.
[Tec19] Duality Technologies. PALISADE library. 2019.
[Vin16] Ivan Matveevich Vinogradov. Elements of number theory. Courier Dover
Publications, 2016.

