HEPCloud: An FPGA-based Multicore Processor for FV Somewhat Homomorphic Function Evaluation by Roy, Sujoy Sinha et al.
1HEPCloud: An FPGA-based Multicore
Processor for FV Somewhat Homomorphic
Function Evaluation
Sujoy Sinha Roy, Kimmo Ja¨rvinen, Jo Vliegen, Frederik Vercauteren, Ingrid Verbauwhede
Abstract—In this paper we present an FPGA based hardware accelerator ‘HEPCloud’ for homomorphic evaluations of medium depth
functions which has applications in cloud computing. Our HEPCloud architecture supports the polynomial ring based homomorphic
encryption scheme FV for a ring-LWE parameter set of dimension 215, modulus size 1228-bit and a standard deviation 50. This
parameter-set offers a multiplicative depth 36 and at least 85 bit security. The processor of HEPCloud is composed of multiple parallel
cores. To achieve fast computation time for such a large parameter-set, various optimizations in both algorithm and architecture levels
are performed. For fast polynomial multiplications, we use CRT with NTT and achieve two dimensional parallelism in HEPCloud. We
optimize the BRAM access, use a fast Barrett like polynomial reduction method, optimize the cost of CRT, and design a fast
divide-and-round unit. Beside parallel processing, we apply pipelining strategy in several of the sequential building blocks to reduce the
impact of sequential computations. Finally we implement HEPCloud on a medium-size Xilinx Virtex 6 FPGA board ML605 board and
measure its on-board performance. To store the ciphertexts during a homomorphic function evaluation, we use the large DDR3
memory of the ML605 board. Our FPGA-based implementation of HEPCloud computes a homomorphic multiplication in 26.67 s, of
which the actual computation takes only 3.36 s and the rest is spent for off-chip memory access. It requires about 37551 s to evaluate
the SIMON-64/128 block cipher, but the per-block timing is only about 18 s because HEPCloud processes 2048 blocks simultaneously.
The results show that FPGA-based acceleration of homomorphic function evaluations is feasible, but fast memory interface is crucial
for the performance.
Index Terms—Homomorphic encryption, FV, lattice-based cryptography, ring-LWE, polynomial multiplication, number theoretic
transform, hardware implementation.
F
1 INTRODUCTION
THE CONCEPT of fully homomorphic encryption (FHE),a form of encryption that allows evaluating arbitrary
functions on encrypted data, was introduced by Rivest,
Adleman, and Dertouzos [32] already in 1978. Constructing
FHE schemes proved to be a difficult problem that remained
unsolved until 2009 when Gentry [17] proposed the first
FHE scheme by using ideal lattices. Despite its groundbreak-
ing nature, Gentry’s proposal did not provide a practical
solution because of its low performance. Since then, many
researchers have developed more efficient schemes to im-
prove the performance of FHE [6], [7], [12], [16], [18], [19],
[27], [37]. Despite these major advances, FHE schemes are
too slow to be used in practical applications. Even some-
what homomorphic encryption (SHE) schemes, which can
perform a limited number of operations on encrypted data,
are also very slow. Software implementations require min-
utes or hours to evaluate even rather simple functions. For
e.g., evaluating the decryption of a lightweight block cipher
• S. Sinha Roy, F. Vercauteren, J. Vliegen and I. Verbauwhede are with
KU Leuven ESAT/COSIC and imec, Kasteelpark Arenberg 10, B-3001
Leuven-Heverlee, Belgium
E-mail: firstname.lastname@esat.kuleuven.be
• K. Ja¨rvinen is with University of Helsinki, Department of Computer
Science, Gustaf Ha¨llstro¨min katu 2b, FI-00014, Helsinki, Finland
E-mail: kimmo.u.jarvinen@helsinki.fi
• F. Vercauteren is also with Open Security Research, FangDa Building 704,
Kejinan-12th, Nanshan, Shenzhen, 518000, China
Date of the manuscript: October 27, 2017.
SIMON-64/128 (block/key size 64/128 bits) [3] requires
4193 s (an hour and 10 minutes) on a 4-core Intel Core-
i7 processor [23]. If FHE could achieve performance levels
that would permit large-scale practical use, it would have
a drastic effect on cloud computing: users could outsource
computations to the cloud without the need to trust service
providers and their mechanisms for protecting users’ data
from outsiders.
Hardware accelerators have been successfully used for
accelerating performance-critical computations in cryptol-
ogy already for several decades. Hence, it is not surprising
that during the recent years several publications [8], [13],
[14], [15], [26], [28], [31], [33], [40], [41] have reported re-
sults on hardware-based acceleration of different FHE and
SHE schemes or their central operations. Technical maturity
and suitability for practical deployment differ significantly
between the published implementations. The parts of the
schemes that are implemented in the publications range
from only central operations such as large integer multi-
plications (see, e.g., [8]) through full implementations of
the computational parts of the schemes (see, e.g., [33]) to
complete implementations in real hardware that include
also all memory handling, interfacing with a host processor,
etc. (see, e.g., [31]).
The published implementations focus on many different
FHE and SHE schemes. In this paper, we focus on the Fan-
Vercauteren (FV) SHE scheme [16], which is based on the
Ring Learning with Errors (Ring-LWE) problem. To the best
of our knowledge, there are no published hardware imple-
mentations of FV prior to this paper; a very recent paper by
Cathe´bras et al. [9] analyses parameter selection of FV in the
light of hardware efficiency but does not provide any actual
implementation results. From the implementation point-of-
view, the FV scheme is close to the Yet-Another-Somewhat-
Homomorphic-Encryption (YASHE) scheme by Bos et al. [6]
because the FV scheme can be implemented with certain
minor modifications and additions to a YASHE architecture.
Hence, the closest counterparts in the literature are the two
implementations of the YASHE scheme that were published
by Po¨ppelmann et al. in [31] and by us in [33]. Recently in
2016, Albrecht et al. [25] published a subfield lattice attack
that runs in sub-exponential time on overstretched NTRU
assumptions. Since the key generation part of the YASHE
scheme relies on a mildly overstretched NTRU assumption,
the subfield attack makes YASHE insecure. However, the FV
scheme remains secure as this attack does not apply to it.
In this paper, we present HEPCloud, an FPGA-based
hardware accelerator for homomorphic evaluations of
medium depth functions, which supports the FV SHE
scheme. It is designed primarily for speeding up homo-
morphic evaluations in cloud computing: the cloud service
provider installs our HEPCloud accelerator on its servers
and delegates the heavy homomorphic function evaluations
to this FPGA-based accelerator. Our goal in designing HEP-
Cloud is to provide a proof-of-concept implementation of a
solution that is complete and mature enough for practical
use. We implement all required components for homomor-
phic function evaluations in real hardware. The architecture
of HEPCloud is based on the SHE accelerator architecture
for the YASHE scheme that we introduced in our earlier
work [33]. The contributions of this paper compared to
previous works and, especially, [33] can be summarized as
follows:
• We introduce an efficient hardware implementation
of the FV SHE scheme that, to the best of our knowl-
edge, is the first FPGA-based accelerator for FV.
• We improve the architecture from [33] to support FV
by designing a unified architecture for two lifting
operations and an architecture for the residue poly-
nomial computation, and by fine-tuning the design
decisions.
• We implement HEPCloud using a Xilinx Virtex-6
ML605 evaluation board and verify its operation
and performance via on-board performance mea-
surements; [33] presented only after place-and-route
results and did not provide any evaluation results on
real hardware.
• We discuss the feasibility of FPGA-based acceleration
of homomorphic function evaluation. We conclude
that despite certain obstacles, in particular, with the
speed of the memory interface, FPGA-based accel-
eration and HEPCloud are feasible solutions for re-
ducing the large overhead of homomorphic function
evaluations in cloud computing environments.
The paper is structured as follows. Section 2 describes
the FV SHE scheme as well as the system setup and the
parameter set that we use. Section 3 contains a high level
description of known optimization techniques to speed-up
computations in modular polynomial rings and describes
how we represent polynomials using the Chinese Remain-
der Theorem (CRT) in order to parallelize computations.
We present our hardware architecture for FV in Section 4.
Section 5 shows the performance results and we end the
paper with the summary in Sect. 6.
2 THE FV HOMOMORPHIC ENCRYPTION SCHEME
In this section we briefly describe the FV somewhat homo-
morphic encryption scheme. The FV scheme was introduced
by Fan and Vercauteren [16] in 2012. It uses a basic ring-LWE
public-key encryption scheme and two additional functions
Add and Mult to perform arithmetic operation on encrypted
data. The polynomial ring is R = Z[x]/〈f(x)〉 with f(x) =
Φd(x), the d-th cyclotomic polynomial of degree n = ϕ(d).
The key generation and the encryption operations in FV
require sampling from two probability distributions defined
on R, namely χkey and χerr respectively. The security is
determined by the degree n of f , the size of the ciphertext
modulus q, and by the probability distributions. Following
[24] one may sample the key and the error polynomials
from a common distribution χ. Typically χ is a discrete
Gaussian distribution χσ with a small standard deviation
σ. However in practice the authors of FV took the private
key as a polynomial with coefficients from a narrow set like
{−1, 0, 1}. In the following we introduce two functions that
are used to describe the FV scheme.
Definition 2.1. (WordDecompw,q(a)). This function is used
to decompose a ring element a ∈ Rq in base w by slicing
each coefficient of a. For v = dlogw(q)e, this function
returns ai ∈ R with coefficients in (−w/2, w/2], where
a =
∑v−1
i=0 aiw
i.
Definition 2.2. (PowersOfw,q(a)). This function scales an
element a ∈ Rq by the different powers ofw. It is defined
as PowersOfw,q(a) = (awi)v−1i=0 . The two functions can be
used to perform a polynomial multiplication in Rq as
〈WordDecompw,q(a),PowersOfw,q(b)〉 = a · b mod q .
This expression has the advantage of reducing the noise
during homomorphic multiplications, as the first vector
contains small elements (in base w).
Now we enumerate the functions used in the FV scheme.
For details of the functions, interested readers may follow
the original paper [16] or the presentation [38].
1) FV.ParamsGen(λ): For a given security parameter
λ, choose a polynomial Φd(x), ciphertext mod-
ulus q and plaintext modulus t, and distribu-
tions χerr and χkey . Also choose the base w for
WordDecompw,q(·). Return the system parameters
(Φd(x), q, t, χerr, χkey, w). Following [16] we use a
uniform signed binary distribution for χkey .
2) FV.KeyGen(Φd(x), q, t, χerr, χkey, w): Sample poly-
nomial s from χkey , sample a ← Rq uniformly
at random, and sample e ← χerr. Compute b =
[−(as+ e)]q . The public key consists of two polyno-
mials pk = {b, a} and the secret key is sk = s. The
scheme uses another key called relinearisation key or
rlk in the function ReLin. This key is computed as
2
follows: first sample a← Rlq uniformly, then sample
e ← χlerr, and then compute rlk = {rlk0, rlk1} =
{[PowersOfw,q(s2)− (e + a · s)]q,a} ∈ {Rlq, Rlq}.
3) FV.Encrypt(pk,m): First encode the input message
m ∈ Rt into a polynomial ∆m ∈ Rq with
∆ = bq/tc. Next sample the error polynomi-
als e1, e2 ← χerr, sample u uniformly from the
signed binary distribution, and, compute the two
polynomials c0 = [∆m + bu + e1]q ∈ Rq and
c1 = [au + e2]q ∈ Rq . The ciphertext is the pair
of polynomials c = {c0, c1}.
4) FV.Decrypt(sk, c): Recover the message m =
[b tq ·
[c0 + sc1]qe
]
t
.
5) FV.Add(c1, c2): For two ciphertexts c1 = {c1,0, c1,1}
and c2 = {c2,0, c2,1}, return c˜add = {c1,0 +
c2,0, c1,1 + c2,1}.
6) FV.Mult(c1, c2, rlk): Compute c˜mult = {c˜0, c˜1, c˜2}
where c˜0 = [b tq · c1,0 · c2,0e]q , c˜1 = [b tq · (c1,0 · c2,1 +
c1,1 · c2,0)e]q , and c˜2 = [b tq · c1,1 · c2,1e]q . Next call
the function ReLin(c˜mult, rlk).
7) FV.Relin(c˜mult, rlk): Compute a relinearised ci-
phertext is c′ = {c′0, c′1}, where c′0 = [c˜0 +
〈WordDecompw,q(c˜2), rlk0〉]q and c′1 = [c˜1 +
〈WordDecompw,q(c˜2), rlk1〉]q .
2.1 System setup and parameter set
The polynomial ring used in the FV scheme is of the form
R = Z[x]/〈f(x)〉 where f(x) is a monic irreducible poly-
nomial of degree n. We put no restriction on f(x), which
allows us to deal with any cyclotomic polynomial Φd(x)
and thus to utilize single instruction multiple data (SIMD)
operations [35], [36]. The SIMD feature embeds multiple
plaintexts into different “slots” in a single ciphertext and
allows evaluating a function on all of them in parallel with
a single execution. Indeed to exploit the SIMD feature, we
choose an irreducible polynomial f(x) such that f(x) mod 2
splits into many different irreducible factors, each factor
corresponding to “one slot” in the SIMD representation. It
is easy to see that this excludes f(x) = xn + 1 with n a
power of two, since it results in only one irreducible factor
modulo 2. For ring-LWE based public-key encryption and
signature schemes [29], [30], [34] it is very common to take
f(x) = xn + 1 with n a power of two. This particular
choice enables polynomial multiplications without reduc-
tions modulo f(x). With our current choice, we achieve
SIMD, but we pay in modular reductions by f(x).
We use a parameter set with d = 65535 (and thus the
degree of f(x) is 32768 = 215), log2(q) = 1228 and χerr
a discrete Gaussian distribution with parameter σ = 50.
We choose the plaintext modulus t = 2, i.e., we evaluate
bit-level operations. The irreducible polynomial f(x) splits
modulo 2 in 2048 different irreducible polynomials, which
implies that we can work on 2048 bits in parallel using the
SIMD method first outlined in [35]. Following the recent
work [5] the multiplicative depth of the parameter set is
36. To estimate the security of the parameter set we took
help of the tool developed by Albrecht [1]. The run time
of the tool increases with the size of the parameter set,
and for the chosen parameter set (which is very large)
we were not able to calculate the security directly in a
Input: Polynomial a(x) ∈ Zq[x] of degree N − 1 and
N -th primitive root ωN ∈ Zq of unity
Output: Polynomial A(x) ∈ Zq[x] = NTT(a)
begin1
A← BitReverse(a)2
/* m doubles each iteration */
for m = 2 to N by m = 2m do3
ωm ← ωN/mN4
ω ← 15
for j = 0 to m/2− 1 do6
for k = 0 to N − 1 by m do7
/* Butterfly operation */
t← ω ·A[k + j +m/2]8
u← A[k + j]9
A[k + j]← u+ t10
A[k + j +m/2]← u− t11
ω ← ω · ωm12
end13
Algorithm 1: Iterative NTT [11]
reasonable time from the tool. The tool can calculate the
security of smaller LWE instances with (n, log2(q), σ) =
{(2048, 100, 50), (4096, 188, 50), (8192, 352, 50)}. These val-
ues are 79, 81, and 85-bits respectively. If we extrapolate in
the same way, then for the parameter set of HEPCloud we
get at least 85-bit security.
3 HIGH LEVEL OPTIMIZATIONS
To efficiently implement FV we have to analyze the two
main operations in detail, namely homomorphic addition
and homomorphic multiplication. Homomorphic addition
is easy to deal with since this simply corresponds to polyno-
mial addition in Rq . Homomorphic multiplication is much
more involved and is the main focus of this paper. As can
be seen from the definition of FV.Mult in Sec. 2, we first need
to multiply the polynomials (i.e. c1,0, c2,0 etc.) of the input
ciphertexts over integers (i.e. without reduction modulo q),
then scale by t/q and round, before mapping back into the
ring Rq . The fact that one first has to compute the result
over the integers (to allow for the scaling and rounding) has
a major influence on how elements of Rq are represented
and on how the multiplication has to be computed. Since
each element in Rq is a polynomial of degree n − 1, the
result of a polynomial multiplication (without reduction
modulo f(x)) will have degree 2n − 2. As such we choose
the smallest N = 2k > 2n− 2, and compute the product of
the two polynomials in the ring Zq[x]/(xN −1) by applying
the N -point NTT (see Alg. 1). The NTT requires the N -
th roots of unity to exist in Zq , so we either choose q a
prime with q ≡ 1 mod N or q a product of small primes
qi with each qi ≡ 1 mod N . It is the latter choice that will
be used throughout this work. The product of two elements
a, b ∈ Rq is then computed in two steps: firstly, the product
modulo xN − 1 (note that there will be no reduction, since
the degree of the product is small enough) is computed
using two NTT’s, N pointwise multiplications modulo q
and then finally, one inverse NTT. To recover the result in
3
Rq , we need a reduction modulo f(x). For this purpose, we
use the Newton iteration method [39].
Note that the polynomial multiplication in FV.Mult are
performed over integers. To get the benefit of NTT, we
perform these multiplications in a larger ring RQ where Q
is a sufficiently large modulus of size ∼ 2 log2(q) such that
the coefficients of the result polynomials are in Z.
3.1 CRT representation of polynomials
The biggest challenge while designing a homomorphic pro-
cessor is the complexity of computation. During a homo-
morphic operation, computations are performed on polyno-
mials of degree 215 or 216 and coefficients of size ∼1,200 or
∼2,500 bits. To tackle the problem of long integer arithmetic,
we take inspiration from the application of the CRT [4] in the
RSA cryptosystems. We choose the moduli q and Q as prod-
ucts of many small prime moduli qi, such that q =
∏l−1
0 qi
and Q =
∏L−1
0 qi, where l < L. Thus any operation modulo
q or Q maps into small computations moduli qi. We use the
term small residue to represent coefficients modulo qi and the
term large residue to represent coefficients modulo q or Q.
3.2 FV.Mult in residue domain
Let the two input ciphertexts be c1 = {c1,0, c1,1} and
c2 = {c2,0, c2,1}. The homomorphic multiplication steps are
described below.
1) Liftq→Q: Lift c1,0, c1,1, c2,0 and c2,1 to RQ from
Rq , i.e., compute the additional residue polynomials
moduli qj for j ∈ [l, L− 1]. Since the ciphertexts are
represented as residue polynomials moduli qi for
i ∈ [0, l − 1] in Rq , we first need to compute the
coefficients modulo q in (−q/2, q/2) by applying
the CRT, and then compute the additional residue
polynomials.
2) PolyArithmeticQ: Compute the product polynomials
c˜0 = c1,0·c2,0, c˜1 = c1,0·c2,1+c1,1·c2,0 and c˜2 = c1,1·
c2,1 by computing multiplications and additions of
the residue polynomials moduli qj for j ∈ [0, L−1].
3) LiftQ→q : Apply CRT on the coefficients of the residue
polynomials of c˜0, c˜1 and c˜2 to get the coeffi-
cients modulo Q in (−Q/2, Q/2). Now compute the
division-and-rounding operations to c˜0 = b t·c˜0q e,
c˜1 = b t·c˜1q e and c˜2 = b t·c˜2q e. Next, reduce the
coefficients modulo q in (−q/2, q/2).
4) ResPolq→qi : Compute the residue polynomials of c˜0
and c˜1 modulo qi for i ∈ [0, l − 1].
5) WordDecomp: Split the coefficients of c˜2 into w-bit
words to get the vector c˜2 of dlg q/we polynomials.
6) FV.Relin: Compute the residue polynomials for each
member of c˜2 and then compute c′0 = c0+〈c˜2, rlk0〉
and c′1 = c1+〈c˜2, rlk1〉 by performing arithmetic on
the residue polynomials modulo qi for i ∈ [0, l − 1].
This step outputs the result of the homomorphic
multiplication c′ = {c′0, c′1} as a set of residue
polynomials in R2q .
The polynomial arithmetic on the residue polynomials
can be performed in parallel. The size of the moduli qi is an
important design decision and depends on the underlying
DDR
Polynomial 
multiplication
and addition
CRT, division and
reduction
rounding, modular
FPGA
Fig. 1. High-level view of HEPCloud architecture
platform. We implement the hardware accelerator on the
Xilinx ML605 board, which has a Virtex-6 FPGA. The FPGA
provides 24×17-bit unsigned DSP multipliers to perform in-
teger multiplications. We could implement a slightly larger
integer multiplier by combining a DSP multiplier with LUT-
based logic. In this work we choose 30-bit prime qi that
satisfy qi ≡ 1 mod N . The reasons for selecting only 30-
bit of primes are: 1) there are sufficiently many primes of
size 30-bit to compose 1,228-bit q and 2,517-bit Q, 2) the
data-paths for performing computations modulo qi become
symmetric, and 3) the basic computation blocks, such as
adders and multipliers of size 30-bit can be implemented
efficiently using the available DSP slices and a few LUTs.
4 ARCHITECTURE
In this section we design a hardware architecture of HEP-
Cloud to accelerate FV.Add and FV.Mult. In Fig. 1 a high-
level view of our architecture is shown. The computation
core of HEPCloud is hosted on a medium size Xilinx Virtex-
6 XC6VLX240T FPGA. The design decisions take account of
the resources available on the board. Since the ciphertexts
are large, of size 2 × 4.8 MB, we use the DDR memory of
the board to store the ciphertexts. During a computation,
portions of the ciphertext(s) are read from the DDR memory
and stored in the on-FPGA BRAMs. After the computation,
the result is written back in the DDR memory. The speed
of the communication between the DDR memory and the
FPGA has a major impact on the performance. In this work
we restrict the data-size to 256 bits per DDR memory access.
In the remaining part of this section, we describe our design
decisions and optimization tricks.
4.1 Architecture for polynomial arithmetic
As discussed in Sec. 3.2, FV.Mult requires arithmetic with
polynomials that have high degrees and large coefficient
sizes. For faster multiplications of the residue polynomials
in the steps PolyArithmeticQ and FV.Relin of Sec. 3.2, we
use the NTT-based polynomial multiplication algorithm.
However a sequential implementation of Alg. 1 will not
be enough to accelerate FV.Mult since the parameter set
(Sec. 2.1) that we use in this paper is very large. E.g., if
we use the sequential NTT core of [34], then one N -point
NTT of a residue polynomial will consume more than 524K
cycles. Note that the NTT algorithm (Alg. 1) is amicable
to parallelism. Hence we use parallel cores to reduce the
number of cycles.
4
4.1.1 Optimization in the routing
Using v parallel cores, where v|N , we can split an N -point
NTT into v parallel butterfly computation threads. Let the
N coefficients be stored in b BRAMs and v|b. There are two
main technical issues related to the memory access that af-
fect the performance of the NTT computation. The first one
is: all the parallel cores access the BRAMs simultaneously.
Since a simple dual port BRAM has one port for reading and
one port for writing, it can support only one read and write
in a cycle. This puts the restriction that a BRAM should be
read (or written) by one core in a cycle, i.e., the generation
of the BRAM-addresses by the parallel cores should be free
from conflicts.
The second issue is related to the routing complexity. A
residue polynomial is stored in many BRAMs, and hence,
if a core needs to access a BRAM that is far from it, then
the routing of wires will be very long. Note that in the basic
NTT (Alg. 1) we see that the maximum difference between
the indexes of the two coefficients is N/2. In our parameter-
set N = 216 and this results in a long critical path.
We address these two technical issues by connecting the
/* Core with index c for butterfly computations */
module butterfly-core(parameter c; output: m, address1, address2, s1,c,1
s2,c, s3,c, s4,c)
begin2
(Itwiddle, offset)← (N/2, 1)3
form = 0 to logN − 1 do4
ωm ← 2m-th primitiveroot(1)5
Nbutterfly ← 0 ; /* Counts butterfly op. in m-loop */6
ω ← ωm,c ; /* Power of ωm for a core-index c */7
for base = 0 to base < offset do8
increment← 09
while base + offset + increment < N2v do10
address1 ← base + increment11
address2 ← base + offset + increment12
/* Reads only c-th group of BRAMs */
(t1, u1)←MEMORYc[address1]13
(t2, u2)←MEMORYc[address2]14
ifm < logN − 1 then15
/* Comp. two butterfly op. */
(t1, t2)← (ω · t1, ω · t2)16
(s1,c, s2,c)← (u1 + t1, u1 − t1)17
(s3,c, s4,c)← (u2 + t2, u2 − t2)18
Nbutterfly ← Nbutterfly + 219
increment = increment + 2 · offset20
if Itwiddle|Nbutterfly then ω ← ω · ωv/2m21
else22
t1 ← ω · t1; ω ← ω · ωv/2m23
t2 ← ω · t2; ω ← ω · ωv/2m24
(s1,c, s2,c, s3,c, s4,c)←25
(u1 + t1, u1 − t1, u2 + t2, u2 − t2)
Nbutterfly ← Nbutterfly + 226
increment = increment + 2 · offset27
Itwiddle ← Itwiddle/228
if offset < v/2 then offset← 2 · offset29
end30
/* Assembles output coefficients from two butterfly
cores and writes in the c-th group of BRAMs */
module memory-write(parameter c; input: m, address1, address2, s1,0,31
. . . s4,v−1)
begin32
if 2m < v2 then gap← 2m33
else gap← v2 ; /* Index gap between two cores */34
if c < v/2 then35
MEMORYc[address1]← (s2,c, s1,c)36
MEMORYc[address2]← (s2,c+gap, s1,c+gap)37
else38
MEMORYc[address1]← (s4,c, s3,c)39
MEMORYc[address2]← (s4,c+gap, s3,c+gap)40
end41
Algorithm 2: Routing efficient parallel NTT [33]
read ports of a group of BRAMs to only one butterfly core.
This dedicated read prevents any sort of conflict during the
memory read operations. Additionally this helps the design
tool to place the butterfly core adjacent to the proper group
of BRAMs in the FPGA, thus reducing routing complexity.
Memory access by a butterfly core with core-index c during
NTT is shown in Alg. 2.
In Alg. 2 MEMORYc is the group of BRAMs that are
connected to the input ports of the c-th butterfly core. Each
memory word contains two residue (2× 30-bit) coefficients.
In line 13-14 two coefficient-pairs (hence four coefficients)
are read from the memory and then the ‘butterfly steps’
(line 16-18) are computed on them. This gives four new
coefficients s1,c, s2,c, s3,c, and s4,c.
The twiddle factor ω, that is used in the butterfly
steps, is initialized to a constant value ωm,c in line 7. The
value of ωm,c is actually an exponent of ωm, where the
exponentiation-magnitude depends on the core index c. The
values of ωm,cs can be stored in a small table or a ROM. The
counter Itwiddle denotes the interval at which ω should be
updated with a new value (line 21). Whenever the number
of butterfly operations (Nbutterfly) becomes a multiple of
Itwiddle, a new ω is computed in line 21. The addresses
registers address1 and address2 are computed from the
counters: base, increment, and offset, that represent the start-
ing memory address, the increment value, and the current
difference between address1 and address2 respectively.
The memory-write module in line 31 collects the coeffi-
cients generated by several of the parallel butterfly cores and
writes the ‘proper’ coefficients in the c-th group of BRAMs.
By ‘proper’ we mean the coefficients that will be read by
the c-th butterfly core in the next iteration of the m-loop in
Alg. 2. The proper coefficients are filtered by observing the
value of a variable gap that depends on c, m and v.
Discussion: Since a butterfly core is attached to a fixed
set of BRAMs, our Alg. 2 minimizes the critical paths (and
hence routing) that lie between the output ports of the
BRAMs and the input ports of the butterfly core. However
the input ports of the BRAMs still remain connected to
multiple butterfly cores. The routing delay at the input
ports might be reduced by inserting pipeline-registers in
the critical paths that start from the output ports of the
butterfly cores and end at the input ports of the memories.
We consider this as a future work. Note that if Alg. 1 is
used to implement HEPCloud then long critical paths would
appear at both the input and output ports of the BRAMs.
4.1.2 Architecture of polynomial arithmetic unit (PAU)
In Fig. 2 we show the internal architecture of the cores that
we use to perform arithmetic on the residue polynomials.
The cores have been designed following the footprints of
the polynomial arithmetic core of [34]. The input register
bank contains registers to store data from the BRAMs. In
addition, the register bank also contains shift registers to
delay the input coefficients in a pipeline during an NTT
computation. The register bank has several ports to provide
data to several other components present in the core. We use
the common name Dregbank to represent all data-outputs
from the register bank. The small ROM block in Fig. 2
contains the twiddle factors and the value ofN−1 to support
the computation of NTT and INTT.
5
...
D RegBank
D RegBank
...
...
D RegBank D
at
a
to
B
R
A
M
Data during liftingData from BRAM
Twiddle
Factors
Reduction
Barrett
Register Bank
Output
30−by−59 bit Multiplication
Small
ROM for
NTT
Constants in
Input
Register Bank
Cache for
Integer Multiplier
Polynomial Arithmetic Unit (PAU)
Fig. 2. Architecture for the vertical cores [33]
The integer multiplier (shown as a circle in Fig. 2) is
a 30 × 30-bit multiplier. We maintain a balance between
area and speed by combining two DSP multipliers and ad-
ditional LUT based small multipliers to form this multiplier.
After an integer multiplication, the result is reduced using
the Barrett reduction circuit [2] shown in Fig. 2. We use the
Barrett reduction technique due to two reasons. The first
reason is that the primes used in this implementation are
not of pseudo-Mersenne type which support fast modular
reduction technique [20]. The second reason is that the cores
are shared by all the prime moduli, and hence, a generic
reduction circuit is more preferable than several dedicated
reduction circuits. The Barrett reduction circuit is bit parallel
to process the outputs from the bit-parallel multiplier in
a flow. The reduction consists of three 31 × 31-bit mul-
tipliers and additional adders and subtractors. The multi-
pliers are implemented by combining two DSP multipliers
with additional LUTs. Thus in total, the Barrett reduction
block consumes six DSP multipliers. Beside performing the
modular reduction operations, the multipliers present in the
Barrett reduction circuit can be reused to perform 30×59-bit
multiplications during the CRT computations.
The adder/subtracter circuits after the Barrett reduction
block in Fig. 2 are used to compute the butterfly operations
during an NTT computation and to perform coefficient-wise
additions and subtractions of polynomials. Finally, the re-
sults of a computation are stored in the output register bank
and then the registers are written back in the memory. To
achieve high operating frequency, we put pipeline registers
(shown as magenta colored lines) in the data paths of the
computation circuits.
4.1.3 External memory access during NTT
During an NTT, the coefficients of the residue polynomial
are read sequentially from the DDR memory and then
loaded in the 16 internal memory blocks. For this purpose
the 256-bit DDR3 interface is used to receive four coefficient
pairs (i.e. eight coefficients) in a burst. However Alg. 2
generates the output coefficient pairs in a permutation that
is different from their initial arrangement. The coefficients
Input: An index of a coefficient pair where index ∈ [0, 215 − 1]
Output: Reverse of the index reverse index ∈ [0, 215 − 1]
begin1
c← index 11 ; /* index of the memory element */2
low← index&31 ; /* least five bits */3
high← (i 5)&63 ; /* next six bits */4
lsb← low[0]5
low← low− lsb6
low reverse← bitreverse(low) ; /* bits in reverse order */7
high reverse← bitreverse(high)8
bias← BiasTable[c]9
reverse index← bias+ (high reverse 5) + low reverse+ (lsb 14)10
end11
Algorithm 3: Calculation of the reverse of an index. A
polynomial of 216 coefficients is stored as 215 coefficient
pairs in 16 memory elements. BiasTable contains 16 bias
values {0, 16, 8192, 8208, 4096, 4112, 12288, 12304, 2048,
2064, 10240, 10256, 6144, 6160, 14336, 14352} correspond-
ing to the 16 memory elements.
pairs are written back in the DDR memory in the right
arrangement using Alg. 3. For a write address index, the
coefficient pair from the address reverse index should be
read from the internal memory. Note that we perform this
rearrangement of the coefficients after the completion of an
NTT following Alg. 2; whereas in the traditional NTT Alg. 1
this rearrangement is performed in the beginning using the
bitreverse function.
4.2 Architecture for lifting back and forth in Rq ↔ RQ
In Sec. 3 we described the lifting operations that we need
to perform during FV.Mult. In this section we describe the
computational steps that we follow to implement the lifting
operations, and then we describe the hardware architectures
of the building blocks. In the end we design a unified
architecture for computing the two lifting operations.
4.2.1 Computation steps for Liftq→Q
Let for an integer a mod q, the residues be [a]qi for i ∈
[0, l − 1]. So we are interested in computing [a]qj for j ∈
6
[l, L − 1]. We first compute the sum of products for i ∈
[0, l − 1] as follows.
asp =
∑
[a]qi · (
q
qi
) · [( q
qi
)−1]qi =
∑
[a]qi · bi (1)
where bi = [(
q
qi
)−1]qi . Next we compute [a
′]qj for j ∈ [l, L−
1] using
[a′]qj =
[∑
[a]qi · [bi]qj
]
qj
. (2)
Note that [bi]qj are 30-bit integers. Finally, we compute the
residues [a]qj for j ∈ [l, L− 1] using the following equation:
[a]qj =
[
[a′]qj − [basp/qc]qj · [q]qj − sign · [q]qj
]
qj
(3)
This computation involves a division of asp by q. The sign
takes a value 0 or 1 depending on asp−basp/qc ·q is smaller
than q/2 or not.
4.2.2 Computation steps for LiftQ→q
We compute the sum of products asp from the residue
polynomials moduli qj for j ∈ [0, L− 1].
asp =
∑
[a]qj · [(
Q
qj
)−1]qj · (
Q
qj
) =
∑
[a′]qj · bj (4)
Here the values [(Qqj )
−1]qj are 30-bit integers and hence the
computation [a′]qj = [a]qj · [(Qqj )−1]qj is a 30-bit modular
multiplication. Next we reduce asp by Q and get aQ in
(−Q/2, Q/2). Then the division and rounding operation is
performed on aQ and the result is reduced modulo q to a
value in (−q/2, q/2).
4.3 Unified architecture
Note that Liftq→Q and LiftQ→q operations involve similar
computation steps such as sum of products in Eq. 1, 2 and 4,
and divisions by q. Hence we design a unified architecture
to compute both Liftq→Q and LiftQ→q . The architecture is
composed of four blocks: 1) sum of products, 2) reduction
moduloQ, 3) division-and-rounding, and 4) reduction mod-
ulo q. The blocks are described as follows.
4.3.1 Sum of products block
Fig. 8 shows a multiply-and-accumulate (MAC) core to com-
pute the sum of products in Eq. 1, 2 and 4. In the figure, the
‘multiplier’ block is borrowed from the PAU (Fig. 2). Since
there are 16 PAU cores in the HE-processor, we instantiate 16
MAC cores. These cores are divided into two parallel MAC-
groups: MAC-0 to MAC-7 form the first group, and MAC-
8 to MAC-15 form the second group. Each MAC-group is
responsible for computing one sum of products.
The ROM block in the MAC core is a loadable memory
and is used to store the constants for Liftq→Q or LiftQ→q . We
set the word size of the ROM to 59 bits. Note that Eq. 1
and 4 require multiplications of 30-bit coefficients [a]qi by
long bi. The MAC cores compute these long multiplications
word-serially using the 31 × 59-bit integer multipliers that
are present inside the multiplier blocks. Alg. 4 shows the
word-serial computation of the sum of products by the 0-th
MAC core. In the algorithm we have assumed that the MAC
core is responsible for the accumulation of first m products
and each bi has w 59-bit words in the ROM. The k-loop
a[ ]
ROM
MAC
Multiplier
CR
T 
Co
ns
ta
nt
s
q
BRAM
re
sid
ue
 c
oe
ffi
ci
en
ts
 c
accc
partial sum from the previous MAC
partial sum from the next MAC
qa[ ] j +1
j +2
Fig. 3. Architecture for computing sum of products
computes the kth 59-bit word of the partial result in sum0.
Whenever a word is computed in sum0, it is forwarded to
MAC-1. Now MAC-1 computes Alg. 4 with the initialization
of acc1 to sum0 and computes the words of the partial sum-
of-products in sum1. Following the same sequence MAC-
2 computes sum2 and then MAC-3 computes sum3. In
parallel to this computation-flow, MAC-7 downto MAC-4
compute sum4. Finally sum4 is added with sum3 in MAC-
3 to get a word of the final sum of products. In Fig. 4 we
show the timing diagram for the pipeline processing.
The computation of Eq. 2 requires sum of modular mul-
tiplications. For this purpose the modular-multiplier circuit
from PAU is used.
Input: Residue coefficients [a]qj and constants bj
Output: Partially accumulated sum of products
begin1
acc0 ← 02
for k = 0 to w − 1 do3
for j = 0 to m− 1 do4
acc0 ← acc0 + [a]qj · bj [k]5
sum0 ← least59bits(acc0)6
acc0 ← acc0  597
sum0 ← least59bits(acc0)8
end9
Algorithm 4: Partial sum in Eq. 1 by MAC core-0
sp
sp
1
2
end of sp end of sp1 2
MAC7
MAC3
MAC0
Time
Fig. 4. Timing diagram for pipeline processing of two consecutive sum-
of-products (sp) by the first MAC-group.
7
4.3.2 Reduction modulo Q block
Let asp be the sum of the products in Eq. 4. For the chosen
parameter set, asp is 7 bits larger than Q. We first reduce asp
to a value in [0, Q − 1] and then central-lift the result to a
value in the range (−Q/2, Q/2). To reduce asp in [0, Q− 1],
we sequentially reduce the extra bits of asp from the most
significant side. The steps are shown in Alg. 5.
Input: An integer asp that is 7-bit larger than Q
Output: Integer asp in [0, Q− 1]
begin1
Qn ← (Q 7)2
for i = 0 to 6 do3
if msb(asp) > 0 then4
temp← asp −Qn5
if temp ≥ 0 then6
asp ← temp7
asp ← (asp  1)8
asp ← (asp  7)9
if asp ≥ Q then10
asp ← asp −Q11
end12
Algorithm 5: Reduction modulo Q
A word-serial architecture for computing the reduction
modulo Q is shown in Fig. 5. The architecture has three
addressable memory components: M and Mt are used to
store the computational data and MQ is used to keep the
modulus Q. These memory components are distributed
RAMs of word size 59-bits and depth 64. At the beginning
of a computation, the input number asp is loaded in M .
Then within the for-loop of Alg. 5, the words of asp are
left-shifted by one position and then stored in Mt. Note
that, line 5 has a conditional subtraction operation. In our
implementation, the subtraction is performed word-serially
using the subtraction circuit, and the result words are left-
shifted by one position and then stored in M . Based on the
sign of the subtraction, either M or Mt is used as the source
of asp for the next computations.
reg
M
reg01
t
M
In
pu
t w
or
ds
O
ut
pu
t w
or
ds
MQ
Fig. 5. Architecture for reduction modulo Q
4.3.3 Division and rounding unit (DRU)
The DRU computes btc/qe during LiftQ→q where t = 2, c is a
coefficient computed from the reduction modulo Q, and b·e
denotes rounding towards the nearest integer. The division
is carried out by precomputing the reciprocal r = 2/q and
then computing r × c. The word size of the DRU is 118 bits
(2× 59) as a compromise between area and latency.
To round a division of two k-bit integers correctly to
k-bits, the quotient must be computed correctly to 2k + 1
bits [22, Theorem 5.2]. In our case, the computation of btc/qe
requires a division of a k1-bit dividend by a k2-bit divisor.
The precision that we will need in this case to guarantee
correct rounding, based on the above, is k1+k2+1 bits. The
divisor q is a 1228-bit constant integer and the dividend c
is an at most 2517-bit integer, which gives a bound of 3746
bits. Hence, the reciprocal r is computed up to a precision
of 32 118-bit words, of which 22 words are nonzero.
Fig. 6. Division and rounding unit (DRU) [33]
Fig. 6 shows the architecture of the DRU. The multiplica-
tion r×c is computed by using a 118×118-bit multiplier that
computes 222 = 484 partial multiplications. This multiplier
performs a 118-bit Karatsuba multiplication by using three
59 × 59-bit multipliers generated with the Xilinx IP Core
tool (which supports only up to 64-bit multipliers). The 59-
bit multipliers each require 16 DSP blocks giving the total
number of 48 DSP blocks. In order to achieve a high clock
frequency, the 118-bit multiplier utilizes a 23-stage pipeline,
of which 18 stages are in the 59-bit multipliers (the optimal
number according to the tool).
The partial products from the 118-bit multiplier are accu-
mulated into a 241-bit (2×118+5) register using the Comba
Alg [10]. These additions are computed in a 4-stage pipeline
with three 59-bit adders and one 64-bit adder, which are all
implemented with LUTs. Whenever all partial products of
an output word have been computed, the register is shifted
to the right by 118 bits and the overflowing bits are given at
the output of the DRU. Once the computation proceeds to
the first word after the fractional point, then the MSB of the
fractional part is added to the register in order to perform
the rounding. The DRU has a constant latency of 687 clock
cycles per coefficient.
8
The DRU is reused for computing basp/qc during the
Liftq→Q. The computation proceeds analogously to the
above. The differences are that the reciprocal is now r = 1/q
and it needs to be computed only to a precision of 2493 bits
(12 nonzero words) because c can be only 36 bits longer than
q. The computation has a latency of 246 clock cycles.
4.3.4 Reduction modulo q block
This block reduces 1291-bit output from the DRU by 1228-
bit modulus q. Since the input data is 63-bit larger than the
modulus, a bit-by-bit modular reduction architecture similar
to Fig. 5 will be slow. Hence, we use a word-serial Barrett
reduction algorithm to perform the reduction in [0, q − 1]
and then we center-lift the reduced data to (−q/2, q/2). The
architecture of this block is shown in Fig. 7
+
Quotient
64−bit integer multiplier
RAM
ROM
59 64
65
123
59 59
R
R
 0
 1
Barrett
Constant
Fig. 7. Architecture for reduction modulo q
The input data is stored in the RAM and the modulus
q is kept in the ROM. In the first step, the quotient is
computed in the Quotient register by multiplying the RAM
content by the Barrett constant. For this purpose the 64-bit
multiplier (Fig. 7) is used. To save area, we designed this
64-bit multiplier circuit using a 32-bit multiplier. The 64-bit
multiplication is computed in four cycles. The accumulation
of the word-serial 64-bit multiplication results is done by the
adder block which uses a 64-bit adder circuit to compute the
addition in two cycles.
After the quotient is computed in the Quotient register,
it is multiplied with the words of the modulus q from the
ROM block and then the result words (in theR0 register) are
subtracted from the words of the input data (from the RAM
block). The result of the subtraction, which is the partially
reduced result, is then written back in the RAM module.
After the partially reduced result is compared with the
modulus q by performing word-serial subtractions. Based
on the comparison, the conditional subtraction of the modu-
lus q is computed. With this, we get the modulus q-reduced
result in the RAM block. Next the result is center-lifted to
(−q/2, q/2).
4.3.5 Integration of the building blocks
Now we describe how we integrate of the building blocks
to compute the Liftq→Q and LiftQ→q operations. Note that
in the first MAC-group, MAC-3 computes the final sum of
products. So the reduction and division blocks are attached
to the MAC-3 core. In Fig. 8 we show the connection of
MAC-3 core with the remaining three blocks. Similarly in
the second MAC-group, MAC-11 core is accompanied by
the reduction and division blocks. The four blocks are in a
pipeline during LiftQ→q to achieve optimum computation
time. The division block takes the maximum cycles and
hence determines the throughput of the entire pipeline.
Every block contains additional memory elements to enable
the pipeline processing: while one memory element is read
by the next block in the pipeline, the other memory element
is used to store the new results.
During Liftq→Q operation, the sum of products asp in
Eq. 1 is computed by a MAC-group, and then it is passed
to the DRU for the computation of basp/qc. In parallel to
this division, the MAC-group computes [a′]qj in Eq. 2. For
the computation of Eq. 3, a small computation block (con-
sisting of a multiplier, subtracter and some small memory
components) is used in the pipeline. The sign is computed
by performing arithmetic on the most significant words
of asp and q. This block is common to both the MAC-
groups as the amount of computation in Eq. 3 is small. The
throughput of the pipeline during Liftq→Q is determined by
the ‘computation of asp followed by the division basp/qc’.
4.3.6 External memory access
The DDR memory access during the Liftq→Q and LiftQ→q
is more complicated than the memory access during NTT.
Here we need to fetch the residue coefficients for different
moduli, whereas during an NTT we fetch coefficients from a
single moduli. So we design a customized DDR memory ac-
cess interface for the lifting operations. Since the DDR-burst
data length is 256 bits, at a time we read eight coefficients for
a single residue from the DDR memory and copy them in
the BRAM. Eight lifting operations are computed by the two
MAC-groups, i.e., four lifting operations per MAC-group,
before writing back the result in the DDR memory.
After eight Liftq→Q, the result is a collection of 43 × 8
coefficients. This is because there are extra 43 moduli in Q
and for each moduli there are eight coefficients. Hence 43
DDR-write operations (each copying eight coefficients) are
performed to copy the result to the memory. After every
eight LiftQ→q during the computation of c˜2, the result is a
collection of eight coefficients, each of size 1228 bits. Now
WordDecomp slices each 1228-bit coefficient into 21 59-bit
coefficients. Since a single DDR-write operation copies four
sliced coefficients, a total of 42 DDR-write operations are
performed.
4.4 Architecture of the ResPolq→qi block
Step 4 in Sec. 3.2 computes the residue polynomials for
c˜0 and c˜1. This step is performed by reducing the 1228-
bit coefficients of c˜0 and c˜1 by the 30-bit moduli qi for
i ∈ [0, l − 1]. In Alg. 6 we show the steps that we follow
to reduce a 1228-bit coefficient by a 30-bit qi.
In line 2 of Alg. 6 the input 1228-bit coefficient a is split
into 30-bit words. The 41 words of a are stored in the array
A with the least significant word in the index position 0
and the most significant word in the index position 40.
Next the for-loop in line 4 multiplies the words A[j] with
9
tM
MQ
regreg
a[ ]
j +1q
a[ ]
j +2
R0
R101ROM
CR
T 
Co
ns
ta
nt
s
partial sum from MAC−4
partial sum from MAC−2
MAC3 Modulo Q
Division
Rounding
c
Multiplier
acc
q
BRAM
re
sid
ue
 c
oe
ffi
ci
en
ts
+
Barrett
Constant
Quotient
59
64−bit integer multiplier
64
123
RAM
ROM
65
5959
Modulo q
Fig. 8. Unified architecture for Liftq→Q and LiftQ→q .
Input: 1228-bit coefficient a mod q
Output: a mod qi
begin1
A← [a0, . . . , a40] ; /* 30-bit words of a */2
R← 03
for j = 0 to 40 do4
R← R+A[j] · (230·j mod qi)5
/* R contains a 66-bit value */
R← R[57 : 0] +R[65 : 58] · (258 mod qi)6
/* Now R contains a 59-bit value */
R← Barrett Reduction59bit(R, qi)7
return R8
end9
Algorithm 6: Reduction of 1228-bit coefficient by 30-bit qi
the constants 230·j mod qi and the multiplication results
are accumulated in the register R. After the completion of
the for-loop, R contains a partially reduced result which
is of size 66 bits. Next the most significant 8 bits of R is
multiplied with 258 mod qi and the result is added with
the least significant 58 bits of R to get a 59-bit partially
reduced value in R. This value is reduced by a Barrett
reduction circuit of 59-bit input size to get the final 30-bit
result modulo qi.
The architecture of the ResPolq→qi block that computes
Alg. 6 is shown in Fig. 9. The input to the ResPolq→qi block
is the output of the Reduction modulo q block in Fig. 8. Since
the Reduction modulo q block outputs in 59-bit words, we use
a BRAM RAM-A of word size 59 and depth 32 to store
the 59-bit words of the input. The constants that are used
in Alg. 6 are kept in ROM. During the execution of the
for-loop in Alg. 6, a 59-bit word is fetched from RAM-A,
then split into two 30-bit chunks. The first chunk is then
multiplied with a 30-bit constants from ROM using the 30-
by-30 bit integer multiplier and result is accumulated in R.
Next the second chunk is multiplied by a constant and then
accumulated in R. Following a similar way, all the words
of the 1228-bit coefficient are processed. This gives a 66-
bit partially reduced result in R. Now the most significant
8 bits of R are multiplied by 258 mod qi and then added
with the least significant 58-bits of R. The output of the
addition is then reduced using the Barrett Reduction circuit
to obtain the final 30-bit modulo qi reduced result. Since
the Barrett Reduction circuit is used only once in Alg. 6, the
30-bit integer multiplier in the figure is actually borrowed
R BarrettReduction
59−by−32 bit RAM
59
30
30
30
30 30
6066
−
+
0
ROM
59
30
qi
60
66
R[65:58]
R
66
58
M0
M1
M2 M3
RAM−A
30−bit integer multiplier
Fig. 9. Architecture of the ResPolq→qi block to compute Alg. 6
from the Barrett Reduction circuit (which contains three such
multipliers).
Note that ResPolq→qi gets its input from the Reduction
modulo q. Hence the input has a sign. To keep the description
simple, we do not sign of the input in Alg. 6. But in the
actual implementation the sign of the input is taken care
of: depending on the sign bit, the architecture either adds
the output of the Barrett Reduction circuit to 0 or subtracts it
from qi.
Since the ResPolq→qi block processes the output coeffi-
cients from the architecture of Fig. 8, the best computation
time can be achieved if it is kept in a pipeline. In that
case, the ResPolq→qi block should be fast enough to reduce
the 1228-bit input coefficient by all of the 41 moduli qi for
i ∈ [0, l − 1] before a new 1228-bit coefficient arrives in the
input. We observed that a single instance of the architecture
shown in Fig. 9 is not fast enough to meet the throughput
of the pipeline. So we replicate the part of the architecture
that is present inside the green dashed-block of Fig. 9 for
four times. Since we borrow three 30-bit multipliers from
the Barrett Reduction circuit, we instantiate one extra multi-
plier. The four instances run in parallel and distribute the
computation job: the first one reduces the input coefficient
by first 11 moduli, and the remaining three instances reduce
10
TABLE 1
Area results on Xilinx Virtex-6 XC6VLX240T-1FF1156 FPGA
Resource Used Avail. Percentage
Slice Registers 63,086 301,440 20.9 %
Slice LUTs 72,613 150,720 48.2 %
BlockRAM 84 BRAM36, 22 BRAM18 416 22.8 %
DSP48 250 768 32.5 %
TABLE 2
Latencies of the building blocks without DDR access overhead
Operation Clocks
N -point NTT 47,795
N -point INTT 51,909
N -point-wise add/sub/mult 4,096
LiftQ→q (per coeff)† 687
LiftQ→q followed by ResPolq→qi (per coeff)† 687
Liftq→Q (per coeff)† 401
Poly mult in Rqj 361,376
† Assuming pipeline processing of many coefficients
the input coefficient by 10 moduli each. We keep two dual-
port ROM blocks in the architecture: the first (second) ROM
block stores the constants required by the first (last) two
instances.
5 RESULTS
We compiled the processor for the ML605 board which has
a Virtex-6 FPGA XC6VLX240T-1FF1156. Different clock do-
mains are used in the design: communication with the DDR
memory is performed at 200 MHz, whereas computations
are performed using a 100 MHz clock. The HEPCloud has
v = 16 parallel cores for performing polynomial arithmetic,
and two cores for computing the lifting operations. The area
counts of our HEPCloud, including the DDR interface, are
shown in Table 1.
Table 2 gives the latencies of the building blocks ex-
cluding the cost of DDR memory access. NTT and INTT
computations are performed on polynomials of N = 216
coefficients. To save memory requirement, we compute the
twiddle factors on the fly at the cost of N integer multipli-
cations. One NTT computation using v = 16 cores requires
(N + N2 log2(N))/16 = 36, 864 multiplications. However
the computation of the twiddle factors in the pipelined data
path of the PAU (Fig. 2) has data dependencies and thus
causes bubbles in the pipeline. We use a small register-
file that stores four consecutive twiddle factors, and reduce
the cycles spent in the pipeline bubbles to around 10,000.
In the case of an INTT, the additional cycles are spent
during scaling operation by N−1. To compute N -point-wise
addition/subtraction/multiplication we need slightly more
than 4,096 cycles.
5.1 Computation cost of the lifting operations
The cycle requirement for LiftQ→q is determined by the
division-and-rounding operation, since it is the costliest
computation in the pipeline of Fig. 8. If we assume that
many LiftQ→q operations are performed in pipeline, then
the cycle requirement per coefficient will be 687. However,
due to the restrictions put by the DDR interface, we process
only four LiftQ→q in pipeline (see Sec. 4.2). As a consequence
4,744 cycles are needed to process four coefficients by a
single LiftQ→q core. Similarly when ResPolq→qi is computed
in pipeline with LiftQ→q , 5,387 cycles are spent per four
coefficients. Similarly, when we assume that many Liftq→Q
operations are performed in pipeline, cycle requirement per
coefficient is 401. In practice, we can compute only four
Liftq→Q in pipeline, and thus it takes total 2,016 cycles for
computing four Liftq→Q operations.
5.2 Computation cost of the residue polynomial multi-
plication
To multiply two residue polynomials modulo qj , we com-
pute two NTTs, then N -point-wise multiplications, and one
INTT. The reduction of the result modulo f(x) follows the
Newton iteration method Newton iteration method [39]. In
this step, two NTTs, two N -point-wise multiplications, one
N/2-point-wise subtraction and two INTTs are computed.
Hence the computation of a polynomial multiplication in
Rqj requires four NTTs, three N -point-wise multiplications,
oneN/2-point-wise subtraction and three INTTs. This trans-
lates into 361,376 cycles.
5.3 Computation cost of FV.Mult
The cycle counts for the steps (see Sec. 3.2) are as follows.
1) Liftq→Q: To lift c1,0, c1,1, c2,0 and c2,1 (each hav-
ing N/2 coefficients) from Rq to RQ, we compute
Liftq→Q operations on 2N coefficients. This takes
total 33,030,144 cycles using the two lifting cores, as
each takes 2016 cycles to process four coefficients.
2) PolyArithmeticQ : Here we compute c˜0, c˜1 and c˜2 by
performing four multiplications and one addition
over RQ. Since an element in RQ consists of 84
residue polynomials, the four multiplications re-
quire 121,422,336 cycles and the addition requires
172,032 cycles. Hence the total computation cost of
PolyArithmeticQ is 121,594,368 cycles.
3) LiftQ→q and ResPolq→qi : To bring c˜0 and c˜1 back
to the residue representation over Rq , we apply
LiftQ→qfollowed by ResPolq→qi (in a pipeline) on
the coefficients. Using two lifting cores, the total
number of cycles required is 44,130,304. For c˜2 we
do not need to compute ResPolq→qi . Hence only
LiftQ→q is performed on the N/2 coefficients of c˜2.
Since each lifting core takes 4,744 cycles to process
four coefficients, it takes 19,431,424 cycles.
4) WordDecomp: This only splits the large coefficients
into words and does not have any computation cost.
In HEPCloud, this happens automatically when data
is copied from the FPGA to the DDR memory.
5) FV.Relin: Each of 〈c˜2, rlk0〉 and 〈c˜2, rlk1〉 requires
summation of 21 polynomial multiplications in Rq .
Note that rlk0 and rlk1 are constant, and hence
they can be kept in the NTT domain to reduce the
computation cost. For a single moduli it requires
21 NTTs, 21 N -point-wise multiplications and 20
N -point-wise additions to compute the summation
〈c˜2, rlk0〉 in the NTT domain. An INTT is needed
to get the f(x)-unreduced result. Next to reduce
modulo f(x), two NTTs, two N -point-wise multi-
plications, one N/2-point-wise subtraction and two
11
TABLE 3
Latencies and timings at 100/200 MHz computation/DDR clock
Operation Computation cycles DDR cycles Total time
FV.Add 83,968 9,740,288 0.050s
FV.Mult 335,978,912 4,663,738,368 26.67s
TABLE 4
Comparison of homomorphic multiplication timings
Implementation Scheme Dimension Coeff-size Time
[31]HW YASHE 16,384 512 48.67 ms
[23]SW YASHE 32,768 1,225 ≈ 2.98s
Our HW FV 32,768 1,228 26.67s
INTTs are computed. Finally the computation of
c′0 for a single moduli requires one N/2-point-
wise addition. Hence in total 23 NTTs, 23 N -point-
wise multiplications, 20 N -point-wise additions,
one N/2-point-wise subtraction, 3 INTTs and one
N/2-point-wise addition are performed to compute
c′0 for a single moduli. The computation cost is same
for computing c′1 for a single moduli. In this way
the total cost of computing c′0 and c
′
1 for all the 41
moduli translates into 117,792,672 cycles.
Overall 335,978,912 cycles are spent in FV.Mult. At 100 MHz
clock frequency, this corresponds to 3.36 seconds.
5.4 Overhead of the DDR memory access
To evaluate our proof of concept implementation, we use a
DDR interface that reads or writes 256 bits in a burst. For
FV.Mult, DDR memory accesses take around 4,663,738,368
cycles at 200 MHz. For FV.Add, the number of cycles for the
memory access is around 9,740,288.
Table 3 shows the timing requirement for computing
FV.Add and FV.Mult operations including the overhead of
DDR memory access. Based on the timing of FV.Mult, we see
that the designed architecture would take roughly 37551s
(11h and 26 min) to evaluate SIMON-64/128 (44 rounds
with 32 ANDs). Since the SIMD feature processes 2048 slots,
the per-block timing will be roughly 18.34s.
5.5 Comparison and discussion
Since implementations of the FV scheme are largely missing
from the literature, in Table 4, we compare HEPCloud to
YASHE implementations on both hardware and software,
which are the closest counterparts.
Po¨ppelmann et al. [31] presented an FPGA implemen-
tation of YASHE, which computes a homomorphic multi-
plication in just 48.67 ms. However, these timings cannot
be compared one-to-one because YASHE is computationally
lighter than FV, but also insecure. Their parameter set also
offers lower security, supports only multiplicative depth up
to 9, and cannot take advantage of the SIMD feature. They
implemented their accelerator using Catapult, an FPGA-
based datacenter accelerator with very fast memory access.
Consequently, they were able to solve the problem of slow
memory access, which is the main problem of HEPCloud
that was implemented on a generic ML605 FPGA devel-
opment board. Indeed, if one observes only the latency of
computation, then HEPCloud is faster than Po¨ppelmann
et al.’s design (taking SIMD feature into account) despite
implementing FV with a larger parameter set.
Lepoint and Naehrig [23] presented C++ implementa-
tions of YASHE for homomorphic evaluations of SIMON
64/128 with YASHE running on a 4-core Intel Core i7-2600
at 3.4 GHz. They reported computation times of 4193s for
SIMON-64/128 using all 4 cores. If we use HEPCloud to
compute YASHE (which is lighter than FV), then it would
take roughly 12000s to evaluate SIMON-64/128. With re-
spect to their implementation, HEPCloud is 2.8 times slower.
Again, the difference is caused by the memory access.
In this work our focus was on designing the computation
core of the FV; the DDR memory interface is a proof of
concept implementation. With 256-bit burst data width, the
DDR interface offers a only 1.97Gb/s read speed and hence
becomes the main bottleneck in our implementation. Desk-
top computers have industry-optimized DDR interface, and
the Intel Core i7-2600 processor has 8MB cache memory [21].
Since a polynomial in Rq is of size 4.8MB, the overhead of
memory access in [23] would be much lower than ours.
Using a faster DDR interface with 2048-bit burst data
length, one can achieve 10Gb/s read and 27Gb/s write
speed. With this interface, the overhead of memory ac-
cess would become roughly equal to the computation cost.
Hence the time for a homomorphic multiplication could be
reduced significantly by performing the memory access and
computation in parallel using two sets of BRAMs: when one
set is used for the computation, the other set is used for the
memory access. We consider integration of a faster DDR
interface in the HEPCloud as a future work. This would
make HEPCloud a practical solution for accelerating SHE
function evaluations in cloud computing.
6 SUMMARY
In this work we designed the hardware building blocks for
homomorphic evaluation of medium depth functions using
the FV scheme. We showed that FPGAs can accelerate the
computation intensive operations of homomorphic function
evaluations. Despite this, we found that a massive amount
of data exchange takes place between the FPGA and the
external DDR memory because only a part of the ciphertext
can be fit in the internal memory of the FPGA. The interface
with the DDR memory plays a very important role in the
performance and becomes a bottleneck unless it is imple-
mented with special care.
We introduced HEPCloud, a single-FPGA design of ho-
momorphic evaluation with FV. We presented a proof-of-
concept implementation of HEPCloud that implements all
hardware components required for homomorphic function
evaluations in cloud computing environments. We demon-
strated that HEPCloud is a feasible solution for accelerating
the very expensive homomorphic function evaluations, par-
ticularly, when fast external memory is available.
Even with the state-of-the-art FPGA acceleration, homo-
morphic function evaluations remain very expensive and
further improvements are still needed. We see several par-
allelization approaches to accelerate HEPCloud. An obvi-
ous way to improve the performance would be to use a
multi-FPGA design (a cluster) where each FPGA computes
12
different homomorphic evaluations independently of each
other. This approach improves throughput, but the latency
of an individual evaluation remains the same. The second
approach is to reduce latency by using parallel FPGAs for
independent FV.Add and FV.Mult inside a single homomor-
phic evaluation. While this is conceptually simple, it may
still face difficulties because data needs to be transferred be-
tween multiple FPGAs. The third option is to distribute the
residue polynomial arithmetic into several FPGAs since they
can be computed independently. However, the lifting oper-
ations need coefficients from different residue polynomials
and require inter-FPGA communication. The fourth option
is to divide different parts of a homomorphic multiplication
to different FPGAs and perform them in a pipelined fashion
in order to increase throughput. The fifth option is to mix the
other options which may lead to good tradeoffs that avoid
the disadvantages. The techniques represented in this paper
can be extrapolated to support these options.
ACKNOWLEDGEMENTS
This work was supported in part by the Research Council
KU Leuven: C16/15/058. In addition, this work was sup-
ported in part by the Flemish Government, by the Hercules
Foundation AKUL/11/19, and by the European Commis-
sion through ICT programme under contract through the
Horizon 2020 research and innovation programme under
contract No H2020-ICT-2014-644371 WITDOM, H2020-ICT-
2014-644209 HEAT, H2020-ICT-2014-645622 PQCRYPTO
and Cathedral ERC Advanced Grant 695305. The work of
K. Ja¨rvinen was supported in part by Academy of Finland
with the projects 283250 (while working in Aalto University)
and 303578 (while working in University of Helsinki).
REFERENCES
[1] M. R. Albrecht. Complexity estimates for solving lwe. https://
bitbucket.org/malb/lwe-estimator/raw/HEAD/estimator.py.
[2] P. Barrett. Implementing the Rivest Shamir and Adleman public
key encryption algorithm on a standard digital signal processor.
In A. M. Odlyzko, editor, Advances in Cryptology — CRYPTO’
86: Proceedings, pages 311–323, Berlin, Heidelberg, 1987. Springer
Berlin Heidelberg.
[3] R. Beaulieu, D. Shors, J. Smith, S. Treatman-Clark, B. Weeks, and
L. Wingers. The SIMON and SPECK families of lightweight block
ciphers. Cryptology ePrint Archive, Report 2013/404, 2013. http:
//eprint.iacr.org/.
[4] D. Boneh. Twenty years of attacks on the RSA cryptosystem.
NOTICES OF THE AMS, 46:203–213, 1999.
[5] J. W. Bos, W. Castryck, I. Iliashenko, and F. Vercauteren. Privacy-
friendly forecasting for the smart grid using homomorphic en-
cryption and the group method of data handling. In Progress in
Cryptology - AFRICACRYPT 2017: 9th International Conference on
Cryptology in Africa, Dakar, Senegal, May 24-26, 2017, Proceedings,
pages 184–201, Cham, 2017. Springer International Publishing.
[6] J. W. Bos, K. Lauter, J. Loftus, and M. Naehrig. Improved secu-
rity for a ring-based fully homomorphic encryption scheme. In
M. Stam, editor, Proceedings of the 14th IMA International Conference
on Cryptography and Coding (IMACC 2013), volume 8308 of Lecture
Notes in Computer Science, pages 45–64. Springer, 2013.
[7] Z. Brakerski. Fully homomorphic encryption without modu-
lus switching from classical GapSVP. In R. Safavi-Naini and
R. Canetti, editors, Advances in Cryptology — CRYPTO 2012, vol-
ume 7417 of Lecture Notes in Computer Science, pages 868–886.
Springer, 2012.
[8] X. Cao, C. Moore, M. O’Neill, E. O’Sullivan, and N. Hanley. Opti-
mised multiplication architectures for accelerating fully homomor-
phic encryption. IEEE Transactions on Computers, 65(9):2794–2806,
Sept. 2016.
[9] J. Cathe´bras, A. Carbon, R. Sirdey, and N. Ventroux. An analysis
of FV parameters impact towards its hardware acceleration. Cryp-
tology ePrint Archive, Report 2017/246, 2017. http://eprint.iacr.
org/2017/246, to appear in the Proc. of WAHC’17.
[10] P. G. Comba. Exponentiation cryptosystems on the IBM PC. IBM
Systems Journal, 29(4):526–538, 1990.
[11] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduc-
tion to Algorithms. McGraw-Hill Higher Education, 2nd edition,
2001.
[12] J.-S. Coron, T. Lepoint, and M. Tibouchi. Scale-invariant fully ho-
momorphic encryption over the integers. In H. Krawczyk, editor,
Public-Key Cryptography — PKC 2014, volume 8383 of Lecture Notes
in Computer Science, pages 311–328. Springer, 2014.
[13] D. Cousins, K. Rohloff, and D. Sumorok. Designing an FPGA-
accelerated homomorphic encryption co-processor. IEEE Transac-
tions on Emerging Topics in Computing, to appear.
[14] Y. Doro¨z, E. O¨ztu¨rk, E. Savas, and B. Sunar. Accelerating LTV
based homomorphic encryption in reconfigurable hardware. In
Cryptographic Hardware and Embedded Systems - CHES 2015 - 17th
International Workshop, Saint-Malo, France, September 13-16, 2015,
Proceedings, pages 185–204, 2015.
[15] Y. Doro¨z, E. O¨ztu¨rk, and B. Sunar. Evaluating the hardware
performance of a million-bit multiplier. In Proceedings of the 16th
Euromicro Conference on Digital System Design (DSD 2013), pages
955–962, 2013.
[16] J. Fan and F. Vercauteren. Somewhat practical fully homomorphic
encryption. Cryptology ePrint Archive, Report 2012/144, 2012.
http://eprint.iacr.org/.
[17] C. Gentry. Fully homomorphic encryption using ideal lattices.
In Proceedings of the 41st ACM Symposium on Theory of Computing
(STOC 2009), pages 169–178, 2009.
[18] C. Gentry, S. Halevi, and N. P. Smart. Homomorphic evaluation of
the AES circuit. In R. Safavi-Naini and R. Canetti, editors, Advances
in Cryptology — CRYPTO 2012, volume 7417 of Lecture Notes in
Computer Science, pages 850–867. Springer, 2012.
[19] C. Gentry, A. Sahai, and B. Waters. Homomorphic encryption from
learning with errors: Conceptually-simpler, asymptotically-faster,
attribute-based. In R. Canetti and J. A. Garay, editors, Advances
in Cryptology — CRYPTO 2013, volume 8042 of Lecture Notes in
Computer Science, pages 75–92. Springer, 2013.
[20] D. Hankerson, A. J. Menezes, and S. Vanstone. Guide to Elliptic
Curve Cryptography. Springer-Verlag New York, Inc., Secaucus, NJ,
USA, 2003.
[21] Intel. CoreTMi7-2600 processor. https://ark.intel.com/products/
52213/Intel-Core-i7-2600-Processor-8M-Cache-up-to-3 80-GHz.
[22] A. H. Karp and P. Markstein. High-precision division and square
root. ACM Transactions on Mathematical Software, 23(4):561–589,
1997.
[23] T. Lepoint and M. Naehrig. A comparison of the homomorphic
encryption schemes FV and YASHE. In D. Pointcheval and
D. Vergnaud, editors, Progress in Cryptology — AFRICACRYPT
2014, volume 8469 of Lecture Notes in Computer Science, pages 318–
335. Springer, 2014.
[24] V. Lyubashevsky, C. Peikert, and O. Regev. On Ideal Lattices
and Learning with Errors over Rings. In Advances in Cryptology
EUROCRYPT 2010, volume 6110 of Lecture Notes in Computer
Science, pages 1–23. Springer Berlin Heidelberg, 2010.
[25] L. D. Martin Albrecht, Shi Bai. A subfield lattice attack on
overstretched ntru assumptions: Cryptanalysis of some fhe and
graded encoding schemes. Cryptology ePrint Archive, Report
2016/127, 2016. http://eprint.iacr.org/2016/127.
[26] C. Moore, N. Hanley, J. McAllister, M. O’Neill, E. O’Sullivan, and
X. Cao. Targeting FPGA DSP slices for a large integer multiplier
for integer based FHE. In A. Adams, M. Brenner, and M. Smith,
editors, Financial Cryptography and Data Security Workshops, the
1st Workshop on Applied Homomorphic Cryptography and Encrypted
Computing (WAHC 2013), volume 7862 of Lecture Notes in Computer
Science, pages 226–237. Springer, 2013.
[27] M. Naehrig, K. Lauter, and V. Vaikuntanathan. Can homomorphic
encryption be practical? In Proceedings of the 3rd ACM Workshop on
Cloud Computing Security Workshop (CCSW 2011), pages 113–124.
ACM, 2011.
[28] E. O¨ztu¨rk, Y. Doro¨z, E. Savas¸, and B. Sunar. A custom accelerator
for homomorphic encryption applications. IEEE Transactions on
Computers, 66(1):3–16, Jan. 2017.
13
[29] T. Po¨ppelmann, L. Ducas, and T. Gu¨neysu. Enhanced lattice-
based signatures on reconfigurable hardware. Cryptology ePrint
Archive, Report 2014/254, 2014. http://eprint.iacr.org/.
[30] T. Po¨ppelmann and T. Gu¨neysu. Towards Practical Lattice-Based
Public-Key Encryption on Reconfigurable Hardware. In Selected
Areas in Cryptography – SAC 2013, Lecture Notes in Computer
Science, pages 68–85. Springer Berlin Heidelberg, 2014.
[31] T. Po¨ppelmann, M. Naehrig, A. Putnam, and A. Macias. Accel-
erating homomorphic evaluation on reconfigurable hardware. In
T. Gu¨neysu and H. Handschuh, editors, Cryptographic Hardware
and Embedded Systems – CHES 2015: 17th International Workshop,
Saint-Malo, France, September 13-16, 2015, Proceedings, pages 143–
163, Berlin, Heidelberg, 2015. Springer Berlin Heidelberg.
[32] R. L. Rivest, L. Adleman, and M. L. Dertouzos. On data banks
and privacy homomorphisms. Foundations of secure computation,
4(11):169–180, 1978.
[33] S. Sinha Roy, K. Ja¨rvinen, F. Vercauteren, V. Dimitrov, and I. Ver-
bauwhede. Modular hardware architecture for somewhat ho-
momorphic function evaluation. In Cryptographic Hardware and
Embedded Systems – CHES 2015: 17th International Workshop, Saint-
Malo, France, September 13-16, 2015, Proceedings, pages 164–184,
Berlin, Heidelberg, 2015. Springer Berlin Heidelberg. Revised
version https://eprint.iacr.org/2015/337.pdf.
[34] S. Sinha Roy, F. Vercauteren, N. Mentens, D. D. Chen, and I. Ver-
bauwhede. Compact ring-lwe cryptoprocessor. In L. Batina and
M. Robshaw, editors, Cryptographic Hardware and Embedded Systems
CHES 2014, volume 8731 of Lecture Notes in Computer Science, pages
371–391. Springer Berlin Heidelberg, 2014.
[35] N. Smart and F. Vercauteren. Fully homomorphic encryption
with relatively small key and ciphertext sizes. In Public Key
Cryptography PKC 2010, volume 6056 of Lecture Notes in Computer
Science, pages 420–443. Springer Berlin Heidelberg, 2010.
[36] N. Smart and F. Vercauteren. Fully homomorphic SIMD opera-
tions. Designs, Codes and Cryptography, 71(1):57–81, 2014.
[37] M. van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan. Fully
homomorphic encryption over the integers. In H. Gilbert, editor,
Advances in Cryptology — EUROCRYPT 2010, volume 6110 of
Lecture Notes in Computer Science, pages 24–43. Springer, 2010.
[38] F. Vercauteren. Fully Homomorphic Encryption. Presenta-
tion at Katholieke Universiteit Leuven, 2014. https://www.esat.
kuleuven.be/cosic/publications/talk-320.pdf.
[39] J. von zur Gathen and J. Gerhard. Modern Computer Algebra.
Cambridge University Press, New York, NY, USA, 1999.
[40] W. Wang and X. Huang. FPGA implementation of a large-number
multiplier for fully homomorphic encryption. In IEEE International
Symposium on Circuits and Systems (ISCAS 2013), pages 2589–2592,
2013.
[41] W. Wang and X. Huang. VLSI design of a large-number multiplier
for fully homomorphic encryption. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 22(9):1879–1887, 2014.
Dr. Sujoy Sinha Roy received his M.S degree
in computer science and engineering from the
Indian Institute of Technology Kharagpur, and
the Ph.D. degree in electrical engineering from
the Katholieke Universiteit Leuven, Belgium. His
research area has been broadly in the field of
efficient implementation of public key cryptogra-
phy.
Dr. Kimmo Ja¨rvinen received the M.Sc. (Tech.)
and D.Sc. (Tech.) degrees in electrical engineer-
ing from the Helsinki University of Technology
(TKK) in Finland in 2003 and 2008, respectively.
From 2008 to 2013 and from 2015 to 2016, he
was a postdoctoral researcher in the Department
of (Information and) Computer Science in Aalto
University in Finland. From 2014 to 2015, he
was with the COSIC Group in KU Leuven ESAT
in Belgium. Since 2016, he has been a Senior
Researcher with the Department of Computer
Science in University of Helsinki in Finland. His research interests lie in
the domains of security and cryptography and, especially, in developing
efficient and secure hardware implementations of cryptosystems.
Dr. Jo Vliegen received his Master degree
in Engineering Technology in 2005, from the
Catholic University College Limburg (Diepen-
beek, Belgium). After three years in industry, Jo
returned to the university college in 2008 and
started his research on the reconfigurability of
FPGAs. In 2014 Jo received his Ph.D. degree
in Engineering Technology from the KU Leuven
(Leuven, Belgium). Since 2014, Jo is a postdoc-
toral researcher in the COSIC research group
at the KU Leuven. His main research activities
focus both on the implementation of cryptographic primitives on FPGAs,
and on the use of (fine-grained) reconfigurability of FPGAs.
Dr. Frederik Vercauteren is a Professor in the
research group COSIC of the Electrical Engi-
neering Department of the KU Leuven in Bel-
gium. He received the M.Sc. degree in com-
puter science, the M.Sc. degree in pure math-
ematics, and the Ph.D. degree in electrical engi-
neering from the Katholieke Universiteit Leuven,
Belgium. Previously, he was a lecturer at the
Department of Computer Science, University of
Bristol, U.K. His research interests include ap-
plications of computational number theory and
arithmetic geometry in cryptography, in particular post-quantum cryp-
tography and homomorphic encryption.
Dr. Ingrid Verbauwhede is a Professor in the re-
search group COSIC of the Electrical Engineer-
ing Department of the KU Leuven in Belgium. At
COSIC, she leads the embedded systems and
hardware group. She is also adjunct professor
at the EE department at UCLA, Los Angeles,
CA. She is as member of the Royal Academy of
Belgium for Science and the Arts. She is a recip-
ient of an ERC Advanced Grant in 2016. She will
receive the IEEE 2017 Computer Society Tech-
nical Achievement Award. She is a pioneer in the
field of efficient and secure implementations of cryptographic algorithms
in embedded context on ASIC, FPGA and embedded SW. She is the
author and co-author of more than 300 publications at conferences,
journals, book chapters and books.
14
