Abstract-In this paper, we present an FPGA based hardware accelerator 'HEPCloud' for homomorphic evaluations of medium depth functions which has applications in cloud computing. Our HEPCloud architecture supports the polynomial ring based homomorphic encryption scheme FV for a ring-LWE parameter set of dimension 2 15 , modulus size 1,228-bit, and a standard deviation 50. This parameter-set offers a multiplicative depth 36 and at least 85 bit security. The processor of HEPCloud is composed of multiple parallel cores. To achieve fast computation time for such a large parameter-set, various optimizations in both algorithm and architecture levels are performed. For fast polynomial multiplications, we use CRT with NTT and achieve two dimensional parallelism in HEPCloud. We optimize the BRAM access, use a fast Barrett like polynomial reduction method, optimize the cost of CRT, and design a fast divide-and-round unit. Beside parallel processing, we apply pipelining strategy in several of the sequential building blocks to reduce the impact of sequential computations. Finally, we implement HEPCloud on a medium-size Xilinx Virtex 6 FPGA board ML605 board and measure its on-board performance. To store the ciphertexts during a homomorphic function evaluation, we use the large DDR3 memory of the ML605 board. Our FPGA-based implementation of HEPCloud computes a homomorphic multiplication in 26.67 s, of which the actual computation takes only 3.36 s and the rest is spent for off-chip memory access. It requires about 37,551 s to evaluate the SIMON-64/128 block cipher, but the per-block timing is only about 18 s because HEPCloud processes 2,048 blocks simultaneously. The results show that FPGA-based acceleration of homomorphic function evaluations is feasible, but fast memory interface is crucial for the performance.
Index Terms-Homomorphic encryption, FV, lattice-based cryptography, ring-LWE, polynomial multiplication, number theoretic transform, hardware implementation
Ç

INTRODUCTION
T HE concept of fully homomorphic encryption (FHE), a form of encryption that allows evaluating arbitrary functions on encrypted data, was introduced by Rivest, Adleman, and Dertouzos [32] already in 1978. Constructing FHE schemes proved to be a difficult problem that remained unsolved until 2009 when Gentry [17] proposed the first FHE scheme by using ideal lattices. Despite its groundbreaking nature, Gentry's proposal did not provide a practical solution because of its low performance. Since then, many researchers have developed more efficient schemes to improve the performance of FHE [6] , [7] , [12] , [16] , [18] , [19] , [27] , [37] . Despite these major advances, FHE schemes are too slow to be used in practical applications. Even somewhat homomorphic encryption (SHE) schemes, which can perform a limited number of operations on encrypted data, are also very slow. Software implementations require minutes or hours to evaluate even rather simple functions. For e.g., evaluating the decryption of a lightweight block cipher SIMON-64/128 (block/key size 64/128 bits) [3] requires 4,193 s (an hour and 10 minutes) on a 4-core Intel Core-i7 processor [23] . If FHE could achieve performance levels that would permit large-scale practical use, it would have a drastic effect on cloud computing: users could outsource computations to the cloud without the need to trust service providers and their mechanisms for protecting users' data from outsiders.
Hardware accelerators have been successfully used for accelerating performance-critical computations in cryptology already for several decades. Hence, it is not surprising that during the recent years several publications [8] , [13] , [14] , [15] , [26] , [28] , [31] , [33] , [40] , [41] have reported results on hardware-based acceleration of different FHE and SHE schemes or their central operations. Technical maturity and suitability for practical deployment differ significantly between the published implementations. The parts of the schemes that are implemented in the publications range from only central operations such as large integer multiplications (see, e.g., [8] ) through full implementations of the computational parts of the schemes (see, e.g., [33] ) to complete implementations in real hardware that include also all memory handling, interfacing with a host processor, etc. (see, e.g., [31] ).
The published implementations focus on many different FHE and SHE schemes. In this paper, we focus on the FanVercauteren (FV) SHE scheme [16] , which is based on the Ring Learning with Errors (Ring-LWE) problem. To the best of our knowledge, there are no published hardware implementations of FV prior to this paper; a very recent paper by Cath ebras et al. [9] analyses parameter selection of FV in the light of hardware efficiency but does not provide any actual implementation results. From the implementation point-ofview, the FV scheme is close to the Yet-Another-SomewhatHomomorphic-Encryption (YASHE) scheme by Bos et al. [6] because the FV scheme can be implemented with certain minor modifications and additions to a YASHE architecture. Hence, the closest counterparts in the literature are the two implementations of the YASHE scheme that were published by P€ oppelmann et al. in [31] and by us in [33] . Recently in 2016, Albrecht et al. [25] published a subfield lattice attack that runs in sub-exponential time on overstretched NTRU assumptions. Since the key generation part of the YASHE scheme relies on a mildly overstretched NTRU assumption, the subfield attack makes YASHE insecure. However, the FV scheme remains secure as this attack does not apply to it.
In this paper, we present HEPCloud, an FPGA-based hardware accelerator for homomorphic evaluations of medium depth functions, which supports the FV SHE scheme. It is designed primarily for speeding up homomorphic evaluations in cloud computing: the cloud service provider installs our HEPCloud accelerator on its servers and delegates the heavy homomorphic function evaluations to this FPGA-based accelerator. Our goal in designing HEPCloud is to provide a proof-of-concept implementation of a solution that is complete and mature enough for practical use. We implement all required components for homomorphic function evaluations in real hardware. The architecture of HEPCloud is based on the SHE accelerator architecture for the YASHE scheme that we introduced in our earlier work [33] . The contributions of this paper compared to previous works and, especially, [33] can be summarized as follows:
We introduce an efficient hardware implementation of the FV SHE scheme that, to the best of our knowledge, is the first FPGA-based accelerator for FV. We improve the architecture from [33] to support FV by designing a unified architecture for two lifting operations and an architecture for the residue polynomial computation, and by fine-tuning the design decisions. We implement HEPCloud using a Xilinx Virtex-6 ML605 evaluation board and verify its operation and performance via on-board performance measurements; [33] presented only after place-and-route results and did not provide any evaluation results on real hardware. We discuss the feasibility of FPGA-based acceleration of homomorphic function evaluation. We conclude that despite certain obstacles, in particular, with the speed of the memory interface, FPGA-based acceleration and HEPCloud are feasible solutions for reducing the large overhead of homomorphic function evaluations in cloud computing environments. The paper is structured as follows. Section 2 describes the FV SHE scheme as well as the system setup and the parameter set that we use. Section 3 contains a high level description of known optimization techniques to speed-up computations in modular polynomial rings and describes how we represent polynomials using the Chinese Remainder Theorem (CRT) in order to parallelize computations. We present our hardware architecture for FV in Section 4. Section 5 shows the performance results and we end the paper with the summary in Section 6.
THE FV HOMOMORPHIC ENCRYPTION SCHEME
In this section we briefly describe the FV somewhat homomorphic encryption scheme. The FV scheme was introduced by Fan and Vercauteren [16] in 2012. It uses a basic ring-LWE public-key encryption scheme and two additional functions Add and Mult to perform arithmetic operation on encrypted data. The polynomial ring is R ¼ Z½x=hfðxÞi with fðxÞ ¼ F d ðxÞ, the dth cyclotomic polynomial of degree n ¼ 'ðdÞ. The key generation and the encryption operations in FV require sampling from two probability distributions defined on R, namely x key and x err respectively. The security is determined by the degree n of f, the size of the ciphertext modulus q, and by the probability distributions. Following [24] one may sample the key and the error polynomials from a common distribution x. Typically x is a discrete Gaussian distribution x s with a small standard deviation s. However in practice the authors of FV took the private key as a polynomial with coefficients from a narrow set like fÀ1; 0; 1g. In the following we introduce two functions that are used to describe the FV scheme. This expression has the advantage of reducing the noise during homomorphic multiplications, as the first vector contains small elements (in base w). Now we enumerate the functions used in the FV scheme. For details of the functions, interested readers may follow the original paper [16] or the presentation [38] . 
System Setup and Parameter Set
The polynomial ring used in the FV scheme is of the form R ¼ Z½x=hfðxÞi where fðxÞ is a monic irreducible polynomial of degree n. We put no restriction on fðxÞ, which allows us to deal with any cyclotomic polynomial F d ðxÞ and thus to utilize single instruction multiple data (SIMD) operations [35] , [36] . The SIMD feature embeds multiple plaintexts into different "slots" in a single ciphertext and allows evaluating a function on all of them in parallel with a single execution. Indeed to exploit the SIMD feature, we choose an irreducible polynomial fðxÞ such that fðxÞ mod 2 splits into many different irreducible factors, each factor corresponding to "one slot" in the SIMD representation. It is easy to see that this excludes fðxÞ ¼ x n þ 1 with n a power of two, since it results in only one irreducible factor modulo 2. For ring-LWE based public-key encryption and signature schemes [29] , [30] , [34] it is very common to take fðxÞ ¼ x n þ 1 with n a power of two. This particular choice enables polynomial multiplications without reductions modulo fðxÞ. With our current choice, we achieve SIMD, but we pay in modular reductions by fðxÞ.
We use a parameter set with d ¼ 65535 (and thus the degree of fðxÞ is 32768 ¼ 2 15 ), log 2 ðqÞ ¼ 1228 and x err a discrete Gaussian distribution with parameter s ¼ 50. We choose the plaintext modulus t ¼ 2, i.e., we evaluate bit-level operations. The irreducible polynomial fðxÞ splits modulo 2 in 2,048 different irreducible polynomials, which implies that we can work on 2,048 bits in parallel using the SIMD method first outlined in [35] . Following the recent work [5] the multiplicative depth of the parameter set is 36. To estimate the security of the parameter set we took help of the tool developed by Albrecht [1] . The run time of the tool increases with the size of the parameter set, and for the chosen parameter set (which is very large) we were not able to calculate the security directly in a reasonable time from the tool. The tool can calculate the security of smaller LWE instances with ðn; log 2 ðqÞ; sÞ ¼ fð2;048; 100; 50Þ; ð4;096; 188; 50Þ; ð8;192; 352; 50Þg. These values are 79, 81, and 85-bits respectively. If we extrapolate in the same way, then for the parameter set of HEPCloud we get at least 85-bit security.
HIGH LEVEL OPTIMIZATIONS
To efficiently implement FV we have to analyze the two main operations in detail, namely homomorphic addition and homomorphic multiplication. Homomorphic addition is easy to deal with since this simply corresponds to polynomial addition in R q . Homomorphic multiplication is much more involved and is the main focus of this paper. As can be seen from the definition of FV:Mult in Section 2, we first need to multiply the polynomials (i.e., c 1;0 , c 2;0 etc.) of the input ciphertexts over integers (i.e., without reduction modulo q), then scale by t=q and round, before mapping back into the ring R q . The fact that one first has to compute the result over the integers (to allow for the scaling and rounding) has a major influence on how elements of R q are represented and on how the multiplication has to be computed. Algorithm 1. Iterative NTT [11] Input: Polynomial aðxÞ 2 Z q ½x of degree N À 1 and Nth primitive root v N 2 Z q of unity Output: Polynomial AðxÞ 2 Z q ½x = NTT(a) 1 begin 2 A BitReverseðaÞ / * m doubles each iteration
Since each element in R q is a polynomial of degree n À 1, the result of a polynomial multiplication (without reduction modulo fðxÞ) will have degree 2n À 2. As such we choose the smallest N ¼ 2 k > 2n À 2, and compute the product of the two polynomials in the ring Z q ½x=ðx N À 1Þ by applying the N-point NTT (see Algorithm 1) . The NTT requires the Nth roots of unity to exist in Z q , so we either choose q a prime with q 1 mod N or q a product of small primes q i with each q i 1 mod N. It is the latter choice that will be used throughout this work. The product of two elements a; b 2 R q is then computed in two steps: first, the product modulo x N À 1 (note that there will be no reduction, since the degree of the product is small enough) is computed using two NTT's, N pointwise multiplications modulo q and then finally, one inverse NTT. To recover the result in R q , we need a reduction modulo fðxÞ. For this purpose, we use the Newton iteration method [39] .
Note that the polynomial multiplication in FV:Mult are performed over integers. To get the benefit of NTT, we perform these multiplications in a larger ring R Q where Q is a sufficiently large modulus of size $ 2 log 2 ðqÞ such that the coefficients of the result polynomials are in Z.
CRT Representation of Polynomials
The biggest challenge while designing a homomorphic processor is the complexity of computation. During a homomorphic operation, computations are performed on polynomials of degree 2 15 or 2 16 and coefficients of size $1,200 or $2,500 bits. To tackle the problem of long integer arithmetic, we take inspiration from the application of the CRT [4] in the RSA cryptosystems. We choose the moduli q and Q as products of many small prime moduli q i , such that
Thus any operation modulo q or Q maps into small computations moduli q i . We use the term small residue to represent coefficients modulo q i and the term large residue to represent coefficients modulo q or Q.
FV:Mult in Residue Domain
Let the two input ciphertexts be c 1 ¼ fc 1;0 ; c 1;1 g and c 2 ¼ fc 2;0 ; c 2;1 g. The homomorphic multiplication steps are described below. 1) Lift q!Q : Lift c 1;0 , c 1;1 , c 2;0 and c 2;1 to R Q from R q , i.e., compute the additional residue polynomials moduli q j for j 2 ½l; L À 1. Since the ciphertexts are represented as residue polynomials moduli q i for i 2 ½0; l À 1 in R q , we first need to compute the coefficients modulo q in ðÀq=2; q=2Þ by applying the CRT, and then compute the additional residue polynomials. 2) PolyArithmetic Q : Compute the product polynomials
1 by computing multiplications and additions of the residue polynomials moduli q j for j 2 ½0; L À 1. 3) Lift Q!q : Apply CRT on the coefficients of the residue polynomials ofc 0 ;c 1 andc 2 to get the coefficients modulo Q in ðÀQ=2; Q=2Þ. Now compute the division-and-rounding operations toc 0 ¼ b The polynomial arithmetic on the residue polynomials can be performed in parallel. The size of the moduli q i is an important design decision and depends on the underlying platform. We implement the hardware accelerator on the Xilinx ML605 board, which has a Virtex-6 FPGA. The FPGA provides 24 Â 17-bit unsigned DSP multipliers to perform integer multiplications. We could implement a slightly larger integer multiplier by combining a DSP multiplier with LUT-based logic. In this work we choose 30-bit prime q i that satisfy q i 1 mod N. The reasons for selecting only 30-bit of primes are: 1) there are sufficiently many primes of size 30-bit to compose 1,228-bit q and 2,517-bit Q, 2) the data-paths for performing computations modulo q i become symmetric, and 3) the basic computation blocks, such as adders and multipliers of size 30-bit can be implemented efficiently using the available DSP slices and a few LUTs.
ARCHITECTURE
In this section we design a hardware architecture of HEPCloud to accelerate FV:Add and FV:Mult. In Fig. 1a high-level view of our architecture is shown. The computation core of HEPCloud is hosted on a medium size Xilinx Virtex-6 XC6VLX240T FPGA. The design decisions take account of the resources available on the board. Since the ciphertexts are large, of size 2 Â 4:8 MB, we use the DDR memory of the board to store the ciphertexts. During a computation, portions of the ciphertext(s) are read from the DDR memory and stored in the on-FPGA BRAMs. After the computation, the result is written back in the DDR memory. The speed of the communication between the DDR memory and the FPGA has a major impact on the performance. In this work we restrict the data-size to 256 bits per DDR memory access. In the remaining part of this section, we describe our design decisions and optimization tricks.
Architecture for Polynomial Arithmetic
As discussed in Section 3.2, FV:Mult requires arithmetic with polynomials that have high degrees and large coefficient sizes. For faster multiplications of the residue polynomials in the steps PolyArithmetic Q and FV:Relin of Section 3.2, we use the NTT-based polynomial multiplication algorithm. However a sequential implementation of Algorithm 1 will not be enough to accelerate FV:Mult since the parameter set (Section 2.1) that we use in this paper is very large. E.g., if we use the sequential NTT core of [34] , then one N-point NTT of a residue polynomial will consume more than 524K cycles. Note that the NTT algorithm (Algorithm 1) is amicable to parallelism. Hence we use parallel cores to reduce the number of cycles. 
Optimization in the Routing
Using v parallel cores, where vjN, we can split an N-point NTT into v parallel butterfly computation threads. Let the N coefficients be stored in b BRAMs and vjb. There are two main technical issues related to the memory access that affect the performance of the NTT computation. The first one is: all the parallel cores access the BRAMs simultaneously. Since a simple dual port BRAM has one port for reading and one port for writing, it can support only one read and write in a cycle. This puts the restriction that a BRAM should be read (or written) by one core in a cycle, i.e., the generation of the BRAM-addresses by the parallel cores should be free from conflicts.
The second issue is related to the routing complexity. A residue polynomial is stored in many BRAMs, and hence, if a core needs to access a BRAM that is far from it, then the routing of wires will be very long. Note that in the basic NTT (Algorithm 1) we see that the maximum difference between the indexes of the two coefficients is N=2. In our parameter-set N ¼ 2 16 and this results in a long critical path. We address these two technical issues by connecting the read ports of a group of BRAMs to only one butterfly core. This dedicated read prevents any sort of conflict during the memory read operations. Additionally this helps the design tool to place the butterfly core adjacent to the proper group of BRAMs in the FPGA, thus reducing routing complexity. Memory access by a butterfly core with core-index c during NTT is shown in Algorithm 2.
In Algorithm 2 MEMORY c is the group of BRAMs that are connected to the input ports of the cth butterfly core. Each memory word contains two residue (2 Â 30-bit) coefficients. In lines 13-14 two coefficient-pairs (hence four coefficients) are read from the memory and then the 'butterfly steps' (lines [16] [17] [18] are computed on them. This gives four new coefficients s 1;c , s 2;c , s 3;c , and s 4;c .
The twiddle factor v, that is used in the butterfly steps, is initialized to a constant value v m;c in line 7. The value of v m;c is actually an exponent of v m , where the exponentiation-magnitude depends on the core index c. The values of v m;c s can be stored in a small table or a ROM. The counter I twiddle denotes the interval at which v should be updated with a new value (line 21). Whenever the number of butterfly operations (N butterfly ) becomes a multiple of I twiddle , a new v is computed in line 21. The addresses registers address 1 and address 2 are computed from the counters: base, increment, and offset, that represent the starting memory address, the increment value, and the current difference between address 1 and address 2 respectively.
The memory-write module in line 31 collects the coefficients generated by several of the parallel butterfly cores and writes the 'proper' coefficients in the cth group of BRAMs. By 'proper' we mean the coefficients that will be read by the cth butterfly core in the next iteration of the m-loop in Algorithm 2. The proper coefficients are filtered by observing the value of a variable gap that depends on c, m and v.
Discussion. Since a butterfly core is attached to a fixed set of BRAMs, our Algorithm 2 minimizes the critical paths (and hence routing) that lie between the output ports of the BRAMs and the input ports of the butterfly core. However the input ports of the BRAMs still remain connected to multiple butterfly cores. The routing delay at the input ports might be reduced by inserting pipeline-registers in the critical paths that start from the output ports of the butterfly cores and end at the input ports of the memories. We consider this as a future work. Note that if Algorithm 1 is used to implement HEPCloud then long critical paths would appear at both the input and output ports of the BRAMs. 
Architecture of Polynomial Arithmetic Unit (PAU)
In Fig. 2 we show the internal architecture of the cores that we use to perform arithmetic on the residue polynomials. The cores have been designed following the footprints of the polynomial arithmetic core of [34] . The input register bank contains registers to store data from the BRAMs. In addition, the register bank also contains shift registers to delay the input coefficients in a pipeline during an NTT computation. The register bank has several ports to provide data to several other components present in the core. We use the common name D regbank to represent all data-outputs from the register bank. The small ROM block in Fig. 2 contains the twiddle factors and the value of N À1 to support the computation of NTT and INTT.
The integer multiplier (shown as a circle in Fig. 2 ) is a 30 Â 30-bit multiplier. We maintain a balance between area and speed by combining two DSP multipliers and additional LUT based small multipliers to form this multiplier. After an integer multiplication, the result is reduced using the Barrett reduction circuit [2] shown in Fig. 2 . We use the Barrett reduction technique due to two reasons. The first reason is that the primes used in this implementation are not of pseudo-Mersenne type which support fast modular reduction technique [20] . The second reason is that the cores are shared by all the prime moduli, and hence, a generic reduction circuit is more preferable than several dedicated reduction circuits. The Barrett reduction circuit is bit parallel to process the outputs from the bit-parallel multiplier in a flow. The reduction consists of three 31 Â 31-bit multipliers and additional adders and subtractors. The multipliers are implemented by combining two DSP multipliers with additional LUTs. Thus in total, the Barrett reduction block consumes six DSP multipliers. Beside performing the modular reduction operations, the multipliers present in the Barrett reduction circuit can be reused to perform 30 Â 59-bit multiplications during the CRT computations.
The adder/subtracter circuits after the Barrett reduction block in Fig. 2 are used to compute the butterfly operations during an NTT computation and to perform coefficient-wise additions and subtractions of polynomials. Finally, the results of a computation are stored in the output register bank and then the registers are written back in the memory. To achieve high operating frequency, we put pipeline registers (shown as magenta colored lines) in the data paths of the computation circuits. 
External Memory Access During NTT
During an NTT, the coefficients of the residue polynomial are read sequentially from the DDR memory and then loaded in the 16 internal memory blocks. For this purpose the 256-bit DDR3 interface is used to receive four coefficient pairs (i.e., eight coefficients) in a burst. However Algorithm 2 generates the output coefficient pairs in a permutation that is different from their initial arrangement. The coefficients pairs are written back in the DDR memory in the right arrangement using Algorithm 3. For a write address index, the coefficient pair from the address reverse index should be read from the internal memory. Note that we perform this rearrangement of the coefficients after the completion of an NTT following Algorithm 2; whereas in the traditional NTT Algorithm 1 this rearrangement is performed in the beginning using the bitreverse function. 
Architecture for Lifting Back and
Forth in R q $ R Q In Section 3 we described the lifting operations that we need to perform during FV:Mult. In this section we describe the computational steps that we follow to implement the lifting operations, and then we describe the hardware architectures of the building blocks. In the end we design a unified architecture for computing the two lifting operations.
Computation Steps for Lift q!Q
Let for an integer a mod q, the residues be ½a q i for i 2 ½0; l À 1. So we are interested in computing ½a q j for j 2 ½l; L À 1. We first compute the sum of products for i 2 ½0; l À 1 as follows:
Next we compute ½a 0 q j for j 2 ½l; L À 1 using
Note that ½b i q j are 30-bit integers. Finally, we compute the residues ½a q j for j 2 ½l; L À 1 using the following equation:
This computation involves a division of a sp by q. The sign takes a value 0 or 1 depending on a sp À ba sp =qc Á q is smaller than q=2 or not.
Computation Steps for Lift Q!q
We compute the sum of products a sp from the residue polynomials moduli q j for j 2 ½0; L À 1
Here the values ½ð Q q j Þ À1 q j are 30-bit integers and hence the computation ½a 0 q j ¼ ½a q j Á ½ð Q q j Þ À1 q j is a 30-bit modular multiplication. Next we reduce a sp by Q and get a Q in ðÀQ=2; Q=2Þ. Then the division and rounding operation is performed on a Q and the result is reduced modulo q to a value in ðÀq=2; q=2Þ.
Unified Architecture
Note that Lift q!Q and Lift Q!q operations involve similar computation steps such as sum of products in Eqs. (1), (2) and (4), and divisions by q. Hence we design a unified architecture to compute both Lift q!Q and Lift Q!q . The architecture is composed of four blocks: 1) sum of products, 2) reduction modulo Q, 3) division-and-rounding, and 4) reduction modulo q. The blocks are described as follows. Fig. 3 shows a multiply-and-accumulate (MAC) core to compute the sum of products in Eqs. (1), (2) and (4). In the figure, the 'multiplier' block is borrowed from the PAU (Fig. 2) . Since there are 16 PAU cores in the HE-processor, we instantiate 16 MAC cores. These cores are divided into two parallel MAC-groups: MAC-0 to MAC-7 form the first group, and MAC-8 to MAC-15 form the second group. Each MAC-group is responsible for computing one sum of products. The ROM block in the MAC core is a loadable memory and is used to store the constants for Lift q!Q or Lift Q!q . We set the word size of the ROM to 59 bits. Note that Eqs. (1) and (4) require multiplications of 30-bit coefficients ½a q i by long b i . The MAC cores compute these long multiplications wordserially using the 31 Â 59-bit integer multipliers that are present inside the multiplier blocks. Algorithm 4 shows the word-serial computation of the sum of products by the 0th MAC core. In the algorithm we have assumed that the MAC core is responsible for the accumulation of first m products and each b i has w 59-bit words in the ROM. The k-loop computes the kth 59-bit word of the partial result in sum 0 . Whenever a word is computed in sum 0 , it is forwarded to MAC-1. Now MAC-1 computes Algorithm 4 with the initialization of acc 1 to sum 0 and computes the words of the partial sum-ofproducts in sum 1 . Following the same sequence MAC-2 computes sum 2 and then MAC-3 computes sum 3 . In parallel to this computation-flow, MAC-7 downto MAC-4 compute sum 4 . Finally sum 4 is added with sum 3 in MAC-3 to get a word of the final sum of products. In Fig. 4 we show the timing diagram for the pipeline processing.
Sum of Products Block
The computation of Eq. (2) requires sum of modular multiplications. For this purpose the modular-multiplier circuit from PAU is used. 
Reduction Modulo Q Block
Let a sp be the sum of the products in Eq. (4). For the chosen parameter set, a sp is 7 bits larger than Q. We first reduce a sp to a value in ½0; Q À 1 and then central-lift the result to a value in the range ðÀQ=2; Q=2Þ. To reduce a sp in ½0; Q À 1, we sequentially reduce the extra bits of a sp from the most significant side. The steps are shown in Algorithm 5.
Algorithm 5. Reduction Modulo Q
Input: An integer a sp that is 7-bit larger than Q Output: Integer a sp in ½0; A word-serial architecture for computing the reduction modulo Q is shown in Fig. 5 . The architecture has three addressable memory components: M and M t are used to store the computational data and M Q is used to keep the modulus Q. These memory components are distributed RAMs of word size 59-bits and depth 64. At the beginning of a computation, the input number a sp is loaded in M. Then within the for-loop of Algorithm 5, the words of a sp are left-shifted by one position and then stored in M t . Note that, line 5 has a conditional subtraction operation. In our implementation, the subtraction is performed word-serially using the subtraction circuit, and the result words are leftshifted by one position and then stored in M. Based on the sign of the subtraction, either M or M t is used as the source of a sp for the next computations.
Division and Rounding Unit (DRU)
The DRU computes tc=q b e during Lift Q!q where t ¼ 2, c is a coefficient computed from the reduction modulo Q, and bÁe denotes rounding towards the nearest integer. The division is carried out by precomputing the reciprocal r ¼ 2=q and then computing r Â c. The word size of the DRU is 118 bits (2 Â 59) as a compromise between area and latency.
To round a division of two k-bit integers correctly to k-bits, the quotient must be computed correctly to 2k þ 1 bits [22, Theorem 5.2] . In our case, the computation of tc=q b e requires a division of a k 1 -bit dividend by a k 2 -bit divisor. The precision that we will need in this case to guarantee correct rounding, based on the above, is k 1 þ k 2 þ 1 bits. The divisor q is a 1,228-bit constant integer and the dividend c is an at most 2,517-bit integer, which gives a bound of 3,746 bits. Hence, the reciprocal r is computed up to a precision of 32 118-bit words, of which 22 words are nonzero. Fig. 6 shows the architecture of the DRU. The multiplication r Â c is computed by using a 118 Â 118-bit multiplier that computes 22 2 ¼ 484 partial multiplications. This multiplier performs a 118-bit Karatsuba multiplication by using three 59 Â 59-bit multipliers generated with the Xilinx IP Core tool (which supports only up to 64-bit multipliers). The 59-bit multipliers each require 16 DSP blocks giving the total number of 48 DSP blocks. In order to achieve a high clock frequency, the 118-bit multiplier utilizes a 23-stage pipeline, of which 18 stages are in the 59-bit multipliers (the optimal number according to the tool).
The partial products from the 118-bit multiplier are accumulated into a 241-bit (2 Â 118 þ 5) register using the Comba Alg [10] . These additions are computed in a 4-stage pipeline with three 59-bit adders and one 64-bit adder, which are all implemented with LUTs. Whenever all partial products of an output word have been computed, the register is shifted to the right by 118 bits and the overflowing bits are given at the output of the DRU. Once the computation proceeds to the first word after the fractional point, then the MSB of the fractional part is added to the register in order to perform the rounding. The DRU has a constant latency of 687 clock cycles per coefficient.
The DRU is reused for computing ba sp =qc during the Lift q!Q . The computation proceeds analogously to the above. The differences are that the reciprocal is now r ¼ 1=q and it needs to be computed only to a precision of 2,493 bits (12 nonzero words) because c can be only 36 bits longer than q. The computation has a latency of 246 clock cycles.
Reduction Modulo q Block
This block reduces 1,291-bit output from the DRU by 1,228-bit modulus q. Since the input data is 63-bit larger than the modulus, a bit-by-bit modular reduction architecture similar to Fig. 5 will be slow. Hence, we use a word-serial Barrett reduction algorithm to perform the reduction in ½0; q À 1 and then we center-lift the reduced data to ðÀq=2; q=2Þ. The architecture of this block is shown in Fig. 7 .
The input data is stored in the RAM and the modulus q is kept in the ROM. In the first step, the quotient is computed in the Quotient register by multiplying the RAM content by the Barrett constant. For this purpose the 64-bit multiplier (Fig. 7) is used. To save area, we designed this 64-bit multiplier circuit using a 32-bit multiplier. The 64-bit multiplication is computed in four cycles. The accumulation of the word-serial 64-bit multiplication results is done by the adder block which uses a 64-bit adder circuit to compute the addition in two cycles.
After the quotient is computed in the Quotient register, it is multiplied with the words of the modulus q from the ROM block and then the result words (in the R 0 register) are subtracted from the words of the input data (from the RAM block). The result of the subtraction, which is the partially reduced result, is then written back in the RAM module.
After the partially reduced result is compared with the modulus q by performing word-serial subtractions. Based on the comparison, the conditional subtraction of the modulus q is computed. With this, we get the modulus q-reduced result in the RAM block. Next the result is center-lifted to ðÀq=2; q=2Þ.
Integration of the Building Blocks
Now we describe how we integrate of the building blocks to compute the Lift q!Q and Lift Q!q operations. Note that in the first MAC-group, MAC-3 computes the final sum of products. So the reduction and division blocks are attached to the MAC-3 core. In Fig. 8 we show the connection of MAC-3 core with the remaining three blocks. Similarly in the second MAC-group, MAC-11 core is accompanied by the reduction and division blocks. The four blocks are in a pipeline during Lift Q!q to achieve optimum computation time. The division block takes the maximum cycles and hence determines the throughput of the entire pipeline. Every block contains additional memory elements to enable the pipeline processing: while one memory element is read by the next block in the pipeline, the other memory element is used to store the new results.
During Lift q!Q operation, the sum of products a sp in Eq. (1) is computed by a MAC-group, and then it is passed to the DRU for the computation of ba sp =qc. In parallel to this division, the MAC-group computes ½a 0 q j in Eq. (2) . For the computation of Eq. (3), a small computation block (consisting of a multiplier, subtracter and some small memory components) is used in the pipeline. The sign is computed by performing arithmetic on the most significant words of a sp and q. This block is common to both the MAC-groups as the amount of computation in Eq. (3) is small. The throughput of the pipeline during Lift q!Q is determined by the 'computation of a sp followed by the division ba sp =qc'. 
External Memory Access
The DDR memory access during the Lift q!Q and Lift Q!q is more complicated than the memory access during NTT. Here we need to fetch the residue coefficients for different moduli, whereas during an NTT we fetch coefficients from a single moduli. So we design a customized DDR memory access interface for the lifting operations. Since the DDRburst data length is 256 bits, at a time we read eight coefficients for a single residue from the DDR memory and copy them in the BRAM. Eight lifting operations are computed by the two MAC-groups, i.e., four lifting operations per MACgroup, before writing back the result in the DDR memory.
After eight Lift q!Q , the result is a collection of 43 Â 8 coefficients. This is because there are extra 43 moduli in Q and for each moduli there are eight coefficients. Hence 43 DDR-write operations (each copying eight coefficients) are performed to copy the result to the memory. After every eight Lift Q!q during the computation ofc 2 , the result is a collection of eight coefficients, each of size 1,228 bits. Now WordDecomp slices each 1,228-bit coefficient into 21 59-bit coefficients. Since a single DDR-write operation copies four sliced coefficients, a total of 42 DDR-write operations are performed.
Architecture of the ResPol q!q i Block
Step 4 in Section 3.2 computes the residue polynomials for c 0 andc 1 . This step is performed by reducing the 1,228-bit coefficients ofc 0 andc 1 by the 30-bit moduli q i for i 2 ½0; l À 1. In Algorithm 6 we show the steps that we follow to reduce a 1,228-bit coefficient by a 30-bit q i .
Algorithm 6. Reduction of 1,228-Bit Coefficient by 30-Bit q i
Input: 1228-bit coefficient a mod q Output: a mod q i 1 begin 2 A ½a 0 ; . . . ; a 40 / * 30-bit words of a * / 3 R 0 4 for j ¼ 0 to 40 do 5 R R þ A½j Á ð2 30Áj mod q i Þ / * R contains a 66-bit value * / 6 R R½57 : 0 þ R½65 : 58 Á ð2 58 mod q i Þ / * Now R contains a 59-bit value * / 7 R Barrett Reduction59bitðR; q i Þ 8 return R 9 end
In line 2 of Algorithm 6 the input 1,228-bit coefficient a is split into 30-bit words. The 41 words of a are stored in the array A with the least significant word in the index position 0 and the most significant word in the index position 40. Next the for-loop in line 4 multiplies the words A½j with the constants 2 30Áj mod q i and the multiplication results are accumulated in the register R. After the completion of the forloop, R contains a partially reduced result which is of size 66 bits. Next the most significant 8 bits of R is multiplied with 2 58 mod q i and the result is added with the least significant 58 bits of R to get a 59-bit partially reduced value in R. This value is reduced by a Barrett reduction circuit of 59-bit input size to get the final 30-bit result modulo q i .
The architecture of the ResPol q!q i block that computes Algorithm 6 is shown in Fig. 9 . The input to the ResPol q!q i block is the output of the Reduction modulo q block in Fig. 8 .
Since the Reduction modulo q block outputs in 59-bit words, we use a BRAM RAM-A of word size 59 and depth 32 to store the 59-bit words of the input. The constants that are used in Algorithm 6 are kept in ROM. During the execution of the for-loop in Algorithm 6, a 59-bit word is fetched from RAM-A, then split into two 30-bit chunks. The first chunk is then multiplied with a 30-bit constants from ROM using the 30-by-30 bit integer multiplier and result is accumulated in R. Next the second chunk is multiplied by a constant and then accumulated in R. Following a similar way, all the words of the 1,228-bit coefficient are processed. This gives a 66-bit partially reduced result in R. Now the most significant 8 bits of R are multiplied by 2 58 mod q i and then added with the least significant 58-bits of R. The output of the addition is then reduced using the Barrett Reduction circuit to obtain the final 30-bit modulo q i reduced result. Since the Barrett Reduction circuit is used only once in Algorithm 6, the 30-bit integer multiplier in the figure is actually borrowed from the Barrett Reduction circuit (which contains three such multipliers).
Note that ResPol q!q i gets its input from the Reduction modulo q. Hence the input has a sign. To keep the description simple, we do not sign of the input in Algorithm 6. But in the actual implementation the sign of the input is taken care of: depending on the sign bit, the architecture either adds the output of the Barrett Reduction circuit to 0 or subtracts it from q i .
Since the ResPol q!q i block processes the output coefficients from the architecture of Fig. 8 , the best computation time can be achieved if it is kept in a pipeline. In that case, the ResPol q!q i block should be fast enough to reduce the 1,228-bit input coefficient by all of the 41 moduli q i for i 2 ½0; l À 1 before a new 1,228-bit coefficient arrives in the input. We observed that a single instance of the architecture shown in Fig. 9 is not fast enough to meet the throughput of the pipeline. So we replicate the part of the architecture that is present inside the green dashed-block of Fig. 9 for four times. Since we borrow three 30-bit multipliers from the Barrett Reduction circuit, we instantiate one extra multiplier. The four instances run in parallel and distribute the computation job: the first one reduces the input coefficient by first 11 moduli, and the remaining three instances reduce the input coefficient by 10 moduli each. We keep two dual-port ROM blocks in the architecture: the first (second) ROM block stores the constants required by the first (last) two instances.
RESULTS
We compiled the processor for the ML605 board which has a Virtex-6 FPGA XC6VLX240T-1FF1156. Different clock domains are used in the design: communication with the DDR memory is performed at 200 MHz, whereas computations are performed using a 100 MHz clock. The HEPCloud has v ¼ 16 parallel cores for performing polynomial arithmetic, and two cores for computing the lifting operations. The area counts of our HEPCloud, including the DDR interface, are shown in Table 1 . Table 2 gives the latencies of the building blocks excluding the cost of DDR memory access. NTT and INTT computations are performed on polynomials of N ¼ 2 16 coefficients. To save memory requirement, we compute the twiddle factors on the fly at the cost of N integer multiplications. One NTT computation using v ¼ 16 cores requires ðN þ N 2 log 2 ðNÞÞ=16 ¼ 36;864 multiplications. However the computation of the twiddle factors in the pipelined data path of the PAU (Fig. 2) has data dependencies and thus causes bubbles in the pipeline. We use a small register-file that stores four consecutive twiddle factors, and reduce the cycles spent in the pipeline bubbles to around 10,000. In the case of an INTT, the additional cycles are spent during scaling operation by N À1 . To compute N-point-wise addition/subtraction/multiplication we need slightly more than 4,096 cycles.
Computation Cost of the Lifting Operations
The cycle requirement for Lift Q!q is determined by the division-and-rounding operation, since it is the costliest computation in the pipeline of Fig. 8 . If we assume that many Lift Q!q operations are performed in pipeline, then the cycle requirement per coefficient will be 687. However, due to the restrictions put by the DDR interface, we process only four Lift Q!q in pipeline (see Section 4.2). As a consequence 4,744 cycles are needed to process four coefficients by a single Lift Q!q core. Similarly when ResPol q!q i is computed in pipeline with Lift Q!q , 5,387 cycles are spent per four coefficients. Similarly, when we assume that many Lift q!Q operations are performed in pipeline, cycle requirement per coefficient is 401. In practice, we can compute only four Lift q!Q in pipeline, and thus it takes total 2,016 cycles for computing four Lift q!Q operations.
Computation Cost of the Residue Polynomial Multiplication
To multiply two residue polynomials modulo q j , we compute two NTTs, then N-point-wise multiplications, and one
INTT. The reduction of the result modulo fðxÞ follows the Newton iteration method Newton iteration method [39] . In this step, two NTTs, two N-point-wise multiplications, one N=2-point-wise subtraction and two INTTs are computed.
Hence the computation of a polynomial multiplication in R q j requires four NTTs, three N-point-wise multiplications, one N=2-point-wise subtraction and three INTTs. This translates into 361,376 cycles.
Computation Cost of FV:Mult
The cycle counts for the steps (see Section 3.2) are as follows. 
Overhead of the DDR Memory Access
To evaluate our proof of concept implementation, we use a DDR interface that reads or writes 256 bits in a burst. For FV:Mult, DDR memory accesses take around 4,663,738,368 cycles at 200 MHz. For FV:Add, the number of cycles for the memory access is around 9,740,288. Table 3 shows the timing requirement for computing FV:Add and FV:Mult operations including the overhead of DDR memory access. Based on the timing of FV:Mult, we see that the designed architecture would take roughly 37551 s (11 h and 26 min) to evaluate SIMON-64/128 (44 rounds with 32 ANDs). Since the SIMD feature processes 2,048 slots, the per-block timing will be roughly 18.34 s.
Comparison and Discussion
Since implementations of the FV scheme are largely missing from the literature, in Table 4 , we compare HEPCloud to YASHE implementations on both hardware and software, which are the closest counterparts.
P€ oppelmann et al. [31] presented an FPGA implementation of YASHE, which computes a homomorphic multiplication in just 48.67 ms. However, these timings cannot be compared one-to-one because YASHE is computationally lighter than FV, but also insecure. Their parameter set also offers lower security, supports only multiplicative depth up to 9, and cannot take advantage of the SIMD feature. They implemented their accelerator using Catapult, an FPGAbased datacenter accelerator with very fast memory access. Consequently, they were able to solve the problem of slow memory access, which is the main problem of HEPCloud that was implemented on a generic ML605 FPGA development board. Indeed, if one observes only the latency of computation, then HEPCloud is faster than P€ oppelmann et al.'s design (taking SIMD feature into account) despite implementing FV with a larger parameter set.
Lepoint and Naehrig [23] presented C++ implementations of YASHE for homomorphic evaluations of SIMON 64/128 with YASHE running on a 4-core Intel Core i7-2600 at 3.4 GHz. They reported computation times of 4193 s for SIMON-64/128 using all 4 cores. If we use HEPCloud to compute YASHE (which is lighter than FV), then it would take roughly 12000s to evaluate SIMON-64/128. With respect to their implementation, HEPCloud is 2.8 times slower. Again, the difference is caused by the memory access.
In this work our focus was on designing the computation core of the FV; the DDR memory interface is a proof of concept implementation. With 256-bit burst data width, the DDR interface offers a only 1.97 Gb/s read speed and hence becomes the main bottleneck in our implementation. Desktop computers have industry-optimized DDR interface, and the Intel Core i7-2600 processor has 8 MB cache memory [21] . Since a polynomial in R q is of size 4.8 MB, the overhead of memory access in [23] would be much lower than ours.
Using a faster DDR interface with 2,048-bit burst data length, one can achieve 10 Gb/s read and 27 Gb/s write speed. With this interface, the overhead of memory access would become roughly equal to the computation cost. Hence the time for a homomorphic multiplication could be reduced significantly by performing the memory access and computation in parallel using two sets of BRAMs: when one set is used for the computation, the other set is used for the memory access. We consider integration of a faster DDR interface in the HEPCloud as a future work. This would make HEPCloud a practical solution for accelerating SHE function evaluations in cloud computing.
SUMMARY
In this work we designed the hardware building blocks for homomorphic evaluation of medium depth functions using the FV scheme. We showed that FPGAs can accelerate the computation intensive operations of homomorphic function evaluations. Despite this, we found that a massive amount of data exchange takes place between the FPGA and the external DDR memory because only a part of the ciphertext can be fit in the internal memory of the FPGA. The interface with the DDR memory plays a very important role in the performance and becomes a bottleneck unless it is implemented with special care.
We introduced HEPCloud, a single-FPGA design of homomorphic evaluation with FV. We presented a proof-ofconcept implementation of HEPCloud that implements all hardware components required for homomorphic function evaluations in cloud computing environments. We demonstrated that HEPCloud is a feasible solution for accelerating the very expensive homomorphic function evaluations, particularly, when fast external memory is available.
Even with the state-of-the-art FPGA acceleration, homomorphic function evaluations remain very expensive and further improvements are still needed. We see several parallelization approaches to accelerate HEPCloud. An obvious way to improve the performance would be to use a multi-FPGA design (a cluster) where each FPGA computes different homomorphic evaluations independently of each other. This approach improves throughput, but the latency of an individual evaluation remains the same. The second approach is to reduce latency by using parallel FPGAs for independent FV:Add and FV:Mult inside a single homomorphic evaluation. While this is conceptually simple, it may still face difficulties because data needs to be transferred between multiple FPGAs. The third option is to distribute the residue polynomial arithmetic into several FPGAs since they can be computed independently. However, the lifting operations need coefficients from different residue polynomials and require inter-FPGA communication. The fourth option is to divide different parts of a homomorphic multiplication to different FPGAs and perform them in a pipelined fashion in order to increase throughput. The fifth option is to mix the other options which may lead to good tradeoffs that avoid the disadvantages. The techniques represented in this paper can be extrapolated to support these options.
