Abstract-After the introduction of first fully homomorphic encryption scheme in 2009, numerous research work has been published aiming at making fully homomorphic encryption practical for daily use. The first fully functional scheme and a few others that have been introduced has been proven difficult to be utilized in practical applications, due to efficiency reasons. Here, we propose a custom hardware accelerator, which is optimized for a class of reconfigurable logic, for L opez-Alt, Tromer and Vaikuntanathan's somewhat homomorphic encryption based schemes. Our design is working as a co-processor which enables the operating system to offload the most compute-heavy operations to this specialized hardware. The core of our design is an efficient hardware implementation of a polynomial multiplier as it is the most compute-heavy operation of our target scheme. The presented architecture can compute the product of very-large polynomials in under 6.25 ms which is 102 times faster than its software implementation. In case of accelerating homomorphic applications; we estimate the per block homomorphic AES as 442 ms which is 28.5 and 17 times faster than the CPU and GPU implementations, respectively. In evaluation of Prince block cipher homomorphically, we estimate the performance as 52 ms which is 66 times faster than the CPU implementation.
INTRODUCTION
F ULLY homomorphic encryption (FHE) schemes are introduced to enable blinded server-side computations. Although the idea was proposed in 1978 [1] , first working FHE scheme was constructed by Gentry in 2009 [2] , [3] . Despite heavy efforts to develop practical implementations of this scheme since its construction, such as rendering expensive bootstrapping evaluations obsolete [4] and parallel processing through batching of multiple data bits into a ciphertext [5] , [6] , [7] , it was not possible to realize an efficient hardware or software implementation. For instance, an implementation by Gentry et al. [8] homomorphically evaluates the AES circuit in about 36 hours resulting in an amortized per block evaluation time of 5 minutes. In [9] and later in [10] Dor€ oz et al. present an architecture for ASIC that implements a full set of FHE primitives including bootstrapping. Another implementation by Dor€ oz et al. [11] manages to evaluate AES roughly an order of magnitude faster than [8] . Dor€ oz et al. implemented Prince encryption algorithm which is more suitable for homomorphic encryption with low depth circuit and less computation complexity in [12] with a runtime of 57 minutes. Cousins et al. report the first reconfigurable logic implementations in [13] , [14] , in which Matlab Simulink was used to design the FHE primitives. This was followed by further FPGA implementations [15] , [16] , [17] , [18] . Cao et al. [16] proposed a number theoretical transform (NTT)-based large integer multiplier combined with Barrett reduction to alleviate the multiplication and modular reduction bottlenecks required in many FHE schemes. Parallel to these efforts on Gentry scheme, new FHE schemes, suchs as lattice-based [19] , [20] , [21] , integerbased [22] , [23] , [24] and learning-with-errors (LWE) or (ring) learning with errors ((R)LWE) based encryption [25] , [26] , [27] schemes, were introduced. The encryption step in the proposed integer based FHE schemes by Coron et al. [23] , [24] were designed and implemented on a Xilinx Virtex-7 FPGA. The synthesis results show speed up factors of over 40 over existing software implementations of this encryption step [16] .
It is clear that these implementations are not practical FHE solutions. Exhausting efforts targeting software and hardware implementations, researchers began to investigate GPUS as alternative platforms for FHE applications. Using GPUs, Wang et al. [28] managed to accelerate the recryption primitive of Gentry and Halevi [26] by roughly an order of magnitude. In [18] , Wang et al. present an optimized version of their result [17] , which achieves speed-up factors of 174, 7.6 and 13.5 for encryption, decryption and the recryption operations on an NVIDIA GTX 690, respectively, when compared to results of the implementation of Gentry and Halevi's FHE scheme [19] that runs on an Intel Core i7 3770K machine. A more recent work by Dai et al. [29] , [30] reports GPU acceleration for NTRU based FHE evaluating Prince and AES block ciphers, with 103 times and 7.6 times speedup values, respectively, over an Intel Xeon software implementation.
In Table 1 , we summarize previous FHE implementations. As can be seen from Table 1 , FPGA and ASIC implementations are promising for significant performance gains. Much of the development so far has focused on the Gentry-Halevi FHE [19] , which intrinsically works with very large integers. Therefore, a good number of research work focused on developing FFT/NTT based large integer multipliers [9] , [10] , [28] . Currently, the only full-fledged (with bootstrapping) FHE hardware implementation is the one reported by Dor€ oz et al. [10] , which also implements the Gentry-Halevi FHE. At this time, there is a lack of hardware implementations of the more recently proposed FHE schemes, i.e., Coron et al.'s FHE schemes [23] , [24] , BGV-style FHE schemes [4] , [19] and NTRU based FHE, e.g., [31] , [32] . We, therefore, focus on providing hardware acceleration support for one particular family of FHE's: NTRU-based FHE schemes, where arithmetic with very large polynomials (both in degree and coefficient size) is crucial for performance.
Our Contribution. In this work, we present an FPGA architecture to accelerate NTRU based FHE schemes. Our architecture may be considered as a proof-of-concept implementation of an external FHE accelerator that will speed up homomorphic evaluations taking place on a CPU. Specifically, the architecture we present manages to evaluate a full polynomial multiplication efficiently, for large degrees 2 14 and 2 15 , by utilizing a number theoretical transform based approach. Using this FPGA core we can evaluate multiplication of 2 14 degree polynomial 72 times faster than a CPU and 25:7 times faster than a GPU implementations. In case of 2 15 degree polynomials, it can evaluate the multiplications 102 and 36:5 times faster than a CPU and a GPU, respectively. Furthermore, by facilitating efficient exchange using a PCI Express connection, we evaluate the overhead incurred in a sustained homomorphic computations of deep circuits. For instance, by including data transfer clock cycles, our hardware can evaluate a full 10 round AES circuit in under 440 ms per block. In case of Prince, our hardware achieves amortized run time of 52 ms per block.
BACKGROUND
In this section we briefly outline the primitives of the L opezAlt, Tromer and Vaikuntanathan's fully homomorphic encryption based schemes, and later discuss the arithmetic operations that will be necessary in its hardware realization.
LTV-Based Fully Homomorphic Encryption
While the arithmetic and homomorphic properties of NTRU have been long known by the research community, a full-fledged fully homomorphic version was proposed only very recently in 2012 by L opez-Alt, Tromer and Vaikuntanathan (LTV) [31] . The LTV scheme is based on a variant of NTRU introduced earlier by Stehl e and Steinfeld [32] . The LTV scheme uses a new operation called relinearization and existing technique modulus switching for noise control. While the LTV scheme can support homomorphic evaluation in a multi-key setting where each participant is issued their own keys, here we focus only on the single user case for brevity.
The primitives of the LTV scheme operate on polynomials in R q ¼ Z q ½x=hx N þ 1i, i.e., with degree N, where the coefficients are processed using a prime modulus q. In the scheme an error distribution function x -a truncated discrete Gaussian distribution -is used to sample random, small B-bounded polynomials. The scheme consists of four primitive functions:
KeyGen. We select decreasing sequence of primes [19] , [23] , [24] ; NTRU Based FHE, e.g. [31] , [32] ðiÞ ðxÞe 2 to cut the noise level by log ðq i =q iÀ1 Þ bits. The operation bÁe 2 is matching the parity bits.
Arithmetic Operations
To implement the costly large polynomial multiplication and relinearization operations we follow the strategy of Dai et al. [29] . For instance, in the case of polynomial multiplication we first convert the input polynomials using the Chinese Remainder Theorem (CRT) into a series of polynomials of the same degree, but with much smaller word-sized coefficients. Then, pairwise product of these polynomials is computed efficiently using Number Theoretical Transform -based multiplier as explained in subsequent sections. Finally, the resulting polynomial is recovered from the partial products by the application of the inverse CRT (ICRT) operation.
CRT Conversion
As an initial optimization we convert all operand polynomials with large coefficients into many polynomials with small coefficients by a direct application of the Chinese Remainder Theorem on the coefficients of the polynomials: CRT : A j À!fA j mod p 0 ; A j mod p 1 ; . . . ; A j mod p lÀ1 g; where p i 's are selected small primes, l is the number of these small primes, and A j is a coefficient of the original polynomial. A side benefit of using the CRT is that it allows us to accommodate the change in the coefficient size during the levels of evaluation, thereby yielding more flexibility. When the circuit evaluation level increases, since q i gets smaller, we can simply decrease the number of primes l. Therefore, both multiplication and relinearization become faster as we proceed through the levels of evaluation. After the operations are completed, a coefficient of the resulting polynomial, CðxÞ is computed by the Inverse CRT (ICRT):
where q ¼ Q i¼lÀ1 i¼0 p i . Note that we will drop the superscript notation used for the reduced polynomials by the CRT for clarity of writing since we will deal with mostly the reduced polynomials henceforth in this paper.
Polynomial Multiplication
The fundamental operation in the LTV scheme, during which the majority of execution time is spent, is the multiplication of two polynomials of very large degrees. More specifically, we need to multiply two polynomials, AðxÞ and BðxÞ over the ring of polynomials Z p ½x=ðFðxÞÞ, where p is an odd integer and degree of FðxÞ is N ¼ 2 n . 
Namely
The NTT can essentially be considered as a Discrete Fourier Transform defined over the ring of polynomials Z p ½x=ðFðxÞÞ. Simply speaking, the forward NTT takes a polynomial AðxÞ of degree N À 1 over Z p ½x=ðFðxÞÞ and yields another polynomial of the form AðxÞ ¼
The coefficients A i 2 Z p are defined as [35] , described in Algorithm 1, is a very efficient method of computing forward and inverse NTT. The permutation in Step 2 of Algorithm 1 is implemented by simply reversing the indexes of the coefficients of A i . The new position of the coefficient A i where i ¼ ði n ; i nÀ1 ; . . . ; i 1 ; i 0 Þ is determined by reversing the bits of i, namely ði 0 ; i 1 ; . . . ; i nÀ1 ; i n Þ. For example, the new position of A 12 when N ¼ 16 is 3. The inverse NTT can also be computed with Algorithm 1, using the inverse of the twiddle factor, i.e., w À1 mod p. Therefore, we can use the same circuit for both forward and inverse NTT. Note that the NTT-based multiplication technique returns a polynomial of degree 2N À 1, which should be reduced to a polynomial of degree N À 1 by diving it by FðxÞ and keeping the remainder of the division operation. When the reduction polynomial FðxÞ is of a special form such as x N þ 1, the NTT is known as Fermat Theoretic Transform (FTT) [36] and the polynomial reduction can be performed easily as described in [37] and [38] .
Relinearization
Relinearization takes a ciphertext and set of evaluation keys (EK i;j ) as inputs, where i 2 ½0; l À 1 and j 2 ½0; dlog ðqÞ= re À 1, l is the number of small prime numbers and r is the level index. Algorithm 2 describes relinearization as implemented in this work. We pre-compute the CRT and NTT of the evaluations keys (since they are fixed) and in the computations we perform the multiplications and additions in the NTT domain. The result is evaluated by taking l INTT and one ICRT at the end. An r-bit windowed relinearization involves dlog ðqÞ=re polynomial multiplications and additions, which are performed again in the NTT domain. Since operand coefficients are kept in residue form, before relinearization we need to compute the inverse CRT ofc t .
Algorithm 2. Relinearization with r Bit Windows
input: Polynomial c with ðn; log ðqÞÞ output: Polynomial d with ð2n; log ðnqlog ðqÞÞÞ fc t g ¼ CRTðcÞ;
The performance of the NTRU based FHE scheme heavily depends on the speed of the large degree polynomial multiplication and relinearization operations. Since the relinearization operation is reduced to the computation of many polynomial multiplications, a fast large degree polynomial multiplication is the key to achieve a high performance in the NTRU-FHE scheme. Having a large degree N increases the computation requirements significantly, therefore a standalone software implementation on a general-purpose computing platform fails to provide a sufficient performance level for polynomial multiplications. The NTT-based polynomial multiplication algorithm is highly suitable for parallelization, which can lead to performance boost when implemented in hardware. On the other hand, the overall scheme is a complex design demanding prohibitively huge memory requirements (e.g., in LTV-AES [11] key requirements exceed 64-GB of memory). Therefore, a standalone architecture for SWHE fully implemented in hardware is not feasible to meet the requirements of the scheme.
In order to cope with the performance issues we designed the core NTT-based polynomial multiplication in hardware, where the polynomials have relatively small coefficients (i.e., 32-bit integers) to use it in more complicated polynomial multiplications and relinearization evaluations. The designed hardware is implemented in an FPGA device, which is connected to a PC with a high speed interface, e.g., PCI Express (PCIe). The PC handles simple and non-costly computations such as memory transactions, polynomial additions and etc. In case of a large polynomial multiplication or NTT conversion (in case of relinearization), the PC using the CRT technique, computes an array of polynomials whose coefficients are 32-bit integers from the input polynomials of much larger coefficients. The array of polynomials with small coefficients are sent in chucks to the FPGA via the high-speed PCIe bus. The FPGA computes the desired operation: polynomial multiplications or only NTT conversion. Later, the PC receives the resulting polynomials from the FPGA and if necessary, i.e., before modulus switching or relinearization, evaluates the inverse-CRT to compute the result.
PCIe Interface
The PCIe is a serial bus standard used for high speed communication between devices which in our case are PC and the FPGA board. As the target FPGA board, we use Virtex-7 FPGA VC709 Connectivity Kit and can operate at 8 GT/s, per lane, per direction with each board having eight lanes. The system is capable of sending the data packets in bursts. This allows us to achieve real time data transaction rate close to the given theoretical transaction rate as the packet sizes become larger.
Arithmetic Core Units
In order to achieve multiplication of two large degree polynomials, we designed hardware implementations for basic arithmetic building blocks to perform operations on the polynomial coefficients such as modular addition, modular subtraction and modular multiplication.
For compute-heavy operations using a large number of multiplication operations such as modular exponentiation and polynomial multiplication, it is a common practice, especially on word-oriented architectures, to perform partial reduction for the intermediate operations [39] . For example, when multiplying two 32-bit numbers with respect to a 32-bit modulus p, it is sufficient to achieve a result that is 32 bits in length, which can still be larger than the modulus p. This increases complexity of modular addition and modular subtraction operations because of the massive number of operations realized in a single clock cycle for multiplication of two polynomials of degree 2 14 and 2 15 . Therefore, we conclude that the most efficient method for the these modular operations is to achieve full modular reduction, and we design our building blocks to work with only fully reduced integers. Also, we base our design on an architecture to perform modular arithmetic operations for 32-bit numbers.
32-Bit Modular Addition
The modular addition circuit, which is illustrated in Fig. 1b , takes one clock cycle to perform one modular addition operation where operands A, B and the modulus p are all 32-bit integers and A; B < p. As noted before, it is guaranteed that the result will be less than the modulus p. Since the largest values of A and B are p À 1, and thus the largest value of A þ B is 2p À 2, at most one final subtraction of the modulus p from A þ B will be sufficient to achieve full modular reduction after addition operation.
32-Bit Modular Subtraction
The modular subtraction circuit, which is designed in a similar manner to modular addition circuit, is illustrated in Fig. 1a . Similarly the subtraction unit is optimized to take one clock cycle to finish one modular subtraction operation on a target device. Since the largest values of A and B are p À 1, and the smallest values of A and B are 0, the largest value of their subtraction can be p À 1, and the smallest value can be Àp þ 1, which indicates that one final addition of the modulus p will be sufficient to achieve full modular reduction after subtraction operation.
Integer Multiplication
The target FPGA device features many DSP units that are capable of performing very fast multiply and accumulate operations. A DSP unit takes three inputs A, B and C, which are 18 bits, 25 bits and 48 bits, respectively. A and B are multiplicand inputs, and C is the accumulate input. The output is a 48-bit integer, which can be defined as D ¼ A Â B þ C. Therefore, we can accumulate the results of many 18 Â 25-bit multiplications without overflow. Since our operands are 32 bits in length, first we need to perform a full multiplication operation of 32-bit numbers. The operand lengths of the DSP units dictate that we need to perform four 16 Â 16-bit multiplication operations to achieve a 32-bit multiplication operation. Utilizing four separate DSP slices, we could perform a 32-bit multiplication with one clock cycle throughput. However, this brings additional complexity to the hardware and because of the overall structure of the polynomial multiplication algorithm, one-cycle throughput is not crucial for our design. Therefore, we decided to utilize a single DSP unit and perform the required multiplication operations to achieve a 32-bit multiplication operation on the same DSP unit. This results in a four-cycle throughput as explained below.
In our design, however, we use Barrett's algorithm [40] for modular reduction, which requires 33 Â 33-bit multiplication operations, for which the utilized method is described in Algorithm 3. Therefore, we use DSP slices to perform 17 Â 17-bit integer multiplications at a time as illustrated in Fig. 2 , instead of 16 Â 16-bit multiplications, where both operations have exactly the same complexity. To minimize critical path delays, we utilize the optional registers for the multiplicand inputs and the accumulate output ports of the DSP unit as shown in Fig. 2 . These registers increase the latency of a single 33 Â 33-bit multiplication to six clock cycles. On the other hand, the throughput is still four clock cycles, which allows the multiplier unit to start a new multiplication every four clock cycles. 
We use classical multiplication algorithm and accumulate the result of the previous multiplication immediately after a 17 Â 17-bit multiplication operation. The result will be in the registers T 1 ; T 0 ; T À1 ; T À2 . Note that the wire widths in Fig. 2 indicate the sizes of the operands and the intermediate values in our application, not the actual widths of the corresponding wires in the DSP units.
32-Bit Modular Multiplication
We use Barrett's modular reduction algorithm [40] to perform modular multiplication operations. The Montgomery reduction algorithm [41] , which is a plausible alternative to the Barrett reduction, can also be used for modular multiplication of 32-bit integers. However, the Montgomery arithmetic requires transformations to and from the residue domain, which can lead to complications in the design. Therefore, we prefer using the Barrett's algorithm in our implementation to alleviate the mentioned complications in the design. We use the algorithm adapted for 32-bit modular multiplication operations as illustrated in Algorithm 4. The comparison operation (and associated addition with 2 33 ) in Step 9 is not needed in hardware implementation, as it is equivalent to checking the carry output of addition of U and 2's complement of V after Step 8. More specifically, when the operation U À V results in a negative number, the actual operation in hardware, where two's complement arithmetic is used, produces no carry. Consequently, if we use exactly the 33 bits of the result ignoring whether there is a carry or not, we will always obtain the correct result.
The subtraction W U À V in Step 8 can be at most a 33-bit number, more precisely 3p À 1 as explained in [42] . Therefore, two subtractions in Steps 10 and 11 can be necessary to obtain the final complete result at the end. As we want to finish Steps 10 and 11 in a single clock cycle, we perform both subtractions in the hardware implementations simultaneously, namely W À 2p and W À p, and select the correct result using the carry bits of the subtraction results and a multiplexer as illustrated in Fig. 3 . If W À 2p is positive, it is guaranteed that it is a number in the range 0 W < p, and we select this result as the output. However, if W À 2p is a negative number and W À p is a positive number, we select W À p as the correct output. If both subtractions yield negative results, we select W as the output. 
Our implementation of the Barrett algorithm, which is illustrated in Fig. 3 takes 19 clock cycles to complete one modular multiplication of 32-bit integers whereas its throughput is four clock cycles. We will refer the first four clock cycles as the warm up cycles of the multiplier and the last 15 clock cycles as the tail cycles. These periods of clock cycles are important for the first and last multiplication operations performed in the pipeline architecture in Fig. 3 . We will need these pieces of information to accurately estimate the number of clock cycles needed in our computations in subsequent sections. 
2
n Â 2 n POLYNOMIAL MULTIPLIER
We implemented a 2 n Â 2 n polynomial multiplier, with 32-bit coefficients. Throughout the paper, we will use the term 2 n to denote the 2 n Â 2 n polynomial multiplier. We do not utilize any special modulus, to achieve a generic and robust polynomial multiplier as we use Barrett's reduction algorithm for coefficient arithmetic. Instead of the classical schoolbook method for polynomial multiplication, we utilized the NTT-based multiplication algorithm, as explained in Section 2.2 and described in Algorithm 5. It should be noted that Step 5 of Algorithm 5 is implemented by coefficient-wise 32-bit modular multiplications.
Algorithm 5. NTT-Based 2
n Polynomial Multiplication 
NTT Operation

NTT Algorithm
We apply the NTT operation on a polynomial AðxÞ of degree 2 n À 1 over Z p ½x=ðFðxÞÞ. Since the result of the NTTbased multiplication will be of degree 2 ðnþ1Þ , we need to zero-pad the polynomial AðxÞ to make it also a polynomial of degree 2 ðnþ1Þ as follows AðxÞ ¼ A j Á w ij mod p, and w 2 Z p is referred as the twiddle factor. Since the size of the NTT operation is actually 2 ðnþ1Þ , we need to choose a twiddle factor w which satisfies the property w 2 ðnþ1Þ 1 mod p and 8i < 2 ðnþ1Þ w i 6 ¼ 1 mod p. To achieve fast NTT operations, we utilize the CooleyTukey approach, as explained in Section 2.2. CooleyTukey approach works by splitting up the NTT-transform into two parts, performing the NTT operation on the smaller parts, and performing a final reconstruction to combine the results of the two half-size NTT transform results into a full-sized NTT operation. If the NTT operation is defined as
we can split up this operation as follows:
which can also be expressed as
, where E i and O i represent the ith coefficients of the 2 n NTT operation on the even and odd coefficients of the polynomial AðxÞ, respectively. It is important to note that if the twiddle factor of the 2 ðnþ1Þ NTT operation is w, the twiddle factor of the smaller 2 n operation will be w 2 . Because of the periodicity of the NTT operation, we know that
. For the twiddle factor, it holds that w iþ2 n ¼ w i Á w 2 n ¼ Àw i . Consequently, we can achieve a full 2 ðnþ1Þ NTT operation with two small 2 n NTT operations utilizing the following reconstruction operation
The reconstruction operation is performed iteratively over very large number of coefficients. To better explain the iterative Cooley-Tukey approach, we would like to give a toy example of the NTT operation. First, we show the smallest NTT-transform circuit used in our design, which is shown in Fig. 4a . Here, the NTT operation is applied over a polynomial of degree 1, with w 2 1 mod p. Therefore, the two outputs of the circuit are A þ B and A þ wB A À B mod p. Utilizing the 2 Â 2 NTT circuit, we can perform a 4 Â 4 NTT operation as shown in Fig. 4b . Here, since we are constructing a 4 Â 4 NTT circuit, we have w 4 1 mod p. In a similar fashion, we can achieve an 8 Â 8 NTT operation utilizing two 4 Â 4 NTT operations, as shown in Fig. 5 . Here, since we are constructing an 8 Â 8 NTT circuit, we have w 8 1 mod p. Also in Fig. 5 , we can see that if the twiddle factor of the 8 Â 8 NTT operation is w, the twiddle factor of the 4 Â 4 NTT operation is w 2 . The overall architecture for iterative computation of NTT is shown in Fig. 6 . Note that, in a full 2 ðnþ1Þ NTT circuit, the twiddle factor w 16484 is used in 8 Â 8 NTT circuits.
Coefficient Multiplication and Accumulation
In order to parallelize multiplication and accumulation operations we utilize 3 Á K DSP units to achieve K modular multiplications in parallel, with a four-cycle throughput, where K is a design parameter that depends on the number of available DSP units in the target architecture. In our design, K is chosen as a power of 2. To be able to feed the DSP units with correct polynomial coefficients during multiplication cycles, we utilize K separate Block RAMs (BRAM) to store the polynomial coefficients as shown in Fig. 7 (e.g., K ¼ 128) . The algorithm used to access the polynomial coefficients in parallel is described in Algorithm 6. The algorithm takes the BRAM content (i.e., the coefficients of AðxÞ), the degree N ¼ 2 n , the current level m, and the number of modular multipliers K ¼ 2 k as input, and generates the indexes in a parallel manner. Every four clock cycles, we try to feed modular multipliers the number of coefficients which is as close to K as possible. Ideally, it is desirable to perform exactly K modular multiplications in parallel, which is not possible due to the access pattern to the powers of w. Algorithm 6, on the other hand, achieves a good utilization of modular multiplication units.
For level m, we use the 2 m Â 2 m NTT circuit. The coefficients are arranged in 2 m Â 2 m blocks. For example when K ¼ 256, for the first level of the NTT operation, where m ¼ 2, we need to multiply every fourth coefficient of the polynomial with w 2 ¼ w 16384 . Since the coefficients are perfectly dispersed, we can read 256 coefficients to feed the 256 multipliers in four clock cycles. This is perfect as the throughput of our multipliers are also four cycles. When the multiplication operations are complete, with an offset of 19 cycles (four clock cycles are for the warm up of the pipeline whereas 15 clock cycles are the tail cycles necessary in a pipelined design to finish the last operation), the results are written back to the same address of the RAM block as the one the coefficients are read from.
We provide formulae for the number of multiplications in each level and an estimate of the number of clock cycles needed for their computation in our architecture. Suppose N ¼ 2 n and K ¼ 2 k (n > k) are the number of coefficients in our polynomial and the number of modulo multipliers in our target device, respectively. The coefficients are stored in BRAMs, with a word size of 32 bits and an address length of 10 bits (1024 coefficients per BRAM). For ideal case, the number of modular multipliers should be four times the number of BRAMS required to store a single polynomial. The formula for the number of multiplications for the level m > 1 can be given as M ¼ 2 nþ1Àm Á ð2 mÀ1 À 1Þ: Also, using K ¼ 2 k multipliers, the number of clock cycles to compute all multiplications in a given level 1 < m n þ 1 can be formulated as
where a ¼ 2 kÀm Á ð2 mÀ1 À 1Þ and b ¼ 2 mÀ1 À 2 k . In the formula, the first (4) and the last terms (15) account for the warm up and the tail cycles.
As an example, Table 3 shows the number of multiplication operations required for each stage of the iterative Cooley-Tukey NTT operation, for a 32,768-coefficient (64K-point) NTT operation, when the number of modular multipliers is 256 (i.e., N ¼ 2 15 and K ¼ 256). As mentioned before, the modulo multipliers are not always fully utilized during the NTT computation. For example when K ¼ 2 8 and N ¼ 2 15 , for m ¼ 2, we have to read every fourth coefficient from the BRAMs. Because the coefficients are perfectly dispersed throughout the 64 BRAMS, we can only read 16 Á 2 ¼ 32 coefficients every clock cycle, which yields a number of 128 concurrent multiplications every four clock cycles. Consequently, we can finish all the modular multiplications in the first level in 4 þ 128 Á 4 þ 15 ¼ 531 clock cycles. Since we can use half the modular multipliers, we achieve half utilization in the first level. However, when m ¼ 3, we have to read every sixth, seventh and eighth out of every eight coefficients. We can read 24 Á 2 ¼ 48 coefficients every clock cycle from the BRAMs. This means we can only utilize 192 out of 25 modular multipliers since the irregularity of the access to the polynomial coefficients. This, naturally, results in a slightly low utilization. However, since we can read two coefficients from each BRAM every clock cycle, we are at almost perfect utilization, resulting in 4 þ 128 Á 4 þ 15 ¼ 531 clock cycles for this and the rest of the stages.
Since the operands of the both operations are accessed in a regular manner, the number of clock cycles spent on modular additions and subtractions are calculated as 2 ðnþ1Þ Áðnþ1Þ 2 t , when there are 2 t modular adders and 2 t subtractors.
w Generation
Theoretically, we need an N-th root of unity in F p for NTT of polynomials of degree N. Due to the polynomial padding in our case, we need an 2Nth root of unity w 2 F p such that . We precompute and store these powers of w in block RAMs in a distributed fashion similar to the coefficients of the polynomials as illustrated in Fig. 7 . Alternatively, the powers of w can be computed on-the-fly for area efficiency. However, since we have sufficient number of block RAMs in the target reconfigurable device, we prefer the precomputation approach.
Reconstruction
Once we are done with the multiplications, we utilize 64 modular adders and 64 modular subtracters to realize the addition and subtraction operations as shown in Equation (1).
Inner Multiplication
Inner multiplication of two 2 n polynomials is trivial for our hardware design. We can load 256 coefficients from each polynomial every four cycles and feed the multipliers, without increasing the four-cycle throughput. For a 2 n polynomial inner multiplication we spend 2 ðnþ1Þ Á 4=256 þ 15 clock cycles.
Inverse NTT
The Inverse NTT operation is identical to the NTT operation, except that instead of the twiddle factor w, we use the twiddle factor w i ¼ w À1 mod p. The precomputed twiddle factors of the inverse NTT are stored in the same block RAMs as the forward NTT twiddle factors, with an address offset. Therefore, the same control block can be utilized with a simple address change for the w coefficients for the inverse NTT operation.
Final Scaling
Final scaling is similar to the inner multiplication phase. We load each coefficient of the resulting polynomial, and multiply them with the precomputed scaling factor. Similar to the inner multiplication phase, we can load 256 coefficients from the resulting polynomial in four cycles cycle and feed the multipliers, without increasing the 4-cycle throughput. For a 2 n polynomial final scaling operation, we spend 2 ðnþ1Þ Á 4=256 þ 15 clock cycles.
IMPLEMENTATION RESULTS
We developed the architecture described in the previous section into Verilog modules and synthesized it using Xilinx Vivado tool for the Virtex 7 XC7VX690T FPGA family. The synthesis results are summarized in Table 4 . We synthesized the design and achieved an operating frequency of 250 MHz for multiplication of polynomials of degrees N ¼ 16; 384 for Prince and N ¼ 32; 768 for AES with a small word size of log p ¼ 32 bit. 1 In Table 5 we summarize the timing results of the synthesized small word size polynomial multiplier.
Although we can scale our architecture for larger parameters, it becomes hard to synthesize, since we are using 50 percent of the LUTs already. Another problem is that with larger hardware it is harder to do the routing because of the butterfly circuit mapping at each level. Also, it becomes harder to fit all the necessary components, i.e., polynomials, powers of v and resulted polynomial in the FPGA. Therefore, it becomes impossible to process a multiplication without extra I/O transactions when computing the NTT conversions.
The FPGA multiplier is used to process each component of the CRT representation of our large coefficient ciphertexts with log q ¼ 500 bits for Prince and log q ¼ 1271 bits for AES implementation. In fact we keep all ciphertexts in CRT representation and only compute the polynomial form when absolutely necessary, e.g., for parity correction during modulus switching and before relinearization. We assume any data sent from the PC through the PCIe interface to the FPGA is stored in onboard BRAM units. Total clock cycles 7,709
1. We use the same hardware architecture for both applications. The only difference is that compared to N ¼ 16; 384 case, the architecture is used almost twice many times in N ¼ 32; 768.
CRT Computation Cost. To facilitate efficient computation of multiplication and relinearization operations we use a series of equal sized prime numbers to construct a CRT conversion. In fact, we chose the primes p i 's such that q ¼ Q l i¼0 p i . During the levels of homomorphic evaluation, this representation allows us to easily switch modulus by simply dropping the last p i following by a parity correction. Also, since we have an RNS representation on the coefficients we no longer need to reduce by q. This also eliminates the need to consider any overflow conditions. Thus, l ¼ log ðqÞ=log ðp i Þ is 25 and 41 for Prince and AES implementations, respectively. We efficiently compute the CRT residue in software on the CPU for each polynomial coefficient as follows:
Precompute and store t k ¼ 2 64Ák ðmod p i Þ where k 2 ½0; dlog ðq=64Þ À 1e. Given a coefficient of c, we divide it into 64-bit blocks as c ¼ f. . . ; w k ; . . . ; w 0 g. We compute the CRT result by evaluating
The CRT computation cost for 41 primes p i per ciphertext polynomial is in the order of 89 ms and for 25 primes p i per ciphertext polynomial is in the order of 14:5 ms on the CPU. The CRT inverse is similarly computed (with the addition of a word carry) before each modulus switching operation at essentially the same cost. Communication Cost. The PCIe bus is only used for transactions of input/output values, NTT constants and transport of evaluation keys to the FPGA board. With eight lanes each capable of supporting eight GT/s transport speed the PCIe is capable to transmit a 1 MB ciphertext in about 0:13 ms. Note that the NTT parameters used during multiplication also need to be transported since we do not have enough room in the BRAM components to keep them permanently. We have two cases to consider: Multiplication: We transport two polynomials of 5 MB/1 MB each along with the NTT parameters of 5 MB/1 MB and receive a polynomial of 10 MB/ 2 MB, which costs about 3:25 ms/0.65 ms per multiplication for AES/Prince implementation. Relinearization: We need to transport the ciphertext we want to relinearize, the NTT parameters and a set of The PCIe transaction of the two input polynomials, the NTT coefficients and the double sized output polynomial is 3:25 ms/0.64 ms for AES/Prince implementation. Thus, the total latency for large polynomial multiplication in the CRT representation is computed in 9:51 ms and 2:48 ms for AES and Prince implementations respectively.
Polynomial Modular Reduction. Since all operations are computed in a polynomial ring with a characteristic polynomial as modulus without any special structure, we use Barrett's reduction technique to perform the reductions. Note that precomputing the constant polynomial x 2N =FðxÞ (truncated division) in the CRT representation we do not need to compute any CRT or inverse CRT operations during modular reduction. Thus we can compute the reduction using two product operations in about 19 and 4:9 ms for AES and Prince implementations respectively.
Modulus Switching. We realize the modulus switching operation by dropping the last CRT coefficient followed by parity correction. To compute the parity of the cut polynomial we need to compute an inverse CRT operation. The following parity matching and correction step takes negligible time. Therefore, modulus switching can be realized using one inverse CRT computation in 89 and 14:5 ms for AES and Prince implementations respectively.
Relinearization Cost. To realinearize a ciphertext polynomial
We need to convert the ciphertext polynomial coefficients into integer representation using one inverse CRT operation, which takes 89 ms/14.5 ms for AES/ Prince implementation. The evaluation keys are kept in NTT representation, therefore we only need to compute two NTT operations for one operand and the result. For l ¼ 41=25 primes and log ðqÞ 16 % 80=32 products the NTT operations take 331 ms/38 ms for AES/Prince implementation. We need to transport the ciphertext, the NTT parameters and 80=32 evaluation keys (ciphertexts) resulting in a 52 ms/4 ms delay for AES/Prince implementation. The summation of the partial products takes negligible time compared to the multiplications and the PCIe communication cost.
Then, the total relinearization operation takes 526 and 61:2 ms for AES and Prince implementation respectively. With the current implementation, the actual NTT computations still dominate over the other sources of latency such as PCIe communication latency and the CRT computations. However, if the design is further optimized, e.g., by increasing the number of processing units on the FPGA or by building custom support for CRT operations on the FPGA, then the PCIe communication overhead will become more dominant. The timing results are summarized in Table 6 .
COMPARISON
To understand the improvement gained by adding custom hardware support in leveled homomorphic evaluation of a deep circuit, we estimate the homomorphic evaluation time
ET AL.: A CUSTOM ACCELERATOR FOR HOMOMORPHIC ENCRYPTION APPLICATIONS
for the AES and Prince circuits and compare it with a similar software implementations by Dor€ oz et al [11] , [12] and by Wei et al [29] , [30] . Homomorphic AES evaluation. Using the NTRU primitives we implemented the depth 40 AES circuit following the approach in [11] . The tower field based AES SBox evaluation is completed using 18 Relinearization operations and thus 2,880 Relinearizations are needed for the full AES. The AES circuit evaluation requires 5,760 modular multiplications. During the evaluation we also compute 6,080 modulus switching operations. This results in a total AES evaluation time of 15 minutes. Note that during the homomorphic evaluation with each new level the operands shrink linearly with the levels thereby increasing the speed. We conservatively account for this effect by dividing the evaluation time by half. With 2,048 message slots, the amortized AES evaluation time becomes 439 ms.
We have also modified Dor€ oz et al.'s homomorphic AES evaluation code to compute relinearization with 16-bits windows (originally single bit). This simple optimization dramatically reduces the evaluation key size and speeds up the relinearization. The results are given in Table 7 . We also included the GPU optimized implementation by Dai et al. [29] on an NVIDIA GeForce GTX 680. With custom hardware assistance we obtain a significant speedups in both multiplication and relinearization operations. The estimated AES block evaluation is also improved significantly where some of the efficiency is lost to the PC to FPGA communication and CRT computation latencies.
Homomorphic Prince evaluation. Using the NTRU primitives we implemented the depth 24 Prince circuit following the approach in [12] . The algorithm is completed using 1,152 relinearizations, 1,920 multiplications, 3,072 modular reductions and 2,688 modular switch operations. An important thing to note that as we did in AES implementation, we divide the evaluation time by half. The reason is that since during the homomorphic evaluation with each new level, the operands shrink linearly so the evaluation speed increases linearly. These results in a total time of 53 seconds and an amortized time of 52 ms with batching 1,024 messages. Here in Table 8 , we compare the results of homomorphic Prince implementation of Dor€ oz et al. [12] which is implemented using a CPU. Also, we include the homomorphic Prince implementations of Dai et al. [29] , [30] on GPUs which are significantly faster compared to the CPU implementation.
CONCLUSIONS
We presented a custom hardware design to address the performance bottleneck in leveled somewhat homomorphic encryption evaluations. For this, we design a large NTT based multiplier, which is able to compute large degree polynomial multiplications using the Cooley-Tukey FTT technique. We extend the support of the custom core to be capable of multiplying large degree polynomials with large coefficients by using CRT representation on the coefficients. Using numerous techniques the design is highly optimized to speedup the NTT computations, and to reduce the burden on the PC/FPGA interface. Our design achieves remarkable improvements in speed of modular multiplication and relinearization of the LTV SWHE scheme compared to the previous software implementations. In order to show the acceleration that our architecture may provide, we estimated the homomorphic AES and Prince evaluation performances and determined a speedup of about 28 and 66 times respectively. Finally, we would like to note that these estimates are only to get a sense of the improvement that our architecture brings in. This custom accelerator architecture can be more useful in many other practical homomorphic evaluation applications in practice. [30] n/a n/a n/a n/a 0:032 103Â " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
