With the rapid increase in cloud computing, concerns surrounding data privacy, security, and confidentiality also have been increased significantly. Not only cloud providers are susceptible to internal and external hacks, but also in some scenarios, data owners cannot outsource the computation due to privacy laws such as GDPR, HIPAA, or CCPA. Fully Homomorphic Encryption (FHE) is a groundbreaking invention in cryptography that, unlike traditional cryptosystems, enables computation on encrypted data without ever decrypting it. However, the most critical obstacle in deploying FHE at large-scale is the enormous computation overhead.
INTRODUCTION
Cloud computing has, in a short time, fundamentally changed the economics of computing. It allows businesses to quickly and efficiently scale to almost arbitrary-sized workloads; small organizations no longer need to own, secure, and maintain their own servers. However, cloud computing comes with significant risks that have been analyzed in the literature over the last decade (see [26, 40, 70] ). Specifically, many of these risks revolve around data security and privacy. For example, data in cloud storage might be exposed to both outsider and insider threats, and be prone to both intentional and unintentional misuse by the cloud provider. Recently, the European Union and the State of California have passed strong data privacy regulations. In this light, companies and organizations that possess highly private data are hesitant to migrate to the cloud, and cloud providers are facing increasing liability concerns.
To mitigate security and privacy concerns, cloud providers should keep customers' data encrypted at all times. Symmetrickey encryption schemes, such as Advanced Encryption Standard (AES) [23] , allow private data to be stored securely in a public cloud indefinitely. However, unless the customers share their secret keys with the cloud, the cloud becomes merely a storage provider. In other cases, the customers may want to securely transmit data to the cloud using protocols such as Transport Layer Security (TLS), but give the cloud access to the unencrypted data for outsourced processing. While this model unlocks the key benefits of cloud computing, it at the same time weakens the customer's data privacy compared to a cloud that only provides an encrypted storage functionality.
In 2009, a new class of cryptosystems, called Fully Homomorphic Encryption (FHE) [34] , was introduced that allows arbitrary computation on encrypted data. This ground-breaking invention enables clients to encrypt data and send ciphertexts to a cloud that can evaluate functions on ciphertexts. Final and intermediate results are encrypted, and only the data owner who possesses the secret key can decrypt, providing an end-toend encryption for the client.
FHE provides provable security guarantees without any trust assumptions on the cloud provider, and it can be used to enable several secure and privacy-preserving cloud-based solutions. For instance, in the context of Machine Learning as a Service (MLaaS), FHE can be used to perform oblivious neural network inference [36] : clients send the encrypted version of their data, the cloud server runs ML models on the encrypted queries and returns the result to the clients. All intermediate and final results are encrypted and can only be decrypted by the clients. Perhaps, the most critical obstacle today to deploy FHE at large-scale is the enormous computation overhead compared to the plaintext counterpart in which the data is not kept confidential.
Most FHE schemes, i.e., the BGV [11] , the BFV [31] , and the TFHE [18] perform exact computation on encrypted data. A recently proposed FHE scheme, called CKKS [17] , performs approximate computation of real numbers and supports efficient truncation of encrypted values. Several works [51, 47] have shown the benefits of choosing the CKKS scheme over other schemes when an approximate computation is required, e.g., Machine Learning applications. Therefore, we focus on the CKKS scheme in this paper, even though our core modules are applicable to most of FHE schemes.
In this paper, we introduce HEAX (stands for Homomorphic Encryption Acceleration), a novel high-performance architecture for computing on (homomorphically) encrypted data. We design several optimized core computation blocks for fast modular arithmetic. HEAX introduces a new architecture for high-throughput Number-Theoretic Transform (NTT). NTT is a ubiquitous operation in FHE as well as many lattice-based cryptography systems. Efficient NTT engine directly improves the performance of these cryptosystems since NTT operation is usually the computational bottleneck. Building on top of NTT module, we design modules to perform high-level operations supported by FHE schemes, thus, accelerating any secure and privacy-preserving application that is based on FHE.
Prior Art and Challenges. In FHE schemes, the ciphertext is a set (usually a pair) of polynomials with degree n − 1 (vector of n integers) modulo a big integer. One of the main challenges of designing an architecture for FHE is that homomorphic operations on ciphertexts involve computationally intensive modular arithmetic on big integers (with several hundred bits). These operations have convoluted data dependency among different parts of the computation, making it challenging to design a high-throughput architecture. Moreover, the degree of the underlying polynomials is enormous (in the order of several thousand). Storing the entire intermediate results on FPGA chip is prohibitive for small encryption parameters and is infeasible for large parameters, given the fact that memory consumption grows as O n 3 .
Prior work that propose customized hardware for non-CKKS schemes have taken one of these approaches: (i) Designing co-processors that only accelerate certain lower-level ring operations [21, 20, 74, 14, 30, 45] . High-level operations are performed on the CPU-side, which makes the coprocessors of limited practical use. (ii) Storing intermediate results on off-chip memory, which significantly degrades the performance [60] to the extent that it can be worse than naive software execution [66] . (iii) Designing a hardware for a fixed modest-sized parameter, e.g., n = 2 12 [67] . However, encryption parameters determine the security-level and the maximum number of consecutive multiplications that one can perform on ciphertext, both of which are application-dependent. One of our primary design goals in HEAX is to have an architecture that can be readily used for a wide range of encryption parameters. In addition, we propose several techniques to efficiently store and access data from on-chip memory and minimize (or eliminate for some parameter sets) off-chip memory accesses.
Client-Side and Server-Side Computation. The homomorphic property of FHE enables the cloud server to perform various operations without having access to the secret key, i.e., not decrypting it. In other words, the cloud server can perform certain transformations on ciphertexts that are equivalent to performing computation on their corresponding plaintext values. For example, adding two ciphertexts results in the encryption of "summation of two underlying plaintext values". However, multiplication is significantly more complicated. Multiplication increases the number of polynomials in the ciphertext; requiring an operation, called relinearization, to transform the ciphertext back to a pair of polynomials. In order to benefit from SIMD-style operations, an encoding step is performed by the client to embed many numbers in a single ciphertext. CKKS scheme supports rotation in which the numbers embedded in the ciphertext can be rotated circularly (without decrypting the ciphertext).
Note that encoding, decoding, encryption, and decryption are performed on the client-side. These operations are not computationally expensive; thus, we do not implement customized hardware for these operations. Besides, it may not be realistic to assume that all clients have access to a reconfigurable hardware. The operations that are performed by the server for evaluating a function on ciphertexts, i.e., multiplication, relinearization, and rotation, are computationally intensive and are the focus of our work.
Contributions.
In what follows, we elaborate on our major contributions in more detail:
• We design a novel architecture for number-theoretic transform (NTT) which is a fundamental building block -and usually the computation bottleneck -for many lattice-based cryptosystems including all FHE schemes. Our design is generic and can process arbitrary-sized polynomials with an adjustable throughput. We develop several techniques to overcome the challenges due to the complex datadependency and convoluted access patterns within NTT. • We introduce the first architecture for CKKS homomorphic encryption. We leverage multi-layer parallelism design starting from ciphertext-level to fine-grained optimized modular arithmetic engines. In contrast to the prior art for other FHE schemes, our architecture can be scaled for different FPGA chips due to its modularity. Moreover, HEAX is not custom-designed for specific FHE parameter set and can be used for a broad range of parameters. • We provide a proof-of-concept implementation on two Intel FPGAs that represent two different classes of FPGAs in terms of available resources. We implement all high-level operations supported by CKKS and evaluate our design for three sets of parameters which account for the vast majority of practical applications. Our experimental results demonstrate more than two orders of magnitude performance improvement compared to heavily-optimized Microsoft SEAL library running on CPU.
Paper Outline. Section 2 provides the notation in the paper as well as a brief background on a few concepts. In Section 3, a detailed description of algorithms in CKKS scheme is provided. We describe our proposed architectures in Section 4, and in Section 5, we elaborate on system-view and data flow. In Section 6, resource utilization and performance of our proofof-concept implementation is provided. We categorize and explain prior art in Section 7.
PRELIMINARIES
Notation. Throughout the paper, integers and real numbers are written in normal case, e.g. q. Polynomials and vectors are written in bold, e.g. a. Vectors of polynomials and matrices are written in upper-case bold, e.g. A. We use subscripts to denote the indices, e.g. a i is the i-th polynomial or row of A.
We assume that n is a power-of-two integer and define a polynomial ring R = Z[X]/ (X n + 1) whose elements have degrees at most n−1 since X n = −1 ∈ R. We write R q = R/qR for the residue ring of R modulo an integer q whose elements have coefficients in [− (q − 1)/2 , q/2 ] ∩ Z. In the actual computation, we represent coefficients in [0, q − 1] ∩ Z. We denote by u · v the multiplication of two polynomials where the product is reduced modulo X n + 1 in R and further reduced modulo q in R q . We denote by u, v the dot product of two vectors, which gives
For a real number r, r denotes the nearest integer to r (rounding upwards in case of a tie), and r is the largest integer smaller than or equal to r. For an integer a, [a] p denotes the reduction of a modulo an integer p to [0, p − 1] ∩ Z. These operations are also applied on a vector of real/integer numbers.
All logarithms are in base two unless otherwise indicated. We use a ← χ to denote sampling a according to distribution χ. For a finite set S, U(S) denotes the uniform distribution on S, e.g., a ← U(R q ) denotes sampling a uniformly from elements of R q . All appearances of p denote a word-sized prime number. q denotes a product of word-sized prime numbers. Residue Number System (RNS). There is a well-known technique to achieve asymptotic/practical improvements in polynomial arithmetic over R q with an RNS by choosing q = ∏ L i=0 p i where p i 's are pair-wise coprime integers, based on the ring isomorphism R q → ∏ L i=0 R p i . We denote the RNS representation of an element a ∈ R q by A = a i = [a] p i 0≤i≤L ∈ ∏ L i=0 R p i . The inverse mapping is defined based on the formula a = ∑ L i=0 a i π i π −1 i p i (mod q), where π i = q p i . Multiplications or additions in R q , denoted by c = Func(a, b), can be performed on their RNS representation: for all i = 0, 1, . . . , L,
There exist full-RNS variants of HE schemes, such as those in [7, 37, 16] , which do not require any RNS conversions except in decryption and show significant improvement over non-RNS variants. On platforms such as GPUs and FPGAs, the advantage of full-RNS variants is more significant. We can execute c i = Func(a i , b i ) for all i's in parallel. Thus, the amount of memory to store temporary data is reduced to the size of R p i elements. Gadget Decomposition. Let g ∈ Z d be a gadget vector and q an integer. The gadget decomposition, denoted by g −1 , is a function from R q to R d which transforms an element a ∈ R q into A ∈ R d , a vector of small polynomials such that a = g, A (mod q). We integrate the RNS-friendly gadget decomposition from [7, 37] .
METHODOLOGY and ALGORITHMS
HEAX targets a full-RNS variant of the CKKS scheme. RNS support was introduced in [16] and provided a performance boost. Three instances of the CKKS scheme are currently available, namely Microsoft SEAL [69] , HEAAN [39] , and HElib [41] libraries. We choose Microsoft SEAL because the other two are not fully based on RNS, i.e., have dependencies on multi-precision integer operations.
In this section, we summarize the algorithms in HEAX that implement the CKKS scheme's evaluation primitives as implemented in Microsoft SEAL [69] . Throughout the section, "CKKS." is a prefix to CKKS specific methods. For example, symmetric-key encryption SymEnc is not prefixed since it is not specific to CKKS but rather a standard operation cryptosystems based on Ring Learning with Errors [54] . In text, methods are described without RNS or NTT forms for readability. In implementation and algorithms, polynomials are represented in RNS throughout evaluation and in NTT representation whenever possible. For completeness, we briefly cover the non-evaluation primitives of CKKS here. Our focus, evaluation primitives, are explained in subsections.
• CKKS.Setup(λ ): For a security parameter λ , set a ring dimension n, a ciphertext modulus q, a special modulus p coprime to q, and a key distribution χ and an error distribution Ω over R. • CKKS.Enc(m, pk): Let m ∈ R be an input plaintext and pk = (b, a) ∈ R 2 qp be a public key. Sample u ← χ and e 0 , e 1 ← Ω. Compute (c 0 , c 1 ) = u · (b, a) + (e 0 , e 1 ) (mod qp). Return ciphertext ct = (c 0 , c 1 ) = (m, 0) + (c 0 , c 1 )/p ∈ R 2 q . • CKKS.Dec(ct, sk): Let ct = (c 0 , c 1 ) ∈ R 2 q be a ciphertext at the -th level,where q = ∏ i=0 p i , return c 0 + c 1 · s (mod q ).
• KskGen(sk , sk): Let sk = s ∈ R qp be the generated secret key, sk = s ∈ R qp be a different key, and a gadget vector g ∈ Z d . Return a key switching key ksk = (
• CKKS.RlkGen(sk): Let sk = s ∈ R qp be the generated secret key, and a gadget vector g ∈ Z d . Return a relinearization key rlk ← KskGen(s 2 , s). • CKKS.GlkGen(sk, step): Let sk = s ∈ R qp be the generated secret key, step ∈ Z be the number of steps that a ciphertext is supposed to be shifted (positive for left-handed shifting, negative for right-handed shifting), and a gadget vector g ∈ Z d . Compute a key s ∈ R qp based on s and step (see [10] for details). Return a Galois (or rotation) key glk ← KskGen(s 2 , s).
Algorithm 1 Standard Barrett Reduction
the lower word of the product 2: t ← x · y /2 w the upper word of the product 3: z ε ← t · p (mod 2 w ) the lower word of the product
For a large n, which is at least 4096 in our implementation, it is asymptotically better to use the convolution theorem and perform a specific form of fast Fourier transform, i.e., NTT, over a finite field. Polynomials are kept in NTT form to reduce the number of NTT and its inverse (INTT) conversions. Fast NTT algorithms are well studied in lattice-based cryptography. We adapt the algorithms in [52] (Section 2.2) which analyzes fast NTT algorithms and introduces specific optimizations for negacyclic convolution. For a ring degree n, we choose a prime number p = 1 mod 2n such that there exists a 2n-th primitive root of unity ψ, i.e., ψ n = −1 mod p.
There are optimizations to integer modular multiplication in the Microsoft SEAL library, which differ from those in [52] .
• Mod(x, p): For a modulus p with at most w bits, given an integer x ∈ 0, (p − 1) 2 , precompute u = 2 2w /p , and compute z = x (mod p). We omit u from function inputs for simplicity. This follows the Barrett reduction algorithm in [8] . It can be used to reduce a single-word or double-word integer. Mod(a, p) performs Mod(a i , p) for all i = 0, 1, . . . , n − 1.
• MulRed(x, y, y , p): For w-bit words and a modulus p < 2 w−2 , given x, y ∈ Z p and precomputed y = y · 2 w /p , compute x · y (mod p) according to Algorithm 2. MulRed optimizes modular multiplications when one of the operands is a known constant, e.g., a power of the root of unity, that can be precomputed. Compared to Mod, this algorithm has less multi-word operations, hence, it is faster.
The value of w is chosen as 54 in HEAX as opposed to 64 in Microsoft SEAL. We explain the reason in more detail in Algorithm 3 NTT Input: a ∈ Z n p , p ≡ 1 mod 2n, Y ∈ Z n p storing powers of ψ in bit-reverse order, and Y = Y · 2 w /p . Output: :ã ← NTT p (a) in bit-reverse ordering. 1: for (m = 1; m < n; m = 2m) do 2: for (i = 0; i < m; i + +) do 3: for ( j = i ·n m ; j < (2i+1)n 2m ; j + +) do for (i = 0; i < m; i + +) do 3:
end for 8:
end for 9: end for 10: a ←ã Section 4. Modulus p has at most 52 bits. • NTT p (a): Given a ∈ Z n p , computeã ∈ Z n p such thatã j = ∑ n−1 i=0 a i ψ (2i+1) j , according to Algorithm 3. • INTT p (ã): Givenã ∈ Z n p , compute a ∈ Z n p such that a j = 1 n ∑ n−1 i=0ã i ψ −(2i+1) j , according to Algorithm 4.
Addition and Multiplication
These two operations are performed in RNS and NTT form:
q according to Algorithm 5 which encrypts pt 0 · pt 1 ∈ R.
Rescaling
In CKKS, a plaintext is the encoding of message slots multiplied by a scale ∆. Note that by multiplying plaintexts, ciphertexts, or a plaintext and a ciphertext, their scales are also multiplied. Therefore, the scale grows exponentially on the number of multiplications. It will quickly grow larger than q L , which causes decryption failure. Besides, adding two operands with different scales produces an incorrect result. Rescaling
Algorithm 5 Ciphertext Multiplication
end for changes (mostly reduces) the scale of a ciphertext. In order to support full-RNS, rescaling changes the scale by a set of fixed ratios, i.e., one of the RNS prime numbers.
• Floor(C, p): GivenC, the RNS and NTT form of c ∈ R q p , generateC , the RNS and NTT form of c = p −1 · c ∈ R q according to Algorithm 6. • CKKS.Rescale(ct): Given a ciphertext ct = (C 0 ,C 1 ) ∈ R 2 q with scale ∆, generate a new ciphertext ct = (Floor(C 0 ), Floor(C 1 )) ∈ R 2 q −1 with scale ∆/p in RNS and NTT form.
Key Switching
Key switching is a technique to make a ciphertext decryptable with a different secret key homomorphically. Various gadget decomposition methods can be adopted to balance noise growth and execution time. Given q d−1 , the product of coprime integers p 0 , . . . , p d−1 , and q divides q d−1 , define gadget decomposition R q → R d as g −1 (a) = [a] p i 0≤i≤d−1 , and gadget vector as g = π i π −1
This choice of gadget decomposition contributes to a fast key switching and high noise growth. With the special modulus p and a rescaling at the end of key switching, explained in [15] , key switching is almost noise-free.
• KeySwitch(ct, ksk): Given a ciphertext ct = (c 0 , c 1 ) ∈ R 2 q decryptable with secret key s and a key switching key
, where | appends one column vector to another, generate a new ciphertext ct = (c 0 , c 1 ) ∈ R 2 q decryptable with secret key s as follows:
Algorithm 7 Key Switching
The input and output ciphertext and the key switching key are in RNS representation and in the NTT form. Algorithm 7 describes the full-RNS variant.
Relinearization. Multiplying two ciphertexts generates a ciphertext ct ∈ R 3 q with three components (Algorithm 5). ct can be decrypted as ct , (1, s, s 2 ) in R q where s is the secret key. To stop ciphertexts from expanding, we relinearize the ciphertext to transform it back to two components. Relinearization is fundamentally a key-switching operation. It transforms a ciphertext decryptable with s 2 to a new ciphertext decryptable with s using relinearization key. 3 and a relinearization key rk (can be seen as ksk), generates a ciphertext (c 0 ,c 1 ) ∈ (∏ i=0 R p i ) 2 as follows:
A plaintext is the encoding of a vector of messages that are evaluated independently. Rotation is applied on a ciphertext to move messages across slots. Each rotation pattern has a corresponding secret key s derived from the original secret key s. Rotation generates a ciphertext that after being decrypted with s is the encoding of the rotated message vector. The single costly part of rotation is also KeySwitch. We refer readers to [17] and [10] for more details.
ARCHITECTURE
In this section, we describe our proposed architectures for NTT/INTT, Multiplication, and Key Switching modules.
Word Size and Native Operations. Microsoft SEAL library [69] is developed for x86 architectures with 64-bit native operations. However, on FPGAs, the bit-width of Digital Signal Processing (DSP) units that perform multiplication may vary, and hence, it is more efficient to have a different bitwidth for native operations. For example, the two FPGA chips that we have implemented our architecture on have 27-bit DSP units. Choosing 27 or 54-bit multiplications as our native operation enables us to use less DSP units to do the same computation. Naive construction of a 64-bit multiplier requires nine 27-bit DSPs. Whereas, a 54-bit multiplier requires only four. However, by reducing the bit-width of the RNS components, one may need to increase the number of such components; roughly speaking, by a factor of 64 54 ≈ 1.2. In practice, small ciphertext moduli are usually less than 54 bits and thus, we do not need to increase the number of moduli. It is worth-mentioning that leveraging more sophisticated multi-word multiplication algorithms such as Toom-Cook, one can implement 64-bit multiplication using five 27-bit multipliers together with more bit-level and Addition operations. Overall, by switching from 64-bit native operations to 54-bit, we observe between 1.4× to 2.25× reduction in the number of DSP units needed (depending on the HE parameters). Please note that with 54-bit word size, we need to make sure all of the ciphertext moduli (p i ) are (i) less than 52-bit to ensure the correctness of Algorithm 2 and (ii) congruent to 1 mod 2n to support NTT as described in Section 3.1. We have precomputed all of such moduli for different parameters.
MULT Module
We first describe the architecture of the MULT module which can process both ciphertext-ciphertext (C-C) as well as ciphertext-plaintext (C-P) homomorphic multiplications. In what follows, we discuss the architecture for C-C multiplication as C-P is a special case of C-C. In Microsoft SEAL, to achieve superior performance, ciphertexts and plaintexts are in NTT form by default and are converted back to the original form only if needed. Therefore, homomorphic multiplication is simply a series of dyadic multiplications on different components of two input ciphertexts as described in Algorithm 5.
The MULT module, depicted in Figure 1 , encompasses nc DYD -many Dyadic Cores (nc stands for number of cores). Therefore, MULT module can compute nc DYD dyadic multiplication at each clock cycle. Each Dyadic core takes as input two polynomial coefficients (Op 1 and Op 2 ), two precomputed constant values (R 1 and R 2 ), and one-word prime p and outputs the multiplication result.
In a general case, each ciphertext can have more than two components. For example, consider a scenario where ct 1 (or ct 2 ) is the output of a previous C-C Mult (not relinearized) and now has three components. Let us denote the number of components in ct 1 and ct 2 by α and β , respectively. The outcome of homomorphic multiplication is a ciphertext with α + β − 1 components. Each ciphertext component is represented in a RNS form. Recall that in homomorphic multiplication (Algorithm 5), the computation can be carried out independently on each RNS basis of two ciphertexts and we leverage this property to reduce BRAM utilization. Minimum BRAM utilization is achieved by storing one residue of each ciphertext component. However, this approach significantly increases data transfer from CPU to FPGA from (α + β ) · n words to (α · β + min(α, β )) · n words since we need to compute all pairwise combinations of ct 1 and ct 2 components. Thus, we allocate α-many memories of size n for ct 1 and β -many memories for ct 2 to hold one residue of all ciphertext components. As a result, we achieve O (α + β ) · n data transfer and BRAM consumption.
In order to fully utilize all nc DYD Dyadic cores -regardless of the values of α and β -we read nc DYD coefficients from one of the polynomials of ct 1 and ct 2 at every clock cycle. However, each unit of on-chip memory, i.e., Block RAMs (BRAM), only supports one read and one write at each clock cycle. In order to read many coefficients from one polynomial at each cycle, we store each polynomial across nc DYD -many parallel memory blocks that share common read/write address signals as depicted in Figure 1 . Let us call the aggregation of one row among different BRAMs as a memory element (ME). Therefore, at every cycle, one memory element (ME1/ME2) is read from ct 1 /ct 2 memory banks and the result (ME3) is written to a separate output memory.
NTT Module
NTT/INTT is the most computationally intensive low-level operation in homomorphic encryption schemes. In what follows, we use NTT to refer to both NTT and INTT modules. At the end of this section, we discuss the differences between these two modules. As can be seen from Algorithm 7, in KeySwitch, NTT/INTT is computed many times on different intermediate polynomials. However, the number of times that we need to compute NTT/INTT is different in distinct parts of the Algorithm. In order to have a fully pipelined architecture, we allocate one NTT module per each NTT operation in Algorithm 7. The relative throughput-rate among different NTT instances also depends on the selected FHE parameters, which is application-dependent. As a result, we need to have a generic architecture for NTT module such that the throughput can be adjusted as needed. This, in turn, is translated to the number of NTT cores that is dedicated to a given NTT module. Figure 3 shows the internal architecture of a single NTT core which accepts two coefficients, one twiddle factor, one precomputed value, and a prime as inputs and computes two transformed coefficients as the outputs. From the functionality perspective, the architecture follows Algorithms 3 and 4.
The throughput of the NTT module is proportional to the number of NTT cores that it encompasses. We denote the number of NTT cores as nc NTT . Ideally at each clock cycle, and given full utilization of NTT cores, 2 nc NTT coefficients are transformed. Similar to MULT module, we store each polynomial across many parallel BRAMs that share common read/write address signals as depicted in Figure 3 . This is possible thanks to the aligned access pattern in NTT: while access pattern changes during NTT, the number of consecutive accesses to the polynomial is always a power of two. Access Pattern. At high-level, the NTT module computes NTT of a polynomial of size n in log n stages. In each stage, the module computes the transformed result of 2 nc NTT coefficients, thus, requiring n 2 nc NTT steps to finish one stage. However, the access pattern changes from one stage to another. Figure 2 illustrates the two types of access patterns in NTT. During the first log n − log nc NTT − 1 stages, these pairs of coefficients are stored in different MEs. Let us call these Type 1 stages. For instance, consider n = 4096 and nc NTT = 8, during the first step of the first stage of NTT, x[0] (in ME 0 ) and x[2048] (in ME 256 ) should be passed to the first NTT core. More precisely, polynomial coefficient x[ j] ( j = 0, 1, ... , n 2 −1) is passed together with x[ j + n 2 ] to a given NTT core. In general, during i th stage,
The address of the ME that is fetched at stage i and step j in Type 1 stages is computed in Address Logic as follows:
where s = log n − log nc NTT − 2 − i, "&" is bit-wise AND operation, >> denotes right-shift, and << is left-shift. As soon as n 2 i = 2 nc NTT , the inter-ME data dependency no longer exists, and pairs of coefficients are selected from within each ME independently, i.e., Type 2 stages.
In Type 1 stages, coefficients within two fetched MEs are always accessed in the same order. For example, the second coefficient in each ME is always passed to the second NTT core. However, in Type 2 stages, a coefficient at specific location of ME is passed to a different NTT core or even different inputs of an NTT core (c in.a or c in.b ) during consecutive stages. Therefore, coefficients have to be reordered to be passed to NTT cores. Later in this section, we will discuss an efficient method for reordering the coefficients. After the computations in NTT cores, coefficients have to be reordered again and stored in the same positions as they were accessed. The access pattern for twiddle factors, i.e., Y and Y in Algorithm 3, is different from the access pattern of coefficients. At stage i, only 2 i unique values of twiddle factors, starting at index 2 i of twiddle polynomial, are used. Since in the worstcase scenario, nc NTT unique twiddle factors are used in a single step of NTT, we store twiddle factors in batches of size nc NTT in parallel as shown in Figure 3 (note that twiddle factors' MEs are half size of polynomial coefficients' MEs). As a result, we can divide the access pattern of twiddle factors into four Output Memory ... MUX x [2 ncNTT] x . groups: (i) in the first group of stages where 2 i < nc NTT , only ME 0 (first ME) is accessed throughout the stage computation and one (or more) twiddle factor(s) is(are) broadcasted into different NTT cores. (ii) At stage log nc NTT , only ME 1 is accessed but no broadcast is needed as each twiddle factor inside ME is passed to a separate NTT core. (iii) During stages where log nc NTT < i < log n − 1, 2 i−log nc NTT unique MEs are fetched and passed to NTT cores, and finally, (iv) at stage log n − 1, a new ME of twiddle factors is read from BRAM at every step. The address of the twiddle factor ME that is fetched at stage i and step j is computed in Address Logic as follows:
Reordering Coefficients and Customized Multiplexers. During Type 1 stages, once the ME is fetched, passing each coefficient in ME to the right NTT core (and right input wire) is straightforward and it can be summarized as follows:
where c in.a (respectively c in.b ) is the input coefficient a (respectively b) of th NTT core, ME e (respectively ME o ) is the memory element at "even" ("odd") read cycles, i.e., j mod 2 = 0 ( j mod 2 = 1) where j is the step number. In Type 2 stages, c in.a (or c in.b ) should receive data from one of the coefficients in ME depending on the value of and i. The naive approach is to use one multiplexer (MUX) per each coefficient input of every NTT core that selects one number from 2 nc NTT fetched numbers. We denote such multiplexer as MUX 2 nc NTT . As a result, we need 2 nc NTT -many MUX 2 nc NTT to pass coefficients to NTT cores and the same number of MUXs to reorder them to be written back to the memory. These MUXs not only make the process of placement and route more challenging but also consume enormous number of registers and logic blocks. Moreover, scaling the NTT module to higher number of cores (> 32) is inefficient due to super-linear resource consumption with respect to nc NTT . In many cases, synthesis tools failed to place and route the required resources to realize these MUXs.
In contrast, we take advantage of the fact that NTT cores' inputs have a different number of possibilities from which they select the correct coefficient at a given stage. For example, during Type 2 stages, c 0 in.a only receives coefficients from first word of ME s regardless of the stage or step number.
In the worst-case scenario, there are log nc NTT different indices from which a coefficient should be accessed from ME s for a particular NTT core input. Therefore, instead of using (4 · nc NTT )-many MUX 2 nc NTT , we instantiate (4 · nc NTT )-many MUXs of size at most MUX log 2 nc NTT . The selection signal of these MUXs is set to s = log n − 1 − i (i being the stage number). The corresponding inputs (MUX{c in.a }(α) and MUX{c in.b }(α)) from which a coefficient should be selected are assigned based on the following formula:
where MUX{c in.a }(α) is the α-th input wire of the MUX that selects the corresponding input coefficient from ME s forth NTT core, thus, 0 ≤ α < log nc NTT . Creating customized multiplexers for twiddle factors is significantly simpler. As we discussed previously, during group i stages, only ME 0 is accessed and depending on the stage number, one or more of its twiddle factors are selected and broadcasted to different NTT cores. Therefore, from the perspective of each w in (or wp in ), there are log nc NTT number of possibilities from which the twiddle factor can be selected. This is implemented as MUX log 2 nc NTT . During group ii-iv stages, w in (respectively wp in ) is selected as ME w [ ] (respectively ME wp [ ]).
Two-Stage Read, Compute, and Write. Storing polynomial coefficients across different memory blocks solves simultaneous memory accesses. However, the NTT cores cannot be fully utilized at all times due to the following reason. During Type 1 stages, coefficients that should be passed to each NTT core are not located in the same memory element. Therefore, two different memory elements should be read (Read α 1 and Read α 2) before the computation (Comp α ) can start which introduces 50% bubble in the NTT core pipeline (please see Figure 4 ). More precisely, first log n − log nc NTT − 1 stages face this problem. Given that NTT modules consume most of the FPGA resources, this in turn means, the throughput of the entire design will be reduced to (log n − log nc NTT − 1)/ log n.
In order to solve this problem, we propose to double the size of MEs and store twice as many consecutive coefficients, i.e., 2 nc NTT , in each memory element. Meanwhile, we reduce the depth of the memories that store the polynomial by half. The modified pipeline is shown in Figure 4 . Even though it is still necessary to read two MEs (during two consecutive cycles) before starting the computation, we can now transform two MEs in the next two cycles (Comp α 1 and Comp α 2) and store them back in the memory (Write α 1 and Write α 2). Our solution does not reduce the delay of the computation but it provides us the full utilization of NTT core during time. As we will discuss in Section 6, BRAM is the most constrained resource. In order to have minimal BRAM usage, all of the reads and writes during different NTT stages are inplace, and no additional BRAM is used to store intermediate values. Putting it all together. Figure 3 illustrates the full architecture of NTT module. On the three corners exist data memory, twiddle factors' memories, and output memory. Based on the value of step and stage counters, the corresponding addresses of the MEs for data and twiddle factor memories are generated. At every cycle, one ME is fetched from data memory and is stored in ME e and ME o registers every other cycles, respectively. For each input coefficient of NTT cores, i.e., c in.a and c in.b , a set of multiplexers select the correct coefficient from ME e and ME o (depicted as light blue boxes in Figure 3 ). During Type 1 stages, c in.a is selected from one of the two positions from ME e and c in.b from ME o , respectively. This selection is performed by MUX3. During Type 2 stages, first one of the ME e or ME o is selected using 2 nc NTT -many twoto-one multiplexers (MUX1) and is stored in ME s registers. Next, as we previously discussed, a customized multiplexer is assigned for each specific operand (c in.a or c in.b ) of a given NTT core (MUX2) to select one of the coefficients within ME s . Finally, MUX4 selects one of these two results based on which stage type is getting processed. A similar set of MUXs (MUX6 and MUX7) are used to reorder the data back in the original locations before storing the result in the memory.
Once the final results are ready to be written in the memory (ME4 and ME5), they will be stored in data memory during two consecutive clock cycles. In the last stage, however, ME4 and ME5 are stored in output memory in order to keep the data available for the next module(s).
Memory Utilization and Word-Packing. Storing multiple polynomial coefficients (or twiddle factors) in multiple parallel memory units (M20K) can cause memory block underutilization both depth-size and width-size. Let us consider a general scenario where β -many numbers are stored in parallel; whether it is polynomial coefficients or twiddle factors.
Depth-wise: Each M20K memory unit holds 512-many 40-bit wide words and at any cycle, one word can be read from or written into the memory. When fewer than 512 words are stored in the memory, the rest of the memory rows cannot be used to store a secondary polynomial since at any point in time we are reading/writing one word associated with the first polynomial. As long as n β ≥ 512, M20K is fully utilized. This inequality generally holds in our hardware architecture except when n = 2 12 (smallest polynomial size) and nc NTT = 16 which makes M20K half utilized. However, this is not an issue since our design is not BRAM-constrained when n = 2 12 .
Width-wise: As the polynomial-size (n) grows, our design becomes more and more BRAM-constrained to the extent that at n = 2 14 , there is not enough BRAM on the chip; thus, we have to use DRAM as well (we will discuss this in more detail in Section 5). Therefore, it is essential that the polynomials are efficiently stored in memory. By storing each coefficient in a separate physical BRAM, we will only reach 54 2 ·40 = 68% utilization. In contrast, we pack multiple coefficients and store them in fewer M20K units as shown in Figure 3 reaching memory utilization of β · 54/( β · 54/40 · 40). For β = 8, BRAM utilization will reach more than 98%.
Performance. The NTT module processes one stage in n 2 nc NTT cycles. Computing the NTT of a polynomial takes log n stages, and hence, it takes n log n 2 nc NTT cycles to compute one NTT. INTT Module. Our NTT architecture can be used for both NTT and INTT with only a few modifications: (i) the NTT core is replaced by INTT core (see Figure 3 ), (ii) the control unit operates in the reverse order of stage numbers, and (iii) twiddle factor memories store the corresponding precomputed values. The rest of the architecture remains unchanged. 
KeySwitch Module
KeySwitch is the most computationally intensive high-level operation in the CKKS scheme. It has several important roles, including relinearization and ciphertext rotation. In this section, we describe our proposed architecture for this operation as well as the challenges we faced during the design process. Figure 5 illustrates the KeySwitch architecture. From the functionality perspective, this architecture corresponds to Algorithm 7. To reduce on-chip memory usage, the design takes one polynomial (one RNS component) at a time and outputs two polynomials. Recall that in CKKS, all polynomials are in NTT form by default. Thus, once the input polynomial is written into the input memory, it has to be converted back to the original domain. This process is performed using the first INTT module (INTT 0 ). Next, the polynomial is transformed to the NTT form for all other primes (including the special prime). Since per each INTT computation, we have to perform k NTT, the throughput of the NTT module(s) has to be k-times the throughput of INTT 0 . Here, k is the number of RNS components of ciphertext modulus, i.e., L + 1. This requirement can be realized in two different ways: (i) having one NTT module with k-many more cores than INTT 0 or (ii) having multiple NTT module with fewer cores per each module. We denote this NTT module (or a set of them) as NTT 0 . We will discuss the trade-offs later in this section. In Figure 5 , the second approach (using more than one NTT module) is chosen for n = 2 13 and k = 4 parameter set.
In the NTT/INTT module, the intermediate results are read/written from internal memory and the final results are written to the output memory. Once the NTT computations are finished, the DyadMult module computes the dyadic product between the output of NTT modules and the relinearization/Galois keys according to Algorithm 7. Recall that a dyadic product on the original input polynomial is also needed in KeySwitch; thus, a separate Dyadic module is used. After dyadic product computation, the result is stored in the corresponding memory banks. There are two sets of BRAM banks, each bank containing the RNS components of one polynomial.
The computation flow described above repeats for k-many times (one per each RNS component). The result is accumulated in the BRAM banks. After k iterations, the second part of the computation -usually referred to as Modulus Switching (developed in [12] ) -is performed. In Modulus Switching which executes Floor, the polynomial corresponding to the special prime has to be transformed back to the time domain (by INTT 1 ) and then be transformed using every other k primes (by NTT 1 ). The aforementioned process is independently performed for both sets of banks. Next, the polynomial is multiplied by the inverse value of the associated prime and subtracted from the result of the first half of KeySwitch computation. The MS module embeds multiplication and subtraction operations. The output of KeySwitch is stored as two sets of k polynomials referred to as "Output Poly 0/1" in Figure 5 . Balancing Throughput. Our primary goal in designing KeySwitch architecture is to have a fully end-to-end pipelined module that can process many key switching operations simultaneously without any (or excessive) FIFOs between different components. Thus, we have to tune the throughput of each component carefully. As we discussed in Section 4.2, this is one of the reasons to design a flexible architecture for NTT, the throughput of which can be adjusted based on the number of cores that it encompasses. According to Algorithm 7, per each initial INTT, we have to compute k NTTs. In the next part, we will discuss two main approaches to realize this requirement.
Number of Cores vs. Number of Modules:
To balance the throughput of INTT 0 and NTT 0 , one can allocate a single NTT module with nc NTT 0 = k · nc INTT 0 . However, in practice, having a very large NTT module with more than 32 cores results in place-and-route failures during hardware synthesis. Moreover, recall from Section 4.2 that the ALM consumption of NTT module grows faster than linear with respect to the number of cores: O nc NTT 0 log nc NTT 0 . Therefore, breaking total number of NTT cores into separate modules results in less ALM consumption and more robust hardware synthesis. The downside of this approach is more BRAM utilization since each NTT module has its own internal data memory (for holding intermediate values) as well as the output memory.
Considering all other restrictions and requirements, we choose multiple NTT modules over a single big one. In general, let's denote the number of NTT 0 as m 0 (assuming a power of two number). Thus, we have:
Next, we can compute the number of cores needed for DyadMult module. Recall from Section 4.2 that it takes (n log n)/(2 nc NTT 0 ) cycles for NTT module to finish the computation. The DyadMult module has to compute the product of NTT output with two different sets of keys (ksk = D 0 | D 1 ). It takes (2 n)/nc DYD cycles to perform dyadic multiplication on the output of the NTT module. Since in general, log n is not a power of two, the throughputs do not perfectly match. We make sure that the throughput of Dyadic module is greater than that of (or equal to) the NTT module by satisfying the following inequality:
The throughput of INTT 1 modules can be adjusted by assigning nc INTT 1 = nc INTT 0 /k . One can also determine nc NTT 1 = nc INTT 0 and nc MS = (2 nc NTT 1 )/ log n . For two FPGA chips that we have implemented HEAX on, the optimal architecture parameters are computed and summarized in Table 5 .
Simultaneous KeySwitch Ops. and Synchronization. Figure 6 shows the high-level pipeline of KeySwitch module for n = 2 13 (third row of Table 5 ). All of the modules -and their internal components -are pipelined, and the throughput is balanced. As can be seen, multiple key switching operations are computed simultaneously in different pipeline stages (in lighter colors). The fifth Dyadic module that operates on input polynomial BRAM is synchronized with the rest of the Dyadic modules even though the computation can be started as soon as the input poly is available. The reason is that during each of the k iterations of Dyadic product, each module computes and accumulates the results by reading/writing from/to a separate BRAM bank. This enables us to avoid any memory replication considering that these memory banks are prohibitively large. However, this delayed computation introduces a dependency problem in the pipeline referred to as "Data Dependency 1". By the time the k-th Dyadic module starts the computation, the content of input poly is overridden by the next key switching operation. As a result, we allocate enough BRAM to hold f 1 -many polynomials. The value of f 1 depends on the architecture parameter, more precisely: Similarly, one of the inputs of the MS module is coming from the output of DyadMult modules, which will be overridden by subsequent results. This is marked as "Data Dependency 2" in Figure 6 . Thus, we need to allocate more memory to store the outcome of the DyadMult modules in f 2 different buffers. The value of f 2 can be computed as
Recall that we pack and store multiple words in parallel to minimize BRAM utilization while supporting simultaneous accesses. Given that each module is consuming and producing a different number of words at a time, we need to make any two subsequent modules compatible in terms of memory read/write rate. Down-Scale Conversion: This procedure is needed when the number of words that have to be read in the next module is lower than the current module. For example, going from NTT 0 to DyadMult, we need to reduce the number of values we store in parallel. We satisfy this constraint by writing the output into multiple M20K units without overlap. This inevitably results in lower BRAM saturation both width-wise and depth-wise. Up-Scale Conversion: This procedure is used, for example, between INTT 1 and NTT 1 . Consider that the conversion rate is r, meaning that every r cycles, the result of previous block has to be written into one memory row. Therefore, during rmany cycles, the results are temporarily stored in registers and they will be written into certain number of M20Ks in parallel. We leverage the data compaction approach that we described in Section 4.2 in order to increase BRAM saturation.
SYSTEM-VIEW and DATA FLOW
In this section, we discuss a higher-level view of the computation and elaborate on the data flow. Figure 7 shows a system-view comprising host CPU and FPGA Board which are connected via Peripheral Component Interconnect express (PCIe) bus. On FPGA board, exist the FPGA chip as well as off-chip DRAM memory connected via DDR interface. 
On-Chip vs. Off-Chip Memory Accesses
There are two main ways to store data on FPGA board: (i) off-chip DRAM with several Gigabytes of capacity but high response delay and (ii) on-chip BRAM with few Megabits of capacity but very fast response time and high throughput.
As has been shown by prior art [66, 67] , leveraging off-chip memory to store intermediate results significantly reduces the overall performance due to high delays between subsequent reads and writes. One of our primary design goals is to avoid off-chip memory access as much as possible. We have introduced several techniques to use minimal on-chip memory, re-use many BRAM units, together with data compaction (see Section 4.2 and Section 4.3). As a result, no off-chip memory access is performed for n = 2 12 parameter set on both Arria 10 and Stratix 10 FPGAs; which is one of the main reasons for our unprecedented performance improvements. For n = 2 13 parameter set, there is sufficient on-chip memory on Stratix 10 chip. Unfortunately, for n = 2 14 , there is not enough BRAM available for our design, and as a result, we have to move some part of the data to off-chip memory. In order to minimize the effect of off-chip memory accesses, we choose to put key switching keys (ksk) on DRAM because of two main reasons: (i) the size of these keys grow very rapidly with HE parameters. In general, the size of ksk grows as O n k 2 , and roughly speaking, k grows linearly with n which results in (almost) O n 3 . This is the highest growth rate compared to all other memory components, including twiddle factors which grow as O n k . (ii) ksk is only read once per each KeySwitch. Note that each unique element of twiddle factors is read k times during one KeySwitch operation; thus, twiddle factors are less suitable candidates.
We distribute the ksk among four different DRAM banks such that at any point in time, the full capacity of off-chip memory bandwidth is used. In order to further mask the effect of DRAM accesses, we leverage the burst mode in which a long sequence of data is read at the same time on each channel. The entire process of reading ksk from DRAM is pipelined to minimize the drop in throughput of KeySwitch. It is worthmentioning that DRAM bandwidth is sufficient to match the throughput of KeySwitch. Per each KeySwitch, two sets of ksk have to be streamed to FPGA chip. Each of these sets, hold k · (k + 1)-many vectors of size n. Substituting n = 2 14 , k = 8, and 64-bit per each word results in ≈ 151 Mega bits. We have to stream this volume of data in 383 microseconds (please see Table 8 ). Therefore, DRAM bandwidth should be higher than 49.28 GBps, which is indeed lower than the measured bandwidth of all four channels combined.
In addition to storing ksk, we use DRAM for one more purpose. In some applications, it is more efficient to store the result of computation on DRAM instead of sending them back to CPU in case these results are going to be used later on. The address at which the result is stored is held on the CPU side and is shown as "Memory Map". The memory map is used to point to the ciphertext(s) that are stored on DRAM to be used during the rest of the computation without involving PCIe.
Data Transfer on PCIe
In order to maximize the utilization of computation blocks on FPGA, we need to interleave computation and data transfer between FPGA and CPU. We divide this design process into two parts: CPU-side and FPGA-side. On the CPU-side, we need to sequence and batch multiple operations in the program (that uses SEAL) and start the data transfer process on PCIe using multiple threads. On the FPGA-side, we need to allocate the necessary buffers to store the data received from CPU. Sequencing and Batching. Transferring data on PCIe involves three main steps: (i) a memcpy is issued to copy the content of the polynomial to pinned memory pages, (ii) CPU signals FPGA that the data is ready, and (iii) FPGA reads the data from PCIe. In order to reduce the time that takes to copy the data, Direct Memory Access (DMA) is used. However, even by relying on DMA, the maximum throughput that PCIe can provide depends on the message size and the number of simultaneous data transfer requests. Therefore, we transfer (at least) a complete polynomial (2 15 − 2 17 Bytes) in each request. Moreover, we implement a multi-threaded data transfer mechanism that uses eight threads to interleave eight separate polynomials at a time to maximize the PCIe throughput and avoid unnecessary bubbles in the computation pipeline. Double and Quadruple Buffering. For the MULT module, it suffices to double-buffer the input such that CPU writes to one of these buffers and FPGA reads from the other one. For KeySwitch module, however, we need to perform quadruple buffering due to the data dependency on input polynomial as discussed in Section 4.3. In order to make sure buffers are not overridden before they are read, we stop the writing process if the buffer has not been read yet.
IMPLEMENTATION and EXPERIMENTS

Experimental Setup
In this section, we elaborate in detail on the resource consumption of each component of our design starting from computation cores and moving to basic and high-level modules. We also provide our experimental results and compare the performance of HEAX to Microsoft SEAL execution on CPU.
We implement HEAX on two Intel FPGAs considering two different scales to illustrate the adaptability of HEAX:
• Board-A: with an Intel Arria 10 GX 1150 chip, 4 GB of DRAM memory with two independent DRAM channels, and PCIe Gen3 with eight lanes providing 7.88 GBps bandwidth in each direction from host CPU to FPGA and vice versa. There are three major types of resources that are available on the FPGA chip:
• Digital Signal Processing (DSP) units that are able to perform one 27-bit or two 18-bit multiplications.
• Adaptive Logic Modules (ALM) that represent core logic units and provide two combinational adaptive look-up tables, a two-bit full adder, and four Registers (REG) that are capable to hold a 1-bit value each.
• Block RAM (BRAM) units that represent on-chip memories for fast read and write operations. BRAM consists of M20K units that hold 512-many 40-bit values. Table 1 summarizes the breakdown of resources available on each FPGA chip as well as the connections to DRAM. We evaluate our design on a wide range of HE parameters: from ciphertext polynomial size (n) of 2 12 and 109-bit ciphertext modulus ( log qp + 1) to 2 14 with 438-bit ciphertext modulus. We refer to these parameter sets as Set-A, Set-B, and Set-C, respectively (summarized in Table 2 ). Recall that k is the number of small RNS components of ciphertext modulus. Table 2 : Description of HE parameter sets. n is the ciphertext polynomial size, qp is the ciphertext modulus, and k is the number of RNS components of q.
These parameters are selected based on the HE security standards [1] for 128-bit classical security. Parameters with 128-bit post-quantum security require slightly smaller ciphertext moduli, hence are easier to implement on an FPGA. We selected as few prime moduli for RNS as possible since they are critical to performance according to analysis in [37] . Note that parameter sets corresponding to 2 11 (or lower) are almost never used in practice due to the multiplication depth of 1 (or zero). Choosing 2 15 (or higher) results in enormous computation blow-up and are also rarely used in practice.
We compare the performance of HEAX with Microsoft SEAL V3.3 [69] , which is an FHE library for BFV and CKKS schemes that has undergone several years of performance optimizations. We measure the performance of SEAL on a single-threaded Intel Xeon(R) Silver 4108 running at 1.80 GHz; which is a similar CPU used in prior art [67] . Basic Modules. Table 4 provides a detailed resource consumption of shell as well as different modules for a various number of cores within the module. The BRAM utilization results are reported for Set-B parameters (n = 2 13 ). As can be seen, BRAM bits usage in each module does not depend on the number of cores. However, as the number of core grows, more coefficients are stored in parallel which results in consuming more M20K units. In the last column, the number of cycles that takes for each module to process a polynomial (or pair of polynomials in case of MULT module) is reported. 
Resource Consumption Computation Cores.
Arria10 n = 2 12 (Set-A) 1 × INTT (8) → 2 × NTT (8) → 3 × Dyad (4) → 2 × INTT (4) → 2 × NTT (8) → 2 × Mult (2) Stratix10 n = 2 12 (Set-A) 1 × INTT (16) → 2 × NTT (16) → 3 × Dyad (8) → 2 × INTT (8) → 2 × NTT (16) → 2 × Mult (4) n = 2 13 (Set-B) 1 × INTT (16) → 4 × NTT (16) → 5 × Dyad (8) → 2 × INTT (4) → 2 × NTT (16) → 2 × Mult (4) n = 2 14 (Set-C) 1 × INTT (8) → 4 × NTT (16) → 5 × Dyad (8) → 2 × INTT (1) → 2 × NTT (8) → 2 × Mult (4)
Performance
Critical Paths and Maximum Clock Frequency. The FPGA operating clock frequency directly affects the performance; thus, we have analyzed the critical paths of our design and have eliminated such paths during many design iterations. These modifications include but are not limited to: altering the data flow of computation and reducing the fan-out as well as data dependency among registers, breaking large multiplexers into smaller ones, and minimizing the number of levels of logic per pipeline stage. The final maximum clock frequency that are achieved for Arria 10 and Stratix 10 FPGA chips are 275 MHz and 300 MHz, respectively.
Scalability. One of the desirable features of HEAX is that it can automatically be instantiated at different scales (with no manual tuning) based on the available hardware resources. To illustrate the scalability and adaptability of our design, we have instantiated HEAX for the same HE parameters (Set-A) but at two different scales (see Table 5 ). As can be seen from Table 6 , the up-scaled instantiation on Stratix 10 consumes (close to) twice the resources and provides twice the throughput compared to Arria 10 instantiation (see Table 8 ). This property enables cloud providers to seamlessly instantiate HEAX based on the underlying application, HE parameters, and available FPGA resources.
Performance Comparison. Table 5 ). Note that we report the performance results for lowlevel operation merely for completeness. These operations are rarely used in isolation and are instead used as part of highlevel operations. For high-level operations, i.e., Rotation and Relinearization (using KeySwitch) and a complete ciphertext multiplication (using MULT and KeySwitch), the performance improvements are more pronounced; because many of such operations can be executed in parallel compared to CPU. Table 8 shows the performance results as well as the speedup for high-level operations in CKKS scheme. As can be seen, HEAX achieves close to two orders of magnitude performance improvement using Arria 10 FPGA (Board-A) compared to CPU (first row of Table 8 ). On a more powerful FPGA, i.e., Intel Stratix 10 (Board-B), HEAX achieves 164-268× performance improvements among various HE parameter sets compared to CPU (second to fourth rows of Table 8 ).
RELATED WORK
Homomorphic encryption can fundamentally change the trust and computation model of many applications and industries. For example, in Machine Learning as a Service (MLaaS), the privacy of users' data as well we the confidentiality of the cloud server's ML model can be preserved. Privacy-preserving MLaaS has been studied rigorously in recent years and several protocol-level and algorithmic improvements have been proposed [36, 48, 63, 46, 19, 64, 62, 61, 42] . However, the computation overhead compared to the traditional plaintext execution remains a standing challenge.
The CKKS scheme is one of the most recently proposed FHE schemes that allows homomorphic operations on fixedpoint numbers; making it the prime candidate for machine learning applications. To the best of our knowledge, no hardware architecture has been proposed for the CKKS scheme, and in this paper, we propose the first of its kind. As a result, it is not fair to compare the performance of HEAX with previous designs that focus on non-CKKS schemes. In what follows, we briefly review the research effort related to FPGA, ASIC, and GPU-based acceleration for non-CKKS schemes as well as (non-HE) secure computation protocols.
Hardware Acceleration for non-CKKS Schemes. In [66] , a system based on FPGA is proposed for BFV scheme to process ciphertext polynomial sizes of 2 15 . However, due to the massive off-chip data transfer, their design does not yield superior performance compared to CPU execution.
Perhaps, the closest work to ours is by Roy et al. [67] in which authors propose an architecture for BFV scheme and implement their design on Xilinx Zynq UltraScale+ MPSoC ZCU102. In order to avoid off-chip memory accesses, authors focus on n = 2 12 ciphertext sizes and report 13× speed-up (using two instances of their proposed processors) compared to the FV-NFLlib [33] executing on an Intel i5 processor running at 1.8 GHz. However, compared to a more optimized Microsoft SEAL library [68], FV-NFLlib is 1.2× slower [6] . In addition, our design is significantly more modular and scalable. We have instantiated HEAX for three different set of HE parameters with no manual tuning (polynomial sizes of 2 12 , 2 13 , and 2 14 ). Moreover, HEAX has a multi-layer pipelined design and is drastically more efficient, offering more than two orders of magnitude performance improvement compared to Microsoft SEAL running on Intel Xeon Silver 4108 at 1.8 GHz (note that similar processor is used compared with [67] running at identical frequency).
FPGA-based Co-Processors. Designing co-processors has also been studied in the literature. These co-processors work in conjunction with CPUs and accelerate one or more of the homomorphic operations [45, 55, 21, 38, 56, 49] . In [55] and [38, 56] , authors focus on designing hardware architecture for the encryption operation only, by leveraging Karatsuba and Comba multiplication algorithms, respectively. In [21] , a Homomorphic Encryption Processing Unit (HEPU) is proposed for LTV scheme [53] . Authors focus on accelerating the Chinese Remainder Transform (CRT) for power-of-2 cyclotomic rings and report 3.2-4.4× performance improvements for homomorphic multiplication using Xilinx Virtex-7 FPGA.
Large-Integer Multiplication Hardware Acceleration. A line of research focuses on designing very large integer multipliers -based on FPGAs or ASICs -that can be used to accelerate homomorphic operations [74, 75, 28, 29, 13] . These architectures support 768K-bit to 1.18M-bit multiplications.
In [14] , a large-integer multiplier and a Barrett modular reduction are proposed that can accelerate HE operations by 11×. However, the performance degradation due to the offchip memory accesses are not considered in this work which can be the main bottleneck as reported by prior art [67, 60] . In contrast, HEAX does not rely on large-integer multiplication, and instead, leverages the parallelism based on RNS to achieve efficient routing and superior performance.
GPU-based HE Acceleration. Graphics Processing Unit (GPU) is an alternative computing platform to accelerate evaluation in HE [25, 22, 57, 5, 50, 72] . Wang et al. [72] have proposed the first GPU acceleration of FHE that targets Gentry-Halevi [35] scheme. Subsequent improvements are reported in [73] . In [71] , a GPU-based implementation of BGV scheme [11] is introduced. In [5], a comprehensive study is reported for multithreaded CPU execution as well as GPU implementation of the BFV scheme. To the best of our knowledge, there is no GPU-accelerated implementation of the CKKS scheme. GPUs normally offer less performance per watt of power than FPGAs by design. Therefore, FPGAs are more suitable candidates for high-performance, low-power secure computation in the cloud.
Acceleration of YASHE and LTV Schemes. Several works [58, 24, 27, 59, 20, 21] focus on improving the performance of YASHE [9] and LTV [53] schemes or their variants. These constructions -based on an overstretched NTRU assumption -are subject to a subfield lattice attack [2] and are no longer secure. Nevertheless, the contribution in optimization techniques remain valuable.
In what follows, we briefly review two of the most relevant designs to ours. In [65] , an architecture for YASHE scheme is proposed that provides 25× performance improvement over CPU. However, authors assume unlimited memory bandwidth which renders off-chip memory accesses free of cost and is not a realistic assumption. Pöppelmann et al. [60] have also proposed an architecture for YASHE scheme. Since ciphertexts are prohibitively large to be stored on on-chip memory, authors propose to leverage the idea of Cached-NTT [3, 4] . Cached-NTT suggests formulating the NTT of large polynomials as many invocations of a fixed computation with the property that loading/saving intermediate results from/to off-chip memory is minimized, thus, the notion of cached. Cached-NTT mini-mizes the number of times off-chip memory transactions are performed since the off-chip memory access is significantly costlier compared with on-chip memory access. In contrast, in HEAX, we rely on the ring isomorphism property and perform independent computation on each RNS component of large ciphertext. This, in turn, allows us to avoid off-chip memory accesses for small-size HE parameters and minimize such accesses for large parameters.
Hardware Acceleration of Secure Computation Protocols. Secure Multi-Party Computation (SMPC) protocols can also be used for the task of privacy-preserving MLaaS. The Garbled Circuits protocol [76] is one of the secure two-party computation protocols for which FPGA-based accelerators have been proposed [43, 32, 44] . However, compared to HE, the communication between parties is significantly higher, making the end-to-end protocol mostly bandwidth constrained. Thus, accelerating the computation portion of SMPC protocols does not necessarily reduce the execution time of the protocol in general. In contrast, the main bottleneck of HE is the computation overhead and reducing the homomorphic evaluation time directly increases the practicality of HE-based computations.
CONCLUSION
In this paper, we introduced a novel set of architectures for Fully Homomorphic Encryption (FHE). To the best of our knowledge, HEAX is the first architecture and fullyimplemented hardware acceleration for the CKKS FHE scheme. CKKS is the prime candidate for machine learning on encrypted data due to floating-point support of this scheme. The components designed in HEAX can also be used for other lattice-based cryptosystems and other FHE/HE schemes. The proposed architecture provides a unique degree of flexibility that can be readily adjusted for various FPGA chips. As a proof-of-concept, we have implemented HEAX on two different FPGAs with contrasting hardware resources. Moreover, unlike prior FPGA-based acceleration for BFV scheme, our design is not tied to a specific FHE parameter set. We evaluate HEAX on a wide range of FHE parameters demonstrating more than two orders of magnitude performance improvements. We hope that HEAX paves the way for large-scale deployment of privacy-preserving computation in clouds.
