Public key cryptography protocols, such as RSA and elliptic curve cryptography, will be rendered insecure by Shor's algorithm when large-scale quantum computers are built. Cryptographers are working on quantum-resistant algorithms, and lattice-based cryptography has emerged as a prime candidate. However, high computational complexity of these algorithms makes it challenging to implement lattice-based protocols on low-power embedded devices. To address this challenge, we present Sapphire -a lattice cryptography processor with configurable parameters. Efficient sampling, with a SHA-3-based PRNG, provides two orders of magnitude energy savings; a single-port RAM-based number theoretic transform memory architecture is proposed, which provides 124k-gate area savings; while a low-power modular arithmetic unit accelerates polynomial computations. Our test chip was fabricated in TSMC 40nm low-power CMOS process, with the Sapphire cryptographic core occupying 0.28 mm 2 area consisting of 106k logic gates and 40.25 KB SRAM. Sapphire can be programmed with custom instructions for polynomial arithmetic and sampling, and it is coupled with a low-power RISC-V micro-processor to demonstrate NIST Round 2 lattice-based CCA-secure key encapsulation and signature protocols Frodo, NewHope, qTESLA, CRYSTALS-Kyber and CRYSTALS-Dilithium, achieving up to an order of magnitude improvement in performance and energyefficiency compared to state-of-the-art hardware implementations. All key building blocks of Sapphire are constant-time and secure against timing and simple power analysis side-channel attacks. We also discuss how masking-based DPA countermeasures can be implemented on the Sapphire core without any changes to the hardware.
Introduction
Modern public key cryptography relies on hard mathematical problems such as integer factorization, discrete logarithms over finite fields and discrete logarithms over elliptic curve groups. However, these problems can be solved by a large-scale quantum computer in polynomial time using Shor's algorithm [1] , thus making today's public key protocols like RSA and ECC vulnerable to quantum attacks. Given the rapid advancement in quantum computing technology over the past few years, cryptographers are developing quantum-secure public key algorithms to protect today's data from tomorrow's threats. Lattice-based cryptography is being considered one of the most promising candidates for post-quantum cryptographic protocols because of its extensive security analysis as well as small public key and signature sizes.
The National Institute of Standards and Technology (NIST) formally initiated the process of standardizing post-quantum cryptography in 2016 [2] . The first round of candidates were announced in late 2017, with lattice-based cryptography accounting for 48% of the public-key encryption and key encapsulation (PKE/KEM) schemes and 25% of the signature schemes. In early 2019, the candidates moving on to the second round were announced [3] , and lattice-based cryptography accounts for 53% (9 out of 17) and 33% (3 out of 9) of the candidates for PKE/KEM and signature schemes respectively. The theoretical foundation of several of these lattice-based protocols lies in the learning with errors (LWE) problem [4] and its variants such as Ring-LWE [5] and Module-LWE [6] , and the hardness of LWE has been well-studied in the presence of both classical and quantum adversaries [7, 8] . This has been accompanied by several software and hardware implementations [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] of LWE and Ring-LWE-based public key encryption and key encapsulation protocols, each supporting specific lattice parameters chosen for increased performance and efficiency. Existing lattice-based cryptography implementations, both in software and hardware, have been thoroughly surveyed in [21] . Most of the hardware implementations focus on FPGA demonstration in order to support reconfigurability of lattice parameters, which is especially important for a fast evolving field like arXiv:1910.07557v2 [cs.CR] 25 Oct 2019 lattice-based cryptography, while existing ASIC implementations either lack configurability or have power and area overheads. Some of the key challenges of implementing lattice-based cryptography in ASICs have been discussed in [22] , and this work presents a solution using a combination of architectural and algorithmic techniques.
Our contributions: In this work, we present Sapphire -a configurable lattice cryptography processorwhich combines low-power modular arithmetic, area-efficient memory architecture and fast sampling techniques to achieve high energy-efficiency and low cycle count, ideal for securing low-power embedded systems. The key technical aspects of our work are as follows: 1. A low-power modular arithmetic core, with configurable prime modulus, is used to accelerate polynomial arithmetic operations; a pseudo-configurable modular multiplier is also implemented, which provides up to 3× improvement in energy-efficiency.
2.
A single-port SRAM-based number theoretic transform (NTT) memory architecture provides 124k-gate area savings without any loss in performance or energy-efficiency.
3. An efficient Keccak core is combined with fast sampling techniques to speed up polynomial sampling, while supporting a wide variety of discrete distribution parameters suitable for lattice-based schemes. 4 . These efficient hardware building blocks are integrated together with an instruction memory and decoder to build our crypto-processor, which can be programmed with custom instructions for polynomial sampling and arithmetic.
5. The Sapphire crypto-processor is coupled with an efficient RISC-V micro-processor to demonstrate several NIST Round 2 lattice-based key encapsulation and signature protocols such as Frodo [23] , NewHope [24] , qTESLA [25] , CRYSTALS-Kyber [26] and CRYSTALS-Dilithium [27] , achieving more than an order of magnitude improvement in performance and energy-efficiency compared to state-of-the-art assembly-optimized software and hardware implementations. 6 . All the key building blocks, such as NTT, polynomial arithmetic and binomial sampling, are constant-time and secure against timing and simple power analysis attacks. While our baseline protocol implementations are not secure against differential power analysis attacks, we discuss how the programmability of our crypto-processor can be utilized to implement masking-based countermeasures.
7. Our ASIC implementation was fabricated in the TSMC 40nm low-power CMOS process, and all protocollevel demonstrations and side-channel measurements have been conducted on our test chip.
The rest of the paper is organized as follows: Section 2 provides a brief mathematical background on LWE and associated computations; in Section 3, we present our implementation of energy-efficient modular arithmetic along with an area-efficient NTT memory architecture; in Section 4, we describe our discrete distribution sampler accelerated by a low-power SHA-3 core; Section 5 describes the overall chip architecture; Section 6 presents detailed measurement results obtained from evaluating lattice-based protocols on our test chip, comparison with state-of-the-art software and hardware implementations as well as side-channel analysis; a summary of our key conclusions along with future research directions are discussed in Section 7. This version is same as our CHES 2019 paper except for fixed typos and the addition of Frodo-1344 implementation results.
Background
In this section, we provide a brief introduction to LWE, Ring-LWE and Module-LWE along with the associated computations. We use bold lower-case symbols to denote vectors and bold upper-case symbols to denote matrices. The symbol lg is used to denote all logarithms with base 2. The set of all integers is denoted as Z and the quotient ring of integers modulo q is denoted as Z q . For two n-dimensional vectors a and b, their inner product is written as a, b = n−1 i=0 a i · b i . The concatenation of two vectors a and b is written as a || b.
LWE and Related Lattice Problems
The Learning with Errors (LWE) problem [4] acts as the foundation for several modern lattice-based cryptography schemes. The LWE problem states that given a polynomial number of samples of the form (a, a, s + e), it is difficult to determine secret vector s ∈ Z n q , where vector a ∈ Z n q is sampled uniformly at random and error e is sampled from the appropriate error distribution χ. Examples of secure LWE parameters are (n, q) = (640, 2 15 ), (n, q) = (976, 2 16 ) and (n, q) = (1344, 2 16 ) for Frodo [23] .
LWE-based cryptosystems involve large matrix operations which are computationally expensive and also result in large key sizes. To solve this problem, the Ring-LWE problem [5] was proposed, which uses ideal lattices. Let R q = Z q [x]/(x n + 1) be the ring of polynomials where n is power of 2. The Ring-LWE problem states that given samples of the form (a, a · s + e), it is difficult to determine the secret polynomial s ∈ R q , where the polynomial a ∈ R q is sampled uniformly at random and the coefficients of the error polynomial e are small samples from the error distribution χ. Examples of secure Ring-LWE parameters are (n, q) = (512, 12289) and (n, q) = (1024, 12289) for NewHope [24] .
Module-LWE [6] provides a middle ground between LWE and Ring-LWE. By using module lattices, it reduces the algebraic structure present in Ring-LWE and increases security while not compromising too much on the computational efficiency. The Module-LWE problem states that given samples of the form (a, a T s + e), it is difficult to determine the secret vector s ∈ R k q , where the vector a ∈ R k q is sampled uniformly at random and the coefficients of the error polynomial e are small samples from the error distribution χ. Examples of secure Module-LWE parameters are (n, k, q) = (256, 2, 7681), (n, k, q) = (256, 3, 7681) and (n, k, q) = (256, 4, 7681) for CRYSTALS-Kyber [26] .
Number Theoretic Transform
While the protocols based on standard lattices (LWE) involve matrix-vector operations modulo q, all the arithmetic is performed in the ring of polynomials R q = Z q [x]/(x n + 1) when working with ideal and module lattices. There are several efficient algorithms for polynomial multiplication [28] , and the Number Theoretic Transform (NTT) is one such technique widely used in lattice-based cryptography.
The NTT is a generalization of the well-known Fast Fourier Transform (FFT) where all the arithmetic is performed in a finite field instead of complex numbers. Instead of working with powers of the n-th complex root of unity exp(−2πj/n), NTT uses the n-th primitive root of unity ω n in the ring Z q , that is, ω n is an element in Z q such that ω n n = 1 mod q and ω i n = 1 mod q for i = n. In order to have elements of order n, the modulus q is chosen to be a prime such that q ≡ 1 mod n. A polynomial a(x) ∈ R q with coefficients a(x) = (a 0 , a 1 , · · · , a n−1 ) has the NTT representationâ(x) = (â 0 ,â 1 , · · · ,â n−1 ), wherê
The inverse NTT (INTT) operation convertsâ(x) = (â 0 ,â 1 , · · · ,â n−1 ) back to a(x) as
Note that the INTT operation is similar to NTT, except that ω n is replaced by ω −1 n mod q and the final results is divided by n. An iterative in-place version of the NTT algorithm is provided in Algorithm 1 [29, 30] . The Algorithm 1 Iterative In-Place NTT [29] Require: Polynomial a(x) ∈ R q and n-th primitive root of unity ω n ∈ Z q Ensure: Polynomialâ(x) ∈ R q such thatâ(x) = NTT(a(x)) 1:â ← PolyBitRev(a) 2: for (s = 1; s ≤ lg n; s = s + 1) do 3: m ← 2 s 4:
for (k = 0; k < n; k = k + m) do 6: ω ← 1 7: for (j = 0; j < m/2; j = j + 1) do 8 :
ω ← ω · ω m mod q 13: end for 14: end for 15 : end for 16: returnâ PolyBitRev function performs a permutation on the input polynomial a such thatâ[i] = PolyBitRev(a)[i] = a[BitRev(i)], where BitRev is formally defined as BitRev(i) = lg n−1 j=0 (((i j) & 1) (lg n − 1 − i)) (for positive integer i and power-of-two n), that is, bit-wise reversal of the binary representation of the index i. Since there are lg n stages in the NTT outer loop, with O(n) operations in each stage, its time complexity is O(n lg n). The factors ω are called the twiddle factors, similar to FFT.
The NTT provides a fast multiplication algorithm in R q with time complexity O(n lg n) instead of O(n 2 ) for schoolbook multiplication. Given two polynomials a, b ∈ R q , their product c = a · b ∈ R q can be computed as
where denotes coefficient-wise multiplication of the polynomials. Since the product of a and b, before reduction modulo f (x) = x n + 1, has 2n coefficients, using the above equation directly to compute a · b will require padding both a and b with n zeros. To eliminate this overhead, the negative-wrapped convolution [31] is used, with the additional requirement q ≡ 1 mod 2n so that both the n-th and 2n-th primitive roots of unity modulo q exist, respectively denoted as ω n and ψ = √ ω n mod q. By multiplying a and b coefficient-wise by powers of ψ before the NTT computation, and by multiplying INTT( NTT(a) NTT(b) ) coefficient-wise by powers of ψ −1 mod q, no zero padding is required and the n-point NTT can be used directly. Similar to FFT, the NTT inner loop involves butterfly computations. There are two types of butterfly operations -Cooley-Tukey (CT) and Gentleman-Sande (GS) [32] . The CT butterfly-based NTT requires inputs in normal order and generates outputs in bit-reversed order, similar to the decimation-in-time FFT. The GS butterfly-based NTT requires inputs to be in bit-reversed order while the outputs are generated in normal order, similar to the decimation-in-frequency FFT. Using the same butterfly for both NTT and INTT requires a bit-reversal permutation. However, the bit-reversal can be avoided by using CT for NTT and GS for INTT, as proposed in [32] .
Sampling
In lattice-based protocols, the public vectors a are generated from the uniform distribution over Z q through rejection sampling. The secret vectors s and error terms e are sampled from the distribution χ typically with zero mean and appropriate standard deviation σ. Accurate sampling of s and e is critical to the security of these protocols, and the sampling must be constant-time to prevent side-channel leakage of the secret information. Although the original LWE proof used discrete Gaussian distributions for sampling the error terms, several lattice-based schemes use binomial, uniform and ternary distributions for efficiency. A detailed survey of different sampling techniques is available in [21] .
Modular Arithmetic and NTT
The core arithmetic and logic unit (ALU) of Sapphire consists of a 24-bit data-path, with modular operations in F q for configurable q. In this section, we describe the details of our energy-efficient modular arithmetic implementation, the ALU design and our area-efficient NTT memory architecture.
Modular Arithmetic Implementation
The modular arithmetic core consists of a 24-bit adder, a 24-bit subtractor and a 24-bit multiplier along with associated modular reduction logic. Our modular adder and subtractor designs are shown in Fig. 1 , and the corresponding pseudo-codes are shown in Algorithms 2 and 3. Both designs use a pair of adder and subtractor, with the sum, carry bit, difference and borrow bit denoted as s, c, d and b respectively. Modular reduction is performed using conditional subtraction and addition, which are computed in the same cycle to avoid timing side-channels. The synthesized areas of the adder and the subtractor are around 550 GE (gate equivalent) each in area.
For modular multiplication, we use a 24-bit multiplier followed by Barrett reduction [33] modulo a prime q of size up to 24 bits. Barrett reduction does not exploit any special property of the modulus q, thus making it ideal for supporting configurable moduli. Let z be the 48-bit product to be reduced to Z q , then Barrett reduction computes z mod q by estimating the quotient z/q without performing any division, as shown in Algorithm 4. Barrett reduction involves two multiplications, one subtraction, one bit-shift and one conditional subtraction. The value of 1/q is approximated as m/2 k , with the error of approximation being e = 1/q − m/2 k , therefore the reduction is valid as long as ze < 1. Since z < q 2 , k is set to be the smallest number such that e = 1/q − ( 2 k /q /2 k ) < 1/q 2 . Typically, k is very close to 2 lg q , that is, the bit-size of q 2 . 
Algorithm 2 Modular Addition
In order to understand the trade-offs between flexibility and efficiency in modular multiplication, we have implemented two different architectures of Barrett reduction logic: (1) with fully configurable modulus (q can be an arbitrary prime) and (2) with pseudo-configurable modulus (q belongs to a specific set of primes), as shown in Fig. 2 .
Apart from the prime q (which can be up to 24 bits), the fully configurable version requires two additional inputs m and k such that m = 2 k /q (m and k are allowed to be up to 24 bits and 6 bits respectively). It consists of total 3 multipliers, as shown in Fig. 2a , the first two being used to compute z = x · y and z · m respectively. For obtaining t = (z · m) k, the bit-wise shift is implemented purely using combinational logic (multiplexers) because shifting bits sequentially in registers can be extremely inefficient in terms of power consumption. We assume that 16 ≤ k ≤ 48 since q is not larger than 24 bits, q is typically not smaller than 8 bits and we know that k ≈ 2 lg q . The third multiplier is used to compute t · q, and a pair of subtractors is used to calculate z − (t · q) and perform the final reduction step. All the steps are computed in a single cycle to avoid any potential timing side-channels. The design was synthesized at 100 MHz (with near-zero slack) and occupies around 11k GE area, which includes the area (around 4k GE) of the 24-bit multiplier used to compute z = x · y.
The pseudo-configurable modular multiplier implements Barrett reduction logic for the following primes used by NIST Round 1 lattice-based candidates: 7681 (CRYSTALS-Kyber) [26] , 12289 (NewHope) [24] , 40961 (R.EMBLEM) [34] , 65537 (pqNTRUSign) [35] , 120833 (Ding Key Exchange) [36] , 133121 / 184321 (LIMA) [37] , 8380417 (CRYSTALS-Dilithium) [27] , 8058881 (qTESLA v1.0) and 4205569 / 4206593 / 8404993 (qTESLA
Algorithm 4 Modular Multiplication with Barrett Reduction [33]
Require: x, y ∈ Z q , m and k such that m = 2 k /q 4: if z ≥ q then 5: z ← z − q 6: end if 7: return z v2.0) [25] . As shown in Fig. 2b , there is dedicated reduction block for each of these primes, and the q SEL input is used to select the output of the appropriate block while the inputs to the other blocks are data-gated to save power. Since the reduction blocks have the parameters m, k and q coded in digital logic and do not require explicit multipliers, they involve lesser computation than the fully configurable reduction circuit from Fig. 2a , albeit at the cost of some additional area and decrease in flexibility. The reduction becomes particularly efficient when at least one of m and q or both can be written in the form 2 l1 ± 2 l2 ± · · · ± 1, where l 1 , l 2 , · · · are not more than four positive integers. For example, we consider the CRYSTALS primes: for q = 7681 = 2 13 − 2 9 + 1 we have k = 21 and m = 273 = 2 8 + 2 4 + 1, and for q = 8380417 = 2 23 − 2 13 + 1 we have k = 46 and m = 8396807 = 2 23 + 2 13 + 2 3 − 1. Therefore, the multiplications by q and m can be converted to significantly cheaper bit-shifts and additions / subtractions, as shown in Algorithms 5 and 6. Implementation details and reduction parameters for each customized modular reduction block are provided in Appendix A. This design also performs modular multiplication in a single cycle. It was synthesized at 100 MHz (with near-zero slack) and occupies around 19k GE area, including the area of the 24-bit multiplier.
In Fig. 3 , we compare the simulated energy consumption of the fully configurable and pseudo-configurable modular multiplier architectures for all the primes mentioned earlier. As expected, the multiplication itself consumes the same energy in both cases, but the modular reduction energy is up to 6× lower for the pseudoconfigurable design. The overall decrease in modular multiplication energy, considering both multiplication and Algorithm 5 Reduction mod 7681 reduction together, is up to 3×, clearly highlighting the benefit of the dedicated modular reduction data-paths when working with prime moduli. For reduction modulo 2 m (m < 24), e.g., in the case of Frodo, the output of the 24-bit multiplier is simply bit-wise AND-ed with 2 m − 1 implying that the modular reduction energy is negligible.
Butterfly Unit and ALU
Next, we elaborate how the modular arithmetic units described earlier are integrated together to build the butterfly module. As discussed in Section 2, NTT computations involve butterfly operations similar to the Fast Fourier Transform, with the only difference being that all arithmetic is performed modulo q instead of complex numbers. There are two butterfly configurations -Cooley-Tukey (or DIT) and Gentleman-Sande (or DIF). In terms of arithmetic, the DIT butterfly computes (a + ωb mod q, a − ωb mod q) and the DIF butterfly computes (a + b mod q, (a − b)ω mod q), where a and b are the inputs to the butterfly and ω is the twiddle factor. The DIT butterfly requires inputs to be in bit-reversed order and the DIF butterfly generates outputs in bit-reversed order, thus making DIF and DIT suitable for NTT and INTT respectively. While software implementations have the flexibility to program both configurations, hardware designs typically implement either DIT or DIF, thus requiring bit-reversals. To solve this problem, we have implemented a unified butterfly architecture [38] which can be configured as both DIT and DIF, as shown in Fig. 4 . It consists of two sets of modular adders and subtractors along with some multiplexing circuitry to select whether the multiplication with ω is performed before or after the addition and subtraction. Since the critical path of the design is inside the modular multiplier, there is no impact on system performance. The associated area overhead is also negligible.
The modular arithmetic blocks inside the butterfly are re-used for coefficient-wise polynomial arithmetic operations as well as for multiplying polynomials with the appropriate powers of ψ and ψ −1 during negativewrapped convolution. Apart from butterfly and arithmetic modulo q, the Sapphire ALU also supports the following bit-wise operations -AND, OR, XOR, left shift and right shift.
NTT Memory Architecture
Hardware architectures for polynomial multiplication using NTT consist of memory banks for storing the polynomials along with the ALU which performs butterfly computations. Since each butterfly needs to read two inputs and write two outputs all in the same cycle, these memory banks are typically implemented using dual-port RAMs [9, 39, 30, 19] or four-port RAMs [17] . Although true dual-port memory is easily available in state-of-the-art commercial FPGAs in the form of block RAMs (BRAMs), use of dual-port SRAMs in ASIC can pose large area overheads in resource-constrained devices. Compared to a simple single-port SRAM, a dual-port SRAM has double the number of row and column decoders, write drivers and read sense amplifiers. Also, the bit-cells in a low-power dual-port SRAM consist of ten transistors (10T) compared to the usual six transistor (6T) bit-cells in a single-port SRAM [40] . Therefore, the area of a dual-port SRAM can be as much as double the area of a single-port SRAM with the same number of bits and column muxing. To reduce this area overhead, we implement an area-efficient NTT memory architecture [38] which uses the constant-geometry FFT data-flow [41] and consists of single-port SRAMs only.
Algorithm 7 Constant Geometry Out-of-Place NTT [42]
Require: Polynomial a(x) ∈ R q and n-th primitive root of unity ω n ∈ Z q Ensure: Polynomialâ(x) ∈ R q such thatâ(x) = NTT(a(x)) 1: a ← PolyBitRev(a) 2: for (s = 1; s ≤ lg n; s = s + 1) do 3: for (j = 0; j < n/2; j = j + 1) do
end for 8: if s = lg n then 9: a ←â 10:
end if 11: end for 12: returnâ
The constant geometry NTT is described in Algorithm 7 [42, 39] . Clearly, the coefficients of the polynomial are accessed in the same order for each stage, thus simplifying the read/write control circuitry. For constant geometry DIT NTT, the butterfly inputs are a[2j] and a[2j + 1] and the outputs areâ[j] andâ[j + n/2], while the inputs are a[j] and a[j + n/2] and the outputs areâ[2j] andâ[2j + 1] for DIF NTT. However, the constant geometry NTT is inherently out-of-place, therefore requiring storage for both polynomials a andâ. For our hardware implementation, we create two memory banks -left and right -to store these two polynomials while allowing the butterfly inputs and outputs to ping-pong between them during each stage of the transform. Although out-of-place NTT requires storage for both the input and output polynomials, this does not affect the total memory requirements of the crypto-processor because the total number of polynomials required to be stored during the protocol execution is greater than two, e.g., four polynomials are involved in any computation of the form b = a · s + e.
Next, we describe how these memory banks are constructed using single-port SRAMs so that each butterfly can be computed in a single cycle without causing read/write hazards. As shown in Fig. 5a , each polynomial is split among four single port SRAMs Mem 0-3 on the basis of the least and most significant bits (LSB and MSB) of the coefficient index (or address addr). This allows simultaneously accessing coefficient index pairs of the form (2j, 2j + 1) and (j, j + n/2). Our NTT memory architecture is shown in Fig. 5b , which consists of two such memory banks labelled as LWE Poly Cache. In every cycle, the butterfly inputs are read from two different single-port SRAMs (out of four SRAMs in the input memory bank) and the outputs are also written to two different single-port SRAMs (out of four SRAMs in the output memory bank), thus avoiding hazards. The data flow in the first two cycles of NTT is shown in Fig. 6 , where the input polynomial a is stored in the left bank and the output polynomialâ is stored in the right bank. As the input and output polynomials exchange their memory banks from one stage to the next, our NTT control circuitry ensures that the same data-flow is maintained. To illustrate this, the memory access patterns for all three stages of an 8-point NTT are shown in Fig. 7 for both decimation-in-time and decimation-in-frequency.
The two memory banks consist of four 1024 × 24-bit single-port SRAMs each (24 KB total). Together they store 8192 entries, which can be split into four 2048-dimension polynomials or eight 1024-dimension polynomials or sixteen 512-dimension polynomials or thirty-two 256-dimension polynomials or sixty-four 128-dimension polynomials or one-hundred-twenty-eight 64-dimension polynomials. By constructing this memory using single-port SRAMs (and some additional read-data multiplexing circuitry), we have achieved area savings equivalent to 124k GE compared to a dual-port SRAM-based implementation. This is particularly important since SRAMs account for a large portion of the total hardware area in ASIC implementations of lattice-based cryptography [17, 43] . In order to allow configurable parameters, our NTT hardware also requires additional storage (labelled as NTT Constants RAM in Fig. 5 ) for the pre-computed twiddle factors:
Since n ≤ 2048 and q < 2 24 , this would require another 24 KB of memory. To reduce this overhead, we exploit the following properties of ω and ψ: ω n/2 = ω 2 n , ω −j n = ω n−j n and ω = ψ 2 [30] . Then, it's sufficient to store only ω j n for j ∈ [0, n/2) and ψ i , n −1 ψ −i mod q for i ∈ [0, n), thus reducing the twiddle factor memory size by 37.5% down to 15 KB.
Finally, we compare the energy-efficiency and performance of our NTT with state-of-the-art software and ASIC hardware implementations in Table 1 . For the software implementation, we have used assembly-optimized code for ARM Cortex-M4 from the PQM4 crypto library [44] , and measurements were performed using the NUCLEO-F411RE development board [45] . Total cycle count of our NTT is ( n 2 + 1) lg n + (n + 1), including the multiplication of polynomial coefficients with powers of ψ. All measurements for our NTT implementation were performed on our test chip operating at clock frequency 72 MHz and nominal supply voltage 1.1 V. Our hardware-accelerated NTT is up to 11× more energy-efficient than the software implementation, after accounting for voltage scaling. It is 2.5× more energy-efficient compared to the fast NTT design from [14] with similar cycle count, and 1.5× more energy-efficient compared to the slow NTT design from [14] with 4× cycle count. Our NTT is almost twice as fast as [43] , since our memory architecture allows computing one butterfly per cycle even with single-port SRAMs, while having similar energy consumption. The energy-efficiency of our NTT implementation is largely due to the careful design of low-power modular arithmetic, as discussed earlier, which decreases overall modular reduction complexity and simplifies the logic circuitry. However, our NTT is still about 4× less energy-efficient compared to [17] , primarily due to the fact that [17] uses 16 parallel butterfly units along with dedicated four-port scratch-pad buffers to achieve higher parallelism and lower energy consumption at the cost of significantly larger chip area (2.05 mm 2 ) compared to our design (0.28 mm 2 ). As will be discussed in Section 6, sampling accounts for majority of the computational cost in Ring-LWE and Module-LWE schemes, therefore justifying our choice of area-efficient NTT architecture at the cost of some energy overhead.
Discrete Distribution Sampler
Hardness of the LWE problem is directly related to statistical properties of the error samples. Therefore, an accurate and efficient sampler is a critical component of any lattice cryptography implementation. Sampling accounts for a major portion of the computational overhead in software implementations of ideal and module lattice-based protocols [46] . A cryptographically secure pseudo-random number generator (CS-PRNG) is used to generate uniformly random numbers, which are then post-processed to convert them into samples from different discrete probability distributions. In this section, we describe our design of energy-efficient CS-PRNG along with fast sampling techniques for configurable distribution parameters.
Energy-Efficient CS-PRNG
Some of the standard choices for CS-PRNG are SHA-3 in the SHAKE mode [47] , AES in counter mode [48] and ChaCha20 [49] . In order to identify the most efficient among these, we have compared them in terms of area, pseudo-random bit generation performance and energy consumption, as shown in Table 2 . Only place-and-route area and measured energy are considered for all analysis, and synthesis area is reported for reference. For fair comparison, all the three primitives -SHA-3, AES and ChaCha20 -were implemented as full data path architectures. From Fig. 8 , we observe that although all three primitives have comparable area-energy product, SHA-3 is 2× more energy-efficient than ChaCha20 and 3× more energy-efficient than AES; and this is largely due to the fact that SHA-3 generates the highest number of pseudo-random bits per round. The basic building block of SHA-3 is the Keccak permutation function [50] . Therefore, our PRNG consists of a 24-cycle Keccak-f[1600] core [38] which can be configured in different SHA-3 modes and consumes 2.33 nJ per round at nominal voltage of 1.1 V (and 0.89 nJ per round at 0.68 V). Its 1600-bit state is processed in parallel, thus avoiding expensive register shifts and multiplexing required in serial architectures. Fig. 9 shows the overall architecture our discrete distribution sampler with the energy-efficient SHA-3 core. Pseudo-random bits generated by SHAKE-128 or SHAKE-256 are stored in the 1600-bit Keccak state register, and shifted out 32 bits at a time as required by the sampler. The sampler then feeds these bits, AND-ed with the appropriate bit mask to truncate them to desired size, to the post-processing logic to perform one of the following five types of operations -rejection sampling in [0, q), binomial sampling with standard deviation σ, discrete Gaussian sampling with standard deviation σ and desired precision up to 32 bits, uniform sampling in [−η, η] for η < q and trinary sampling in {−1, 0, +1} with specified weights for the +1 and −1 samples.
Rejection Sampling
The public polynomial a in Ring-LWE and the public vector a in Module-LWE have their coefficients uniformly drawn from Z q through rejection sampling, where uniformly random numbers of desired bit size are obtained from the PRNG as candidate samples and only numbers smaller than q are accepted. The probability that a random number is not accepted is known as the rejection probability. For prime q, the rejection probability is calculated as (1 − q/2 lg q ). In Table 3 , we list the rejection probabilities for primes mentioned earlier in Section 3. Clearly, different primes have very different rejection probabilities, often as high as 50%, which can be a bottleneck in lattice-based protocols. To solve this problem, we refer to [51] where pseudo-random numbers smaller than 5q are accepted for q = 12289, thus reducing the rejection probability from 25% to 6%. We extend this technique for any prime q by scaling the rejection bound from q to kq, for appropriate small integer k, so that the rejection probability is now (1 − kq/2 lg kq ). We list these scaling factors for the primes in Table 3 along with the corresponding decrease in rejection probability.
Although this method reduces rejection rates, the output samples now lie in [0, kq) instead of [0, q). In [51] , for q = 12289 and k = 5, the accepted samples are reduced to Z q by subtracting q from them up to four times. Since k is not fixed for our rejection sampler, we employ Barrett reduction [33] for this purpose. Unlike modular multiplication, where the inputs lie in [0, q 2 ), the inputs here are much smaller; so the Barrett reduction parameters are also quite small, therefore requiring little additional logic. In Table 4 , we compare our 
Binomial Sampling
For binomial sampling, we take two k-bit chunks from the PRNG and computes the difference of their Hamming weights, as proposed in [24] . The resulting samples follow a binomial distribution with standard deviation σ = k/2. We allow configuring k to any value up to 32, thus providing the flexibility to support different standard deviations. We compare our binomial sampling performance (SHAKE-256 used as PRNG) with state-of-the-art software and hardware implementations in Table 5 . Our sampler is more than two orders of magnitude more energyefficient compared to the software implementation on ARM Cortex-M4 which uses assembly-optimized Keccak [44] . It is also 14× more efficient than [17] which uses Knuth-Yao sampling [52] for binomial distributions with ChaCha20 as PRNG.
Discrete Gaussian Sampling
Our discrete Gaussian sampler implements the inversion method of sampling [53] from a discrete symmetric zero-mean distribution χ on Z with small support which approximates a rounded continuous Gaussian distribution, e.g., in Frodo [23] and R.EMBLEM [34] . For a distribution with support S χ = {−s, · · · , −1, 0, 1, · · · , s}, where s is a small positive integer, the probabilities Pr(z) for z ∈ S χ , such that Pr(z) = Pr(−z) can be derived from the cumulative distribution table ( 
Pr(i) for z ∈ [1, s] for precision r. Given random inputs r 0 ∈ {0, 1}, r 1 ∈ [0, 2 r ) and distribution table T χ , a sample e ∈ Z from χ can be obtained using Algorithm 8 [23] .
The sampling must be constant-time in order to eliminate timing side-channels, therefore the algorithm does a complete loop through the entire table T χ . The comparison r 1 > T χ [z] must also be implemented in a constant-time manner. Our implementation adheres to these requirements and uses a 64 × 32 RAM to store the CDT, allowing the parameters s ≤ 64 and r ≤ 32 to be configured according to the choice of the distribution. In Table 6 , we have compared our Gaussian sampler performance (SHAKE-256 used as PRNG) with software implementation on ARM Cortex-M4 using assembly-optimized Keccak [44] , and we observe up to 40× improvement in energy-efficiency after accounting for voltage scaling. Hardware architectures for Knuth-Yao sampling have been proposed by [9] and [17] , but they are for discrete Gaussian distributions with larger standard deviation and higher precision, which we do not support.
Other Distributions
Several lattice-based protocols, such as CRYSTALS-Dilithium [27] and qTESLA [25] , require polynomials to be sampled with coefficients uniformly distributed in the range [−η, η] for a specified bound η < q. For this, we again use rejection sampling. Unlike rejection sampling from Z q , we do not require any special techniques since η is typically small or an integer close to a power of two. Finally, we have also implemented a trinary sampler for polynomials with coefficients from {−1, 0, +1}. We classify these polynomials into three categories: (1) with m non-zero coefficients, (2) with m 0 +1's and m 1 −1's, and (3) with coefficients distributed as Pr(x = 1) = Pr(x = −1) = ρ/2 and Pr(x = 0) = 1 − ρ for ρ ∈ {1/2, 1/4, 1/8, · · · , 1/128}. Their implementations are described in Algorithms 9, 10 and 11. For the first two cases, we start with a zero-polynomial s of size n. Then, uniformly random coefficient indices ∈ [0, n) are generated, and the corresponding coefficients are replaced with −1 or +1 if they are zero [25, 35] . For the third case, sampling of the coefficients is based on the observation [54] that for a uniformly random number x ∈ [0, 2 k ) we have Pr(x = 0) = 1/2 k , Pr(x = 1) = 1/2 k and Pr(x ∈ [2, 2 k )) = 1 − 1/2 k . Therefore, for the appropriate value of k ∈ [1, 7], we can generate samples from the desired trinary distribution with ρ = 1/2 k . For all three algorithms, the symbol ∈ R denotes pseudo-random number generation using the PRNG.
Algorithm 9
Trinary Sampling with m non-zero coefficients (+1's and −1's) Require: m < n and a PRNG Ensure: s = (s 0 , s 1 , · · · , s n−1 ) 1: s ← (0, 0, · · · , 0) ; i ← 0 2: while i < m do 3: pos The top-level architecture of Sapphire is shown in Fig. 10 . The efficient building blocks described in Sections 3 and 4 are integrated with a 1 KB instruction memory and an instruction decoder to form the core of our crypto-processor. It can be programmed using 32-bit custom instructions to perform different polynomial arithmetic, transform and sampling operations, as well as simple branching. For example, the following instructions generate polynomials a, s, e ∈ R q , and calculate a · s + e, which is a typical computation in the Ring-LWE-based scheme NewHope-1024: config (n = 1024, q = 12289) # sample_a rej_sample (prng = SHAKE-128, seed = r0, c0 = 0, c1 = 0, poly = 0) # sample_s bin_sample (prng = SHAKE-256, seed = r1, c0 = 0, c1 = 0, k = 8, poly = 1) # sample_e bin_sample (prng = SHAKE-256, seed = r1, c0 = 0, c1 = 1, k = 8, poly = 2) # ntt_s The config instruction is first used to configure the protocol parameters n and q which, in this example, are the parameters from NewHope-1024. For n = 1024, the polynomial cache is divided into 8 polynomials, which are accessed using the poly argument in all instructions. For sampling, the seed can be chosen from a pair of 256-bit registers r0 and r1, while two 16-bit registers c0 and c1 are used as counters for sampling multiple polynomials from the same seed. For coefficient-wise operations poly_op, the poly_src argument indicates the first source polynomial while the poly_dst argument is used to denote the second source (and destination) polynomial. Similarly, the following set of instructions are used to generate matrix of polynomials A ∈ R 2×2 In this example, parameters from CRYSTALS-Kyber-512 have been used. For n = 256, the polynomial cache is divided into 32 polynomials, which are again accessed using the poly argument. The init instruction is used to initialize a specified polynomial with all zero coefficients. The matrix A is generated one row at a time, following a just-in-time approach [55] instead of generating and storing all the rows together, to save memory, which becomes especially useful when dealing with larger matrices such as in CRYSTALS-Kyber-1024 and CRYSTALS-Dilithium-IV. We have written a Perl script to parse such plain-text programs and convert them into 32-bit binary instructions which can be decoded by the Sapphire crypto-processor. A complete list of supported instructions is provided in Appendix B.
We use dedicated clock gates for fine-grained power savings during program execution, and an interrupt pin is used to indicate completion of the program. Its memory and data registers can be accessed through a simple memory-mapped interface. Using the same interface, it is also coupled with a low-power RISC-V micro-processor [56] , with 32 KB instruction memory and 64 KB data memory, which implements the RV32IM instruction set [57] and has Dhrystone performance similar to ARM Cortex-M0. When executing cryptographic workloads in the Sapphire core, the RISC-V core can be clock-gated using the wait-for-interrupt (wfi) instruction. The processor is woken up by a dedicated interrupt from the Sapphire core, which is raised when the cryptographic operation is complete. Using the memory-mapped interface ensures that the cryptographic core can be accessed through simple load and store instructions, without requiring any custom instructions or changes to the compilation toolchain. While the cryptographic core is used to accelerate all lattice cryptography computations, the RISC-V processor is used for scheduling the cryptographic workloads as well as for compression and decompression of public keys and ciphertexts. The Keccak-f[1600] core inside Sapphire can be accessed standalone through RISC-V software, and is used to accelerate SHA-3 hashing and extendable output functions according to the requirements of the protocol.
Our test chip was fabricated in the TSMC 40nm LP CMOS process, and the chip micrograph is shown in Fig. 11 with the key design components highlighted. The final placed-and-routed design of our Sapphire core consists of 106k logic gates (76 kGE for synthesized design) and 40.25 KB SRAM, with a total area of 0.28 mm 2 (logic and memory combined). Our test chip supports supply voltage scaling from 0.68 V to 1.1 V. Although one of our key design objectives was to demonstrate a configurable lattice cryptography processor, our architecture can be easily scaled for more specific parameter sets. For example, in order to accelerate only NewHope-512 (n = 512, q = 12289), size of the polynomial cache can be reduced to 6.5 KB (= 8 × 512 × 13 bits) and the pre-computed NTT constants can be hard-coded in logic or stored in a 2.03 KB ROM (= 2.5 × 512 × 13 bits) instead of the 15 KB SRAM. Also, the modular arithmetic logic in the ALU can be simplified significantly to work with a single prime only.
We use the on-chip software-configurable clock gates (shown in Fig. 10 ) to accurately measure power consumption of different sub-modules inside the Sapphire core, e.g., sampling, NTT, arithmetic, etc. For example, the following instructions are executed to measure the average power consumption of NTT over 1000 executions: The clock_config instruction is used to control the clock gates, e.g., the PRNG and sampler clocks are gated when measuring NTT power (the RISC-V core is clock-gated using wfi as explained earlier). A simple loop is implemented using labels, comparison and conditional jump instructions, similar to assembly programs in general-purpose micro-controllers (please refer to Appendix B for details of our custom instructions). One of the chip GPIO pins is kept high during the execution of this program to indicate the measurement window, and the power consumption is measured using a source meter. This still includes leakage power from the rest of the chip, but it is only a small fraction of the total power compared to the dynamic power of the operation being measured. Similarly, power consumption of the RISC-V core is measured by clock-gating the Sapphire cryptographic core through software. Finally, leakage power of the chip is measured by externally gating the clock signal being supplied to the chip, so that all logic inside the chip is inactive. The RISC-V processor consumes 45 µW/MHz at 1.1 V (18 µW/MHz at 0.68 V) when running the Dhrystone 2.1 benchmark. Power consumption of the cryptographic core is a strong function of the protocols being executed along with the associated parameters. Average power consumption of the lattice crypto-processor was measured to be around 8 mW at 1.1 V and 72 MHz (520 µW at 0.68 V and 12 MHz). Total leakage power of the chip was measured to be 391 µW at 1.1 V (70 µW at 0.68 V). Since our chip operates on a single power domain, it is not possible to measure leakage power of different components of the chip. We report the individual module-wise leakage and dynamic power consumption, as obtained from post-place-and-route simulations of our design operating at 1.1 V and 72 MHz, in the Before moving on to the protocol implementations and measurements, we summarize some key architectural design techniques we have used to achieve energy-efficiency:
• We have employed increased parallelism in the modular arithmetic and CS-PRNG modules in the form of single-cycle butterfly computation and 1600-bit 24-cycle Keccak data-path respectively. This reduces cycle count as well as data movement and control circuitry, thus decreasing overall energy consumption.
• Based on overall computational complexity, we know that additions are much cheaper than multiplications. Therefore, we have exploited special properties of prime q and parameter m, wherever possible, during Barrett reduction to convert expensive multiplications into cheaper bit-shifts and additions / subtractions.
• Reading data from registers involves much smaller energy consumption compared to reading from SRAMs. We have used registers for storing PRNG seeds, temporary values and the Keccak state, and SRAMs are used to store only the polynomials. This significantly reduces overall energy consumption, especially for the Keccak core.
• Software-controlled clock gates (explicitly inserted in RTL, apart from tool-inserted clock gates) for the sampler, PRNG and NTT allow fine-grained dynamic power savings by gating inactive modules as required during program execution.
• The crypto-processor internal memory is efficiently utilized to store polynomials during protocol execution, thus avoiding access to the main processor's data memory as much as possible and reducing energy consumption.
Figure 12:
Measurement setup with our test chip.
Protocol Implementations and Measurement Results
To measure the efficiency of our design, we have implemented the following NIST Round 2 lattice-based cryptography protocols on our test chip: -II  2  Dilithium-III  3 Dilithium-IV where NIST security levels 1-6 indicate brute-force security matching or exceeding that of AES-128, SHA3-256, AES-192, SHA3-384, AES-256 and SHA3-512 respectively. Fig. 12 shows our test board and measurement setup. The test chip is housed in a QFN64 socket soldered to the board, an Opal Kelly XEM7001 FPGA development board is used to interface with the chip, and a Keithley 2602A source meter supplies power to the chip. Both the FPGA and the source meter are controlled from a host computer through USB and GPIB interfaces respectively. The FPGA is used to transfer programs from the host computer to the instruction memory of our test chip. Also, a small ring-oscillator-based true random number generator [58] implemented on the FPGA is connected to our test chip through GPIO pins for providing fresh random inputs to the randombytes function which is part of the NIST API. All lattice cryptography programs are written using custom instructions and compiled with our script, while all RISC-V software is written in C and compiled using the riscv-gcc toolchain.
Protocol Implementations and Evaluation Results
Next, we describe some key aspects of our protocol implementations along with timing and energy profiling results. All polynomial arithmetic, transforms and sampling operations are accelerated using custom programs running in the Sapphire core, and all SHA-3 computations utilize the Keccak core inside Sapphire. The RISC-V processor is used only to read / write data and programs from / to the cryptographic core (both when executing polynomial computations and when utilizing the fast Keccak core for SHA-3 operations), generate initial randomness using the randombytes function, encode / decode messages and compress / decompress public keys and ciphertexts. For polynomials which need to be read from the polynomial cache and encoded (or decoded and written to the polynomial cache), we directly post-process the outputs (or pre-process the inputs) of the crypto-processor's internal memory, instead of first storing the data in intermediate temporary arrays and then processing them. This saves around 10-20% cycles in overall protocol run-time. Also, the internal clock gates are strategically enabled and disabled during program execution using the clock_config instruction (please refer to Appendix B for details of our custom instructions) to reduce overall energy consumption. For the NewHope and CRYSTALS-Kyber key exchange schemes, each of the CPA-secure public key encryption functions -CPA-PKE.KeyGen, CPA-PKE.Encrypt and CPA-PKE.Decrypt -has been written entirely (excluding the encoding and decoding operations) using Sapphire custom instructions with each of the corresponding programs fitting completely in its 1 KB instruction memory. The CCA-secure key encapsulation functions -CCA-KEM.KeyGen, CCA-KEM.Encaps and CCA-KEM.Decaps -involve calls to SHA-3 and the CPA-PKE functions (according to the Fujisaki-Okamoto transform [59] ), which are implemented in software. Since the signature schemes qTESLA and CRYSTALS-Dilithium both involve probabilistic rejection of intermediate values, the associated polynomial computations are split into multiple custom programs instead of one each for the KeyGen, Sign and Verify functions. These blocks of code are scheduled using RISC-V software, which also handles encoding and decoding operations. The only exception is the KeyGen step in qTESLA, where high-precision discrete Gaussian sampling using large CDT tables is implemented in software, with the SHA-3 functions accelerated in hardware.
Since Module-LWE algorithms involve working with vectors or matrices of polynomials, it is particularly important to ensure that these polynomials fit inside the crypto-processor memory as much as possible (because reads and writes to the internal memory through software are not cheap). When multiplying the public matrix A with the secret vector s, the matrix A is generated through rejection sampling, one row at a time, following the just-in-time approach from [55] . This reduces memory footprint so that the entire computation can fit in the polynomial cache.
In Table 7 , we compare cycle count and energy consumption of our implementations of the Ring-LWE and Module-LWE CPA-PKE schemes with assembly-optimized software on ARM Cortex-M4 micro-processor (from PQM4 [44] ), with average cycle counts for 100 executions. The energy consumption of our test chip has been measured at 1.1 V and 72 MHz, while the energy consumption of the Cortex-M4 processor is estimated from cycle counts using average power (61.5 mW or 615 pJ/cycle at 3.0 V and 100 MHz) measured on NUCLEO-F411RE operating at 100 MHz. The cycle count and energy consumption for our implementation include program execution as well as the additional overhead of writing inputs to and reading outputs from the Sapphire cryptographic core. For both NewHope and CRYSTALS-Kyber, we observe up to an order of magnitude improvement in energy-efficiency compared to software, after accounting for voltage scaling. Fig. 13 shows how configurability of the Sapphire polynomial cache is utilized to support different ring dimensions.
Although our lattice crypto-processor architecture primarily targets Ring-LWE and Module-LWE schemes, we also implement the LWE-based Frodo KEM protocol to demonstrate its flexibility. Since LWE-based algorithms require large matrix multiplications, the arithmetic operations dominate total computation cost unlike Ring-LWE and Module-LWE where sampling is the most expensive operation. Since the matrix dimensions are not powers of two, we tile the rows or columns so that we can use the crypto-processor's array operations effectively, as shown in Fig. 14. For Frodo-640, we split each 640-element array into two arrays of size 512 and 128. For Frodo-976, we simply use arrays of size 1024 with the last 48 elements zeroed out or ignored, as applicable. For Frodo-1344, we use arrays of size 1536, formed by splitting them into two arrays of size 1024 and 512, with the last 192 elements (of the 512-dimension array) zeroed out or ignored, as applicable. However, this tiling scheme makes our version of Frodo incompatible with the reference software implementation.
Frodo involves three large matrix multiplications: AS, S A and S B, where A, S, S and B have dimensions n × n, n ×n,m × n and n ×n respectively with n ∈ {640, 976, 1344} andm =n = 8. We ensure that S is stored in row-major form and B is stored in column-major form, which simplifies calculating S B using the schoolbook matrix multiplication technique. The poly_op instruction is used to coefficient-wise multiply a row of the multiplier matrix with a column of the multiplicand matrix, and the sum_elems instruction computes the sum of its elements to generate one element of the output matrix (please refer to Appendix B for details of our custom instructions). For calculating the matrix AS, we generate A in row-major form (using rejection sampling, with zero chance of rejection since q is a power of two) and S in column major form (using CDT-based discrete Gaussian sampling) so that the same techniques still work. For n ∈ {640, 976}, the matrix S is generated two columns at a time to reduce the number of outer loop iterations, as illustrated in the pseudo-code below: Since both matrices S and A are generated on-the-fly in row-major fashion, this makes calculating S A a bit complicated. We multiply each element of the i-th row of A with the i-th element of the j-th row of S to generate a partial sum. These i partial sums are incrementally added together to compute the j-th row of the output matrix S A. Once again, we generate S two columns at a time to reduce the number of outer loop iterations. The corresponding pseudo-code is shown below: where the reg = (poly)[i] instruction is used to save the i-th element of the array in the 24-bit internal register reg, the init (poly) instruction creates an array of zeros and the CONST_MUL operation multiplies each element of an array with the value stored in reg (please refer to Appendix B for details of our instructions). The AS + E and S A + E computations require 10.9M and 9.9M cycles respectively for Frodo-640, and 25.3M and 23.2M cycles respectively for Frodo-976, and 67.1M and 62.7M cycles respectively for Frodo-1344, which constitute majority of the total cycle count. This is quite different from the Ring-LWE and Module-LWE schemes, where polynomial sampling accounts for 60-70% of the total computation cost. Please note that memory usage of Frodo-1344-CCA-KEM-Decaps exceeds the 64 KB processor data memory on our test chip; hence it was evaluated only in simulation, with power consumption extrapolated from measured power for Frodo-640 and Frodo-976.
In Tables 8 and 9 , we have compared cycle count and energy consumption of assembly-optimized Cortex-M4 software [44] with our hardware-accelerated implementation on our test chip operating at 1.1 V and 72 MHz, with average cycle counts for 100 executions. Clearly, our design achieves up to an order of magnitude improvement in energy-efficiency and performance compared to state-of-the-art software. We note that Module-LWE schemes, although a bit slower compared to Ring-LWE, offer parameters with better scalability in terms of security and efficiency compared to Ring-LWE schemes. Among the key encapsulation schemes, NewHope and CRYSTALS-Kyber are two orders of magnitude more efficient than Frodo, owing to the inherent structure in ideal and module lattices where the key operation is polynomial multiplication as opposed to matrix multiplication in standard lattices. Among the digital signature schemes evaluated, qTESLA allows faster signature generation and verification compared to CRYSTALS-Kyber. However, our implementation of the key generation step in qTESLA is quite expensive since it uses CDT-based discrete Gaussian sampling with large tables and high precision. This is not a big concern since signature key-pairs are generated infrequently; also, more specialized hardware can be added to our architecture to support such distribution parameters, albeit at the cost of logic area.
In Fig. 15 , we plot the measured energy consumption of the Ring-LWE and Module-LWE-based CCA-KEM-Encaps and Sign algorithms at different post-quantum security levels, as implemented on our test chip operating at at 1.1 V and 72 MHz. Due to the configurability of our lattice crypto-processor, we are able to implement all these different modes and achieve energy scalability through efficiency versus security trade-offs. In Table 10 , we compare our design with existing hardware-accelerated implementations of NIST Round 2 lattice-based protocols. Our crypto-processor is significantly smaller than the multiple designs generated using high-level synthesis in [20] , and is also more flexible and energy-efficient. Our Kyber implementation is faster than [18] which uses RSA, AES and SHA hardware accelerators on the SLE 78 security controller platform to accelerate lattice cryptography. Efficiency of our design is greater than or comparable to state-of-the-art FPGA implementations of Ring-LWE [13, 60] . Notably, [60] also uses a RISC-V processor with NTT and SHA accelerators to implement the NewHope protocol. However, our implementation of Frodo, which re-purposes the Ring/Module-LWE hardware for LWE computations, is not as efficient as the dedicated LWE accelerator in [16] . Finally, we also compare our design with state-of-the-art pre-quantum elliptic curve cryptography hardware [56, 61] , and we observe our implementation of CCA-secure lattice-based key encapsulation using NewHope-512 to be around 5× more efficient compared to elliptic curve Diffie-Hellman key exchange using the NIST P-256 curve at comparable pre-quantum security level.
Side-Channel Analysis
Side-channel security is an important aspect of all public-key cryptography implementations and lattice-based cryptography is not an exception. In order to prevent information leakage through timing side channels, the most important requirement is to ensure that the timing and memory access patterns of underlying computations are independent of the secret data being computed upon. In our implementation, this is achieved either by making the computations constant-time, e.g., binomial sampling, discrete Gaussian sampling, NTT and polynomial arithmetic, or by using rejection sampling, e.g, sampling numbers from [0, q) or [−η, η] or probabilistic rejection during signature schemes. Since our cryptographic core and RISC-V processor both have a single-level memory hierarchy, the possibility of cache timing attacks is also eliminated.
Our power side-channel measurement setup is shown in Fig. 17 . Our test board has an 18 Ω resistor connected in series between the power supply and the VDD pin of our test chip. The voltage across this resistor, proportional to the chip's current draw, is magnified using a non-inverting differential amplifier (consists of an AD8001 op-amp chip, with 6 dB flat gain up to 100 MHz, in the non-inverting configuration with resistors of appropriate sizes) and then observed through a 2.5 GS/s Tektronix MDO3024 mixed domain oscilloscope.
The execution times of binomial sampling, discrete Gaussian sampling, NTT, polynomial coefficient-wise multiplication and addition (with n = 1024 and q = 12289) were measured for 10,000 random executions to Typical simple power analysis (SPA) attacks on lattice cryptography implementations exploit information leakage through conditional branching or data-dependent execution times during the modular arithmetic computations in NTT or polynomial coefficient-wise multiplication [62, 63, 64] . As explained in Fig. 16 , our implementation of polynomial arithmetic is constant-time. To quantitatively evaluate SPA resistance of our design, we perform a difference-of-means test [65, 64, 66] on three polynomial operations -NTT, coefficient-wise multiplication and coefficient-wise addition -which are traditionally used as attack points. In this test, we try to differentiate two sets of measurements -those with a particular coefficient ('0'-th coefficient in our case) in the input polynomial set to 0 (denoted as set '0' or S 0 ) versus the same coefficient set to q − 1 (denoted as set '1' or S 1 ) -by comparing their means separately for each point in the mean power trace. The difference-of-means is calculated for increasing number of measurements and plotted as a function of the number of traces N . The corresponding 99.99% confidence interval for having a zero difference of means between these two sets is calculated as t c · (σ 2 0 + σ 2 1 )/N , where σ 0 and σ 1 are the standard deviations of the two sets S 0 and S 1 respectively and t c is the critical t-statistic for N − 1 degrees of freedom and cumulative probability = 1 − (1 − 0.9999)/2 = 0.99995. As long as the absolute difference-of-means is smaller than the confidence interval, it is a strong indicator that the sets S 0 and S 1 are indistinguishable. In Figures 18, 19 and 20, we provide preliminary difference-of-means test results for three polynomial operations (with n = 1024 and q = 12289) as measured from our test chip operating at 1.1 V and 10 MHz. Sampling rate of the oscilloscope was set to 500 MS/s for NTT and 2.5 GS/s for coefficient-wise multiplication and addition. The red lines denote measured difference-of-means, and the dashed lines mark the 99.99% confidence interval for ideal zero difference-of-means. These results validate that our design is secure against SPA side-channel attacks.
The protocol implementations discussed earlier do not have any explicit countermeasures against differential power analysis (DPA) attacks. Although DPA attacks can be mitigated by using ephemeral keys, it is still important to analyze how these protocols can be made DPA-secure. Masking-based countermeasures have been proposed in [67, 68, 46] for Ring-LWE encryption. Since our crypto-processor is programmable, such masked protocols can be implemented using the right mix of software and hardware acceleration. For example, we consider NewHope-CPA-PKE and discuss how the masked decryption algorithm, inspired by [67, 68, 46] , can be implemented using our hardware. A simplified version of the CPA-PKE scheme, excluding any key / ciphertext compression / decompression and encoding / decoding and implementation-specific details, is provided below:
function NewHope-CPA-PKE.KeyGen(seed):
Sampleâ, s, e ∈ R q b ←â ŝ +ê return (pk = (â,b), sk =ŝ)
function NewHope-CPA-PKE.Encrypt(pk, coin, µ ∈ {0, · · · , 255} 32 ):
where µ is the 32-byte message to be encrypted,x is the NTT representation of polynomial x ∈ R q , denotes coefficient-wise multiplication (in the transform domain) and · denotes polynomial multiplication in R q . The Encode function converts message µ into a polynomial in R q . To allow robustness against errors, each bit of the 256-bit message is encoded into n/256 coefficients. For example, for n = 1024, the i-th, (256 + i)-th, (512 + i)-th and (768 + i)-th coefficients are set to 0 or q/2 depending on whether the i-th bit in µ is 0 or 1 respectively, for i ∈ {0, · · · , 255}. The Decode function maps n/256 coefficients of a polynomial back to the original message bit. For example, for n = 1024, it takes the i-th, (256 + i)-th, (512 + i)-th and (768 + i)-th coefficients (each in the range {0, · · · , q − 1}, subtracts q/2 from each of them, accumulates their absolute values, and finally sets the i-th message bit to 0 if the sum is larger than q or to 1 otherwise, for i ∈ {0, · · · , 255}. Further details about these functions are available in the NewHope specification document [24] . The Decrypt algorithm requires one polynomial coefficient-wise multiplicationû ŝ, one inverse NTT (including multiplication with n −1 ψ −i ) to compute u · s, and one polynomial coefficient-wise subtraction v − u · s. Figure 21 shows the corresponding measured power waveform for n = 1024. Similar to the encryption scheme studied in [68] , we note that NewHope-CPA-PKE is also additively homomorphic, that is, if c 1 = (û 1 , v 1 ) and c 2 = (û 2 , v 2 ) are the ciphertexts corresponding to messages µ 1 and µ 2 respectively, under the same key-pair, then (û 1 +û 2 , v 1 + v 2 ) will be the ciphertext corresponding to µ 1 ⊕ µ 2 . Following the works of [67, 68, 46] , this property can be exploited to randomize the decryption algorithm (as a first-order DPA countermeasure) as explained below: Therefore, the masked decryption now requires generation of a random message along with invocations of both the Encrypt and Decrypt functions. As explained earlier, these functions can be implemented entirely using Sapphire custom programs, so the masking involves minimal software overheads. Referring to the cycle counts and energy consumption of NewHope-1024-CPA-PKE in Table 7 , we note that the masked decryption is about 3× less efficient compared to the unmasked version, both in terms of energy and performance. Since µ r is independent from the original message µ, the ciphertext c r can be pre-computed offline in order to reduce online computation time and energy consumption. As explained in [68] , this technique does not require any modifications to the Decode function. However, addition of ciphertexts increases the noise in them, thus increasing the decryption failure rate. Each of the two polynomials in the ciphertext contains one noise term whose coefficients are derived from the zero-mean binomial distribution with support [−k, k] and standard deviation σ = k/2 (k = 8 for NewHope). When two such ciphertexts are added, the resulting noise distribution (still binomial) now has support [−2k, 2k] with standard deviation σ = 2k/2 = √ k, that is, the noise variance is doubled. For k = 16, which is also used in NewHope-Simple, the decryption failure probability will go up from 2 −216 [24] to 2 −60 [69] . As discussed in [68] , standard deviation of the error distribution can be decreased to allow correct decryptions at the cost of a minor deterioration in security. So, one possibility is to set k = 4 in the unmasked scheme (so that k = 8 for masked decryption and failure probability remains 2 −216 ). The corresponding decrease in security level is from 289 bits to 268 bits, as obtained from the LWE hardness estimator [70] using the following Sage module: load("https://bitbucket.org/malb/lwe-estimator/raw/HEAD/estimator.py") n = 1024; q = 12289; stddev = sqrt(4/2); alpha = sqrt(2*pi)*stddev/q _ = estimate_lwe(n, alpha, q, reduction_cost_model=BKZ.sieve)
Conclusion and Future Work
In this work, we have presented a configurable lattice cryptography processor supporting different parameters for NIST Round 2 lattice-based key encapsulation and digital signature protocols such as NewHope, qTESLA, CRYSTALS-Kyber, CRYSTALS-Dilithium and Frodo. Efficient modular arithmetic, sampling and NTT memory architectures together provide an order of magnitude improvement in performance and energy-efficiency compared to state-of-the-art software and hardware implementations. Our ASIC implementation was fabricated in a 40nm low-power CMOS process and all measurement results are obtained from our test chip operating at 1.1 V and 72 MHz. Our protocol implementations are secure against timing and simple power analysis attacks, and we also discuss how masking countermeasures against differential power analysis can be implemented using the programmability of our crypto-processor.
Since our design supports configurable lattice parameters, it will be interesting to explore other lattice-based protocols such as Saber [71] and Round5 [72] , which are based on the LWR (learning with rounding) problem [73] . More concrete analysis of DPA-secure masked implementations, for CPA-PKE, CCA-KEM and signature schemes, along with leakage tests and impact on performance and energy-efficiency, will also be performed in the future. Finally, non-lattice-based post-quantum protocols can also be implemented on our platform, using a mix of hardware acceleration and software, since they can still benefit from our efficient implementation of modular arithmetic and SHA-3 computations.
Sampling: polynomial sampling from various distributions bin_sample (prng, seed, c0, c1, k, poly) cdt_sample (prng, seed, c0, c1, r, s, poly) rej_sample (prng, seed, c0, c1, poly) uni_sample (prng, seed, c0, c1, eta, bitlen, poly) tri_sample_1 (prng, seed, c0, c1, m, poly) tri_sample_2 (prng, seed, c0, c1, m0, m1, poly) tri_sample_3 (prng, seed, c0, c1, rho, poly) where prng can be SHAKE-128 or SHAKE-256, seed can be r0 or r1, and k, r, s, eta, bitlen, m, m0, m1, rho are the distribution parameters if (flag == / != -1 / 0 / +1) goto <label> where the flag register stores -1, 0 and +1 for the register comparison result being "lesser than", "equal to" and "greater than" respectively, and it stores 1 or 0 depending on whether the equality check and infinity norm check has passed or failed respectively SHA-3 Computations: hashing operations sha3_init sha3_256_absorb (poly) sha3_512_absorb (poly) sha3_256_absorb (r0 / r1) sha3_512_absorb (r0 / r1) r0 / r1 = sha3_256_digest r0 || r1 = sha3_512_digest where the seed registers are used to store the hash outputs -either r0 or r1 for SHA-3-256, and both r0 and r1 together for SHA-3-512
