Lattice-based cryptography (LBC) is one of the most promising classes of post-quantum cryptography (PQC) that is being considered for standardization. This brief proposes an optimized schoolbook polynomial multiplication (SPM) for compact LBC. We exploit the symmetric nature of Gaussian noise for bit reduction. Additionally, a single field-programmable gate array (FPGA) DSP block is used for two parallel multiplication operations per clock cycle. These optimizations enable a significant 2.2× speedup along with reduced resources for dimension n = 256. The overall efficiency (throughput per slice) is 1.28× higher than the conventional SPM, as well as contributing to a more compact LBC system compared to previously reported designs. The results targeting the FPGA platform show that the proposed design can achieve high hardware efficiency with reduced hardware area costs.
I. INTRODUCTION
Traditional public key cryptography algorithms, including RSA and elliptic-curve cryptography (ECC), will no longer be secure in the near future, due to advancements in quantum computing. The National Institute of Standards and Technology (NIST) called for the proposal of post-quantum cryptographic (PQC) algorithms [1] and received 70 PQC algorithm submissions. Among the potential PQC algorithms to be standardized, lattice-based cryptography (LBC) is one of the most promising types. Almost half of the PQC candidates in round 2 of the PQC standardization process are lattice-based [2] . LBC algorithms are based on the hard problem of finding the shortest (or closest) vector (SVP or CVP) in a lattice. These problems are believed to be hard for both classical and quantum computers.
Polynomial multiplication plays a critical role in LBC and is typically carried out by schoolbook or number theoretic transform (NTT) multiplication. Schoolbook polynomial multiplication (SPM) is a naive method, requiring direct multiplication and subsequent accumulation of results. Although it is slow, it offers simple implementation and low hardware resource cost. NTT is a much faster alternative that comes with additional hardware resource costs and complexity in terms of operations (pre-computation, array Manuscript [2] . NTT is suitable for the parameters on the specific modulo ring, while schoolbook multiplication is a more generic approach. Therefore, it is important to explore how to efficiently implement schoolbook multiplication. Ring-learning with errors (R-LWE) is a widely investigated algorithm that is based on a hard lattice problem. The most critical operation in the R-LWE schemes is polynomial multiplication on the ring. It operates on the ring Z q [x]/(x n + 1), where q is the modular prime.
This brief proposes a compact and efficient hardware design for R-LWE encryption/decryption based on SPM. We exploit the distribution symmetry of Gaussian noise to achieve a reduced bit width and full utilization of DSP blocks. A compact SPM is designed with approximately 2× speedup without additional hardware resource consumption. A comparison with the existing R-LWE designs is provided, which highlights the efficiency of our proposed design. The proposed design optimizations can also be undertaken for other LBC schemes and other FPGA families. Table I details the R-LWE-based public key encryption (PKE) scheme. We focus on encryption and decryption, assuming that key generation can be carried out infrequently and offline. The hardware 1063-8210 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
II. PRELIMINARY BACKGROUND
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Algorithm 1 Schoolbook PMA for Encryption or Decryption resource requirements for R-LWE encryption and decryption are mainly due to the SPM. D σ is the Gaussian distribution with standard deviation σ , and U is the uniform distribution. Polynomials c 1 and c 2 are the ciphertext results. The decryption procedure also performs polynomial multiplication and addition and decodes the polynomial to plaintext. Please refer to [8] for more details on the R-LWE scheme. It is evident that polynomial multiplication is the most computationally intensive part of the cryptographic scheme. The typical SPM algorithm used in R-LWE can be expressed as in Table 1 [4] . Considering the property of polynomial multiplication that x n ≡ −1, note that the product c(x) = a(x) × b(x) is not a normal circumferential convolution. It has a sign bit in the accumulation, namely, (−1) (i+ j )/n . This sign bit is 1 if i + j < n and −1, otherwise. The dimension is denoted as n, which means that this method has O(n 2 ) complexity
Polynomial addition is an ordered sequential addition, well suited for a low-cost hardware implementation. To evaluate the encryption/decryption performance of R-LWE, we implement both polynomial multiplication and addition (PMA) on FPGA, as shown in Algorithm 1. This algorithm calculates d = a * b + c. First, the elements a multiplied b are calculated, then the Barrett reduction algorithm is used to perform the modular operation with the prime (q), and finally, the result is assigned to d. The modular reduction in line 11 only requires a multiplexer due to the small bit width.
In most of the reported hardware designs [8] - [10] with a medium security level, the R-LWE parameter set (n = 256, q = 7681, and s = 11.31) is used. For modular q reduction, [11] introduces an algorithm that uses shift, add, and subtract operations to accomplish the modular reduction. For the noise, a zero-centered discrete Gaussian distribution (such as r 2 ) with a standard deviation of s/ √ 2π = 4.51 is considered. On the modular ring, Gaussian distribution is shown in Fig. 1 . As most of the lattice-based cryptosystems in NIST PQC Round 2 candidates require Gaussian or binomial distribution, the proposed methods in Section III can be extended to such distributions that have a bounded interval around 0.
III. PROPOSED OPTIMIZED POLYNOMIAL MULTIPLICATION
This section proposes two novel techniques for efficient hardware implementation of R-LWE encryption/decryption modules.
A. Reduced Bit Width Due to the Noise Distribution Symmetry
The discrete Gaussian noise distribution, as shown in Fig. 1 , is naturally symmetrical between [0, q−1]. Without loss of generality, for σ = 4.51, the number range is limited to [0, 31] and [−31, −1] (i.e., [7650, 7680] on the modular integer ring if presented as an unsigned number). Opting for signed number representation instead of naïve 13-bit representation can save hardware resources. For the required number range, one sign bit and five data bits (6 bits in total) are enough to represent the input data instead of a 13-bit unsigned representation, as shown in Fig. 2 . This proposed reduced bit-width technique can be applied to all polynomial multiplications in various R-LWE-based PKE schemes.
The reduced signed representation reduces the data for multiplication as well as memory accesses, and the modular operation described TABLE II  HARDWARE IMPLEMENTATION RESULTS OF DIFFERENT SPMA DESIGNS   TABLE III  ENS ON FPGA in Algorithm 2 is also simplified. The multiplication product width is reduced from 26 bits (13 bit × 13 bit) to 17 bits (13 bit × 5 bit), as the sign bit is not used during the modular multiplication. The sign bit is used for number inversion, as shown in line 7 of Algorithm 1. Compared to the original modular reduction in [11] , it saves one addition, one multiplexer, and one subtraction. In line 2 of Algorithm 2, tq is the product of t multiplied q. In line 3, y is an approximate modular result, which requires extra subtractions. Furthermore, the reduced bit-width multiplication makes it possible to perform two multiplications on a single DSP block in FPGA, which will be further discussed in Section III-B.
Algorithm 3 Two Multiplications Within One DSP Block

B. Full Utilization of FPGA DSP Blocks
In Xilinx 7 series FPGAs, a single DSP block can support a 25 × 18 bit multiplication. For R-LWE, due to the reduced size Algorithm 4 Optimized Schoolbook PMA for Encryption or Decryption of the multiplication required (13 × 5), we can efficiently pack two multiplicands to perform two multiplications using one DSP block on the FPGA. This bit packing is elaborated in Algorithm 3, where two multiplications m = a × c and n = b × c are depicted. First, in line 1, a and b are concatenated with 13 inserted zeros in the middle to form a new multiplicand tmp_ab. tmp_ab is 23 bits in size, where the first 5 bits are b, the last 5 bits are a, and with 13 zeros in the middle. Then, in line 2, a 23×13 multiplication is carried out. In lines 3 and 4, the results m and n are separated out for two parallel multiplications. The whole process is presented in detail in Fig. 3 . The product of a×c is an 18-bit result, unrelated to the product of b×c. This packing enables two simultaneous multiplications via one DSP slice per cycle. This trick can be extended to newer Xilinx FPGA families (including Ultrascale and Ultrascale+), which come with DSP multiplier slices of similar or wider dimensions, e.g., 27×18 multiplier in Ultrascale+.
The optimized schoolbook PMA, presented in Algorithm 4, uses both optimization techniques. First, it offers reduced bit-width representation to save hardware resources, which simplifies the modular reduction and reduces the critical path delay. In Algorithm 4, the polynomial b elements are the discrete Gaussian distribution samples, each of length 5 bits. Second, by employing full utilization of DSP blocks, the system carries out two multiplications, boosting performance without extra DSP resources. In lines 14 and 17, sign() denotes the MSB of the signed number (sig1/sig2 = 1 for negative number). For negative numbers, result ab_m should be subtracted from the modulus q.
IV. HARDWARE IMPLEMENTATION RESULTS
A. Hardware Design Structure
The high-level hardware block diagram of the optimized SPMA is shown in Fig. 4 . There are four input data from the BRAMs for storing the three polynomials a, b and c, in which polynomial b allows two parallel accesses per clock cycle. Right after the multiplication, the data split into two parallel pipelined parts. Then, modular reduction operations follow next. The control signals "sig1" XOR "b1.sign()" and "sig2" XOR "b2.sign()" are used to determine the sign of accumulated data. Finally, results d1 and d2 are written to the BRAMs. BRAMs are controlled by the control address unit. Fig. 4) .
B. SPMA Performance Results
The designs are synthesized and implemented using Xilinx Vivado 2016.4 targeting a Kintex-7 FPGA (KC705) and post-place and route results are presented in Table II . SPMA-1 refers to the naïve design with no optimizations, as described in Algorithm 1. SPMA-2 exploits the reduced bit-width technique, while SPMA-3 additionally uses the DSP bit packing technique. The SPMA-2 design requires around 15.2% fewer FPGA resources and achieves a higher operating frequency. The SPMA-3 design achieves the highest throughput due to two reasons: first, due to a reduction in the critical path, enabling the highest operating frequency, and second, SPMA-3 almost halves the computation cycles required and consequently achieves twice the speedup compared with SPMA-1. The efficiency (denoted as throughput per slice) of SPMA-3 is 2.28× compared with SPMA-1.
C. R-LWE Cryptography Implementation Results
In the context of R-LWE-based PKE, SPMA-3 can be used in all three modules: key generation, encryption, and decryption. The encryption module consists of three Gaussian samplers, two SPMA-3, and one polynomial addition. The implementation of the Gaussian sampler is based on the cumulative distribution table (CDT) sampling design, which resists the threat of timing attacks by inherently running in constant time [12] . The overall R-LWE cryptosystem hardware block diagram is shown in Fig. 5 .
To ensure a fair comparison of our optimized SPMA-3-based R-LWE design with the earlier reported FPGA implementations (on Spartan and Virtex families), the following method of equivalence conversion is proposed for design evaluation. We first convert the DSP blocks and BRAMs used in a given design into an equivalent number of slices (ENS). For Xilinx 7 series, a single DSP block can be replaced by 128 slices for a 25 × 18 multiplier using the built-in IP core. But not every design fully uses the DSP block, therefore, the weight of 0.8 is assigned to the DSP block (one DSP block = 128 × 0.8 = 102.4) slices. Similarly, each BRAM (18k) can be substituted for 116.2 (166 × 0.7) slices and BRAM (8k) can be replaced by 56 (70×0.8) slices. The BRAMs are reconstituted by the slice memory using two dual-ported RAM modes. Table III shows the detailed ENS on FPGA. Table IV compares the proposed design with the previous R-LWE implementations using the same parameter set (n = 256, q = 7681, and s = 11.31) except [6] . The design in [6] also uses SPMAs, but it has lower frequency and throughput, and its efficiency is much lower than the proposed design. Furthermore, it uses a parameter set of (256, 4093, and 8.35), which is considered to be less secure compared with other designs in Table IV . The latest design [10] claims resistance against timing attack due to the usage of CDT-based noise sampling. However, it is much slower than other hardware designs. Pöppelmann and Güneysu [8] proposed a fast R-LWE cryptographic processor at the cost of substantially more resource. The most efficient designs [9] only use NTT (without inverse NTT) for encryption and only inverse NTT for decryption. Meanwhile, NTT computation requires the computation of twiddle factors, which requires the RAM storage when precomputed. However, these RAMs have not been included for comparison. As mentioned in Section I, half of the 12 LBC contestants in round 2 of the NIST PQC initiative do not use NTT.
Due to the two proposed novel techniques for optimized SPMA, we achieve an efficient design for R-LWE encryption and decryption. Our design only requires 69 654 clock cycles (0.229 ms) for encryption and 34 436 clock cycles (0.114 ms) for decryption, which makes our proposed design the optimal choice for resource-constrained devices, achieving both high hardware efficiency and performance.
V. CONCLUSION
This brief proposes novel optimizations for the most computationally intensive part of LBC constructions, i.e., the polynomial multiplier, targeting the high-speed FPGA platform. We exploit the noise distribution symmetry to reduce the dynamic range and reduced bit width of the discrete Gaussian data samples. This simplification also leads to smart packing of data and the full utilization of the DSP block to gain a 2× speedup.
