Abstract. Compact implementations of the ring variant of the Learning with Errors (Ring-LWE) on the embedded processors have been actively studied due to potential quantum threats. Various Ring-LWE implementation works mainly focused on optimization techniques to reduce the execution timing and memory consumptions for high availability. For this reason, they failed to provide secure implementations against general side channel attacks, such as timing attack. In this paper, we present secure and fastest Ring-LWE encryption implementation on low-end 8-bit AVR processors. We targeted the most expensive operation, i.e. Number Theoretic Transform (NTT) based polynomial multiplication, to provide countermeasures against timing attacks and best performance among similar implementations till now. Our contributions for optimizations are (1) we propose the Look-Up Table ( LUT) based fast reduction techniques for speeding up the modular coefficient multiplication in regular fashion, (2) we use the modular addition and subtraction operations, which are performed in constant timing. With these optimization techniques, the proposed NTT implementation enhances the performance by 18.3∼22% than previous works. Finally, our Ring-LWE encryption implementations require only 680,796 and 1,754,064 clock cycles for 128-bit and 256-bit security levels, respectively.
Introduction
Classic public key cryptography algorithms such as RSA and Elliptic Curve Cryptography (ECC) are built based on integer factorization and discrete logarithm problems, which are believed to be secure against classical computer environments with properly chosen parameters. For this reason, a number of works focused on compact implementations of RSA and ECC [17, 22, 5, 21, 13, 23, 9, 12, 24, 10, 8, 6] . However, such hard problems can be solved using Shor's algorithm on a sufficient large quantum computer in polynomial time [25] . To defeat potential attacks and threats, lattice-based cryptography is considered as one of the most promising candidates for post-quantum cryptography. Latticebased cryptography is built based on worst-case computational assumptions in lattices that would remain hard even for quantum computers. Furthermore, the emerging Internet of Things (IoT) technology introduces new computing environments including all kinds of sensors, actuators, meters, consumer electronics, medical monitors, household appliances and vehicles. Since these devices are very resource-constrained in terms of computing power, power supply and memory resources, implementing public-key cryptographic algorithms on low-end 8-bit processors poses a big challenge. Therefore, it is necessary to further study the post-quantum cryptosystems on the low-end IoT devices.
The introduction of Learning with Errors (LWE) problem and its ring variant (Ring-LWE) [18, 14] provide efficient ways to build lattice-based public key cryptosystems. The following software implementations of Ring-LWE based publickey encryption or digital signature schemes improved performance and memory requirements: Oder et al. presented an efficient implementation of Bimodal Lattice Signature Scheme (BLISS) on a 32-bit ARM Cortex-M4F microcontroller [15] . De Clercq et al. implemented Ring-LWE encryption scheme on the identical ARM processors [3] . They utilized 32-bit registers to retain two 13 ∼ 14 coefficients. Boorghany et al. implemented a lattice-based cryptographic scheme on an 8-bit processor for the first time in [1, 2] . The authors evaluated four lattice-based authentication protocols on both 8-bit AVR and 32-bit ARM processors. In particular, Fast Fourier Transform (FFT) transform and Gaussian sampler function are implemented. In LATINCRYPT'15, Pöppelmann et al. studied and com-pared implementations of Ring-LWE encryption and BLISS on an 8-bit Atmel ATxmega128 microcontroller [16] . In CHES'15, Liu et al. optimized implementations of Ring-LWE encryption by presenting efficient modular multiplication, NTT computation and refined memory access schemes to achieve high performance and low memory consumption [11] . They presented two implementations of Ring-LWE encryption scheme for both medium-term and long-term security levels on an 8-bit AVR processor. Liu et al. presented the first secure Ring-LWE encryption and BLISS signature implementations against timing attack [7] . NTT and sampling computations are implemented in constant time to prevent timing attack. Particularly, modular reduction is performed in Montgomery reduction to reduce computation complexity. Recently, in [4, 9] , high efficient implementations on ARM-NEON and MSP430 processors are also covered.
Contributions
This paper continues the line of research on the secure and compact implementations of the Ring-LWE encryption scheme on low-end 8-bit AVR processor. Core contributions are the techniques to prevent information leakage and optimizations to improve real-world performance of Ring-LWE encryption scheme.
In particular, we focused on the optimization of Number Theoretic Transform (NTT) based polynomial multiplication, which is the most expensive computation in the Ring-LWE. In NTT computation, a number of modular arithmetic operations are required and optimization of modular reduction is highly related with performance. To accelerate performance, we use Look Up Table (LUT) based fast reduction techniques for modular coefficient multiplication. Modular addition and subtraction operations are also implemented in constant time and incomplete representation. To optimize the performance in assembly level, NTT routines fully utilize general purpose registers in the target processors.
Based on the above NTT optimization techniques, we present secure and compact implementations of Ring-LWE encryption scheme on an low-end 8-bit AVR processor. All operations are designed to prevent the timing attack. The implementation only requires 681K and 1, 754K clock cycles for 128-bit and 256-bit security level encryption respectively.
The rest of this paper is organized as follows. In section 2, we recall background of Ring-LWE encryption scheme, NTT algorithm, and previous implementation techniques for NTT algorithm. In Section 3, we present optimization techniques for NTT on low-end 8-bit AVR processors. In particular, we propose techniques to prevent information leakage through timing and reduce execution time of NTT algorithm. In Section 4, we report performance of our implementation and compare with the state-of-the-art NTT and Ring-LWE encryption on the low-end 8-bit AVR platforms. Finally, we conclude the paper in Section 5.
Background

Ring-LWE encryption scheme
In 2010, Lyubashevshy et al. proposed an encryption scheme based on a more practical algebraic variant of LWE problem defined over polynomial rings R q = Z q [x]/⟨f ⟩ with an irreducible polynomial f (x) and a modulus q. In Ring-LWE problem, elements a, s and t are polynomials in the ring R q . Ring-LWE encryption scheme proposed by Lyubashevshy et al. was later optimized in [20] . Roy et al.'s variant aims at reducing the cost of polynomial arithmetic. In particular, the polynomial arithmetic during a decryption operation requires only one Number Theoretic Transform (NTT) operation. Beside this computational optimization, the scheme performs sampling from the discrete Gaussian distribution using a Knuth-Yao sampler. In next subsection, we will first present mathematical concepts of NTT and Knuth-Yao sampling operations, then we will describe the steps used in the Roy et al.'s version of the encryption scheme. Now, we describe steps applied in the encryption scheme proposed by Roy et al. [20] . We denote the NTT of a polynomial a byã.
-Key generation stage Gen(ã): Two error polynomials r 1 , r 2 ∈ R q are sampled from the discrete Gaussian distribution X σ by applying the Knuth-Yao sampler twice.r
and then an operationp =r 1 −ã ·r 2 ∈ R q is performed. Public key is polynomial pair (ã,p) and private key is polynomialr 2 .
n is a binary vector of n bits. This message is first encoded into a polynomial in the ring R q by multiplying the bits of message by q/2. Three error polynomials e 1 , e 2 , e 3 ∈ R q are sampled from X σ . The ciphertext is computed as a set of two polynomials (C 1 ,C 2 ):
and then a decoder is used to recover the original message M from M ′ .
Number Theoretic Transform
We use the Number Theoretic Transform (NTT) to perform polynomial multiplication. NTT can be seen as a discrete variant of Fast Fourier Transform (FFT) but performs in a finite ring Z q . Instead of using the complex roots of unity, NTT evaluates a polynomial multiplication a(
Algorithm 1 Iterative Number Theoretic Transform
Require: A polynomial a(x) ∈ Zq[x] of degree n − 1 and n-th primitive ω ∈ Zq of unity Ensure:
1: a = BitReverse(a) 2: for i from 2 by i = 2i to n do 3:
for j from 0 by 1 to i/2 − 1 do 5:
for k from 0 by i to n − 1 do 6:
: return a n-th roots of unity ω i n for i = 0, . . . , n − 1, where ω n denotes a primitive n-th root of unity. Algorithm 1 shows the iterative version of NTT algorithm.
The iterative NTT algorithm consists of three nested loops. The outermost loop (i-loop) starts from i = 2 and increases by doubling i, and the loop stops when i = n, thus it has only log 2 n iterations. In each iteration, the value of twiddle factor ω i are computed by executing a power operation ω i = ω n/i n , and the value of ω is initialized by 1. Compared to i-loop, the j-loop executes more iterations, the number of iteration can be seen as a sum of a geometric progression for 2 i where i starts from 0 and has a maximum value of log 2 (n − 1), thus, the j-loop has n−1 iterations. In each iteration of j-loop, the twiddle factor ω is updated by performing a coefficient modular multiplication. Apparently, the innermost loop (k-loop) occupies most part of the execution time of NTT algorithm since it is executed roughly 
Previous Implementations of NTT
In LATINCRYPT'15, Pöppelmann et al. optimized the NTT operation by merging inverse NTT and multiplication by powers of ψ −1 . Furthermore, bit-reversal step is removed by the manipulation of the standard iterative algorithms. In CHES'15, Liu et al. suggested the high-speed NTT operations with efficient coefficient modular multiplication [11] . They presented the Move-and-Add (MA) method to perform the 16-bit wise coefficient multiplication and the Shift-AddMultiply-Subtract-Subtract (SAMS2) techniques to replace the expensive reduction operations with the MUL instructions by cheaper shift and addition instructions. In TECS'17, Liu et al. improved the modular reduction by using Montgomery reduction [7] . This improves the previous SAMS2 techniques when the case requires a number of shift and addition operations on low-end devices. The new technique ensures the constant time computation together with high performance.
Proposed Methods
NTT computation takes up the majority of the execution time on modular multiplication operation since it is performed in the innermost k-loop. The 16-bit wise multiplication requires only 4 8-bit wise multiplication operations and this is already well covered in previous works [11] . Thus, the optimization of fast reduction operation is a perquisite for high-speed implementation of NTT algorithm. We chose the prime modulus q = 7681 (i.e. 0x1e01 in hexadecimal representation) and q = 12289 (i.e. 0x3001 in hexadecimal representation) for the target parameters, which are used in previous works [11, 7] .
Unlike previous SAMS2 method by [11, 7] , we propose an optimized LookUp Table ( LUT) based fast reduction technique for performing the mod 7681 and mod 12289 operations. The main idea is to first reduce the result by using the 8-bit wise pre-computed reduced results, and then perform the tiny fast reduction steps on short coefficients. The results are kept in the incomplete representation in order to optimize the number of subtraction in the reduction step. For the case of prime modulus q = 7681, the variables are always kept in range of (0, 13 ≡ 2 9 − 1 mod 7681, the fast reduction can be performed with 16-bit wise addition (2 9 ) and 8-bit wise subtraction operations (−1). The detailed method is described in Figure 1 . We keep the product in four registers (r3, r2, r1, r0) , which has been marked by different colors. Each of register (r3, r2, r1, r0) is 8-bit long. The colorful parts mean that this bit has been occupied while the white part means the current bit is empty. The reduction with 7681 using LUT approach can be performed as follows:
5 Two LUTs only require 1KB (2 8 × 2 + 2 8 × 2) and the LUTs are stored in the ROM. Considering that AVR platforms support ROM size in 128, 256, and 384 KB, the ROM consumption of LUT is negligible. 1. LUT access. We first perform the LUT access with variable (r2) to get the 13-bit wise reduced results (s1 and s0). Then, the variable (r3) is also reduced to the results (t1 and t0). Both results are 13-bit wise long and stored in 2 8-bit registers. 2. Addition. We then perform the addition of (r1, r0)+(s1, s0)+(t1, t0). Apparently, the sum result is less than 18-bit, which can be kept in three registers (k2, k1, k0). 3. Shifting. We right shift (k2, k1, k0) by 13-bit to get the result (u0). Afterwards, the value (u0) is left shifted by 9-bit to get the (d1, d0). 4. Modulo. Thereafter, the intermediate results (k2, k1, k0) below 13-bit are extracted and we obtain the (w1, w0). 5. Addition and Subtraction. Finally, we perform the addition and subtraction operations of (w1, w0) + (d1, d0) − u0.
In Algorithm 2, the LUT based modular reduction in source code level is described. In Step 1∼13, MOV-and-ADD multiplication is used to perform the 16-bit wise multiplication. The 32-bit intermediate results are stored in 4 8-bit registers (R18, R19, R20, R21). In Step 14∼15, the address of LUT 1 is loaded to 2 registers (R30, R31). Then, the 17∼24-th bits (R20) is added to the address. When the address pointer is ready, the LUT access is performed. From Step 22 to 29, the 25∼28-th bits (R21) are used to access the LUT 2. Afterwards the results are reduced. In Step 30∼31, two 13-bit LUT results are added. Afterwards, the summation is added to the intermediate results. From Step 35 to 45, tiny fast reduction is performed on 17-bit intermediate results with 16-bit wise addition and 8-bit wise subtraction operations.
Since the LUT approach is generic approach for any primes, proposed LUT based approach is also available in the case of mod 12289. Two differences are LUT value and final step (tiny fast reduction). We need to construct the (mod 12289)'s LUT. For the final step, we perform the tiny fast reduction with modulus Figure 2 . We execute two LUT and one tiny final reduction. After the tiny fast reduction, it outputs 16-bit results and this can incur the overflow in following operations. We perform the fast reduction once again to fit the results within 15-bit. By leaving the most significant bit in the register, addition and subtraction operations do not need to check whether the intermediate results generate the overflow/underflow or not.
Constant Modular Addition and Subtraction
To prevent timing attacks, modular addition and subtraction operations should be implemented in constant time. We used the incomplete representation and unsigned type for variable format. The results are always kept in 2 bytes and positive values. The detailed descriptions are available in Algorithm 3. First addition or subtraction operation is performed. In particular, subtraction operation is performed with addition of variable (q 2 ) to avoid underflow condition. From Step 6 to 9, the tiny fast reduction operation is performed. However, the result we get in Step 9 may still be larger than modulus (q = 7681), thus, we do the correction by subtracting the modulus (q). If the underflow condition occurs, we perform the addition with modulus (q) with the mask variable (P ). Finally, the results (R) are always kept within 0x2000 in the incomplete representation.
For the case of 12289, we can adopt the constant modular addition and subtraction techniques in Algorithm 3. Only the parameters are different. The detailed descriptions are given in Algorithm 4. Firstly, the addition and subtraction operations are performed. Afterwards, the fast reduction is performed. The obtained results (R) are always kept within 0x4000 in the incomplete representation.
Performance Evaluation
This section presents performance results of our implementation. We first give the experimental platform in section 4.1. Afterwards, we show a comparison with the previous modular multiplication and NTT implementations in section 4.2. Finally, we show a comparison with the previous Ring-LWE implementation in section 4.3. 
{Borrow, R} ← R − 0x1E01 {last correction} 12:
P ← 0x0000 − Borrow 13:
R ← R + (0x1E01&P ) 14: return R Algorithm 4 Constant modular addition/subtraction for q = 12289 (0x3001)
{Borrow, R} ← R − 0x3001 {last correction} 12:
R ← R + (0x3001&P ) 14: return R
Experimental platform
Our implementation uses ATxmega128A1 processor on an Xplain board as target platform. This processor has a maximum frequency of 32 MHz, 128 KB flash program memory, and 8 KB SRAM. It supports an AES crypto-accelerator and can be used in a wide range of applications, such as industrial, hand-held battery applications as well as some medical devices. The implementation is written using a mixed ANSI C and Assembly languages. In particular, the main structure and interface are written in C while the core operations such as modular arithmetic is implemented in Assembly. For the LUT based approach, the con- stant LUT variables are stored in flash program memory, which requires 0.5KB for saving the parameters and 3 clock cycles for each byte access. We complied our implementation with speed optimization option '-O3' on Atmel Studio 6.2. In order to obtain accurate timing, we execute each operation for at least 1000 times and report average cycle count for each operation. [7] . They perform the Montgomery reduction to reduce the 28/30-bit variables to 14/15-bit results. However, the complexity of n-word Montgomery reduction is generally n 2 + n, which is still high overheads on the low-end devices. Unlike previous approaches, we used LUT based approach to achieve high performance and secure implementation.
Comparison of modular multiplication and NTT
As shown in the Table 1 , the proposed modular multiplication with 7681 and 12289 only requires 57 and 66 clock cycles, which are 16 and 4 clock cycles smaller than previous approaches, respectively [7] . The proposed NTT operation also shows higher performance than previous works. NTT operation only requires 158, 607 clock cycles for 128-bit security implementation and 403, 224 cycles for 256-bit security implementation. Results of NTT for medium and long-term security are 18.3% and 22.0% faster than previous works, respectively.
Comparison of Ring-LWE
With optimized NTT implementation, we evaluated the Ring-LWE encryption scheme with parameter sets (n, q, σ) with (256, 7681, 11.31/ √ 2π) and (512,12289, 12.18/ √ 2π) for security levels of 128-bit and 256-bit. The tailcut of discrete Gaussian sampler is limited to 12σ to achieve a high precision statistical difference from the theoretical distribution, which is less than 2 −90 . These parameter sets were also used in most of the previous software implementations, e.g., [1-3, 11, 7] .
Discrete Gaussian sampling is an integral part of Ring-LWE algorithm. However, previous implementations are not secure against timing and simple power analysis, since the Knuth-Yao sampler uses a bit/byte scanning operation in which the sample generated is related to the number of probability-bits/bytes scanned during a sampling operation and its timing provides secret information to an adversary about the value of the sample. In [19] , Roy et al. suggested a random shuffling method to protect the Gaussian distributed polynomial against such attacks. The random permutation is performed after generating all samples. The random shuffle operation swaps all samples randomly, which removes any timing information from samplings. In the implementation, we adopt the previous Knuth-Yao sampler with byte-scanning [19, 11] . Afterwards, all generated samples are randomly mixed with the random numbers. Table 2 compares software implementations of 128-bit and 256-bit security lattice-based cryptosystems on the 8-bit AVR processors. We compare the previous work [1, 2, 16, 11, 7] with ours. Proposed 128-bit security implementation requires 159K, 35K, and 681K cycles for NTT, sampling and encryption, respectively. Compared to the recent work [7] , the NTT operation is significantly improved because we used compact modular multiplication routine. For the secure sampling, we adopted lightweight random shuffling technique, which shows better performance than previous works. The proposed implementations are constant timing, which ensures a secure computation against simple power analysis and timing attacks. The similar performance enhancement is observed in 256-bit case.
Conclusion
This paper presents optimization techniques for efficient and secure implementation of NTT and its application Ring-LWE encryption on the low-end 8-bit AVR platform. For the secure KY sampler, we use the random shuffling technique to prevent the side channel attack. A combination of both NTT and KY sampler implementation achieved new speed records for secure 128-bit and 256-bit Ring-LWE encryption implementation on low-end 8-bit AVR platforms.
Our future works are applying the proposed techniques to the other lowend IoT devices, such as 8-bit PIC and 16-bit MSP processors. Similarly, these platforms also support very limited Arithmetic Logic Unit (ALU) and memory consumptions. Second, we will further investigate side channel attacks on the implementation of Ring-LWE. Unlike traditional RSA and ECC, only few works explored potential threats on the implementation of Ring-LWE.
