Over recent years lattice-based cryptography has received much attention due to versatile average-case problems like Ring-LWE or Ring-SIS that appear to be intractable by quantum computers. In this work, we evaluate and compare implementations of Ring-LWE encryption and the bimodal lattice signature scheme (BLISS) on an 8-bit Atmel ATxmega128 microcontroller. Our implementation of Ring-LWE encryption provides comprehensive protection against timing side-channels and takes 24.9ms for encryption and 6.7ms for decryption. To compute a BLISS signature, our software takes 317ms and 86ms for verification. These results underline the feasibility of lattice-based cryptography on constrained devices.
INTRODUCTION
The Rivest-Shamir-Adleman cryptosystem (RSA) and elliptic curve cryptography (ECC)-based schemes are the most popular asymmetric cryptosystems to date, being deployed in billions of security systems and applications. Despite their predominance, they are known to be susceptible to attacks using quantum computers [Shor 1994 ] on which significant resources are spent to boost their further development [Rich and Gellman 2013] . Moreover, even standardization bodies like NIST [2015] and secret services like NSA's IAD [NSA 2015] acknowledge the need for discussions on the development, standardization, and deployment of efficient post-quantum publickey cryptography. Additionally, RSA and ECC have been shown to be quite inefficient on very small and constrained devices like 8-bit AVR microcontrollers [Gura et al. 2004; Hutter and Schwabe 2013] . A possible alternative is asymmetric cryptosystems based on hard problems on ideal lattices. The special algebraic structure of ideal lattices [Lyubashevsky et al. 2010a ] defined in R = Z q [x]/ x n + 1 allows a significant reduction of key and ciphertext sizes and enables efficient arithmetic using the number theoretic transform (NTT) 1 [Nussbaumer 1982; Winkler 1996; Blahut 2010] . To realize lattice-based public key encryption several proposals exist (see Cabarcas et al. [2014] for a comparison) like classical NTRU [Hoffstein et al. 1998 ] (defined in Z q [x]/ x n − 1 ), provably secure NTRU [Stehlé and Steinfeld 2011] (defined in Z q [x]/ x n + 1 ), or a scheme based on the ring learning with errors (RLWE) problem [Lindner and Peikert 2011; Lyubashevsky et al. 2010a] (from now on referred to as RLWEenc). From an implementation perspective, the RLWEenc scheme is currently one of the best-studied lattice-based public key encryption schemes (see de Clercq et al. [2014] ; Chen et al. [2015] ; Roy et al. [2014b] ; Güneysu [2013, 2014] ; ; Liu et al. [2015a] ). The scheme is also highly related to recently proposed key exchange protocols [Peikert 2014; Bos et al. 2015 Bos et al. , 2016 Alkim et al. 2016a ] which basically use key generation, encryption, and decryption of RLWEenc. When using a RLWEenc-based key exchange scheme it is possible to rely on a so-called error reconciliation mechanism for further efficiency. This way a key can be derived that has not explicitly been chosen by one party. However, using a public key encryption scheme for key transport (i.e., one party explicitly chooses a symmetric key) is a standard technique and thus RLWEenc could be also used for post-quantum key exchange.
Concerning signature schemes, several proposals exist like GLP [Güneysu et al. 2012 ] (derived from Lyubashevsky [2012] ), BG [Bai and Galbraith 2014] , PASSSign [Hoffstein et al. 2014 ], a modified NTRU signature scheme [Melchor et al. 2014] , or a signature scheme derived from a recently proposed IBE scheme . However, so far instantiations of the bimodal lattice signature schemes (BLISS) [Ducas et al. 2013a] seem superior in terms of signature size, performance, and security. Despite their popularity, implementation efforts so far mainly led to very efficient hardware designs for RLWEenc [Chen et al. 2015; Roy et al. 2014b; and BLISS and fast software on 32-bit microcontrollers [Oder et al. 2014] and 64-bit microprocessors Ducas et al. 2013a ] but only few works cover constrained 8-bit architectures Liu et al. 2015a ]. Additionally, current works usually rely on the straightforward Cooley-Tukey radix-2 decimation-in-time algorithm (e.g., Roy et al. [2014b] , de Clercq et al. [2014] , , and ) to implement the NTT and thus to realize polynomial multiplication c = a · b for a, b, c ∈ R as c = INTT(NTT(a)•NTT(b)). However, by taking a closer look at works on the implementation [Crandall and Pomerance 2001; Chu and George 2000] of the highly related fast Fourier transform (FFT) it becomes evident that the sole focus on Cooley-Tukey radix-2 decimation-in-time algorithms prevents further optimizations of the NTT, especially given the constraints of an 8-bit architecture.
Contribution. In this work, we present different optimization techniques to increase the performance of RLWEenc and BLISS. We review different approaches and varieties of NTT algorithms and improvements compared to previous work are mainly achieved by merging certain operations into the NTT itself (multiplication by n −1 , powers of ψ and ψ −1 ) and by removing the expensive bit-reversal step. Additionally, we propose two methods to improve the performance of modular coefficient multiplication. For q = 7, 681, we describe the "MOV-and-ADD" technique for coefficient multiplication and the SAMS2 technique for modular reduction. For q = 12, 289, we propose a constant-time tiny Montgomery multiplication based on the observation that the modulus belongs to the family of optimal prime fields (OPFs) ].
This work is based on two previous conference versions [Liu et al. 2015b; Pöppelmann et al. 2015] . The main additional contributions are constant-time NTTs and noise samplers as components of the (up to our knowledge) first RLWEenc with comprehensive protection against timing attacks. Due to our improvements using Montgomery multiplication and the SAMS2 technique our constant-time NTTs achieve similar performance as the unprotected implementations in Liu et al. [2015b] and Pöppelmann et al. [2015] . Moreover, we show a new approach for constant-time noise generation based on a low-precision cumulative distribution table (CDT) sampler that combines the advantages of relieved precision requirements (see Saarinen [2015] ) for Ring-LWE with the performance of a table-based sampling approach. This sampler even outperforms the recently proposed very simple approach using the Binomial distribution (see Alkim et al. [2016a] ) as it deals better with larger standard deviations than the approach in Alkim et al. [2016a] . All in all, our results show that lattice-based cryptography can be used to realize the two most basic asymmetric primitives (public key encryption and signatures) on very constrained devices with high performance.
Outline. In Section 2, we review NTT, RLWEenc, and BLISS. We then show NTT algorithms that are better suited for polynomial multiplication in Section 3. Our AVR ATxmega128 implementation is described in detail in Section 4 and we discuss our results in Section 5.
BACKGROUND
In this section, we introduce the NTT and explicitly describe its application in the algorithms of the RLWEenc public key encryption scheme and the BLISS signature scheme.
The NTT and Negacyclic Convolutions
The NTT [Nussbaumer 1982; Winkler 1996; Blahut 2010 ] is similar to the discrete Fourier transform (DFT) but all complex roots of unity are exchanged for integer roots of unity and arithmetic is also carried out modulo an integer q in the field GF(q) 2 . The forward transformationã = NTT(a) of a length n sequence (
. . , n − 1 where ω is an n-th primitive root of unity. The inverse transform
. . , n − 1 where ω is exchanged by ω −1 and the final result scaled by n −1 . For an n-th primitive root of unity ω n it holds that ω n n = 1 mod q, ω n/2 n = −1 mod q, ω n 2 = ω 2 n , and ω i n = 1 mod q for any i = 1, . . . , n − 1.
The main operation in ideal lattice-based cryptography is polynomial multiplication.
3
Schemes are usually defined in R = Z q [x]/ x n +1 with modulus x n +1 where n is a power of 2 and one can make use of the negacyclic convolution property of the NTT that allows carrying out a polynomial multiplication in Z q [x]/ x n +1 using length-n transforms and no zero padding. [Winkler 1996; Crandall and Fagin 1994; Crandall and Pomerance 2001] , where • denotes point-wise multiplication. For simplicity, we do not always explicitly apply PMul ψ or PMul ψ −1 when it is clear from the context that a negacyclic convolution is computed.
The RLWEenc Cryptosystem
The semantically secure public key encryption scheme RLWEenc was proposed in Lindner and Peikert [2011] and Lyubashevsky et al. [2010a Lyubashevsky et al. [ , 2010b and is also used as a building block in the identity-based encryption scheme (IBE) by . We provide the key generation procedure RLWEenc GEN in Algorithm 1, the encryption procedure RLWEenc ENC in Algorithm 2, and the decryption procedure RLWEenc DEC in Algorithm 3. All algorithms explicitly use calls to the NTT and function names used later on during the evaluation of our implementation (see Section 5). The exact placement of NTT transformations is slightly changed compared to Roy et al. [2014b] , which saved one transformation compared to , as c 2 is not transmitted in NTT form and thus removal of least significant bits is still possible (see and The main idea of the scheme is that during encryption the n-bit encoded messagē m = Encode(m) is added to pe 1 + e 3 (in NTT notation INTT(h 2 ) + e 3 ) which is uniformly random and thus hides the message. Decryption is only possible with knowledge of r 2 since otherwise the large term ae 1 r 2 cannot be eliminated when computing c 1 r 2 + c 2 . The encoding of the message of length n is necessary as the noise term e 1 r 1 + e 2 r 2 + e 3 is still present after calculating c 1 r 2 + c 2 and would prohibit the retrieval of the binary message after decryption. With the simple threshold encoding Encode(m) = which is interpreted as 1 and 0 otherwise. As a consequence, the maximum error added to each coefficient must not be larger than | q 4 | in order to decrypt correctly. The probability for decryption errors depends on the shape (i.e., std. deviation) of the RLWEenc error/noise distribution that is sampled by Sampler. The usage of the discrete Gaussian distribution for RLWEenc, which comes naturally from certain worst-case to average case reductions, has led to some controversy regarding the required precision and tail-cut for practical and theoretical security. The motivation to move away from high-precision discrete Gaussians is that the implementation of a high-precision discrete Gaussian sampler is costly (see Bos et al. [2015] ), especially when constant-time operation is required. Moreover, from a practical perspective it is extremely unlikely that any instantiation of RLWEenc using a high-precision sampler will ever sample a value close to the tail cut (i.e., 10σ ) as the probability, e.g., to sample 45 for σ = 4.51 is < 2 −75 . Moreover, the concrete shape of a distribution is not exploited by lattice reduction algorithms or other approaches to cryptanalysis of RLWEenc and the main parameter of a distribution is only variance and entropy. The observations have also been made in other works [Saarinen 2015; Alkim et al. 2016a ] and in Alkim et al. [2016a] as additional safeguard a reduction using Rényi divergence is provided that shows that the usage of a Binomial distribution instead of a rounded continuous Gaussian will not lead to a significant advantage of an attacker when RLWEenc is used as a key exchange scheme. The security of the key exchange scheme from Alkim et al. [2016a] is also based on the RLWE problem. It is also worth noting that decreasing σ reduces the error probability but also negatively affects the security of the scheme [Göttert et al. 2012; Lindner and Peikert 2011] . Increasing q, on the other hand, increases the key size, ciphertext expansion, and reduces performance (on certain devices). To support the NTT, Göttert et al. [2012] proposed parameter sets (n, q, s) where σ = s/ √ 2π denoted as 7, 681, 11.31) and 12, 289, 12.18 ). Lindner and Peikert [2011] originally proposed the parameter sets 4, 093, 8.87) , RLWEencIIb (256, 4, 093, 8.35), 4, 093, 8.00 ). The security levels of RLWEenc-Ia and RLWEenc-IIb are roughly comparable and RLWEenc-IIb provides 105.5 bits of pre-quantum security, according to a refined security analysis by Liu and Nguyen [2013] for standard LWE and the original parameter sets. The RLWEenc-IIa parameter set uses a larger dimension n and should thus achieve even higher security than the 156.9 bits obtained by Liu and Nguyen for RLWEenc-IIIb. For the IBE scheme in the parameters n = 512, q ≈ 2 23 and a trinary error/noise distribution are used. An approach to estimate the security of Ring-LWE based encryption against quantum computers (in a pessimistic model) is provided in Alkim et al. [2016a] . Applying this approach to our parameter set, the estimated post-quantum security of the scheme is approximately 46 bits for RLWEenc-Ia and 104 bits for RLWEenc-IIa. Thus, for long-term security applications we recommend to use the parameter set RLWEenc-IIa. RLWEenc-Ia is still useful for applications that do not require post-quantum security. For RLWEenc-IIa the ciphertext size is 14,336 bits.
The BLISS Cryptosystem
In this work, we only consider the efficient ring-based instantiation of BLISS [Ducas et al. 2013a] . We recall the key generation procedure BLISS GEN in Algorithm 4, the signing procedure BLISS SIGN in Algorithm 5, and the verification procedure BLISS VER in Algorithm 6. Key generation requires uniform sampling of sparse and small polynomials f, g, rejection sampling (N κ (S)), and computation of an inverse. To sign a message, two masking polynomials y 1 , y 2 ← D Z n ,σ are sampled from a discrete Gaussian distribution using the SampleGauss σ function. The computation of ay 1 is performed using the NTT and the compressed u is then hashed together with the message μ by Hash. The binary string c is used by GenerateC to generate a sparse polynomial c. Polynomials y 1 , y 2 then hide the secret key which is multiplied with the sparse polynomials using the SparseMul function. This function exploits that only κ coefficients in c are set and only d 1 + d 2 coefficients in s 1 and s 2 . After a rejection sampling and compression step the signature (z 1 , z † 2 , c) is returned. The verification procedure BLISS VER in Algorithm 6 just checks norms of signature components and compares the hash output with c in the signature.
In this work, we focus on the 128-bit pre-quantum secure BLISS-I parameter set which uses n = 512 and q = 12, 289 (same base parameters as RLWEenc-IIa). The density of the secret key is δ 1 = 0.3 and δ 2 = 0, the standard deviation of the coefficients of y 1 and y 2 is σ = 215.73, and the repetition rate is 1.6. The number of dropped bits in z 2 is d = 10, κ = 23, and p = 2q/2 d . The final size of the signature is 5,600 bits with Huffman encoding and approximately 7,680 bits without Huffman encoding.
FASTER NTTS FOR LATTICE-BASED CRYPTOGRAPHY
In this section, we examine fast algorithms for the computation of the NTT and show techniques to speed up polynomial multiplication for lattice-based cryptography. 4 The most straightforward implementation of the NTT is a Cooley-Tukey radix-2 decimationin-time (DIT) approach [Cooley and Tukey 1965] that requires a bit-reversal step as the algorithm takes bit-reversed input and produces naturally ordered output (from now on referred to as NTT CT bo→no ). To compute the NTT as defined in Section 2.1, the NTT CT bo→no algorithm applies the Cooley-Tukey (CT) butterfly, which computes a ← a + ωb and b ← a − ωb for some values of ω, a, b ∈ Z q , overall n log 2 (n) 2 times. The biggest disadvantage of relying solely on the NTT CT bo→no algorithm is the need for bit-reversal, multiplication by constants, and that it is impossible to merge the final multiplication by powers of ψ −1 into the twiddle factors of the inverse NTT (see Roy et al. [2014b] ). With the assumption that twiddle factors (powers of ω) are stored in a table and thus not computed on-the-fly, it is possible to further simplify the computation and to remove bit-reversal and to merge certain steps. This assumption makes sense on constrained devices like the ATxmega, which have a rather large flash memory.
ALGORITHM 4: BLISS Key Generation
Choose uniform polynomials f, g with δ 1 n entries in {±1} and δ 2 n entries in {±2} 3:
Choose a random bit b 8:
Continue with probability 11: 
Accept iff c = GenerateC(c CT bo→no algorithm for a hardware implementation and show how to merge the multiplication by powers of ψ (see Section 2.1) into the twiddle factors of the forward transformation. However, this approach does not work for the inverse transformation due to the way the computations are performed in the CT butterfly as the multiplication is carried out before the addition. In this section, we show that it is possible to merge the multiplication by powers of ψ −1 during the inverse transformation using a fast decimation-in-frequency (DIF) algorithm [Gentleman and Sande 1966] . The DIF NTT algorithm splits the computation into a sub-problem on the even outputs and a sub-problem on the odd outputs of the NTT and has the same complexity as the NTT CT bo→no algorithm. It requires usage of the so-called Gentlemen-Sande (GS) butterfly which computes a ← a + b and b ← (a − b)ω for some values of ω, a, b ∈ Z q . Following Chu and George [2000, Section 3.2] , where ω n is an n-th primitive root of unity and by ignoring the multiplication by the scalar n −1 , the inverse NTT and application of PMul ψ can be defined as
When r is even this results in
and for odd r in
The two new half-size sub-problems where ψ is exchanged by ψ 2 can now be again solved using the recursion. As a consequence, when using an in-place radix-2 DIF algorithm it is necessary to multiply all twiddle factors in the first stage by ψ −1 , all twiddle factors in the second stage by ψ −2 , and in general by ψ −2 s for stage s ∈ {0, 1, . . . , log 2 (n) − 1} to merge the multiplication by powers of ψ −1 into the inverse NTT (see Figure 1 for an illustration). In case the PMul ψ or PMul ψ −1 operation is merged into the NTT computation, we denote this by an additional superscript ψ or ψ −1 , e.g., as NTT CT ,ψ bo→no .
Removing Bit-Reversal
For memory efficient and in-place computation a reordering or so-called bit-reversal step is usually applied before or after an NTT/FFT transformation due to the required reversed input ordering of the NTT de Clercq et al. [2014] , , and . However, by manipulation of the standard iterative algorithms and independently of the used butterfly (CT or GS) it is possible to derive natural order to bit-reversed order (no → bo) as well as bit-reversed to natural order (bo → no) forward and inverse algorithms. The derivation of FFT algorithms with a desired ordering of inputs and outputs is described in Chu and George [2000] and we followed this description to derive the NTT algorithms NTT , as well as their respective inverse counterparts. It is also possible to construct self-sorting NTTs (no → no) but in this case the structure becomes irregular and temporary memory is required (see Chu and George [2000] ).
Tuning for Lattice-Based Cryptography
The optimizations discussed in this section so far can be used to generically optimize polynomial multiplication in Z q [x]/ x n + 1 . However, for lattice-based cryptography there are special conditions that hold for most practical algorithms; in the NTT-enabled algorithms of RLWEenc and BLISS every point-wise multiplication (denoted by •) is performed with a constant and a variable, usually a randomly sampled polynomial. Thus, the most common operation in lattice-based cryptography is not simple polynomial multiplication but multiplication of a (usually random) polynomial by a constant polynomial (i.e., global constant or public key). Thus, the scaling factor n −1 can be multiplied into the pre-computed and pre-transformed constant
Taking into account that we also want to remove the need for bit-reversal and want to merge the multiplication by powers of ψ into the forward and inverse transformation (as discussed in Section 3.1) we propose to use an NTT performance on 32-bit platforms (and thus even more on 8-bit ones) as stated in Alkim et al. [2016b] , we did not consider this approach for our implementation.
IMPLEMENTATION OF LATTICE-BASED CRYPTOGRAPHY ON ATXMEGA128
In this section, we provide details on our implementation of the NTT as well as RLWEenc, and BLISS on the ATxmega128. Our implementation of RLWEenc has a constant execution time and is therefore protected against timing side-channels, which is required for most practical and interactive applications. However, sampling in ideal lattice-based signature schemes, like BLISS, requires a much higher standard deviation and sampling precision. Unfortunately, it is still an open question how to efficiently perform constant-time sampling from these distributions and therefore, we solely focus on performance of BLISS and to a lesser extent on a small memory and code footprint in this work. bo→no transformations as described in Section 3. We implemented both algorithms in C and optimized the modular multiplication and butterfly operations using assembly language.
MOV-and-ADD + SAMS2.
We propose an optimized technique for performing coefficient multiplication called "MOV-and-ADD." The MOV-and-ADD multiplication technique resembles the traditional double-precision multiplication, but calculates the byte-products in a more efficient order to reduce the number of adc instructions. As shown in Figure 3 , we split the computation process into four blocks and perform the byte multiplication in a sequence of Inspired by the work in , we propose an optimized technique for reduction technique called "SAMS2," which only consists of four different basic operations, namely, shifting, addition, multiplication, as well as subtractions. We will use Figure 4 to describe the process of SAMS2. The input of SAMS2 is the product of t = a · b obtained from the "MOV-and-ADD" method, where t is kept in four registers (r3, r2, r1, r0) and have been marked by different colors. Each of the registers r3, r2, r1, r0 are 8 bits long. The reduction with 7, 681 using SAMS2 approach can be performed as follows: we first get the value of t 13, t 17, and t 21 by shifting. We right shift r3, r2, r1 by one bit, and store the intermediate result of r3, r2 in t0 (i.e., the value of t 17). After that, we right shift by 4 bits to get the results t1 (i.e., the value of t 13) and t2 (i.e., the value of t 21). Then, we can get the quotient h by performing an addition of t0 + t1 + t2. Apparently, the sum result is less than 16-bit, which can be kept in two registers. Thereafter, we calculate the values of h · q by multiplication. Since the lower byte of 7, 681 is 1, a 16 × 8-bit multiplication and several addition operations are sufficient. Next, we subtract both the sum of h and the product obtained from step 3 from t. However, the result we get in step 4 may still be larger than q = 7, 681, thus, we subtract the modulus q at most two times. A constant-time modular coefficient multiplication using a combination of MOV-and-ADD and SAMS2 requires 63 and 94 cycles for q = 7, 681 and q = 12, 289, respectively.
Tiny Montgomery Multiplication.
We propose a constant-time tiny Montgomery multiplication for computing coefficient multiplication. As shown in Algorithm 7, the tiny Montgomery multiplication can be computed in four steps. The first step (line 2) is to compute the product of coefficients a · b by using proposed MOV-and-ADD method. Then, we only compute the lower 16-bit of t · q and get the value of s by a masking operation (line 3). In line 4, we first compute the coefficient multiplication s · q, and then add the result with t (kept in four registers); after that, we compute z by using several shifting operations. Finally, in lines 5 and 6, two conditional subtractions are necessary to make sure the result is in [0, q − 1]. In summary, the processing of tiny Montgomery multiplication in Algorithm 7 requires 73 clock cycles to run in constant time. This achieved result outperforms the Barrett reduction [Barrett 1987 ] (roughly 600 cycles) and the shift-and-add (roughly 216 cycles) by a factor of roughly 8.2 and 3, respectively.
ALGORITHM 7: Tiny Montgomery Multiplication for q = 7, 681
Precondition: 13-bit modulus q = 7, 681, Montgomery radix R = 2 13 , (incomplete) coefficients a, b ∈ [0, 2 13 − 1] , pre-computed constant q = −q −1 mod 2 13 = 7, 679
return z 8: end function Since the Montgomery multiplication needs one transformation into the Montgomery domain at the beginning of the NTT CT no→bo and one backwards transformation at the end of NTT GS bo→no , the MOV-and-ADD multiplication with subsequent SAMS2 reduction is still faster for q = 7, 681. For q = 12, 289, a coefficient multiplication can be computed within 70 cycles. Three cycles are saved by needing less shift operations and therefore the tiny Montgomery reduction gets outperformed even when the transformation to and from the Montgomery domain are taken into account.
Usage of Look-Up Tables for Narrow Input Distributions.
As discussed in Section 3.3, it is common in lattice-based cryptography to apply forward transformations mostly to values sampled from a narrow binomial error/noise distribution (other polynomials are usually constants and pre-computed). In this case, only a limited number of possible inputs to the butterfly of the first stage of the NTT CT ,ψ no→bo transformation exist and it is possible to pre-compute look-up tables to speed up the modular multiplication. Especially for RLWEenc, the range of possible input coefficients to the first stage butterfly is rather small, since they are Gaussian distributed with a low standard deviation. bo→no ), Gaussian sampling (Sampler), and point-wise multiplication (PwMulFlash, also denoted as •). We assume that secret and public keys are stored in the flash memory, but loading from RAM would also be possible, and probably even faster. Due to the usage of the NTT, the NTT CT ,ψ no→bo transformation is only applied on the small noise polynomials e 1 , e 2 . Thus, we can optimize the transformation for this input distribution and it is possible to substitute all multiplications in stage one of NTT 4.2.1. Sampler CDT-high : Non-Constant Time High-Precision CDT. For the sampling of the Gaussian distributed polynomials with high precision our first approach is to use a CDT [Devroye 1986; Peikert 2010] . We construct the table M with entries p z = Pr(x ≤ z : x ← D σ ) for z ∈ [0, τ σ ] with a precision of λ = 96 bits. The tail-cut factor τ determines the number of lines |z t | = τ σ of the table and reduces the output to the range x ∈ {− τ σ , . . . , τ σ }. To sample a value we choose a uniformly random y from the interval [0, 1) and a bit b and return the integer (−1) b z ∈ Z Z such that y ∈ [ p z−1 , p z ). Further, we store only the positive half of the tables and then sample a sign bit. For this, the probability of sampling zero has been pre-halved when constructing the table.
Implementation of LP-RLWE
The constant CDF matrix M is stored in the flash memory with k = σ τ rows and l = λ/8 columns. Implementing the CDT approach with high precision to run in constant time is quite expensive as shown in Bos et al. [2015] [Knuth and Yao 1976] . In order to achieve a precision of 2 −90 for dimension n = 256, the Knuth-Yao algorithm requires a probability matrix P mat of 55 rows and 109 columns [de Clercq et al. 2014] . Our implementation stores each column (i.e., 55-bit) in 7 bytes and the probability matrix occupies 763 bytes in total. A straightforward implementation using bit-scanning method operation requires one to check each bit and decreases the distance (d) whenever the bit is set. Instead of this, we perform the algorithm in a byte-scanning fashion which compares the eight concatenated bits of each byte and only perform the scanning operations for such non-zero bytes. To speed up the computation, we apply a Look-Up Table ( LUT)-based approach to our byte-wise scanning implementation [Roy et al. 2014a] . This is done in three levels. The first level is an LUT that maps eight random bits into a sample or into an intermediate node. If the LUT outputs a sample point, then the sampling operation stops. Otherwise, the sampling operation visits the second level, which is another LUT. This second LUT maps five random bits into a sample point with high probability, or into an intermediate node with small probability. When the second LUT returns an intermediate node, the slow byte-scanning operation is performed. However, this happens with a small probability.
Sampler Bio-low : Constant Time Binomial Sampler.
In Alkim et al. [2016a] it was shown that the high-precision discrete Gaussian distribution used in a key exchange scheme similar to RING-LWEENCRYPT can be replaced by a centered binomial distribution without significantly hurting security. It is possible to sample from the centered binomial distribution by computing k i=0 b i −b i , where the b i , b i ∈ {0, 1} are uniform independent bits. In this case, the variance is k/2 and to achieve the required standard deviation of σ = 4.51 (RLWEenc-Ia ) (respectively, σ = 4.86 (RLWEenc-IIa )), k has to be set to a value of 41 (respectively, 48). The implementation of the binomial sampler is very straightforward. For k=48 we need exactly 12 bytes of randomness to sample one coefficient. This huge randomness consumption unfortunately leads to a performance penalty. On the other hand, a constant execution time and no need for precomputed tables are advantages of the binomial sampler.
Sampler CDT-low : Constant Time Medium-Precision CDT.
A more efficient approach to sample in constant time is a cumulative distribution table with two bytes of precision. The same precision is used in the lattice-based key exchange called Frodo [Bos et al. 2016] . As elaborated in Section 2.2, and described in works like Saarinen [2015] , a sampler with such a precision is not expected to lower the security of the scheme significantly. To avoid comparing the input with each entry in the CDT, we also make use of a guide table Devroye 1986 ] with a size of 256 bytes. This table maps one input byte onto an entry in the CDT. Now we can reduce the number of comparisons to the maximum number of CDT entries that share the same first byte value. The result of the guide table look-up is then used as a starting point for those comparisons. For n = 256, we can reduce the number of comparisons from 20 to 7 by this technique since seven table entries have a first byte value of 0xF F.
The sampler must also keep track of how many bytes of randomness were used. The sampler always needs one bit for the sign and one byte for the comparisons with the table entries. In case this input byte matches one of the table entries, the sampler needs another byte. Therefore, after the first comparison, we check whether the first input byte and the starting table entry were equal or not. Since we are using a guide table approach, it suffices to perform this check after the first comparison only because subsequent table entries can only be equal to the input if the first table entry already was.
Random numbers for the Gaussian sampling are obtained from a pseudo-random number generator (PRNG) using the hardware AES-128 engine running in counter mode. The PRNG is seeded by noise from the least significant bit of the analog digital converter. For the state (key, plaintext) 32 bytes of statically allocated memory are necessary. At the beginning of each encryption we fill a randomness buffer by running the PRNG multiple times in succession. Our Gaussian sampler needs one bit to determine the sign and with a probability of 242 256 (n = 256) one byte of randomness for the table look-up and two bytes otherwise. Therefore, we have to set the buffer size such that the probability of not providing enough randomness is negligible. In case every sampler run only requires one byte of randomness, the total amount of randomness used is 3n+ 
Therefore, we set the buffer size to 3n + 3n 8
+ 144 = 1, 008. For n = 512, the buffer size is 3n + 3n 8 + 224 = 1, 952. Even though in many cases we do not need a second byte, we always perform a two-byte comparison to make the sampler run in constant time. In case the output of the sampler is only determined by the first byte and the second byte has no influence on the result, we reuse the second byte in the next sampling operation.
4.2.5. Constant-Time Arithmetic. To create a constant-time and still efficient implementation, we implemented all core operations in assembly. Addition and subtractions are computed mod q and therefore usually contain a conditional subtraction or addition of q. To make this operation unconditional, we follow the standard approach of constanttime implementation by subtracting or adding q, checking whether a carry appeared, and creating a mask out of the carry. We then compute the AND-conjunction of the mask and q and attempt to reverse to addition/subtraction by subtracting/adding the masked value of q. By doing so, the executed operations will always be the same and the result will still be correctly reduced mod q. For the multiplication in the NTT bufferfly operation, we used the SAMS2 algorithm for q = 7, 681 and the Montgomery multiplication for q = 12, 289. Besides the multiplication, the butterfly operations of the NTT always consist of one addition and one subtraction with subsequent reduction. For efficiency reasons we also created constant-time assembly functions for threshold encoding and decoding. For the encoding, we load one byte of the message, shift it to the right, and again use the carry to create a mask. We then proceed with the standard method as described above and repeat the procedure for each message bit. For decoding, we load one element of the ciphertext polynomial and subtract
. If a carry appears, we already know the decoded message bit has to be 0. If no carry appears, we proceed with subtracting
. This time a carry indicates that the decoded message bit has to be 1.
Implementation of BLISS
Besides polynomial arithmetic, the most expensive operation for BLISS is the sampling of y 1 and y 2 from a discrete Gaussian distribution (SampleGauss σ ). For the rather large standard deviation of σ = 215.73 (RLWEenc requires only σ = 4.86) a straightforward CDT sampling approach, even with binary search, would lead to a large table with roughly τ σ = 2, 798 entries of approximately 30-40 kB overall (see ). Another option for embedded devices would be the Bernoulli approach from Ducas et al. [2013a] implemented in but the reported performance of 13,151,929 cycles to sample one polynomial would cause a massive performance penalty. As a consequence, we implemented the hardware-optimized sampler from on the ATxmega. It uses the convolution property of Gaussians combined with Kullback-Leibler divergence and mainly exploits that it is possible to sample a Gaussian distributed value with variance σ 2 by sampling x 1 , x 2 from a Gaussian distribution with smaller standard deviation σ such that σ 2 + k 2 σ 2 = σ 2 and combining them as x 1 + kx 2 (for BLISS-I k = 11 and σ = 19.53). Additionally, the performance of the σ -sampler is improved by the use of short-cut intervals where each possible value of the first byte of the uniformly distributed input is assigned to an interval that specifies the range of the possible sampled values. This approach reduces the number of necessary comparisons and nearly compensates for the additional costs incurred by the requirement to sample two values (x 1 , x 2 with σ = 19.53) instead of one directly (with σ = 215.73). The sampling is again performed in a lazy manner and we use the same PRNG based on AES-128 as for RLWEenc.
To implement the NTT CT no→bo we did not use a look-up table for the first stage as the input range [−σ τ, σ τ ] of y 1 or z 1 is too large. For the instantiation of the random oracle (Hash) that is required during signing and verification, we have chosen the official AVR implementation of Keccak [Bertoni et al. 2012] . From the output of the hash function the sparse polynomial c with κ coefficients equal to one is generated by the GenerateC (see Ducas et al. [2013b, Section 4 .4]) routine. We store only κ indices where a coefficient of c is one. This reduces the dynamic RAM consumption and allows a more efficient implementation of the multiplication of c by s 1 and s 2 using the SparseMul routine. By using column-wise multiplication and by ignoring all zero coefficients, the multiplication can be performed more efficiently than with the NTT.
RESULTS AND COMPARISON
All implementations are measured on an 8-bit ATxmega128A1 microcontroller running at 32MHz and featuring 128kB flash memory, 8kB RAM, and 2kB EEP-ROM. Cycle accurate performance measurements were obtained using two coupled 16-bit timer/counters and dynamic RAM consumption is measured using stack canaries (see Pöppelmann et al. [2015] for more details). All public and private keys are assumed to be stored in the flash of the microcontroller and we consider the .text + .data + .bootloader sections to determine the flash memory utilization. For our implementation we used no calls to the standard library, the avr-gcc compiler in version 4.7.0, and the following compiler options (shortened): -Os -fpack-struct -ffunction-sections -fdata-sections -flto.
RLWEenc. Detailed cycle counts for the encryption and decryption as well as the most expensive operations are given in Table I 2,144 Note that the stack usage is divided into a fixed amount of memory necessary for plaintext, ciphertext, and additional components (like random number generation) and the dynamic consumption of the encryption and decryption routine. We encrypt a message of n bits. a key stored in the flash (PwMulFlash), and message encoding (Encode). Decryption is extremely simple, fast, and basically calls INTT
bo→no , the decoding and an addition so that roughly 149 decryption operations could be performed per second on the ATxmega128.
The NTT can be performed in place so that no additional large temporary memory on the stack is needed. But storing the NTT twiddle factors for forward and inverse transforms in flash consumes 2n words = 4n bytes, which is around 16% of the allocated flash memory for q = 7, 681 and around 22% for q = 12, 289. Table II , we present detailed cycle counts for signing and verifying as well as for the most expensive operations in BLISS-I. Due to the rejection sampling and the chosen parameter set 1.6 signing attempts are required on average to create one signature. One attempt requires 6,34,1763 cycles on average and only a small portion of the computation, i.e., the hashing of the message, does not have to be repeated in case of a rejection. During a signing attempt the most expensive operation is the sampling of two Gaussian distributed polynomials which takes 2 × 1,141,007 = 2,282,014 cycles (36% of the overall cycles). The calls to NTT bo→no account for 15.2% of the overall cycles of one attempt. In contrast to the RLWEenc implementation, we do not use the constant-time Montgomery approach and therefore the performance of the NTT CT ,ψ no→bo is slightly different compared to RLWEenc. Hashing the compressed u and the message μ is time-consuming and accounts for roughly 21% of the overall cycles during one attempt. Savings would be definitely possible by using a different hash function (see Balasch et al. [2012] for an evaluation of different functions) but Keccak appears to be a conservative choice that matches the 128-bit security target very well. The sparse multiplication takes only 504,023 cycles for one multiplication what is preferable than computing the product with multiple NTT transformations. The flash memory consumption includes 2n words, which equals 4n = 2,048 bytes for the NTT twiddle factors and 3,374 bytes for look-up tables of the sampler. The stack usage is divided into a fixed amount of memory necessary for for message, signature, and additional components (like random number generation) and the dynamic consumption of the signing and verification routine. We sign a message of n bits.
BLISS. In
Comparison. A detailed comparison of our implementation with related work that also targets the AVR platform 5 and provides 80-128-bits of security is given in Table III . Our implementation of RLWEenc-Ia encryption outperforms the software from by a factor of 3.8 and results from by a factor of 6.3 in terms of cycle counts. Decryption is 6.4 times and 11.5 times faster, respectively. For comparison, we also present the cycle counts for the previous versions of this work. We highlight that these implementations were not protected against timing side-channels and timing-constant implementations are usually expensive to achieve (especially for the Gaussian sampling operation). Due to our optimizations presented in this article, we are still able to achieve similar or even better performance numbers.
A comparison between our implementation of BLISS-I and the implementations of and is difficult since the authors implemented the signature as authentication protocol. Therefore, they only provide the runtime of a complete protocol run that corresponds to one signing operation and one verification in our results, but without the expensive hashing as the sparse polynomial c is not obtained from a random oracle but randomly generated by the other protocol party. However, our implementations of BLISS SIGN and BLISS VER still require less cycles than the implementation of BLISS-I. As our implementation of BLISS-I needs 18,401 bytes of flash memory, it is also smaller than the implementation of that requires 66.5kB of flash memory and the implementation of that needs 25.1kB of flash memory.
Compared with different classes of schemes, like QC-MDPC McEliece, RSA, and ECC, or implementations are at least one order of magnitude faster. A comparison with NTRU implementations is currently not easily possible due to lack of published results [Liu et al. 2015b] AX128 32 Enc/Dec 671,628 275,646 47.6 116.1 RLWEenc-IIa (n = 512) [Liu et al. 2015b] AX128 32 Enc/Dec 2,617,459 686,367 12.2 46.6 RLWEenc-Ia (n = 256) [Pöppelmann et al. 2015] AX128 32 Enc/Dec 874,347 215,863 36.6 148.2 RLWEenc-IIa (n = 512) [Pöppelmann et al. 2015 ] AX128 32 Enc/Dec 2,196,945 600,351 14.6 53.3 RLWEenc-Ia (n = 256) AT64 8 Enc/Dec 3,042,675 1,368,969 2.6 5.8 RLWEenc-Ia (n = 256) AX64 32 Enc/Dec 5,024,000 2,464,000 6.4 13.0 BLISS-I, (this work)
AX128 32 Sign/Ver 10,156,247 2,760,244 3.2 12.0 BLISS-I [Pöppelmann et al. 2015] AX128 32 Sign/Ver 10,537,981 2,814,118 3.0 11.4 BLISS-I (Bernoulli) [Liu et al. 2015b] AX128 32 NTT 441,572 72.5 NTT CT bo→no (n = 512) [Pöppelmann et al. 2015] AX128 32 NTT 521,872 61.3 NTT CT bo→no (n = 512) [Düll et al. 2015] AT2560 16 Point mul 13,900,397 1.2 ECC-ecp160r1 [Gura et al. 2004] AT128 8 Point mul 6,480,000 1.2 for the AVR platform. 6 We also refer to de Clercq et al. [2014] for an implementation of RLWEenc-Ia and RLWEenc-IIa on an ARM Cortex-M4 (32-bit, 168MHz). However, as the Cortex-M4 is a 32-bit processor a comparison across architectures with different bit-widths is naturally hard. Table IV shows a comparison of the different sampling approaches for RLWEenc investigated in this work. Important to note is that only Sampler Bio-low and Sampler CDT-low run in constant time. At the same time, these samplers provide less precision than Sampler KY-high and Sampler CDT-high . Among the highprecision samplers, Sampler KY-high is 3.1 times faster for σ = 4.51. But on the other hand, Sampler CDT-high needs only half as much memory. Guide tables could be used to trade memory space for cycle counts for Sampler CDT-high . Since Sampler CDT-low already [Liu et al. 2015b] 26,763 1.2kB 255,218 1.4kB Sampler CDT-high [Pöppelmann et al. 2015] 84,001 0.6kB 170,861 0.7kB features guide tables, it needs more memory, but is still faster than Sampler CDT-high . Indeed, Sampler Bio-low does not need any precomputed tables, but the cycle counts are 2.6/2.8 times higher than for Sampler CDT-low . For constant-time implementations we therefore recommend to use Sampler CDT-low . Return a 21: end function
Comparison of Samplers.

