0000−0001−9294−5594] , and Arnaud TISSERAND 1[0000−0001−7042−3541]
Introduction
The algorithms currently used in public-key cryptography (PKC) in most of applications are not secure against quantum computers using Shor's algorithm [28] . Post-quantum cryptography (PQC) relies on mathematical problems for which known quantum algorithms offer no significant speed-up. Lattice problems such as learning with errors (LWE) and ring-LWE (RLWE) received a lot of attention. A standardization project for post-quantum encryption and signatures has been launched by NIST in 2016 [8] . Its goal is to select and standardize post-quantum solutions to replace RSA and ECC. The second round started in January 2019, it includes 17 public-key encryption (PKE) submissions [1] , and 9 of them are based on lattice problems.
While PQC is resistant against quantum computers, its implementation must be protected against physical attacks. Side channel attacks (SCAs) exploit the leakage of secret information through the analysis of the power consumption, electromagnetic radiation or computation timings of the cryptographic device. For instance, correlation power analysis (CPA) uses a set of traces obtained by measuring the power consumption of the device for different inputs.
The secret key in RLWE based cryptosystems consists of a polynomial in a finite ring. Decryption involves a multiplication with the secret polynomial. This multiplication is the ideal target for SCAs. One way to prevent such attacks is to randomize the operands to remove the correlation between the power traces and the secret polynomial coefficients.
In this work, we implement in hardware various countermeasures from the state of the art against SCAs. We also improve some of them and propose a new one. We consider the masking scheme from [25] for which we propose a new masked decoder. Our decoder is deterministic and does not add to the decryption failure probability, as opposed to the one from [25] . We also implement two randomization techniques proposed by [27] : blinding, shifting; and a combination of the two. To the best of our knowledge, these are the first FPGA implementations of these techniques. We also propose a new countermeasure based on a redundant representation of the ring elements to randomize the computations during the decryption algorithm. Our new protection leads to a small overhead in terms of time and area. We also propose two methods to shuffle the operations during the point-wise multiplications. Finally, we compare all those solutions implemented on the same FPGA and using the same high-level synthesis (HLS) tools for fair comparison. HLS allows us to quickly evaluate several architectures and parameters. Our results show that HLS tools lead to implementations with similar, or even better, performances than VHDL or Verilog ones from the literature but for a significantly reduced design cost. 3 State of the Art Analysis
Definitions and Notations

Learning with Errors based PKE
The LWE problem and LWE based cryptography have been introduced by Regev in [24] , where it is shown that LWE is at least as hard as some worst case lattice problems. The introduction of RLWE by [17] gives rise to more efficient cryptographic applications by adding an algebraic structure to the lattices. The matrices and vectors from Regev's cryptosystem are replaced by polynomials in finite rings, reducing the size of the public key and allowing fast multiplication algorithms. This speed-up has been studied by [11] in hardware implementation. The definition of RLWE we give here is a practical instantiation of the more general definition from [17] .
Definition 1 (RLWE [17] ). For some (secret key) polynomial s $ ← − B λ (R q ), a RLWE sample is generated by sampling a polynomial a from the uniform distribution over R q , and sampling e $ ← − B λ (R q ) and computing the output (a, b) where b = a · s + e. The search variant of the RLWE problem is to find s given a number of samples for s.
We describe the framework used for instance by NewHope. Other schemes may use deterministic errors ("Learning with Rounding") or Gaussian noise instead of using the binomial distribution. RLWE based submissions still present in the second round of the NIST standardization project include NewHope [2] , LAC [15] and Round5 [4] . 
closer to 0 than to q 2 , else decode to 1.
Components of the Cryptosystem
Encoding/Decoding. The maps between the message space {0, 1} n and R q are called Encode and Decode. A string of bits is encoded by mapping 0 to 0 and 1 to q 2 , resulting in a polynomial with coefficients in Z q . To decode a polynomial in R q , a coefficient is mapped to 0 if it is closer to 0 than to q 2 in Z q , else it is mapped to 1. That is, if c ∈ q 4 , 3q 4 then c → 1 , else c → 0.
Binomial sampling. The B λ distribution over Z q is sampled by generating 2λ uniformly random bits a 1 , . . . , a λ , b 1 , . . . , b λ and returning
Polynomial multiplication and NTT. The encryption and decryption functions both rely on polynomial multiplication in R q . The polynomial multiplication is a costly operation. Hardware implementations of RLWE schemes such as [20] tend to use the Number Theoretic Transform (NTT) to compute this operation. It has also been suggested to use the schoolbook algorithm for area optimization [21] , but in general the NTT seems to yield better performance and highly areaoptimized implementations exist [26] .
To compute a polynomial multiplication using the NTT, the polynomials should be mapped to the NTT domain where the polynomial multiplication is a point-wise operation taking only n modular multiplications in Z q . Addition and subtraction can also be performed point-wise in the NTT domain. The inverse NTT is applied to bring the result back in the time domain.
Definition 2 (NTT).
Let ω ∈ Z q be a primitive n-th root of unity and a(x) = n−1 i=0 a i x i an element in R q . Then the image of a under the NTT is given bŷ a = n−1 j=0â j x j , whereâ j = a(ω j ). To use the NTT for multiplication in R q , the polynomials have to be preprocessed using the negative wrapped convolution (NWC) [16] 
for some c(x) of degree smaller than n. Let φ ∈ Z q be a primitive 2n-th root of unity such that φ n = −1 mod q. Then, in Z q [x] and for i ≥ 0, one has:
This means that NTT(a(φx)) NTT(b(φx)) = NTT(c(φx)). In other words, one gets the reduction mod(x n + 1) for free by applying the NTT to a(φx) instead of a(x). To obtain the correct result from the polynomial multiplication, the inverse of the NWC should be applied to NTT −1 (NTT(c)). That is, each coefficient has to be multiplied by a power of φ −1 . Therefore, the values of φ i mod q have to be precomputed for 0 < i < n and −n < i < 0.
The transform is efficiently computed in log 2 (n) stages of n multiplications using the Cooley-Tukey algorithm [9] . Several optimizations have been proposed to accelerate this computation. The multiplication by the powers of the 2n-th root of unity can be merged with the twiddle factors in the first stage or the scaling multiplication by n −1 mod q [26] . Instead of precomputing n −1 and the powers of φ −1 , the values of n −1 φ −i for 0 ≤ i < n can be precomputed directly. A similar result merging the NWC with the final stage of the inverse NTT was described by [22] . By making clever use of the Decimation-In-Time (DIT) and Decimation-In-Frequency (DIF) transforms, [22] shows that the bit reversal can be avoided. The DIT NTT is used to compute the inverse transformation. The DIT NTT takes an input in bit-reversed order and returns the output in the original order. The bit-reversal resulting from the DIF forward transformation is thus automatically undone by the inverse NTT. All the operations in the NTT domain are computed on the bit-reversed coefficient vectors. The public and private keys are therefore stored in bit-reversed order in the NTT domain. To limit the amount of modular reductions during the NTT, [14] allows variables to grow slightly larger than q.
Not counting symmetric primitives, the NTT is the most expensive operation in the scheme with a complexity of O(n log n) . To reduce the number of NTTs to be computed, the public and private keys can be stored in the NTT domain. The ciphertext part c 1 must also be sent in the NTT domain. During the encryption, 2 forward NTTs and 1 inverse NTT have to be computed and during the decryption only 1 inverse NTT is needed.
Using the constant geometry variant [19] of the NTT algorithm, the memory access pattern is independent of the stage. The values are not read from the same memory as the one that the updated variables are written to, therefore 2 BRAMs are needed in the implementation. At each iteration of the stage loop, 2 values are read from the memory, a butterfly operation is computed and the 2 results are written to the memory. A detailed description of the stage is given by Algorithm 1. All arithmetic operations are performed in Z q .
Algorithm 1 i-th stage of the NTT [19] 1: function stage(X, i) 2:
for j ← 0 to n 2 − 1 do 3:
Get twiddle factor from memory 4:
Modular reduction. Modular reduction for moduli of the form q = 2 l1 −2 l2 +1 can be efficiently computed using algorithms in the style of [29] . Using the fact that 2 l1 ≡ 2 l2 − 1 mod q, a modular reduction can be computed using only bitwise shifts, additions and subtractions.
Side Channel Analysis
Power analysis attacks on cryptographic implementations have first been described by [12] . Side channel attacks on LWE cryptography exploit vulnerabilities in Gaussian sampling algorithms [6] , [10] , polynomial multiplication [3] or the NTT [23] . The decryption algorithm makes use of the secret key and is therefore vulnerable to statistical and machine learning attacks on side channels such as power consumption (for instance differential power analysis, DPA) or electromagnetic radiations (differential electromagnetic analysis, DEMA).
Protections
All the operations in the decryption handle inputs that depend on the secret key. To protect it against DPA, these inputs should be randomized at the start of each decryption. The decryption algorithm decodes the coefficients of some polynomial d where d is defined by d = c 2 − NTT −1 (c 1 s). It should be noted that knowledge of the coefficients of d leads to complete key recovery in the chosen plaintext attack model. Since c 1 and c 2 are known inputs and c 1 is invertible in R q with high probability, one can compute
To prevent SCAs on the coefficients of d, these coefficients should not be computed directly. Instead, a randomized or masked version of d is used. The decoder should therefore be able to decode randomized or masked inputs. We now present the main countermeasures from literature against statistical attacks on RLWE.
Masking. In [25] the secret key is split in two shares: s = s + s for some uniformly random s at the start of each decryption. The linear part of the decryption function is computed twice: first the ciphertext is decrypted (but not decoded) using secret key s and then using secret key s , yielding two polynomials d and d . The final step consists of decoding the coefficients of d = d + d to bits. This is a non linear operation, that is,
This means that one cannot simply apply the decoder to the coefficients of d and d separately and then add the results in Z 2 to obtain the correct plaintext.
Because of the DPA scenario mentioned above, the two shares d and d should not be recombined before decoding to bits. Let d and d denote a coefficient (of some fixed index) of polynomials d and d respectively. A masked decoder takes as input two coefficients (d , d ) ∈ Z 2 q and computes the value of Decode(d + d ) without computing d + d . The solution from [25] makes use of the fact that for some (d , d ) ∈ Z 2 q it is easy to deduce the value of
, therefore the coefficient decodes to 1. Similar "easy cases" exist, but not all (d , d ) ∈ Z 2 q can be resolved in this way. If both d and d lie between 0 and q 4 , all we know is that 0 ≤ (d +d ) < q 2 and this can decode to either 0 or 1.
The idea from [25] to solve the hard cases is to reshare the two shares: for
It is therefore possible to add any constant to one of the shares and subtract the same constant from the other one. However, there is no guarantee that (d + δ, d − δ) is an easy case. If the new shares still do not form an easy case, the shares have to be reshared again. In [25] a list of constants {δ 1 , . . . , δ 16 } is presented that is supposed to minimize the number of resharings to be performed. Their implementation refreshes the shares 16 times such that with high probability an easy case is obtained in at least one of the 16 iterations.
The computation time overhead due to the 16 iterations and the additional decoding failures are important drawbacks to this solution. [18] propose an alternative masked decoding. Their method effectively decodes without additional decoding failures. The comparison that they make between this decoder and their re-implementation of the one from [25] however shows only a very limited improvement in terms of performance. Their masked decryption takes over 3 times more cycles to compute than the unmasked version. The same implementation also uses a blinding countermeasure proposed by [27] .
Blinding. With the blinding countermeasure [27] the polynomials s and c 1 are multiplied by scalars a and b in Z q respectively. The blinded polynomial multiplication is then computed: (as) · (bc 1 ) = (ab)s · c 1 . The inverse (ab) −1 should be computed to obtain s·c 1 . [27] suggested to use (pre-computed) powers of ω and ω −1 as blinding factors to avoid the modular inversion. The decoding process cannot be protected from DPA with the scalar blinding method. The blinding multiplication has to be inverted before the coefficients can be decoded.
Shifting. It is also suggested in [27] to apply a random anti-cyclic shift to the coefficients vector of the polynomials before multiplying. Due to the ring structure, this anti-cyclic shift corresponds to a multiplication by some power of x. For some random i, j < n, (x j s(x)) · (x i c 1 (x)) = x i+j s(x)c 1 (x) is computed and s(x)c 1 (x) can be recovered by inversing the shift.
In practice it is not possible to obtain x i s and x j c 1 by applying anti-cyclic shifts to their coefficients vectors, because they are represented in the NTT domain. To multiply by x i in the NTT domain, observe that
and NTT((φx) i ) = φ i · NTT(x i ). All of the coefficients of NTT(x i ) are already pre-computed, since they are exactly the n powers of ω. Multiplication by x i can thus be done by a pointwise multiplication with the powers of ω and φ i (for the NWC). This multiplication has to be performed in bit-reversed order, since s and c 1 are in the NTT domain.
Shuffling. Masking, blinding and shifting offer little to no protection against single trace attacks. The single trace attack by [23] exploits leakage from the operations performed during the NTT. In that paper, it is suggested to counter the attack by randomizing the order in which the butterfly operations are computed. During each stage, the n 2 butterfly operations can be computed in a random order. The same shuffling methods can also be applied to all the pointwise operations in the decryption.
Unprotected FPGA Implementation of RLWE
In this section, we present our implementation of the encryption and decryption algorithms described in the previous section on an Artix XC7A200 FPGA using Vivado HLS (version 2018.1). Our decryption architecture is the basis for the protected implementations proposed in the next sections. We compare our unprotected RLWE implementations with results from literature. One of our goals is to show that competitive results can be obtained using HLS from C code for a reduced design cost compared to VHDL or Verilog design. Figure 1 presents the high-level architecture of our accelerator. For encryption, the public keys are first loaded into the local RAM, then the computations are performed by the functional units (FUs, see below). During encryption/decryption our accelerator is isolated for security reasons, it does not take any input or generate any output. After encryption/decryption, the result is sent out through the interface. In the paper, all the communications through the interface are included in our results. Depending on parameter n, the typical time spent for interfacing represents about 12% to 21% of the total encryption/decryption time. Parameters. In order to compare with literature results, we implement RLWE for the parameter sets (n, q, λ) = (1024, 12289, 8) and (256, 7681, 3). For n = 1024, we use a simplification of the CPA-secure version of NewHope1024 PKE with key reuse. We do not implement the key refreshing, ciphertext compression/decompression and key encoding/decoding in this paper. For simplicity we use the Trivium stream cipher as PRNG.
Encryption and Decryption. Following [22] , we avoid the bit-reversal step by implementing both the DIF and DIT NTT. The stage loop is fully pipelined, such that it takes just over n 2 clock cycles to complete one stage. The complete forward transformation is computed in few more than n 2 log n cycles. The error polynomials e 1 , e 2 and e 3 are sampled from the binomial distribution B λ (R q ). The required random bits are provided by the PRNG. Since the ciphertext part c 1 = ae 1 +e 2 will be sent in the NTT domain, the NTT has to be applied to both e 1 and e 2 . The NWC must be computed for both polynomials. To compute these multiplications, we use the fact that the coefficients are sampled from B λ and therefore are bounded by −λ and λ. The multiplications can be computed using only a few shifts and additions, without using a DSP block. The NTT is then applied to e 1 and e 2 simultaneously, using two parallel NTT units each consisting of one butterfly unit and two BRAMs. The architecture for sampling e 1 (or e 2 ) and mapping it to the NTT domain is shown in Figure 2 . The architecture for decryption is shown in Figure 3 . The area and timing implementation results for RLWE are shown in Table 1 with similar solutions from the literature.
Our small implementation is denoted by V1. This implementation with only 1 DSP block is comparable in size and speed to [20] but is larger and 15 to 20% slower than the cryptoprocessor from [26] . By computing the forward NTTs in parallel in V2, we are faster than both, but more DSP blocks are needed. For n = 1024, the key exchange implementation by [13] is comparable with our V3 results in terms of speed, but the V3 implementation uses 50% less DSP blocks and BRAMs. We conclude that results obtained using HLS are comparable or, in the best case, even better in terms of speed (up to 25%) and/or area (up to 50%) than results from works based on VHDL or Verilog implementations.
New Variants of State of the Art Protections
The protections proposed in this section and the next one are implemented by modifying our base architecture from Figure 3 for n = 256 and q = 7681. In real-world applications these protections should be part of an architecture implementing the CCA2-secure version of the scheme, including a re-encryption of the decrypted ciphertext and several evaluations of some hash function. 
Masking with a New Masked Decoder
We implement a variant of the masking scheme described in the state of the art [25] , improving the masked decoding process. We propose a simple masked decoder that does not need 16 iterations and that does not increase the decoding failure probability. Let d , d ∈ 0, q 4 , then d + d ∈ 0, q 2 . If either d or d were to be shifted by exactly the right amount (cf. 4), then we would be able to for 0 ≤ i ≤ 3, that we will refer to as "quadrants". The property that allows to solve the easy cases is the following:
In the remainder of this section, we let i and j be the quadrant indices of d and d respectively. For i + j = 1 mod 4 it follows from Property 1 that
In other words, the sum lies in the left half of Z q and therefore decodes to 1. Similarly, the (d , d ) ∈ Z 2 q for which i + j = 3 mod 4 are easy cases and decode to 0.
The hard cases are given by (d , d ) ∈ Z 2 q for which i + j = 0 mod 4 or i + j = 2 mod 4, that is,
To reduce to an easy case, it suffices to move either (but not both) d or d to an adjacent quadrant. Then for the new pair (d + δ, d − δ) exactly 1 mod 4 is added to or subtracted from the sum i + j. Then for the updated i, j it holds that i + j = 1 mod 4 or i + j = 3 mod 4 and Property 1 applies. It is always possible to modify the sum i + j for the i, j corresponding to the shares by exactly 1.
, then there is a δ such that d + δ ∈ Q i+1 and d − δ ∈ Q j . If the opposite holds, then d can be moved to Q j−1 by subtracting a δ while d stays in Q i . The new pair (d + δ, d − δ) forms an easy case. This method does not work when the distance δ between d and (j−1)q 4 is equal to the distance δ between d and iq 4 . However, these are exactly the cases for which d + d is equal to either q 4 or 3q 4 . This means that even an unmasked decoder would not be able to decode these cases correctly. The parameters in LWE-based cryptoschemes are usually chosen such that such cases appear with negligible probability.
The comparison operation δ < δ has to be implemented with caution. Generally, comparisons are performed by checking the bit sign of the subtraction of its operands. Since δ − δ = −(d + d ) + kq 4 for some integer k < 4, this operation leaks information about the unmasked value of d. Instead of implementing a combinatory circuit, we have implemented successive accesses to a look-up table to perform the comparison. The implemented algorithm is described in Algorithm 2, where the bits of a and b are denoted (a 0 , . . . , a w−1 ) and (b 0 , . . . , b w−1 ) respectively. The look-up table implements the function T defined by
Note that it is not necessary to assume that d > d . Given (d , d ) and their corresponding quadrant indices (i, j) = index(d , d ), the distances δ = are computed and compared. We have that:
Algorithm 2 Returns True if and only if a > b In order to make this masked decoder compatible with CCA2-secure implementations, the output is also masked. Instead of returning the plaintext bit, a random bit is generated and XORed with the unmasked decoding result. The decoder returns both the mask and the masked value.
A total of 2 n log q = 2 3328 different masks can be obtained. The security of the masking scheme with its original decoder is experimentally evaluated by [25] . They also mention the (small) possibility of horizontal DPA attacks targeting the 16 iterations of their masked decoder. Our proposed decoder does not have this vulnerability since it does not use 16 iterations.
Shifting
In [27] there is no mention of any masked decoder. To secure the complete decryption function, we propose to apply the (normal) decoder to the shifted polynomial x i+j c 2 (x) − x i+j s(x)c 1 (x), meaning that c 2 (x) should be shifted separately. The plaintext can then be obtained by applying the inverse shift to the decoded polynomial. The minus sign that comes with the anti-cyclic shift does not change the value of the decoded coefficient, because ∀a ∈ Z/qZ it holds that Decode(a) = Decode(−a). The decryption procedure for a ciphertext (c 1 , c 2 ) can then be described as follows:
1. Generate random i, j < n. 2. Compute NTT(x i ) s and NTT(x j ) c 1 by multiplying s and c 1 by the powers of ω and φ in an order determined by i and j respectively. 3. Compute the pointwise product to obtain x i+j s · c 1 and apply in the inverse NTT. 4. Apply the anti-cyclic (i + j)-shift to c 2 and obtain x i+j c 2 . 5. Compute the subtraction x i+j c 2 − x i+j s · c 1 = x i+j (c 2 − s · c 1 ). 6. Decode to obtain the shifted plaintext. Shift i + j positions to the left.
Blinding
The blinding countermeasure is implemented by generating two random indices 0 ≤ i, j < n and multiplying c 1 and s by ω i and ω j respectively.
Shifting and Blinding combined
Both shifting and blinding involve multiplication by the powers of ω and φ. To shift the polynomial s(x) by i < n positions, we compute φ i ·NTT(x i ) s(x). With almost no additional costs, this operation can be combined with the blinding operation by simply modifying the exponents of ω. To shift the polynomial by i positions and blind using ω −j for some j < n, we use:
Both s and c 1 are shifted and blinded. The combined blinding factor has to be removed before the decoding. The combination of the two countermeasures is therefore somewhat more expensive than shifting alone. The decoding is performed in the shifted order.
Both shifting and blinding use two log(n)-bit randomization factors. As pointed out by [27] , the total amount of added noise entropy for shifting and blinding combined is 4 log(n) bits. For n = 256 this is equal to 32.
6 New Protections
Shuffling
The first of the two shuffling methods proposed in this paper consists of replacing loop counters by linear feedback shift registers (LFSR). An LFSR is parametrized by an irreducible polynomial f ∈ F 2 [x] and its degree k. It computes x i a mod f for 0 ≤ i < 2 k − 1 and some initial state a ∈ F 2 [x]/f . The computed values are exactly all the 2 k − 1 invertible elements of the finite field F 2 [x]/f . The order in which they are computed is determined by the initial state a. Multiplication by x in F 2 [x]/f is very fast and can be computed using only 1 shift and a XOR on bit positions depending on f . Our second shuffling method consists of generating a random permutation using a permutation network in the style of [5] .
LFSR method. Let an LFSR be parametrized by some irreducible polynomial f of degree n 2 . We let k = log 2 (n) − 1 and consider the coefficients vectors of polynomials in F 2 [x]/f to be the binary representations of integers ranging from 0 to n 2 − 1. The LFSR thus generates a sequence of n 2 − 1 integers that will serve as indices for the loop counter in Algorithm 1. Instead of computing the i-th butterfly operation at the i-th loop iteration, we compute the butterfly operation that is on the j-th position, where j is the index corresponding to the i-th element generated by the LFSR. In other words, the normal loop counter is replaced by an LFSR. The LFSR has only 2 k − 1 outputs, whereas we need 2 k for the n 2 butterfly operations. Therefore the first operation of each stage is not shuffled: it is always computed in the first loop iteration of the stage.
To obtain a meaningful permutation, we use the PRNG to generate a new initial state a at the start of each stage. Since a = 0 is not allowed as initial state, we set a = 1 as the default state in the case that the PRNG outputs 0. The initial state is thus set to default state with probability 4 n . All the other initial states appear with probability 2 n . This slight bias could be reduced by having the PRNG generate multiple initial states and selecting a non-zero state. The 2 k − 1 possible initial states determine 2 k − 1 unique sequences. The operations of a complete log 2 (n) stage NTT can then be computed in (2 k − 1) log 2 (n) different ways. For n = 256 and k = 7 this is more than 2 55 . With the LFSR method applied to the pointwise operations outside the NTT as well, the total number of random bits added is equal to 71. A single trace attack like [23] that requires all of the log 2 (n) stages seems unlikely to succeed on an implementation using the LFSR countermeasure as described.
We use an LFSR of degree k = log 2 (n) in a similar manner to shuffle the order of the n pointwise multiplications outside the NTT.
Drawbacks to the LFSR loop counter include a limited permutation space, a slightly biased outcome and the fact that the first element is not permuted. Permutation Network Method. We propose to use a permutation network generator in the style of [5] . Their permutation generator is designed for use in AES and is impractical for larger (N = 256) permutations. It is also biased. We simplify their permutation network to obtain a permutation generator that can generate N N/2 permutations and that is uniform on its range. In the remainder of this section, the parameter N is the size of the permutation, which is equal to n for the shuffling of the pointwise operations. To shuffle the butterfly operations during the computation of the NTT, N is substituted by n 2 .
Then T 0 is a bitwise shift erasing the least significant bit (LSB), and T 1 applies the same shift and sets the MSB to 1.
The permutation network consists of k = log 2 (N ) stages and takes as input (x 0 , . . . , x N −1 ) = (0, . . . , N − 1). During each stage, N 2 random bits b 1 , . . . , b N/2 are generated and for all pairs (x 2i , x 2i+1 ) the images T bi (x 2i ) and Tb i (x 2i+1 ) are computed. In other words, for each pair (x 2i , x 2i+1 ), one is sent to position i, while the other is mapped to i + N 2 . This is equivalent to writing one bit of the image of x 2i under the generated permutation and writing its complement to the image of x 2i+1 . The network is shown for N = 8 in Figure 5 . It is exactly the same as the computation scheme of the constant geometry NTT, in which the butterfly operators are replaced by controlled swap operators.
For any integer 0 ≤ x < N the image of x under the generated permutation can be written as T b1 • · · · • T b k (x) and is equal to the value corresponding to the binary representation (b 1 , . . . , b k ) 2 . The kN 2 control bits thus determine the image of each index under the generated permutation. By uniqueness of binary representation it follows that any modification to any subset of the kN 2 control bits would modify the generated permutation as well. The permutation generator is therefore an injective map from {0, 1} N k/2 into the set of all permutations Σ N . This means that the number of possible configurations of the kN 2 control bits, which is equal to N N/2 , is exactly the number of permutations that can be generated by the network. The number of different permutations that can be obtained is 2 1024 for the pointwise operations and 2 256 for the NTT. Moreover, the output of the permutation network is uniform on its range given uniformly random input.
Since the permutation space is much larger than the one we obtain with the LFSR, we will only generate one {0, . . . , n 2 − 1} → {0, . . . , n 2 − 1} permutation for the NTT at the start of each decryption. Each stage is then computed in the order defined by this permutation. We also compute only one permutation of size n that will be used for all the pointwise operations during one decryption.
Randomization using Redundant Number Representation
In RSA and ECC, some exponent or scalar randomization countermeasures have been proposed against SCAs (see for instance [7] ). A secret exponent or scalar can be randomized without loss of information by adding a random multiple of the group order to it. The corresponding power traces are thus randomized, removing correlation between the side channel traces and the secret key. A similar concept can be applied to RLWE.
We can add random multiples of the modulus q to the secret key coefficients without invalidating the secret key. This is done at the start of each decryption. The PRNG is used to generate small r-bit random numbers for some integer parameter r. These numbers are multiplied by q and then added to the input and to the secret key. We then continue using arithmetic operations in Z/(2 r q)Z instead of in Z q . The fact that for all a, b ∈ Z we have that ab mod (2 r q) ≡ ab mod q, ensures us that the result is in the correct equivalence class.
The redundancy is not removed for the decoding. Instead we modify the algorithm to decode the coefficients directly from Z/(2 r q)Z to {0, 1}. The new decoder returns 0 if the input lies in the union of sets Validation through Correlation Power Analysis Simulations. We evaluate the robustness of our countermeasure based on a redundant representation by simulating CPAs. The polynomial multiplication in the NTT domain consists of n independent multiplications in Z/qZ. They are of the form c·s mod q, where s is a coefficient of the secret key and c is a coefficient of the input ciphertext. We simulate correlation attacks on one modular multiplication of a known input coefficient c with an unknown secret key coefficient s. We assume that the attacker observes the modular multiplication c · s mod q for a number of different (known) inputs c. For each modular multiplication she/he obtains the exact Hamming weight (HW) of the result. The attacker computes the "predictions": the HW of the value c · s mod q for all subkey candidates s ∈ Z q and for all inputs c. She/he evaluates the correlation between the observed HW and the predictions. For each subkey possibilitys ∈ Z q , the Pearson's correlation coefficient between the observed HW and the predictions is computed. Without countermeasures, the highest correlation is obtained for the correct subkey guess.
The inputs are randomized by adding a multiple of q and used in computations in Z/(2 r q)Z for some redundancy parameter r. The impact of our countermeasure on the effectiveness of the CPA can be seen (for q = 7681) in Figure 7 . Without redundancy (r = 0), the attacker observes the exact HW of the value c · s mod q for different values of c. These HWs coincide with the predictions for the correct subkey guess, resulting in a correlation coefficient of 1. For higher levels of redundancy, the average of the correlation coefficient for the correct subkey guess decreases.
The right side of Figure 7 shows that the maximum correlation is obtained for incorrect subkey guesses for all r ≥ 1. We refer to subkey guesses that yield to a higher correlation coefficient than the correct subkey guess as false positives. The number of these false positives increases with the redundancy level. For r = 8 and r = 9 there are on average around q 2 subkey guesses, for one coefficient of the secret key, that yield to a higher correlation coefficient than the correct key guess. Our countermeasure ensures that an exponential number of up to q 2 n guesses have to be tested to recover the complete secret key.
Comparison of all Protections
FPGA implementation results for RLWE solutions with various countermeasures are presented in Table 2 . Results from [25] are reported, and we also reimplemented their solution on an Artix-7 XC7A200 (denoted "A7") to provide fair comparisons. We also implemented the blinding and shifting methods from [27] and our shuffling methods. To the best of our knowledge, these are the first FPGA implementations for these countermeasures. Finally, the results for masking with our new masked decoder and our redundant randomized countermeasures are reported. The amount of randomness added for each countermeasure is specified in the second column of the table. \renewcommand4pt3pt We cannot directly compare our re-implementation of the masking from [25] and their original results on a Virtex-II XC2VP7 (denoted "V2"). However, it can be seen that the impact of masking on the performance of their V2 implementation is very high compared to our A7 re-implementation. The computation time for decryption is tripled. This is probably because the number of arithmetic operations in Z q is doubled while no parallelism is used. Moreover, it seems that their masked decoder is implemented sequentially. In our re-implementation of the masked decoder from [25] , we use parallelism to significantly reduce the performance penalty of their 16-step decoder. This increases the area.
Our new masked decoder is relatively simple and requires a small area (about 20% reduction compared to the re-implementation of the decoder from [25] ), with almost the performance of the unprotected implementation. Compared to the unprotected solution, we use extra DSP blocks and BRAMs to compute the decryptions of the two shares in parallel.
The blinding implementation gives a slightly slower solution. Its area overhead is smaller than for both masking techniques. However, we stress that this blinding countermeasure should be used in combination with another countermeasure (as specified in [18] ), since the blinding factor is removed before the decoding step. The shifting implementation yields to similar overhead (although with lower frequency) and its combination with blinding seems to be worthwhile. The permutation network is relatively costly in area. The LFSR loop counter is cheaper and slightly faster.
Finally, our redundant randomized countermeasure does not need additional DSPs or BRAMs to be implemented for small redundancy parameters (r ≤ 4) and can therefore be used as a cheap way to secure the decryption. For higher redundancy levels the multiplication cannot be computed within a single 18 × 25 bits multiplier, as the ones hardwired in the Artix DSP blocks. A few additional DSP blocks and BRAMs are needed.
Conclusion
In this work, we compared several countermeasures against SCAs for RLWE from [25] , [27] and proposed new ones. Our first proposed countermeasure is an adaptation of [25] with a new masked decoder which is deterministic. Our second one uses a redundant representation to randomize polynomial coefficients. We also implemented two different methods for shuffling. All the countermeasures (from literature and our ones) have been implemented on FPGA to evaluate the overhead compared to a common reference implementation on the same FPGA. Our new decoder uses over 20% less slices and LUTs than the one from [25] . To the best of our knowledge, we also present the first FPGA implementations for the blinding and shifting countermeasures from [27] , and a combination of the two. Finally, our protection based on redundancy at ring level provides a cheap randomization method with an adjustable security/overhead trade-off.
In the future, we will explore other types of architectures, operators, algorithms and countermeasures (e.g. at architecture level). We also plan to use our solutions in the context of application benchmarks and evaluate their security against SCAs using a hardware setup under development in our research group.
