Abstract-As the development of a viable quantum computer nears, existing widely used public-key cryptosystems, such as RSA, will no longer be secure. Thus, significant effort is being invested into post-quantum cryptography (PQC). Lattice-based cryptography (LBC) is one such promising area of PQC, which offers versatile, efficient, and high performance security services. However, the vulnerabilities of these implementations against side-channel attacks (SCA) remain significantly understudied. Most, if not all, lattice-based cryptosystems require noise samples generated from a discrete Gaussian distribution, and a successful timing analysis attack can render the whole cryptosystem broken, making the discrete Gaussian sampler the most vulnerable module to SCA. This research proposes countermeasures against timing information leakage with FPGA-based designs of the CDT-based discrete Gaussian samplers with constant response time, targeting encryption and signature scheme parameters. The proposed designs are compared against the state-of-the-art and are shown to significantly outperform existing implementations. For encryption, the proposed sampler is 9x faster in comparison to the only other existing time-independent CDT sampler design. For signatures, the first time-independent CDT sampler in hardware is proposed.
I. INTRODUCTION
Cryptography is one of the most important tools to protect information sent across public networks, using digital signatures and encryption. This security, currently supported by the hardness of factoring large primes (RSA) and the discrete logarithm problem (elliptic-curve cryptography), may soon be under threat by the possible construction of quantum computers. Indeed, the NSA and CESG have both indicated a need to transition towards quantum-resistant algorithms [1] , [2] . Protecting secure communications against quantum attacks is vital, and thus several post-quantum (or quantum-resilient) constructions have been proposed to protect technologies such as cloud security and the Internet of things. Latticebased cryptography (LBC) is arguably the most promising when compared to other post-quantum cryptosystems, as it offers extended functionality and average-case to worst-case hardness, whilst being more efficient for both encryption and digital signature schemes [3] .
However, the real-world practicality of LBC should be considered, including suitable countermeasures against sidechannel analysis (SCA). A NIST call [4] requests new quantum-resilient algorithms that offer SCA attack resistance. The most vulnerable component of lattice-based cryptosystems is the generation of randomness, typically discrete Gaussian randomness, to mask the computations of the secret-key and plaintext data. Unfortunately, discrete Gaussian samplers are highly susceptible to timing-analysis attacks, due to nonconstant run-time [5] . There has been little research into the SCA-resilience of lattice-based cryptographic implementations to physical attacks; only Roy et al. have investigated masking [6] and side-channel secure discrete Gaussian sampling [7] .
This research proposes a timing-attack resilient hardware design of a discrete Gaussian sampler, adopting the cumulative distribution table (CDT) [8] technique. Practical FPGA designs of novel CDT-based constant response time samplers, with appropriate practical parameters for both encryption and signatures, are presented.
II. BACKGROUND A. Lattice-based Cryptography One foundational work that underpins LBC is the learning with errors (LWE) problem [9] . Cryptosystems based upon the LWE problem enjoy worst-case to average-case hardness; they are proven infeasible to break unless all instances of certain lattice problems are easy to solve [10] , [11] . Lattice-based cryptosystems, based on the hardness of the LWE problem are as hard to solve as Definition 1.
Definition 1:
[LWE] For positive integers n and q ≥ 2, the secret s ∈ Z n q , and a probability distribution χ on Z q , let A s,χ be the LWE distribution, obtained by choosing a $ ← Z n q , a noise term e ← χ, and outputting (a, b = a, s +e) ∈ Z n q ×Z q . The decisional LWE problem is defined as, given access to m independent samples chosen according to A s,χ , distinguish between these m LWE samples and ones chosen uniformly at random with noticeable probability.
B. Discrete Gaussian Sampling
The error distribution χ seen in Definition 1 is almost always defined as the discrete Gaussian distribution † . The centered discrete Gaussian distribution D Z,σ , over Z, with standard deviation σ, is defined proportionally such that a value x ∈ Z is sampled from D σ with the probability ρ σ (x)/ρ σ (Z), where:
† With the exception of key-exchange protocols such as [12] , which are able to employ the slightly less "normal", binomial distribution.
978-1-5090-5602-6/16/$31.00 c 2016 IEEE
2πσ, the probability of sampling x ∈ Z from the distribution D Z,σ is calculated as ρ σ (x)/S σ . 1) Exploiting Symmetry: To half the initial memory requirement, one can consider the distribution over Z + , proportional to ρ σ (x) for ∀x > 0. For x = 0, ρ σ (0) is halved, otherwise this will be counted twice. The distribution can be recovered by adding a random sign bit after sampling.
2) Practical Discrete Gaussian Parameters: The statistical distance between the "perfect" theoretical discrete Gaussian distribution and the "practical" should be no greater than 2 −λ . It is recommended [13] that precision need be no greater than λ/2, for a target security level λ-bits, as it is argued that no algorithm can distinguish between a "perfect" sampler and one with statistical distance 2 −λ/2 . Two important cryptographic applications for the samplers are targeted: encryption and digital signature schemes. The parameters are from the encryption scheme by Lindner and Peikert [14] 
(LP)
‡ , where (σ, λ, τ ) = (3.33, 64, 9.42), and the digital signature scheme by Ducas et al. [15] (BLISS), where (σ, λ, τ ) = (215, 64, 9.42).
3) Gaussian Convolution: The standard deviation can be significantly decreased by using Peikert's convolution lemma [18] , adapted by Pöppelmann et al. [19] . Referring to [8] , [19] for the formal definitions of the smoothing parameter η and Kullback-Leibler divergence respectively, the adaption states:
for some positive real σ 1 , σ 2 and let σ −2
Proof: The proof of this lemma is referred to in [19] . Lemma 1 holds for σ BLISS = 215 by setting k = 11, such that σ = σ/ √ 1 + k 2 ≈ 19.47, and by sampling twice x 1 , x 2 ← D Z,σ a value x ← D Z,σ can be built as x = x 1 + kx 1 . Although an additional sample is required, a smaller standard deviation means that memory consumption of the precomputed tables is significantly reduced; memory consumption is reduced from 130kb for σ BLISS = 215, to 11.74kb for σ BLISS = 19.47.
III. THE CUMULATIVE DISTRIBUTION TABLE SAMPLER
The cumulative distribution table (CDT) sampler requires a precomputed table of discrete Gaussian cumulative distribution function (CDF) values [8] . CDT sampling is arguably more promising than other discrete Gaussian sampling schemes, as the distribution parameters are known in advance. The CDF values range from 0 ≤ x ≤ 1, and are stored in a look-up table S [x] , where the total number of table entries is at least:
CDT sampling works as follows: sample r $ ← {−τ σ, . . . , τ σ}, with λ bits of precision. r is compared against the CDF table contents to find an interval such that: S[x] ≤ r < S[x + 1], ‡ The same parameters are used for implementing the ring-LWE encryption scheme of Lyubashevsky et al. [16] , see [17] for a hardware implementation.
where x is output as the required discrete Gaussian sample, occurring with probability
For comparisons, binary search is chosen, and is detailed in Algorithm 1. Pointers min, max, and mid point to the first, last, and middle of the search space, respectively. r is iteratively compared to the middle value of the search space S[cur], whose upper or lower half is discarded depending on the comparison result. For a finite search space with N samples, the comparisons required before a match does not exceed log 2 (N ) .
Algorithm 1 CDT Sampling from D Z,σ via Binary Search
Require:
1: Three integers min, mid, and max. 
A. Previous work
The use of CDT sampling was proposed by Peikert [8] , and adapted by Ducas et al. [15] . Pöppelmann et al. [19] implemented BLISS on reconfigurable hardware and suggested optimisations for the CDT sampler, including hashing to reduce the search space and skipping the leading zero storage to reduce the table size by a factor of 2. Du and Bai [20] further optimised hardware area by using piecewise comparison and hashing. These optimisations reduce the precomputed table size, and improve throughput. However, hashing divides the search intervals into irregular sizes, meaning the binary search has non-constant execution time, making it susceptible to timing analysis attacks.
The only CDT sampler design on a FPGA with constanttime throughput is by Pöppelmann and Güneysu [21] . This fully pipelined design offers a single cycle per sample throughput, but the large number of parallel comparisons renders it impractical. Roy et al. [7] presented a hardware design of a discrete Gaussian sampler resistant against timing attacks, using a Knuth-Yao sampler that generates a batch of samples, subsequently shuffled to disassociate the related timing information. However, this design is non-constant time and only suitable for small standard deviations.
B. Timing Attack Vulnerabilities
Side channel attacks are physical attacks, based on information gained from the physical implementation of a cryptosystem. To date, little research has been conducted on the vulnerabilities of LBC implementations to physical attacks; efforts so far are summarised by Hodgers et al. [22] . Timing analysis attacks are highly algorithm-specific in nature, where the dependency between the execution time of an algorithm and its secret internal states is exploited. Attacks on LBC constructions are emerging [23] , [5] . The timing-attack countermeasure is to guarantee an execution time independent of the secret values [24] . This can be achieved by ensuring constant response time [21] or subsequent random shuffling of the secret values [7] . The following definition of time independence is used in this research:
Definition 2 (Time independence): A program achieves the property of independent-time when no information about the secret value(s) is leaked by the timing of the program.
IV. CONSTANT-TIME CDT HARDWARE ARCHITECTURES
A constant-time implementation of a CDT sampler is achievable by comparing the table in a fixed number of clock cycles. Algorithm 1 shows that an early termination is possible if the comparison of uniformly sampled r and the S[mid] returns an equality. This exact match could happen with a small probability of 2 −λ N . This early termination is avoided by not monitoring an exact match of r and S[mid] separately. Hence, the binary search algorithm is always bounded between log 2 (N ) , log 2 (N ) search iterations of the for loop. Consequently, where N is a power of two, the algorithm executes exactly in constant-time; and for all other N , the algorithm is tweaked to occasionally perform an extra iteration to ensure the algorithm complexity is fixed to log 2 (N ) .
A. Trivium as a PRNG
The CDT sampler requires uniformly distributed samples, for which Trivium [25] is selected, due to its versatility. It is a synchronous, binary stream cipher with a 288-bit internal state. To achieve a large number of uniformly random bits per clock cycle, the Trivium modules are unrolled. The resources for unrolled designs compared to standard Trivium×1 are rather negligible: 28 additional LUTs, 63 additional flip-flops, and 15 additional slices for Trivium×8 and 26 additional LUTs, 147 additional flip-flops, and 21 additional slices for Trivium×32 on a Spartan-6 LX25-3 device, post place-and-route. Figure 1a illustrates the proposed constant-time CDT sampler for encryption. The CDF table S[·] consists of N = 32 (τ ×σ) entries, with λ = 64. A single ported ROM, a 5-bit address port and a 64-bit data port are employed. Trivium×64 is used for uniform sample generation, with module initialisation (key setup, IV setup, and the randomisation phase) handled externally at startup, and thereafter controlled by the binary search state-machine, BinSearch. The uniform samples are only generated when required, saving circuit power.
B. Proposed Constant-Time CDT Sampler For Encryption
The BinSearch state-machine begins at the SET state, resetting the three pointers (min, mid, and max) to initial values (as in Algorithm 1). It transitions unconditionally to the SEARCH state in the next clock cycle and these three pointers are updated, given the result from the 64-bit comparison. After exactly 5 cycles, a search is found, and the state generates a single bit hit. The Trivium module is activated by this hit to request a new uniformly random 64-bit value. The buffered mid is combined with a random bit b, which attaches a sign to the generated discrete Gaussian sample x.
C. Proposed Constant-Time CDT Sampler For Signatures
The proposed CDT sampler for signatures uses two statemachines, BinSearch0 and BinSearch1, to parallelise two independent searches (see Figure 1b) . Since most FPGA devices have dual ported BRAMs, the CDF table for both state-machines can be accessed from one BRAM. Each statemachine has an 8-bit address and a 64-bit data port. The two state machines each get a 64-bit uniformly random number, r0 and r1 respectively, from the PRNG, in two consecutive clock cycles. During the next 8 clock cycles, r0 and S[mid0] are compared in BinSearch0 and the three pointers are updated. Simultaneously, r1 is processed at BinSearch1. The statemachines work independently to generate two independent random samples x 0 and x 1 in 8 clock cycles; the two samples are then combined as x = x 0 + 11x 1 , where lastly a sign bit is assigned.
V. RESULTS, EVALUATION AND COMPARISON
This section presents the post place-and-route performance results and comparison of the proposed samplers with existing sampler implementations for encryption and signatures. Xilinx ISE 14.7 is used, and where possible comparable implementations have been re-run on the same FPGA device in order to fairly compare the results. Throughput and throughput per area (TPAR) have been evaluated for all schemes, in terms of sampling operations per second (Ops/s) and sampling operations per second per slice (Ops/s/S). Table I gives the CDT sampler resource consumption for both encryption and signature parameters. A low cost Spartan-6 FPGA is targeted, and low area and balanced results are presented, where the area-optimised designs employ BRAM, unlike the balanced designs, which offer higher running frequencies. Table II shows performance results of the samplers, compared with existing CDT hardware samplers. For encryption parameters, a single ported distributed ROM comprising of LUTs is proposed in the design without BRAM. The slice resources can be significantly reduced if BRAMs are utilised. However, the price of resource reduction is paid for by reduced operable frequency. The only other constant-time CDT implementation for encryption is by Pöppelmann and Güneysu [21] , which generates a single sample per cycle. However, it is 4x slower in frequency with 5x many slices, and thus this research offers a more lightweight, constant-time alternative. The CDT sampler design by Du and Bai [20] is lightweight but it is only for encryption and does not run in independent-time. For signature parameters, the implementation by Pöppelmann et al. [19] operates in non-constant time but has a lower throughput per slice than this work. Thus the CDT sampler proposed in this research is preferable for practical implementations. 
VI. CONCLUSION
In this research two independent-time hardware designs of a discrete Gaussian CDT sampler are proposed, suitable for encryption and signature applications, with a focus on low-area foot-print and high throughput. Resistance against timing attacks is achieved by ensuring constant execution time. Moreover, the proposed hardware CDT sampler designs clearly outperform the previously proposed samplers.
