Abstract-Lattice-based cryptography is one of the most promising branches of quantum resilient cryptography, offering versatility and efficiency. Discrete Gaussian samplers are a core building block in most, if not all, lattice-based cryptosystems, and optimised samplers are desirable both for high-speed and low-area applications. Due to the inherent structure of existing discrete Gaussian sampling methods, lattice-based cryptosystems are vulnerable to side-channel attacks, such as timing analysis. In this paper, the first comprehensive evaluation of discrete Gaussian samplers in hardware is presented, targeting FPGA devices. Novel optimised discrete Gaussian sampler hardware architectures are proposed for the main sampling techniques. An independent-time design of each of the samplers is presented, offering security against side-channel timing attacks, including the first proposed constant-time Bernoulli, KnuthYao, and discrete Ziggurat sampler hardware designs. For a balanced performance, the Cumulative Distribution Table (CDT) sampler is recommended, with the proposed hardware CDT design achieving a throughput of 59.4 million samples per second for encryption, utilising just 43 slices on a Virtex 6 FPGA and 16.3 million samples per second for signatures with 179 slices on a Spartan 6 device.
INTRODUCTION
E SSENTIALLY all asymmetric cryptographic primitives used for secure Internet communications, such as RSA and ECDSA, will be rendered completely insecure with the practical development of a quantum computer, by virtue of Shor's algorithm [1] . Even symmetric-key encryption schemes, such as the advanced encryption standard (AES), will have a quadratic brute-force speed-up via Grover's algorithm [2] , [3] -decreasing a search space of Oð2 n Þ to Oð2 n=2 Þ-making AES-128 as secure as AES-64. There are now emerging branches of cryptography resistant to these quantum reductions, namely, quantum-resilient or postquantum cryptography.
Post-quantum cryptography deals with non-quantum operations but is theoretically strong against cryptanalysis on classical and quantum computers. Recently, post-quantum cryptography has seen significant advancements in many areas, due to progress in advanced frameworks such as the Internet of Things as well as the need for long-term, highly secure, quantum-resilient cryptographic primitives. Such progress has led to growth in novel cryptographic constructions such as encryption schemes and signature schemes, as well as more advanced schemes such as fully homomorphic encryption, which can process on encrypted data and can be used to secure practical cloud infrastructures. Furthermore, with the announcement that the governmental agencies are planning transitions towards quantum-resistant algorithms [4] , with CESG focusing on post-quantum cryptography as opposed to quantum technologies [5] , it is clear how important these drop-in replacements are for classical cryptosystems.
Lattice-based cryptography (LBC) [6] , [7] , a subtype of post-quantum cryptography, bases its hardness assumption on finding the shortest (or closest) vector in a lattice, which is to date resilient to attacks by a quantum computer. LBC is very promising as it offers extended functionality whilst being more efficient than ECC and RSA based primitives of public-key encryption [8] and digital signature schemes [9] , [10] , at the cost of larger key sizes. Lattice-based cryptoschemes are usually founded on either the learning with errors problem (LWE) [7] or the short integer solution problem (SIS) [6] , which are based on standard lattices, or their equivalent ring variants ring-LWE and ring-SIS [11] , [12] , [13] . These ring-variants enable the use of number theoretic transform (NTT) operations for polynomial multiplication, as opposed to matrix-vector multiplications, resulting in significant savings in computational efficiency. For nearly a decade the concepts remained solely academic, due to inefficiencies such as large key sizes. However, LBC primitives are now viable for practical implementation (see the FPGA design [9] of BLISS [14] ) and many compete well with equivalent, classical schemes.
As LBC starts to become feasible for being deployed in the real world, suitable countermeasures against physical cryptanalysis, including side-channel analysis (SCA), become a requirement, which is echoed by a recent call by NIST for quantum-resistant algorithms resistant to SCA attacks [15] .
The key building blocks in LBC include matrix-vector multiplication for standard lattice schemes, polynomial multiplication for ideal lattice schemes, and discrete Gaussian sampling. Arguably the most vulnerable module within modern lattice-based constructions is the discrete Gaussian sampler, which, when successfully attacked, renders the whole cryptosystem broken [16] . To date, there has been little research into the SCA-resilience of lattice-based cryptographic implementations (with only Roy et al. [17] , [18] as a preface). The rationale for employing discrete Gaussian samplers (as opposed to another probability distribution) is that they allow for more efficient implementations, with smaller output sizes such as ciphertexts or signatures. Moreover, they are used to add "noise" onto values that would otherwise give away secret information via Gaussian elimination. They are, however, highly susceptible to timing analysis attacks due to their inherent non-constant run-time (due to the normalised structure and/or speed optimisations).
This research proposes practical and time-independent hardware implementations of all discrete Gaussian samplers used in lattice-based encryption and signature schemes 1 , that is, the Bernoulli sampler [14] , the cumulative distribution table (CDT) sampler [19] , the Knuth-Yao sampler [20] , and the discrete Ziggurat sampler [21] , [22] . The major contributions of this research are as follows:
1) The first comprehensive evaluation of all aforementioned discrete Gaussian samplers, which are a core component of lattice-based encryption and signature schemes, with mathematical descriptions of all prominent sampling techniques, discussions of their inherent limitations and strengths, as well as previous comparable hardware implementations. 2) Practical hardware FPGA designs of discrete Gaussian samplers are presented, compared with current state-of-the-art implementations, for appropriate practical parameters, throughput, memory consumption, and resource count. Novel optimisation strategies are proposed for the hardware architectures of the sampling techniques, where the results compete with, and in many cases, significantly outperform, previous implementations.
3) The proposed hardware designs operate in independent-time and therefore provide resistance against timing analysis attacks. 4) Based on the performance results, recommendations are given for the most appropriate sampler to use in particular applications. The paper is outlined as follows: first, prerequisites are given on the discrete Gaussian distribution. Section 3 presents previous work on discrete Gaussian sampling techniques. Section 4 addresses the desirable features each sampler requires to operate in constant-time, and in Section 5 the hardware architectures of each constant-time discrete Gaussian sampler are described. Section 6 presents the performance results of the samplers, compared to previous implementations, and targeted recommendations for specific implementations are given.
PREREQUISITES/PRELIMINARIES

The Discrete Gaussian Distribution
The discrete Gaussian distribution or discrete normal distribution D Z;s over Z with mean 0 and standard deviation s is defined to have a weight proportional to r s ðxÞ ¼ expð r s ðkÞ % ffiffiffiffiffi ffi 2p p s the probability of sampling x 2 Z from the distribution D Z;s is calculated as r s ðxÞ=S s . In this research, the standard deviation s is assumed fixed, thus, it is sufficient to sample from Z þ proportional to rðxÞ for all x > 0 and to set rð0Þ=2 for x ¼ 0, where a sign bit used to output values over Z.
Practical Discrete Gaussian Parameters
The parameters for discrete Gaussian sampling depend significantly on the application. To appropriately evaluate the performance of the samplers, two of the most common parameter sets are considered. These are encryption parameters from Lindner and Peikert [24] (LP) 2 with s ¼ 3:33 (denoted as s LP ) and digital signature parameters from Ducas et al. [14] (BLISS) with s ¼ 215 (denoted as s BLISS ); this choice allows for a more comprehensive analysis, covering both encryption and digital signature schemes. Table 1 shows the parameters required for discrete Gaussian sampling in practice.
Due to the nature of the discrete Gaussian distribution, that is with infinitely long tails and infinitely high precision, it is essential to make practical compromises which also do not hinder the integrity of the scheme. The discrete Gaussian parameters needed are ðs; ; tÞ, representing the sampler's standard deviation, precision and tail-cut respectively, and are detailed as follows:
Å The standard deviation ðsÞ controls the distribution's shape by quantifying the dispersion of data from the mean. The standard deviation depends on the modulus used within in the LWE or SIS scheme. For instance in LWE, should s be too small the hardness assumption may become easier than expected, and if s is too large the problem may not be as welldefined as required.
Å The precision parameter ðÞ governs the level of precision required for an implementation, exacting the statistical distance between the "perfect" theoretical discrete Gaussian distribution and the "practical" to be no greater than 2 À , corresponding directly to the scheme security level. Saarinen [26] recommends that for a scheme with target security level -bits, precision need be no greater than =2, arguing that there exists no algorithm that can distinguish between [14] ð215; 64; 9:42Þ
1. With the Binomial sampler used in [23] currently only applicable for key-exchange protocols.
2. The same parameters are used for implementing the ring-LWE encryption scheme of Lyubashevsky et al. [25] , see [8] for a hardware implementation.
a "perfect" sampler and one with statistical distance 2 À=2 .
Å The tail-cut parameter ðtÞ administers how much of the less-heavy tails can be excluded in the practical implementation, for a given security level. That is, given a target security level of s-bits, the target distance from "perfect" need be no less than 2
Às . Therefore, instead of considering samples in the range jxj 2 f0; 1g, they can be considered in the range jxj 2 f0; stg. Applying the reduction in precision also affects the tail-cut parameter, which is calculated as t ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Â 2 lnð2Þ p .
Gaussian Convolution
Further reductions in memory are possible by virtue of Peikert's convolution lemma [27] , adapted by P€ oppelmann et al. [9] using the Kullback-Leibler divergence. Referring to [9] , [19] for the formal definitions of the smoothing parameter h and Kullback-Leibler divergence respectively, the adaption in [9] states: 
Proof. The proof of this lemma is referred to in [9] . 
SAMPLING TECHNIQUES-BACKGROUND & PREVIOUS WORK
Lattice-based cryptoschemes, based on LWE and SIS problems, require discrete Gaussian noise to mask the scheme's secret key. In this section, the main techniques for generating discrete Gaussian noise are surveyed.
Rejection Sampling
Using rejection sampling or the acceptance-rejection method [28] , it is possible to sample from an arbitrary target distribution f when given access to a bounded probability distribution g. To sample from f, a sample from g is accepted with probability fðxÞ=ðM Á gðxÞÞ, where M is some positive real. When fðxÞ M for all x, the sampler produces the exact distribution f. To use rejection sampling to draw values from the positive half of a discrete Gaussian distribution, a uniformly random integer x 2 f0; . . . ; tsg is chosen and accepted with a probability proportional to r s ðxÞ ¼ exp ðÀx 2 =2s 2 Þ (in this case M ¼ 1). A uniformly random value x is selected, a random value u chosen from ½0; 1Þ and it is checked whether u < r s ðxÞ and thus whether a random u is under (acceptance) or above the curve (rejection) of the probability mass function of the Gaussian distribution. Sampling from the full range of the discrete Gaussian distribution involves rejection of an accepted x ¼ 0 with probability 1 2 and sampling of a sign bit. On average the method requires 2t= ffiffiffiffiffi ffi 2p p trials until a sample is accepted. As this method requires many rejections to achieve an accepted value (an average of % 8 trials) and the costly computation of the exponential function to high precision, rejection sampling is considered inefficient for practical hardware instantiations. The approach in Section 3.1 can be optimised to reduce the amount of rejections. This is achieved in the scheme by Ducas et al. [14] , where Bernoulli distributed variables B c are used, outputting one with probability c 2 ½0; 1, and zero otherwise. The number of rejections are reduced by using a distribution g called binary Gaussian distribution where x is proportional to r s bin ðxÞ ¼ 2
Bernoulli Sampling
with parameter s bin ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1=ð2 ln 2Þ p % 0:849 (see Algorithm 1) which can be constructed on-the-fly. Using the binary Gaussian, an intermediate distribution k Á D Z þ ;s bin þ Uðf0k À 1gÞ is constructed and the correct Gaussian distribution D Z þ ;ks bin is shaped using rejection sampling (Algorithm 2), which reduces the number of rejections to % 1:47 (compared to 8 for classical rejection sampling). For rejection sampling, a value is accepted with probability fðxÞ=ðM Á gðxÞÞ which usually requires explicit computation of fðxÞ. However, it holds for an integer x ¼ P 'À1 i¼0 x i 2 i with x i 2 f0; 1g that:
B expðÀ2 i =fÞ ;
meaning the final rejection step is performed independently, by evaluation of Bernoulli trials, which is efficient given precomputed biases c i for every x i . The method only requires log 2 ð2:4ts 2 Þ bits of storage, and it is shown in [14] Bernoulli sampling has been implemented without using the binary Gaussian distribution [8] for the small standard deviation necessary for lattice-based public-key encryption. A complete hardware implementation of the sampler can be found in the BLISS signature scheme by P€ oppelmann et al. [9] , where the binary Gaussian distribution r s bin ðf0; 1; . . . ; jgÞ ¼ P j i¼0 2 Ài 2 ¼ 1:1001000010000001 . . . is not constructed on-the-fly. Instead they use two 64-bit shift registers to store the expansion precomputed up to a precision of 128 bits, which outputs an x 2 D s bin . Uniformly random values y 2 f0; 1; . . . ; k À 1g are sampled which may require rejection sampling (for instance, the probability of a rejection for k ¼ 254 is 2=256). The pipelined Bernoulli calculation stage takes a ðy; xÞ tuple as input and computes t ¼ kx and outputs z ¼ t þ y as well as j ¼ yðy þ 2tÞ. While z is retained in a register, the Bernoulli evaluation module evaluates the Bernoulli distribution of b B expðj=2s 2 Þ . If and only if b ¼ 1 the value z is passed to the output, and discarded otherwise. The evaluation of B expðx=fÞ requires independent evaluations of Bernoulli variables. Sampling from B c is done by evaluating s < c for a uniformly random s 2 ½0; 1Þ and a precomputed c. The precomputed tables store values c i ¼ expð2i=fÞ, for 0 i l, f ¼ 2s 2 where l ¼ dlog 2 ðmaxðjÞÞe, are stored in a distributed RAM. The B expðÀx=fÞ module then searches for one-bit positions u in j and evaluates the Bernoulli variable B c u . They do this in a lazy manner; the evaluation aborts when the first bit has been found that differs between a random s and c. The technique saves randomness and runtime but incurs non-constant runtime. The chance of rejection is larger for the most significant bits, therefore these are scanned first to abort as quickly as possible. Lastly, a random one-bit is used to "sign" the samples and reject half of the samples where z ¼ 0.
The Bernoulli sampler is suitable for hardware implementations as most operations work on single bits. However, due to the non-constant time behaviour of rejection sampling, buffers were introduced in [9] between each element to allow parallel execution and maximum utilisation of every component. This includes the distribution and buffering of random bits. To reduce the impact of buffering on resource consumption they included LIFO buffers that solely require a single port RAM and a counter, as the ordering of independent random elements does not need to be preserved by the buffer (as in the case with a FIFO). For maximum utilisation they evaluated optimal combinations of sub-modules and implemented two B expðÀx=fÞ modules fed by two instantiations of the Trivium stream cipher to generate pseudo-random bits. To date, no time-independent implementations of the Bernoulli sampler have been reported.
Cumulative Distribution (CDT) Sampling
The Cumulative Distribution 
Once the CDF values S½Á are computed, a sample r is drawn uniformly, r 2 ½0; 1Þ, with bits of precision, where the desired sample, x, is found satisfying interval S½x r < S½x þ 1, occurring with probability r½x ¼ S½x þ 1 À S½x.
Initial table values (close to x ¼ 0) are more probable than values near the end.
As the discrete Gaussian CDF is a sorted table, the binary search algorithm is used to find the position of the target value, (shown in Algorithm 4). The search space, comprising initially of the entire CDT (N samples), is dichotomously exhausted in every iteration of the algorithm. Pointers min and cur point to the first and the middle of the search space, respectively, while jmp maintains the number of search space samples reduced by half. In every iteration of the while loop, r is compared to the middle value (cur) of search spaces, whose upper or lower half is discarded depending on the comparison result. For a bounded interval, the number of comparisons required before a match is found does not exceed dlog 2 ðNÞe. A uniformly sampled bit b is used to dictate the sign of result x. The use of discrete Gaussian sampling based on large pre-computed CDTs was first proposed by Peikert [19] , and adapted by Ducas et al. [14] for use in their signature scheme BLISS. Several reductions to the table size were suggested by P€ oppelmann et al. [9] for a reconfigurable hardware implementation of BLISS. First, the most significant m bits of r can be hashed to reduce the search space according to precomputed guide tables. Choosing m ¼ 8 cuts the 2891 entries initially required for BLISS into 2 m intervals requiring guide tables holding the min and cur pointers. Thus, the complexity of the binary search is reduced from [11, 12] searches to [1.3,1.7] searches. Second, the adoption of Lemma 1 (Section 2.3) on BLISS parameters significantly reduces standard deviation for CDT, accompanied by a table size reduction by a factor of 11. Third, a floating point representation of CDT samples with a variable mantissa size skips leading zero storage, halving the table size.
These optimisations reduce the precomputed CDT sizes, consequently reducing search space and improving design throughput. However, hashing divides the search intervals into irregular sizes, meaning the binary search within intervals has non-constant bounds, making it susceptible to timing analysis attacks. The only CDT-based discrete Gaussian sampler promising a constant-time throughput for small standard deviations relies on a array of N parallel comparators, of -bits each, comparing against the uniform number r [29] . Each comparator returns a binary answer, the first comparator x satisfying S½x r < S½x þ 1 is the required result. This fully pipelined design results in a single cycle per sample throughput but the large number of parallel comparisons renders it inefficient in terms of hardware resources. Du and Bai [30] optimise the hardware area by employing a piecewise comparison instead of a full -bit comparison in the binary search algorithm. To compare any -bit table sample and a uniform value of the same size, comparing the first m bits can give a definite result of being greater or smaller with probability ð2 m À 1Þ=2 m (99.6 percent for m ¼ 8) and increases for larger m. In the case of an equality, m is increased and a larger comparison must be carried out. This lazy comparison scheme not only reduces the need for large comparators, but also economises the need for uniform sample bits required per output. Hashing is used to further narrow down the binary search time resulting in an areaefficient and yet high average throughput performance.
Algorithm 5. Discrete Gaussian Sampling using Discrete Ziggurat [22] Require: The number of rectangles, m, the lower right corners of rectangles ( x i b c, y i b c ) for i 2 m, precision ; 1: while true do 2: Choose rectangle, i f1; . . . ; mg, choose sign bit s f0; 1g, choose random bit b f0; 1g choose random value x f0; . . .
return sx; 9: else 10:
return sx; 14: 
Discrete Ziggurat Sampling
Discrete Ziggurat sampling is adapted by Buchmann et al. [22] from an original method of continuous Ziggurat sampling, proposed by Marsaglia and Tsang [21] . It is an optimised form of rejection sampling, which divides the area under the curve into several rectangles (REC NUM ) of equal area, where the left hand side of each rectangle is aligned with the y-axis and the right hand side of each rectangle aligns with the curve of the discrete Gaussian distribution. This structure can be used to optimise rejection sampling of random points from a uniform distribution. The increase in the number of rectangles thus decreases the probability of rejection. Currently, there are no hardware implementations of this sampler. However, Buchmann et al. [22] propose a C++ implementation with several optimisations, depicted in Algorithm 5, and also a comparison of discrete Ziggurat with alternative sampling methods to show that this method is suitable for use when large standard deviations are required. In discrete Ziggurat sampling [22] , first a random value, x, is generated and then x is checked for cases when x equals zero or lies comfortably within a given rectangle (i.e., x x iÀ1 b c). Random samples including a random sign bit s are then generated. Otherwise, the sample is found to lie close to the curve and further calculations are required to establish whether x is above or below this curve (see lines 10 onwards in Algorithm 5).
Algorithm 6. sLine [22] for use in Algorithm 5
Require:
Knuth-Yao Sampling
Knuth and Yao [20] present a tree based algorithm for sampling from non-uniform distributions by using a minimal number of input uniform bits, close to the entropy of the probability distribution. Given a probability distribution represented by p 0 ; p 1 ; . . . ; p N , while each sample is represented by a maximum of bits, the probability matrix P can be represented as a N Â binary matrix (after adding leading/trailing zeros). A binary tree, called the discrete distribution generating (DDG) tree, can be constructed, with Àlevels, comprising of two types of nodes: internal nodes (called I nodes) having two children each or terminal nodes (called T nodes) without children. The DDG is constructed so that the number of terminal nodes at the ith level of the tree equals the number of non-zero bits in the ith column of the probability matrix P . Each terminal node is marked by the row number of P . The sampling process is a random walk starting from the tree root moving down, consuming a single uniform bit at each step, taking the left node of the next level when encountering a 0 and right node otherwise. Hitting a terminal node outputs the integer label associated with it, generating a successful sample. Fig. 1 illustrates a DDG tree for a toy example. The tree levels equal the number of columns of P , while the number of terminal nodes match the non-zeros in P . A DDG tree can have at most ðN Â Þ terminal nodes and ðN Â Þ À 1 internal vertices. Fig. 1 shows equivalent tree and matrix representations of P , where each of the ith columns of the table has no more than ð2 Â iÞ ðN Â Þ entries, since that is the upper bound on the number of terminal nodes. The entire table does not need to be stored; instead P and the local information of one column suffices so that the information of the next column can be constructed at runtime.
Roy et al. [31] proposed a hardware friendly random walk for the discrete Gaussian sampler, based on the exploitation of ith column/tree level information to interpret the next. At any level i of the tree traversal, let d denote the distance between the visited node and the right most node of that tree level. At the DDG tree root (the 0th level), d ¼ 0. Depending on a 0 or a 1 being encountered, the distance becomes 2 Â d þ 1 or 2 Â d, respectively, after consuming one uniform bit. Next, they exploit the fact that in the DDG tree, all the I nodes are on the left and all the T nodes (that are non-zero and contribute to the Hamming weight of that particular M column) are on the right. Hence, subtracting the Hamming weight of the current column from the distance d at the ith level of the tree will show if the current node is an I node or a T node, if the result is positive or negative respectively. 1: Discrete samples of Gaussian distribution as matrix P with N Â dimension and N ¼ t Â s; 2: Sample bits uniformly in f0; 1g, store in array r; 3: Column-wise Hamming distance of P, i.e., h dist½j ¼
for (int row 0 : row < N; row row þ 1) do 8: Algorithm 7 gives the procedural details of the KnuthYao sampling algorithm. Assertion of the hit signal indicates a successful sample and ctr is the iteration over the random binary samples array r. h_dist is a -element vector holding the column-wise Hamming distances of the matrix P. The loop at line four iterates over the columns, starting from the most significant bit position, and checks if a terminal node is hit in that level of the DDG tree, by checking the sign bit of d; each iteration consumes one uniformly sampled bit. In case a column is localised for a hit, the loop in line 7 iterates over all the rows in that column, until a row is localised where a hit is found, terminating the two loops. No uniform random bits are consumed during these iterations. A signed bit is attached to the sample, based on the uniformly sampled bit.
Devroye [32] shows that the required uniformly generated bits for a sample generation is at most two more than the entropy of the distribution.
Dwarakanath and Galbraith [33] suggest a compression of the probability matrix, as most of the distribution values close to the tail have an increasingly large number of leading zeros. A block variant of the Knuth-Yao algorithm was proposed that divides consecutive probabilities in different blocks with roughly the same number of leading zeros resulting in reasonable storage savings of P . Roy et al. [31] exploited this block based compression for the Knuth-Yao sampler in an FPGA implementation with s LP ¼ 3:33. Instead of keeping a h_dist vector for column Hamming weight, they iterate over the columns of the P matrix, bit by bit. The iteration proceeds from bottom to top, consuming one cycle for each bit of the probability matrix. Since the discrete Gaussian values close to the centre are more likely, working from top to bottom instead significantly accelerates the algorithm's average speed (as in Algorithm 7). Consequently, the implementation [31] , though very lightweight, requires on average 17 clock cycles to generate one discrete Gaussian sample. Subsequent work by Roy et al. [17] improved the speed up to 2.5 cycles per sample, by table hashing in multiple stages. As these implementations were non-constant time, a random shuffle method is used to protect the discrete Gaussian distributed polynomial against timing analysis attacks, at the cost of additional FPGA slices. Clerq et al. [34] present a software implementation of ring-LWE encryption on a 32-bit ARM Cortex-M4F microcontroller using a Knuth-Yao based fast discrete Gaussian sampler, bettering all existing encryption implementations in software, where discrete Gaussian sampling requires an average 28.5 cycles per sample.
TIMING ATTACKS AND DESIGN GOALS
Timing attacks were proposed for the first time by Kocher [35] . These attacks exploit the time difference between operations to gain information about the secret key. Exploitable timing differences can be caused by routines whose execution time depends on the secret data or by the difference in data access pattern caused by memory hierarchy. These attacks have been successfully applied against several cryptographic schema, including lattice-based schemes (in particular NTRU [36] ). Thus in the context of LBC, it is crucial to implement schemes resistant against timing attacks.
The need for a discrete Gaussian sampler resistant against timing attacks is emphasised by the NIST call for implementations of encryption and signature schemes (of which discrete Gaussian samplers are core components) explicitly requiring resistance against side-channel attacks [15] . Software implementation of Gaussian samplers have been shown to leak information through the timing channel [26] . Bruinderink et al., in particular, presented a time side-channel attack, exploiting the cache patterns [16] , on the CDT sampler. The majority of hardware designs reported so far [9] , [31] , [37] are susceptible to timing attacks. Resistance against timing attacks is achieved when all the operations involving secret information are executed in a time which is independent from the secret data.
Time independence is a property which implies that the computation time of an algorithm does not depend on the input data. As a consequence, when the computed data involve secret information, a time independent computation guarantees that no information about it is leaked via the timing channel. This property can be achieved in several ways, the most common methods are ensuring constant execution time [38] or random shuffling [17] of the secret values.
The discrete Gaussian sampler proposed by P€ oppelmann and G€ uneysu [38] , which targets constant execution time, implements a time-independent CDT sampler for encryption. The area overhead of that design is however too high to be practical. Roy et al. [17] also presented a discrete Gaussian sampler resistant against timing attacks. The design implements a fast Knuth-Yao based non-constant time Gaussian sampler which generates a batch of samples, which is subsequently shuffled to disassociate the related timing information.
Time independence is not the only desired feature: implementations of discrete Gaussian samplers should also ensure fast response time (for high throughput) and require minimal area occupation (for constrained devices). In this paper these design goals are considered and the limitations of previous work are overcome by proposing independenttime and practical implementations of all four discrete Gaussian sampling schemes currently used in LBC. However, throughput and area are considered as secondary variables to be optimised, since the main objective is to ensure time-independence of the proposed designs.
GAUSSIAN SAMPLER ARCHITECTURES
Constant-Time Bernoulli Sampling
Discrete Gaussian sampling using Bernoulli is performed in constant-time when the table comparison (of values c i , seen in Section 3.2) is always completely evaluated. Thus a comparison would not be aborted if one mismatch between the randomly generated bit r and the table value c is found. However, if a rejection of the Bernoulli variable leads to the rejection of a complete sample, is r ! c an abort can be tolerated. For acceptance ðr < cÞ instead, the full comparison has to be made. The dependency on the input value of x would in fact leak information. The leakage can be mitigated by evaluating the whole table in a pipelined fashion, so that no secret addresses are used (the whole table has to be read). This however requires a large number of random bits, generated using Trivium Â32 as a PRNG. The PRNG produces 32 random bits, sufficient for the whole table comparison, and they are stored in a register, ready to be used when j ¼ yðy þ 2kxÞ is calculated. In the case of a rejected value, the evaluation aborts early, since rejection does not leak any secret information. Fig. 2 illustrates the high-level architecture of the proposed constant-time Bernoulli sampler. The binary Gaussian module includes the reduced precision ¼ 64-bits and uses a 64-bit shift register (implemented in a LUTM on the target FPGA) to store the precomputed expansion. This operation, as well as the uniformly random value y 2 f0; 1; . . . ; k À 1g, are also improved by incorporating an unrolled Â8 Trivium as a PRNG. Previous designs [9] required 12-bits of randomness for the binary Gaussian component and 8-bits of randomness for the uniform value k. This proposed design, which combines a Â8 unrolled Trivium with a reduced precision allows the computation in two clock cycles instead of % 20 clock cycles using a single bit per clock PRNG. The values of j and z are used in the evaluation, j is used to evaluate b B expðÀj=2s 2 Þ where if b ¼ 1 the value z is accepted and output, otherwise it is rejected.
Since Gaussian convolutions are integrated (Section 2.3), employing two table evaluation components decreases the clock cycle count per sample. Each evaluation is able to compute a 128-bit comparison on every clock cycle, meaning per clock cycle two rows can be compared. Once the pre-calculation components are completed and the pipeline is full, input values to the tables will be ready after every clock cycle, meaning a discrete Gaussian sample takes exactly ddlog 2 ðmaxðjÞÞe=2e þ 2 cycles, which includes ddlog 2 ðmaxðjÞÞe=2e for x 1 D s 0 , þ1 more clock cycle for x 2 D s 0 , and þ1 more clock cycle for them to be combined as calculated as log 2 2:4ts 2 which for s BLISS ¼ 215:73 requires 784 bits and improves on [9] by % 48 percent in table space, and for s LP ¼ 3:39 requires 400 bits. Furthermore, the proposed Bernoulli sampler generates a Gaussian sample in 13 clock cycles for BLISS signature parameters and 7 clock cycles for LP encryption parameters, which as well as being the first Bernoulli sampler to operate in constant-time, also betters the speed of the previous hardware implementations of [9] , and competes well with [37] .
Constant-Time CDT Sampling
The binary search for a desired sample can result in an early termination, when the equality comparison of uniformly sampled r and the S½cur results in an exact match, though this early termination is only likely with a very small probability. Algorithm 4 for CDT sampling qualifies the inequality (instead of 64-bit equality comparison), hence avoiding early termination bounding the algorithm for Â blog 2 ðNÞc; dlog 2 ðNÞe Ã search iterations before generating a result every time. For CDT where N is a power of two, the algorithm performs inherently in constant-time; for all other N, the algorithm is tweaked to occasionally perform an extra iteration to ensure the algorithm complexity is fixed to dlog 2 ðNÞe.
For the generation of discrete Gaussian samples with s LP ¼ 3:33, the CDF table S consists of N ¼ 32 (dt Â s LP e) entries, each with ¼ 64 bits of precision. Hence, a single ported BRAM, with a 5-bit address and 64-bit data ports suffices for the design. A 64Â unrolled Trivium is used as a PRNG for the generation of the uniform samples r, where the initialisation of the module is handled externally at the startup and controlled by the binary search state machine (referred to as BinSearch). The Trivium module is activated by the enable output pulse signal from BinSearch, so the uniform samples are only generated when required, saving circuit power. state-machine instance would take 2 Â dlog 2 ð184Þe ¼ 16 cycles for generation of a single valid sample. However, a better throughput model uses two instances of state-machine to parallelise two independent searches (seen in Fig. 3 ). The two state machines BinSearch0 and BinSearch1 each get a 64-bit uniformly random number from the Trivium-based PRNGs, in two consecutive clock cycles. The BinSearch state-machines initiate their processing from the START state, resetting three pointers, namely, min, cur, and jmp, with initial values, as given in Algorithm 4. They transition unconditionally to the SEARCH state in the next clock cycle that updates their three pointers as per the result from their 64-bit comparison operation. After exactly eight cycles an appropriate x is found, and the state generates a single bit hit to signal this occurrence. This hit also activates the Trivium module to request a new uniformly random 64-bit value. BinSearch0 and BinSearch1 share the same dual ported BRAM, each port having 8-bit address and 64-bit data ports. The state machines work independently to generate two random samples x 1 ; x 2 D s 0 in 8 clock cycles (shown in Fig. 3) ; the two samples are buffered on their respective hit signals into registers. The samples are then combined as x ¼ x 1 þ 11x 2 where a sign bit is also attached.
Time-Independent Knuth-Yao Sampling
Inherently, the Knuth-Yao sampler operates in non-constant time. Considering the column and row localisation phases take one clock cycle to jump from one column/row to the other, successful sampling requires at least 2 cycles if the 0th row and 0th column are localised for a hit, with the maximum being ð Â NÞ when the last row and last column get a hit. The average response time is % 10:6 cycles since the column and row localisation each require % 5:3 cycles. Fig. 4 gives an architectural description of the proposed Knuth-Yao sampler in hardware, given a probability matrix P . Trivium is used as a PRNG for the generation of the uniformly sampled bit r that is inverted to get r_n. The initialisation of Trivium is handled externally at startup, which afterwards is controlled via the KYSearch state-machine. An en signal is used to activate the PRNG to produce randomness only when required. The KYSearch state-machine initiates its processing from the START state, resetting output registers, row, col, d, en and hit with initial values, as given in Algorithm 7. It transitions unconditionally to the SEARCH state that updates the output registers as per the state of the d pointer. As long as d is a non-zero positive number, the statemachine keeps changing columns, consuming random bits and updating d accordingly. If d is a negative number, then the column for a hit is already localised and the row search starts without changing column values. Since no new random bits are required, en is kept low and d is updated by adding the P matrix values until d reduces to 0. Assertion of a hit ends the sampling process, where the row number is assigned a sign bit based on a uniformly sampled bit.
To ensure ease of access, the P matrix transpose is stored as a distributed ROM or in a BRAM in the FPGA implementation. When using a BRAM, storing a mere 64 Â 32-bit P matrix is excessive. For better resource utilisation, the h dist vector of Hamming distances is also stored in the same BRAM. Additionally, hashing is employed to boost the throughput, since the response time is non-constant and the same BRAM holding the P matrix can also store the hash tables. To consider 8 bits of uniform random numbers at a time, a hash table with 256 entries is easily accommodated in BRAM, for the case where s LP ¼ 3:33, 249 out of these 256 entries result in a hit (97.26 percent). Otherwise, the hash table entries keep the d value to be loaded in the KYSearch state-machine directly and scanning of the first 8 columns is skipped, improving throughput.
To make the sampler independent-time, operations could be made constant-time, that is the largest possible cycle count ð Â NÞ must be ensured before a discrete Gaussian sample is generated as an output. For s LP ¼ 3:33, it turns out to be too slow to be practical (2,000 cycles per valid sample generation, and worse for signature parameters). Hence, to achieve timing independence, the discrete Gaussian samples are instead shuffled after the generation of a complete block. Considering a block size of n samples, if n 1 are generated by hashing then the remaining n 2 ¼ n À n 1 are generated by the KYSearch state-machine. The samples are stored in another BRAM such that n 1 samples are accommodated at the top of the BRAM with the remaining at the bottom. A simplistic shuffling algorithm is implemented, the Fisher-Yates shuffle [39] (also used in [17] ), which goes over each of the n 2 samples, swapping these within randomly generated locations ½0; n 1 À 1. A simple shuffler state-machine enables each swap in two clock cycles considering a dual-ported BRAM. Due to the small percentage of n 2 samples in BRAM (2.7 percent), shuffling does not significantly exacerbate the sample generation throughput. However, the state-machine and the BRAM require extra FPGA resources.
The Knuth-Yao implementation relates the number of random bits required (and consequently the running time of the algorithm) to the distribution entropy (or standard deviation). For s BLISS ¼ 215, even after using Lemma 1, the running time will be more than 20 clock cycles per sample, which is far too slow for practical purposes. Hashing would also require much larger tables to be effective. Consequently, no further work is undertaken on the Knuth-Yao sampler for signature parameters.
Constant-Time Discrete Ziggurat Sampling
To date, there are no existing hardware designs of the discrete Ziggurat Gaussian sampler. There are several possible reasons for this: first, the discrete Ziggurat sampling algorithm requires several costly computations, second, the generation of suitable rectangles for sampling is non-trivial and can require multiple iterations, and third, there are several alternative sampling algorithms which are better suited to the hardware platform.
In this research, the first hardware design of a discrete Gaussian Ziggurat sampler on FPGA is presented. To adapt the algorithm to suit the hardware platform, the rectangle generation and several pre-computations are completed offline. This is a reasonable assumption to make, as previously mentioned in other sampling algorithms (such as the precomputed lookup tables used in Bernoulli and CDT samplers). The number of rectangles (REC NUM ) used in the proposed design is set to eight. Such parameters would benefit from increased flexibility or on-the-fly calculation in future implementations. Parameters are stored as binary vectors, and fixed point precision ( bit) is used to minimise memory consumption on the FPGA. Two's complement representation is used for the binary sample values, where the most significant bit represents the sign of the given sample. The floored x coordinates and bit precision y coordinates of the rectangles and a table of precomputed probability values (that is, ðr s ðx i Þ À y i Þ Ã 2 ) are stored on the FPGA, called STORE X, STORE Y, and STORE RHO respectively. Additionally, SLINE COMP stores the precomputed fractional division required in the sLine operation. These pre-computations enable efficient hardware implementation at the cost of flexibility. For this reason, software implementations of Ziggurat are preferred to any hardware implementation.
The design follows Algorithm 5, where a series of comparisons are used to generate discrete Gaussian samples. Fig. 5 depicts the proposed hardware architecture at a high level. An unrolled Trivium Â8 is employed and a buffer is used to store intermediate values. A counter controlled by a state-machine is used to ensure the buffer is full before initial sample generation. The design require n uniformly random bits per sample, where n ¼ 2 þ þ ð2 Â REC NUM Þ given the number of rectangles REC NUM . For sampling, a random integer value x and a rectangle value i are randomly chosen using Trivium outputs. If x is less than the floored x coordinate of i, x i is stored in STORE X, it is output along with a sign bit, s. Otherwise, several other comparisons are carried out to verify if x falls under the discrete Gaussian curve. In this case, a y dash value is uniformly generated with -bits of precision, and subsequently used to generate y, such that
y is used within three inequalities to decide whether to reject the sample x, that is, to decide in which part of the rectangle the randomly generated integer x lies (as seen in Algorithm 5). For further details see the original proposed algorithm and software design by Buchmann et al. [22] .
For the signature parameters, the same design is used twice and the sampling process is repeated to generate two samples, and combined in the same way as in the other sampling techniques. To make the sampler perform in constanttime, the state-machine controls signals and ensures all comparisons are carried out, regardless of success or failure, within a set number of clock cycles. This design choice brings a small cost of additional clock cycles per sample generated.
COMPARISON & RESULTS
In this section the performance results are described and compared against the same implementations for encryption (for Table 2 shows the performance of the constant-time Bernoulli samplers in hardware. The results are compared to other implementations which target lattice-based encryption and signature schemes but without protection against timing attacks. To achieve constant computation time, the proposed Bernoulli samplers need to perform many more comparisons than the previously proposed designs [9] , [37] . For this reason, the number of flip-flops in this design is larger than others. However, the needed comparisons can be performed in a single clock cycle, where instead LUT and slice usage has decreased. This has mainly been achieved by incorporating Â8 and Â32 unrolled Trivium components, which simplify the overall design and alleviate the need for excess data buffers. The halving of the precision parameter also contributed to the reduction of the LUT and Table 2 that, despite the increase in flip-flop consumption, the Bernoulli samplers for both standard deviations improve on previous designs. [17] . The first implementation (without BRAMs and hashing) compares to the existing work by [17] (without hashing). In the second implementation, hashing is undertaken; the BRAM is used to store the P matrix, the h dist and the hashing table. At the cost of one additional BRAM, this implementation improves on the Ops/s/S of the best existing Knuth-Yao implementation by a factor of 2. The last implementation also includes a shuffler for timing independence; hence another BRAM for generated discrete Gaussian samples is added and the slice budget increases to accommodate the shuffler state machine. A consequent decrease in Ops/s/S follows.
Bernoulli Results
Knuth-Yao Results
Cumulative Distribution Table Results
For the CDT sampler, FPGA implementations with and without BRAMs are proposed. For s LP ¼ 3:33, a single port distributed ROM is used. The number of slices can be significantly reduced if BRAMs are utilised, as shown in Table 4 , where one instance of RAMB36 is used in a 64 Â 32 configuration. However, in this configuration the maximum operable frequency is halved. The reduction in slices is higher for s BLISS ¼ 215 due to the larger distribution table being shifted to BRAM. Table 4 compares the proposed CDT samplers with stateof-the-art. The only other constant-time CDT implementation for s LP ¼ 3:33, proposed by P€ oppelmann and G€ uneysu [38] , operates at a frequency of more than 4Â lower. The slice count is also 5Â larger, contributed primarily by as many parallel comparators as the S table words. Hence, despite the reported CDT implementation generating a single sample per cycle [38] , the design in this research proves to be a very lightweight, constant-time alternative, outperforming it by a factor of around 5Â in terms of Ops/s/S. The CDT sampler by Du and Bai [30] is lightweight and achieves good average throughput. However, their proposed design is not an independent time implementation. For s BLISS ¼ 215, the implementation by P€ oppelmann et al. [9] also operates in non-constant time, is costlier in terms of slice consumption, and slower in terms of throughput per slice. Since their reported design consumes BRAM, when compared to this implementation (with BRAM), their design remains % 5Â inferior in terms of Ops/s/S. However it requires around 3Â less uniform random bits per sample, compared to the proposed constant-time designs. Table 5 shows the performance of the proposed constanttime discrete Ziggurat. There are no previous hardware designs of this sampler; thus the discrete Ziggurat sampler can be compared with the other samplers (Tables 2, 3 , and 4). sampler surpasses all others in terms of an overall balanced performance with area, throughput, and timing independence. If the use of additional BRAMs is considered, the Knuth-Yao time-independent implementation (KY_Enc_RAM) has the best overall performance in terms of low area, high throughput, and also the lowest number of bits required per sample. For signatures, the RAM-free CDT implementation (CDT_Sign) proves to be an overall winner, followed by the Bernoulli sampler (Ber_Sign), being around 2Â more expensive in terms of slices. In conclusion, this research provides a thorough investigation of the practical discrete Gaussian samplers (CDT, Knuth-Yao, Bernoulli, and discrete Ziggurat) used in lattice-based cryptosystems. Novel time-independent implementations are presented, ensuring resistance against timing attacks. Moreover, the proposed hardware sampler designs clearly outperform most of those previously proposed. . She has authored two research books and has more than 100 peer-reviewed conference and journal publications. Her research interests include hardware cryptographic architectures, side channel analysis, physical unclonable functions and post-quantum cryptography. In 2014 she was awarded a Royal Academy of Engineering Silver Medal, which recognises outstanding personal contribution by an early or mid-career engineer that has resulted in successful market exploitation. She is a member of the IEEE.
Discrete Ziggurat Results
RECOMMENDATIONS AND CONCLUSION
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
