Abstract. The recent Bimodal Lattice Signature Scheme (Bliss) showed that lattice-based constructions have evolved to practical alternatives to RSA or ECC. Besides reasonably small signatures with 5600 bits for a 128-bit level of security, Bliss enables extremely fast signing and signature verification in software. However, due to the complex sampling of Gaussian noise with high precision, it is not clear whether this scheme can be mapped efficiently to embedded devices. Even though the authors of Bliss also proposed a new sampling algorithm using Bernoulli variables this approach is more complex than previous methods using large precomputed tables. The clear disadvantage of using large tables for high performance is that they cannot be used on constrained computing environments, such as FPGAs, with limited memory. In this work we thus present techniques for an efficient Cumulative Distribution Table (CDT) based Gaussian sampler on reconfigurable hardware involving Peikert's convolution lemma and the Kullback-Leibler divergence. Based on our enhanced sampler design, we provide a first Bliss architecture for Xilinx Spartan-6 FPGAs that integrates fast FFT/NTT-based polynomial multiplication, sparse multiplication, and a Keccak hash function. Additionally, we compare the CDT with the Bernoulli approach and show that for the particular Bliss-I parameter set the improved CDT approach is faster with lower area consumption. Our core uses 2,431 slices, 7.5 BRAMs, and 6 DSPs and performs a signing operation in 126 µs on average. Verification takes even less with 70 µs.
Introduction and Motivation
Virtually all currently used digital signature schemes rely either on the factoring (RSA) or the discrete logarithm problem (DSA/ECDSA). However, with Shor's algorithm [39] sufficiently large quantum computers can solve these problems in polynomial time which potentially puts billions of devices and users at risk. Although powerful quantum computers will certainly not become available soon, significant resources are definitely spent by various organizations to boost their further development [35] . Also motivated by further advances in classical cryptanalysis (e.g., [4, 5, 20] ), it is important to investigate potential alternatives now to have secure constructions and implementations at hand when they are finally needed.
In this work we deal with such a promising alternative, namely the Bimodal Lattice Signature Scheme (Bliss) [12] , and specifically address implementation challenges for constrained devices and reconfigurable hardware. First efforts in this direction were made in 2012 by Güneysu et al. [16] (GLP). Their scheme was based on work by Lyubashevsky [26] and tuned for practicability and efficiency in embedded systems. This was achieved by a new signature compression mechanism, a more "aggressive", non-standard hardness assumption, and the decision to use uniform (as in [25] ) instead of Gaussian noise to hide the secret key contained in each signature via rejection sampling. While GLP allows high performance on low-cost FPGAs [16] and CPUs [17] it later turned out that the scheme is suboptimal in terms of signature size and its claimed security level compared to Bliss. The main reason for this is that Gaussian noise, which is prevalent in almost all lattice-based constructions, allows more efficient, more secure, and also smaller signatures. However, while other techniques relevant for lattice-based cryptography, like fast polynomial arithmetic on ideal lattices received some attention [1, 32, 36] , it is currently not clear how efficient Gaussian sampling can be done on reconfigurable and embedded hardware for large standard deviations. Results from electrical engineering (e.g., [19, 41] ) are not directly applicable, as they target continuous Gaussians. Applying these algorithms for the discrete case is not trivial (see, e.g., [8] for a discrete version of the Ziggurat algorithm). First progress was recently made by Roy et al. [37] based on work by Galbraith and Dwarakanath [13] providing results for a Gaussian sampler in lattice-based encryption that requires low resources. We would also like to note that for lattice-based digital signature schemes large tables in performance optimized implementations might imply the impression that Gaussian-noise based schemes are a suboptimal choice on constrained embedded systems. A recent example is a microcontroller implementation of Bliss [7] that requires tables for the Gaussian sampler of roughly 40 to 50 KB on an ATxmega64A3. Other latticebased signatures with explicit reductions to standard lattice problems [14, 24, 28] are also inefficient in terms of practical signature and public key sizes (see [3] for an implementation of [28] ). Thus, despite the necessity of improving Gaussian sampling techniques (which is one contribution of this work) Bliss seems to be currently the most promising scheme with a signatures length of 5600 bit, equally large public keys, and 128-bit of equivalent symmetric security. There surely is some room for theoretical improvement, as suggested by the new compression ideas developed by Bai and Galbraith [2] ; one can hope that all those techniques can be combined to further improve lattice-based signatures.
Contribution. One contribution of this work are improved techniques for efficient sampling of Gaussian noise that support parameters required for digital signature schemes such as Bliss and similar constructions. First, we detail how to accelerate the binary search on a cumulative distribution table (CDT) using a shortcut table of intervals (also known as guide table [9, 11] ) and develop an optimal data structure that saves roughly half of the table space by exploiting the properties of the Kullback-Leibler divergence. Furthermore, we apply a convolution lemma [29] for discrete Gaussians that results in even smaller tables of less than 2.1 KB for Bliss-I parameters. Based on these techniques we provide an implementation of the Bliss-I parameter set on reconfigurable hardware that is tweaked for performance and offers 128-bit of security. For practical evaluation we compare our improvements for the CDT-based Gaussian sampler to the Bernoulli approach presented in [12] . Our implementation includes an FFT/NTT-based polynomial multiplier (contrary to the schoolbook approach from [16] ), more efficient sparse multiplication, and the KECCAK-f [1600] hash function to provide the full picture of the performance that can be achieved by employing latest lattice-based signature schemes on reconfigurable hardware. Our implementation on a Xilinx Spartan-6 FPGA supports up to 7958 signatures per second using 7,491 LUTs, 7,033 flip-flops, 6 DSPs, and 7.5 block RAMs and outperforms previous work [16] both in time and area.
In order to allow third-party evaluation of our results, source code, testbenches, and documentation is available on our website 3 .
The Bimodal Lattice Signature Scheme
The most efficient instantiation of the Bliss signature scheme [12] is based on ideal-lattices [27] and operates on polynomials over the ring R q = Z q [x]/ x n +1 . For quick reference, the Bliss key generation, signing as well as verification algorithms are given in Figure 1 and implementation relevant parameters as well as achievable signature and key sizes are listed in Table 1 . Note that for the remainder of this work, we will focus solely on Bliss-I. The Bliss key generation basically involves uniform sampling of two small and sparse polynomials f , g, computation of a certain rejection condition (N κ (S)), and computation of an inverse. For signature generation two polynomials y 1 , y 2 of length n are sampled from a discrete Gaussian distribution with standard deviation σ. Note that the computation of ay 1 can still be performed in the FFT-enabled ring R q instead of R 2q . The result u is then hashed with the message µ. The output of the hash function is interpreted as sparse polynomial c. The polynomials y 1,2 are then used to mask the secret key polynomials s 1,2 which are multiplied with the polynomial c and thus "sign" the hash of the message. In order to prevent any leakage of information on the secret key, rejection sampling is performed and signing might restart. Finally, the signature is compressed and (z 1 , z † 2 , c) returned. For verification the norms of the signature are first validated, then the input to the hash function is reconstructed and it is checked whether the corresponding hash output matches c from the signature.
Algorithm KeyGen() 1: Choose f , g as uniform polynomials with exactly d1 = δ1n entries in {±1} and d2 = δ2n entries in {±2} 2: S = (s1, s2) t ← (f , 2g + 1)
Continue with probability 
Improving Gaussian Sampling for Lattice-Based Digital Signatures
Target distribution. We recall that the centered discrete Gaussian distribution D Z,σ is defined by a weight proportional to ρ σ (x) = exp(
2 ) for all integers x. Our goal is to efficiently sample from that distribution for a constant value σ ≈ 215.73 as specified in Bliss-I (precisely σ = 254 · σ bin where σ bin = 1/(2 ln 2) is the parameter of the so-called binary-Gaussian; see [12] ). This can easily be reduced to sampling from a distribution over Z + proportional to ρ(x) for all x > 0 and to ρ(0)/2 for x = 0.
Overview. Gaussian sampling using a large cumulative distribution table (CDT) has been shown to be an efficient strategy for the software implementation of Bliss given in [12] . In this section, we further enhance CDT-based Gaussian sampling for use on constrained devices. For simplicity, we explicitly refer to the parameter set Bliss-I although we remark that our enhancements can be transferred to any other parameter set as well. To increase performance, we first analyze and improve the binary search step to reduce the number of comparisons (cf. Section 3.1). Secondly, we decrease the size of the precomputed tables. In Section 3.3 we therefore apply a convolution lemma for discrete Gaussians adapted from [30] that enables the use of a sampler with much smaller standard deviation σ ≈ σ/11, reducing the table size by a factor 11. In Section 3.4 we finally reduce the size of the precomputed table further by roughly a factor of two using floating-point representation by introducing an adaptive mantissa size.
For those last two steps we require the "measure of distance" 4 for a distribution, called Kullback-Leibler divergence [10, 23] , that offers tighter proofs than the usual statistical distance (cf. Section 3.2). Kullback-Leibler is a standard notion in information theory and already played a role in cryptography, mostly in the context of symmetric cryptanalysis [6, 42] .
Binary Search with Shortcut Intervals
The CDT sampling algorithm uses a table 0
to sample from a uniform real r ∈ [0, 1). The output x is the unique index satisfying T [x] ≤ r < T [x + 1] and it is obtain via a binary search. Each output x ∈ {0 . . . S} has a probability
. For Bliss-I we need a table with S = 2891 ≈ 13.4σ entries to dismiss only a portion of the tail less than 2 −128 . As a result, the naive binary search would require C ∈ [ log 2 S , log 2 S ] = [11, 12] comparisons on average.
As an improvement we propose to combine the binary search with a hash map based on the first bits of r to narrow down the search interval in a first step (an idea that is not exactly new [9, 11] , also known as guide tables). For the given parameters and memory alignment reasons, we choose the first byte of r for this hash map: the unique v ∈ {0 . . . 
. For this distribution this would give C ∈ [1.3, 1.7] comparisons on average.
Preliminaries on the Kullback-Leibler Divergence
We now present the notion of Kullback-Leibler (KL) divergence that is later used to further reduce the table size. Detailed proofs of following lemmata are given in the full version [31] .
Definition 1 (Kullback-Leibler Divergence). Let P and Q be two distributions over a common countable set Ω, and let S ⊂ Ω be the strict support of P (P(i) > 0 iff i ∈ S). The Kullback-Leibler divergence, noted D KL of Q from P is defined as:
with the convention that ln(x/0) = +∞ for any x > 0.
The Kullback-Leibler divergence shares many useful properties with the more usual notion of statistical distance. First, it is additive so that
An important difference though is that it is not symmetric. Choosing parameters so that the theoretical distribution Q is at KL-divergence about 2 −128 from the actually sampled distribution P, the next lemma will let us conclude the following 5 : if the ideal scheme S Q (i.e. Bliss with a perfect sampler) has about 128 bits of security, so has the implemented scheme S P (i.e. Bliss with our imperfect sampler).
Lemma 1 (Bounding Success Probability Variations). Let E P be an algorithm making at mostueries to an oracle sampling from a distribution P and returning a bit. Let ≥ 0, and Q be a distribution such that D KL (P Q) ≤ . Let x (resp. y) denote the probability that E P (resp. E Q ) outputs 1. Then, |x − y| ≤ q /2.
In certain cases, the KL-divergence can be as small as the square of the statistical distance. For example, noting B c the Bernoulli variable that returns 1 with probability c, we have
In such a case, one requires q = O(1/ 2 ) samples to distinguish those two distribution with constant advantage. Hence, we yield higher security using KL-divergence than statistical distance for which the typical argument would only prove security up to q = O(1/ ) queries. Intuitively, statistical distance is the sum of absolute errors, while KL-divergence is about the sum of squared relative errors.
Lemma 2 (Kullback-Leibler divergence for bounded relative error). Let P and Q be two distributions of same countable support. Assume that for any i ∈ S, there exists some δ(i) ∈ (0, 1/4) such that we have the relative error bound
2 P(i). 5 Apply the lemma to an attacker with success probability 3/4 against S P and number of queries < 2 127 (amplifying success probability by repeating the attack if necessary), and deduce that it also succeeds against S Q with probability at least 1/4.
Using floating-point representation, it seems now possible to halve the storage ensuring a relative precision of 64 bits instead of an absolute precision of 128 bits. Indeed, storing data with slightly more than of relative 64 bits of precision (that is, mantissa of 64 bits in floating-point format) one can reasonably hope to obtain relative errors δ(i) ≤ 2 −64 resulting in a KL-divergence less than 2 −128 . We further exploit this idea in Section 3.4. But first, we will also use KL-divergence to improve the convolution Lemma of Peikert [30] and construct a sampler using convolutions.
Reducing Precomputed Data by Gaussian Convolution
Given
for any c. While this is not generally the case for discrete Gaussians, there exists similar convolution properties under some smoothing condition as proved in [29, 30] . Yet those lemmata were designed with asymptotic security in mind; for practical purpose it is in fact possible to improve the O( ) statistical distance bound to a O( 2 ) KL-divergence bound. We refer to [30] for the formal definition of the smoothing parameter η; for our purpose it only matters that η (Z) ≤ ln(2 + 2/ )/π and thus our adapted lemma allows to decrease the smoothing condition by a factor of about √ 2.
Lemma 3 (Adapted from Thm. 3.1 from [30] ). Let
for some positive reals σ 1 , σ 2 and let σ −2
2 , and
Remark. The factor 1/ √ 2π in our version of this lemma is due to the fact that we use the standard deviation σ as the parameter of Gaussians and not the renormalized parameter s = √ 2πσ often found in the literature.
Proof. The proof is similar to the one of [30] , with Λ 1 = Z, Λ 2 = kZ, c 1 = c 2 = 0; but for the last argument of the proof where we replace statistical distance by KL-divergence. As in [30] , we first establish that for anyx ∈ Z one has the following relative error bound
It remains to conclude using Lemma 2.
To exploit this lemma, for Bliss-I we set k = 11, σ = σ/ √ 1 + k 2 ≈ 19.53, and sample x = x 1 +kx 2 for x 1 , x 2 ← D Z,σ (equivalently k ·x 2 = x 2 ← D kZ,kσ ). The smoothness conditions are verified for = 2 −128 /32 and η (Z) ≤ 3.92. Due to usage of the much smaller σ instead of σ the size of the precomputation table reduces by a factor of about k = 11 at the price of sampling twice. However, the running time does not double in practice since the enhancement based on the shortcut intervals reduces the number of necessary comparisons to C ∈
Asymptotics cost. If one considers the asymptotic costs in σ our methods allow one to sample using a table size of Θ( √ σ) rather than Θ(σ) by doubling the computation time. Actually, for much larger σ one could use O(log σ) samples of constant standard deviation and thus achieve a table size of O(1) for computational cost in O(log σ).
CDT Sampling with Reduced Table Size
We recall that when doing floating-point error analysis, the relative error of a computed value v is defined as |v − v e |/v e where v e is the exact value that was meant to be computed. The left term is an estimation of the relative-error blow-up induced by the subtraction with the CDT in the reverse order and the right term the same estimation for the CDT in the natural order. We aim to have a variable precision in the table T [i] so that δ(i) 2 P(i) is about constant around 2 −128 /|S| as suggested by Lemma 2 while δ(i) denotes the relative error δ(i) = |P(i) − Q(i)|/P(i). As a trade-off between optimal variable precision and hardware efficiency, we propose the following data-structure. We define 9 tables M 0 . . . M 8 of bytes for the mantissa with respective lengths 0 ≥ 1 ≥ · · · ≥ 8 and another byte table E for exponents, of length 0 . The value T [i] is defined as
where M k [i] is defined as 0 when the index is out of bound i ≥ k . Thus, the value of T [i] is stored with p(i) = 9−min{k| k > i} bytes of precisions. More precisely, 
Implementation on Reconfigurable Hardware
In this section we provide details on our implementation of the Bliss-I signature scheme on a Xilinx Spartan-6 FPGA. We include the enhancements from the previous section to achieve a design that is tweaked for high-performance at moderate resource costs. For details on the implementation of the Bernoulli sampler proposed in [12] we refer to the full version [31] .
Enhanced CDT Sampling.
Along the lines of the previous section our hardware implementation operates on bytes in order to use the 1024x8-bit mode of operation of the Spartan-6 Table T . block RAMs. The design of our CDT sampler is depicted in Figure 3 For random byte generation we use three instantiations of the Trivium stream cipher (each Trivium instantiation outputs one bit per clock cycle) to generate a uniformly random byte every third clock cycle and store spare bits in a LIFO for later use as sign bits. The random values r j are stored in a 128x8 bit ring buffer realized as simple dual-port distributed RAM. The idea is that the sampler may request a large number of random bytes in the worst-case but usually finishes after one or two comparisons due to the lazy search. As the BinSearch component keeps track of the maximum number of accessed random bytes, it allows the Uniform sampler to refresh only the used max(j) + 1 bytes in the buffer. In case the buffer is empty, we stop the Gaussian sampler until a sufficient amount of randomness becomes available. In order to compute the final sample x we determine sign bits of two samples x 1 , x 2 and finally output x = x 1 + 11x 2 .
To achieve a high clock frequency, a comparison in the binary search step could not be performed in one cycle due to the excessive number of tables and range checks involved. We therefore allow two cycles per search step which are carefully balanced. For example, we precompute the indices i = (min+i)/2 and i = (max+i)/2 in the cycle prior to a comparison to relax the critical paths. We further merged the block memory B (port A) and the exponent table E (port B) into one 18k block memory and optimized the memory alignment accordingly. Note also that we are still accessing the two ports of the block RAM holding B and E only every two clock cycles which would enable another sampler to operate on the same table using time-multiplexing.
Signing and Verification Architecture
The architecture of our implementation of a high-speed Bliss signing engine is given in Figure 4 . Similar to the GLP design [16] we implemented a two stage pipeline where the polynomial multiplication a 1 y 1 runs in parallel to the hashing H( u d , µ) and sparse multiplication z 1,2 = s 1,2 c+y 1,2 6 . For polynomial multiplication [1, 32, 36] of a 1 y 1 we rely on a publicly available FFT/NTT-based polynomial multiplier [33] (PolyMul). The public key a 1 is stored already in NTT format so that only one forward and one backward transform is required. The multiplier also instantiates either the Bernoulli or the CDT Gaussian sampler (configurable by a VHDL generic) and an intermediate FIFO for buffering. When a new triple (a 1 y 1 , y 1 , y 2 ) is available the data is transferred into the block memories BRAM-U, BRAM-Y1 and BRAM-Y2 and the small polynomial u = ζa 1 y 1 + y 2 is computed on-the-fly and stored in BRAM-U for later use. The lower order bits u d mod p of u are saved in the RAM-U. As random oracle we have chosen the KECCAK-f [1600] hash function for its security and speed in hardware [22, 38] . A configurable hardware implementation 7 is provided by the KECCAK project and the mid-range core is parametrized so that the KECCAK state it split into 16 pieces (N b = 16). To simplify control logic and padding we just hash multiples of 1024 bit blocks and rehash in case of a rejection. Storing the state of the hash function after hashing the message (and before hashing u d mod p) would be possible but is not done due to the state size of KECCAK. After hashing the ExtractPos component extracts the κ positions of c which are one from the binary hash output and stores them in the 23x9-bit memory RAM-Pos.
For the computation of s 1 c and s 2 c we then exploited that c has mainly zero coefficients and only κ = 23 coefficients set to one. Moreover, only d 1 = δ 1 n = 154 coefficients in s 1 are ±1 and s 2 has d 1 entries in ±2 where the first coefficient is from {−1, 1, 3}. The simplest and, in this case, also best suited algorithm for sparse polynomial multiplication is the row-or column-wise schoolbook algorithm. While row-wise multiplication would benefit from the sparsity of s 1,2 and c, more memory accesses are necessary to add and store inner products. Since memory that has more than two ports is extremely expensive, this also prevents or at least limits efficient and configurable parallelization. As a consequence, our implementation consists of a configurable number of cores (C) which perform column-wise multiplication to compute z 1 and z 2 , respectively. Each core stores the secret key (either s 1 or s 2 ) efficiently in a distributed RAM and accumulates inner products in a small multiply-accumulate unit (MAC). Positions of c are fed simultaneously into the cores. Another advantage of our approach is that we can compute the norms and scalar products for rejection sampling parallel to the sparse multiplication. In Figure 4 a configuration with C = 2 is shown for simplicity but our experiments show that C = 8 leads to an optimal trade-off between speed and resource consumption. Our verification engine uses only the PolyMul (without a Gaussian sampler) and the Hash component and is thus much more lightweight compared to signing. The polynomial c stored as (unordered) positions is expanded into a 512x1-bit distributed RAM and the input to the hash function is computed in a pipelined manner when PolyMul outputs a 1 y 1 .
Results and Comparison
In this section we discuss our results which were obtained post place-and-route (PAR) on a Spartan-6 LX25 (speed grade -3) with Xilinx ISE 14.6.
Gaussian Sampling. Detailed results on area consumption and timing of the CDT and Bernoulli Gaussian sampler designs are given in Table 2 . The results
show that the enhanced CDT sampler consumes less logic resources than the Bernoulli sampler, as described in the full version [31] , at the cost of one 18k block memory to store the tables E and B. This is a significant improvement in terms of storage size compared to a naive implementation without the application of the Kullback-Leibler divergence and Gaussian convolution. A standard CDT implementation would require at least στ λ = 370 kbits (that is about 23 many 18K block Rams) for the defined parameters matching a standard deviation σ = 215.73, tailcut τ = 13.4 and precision λ = 128. Regarding randomness consumption the CDT sampler needs on average 21 bits for one sample (using two smaller samples and the convolution theorem) which are generated by three instantiations of Trivium. The Bernoulli sampler on the other hand consumes 33 bits on average, generated by two instantiations of Trivium. With respect to the averaged performance, 7.4 and 18.5 cycles are required by the CDT and the Bernoulli sampler to provide one sample, respectively.
As a consequence, by combining the convolution lemma and KL-divergence we were able to maintain the advantage of the CDT, namely high speed and relative simple implementation, but significantly reduced the memory requirements (from ≈ 23 18K block RAMs to one 18K block RAM). The convolution lemma works especially well in combination with the reverse tables as the overall table sizes shrink and thus the number of comparisons is reduced. Thus, we do not expect a CTD sampler that samples directly from standard deviation σ to be significantly faster. Additionally, larger tables would require more complex address generation which might lower the achievable clock frequency. The Bernoulli approach on the other hand does not seem as suitable for an application of the convolution lemma as the CDT. The reason is that the tables are already very small and thus a reduction would not significantly reduce the area usage.
Previous implementations of Gaussian sampling for lattice-based public key encryption can be found in [34, 37] . However, both works target a smaller standard deviation of σ = 3.3. The work of Roy et al. [37] uses the Knuth-Yao algorithm (see [13] for more details), is very area-efficient (47 slices on a Virtex-5), and consumes few randomness but requires 17 clock cycles for one sample. In [34] Bernoulli sampling is used to optimize simple rejection sampling by using Bernoulli evaluation instead of computation of exp(). However, without usage of the binary Gaussian distribution (see [12] ) the rejection rate is high and one sample requires 96 random bits and 144 cycles. This is acceptable for a relatively slow encryption scheme and possible due to the high output rate (one bit per cycle) of the used stream cipher but not a suitable architecture for Bliss. The discrete Ziggurat [8] performs well in software and might also profit from the techniques introduced in this work but does not seem to be a good target for a hardware implementation due to its infrequent rejection sampling operations and its costly requirement on high precision floating point arithmetic.
BLISS Operations. Results for the Bliss signing and verification engine and sub-modules can be found in Table 2 including averaged cycle counts for suc-cessfully producing a signature. Note that the final slice, LUT, and FF counts of the signing engine cannot directly be computed as the sum of the sub modules due to cross module optimizations, timing optimization, and additional control logic between modules. One signing attempt takes roughly 10k cycles and on average 1.6 trials are necessary using the Bliss-I parameter set. To evaluate the impact of the sampler used in the design, we instantiated two signing engines of which one employs a CDT sampler and the other one two Bernoulli samplers to match the speed of the multiplier. For a similar performance of roughly 8,000 signing operations per second, the signing instance based on the Bernoulli sampler has a significantly higher resource consumption (about 470 extra slices). Due to the two pipeline stages involved, the runtime of both instances is determined by max(Cycles(PolyMul), Cycles(Hash)) + Cycles(SparseMul) where the rejection sampling in Compression is performed in parallel. Further design space exploration (e.g., evaluating the impact of a different number of parallel sparse multiplication operations or a faster configuration of KECCAK) always identified the PolyMul component as performance bottleneck or did not provide significant savings in resources for reduced versions. In order to further increase the clock rate it would of course also be possible to instantiate the Gaussian sampler in a separate clock domain. The verification runtime is determined by Cycles(PolyMul)+Cycles(Hash) as no pipelining is used and PolyMul is slightly faster than for signing as no Gaussian sampling is needed. Table 2 : Performance and resource consumption of the full Bliss-I signing engine using the CDT sampler or two parallel Bernoulli samplers (Ber) on the Spartan-6 LX25 for a small 1024 bit message.
DSPs. The structural advantage of Bliss is a smaller polynomial modulus (GLP: q = 8383489/Bliss-I: q = 12289), less iterations necessary for a valid signature (GLP: 7/Bliss-I: 1.6), and a higher security level (GLP: 80 bit/Bliss-I: 128 bit). Furthermore and contrary to [16] , we remark that our implementation takes the area costs and timings of a hash function (KECCAK) into account. In summary, our implementation of Bliss is superior to [16] in almost all aspects. Table 3 : Signing or verification speed of comparable signature scheme implementations. The GLP implementation was measured on a Spartan-6 device, the B-163 ECDSA one on a Cyclone II and the other implementations were done on Virtex-5.
