Abstract. Nearly all of the currently used and well-tested signature schemes (e.g. RSA or DSA) are based either on the factoring assumption or the presumed intractability of the discrete logarithm problem. Further algorithmic advances on these problems may lead to the unpleasant situation that a large number of schemes have to be replaced with alternatives. In this work we present such an alternative -a signature scheme whose security is derived from the hardness of lattice problems. It is based on recent theoretical advances in lattice-based cryptography and is highly optimized for practicability and use in embedded systems. The public and secret keys are roughly 12000 and 2000 bits long, while the signature size is approximately 9000 bits for a security level of around 100 bits. The implementation results on reconfigurable hardware (Spartan/Virtex 6) are very promising and show that the scheme is scalable, has low area consumption, and even outperforms some classical schemes.
Introduction
Due to the yet unpredictable but possibly imminent threat of the construction of a quantum computer, a number of alternative cryptosystems to RSA and ECC have gained significant attention during the last years. In particular, it has been widely accepted that relying solely on asymmetric cryptography based on the hardness of factoring or the (elliptic curve) discrete logarithm problem is certainly not sufficient in the long term [7] . This has been mainly due to the work of Shor [34] , who demonstrated that both classes of problems can be efficiently attacked with quantum computers. As a consequence, first steps towards the required diversification and investigation of alternative fundamental problems and schemes have been taken. This has already led to efficient implementations of various schemes based on multivariate quadratic systems [5, 3] and the codebased McEliece cryptosystem [10, 35] .
Another promising alternative to number-theoretic constructions are latticebased cryptosystems because they admit security proofs based on well-studied problems that currently cannot be solved by quantum algorithms. For a long time, however, lattice constructions have only been considered secure for inefficiently large parameters that are well beyond practicability 1 or were, like GGH [14] and NTRUSign [17] , broken due to flaws in the ad-hoc design approach [30] . This has changed since the introduction of cyclic and ideal lattices [26] and related computationally hard problems like Ring-SIS [31, 22, 24] and Ring-LWE [25] which enabled the constructions of a great variety of theoretically elegant and efficient cryptographic primitives.
In this work we try to further close the gap between the advances in theoretical lattice-based cryptography and real-world implementation issues by constructing and implementing a provably-secure digital signature scheme based on ideal lattices. While maintaining the connection to hard ideal lattice problems we apply several performance optimizations for practicability that result in moderate signature and key sizes as well as performance suitable for embedded and hardware systems.
Digital Signatures and Related Work. Digital signatures are arguably the most used public-key cryptographic primitive in practical applications, and a lot of effort has gone into trying to construct such schemes from lattice assumptions. Due to the success of the NTRU encryption scheme, it was natural to try to design a signature scheme based on the same principles. Unlike the encryption scheme, however, the proposed NTRU signature scheme [18, 16] has been completely broken by Nguyen and Regev [30] . Provably-secure digital signatures were finally constructed in 2008, by Gentry, Peikert, and Vaikuntanathan [13] , and, using different techniques, by Lyubashevsky and Micciancio [23] . The scheme in [13] was very inefficient in practice, with outputs and keys being megabytes long, while the scheme in [23] was only a one-time signature that required the use of Merkle trees to become a full signature scheme. The work of [23] was extended by Lyubashevsky [20, 21] , who gave a construction of a full-fledged signature scheme whose keys and outputs are currently on the order of 15000 bits each, for an 80-bit security level. The work of [13] was also recently extended by Micciancio and Peikert [27] , where the size of the signatures and keys is roughly 100, 000 bits.
Our Contribution. The main contribution of this work is the implementation of a digital signature scheme from [20, 21] optimized for embedded systems. In addition, we propose an improvement to the above-mentioned scheme which preserves the security proof, while lowering the signature size by approximately a factor of two. We demonstrate the practicability of our scheme by implementing a scalable and efficient signing and verification engine. For example, on the lowcost Xilinx Spartan-6 we are 1.5 times faster and use only half of the resources of the optimized RSA implementation of Suzuki [38] . With more than 12000 signatures and over 14000 signature verifications per second, we can satisfy even high-speed demands using a Virtex-6 device.
Outline. The paper is structured as follows. First we give a short overview on our hardness assumption in Section 2 and then introduce the highly efficient and practical signature scheme in Section 3. Based on this description, we introduce our implementation and the hardware architecture of the signing and signature verification engine in Section 4 and analyze its performance on different FPGAs in Section 5. In Section 6 we summarize our contribution and present an outlook for future work.
Preliminaries

Notation
Throughout the paper, we will assume that n is an integer that is a power of 2, p is a prime number congruent to 1 modulo 2n, and R 
Hardness Assumption
In a particular version of the Ring-SIS problem, one is given an ordered pair of polynomials (a, t) ∈ R p n × R p n where a is chosen uniformly from R p n and t = as 1 + s 2 , where s 1 and s 2 are chosen uniformly from R p n k , and is asked to find an ordered pair (s 1 , s 2 ) such that as 1 + s 2 = t. It can be shown that when k > √ p, the solution is not unique and finding any one of them, for √ p < k p, was proven in [31, 22] to be as hard as solving worst-case lattice problems in ideal lattices. On the other hand, when k < √ p, it can be shown that the only solution is (s 1 , s 2 ) with high probability, and there is no classical reduction known from worst-case lattice problems to finding this solution. In fact, this latter problem is a particular instance of the Ring-LWE problem. It was recently shown in [25] that if one chooses the s i from a slightly different distribution (i.e., a Gaussian distribution instead of a uniform one), then solving the Ring-LWE problem (i.e., recovering the s i when given (a, t)) is as hard as solving worst-case lattice problems using a quantum algorithm. Furthermore, it was shown that solving the decision version of Ring-LWE, that is distinguishing ordered pairs (a, as 1 + s 2 ) from uniformly random ones in R p n × R p n , is still as hard as solving worst-case lattice problems.
In this paper, we implement our signature scheme based on the presumed hardness of the decision Ring-LWE problem with particularly "aggressive" parameters. We define the DCK p,n problem (Decisional Compact Knapsack problem) to be the problem of distinguishing between the uniform distribution over R p n × R p n and the distribution (a, as 1 + s 2 ) where a is uniformly random in R p n and s i are uniformly random in R p n 1 . As of now, there are no known algorithms that take advantage of the fact that the distribution of s i is uniform (i.e., not Gaussian) and consists of only −1/0/1 coefficients 2 , and so it is very reasonable to conjecture that this problem is still hard. In fact, this is essentially the assumption that the NTRU encryption scheme is based on. Due to lack of space, we direct the interested reader to Section 3 of the full version of [21] for a more in-depth discussion of the hardness of the different variants of the SIS and LWE problems.
Cryptographic Hash Function H with Range D n 32
Our signature scheme uses a hash function, and it is quite important for us that the output of this function is of a particular form. The range of this function, D n 32 , for n ≥ 512 consists of all polynomials of degree n − 1 that have all zero coefficients except for at most 32 coefficients that are ±1.
We denote by H the hash function that first maps {0, 1} * to a 160-bit string and then injectively maps the resulting 160-bit string r to D n 32 via an efficient procedure we now describe. To map a 160-bit string into the range D n 32 for n ≥ 512, we look at 5 bits of r at a time, and transforms them into a 16-digit string with at most one non-zero coefficient as follows: let r 1 r 2 r 3 r 4 r 5 be the five bits we are currently looking at. If r 1 is 0, then put a −1 in position number r 2 r 3 r 4 r 5 (where we read the 4-digit string as a number between 0 and 15) of the 16-digit string. If r 1 is 1, then put a 1 in position r 2 r 3 r 4 r 5 . This converts a 160-bit string into a 512-digit string with at most 32 ±1's. 3 We then convert the 512-bit string into a polynomial of degree at least 512 in the natural way by assigning the i th coefficient of the polynomial the i th bit of the bit-string. If the polynomial is of degree greater than 512, then all of its higher-order terms will be 0.
The Signature Scheme
In this section, we will present the lattice-based signature scheme whose hardware implementation we describe in Section 4. This scheme is a combination of the schemes from [20] and [21] as well as an additional optimization that allows us to reduce the signature length by almost a factor of two. In [20] , Lyubashevsky constructed a lattice-based signature scheme based on the hardness of the Ring-SIS problem, and this scheme was later improved in two ways [21] .
The first improvement results in signatures that are asymptotically shorter, but unfortunately involves a somewhat more complicated rejection sampling algorithm during the singing procedure, involving sampling from the normal distribution and computing quotients to a very high precision, which would not be very well supported in hardware. We do not know whether the actual savings achieved in the signature length would justify the major slowdown incurred, and we do leave the possibility of efficiently implementing this rejection sampling algorithm to future work. The second improvement from [21] , which we do use, shows how the size of the keys and the signature can be made significantly smaller by changing the assumption from Ring-SIS to Ring-LWE.
The Basic Signature Scheme
For ease of exposition, we first present the basic combination scheme of [20] and [21] in Figure 1 , and sketch its security proof. Full security proofs are available in [20] and [21] . We then present our optimization in Sections 3.2 and 3.3.
Signing Key: s1, s2
Sign(μ, a, s1, s2) 1: y1, y2 The parameter k in our scheme which first appears in line 1 of the signing algorithm controls the trade-off between the security and the runtime of our scheme. The smaller we take k, the more secure the scheme becomes (and the shorter the signatures get), but the time to sign will increase. We explain this as well as the choice of parameters below.
To sign a message μ, we pick two "masking" polynomials y 1 , y 2 $ ← R p n k and compute c ← H(ay 1 + y 2 , μ) and the potential signature (z 1 , z 2 , c) where z 1 ← s 1 c + y 1 , z 2 ← s 2 c + y 2 4 . But before sending the signature, we must perform a rejection-sampling step where we only send if z 1 , z 2 are both in R p n k−32 . This part is crucial for security and it is also where the size of k matters. If k is too small, then z 1 , z 2 will almost never be in R p n k−32 , whereas if its too big, it will be easy for the adversary to forge messages 5 . To verify the signature (z 1 , z 2 , c), the verifier simply checks that z 1 , z 2 ∈ R p n k−32 and that c = H(az 1 + z 2 − tc, μ). Our security proof follows that in [21] except that it uses the rejection sampling algorithm from [20] . Given a random polynomial a ∈ R p n , we pick two poly-
as the public key. By the DCK p,n assumption (and a standard hybrid argument), this looks like a valid public key (i.e., the adversary cannot tell that the s i are chosen from R p n k rather than from R p n 1 ). When the adversary gives us signature queries, we appropriately program the hash function outputs so that our signatures are valid even though we do not know a valid secret key (in fact, a valid secret key does not even exist). When the adversary successfully forges a new signature, we then use the "forking lemma" [33] to produce two signatures of the message μ, (z 1 , z 2 , c) and (z 1 , z 2 , c ), such that
which implies that
and because we know that t = as 1 + s 2 , we can obtain
Because z i , s i , c, and c have small coefficients, we found two polynomials u 1 , u 2 with small coefficients such that au 1 + u 2 = 0 6 By [21, Lemma 3.7], knowing such small u i allows us to solve the DCK p,n problem.
We now explain the trick that we use to lower the size of the signature as returned by the optimized scheme presented in Section 3.3. Notice that if Equation (2) does not hold exactly, but only approximately (i.e., az 1 + z 2 − tc − (az 1 + z 2 − tc ) = w for some small polynomial w), then we can still obtain small u 1 , u 2 such that au 1 + u 2 = 0, except that the value of u 2 will be larger by at most the norm of w. Thus if az 1 + z 2 − tc ≈ az 1 + z 2 − tc , we will still be able to produce small u 1 , u 2 such that au 1 + u 2 = 0. This could make us consider only sending (z 1 , c) as a signature rather than (z 1 , z 2 , c), and the proof will go through fine. The problem with this approach is that the verification algorithm will no longer work, because even though az 1 + z 2 − tc ≈ az 1 − tc, the output of the hash function H will be different. A way to go around the problem is to only evaluate H on the "high order bits" of the coefficients comprising the polynomial az 1 + z 2 − tc which we could hope to be the same as those of the polynomial az 1 − tc. But in practice, too many bits would be different (because of the carries caused by z 2 ) for this to be a useful trick. What we do instead is send (z 1 , z 2 , c) as the signature where z 2 only tells us the carries that z 2 would have created in the high order bits in the sum of az 1 + z 2 − tc, and so z 2 can be represented with much fewer bits than z 2 . In the next subsection, we explain 5 The exact probability that z1, z2 will be in R exactly what we mean by "high-order bits" and give an algorithm that produces a z 2 from z 2 , and then provide an optimized version of the scheme in this section that uses the compression idea.
The Compression Algorithm
For every integer y in the range − 
The Lemma below states that given two vectors y, z ∈ R p n where the coefficients of z are small, we can replace z by a much more compressed vector z while keeping the higher order bits of y + z and y + z the same. The algorithm that satisfies this lemma is presented in Figure 5 in Appendix A. 
2. z can be represented with only 2n + log(2k + 1) · 6kn p bits.
A Signature Scheme for Embedded Systems
We now present the version of the signature scheme that incorporates the compression idea from Section 3.2 (see Figure 2 ). We will use the following notation that is similar to the notation in Section 3.2: every polynomial Y ∈ R p n can be written as
where
and k corresponds to the k in the signature scheme in Figure  2 . Notice that there is a bijection between polynomials Y and this representation (Y (1) , Y (0) ) where
and
Intuitively, Y (1) is comprised of the higher order bits of Y. The secret key in our scheme consists of two polynomials s 1 , s 2 sampled uniformly from R Signing Key: s1, s2
Sign(μ, a, s1, s2) 1: y1, y2 , z1, z 2 , c, a, t) 1: Accept iff z1, z 2 ∈ R p n k−32 and c = H (az1 + z 2 − tc) (1) , μ Fig. 2 . Optimized Signature Scheme t = as 1 + s 2 . In step 1 of the signing algorithm, we choose the "masking polynomials" y 1 , y 2 from R p n k . In step 2, we let c be the hash function value of the high order bits of ay 1 + y 2 and the message μ. In step 3, we compute z 1 , z 2 and proceed only if they fall into a certain range. In step 5, we compress the value z 2 using the compression algorithm implied in Lemma 3.1, and obtain a value z 2 such that ( and that c = H (az 1 + z 2 − tc) (1) , μ . The running time of the signature algorithm depends on the relationship of the parameter k with the parameter p. The larger the k, the more chance that z 1 and z 2 will be in R p n k−32 in step 4 of the signing algorithm, but the easier the signature will be to forge. Thus it is prudent to set k as small as possible while keeping the running time reasonable.
Concrete Instantiation
We now give some concrete instantiations of our signature scheme from Figure 2 . The security of the scheme depends on two things: the hardness of the underlying DCK p,n problem and the hardness of finding pre-images in the random oracle H 8 . For simplicity, we fixed the output of the random oracle to 160 bits and so finding pre-images is 160 bits hard. Judging the security of the lattice problem, on the other hand, is notoriously more difficult. For this part, we rely on the extensive experiments performed by Gama and Nguyen [12] and Chen and Nguyen [8] to determine the hardness of lattice reductions for certain classes of lattices. The lattices that were used in the experiments of [12] were a little different than ours, but we believe that barring some unforeseen weakness due to the added algebraic structure of our lattices and the parameters, the results should be quite similar. We consider it somewhat unlikely that the algebraic structure causes any weaknesses since for certain parameters, our signature scheme is as hard as Ring-LWE (which has a quantum reduction from worst-case lattice problems [25] ), but we do encourage cryptanalysis for our particular parameters because they are somewhat smaller than what is required for the worst-case to average-case reduction in [37, 25] to go through. The methodology for choosing our parameters is the same as in [21] , and so we direct the interested reader to that paper for a more thorough discussion. In short, one needs to make sure that the length of the secret key [s 1 |s 2 ] as a vector is not too much smaller than √ p and that the allowable length of the signature vector, which depends on k, is not much larger than √ p. Using these quantities, one can perform the now-standard calculation of the "root Hermite factor" that lattice reduction algorithms must achieve in order to break the scheme (see [12, 28, 21] for examples of how this is done). According to experiments in [12, 8] a factor of 1.01 is achievable now, a factor of 1.007 seems to have around 80 bits of security, and a factor of 1.005 has more than 256-bit security. In Figure 1 , we present two sets of parameters. According to the aforementioned methodology, the first has somewhere around 100 bits of security, while the second has more than 256.
We will now explain how the signature, secret key, and public key sizes are calculated. We will use the concrete numbers from set I as example. The signature size is calculated by summing the bit lengths of z 1 , z 2 , and c. Since z 1 is in R p n k−32 , it can be represented by n log(2(k − 32) + 1) ≤ n log k + n = 7680 bits. From Lemma 3.1, we know that z 2 can be represented with 2n + log(2(k − 32) + 1) · 6(k−32)n p ≤ 2n + 6 log(2k) = 1114 bits. And c can be represented with 160 bits, for a total signature size of 8954 bits. The secret key consists of polynomials s 1 , s 2 ∈ R p n 1 , and so they can be represented with 2 n log(3) = 1624 bits, but a simpler representation can be used that requires 2048 bits. The public key consists of the polynomials (a, t), but the polynomial a does not need to be unique for every secret key, and can in fact be some randomness that is agreed upon by everyone who uses the scheme. Thus the public key can be just t, which can be represented using n log p = 11776 bits.
We point out that even though the signature and key sizes are larger than in some number theory based schemes, the signature scheme in Figure 2 is quite efficient, (in software and in hardware), with all operations taking quasi-linear time, as opposed to at least quadratic time for number-theory based schemes. The most expensive operation of the signing algorithm is in step 2 where we need to compute ay 1 + y 2 , which also could be done in quasilinear time using FFT. In step 3, we also need to perform polynomial multiplication, but because c is a very sparse polynomial with only 32 non-zero entries, this can be performed with just 32 vector additions. And there is no multiplication needed in step 5 because az 1 − tc = ay 1 + y 2 − z 2 .
Implementation
In this section we provide a detailed description of our FPGA implementation of the signature scheme's signing and verification procedures for parameter set I with about 100 bits of equivalent symmetric security. In order to improve the speed and resource consumption on the FPGA, we utilize internal block memories (BRAM) and DSP hardcores spanning over three clock domains. We designed dedicated implementations of the signing and verification operation that work with externally generated keys.
Roughly speaking, the signing engine is composed out of a scalable amount of area-efficient polynomial multipliers to compute ay 1 + y 2 . Fresh randomness for y 1 , y 2 is supplied each run by a random number generator (in this prototype implementation an LFSR). To ensure a steady supply of fresh polynomials from the multiplier for the subsequent parts of the design and the actual signing operation, we have included a buffer of a configurable size that pre-stores pairs (ay 1 +y 2 , y 1 ||y 2 ). The hash function H saves its state after the message has been hashed and thus prevents rehashing of the (presumably long) message in each new rejection-sampling step. The sparse multiplication of sc works coefficientwise and thus allows immediate testing for the rejection condition. If an outof-bound coefficient occurs (line 4 and 6 of Figure 2) , the multiplication and compression is immediately interrupted and a new polynomial pair is retrieved from the buffer. For the verification engine, we rely on the polynomial multiplier used to compute ay 1 +y 2 twice as we compute az 1 +z 2 first, maintain the internal state and therefore add t(−c) in a second round to produce the input for the hash function. When signatures are fed into or returned by both engines, they are encoded in order to meet the signature size (see Lemma A.2 for a detailed algorithm).
Message Signing
The detailed top-level design of the signing engine is depicted in Figure 3 . The computation of ay 1 + y 2 is implemented in clock domain (1) and carried out by a number of PolyMul units (three units are shown in the depicted setup). The BRAMs storing the initial parameters y = y 1 ||y 2 are refilled by a random number generator (RNG) running independently in clock domain (3) and the constant polynomial a is loaded during device initialization. When a PolyMul unit has finished the computation of r = ay 1 + y 2 , it requests exclusive access to the Buffer and stores r and y when free space is available. Internally the Buffer consists of the two configurable FIFOs FIFO(r) and FIFO(y). As all operations in clock domain (1) and (3) are independent of the secret key or message, they are triggered when space in the Buffer becomes available. As described in Section 3.4, the polynomial r = ay 1 + y 2 is needed as input to the hashing as well as for the compression components and is thus stored in BRAM BUF(r) while the coefficients of y 1 , y 2 are only needed once and therefore taken directly out of the FIFOs.
When a signature for a message stored in FIFO(m) is requested, the samplingrejection is repeated in clock domain (2) until a valid signature has been written into FIFO(σ). The message to be signed is first hashed and its internal state saved. Therefore, it is only necessary to rehash r in case the computed signature is rejected (but not the message again). When the hash c is ready, the Compression component is started. In this component, the values z 1 = s 1 c+y 1 and z 2 = s 2 c+ y 2 are computed column/coefficient-wise with a Comba-style sparse multiplier [9] followed by an addition so that coefficients of z 1 or z 2 are sequentially generated. Rejection-sampling is directly performed on these coefficients and the whole pair (r, y) is rejected once a coefficient is encountered that is not in the desired range. The secret key s = s 1 ||s 2 is stored in the block RAM BRAM(s) which can be initialized during device initialization or set from the outside during runtime. The whole signature σ = (z 1 , z 2 , c) is encoded by the Encoder component in order to meet the desired signature size (max. 8954 bits) and then written into the FIFO FIFO(σ). The usage of FIFOs and BRAMs as I/O port allows easy integration of our engine into other designs and provides the ability for clock domain separation.
Polynomial Multiplication. The most time-consuming operation of the signature scheme is the polynomial multiplication a · y 1 (with the addition of y 2 being rather simple). Recall that a ∈ R p n has 512 23-bit wide coefficients and that y 1 ∈ R p n k consists of 512 16-bit wide coefficients. We are aware that the selected schoolbook algorithm (complexity of O(n 2 )) is theoretically inferior compared to Karatsuba [19] (O(n log 3 )) or the FFT [29] (O(n log n)). However, its regular structure and iterative nature allows very high clock frequencies and an area efficient implementation on very small and cheap devices. The polynomial reduction with f = x n + 1 is performed in place which leads to the negacyclic convolution r = (1), (2), (3).
of a and y 1 . The data path for the arithmetic is depicted in Figure 4 (a). The computation of a i y j is realized in a multiplication core. We avoid dealing with signed values by determining the sign of the value added to the intermediate coefficient from the MSB sign bit of y j and if a reduction modular x n + 1 is necessary. As all coefficients of a are stored in the range [0, p − 1] they do not affect the sign of the result. Modular reduction (see Figure 4(b) ) by p = 8383489 is implemented based on the idea of Solinas [36] as 2 23 mod 8383489 = 5119 is very small. For the modular addition of y 2 the multiplier's arithmetic pipeline is reused in a final round in which the output of BRAM(a) is being set to 1 and the coefficients of y 2 are being fed into the BRAM(y) port. Each PolyMul unit also acts as an additional buffer as it can hold one complete result of r in its internal temporary BRAM and thus reduces latency further in a scenario with precomputation. All in all, one PolyMul unit requires 204 slices, 3 BRAMs, 4 DSPs and is able to generate approx. 1130 pairs of (r,y) per second at a clock frequency of 300 MHz on a Spartan-6. 
Signature Verification
In the previous sections we discussed the details of the signing algorithm. When dealing with the signature verification, we can reuse most of the previously described components. In particular, the PolyMul component only needs a slight modification in order to compute az 1 + z 2 − tc which allows efficient resource sharing for both operation. It is easy to see that we can split the computation of the input to the hash instantiation into t 1 = az 1 + z 2 , t 2 = t(−c) + 0 and t = t 1 + t 2 . We see that the first equation can be performed by the PolyMul core as a ∈ R As a consequence, PolyMul supports a special flag that triggers a multiply-accumulate behavior in which the content of BRAM(r) is preserved after a full run of the schoolbook multiplication (ay 1 ) and an addition of y 2 . Therefore, the intermediate values t 1 and t 2 are summed up in BRAM(r) and we do not need the final addition. This enabled us to design a verification engine that performs its arithmetic operations with just two runs of the PolyMul core.
Results and Comparison
All presented results below were obtained after post-place-and-route (PAR) and were generated with Xilinx ISE 13.3. We have implemented the signing and verification engine (parameter set I, buffer of size one) on two devices of the low-cost Spartan-6 device family and on one high-speed Virtex-6 (all speed grade −3). Detailed information regarding performance and resources consumption is given in Table 2 and Table 3 , respectively. For the larger devices we instantiate multiple distinct engines as the Compression and Hash components become the bottleneck when a certain amount of PolyMul components are instantiated. Note also that our implementation is small enough to fit the signing (two PolyMul units) or verification engine on the second-smallest Spartan-6 LX9.
When comparing our results to other work as given in Table 4 , we conservatively assume that RSA signatures (one modular exponentiation) with a key size of 1024 bit and ECDSA signatures (one point multiplication) with a key size of 160 bit are comparable to our scheme in terms of security (see Section 3.4 for details on the parameters). In comparison with RSA, our implementation on the low-cost Spartan-6 is 1.5 times faster than the high-speed implementation of Suzuki [38] -that still needs twice as many device resources and runs on the more expensive Virtex-4 device. Note however, that ECC over binary curves is very well suited for hardware and even implementations on old FPGAs like the Virtex-2 [1] are faster than our lattice-based scheme. For the NTRUSign latticebased signature scheme (introduced in [17] and broken by Nguyen [30] ) and the XMSS [6] hash-based signature scheme we are not aware of any implementation results for FPGAs. Hardware implementations of Multivariate Quadratic (MQ) cryptosystems [5, 3] show that these schemes are faster (factor 2-50) than ECC but also suffer from impractical key sizes for the private and public key (e.g., 80 Kb for Unbalanced Oil and Vinegar (UOV)) [32] . While implementations of the McEliece encryption scheme offer good performance [10, 35] the only implementation of a code based signature scheme [4] is extremely slow with a runtime of 830 ms for signing.
Conclusion
In this paper we presented a provably secure lattice based digital signature scheme and its implementation on a wide scale of reconfigurable hardware. With moderate resource requirements and more than 12,000 and 14,000 signing and verification operations per second on a Virtex-6 FPGA, our prototype implementation even outperforms classical and alternative cryptosystems in terms of signature size and performance. Future work consists of optimization of the rejection-sampling steps as well as evaluation of different polynomial multiplication methods like the FFT. We also plan to investigate practicability of the signature scheme on other platforms like microcontrollers or graphic cards.
A Compression Algorithm
In this section we present our compression algorithm. For two vectors y, z, the algorithm first checks whether the coefficient y[i] of y is greater than (p − 1)/2 − k in absolute value. If it is, then there is a possibility that y[i] + z[i] will need to be reduced modulo p and in this case we do not compress z[i]. Ideally there should not be many such elements, and we can show that for the parameters used in the signature scheme, there will be at most 6 (out of n) with high probability. It's possible to set the parameters so that there are no such elements, but this decreases the efficiency and is not worth the very slight savings in the compression.
Assuming 
