Abstract. In the last years code-based cryptosystems were established as promising alternatives for asymmetric cryptography since they base their security on well-known NP-hard problems and still show decent performance on a wide range of computing platforms. The main drawback of code-based schemes, including the popular proposals by McEliece and Niederreiter, are the large keys whose size is inherently determined by the underlying code. In a very recent approach, Misoczki et al. proposed to use quasi-cyclic MDPC (QC-MDPC) codes that allow for a very compact key representation. In this work, we investigate novel implementations of the McEliece scheme using such QC-MDPC codes tailored for embedded devices, namely a Xilinx Virtex-6 FPGA and an 8-bit AVR microcontroller. In particular, we evaluate and improve different approaches to decode QC-MDPC codes. Besides competitive performance for encryption and decryption on the FPGA, we achieved a very compact implementation on the microcontroller using only 4,800 and 9,600 bits for the public and secret key at 80 bits of equivalent symmetric security.
Introduction
Nearly all established asymmetric cryptosystems rely on two classes of fundamental problems, namely the factoring problem and the (elliptic curve) discrete logarithm problem. Due to Shor's [37] efficient algorithm which solves both problems on quantum computers, it has become evident that a larger diversification of public key primitives is urgently required to be prepared in case quantum computers enter the scene. In this context, IBM announced two improvements in quantum computing [11] and estimates that such systems might become practical and available within the next 15 years.
The most promising alternatives are currently classified into code-based, lattice-based, multivariatequadratic (MQ-), and hash-based cryptography. A major drawback of many proposed cryptosystems within these classes are their low efficiency and practicability due to large key sizes or complex computations compared to classical RSA and ECC cryptosystems. This is particularly considered an issue for small and embedded systems where memory and processing power are a scarce resource. Nevertheless, it was shown that code-based cryptosystems such as the well-established proposals by McEliece and Niederreiter can significantly outperform classical asymmetric cryptosystems on embedded systems [13, 16, 20, 32] -at the cost of very large keys (often more than 50 kByte). Therefore, current research is targeting alternative codes that allow more compact key representations but still preserve the security properties of the cryptosystem.
Background on MDPC Codes
In the following we introduce (QC-)MDPC codes, closely following the description given in [28] . (QC-)MDPC codes are a special variant of linear codes and are defined as follows:
Definition 1 (Linear codes). A binary (n, r)-linear code C of length n, dimension n − r and co-dimension r, is a (n−r)-dimensional vector subspace of F n 2 . It is spanned by the rows of a matrix G ∈ F (n−r)×n 2 , called a generator matrix of C. The generator matrix is the kernel of a matrix H ∈ F r×n 2 and called the parity-check matrix of C. The codeword c ∈ C of a vector m ∈ F (n−r) 2 is given by c = mG. Given a vector e ∈ F n 2 , we obtain the syndrome s = He T ∈ F r 2 . The dual C ⊥ of C is the linear code spanned by the rows of any parity-check matrix of C.
A linear code can be quasi-cyclic according to the following definition:
Definition 2 (Quasi-cyclic code). A (n, r)-linear code is quasi-cyclic (QC) if there is some integer n 0 such that every cyclic shift of a codeword by n 0 positions is again a codeword. When n = n 0 p, for some integer p, it is possible and convenient to have both generator and parity check matrices composed by p × p circulant blocks. A circulant block is completely described by its first row (or column) and the algebra of p × p binary circulant matrices is isomorphic to the algebra of polynomials modulo x p − 1 in F 2 .
On top we can define the MDPC codes:
Definition 3 (MDPC codes). A (n, r, w)-MDPC code is a linear code of length n and co-dimension r admitting a parity check matrix with constant row weight w.
When MDPC codes are quasi-cyclic, they are called (n, r, w)-QC-MDPC codes. LDPC codes typically have small constant row weights (usually, less than 10). For MDPC codes, row weights scaling in O( n log(n)) are assumed.
McEliece Based on QC-MDPC Codes
We now present a variant of the McEliece cryptosystem based on (n, r, w)-QC-MDPC codes with n = n 0 p and r = p. To obtain such a code, we first pick a word h ∈ F n 2 of length n = n 0 p and weight w at random. Then, the QC-MDPC code is defined by a quasi-cyclic parity-check matrix H ∈ F n 2 of first row h and all other r − 1 rows are obtained from r − 1 quasi-cyclic shifts of h. The parity-check matrix then has the form H = [H 0 |H 1 |...|H n0−1 ]. Each block H i has row weight w i , such that w = n0−1 i=0 w i with a smooth distribution of w i 's. Finally, the generator matrix G in row reduced echelon form can be easily derived from the H i blocks. Assuming that H n0−1 is non-singular (this particularly implies w n0−1 being odd, otherwise the rows of H n0−1 would sum up to 0), we compute G of the form (I|Q), where I is the identiy matrix and
In the following we detail the key-generation as well as encryption and decryption for McEliece based on QC-MDPC codes.
-Key-Generation: The public and private keys are generated as follows. First generate a parity-check matrix H ∈ F r×n 2 of a t-error-correcting (n, r, w)-QC-MDPC code. Then generate its corresponding generator matrix G ∈ F (n−r)×n 2 in row reduced echelon form. The public key is G and the private key is H. Since quasi-cyclic matrices are used, it suffices to store the first rows g and h of the circulant blocks which significantly reduces storage requirements.
-Encryption: To encrypt a plaintext m ∈ F (n−r) 2 into x ∈ F n 2 , first generate an error vector e ∈ F n 2 of wt(e) ≤ t at random. Then compute x ← mG + e.
-Decryption: Let Ψ H be a t-error-correcting LDPC/MDPC decoding algorithm equipped with the sparse parity-check matrix H. To decrypt x ∈ F n 2 into m ∈ F (n−r) 2 compute mG ← Ψ H (mG + e). Finally extract the plaintext m from the first (n − r) positions of mG.
Security of QC-MDPC
The description of McEliece based on QC-MDPC codes in Section 3.1 eliminates the scrambling matrix S and the permutation matrix P usually used in the McEliece cryptosystem. The use of a CCA2-secure conversion (e.g., [24] ) allows G to be in systematic-form without introducing any security-flaws. Note that [28] states that a quasi-cyclic structure, by itself, does not imply a significant improvement for an adversary. All previous attacks on McEliece schemes are based on the combination of a quasi-cyclic/dyadic structure with some algebraic code information. To resist the best currently known attack of [5] and also the improvements achieved by the DOOM-attack [36] , the authors of [28] suggest parameters as given in Table 1 . 
Decoding (QC-)MDPC Codes
For code-based cryptosystems, decoding a codeword (i.e., the syndrome) is usually the most complex task. Decoding algorithms for LDPC/MDPC codes are mainly divided into two families. The first class (e.g., [7] ) offers a better error-correction capability but is computationally more complex than the second family. Especially when handling large codes, the second family, called bit-flipping algorithms [15] , seems to be more appropriate. In general, they are all based on the following principle:
1. Compute the syndrome s of the received codeword x. 2. Check the number of unsatisfied parity-check-equations # upc associated with each codeword bit. 3. Flip each codeword bit that violates more than b equations.
This process is iterated until either the syndrome becomes zero or a predefined maximum number of iterations is reached. In that case a decoding error is returned. The main difference of the bit-flipping algorithms is how the threshold b is computed. In the original algorithm of Gallager [15] , a new b is computed at each iteration.
In [22] , b is taken as the maximum of the unsatisfied parity-check-equations M ax upc and the authors of the QC-MDPC scheme propose to use b = M ax upc − δ, for some small δ.
Since estimating the error-correction capability of LDPC and MDPC codes generally is a hard task and is also influenced by the choice of threshold b, we derive different versions of the bit-flipping algorithm, evaluate their error-correcting capability and count how many iterations are required on average to decode a codeword. Because we are targeting embedded systems, we omit the variant storing n 0 counters for # upc for each ciphertext bit. This would allow to skip the second computation of # upc in some variants, but would blow up memory consumption to an unacceptable amount. We now introduce the different decoders under investigation:
Decoder A is given in [28] and computes the syndrome, then checks the number of unsatisfied paritycheck-equations once to compute the maximum M ax upc and afterwards a second time to flip all codeword bits that violate b ≥ M ax upc −δ equations. Afterwards the syndrome is recomputed and compared to zero.
Decoder B is given in [15] and computes the syndrome, then checks the number of unsatisfied paritycheck-equations once per iteration i and directly flips the current codeword bit if # upc is larger than a precomputed threshold b i . Afterwards the syndrome is recomputed and compared to zero.
We noticed that the previously proposed bit-flipping decoders recompute the syndrome after every iteration. Since this is quite costly we propose an optimization based on the following observation: If the amount of unsatisfied parity-check-equations exceeds threshold b, the corresponding bit in the codeword is flipped and the syndrome changes. We would like to stress that the syndrome does not change arbitrarily, but the new syndrome is equal to the old syndrome accumulated with the row h j of the parity check matrix that corresponds to the flipped codeword bit j. By keeping track of which codeword bits are flipped and updating the syndrome accordingly, the syndrome recomputation can be omitted. Hence, we propose and evaluate the following decoders:
Decoder C 1 computes the syndrome, then checks the number of unsatisfied parity-check-equations once to compute the maximum M ax upc and afterwards a second time to flip all codeword bits that violate b ≥ M ax upc − δ equations. If a codeword bit j is flipped, the corresponding row h j of the parity check matrix is added to a temporary syndrome. At the end of each iteration the temporary syndrome is added to the syndrome, directly resulting in the syndrome of the new codeword without requiring a full recomputation.
Decoder C 2 computes the syndrome, then checks the number of unsatisfied parity-check-equations once to compute the maximum M ax upc and afterwards a second time to flip all codeword bits that violate b ≥ M ax upc − δ equations. If a codeword bit j is flipped, the corresponding row h j of the parity check matrix is added directly to the current syndrome. Using this method we always work with an up-to-date syndrome and not with the one from the last iteration.
Decoder D is similar to Decoder B with precomputed thresholds b i , but uses the direct update of the syndrome as done in Decoder C 2 .
Decoder E is similar to Decoder C 2 but compares the syndrome to zero after each flipped bit and aborts the current bit-flipping iteration immediately if the syndrome becomes zero.
Decoder F is similar to Decoder D and in addition uses the same early exit trick as Decoder E.
The average number of iterations required to decode a codeword and the decoding failure rate for the different decoders with different numbers of errors are shown in Table 6 in the appendix for a QC-MDPC code with parameters n 0 = 2, n = 9600, r = 4800, w = 90 (cf. first row of Table 1 ). All measurements are taken for 1000 random codes and 100,000 random decoding tries per code on a Intel Xeon E5345 CPU running at 2.33 GHz. For versions with precomputed thresholds b i we used the formula given in Appendix A of [28] to precompute the most suitable b i 's for every iteration. For versions using b = M ax upc − δ, we found by exhaustive experiments that the smallest number of iterations are required for δ = 5 2 . A decoding failure is returned when the decoder did not succeed within ten iterations.
The timings given in Table 6 should only be used to compare the decoders among each other. The evaluation was done in software and is not optimized for speed. It is designed to keep only the generating polynomial h and not the whole parity check matrix H in memory which would allow for a time/memory trade-off and faster computations. The corresponding row is derived at runtime by rotating the polynomial.
Our evaluations clearly show the superior error correcting capability of decoders D and F which in addition require the lowest number of iterations when compared to the other decoders (cf. Table 6 ). Decoders A and C 1 are least efficient with an average of more than 5 bit-flipping iterations. Our new decoders D and F on average save 2.9 iterations compared to decoder A and 0.7 iterations compared to B. This directly relates to the required time for decoding which is up to 4 times faster.
The small timing advantage of decoder F over D is due to the immediate termination if the syndrome becomes zero. Another interesting observation we made for all decoders is that if a codeword is decodable, then this is achieved after a small number of iterations. We noticed that if a codeword is not decoded within 4-6 iterations, a higher number of iterations does not lead to a successful decoding. Therefore, a early detection of a decoding failure is possible.
Implementation
In this section we discuss decoder and parameter selections and reason design choices for our QC-MDPC McEliece implementations on reconfigurable hardware and microcontrollers. The primary goal for the hardware design is high-performance while the microcontroller implementation aims for a low memory footprint. Note, the implementations of a CCA2-secure conversion and true random number generation are out of the scope of this work.
Decoder and Parameter Selection
Our implementations aim for a security level of 80 bit, comparable to ECC-160 and RSA-1024. Hence, we select the following QC-MDPC code parameters that provide a 80-bit security level according to Table 1. n 0 = 2, n = 9600, r = 4800, w = 90, t = 84
Using these parameters we have a 4800-bit public key and a 9600-bit sparse secret key with 90 set bits. Such key sizes are only a fraction of the key sizes of other code-based public-key encryption schemes. During encryption a 4800-bit plaintext is encoded into a 9600-bit codeword and 84 errors are added to it. It follows from n 0 = 2 that the 9600-bit codeword and secret key consist of two separate 4800-bit codewords/secret keys, respectively.
As shown in Section 3 our decoders D and F require only one syndrome computation in the beginning and update the syndrome directly in the bit-flipping step. Furthermore, due to the precomputed thresholds b i the computation of the maximum number of unsatisfied parity check equations can be omitted. The decoders only differ in the way they handle the part where they check if the syndrome is zero. While decoder F checks the syndrome every time the syndrome is change in the bit-flipping step, decoder D tests the syndrome at the end of each bit-flipping iteration. Note, the decoding behavior of both decoders is the same, i.e., they require the same amount of bit-flipping iterations with the difference that decoder F exits as soon as the syndrome is equal to zero.
We base our QC-MDPC McEliece decryption implementation on decoder D in hardware and on decoder F for the microcontroller. The reason for choosing decoder D to be implemented in hardware is that we sequentially rotate the codewords and secret keys in every cycle of the bit-flipping iterations. If the syndrome becomes zero during a bit-flipping iteration and we skip further computations immediately, the secret polynomials and the codewords would be misaligned. To fix this we would have to rotate them manually into their correct position which would take roughly the same amount of time as just letting the decoder finish the current iteration.
Both implementations use a maximum of five iterations before returning a decoding error and the corresponding precomputed b i are (28, 26, 24, 22, 20) , which are computed using the formula in the appendix of [28] .
FPGA Implementation
For our evaluation of QC-MDPC in reconfigurable hardware we use Xilinx's Virtex-6 FPGA device family as target platform. Virtex-6 devices are powerful FPGAs offering thousands of slices, where each slice contains four 6-input lookup tables (LUT), eight flip-flops (FF), and surrounding logic. In addition, embedded resources such as block memories (BRAM) and digital signal processors (DSP) are available. In the following we reason our design choices and describe the implementations of the QC-MDPC-based McEliece en-and decryption.
Design Considerations Because of their relatively small size, the public and secret key do not have to be stored in external memory as it was necessary in earlier FPGA implementations of McEliece and Niederreiter using, e.g., Goppa codes. Since we aim for high-speed, we store all operands directly in FPGA logic and refrain from loading/storing them from/to internal block memories or other external memory as this would affect performance. Reading a single 4800-bit vector from a 32-bit BRAM interface would consume 150 clock cycles. However, if maximum performance is not required, the use of BRAMs could certainly reduce resource consumption significantly.
In contrast to the microcontroller implementation we do not exploit the sparsity of the secret polynomials in our FPGA design. Using a sparse representation of the secret polynomials would require to implement w = 90 counters with 13 bits, each indicating the position of a set bit in one of the two secret polynomials. To generate the next row of the secret key, all counters have to be increased and in case of exceeding 4799 they have be set to 0. If a bit in the codewords x 0 or x 1 is set we have to build a 4800-bit vector from the counters belonging to the corresponding secret polynomial and XOR this vector to the current syndrome. The alternative is to read out the content of each counter belonging to the corresponding secret polynomial and flip the corresponding bit in the syndrome. These tasks, however, are time and/or resource consuming in hardware.
Implementation We use a Virtex-6 XC6VLX240T FPGA as target device for a fair comparison with previous work -although all our implementations would fit smaller devices as well.
The encryption and decryption unit are equipped with a simple I/O interface. Messages and codewords are send and received bit by bit to keep the I/O overhead of our implementation small and thus get as close as possible to the actual resource consumptions of the en-/decoder.
QC-MDPC Encryption:
In order to implement a QC-MDPC encoder we need a vector matrix multiplication to multiply message m with the public key matrix G to retrieve a codeword c = mG and then add an error vector with hw(e) ≤ 84 to get the ciphertext x = c + e. We are given a 4800-bit public key g which is the first row of matrix G. Rotating g by one bit position yields the next row of G and so forth. Since G is of systematic form the first half of c is equal to m. The second half, called redundant part, is computed as follows.
We iterate over the message bit by bit and XOR the current public polynomial to the redundant part if the current message bit is set. To implement this in hardware we need three 4800-bit registers to hold the public polynomial, the message, and the redundant part. Since only one bit of the message has to be accessed in every clock cycle, we store the message in a circulant shift register which can be implemented using shift register LUTs.
QC-MDPC Decryption:
Decryption is performed by decoding the received ciphertext, the first half of the decoded codeword is the plaintext. As QC-MDPC decoder we implement the bit-flipping decoder D as described in Section 3.3. In the first step we need to compute the syndrome s = Hx T by multiplying parity check matrix H = [H 0 |H 1 ] with the ciphertext x. Given the first 9600-bit row h = [h 0 |h 1 ] of H and the 9600-bit codeword x = [x 0 |x 1 ] we compute the syndrome as follows. We sequentially iterate over every bit of the codewords x 0 and x 1 in parallel and rotate h by rotating h 0 and h 1 accordingly. If a bit in x 0 and/or x 1 is set, we XOR the current h 0 and/or h 1 to the intermediate syndrome which is set to zero in the beginning. The syndrome computation is finished after every bit of the ciphertext has been processed.
Next we need to check if the syndrome is zero. We implement this as a logical OR tree. Since the FPGA offers 6-input LUTs, we split the syndrome into 6-bit chunks and compute their logical OR on the lowest level of the tree. The results are fed into the next level of 6-bit LUTs which again compute the logical OR of the inputs. This is repeated until we are left with a single bit that indicates if the syndrome is zero or not. In addition, we add registers after the second layer of the tree to minimize the critical path.
If the syndrome is zero, the decryption is finished. Otherwise we have to compute the number of unsatisfied parity check equations for each row h = [h 0 |h 1 ]. We therefore compute the hamming weight of the logical AND of the syndrome and h 0 and h 1 , respectively. If the hamming weight exceeds the threshold b i for the current iteration i, the corresponding bit in the codeword x 0 and/or x 1 is flipped and the syndrome is directly updated by XORing the current secret polynomial h 0 and/or h 1 to it. Then h 0 and h 1 are rotated by one bit and the process is repeated until all rows of H have been checked.
Since the computation of the number of unsatisfied parity check equations for h 0 and h 1 can be performed independently, we have two options for implementation. Either we compute the parity check violations of the first and second secret polynomial iteratively or we instantiate two hamming weight computation units and process the polynomials in parallel. The iterative version will take twice the time but using less resources. We explore both version to evaluate this time/resource trade-off.
Computing the hamming weight of a 4800-bit vector efficiently is a challenge of its own. Similar to the zero comparator we split the input into 6-bit chunks and determine their hamming weight. We then compute the overall hamming weight by building an adder tree with registers on every layer to minimize the critical path. After all rows of H have been processed, the syndrome is again compared to zero. If the syndrome is zero, the first 4800-bit of the updated codeword (i.e. x 0 ) are equal to the decoded message m and are returned. Otherwise the bit-flipping is repeated with the next b i until either the syndrome becomes zero or the maximum number of iterations is exceeded.
Microcontroller Implementation
As implementation platform we choose a ATxmega256A3 microcontroller for straightforward comparison with previous work. The microcontroller provides 16 kByte SRAM and 256 kByte program memory and can be clocked at up to 32 MHz. The main parts are written in C and we pay careful attention to implement timing critical routines as, e.g., the polynomial rotation and addition using inline assembly.
The encoding operation is straightforward. Since G is of systematic form, the first r ciphertext bits are the message itself and are simply copied. For the multiplication with the redundant part Q, the message bits are parsed and the corresponding rows of G are summed up. Afterwards the current row is rotated by one bit-position to generate the next row. We implemented two different version of the encoder which differ in the way the public polynomial rotation is implemented. In one version we use a loop to rotate the byte of the public polynomial and in the other version we unroll this process.
Usually, smartcard devices communicate over a very slow interface, e.g., 106 kByte/s [40] . In contrast to cryptosystems such as RSA and ECC, we do not need the message as a whole to start with the encryption. Therefore, an interesting option is to directly encode a byte of the message as soon as it arrives while the next message byte is still in transfer. To some extend, this allows to hide the computation time within the latency required to transfer the message.
For decoding, recall that the n 0 = 2 involved secret polynomials are sparse and only 45 out of 4800 bits are set. Instead of saving 4800 coefficients in 4800 8 = 600 bytes, it is sufficient to save the indices of the w i = 45 bits that are set. Each secret polynomial therefore requires only log 2 (4800)/8 · 45 = 2 · 45 = 90 bytes. Additionally, rotating a polynomial by one bit-position means incrementing the 45 indices by one and handling the overflow from x 4800 to x 0 . We developed a vector-(sparse-matrix) multiplication, which adds a sparse row to the syndrome by flipping the 45 indexed bits in the 4800 bit syndrome. Also the update of the syndrome can be handled this way when a ciphertext bit is flipped. In order to keep the memory consumption low while still achieving good performance we use decoder F, as described in Section 3. Since we store the bit-position in counters, an early exit of the decoding phase can be implemented -unlike to our hardware implementation. The complete secret key therefore requires only 2 · (2 · 45) bytes for the secret polynomials and additionally ten bytes for the precomputed thresholds b i .
Note that the precomputed thresholds b i can be treated as public system parameter. In contrast to the encoding process, every ciphertext byte is accessed multiple times during decoding so that the "process-whiletransfer"-method described above is not applicable. Also note that during decoding no additional memory is required to store the plaintext as the first half of the ciphertext is equal to the plaintext after successful decoding.
Results
In the following we present our QC-MDPC implementation results in reconfigurable hardware and in software on a 8-bit microcontroller. Afterwards we give an overview of existing public key encryption implementations for similar platforms and compare them to our results.
FPGA Results
All our results are obtained post place-and-route (PAR) for a Xilinx Virtex-6 XC6VLX240T FPGA using Xilinx ISE 14.5. For the throughput figures we assume a fast enough I/O interface is provided.
In hardware, our QC-MDPC encoder runs at 351.3 MHz and encodes a 4800-bit message in 4800 clock cycles which results in 351.3 Mbit/s. The iterative version of our QC-MDPC decoder runs at 222.5 MHz. Since the decoder does not run in constant time, we calculate the average required cycles for iterative decoding as follows. Computing the syndrome for the first time needs 4800 clock cycles and comparing the syndrome to zero takes another 2 clock cycles. For every following bit-flipping iteration we need 9620 plus again 2 clock cycles for checking the syndrome. As shown in Table 6 , decoder D needs 2.4002 bit-flipping iterations on average. Thus, the average cycle count for our iterative decoder is 4800 + 2 + 2.4002 · (9620 + 2) = 27896.7 clock cycles.
Our non-iterative decoder processes both secret polynomials in the bit-flipping step in parallel and runs at 190.6 MHz. We calculate the average cycles as before with the difference that every bit-flipping iteration now takes 4810 + 2 clock cycles. Thus, the average cycle count for our non-iterative decoder is 4800 + 2 + 2.4002 · (4810 + 2) = 16351.8 clock cycles.
The non-iterative decoder operates 46% faster than the iterative version while occupying 40-65% more resources. Compared to the decoders, the encoder runs 6-9 times faster and occupies 2-6 times less resources. Table 2 summarizes our results.
Using the formerly proposed decoders that work without our syndrome computation optimizations (i.e., decoders A and B) would result in much slower decryptions. Decoder A would need 4802 + 5.2964 · (2 · 9620 + 4802) = 132138.0 cycles in an iterative and 4802 + 5.2964 · (2 · 4810 + 4802) = 81186.7 cycles in a non-iterative implementation. Decoder B saves cycles by skipping the M ax upc computation but would still need 4802 + 3.1425 · (9620 + 4802) = 50123.1 cycles in an iterative and 4802 + 3.1425 · (4810 + 4802) = 35007.7 cycles in a non-iterative implementation.
Comparison A comparison with previously published FPGA implementations of code-based (McEliece, Niederreiter), lattice-based (Ring-LWE, NTRU), and standard public key encryption schemes (RSA, ECC) is given in Table 3 . The most relevant metric for comparing the performance of public key encryption schemes Table 2 . Implementation results of our QC-MDPC implementations with parameters n0 = 2, n = 9600, r = 4800, w = 90, t = 84 on a Xilinx Virtex-6 XC6VLX240T FPGA. often depends on the application. For key exchange it is the required time per operation, given the symmetric key size is smaller or equal to the bit size that can be transmitted in one operation. For data encryption (i.e., much more than one block), throughput in Mbit/s is typically the most interesting metric. A hardware McEliece implementation based on Goppa codes including CCA2 conversion was presented for a Virtex5-LX110T FPGA in [38, 39] . Comparing their performance to our implementations shows the advantage of QC-MDPC McEliece in both time per operation and Mbit/s. The occupied resources are similar to our resource requirements but in addition 75 block memories are required for storage. Even more important for real-world applications is the public key size. QC-MDPC McEliece requires 0.59 kByte which is only a fraction of the 100.5 kByte public key of [38] .
A McEliece co-processor was recently proposed for a Virtex5-LX110T FPGA [16] . Their design goal was to optimize the speed/area ratio while we aim for high performance. With respect to decoding performance, our implementations outperform their work in both time/operation and Mbit/s. But the co-processor needs much less resources and can also be implemented on low-cost devices such as Spartan-3 FPGAs. The public keys in this work have a size of 63.5 kByte which is still much larger than the 0.59 kByte of QC-MDPC McEliece.
The Niederreiter public key scheme was implemented in [21] for a Virtex6-LX240T FPGA. The work shows that Niederreiter encryption can provide high performance with a moderate amount of resources. Decryption is more expensive both in computation time as well as in required resources. The Niederreiter encryption is the superior choice for a minimum time per operation, but concerning raw throughput QC-MDPC achieves better results. Furthermore, the public key with 63.5 kByte of the Niederreiter encryption using binary Goppa codes might be to large for real-world applications.
FPGA implementations of lattice-based public key encryption were proposed in [17] for Ring-LWE and in [23] for NTRU. The Ring-LWE implementation requires a huge amount of resources (in particular, exceeding the resources provided by their Virtex6-LX240T FPGA). On the other hand, NTRU as implemented in [23] shows that lattice-based cryptography can provide high performance at moderate resources requirements. Note further that the results are reported for an outdated Virtex-E FPGA which is hardly comparable to modern Virtex-5/-6 devices.
Efficient ECC hardware implementations for curves over GF (p) and GF (2 m ) are [12, 18, 34, 35] which all yield good performance at moderate resource requirements. The most efficient RSA hardware implementation to date was proposed in [42, 41] . Both the time to encrypt and decrypt one block as well as the throughput are considerably worse than QC-MDPC McEliece. 
Microcontroller Results
Our QC-MDPC encryption requires 606 byte SRAM and 3,705 byte flash memory for the iterative design and 606 byte SRAM and 5,496 byte flash memory in the unrolled version. Both versions already include the public key. The decryption unit requires 198 byte SRAM and 2,218 byte flash memory including the secret key, which is copied to SRAM at start-up for faster access. The encoder requires 26,767,463 cycles on average or 0.8 seconds at 32 MHz. Most cycles are consumed when adding a row of G to the ciphertext (∼ 6000 cycles each) and when rotating a row to generate the next one (∼ 2400 cycles). The decoder requires 86,874,388 cycles on average or 2.7 seconds at 32 MHz. Rotating a polynomial in sparse representation takes 720 cycles and adding a sparse polynomial to the syndrome requires 2,285 cycles which clearly shows the advantage of a sparse representation. Nevertheless, computing a syndrome using the vector-(sparse-matrix)-multiplication on average requires 10,379,351 cycles. Because syndrome, ciphertext and the current row of H (even in sparse form) are too large to be held in registers, they have to be stored in SRAM and are continuously loaded and stored.
Comparison Table 4 compares our results with other implementation of McEliece and with implementations of the classical cryptosystems RSA and ECC on a similar microcontroller. For the code-based schemes, the flash memory usage includes the public and secret key, respectively. For RSA and ECC, [19] does not clearly state if the key size is included.
The main advantage of our implementations compared to other code-based schemes is the small memory footprint. Especially our decoder requires much less memory than other McEliece decoders because we only need to store the bit positions of the sparse secret polynomials instead of the full secret key.
We use the cycles/byte metric to compare our results to other implementations that handle different plaintext/ciphertext sizes. Our iterative encoder outperforms the encoders of [10] and [13] . Our unrolled version is nearly as fast as [20] with only half the amount of flash memory and six times less SRAM. Solely the quasi-dyadic McEliece implementation of [20] outperforms our implementation, however requires much more SRAM and flash memory.
Conclusions
In this work we presented implementations for the McEliece cryptosystem over QC-MDPC codes for Xilinx Virtex-6 FPGAs and AVR microcontrollers. Our implementations were primarily designed for high throughput and low memory consumption. Since decoding is generally the most expensive operation in code-based cryptography, we analyzed existing decoders and proposed several optimized decoders. We evaluated all decoders and selected the most suitable ones for the corresponding platforms. In addition, we showed that it is indeed possible to realize alternative public-key cryptosystems with moderate key size requirements and high performance or low memory on embedded systems. By demonstrating the excellent properties of this novel construction for embedded applications, we hope to have provided another incentive for further cryptanalytical investigation of QC-MDPC codes in the context of code-based cryptography.
