Abstract. This paper presents extremely fast algorithms for code-based public-key cryptography, including full protection against timing attacks. For example, at a 2 128 security level, this paper achieves a reciprocal decryption throughput of just 60493 cycles (plus cipher cost etc.) on a single Ivy Bridge core. These algorithms rely on an additive FFT for fast root computation, a transposed additive FFT for fast syndrome computation, and a sorting network to avoid cache-timing attacks.
Introduction
This paper presents new software speed records for public-key cryptography: for example, more than 400000 decryptions per second at a 2 80 security level, or 200000 per second at a 2 128 security level, on a $215 4-core 3.4GHz Intel Core i5-3570 CPU. These speeds are fully protected against simple timing attacks, cache-timing attacks, branch-prediction attacks, etc.: all load addresses, all store addresses, and all branch conditions are public.
The public-key cryptosystem used here is a code-based cryptosystem with a long history, a well-established security track record, and even post-quantum security: namely, Niederreiter's dual form [59] of McEliece's hidden-Goppa-code cryptosystem [56] . This cryptosystem is well known to provide extremely fast encryption and reasonably fast decryption. Our main contributions are new decryption techniques that are (1) much faster and (2) fully protected against timing attacks, including the attacks by Strenzke in [74] , [75] , and [76] .
The main disadvantage of this cryptosystem is that public keys are quite large: for example, 64 kilobytes for the 2 80 security level mentioned above. In some applications the benefits of very fast encryption and decryption are outweighed by the costs of communicating and storing these keys. We comment that our work allows a tradeoff between key size and decryption time: because decryption is so fast we can afford "combinatorial list decoding", using many trial decryptions to guess a few error positions, which allows the message sender to add a few extra error positions (as proposed by Bernstein, Lange, and Peters in [17] ), which increases security for the same key size, which allows smaller keys for the same security level.
We also present new speed records for generating signatures in the CFS codebased public-key signature system. Our speeds are an order of magnitude faster than previous work. This system has a much larger public key but is of interest for its short signatures and fast verification.
We will put all software described in this paper into the public domain.
To bitslice, or not to bitslice. A basic fact in both hardware design and computational complexity theory is that every function from b bits to c bits can be expressed as a fully unrolled combinatorial circuit consisting solely of two-input NAND gates. Converting these NAND gates into "logical" CPU instructions produces fully timing-attack-protected software that computes the same function. One can save a small constant factor by paying attention to the actual selection of logical instructions on each CPU: for example, some CPUs offer single-cycle instructions as complex as a MUX gate. Even better, the same CPU instructions naturally operate on w-bit words in parallel, so in the same time they carry out w separate bitsliced copies of the same computation; common values of w on current CPUs include 8, 16, 32, 64, 128 , and 256. For a server bottlenecked by cryptographic computations there is no trouble finding 256 separate computations to carry out in parallel. However, this approach makes only very limited use of the instruction set of a typical CPU, and it is completely unclear that this approach can produce competitive speeds for typical cryptographic computations. Almost all cryptographic software takes advantage of more complicated CPU instructions: for example, the fastest previous software [23] for McEliece/Niederreiter decryption uses inputdependent table lookups for fast field arithmetic, uses input-dependent branches for fast root-finding, etc.
There are a few success stories for bitsliced cryptographic computations, but those stories provide little reason to believe that bitslicing is a good idea for code-based cryptography:
-Biham in [21] achieved a 2× speedup for DES on an Alpha CPU using bitslicing with w = 64. Non-bitsliced implementations of DES used only 64-entry table lookups and were slowed down by frequent bit-level rearrangements.
For comparison, code-based cryptography naturally uses much larger tables without such obvious slowdowns. -Bernstein in [11] , improving upon previous work by Aoki, Hoshino, and Kobayashi in [3] , set speed records for binary-field ECC on a Core 2 CPU using bitslicing with w = 128. This paper took advantage of fast multiplication techniques for a large binary field, specifically F 2 251 . For comparison, code-based cryptography uses medium-size fields (such as F 2 11 ) with relatively little possibility of savings from fast multiplication. -Käsper and Schwabe in [46] , improving upon previous work by Matsui and Nakajima in [55] , achieved a 1.4× speedup for AES on a Core 2 CPU using bitslicing with w = 128. Non-bitsliced implementations of AES use only 256-entry table lookups. -Bos, Kleinjung, Niederhagen, and Schwabe in [26] optimized both bitsliced and non-bitsliced implementations of an ECC attack on a Cell CPU, and found that bitslicing was 1.5× faster. Like [11] , this computation took advantage of fast multiplication techniques for a large binary field, specifically F 2 131 . -Matsuda and Moriai in [54] reported fast bitsliced implementations of the PRESENT and Piccolo ciphers on recent Intel CPUs. Non-bitsliced implementations of these ciphers use only 16-entry table lookups.
To summarize, all of these examples of bitsliced speed records are for small Sboxes or large binary fields, while code-based cryptography relies on medium-size fields and seems to make much more efficient use of table lookups. Despite this background we use bitslicing for the critical decoding step inside McEliece/Niederreiter decryption. Our central observation is that this decoding step is bottlenecked not by separate operations in a medium-size finite field, but by larger-scale polynomial operations over that finite field; state-of-the-art approaches to those polynomial operations turn out to interact very well with bitslicing. Our decoding algorithms end up using a surprisingly small number of bit operations, and as a result a surprisingly small number of cycles, setting new speed records for code-based cryptography, in some cases an order of magnitude faster than previous work.
The most important steps in our decoding algorithm are an "additive FFT" for fast root computation (Section 3) and a transposed additive FFT for fast syndrome computation (Section 4). It is reasonable to predict that the additive FFT will also reduce the energy consumed by hardware implementations of codebased cryptography. We also use a sorting network to efficiently simulate secretindex lookups in a large table (Section 5); this technique may be of independent interest for other computations that need to be protected against timing attacks.
Results: the new speeds. To simpify comparisons we have chosen to report benchmarks on a very widely available CPU microarchitecture, specifically the Ivy Bridge microarchitecture from Intel, which carries out one 256-bit vector arithmetic instruction per cycle. We emphasize, however, that our techniques are not limited to this platform. Older Intel and AMD CPUs perform two or three 128-bit vector operations per cycle; common tablet/smartphone ARMs with NEON perform one or two 128-bit vector operations per cycle (exploited by Bernstein and Schwabe in [18] , although not with bitslicing); the same techniques will also provide quite respectable performance using 64-bit registers, 32-bit registers, etc. Table 1 .1 reports our decoding speeds for various code parameters. Decoding time here is computed as 1/256 of the total latency measured for 256 simultaneous decoding operations. Decryption time is slightly larger, because it requires hashing, checking a MAC, and applying a secret-key cipher; see Section 6. We comment that the software supports a separate secret key for each decryption (although many applications do not need this), and that the latency of 256 decryptions is so small as to be unnoticeable in typical applications.
We use the usual parameter notations for code-based cryptography: q = 2 m is the field size, n is the code length, t is the number of errors corrected, and k = n − mt. "Bytes" is the public-key size k(n − k)/8 ; the rows are sorted by this column. "Total" is our cycle count (measured by the Ivy Bridge cycle counter with Turbo Boost and hyperthreading disabled) for decoding, including overhead beyond vector operations. This cycle count is partitioned into five stages: "perm" for initial permutation (Section 5), "synd" for syndrome computation (Section 4), "key eq" for solving the key equation (standard Berlekamp-Massey), "root" for root-finding (Section 3), and "perm" again for final permutation. Some of the parameters in this table are taken from [17] , which says that these parameters were designed to optimize security level subject to key sizes of 2 16 , 2 17 , 2 18 , 2 19 , and 2 20 bytes. Some parameters are from [44] . Some parameters are from [23] , and for comparison we repeat the Core 2 cycle counts reported in [23] . (We comment that the "cycles/byte" in [23] are cycles divided by (k + lg n t )/8.) Our speedups are much larger than any relevant differences between the Core 2 and the Ivy Bridge that we used for benchmarking; we will report Core 2 cycle counts for our software in a subsequent online update of this paper. "Sec" is the approximate security level reported by the https://bitbucket. org/cbcrypto/isdfq script from Peters [65] , rounded to the nearest integer.
Some of the parameter choices from [23] are uninteresting in all of our metrics: they are beaten by other parameter choices in key size, speed, and security level. For these parameter choices we mark our cycle count in gray. Note that we have taken only previously published parameter sets; in particular, we have not searched for parameters that sacrifice key size to improve speed for the same security level.
Previous speeds for public-key cryptography. The eBATS benchmarking system [16] includes seven public-key encryption systems: mceliece, a McEliece implementation from Biswas and Sendrier (with n = 2048 and t = 32, slightly above a 2 80 security level); ntruees787ep1, an NTRU implementation (2 256 security) from Mark Etzel; and five sizes of RSA starting from ronald1024 (2 80 security). None of these implementations claim to be protected against timing attacks. On h9ivy, an Ivy Bridge CPU (Intel Core i5-3210M), the fastest encryption (for 59-byte messages) is 46940 cycles for ronald1024 followed by 61440 cycles for mceliece, several more RSA results, and finally 398912 cycles for ntruees787ep1. The fastest decryption is 700512 cycles for ntruees787ep1, followed by 1219344 cycles for mceliece and 1340040 cycles for ronald1024.
A followup paper [23] by Biswas and Sendrier reports better decryption performance, 445599 cycles on a Core 2 for n = 2048 and t = 32. Sendrier says (private communication) that he now has better performance, below 300000 cycles. However, our speed of 26544 cycles for n = 2048 and t = 32 improves upon this by an order of magnitude, and also includes full protection against timing attacks.
eBATS also includes many Diffie-Hellman systems. One can trivially use Diffie-Hellman for public-key encryption; the decryption time is then the DiffieHellman shared-secret time plus some fast secret-key cryptography, and the encryption time is the same plus the Diffie-Hellman key-generation time. When we submitted this paper the fastest Diffie-Hellman shared-secret time reported from h9ivy was 182632 cycles (side-channel protected), set by the curve25519 implementation from Bernstein, Duif, Lange, Schwabe, and Yang in [14] . The fastest time now is 77468 cycles (not side-channel protected), set by gls254 from Oliveira, López, Aranha, and Rodríguez-Henríquez; see [60] . Our software takes just 60493 cycles (side-channel protected) for decryption with n = 4096 and t = 41 at the same 2 128 security level. We have found many claims that NTRU is orders of magnitude faster than RSA and ECC, but we have also found no evidence that NTRU can match our speeds. The fastest NTRU decryption report that we have found is from Hermans, Vercauteren, and Preneel in [43] : namely, 24331 operations per second on a GTX 280 GPU.
Heyse and Güneysu in [44] report 17012 Niederreiter decryption operations per second on a Virtex6-LX240T FPGA for n = 2048 and t = 27. The implementation actually uses only 10% of the FPGA slices, so presumably one can run several copies of the implementation in parallel without running into place-androute difficulties. A direct speed comparison between such different platforms does not convey much information, but we point out several ways that our decryption algorithm improves upon the algorithm used in [44] : we use an additive FFT rather than separate evaluations at each point ("Chien search"); we use a transposed additive FFT rather than applying a syndrome-conversion matrix; we do not even need to store the syndrome-conversion matrix, the largest part of the data stored in [44] ; and we use a simple hash (see Section 6) rather than a constant-weight-word-to-bit-string conversion.
Field arithmetic
We construct the finite field F 2 m as F 2 [x]/f , where f is a degree-m irreducible polynomial. We use trinomial choices of f when possible. We use pentanomials for F 2 13 and F 2 16 .
Addition. Addition in F 2 m is simply a coefficient-wise xor and costs m bit operations.
Multiplication.
A field multiplication is composed of a multiplication in F 2 [x] and reduction modulo f . We follow the standard approach of optimizing these two steps separately, and we use standard techniques for the second step. Note, however, that this two-step optimization is not necessarily optimal, even if each of the two steps is optimal.
For the first step we started from Bernstein's straight-line algorithms from http://binary.cr.yp.to/m.html. The mth algorithm is a sequence of XORs and ANDs that multiplies two m-coefficient binary polynomials. The web page shows algorithms for m as large as 1000; for McEliece/Niederreiter we use m between 11 and 16, and for CFS (Section 7) we use m = 20. These straightline algorithms are obtained by combining different multiplication techniques as explained in [11] ; for 10 ≤ m ≤ 20 the algorithms use somewhat fewer bit operations than schoolbook multiplication. We applied various scheduling techniques (in some cases sacrificing some bit operations) to improve cycle counts.
Squaring. Squaring of a polynomial does not require any bit operations. The square of an m-coefficient
. The only bit operations required for squaring in F 2 m are thus those for reduction. Note that half of the high coefficients are known to be zero; reduction after squaring takes only about half the bit operations of reduction after multiplication.
Inversion. We compute reciprocals in F 2 m as (2 m −2)nd powers. For F 2 20 we use an addition chain consisting of 19 squarings and 6 multiplications. For smaller fields we use similar addition chains.
Finding roots: the Gao-Mateer additive FFT
This section considers the problem of finding all the roots of a polynomial over a characteristic-2 finite field. This problem is parametrized by a field size q = 2 m where m is a positive integer. The input is a sequence of coefficients
of degree at most t. The output is a sequence of q bits b α indexed by elements α ∈ F q in a standard order, where b α = 0 if and only if f (α) = 0.
Application to decoding. Standard decoding techniques have two main steps: finding an "error-locator polynomial" f of degree at most t, and finding all the roots of the polynomial in a specified finite field F q . In the McEliece/Niederreiter context it is traditional to take the field size q as a power of 2 and to take t on the scale of q/ lg q, typically between 0.1q/ lg q and 0.3q/ lg q; a concrete example is (q, t) = (2048, 40). In cases of successful decryption this polynomial will in fact have exactly t roots at the positions of errors added by the message sender.
Multipoint evaluation. In coding theory, and in code-based cryptography, the most common way to solve the root-finding problem is to simply try each possible root: for each α ∈ F q , evaluate f (α) and then OR together the bits of f (α) in a standard basis, obtaining 0 if and only if f (α) = 0.
The problem of evaluating f (α) for every α ∈ F q , or more generally for every α in some set S, is called multipoint evaluation. Separately evaluating f (α) by Horner's rule for every α ∈ F q costs qt multiplications in F q and qt additions in F q ; if t is essentially linear in q (e.g., q or q/ lg q) then the total number of field operations is essentially quadratic in q. "Chien search" is an alternative method of evaluating each f (α), also using qt field additions and qt field multiplications.
There is an extensive literature on more efficient multipoint-evaluation techniques. Most of these techniques (for example, the "dcmp" method recommended by Strenzke in [76] ) save at most small constant factors. Some of them are much more scalable: in particular, a 40-year-old FFT-based algorithm [25] by Borodin and Moenck evaluates an n-coefficient polynomial at any set of n points using only n 1+o(1) field operations. On the other hand, the conventional wisdom is that FFTs are particularly clumsy for characteristic-2 fields, and in any case are irrelevant to the input sizes that occur in cryptography.
Additive FFT: overview. For multipoint evaluation we use a characteristic-2 "additive FFT" algorithm introduced in 2010 [39] by Gao and Mateer (improving upon an algorithm by von zur Gathen and Gerhard in [40] , which in turn improves upon an algorithm proposed by Wang and Zhu in [77] and independently by Cantor in [29] ), together with some new improvements described below. This algorithm evaluates a polynomial at every element of F q , or more generally every element of an F 2 -linear subspace of F q . The algorithm uses an essentially linear number of field operations; most of those operations are additions, making the algorithm particularly well suited for bitslicing.
The basic idea of the algorithm is to write f in the form
; this is handled efficiently by the "radix conversion" described below. This form of f shows a large overlap between evaluating f (α) and evaluating f (α + 1). Specifically, (α + 1)
Evaluating both f 0 and f 1 at α 2 − α produces both f (α) and f (α + 1) with just a few more field operations: multiply the f 1 value by α, add the f 0 value to obtain f (α), and add the f 1 value to obtain f (α + 1).
The additive FFT applies this idea recursively. For example, if β 2 − β = 1 then evaluating f at α, α + 1, α + β, α + β + 1 reduces to evaluating f 0 and f 1 at α 2 − α and α 2 − α + 1, which in turn reduces to evaluating four polynomials at α 4 − α. One can handle any subspace by "twisting", as discussed below. For comparison, a standard multiplicative FFT writes f in the form f 0 (x 2 ) + xf 1 (x 2 ) (a simple matter of copying alternate coefficients of f ), reducing the computation of both f (α) and f (−α) to the computation of f 0 (α 2 ) and f 1 (α 2 ). The problem in characteristic 2 is that α and −α are the same. The standard workaround is a radix-3 FFT, writing f in the form f 0 (
, but this is considerably less efficient.
We comment that the additive FFT, like the multiplicative FFT, is suitable for small hardware: it can easily be written as a highly structured iterative algorithm rather than a recursive algorithm, and at a small cost in arithmetic it can be written to use very few constants.
Additive FFT: detail. Consider the problem of evaluating a 2 m -coefficient polynomial f at all subset sums (F 2 -linear combinations) of β 1 , . . . , β m ∈ F q : i.e., computing f (0), f (β 1 ), f (β 2 ), f (β 1 + β 2 ), etc. Gao and Mateer handle this problem as follows.
If m = 0 then the output is simply f (0). Assume from now on that m ≥ 1. If β m = 0 then the output is simply two copies of the output for β 1 , . . . , β m−1 . (The algorithm stated in [39] is slightly less general: it assumes that β 1 , . . . , β m are linearly independent, excluding this case.) Assume from now on that β m = 0.
Assume without loss of generality that β m = 1. To handle the general case, compute g(x) = f (β m x), and observe that the output for f, β 1 , β 2 , . . . , β m is the same as the output for g, β 1 /β m , β 2 /β m , . . . , 1. (This is the "twisting" mentioned above. Obviously the case β m = 1 is most efficient; the extent to which this case can be achieved depends on how many powers of 2 divide lg q.) Apply the radix conversion described below to find two 2 m−1 -coefficient poly-
. Recursively evaluate f 0 at all subset sums of δ 1 , . . . , δ m−1 , where δ i = β 2 i −β i . Also recursively evaluate f 1 at all subset sums of δ 1 , . . . , δ m−1 .
Observe that each subset sum α = i∈S β i with S ⊆ {1, 2, . . . , m − 1} has α 2 − α = γ where γ = i∈S δ i . Compute f (α) as f 0 (γ) + αf 1 (γ), and compute f (α + 1) as f (α) + f 1 (γ). Note that these evaluation points α and α + 1 cover all subset sums of β 1 , β 2 , . . . , β m , since β m = 1.
The radix-conversion subroutine. Here is how to write a 2 m -coefficient poly-
, where f 0 and f 1 are 2 m−1 -coefficient polynomials. If m = 1, simply take f 0 = c 0 and f 1 = c 1 . Assume from now on that m ≥ 2. Abbreviate 2 m−2 as n; then f = c 0 + c 1 x + · · · + c 4n−1 x 4n−1 . Divide f by the power (x 2 − x) n = x 2n − x n , obtaining a quotient Q and a remainder R: explicitly,
This takes 2n = 2 m−1 additions; note that c 2n + c 3n etc. from Q are reused in R.
Recursively write Q in the form
, and recursively write R in the form R 0 (
This procedure is a special case of a general radix-conversion method credited to Schönhage in [49, page 638] . The standard method to convert an integer or polynomial to radix r is to divide it by r, output the remainder, and recursively handle the quotient. Schönhage's method is to divide by a power of r and handle both the quotient and remainder recursively. The division is particularly efficient when the power of r is sparse, as in the case of (
Improvement: 1-coefficient polynomials. Gao and Mateer show that for q = 2 m this additive-FFT algorithm uses 2q lg q − 2q + 1 multiplications in F q and (1/4)q(lg q) 2 + (3/4)q lg q − (1/2)q additions in F q . The β m = 1 optimization removes many multiplications when it is applicable.
We do better by generalizing from one parameter to two, separating the maximum polynomial degree t from the number 2 m of evaluation points. Our main interest is not in the case t + 1 = 2 m , but in the case that t is smaller than 2 m by a logarithmic factor. The adjustments to the algorithm are straightforward. We begin with a polynomial having t + 1 coefficients. If t = 0 then the output is simply 2 m copies of f (0), which we return immediately without any additions or multiplications. If t ≥ 1 then we continue as in the algorithm above; f 0 has (t + 1)/2 coefficients, and f 1 has (t + 1)/2 coefficients. Note that t + 1 and 2 m each drop by a factor of approximately 2 in the recursive calls.
It is of course possible to zero-pad a (t + 1)-coefficient polynomial to a 2 mcoefficient polynomial and apply the original algorithm, but this wastes considerable time manipulating coefficients that are guaranteed to be 0. Improvement: 2-coefficient and 3-coefficient polynomials. We further accelerate the case that t is considerably smaller than 2 m , replacing many multiplications with additions as follows.
Recall that the last step of the algorithm involves 2 m−1 multiplications of the form αf 1 (γ). Here α runs through all subset sums of β 1 , β 2 , . . . , β m−1 , and γ = α 2 − α. The multiplication for α = 0 can be skipped but all other multiplications seem nontrivial. Now consider the case that t ∈ {1, 2}. Then f 1 has just 1 coefficient, so the recursive evaluation of f 1 produces 2 m−1 copies of f 1 (0), as discussed above. The products αf 1 (γ) = αf 1 (0) are then nothing more than subset sums of
Results. Table 3 .1 displays the speed of the additive FFT, including these improvements, for an illustrative sample of field sizes q = 2 m and degrees t taken from our applications to decoding.
Other algorithms. We briefly mention a few alternative root-finding algorithms.
In the standard McEliece/Niederreiter context, f is known in advance to have t distinct roots (for valid ciphertexts). However, in the signing context of Section 7 and the "combinatorial list decoding" application mentioned in Section 6, one frequently faces, and wants to discard, polynomials f that do not have t distinct roots. One can usually save time by checking whether x q − x mod f = 0 before applying a root-finding algorithm. There are other applications where one wants all the roots of a polynomial f that has no reason to have as many as deg f distinct roots; for such applications it is usually helpful to replace f with gcd {f, x q − x}. There are other root-finding techniques (and polynomial-factorization techniques) that scale well to very large finite fields F q when t remains small, such as Berlekamp's trace algorithm [7] . If t is as large as q then all of these techniques are obviously slower than multipoint evaluation with the additive FFT, but our experiments indicate that the t cutoff is above the range used in code-based signatures (see Section 7) and possibly within the range used in code-based encryption. Our main reason for not using these methods is that they involve many data-dependent conditional branches; as far as we can tell, all of these methods become much slower when the branches are eliminated.
There is a generalization of the additive FFT that replaces x 2 − x with x t − x if q is a power of t. Gao and Mateer state this generalization only in the extreme case that lg q and lg t are powers of 2; we are exploring the question of whether the generalization produces speedups for other cases.
Syndrome computation: transposing the additive FFT
Consider the problem of computing the vector ( α r α , α r α α, . . . , α r α α d ), given a sequence of q elements r α ∈ F q indexed by elements α ∈ F q , where q = 2 m . This vector is called a "syndrome". One can compute α r α α i separately for each i with approximately 2dq field operations. We do better in this section by merging these computations across all the values of i.
Application to decoding. The standard Berlekamp decoding algorithm computes the syndrome shown above, and then solves a "key equation" to compute m and various parameters t. The total number of field additions is q times "adds"; the total number of field multiplications is q times "mults". For comparison, Horner's rule uses qt additions and qt multiplications; i.e., for Horner's rule, "adds" and "mults" are both t. Chien search also uses qt additions and qt multiplications.
the error-locator polynomial mentioned in Section 3. When Berlekamp's algorithm is applied to decoding Goppa codes using a degree-t polynomial g as described in Section 6, the inputs r α are a received word divided by g(α)
2 , and d is 2t − 1. Many other decoding algorithms begin with the same type of syndrome computation, often with d only half as large.
Note that there are only n ≤ q bits in the received word. The (d + 1)m = 2tm syndrome bits are F 2 -linear functions of these n input bits. Standard practice in the literature is to precompute the corresponding 2tm × n matrix (or a tm × n matrix for Patterson's algorithm), and to multiply this matrix by the n input bits to obtain the syndrome. These 2tmn bits are by far the largest part of the McEliece/Niederreiter secret key. Our approach eliminates this precomputed matrix, and also reduces the number of bit operations once t is reasonably large.
Syndrome computation as the transpose of multipoint evaluation. Notice that the syndrome (c 0 , c 1 , . . . , c d ) is an F q -linear function of the inputs r α . The syndrome-computation matrix is a "transposed Vandermonde matrix": the coefficient of r α in c i is α i . For comparison, consider the multipoint-evaluation problem stated in the previous section, producing f (α) for every α ∈ F q given a polynomial
The multipoint-evaluation matrix is a "Vandermonde matrix": the coefficient of c i in f (α) is α i . To summarize, the syndrome-computation matrix is exactly the transpose of the multipoint-evaluation matrix. We show below how to exploit this fact to obtain a fast algorithm for syndrome computation.
Transposing linear algorithms. A linear algorithm expresses a linear computation as a labeled acyclic directed graph. Each edge in the graph is labeled by a constant (by default 1 if no label is shown), multiplies its incoming vertex by that constant, and adds the product into its outgoing vertex; some vertices without incoming edges are labeled as inputs, and some vertices without outgoing edges are labeled as outputs. of a + 4b, 10a + 41b given a, b, using constants 4 and 10; and a computation of
The transposition principle states that if a linear algorithm computes a matrix M (i.e., M is the matrix of coefficients of the inputs in the outputs) then reversing the edges of the linear algorithm, and exchanging inputs with outputs, computes the transpose of M . This principle was introduced by Bordewijk in [24] , and independently by Lupanov in [53] for the special case of Boolean matrices. This reversal preserves the number of multiplications (and the constants used in those multiplications), and preserves the number of additions plus the number of nontrivial outputs, as shown by Fiduccia in [36, Theorems 4 and 5] after preliminary work in [35] .
For example, Figure 4 .2 displays the reversals of the linear algorithms in Transposing the additive FFT. In particular, since syndrome computation is the transpose of multipoint evaluation, reversing a fast linear algorithm for multipoint evaluation produces a fast linear algorithm for syndrome computation.
We started with our software for the additive FFT, including the improvements discussed in Section 3. This software is expressed as a sequence of additions in F q and multiplications by various constants in F q . We compiled this sequence into a directed acyclic graph, automatically renaming variables to avoid cycles.
We then reversed the edges in the graph and converted the resulting graph back into software expressed as a sequence of operations in F q , specifically C code with vector intrinsics.
This procedure produced exactly the desired number of operations in F q but was unsatisfactory for two reasons. First, there were a huge number of nodes in the graph, producing a huge number of variables in the final software. Second, this procedure eliminated all of the loops and functions in the original software, producing a huge number of lines of code in the final software. Consequently the C compiler, gcc, became very slow as m increased and ran out of memory around m = 13 or m = 14, depending on the machine we used for compilation.
We then tried the qhasm register allocator [9] , which was able to produce working code for larger values of m using the expected number of variables (essentially q), eliminating the first problem. We then wrote our own faster straight-line register allocator. We reduced code size by designing a compact format for the sequence of F q operations and interpreting the sequence at run time. There was, however, still some performance overhead for this interpreter.
We considered more advanced compilation techniques to reduce code size: the language introduced in [33] , for example, and automatic compression techniques to recognize repeated subgraphs of the reversed graph. In the end we eliminated the compiler, analyzed the interaction of transposition with the structure of the additive FFT, and designed a compact transposed additive FFT algorithm.
The original additive FFT algorithm A has steps of the form B, A 1 , A 2 , C, where A 1 and A 2 are recursive calls. The transpose A has steps C , A 2 , A 1 , B , preserving the recursions. The main loop in the additive FFT takes a pair of variables v, w (containing f 0 (α 2 − α) and f 1 (α 2 − α) respectively), operates in place on those variables (producing f (α) and f (α + 1) respectively), and then moves on to the next pair of variables; transposition preserves this loop structure and simply transposes each operation. This operation replaces v by v + w · α and then replaces w by w + v; the transposed operation replaces v by v + w and then replaces w by w + v · α.
Improvement: transposed additive FFT on scaled bits. Recall that, in the decoding context, the inputs are not arbitrary field elements: r α is a received bit divided by g(α)
2 . We take advantage of this restriction to reduce the number of bit operations in syndrome computation.
The first step of the transposed additive FFT operates on each successive pair of inputs v, w as described above The same idea can be used for more levels of recursion, although the number of required constants grows rapidly. Using this idea for all levels of recursion is tantamount to the standard approach mentioned earlier, namely precomputing a 2tm × n matrix.
Secret permutations without secret array indices:
odd-even sorting Section 3 presented an algorithm that, given a polynomial f , outputs bits b α for all α ∈ F q in a standard order (for example, lexicographic order using a standard basis), where b α = 0 if and only if f (α) = 0. However, in the McEliece/Niederreiter context, one actually has the elements (α 1 , α 2 , . . . , α q ) of F q in a secret order (or, more generally, (α 1 , . . . , α n ) for some n ≤ q), and one needs to know for each i whether f (α i ) = 0, i.e., whether b α i = 0. These problems are not exactly the same: one must apply a secret permutation to the q bits output by Section 3. Similar comments apply to Section 4: one must apply the inverse of the same secret permutation to the q bits input to Section 4. This section considers the general problem of computing a permuted q-bit string b π(0) , b π(1) , . . . , b π(q−1) , given a q-bit string b 0 , b 1 , . . . , b q−1 and a sequence of q distinct integers π(0), π(1), . . . , π(q − 1) in {0, 1, . . . , q − 1}. Mapping the set {0, 1, . . . , q − 1} to F q in a standard order, and viewing α i+1 as either π(i) or π −1 (i), covers the problems stated in the previous paragraph. The obvious approach is to compute b π(i) for i = 0, then for i = 1, etc. We require all load and store addresses to be public, so we cannot simply use the CPU's load instruction (with appropriate masking) to pick up the bit b π(i) . Bitslicing can simulate this load instruction, essentially by imitating the structure of physical RAM hardware, but this is very slow: it means performing a computation involving every element of the array. We achieve much better bitslicing speeds by batching all of the required loads into a single large operation as described below.
Sorting networks. A "sorting network" uses a sequence of "comparators" to sort an input array S. A comparator is a data-independent pair of indices (i, j); it swaps
. This conditional swap is easily expressed as a data-independent sequence of bit operations: first some bit operations to compute the condition S
[i] > S[j], then some bit operations to overwrite (S[i], S[j]) with (min {S[i], S[j]}, max {S[i], S[j]}).
There are many sorting networks in the literature. We use a standard "oddeven" sorting network by Batcher [4] , which uses exactly (m 2 − m + 4)2 m−2 − 1 comparators to sort an array of 2 m elements. This is more efficient than other sorting networks such as Batcher's bitonic sort [4] or Shell sort [72] . The oddeven sorting network is known to be suboptimal when m is very large (see [2] ), but we are not aware of noticeably smaller sorting networks for the range of m used in code-based cryptography.
Precomputed comparisons. We treat this section's b π(i) computation as a sorting problem: specifically, we use a sorting network to sort the key-value
. . according to the keys. Note that computing (π −1 (0), π −1 (1), . . .) from (π(0), π(1), . . .) can be viewed as another sorting problem, namely sorting the key-value pairs (π(0), 0), (π (1), 1) , . . . according to the keys.
We do better by distinguishing between the b-dependent part of this computation and the b-independent part of this computation: we precompute everything b-independent before b is known. In the context of code-based cryptography, the permutations π and π −1 are known at key-generation time and are the same for every use of the secret key. The only computations that need to be carried out for each decryption are computations that depend on b.
Specifically, all of the comparator conditions S Permutation networks. A "permutation network" (or "rearrangeable permutation network" or "switching network") uses a sequence of conditional swaps to apply an arbitrary permutation to an input array S. Here a conditional swap is a data-independent pair of indices (i, j) together with a permutation-dependent bit c; it swaps S[i] with S[j] if c = 1.
A sorting network, together with a permutation, produces a limited type of permutation network in which the condition bits are computed by dataindependent comparators; but there are other types of permutation networks in which the condition bits are computed in more complicated ways. In particular, the Beneš permutation network [5] uses only 2 m (m − 1/2) conditional swaps to permute 2 m elements for m ≥ 1. The main challenge in using the Beneš permutation network is to compute the condition bits in constant time; see Section 6 for further discussion of timingattack protection for key generation. We have recently completed software for this condition-bit computation but have not yet integrated it into our decoding software. We will report the details of this computation, and the resulting speeds, in an online update of this paper.
Alternative: random condition bits. In code-based cryptography we choose a permutation at random; we then compute the condition bits for a permutation network, and later (during each decryption) apply the conditional swaps. An alternative is to first choose a random sequence of condition bits for a permutation network, then compute the corresponding permutation, and later apply the conditional swaps.
This approach reduces secret-key size but raises security questions. By definition a permutation network can reach every permutation, but perhaps it is much more likely to reach some permutations than others. Perhaps this hurts security. Perhaps not; perhaps a nearly uniform distribution of permutations is unnecessary; perhaps it is not even necessary to reach all permutations; perhaps a network half the size of the Beneš network would produce a sufficiently random permutation; but these speculations need security analysis. Our goals in this paper are more conservative, so we avoid this approach: we are trying to reduce, not increase, the number of questions for cryptanalysts.
A complete code-based cryptosystem
Code-based cryptography is often presented as encrypting fixed-length plaintexts. McEliece encryption multiplies the public key (a matrix) by a k-bit message to produce an n-bit codeword and adds t random errors to the codeword to produce a ciphertext. The Niederreiter variant (which has several well-known advantages, and which we use) multiplies the public key by a weight-t n-bit message to produce an (n − k)-bit ciphertext. If the t-error decoding problem is difficult for the public code then both of these encryption systems are secure against passive attackers who intercept valid ciphertexts for random plaintexts.
What users want, however, is to be able to encrypt non-random plaintexts of variable length and to be secure against active attackers who observe the receiver's responses to forged ciphertexts. The literature contains several different ways to convert the McEliece encryption scheme into this more useful type of encryption scheme, with considerable attention paid to -the ciphertext overhead (ciphertext length minus plaintext length) and -the set of attacks that are proven to be as difficult as the t-error decoding problem (e.g., generic-hash IND-CCA2 attacks in [50] ).
However, much less attention has been paid to -the cost in encryption time, -the cost in decryption time, and -security against timing attacks.
The work described in previous sections of this paper, speeding up t-error decoding and protecting it against timing attacks, can easily be ruined by a conversion that is slow or that adds its own timing leaks. We point out, for example, that straightforward implementations of any of the decryption procedures presented in [50] would abort if the "D McEliece " step fails; the resulting timing leak allows all of the devastating attacks that [50] claims to eliminate.
This section specifies a fast code-based public-key encryption scheme that provides high security, including security against timing attacks. This section also compares the scheme to various alternatives.
Parameters. The system parameters are positive integers m, q, n, t, k such that n ≤ q = 2 m , k = n − mt, and t ≥ 2. For example, one can take m = 12, n = q = 4096, t = 41, and k = 3604.
Key generation. The receiver's secret key has two parts: first, a sequence (α 1 , α 2 , . . . , α n ) of distinct elements of F q ; second, a squarefree degree-t polynomial g ∈ F q [x] such that g(α 1 )g(α 2 ) · · · g(α n ) = 0. These can of course be generated dynamically from a much smaller secret.
The receiver computes the t × n matrix
over F q . The receiver then replaces each entry in this matrix by a column of m bits in a standard basis of F q over F 2 , obtaining an mt × n matrix H over F 2 . The kernel of H, i.e., the set of c ∈ F n 2 such that Hc = 0, is a vector space of dimension at least n − mt = k, namely the Goppa code Γ = Γ 2 (α 1 , . . . , α n , g).
At this point one can compute the receiver's public key K by applying Gaussian elimination (with partial pivoting) to H. Specifically, K is the result of applying a sequence of elementary row operations to H (adding one row to another row), and is the unique result in systematic form, i.e., the unique result whose left tm × tm submatrix is the identity matrix. One can trivially compress K to (n − mt)mt = k(n − k) bits by not transmitting the identity matrix; this compression was introduced by Niederreiter in [59] , along with the idea of using a systematic parity-check matrix for Γ instead of a random parity-check matrix for Γ . If Gaussian elimination fails (i.e., if the left tm × tm submatrix of H is not invertible) then the receiver starts over, generating a new secret key; approximately 3 tries are required on average.
The standard approach to Gaussian elimination is to search for a 1 in the first column (aborting if there is no 1), then swap that row with the first row, then subtract that row from all other rows having a 1 in the first column, then continue similarly through the other columns. This approach has several timing leaks in the success cases. (It also takes variable time in the failure cases, but those cases are independent of the final secret.) We eliminate the timing leaks in the success cases as follows, with only a small constant-factor overhead. We add 1 − b times the second row to the first row, where b is the first entry in the first row; and then similarly (with updated b) for the third row etc. We then add b times the first row to the second row, where b is the first entry in the second row; and then similarly for the third row etc. We then continue similarly through the other columns.
An alternate strategy is to first apply a reasonably long sequence of elementary row operations to H, using a public sequence of rows but secret random multiples. Here "reasonably long" is chosen so that the output is negligibly different from a uniform random parity-check matrix for the same code. That parity-check matrix can safely be made public, so one can feed it to any Gaussian-elimination routine to obtain K, even if the Gaussian-elimination routine leaks information about its input through timing.
One can argue that key generation provides the attacker only a single timing trace (for the secret key that ends up actually being used), and that this single trace is not enough information to pinpoint the secret key. However, this argument relies implicitly on a detailed analysis of how much information the attacker actually obtains through timing. By systematically eliminating all timing leaks we eliminate the need for such arguments and analyses.
Encryption. To encrypt a variable-length message we generate a random 256-bit key for a stream cipher and then use the cipher to encrypt the message. AES-CTR has fast constant-time implementations for some platforms but not for others, so we instead choose Salsa20 [10] as the stream cipher. To eliminate malleability we also generate a random 256-bit key for the Poly1305 MAC [8] , which takes time dependent only on the message length, and use this MAC to authenticate the ciphertext.
To generate these two secret keys we generate a random weight-t vector e ∈ F n 2 and then hash the vector to 512 bits. For the moment we use SHA-512 as the hash function; according to [19] it is still not yet clear exactly which Keccak variants will be specified for SHA-3. All of these hash functions take constant time for fixed n.
To transmit the vector e to the receiver we compute and send w = Ke ∈ F tm 2 . The ciphertext overhead is tm bits for w, plus 128 bits for the authenticator.
Note that we are following Shoup's "KEM/DEM" approach (see [73] ) rather than the classic "hybrid" approach. The hybrid approach (see, e.g., [61, Section 5.1]) is to first generate random secret keys, then encode those secret keys (with appropriate padding) as a weight-t vector e. The KEM/DEM approach is to first generate a weight-t vector e and then hash that vector to obtain random secret keys. The main advantage of the KEM/DEM approach is that there is no need for the sender to encode strings injectively as weight-t vectors, or for the receiver to decode weight-t vectors into strings. The sender does have to generate a random weight-t vector, but this is relatively easy since there is no requirement of injectivity.
A security proof for Niederreiter KEM/DEM appeared very recently in Persichetti's thesis [64] . The proof assumes that the t-error decoding problem is hard; it also assumes that a decoding failure for w is indistinguishable from a subsequent MAC failure. This requires care in the decryption procedure; see below.
Decryption. A ciphertext has the form (a, w, c) where a ∈ F 128 2 , w ∈ F tm 2 , and c ∈ F * 2 . The receiver decodes w (as discussed below) to obtain a weight-t vector e ∈ F n 2 such that w = Ke, hashes e to obtain a Salsa20 key and a Poly1305 key, verifies that a is the Poly1305 authenticator of c, and finally uses Salsa20 to decrypt c into the original plaintext.
Our decoding procedure is a constant-time sequence of bit operations and always outputs a vector e, even if w does not actually have the form Ke. With a small extra cost we also compute, in constant time, an extra bit indicating whether decoding succeeded. We continue through the hashing and authenticator verification in all cases, mask the authenticator-valid bit with the decoding-succeeded bit, and finally return failure if the result is 0. This procedure rejects all forgeries with the same sequence of bit operations; there is no visible distinction between decoding failures and authenticator failures.
Finding a weight-t vector e given w = Ke is the problem of syndrome decoding for K. We follow one of the standard approaches to syndrome decoding: first compute some vector v ∈ F n 2 such that w = Kv, and then find a codeword at distance t from v; this codeword must be v −e, revealing e. We use a particularly simple choice of v, taking advantage of K having systematic form: namely, v is w followed by n − mt zeros. (This choice was recommended to us by Nicolas Sendrier; we do not know where it was first used in code-based cryptography.) This choice means that the receiver does not need to store K. We also point out that some of the conditional swaps in Section 5 are guaranteed to take 0, 0 as input and can therefore be skipped.
There are two standard methods to find a codeword at distance t from v: Berlekamp's method [6] and Patterson's method [63] . To apply Berlekamp's method one first observes that Γ = Γ 2 (α 1 , . . . , α n , g
2 ), and then that Γ is the F 2 -subfield subcode of the generalized Reed-Solomon code Γ q (α 1 , . . . , α n , g 2 ). Berlekamp's method decodes generalized Reed-Solomon codes by computing a syndrome (Section 4), then using the Berlekamp-Massey algorithm to compute an error-locator polynomial, then computing the roots of the error-locator polynomial (Section 3).
Many authors have stated that Patterson's method is somewhat faster than Berlekamp's method. Patterson's method has some extra steps, such as computing a square root modulo g, but has the advantage of using g instead of g 2 , reducing some computations to half size. On the other hand, Berlekamp's method has several advantages. First, as mentioned in Section 1, combinatorial list-decoding algorithms decode more errors, adding security for the same key size, by guessing a few error positions; in this case most decoding attempts fail (as in Section 7), and the analysis in [52] suggests that this makes Berlekamp's method faster than Patterson's method. Second, Berlekamp's method generalizes to algebraic list-decoding algorithms more easily than Patterson's method; see, e.g., [12] . Third, Berlekamp's method is of interest in a wider range of applications. Fourth, Berlekamp's method saves code size. Finally, Berlekamp's method is easier to protect against timing attacks.
New speed records for CFS signatures
CFS is a code-based public-key signature system proposed by Courtois, Finiasz, and Sendrier in [31] . The main drawbacks of CFS signatures are large public-key sizes and inefficient signing; the main advantages are short signatures, fast verification, and post-quantum security. This section summarizes the CFS signature system and reports our CFS speeds.
Review of CFS. System parameters are m, q, n, t, k as in Section 6, with two extra requirements: n = q, and g is irreducible. Key generation works as in the encryption scheme described in Section 6.
The basic idea of signing is simple. To sign a message M , first hash this message to a syndrome. If this syndrome belongs to a word at distance ≤ t from a codeword, use the secret decoding algorithm to obtain the error positions and send those positions as the signature. The verifier simply adds the columns of the public-key matrix indexed by these positions and checks whether the result is equal to the hash of M .
Unfortunately, a random syndrome has very low chance of being the syndrome of a word at distance ≤ t from a codeword. CFS addresses this problem using combinatorial list decoding: guess δ error positions and then proceed with decoding. If decoding fails, guess a different set of δ error positions. Finding a decodable syndrome requires many guesses; as shown in [31] the average number of decoding attempts is very close to t!. The decoding attempts for different guesses are independent; we can thus make efficient use of bitslicing in a single signature computation.
We actually use parallel CFS, a modification of CFS proposed by Finiasz in [37] . The idea is to compute λ different hashes of the message M and compute a CFS signature for each of these hashes. This increases the security level of CFS against a 2004 Bleichenbacher attack; see generally [61] and [37] .
Previous CFS speeds. Landais and Sendrier in [52] describe a software implementation of parallel CFS with various parameters that target the 80-bit security level. Their best performance is for parameters m = 20, t = 8, δ = 2 and λ = 3. With these parameters they compute a signature in 1.32 seconds on average on an Intel Xeon W3670 (Westmere microarchitecture) running at 3.2GHz, i.e., 4.2 · 10 9 cycles per signature on average.
New CFS software. Our CFS software uses the same set of parameters. For most of the computation we also use the same high-level algorithms as the software described in [52] : in particular, we use the Berlekamp-Massey algorithm to compute the error-locator polynomial f , and we test whether this polynomial splits into linear factors by checking whether x 2 m ≡ x (mod f ).
The most important difference in our implementation is the bitsliced field arithmetic. This has two advantages: it is faster and it does not leak timing information. Some parts of the computation are performed on only one stream of data (since we sign one message at a time), but even in those parts we continue using constant-time field arithmetic rather than the lookup-table-based arithmetic used in [52] .
We do not insist on the entire signing procedure taking constant time, but we do guarantee that the signing time (and all lower-level timing information) is independent of all secret data. Specifically, to guarantee that an attacker has no information about the guessed error positions that did not allow successful decoding, we choose δ = 2 random elements of F 2 m and compute the corresponding public-key columns, rather than running through guesses in a predictable order. These columns are at some positions in the public key; we compute these positions (in constant time) if decoding is successful.
There are three main bottlenecks in generating a signature:
-pick e 1 , e 2 ∈ F 2 m at random and compute the corresponding public-key columns; -use Berlekamp-Massey to obtain an error-locator polynomial f ; -test whether x 2 m ≡ x (mod f ).
Once such a polynomial f has been found, we multiply it by (x − e 1 )(x − e 2 ) to obtain a degree-10 error-locator polynomial. We then find all roots of this polynomial and output the set of corresponding support positions as the signature. We split the root-finding problem into 256 separate 2 12 -point evaluation problems, again allowing fast constant-time bitsliced arithmetic for a single signature.
New CFS speeds. Our software signs in than 0.425 · 10 9 Ivy Bridge cycles on average; the median is 0.391·10
9 Ivy Bridge cycles. This cycle count is an order of magnitude smaller than the cycle count in [52] . We measured this performance across 100000 signature computations on random 59-byte messages on one core of an otherwise idle Intel Core i5-3210M with Turbo Boost and hyperthreading disabled.
It is common to filter out variations in cycle counts by reporting the median cycle count for many computations. Note, however, that the average is noticeably higher than the median for this type of random process. Similar comments apply to, e.g., RSA key generation.
Most of the 0.425 · 10 9 cycles are used by the three steps described above:
-picking e 1 and e 2 and computing the corresponding columns takes 52792 cycles for a batch of 256 iterations; -the Berlekamp-Massey step takes 189900 cycles for a batch of 256 iterations; -testing whether x These computations account for (52792 + 189900 + 436008)(t!λ + 128)/256 ≈ 0.32 · 10 9 cycles on average. Root-finding, repeated λ times, accounts for another 0.05 · 10 9 cycles. A small number of additional cycles are consumed by hashing, converting to bitsliced form, multiplying the degree-8 error-locator polynomial f by (x − e 1 )(x − e 2 ), et al.
We also have extremely fast software for signature verification, taking only 2176 cycles. This count is obtained as the median of 1000 signature verifications for 59-byte messages. Furthermore we have software for Intel and AMD processors that do not feature the AVX instruction set and that instead uses SSE instructions on 128-bit vectors. This software generates a signature in 0.658 · 10 9 cycles on average and verifies a signature in only 2790 cycles on one core of an Intel Core 2 Quad Q6600 CPU.
