Multivariate Public Key Cryptosystems (MPKCs) are often touted as future-proong the advent of the Quantum Computer. It also has been known for eciency compared to traditional alternatives. However, this advantage seems to be eroding with the increase of arithmetic resources in modern CPUs and improved algorithms, especially with respect to ECC. We show that the same hardware advances do not necessarily just favor ECC. The same modern commodity CPUs also have an overabundance of small integer arithmetic/logic resources, embodied by SSE2 or other vector instruction set extensions, that are also useful for MPKCs. On CPUs supporting Intel's SSSE3 instructions, we achieve a 4× speed-up over prior implementations of Rainbow-type systems (such as the ones implemented in hardware by Bogdanov et al. at CHES 2008) in both public and private map operations. Furthermore, if we want to implement MPKCs for all general purpose 64-bit CPUs from Intel and AMD, we can switch to MPKC over elds of relatively small odd prime characteristics. For example, by taking advantage of SSE2 instructions, Rainbow over F 31 can be up to 2× faster than prior implementations of same-sized systems over F 16 . A key advance is in implementing Wiedemann instead of Gaussian system solvers. We explain the techniques and design choices in implementing our chosen MPKC instances, over representative elds such as F 31 , F 16 and F 256 . We believe that our results can easily carry over to modern FPGAs, which often contain a large number of multipliers in the form of DSP slices, oering superior computational power to odd-eld MPKCs.
Introduction
Multivariate public-key cryptosystems (MPKCs) [12, 36] form a genre of PKCs dating from 1988 [31] , whose public keys represent multivariate polynomials in many variables over a small eld K = F q : P : w = (w 1 , w 2 , . . . , w n ) ∈ K n → z = (p 1 (w), p 2 (w), . . . , p m (w)) ∈ K m .
Our Answers and Contributions
We will show that advances in chip building also benet MPKCs. First, vector instructions available on many modern CPUs can provide signicant speed-ups for MPKCs over binary elds. Secondly, we can derive an advantage for MPKCs by using as the base eld F q for q equal to a small odd prime such as 31 on most of today's CPUs. This may sound somewhat counterintuitive, since for binary elds addition can be easily accomplished by the logical exclusive-or (XOR) operation, while for odd prime elds, costly reductions modulo q are unavoidable. Our reasoning and counter arguments are detailed as follows.
• all current Intel x86 CPUs of the Core and Atom lines with the Intel SSSE3 instruction PSHUFB. The Atom CPUs are specially designed for low-power and embedded applications; • all future AMD CPUs, which supports the SSE5 instruction PPERM, which can do everything PSHUFB does; • many Power derivatives PowerPC G4/G5, Power6, Xenon (in XBOX 360) and Cell (in PS3) CPUs having the AltiVec or VMX instruction PERMUTE, essentially identical to PSHUFB.
Scalar-to-vector multiplication in F 16 and F 256 can thus be sped up roughly tenfold. Using this, MPKC schemes such as TTS and Rainbow get a speed-up by a factor of 4× or higher.
2. Many CPUs today do not support the simultaneous table look-ups but instead have some vector instructions, e.g., virtually all desktop CPUs today support SSE2 instructions, which can pack eight 16-bit integer operands in its 128-bit xmm registers and hence dispatch eight simultaneous integer operations per cycle in an SIMD style.
• Using as the base eld of MPKCs F q with a small odd prime q instead of F 256 or F 16 enables us to take advantage of SSE2 and equivalent hardware. Even considering the overhead of conversion between base q and binary, schemes over F 31 is usually faster than an equivalent scheme over F 16 or F 256 (that does not use SSSE3).
• MPKCs over F q are still substantially faster than ECC or RSA. While algebra tells us that q can be any prime power, it pays to specialize to a small set of carefully chosen instances. In most of our implementations, we will be using q = 31, which experiments show to be particularly computationally ecient.
In this work, we will demonstrate that the advances in chip architecture do not leave MPKCs behind while improving traditional alternatives. Furthermore, we list a set of counter-intuitive techniques we have discovered during the course of implementing nite eld arithmetic using vector instructions.
1. Sometimes it is still much better when solving a small and dense matrix equation to use iterative methods like Wiedemann instead of straight Gaussian elimination.
2. To invert a vector in F * q element by element, it can be better to raise to the (q − 2)-th power.
3. For big-eld MPKCs, elds of certain sizes such as F 31 15 are much better than other comparable sizes because arithmetic operations are much faster. For such elds, computing the multiplicative inverse is again fastest by raising to a high power.
4. It is important to manage numerical ranges to avoid overow, for which certain instructions are unexpectedly useful. For example, the PMADDWD or packed multiply-add word to double word instruction, which computes from two 8-long vectors of 16-bit signed words (x 0 , . . . , x 7 ) and (y 0 , . . . , y 7 ) the 4-long vector of 32-bit signed words (x 0 y 0 + x 1 y 1 , x 2 y 2 + x 3 y 3 , x 4 y 4 + x 5 y 5 , x 6 y 6 + x 7 y 7 ), avoids many carries when evaluating a matrix-vector product (mod q).
Finally, we reiterate that like most implementation works such as the one by Bogdanov et al [5] , we only discuss implementation issues and do not concern ourselves with the security of MPKCs in this paper. Those readers interested in the security and design of MPKCs are instead referred to the MPKC book [12] and numerous research papers in the literature.
2 Background on MPKCs
In this section, we summarize the MPKC instances that we will be investigating. We will use the notation in Sec. 1, so we only need to describe the central map Q since M S and M T are always square and invertible matrices, usually of dimension n and m, respectively. To execute a private map, we replace the minus components if necessary, invert T , invert central map Q, invert S, and check that the prex or perturbation is good.
Notes on Odd-eld MPKCs
Some MPKCs TTS, Rainbow, even oil-and-vinegar [10, 11, 16, 30] can be implemented verbatim over small odd prime elds just like over F 2 k . For the remaining ones, e.g., IC-derivatives, an odd-characteristic version is inconvenient but not impossible. Most attacks on and their respective defenses of MPKCs are fundamentally independent of the base eld. What is even better is that some attacks are known or conjectured to be easier over binary elds than over small odd prime elds [4, 8, 14, 20] , but never vice versa. For example, the powerful Gröbner basis methods [18, 19] are in this last category.
C * and HFE in an odd-characteristic eld were mentioned by Wolf and Preneel [36] , but not much researched until recently.
Rainbow and TTS Families of Digital Signatures
Rainbow(F q , o 1 , . . . , o ) is characterized as follows [13, 16] as u stages of UOV:
• The segment structure is given by a sequence 0 < v 1 
• The central map Q has components
where
• In every q k , where 
Hidden Field Equation (HFE) Encryption Schemes
HFE is a big-eld variant of MPKC. We identify L, a degree n extension of the base eld
we have a quadratic Q, invertible via the Berlekamp algorithm with x, y as elements of (F q ) n .
Solving HFE directly is considered to be sub-exponential [23] , and a standard HFE implementation for 2 80 security works over F 2 103 with degree d = 129. We know of no timings below 100 million cycles on a modern processor like a Core 2. Modiers like vinegar or minus cost extra.
A recent attempt to make HFE faster uses the following multi-variable construction [4] .
is also a randomlychosen quadratic for = 1, . . . , h. When h is small, this Q can be easily converted into an equation in one of the X i using Gröbner basis methods at degree no longer than 2 h , which is good since solving univariate equations is cubic in the degree. The problem is that the authors also showed that these schemes are equivalent to the normal HFE and hence are equally (in-)secure.
Given the recent conjecture [14] that for odd characteristic, the usual Gröbner basis attacks on HFE does not work as well. We will try our hands at multivariate HFEs over F q for an odd q. To be conservative, we will enforce one prexed zero block (the prex or p modier, to block structural attacks) at a q times speed penalty.
C *
, -Invertible Cycles ( IC) and Minus-p Schemes C * is the original Matsumoto-Imai scheme [31] , also a big-eld variant of MPKC. We identify a larger eld L with K n with a K-linear bijection φ :
The -invertible cycle [15] can be considered as an improved extension of C * . We use the simple = 3. In 3IC we also uses an intermediate eld
. 3IC and C * maps have lots in common [10, 15, 21] . To sign, we do minus on r variables and use s prexes (set one or more of the variables to zero), to defend against all known attacks against C * schemes [10] . This is written as C * − p(q, n, α, r, s) or 3IC-p(q, n, r, s). Ding et al. recommend C * − p(2 4 , 74, 22, 1) or pFLASH [10] .
The following is a sketch of how to invert 3IC-p over a eld like
Note: for odd q, square roots are non-unique and slow, and α = 0 (square function) is a valid C * .
Background on x86 Vector Instruction Set Extensions
The seminal [31] of Matsumoto-Imai notes that bit slicing is useful for MPKCs over F 2 as a form of SIMD. Berbain et al. [2] pointed out that bit slicing can be extended appropriately for F 16 to evaluate public maps of MPKCs, as well as to run the QUAD stream cipher. Chen et al. extended this further to Gaussian elimination in F 16 , to be used for TTS [7] . To our best knowledge, the only mention of more advanced vector instructions in the MPKC literature is T. Moh's suggestion to use AltiVec instructions (only available then in the PowerPC G4) in his TTM cryptosystem [32] . This fell into obscurity after TTM was cryptanalyzed [22] .
In this section, we describe one of the most widely deployed vector instruction sets, namely, the x86 SIMD extensions. The assembly language mnemonics and code in this section are given according Intel's naming conventions, which is supported by both gcc and Intel's own compiler icc. We have veried that the two compilers give similar performance results for the most part. Load/Store: To and from xmm registers from memory (both aligned and unaligned) and traditional registers (using the lowest unit in an xmm register and zeroing the others on a load). 
Integer instructions in the SSE2

SSSE3 (Supplementary SSE3) instructions
SSSE3 adds a few very useful instructions that assist with our vector programming.
PALIGNR (packed align right, really a byte-wise shift): PALIGNR xmm (i), xmm (j), k shifts xmm (j) right by k bytes, and insert the k rightmost bytes of xmm (i) in the space vacated by the shift, with the result placed in xmm (i). Can be used to rotate an xmm register by bytes.
PHADDx,PHSUBx H means horizontal. Take as an example, the PHADDW instruction. If destination register xmm (i) starts out as (x 0 , x 1 , . . . , x 7 ), the source register xmm (j) as (y 0 , y 1 , . . . , y 7 ), then after PHADDW xmm (i), xmm (j), xmm (i) will hold PMULHRSW gives the rounded higher word of the product of two signed words in each of 8 positions.
[SSE2 only has PMULHW, which gives the higher word in the product.]
The source register xmm (j) of each instruction above can be replaced by a 16-byte-aligned memory region. The interested reader is referred to Intel's own reference manual for further information on optimizing for the x86-64 architecture [27] . To our best knowledge SSE4 and SSE5 do not improve the matter greatly for us (except for SSE5's version of PSHUFB), so we skip their descriptions here. Both SSSE3 and bitslicing then accomplishes matrix-to-vector multiplication in the same way, and even the evaluation of MPKCs' public maps in the same way, requiring column-rst matrices.
Evaluating public maps: We normally do
In theory, it is good to bitslice in F 16 when multiplying a scalar to a vector that is a multiple of 64 in length. Our tests show bitslicing a F 16 scalar-to-64-long-vector to take a tiny bit less than 60 cycles on a core of a newer (45nm) Core 2 CPU. The corresponding PSHUFB code takes close to 48 cycles. For 128-long vectors, we can still bitslice using xmm registers. It comes out to around 70 cycles with bitslicing, against 60 cycles using PSHUFB. This demonstrate the usefulness of SSSE3 since these should be optimal cases for bitslicing. 
Elimination or Solving a Matrix Equation
Data Conversion between F 2 and F q
The rst problem with MPKCs over odd prime elds is the conversion between binary and base-q data. Suppose the public map is P : F n q → F m q . For digital signatures, we need to have q m > 2 , where is the length of the hash, so that all hash digests of the appropriate size t into F q blocks. For encryption schemes that pass an -bit session key, we need q n > 2 .
Quadword (8-byte) unsigned integers in [0, 2 64 − 1] t decently into 13 blocks in F 31 . So to transfer 128-, 192-, and 256-bit AES keys, we need at least 26, 39, and 52 F 31 blocks, respectively.
Packing F q -blocks into binary can be more wasteful in the sense that it can use more bits than necessary, as long as the map is injective and convenient to compute. For example, we have opted for a very simple packing strategy in which every three F 31 blocks are t in a 16-bit word.
Basic Arithmetic Operations and Inversion mod q
In F q , the addition, subtraction, and multiplication operations are simply integer arithmetic operations mod q. The division instruction is always particularly time-consuming, so we replace it with multiplication using the following proposition [24] :
Note that the second equality, since often M > 2 n .
Most of the time there is an arithmetic instruction that returns the top n bits of the product of n-bit unsigned integers x and y, which achieves + x = x div 31, R = x − 31 Q, for an unsigned integer x < 2 64 . To invert one element in F q , we usually use a look-up table. In some cases, we need to invert many F q elements at the same time. As will be described later, we vectorize most arithmetic operations using SSE2 and hence need to store the operands in xmm registers. Although eight 7 table look-ups usually do not take long, getting the operands out of and into xmm registers can be troublesome. Instead, we can use a (q − 2)-th power (patched inverse) to get the inverse for a vector of elements. For example, after taking into account the possibility of overow, we do the following to get a 29-th power to compute multiplicative inverses in F 31 using 16-bit integers (short int):
y = x * x * x mod 31; y = x * y * y mod 31; y = y * y mod 31; y = x * y * y mod 31.
Finally, if SSSE3 is available, inversion in a F q for q < 16 is possible using one PSHUFB, and for 31 ≥ q > 16 using two PSHUFB's and some masking. Overall, the most important optimization is avoiding unnecessary modulo operations by delaying them as much as possible. To achieve this goal, we need to carefully track the size of the operands. SSE2 uses xed 16-or 32-bit operands for most of its integer vector operations. In general, the use of 16-bit operands, either signed or unsigned, gives the best trade-o between modulo reduction frequency (wider operands allow for less frequent modulo operations) and parallelism (narrower operands allow more vector elements packed in an xmm register).
Vectorizing mod q using SSE2
Using the vectorized integer addition, subtraction, and multiplication provided by SSE2, we can easily execute multiple integer arithmetic operations simultaneously. Now the question is how we implement vectorized modulo operations described in Sec. 4.2. While SSE2 does provide instructions returning the upper word of a 16-bit-by-16-bit product, there are no facilities for carries, and hence it is dicult to guarantee a range of size q for a general q. It is then important to realize that we do not always need the tightest range. Minus signs are okay, as long as the absolute values are relatively small to avoid non-trivial modulo operations.
, is guaranteed to return a value y ≡ x (mod q) such that |y| ≤ q for general b-bit word arithmetic, where IMULHIb returns the upper half in a signed product of two b-bit words, for
• For specically q = 31 and b = 16, we can do better and get a y ≡ x (mod 31), −16 ≤ y ≤ 15, for any −32768 ≤ x ≤ 32752 by: y = x − 31 · IMULHI16 (2114, x + 15) . Here IMULHI16 is implemented via the Intel intrinsic of __mm_mulhi_epi16.
• For I/O in F 31 , from y above we get a principal value between 0 and 30:
, where & is the logical AND and ≫ arithmetically shifts in the sign bit.
• There is a slightly faster version using rounding, when SSSE3 (with PMULHRSW) is available.
Matrix-vector Multiplication and Polynomial Evaluation
Core 2 and newer Intel CPUs have SSSE3 and can add horizontally within an xmm register, c.f., Sec. 3.2. Specically, the matrix M can be stored in the row-major order. Each row is multiplied component-wise to the vector v. We then use PHADDW to add horizontally and arrange the elements at the same time. Surprisingly, the convenience of having PHADDW available only makes at most a 10% dierence for q = 31.
If we are restricted to using just SSE2, then it is advisable to store M in the column-major order and treat the matrix-to-vector product as taking a linear combination of the column vectors. For q = 31, each 16-bit component in v is copied eight times into every 16-bit word in an xmm register using an __mm_set1 intrinsic, which takes three data-moving (shue) instructions, but still avoids the penalty for accessing the L1 cache. Finally we multiply this register into one column of M, eight components at a time, and accumulate.
Public maps are evaluated as in Sec. 3.3, except that We may further exploit PMADDWD as mentioned in Sec. 1.2, which computes (x 0 y 0 + x 1 y 1 , x 2 y 2 + x 3 y 3 , x 4 y 4 + x 5 y 5 , x 6 y 6 + x 7 y 7 ) given (x 0 , . . . , x 7 ) and (y 0 , . . . , y 7 ). We interleave one xmm with two monomials (32-bit load plus a single __mm_set1 call), load a 4 × 2 block in another, PMADDWD, and continue in 32-bits until the eventual reduction mod q. This way we are able to save a few mod-q operations.
The Special Case of F 31 . We also pack keys (c.f., Sec. 4.1) so that the public key is roughly mn(n + 3)/3 bytes, which holds mn(n + 3)/2 F 31 entries. For F 31 , we avoid writing the data to memory and execute the public map on the y as we unpack to avoid cache contamination. It turns out that it does not slow things down too much. Further, we can do the messier 32-bit mod-q reduction without __mm_mulhi_epi32 via shifts as 2 5 = 1 mod 32.
Solving Systems of Linear Equations
Solving systems of linear equations are involved directly with TTS and Rainbow, as well as indirectly in the other schemes through taking inverses. Normally, one runs a Gaussian elimination, in which the elementary row operations can be sped up by SSE2.
One must notice here that during a Gaussian elimination, one needs to do frequent modular reductions, which rather slows things down from the otherwise expected speed. To elaborate, say we have an augmented matrix [A|b] modulo 31. For ease of doing elementary row operations, we naturally store the matrix in the row-major order. Now suppose we have done elimination on the rst column. Each entry in the remaining columns will now be be of size up to about 1000 (31 2 ), or at least up to around 250 if signed representation is used. So, to do the second column of eliminations, we need to reduce that column mod 31 before we can look up the correct coecients. Note that reducing a single column by table look-up is no less expensive than reducing the entire matrix when the latter is not too large due to the overhead associated with moving data in and out of the xmm registers, so we end up reducing the entire matrix many times.
What can we do then? Well, we can switch to an iterative method like Wiedemann or Lanczos. To solve by Wiedemann a n×n system Ax = b, one computes zA i b for i = 1 . . . 2n for some given z. Then computes the minimal polynomial from these elements in F q , using the Berlekamp-Massey algorithm. Assuming that this is the minimal polynomial of A, we have a solution.
It looks very counter-intuitive, since a Gaussian elimination does around n 3 /3 multiplications in F q but Wiedemann takes 2n 3 for a dense matrix for the matrix-vector products alone, then need extra memory/time to store the partial results and run Berlekamp-Massey. Yet, each iteration we only need to reduce a single vector, not a matrix. That is the key observation and the tests show that Wiedemann (Lanczos too often fails) is signicantly faster for convenient sizes and odd q.
Recommendation: q = 31
Clearly, we need to avoid having too large q (too many reductions mod q) and too small q (too large an array). The choice of q = 31 seems the best compromise, since it also allows us several convenient tower elds and easy packing conversions (close to 2 5 = 32). This is veried empirically.
Arithmetic in F q k
In a big-eld or two-eld variant of MPKC, we need to handle L = F q k ∼ = F q [t]/ (p(t)), where p is an irreducible polynomial of degree k. Any irreducible p results in an isomorphic representation of the same eld, but often it is of paramount importance to pick a p that makes for ecient computations. It would be convenient if p(t) = t k − a for a small positive a. When k|(q − 1) and in a few other cases, such a suitable a can be found.
When p is in a convenient form, the map X → X q in L, as a precomputable linear map over K = F q , becomes nearly trivial, and multiplication/division/inversion become much easier, which is evident from the example timings for a tower eld in Tab. 
Multiplication and Squaring
When we have
The straightforward way to multiply is to take each x i , copy it eight times, multiply by the correct y i 's using PMULLW, and then shift the result by the appropriate distances using PALIGNR (if SSSE3 is available) or unaligned load/stores/shifts (otherwise), depending on the architecture and compiler. For inconvenient cases like when k = 9, we need to tune the code somewhat to the occasion. As an example, for k = 9, we would multiply the x-vector by y 8 and the y-vector by x 8 , leaving the rest in a convenient 8 × 8 pattern for access.
For very large elds, we can use Karatsuba [29] or other more advanced multiplication algorithms. For example, we can treat F 31 30 as F 31 15 [u]/(u 2 − t), where
We often say squaring is 0.8 times of a multiplication. Here we simply skip some of the iterations in the loops used for multiplication and weigh some of the other iterations double. Due to architectural dierences, the ratio of the costs of squaring and multiplication measured anywhere from as high as 0.92 to as low as 0.75 for elds in the teens of F 31 blocks.
Square and Other Roots
Today there are many ways to compute square roots in a nite eld [3] . For eld sizes q = 3 (mod 4), it is easy to compute the square root in F q via √ y = ±y q+1 4 . Here we implement the Tonelli-Shanks method for 1 (mod 4) eld sizes, as working with a xed eld we can include precomputated tables with the program for free. To recap, assume that we want to compute square roots in the eld L, where |L| − 1 = 2 k a, with a being odd. 0. Compute a primitive solution to g 2 k = 1 in L. We only need to take a random x ∈ L and compute g = x a , and it is almost even money (i.e., x is a non-square) that g 2 k−1 = −1, which means we have found a correct g. Start with a precomputed table of (j, g j ) for 0 ≤ j < 2 k .
1. We wish to compute an x such that x 2 = y. First compute v = y a−1 2 . 2. Look up in our table of 2 k -th roots yv 2 = y a = g j . If j is odd then y is a non-square. If j is even, then
Since we implemented mostly mod 31, for F 31 k taking a square root is easy when k is odd and not very hard when k is even. For example, in F 31 9 , we compute There are several ways to do multiplicative inverses in F q k . The classical one is an extended Euclidean Algorithm; another is to solve a system of linear equations; the last one is to invoke Fermat's little theorem and raise to the power of q k − 2.
For our specialized tower elds of characteristic 31, the extended Euclidean Algorithm is slower because after one division the sparsity of the polynomial is lost. Solving every entry in the inverse as a variable and running a elimination is about 30% better; even though it is counter-intuitive to compute X 31 15 −2 to get 1/X, it ends up fastest by a factor of 2×3×.
Finally, we note that when we compute √ X and 1/X as high powers at the same time, we 
Concluding Remarks
Given the results in Sec. 6 and the recent interest into the theory of algebraic attacks on oddcharacteristic HFE, we think that odd-eld MPKCs merit more investigation. Furthermore, today's FPGAs have many built-in multipliers and intellectual properties (IPs), as good integer multipliers are common for application-specic integrated circuits (ASICs). One excellent example of using the multipliers in FPGAs for PKCs is the work of Güneysu and Paar [25] . We believe our results can easily carry over to FPGAs as well as any other specialized hardware with a reasonable number of small multipliers. There are also a variety of massively parallel processor architectures on the rise, such as NVIDIA's and AMD/ATI's graphics processors, as well as Intel's upcoming Larrabee [33] . The comparisons herein must of course be re-evaluated with each new instruction set and new silicon implementation, but we believe that the general trend stands on our side.
