In this paper ways to efficiently implement public-key schemes based on Multivariate Quadratic polynomials (MQ-schemes for short) are investigated. In particular, they are claimed to resist quantum computer attacks. It is shown that such schemes can have a much better time-area product than elliptic curve cryptosystems. For instance, an optimised FPGA implementation of amended TTS is estimated to be over 50 times more efficient with respect to this parameter. Moreover, a general framework for implementing small-field MQ-schemes in hardware is proposed which includes a systolic architecture performing Gaussian elimination over composite binary fields.
Introduction
Efficient implementations of public key schemes play a crucial role in numerous real-world security applications: Some of them require messages to be signed in real time (like in such safety-enhancing automotive applications as car-to-car communication), others deal with thousands of signatures per second to be generated (e.g. high-performance security servers using so-called HSMs -Hardware Security Modules). In this context, software implementations even on high-end processors can often not provide the performance level needed, hardware implementations being thus the only option. In this paper we explore the approaches to implement Multivariate Quadratic-based public-key systems in hardware meeting the requirements of efficient high-performance applications. The security of public key cryptosystems widely spread at the moment is based on the difficulty of solving a small class of problems: the RSA scheme relies on the difficulty of factoring large integers, while the hardness of computing discrete logarithms provides the basis for ElGamal, Diffie-Hellmann scheme and elliptic curves cryptography (ECC). Given that the security of all public key schemes used in practice relies on such a limited set of problems that are currently considered to be hard, research on new schemes based on other classes of problems is necessary as such work will provide greater diversity and hence forces cryptanalysts to spend additional effort concentrating on completely new types of problems. Moreover, we make sure that not all "crypto-eggs" are in one basket. In this context, we want to point out that important results on the potential weaknesses of existing public key schemes are emerging. In particular techniques for factorisation and solving discrete logarithms improve continually. For example, polynomial time quantum algorithms can be used to solve both problems. Therefore, the existence of quantum computers in the range of a few thousands of qbits would be a real-world threat to systems based on factoring or the discrete logarithm problem. This emphasises the importance of research into new algorithms for asymmetric cryptography.
One proposal for secure public key schemes is based on the problem of solving Multivariate Quadratic equations (MQ-problem) over finite fields F, i.e. finding a solution vector x ∈ F n for a given system of m polynomial equations in n variables each          y 1 = p 1 (x 1 , . . . , x n ) y 2 = p 2 (x 1 , . . . , x n )
. . .
for given y 1 , . . . , y m ∈ F and unknown x 1 , . . . , x n ∈ F is difficult, namely N P-complete. An overview over this field can be found in [14] . Roughly speaking, most work on public-key hardware architectures tries to optimise either the speed of a single instance of an algorithm (e.g., high-speed ECC or RSA implementations) or to build the smallest possible realization of a scheme (e.g., lightweight ECC engine). A major goal in high-performance applications is, however, in addition to pure time efficiency, an optimised cost-performance ratio. In the case of hardware implementations, which are often the only solution in such scenarios, costs (measured in chip area and power consumption) is roughly proportional to the number of logic elements (gates, FPGA slices) needed. A major finding of this paper is that MQ-schemes have the better time-area product than established public key schemes. This holds, interestingly, also if compared to elliptic curve schemes, which have the reputation of being particularly efficient.
The first public hardware implementation of a cryptosystem based on multivariate polynomials we are aware of is [17] , where enTTS is realized. A more recent result on the evaluation of hardware performance for Rainbow can be found in [2] .
Our Contribution
Our contribution is many-fold. First, a clear taxonomy of secure multivariate systems and existing attacks is given. Second, we present a systolic architecture implementing Gauss-Jordan elimination over GF(2 k ) which is based on the work in [13] . The performance of this central operation is important for the overall efficiency of multivariate based signature systems. Then, a number of concrete hardware architectures are presented having a low time-area product. Here we address both rather conservative schemes such as UOV as well as more aggressively designed proposals such as Rainbow or amended TTS (amTTS). For instance, an optimised implementation of amTTS is estimated to have a TA-product over 50 times lower than some of the most efficient ECC implementations. Moreover, we suggest a generic hardware architecture capable of computing signatures for the wide class of multivariate polynomial systems based on small finite fields. This generic hardware design allows us to achieve a time-area product for UOV which is somewhat smaller than that for ECC, being considerably smaller for the short-message variant of UOV.
Foundations of MQ-Systems
In this section, we introduce some properties and notations useful for the remainder of this article. After briefly introducing MQ-systems, we explain our choice of signature schemes and give a brief description of them. 
Mathematical Background
Let F be a finite field with q := |F| elements and define Multivariate Quadratic (MQ) polynomials p i of the form
, and quadratic terms). We now define the polynomial-vector P := (p 1 , . . . , p m ) which yields the public key of these Multivariate Quadratic systems. This public vector is used for signature verification. Moreover, the private key (cf Fig.1 ) consists of the triple (S, P ′ , T ) where S ∈ Aff(F n ), T ∈ Aff(F m ) are affine transformations and
Throughout this paper, we will denote components of this private vector P ′ by a prime ′ . The linear transformations S and T can be represented in the form of invertible matrices M S ∈ F n×n , M T ∈ F m×m , and vectors v S ∈ F n , v T ∈ F m i.e. we have S(x) := M S x + v S and T (x) := M T x + v T , respectively. In contrast to the public polynomial vector P ∈ MQ(F n , F m ), our design goal is that the private polynomial vector P ′ does allow an efficient computation of x ′ 1 , . . . , x ′ n for given y ′ 1 , . . . , y ′ m . At least for secure MQ-schemes, this is not the case if the public key P alone is given. The main difference between MQ-schemes lies in their special construction of the central equations P ′ and consequently the trapdoor they embed into a specific class of MQ-problems.
In this kind of schemes, the public key P is computed as function composition of the affine transformations S :
To fix notation further, we note that we have P, P ′ ∈ MQ(F n , F m ), i.e. both are functions from the vector space F n to the vector space F m . By construction, we have ∀x ∈ F n : P(x) = T (P ′ (S(x))).
Signing
To sign for a given y ∈ F m , we observe that we have to invert the computation of y = P(x). Using the trapdoor-information (S, P ′ , T ), cf Fig. 1 , this is easy. First, we observe that transformation T is a bijection. In particular, we can compute y
T y. The same is true for given x ′ ∈ F n and S ∈ Aff(F n ). Using the LU-decomposition of the matrices M S , M T , this computation takes time O(n 2 ) and O(m 2 ), respectively. Hence, the difficulty lies in evaluating x ′ = P ′−1 (y ′ ). We will discuss strategies for different central systems P ′ in Sect. 2.4.
Verification
In contrast to signing, the verification step is the same for all MQ-schemes and also rather cheap, computationally speaking: given a pair x ∈ F n , y ∈ F m , we evaluate the polynomials
Then, we verify that p i = y i holds for all i ∈ {1, . . . , m}. Obviously, all operations can be efficiently computed. The total number of operations takes time O(mn 2 ).
Description of the Selected Systems
Based on [14] 
Unbalanced Oil and Vinegar (UOV)
.
Unbalanced Oil and Vinegar Schemes were introduced in [10, 11] . Here we have γ ∈ F, i.e. the polynomials p are over the finite field F. In this context, the variables x q m − q i invertible ones [14] .
Taking the currently known attacks into account, we derive the following secure choice of parameters for a security level of 2 80 :
• Small datagrams: m = 10, n = 30, τ ≈ 0.003922 and one K = 10 solver
• Hash values: m = 20, n = 60, τ ≈ 0.003922 and one K = 20 solver
The security has been evaluated using the formula
. Note that the first version (i.e. m = 10) can only be used with messages of less than 80 bits. However, such datagrams occur frequently in applications with power or bandwidth restrictions, hence we have noted this special possibility here.
Rainbow.
Rainbow is the name for a generalisation of UOV [7] . In particular, we do not have one layer, but several layers. This way, we can reduce the number of variables and hence obtain a faster scheme when dealing with hash values. The general form of the Rainbow central map is given below.
We have the coefficients γ ∈ F, the layers L ∈ N and the vinegar splits v 1 < . . . < v L+1 ∈ N with n = v L+1 . To invert Rainbow, we follow the strategy for UOV -but now layer for layer, i.e. we pick random values for x 1 , . . . , x v1 , solve the first layer with an (
, insert the values x 1 , . . . , x v2 into the second layer, solve second layer with an
The probability that we do not obtain a solution for this system is τ
using a similar argument as in Sec. 2.4.1.
Taking the latest attack from [3] into account, we obtain the parameters L = 2, v 1 = 18, v 2 = 30, v 3 = 42 for a security level of 2 80 , i.e. a two layer scheme 18 initial vinegar variables and 12 equations in the first layer and 12 new vinegar variables and 12 equations in the second layer. Hence, we need two K = 12 solvers and obtain τ ≈ 0.007828
amended TTS (amTTS).
The central polynomials P ′ ∈ MQ(F n , F m ) for m = 24, n = 34 in amTTS [6] are defined as given below:
We have α, γ ∈ F and σ, π permutations, i.e. all polynomials are over the finite field F. We see that they are similar to the equations of Rainbow (Sec. 2.4.2) -but this time with sparse polynomials. Unfortunately, there are no more conditions given on σ, π in [6] -we have hence picked one suitable permutation for our implementation.
To invert amTTS, we follow the sames ideas as for Rainbow -except with the difference that we have to invert twice a 10 × 10 system (i = 10 . . . 19 and 24 . . . 33) and once a 4 × 4 system, i.e. we have K = 10 and K = 4. Due to the structure of the equations, the probability for not getting a solution here is the same as for a 3-Layer Rainbow scheme with v 1 = 10, v 2 = 20, v 3 = 24, v 4 = 34 variables, i.e. τ amT T S = τ Rainbow (10, 20, 24, 34) ≈ 0.011718. 
enhanced TTS (enTTS).
The overall idea of enTTS is similar to amTTS, m = 20, n = 28. For a detailed description of enTTS see [16, 15] . According to [6] , enhanced TTS is broken, hence we do not advocate its use nor did we give a detailed description in the main part of this article, However, it was implemented in [17] , so we have included it here to allow the reader a comparison between the previous implementation and ours.
Building Blocks for MQ-Signature Cores
Considering Section 2 we see that in order to generate a signature using an MQ-signature scheme we need the following common operations:
• computing affine transformations (i.e. vector addition and matrix-vector multiplication),
• (partially) evaluating multivariate polynomials over GF(2 k ),
• solving linear systems of equations (LSEs) over GF(2 k ).
In this section we describe the main computational building blocks for realizing these operations. Using these generic building blocks we can compose a signature core for any of the presented MQ-schemes (cf Section 4).
A Systolic Array LSE Solver for GF(2 k )
In 1989, Hochet et al. [9] proposed a systolic architecture for Gaussian elimination over GF(p). They considered an architecture of simple processors, used as systolic cells that are connected in a triangular network. They distinguish two different types of cells, main array cells and the boundary cells of the main diagonal.
Cr in T out Figure 3 : Pivot Cell of the Systolic Array LSE Solver
Wang and Lin followed this approach and proposed an architecture in 1993 [13] for computing inverses over GF(2 k ). They provided two methods to efficiently implement the Gauss-Jordan algorithm over GF(2) in hardware. Their first approach was the classical systolic array approach similar to the one of Hochet et al.. It features a critical path that is independent of the size of the array. A full solution of an m × m LSE is generated after 4m cycles and every m cycles thereafter. The solution is computed in a serial fashion.
The other approach, which we call a systolic network, allows signals to propagate through the whole architecture in a single clock cycle. This allows the initial latency to be reduced to 2m clock cycles for the first result. Of course the critical path now depends of the size of the whole array, slowing the design down for huge systems of equations. Systolic arrays can be derived from systolic networks by putting delay elements (registers) into the signal paths between the cells.
We followed the approach presented in [13] to build an LSE solver architecture over GF(2 k ). The biggest advantage of systolic architectures with regard to our application is the low amount of cells compared to other architectures like SMITH [4] . For solving a m × m LSE, a systolic array consisting of only m boundary cells and m(m + 1)/2 main cells is required.
An overview of the architecture is given in Figure 2 . The boundary cells shown in Figure 3 mainly comprise one inverter that is needed for pivoting the corresponding line. Furthermore, a single 1-bit register is needed to store whether a pivot was found. The main cells shown in Figure 4 comprise of one GF(2 k ) register, a multiplier and an adder over GF(2 k ). Furthermore, a few multiplexers are needed. If the row is not initialised yet (T in = 0), the entering data is multiplied with the inverse of the pivot (E in ) and stored in the cell. If the pivot was zero, the element is simply stored and passed to the next row in the next clock cycle. If the row is initialised (T in = 1) the data element a i,j+1 of the entering line is reduced with the stored data element and passed to the following row. Hence, one can say that the k-th row of the array performs the k-th iteration of the Gauss-Jordan algorithm.
The inverters of the boundary cells contribute most of the delay time t delay of the systolic network. Instead of introducing a full systolic array, it is already almost as helpful to simply add delay elements only between the rows. This seems to be a good trade-off between delay time and the number of registers used. This approach we call systolic lines.
As described earlier, the LSEs we generate are not always solvable. We can easily detect an unsolvable LSE by checking the state of the boundary cells after 3m clock cycles (m clock cycles for a systolic network, respectively). If one of them is not set, the system is not solvable and a new LSE needs to be generated. However, as shown in Table 1 , this happens very rarely. Hence, the impact on the performance of the implementation is negligible. Table 2 shows implementation 
Matrix-Vector Multiplier and Polynomial Evaluator
For performing matrix-vector multiplication, we use the building block depicted in Figure 5 . In the following we call this block a t-MVM. As you can see a t-MVM consists of t multipliers, a tree of adders of depth about log 2 (t) to compute the sum of all products a i · b i , and an extra adder to recursively add up previously computed intermediate values that are stored in a register. Using the RST-signal we can initially set the register content to zero. To compute the matrix-vector product
using a t-MVM, where t is chosen in a way that it divides 1 u, we proceed row by row as follows: We set the register content to zero by using RST. Then we feed the first t elements of the first row of A into the t-MVM, i.e. we set a 1 = a 1,1 , . . . , a t = a 1,t , as well as the first t elements of the vector b. After the register content is set to t i=1 a 1,i b i , we feed the next t elements of the row and the next t elements of the vector into the t-MVM. This leads to a register content corresponding to 2t i=1 a 1,i b i . We go on in this way until the last t elements of the row and the vector are processed and the register content equals u i=1 a 1,i b i . Thus, at this point the data signal c corresponds to the first component of the matrix-vector product. Proceeding in a analogous manner yields the remaining components of the desired vector. Note that the u t parts of the vector b are re-used in a periodic manner as input to the t-MVM. In Section 3.4 we describe a building block, called word rotator, providing these parts in the required order to the t-MVM without re-loading them each time and hence avoid a waste of resources.
Therefore, using a t-MVM (and an additional vector adder) it is clear how to implement the affine transformations S : F n → F n and T : F m → F m which are important ingredients of an MQ-scheme. Note that the parameter t has a significant influence on the performance of an implementation of such a scheme and is chosen differently for our implementations (as can be seen in Section 4).
Besides realizing the required affine transformations, a t-MVM can be re-used to implement (partial) polynomial evaluation. It is quite obvious that evaluating the polynomials p 
We immediately obtain the coefficients of the non-constant part of this linear polynomial, i.e. β i,n−m+1 , . . . , β i,n , by computing the following matrix-vector product:
Also the main step for computing β i,0 can be written as a matrix-vector product: 
Of course, we can exploit the fact that the above matrix is a lower triangular matrix and we actually do not have to perform a full matrix-vector multiplication. This must simply be taken into account when implementing the control logic of the signature core. In order to obtain β i,0 from (α i,1 . . . α i,n−m )
T we have to perform the following additional computation:
This final step is performed by another unit called equation register which is presented in the next section.
Equation Register
The Equation Register building block is shown in Figure 6 . A w-ER essentially consists of w + 1 register blocks each storing k bits as well as one adder and one multiplier. It is used to temporarily store parts of an linear equation until this equation has been completely generated and can be transferred to the systolic array solver. For instance, in the case of UOV we consider linear equations of the form
where we used the notation from Section 3.2. To compute and store the constant part n−m j=1 α i,j b j − y ′ i of this equation the left-hand part of an m-ER is used (see Figure 6 ): The respective register is initially set to y ′ i . Then the values α i,j are computed one after another using a t-MVM building block and fed into the multiplier of the ER. The corresponding values b j are provided by a t-WR building block which is presented in the next section. Using the adder, y ′ i and the products can be added up iteratively. The coefficients β i,j of the linear equation are also computed consecutively by the t-MVM and fed into the shift-register that is shown on the right-hand side of Figure 6 . 
Word Rotator
A word cyclic shift register will in the following be referred to as word rotator (WR). A (t, r)-WR, depicted in Figure 7 , consists of r register blocks storing the u t parts of the vector b involved in the matrix vector products considered in Section 3.2. Each of these r register blocks stores t elements from GF(2 k ), hence each register block consists of t k-bit registers. The main task of a (t, r)-WR is to provide the correct parts of the vector b to the t-MVM at all times. The r register blocks can be serially loaded using the input bus x. After loading, the r register blocks are rotated at each clock cycle. The cycle length of the rotation can be modified using the multiplexers by providing appropriate control signals. This is especially helpful for the partial polynomial evaluation where due to the triangularity of the matrix in Equation (2), numerous operations can be saved. Here, the cycle length is j t , where j is the index of the processed row. The possibility to adjust the cycle length is also necessary in the case r > u t frequently appearing if we use the same (t, r)-WR, i.e., fixed parameters t and r, to implement the affine transformation T , the polynomial evaluations, and the affine transformation S. Additionally, the WR provides b j to the ER building block which is needed by the ER at the end of each rotation cycle. Since this b j value always occurs in the last register block of a cycle, the selector component (right-hand side of Figure 7 ) can simply load it and provide it to the ER.
Performance Estimations of Small-Field MQ-Schemes in Hardware
We implemented the most crucial building blocks of the architecture as described in Section 3 (systolic structures, word rotators, matrix-vector multipliers of different sizes). In this section, the estimations of the hardware performance for the whole architecture are performed based on those implementation results. The power of the approach and the efficiency of MQ-schemes in hardware is demonstrated at the example of UOV, Rainbow, enTTS and amTTS as specified in Section 2. Side-Note: The volume of data that needs to be imported to the hardware engine for MQschemes may seem too high to be realistic in some applications. However, the contents of the matrices and the polynomial coefficients (i.e. the private key) does not necessarily have to be imported from the outside world or from a large on-board memory. Instead, they can be generated online in the engine using a cryptographically strong pseudo-random number generator, requiring only a small, cryptographically strong secret, i.e. some random bits.
UOV
We treat two parameter sets for UOV as shown in Table 3 : n = 60, n = 20 (long-message UOV) as well as n = 30, m = 10 (short-message UOV). In UOV signature generation, there are three basic operations: linearising polynomials, solving the resulting equation system, and an affine transform to obtain the signature. The most time-consuming operation of UOV is the partial evaluation of the polynomials p ′ i , since their coefficients are nearly random. However, as already mentioned in the previous section, for some polynomials approximately one half of the coefficients for the polynomials are zero. This somewhat simplifies the task of linearization.
For the linearization of polynomials in the long-message UOV, 40 random bytes are generated to invert the central mapping first. To do this, we use a 20-MVM, a (20,3)-WR, and a 20-ER. For each polynomial one needs about 100 clock cycles (40 clocks to calculate the linear terms and another 60 ones to compute the constants, see (1) and (2)) and obtains a linear equation with 20 variables. As there are 20 polynomials, this yields about 2000 clock cycles to perform this step.
After this, the 20 × 20 linear system over GF (2 8 ) is solved using a 20 × 20 systolic array. The signature is then the result of this operation which is returned after about 4×20=80 clock cycles. Then, the 20-byte solution is concatenated with the randomly generated 40 bytes and the result is passed through the affine transformation, whose major part is a matrix-vector multiplication with a 60×60-byte matrix. To perform this operations, we re-use the 20-MVM and a (20,3)-WR. This requires about 180 cycles of 20-MVM and 20 bytes of the matrix entries to be input in each cycle.
For the short-message UOV, one has a very similar structure. More precisely, one needs a 10-MVM, a (10,3)-WR, a 10-ER and a 10×10 systolic array. The design requires approximately 500 cycles for the partial evaluation of the polynomials, about 40 cycles to solve the resulting 10×10 LSE over GF (2 8 ) as well as another 90 cycles for the final affine map. Note that the critical path of the Gaussian elimination engine is much longer than that for the remaining building blocks. So this block represents the performance bottleneck in terms of frequency and hardware complexity. For this reason we decided to clock different components of the design with different frequencies. For the XC5VLX50-3 device the Gaussian elimination engine is clocked with 200 MHz and the rest with 400 MHz. Alternatively, for the XC3S1500 device the Gaussian elimination component is clocked with about 80 MHz, the remaining engines with 160 MHz. See Table 3 for our estimations.
Rainbow
In the version of Rainbow we consider, the message length is 24 byte. That is, a 24-byte matrixvector multiplication has to be performed first. One can take a 6-MVM and a (6,7)-WR which require about 96 clock cycles to perform the computation. Then the first 18 variables of x ′ i are randomly fixed and 12 first polynomials are partially evaluated. This requires about 864 clock cycles. The results are stored in a 12-ER. After this, the 12×12 system of linear equations is solved. This requires a 12×12 systolic array over GF (2 8 ) which outputs the solution after 48 clock cycles. Then the last 12 polynomials are linearised using the same matrix-vector multiplier and word rotator based on the 18 random values previously chosen and the 12-byte solution. This needs about 1800 clock cycles. This is followed by another run of the 12×12 systolic array with the same execution time of about 48 clock cycles. At the end, roughly 294 more cycles are spent performing the final affine transform on the 42-byte vector. See Table 3 for some concrete performance figures in this case.
enTTS and amTTS
Like in Rainbow, for enTTS two vector-matrix multiplications are needed at the beginning and at the end of the operation with 20-and 28-byte vectors each. We take a 10-MVM and a (10,3)-WR for this. The operations require 40 and 84 clock cycles, respectively. One 9-ER is required. Two 10×10 linear systems over GF (2 8 ) need to be solved, requiring about 40 clock cycles each. The operation of calculating the linearization of the polynomials can be significantly optimised compared to the generic UOV or Rainbow (in terms of time) which can drastically reduce the time-area product. This behaviour is due to the special selection of polynomials, where only a small proportion of coefficients is non-zero. After choosing 7 variables randomly, 10 linear equations have to be generated. For each of these equations, one has to perform only a few multiplications in GF (2 8 ) which can be done in parallel. This requires about 20 clock cycles. After this, another variable is fixed and a further set of 10 polynomials is partially evaluated. This requires about 20 further cycles.
In amTTS, which is quite similar to enTTS, two affine maps with 24-and 34-byte vectors are performed with a 12-MVM and a (12,3)-WR yielding 48 and 102 clock cycles, respectively. Two 10×10 and one 4×4 linear systems have to be solved requiring for a 10×10 systolic array (twice 40 and once 16 clock cycles). Moreover, a 10-ER is needed. The three steps of the partial evaluation of polynomials requires roughly 40 clock cycles in this case. See Table 3 for our estimations on enTTS and amTTS. # For comparison purposes we assume that the design can be clocked with up to 80 MHz.
Comparison and Conclusions
Our implementation results (as well as the estimations for the optimisations in case of enTTS and amTTS) are compared to the scalar multiplication in the group of points of elliptic curves with field bitlengths in the rage of 160 bit (corresponding to the security level of 2 80 ) over GF(2 k ), see Table 3 . A good survey on hardware implementations for ECC can be found in [5] .
Even the most conservative design, i.e. long-message UOV, can outperform some of the most efficient ECC implementations in terms of TA-product on some hardware platforms. More hardwarefriendly designs such as the short-message UOV or Rainbow provide a considerable advantage over ECC. The more aggressively designed enTTS and amTTS allow for extremely efficient implementations having a more than 70 or 50 times lower TA-product, respectively. Though the metric we use is not optimal, the results indicate that MQ-schemes perform better than elliptic curves in hardware with respect to the TA-product and are hence an interesting option in cost-or size-sensitive areas.
