Abstract -This is the first implementation in FPGA of the recently published class of public key algorithms -MQQ, that are based on quasigroup string transformations. Our implementation achieves decryption throughput of 399 Mbps on an Xilinx Virtex-5 FPGA that is running on 249.4 MHz. The encryption throughput of our implementation achieves 44.27 Gbps on an Xilinx Virtex-5 chip that is running on 276.7 MHz. Compared to RSA implementation on the same FPGA platform this implementation of MQQ is 10,000 times faster in decryption, and is more than 17,000 times faster in encryption. The main goal of this work was to build a hardware that can perform operations with the public and the private key that have as high as possible speed. Our main comparison is with RSA with a similar cryptographic strength, because we want to emphasize that RSA being essentially sequential algorithm can not benefit from the parallel capabilities that modern FPGAs offer, while MQQ can.
I. INTRODUCTION
The most popular Public Key Cryptosystem (PKC) schemes are the Diffie and Hellman (DH) key exchange scheme based on the hardness of discrete logarithm problem [4] , the Rivest, Shamir and Adleman (RSA) scheme based on the difficulty of integer factorization [20] , and the Koblitz and Miller (ECC -Elliptic Curve Cryptography) scheme based on the discrete logarithm problem in an additive group of points defined by elliptic curves over finite fields [13] , [16] . There are two common characteristics of these well known PKCs (DH, RSA and ECC): 1. their speed -which frequently is a thousand times lower than the symmetric cryptographic schemes, 2. their security -which relies on one of two hard mathematical problems: efficient computation of discrete logarithms and factorization of integers.
Several other ideas have been proposed during the last 30 years, such as: • PKCs based on lattice reduction problems [1] , [7] and on lattice problems over rings such as NTRU [9] ; • PKCs based on braid groups [12] ; Recently a new public key scheme called MQQ which is based on multivariate quadratic polynomials and quasigroup string transformations was proposed by Gligoroski et al., [6] . According to the authors of MQQ, that scheme has potential to be as fast as a typical block cipher. In this paper we are describing an implementation in Xilinx Virtex-5 FPGA that actually confirms the original claims by the authors of MQQ.
Organization of the paper is the following: In Section II we give a brief description of the MQQ algorithm. The hardware implementation of MQQ encryption is described in Section III, and the hardware implementation of MQQ decryption is described in Section IV. A comparative analysis of our implementation with RSA and AES is given in Section V. Conclusions are given in Section VI.
II. PRELIMINARIES
In this section we will briefly describe the MQQ algorithm. More detailed description the reader can find in [6] . Since it is based on quasigroups we give several definitions about quasigroups. Additional information about quasigroups reader can find in [2] , [3] , [14] , [22] .
Definition 1: A quasigroup (Q, * ) is a groupoid satisfying the law
It follows from (1) that for each a, b ∈ Q there is a unique x ∈ Q such that a * x = b. Then we denote x = a \ * b where \ * is a binary operation in Q (called a left parastrophe of * ) and the groupoid (Q, \ * ) is a quasigroup too. The algebra (Q, * , \ * ) satisfies the identities
Consider an alphabet (i.e., a finite set) Q, and denote by Q + the set of all nonempty words (i.e., finite strings) formed by the elements of Q. In this paper, depending on the context, we will use two notifications for the elements of Q + : a 1 a 2 . . . a n and (a 1 , a 2 , . . . , a n ), where a i ∈ Q. Let * be a quasigroup operation on the set Q. For each l ∈ Q we define two functions e l, * , d l, * : Q + → Q + as follows:
The functions e l, * and d l, * are called the e-transformation and the d-transformation of Q + based on the operation * with leader l respectively, and their graphical representations are shown in Fig. 1 . 
for each leader l ∈ Q and for every string M ∈ Q + . The authors of MQQ in [6] noticed that when a quasigroup is represented as a Boolean function, then there exists a special class of quasigroups, called multivariate quadratic quasigroups (MQQs). Those MQQs can be of different types. 
and if
Algorithm for decryption with the private key (T, S, * 1, . . . , * 8) In [6] there are defined several algorithms for generating MQQs, for key generation, for encryption and for decryption. However, in this paper we will present only algorithms (and parts of algorithms) that are important for implementing decryption and encryption. We refer the reader to [6] for other algorithms and for more detailed description of MQQ.
One important part of the MQQ algorithm is the use of the bijection of Dobbertin. Dobbertin has proved [5] that the function Dob(X) = X
Moreover it is multivariate quadratic too. In our implementation of MQQ public key cryptosystem we have used the bijection of Dobbertin for m = 6 i.e. a bijection in GF (2 13 ). From hardware resources point of view this means that for implementing the bijection of Dobbertin for m = 6 (actually its inverse) we need 2 13 ×13 = 106496 bits of ROM. The algorithm for decryption/signing by the use of the private key (T, S, * 1 , . . . , * 8 ) is defined in Table I .
The algorithm for encryption with the public key is straightforward application of the set of n multivariate polynomials
In order to implement MQQ encryption in FPGA we have presented the mapping y = P(x) as a matrix-vector multiplication, where the operation "+" is the logical XOR operation and where multiplication of two variables is actually the logical AND operation. Namely, every P i (x 1 , . . . , x n ) can be represented as:
where a i,j,k ∈ {0, 1}. That means that we can represent the encryption as:
. . . a n,0,0 a n,1,0 . . . a n,n,0 a n,1,2 a n,1,3 . . . a n,n−1,n
.
Note that the Boolean matrix A is the public key and it is an n × (1 + n + n(n−1) 2 ) matrix, and the vector X is an (1 + n + n(n−1) 2 )×1 vector obtained from the vector x = (x 1 , . . . , x n ).
III. HARDWARE DESIGN OF THE ENCRYPTION PROCEDURE
In this and in the following section we will describe a hardware implementation of 160-bit MQQ encryption and decryption. Our goal was to prove or disprove the claim of the authors of MQQ that it can have operational speed that is same as that of the block ciphers. To achieve that goal, we have implemented 160-bit MQQ in VHDL, and have constructed a parallel design, where every component in the design completes its computation in as minimum as possible clock cycles, and with small duration of each cycle. We have used Xilinx Virtex-5 FPGA family and its synthesizing tool "ISE Foundation 10.1".
One of the biggest problems that we faced in this part was the size of the public key (the matrix A) and the goal to finish the matrix-vector multiplication as fast as possible. For 160-bit MQQ, the Boolean matrix A has 160 rows and 12881 columns. Although theoretically it is possible to perform a matrix-vector operation in one cycle, performing operations on the whole matrix in one single FPGA was a source of constant compiler warnings and complete compiler blackouts. And since our work can be seen as a "proof of a concept" to simplify the computations we decided to put the public key into a ROM. For real operational and flexible use of MQQ, Virtex-5 would offer a plenty of RAM. Figure 2 shows the complete architecture for the entire encryption process, which includes four identical hierarchies; each one contains three main hardware operative parts, named: "Divider", "Hybrid ROM" and "Hybrid DEMux". Dividing (splitting) the public key A in four parts, i.e. its splitting in four FPGA chips that work in parallel was done in order to overcome synthesizing difficulties with the big public key A in ROM in one chip.
So, the idea is to implement a matrix-vector multiplication A · X in a classical block manner, where we represented the matrix A as The component "Divider" consists of four operative parts, and is shown in Figure 3 . The role of this part is to transfer the input data of 160 bits from "REG 1" and by using "Combinational AND Gates" block to expand the input data into 12881 bits. Then those 12881 bits go to "Register Y" to adjust the synchronization between the data bits (a technique described in Xilinx Virtex-5 user guide [23] ). After this, the data are divided in two parts by the multiplexer Hybrid Mux. This multiplexer has two kinds of inputs, first 80 inputs have 160 bits as a data width and the last input has 81 bits. Similar is with its output, i.e. it has two branches: first has 160 bits and the second has 81 bits.
The second component "Hybrid ROM" contains 44 operative parts, and its implementation is shown in Figure 4 . This part has 40 ROMs (according to 40 rows of the submatrix A i ) and a component "Parallel ROM". All of them are used in parallel. The output of these ROMs goes directly with the output of the "Divider" in the components "Combinational Logic Gates 1" and "Combinational Logic Gates 2" as shown in Figure 4 . The matrix-vector multiplication is realized in these two components by using AND and XOR gates between the outputs of ROMs and "Divider". The output of the component "Hybrid ROM" has two sequences, 40 bits each. The role of the third component "Hybrid DEMux" is to finalize the matrix-vector multiplication that is started in the component "Hybrid ROM". It has four operative parts. Its implementation is shown in Figure 5 . The output from the "Hybrid ROM" component goes directly to "Hybrid DEMux". The term "Hybrid" comes from the specifics of the demultiplexer "DEMux", because this "DEMux" has two inputs. The output of the DEMux will go to the component "Combinational Logic Gates 3" through the "Register Z" component. This register is used to keep the synchronization between data (see [23] ). Then the output of "Combinational Logic Gates 3" goes again to another synchronization register "Register W" in order to keep the synchronization of the final output.
This implementation of the MQQ encryption is fully pipelined. It takes initially 82 cycles to encrypt 160 input bits, but then, the encryption engine can output 160 bits in every cycle. Figure 6 shows the complete architecture for the entire decryption procedure described in Table I . It has four main hardware components: "Private Matrix T −1 ", "Sequencer", "Dobbertin" and "Private Matrix S −1 ". Actually the first and the fourth component are structurally the same (with different Boolean matrices T and S).
IV. HARDWARE DESIGN OF THE DECRYPTION PROCEDURE
The structure of the first and the fourth component, "Private Matrix T −1 " and "Private Matrix S −1 " is shown in Figure 7 . It implements Steps 1 and Step 7 from the decryption algorithm described in Table I . It has 160 parts and each part computes a dot product between two Boolean vectors of length 160. One vector is the input vector of 160 bits and the other vector is one row from the private matrix T −1 or S −1 placed as a fixed ROM. The role of the operation "+" in this computation of the dot product is a logical XOR (represented by the "Bit by bit XOR" components in the Figure  7) , and the role of multiplication is the logical AND operation (represented by the "AND Gate" parts in the Figure 7) . The output of these parallel 160 dot products have to go to the next phase through the component "Register X" to keep the synchronization between the data (see [23] ).
The second component "Dobbertin ROM" is shown in Figure 8 . It takes the data which come from "Private Matrix T −1 ". Actually it changes just 13 bits, which positions are determined in Step 2, Step 3 and Step 4 in the decryption algorithm described in Table I and computes the inverse Dobbertin function in GF (2 13 ). The realization of this component is a simple lookup table reading from a ROM with 2 13 entries and it outputs 13 bits. The output of this component, goes to the third component "Sequencer". The third component "Sequencer" is shown in Figure 9 . This is the most complex part and it implements the Step 5 and the Step 6 from the decryption algorithm described in Table I . This part contains 32 5-bit registers, 2 Multiplexers, a Master ROM, a control component, a "DEMux 1 × 32" component and two counters. Since there is a feedback from the Master ROM to the Multiplexer "Mux 2 × 1" when the selector of the multiplexor is 1, we can not apply the pipelining in this stage of the decryption.
The internal structure of the "Master ROM" component is shown in Figure 10 . It has 8 ROMs that work in parallel, one control component one 5-bit counter and one 8×1 multiplexer.
This implementation of the MQQ decryption takes 100 cycles to decrypt 160 input bits.
V. PERFORMANCE COMPARISON
We have realized MQQ public key scheme in Virtex-5 chip: xc5vfx70t-2-ff1136. The summary of the synthesizing of our VHDL implementation of the 160-bit MQQ using the Xilinx tool "ISE Foundation 10.1" is given in Table II. MQQ public key scheme has a property to be highly parallelized and that property can be evidently demonstrated in its hardware realization. On the other hand, all popular public key algorithms (RSA, DH, ECC, DSA, ECDSA) have essentially a sequential nature. Thus, using highly parallelized hardware such as FPGA, the speed difference between MQQ and those popular public key algorithms is five orders of magnitude. More concretely, implemented in FPGA, MQQ is more than 10,000 times faster than DSA, RSA or ECDSA, and is comparable or even faster than the symmetric block cipher AES.
In Table III we compare the speed of 160-bit MQQ with the speed of 1024-bit RSA realized in Xilinx Virtex-5 FPGA chip by the company "Helion Technology Limited" [8] . In the same table we also give the speed of AES (128-bit key) realized in Xilinx FPGA chip in the same Virtex-5 family and by the same company.
VI. CONCLUSIONS
We have implemented in FPGA a 160-bit instance of the newly published public key scheme MQQ. The results of our implementation show that in hardware, MQQ public key algorithm in encryption and decryption (that means also in verification and signing) can be as fast as a typical block cipher and is several orders of magnitude faster than most popular public key algorithms like RSA, DH or ECC.
