Abstract: For many applications from the areas of cryptography and coding, finite field multiplication is the most resource and time consuming operation. In this paper, optimized high performance parallel GF( 2 233 ) multipliers for an FPGA realization were designed and the time and area complexities were analyzed. One of the multipliers uses a new hybrid structure to implement the Karatsuba algorithm. For increasing performance, we make excessive use of pipelining and efficient control techniques and use a modern state-of-the-art FPGA technology. As a result we have, to our knowledge, the first hardware realization of sub quadratic arithmetic and currently the fastest and most efficient implementation of 233 bit finite field multipliers.
INTRODUCTION
The arithmetic operations in finite fields are mainly used in cryptography and error control coding. Addition and multiplication are the two basic operations in the finite field GF (2 m ).Addition in GF(2 m ) is easily realized using m two-input XOR gates while multiplication is costly in terms of gate count and time delay. The other operations of the finite fields, such as exponentiation, division and inversion can be performed by repeated multiplications.As a result there is a need to have a fast multiplication architecture with low complexities. The hardware/software implementation efficiency of finite field arithmetic is measured in terms of the associated space and time complexities. The space complexity is defined as the number of XOR and AND gates needed for the implementation of the circuit,whereas the time complexity is the total gate delay of the circuit. The space and time complexities of a multiplier heavily depend on how the field elements are represented. An element of GF ( 2 m ) is usually represented with respect to one of the three popular bases: Polynomial (canonical or standard) basis(PB), dual basis (DB), and normal basis (NB).
Especially for the area of cryptography where the extension of the finite field GF (2 m ) is fairly large, say m > 160, the selection of the multiplication algorithm has a major impact on the overall system performance. The selection of the finite field is based on the FlPS 186-2 standard concerning with the digital signature algorithms and proposed by NIST. This standard suggests 5 binary fields, mainly the extension degrees 163,233, 283, 409, and 573, which are all prime extensions. We have selected GF (2 233 ) to satisfy the security requirements in elliptic curve cryptography for the next years, but our results can be adapted to finite fields with other prime extensions as well.
For cryptography, the requirements with respect to performance and security may change depending on the application. For this reason we use FPGAs as target technology in order to avoid the flexibility lacking in ASIC designs. It turns out that many optimizations of field multipliers proposed for ASIC design do not hold for FPGA. The main differences are  Influence of routing on the FPGA performance.
 4-input lookup table technology instead of 2-input logic gates.  Treatment of high-fanout nets on FPGAs. So we decided to create completely new FPGA optimized designs for the multipliers.
The paper is organized as follows. In the next section we give an overview over related work. Section 3 gives a short introduction into the theory of operation of the classical, Karatsuba and Massey-Omura multipliers. The architecture and FPGA implementation of these multipliers is described in detail in Section 4. The performance results and a comparison is given in Section 5.
RELATED WORK
Several works concern the comparison of different hardware based multiplier architectures in the binary finite fields. The authors of [3] have compared three known serial multipliers, namely Berlekamp, Massey-Omura. and a polynomial basis multiplier. And implemented them for a small finite field GF (2 8 ) in VLSI.. [4] considers VLSI implementation of parallel multipliers for a class of finite fields GF (2 m ) with extension degrees m = 8, 16,24, and 32.which are not prime extension degrees and are believed to have security weaknesses.
[6] considers different parallel multipliers in GF (2 4 ) which is suitable for coding applications. This work also considers hardware optimization techniques to improve the performance of multipliers and make some estimates which hold only for small finite fields. [7] gives a detailed comparison of different VLSI implementations of parallel multipliers in G F (2 4 ).Indeed all of the above works (except [4] ) correspond to small finite fields and the results can not be easily extended to larger fields.
With the development of new FPGA families with large gate counts, however, it is possible to realize parallel finite field multipliers on a single chip which performs the total multiplication operation in a few clock cycles. So it become necessary to .
have a performance analysis of the multipliers for large field extensions (with m>160)to select the best multiplier for a certain application.
MULTIPLICATION IN THE BINARY FINITE FIELDS
There are several algorithms to multiply two finite field elements and each of them has its benefits depending on the finite field size, the implementation type (hardware or software), and the time and area requirements. One of the main differences between these algorithms is the finite field representation basis. In this section we give a brief introduction of different hardware based finite field multipliers in GF (2 m ) along with their space and time complexities. When the Hamming weight of the irreducible polynomial plays a significant role, we assume the existence of an irreducible trinomial of degree n when considering the multiplication in GF (2 m ).This is a reasonable assumption since our special finite field is GF (2 233 ), and the polynomial x 233 +x 74 + 1 is irreducible. On the other hand it is conjectured that a trinomial of degree m exists for a large amount of values n. Multipliers will be categorized depending on the finite field basis.
Normal Basis Multipliers
An element a in GF (2 m ) is called a normal element, when the elements of the set Γ= {a 2^ i |0<=i< m) are linearly independent. In this case, the set Γis called a normal basis. One great advantage of the normal bases is that squaring in this basis consists of only a cyclic shift (which requires no logic elements and can he done in nearly zero time). There are two types of normal bases for which there exist effective multiplication methods, namely optimal normal bases of type I and II. It is well-known that there always exists a normal basis in the field GF (2 m ) over GF (2) for all positive integers m .By finding an element β Є GF (2 m )such that 
Where α i Є GF (2) , 0 ≤ i≤ m-1, is the ith coordinate of A with respect to the NB. In short, the normal basis representation of A will be written as
In vector notation, however, (1) can be written as
], and T denotes vector transposition.
The main advantage of the NB representation is that an element A can be easily squared by applying right cyclic shift of its coordinates, since
3.1.1.Massey-Omura Parallel Multiplier
The Massey-Omura multiplier is one of the most famous multipliers that work in the normal basis representation. It consists of similar blocks which can work in parallel to generate output bits simultaneously. One great advantage of this multiplier is its flexibility as a serial -parallel multiplier. This means that the designer has the ability to select an arbitrary number of similar blocks to achieve different numbers of output bits in one clock cycle, depending on the given area constraints. 
Where the multiplication matrix M is defined by
If all entries of M are written with respect to the NB, then the following is obtained
where M i s are m x m matrices whose entries belong to GF (2) . By substituting (6) into (4), the coordinates of C are found as follows:
where
are,respectively, the i-fold left cyclic shift of a and b . It is not difficult to verify that the number of 1s in each Mi, 0 <= i <= m -1, is the same, which is here after denoted as C N . Since these nonzero entries of Mi determine the gate count of the normal basis multiplier, C N is referred to as the complexity of the NB .
The coordinate c i in (7) can be written as modulo 2 sum of exactly C N terms. Each of these terms is a modulo 2 product of exactly two coordinates (one of A and B each). Thus, the generation of c i requires C N multiplications and C N -1 additions over GF (2) . In hardware, this corresponds to C N AND gates and (C N -1 ) XOR gates, assuming that all gates have two inputs. If these XOR gates are arranged in the binary tree form, then the total gate delay to generate c i s T A + [log 2 C N d]T X , where T A and T X are the delays of one AND gate and one XOR gate, respectively. For parallel generation of all c i s, i = 0, 1,------, m-1, one needs mC N AND and m(C N -1) XOR gates. Also, one can reduce the number of AND gates to m 2 by reusing multiplication terms over GF (2) . Thus, to reduce the number of XOR gates, we have to choose a normal basis such that CN is minimum. It was proven that C N ≥2m -1.If C N =2m -1, then the NB is called an optimal normal basis (type-I or type-II).
Polynomial Basis Multipliers
In this basis, each element is represented as a linear combination of different powers of a root of an irreducible polynomial. Indeed multiplication in this basis consists of a polynomial multiplication followed by a modular reduction. There are different possibili-ties to multiply two elements in this basis like the Mastrovito, the classical, and the Karatsuba multipliers. Since there is only small difference in time and space complexities of the Mastrovito and the classical multipliers we select the classical multiplier because of its regular structure and the possibility of pipelining which is difficult to apply to the Mastrovito multiplier..
Classical Multiplier
The most straight forward method to perform finite field multiplication is to multiply the polynomials and then reduce the result modulo an irreducible polynomial to achieve the final result. The school method polynomial multiplication requires n2 AND gates and (n -1) 2 XOR gates (2-input each). The combinatorial propagation delay across a school method multiplier is T =T AND + [log 2 n]T XOR . Reducing modulo the polynomial f (x) can be done using ( r -l)(n -1) two input XOR gates, where T and n are the Hamming weight and the degree of the polynomial f ( x ) , respectively.
3.2.2.Karatsuba Multiplier
An approach to reduce the number of gates in the polynomial basis multipliers is the Karatsuba method. In this method the number of multiplications is reduced but at the cost of increasing the number of additions and the total propagation delay. This method decreases the total number of gates from O (n 2 ) to the O (n 1.59 ), which is very effective when the polynomials become large. To achieve a tradeoff between the area and propagation delay which is long in the Karatsuba multipliers, we have used a hybrid structure by using the Karatsuba multiplication formulas(see [l0] ) for the polynomials of degree 1 and 2 in a hierarchical manner above school method multipliers of degree 39. This structure requires 28800 AND and 31183 XOR gates, and a total propagation delay of T AND + l4T XOR .. The costs for a pure Karatsuba multiplier are 6561 AND, 37320 XOR, and T AND + 26T XOR and for a school method multiplier are 54289 AND, 53824 XOR, and T AND + 8 T X O R .
Let the field GF(2 m ) be constructed using the irreducible polynomial P(x) of degree m=rn,with r=2 k ,k an integer. Let A,B be two elements in GF(2 m ). Both elements can be represented in the polynomial basis as, Then, using last two equations, the polynomial product is given as
Karatsuba algorithm is based on the idea that the product of last equation can be equivalently written as
Let us define
Using equation (4),and taking into account that the polynomial product C has at most 2m-1 coordinates, we can classify it coordinates as 
Although (4) seems to be more complicated than (3), it is easy to see that equation (4) can be used to compute the product at a cost of four polynomial additions and three polynomial multiplications. In contrast, when using equation (3), one needs to compute four polynomial multiplications and three polynomial additions. Due to the fact that polynomial multiplications are in general much more expensive operations than polynomial additions, it is valid to conclude that (4) In this case it has been assumed that the block selected to implement the GF(2 n ) arithmetic has a T delay 2 n gate delay associated with it.
As it has been mentioned above, the hybrid approach proposed here requires the use of an efficient multiplier algorithm to perform the n-bit polynomial multiplications. It can be shown that the space and time complexities for the classic ö n-bit multiplier are given as #XORs = (n-1) 2 ; #ANDs = n 2 ; Delay ≤ T AND +T X [Log 2 n].
Combining the complexities given in equation (8), together with the complexities of equation (7) we conclude that the space and time complexities of the hybrid m bit Karatsuba multiplier truncated at the nbbit multiplicand level are upper bounded by #XORs ≤ (m/n) log 2 3 (n 2 +6n-1)-8m+2; #AND ≤3 log 2 r M and2 n =(m/n) log 2 3 n 2 ; (9) Delay ≤ T AND +T X (log 2 n+4log 2 r).
FPGA IMPLEMENTATIONS OF PARALLEL MULTIPLIERS
In this section we have presented the architectures of the parallel multipliers. The interface logic is the same for all multipliers so we can use the same test bench and the designs are interchangeable.
Massey-Omura Multiplier
If implemented fully in parallel, the resource requirements of the Massey-Omura multiplier are very large (exceeding the LUP resources of our FPGA by about 7 percent), but it can be realized with any degree of parallelism between fully parallel and fully serial. So we use a semi-parallel implementation where a multiplication is performed in two steps. As shown in Figure 1 , Massey-Omura consists of two cycshift stages' with 117 outputs each. Output n is the same as output n -1 but cyclically rotated by one bit. The 117 rotated operand pairs are passed in parallel to 117 identical XOR trees (XOR.1 ... XOR-117) that compute the lower 117 hits of the result. The last outputs of the cycshift stages are fed back to the inputs via an operand register, so the second set of rotated operands as well as the higher pan of the result is generated one clock cycle later. 
4.2.The Classical Multiplier
The implemented classical multiplier consists of a polynomial multiplier followed by the modular reducer as shown in Figure 2 . Assuming that the polynomial 
Each of the rows of (1) has some elements which must be combined in a XOR tree to generate a single bit of the result. The rows c i and c 2n-2-i for 0<= i < n -1 are generated with tree structured XOR-circuits of identical length, but with different in-puts. So we have a total of 465 XOR trees, where 464 of them are pair wise equal in size.
As earlier design of ours which used only 233 trees but additionally 232 multiplexers required more clock cycles and exceeded the FPGA resources, so we decided to use the full number of XOR trees. 
4.3.The Hybrid Karatsuba Multiplier
We have used a hybrid structure to combine the Karatsuba algorithm with 2 and 3 coefficients respectively to generate a Karatsuba algorithm with 6 coefficients. Futhermore, we have used a new distributed control structure to implement the polynomial multiplication. The combination of these two Karatsuba methods has already been proposed in for composite extension finite fields.and for the Optimal Extension fields. But to our knowledge, it is the first time that such a combination has been implemented in hardware for prime extension finite fields. The block diagram of the complete multiplier is shown in Figure 3 The multiplier in the upper level consists of three 80-bit adders, and overlap circuit. Each of the multipliers will be used twice during a polynomial multiplication to cover the total six 80-bit 80-bit multiplications. The control circuit starts the multipliers at the at the suitable time tomake use of the pipelinestages in the multipliers. It also controls the timing of the adders. Since outputs of the different multipliers have some powers of z in common, the overlap circuit XORs the overlapping powers. 
5.CONCLUSION
In this section the. performance comparison of the FPGA synthesis results were given. All multipliers are synthesized for a Xilinx xc2v-6000-ff1517-4 FPGA without pin mapping and area constraints. In subsequent synthesis iterations, we specified timing constraints with slightly increasing stringency in order to converge to an optimal timing. It should be noted that the clock cycle time is computed including the pad delays since all multipliers are implemented as "stand alone" designs. Table I gives a comparison of the number of 4-input LUTs, the number of flipflops, the equivalent gate count and the clock period for each multiplier.
