Abstract: The efficient design of digit-serial multipliers for special binary composite fields, F 2 nm where gcd(n, m) ¼ 1, is presented. These composite fields can be constructed via an irreducible pentanomial of degree nm but not an irreducible trinomial of degree nm. The conventional construction method for such digit-serial multipliers is to exploit the simplicity of pentanomials to obtain efficent linear feedback shift registers together with AND -XOR arrays. In this approach, these binary fields are constructed via irreducible trinomials of degree m with respect to F 2 n which in turn are also constructed via an irreducible trinomial (Hybrid I) or pentanomial (Hybrid II) over F 2 . The bit-serial structure to the tower field and applying the bit-parallel structure to the ground field are applied to obtain the hybrid architecture. Three kinds of multipliers (conventional, Hybrid I and Hybrid II) are implemented using the same FPGA device. Since at least one level is constructed via a trinomial instead of a pentanomial, the hybrid multipliers are 10 -33% more efficient than the conventional ones according to the post-place-and-route-timing analysis via Xilinx-ISE 7.1.
Introduction
In recent years, finite-field arithmetic has been attracting an increased attention of researchers because of its extensive applications in error correction or cryptographic algorithms adopted in the Internet and wireless communication systems. Especially, finite-field multiplication always plays a central role in directly determining the efficiency of public key schemes, such as elliptic curve cryptosystems (ECC). Therefore it is imperative to design the multipliers with high efficiency. Our approach focuses on the digit-serial multipliers over the binary composite fields F 2 nm with polynomial basis representation. It is well-known that composite fields, namely F 2 nm , yield efficient implementations if the basis is chosen wisely. They can be constructed via low Hamming weight irreducible polynomials of degree nm, such as trinomials or pentanomials, so that the field elements can be represented via the corresponding polynomial basis over F 2 . Alternatively, it can be constructed via a trinomial or pentanomial of degree m, and the field elements can be represented via its corresponding polynomial basis over the ground field F 2 n , which in turn can also be constructed by a simple irreducible polynomial of degree n. One can derive either the traditional digit-serial multiplier with digit size n according to the first method or the composite field multiplier according to the second one [1] .
FPGA technology provides the designers with more flexibility and manoeuvrability at the bit level, particularly suitable for the applications with wide operand sizes. For efficient FPGA implementation of the binary field multiplier, the designer not only needs to consider the circuit complexity in terms of gate count but also needs to consider other issues such as wire density and regularity of circuit structure. There have been several published literature sources discussing the circuit complexity and latency of both multipliers from the point of view of ASICs [1 -5] , whereas we are the first to compare these two architectures in terms of FPGA implementations. We derive tables for selected composite field F 2 nm which can be constructed via a pentanomial of degree nm but not a trinomial. However, for these composite degrees, there exist at least one irreducible trinomial for either F 2 m or F 2 n . Accordingly, we implement both conventional multipliers and hybrid multipliers for these composite fields listed in the tables. Both designs are ported to the FPGA device, Xilinx xc2v1000-5-ff896 for comparisons in terms of timing and area. Section 2 briefly reviews the structure of the binary composite fields and the construction method. Section 3 introduces both the traditional digit-serial multiplier and the composite field multiplier. Section 4 focuses on the FPGA implementations for both designs. Comparisons of performance and cost are demonstrated. Finally, conclusions are given in Section 5.
Mathematical background
We first introduce some notations in Fig. 1 . Let a denote the generator of the polynomial basis of F 2 nm with respect to F 2 . Let b denote the generator of the polynomial basis of F 2 n with respect to F 2 . And let g denote the generator of the polynomial basis of F 2 m with respect to F 2 . There always exist irreducible trinomials or pentanomials for the field size n 10 000 which can be exploited to construct the binary fields with respect to F 2 [6] . In our approach, we assume that a is the root of an irreducible pentanomial of degree nm. The conventional representation is to use the standard basis constructed by a, namely B s ¼ {1, a, a 2 , . . . , a nmÀ1 }. For the composite field F 2 nm where gcd(n, m) ¼ 1, the field elements can be represented in a different way. Let b and g are the roots of irreducible trinomials or pentanomials of degree of n and m separately, then g can be raised up to generate the polynomial basis, namely B t ¼ {1, g, g 2 , . . . , g mÀ1 }, for the tower field F (2 n ) m with respect to F 2 n . b can be used to construct the polynomial basis, namely
} for the ground field F 2 n with respect to F 2 . The proof of this theorem can be found in [7] . For instance, the elements of F 2 2Â5 can be represented via B s generated by a, where a is the root of
Alternatively, the field elements can be represented via B t generated by g and B g generated by b, where g is the root of
For some degrees nm, pentanomials are the lowest Hamming weight irreducible polynomials to construct F 2 nm . However, it is still possible to obtain a more efficient digit-serial multiplier by adopting the second representation because there may exist at least one irreducible trinomial of degree n or m. In Table 1 , we list generating pentanomials f s (x) for large fields and generating trinomials f t (x) and f g (x) for both tower fields and ground fields. In Table 2 , the generating polynomial for F 2 n is a pentanomial. The composite exponents are bounded by 80 nm 300 because of the requirements of operand sizes for cryptographic applications. These tables are organised as follows: a trinomial
0, is represented by the quadruple n, k 3 , k 2 , k 1 [6] . For example, the field F 2 205 has no trinomial basis and a pentanomial basis can be used in this case. The simplest pentanomial basis is generated by the roots of
However, since 205 ¼ 5 . 41 and trinomial basis exists for both fields, F 2 5 and F 2 41 , we may realise efficient field arithmetic using the trinomial bases. In case 80 nm 300, there exists approximately 30 composite fields which can be constructed in the first way and approximately 34 ones which can be constructed in the second way.
3
Conventional and hybrid multipliers
Bit-parallel multiplier which can complete one multiplication in one clock cycle can achieve high operation speed. However, it is not suitable to adopt it directly in the public key cryptosystem such as ECC because of its large circuit complexity in case of large operand sizes. Bit-serial multiplier with an iterative structure sacrifices the operation speed to gain efficiency in terms of area. By combining both structures, we can derive the digit-serial hybrid multiplier for some composite fields as claimed previously [1] , that is the bit-serial structure can be applied to the tower field and the bit-parallel structure can be applied to the ground field. On the other hand, we can also derive the digit-serial multiplier using the conventional polynomial basis representation [8] [9] [10] [11] [12] . We analyse these two different architectures using concrete examples.
Conventional digit-serial multiplier
The standard polynomial basis generated by a, B s ( Fig. 1) , can be used to represent the elements belonging to F 2 nm . a can be usually chosen as the root of an irreducible trinomial or pentanomial of degree nm. In our approach (Tables 1 and 2 ), a is the root of an irreducible pentanomial, that is
Therefore the standard technique for reduction, replacing a nm with a k3 þ a k2 þ a k1 þ 1, can be exploited to decrease the circuit complexity.
We use the left-to-right multiplication algorithm to develop the most significant digit-serial multiplier where the digit size is n. In Fig. 2 , the two operands a and b and the product c can be represented as follows
where a i , b i and c i [ F 2 . AND -XOR arrays are adopted to perform the computations of Steps 3 -7 in Fig. 2 in parallel. There are two ways to develop such AND -XOR arrays. The first one is to skip the modular operation for each d j and keep the partial sum as n þ m bits and perform the modular operation only once finally. The second one is to perform modular operations for each d j so that the partial sum are kept as m bits instead of n þ m. Even though more XOR gates are used, the wire density is decreased which is better for those applications with large operand size. We adopt the second structure. The linear feedback shift register (LFSR) structure is derived from Step 8.
For convenience of understanding, we provide a concrete example which will be also used in the derivation of the hybrid digit-serial multiplier. We choose n ¼ 5, m ¼ 9 and the generating polynomial of (3), which can be used to derive the AND -XOR arrays as well as the LFSR structure for reduction.
The architecture is described in Fig. 3 where È denotes XOR gate arrays, and denotes AND gate arrays.
Hybrid digit-serial multiplier
According to the theorem claimed in Section 2, we can develop a hybrid digit-serial multiplier by applying the bitserial structure to the tower field and applying the bitparallel structure to the ground field. In this paper, the tower field is constructed via an irreducible trinomial and the ground field is constructed via an irreducible trinomial or pentnomial. We will illustrate the derivation via the same example used in Section 3.1.
The two operands a and b and the product c in F 2 nm can be represented as
where a i , b i and c i F 2 n . And these coefficients can be represented as
where a ij , b ij and c ij [ F 2 . In our example, m ¼ 9 and n ¼ 5.
Derivation of the bit-serial structure:
The generating polynomial for the tower field is chosen as
According to this trinomial together with Fig. 2 , we can derive the bit-serial structure for the tower field shown in Fig. 4 where È denotes the adder over F 2 5 , that is five bitwise XORs, and denotes the bit-parallel multiplier over F 2 5 . The modifications of Fig. 2 we need to consider are as follows. The coefficients of a, b and c locate in F 2 n instead of F 2 . The generating polynomial is f t (x) instead of f s (x), and the digit size is 1 instead of n.
Derivation of the bit-parallel structure:
The generating polynomial for the ground field are chosen as
Let a, b and c ¼ a . b denote the elements in F 2 n with the polynomial basis generated by b. Let a s , b t and d j denote the coefficients in F 2 accordingly where 0 s, t 4 and 0 j 8. By Mastrovito's method [13, 14] c ¼
where d j [ F 2 . By (7) Then we can get the coefficients c j as follows.
The bit-parallel multiplier over F 2 5 is shown in Fig. 5 , where o . denotes AND gate and È denotes XOR gate. More details can also be found in [2, 4] .
Next we provide the derivations of the general pentanomial bit-parallel multiplier considering that it is the most complicated case for our hybrid multipliers. Suppose that the partial product d ¼ P 2nÀ1 i¼0 d j b i has been computed via Mastrovito's method. Then we need to perform reductions for those degrees i ! n. If i À n þ k 3 , n then we have
If i À n þ k 2 , n i À n þ k 3 then two reductions are necessary.
If i À n þ k 1 , n i À n þ k 2 then one more reduction is needed.
Similarly, if n i À n þ k 1 then totally four reductions are necessary.
Since k 3 , n=2, at most four reductions are needed. On the basis of the above equations, we can derive the bit-parallel pentanomial multiplier. The same method can be also applied to the bit-parallel trinomial multiplier.
Complexity analysis for both architectures
We declare some notations used in the following analysis. T A denotes the delay of a two-input AND gate and T X denotes the delay of a two-input XOR gate. The complexity comparisons in terms of ASIC estimation are summarised in Table 3 , where Hybrid I denotes the hybrid multiplier in which the ground field is constructed via a trinomial and Hybrid II denotes the one in which the ground field is constructed via a pentanomial. For the conventional architecture, the total number of two-input AND gates is mn 2 (n is the digit size) and the number of two-input XOR gates is determined by three parts, feedback network, modular components for a i b and the partial sum in Step 8 of Fig. 2 . For Hybrid I, a bitparallel multiplier in F 2 n over F 2 can be built with at most n 2 AND gates and n 2 2 1 XOR gates [2] . For Hybrid II, a bit-parallel multiplier can be built with at most n 2 AND gates and n 2 þ 2n 2 3 XOR gates [5] . The hybrid multiplier contains m bit-parallel multipliers and together with (m þ 1)n XOR gates considering that the tower field is constructed via an irreducible trinomial in our approach.
The difference in terms of latency between the conventional and hybrid architectures is not huge. If m ( n, the hybrid architecture is more efficient in terms of area. The wire density due to the feedback network in the hybrid multipliers is decreased. Additionally, more regularity can be gained in the hybrid multipliers since the bit-parallel component has the same structure.
FPGA implementations
We choose parameters according to Tables 1 and 2 for our experiments. All multipliers are implemented using a Xilinx xc2v1000-5-ff896 which contains 15 360 slice flip flops, 15 360 4-input look-up tables (LUTs) and 7680 configurable logic blocks (CLBs). Both synthesis, and place and route are completed via Xilinx-ISE 7.1.
The performance comparisons of the FPGA implementation results after placing and routing are summarised in Tables 4 and 5 . Comparisons of latency by area are shown in Figs. 6 and 7 . The optimisation goal for synthesis is speed instead of area so that the registers are replicated by three times on average to shorten the critical path for both architectures. However, we find that fewer registers are used in the hybrid multipliers than in the conventional ones. This is because that the hybrid one is more regular and the wire density in feedback networks is decreased. In  Figs. 6 and 7 , we can see that the hybrid multipliers are 10-33% more efficient than the conventional ones in terms of the product of latency by area.
Conclusions
Efficient realisation of digit-serial multipliers for binary fields is important to applications such as cryptography and coding theory. It has been claimed that the property of composite fields F 2 nm can be used to derive more efficient multipliers; however, to date there has been very little empirical evidence, in particular for FPGA implementations to support this view. In our approach, we concentrate on those composite fields which can be constructed via a pentanomial of degree nm but not a trinomial of degree nm, and in which the tower field is constructed via a trinomial of degree m and the ground field is constructed via either a trinomial or a pentanomial of degree n. We investigate both conventional and hybrid digit-serial multipliers for these special binary composite fields. In case that m ( n, the hybrid multiplier is more efficient than the conventional one in term of gate count. Furthermore, both architectures are implemented using the same FPGA devices. Fewer registers are replicated for shortening the critical path in the hybrid multiplier than in the conventional one because the hybrid architecture is more regular and its wire density in the feedback network is decreased considerably. The hybrid multipliers are generally 10-33% more efficient in terms of the product of latency by area than the conventional ones. Therefore we can conclude that for such special binary composite fields, the hybrid architecture is more efficient in terms of area, regularity and wire density, which make it more suitable for FPGA realisations compared with the conventional architecture. 
