Abstract-Hardware implementations of arithmetic operations over binary finite fields GF (2 m ) are widely used in several important applications, such as cryptography, digital signal processing and error-control codes. In this paper, efficient reconfigurable implementations of bit-parallel canonical basis multipliers over binary fields generated by type II irreducible pentanomials f (y) = y m + y n+2 + y n+1 + y n + 1 are presented. These pentanomials are important because all five binary fields recommended by NIST for ECDSA can be constructed using such polynomials. In this work, a new approach for GF (2 m ) multiplication based on type II pentanomials is given and several post-place and route implementation results in Xilinx Artix-7 FPGA are reported. Experimental results show that the proposed multiplier implementations improve the area×time parameter when compared with similar multipliers found in the literature.
Abstract-Hardware implementations of arithmetic operations over binary finite fields GF (2 m ) are widely used in several important applications, such as cryptography, digital signal processing and error-control codes. In this paper, efficient reconfigurable implementations of bit-parallel canonical basis multipliers over binary fields generated by type II irreducible pentanomials f (y) = y m + y n+2 + y n+1 + y n + 1 are presented. These pentanomials are important because all five binary fields recommended by NIST for ECDSA can be constructed using such polynomials. In this work, a new approach for GF (2 m ) multiplication based on type II pentanomials is given and several post-place and route implementation results in Xilinx Artix-7 FPGA are reported. Experimental results show that the proposed multiplier implementations improve the area×time parameter when compared with similar multipliers found in the literature.
I. INTRODUCTION
Galois or finite fields GF (2 m ) have been widely studied due to their use in important applications, such as cryptography, digital signal processing and error control codes. These applications frequently require efficient hardware implementations of GF (2 m ) arithmetic operations, of which multiplication is the most complex and important one. Efficient multiplication methods and architectures have been proposed for different representation bases, where canonical or polynomial basis is the most widely used. Apart from the basis selection, the complexity of the multiplication also depends on the defining irreducible polynomial f (y) selected for the field. For GF (2 m ) hardware implementation, irreducible trinomials and pentanomials are normally used. Two-step classic polynomial basis multiplication in GF (2 m ) involves a polynomial multiplication followed by a reduction modulo an irreducible polynomial. An efficient bitparallel canonical multiplier was proposed by Mastrovito [1] in which a product matrix combines the above two steps together [2] , [3] , [4] , [5] . A new polynomial basis multiplication scheme was proposed in [6] , where the decomposition of a product matrix led to the introduction of S i and T i functions given by the sum of product terms. The addition of these functions is used for the computation of the product of two GF (2 m ) elements. The sum of products in S i and T i were split in [7] into sums of 2 j product terms implemented as binary trees of XOR gates with depth j. The addition in pairs of binary trees with the same depth leads to a reduction of the multiplication delay.
In this paper, efficient Xilinx FPGA implementations of GF (2 m ) bit-parallel canonical basis multipliers based on type II pentanomials f (y) = y m + y n+2 + y n+1 + y n + 1 are presented. In order to optimize the synthesis, a new approach for the product is considered in which the splitting of S i and T i terms is performed, but where the restriction imposed by the addition in pairs of binary trees with the same depth has been removed. In such a case, Xilinx XST tool has freedom to optimize the synthesis of the multiplier. Several GF (2 m ) multipliers, including the specific field GF (2 8 ), have been described in VHDL and post-place and route implementation results in Xilinx Artix-7 have been reported. The field GF ( 2 8 ) is specially important because it has been standardized for space communication by NASA and ESA and used in CD players and Advanced Encryption Standard (AES). Experimental results show that the proposed approach for multiplication improves the area×time complexity when compared with similar multiplication methods found in the literature. Furthermore, the new approach also presents the lowest delay for most of the here implemented multipliers.
II. BACKGROUND
Any element A of the binary field GF (2 m ) can be represented in the canonical or polynomial basis {1, x, . . . ,
, where x is a root of the irreducible polynomial f (y) = m i=0 f i y i . Canonical basis multiplication in GF (2 m ) involves a polynomial multiplication followed by a reduction modulo the irreducible polynomial. An efficient bit-parallel canonical multiplication method was proposed by Mastrovito [1] in which a product matrix combines the above two steps together. In [6] , a new GF (2 m ) polynomial basis multiplication approach was used. In order to compute the product C = A · B, this method introduced the functions S i and T i given by the addition of terms
, respectively. These functions are implemented as binary trees of 2-input XOR gates with a lower level of 2-input AND gates (corresponding to the a i b j products). The expressions for [6] :
where p = i/2 and q = ( m/2 + i/2 ). In (1), the term x p = a p b p only appears for i odd and x q only appears for (m and i even) or for (m and i odd). In this case, r = q. Otherwise, i.e., for (m even and i odd) or for (m odd and i even), the term x q does not appear and r = ( m/2 + i/2 ). For example, using (1), the terms S i and
The product C = A · B can then be computed as the addition of these terms.
For hardware implementation of GF (2 m ) multiplication, low Hamming weight irreducible polynomials, such as trinomials and pentanomials, are normally used. Type II irreducible pentanomials [5] f (y) = y m + y n+2 + y n+1 + y n + 1, with 2 ≤ n ≤ m/2 − 1, are important because they are abundant and all five binary fields recommended by NIST for ECDSA can be constructed using such irreducible polynomials. Canonical basis multiplication for these pentanomials was studied in [6] , where expressions for the coefficients of the product were given using S i and T i terms. For the specific field GF (2 8 ) generated by the polynomial f (y) = y 8 + y 4 + y 3 + y 2 + 1, with (m, n) = (8, 2), the coefficients computed as in [6] are given in Table I.   TABLE I  COEFFICIENTS OF THE PRODUCT 
One of the problems to reduce the delay of the product is due to the monolithic construction of S i and T i functions, given by a sum of terms x k and z Table II.   TABLE II From Table II , it can be observed that the terms S j i and T j i perform the addition of 2 j products so they can be implemented as j-level complete binary trees of XOR gates. Using these terms, expressions for the GF ( 2 8 ) multiplier based on the splitting method introduced in [7] for type II irreducible pentanomial are given in Table III, in Table III , that appears in the coefficients c 0 and c 2 ).
Theoretical complexities of bit-parallel multipliers based on type II pentanomials were given in [7] . For the GF (2 8 ) multiplier shown in table III, it can be found that the delay complexity is T A + 5T X , with T A and T X representing the delay of 2-input AND and XOR gates, respectively. This theoretical delay is the lowest one among similar GF ( 2 8 ) multipliers, such as those given in [6] and [3] , with delays T A + 6T X and T A + 7T X , respectively. The space complexity (number of 2-input AND and XOR gates) of the multiplier given in table III was found to be 64 AND and 87 XOR gates. In this case, the theoretical number of XOR gates is greater than those found in [6] and [3] , with 80 and 77 XOR gates, respectively, while that the number of 2-input AND gates is the same in all approaches.
Expressions given in Table III for the coefficients of the GF (2 8 ) polynomial basis multiplier impose hard restrictions (given by the parenthesis) for the addition of the different terms in order to reduce the number of XOR levels and therefore reduce the delay of the multiplier. However, these restrictions could not be efficient for a synthesis tool in order to map that expressions into FPGA's logic blocks. In such a case, more freedom should be given to the synthesizer to find an optimized implementation of the multiplier. 
In Table IV, Table II ) has been used, but the restriction imposed in the product by the parenthesized addition of terms with the same j-level has been removed. The coefficients of the product are then given as sums of S j i and T j i individual terms and the synthesis tool is free to perform an optimized implementation of the multiplier.
IV. FPGA IMPLEMENTATION RESULTS
The GF (2 8 ) polynomial basis multiplier given in Table IV has been implemented in Xilinx Artix-7 XC7A200T-FFG1156. The design entry has been behavioral VHDL and the experimental results are those reported by Xilinx ISE 14.7 using XST synthesizer. Furthermore, same pin assignments and speed high optimizations have been part of the design methodology. In order to compare the proposed GF ( 2 8 ) multiplier with other similar approaches, VHDL descriptions of different multipliers have also been implemented. The methods used for description and comparison have been the Mastrovito approaches given in [2] and [3] , the bit-parallel version of the multiplier presented in [8] , the multiplier given in [6] that introduced the S i and T i functions, and the method with splitting S i /T i functions and hard parenthesized restrictions presented in [7] .
Experimental post-place and route results obtained for GF (2 8 ) multipliers are given in Table V , where the area complexity is expressed in terms of the number of LUTs and Slices used. Time results (in nanoseconds) represent the critical path of the GF (2 m ) multipliers. The A×T metrics express time delay by area in LU T s × ns in order to compare the area and delay (less is better). From Table V , it can be observed that the proposed multiplier exhibits the lowest number of LUTs used and the lowest A×T metrics among the different approaches, while the lowest number of slices and delay correspond to the works given in [2] and [8] , respectively. In comparison with the splitting method with parenthesized restrictions given in Table III , it can be observed that the new approach is more area and time efficient.
The new approach used for the GF (2 8 ) multiplier has been applied to the implementation of several type II irreducible polynomial basis multipliers. Same design methodology and methods used for comparison in GF (2 8 ) have been considered. The post-place and route results are also given in Table V (2 113 ) are recommended by SECG (Standards for Efficient Cryptography Group) [9] and GF (2 163 ) are recommended by NIST for ECDSA. From the experimental results, it can be observed that the new approach here proposed exhibits the best area×time values among the different methods implemented for most of the binary fields considered (except for NIST (163,68) and SECG (113,34), where the multiplier given in [3] obtains the best values). Furthermore, the new approach also presents the lowest delays for most of the fields implemented (except for NIST (163,66) and SECG (113,34), where the method introduced in [6] gets lowest delays). With respect to area complexity, the multiplier given in [3] presents the lowest number of LUTs in most cases. There is not an specific method getting the lowest number of slices. In comparison with the splitting method with hard parenthesized restrictions given in [7] , it can be observed that the new approach is more area and time efficient in all implemented fields. Therefore, the hard restrictions imposed by the parenthesis for the addition of the different terms in [7] in order to reduce the number of XOR levels made the synthesizer could not perform an optimized mapping into the FPGA's logic blocks. This optimization could be done in the non-parenthesized implementations given in Table V that offer the synthesis tool more freedom to find an optimized implementation of the GF (2 m ) multiplier.
