Abstract-In this paper, a fast implementation of bit-parallel polynomial basis (PB) multipliers over the binary extension field GF ð2 m Þ generated by type-I irreducible pentanomials is presented. Explicit expressions for the coordinates of the multipliers and a detailed example are given. Complexity analysis shows that the multipliers here presented have the lowest delay in comparison to similar bit-parallel PB multipliers found in the literature based on this class of irreducible pentanomials. In order to prove the theoretical complexities, hardware implementations over Xilinx FPGAs have also been performed. Experimental results show that the approach here presented exhibits the lowest delay with a balanced Area Â Time complexity when it is compared with similar multipliers.
INTRODUCTION
EFFICIENT VLSI implementations of high-speed multipliers over binary extension fields GF ð2 m Þ are highly desirable for several applications, such as cryptography, digital signal processing or coding theory [1] . Elements in GF ð2 m Þ are mainly represented in polynomial basis (PB) because it provides more freedom on hardware optimizations for arithmetic operations. The efficiency of their hardware implementations is measured in terms of the number of 2-input gates (AND, XOR) and of the gate delays (T A , T X ) of the circuit. Many approaches and architectures have been proposed to perform PB multipliers [2] , [3] , [4] , [5] , [6] . The complexity of the multiplier mainly depends on the irreducible polynomial fðyÞ selected for the finite field. For hardware implementations, trinomials [7] , [8] , [9] and pentanomials are normally used. PB multiplication requires a multiplication of polynomials followed by a modular reduction. Efficient bit-parallel multipliers can be implemented using a product matrix that combine the above two steps together [10] , [11] , [12] , [13] , [14] . A new PB multiplication method based on the decomposition of a product matrix was used in [15] . This method introduced the functions S i and T i given by sum of terms x k ¼ ða k b k Þ and z j i ¼ ða i b j þ a j b i Þ, where a i ; b i 2 GF ð2Þ are the coefficients of A; B 2 GF ð2 m Þ. The coefficients of the product can be computed as the sum of that functions. The above method was applied in [15] to type I irreducible pentanomials, where groups of shared subexpressions were determined in order to reduce the area complexity of the multiplier. In [16] , the sum of products given in the S i and T i functions were splitted into sums of 2 j product terms that can be implemented as binary trees of XOR gates with depth j. The sum in pairs of binary trees with the same depth yields a reduction of the number of XOR levels needed to compute the product coefficients. Furthermore, the use of binary trees of XOR gates can minimize power consumption in comparison to the use of linear arrays of XORs [17] . The multiplication approach given in [16] was applied to type II irreducible pentanomials in the form fðyÞ ¼ y m þ y nþ2 þ y nþ1 þ y n þ 1.
In this paper, a new fast bit-parallel GF ð2 m Þ polynomial basis multiplier is presented, where the splitting approach in [16] has been applied to general type I irreducible pentanomials and where the expressions of the product coefficients given in [15] for these pentanomials have been simplified in order to obtain high-speed multipliers. Type I irreducible pentanomials fðyÞ ¼ y m þ y nþ1 þ y n þ y þ 1, where 2 n bm=2c À 1, are very important because they are abundant (there are 807 different m values in the interval ½8; 1000 such that a type I irreducible pentanomial of degree m exists) and they are used in important applications. For example, arithmetic used in the Advanced Encryption Standard (AES) is based on the binary extension field GF ð2 8 Þ generated by type I irreducible pentanomial fðyÞ ¼ y 8 þ y 4 þ y 3 þ y þ 1. Furthermore, the three finite fields m 2 f163; 233; 283g from the five recommended by National Institute of Standards and Technology (NIST) for Elliptic Curve Digital Signature Algorithm (ECDSA) can be constructed using such pentanomials. The bit-parallel PB multiplier here presented has the lowest delay known to date for similar PB multipliers based on this type of irreducible pentanomials. In order to prove the theoretical complexities, hardware implementations over Xilinx FPGAs have also been performed. NIST and SECG (Standards for Efficient Cryptography Group) recommended GF ð2 m Þ multipliers have been described in VHDL and post-place and route implementation results in Xilinx Artix-7 have been reported. Experimental results show that the approach here presented exhibits the lowest delay with a balanced Area Â Time complexity when it is compared with similar multipliers.
The paper is organized as follows. Section 2 provides notation and mathematical background. Type I irreducible pentanomials are introduced in Section 3, where new reduced expressions for multiplication are given. Section 4 describes the new multiplier, gives an example of multiplication and analyses the theoretical complexity. Comparisons with other similar multipliers are given in Section 5. Hardware implementation results are presented in Section 6. Finally, conclusions are given in Section 7.
BACKGROUND
Let fðyÞ ¼ P m i¼0 f i y i be a monic irreducible polynomial of degree m over GF ð2Þ. The elements of the binary extension field GF ð2 m Þ can be represented in the polynomial basis f1; x; . . . ; x mÀ1 g, where x is a root of the irreducible generating polynomial fðyÞ. Any element A 2 GF ð2 m Þ is represented in PB as A ¼ P mÀ1 i¼0 a i x i , where a 0 i s 2 GF ð2Þ are the coefficients of A. In order to compute the coefficients of the product C ¼ A Á B, a new method was used in [15] . This method introduced the functions S i and T i given by the sum of terms x k ¼ ða k b k Þ and z j i ¼ ða i b j þ a j b i Þ, where a i ; b i 2 GF ð2Þ are the coefficients of A and B, respectively. These functions are implemented as binary trees of 2-input XOR gates with a lower level of 2-input AND gates (corresponding to the a i b j products). The product C ¼ A Á B can be computed as the sum of these functions. The expressions for S i (1 i m) and
where
only appears for i odd and x g only appears for (m and i even) or for (m and i odd). In this case, h ¼ g. Otherwise, i.e., for (m even and i odd) or for (m odd and i even), the term x g does not appear and the value of h ¼ m=2
Þ . For example, for GF ð2 5 Þ the terms S i and T i are as follows:
TYPE I IRREDUCIBLE PENTANOMIALS
Type I irreducible pentanomials were defined in [14] as fðyÞ ¼ y m þ y nþ1 þ y n þ y þ 1, for 2 n bm=2c À 1. These pentanomials are very important because they are abundant and they are used in a wide number of applications. For example, the specific type I pentanomial fðyÞ ¼ y
Polynomial basis multiplication for type I irreducible pentanomials was studied in [15] , where expressions of the product coefficients were computed. In these expressions, groups G j i of subexpressions given as sums of j terms T k were also found. These j-terms groups G j i can be shared among different coefficients leading to a reduction of area complexity of the multiplier. In this work, it is observed that the common groups found in [15] can be simplified in order to reduce the delay of the multiplier. The simplification is shown in the following example with ðm; nÞ ¼ ð13; 3Þ.
GF ð2
13 Þ Multiplier for fðyÞ ¼ y
The product C ¼ A Á B in GF ð2 13 Þ generated by the type I irreducible pentanomial with parameters (m; n) = (13,3) can be computed using the expressions given in [15] . The coefficients c i of the prod- Table 1 , where a c i coordinate is the sum of the S l and T p terms in the ith row. In Table 1 , the above G j i groups are not represented and only individual terms T k are shown. It can also be observed that there are several T i terms that are cancelled in some rows.
New General Expressions for the Multiplier
In a similar way to that seen in the previous example, the coordinates of the product C ¼ A Á B in PB for general type I pentanomials fðyÞ ¼ y m þ y nþ1 þ y n þ y þ 1, with 2 n < bm=2c À 1, are given in Table 2 , where z ¼ m À n. From the table, it can be observed that several T i terms are cancelled, therefore reducing the complexity of the multiplier.
The new general reduced expressions for the coordinates are also given in Table 3 . In this table, the coefficients have been divided into eight sections (named from A to H ), depending on the terms T i involved and on the number of S i and T i terms in the sums for the coefficients. The number of terms in sections A , B , C , D , E , F , G and H is 4, 5, 4, 7, 7, 6, 5 and 4, respectively. It can be observed that coefficients in sections D and E present the maximum number of terms (seven). Furthermore, from equation (1), the term T 0 is given by the addition of h À 1 terms z j i and the term x g (if it exists), i.e., it performs the sum of the maximum number of z j i terms and therefore it presents the highest delay. From (1), the complexity of T i terms decreases when subindex i increases, so the next most complex T i term is T 1 . It must be noted that the coefficient c nþ1 (in section E ) has the maximum number of terms (seven) and it includes the two most complex terms T 0 and T 1 in its sum. Therefore, c nþ1 is the most complex coefficient and it will be used in following sections to determine the delay of the new multiplier.
NEW MULTIPLIER FOR TYPE I IRREDUCIBLE PENTANOMIALS
As shown in Table 3 , the coefficients of the product C ¼ A Á B in PB can be computed as the addition of functions S i and T i that are given in (1) by sum of terms
However, the monolithic construction of S i and T i terms can represent a problem if low-delay implementations are required. For example, for GF ð2 
where terms in brackets indicate that they must be added previously to the XOR with the other terms, results in a 3-level binary tree of 2-input XOR gates. However, the addition T 1 þ T 3 involves the XOR of four product terms. This sum could be implemented with a 2-level complete binary tree of XOR gates if the additions could be done in a separate way, i.e., if the product a 3 b 3 could be first XORed with the term a 4 b 4 and then perform the XOR with
In [16] , a new approach was given by considering the functions S i and T i as an addition of S j i and T j i terms, respectively, in such a way that [16] . Furthermore, common terms appearing in several coefficients can be shared in order to reduce the number of XORs. These common terms correspond to the sums (S i þ S iþ1 ) and ( Table 3 , it can be observed that only common additions (T i þ T iþ1 ) can be found (with i ¼ 0; . . . ; m À 4 for even m, and with i ¼ 0; . . . ; m À 4 for odd m) [16] . The following algorithm for multiplication using the above approach was given in [16] ;j terms at level l. In the next section, the representation introduced in [16] is applied to the new reduced expressions given in Table 3 for the type I pentanomial multiplier with ðm; nÞ ¼ ð13; 3Þ.
Type I Pentanomial Multiplier for ðm; nÞ ¼ ð13; 3Þ
Let us consider the product C ¼ A Á B in GF ð2
13 Þ generated by type I pentanomial fðyÞ ¼ y 13 þ y 4 þ y 3 þ y þ 1. Using equation (1), S i and T i functions are given in Table 4 where S i and T i are the XOR of the x k and z j i terms given in their rows. In this table, the columns labeled as 2 0 , 2 1 , 2 2 and 2 3 represent the number of product terms a h b l involved in each column. For example, 10 . This representation is given in the column labeled binary in Table 4 , where it can be observed that the binary vector ðs Table 4 including terms S i and T mÀ1Ài in a row with the same binary representation.
The space complexity of the multiplier can be reduced if common terms that appear in several coefficients are shared. In Table 4 , consecutive S i and T i terms having S j i and T j i terms with the same . The same applies to T 5 and T 6 , with (T ) that give rise to 2-level and 3-level binary trees of XOR gates, respectively. Therefore, the groups (S 6 þ S 7 ) and (T 5 þ T 6 ) can reduce the complexity. The groups for this example are represented in Table 4 by shadowed cells with same color. The S groups are (S 2 ; S 3 ), (S 4 ; S 5 ), (S 6 ; S 7 ), (S 8 ; S 9 ), (S 10 ; S 11 ) and (S 12 ; S 13 ), while that the T groups are (T 1 ; T 2 ), (T 3 ; T 4 ), (T 5 ; T 6 ), (T 7 ; T 8 ) and (T 9 ; T 10 ).
Using Tables 1 and 3 , the coefficients of the product for this GF ð2
13 Þ multiplier are given in Table 5 , where the previous T groups are shadowed. From Table 5 , it can be observed that the group (T 9 þ T 10 ) appears in three coefficients (c 0 , c 3 and c 10 ) while that (T 1 þ T 2 ), (T 3 þ T 4 ), (T 5 þ T 6 ) and (T 7 þ T 8 ) are found in two coefficients. Therefore, only one of each of these groups must be implemented. The number of T j i terms in each group determines the number of XOR gates that can be reduced. From Table 4 , it can be observed that the group (T 1 þ T 2 ) involves the addition of the two terms (T 2 ), therefore requiring 2 XOR gates. Likewise, (T 3 þ T 4 ), (T 5 þ T 6 ), (T 7 þ T 8 ) and (T 9 þ T 10 ) require 1, 2, 1, and 1 XOR gates, respectively. In addition, (T 9 þ T 10 ) can be found in three different coefficients, so the number of XOR gates that can be reduced will be 2 Á 1 ¼ 2. Therefore, the number of XOR gates that can be reduced by sharing is 2 + 1 + 2 + 1 + 2 = 8 XOR. General expressions for the computation of the number of XOR gates that can be reduced due to sharing of groups are given in Section 4.2. Using the algorithm for multiplication previously given [16] and using the S j i and T j i terms given in Table 4 , the coefficients of the product are shown in the third column of Table 5 . The precedence of the sums of terms in Table 5 is represented with parenthesis.
As stated in Section 3, coefficient c nþ1 is the most complex one for Type I pentanomials. For GF ð2
13 Þ, this coefficient corresponds with c 4 , which implementation is given in Fig. 1 . Using Table 5 , it requires the addition of 7 terms (including the most complex ones T 0 and T 1 ), so it determines the maximum delay of the multiplier. The initial S Table 5 are represented in Fig. 1 by black and gray circles, respectively. For c 4 , the initial terms are Time complexity of this GF ð2 13 Þ multiplier can be computed taking into account that the most complex coefficient c 4 requires a 6-level binary XOR tree, so the delay is given by T A þ 6T X . The T A delay corresponds to the 0-level a i b j products of the coefficients of A and B. For area complexity, the number of 2-input AND and XOR gates must be computed. The number of AND gates is given by the products a i b j , with i; j 2 ½0; m À 1, and for GF ð2 m Þ multipliers is m 2 [16] . Therefore, the number of AND gates is 169 for the GF ð2
13 Þ multiplier. The number of XOR gates can be computed as the sum of XOR gates in the initial S j i and T j i terms (as given in Table 4 ) plus the number of new XOR gates generated in the coefficients (as given in Table 5 ) minus the number of XOR gates due to shared groups. The S 
The number of new XORs generated in the coefficients due to the addition of S j i and T j i terms is found to be 110 (see Table 5 ). The number of XORs due to the shared groups were previously computed (8 XOR). Therefore, the total number of XOR gates of this multiplier is 66 + 56 + 110 À 8 = 224 XOR.
Complexity Analysis of the New Multiplier 4.2.1 Time Complexity
In Section 3.2 was found that c nþ1 is the most complex coefficient, so it is used to determine the delay of the new multiplier. To do that, the complexity of S i and T i terms must be determined. As shown in Section 4.1, the number of initial terms S j i and T j i are given by the binary representations of the subindex i for S i and by the value m À 1 À i for T i , respectively [16] . Therefore, the equivalence (only in relation to the number of terms) T i S mÀ1Ài can be used to determine the number of T j i terms in T i . This equivalence determines that, for c nþ1 , T 0 S mÀ1 , T 1 S mÀ2 , T n S zÀ1 , T nþ1 S zÀ2 , T zÀ1 S n and T zþ1 S nÀ2 , where z ¼ m À n, so 
. In order to determine the binary representation of these subindexes, the expression q ¼ P blog 2 qc i¼0 q=2 i b cmod2 ð Þ Á 2 i giving the binary configuration of a number q can be used [16] . The value q=2 
c mod 2, then m j can be computed as [16] 
Using this expression, the number of initial terms for the most complex coefficient c 4 given in Section 4.1 can be computed. The number of initial terms in level 3, for example, will be m 3 ¼ ð12Þ Fig. 1) .
The total number of terms M blog 2 mc in the blog 2 mc-level is the addition of initial terms m blog 2 mc plus the terms in that level created by the XOR of terms in lower levels. Using the property of modulo operation dqe þ n ¼ dq þ ne, with n integer, and having into account that the total number of terms in level i is M i ¼ m i þ dm iÀ1 =2e, then it can be proved [16] that the terms created in level blog 2 mc due to the addition in pairs of terms in level blog 2 mc À 1 is dð P blog 2 mcÀ1 i¼0 2 i m i Þ=2 blog 2 mc e. Therefore, the total number of terms in blog 2 mc-level will be the sum of m blog 2 mc plus the above expression, i.e., M blog 2 mc ¼ dð P blog 2 mc i¼0 2 i m i Þ=2 blog 2 mc e. In order to compute this expression, the number m j of initial terms in level j should be known. This number was previously given for c nþ1 . Using the fact that mod operator is defined by x mod y ¼ x À y x=y b c, for real x; y (y 6 ¼ 0), then q
The number of XOR levels needed to compute the coefficient c nþ1 will be blog 2 mc þ dlog 2 M blog 2 mc e, so the highest delay of the multiplier based on type I pentanomials is
Area Complexity
From (1), the number of AND gates in S i and T i are i and m À i À 1, respectively. Therefore, the total number of AND gates of the multiplier is m 2 . In order to compute the number of XOR gates of the multiplier, the number of XORs 1 given by S i and T i must be determined. From (1) , these values are i À 1 and m À i À 1, respectively, and therefore 1 is ðm À 1Þ 2 . The number 2 of XOR gates used for the addition of S i and T i terms in the product coefficients of Table 3 must also be computed. Functions S i appear only once in Table 3 while that T i terms appear several times. Therefore, the number of XORs in (1) terms of T i . If a term T i appears p i times in Table 3 , then the other p i À 1 occurrences are taken into account determining the number Q i of XORs needed for the addition of the T j i terms and multiplying it by p i À 1. Therefore, the XOR gates 3 given by P mÀ2 i¼0 ðp i À 1Þ Á Q i must also be computed. Finally, the number 4 of XORs given by shared groups (T i ; T j ) should also be determined. The total number of XOR gates of the multiplier will then be 1 þ 2 þ 3 À 4 [16] . In Appendix the following values have been computed:
The number 2 of XORs needed for the sum of S i and T i terms in product coefficients is 4m þ 2n À 3. The number 3 of XOR gates can be given as
In this expression, the number of XOR gates C n needed for the sum of the S j n terms of S n is C n ¼ H n À 1 [16] , where H n is the Hamming Weight of n, and
The number 4 of XOR gates given by shared groups (T i ; T j ) in the product coefficients is ð P i i¼k;kþ2;...
for even m and ðm À 3Þ for odd m, and y represents that H only appears for odd n. Therefore, the number of XOR gates of the multiplier given by 1 þ 2 þ 3 À 4 will be
where S h ¼ P h i¼1 H i . Using (4), for the example given in Section 4.1 with m ¼ 13, n ¼ 3, the values S 12 ¼ 22,
Applying these values to equation (4) we obtain 169 À 13 þ 3 Á 22 þ 2 Á 4 þ 2 À 7 À 1 ¼ 224 XOR gates, matching the result given in Section 4.1.
COMPARISON WITH OTHER MULTIPLIERS
In Table 6 theoretical complexities of the multiplier here proposed are compared with the best results known to date for bit-parallel PB multipliers over GF ð2 m Þ generated by type I irreducible pentanomials. Simulations done with Maple have proven that the delay of our multiplier is less than or equal to the best delay given in [11] and [15] , i.e., blog 2 mc þ log 2 ð4m þ n À 6Þ=2 blog 2 mc AE Ç AE Ç 3þ dlog 2 ðmÀ 1Þe. From these results, it was found that for the 807 different values of the field size m 2 ½8; 1000 for which a type I pentanomial exists, the proposed multiplier has the smallest delay in 762 different values of m. Furthermore, among the 1974 ðm; nÞ combinations with m 2 ½8; 1000 for which type I pentanomials exist, there are 187 and 1787 different pairs ðm; nÞ for which the proposed multiplier has equal and less delay, respectively, than the multipliers given in [11] and [15] . With respect to area complexity, the proposed multiplier presents equal number of AND gates (except for [18] ) and a higher number of XOR gates. This increased number is due to the splitting of functions S i and T i into S j i and T j i terms, respectively. In Table 7 the complexities of bit-parallel PB multipliers for NIST recommended GF ð2 m Þ, with m 2 f163; 233; 283g, for which type I irreducible pentanomials exist are presented. It can be observed that the multiplier here proposed presents the lowest delay among the different analyzed methods. These reductions range from 8.3 percent for GF ð2 283 Þ to 9.1 percent for GF ð2 163 Þ and GF ð2 233 Þ with respect to the best delays found in the literature.
HARDWARE IMPLEMENTATIONS
In order to further compare the new approach with other similar methods, bit-parallel GF ð2 m Þ PB multipliers based on type I irreducible pentanomials have been described in VHDL, synthesized and implemented on Xilinx FPGA Artix-7 XC7A200T-FFG1156. Experimental results are those reported by Xilinx ISE 14.7 using XST synthesizer. Furthermore, same pin assignments and speed high optimizations have been part of the design methodology. Experimental post-place and route results are given in Table 8 for multipliers based on type I irreducible pentanomials for SECG [20] recommended finite fields GF ð2 m Þ, with ðm; nÞ = (113, 8), (113, 24), (113, 40), (131, 59), and for NIST (163, 59). Area complexity is expressed in Table 8 in terms of the used number of LUTs and Slices, and time results (in nanoseconds) represent the minimum time needed for performing one GF ð2 m Þ multiplication. The A Â T metrics express area by time delay in Slices Â ns in order to compare the area and delay (less is better). From the experimental results, it can be observed that the new multiplier here proposed exhibits the lowest delay among the different methods. Moreover, the new approach presents the best Area Â Time values in three of the five implemented multipliers, therefore also showing a restrained area usage in comparison with other methods.
CONCLUSION
In this paper, a new fast bit-parallel GF ð2 m Þ polynomial basis multiplier for type I irreducible pentanomials has been presented. Efficient implementations of high-speed multipliers over binary extension fields are highly desirable for several important applications. Furthermore, type I irreducible pentanomials are abundant and they are used in applications such as the AES. In this work, explicit expressions for the coordinates of the proposed multiplier are given. These expressions are implemented as the addition in pairs of binary trees of XOR gates with the same depth, leading to a reduction of delay. Moreover, the use of binary trees can minimize power consumption in comparison to the use of linear arrays of XOR gates. A detailed multiplication example has been also given. Theoretical complexity analysis has shown that the proposed multiplier presents the lowest delay among the best results known to date for similar 
, 17 (even n). y = term included for odd n. multipliers based on this type of irreducible pentanomials. Simulation results have proven that for the 1,974 ðm; nÞ combinations, with m 2 ½8; 1000 and 2 n bm=2c À 1, for which type I irreducible pentanomials exist, there are 187 and 1,787 different pairs ðm; nÞ for which the proposed multiplier has equal and less delay, respectively, than the best results found in the literature. Furthermore, for NIST recommended finite fields GF ð2 m Þ with ðm; nÞ ¼ ð163; 59Þ, ð233; 25Þ and ð283; 59Þ, the multiplier here proposed presents a reduction of the delay ranging from 8.3 to 9.1 percent with respect to the best results known to date. In order to prove the theoretical complexities, hardware implementations over Xilinx FPGAs have also been performed. NIST and SECG GF ð2 m Þ multipliers have been described in VHDL and post-place and route implementation results in Artix-7 have been reported. Experimental results have shown that the proposed multiplier exhibits the lowest delay with a balanced Area Â Time complexity when compared with similar multipliers.
APPENDIX AREA COMPLEXITY
In order to determine the number of XOR gates of the multiplier, the quantities 2 , 3 and 4 must be computed.
The coefficients in Table 3 have been divided into eight sections. The number of S i and T i terms in each section was given in Section 3.2. The XOR gates in the product coefficients are the following: 3 in section A ; 4ðn À 2Þ in B ;; 3 and 6 in C and D , respectively; 6ðn À 2Þ in E ; 10 in F ; 4ðm À 2n À 2Þ in G and finally 3 in H . Therefore, the number 2 of XOR gates needed for the addition of S i and T i terms in the product coefficients is 4m þ 2n À 3.
The number p i of times each T i appears in Table 3 must be determined to compute the number 3 of XOR gates. There are z À 1 terms (T 0 ; . . . ; T zÀ2 ) that appear 4 times, n À 2 terms (T zþ1 ; . . . ; T mÀ2 ) that appear 6 times and T zÀ1 appears 7 times. One occurrence of T i terms are already included in 1 , so the number of XORs due to the above terms appearing 3, 5 and 6 times, respectively, must be determined. If we define F a;b ¼ P b i¼a Q i , where Q i is the number of XORs needed for the sum of the T j i terms for T i , then the number 3 of XORs is P mÀ2 i¼0 ðp i À 1Þ Á Q i ¼ 3 Á F 0;mÀ2 þ 2 Á F zÀ1;mÀ2 þ Q zÀ1 . Using the equivalence T i S mÀ1Ài and denoting Ç h ¼ P h i¼1 C i , then 3 can be computed as P mÀ2 i¼0 ðp i À 1Þ Á Q i ¼ 3 Á Ç mÀ1 þ 2 Á Ç n þ C n . The XORs C i needed for the addition of S j i terms in S i can be determined using the number of 1's in the binary configuration of i [16] . For example, S 11 in Table 4 can be written as S 11 ¼ S terms. Binary configuration of subindex 11 is ð1; 0; 1; 1Þ 2 , with three 1's, so the number of XOR gates C 11 will be its Hamming Weight H 11 minus 1. Therefore, Table 3 is used to compute the number of XOR gates given by shared groups (T i ; T j ). It can be found that for even n, there are bz=2c groups (T i ; T iþ1 ), i 2 ½0; z À 2 for even m and i 2 ½1; z À 2 for odd m, that appear in two coefficients. For odd n, the group (T zÀ1 ; T z ) appears in three coefficients while that dz=2e À 1 groups (T i ; T iþ1 ), i 2 ½0; z À 3 for even m and i 2 ½1; z À 3 for odd m, appear in two coefficients. For these groups, the term with highest subindex gives the XORs to be shared [16] . The XORs represented by the above groups are therefore given by the Hamming Weight of binary representation of the lowest subindex i of S i for each group. The quantities to be computed are ðH n þ H nþ2 þ Á Á Á þ H mÀ2 Þ for even n and m, ðH n þ H nþ2 þ Á Á Á þ H mÀ3 Þ for even n and odd m, ðH nÀ1 þ H nþ1 þ Á Á Á þ H mÀ2 Þ þ H nÀ1 for odd n and even m, and ðH nÀ1 þ H nþ1 þ Á Á Á þ H mÀ3 Þ þ H nÀ1 for odd n and m. Therefore the number 4 of XORs given by the shared groups will be ð P i i¼k;kþ2;... H i Þ þ H y nÀ1 ¼ D n þ H y , where k ¼ 2bn=2c, i ¼ ðm À 2Þ for even m and ðm À 3Þ for odd m, and y represents that H only appears for odd n.
