Abstract. Representing finite field elements with respect to the polynomial (or standard) basis, we consider a bit parallel multiplier architecture for the finite field GF (2 m ). Time and space complexities of such a multiplier heavily depend on the field defining irreducible polynomials. Based on a number of important classes of irreducible polynomials, we give exact complexity analyses of the multiplier gate count and time delay. In general, our results match or outperform the previously known best results in similar classes. We also present exact formulations for the coordinates of the multiplier output. Such formulations are expected to be useful to efficiently implement the multiplier using hardware description languages, such as VHDL and Verilog, without having much knowledge of finite field arithmetic.
Introduction
With the rapid expansion of the Internet and wireless communications, more and more digital systems are becoming increasingly equipped with some form of cryptosystems to provide various kinds of data security. Many such cryptosystems rely on computations in very large finite fields and require fast computations in the fields [5, 1] . Among the basic arithmetic operations over finite field GF (2 m ), addition is easily realized using m two-input XOR gates while multiplication is costly in terms of gate count and time delay.
In the past, many bit parallel multipliers were proposed (see for example [3, 9, 2, 11, 6, 10] ). In [4, 3] , Mastrovito proposed an algorithm along with its hardware architecture for polynomial (PB) basis multiplication. In his scheme, first a binary matrix is formed which is then multiplied with a binary vector to obtain the required result. Halbutogullari and Koc have given a method for constructing the Mastrovito multiplier for arbitrary irreducible polynomials [2] . This method considers general as well as special classes of irreducible polynomials such as trinomials, all-one polynomials (AOPs) and equally-spaced polynomials (ESPs). So far, for these special polynomials, the XOR gate count and time delay of the Halbutogullari-Koc algorithm appear to be the lowest. In [11] , Zhang and Parhi give a systematic method to design the Mastrovito multiplier. Moreover, in [11] , the method is extended to design the modified Mastrovito multiplication scheme proposed in [8] . They also present new results on the complexities of the Mastrovito multiplier for two classes of irreducible pentanomials. Recently, Rodriguez-Henriquez and Koc in [7] have proposed a PB multiplier for special case of pentanomials and have given its time and gate complexities.
In this article, first we review the multiplication scheme and its bit-parallel architecture presented in [6] . Then, using the reduction matrix Q, the complexities of the multiplier based on a number of irreducible polynomials are obtained. We also present explicit formulations for the output coordinates of the multiplier in terms of its inputs. Such formulations can be directly coded using VHDL or Verilog languages to implement an efficient multiplier by someone who is not that familiar with finite field arithmetic. It is shown that for general irreducible polynomials, the space and time complexities of the proposed structure are lower than those available in the literature in terms of combined gate count and time delay. Furthermore, this architecture has fewer signals to be routed which is advantageous for VLSI implementation.
Polynomial Basis Multiplications over GF (m )
Let P (x) = x m + m−1 i=0 p i x i be a monic irreducible polynomial over GF (2) of degree m, where p i ∈GF (2) for i = 0, 1, · · · , m − 1. Let α ∈ GF (2 m ) be a root of P (x), i.e., P (α) = 0. Then the set {1, α, α 2 , · · · , α m−1 } is referred to as the polynomial or standard basis and each element of GF (2 m ) can be written with respect to (w.r.t.) the polynomial basis (PB). Let A be an element in GF (2 m ), then the representation of A w.r.t. the PB is
, where a i 's are the coordinates. For convenience, these coordinates will be denoted in vector notation
T , where T denotes the transposition. Using this vector notation, the representation of A can be written
T . Let S be the binary polynomial of degree not more than 2m − 2 obtained by the direct multiplication of the PB representations of any two elements A and B of GF (2 m ), i.e.,
where
Then, the product C = A·B can be obtained by the following modulo reduction.
Definition 1. [3] The reduction matrix Q is an m − 1 by m binary matrix which is obtained from
where d, e and Q are defined in (2) , (3) , and (6) respectively.
The corresponding architecture for polynomial basis multiplication over GF (2 m ) is shown in Figure1. This structure is divided into two parts: IP-network and Q-network. The IP-network has m blocks (denoted as I 0 , I 1 , · · · , I m−1 ) which generates vectors d and e in accordance with (2) and (3), using m 2 AND gates and (m − 1) 2 XOR gates. Using (2) and (3), the delay for d j , 0 ≤ j ≤ m − 1, and e i , 0 ≤ i ≤ m − 2, can be calculated from
In Figure 1 , the Q-network takes d and e as inputs and generates c. It is noted that the number of lines on the interconnection bus IB is fixed and is equal to the number of e j 's, i.e., m − 1. In Figure 1 , there are three buses, A, B and IB, and the number of lines on the buses is 3m − 1.
In the following sections, we attempt to minimise the number of XOR gates of the Q-network for special irreducible polynomials, namely equally-spaced polynomials, trinomials, and pentanomials. We start with equally-spaced polynomials which are very structured and will help us present the remaining special cases with less difficulties. Proof. When α is a root of the s-ESP of degree m as defined above, we have
Multipliers Using Equally-Spaced Polynomials
Using (10), the reduction matrix Q is obtained as
where I j is the j × j unity matrix and 0 s+1 is a zero matrix which has m − s − 1 rows and s + 1 columns. The graphical representations of Q in (11) for different values of s are shown in Figure 2 . In this figure, non-zero entries of Q are shown with the small squares. In order to obtain exact expressions for N X and T C , first we obtain the coordinates of C. To this end, from Theorem 1 and (11), one can write 
Thus, using (12) and (13), the exact XOR gate count for an s-ESP based multiplier is N X = m 2 −s. Also, by using (8) and (9), d j of (13) can be generated with a maximum gate delay of
It is worth mentioning that the resultant number of signal lines on IB reduces from m − 1 to s, which is considerably lower than the s-ESP based Mastrovito multiplier which has
+ m signal lines [4] . Thus, the total number of lines on the buses of the multiplier is 2m + s.
Extension to More Generic Polynomials
Here we consider irreducible polynomials of the form
The Hamming weight of P (x) is t + 2 and the degree of the second leading term is less than or equal to m 2 . All five binary fields recommended by NIST for ECDSA can be constructed by such irreducible polynomials.
In order to apply the general formulation stated in Section 2 to these polynomials, first we obtain the corresponding Q matrix. Note that all the rows of the Q matrix are the PB representations of
Thus, the 0-th row, i.e., i = 0, has only ones in these t + 1 columns of Q:
The consecutive rows of this matrix can be obtained by using a linear feedback shift register (LFSR). As a result, the rows with i = 0 to m − k t − 1 of Q have t + 1 ones.
The Q matrix for t = 1 and t = 3 (i.e., trinomials and pentanomials, respectively) are shown in 0 1 0 1 0 1 00 11 0 1 0 1 00 11 00 11 0 1 0 1 0 1 0 1 00 11 00 11 00 11 00 11 00 11 00 11 0 1 0 1 00 11 00 11 00 11 00 11 0 1 00 11 00 11 00 11 00 11 00 11 00 11 00 11 0 1 0 1 00 11 00 11 00 11 00 11 0 1 0 1 0 1 0 1 00 11 00 11 00 11 0 1 0 1 00 11 00 11 00 11 ); and for pentanomials: (c) k1
that there is no previous lines that pass these columns. If there exists a previous line that passes the column k j , 1 ≤ j ≤ t, then the previous line terminates in column k j − 1 and no new line originates from column k j due to XORing of two lines. This happens in row 
, (see 
and
Proof. Let us denote e
0 , e
T i e, 0 ≤ i ≤ t, then using Theorem 1, we can obtain the coordinates of the pentanomial based multiplication as
First, let us assume k 1 = 1. Using Q 0 (see Figure 4 (a) for t = 3), the elements of e (0) are as follows:
(15) The total number of XOR gates to form e 
This results in the coordinates of C = AB as
by using (14). To realize (18) in hardware, one requires
gates. Thus, the total XOR gates needed for the multiplier is (m−1)
To obtain the time delay of the proposed multiplier, we use a binary tree for each coordinate in (18). For j / ∈ [k t , m − 2], it is seen in (18) that T C ≤ log 2 (t + 1) T X + T (e (0) 0 ) and the proof is complete by using (16). Now, we need only to obtain the time delay of c j s for
Also, using (17) and (16), one can see
which implies that
and the proof is complete.
In addition to the three buses shown in Figure 1 now, there will be another bus in the middle of the Q-network for signals e (0) j for 0 ≤ j ≤ k t − 2. Thus, the total number of lines on the buses is 3m + k t − 2. 
Special Classes of Pentanomials
A polynomial with five non-zero coefficients, i.e., P (x) = x m +x k3 +x k2 +x k1 +1, where 1 ≤ k 1 < k 2 < k 3 ≤ m − 1, is called a pentanomial of degree m. The non-zero constant term is due to the irreducibility properly needed to define the field. In terms of the values of k i s, the pentanomials can be divided into a number of different classes. Below we consider two special classes of irreducible pentanomials as proposed in [11] .
Class 1:
For this class of irreducible pentanomial where k 3 ≤ m 2 , one can apply t = 3 to the complexity results we have presented in Section 4. This yields the following.
Corollary 2. The gate counts and time delay of the multiplier for the the pen-
and the number of lines on the buses is 3m + k 3 − 2.
The number of XOR gates can be reduced if we choose a pentanomial such that k 1 = k 3 − k 2 . Towards this, let us introduce the following set of new signals
Equation (19) can be used to generate e (0)
The total number of XOR gates needed to generate e (0) j 's (see (20) ) is N 1 = k 1 + k 2 + k 3 − 3 where k 2 − 1 of which is due to (19). Also, the maximum delay due to gates in (20) is
(21)
Lemma 1. With symbols defined as above, one has
Let us represent e (01) j , 0 ≤ j ≤ m − 1, as the elements of (Q 0 + Q 1 )
T e, where Q 0 and Q 1 are shown in Figure 4 (a) and Figure 4(b) , respectively. Then, substituting t = 3 in the general case given in (18) and using the above lemma, we can obtain the coordinates of C = AB as follows:
where e (01) j−k2 = 0 for j < k 2 , and
As seen in (23), one has to realize e (0) (22) requires 2m−k 2 XOR gates. Thus, the total number of XOR gates needed for the multiplier is (m−1)
Due to the reuse of terms e j , 0 ≤ j ≤ k 2 − 1, and e (0)
, additional lines needed on the bus in the Q-network are (k 2 − 1) and (m − k 1 − k 2 ), respectively. Thus, the total number of lines on the buses is increased to 4m + k 2 − 3.
To obtain the time delay of the proposed multiplier, we use Table 1 which shows the maximum delay of the used signals in (22) for the given ranges of j in each row. In this figure i, 0 ≤ i ≤ 4, represents the time delay of T A + (i + log 2 (m − 1) ) T X , and the numbers inside brackets are for k 1 = 1. Also, x determines either e (01) j or e (01) j−k2 to be added with d j first to obtain c j . In each row of this table, the delays are obtained for the first digit of the given range. This is because as j increases, the time delays of the used signals in each row of this table decreases. As seen in this table, the maximum delay of the multiplier is T A + (4 + log 2 (m − 1) ) T X . For k 1 = 1, only one signal, i.e., c k3 , has the delay of T A + (4 + log 2 (m − 1) ) T X . One can reduce this delay to
by using one extra XOR gate. 
Based on the above results, we can state the following.
Theorem 4. The gate counts and time delay of the multiplier based on the pentanomial P
and the number of lines on the buses is 4m + k 2 − 3.
Remark 1.
To verify that class 1 irreducible pentanomials exist, we have used a Maple TM program for m ∈ [160, 600] and have found that at least one irreducible pentanomial exists for each m in the range of 160 to 600. This is of interest to elliptic curve cryptosystem designers. In order to minimise the number of XOR gates of the multiplier, we have obtained irreducible pentanomials such that k 1 is minimum. We have also observed that, k 1 is less than or equal six for all m in the above mentioned range.
It is noted that the pentanomial presented in [7] is a special case when k 1 = 1.
Class 2:
We refer to polynomials (or 1 ≤ k 1 ≤ 5s + 1) are shown in Figure 5 . Based on this figure, we can state the following theorem. 11 00 11 00 11 00 11 00 00 11 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 00 11 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 00 11 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 
or 2s + 1 < k1 ≤ 5s + 1. 
Theorem 5. The gate counts and the time delay of the multiplier for the pen-
) TX This paper k1 = 1 m 2 + 2m − 3 TA + (3 + log 2 (m − 1) ) TX This paper k3 − k2 = k1 m 2 + m + k1 − 2 TA + (4 + log 2 (m − 1) ) TX [7] k3 − k2 = k1 = 1 m 2 + m + 2k2 TA + (3 + log 2 (m − 1) ) TX This paper k3 − k2 = k1 = 1 m 2 + m TA + (3 + log 2 (m − 1) ) TX This paper, [7] 
Complexity Results and Concluding Remarks
In this article, time and space complexities of bit parallel multipliers for GF (2 m ) have been considered. A comparison of our newly derived gate counts and delays with those of existing ones is shown in Table 2 . As seen in this table, for trinomial x m +x+1, the multiplier of Figure 1 has one additional XOR gate delay compared to the best one available in the literature, i.e., [2, 11] . However, our results for the ESPs and trinomials (k = 1) match the corresponding best results available ( [2, 11] and [9] ). Also, the resultant gate and time complexities for trinomials match those presented in [10] .
For a more generic irreducible polynomial as discussed in Section 4, the multiplier in Figure 1 has the same gate count but a shorter time delay compared to [11] . For class 1 pentanomials, this multiplier is faster than [11] and has fewer XOR gates if the special case of k 3 − k 2 = k 1 is used. This proposed special case of class 1 covers the case of pentanomials reported in [7] , where k 1 = 1. Compared to the multiplier proposed in [7] , the multiplier discussed in this paper for the special case of k 1 = k 3 − k 2 = 1 has 2k 2 fewer XOR gates and match the ones proposed in [7] for k 1 = 1 and k 2 = 2. Also, for class 2 pentanomials, our multiplier is either faster or has the same gate delay and has at least 1.33m − 7 fewer XOR gates than the multiplier reported in [11] .
In VLSI implementation, in addition to the gate counts, the number of lines on the buses is also an important parameter which determines the space complexity and consequently its actual time delay. Table 3 compares this metric of the proposed architecture with that of Mastrovito multiplier [4] . As shown in this table, the architectures discussed here have a fewer number of lines on the buses compared to the well known Mastrovito multiplier.
