We present a new sequential normal basis multiplier over GF (2 m ). The gate complexity of our multiplier is significantly reduced from that of Agnew et al. and is comparable to that of Reyhani-Masoleh and Hasan, which is the lowest complexity normal basis multiplier of the same kinds. On the other hand, the critical path delay of our multiplier is same to that of Agnew et al. Therefore it is supposed to have a shorter or the same critical path delay to that of Reyhani-Masoleh and Hasan. Moreover our method of using a Gaussian normal basis makes it easy to find a basic multiplication table of normal elements. So one can easily construct a circuit array for large finite fields, GF (2 m ) where m = 163, 233, 283, 409, 571, i.e. the five recommended fields by NIST for elliptic curve cryptography.
Introduction
Finite field multiplication finds various applications in many cryptographic areas such as ECC and AES. Though one may design a finite field multiplier in a software implementation, a hardware arrangement has a strong advantage when one wants a high speed multiplier. Moreover arithmetic of GF (2 m ) is easily realized in a circuit design using a few logical gates. A good multiplication algorithm depends on the choice of a basis for a given finite field. Especially a normal basis is widely used [5, 10, 11] because it has some good properties such as simple squaring. A multiplication in GF (2 m ) can be classified into two types, a parallel (two dimensional) [4, 5, 8, 10] and a sequential (linear) [1, 3, 9, 11] architectures.
Though a parallel multiplier is well suited for high speed applications, ECC requires large m for GF (2 m ) (at least m = 163) to support a sufficient security. In other words, since the parallel architecture has an area complexity of O(m 2 ), it is not suited for this application. On the other hand, a sequential multiplier has an area complexity of O(m) and therefore is applicable for ECC. Since it takes m clock cycles to produce one multiplication result using a sequential multiplier, it is slower than a parallel multiplier. Consequently reducing the total delay time of a sequential multiplier is very important.
A normal basis multiplier of Massey and Omura [7] has a parallel-in, serialout structure and has a quite long critical path delay proportional to log 2 m. Agnew et al. [1] proposed a sequential multiplier which has a parallel-in, parallelout structure. It is based on the multiplication algorithm of Massey and Omura, however the critical path delay of the multiplier of Agnew et al. is significantly reduced from that of Massey and Omura. Recently, Reyhani-Masoleh and Hasan [3] presented two sequential multipliers using a symmetric property of multiplication of normal elements. Both multipliers in [3] have roughly the same area complexity and critical path delay. These multipliers have the reduced area complexity from that of Agnew et al. with a slightly increased critical path delay. In fact, the exact critical path delay of the multipliers of Reyhani-Masoleh and Hasan is difficult to estimate in terms of m and is generally believed to be slightly longer or equal to that of Agnew et al. For example, for the case of a type II ONB, the critical path delay of Reyhani-Masoleh and Hasan [3] is T A + 3T X while that of Agnew et al. [1] is T A + 2T X , where T A , T X are the delay time of a two input AND gate and a two input XOR gate, respectively. However since we are dealing with a sequential (linear) multiplier, even a small increment of critical path delay such as T X results in a total delay of mT X where m is the size of a field.
Our aim in this paper is to present a sequential multiplier using a Gaussian normal basis in GF (2 m ) for odd m. Since choosing an odd m is a necessary condition for cryptographic purposes and since a low complexity normal basis is frequently a Gaussian normal basis of type (m, k) for low k, our restriction in this paper does not cause any serious problem for practical purposes. In fact all the five recommended fields GF (2 m ) by NIST [16] for ECC where m = 163, 233, 283, 409, 571 can be dealt using our Gaussian normal basis, and the corresponding circuits are easy to construct if one follows a simple arithmetic rule of a Gaussian normal basis. We will show that the area complexity of our sequential multiplier is reduced from that of the multiplier of Agnew et al. [1] and thus comparable to that of the multiplier of Reyhani-Masoleh and Hasan [3] . Moreover the critical path delay of our multiplier is same to that of Agnew et al. and therefore is believed to be shorter or equal to that of Reyhani-Masoleh and Hasan.
Review of the Multipliers of Agnew et al. and Reyhani-Masoleh and Hasan
Let GF (2 m ) be a finite field with characteristic two. GF (2 m ) is a vector space of dimension m over GF (2) . A basis of the form {α, α 2 , α 2 2 , · · · , α 2 m−1 } is called a normal basis for GF (2 m ). It is well known [6] that a normal basis exists for all m ≥ 1. Let {α 0 , α 1 , · · · , α m−1 } be a normal basis in GF (2 m ) with α i = α 2 i . Let
where λ (s) ij is in GF (2). Then for any integer t, we have
where the subscripts and superscripts of λ are reduced (mod m). Therefore comparing the coefficients of α s , we find
In particular, we have λ Therefore, using (4), we have the coefficients c s of C = AB as
where the subscripts of a, b and λ are reduced (mod m). The circuit of Agnew et al. [1] is a straightforward realization of the above equation with the information of the m by m matrix (λ (0) ij ). When there is a type II ONB (optimal normal basis), it is easy to find λ (0) ij as is explained in [1] . That is, ij by following a simple arithmetic rule. A Gaussian normal basis and a type II ONB will be discussed briefly in the following sections.
Recently, Reyhani-Masoleh and Hasan [3] suggested a new normal basis multiplication algorithm which significantly reduces the area complexity compared with the multiplier of Agnew et al. They used αα i instead of α i α j and wisely utilized the symmetric property between αα i and αα m−i . In fact they proposed two sequential multiplication architectures, so called XESMPO and AESMPO [3] . Since the hardware complexity of AESMPO is higher than that of XESMPO and both architectures have the same critical path delay, we will sketch the idea in [3, 4] for the case of XESMPO. In [3, 4] , the multiplication C = AB is expressed as
When m is odd, the second term of the right side of the above equation is written as
and when m is even, it is written as
where ν = ⌊ m−1 2 ⌋, i.e. m = 2ν + 1 or m = 2ν + 2. Also the second term of (9) and (10) is written as
where the first (resp. second) equality comes from the rearrangement of the summation with respect to j (resp. i) and all the subscripts are reduced to (mod m). Therefore we have the basic multiplication formula of Reyhani-Masoleh and Hasan depending on whether m is odd or m is even as
or
(13) Using these formulas, they derived a sequential multiplier where the gate complexity is significantly reduced from that of [1] . The circuit of the multiplier is shown in Figure 2 for m = 5 where a type II ONB is used. 3 Gaussian Normal Basis of Type k in GF (2 m )
We will briefly explain basic multiplication principle in GF (2 m ) with a Gaussian normal basis of type k over GF (2) (See [6, 12] .). Let m, k be positive integers such that p = mk + 1 is a prime = 2. Let K = τ be a unique subgroup of order k in GF (p) × . Let β be a primitive pth root of unity in GF (2 mk ). The following element
is called a Gauss period of type (m, k) over GF (2) . Let ord p 2 be the order of 2 (mod p) and assume gcd(mk/ord p 2, m) = 1. Then it is well known [6] that α is a normal element in GF (2 m ). That is, letting
. It is called a Gaussian normal basis of type k or (m, k) in GF (2 m ). Since K = τ is a subgroup of order k in the cyclic group GF (p) × , the quotient group GF (p) × /K is also a cyclic group of order m and the generator of the group is 2K. Therefore we have a coset decomposition of GF (p) × as a disjoint union,
where
From (15), there are unique 0
Also when i = v,
Therefore αα i is computed by the sum of at most k basis elements in {α 0 , α 1 , · · · , α m−1 } for i = v and αα v is computed by the sum of at most k − 1 basis elements and the constant term k ≡ 0, 1 ∈ GF (2).
163, 233, 283, 409, 571 suggested by NIST [16] for ECDSA have the property that m = prime. Therefore it is not so serious restriction to assume that m is odd for a fast multiplication algorithm if one is interested in this kind of applications.
For odd values of m, it is well known [15] that a Gaussian normal basis of type k or (m, k) always exists for some k ≥ 1. Since mk + 1 is a prime with m = odd, it follows that k is even. Thus it is enough to study the multiplication in GF (2 m ) for odd m with a Gaussian normal basis of type k for even k. To derive a low complexity architecture, in view of the multiplication formulas (17) and (18), one should choose a small k, i.e. low complexity Gaussian normal basis. The least possible even k ≥ 1 is k = 2. This is so called a type II ONB (optimal normal basis) or more specifically a type 2 Gaussian normal basis. Among the five binary fields recommended by NIST, m = 233 is the only case where a type II ONB exists. On the other hand, the lowest complexity Gaussian normal basis for the rest of the fields are type 4 Gaussian normal basis when m = 163, 409, type 6 Gaussian normal basis when m = 283, and type 10 Gaussian normal basis when m = 571 (See [12] ). Let {α 0 , α 1 , · · · , α m−1 } be any normal basis in GF (2 m ) with α i = α 2 i and let
where λ ij is in GF (2). Taking repeated powers of 2 for both sides of the above equation, one finds λ
where λ ij may not be so easy unless one has a sufficient information on the given normal basis. Also note that (λ (s) ij ) is a symmetric matrix but (λ ij ) is not in general. However, it turns out that (λ ij ) is a symmetric matrix if a Gaussian normal basis of type k with k even is used. More precisely, we have the following. Lemma 1. If {α 0 , α 1 , · · · , α m−1 } is a Gaussian normal basis of type k where k is even, then we have λ
Proof. From (20), it is enough to show that λ ij = λ i−j,−j . From the formulas (17) and (18), it is clear that λ ij = 1 if and only there exist odd pairs of (s, s ′ ) (mod k) such that
where τ is a unique multiplicative subgroup of order k in GF (p) × with p = mk + 1. Let S be the set of all pairs (s, s ′ ) (mod k) satisfying (21) and same way define T as the set of all pairs (t, t ′ ) (mod k) satisfying 1 + τ t 2 i−j = τ t ′ 2 −j .
To prove λ ij = λ i−j,−j , it suffices to show that the sets S and T have the same cardinality. Dividing both sides of the equation (21) by τ s ′ 2 j , we get
Since the order of τ is k where k is even, we have −1 = τ k 2 and therefore
Since the map f S :
Construction of a sequential multiplier and complexity analysis
Now from (6) 
Let us define an element x st , 0 ≤ s, t ≤ m − 1, in GF (2) as
with corresponding matrix X = (x st ). Then the tth column vector X t of X is
where (x 0t , x 1t , · · · , x m−1,t ) T is the transposition of the row vector (x 0t , x 1t , · · · , x m−1,t ). Also the sum of all column vectors X t , t = 0, 1, · · · , m − 1, is exactly
because m−1 t=0 x st = c s . Our purpose is to reduce the gate complexity of our multiplier by rearranging the column vectors X t and reusing partial sums in the computation. Let m − 1 = 2ν and define m by m matrix Y = (y st ) by the following permutation of the column vectors of X as follows; When ν is odd, define Y as (Xν , · · · , X3, X1, Xm−1, Xm−3, · · · , Xm−ν , Xν−1, · · · , X2, X0, Xm−2, · · · , Xm−ν+1),
and when ν is even, Y is defined as (Xν , · · · , X2, X0, Xm−2, · · · , Xm−ν , Xν−1, · · · , X3, X1, Xm−1, Xm−3, · · · , Xm−ν+1).
Then the sum of all column vectors Y t , 0 ≤ t ≤ m − 1, of Y with Y t = (y 0t , y 1t , · · · , y m−1,t ) T is same to the sum of all column vectors X t , 0 ≤ t ≤ m−1, of X which is (c 0 , c 1 , · · · , c m−1 ) T .
To derive a parallel-in, parallel-out multiplication architecture, we will compute the sum of shifted diagonal vectors of Y, instead of computing the sum of column vectors of Y . This can be done from the following observations. In the expression of the matrix Y , there are exactly t − 1 columns between the vectors X t and X m−t . Also, sth entry of X t and s + tth entry of X m−t have the same terms of a i s in their summands. In other words, from (25), we have
where the third expression comes from the rearrangement of the summation on the subscript i and the last expression comes from Lemma 1 saying λ ij = λ i−j,−j . Thus x st and x s+t,m−t have the same term m−1 i=0 a i+s λ it in their expression and this will save the number of XOR gates during the computation of AB. 
biαi are loaded in m-bit registers respectively. Also intermediate values D0, D1, · · · , Dm−1 of the multiplication are all set to zero. 2. For t = 0 to m − 1, do the following;
where the above computation is done in parallel for all 0 ≤ s ≤ m − 1. 
After mth iteration, we have
Note that y s−1,s and y ss , 0 ≤ s ≤ m − 1, in the equation (32), are from the same column Y s of the matrix Y . Since Y is obtained by a column permutation of a matrix X, we conclude that y s−1,s = x s−1,s ′ and y ss = x ss ′ for some s ′ depending on s. Moreover from the equation (25), we get
which implies that x s−1,s ′ (= y s−1,s ) is obtained by right cyclic shifting by one position of the vectors a i s and b i s from the expression x s,s ′ (= y s,s ). Since this can be done without any extra cost, all the necessary gates to construct a circuit from the algorithm in Table 1 are the gates needed to compute the first (i.e. t = 0) clock cycle of the step 2 of the algorithm,
Recall that, for each s, there is a corresponding (because of a permutation) s ′ such that
If s ′ = 0, i.e. if x ss ′ is not in the 0th column of X, then from the equations (25) and (30), we find that the necessary XOR gates to compute x ss ′ and x s+s ′ ,m−s ′ (which are the diagonal entries of the matrix Y ) can be shared. Note that x ss ′ = ( m−1 i=0 a i+s λ is ′ )b s ′ +s can be computed by one AND gate and at most k −1 XOR gates since the multiplication matrix (λ ij ) of a Gaussian normal basis of type k has at most k nonzero entries for each column (row) in view of the equation (17). Thus the total number of necessary gates to compute all y ss = x ss ′ with s ′ = 0 is m − 1 AND gates plus m−1 2 (k − 1) XOR gates. When s ′ = 0, then the number of nonzero entries of λ i0 , 0 ≤ i ≤ m − 1, is one because αα 0 = α 2 = α 1 . Therefore we need one AND gate and no XOR gate to compute x ss ′ with s ′ = 0. Since the addition D s + y ss , 0 ≤ s ≤ m − 1, in (35) needs one XOR gate for each 0 ≤ s ≤ m − 1, the total gate complexity of our multiplier is m AND gates plus at most m + m−1 2 (k − 1) XOR gates. The critical path delay can also be evaluated easily. It is clear from (35) and (36) that the critical path delay is at most T A + (1 + ⌈log 2 k⌉)T X . We compare our sequential multiplier with other multipliers of the same kinds in Table 2 . In the table, C N denotes the number of nonzero entries in the matrix (λ (0) ij ). It is well known [6] that C N ≤ mk + m − k if k is odd and C N ≤ mk − 1 if k is even. In our case of GF (2 m ) with m = odd, it is easy, from (17) and (18), to see that C N has a more strong bound C N ≤ mk − k + 1. Thus the bounds 1) . Consequently the circuit in [3] and our multiplier have more or less the same hardware complexity.
Gaussian normal basis of type 2 and 4 for ECC
Let p = 2m + 1 be a prime such that gcd(2m/ord p 2, m) = 1, i.e. either 2 is a primitive root (mod p) or ord p 2 = m and m = odd. Then the element α = β + β −1 where β is a primitive pth root of unity in GF (2 2m ) forms a normal basis {α 0 , α 1 , · · · , α m−1 } in GF (2 m ), which we call a Gaussian normal basis of type 2 (or a type II ONB). A multiplication matrix (λ ij ) of αα i has the following property; λ ij = 1 if and only if 1 ± 2 i ≡ ±2 j (mod p) for any choice of ± sign. This is obvious from the basic properties of Gaussian normal basis in Section 3. Since m divides ord p 2, i = 0 is a unique value (mod m) satisfying 1 ± 2 i ≡ 0 (mod p). That is, αα 0 = α 1 and the 0th row of (λ ij ) is (0, 1, 0, · · · , 0). If i = 0, then 1 ± 2 i ≡ 0 (mod p) and thus ith (i = 0) row of (λ ij ) contains exactly two nonzero entries. Therefore for the case of a type II optimal normal basis, we need m AND gates and m + m−1 2 = 3m−1 2 XOR gates. Also the critical path delay is T A + 2T X , while that of [3] is T A + 3T X . Let us give a more explicit example for the case m = 5.
Example 1. Let β be a primitive 11th root of of unity in GF (2 10 ) and let α = β + β −1 be a type II optimal normal element in GF (2 5 ). The computations of αα i , 0 ≤ i ≤ 4, are easily done from the following table. For each block regarding K and K ′ , (s, t) entry with 0 ≤ s ≤ 1 and 0 ≤ t ≤ 4 has the value τ s 2 t and 1 + τ s 2 t respectively, where τ = −1 is a unique multiplicative subgroup of order 2 in GF (11) × . Table 3 . Computation of Ki and K ′ i using a type II ONB in GF (2 m ) for m = 5 From the above table, it can be found that αα 0 = α 1 and
For example, the computation of αα 3 can be done as follows. See the block K ′ 3 and find 9 ≡ −2 (mod 11) is in K 1 and −7 ≡ 4 is in K 2 . Thus we have αα 3 = α 1 + α 2 . In fact, for the case of type II ONB, there is a more regular expression called a palindromic representation which enables us to find the multiplication table more easily. However for the general treatments of all Gaussian normal bases of type k for arbitrary k, we are following this rule. Note that for all other type II ONB where m = 5, the multiplication table can be derived exactly the same manner. From (37), the corresponding matrix (λ ij ) for m = 5 is
and using (24),(25),(28),(29), we find that the multiplication C = From this, one has the shift register arrangement of C = AB using a type II ONB in GF (2 m ) for m = 5 and it is shown in Figure 3 . Note that the underlined entries are the first terms to be computed. Also note that the (shifted) diagonal entries have the common terms of a i s. As is mentioned before, there exists only one field GF (2 233 ) for which a type II ONB exists in GF (2 m ) among the recommended five fields GF (2 m ), m = 163, 233, 283, 409, 571, by NIST. Though the circuits of multiplication using a type II ONB are presented in many places [1, 3, 10, 11] , the authors could not find an explicit example of a circuit design using a Gaussian normal basis of type k > 2. Since there are two fields GF (2 163 ), GF (2 409 ) for which a Gaussian normal basis of type 4 exists, it is worthwhile to study the multiplication and the corresponding circuit for this case. For the clarity of exposition, we will explain a Gaussian normal basis of type k = 4 in GF (2 m ) for m = 7. Note that the general case can be dealt in the same manner as in the following example.
Example 2. Let p = 29 = mk + 1 with m = 7, k = 4 where a Gauss period α of type (7, 4) exists in GF (2 7 ). In this case, the unique cyclic subgroup of order 4 in GF (29) × is K = {1, 2 7 , 2 14 , 2 21 } = {1, 12, 28, 17}. Let β be a primitive 29th root of unity in GF (2 28 ). Thus letting τ = 12, a normal element α is written as α = β + β 12 + β 17 + β 28 and {α 0 , α 1 , · · · , α 6 } is a normal basis in GF (2 7 ). The computations of αα i , 0 ≤ i ≤ 6, are done from the following table. For each block regarding K and K ′ , (s, t) entry with 0 ≤ s ≤ 3 and 0 ≤ t ≤ 6 has the value τ s 2 t and 1 + τ s 2 t respectively. From the above table, we find αα 0 = α 1 and αα 1 = α 0 + α 2 + α 5 + α 6 , αα 2 = α 1 + α 3 + α 4 + α 5 , αα 3 = α 2 + α 5 ,
For example, see the block K ′ 2 for the expression of αα 2 . The entries of K ′ 2 are 5, 20, 26, 11. Now see the blocks of K i s and find 5 ∈ K 1 , 20 ∈ K 3 , 26 ∈ K 5 , 11 ∈ K 4 . Thus we get αα 2 = α 1 + α 3 + α 4 + α 5 . From (40), the multiplication matrix (λ ij ) is written as 
and again using the relations (24),(25),(28),(29), we get the following multiplication result C = AB = 6 i=0 c i α i . In the following table, a ijkl is defined as a ijkl = a i + a j + a k + a l . For example, we have c 0 = (a 2 + a 5 )b 3 + (a 0 + a 2 + a 5 + a 6 )b 1 + (a 1 +a 4 +a 5 +a 6 )b 6 +(a 2 +a 6 )b 4 +(a 1 +a 3 +a 4 +a 5 )b 2 +a 1 b 0 +(a 1 +a 2 +a 3 +a 6 )b 5 . c0 = (a2 + a5)b3 + a0256b1 + a1456b6 + (a2 + a6)b4 + a1345b2 + a1b0 + a1236b5 c1 = (a3 + a6)b4 + a1360b2 + a2560b0 + (a3 + a0)b5 + a2456b3 + a2b1 + a2340b6 c2 = (a4 + a0)b5 + a2401b3 + a3601b1 + (a4 + a1)b6 + a3560b4 + a3b2 + a3451b0 c3 = (a5 + a1)b6 + a3512b4 + a4012b2 + (a5 + a2)b0 + a4601b5 + a3b3 + a4562b1 c4 = (a6 + a2)b0 + a4623b5 + a5123b3 + (a6 + a3)b1 + a5012b6 + a4b4 + a5603b2 c5 = (a0 + a3)b1 + a5034b6 + a6234b4 + (a0 + a4)b2 + a6123b0 + a6b5 + a6014b3 c6 = (a1 + a4)b2 + a6145b0 + a0345b5 + (a1 + a5)b3 + a0234b1 + a0b6 + a0125b4
The corresponding shift register arrangement of C = AB using a Gaussian normal basis of type 4 in GF (2 m ) for m = 7 is shown in Figure 4 . Also note that the underlined entries are the first terms to be computed and the (shifted) diagonal entries have the common terms of a i s. The critical path delay of the circuit using a type 4 Gaussian normal basis is only T A + 3T X and can be effectively realized for the case GF (2 163 ) and GF (2 409 ) also. 
Conclusions
In this paper, we proposed a low complexity sequential normal basis multiplier over GF (2 m ) for odd m using a Gaussian normal basis of type k. Since, in many cryptographic applications, m should be an odd integer or a prime, our assumption on m is not at all restrictive for a practical purpose. We presented a general method of constructing a circuit arrangement of the multiplier and showed explicit examples for the cases of type 2 and 4 Gaussian normal bases. Among the five binary fields, GF (2 m ) with m = 163, 233, 283, 409, 571, recommended by NIST [16] for ECC, our examples cover the cases m = 163, 233, 409 since GF (2 233 ) has a type II ONB and GF (2 163 ), GF (2 409 ) have a Gaussian normal basis of type 4. Our general method can also be applied to other fields GF (2 283 ) and GF (2 571 ) since they have a Gaussian normal basis of type 6 and 10, respectively. Compared with previously proposed architectures of the same kinds, our multiplier has a superior or comparable area complexity and delay time. Thus it is well suited for many applications such as VLSI implementation of elliptic curve cryptographic protocols.
