AbstractÐThe Massey-Omura multiplier of qp P m uses a normal basis and its bit parallel version is usually implemented using m identical combinational logic blocks whose inputs are cyclically shifted from one another. In the past, it was shown that, for a class of finite fields defined by irreducible all-one polynomials, the parallel Massey-Omura multiplier had redundancy and a modified architecture of lower circuit complexity was proposed. In this article, it is shown that, not only does this type of multipliers contain redundancy in that special class of finite fields, but it also has redundancy in fields qp P m defined by any irreducible polynomial. By removing the redundancy, we propose a new architecture for the normal basis parallel multiplier, which is applicable to any arbitrary finite field and has significantly lower circuit complexity compared to the original Massey-Omura normal basis parallel multiplier. The proposed multiplier structure is also modular and, hence, suitable for VLSI realization. When applied to fields defined by the irreducible all-one polynomials, the multiplier's circuit complexity matches the best result available in the open literature.
INTRODUCTION
T HE arithmetic operations in finite fields are mainly used in cryptography and error control coding [14] , [18] . Addition and multiplication are two basic operations in the finite field qp P m . Addition in qp P m is easily realized using m two-input XOR gates while multiplication is costly in terms of gate count and time delay. The other operations of finite fields, such as exponentiation, division, and inversion can be performed by repeated multiplications [25] , [1] , [7] . As a result, there is a need to have a fast multiplication architecture with low complexity.
The space and time complexities of a multiplier heavily depend on how the field elements are represented. An element of qp P m is usually represented with respect to one of the three popular bases: polynomial (canonical or standard) basis (PB), dual basis (DB), and normal basis (NB). Correspondingly, parallel multipliers are categorized into PB multiplier, DB multiplier, and NB multiplier [11] . Recently, several architectures for PB and DB multiplication over qp P m have been proposed, for example, [17] , [8] , [5] , [27] . Also, in order to reduce hardware complexity, some PB and DB multipliers have been proposed for specific classes of fields, such as trinomials [23] , [4] , all-one polynomials and equally-spaced polynomials [9] , [13] , [26] , and composite fields [20] , [21] . It appears that PB multipliers for classes of trinomials and composite fields still achieve the lowest circuit complexity (for examples, see [4] , [21] ). In a normal basis, squaring of an element of qp P m can be easily performed by a cyclic shift. Although multiplication in this basis appears to be more complex compared to the other bases for the general case, it is still desirable in many applications to represent the field elements with respect to a normal basis.
The original normal basis multiplication algorithm was invented by Massey and Omura [15] and its first VLSI implementation (both bit-serial and bit-parallel) was reported by Wang et al. [24] . A normal basis exists for every finite field, so does this type of multipliers which are hereafter referred to as Massey-Omura (MO) multipliers. Hasan et al. [10] , proposed a novel architecture to reduce the complexity of the bit-parallel MO multiplier by restricting the irreducible polynomial to be an all-one polynomial (AOP), which is the best known architecture in terms of gate counts and time complexity for this class of fields. Recently, Koc and Sunar [13] developed a parallel normal basis multiplier by extension of a PB multiplier for the same class of fields generated by the AOPs. On the other hand, Mullin et al. [19] gave a lower bound on the complexity of normal bases and defined the normal bases that have this lower bound as optimal normal bases (ONB). They defined two types of optimal normal bases, type-I and type-II, where the normal bases generated by an irreducible AOP belongs to type-I. Gao and Lenstra [6] showed that these two types are all the ONBs in qp P m . Also, Ash et al. [2] presented methods to find other low complexity normal bases and techniques to determine their complexities.
In this paper, a generalized procedure and architecture for reducing the complexity of parallel normal basis multiplier over qp P m are developed. The upper bounds of the gate count and time complexity of the proposed architecture are derived. The proposed procedure is then applied to two types of optimal normal bases and their architectures are proposed. To further reduce the complexity of the multiplier, the architecture is optimized in terms of gate count by reusing partial sums. The complexities of the proposed architectures are compared with those of the previously reported structures.
The organization of this paper is as follows: In Section 2, normal basis representation and the MO multiplier are briefly introduced. In Section 3, a reduced redundancy MO multiplication scheme is derived and its bit-parallel architecture is considered. This method is applied to two types of ONBs and the results are compared with the previous ones. In Section 4, we present an optimized multiplier based on irreducible all-one polynomials. In Section 5, we apply the technique of signal reuse to further reduce the gate count of the proposed architecture as well as compare the complexities of a non-ONB with an ONB for finite fields of qp P S with and without reusing signals for the proposed architecture. Finally, in Section 6, concluding remarks are made.
PRELIMINARIES

Normal Basis Representation
It is well-known that there always exists a normal basis in the field qp P m over qp P for all positive integers m [14] . By finding an element P qp P m such that
is a basis of qp P m over qp P, any element e P qp P m can be represented as
where i P qp P, H i m À I, is the ith coordinate of e with respect to the NB. In short, the normal basis representation of e will be written as
In vector notation, however, (1) can be written as
denotes vector transposition.
The main advantage of the NB representation is that an element e can be easily squared by applying right cyclic shift of its coordinates, since
Massey-Omura Parallel Multiplier
Let e and f be two elements of qp P m and represented with respect to the NB as e mÀI iH i
respectively. Let g denote their product as
where the multiplication matrix w is defined by
If all entries of w are written with respect to the NB, then the following is obtained
where w i s are m Â m matrices whose entries belong to qp P. By substituting (6) into (4), the coordinates of g are found as follows:
cyclic shift of and [10] . It is not difficult to verify that the number of 1s in each w i , H i m À IY is the same, which is hereafter denoted as g x . Since these nonzero entries of w i determine the gate count of the normal basis multiplier, g x is referred to as the complexity of the NB [19] . The coordinate i in (7) can be written as modulo P sum of exactly g x terms. Each of these terms is a modulo P product of exactly two coordinates (one of e and f each). Thus, the generation of i requires g x multiplications and g x À I additions over qp P. In hardware, this corresponds to g x AND gates and g x À I XOR gates, assuming that all gates have two inputs. If these XOR gates are arranged in the binary tree form, then the total gate delay to generate i is e log P g x d e , where e and are the delays of one AND gate and one XOR gate, respectively. For parallel generation of all i s, i HY IY Á Á Á Y m À I, one needs mg x AND and mg x À I XOR gates (see also [3] , [16] ). Also, one can reduce the number of AND gates to m P by reusing multiplication terms over qp P. Thus, to reduce the number of XOR gates, we have to choose a normal basis such that g x is minimum. It was proven that g x ! Pm À I.
If g x Pm À I, then the NB is called an optimal normal basis (type-I or type-II).
A REDUCED REDUNDANCY MASSEY-OMURA PARALLEL MULTIPLIER
In this section, we present a new low complexity architecture for bit-parallel Massey-Omura multiplier. The improvement of the new architecture is based on a formulation of the multiplication operation, which is given below.
Formulation of Multiplication
In (5), the multiplication matrix w is symmetric and its diagonal entries are the elements of the NB. Thus, we can write
where h is a diagonal matrix and is an upper triangular matrix having zeros at diagonal entries as given below
Then, (4) can be written as
exponents of in the matrix. Elements of belong to the set of the ring of integers modulo P m À I. The binary representation of k P , using m bits, has only two ones and zeros elsewhere. Let us classify these elements of to different subsets i such that each element of a specific subset is found by consecutive multiplications of I P i by P l as
where v is the number of subsets with elements whose binary representations have two 1's. In (12) , i is essentially the cyclotomic coset of I P i modulo P m À I. Let us define In order to prove (16), we have to show that, after m P cyclic shifts of the representation of v , the representation of v is achieved again, i.e., we have to prove that P v v v X By using the definition of (13), we have Now, let us denote
then the multiplication of (11) can be performed by using the following theorem. In (18) and the remainder of the paper, k means ªk reduced modulo m.º Theorem 1. Let e and f be two elements of qp P m and g be their product. Then, Proof. By substituting (9) into (11), we have
Using (13) in (10), we obtain
REYHANI-MASOLEH AND HASAN: A NEW CONSTRUCTION OF MASSEY-OMURA PARALLEL MULTIPLIER OVER qp P m 513
1. An alternate and more concise proof of (16) and, by changing l to l I, the first part of (22) is obtained. The similar method can be used for m being even and, so, the proof is complete. t u
Below, we discuss how Theorem 1 and Corollary 1 can be used to implement an efficient architecture for realizing a parallel NB multiplier. We show that Theorem 1 yields circuits with the lowest space and time complexities presented so far for the general case of an arbitrary qp P m . For the special case of the irreducible all-onepolynomials (AOP), our result matches the best known result available in the literature. 
Architecture
Here, we use the results of the previous subsection and present a bit-parallel architecture for normal basis multiplier. The architecture is shown in Fig. 1 and is hereafter referred to as reduced redundancy Massey-Omura (RR_MO) multiplier. In this architecture, block f H generates mÀI jH j j 
Since the coordinates of
are known, the pass-thru module is realized by simply connecting x jYi to the coordinates where the representation of P j i has 1's. That is, the single input line of the pass-thru module is directly connected to its r i output lines, where r i refers to the Hamming weight, i.e., the number of 1's, in the NB representation of i .
In Fig. 1 , the first level of sum blocks, i I i v, consist of qp P m adders. Each of the m output bits of i is realized by adding r i terms. The next level of summation block also consists of qp P m adders and has m XOR binary-trees each with v I inputs. The details of this architecture is shown with an example in Fig. 2 . This multiplier uses a type-II optimal normal basis (ONB) and is implemented in finite field of qp P S , where S P I. By using the obtained by a cyclic shift from the previous one. The doted lines in Fig. 2 are sketched to illustrate this cyclic shift and are not connected to any parts of the circuit.
Gate and Time Complexities
In Table 1 , the complexity of the proposed architecture is shown. The number of XOR gates in v is different from the other i when m is an even number. Note that, although HXS for m being an even integer, the number of XOR gates in v is still an integer. Then, from (16) (26) and (15), respectively. In the literature, gate count is often expressed in terms of g x . Towards this effort, we have the following theorem. x w x h Px X QH By writing entries of (21) with respect to NB and noting that the number of ones in The number of XOR gates x as given in Theorem 2 can be reduced by using optimization techniques. In i blocks of Fig. 1 , the number of XOR gates is reduced when the representation of i has more than two consecutive ones or the representation is symmetric for composite values of m, i.e.,
where k is a divisor of m. These techniques will be explained later. Below, we give the complexity of the RR_MO multiplier. , then the time delay of the RR_MO multiplier is g e I log P g x I P $ % Y which reduces to (34) after simplification. t u Table 2 compares gate and time complexities of the proposed architecture with of the MO multiplier of [24] . Since g x ! Pm À I, this table shows the significant reduction in the gate count of the proposed multiplier compared to that of [24] . It is noted that the number of XOR gates in this table can be reduced when more than two consecutive ones or a symmetrical property exist in the representation of i . Therefore, this number in the table is an upper bound.
Corollary 2. The number of XOR gates and the time delay of type-II optimal normal basis multiplier are
respectively.
Proof. For an optimal normal basis (ONB), we have g x Pm À I. Substituting this value of g x into (29) and (34), one obtains (35) and (36). The representation of i I i v with respect to type-II ONB has only two coordinates. Therefore, optimization technique cannot be Fig. 1. applied to reduce the complexity of XOR gates. Hence, the upper bound for x would be the exact number of XOR gates. t u
Remark. One can take advantage of the fact that for m even, the representation of v is symmetric, i.e., k P in (33), and one can reduce the number of XOR gates in the RR_MO multiplier. Towards this, using (16), one obtains that the upper m P coordinates of the output signals in the v block of Fig. 1 are identical to the lower m P coordinates. Thus, by reusing these signals, the number of XOR gates needed in the v block is reduced to one half of the previous one, i.e., m P HXSr v À I. Therefore, for even values of m, the new upper bound for the number of XOR gates in the RR_MO parallel multiplier becomes
In the following, we attempt to reduce the XOR gate count of the proposed architecture by reusing signals for the type-I ONB multiplier and compare it with the previous ones for the same class of finite fields.
AN OPTIMIZED MULTIPLIER USING IRREDUCIBLE ALL-ONE POLYNOMIALS
A type-I ONB is generated by roots of an irreducible all-one polynomial (AOP). An AOP of degree m has its all m I coefficients equal to 1, i.e.,
The AOP is irreducible if m I is prime and 2 is primitive modulo m I [18] . Thus, the roots of (37) i.e., where k i is obtained from
Since m I is odd prime, i.e., m is even, v m P . When is a root of (37), one has mI m iI i IX RH Thus, using (13) and (40), we have
Thus,
In ( Using the generalized architecture and (44), the output of the f i block in the first row is the P k i th coordinate of i and the output of the second row is in the k i Ith position and, so on. This is accomplished by rewiring module i as shown at the bottom of Fig. 3 .
The total number of AND gates of this circuit is m P which is the same as the one in the general case, but the number of XOR gate is reduced to m P À I. In order to determine the time delay of the architecture in Fig. 3 , we have to determine the longest path from the input to the output and it is the sum of delays of f i , f , and the very last XOR gate. Therefore, the time delay of this structure is e log P v d e e I log P m d e X Since m is even, we have log P m d e log P m À I d eand, thus, the time delay is g e I log P m À I d e . The above gate count and delay can be compared with those of other parallel multipliers of the same class generated by irreducible AOPs. The comparison is shown in Table 3 . It is easily seen the best architectures in terms of area and time complexities are those of Hasan et al. [10] and the architecture in Fig. 3 in normal basis and Wu and Hasan [26] in weakly dual basis. Also, the proposed circuit is regular and is derived from the general case. The modularity of the proposed architecture makes it suitable for VLSI implementation.
OPTIMIZATION BY SIGNAL REUSE
If coordinates of i I i v have consecutive ones (more than two) in its representation with respect to the NB, then the XOR count of the i block of Fig. 1 Tables 1 and 2 .
Example 1. Let qp P S be the finite field generated by the irreducible polynomial p z z S z P I whose root is , i.e., p H. If we choose Q , then
is a NB. Using Table 1 of [19] , the representation of I and P and their consecutive squares are found from Table 4 . Let
be the output of the jth f i block and s iYn H n R denote the P n th coordinate of the outputs of the i block of Fig. 1 . Using Table 4 and (19), the outputs of I are found as
Since both s IYH and s IYI have common terms, x IYI x PYI , then it is not needed to generate this common terms twice. Similar expression exists for s IYP and s IYQ with common term, x RYI x HYI . Therefore, the total number of XOR gates used in I is reduced from 10 to eight.
Similar optimization is accomplished in the P block. Since the representation of P has one zero in Table 4 Therefore, the total XOR gates of this block would be eight instead of 15. This optimization method may however increase time delay of the architecture. Table 5 shows a comparison of this method with the general NB multiplier as well as the type-II ONB. Using (32), g x for this example and type-II ONB are 15 and nine, respectively. It is seen that the number of XOR gates of the multiplier with grater value of g x has less XOR gate counts than that in the optimal normal basis using MO multiplier (36 versus 40).
CONCLUSIONS
In this article, a reduced redundancy Massey-Omura parallel multiplier has been proposed. This multiplier reduces the complexity of the parallel Massey-Omura multiplier for any normal basis and is not limited to any special class of finite fields. In particular, the space complexity of the proposed structure is about half of the other architectures. Also, by reusing signals, the number of XOR gates have been further reduced and the results of the application of this technique have been compared to the original one using an example.
Since only 23 percent of all normal bases in qp P m for m`IY PHH are optimal [19] , the proposed architecture is useful in the design of an efficient multiplier, especially for nonoptimal normal bases. 
