Abstract-Reordered normal basis is a certain permutation of a type II optimal normal basis. In this paper, a high speed design of a word level finite field multiplier using reordered normal basis is presented. Proposed architecture has a very regular structure which makes it suitable for VLSI implementation. Architectural complexity comparison shows that the new architecture has smaller critical path delay compared to other word level multipliers available in open literature at the cost of having moderately higher area complexity. The new architecture out performs all other similar proposals considering the product of area and delay as a measure of performance.
I. INTRODUCTION
Finite field arithmetic has important applications in cryptography, specially public key cryptography. Elliptic curve and El-Gamal cryptosystems are two important examples of public key cryptosystems based completely on finite field arithmetic [1] [2] . Two types of finite fields are commonly used in practice, prime field F p and the binary field F 2 m . Binary field is an extension of the prime field, F 2 , which contains 2 m elements. Binary fields are attractive for high speed cryptography applications since they are suitable for hardware implementation [1] [3] .
An important factor that has great effect on finite field arithmetic efficiency is the basis used to represent the field elements. Common bases used in practice are polynomial basis (PB) and normal basis (NB) [4] . Polynomial basis is the most popular basis which has been widely used for hardware and software implementations [3] . Normal basis on the other hand is advantageous for hardware implementation since squaring operation can be implemented at very small cost. Low cost squaring operation can be used to speed up the exponentiation operation by repeated squaring and multiplication [5] .
Multiplication operation is considered to be the main operation in finite field arithmetic. In normal basis, multiplication can be modeled as a matrix multiplication where two input vectors are multiplied by a multiplication matrix resulting in output product bits. For a binary field the multiplication matrix elements are zero or one, consequently the multiplication complexity depends on the number of ones in the multiplication matrix. The number of ones inside the multiplication matrix is referred to as normal basis complexity. For a binary field F 2 m , it has been proven that the complexity of normal basis varies between m 2 and 2m − 1. The normal basis for which the complexity is minimized is referred to as optimal normal basis (ONB). Two types of optimal normal bases have been found which are referred to as optimal normal basis type I and type II [6] . Reordered normal basis is refereed to as a certain permutation of a type II optimal normal basis [7] . Hardware implementation of binary finite field multipliers can be categorized into three categories. First category are bit level multipliers. A bit level multiplier takes m clock cycles to finish one multiplication in a binary field of size m. The multipliers in this class are considered to be low power and taking small area of silicon. Their main disadvantage is their low multiplication speed for large field sizes. The second category are full parallel multipliers. A full multiplier takes one clock cycle to finish the multiplication for any field size. These multipliers are not practical since they require large silicon area and are not economical to be made. The third category are word level finite field multipliers which are the most commonly used in practice. A word level multiplier takes d clock cycles, 1 d m, to finish one multiplication operation in a binary field of size m. The value of d can be selected by designer to set the trade off between area and speed. Decreasing the value of d will result in faster and larger multipliers while increasing d will make smaller and slower multipliers.
In this work a new word level finite field multiplier using a reordered normal basis is presented. It is shown that the new design has higher speed compared to other similar proposals.
The organization of this paper is as follows : Reordered normal basis and multiplication using this basis are briefly reviewed in Section II. In Section III, a new word level multiplier using reordered normal basis is proposed. Architectural complexity comparison with similar proposals are presented in Section IV. Section V contains the comparisons for different word level multipliers. Finally some concluding remarks are given in Section VI.
II. A BRIEF REVIEW OF REORDERED NORMAL BASIS AND ITS ARITHMETIC IN F 2 m

A. Reordered Normal Basis Definition
Theorem 1: [7] Let β be a primitive (2m + 1) st root of unity in F 2 m (β 2m+1 = 1) and γ = β + β −1 generates a type II optimal normal basis. Then {γ i , i = 1, 2, . . . , m} with
It has been shown that the basis {γ i , i = 1, 2, . . . , m} is a permutation of the normal basis {γ
. We denote the basis
as the reordered normal basis following [8] .
Reordered normal basis not only offers free squaring but also can avoid modulo reduction step in a multiplication operation.
B. Reordered Normal Basis Multiplication
Assume that A and B are two arbitrary elements in F 2 m represented with respect to reordered normal basis
To facilitate multiplication, function s(i) has been defined, mapping set of integers to the set {0, 1, . . . , 2m + 1} [8] .
Next compute γ j A where 1 j m,
And also note that [8] ,
Note that it was assumed that a 0 = 0 [8] . The value for c i can be calculated as follows:
III. PROPOSED WORD LEVEL MULTIPLICATION USING REORDERED NORMAL BASIS
A. Word Level Multiplication Using Reordered Normal Basis
From (3) it can be seen that one product bit c i is a sum of 2m terms where each term is the partial product bit a j b s(i+j) or a j b s(i−j) . In a bit level multiplier all the output products (c j s) are computed in parallel. It will take the multiplier m clock cycles to finish the multiplication since the multiplier generates two terms, a j [b s(i+j) +b s(i−j) ], of each of the output product bits in each clock cycle.
Let w denote the word size and k = m/w be the number of words. Write the subscript of a j in (3) as j = gw + , g = 0, 1, 2, . . . , k − 1 and = 1, 2, . . . , w. Replace j in (3) with gw + :
If c
. . , m, can be implemented and computed in one clock cycle, then the required number of clock cycles to calculate each c i will decrease form m to w.
B. Proposed Word level Architecture for Multiplication
From (3) a new architecture for reordered normal basis multiplication is proposed which is shown at Fig. 1 . The architecture contains a 2m + 1 bit circular shift register which should be initialized with one of the input coefficients. The Expansion/Permutation module is just a reordering and copying module which doesn't contain any gates. This module accepts 2m + 1 inputs from the circular shift register and provides 2km outputs for the two-input AND gates.
The circular shift bit register and the expansion/Permutation module are in charge of generating the b s(j+i) , b s(j−i) terms which are multiplied afterwards with the appropriate a gw+ with two input AND gates. At the end of the w th clock cycle, each of the one bit registers (one bit accumulators) will contain the summation of w terms of each one of the output product bits. Summation of 2k of these w terms will generate one output product bit. 2k-input XOR gates at the bottom of the architecture are in charge of adding 2k of these w-terms together for each output product bit. Note that each output product bit is made out of 2kw terms.
The input operand A is required to be fed into the multiplier in a comb style, i.e. in the first clock cycle the inputs are a 1 , a w+1 , a 2w+1 , . . . , a (k−1)w+1 . For the second clock cycle the inputs are a 2 , a w+2 , a 2w+2 , . . . , a (k−1)w+2 . Finally in the w th clock cycle the inputs are a w , a 2w , a 3w , . . . , a kw . Note that if the subscript of an input bit exceeds m then it is replaced by a zero bit. As it is clear from Fig. (1) , the IV. ARCHITECTURE COMPLEXITIES The area and delay complexity of the proposed design can be easily determined from Fig. 1 . Registers appear in two parts of the design, first part is the circular shift register ,R, which contains 2m + 1 flip flops. The second part is the one bit registers that hold the partial sum of the output product bits, whose number is equal to 2km since each output bit uses 2k registers. The total number of two-input AND gates is equal to 2km since each output product bit uses 2k two-input AND gates. XOR gates exist in two parts of the architecture. The first part is XOR gates that follow the AND gates. For each output product the number of these two-input XOR gates is equal to 2k which is the same as the number of AND gates. For m output products, totally 2km of these two-input XOR gates exist. The second part is the 2k-input XOR gates which exist for each of the output product bits. The equivalent number of two-input XOR gates to implement these 2k-input XOR gates is equal to (2k − 1)m. Consequently, the equivalent number of two-input XOR gates in the architecture is equal to (4k −1)m.
The critical path delay for the proposed multiplier has one AND gate and one XOR gate. Note that the 2k-input XOR gate is not part of the critical path since the 2k summation for the c i outputs have to be calculated only once at the end of the multiplication. Area-Timing complexity for the proposed multiplier and similar proposed architectures are shown in table I . In this table, the delay of a two input AND gate has been shown by T A and the delay for an n-input XOR gate has been approximated by log 2 n T X . The first row of the table presents the famous Massey Omura Architecture [10] and the second row represents the Hybrid PISO architecture proposed in [8] , both with k levels of pipelining. The word level serial input parallel output architecture from [11] is shown in the third row. The Fourth row presents the bit-serial word-parallel architecture proposed in [9] . The last row of the table presents our proposed architecture. As can be seen the critical path delay of the proposed word level architecture and the total multiplication delay are smaller than the previously proposed word level architectures.
For the purpose of illustration we have tabulated the AreaDelay complexity for the proposed architecture with the previously proposed multipliers in TABLE II. We have used the practical field size of m = 233 which is a recommended NIST (National Institute of Standards and Technology) Binary field degree with number of parallel modules (k) of k = 8, 16 which are practical for VLSI implementation.
In this table Area Cost represents the addition of the number of AND gates with twice the number of XOR gates with twice the number of registers ( We assume that the size of an AND gate is half the size of an XOR gate or a one bit register). Also the Delay Cost represents the addition of the delay for AND gates with twice the delay for XOR gates ( We assume that the delay of an XOR gate is twice the delay of an AND gate). Area Delay Cost represents the multiplication of Area Cost by Delay Cost which can be used as a measure of performance. As can be seen from the table the proposed Architecture has the best performance between all multipliers.
Taking into account the approximations for area and delay, the average performance increase for the proposed architecture is equal to 35% and 60% for w = 16, 8 respectively, compared to W-P B-S [9] which has the second best performance.
V. CONCLUSIONS
A new high speed word level finite field multiplier using a reordered normal basis is presented. Complexity comparison and numerical examples show that the new architecture is faster and performs better compared to other similar proposals considering area-delay product as a measure of performance. The proposed multiplier can be used for public key cryptography applications where high speed multipliers are needed.
