Abstract-A new parallel-in serial-out finite field multiplier using redundant representation for a class of fields is proposed. It has been shown that the proposed architecture has significantly lower complexity in comparison to the previously proposed architectures using the same representation. The new multiplier can be implemented in a hybrid fashion or at digit-level, which provides the designer with considerable amount of area-speed trade-offs. FPGA implementation of the proposed multiplier is also presented and compared with FPGA realization of a previously proposed multiplier.
I. INTRODUCTION
Recently the use of finite field arithmetic in the area of cryptography has gained increasing importance. The main reason is that two out of the three most popular public key cryptosystems, namely, El-Gamal and Elliptic curve are based on finite field arithmetic [1] , [3] . Research in this area is moving toward finding new architectures to implement the arithmetic operations more efficiently.
One important factor that has great impact on finite field arithmetic performance is the basis in which finite field elements are represented [4] . The most commonly used bases are polynomial basis or standard basis, normal basis and dual basis. Recently, a method of redundant representation of field elements has attracted attention [2] , [5] . In this work, we continue the line of research on redundant representation which was originally derived from the minimal cyclotomic ring [5] . The idea here is to embed a field in the minimal cyclotomic ring where the arithmetic operations are performed. Redundant representation has the advantages that it not only offers free squaring operation as normal basis does, but also accommodates ring type operations where modulo reduction can be avoided for a field multiplication operation.
Two different types of architectures were proposed for hardware implementation of redundant basis multipliers [5] . The first one is parallel-in serial-out architecture and the second one is serial-in parallel-out architecture. In this paper we show that the complexity of redundant representation multiplication for a class of fields can be reduced significantly by applying a nice feature of redundant representation. The architecture for the improved multiplication is presented. It is shown that FPGA implementation of the proposed multiplier uses fewer cells than the previously proposed architecture. A hybrid or digit-level version of the proposed multiplier is also presented to provide considerable amount of area-timing trade-offs.
The organization of the rest of this article is as follows: In section II a brief review of redundant representation and multiplication is presented. A new multiplication method in redundant representation for a class of fields is proposed in section III. In section IV a new parallel-in serial-out redundant basis multiplier for a class of fields is proposed. The architectural complexity of the proposed multiplier is compared with the previously proposed architectures and presented in this section too. Section IV shows how the proposed architecture can be extended to a hybrid or digit-level fashion. FPGA implementation and comparison for the proposed multiplier and the previous proposal are presented in section V. Finally, a few concluding remarks is given in section VI. Let K be a field and f (x) ∈ K[x] be a polynomial defined over K. Then the field that contains all the roots of f (x) is called the splitting field of the polynomial f (x). The splitting field of x n − 1 is called the n th cyclotomic field, denoted by K (n) . Let β be a primitive n th root of unity. Then K (n) is generated by β over K and elements in K can be represented in the form
II
Thus the set [1, β, β 2 , . . . , β n−1 ] can be viewed as a representation 'basis' for K (n) [5] . Since the representation of A is not unique, the set [1, β, β 2 , . . . , β n−1 ] is called redundant basis for any subfield of K (n) . Note that the elements in the redundant basis form a cyclic group of order n and
We are particularly interested in the following case: Let K = F 2 and K (n) be a cyclotomic field that F 2 m can be embedded in. The following lemma characterizes the relationship between m and n.
Lemma 1. [4]
Let n be an odd positive integer. Then,
if and only if m divides the multiplicative order of 2 mod n.
B. Redundant Basis Multiplication in F 2 m
Consider the redundant basis in F 2 m over F 2 :
Let field elements A, B ∈ F 2 m to be represented with respect to I as :
Then the product of A and B can be given by
where
C. Redundant Basis and Gauss Periods
Some redundant bases can be easily introduced by Gauss period. The Gauss periods over F 2 are defined as follows: Let m, k ≥ 1 be integers such that n = mk + 1 is a prime. Let K be the unique subgroup of order k of the multiplicative group Z × n of Z n = Z/nZ, then, for any primitive nth root β of unity in F 2 mk , the element γ = a 0 , a 1 , . . . , a n−1 ) with respect to I. Assume k 2 is even, then
Proof: This is a direct result from Lemma 2 in [5] by noting that the redundant basis I and the basis I 4 used in [5] satisfy
III. PROPOSED MULTIPLICATION USING REDUNDANT BASIS REPRESENTATION
When there exists a Gauss period of type (m, k) over F 2 and k is an even number, the complexities of redundant basis multiplier proposed in [5] can be reduced further by utilizing the result in Lemma 2.
To facilitate multiplication of elements in the redundant basis representation, a new function s(i) mapping the set of integers to the set {0, 1, . . . , n−1 2 } is defined as follows
Taking into account (3) and following (2), the product coefficient c j , 0 j n−1 2 , can be rewritten as
From (4) it is easy to prove that
A proof is given as follow :
It is obvious from the definition of the s(i) function that s(0) = 0 and s(i) = s(n − i) = s(−i), hence
c n−j = a 0 b j + n−1 2 i=1 a s(i) b s(j+i) + b s(j−i) = c j .
IV. PARALLEL-IN SERIAL-OUT MULTIPLICATION USING REDUNDANT REPRESENTATION

A. Bit-level Parallel-In Serial-Out Architecture
From (4), a new architecture for bit-level parallel-in serialout multiplication is proposed. The architecture for an n-bit multiplier in shown in Fig. 1 . From (5) it can be seen that only the first product coefficients need to be generated. Thus it takes n+1 2 clock cycles for the multiplier to finish one operation. Complexity comparison between the proposed architecture and the previously proposed parallel-in serial-out architecture [5] , are presented in Table I . In this table, the delay of a two input AND gate has been shown by T A and the delay for a n input XOR gate has been approximated by log 2 n T X . As can be seen from the architectural analysis, the proposed architecture takes smaller area providing the same speed. clock cycles for a n bit multiplier to compute all output bits. This fact makes the parallel-in serial-out architecture impractical for some applications like elliptic curve cryptography. Because in these applications, the field sizes are required to be at least 200 and the speed of the multiplier is not fast enough.
One way to speed up the multiplication is to implement the architecture in a hybrid fashion. In a hybrid fashion or digit-level multiplier, the designer has the ability to set the trade-offs between area and speed during the design process. This feature of the digit-level multipliers is very important for VLSI designers since usually it is impractical to use a full parallel architecture because of its area requirements and the parallel-in serial-out architectures are not fast enough. Fig. 2 shows a hybrid version of the proposed multiplier [5] n n n+1 2 T A + log 2 n T X when the number modules is r = 2. The multiplier takes ∼ r times faster. Also note that for the multiplier shown in Fig. 2 , the input connections for the second module, the ones coming out from the shift register, are a circular shifted version of the input connections for the first module by n−1 4 + 1. For a general case that an n-bit multiplier contains r modules, the input connections for each module is a circular shifted version of the connections for the previous module by n+1 2r . The areatiming trade-offs for the proposed digit-level parallel-in serialout multiplier containing r modules is compared with the previously proposed hybrid multiplier and given in Table II . V. HARDWARE IMPLEMENTATION The proposed multiplier and the similar one proposed in [5] have been implemented on FPGA. We used the field size of m = 163 which has the smallest cyclotomic ring of n = 653. In our experience for hardware implementation we used the Stratix FPGA from the Altera company. Simulations were done through the Quartus II software. Hardware implementation results are given in Table III . Implementation results show decrease in the number of Logic Elements(LEs) used in FPGA by 5%. Note that the critical path delay of the proposed multiplier is measured slightly smaller than the previous proposal. This comes from the fact that the delay of an n input XOR gate was approximated by log 2 n T X .
VI. CONCLUSIONS
A new parallel-in serial-out finite field multiplier using redundant basis representation is proposed. Architectural com- plexity of the proposed multiplier and previously proposed architectures were compared and it has been shown that the proposed architecture has less complexity and takes smaller area compared to the previous proposal, providing the same speed. The proposed architecture can also be implemented in a hybrid fashion. The hybrid architecture gives the designer the ability to set the trade off between area and speed. This feature of the hybrid architecture makes it suitable for VLSI implementation of large field size multipliers since bit-serial multiplier usually is not fast enough. The proposed architecture and previous similar proposal were implemented on FPGA and their speed-area comparison has also shown the advantages of the proposed multiplier.
