This article presents simple and highly regular architectures for finite field multipliers using a redundant representation. The basic idea is to embed a finite field into a cyclotomic ring which has a basis with the elegant multiplicative structure of a cyclic group. One important feature of our architectures is that they provide area-time trade-offs which enable us to implement the multipliers in a partial-parallel/hybrid fashion. This hybrid architecture has great significance in its VLSI implementation in very large fields. The squaring operation using the redundant representation is simply a permutation of the coordinates. It is shown that when there is an optimal normal basis, the proposed bit-serial and hybrid multiplier architectures have very low space complexity. Constant multiplication is also considered and is shown to have advantage in using the redundant representation.
INTRODUCTION
Efficient computations in finite fields and their architectures are important in many applications including coding theory, computer algebra systems and public-key cryptosystems (e.g., elliptic curve cryptosystems). Although all finite fields of the same cardinality are isomorphic, their arithmetic efficiency depends greatly on the choice of bases for field element representations. The most commonly used bases are polynomial bases (PB) and normal bases (NB), sometimes combined with dual bases (DB) [15] . A major advantage of normal bases in the fields of characteristic two is that the squaring operation in NB is simply a cyclic shift of the coordinates of elements, so these are useful for computing large exponentiations and multiplicative inverses [13, 11, 1] . Also, the multiplication table of a normal basis is symmetric, so suitable for hardware implementation. This is the basis for the multiplier of Massey-Omura [16] and that of Onyszchuk et al. [18] .
Recently, Gao et al. [7, 8] have proposed a novel method to perform fast multiplication with a normal basis generated by Gauß periods. The main idea is to embed a field in a larger ring, perform multiplication (using the Fast Fourier Transform) there and then convert the result back to the field. The ring they use is referred to as a cyclotomic ring which has an extremely simple basis whose elements form a cyclic group. One purpose of this paper is to make this idea more explicit and present architectures that are suitable for hardware implementation.
We are mainly interested in finite fields of characteristic two, i.e. ¾ Ñ , which are one of the two types of fields used most commonly in practice (the other one is Ô where Ô is a prime). We show how to find the smallest cyclotomic ring in which ¾ Ñ can be embedded. Since "embedding" is not unique, each element in the ring can be represented in more than one way, i.e., the representation contains certain amount of redundancy. In this article, we also discuss how this redundant representation of a field element can be efficiently converted to a normal basis and vice versa.
Another purpose of our paper is to present architectures for arithmetic in ¾ Ñ . Both bit-serial and hybrid multipliers using the redundant representation are proposed and their complexities are discussed. A modified form of the multipliers using the redundant representation with reduced complexity are also presented. The bit-serial and hybrid architectures of this modified multiplier have lower complexity compared to the previously reported normal basis multipliers. A constant multiplier using the redundant representation is also considered.
We should mention other related work here. Itoh and Tsujii [14] constructed a multiplier for a class of fields defined by irreducible all-one-polynomials (AOPs) and equally-spaced-polynomials (ESPs). Wolf [22] found a simple multiplication architecture for irreducible AOP's. Drolet [4] uses maximum subfields in cyclotomic rings. Silverman [19] considered a special case when there is a type I optimal normal basis. This case is also considered in [7, 8] . A more recent article on redundant representation is [10] .
The organization of this paper is as follows: Section 2 shows how redundant representation of a field element can be derived from cyclotomic rings. In Section 3, multiplication operation using the redundant representation is discussed and then basis conversions are given. Architectures of bit-serial, bit-parallel, hybrid, and constant multipliers are presented in Section 4. For the field which has a type II ONB, we show in Section 5 that more efficient architectures can be developed using a basis derived from the redundant representation. This multiplier architecture is highly regular and also has low complexity. Finally, a few concluding remarks are given in Section 6.
CYCLOTOMIC FIELDS AND REDUNDANT REPRESEN-TATION
Let Ã be any field and Ò a positive integer. The Ò-th cyclotomic field, denoted by Ã´Ò µ , over Ã is defined to be the splitting field of Ü Ò ½ over Ã. In particular, Ò divides Ã´ µ ½ for some and is thus coprime to the characteristic. Let ¬ be a primitive Ò-th root of unity in some extension of Ã. Then Ã´Ò µ is generated by ¬ over Ã and elements of Ã´Ò µ can be written in the form 
This simple multiplication table allows us to design efficient architectures of low complexity as shown in Section 3.
Suppose that Õ Ñ is embedded in Ã´Ò µ , where Õ is a prime power. Then arithmetic in Õ Ñ using the redundant representation can be performed following these three steps:
1. Represent elements in Õ Ñ in the form (1) 
This is the case considered by Silverman, Gao, et al. [19, 7, 8] .
Remark 2 If there is a type II optimal normal basis in
there is a RB of size ¾Ñ · ½ for ¾ Ñ .
This case will be considered in more detail in Section 5. In concluding this section, in Table 1 we give the smallest values of Ò for ½ ½ Ñ ¾ ¼ such that ¾ Ñ is contained in Ã´Ò µ .
MULTIPLICATION USING REDUNDANT REPRESEN-TATION
From now on we only consider fields of characteristic two.
Multiplication Operation
Consider the basis of our redundant representation for ¾ Ñ over ¾ : Since ¬ Ò ½ , the product of field elements and can be given by
½µ in the subscript denotes that ½ is to be reduced modulo Ò. 
Then a multiplication operation using the redundant representation is decided by expression (3).
On the other hand, the squaring of an element using basis Á ½ can simply be performed as follows:
Note that Ò is odd because of the minimum of the redundant basis, thus ¾ can be written as
Clearly, a squaring operation using redundant representation is equivalent to a permutation of the element coordinates.
Gauß Period, Normal Basis and Redundant Basis
Some redundant bases can be easily introduced by the normal bases generated with the Gauß period, and by doing so one can find the relation/conversion between the RB and the normal basis. This is discussed below.
The Gauß period (GP), which was discovered by Gauß, is defined as follows: Let Ñ ½ be integers such that Ò Ñ · ½ is a prime, and let Õ be a prime power with ´Õ Òµ ½ .
Let Ã be the unique subgroup of order of the multiplicative group of Ò Ò , then for any primitive Òth root ¬ of unity in Õ Ñ , the element
is called a Gauß period of type´Ñ µ over Õ , where « is a th root of unity in ¢ Ñ·½ . It can be checked that ¾ Õ Ñ . For example, when ¾ , « is a square root of unity in ¢ ¾Ñ·½ ¢ ¾Ñ·½ .
So, « ¦½, and ¬ · ¬ ½ . This is the case which will be discussed in Section 5.
GPs have been used to construct normal bases with low complexity [17, 3] . A GP of typé Ñ µ over ¾ naturally introduces a normal basis Á ¾ ¾ ¾ Ñ ½ in ¾ Ñ over ¾ if and only if ´ Ñµ ½ , where is the order of ¾ modulo Ò. Furthermore, such a normal basis has complexity at most Ñ ¼ ½ with ¼ if even and · ½ otherwise [3, 21, 6] . Clearly, GPs of type´Ñ ½µ and´Ñ ¾µ generate optimal normal bases (ONBs) with complexity ¾Ñ ½, which are usually called type-I and type-II ONBs, respectively [17] .
For a normal basis generated with GP of type´Ñ µ, from (4) we have
where « is a primitive th root of unity in ¢ Ñ·½ . Note that each element in Á ¾ is a sum of elements. Let the set of these Ñelements be denoted as 
Conversions of Bases
Among the three steps of redundant representation arithmetic, the first and the final steps deal with the change of representations. In this subsection we discuss the conversions between the normal basis and the redundant basis derived from the Gauß period. We show that such conversions can be done in hardware with almost no cost.
Before giving the conversions between normal basis Á ¾ and RB Á ½ , we first introduce two intermediate "bases". Following the discussion in the previous subsection, we separate each sum of terms of Á ¾ and put the Ñelements in an ordered set and let it be denoted by Á ¿ :
Clearly, Á ¿ can serve as a "basis" of ¾ Ñ . The second intermediate "basis" is given by
From the discussion in the previous subsection, we know that Á has exactly the same Ñ elements as Á ¿ but with a different order. Moreover, the permutation can be carried out as follows.
½ to be reduced modulo Ò. In this way, we create a one-to-one correspondence between the Á ¿ and Á based coordinates.
Obviously, conversions between the normal basis and the RB can be divided into three steps: Step (b) has been solved in (5) . It can be implemented as a rewiring of lines and has almost no cost in hardware.
Step (c) 
Conversely, if 's are given, then
In
Step (a), the conversion from the normal basis Á ¾ to the intermediate basis Á ¿ can be given as follows. If
The reverse conversion, however, is much more complicated. Note that it is not possible to convert every redundant representation, since some of them may not represent an element in the field ¾ Ñ . Two tasks have to be performed in this step: One is to identify the representation of a field element w.r.t. Á ¿ , and the second is to convert the identified field element's representation back to the normal basis.
For the interest of this paper which deals with finite field multiplication, it is sufficient to consider identifying the product of two field elements in Á ¿ and then convert it back to the normal basis. Suppose that the coordinates A proof of this lemma is given in Appendix A.
The lemma allows us to identify the Á ¿ basis representation of the product of two field elements also represented by Á ¿ . Once the product is obtained in this Á ¿ basis, it can be converted to the corresponding normal basis aś
Thus, Step (a) of basis conversion can be realized with (8) and (10).
Further Results on Redundant Basis
Lemma 2 Let ¾ ¾ Ñ and the Á basis representation of be obtained from its normal basis representation by using (8) and (5), and let it be´ 
Proof:
Let « be the primitive th root of unity. Then 
We can obtain its Á representation as follows:
It can be seen that in the Á basis representation the first ½ coordinates are a mirror reflection of the last ½ coordinates. The corresponding redundant basis representation of is obtained simply by including a "¼" before the first coordinate in the Á representation.
Also note that only eight consecutive coordinates´ ½½ µ of the redundant representation, which include all the seven coordinates w.r.t. the normal basis, are necessary for determining the element . This fact can be exploited when a multiplication operation using redundant representations is implemented. If we denote as the minimal number of consecutive coordinates of the redundant representation needed to determine the element, then Table 2 shows some values of for the fields given in Table 1 which can be generated with the Gauß period of type´Ñ µ. 
¾
This property will be used later to obtain efficient architecture for finite field multiplier.
ARCHITECTURES FOR RB MULTIPLICATION
In this section we present architectures for hardware implementation of the multiplication: ¡ based on (3), where and are represented with respect to the redundant basis Á ½ . Conversion between Á ½ and the normal basis Á ¾ , as discussed in Subsection 3.3, can be performed without any logic gates.
Bit-Serial Multipliers
Parallel-in serial-out version An architecture for a parallel-in-serial-out (PISO) multiplier is shown in Fig. 1 . The Ò-bit register, which is initially loaded with , is cyclically shifted with a clock. The contents of this register are bit-wise multiplied with the coordinates of and the resultant Ò bits are added using Ò ½ modulo two adders (arranged in a binary tree form for minimum delay). For a straightforward implementation, this PISO multiplier requires Ò flip-flops, 2 Ò AND gates and Ò ½ XOR gates, and the multiplication is completed in Ò clock cycles.
The PISO multiplier architecture shown in Fig. 1 : Binary Tree It is also possible to reduce the number of clock cycles needed by the PISO multiplier. Towards this end, if we can change the order of the input bits to the PISO multiplier such that in the first (Ñ Ò) clock cycles the multiplier generates those consecutive coordinates of that have at least one copy of , for all ´ ¼ ½ Ñ ½ ), then the computation time would reduce from Ò´ Ñ· ½ µ to clock cycles. The value of can be considerably lower than Ò. Serial-in parallel-out version A serial-in parallel-out (SIPO) multiplier which is capable of running at a very high clock rate is shown in Fig. 2 , where the element is stored in a cyclic shift register and the element is shifted in a bit-serial fashion. Each of the Ò accumulator units consists of a mod ¾ adder and a flip-flop. These flip-flops are initialized to zero and contain the product after Ò clock cycles. Table 3 shows a comparison of the two multipliers presented here and the parallel-in serial-out polynomial ring multiplier proposed in [4] . In Table 3 Table 3 : Comparison of bit-serial multipliers using polynomial ring basis and redundant representation.
Constant multiplier
For an implementation of multiplication operation, if one of the inputs (i.e., either or ) is known or fixed, the multiplier is called a constant multiplier. In the past, efficient architectures for such constant multipliers were proposed using polynomial and its dual basis. When normal bases are used, the constant multiplier are however not that efficient.
This is mainly because most normal basis multipliers require that both and be shifted in Proof: The theorem follows by noting that can be written as
Parallel Architectures
Full parallel version Since the architecture shown in Fig. 1 operates in parallel-in and serialout fashion, it can be easily parallelized. Fig. 3 shows the circuit (module Å) that generates one coefficient of the product . The inputs to module Å are and -fold cyclically shifted version of . Clearly, a full bit-parallel multiplier can be obtained by using Ò such Å modules.
The circuits for module Å can be optimized to save AND gates in the same way as we discussed for the PISO multiplier. Also the number of modules can be reduced to Ñ. Since it is sufficient to generate only those Ñ coordinates that correspond to the normal basis, each Å module requires Ñ AND and Ò ½ XOR gates, and there are Ñ such Å modules. 4 Hence the total number of gates for the bit parallel multiplier is since it might be difficult to implement a full-scale bit-parallel multiplier when the field is very large. Fig. 4 shows the architecture of a hybrid multiplier using only two Å modules. There are two shift registers Ê ½ and Ê ¾ . Register Ê ½ is of length Ò · ½ ¾ bits and initially loaded with
¾ bits long and initially loaded with ¾ Ò ½ . The interlacing module combines the outputs from the two registers into one such that its first bit is the first bit from Ê ½ , the second bit is the first bit from Ê ¾ , the ¿rd bit is the second bit from Ê ½ , the th bit is the second bit from Ê ¾ , , and so on. 
ARCHITECTURE FOR TYPE-II ONB MULTIPLIER
In this section we deal with type-II ONB. Extending the work of Gao and Vanstone [6] , we present several bit-serial and bit-parallel multiplier architectures.
Algorithm
Below we consider in more detail Remark 2 given in Section 2. The above constant multiplication ¡ was proposed by Gao and Vanstone [6] . In order to obtain a general multiplier, let ´ ½ Ñ µbe an element in ¾ Ñ , w.r.t. the basis ½ ¾ Ñ , then multiplication of and can proceed as follows:
If the product is denoted as
, also in the basis ½ ¾ Ñ , then
Note that also generates a normal basis
From ×´ µ and the expression (12) , it can be seen that the basis ½ ¾ Ñ is a permutation of the above normal basis. Thus in hardware a squaring operation using the basis ½ ¾ Ñ costs nothing but rearrangement of wires.
Architectures
Parallel-in serial-out multiplier An architecture to implement this multiplication is shown in Figure 5 . A´¾Ñ · ½ µ -bit register, which is divided into two parts (left and right) and is shifted cyclically, stores ×´ µ ¼ ½ ¾ Ñ . A total of Ñ AND gates and Ñ XOR gates are used to generate Ñ terms of 's. Finally, another Ñ ½ XOR gates, formed as a binary tree, take Ñ terms of 's as inputs and produce the coordinate of . In one clock cycle, the register is shifted once and one is generated at the output port. A multiplication is completed in Ñ clock cycles.
The size complexity of the multiplier in Fig. 5 is Ñ AND gates and ¾Ñ ½ XOR gates, along with a´¾Ñ · ½ µ -bit shift register. The delay in the critical path is Ì · ½ · ÐÓ ¾ Ñ µÌ . A comparison of the proposed multiplier with some other similar bit-serial normal basis multipliers is shown in Table 4 . As it can be seen except for the multiplier of Geiselmann and Gollmann [9] , the proposed multiplier has an overall space and time complexities that is better than those of any other multiplier listed in the table. The multiplier of [9] requires about Ñ ¾ fewer XOR gates, however, the proposed multiplier has a highly regular structure which makes it attractive for hardware implementation for very large fields.
Multipliers
#AND #XOR #flipflops # clk cycles basis Massey-Omura [16] ¾Ñ ½ ¾Ñ ¾ ¾Ñ Ñ normal Feng [5] ¾Ñ ½ ¿Ñ ¾ ¿Ñ ¾ 
in Ñ flipflops. After Ñ clock cycles, the contents of the Ñ flipflops at the top are the coordinates of the product .
Compared to the multiplier shown in Fig. 5 , the high-speed version multiplier has a higher complexity. Besides Ñ AND gates, ¾Ñ XOR gates, and a cyclic shift register of length ¾Ñ · ½ , the high-speed version multiplier also needs Ñ flipflops. The critical path has a delay of Ì ·¾Ì , which is however much shorter than that of the multiplier shown in Fig. 5 . It can be seen from Fig. 7 that module Å ¼ is all combinatorial circuits and similar to module Å in Fig. 3 . In Fig. 7 , two copies of module Å ¼ are used and each of them generates one product coordinate at a time. The cyclic shift module enables a cyclic shift of ¾Ñ · ½ coefficients ×´¼µ ×½µ ×¾Ñµ , and costs no gates and registers. When Ñ is even, the ¾Ñ · ½ -bit register is initially loaded with
When Ñ is an odd number, the order of the ¾Ñ · ½ bits initially loaded into the register is 
CONCLUDING REMARKS
In this paper, we have considered multiplication in ¾ Ñ using a redundant representation. The basic idea behind the multiplication is to embed the field ¾ Ñ into the smallest cyclotomic field We have also shown that the redundant representation can be used to obtain efficient bit-serial, bit-parallel, and hybrid multiplier structures. Additionally. we have discussed how to reduce the time and space complexities of these multipliers using properties of the redundant representation.
The conversions from the redundant representation to the corresponding normal basis and vice versa have been given. We have shown that these conversions can be implemented in hardware without any logic gates.
When there is a type I ONB in ¾ Ñ , it follows from our discussion in Section 4 that the minimal representation of a constant field element always has a Hamming weight not greater than Ñ ¾ . Consequently, the proposed constant multiplier has very low complexity. When there exists a type II ONB, very simple and highly regular multiplier architecture can be obtained using the redundant representation (refer to Section 5). It has been shown that such multipliers have lower or equivalent complexity compared to most of the previously proposed similar multipliers.
Hybrid or partial parallel architectures have also been presented for this type of ONBs.
One question arising from the work presented here remains: Can this modified redundant representation multiplier described in Section 5 be generalized to any field ¾ Ñ ?
