Abstract-Elements of a finite field, GF ð2 m Þ, are represented as elements in a ring in which multiplication is more time efficient. This leads to faster multipliers with a modest increase in the number of XOR and AND gates needed to construct the multiplier. Such multipliers are used in error control coding and cryptography. We consider rings modulo trinomials and 4-term polynomials. In each case, we show that our multiplier is faster than multipliers over elements in a finite field defined by irreducible pentanomials. These results are especially significant in the field of elliptic curve cryptography, where pentanomials are used to define finite fields. Finally, an efficient systolic implementation of a multiplier for elements in a ring defined by x n þ x þ 1 is presented.
INTRODUCTION
F INITE fields play an important role in coding theory and public-key cryptography [1] . Coding theory has applications in error-free communications and data storage. Publickey cryptography has applications in smart-card technology, e-commerce security, and internet security [2] , [3] , [4] , [5] . Both coding theory and public-key applications use algorithms based on arithmetic of elements over an extension of a field GF(2) denoted by GFð2 k Þ. Public-key cryptographic algorithms [6] , [7] usually require words of very long length (greater than 160 bits). This leads to low performance cryptographic algorithms, making them impractical for commercial applications. To overcome this deficiency, fast hardware architectures for performing arithmetic in Galois fields GFð2 k Þ needs to be designed. The arithmetic operations normally performed are addition, multiplication, squaring, exponentiation, and inversion. This paper focuses on the multiplication of elements in a Galois field GFð2 k Þ. Finite field multipliers can be expensive, both in terms of gate count and delay.
Multipliers can be classified into bit-parallel and bitserial multipliers [8] , [9] , [10] , [11] , [12] . Bit-parallel multipliers compute the result in one clock cycle but generally have an area requirement proportional to k 2 , where the elements that are multiplied belong to GFð2 k Þ. Bit-serial multipliers compute the result in k clock cycles, but have an area requirement proportional to k. Multipliers can also be classified based on whether the finite field elements are defined by a normal basis or a canonical basis. One advantage of the normal basis representation of field elements is that the squaring of an element can be computed by the cyclic shift of the binary representation. However, canonical bases are widely used and lead to efficient implementations of multipliers. The time and space complexities of bit-parallel canonical basis multipliers are much better than that of multipliers based on the normal basis. In this paper, we present a new ring representation of field elements to design new bit-parallel, canonical basis multipliers that are better in time complexity and almost as good in space complexity compared to existing multipliers. The elements of a field are mapped to elements in a ring that results in extra bits being required in the representation of the elements of the field. This results in a modest increase in space complexity. However, this also results in faster multipliers as the ring elements are defined using simpler polynomials over the finite field GF (2) .
The elements of a finite field are represented as binary words according to one of many traditionally available representations. We propose a new representation of the elements of a finite field inspired by [11] . In some cases, this representation leads to faster multipliers compared to other representations. A finite field, F 2 m , is defined by an irreducible polynomial, fðxÞ, of degree m and each element in the field can be represented as an m-bit binary number or a polynomial of degree ðm À 1Þ. Multiplication of two elements in a finite-field implies multiplication of the two polynomials that represent the finite-field elements and then taking the modulus of this product with respect to fðxÞ. The complexity of multiplication is dependent on the number of terms in the polynomial fðxÞ. The fewer the terms the lower is the complexity. Þ for some element in F 2 m . Representing field elements with a normal basis leads to efficient implementation of squaring of an element (squaring is a circular shift of the binary representation of a field element). However, parallel multipliers are not as efficient as multipliers using the canonical basis.
In [11] , a new representation of elements of finite fields F 2 m was proposed. This representation was based on the isomorphism from F 2 m into the ring F 2 ½x=ðx n þ 1Þ for some n, with gcd(n, 2) = 1. The polynomial ring, F 2 ½x=ðx n þ 1Þ, contains an isomorphic copy of F 2 m only if ðx n þ 1Þ has an irreducible factor h(x) such that deg(h(x)) = m. Let ðx n þ 1Þ ¼ gðxÞhðxÞ. Then, n = deg(g(x)) + deg(h(x)) = deg(g(x)) + m. Let the degree of g(x) be r. Therefore, n = r + m. The number of bits in the new representation has now increased from m to n = m + r. However, since the multiplication in the ring is performed modulo ðx n þ 1Þ, it turns out that multiplication in the ring is faster than multiplication modulo some degree m polynomial only if r is small. A polynomial of the form fðxÞ
is the all-1 polynomial and the ring representation is faster than any other representation with a modest increase in the number of XOR and AND gates. However, irreducible all-1 polynomials of degree m exist if and only if (m + 1) is prime and 2 is a generator of the field GF(m + 1). Whenever an irreducible all-1 polynomial cannot be found, then the smallest n such that ðx n þ 1Þ has an irreducible factor of degree m must be found in order for the ring F 2 ½x=ðx n þ 1Þ to have a copy of the field F 2 m . In such a case, n may turn out to be much larger than m. For example, for m = 9, the minimum value of n such that ðx n þ 1Þ has a factor of degree 9 is 73. This makes the fieldelements of F 2 9 have a 73-bit representation in the ring F 2 ½x=ðx 73 þ 1Þ, thus making the multiplier that uses this representation very complex both in terms of time and circuit complexity. In this paper, we propose new rings in which we can find an isomorphic copy of F 2 m . The new rings are of the form
The multipliers that result are faster if k and n are as small as possible [16] . We use a well-known result (Wedderburn's Theorem) in algebra and the software Mathematica to help us find the required rings. These new rings are useful if the least number of terms in an irreducible polynomial is 5 and the ring F 2 ½x=ðx n þ 1Þ does not work due to reasons mentioned above. For example, consider m = 27. There is no irreducible trinomial of degree 27 and the least term irreducible polynomial of degree 27 is a pentanomial. One such pentanomial is ðx 27 þ x 5 þ x 2 þ x þ 1Þ. However, the ring F 2 ½x=ðx 29 þ x þ 1Þ contains an isomorphic copy of the field F 2 27 . The ring F 2 ½x=ðx n þ 1Þ, where n = 73, also contains an isomorphic copy of the field F 2 27 . Multiplication in the ring F 2 ½x=ðx 29 þ x þ 1Þ is more efficient than multiplication in the ring F 2 ½x=ðx 73 þ 1Þ because multiplication of 29-bit field elements modulo a trinomial is faster than multiplication of 73-bit elements modulo a binomial. Besides, the number of gates needed for 73-bit elements is too high.
The rest of the paper is organized as follows: Section 2 describes polynomial ring representations and Wedderburn's Theorem. Section 3 describes some new ring representations based on Wedderburn's Theorem that improve multiplication complexity. Section 4 describes and compares the complexity of the multiplier resulting from the new ring representations with other multipliers. Section 5 describes a systolic implementation of a multiplier for elements in the ring modulo x n þ x þ 1 and Section 6 concludes the paper. The advantage of a systolic implementation is the ease of implementation in VLSI. The complexity of the systolic multiplier presented is the same as any other implementation of the multiplier.
POLYNOMIAL RING REPRESENTATION
We begin this section by stating Wedderburn's Theorem and how it can be used to obtain a new representation for field elements. The Wedderburn Theorem in the form required for this paper states that if fðxÞ is a polynomial over the field k with no repeated roots (that is gcdðfðxÞ, f 0 ðxÞÞ ¼ 1), then the ring R ¼ k½x=f decomposes into a direct sum of ideals of R. Therefore,
Each of these ideals is generated by a single element and is isomorphic to a finite field extension K of k. A more general version of Wedderburn's Theorem can be found in [13] . The decomposition of R ¼ k½x=f into ideals is effectuated by the construction of the system of generators for these ideals. Such a system of generators fe 1 ; e 2 ; . . . ; e r g will consist of a complete orthogonal system of idempotents. In R, one has e 2 i ¼ e i ; e i e j ¼ 0 for i 6 ¼ j, e 1 þ e 2 þ Á Á Á þ e r ¼ 1 and the ideals I i ¼ Re i , are all isomorphic to fields. At this point, there is great advantage to working over fields of characteristic two. In characteristic two, the Frobenius map on R, the map
is a linear transformation of k-vector spaces. As each of the ideals in the decomposition is stable under the Frobenius map, the complete orthogonal system of idempotents providing the decomposition is realized by the eigenvectors of the Frobenius. In general characteristic, the problem of computing the idempotents is more problematic [14] . Let f(x) be a polynomial of degree n over GF (2) , that can be factorized into r, distinct, irreducible factors as follows:
Let the degrees of f i ðxÞ be p i , then Wedderburn's Theorem [13] states that the ring F 2 ½x=fðxÞ is a direct sum of fields F 2 ½x=f i ðxÞ. Therefore:
Here, F 2 ½x=fðxÞ is a polynomial ring modulo f(x) over GF(2) and F 2 ½x=f i ðxÞ ¼ F 2 p i is the finite extension field over GF(2) modulo f i ðxÞ. Wedderburn's Theorem also gives a procedure to map the ring elements to field elements. This is because each of the fields F 2 ½x=f i ðxÞ ¼ F 2 p i is generated by an idempodent, e i , that belongs to the set of orthogonal idempotents. Therefore:
Idempotents, e i , can always be found by solving the linear system of equations e 2 i ¼ e i . From these idempotents, we can choose a subset that are orthogonal. Therefore, each F 2 ½x=fðxÞ Ã e i is isomorphic to F 2 p i . Elements of the ring F 2 ½x=fðxÞ can be expressed as r-tuples ða 1 ; a 2 ; Á Á Á ; a r Þ, where each a i 2 F 2 ½x=fðxÞ Ã e i . To perform an add or multiply operation on two r-tuples, we simply perform the operation on their individual elements.
We now consider an example that illustrates the above facts. Let fðxÞ ¼ x 3 þ 1. Since
this implies that the following is true from Wedderburn's Theorem.
Therefore, an isomorphic copy of F 2 ½x=ðx þ 1Þ and 
The product of two elements can be defined in a similar manner.
For example, from Table 1 , we see that x 2 2 F 2 ½x=ðx 3 þ 1Þ can be written as ððx þ 1Þ; ðx 2 þ x þ 1ÞÞ and 1 2 F 2 ½x=ðx 3 þ 1Þ can be written as ððx 2 þ xÞ; ðx 2 þ x þ 1ÞÞ and the sum of 1 and x 2 can be written as
In [11] , multiplication in a field F 2 m was performed by finding a ring F 2 ½x=ðx n þ 1Þ which contains an isomorphic copy of F 2 m . In other words, fðxÞ is chosen to be ðx n þ 1Þ and one of its irreducible factors must be of degree m. Therefore, if ðx n þ 1Þ can be factorized as ðx n þ 1Þ ¼ f 1 ðxÞf 2 ðxÞ Á Á Á f r ðxÞ and degree ðf r ðxÞÞ ¼ m, then
Note that degree ðf i ðxÞÞ ¼ p i and p r ¼ m. This implies that multiplication in F 2 ½x=f r ðxÞ can be performed in the ring F 2 ½x=ðx n þ 1Þ. This method in [11] relies on the fact that finding the modulus with respect to ðx n þ 1Þ is computationally more efficient compared to finding the modulus with respect to f r ðxÞ. However,
Therefore, the number of bits in a ring element has p 1 þ p 2 þ Á Á Á þ p rÀ1 more bits than a field element of F 2 m which has p r ¼ m bits. If p 1 þ p 2 þ Á Á Á þ p rÀ1 is large, then multiplication in the ring is expensive (too many gates) and slow, thus nullifying the advantage of computing the modulus with respect to a simple polynomial ðx n þ 1Þ. For example, if m = 19, then p 1 þ p 2 þ Á Á Á þ p rÀ1 ¼ 524; 287 and, thus, this technique in [11] fails to be effective. The method in [11] is effective only when fðxÞ ¼ f 1 ðxÞf 2 ðxÞ and f 1 ðxÞ ¼ ðx þ 1Þ and degree ðf 2 ðxÞÞ ¼ m. This implies that f 2 ðxÞ is an all-one polynomial and such polynomials are few in number. It is hard to find AOPs because a polynomial of degree n is AOP if and only if n + 1 is prime and 2 is a generator of GF(n + 1). In the next section, we suggest new rings that overcome this problem.
NEW RING REPRESENTATION
In cryptographic applications, fields based on low weight irreducible polynomials are desired. The weight of a polynomial is the number of terms in it. For example, the weight of ðx 7 þ x 3 þ x þ 1Þ is 4. About half the irreducible, minimum weight polynomials of degree less than 10,000 
Note that, in each of the above equations, f 1 ðxÞ is different. Therefore, the procedure to find a ring which contains an isomorphic copy of a field, F 2 m , in which multiplication must be done, is as follows: Assume that no irreducible trinomial exists of degree m.
1.
If an irreducible AOP of degree m exists, then perform the multiplication in F 2 ½x=ðx n þ 1Þ, where n = m + 1.
If no ring F 2 ½x=ðx
n þ 1Þ can be found such that n ¼ ðm þ 1Þ, then find a ring
for some irreducible f 2 ðxÞ of degree m. Note that n should be as small as possible.
Find a ring F
for some irreducible f 2 ðxÞ of degree m. 4. Perform multiplication of elements in a field, F 2 m , defined by a pentanomial. Note that, up to degree m ¼ 10; 000, there is either a trinomial or a pentanomial that is irreducible. 5. Choose one of the representations from Steps 2, 3, or 4 above for which the multiplier complexity is minimal. We now consider the structure of irreducible polynomials that will help us find the rings mentioned above. 
, where The proof of the above propositions follows from the structure of the polynomials being considered. Note that it is much easier to find polynomials that satisfy Proposition 2 than those that satisfy Proposition 1. Tables 2 and 3 give some polynomials that satisfy the two observations above. These tables also contain the percentage increase in the AND and XOR gates for a multiplier implemented using each polynomial as compared to Mastrovito multipliers implemented using the original polynomial of degree m. In the tables, we have considered only those polynomials with k ¼ 1, k 1 ¼ 2, and k 2 ¼ 1. For most of the entries in the tables, there does not exist a degree m all-1 polynomial.
We first consider the polynomial x n þ x þ 1 and its factors. This polynomial is of special interest due to the fact that the time taken to multiply two polynomials modulo x n þ x þ 1 is the same as the time taken to multiply two polynomials modulo x n þ 1 [16] . Table 2 gives the degrees of the irreducible factors of x n þ x þ 1. We consider only those n which have an irreducible factor of degree m and there exists no irreducible trinomial of degree m. Therefore, multiplication of elements in F 2 m can be performed more efficiently in the ring F 2 ½x=ðx n þ x þ 1Þ. Note that the table is not an exhaustive list of such polynomials. In the tables below, T A is the delay of a 2-input AND gate and T X is the delay of a 2-input XOR gate. The delays have been computed using Table 4 .
Next, we consider the 4-term polynomial x n þ x 2 þ x þ 1 and its factors. This polynomial is the simplest 4-term polynomial for a given n. Table 3 gives the degrees of the irreducible factors of x n þ x 2 þ x þ 1. We are interested in finding an isomorphic copy of F 2 m in the ring F 2 ½x=ðx n þ x 2 þ x þ 1Þ in order to reduce the multiplication complexity. This can be done if x n þ x 2 þ x þ 1 has an irreducible factor of degree m that is as close to n as possible. Table 3 gives degrees of factors of x n þ x 2 þ x þ 1, which has a factor of degree m and there exists no irreducible trinomial of degree m. Again, the times have been computed using Table 4 . Table 4 gives the complexity of multiplying two field/ ring elements when the field/ring is defined by a degree m binomial, trinomial, 4-term polynomial, and a pentanomial [16] .
COMPLEXITY
In Table 4 , T A is the delay of a 2-input AND gate and T X is the delay of a 2-input XOR gate. The second and third columns give the number of 2-input AND and 2-input XOR gates in an implementation of a multiplier and the fourth column gives the time required to perform a multiplication. Our technique performs multiplication in rings based on three and four term polynomials instead of a field based on a pentanomial. However, the rings defined by us are based on polynomials of higher degree than the pentanomial. This increases the gate count in the multiplier a little, but the speed of the multiplier increases. Wire length will also increase due to increase in the gate count, thereby increasing the delay by an additional amount. However, this is not a factor if the increase in the gate count is low. 
For example, consider multiplication in the field F 2 91 . From the first row of Table 2 , we know that there exists a polynomial x 99 þ x þ 1 which has two irreducible factors of degree 8 and 91. Therefore, from Wedderburn's Theorem, we know that there exists an isomorphic copy of the field F 2 91 in the ring F 2 ½x=ðx 99 þ x þ 1Þ. Multiplication in the ring takes T A þ 7T X time units, whereas multiplication in the field F 2 91 defined by the pentanomial
Thus, a savings of 7T X time units is achieved using our approach. However, each ring element is 99 bits long and each field element is 91 bits long. Therefore, from Table 4 , we can say that multiplication in the ring requires 9,801 2-input AND gates and 9,800 2-input XOR gates. Multiplication in the field requires 8,281 2-input AND gates and 8,460 2-input XOR gates. Thus, our technique requires extra hardware. Another advantage of our technique is that a systolic multiplier can be designed for multiplication modulo x n þ x þ 1. This makes implementation in VLSI very easy.
SYSTOLIC COMPUTATION
In this section, we describe a systolic implementation of multipliers for elements in the ring defined by the polynomial x n þ x þ 1. Systolic multipliers are modular and, hence, easy to implement in VLSI. Low complexity systolic multipliers were described in [15] . However, these multipliers were restricted to fields defined by the all-1 polynomial (AOP). AOPs of degree n can be found only if (n + 1) is a prime and 2 is a generator of the field GF(n + 1). When AOPs cannot be found, we can use Wedderburn's Theorem to define new rings defined by the polynomial x n þ x þ 1 which contain the finite field of interest. A systolic implementation of multiplication in this new ring is presented next. We will show that the multiplier is similar to the multiplier of elements in a ring defined by the polynomial x n þ 1. We describe the Mastrovito multiplier [16] next and show how this can be used to design a systolic multiplier.
Mastrovito Multiplier for
Finite field/ring multiplication using a canonical basis is carried out by polynomial multiplication and modulo operation. Let cðxÞ be the product of aðxÞ and bðxÞ, where aðxÞ; bðxÞ; cðxÞ 2 F 2 ðxÞ=ðx n þ x þ 1Þ. 
Note that a ¼ ½a 0 ; a 1 ; . . . ; a nÀ1 T is the coefficient vector of aðxÞ and the matrix A s is n Â n and A t is ðn À 1Þ Â n. The modulo operation is performed next, as follows:
Let x k mod ðx n þ x þ 1Þ be denoted as u j ðxÞ, where j ¼ k À n þ 1 and n k 2n À 2. Therefore, u j ðxÞ is defined as follows:
& Therefore, cðxÞ can be written as follows:
This can be written in matrix form as follows:
In the above equation, I n is the n Â n identity matrix and U is an n Â ðn À 1Þ matrix defined as 
In the above expression, the size of the matrices
The required product c can now be expressed as follows:
Note that the matrices A s ,
are of size n Â n.
Systolic Array
The product c can be written as
We construct a systolic array first for computing
This systolic array can then be modified easily to compute
Computing Thus, the systolic array for computing c is simply a modification of the array for computing A c :b. This is especially important because the array to compute A c :b is very simple. This is evident from the following expression for A c : Note that every row of A c is a circular shift of the row above it. Therefore, the rows of A c consist of all circular shifts of vector a. Let us consider multiplication in the ring R=ðx 5 þ x þ 1Þ. The systolic array for computing A c :b for this ring is shown in Fig. 1 . Note that flip-flops are not shown in the figure. In Fig. 1 , each cell consists of a 2-input AND gate and a 2-input XOR gate. The cells in the top row of Fig. 1 do not contain an XOR gate. The cells shaded in gray in Fig. 1 compute A t : b. The output of these cells can be reused to compute c. An individual cell is shown in Fig. 2 . Fig. 3 shows the array to compute
The only difference between Fig. 3 and Fig. 1 is that (n-1) cells in Fig. 3 contain an extra 2-input XOR gate. These cells actually perform the addition of the term to A c :b. These (n-1) cells are shaded gray in Fig. 3 . Note that the inputs to the cells have been arranged in such a way that the upper left triangular portion computes A t :b (these cells have been shaded gray in Fig. 1 ). This is then reused to perform the addition
The resulting extra connections between cells are shown in Fig. 3 by bold lines. Fig. 4 shows the new cell of Fig. 3 . Except for these (n-1) new cells, all the other cells are the same as in Fig. 2 . The systolic array consists of n 2 , 2-input AND gates and n 2 À 1, 2-input XOR gates. Since the top row of cells requires no XOR gates, the array to compute A c :b contains n 2 À n XOR gates and the extra (n-1) XOR gates are necessary for computing
CONCLUSION
In this paper, we have presented a new representation of field elements using Wedderburn's Theorem. This enables us to find rings, which have an isomorphic copy of the field we are interested in. The decomposition of a ring into ideals isomorphic to extension fields is done by finding generators of these ideals. These generators are a set of orthogonal idempotents that can be found easily when dealing with fields of characteristic 2. The resulting multipliers are faster, but lead to a modest increase in hardware. In particular, we have considered rings of the form F 2 ½x=ðx n þ x þ 1Þ and F 2 ½x=ðx n þ x 2 þ x þ 1Þ. In both cases, we have provided examples for various values of n when multiplication is faster. It should be noted that this procedure can be used to find new rings that are based on polynomials other than the ones considered in this paper.
We have also shown that a multiplier for elements in F 2 ½x=ðx n þ x þ 1Þ can be easily implemented as a systolic array. This makes our procedure more attractive as it allows a simplified implementation in VLSI. A major advantage of the method proposed in this paper is that it can be used to improve encryption and decryption algorithms in elliptic curve cryptography. . For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
