This contribution introduces a new class of multipliers for nite elds GF ((2 n ) 4 ). The architecture is based on a modi ed version of the Karatsuba-Ofman algorithm (KOA). By determining optimized eld polynomials of degree four, the last stage of the KOA and the modulo reduction can be combined. This saves computation and area in VLSI implementations. The new algorithm leads to architectures which show a considerably improved gate complexity compared to traditional approaches and reduced delay if compared with KOA-based architectures with separate modulo reduction. The new multipliers lead to highly modular architectures an are thus well suited for VLSI implementations. Three types of eld polynomials are introduced and conditions for their existence are established. For the small elds where n = 2; 3; : : : ; 8, which are of primary technical interest, optimized eld polynomials were determined by an exhaustive search. For each eld order, exact space and time complexities are provided.
Introduction
Galois elds have gained wide spread applications in modern communication systems such as computer networks, satellite links, or compact disks. They all use nite eld arithmetics for error correction 32] or for cryptographic algorithms 22] . Both issues are central for the provision of reliable and secure communication. Further applications can be found in signal processing 3] and pseudo random number generation 30] . Modern applications in many cases call for VLSI implementations of the arithmetic modules in order to satisfy the high speed requirements. For technical applications extension elds of GF (2) , denoted by GF(2 k ), are of primary interest. Assuming a basis representation of the eld elements, addition is a relatively inexpensive operation, whereas the other eld operation, multiplication, is costly in terms of gate count and delay. E cient multiplier architectures are especially crucial from a system design point of view since most advanced arithmetic functions such as inversion and exponentiation are based on multiplication. There have been considerable research e orts in the development of VLSI-suited multiplier architectures over the last decade which is re ected by numerous journal publications 31, 29, 11, 12, 20, 10, 9, 5] and recent dissertations 21, 8, 6, 24, 13] . Multipliers can be classi ed into bit serial and bit parallel architectures. Since there is a space-time trade-o , bit parallel multipliers tend to be faster which makes them attractive for many applications. The penalty, however, are higher hardware costs. This article will focus on bit parallel architectures with an optimized gate count.
There are three di erent popular basis representations for elements of GF (2 k (2) can be realized by a two-input AND gate and an adder by a two-input XOR gate. In 27] it is shown that the computational complexity is also a good estimate for the chip area needed in VLSI implementations. For this reason it is attractive to provide architectures with low computational complexity for e cient hardware implementations.
More recently there have also been proposals for multipliers with complexities below the k 2 bound. These architectures are either based on multiple eld extensions 2, 28], on fast convolution algorithms 1] or on both 25] (see 24, Section 3.2] for an overview.) One algorithm which was found to be particularly suited in this context is the Karatsuba-Ofman algorithm (KOA) which will be looked at in more detail in the following section.
In this contribution a new architecture will be developed which further improves the computational complexity of the multipliers proposed in 1, 25] which uses the standard KOA. The improvement is based on special classes of polynomials of degree four over elds GF(2 n ). These polynomials combine the structure of the KOA and the fact that the eld characteristic is two. Fields of the type GF((2 n ) 4 ) are of great technical interest since all elds whose elements are represented by multiples of bytes (i.e., 8; 16; 24; : : : bits) are of this form.
The outline of the remaining paper is the following. In Section 2 the multiplier from 25]
will be reviewed for the case where the eld has the form GF((2 n ) 4 ). Section 3 develops the underlying idea of the new architecture and shows how suitable irreducible polynomials over GF(2 n ) were determined. In Section 4 di erent classes of e cient eld polynomials are introduced together with proofs of existence. Expressions for the corresponding multiplier architectures are derived. In Section 5, overall gate and time complexity of multipliers in GF(2 k ), k = 8; : : : ; 32 are provided.
Multiplication in Composite Fields GF ((n ) 4 )
This section will review the multiplier proposed in 25]. The basic idea is to represent the eld GF(2 k ) by a so-called composite eld 7] of the form GF((2 n ) m ) and to apply a fast convolution algorithm to eld elements in polynomial (or canonical) representation.
The eld GF((2 n ) 4 ) is isomorphic to GF(2 n )=(P (x)), where P(x) is an irreducible polynomial of degree four over GF(2 n ). In the following, a residue class will be identi ed with the polynomial of least degree in this class. An element of GF((2 n ) 4 ) can then be represented by a polynomial with a maximum degree of three with coe cients in GF(2 n ):
A(x) = a 3 x 3 + a 2 x 2 + a 1 x + a 0 ; a i 2 GF(2 n ) ; A 2 GF((2 n ) 4 
)
Field multiplication C(x) = A(x)B(x), where A; B; C 2 GF((2 n ) m ), can be performed by
The two steps involved in the operation (1) (1), the polynomial multiplication. The algorithm performs polynomial multiplication with a reduced number of coe cient multiplication at the cost of an increased number of coe cient additions. For the case considered here, i.e., polynomials of degree three, the KOA can be performed with log 2 4 = 2 iterations. The complexity of the algorithm is given by the following lemmas.
Lemma 1 Two arbitrary polynomials in one variable of degree less or equal three with coe cients in GF(2 n ) can be multiplied by means of the Karatsuba-Ofman algorithm with:
multiplications and additions, respectively, in GF(2 n ).
Lemma 2 A parallel realization of the Karatsuba-Ofman algorithm for the multiplication of two arbitrary polynomials in one variable of degree less or equal three with coe cients in GF(2 n ) can be implemented with a time complexity (or delay) of:
where \ T " and \ T "denote the delay of one multiplier and one adder, respectively, in GF(2 n ).
Proof. The lemmas follow immediately from Theorem 1 and 2, respectively,
It should be noted that the number of additions in (3) can be reduced to 22 25] . For the following sections it will be bene cial to look at the exact operations which are performed by the KOA. The KOA operates in three stages. The rst two stages split the These equations can be realized with 10 additions and 9 multiplications. A block diagram of a parallel realization of the two rst two stages of the KOA is given in Figure 1 .
In the algorithm's third stage the product polynomial is constructed. If the result of the multiplication is a product polynomial C 0 (x) of degree six: The equations can be realized with 12 additions if certain terms are precomputed.
In order to perform nite eld multiplication, the polynomial C 0 (x) has to be reduced modulo the eld polynomial: C(x) C 0 (x) mod P(x), where C(x) = c 3 x 3 + c 2 x 2 + c 1 x + c 0 . 
where r i?1;j?1 = 0 if i = 0. In 25, Table 1 ] sub-optimum polynomials are listed which perform the mapping (7) with low complexity. They were determined by an exhaustive search. The overall computational complexity of a multiplier is the sum of the complexities from Lemma 1 and the complexity of a realization of Equation (7).
A New Approach
In this section a new approach is outlined. The new method combines properties of the KOA over elds of characteristic two and of the modulo reduction process.
In the architecture described above polynomials were determined which optimize the complexity of the modulo reduction (7), i.e., the mapping from the c 0 k to the c i coe cients. This optimization is independent of the preceding polynomial multiplication by means of the KOA. The key idea of the new approach is to determine optimized mappings from the coe cients d i , i = 1; : : : ; 8; to the coe cients c i , i = 0; : : : ; 3; of the product polynomial. Such a mapping has the form c i = 8 X j=0 s i;j d j ; i = 0; : : : ; 3 ; s i;j 2 GF(2 n ):
The mapping includes the last stage of the KOA given in the Equations (6) and the modulo reduction as described by Equation (7). Expressions for the coe cients s i;j are obtained by inserting Equations (6) into Equation (7) . The potential advantage inherent in this approach is based on the following observation: If at least two coe cients r i;j and r i;k are equal, summations of the form d k +d k may occur. Since the characteristic of the eld GF (2 n ) is two, the equation d i + d i = 0 holds. Thus cancelations can potentially occur in the four Equations (9) leading to a lower computational complexity.
Example. Let's assume there is a eld polynomial P (x) leading to coe cients r 3;0 = 1 and r 3;1 = r 3;2 = 0 in Equation (7). Then the output coe cient c 3 can be expressed as which can be realized with only four additions. In contrast, a straightforward realization of the sum c 0 3 + c 0 4 calls for 12 additions.
In order to nd expressions (9) with an optimized number of arithmetic operations two di erent strategies were applied:
1. An exhaustive search through all irreducible polynomials of degree four over GF(2 n ), for n 8 was performed. For every polynomial the number of additions and constant multiplications required for (9) was evaluated. Besides the cancelation described above, further computational redundancies in the expressions were taken into account as follows: If there were identical s i;j for a given c i , i.e., s i;u = s i;v the number of constant multiplication is further reduced since
It was found that for di erent values of n, similar polynomials lead to optimum architectures. By identifying these polynomial types we were able to de ne classes of multiplier architectures which are very e cient computationally (type A and C in the following sections.)
2. Based on the results from the exhaustive search we constructed another polynomial type (type B in the following) for a certain class of elds.
E cient Multipliers
This section describes di erent optimized eld polynomials and new classes of multipliers based on them.
Low Complexity Field Polynomials
This subsection introduces three types of irreducible polynomials and establishes their existence.
We begin with the de nition of eld polynomials for GF((2 n ) 4 ). To show that such an element p exists, we can assume that n is even (if n is odd, just take p = 1). Half of the elements of GF (2 n ) So far, conditions for the irreducibility of the three types of polynomials have been given. In the following, irreducible type A and type C polynomials will be used for odd n, while irreducible type B polynomials will be used for even n. The reason that two di erent types of polynomials are used for odd n is that type C polynomials are potentially primitive while type A polynomials are not. Primitivity is sometimes of interest in practice. For most of the (small) elds GF(2 n ) of technical interest we were able to determine primitive ones through computer searches. These will be provided in Table 2 .
Multiplier Architectures
This subsection derives e cient multiplier architectures from the three types of eld polynomials. All multipliers are constructed from the rst two stages of the KOA given in Equations (5) and mappings of form (9).
Type A Multipliers
Type A multipliers are based on the polynomials P A (x). Their computational complexity is given by the following theorem:
Lemma 7 Type A multiplier in GF((2 n ) 4 ) can be realized with a complexity of # = 21; (13) # = 9;
additions and multiplications, respectively, in GF(2 n ). The delay T of a parallel realization is given by T = 5T + T ; where \ T " denotes the delay caused by an adder in GF(2 n ) and \ T " denotes the delay of a multiplier. The set of equations can be realized with 11 additions when the term (d 0 + d 5 ) is precomputed. The overall complexity is thus 9 multiplications and 10 + 11 = 21 additions.
From Figure 1 it follows that the delay introduced by the rst two stages of the KOA is 2T and 1T . Computation of c 3 from the d i 's adds another delay of 3T . The over-all critical path contains thus 3 + 2 = 5T and 1T .
In order to further illustrate the structure of the multiplier, Figure 2 shows a block diagram of a parallel hardware realization of a type A multiplier. The circuit uses the two building blocks \GF (2 n ) adder" and \GF (2 n ) multiplier". The complexities from Equations (13), (14) and (7) can easily be veri ed from the diagram. The inputs are the variables a i ; b i , i = 0; 1; 2; 3, the outputs are the c i . It should be noted that the inputs and outputs and all internal connections are n bit wide buses. The gure also underlines the high degree of modularity of the architecture. An entire multiplier in the composite eld GF((2 n ) m ) can be constructed from only two modules which provide GF(2 n ) addition and multiplication, respectively.
We now compare the space and time complexity of the type A multiplier with the complexities of the pure polynomial multiplication. From Lemmas 1 and 2 it can be seen that the entire eld multiplication, consisting of polynomial multiplication and modulo reduction, can be performed with a complexity smaller than the complexity for the polynomial multiplication only. This holds for the computational complexity (number of additions and multiplications) as well as for the delay. We would like to stress that the improved complexities are solely due to a reduced number of additions in the sub eld GF(2 n ). The number of sub eld multiplications is identical for both architectures. 
Results
This section will summarize the features of the three types of multipliers. Also, space and time complexity expressions on the gate-level will be developed. For a number of elds of technical interest, optimized polynomials are presented.
Gate Complexities
In order to obtain gate counts, the three operations in the ground eld GF(2 n ) addition, multiplication and constant multiplication must be further speci ed. Addition requires n small elds of technical interest. In particular, this complexity is achieved for n = 2; 3; : : : ; 7 20] . The complexity of constant multiplication \ cnst " is heavily dependent on the actual eld element multiplied with. The average complexity is given by (n 2 =2) ?n XOR gates 21].
However, by carefully choosing polynomials with suited coe cients, the constant multiplication complexity can be considerably below the average 24, 26] . Polynomials optimized this way are listed in Table 2 below. The complexity contribution stemming from the constant multiplications is extremely small in all cases. If the constant multiplications are neglected, our architecture asymptotically improves the k 2 = 16n 2 complexity bound for traditional architectures by 44%.
The architecture limits the speed of multiplier implementations by the delay through the critical path. Since adders in GF(2 n ) are realized by n parallel XORs, the delay of an adder is one XOR gate delay. A multiplier in GF(2 n ) is upper bounded by T T and + 2T xor dlog 2 ne; (17) where T and and T xor are the delays of one AND gate and one XOR gate, respectively. Table 1 lists for each of the three types of polynomials: The condition for existence, a statement regarding primitivity, the gate complexity, and the number of gate delays along the critical path. The \prim." column provides information about whether the polynomial is potentially primitive.
Optimized Examples and Comparison
For many applications such as Reed-Solomon coders it is helpful to have type B and C polynomials which ful ll the following criteria:
1. The coe cient p is optimized with respect to the complexities of the constant multiplications occurring in the Equations (15) and (16). 2. The polynomials are primitive.
We performed exhaustive searches according to the two optimization criteria for all ground elds GF(2 n ), n = 2; 3; : : : ; 8. The eld polynomial used for the ground elds are listed in the column headed by \Q(y)" in Table 2 . The polynomials are given by their binary coe cients, highest coe cient leftmost. All of the Q(y) polynomials are primitive. For the elds with even n, optimized type B polynomials were determined. In all but the case n = 4, primitive polynomials were found. Among the existing primitive polynomials, the one yielding the lowest constant multiplication complexity was chosen. The corresponding coe cients are provided in an exponent notation \! i " in Table 2 , where ! is a primitive element such that Q(!) = 0. For elds with n odd, primitive type C polynomials were determined by the exhaustive search. Again, for each given n, the polynomial with the lowest constant multiplication complexity was chosen. Table 2 gives detailed information about the gate complexity of multipliers in GF( (2 Example. We consider the multiplier in GF ((2 3 ) 4 ) = GF (2 12 ) based on a type C polynomial. The ground eld GF (2 3 ) has the primitive eld polynomial Q(y) = y 3 + y +1. The extension eld has the primitive eld polynomial P (x) = x 4 +x 3 +!, where Q(!) = 0. Each of the nine multipliers in GF (2 3 ) has a gate count of 9 AND and 8 XOR gates 20]. Each ground eld adder requires 3 XOR gates. Constant multiplication with p = ! requires one XOR gate. Hence, the overall complexity is 9 9 = 81 AND gates and 9 8 + 24 3 + 3 1 = 147 XOR gates.
The critical path consists of ve ground eld adders, one ground eld multiplier and one constant multiplier with !. Each adder introduces one XOR gate delay, the k n Q(y) It is interesting to compare the time and space complexities with other architectures. The traditional architecture 20] for GF (2 12 ) requires considerably more gates; our architecture shows an improvement of of 44% with respect to the AND gate count and 29% XOR with respect to the XOR gate count. The delay of the traditional architecture, however, is slightly smaller; seven T xor as opposed to nine T xor . Comparing the multiplier to the one based on the standard KOA with separate modulo reduction 25] as described in Section 2, it can be seen that the number of XOR gates is reduced from 159 to 135. However, the main advantage seems to be that the delay is reduced from 11 T xor to 8 T xor , or 27%, which can be a major factor for applications with high speed requirements.
Applications
There are two main area of applications for Galois eld arithmetic. Certain error correction codes, in particular the popular Reed-Solomon codes, are based on arithmetic in nite elds of characteristic two. The speci c eld GF (2 8 ) has been standardized for space communication by ESA and NASA 18] , and for use in CD players. Since the latter one constitutes a high volume application, provision of highly optimized architecture is attractive. We would like to stress that the gate count achieved by the new architecture for GF((2 2 ) 4 ) = GF (2 8 ) shows an improvement of 44% for the AND gate count together with a slightly improved XOR gate count compared to traditional approaches. Application of larger Galois elds to Reed-Solomon codes is also possible. The results in Table 2 provide detailed information for designer of corresponding implementations.
The other main area of application for Galois eld arithmetic are public-key cryptographic schemes. For our architectures schemes with moderate eld order and the possibility of composed extension degrees k = n m are of importance. Elliptic curve 16] and hyperelliptic curve 17] cryptosystems ful ll both requirements. The latter one can be based on eld orders as small as k 30, whereas the former one requires values in the range of k = 150{250. In principle our architectures are applicable to both schemes.
Conclusion
New multiplier architectures for composite Galois elds of the form GF((2 n ) 4 ) have been considered. It was shown that the combination of the Karatsuba-Ofman algorithm and the modulo reduction can lead to a reduced number of computations required for a eld multiplication. The computational gain is heavily dependent on the eld polynomial used. By means of exhaustive searches we determined three classes of polynomials which lead to optimum multipliers with respect to computational complexity. For each eld polynomial type, conditions of existence were provided. Ways of nding these polynomials were provided too.
Our approach asymptotically improves the k 2 = 16n 2 complexity bound for traditional architectures by 44%. However, considerable improvements are given for small elds which are of primary technical interest too. For the elds where n = 2; 3; : : : ; 8, eld polynomials with optimized coe cients were determined. For each eld except for n = 4, primitive polynomials are provided. The exact gate complexity and delays are given for each multiplier. A comparison with multipliers over elds GF(2 k ), k = n m, shows that the new architectures lead to a considerably improved gate count, whereas the theoretical delay of the new architecture is only slightly larger. Compared to composite eld architectures which separate polynomial multiplication and modulo reduction 25], the new approach shows a reduced delay together with a further improved gate count.
The new architectures are naturally modular, which is a very attractive feature for VLSI design and implementation. A multiplier can be realized by using only a small number of modules providing GF (2 n ) arithmetic. The architectures have applications in the area of error correction codes as well as in public-key cryptography.
