Abstract. Recently, there has been a lot of interest on cryptographic applications based on fields GF (p m ), for p > 2. This contribution presents GF (p m ) multipliers architectures, where p is odd. We present designs which trade area for performance based on the number of coefficients that the multiplier processes at one time. Families of irreducible polynomials are introduced to reduce the complexity of the modulo reduction operation and, thus, improved the efficiency of the multiplier. We, then, specialize to fields GF (3 m ) and provide the first cubing architecture presented in the literature. We synthesize our architectures for the special case of GF (3 97 ) on the XCV1000-8-FG1156 and XC2VP20-7-FF1156 FPGAs and provide area/performance numbers and comparisons to previous GF (3 m ) and GF (2 m ) implementations. Finally, we provide tables of irreducible polynomials over GF (3) of degree m with 2 ≤ m ≤ 255.
Introduction
Galois field arithmetic has received considerable attention in recent years due to their application in public-key cryptography schemes and error correcting codes. In particular, two public-key cryptosystems based on finite fields stand out: elliptic curve (EC) cryptosystems, introduced by Miller and Koblitz [24, 19] , and hyperelliptic cryptosystems, a generalization of elliptic curves introduced by Koblitz in [20] . Both, prime fields and extension fields, have been proposed for use in such cryptographic systems. However, until a few years ago the focus was mainly on fields of characteristic 2 due to the straight forward manner in which elements of GF (2) can be represented, i.e., they can be represented by the logical values "0" and "1". For these types of fields, both software implementations and hardware architectures have been studied extensively. In recent years, GF (p m ) fields, where p is odd, have gained interest in the research community. Mihȃlescu [28] and independently Bailey and Paar [3, 4] introduced the concept of Optimal Extension Fields (OEFs) in the context of elliptic curve cryptography. OEFs are fields GF (p m ) where p is odd and both p and m are chosen to match the particular hardware used to perform the arithmetic, thus allowing for efficient field arithmetic. The treatment in [4, 28] and that of other works based on OEFs has only been concerned with efficient software implementations.
In [21, 32] , GF (p m ) fields are proposed for cryptographic purposes where p is relatively small. [21] describes an implementation of ECDSA over fields of characteristic 3 and 7. The author in [32] describes a method to implement elliptic curve cryptosystems over fields of small odd characteristic, only considering p < 24 in the results section. More recently, Boneh and Franklin [8] introduced an identity-based encryption scheme based on the use of the Weil and Tate pairings. Similarly, [7] described a short signature scheme based on the Weil and Tate pairings. Other applications include [17, 34] . All of these applications consider elliptic curves defined over fields of characteristic 2 and 3. Because characteristic 2 field arithmetic has been extensively studied in the literature, authors have concentrated the efforts to improve the performance of systems based on characteristic 3 arithmetic 4 . For example, [5, 10] describe algorithms to improve the efficiency of the pairing computations. [10] also introduces some clever tricks to improve the efficiency of the underlying arithmetic in software based solutions. [29] treats the hardware implementation of fields of characteristic 3. Their design is only geared towards fields of characteristic 3. Moreover, it is acknowledged that the element representation makes their architectures unsuitable to take advantage of operations such as cubing (i.e. a cubing will only be as fast as a general multiplication, while in other implementations cubing could be more efficient) which are important in Tate pairing computations.
Our Contributions
Given the research community's interest on cryptographic systems based on fields of odd characteristic and the lack of hardware architectures for general odd characteristic fields, we try to close this gap. Our approach is different from previous ones, in that, we propose general architectures which are suitable for fields GF (p m ) with p odd. In particular, we generalize the work in [33] to fields GF (p m ), p odd. We, then, study carefully the case of GF (3 m ) due to its cryptographic significance. In addition, we focused on finding irreducible polynomials over GF (3) to improve the performance of the multiplier. For the problem of efficient GF (p) arithmetic, we refer the reader to [9, 30, 13] .
The remaining of this contribution is organized as follows. In Section 2 we survey previous GF (p m ) architectures and discuss certain multiplier architectures for GF (2 k ) type fields, which we generalize for the GF (p m ) case in Section 4. In Section 4, we also study both the time and area complexities of the multipliers and give two theorems which help us in choosing irreducible polynomials that will minimize the area and delay of the resulting multipliers. Finally, Section 5 specializes the results of the previous section to the case of GF (3 m ) fields which recently have become of great interest in the research community. We also describe a prototype implementation of our architectures for three different fields on two FPGAs. We also provide tables of irreducible polynomials over GF (3) for degrees between 2 and 255. We end this contribution with some conclusions.
Related Work
In contrast to the GF (p) case, there has not been a lot of work done on GF (p m ) architectures. Our literature search yielded [31] as the only reference that explicitly treated the general case of GF (p m ) multipliers, p odd 5 . In [31] , GF (p m ) multiplication is computed in two stages. First the polynomial product is computed modulo a highly factorizable degree S polynomial, M (x), with S ≥ 2m−1. The product is, thus, computed using a polynomial residue number system. The second step involves reducing modulo the irreducible polynomial p(x) over which GF (p m ) is defined. The method, however, does not seem to apply to field sizes common in cryptographic applications due to certain constraints on the size of m. In particular, for p = 3, there does not exist a suitable M (x) polynomial.
The authors in [25] consider multiplier architectures for composite fields of the form GF ((3 n ) 3 ) using Multi-Value Logic (MVL) and a modified version of the Karatsuba algorithm [18] for polynomial multiplication over GF ((3 n ) 3 ). Elements of GF ((3 n ) 3 ) are represented as polynomials of maximum degree 2 with coefficients in GF (3 n ). Multiplication in GF (3 n ) is achieved in the obvious way. Karatsuba multiplication is combined with modular reduction over GF ((3 n ) m ) to reduce the complexity of their design. Because of the use of MVL no discussion of modulo 3 arithmetic is given. The authors estimate the complexity of their design for arithmetic over GF ((3 2 ) 3 ) as 56 mod-3 adders and 67 mod-3 multipliers. To our knowledge, [29] is the first work that describes GF (3 m ) architectures for applications of cryptographic significance, thus we describe it in some detail. The authors describe a representation similar to the one used by [10] to represent their polynomials. They combine all the least significant bits of the coefficients of an element, say A, into one value and all the most significant bits of the coefficients of A into a second value (notice the coefficients of A are elements of GF (3) and thus 2 bits are needed to represent each of them). Thus, A = (a 1 , a 0 ) where a 1 and a 0 are m-bit long each. Addition of two polynomials
where ∨ and ⊕ mean the logical OR and exclusive OR operations, respectively. The authors of [29] notice that subtraction and multiplication by 2 are equiva-lent in characteristic 3 and that they can be achieved as 2 · A = 2 · (a1, a0) = −A = −(a1, a0) = (a0, a1). Multiplication is achieved in the bit-serial manner, by repeatedly shifting the multiplier down by one bit position and shifting the multiplicand up by one bit position. The multiplicand is then added or subtracted depending on whether the least significant bit of the first or second word of the multiplier is equal to one. They notice that with this representation a cubing operation is only as fast as a general multiply, whereas, using other implementation methods the cubing operation is much faster. Finally, [29] also discuss the implementation of multiplication in GF ((3 m ) 6 ) using the irreducible polynomial Q(y) = y 6 +y+2. They use the school book method to multiply polynomials of degree 5 with coefficients in GF (3 m ) and then reduce modulo Q(y) using 10 additions and 4 doublings in GF (3 m ). They provide timings which we further discuss in Section 6.2.
In [33] a new approach for the design of digit-serial/parallel GF (2 k ) multipliers is introduced. Their approach combines both array-type and parallel multiplication algorithms, where the digit-type algorithms minimize the latency for one multiplication at the expense of extra hardware inside each digit cell. In addition, the authors consider special types of polynomials which allow for efficiency in the modulo p(x) reduction operation. These architectures are generalized in Section 4 to the GF (p m ) case, where p is odd.
Mathematical Background
For a thorough introduction to finite fields, we refer the reader to [22] . Here, we briefly review the theory that we will need to develop the architectures of this paper. In the following, we will consider the field GF (p m ) generated by an
Notice that by assumption p(α) = 0 since α is a root of p(x). Therefore,
gives an easy way to perform modulo reduction whenever we encounter powers of α greater than m − 1.
In what follows, it is assumed that A, B, C ∈ GF (p m ), with
where the addition a i + b i is done in GF (p). We write the multiplication of
multiplication is understood to happen in the finite field GF (p m ) and all α t , with t ≥ m can be reduced with (2) . Much of the remaining of this paper is devoted to efficient ways to perform multiplication in hardware. We abuse our notation and throughout the text we will write A mod p(α) to mean explicitly the reduction step described previously. Finally, we refer to A as the multiplicand and to B as the multiplier.
General Architectures for GF
This section is concerned with hardware architectures for addition and multiplication in GF (p m ). Inversion can be performed through the Euclidean algorithm or by exponentiation based techniques (see for example [12] ) and we do not treat it any further in this paper.
Adders
Addition in GF (p m ) is performed according to (3) . A parallel adder requires m GF (p) adders and its critical path delay is one GF (p) adder.
Multipliers
There are three different types of architectures used to build GF (p m ) multipliers: array-, digit-, and parallel-multipliers [33] . Array-type (or serial) multipliers process all the coefficients of the multiplicand in parallel in the first step, while the coefficients of the multiplier are processed serially. Array-type multiplication can be performed in two different ways, depending on the order in which the coefficients of the multiplier are processed: Least Significant Element (LSE) first multiplier and Most Significant Element (MSE) first multiplier not described here because of lack of space. Digit-multipliers are also divided in Most Significant and Least Significant Digit-Element first multipliers, depending on the order in which the coefficients of the polynomial are processed. Parallel-multipliers have a high critical path delay but only require one clock cycle to complete a whole multiplication. Thus, parallel-multipliers exhibit high throughput and they are best suited for applications requiring high-speed and relatively small finite fields. However, they are expensive in terms of area when compared to serial multipliers and thus most of the time prohibitive for cryptographic applications. We don't discuss parallel-multipliers any further in this paper.
Least Significant Element (LSE) First Multiplier. The LSE scheme processes first coefficient b 0 of the multiplier and continues with the remaining coefficients one at the time in ascending order. Hence, multiplication according to this scheme can be performed in the following way:
Algorithm 1 LSE Multiplier
Require:
The accumulation of the partial product has to be done with a polynomial adder. This multiplier computes the operation according to Algorithm 1. The adapted substructure sharing technique [16] can be used to compute the LSE multiplication in a more efficient way, if the LSE multiplier is used as building block for a larger system, like a system with broadcast structure.
Reduction mod p(α). In LSE multipliers the quantity W α, where
, has to be reduced modp(α). Multiplying W by α,
Using (2) and re-writing the index of the second summation, W α modp(α) can then be calculated as follows:
where all coefficient arithmetic is done modulo p. Using (4) we can write expressions for A and C in Algorithm 1 at iteration i as follows:
with C (−1) = 0 and A (−1) = A. As a final remark, notice that if you initialize C to a value different from 0, say I, then Algorithm 1 computes C ≡ A · B + I mod p(α). This multiply-accumulate operation turns out to be very useful in elliptic curve systems and it is obtained at no extra cost.
Area/Time Complexity of LSE Multipliers. LSE multipliers take m iterations to output the product C ≡ A · B mod p(α). In each iteration, the following Efficient GF (p m ) Arithmetic Architectures for Cryptographic Applications 7 operations are performed: 1 multiplication of a GF (p) element by a GF (p m ) element (requires m GF (p) multipliers), 1 GF (p m ) addition (requires m GF (p) adders), 1 multiplication by α (implemented as a GF (p) coefficient shift), and 1 modulo p(α) reduction. This last operation could be implemented according to (4) , and thus, it would require (r − 1) GF (p) multipliers and (r − 2) adders for a fixed r-nomial (where r is the number of non-zero coefficients in the irreducible polynomial p(x)). However, it could be desired to load the modulus on demand, in which case one would need m GF (p) multipliers and (m − 1) adders.
Both the area and time complexities of Algorithm 1 are summarized in Table 1 in terms of GF (p) adders and multipliers, for two types of irreducible polynomials. This unusual measure is independent of technology and thus most general. One immediate advantage of estimating area in terms of GF (p) adders (multipliers) is that we don't need to care about the way these are implemented. In particular, there are many implementation choices depending on your application and design criteria [9, 13, 30] . Section 6.1 gives specific complexity numbers for an FPGA implementation of GF (3 m ) arithmetic in terms of both Look-Up Tables 6 (LUTs), also known as Configurable Logic Blocks (CLBs), and flip-flops (FF), thus taking into account the way GF (p) arithmetic is implemented on the target technology. In Table 1 , ADD and MUL refer to the area and delay 
of a GF (p) adder and multiplier, respectively. We have not taken into account the delays or area requirements of storage elements (such as those needed to implement a shift register) or routing elements (such as those used for interconnections in FPGAs). In addition, we do not make any distinction between general and constant GF (p) multipliers, i.e., we assume their complexities are the same. Finally, general irreducible polynomials refer to the case in which you want to be able to change the irreducible polynomial on demand.
Digit-Serial/Parallel Multipliers. LSE multipliers process the coefficients of A in parallel, while the coefficients of B are processed serially. Hence, these multipliers are area-efficient and suitable for low-speed applications. Digit multipliers, introduced in [33] for fields GF (2 k ), are a trade-off between speed, area, and power consumption. This is achieved by processing several of B's coefficients at the same time. The number of coefficients that are processed in parallel is defined to be the digit-size and we denote it by D. For a digit-size D, we can denote by d = ⌈m/D⌉ the total number of digits in a polynomial of degree m − 1. Then, we can re-write the multiplier as
Di , where
and we assume that B has been padded with zero coefficients such that
In the following, a generalized digit-serial/parallel multiplication algorithm is introduced. We named this algorithm Least Significant Digit-Element first multiplier (LSDE), where the word element was introduced to clarify that the digits correspond to groups of GF (p) coefficients in contrast to [33] where the digits where groups of bits. The LSDE is an extension of the LSE multiplier. Using (6), the product C ≡ AB mod p(α) in this scheme can be calculated as follows
This is summarized in Algorithm 2. The adapted substructure sharing technique
Algorithm 2 LSDE Multiplier
Biα
Di , where Bi is as defined in (5)
can also be used for the LSDE, like in the case of the LSE multiplier. We end this section by noticing that, as in Algorithm 1, if C is initialized to I in Algorithm 2, we can obtain as an output A · B + I mod p(α) at no additional (hardware or delay) cost. This operation, known as a multiply/accumulate operation, is very useful in elliptic curve based systems.
Reduction mod p(α) for Digit Multipliers. In LSDE multipliers the product W α D mod p(α) occurs. As in the LSE multiplier case, one can derive equations for the modular reduction for particular irreducible p(α) polynomials. However, it is more interesting to search for polynomials that minimize the complexity of the reduction operation. In coming up with these optimum irreducible
polynomials we use two theorems from [33] , adapted to the case of GF (p m ) fields with p odd.
j=0 p j α j , with k < m. For t ≤ m − 1 − k, the degree of α m+t can be reduced to be less than m in one step with the following equation:
Theorem 2. For Digit multipliers with digit-element size D, when D ≤ m − k the degree of the intermediate results in Algorithm 2 can be reduced to be less than m in one step.
Theorems 1 and 2, implicitly say that for a given irreducible polynomial p(α)
j=0 p j α j , the digit-element size will depend on the value of k.
Area/Time Complexity of LSDE Multipliers. Before estimating the complexity of the LSDE multiplier, it is helpful to obtain equations to describe the values of A and C at iteration i in Algorithm 2. Thus, assume that B i is as in (5) 
where C (−1) = 0, A (−1) = A, and
Now it is easy to see that in each iteration one requires mD multipliers in parallel and (8), and one multiplier. We notice, however, that it is possible to improve the critical path delay of the LSDE multiplier by using a binary tree of adders 7 . Using this technique one would reduce the length of the critical path from D GF (p) adders and one GF (p) multiplier to ⌈log 2 (D + 1)⌉ adders and one multiplier. We use this, as our complexity for the critical path.
The computation of (9) requires only D(k + 1) multipliers (notice that the second term of (9) looks exactly the same as D (i) in (10), except that the limits in the summation are changed) and at most Dk adders. In the case in which p(α) is an r-nomial, the complexities reduce to (r − 1)D multipliers and at most (r − 2)D adders (notice that the first summation in (9) starts at j = D, thus, the first D coefficients resulting from the second summation do not need to be added to anything). We say at most because depending on the values of D and k, some adders may be saved. These results are summarized in Table 2   Table 2 . Area complexity and critical path delay of LSDE multiplier. Table 2 makes the same assumptions as for the LSE case, i.e., ADD and MUL refer to the area and delay of a GF (p) adder and multiplier, respectively, delay or area of storage elements are not taken into account, and no distinction is made between general and constant GF (p) multipliers. We end by noticing that an LSDE multiplier with D = 1 is equivalent to an LSE multiplier. Our complexity estimates verify this if you let D = 1 and k = m − 1 in Table 2 .
Irreducible

Comments on Irreducible Polynomials for GF (p m )
From Theorems 1 and 2, it is obvious that choosing an irreducible polynomial should be carefully done. For fields GF (p m ) with odd prime characteristic it is often possible to choose irreducible binomials p(α) = x m − ω, ω ∈ GF (p). This is particularly interesting since binomials are never irreducible in characteristic 2 fields. Another interesting property of binomials is that they are optimum from the point of view of Theorem 1. In particular for any irreducible binomial p(α) = x m − ω, k = 0 and D ≤ m in Theorem 2, which means that even in the degenerate case where D = m (i.e. a parallel multiplier) one is able to perform the reduction in one step. In addition, reduction is virtually for free, corresponding to just a few GF (p) multiplications (this follows from the fact that α m = ω). A specific sub-class of these fields where q is a prime of the form q = p = 2 n − c, c "small", has recently been proposed for cryptographic applications in [4] . We notice that the existence of irreducible binomials has been completely established as Theorem 3 shows 8 . When irreducible binomials can not be found, one searches in incremental order for irreducible trinomials, quadrinomials, etc. In [15] von zur Gathen and Nöcker conjecture that the minimal number of terms σ q (m) in irreducible polynomials of degree m in GF (q), q a power of a prime, is for all m ≥ 1, σ 2 (m) ≤ 5 and σ q (m) ≤ 4 for q ≥ 3. This conjecture has been verified for q = 2 and m ≤ 10000 [6, 11, 15, [36] [37] [38] and for q = 3 and m ≤ 539 [14] .
By choosing irreducible polynomials with the least number of non-zero coefficients, one can reduce the area complexity of the LSDE multiplier (this follows directly from Table 2 ). We point out that by choosing irreducible polynomials such that their non-zero coefficients are all equal to p − 1 one can further reduce the complexity since all the multiplications by −p s in (9) reduce to multiplication by 1. Notice that there is no existence criteria for irreducibility of trinomials over any field GF (p m ). The most recent advances in this area are the results of Loidreau [23] , where a table that characterizes the parity of the number of factors in the factorization of a trinomial over GF (3) is given, and the necessary (but not sufficient) irreducibility criteria for trinomials introduced by von zur Garten in [14] . Neither reference provides tables of irreducible polynomials.
Case
Study: GF (3 m
) Arithmetic
FPGAs are reconfigurable hardware devices whose basic logic elements are LookUp Tables (LUTs) , also called Configurable Logic Blocks (CLBs), flip-flops (FFs), and, for modern devices, memory elements [1, 2, 35] . The LUTs are used to implement Boolean functions of their inputs, that is, functions traditionally implemented with logic gates. In the particular case of the XCV1000E-8-FG1156 and the XC2VP20-7-FF1156, their basic building blocks are 4-input bits/1-output bit LUTs. This means that all basic arithmetic operations in GF (3) (add, subtract, and multiply) can be done with 2 LUTs, where each LUT generates one bit of the output.
Cubing in GF
It is well known that for A ∈ GF (p m ) the computation of A p is linear. In the particular case of p = 3, we can write the frobenius map as:
Equation (11) can in turn be written as the sum of three terms (where we have re-written the indices in the summation):
Notice that only U and V need to be reduce mod p(α). We further assume that p(α) = x m + p t x t + p 0 with t < m/3. This assumption is valid in terms of the existence of such irreducible trinomials as shown in Section 5.2. Thus,
where we have made use of the fact that (
. It can be shown that U and V can be reduced to be of degree less than m in one extra reduction step. To estimate the complexity of this cubing circuit, we assume that p(α) is a fixed irreducible trinomial with t < m/3 i.e., that multiplications in GF (3) (for example multiplying by −p t p 0 ) can be handled by adders and subtracters. Then, it can be shown that one needs in the order of 2m adders/subtracters to perform a cubic operation in GF (3 m ).
Irreducible Polynomials over GF (3)
Following the criteria of Section 4.3 for choosing irreducible polynomials, we tried to find irreducible binomials first. Unfortunately, the only irreducible binomial over GF (3) is x 2 + 1, thus we considered irreducible trinomials. Notice that x m + x t + 1 is never irreducible over GF (3) since 1 is always a root of it. Thus, we only searched for irreducible trinomials of the following forms: x m − x t − 1 or x m ± x t ∓ 1. For 2 ≤ m ≤ 255, we exhaustively searched for these trinomials (see Tables 6, 7 , and 8 in Appendix A). There are only 23 degrees m in the range above for which we were unable to find trinomials (which agrees with the findings in [14] ) and thus quadrinomials can be found. Of these quadrinomials only 4 correspond to m prime (149, 197, 223, 233) . Prime m is the most commonly used degree in cryptographic applications. We notice that of the 50 primes in the above range which had trinomials, we were not able to find trinomials with t < m/3 for 9 of them (18 %).
6 GF (3 m ) Prototype Implementation and Comparisons Figure 1 shows a block diagram of the prototyped arithmetic unit (AU). Notice that in Figure 1 , all bus-widths correspond to how many GF (3) elements can be carried by the bus, i.e., if we write m, then it is understood that the bus is 2m bits wide. The AU consists of an LSDE multiplier and a cubing circuit.
Efficient GF (p m ) Arithmetic Architectures for Cryptographic Applications 13
The multiplier and the cubing circuit support the computation of field additions, squares, multiplications, and inversions. For addition and subtraction we take advantage of the multiply/accumulate capabilities of the LSDE multiplier and cubing circuit. In other words, the addition C = A+B is done by first computing A·1 and then adding to it the product B ·1. This takes two clock cycles. However, if operand A is already in the accumulator of the multiplier one can compute C = B · 1 + A in one clock. This eliminates the need for an adder. Subtractions are computed in a similar manner, i.e., C = A−B is done by first computing A·1 and then adding to it the product (−1) · B or alternatively as C = (−1) · B + A. AU prototypes were developed to verify the suitability of the architecture shown in Figure 1 for reconfigurable FPGA logic and compare the efficiency of GF (3 m ) and GF (2 m ) AUs. The prototypes were coded in VHDL at a very low level. The VHDL code was synthesized using Synopsis FPGA Compiler 3.7.1 and the component placement and routing was done using Xilinx Design Manager 4.2.03i. The prototypes were synthesized and routed for the Xilinx XCV1000-8-FG1156 and XC2VP20-7-FF1156 FPGAs. The XCV1000E-8-FG1156 prototype allowed us to compare our AU implementations against the AU for GF (2 m ) used in the EC processor (ECP) from [27] , which is one of the fastest ECP implemented in FPGA logic for EC defined over fields GF (2 167 ). The XC2VP20-7-FF1156 prototype allowed us to verify the speed of our AU for one of the newest families of Xilinx FPGAs. Three implementation were developed which support the fields GF (3 97 ), GF (2 151 ), and GF (2 241 ). The fields GF (3 97 ) and GF (2 241 ) are used in Weil and Tate pairing schemes for systems with comparable degrees of security (see [10, 5, 29] ). The field GF (2 151 ) offers security comparable to that of GF (3 97 ) for cryptosystems based on the EC discrete logarithm problem. Table 3 shows the complexity estimates for the AU shown in Figure 1 . The estimates assume the use of optimum irreducible polynomials and give the register complexity in terms of the number of flip-flops. Note that the register estimates do not account for registers used to reduce the critical path delay of a multiplier, a technique known as pipelining. This technique was used to reduce the critical path delay of the prototype implementations. The complexity estimates Table 2 the digit multiplication core and accumulator circuits require mD GF (3) multipliers and mD GF (3) adders. This circuit stores the result in two (m+D−1)-bit registers. An m-bit register requires m flip-flops (FFs).
GF (3 m ) Complexity Estimates
The estimates for the Aα
Di mod p(α) circuit assume that the circuit contains two m-bit multiplexers that select between the element A and the element Aα Di mod p(α). An m-bit multiplexer requires m LUTs. For programmable optimum irreducible trinomials, the circuitry that generates Aα Di mod p(α) requires 2D GF (3) multipliers and D adders (see Table 2 ). This circuit stores the result in two m-bit registers and the coefficients of p(x) in two 2r-bit registers (r = 3 for trinomials). 4. The coefficients of B are fed in by two m-bit parallel in/serial out shift registers. Each shift registers contains m 2:1 multiplexers and m registers. 5. The cubic circuit requires 2m GF (3) adders. 6. The complexity for the GF (2 m ) AU is done according to [26] . It is also assumed that the GF (2 m ) AU contains an LSD multiplier and a squarer.
The estimates that we obtain from our models are very accurate when compared to the actual measured complexities. This validates our models and assumptions. Table 5 presents the timings obtained for our three prototypes. We have tried to implement our designs in such a way that we can make a meaningful comparison. Thus, although, the clock rates are not exactly the same between the different designs (this is due to the fact that the clock rate depends on the critical path of the AU which is different for each circuit), they are not more than 10 % different. The platforms are the same and we chose same digit sizes for both GF (2 m ) and GF (3 m ) architectures. The results make sense, for the same digit size (D = 16) we obtain that the GF (3 97 ) design is about twice as big as the GF (2 241 ) design and more than 3 times the size of the GF (2 151 ) AU. This is offset by the gain in performance. At similar clock rates the GF (3 97 ) design is 2.7 times faster than the corresponding GF (2 241 ) AU and 1.4 times faster than the GF (2 151 ) one. It is clear that by using more hardware resources for GF (3 97 ) we achieve better performance than characteristic two fields. In particular, by choosing the same digit size for both fields, we implicitly process twice as many bits of the multiplier in GF (3 97 ) as in the GF (2 m ) case (remember that the E in LSDE refers to elements of GF (p) and not bits as in the GF (2 m ) case). Table 5 also includes the results from [29] . We don't think it is possible to make a meaningful comparison, other than point out that by coding directly in VHDL, one can achieve huge improvements in performance of FPGA based implementations.
Results
Conclusions
In this paper, we have generalized the finite field multipliers of [33] to the odd characteristic case. We have also presented multiplication algorithms for both serial and digit-based architectures. Finally, we have presented a general framework to choose irreducible polynomials that reduce the computation of the mod-ulo reduction operation. More importantly, we have shown that it is possible to achieve considerable performance from FPGA implementations of non-binary finite fields. However, from our discussion in the previous section, we conclude that fields of characteristic 2 can not be surpassed by other fields if one considers both area and time performance measures.
