AbstractÐWith the explosive growth of electronic commerce, dedicated cryptographic processors are becoming essential since general-purpose processors cannot provide the performance and functionality direly needed. This paper proposes an architecture for a versatile Galois field GF(P m ) processor for cryptographic applications. This processor uses both canonical and triangular bases for field elements representation and manipulation. The variable dimension datapath of the processor is versatile enough to meet the varying requirements for different applications and environments. To provide flexibility for different cryptographic applications, an instruction set architecture is designed. Finally, a prototype VLSI implementation of the Galois field processor is presented and discussed.
ae

INTRODUCTION
I
N the next few years, electronic commerce (e-commerce) is expected to grow at an exponential rate to facilitate electronic financial transactions among a wide spectrum of users, including banks, businesses, and individuals, over open networks. The use of the open networks, however, poses a variety of security threats concerning authentication, data integrity, confidentiality, etc. Security breeches in e-commerce may not only cause direct financial losses to the parties engaged in financial transactions, but also long term damage to the users' confidence in the underlying infrastructure. As a result, it is important to use appropriate security schemes to ensure adequate level of security in e-commerce [1] , [2] .
Many of the cryptographic security schemes that can be used in e-commerce or other applications rely on computations in the finite (or Galois) field GF(P m ), which has P m elements and supports basic arithmetic operations under the closure condition [3] . These operations are different from the conventional integer and floating point operations and are not directly supported by today's general-purpose processors. As a result, GF(P m ) based cryptographic schemes are implemented by emulating field operations on the general-purpose processors. This does not always give the desired performance in terms of the response time, especially when a very large finite field is to be used to achieve an increased level of security [4] .
To speed up the computations in finite fields, a few dedicated processors and logic units were developed in the past [5] , [6] , [7] . Most of these processors operate over a fixed field. In other words, if there is a change in the field parameters (e.g., the field size or the irreducible polynomial defining the representation of the field elements), a new processor is needed. There are certain situations where different sets of parameters are to be used, for example, cryptographic schemes of different strengths are to be used based on domestic and international transactions. If the field parameters can be changed without switching to a new processor, it would increase the users' flexibility to use the same processor/device in a number of security environments and reduce the cost. The few publications that deal with varying field parameters include [8] , [9] , [10] . However, they mostly address field multiplication only and do not present a complete processor.
In this paper, we present a hardware architecture and the related VLSI implementation of a versatile processor for computations in the field GF(P m ). This GF processor supports user specified field dimensions and field defining polynomials and is capable of directly computing various field operations, including inversion, multiplication, accumulation, and basis transformations. It can be used either as a stand-alone arithmetic processor for computations in finite fields or as a coprocessor with other general-purpose or control processors for developing high speed encryptiondecryption devices.
The outline of the paper is as follows: In the next section, the representation of the field elements in the GF processor is discussed and the processor core architecture is presented. In Section 3, the computational algorithms that led to the development of the processor and their operation are discussed. Section 4 presents the GF processor datapath optimization techniques and its interface to external devices. An instruction set architecture supporting various finite field operations and a prototype VLSI implementation of the GF processor with some application examples are given in Section 5, while some concluding remarks are given in Section 6.
DATA REPRESENTATION AND PROCESSOR CORE ARCHITECTURE
The main building blocks of the GF processor are its datapath and control unit. The datapath is further divided into two units, namely, field operation unit (FOU) and input/output unit (IOU), as shown in Fig. 1 . The FOU is to realize various arithmetic operations over a range of field dimensions and irreducible polynomials defining the representation of the field elements. The IOU is to provide a convenient mechanism for loading and unloading of data to and from the FOU. The IOU also brings in processor instructions which are decoded by the control unit.
Data Representation
The area and time complexities of the FOU units are significantly affected by the way the field elements are represented. In the logarithmic representation, field elements are expressed as exponents of a primitive element. While this representation requires simple integer addition/ subtraction modulo P m À I for the field multiplication/ inversion, it requires complicated discrete log and anti-log operations or special purpose hardware for the GF(P m ) addition/subtraction. A log/anti-log module designed for a certain field cannot be directly used when the field parameters change. On the other hand, in the conventional representation, the field elements are expressed as algebraic sums of a set of m linearly independent elements of the field. The set is referred to as the basis of representation. For example, if all i P qpP m , for H i m À I, are linearly independent, then the set À f H Y I Y Á Á Á Y mÀI g forms a basis of GF(P m ) over the ground field GF(2) and any element of GF(P m ) can be expressed as mÀI iH i i , where i P qpP. The term i is referred to as the ith coordinate of with respect to basis À.
The bases which were assumed in the various realizations in the past are normal, canonical (also known as polynomial or standard), triangular, and dual bases. Although normal bases offer an almost-cost-free squaring operation, they are not suitable for the design of a flexible processor because of their complex multiplication and inversion circuits for generic field parameters. Triangular and dual bases are quite similar in carrying out field operations and, under certain conditions, these two types of bases are equal [11] . Basis transformations between the canonical and its triangular bases can be efficiently realized in hardware. In this paper, the canonical basis () and the triangular basis (Ã) are used together. In the past, the combined use of two bases was found to yield efficient hardware for finite field arithmetic units [8] , [11] .
For P qpP m , a canonical basis has the form
is an irreducible binary polynomial of degree m (i.e., f m f H I and f i P fHY Ig for H`i`m) and has a root 3 in GF(P m ), then it is well-known that the elements of the set fIY 3Y Á Á Á Y 3 mÀI g are linearly independent and, hence, form a canonical basis. Using this canonical basis as the primal basis of representation, a triangular basis, hereafter denoted as While storing these coordinates in an wEit register (w ! m), it is advantageous to align them with the left or the right boundaries of the register. In this paper, these coordinates are assumed to be aligned to the left boundary and the rightmost w À m bits are ignored, as shown in Fig. 2 . This kind of alignment enables us to serially load the coordinates into the register in m clock cycles independent of the register size w.
Processor Core Architecture
The core of the GF processor is the FOU. The FOU architecture is presented here, while the algorithms and operations which led to the development of this FOU are given in Section 3. The FOU is the most space consuming unit in the GF processor and consists of a number of registers and logic blocks. The details of these blocks depend on the algorithms used to realize field arithmetic operations. An organization of the FOU using an w IEit bus is shown in Fig. 3 . The leftmost 1-bit bus (us w ) is for bit serial operations while the wEit (us wÀIYFFFYH ) is for parallel operations. The value of w is the largest dimension the GF processor can support.
As shown in Fig. 3 , the FOU has seven registers, namely, A1, A2, B1, B2, C, D, and T. All registers, except for D, are w bits long. Register D is dlog P we bits long and holds the binary representation of the current field dimension. Register C is the configuration register and holds the current irreducible polynomial which defines the representation of the field elements. Registers C and D are loaded with appropriate values only when the field parameters change. For p x x m mÀI iH f i x i , the C register contents are as shown in Fig. 4 , where the rightmost w À m bits are zero. These zero bits nullify the corresponding bits in other registers during certain arithmetic operations, which will be explained later in this article.
Data can be loaded and unloaded in parallel to and from registers A1, A2, B1, B2, and T. Additionally, registers A1, A2, and T can be shifted in both directions. The FOU also has four logic blocks, namely, XOR-ARRAY, IP1, IP2, and S. The XOR-ARRAY block takes two wEit inputs and produces an wEit output, which is the bit-wise ex-or of the inputs. Both IP1 and IP2 are inner product units, each of which takes two wEit vectors and produces a one-bit output since the inner product is over GF (2) . Depending on the direction of the data flow in the T register, IP1 produces the following outputs:
where ÀI H and i and fP i correspond to the ith bit of registers T and B2, respectively. The output of IP2 is similar to the above equation for IP1 except that B2 is replaced by C.
The S block consists of a two-input XOR gate and a twoinput multiplexer. It can be configured to provide input in at the left end of register T and to place a single bit on the 1-bit bus (us w ). The configuration is controlled by two signals; namely, modeg and inw , generated by the control unit. The operations of the S block, based on these signals, are as follows:
where speil is a control signal and out is the bit serial output of T at the left end.
Registers T and C along with logic blocks IP2 and S form two special structures, namely, linear feedback shift register (LFSR) and feed-forward shift register (FFSR). When modeg H and the T register is shifted right, an LFSR is formed. On the other hand, when modeg I and the register T is shifted left, an FFSR is formed and the one bit output of the feed-forward operation is placed on the 1-bit bus given by us w wÀI wÀI iH iÀI g i out s P out Y U where i and g i correspond to the ith bit of registers T and C, respectively. Another special structure in the FOU consists of register B1 and its multiplexing like input logic. Depending on the value of e P fHY Ig, this structure operates as a buffer or a shift-right-with-one-and-store unit, i.e.,
COMPUTING ALGORITHMS AND THEIR OPERATIONS
In this section, the algorithms which have led to the development of the FOU organization, along with their operations, are presented. Addition/subtraction over GF(P m ) is simple bit-wise XOR of the field elements. In the FOU, an addition operation is performed using the XOR-ARRAY, register B2 (which acts as an accumulator), and another register (excluding D) in bit parallel fashion. On the other hand, GF(P m ) inversion, multiplication, and basis transformations are performed in bit serial fashion. The latter is essential to obtain an area efficient realization of the FOU.
Inversion
Using the extended Euclidean algorithm, which is a wellknown method for area efficient hardware realization of an inverter over GF(P m ) for an arbitrary m, one needs four or more mEit registers. In a large field, the number of registers needed plays an important role in determining the size of the processor, as well as the overall performance of a system which relies on the processor. If the algorithms take fewer registers, then we can either reduce the area of the processor or utilize the unused registers to store intermediate results and, hence, reduce loading and unloading operations. The latter can yield a considerable improvement in the performance of system level computations (e.g., to establish a shared secret key on an elliptic curve defined over a large finite field).
Algorithm
To compute the inverse of P qpP m assume that:
Also assume that 3 P qpP m is a root of the irreducible polynomial p x, is the inverse of , and g i is the ith coordinate of g 3 m with respect to the canonical basis. Then, from [12] , one can write:
Equation (10) has a special structure and can be efficiently solved to obtain g i s using the Berlekamp-Massey algorithm [13] . We thus have the following algorithm to compute ÀI g 3 m .
Algorithm 1: (GF(P m ) Inversion) Input: w.r.t.
Output: w.r.t. 
An example showing the inversion operation using the above algorithm is given in Appendix B.
Operation
In the above algorithm, s i Y d, and e are either 0 or 1. Also, x and x are polynomials over GF(P), while v is an integer. Thus, x and x are updated using modulo 2 arithmetic and v is updated using integer arithmetic. These two polynomials are stored in registers B2 and B1, respectively, in the FOU, while v is stored in the control unit.
Based on the properties of the Berlekamp-Massey algorithm [13] , we have deg x m. Thus, an m IEit register is needed to store x, implying that register B2 should be w I bits long for the maximum value of m w. This, however, causes an inconsistency in the lengths of the registers connected to the wEit bus of the FOU. To eliminate this inconsistency, coefficient p H (i.e., the constant term of x) is not stored in B2. This, however, does not result in any information loss since, in Step 1.2, p H is always ª1.º However, when e I, polynomial x is updated to x. In this updating process, coefficient p H I is restored by the multiplexing-like-logic at the input of register B1.
At the end of the iteration process in Step 1.2, the coefficients of x are essentially the coordinates of g P qpP m in some reverse order, specifically,
Thus, x and x are stored left-adjusted in registers B2 and B1, respectively, starting from their lowest order coefficients (i.e., p I for x and q H for x).
In the inversion algorithm, the sequence fs i g is generated from the coordinates of using the LFSR structure consisting of T, C, S, and IP2 in the FOU. Assuming that the canonical basis coordinates of are stored in register ei, i IY P, if we right-shift these coordinates into logic S via the 1-bit bus, then s H Y s I Y F F F Y s mÀI are generated at the output of S and enter T in bit serial fashion. In the next m À I clock cycles, s m Y s mI Y F F F Y s PmÀP are generated simply by right-shifting T with speil H and inw I at the inputs of S. The term s PmÀI is generated with one more right-shift of T with speil I and inw I at the inputs of S. Since B2 contains the coefficients of x and T contains the sequence fs i g, the inner product d of the inversion algorithm is obtained by combining the output of IP1 of the FOU to in .
Remarks
. The first m elements of the fs i g sequence in the above inversion algorithm are essentially the Ã coordinates of , i.e.,
Thus, the inversion algorithm and its operation on the FOU are also applicable to the Ã representation of with only minor modifications. In either basis representation, the inverse of is, however, obtained with respect to . . The element whose inverse is to be determined does not need to be in one of the registers of the FOU. Element can be directly brought into the LFSR structure in bit serial fashion to form the fs i g sequence and to apply the inversion algorithm. In some applications, this will save m clock cycles needed to buffer . . As can be seen from the operation of the inversion algorithm, the main components for the inversion operation are three registers 1 (viz., T, B1, and B2) and three logic blocks 2 (viz., XOR-ARRAY, IP1, and IP2). Instead of using the above Berlekamp-Massey algorithm-based inversion, if the extended Euclidean algorithm based inversion (or its variation, such as the almost inverse algorithm [14] ) is used, then one will need four (or more) registers and two XORARRAYs. Although, both algorithms take Pm clock cycles per inversion, the former has a longer critical path of dlog P weh h e because of the inner product unit (h and h e correspond to the delays due to an XOR and AND gates, respectively). Variations of the Berlekamp-Massey exist where the inner product can be avoided at the expense of other circuits [15] . The inner product units, however, can be useful in implementing other finite field arithmetic operations, such as multiplication and basis transformations, which are discussed below.
Multiplication
Algorithm
Let u be the product of and , where Y Y u P qpP m .
Conventional finite field multipliers, where both the multiplicand () and the multiplier () are represented w.r.t. , have logic circuit, consisting of XOR and AND gates, at the input of each stage of the register which generates the partial products [16] . The presence of the logic circuits limits the usage of the register as a general purpose buffer whose contents could otherwise be shifted in both directions, a feature a designer would like to have in the processor on which larger systems may be built. Keeping this in mind, if one applies (9) and (10a) to Proposition 1 of [11] , the following is obtained:
Using this type of matrix equation, the coordinates of the product can be generated in two waysÐOne using two inner product units, where the coordinates are obtained in bit serial fashion, and the other using only one inner product unit, where the coordinates are obtained in bit parallel fashion after m clock cycles. This leads to the following algorithm that can be implemented using the FOU.
Algorithm 2: (GF(P m ) multiplication) Input: in Ã and in Output: u in Ã
Step 2.1.
Operation
The Ã coordinates j , H j m À I, are loaded into register T of the FOU, which are the first m elements of the fs i g sequence. The remaining elements are generated in T in the LFSR mode. The coordinates of are assumed to be in ei, i IY P, in reverse order. In each clock cycle of the multiplication operation, these coordinates are left-shifted one position and if the leftmost bit of Ai is 1, the contents of T are accumulated in B2 to yield the Ã coordinates of u in B2 after m clock cycles.
Remarks
. If the coordinates of follow those of in entering the FOU, then the iterations of Step 2.2 of Algorithm 2 can proceed with the arrival of each coordinate and the product u is fully available in B2 right after the last coordinate of enters the FOU. This implies that, after the first coordinate of has entered the FOU, Algorithm 2 is able to generate the product u in Pm clock cycles. . As stated before, both and u in Algorithm 2 are given w.r.t. Ã. In applications where is available w.r.t. and/or u is needed w.r.t. , the basis transformations can be performed on-the-fly as the coordinates of and u enter and leave, respectively, the FOU. This is discussed below.
Basis Transformation
In this subsection, the transformations of the coordinates of an element P qpP m from Ã to and vice versa are discussed. Toward this effect, if (2) is used for the forward transformation (i.e., Ã to ) in the FOU, then the Ã coordinates are to be shifted, in reverse order, in the T register operating in the FFSR mode. The desired 1 . We have not counted C since it is in the FOU anyway for configuration purposes.
2. We have ignored the logic block S since it is a simple structure and does not depend on the value of m.
coordinates would appear at the output of IP2 in bit serial fashion. If these coordinates are to be directly taken to a register (either A1 or A2) for the purpose of storage, then we would have these coordinates stored in reverse order. More importantly, a separate path (perhaps an additional 1-bit bus) would be needed from the IP2 output to the various possible destination registers.
As an alternative approach, after a few steps of algebraic manipulation, (2) can be written as follows:
In order to apply (14) to implement the forward transformation in the FOU, the Ã coordinates are first loaded into T in correct order in bit parallel fashion. Then, T is left-shifted in the FFSR mode and the coordinates appear on the 1-bit bus in bit serial fashion. If these coordinates are to be stored in A1 or A2, they can be directly taken from the bus in correct order. For backward (i.e., to Ã) transformation in the FOU, (3) can be applied. In this case, the coordinates are simply shifted into the T register operating in the LFSR mode and, after m clock cycles, the Ã coordinates are formed in the T register.
Comments
. From the algorithms and operations presented above, one can see that the field inversion operation essentially dominates the FOU organization. The important blocks of the FOU, viz., T, B1, B2, IP1, and IP2, are employed during the inversion operation. Subsets of these blocks satisfy the needs of field multiplication and basis transformations. The two registers A1 and A2 are more like general purpose registers and are used to reduce the number of input/output operations by storing intermediate results.
Although only two such registers are shown to be in the FOU, this number can be increased for improved performance at the application level where this type of processor is used. This is mainly because additional registers can be used to hold intermediate results and, hence, reduce I/O operations, which may constitute a significant portion of the total computation time at the application level. . When an elliptic curve cryptosystem is to be implemented on the processor, one might attempt to represent the elliptic curve points with respect to various projective coordinate systems. Such representation systems reduce the number of inversions at the expense of an increased number of multiplications and are considered to be advantageous if the inverse is at least three times slower than the multiplication and enough registers are available to hold intermediate results. This is, however, not the case for the processor being presented here. Additionally, the main cost of having an inverter in the FOU, given that the latter already has a multiplier, is only an inner product unit (IP1) and a register (B1).
FOU OPTIMIZATION AND THE I/O UNIT
Based on the algorithms and operations discussed in the previous section, we will give a number of approaches to reduce space and time complexities of the FOU. Then, we will present the IOU which interfaces the FOU to external devices.
FOU Optimization
Bus Reduction
The wEit bus provides fast data transfers among various FOU blocks. For large values of w, this bus may constitute a real design challenge at the physical layout level and, if not designed properly, it can account for a significant increase in silicon area. A trade-off can be made by reducing the width of the bus to t bits (t`w). This reduction would increase the number of bus transfers to d m t e for a single mEit operand. The effect of this on the field inversion is as follows: Using Algorithm 1, this tEit bus would be used to update x as follows:
Assuming that the probability of d being nonzero is 0.5, the total number of bus transfers for Step 1.2 is md 
Inner Product Unit Optimization
The two inner product units of the FOU provide the longest critical path. While IP2 is used by inversion, multiplication, and basis transformations, IP1 is used only by inversion to compute d in Step 1.2. Variations of the Berlekamp-Massey algorithms, which do not require the inner product d, employ other circuits such as registers and XOR gates [16] . For large w, when these circuits cannot be added for space limitations, one can proceed as follows to shorten the critical path of IP1:
Assuming that the two-input XOR gates are in binary tree form, 1-bit buffers are placed at a level of I P dlog P we. The total number of such buffers is only P I P dlog P we % w p . (In most elliptic curve cryptosystems, w`PST, this implies that only 16 1-bit buffers would be needed). Although these buffers can potentially reduce the critical path by half (and, hence, double the clock frequency), the value of d is available one clock cycle later. In order to proceed with the iterations of Step 1.2, the value of d is predicted to be ª0º and the triplet x, x, and v is updated accordingly. In the next clock cycle, if the prediction turns out to be correct, the next iteration starts. For a wrong prediction, x, x, and v are corrected as follows:
Assume that, at iteration i, the correct x, x, and v values in Step 1.2 are i x, i x, and v i , respectively. In the next clock cycle, with the prediction of d H, the triplet is updated as follows:
HASAN AND WASSAL: VLSI ALGORITHMS, ARCHITECTURES, AND IMPLEMENTATION OF A VERSATILE GF(P
where the primes in the superscripts indicate predicted values of the triplet. The correct values of the triplet, however, at iteration i I are
Thus, the correct values of the triplets can be completely recovered from the predicted values. The realization of the correction process in the FOU, however, requires a minor change in the input logic of register B1, which holds x, since iI x H iI x if e HX PP
The correction process also requires one extra clock cycle as a penalty for the wrong prediction. Assuming that 50 percent predictions are wrong and that the reduction of the critical path doubles the clock frequency, the average speed-up of the inversion operation is approximately P Â
Pm
PIm À I QQ7. Since the other operations, like multiplication and basis transformations, do not use IP1, their speed-ups are expected to be close to the speed-up of the clock.
IP2, the other inner product unit of the FOU, has p x, the field defining irreducible polynomial stored in register C, as one of its inputs. Assuming that the second leading term of p x is x k , i.e.,
the C register in the FOU has zero in its leftmost m À I À k positions. If the FOU is designed to be used for a set of predetermined irreducible polynomials (e.g., those specified by certain standardization committees), one can find minfm À I À kg for these polynomials and, hence, can reduce the number of AND and XOR gates by minfm À I À kg each. Additionally, for efficient hardware realization, p x is usually chosen to be an irreducible trinomial, if one exists. Where there are no such trinomials, irreducible pentanomials are known to exist for all values of practical interest of m [17] . Thus, if the FOU is designed to support a set of n such low weight irreducible polynomials, then the maximum number of AND and XOR gates in IP2 would be Rn and Rn À I, respectively. The value of n is expected to be reasonably small for most applications. For example, the value of n can be as small as 4, yet satisfying the GF(P m ) computational requirements for elliptic curve cryptosystems with m P fIIQY ITQY IWQY PQWg which are of practical interest [18] .
Input/Output Unit
Compared to general purpose processors, a finite field processor for cryptographic applications needs to handle operands of much larger size. For example, using elliptic curve cryptography, which appears to use smaller operand (i.e., key) sizes, a GF(P m ) processor needs to deal with 113-bit long operands for the minimum level of security being recommended by various standard committees. The operands can be of 1,000 bits or more if the cryptosystem is based on the Diffie-Hellman discrete logarithm problem.
True bit-parallel input/output operations for such large operands are difficult, even with today's advanced VLSI technologies. On the other hand, bit-serial input/output appears to be conservative and may cause an unacceptable amount of time delays for loading and unloading the operands to and from the GF processor. A more practical approach to these input/output operations is to split the operand into several blocks. The block size can be 8, 16, 32, or 64 bits to make the processor chip compatible with other devices.
To this effort, Fig. 5 shows the structure of the external input/output interface for the GF processor. This interface is based on asynchronous handshaking between the GF processor and other devices. The external bus (X_bus) moves the data and instructions to and from the processor in 16-bit blocks to speed up the data transfer. Three asynchronous registers (namely, DATAIN, DATAOUT, and SR) connect to the external bus, while two synchronous ones (viz., IR and DBR) are used for decoding and buffering.
The DATAIN register buffers data and instructions from the external bus. Instructions are copied later to the instruction register (IR) for further decoding and execution, while data are copied to the data buffer register (DBR) for further transfer to the FOU via the 1-bit bus. The DATAOUT register latches data from the D register or from the DBR and sends it out on the external bus via tristate buffers (TSB).
INSTRUCTION SET ARCHITECTURE AND VLSI IMPLEMENTATION
In this section, we first present a set of instructions to be executed on the datapath developed in the previous sections. Then, we present a prototype VLSI implementation of the entire processor.
Instruction Set
The instruction set consists of four logical groups. The first group contains those instructions used to clear, load, and unload the registers. They also support transforming the data representation to and from the triangular basis during loading and unloading, respectively. The second group contains shift left and right instructions. Arithmetic operations are carried out by the third group of instructions. These include accumulation, multiplication, and inversion. Moving the data between different registers in various fashions, even with basis transformation, is the responsibility of the last group. Table 1 lists the instructions supported along with their valid operands. To support these instructions, a control unit has been designed in [19] . Using the instruction set, a sample program for computing an inverse over GF(P R ) is given below. ; Result = $000B
Assuming that it is a ªcoldº start, the program sets up the D and C registers with the field dimension (m) and the configuration polynomial (p x), respectively. The contents of D and C correspond to the field given in Appendix A. The element whose inverse is to be computed is loaded in A1. The ªInv A1º instruction computes the inverse and stores the result in register B2, which is then loaded into A1 in bit parallel fashion. The ªMvAr A2,A1º instruction reverses the bit ordering and ªUnLd A2º unloads the result.
VLSI Implementation
The GF processor architecture was modeled in VHDL and simulated to verify its functionality. After complete verification of the design functionality, it was then synthesized using appropriate time and area constraints. Time constraints in this case were alleviated by the pads, package, bonding, and current limitations of the available prototyping technology because these limitations are already forcing us to operate the chip at lower frequencies [20] . Area constraints were of higher priority since we target an area efficient GF processor. Both simulation and synthesis steps were carried out using Synopsys tools [21] and a CMOS HXS"m technology optimized for a 3.3V supply voltage. The prototype can support operations over finite fields up to GF(P TR ). The silicon area used was approximately QXRRSmm Â QXVPUmm for the whole chip packaged in a 68 pin PGA package.
The prototype chip was successfully tested at a clock frequency of 50MHz. This frequency limitation is due to the packaging technology used, however, the implemented chip core can run at a frequency of more than 80MHz, while an implementation with w PST is estimated to run at a frequency of more than 75MHz. Approximate times required for some of the important finite field operations, as well as those for Elliptic Curve point addition and point doubling operations at such frequency, are compared to a recent fast software implementation [22] in Table 2 . The speed-up ratios are also shown and are at least eight, which verifies the importance of the hardware implementation. The Elliptic Curve operations, however, require as much as six general registers for the operations to be implemented without much loading and unloading of intermediate results. This number can be reduced to two general registers to maintain a reasonable chip area by delegating the field addition operations needed to the main processor at almost no speed penalty (e.g., the times needed for EC addition and doubling become PRXPU"se and PTXUW"se, respectively). Fig. 6 shows the annotated layout along with the micro-photograph of the fabricated and tested chip. Table 3 shows the area used by each of the building blocks along with the input/output pads ring, which is about 0.5 mm wide from each of the four sides. It also shows an estimate for the area required for the FOU that can support finite fields up to a dimension of 256, which is a practical dimension for current cryptographic applications. The core area utilization factors represent the ratio of the core processor area to the area used for buses and interconnects in all of the blocks. These factors do not exceed 0.5, showing the significant area used by the wide parallel interconnects.
CONCLUSIONS
In this paper, the design and implementation of a Galois field processor, proposed as a flexible and configurable module for cryptographic applications, have been presented. The processor computation algorithms make use of both canonical and triangular basis computations. Algorithms for VLSI realization of basis transformations, multiplication, and inversion have been presented and implemented. An extensible instruction set architecture is used to support those computations and some other manipulations. A variable dimension datapath allows the processor to operate over different finite fields in order to accommodate different applications and different requirements. A VHDL model was used to simulate the processor and verify its functionality. Furthermore, the model was synthesized and a prototype chip was fabricated and tested. The design and the implementation emphasize an area efficient processor to target embedded systems and smart card applications, however, more registers, complex controller circuitry, pipelining, wider internal buses could be used to build a high-performance variation at the cost of increased area. Also, for large composite fields, such as GF(P IHPR ), used in the conventional discrete logarithm-based cryptosystems, one can define the arithmetic operations over the subfield GF(P PST ) or GF(P IPV ) and use the proposed processor for the subfield.
APPENDIX A GF(P R ) REPRESENTATION
Let p x be x R x I and 3 P qpP R satisfy p 3 H. Then, it can be shown that fIY 3Y 3 P Y 3 Q g and
The representation of all of the nonzero elements of GF(P R ) w.r.t. and Ã is given in Table 4 . For easy reference, the elements are also represented as powers of 3, which happens to be a primitive element of GF(P R ).
APPENDIX B EXAMPLE OF INVERSE OPERATION
Consider the field GF(P R ) given in Appendix A. Let 3 V Q Y P Y I Y H HY IY HY I. To find the inverse of , below we apply Algorithm 1. Step 1.1.
x IY x IY v HY s H H Step 1.2. The iterations are shown in Table 5 
