We describe a scalable and unified architecture for a Montgomery multiplication module which operates in both types of finite fields GF (p) and GF (2 m ). The unified architecture requires only slightly more area than that of the multiplier architecture for the field GF (p). The multiplier is scalable, which means that a fixed-area multiplication module can handle operands of any size, and also, the wordsize can be selected based on the area and performance requirements. We utilize the concurrency in the Montgomery multiplication operation by employing a pipelining design methodology. We also describe a scalable and unified adder module to carry out concomitant operations in our implementation of the Montgomery multiplication. The upper limit on the precision of the scalable and unified Montgomery multiplier is dictated only by the available memory to store the operands and internal results, and the module is capable of performing infinite-precision Montgomery multiplication in both types of finite fields.
Introduction
The basic arithmetic operations (i.e., addition, multiplication, and inversion) in prime and binary extension fields, GF (p) and GF (2 m ), have several applications in cryptography, such as decipherment operation of RSA algorithm [18] , Diffie-Hellman key exchange algorithm [3] , elliptic curve cryptography [7, 12] , and the Digital Signature Standard including the Elliptic Curve Digital Signature Algorithm [15] . The most important of these three arithmetic operations is the field multiplication operation since it is the core operation in many cryptographic functions.
The Montgomery multiplication algorithm [13] is an efficient method for doing modular multiplication with an odd modulus. The Montgomery multiplication algorithm is a very useful for obtaining fast software implementations of the multiplication operation in prime fields GF (p). The algorithm replaces division operation with simple shifts, which are particularly suitable for implementation on general-purpose computers. The Montgomery multiplication operation has been extended to the finite field GF (2 k ) in [9] . Efficient software implementations of the multiplication operation in GF (2 k ) can be obtained using this algorithm, particularly when the irreducible polynomial generating the field is chosen arbitrarily. The main idea of the architecture proposed in this paper is based on the observation that the Montgomery multiplication algorithm for both fields GF (p) and GF (2 k ) are essentially the same algorithm. The proposed unified architecture performs the Montgomery multiplication in the field GF (p) generated by an arbitrary prime p and in the field GF (2 m ) generated by an arbitrary irreducible polynomial p(x). We show that a unified multiplier performing the Montgomery multiplication operation in the fields GF (p) and GF (2 k ) can be designed at a cost only slightly higher than the multiplier for the field GF (p), providing significant savings when both types of multipliers are needed.
Several variants of the Montgomery multiplication algorithm [17, 10, 2] have been proposed to obtain more efficient software implementations on specific processors. Various hardware implementations of the Montgomery multiplication algorithm for limited precision operands are also reported [2, 17, 4] . On the other hand, implementations utilizing high-radix modular multipliers have also been proposed [17, 11, 19] . Advantages and disadvantages of using high-radix representation have been discussed in [22, 21] . Because high-radix Montgomery multiplication designs introduce longer critical paths and more complex circuitry, these designs are less attractive for hardware implementations.
A scalable Montgomery multiplier design methodology for GF (p) was introduced in [21] in order to obtain hardware implementations. This design methodology allows to use a fixed-area modular multiplication circuit for performing multiplication of unlimited precision operands. The design tradeoffs for best performance in a limited chip area were also analyzed in [21] . We use the design approach as in [21] to obtain a scalable hardware module. Furthermore, the scalable multiplier described in this paper is capable of performing multiplication in both types finite fields GF (p) and GF (2 k ), i.e., it is a scalable and unified multiplier.
The main contributions of this paper are summarized below.
• We show that a unified architecture for multiplication module which operates both in GF (p) and GF (2 m ) can be designed easily without compromising scalability, time and area efficiency.
• We analyze the design considerations such as the effect of word length, the number of the pipeline stages, and the chip area, etc., by supplying implementation results obtained by Mentor graphics synthesis tools.
• We describe the design of a dual-field, scalable adder circuit which is suitable for the pipeline organization of the multiplier. This adder is necessary for the final reduction step in the Montgomery algorithm and in the final addition for converting the result of the multiplication operation (which is in the Carry-Save form) to the nonredundant form. Naturally, the adder operates both in GF (p) and GF (2 m ). We give an analysis of the time and area cost of the adder circuit.
We start with a short discussion of scalability in §2 and explain the main idea behind the unified multiplier architecture in §3. We then present the methodology to perform the Montgomery multiplication operation in both types of finite fields using the unified architecture. We give the original and modified definitions of Montgomery algorithm for GF (p) and GF (2 m ) in §4. We discuss concurrency in the Montgomery multiplication and show the methodology to design a pipeline module utilizing the concurrency in §5. We present the processing unit and the modifications needed to make the unit operate in prime and binary extension fields in §6. We then provide a multi-purpose word adder/subtractor module in §7, which can be integrated into the main Montgomery multiplier module in order to perform the field addition and subtraction operations. In §8, we discuss the area/time tradeoffs and suitable choices for word lengths, the number of pipeline stages, and typical chip area requirements. Finally, we summarize our conclusions in §9.
Scalable Multiplier Architecture
An arithmetic unit is called scalable if it can be reused or replicated in order to generate longprecision results independently of the data path precision for which the unit was originally designed. To speed up the multiplication operation, various dedicated multiplier modules were developed in [19, 1, 14] . These designs operate over a fixed finite field. For example, the multiplier designed for 155 bits [1] cannot be used for any other field of higher degree. When a need for a multiplication of larger precision arises, a new multiplier must be designed. Another way to avoid redesigning the module is to use software implementations and fixed precision multipliers. However, software implementations are inefficient in utilizing inherent concurrency of the multiplication because of the inconvenient pipeline structure of the microprocessors being used. Furthermore, software implementations on fixed digit multipliers are more complex and require excessive amount of effort in coding. Therefore, a scalable hardware module specifically tailored to take advantage of the concurrency of the Montgomery multiplication algorithm becomes extremely attractive.
Unified Multiplier Architecture
Even though prime and binary extension fields, GF (p) and GF (2 m ), have dissimilar properties, the elements of either field are represented using almost the same data structures inside the computer. In addition, the algorithms for basic arithmetic operations in both fields have structural similarities allowing a unified module design methodology. For example, the steps of the Montgomery multiplication algorithm for binary extension field GF (2 m ) given in [9] only slightly differs from those of the integer Montgomery multiplication algorithm [13, 10] . Therefore, a scalable arithmetic module, which can be adjusted to operate in both types of fields, is feasible, provided that this extra functionality does not lead to an excessive increase in area or a dramatic decrease in speed. In addition, designing such a module must require only a small amount of extra effort and no major modification in control logic of the circuit.
Considering the amount of time, money and effort that must be invested in designing a multiplier module or more generally speaking a cryptographic coprocessor, a scalable and unified architecture which can perform arithmetic in two commonly used algebraic fields is definitely beneficial. In this paper, we show the method to design a Montgomery multiplier that can be used for both types of fields following the design methodology presented in [21] . The proposed unified architecture is obtained from the scalable architecture given in [21] after minor modifications. The propagation time is unaffected and the increase in chip area is insignificant.
Montgomery Multiplication
Given two integers A and B, and the prime modulus p, the Montgomery multiplication algorithm computes
where R = 2 m and A, B < p < R, and p is an m-bit number. The original algorithm works for any modulus n provided that gcd(n, R) = 1. In this paper, we assume that the modulus is a prime number, thus, we perform multiplication in the field defined by this prime number. This issue is also relevant when the algorithm is defined for the binary extension fields. The Montgomery multiplication algorithm relies on a different representation of the finite field elements. The field element A ∈ GF (p) is transformed into another elementĀ ∈ GF (p) using the formulaĀ = A · R (mod p). The numberĀ is called Montgomery image of the element, orĀ is said to be in the Montgomery domain. Given two elements in the Montgomery domainĀ andB, the Montgomery multiplication computes
whereC is again in the Montgomery domain. The transformation operations between the two domains can also be performed using the MonMul function as
Provided that R 2 (mod p) is precomputed and saved, we need only a single MonMul operation to carry out each of these transformations. However, because of these transformation operations, performing a single modular multiplication using MonMul might not be advantageous, however, there is a method to make it efficient for a few modular multiplications by eliminating the need for these transformations [16] . The advantage of the Montgomery multiplication becomes much more apparent in applications requiring multiplication-intensive calculations, e.g., modular exponentiation or elliptic curve point operations. In order to exploit this advantage, all arithmetic operations are performed in the Montgomery domain, including the inversion operation [6, 20] . Furthermore, it is also possible to design cryptosystems in which all calculations are performed in the Montgomery domain eliminating the transformation operations permanently. 
Input:
A, B ∈ GF (p) and m = log 2 p Output:
C ∈ GF (p) Step 1:
C := 0 Step 2:
for i = 0 to m − 1 Step 3:
return C
In the case of GF (2 m ), the definitions and the algorithms are slightly different since we use polynomials of degree at most m − 1 with coefficients from the binary field GF (2) to represent the field elements. Given two polynomials
and the irreducible monic degree-m polynomial
generating the field GF (2 m ), the Montgomery multiplication of A(x) and B(x) is defined as the field element C(x) which is given as
We note that, as compared to Equation (1), R(x) = x m replaces R = 2 m . The representation of x m in the computer is exactly the same as the representation of 2 m , i.e., a single 1 followed by 2 m zeros. Furthermore, the elements of GF (p) and GF (2 m ) are represented using the same data structures. For example, the elements of GF (7) for p = 7 and the elements of GF (2 3 ) for p(x) = x 3 + x + 1 are represented in the computer as follows:
GF (7) = {000, 001, 010, 011, 100, 101, 110} , GF (2 3 ) = {000, 001, 010, 011, 100, 101, 110, 111} .
Only the arithmetic operations acting on the field elements differ. The Montgomery image of a polynomial
. Similarly, before performing Montgomery multiplication, the operands must be transformed into the Montgomery domain and the result must be transformed back. These transformations are accomplished using the precomputed variable R 2 (x) = x 2m (mod p(x)) as follows:
The bit-level Montgomery multiplication algorithm for the field GF (2 m ) is given below:
, and m Output:
C(x) Step 1:
C(x) := 0 Step 2:
We note that the extra subtraction operation in Step 6 of the previous algorithm is not required in the case of GF (2 m ), as proven in [9] . Also, the addition operations are different. While addition in binary field is just bitwise mod 2 addition, the addition in GF (p) requires carry propagation. Our basic observation is that it is possible to design a unified Montgomery multiplier which can perform multiplication in both types of fields if an adder module, equipped with the property of performing addition with or without carry, is available. The design of an adder with this property is provided in the following sections.
The algorithms presented in this section require that the operations be performed using full precision arithmetic modules, thus, limiting the designs to a fixed degree. In order to design a scalable architecture, we need modules with the scalability property. The scalable algorithms are word-level algorithms, which we give in the following sections.
The Multiple-Word Montgomery Multiplication Algorithm for GF (p)
The use of fixed precision words alleviates the broadcast problem in the circuit implementation. Furthermore, a word-oriented algorithm allows design of a scalable unit. For a modulus of m-bit precision, e = m/w words (each of which is w bits) are required. The algorithm proposed in [21] scans the operand B (multiplicand) word-by-word, and the operand A (multiplier) bit-by-bit. The vectors involved in multiplication operations are expressed as
where the words are marked with superscripts and the bits are marked with subscripts. For example, the ith bit of the kth word of B is represented as B 
Input:
A, B ∈ GF (p) and p Output:
Step 6:
parity := T S 0 0
Step 8:
Step 10: for j = 1 to e − 1 Step 11:
Step 13:
Step 14:
Step 18:
Step 19: end for Step 20:
T S (e−1)
Step 21: end for Step 22:
if C > p then Step 24:
C := C − p Step 25:
return C As suggested in [21] , we use the Carry-Save form in order to represent the intermediate results in the algorithm. The result of an addition is stored in two variables (T C (j) , T S (j) ), thus, they can grow as large as 2 m+1 + 2 m − 3 which is exactly equal to the result of the addition of three numbers at the right hand side of equations in Steps 4, 8, 11, and 16 . Recall that T C (j) and T S (j) are m-bit numbers, but T C (j) must be seen as a number multiplied by 2 since it represents the carry vector in the Carry-Save notation. At the end of Step 21, we obtain the result in the Carry-Save form which needs an extra addition to get the final result in the nonredundant form. If the final result is greater than the modulus p, one subtraction operation must be performed as shown in Step 24.
Multiple-Word Montgomery Multiplication Algorithm for GF (2 m )
The Montgomery multiplication algorithm for GF (2 m ) is given below. Since there is no carry computation in GF (2 m ) arithmetic, the intermediate addition operations are replaced by bitwise XOR operations, which are represented below using the symbol ⊕.
T S := 0 m Step 2:
for i = 0 to m Step 3:
Step 5:
Step 7: for j = 1 to e − 1 Step 8:
Step 10:
T S (j−1)
Step 11:
Step 12: end for Step 13:
T S return C Notice that in the outer loop the index i runs from 0 to m. Since (m + 1) bits are required to represent irreducible polynomial of GF (2 m ), we prefer to allocate (m + 1) bits to express the field elements. We can also modify the algorithm for GF (p) accordingly for sake of uniformity. Therefore, the formula for the number of words to represent a field element for both cases is given as e = (m + 1)/w where w is the selected wordsize.
Concurrency in Montgomery Multiplication
In this section, we analyze the concurrency in Montgomery multiplication algorithms as given in the subsections §4.1 and §4.2. In order to accomplish this task, we need to determine the inherent data dependencies in the algorithm and describe a scheme to allow the Montgomery multiplication to be computed on an array of processing units organized in a pipeline.
We prefer to accomplish concurrent computation of the Montgomery multiplication by exploiting the parallelism among the instructions across the different iterations of i-loop of the algorithms, as proposed in [21] . We scan the multiplier one bit at a time, and after the first words of the intermediate variables (T C, T S) are fully determined, which takes two clock cycles, the computation for the second bit of A can start. In other words, after the inner loop finishes the execution for j = 0 and j = 1 in ith iteration of the outer loop, the (i + 1)th iteration of outer loop starts its execution immediately. The dependency graph shown in Figure 1 illustrates these computations. B (1) p (1) B (2) p (2) p (3) p (4) p ( 
B (1) p (1) p
B (2) B (2) p (2) p ( Each circle in the graph represents an elementary computation performed in each iteration of the j-loop. We observe from this graph that these computations are very suitable for pipelining. Each column in the graph represents operations that can be performed by separate processing units (PU) organized as a pipeline. Each PU takes only one bit from multiplier A and operates on each word of multiplicand, B, each cycle. Starting from the second clock cycle, a PU generates one word of partial sum T = (T C, T S) in the Carry-Save form at each cycle, and communicates it to the next PU which adds its contribution to the partial sum, when its turn comes. After e + 1 clock cycles, the PU finishes its portion of work, and becomes available for further computation. In case there is no available PU and there is work to do, the pipeline must stall and wait for the working PUs to finish their jobs. Since the PU at the end of the pipeline has no way of communicating its result to another PU, we need to provide extra buffers for them. In the worst case, which happens when there is only one PU, there must be 2e extra buffers of w length to hold these partial sum words. In the last clock cycle of each column, the The PU responsible for this column must receive p (e) = B (e) = 0. Elementary computations represented by circles in Figure 1 are performed on the same hardware module. Local control module in the PU must be able to extract T S (0) 0 and keep this value for the entire operand scanning. Each PU, in other words, has to obtain this value and use it to decide whether to add the modulus p to the partial sum. This value is determined in the first clock cycle of the each stage.
An example of the computation for 7-bit operands is shown in Figure 2 for the word size w = 1 provided that there are sufficient number of PUs preventing the pipeline to stall. Note that there is a delay of 2 clock cycles between the stage for x i and the stage for x i+1 . The total execution time for the computation takes 20 clock cycles in this example. At the clock cycles 7 and 15, the pipeline cannot engage a PU, and thus, it must stall for 2 extra cycles. At the 9th and 17th cycles, the first PU becomes available and computation proceeds.
We need a buffer of 4-bit length to store the partial sum bits during the stall. Because the 8 is not a multiple of 3, the last two pipeline stages perform extra computations. Since it is a pipeline organization, it is not possible to stop the computations at any time. In [21] , these extra cycles are treated as waste cycles. However, it is possible to perform useful computation without complicating the circuit. Recall that C = A·B ·2 −m (mod p) where m is the number of bits in the modulus p. If we continue the computations in these extra pipeline cycles, we calculate C = A · B · 2 −n (mod p) where n > m is the smallest integer multiple of the number of PUs in the pipeline organization. It is always easy to rearrange the Montgomery settings according to this new Montgomery exponent, namely R = 2 n , or R = x n for the field GF (2 m ) case.
The total computation time, CC (clock cycles), is slightly different from the one in [21] and is given as 
Scalable Architecture
An example of pipeline organization with 2 PUs is shown in Figure 5 . An important aspect of this organization is the register file design. The bits of multiplier a i are given serially to the PUs, and are not used again in later stages and can be discarded immediately. Therefore, a simple shift register would be sufficient for the multiplier. The registers for the modulus p and multiplicand B can also be shift registers. When there is no pipeline stall, the latches between PUs forward the modulus and multiplicand to next PU in the pipeline. However, if pipeline stalls occur, the modulus and multiplicand words generated at the end of the pipeline enter the SR − p and SR − B registers. The length of these shift registers are of crucial importance and determined by the number of pipeline stages (k) and the number of words (e) in the modulus. By considering that SR − p and SR − B values require one extra register to store the all-zero word needed for the last clock cycle in every stage (recall that p (e) = B (e) = 0) the length of these registers can be given as
The width of the shift registers is equal to w, the wordsize. Once the partial sum (T C, T S) is generated, it is transmitted to the next stage without any delay. However, we need two shift registers, SR − T C and SR − T S, to hold the partial sums from the last stage until the job in the first stage is completed. The length (L 2 ) of the registers T C and T S is equal to L 1 . We observe that only at most one word of each operand is used in every clock cycle. This makes different design options possible. Since we intend to design a fully scalable architecture, we need to avoid restrictions on the operand size or deterioration of the performance. Also we assume that no prior knowledge is available about the prospective range of the operand precision. Since the length of the shift registers can vary with the precision, designing full-precision registers within the multiplier might not be a good idea. Instead, one can limit the length of these registers within the chip and use memory for the excessive words. If this method is adopted, the length of the registers no longer would depend on the precision and/or the number of stages. The words needed earlier are brought from memory to the registers first, and the successive groups of words are transferred during the computation. If the memory transfer rate is not sufficient, however, pipeline might stall.
The registers for T C, T S, B, and p must have loading capability which can complicate the local control circuit by introducing several multiplexers (MUX). The delay imposed by these MUXes will not create a critical path in the final circuit. The global control block was not mentioned since its function can be inferred from the dependency graph and the algorithms.
Processing Unit
The processing unit (PU) consists of two layers of adder blocks, which we call dual-field adders. A dual-field adder is basically a full adder which is capable of performing addition both with carry and without carry. Addition with carry corresponds to the addition operation in the field GF (p) while addition without carry corresponds to the addition operation in the field GF (2 m ). We give the details about the dual-field adder in the next subsection. The block diagram of a processing unit (PU) for w = 3 is shown in Figure 6 . The unit receives the inputs from the previous stage and/or from the registers SR − A, SR − B and SR − p, and computes the partial sum words. It delays p and B for the first cycle, then, it transmits them to the next stage along with the first partial sum word (which is ready at the second clock cycle) if there is an available PU. The data path for partial sum T = (T C, T S) (which is expressed in the redundant Carry-Save form) is 2w bits long while it is w bits long for p and B and 1 bit long for a i . At the first cycle, the decision to add the modulus to the partial sum is determined, and this information is kept during the following e clock cycles. The computations in a PU for e = 5 are illustrated in Table 1 for both types of fields GF (p) and GF (2 m ). Table 1 : Inputs and outputs of the ith pipeline stage with w = 3 and e = 5 for both types of fields GF (p) (top) and GF (2 m ) (bottom).
Cycle No Inputs Outputs
1 , 0)
Notice that partial sum words in GF (2 m ) case are also in the redundant Carry-Save form. However, one of the components of the Carry-Save representation is always zero and the actual value of the result is the modulo-2 sum of the two. Since consecutive operations are all additions and the Carry-Save form is already aligned by the shift and alignment layer, this does not lead to any problem. We need to recall, however, that one extra addition is necessary at the end of the multiplication process. In the next section, we introduce a multi-purpose word adder/subtractor module which performs this final addition at the cost of an extra clock cycle.
Dual-Field Adder
Dual-field adder (DFA) shown in Figure 7a , as mentioned before, is basically a full-adder equipped with the capability of doing bit addition both with and without carry. It has an input called F SEL (field select) that enables this functionality. When F SEL = 1, the DFA performs the bitwise addition with carry which enables the multiplier to do arithmetic in the field GF (p). When F SEL = 0, on the other hand, the output Cout is forced to 0 regardless of the values of the inputs. The output S produces the result of bitwise modulo-2 addition of three input values. At most 2 of 3 input values of dual-field adder can have nonzero values while in the GF (2 m ) mode.
An important aspect of designing the dual-field adder is not to increase the critical path of the circuit which can have an effect on the clock speed which would be against our design goal. However, a small amount of extra area can be sacrificed. We show in the following section that this extra area is very insignificant. Figure 7b shows the actual circuit synthesized by Mentor Graphics tools using the 1.2µm CMOS technology. In the circuit, the two XOR gates are dominant in terms of both area and propagation time. As in the standard full-adder circuit, the dual-field adder has two XOR gates connected serially. Thus, propagation time of the dual-field adder is not larger than that of full adder. Their areas differ slightly, but this does not cause a major change in the whole circuit.
Multi-purpose Word Adder/Subtractor
The proposed Montgomery multiplier generates results in the redundant Carry-Save form, hence we need to perform an extra addition operation at the end of the calculation to obtain the nonredundant form of the result. Therefore, a field adder circuit that operates in both GF (p) and GF (2 m ) is necessary. A full-precision adder would increase the critical path delay and the area, and would also be hard to scale. A word adder of the type given in Figure 8 would be suitable for our implementation since the multiplier generates only one word at each clock cycle in the last stage of pipeline, thus we need to perform one word addition at a time. The word adder has two control inputs F SEL and A/S, which enable to select the field (GF (p) or GF (2 k )) and to choose between the addition and subtraction when in GF (p) mode, respectively. The adder propagates the carry bit to the next word additions while working in GF (p) mode (i.e., F SEL = 1). Thus, the carry from a word addition operation is delayed using a latch and fed back into the C in input of the adder for the next word addition at the next clock cycle. In the GF (2 m ) mode, the module performs only bitwise modulo-2 addition of two input words and the A/S input is ineffective. An addition operation of two e-word long numbers takes e + 1 clock cycles. The last cycle generates the carry and prepares the circuit for another operation by zeroing the output of latch. Figure 9 shows an example of addition operation with operands of 3 words. We added subtraction functionality in the field GF (p) to the word adder because the result might be larger than the modulus, and hence one final subtraction operation is necessary as shown in
Step 23 of the algorithm in Section 4.1. We do not need this reduction in the GF (2 m ) case. The final subtraction operation takes place only if the result is larger than the modulus. Thus, a comparison operation, which can also be performed utilizing the multi-purpose word adder/subtractor, is required. However, the control circuitry to perform this conditional subtraction might be complicated, therefore, it might be placed outside of the Montgomery multiplier unit.
Another reason to include a multi-purpose word adder unit in the multiplier circuit is the fact that the field addition operation is also needed in many cryptographic applications. For example, in elliptic curve cryptosystems, the field addition and multiplication operations are performed successively, hence having the multiplier and adder in the same hardware unit will decrease the communication overhead. A word adder that has these properties is synthesized using the Mentor Graphics tools and the time and space requirements are obtained, which are given in Table 2 . Finally, Figure 10 illustrates what happens in last stage of the pipeline. A pair of redundant words (T C
j ) are generated each cycle for e clock cycles. The word adder can be used to add these pairs in order to obtain the result words C (i) . Note that only one extra cycle is needed to convert the result from the Carry-Save form to the nonredundant form. B (2) p (2) B (0) p (0) TC (0) TC (1) TS (0) TS (1) (TC
, TS 
) (TC (2) , TS (2) ) (TC 
Design Considerations
In [21] , an analysis of the are and time tradeoffs is given for the scalable multiplier. The architecture allows designs with different word lengths and different pipeline organizations for varying values of operand precision. In addition, the area can be treated as a design constraint. Thus, one can adjust the design according to the given area, and choose appropriate values for the word length and the number of pipeline stages, in accordance. We give a similar analysis for the scalable and unified architecture. We are targeting two different classes of ranges for operand precision:
• High precision range which includes 512, 768 and 1024, is intended for applications requiring the exponentiation operation.
• Moderate precision range which includes 160, 192, 224, and 256, is typical for elliptic curve cryptography.
The propagation delay of the PU is independent of the wordsize w when w is relatively small, and thus all comparisons among different designs can be made under the assumption that the clock cycle is the same for all cases. The area consumed by the registers for the partial sum, the operands, and modulus is also the same for all designs, and we are not treating them as parts of the multiplier module. The proposed scheme yields the worst performance for the case w = m in the high precision range, since some extra cycles are introduced by the PU in order to allow word-serial computation, when compared to other full-precision conventional designs. On the other hand, using many pipeline stages with small wordsize values brings about no advantage after a certain point. Therefore, the performance evaluation reduces into finding an optimum organization for the circuit.
In order to determine the optimum selection for our organization, we obtain implementation results by synthesizing the circuit with Mentor Graphics tools using 1.2µm CMOS technology. The cell area for a given word size w is obtained as
units, and is slightly different from the one found in [21] , where the multiplication factor in the formula is the area cost provided by the synthesis tool for a single bit slice. Note that a 2-input NAND gate takes up 0.94 units of area. In the pipelined organization, the area of the inter-stage latches is important, which was measured as A latch (w) = 8.32w (6) units. Thus, the area of a pipeline with k processing elements is given as
units. For a given area, we are able to evaluate different organizations and select the most suitable one for our application. The graphs given in Figure 11 allow to make such evaluations for a fixed area of 15,000 gates. For both moderate and high precision ranges, the number of stages between 5 and 10 are likely to give the best performance. For the high precision cases, fewer than 5 stages yields very poor performance since the fixed area becomes insufficient for large wordsizes and the performance degradation due to pipeline stalls becomes a major problem. The small number of stages with very long word sizes seem to provide a reasonable performance in the moderate range, however, because of the incompatibility issues about using very long word sizes and inefficiency when the precision increases, using fewer than 5 stages is not advised. We avoid using many stages for two reasons:
• high utilization of the PUs will be possible only for very high precision, and
• the execution time may have undesirable oscillations.
The behavior mentioned in the latter category is the result of the facts that
• extra stages at the end of the computations, and
• there is not a good match between the number of words e and the number of stages k, causing a underutilization of stages in the pipeline.
From the synthesis tool we obtained a minimum clock cycle time of 11 nanoseconds, which allows to use a clock frequency of up to 90MHz with 1.2µm CMOS. Using the CMOS technology with smaller feature size, we can attain much faster clock speeds. It is very important to know how fast this hardware organization really is when comparing it to a software implementation. The answer to this would determine whether it is worth to design a hardware module. In general, it is difficult to compare hardware and software implementations. In order to obtain realistic comparisons, a processor which uses similar clock cycles and technology must be chosen. We selected an ARM microprocessor [5] with 80 MHz clock which has a very simple pipeline. We compare the GF (p) multiplication timing on this processor against that of our hardware module. We use the same clock frequency 80 MHz for the module of the pipeline organization with w = 32 and k = 7 for the hardware module. On the other hand, the Montgomery multiplication algorithm is written in the ARM assembly language by using all known optimization techniques [8, 10] . Table 3 shows the multiplication timings and the speedup. 
Conclusion
Using the design methodology proposed in [21] , we obtained a scalable field multiplier for GF (p) and GF (2 m ) in unified hardware module. The methodology can also be used to design separate modules for GF (p) and GF (2 m ) which are fast, scalable and area-efficient. The fundamental contribution of this research is to show that it is possible to design a dual-field arithmetic unit without compromising scalability, the time performance and area efficiency. We also presented a dual-field addition module which is suitable for the pipeline organization of the multiplier. The adder is scalable and capable of performing addition in both types of fields. Our analysis shows that a pipeline consisting of several stages is adequate and more efficient than a single unit processing very long words. Working with relatively short words diminishes data paths in the final circuit, reducing the required bandwidth. The proposed multiplier was synthesized using the Mentor tools, and a circuit capable of working with clock frequencies up to 90 MHz is obtained. Except for the upper limit on the precision which is dictated only by the availability of memory to store the operands and internal results, the module is capable of performing infinite-precision Montgomery multiplication in GF (2 m ) and GF (p).
