ABSTRACT
INTRODUCTION
As an indispensable component of modern consumer electronics, security applications, such as IC cards used for personal authentication, and domestic network applications, play an important role. In fact, such data security receives constant attention, since people tend to communicate with each other by various electronic devices over networks. Security applications are based upon intensive computations of cryptographic algorithms, which generally involve in arithmetic operations in large Galois Fields.
We classify Galois field architectures by basis representation of field elements. The most popular representations are standard (polynomial), normal, and dual basis. The dual basis representation has been used for relatively small Galois field applications; such as 8-bit Reed-Solomon codes. However, the dual basis representation is not suitable for large Galois fields used for cryptography [l] . Thus, polynomial basis and (optimal) normal basis representations have been used for cryptographic applications to date. The normal basis representation allows very efficient exponentiation, while regularity and extensibility for hardware implementation is worse than the standard basis [2]. Polynomial basis offers good solutions to most GF computational problems. Also, polynomial basis is the easiest to use among other representations. Therefore, we focus on using the polynomial basis throughout this document.
There are a relatively small number of publications on Galois field architectures specifically designed for cryptographic applications. Previous research on GF multiplications in polynomial basis representations includes general serial multiplier architectures [2], bit-systolic parallel multiplier architectures [3] , and the hybrid multiplier architecture [4]. Although the hybrid multiplier architecture is known to exhibit the best throughput versus cost ratio, it cannot be used in the prime exponent GF(2""p) (p is prime) but only in the composite exponent GF(2"'="k) ( m is composite). Thus, the security level of applications is reduced [5].
Publications on implementing hardware GF dividers have been rare due to the infrequent use of division in processing finite field applications. Rather, GF division has been implemented using software [6]. Furthermore, if the concept of the projective coordinates is adopted, then it need not be used at all [7] . The projective coordinate scheme avoids the inversion operation at point addition or at point doubling in Elliptic Curve arithmetic operations at the cost of more field multiplications. Using projective coordinates is efficient until a polynomial inversion can be accomplished with less than 1 1 field multiplications, and thus is not within our consideration [8]. Brunner, et al . introduced a fundamental algorithm for obtaining fast GF inversion by using the modified Euclidian algorithm [9]. There have been attempts to decrease the throughput of inversion clock cycles based on Brunner's work using systolic arrays, but they resulted in the heavy cost of hardware resources [lo]. . The standard method for point multiplication has been the double-and-add algorithm, which is analogous-to the repeating square-and-multiply algorithm used for exponentiation [ 111.
We researched the previous literatures in detail, and consolidated the strong points from other studies in order to propose novel types of the GF multiplier, GF divider. and an efficient algorithm for point multiplication. The outline of the paper is as follows: We start by introducing the mechanism of Elliptic Curve Cryptosystem in Chapter 2. Chapter 3 discusses our proposed finite field multiplier and the performance comparison with other GF multiplier architectures, and Chapter 4 describes our finite field divider. which is also compared with other studies. Chapter 5 covers the efficient algorithm for point multiplication, an algorithm that we developed. Our conclusions are given in Chapter 6.
Manuscript received June 25,2001

ELLIPTIC CURVE CRYPTOSYSTEM
Background
There are mainly two categories of cryptographic methods for enciphering and deciphering data: secret key schemes use the same key for encryption and decryption; public key schemes use different keys. Secret key schemes are represented by DES and public key schemes are represented by RSA [12] [13]. Arithmetic operations for public key schemes are much slower than those of secret key schemes having the same security level, however its existence is essential in that the key of the secret key schemes is changed frequently and encrypted by the public key cryptosystem before data transaction. The Elliptic Curve Cryptosystem (ECC), which was first suggested by Victor Miller [14] and Neal Koblitz [15] in the 1970's, has emerged as an alternative to RSA due to its shorter key lengths even at same security level. It is based on a computationally hard problem, which is the so-called discrete logarithm problem in large finite groups. Given a large finite group G and group elements {Pa, Qo} E G,with Qa = k.Po, then it is impossible to compute k given only Pa and Qo.
In the following we will only consider so-called nonsupersingular elliptic curves; which provide the highest security [ 7 ] . Elliptic curve E, P(x), point on the curve P are publicly known factors beforehand. Let E be a nonsupersingular elliptic curve:
defined over Fp (p = 2"' ). Suppose a point P(x,.y) is used to generate the whole addition group of E and the message point is represented as M(m,, my). The elements of Fp are represented with the polynomial basis. An irreducible prime polynomial P(x) is selected according to the order of the given finite field. Fig. 1 shows how the ECC based secure data exchange works. Every field operation, such as addition, multiplication, and division, in the procedure is performed by way of a finite field arithmetic modulo P(x). User Receiver . randomly chooses an integer k, and calculates his (her) public key kJ'. The integer k , is user Receiver's secret key.
User Sender also randomly chooses an integer kb and obtains his (her) public key kbP and calculates the point k,k,,P = (x', y'). Now, multiplying the message A4 and the point k,k$ encrypts the message into Mopher. User Receiver refers to the user Sender's public key khp to calculate the shared secret point k&P. To decipher the encrypted message, user Receiver divides the encrypted point Mc,ph,er by the shared secret point k,k,,P modulo P(x). 
Receiver Sender
x3 = l2 + A + a, y , = x I 2 + (l+l)x3,(3)
Negation routine:
Let P(x1, yJ be a point on the curve
From the above rules; we can discern the number of field operations required'to carry out the routine. In the Add routine, 8 additions, 1 multiplication, 1 division, and 1 squaring are required. We should check that the divider of k or (xl + x2) is not zero. The Double routine requires 4 additions, 1 multiplication, 2 squarings, and 1 division. Also, we should check that the divider of h or x, is not zero, The Negation routine requires one addition. This operation is needed when implementing the fast algorithm for the calculation of kP. As explained later in Chapter 6, the values of (4') and (-2P) are needed in the algorithm we developed.
AS basic mathematics for the ECC-based cryptosystem, GF multiplication and GF division occupy the most important positions. In the next 2 chapters, we present the fast and efficient GF multiplier architecture and GF divider architecture we have developed, and furthermore discuss their performances in detail.
'FAST FINITE FIELD MULTIPLIER
Multiplier Architecture
This section describes the proposed fast finite field multiplier architecture for Galois fields using polynomial basis. GF multiplication is considered to be the most critical operation for performance enhancement of practical public-key c;ryptographic applications, which use more than 150-bit wide finite fields.
We consider arithmetic operations in one of the extension fields of GF(2). The extension degree is denoted by m, PO that the.field can be represented as GF(2"'). This field is isomorphic to GF(2)[x]/(P(x)), where P ( x ) = x m + c L i p , x ' i s an irreducible polynomial of degree m with p , E GF(2). The calculation of the product of two arbitrary finite elements in
GF(2"), Z(x) = A(x).B(x)
, where A ( x ) =~~i a , x ' , ~(~1~ c"-'b,x~ , proceeds as follows:
,=O ,=O Expression (5) describes the traditional Mastrovito's serial multiplier structure [2] . The basic concept of our proposed multiplier architecture is to divide the expression into n parts to obtain n-times speed-up. We explain our architecture in detail for the case of speed-up factor n = 2. To double the overall throughput, we divide the second row of (5) into even and odd parts.
In (6): the expression of ZeI.~,)(x) is similar to the traditional serial multiplier, except that the orders are even integers and increase by a step of 2. However, Z, , Jx) looks complicated at a glance. We modified the expression by taking out the x term out of the brackets. Now; we can handle the Zod&) in a similar fashion. We perform the modulo reduction using the following property:
Here, pm.l in the.second row of the equation (7) can be substituted with zero for simplification without any loss of generality, because we use prime polynomials with only low hamming weights (trinomials or pentanomials having p,=l in the lower order of the polynomial) in real-world applications: Consequently, the modulo reduction equation by x2 is derived as follows: . . , .
. .
This expression allows the reduction of x2 to simultaneously produce the results OfZ,,.,,,(x) and Zod&) in the same clock cycle, thereby doubling the throughput. The first row of the equation (6) and Zddx) without wasting any cycles. Since the exponents used in high-security applications are always prime numbers, the choice of an odd exponent provides these advantages at one move. Fig. 2 shows the schematic of the x*-multiplying circuit based on (8). The symbol 63 represents the logical AND gate, while the symbol 0 represents the logical XOR gate. The p , components in the block diagram can be thought of as ordoff switches. Fig. 3 represents the implementation of ZeV&) in the expression (6). The part below the shaded box functions as the x'-multiplying circuit. In the Zddx) circuit, however, there should be a part that can be operated as the x2-multiplying circuit, as well as the traditional x-multiplying circuit required for treating the x term out of the brackets in the third row of (6). We implement ZdJx) in expression (6) by simply adding (m-1) times 2-to-I multiplexers (muxes), plus w, -1 extra muxes for selecting either x-or x2-multiplying circuit: where w, is the hamming weight of the prime polynomial whose value is very small relative to the order of the finite field. Fig. 4 represents the implementation of Z<,&) with added components. The complementary input '0' in the B(x)-coefficient in the last clock cycle is required for treating the x term out of the brackets;as explained above This way, the number of clock cycles for one field multiplication is reduced by a factor of two. The most resource-consuming factor is the number of registers. The impact of the other logic gates is negligible if we customize the design of the circuits. Our double fast multiplier consumes approximately 1.5 times as many resources as the traditional multiplier architecture. We do not have to consume double the resources to obtain two-times the speed, due to sharing the A(x)-coefficient register.
We can extend this idea to build a multiplier that is 3-times faster than the traditional serial one by splitting the expression ( 5 ) into 3 parts: ZJk(x), Zjk.,(x), and ZJ&). The range of the value k should depend on the situation, as we considered in the above case of the double fast multiplier.
Fig. 5 Block diagram of hybrid multiplier
The hardware implementation of the -Z&(.x) is similar to traditional serial architecture using an x'-multiplying circuit, except that the orders are integers of multiples of 3. I? implementing ZJ, , (x) and Z3k.2(x) , support for an-x'-multiplying circuit and x-multiplying circuit with additional selecting muxes is required. Likewise, we can design a multiplier whicH is t-times as fast as the traditional serial one by using an x'-multiplying circuit, while consuming approximately (t+l)/2-times the resource. Since the proposed multiplier architecture basically follows the traditional one: the critical delay path is nearly as short as that of the serial multiplier architecture independent of the speed-up factor t.
Multiplier performance evaluation
The proposed multiplier architecture has been simulated and verified with results obtained from a traditional serial multiplier modeled with HDL. In order to obtain simple and reasonable performance evaluations, we considered an implementation in GF(2'7"2k) and compared it with the hybrid architecture, which is known to show the best throughput versus cost ratio, yet contains the critical defect that it cannot be used in the prime exponent GF(2"=P) 0, is prime), but only in the composite exponent GF(2m="k) (nz is composite). Fig. 5 is the block diagram of the hybrid multiplier architecture. If n = 1, the structure operates as the traditional bit-serial architecture with all lines having one-bit connections. If n > 1, however, all connections are n bit wide buses and all arithmetic is performed in the subfield GF(2"), thereby producing results in k clock cycles.
We focus our compa:ison point on the shaded area of Fig. 3 and Fig. 4 , since the width of the serial B(x)-coefficient register is the same in both architectures, and, according to the SEC-I [16] , the hamming weight of the suggested prime polynomials is very small compared to the exponent m, and can thus be ignored. Table 1 shows the comparison results. 1. n the proposed architecture, there are m ANDs, m XORs and ni registers in ZeVell(x). Similarly, there are n7 ANDs, nz XORs, (m-l+w,-l = n2+@,-2) 2-to-1 muxes and m registers in Zddx).
There are also m registers for storing A(x)-coefficients, and nz XORs for the final addition. Compared to the traditional serial architecture, a critical delay path exists in ZOc&) with an additional mux, while the hybrid architecture includes a parallel GF(2"=*) multiplier through which delay increases severely as n increases. The resulting cycle time is one clock cycle longer than in the hybrid architecture, because the exponent in the case above is even, as is not the case for the prime order finite field. However, this is insignificant compared to the overall throughput.
In implementing encryption in the elliptic curve cryptosystem, finite field multiplication is the most timecritical operation, and we have proposed the above multiplier, which is t-times as fast as the traditional serial multiplier and consumes about (t+1)/2-times the resources: and have verified it with numerical expressions and HDL. It can also be used directly for GF squarers without any additional logic if we control the inputs. In the following section, we propose an advanced type of aredtime efficient GF divider based on the modified Euclidian algorithm. 
ADVANCED FINITE FIELD DIVIDER
Galois field division
Finite field division in the GF(2") has the form A(x)/B(x) modulo P(x), where the degrees of A(x) and B(x) are always lower than m, and p ( x ) = x m +~~~p , x '
is an irreducible polynomial of degree m with p , E GF(2). Due to the infrequent use of division in processing finite field applications, the studies of GF dividers have not been very active since they have been implemented using software. However, the need for hardware implementation has risen with the higher level of security in complicated applications. GF division proceeds through the following two steps:
Step
Find the inverse ofB(x), B(x)-'
Consequently, GF division can be expressed as GF multiplication by the inverse. The simplest way to find the inverse of a field element is by table-lookup. This tablelookup method is efficient on a relatively small finite field such as GF(2"') ( m < 8), but it cannot be effectively implemented in VLSI in large finite fields because memory requirement increases exponentially as m increases [9] . For this reason. algorithm-based methods are employed for GF division in large finite fields. The two most frequently used algorithms for GF division are as follows:
. FermatS theorem: using the theorem that B2, = 5 for any element B E GF(2"), we can obtain the inverse of the field element B by recursive squaring and multiplication, since the field element 5 can be expressed as E-' = B 2 B Z 2 5 2 3 . . .
B2--' in GF(2"').
Many software implementations adopt this method, but it cannot be effectively implemented in VLSI, since the performance versus cost ratio is not suitable 121.
Euclid's algorithm: we can find the inverse of a field element in the course of obtaining the GCD of two polynomials. This method is especially useful when the field elements are represented using polynomial basis.
We propose an efficient and advanced type of GF divider using the Euclidian algorithm. There have been studies on GF dividers based on the Euclidian algorithm [9] [10], and the results in [9] are area-efficient but show relatively low throughput and low extensibility.
[IO] employs the concept of a systolic array implementing modular structure to achieve higher throughput using a pipelining technique, but it has a very high area complexity. Our proposed divider architecture is based on the division algorithm in [9] . We improved the reference divider architecture by merging an n-bit look-up table method into a new-type GF divider that enhances area efficiency and n-times throughput. We present the adopted algorithm and the proposed idea below.
Division algorithm & improved architecture
In [9], Brunner er al. suggested a modified Euclidian algorithm applicable to GF inversion/division. Our divider architecture is based on this modified Euclidian algorithm as follows: 
Division algorithm using the modified Euclidian method
The derivation of this algorithm 'and the expressions for temporary values R, S, U, and V are referred to in [9] . As the above algorithm executes polynomial division, the degree of R or S decreases by one per iteration during 2m iterations, depending on the values of MSBs of the R, S registers and a status bit which indicates whether the value of delta is zero or not.
We improve the division algorithm suggested above by depending on the 2 most significant bits of the R and S registers simultaneously, in order to obtain the division result in m iterations. As a small overhead, we have to prepare a ROM table to store all possible operatjons since the possible number of operations increases from 2' = 8 to 25 = 32, which results from depending on the bits r,, r,,,-,, s,,, sn,.,> and the indicator bit. Fig. 6 represents the block diagram of our proposed GF divider. R, S. U. and V circuit blocks are designed for executing 32 operations stored in the ROM table. R and S blocks are (m+l)-bits wide, since the irreducible polynomial P(x) is stored in S, and there is an initial assumption that the degree of the polynomial B(x) is m. Table 2 shows all of the operations of our GF divider. The reference bits r,,,, r,,,-,: s,,,  s,,,. ,, and the indicator bit of the delta are combined in each iteration to produce the control signals from the CTL-ROM. R, S, U, and V cells execute the operation given in Table 2 , dependent on the control signals from CTL-ROM cycle by cycle. Since we process double the work per one iteration in the same division algorithm, a modified structure of the U cell is introduced by inducing the expressions below, as in the design of the double-fast GF multiplier explained in the previous chapter: x2U(x)=u,x2 +UlX3 +u2x4+:..+u ,,,_ > X " + U ,,,_, X".' 
Fig. 6 Block diagram of proposed CFdivider
We arranged the expressions using the property in (7) and implemented the expressions (9) and (1 0) just as we did in the design of the x*-multiplying circuit. We finally obtain the division result A(x)lB(x) after m iterations.
Divider performance evaluation
We compare the performance of our proposed GF divider with that of reference dividers implemented using the division algorithm which employs the modified Euclidian algorithm in terms of area, clock cycles, critical delay, and extensibility.
The algorithm that Brunner has suggested for designing a GF divider is the basis for GF dividers using the Euclidian algorithm, and it can be extended to produce one division result in 2mln cycles with an n-folded form. However, Brunner's divider implementation architecture is not fit for extension on a large scale due to the fact that the critical path delay and the area in each functional cell except for the flip/flops increase proportionally to n with n-folded form.
According to a paper on GF dividers by Guo, one cycle per one division is possible using a systolic array together with a pipelining technique, but has a weak point in area efficiency costing as much as O(mlog2m) to achieve the throughput of one cycle/division and O(m2) for the throughput of m cyclesldivision.
In our modification of Brunner's work, the proposed structure of our GF divider has the throughput of 2mln clock cycles per division depending on the n highest significant bits of R and S cells. Table 3 represents. the comparative results of our proposed GF divider architecture, implemented to produce one division result in m clock cycles, with previous dividers having the same throughputs. Table 3 shows that the performance of Brunner's work and ours does not differ when implementing for a throughput of m clock cycles per division. Considering extensibility, however, our
proposed GF divider requires far less overhead than in the Brunner's work. Our proposed divider, which utilizes depending on the two most significant bits simultaneously, is more advantageous than using an n-fold structure in that the number of cells in ours does not change requiring more operations of R, S, U, and P' cells and larger ROM table exponentially proportional to n; yet, the number of cells and the critical delay increases in proportion to n in the Brunner's.
O(m) O b )
Area Complexity Critical Delay for Extensibility A single point multiplication requires multiple computations of point addition ( P # Q) and point doubling ( P = Q). When implementing encryption systems using ECC, we must use an efficient method to compute the equation (11). The standard m'ethod for point multiplication is the double-andadd algorithm, which is analogous to the repeating squareand-multiply algorithm used for exponentiation. If " : b , 2 ' , b, E {o, 11, then k P = c z : b , ( 2 ' P ) . The traditional algorithm for computing kP is as follows :
Traditional algorithm for computing kP kP:
It is clear that we need m-1 Double operations and w, Add operations in the traditional double-and-add algorithm. To reduce the number of iteration steps and field operations, in this chapter, we apply the modified Booth's algorithm which is well known for fast binary multiplication in computer arithmetic.
New algorithm for computing W using redundant form
We focus on the adoption of redundancy in the modified Booth's algorithm, which is based on the fact that fewer if r,=-P then Q: =Add -P to Q Efr,=-ZP then Q:=Add -2P to Q End (Q = kP) (We have employed terminology from radix-4 Booth's recoding in the above algorithm).
To reduce the number of steps by half, we draw an equation for computing 4P from (3), as seen below:
Quad routine:
Let P(xl, yl) be a point on the curve. If x I = 0, the result of 4P is 0. r f x l #0, 4(xl, y J = R(x3, y3), where From this formula, we can determine the number of field operations. The Quad routine usually requires 10 additions, 1 multiplication, 2 divisions, and 4 squarings. Fig. 7 shows a simple example of comparison between the traditional double-and-add algorithm and our proposed new algorithm using radix-4 redundant recoding. The number of iterations decreases from m to [SI+ I steps. Table 4 summarizes the improvement in the number of steps and required EC operations. The number o f operations in Table 4 is calculated based on the probability that is dependent on the hamming weight of the prime polynomial. The probability of the existence of 1 in the binary representation of k during m steps in the double-and-add algorithm is 0.5, and the probability of the existence of nonzero Booth's recoding term is1618 [18]. The new algorithm exhibits a reduction of about 12.5% in handling Add operations. Furthermore, the new algorithm is also advantageous because of using Quad operations. The Quad routine is induced from manipulating the expressions in the Double routine resulting in a reduction of 1 field multiplication, and the proposed algorithm can be far more efficient by enhancing the Quad routine using higher mathematics in future. The proposed algorithm requires Booth's recoding circuit and memory space for storing the values of P, 2P, -P, and -2P additionally. The number of field operations for calculating kP are represented in detail in Table 5 , which demonstrates the efficiency of our proposed algorithm. We achieved performance improvement of about 19% in multiplication considering that our GF multiplier can be used as a GF squarer, and about 9% in GF division. First, we designed a new type of fast finite field multiplier. We achieved as much as t-times the speed-up compared to the traditional serial multiplier architecture by manipulating the expression for the serial multiplication algorithm, resulting in a total cost of (t+l)/2-times the resources of the traditional serial GF multiplier. The proposed multiplier architecture has the highest throughput per cost ratio of any other GF multiplier published to date. Additionally, it can be used in the secure large prime Galois field.
Multiplication Division
Secondly, we improved the algorithm introduced in [9] in order to design a fast GF divider architecture. We improved the existing GF division algorithm by depending on the n most significant bits of temporary polynomials simultaneously. This allows us to obtain GF division results in 2mln clock cycles using only an increased number of cell operations, resulting in an increase in the operational ROM table. Our GF divider architecture is especially useful when a division time of less than m clock cycles is required with minimal additional resource.
Finally, we developed a novel fast algorithm for calculating kP, which is the most time-consuming operation in the ECC data encryption scheme. We adopted the radix-4 Booth's redundancy concept to decrease the steps required for calculating kP by half compared to that of the traditional double-and-add algorithm. We also decreased the number of field operations by defining and drawing a point quadrupling operation expression.
The GF multiplier and GF divider, together 'with the algorithm for calculating kP, are the basic components of the ECC cryptosystem for secure data encryption and. decryption. In this paper, we aimed to extract the best performance by adding the minimum overhead to exhibit a high performance versus cost ratio. At present, semiconductor-manufacturing technology is rapidly developing into nano-technology. Investment in resources-to improve performance is essential.
As the Internet and information technology continues to grow, the need for security applications for consumer electronics, such as IC cards for private authentication, and domestic network applications, become 'more and more complicated, thereby requiring intensive computations. Our proposed high-speed cryptographic algorithms for ECC are extremely well suited 'for such applications and exhibit ~ versatile extensibility.
