Elliptic curve cryptosystems are expected to be a next standard of public-key cryptosystems. A security level of elliptic curve cryptosystems depends on a difficulty of a discrete logarithm problem on elliptic curves. The security level of a elliptic curve cryptosystem which has a public-key of 160-bit is equivalent to that of a RSA system which has a public-key of 1024-bit. We propose an elliptic curve cryptosystem LSI architecture embedding word-based Montgomery multipliers. A Montgomery multiplication is an efficient method for a finite field multiplication. We can design a scalable architecture for an elliptic curve cryptosystem by selecting structure of word-based Montgomery multipliers. Experimental results demonstrate effectiveness and efficiency of the proposed architecture. In the hardware evaluation using 0.18 µm CMOS library, the highspeed design using 126 Kgates with 20 × 8-bit multipliers achieved operation times of 3.6 ms for a 160-bit point multiplication.
Introduction
Frequent electronic transactions in business and the people's everyday purchase of digital data, such as music and video images, account for the wide propagation of services supported by the interchange of the electronic data even in our own daily lives. Furthermore, concepts of EDI (Electronic Data Interchange) and CALS (Continuous Acquisition and Life-cycle Support) evidently prognosticate the future with more active and sophisticated interchange of electronic data. Cryptosystem technologies provide security, which is the foundation of these data interchange.
Among the landmark technologies for cryptosystems, a RSA system has been one of the most commonly preferred public-key cryptosystems. However, the security level of RSA systems, which stands on a difficulty in prime factorization of large integer, is always threatened by the expanding performance of microprocessors. Consequently, the suggested size of a RSA public-key has seen frequent updates; 512-bit key in 1980s, 1024-bit in late 1990s, and 2048-bit highly prospected in 2000s. This growing size of the public-key inevitably leads to heavier energy consumption and larger circuit area, all of which severely degrades the quality of RSA systems as the standard in public-key cryptosystems.
Instead of RSA systems, elliptic curve cryptosystems are winning next standard in the public-key cryptosystems.
The security level of elliptic curve cryptosystems depends on a difficulty of a discrete logarithm problem on a general elliptic curves. Cryptosystems based on elliptic curves are an exciting technology because it has the same security level as the conventional systems such as RSA systems, they offer the benefits of smaller key sizes and hence of smaller memory and processor requirements. This makes them ideal for use in smart cards and other environments where resources such as storage, time, or power are at premium. It is highly expected that a successful implement of elliptic curve cryptosystem in LSIs such as ASIC, FPGA, and microprocessor can greatly enhance the current LSIs in circuit size, speed, and energy consumption compared to RSA cryptosystem LSIs. Several elliptic curve cryptosystem LSIs have already been proposed [2] , [6] , [10] . In [2] , an elliptic curve cryptosystem LSI embedding a 1024-bit adder was proposed over GF (2 160 ). However, we consider that it has no flexibility and large area overhead because of embedding a 1024-bit adder. In [6] , an elliptic curve cryptosystem LSI embedding a Massey-Omura multiplier was proposed. However, it can be not applied to elliptic curve cryptosystems which have different conditions, because the Massey-Omura multiplier which is one of multipliers over finite fields uses an optimal normal basis. In [10] , a microprocessor which realizes an elliptic curve cryptosystem was proposed. It has large circuit area because it supports not only functions on elliptic curves but also prime number generation, RSA key generation, and CRT (Chinese Remainder Theorem) operation.
In order to allow elliptic curve cryptosystems to enjoy its advantage less conditionally in smaller LSIs, we, in this paper, propose an elliptic curve cryptosystem LSI architecture embedding word-based Montgomery multipliers. In our proposed LSIs, the multiple operations over finite fields can be executed in parallel by adopting a word-based Montgomery multiplication algorithm. Moreover, by controlling the word size and the number of the word-based Montgomery multipliers, we are able to design the elliptic curve cryptosystem LSIs which provide good scalability in terms of speed, area and operating frequency. This paper is organized as follows: Sect. 2 shows the outline of an elliptic curve cryptosystem algorithm and functions on elliptic curves. 
Elliptic Curve Cryptosystem
In this section, we show the definition and the key properties of elliptic curves for elliptic curve cryptosystems.
Elliptic Curves for Cryptosystems
A Jacobian point (x, y, z) on the elliptic curve satisfies the homogeneous Weierstrass equation
with a, b ∈ K(field). Characteristic of field K is more than 3. When z 0, it corresponds to the affine point (x/z 2 , y/z 3 ). E GF p includes a point at infinity O = (∞, ∞). When Eq. (2) is satisfied, Eq. (1) does not have a multiple root. A point addition can be done in affine coordinates using field multiplications and inversions. However, a point addition can be done in Jacobian coordinates using field multiplications only, with no inversions required. In cases where field inversions are significantly more expensive than filed multiplications, it is efficient to implement Jacobian coordinates [1] .
Algorithm of Elliptic Curve Cryptosystem
The basic building blocks of an elliptic curve cryptosystem over F q are computations of the form
where P is a curve point, and k is an arbitrary integer in the range 1 < k < ord(P). For some of the cryptographic protocols, P is a designated fixed point that generates a large, prime order subgroup of E(F q ), while for others P is an arbitrary point in such a subgroup. The strength of the cryptosystem lies in the fact that given the curve, the point P and
[k]P, it is hard to recover k. This is the elliptic curve discrete logarithm problem. We refer to the computation of Eq. (3) as point multiplication. The simplest efficient method for point multiplication relies on the binary expansion of k. Figure 1 shows the binary method.
The binary method requires l − 1 point doublings and W − 1 point additions (operations involving O are not counted), where l is the length and W the weight (number of ones) of the binary expansion of k.
Point Addition on an Elliptic Curve
Let A and B be two distinct rational points on an elliptic curve E. The straight line joining A and B mult intersect the Input: curve at one further point, say C, since we are intersecting a line with a cubic curve. The point C will also be rational since the line, the curve and the points A and B are themselves all defined over K. If we then reflect C in the x-axis we obtain another rational point which we shall call A + B (Fig. 2) .
To add A to itself, or to it double A in the jargon, we take the tangent to the curve at A. Such a line must intersect E(K) in exactly one other point, say as B, as E is defined by a cubic equation. Again we reflect B in the x-axis to obtain a point which we call [2]P = P + P. If the tangent to the point is vertical, it 'intersects' the curve at the point at infinity and P + P = O, i.e., A is a point of order 2.
Point Addition in Projective Coordinates
An point addition can be done in Jacobian coordinates using field multiplications only. Thus, inversions are deferred, and only one need be performed at the end of a point multiplication operation, if it is required that the final result be given in affine coordinates. The cost of eliminating inversions is an increased number of multiplications, so the appropriateness of using Jacobian coordinates is strongly determined by the ratio I: M, where I and M denote, respectively, the cost of field inversion and multiplication.
The sequence in Fig. 3 computes the sum P 3 = (x 3 , y 3 , z 3 ) of two points P 1 = (x 1 , y 1 , z 1 ) and P 2 = (x 2 , y 2 , z 2 ) in Jacobian coordinates. We assume that P 1 , P 2 O and that P 1 ±P 2 . In Fig. 3 , the cost of each step of the computation is noted at the right-hand side of the step. The total cost for general point addition is 16M.
The condition P 1 = ±P 2 is equivalent to λ 3 = 0 in Fig. 3 . Furthermore, given that λ 3 = 0, the condition P 1 = P 2 is equivalent to λ 6 = 0. When this condition is detected,
Fig. 3 Point addition in Jacobian coordinates, characteristic p > 3. 
Elliptic Curve Cryptosystem LSI Architecture
We propose an elliptic curve cryptosystem LSI architecture over prime fields. The most important issue in implementing the elliptic curve cryptosystem is how to implement the underlying finite field multiplication. Recently, a MasseyOmura multiplication [4] , [9] , [11] and a Montgomery multiplication [7] , [12] , [13] are remarked as a finite field multiplication algorithm. The Massey-Omura multiplier is one of multipliers over finite fields and uses an optimal normal basis. The optimal normal basis allows efficient bit-serial multipliers, such as the one described by Massey and Omura in [8] . However, an elliptic curve cryptosystem embedding a Massey-Omura multiplier cannot be communicated with systems that use different bases without base transformation, and thus, the application area would be very narrow. The Montgomery multiplication is basic operation used in finite field multiplications. The Montgomery multiplication algorithm limits a operand size of a multiplier. By adopting the word-based Montgomery multiplication algorithm, multiple processes over finite fields can be divided for parallel execution.
The proposed elliptic curve cryptosystem LSI applies the binary method for a point multiplication and has wordbased Montgomery multipliers and a finite field ALU as finite field arithmetic modules.
Montgomery Multiplier
The application of the Montgomery multiplication algorithm on two integers X and Y, with required parameters for n-bit of precision, will result in the number 
where r = 2 n and M is an integer in the range 2 n−1 < M < 2 n such that gcd(r, M) = 1. Since r = 2 n , it is sufficient that the modulus M be an odd integer. For cryptosystem applications, the modulus M is usually a prime number, thus this condition is easily satisfied.
In Eq. (4), when lower n-bit of the variable XY are equal to 0, the variables Z can be obtained by n-bit right shifts, without modular arithmetic. Because each of the lower n-bit of the variables XY is equal to 0 by adding multiples of the modulus M, the Montgomery multiplication algorithm executes a finite field multiplication at very fast. The addition of multiples of the modulus M does not concern the result of Z because it is the modular arithmetic of the modulus M. Figure 5 shows Montgomery multiplication algorithm [5] .
The Montgomery multiplier can easily be implemented in hardware because its algorithm does not include the dicision operations. However, this Montgomery multiplier only works when the size of operand is predefined n-bit. To remove this limitation and keep the simple operations of the algorithm, a modified algorithm was proposed as a wordbased Montgomery multiplication [12] . The word-based Montgomery multiplication is one of the solutions which eliminate this problem.
Word-Based Montgomery Multiplication
Given two operands Y (multiplicand) and X (multiplier) and the modulus M, the algorithm presented in this section executes a series of operations to generate XYr −1 mod M, scanning Y and M word-by-word and scanning X bit-by-bit. This characteristic enables us to derive a hardware implementation that is very regular and based on simple operations. The word-based Montgomery multiplication algorithm is presented in Fig. 6 [12] . In Fig. 6 , the final reduction step appeared in Fig. 5 was intentionally eliminated. The work in [3] describes the conditions for which the reduction is eliminated when multiple multiplications are performed.
Architecture for the Word-Based Montgomery Multiplier
A proposed architecture for the word-based Montgomery multiplier is shown in Fig. 7 . In Fig. 7 , the word size of the multiplier is 32-bit. We
for j = 1 to e 7 :
(C a , S ( j) ) :
w−1..1 ) 10 :
end for 11 : else 12 :
for j = 1 to e 13 :
( execute a series of operations in Fig. 6 , scanning mul A bitby-bit and scanning mul B and mul N word-by-word. We can execute the Montgomery multiplication by iterating at required times following Fig. 6 .
In Fig. 6 , the dependency between operations within the j loop restricts their parallel execution due to the dependency on the carry bits. However, instructions in different i loops may be executed in parallel. The proposed wordbased Montgomery multiplier can execute instructions in j loop at 1 clock. We assume that the Montgomery multiplication is executed by two word-based Montgomery multipliers. One executes the instructions of j loop. When it have executed the instructions of first iteration of j loop, the other multiplier can execute the instructions of next i loop. Thus, the word-based Montgomery multipliers can execute in parallel with two clocks latency.
Synthesized results of the word-based Montgomery multipliers is shown in Table 1 . Table 1 shows the estimated area and delay for several configurations of the wordbased Montgomery multipliers. The delay of word-based Montgomery multipliers which have smaller word size is shorter. However, the area of word-based Montgomery multipliers which have larger word size is smaller than that of the same word size by using word-based Montgomery multipliers which have small word size.
Architecture for the Elliptic Curve Cryptosystem LSI
A proposed architecture for a 160-bit elliptic curve cryptosystem LSI embedding a 32-bit word-based Montgomery multiplier is shown in Fig. 8 . We apply the binary method to the point multiplication in the proposed elliptic curve cryptosystem LSIs. The binary method includes the finite field arithmetic. The wordbased Montgomery multipliers execute finite field multiplications, and a finite field ALU executes finite field additions and subtractions. The finite field arithmetic modules execute the 160-bit arithmetic on an elliptic curve by iterating at required times. In Fig. 8 , the controller controls the inputs of the finite field arithmetic modules from the shift registers.
Our proposed architecture has 8-bit finite field ALU because it can execute 160-bit addition and subtraction in a very short time compared to the operating time of a 160-bit finite field multiplication.
The proposed elliptic curve cryptosystem LSIs execute the point multiplication as follows: first, we load inputs into registers from the shift registers. The finite field arithmetic modules execute the point addition flowing Figs. 3 and 4. We need load inputs into registers from shift registers and store operation results with shift registers from registers. When the LSIs have completed the point multiplication, the signal done is high and we obtain the operation result.
Design Evaluation

Hardware Implementation Results
The proposed elliptic curve cryptosystem LSI described above was synthesized by using Synopsys Design Compiler as a logic-level synthesis tool and VDEC libraries (CMOS and 0.18-µm technology)
† . The proposed architecture has area/time trade offs that result from different values of word size and number of stages in the pipeline. We synthesized the 160-bit elliptic curve cryptosystem LSIs embedding 8, 16, and 32-bit word-based Montgomery multipliers and a 8-bit finite field ALU. The performance evaluation of the elliptic curve cryptosystem LSI embedding 8-bit word-based Montgomery multipliers which have 8 and 20 stages in the † The libraries in this study have been developed in the chip fabrication program of VLSI Design and Education Center (VDEC), the University of Tokyo with the collaboration by Hitachi Ltd. and Dai Nippon Printing Corporation. pipeline was estimated from the synthesized results. We can easily select elliptic curves which is used in our LSIs. These curves only have to satisfy the Eqs. (1), (2) .
In Table 2 , we show the synthesized results of the values of area, delay, configuration of multipliers, and operation time. In Table 2 , com. area is the size of operating unit and noncom. area is the size of all registers. In these designs, we store all inputs and temporary variables with registers. Since we assume all the inputs and the temporary variables are stored in the registers, the area of the registers account for 70% of the total area. It is not realistic to store all inputs and temporary variables with memories instead of all registers. However, we can replace the input registers with memories. Then the total area is dramatically reduced without operation time overhead because our LSIs can execute the finite field arithmetic while the data in the memory is loaded and stored. The area of the finite field arithmetic modules are sufficiently small compared to the total area. The area of the controller account for majority of the total area except the area of registers. Thus, the elliptic curve cryptosystem LSI embedding a 32-bit wordbased Montgomery multiplier has only 0.9% area overhead compared to the LSI embedding a 8-bit word-based Montgomery multiplier.
In Table 2 , delay means a critical path delay. The operation cycle counts can be reduced by increasing the word size of multipliers, but the delay becomes longer, and thus, the operating frequency becomes lower.
In Table 2 , operation time means the time of a 160-bit point multiplication. All LSIs in Table 2 operate at their own maximum operating frequency. The high-speed design using 126 Kgates with 20×8-bit multipliers achieved operation times of 3.6 ms for a 160-bit pint multiplication. We consider the parallel execution of multipliers. The operation cycle counts of the LSI embedding a 16-bit word-based Montgomery multiplier is smaller than that of the LSI embedding 2×8-bit word-based Montgomery multipliers. However, the operation time of the LSI embedding 2 × 8-bit word-based Montgomery multipliers is reduced compared to that of the LSI embedding a 16-bit word-based Montgomery multiplier, because the maximum operating frequency of the first one is higher. The area of the LSI embedding a 16-bit wordbased Montgomery multiplier is smaller than that of the LSI embedding 2 × 8-bit word-based Montgomery multipliers. Table 3 shows the hardware performance comparison for point multiplication over prime fields. The field sizes are 160-bit except the hardware of [6] . The field size of the hardware of [6] is 155-bit and it uses an optimal normal basis. In Table 3 , area of our LSIs is the size of operating unit, because we can replace the registers with memories. We will embed our LSIs with other system LSIs in portable information devices. We can use memories in these system LSIs. The area of hardware of [6] and [2] is the area of operation unit except the area of memories.
Performance Comparison with Previous Approaches
In [10] , an elliptic curve cryptosystem processor architecture was proposed. These designs are evaluated using a 0.13-µm technology. A compact design using 28.3 Kgates with a 8-bit multiplier achieved operation times of 2.79 ms for a 160-bit point multiplication. This hardware was executed at higher operating frequency because it is microprocessor. The ecc 8 8 which has almost the same area as this hardware achieves 16% number of cycles reduction compared to this hardware. Recently, the operating frequency of system LSIs in cellular phones is under 200 MHz, because we have to desigh low power system LSIs. The operating frequency is important constraint of system LSIs design for portable information devices. The operating frequency of ecc 8 8 is 4.4 times as low as that of the compact design of [10] . In [6] , a elliptic curve cryptosystem processor embedding a Massey-Omura multiplier was proposed. This hardware used an optimal normal basis, in order to simplify and minimize the multiplier. This hardware cannot be communicated with systems that use different field parameters, and thus, the application area would be very narrow. Reference [2] proposed an adder-based hardware architecture that supports the finite field arithmetic for integer and point multiplication for arbitrary polynomial fields. But, it is difficult for its 1,024-bit adder to increase the operating frequency. This hardware does not have enough flexibility because it has a 1,024-bit adder. Our proposed architecture provides good scalability in terms of speed, area and operating frequency. We can design high-speed architecture by exchanging the values of word size and number of stages in the pipeline.
Conclusion
In this paper, we proposed the elliptic curve cryptosystem LSI architecture embedding word-based Montgomery multipliers. This architecuture was synthesized for 8, 16, and 32-bit multipliers by using a 0.18-µm CMOS library. Our proposed LSIs can execute point multiplications efficiently with low area overhead, because we use word-based Montgomery multipliers. For example, ecc 8 20 in Table 3 can execute a 160-bit point multiplication in binary fields in just 3.6 ms.
Our proposed elliptic curve cryptosystem LSI architecture also provides good scalability in terms of speed, area, and operating frequency. It can be used for various applications, such as embedded microcontrollers and high-speed security servers, by changing the configuration of the multipliers. As a consequence, the proposed architecture has strong effectiveness and performance.
We intend to improve the delay of the multiplier. Then we can execute the point multiplication faster by using higher operating frequency. We also intend to estimate the energy consumption of the proposed architecture.
