Introduction
A scalable architecture for Elliptic Curve Point Multiplication (ECPM) is presented in this paper. ECPM is the basic operation of any elliptic curve clyptosystem and it is essential that it is implemented efficiently. Several implementations of ECPM have been introduced e.g. in [21, [51, [61,[91, [ l l ] and [ I 21 .
Certain of these designs implement only the Galois field arithmetic part of an ECPM, but the architecture presented in this paper implements the entire ECPM allowing better performance, respectively.
ECC implementations can be categorized into reconfigurable and non-reconfigurable classes. In a reconfigurable implementalion, the Galois field, over which the elliptic curve is defined, can be changed without the need to change the design. In a non-reconfigurable design, the FPGA must be reprogrammed in order to change the field. Nonreconfigurable designs are usually significantly faster than reconfigurable ones. Therefore, they should be preferred if very high performance is required. The architecture presented here is non-reconfigurable. Reprogrammability of FPGAs is an advantage for non-reconfigurable implementations, because a non-reconfigurable architecture on the FPGA can be changed by reprogramming the device. Hence, reprogrammability vastly reduces disadvantages of nonreconfigurable architectures.
Field multiplication is the operation which ultimately defines the performance of an ECPM. Hence, it is essential that it is implemented carefully. The multiplier architecture presented here was designed on the lowest possible level to ensure maximum performance. Flexibility of the multiplier structure was the key issue in the design, because it is important that the implementation can be scaled to meet both speed and area requirements of an arbitrary application.
Elliptic Curve Cryptography
ECC has been of much interest in the cryptography community, because a high level of security can be achieved with short keys and low computational complexity. [3] In this paper, only elliptic curves E over Galois field G F ( Y ) with a polynomial basis are considered. An irreducible polynomial generating GF(2"') is denoted as m(z). ECPM is defined on E , so that Q = k P = P t P + ... Mxy is used for coordinate conversion.
Architecture for ECPM
The design presented in this paper implements an entire ECPM. In many publications, an ECC pro- part of an ECPM, is designed. This approach leads to smaller designs, but better performance is achieved by implementing the entire ECPM. An important benefit is the processor off-load achieved with the complete implementation, i.e., the processor controlling the ECPM is freed completely from the process, while in ECC processor solutions the controlling processor has to control the use of the ECC processor.
If 
Multiplier Structure
Key objective in the derivation of the field multiplier architecture was that it can be scaled to meet both latency and area requirements. Thus, a scalable architecture was developed. The structure is developed particularly for FPGAs which commonly use LUT is a block which can implement any 4-to-I-bit function. All optimizations are performed so that this structure is used as efficiently as possible. Although the architecture is developed for FPGAs. it can he used for ASIC implementations as well.
A field multiplication includes an algebraic multiplication of polynomials and a reduction modulo an irreducible polynomial m(z). These phases can be calculated simultaneously as performed in many reconfigurable multipliers. If irreducibles are fixed, the reduction can be calculated in one clock cycle and it is beneficial to compute these phases separately.
The reduction is trivial if the irreducible is known a priori. Simple xor-equations can be derived using known algorithms given, e.g., in [31 and they can be hardwired into the design. The reduction can he calculated in one clock cycle with a high enough clock frequency, because irreducible polynomials used in ECC are usually trinomials or pentanomials.
The algebraic multiplication is calculated in structures called LUT-trees which consist of 4-to-I-bit Fig. 3 , there are registers between every LUTlevel, but in real applications it is beneficial to compute several LUT-levels in a clock cycle. The best results were achieved if registers were added between every third level, i.e. X = 3.
Let n be the number of LUT-trees and K~~~ the number of levels in the largest LUT-tree. When calculation of coefficients di is divided equally for n LUTtrees, the latency of the multiplier is defined as where c i s a constant defined by the design decisions, e.g., possible inpuUoutput registers. Eq. (4) can be used for scaling latency ( L A [ ) and area (n) of the multiplier to fit the requirements of an application.
Other Galois Field Arithmetics
Implementation of the field addition is trivial as it is only an m-bit xor-operation. Field squaring is easy to perform if irreducible polynomials are fixed. The algebraic multiplication is performed only by adding zeros into the bit vector and no LUT-tree structure is needed. As mentioned in Sec. 3.1, reduction can be calculated with hardwired xor-equations derived using known algorithms. Field inversion is a more complex operation. The method used in the SIC-ECPM architecture was introduced by Shantz in [ 131.
Results
The ECPM architecture presented in Sec. 3 was implemented on Xilinx Vinex-I1 XCZV8000-5 which contains logic resources of 46,592 slices. A slice is the basic element of Xilinx FPGA devices and it consists of two LUTs, carry logic and two flip-flops [ 141. Several implementations of the architecture were designed using SIG-ECPM VHDL generator [XI with different parameters. Elliptic curves, recommended by the Standards for Efficient Cryptography Group (SECG) in [4] , were used in the implementations.
Results of the implementations are presented in Table I . The latency value is calculated on an SECG 
225
Vmex-tl 163 R n.a.
3 M 193
R n.a. 
2979
Vlnen-E Galois field multiplication is the critical operation of an ECPM and it was designed with special care. The design was made on as low level as possible. Because the basic element of most modem FPGAs is a 4-10. I -hit LUT, the architecture was optimized to exploit this structure as efficiently as possible.
The architecture proved to he very efficient. For most parameters, it is the fastest published FPGAbased architecture for ECC at the time of writing this paper, at least to the authors' knowledge. The research was performed in the GO-SEC project at HUT.
