I. Introduction
Recently, elliptic curve EC cryptosystems have b ecome more attractive due to their small key sizes and varieties of choices of the curves available. However, it is not e cient to implement them with a generalpurpose microprocessor because of word size mismatch, less parallel computation, no hardware supported wire permutation and algorithm architecture mismatch. The solution to this problem is to build a coprocessor. This coprocessor can be optimized for the algorithm of a particular application to enhance performance. Thus, the total hardware utilization can be kept at a very high rate and the computation is speeded up.
In this paper, a compact fast elliptic curve crypto coprocessor with variable key size is introduced, which utilizes the internal SRAM registers in an FPGA. The generic hardware architecture for the coprocessor is implemented with a parameterized in term of key size VHDL description and is synthesized mapped to a Xilinx FPGA. The algorithms adopted and the architecture developed here are suitable for massively parallel computation. The experimental results show that the design can achieve a high utilization of CLBs for the Xilinx 4000 series.
II. Algorithms for EC cryptosystems
The basic operation in an EC cryptosystem is called elliptic curve scalar multiplication. Since EC subtraction is as simple as EC addition and can be computed with one EC addition, the most e cient method is to use an addition subtraction method 1 3 , where the scalar or the integer is decomposed as a non-adjacent formatNAF, and the scalar multiplication can be done with a series of additions subtractions of elliptic curve points. The maximal number of non-zero bits in the NAF is about m 3 1 , while the maximal number of non-zero bits in binary decomposition is m. In turn, the addition subtraction of EC points consists of a set of underlying eld additions, squarings, multiplications and inversions. When the elliptic curve is de ned over GF2 m with an optimal normal basis, these underlying eld operations have the least complexity. Besides NAF, factorization formats, like , adic format 1 , share similar algorithm structure and thus can be implemented with the same hardware architecture.
III. Implementation of EC cryptosystems
A. General structure Since the word size or key size for a typical elliptic curve cryptosystem is large, the above algorithm can not be unfolded. Therefore, a folded hardware architecture is constructed with a controller to sequence the computation. In Fig. 1 , the two FIFOs serve as input output bu ers and the dual-port register le is used to save input parameters and intermediate data.
The shaded area provides GF2 m arithmetic units of GF adder, GF squarer, GF multiplier and GF inverter. With the nite eld of characteristic 2 as the underlying eld, the addition is just bit-wise XOR, and the squaring is only a simple cyclic right shift for normal basis representations. The structures of GF multiplier and GF inverter are discussed in the following sections. B. GF multiplier structure
The GF multiplier is a modi ed form of a MasseyOmura multiplier 4 , which reduces the numb e r o f A N D gates by 50 and the wire permutations by 50 in the XOR PlanejAND PlanejXOR tree. The multiplier can be implemented as either a bit-serial multiplier, a digital-serial multiplier or a parallel multiplier, depending on the amount o f a vailable hardware resources. Each multiplication takes m clock cycles in a serial multiplier and only one clock cycle in a fully parallel multiplier. However, the hardware required for a parallel multiplier will be m times that of the serial multiplier. Pipeline techniques are also applied to the multiplier to reduce the clock cycle time. The modi ed MasseyOmura serial multiplier takes m AND gates, 2m XOR gates and 3m ip-ops, and has a delay o f mT AND + T X O R blog 2 m , 1c when it is not pipelined. However, when it is pipelined, the serial multiplier has a delay o f m + blog 2 m , 1c M a x T AND ; T X O R and a cost of m AND gates, 2m XOR gates and 5m ip-ops.
C. GF inverter structure
The structure of GF inverter is derived from the method introduced by T.Itoh et al 2 . The inverse takes blogm-1c recursive iterations and a total of blogm1c+HWm-1-2 underlying eld multiplications, where HWm-1 is the Hamming weight of m-1. Though the number of multiplication can not be reduced any further, the time taken for one inverse can be improved by reducing the latency of the underlying eld multiplier. Since the inverse algorithm is recursive, the underlying eld multiplier can not be pipelined across multiple iterations. The only e ective w ay to reduce the time taken for one inverse is to reduce the latency of the underlying eld multiplier by pipelining it within one iteration or making it parallel.
D. Controller Structure
The controller takes advantage of the abundance of internal SRAM and registers in Xilinx FPGA. The controller is built up as a nite state machine and uses table look-up to implement the logic function. Since the whole look-up table consists of small look-up tables from each CLB, the controller can be pipelined to have a clock cycle time equal to one CLB delay.
IV. Development of the Multiplier

A. Testing data generation
The above EC scalar multiplication algorithms were implemented in Mathematica. With the implementation, the testing data for VHDL simulations can be computed and the VHDL code can be easily veri ed.
B. VHDL code description and simulation
The controller is the kernel of the scalar multiplier and has the format of an FSM. All operations can be categorized as one of the following atomic operations: unconditional jump, conditional jump, operand load, operand store, nite eld addition, nite eld squaring, nite eld multiplication and nite eld inversion. Then, each state of the controller consists of one or more such atomic operations because addition, squaring, multiplication inversion and load store can be executed concurrently. The execution schedule is optimized to provide the shortest computing time. These atomic operations are represented as macros and are re-used in the VHDL code. The entire VHDL code has been simulated extensively.
C. Experimental Results
The mapping layout is done with the Xilinx Design Manager and its results are shown in Table I . 
D. Mapping analysis
In Table II , the actual CLBs means the CLB count of the mapping obtained with the Xilinx Design Manager. The minimal maximal CLBs comes from an estimation formula and puts a lower upper bound for the total number of CLBs needed for the design. The estimation method is not shown due to the limited space. 
V. Conclusions
The architecture has a high area utilization. The architecture can provide high parallelism.
