Abstract -In this paper we propose a low-cost coprocessor architecture for elliptic curves cryptography which supports the main mathematical operations for the computation of ECDSA over GF(2193), including point doubling, point addition and scalar multiplication over field ECC. As the field is fixed, we use this special property to design special operational units and pipeline structure. Although only increasing 4% area, it speeds around 40% faster at the same time. Our design allows to perform a scalar multiplication over GF(2193) in 24 ms at a clock frequency of 10 MHz, and is only 11,486 NAND gates large. The advantages of both the performance and hardware cost are much favorable compared with previous similar work.
I. INTRODUCTION
Embedded systems with Cryptography are widely used for many different purposes in daily life, which enable profitable and legal trading, confidentiality, integrity, and nonreputability of transactions in e-business, e-govemment, and Internet applications. Elliptic Curve Cryptography (ECC) has the highest security/key-length radio among all known pubic key. This advantage is especially suitable for targeting embedded system as it typically provides relative limited resources, such as smaller area, lower power consumption and appropriate time-consuming calculations over field ECC.
There are numerous papers dealing with the design of ECC on 8-bit CPU platforms [1] - [3] , [5] , [6] . The performance of ECC coprocessors in previous work highly depended on the efficiency of the platforms: the performance built around an AVR microcontroller [2] , [5] , [6] is faster than that of those using 8051 [1] , [3] , furthermore, it's difficult to apply them to different platforms, because of the differences of instructions among those platforms.
The approach we present in this paper is different from previous work in two important aspects. First, We propose a low-cost coprocessors to compute fundamental functions over GF(2193) and that the performance of scalar multiplication maintains all the same, no matter how different the instructions of platforms are. It can be applied to any 8-bit or 16-bit platforms easily. Second, as the coprocessor performs field arithmetic in a single field, we introduce some special blocks to accelerate calculations and pipeline reused structure to minimize the area. So our design has also a well-balanced tradeoff between small area and fast computation. As far as we know, the hardware cost of our work in terms of area is the smallest among all the related work with same security level.
The structure of this paper as follows: we present some basic mathematical aspects of ECC in section II, and introduce the proposed architecture in section III, then list the results and some comparisons in section IV. Finally, the conclusions are drawn in section V.
II. MATHEMATICAL BACKGROUND
In this paper we choice elliptic curves based on GF(2m) which allow efficient implementations in terms of silicon area and computing time. An elliptic curve over a field GF(2m) can be defined as the form as y2+xy=x3+ax2+b (1) with a, b E GF(2m), and b 0. (x, y) satisfying (1) is called a point P on the curve. The set of all points, constitute an Abelian group including point 0 (referred to as the "point at infinity"), the identity element.
The basic operations of ECC are point doubling and point addition within an ablelian group E. Two distinct point P, Q E E can be added to R = P + Q, called point addition. The particular case P + P =2P is called point doubling. Performing both these point operations involves several sub-operations, such as addition, multiplication, square, inversion in the underlying field GF(2m). The hierarchical structure for these operations in our architecture is illustrated in Fig. 1 . (2) 1-4244-1098-3/07/$25.00 c2007 IEEE k is a integer number and P is a point on ECC. For our implementation, we use the projective coordinates based on the Montgomery Scalar Multiplication algorithm introduced in [4] . We use the polynomial basis representation (am,-... alao) with the irreducible trinomial F(x) = x193 +x15 +1.
III. COPROCESSOR ARCHITECTURE DETAILS
In this section, we describe the architecture of the ECC design in a top-down manner. In Section III-A, the overall system structure is described. Section Ill-B shows some instructions of AU and some details of the data path.
A. System Overview
The overall system structure is illustrated in Fig. 2 , which consists of three major parts: the 8-bit DW8051 microcontroller, the ECC coprocessor, the memory storage unit, including Memory Management Unit and SRAM. More than 5000 extra time is spent on transforming data in the process of the scalar multiplication. That's intolerantly slow to the system.
In order to alleviate the data transformation bottleneck, we introduce the Memory Management Unit (MMU) to help the coprocessor transform data. When the coprocessor works, the CPU microcontroller stands idle. Due to MMU, which allows direct memory access, it contributes to the efficient data transfer occurring during a scalar multiplication between ECC coprocessor and SRAM.
In previous work [1] - [3] , [5] , [6] , the commands are given by the microcontroller to perform field arithmetic operations during a scalar multiplication. Its performance highly depends on the efficiency of the platforms. In our work, field arithmetic operations perform by the MU controller with a fixed finite state machine, and are without the control of the microcontroller when computing. This method provides us with faster speed and independent executions. Fig. 2 also illustrates the main internal architecture of the ECC coprocessor, which consists of three parts: the Main Controller (MU), the arithmetic unit controller (AUC) and data path. The MC is the main controller of the ECC coprocessor as a finite state machine to conduct the AUC for computing the operations of point addition, point doubling and scalar multiplication. The AUC, also a finite state machine, controls data path to perform the field operations.
B. Arithmetic Unit and Data Path
The AU does not only perform arithmetic operations, but also executes some store/load operations, including operations of loading, saving and exchanging internal registers data. register c, has a multiplexer of four choosing one. The first input port of the multiplexer is connected with the output port of c, 16 , which can achieve parallel data transfer; the second one is connected with the output port of itself, which help hold data; the third one is connected with the output port of square to store the results of square; the last one is connected with the output port of multiplier to store its results.
The Field Addition is the simplest among all operations, since it is a bit by bit addition, having no carry bit, which just needs 16 XOR gates to compute in 13 clock cycles.
The Field Square, capable of computing a square just in one clock cycle, excluding data input and output, can be only applied when the finite field is fixed. This special property should be used in this architecture in order to accelerate the speed of scalar multiplication.
The square over GF(2') has a special feature: (ax+b)2 a2x2±+2ax+b2 ax2±+b Set C(x) = A2(x), then, C(x) = c192x + clglx ±c1C0x ±... +c2x + c1x + co The square can be constructed by XOR gates. Fig. 4 shows the internal connections of square. The Field Multiplication is the most important operation in the scalar multiplication process, as it is the most frequent used operation. A bit-serial multiplier is the simplest method and needs the least area. A k-digit multiplier can achieve a kfold speedup for multiplication with the increasing the complexity of the circuit. Another trade-off between speed and area is possible by using the digit-serial multiplier. Compared to the bit-serial multiplier where only one bit of operand B is used in each iteration, here multiple bit (equal to the digit-size) of B are multiplied to the operand A in each iteration (Fig. 5) . We use a digital size of 2 as it gives a good speed-up without drastically increasing the area requirement.
A 2-digit multiplier algorithm presents in the following and the circuit of the algorithm is achieved in Fig. 5 . The implement of field inversion in hardware is the most difficult and expensive field operation. A field inversion circuit based on extended Euclidean algorithm will add more complexity to the controller, leading to a larger chip area. Considering the design is applied to a low-cost design, we choose the inversion algorithm that use field square and field multiplication. Although impacting performance, the method does not add significantly to the complexity of a hardware design. The inversion algorithms using field multiplication is based on the Fermat's theorem: Table III shows the process of field inversion using the Fermat's theorem, which requires 192 field squares and 8 field multiplications. So the field inversion can be achieved through a finite state machine. Then, we can estimate the performance of our coprocessor whether with field square or not from table IV, noting that each operation needs data transformation. It illuminates us that square accelerates the performance of scalar multiplication over 4000 faster. We post-stimulate our coprocessor with ModelSim SE 6. lb. It requires 240,000 cycles for a scalar multiplication. Among all the performances and hardware cost of ECC processors on table VI, our work needs least silicon area than others with the same security level, and at the most cases, it has a faster computation speed.
VII. CONCLUSION
This paper presents a low-cost architecture for elliptic curve cryptography. Compared with other related work, it doesn't strongly depend on the efficiency of platform and can much easily work as an IP block for different embedded system. We also introduce the MMU to help the ECC coprocessor store and load data directly. The MMU does not only speed up its performance, but also makes it work independently.
At the cost of about 11.5k gates in hardware, ECC scalar multiplication require 24msec over GF(2'93) on our embedded system when clocked with 10 MHz. Although increasing 400 extra area as a specific square on a silicon chip, it has much higher performance in computation, around 400O faster than before. To our knowledge, it's the coprocessor with the smallest area among those ECC low-cost implements with the same security level. In addition, it has a relatively faster performance. All of these features make our work more favorable than previous work on low-cost embedded systems.
