It is a challenge to implement large word length public-key algorithms on embedded systems. Examples are smartcards, RF-ID tags and mobile terminals. This paper presents a HW/SW co-design solution for RSA and Elliptic Curve Cryptography (ECC) over GF(p) on a 12 MHz 8-bit 8051 micro-controller. The hardware coprocessor has a Modular Arithmetic Logic Unit (MALU) of which the digit size (d) is variable. It can be adapted to the speed and bandwidth of the microcontroller to which it is connected. The HW/SW co-design space exploration is based on the GEZEL system-level design environment. It allows the designer to find the best performance-area combination for the digit size. As a case study of an FPGA prototyping, 160-bit ECC over GF(p) (ECC-160p) was implemented on Xilinx Virtex-II PRO (XC2VP30). The results show that one point multiplication takes only 130 ms including all communications between the 8051 and the coprocessor. The performance is 40 times faster than the most optimized SW implementation on a small CPU in literature. This is achieved by the HW/SW co-design exploration in order to find the optimized digit size of the MALU. On the other hand, the design of ECC-160p maintains a high level of flexibility by using coprocessor instructions. Our proposed architecture proves that HW/SW co-design provides a high performance close to ASIC solutions with a flexible feature of SW even on a small CPU.
Introduction
Public-key cryptosystems form an essential building block for digital communication. Unlike private-key algorithms that allow for a fast encryption of a large bulk of data, the importance of Public-Key Cryptography (PKC) is to have secure communications over insecure channels without prior exchange of a secret key. In addition, PKC enables digital signatures as an important cryptographic service. Diffie and Hellman introduced the idea of PKC in the mid 1970s [1] .
0045-7906/$ -see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.compeleceng. 2007 . 05 . 005 Implementation of Public-Key Cryptography (PKC) is a challenge in embedded systems. Examples are smartcards, RF-ID tags, and mobile terminals. They have limited silicon resources and a limited power budget. Although software implementations are flexible, the performance of PKC is very slow on the small 8-bit or 16-bit CPUs that are often used for power-constrained embedded systems. On the other hand, ASIC implementations of PKC show better performance than software ones. This is mainly due to the following reasons.
(1) PKC systems need repeated modular operations with much longer bit integers than CPU's operation bit widths provide (e.g. RSA needs 1024 bits or larger and ECC needs 160 bits or larger) [2] [3] [4] . ASICs do not have this bit-width limitation. (2) In an ASIC, the memory organization and the memory access latency can be tuned to the application. Allocating Flip-Flops (F/Fs) or RAMs dedicated to ASICs enable the best use of clock cycles for transferring intermediate data. (3) A dedicated controller in ASIC can handle operations in a datapath without any stalls.
Such an ASIC is the best for mass production considering trade-off between cost and performance. However, a fixed hardware solution has only a limited application range and cannot be easily reused. One approach to make it reusable is to use re-configurable logic such as FPGAs. The main drawback of using FPGAs is that their power consumption is still unacceptable for most small embedded systems. Hence, HW/SW co-designs are attractive since they offer the advantage of both SW flexibility and HW performance by a proper partition of HW/SW. In this paper, a 12 MHz 8051 was chosen as a controller for our proposed HW/SW co-design. Iterative modular multiplications that are the most critical computation components are implemented on the coprocessor part. The originality of our approach is that we do a simultaneous HW and SW optimization. This is facilitated by GEZEL, the design methodology [13] . GEZEL is a hardware description language that can interface with an Instruction Set Simulator (ISS) of a CPU. The combination of GEZEL and ISS makes it possible to simulate the system-level functionality fast, distribute the computations between HW and SW and even optimize the HW coprocessor for a given embedded core. This shortens the design time for both HW and SW.
The remainder of this paper is as follows. Section 2 gives a survey of relevant previous work. In Section 3 the architecture for our proposed coprocessor is explained. The software implementation on the 8051 is explained in Section 4. The performance evaluation is done with a system-level simulation and the results are reported in Section 5. The results of our FPGA implementation are introduced in Section 6, and Section 7 concludes the paper.
Related work
The most recent published work is the one of Satoh and Takano [8] . They present a dual field multiplier with a high performance in both types of fields. The throughput of an EC point multiplication is maximized by use of a Montgomery multiplier and an on-the-fly redundant binary converter. The biggest advantage of their design is in scalability in operand size and also flexibility between speed and hardware area. They show that ECC-160p point multiplication can be calculated in 1.21 ms with 0.13-lm CMOS ASIC running at 137.7 MHz. As this is a full-custom ASIC solution, its fast performance cannot be compared directly with our proposed architecture. However, it can be a good index to evaluate the efficiency of the proposed HW/ SW co-design. Ö rs et al. introduce an FPGA implementation of ECC over GF(p) with systolic array type of Montgomery Modular Multiplier (MMM) [9] . Batina et al. optimize the architecture of [9] and obtain 3.9 ms for ECC-160p on an FPGA operating at 53 MHz [10] . Considering software design on an embedded CPU, the only relevant work was reported by Gura et al. [11] . They compared ECC and RSA on 8-bit CPUs and show that PKC is viable on small devices. One of their results shows an ECC-160p point multiplication computed in 4.58 s with a 14.7 MHz 8051-based CPU.
Coprocessor architecture
In this section, the proposed coprocessor architecture is described in the details.
Modular arithmetic logic unit (MALU)
For the datapath of the coprocessor, we designed a configurable Modular Arithmetic Logic Unit (MALU) based on Montgomery's algorithm [5] . The MALU consists of regularly allocated Carry-Save Adders (CSAs).
As shown on the left side of Fig. 1 , it has four input vectors, X ¼ ðx g . . .
/d e and h = dk/de. X and Y are the multiplicand and the multiplier, and N is the modulus. The addend vector S is provided to the MALU by d bits in every cycle and eventually added to the result of modular multiplication of X and Y (modulo N). The intermediate results, the so-called virtual sums and carries, are stored in V S = (vs i,k+aÀ1 . . . vs i,1 vs i,0 ) 2 and V C = (vc i,k+aÀ1 . . . vc i,1 vc i,0 ) 2 . They are reset to zero when a modular multiplication starts to execute (i = 0). After finishing Montgomery multiplication, the multiplication result is output from the right-most cell by d bits in every cycle as S out = (sout g . . . sout 1 sout 0 ) 2 d. As will be explained later, V S and V C are fed back as inputs in a modular addition. In order to make sout i zero, the right most cell, cell(i,0) determines m i vector and provides it for the rest of cells.
The proposed array is flexible as for the size of d. Performance and cost trade-offs, and the type of interface to the l-controller determine the value of d. For a slow l-controller with a slow interface it is beneficial to take smaller d since the coprocessor can utilize cycles between coprocessor instructions. The MALU has two independent stages. One is the Carry-Save (CS)-stage that implements the Montgomery's algorithm in a CS-form. Another converts the CS-form integer into a normal integer by propagating carries, namely the Carry-Propagate (CP)-stage. Moreover the CP-stage is capable of adding S to the result of the CS-stage. For reducing the hardware cost, the CP calculations are executed in the same cell that is used for the CS-stage. This operation is described by Eq. (1).
Here R is selected as R k+a where k is the bit-length of the secret key and a is a value determined so that the final reductions can be avoided [12] . In this paper, we chose a as a = 4. Thus, the reduction step is not required while it is needed in the original notation of Montgomery's algorithm. For the convenience of iterative use of Eq. (1), the so-called Montgomery form is applied to keep the output in the Montgomery form as well. The clock latency for the CP-stage and the MALU (CS-stage + CP-stage) are d(k + a)/de and 2d(k + a)/de cycles, respectively.
Coprocessor memory
The MALU operations needs seven sets of (k + c)-bit F/Fs that store inputs, outputs, and intermediate variables, where c 2 [0, 2]. The MALU coprocessor will be used repetitively by the 8051. The values used and generated should be stored somewhere in the system. One possible location for them is the memory attached to the 8051 (XRAM). This is the simplest and cheapest solution. Another solution is realized by allocating more sets of F/Fs or SRAMs in the coprocessor itself, called CP-RAM. This is an expensive solution in a cost point of view, especially for large k. But it offers a much higher performance since the overhead time to access the F/Fs in the coprocessor is much less than accessing XRAM. Fig. 1 . Schematics of the MALU.
Finite state machine (FSM)
The coprocessor is controlled by opcodes sent via the port, P0 of the 8051 (coprocessor instructions). The operands are transferred through P1, P2, and P3. The FSM block in the coprocessor decodes the coprocessor instructions and executes the required operations. A 12 MHz 8051 can issue one CPU instruction only once every 12 clock cycles and the port accesses take several CPU instruction cycles. This results in huge intervals between consecutive coprocessor instructions with a coprocessor running at 12 MHz. Therefore, the FSM should keep the datapath busy computing until the next coprocessor instruction is available. The details are given in Section 5.2. This is another system optimization problem.
Software implementation on the 8051
The performance of PKCs is primarily determined by the efficient realization of the arithmetic operations (addition, multiplication and inversion) in the underlying finite field. If projective coordinates are used for an elliptic curve, the inversion can be neglected because it is needed only when converting the projective point back to the affine point at the last step of point multiplication. Therefore, the system architecture for PKCs is normally designed to accelerate the field multiplication and addition. Fig. 2 shows the design hierarchy using the 8051. An 8051 is an 8-bit micro-controller originally designed by Intel that consists of several components: a controller and instruction decoder, an ALU, 128 bytes of internal memory (IRAM), up to 64 Kbytes of external RAM (XRAM) and up to 64 Kbytes of external program memory or 4 Kbytes of internal program memory (PROM). Thus, the targeted PKC has a high flexibility in programming a large variation of publickey operations as well as the high performance by the coprocessor. In the following sections, the operations handled by SW are explained. signed m-ary [6] ) also can be applied for the proposed architecture, the simplest algorithm, binary-method [7] is implemented in this work.
Point multiplication

Point addition/doubling
Point addition and doubling can be performed according to the algorithm given in [6] . Here we assume that the two points that will be added, i.e. P = (X 1 , Y 1 , Z 1 ) and Q = (X 2 , Y 2 , Z 2 ) are the points on the weighted projective coordinates (Jacobian representation) and Montgomery representation, where (X, Y, Z) corresponds to the affine coordinates (X/Z 2 , Y/Z 3 ). The resultant point is stored as Q, i.e. Q ( P + Q. The following scheduling as shown in Table 1 can be used for point addition.
Point doubling is considered as a special case of point addition, i.e. Q ( 2Q = Q + Q. In Table 1 , a possible schedule for point doubling is also given. For efficient computing of point addition and doubling, four additional registers (t 1 -t 4 ) storing intermediate variables are provided. These are allocated in CP-RAM.
System-level design and simulation
GEZEL system design environment
The specified architecture was designed and simulated with GEZEL. GEZEL provides cycle-accurate cosimulation with the 8051 ISS. The coprocessor is designed in an FSMD-manner that GEZEL uses for a hardware description. The GEZEL codes are automatically translated into VHDL codes that are synthesizable with existing synthesis tools. GEZEL enables a fast system-level estimation for a target system architecture and quick prototyping on an FPGA.
Instruction sets for coprocessor
First, the registers in the coprocessor are initialized via ports in unit of 8 bits. Setting CP-RAM for modulo N requires 20 executions of SETN(din) instructions for ECC-160p. After initializing all parameters and initial values, MALU N and CP N operations are executed according to the sequence of Table 1 . The MALU N operation consists of several instructions. For instance, the second step of point addition in Table 1 starts from CS(din,din2) instruction. Here, din and din2 are the addresses of CP-RAM where the values X 2 and t 1 are stored. The coprocessor loads the two operands and executes the CS-stage. The succeeding instructions are CS( ) and CP( ) to continue the CS-stage and the CP-stage. The number of CS( ) and CP( ) instructions are determined by the value of d. At the last step, CS(din) instruction is sent from the 8051 and the coprocessor stores the resultant value of the CP-stage, t 2 at the address of din of CP-RAM.
In our SW case, the 8051 needs four port accesses to issue a coprocessor instruction whose type is INST(din,din2,din3) for instance. This is illustrated in Fig. 3 . The four port accesses correspond to 96 clock cycles or even more for the coprocessor. Therefore, we need to define a new instruction that is a series of CS( ) and CP( ) instructions so that the coprocessor could iteratively execute the CS-stage and the CPstage. When d is large enough to complete the CS-stage and the CP-stage within 96 clock cycles, we need only one instruction for executing the MALU N operation. This fact infers that use of a large digit size does not always improve performance in our case because the 8051 cannot dispatch instructions to keep the coprocessor busy computing. On the other hand, in case of using a relatively small digit size, we need to send several instructions. More precisely, when 2d(160 + 4)/de P 96 or d 6 3, we need to send two or more instructions for the MALU N operation in the case of ECC-160p. For the 1024-bit RSA case, 2d(1024 + 4)/de P 96 or d 6 21 is derived. Those calculated values of d indicate that the maximum area size of the coprocessor (determined by the digit size) is limited by the communication speed between the 8051 and the coprocessor. In other words, performance improvement cannot be expected even if setting the digit size d P 4 for ECC-160p as shown in Fig. 4 .
However, as far as the value of d satisfies d 6 3 for ECC-160p or d 6 21 for 1024-bit RSA, a coprocessor with a larger d provides better performance. Moreover, those calculated values of d do not guarantee the best performance because of additional communication between the coprocessor and the 8051, i.e. as d becomes smaller, the coprocessor has to frequently communicate with the 8051 to wait a new instruction by a flag polling or an interrupt which need extra cycles. In the next section, trade-off between cost and performance is discussed by using GEZEL.
Cost/performance estimation using GEZEL
Before prototyping on an FPGA, the performance for RSA and ECC is estimated using GEZEL systemlevel co-simulation with the 8051. As illustrated in Fig. 4 , ECC-160p shows the same performance for d = 8 and 16 as one for d = 4. As for RSA-1024, the performance improves as d increases for d 6 16. As a result, the best trade-off is obtained in the case of d = 4 for both RSA and ECC as shown in Fig. 5 . To make a reasonable and quick estimation for the area, the following gate counts are used for the proposed coprocessor: F/F, FA (Full Adder), HA (Half Adder), 2-to-1 MUX (Multiplexer), and 2-bit XOR are counted as 8, 12, 6, 2, and 3, respectively. Here, we use product of area and performance to find the best trade-off between cost and performance.
FPGA implementation results of ECC-160p
Based on the performance estimation with GEZEL, we implemented ECC-160p and compared it with the previous work. The software C codes are compiled with lVision2 by Keil Software, Inc. with a target device of Intel 8051AH. The coprocessor block was synthesized with Project Navigator by Xilinx and implemented on Virtex-II PRO (XC2VP30). The first three references targeting an ASIC solution show better performance at 12 MHz operation than this work except the design of using 8-bit multiplier of [8] as shown in Table 2 . Considering the flexibility of this work, the difference of the performance can be regarded as small enough. On the other hand our result is about 40 times faster than an optimized software implementation of [11] . Our architecture proves that HW/SW co-design provides a high performance close to ASIC solutions with a flexible feature of SW.
Conclusions
We have presented a MALU coprocessor that is scalable in the digit size d. It allows a fast execution for modular multiplications and additions on CSA chains. As a case study to prove the appropriate HW/SW partitioning, ECC-160p is prototyped on an FPGA. The result shows high performance even with 8-bit 8051. Point multiplication takes only 130 ms including all communications between the 8051 and the coprocessor. This is achieved by the HW/SW co-design exploration with the digit size of the MALU. By changing the coprocessor configuration and adapting SW for a micro-controller, it is ideally suitable for RSA and ECC algorithms. 
