Abstract-This paper presents a microcode instruction set coprocessor which is designed to work with an 8-bit 8051 microcontroller and implements a Hyperelliptic Curve Cryptosystem (HECC). The microcode coprocessor is capable of performing a range of Galois Field operations using a dualmultiplier/dual-adder datapath and storing the intermediate results in the local storage unit of the coprocessor (RAM). This coprocessor is programmed using the software routines from the 8051 microcontroller which implements the HECC divisor's doubling and addition operations. The Jacobian scalar multiplication was computed in a 656 msec (7.87 M cycles) at 12 MHz clock frequency.
INTRODUCTION
High speed implementation of Public Key Cryptography (PKC) is required for providing security in various communication systems. The best-known and most commonly used public-key cryptosystem is RSA [1] . However, it is not a feasible solution for low-power and low foot-print devices. Emerging areas such as RFID tags and sensor networks put new requirements on implementations of PKC algorithms with firm constraints in terms of number of gates, power, bandwidth, etc. A promising candidate appears to be a Hyper/Elliptic Curve Cryptosystem (H/ECC), but the previously mentioned requirements can probably be achieved only with the synergy of hardware and software. ECC has already proven its potential as it offers shorter certificates, lower power consumption and better performance on some platforms. In addition, ECC offers more "security per bit" than RSA, as no sub-exponential algorithm is known that solves the discrete logarithm problem in this group. However, HECC maintains all those advantages with even shorter bit-lengths. More precisely, the operand size for HECC is at least a factor of two smaller than the one of ECC, with the same level of security. This fact makes HECC a very good choice for platforms with limited resources.
Algorithms for HECC and their implementations have been studied intensively in the past years. A significant amount of work has been performed on optimizing the formulae for the group operation [2, 4, 5, 7] . Explicit formulae for genus 2 curves are given by Lange [2] for arbitrary fields and for various types of coordinates. For embedded processors, a large amount of work is performed for the ARM platform [3, 9, 10] . Pelzl et al. [9] implemented the group operation of genus 2 and 3 for HECC on an ARM7 processor. They compared the results with ECC implementation (with corresponding security) and showed that HECC performance is comparable to the one of ECC. The performance for divisor scalar multiplication on the ARM microprocessor for genus 2 was further optimized in [10] and compared to genuses 3 and 4. Gura et al. [11] compared ECC and RSA on 8-bit CPUs and proved that Public-Key Cryptography is viable on small devices, with the results favoring ECC substantially.
The first complete hardware implementation of HECC was given by Boston [6] . They used Cantor's algorithm [8] to implement HECC on the VirtexII FPGA. Wollinger et al. investigated HECC implementation on a VLSI coprocessor [12, 13] . In [14] three different architectures on a FPGA have been examined for a vast area of applications. Most of the published work dealt with binary fields. The only exception is work of Baktır et al. [3] which investigated implementation over an extension field of odd characteristic i.e. over Optimal Tower Fields (OTF) on an ARM7. This paper presents a microcode instruction set coprocessor which is designed to work with an 8-bit 8051 microcontroller to implement a Hyperelliptic Curve Cryptosystem. More precisely, we have implemented the HECC divisor multiplication operation on the 8051 microprocessor, which uses a hardware coprocessor to optimize the performance. This extra hardware is a coprocessor with dual-multiplier/dual-adder datapath, which allows for a speed-up of factor 228 when compared with the software-only solution. We have re-written the formulae of Byramjee and Duquesne [7] to facilitate the divisor operations in this special case. In this way we achieved optimized divisor doubling and addition. Namely, we take advantage of a special dual-multiplier/dual-adder datapath, which allowed us to explore the parallelism in field multiplications.
The remainder of this paper is organized as follows. In section 2 some background information on HECC is given. Details of our implementation are specified in section 3. Section 4 gives details of a microcode instruction set coprocessor. Results are listed in section 5 and conclusions are given in section 6.
II. HYPERELLIPRIC CURVE CRYPTOGRAPHY
Hyperelliptic Curve Cryptography was proposed in 1988 by Koblitz [15] as a generalization of Elliptic Curve Cryptography. In particular, elliptic curves can be viewed as a special case of hyperelliptic curves i.e. an EC is an HEC with genus g=1.
A. Hyperelliptic curves
Here we consider a hyperelliptic curve C of genus g=2 over GF (2 m ), which is defined by an equation of the form: For our implementation we used so-called type II curves [7] , which are defined by h 2 = 0, h 1 ≠ 1. In particular, the authors of [7] recommend curves of the form:
since they combine simpler arithmetic with a good security level. More precisely, those curves allow for much faster divisor doubling while addition stays the same as for a general curve. Now we introduce a group structure for specific objects created on a hyperelliptic curve. A divisor D is a formal sum of points on the hyperelliptic curve C. Let Div denote the group of all divisors on C and Div 0 the subgroup of Div of all divisors with degree zero. The Jacobian J of the curve C is defined as the quotient group J = Div 0 /P. Here P is the set of all principal divisors, where a divisor D is called principal if D = div(f), for some element f of the function field of C. In practice, the Mumford representation is typically used; in this representation each divisor is represented as a pair of polynomials [u,v] . Here, u is monic of degree 2,
implementations of HECC, we need to implement the multiplication of elements of the Jacobian i.e. divisors with some scalar.
B. HECC algorithms
Following a top-down approach, the highest-level operation is the divisor scalar multiplication. It is implemented by the use of the so-called "non-adjacent form" i.e. as the NAF algorithm [17] , which has the lowest weight among all other signed digit representations. The fact that the subtraction of divisors is as expensive as the divisor addition makes this representation beneficial. In this way the scalar multiplication is implemented as a sequence of divisor additions/subtractions and doublings. We use projective coordinates which allow us to complete all divisor operations without inversion. Only one inversion and four multiplications are required at the end to convert back from projective to affine coordinates. We have re-written the formulae from [7] for the doubling to achieve almost full parallelism for field multiplications. We also used the same approach to get the formulae for the addition in the case of mixed coordinates. The formulae for both, the parallelized doubling and addition are given in Tables I and II, respectively.
III. COPROCESSOR ARCHITECTURE
This section presents the architecture of the proposed crypto coprocessor. First the system architecture and the interface of the coprocessor with the 8-bit microcontroller are described. Then, different units of the crypto coprocessor which are the coprocessor's datapath, the storage unit, and the controller are presented. Figure 1 shows the block diagram of the hardware architecture. There are four 8-bit ports that are used for communication between the 8051 microcontroller and the coprocessor. Two of them are for the input and output data and the other two are for coprocessor's instruction and the address to access the local storage. Every data transfer to the local storage (RAM) is through the input_word and the output_word registers that are 84 bits wide which is the word length of the operation in the coprocessor's datapath.
A. System Architecture
The 8051 is an 8-bit microcontroller originally designed by Intel that consists of several components: a controller and instruction decoder, an ALU, 128 bytes of internal memory, up to 64 KB of external memory addressed by a 16-bit DPTR register, and up to 64 KB of external program memory or 4 KB of internal program memory (ROM). The 8051 also has 28 bytes of special function registers (SFRs), which are used to store system values such as timers, serial port controls, input/output registers, etc. In our architecture using the Dalton 8051 core from UC Riverside [18] , all four ports are available as "memory-mapped" interface to the microcode coprocessor. Figure 2 shows the coprocessor's datapath that is designed based on the dual-multiplier/dual-adder in GF (2 83 ). The main reason for this implementation is that the divisor's operations in Tables I and II are scheduled so that two multiplications or additions can be performed concurrently in order to increase the overall performance. This means that the datapath has to be capable of performing every line of the schedules in Tables I and II . This can be done as the following. Before starting the GF (2 83 ) operations, the input operands are loaded into A, B, and D registers. After the completion of the multiplication or addition, the output results can be either sent out to the local storage or be moved from C registers to the input registers (A, B, D) for further processing. Therefore, a combination of the Galois Field operations which include multiple multiplication/addition over multiple input operands can be performed. This way every line of the divisor's doubling and addition schedules can be implemented over the proposed datapath. Moreover, the bit-serial GF (2 83 ) multipliers that perform multiplication in 84 cycles and the bit-parallel GF (2 83 ) adders that perform addition in a single clock cycle are used. 
C. Local storage unit
The local storage unit consists of 128 memory locations of 32-bit width. In order to have easy addressing, every four locations are used to store each temporary variable of the GF (2 83 ) field. Therefore, there are total of 32 memory locations that can store the elements of GF (2 83 ). The input data is first loaded into addresses 0x00 to 0x10 and the doubling and addition result is overwritten to the same locations for every step of the scalar multiplication algorithm.
In the end the same memory locations contain the final result which is sent back to the 8051 microcontroller after the projective-to-affine conversion is performed. The memory address bus is 7 bits wide to cover the 128 locations (variables) and the coprocessor controller asserts the required values for the memory read (rd) and write (wr) signal. Also notice that the input into and out of the local RAM has to go through the input_word and output_word registers.
D. Coprocessor's controller
The controller takes care of reading the instructions and addressing different locations of the local storage. It also controls the datapath elements in order to implement the microcode instructions.
IV. INSTRUCTION SET
There are two basic type of instructions for the proposed coprocessor: single and microcode instructions, described as follows. Table III shows the single instructions and their definitions. These instructions are used to transfer data between the coprocessor and 8051, load and store data to the RAM through the input_word and output_word registers, perform single operations using each of the adders and multipliers, moving the content of the registers in the coprocessor's datapath.
A. Single instructions

B. Microcode instructions
The microcode instructions are the main instructions that are used to implement the divisor's addition and doubling algorithms. These instructions implement a combination of Galois Field additions and multiplications with multiple input operands. Table IV lists these instructions, their definitions, and their microcode implementations. Any line of the divisor's doubling or addition schedules in Tables I  and II can be implemented by one of these microcode instructions. It should be noted that before calling these instructions, the input operands, (values of A1, B1, D1, A2, B2, and D2) are loaded from the RAM using the single instructions (Read_from_RAM, Outword_to_A1, Outword_to_B1, Outword_to_D1, Outword_to_A2, Outword_to_B2, and Outword_to_D2). On the other hand after running any of the above microcode instructions, the results (registers C1 &C2) are stored back into the RAM. Runs the first multiplier (Mult1) C1 = A1 * B1 Do_add1
Runs the first adder (Add1) C1 = A1 + B1 Do_mult2
Runs the second multiplier (Mult2) C2 = A2 * B2 Do_add2
Runs the second adder (Add2) C2 = A2 + B2 Outword_to_A1
Moves the content of output_word to A1 Outword_to_B1
Moves the content of output_word to B1 Outword_to_D1
Moves the content of output_word to D1 C1_to_inword
Moves the content of C1 to input_word Outword_to_A2
Moves the content of output_word to A2 Outword_to_B2
Moves the content of output_word to B2 Outword_to_D2
Moves the content of output_word to D2 C2_to_inword
Moves the content of C2 to input_word C1_to_B1
Moves the content of C1 to B1 D1_to_A1
Moves the content of D1 to A1 C2_to_B2
Moves the content of C2 to B2 D2_to_A2
Moves the content of D2 to A2 C1_to_A2
Moves the content of C1 to A2 C2_to_A1
Moves the content of C2 to A1 
C. Programming from 8051 Micro-controller
In order to program the microcode coprocessor using the 8051, the proper instructions (single or microcode) are assigned to the ports of the 8051 microcontroller. Figure 3 shows an example that implements step 3 of the doubling algorithm (compute k) shown in Table I . Each line of the program in figure 3 puts the required binary opcode on the ports of 8051 (P0-P3). This is done by writing assembly software codes for the 8051 microcontroller.
V. PERFORMANCE RESULTS
The proposed HW/SW co-design of the HECC system was implemented and co-simulated using GEZEL [19] . GEZEL is a design environment for the exploration of domain-specific coprocessor and multiprocessor micro architectures, which can provide cycle-true HW/SW cosimulation with various embedded core instruction set simulators. In our application, we used the Dalton 8051 ISS to perform cycle-accurate simulation. The microcode coprocessor is designed in GEZEL hardware description language which is a FSMD (finite state machine plus datapath) system model. The coprocessor is attached to the input/output ports of the 8051 ISS using the GEZEL design environment and timing and functional verification is performed. In the end, the GEZEL code was automatically converted to RTL VHDL and synthesized for FPGA.
The detailed timings of different parts of the HECC codesign implementation are presented in Table V . The delay is given in terms of number of cycles and msec at the 12 MHz clock frequency for 8051. Sizes of RAM and ROM are given in bytes. Table VI compares the performance of the scalar multiplication of the presented HECC system with related work. Our 83-bit HECC system takes 7.8 M cycles of the 8051 micro-controller which results in the total delay of 656 msec at a 12 MHz clock frequency. This implementation is more than 228 times faster than the pure software implementation of HECC on 8051 and is 7 times faster than 160-bit ECC implementation on 8051 as is reported by [11] . Moreover, compared to 80-bit HECC implementation of ARM7, the number of clock cycles is a better metric because ARM7 is clocked at 80 MHz. In terms of number of clock cycles, our design is at the same order with [10] and around 4 time faster that [3] . This paper presented a microcode crypto coprocessor that is designed to accelerate the Hyperelliptic Curve scalar multiplication using the 8051 microcontroller. The microcode coprocessor is capable of performing the combination of GF (2 83 ) operations. The divisor's addition and doubling operations are implemented using SW routines based on the coprocessor's microcode instructions. The scalar multiplication is developed in C and compiled into 8051 assembly instructions. The total delay of 656 msec (7.8 Mcycles) was achieved for the 83-bit HECC scalar multiplication at 12 MHz.
