Abstract. In this paper we describe a high performance, area-efficient implementation of Hyperelliptic Curve Cryptosystems over GF(2 m ). A compact Arithmetic Logic Unit (ALU) is proposed to perform multiplication and inversion. With this ALU, we show that divisor multiplication using affine coordinates can be efficiently supported. Besides, the required throughput of memory or Register File (RF) is reduced so that area of memory/RF is reduced. We choose hyperelliptic curves using the parameters h(x) = x and f (x) = x 5 + f3x 3 + x 2 + f0. The performance of this coprocessor is substantially better than all previously reported FPGA-based implementations. The coprocessor for HECC over GF(2 83 ) uses 2316 slices and 2016 bits of Block RAM on Xilinx Virtex-II FPGA, and finishes one scalar multiplication in 311 µs.
Introduction
Public-Key Cryptography (PKC) [10] , introduced in the mid 70's by Diffie and Hellman, ensures a secure communication over an insecure network without prior key agreement. PKC is widely used for digital signatures, key agreement and data encryption. The best-known and most commonly used public-key cryptosystems are RSA [26] and Elliptic Curve Cryptography (ECC) [23, 19] , but recently HyperElliptic Curve Cryptography (HECC) [20] is catching up. The main benefit for curve-based cryptography e.g. ECC and HECC is that they offer equivalent security as RSA for much smaller parameter sizes. The advantages result in smaller data-paths, less memory and lower power consumption.
Implementing HECC on a resource-constrained platform has been a challenge in both area and performance. Over the past few years, HECC have been implemented in both software [25, 27] and hardware [4, 7, 15, 11] . However, the implementations so far failed in reaching the performance of ECC implementations with comparable hardware cost. Table 1 compares the computational complexity of point/divisor operations in ECC and HECC as in [2] . Here I, M and S denote modular inversion, multiplication and squaring, respectively. Note that Table 1 is not exhaustive, and a comprehensive description of different coordinates as well as their computational complexity can be found in [2] . In addition, state of the art regarding various types of coordinates for all types of curve-based cryptosystem can be found in [9] . For example, ECC over GF (2 163 ) and HECC over GF (2 83 ) are supposed to offer equivalent security as 1024-bit RSA [2] . Using projective coordinates, one EC Point Addition (PA) requires 15 multiplications and 3 squarings in GF (2 163 ), while one HEC Divisor Addition (DA) requires 49 multiplications and 4 squarings in GF (2 83 ), which is much more complex even with parameters of half bit-lengths. In order to speed up HECC implementations, parallel multipliers [4, 7] or inverters [15] were used. As a result, an ALU becomes large in the area. In order to efficiently feed data to parallel multipliers and inverters, a high-throughput Register File (RF) with an additional control logic i.e. a MUX array connected to ALU is required. This adds even more area to implementations. In this paper, we describe a compact HECC coprocessor on an FPGA platform. The coprocessor utilizes a unified multiplier/inverter, which supports both multiplication and inversion. This architecture brings three main advantages. First, the fast inverter makes affine coordinates very efficient. Second, as the multiplier and inverter share partial data-path, it is much smaller in area compared to previous implementations. Third, using only one multiplier/inverter, the required throughput of Memory or RF is comparably low. Therefore we can reduce the area of the memory. Note that the architecture proposed here for FPGA design can also lead to an area-efficient design in ASICs. The coprocessor was synthesized with Xilinx ISE8.1i. On Virtex-II FPGA (XC2V4000), this coprocessor finishes one scalar multiplication of HECC over GF (2 83 ) in 311 µs using 2316 slices and 2016 bits memory. To the best of our knowledge, this implementation is faster than all proposed FPGA-based implementations of HECC, while the area is much smaller than that of the fastest reported implementation [15] .
The rest of the paper is organized as follows. Section 2 gives a brief introduction on the previous work. Section 3 describes the mathematical background of HECC and field arithmetic. Section 4 describes the architecture of the proposed HECC coprocessor. In Sect. 5 we show the implementation results. We conclude the paper and give some future work in Sect. 6. In 2001, Wollinger described the first hardware architecture for HECC implementations using Cantor's algorithm [6] in his thesis [32] . However, the architecture was only outlined. The first complete hardware implementation of HECC was presented in [4] . It is also based on Cantor's algorithm, but with improvement on the calculation of Greatest Common Divisor (GCD). This implementation, using 16600 slices on Xilinx Virtex II FPGA, supports a genus-2 HEC over GF (2 113 ). One scalar multiplication takes 20.2 ms on this coprocessor running at 45MHz. This work was further improved in [7] .
In 2002, Lange generalized the explicit formulae for HECC over finite fields with arbitrary characteristic [21] . This was first implemented on 32-bit embedded processors (ARM7TDMI and PowerPC) in [25] . The inversion in this algorithm was performed with Extended Euclidean Algorithm (EEA). The first hardware implementation of HECC using explicit formulae was described in [12] . Further improvement by using mixed coordinates and simplified curves were proposed in [11] . In [11] the coprocessor, running at 45.3MHz, deploys 25272 slices on Xilinx Virtex II FPGA. With this implementation 2.03 ms is required to perform one scalar multiplication of HECC over GF (2 113 ). There are some ASIC implementations of HECC using projective coordinates. For example, Sakiyama proposed a HECC coprocessor [28] using 0.13-µm CMOS technology. The coprocessor runs at 500 MHz, and can perform one scalar multiplication of HECC over GF (2 83 ) in 63 µs. The first hardware implementations of HECC using affine version of explicit formulae were described in [31] , which described so far the fastest FPGA-based HECC coprocessor. This coprocessor uses three modular multipliers and two modular inverters. It uses 7785 slices on Xilinx Virtex II FPGA(XC2V4000), and can reach a clock frequency of 56.7MHz. One scalar multiplication of HECC over GF (2 81 ) takes 415 µs.
Mathematical Background

Hyperelliptic curve cryptography
Hyperelliptic curves are a special class of algebraic curves; they can be viewed as generalization of elliptic curves. Namely, a hyperelliptic curve of genus g = 1 is an elliptic curve, while in general, hyperelliptic curves can be of any genus g ≥ 1.
Let GF(2 m ) be an algebraic closure of the field GF(2 m ). Here we consider a hyperelliptic curve C of genus g = 2 over GF(2 m ), which is given with an equation of the form:
where
is a polynomial of degree at most g (deg(h) ≤ g) and f (x) is a monic polynomial of degree 2g + 1 (deg(f ) = 2g + 1). Also, there are no solutions (x, y) ∈ GF (2 m ) × GF (2 m ) which simultaneously satisfy the equation (1) and the equations:
These points are called singular points. For the genus 2, in the general case the following equation is used
A divisor D is a formal sum of points on the hyperelliptic curve C i.e. D = m P P and its degree is degD = m P . Let Div denotes the group of all divisors on C and Div 0 the subgroup of Div of all divisors with degree zero. The Jacobian J of the curve C is defined as quotient group J = Div 0 /P . Here P is the set of all principal divisors, where a divisor D is called principal if D = div(f ), for some element f of the function field of C (div(f ) = P ∈C ord P (f )P ). The discrete logarithm problem in the Jacobian is the basis of security for HECC. In practice, the Mumford representation according to which each divisor is represented as a pair of polynomials [u, v] is usually used. Here, u is monic of degree 2, degv < degu and u|f − hv − v 2 (so-called reduced divisors). For implementations of HECC, we need to implement the multiplication of elements of the Jacobian i.e. divisors with some scalar.
The main operation in any hyperelliptic curve based primitive is scalar multiplication, i.e. mD where m is an integer and D is a reduced divisor in the Jacobian of some hyperelliptic curve C. The first algorithm for arithmetic in the Jacobian is due to Cantor [6] . However, until "explicit formulae" were invented, the HECC was not considered a suitable alternative to EC based cryptosystems. For geni 2 and 3, there was some substantial work on the formulae and algorithms for computing the group law on the Jacobian have been optimized. Algorithms for the group operation for the case of genus 2 hyperelliptic curves, which we used are due to Lange [22] .
The main operation in any curve-based primitive (ECC or HECC) is the scalar multiplication. Looking at the arithmetic for both ECC/HECC the only difference between ECC and HECC is in the group operations. On this level both ciphers consist of different sequences of operations. Those for HECC are more complex when compared with the ECC point operation, but they use shorter operands. The divisor scalar multiplication is achieved by repeated divisor addition and doubling. Many techniques that help to speed up ECC scalar multiplication are also applicable to HECC. For example, using Non-Adjacent Form (NAF) for scalar representation or window method can also improve HECC performance.
Field Arithmetic
An element α in GF(2 m ) can be represented as a polynomial
where ⊕ is XOR operation.
Multiplication In the literature there are various algorithms and architectures [3, 30] proposed for modular multiplication in GF(2 m ). The bit-serial algorithms can be classified into two categories, the Most Significant Bit (MSB) first algorithms and the Least Significant Bit (LSB) first algorithms. It is important to point out that LSB-first bit-serial multiplier has shorter critical path than MSB-first bit-serial multipliers [3] . In this paper, we use the LSB-first algorithm.
Compared with the other modular operations, modular inversion is considered as a computationally expensive operation. The most commonly used methods to perform the modular inversion are based on Fermat's little theorem [1] , Extended Euclidean Algorithm [18] and Gaussian elimination [16] . EEA is widely used to perform inversion in practice.
The schoolbook EEA-based inversion algorithm in GF(2 m ) is commonly considered inefficient due to the long polynomial division in each iteration. This problem was partially solved by replacing degree comparison with a counter [5] .
In [34] , Yan et al. proposed a modified inversion algorithm based on the EEA. Algorithm 2 shows this inversion algorithm. Here we use S i (x) to denote the value of S(x) after i th iteration, and d
The complement of C 1 is represented asC 1 . Unlike many other EEA variants [14, 5, 18] , this algorithm has no modular operations, thus a short critical path delay can be easily achieved. Besides, with a fixed number of iterations, it is more secure against side-channel analysis.
HECC coprocessor architecture
In this section we describe a compact coprocessor architecture for HECC over GF(2 m ). Two main approaches are used to reduce the area: using compact ALU and reducing memory area. First, we propose a unified digit-serial modular multiplier/inverter, which enables a small ALU. Second, we investigate the characterAlgorithm 2 EEA-Based Inversion Algorithm [34] Input: irreducible binary polynomial P (x) with deg(P (x)) = m, polynomial A(x) with deg(A(x)) < m.
4: end for Return:
istics of the ALU, and reduce area of memory block as well as its interconnecting network.
Modular Multiplier
As shown in Algorithm 1, the main operation in LSB-first multiplication is (bA(x) + C(x)), which can be performed by a row of AND gates and XOR gates shown in Figure 1 (a). Figure 1(b) shows the architecture of a LSB-first bit-serial multiplier. Two (m + 1)-bit registers are used to hold the parameter P (x), A(x) and two m-bit registers to hold B(x) and the partial product C(x). Note that B(x) is shifted to right by one bit in each clock cycle. Here (a m P (x) + A(x)) and (b 0 A(x) + C(x)) is performed on the left and right side, respectively. If low Hamming weight irreducible polynomials are used, the AND-XOR cell on the left side can be simplified. For example, using P (x) = x 83 + x 7 + x 4 + x 2 + 1, only 4 AND gates and 4 XOR gates are required to perform (a m P (x) + A(x)).
It is clear that the critical path delay is T AND + T XOR , where T AND and T XOR denote the delay of a 2-input AND and XOR gate, respectively. One multiplication in GF(2 m ) takes m clock cycles on this bit-serial multiplier. 
Unified Modular Inverter and Multiplier
We propose a unified architecture which can perform both multiplication and inversion. In [8] , Daly et al. have proposed a unified ALU for GF(p). It can perform addition, subtraction, multiplication and inversion. Compared with this ALU, our unified inverter/multiplier in GF(2 m ) has a shorter critical path delay, and can be implemented in a digit-serial manner to achieve a higher throughput. Figure 2 shows the data-path of our proposed bit-serial inverter and multiplier. It realizes both Algorithm 1 and Algorithm 2. The multiplier and the inverter share one AND-XOR cell and three registers. The critical path delay is 2T MUX . Here T MUX denotes the delay of a 2-input multiplexer. This multiplier/inverter finishes one inversion operation in GF(2 m ) in (2m − 1) clock cycles. This data-path supports the following operations:
1. Modular Multiplication.
-
Compact digit-serial Inverter/Multiplier for HECC
In order to achieve higher throughput, a digit-serial inverter/multiplier can be implemented with multiple bit-serial multiplication and inversion units. We propose a flexible architecture which allows us to explore the trade-off between performance and hardware cost. Given a constant w 2 , increasing w 1 will reduce the number of clock cycles required by one inversion. However, it will increase the area as well as the critical path delay. As a result, the multiplication will be slowed down slightly. Therefore, w 1 /w 2 can be chosen for different design targets such as high performance, low hardware cost or smallest area-time product. Theoretical exploration for optimal (w 1 , w 2 ) for a specific design target is out of the scope of this paper. Table 2 shows the performance and area of the proposed ALU with different configurations. Here Xilinx Virtex II (XC2V4000) FPGA is used. In this HECC implementation we choose w 1 = 3 and w 2 = 14 as the best performance/area trade-off for this architecture. With this configuration, one multiplication and one inversion in GF(2 83 ) take 47.9 and 439 ns, respectively.
Memory/RF analysis
Besides ALU, memory/RF is another main component that decides the overall area and performance of a coprocessor. The size, throughput and delay of memory/RF must be chosen according to the requirement of the ALU. We analyze different design strategies of HECC coprocessor here. Both memory and RF have their own advantages and disadvantages. While registers are larger than memory of the same capacity, memory usually has one clock delay in read operation. This delay may cause performance degradation when multiple data-path work in a pipelining mode, see [31] . Thus, HECC coprocessors using multiple data-path [7, 33, 31] require an efficient register file to feed data to parallel multipliers and inverters. The register file and its interconnecting network make a big part of the whole area.
The area of memory/RF is dependent on the size and throughput [24, 29] . Higher throughput results in a more complex decoder and a larger interconnecting network, which cause the area increase. Thus, reducing the memory/RF throughput reduces the area. Table 3 shows the required memory/RF throughput of different ALUs. Note that here we use GF (2 83 ) for all the ALUs, D denotes the delay of multiplication. For example, when using three multipliers, the ALU reads 6 operands from memory/RF and writes 3 data back. In [33, 31] , 3 clock cycles are required for one multiplication. If each operand is 84-bit, then the ALU needs to read 168 bits in each clock cycle. The proposed multiplier/inverter shown in Figure 3 requires 56-bit read and 14-bit write in each clock cycles. The required memory throughput is much smaller than that in [33] and [31] .
Coprocessor architecture
The HECC Coprocessor is shown in Figure 4 . It contains an Instruction ROM, a main controller and a unified modular multiplier/inverter. The Instruction ROM Here one Add instruction takes two cycles. As w 1 = 3, one Inv instruction takes 55 clock cycles. One Mul instruction takes 6 clock cycles. One Mac instruction consists of one Mul and two Add instructions. However, it takes also 6 clock cycles. This is because fetching and adding data Rd and Re are performed during the multiplication. Two Add and one Mul instructions cause 6 operand fetches and 3 result stores, while one Mac instruction requires only 4 operand fetches and one result store. Therefore, the use of Mac instruction reduces the number of memory access and speeds up the scalar multiplication.
In this implementation, we choose hyperelliptic curves with the following parameters: h(x) = x and f (x) = x 5 + f 3 x 3 + x 2 + f 0 . One DA operation consists of 36 instructions, which include 11 Add, 24 Mac and 1 Inv instructions. One DD operation consists of 14 instructions, which includes 2 Add, 11 Mac and 1 Inv instructions. Note that the architecture of the coprocessor can be slightly modified so that it can be integrated into a SoC where memory is shared. The required throughput of memory needs to be further reduced. In the InsRom Mac instruction needs to be replaced by a Mul and two Add instructions, thus only two instead of four operands need to be loaded for each instruction. In this case, the required throughput of memory is 2 * 84 6 = 28 bits, the amount that a 32-bit dual-port SRAM is able to offer. However, the add instruction requires 6 instead of 2 clock cycles, which slightly degrades the performance of the coprocessor.
Implementation Results
In order to check the area and performance of the proposed coprocessor, we implemented the architecture from Figure 4 on a Xilinx Virtex-II (XC2V4000) FPGA. The coprocessor is described with Gezel [13] language and synthesized with Xilinx ISE8.1. It uses 2316 slices and 6 Block RAMs. A clock frequency of 125 MHz can be reached. Table 4 compares the area and performance with previous FPGA-based implementations of HECC in GF(2 m ). The proposed HECC coprocessor in [7] uses Cantor's method to perform divisor addition and doubling. It has two modular multipliers, one inverter, one GCD module and several other logics. Register file is connected to the datapath with MUX arrays. When supporting HECC in GF (2 83 ), it uses 22000 slices on Xilinx Virtex-II FPGA and can finish one scalar multiplication in 10 ms.
The proposed HECC coprocessor in [11] uses the mixed coordinates of explicit formulae proposed in [21] . The ALU contains three modules, namely divisor addition module, divisor doubling module and coordinates conversion module.
Each of them has four field multipliers, while only the coordinates conversion module has a inverter. It supports Right-to-Left binary expansion method, which scans the key from LSB to MSB, and can perform divisor addition and doubling in parallel. It also supports NAF method. Here we list the performance of scalar multiplication using NAF method as it is slightly faster than the binary method.
The HECC coprocessor proposed in [17] uses projective coordinates, and a superscalar architecture is used to support parallel field operations. Several digit-serial (w = 12) multipliers are used. Our coprocessor, using one unified multiplier/inverter, is faster than the coprocessor in [17] that uses three multipliers. The architectures proposed in [31] , however, uses affine coordinates of the explicit formulae. Three different architectures ranging from high speed to low hardware cost are proposed. For the high speed version, with three multipliers and two inverters, only 415 µs is required to finish one scalar multiplication. The area of the coprocessor is also much smaller than that of [7, 11] . The area can be further reduced to 3955 slices but, in that case it requires 831 µs for one scalar multiplication.
Compared with all the previous FPGA-based implementations our implementation has the best performance, to the best of our knowledge. The area reduction is attributed to the use of compact ALU and the reduction of the memory throughput. The ALU in [31] contains two multipliers and one inverter, which in total use 2427 slices. The ALU used in this paper requires only 1500 slices. The performance gain is mainly due to the efficient inverter. When running at 56.7 MHz, the inverter in [31] requires 1570 ns in average for one inversion in GF (2 81 ), while the proposed ALU finishes one inversion in GF(2 83 ) in 439 ns. Though we use only one multiplier, which is also slower than the one in [31] , the overall performance of divisor addition/doubling is better.
Conclusions
We describe a compact architecture for HECC over binary extension field. This architecture uses a unified modular multiplier/inverter, and reduces the throughput of the memory. Thus, the area of the coprocessor is largely reduced. On a Xilinx Virtex II (XC2V4000) FPGA, the proposed coprocessor takes 311 µs to finish one scalar multiplication in HECC over GF (2 83 ). The proposed implementation can be further speeded up by exploring instruction level parallelism. Besides, if more space is available in the data memory, precomputation can be used to drastically improve the performance.
