This paper describes two novel architectures for a unified multiplier and inverter (UMI) in GF(2 m ): the UMI merges multiplier and inverter into one unified data-path. As such, the area of the data-path is reduced. We present two options for hyperelliptic curve cryptography (HECC) using UMIs: an FPGAbased high-performance implementation (Type-I) and an ASIC-based lightweight implementation (Type-II). The use of a UMI combined with affine coordinates brings a smaller data-path, smaller memory and faster scalar multiplication.
Introduction
Hyperelliptic curve cryptography (HECC) [22] is an important candidate for public-key cryptography [13] . Like elliptic curve cryptography (ECC) [27, 21] , it uses smaller parameter sizes than RSA [29] for equivalent security level. In constrained devices, HECC enables valuable optimizations in area and speed.
Implementing HECC on a resource-constrained platform is a challenge for both area and performance. Table 1 describes the computational complexity of divisor operations in different coordinate systems [5] . Here I, M and S denote modular inversion, multiplication and squaring, respectively. Over the past few years, HECC have been implemented in both software [28, 30, 3, 4, 6] and hardware [8, 11, 17, 14] . Table 2 summarizes previous FPGA-based HECC implementations.
In 2001, Wollinger described the first hardware architecture for HECC implementations [36] . However, the architecture was only outlined. The first complete hardware implementation of HECC was presented in [8] . This work was improved by Clancy [11] . All of them use Cantor's algorithm [10] for divisor addition and doubling.
In 2002, Lange generalized the explicit formulae for HECC over finite fields with arbitrary characteristic [23] . The first hardware implementation of HECC using explicit formulae was described in [15] . In [14] an improved version is proposed. The first hardware implementation using the affine version of explicit formulae was described in [35] , which presents the fastest FPGA-based HECC coprocessor up to date.
Also some ASIC implementations of HECC using projective coordinates were proposed. For example, Sakiyama proposed an HECC coprocessor [31] using 130 nm CMOS technology. The coprocessor is able to run at 500 MHz, and it can perform one scalar multiplication of HECC over GFð2 83 Þ in 63 ms.
Previous HECC implementations often use multiple multipliers or inverters to speed up the scalar multiplication. Commonly, an architecture shown in Fig. 1 is used. The use of multiple multipliers in parallel demands a high-throughput memory and a complex data bus, which result in further area increase. In this paper, we explore the power of a unified multiplier and inverter (UMI) for area reduction and performance improvement. We consider the architecture shown in Fig. 2 more area-efficient. We show that the use of a UMI brings three main advantages. First of all, the fast inverter makes affine coordinates very efficient, thus the performance is improved. Secondly, it reduces the area of the data-path. Thirdly, using only one data-path simplifies the data-bus and reduces the size of memory.
Our contributions: The contributions of this paper are threefold. Firstly, we propose two novel architectures for a unified multiplier and inverter. We show that merging an inverter into a multiplier results in a substantial area reduction. Secondly, we propose a digit-serial UMI architecture and describe a method to adjust the digit-size. We use this method to explore the best areatime trade-off for HECC. Thirdly, using the proposed UMIs, we implement two HECC processors: a high-throughput design that outperforms all previous FPGA-based HECC implementations and a lightweight design that is suitable for passive RFID tags.
The rest of the paper is organized as follows. Section 2 gives a brief introduction to the previous work, including mathematical background of HECC and the field arithmetic. Sections 4 and 5 describe the FPGA-based, high throughput architecture and the ASIC-based, lightweight architecture for HECC coprocessors, respectively. We conclude the paper in Section 6.
Previous work

Hyperelliptic curve cryptography
Hyperelliptic curves are a special class of algebraic curves; they can be viewed as a generalization of elliptic curves. Namely, a hyperelliptic curve of genus g ¼1 is an elliptic curve, while in general, hyperelliptic curves can be of any genus g Z 1. However, not all geni are used for cryptography.
Let GFð2 m Þ be an algebraic closure of the field GFð2 m Þ. Here we consider a hyperelliptic curve C of genus g ¼ 2 over GFð2 m Þ, which is given by an equation of the form:
where hðxÞ A GFð2 m Þ½x is a polynomial of degree at most g ðdegðhÞ rgÞ and f(x) is a monic polynomial of degree 2g þ1 ðdegðf Þ ¼ 2 g þ1Þ. Also, there are no solutions ðx,yÞ A GFð2 m Þ Â 
A divisor D is a formal sum of points on the hyperelliptic curve C, i.e., D ¼ P m P P, where P is a point on C, m P is an integer and m P ¼ 0 for almost all P. The degree of D is defined as degD ¼ P m P . Let Div denote the group of all divisors on C and Div 0 the subgroup of Div of all divisors with degree zero. The Jacobian J of the curve C is defined as quotient group J ¼ Div 0 =R, where R is the set of all principal divisors. A divisor D is called principal if D ¼ divðf Þ for some element f of the function field of C (divðf Þ ¼ P P A C ord P ðf ÞP). The discrete logarithm problem in the Jacobian is the basis of security for HECC. In practice, the Mumford representation according to which each divisor is represented as a pair of polynomials [u,v] is commonly used. Here, u is monic and [u,v] satisfy degðuÞ r 2,degðvÞ o degðuÞ and ujf ÀhvÀv 2 (so-called reduced divisors). The main operation in any hyperelliptic curve based primitive is scalar multiplication, i.e., mD where m is an integer and D is a reduced divisor in the Jacobian of C. The first algorithm for arithmetic in the Jacobian is due to Cantor [10] . However, until ''explicit formulae'' were introduced, HECC was not considered a suitable alternative to elliptic curve based cryptosystems. For geni 2 and 3, there was some substantial work on the formulae and algorithms for computing the group law on the Jacobian have been optimized. The algorithms we use for divisor operations are due to Lange and Stevens [25] .
Field arithmetic
An element a in GFð2 m Þ is represented as a polynomial
, where a i A GFð2Þ. For the sake of simplicity, we use capital letters to denote polynomials, and small letters with subscript to denote their coefficients. For example, A stands for A(x), and a 0 is the least significant bit of A. 
Iþ 5Mþ6S a -n This table is not exhaustive. State-of-the-art formulae can be found in [5, 12] .
a Note this fast doubling formulae only work for curves with deg(h) ¼ 1.
Algorithm 1. MSB-first multiplication [7] .
Input: Polynomial A, B and P. Output: R¼AB mod P.
Algorithm 2. Left-shift inversion [20] .
Input: Polynomial A and P.
if c ¼1 then 5:
else 7:
end if 9:
if c ¼1 then 10:
else 12:
end if 14: end for Return: R'H 2m .
Algorithm 3. LSB-first multiplication [7] .
Input: Polynomial A, B and P. Output: R ¼ AB mod P. 1:
for i ¼ 0 to mÀ1 do 3:
Algorithm 4. Right-shift inversion [37] .
if c 2 ¼1 then 5:
end if
Multiplication
In the literature there are various algorithms and architectures proposed for modular multiplication in GFð2 m Þ [7, 34] . The bitserial algorithms can be classified into two categories, the most significant bit (MSB) first algorithms and the least significant bit (LSB) first algorithms. Algorithms 2 and 4 show an MSB-first and an LSB-first multiplication algorithm, respectively. Here we use C i to denote the value of C after ith iteration, and b i the ith coefficient of B.
The MSB-first multiplication scans B from the MSB side. In each iteration, b i A is added to C, which is then shifted to the left and reduced. The LSB-first multiplication scans B from the LSB side. In each iteration, T is updated to xT, and b iT accumulated in C. LSB-first multipliers update T and C in parallel, thus they can achieve shorter critical path than MSB-first multipliers [7] . On the other hand, it requires an extra register to keep T.
Inversion
Modular inversion is considered as a computationally expensive operation. The most commonly used inversion algorithms are based on Fermat's little theorem [2] , extended Euclidean algorithm (EEA) [20] and Gaussian elimination [18] . EEA is widely used to perform inversion in practice.
The schoolbook EEA-based inversion algorithm in GFð2 m Þ is considered inefficient due to the long polynomial division in each iteration. This problem was partially solved by replacing the degree comparison with a counter [9] . Algorithms 2 and 4 show two variants of EEA, namely, left-shift inversion and right-shift inversion [37] , respectively. Here we use S i to denote the value of S after ith iteration, and s
. The complement of c 1 is represented as c 1 .
From an implementation perspective, the right-shift inversion algorithm is preferred for a high-performance inverter. The rightshift inversion algorithm has no modular operations. As a result, a short critical path delay can be easily achieved. The counter d is realized with the ring counter [37] . A ring counter d has only one 1-bit. The value of the counter is defined as ðÀ1Þ sign Á d, where d is the number of 0 at the right side of 1 in the register d. An n-bit ring counter can count up to n À 1, thus it is larger than an equivalent counter using ripple-carry adder. On the other hand, it has a shorter critical path delay since it only has shift operations. The left-shift inversion algorithm uses a ripple-carry adder, and it fits better in area-constrained devices.
Unified multiplier and inverter
The main observation of this paper is that multiplier and inverter can be efficiently merged, which brings a significant reduction in area. For example, Step 3 in Algorithm 3 and Step 10 in Algorithm 4 can be generalized to the following operation:
Another example is Algorithm 1 and Step 12 in Algorithm 2. They can be generalized to
Indeed, a modification of the architecture of a bit-serial multiplier makes it also an inverter.
In the following sections we describe two UMI architectures. Table 3 summarizes the design rationale of these two types of UMI. Let I/M denote the inversion to multiplication ratio in terms of delay. Type-I UMI is optimized for low critical path delay. It realizes the LSB-first multiplication and the Right-shift EEA algorithm. Here one inversion is equivalent to 2 multiplications. Type-I UMI is used to build a high-performance HECC processor. Type-II UMI is targeting ultra-constrained devices. It realizes the MSB-first multiplication and the Left-shift EEA algorithm. Here one inversion is equivalent to 4 multiplications. Type-II UMI is used to build a low footprint HECC processor.
High-throughput UMI and HECC processor
In this section we present the architecture of Type-I UMI and a high-performance HECC processor. We first describe the architecture of the UMI, then we discuss a method to select the I/M ratio. We also compare the performance of the design with previous implementations in the end of this section. 6 shows the data-path of the proposed unified inverter and multiplier. The data-path realizes both Algorithms 3 and 4. Table 4 describes how to configure the UMI to perform inversion or multiplication.
Type-I UMI architecture
The goal of this data-path merging is to maximize the hardware sharing and at the same time to minimize the overhead on critical path delay.
Hardware sharing: Three registers (R, S and H) and one AND-XOR cell are shared.
Critical path: The critical path delay is the same as a standalone inverter, i.e., 2T MUX .
Function selection: The selection of a working mode (i.e., multiplication or inversion) is performed on the existing registers at the first cycle. It is also shown in Table 4 .
Throughput:
The UMI achieves a throughput of 1/(2m À1) inversions or 1/m multiplications per cycle.
The critical path delay of UMI is longer than the one of a multiplier. In other words, merging an inverter into a multiplier slows down the multiplication. However, for divisor additions in HECC, performing one inversion saves 28 multiplications (see Table 1 ). Indeed, having a fast inverter at the cost of slower multiplication may still speed up the divisor addition and doubling. This issue is discussed in the following section.
Digit-serial UMI
While the use of UMI achieves an area reduction of the ALU, it also slows down multiplications. For applications where many more multiplications than inversions are required, maximizing the throughput of an inverter at the cost of a slower multiplier is not always desirable. Therefore, we propose a flexible architecture which enables an arbitrary I/M ratio. Fig. 7 shows a design that replaces two bit-serial UMI with multipliers. We use w I and w M to denote the actual digit-size of the inverter and multiplier, respectively. The UMI in Fig. 7 Fig. 3 . AND-XOR cell building block. Fig. 4 . LSB-first bit-serial modular multiplier. Table 4 Configurations and operations of UMI-I.
Registers Multiplication Inversion
The w I /w M ratio should be decided by the applications and the design constraints of the circuit. The next section describes an HECC processor built with the UMI.
Type-I HECC processor
The HECC coprocessor is shown in Fig. 8 . It contains an instruction ROM, a main controller and the Type-I UMI. The instruction ROM contains the field operation sequences of divisor addition and doubling. As only a single data-path is used, the coprocessor does not require high-bandwidth register files. Instead, a data RAM is used to keep the curve parameters, base divisor and intermediate data. On FPGAs, block RAMs are used.
For divisor addition and doubling, we use the explicit formulae proposed by Lange and Stevens [25] . One divisor addition takes 1I þ21Mþ3S, while one divisor doubling takes 1I þ5Mþ 6S. We give in the Appendix the explicit formulae in the form of register operations.
The selection of w M and w I is decided by the following constraints: speed and area. We choose w M ¼14 such that the area meets our constraints, i.e., the overall area should be smaller than the smallest known implementation ( [32] in Table 2 ). We then adjust w I and measure the performance of the UMI on a Xilinx XC2V4000 FPGA. The following equations are used to estimate the delay of one divisor addition (DA), one divisor doubling (DD) and one scalar multiplication Fig. 9 . Area of the UMI and delay for DA, DD and SM. a Non-adjacent form. b Using binary method for scalar divisor multiplication. (SM), respectively. Here T I and T M denote the delay of one inversion and multiplication, respectively. Note that squaring is also performed with the UMI, thus T S ¼ T M .
As shown in Fig. 9 , when w I increases, T M goes up. However, the delay of one inversion goes down. T DA reaches its low point when w I ¼ 3, while T DD stays almost unchanged when w I goes from 3 to 5. The delay of one scalar multiplication also reaches its low point at w I ¼ 3. Note that the area increases almost linearly when w I grows. When w I 4 3, there is no gain in speed while area goes up. Thus, we choose w I ¼ 3 and w M ¼14 as the best performance-area trade-off for this architecture. One multiplication and one inversion in GFð2 83 Þ take 47.9 and 439 ns, respectively.
Results and comparison
We implemented the architecture from Fig. 8 on a Xilinx Virtex-II (XC2V4000) FPGA. The coprocessor is described with the Gezel [33] language and synthesized with Xilinx ISE8.1. It uses 2316 slices and 6 block RAMs. A clock frequency of 125 MHz can be reached. Table 5 gives a comparison in the area and performance with previous FPGA-based implementations of HECC in GFð2 m Þ.
Among all the previous implementations, the design from Sakiyama et al. and Wollinger are of special interests to compare with. They both use explicit formulae, and the designs are much smaller than other implementations. The HECC coprocessor presented in [32] uses projective coordinates and a superscalar architecture. Different number of digit-serial (w¼12) multipliers are used for different configurations. Our coprocessor, using one unified multiplier/inverter, is faster than the one of [32] using three multipliers.
The architectures proposed in [35] , however, uses affine coordinates of the explicit formulae. Three different architectures ranging from high speed to low hardware cost are proposed. The high speed version uses three multipliers and two inverters, and it takes 415 ms to finish one scalar multiplication. To the best of our knowledge, this is also the fastest HECC implementation on FPGA up to date. The low-area version uses 3955 slices. However, it requires 831 ms for one scalar multiplication.
Compared to all the previous implementations, our HECC processor achieves a higher performance at a lower area cost. The area reduction is attributed to the use of compact ALU and the reduction of the memory throughput. The ALU in [35] contains two multipliers and one inverter, which in total use 2427 slices. The ALU used in this paper requires only 1500 slices. The performance gain is mainly due to the efficient inverter. When running at 56.7 MHz, the inverter in [35] requires 1570 ns on average for one inversion in GFð2 81 Þ, while our high-throughput UMI finishes one inversion in GFð2 83 Þ in 439 ns.
Although we use only one multiplier, which is also slower than the one in [35] , the divisor addition and doubling are faster.
Lightweight UMI and HECC processor for RFID
In this section, we describe an HECC processor targeting extremely constrained devices such as passive RFID tags. In such applications, area and power consumption are of higher priority than performance. According to [1] , a passive RFID tag should have power consumption less than 15 mW to guarantee 1 m operation range. Some ECC implementations [26, 19] can already fulfill the requirements. We show that HECC can also fulfill the requirements with a comparable performance. 
Type-II UMI
Type-II UMI realizes Algorithms 1 and 2. In this architecture, the bit-serial multiplier is reused for the inversion. The counter d, implemented as a ring counter in Type-I UMI, is replaced with a ripple-carry adder. Fig. 10 shows the data-path of the proposed digit-serial inverter and multiplier.
Multiplication: The data-path performs C i þ 1 'ððC i þ b i AÞ {1Þ mod P. In this case, only two registers (S and H) are used, thus R and J can be used as storage.
Inversion: In this mode, fR,Sg pair and fH,Sg pair are updated alternatively. The bit-serial multiplier performs one of the following operations:
Note that R and S are updated first, then H and J are updated accordingly in the next cycle.
Assuming the digit-size is w, one multiplication in GFð2 m Þ takes dm=we cycles, while one inversion takes d4m=we cycles.
Type-II HECC processor: low-footprint
The HECC processor is shown in Fig. 11 . It contains an instruction ROM, a main controller, a Type-II UMI, a register file, and an input/output interface. It differs from the Type-I HECC processor in both the UMI architecture and storage. The Type-I HECC processor uses a dual-port RAM, while the Type-II HECC processor uses a single-port register file.
Besides the multiplier and inverter, the register file makes a big portion of the area. Reducing area of the register file is the key step towards a compact implementation. An HECC processor using affine coordinates requires fewer registers to store intermediate results since no Z coordinates are used. Moreover, it also reduces the number of intermediate results. Lange and Mishra studied the register allocation for parallel multipliers [24] . Our investigation shows that 12 registers are sufficient for scalar multiplication with flexible base divisor D. Note that the Type-II UMI has four registers, among which two can be used for storage when it is not working as an inverter. Thus, we only need 10 registers in the register file. The complete register allocation for divisor doubling and divisor addition is given in the Appendix.
Results and comparison
We synthesized the Type-II HECC processor with 130 nm standard cell library. Table 6 summarizes the area and power of the proposed design.
Our HECC implementation uses 14.5 kGates and finishes one scalar multiplication in 136,838 clock cycles. The power consumption, estimated with power compiler, is 13:4 mW when running at 300 kHz. The implementation of [31] , using projective coordinates, requires 266,133 clock cycles for one scalar multiplication. Note that it is defined on a smaller field and the result does not include data storage. The power and energy consumption of our design is 65% lower while it achieves the same throughput.
There are several ECC implementations proposed for RFID tags. Lee et al. [26] use digit-serial multipliers, while Hein et al. use a 16 Â 16 GFð2Þ multiplier and 32-bit accumulator. Comparing the implementation in [26] , using a 16 Â 16-bit multiplier requires less area, lower power consumption. On the other hand, it requires 296k clock cycles, twice as many as Lee's ECC processor (and our HECC processor), for one scalar multiplication, and its energy consumption is about six times higher.
Our HECC processor can meet the requirements for passive RFID tags in terms of area, power and energy. However, ECC implementations are still leading in terms of the energy efficiency. This is due to the fact that the computational complexity of HECC scalar multiplication is higher than ECC.
Conclusions
We explore the efficiency of a unified multiplier and inverter datapath in HECC implementations. Two types of UMI are proposed. Type-I UMI, which realizes the LSB-first multiplication and right-shift EEA algorithms, achieves a short critical path delay. Using the Type-I UMI results in a high performance HECC processor on FPGA. The Type-II UMI, which realizes the MSB-first multiplication and the leftshift EEA algorithms, achieves a low footprint. Using the Type-II UMI results in a lightweight HECC processor for constrained devices. The use of UMI brings a substantial improvement in terms of area and performance of HECC implementations.
For applications like RFIDs, physical security is very important. Known HECC implementations are either based on a binary scalar multiplication method or NAF. They might be vulnerable to side-channel attacks. In ECC implementations, Montgomery ladder is widely used for protection against simple power analysis. A recent work by Gaudry and Lubicz [16] has shown that scalar multiplication of divisors can also use the Montgomery ladder. As a future work, we will combine our architecture with the algorithm proposed by Gaudry and Lubicz. 
