Modular multiplication and inversion are the essential operations in many Public Key Cryptosystems (PKCs). In this paper, we describe a unified digit-serial inverter/multiplier in GF (2 m ). The inversion is based on a modified Extended Euclidean Algorithm (EEA), while the multiplication is based a LSB-first multiplication algorithm. As the inverter and multiplier share the data-path, it is smaller than Arithmetic Logic Units (ALUs) with separated inverters and multipliers. When choosing digit size to be w, this inverter/multiplier finishes one inversion and one multiplication in 
INTRODUCTION
Modular multiplication and inversion are the essential operations in many PKCs [6] such as Elliptic Curve Cryptography (ECC) [18, 14] and HyperElliptic Curve Cryptography (HECC) [15] . The Elliptic Curve (EC) scalar multiplication, which is commonly used in EC-based protocols, is performed as a sequence of Point Addition (PA) and Point Doubling (PD). In HECC, a scalar multiplication is performed as a sequence of Divisor Addition (DA) and Divisor Doubling (DD). Point/Divisor operations are then performed as a sequence of modular addition, multiplication and inversion. As a result, the performance of ECC and HECC relies on the efficiency of the underlying modular operations. Table 1 shows the number of modular operations required by ECC and HECC with different coordinates. Here I, M and S denote modular inversion, multiplication and squaring, respectively.
There are several constraints in the design of multiplier and inverter for ECC and HECC. First of all, the finite field GF (2 m ) used by ECC or HECC must be large enough to ensure a secure cryptosystem. For example, ECC over GF (2 m ) is commonly considered to be secure if m ≥ 163, while HECC over GF (2 83 ) is of equivalent security as ECC over GF (2 163 ) [2] . Second, as only one or two inversions are required in each point/divisor operation, the latency of the in- verter is of higher priority than throughput. Therefore, some architectures for inversion, such as a 2-dimensional systolic array [22, 9] , are not suitable for this application due to either infeasible size or large latency. Many ALUs have been proposed to support efficient modular multiplication, inversion and squaring. The previously proposed ALUs [7, 10, 17] often consist of an inverter, one or two multipliers, and squarers. ECC or HECC implemented on these ALUs is normally much faster than that based on multiplier-only ALUs. However, they are also much larger in area. In [16] , a combined multiplication/division architecture was proposed based on the Stein's algorithm [20] . It is about 27% smaller than an ALU consisting of one separated multiplier and divider. However, the combination of divider and multiplier increases the critical path delay. As a result, both multiplication and division are slowed down.
In this paper, we propose a digit-serial architecture for inversion and multiplication. The inversion shares the datapath with multiplication, thus it is much smaller than ALUs with separated inverters and multipliers. Besides, the critical path delay of the inverter/multiplier is the same as a standalone inverter. When choosing the digit size to be w, this inverter/multiplier finishes one inversion in The rest of the paper is organized as follows. Section 2 gives a brief introduction on the previous work. In Section 3 we describe the architecture of the digit-serial inverter/multiplier. In Section 4 we show the implementation results. We conclude the paper, and give some future work in Section 5.
PREVIOUS WORK
An element α in GF (2 m ) can be represented as a polynomial A(x)= m−1 i=0 a i x i , here a i ∈ {0, 1}. Addition and subtraction of two elements in GF (2 m ) are performed as polynomial addition in GF (2),
where ⊕ is XOR operation. Multiplication in GF (2 m ) is defined as polynomial multiplication modulo P (x), the irreducible polynomial by which the field is defined.
Multiplication
In the literature there are various algorithms and architectures [3, 19] proposed for modular multiplications in GF (2 m ). The bit-serial algorithms can be classified into two categories, the Most Significant Bit (MSB) first algorithms and the Least Significant Bit (LSB) first algorithms. It is important to point out that LSB-first bit-serial multiplier has shorter critical path than MSB-first bit-serial multipliers [3] . In this paper, we use the LSB-first algorithm.
Algorithm 1 LSB-first bit-serial modular multiplication in
A (x) ← xA (x) mod P (x); 5: end for Return: C(x).
Inversion
If
is called the multiplicative inverse of A(x). Compared with the other modular operations, modular inversion is considered as a computationally expensive operation. The most commonly used methods to perform the modular inversion are based on Fermat's little theorem [1] , Extended Euclidean Algorithm [13] and Gaussian elimination [11] . EEA is widely used to perform inversion in practice.
The schoolbook EEA based inversion algorithm in GF (2 m ) is commonly considered inefficient due to the polynomial long division in each iteration. This problem was partially solved by replacing degree comparison with a counter [4] . In [4] , a counter is used to trace the degree difference between polynomials. This method simplifies the control logic and reduces the critical path delay.
Algorithm 2 EEA-Based Inversion Algorithm [23] Input:
irreducible binary polynomial P (x) with deg(P (x)) = m, polynomial A(x) with deg(A(x)) < m.
2: for i = 1 to 2m − 1 do 3:
In [23] , Yan et al. proposed a modified inversion algorithm based on the EEA. Algorithm 2 shows this inversion algorithm. Here we use S i (x) to denote the value of S(x) after i th iteration, and d
The complement of C 1 is represented asC 1 . The MSB of S(x) is evaluated in each iteration. If it is zero, then S(x) is left-shifted by one bit. Otherwise, the length of S(x) and R(x) is compared. If S(x) is shorter than R(x), then R(x) and S(x) are swapped. In this manner, the length of either S(x) or R(x) is decreased by one in each iteration. After 2m − 1 iterations, the R(x) becomes x m , and H(x) = A −1 (x). During the whole loop, the difference of length of R(x) and S(x) is kept by a counter.
Unlike many other EEA variants [9, 4, 13] , this algorithm has no modular operation, thus a short critical path delay can be easily achieved. Besides, this algorithm has a fixed number of iterations. From a security point of view, inversion algorithms with fixed number iterations are more secure against side-channel analysis. As this algorithm has only bit-by-bit shift operations in each iteration, it can be easily implemented in a systolic architecture. A number of architectures have been proposed to perform this EEA based inversion. For example, systolic architecture for inverters with an area-time complexity of O(m 2 ) have been proposed [24, 22] . As this algorithm has no modular operations, the implementations have a shorter critical path delay and a smaller area than other EEA based implementations [9, 5, 12] . As shown in Alg. 1, the main operation in LSB-first multiplication is (bA(x)+C(x)), which can be performed by a row of AND gates and XOR gates shown in Figure 1(a) . Figure  1(b) shows the architecture of a LSB-first bit-serial multiplier. Two (m+1)-bit registers are used to hold the parameter P (x), A(x) and two m-bit registers to hold B(x) and the partial product C(x). Here (a m P (x) + A(x)) and (b 0 A(x) + C(x)) is performed on the left and right side, respectively. The output of the left AND-XOR cell is then left-shifted by one bit and written back to A(x), while the output of the right AND-XOR cell is written back to C(x). In order to offer b i at the i th clock cycle, the register containing B(x) is shifted to right by one bit in each clock cycle.
DIGIT-SERIAL INVERTER/MULTIPLIER

Bit-serial Multiplier
It is clear that the critical path delay is T AND + T XOR , where T AND and T XOR denote the delay of a 2-input AND and XOR gate, respectively. One multiplication in GF (2 m ) takes m clock cycles on this bit-serial multiplier.
Bit-serial inverter
In Figure 2 we present a bit-serial architecture for inversion in GF (2 m ). It is a realization of Algorithm 2. This bitserial inverter uses two AND-XOR cells. The left cell performs
, and the right cell performs
). Two (m + 1)-bit multiplexer are used to generate R i (x) and H i (x). Note that the counter is implemented as a so-called ring counter [24] . The value of counter is defined as follow. The critical path delay of the bit-serial inverter is 2T MUX . Here T MUX denotes the delay of a 2-input multiplexer. This inverter finishes one inversion operation in GF (2 m ) in (2m− 1) clock cycles. The critical path delay can be further reduced by breaking up the data-path into a two-stage pipeline [22] . However, this will double the latency and require more registers.
Unified bit-serial Inverter and Multiplier
We propose a unified architecture which can perform both multiplication and inversion. It is a combination of the bitserial multiplier shown in Figure 1 and the bit-serial inverter shown in Figure 2 . Figure 3 shows the data-path of our proposed bit-serial inverter/multiplier.
The multiplier and the inverter share one AND-XOR cell and three registers. The combined inverter/multiplier has only one more register and AND-XOR cell than inverter shown in Figure 2 . As a result, it is much smaller compared ALUs with separated multiplier and inverter. This data-path supports the following operations: •
• Return C m (x).
Modular Inversion
• Return H 2m−1 (x).
Note that in order to support multiplication, one register C(x) and one AND-XOR cell are attached to the architecture in Figure 3 . No multiplexers or other logics are inserted to it. As a result, the critical path delay of this unified inverter and multiplier is the same as the inverter alone.
In order to achieve higher throughput, a digit-serial inverter can be implemented with multiple bit-serial inverters. Figure 4 shows the architecture of a digit-serial inverter where w = 3. The output of the bit-serial inverter at the bottom is restored to the registers. When choosing digit size as w, one modular inversion in GF (2 m ) takes Table 2 compares the area and time complexity of proposed inverter with some inverters proposed in literature. The proposed inverter in Figure 4 requires less area than the one proposed in [22] and [9] . The critical path of our inverter is much smaller than that of [9] . However, it is slightly larger than that in [22] , which uses a pipeline to reduce the critical path delay. As a result, its latency is about twice as large as our design.
IMPLEMENTATION RESULT
The critical path delay of the inverter [22] can be further reduced at the cost of increase of latency. Multiplication is not supported by inverters proposed in [22, 9] . Compared with the multiplier/divider proposed in [16] , our bit-serial inverter/multiplier has smaller area and shorter critical path delay. Note that the divider [21] on which the combined multiplier/divider is derived has a critical path delay of T AND +T XOR3 . When adding the functionality of multiplication to it, a 2-input MUX gate is inserted in the critical path. As a result, the critical path delay is increased to T AND + T XOR3 + T MUX . The division operation can be performed as an inversion followed by a multiplication on our data-path. It takes 3m − 1 clock cycles on our bit-serial inverter/multiplier, while the divider from [16] requires 2m − 1 clock cycles. However, since the critical path delay of our inverter is about half of the divider, the overall delay of a division on the proposed inverter/multiplier is smaller.
In order to check the area and performance of the proposed inverter/multiplier, we implemented the architecture from Fig. 4 on a Xilinx Virtex-II PRO (XC2VP30) FPGA. The inverter is described with Gezel [8] language and synthesized with Xilinx ISE8.1. Table 3 shows the area and clock frequency when choosing different digit sizes. The size of inverter/multiplier in Table 3 includes the data-path and the ring counter.
CONCLUSIONS
A digit-serial inverter/multiplier based on Extended Euclidean Algorithm is proposed. The proposed inverter and multiplier share the data-path, thus have a smaller area than Table 3 .
Implementations results of digit-serial inverter/multiplier in GF (2 m ). (Data-path only)
