Abstract-Halving methods have been proposed for parallel implementation of ECC primitives on multicore processors. In hardware, they can also provide protection against some side channel attacks (thanks to parallel independent operations). But they require affine coordinates for curve points and costly inversions. We propose a new combined multiplication-inversion unit for binary field extensions and halving based ECC methods optimized for FPGAs. We target small area solutions compared to very fast but costly ones from state-of-art. Our solution is based on permuted normal basis, Massey-Omura multiplication and Itoh-Tsujii inversion algorithms. Our FPGA implementations show better efficiency for large fields.
I. INTRODUCTION
F INITE fields [1] are widely used in cryptography. Efficient arithmetic over finite fields is a key element for implementing public-key cryptosystems. Binary field extensions GF(2 m ) are widely used in hardware implementations of elliptic curve cryptography (ECC [2] ) due to their higher speed and smaller silicon area compared to GF(p) based solutions.
Scalar multiplication using point halving has been proposed in [3] to provide more parallelism. There, operations can be divided into two independent sequences and performed in parallel [4] . See [5] for recent work in this domain. This can also provide higher resilience against side channel attacks (see [6] for instance) since two shorter and independent scalar multiplications are performed in parallel. This requires the curve points to be represented using affine coordinates. Then, more inversion operations have to be computed than in projective coordinates [2, Sec. 3.2] . Representing field elements using normal basis will also provide very efficient square and squareroot operations (i.e. circular shifts) which are also important for halving methods.
In this paper, we present a combined multiplicationinversion unit (MIU). State of art GF(2 m ) inversion and related operations are recalled in Sec. II. Sec. III presents our GF(2 m ) inversion algorithm. It uses a new representation for field elements called permuted normal basis, a modified MasseyOmura multiplication [7] which allows efficient inversion through Itoh-Tsujii algorithm [8] . Our proposed hardware unit presents a higher internal parallelism without the large area penalty used in state-of-art solutions. We target small solutions instead of faster but very costly operators since ECC primitives for high security levels (e.g. m = 571) may not be used frequently. It offers a new speed-area trade-off in arithmetic units with higher throughput for halving based ECC accelerators. We evaluated our solution using GF(2 m ) fields recommended by NIST [9] . Our MIU architecture and its implementation results are presented in Sec. IV. Our implementations have been performed on a Spartan-6 LX75T and Virtex-4 LX100 FPGAs for comparison purpose. Finally, Sec. V concludes the paper.
II. STATE OF THE ART

A. Representations of GF(2 m ) Elements
There exist two popular representations: normal basis and polynomial basis [1] . In normal basis, A ∈ GF(2 m ) is encoded as
where coefficients a i belong to GF(2) and β is a special element. A = [a 0 , a 1 , . . . , a m−1 ] denotes the vector of coefficients. Square and square-root operations can be easily computed using circular shifts on the vector of coefficients. In this work, we use normal basis.
In polynomial basis, A is encoded as
where coefficients a i ∈ GF(2). Additions and multiplications are polynomial operations modulo the irreducible polynomial f in GF(2) [x] . There exist other representations with particular properties such as Dickson basis [10] or dual basis [11] .
B. GF(2 m ) Addition and Multiplication
GF(2 m ) addition can be efficiently performed using parallel XOR gates on all coefficients without any carry propagation. In normal basis, most of multiplication algorithms are based on matrix-vector product formulations of the basic multiplication. This leads to larger multipliers compared to polynomial basis ones. Massey and Omura (MO) proposed a multiplication method (see Algo. 1 from [7] ), which computes P = A×B, based on matrix-vector product where the constant matrix M 0 is only composed of GF(2) elements. Notation ROL(x, n) (ROR) stands for n-bit left (right) circular shift of x. Notation P [i] is the i-th bit of P (starting with LSBs). At line 3 in Algo. 1, M 0 is a m × m binary, constant, symmetric and very sparse matrix. Then multiplying by M 0 just uses XOR trees. A serial-output MO multiplier requires a block of multiplication by M 0 (noted ×M 0 ), two ROL registers and a dot product operator. A last ROL register can be used to provide a parallel output if needed.
We use Gaussian normal basis (GNB, a specific type of normal basis) due to its very low hardware complexity for the implementation of multiplication by M 0 [12] . The number of "1" in M 0 is given by C m ≤ t × m − t + 1 where t is the Algorithm 1: Original Massey-Omura multiplication [7] . Operands: A, B in GF(2 m ) represented in normal basis
A ← ROL (A, 1) ; B ← ROL(B, 1) ; P ← ROL(P, 1) 5 return P The original Massey-Omura multiplier produces one result bit every clock cycle. Using two parallel ×M 0 blocks produces two result bits per clock cycle with an important area overhead. For instance, multiplication in GF (2 233 ) only requires 117 = 233/2 clock cycles with an area overhead close to 2.
In [13] some redundancies in d parallel copies of the ×M 0 block are used to provide multiple bits of the result at each clock cycle without the full area penalty. The number of XOR gates is then
Parallel-output multipliers have been proposed in normal basis [14] , [15] . P is the final accumulation of the partial products which is available after m clock cycles. In [16] , a serial-input/parallel-output multiplier using w-bit sub-words computes the result in m/w clock cycles.
C. GF(2 m ) Inversion Algorithms
Two main methods are used for GF(2 m ) inversion: Euclidean algorithm and Fermat's little theorem (FLT). Up to now, Euclide based solutions are rarely used in hardware [17] . FLT states that A
exponentiation can compute the inverse of A ∈ GF(2 m ) * using the standard square and multiply algorithm.
Itoh and Tsujii (IT) proposed in [8] an efficient way to perform this exponentiation noting that 2 m − 2 = (111 · · · 110) 2 and using addition chains. An addition chain is a sequence of additions where all operands are selected among the previously 
computed terms [18, Sec. 4.6.3] . Each term is the Hamming weight of the exponent written in the binary representation. Tab. II provides an example in GF(2 7 ) with both the addition chain and the corresponding sequence of multiplication and square operations (where T i is the i-th intermediate product). In practice, the efficiency of IT based inversion mostly relies on the multiplier efficiency.
Recently in [19] , a new IT inversion operator has been proposed using the hybrid multiplier from [16] which performs A × B × C in m/w + 1 clock cycles with w-bit sub-words.
In [20] , the authors generalize the concept of addition chains introducing the k-addition chains. They use such a tool to improve the work from [19] . Very recently in [21] , a parallel-IT algorithm has been proposed to speed-up the IT-sequence using a second multiplier.
III. PROPOSED SOLUTION
In ECC halving based scalar multiplication, there is an inversion for each scalar bit equal to "1": on average every 0.5 key bit without key recoding or every 0.33 key bit using non-adjacent form (NAF) recoding, see [2, Sec. 3.3] . Then a standalone inversion unit would lead to underused silicon. We designed a new combined multiplication-inversion unit (MIU) to support both multiplication and inversion GF(2 m ) operations in GNB. It uses the IT method, then only multiplications and squares are required for field inversion. As squaring is very cheap in GNB (i.e. just a ROR), implementing a MIU requires a field multiplier, a ROR, a local register and a small controller. Our MIU has two operation modes: i) standard GF(2 m ) multiplication where operands A and B are multiplied to produce P = A×B; ii) IT based GF(2 m ) inversion where multiplier and ROR blocks are used to compute the multiplications and squares (with intermediate values in the local register). We improved the inversion speed by using optimized multiplications for specific terms of the IT exponentiation. Our MIU produces two result bits per clock cycle with only one MO multiplication block. The obtained speed-up is about 20 % and not 100 % for two reasons: first, not all terms in the IT exponentiation sequence have this specific form; second, some of result bits are produced several times.
The GF(2 m ) operand(s) and result of our MIU, for P = A× B or R = A −1 operations, are represented using a variation of GNB with permuted coefficients proposed in Sec. III-C. Some values are stored using w-bit sub-words.
A. Optimizing Large Shifter in IT Exponentiation on FPGA
During the IT exponentiation [8] , the intermediate products in GF(2 m ) must be shifted by several shift amounts (e.g. for m = 571 there are 13 different shift amounts). This requires a very large m-bit barrel shifter (m ∈ [163, 571] bits in ECC). As an example, for GF(2 571 ) on a Spartan 6 FPGA, a 571-bit shifter (optimized for the 13 shift amounts) requires 3425 LUTs while a Massey-Omura multiplier requires 2246 LUTs. Furthermore, the shift amounts are different for each field size m, then one must use a specific shifter for each m.
We replace all these large shifters by one hardwired block RAM (BRAM) of the FPGA. Instead of storing each bit of Algorithm 2: w-bit words Massey-Omura multiplication.
A ← ROL (A, 1) ; B ← ROL(B, 1) 7 return P the intermediate m-bit product P in a register, we duplicate them w times in the BRAM using the following patterns:
BRAMs in recent FPGAs are large enough to support the m · w bits. For instance, with w = 32 (a typical width for BRAMs), 7456 bits are required for m = 233 and 18272 bits for m = 571. In a low-cost Spartan-6 FPGA, 18Kb BRAMs are available, then for the largest field m = 571, one needs two BRAMs. The MO multiplier is a sequential operator with m clock cycles for producing each intermediate P . We use a w-bit temporary register to store [p i , p i+1 , . . . , p i+w−1 ] to feed the BRAM, and this register is shifted by 1 bit for the next cycle. The total cycle count for each product in the IT sequence is m+w. Using our BRAM based solution, shifting is performed by reading the words at addresses (i + αw) mod m for α ∈ {0, 1, 2, . . . , m/w ]}, where i is the shift amount. In Algo. 2, we adapt the MO multiplication algorithm to our BRAM based shifting solution (where SHL is a left shift).
This method seems to be equivalent to a simple shift register iteratively used over m clock cycles. But it will allow us to provide, for all m parameters, two shifted values for two different shift amounts at each clock cycle using a dual-port BRAM (DP-BRAM).
B. Support of Specific Patterns in the Exponentiation Chain
In the Itoh-Tsujii exponentiation sequence, the specific multiplication pattern (SMP) A T are computed. Since these two values are equal (one is the transpose of the other), we save one matrix-vector multiplication for each iteration of the algorithm. At every clock cycle, we compute where CC(r) is clock cycle number r. We produce the output bits serially, like the original MO, but 2 bits at each clock cycle. This allows to efficiently overlap successive computations as in [19] . Producing most Algorithm 3: Massey-Omura Multiplication in PNB. Operands: A, B ∈ GF(2 m ) in PNB, and k
A ← ROL (A, 1) ; B ← ROL(B, 1); C ← ROL(C, 1) 10 return P of the result bits twice will not be a problem using the trick presented in the next subsection.
C. Multiplication and Inversion using Permuted Normal Basis
During the computation of a SMP in the IT sequence, using the modified MO from Sec. III-B leads to some redundancies in the produced bits. Using the example from last paragraph of Sec. III-B, a SMP with k = 1 produces 2m − 2 bits instead of m for the result of A 2 k ×A (redundancy ≈ 2). This is due to the constant shift amount equal to 1 used in the MO algorithm.
We propose a modified MO algorithm with a new constant shift amount θ > 1 leading to a lower overall redundancy level during a complete IT exponentiation. We wrote a Python program to find θ which minimizes the inversion time for a given field GF(2 m ). We compute an addition chain for m using the basic binary method: Addition chains produced by the basic binary method are not always the shortest ones but they lead to a smaller number of intermediate m-bit registers. In practice, all our chains are the shortest or at most 1-term longer than the shortest chains from [22] .
We introduce the permuted normal basis (PNB) representation where element A = [a 0 , a 1 , a 2 In our modified Algo. 3, matrix M 0 is the adapted matrix M 0 for PNB representation. M 0 is also symmetric and is a permutation of the rows and columns of M 0 . Row i in M 0 is at row iθ −1 mod m in M 0 and column j in M 0 is at column jθ −1 mod m in M 0 (where θ −1 is the modular inverse of θ mod m). An example is given in Fig. 1 for GF(2 7 ). Our algorithm produces the two result bits (p iθ , p iθ+k ) at every clock cycle. Since m is prime, the indexes iθ mod m generate GF(m) seen as an additive group. Consequently, there exists for each i < m an integer j < m such that (iθ + k)mod m = jθ mod m. We sequentially write the result bits into two w-bit words in a DP-
. . , p (j+w−1)θ ] (all the indexes are computed modulo m).
Let us present a complete example for the toy field GF(2 7 ). m = 7 leads to the chain (U ) = (1, 2, 3, 6) (see Tab. II). In this chain, there are two SMPs: k = 1 (for 2=1+1) and k = 3 (for 6 = 3+3). Our program returns θ = 2. For SMP k = 1, the cycles are: CC (0) 
D. Cost Estimation
We denote N SMP the number of SMPs in the IT sequence. We compute the number of clock cycles in an inversion as C inv = C IO + C SMP + C nonSMP where: In Tab. III, we report θ parameters, the estimated inversion time (in clock cycles) of both original and PNB algorithms for finite fields recommended by NIST [9] and parameter w = 32 (selected for our target FPGAs, see Sec. IV). For the largest fields, our Python program generates θ in less than 5 minutes on a mid-range computer (3 GHz Xeon processor with 8 GB).
Using PNB, IT-based inversion and modified MO multiplication from Algo. inversions compared to the original IT [8] . For PNB GF(2 m ) multiplication, our MIU is as fast as the original MO with a small area overhead. Addition in GF(2 m ) has exactly the same time and area costs in both representations (GNB and PNB). Squaring in PNB GF(2 m ) is also a constant shift but where the shift amount is θ instead of 1. Conversion between PNB and GNB (for both directions) is a permutation of the coefficients. In parallel architectures this can be done using routing. In sequential implementations with several chunks per field element, this can be done using several registers accesses and routing. In practice, no conversion is required during the scalar multiplication.
IV. IMPLEMENTATION DETAILS AND RESULTS ON FPGAS
The architecture of our combined multiplication-inversion unit (MIU) is presented at Fig. 2 where = log 2 m is the address size for field elements at bit level. Our MIU supports both SMP A 2 k × A and standard multiplications A×B. In the MIU, the internal address computations (addition and reduction modulo m) are pipelined to reach higher frequencies. Its main computation block is our modified MO multiplier for PNB representation from Algo. 3 and illustrated at Fig. 3 . This block requires two dot product operators (the central rectangles in Fig. 3 ) instead of one (in the original MO multiplier) as well as three m-bit shift registers instead of two. It also requires two small w-bit registers for temporary transfers to the BRAM. In Fig. 2 , the m-bit register REG1 stores A during the complete IT sequence when computing A −1 . REG2, in BRAM, stores all intermediate/final products of the IT sequence. We used w = 32 to fit actual BRAM sizes in our target FPGAs. Our MIU operator is scalable for larger w (for other targets). Our solution is optimized for FPGAs with BRAMs. In case of ASIC implementation, we need to study an appropriate architecture.
All internal operations and transfers in our MIU are scheduled by the high-level finite states machine (FSM) presented at Fig. 4 . It schedules the complete IT algorithm with:
• Initial state LOAD_OP loads the operand(s) A in PNB GF(2 m ) for inversion (and possibly B for multiplication) sequentially using w-bit sub-words;
• MULT starts a standard MO multiplication A×B; • DONE is reached when the computation is finished (and the user can serially read the result); • REG×REG starts the computation of the first product A× A in the IT sequence using the modified MO multiplier from Algo. 3. The result A 2 is stored in the BRAM using the redundant representation from Sec. III-A.
• BRAM×BRAM starts the computation of a SMP A 2 k ×A;
• REG×BRAM starts the computation of a non-SMP multiplication A×B; • WAIT is reached at the end of one step in the IT sequence;
• NEXT determines what is the next step in the IT sequence (or the end of the computation).
Tab. IV and Tab. V report implementation results of GF(2 m ) inversions using our MIU on Spartan-6 LX75T and Virtex-4 LX100 FPGAs respectively. Three inversion algorithms have been fully implemented and validated: MO1 uses a single ×M 0 block with the original MO multiplier from [7] ; RM2 uses two ×M 0 blocks with the modified multiplier proposed in [13] (which eliminates t redundant XOR gates and produces two output w-bit words at addresses (i, i + m/2 )); In Tab. IV, V and VI, lines with a star (*) correspond to state-of-art algorithms for multiplication and inversion from [7] and [13] we implemented and optimized on the same FPGA than our PNB solution. PNB corresponds to our MIU architecture at Fig. 2 with one ×M 0 block from Algo. 3 ( Fig. 3) in PNB representation. The three versions have been implemented using the architecture of Fig. 2 with the corresponding internal multiplier. The inversion duration is reported in column named "Time". Bottom of Tab. V presents comparisons with state-of-art algorithms from [19] and [21] . Our solution is about 10 times slower but it is more than 10 times smaller. On a Virtex-4 LX100 (a mid-range device), there are 98304 LUTs. Then state-of-art solutions from [19] and [21] require 86 and 57 % of the device just for multiplication and inversion respectively.
The RM2 solution (with a large internal multiplier) leads to important frequency reduction compared to the original MO1 solution. Our PNB solution has a smaller impact on the circuit frequency. This can be an advantage when designing complete halving based accelerators. Our MIU with PNB representation seems to be interesting for the larger fields. For instance, in GF(2 571 ), we obtain a better area-speed trade-off than the RM2 for a cost similar to the simple MO1. Comparing various computation units is not an easy task when multiple areaspeed trade-offs are possible. The efficiency of our MIU also depends on the targeted FPGA: the best results are obtained for the most recent one (this is probably due to a better use of 6-input LUTs and more important routing resources).
Our solution actually factorizes some matrix-vector products in the SMPs. For large values of t, our solution leads to more factorizations of sub-expressions. We believe that the PNB solution could be used for fields whose type t is greater or equal to 10.
We estimated the cost and performance of various halving based ECC scalar multiplication algorithms. Paper [3] presents a rough estimation using r-NAF key recoding: mI/(r + 1) + mM (1 + 3/(r + 1)) μs where I is the inversion time and M the multiplication time (given in microseconds). We estimated the area cost and the computation time of all operators used in halving based methods (field addition, field multiplicationinversion, field square and square-root, trace, and other small specific operators). We compared several algorithms, the corresponding results are reported in Tab. VI. To compare the efficiency of various algorithms and area-time trade-offs, we also report the area-time product (ATP): the number of LUTs multiplied by the scalar multiplication time. The area results are very similar for both NAF and 3-NAF key recoding units (the area difference is about a few LUTs). Tab. VI shows that our PNB solution is more efficient when considering both computation time and silicon area. We think our approach can be used in applications where high security level is required on small circuits.
All algorithms have been validated using intensive functional simulation (at bit level) in Maple. All architectures have been validated using VHDL simulations using Modelsim. [21] 1.34 74 Hybrid IT (d=13) [19] 1.40 119
V. CONCLUSION
We proposed a new hardware unit, called MIU, combining GF(2 m ) multiplication and inversion operations for halving based ECC cryptosystems. We proposed a new representation of the field elements called permuted normal basis (PNB), modifications of the Massey-Omura algorithm for multiplication, and the Itoh-Tsujii algorithm for inversion. Our solution leads to about 20 % theoretical speed-up over previous works. It has been implemented on FPGAs and seems to be interesting for the larger fields. It also leads to a higher frequency compared to parallel solutions from state-of-art. On halving based ECC scalar multiplication, our PNB solution leads to a better area-time efficiency for large fields.
In the future, we plan to study extensions with multiple internal multiplication blocks, mixing our PNB and parallel-IT [21] solutions and also try to adapt our algorithm (especially the optimization of very large shifters) to ASIC targets. We believe that PNB representation and associated algorithms could be efficiently used for finite fields with a type t ≥ 10.
