In this paper, we propose fast finite field and elliptic curve (EC) algorithms useful for embedding cryptographic functions on high performance device such that most instructions take just one cycle. In such case, the integer multiplications and additions have the same computational cost so that the computational cost analyses that were previously done in traditional manner may be invalid and in some cases the new algorithms should be introduced for fast computation. In our implementation, column major method for field multiplication and BP inversion algorithm are used for fast field arithmetic, and mixed coordinates method is used for efficient EC exponentiation. We give here analyses on various algorithms that are useful for implementing EC exponentiation on CalmRISC microcontroller with MAC2424 coprocessor, as well as new exact analyses on BP (Bailey-Paar) inversion algorithm and EC exponentiation. Using techniques shown in this paper, we implemented EC exponentiation for various coordinate systems and the best result took 122ms, assuming 50ns clock cycle.
Introduction
Since Koblitz [8] and Miller [11] first introduced elliptic curve cryptography (ECC), many works [7, 10, 2] have shown that ECC can be very efficiently embedded into restricted hardware such as smart cards. During the past few years, most people believed that elliptic curve defined over GF (2 m ) was the only useful one for hardware implementation since it can be implemented with only simple bit operations. GF (p) and GF (p m ) were popular in computer software implementation but they were not in hardware implementation because a math coprocessor is required for its implementation in smart cards and it significantly increases the cost.
However ECC is not restricted to smart cards. There can be many hardware applications that already have a fast microcontroller with math coprocessor. One such application is a portable MP3 player, it needs a high performance microcontroller which supports fast integer multiplication and division to decode MP3 data and it also needs cryptographic services to prevent unauthorized copy of MP3 files. In this case, GF (p) or GF (p m ) is more likely the better than GF (2 m ), since they can utilize the fast multiplication and division instructions. GF (p) and GF (p m ) both are good choices for that kind of application, but GF (p m ) seems to be a better choice because there is no need to implement complex multiple precision routines and it utilizes the full capability of the microcontroller. Not only is it easy to implement but also more efficient, since there are no carry propagation and the inversion in GF (p m ) is far more efficient than that in GF (p). Many works [10, 2, 1] have shown that GF (p m ) is very suitable choice for computer software implementation.
We have implemented EC over GF (p 10 ) in CalmRISC microcontroller with MAC2424 coprocessor. CalmRISC is a very fast 8-bit RISC microcontroller and MAC2424 is a high performance math coprocessor that can compute 24-bit signed multiplication just in one cycle and that provides efficient division step instruction. Since integer multiplication and division are the critical operations in GF (p m ), such devices provide the best platform for implementing EC defined over GF (p m ).
This paper focuses on implementing EC defined over GF (p m ) in CalmRISC microcontroller with MAC2424 math coprocessor. In particular, we have used GF (p 10 ), satisfying OEF (Optimal Extension Field [2] ) conditions, where p = 2 16 − 165 = 0xff5b and the irreducible polynomial being f(x) = x 10 − 2.
Processor Features

CalmRISC Microcontroller
CalmRISC is Samsung's 8-bit low power RISC microcontroller that follows Harvard style. Both instruction and data can be fetched simultaneously without causing a stall using separate paths for memory access. CalmRISC has a 3-stage pipeline:
1. Instruction Fetch (IF) 2. Instruction Decode/Data Memory Access (ID/MEM) 3. Execution/Writeback (EXE/WB) The first stage (or cycle) is IF, where the instruction pointed to by the program counter is read into the instruction register (IR). The second stage is ID/MEM, where the fetched instruction (stored in IR) is decoded and the data memory access is performed, if necessary. The final stage is Execution and Writeback stage (EXE/WB), where the required ALU operation is executed and the result is written back into the destination registers. Since CalmRISC instructions are pipelined, the next instruction fetch is not postponed until the current instruction is completely finished, but it is performed immediately after the current instruction fetch is done.
Most of CalmRISC instructions are 1-word instruction, while branch instructions such as long "call" and "jump" instructions are a 2-word instruction. Thus the number of clocks per instruction (CPI) is 1 except for long branches, which take 2 clock cycles per instruction.
MAC2424 Math Coprocessor
MAC2424 is a 24-bit high performance fixed-point DSP coprocessor for Calm-RISC microcontroller. Main datapaths are constructed to 24-bit width, but it can also perform 16-bit data processing efficiently in 16-bit operation mode.
There are two modes of operation in MAC2424: 24-bit mode operation and 16-bit mode operation.
24-Bit Mode Operation.
-Signed fractional/integer 24 x 24-bit multiplication in single cycle -24 x 24-bit multiplication and 52-bit accumulation in single cycle -24-bit arithmetic operation -Two 48-bit multiplier accumulator with 4-bit guard -Two 32K x 24-bit data memory spaces 16-Bit Mode Operation.
-Four-Quadrant fractional/integer 16 x 16-bit multiplication in single cycle -16 x 16-bit multiplication and 40-bit accumulation in single cycle -16-bit arithmetic operation with 8-bit guard -Two 32-bit multiplier accumulator with 8-bit guard -Two 32K x 16-bit data memory spaces
Programming Environment
CalmSHINE is a C compiler for CalmRISC and MAC2424. It also supports assembly language. Thus architecture specific low-level instructions (such as 24-bit by 24-bit multiplication and accumulation) can be utilized via assembly language. Non-architecture specific functions may be written in C language.
Finite Field Arithmetic
Optimization of finite field arithmetic is very critical to the overall performance of EC operations. In this section, we describe algorithms for implementing efficient finite field arithmetic. In our implementation, we use GF (p m ) where p =0xff5b (16 bits), m = 10 and f(x) = x 10 − 2 as an irreducible polynomial. Although CalmRISC supports 24-bit by 24-bit multiplication, it is signed multiplication and memory access is very inefficient in 24-bit mode due to the memory alignment. This is why we use 16-bit p.
Modular Reduction
Modular reduction is used very frequently and it is the bottleneck of the performance of finite field arithmetic. In most computer software implementations, the modular reduction using simple bit shifts and additions is a popular choice and provides a very good performance when p is a pseudo-Mersenne number. This is due to the fact that the division instruction is very slow for most hardware. However this is not the case for MAC2424, since every operation is simple and it takes only one cycle except long-branch operations. Thus modular reduction using division step instruction is desirable for MAC2424. Moreover it has an advantage that the intermediate values do not need to move around between the registers. During the division steps in MAC2424, the dividend and divisor keep their position until the division ends. In our implementation, the modular reduction by repeated division step instruction takes 39 cycles, while the modular reduction by bit shifts and additions takes 90 cycles.
Field Multiplication and Squaring
We considered three different algorithms for finite field multiplication, Karatsuba-Offman algorithm (KOA), column major method and row major method. First we consider KOA. KOA works by reducing the number of multiplications while increasing the number of cheap additions/subtractions by the recursively. In general, it gives about 10 ∼ 20% performance enhancement for most architecture. However the computational cost for multiplication and addition/subtraction is exactly the same for MAC2424, so reducing the number of multiplication with sacrificing the number of addition/subtraction does not help.
Row major method is just a schoolbook method, so we skip the description here. Column major method is described as follows. This is not a general method but it is for our specific case where f(x) is binomial (f(x) = x m − α), and with this algorithm the polynomial reduction and multiplication can be done simultaneously.
Algorithm 1 (Column Major Multiplication).
Input:
Row major method and column major method both may be good choices because the required number of operation is both equal, but the row major method has the disadvantage of storing full one intermediate row in a temporary memory. Moreover column major is preferable since MAC2424 can multiplyand-accumulate simultaneously in one cycle. So the column major method of field multiplication is definitely a better choice for MAC2424. Algorithm 1 uses m 2 + m − 1 multiplication instructions and m modular reductions with modulus p. Algorithm 1 can be similarly applied to field squaring, so m(m+1) 2 + m − 1 multiplication instructions and m modular reductions with modulus p are needed for field squaring. Modular reduction is performed only m times because product of two subfield elements can be safely accumulated multiple times in MAC2424's accumulator and α is small enough (α = 2) that z in Algorithm 1 never overflows. This means we don't need to reduce the intermediate values, instead we need to reduce just the final values. In our implementation, field multiplication takes 723 cycles and field squaring takes 717 cycles. The ratio of field squaring to field multiplication is almost close to 0.9 in our case. This is due to the fact that the most of the time is taken in modular reduction and that MAC2424 can multiply very fast.
Field Inversion
There have been many research efforts on finite field inversion algorithm. Wellknown algorithms are extended Euclidean algorithm, almost inversion algorithm, and their variants. The efficiency of a finite field inversion algorithm can be roughly measured by counting the subfield inversion it uses since the subfield inversion is the most time consuming job among the subfield arithmetic. Even for MAC2424, subfield inversion could not be done fast. It takes 670 cycles in our implementation. Among the various finite field inversion algorithms we consider IM (Inversion with Multiplication) [10] and BP [1] algorithms since only they require just one subfield inversion. Here we review the IM algorithm and the BP algorithm.
Algorithm 2 (IM Inversion Algorithm
In Algorithm 2, capitalized variables represent the polynomial representation with indeterminate x, deg(F ) denotes the degree of polynomial F and F i is the coefficient of
The BP algorithm is exponentiation-based inversion algorithm using the fact that
The following equation shows that only m−1 subfield multiplications are required for p-th power of a field element. Note that α pi/m mod p and pi mod m (i = 1, . . . , m − 1) should be pre-computed beforehand.
One might have noticed that the p-th power repeated by i times can be collapsed to one p i -th power which have the same computational cost with p-th power. Now we have all apparatus for BP algorithm:
Algorithm 3 (BP Inversion Algorithm). Computes
In Algorithm 3, (A(x) r ) −1 is always a subfield inversion. In this algorithm, computing A(x) r−1 is very critical to the performance of finite field inversion. Since r − 1 is in a special form we can compute A(x) r−1 efficiently by addition chain and p i -th power. The efficient method was already shown in [1] , but we want to show that the analysis shown in [1] and [10] is incorrect. Computing A r−1 where r = p + p 2 + · · · + p 5 Our method
Bailey & Paar's method B ← A p = A (10) B ← A p = A (10) T1 ← AB = A (11) T1
The required number of p i -th power in BP algorithm is not always log 2 (m − 1) + H W (m − 1) where H W (·) is Hamming weight. Instead, the number of p i -th power is at least one less than this when m is even except for m = 2. Table 1 shows that one p i -th power can be reduced for m = 6 considering this fact. In general, if m is even, the exact number of p i -th power is: Table 2 shows the new exact analyses of BP algorithm and it is compared with IM algorithm. Note that we corrected minor counting errors that were shown in previous works [1, 10] . We also show analyses for the 'special case' when the product of two subfield elements can be accumulated without overflow, and when α is small, that is our case. In that case, we can save m − 1 modular reduction in field multiplication and 1 modular reduction in final field multiplication that only computes the constant term of A(x) r from A(x) r−1 and A(x).
In Table 2 , the term "multiplication" means general multiplication performed by the processor, not the field or subfield multiplication. According to Table 2 , the complexity of BP looks greater than that of IM. However if m is not too large, BP can provide better performance. When m = 10, that is the specific case for our implementation, IM requires 283 multiplication and 132 modular reduction, and BP requires 493 multiplication and 87 modular reduction (note that one less number of p i -th power is required for m = 10 than the analysis shown in Table 2 ). Recall that, for MAC2424 the number of modular reduction is more significant (costs much more) than that of multiplication. Hence we conclude that BP algorithm will perform much better in our case (283+132×39 = 5431(IM) > 493 + 87 × 39 = 3886(BP)). In addition, not only does the BP algorithm take fewer cycles but also it is simple to implement as looping or branching is not needed. More improvement is possible for BP algorithm when m has a factor of 2. In our specific case, p i -th power is computed as follows.
Algorithm 4 (p-th Power Algorithm for Our Specific Case). B(x) = A(x) p
1. Array X is pre-computed as X = {1, 0x5AC3, 0x7B13, 0x9EEF, 0x7E9E, 0xFF5A(= -1), 0xA498, 0x8448, 0x606C, 0x80BD} 2. For i = 0 to 9
Algorithm 4 requires 8 multiplications, 8 modular reductions and 1 subtraction. Note that multiplying −1 is equal to subtracting from p. The following is the pre-computed value X for p 2 -th power and p 4 -th power algorithm for our specific case, respectively. To compute p 2 -th or p 4 -th power we only need to substitute the X in Algorithm 4 with the following Xs.
For p 2 -th power: X = {1, 0x7B13, 0x7E9E, 0xA498, 0x606C, 1, 0x7B13, 0x7E9E, 0xA498,0x606C} For p 4 -th power: X = {1, 0x7e9e, 0x606c, 0x7b13, 0xa498, 1, 0x7e9e, 0x606c, 0x7b98, 0xa498}
The above each pre-computed array X has two 1s, so only 8 multiplications and 8 modular reductions are needed to compute p 2 -th or p 4 -th power. And even more, X [1−4] is identical to X [6−9] , thus the memory can be saved. Now we are ready to construct the most efficient method to compute A(x) r−1 for our specific case, which leads to the least number of field multiplications and fully utilizes the above facts. The following algorithm efficiently computes A(x) r−1 and we used this in our actual implementation.
(T2 = A p+p 2 +p 3 +p 4 +p 5 +p 6 +p 7 +p 8 +p 9 ) (mul)
As shown above A(x) r−1 is done by 4 field multiplications and 4 p i -th powers. As it can be seen, since m is even, the number of field multiplication is equal to that of p i -th power. Table 3 shows our finite field GF (p 10 ) implementation results. The cycles for each functions were measured using Samsung's CalmSHINE compiler and all finite field functions were written in assembly. All cycles include the functional overhead such as parameter loading and data movements when entering and leaving the functions. In Table 3 , 'I/M' is the ratio of field inversion to the field multiplication and 'S/M' is the ratio of field squaring to the field multiplication. 
Performance of Field Arithmetic
Elliptic Curve Arithmetic
In this section we discuss the method of optimizing EC exponentiation using mixed coordinate system. We optimize the EC exponentiation by combining mixed coordinate system from [5] and the Lim-Hwang's method [10] .
Signed Window Algorithm for EC Exponentiation
Signed window method is known to be the most efficient method for computing EC exponentiation (EC scalar multiplication) excluding the fixed base exponentiation algorithms. Let k be a positive integer, and suppose we want to compute kP where P is an arbitrary point on an elliptic curve. Then k can be expressed as follows.
where W [i] is odd, −2 w + 1 ≤ W [i] ≤ 2 w − 1 and w ≤ k i . To compute EC exponentiation with signed window method, first pre-compute P i = iP (i = ±1, ±3, · · · , ±(2 w − 1)) and then evaluate the following equation using the precomputed values.
However pre-computation is not needed for −2 w + 1 ≤ W [i] ≤ −1 since negating EC point can be done easily, that is computing P W [i] = −P −W [i] takes negligible time. It is also possible to implement an EC subtraction function to get rid of the redundant EC negating time using a small portion of additional program memory.
The first k v doublings, in case W [v] < 2 w−1 , can be more efficiently computed [5] . There have been an analysis on this in [5] , however it is incorrect and we want to correct it here.
First we need to consider the probability that the bit size of W [v] equals j. This can be easily computed and the following equation shows it.
Use the fact that the above modification reduces w + 1 − j doublings and increases 1 addition, and that we do not apply the above modification when j = w. Then can be easily verified that the average number of doublings reduced is 3 2 − 1 2 w−1 and the average number of addition increased is 1 2 . Note that the average value of k v is w + 2, so w + 2 doublings are needed in an average case if we don't use the above modification.
Mixed Coordinates System
We use mixed coordinates system to speed up computation of EC exponentiations. For a given rational integer k and an elliptic curve point P , we can evaluate the EC exponentiation kP by the following steps.
The EC exponentiation kP is computed by repeating basic step
, the computational cost for a basic step is
In this paper, we denote affine coordinate as A [6, 12] , projective coordinate P [9, 6] , Jacobian coordinate J [3] , modified Jacobian coordinate J m [10] and Chudnovsky-Jacobian coordinate J C [3] . Note that we use a different Modified Jacobian coordinate system. The Modified Jacobian coordinate shown in [10] is better because it reduces one field addition/subtraction in EC addition.
Let us now discuss suitable coordinate systems for C 1 , C 2 and C 3 . Since doublings in C 1 are repeated most frequently, we should choose C 1 such that t(2C 1 ) is the smallest, thus we select J m as C 1 .
Since pre-computation P W [i] is done during online time in signed window method, i.e. the resulting values are not saved in auxiliary memory for another exponentiation, the pre-computation time is included in total elapsed time. Thus we should select coordinate C 3 suitable to compute the values P w [i] . Cohen et al. [5] proposed to use either affine coordinate or Chudnovsky-Jacobian coordinate as C 3 and to select one by comparing t(J C + J C ) and t(A + A). However, since the ratio I/M is relatively small, we chose C 3 = A. Then there are two precomputation methods. First, we can compute P i = iP (i = 1, 3, 5, . . . , 2 w − 1) in affine coordinate by simple method by repeating P i+2 = P i + P for i = 1, 3, 5, . . . , 2 w − 3 where P 1 = P and P = P + P . Here, the total computational cost is t(2A)
To reduce the number of inversion in F (p m ), we can apply 'Montgomery trick of simultaneous inversion [4] ' with sacrificing the number of multiplications and squares. The total cost in that case is wI + (5 · 2 w−1 + 2w − 10)M + (2 w−1 + 2w − 3)S. Table  4 shows the expected computational cost for these two methods. In Table 4 , the computational costs for pre-computation in case of C 3 = J c were also shown for comparison. In Table 4 , Montgomery's trick is shown to be the best choice. However we didn't use the Montgomery trick, since online pre-computation time is just a very small part of EC exponentiation, and it does not significantly improves the EC exponentiation time. In addition, the Montgomery's method requires much more program memory than the simple method without giving much improvement in performance. We chose to use simple method in online pre-computation.
Let us discuss suitable coordinate for C 2 . Since we selected modified Jacobian coordinate and affine coordinate for C 1 and C 3 respectively, coordinate for
. Although there are 5 candidates for C 2 , Table 5 shows computational amounts to compute a basic step (Eqn. 6) using 3 candidates of least cost. In Table 5 , we assumed window size w = 4.
In Table 5 , the 1-bit gap between the two neighboring diminished windows is considered to be the worst case (i.e. k i = w + 1 for i = 1, 2, . . . , v) , and the 2-bit gap is considered to be the average case (i.e. k i = w + 2 for i = 1, 2, . . . , v) . According to Table 5 , we can select either Jacobian coordinate(J) or Chudnovsky-Jacobian coordinate(J c ) for C 2 . Since Chudnovsky-Jacobian coordinate uses 2 more finite field F (p m ) elements than Jacobian, it is inefficient in storage. Thus we select Jacobian coordinate for C 2 . Consequently, for (C 1 , C 2 , C 3 ) = (J m , J, A), w = 4 and |k| = 160, we can compute an EC exponentiation kP with following computational cost in average.
≈ 8I + 849.7M + 763.7S ≈ 1596M (7) In worst case, we can compute kP with the following cost.
Implementation Results
We implemented elliptic curve exponentiation in CalmRISC with MAC2424 coprocessor using all algorithms shown in previous sections. All finite field functions were written in assembly language since time critical low-level instructions cannot be programmed in high-level language, and all elliptic curve functions were written in C language on top of the finite field functions. Table 6 shows our implementation of elliptic curve exponentiation in various coordinate systems. Note that the result shown in Table 6 was measured using CalmSHINE C compiler. CalmSHINE compiler measures the clock cycle for each function exactly, however it can be done only in 'debug build mode' and CalmSHINE compiler does very poor code optimization in 'debug build mode'. This is why the result shown in Table 6 is slower than what is expected. In real implementation with optimized codes, it will perform much better. Referring to Table 6 , mixed coordinates is the best with almost 10% of improvement over fastest single coordinate system (Modified Jacobian). 
Conclusions
In this paper, we proposed optimized algorithms for implementing EC in Calm-RISC with MAC2424 math coprocessor, in which all instructions take just one clock cycle, and we showed implementation results and full analyses on their performances. We also gave new exact alalyses on BP inversion algorithm and EC exponentiation. In our implementation, we used column major method for field multiplication and slightly improved BP algorithm for field inversion. Mixed coordinates using Lim-Hwang's Modified Jacobian coordinate was applied for for efficient EC exponentiation. Our implementation of EC exponentiation took about 122ms (assuming one cycle takes 50ns), which is about 10% of improvement over single coordinate system. This result can be much better in real implementation with CalmSHINE's optimized compile mode. Although the algorithms shown in this paper is focused on our specific case, it can be easily applied to other environments where all basic arithmetic instructions have the same computational cost.
