In this work, an improved algorithm for Montgomery modular inversion over GF(2 m ) is proposed. Moreover, A novel scalable hardware architecture for the proposed algorithm is presented which is parameterizable and amenable to interfacing to special purpose processors such as microcontrollers. The architecture supports operations over finite fields GF(2 m ) up to m 571 ≤ without the need to reconfigure the hardware. The results show that, this work can be exploited to construct low resource elliptic curve cryptosystems (ECC).
INTRODUCTION
Since their introduction by Miller and independently Koblitz in 1985 (D. Hankerson, 2004 , Elliptic curve cryptosystems are considered the best compromise between the required security and the attainable performance for many low resource constrained security systems. Scalability versus performance, in particular for low resources applications, has always been a challenging trade-off in ECC hardware implementations. The efficiency of of this trade-off depends significantly on the efficient implementation and scalability of the modular arithmetic of the underlying field. The computation of the modular inversion is the most challenging from this perspective. Hence, the contribution of the work presented in this paper.
In the literature, several algorithms for computing the multiplicative inverse in GF(2 m ) have been proposed (D. Hankerson, 2004) . Some of them are based on performing modular multiplication like Fermat's little theorem. In contrast, others apply the greatest common divisor algorithm GCD which has many variants. All these variants can compute the modular inverse in about 2m iterations. However, the Montgomery inversion algorithm (B. Kaliski, 1995) offers better performance and can perform the inversion in less than 2m iterations. Consequently, this work investigates Montgomery modular inversion and develops algorithmic modifications that reduce the hardware complexity whilst offering scalable and parameterized inversion with low area architecture over FPGAs.
A modified algorithm for Montgomery is therefore proposed and implemented on the smallest and lowest cost Xilinx FPGA. The architecture is parameterized to support variable word lengths and has been implemented with 8, 16, 32 and 64 bit word lengths for finite field lengths m=163 and m=571. The results obtained show that the 32-bit data path designs are the best compromise between the low area requirements and the practical performance in terms of throughput (4.63 Mbps for m = 163). This paper is organized as follows: section 2 presents a theoretical background about ECC over GF(2 m ). Section 3 gives an overview about the Montgomery modular inversion. The proposed improved algorithm is presented in section 4. The description of the circuit operation and the FPGA implementation are in Section 5. Finally, Section 6 shows the performance and results of the implementation on a state of the art FPGA.
ELLIPTIC CURVE ITHMETIC OVER GF(m )
Briefly, a cryptosystem based on an elliptic curve E over finite fields GF(2 m ) is mainly used for encipherment of point P by key K such that, Q=K.P. 
Thus, we can observe from equations (2,3) that one inversion is involved in both point addition and point doubling over the elliptic curve E.
MONTGOMERY INVERSION AND ITS VARIANT
Based on the extended binary algorithm and Montgomery trick for computing the modular multiplication (L. Montgomery, 1985) . B.Kaliski was the first to propose the Montgomery inverse algorithm for a given irreducible polynomial p(x) and for any element a(x)∈ GF(p) or GF(2 Kaliski Algorithm is simple, it has no fixed number of iterations which makes it difficult to be mapped into hardware efficiently. M.Shieh, J.Chen, and C.Ming (M. Shieh, 2006 ) developed a new modification to the Kaliski's algorithm, as shown in algorithm 1, in which only one phase is required. Consequently, by this improvement the data dependency between the first phase and the second phase has been eliminated. Moreover, this also avoided the zero comparison operation required by the original algorithm. Algorithm 2 below, which will save m XORs used to perform the degree comparison. Besides, these m XORs lie on the critical path of the data path. Hence, we can have great savings not only in terms of area but also in terms of reducing the delays caused by degree comparison in algorithm 1. The algorithm 2 proceeds as follows: At the beginning, the counter, the state bit, the vectors u, v, s and r are initialized. Thus, we have u > v at the beginning. This means that degree of u has to be decremented according to the BGCD algorithm. Further, at the start of the algorithm the value of u 0 always equals to 1. We have two possible conditions for the vector v. If v 0 =1, hence, in the second iteration the counter will be incremented by one and the state bit will equal one. The procedure for decreasing the degree of u is performed by XORring u and v, dividing the value by 2 and saving the result in u. In parallel, vector s is XORed with r and vector s is doubled. The results of the two operations will be stored in r and s respectively. The other possible condition is v 0 =0. Thus, the vector v is even. Hence, the counter will increment by one but the state bit will remain zero. Next, the vector v will be divided by two and the vector r will be doubled. Accordingly, the value of the state bit =0 and the counter >0. For the state bit = 1, If u 0 = 1 and v 0 =1. This means that the degree of v>u. Hence, the degree of v has to be reduced. Thus, the vector v is XORed with u and the result will be stored back in v. In parallel, vector r is XORed with s and vector r is doubled; the results of the two operations will be stored in s and r respectively and the counter value will be decremented by one. If the value of the counter becomes zero the state bit will be equal to zero otherwise will remain one. The algorithm keeps track as the procedures in algorithm 2 until 2m iterations. After 2m iterations, the value of the vectors u converges to one. Meanwhile, the values of the vectors v and s converge to zero. Finally, the inverse of the vector a(x) represented in the Montgomery domain will be the value in the vector r As shown in figure 2 and figure 3, both u-v and s-r blocks have a (DBRAM) that acts to hold the vectors u, v, s, and r. The (DBRAM) in each block is addressed by a counter controlled by the control block. Counters are scalable and they accommodate addressing the (DBRAM) up to 2*( (m-m.modWord-Length)/Word-Length+1 ) memory depth, where m is the length of the vector a(x). Both u-v and s-r blocks have two shifting units. In the u-v block, the shifting unit is right shifting. Meanwhile, in the s-r block, the unit is left shifting. Both units load the word to be shifted, storing the most significant digit MSD for the left shift unit or the least significant digit LSD for the right shift unit to be added to the next word, shif left or right by the corresponding number of shift counts, and then write the shifted word to the (DBRAM) port. The Reduction unit is designed to be parameterized and scalable to accommodate finite fields up to m 571 ≤ in addition to different data path widths. NIST recommended reduction polynomials (NIST, 2000) are used to implement the reduction unit as they are designed to provide both security and high performance.
CIRCUIT DESIGN

CONCLUSIONS AND RESULTS
The proposed modified algorithm for Montgomery inversion has been fully modelled in VHDL and implemented on the smallest and lowest cost chip available from Xilinx Spartan III family (XC3S50). The proposed architecture is parameterized in order to support variable word lengths. A scalable architecture has been implemented with 8, 16, 32 and 64 bit word lengths. Table 1 -2 shows the implementation results for the different widths after place and rout for finite field lengths m=163 and m=571. As expected, the control block and counters dominate the critical path of the design. Thus, the increment of the operand size has a lesser effect on the working frequency. The results show that the 32-bit data path designs are the best compromise between the low area requirements and the practical performance in terms of throughput (4.63 Mbps for m = 163). Further, the proposed architecture with low hardware resources is expected to yield correspondingly lower power budgets and therefore would be suited for low resource ECC implementations. 
