To improve the speed of modular multiplication operation on ECC processor over GF (p), the paper presented a novel hardware implementation of Montgomery algorithm. Based on analyzing basic Montgomery modular multiplication algorithm, this work applied multi-step operation to Montgomery algorithm, which can accelerate speed by reducing the number of clocks. Simulation with Modelsim indicates that a completion of modular multiplication requires only 16 clock circles. Finally, the design of hardware architectures was evaluated on Altera Stratix III families. By following the new design concept, the multiplier structure can reach to a higher performance. Compared with other modular multiplier, the computation time of improved modular multiplier is lesser, decreasing 42% and reaching to 0.2 s µ . It is estimated that the computation of a 256-bit scalar point multiplication over GF(p) would take about 0.76 ms.
Introduction
Modular multiplier is very important for public key cryptosystems, such as RSA [1] and ECC [2] .However, compared with other public key cipher algorithm, ECC has more higher speed, smaller storage space and lower bandwidth requirements. Especially, due to limited computing resources, all kinds of wireless devices and smart cards are implemented on ECC. It is well-known that modular arithmetic is an important unit for the speed of ECC. So, it is particularly critical for ECC to accelerate the computation of modular multiplication.
Nowadays, there have been various algorithms literature proposed in speeding up modular multiplication. Among the modular multiplication schemes, their means can be broken down into two general types, namely, multiply and then reduce and Montgomery's method. The former means is efficiently performance when the modulo is a Merssane prime [3] .In the previous work , they have proposed many effective methods. In M. Kaihara et al. [4] and M. Schramm et al. [5] , they present shift and add modular multiplication algorithm; In prime field ECC processors, carry free structure is necessary to avoid lengthy data paths caused by carry propagation. There has been redundant schemes applied to different designs, for example, Carry Save Arithmetic (CSA) or Redundant Signed Digits (RSD); There is a scalable word based structure proposed in Tenca and Koc [6] . In D. Harris et al. [7] , the authors present a scheme that used left shifted multiplication and modulus to replace intermediate result. In Rupali Verma et al. [9] , this paper presents compute early word based scalable Montgomery architecture. It computes the most significant bit of word by applying 2 XOR operations. Recently, in AEST2016 2016 Shiann-Rong Kuang et al. [10] , this paper proposes a simple and efficient Montgomery multiplication algorithm such that the low-cost and high-performance Montgomery modular multiplier can be implemented accordingly, which avoids the carry propagation at each addition operation. Hang Yuan et al [12] presented systolic array Montgomery multiplier. It was seen that the frequency of design is higher in 0.13 ASIC, but it occupied lots of clock numbers to calculate Montgomery modular multiplication.
Although the previous Montgomery algorithms are efficient, most of them occupied many DSP units to reduce computation time. So, most of previous works should not be flexible and well transportability. Therefore, this paper describes the design and implementation of a 256-bit Montgomery modular multiplication without occupying internal DSP units. The design is can applied to kinds of hardware platform such as ASIC, which has a good portability and flexibility. Section II presents concept and processor of ECC, and basic architecture of Montgomery modular multiplication. Improved structure of Montgomery modular multiplication is described section III. In section IV, hardware implementation, simulation results and comparisons with other designs are introduced. Finally, author makes a conclusion about this paper.
ECC and Montgomery modular multiplier

ECC structure
There are four floors to describe ECC structure, just as shown in Figure 1 . The top floor should be the ECC protocol. The realization of protocol floor is determined by floor2 point multiplication, where computation of point multiplication can be implemented by floor3 point doubling and point addition. Basic operations in finite field are the lowest, including four parts, ie. Modular adder, modular subtraction, modular multiplication and modular inversion.
Modular adder and modular subtraction can be realized easily. Modular inverse is most time consuming operation, but point doubling and point addition can avoid modular inverse by transforming into Jacobi coordinates. Therefore, improving the speed of modular multiplication is initial for ECC scheme. This paper fosses on studying modular multiplication algorithm. 
Montgomery modular multiplier arithmetic
The algorithm for modular multiplication has been proposed by P.L. Montgomery in 1985 [11] .The algorithm is considered to be the fastest algorithm to compute AB mod M in computers when the values of A,B and M are large. It is a method for performing modular multiplication without division operation [12] . The following part gives a brief description of the Montgomery algorithm.
Suppose that using equation (1) 
Improved Montgomery modular multiplier
According to Algorithm1, Montgomery 's original algorithm need to do k k × -bit multiplication and 2k -bit addition. What's more, its middle results reach to 2n-bit, so the consumption of hardware resources are incalculable and not implemented on hardware. Therefore, Algorithm1 needs optimizing appropriately such that Algorithm1 is more suitable for hardware implementation.
Hardware structure of Montgomery
If m is odd then there exists an element 1 
2
− of m Z such that
Assume m is k-bit number, 2
If a natural a is multiplied by 1 
− ,the result can be expressed by following form : 
Optimization of hardware implementation
Generally, structure of the hardware implementation can be divided into the serial structure, parallel structure and string and hybrid structure. Full parallel structure need to complete all operations in one clock cycle. The means can reduce time largely, but it completes a calculation at the expense of the large hardware resources. Therefore, the means have only theoretical research significance. Full serial structure is that only one iteration computation is completed in each clock cycle ,so it needs a lot of clock cycles. Compared with full parallel structure, this form uses less hardware resources, but speed is quite slow. As a result, serial-parallel structure can be applied to the Montgomery algorithm.
Compared with the original algorithm, our improved algorithm is more flexible. Firstly, we introduced the three variables s and t and r, respectively, represents the steps in a clock, steps execution times, and the rest of the three concepts, calculation of two exactly getting originally should have need of Montgomery reduced to only t k a clock a clock or t + 1 clock (one). In addition, the relation between the three variables:
Algorithm3 shows the proposed multi-step structure. It describes how to realize the relationship multi-step and reducer. In order to more clearly illustrate the method of proposed implementation, the author also gives Fig 3 to Figure 3 shows the multi-step Montgomery structure, where Figure 2 represents Architecture of multi-step Montgomery for hardware implementation. From Figure2, we can see that control unit and CSA_Montgomery unit are most important parts. the structure consists of two parts, ie. Main Montgomery structure and Remainder Montgomery structure, respectively, iterations in a clock and remainder steps. For Montgomery structure in Figure2, it is showed in Figure 3 . 
Result and performance comparison
In order to better analyze the Montgomery , the paper presents the results of new architecture. The correctness of design is verified by Modelsim . With the correctness, firstly, we select Altera Stratix III as the target device, then the design is developed in Verilog HDL and synthesized with Quartus II 13.0. Figure 4 gives the circuit structure from RTL view. The improved design implemented on Altera Stratix III takes 17161 ALUTs, ,about 37% of all resources in the device, and .no using DSP units. The reset resource is enough for other IPs to build a full cryptographic system.
Figure 4. RTL view for multi-step Montgomery modular multiplication
The performance comparison with previous relevant designs is illustrated in Table1. Hang Yuan et al [12] presented systolic array Montgomery multiplier. We can see that the frequency of design is higher in 0.13 ASIC, but it occupied lots of clock numbers to calculate Montgomery modular multiplication. Literature [13] and literature [14] used embedded multipliers structure. Former method was 256 embedded multipliers, and latter idea was 81 embedded multipliers. Both of them used lesser clock numbers relative to literature [12] such that cost lesser time, but they occupied DSP units expect relative resources, where literature [13] used 256 18 18 × mult and 11992 slices, literature [14] has taken 81 embedded multipliers and 23405 LES. Therefore, both of designs are not better than ours in balancing speed and hardware resources. Fully-parallel that presented by Mr. Nahed Mulla et al [15] is similar to our design. Compared with Mr. Nahed Mulla et al [15] , our dsign may be cost more time, but their design occupied much more resources, including 23405LE and 16DSP. So, their design did not have generic , that is to say, their design was not apply to other platform, such as ASIC. M. Morales-Sandoval et al [16] presented iterative digit-digit Montgomery multiplication (IDDMM) algorithm, and their input operands (multiplicand, multiplier and modulus) are represented using as radix 2 k β = .It can be seen that the performance of k=64 is more efficient than the performance of k=32.However, the method of literature [16] also occupied many DSP units, so our design is quite dominant in trade-off speed and hardware. 
Conclusion
In this paper, we put forward a kind of more effective multi-step Montgomery multiplication algorithm based on multi-step method. The proposed architecture can realize the 256-bit Montgomery multiplication calculation of arbitrary steps. With the 16 steps, it can perform a modular multiplication in only 16 cycles. Results on Altera StratixIII shows that the paper's design is more faster than previous designs on ASIC or FPGA devices. Compared with previous relevant works, the design not only has higher speed, but also occupies lesser hardware resources. What's more, the design can also be performed other platform due to using internal DSP units. In order to explain preferablely the importance of proposed Montgomery for ECC, we would accomplish an entire ECC structure to estimate advantage of the Montgomery design in the future.
