Abstract-Modular multiplication is a basic operation in public key cryptosystems, like RSA and elliptic curve cryptography (ECC). There are many algorithms to speed up its calculation. Among them, Montgomery algorithm is the most efficient method for avoiding expensive divisions. Recently, due to the increasing use of diverse embedded systems, variable precision modular multiplications with scalable architectures gain more and more attentions. In this paper, we propose a new word-based implementation of Montgomery modular multiplication. A predict policy is incorporated with a scalable architecture to reduce area cost and time latency. Compared with other scalable designs, our area-time product is the best among all, with little memory overhead.
INTRODUCTION
Modular multiplication is regarded as a basic arithmetic operation widely used in many public key cryptosystems, such as RSA [1] and ECC [2] . These cryptographic protocols require the application of large modulus M, which is needed in repeated modular multiplications. Traditional algorithms use division by M to avoid calculation overflow, but it leads to low performance in hardware and software implementations. So far, the most efficient and popular modular multiplication method is Montgomery algorithm [3] . Its modular multiplication starts from the least significant position and uses division by a power of two, instead of division by M. As a result, the primary operations for Montgomery algorithm are just simple additions and shifts. For the sake of high performance, several designs [4] , [5] had proposed the use of carry save adder (CSA) to prevent the long carry propagation delay.
To speed up the computation of Montgomery multiplication, various techniques had been reported, such as systolic architecture designs [6] [7] to get high throughput. However, most improved Montgomery modular multiplication methods were developed for fixed precision of operands, which are not applicable for variable precision multiplication. In [8] , Tenca introduced a radix-2 word-based approach for Montgomery modular multiplication, as well as a scalable architecture design. But, due to the data dependency among words, Tenca's method requires the iteration process to take two more clock cycles to complete the multiplication. Several improved techniques, based on Tenca's radix-2 wordbased Montgomery algorithm, had been proposed. In [9] , Harris introduced left shifting Y and M, instead of right shifting S. In [10] , Shieh defined a new math function for S' = S -SM' / 2 to solve the data dependency problem. In [11] , Huang selected the result, from two possible values, based on the most significant bit to reduce the clock latency. Besides, a high radix word-based Montgomery algorithm was proposed to decrease the operation time, by using the Booth encoding technique [12] . Nevertheless, its computation logic requirement increases as well.
In this paper, we work on optimization design of hardware architecture of word-based Montgomery algorithm. We modify Huang's algorithm and incorporate a predict policy to the proposed scalable architecture to improve overall performance. This paper is organized as follows: Section II introduces Montgomery modular multiplication algorithms. Section III discusses a predict policy and the associated scalable architecture. Section IV shows experimental results, and Section V presents summary and conclusion.
II. BACKGROUND

A. Montgomery Modular Multiplication
Montgomery modular multiplication algorithm replaces the trial division by simple addition and shift operations. Given an n-bit odd modulus M and a radix constant R, defined as r n mod M. The radix number is r = Since the convergence range of S is 0 to 2M, the final output subtraction of Montgomery modular multiplication can be directly removed to avoid the extra subtraction S = S -M [13] .
B. Word-Based Montgomery Modular Multiplication
Tenca [8] proposed a radix-2 word-based Montgomery modular multiplication algorithm (OR2WMM), which takes two more clock cycles to complete multiplication. Huang 
Input: M (odd modulus, n bits), X (multiplier, n bits), Y (multiplicand, n bits), and then X, Y < M.
1: for
i = 0 to n -1 do 2: q = (x i and Y 0 (0) ) xor S 0(0)
3:
(CO (1) ,
5:
if (SM (0) = 1) then 6:
7:
SR (0) = (SO w-1 (0) , S w-2..1 (0) ) 8: else 9: C (1) = CE(1)
10:
) 11: end if 12: for j = 1 to e -1 do 13: , (Y + M) (j) .
17:
(0, 0) :
4:
(1, 0) :
5:
(1, 1) :
According to the above word-based PP, a predict policy radix-2 word-based Montgomery modular multiplication algorithm (PP-R2WMM) is proposed. Given an n-bit modulus M and a multiplicand Y, they are partitioned to e = ⎡ ⎤ w n / ) 1 ( + words. The dependency graph of PP-R2WMM is similar to Huang's graph, and is shown in Fig.1 . The main difference is the replacement of two CSAs with one CSA and one LUT. 
IV. EXPERIMENTAL RESULT
The proposed PP-R2WMM algorithm was coded in VHDL hardware description language and synthesized using the design platform of Altera foundation. The Altera FPGA EP2C70F896C6 is used as the benchmark test chip, and all experiments are conducted in Altera DE2-70 board. Our code was verified for word size w = 8-bit and tested for bit n = 512, 1024, and 2048. The performance and hardware area of three radix-2 scalable designs are shown in Table I . The IR2WMM [11] has less area × time than OR2WMM [8] , mainly due to the reduced number of clock cycles by half, which also uses less registers than OR2WMM. Compare PP-R2WMM with IR2WMM, the former area is smaller than IR2WMM, because the kernel of PE is simplified to one CSA. But the extra cost is the memory usage as shown in Table II.   TABLE I PERFORMANCE AND HARDWARE AREA OF THREE RADIX-2  SCALABLE ARCHITECTURES   TABLE II MEMORY RESOURCE REQUIREMENT OF THREE SCALABLE  ARCHITECTURE DESIGNS V. CONCLUSION
In this paper, we propose a predict policy with a new scalable architecture for word-based Montgomery modular multiplication. The experimental result shows our area-time product is the best among all compared methods, with little memory overhead. In the future, the proposed techniques will be extended to consider various design metrics, such as parallel, scalable, and high-radix design.
