Montgomery Modular Multiplication in hardware is of great importance for the realisation of practical public key systems. Hence, an efficient implementation of modular exponentiation in terms of speed and resources in hardware is essential. This paper focuses on implementation of fully pipelined SOS based Montgomery Multiplication algorithm in Virtex-5 FPGA using DSP slices to achieve best area-speed trade off. Our implementation results and comparison with other Multipliers show that our Multiplier is comparable to known Montgomery Multipliers in terms of area-speed trade off.
INTRODUCTION
In public key cryptosystems i.e. ECC & RSA, arithmetic operations, modular exponentiation and Modular Multiplication are of crucial importance for the performance of the system. Montgomery Multiplication is an efficient method to perform Modular Multiplication introduced by Peter L. Montgomery (1985) . An overview of different algorithms for Montgomery Modular Multiplication (MMM) using a single b-bit integer multiplier is given by Koc (1996) .
In this paper, hardware architecture for improved SOS based MMM in FPGA using dedicated multiplier block to achieve speed and area trade off is presented. We used Virtex-5 DSP48E Slices for practical realization of basic step of SOS i.e. 32x32 bits multiplier and full length adder.
The remainder of the paper is organized as follows. Section 2 introduces the Montgomery's Algorithm. Section 3 gives a summary of previous work. Section 4 presents detailed description of our Multiplier. Section 5 presents the implementation results with comparisons made to the known implementations. Section 6 concludes the paper.
MONTGOMERY MULTIPLICATION
Montgomery Multiplication is the most popular and efficient method to perform Modular Multiplication. It was introduced by Peter L. Montgomery (1985) and presented as Algorithm 1 in this paper. Koc (1996) presented an overview of different algorithms for Montgomery Multiplication using a single b-bit integer multiplier. The algorithms are SOS, CIOS, FIOS, FIPS and CIHS. & Oct, 1999 presents an improved MMM algorithm that performs an extra iteration which results in the avoidance of the conditional final subtraction. Our work is targeted towards fully pipelined implementation of improved SOS algorithm only.
PREVIOUS WORK
There exists a substantial amount of previous work on the implementation of Montgomery Multipliers. In this section, the most important known Montgomery Multipliers implementation over GF(P) in FPGAs have been discussed.
A scalable systolic array was implemented by Batina (2004) . Manochehri (2004) introduced pipelining inside the CSA logic. McIvor (2004) gave a comparison of the algorithms presented by Koc (1996) . Bunimov (2002) designed Montgomery Multipliers by using carry-save adders and practical FPGA implementation of this design is given by Amanor (2005) . Kelley (2005) designed a scalable Montgomery Multiplier by using two w·v-bit multipliers, two 3-2 carry-save adders and one w+v carry-propagate adder. Nele Mentens (July, 2007) gave parallel implementation of algorithms presented by Koc (1996) and claims to be the fastest published Montgomery Multiplier on FPGA.
OUR SOS MULTIPLIER
We focused on implementation of improved SOS based Montgomery algorithm by using Virtex-5 DSP48E slices. We designed the basic 32x32 bit multiplier and 32 bit adders in DSP48E (UG193, April, 2006) . Complete 1024 bits SOS based Montgomery Multiplier was implemented by adopting pipelined architecture employing dual port RAMs.
Design Realization
The hardware realization of improved SOS algorithm has been shown in Figure 1 . In Step 1, we multiply each 32 bit word of 2 nd variable B with the complete 1056 bits words of 1 st variable A. The multiplication output is 2*b bits which is represented as C and S, where C is the upper b bit word and S is the lower one. The C word is delayed by one clock cycle and added with the next S word computed. In this manner we get n*b bit words of T as shown in Figure 1 .
In order to form the complete T, we shift the first computation (B 0 *A) by one word after extracting the T 0 word and add with the second n*b bit words computed from B 1 *A as shown in the Figure 1 . It is worth noting that the value of "m" required in step 2 is computed in parallel as soon as T 0 becomes available. ..................................................................... Step 2
Step 1
TH TH
Step 2 continues till the last word of T L
Figure 1: Hardware Flow of Algorithm.
In
Step 2 we have to perform two types of iterations. In the first iteration, we compute new T by multiplying "m" with N j and add the old T values to it. The result is a 2*b bit word formed as C and S as in step 1. C is added to the result as in the previous step. However the major differences between this step and the previous one are:-Instead of Shift and Add operation in step 1, ADD Function (Refer Figure 1) is performed. It is carried out upon completion of multiplication operation on n lower words of T (i.e. T L ). The ADD function simply adds the carry (C i ) generated from (T n +m i *N n ) words. In hardware, we have implemented it independently. The computation of m i for each step is done as soon as the T i word in T L has been computed. A dedicated Multiplier computes this result in hardware. 
Top Level Design
The Top level design of the Multiplier is given in Figure 2 . Majority of the components used in the Multiplier are Xilinx Cores. 32x32 bit Multiplier with 32 bit Adder is implemented using the fully pipelined Multiplier architecture (Xilinx Virtex-4 Handbook, 2004). Kelley (2005) , in terms of resource utilization is harder to evaluate, but our Multiplier is comparable to it in terms of area and speed which is our main objective. Koc (1996) 60 Not Applicable ---799 Pentium-60
1024 Bit Multiplier Architecture

IMPLEMENTATION RESULTS & COMPARISION
CONCLUSIONS
This paper presented the design methodology for implementing improved SOS MMM for large integers GF(P) of 32 bit word size in FPGAs using DSP Slices to achieve area and speed trade off. The proposed SOS Montgomery Multiplier was implemented and tested at 269.5MHz with 160, 256, 512 and 1024 bit integers.
The fundamental contribution of this work is to show that it is possible to design efficient Montgomery Multipliers without compromising scalability, portability, time performance and area efficiency. Our multiplier is comparable to known Montgomery Multipliers in terms of area-speed trade off.
