This paper presents a pipelined architecture of a modular Montgomery multiplier, which is suitable to be used in public key coprocessors. Starting from a baseline implementation of the Montgomery algorithm, a more compact pipelined version is derived. The design makes use of 16-bit integer multiplication blocks that are available on recently manufactured FPGAs. The critical path is optimized by omitting the exact computation of intermediate results in the Montgomery algorithm using a 6-2 carry-save notation. This results in a high-speed architecture, which outperforms previously designed Montgomery multipliers. Because a very popular application of Montgomery multiplication is public key cryptography, we compare our implementation to the state-of-the-art in Montgomery multipliers on the basis of performance results for 1024-bit RSA.
INTRODUCTION
Montgomery multiplication has been shown to be a very efficient way to perform modular multiplication [10] . That is why the algorithm is often used in data paths that provide * Nele Mentens and Kazuo Sakiyama are partially funded by FWO (G.0450.04), IBBT, EU IST FP6 projects ECRYPT and SESOC. modular arithmetic. This paper presents an architecture of a Montgomery multiplier that is implemented on an FPGA. The architecture utilizes 16-bit integer multipliers, which are available on recently designed FPGAs.
Starting from a rather straightforward implementation of the Montgomery algorithm, an optimization for area and speed is done by downsizing the architecture and introducing more levels of pipelining. Moreover, a substantial reduction of the length of the critical path is achieved by optimizing the speed of the feedback loop in the architecture. This optimization is based on the observation that the exact computation of intermediate values is not necessary and a carry-save approach can be used.
A popular application of modular arithmetic is public key cryptography. That is why we compare our implementation to the state-of-art in Montgomery multipliers on the basis of the time to perform a RSA decryption [12] . The results show that our implementation outperforms previously designed Montgomery multipliers on FPGAs.
The organization of the paper is as follows. Section 2 lists the state-of-the-art in Montgomery multipliers. In Sect. 3, the Montgomery algorithm is introduced. Section 4 compares our new architecture to a baseline architecture. Finally, Sect. 5 and Sect. 6 give the implementation results and conclude the paper, respectively.
PREVIOUS WORK
A very good overview of implementation options for Montgomery multipliers is given by Koç et al. in [4] . Together with implementation results on a Pentium processor, they describe many algorithms that implement Montgomery multiplication using a single w-bit integer multiplier. These algorithms can be extended to parallel versions. In a fully parallelized implementation, the Finely Integrated Operand Scanning (FIOS) and Coarsely Integrated Operand Scanning (CIOS) algorithms in [4] lead to the same architecture. Our architecture is also based on these algorithms.
A scalable radix-2 implementation is introduced by Tenca and Koç in [13] . The same notion of scalability is implemented by Batina et al. in [1] . They present a systolic array architecture resulting in a Montgomery based RSA implementation. More recent hardware implementations of Montgomery multiplication include the work by McIvor et al. [9] . They use Carry Save Adders (CSAs) to perform the large word length additions required for Montgomery multiplication. This idea is in a way similar to our idea. In [9] , the carry-save format is used in between modular multiplications to perform RSA, resulting in 5-2 CSA logic. In our implementation, instead we use 6-2 CSA logic, optimizing the internal loop in the Montgomery multiplication algorithm. The carry-save approach is also used by Bunimov and Schimmler in [2] . They combine CSA logic with a table lookup. In [8] , Manochehri and Pourmozafari introduce pipelining inside the CSA logic. This is different from our approach, which pipelines the rest of the architecture.
The implementation of RSA decryption in [9] is using the Chinese Remainder Theorem (CRT), which leads to a speedup factor of almost 4 [11] . Up to now, the fastest reported RSA implementation on an FPGA, to our knowledge, that does not apply CRT, is presented by Kelley and Harris in [5] . Similar to our architecture, they also use the dedicated multiplier blocks on the FPGA. We will compare our results to this implementation.
MONTGOMERY MULTIPLICATION
In 1985, Peter Montgomery introduced a method to efficiently perform modular multiplication [10] . Instead of computing X * Y mod M , the Montgomery algorithm computes X * Y * R −1 mod M , where R is a power of two. In this way, trial division can be avoided and a division by a power of two is executed instead. This comes down to a rightshift operation, which is almost costless in hardware. An improvement on Montgomery's algorithm was presented by Colin Walter in [14, 15] . Compared to Montgomery's original algorithm, this algorithm performs one extra iteration, making the conditional final subtraction unnecessary. The improved algorithm is therefore time-constant and avoids the implementation of a subtractor.
Alg. 1 shows the improved Montgomery algorithm. In [14, 15] , Walter proves that, if the inputs X and Y are smaller than 2 * M , the output is also bounded to 2 * M , while the intermediate result T has a bound of 4 * M . He also shows that, after converting from Montgomery to normal representation, the result is smaller than M again. Note that only the LSB of the b-bit digits Xn and Yn can be equal to one in order to satisfy 0 ≤ X, Y < 2 * M . For digit Tn the two LSBs can be equal to one, while the rest of the bits are zero, because T < 4 * M is always ensured. However, in the implementations that are discussed in Sect. 4, we always consider the complete length of X, Y and T , i.e. b * (n + 1), for ease of notation. To conclude this section, we remark that the division by b in Step 4 of Alg. 1 can be implemented as a shift operation, because the b LSBs of the sum are equal to zero.
Algorithm 1 Improved Montgomery multiplication
Require: integers M = (Mn−1 · · · M0) b , X = (Xn · · · X0) b , Y = (Yn · · · Y0) b with 0 ≤ X, Y < 2 * M , R = b n+1 with gcd(M, b) = 1 and M = −M −1 mod b Ensure: X * Y * R −1 mod M 1: T = (Tn · · · T0) b ← 0 2: for i from 0 to n do 3: Ui ← (T0 + X0 * Yi) * M mod b 4: T ← (T + X * Yi + M * Ui)/b 5: end for 6: Return T
MONTGOMERY MULTIPLICATION ARCHITECTURE
This section presents two hardware architectures that implement Alg. 1 with b = 2 16 and n = 64, i.e. 1024-bit Montgomery multiplication is implemented. The first architecture is rather straightforward. It is described in Sect. 4.2, which also elaborates on the drawbacks of the design. To solve these drawbacks, many optimizations were done in order to obtain a much faster pipelined architecture, presented in Sect. 4.3. Before introducing both architectures, the available resources on our hardware platform are listed in Sect. 4.1.
Hardware Resources
The target implementation platform for the presented architectures is a FPGA. Recent FPGAs contain highly optimized 16-bit integer multipliers, which are implemented as dedicated hardware to be connected to the reconfigurable part of the FPGA [16] . Moreover, highly optimized arbitrary sized adders are available. The circuitry inside both the multiplier blocks and the adders is proprietary information of the FPGA vendors.
Baseline Implementation
The architecture of a more or less straightforward manner of computing Alg. 1 is shown in Fig. 1 . The registers are indicated by gray boxes: X and Y are the inputs, M is the modulus and T is the intermediate register where the result will also be stored. The "mult" blocks consist of 16-bit integer multipliers in parallel and output a sum and a carry result, as depicted in Fig. 2(a) . The output of the "modmult" block is formed by the 16 LSBs of an integer multiplier. As mentioned in Sect. 4.1, no manual adder optimization is done. Therefore, the "add" block is a black-box 5-input adder provided by the FPGA vendor. Because the modulus M is constant throughout the Montgomery multiplication, the "modinv" operation needs to be computed only once. For this reason, the delay of this operation is not important. The "modinv" block consists of two-level logic situated in the CLBs of the FPGA. Note that the output of the 5-input adder is bounded to 1056 bits, because of the observations made in Sect. 3. To realize the division by b = 2 16 in step 4 of Alg. 1, a right shift operation over 16 positions is performed on T . The implementation presented in Fig. 1 computes the same sequence of operations in each clock cycle. This sequence is shown in Table 1 , where the subscripts new and old denote the input and the output of a register, respectively, and the numbers in square brackets denote the bit indices. These operations are executed n + 1 = 65 times for one Montgomery multiplication. This means the result will be available in register T after 65 clock cycles.
After implementing the architecture in Fig. 1 on a Xilinx XC2VP30 FPGA, a maximum clock frequency of 30 MHz was found. This frequency is determined by the large critical path, going from register Y (which has a higher fan-out than register X) to register T through 3 16-bit integer multipliers, a 16-bit adder and a 5-input 1056-bit adder. In order to reduce the critical path, registers could be introduced in between these 5 arithmetic units. However, the delay of the 1040-bit 5-input adder is much larger than the delay of the multipliers, which still limits the maximum operating frequency. Inserting registers inside this adder would increase the maximum operating frequency, but would not reduce the delay of the feedback loop, which computes the new value of T from the old value of T . Therefore, we need to optimize this feedback loop in order to reduce the critical path and increase the maximum frequency. These optimizations are discussed in Sect. 4.3.
Another limitation of the architecture in Fig. 1 is that the size of the registers and arithmetic blocks is equal to the full word length of the Montgomery multiplier. Section 4.3 introduces a method to make the architecture more compact.
Downsized Pipelined Implementation
This section describes our new architecture in three steps. In Sect. 4.3.1, the optimization of the 5-input adder in Fig. 1 is discussed. Section 4.3.2 elaborates on the downsizing of the architecture. Finally, the length of the critical path is reduced substantially by introducing pipelining, which is explained in Sect. 4.3.3.
Speeding-up the Adder
To optimize the delay of the 5-input adder in Fig. 1 Instead, we compute the sum and the carry of T separately using a carry-save adder. The new Montgomery multiplier architecture is shown in Fig. 3 , where the sum and the carry of T are denoted by ST i and CT i . The alignment of the inputs and outputs of the carry-save adder, denoted by "6-2 compression", is shown in Fig. 4 . The carry output of the MSB cell is always zero, because of the bounds explained in Sect. 3. This results in a 1055-bit carry value at the output of the carry-save adder. The architecture of one compression cell is depicted in Fig. 2(b) .
The division by b = 2 16 in Step 4 in Alg. 1 is executed by performing a 16-bit right shift operation on the sum and the carry of T . The bits of ST i and CT i that are shifted out add up to 0 as explained in Sect. 3. However, the carry-out of this addition needs to be taken into account and is therefore led back into the 6-2 compression block. The carry-bit is 16] stored in register ci in Fig. 3 . To be able to perform Step 3 in Alg. 1, the sum of the next 16 bits of ST i and CT i , denoted by T0i in Fig. 3 , is calculated and added to SXY i [15 : 0] . After replacing the 5-input adder by a 6-2 compression block, a higher operating frequency can be used. The implementation of the architecture in Fig. 3 on a Xilinx XC2VP30 FPGA shows a maximum operating frequency of 50 MHz, which is substantially faster than the clock frequency reported in Sect. 4.2. However, to compute the final result, ST i and CT i need to be added up. This can be done with a sequential adder. To balance the delay of the sequential adder with the critical path of rest of the architecture, 528 bits are computed per clock cycle, resulting in the computation of T in 2 clock cycles. This makes the total cycle count for one Montgomery multiplication equal to 65 + 2 = 67.
Downsizing the Architecture
The architecture in Fig. 3 employs 65 and 64 16-bit integer multipliers for the computation of X * Yi and M * Ui, respectively. One more 16-bit multiplier is used for computing Ui. This results 130 multipliers. Although there exist FPGAs that provide this amount of multipliers, many FPGAs do not have enough resources to implement this full word length architecture. That is why we introduce a downsized architecture in this section. Although the downsizing factor can be chosen arbitrarily, we stick to a factor of 4, resulting in an architecture with 17 + 16 + 1 = 34 16-bit multipliers. This allows us to do a more or less fair comparison with previously designed Montgomery multipliers in Sect. 5.
The downsized architecture is shown in Fig. 5 . The "mult" and "6-2 compression" computation for the evaluation of one 16-bit digit of Y are spread over 4 clock cycles. In each of these 4 clock cycles a partial result of the computation is valid. Input register X is a cyclic shift register. It outputs three times 256 bits and one time 272 bits. Each of these four outputs is multiplied with Yi. In the same way, register M shifts out four times 256 bits that are multiplied with Ui. The 1056-bit 6-2 compression depicted in Fig. 4 is divided into 4 parts. Because some inputs are 272 bits and shifted to the left over 16 positions, the width of the 6-2 compression block is 288 bits. Because the complete 1040-bit sum and carry results of the carry-save addition need to be led back into the 6-2 compression block (as shown in Fig. 4) , the registers that store ST i and CT i cannot be reduced. They are implemented as full length shift registers. In the first of each four cycles, the division by 2 16 in Step 4 in Alg. 1 is performed and ci is written into a register, as explained in Sect. 4.3.1, and stored in 4 cycles. T0i is also stored and saved in 4 cycles until the next Yi is evaluated. Note that Ui needs to be computed only once every 4 clock cycles. That is why a register is inserted to store Ui. This register is enabled every fourth clock cycle. Because we inserted a register to store Ui, we are splitting the critical path approximately in two, resulting in a maximum clock frequency of 100 MHz on a Xilinx XC2VP30 FPGA. To make add up the corresponding parts of X * Yi and M * Ui, registers need to be inserted to store S/CXY i .
We need one cycle to compute the first Ui and 4 cycles for the computation of each X * Yi. Because the critical path is half compared to the architecture in Fig. 3 , we use a 272-bit adder to perform the final addition. This computes T in 4 clock cycles. As a result, the total cycle count is equal to 1 + 4 * 65 + 4 = 265.
Introducing Pipelining
Because the architecture in Fig. 5 computes a new Ui only once in four clock cycles, we can introduce four levels of pipelining in order to reduce the critical path. The pipelined architecture is shown in Fig. 6 . Because SMU i and CMU i are stored 4 cycles after Yi is valid, the registers SXY i and CXY i need to be repeated three times, resulting in 4 pipeline The pipelining schedule of our architecture is shown in Fig. 7 . The multiplications X * Yi and M * Ui take 4 clock cycles. The indices a, b, c and d in Fig. 7 indicate the results of the multiplications after the first, second, third and fourth cycle, respectively. However, Hi is already computed when S/CXY i,a is ready. After 1 cycle, S/CXY i,a is ready and 4 cycles are needed to compute the output of the downsized "6-2 compression" block. In total, 65 of these 4-cycle operations need to be computed, resulting in 65 * 4 = 260 clock cycles. Finally, the carry-save form of the result needs to transformed into the actual result T using the sequential 104-bit adder. As a consequence, the final result T is stored 10 cycles after the computation of last ST i and CT i is finished. This brings the total cycle count for one Montgomery multiplication to 1+260+10 = 271 cycles. After implementing the architecture in Fig. 6 on a Xilinx XC2VP30 FPGA, a maximum operating frequency of 125 MHz could be found. More detailed implementation results of this efficient architecture are discussed in Sect. 5.
IMPLEMENTATION RESULTS
Because very popular applications of modular multipliers are implementations of public key cryptography, the speed of our Montgomery multiplier is evaluated on the basis of the time to compute one 1024-bit RSA decryption, which is based on modular exponentiation [12] . There exist many algorithms for implementing modular exponentiation. The most straightforward algorithm is the squareand-multiply algorithm [12] . When only one Montgomery multiplier is available, this algorithm requires 1024 square and 1024 multiply operations for an exponent with a Hamming weight of 1024. For side-channel security [7] , we apply our Montgomery multiplier for both the square and the multiply operation. This results in 2048 Montgomery multiplications. A more efficient way to implement modular exponentiation is the k-ary method [3, 6] , in which 2 k − 2 multiplications are performed in the pre-computation phase and 1024/k * (k + 1) in the exponentiation phase. Because recent FPGAs contain a lot of block RAM, the k-ary method, that requires the storage of the pre-computed values, is a very efficient way to perform modular exponentiation. Table 2 shows the results of our implementation in comparison with the fastest reported implementations, to our knowledge, described in literature. The speed of our implementation is given for the square-and-multiply as well as the k-ary method. In the case of 1024-bit exponentiation, k = 6 turns out to be the most optimal choice. This corresponds to 1259 modular multiplications for one 1024-bit modular exponentiation.
Although it is very hard to compare implementations on different FPGAs, it is clear that our implementation outperforms the architecture presented in [5] , especially when using the k-ary method for modular exponentiation. In terms of resources it should be noted that the k-ary method employs extra block RAM to store 2 k pre-computed values. However, in most cases this is not a problem, since recently designed FPGAs provide a lot of block RAM.
CONCLUSION
This paper presented two Montgomery multiplication architectures. The first one is a baseline implementation of the improved Montgomery algorithm. The second one is a downsized pipelined version that includes optimizations to reduce the length of the critical path. These optimizations were achieved by using a carry-save representation for the intermediate results. The performance results show that our downsized pipelined implementation is much faster than the state-of-the-art in Montgomery multipliers. 
