Abstract-Montgomery modular multiplication is one of the fundamental operations used in cryptographic algorithms, such as RSA and Elliptic Curve Cryptosystems. At CHES 1999, Tenca and Koç proposed the Multiple-Word Radix-2 Montgomery Multiplication (MWR2MM) algorithm and introduced a now-classic architecture for implementing Montgomery multiplication in hardware. With parameters optimized for minimum latency, this architecture performs a single Montgomery multiplication in approximately 2n clock cycles, where n is the size of operands in bits. In this paper, we propose two new hardware architectures that are able to perform the same operation in approximately n clock cycles with almost the same clock period. These two architectures are based on precomputing partial results using two possible assumptions regarding the most significant bit of the previous word. These two architectures outperform the original architecture of Tenca and Koç in terms of the product latency times area by 23 and 50 percent, respectively, for several most common operand sizes used in cryptography. The architecture in radix-2 can be extended to the case of radix-4, while preserving a factor of two speedup over the corresponding radix-4 design by Tenca, Todorov, and Koç from CHES 2001. Our optimization has been verified by modeling it using Verilog-HDL, implementing it on Xilinx Virtex-II 6000 FPGA, and experimentally testing it using SRC-6 reconfigurable computer.
the number of scanning steps was reduced, the complexity of control and computational logic increased substantially at the same time. In [7] , Harris et al. implemented the MWR2MM algorithm in a quite different way, i.e., left shifting Y and M instead of right shifting S. Their approach was able to process an n-bit precision Montgomery multiplication in approximately n clock cycles, while keeping the scalability and simplicity of the original implementation. In [8] and [9] , the left-shifting technique was applied on the radix-2 and radix-4 versions of the parallelized Montgomery algorithm [10] , respectively. In [11] , Michalski and Buell introduced a MWRkMM algorithm, which is derived from The Finely Integrated Operand Scanning Method described in [12] . MWRkMM algorithm requires the built-in multipliers in the FPGA device to speed up the computation. This feature makes the implementation expensive. The systolic high-radix design by McIvor et al. described in [13] is also capable of very high-speed operation, but suffers from the same disadvantage of large area requirements for fast multiplier units. A different approach based on processing multiprecision operands in carry-save (CS) form has been presented in [14] . This architecture is optimized for the minimum latency and is particularly suitable for repeated sequence of Montgomery multiplications, such as the sequence used in modular exponentiations (e.g., RSA).
In this paper, we focus on the optimization of hardware architectures for MWR2MM and MWR4MM algorithms in order to minimize the number of clock cycles required to compute an n-bit precision Montgomery multiplication. We start with the introduction of Montgomery multiplication in Section 2. Then, the classic MWR2MM architecture is discussed. The new optimized architecture, which is able to perform the n-bit precision MWR2MM algorithm in approximately n clock cycles, is presented in Section 3. In Section 4, we propose an alternative optimized architecture that is able to achieve the same performance goal with simpler logic design. In Section 5, the high-radix version of our new architecture is introduced. In Section 6, we first compare our two optimized architectures with three previous architectures from the conceptual point of view. Then, the hardware implementations of all discussed architectures are presented and contrasted with each other. Finally, in Section 7, we present the summary and conclusions for this work.
MONTGOMERY MULTIPLICATION ALGORITHM
Let M > 0 be an odd integer. In many cryptosystems, such as RSA, computing X Á Y ðmod MÞ is a crucial operation. The reduction of X Á Y ðmod MÞ is a more time-consuming step than the multiplication X Á Y without reduction. In [2] , Montgomery introduced a method for calculating products ðmod MÞ without the costly reduction ðmod MÞ, since then known as Montgomery multiplication. Montgomery multiplication of X and Y ðmod MÞ, denoted by MP ðX; Y ; MÞ, is defined as X Á Y Á 2
Àn ðmod MÞ for some fixed integer n. Since Montgomery multiplication is not an ordinary multiplication, there is a conversion process between the ordinary domain (with ordinary multiplication) and the Montgomery domain. The conversion between the ordinary domain and the Montgomery domain is given by the relation X ! X 0 , where X 0 ¼ X Á 2 n ðmod MÞ. The corresponding diagram is shown in Table 1 . Table 1 shows that the conversion is compatible with multiplications in each domain, since 
The conversion between each domain can be done using the same Montgomery operation, in particular X 0 ¼ MP ðX; 2 2n ðmod MÞ; MÞ and X ¼ MP ðX 0 ; 1; MÞ, where 2 2n ðmod MÞ can be precomputed. Despite the initial conversion cost, we achieve an advantage over ordinary multiplication if we do many Montgomery multiplications followed by an inverse conversion at the end, which is the case, for example, in RSA. Algorithm 1. Radix-2 Montgomery Multiplication Algorithm 1 shows the pseudocode for the radix-2 Montgomery multiplication, where we choose n ¼ blog 2 Mc þ 1. n is the size of M in bits.
The verification of the above algorithm is given below: Let us define S½i as
Àn ðmod MÞ ¼ MP ðX; Y ; MÞ. S½n can be computed iteratively using the following dependence:
Therefore depending on the parity of S½i þ x i Á Y , we compute S½i þ 1 as
to make the numerator divisible by 2. Since Y < M and S½0 ¼ 0, one has 0 S½i < 2M for all 0 i < n. In [15] , [16] , it is shown that the result of a Montgomery multiplication X Á Y Á 2 Àn ðmod MÞ < 2M when X; Y < 2M and 2 n > 4M. As a result, by redefining n to be the smallest integer such that 2 n > 4M, the subtraction at the end of Algorithm 1 can be avoided and the output of the multiplication can be directly used as an input for the next Montgomery multiplication. [4] In Algorithm 2, the operand Y (multiplicand) is scanned word-by-word, and the operand X is scanned bit-by-bit. The operand length is n bits, and the wordlength is w bits. e ¼ d nþ1 w e words are required to store S since its range is ½0; 2M À 1. The original M and Y are extended by one extra bit of 0 as the most significant bit. Presented as vectors, 
OPTIMIZING MWR2MM ALGORITHM
Assuming that C ðjÞ 3, we obtain À C ðjþ1Þ ; S
From (5), we have C
3. By induction, C ðjÞ 3 is ensured for any 0 j e À 1. Additionally, based on the fact that S 2M, we have C ðeÞ
1.
The data dependency graph of the hardware implementation for the MWR2MM algorithm by Tenca and Koç is shown in Fig. 1 . Each circle in the graph represents an atomic computation and is labeled according to the type of action performed. Task A consists of computing lines 2.3 and 2.4 in Algorithm 2. Task B corresponds to computing lines 2.6 and 2.7 in Algorithm 2.
The data dependencies among the operations within j loop makes it impossible to execute the steps in a single iteration of j loop in parallel. However, parallelism is possible among executions of different iterations of i loop. In [4] , Tenca and Koç suggested that each column in the graph may be computed by a separate processing element (PE), and the data generated from one PE may be passed into another PE in a pipelined fashion. Following this method, all atomic computations represented by circles in the same row can be processed concurrently. The processing of each column takes e þ 1 clock cycles (one clock cycle for Task A, e clock cycles for Task B). Because there is a delay of two clock cycles between the processing of a column for x i and the processing of a column for x iþ1 , the minimum computation time
2 e PEs are implemented to work in parallel. In this configuration, after e þ 1 clock cycles, PE #0 switches from executing column 0 to executing column P max . After another two clock cycles, PE #1 switches from executing column 1 to executing column P max þ 1, etc.
The opportunity of improving the implementation performance of Algorithm 2 is to reduce the delay between the processing of two subsequent iterations of i loop from two clock cycles to one clock cycle. The two-clock-cycle delay comes from the right shift (division by two) in both Algorithm 1 and 2. Take the first two PEs in Fig. 1 for example. These two PEs compute the S words in the first two columns. Starting from clock #0, PE #1 has to wait for two clock cycles before it starts the computation of S ð0Þ ði ¼ 1Þ in the clock cycle #2. In order to reduce the two-clock-cycle delay to half, we propose an approach of precomputing the partial results using two possible assumptions regarding the most significant bit of the previous word. As shown in Fig. 2 , PE #1 can take the w À 1 most significant bits of S ð0Þ ði ¼ 0Þ from PE #0 at the beginning of clock #1, do a right shift, and compute two versions of S ð0Þ ði ¼ 1Þ, based on the two different assumptions about the most significant bit of this word at the start of computations. At the beginning of the clock cycle #2, the previously missing bit becomes available as the least significant bit of S ð1Þ ði ¼ 0Þ. This bit can be used to choose between the two precomputed versions of S ð0Þ ði ¼ 1Þ. Similarly, in the clock cycle #2, two different versions of S ð0Þ ði ¼ 2Þ and S ð1Þ ði ¼ 1Þ are computed by PE #2 and PE #1, respectively, based on two different assumptions about the most significant bits of these words at the start of computations. At the beginning of the clock cycle #3, the previously missing bits become available as the least significant bits of S ð1Þ ði ¼ 1Þ and S ð2Þ ði ¼ 0Þ, respectively. These two bits can be used to choose between the two precomputed versions of these words. The same pattern of computations is repeated in subsequent clock cycles. Furthermore, since e words are enough to represent the values in S, S ðeÞ is discarded in our designs. Therefore, e clock cycles are required to compute one iteration of S.
The proposed optimization technique can be applied onto both nonredundant and redundant representation of the partial sum S, as demonstrated in Fig. 3 . It is logically straightforward to apply the approach when S is represented in nonredundant form because each digit of S consists of only one bit. When S is represented in redundant Carry-Save form, each digit of S consists of two bits, the sum (SS) bit, and the carry (SC) bit. As shown in Figs. 3b and 3c, after the update of S ðjÞ , only the sum bit of S , has been already computed and can be forwarded to the next PE together with S 0ðjÞ wÀ1::1 . Then, the same approach can be applied to update S ðjÞ . In the remainder of this paper, we use the nonredundant form in all the diagrams and description for the sake of simplicity. The corresponding diagrams and implementations in redundant format can be derived from the nonredundant case accordingly.
The data dependency of the optimized architecture for implementing MWR2MM algorithm is shown in Fig. 4 . Similar to the original implementation by Tenca and Koç, the circle in the graph of Fig. 4 represents an atomic computation. Task D consists of three steps, the computation of q i , the calculation of two sets of possible results, and the selection between these two sets of results using an additional input S The exact approach to avoiding the extra clock cycle delay due to the right shift is detailed as follows by taking Task E as an example. Each PE first computes two versions of C ðjþ1Þ and S is equal to one, and the other assumes that this bit is equal to zero. Both results are stored in registers. At the same moment, the bit S ðjþ1Þ 0 becomes available and this PE can output the correct C ðjþ1Þ and S ðjÞ . For Task D, the computation of q i is performed in addition to the computation of C ð1Þ and S ð0Þ . The diagram of the PE logic is given in Fig. 5 . The signals at the left and right sides are for the interconnection purpose. The carry C is fed back to the core logic of the same PE. The signal x i remains unchanged during the computation of a whole column in Fig. 4 . S ðjÞ is a word of the final output at the end of the computation of the whole multiplication.
The core logic in Fig. 5 consists of two parts, the combinational logic and a finite state machine. The multiplications of x i Á Y ðjÞ and q i Á M ðjÞ are shown to be carried out using multiplexers. A row of w AND gates is another implementation option. On FPGA devices, the designer may leave the choice of the real implementation up to the synthesis tool for the best performance in terms of trade-off between speed and area. The direct implementation of two branches (i.e., lines 4.1 and 4.2 in Algorithm 4) requires the use of two ripple-carry adders, 1 each of which consists of three w-bit inputs and a carry. It is easy to see that these two additions only differ in the most significant bit of the S word and share all remaining operand bits. Therefore, it is desired to consolidate the shared part between these two additions into one ripple-carry adder with three w À 1-bit inputs and a carry. The remaining separate parts are then carried out using two small adders. Following this implementation, the resource requirements increase only marginally while performing computation for two different cases. When S is represented in redundant form (see Fig. 3c ), only one additional Full Adder (FA) is required to cover two possible cases of SS ðjÞ wÀ1 . The optimized architecture keeps the scalability of the original architecture described in [4] . Fig. 6 illustrates how to use p PEs to implement the MWR2MM algorithm. Both M ðjÞ and Y ðjÞ are moved from left to right every clock cycle through registers. S ðjÞ has been registered inside each PE. Therefore, it can be passed into the next PE directly. The total computation time T , in clock cycles when p stages are used in the pipeline to compute for the case with n bits of size, is given by
where k ¼ b n p c. The first case shown in (6) represents the situation when there are more PEs than the number of words. Then, it would take n clock cycles to scan the n bits in X and take another e À 1 clock cycles to compute the remaining e À 1 words in the last iteration. The second case models the condition when the number of words in the operand is larger than the number of PEs. If we define a kernel cycle as the computation in which p bits of x are processed, then there is an e À p-clock-cycle extra delay between two kernel cycles. In this case, k complete and one partial kernel cycles are required to process all n bits in X. Overall, the new architecture is capable of reducing the processing latency to half of latency of the Tenca-Koç design, given maximum number of PEs. Fig. 7 demonstrates these two different cases with a simplified example.
If e > p, the output from the rightmost PE is fed into a queue and processed by the leftmost PE later. This is the example shown in Fig. 7b . Since there is an e À p-clock-cycle extra delay between two kernel cycles, the length of the queue Q is determined as
In order to distinguish this architecture from the other architecture, which is described in Section 4, the architecture discussed in this section is called Architecture 1 hereafter.
THE ALTERNATIVE OPTIMIZED HARDWARE ARCHITECTURE OF MWR2MM ALGORITHM
In Section 3, we presented the optimization technique for improving the performance of the original implementation architecture by Tenca and Koç. In this section, we present an alternative optimized hardware architecture for implementing MWR2MM algorithm. The corresponding data dependency graph is shown in Fig. 8 . Similar to the previous data dependency graphs in Figs. 1 and 4 , the computation of each column in Fig. 8 can be processed by one separate PE. Similarly to the graph in Fig. 4 , there is only one clock cycle latency between the processing of two adjacent columns in this data dependency graph.
1. Ripple-carry adders are used when S is represented in nonredundant form. When S is represented in redundant form, carry-save adders should be used instead. These three data dependency graphs map Algorithm 2 following different strategies, as shown in Fig. 9 . In Figs. 1 and 4, each column corresponds to a single iteration of i loop and covers all iterations of j loop, as shown in Figs. 9a and 9b, respectively. In contrast, each column in Fig. 8 corresponds to a single iteration of j loop and covers all iterations of i loop, as shown in Fig. 9c .
Following the data dependency graph in Fig. 8 , we present an alternative hardware architecture of MWR2MM algorithm in Fig. 10 . This architecture can finish the computation of Montgomery multiplication of n-bit operands in n þ e À 1 clock cycles. Furthermore, this alternative design is simpler than the approach given in [4] in terms of control logic and data path logic. Hereafter, we call this alternative architecture Architecture 2.
As shown in Fig. 10d , Architecture 2 consists of e PEs forming a computation chain. Each PE focuses on the computation of a specific word in S, i.e., PE #j only works on S ðjÞ . In other words, each PE corresponds to one fixed round as j in the inner loop of Algorithm 2. Meanwhile, all PEs scan different bits of operand X at the same time. The same optimization technique is applied to avoid the extra clock cycle delay due to the right shift. The pseudocode in Algorithm 4 describes the function and internal logic of the PE #j. The function of the combinational logic is given by lines 4.1 and 4.2. Lines 4.3 to 4.8 are implemented using two 2-to-1 multiplexers, shown in the diagram to the right of Register. Fig. 11 demonstrates the computations of the first three PEs in the first three clock cycles. The internal logic of all PEs is same except the two PEs residing at the head and tail of the chain. PE #0, shown in Fig. 10a as the cell of type D, is also responsible for computing q i and has no C ðjÞ input. This PE implements Algorithm 3. PE #ðe À 1Þ, shown in Fig. 10c as type F, has only one internal branch because the most significant bit of S ðeÀ1Þ is equivalent to C ðeÞ 0 , which is determined at the beginning of every clock cycle. This PE implements Algorithm 5.
Algorithm 5. Computations in Task F Two shift registers parallel to PEs carry x i and q i , respectively, and do a right shift every clock cycle. Before the start of multiplication, all registers, including the two shift registers and the internal registers of PEs, should be reset to zeros. All the bits of X will be pushed into the shift register one by one and followed by zeros. The second shift register will be filled with values of q i computed by PE #0 of type D. All the registers can be enabled at the same time after the multiplication process starts because the additions of Y ðjÞ and M ðjÞ will be nullified by the zeros in the two shift registers before the values of x 0 and q 0 reach a given stage.
The internal register of PE #j keeps the value of S ðjÞ that should be shifted one bit to the right for the next round of calculations. This feature gives us two options to generate the final product.
1. The contents of S ðjÞ wÀ1::0 can be stored in e clock cycles after PE #0 finishes the calculation of the most significant bit of X, i.e., after n clock cycles, and then the circuit can do a right shift on all accumulated bits. Or, 2. One more round of calculation can be performed right after the round with the most significant bit of X. In order to do so, one bit of "0" needs to be pushed into two shift registers to make sure that the additions of Y ðjÞ and M ðjÞ are nullified and the only operation performed by the circuit is right shift. Then, the contents of S ðjÞ wÀ1::0 are collected in e clock cycles after PE #0 finishes its extra round of calculations. These words are concatenated to form the final product. After the final product is generated, there are two methods to collect them. If the internal registers of PEs are disabled after the end of computation, the entire result can be read in parallel after n þ e À 1 clock cycles. Alternatively, the results can be read word by word in e clock cycles by connecting internal registers of PEs into a shift register chain.
The exact way of collecting the results largely depends on the application. For example, in the implementation of RSA, a parallel output would be preferred; while in the ECC computations, reading results word by word may be more appropriate.
HIGH-RADIX ARCHITECTURE OF MONTGOMERY MULTIPLICATION
The concepts illustrated in Figs. 4 and 8 can be adopted to the design of high-radix hardware architecture of Montgomery multiplication. Instead of scanning one bit of X every time, several bits of X can be scanned together for high-radix cases. Assuming k bits of X are scanned at one time, 2 k branches should be covered at the same time to maximize the performance. Considering the value of 2 k increases exponentially as k increments, the design becomes impractical beyond radix-4. Following the same definitions regarding words as in Algorithm 2, the radix-4 version of Montgomery multiplication is shown as Algorithm 6. Two bits of X are scanned in one step this time instead of one bit as in Algorithm 2. While reaching the maximal parallelism, the radix-4 version design takes n 2 þ e À 1 clock cycles to process n-bit Montgomery multiplication.
The carry variable C has three bits, which can be proven in a similar way to the proof of the radix-2 case. The value of q ðiÞ at line 6.3 of Algorithm 6 is defined by a function involving S The multiplication by 3, which is necessary to compute x ðiÞ Á Y ðjÞ and q ðiÞ Á M ðjÞ , can be done on the fly or avoided by using Booth recoding as discussed in [6] . Using the Booth recoding would require adjusting the algorithm and architecture to deal with signed operands.
Furthermore, we can generalize Algorithm 6 to handle MWR 2 k MM algorithm. In general, x ðiÞ and q ðiÞ are both k-bit variables. x ðiÞ is a k-bit digit of X, and q ðiÞ is defined by (10) .
Nevertheless, the implementation of the proposed optimization for the case of k > 2 would be impractical in majority of applications.
HARDWARE IMPLEMENTATION AND COMPARISON OF DIFFERENT ARCHITECTURES
In this section, we compare five major types of architectures for Montgomery multiplication from the point of view of the number of PEs and latency in clock cycles.
In the architecture by Tenca and Koç, the number of PEs can vary between one and P max ¼ d eþ1 2 e. The larger the number of PEs, the smaller the latency, but the larger the circuit area. This feature allows the designer to choose the best possible trade-off between these two requirements.
The architecture by Harris et al. [7] has the similar scalability as the original architecture by Tenca and Koç [4] . Instead of making right shift of the intermediate S ðjÞ values, their architecture left shifts the Y and M to avoid the data dependency between S ðjÞ and S ðjÀ1Þ . The data processing diagram in Harris' architecture is shown in Fig. 12 . For the design with the number of PEs optimized for minimum latency, the architecture by Harris reduces the number of clock cycles from 2n þ e À 1 (for Tenca and Koç [4] ) to n þ 2e À 1.
Our optimized architecture, Architecture 1, is built using similar concepts to the architecture by Tenca and Koç. However, it is able to reduce the processing latency to approximately half while preserving the scalability of the original architecture.
Our alternative architecture, Architecture 2, and the architecture by McIvor et al. both have fixed size, optimized for minimum latency. Our architecture consists of e PEs, each operating on operands of the size of a single word. The architecture by McIvor et al. consists of just one PE, operating on multiprecision numbers represented in the carry-save form. The final result of the McIvor architecture obtained after n clock cycles is expressed in the carry-save redundant form. In order to convert this result to the nonredundant binary representation, additional e clock cycles are required, which makes the total latency of this architecture comparable to the latency of our architecture. In the sequence of modular multiplications, such as the one required for modular exponentiation, the conversion to the nonredundant representation can be delayed to the very end of computations. Therefore, each subsequent Montgomery multiplication can start every n clock cycles. The similar property can be implemented in our architecture by starting a new multiplication immediately after the first PE, PE #0, has released the first least significant word of the final result.
Architecture 2 can be parameterized in terms of the value of the word size w. The larger w the smaller the number of PEs, but the larger the size of a single PE. Additionally, the larger w the smaller the maximum clock frequency, especially in the non-redundant representation. The latency expressed in the number of clock cycles is equal to n þ dððn þ 1Þ=wÞe À 1, and is almost independent of w for w ! 16. Since actual FPGA-based platforms, such as SRC-6 used in our implementations, have a fixed target clock frequency, this target clock frequency determines the optimum value of w. Additionally, the same HDL code can be used for different values of the operand size n and the parameter w, with only a minor change in the values of respective constants.
Both optimized architectures, Architecture 1 and Architecture 2, have been implemented in Verilog HDL, and their codes have been verified using reference software implementation. The results completely matched.
We have selected Xilinx Virtex-II6000FF1517-4 FPGA device used in the SRC-6 reconfigurable computer for the prototype implementations. The synthesis tool was Synplify Pro 9.1 and the Place and Route tool was Xilinx ISE 9.1.
We have implemented four different sizes of multipliers, 1,024, 2,048, 3,072, and 4,096 bits, respectively, in the radix-2 case using Verilog-HDL to verify our approach. The resource utilization on a single FPGA is shown in Table 2 . For comparison, we have implemented the multipliers of these four sizes following the hardware architectures by Tenca and Koç and by Harris et al. as well. Additionally, we have implemented the approach based on CSA (Carry-Save Addition) from [14] as a reference. The purpose is to show how the MWR2MM architecture compares with other types of architectures in terms of resource utilization and performance.
The word size w is fixed at 16 bit for most of the architectures implementing the MWR2MM algorithm. Moreover, the 32-bit case of Architecture 2 is tested as well to show the trade-off among clock rate, minimum latency, and area. In order to maximize the performance, we used the maximum number of PEs in the implementation of all three scalable architectures, i.e., the architecture by Tenca and Koç [4] , the architecture by Harris et al. [7] , and Architecture 1. Therefore, the queue (shown in Fig. 6) is not implemented in all three cases. In the implementation of these four architectures, S is represented in nonredundant form. In other words, ripple-carry adders are used in the implementation.
In order to minimize the critical path delay in the ripplecarry addition of c þ S ðjÞ þ x i Á Y ðjÞ þ q i Á M ðjÞ , this threeinput addition with carry is broken into two two-input additions. As shown in Fig. 13a, x 
is precomputed one clock cycle ahead of its addition with S ðjÞ . This technique is applied to the implementation of all four cases to maximize the frequency. This design point is appropriate when the target device is an FPGA device with abundant hardware resources. When area constraint is of high priority, or S is represented in redundant form (as suggested in [4] , [5] , [7] ), this frequency-oriented technique may become unnecessary. The real implementation of the second two-input addition with 2-bit carry in Xilinx Virtex-II device is illustrated in Fig. 13d . w þ 2 full adders and w half adders (HAs) form two parallel chains to perform the addition. Considering w FAs used in the first addition, the implementation of the logic in Fig. 13a requires 3w þ 2 FAs or HAs. Compared with the 2w FAs used in Fig. 3c , the nonredundant pipelined implementation of Montgomery multiplication will consume approximately 50 percent more hardware resources than the implementation in redundant form on Xilinx Virtex-II platform.
From Table 2 , we can see that both Architecture 1 and Architecture 2 (radix-2 and w ¼ 16) give a speedup by a factor of almost two compared with the architecture by Tenca and Koç [4] in terms of latency expressed in the number of clock cycles. The minimum clock period is comparable in both cases and extra propagation delay in our architecture is introduced only by the multiplexers directly following the Registers, as shown in Figs. 5 and 10 .
The resource requirements of the PE in three scalable architectures are very close to each other because most of their logic is the same. The implementations of both Harris' architecture and Architecture 1 use twice as many PEs as the architecture by Tenca and Koç. At the same time, they both require only about 44 percent more resources (in LUTs) compared with the Tenca and Koç's architecture. This feature is due to the way LUTs are counted by implementation tools; namely, LUT is counted as one even if not all of its inputs are used. A close observation of the area report by Synplify Pro reveals that in the cases of both Harris' architecture and Architecture 1, the percentage of fully or close-to-fully used LUTs is much higher than in case of Tenca and Koç's architecture.
Architecture 2 occupies 16 percent less resources than architecture by Tenca and Koç in terms of LUTs, although our Architecture 2 uses almost twice as many PEs. This result is mainly due to the fact that our PE shown in Fig. 10b is substantially simpler than the PE in the architecture by Tenca and Koç [4] . The PE in [4] is responsible for calculating multiple columns of the dependency graph shown in Fig. 1 . Therefore, it must switch its function between Tasks A and B, depending on the phase of calculations. In contrast, in our Architecture 2, each PE is responsible for only one column of the dependency graph in Fig. 8 and one Task, either D or E or F. Additionally in [4] , the words Y ðjÞ and M ðjÞ must rotate with regard to PEs, which further complicates the control logic.
Compared with the architecture by McIvor et al. [14] , our Architecture 2 (radix-2 and w ¼ 16) has a comparable latency expressed in the number of clock cycles. In terms of clock frequency, the McIvor's architecture is better by 40-47 percent, but in terms of area, our architecture is [14] , S is represented in non-redundant form. superior by almost a factor of 2. As a result, Architecture 2 outperforms the McIvor's design in terms of the product of latency times area by about 20 percent.
In Table 3 , performance gain of various architectures against the architecture of Tenca and Koç is summarized. Harris' architecture, Architecture 1, and Architecture 2 all consistently outperform the classic architecture by Tenca and Koç in terms of both latency and the product of latency times area, for all four investigated operand sizes. Both Harris' architecture and Architecture 1 achieve a gain of around 20 percent regarding the product of latency times area. Architecture 2 can achieve a gain up to 50 percent due to much smaller resource requirements.
In all investigated architectures, the time between two consecutive Montgomery multiplications can be further reduced by overlapping computations for two consecutive sets of operands. In the original architecture by Tenca and Koç, this repetition interval is equal to 2n clock cycles, and in all other investigated architectures n clock cycles.
For radix-4 case, we only have implemented four different operand sizes, 1,024, 2,048, 3,072, and 4,096, of Montgomery multipliers in Architecture 2 as a showcase in Table 4 . The wordlength is the same as the one in the radix-2 case, i.e., 16 bits. For all four cases, the maximum frequency is comparable for both radix-2 and radix-4 designs. Moreover, the minimum latency of the radix-4 designs is almost half of the radix-2 designs. In the meantime, the radix-4 designs occupy more than twice as many resources as the radix-2 versions. These figures fall within our expectations because radix-4 PE has four internal branches, which doubles the quantity of branches of radix-2 version, and some small design tweaks were required to redeem the propagation delay increase caused by more complicated combinational logic. Some of these optimization techniques are listed below. As mentioned at the beginning of Section 5, the hardware implementation of our optimization beyond radix-4 is no longer viable considering the large resource cost for covering all the 2 k branches in one clock cycle, and the need to perform multiplications of words by numbers in the range 0::2 k À 1.
CONCLUSION
In this paper, we present two new hardware architectures for Montgomery multiplication. These architectures are based on the new idea for enhancing parallelism by precomputing partial results using two different assumptions regarding the most significant bit of each partial result word. Additionally, Architecture 2 introduces a new original data dependency graph, aimed at significantly this architecture in terms of the product latency by area by about 20 percent for all operand sizes. These two new architectures can be extended from radix-2 to radix-4 in order to further reduce their circuit latency at the cost of increasing the product of latency times area. Our architectures have been fully verified by modeling them using Verilog-HDL, and comparing their function versus reference software implementation of Montgomery multiplication based on the GMP library. Our code has been implemented on Xilinx Virtex-II 6000 FPGA and experimentally tested on SRC-6 reconfigurable computer. Our architectures can be easily parameterized, so the same generic code with different values of parameters can be easily used for multiple operand and word sizes. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
