With the increase of key length used in public cryptographic algorithms such as RSA and ECC, the speed of Montgomery multiplication becomes a bottleneck. This paper proposes a high speed design of Montgomery multiplier. Firstly, a modified scalable high-radix Montgomery algorithm is proposed to reduce critical path. Secondly, a highradix clock-saving dataflow is proposed to support high-radix operation and one clock cycle delay in dataflow. Finally, a hardware-reused architecture is proposed to reduce the hardware cost and a parallel radix-16 design of data path is proposed to accelerate the speed. By using HHNEC 0.25 µm standard cell library, the implementation results show that the total cost of Montgomery multiplier is 130 KGates, the clock frequency is 180 MHz and the throughput of 1024-bit RSA encryption is 352 kbps. This design is suitable to be used in high speed RSA or ECC encryption/decryption. As a scalable design, it supports any key-length encryption/decryption up to the size of on-chip memory.
Introduction
Public key cryptography plays a very important role in modern information security. It not only can be used to encrypt/decrypt data like symmetric cryptography, but also can provide service such as confidentiality, authentication, data integrity check and non-repudiation. RSA algorithm [1] , which is proposed by Rivest, Shamir and Adleman in 1976, is the most widely used public key cryptographic algorithm. ECC algorithm, which is introduced by Koblitz [2] and Victor S. Miller [3] , is another very famous public cryptographic algorithm.
Both of RSA algorithm and ECC algorithm use modular multiplication as the primary operation. With the increase of the key-length used in these algorithms, the speed of modular multiplication becomes a bottleneck. A lot of papers have been published to accelerate the speed of modular multiplication. Till now, Montgomery modular multiplication algorithm [4] is considered as the most efficient algorithm. A lot of hardware implementations are based on this algorithm. Some of them focus on scalable design [5] , [6] , which makes the hardware implementation have ability to handle any key-length encryption/ decryption. Some focus on high-radix design in [7] - [9] , [11] , which can reduce total clock cycles for multiplication. Some focus on dataflow optimization [10] , which can reduce the delay cycles in dataflow. In this paper, a high-speed design of Montgomery multiplier is presented. Firstly, by using the proposed modified scalable high-radix Montgomery algorithm, it can parallelize the data path and shorten the critical path. Secondly, by using the proposed high-radix clock-saving dataflow, it achieves high-radix design with one clock cycle delay in dataflow. Finally, a compact hardware design of Montgomery multiplier is proposed to reduce hardware cost and accelerate the speed.
The rest of the paper is organized as following: Montgomery algorithm and the proposed modified algorithm are introduced in Sect. 2. The proposed clock-saving dataflow is introduced in Sect. 3. The architecture design of Montgomery multiplier is presented in Sect. 4. The experimental results and analysis are presented in Sect. 5. Finally, conclusion is given in Sect. 6. Table 1 shows the notations used in this paper. 
Algorithms

Previous Algorithms
Algorithm 1: Montgomery Multiplication Algorithm
End For 12.
The sign ext in step 12 and 13 is sign extending operation. The Booth function in step 3 is used to support high-radix operation. The detail of Booth function is,
Algorithm 2 provides some advantages compared to algorithm 1. Firstly, word-based operation makes the multiplier be scalable to variable key-length. Secondly, highradix design processes multiple bits of X in every loop (Step 3, Algorithm 2). It can reduce the clock cycles used in multiplication. Thirdly, carry-save adder is introduced to reduce critical path.
However, there are some disadvantages in Algorithm 2. Firstly, high radix design makes the calculation of q Y j and q M j very complex, and the path delay will be increased very quickly when using higher radix. This problem can be solved by using our proposed algorithm (Sect. 2.2) and data path architecture (Sect. 4). Secondly, scalable design makes the data be dependent in pipeline, which causes two clock cycles delay in pipeline. This problem also can be solved by using our proposed high-radix clock-saving dataflow in Sect. 3.
Proposed Algorithm
Algorithm 3: Modified scalable high-radix Montgomery multiplication algorithm
End For 11. For high-radix design, the path delay of q Y j and q M j is large. In this way, our proposed algorithm can support parallel calculation of q Y j and q M j , and it achieves much shorter critical path than algorithm 2. Our proposed algorithm is much more suitable for high-speed hardware implementation of Montgomery algorithm.
Data Flows
Previous Dataflows
The most frequently used dataflow for scalable high-radix Montgomery algorithm is shown in Fig. 1(a) , which is proposed by Tenca and Koç in [7] . This data flow is a pipeline data flow of algorithm 2. Due to the data dependence in algorithm 2 (In Step 9, 10, the output (S i−1 , C i−1 ) needs both of (S i , C i ) and (S i−1 , C i−1 )), there are two clock cycles delay in this dataflow. As a result, this dataflow needs more clock cycles to complete one time multiplication. In order to deal with this problem, Herris proposed a new radix-2 dataflow in [10] , which is shown in Fig. 1(b) . This dataflow achieves one clock cycle delay by removing data dependence in algorithm 2. As shown in algorithm 2, the right-shifting (Step 9, 10.) causes data dependence. In Herris' dataflow, the right-shifting of product S is removed. As a result, the product S is equivalently be multiplied by 2 in every pipeline stage, so the input data (Y, M) of next stage need to be multiplied by 2 too.
Tenca's dataflow uses carry-save result and support high-radix design. Herris' dataflow achieves one clock cycle delay while using radix-2. However, both of Tenca's dataflow and Herris dataflow base on algorithm 2, they can't be used in this paper.
Proposed High-Radix Clock-Saving Dataflow
The proposed dataflow is shown in Fig. 1(c) . This dataflow bases on proposed algorithm 3. It achieves both of one clock cycle delay and high-radix design.
As shown in Fig. 1(c) , operand Y is initially multiplied by 2 k as specified in algorithm 3, so the input data (Y, M) for each stage becomes (2 k Y, M). In order to achieve one clock cycle delay in dataflow and support high-radix design, the input data (2 k Y, M) needs to be multiplied by 2 k accumulatively in each stage (except the first stage).
Compare with others' dataflow, the proposed dataflow has some advantages. Firstly, high-radix and one clock cycle delay make this dataflow need very few clock cycles to do multiplication. Secondly, algorithm 3 used in this dataflow can achieve much shorter critical path than other's design. In this way, our dataflow is much more suitable for high-speed design of Montgomery multiplier.
Proposed Hardware Architecture
Hardware-Reused Multiplier Architecture
The proposed Montgomery multiplier is shown in Fig. 2(a) . The multiplier's data path contains NS MM Cells. The MM Cell is the basic processing element in the pipeline. There are two coefficient processing elements, q Y j PE and q M j PE.
They can be shared to all of the MM Cells in pipeline. From Fig. 1(c) , it can be seen that the calculation of q Y j and q M j are done just in the first cycle of each stage (Grey cycle shown in Fig. 1(c) ). All of the remained cycles (White cycles) don't need to calculate q Y j and q M j . This property provides possibility to reuse the q Y j PE and q M j PE in the data path.
The FIFO in Fig. 2(a) is used to avoid data overflow in pipeline. When the NW (Number of words of operands) is larger than NS (number of stages in dataflow), data overflow will happen in pipeline. The FIFO can be used to store overflowed data temporarily.
Parallel Radix-16 MM-Cell Design
The design of MM Cell is shown in Fig. 2(b) . This is a highradix design. In this paper, we implement radix-16 (k = 4) design of MM Cell.
As shown in algorithm 3, the function of MM Cell is:
While using proposed dataflow in Fig. 1(c) , the input data The range of q Y j is [−8, 8] . All of the number in this range can be split into two components, which is power of 2. For q M j under radix-16,
The range of q M j is [0, 15]. Unfortunately, 11 and 13 can't satisfy the requirement. In order to deal with this problem, we propose a mapping table from [0, 15] to [−7, 8] in Table 2 , which can be equivalently used for q M j .
Very Low Complex Implementation of q M j
As shown in Eqs. (3) and (4), q M j is much more complex than q Y j . Normally, q Y j PE and q M j PE are directly implemented by lookup table. The size of look up table is increased exponentially to radix number. This effect can be illustrated in Fig. 3 . While using radix 16, the table size of q M j is 4 times of q Y j . In order to reduce the cost of q M j calculation, we present a very low complex implementation of 
In the second group:
The difference of group 1 and group 2 is: Based on above analysis, the q M j can be implemented as Fig. 4 shows. Because modulus M is an odd number, Eq. (5) can be further presented as:
After q M j is calculated, the q M j sel can be calculated by using a small size lookup table as same as q Y j sel.
Analysis and Experimental Results
Based on proposed dataflow in Fig. 1(c) , the total number of clock cycles to do Montgomery multiplication is shown in Eq. (8), The meaning of notations in this equation can be found in Table 1 .
N k·NS is the number of loops in pipeline.
k·NS BPW
and k·NW BPW are the extra clock cycles overhead for our clocksaving dataflow. As shown in Fig. 1(c) , the product (S, C) is multiplied by 2 k in every stage. When it increases to 2 BPW (S, C), the LSW (Least-Significant-Word) of product is 0, so this all-zero LSW needs 1 extra clock cycle to be eliminated.
There are two cases of this equation. While NW ≤ NS , all of the words of Y, M can be loaded in the pipeline. The number of needed cycles is mainly decided by the NS. While NW > NS , the operand Y, M will be overflowed in the pipeline. In this case, the FIFO (shown in Fig. 2(a) ) can be used to store overflowed data, and the number of cycles is mainly decided by the NW. Table 4 shows the clock cycles comparison of our dataflow with Tenca-Koç dataflow and Herris dataflow. It shows that our dataflow achieves much less clock cycles than their dataflow.
In this table, different key length and stages are used to calculate clock cycles for each dataflow. The BPW is equal to 32. Because our dataflow and Tenca-Koç dataflow are high-radix dataflow, we use radix-16 for these two dataflows in Table 4 , and Herris dataflow uses radix-2.
In Table 4 , the NS of one clock cycle delay dataflow is half of two clock cycles delay dataflow. The reason is illustrated in Fig. 5 . Each Dataflow Stage of two clock cycles delay dataflow in Fig. 5(a) Fig. 5(a) represent the registering stages in the dataflow. For fairly comparing, the NS of one clock cycle delay dataflow should be two times of two clock cycles delay dataflow.
The ASIC implementation of this work uses NS = 32, BPW = 32, Radix = 16. We use HHNEC 0.25 µm CMOS standard cell library and Synopsys EDA tools to do ASIC design. Table 5 shows the performance comparison of this paper's result with other's work. [8] is a radix 8 design proposed by Tenca-Koç, [9] is an improved radix 16 design. By using proposed algorithm and the parallel data path design, this design's frequency is higher than [9] under the same radix number. [10] is a FPGA implementation by using one clock cycle delay dataflow, [11] is a very high radix design (radix 2 16 ) using multiplier and RAM which is embedded in FPGA, [12] is a radix-2 scalable design which using 2 clock cycles delay dataflow.
Normally, radix-2 k design can achieve about k times of performance than radix-2 design. One clock cycle delay dataflow can achieve about 2 times of performance than two clock cycles delay dataflow. From the Table 5 , our design uses radix-16 with one clock cycle delay dataflow. It achieves much higher performance than other's design.
Conclusion
This paper proposes a high speed design of Montgomery multiplier. By using proposed algorithm, it parallelizes the data path and shortens the critical path. By using proposed clock-saving dataflow, it reduces the total clock cycles of multiplication to a very small number. Finally, a very compact hardware architecture design is proposed to reduce hardware cost and improve the performance. The experimental results show that this design achieves very high performance with low hardware cost. This design is very suitable for high-speed RSA or ECC implementation.
