The Modern era of information and communication technology demands security. It is a prime important parameter along with other features. Data encryption and decryption algorithms such as RSA and ECC are popularly used to get the desired level of security. Modular multiplication is the integral part of these algorithms. Montgomery Multiplication is most efficient algorithm for modular multiplication. The modulo multiplication is a slow process for large bit size computations for key size more than 512. The critical path delay in architecture affects the iteration delay and overall computation time of encryption and decryption. The slow ripples of the carries in the Montgomery multiplication are often replaced by the carry save architectures of additions. The paper optimizes this traditional approach using a novel combination of an unfolding algorithm and a pre-computation technique. A new architectureMM4_2 multiplier, which holds input and output in carry save format and consuming the smallest critical path, is also modified using unfolding approach . The novel approach improves the overall computation time by 37.89% and 34.68% for ASIC and FPGA implementations as compared to traditional carry save approach. Further, unfolding of MM42_Multiplier in resent architecture gives improvement of 10.98 % in ASIC and 31.24% in FPGA as compare to original MM42 multiplier architecture.
Introduction
The modern communication trends are more concerned over the security of the information. Cryptography, meaning a secret writing helps in encryption and decryption of the information to be transmitted and received respectively. The time required to encrypt and decrypt the information is one of the key role players in deciding the efficiency of these algorithms. The trend of using a larger key size for more security is popular but affects the overall time of encryption. Elliptic curve cryptography (ECC) uses much smaller key size as compared to traditional RSA algorithm to achieve the same level of security. For example, the security level achieved by the key size of 2048 in RSA algorithm is achieved by key size of 256 only in ECC algorithm. The basic operation for data encryption in cryptographic algorithm is series of modular multiplication given below.
Z= X.Y Mod M (1)
The time required to complete the modular multiplication increases with the key size. This makes the cryptosystem slow. The faster encryption and decryption makes the server job easier, with faster loading of web pages. Various algorithms like shift and add multiplication, double add and reduce, Montgomery Multiplication algorithm are developed for fast modular multiplication. The Montgomery Multiplication algorithm is most efficient, fastest and requires lowest computation over head. In Montgomery Multiplication add and shift operations are the only basic operations. Carry propagation in these adders increases the critical path delay of the circuit, resulting to limited speed of the operation. The main objective of the proposed work is to decrease the critical path delay and overall computation time of the Montgomery multiplication. To do so, the traditional Montgomery multiplier architecture is modified with unfolding and pre-computation techniques. In the past, one can witness some techniques based on pre computation. We further modify this pre-computation approach with a combination of unfolding algorithm giving better delay performances by sacrificing some area. The philosophy behind a use of an unfolding technique reduces the overall computation time theoretically to half. The next section discuss about the previous work on the
Previous Work
Modulo arithmetic circuits undergo several modifications in the past, either to increase the speed or to decrease the power. −1 mod M. The main operations involve in algorithm are add and shift operation. Addition operation is implemented using adders, carry propagation causes long critical path in adders, which limits the speed of the operation.
Proposed Architectural Modifications
Several techniques were proposed that reduce carry propagation delay which limit the speed of circuit. In [5] Modified Montgomery Modular Multiplication and RSA exponentiation technique the addition is broken into multiple stages and implemented with carry save adder. Bunimovet.al [8] proposed a method that uses Carry Save-Adder to reduce critical path delay. Given below is Fast Montgomery Multiplication algorithm.
The algorithm [8] uses two carry save adder as shown in Fig 1. The critical path of carry save adder is equal to a full adder, as the carry bits are store in a register. Carry save adder add three operands and gives two results sum and carry. The final result is added with the conventional adders. In this [8] architecture carry save adders consume most of the area. The delay is further reduced by eliminating one of the carry save adder by pre-computing the result of one carry save adder. In the Montgomery Multiplication Algorithm given in [8] step 3 and step 4 are implemented using Carry Save Adder. In step 3 when Xi=0, there is no need to add Y to the Sum. In step 4 when So=0, Co=0 (ie. So, Co are even) then, there is no need to add M to the Sum. Using the approach author in [8] pre-computed the output of first Carry Save Adder. These values are stored in the Look up table. As shown in Table I Address (Xi, Yo, So, Co) is give to the lookup table, Z is result generated by the look up table. This Z is given as input to the carry save adder. The approach suggested by [8] of using a look up as shown in Fig. 2 can be further modified to decrease delay and the overall computation time of the Montgomery multiplication. We unfolded the architecture of Montgomery Multiplication which requires K iteration to obtain a final modular multiplication result, where K is number of bits in input A and B. For the circuit shown in Fig. 3 the addition logic is implemented by AND-XOR gates while the shifting operation by one bit is done by shift register. Copyright © 2018 Helix E-ISSN: 2319-5592; P-ISSN: 2277-3495 .But addition operation involve in the algorithm still remain the problem as it require conventional adders. Carry ripple propagation in these conventional adders is the responsible for the long critical path of the circuit. This bottleneck is resolve , by a new Montgomery Multiplication design, MM42_ Multiplier [12] . This new architecture maintain the inputs i.e. A, B and output Sum in carry save format i.e. AS,AC,BS,BC and (SS,SC) respectively. Two extra register are required D1, D2 registerare required to store precomputed value (BS+BC+ M). This reduce the critical path, accelerating the operation of Montgomery multiplier. Further to enhance the multiplier processing speed, we contribute to reduce speed by applying unfolding technique, which result in reduced delay for final result. Using unfolding technique, we can unfold DFG in Fig 4, by using the formula: Ui à V(i + w) % J Where, w is Delay, J is unfolding factor, i =0,1,…..J-1. U is source node, V is destination node. The Fig. 5 shows the unfolded architecture, for the unfolding factor J= 2. The data flow graph shows the hardware required increases by factor 2. Therefore, the area overhead is also increased. The unfolded Montgomery architecture allows us to compute the Montgomery Multiplication in "K/2" cycles as compared to "K" cycles in the original algorithm, where "K" is number of bits in inputs. This technique decreases the time delay and increase the throughput of the Multiplier. Similarly we also solve for the unfolding of Montgomery multiplier with the precomputation approach for the unfolding factor 2 and obtain the architecture shown in Fig. 6 . The proposed and the original architectures of Montgomery multiplier are designed and implemented in VHDL. The designs are synthesized for ASIC technology in Leonardo Spectrum with TSMC 350nm technology library.
The functional and post rout simulation of the designs are also carried out on Xilinx ISE 14.5 and implemented on the Atlys Spartan 6 FPGA development board. Table IV are designed using VHDL and post route simulations are verified for functional check in mentor graphics tool chain. The results are shown in Table IV gives area in terms of number of gates and LUT"s for ASIC and FPGA technologies respectively. The time required for computation of one bit, given input to Montgomery multiplier is shown by computation delay. The product Area-delay is also analyzed. It can be seen from Table IV that the pre-computation improves the performance of the Two CSA architecture. The carry save adder replaced by the lookup table in the pre-computed architecture reduces both area and delay as seen in the Table IV . The decrease is due to the reduction in the critical path. Though the decrease in area is very little, a significant decrease in delay by 19.05% and 13.87% for ASIC and FPGA respectively is observed. 
Conclusion
The VLSI implementations of Montgomery multiplication algorithm with a combined approach of pre-computation technique and unfolding technique is studied in this paper. The pre-computation in the Montgomery architecture uses only one carry save adder as compared to traditional two carry save design. Therefore, the operating frequency increases. We further improved this operating frequency using unfolding technique. The speed performance of traditional Two Carry save architectures is improved by pre-computation in [8] by 19%. We improve the performance of traditional Two CSA architecture using unfolded pre-computation architecture to 37.89% and 34.68% for ASIC and FPGA implementations. Further, unfolded version of MM42_ Multiplier gives reduced delay by 65.47 % in ASIC and 40.78% in FPGA than traditional CSA architecture and require least area.
