Some of the most popular public key encryption algorithms use exponentiation as their core operation, which can be mostly broken into several modular squaring operations. In this paper, we present GF(p) modular squaring algorithms and efficiently implement them on hardware. We present different algorithms, two for squaring and one for reduction combined with the squaring, to provide a general modular squaring algorithm. The algorithms are implemented through datapaths that uses redundant Carry-Save Adders, making the computation time independent form the operands precision. The proposed algorithms are compared with each other as well as with the existing modular squaring algorithms. The experimental results are obtained by synthesizing the hardware designs for FPGA Virtex5 chip (xc5vlx50 -ff1153 technology), which showed interesting results and made our ideas very attractive.
INTRODUCTION
Sharing and securing data have become basic and essential needs, increasing the demand for faster and more secure encryption algorithms. There are two main branches in cryptography: symmetric key cryptography, where the same key, which must remain secret, is used for encryption and decryption; and public key cryptography (PKC), where two keys are used. Public key, which is publicly shared, is used to encrypt data; private key, which is kept private, is used to decrypt data. PKC is the most widely used type of cryptography for authentication and digital signatures (Adleman, Rivest, & Shamir, 1978; Digital signature standard, 2000) , and depends heavily on modular arithmetic, including addition, multiplication, and exponentiation (ElGamal, 1998; Hellman & Diffie, 1976) .
Among modular arithmetic operations, modular exponentiation is the main operation used in many PKC algorithms, such as El-Gamal cryptosystem (ElGamal, 1998) ; Diffie-Hellman key exchange algorithm (Hellman & Diffie, 1976) , RSA (Adleman, Rivest, & Shamir, 1978) , and the Digital Signature Standard (Digital signature standard, 2000) . Since modular exponentiation can be broken down into successive modular squaring and multiplication operations, the performance of PKC algorithms can be enhanced by using fast and efficient modular squaring algorithms (George, 2005; Wu, Lou, & Chang, 2006; Barrett, 1986; Montgomery, 1985) . Karatsuba and Ofman (1963) were one of the first to show that multiplication can be done in time less than O(n 2 ), where n is the number of digits in the operands. They introduced a recursive algorithm that divides the integer into two parts and continues to be divided recursively, up to a certain limit. Comba (1990) presented a multiplication algorithm to improve the classical method of multiplication by employing smart programming optimizations. Although the algorithm is mathematically identical to the classical method, it has an advantage of computing partial products directly to reduce the number of required memory writes. George (2005) compared the classical multiplication method with the Karatsuba algorithm, the Comba algorithm, and a hybrid between the two algorithms. He found that the Karatsuba algorithm outperformed both the classical method and the Comba algorithm, but the best result came from combining both Karatsuba's and Comba's algorithms up to a certain breakpoint. Koc (1994) presented the standard squaring algorithm. Wu, Lou, and Chang (2006) have shown that Yang-Hseih-Laih's squaring algorithm (Hseih & Laih, 2003) has an error-indexing bug and presented an algorithm to fix it.
In this work, we will present the hardware architecture simulation results for several algorithms and their implementation. We will compare their performance with each other and with already known other algorithms in this field. The proposed algorithms are compared against four algorithms: the standard squaring algorithm, the Wu-Lou-Chang squaring algorithm (2006), the Barrett modular reduction algorithm (1986) , and finally the Montgomery modular multiplication algorithm (1985) . The experimental results were obtained by synthesizing the hardware designs using FPGA Virtex5 chip (xc5vlx50 -ff1153 technology) for different operand sizes (in bits): 8, 16, 32, 64, 128, 256, 512, and 1024 . The obtained results were compared with other algorithms results using the same operands sizes.
PROPOSED ALGORITHMS
We present three algorithms: two for modular squaring and one for modular reduction. The squaring algorithms will be denoted as PA1 and PA2 (proposed algorithms 1 and 2). The proposed modular reduction algorithm will be denoted as PMR. PA1 and PA2 use an idea that was inspired while researching Carry-Save Adders (CSA).
As in CSAs, we divided the squaring result into two results, that is, sum and carry. Both sum and carry can be calculated with a delay of one full adder. The sum can be calculated in one step (wiring in case of radix-2, or LUT in case for radix-4 and radix-8), and the rest of the algorithm's function is to calculate the carry. Instead of starting with an empty vector for the result, we start with the value of the sum. To get the final result, the sum and carry are added. In addition, all intermediate additions are calculated using CSAs.
PA1 Squaring Algorithm
PA1 is a modification of the Wu-Lou-Chang squaring algorithm (2006) . The Wu-Lou-Chang algorithm follows the traditional pencil-and-paper method of squaring. It is to be noted that the Wu-Lou-Chang algorithm replaced the term 2*xi*xj with xi*xj to ensure that the intermediate calculations do not exceed two digits. Also, to improve the performance, it retrieved the value of xi*xj from a Look-Up Table  ( LUT). The reader is referred to (Wu, Lou, & Chang, 2006) for more details on this algorithm. We modified this algorithm by making the starting index of the inner loop start at i+1 instead of 0. We also are exchanging (uv) b = x i * x j with "(tuv) b = 2 * x i * x j , which reduce the number of iterations from n 2´ to "(n 2 -n) / 2. The modified algorithm (PA1) is shown in Figure 1 .
PA1 also uses the sum-carry technique discussed earlier. Although this is a simple modification, we found it reduces the number of iterations by half assuming values for x i * x j are precomputed; the proposed algorithm has been implemented in radix-2, radix-4, and radix-8.
• For radix-2: x i * x j = x i and x j • For radix-4 and radix-8, we need to precompute 16
and 64 values, respectively.
PA2 Squaring Algorithm
The PA2 algorithm is based on an equation in (Hong, Oh, & Yoon, 1996) . They present an idea and derive a recursive equation. Figure 2 shows the derivation of the equation.
The relationship between consecutive C (i) can be expressed by the following equation:
When we expand the recursive equation (for a 4-bit operand in radix-2, for If we divide this equation into two parts, we will get the sum and carry parts discussed earlier.
The final form for sum and carry can be calculated as follows:
The general formulas for the sum and carry are shown in Figure 3 . Our PA2 squaring algorithm is shown in Figure 4 .
In the carry equation in Figure 3 , b 2(i-1) *C/b i can't be reduced because C/b i is an integer division operation not a floating division operation, which means that to get the correct result we must divide first, then multiply. The next example shows the difference between dividing then multiplying and multiplying then dividing. 1. If we have C = (001111) 2 = (15) 10 , and we wanted to calculate 8 * C/4. The correct way to calculate it for our algorithm is:
• C/4 = 000011 • 8 *(C/4) = (011000) 2 = 24 2. If we replace it with 2 * C, or if we multiply first then divide we will get a wrong result.
• 2 * C = (011110) 2 = (30) 10 or • 8*C = 1111000, (8*C)/4 = (011110) 2 = (30) 10
To calculate {b * b 2(i-1) * C / b i } efficiently, the following operation is used:
When we calculate xi*2*R, the value of R is shifted to the left by one bit, then CSAs are used instead of a multiplier to perform the multiplication operation. For example, to calculate 3*2*R:
• First we shift R to the left by one bit: R = R << 1.
This represents 2*R. • To calculate 3*R, we add R + 2R, meaning that we add R with a shifted copy R (i.e., 3*R = R + 2R = R + (R << 1)). • All the additions are done using CSAs Let us give an example of this algorithm by using it to calculate 231 2 in radix-4. So we have X = (3213) 4 , n = 4 (i.e., X has 4 digits), b = 4.
Note that (3 2 ) 4 = (21) 4 , (2 2 ) 4 = (10) 4 , ( 1 2 
PMR Modular Reduction Algorithm
We require a precalculation of k values, which is denoted to as PreMod, as the following:
When we want to reduce an integer X modulo P, we start by taking the value of the first k bits of X (i.e., X(0 to k-1) ) and add to it the precalculated values with indices i if x i = 1, where k £ i < 2k. When we add n k-bit values, we will have a number of additional carry bits equal to:
The result is extra r bits, so the algorithm is applied again to the result. This operation is repeated until r £ 2. If we end up with r = 2, then there will be three values to add: X(0 to k-1), x k *PreMod(k), and x k+1 *Pre-Mod(k+1). In other words:
To calculate the modulus of on P, with and in the fo If we add any 3 k-bit numbers, bits k and k+1 will never be 1 at the same time (i.e., bits k and k+1 will be (10) 2 ,(01) 2 , or (00) 2 ). To show why this happens, suppose the worst case of adding 3 8-bit numbers where all the bits have a value of 1:
From the example we can see that bits 8 and 9 are (10)2. So, the result of the addition of any 3 k-bit numbers will have bits k and k+1 equal to (10) 2 , (01) 2 , or (00) 2 .
When we get r = 2, at the worst case we will only have to add two k-bit numbers: the first will be 0 £ X (0 to k-1) < 2P, and the second will be either x k *PreMod(k) or x k+1 *PreMod(k+1) which are less than P. So the result of this addition will be 0 £ sum < 3P, which means that we may need to subtract P from the sum a maximum of 2 times.
If r = 1, then we have the same case as with r = 2. If r = 0, then we just check if the sum is larger than P; if so, we subtract P to get the final result.
To reduce an integer with a length n, with a prime modulo of length k, where k < n < 2k, we need to apply the algorithm twice if 2 £ k £ 7, three times if 8 £ k £ 127, and four times if 128 £ k £ 2 127 -1. Figure 5 shows the PMR modular reduction algorithm.
To give an example of this algorithm at work, we will calculate 17156 mod 131, this is shown in Figure 6 .
RESULTS
The hardware design of the proposed and original algorithms was coded using VHDL. The tool used for simulation and functional correctness was ModelSim Xilinx Edition III v6.2g. The synthesis tool used to get the experimental results was Xilinx ISE 9.2i. The target technology was set to Virtex5 (xc5vlx50 -ff1153 technology). All algorithms were synthesized on a PC with Core 2 Duo E4300 processor 2.7 GHz and 2 GB of RAM.
The proposed squaring algorithms, the standard squaring algorithm and the Wu-Lou-Chang, were implemented in radix-2, radix-4, and radix-8 bases. The Montgomery multiplication algorithm, Barrett's reduction algorithm, and our proposed modular reduction algorithm were implemented in radix-2 only. All the algorithms were implemented with input lengths 8, 16, 32, 64, 128, 256, 512, and 1024 . The Standard squaring algorithm will be denoted to as SA1, and Wu-Lou-Chang's squaring algorithm as SA2. Figure 7 shows the time result obtained from comparing the total time of PMR with Barrett's reduction algorithm.
We can see from Figure 7 that the PMR modular reduction algorithm has a better time curve than Barrett's reduction. This makes PMR a more suitable option for hardware implementation. 
CONCLUSION
This work presented two new squaring algorithms, PA1 and PA2, and one modular reduction algorithm, PMR, for GF(p). Modular squaring algorithms play an important role in PKC. Some of the most popular public key encryption algorithms such as El-Gamal cryptosystem, Diffie-Hellman key exchange algorithm, RSA, and the digital signature standard. Since modular exponentiation can be broken down into modular squaring and multiplication operations, the performance of PKC algorithms can be enhanced by using a fast and efficient modular squaring algorithm.
In this paper we implemented the proposed algorithms in hardware and compared their performances with each other and other already known algorithms. First, we presented the algorithms against which we compared our proposed algorithms. Second, we presented the proposed algorithms and have shown their derivations. The hardware architecture for the proposed algorithms was also presented.
The algorithms that we compared consisted of two squaring algorithms, one modular reduction algorithm, and one modular squaring algorithm. The squaring algorithms were the standard squaring algorithm and the Wu-Lou-Chang squaring algorithm. The modular reduction algorithm was the Barrett reduction algorithm, and the modular squaring algorithm was the Montgomery multiplication algorithm.
Also, we showed the synthesis results for our algorithms for different bit lengths: 8, 16, 32, 64, 128, 256, 512 , and 1024 bits. Then the results were compared to the synthesis results of the other algorithms using the same bit lengths. The experimental results were obtained by synthesizing the hardware design for FPGA Virtex5 chip (xc5vlx50 -ff1153 technology).
Montgomery's multiplication algorithm were discussed and compared with different combinations of squaring algorithms and modular reduction algorithms for radix-2 implementations. Radix-2, 4, and 8 designs were implemented and compared in terms of area and total execution time. The PA1 algorithm, which is a modification on the SA2 algorithm, has shown good improvement; although it introduces a small increase in area, it gives a good improvement in total execution time. We have shown that the PA2 algorithm, which is an implementation of an equation found in (Karatsuba & Ofman, 1963) , in combination with PMR has the best area and total execution time results and gives a better performance than Montgomery's multiplication algorithm in radix-2.
Our proposed modular reduction algorithm (PMR) was shown to have low area requirements compared to Barrett's reduction algorithm. This was because Barrett's algorithm requires two multipliers, where PMR only uses adders and CSAs. Also, PMR has shown a better time performance than Barrett's reduction algorithm. This makes PMR a much better choice to be combined with squaring algorithms to form efficient modular squaring algorithms. 
