Abstract
Introduction
Multiplication in a finite field is essential to many encryption algorithms including RSA, Diffie-Hellman key exchange, the Digital Signature Algorithm, and elliptic curve cryptography [1] . The two common finite Galois fields are GF(2 n ), used for elliptic curves, and GF(p), used for most other algorithms. Multiplication in a prime field GF(p) is performed modulo some prime p. Multiplication in a binary extension field GF (2 n ) is performed modulo some irreducible polynomial f(x) of degree n.
It is implemented identically to GF(p) except that carries are not propagated. Therefore addition reduces to the XOR operation.
Cryptographic computations are time -consuming because they operate on precisions of 256 to 2048 or more bits and require large numbers of multiplications to perform exponentiation.
The Montgomery multiplication algorithm [2] is commonly used because it avoids division by the modulus. Many software and hardware implementations of Montgomery multiplication have been proposed. Software uses repeated multiplication and addition instructions [3, 4] . Radix-2 hardware designs operate in a word-serial fashion with addition as the basic operation [5] . Higher-radix designs use fewer cycles at the expense of requiring multiplications or memories containing precomputed multiples [6, 7, 8] .
Hardware designs are said to be scalable if they can work on variable precision limited only by memory capacity. They are unified if they handle both GF(p) and GF (2 n ) on the same array [9] . This paper proposes an improvement on the Tenca-Koç scalable unified radix-2 Montgomery multiplier [5] with half the latency for small and moderate-precision operands. The paper begins by reviewing Montgomery multiplication and the TencaKoç algorithm. It then describes how to left-shift input operands rather than right-shift results to avoid a bottleneck waiting for the most significant bit of each result word. The queue size is also cut in two by converting results to nonredundant format before storing them. Delay, area, and power results for a Verilog implementation synthesized to a Xilinx FPGA are discussed.
Montgomery Multiplication
We would like to compute Z = X × Y mod M, where the operands have n bits of precision and M is an odd number in the range 2 n-1 < M < 2 n . In GF(p), M is the prime p. In GF(2 n ), M is a binary representation of an irreproducible polynomial and carries are not propagated between columns in the multiplication. The modulo operation is expensive because it involves division.
Montgomery [2] observed that the divisions can be converted into simple shifts if multiplication is instead performed on so-called Montgomery residues (M-residues). The M-residue of an integer a (0 = a < M) is defined to be mod a ar M = where r = 2 n . For example, if r = 16 and M = 11, then we see 3 3 16 mod 11 4 = × = . There is an isomorphism between integers in this range and their Montgomery residues.
The modular multiplicative inverse b -1 of an integer b is that number such that bb It may not be immediately obvious that multiplying by r -1 mod M is an easier problem than simply multiplying mod M. However, there is a simple Radix-2 algorithm for doing so, given in Figure  1 . This algorithm computes S = MM(X, Y) = XYr -1 mod M. It uses n steps, as in a word-serial multiplication algorithm. On step i, it adds the Y to the running sum if the ith bit of X (x i ) is true. Also on each step, it divides by two. If the running sum was odd, it first adds M so the result can be divided by two without loss of information. This is permissible because adding M mod M does not change the result. After n steps of dividing by two, the algorithm has divided by r = 2 n . The algorithm might produce a result as large as 2M-1, so it concludes by subtracting M if necessary to restore the result to the legal range. The final subtraction can be avoided for repetitive multiplications used in exponentiation [10] , so we will ignore it through the rest of this paper. In summary, the algorithm computes the Montgomery product using only 2n n-bit additions and n one-bit right shifts, which is substantially simpler than conventional modular multiplication with division.
Conversion to and from M-residues is accomplished using Montgomery multiplication: Note that r 2 mod M should be precomputed to make this efficient; this is easy for cryptographic systems that change M infrequently. Now a long sequence of multiplications, like those required in exponentiation, can be performed by converting the operands to M-residues, performing Montgomery multiplications, and converting the result back to an integer.
Z is commonly stored in carry-save redundant format for fast addition. In a unified design, we use a modified carry-save adder that forces the carry to zero when operating in GF(2 n ) [9] . Define a bit cell to perform addition for one bit of the partial product of x i and y j . In a typical design, a bit cell would contain two full adders, two AND gates, and some registers, as 
Figure 2. Tenca-Koç multiple word radix-2 Montgomery multiplication algorithm
will be shown later in Figures 5 and 8 . Assume the bit cell requires a full cycle to operate. We will compare the number of bit cells and the number of cycles required by various algorithms. The Radix-2 Montgomery Multiplication Algorithm uses n bit cells and n cycles so its area-delay product is n 2 .
The Tenca-Koç Algorithm
The algorithm of Figure 1 requires an adder with a precision of n so it is not scalable to different values of n. Tenca and Koç present a multiple word radix- 2 Montgomery multiplication algorithm that uses hardware with a fixed word width w [5] . We will review this algorithm and its performance before describing how to improve it. For n = ew, the hardware is reused e times. Let
, where words are indicated with superscripts and bits with subscripts. M, Y, and Z are zero-extended to e+1 words to avoid overflow.
The algorithm is given in Figure 2 . The outer loop iterates over all n bits of X. The inner loop iterates over e+1 words of M, Y, and Z. Z is odd if the least significant bit is 1. Z is right-shifted by one bit at each step to divide by two. Note that the least significant bit of Z j must be computed before it can be right-shifted into the most significant position of Z j-1 on the jth step of the inner loop. This is a critical limitation of the algorithm.
Observe that the only dependency in the outer loop is that Z j for iteration i must be known before Z j for iteration i+1 can be computed. A hardware implementation of the Tenca-Koç algorithm unrolls the outer loop to use a pipelined kernel of p w-bit processing elements (PEs). Each PE contains two wbit adders and two banks of w AND gates to add x i ×Y j and odd×M j to Z j and registers to hold the results. A PE must wait two cycles to kick off after its predecessor until Z 0 is available because Z 1 must first be computed and shifted. In one kernel cycle, p bits of x are processed. Hence k = n/p kernel cycles are required to do the entire computation. We will assume for simplicity that n is divisible by p and w; this is usually true because all three parameters are typically powers of 2. Figure 3 shows a block diagram of the scalable Montgomery multiplier. The kernel contains p w-bit PEs for a total of wp bit cells. Z is stored in carry-save redundant form. If PE p completes Z 0 before PE1 has finished Z e-1 , the result must be queued until PE1 becomes available again. The design in [5] queues the results in redundant form, requiring 2w bits per entry. For large n the queue consumes significant area, so we propose converting Z to nonredundant form to save half the queue space, as shown in Figure 4 . On the first cycle, Z is initialized to 0. When no queuing is needed, the carry-save redundant Z' is bypassed directly to avoid the latency of the carry-propagate adder. The nonredundant Z result is also an output of the system. Figure 5 shows a design of the processing element. It uses 2w carry-save adders and 2-input AND gates, a 2:1 multiplexer, and 4w+5 register bits. The odd parity of the least-significant word of Z is stored to determine whether M should be added. On each cycle, Z is right-shifted so that the most significant bit of the previous word becomes the least significant bit of the next word. Figure 6 shows a pipeline diagram of the TencaKoç architecture indicating which bits are processed on each cycle. There are two dependencies for PE1 to begin a kernel cycle, indicated by the gray arrows: PE1 must be finished with the previous cycle, and the Z w-1:0 result of the previous kernel cycle must be ready at PE p. We assume that there is a two -cycle latency to bypass the result from PE p to account for the FIFO and routing. The computation time in clock cycles is ( ) The first case corresponds to a large number of words. Each kernel cycle requires e+1 clock cycles for the first PE to handle one bit of X. The output of PE p must be queued until the first PE is ready again. The second case corresponds to lower precision where a small number of words are necessary. Each kernel cycle takes 2p clock cycles before final PE produces its first word and one more cycle to bypass the result back. k kernel cycles are needed. Finally e-2 cycles are required to obtain the more significant words at the end of the final kernel cycle.
In other words, if there are relatively few small PEs, the latency is approximately ke = n 2 /wp, so the area-delay product is n 2 ; the design is efficient. On the other hand, if there are many PEs or the PEs are too wide (wp exceeding approximately n/2), the latency is approximately 2kp = 2n and the area-delay product is 2nwp > n 2 . When large amounts of hardware are available, the minimum latency is a factor of two worse than the simple radix-2 design. In the next section, we will see how to improve the latency. 
Improved Scalable Architecture
The fundamental problem with the Tenca-Koç architecture is the two -cycle latency from one PE to the next caused by waiting to right shift Z . Instead, we propose to begin operating on the least significant bits of Z as soon as they are available. Rather than right-shifting Z, we left-shift Y and M at each step. Now, each PE can begin immediately after its predecessor. At the end of each kernel cycle, p/w additional cycles are necessary to complete the most significant words rather than the single cycle previously used to handle overflow. For convenience, we assume that p is a multiple of w. Figure 7 shows a pipeline diagram for the improved scalable Montgomery multiplier. The bits with negative indices are insignificant trailing zeros. Each of the first k-1 kernel cycles takes max(e, p) + p/w + 1 clock cycles before the next can begin (cases I and II, respectively). The final kernel cycle takes e + p + p/w cycles to produce the last word. In case I, the system is still limited by the time for PE1 to complete and the latency is only slightly better. In case II, the system is limited by the time for PE p to complete and speeds up significantly. Substituting kp/w = e, the latencies simplify to ( )( ) ( ) Again, for Case I the latency simplifies to approximate n 2 /wp so the area-delay product is n 2 . In case II, the latency is approximately n so the areadelay product is nwp, which is ideal. Both the latency and area-delay product improve by a factor of two.
The overall architecture is unchanged from Figure  3 ; the differences lie in the shifting done by the PEs and the timing from the sequence controller. Figure 8 shows a schematic of the improved w-bit processing element. It contains the same critical path and amount of hardware as the Tenca-Koç design. Table 1 lists the cycle count for various choices of w, p, and n for the Tenca-Koç and improved multipliers. For operand precisions n up to the number of bit cells wp, the new design is about twice n-bit modular exponentiation requires at most 2n + 2 modular multiplications including the conversion to and from M-residues. Table 2 compares the time for 256-bit and 1024-bit exponentiation using various recent hardware and software implementations. For reference, a CLB in a Xilinx 4000XV-series chip contains 32 bits of RAM or two flip-flops and two LUTs.
Results
The improved scalable design is significantly faster than the Tenca-Koç design because of both the architecture and the faster clock rate. It appears to be comparable in performance and to use less hardware than the nonscalable radix-2 systolic design of Blum [12] . However, Blum's radix-16 nonscalable design [6] is faster because it processes four times as many bits of X per cycle per PE using 1.5x as much hardware. This suggests that it would be interesting to further investigate higher-radix scalable designs, although [13, 14] did not achieve as dramatic an areadelay improvement.
Mukaida [7] uses an entirely different approach based on a large multiplier for GF(p). The approach scales to multiple word lengths by reusing the multiplier, but does not support GF(2 n ). The paper does not report times for modular exponentiation, but it appears to be extremely fast at generating digital signatures.
A 16×16 Montgomery multiplier has also been simulated in a 90 nm process using V DD = 1.2 V [15] . Static circuits with high V t devices are used exclusively. The clock frequency of 2.4 GHz is limited by the critical path through the 16-bit CPA. If this were pipelined, the limiting path through the bit cell operates at 3.2 GHz. The kernel has an area of 354 µm × 146 µm based on a trial layout. The complete unit draws 69 mW on a random test case, of which 23 mW is leakage power.
Conclusion
This paper has described an improvement on the Tenca-Koç unified scalable radix-2 Montgomery multiplier. The design left-shifts the input operands rather than right-shifting the result to reduce the latency by nearly a factor of two for operand precisions up to the size of the array. It also converts intermediate results to nonredundant form to cut the queue memory requirement in half. The proposed multiplier has been synthesized for a Xilinx Virtex-II FPGA. It is the fastest scalable unified design reported in the literature.
This work might be extended to higher radix multipliers.
It would also be useful to better understand the tradeoffs between architectures with a large number of bit cells and those with a large conventional multiplier array using reduction steps from [3] .
