Abstract. This paper describes an algorithm and architecture based on an extension of a scalable radix-2 architecture proposed in a previous work. The algorithm is proven to be correct and the hardware design is discussed in detail. Experimental results are shown to compare a radix-8 implementation with a radix-2 design. The scalable Montgomery multiplier is adjustable to constrained areas yet being able to work on any given precision of the operands. Similar to some systolic implementations, this design avoid the high load on signals that broadcast to several components, making the delay independent of operand's precision.
Introduction
Several applications, such as RSA algorithm, [14] Diffie-Hellman key exchange algorithm [5] , Digital Signature Standard [12] , and Elliptic curve cryptography [6, 9] use modular multiplication and modular exponentiation. The Montgomery Multiplication (MM) algorithm [10] provides certain advantages in the implementation of modular multiplication. Multiple software and hardware designs have been developed using the algorithm.
An aspect of cryptographic applications is that very large numbers are used. The precision varies from 128 and 256 bits for elliptic curve cryptography to 1024 and 2048 bits for applications based on exponentiation [15] . Most of the hardware designs for modular multiplication are fixed-precision solutions. That is, the operands cannot exceed a fixed bit-size. Designs that can take operands with an arbitrary precision are researched in the ASIC [18] and the FPGA [2] realms.
It is recognized that designing hardware requires making the area-time tradeoff [21] . In the general case "faster means better". However, an application where this rule is not valid can always be found. Therefore, it is important that the designers have several options or choices that they can choose from.
The basic idea of the scalable Montgomery multiplier has been presented in [18] . The main features of this multiplier are (1) the ability to work on any given operand precision at the kernel level,(2) be adjustable to any chip area, a (3) use a pipelined organization that reduces the impact on signal loads as a result of high precision of the operands.
The first feature is unique in comparison to other designs. The ability to handle long-precision numbers with small precision operations has been done using conventional multipliers, and a control algorithm that uses these multipliers [7] . The general approach is to reuse a hardware core with a fixed precision, usually at most 32 or 64 bits. The current publications show conventional multipliers that do not exceed a precision of 100 bits [16, 1] . The control algorithm is usually complex in this case and the increase in parallelism involve multiple datapaths and high complexity at the system level. Other solutions that use systolic array implementation are designed for a fixed precision and the implementation must be modified if a precision larger than the one originally considered is required.
The second feature comes from the flexibility of the algorithm and hardware to be adjusted in both word size and number of processing elements. The more hardware is available, the better is the performance of the multiplier. Similar adjustment is also possible on algorithms based on conventional multipliers, at the cost already presented above. Beyond any doubt, cryptographic algorithms will be embedded in almost any application involving exchanging of information. Applications, such as smart cards [11] and hand-held devices require hardware designs restricted on area and power resources.
The high load on signals broadcast to several hardware components is an important factor to slow down high-precision Montgomery multiplier (MM) designs. For this reason, the use of systolic structures have been considered by other researchers. The organization presented in this paper is not purely systolic, and has a flavor of serial-parallel implementation of the multiplication algorithm.
In this work we present an evolution of the radix-2 algorithm proposed in previous papers, which lead us to a higher radix design of the system. This paper describes the issues involved in this design and the experimental results to compare with the former radix-2 design.
High-radix Word-based Montgomery Algorithm
The notation used throughout this text is shown in Table 1 . Figure 1 shows the Multiple-word High-Radix (2 k ) Montgomery Multiplication algorithm (MWR2 k MM), a generalization of the MM algorithm presented in [18] . A full-precision High-Radix Montgomery algorithm has been presented and proven to be correct in [8] . To prove correctness of the algorithm in Figure 1 we show that it is equivalent to the one presented in [8] .
• M -modulus for modular multiplication; • X -multiplier operand for modular multiplication;
• xj -a single bit of X at position j;
• Xj -a single radix-r digit of X at position j; • Y -multiplicand operand for modular multiplication;
• N -number of bits in the operands;
• r -Radix (r = 2 k ); • S -partial product in the multiplication process;
• k -number of bits per digit in radix r;
• qY j -coefficient that determines a multiple of Y which is added to the partial product S in the j th iteration of the computational loop; • qM j -coefficient that determines a multiple of the modulus M which is added to the partial product S in the j th iteration of the computational loop; • BP W -number of bits in a word of either Y , M or S;
• NS -number of stages;
• CS -carry-save;
th word of S. Table 1 . Notation
The parameter k changes depending on how many bits of the multiplier X are scanned during each loop, or the Radix of the computation (r = 2 k ). Each loop iteration (computational loop) scans k-bits of X (a radix-r digit X i ) and determines the value q Y , according to Booth encoding [3] . Booth encoding is applied to a bit vector to reduce the complexity of multiple generation in the hardware. For radix-8 the Booth function for each digit is given as:
where X i = (x i+2 , x i+1 , x i ) is a radix-8 digit (i = km where m is an integer), x j ∈ {0, 1}, and x i−1 is the most significant bit (MSbit) of the previous digit.
For Radix-2 computation k = 1 and q Yj = x j are used, making the algorithm equivalent to the one presented in [18] . C a and C b represent two carry bits that are propagated from the computation of one word to the computation of the next word. In order to make the least-significant k-bits of S all zeros, q Mj M is added to the partial product. This is required to avoid losing bits in the shift operation performed in Step 10. The value of q Mj that satisfies this condition is determined by examining the least significant k-bits of S generated at Step 4.
In step 11 and 12 the most significant (MS) word of S is generated and sign extended. The use of Booth encoding may cause intermediate values of S to be negative. The final result in S, when Step 13 (final reduction step) is reached, is always positive and it can be a number greater than the modulus M . Its purpose is to reduce the result to a number less than the modulus. M is chosen as 2 N −1 < M < 2 N and the result is bounded as 0 ≤ S < 2M . Therefore, a single subtraction of the modulus will assure that S < M, just in the case when the final result in S is greater than or equal to the modulus.
The MWR2 k MM is a multiple-word version of a full-precision algorithm presented in Figure 2 , which is called in this work R2 k MM algorithm. To obtain
Step 1:
5:
7:
10:
Ca := Ca or C b 12:
the R2 k MM algorithm we transform the word-based sequence of operations into full-precision operations. It is shown in [8] that the requirement for q M is given
S := 0 x−1 := 0 2:
as:
This requirement can be also rewritten as
The latter equation is another representation of the requirement that the last k bits of S must be zeros. The
Step 5 is equivalent to this requirement as shown below:
It is also easy to show that
from the Booth encoding properties.
The last two equations show that the coefficients q Yj and q Mj are determined the same way as in [8] , which makes both algorithms equivalent. In [8] there are requirements for X and Y that determine the boundaries for the result S. There are no such requirements in the R2 k MM algorithm. The R2 k MM algorithm inherits the boundaries for the result from the original MM algorithm.
High-radix Montgomery Multiplier -System level
For high-precision computation it is beneficial to divide the multiplicand Y , the modulus M and the result S into words [18] . The approach keeps the gates and the wire delays inside reasonable boundaries. With operands' precision of thousands of bits, a conventional design to multiply all the bits at once would have a high number of pins, increased fan-in for the gates, high gate loads, and gate outputs driving long wires.
The multiplications (q Y * Y ) ( * ) and (q M * M ) ( * ) shown in the MWR2 k MM algorithm can be implemented by multiplexers (MUXes) and adders. The shifting operation in Step 10 is simple in hardware. Additions can be done using CarrySave Adders (CSA), and keeping S in redundant form. With this approach the carries generated during addition are not propagated but rather stored in a separate bit-vector along with a bit-vector for the sum bits. The most complex operations of finding the coefficients q Y and q M (steps 3 and 5) can be executed by table look-up. q Y is pre-computed before the computational cycle begins since it depends only on the least significant k bits of X. This observation leaves the computation of q M in the most critical part of the algorithm as it is also pointed out by other authors [13, 20] .
The architecture of a Montgomery multiplier implementing the MWR2 k MM algorithm is shown in Fig. 3 . There are two main functional blocks: Kernel and IO. Only the data path is shown. The Kernel's datapath is where the computation takes place according to the algorithm. A control block (not shown) supplies the signals to synchronize the system. Depending on the kernel configuration (number of stages and word size) the operands must pass through the data path several times [18] .
The signal X j is a k-bit signal. It provides the bits of X needed for Step 3 of the MWR2 k MM algorithm. The IO block provides the interface with the user and the memory elements for the operands, modulus, and partial result. This block can be implemented in different ways depending on the application where the multiplier will be used and/or the system's architecture in which the multiplier will be integrated. The solution for this block can be flexible and the only requirement for it is to meet the timing specifications for the kernel. Therefore, the architecture of this functional unit is out of the scope of this work. A detailed description of the signal's timing in the interface between I/O and kernel is presented in [19] .
Kernel Datapath and reduction.
The kernel datapath is organized as a pipeline of cells (MMcell) separated by registers (Fig.4) . A stage consists of a MMcell and a register. The MMcell implements one iteration of the FOR loop (steps 3 to 12) in the MWR2 k MM algorithm. Each stage gets as inputs one word of Y , M , SS and SC each clock cycle. Additionally, (NS * k) bits of X are transferred to the kernel over 2 * NS clock periods, where NS corresponds to the number of stages. Depending on the computation's progress, k bits of X are loaded in a different stage every 2 clock cycles. Each stage needs these bits at different times. Thus, this signal is made common for all stages with internal control loading the signal in the right stage at the right time. The MS bit of X i is used to Booth encode X i+1 , as explained in Section 2, thus, a cell must store these two pieces of information in order to properly encode a radix-r digit of X. The datapath outputs one word of each SS and SC every clock cycle. The pipeline outputs are SS The reduction block implements the final reduction step in the MWR2 k MM algorithm. The final reduction happens after the last iteration of the loop scanning the bits of X. During the intermediate iterations the final reduction block propagates the signals from the kernel datapath without operating on them. However, the design takes advantage of the word-serial output of the kernel datapath and implements the final reduction serially, on-the-fly, as the words of both vectors of the result are coming out of the kernel datapath. The condition S ≥ M will not be known before the last pair of words for S is computed in the datapath. The final reduction block implements the computation for both conditions, S ≥ M , when S − M is generated, and S < M, when the result is correct. In both cases the Carry-Save to non-redundant conversion is required. Both resulting vectors will be stored in the place for SS and SC (the two bitvectors of the intermediate result) in the IO block. After the last pair of words of S is processed, a flag is set by the control circuitry indicating which condition is valid, S ≥ M or S < M. The result will be in either SS or SC. A detailed implementation of the final reduction block is presented in [19] .
Kernel Implementation
The direct design of the kernel processing element leads to an organization shown in Figure 5(a) . The figure shows the main blocks in the design: booth encoding, multiple generation, adders, and registers (shaded boxes). Shifting and alignment is done by proper combination of signals. Multiple generation for high-radix designs is expensive because q Y and q M may assume values that are not powers of 2. As an example, the bit-vector 2Y can be produced from Y by left-shifting Y by one bit. However, the bit-vector 3Y is produced by adding Y and 2Y .
The critical path in the basic design is very long and makes the design of such high-radix circuit less attractive. The high radix is going to increase the table delay and size, and the multiple generation delay and size. To increase the performance of this system, re-timing was applied, resulting in the design shown in Figure 5 (b).
Improving the performance using re-timing
Using re-timing, pieces of combinational logic are relocated to other other parts of a sequential system, modifying the critical path. One problem with the first direct implementation of the high-radix algorithm is the long critical path, passing through several modules, as shown in Figure 5 , and the coefficients q Yj . If the word size for S is more than 2k bits the k LSBits of S for the next pipeline stage will be available well before the whole word S (0) is available. The idea is to advance the information on the k least-significant bits (LSBits) of the shifted S (0) . In the previous design, these bits were propagated between two registers with no logic operation done on them. Instead of simply propagating the bits, the logic determining q M is performed on them, as shown in Figure 5(b) .
The difference between these cell designs is that a portion of the first adder was moved to before the input registers, and this portion of the adder computes only the k LSBits of the not yet shifted partial product, which is required to compute q M . The k-bit vector addM in the Figure represents these bits in nonredundant form, and is applied to the Table that generates q M in the next clock cycle, considering also k − 1 bits of the modulus M . As a result of this hardware organization, all possible path delays will not exceed the delay of two adders and two MUXes.
The computation done on the LSBits by the leftmost is also done for all the other remaining operand words. So, while the leftmost adder works on the LS bits of a word, the topmost adder (after the input register) should be working on the other bits of the same word. There is one clock cycle difference between the two circuits, and therefore, this situation must be considered carefully.
A radix-8 design
Without loss of generality, the details of this design will be explained based on a radix-8 implementation. The circuit in Fig. 6 shows the diagram for a Radix-8 MMcell.
One way of implementing the coefficients q Y and q M is to split them into some components that will generate simple multiples and add these multiples in the adder. bit-vector, Y in this case, and introducing a carry-in with a value of '1'. Since each four-to-two adder has only one carry-in input, only one of the components can be negative. Two multiplexers generate the multiples (q1 Yj * Y ) ( * ) and (q2 Yj * Y ) ( * ) . The Booth encoding is done according to Table 2 in DEC XJ functional block. As an example, (/2 * Y ) means that the Y is multiplied by 2 and all the bits are complemented (or negated). Also, one can notice that the values 2 and −2 are formed in two different ways. This approach simplifies the decoding logic for X j . The outputs of DEC XJ are the control signals for the multiplexers as well as the carry-in bit for the first 4-to-2 adder (during the first computational cycle only).
Because the coefficients q Y and q M are split into two each, the adders need to have an extra input. The two four-to-two adders have a total of two carryout bits propagating between sequential words of the partial product S. One carry-out is inserted at the LSB position of vector carryA. The other carry-out is introduced back to the same adder as a carry-in bit for the next word of S.
The coefficient q Mj depends on the 3 LSBits of the partial product S and the three LSBits of the modulus M . The product is represented by 2 vectors. There is one additional input bit, hidden-bit, which affects q Mj . The hidden-bit is generated by carry propagation in the least significant bits of the least significant word computation, which are zeroed in the process. Knowing that the LSB of M is always '1' and the LSB of carryA is always '0', q Mj will depend only on eight bits: sumA 2..0 , carryA 2..1 , hidden-bit and M (0) 2..1 . In
Step 10 of the MWR8MM algorithm the partial product is right-shifted by three bits. Because carry-save representation (CS) is used for S, the LS words of the two bit-vectors (sumB (0) , carryB (0) ) after Step 6 in the algorithm can be, for example: sumB (0) = ×.. × 110 and carryB (0) = ×.. × 010, where × represents any value of the bit in this position. The last three bits of S are equivalent to zeros when converted to a non-redundant form. However, data will be lost if these bits are shifted out without taking into account the carry propagation (110 + 010 = 1000). The carry bit generated in this case is the hidden-bit.
Instead of using a carry propagate adder to obtain the hidden-bit, in radix 8 the following observation is made: the last bit of carryB (0) is always '0', therefore, to detect a hidden-bit it is enough to test if there is a 1 value in the second or third bits of either carryB (0) or sumB (0) . The circuit for the hiddenbit detection is reduced to sumB
1 . These two bits of sumB (0) are stored into flip-flops, thus, the hidden-bit logic does not stand in the critical path for the whole cell. Since the hidden bit is found after the operation on the LS word is done, it is transferred from one cell to another, as part of the LS word. It can be inserted in the free LSBit position in carryA (0) and also participates in determining q M .
If all eight bits are used for a lookup table for q M , the table will have 256 entries. The number of entries can be reduced by assimilating the carries for sumA 2..0 , carryA 2..1 , and hidden-bit by a three-bit adder. The resulting threebit vector is named addM :
which reduces the table for q M to only 32 entries. It is represented by the DEC M functional block according to Table 3 . The decoder outputs are the control signals for the multiplexers implementing (q1 Mj * M ) ( * ) and (q2 Mj * M ) ( * ) . The decoder also has an output which is asserted '1' whenever q1 Mj is negative. This signal becomes a carry-in for the second four-to-two adder. Table 3 . Decoding for qM .
The multiples of Y and M , like 2Y, 4Y, 2M, 4M, 8M , require that these operands be left-shifted. Caused by the word-serial scanning of this algorithm, this shifting requires some of the MSBits from the previous words of Y and M to be kept when the new words arrive. If it is the first word (first cycle='1') then a number of zeros is shifted in to produce the needed multiple. Otherwise, the MSBits of the previous word are shifted in as the LSBits of the current word.
As described at the end of the previous section, the leftmost adder is operating on the LSbits of words j of S and q Y Y while the topmost adder is operating of the MSbits of word j − 1. This arrangement requires that the carry-out propagation among words of the partial sum A (carryA and sumA) be considered carefully. The carry-out of the topmost adder, net spillA2, is introduced immediately as carry-in for the leftmost adder. The carry-out of the leftmost adder is delayed one clock cycle before it is introduced as carry-in to the topmost adder.
Experimental Results and Analysis
This section describes the experimental data obtained with the radix-8 Kernel designs and compares them with the radix-2 design. Although both radix-8 designs were implemented, only the results for the re-timed radix-8 design is presented in detail. The complete data is presented in [19] .
Synthesis and Simulation Environment.
The Mentor Graphics' package of applications was used to generate this data. The target technology was set to AMI05 slow (0.5µm) provided in the ASIC Design Kit (ADK) from the same company. A data-book for this technology is available at [4] . Before the designs were synthesized, they were simulated in ModelSim for functional correctness. The designs were described in VHDL, synthesized with Leonardo as flattened designs (no hierarchy), and laid-out using ICStation. This last tool provides RC parameter extraction. RC-extraction allows the determination of time delay values for each wire in the design, bringing further simulations closer to the real-silicon simulations. Using the information from ICStation and Leonardo, the designs were back annotated and verified with Velocity. The values presented in this section were obtained from several experiments.
The kernel area depends on the number of stages in the pipeline (NS) and the word size (BP W ). The area for the radix-8 kernel was obtained as:
The total computational time for the kernel is a product of the number of clock cycles (T CLKs ) and the clock period (t p ). The clock period is derived from the synthesis results, and will depend on the number of stages, the word size, and other parameters. The number of clock periods to complete a computation is obtained from the algorithm. Table 4 shows the critical path delay (t p ) as a function of the number of stages for the re-timed radix-8 kernel as well as the number of bits per word in the operands. These two parameters also determine the design area. The boldfaced figures in the Table show tested configurations. The rest of the figures are produced by linear interpolation. An increase in area leads to an increase in the critical path delay. This is due to increased wire lengths (parasitic resistance and capacitance) and fan-outs for the gates. A setup time plus clock-to-Q propagation time of 1.2ns for flip-flops is given for AMI05-slow technology. The hold time requirement is insignificantly small. The setup and hold time requirements will scale with the technology giving the same proportional effect on the clock period. represents the number of words in the N -bit operands with chosen word size of BP W bits [18] . Because of the extra register in the pipeline a word propagates through the pipeline for (2 * NS + 1) clock cycles For Radix-8, since 3 bits of X are used in each stage, 
It can be shown that when NW < 2 * NS adding more stages to the pipeline has somewhat unpredictable effect on the total number of clock cycles. It happens because in this case the number of words NW has a small effect on the computational time, while the fraction N 3 * NS has minimums and maximums as the number of stages NS changes. Thus, it may be the case that a design with more stages will be slower than a design with less stages. Figure 7 shows the total actual computational time (T CLKs ×t p ) for N = 256 and N = 1024, using designs with different number of stages (NS) and word size (BP W ). The first observable minimum computational time happens when the boundary NW ≤ 2 * NS and NW > 2 * NS is crossed. With further increase in the number of pipeline stages the computational time goes through a series of minimal and maximal values. The boundary NW > 2NS is crossed at a different number of stages for a different precision of the operands (a different number of words). Operands with precision 256 bits will require a smaller number of stages in the pipeline than operands with 1024 bits precision, in order to execute the operation in minimal time. The goal of choosing a design point is to have computational time for 256-bit precision close to its absolute minimal value and at the same time to have as small computational time for 1024-bit precision as possible. It can be seen from the data obtained in the experiments that the fastest designs are achieved with a word size of 8 bits. For this word size and 256-bit precision, the first optimal design point is for NS = 15. Table 5 . Some design points for radix-8 kernel, BP W = 8, N = 256 and N = 1024. Table 5 compares several design points for the radix-8 kernel with BP W = 8. The Table presents the design area and the ratio of the computational time related to the point NS = 15. It can be seen that the design point with NS = 22 is very suitable since the computational time for 256-bit precision is very close to its minimal value. At the same time the computational time for 1024-bit precision is improved by 37% as compared to the point with NS = 15. With further increase of the number of stages the computational time for 256-bit precision worsens while the computational time for 1024-bit precision does not improve significantly (only 2% per stage).
A comparison of performance between the radix-2 design ( [18] ) and the radix-8 designs discussed in this paper is shown in Figure 8 . The data shows the time to compute the modular multiplication for 256-bit operands as a function of the design area. For small areas, the radix-2 design (v1) performs as well as the radix-8 design with re-timing (v3). The basic design (v2) is worse than the radix-2 one. For areas of 10,000 gates or more, the radix-8 design with re-timing is better than the other two, which shows that the high-radix design has a better overall performance.
Conclusion
This paper presented the algorithm modifications and hardware implementation details of a high-radix implementation of the scalable modular multiplier presented in [18] . A radix-8 design was used to exemplify the design process, and to obtain experimental results that show the viability of using this approach. Experimental data shows that the radix-8 scalable multiplier is able to perform as well as the radix 2 design for small areas, and better than the radix-2 design for larger areas. The re-timing technique applied to the high-radix design was critical to obtain a competitive solution. 
