Abstract-Since redundant number systems allow for constant time addition, they are often at the heart of modular multipliers designed for public-key cryptography (PKC) applications. Indeed, PKC involves large operands (160 to 1,024 bits), and several researchers proposed carry-save or borrow-save algorithms. However, these number systems do not take advantage of the dedicated carry logic available in modern Field-Programmable Gate Arrays (FPGAs). To overcome this problem, we suggest to perform modular multiplication in a high-radix carry-save number system, where a sum bit of the carry-save representation is replaced by a sum word. Two digits are then added by means of a small Carry-Ripple Adder (CRA). Furthermore, we propose an algorithm that selects the best high-radix carry-save representation for a given modulus and generates a synthesizable VHDL description of the operator.
Horner's Rule-Based Modular Multiplication
In order to compute hXY i M ¼ XY mod M, where M is an n-bit integer such that 2 nÀ1 < M < 2 n , our algorithm is described by an iterative procedure based on the celebrated Horner's rule:
where X ¼ x rÀ1 x rÀ2 . . . x 1 x 0 is an unsigned r-bit integer, and Y is an n-bit integer belonging to f0; . . . ; M À 1g. This equation can be expressed recursively as follows (we perform a modulo M reduction at each step in order to keep an n-bit intermediate result):
where Q½r ¼ 0, and Q½0 ¼ hXY i M . Since Q½i þ 1 and Y are less than or equal to M À 1, T ½i < 3M, and (1) is implemented by means of a left shift, an addition, a comparator, and up to two subtractions to perform the modulo M reduction [2] . Since public-key cryptography involves large integers, operands are often represented in the carry-save number system, which enables addition in constant time (see, for instance, [3] ). However, due to the redundancy of this representation, comparison requires a conversion in a nonredundant number system. This operation involves carry propagations, thus losing the main advantage of the carrysave representation. Several improvements of the algorithm sketched by (1) have therefore been investigated to avoid comparisons. They are based on the following observation: computing a number P ½i congruent to Q½i modulo M only requires inspecting a few most significant digits of T ½i. In order to avoid an expensive final modular reduction, P ½i should be less than a very small multiple of M. The iteration described in this paper returns, for instance, hXY i M or hXY i M þ M and requires at most one subtraction to get the final result.
Koç and Hung [4] , [5] proposed, for instance, a carrysave algorithm based on a sign estimation technique. At each step, ÀM, 0, or M is added to T ½i according to a few most significant digits of Q½i þ 1. Takagi and Yajima [6] , [7] applied a similar technique to design signed-digit-based architectures. When the modulus M is known at design time, which is often the case in publickey cryptography, another approach consists of building a table ðaÞ ¼ ha Á 2 i M and in defining the following iteration:
where P ½r ¼ 0, and is generally chosen to be equal to n or n À 1. Since ðT ½i div 2 Þ is an n-bit number, P ½i and T ½i are respectively ðn þ 1Þ-and ðn þ 3Þ-bit numbers. Therefore, the algorithm sketched by the above equations requires a small table. Carry-save implementations of (2) and (3) have, for example, been proposed by Jeong and Burleson [8] , Kim and Sobelman [9] , and Peeters et al. [10] . Since these algorithms depend on the modulus M, they seem likely candidates for cryptographic hardware based on FPGAs: the reconfigurability of these devices allows one to optimize the architecture according to some parameters (e.g., the modulus) and to modify the hardware whenever they change.
FPGA-Specific Issues
Modern FPGAs are mainly designed for digital signal processing applications involving rather small operands (16 to 32 bits). FPGA manufacturers chose to include dedicated carry logic enabling the implementation of fast carry-ripple adders (CRAs) for such operand sizes. Let us study, for example, the architecture of a Spartan-3 device. Fig. 1 describes the simplified architecture of a slice, which is the main logic resource for implementing synchronous and combinatorial circuits. Each slice embeds two fourinput function generators (G-LUT and F-LUT), two storage elements (i.e., flip-flops FFY and FFX), carry logic (CYSELG, CYMUXG, CYSELF, CYMUXF, and CYINIT), arithmetic gates (GAND, FAND, XORG, and XORF), and widefunction multiplexers. Each function generator is implemented by means of a programmable lookup table (LUT).
Recall that a full-adder (FA) cell has two bits x i and y i , as well as a carry-in bit c in , as inputs and computes a sum bit s i and a carry-out bit c out such that
Assume that the F-LUT function generator computes h i . Then, the XORF gate implements (4), whereas (5) involves three multiplexers (CYOF, CYSELF, and CYMUXF). s i is either sent to another slice (output X) or stored in a flip-flop (FFX). The G-LUT function generator allows one to implement a second FA cell within the same slice, which thus embeds a 2-bit CRA. Unfortunately, a conventional carrysave adder (CSA) requires twice as many hardware resources: since each slice has a single-input carry CIN, it is impossible to implement two FA cells with independent carryin bits. Therefore, hardware design tools allocate two function generators when they are provided with a VHDL description of (4) and (5). It is of course possible to write a VHDL code that explicitly instantiates F-LUT, XORF, and CYMUXF. In this case, note that G-LUT can only be used to implement the control unit or the ð:Þ table of (3). According to experiment results, G-LUT often remains unused. Though reducing the number of LUTs of the design, taking advantage of dedicated logic to describe a CSA leads to a larger operator (Table 1) . Similar problems arise, for instance, on all Virtex devices (Xilinx) and Cyclone II FPGAs (Altera). It is therefore interesting to investigate modular multiplication algorithms based on FPGA-friendly number systems.
Our Contribution
We proposed a family of radix-2 algorithms designed for FPGAs embedding four-input LUTs and dedicated carry logic in [11] . Table 2 compares our iteration stage against three carry-save schemes. Since these results do not include the conversion from carry-save to unsigned integer that occurs at the end of each multiplication, both the area and delay of carry-save operators are underestimated. According to this experiment, our previous approach is efficient for moduli up to 32 bits. Thus, the aim of this paper is to extend our work to larger moduli. In order to benefit from dedicated carry logic available in almost all FPGA families, we suggest to choose a high-radix carry-save number system, where each sum bit of the carry-save representation is replaced by a sum word (Section 2). Such a number system allows for the design of a modular multiplication algorithm based on small CRAs (Section 3). The main originality of our approach is to analyze the modulus M in order to select the most efficient high-radix carry-save representation and to automatically generate the VHDL description of the operator (Section 4). Experimental results validate the efficiency of the proposed modular multiplication scheme (Section 5). We proposed a preliminary version of this work based on a different iteration in [12] .
HIGH-RADIX CARRY-SAVE NUMBERS
A k-digit high-radix carry-save number X is denoted by 
Let us define
and
With this notation, we have X ¼ X ðsÞ þ X ðcÞ . Such a number system has nice properties to deal with large numbers on FPGA:
. Its redundancy allows one to perform addition in constant time (the critical path only depends on max 0 j kÀ1 n j ).
. The addition of a sum word x ðsÞ j , a carry bit x ðcÞ jÀ1 , and an n j -bit unsigned binary number can be performed by a CRA. A key observation is that all sum words do not need to have the same width. This peculiarity will allow us to select a number system according to the modulus to optimize our operators (Section 4). In the following, we will assume that the carry bit of the most significant digit is always equal to zero (the weight of the most significant carry bit is therefore equal to 2 n 0 þn 1 þÁÁÁþn kÀ2 ).
Example 1. Fig. 2a describes a four-digit high-radix carrysave number with n 0 ¼ n 1 ¼ n 2 ¼ 4 and n 3 ¼ 3. According to the first definition of this number system, we have
We can also compute
We obtain X ¼ X ðsÞ þ X ðcÞ ¼ 33;626.
Consider the modular multiplication described by (2) and (3) and assume that both T ½i and P ½i are high-radix carry-save numbers. Each equation involves now the addition of a high-radix carry-save number and an unsigned integer (a partial product x i Y or a number ðT ½i div 2 Þ stored in the table). Fig. 3 describes how to perform these operations: the integer operand is split into k words of respective lengths n 0 ; . . . ; n kÀ1 ; then, each of these words is added to a sum word and a carry bit by means of an n j -bit CRA. The high-radix carry-save encoding has unfortunately a drawback in the sense that shifting an operand modifies its representation. The following example illustrates this problem, which occurs in the computation of T ½i (2).
Example 2. Let us consider again the number X ¼ 33;626, whose format is defined in Fig. 2a . By shifting X, we obtain Z ¼ 2X ¼ 67;252 (Fig. 2b) . However, the least significant sum word is now a 5-bit number, and
According to this example, P ½i þ 1 and P ½i do not have the same encoding, and the width of the CRA dealing with the least significant digit of P would increase at each step. The solution consists of converting T ½i to the format of P ½i. In the following, we describe a modular multiplication algorithm that minimizes the hardware overhead induced by this conversion.
HIGH-RADIX CARRY-SAVE MODULAR MULTIPLICATION
This section describes how to take advantage of a highradix carry-save number system to perform a modular multiplication. We assume the following:
. The modulus M is an n-bit number belonging to f2 nÀ1 þ 1; . . . ; 2 n À 1g. . X is an r-bit unsigned integer. . Y is an unsigned integer smaller than M. . is a small integer parameter that will determine the size of the table required for the modulo M reduction. . The most significant sum word of P ½i contains at least n kÀ1 ¼ 5 bits if ¼ 1 and n kÀ1 ¼ 6 bits if ¼ 2.
These hypotheses guarantee that -P ðcÞ ½i is smaller than 2 nÀ , i.e. hP ðcÞ ½ii 2 nÀ ¼ P ðcÞ ½i (we fixed the carry bit of the most significant digit of a high-radix carry-save number to zero in Section 2), and -P ðsÞ ½i is an ðn þ 2Þ-bit number (a proof is given in Appendix A).
. At each iteration, we compute a high-radix carrysave number P ½i congruent to 2P ½i þ 1 þ x i Y modulo M. According to our hypotheses, we have
The iteration of our algorithm is slightly different from the one described in Section 1. Let us define ½i þ 1 ¼ P ðsÞ ½i þ 1 div 2 nÀ and write the intermediate result at step i þ 1 as follows:
It is worth noticing that according to our hypotheses, ½i þ 1 is a 3-or 4-bit unsigned number for ¼ 1 or ¼ 2, respectively. Thus, a small table addressed by ½i þ 1 allows one to efficiently compute a number congruent to P ½i þ 1 modulo M:
Note that when ¼ 1, the table can be stored within the LUTs of a CRA on Spartan-3 and Virtex FPGAs [11] . Since we compute a high-radix carry-save number congruent to XY modulo M, a conversion and a final modulo M reduction are mandatory. In order to keep the hardware overhead as small as possible, we apply a trick proposed by Peeters et al. [10] in the case of a carry-save implementation. At each step, our algorithm computes
According to this equation, P ½i is always even when x i Y ¼ 0. Thus, by performing an additional step with x À1 ¼ 0, we obtain an even number P ½À1 congruent to 2XY modulo M. Note that P ½À1=2 is smaller than or equal to P ½0 and easy to compute (a right shift of one position involves only wiring). Furthermore, performing the final modulo M correction with P ½À1=2 requires fewer hardware resources. Let us define
hj Á 2 nÀ1 i M ; if ¼ 1 and n kÀ1 ! 6:
P ½À1=2 is a high-radix carry-save number equal to hXY i M or hXY i M þ M, and the final modulo M reduction requires at most one subtraction in the following cases:
. ¼ 1, n kÀ1 ! 6, and
A proof of correctness of this modular multiplication scheme, summarized by Algorithm 1, is provided in Appendix A. At each iteration, a new intermediate result P ½i is computed in two steps. LetP ½i þ 1 be a high-radix carry-save number such thatP ðsÞ ½i þ 1 ¼ hP ðsÞ ½i þ 1i 2 nÀ andP ðcÞ ½i þ 1 ¼ P ðcÞ ½i þ 1. We first carry out the sum of the partial product x i Y and 2P ½i þ 1 by means of small CRAs:
By shifting the high-radix carry-save number P ½i þ 1, we define a new internal representation for T ½i (Section 2). The second step consists of adding 2h½i þ 1 Á 2 nÀ i M to T ½i and converting the result to the format of P ½i þ 1. Algorithm 1. High-radix carry-save modulo M multiplication Input: An n-bit modulus M such that 2 nÀ1 < M < 2 n , an r-bit number X 2 IN, Y 2 f0; . . . ; M À 1g, and a parameter 2 f1; 2g. P ½i and T ½i are high-radix carry-save numbers.
The main difficulty of the implementation arises from the left shift of the carry bits P ðcÞ ½i þ 1. Since T ½i has a different encoding, it is necessary to perform a conversion. We suggest to compute a high-radix carry-save number U½i that has the same encoding as P ½i þ 1, and is equal to the sum of the carry bits of T ½i and the output of the table (i.e. 2h½i þ 1 Á 2 nÀ i M ). Therefore, we perform the following operations at each iteration of Algorithm 1:
The high-radix carry-save number T ½i contains three carry bits of respective weights 2 5 , 2 9 , and 2 13 (recall the constraint introduced in Section 2: the carry bit of the most significant digit is always equal to zero). We split 2h½i þ 1 Á 2 nÀ i M into four 4-bit words and perform three additions to compute U½i (Fig. 4) .
CHOICE OF A HIGH-RADIX CARRY-SAVE NUMBER SYSTEM
Let us represent the table involved in the modulo M correction by a matrix É. Each line of É stores an n-bit number h½i þ 1 Á 2 nÀ i M . In the following, we will have to consider a subset of consecutive columns of É. Let É 
It is worth noticing that the amount of hardware required to compute U½i depends on the modulus M and the encoding of P ½i. For instance, if a column of É contains only zeros, it can be replaced by a carry bit at no extra cost. We propose an algorithm that selects a high-radix carrysave number system minimizing the hardware overhead introduced by the computation of U½i (6). Assume that we want to merge t ðcÞ w with the jth column and t ðcÞ wþ1 with the ðj þ hÞth column of É, and recall that the carry bits of T ½i are left-shifted compared to those of P ½i and U½i (Fig. 5) . Therefore, we compute a digit u wþ1 such that . If ' ¼ h À 1, the addition requires an ðh À 1Þ-bit CRA, which generates an output carry bit u ðcÞ wþ1 (see (8) and Fig. 6b ). Since this bit will be added to a few bits of T ðsÞ ½i in order to compute p ðsÞ wþ2 ½i, we raise a flag that indicates this carry propagation. . When 0 < ' < h À 1, an '-bit CRA computes the sum and generates an output carry c out . If the ðj þ 'Þth column of É stores only zeros, we replace it by c out (Fig. 6c) . Otherwise, we need an OR gate to add c out to the ðj þ 'Þth column. Since we target FPGA applications, a more efficient solution consists of taking advantage of the dedicated carry logic to perform this operation, and we add t ðcÞ w to ðjþ':jÞ by means of an ð' þ 1Þ-bit CRA (Fig. 6d) . Note that u given by (7) and easily determine that ' ¼ 2. Thus, we have to examine the third column of É ð6:3Þ in order to compute the width of the CRA. Since this column contains a non-null element, we need an ð' þ 1Þ-bit CRA (see Fig. 6d ).
Let us now build a directed acyclic graph as follows:
. Each node represents a column of the matrix É. . The weight of the edge between nodes j and j þ h is given by the width of the CRA responsible for the addition of . Since we want to perform a modular multiplication by means of small CRAs, we have to provide the algorithm with a constraint on the maximal number of positions w max between two consecutive carry bits (without this constraint, we would, for instance, have an edge from the first node to the last node). . The minimal distance between two consecutive carries w min should be greater than or equal to two. It guarantees that the smallest building block is a 2-bit CRA. Algorithm 2 describes a way to build this graph. After having computed É, we have to determine to which columns the most significant bit of T ½i can be added. We denote by j max the upper index. Recall that when ¼ 2, we assume that the most significant sum word of P ½i contains at least n kÀ1 ¼ 5 bits. Thus, we deduce that j max ¼ n À 2 ( Fig. 7a) . The most significant sum word of U½i is computed as follows: 
:
When ¼ 1, we have n kÀ1 ! 6 and j max ¼ n À 3 (Fig. 7b) . It is sometimes possible to relax the constraint on n kÀ1 : it suffices that the addition of t ðcÞ kÀ2 to ðn:j max Þ does not generate an output carry (see the proof in Appendix A for details). This condition is satisfied if the length of the longest string of consecutive ones starting from the j max column of É is smaller than or equal to n À j max . We have to distinguish three cases to build the graph:
. The first carry bit t ðcÞ 0 can be added to any column whose index belongs to f2; . . . ; w max þ 1g. We create an edge of weight zero between the first node and the nodes associated with such columns. . The most significant carry bit t ðcÞ kÀ2 belongs to the set fn À w max þ 1; . . . ; j max g. Let h 2fw min ; . . . ; w max À 1g. There is therefore an edge between nodes j and n if j þ h ! n. . There is an edge between nodes j and j þ h, where h 2 fw min ; . . . ; w max g, if j þ h j max .
The next step consists of finding a shortest path in the graph. In order to minimize the critical path, we suggest to remove all edges whose carry propagation flag is set to one.
If there is no path between nodes 1 and n in this pruned graph, we have to consider the full graph.
Algorithm 2. Selection of a high-radix carry-save number system. Input: An n-bit modulus M, w max , and w min such that w max ! w min ! 2. A parameter 2 f1; 2g. Output: A directed acyclic graph.
1: Compute the matrix É according to ; 2: if ¼ 2 then 3: j max n À 2; 4: else 5: j max n À 3; 6: j n À 2 7: while #ð ðn:jÞ Þ n À j do 8: j max j; 9: j j þ 1; 10: end while 11: end if 12: for j ¼ 2 to w max do 13: Create an edge of weight 0 between nodes 1 and j; 14: end for 15: for j ¼ 2 to j max do 16: for h ¼ w min to w max do 17:
if j þ h ! n and h < w max then 18:
' max #ð ðn:jÞ Þ;
19:
Create an edge between nodes j and n; 
20:
Compute the weight of the edge (see Fig. 6 Create an edge between nodes j and j þ h; 25:
Compute the weight of the edge (see Fig. 6 ); 26:
end if 27: end for 28: end for Example 6 (Example 4 continued). Let us apply Algorithm 2 to our 16-bit example. First, we note that adding a carry bit to the 15th column of the matrix does not generate an output carry, and we set j max ¼ 15. Then, we build the graph illustrated in Fig. 8 according to Algorithm 2. The c flag on the edge between nodes j and j þ h indicates that adding a carry bit to the jth column of É generates an output carry bit u ðcÞ j . After having removed all edges labeled with a c flag and nodes without a predecessor or successor, we obtain a pruned graph (Fig. 9 ). Thus, P ½i is a four-digit word with n 0 ¼ n 1 ¼ n 2 ¼ 4 and n 3 ¼ 6 (Fig. 10) . Since n ¼ 16 and max ¼ ð1010110010100101Þ 2 ¼ 44;197, we have
and ¼ 1 is a valid choice. Note that if the above equality is not satisfied, we have to build a new graph with ¼ 2. Recall that there is always a solution for ¼ 2 and n kÀ1 ! 5. Once the high-radix carry-save representation is known, the automatic generation of a VHDL description of the modulo M multiplier is rather straightforward. The computation of T ½i involves k CRAs of respective widths n 0 þ 1; n 1 ; . . . ; n k À 2, and n À n kÀ2 À Á Á Á À n 0 À 1 (Fig. 7) . Each edge of the graph encodes the size of the CRA determining a digit of U½i, and the carry propagation flag indicates whether a carry bit u ðcÞ j is necessary or not. Finally, k CRAs of widths n 0 ; . . . ; n k À 1 allow one to add T ðsÞ ½i to U½i.
IMPLEMENTATION RESULTS
In order to compare our algorithm against modular multipliers published in the open literature, we wrote a VHDL code generator whose inputs are a modulus M and w max (maximal number of positions between two consecutive carry bits; see Section 4). Our tool returns a structural VHDL description of a high-radix carry-save multiplier, as well as scripts to automatically place and route the design and collect area and timing information. This tool also generates a VHDL description of two architectures proposed by other researchers. The first one, described by Peeters et al. in [10] , is summarized by Algorithm 3.
Intermediate results are carry-save numbers denoted by ðC½i; S½iÞ. At each step, a CRA computes the sum k of the three most significant bits of C½i þ 1 and S½i þ 1. This 4-bit word addresses a table storing hk Á 2 nÀ2 i M , 0 k 15. Thanks to an additional iteration with x À1 ¼ 0, this algorithm returns a carry-save number ðC½À1; S½À1Þ, which is smaller than 2M. Since our multiplier satisfies the same property, conversion in a nonredundant number system is performed with the same operator. 2 We will therefore only consider iteration stages in our experiments in order to compare high-radix carry-save and carry-save number systems. Algorithm 3. Peeters et al.'s modulo M multiplication [10] . Input: An n-bit modulus M such that 2 nÀ1 < M < 2 n , an r-bit number X 2 IN, and Y 2 f0; . . . ; M À 1g. We assume that x À1 ¼ 0.
T ½i
Amanor et al. introduced a carry-save architecture optimized for modular multiplication on FPGAs in [13] . The authors assume that both M and Y are known at design time. This hypothesis allows for the design of an iteration stage embedding a single CSA and a table addressed by the most significant bit of x i , C½i þ 1, and 
Since M belongs to f2 nÀ1 þ 1; . . Fig. 11 describes place-and-route results on a Xilinx Spartan-3 FPGA. In these experiments, the moduli are 256-bit randomly generated primes. Compared against Algorithm 3, we observe the following:
. Our high-radix carry-save architecture allows us to significantly reduce the number of slices while only slightly augmenting the critical path. At the price of a longer critical path, we are able to further diminish the area by increasing the parameter w max . Note that conversion from (high-radix) carry-save to unsigned binary integer is usually based on pipelined CRAs (see, for instance, [3] ). Depending on the trade-off between area and delay, this operator can be slower than an iteration stage based on (high-radix) carrysave arithmetic. . The area of our operator is less sensitive to the choice of M. This is mainly related to the architecture of Xilinx FPGAs: in most cases, ¼ 1, and each operator embeds a table addressed by 3 bits. Since our target FPGA embeds four-input LUTs, this table is embedded within the slices of the adder computing P ½i [11] . Since 4 bits address the table of Algorithm 3, additional LUTs are requested. Their amount depends on the modulus M: if M contains null or identical columns, synthesis tools are able to simplify the design. For the moduli considered in these experiments, highradix carry-save multipliers have roughly the same area as the operator proposed by Amanor et al. in [13] . Recall that a CSA requires twice the number of slices of a CRA on our target FPGA family. Since the moduli involved in these experiments require only 3 bits to perform a modulo M reduction, our architecture is mainly based on two CRAs. High-radix carry-save representations enable here the design of a more versatile modular multiplier (both X and Y are inputs) with the same number of slices. Table 3 summarizes further results obtained with a Spartan-3 FPGA. We generated 100 prime moduli for each experiment and reported the intervals in which lie the area and delay ratios between our proposal and Algorithms 3 and 4. These experiments indicate that our approach always allows one to select a prime number that reduces the circuit area without increasing the critical path. Table 4 digests experiment results involving an Altera Cyclone II FPGA. Fig. 12 describes a Logic Element (LE), which is the smallest unit of configurable logic in the Cyclone II architecture. Each LE includes a four-input LUT, a storage element, and dedicated carry logic and operates in normal mode or arithmetic mode. CRAs are based on LEs in arithmetic mode, in which the LUT implements two three-input function generators. It is therefore impossible to store M within LUTs of a CRA. This explains why our algorithm leads to slightly smaller TABLE 3 Area and Delay Ratios between Our Proposal and Algorithms 3 and 4 on a Spartan-3 FPGA One hundred prime moduli were randomly generated for each experiment. We report the intervals in which lie the area and delay ratios.
TABLE 4 Area and Delay Ratios between Our Proposal and Algorithms 3 and 4 on a Cyclone II FPGA
One hundred prime moduli were randomly generated for each experiment. We report the intervals in which lie the area and delay ratios. area savings for this FPGA family. On Cyclone-II devices, CSA operators are significantly faster; however, conversion to a nonredundant number system involves pipelined CRAs. If this operator is based on 32-bit blocks, our high-radix carry-save iteration stage has a slower critical path. In this case, our approach leads to smaller modular multipliers than CSA schemes, without impacting on computation time.
CONCLUSION
We proposed an algorithm to automatically generate VHDL descriptions of modular multipliers for FPGAs. The main originality of our approach is the selection of an optimal high-radix carry-save encoding of intermediate results according to a given modulus M. High-radix carry-save number systems take advantage of dedicated carry logic available in almost all FPGA families and reduce the amount of interconnects. Therefore, our approach allows us to significantly reduce the area of modular multipliers.
APPENDIX A
This appendix aims at proving the correctness of Algorithm 1. We proceed in three steps: after establishing a property of the modulo M correction considered in this paper, we show that P ðsÞ ½i is an ðn þ 2Þ-bit number. We conclude by computing a bound on P ½À1, which indicates that P ½À1=2 < 2M. This proof also provides the reader with all the technical details requested to implement the algorithm or an automatic code generator.
A.1 A Property of Modulo M M Correction
The first step consists of establishing a property that will allow us to compute bounds on P ½i.
The proof is straightforward if the modulus M is smaller than or equal to . Let us assume now that M ¼ þ , where satisfies the following inequality:
For k 2 f0; 1; 2; 3g, we easily check that hk Á 2
Consequently, we have
For k ¼ 8, the following modulo M operation has to be carried out:
Thus, we have
A modulo M reduction is again required for k ¼ 12. Since
we obtain
Since ! 1, we conclude the proof by noting that
A.2 Width of P P ðs sÞ ½i i
Let us prove by induction that P ½i is an ðn þ 2Þ-bit number. Since P ½r ¼ 0, we check that k ¼ 0, and T ½r À 1 ¼ P ½r À 1 ¼ x rÀ1 Y , which is an n-bit number. This property holds for i ¼ r À 1. Assume now that P ðsÞ ½i þ 1 is an ðn þ 2Þ-bit number. We have to consider two cases according to the parameter :
. Our hypotheses guarantee that n kÀ1 ! 5 for ¼ 2.
Therefore, 2hP ðsÞ ½i þ 1i 2 nÀ2 contains k sum words of respective widths n 0 0 ¼ n 0 þ 1; . . . ; n 0 kÀ2 ¼ n kÀ2 , and n 0 kÀ1 ¼ n kÀ1 À 4 (Fig. 13a) . Let us split the partial product x i Y into k blocks in order to add it word by word to 2hP ðsÞ ½i þ 1i 2 nÀ2 . We know that
Since x i Y is an n-bit integer, we deduce from the above equation that its most significant sum word contains n 00
Therefore, the sum of the most significant bits of 2hP ðsÞ ½i þ 1i 2 nÀ2 , x i Y , and a carry bit is bounded by
which is an ðn kÀ1 À 2Þ bit number. Therefore, since P kÀ1 i¼0 n i ¼ n þ 2, T ðsÞ ½i is an ðn þ 1Þ-bit number. Indeed, we have
Four most significant bits of P ðsÞ ½i þ 1 address the table responsible for the modulo M correction (Fig. 13b) . Recall that we have to combine the output of this table and carry bits of T ½i in order to generate a high-radix carry-save number U½i, whose format is the one of P ½i. Since 2hk Á 2 nÀ i M is an ðn þ 1Þ-bit number, we split it into k words of respective lengths n 0 ; n 1 ; . . . ; n kÀ2 , and ðn kÀ1 À 1Þ. Consider now the addition of the most significant words of 2hk Á 2 nÀ i M and T ðsÞ ½i and the most significant carry bit of T ðcÞ ½i. According to our hypotheses, n kÀ1 ! 5, and this most significant word contains at least 4 bits. Consider the worst case (Fig. 13b) , where n kÀ1 ¼ 5 and the weight of the most significant bit of T ðcÞ ½i is equal to 2 nÀ2 . We deduce from (9) that U ðsÞ ½i is an ðn þ 2Þ-bit number and that its most significant sum word is smaller than or equal to 2 4 . The addition of the most significant words of T ðsÞ ½i and U ðsÞ ½i and a carry bit never generates an output carry, and P ðsÞ ½i is therefore an ðn þ 2Þ-bit number (Fig. 13c) .
. Assume now that ¼ 1. The same approach allows us to show that T ðsÞ ½i is an ðn þ 1Þ-bit word (Fig. 14a) . According to our hypotheses, the most significant word of P ðsÞ ½i À 1 contains at least 6 bits. Therefore, the weight of the most significant carry bit of T ðcÞ ½i is at most 2 nÀ3 . Since (9) guarantees that 2hk Á 2 nÀ i M < 2 n þ 2 nÀ1 þ 2 nÀ2 þ 2 nÀ3 , we deduce that U ðsÞ ½i is an ðn þ 1Þ-bit number (Fig. 14b) . Note that for some moduli, we can relax the constraint on n kÀ1 : the remainder of the proof will only assume that U ðsÞ ½i is an ðn þ 1Þ-bit number. An automatic code generator can check this condition very easily for a given value of M. Since the most significant words of T ðsÞ ½i and U ðsÞ ½i have the same size, their addition may generate an output carry, and P ðsÞ ½i is therefore an ðn þ 2Þ-bit number (Fig. 14c) .
A.3 Final Modulo M M Correction
The last step consists of proving that P ½À1=2 is smaller than 2M. We have again to consider two cases according to :
. Assume that ¼ 2 and consider the last iteration (i.e., i ¼ À1). Since the partial product x À1 Y is equal to zero, we have T ½À1 ¼ 2 Á hP ðsÞ ½0i 2 nÀ2 þ P ðcÞ ½0
2 Á ð2 nÀ2 À 1 þ 2 nÀ2 À 1Þ ¼ 2 n À 4:
Thus, P ½À1 2 n þ 2M À 6, and P ½À1=2 2 nÀ1 þ M À 3. Since the modulus M is supposed to be greater than 2 nÀ1 , we know that P ½À1=2 is smaller than 2M. . When ¼ 1, hP ðsÞ ½i þ 1i 2 nÀ is smaller than or equal to 2 nÀ1 À 1. Recall that the weight of the most significant carry bit of P ðcÞ ½i þ 1 is equal to n 0 þ n 1 þ Á Á Á þ n kÀ2 (Section 2). Thus
and P ½À1 2 n À 2 þ 2 n 0 þ1 þ Á Á Á þ 2 n 0 þÁÁÁþn kÀ2 þ1 þ 2 max :
Therefore, P ½À1=2 is smaller than 2M if . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
