Abstract-Public key cryptography often involves modular multiplication of large operands (160 up to 2048 bits). Several researchers have proposed iterative algorithms whose internal data are carry-save numbers. This number system is unfortunately not well suited to today's Field Programmable Gate Arrays (FPGAs) embedding dedicated carry logic.
I. INTRODUCTION
This paper is devoted to the study of modular multiplication of large operands on Field Programmable Gate Arrays (FPGAs), which is a crucial operation in many public key cryptosystems (Elliptic Curve Cryptography, XTR, RSA). In order to compute XY M = XY mod M , where M is an n-bit integer such that 2 n−1 < M < 2 n , our algorithm is described by an iterative procedure based on Horner's rule:
where X is an unsigned r-bit number and Y is an n-bit number belonging to {0, . . . , M − 1}. This equation can be expressed recursively as follows: (1) is implemented by means of a left shift, an addition, and up to two subtractions to perform the modulo M reduction [1] . Several improvements of the algorithm sketched by Equation (1) have been proposed. The basic idea consists in computing a number congruent with Q[i] modulo M , which requires less hardware than a modulo M addition. Koç and Hung proposed a carry-save algorithm based on a sign estimation technique [2] , [3] . At each step −M , 0, or M is added to 2Q[i + 1] + x i Y according to a few most significant digits of Q[i + 1] . When M is known at design time, which is often the case in public key cryptography, another method consists in building a table ψ(a) = a · 2 β M and in defining the following iteration:
where P [r] = 0 and β is generally chosen equal to n or n − 1. Carry-save implementations of Equations (2) and (3) (Table I) . Beuchat and Muller proposed a family of radix-2 algorithms designed for FPGAs embedding 4-input LUTs and dedicated carry logic [7] (Algorithm 1). The resulting operators are efficient for moduli up to 32 bits. Since the computation of ψ(a) only requires three bits, the reduction table can be stored within the LUTs of the second CRA. However, operators based on carry-save adders are much faster for larger operands (Table II) . Algorithm 1 Radix-2 modulo M multiplication [7] .
In this paper, we describe an implementation of Algorithm 1 in a high-radix carry-save number system, which is briefly reviewed in Section II. The basic idea consists in replacing the sum bits of the carry-save representation by sum words which are efficiently added by means of small CRAs. Our high-radix carry-save modular multiplication algorithm is described in Section III. Its main originality is to analyze the modulus M in order to choose the best high-radix carry-save representation of P [i] and T [i]. Consequently, both architecture and performance of our operators depend on M . We wrote a VHDL code generator to easily build several modulo M multipliers and to compare them against previously published solutions (Section IV). Finally, we conclude in Section V. 
II. HIGH-RADIX CARRY-SAVE NUMBERS
Definition 1: A k-digit high-radix carry-save number X is denoted by
where the jth digit x j consists of an n j -bit sum word x (s) j and a carry bit x
nj . Note that this representation does not require a fixed radix. We have:
Example 1: Consider the 5-digit high-radix carry-save number X = ((5, 1), (0, 0), (3, 1), (6, 1), (4, 0)) with n 4 = n 3 = n 0 = 3, n 2 = 2, and n 1 = 4 ( Figure 1 ). According to Definition 1, X represents 54324: 
III. HIGH-RADIX CARRY-SAVE MODULAR MULTIPLICATION

A. High-Radix Carry-Save Iteration Stage
Assume that P [i + 1] is a k-digit high-radix carry-save number whose sum words are n 0 -, . . . , n k−1 -bit unsigned integers such that n 0 + . . . 
We deduce from this inequality that the = 2. However, the computation of the remaining digits is defined by: In the following, we propose a method to efficiently merge the carry bits and ψ(a) so that the computation of P [i] only requires k CRAs. The format of the high-radix carrysave numbers as well as the architecture of the operator will therefore depend on the chosen modulus M .
B. Modulo M Reduction Table
Let us consider the 32-bit modulus M = 4294967111 whose correction table is stored in the matrix Ψ M : In this example, ψ(a) is a 10-bit word, ∀a ∈ {0, . . . , 4}. If n 0 = 9, we can therefore merge the output of the table and the carry bits t (c)
as follows:
otherwise. Figure 3 describes the hardware architecture of an iteration stage. We assume here that n 1 = 5 and the critical path includes therefore a 10-bit CRA and a 5-bit CRA on Virtex or Spartan FPGAs (a dedicated AND gate is associated with each LUT to compute x i Y and each bit of ψ(a) is computed within a LUT of the second CRA). It is however possible to further reduce n 0 at the price of a slightly more complex logic. Suppose for instance that n 0 = 1. We have to combine a carry bit t However, these two strategies fail for several moduli whose reduction table is as follows: 1 1 1 1 1 1 1 0 . 
Let us consider the nine non-zero columns of this matrix. We define
A HA cell computes ψ 0 (a) + t (c)
and returns a sum and a carry bit respectively denoted by σ (s) and σ (c) . Consequently, we have
Therefore, the modification of the reduction table only requires a HA cell, an inverter, and an AND gate:
We can apply the same strategy to combine several carry bits t
j+q with ψ (a) (Figure 4 ). Algorithm 2 summarizes the highradix carry-save modular multiplication method proposed in this paper. 
IV. IMPLEMENTATION RESULTS
In order to compare our algorithm against the one proposed by Peeters et al. [6] , which is to our knowledge the best previously published modular multiplier based on Horner's rule, we generated 250 prime numbers of 64, 128, 192, and 256 bits. In all experiments, the maximum width of a sum word was 8 bits. We selected moduli for which the table Ψ M contains almost only 1s. Such numbers are considered as worst-case moduli because they require two levels of logic (a CRA and AND gates) to combine the carry bits and the table (see Figure 4) . For such moduli, our approach allows to reduce the area by 35 to 40% (Figure 5 ) at the price of a slightly larger critical path (Figure 6) 1 . For moduli whose table Ψ M is a sparse matrix, our first experiments indicate that our method roughly divide by two the number of slices of an iteration stage without increasing the critical path. However, we have to consider a larger set of moduli to confirm these results.
Algorithm 2
High-radix carry-save modulo M multiplication. 
for j in 1 to k − 1 do In this paper, we have proposed a high-radix carry-save FPGA implementation of modular multiplication. Our results indicate that this approach significantly reduces the area of the iteration stage compared to previously published solutions, while only slightly increasing the critical path for some moduli.
Our algorithm returns a high-radix carry-save number P [0] which we have to convert to standard binary representation. This operation requires a modular adder whose architecture depends on the modulus M . The second addition of the iteration stage involves the n least significant sum bits of T [0] and (k − 1) carry-bits combined with ψ(a). Therefore, (ψ(a)).
We can use this equation to find an integer q such that P [0] < qM . The conversion is however expensive if q ≥ 2. We plan to design an algorithm which guarantees that
