Introduction: Modular multiplication is a key step in many cryptographic applications. It is also important for efficiently implementing residue number system (RNS) arithmetic. Various solutions have been suggested for implementing modular multiplication hXY i M ¼ XY mod M, where M is an n-bit number, among them being use of Montgomery's multiplication algorithm [1] , or interleaving the multiplication with the residual computation. This Letter is devoted to the second class of methods. More precisely, we are interested in multiplication algorithms that derive from Horner's algorithm:
where r is the number of digits of X. These algorithms (in radix 2) use a recurrence of the form:
where Q[r] ¼ 0. Such recurrences can easily be modified to allow the computation of hXY þ W i M . From (1), we get:
If we add some 'redundancy' to the representation of the residue classes modulo M (i.e. if Q[r À i] belongs to a set of more than M elements), then determining which small multiple of M must be added to or subtracted from 2Q[r À i À 1] þ x rÀi Y to keep a bounded value only requires examining a few most significant bits of 2Q[r
Depending on the target technology, the Q[ j]'s can be represented in redundant (e.g. carry-save) form (typically, for an ASIC implementation), or in conventional non-redundant binary form for an FPGA implementation. Different variants have been suggested: (i) Koç and Hung [2] implement (1) in radix 2 with carry-save adders. They perform up to three subtractions at each step to keep the Q[ j]'s bounded. The operands are in non-redundant form, and the product is obtained in redundant form. To perform modular exponentiation with their method, one needs to insert conversions that make a pipeline implementation inefficient.
(ii) Takagi and Yajima [3] suggest a radix-2 and a radix-4 implementation of the recurrence. The Q[ j]'s are represented in a redundant signed digit system. (iii) Jeong and Burleson [4] suggest two radix-2 carry-save implementations of a recurrence, later on improved by Kim and Sobelman [5] .
We start from a non-redundant version of Kim and Sobelman's recurrence. We deal with FPGA implementation of modular multiply and accumulate operations. FPGAs are arrays of logic cells (CLBs). Our main target is the Xilinx Virtex-E family of FPGAs. On such circuits, very efficient ripple carry adders (RCAs) are available, so that using redundant addition is no longer interesting. The proof of the theorems, along with a more detailed presentation of our results, can be found in [6] .
Radix-2 modular multiplication-addition: Based on Kim and Sobelman's work [5] , the first modulo M multiplication-addition algorithm studied in this Letter consists in computing:
If M is known at design time, the function j can be stored in a small table, the size of which depends only on the maximal value of
is an (n þ 1)-bit number, the range of which is: i M , and a modulo M correction is needed between two consecutive steps. We therefore define a modified iteration:
where P 
The maximal values of T 
We deduce from Theorem 2 that the computation of each bit of c requires the most three significant bits of T 0 [r À i]. If the operator handles a single modulus known at design time, it is again possible to implement an iteration stage in a single CLB column on a Virtex-E device.
Until now, we have assumed that M was a constant and that the values of j or c were pre-computed and included in the VHDL or Verilog code. It is however possible to build the table on-the-fly by means of a subtracter and an adder, and to store the values of j or c in, respectively, three or six registers.
The functions j and c can be computed recursively on N* as follows:
. Consider the first algorithm described in this Letter and assume that j(1) is determined in a preprocessing step; j (2) Table 1 summarises some place and route results. Experiment 1 involves an iteration stage which performs a modulo M addition according to (2) . At the price of a larger area, we shorten the critical path by computing j on-the-fly and implementing the algorithm defined by (3) (experiment 2). When the modulus is a constant, we take advantage of the architecture illustrated by Fig. 1 and reduce both area and delay (experiment 3). Our results show that fast modular multiplication can be achieved on an FPGA, and that the modulus can be an input to the operator: there is no need to know it at design time. 
