Automatic Generation of Modular Multipliers for FPGA Applications by Muller, Jean-Michel & Beuchat, Jean-Luc
Automatic Generation of Modular Multipliers for FPGA
Applications
Jean-Michel Muller, Jean-Luc Beuchat
To cite this version:
Jean-Michel Muller, Jean-Luc Beuchat. Automatic Generation of Modular Multipliers for
FPGA Applications. IEEE Transactions on Computers, Institute of Electrical and Electronics
Engineers, 2008, 57 (12), pp.1600-1613. <10.1109/TC.2008.102>. <ensl-00122716v3>
HAL Id: ensl-00122716
https://hal-ens-lyon.archives-ouvertes.fr/ensl-00122716v3
Submitted on 24 Nov 2008
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Automatic Generation of Modular Multipliers
for FPGA Applications
Jean-Luc Beuchat and Jean-Michel Muller, Senior Member, IEEE
Abstract—Since redundant number systems allow for constant time addition, they are often at the heart of modular multipliers
designed for public-key cryptography (PKC) applications. Indeed, PKC involves large operands (160 to 1,024 bits), and several
researchers proposed carry-save or borrow-save algorithms. However, these number systems do not take advantage of the
dedicated carry logic available in modern Field-Programmable Gate Arrays (FPGAs). To overcome this problem, we suggest to
perform modular multiplication in a high-radix carry-save number system, where a sum bit of the carry-save representation is replaced
by a sum word. Two digits are then added by means of a small Carry-Ripple Adder (CRA). Furthermore, we propose an algorithm
that selects the best high-radix carry-save representation for a given modulus and generates a synthesizable VHDL description of
the operator.
Index Terms—Modular multiplication, high-radix carry-save number system, FPGA.
Ç
1 INTRODUCTION
THIS paper is devoted to the study of modular multi-plication of large operands on Field-Programmable
Gate Arrays (FPGAs). This operation is crucial in many
public-key cryptosystems (e.g., elliptic curve cryptography,
XTR, and RSA), and various solutions have already been
investigated. Since iterative algorithms offer a good trade-
off between calculation time and circuit area, they have
received considerable attention. Least-significant-digit-first
schemes are often based on Montgomery’s algorithm [1].
However, that approach requires preprocessing and post-
processing and is of interest when a large amount of
consecutive modular multiplications is required (e.g.,
modular exponentiation). In this paper, we will consider a
most-significant-digit-first scheme.
1.1 Horner’s Rule-Based Modular Multiplication
In order to compute hXY iM ¼ XY modM, where M is an
n-bit integer such that 2n1 < M < 2n, our algorithm is
described by an iterative procedure based on the celebrated
Horner’s rule:
hXY iM ¼ . . . ðxr1Y Þ2þ xr2Yð Þ2þ . . .ð Þ2þ x0Yh iM;
where X ¼ xr1xr2 . . .x1x0 is an unsigned r-bit integer,
and Y is an n-bit integer belonging to f0; . . . ;M  1g. This
equation can be expressed recursively as follows (we
perform a modulo M reduction at each step in order to
keep an n-bit intermediate result):
T ½i ¼ 2Q½iþ 1 þ xiY ;
Q½i ¼ hT ½iiM;
ð1Þ
where Q½r ¼ 0, and Q½0 ¼ hXY iM . Since Q½iþ 1 and Y are
less than or equal to M  1, T ½i < 3M, and (1) is
implemented by means of a left shift, an addition, a
comparator, and up to two subtractions to perform the
modulo M reduction [2].
Since public-key cryptography involves large integers,
operands are often represented in the carry-save number
system, which enables addition in constant time (see, for
instance, [3]). However, due to the redundancy of this
representation, comparison requires a conversion in a
nonredundant number system. This operation involves carry
propagations, thus losing the main advantage of the carry-
save representation. Several improvements of the algorithm
sketched by (1) have therefore been investigated to avoid
comparisons. They are based on the following observation:
computing a number P ½i congruent to Q½i moduloM only
requires inspecting a few most significant digits of T ½i. In
order to avoid an expensive final modular reduction, P ½i
should be less than a very small multiple ofM. The iteration
described in this paper returns, for instance, hXY iM or
hXY iM þM and requires at most one subtraction to get the
final result.
Koc¸ and Hung [4], [5] proposed, for instance, a carry-
save algorithm based on a sign estimation technique. At
each step, M, 0, or M is added to T ½i according to a
few most significant digits of Q½iþ 1. Takagi and
Yajima [6], [7] applied a similar technique to design
signed-digit-based architectures. When the modulus M is
known at design time, which is often the case in public-
key cryptography, another approach consists of building
IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 12, DECEMBER 2008 1
. J.-L. Beuchat is with the Graduate School of Systems and Information
Engineering, Laboratory of Cryptography and Information Security,
University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan.
E-mail: beuchat@risk.tsukuba.ac.jp.
. J.-M. Muller is with CNRS, projet INRIA Are´naire, Universite´ de Lyon,
Laboratoire LIP, ENS Lyon, 46 alle´e d’Italie, 69364 Lyon Ce´dex 07, France.
E-mail: Jean-Michel.Muller@ens-lyon.fr.
Manuscript received 1 Jan. 2007; revised 28 Apr. 2008; accepted 19 May
2008; published online 27 June 2008.
Recommended for acceptance by M. Schulte.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TC-0003-0107.
Digital Object Identifier no. 10.1109/TC.2008.102.
0018-9340/08/$25.00  2008 IEEE Published by the IEEE Computer Society
a table  ðaÞ ¼ ha  2iM and in defining the following
iteration:
T ½i ¼ 2P ½iþ 1 þ xiY ; ð2Þ
P ½i ¼  T ½i div 2 þ T ½ih i2 ; ð3Þ
where P ½r ¼ 0, and  is generally chosen to be equal to n or
n 1. Since  ðT ½i div 2Þ is an n-bit number, P ½i and T ½i
are respectively ðnþ 1Þ- and ðnþ 3Þ-bit numbers. Therefore,
the algorithm sketched by the above equations requires a
small table. Carry-save implementations of (2) and (3) have,
for example, been proposed by Jeong and Burleson [8],
Kim and Sobelman [9], and Peeters et al. [10]. Since these
algorithms depend on the modulus M, they seem likely
candidates for cryptographic hardware based on FPGAs: the
reconfigurability of these devices allows one to optimize the
architecture according to some parameters (e.g., the mod-
ulus) and to modify the hardware whenever they change.
1.2 FPGA-Specific Issues
Modern FPGAs are mainly designed for digital signal
processing applications involving rather small operands
(16 to 32 bits). FPGA manufacturers chose to include
dedicated carry logic enabling the implementation of fast
carry-ripple adders (CRAs) for such operand sizes. Let us
study, for example, the architecture of a Spartan-3 device.
Fig. 1 describes the simplified architecture of a slice, which
is the main logic resource for implementing synchronous
and combinatorial circuits. Each slice embeds two four-
input function generators (G-LUT and F-LUT), two storage
elements (i.e., flip-flops FFY and FFX), carry logic (CYSELG,
CYMUXG, CYSELF, CYMUXF, and CYINIT), arithmetic
gates (GAND, FAND, XORG, and XORF), and wide-
function multiplexers. Each function generator is imple-
mented by means of a programmable lookup table (LUT).
Recall that a full-adder (FA) cell has two bits xi and yi, as
well as a carry-in bit cin, as inputs and computes a sum bit si
and a carry-out bit cout such that 2cout þ si ¼ xi þ yi þ cin.
Let hi ¼ xi  yi. Then, we have
si ¼ hi  cin; ð4Þ
cout ¼ xi; if hi ¼ 0 ði:e:; xi ¼ yiÞ;cin; otherwise:

ð5Þ
Assume that the F-LUT function generator computes hi.
Then, the XORF gate implements (4), whereas (5) involves
three multiplexers (CYOF, CYSELF, and CYMUXF). si is
either sent to another slice (output X) or stored in a flip-flop
(FFX). The G-LUT function generator allows one to imple-
ment a second FA cell within the same slice, which thus
embeds a 2-bit CRA. Unfortunately, a conventional carry-
save adder (CSA) requires twice as many hardware
resources: since each slice has a single-input carry CIN, it is
impossible to implement two FA cells with independent carry-
in bits. Therefore, hardware design tools allocate two
function generators when they are provided with a VHDL
description of (4) and (5). It is of course possible to write a
VHDL code that explicitly instantiates F-LUT, XORF, and
CYMUXF. In this case, note that G-LUT can only be used to
implement the control unit or the  ð:Þ table of (3). According
to experiment results, G-LUT often remains unused. Though
reducing the number of LUTs of the design, taking advantage
of dedicated logic to describe a CSA leads to a larger operator
(Table 1). Similar problems arise, for instance, on all Virtex
devices (Xilinx) andCyclone II FPGAs (Altera). It is therefore
interesting to investigate modular multiplication algorithms
based on FPGA-friendly number systems.
1.3 Our Contribution
We proposed a family of radix-2 algorithms designed for
FPGAs embedding four-input LUTs and dedicated carry
logic in [11]. Table 2 compares our iteration stage against
2 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 12, DECEMBER 2008
Fig. 1. Simplified diagram of a slice of a Spartan-3 FPGA.
TABLE 1
Area and Number of LUTs of Three Carry-Save Iteration Stages
(Spartan-3 FPGA)
TABLE 2
Area and Delay of Carry-Save and Radix-2 Iteration Stages
(Spartan-3 FPGA)
three carry-save schemes. Since these results do not include
the conversion from carry-save to unsigned integer that
occurs at the end of each multiplication, both the area and
delay of carry-save operators are underestimated. Accord-
ing to this experiment, our previous approach is efficient for
moduli up to 32 bits. Thus, the aim of this paper is to extend
our work to larger moduli. In order to benefit from
dedicated carry logic available in almost all FPGA families,
we suggest to choose a high-radix carry-save number
system, where each sum bit of the carry-save representation
is replaced by a sum word (Section 2). Such a number system
allows for the design of a modular multiplication algorithm
based on small CRAs (Section 3). The main originality of
our approach is to analyze the modulusM in order to select
the most efficient high-radix carry-save representation and
to automatically generate the VHDL description of the
operator (Section 4). Experimental results validate the
efficiency of the proposed modular multiplication scheme
(Section 5). We proposed a preliminary version of this work
based on a different iteration in [12].
2 HIGH-RADIX CARRY-SAVE NUMBERS
A k-digit high-radix carry-save number X is denoted by
X ¼ ðxk1; . . . ; x0Þ ¼ xðcÞk1; xðsÞk1
 
; . . . ; x
ðcÞ
0 ; x
ðsÞ
0
  
;
where the jth digit xj consists of an nj-bit sum word x
ðsÞ
j
and a carry bit x
ðcÞ
j such that xj ¼ xðsÞj þ xðcÞj 2nj . According to
this definition, we have
X ¼x0 þ x12n0 þ x22n0þn1 þ    þ xk12n0þþnk2
¼xðsÞ0 þ
Xk2
i¼0
x
ðcÞ
i þ xðsÞiþ1
 
2
Pi
j¼0 ni þ xðcÞk12
Pk1
j¼0 ni :
Let us define
XðsÞ ¼ xðsÞ0 þ xðsÞ1 2n0 þ    þ xðsÞk12n0þþnk2 ;
and
XðcÞ ¼ xðcÞ0 2n0 þ xðcÞ1 2n0þn1 þ    þ xðcÞk12n0þþnk1 :
With this notation, we have X ¼ XðsÞ þXðcÞ. Such a number
system has nice properties to deal with large numbers on
FPGA:
. Its redundancy allows one to perform addition in
constant time (the critical path only depends on
max0jk1 nj).
. The addition of a sum word xðsÞj , a carry bit x
ðcÞ
j1, and
an nj-bit unsigned binary number can be performed
by a CRA.
A key observation is that all sum words do not need to have
the same width. This peculiarity will allow us to select a
number system according to the modulus to optimize our
operators (Section 4). In the following, we will assume that
the carry bit of the most significant digit is always equal to
zero (the weight of the most significant carry bit is therefore
equal to 2n0þn1þþnk2 ).
Example 1. Fig. 2a describes a four-digit high-radix carry-
save number with n0 ¼ n1 ¼ n2 ¼ 4 and n3 ¼ 3. Accord-
ing to the first definition of this number system, we have
X ¼xðsÞ0 þ xðcÞ0 þ xðsÞ1
 
 24 þ xðcÞ1 þ xðsÞ2
 
 28
þ xðcÞ2 þ xðsÞ3
 
 212
¼ 10þ ð1þ 4Þ  24 þ ð0þ 3Þ  28 þ ð1þ 7Þ  212
¼ 33;626:
We can also compute
XðsÞ ¼ 10þ 4  24 þ 3  28 þ 7  212 ¼ 29;514;
and
XðcÞ ¼ 1  24 þ 0  28 þ 1  212 ¼ 4;112:
We obtain X ¼ XðsÞ þXðcÞ ¼ 33;626.
Consider the modular multiplication described by (2)
and (3) and assume that both T ½i and P ½i are high-radix
carry-save numbers. Each equation involves now the
addition of a high-radix carry-save number and an unsigned
integer (a partial product xiY or a number  ðT ½i div 2Þ
stored in the table). Fig. 3 describes how to perform these
operations: the integer operand is split into k words of
BEUCHAT AND MULLER: AUTOMATIC GENERATION OF MODULAR MULTIPLIERS FOR FPGA APPLICATIONS 3
Fig. 2. High-radix carry-save numbers. (a) Encoding of the number X ¼ 33;626. (b) Encoding of Z ¼ 2X.
Fig. 3. Addition of an unsigned binary number and a high-radix carry-
save number.
respective lengths n0; . . . ; nk1; then, each of these words is
added to a sum word and a carry bit by means of an nj-bit
CRA. The high-radix carry-save encoding has unfortunately
a drawback in the sense that shifting an operand modifies its
representation. The following example illustrates this
problem, which occurs in the computation of T ½i (2).
Example 2. Let us consider again the number X ¼ 33;626,
whose format is defined in Fig. 2a. By shifting X, we
obtain Z ¼ 2X ¼ 67;252 (Fig. 2b). However, the least
significant sum word is now a 5-bit number, and
Z ¼ z0 þ z12n0þ1 þ z22n0þn1þ1 þ z32n0þn1þn2þ1
¼ 20þ ð1þ 4Þ  25 þ ð0þ 3Þ  29 þ ð1þ 7Þ  213
¼ 67;252:
According to this example, P ½iþ 1 and P ½i do not
have the same encoding, and the width of the CRA
dealing with the least significant digit of P would increase
at each step. The solution consists of converting T ½i to the
format of P ½i. In the following, we describe a modular
multiplication algorithm that minimizes the hardware
overhead induced by this conversion.
3 HIGH-RADIX CARRY-SAVE MODULAR
MULTIPLICATION
This section describes how to take advantage of a high-
radix carry-save number system to perform a modular
multiplication. We assume the following:
. The modulus M is an n-bit number belonging to
f2n1 þ 1; . . . ; 2n  1g.
. X is an r-bit unsigned integer.
. Y is an unsigned integer smaller than M.
.  is a small integer parameter that will determine the
size of the table required for the modulo M
reduction.
. The most significant sum word of P ½i contains at
least nk1 ¼ 5 bits if  ¼ 1 and nk1 ¼ 6 bits if  ¼ 2.
These hypotheses guarantee that
– P ðcÞ½i is smaller than 2n, i.e. hP ðcÞ½ii2n ¼
P ðcÞ½i (we fixed the carry bit of the most
significant digit of a high-radix carry-save
number to zero in Section 2), and
– P ðsÞ½i is an ðnþ 2Þ-bit number (a proof is given
in Appendix A).
. At each iteration, we compute a high-radix carry-
save number P ½i congruent to 2P ½iþ 1 þ xiY
modulo M.
According to our hypotheses, we have
P ½iþ 1 ¼ P ðsÞ½iþ 1 div 2n
 
 2n
þ P ðsÞ½iþ 1
D E
2n
þ P ðcÞ½iþ 1
D E
2n
¼ P ðsÞ½iþ 1 div 2n
 
 2n
þ P ðsÞ½iþ 1
D E
2n
þP ðcÞ½iþ 1:
The iteration of our algorithm is slightly different from the
one described in Section 1. Let us define ½iþ 1 ¼
P ðsÞ½iþ 1 div 2n and write the intermediate result at
step iþ 1 as follows:
P ½iþ 1 ¼ ½iþ 12n þ P ðsÞ½iþ 1
D E
2n
þP ðcÞ½iþ 1:
It is worth noticing that according to our hypotheses,
½iþ 1 is a 3- or 4-bit unsigned number for  ¼ 1
or  ¼ 2, respectively. Thus, a small table addressed by
½iþ 1 allows one to efficiently compute a number
congruent to P ½iþ 1 modulo M:
P ½iþ 1  ½iþ 12nh iM þ P ðsÞ½iþ 1
D E
2n
þ P ðcÞ½iþ 1 ðmodMÞ:
Note that when  ¼ 1, the table can be stored within the
LUTs of a CRA on Spartan-3 and Virtex FPGAs [11]. Since
we compute a high-radix carry-save number congruent
to XY modulo M, a conversion and a final modulo M
reduction are mandatory. In order to keep the hardware
overhead as small as possible, we apply a trick proposed by
Peeters et al. [10] in the case of a carry-save implementation.
At each step, our algorithm computes
P ½i ¼ xiY þ 2 ½iþ 12nh iM
þ 2 P ðsÞ½iþ 1
D E
2n
þP ðcÞ½iþ 1
 
:
According to this equation, P ½i is always even when
xiY ¼ 0. Thus, by performing an additional step with
x1 ¼ 0, we obtain an even number P ½1 congruent to
2XY moduloM. Note that P ½1=2 is smaller than or equal
to P ½0 and easy to compute (a right shift of one position
involves only wiring). Furthermore, performing the final
modulo M correction with P ½1=2 requires fewer hard-
ware resources. Let us define
 max ¼
max
0j<24
hj  2n2iM; if  ¼ 2 and nk1  5;
max
0j<23
hj  2n1iM; if  ¼ 1 and nk1  6:
8<
:
P ½1=2 is a high-radix carry-save number equal to hXY iM
or hXY iM þM, and the final modulo M reduction requires
at most one subtraction in the following cases:1
.  ¼ 2 and nk1  5.
.  ¼ 1, nk1  6, and
2n1  1þ 2n0 þ    þ 2n0þþnk2 þ  max
2
< M:
A proof of correctness of this modular multiplication
scheme, summarized by Algorithm 1, is provided in
Appendix A. At each iteration, a new intermediate result
P ½i is computed in two steps. Let ~P ½iþ 1 be a high-radix
carry-save number such that ~P ðsÞ½iþ 1 ¼ hP ðsÞ½iþ 1i2n
and ~P ðcÞ½iþ 1 ¼ P ðcÞ½iþ 1. We first carry out the sum
4 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 12, DECEMBER 2008
1. Note that the algorithm described in our preliminary work [12] does
not satisfy this property, and the final modular reduction depends on the
high-radix number system and the modulus M.
of the partial product xiY and 2 ~P ½iþ 1 by means of
small CRAs:
T ½i ¼ 2 ~P ½iþ 1 þ xiY :
By shifting the high-radix carry-save number P ½iþ 1, we
define a new internal representation for T ½i (Section 2). The
second step consists of adding 2h½iþ 1  2niM to T ½i and
converting the result to the format of P ½iþ 1.
Algorithm 1. High-radix carry-save modulo M
multiplication
Input: An n-bit modulus M such that 2n1 < M < 2n, an
r-bit number X 2 IN, Y 2 f0; . . . ;M  1g, and a
parameter  2 f1; 2g. P ½i and T ½i are high-radix
carry-save numbers.
Output: P ¼ hXY iM .
1: P ½r  0;
2: x1  0;
3: for i in r 1 downto 1 do
4: ½iþ 1  P ðsÞ½iþ 1 div 2n;
5: T ½i  2ðhP ðsÞ½iþ 1i2n þ P ðcÞ½iþ 1Þ þ xiY ;
6: P ½i  T ½i þ 2h½iþ 1  2niM ;
7: end for
8: P  P ½1=2;
9: if P > M then
10: P  P M;
11: end if
The main difficulty of the implementation arises from the
left shift of the carry bits P ðcÞ½iþ 1. Since T ½i has a different
encoding, it is necessary to perform a conversion. We
suggest to compute a high-radix carry-save number U ½i
that has the same encoding as P ½iþ 1, and is equal to the
sum of the carry bits of T ½i and the output of the table (i.e.
2h½iþ 1  2niM ). Therefore, we perform the following
operations at each iteration of Algorithm 1:
T ½i  2 ~P ½iþ 1 þ xiY ;
U ½i  2 ½iþ 1  2nh iMþT ðcÞ½i;
P ½i  U ½i þ T ðsÞ½i:
ð6Þ
Example 3. Let n ¼ 16, k ¼ 4, and n0 ¼ n1 ¼ n2 ¼ n3 ¼ 4.
The high-radix carry-save number T ½i contains three
carry bits of respective weights 25, 29, and 213 (recall the
constraint introduced in Section 2: the carry bit of the
most significant digit is always equal to zero). We split
2h½iþ 1  2niM into four 4-bit words and perform
three additions to compute U ½i (Fig. 4).
4 CHOICE OF A HIGH-RADIX CARRY-SAVE NUMBER
SYSTEM
Let us represent the table involved in the modulo M
correction by a matrix . Each line   of  stores an n-bit
number h½iþ 1  2niM . In the following, we will have to
consider a subset of consecutive columns of . Let ðjþh:jÞ
be the matrix constituted by columns j to jþ h of . Each
line of ðjþh:jÞ contains an ðhþ 1Þ-bit number  ðjþh:jÞ .
Example 4. Let us consider the 16-bit modulus M ¼ 54;107
and assume that  ¼ 1. We have
¼
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1
1 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1
0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0
0 0 0 0 0 1 0 1 1 1 1 0 1 1 1 1
1 0 0 0 0 1 0 1 1 1 1 0 1 1 1 1
0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0
2
66666666664
3
77777777775
:
According to our notation, we have, for instance
ð6:3Þ ¼  ð6:3Þ
 
¼
0 0 0 0
0 0 0 0
1 0 0 1
1 0 0 1
0 0 1 0
1 0 1 1
1 0 1 1
0 1 0 1
2
66666666664
3
77777777775
: ð7Þ
It is worth noticing that the amount of hardware
required to compute U½i depends on the modulus M and
the encoding of P ½i. For instance, if a column of  contains
only zeros, it can be replaced by a carry bit at no extra cost.
We propose an algorithm that selects a high-radix carry-
save number system minimizing the hardware overhead
introduced by the computation of U ½i (6). Assume that we
want to merge tðcÞw with the jth column and t
ðcÞ
wþ1 with the
ðjþ hÞth column of , and recall that the carry bits of T ½i
are left-shifted compared to those of P ½i and U½i (Fig. 5).
Therefore, we compute a digit uwþ1 such that
u
ðcÞ
wþ12
h þ uðsÞwþ1 ¼ 2

 ðjþh2:jÞ|ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ}
h1 bits
þtðcÞw

þ  ðj1:j1Þ : ð8Þ
This operation involves at most an ðh 1Þ-bit CRA.
However, it is unlikely that there are very long strings
of consecutive ones in matrix  (see below). Let us
denote by #ð ðjþh2:jÞ Þ the length of the longest string
of consecutive ones starting from the least significant bit
of  ðjþh2:jÞ . Then, the following cases may occur
according to ‘ ¼ max#ð ðjþh2:jÞ Þ:
BEUCHAT AND MULLER: AUTOMATIC GENERATION OF MODULAR MULTIPLIERS FOR FPGA APPLICATIONS 5
Fig. 4. Conversion of T ½i by merging its carry bits with
2h½iþ 1  2niM .
. If ‘ ¼ 0, the jth column of  contains only zeros and
can be replaced by tðcÞw at no extra cost. More
formally, we have (Fig. 6a):
u
ðsÞ
wþ1 ¼ 4 ðjþh2:jþ1Þ þ 2tðcÞw þ  ðj1:j1Þ ;
u
ðcÞ
wþ1 ¼ 0:
. If ‘ ¼ h 1, the addition requires an ðh 1Þ-bit
CRA, which generates an output carry bit u
ðcÞ
wþ1 (see
(8) and Fig. 6b). Since this bit will be added to a few
bits of T ðsÞ½i in order to compute pðsÞwþ2½i, we raise a
flag that indicates this carry propagation.
. When 0 < ‘ < h 1, an ‘-bit CRA computes the sum
and generates an output carry cout. If the ðjþ ‘Þth
column of  stores only zeros, we replace it by cout
(Fig. 6c). Otherwise, we need an OR gate to add cout
to the ðjþ ‘Þth column. Since we target FPGA
applications, a more efficient solution consists of
taking advantage of the dedicated carry logic to
perform this operation, and we add tðcÞw to  
ðjþ‘:jÞ
 by
means of an ð‘þ 1Þ-bit CRA (Fig. 6d). Note that uðcÞwþ1
is always equal to zero.
Let us try to estimate what values of ‘ we can expect in
practice. Ifwe consider that the bits inmatrix can be viewed
as “random” bits, with equal probability for zero as for one,
then the average value of ‘will be NðhÞ=2h1, where NðhÞ is
the sumof the lengths of the strings of ones that start from the
rightmost bit in all possible ðh 1Þ-bit strings. Since we
obviously have Nð1Þ ¼ 1 and NðhÞ ¼ 2Nðh 1Þ þ 1, we
immediately deduce that the average value of ‘ is 1 1=2h1
(i.e., less than one).
Example 5 (Example 4 continued). Assume that we want to
add carry bits to the third and eighth columns of  (i.e.,
j ¼ 3 and h ¼ 5). We have to consider the matrix ð6:3Þ
given by (7) and easily determine that ‘ ¼ 2. Thus, we
have to examine the third column of ð6:3Þ in order to
compute the width of the CRA. Since this column
contains a non-null element, we need an ð‘þ 1Þ-bit
CRA (see Fig. 6d).
Let us now build a directed acyclic graph as follows:
. Each node represents a column of the matrix .
. The weight of the edge between nodes j and jþ h is
given by the width of the CRA responsible for the
addition of  ðjþh2:jÞ and t
ðcÞ
w (8), as well as the flag
that indicates a carry propagation.
A shortest path of this graph defines the high-radix carry-
save representation minimizing the hardware overhead
introduced by the computation of U ½i for a given
modulus M. The algorithm requires two parameters to
control the size of the CRAs:
. Since we want to perform a modular multiplication
by means of small CRAs, we have to provide the
algorithm with a constraint on the maximal number
of positions wmax between two consecutive carry bits
(without this constraint, we would, for instance,
have an edge from the first node to the last node).
. The minimal distance between two consecutive
carries wmin should be greater than or equal to
two. It guarantees that the smallest building block is
a 2-bit CRA.
Algorithm 2 describes a way to build this graph. After
having computed , we have to determine to which
columns the most significant bit of T ½i can be added. We
denote by jmax the upper index. Recall that when  ¼ 2, we
assume that the most significant sum word of P ½i contains
at least nk1 ¼ 5 bits. Thus, we deduce that jmax ¼ n 2
6 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 12, DECEMBER 2008
Fig. 5. Cost of the addition of a carry bit (1). In this example, h is
equal to five.
Fig. 6. Cost of the addition of a carry bit (2). (a) Only zeros in the jth column of . (b) ‘ ¼ h 1. (c) ‘ < h 1 and the ðjþ ‘þ 1Þth column of 
contains only zeros. (d) ‘ < h 1.
(Fig. 7a). The most significant sum word of U½i is computed
as follows:
u
ðsÞ
k1 ¼ 2  ðn:jmaxÞ þ tðcÞk2
 
þ  ðjmax1:jmax1Þ :
When  ¼ 1, we have nk1  6 and jmax ¼ n 3 (Fig. 7b). It
is sometimes possible to relax the constraint on nk1: it
suffices that the addition of t
ðcÞ
k2 to  
ðn:jmaxÞ
 does not generate
an output carry (see the proof in Appendix A for details).
This condition is satisfied if the length of the longest string
of consecutive ones starting from the jmax column of  is
smaller than or equal to n jmax. We have to distinguish
three cases to build the graph:
. The first carry bit tðcÞ0 can be added to any column
whose index belongs to f2; . . . ; wmax þ 1g. We create
an edge of weight zero between the first node and
the nodes associated with such columns.
. The most significant carry bit tðcÞk2 belongs to the set
fn wmax þ 1; . . . ; jmaxg. Let h2fwmin; . . . ; wmax  1g.
There is therefore an edge between nodes j and n if
jþ h  n.
. There is an edge between nodes j and jþ h, where
h 2 fwmin; . . . ; wmaxg, if jþ h  jmax.
The next step consists of finding a shortest path in the
graph. In order to minimize the critical path, we suggest to
remove all edges whose carry propagation flag is set to one.
If there is no path between nodes 1 and n in this pruned
graph, we have to consider the full graph.
Algorithm 2. Selection of a high-radix carry-save number
system.
Input: An n-bit modulus M, wmax, and wmin such that
wmax  wmin  2. A parameter  2 f1; 2g.
Output: A directed acyclic graph.
1: Compute the matrix  according to ;
2: if  ¼ 2 then
3: jmax  n 2;
4: else
5: jmax  n 3;
6: j n 2
7: while #ð ðn:jÞ Þ  n j do
8: jmax  j;
9: j jþ 1;
10: end while
11: end if
12: for j ¼ 2 to wmax do
13: Create an edge of weight 0 between nodes 1 and j;
14: end for
15: for j ¼ 2 to jmax do
16: for h ¼ wmin to wmax do
17: if jþ h  n and h < wmax then
18: ‘ max#ð ðn:jÞ Þ;
19: Create an edge between nodes j and n;
BEUCHAT AND MULLER: AUTOMATIC GENERATION OF MODULAR MULTIPLIERS FOR FPGA APPLICATIONS 7
Fig. 7. Cost of the addition of a carry bit (3). (a) Computation ofU ½iwhen  ¼ 2 and jmax ¼ n 2. (b) Computation ofU½iwhen  ¼ 1 and jmax ¼ n 3.
20: Compute the weight of the edge (see Fig. 6);
21: h wmax þ 1 (exit the loop);
22: else if jþ h  jmax then
23: ‘ max#ð ðjþh2:jÞ Þ;
24: Create an edge between nodes j and jþ h;
25: Compute the weight of the edge (see Fig. 6);
26: end if
27: end for
28: end for
Example 6 (Example 4 continued). Let us applyAlgorithm2
to our 16-bit example. First, we note that adding a carry bit
to the 15th column of the matrix does not generate an
output carry, and we set jmax ¼ 15. Then, we build the
graph illustrated in Fig. 8 according to Algorithm 2. The c
flag on the edge between nodes j and jþ h indicates that
adding a carry bit to the jth column of  generates an
output carry bit u
ðcÞ
j . After having removed all edges
labeled with a c flag and nodes without a predecessor or
successor,we obtain a pruned graph (Fig. 9). Thus,P ½i is a
four-digitwordwithn0 ¼ n1 ¼ n2 ¼ 4andn3 ¼ 6 (Fig. 10).
Since n ¼ 16 and  max ¼ ð1010110010100101Þ2 ¼ 44;197,
we have
2n1  1þ 2n0 þ 2n0þn1 þ 2n0þn1þn2 þ  max
2
¼ 2
15 þ 24 þ 28 þ 212 þ 44;197
2
¼ 40;666 < M;
and  ¼ 1 is a valid choice. Note that if the above
equality is not satisfied, we have to build a new graph
with  ¼ 2. Recall that there is always a solution for
 ¼ 2 and nk1  5.
8 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 12, DECEMBER 2008
Fig. 8. Construction of a graph to select a high-radix carry-save representation for the 16-bit modulus M ¼ 54;107 (1).
Fig. 9. Construction of a graph to select a high-radix carry-save representation for the 16-bit modulusM ¼ 54;107 (2). Shaded nodes belong to the
shortest path.
Fig. 10. Choice of a high-radix carry-save representation for the 16-bit
modulus M ¼ 54;107 (3).
Once the high-radix carry-save representation is known,
the automatic generation of a VHDL description of the
modulo M multiplier is rather straightforward. The
computation of T ½i involves k CRAs of respective widths
n0 þ 1; n1; . . . ; nk  2, and n nk2      n0  1 (Fig. 7).
Each edge of the graph encodes the size of the CRA
determining a digit of U ½i, and the carry propagation
flag indicates whether a carry bit u
ðcÞ
j is necessary or not.
Finally, k CRAs of widths n0; . . . ; nk  1 allow one to add
T ðsÞ½i to U ½i.
5 IMPLEMENTATION RESULTS
In order to compare our algorithm against modular multi-
pliers published in the open literature, we wrote a VHDL
code generator whose inputs are a modulus M and wmax
(maximal number of positions between two consecutive
carry bits; see Section 4). Our tool returns a structural
VHDL description of a high-radix carry-save multiplier, as
well as scripts to automatically place and route the design
and collect area and timing information. This tool also
generates a VHDL description of two architectures pro-
posed by other researchers. The first one, described by
Peeters et al. in [10], is summarized by Algorithm 3.
Intermediate results are carry-save numbers denoted by
ðC½i; S½iÞ. At each step, a CRA computes the sum k of the
three most significant bits of C½iþ 1 and S½iþ 1. This 4-bit
word addresses a table storing hk  2n2iM , 0  k  15.
Thanks to an additional iteration with x1 ¼ 0, this
algorithm returns a carry-save number ðC½1; S½1Þ,
which is smaller than 2M. Since our multiplier satisfies
the same property, conversion in a nonredundant number
system is performed with the same operator.2 We will
therefore only consider iteration stages in our experiments
in order to compare high-radix carry-save and carry-save
number systems.
Algorithm 3. Peeters et al.’s modulo M multiplication [10].
Input: An n-bit modulus M such that 2n1 < M < 2n, an
r-bit number X 2 IN, and Y 2 f0; . . . ;M  1g. We
assume that x1 ¼ 0.
Output: P ¼ hXY iM .
1. C½r  0; S½r  0;
2. for i in r 1 downto 1 do
3. k C½iþ 1 div 2n2 þ S½iþ 1 div 2n2;
4. T ½i  xiY þ 2hk  2n2iM ;
5. ðC½i; S½iÞ  2ðhC½iþ 1i2n2 þ hS½iþ 1i2n2Þ þ T ½i;
6. end for
7. P  ðS½1 þ C½1Þ=2;
8. if P > M then
9. P  P M;
10. end if
Amanor et al. introduced a carry-save architecture
optimized for modular multiplication on FPGAs in [13].
The authors assume that both M and Y are known at
design time. This hypothesis allows for the design of an
iteration stage embedding a single CSA and a table
addressed by the most significant bit of xi, C½iþ 1, and
S½iþ 1 (Algorithm 4). They show that the sum of the most
significant bits of C½iþ 1 and S½iþ 1 is always a 2-bit
number. Therefore, the table only contains eight values.
Unfortunately, the authors did not address the final
conversion issue. However, since C½iþ 1 div 2n1 þ
S½iþ 1 div 2n1  3, we deduce that
C½i þ S½i ¼ C½iþ 1 div 2n1 þ S½iþ 1 div 2n1   2n1
þ C½iþ 1h i2n1þ S½iþ 1h i2n1
 3  2n1 þ 2  ð2n1  1Þ ¼ 2nþ1 þ 2n1  2:
SinceM belongs to f2n1 þ 1; . . . ; 2n  1g, C½i þ S½imay be
greater than 2M, and Algorithm 4 requires more hardware
resources than our algorithm or Peeters et al.’s scheme to
perform a conversion.
Algorithm 4. Amanor et al.’s moduloM multiplication [13].
Input: An n-bit modulus M such that 2n1 < M < 2n, an
r-bit number X 2 IN, and Y 2 IN.
Output: P ¼ hXY iM .
1. C½r  0; S½r  0;
2. for i in r 1 downto 0 do
3. k 2ðC½iþ 1 div 2n1 þ S½iþ 1 div 2n1Þ;
4. T ½i  hk  2n1 þ xiY iM ;
5. ðC½i; S½iÞ  2ðhC½iþ 1i2n1 þ hS½iþ 1i2n1Þ þ T ½i;
6. end for
7. P  hC½0 þ S½0iM ;
Fig. 11 describes place-and-route results on a Xilinx
Spartan-3 FPGA. In these experiments, the moduli are
256-bit randomly generated primes. Compared against
Algorithm 3, we observe the following:
. Our high-radix carry-save architecture allows us to
significantly reduce the number of slices while only
slightly augmenting the critical path. At the price of
a longer critical path, we are able to further diminish
the area by increasing the parameter wmax. Note that
conversion from (high-radix) carry-save to unsigned
binary integer is usually based on pipelined CRAs
(see, for instance, [3]). Depending on the trade-off
between area and delay, this operator can be slower
than an iteration stage based on (high-radix) carry-
save arithmetic.
. The area of our operator is less sensitive to the choice
of M. This is mainly related to the architecture of
Xilinx FPGAs: in most cases,  ¼ 1, and each
operator embeds a table addressed by 3 bits. Since
our target FPGA embeds four-input LUTs, this table
is embedded within the slices of the adder comput-
ing P ½i [11]. Since 4 bits address the table of
Algorithm 3, additional LUTs are requested. Their
amount depends on the modulus M: if  M contains
null or identical columns, synthesis tools are able to
simplify the design.
For the moduli considered in these experiments, high-
radix carry-save multipliers have roughly the same area
as the operator proposed by Amanor et al. in [13]. Recall
that a CSA requires twice the number of slices of a CRA
BEUCHAT AND MULLER: AUTOMATIC GENERATION OF MODULAR MULTIPLIERS FOR FPGA APPLICATIONS 9
2. Our approach further reduces the wiring since high-radix carry-save
numbers involve less carry bits than carry-save numbers.
10 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 12, DECEMBER 2008
Fig. 11. Area and delay of modular multipliers on a Spartan-3 FPGA. Fifty prime moduli were randomly generated for each experiment. (a) Area and
delay comparisons for n ¼ 256 and wmax ¼ 8. (b) Area and delay comparisons for n ¼ 256 and wmax ¼ 16. (c) Area and delay comparisons for
n ¼ 256 and wmax ¼ 24. (d) Area and delay comparisons for n ¼ 256 and wmax ¼ 32.
on our target FPGA family. Since the moduli involved in
these experiments require only 3 bits to perform a
modulo M reduction, our architecture is mainly based
on two CRAs. High-radix carry-save representations
enable here the design of a more versatile modular
multiplier (both X and Y are inputs) with the same
number of slices.
Table 3 summarizes further results obtained with a
Spartan-3 FPGA. We generated 100 prime moduli for
each experiment and reported the intervals in which lie
the area and delay ratios between our proposal and
Algorithms 3 and 4. These experiments indicate that our
approach always allows one to select a prime number
that reduces the circuit area without increasing the
critical path.
Table 4 digests experiment results involving an Altera
Cyclone II FPGA. Fig. 12 describes a Logic Element (LE),
which is the smallest unit of configurable logic in the
Cyclone II architecture. Each LE includes a four-input
LUT, a storage element, and dedicated carry logic and
operates in normal mode or arithmetic mode. CRAs are
based on LEs in arithmetic mode, in which the LUT
implements two three-input function generators. It is
therefore impossible to store  M within LUTs of a CRA.
This explains why our algorithm leads to slightly smaller
BEUCHAT AND MULLER: AUTOMATIC GENERATION OF MODULAR MULTIPLIERS FOR FPGA APPLICATIONS 11
TABLE 3
Area and Delay Ratios between Our Proposal and Algorithms 3 and 4 on a Spartan-3 FPGA
One hundred prime moduli were randomly generated for each experiment. We report the intervals in which lie the area and delay ratios.
TABLE 4
Area and Delay Ratios between Our Proposal and Algorithms 3 and 4 on a Cyclone II FPGA
One hundred prime moduli were randomly generated for each experiment. We report the intervals in which lie the area and delay ratios.
Fig. 12. Simplified diagram of an LE in arithmetic mode (Cyclone II
family).
area savings for this FPGA family. On Cyclone-II devices,
CSA operators are significantly faster; however, conver-
sion to a nonredundant number system involves pipe-
lined CRAs. If this operator is based on 32-bit blocks, our
high-radix carry-save iteration stage has a slower critical
path. In this case, our approach leads to smaller modular
multipliers than CSA schemes, without impacting on
computation time.
6 CONCLUSION
We proposed an algorithm to automatically generate
VHDL descriptions of modular multipliers for FPGAs.
The main originality of our approach is the selection of
an optimal high-radix carry-save encoding of intermediate
results according to a given modulus M. High-radix
carry-save number systems take advantage of dedicated
carry logic available in almost all FPGA families and
reduce the amount of interconnects. Therefore, our
approach allows us to significantly reduce the area of
modular multipliers.
APPENDIX A
This appendix aims at proving the correctness of
Algorithm 1. We proceed in three steps: after establishing
a property of the modulo M correction considered in this
paper, we show that P ðsÞ½i is an ðnþ 2Þ-bit number. We
conclude by computing a bound on P ½1, which
indicates that P ½1=2 < 2M. This proof also provides
the reader with all the technical details requested to
implement the algorithm or an automatic code generator.
A.1 A Property of Modulo M Correction
The first step consists of establishing a property that will
allow us to compute bounds on P ½i. Let  ¼ 2n  2n4 ¼
2n1 þ 2n2 þ 2n3 þ 2n4 and M 2 f2n1; . . . ; 2n  1g. Then
hk  2n2iM < ; 8k 2 0; . . . ; 24  1: ð9Þ
The proof is straightforward if the modulusM is smaller
than or equal to . Let us assume now that M ¼  þ ,
where  satisfies the following inequality:
1    2n4  1:
For k 2 f0; 1; 2; 3g, we easily check that hk  2n2iM ¼
k  2n2 < . Since h2niM ¼ 2n M, we obtain
h4  2n2iM ¼ 2n4   < :
Consequently, we have
h5  2n2iM ¼ 2n2 þ 2n4   < ;
h6  2n2iM ¼ 2n1 þ 2n4   < ;
h7  2n2iM ¼ 2n1 þ 2n2 þ 2n4   < :
For k ¼ 8, the following modulo M operation has to be
carried out:
h8  2n2iM ¼ h2n þ 2n4  iM:
SinceM < 2n þ 1  2n þ 2n4    2n þ 2n4  1 < 2M, we
deduce that
h8  2n2iM ¼ 2n þ 2n4   M ¼ 2n3  2:
Thus, we have
h9  2n2iM ¼ 2n2 þ 2n3  2 < ;
h10  2n2iM ¼ 2n1 þ 2n3  2 < ;
h11  2n2iM ¼ 2n1 þ 2n2 þ 2n3  2 < :
A modulo M reduction is again required for k ¼ 12. Since
M < 2n þ 2  2n þ 2n3  2  2n þ 2n3  2 < 2M;
we obtain
h12  2n2iM ¼h2n þ 2n3  2iM
¼ 2n þ 2n3  2 M
¼ 2n3 þ 2n4  3:
Since   1, we conclude the proof by noting that
h13  2n2iM ¼ 2n2 þ 2n3 þ 2n4  3 < ;
h14  2n2iM ¼ 2n1 þ 2n3 þ 2n4  3 < ;
h15  2n2iM ¼ 2n1 þ 2n2 þ 2n3 þ 2n4  3 < :
A.2 Width of P ðsÞ½i
Let us prove by induction that P ½i is an ðnþ 2Þ-bit
number. Since P ½r ¼ 0, we check that k ¼ 0, and T ½r 1 ¼
P ½r 1 ¼ xr1Y , which is an n-bit number. This property
holds for i ¼ r 1. Assume now that P ðsÞ½iþ 1 is an
ðnþ 2Þ-bit number. We have to consider two cases
according to the parameter :
. Our hypotheses guarantee that nk1  5 for  ¼ 2.
Therefore, 2hP ðsÞ½iþ 1i2n2 contains k sum words of
respective widths n00 ¼ n0 þ 1; . . . ; n0k2 ¼ nk2, and
n0k1 ¼ nk1  4 (Fig. 13a). Let us split the partial
product xiY into k blocks in order to add it word by
word to 2hP ðsÞ½iþ 1i2n2 . We know that
Xk1
i¼0
n0i ¼ n 1:
Since xiY is an n-bit integer, we deduce from the
above equation that its most significant sum word
contains n00k1 ¼ n0k1 þ 1 ¼ nk1  3 bits. Therefore,
the sum of the most signif icant bits of
2hP ðsÞ½iþ 1i2n2 , xiY , and a carry bit is bounded by
2n
0
k1þ1  1
 
þ 2n0k1  1
 
þ 1 ¼ 2nk13 þ 2nk14  1
¼ 3  2nk14  1;
which is an ðnk1  2Þ bit number. Therefore, sincePk1
i¼0 ni ¼ nþ 2, T ðsÞ½i is an ðnþ 1Þ-bit number.
Indeed, we have
ðnk1  2Þ þ
Xk2
i¼1
ni þ ðn0 þ 1Þ ¼
Xk1
i¼0
ni  1
¼nþ 1:
12 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 12, DECEMBER 2008
Four most significant bits of P ðsÞ½iþ 1 address the
table responsible for the modulo M correction
(Fig. 13b). Recall that we have to combine the output
of this table and carry bits of T ½i in order to generate
a high-radix carry-save number U½i, whose format is
the one of P ½i. Since 2hk  2niM is an ðnþ 1Þ-bit
number, we split it into k words of respective
lengths n0; n1; . . . ; nk2, and ðnk1  1Þ. Consider
now the addition of the most significant words of
2hk  2niM and T ðsÞ½i and the most significant carry
bit of T ðcÞ½i. According to our hypotheses, nk1  5,
and this most significant word contains at least
4 bits. Consider the worst case (Fig. 13b), where
nk1 ¼ 5 and the weight of the most significant bit of
T ðcÞ½i is equal to 2n2. We deduce from (9) that U ðsÞ½i
is an ðnþ 2Þ-bit number and that its most significant
sum word is smaller than or equal to 24. The
addition of the most significant words of T ðsÞ½i
and U ðsÞ½i and a carry bit never generates an output
carry, and P ðsÞ½i is therefore an ðnþ 2Þ-bit number
(Fig. 13c).
. Assume now that  ¼ 1. The same approach allows
us to show that T ðsÞ½i is an ðnþ 1Þ-bit word
(Fig. 14a). According to our hypotheses, the most
significant word of P ðsÞ½i 1 contains at least 6 bits.
Therefore, the weight of the most significant carry
bit of T ðcÞ½i is at most 2n3. Since (9) guarantees that
2hk  2niM < 2n þ 2n1 þ 2n2 þ 2n3, we deduce
that UðsÞ½i is an ðnþ 1Þ-bit number (Fig. 14b). Note
that for some moduli, we can relax the constraint on
nk1: the remainder of the proof will only assume
that U ðsÞ½i is an ðnþ 1Þ-bit number. An automatic
code generator can check this condition very easily
for a given value of M. Since the most significant
words of T ðsÞ½i and U ðsÞ½i have the same size, their
addition may generate an output carry, and P ðsÞ½i
is therefore an ðnþ 2Þ-bit number (Fig. 14c).
A.3 Final Modulo M Correction
The last step consists of proving that P ½1=2 is smaller than
2M. We have again to consider two cases according to :
. Assume that  ¼ 2 and consider the last iteration
(i.e., i ¼ 1). Since the partial product x1Y is equal
to zero, we have
T ½1 ¼ 2  hP ðsÞ½0i2n2 þ P ðcÞ½0
 
 2  ð2n2  1þ 2n2  1Þ ¼ 2n  4:
Thus, P ½1  2n þ 2M  6, and P ½1=2  2n1 þ
M  3. Since the modulus M is supposed to be
greater than 2n1, we know that P ½1=2 is smaller
than 2M.
BEUCHAT AND MULLER: AUTOMATIC GENERATION OF MODULAR MULTIPLIERS FOR FPGA APPLICATIONS 13
Fig. 13. Proof of Algorithm 1 for  ¼ 2. (a) Computation of T ½i. (b) Modulo M reduction and conversion. (c) Computation of P ½i.
Fig. 14. Proof of Algorithm 1 for  ¼ 1. (a) Computation of T ½i. (b) Modulo M reduction and conversion. (c) Computation of P ½i.
. When  ¼ 1, hP ðsÞ½iþ 1i2n is smaller than or
equal to 2n1  1. Recall that the weight of the
most significant carry bit of P ðcÞ½iþ 1 is equal to
n0 þ n1 þ    þ nk2 (Section 2). Thus
T ½1  2n  2þ 2n0þ1 þ    þ 2n0þþnk2þ1;
and
P ½1  2n  2þ 2n0þ1 þ    þ 2n0þþnk2þ1 þ 2 max:
Therefore, P ½1=2 is smaller than 2M if
2n1  1þ 2n0 þ    þ 2n0þþnk2 þ  max
2
< M:
ACKNOWLEDGMENTS
The authors are grateful to the anonymous referees for their
valuable comments. The work described in this paper has
been supported in part by the New Energy and Industrial
Technology Development Organization (NEDO), Japan,
and by the Swiss National Science Foundation through
the Advanced Researchers program while Jean-Luc Beuchat
was at Ecole Normale Supe´rieure de Lyon (Grant PA002-
101386). The authors would like to thank F. de Dinechin and
J. Detrey for administrating the CAD tool servers on which
experiments described in this paper were performed.
REFERENCES
[1] P. Montgomery, “Modular Multiplication without Trial Division,”
Math. of Computation, vol. 44, no. 170, pp. 519-521, 1985.
[2] G.R. Blakley, “A Computer Algorithm for Calculating the Product
ab Modulo m,” IEEE Trans. Computers, vol. 32, no. 5, pp. 497-500,
May 1983.
[3] M.D. Ercegovac and T. Lang, Digital Arithmetic. Morgan
Kaufmann, 2004.
[4] C.K. Koc¸ and C.Y. Hung, “Carry-Save Adders for Computing
the Product AB Modulo N,” Electronics Letters, vol. 26, no. 13,
pp. 899-900, June 1990.
[5] C.K. Koc¸ and C.Y. Hung, “A Fast Algorithm for Modular
Reduction,” IEE Proc.: Computers and Digital Techniques, vol. 145,
no. 4, pp. 265-271, July 1998.
[6] N. Takagi and S. Yajima, “Modular Multiplication Hardware
Algorithms with a Redundant Representation and Their Applica-
tion to RSA Cryptosystem,” IEEE Trans. Computers, vol. 41, no. 7,
pp. 887-891, July 1992.
[7] N. Takagi, “A Radix-4 Modular Multiplication Hardware
Algorithm for Modular Exponentiation,” IEEE Trans. Computers,
vol. 41, no. 8, pp. 949-956, Aug. 1992.
[8] Y.-J. Jeong and W.P. Burleson, “VLSI Array Algorithms and
Architectures for RSA Modular Multiplication,” IEEE Trans. VLSI
Systems, vol. 5, no. 2, pp. 211-217, June 1997.
[9] S. Kim and G.E. Sobelman, “Digit-Serial Modular Multiplication
Using Skew-Tolerant Domino CMOS,” Proc. IEEE Int’l Conf.
Acoustics, Speech, and Signal Processing (ICASSP ’01), vol. 2,
pp. 1173-1176, 2001.
[10] E. Peeters, M. Neve, and M. Ciet, “XTR Implementation on
Reconfigurable Hardware,” Proc. Workshop Cryptographic Hardware
and Embedded Systems (CHES ’04), M. Joye and J.-J. Quisquater,
eds., pp. 386-399, 2004.
[11] J.-L. Beuchat and J.-M. Muller, “Modulo m Multiplication-
Addition: Algorithms and FPGA Implementation,” Electronics
Letters, vol. 40, no. 11, pp. 654-655, May 2004.
[12] R. Beguenane, J.-L. Beuchat, J.-M. Muller, and S. Simard,
“Modular Multiplication of Large Integers on FPGA,” Proc. 39th
Asilomar Conf. Signals, Systems and Computers, 2005.
[13] D.N. Amanor, C. Paar, J. Pelzl, V. Bunimov, and M. Schimmler,
“Efficient Hardware Architectures for Modular Multiplication on
FPGAs,” Proc. 15th Int’l Conf. Field Programmable Logic and
Applications (FPL ’05), pp. 539-542, 2005.
Jean-Luc Beuchat received the MSc and PhD
degrees in computer science from the Swiss
Federal Institute of Technology, Lausanne, in
1997 and 2001, respectively. He is an associate
professor in the Graduate School of Systems
and Information Engineering, Laboratory of
Cryptography and Information Security, Univer-
sity of Tsukuba, Tsukuba, Japan. His current
research interests include computer arithmetic
and cryptography.
Jean-Michel Muller received the PhD degree
from the Institut National Polytechnique de
Grenoble in 1985. He is the directeur de
recherches (senior researcher) at CNRS,
France, and he is the former head of the LIP
Laboratory (LIP is a joint laboratory of CNRS,
the Ecole Normale Supe´rieure de Lyon, INRIA,
and the Universite´ Claude Bernard Lyon 1). His
research interests are in computer arithmetic.
He is the author of several books, including
Elementary Functions, Algorithms and Implementation (second edition,
Birkha¨user, 2006). He served as an associate editor of the IEEE
Transactions on Computers from 1996 to 2000. He was the co-program
chair of the 13th IEEE Symposium on Computer Arithmetic in 1997 and
the general chair of the 14th IEEE Symposium on Computer Arithmetic
in 1999. He is a senior member of the IEEE and a member of the IEEE
Computer Society.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
14 IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 12, DECEMBER 2008
