Efficient Pipelining for Modular Multiplication Architectures in Prime Fields by Nele Mentens et al.
Efﬁcient Pipelining for Modular Multiplication
Architectures in Prime Fields
Nele Mentens, Kazuo Sakiyama, Bart Preneel and Ingrid Verbauwhede
∗
Katholieke Universiteit Leuven, ESAT-SCD/COSIC
Kasteelpark Arenberg 10
3001 Heverlee, Belgium
{Nele.Mentens,Kazuo.Sakiyama,Bart.Preneel,Ingrid.Verbauwhede}@esat.kuleuven.be
ABSTRACT
This paper presents a pipelined architecture of a modu-
lar Montgomery multiplier, which is suitable to be used in
public key coprocessors. Starting from a baseline imple-
mentation of the Montgomery algorithm, a more compact
pipelined version is derived. The design makes use of 16-
bit integer multiplication blocks that are available on re-
cently manufactured FPGAs. The critical path is optimized
by omitting the exact computation of intermediate results
in the Montgomery algorithm using a 6-2 carry-save nota-
tion. This results in a high-speed architecture, which out-
performs previously designed Montgomery multipliers. Be-
cause a very popular application of Montgomery multiplica-
tion is public key cryptography, we compare our implemen-
tation to the state-of-the-art in Montgomery multipliers on
the basis of performance results for 1024-bit RSA.
Categories and Subject Descriptors
B.2 [Arithmetic and Logic Structures]: High-Speed Arith-
metic—Cost/performance; E.3 [Data]: Data Encryption—
Public key cryptosystems
General Terms
Security
Keywords
FPGA, Montgomery multiplication, cryptography, public
key coprocessor
1. INTRODUCTION
Montgomery multiplication has been shown to be a very
eﬃcient way to perform modular multiplication [10]. That
is why the algorithm is often used in data paths that provide
∗Nele Mentens and Kazuo Sakiyama are partially funded by
FWO (G.0450.04), IBBT, EU IST FP6 projects ECRYPT
and SESOC.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for proﬁt or commercial advantage and that copies
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
permission and/or a fee.
GLSVLSI’07, March 11–13, 2007, Stresa-Lago Maggiore, Italy.
Copyright 2007 ACM 978-1-59593-605-9/07/0003 ...$5.00.
modular arithmetic. This paper presents an architecture of
a Montgomery multiplier that is implemented on an FPGA.
The architecture utilizes 16-bit integer multipliers, which are
available on recently designed FPGAs.
Starting from a rather straightforward implementation of
the Montgomery algorithm, an optimization for area and
speed is done by downsizing the architecture and introduc-
ing more levels of pipelining. Moreover, a substantial reduc-
tion of the length of the critical path is achieved by opti-
mizing the speed of the feedback loop in the architecture.
This optimization is based on the observation that the exact
computation of intermediate values is not necessary and a
carry-save approach can be used.
A popular application of modular arithmetic is public key
cryptography. That is why we compare our implementation
to the state-of-art in Montgomery multipliers on the basis of
the time to perform a RSA decryption [12]. The results show
that our implementation outperforms previously designed
Montgomery multipliers on FPGAs.
The organization of the paper is as follows. Section 2 lists
the state-of-the-art in Montgomery multipliers. In Sect. 3,
the Montgomery algorithm is introduced. Section 4 com-
pares our new architecture to a baseline architecture. Fi-
nally, Sect. 5 and Sect. 6 give the implementation results
and conclude the paper, respectively.
2. PREVIOUS WORK
A very good overview of implementation options for Mont-
gomery multipliers is given by Ko¸ c et al. in [4]. Together
with implementation results on a Pentium processor, they
describe many algorithms that implement Montgomery mul-
tiplication using a single w-bit integer multiplier. These al-
gorithms can be extended to parallel versions. In a fully
parallelized implementation, the Finely Integrated Operand
Scanning (FIOS) and Coarsely Integrated Operand Scan-
ning (CIOS) algorithms in [4] lead to the same architecture.
Our architecture is also based on these algorithms.
A scalable radix-2 implementation is introduced by Tenca
and Ko¸ c in [13]. The same notion of scalability is imple-
mented by Batina et al. in [1]. They present a systolic
array architecture resulting in a Montgomery based RSA
implementation. More recent hardware implementations of
Montgomery multiplication include the work by McIvor et
al. [9]. They use Carry Save Adders (CSAs) to perform the
large word length additions required for Montgomery mul-
tiplication. This idea is in a way similar to our idea. In [9],
the carry-save format is used in between modular multipli-
cations to perform RSA, resulting in 5-2 CSA logic. In ourimplementation, instead we use 6-2 CSA logic, optimizing
the internal loop in the Montgomery multiplication algo-
rithm. The carry-save approach is also used by Bunimov
and Schimmler in [2]. They combine CSA logic with a ta-
ble lookup. In [8], Manochehri and Pourmozafari introduce
pipelining inside the CSA logic. This is diﬀerent from our
approach, which pipelines the rest of the architecture.
The implementation of RSA decryption in [9] is using the
Chinese Remainder Theorem (CRT), which leads to a speed-
up factor of almost 4 [11]. Up to now, the fastest reported
RSA implementation on an FPGA, to our knowledge, that
does not apply CRT, is presented by Kelley and Harris in
[5]. Similar to our architecture, they also use the dedicated
multiplier blocks on the FPGA. We will compare our results
to this implementation.
3. MONTGOMERY MULTIPLICATION
In 1985, Peter Montgomery introduced a method to eﬃ-
ciently perform modular multiplication [10]. Instead of com-
puting X∗Y mod M, the Montgomery algorithm computes
X ∗ Y ∗ R
−1 mod M, where R is a power of two. In this
way, trial division can be avoided and a division by a power
of two is executed instead. This comes down to a right-
shift operation, which is almost costless in hardware. An
improvement on Montgomery’s algorithm was presented by
Colin Walter in [14,15]. Compared to Montgomery’s origi-
nal algorithm, this algorithm performs one extra iteration,
making the conditional ﬁnal subtraction unnecessary. The
improved algorithm is therefore time-constant and avoids
the implementation of a subtractor.
Alg. 1 shows the improved Montgomery algorithm. In
[14,15], Walter proves that, if the inputs X and Y are smaller
than 2 ∗ M, the output is also bounded to 2 ∗ M, while
the intermediate result T has a bound of 4 ∗ M. He also
shows that, after converting from Montgomery to normal
representation, the result is smaller than M again. Note
that only the LSB of the b-bit digits Xn and Yn can be equal
to one in order to satisfy 0 ≤ X,Y < 2 ∗ M. For digit Tn
the two LSBs can be equal to one, while the rest of the bits
are zero, because T < 4 ∗ M is always ensured. However,
in the implementations that are discussed in Sect. 4, we
always consider the complete length of X, Y and T, i.e.
b ∗ (n + 1), for ease of notation. To conclude this section,
we remark that the division by b in Step 4 of Alg. 1 can be
implemented as a shift operation, because the b LSBs of the
sum are equal to zero.
Algorithm 1 Improved Montgomery multiplication
Require: integers M = (Mn−1    M0)b, X = (Xn    X0)b, Y =
(Yn    Y0)b with 0 ≤ X,Y < 2∗M, R = b
n+1 with gcd(M,b) = 1
and M
′ = −M
−1 mod b
Ensure: X ∗ Y ∗ R
−1 mod M
1: T = (Tn    T0)b ← 0
2: for i from 0 to n do
3: Ui ← (T0 + X0 ∗ Yi) ∗ M
′ mod b
4: T ← (T + X ∗ Yi + M ∗ Ui)/b
5: end for
6: Return T
4. MONTGOMERY MULTIPLICATION
ARCHITECTURE
This section presents two hardware architectures that im-
plement Alg. 1 with b = 2
16 and n = 64, i.e. 1024-bit
M
Y M
X
mult
C
16 1040*16
modinv
16*16
modmult
16*16
S
16
modadd
16+16
16 (LSB)
1040
16
16
16 (LSB)
1024
mult
1024*16
C S
1040 1040 1024 1024
T
add
16 (LSB)
Y
U
MU
XY
MU
1040
1056
1040 (MSB)
i
XYi i
i
i i
1040 +  (1040<<16) + 1040 + (1024<<16) + 1024
Figure 1: Baseline architecture. The registers are
indicated by gray boxes. A shift register is denoted
by a gray box with an arrow inside.
Montgomery multiplication is implemented. The ﬁrst archi-
tecture is rather straightforward. It is described in Sect. 4.2,
which also elaborates on the drawbacks of the design. To
solve these drawbacks, many optimizations were done in
order to obtain a much faster pipelined architecture, pre-
sented in Sect. 4.3. Before introducing both architectures,
the available resources on our hardware platform are listed
in Sect. 4.1.
4.1 Hardware Resources
The target implementation platform for the presented ar-
chitectures is a FPGA. Recent FPGAs contain highly opti-
mized 16-bit integer multipliers, which are implemented as
dedicated hardware to be connected to the reconﬁgurable
part of the FPGA [16]. Moreover, highly optimized arbi-
trary sized adders are available. The circuitry inside both
the multiplier blocks and the adders is proprietary informa-
tion of the FPGA vendors.
4.2 Baseline Implementation
The architecture of a more or less straightforward man-
ner of computing Alg. 1 is shown in Fig. 1. The registers
are indicated by gray boxes: X and Y are the inputs, M
is the modulus and T is the intermediate register where the
result will also be stored. The “mult” blocks consist of 16-
bit integer multipliers in parallel and output a sum and a
carry result, as depicted in Fig. 2(a). The output of the
“modmult” block is formed by the 16 LSBs of an integer
multiplier. As mentioned in Sect. 4.1, no manual adder op-
timization is done. Therefore, the “add” block is a black-box
5-input adder provided by the FPGA vendor. Because the
modulus M is constant throughout the Montgomery mul-
tiplication, the “modinv” operation needs to be computed
only once. For this reason, the delay of this operation is not
important. The “modinv” block consists of two-level logic
situated in the CLBs of the FPGA. Note that the output
of the 5-input adder is bounded to 1056 bits, because of
the observations made in Sect. 3. To realize the division by
b = 2
16 in step 4 of Alg. 1, a right shift operation over 16
positions is performed on T.S
16*16
16
16*16 16*16
16 16 16 16
16*16
C
in1
16 16 16 16
16 16 16 16
in2i
(a)
FA FA
FA
FA
i1 i2 i3
co1
6 i 5 i 4 i
co2
ci1
co3 ci2
S C
ci3
(b)
Figure 2: (a) Architecture of a “mult” block, where
S and C denote the sum and carry in a special carry-
save form (S + (C ≪ 16)). (b) Architecture of one
compression cell pointed at in Fig. 4.
The implementation presented in Fig. 1 computes the
same sequence of operations in each clock cycle. This se-
quence is shown in Table 1, where the subscripts new and
old denote the input and the output of a register, respec-
tively, and the numbers in square brackets denote the bit
indices. These operations are executed n+1 = 65 times for
one Montgomery multiplication. This means the result will
be available in register T after 65 clock cycles.
After implementing the architecture in Fig. 1 on a Xilinx
XC2VP30 FPGA, a maximum clock frequency of 30 MHz
was found. This frequency is determined by the large criti-
cal path, going from register Y (which has a higher fan-out
than register X) to register T through 3 16-bit integer multi-
pliers, a 16-bit adder and a 5-input 1056-bit adder. In order
to reduce the critical path, registers could be introduced in
between these 5 arithmetic units. However, the delay of the
1040-bit 5-input adder is much larger than the delay of the
multipliers, which still limits the maximum operating fre-
quency. Inserting registers inside this adder would increase
the maximum operating frequency, but would not reduce the
delay of the feedback loop, which computes the new value of
T from the old value of T. Therefore, we need to optimize
this feedback loop in order to reduce the critical path and
increase the maximum frequency. These optimizations are
discussed in Sect. 4.3.
Another limitation of the architecture in Fig. 1 is that the
size of the registers and arithmetic blocks is equal to the full
word length of the Montgomery multiplier. Section 4.3 in-
troduces a method to make the architecture more compact.
4.3 Downsized Pipelined Implementation
This section describes our new architecture in three steps.
In Sect. 4.3.1, the optimization of the 5-input adder in Fig. 1
is discussed. Section 4.3.2 elaborates on the downsizing of
the architecture. Finally, the length of the critical path is
reduced substantially by introducing pipelining, which is ex-
plained in Sect. 4.3.3.
4.3.1 Speeding-up the Adder
To optimize the delay of the 5-input adder in Fig. 1, we
omit the exact computation of the intermediate values of T.
Y
X
16
mult
1040*16
1040
M
modinv
16*16
16 (LSB)
16
1040 (MSB) 1040 (MSB)
Y
1040 1040
i XY C SXYi M
modmult
16*16 16 (LSB)
modadd
16+16
16 16
16
1040 1040 1040 1040 1024 1024
1024 1024
mult
1024*16
1024
16
6−2 compression
c add
16+(15<<1)
add
16+16
16
T0
1055 1056
32 (LSB)
31 (LSB)
16 (LSB)
15 (LSB)
16 (MSB)
16 (MSB)
1040 1040
1040 + 1040
1040 1040
T
i
i U
seqadd528
528
i
i
i
ST
CT
i
Figure 3: Full word length architecture with carry-
save adder.
Instead, we compute the sum and the carry of T separately
using a carry-save adder. The new Montgomery multiplier
architecture is shown in Fig. 3, where the sum and the carry
of T are denoted by STi and CTi. The alignment of the
inputs and outputs of the carry-save adder, denoted by “6-2
compression”, is shown in Fig. 4. The carry output of the
MSB cell is always zero, because of the bounds explained in
Sect. 3. This results in a 1055-bit carry value at the output
of the carry-save adder. The architecture of one compression
cell is depicted in Fig. 2(b).
The division by b = 2
16 in Step 4 in Alg. 1 is executed by
performing a 16-bit right shift operation on the sum and the
carry of T. The bits of STi and CTi that are shifted out add
up to 0 as explained in Sect. 3. However, the carry-out of
this addition needs to be taken into account and is therefore
led back into the 6-2 compression block. The carry-bit is
C
S
S
C
MU
MU
C
S
SUM = 0
CARRY = c
S
C
1056
1040
1040
1024
1024
1040
1024 10401039 1023 1055 1615 0
ONE COMPRESSION CELL
1040
1055
XY
XY
 / c
i
i
i
i
Ti,old
T ,old i
i,old
i,new
T
Ti
i
,new
,new
Figure 4: Alignment of the compression block. The
subscripts new and old indicate the input and output
of a register, respectively.Table 1: Operations performed in Fig. 1 in each clock cycle. The subscripts new and old indicate the input
and output of a register, respectively.
SXYi + (CXYi ≪ 16) ⇐ X ∗ Yi
Ui ⇐ (Told[15 : 0] + SXYi[15 : 0]) ∗ M
′ [15 : 0]
SMUi + (CMUi ≪ 16) ⇐ M ∗ Ui
Tnew ⇐ Told + SXYi + (CXYi ≪ 16) + SMUi + (CMUi ≪ 16) [1055 : 16]
stored in register ci in Fig. 3. To be able to perform Step 3
in Alg. 1, the sum of the next 16 bits of STi and CTi, denoted
by T0i in Fig. 3, is calculated and added to SXYi[15 : 0].
After replacing the 5-input adder by a 6-2 compression
block, a higher operating frequency can be used. The imple-
mentation of the architecture in Fig. 3 on a Xilinx XC2VP30
FPGA shows a maximum operating frequency of 50 MHz,
which is substantially faster than the clock frequency re-
ported in Sect. 4.2. However, to compute the ﬁnal result,
STi and CTi need to be added up. This can be done with
a sequential adder. To balance the delay of the sequential
adder with the critical path of rest of the architecture, 528
bits are computed per clock cycle, resulting in the computa-
tion of T in 2 clock cycles. This makes the total cycle count
for one Montgomery multiplication equal to 65 + 2 = 67.
4.3.2 Downsizing the Architecture
The architecture in Fig. 3 employs 65 and 64 16-bit integer
multipliers for the computation of X ∗ Yi and M ∗ Ui, re-
spectively. One more 16-bit multiplier is used for computing
Ui. This results 130 multipliers. Although there exist FP-
GAs that provide this amount of multipliers, many FPGAs
do not have enough resources to implement this full word
length architecture. That is why we introduce a downsized
architecture in this section. Although the downsizing factor
can be chosen arbitrarily, we stick to a factor of 4, resulting
in an architecture with 17 + 16 + 1 = 34 16-bit multipliers.
This allows us to do a more or less fair comparison with
previously designed Montgomery multipliers in Sect. 5.
The downsized architecture is shown in Fig. 5. The “mult”
and “6-2 compression” computation for the evaluation of one
16-bit digit of Y are spread over 4 clock cycles. In each of
these 4 clock cycles a partial result of the computation is
valid. Input register X is a cyclic shift register. It outputs
three times 256 bits and one time 272 bits. Each of these
four outputs is multiplied with Yi. In the same way, register
M shifts out four times 256 bits that are multiplied with Ui.
The 1056-bit 6-2 compression depicted in Fig. 4 is divided
into 4 parts. Because some inputs are 272 bits and shifted to
the left over 16 positions, the width of the 6-2 compression
block is 288 bits. Because the complete 1040-bit sum and
carry results of the carry-save addition need to be led back
into the 6-2 compression block (as shown in Fig. 4), the
registers that store STi and CTi cannot be reduced. They
are implemented as full length shift registers. In the ﬁrst
of each four cycles, the division by 2
16 in Step 4 in Alg. 1
is performed and ci is written into a register, as explained
in Sect. 4.3.1, and stored in 4 cycles. T0i is also stored and
saved in 4 cycles until the next Yi is evaluated. Because of
the division by 2
16, only 272 bits are written into STi and
CTi in the ﬁrst of each four cycles. The 16 MSBs are forced
to zero in order to match the signal width of 288 bits. In
the other three cycles, all 288 bits at the output of “6-2
compression” are shifted into STi and CTi. This results in a
Y
X
16
mult Y
16
mult
16
6−2 compression
c add
16+(15<<1)
add
16+16
16
32 (LSB)
31 (LSB)
16 (LSB)
15 (LSB)
16 (MSB)
16 (MSB)
i U
272 272
272
272*16
256*16
256
256 256
256 256 272 272
288
288
288
1040 + 1040
T
288
288
288
272
seqadd272
1152 1152
1040 (MSB) 1040 (MSB)
288 288
i
M
i
i
i
i
ST
CT
0 T
16 (LSB)
16
modinv
16*16
16
M
16
16*16
modmult
16
16+16
modadd
16 (LSB)
XYi C
i SXY
Figure 5: Downsized architecture with carry-save
adder. The registers to store SXYi and CXYi are car-
ried out 2 times. The 16 LSBs from SXYi are taken
after the ﬁrst register.
width of 4 ∗ 288 = 1152 for STi and CTi. However, because
of the bounds explained in Sect. 3, we only need the 1040
LSBs of STi and CTi to compute the ﬁnal result T.
Note that Ui needs to be computed only once every 4
clock cycles. That is why a register is inserted to store Ui.
This register is enabled every fourth clock cycle. Because we
inserted a register to store Ui, we are splitting the critical
path approximately in two, resulting in a maximum clock
frequency of 100 MHz on a Xilinx XC2VP30 FPGA. To
make add up the corresponding parts of X ∗Yi and M ∗Ui,
registers need to be inserted to store S/CXYi.
We need one cycle to compute the ﬁrst Ui and 4 cycles for
the computation of each X ∗Yi. Because the critical path is
half compared to the architecture in Fig. 3, we use a 272-bit
adder to perform the ﬁnal addition. This computes T in 4
clock cycles. As a result, the total cycle count is equal to
1 + 4 ∗ 65 + 4 = 265.
4.3.3 Introducing Pipelining
Because the architecture in Fig. 5 computes a new Ui
only once in four clock cycles, we can introduce four levels of
pipelining in order to reduce the critical path. The pipelined
architecture is shown in Fig. 6. Because SMUi and CMUi
are stored 4 cycles after Yi is valid, the registers SXYi and
CXYi need to be repeated three times, resulting in 4 pipelineY
X
16
mult
modmult
16*16
C
S
C
mult
U
16 16
16 (LSB) 32 (LSB)
16 (MSB)
16 (MSB)
16+(15<<1)
add
31 (LSB) 15 (LSB)
add
16+16
16
modinv
16*16
16 (LSB)
16
16
M
Y
S
MU
6−2 compression
i
(4x)
(4x)
XYi
XYi
i
i
MUi
272
272*16
256*16
272 272
256
256 256
256 256 272 272 288 288
288 288
288 288
seqadd104
1040 + 1040
T
104
288 288
1040 1040
16
modadd
16+16
16 (LSB)
16
16
T
i H
0
M
ci
i
i
ST
CT
i
Figure 6: Downsized pipelined architecture. The
registers to store SXYi and CXYi are carried out 4
times. The 16 LSBs from SXYi are taken after the
ﬁrst register.
registers for SXYi and CXYi. The 16 LSBs of SXYi, used
for the computation of Ui, are taken after the ﬁrst register.
At the end of the Montgomery multiplication, STi and CTi
need to be added. In this architecture we need a 104-bit
sequential adder to balance the path between STi/CTi and
T with the rest of the architecture. Because of the bounds
explained in Sect. 3, the last carry-out of this addition can
be omitted, resulting in a 1040-bit value for T.
The pipelining schedule of our architecture is shown in
Fig. 7. The multiplications X ∗ Yi and M ∗ Ui take 4 clock
cycles. The indices a, b, c and d in Fig. 7 indicate the re-
sults of the multiplications after the ﬁrst, second, third and
fourth cycle, respectively. However, Hi is already computed
when S/CXYi,a is ready. After 1 cycle, S/CXYi,a is ready
and 4 cycles are needed to compute the output of the down-
sized “6-2 compression” block. In total, 65 of these 4-cycle
operations need to be computed, resulting in 65 ∗ 4 = 260
clock cycles. Finally, the carry-save form of the result needs
to transformed into the actual result T using the sequential
104-bit adder. As a consequence, the ﬁnal result T is stored
10 cycles after the computation of last STi and CTi is ﬁn-
ished. This brings the total cycle count for one Montgomery
multiplication to 1+260+10 = 271 cycles. After implement-
ing the architecture in Fig. 6 on a Xilinx XC2VP30 FPGA,
a maximum operating frequency of 125 MHz could be found.
More detailed implementation results of this eﬃcient archi-
tecture are discussed in Sect. 5.
5. IMPLEMENTATION RESULTS
Because very popular applications of modular multipli-
ers are implementations of public key cryptography, the
speed of our Montgomery multiplier is evaluated on the ba-
sis of the time to compute one 1024-bit RSA decryption,
Table 2: Comparison of the performance results
of our implementation with the fastest previously
designed implementation. The computation time
for a 1024-bit RSA decryption without the use of
CRT (tRSA) is given using the square-and-multiply
method (a) and the k-ary method (b).
ref. platform resources freq. tRSA
(MHz) (ms)
this XC2VP30 34 mults 125 4.4 (a)
work 5.5k slices 2.7 (b)
[5] XC40250XV 32 mults 135 5.0
5 kbit RAM
2.6k LUTs
which is based on modular exponentiation [12]. There ex-
ist many algorithms for implementing modular exponenti-
ation. The most straightforward algorithm is the square-
and-multiply algorithm [12]. When only one Montgomery
multiplier is available, this algorithm requires 1024 square
and 1024 multiply operations for an exponent with a Ham-
ming weight of 1024. For side-channel security [7], we apply
our Montgomery multiplier for both the square and the mul-
tiply operation. This results in 2048 Montgomery multipli-
cations. A more eﬃcient way to implement modular expo-
nentiation is the k-ary method [3,6], in which 2
k − 2 multi-
plications are performed in the pre-computation phase and
⌈1024/k⌉ ∗ (k + 1) in the exponentiation phase. Because re-
cent FPGAs contain a lot of block RAM, the k-ary method,
that requires the storage of the pre-computed values, is a
very eﬃcient way to perform modular exponentiation.
Table 2 shows the results of our implementation in com-
parison with the fastest reported implementations, to our
knowledge, described in literature. The speed of our imple-
mentation is given for the square-and-multiply as well as the
k-ary method. In the case of 1024-bit exponentiation, k = 6
turns out to be the most optimal choice. This corresponds
to 1259 modular multiplications for one 1024-bit modular
exponentiation.
Although it is very hard to compare implementations on
diﬀerent FPGAs, it is clear that our implementation outper-
forms the architecture presented in [5], especially when using
the k-ary method for modular exponentiation. In terms of
resources it should be noted that the k-ary method employs
extra block RAM to store 2
k pre-computed values. However,
in most cases this is not a problem, since recently designed
FPGAs provide a lot of block RAM.
6. CONCLUSION
This paper presented two Montgomery multiplication ar-
chitectures. The ﬁrst one is a baseline implementation of
the improved Montgomery algorithm. The second one is a
downsized pipelined version that includes optimizations to
reduce the length of the critical path. These optimizations
were achieved by using a carry-save representation for the
intermediate results. The performance results show that our
downsized pipelined implementation is much faster than the
state-of-the-art in Montgomery multipliers.S/CT65,c S/CT65,a
S/C
S/CMU S/CMU S/CMU
XY XY
S/CT65,b S/CT65,d
T065
c65
S/C
65,a XY S/C
65,b 65,c 65,d S/CXY
H65
65 U
S/CMU65,a 65,b 65,c 65,d
S/C S/CXY0,a XY S/CXY S/CXY 0,b 0,c 0,d S/C S/C S/C S/C XY XY XY XY 1,a 1,b 1,c 1,d
S/C S/C S/C S/C S/C S/C S/C S/C MU MU MU MU MU MU MU MU 0,a 0,b 0,c 0,d 1,a 1,b 1,c 1,d
0 H
U0
H1
U1
T00
0,a T S/C S/C S/C S/C S/C S/C S/C S/C T T T T T T T 0,b 0,c 0,d 1,a 1,b 1,c 1,d
0 c
T
c
01
1
Figure 7: Pipelining schedule for the architecture in Fig. 6. The indices a, b, c and d denote the quarter
word length parts of the sum and carry, where a indicates the LSB part and d the MSB part. Note that the
SXYi and CXYi values pass through a 4-stage pipelining queue in order to be ready at the same time as the
corresponding SMUi and CMUi. The pipelining queue is omitted in this ﬁgure.
7. REFERENCES
[1] L. Batina and G. Muurling. Montgomery in Practice:
How to Do It More Eﬃciently in Hardware. In
B. Preneel, editor, In Topics in Cryptology - CT-RSA
- The Cryptographers’ Track at the RSA Conference,
number 2271 in Lecture Notes in Computer Science,
pages 40–52, San Jose, USA, February 18-22 2002.
Springer-Verlag.
[2] V. Bunimov and M. Schimmler. Area and time
eﬃcient modular multiplication of large integers. In
Proceedings of IEEE 14th International Conference on
Application-speciﬁc Systems, Architectures and
Processors (ASAP), pages 400–409. IEEE, 2003.
[3] C ¸. K. Ko¸ c. High-radix and bit recoding techniques for
modular exponentiation. International Journal of
Computer Mathematics, 40(3+4):139–156, 1991.
[4] C ¸. K. Ko¸ c, T. Acar, and B. S. Kaliski. Analyzing and
comparing montgomery multiplication algorithms.
IEEE Micro, 16:26–33, 1996.
[5] K. Kelley and D. Harris. Parallelized very high radix
scalable Montgomery multipliers. In Conference
Record of the Thirty-Ninth Asilomar Conference on
Signals, Systems and Computers, pages 1196–1200,
2005.
[6] D. E. Knuth. The Art of Computer Programming:
Seminumerical Algorithms, volume 2. Addison-Wesley,
1981.
[7] P. Kocher, J. Jaﬀe, and B. Jun. Introduction to
diﬀerential power analysis and related attacks.
http://www.cryptography.com/dpa/technical, 1998.
[8] K. Manochehri and S. Pourmozafari. Fast
Montgomery modular multiplication by pipelined CSA
architecture. In Proceedings of International
Conference on Microelectronics (ICM), pages 144–147,
2004.
[9] C. McIvor, M. McLoone, J. McCanny, A. Daly, and
W. Marnane. Fast Montgomery Modular
Multiplication and RSA Cryptographic Processor
Architectures. In Proceedings of 37th Annual Asilomar
Conference on Signals, Systems and Computers, pages
379–384, November 2003.
[10] P. Montgomery. Modular multiplication without trial
division. Mathematics of Computation, 44:519–521,
1985.
[11] J.-J. Quisquater and C. Couvreur. Fast decipherment
algorithm for rsa public-key cryptosystem. Electronic
Letters, 18(21):905–907, 1982.
[12] R. L. Rivest, A. Shamir, and L. M. Adleman. A
method for obtaining digital signatures and public-key
cryptosystems. Communications of the ACM,
21(2):120–126, 1978.
[13] A. Tenca and C ¸.K. Ko¸ c. A scalable architecture for
Montgomery multiplication. In C ¸.K. Ko¸ c and C. Paar,
editors, Proceedings of 1st International Workshop on
Cryptographic Hardware and Embedded Systems
(CHES), number 1717 in Lecture Notes in Computer
Science, pages 94–108, Worcester, Massachusetts,
USA, August 12-13 1999. Springer-Verlag.
[14] C. Walter. Montgomery’s multiplication technique:
How to make it smaller and faster. In C ¸.K. Ko¸ c and
C. Paar, editors, Proceedings of 1st International
Workshop on Cryptographic Hardware and Embedded
Systems (CHES), number 1717 in Lecture Notes in
Computer Science, pages 80–93, Worcester, MA, USA,
August 12-13 1999. Springer-Verlag.
[15] C. D. Walter. Montgomery exponentiation needs no
ﬁnal subtraction. Electronic letters, 35(21):1831–1832,
October 1999.
[16] Xilinx. Xilinx: The programmable logic company.
http://www.xilinx.com, 2006.