Montgomery Modular Multiplication Algorithm on Multi-Core Systems by Fan, Junfeng et al.
MONTGOMERYMODULAR MULTIPLICATION ALGORITHM ONMULTI-CORE
SYSTEMS
Junfeng Fan, Kazuo Sakiyama, and Ingrid Verbauwhede
Katholieke Universiteit Leuven,ESAT/SCD-COSIC,
Kasteelpark Arenberg 10
B-3001 Leuven-Heverlee, Belgium
ABSTRACT
In this paper, we investigate the efficient software implementa-
tions of the Montgomery modular multiplication algorithm on
amulti-core system. AHW/SW co-design technique is used to
find the efficient system architecture and the instruction sche-
duling method. We first implement the Montgomery modular
multiplication on amulti-core systemwith general purpose co-
res. We then speed up it by adopting the Multiply-Accumulate
(MAC) operation in each core. As a result, the performance
can be improved by a factor of 1.53 and 2.15 when 256-bit
and 1024-bit Montgomery modular multiplication being per-
formed, respectively.
Index Terms— Montgomery Modular Multiplication,
Multi-core, Parallel Architectures
1. INTRODUCTION
Modular multiplication is a fundamental operation in many
popular Public Key Cryptography (PKC) algorithms such as
RSA [1] and ECC [2, 3]. As the division operation in modu-
lar reduction is time-consuming, Montgomery [4] proposed
a new algorithm where division is avoided. An integer X is
represented as X · R mod M , where M is the modulo and
R = 2r is a radix which is coprime toM . This representation
is called Montgomery residue. Multiplication is performed in
this residue, and division by M is replaced with division by
R.
So far, the Montgomery modular multiplication algorithm
has been widely implemented in both software [5, 6, 7] and
hardware [9, 10, 11]. Compared to the software implementa-
tions, the hardware implementations are faster as a dedicated
data-path is used. However, they are fixed in functions and
are not able to respond to new algorithms. The software im-
plementations are flexible and can be easily modified to per-
form new algorithms, while they are not fast enough in some
real-time applications. Therefore, combining the advantages
of both software implementations and hardware implementa-
tions is necessary.
In this paper, we investigate the implementation of the
Montgomery modular multiplication on a multi-core copro-
cessor. Multi-core processors are chose as the platform be-
cause they have multiple data-paths, and are completely pro-
grammable. We use a Very Long Instruction Word (VLIW)
processor as a prototype. The Montgomery modular multipli-
cation is accelerated by performing parallel computation. The
bottleneck of this implementation is analyzed. We optimize
the platform by deploying multiply-accumulate instruction in
each core.
The rest of the paper is organized as follows. Section 2
briefly reviews previous work on the Montgomery algorithm
and its implementations. In section 3, we describe the archi-
tecture of our platforms. The instruction scheduling method is
proposed in section 4. Section 5 proposes a modified platform
to speed up the computation. Finally, we show the implemen-
tation results in section 6 and conclude the paper including
future work in section 7.
2. PREVIOUS WORK
The Montgomery modular multiplication algorithm was desi-
gned to avoid division in modular multiplications. Given two
n-bit inputs,X and Y , this algorithm gives Z = X · Y ·R−1
mod M , where R equals to 2n and M is the n-bit modulo.
Algorithm 1 shows the Radix-2w Montgomery modular mul-
tiplication algorithm in detail. AmodifiedMontgomery multi-
plication algorithmwas proposed to avoid the conditional final
substraction by choosing a suitable R [12].
As shown in Algorithm 1, the operands X , Y andM are
divided into w-bit words. In the beginning of each iteration,
X0 · Yi is calculated to generate T . After the generation of T ,
the multiplication ofX · Yi and reduction of C are performed
together by doing Z = Z + X · Yi + M · T . After that, Z0
always becomes 0. The division of Z by r is performed by
shifting Z one word to the right. After s iterations and one
conditional substraction, Z = X · Y · R−1 modM is obtai-
ned. As Algorithm 1 scans the operandsX andM from Least
Significant Bit (LSB) to Most Significant Bit (MSB) simulta-
neously, it is also called Finely Integrated Operand Scanning
(FIOS).
In recent years, the Montgomery modular multiplication
has been widely implemented in software and hardware. For
example, In [14], it was implemented on an 8-bit microcon-
2611-4244-1222-6/07/$25.00 ©2007 IEEE SiPS 2007
Algorithm 1 Radix-2w Montgomery modular multiplication
(FIOS) [13]
Input: integers M = (Ms−1, ...,M0)r, X =
(Xs−1, ..., X0)r, Y = (Ys−1, ..., Y0)r, where 0 ≤ X,Y <
M , r = 2w, s =  n
w
, R = rs with gcd(M, r) = 1 and
M
′
= −M−1mod r.
Output: X · Y ·R−1 modM
1: Z = (Zs−1, ..., Z0)r ← 0
2: for i = 0 to s− 1 do
3: T ← (Z0 + X0 · Yi) ·M
′
mod r
4: Z ← (Z + X · Yi + M · T )/r
5: end for
6: if Z > M then
7: Z ← Z −M
8: end if
9: return Z
troller. In [6] it was implemented on an high-end TI DSP
(TMS320C6201) . Großschädl [7] showed that the software
implementations on general purpose CPU can be sped up by
extending the ISA. These software implementations are high-
ly flexible, whereas the performance is limited. The hardwa-
re implementations of the Montgomery multiplication were
also widely investigated. Researchers have deployed various
architectures, such as bipartite multipliers [15] and systolic ar-
rays [16, 17, 18] to achieve high thoughput. In order to obtain
some flexibility, reconfigurable datapath [11] for the Mont-
gomery modular multiplication was also explored. However,
there is still a gap between the flexibility and performance.
One way to bridge the gap is using parallel computation with
programmable devices, e.g., dual-mac DSP [6].
In this paper, we investigate the implementation of the
Montgomery modular multiplication on a multi-core system.
A VLIW processor with general purpose cores is proposed.
According to the implementation result, we optimize it by de-
ploying the MAC instruction in each core. A new instruction
scheduling method is also introduced to achieve high paralle-
lism.
3. OUR DESIGN PLATFORM
In order to achieve an efficient and flexible implementation,
the HW/SW co-design method is used. A quick and correct
evaluation of cost and performance for various hardware con-
figurations and software programs is needed during the de-
sign process. Thus, we use a simulation environment, called
GEZEL [20], which allows us to estimate immediate system
performance in a cycle-accurate manner before synthesizing
the entire design. The GEZEL code can be automatically con-
verted to VHDL code and then synthesized.
Our first design platform, referred as platform-I, is aVLIW
processor with general purpose cores. As shown in Figure 1,
this platform consists of a main controller, a data memory, an
Fig. 1. Platform-I architecture. (w = 32).
instruction memory and several cores. Only the main control-
ler can access the instruction memory and the data memory.
The main controller fetches instructions from the instruction
memory and dispatches them to all cores in parallel via the
instruction bus. Each core executes arithmetic instructions in
parallel, and stores the results in its register file. The data me-
mory has only one read/write port, therefore, a single data
memory access is allowed in each cycle.
The block diagram of the core is also shown in Figure 1.
We denotew as the operation size ofw-bit cores. It is a highly
simplified Load/Store CPU. It has an instruction decoder, a
register file with sixteen 32-bit registers and a status register.
The Arithmetic Logic Unit(ALU) includes one 32-bit multi-
plier and one 32-bit adder. It also has an output register to store
the data that will be written to the data memory, and an input
register to buffer the data from the data memory. Both of them
are 32-bit. One Write Back (WB) register is also used to store
data from the ALU.
The cores here support a simple Load/Store Instruction
Set Architecture (ISA). As shown in Table 1, this simplified
ISA has only 8 general instructions. Here #Addr denotes a
memory address. Instructions for each core are 16-bit long. All
the arithmetic operations are performed among data stored in
the register file. When data needs to be moved from one core
to another, it is first stored to the data memory, then loaded by
the destination core. Cores in this platform support a 4-stage
instruction pipelining: namely, Instruction fetch and decoding,
Register fetch, Execute and Register write back.
262
Table 1. Instruction sets for each core.
Opc
4-bit
Opr 1
4-bit
Opr 2
4-bit
Opr 3
4-bit
Description
Nop No operation
Load Ri #Addr Load the data from locati-
on Addr of the data me-
mory into register Ri
Store Ri #Addr Store the data of register
Ri to location Addr of
the data memory
Mul Ri Rj Rk {R(i+1),Ri}=Rj· Rk
Add Ri Rj Rk {Ca,Ri}=Rj+Rk,Ca is
the carry out and is stored
in the status register
Adc Ri Rj Rk {Ca,Ri}=Rj+Rk+Ca
Sub Ri Rj Rk Ri=Rj-Rk-Ca
Suc Ri Rj Rk Conditional Sub
4. INSTRUCTION SCHEDULING
The Montgomery modular multiplication algorithm is parti-
tioned and mapped to each core. In order to achieve a high
performance, the instructions are manually scheduled so that
all the cores are utilized efficiently. The instruction scheduling
method is the essential part of the software implementation.
The data dependency of theMontgomery algorithm is ana-
lyzed in Figure 2. The main dependency is due to the carries of
additions. Taking FIOS shown in Algorithm 1 as an example,
in each iteration, Zj is replaced by (Zj + (X · Yi)j + (M ·
T )j +Ca), where Ca is the carry. Obviously,Xj ·Yi, for any
0 ≤ i, j ≤ s−1, is only dependent on the operandsX and Y .
We can also calculateMj ·T immediately after the generation
of T . The products with the same weight of Zj and the carry
from Zj−1 are accumulated to Zj , generating a new Zj and
2-bit carries. As a result, Zj can only be generated after the
carry from Zj−1 is ready.
As shown in Figure 2, we need to add Zj with four w-bit
data and 2-bit carries. In hardware implementations, cascaded
Carry Save Adders (CSAs) can be used to construct a 6-to-2
CSA. The carry can also be saved in a 2-bit register or trans-
ferred to another PE. However, in general purpose processors
these special features are not available. Normally only general
adders with a fixed length are used. The carry is saved in the
status register after an Add instruction. In order to keep the
1-bit carry for future use, one instruction is needed to copy
it from the status register to a general register. It will be very
inefficient to use carries generated by another core, since it
needs to be stored to register file first, and then transferred via
the data memory.
Therefore, it will be desirable to partition the algorithm so
that carry is only used in the core where it was generated. Note
Xs Yi

Ms T

X1 Yi

M1 T

X0 Yi

M0 T

MSB  LSB
Z0+
+ 0Z1
Z0
+ Zs
Zs-1
+
Zs
Carry
2
· · ·
· · ·
Fig. 2. Data dependency of FIOS Montgomery algorithm.
that in order to generate T , only Z0 must be ready at the end
of the previous iteration, while (Zs−1...Z1) can be generated
later. Based on this observation, an instruction scheduling me-
thod is proposed and is shown in Figure 3. In this method,
each iteration in Algorithm 1 is performed by multiple cores.
Here we choose n = 256, w = 32 and s =  n
w
 = 8. Du-
ring the whole loop (Z1, Z0) is generated and stored in core-1,
(Z3, Z2) in core-2, (Z5, Z4) in core-3 and (Z7, Z6) in core-4.
Carry is only used in the local core. At the end of each itera-
tion, Z1 is sent to core-1, Z3 is sent to core-2 and Z5 is sent
to core-3. After 8 iterations and a conditional substraction,
Z = X · Y · R−1 modM is generated and stored separately
in four cores. Z can be written to the data memory or can be
used by another modular multiplication.
This method has two advantages. First, it utilizes all the
four ALUs efficiently by symmetrically partitioning theMont-
gomery modular multiplication algorithm. Second, operands
and intermediate data are distributed in the register file of each
core, thus less registers in each core are required. According
to Figure 3, core-1 only needs to store (X1, X0), (M1,M0)
and (Z1, Z0). During the whole computation they can stay in
the register file. As a result, the number of load and store
operation are reduced.
When using one core to perform 256-bit Montgomery
modular multiplication, 644 clock cycles are required. When
using4 cores,weneedonly217clock cycles. That is, the 4-core
based implementation is 2.96 times faster than the single-core
based implementation.
The implementation result is summarized in Table 2. Ac-
cording to the table, the bottleneck of this implementation is
addition operations. The number of addition operations is al-
Table 2. Number of each operation in one 256-bitMontgomery
modular multiplication: Platform-I
Mul Add Sub Load Store Nop Total
136 372 16 113 44 187 217
263
Fig. 3. Instruction scheduling method. (n = 256, w = 32, s =  n
w
 = 8).
most three times larger than the number of multiplication. In
the platform-I, one Add instruction consumes one clock cy-
cle, just as one Mul instruction does. In order to improve the
performance of this implementation, addition operations need
to be accelerated.
5. PERFORMANCE SPEEDUP
As shown in section 1, in the kth iteration we perform (Ca,
Zi+1, Zi) = Xi · Yk + Mi · T + (Zi+1, Zi), where 0 ≤
i, k < s. This operation can be efficiently performed with
two MAC operations, (Ca,Zi+1, Zi) = Xi · Yk + (Zi+1, Zi)
and (Ca,Zi+1, Zi) = Mi ·T +(Zi+1, Zi). Here theCa from
the first MAC operation needs to be saved before being repla-
ced by the second one. Based on this observation, we propose
a revised multi-core platform, platform-II. Compared to the
platform-I, cores in platform-II have one more 32-bit adder.
The block diagram of the modified core is shown below.
Fig. 4. Block diagram of the cores in platform-II. (w = 32).
In the platform-II, each core contains a multiplier and two
adders. Besides the ISA shown in Table 1, two more instruc-
tions are supported by the platform-II.
MAC Rc,Ra,Rb
Adw Rc,Ra,Rb
Here we specify Rc+1 implicitly. The MAC instruction per-
forms(Ca,Rc+1,Rc)=(Rc+1,Rc)+Ra*Rb+Ca, andAdw
instruction performs (Ca,Rc+1,Rc)=(Rc+1,Rc)+
(Ra,Rb)+Ca. When executing MAC and Adw, we need to
read four data from Rc+1, Rc, Ra and Rb, and write 2 da-
ta back to Rc+1 and Rc. As a result, the register file needs
four read ports and two write ports. As increasing the number
of read/write ports causes drastic increment in area, register
file with two separated banks, bank odd and bank even, are
used. Each bank contains eight 32-bit registers and has one
write port and two read ports. When performing MAC and Adw
instructions, Ra and Rb are always in different banks, and so
do Rc+1 and Rc.
We use the same instruction scheduling method. The im-
plementation result of the 256-bit modular multiplication is
summarized in Table 3. The number of addition operation on
the platform-II is only 30% of that on the platform-I. For one
256-bit modular multiplication, the number of cycles in to-
tal is about 42% less than that of the implementation on the
platform-I.
Table 3. Number of each operation in one 256-bitMontgomery
modular multiplication: Platform-II
Mac Mul Add Sub Load Store Nop Total
128 8 114 16 93 32 109 125
6. RESULTS
The multi-core platform proposed in section 3 is implemen-
ted with GEZEL. The GEZEL code is automatically conver-
264
Table 4. Performance comparison of modular multiplication.
Reference Description Platform Area
(Slices)
Freq.
(MHz)
256-bit
time(μs)
1024-bit
time(μs)
This work 4-cores Xilinx 3873 93 2.3 44.0
(Platform-I) 4 32x32 mults XC2VP30
This work 4-cores Xilinx 4233 81 1.5 20.4
(Platform-II) 4 32x32 mults XC2VP30
Tenca & Koç [5] Software ARM - 80 43 570
implementation processor
Cohen et al. [21] Software UltraSPARC - 143 14.6† −
implementation GMP library
Itoh et al. [6] Software DSP - 200 2.68‡ −
implementation TMS320C6201
Brown et al. [22] Software Pentium II - 400 1.57§ −
implementation
Sakiyama et al. [10] CSAs based Xilinx 4836 110.4 0.80 −
Dual-Field XC2VP30
Kelley et al. [9] 4-PEs Xilinx 360∗ 135 0.68 8.3
8 16x16 mults XC2V2000-6
Mentens et al. [11] 34 16x16 mults Xilinx 5500 125 0.17∗ 2.1
XC2VP30
* Author’s estimation from the original paper. † 224-bit modular multiplication.
‡ 239-bit Montgomery modular multiplication. § Using fixed modulo for fast reduction.
ted to synthesizable VHDL code. The software program of
Montgomery modular multiplication is stored in the instructi-
on memory. The operands,X , Y andM , are stored in the data
memory.
For the purpose of checking the maximum frequency, the
platform is implemented on Xilinx Virtex-II PRO (XC2VP30)
FPGA. A maximum frequency of 93 MHz could be achieved
for the platform-I and 81 MHz for the platform-II. The in-
struction memory and the data memory are implemented in
the block RAM on the FPGA board. The number of slices
here only includes the main controller and cores. The perfor-
mance comparison between our software implementations and
the state-of-the-art implementations is summarized in Table 4.
As shown in Table 4, the 256-bit modular multiplication
on the platform-II is almost 28 times faster than the imple-
mentation on the ARM processor [5] and almost 9 times faster
than the implementation on the UltraSPARC processor [21].
Compared to the implementation on TI’s dual-mac DSP
(TMS320C6201), our implementation is about 1.78 times fa-
ster. The implementation of [22] obtains a high performance,
while only supports fixed modulo. Compared to the state-of-
the-art hardware implementations [9, 10, 11], software imple-
mentations are stillmuch slower. This is because of a dedicated
datapath is used. For example, in [11] 34 multipliers are used
and can finish one iteration of the Algorithm 1 in one clock
cycle.
7. CONCLUSIONS
In this paper, we introduced an efficient software implementa-
tion of the Montgomery multiplication algorithm on a multi-
core system. A prototype of general multi-core systems is im-
plemented. We proposed a scheduling method and based on
the implementation result a new platform is proposed to im-
prove the performance. The new platform supports multiply-
accumulate instructions and can accelerate the calculation by
a factor of 1.53 and 2.15 when 256-bit and 1024-bit Montgo-
mery modular multiplication are performed, respectively.
Our future work includes speeding up the data transfers
between different cores and downsizing the whole platform.
We believe that by improving the data transfer scheme a hig-
her performance could be achieved without losing flexibility.
This platform can also be used to perform other algorithms,
e.g., modular inversion using Extended Euclidean Algorithm
(EEA).
Acknowledgments
Junfeng Fan and Kazuo Sakiyama are funded by a research
grant of the Katholieke Universiteit Leuven and FWO projects
(G.0450.04, G.0475.05). This work was supported in part by
the IAP Programme P6/26 BCRYPT of the Belgian State (Bel-
gian Science Policy), by the EU IST FP6 projects (SESOC and
ECRYPT), by the K. U. Leuven, and by the IBBT-QoE project
265
of the IBBT.
8. REFERENCES
[1] R. L. Rivest, A. Shamir and L. M. Adleman. A method
for obtaining digital signatures and public-key crypto-
systems. Communications of the ACM, 21(2):120-126,
1978.
[2] N. Koblitz. Elliptic curve cryptosystem. Math. Comp.,
48:203-209, 1987.
[3] V.Miller. Uses of elliptic curves in cryptography. InH.C.
Williams, editor, Advances in Cryptology: Proceedings
of CRYPTO’85, number 218 in LNCS, pages 417-426.
Springer-Verlag, 1985.
[4] P. Montgomery. Modular multiplication without trial di-
vision. Mathematics of Computation,44:519-521,1985.
[5] A. Tenca and Ç. K. Koç. A scalable architecture for
modular multiplication based on Montgomery’s algo-
rithm. IEEE Transactions on Computers, 52(9):1215-
1221, September 2003.
[6] K. Itoh,M. Takenaka, N. Torii, S. Temma, andY. Kuriha-
ra: Fast implementation of public-key cryptography on
a DSP TMS320C6201. Proceedings of Cryptographic
Hardware and Embedded Systems - CHES’99, LNCS
1717, pp. 61-72, Springer-Verlag, 1999.
[7] J. Großschädl, K. C. Posch, and S. Tillich. Architectural
Enhancements to Support Digital Signal Processing and
Public-KeyCryptography. Proceedings of the 2ndWork-
shop on Intelligent Solutions in Embedded Systems (WI-
SES 2004), pp. 129-143, Graz, Austria, June 25, 2004.
[8] S. E. Eldridge and C. D. Walter. Hardware implementa-
tion ofMontgomery’s modular multiplication algorithm.
IEEE Transactions on Computers,42(6):693-699,June
1993.
[9] K. Kelley andD. Harris. Parallelized very high radix sca-
lable Montgomery multipliers. Conference on Signals,
Systems and Computers, pages 1196-1200, 2005.
[10] K.Sakiyama,B. Preneel and I.Verbauwhede.A fast dual-
field modular arithmetic logic unit and its hardware im-
plementation. Proceedings of IEEE International Sym-
posium on Circuits and Systems (ISCAS 2006), pages
787-790, 2006.
[11] N. Mentens, K. Sakiyama, B. Preneel, and I. Verbau-
whede. Efficient Pipelining for Modular Multiplication
Architectures in Prime Fields. Proceedings of the 2007
Great Lakes Symposium onVLSI (GLSVLSI 2007), 2007.
[12] C. D. Walter. Montgomery’s exponentiation needs no
final subtraction. Electronic letters, 35(21):1831-1832,
October 1999.
[13] Ç. K. Koç, T. Acar and B. S. Kaliski. Analyzing and
comparing Montgomery multiplication algorithms. IE-
EE Micro,16:26-33,1996.
[14] N. Gura, A. Patel, A.Wander, H. Eberle and S. C. Shantz.
Comparing Elliptic Curve Cryptography and RSA on 8-
bit CPUs . Proceedings of Cryptographic Hardware and
Embedded Systems - CHES’04, LNCS 3156, pp. 119 -
132, Springer-Verlag, 2004
[15] M. E. Kaihara and N. Takagi. Bipartite modular multi-
plication. Proceedings of Cryptographic Hardware and
Embedded Systems - CHES 2005, number 3659 in Lec-
ture notes in Computer Science, pages 201-210, Septem-
ber 2005. Springer-Verlag.
[16] K. Iwamura, T.Matsumoto, and H. Imai. High-speed im-
plementation methods for RSA scheme. In R. A. Ruep-
pel, editor, Advances in Cryptology: Proceedings of EU-
ROCRYPT92, number 658 in LectureNotes inComputer
Science, pages 221-238. Springer-Verlag, 1992.
[17] L. Batina and G. Muurling. Montgomery in practice:
How to do it more efficiently in hardware. In B. Preneel,
editor, Proceedings of RSA 2002 Cryptographers Track,
number 2271 in Lecture Notes in Computer Science,
pages 40-52, San Jose, USA, February 18-22 2002.
Springer-Verlag.
[18] T. Blum and C. Paar. Montgomery modular exponentia-
tion on reconfigurable hardware. In Proceedings of 14th
IEEE Symposium on Computer Arithmetic, pages 70-77,
Adelaide, Australia, April 14-16 1999.
[19] S. H. Tang, K. S. Tsui and P. H. W. Leong. Modular ex-
ponentiation using parallel multipliers. Proceedings of
the 2003 IEEE International Conference on Field Pro-
grammable Technology (FPT), Tokyo, 52-59. 2003
[20] P. Schaumont and I. Verbauwhede. Interactive cosimu-
lation with partial evaluation. Proc. Design Automation
and Test in Europe (DATE 2004), pp. 642-647, 2004.
[21] H. Cohen, A. Miyaji and T. Ono. Efficient elliptic curve
exponentiation using mixed coordinates. Asiacrypt’98,
LNCS 1514, pp. 51-65, Springer-Verlag, 1998.
[22] M.Brown,D.Hankerson, J. López andA.Menezes. Soft-
ware implementation of the NIST elliptic curves over
prime fields. Topics in Cryptology, CT-RSA 2001, LNCS
2020, pp. 250-265, Springer-Verlag, 2001.
266
