In this paper, we investigate the efficient software implementations of the Montgomery modular multiplication algorithm on a multi-core system. A HW/SW co-design technique is used to find the efficient system architecture and the instruction scheduling method. We first implement the Montgomery modular multiplication on a multi-core system with general purpose cores. We then speed up it by adopting the Multiply-Accumulate (MAC) operation in each core. As a result, the performance can be improved by a factor of 1.53 and 2.15 when 256-bit and 1024-bit Montgomery modular multiplication being performed, respectively.
INTRODUCTION
Modular multiplication is a fundamental operation in many popular Public Key Cryptography (PKC) algorithms such as RSA [1] and ECC [2, 3] . As the division operation in modular reduction is time-consuming, Montgomery [4] proposed a new algorithm where division is avoided. An integer X is represented as X · R mod M , where M is the modulo and R = 2 r is a radix which is coprime to M . This representation is called Montgomery residue. Multiplication is performed in this residue, and division by M is replaced with division by R.
So far, the Montgomery modular multiplication algorithm has been widely implemented in both software [5, 6, 7] and hardware [9, 10, 11] . Compared to the software implementations, the hardware implementations are faster as a dedicated data-path is used. However, they are fixed in functions and are not able to respond to new algorithms. The software implementations are flexible and can be easily modified to perform new algorithms, while they are not fast enough in some real-time applications. Therefore, combining the advantages of both software implementations and hardware implementations is necessary.
In this paper, we investigate the implementation of the Montgomery modular multiplication on a multi-core coprocessor. Multi-core processors are chose as the platform because they have multiple data-paths, and are completely programmable. We use a Very Long Instruction Word (VLIW) processor as a prototype. The Montgomery modular multiplication is accelerated by performing parallel computation. The bottleneck of this implementation is analyzed. We optimize the platform by deploying multiply-accumulate instruction in each core.
The rest of the paper is organized as follows. Section 2 briefly reviews previous work on the Montgomery algorithm and its implementations. In section 3, we describe the architecture of our platforms. The instruction scheduling method is proposed in section 4. Section 5 proposes a modified platform to speed up the computation. Finally, we show the implementation results in section 6 and conclude the paper including future work in section 7.
PREVIOUS WORK
The Montgomery modular multiplication algorithm was designed to avoid division in modular multiplications. Given two n-bit inputs, X and Y , this algorithm gives
mod M , where R equals to 2 n and M is the n-bit modulo. Algorithm 1 shows the Radix-2 w Montgomery modular multiplication algorithm in detail. A modified Montgomery multiplication algorithm was proposed to avoid the conditional final substraction by choosing a suitable R [12] .
As shown in Algorithm 1, the operands X, Y and M are divided into w-bit words. In the beginning of each iteration, X 0 · Y i is calculated to generate T . After the generation of T , the multiplication of X · Y i and reduction of C are performed together by doing Z = Z + X · Y i + M · T . After that, Z 0 always becomes 0. The division of Z by r is performed by shifting Z one word to the right. After s iterations and one conditional substraction, Z = X · Y · R −1 mod M is obtained. As Algorithm 1 scans the operands X and M from Least Significant Bit (LSB) to Most Significant Bit (MSB) simultaneously, it is also called Finely Integrated Operand Scanning (FIOS).
In recent years, the Montgomery modular multiplication has been widely implemented in software and hardware. For example, In [14] , it was implemented on an 8-bit microcon-
w , s = n w , R = r s with gcd(M, r) = 1 and
troller. In [6] it was implemented on an high-end TI DSP (TMS320C6201) . Großschädl [7] showed that the software implementations on general purpose CPU can be sped up by extending the ISA. These software implementations are highly flexible, whereas the performance is limited. The hardware implementations of the Montgomery multiplication were also widely investigated. Researchers have deployed various architectures, such as bipartite multipliers [15] and systolic arrays [16, 17, 18 ] to achieve high thoughput. In order to obtain some flexibility, reconfigurable datapath [11] for the Montgomery modular multiplication was also explored. However, there is still a gap between the flexibility and performance. One way to bridge the gap is using parallel computation with programmable devices, e.g., dual-mac DSP [6] .
In this paper, we investigate the implementation of the Montgomery modular multiplication on a multi-core system. A VLIW processor with general purpose cores is proposed. According to the implementation result, we optimize it by deploying the MAC instruction in each core. A new instruction scheduling method is also introduced to achieve high parallelism.
OUR DESIGN PLATFORM
In order to achieve an efficient and flexible implementation, the HW/SW co-design method is used. A quick and correct evaluation of cost and performance for various hardware configurations and software programs is needed during the design process. Thus, we use a simulation environment, called GEZEL [20] , which allows us to estimate immediate system performance in a cycle-accurate manner before synthesizing the entire design. The GEZEL code can be automatically converted to VHDL code and then synthesized.
Our first design platform, referred as platform-I, is a VLIW processor with general purpose cores. As shown in Figure 1 , this platform consists of a main controller, a data memory, an instruction memory and several cores. Only the main controller can access the instruction memory and the data memory. The main controller fetches instructions from the instruction memory and dispatches them to all cores in parallel via the instruction bus. Each core executes arithmetic instructions in parallel, and stores the results in its register file. The data memory has only one read/write port, therefore, a single data memory access is allowed in each cycle.
The block diagram of the core is also shown in Figure 1 . We denote w as the operation size of w-bit cores. It is a highly simplified Load/Store CPU. It has an instruction decoder, a register file with sixteen 32-bit registers and a status register. The Arithmetic Logic Unit(ALU) includes one 32-bit multiplier and one 32-bit adder. It also has an output register to store the data that will be written to the data memory, and an input register to buffer the data from the data memory. Both of them are 32-bit. One Write Back (WB) register is also used to store data from the ALU.
The cores here support a simple Load/Store Instruction Set Architecture (ISA). As shown in Table 1 , this simplified ISA has only 8 general instructions. Here #Addr denotes a memory address. Instructions for each core are 16-bit long. All the arithmetic operations are performed among data stored in the register file. When data needs to be moved from one core to another, it is first stored to the data memory, then loaded by the destination core. Cores in this platform support a 4-stage instruction pipelining: namely, Instruction fetch and decoding, Register fetch, Execute and Register write back. 
INSTRUCTION SCHEDULING
The Montgomery modular multiplication algorithm is partitioned and mapped to each core. In order to achieve a high performance, the instructions are manually scheduled so that all the cores are utilized efficiently. The instruction scheduling method is the essential part of the software implementation. The data dependency of the Montgomery algorithm is analyzed in Figure 2 . The main dependency is due to the carries of additions. Taking FIOS shown in Algorithm 1 as an example, in each iteration, Z j is replaced by
where Ca is the carry. Obviously, X j · Y i , for any 0 ≤ i, j ≤ s − 1, is only dependent on the operands X and Y . We can also calculate M j · T immediately after the generation of T . The products with the same weight of Z j and the carry from Z j−1 are accumulated to Z j , generating a new Z j and 2-bit carries. As a result, Z j can only be generated after the carry from Z j−1 is ready.
As shown in Figure 2 , we need to add Z j with four w-bit data and 2-bit carries. In hardware implementations, cascaded Carry Save Adders (CSAs) can be used to construct a 6-to-2 CSA. The carry can also be saved in a 2-bit register or transferred to another PE. However, in general purpose processors these special features are not available. Normally only general adders with a fixed length are used. The carry is saved in the status register after an Add instruction. In order to keep the 1-bit carry for future use, one instruction is needed to copy it from the status register to a general register. It will be very inefficient to use carries generated by another core, since it needs to be stored to register file first, and then transferred via the data memory.
Therefore, it will be desirable to partition the algorithm so that carry is only used in the core where it was generated. Note that in order to generate T , only Z 0 must be ready at the end of the previous iteration, while (Z s−1 ...Z 1 ) can be generated later. Based on this observation, an instruction scheduling method is proposed and is shown in Figure 3 . In this method, each iteration in Algorithm 1 is performed by multiple cores. Here we choose n = 256, w = 32 and s = n w = 8. During the whole loop (Z 1 , Z 0 ) is generated and stored in core-1, (Z 3 , Z 2 ) in core-2, (Z 5 , Z 4 ) in core-3 and (Z 7 , Z 6 ) in core-4. Carry is only used in the local core. At the end of each iteration, Z 1 is sent to core-1, Z 3 is sent to core-2 and Z 5 is sent to core-3. After 8 iterations and a conditional substraction, Z = X · Y · R −1 mod M is generated and stored separately in four cores. Z can be written to the data memory or can be used by another modular multiplication.
This method has two advantages. First, it utilizes all the four ALUs efficiently by symmetrically partitioning the Montgomery modular multiplication algorithm. Second, operands and intermediate data are distributed in the register file of each core, thus less registers in each core are required. According to Figure 3 , core-1 only needs to store (X 1 , X 0 ), (M 1 , M 0 ) and (Z 1 , Z 0 ). During the whole computation they can stay in the register file. As a result, the number of load and store operation are reduced.
When using one core to perform 256-bit Montgomery modular multiplication, 644 clock cycles are required. When using 4 cores, we need only 217 clock cycles. That is, the 4-core based implementation is 2.96 times faster than the single-core based implementation.
The implementation result is summarized in Table 2 . According to the table, the bottleneck of this implementation is addition operations. The number of addition operations is al- most three times larger than the number of multiplication. In the platform-I, one Add instruction consumes one clock cycle, just as one Mul instruction does. In order to improve the performance of this implementation, addition operations need to be accelerated.
PERFORMANCE SPEEDUP
As shown in section 1, in the k th iteration we perform (Ca,
where 0 ≤ i, k < s. This operation can be efficiently performed with two MAC operations, (Ca,
Here the Ca from the first MAC operation needs to be saved before being replaced by the second one. Based on this observation, we propose a revised multi-core platform, platform-II. Compared to the platform-I, cores in platform-II have one more 32-bit adder. The block diagram of the modified core is shown below. In the platform-II, each core contains a multiplier and two adders. Besides the ISA shown in Table 1 , two more instructions are supported by the platform-II.
MAC Rc,Ra,Rb Adw Rc,Ra,Rb
Here we specify Rc+1 implicitly. The MAC instruction performs (Ca,Rc+1,Rc)=(Rc+1,Rc)+Ra*Rb+Ca, and Adw instruction performs (Ca,Rc+1,Rc)=(Rc+1,Rc)+ (Ra,Rb)+Ca. When executing MAC and Adw, we need to read four data from Rc+1, Rc, Ra and Rb, and write 2 data back to Rc+1 and Rc. As a result, the register file needs four read ports and two write ports. As increasing the number of read/write ports causes drastic increment in area, register file with two separated banks, bank odd and bank even, are used. Each bank contains eight 32-bit registers and has one write port and two read ports. When performing MAC and Adw instructions, Ra and Rb are always in different banks, and so do Rc+1 and Rc.
We use the same instruction scheduling method. The implementation result of the 256-bit modular multiplication is summarized in Table 3 . The number of addition operation on the platform-II is only 30% of that on the platform-I. For one 256-bit modular multiplication, the number of cycles in total is about 42% less than that of the implementation on the platform-I. 
RESULTS
The multi-core platform proposed in section 3 is implemented with GEZEL. The GEZEL code is automatically conver- For the purpose of checking the maximum frequency, the platform is implemented on Xilinx Virtex-II PRO (XC2VP30) FPGA. A maximum frequency of 93 MHz could be achieved for the platform-I and 81 MHz for the platform-II. The instruction memory and the data memory are implemented in the block RAM on the FPGA board. The number of slices here only includes the main controller and cores. The performance comparison between our software implementations and the state-of-the-art implementations is summarized in Table 4 .
As shown in Table 4 , the 256-bit modular multiplication on the platform-II is almost 28 times faster than the implementation on the ARM processor [5] and almost 9 times faster than the implementation on the UltraSPARC processor [21] . Compared to the implementation on TI's dual-mac DSP (TMS320C6201), our implementation is about 1.78 times faster. The implementation of [22] obtains a high performance, while only supports fixed modulo. Compared to the state-ofthe-art hardware implementations [9, 10, 11] , software implementations are still much slower. This is because of a dedicated datapath is used. For example, in [11] 34 multipliers are used and can finish one iteration of the Algorithm 1 in one clock cycle.
CONCLUSIONS
In this paper, we introduced an efficient software implementation of the Montgomery multiplication algorithm on a multicore system. A prototype of general multi-core systems is implemented. We proposed a scheduling method and based on the implementation result a new platform is proposed to improve the performance. The new platform supports multiplyaccumulate instructions and can accelerate the calculation by a factor of 1.53 and 2.15 when 256-bit and 1024-bit Montgomery modular multiplication are performed, respectively.
Our future work includes speeding up the data transfers between different cores and downsizing the whole platform. We believe that by improving the data transfer scheme a higher performance could be achieved without losing flexibility. This platform can also be used to perform other algorithms, e.g., modular inversion using Extended Euclidean Algorithm (EEA). of the IBBT.
