Introduction
Fast multiplication is critical for many applications [1] , in particular when many of them need to be computed in sequence as for an exponentiation; an operation that is essential for public-key cryptography (RSA, DH, Elgamal, etc.) with operands ranging from 1000 to 15000+ bits depending on the security level 1 . In these cases, multiplications are performed in finite rings Z/P Z; an operation referred to as modular multiplication. At the same time, current processors include several arithmetic units, allowing to perform several computations in parallel. Since exponentiation cannot be efficiently parallelized, it seems natural to investigate parallel multiplication algorithms and their implementations on multicore processors.
Modular multiplication algorithms fall into two main classes: so-called interleaved methods perform the computation as a combined multiply-and-modulo operation, providing both high regularity and memory efficiency; they are generally preferred for hardware implementation. However, such interleaved methods prevent the use of fast, sub-quadratic or quasi-linear algorithms such as Karatsuba [2] , Toom-Cook [3] , [4] and the FFT-based methods [5] . Hence, when dealing with large operands and when memory is not a major concern, modular multiplication is implemented as two separate operations. We refer the reader to [6, chapter 2] for a recent, excellent survey. 1 . See the ECRYPT II recommendations at http://www.keylength.com/en/3/ The most famous modular reduction algorithms are due to Barrett [7] and Montgomery [8] . Interleaved variants of Montgomery multiplication [9] do exist, but these are not considered here. These two algorithms share many similarities. In particular, they essentially consist of two integer multiplications that are intrinsically sequential: the result of the first one is needed for the second one. Although it is possible to parallelize each integer multiplication, this implies thread synchronizations whose cost should not be underestimated.
To the best of our knowledge, the first modular multiplication/reduction algorithm where parallelism is possible at the modular level was proposed in 2008 by Kaihara and Takagi [10] . The so-called bipartite modular multiplication uses two independent modular reduction algorithms, which reduce the input from both the least and most significant digits at the same time. In [11] , Sakiyama et al. proposed a natural extension of the bipartite algorithm that can be implemented with three processes at the cost of some extra synchronizations. We recall these algorithms and their complexity in Section 2.
The bipartite algorithm is designed for run on two independent units. When more cores are available, it is possible to parallelize the integer multiplications within the bipartite algorithm, but as we shall see, the number of thread synchronizations required by this approach induces an important overhead in practice. In this paper, we propose a novel k-ary multipartite algorithm which allows to split the computations on more than two independent processes, without increasing this parallelism overhead.
In Section 2, we recall the famous algorithms of Barrett and Montgomery and we introduce partial variants which proves useful for the sequel, in particular the bipartite algorithm is very easily described. We present our novel algorithm and its complexity in Section 3 and compare several parallelization strategies in Section 4. We present some timings in Section 5 which show that our algorithm may be of interest for multicore processors.
Background
We consider multiple precision integers in radix β. Given 0 < P < β n and C ≥ 0, a modular reduction algorithm is a method for computing R = C mod P . When C is the result of a product AB we may talk about modular multiplication.
Very often, we want R to be fully reduced, i.e. less than P , but in some cases partial reduction may be sufficient. (We make the concept of partial reduction more precise below.) In the next paragraphs, we recall the algorithms of Montgomery [8] , Barrett [7] , as well as the more recent bipartite [10] and tripartite [11] algorithms. We express their complexity in terms of the number of integer multiplications. Following [6] we use M (m, n) to denote the time to perform an integer multiplication with operands of size m and n respectively, or simply M (n) when both operands have the same size.
Montgomery modular reduction
Given 0 < P < β n and 0 ≤ C < P 2 , Montgomery's algorithm computes the smallest integer Q such that C + QP is a multiple of β n . Hence (C + QP )/β n is exact and congruent to Cβ −n (mod P ). If gcd(P, β) = 1, this Q-value is computed as Q = μC mod β n , where μ = −P −1 mod β n is precomputed. A detailed description is given in [6] . When both the remainder and quotient are needed, Montgomery's method is also known as Hensel's division. Note that μC and QP can be evaluated using a low short product and a high short product respectively. The complexity of Montgomery reduction is 2M (n). By extension, the complexity of Montgomery multiplication is 3M (n).
In order to simplify the presentation of the next algorithms, let us now introduce a partial Montgomery reduction. Roughly speaking, the goal of this partial reduction is to zero out the t least significant digits of C.
In other words, it takes a value C of size m and computes R ≡ Cβ −t (mod P ); a value that is shorter than C by t digits. If t > m − n, the return value R is the proper remainder, i.e. less than P .
Algorithm 1: PMR (Partial Montgomery Reduction)
Input: integers P, C, t, μ such that
The proof of correctness is easily deduced from that of the original Montgomery's reduction algorithm. (See [6, Chapter 2] ). Let us focus on its complexity: step 1 corresponds to the multiplication of the t least significant digits of C with the t least significant digits of μ. This step costs either M (m, t) when m < t or M (t) otherwise.
Step 2 involves a multiplication of P and Q, of size n and t respectively.
As a result, the complexity is:
Barrett modular reduction
Barrett's algorithm [7] computes an approximation of the quotient C/P as Q = C/β n ν/β n , where ν = β 2n /P is precomputed. The remainder is then obtained by computing C − QP , possibly followed by at most three subtractions if one wants the remainder to be fully reduced. The detailed description and proof of correctness can be found in [6] . (See also [12] for a slightly different version.) Assuming full products, the complexity of Barrett's reduction is 2M (n) (or 3M (n) for Barrett's multiplication).
As for Montgomery's algorithm, we introduce a partial Barrett reduction algorithm (see Algorithm 2) which takes C < P 2 , C < β m as input and computes R ≡ C (mod P ) such that 0 ≤ R < β m−t with t ≤ m − n. The idea is to zero out the t most significant digits of C. This is achieved by computing the t leading digits of Q as:
Note that the factor ν = β n+t /P in the numerator replaces Barrett's original precomputed constant.
Algorithm PBR is correct and step 3 is performed at most twice.
Proof: Since all operations are just adding multiples of P to C, it is clear R ≡ C (mod P ). Therefore, we only need to prove 0 ≤ R < β m−t . Since ν < β n+t /P , writing C as
Given the definition of ν and Q, we have ν > β n+t /P − 1 and Q > (C 1 ν/β t − 1)β m−n−t . Thus we also have νP > β n+t −P and
Since we have C 0 < β m−t and C 1 < β t we get
Finally, since P < β n we have β m−n−t P < β m−t . Hence, if R ≥ β m−t then R − β m−n−t P ≥ 0. According to (3) at most two subtractions are required to ensure 0 ≤ R < β m−t . Regarding complexity: since β n−1 < P < β n we have β t < β n+t /P < β t+1 and thus β t ≤ ν < β t+1 . Hence, step 1 is the multiplication of C 1 of size t and ν of size t+1 of asymptotic cost M (t). However, as we shall encounter this situation in the k-ary multipartite multiplication algorithm presented in Section 3, we note that if
if C 1 has only s < t significant digits in the left-most positions) then the cost of step 1 can be reduced to M (s, t).
Step 2 is a multiplication of P and Q, more exactly with the t +1 leading digits of Q (the other ones being all zeros). If one assumes that multiplying by powers of β is free, the cost of step 2 is M (n, t). Hence the complexity:
Bipartite modular multiplication
The bipartite algorithm proposed by Kaihara and Takagi [10] computes AB mod P for 0 ≤ A, B < P < β n and gcd(P, β) = 1 using two independent, parallel processes. One operand, say B, is split into two parts, say B = B 1 β n/2 + B 0 , such that AB mod P = (AB 1 β n/2 mod P + AB 0 mod P ) mod P . However, in that form the sizes of the operands to be reduced are very unbalanced: 2n for AB 1 β n/2 versus 3n/2 for AB 0 . So the computation of the former would take much longer to complete than the latter. Instead, Kaihara and Takagi use a Montgomery-like representation to compute the result of ABβ −n/2 mod P as AB 1 mod P + AB 0 β −n/2 mod P mod P. Using the previously defined PBR for AB 1 mod P and PMR for AB 0 β −n/2 mod P , the complexity of the bipartite algorithm is easily deduced. The two partial products AB 0 and AB 1 cost M (n, n/2) each. Adding the costs of PMR and PBR with t = n/2 from (1) and (4) respectively yields a total cost of 2M (n/2) + 4M (n, n/2).
Tripartite modular multiplication
The tripartite modular multiplication proposed by Sakiyama et. al. in [11] is a natural extension of the bipartite algorithm, where both operands are divided into two parts. Karatsuba's scheme is used to reduce the number of integer multiplications. The tripartite algorithm evaluates ABβ −n/2 mod P as
can be expressed using PBR and PMR respectively. The term X 1 − X 2 − X 0 , of size at most n, is reduced by a small number of subtractions. Thus, the complexity of the tripartite multiplication is 3M (n/2) for X 0 , X 1 , X 2 plus 2M (n/2) + 2M (n, n/2) for the two reductions, leading to an overall complexity of 5M (n/2) + 2M (n, n/2).
A novel k-ary multipartite modular multiplication
In this section we introduce a generalization of the bipartite and tripartite algorithms, which allows to break the computations into an arbitrary number of independent processes, without increasing the parallelization overhead. Let 0 ≤ A, B < P < β n and gcd(P, β) = 1.
, the modular product shifted to the right by n/2 digits rewrites
where
Without loss of generality, we assume through the rest of the paper that k|n; this can always be achieved by padding operands with leading zeros. As in the bipartite and tripartite algorithms, if d i,j < 0, the products A i B j are reduced with PMR yielding
n and no reduction is necessary. Unlike the tripartite algorithm, all the terms are fully independent. Hence the k-ary multipartite algorithm can be implemented without synchronization. In Figure 1 , we illustrate the case where both operands are divided into k = 3 parts. The final result ABβ −n/2 is then obtained by summing up modulo P all the partially reduced terms A i B j β di,j .
Complexity analysis
In the form described above, the complexity of the k-ary multipartite algorithm depends on the number of calls to PMR and PBR as well as the exact number of significant digits of their respective inputs.
Lemma 2: Let k > 0 denote the number of blocks of each operand. Then, the number of calls to PMR (resp. PBR) is
Figure 1: Illustrating the 3-ary multipartite multiplication equal to k(k+2)/8 if k is even and (k+1)(k+3)/8 if k is odd. Moreover, the number of partial products that do not need to be reduced is (3k 2 − 2k)/4 if k is even and (3k
Proof: As explained above, PMR is used when d i,j < 0, which is equivalent to 0 ≤ i + j < k/2. Since i + j is an integer, this is equivalent to 0 ≤ i + j ≤ k/2 − 1. Note also that for every value i + j, there are exactly i + j + 1 partial products A i B j . Similarly, calls to PBR occur for 3k/2 −1 ≤ i + j ≤ 2k − 2 and there are exactly 2k − 2 − (i + j) + 1 partial products for every value i + j. Therefore, the number of calls to PMR (resp. PBR) is equal to 1 + 2 + · · ·+ k/2 = k/2 ( k/2 + 1)/2. Replacing k/2 by k/2 when k is even and (k + 1)/2 when k is odd concludes the proof.
Analyzing those calls to PMR (resp. PBR), we note that some computations are redundant. Indeed, for each A i B j β di,j to be reduced modulo P , we compute a Q-value whose goal is to zero out some digits of that partial product. Thus, for a given weight d i,j , several Q-values are computed to zero out digits of identical weight. Therefore, adding those partial products of identical weight together before computing a single Qvalue is more efficient, at least theoretically. This is illustrated in Figure 2 for k = 3. Note that this approach reduces the number of calls to PMR (resp. PBR) from O(k 2 ) to k/2 . A description of the k-ary multipartite is given in Algorithm 3. Note that the partial products A i B j that need not be reduced can either be added together into C di,j as in Algorithm 3 (line 2) or at the end into R. This latter option will be considered in Section 4.4 to reduce the parallel complexity.
Theorem 1: The total arithmetic cost of the k-ary multipartite modular multiplication is at most
Figure 2: Reducing the number of call to PMR and PBR in the 3-ary multipartite algorithm by accumulating partial products of identical weight
Aiβ in/k and
if di,j < 0 then 4:
else if di,j > n − 2n/k then 6:
where ti,j = di,j + 2n/k − n 7: else 8:
Proof: First, notice that there are k 2 products A i B j with operands of size n/k, whence the term k 2 M (n/k). The other multiplications come from the calls to PMR and PBR. As seen in the proof of Lemma 2, there are exactly k/2 calls to PMR and k/2 calls to PBR. Now, by adding together partial products of identical weight, the operands to be reduced may grow due to carry propagation. However, we note that these carries are not an issue in PMR since they do not change the number of digits to be reduced. For PBR, the problem is easily bypassed by zeroing out the t i,j most significant digits (as before), at the cost of a few extra subtraction at the end. (Note that Barrett's algorithm already requires a few final subtractions anyway.) In practice, this extra number of subtractions is negligible. As illustrated in Figure 2 , the number of digits to zero out with PMR is exactly −d i,j . Therefore, we have (t i,j , 2n/k) + M (n, t i,j ) . Hence the thesis.
Note that the complexity given in (6) does not take into account the cost induced by the parallelism, in particular that induced by thread synchronizations. In the next section, we discuss several parallel implementation strategies.
Parallelization
In this section, we analyze and compare several parallel implementations of the algorithms presented above. We give the parallel complexity together with the number of synchronizations between concurrent threads. The latter should not be underestimated as it largely impacts practical performances. In [13] the overhead due to synchronization is defined as follows: if T s is the sequential time for a section of code, and T p the time for the parallel version of this on p processors, then the overhead is given by Op = T p − T s/p. Using their benchmarking tools, we were able to estimate the number of cycles of one synchronization on our architecture (Intel Xeon X5650 Westmere running at 2.66 GHz). We obtained the following results: 927 cycles for 2 threads, 1303 cycles for 3 threads, 1559 cycles for 4 threads, 2221 for 6 threads and 2571 for 8 threads.
In order to parallelize the algorithms without introducing too much parallelism overhead, we heavily rely on a quadratic scheme. The idea is to split the operands in θ 1 (resp. θ 2 ) chunks and to perform θ 1 θ 2 multiplications in parallel. Thus, one can easily derive a parallel algorithm to multiply two ndigits integers with a parallel complexity of M (n/θ 1 , n/θ 2 ) on T = θ 1 θ 2 cores and 2 thread synchronizations occuring before and after the summation of the partial products. Our parallel complexities are solely based on that quadratic scheme. Note that additions are not performed in parallel as it would require too many synchronizations. Although sub-quadratic schemes such as Karatsuba or Toom-Cook perform fewer operations, they also introduce many synchronizations which make them unsuitable in our context. They would however be of interest for bigger operands.
In the next sections, we consider a model where the algorithms receive synchronized data and output synchronized data such that it is possible to call them multiple times in sequence without extra synchronization.
In order to provide bounds and legible comparisons, we consider the following assumption on integer multiplication, which almost reflects the practical behaviour of integer multiplication except for quasi linear methods:
Barrett/Montgomery:
Although intrinsically sequential, both algorithms can be distributed on several arithmetic units by parallelizing the integer multiplications. Using the quadratic scheme, both Barrett and Montgomery's algorithms have a parallel complexity of 3M (n/θ 1 , n/θ 2 ) on T = θ 1 θ 2 cores. Each integer multiplication requires 1 synchronization after the parallel evaluation of the partial products, plus 1 synchronization at the end (after the summation of those partial products), allowing to move on to the next multiplication with synchronized data. Hence, both Barrett and Montgomery algorithms require 6 synchronizations.
Note that these algorithms are symmetric and have the same complexity. We verified experimentally that the performances of both methods are indeed very close. For the sake of clarity, we only give results for Montgomery's multiplication.
Bipartite:
The Bipartite multiplication was designed to run in parallel on two cores, with a parallel complexity of M (n/2) + 2M (n, n/2). However, if more than two arithmetic units are available, the integer multiplications in PMR and PBR can also be parallelized, using the strategies discussed above. On T = θ 1 θ 2 cores, splitting the operands in θ 1 and θ 2 /2 chunks, the cost becomes M (n/2θ 1 , n/θ 2 )+2M (n/θ 1 , n/θ 2 ), or equivalently 2.5M (n/θ 1 , n/θ 2 ) using (7) (assuming θ 2 is even). As for Montgomery, the bipartite algorithm requires 6 synchronizations.
Tripartite:
As seen in Section 2.4, the complexity of the tripartite multiplication is 3M (n/2) for X 0 , X 1 , X 2 plus 2M (n/2) + 2M (n, n/2) for the two reductions. X 0 , X 1 , X 2 can be computed in parallel on three cores in M (n/2). After a synchronization, the two reductions can also be computed in parallel in M (n/2) + M (n, n/2); each reduction involving an extra synchronisation. As for the bipartite algorithm, the integer multiplications can be parallelized if more units are available at the cost of some extra synchronizations. On T = θ 1 θ 2 cores, splitting the operands in θ 1 and θ 2 /3 chunks, the cost becomes 2M (n/2θ 1 , 3n/2θ 2 ) + M (n/θ 1 , 3n/2θ 2 ), or equivalently 3M (n/θ 1 , n/θ 2 ) using (7) (assuming θ 2 is divisible by 3). The tripartite algorithm requires a total of 6 synchronizations. In [11] a variant with 5 tasks using two levels of Karatsuba is also presented. Although potentially of interest for hardware implementations, the number of synchronizations makes it unsuitable in software for the targeted operand sizes.
k-ary multipartite:
The k-ary multipartite multiplication offers more flexibility. A naive implementation would require k 2 threads, one for each
, with a parallel complexity equivalent to the cost of the most expensive of them. That is, the computations of A 0 B 0 β −n/2 mod P using PMR or that of A k−1 B k−1 mod P using PBR of exact same cost. Although feasible for small values of k, this strategy requires many cores and is very unbalanced. As pointed out in Section 3.1, it is more efficient to combine the partial products of identical weight together and to compute only one Q-value per weight as in Algorithm 3. Indeed, we reach the same parallel complexity while reducing the number of threads. This is achieved by observing that the instructions in the foreach block of Algorithms 4 are independent and can therefore be computed in parallel. To further reduce the complexity, we postpone and reduce all the products by P appearing in each PBR and PMR of Algorithm 3 into a unique multiplication QP done, in parallel, at the end. In order to do so, each thread computes C di,j (as in Algorithm 3) together with the corresponding Q-value. All these Q-values are then added together after a thread synchronization. This is illustrated in Algorithm 4.
Aiβ in/k and 
Before stating our main theorem which gives an upper bound on the parallel complexity of Algorithm 4, we present the scheduling principle leading to this bound on two small examples. Starting with k = 2, it is easy to see that computing A 0 B 0 plus its corresponding Q-value costs 2M (n/2); the same applies for A 1 B 1 . This is equivalent to the cost of computing the two products A 0 B 1 and A 1 B 0 . Hence, the 2-ary multipartite can be implemented on three threads in 2M (n/2), plus the cost of computing QP in parallel after a thread synchronization. For k = 3 (see Figure 2) , A 0 B 0 and its corresponding Q-value costs c 0 = M (n/3) + M (n/2) = 13M (n/6) according to (7) . The cost is identical for A 2 B 2 and its corresponding Q-value. Computing A 0 B 1 + A 1 B 0 and the corresponding Q-value costs c 1 = 2M (n/3) + M (n/6) = 9M (n/6). Symmetrically, evaluating A 2 B 1 + A 1 B 2 and its Q-value costs c 1 as well. Since c1 < c 0 , those computations can be performed in parallel on four cores in c 0 = 13M (n/6). The three remaining products A 0 B 2 , A 1 B 1 , A 2 B 0 can be computed on one thread in 3M (n/3) = 12M (n/6) < c 0 .
(Although it does not change the number of threads for k = 3, those remaining partial products should be considered independently as it allows a finer task scheduling for larger values of k.) Hence, the 3-ary multipartite multiplication can be implemented on five threads in 13M (n/6), plus the cost of computing QP in parallel after a thread synchronization. Using the same idea, Theorem 2 below gives an upper bound on the parallel complexity of the k-ary multipartite algorithm given in Algorithm 4. This theorem assumes that k|n, θ 1 |n and θ 2 |n, which can always be achieved with padding.
Theorem 2: Let us assume that
cores are available, then using a quadratic scheme with k > 3 and T ≥ 3 k/2 , the parallel complexity of the k-ary multipartite multiplication is at most
Proof: The term M (n/θ 1 , n/θ 2 ) comes from the final product QP , computed in parallel on θ 1 θ 2 cores. So, let us focus on the other terms. We start by proving that any of the following computations can be computed in at most M (n/k)+ M (n/2, 2n/k):
(i) A 0 B 0 and its corresponding Q-value, (ii) =i+j A i B j and its corresponding Q-value plus nonreduced
According to (1) , the computation of A 0 B 0 and its corre-
, and the corresponding Q-values costs
Under our assumption on M (n), we have M
Adding the cost of products A i B j gives a total of (k + 1)M n k , whence (ii). Finally, using (7), it is easy to see that M 
According to (iii), one can handle k + 1 partial products A i B j on one core in the given time. Replacing k by k/2 + k/2 in the previous equation, gives
Hence, k/2 cores are needed for the N k remaining products, which concludes the proof. n 2 , n) , the parallel complexity of the k-ary multipartite algorithm is bounded by
and the number of thread synchronizations is exactly 3.
Proof: Using (8) with k = 2θ 1 θ 2 /3, the proof is immediate.
As illustrated in Section 5, we observed that synchronizations are very expensive and should be limited. In particular for small operands, when the relative cost between a synchronization and a product is very large. In those cases, one may prefer not to gather all Q-values together before computing QP . Each thread then has to compute its own product QP which costs M n 2 − n k , n . This, of course drastically increases the parallel complexity but saves 1 synchronization, possibly leading to significant speed-ups in practice (see Section 5) . For completeness, the parallel complexity of this approach is (1.5 + Table 1 we summarize the parallel complexities and the number of synchronizations of the presented parallel modular multiplication algorithms.
Algorithm
Parallel complexity on θ 1 θ 2 cores # synch. 
Comparisons
As already mentioned, software parallelism induces a cost that is only amortized when computation time is important. This extra cost comes from thread launching (clone(), fork()), thread synchronization (wait()) and memories operations (placement, transfers and cache misses). This fact motivates our study of a parallel modular multiplication algorithm which minimize this impact.
In order to validate our theoretical study, we developed a C++ library implementing the algorithms presented in the previous sections on a shared memory MIMD architecture.
In the following, we discuss some benchmarks of this library on an architecture embedding two Intel Xeon processors X5650 Westmere, with six cores each, running at 2.66 GHz. Our implementation is based on the GNU multiple precision (GMP) library version 5.0.2 [14] and the OpenMP API framework. Compilation was done with the Intel C++ compiler version 10. Our implementation is generic in the operands' sizes. The sequential integer operations computed by each parallel task are the ones from GMP (e.g. product is done with mpn_mul). For each algorithm and number of cores, we optimized the task scheduling in order to minimize the parallel complexity and to reduce the number of synchronizations. When possible, parallel tasks were gathered on the same processor to avoid data movement through the system bus. Our timings are summarized in Table 2 .
In order to provide a universal reference, we give the time to perform one modular multiplication on one thread, i.e. a sequential implementation. The row entitled "Best seq." in Table 2 always corresponds to the fastest option between our sequential implementation of Montgomery and GMP (mpn_mul, mpn_mod). For parallel integer multiplications, we implemented both the quadratic scheme and Karatsuba whenever possible, and observed that the later was only advantageous when the number of cores is a multiple of 3. In the other cases, the quadratic scheme was always faster. As in the sequential case, our implementations always call the best parallel integer multiplication strategy available.
The bound given in the corollary of Theorem 2 assumes that k = 2T/3, where T is the number of cores. For 3 and 6 cores, and inputs of size less than 12000 bits, the best timings were indeed obtained with the k-ary multipartite algorithm with k = 2 and k = 4 respectively. When the number of cores did not allow an optimal mapping, we observed that choosing k > 2T/3 allows a finer scheduling resulting in faster implementations. On 4 and 8 cores, our timings correspond to the 4-ary and 8-ary multipartite algorithms respectively, where extra A i B j have been scheduled by hand.
For the larger operands, we also observed that the bipartite algorithm was the fastest. This is not surprising since the parallel complexity given in Table 1 is always better than the other algorithms and since the parallelism overhead (a constant factor) becomes negligible. The fact that the k-ary multipartite catches up with the bipartite for increasing T is not surprising either, this reflects the complexity given in Table 1 .
For the smaller operands, however, the cost induced by the parallelism is important. This explains why the k-ary multipartite version 2 which only needs 1 synchronization is faster than version 1 which requires 2.
Finally, one should observe that it may be advantageous not to use all the cores available. For example, if one considers 2048-bit operands and if 8 cores are available, one can observe that the 2-ary multipartite version 2 on 3 cores is faster. Table 2 : Timings (in μs) for several parallel modular multiplication algorithms on 1, 3, 4, 6 and 8 cores, for operands ranging from 1024 to 16384 bits. For a given number of cores and a given size, the gray cells represent the fastest algorithm
Conclusions
The experiments presented in this article show that parallelization of modular multiplication is relevant at a software level. Increasing the number of core improves the performance but a full usage of the cores may not be the better strategy for certain sizes. For operands smaller than 2 13 bits, it seems preferable to use synchronization-friendly algorithms rather than the one with the best complexity. Our k-ary multipartite algorithm, which is a synchronization-friendly generalization of the bipartite and tripartite algorithms, is therefore a good alternative.
For larger sizes, the bipartite method clearly offers the best alternative since the arithmetic complexity dominates the parallelism overhead. However, as soon as the number of core increases, the gain of the bipartite over our k-partite algorithm is postponed to larger integers.
The best results for the k-ary multipartite algorithm were obtained thanks to the scheduling proposed in Theorem 2 completed with hand optimizations when the number of cores did not allow a direct mapping. Generalizing these optimizations would allow to provide an automatic scheduler for efficient parallel modular multiplication on any number of cores.
It seems also interesting to consider the modular reduction problem rather than modular multiplication. An idea would be to use an heterogeneous splitting of the operand to be reduced. Modular squaring should be also of interest since it is a major operation for exponentiation.
