An algorithmic and architectural study on Montgomery exponentiation in RNS by Gandino, Filippo et al.
  
This is an electronic version (author’s or accepted version) of the paper:  
 
Gandino F., Lamberti F., Paravati G., Bajard J.C., Montuschi, P.,                                  
“An algorithmic and architectural study on Montgomery exponentiation in RNS,”  
IEEE Transactions on Computers, 61(8), pp. 1071-1083, 2012.  
 
DOI: 10.1109/TC.2012.84 
Link to IEEE Xplore®: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6185537 
 
© 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be  
obtained for all other uses, in any current or future media, including  
reprinting/republishing this material for advertising or promotional purposes, creating new  
collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted  
component of this work in other works. 
 
 
TRANSACTIONS ON COMPUTERS 1
An Algorithmic and Architectural Study on
Montgomery Exponentiation in RNS
Filippo Gandino, Member, IEEE, Fabrizio Lamberti, Member, IEEE, Gianluca Paravati,
Jean-Claude Bajard, Member, IEEE, and Paolo Montuschi, Senior Member, IEEE
Abstract—The modular exponentiation on large numbers is computationally intensive. An effective way for performing this operation
consists in using Montgomery exponentiation in the Residue Number System (RNS). This paper presents an algorithmic and
architectural study of such exponentiation approach. From the algorithmic point of view, new and state-of-the-art opportunities that come
from the reorganization of operations and precomputations are considered. From the architectural perspective, the design opportunities
offered by well-known computer arithmetic techniques are studied, with the aim of developing an efficient arithmetic cell architecture.
Furthermore, since the use of efficient RNS bases with a low Hamming weight are being considered with ever more interest, four
additional cell architectures specifically tailored to these bases are developed and the trade-off between benefits and drawbacks is
carefully explored. An overall comparison among all the considered algorithmic approaches and cell architectures is presented, with
the aim of providing the reader with an extensive overview of the Montgomery exponentiation opportunities in RNS.
Index Terms—RNS, Montgomery reduction, Modular exponentiation, Modular multiplication.
F
1 INTRODUCTION
THE hardware implementation of modular exponen-tiation for very large integers plays an important
role in various fields, being security among the most
notable ones. In recent years, many research efforts
specifically focused on the hardware implementation of
Montgomery exponentiation (ME) in the residue number
system (RNS). In RNS, very long integer multiplications
and additions can be split in independent short integer
operations. However, other operations such as division
and modular reduction may be difficult to execute [1].
ME, which is based on Montgomery multiplication
(MM) [2], can be performed by avoiding the standard
modular reduction approach.
A number of versions of the RNS ME algorithm
have been proposed, aimed at reducing the delay of
exponentiation (e.g., [3], [4], [5]). Most of the the above
approaches deal with the base extension (BE) part of
the algorithms, since this operation, which calculates a
number on a different RNS base, contributes in large part
to the overall computational effort of RNS ME.
An RNS ME technique in the context of RSA was
proposed in [4] by Kawamura et al. This technique uses
a new BE algorithm, characterized by a summation that
provides a result modulo a small multiple of the base,
which is corrected after the sum of each element. In [6],
further details and an architecture were presented. The
 F. Gandino, F. Lamberti, G. Paravati and P. Montuschi are with the
Department of Computer Engineering, Politecnico di Torino, Italy.
E-mail: ffilippo.gandino, fabrizio.lamberti, gianluca.paravati,
paolo.montuschig@polito.it
 J.-C. Bajard is with LIP6 CNRS - Universit Pierre et Marie Curie.
E-mail: jean-claude.bajard@lip6.fr
architecture was compared with non-RNS approaches,
showing better performance.
In [5], Bajard and Imbert described another ME ap-
proach based on [7] and exploiting two BE techniques:
a new approximated BE and the BE algorithm proposed
in [8], where the result is approximated and corrected
by using an extra modulo.
In this paper, a comprehensive algorithmic and archi-
tectural study on RNS ME is presented. The paper is
based on an earlier work presented in [9], where new
and state-of-the-art approaches for reorganizing opera-
tions and for exploiting precomputation were combined
and analyzed, and two RNS ME algorithms suitable to
the BE approaches used in [4] and [5] were discussed. A
previous study limited to the internal reorganization of
the RNS Montgomery reduction was presented in [10].
Moreover, the idea of rearranging internal operations
and precomputed values was simultaneously exploited
by Guillermin in [11], [12], where a reorganized RNS
Montgomery reduction algorithm with a BE approach
based on [4] was applied to elliptic curve cryptography.
The algorithmic analysis in the current work extends
previous studies, and considers operations both internal
and external to the RNS MM with the aim of limiting
the number of computations required during the Mont-
gomery reduction.
The work in [9] also included a study on the cell
architecture presented in [6], and proposed a novel cell
architecture suitable to the approach used in [5] (which
was evaluated by means of a theoretical analysis based
on equivalent gates delay and area cost). In the cur-
rent work, the architectural perspective is significantly
extended, by exploiting the outcomes of the former
analysis as well as the synergies between algorithmic
TRANSACTIONS ON COMPUTERS 2
TABLE 1
List of Symbols
Context Symbol Meaning
A RNS base
B RNS base
k Number of base elements
r Number of bits of each base element
aj j
th element of the base A; 8j; 1  j  k
bi i
th element of the base B; 8i; 1  i  k
A
Qk
j=1 aj
B
Qk
i=1 bi
RNS bases Aj Aaj ; 8j; 1  j  k
Bi
B
bi
; 8i; 1  i  k
A 1j Multiplicative inverse of Aj on aj
B 1i Multiplicative inverse of Bi on bi
B 1A Multiplicative inverse of B on A
B 1N Multiplicative inverse of B on N
ci 2
r   ai
h Maximum number of bits of ci
ME g Number of bits of the exponent
MM R Montgomery number
N Modulo of the operation
BE Values  Approximation ~x xB
Architecture p Stages of pipeline
M Parallel multipliers
c Approximated value of qA
Kawamura f Floor of c; (bcc)
et al. BE [4]  Initial value of c; f0; 0:5g
% Number of accumulated bits of qi
Bajard et al. BE [5] ar Redundant base element
A 1r Multiplicative inverse of A on ar
Accents ]tilde Approximated valuesdhat Values multiplied by A 1j in A
jxjy x mod y
Operations << Left shift
and Relations  Equivalent
and architectural aspects. In particular, on the one hand,
the use of techniques well assessed in computer archi-
tecture design is explored, and a new cell architecture
exploiting pipeline and redundancy is proposed. On the
other hand, efficient RNS bases studied in [13], [14]
are considered, and four additional cell architectures
are developed where a totally different design based
on additions rather than on modular reductions is ex-
ploited. Furthermore, a comprehensive comparison en-
compassing reorganized algorithms and newly designed
cells is carried out, based on algorithmic analysis and
logic synthesis. Results show that, without additional
constraints on the representable numbers, it is possible
to achieve a 41:1% reduction in delay, with a 17:6%
increase in area, with respect to [6]. Moreover, further
improvements (up to a 41:8% reduction in delay with
a 2:8% reduction in area) can be reached at the cost
of stricter constraints on the size of the representable
numbers.
The remaining of the paper is organized as follows. In
Section 2, the impact of reorganization of operations and
preprocessing is analyzed. In Section 3, the architectural
study is discussed, by analyzing in details the design
of the novel cell architectures. In Section 4, an overall
comparison is presented, considering all the algorithms
and architectures described in the paper. Finally, in
Section 5, conclusions are drawn.
2 ALGORITHMIC STUDY
In this section, a suitable reorganization of the opera-
tions and of precomputations capable of reducing the
number of multiplications in RNS ME algorithms is
described. The RNS ME corresponds to a classic square
and multiply algorithm, where RNS MMs are iterated in
a loop. The RNS MM follows the normal Montgomery
multiplication approach, but it is adapted to RNS. The
main difference is the presence of two BEs, which are
required by RNS MM in order to execute the modular
reduction. The BE corresponds to an algorithm that
calculates a value represented on an RNS base on a
different base. Since the RNS ME uses the RNS MM
that, in turn, requires the BE, after a brief background
on RNS the three algorithms will be discussed separately
by following a top-down presentation approach.
The study firstly identifies in an analytical way the
reduction in the number of multiplications needed by
the reorganized algorithms. Then, the multiplications are
classified and their weight is analyzed in details (by
making reference to the concept of multiplication step,
i.e., to a set of k multiplications distributed on k cells),
with the aim of precisely determining the impact of
such reorganization in the perspective of pipelining and
parallelization opportunities offered by RNS. The algo-
rithms will be discussed referring to symbols reported
in Table 1.
2.1 Mathematical Background
RNS allows long integers to be represented as sets of
short integers. Considering the base A = (a1; a2; :::; ak),
composed of k relative prime numbers (where k is the
base size), any number x with 0  x < A = Qki=1 ai is
uniquely represented by a sequence of positive integers
(x1; x2; :::; xk), where xi = jxjai ; 8i : 1  i  k.
In RNS, multiplications, additions, and subtractions
can be carried out independently and in parallel for each
base element, limiting both the operations and the carry
propagation to the bits of each independent element.
However, in RNS, other operations like overflow de-
tection, division and modular reduction are more com-
putationally intensive than in other representations. For
instance, an exact division xy can be executed by means
of the multiplication of x by the multiplicative inverse
of y, modulo A, but this operation can be performed
without affecting the size of the representable values
only if the greatest common divisor gcd(y;A) = 1. If this
condition is not valid, a BE has to be operated, requiring
a significant computational effort.
An efficient BE technique is based on the Chinese
Remainder Theorem (CRT) [15]. CRT can be used to
convert a value x from an RNS base to a radix system.
The conversion expression is:
x =

kX
i=1
xiA 1i ai Ai

A
(1)
TRANSACTIONS ON COMPUTERS 3
Algorithm 1: State-of-the-art RNS ME
Input: x in A[B[ar and e = (eg 1:::e1e0)b, such thaty A =Qk
j=1 aj , B =
Qk
i=1 bi, and gcd (A;B) = 1, gcd(N;B) =
1, 0  xy < NB, B > 4N , and A > 2N
Output: z = jxejN in A [ B
Precomputation:
B2
N
,
jBjN in A [ B [ ar
W^W
-
1: x   MM  x; B2
N

2: z = jBjN in A [ B [ ar
3: for i from g   1 down to 0 do
4: z    MM (z; z)
5: if ei = 1 then
6: z    MM(z; x)
7: end if
8: end for
9: z    MM(z; 1)
-
Algorithm 2: Reorganized RNS ME
Input: x in A[B[ar and e = (eg 1:::e1e0)b, such thaty A =Qk
j=1 aj , B =
Qk
i=1 bi, and gcd (A;B) = 1, gcd(N;B) =
1, 0  xy < NB, B > 5N , and A > 4N
Output: z = jxejN in A [ B
Precomputation:\jB2jN , [jBjN in A [ B [ ar , jAj jaj andA 1j aj for j = 1:::k
1: x^aj =
xaj A 1j aj for j = 1:::k
2: ^x   MM (x^;\jB2jN )
3: z =[jBjN in A [ B [ ar
4: for i from g   1 down to 0 do
5: z^    MM (z^; z^)
6: if ei = 1 then
7: z^    MM(z^; ^x)
8: end if
9: end for
10: z^    MM(z^; 1^)
11: zaj =
z^aj Ajaj for j = 1:::k
y With Kawamura et al. BE approach [4], B > 5N , A > 4N ; with Bajard et al. BE approach [5], B > (k+ 1)2N , A > (k+ 1)N
Fig. 1. The state-of-the-art and the reorganized RNS Montgomery exponentiation algorithms
where Ai = Aai , A
 1
i is the multiplicative inverse of Ai on
ai and
Pk
i=1
xiA 1i ai Ai is equal to x+A (with  < k).
In order to complete a BE, the modular reductions of x
by the elements of the new base must be performed.
In this study, Ai and A 1i are always used on the base
element ai, or on a different base. Since they are never
used on a base element aj of A with j 6= i, in order to
simplify the notation in the following Ai in A and A 1i
in A will be used to indicate  jA0ja0 ; jA1ja1 ; :::; jAkjak
and
A 10 a0 ; A 11 a1 ; :::; A 1k ak, respectively.
2.2 Reorganization of the Operations
The algorithmic study reported in this work moves from
considering that, in the two state-of-the-art RNS ME al-
gorithms [4], [5], a common feature can be identified, i.e.,
the presence of a significant number of multiplications
of a partial result by a precomputed value.
Taking into account the above characteristic, it is ob-
served that, by exploiting the commutative property,
a sequence of two multiplications of a partial result
by a precomputed value, e.g., (x  k1)  k2, could be
substituted by one multiplication of the original value
by the product of the two precomputed values, i.e.,
x k3, with k3 = k1  k2. Therefore, by precomputing a
different value, it is possible to decrease the number of
multiplications and reach the same result. By considering
other arithmetic properties, e.g., the distributive and
associative ones, operations involved in the considered
algorithms could be arranged in a more effective way.
The reorganization of operations and precomputed
values internal to the RNS MM has been already ex-
ploited in various studies [10], [11], [12]. This section
analyzes reorganized algorithms for the RNS ME (where
internal reorganizations are combined with external ones
by following the approach proposed in [9]), and com-
pares them to the state-of-the-art algorithms in [4], [5].
2.3 RNS Montgomery Exponentiation (RNS ME)
ME is carried out by iterating MM, where MM(jx yjN )
gives w  xyR 1 (modN), with R the Montgomery
constant. ME computes jxejN at the maximum cost
of 2 log2 e + 2 MMs. Considering x and y such that
x = jxRjN and y = jyRjN , then z =
xyR 1
N
= jxyRjN .
Therefore, the exponentiation can be executed by iter-
ating MM on x. ME provides the exact result, or the
results plus N ; hence, a final comparison and a possible
subtraction are required.
The state-of-the-art and reorganized RNS ME algo-
rithms are shown in Fig. 1 ( [4] and [5] are discussed to-
gether since, although two different RNS ME approaches
are exploited, the main part of the algorithm is common).
RNS ME is performed on two RNS bases, A = (a1; :::; ak)
and B = (b1; :::; bk), such that A =
Qk
j=1 aj , B =
Qk
i=1 bi,
and gcd (A;B) = 1. B is used as the Montgomery
constant. Since R = B, jBjN and
B2
N
must be precom-
puted. Step 1 in Algorithm 1 [4], [5] calculates x. Step 2
initializes the exponentiation process, which is executed
in the loop from Step 3 to Step 8.
In the reorganized algorithm (Algorithm 2), two mul-
tiplication steps are added outside the loop. Before en-
tering the loop, the values on base A are multiplied by
A 1j (Step 1). This multiplication is performed on base
A in the state-of-the-art BEs at the end of RNS MM, in
order to extend the final result from A to B. With the
aim of saving a multiplication step, in the reorganized
algorithm the correct result onA is not provided, directly
calculating it multiplied by A 1j . All the input values in
A of the RNS MM are represented in a new notation
where all the values are premultiplied by A 1j (and
represented with a dhat accent). This new notation is
stable for the addition and our MM. After the loop, a
multiplication by Aj is executed to reach the final result
of the exponentiation (that can be either the exact result,
TRANSACTIONS ON COMPUTERS 4
TABLE 2
Operations in the State-of-the-Art and the Reorganized MM and BE Algorithms (Without Error Evaluation)
[4], [7] Reorganized
A B A B
MM s = x  y s = x  y ^^s = x^  y^ s = x  y
u = s    N 1
q = u  B 1i q = u 

 N 1B 1i

BE1 ~u =
Pk
i=1 qbi  Bi w^ = ^^s 

B 1A Aj

+
Pk
i=1 qi 

BiNB
 1
A A
 1
j

u = ~u    B w^ = w^ +  

 NA 1j

t = u N
MM v = s+ t
w = v  B 1A
q = w  A 1j
BE2 ~w =
Pk
j=1 w^aj  Aj ~w =
Pk
j=1 w^aj  Aj
w = ~w     A w = ~w     A
y Aj in A and A 1j in A are used to indicate

jA0ja0 ; jA1ja1 ; :::; jAkjak

and
A 10 
a0
;
A 11 
a1
; :::;
A 1k 
ak

, respectively.
Algorithm 3: State-of-the-art RNS MM
Input: x, y in A [ B [ ar
Output: w  xyB 1N (mod N);
w < 2N in A [ B [ ar
Precomputation: B 1A , N in A [ ar ; N 1 in B
1: s = x  y in B
2: s = x  y in A [ ar
3: u = s  ( N 1) in B
4: u in A [ ar ( BE1(u in B)
5: t = u N in A [ ar
6: v = s+ t in A [ ar
7: w = v B 1A in A [ ar
8: w in B ( BE2(w in A [ ar)
Algorithm 4: Reorganized RNS MM
Input: x^, y^ in A [ B [ ar
Output: w  xyB 1N (mod N); w < 2N in B [ ar ,
w^  xyB 1N A 1j (mod N) in A
Precomputation:
1: s = x  y in B [ ar
2: ^^s = x^  y^ in A
-
3: w^ in A [ ar ( BE1(s in B [ ar; ^^s in A)
-
-
-
4: w in B ( BE2(w^ in A; w in ar)
Fig. 2. The state-of-the-art and the reorganized RNS Montgomery multiplication algorithms
or the results plus a multiple of N ).
2.4 RNS Montgomery Multiplication (RNS MM)
In the state-of-the-art algorithms [4], [7], RNS MM is
performed on two RNS bases, A and B. Since B is used
as the Montgomery constant, B 1A must be precomputed
on the base A, where B 1A is the multiplicative inverse
on A of B.
In Fig. 2, the state-of-the-art and reorganized RNS
MM algorithms are presented (and, as before, [4] and
[7] are treated together). B is used as the Montgomery
constant; thus, by executing the multiplication of Step 3
in B, the modular reduction by B is immediate. Since
a division by B is required, which must be executed in
a base composed by elements relatively prime to B, the
BE to A of Step 4 is performed. The multiplication in
Step 5 and the addition in Step 6 are only executed on
A, since the result of Step 6 on B would be equal to 0.
The multiplication by B 1 in Step 7, which corresponds
to the division, is only performed on A, as previously
described. A final BE to B is required, in order to reach
a valid result that could be passed in input to other MMs.
The main difference between [4] and [7] is in the BE
technique used. Both the techniques are based on (1) and
avoid the modular reduction by A, which is computa-
tionally intensive. Without the modular reduction, which
would give x as a final result, the partial result is equal
to x + A (or x + B, according to the base of origin).
In order to perform the correction, [4] evaluates  by
checking the most significant bits of the results of the
previous multiplication step. In [7], two BE techniques
are used: in the first one there is no correction; in the
second one (firstly proposed in [8]), a redundant base
element is used to evaluate  and to correct the result.
In the reorganized algorithm, the precomputed values
and the order of operations are modified as illustrated
in Table 2 (without considering the contribution of error
evaluation). The differences are:
1) The precomputed values  N 1 and B 1i in B are
substituted by  N 1B 1i ; hence, the multiplication
by  N 1 in RNS MM is merged with the multipli-
cation by B 1i in the subsequent BE, thus saving a
multiplication step.
2) The precomputed values N and Aj are substituted
by AjN ; the multiplication by N in RNS MM is
merged with the summation of multiplications by
Aj in the previous BE and another multiplication
step is avoided.
3) The multiplication by B 1A is split in two parts. One
part is merged with the summation of multiplica-
tions by AjN in A in the first BE. The precomputed
value AjN is substituted by AjNB 1A . Also s inA is
multiplied by B 1A in the first BE; thus, the number
of multiplication steps remains unchanged.
TRANSACTIONS ON COMPUTERS 5
Algorithm 5: First Base Extension proposed by Kawamura
et al. [4]
Input: u in B,  = 0
Output: u in A
Precomputation:
B 1i bi for i = 1:::k,
Bi in A for i = 1:::k,  B in A.
1: qi =
ubi B 1i bi for i = 1:::k
2: c = 
3: uaj = 0 for j = 1:::k
^^s
ay
4: for i = 1 to k do
5: c = c+
trunc(qi)
2r
6: fy = bcc
7: c = c  f
8: uaj =
uaj + qi Bi + f( B)aj
for j = 1:::k
9: end for
Algorithm 6: First Reorganized Base Extension based on the
algorithm by Kawamura et al. (KBE1)
Input: s in B,  = 0, ^^s in A
Output: w^ in A
Precomputation:
 N 1B 1i bi for i = 1:::k, BiNB 1A A 1j
in A for i = 1:::k,  NA 1j , B 1A Aj in A.
1: qi =
sbi    N 1B 1i bi for i = 1:::k
2: c = 
3: w^aj =
^^saj   B 1A Aj
aj
for j = 1:::k
4: for i = 1 to k do
5: c = c+
trunc(qi)
2r
6: fy = bcc
7: c = c  f
8: w^aj =
w^aj + qi   BiNB 1A A 1j + f   NA 1j aj
for j = 1:::k
9: end for
y f 2 f0; 1g
Fig. 3. The first BE presented in [4] and the reorganized BE (KBE1)
4) The multiplication by A 1j is not required, since all
the inputs in A are premultiplied by Aj . However,
a multiplication by Aj is required to reach the
correct value. This correction can be merged to the
multiplication by B 1A and no additional multipli-
cation step is required. Hence, the precomputed
value B 1A is substituted by B
 1
A Aj .
In the following, two versions of the reorganized al-
gorithm will be analyzed, by making reference to the
different BE approaches used by Kawamura et al. and by
Bajard et al. in [4] and [7], respectively. The algorithms
are presented by including the redundant base element
ar, even though it is involved only in the latter approach.
By analyzing the RNS MM algorithms in Fig. 2, it can
be observed that Algorithm 3, Step 1 corresponds to Step
1 of the reorganized algorithm (Algorithm 4). In Step
2 of Algorithm 4, the two inputs of the multiplication
are already premultiplied by A 1j ; hence, the result is
equal to Step 2 in Algorithm 3, multiplied by A 2j . In
Algorithm 4, Steps 3, 5, 6 and 7 of Algorithm 3 are moved
into the first BE, together with the multiplication by Aj
that is required to correct the input.
2.5 BE Approach by Kawamura et al.
The first BE proposed by Kawamura et al. in [4] (Algo-
rithm 5) and KBE1, i.e., the reorganized BE based on the
same approach (Algorithm 6) are shown in Fig. 3. KBE1
includes all the operations of Algorithm 5 as well as the
operations of Algorithm 3, Steps 3, 5, 6 and 7. While
the first BE proposed by Kawamura et al. only extends
the value u from B to A, KBE1 directly calculates w^ of
Algorithm 4, Step 3 from s of Step 1 and ^^s of Step 2.
The result of the summation in the BE of x from A
to B is ~x = x + A, where  is an integer number
with 0   < k. The algorithm proposed by Kawa-
mura et al. calculates the approximate value of  as
~ = b +Pk0 trunc(qi)=2rc ' bPk0 qi=bic; ~ is estimated
accumulating the %-th most significant bits of qi, cut
by trunc() and divided by 2r, where trunc(qi) = qi ^
(1(r):::1(r %+1)0(r %):::0(1))(2) and ^ denotes a bitwise
AND operation. The variable  represents the starting
value of the parameter used to calculate . In [4], two
theorems prove that with proper values of , with a low
% and by selecting the base elements so that 2r is close to
bi, the approximation does not introduce errors. In the
first BE,  = 0 and the input is lower than B; therefore,
according to Theorem 2 in [4], the result of the BE is
~x < 2B. In the second BE,  = 0:5 and the input is
lower than 2N ; hence, according to Theorem 1 in [4],
the result of the BE is correct. With r = 32, k = 33 and
max(2r   ai; 2r   bi) < 216; 8i, it is possible to choose
%  7. The algorithm is the same for both the BEs, with
exchanged bases. This approach requires that A  4N
and B  5N .
Algorithm 5 starts with the multiplication of each
partial result ubi by the precomputed value B
 1
i , modulo
bi. In KBE1, the partial result used in the correspond-
ing multiplication is sbi and the precomputed value is
 N 1B 1i . The result of KBE1 Step 1 is the same of
Algorithm 5, Step 1, for the associative property. In
Algorithm 5, Step 3 corresponds to an initialization.
In KBE1, Step 3 executes in A the multiplication of ^^s
by the precomputed value B 1A Aj , in order to reach a
result that can be added to the summation by provid-
ing a value ready for the BE. In both the algorithms,
Steps from 5 to 7 evaluate the contribution of each
partial result qi to . In Algorithm 5, Step 8 requires
k2 multiplications, corresponding to the summation of
the results of the multiplication of each partial result
qi by the corresponding precomputed value jBijaj , per
base element aj , and to the sum of the precomputed
value  B, when the floor f , used to evaluate , is equal
to 1. In the corresponding step of KBE1, BiNB 1A A
 1
j
and  NA 1j   BNB 1A A 1j are used instead of Bi
and  B, respectively. For the associative and distributive
properties, KBE1 Step 3 provides the same result of
TRANSACTIONS ON COMPUTERS 6
Algorithm 7: Second Base Extension proposed by Kawa-
mura et al. [4]
Input: w in A,  = 0:5
Output: w in B
Precomputation:
A 1j aj for j = 1:::k,
Aj in B for j = 1:::k,  A in B.
1: w^j =
waj A 1j aj for i = 1:::k
2: c = 
3: wbi = 0 for i = 1:::k
4: for j = 1 to k do
5: c = c+
trunc(w^j)
2r
6: fy = bcc
7: c = c  f
8: wbi = jwbi + w^j Aj + f( A)jbi for i = 1:::k
9: end for
Algorithm 8: Second reorganized Base Extension based on
Kawamura et al. approach (KBE2)
Input: w^ in A,  = 0:5
Output: w in B
Precomputation: Aj in B for j = 1:::k,
 A in B.
-
1: c = 
2: wbi = 0 for i = 1:::k
3: for j = 1 to k do
4: c = c+
trunc(w^j)
2r
5: fy = bcc
6: c = c  f
7: wbi = jwbi + w^j Aj + f( A)jbi for j = 1:::k
8: end for
y f 2 f0; 1g
Fig. 4. The second BE presented in [4] and the reorganized BE (KBE2)
Algorithm 9: First Base Extension employed by Bajard et
al. [5], [7]
Input: u in B
Output: u in A [ ar
Precomputation:
B 1i bi for i = 1:::k,
Bi in A [ ar for i = 1:::k.^^s
1: qi =
ubi B 1i bi for i = 1:::k
2: uaj =
Pki=1 qi Bi
aj
for j = 1:::k and j = r
-
^^s
Algorithm 10: First reorganized BE based on the algorithm
by Bajard et al. (BBE1)
Input: s in B [ ar , ^^s in A
Output: w^ in A, w in ar
Precomputation:
 N 1B 1i bi for i = 1:::k, BiNB 1A A 1j
in A for i = 1:::k, B 1A Aj in A;
B 1A ar , BiNB 1A ar
for i = 1:::k.
1: qi =
sbi  ( N 1B 1i )bi for i = 1:::k
2: w^aj =
^^saj B 1A Aj +Pki=1 qi  (BiNB 1A A 1j )
aj
for j = 1:::k
3: wr =
sr B 1A +Pki=1 qi  (BiNB 1A )
ar
Fig. 5. The first BE used in [7] and the reorganized BE (BBE1)
Algorithm 11: Second Base Extension employed by Bajard
et al. [5], [7], [8]
Input: w in A [ ar
Output: w in B
Precomputation:
A 1j aj for j = 1:::k, A 1r ar ,  A in B,
Aj in B for j = 1:::k, jAj jar for i = 1:::k.
1: w^j =
waj A 1j aj for j = 1:::k
2: w0r =
Pkj=1 w^jAj
ar
3:  =
(w0r   wr)A 1r ar
4: wbi =
Pkj=1 w^j Aj
bi
for i = 1:::k
5: wbi = jwbi  A  jbi for i = 1:::k
Algorithm 12: Second reorganized BE based on the algo-
rithm by Bajard et al. (BBE2)
Input: w^ in A, jwjar
Output: w in B
Precomputation:
A 1r ar ,  A in B,
Aj in B for j = 1:::k, jAj jar for j = 1:::k.
-
^^s
ay
1: w0r =
Pkj=1 w^jAj
ar
2:  =
(w0r   wr)A 1r ar
3: wbi =
Pkj=1 w^j Aj
bi
for i = 1:::k
4: wbi = jwbi  A  jbi for i = 1:::k
Fig. 6. The second BE used in [7] and the reorganized BE (BBE2)
Algorithm 5, Step 1.
Fig. 4 shows the second BE algorithm proposed in [4]
and the related reorganized algorithm. Algorithm 7 is
the same as Algorithm 5, but with the bases switched.
KBE2, is the same as Algorithm 7, but without the first
step, since the input is already multiplied by A 1j .
2.6 BE Approach by Bajard et al.
The MM algorithm proposed by Bajard et al. in [7] (and
applied in the context of RNS ME in [5]) uses two BE
algorithms; Algorithm 9 in Fig. 5 shows the first one,
which does not perform the BE correction in order to
save time. Algorithm 11 in Fig. 6, originally proposed by
Shenoy and Kumaresan [8], calculates the correct result.
The error in the result of the first BE does not affect the
final result of the MM, but larger bases are required, i.e.,
A;B > N(k + 2)2 :
Algorithm 10 in Fig. 5 corresponds to the first re-
organized BE based on the approach by Bajard et al.
(BBE1) and includes the operations of Algorithm 9 and
the operations of Algorithm 3, Steps 3, 5, 6, and 7.
Algorithm 9 starts with the multiplication of each
partial result ubi by the precomputed value B
 1
i , modulo
bi. In BBE1, the partial result used in the correspond-
TRANSACTIONS ON COMPUTERS 7
TABLE 3
Number of Modular Multiplications Required by the Considered RNS MM Algorithms
Kawamura et al. [4] Bajard et al. [7] Reorganized (with KBE1/KBE2) Reorganized (with BBE1/BBE2)
Step 1, 3, and 4 of MM 5k 5k 2k 2k
First BE without correction k2 + k k2 + k k2 + 2k k2 + 2k
First BE correction k 0 k 0
Second BE without correction k2 + k k2 + k k2 k2
Second BE correction k k k k
Total without BE correction 2k2 + 7k 2k2 + 7k 2k2 + 4k 2k2 + 4k
Total 2k2 + 9k 2k2 + 8k 2k2 + 6k 2k2 + 5k
ing multiplication is sbi and the precomputed value is
 N 1B 1i . The result of BBE1 Step 1 is the same as
the result of Algorithm 5, Step 1, for the associative
property. In Algorithm 9, Step 2 requires k2 multipli-
cations, corresponding to the summation of the results
of the multiplications of each partial result qi by the
corresponding precomputed value jBijaj , per base ele-
ment aj . In the corresponding step of BBE1, BiNB 1A A
 1
j
is used instead of Bi. The result of the summation of
the results of the multiplications of these precomputed
values by the corresponding partial results are added to
the multiplication of ^^s by the precomputed value B 1A Aj .
BBE1 provides the same result of the first step of the
second BE used by Bajard et al., for the associative and
distributive properties.
The second BE used by Bajard et al. (Algorithm 11),
and the corresponding second reorganized BE (BBE2)
(Algorithm 12) are shown in Fig. 6. The value of  is
calculated by evaluating the difference between the cor-
rect result and the extended result on a redundant base
element ar, such that gcd(ar; A) = 1 and gcd(ar; B) = 1,
according to [8]. Algorithm 11, Step 2 calculates the
difference between the correct value of x in ar and the
result of the BE on ar, which correspond to:
 =
jxjar   jx+ Ajar
A
: (2)
Algorithm 10 is the same as Algorithm 11 without Step
1, since the input of the BE is already multiplied by A 1j .
2.7 Analysis
This section presents an analysis focused on MM, which
corresponds to the most computationally intensive part
in the overall ME algorithm.
2.7.1 Number of modular multiplications
In [4] and [5], the metric used to evaluate the per-
formance of the proposed approach was based on the
number of modular multiplications needed (as shown
in Table 3). The reorganized RNS MM algorithms re-
quires 3k modular multiplications less than the state-
of-the-art algorithms with the same BE technique. The
reorganized RNS ME requires 2k additional modular
multiplications. Considering an RSA scenario with an
TABLE 4
Multiplication Steps of the State-of-the-art MM
Multiplication Operation Number of Type
step ID multiplications
1 s = xy in B k 1
2 s = xy in A k 3
3 u = s( N 1) k 1
4 q = uB 1i k 1
5. . .k+4 u = qiBi k2 2
k+5 t = uN k 1
k+6 w = vB 1A k 1
k+7 w^ = wA 1j k 1
k+8. . . 2k+7 w = w^jAj k2 2
TABLE 5
Multiplication Steps of the Reorganized MM
Multiplication Operation Number of Type
step ID multiplications
1 s = xy in B k 1
2 ^^s = x^y^ in A k 3
3 q = s( N 1B 1i ) k 1
4 w^ = s^B 1A Aj k 3
5. . .k+4 w^i = BiNB 1A A
 1
j k
2 2
k+5. . . 2k+4 w = w^jAj k2 2
exponent of size g = 1024 and k = 33, the maximum
number of modular multiplications would be reduced
from 4938450 to 4735566 (95:8%). Considering a smaller
base where g = 640 and k = 21, the maximum number of
modular multiplications would be reduced from 1319178
to 1238454 (93:9%). The percentage reduction is slight
better with a smaller base size, since the relative weight
of 3k (i.e., the reduction) over k2 (i.e., main contribution
to the total cost) is greater.
2.7.2 Characterization of the modular multiplications
Table 4 and Table 5 provide a characterization of the
multiplications involved in the state-of-the-art and the
reorganized RNS MM and BE algorithms (without cor-
rection) in terms of multiplication steps.
Since each operation is performed at least on k base
elements, up to k cells can work in parallel, requiring
2k+7 multiplication steps for the multiplications shown
in Table 4, and only 2k + 4 for Table 5.
The multiplication steps can be classified in the per-
spective of pipelining and parallelization possibilities, in
order to calculate the number of cycles required by each
step. Multiplication step types are:
TRANSACTIONS ON COMPUTERS 8
 Type 1, multiplication steps that cannot be paral-
lelized or executed in pipeline, since they use as an
input the result of the previous operation, and their
output is used as input by the subsequent one; the
number of required cycles is p, where p corresponds
to the number of stages of the pipeline.
 Type 2, a group of multiplication steps that can be ex-
ecuted in parallel or pipelined; they require p cycles
for the last multiplication step and 1 cycle for the
others; by considering M as the number of parallel
multipliers per cell, a group of xmultiplication steps
requires b xM c+ p  1 cycles.
 Type 3, multiplication steps that can be executed
in parallel to others; they require zero cycles in
parallel or pipelined architectures, 1 cycle otherwise
(b 1p+M 1c).
Without considering the BE correction, the state-of-the-
art RNS MM involves: 6 Type 1 multiplication steps, 2
groups of k Type 2 multiplication steps, and 1 Type 3
multiplication step (ID 2). Therefore, the total number
of cycles per RNS MM is 2d kM e   2 + b 1p+M 1c+ 8p.
Comparing Table 5 to Table 4, it can be observed that
the reorganized algorithm requires 4p cycles less than
the state-of-the-art one, but it needs b 1p+M 1c additional
cycles. Therefore, the improvement is directly linked to
the number of pipeline stages and of parallel multipliers.
As a matter of example, by considering p = 3,M = 1 and
k = 33 as in [4] (without BE correction), the reorganized
RNS MM algorithm obtains a delay reduction of 13:63%.
With a higher degree of pipelining, a larger reduction is
achieved (e.g., 16:66% with p = 4 and M = 1, 19:23%
with p = 5 and M = 1, etc.).
The reorganized RNS ME algorithm requires 2p ad-
ditional multiplication steps (Algorithm 2, Steps 1 and
11). By considering an RSA scenario, with g = 1024,
p = 3, M = 1 and k = 33, the maximum number of
cycles would be reduced from 180400 to 155806 (86:3%).
Considering a smaller base where g = 640 and k = 21,
the maximum number of cycles would be reduced from
82048 to 66670 (81:2%). The percentage reduction is
better with a smaller base size, since the relative weight
of 4p (i.e., the reduction) over k (i.e., main contribution
to the total cost) is greater.
2.8 Extension to Other Cryptosystems and Remarks
The reorganization of RNS ME provides greater benefits
when the RNS bases are small. However, in cryptogra-
phy, the exponentiation is typically used for RSA, which
requires very large bases. The reorganization of the op-
erations internal to the RNS Montgomery reduction has
been also exploited in cryptosystems that require single
executions or sequences of Montgomery reductions, e.g.,
in elliptic curve cryptography [11], [12]. The RNS Mont-
gomery reduction has been recently exploited also in the
context of pairing [16], [17]. In these domains, the inter-
nal reorganization can produce the same reduction in
the number of cycles, which, nonetheless, have a higher
relative weight (as described in Sections 2.7.1 and 2.7.2).
Moreover, although the new representation decreases the
duration of each modular reduction, it introduces a fixed
delay proportional to the number of input and output
values. Furthermore, this representation requires that
the input of the RNS Montgomery reduction is double
hat (i.e., the product of two hat values). A different
input would require a different Montgomery reduction
algorithm or an additional multiplication step to adapt
it. Based on such considerations, the application of re-
organized algorithms to other cryptosystems seems ap-
pealing, though an extensive analysis would be required
to precisely characterize advantages and drawbacks.
3 ARCHITECTURAL STUDY
In [6], Nozaki et al. proposed a cell architecture suitable
to the algorithm presented in [4]. The cell is matched
to a single element per RNS base; hence, a set of k
cells is required to execute the exponentiation. A cell
can be matched to more than one element per base, in
order to reduce the area (though increasing the delay).
The cell is composed by a three-stage pipelined Modular
Multiplier and Accumulator Unit (MMAU), a Cox unit
(that evaluates and corrects the BE error) and some
memory elements. The MMAU used by Nozaki et al. is
also suitable for the approach proposed by Bajard et al.;
nevertheless, in this case, an additional redundant cell
matched to the redundant base element is used, instead
of the Cox unit.
The architectural study starts with the analysis of the
cell architecture proposed in [6] (then extended to the
approach proposed by Bajard and Imbert). Then, the pos-
sible base element types are analyzed, and five new cell
architectures are presented. The study moves from an
extensive theoretical investigation based on equivalent
gates delay and area cost reported in [18], and produces
an exhaustive analysis based on synthesis results.
3.1 Base Elements
RNS offers the opportunity to select the base elements,
which strongly affect the computational effort required
by the reduction. In [19], the authors showed that with
a modulus ai = 2r   2h   1 and h < r+12 , the modular
reduction of jxjai , with x < a2i , can be reduced to:
x  x1 + x2 + x4 + 2h(x3 + x4) (mod ai) (3)
where x1 = x mod 2r, x2 = bx  2rc, x3 = x2 mod 2r h
and x4 = bx2  2r hc.
This formula provides efficient base elements, which
nonetheless must be relatively prime and must satisfy
the algorithm size limitations. Formula ai = 2r   2h  1
requires the same computational effort as ai = 2r 2h 1,
but it can provide more base elements.
A larger number of base elements can be reached by
ai = 2
r ci, with ci < 2h and h < r 12 . In [13], the authors
showed that the modulo reduction of x < 22r requires
TRANSACTIONS ON COMPUTERS 9
Negative
Special
Shifter
BSA Reduction Tree
r-1 r
r
r+4+log(k)
4 r 4 r
ADD
BSA
MUX
r
ADD ADD ADD
MUX
r r
Positive
Special
Shifter
r
4
First
Modular Reduction/Adder
Unit
Second 
Modular Reduction Unit
REGISTER
REGISTER
r+4+log(k)
MUX
r+4+log(k)
log(k)
r
+ -
+ +
-+ ± ± ± ±+ + - -± ± ± ±
+
-
- -
-++
+ -
BSA
-++
+ -
BSA
-++
+ -
BSA
-++
+ -
BSA
-++
+ -
+ - + -+ -
+ + +
MUX
r
r r+1 r+2
r
r
r
+ -
+
MUX
2r-1
2r
2r-1
2r-1
+
-
+ -
-+ +
+-
+
-
+ +
r+4+log(k)
r-1 rr r
REGISTER
CS-PSEUDO-MUL
ADD
ADD
REGISTER
ADD ADD
MUX
REGISTER
r
r
r
2r-1
2r-1
2r+log2(k)
rr
2r+log2(k)
2r+log2(k)
log2(k)
log2(k)
r
r
h2h
r+h-1
r+h-1
log2(k)+2h-1
log2(k)+2h-1
r+h+1
rh+1
h+1h
2h2h
r
r
r
r r
r
r
r
Multiplier/Adder
Unit 
First
Modular Reduction
Unit 
Second
Modular Reduction
Unit 
MUX
r
2r+log2(k)
CS-PSEUDO-MUL
CS-PSEUDO-MUL
CS-PSEUDO-MUL
r1
MUX
MUX
REGISTER
CS-ADD
REGISTER
CS-ADD
MUX
MUX
r
r
r
2r-1
rr-1
2r-1
r-1
r
h
r+h-2r+h-2
r
r
r+1
r+2
r
r r
r
r
Multiplier
Unit 
First
Modular Reduction/Adder
Unit 
Third
Modular Reduction
Unit 
SEL
r
2r-1
REGISTER
2r-1
rr-1
r-1 h
r+h-2
r+h-2
r
REGISTER
r+h+log(k)
r
CS-ADD
h
h
r
r
r
ADD
r+1
r+1
r r r+1 r+1
r+h+log(k)
r+h+
log(k)
r+h+log(k)
h+log(k)
h+log(k)
h+log(k)
h+log(k)
2h+log(k)-1 2h+
log(k)-1
2h+
log(k)-1
2h+log(k)-1
r+h+log(k)
REGISTER
REGISTER
r+1 r
Second
Modular Reduction
Unit 
CS-PSEUDO-MUL
CS-PSEUDO-MUL
CS-PSEUDO-MUL
CS-PSEUDO-MUL
CS-PSEUDO-MUL
ADDADD
CS-ADD
MUX
r1
r
r2
CTRL
REGISTER
MUX
r
r
r
2r-1
r
Multiplier
Unit 
2r-1
REGISTER
CS-PSEUDO-MUL
r
Fig. 7. (a) Three-stage, (b) Four-stage and (c) !(ci)  3 MMAU architecture
2 multiplications and 3 additions. The partial modular
reduction y  x (mod ai) can be calculated by:
y = jxj2r + (x >> r)  ci : (4)
With x < 2z , z > r and ci < 2h, it is y <
max (2r+1; 2z r+h+1). Therefore, each iteration of this
method can reach a reduction of r   h  1 bits. In order
to reach a larger reduction per step with input x > 22r,
it is possible to calculate:
y = jxj2r + jx >> rj2r  ci + (x >> 2r)  c2i : (5)
With x < 2z , z > 2r and ci < 2h, it is y <
max (2r+h+2; 2z 2r+2h+1). Therefore, each iteration can
reach a reduction of 2r 2h 1 bits. This approach, which
is also used in the architecture proposed in [6], is more
computationally expensive than the previous one, but
provides a larger number of possible base elements.
An intermediate approach requires base elements
compliant with ai = 2r   ci, with ci < 2h, h < r+12 and
the Hamming weight of ci, denoted as !(ci), selected
so that !(ci) < t. The limit to the Hamming weight
allows to perform the modular reduction as a sequence
of additions. According to [14], a modular reduction of
x < 22r requires 2!(ci) + 2 additions.
3.2 Three-stage MMAU Architecture
In this subsection, the MMAU architecture proposed
in [6] is analyzed. The MMAU corresponds to a cell
without the error correction, as shown in Fig. 7. This
cell is characterized by three-stage pipelining (p = 3) and
includes one multiplier (M = 1). The MMAU is divided
into the units listed below.
 Multiplier Adder Unit (MAU), which performs un-
signed multiplications and additions. It provides an
TRANSACTIONS ON COMPUTERS 10
output on 2r + log2 k bits.
 First Modular Reduction Unit (FMRU), which per-
forms the partial modular reduction (5). It provides
an output on r + h+ 1 bits.
 Second Modular Reduction Unit (SMRU), which cal-
culates the final result of the modular reduction
performing (4) and an addition. The final reduction
requires to check the (r + 1)-th least significant
bit. When it is equal to ’1’, the value is greater
than ai; hence,  ai is added in order to reach the
right modulo. The output is represented on r bits.
Although the final result is smaller than 2r, it can
still be larger than ai; nonetheless, according to [6],
it does not affect the overall results.
The Type 1 multiplication steps are executed sequentially
by each unit, requiring 3 cycles. The Type 3 multiplica-
tion steps are executed in parallel to the Type 1 ones, ex-
ploiting the unused units. Therefore, they do not require
additional cycles. The groups of Type 2 multiplication
steps are executed by the MAU, one multiplication and
accumulation per cycle. After the last multiplication step,
2 cycles are required to reduce the result.
3.3 Proposed Four-stage MMAU Architecture
An efficient cell can be obtained by introducing redun-
dancy, increasing the level of pipeline, and moving the
accumulation from the MAU to the FMRU. By using a
carry-save representation, the final adder of the MAU
and of the FMRU can be removed. Based on theoreti-
cal observations, the SMRU can be split in two units,
thus augmenting the degree of pipelining. The area and
the delay of the four-stage architecture can be reduced
by moving the addition operation in the FMRU, thus
limiting the number of input lines of the MAU and of
the FMRU. Fig. 7 shows the corresponding architecture,
which is divided into:
 The Multiplier Unit (MU), which performs unsigned
multiplications. It provides a carry-save output on
2r   1 bits.
 The First Modular Reduction and Adder Unit (FM-
RAU), which executes (4) and can accumulate the
results or add other values; (5) is not applicable,
since the input is too short. It provides a carry-save
output on r + h+ log2 k bits.
 The SMRU, which computes (4). It provides a carry-
save output on r + 1 bits.
 The TMRU, which performs the final addition and
reduction. It provides an output on r bits.
The area and the delay of the FMRAU are larger
than in the previous cell architecture, since the result of
the MU is redundant. However, the reduced number of
bits attenuates the area increase. In order to execute the
final reduction, three possible results are simultaneously
calculated. One is the partial result of the reduction, one
is added to  ai, and the last is added to  2ai. The most
significant bits of the partial results are checked, in order
to select the correct one.
In this MMAU, the Type 1 multiplication steps are ex-
ecuted sequentially by each unit, requiring 4 cycles. The
Type 3 multiplication steps are executed in parallel to
the Type 1 multiplications, without requiring additional
cycles. The groups of Type 2 multiplication steps are
executed by the MU and accumulated by the FMRAU,
one per cycle. Lastly, 3 further cycles are required to
complete the groups of Type 2 multiplication steps.
The four-stage MMAU architecture is characterized by
a delay that is smaller than with the three-stage one, but
requires a larger area.
3.4 Proposed !(ci)  3 MMAU Architecture
With a base compliant with !(ci) < 3, it is possible to
substitute the multiplications used for the reduction with
additions. However, this strategy requires the manage-
ment of negative numbers. In order to avoid additional
delays, in the reduction units the carry-save represen-
tation is replaced with the borrow-save one, where the
values are expressed by two binary unsigned numbers
(and the standard representation can be obtained by
subtracting the second from the first). The !(ci)  3
MMAU architecture is divided into:
 The MU, which performs unsigned multiplications.
It provides a carry-save output on 2r   1 bits.
 The FMRAU, which can transform a carry-save
input in borrow-save representation, convert each
2r-bit input in 7 r-bit numbers, and accumulate the
results or add other values. It provides a borrow-
save output on r + 4 + log2 k bits.
 The SMRU, which performs the final reduction. It
produces a result on r bits.
The MMAU architecture is shown in Fig. 7. The 2r-bit
numbers that represent the inputs of the FMRAU are
transformed in borrow-save representation and, then, in
14 r-bit numbers by two special shifters. By considering
a design without a knowledge of base elements (i.e.,
each cell can work with any base element respecting
the constraints), the special shifter consists in some
wired links and in 6 barrel shifters (which could be
also substituted by multiplexers). The negative special
shifter is identical to the positive one, but the sign of
the numbers is switched. After the special shifters, a
borrow-save adder reduction tree is used to reduce the
number of partial results and to accumulate the results
or add other values. The result is normally represented
in borrow-save representation on r+4 bits. However, at
the end of the accumulation, it requires r+log2 k+4 bits.
The SMRU is only able to reduce numbers on r+4 bits;
thus, when the result is larger, it is used again as input
in the FMRU, in order to reach the required length. The
SMRU evaluates 4 bits from r to r+3 of the two outputs
of FMRAU, and adds/subtracts a proportional multiple
of the base element. The partial result is equivalent to
that of the four-stage architecture; hence, the same final
reduction strategy is required.
TRANSACTIONS ON COMPUTERS 11
CS-MUL
REGISTER
ADD
SEL
r
r
r
2r-1
2r-1
2r+log(k)
2r+log(k)
2r+log(k)
r
Multiplier/Adder
Unit 
SEL
r
2r+log(k)
ADD
r-ρρ
r
1
REGISTER
Cox Unit 
r
r
CS-MUL
CS-ADD
REGISTER
rr-1
r-1
r
h
r+h-2
r+h-2
First
Modular Reduction/Adder
Unit 
SEL
r
rr-1
CS-MUL
r-1 h
r+h-2
r+h-2
r
REGISTER
r+h+log(k) r+h+log(k)
r+h+
log(k)
r+h+log(k)
h+log(k)
r+h+log(k)
r
r
h+log(k)
2r-12r-1
ρ
ρ
ρ+1
ρ
ρ
3-UNIT CELL 4-UNIT CELL
Fig. 8. Cox unit
This MMAU architecture can also be used when there
is a knowledge of base elements; in this case the barrel
shifters are substituted by 2-input multiplexers. Accord-
ing to the delay reduction in FMRAU, the addition used
to evaluate the 4 bits in the SMRU can be directly
executed in the FMRAU.
3.5 Proposed !(ci)  2 MMAU Architecture
With a base compliant with !(ci) < 2, it is possible to
reduce the area and the delay of the modular reduction
units. As with !(ci) < 3, negative numbers have to be
managed. The MMAU architecture is divided into the
three units below.
 The MU, which performs unsigned multiplications.
It provides a carry-save output on 2r   1 bits.
 The FMRU, which can transform a carry-save input
in borrow-save representation, convert each 2r-bit
input in 4 r-bit numbers, and accumulate the results
or add other values. It provides a borrow-save
output on r + 2 + log2 k bits.
 The SMRU, which performs the final reduction. It
produces a result on r bits.
The !(ci)  2 MMAU architecture is close to the
!(ci)  3 one, but smaller and faster. Only 3 bits of
the result of the borrow-save reduction tree are used
to select the value to add in the SMRU. This control is
directly executed by the FMRAU, in order to reduce the
delay of the SMRU. When there is no knowledge of base
elements, 2 barrel shifters are required. Otherwise, it is
possible to substitute them with 2-input multiplexers.
3.6 Higher Level of Pipelining
The proposed cell architectures are composed by macro
pipeline stages. Each macro stage has a clear mathemati-
cal function, and the stages of each cell architecture have
been balanced by considering the equivalent gates delay.
Therefore, they represent a technology independent and
functionally atomic solution. It could be possible to
further improve the time performance with a higher
CS-MUL
SEL
log2(k+1)
log2(k+1)
log2(k+1)
Multiplier
Unit 
REGISTER
ADD
SEL
Adder
Unit 
REGISTER
SEL
REGISTER
SEL
log2(k+1)
log2(k+1)
CS-MUL
SEL
log2(k+1)
log2(k+1) log2(k+1)
log2(k+1)
log2(k+1)
CS-ADD
log2(k+1)
log2(k+1)
log2(k+1)
log2(k+1)
log2(k+1)
log2(k+1)
log2(k+1) log2(k+1)
log2(k+1)
log2(k+1)
log2(k+1)log2(k+1)
log2(k+1)log2(k+1)
log2(k+1)
Fig. 9. Redundant cell matched to ar
pipelining. In the literature, there are examples of RNS
arithmetic cells designed for FPGA with higher level of
pipeline [11], [12], [16], [17]. However, a higher pipelin-
ing would increase the area. Moreover, the identification
of the best number of pipeline stages is an optimization
problem that is out of the topic of this work, and its
result would be strongly dependent on technological
choices.
In order to provide the reader with a rough idea about
the effects of a higher pipelined cell, an experimental
analysis has been carried out, by splitting each macro
stage of the Four-stage MMAU Architecture in 2 sub-
stages. The corresponding eight-stage MMAU would
give a reduction in delay equal to 10.3% and an increase
of 16.6% in area with respect to the original one.
3.7 BE Correction (Approach by Kawamura et al.)
The approach proposed in [4] requires a Cox unit, in
order to perform the modular reduction. The Cox unit
is composed by an adder, a register and a set of AND
gates. Fig. 8 shows the Cox unit connected to the MAU
of the three-stage MMAU on the left, and to the FMRU
of the fours-stage MMAU on the right. In the !(ci)  3
and !(ci)  2 MMAU architectures, it can be connected
to the FMRU reduction tree. The size of the adder (%)
can be set to 9, with r = 32, h = 11 and k = 33.
TRANSACTIONS ON COMPUTERS 12
TABLE 6
Area and Delay of the Units in the Designed Cells with
k = 33, r = 32 and h = 11
Architecture BE Unit Area (m2) Delay
Unit Totalz (ns)
MAU 8381y 1.51
[4] FMRU 4477 33142y 1.15
Three-stage SMRU 2543 1.15
MAU 8004 1.42
[7] FMRU 4477 32535 1.15
SMRU 2543 1.15
MU 7285 0.90
[4] FMRAU 8124 39372 0.86
SMRU 4121 0.88
Four-stage TMRU 2101 0.76
MU 7285 0.90
[7] FMRAU 7380 38398 0.88
SMRU 4121 0.88
TMRU 2101 0.76
MU 7285 0.90
!(ci)  3 [4] FMRAU 12002 40616 1.06
without SMRU 3588 1.05
base MAU 7285 0.90
knowledge [7] FMRAU 11371 39755 1.08
SMRU 3588 1.05
MU 7285 0.90
!(ci)  3 [4] FMRAU 6573 34885 0.85
with SMRU 3286 0.87
base MAU 7285 0.90
knowledge [7] FMRAU 6881 34963 0.89
SMRU 3286 0.87
MU 7285 0.90
!(ci)  2 [4] FMRAU 6510 35037 0.83
without SMRU 3501 0.87
base MAU 7285 0.90
knowledge [7] FMRAU 6098 34395 0.87
SMRU 3501 0.87
MU 7285 0.90
!(ci)  2 [4] FMRAU 4523 33050 0.75
with SMRU 3501 0.87
base MAU 7285 0.90
knowledge [7] FMRAU 3774 32071 0.72
SMRU 3501 0.87
y The Cox cell (163 m2) is included
z Including 17741 m2 and 17511 m2 for the memory re-
quired by the state-of-the-art approaches, with [4] and [5]
BEs, respectively. The reorganized approach saves 230 m2.
3.8 Error correction (Approach by Bajard et al.)
The approach used in [5] requires a redundant cell
matched to ar. This cell calculates , which is multiplied
by A on the other cells.
Fig. 9 shows the redundant cell, which is composed by
a Multiplier Unit (MU) and an Adder Unit (AU). Since
this cell works on a number of bits smaller than the other
cells, two multiplications can be processed in parallel
and added in one step. Moreover, as proposed in [5],
ar can be a power of 2; hence, the modular reduction
is immediate. The area overhead is similar to the one
proposed in [4]. The BE correction requires an additional
step. However, as suggested in [5], it is possible to avoid
a multiplication by using tables, but the result should be
summed by an additional input line.
3.9 Analysis of Cell Architectures
All the described cell architectures have been synthe-
sized using the Nangate 45nm Open Cell Library with
Synopsys Design Compiler. Table 6 shows the delay and
area cost of the cells by considering r = 32, k = 33
and h = 11, as in [5] and [6]. The delay refers to the
combinatory net of each unit involved in the considered
cell architectures. The area of each cell includes the
MMAU, its registers and the RAM. The area is obtained
by considering the RAM required for the state-of-the-art
algorithms; when the reorganized algorithms are used,
230 m2 can be saved. All the precomputed values
are considered as updatable. All the cells that use the
Kawamura et al. BE approach include an additional
input for the correction. However, only the three-stage
cells directly include the Cox unit, while for the other
architectures the Cox unit is a separate cell. In the latter
case, the area of the Cox unit is 168 m2 and its delay is
0.18 ns. The ar cell is always separate. Its area is 17043
m2, while its delay is 0.49 ns and 0.39 ns for the MU
and the AU, respectively.
In the three-stage cell, the MAU is the slowest unit.
The version with the Kawamura et al. BE approach is
slower, since the output of the Cox cell is calculated
and added in the MAU during the same cycle. In the
other cell architectures, the Cox output is added in the
FMRAU; hence, it is calculated by the Cox unit during
a cycle, and added during the subsequent one. In this
way, the effects of the Cox unit on the delay are reduced.
The architectural improvements applied in the four-stage
architecture reduce the delay of all the units, but increase
the area. The FMRAU is the slowest unit of the !(ci)  3
MMAU without knowledge of the base elements. This is
mainly due to the presence of the barrel shifters, which
also contribute to the large area. In the !(ci)  3 MMAU
with knowledge of base elements, the slowest unit is
the MU, as in the four-stage MMAU. This is due to the
removal of the barrel shifters from the FMRAU. Even in
the !(ci)  2 MMAU the slowest unit is the MU, thanks
to the smaller borrow-save adder reduction tree.
4 OVERALL COMPARISON
The comparison of all the algorithms and cell architec-
tures is summarized in Table 7. The area cost corre-
sponds to the k cells and the BE correction cell. The
configuration used is k = 33, r = 32, h = 11 and M = 1.
The reorganized algorithm provides a reduction in the
number of cycles, and a small reduction in the RAM
size, thanks to the lower number of precomputed values.
When compared to the state-of-the-art algorithm, with
the Kawamura et al. BE and the three-stage cell, it
provides a 13:6% reduction in delay, and a 0:6% in area.
The impact of the BE technique depends on the archi-
tecture considered. The technique proposed by Bajard et
al. requires always one cycle more than the one by Kawa-
mura et al.; however, the Kawamura et al. approach often
increases the delay, especially in the three-stage cell. With
the exception of the three-stage cell, where the Bajard et
al. approach is faster, the delay is always similar. Even
the area is not strongly affected by the BE approach.
The four-stage cell provides a reduction in delay, but
involves a moderate increase in area. With the state-of-
the-art algorithm and the Kawamura et al. BE, it reaches
TRANSACTIONS ON COMPUTERS 13
TABLE 7
Area and Delay of the Cells, with k = 33, r = 32, h = 11 and M = 1
Algorithm BE Architecture Base Max Cycles Cycle MM delay Max ME Cells area
knowledge k delay (ns) delay (ns) (m2)
[4], [7] [4] Three-stage No > 42 88 1:73 153 312103(100:0%) 1099266(100:0%)
Reorganized [4] Three-stage No > 42 76 1:73 132 269545(86:4%) 1091676(99:4%)
[4], [7] [7] Three-stage No > 42 89 1:64 146 299228(95:9%) 1113303(101:3%)
Reorganized [7] Three-stage No > 42 77 1:64 127 258884(83:0%) 1105713(100:6%)
[4], [7] [4] Four-stage No > 42 96 1:12 108 220425(70:7%) 1299444(118:3%)
Reorganized [4] Four-stage No > 42 80 1:12 90 183689(58:9%) 1291854(117:6%)
[4], [7] [7] Four-stage No > 42 97 1:12 109 222721(71:4%) 1284177(116:9%)
Reorganized [7] Four-stage No > 42 81 1:12 91 185985(59:6%) 1276587(116:2%)
[4], [7] [4] !(ci)  3 No 42 90 1:28 116 236160(75:6%) 1340496(122:0%)
Reorganized [4] !(ci)  3 No 42 78 1:28 100 204680(65:6%) 1332906(121:3%)
[4], [7] [7] !(ci)  3 No 42 91 1:30 119 242515(77:8%) 1328958(120:9%)
Reorganized [7] !(ci)  3 No 42 79 1:30 103 210543(67:5%) 1321368(120:3%)
[4], [7] [4] !(ci)  3 Yes 42 90 1:12 101 206640(66:3%) 1151373(104:8%)
Reorganized [4] !(ci)  3 Yes 42 78 1:12 88 179095(57:4%) 1143783(104:1%)
[4], [7] [7] !(ci)  3 Yes 42 91 1:12 102 208936(67:0%) 1170822(106:6%)
Reorganized [7] !(ci)  3 Yes 42 79 1:12 89 181391(58:2%) 1163232(105:9%)
[4], [7] [4] !(ci)  2 No 20y 90 1:12 101 206640(66:3%) 1156389(105:2%)
Reorganized [4] !(ci)  2 No 20y 78 1:12 88 179095(57:4%) 1148799(104:6%)
[4], [7] [7] !(ci)  2 No 20y 91 1:12 102 208936(67:0%) 1152078(104:9%)
Reorganized [7] !(ci)  2 No 20y 79 1:12 89 181391(58:1%) 1144488(104:2%)
[4], [7] [4] !(ci)  2 Yes 20y 90 1:12 101 206640(66:3%) 1090818(99:3%)
Reorganized [4] !(ci)  2 Yes 20y 78 1:12 88 179095(57:4%) 1083228(98:5%)
[4], [7] [7] !(ci)  2 Yes 20y 91 1:12 102 208936(67:0%) 1075386(97:9%)
Reorganized [7] !(ci)  2 Yes 20y 79 1:12 89 181391(58:2%) 1067796(97:2%)
y !(ci)  2 is not compliant with k = 33, but the results are anyway presented with k = 33 in order to ensure a fair comparison
a 29:3% reduction in delay, and a 18:3% increase in
area. The eight-stage version of this cell reaches a 36:5%
reduction in delay, and a 37:9% increase in area.
The !(ci)  3 cell without knowledge of base elements
performs worse than the four-stage cell both with respect
to delay and area cost, whereas with knowledge of
base elements, it achieves better results. Therefore, this
approach can be effectively applied only if each cell is
specifically designed for 2 specific base elements. With
respect to the three-stage cell, the area is slightly larger.
The !(ci)  2 cell without knowledge of base elements
provides area and delay performance close to those of
the !(ci)  3 cell with knowledge of base elements. With
knowledge of base elements, the area is still reduced, and
it is even smaller than with the three-stage cell. However,
the constraint on the base size (k  10) limits the width
of the representable numbers to 319 bits. Therefore, this
approach cannot be applied to a 1024-bit modulus.
Based on the above analysis, the following observa-
tions can be made. The reorganized algorithm is more
efficient than the state-of-the-art one. By using efficient
cells, the BE approach does not strongly affect perfor-
mance. For inputs larger than 1343 or, when the cells
are designed independently of the base, larger than 319
bits, the four-stage cell provides the best results. When
numbers between 1343 and 320 bits have to be managed
and the knowledge of base elements is available, the
!(ci)  3 architecture should be considered. When
numbers to be represented require less than 320 bits, the
best cells are the !(ci)  2 ones.
5 CONCLUSION
In this paper, an algorithmic and architectural study on
RNS ME has been presented. The opportunities that
come from the reorganization of the operations and
of precomputations have been investigated. In the ar-
chitectural study, five new cell architectures have been
presented, by exploiting well-known design techniques
emerging from computer arithmetic and by considering
efficient RNS bases. All the algorithms and the cell
architectures have been compared. Results have shown
that, by rearranging operations and preprocessing, the
delay can be significantly reduced. Moreover, the analy-
sis carried out on the cells for efficient bases has shown
that their performance are particularly interesting only
when their constraints are strict, hence limiting their
applicability. The overall analysis should provide the
reader with a comprehensive overview of algorithmic
and architectural approaches that could be considered
for performing RNE ME, by suggesting solutions to be
possibly considered depending on the particular con-
straints set by the application scenario.
ACKNOWLEDGMENTS
The authors wish to thank the anonymous reviewers
for providing insightful feedback and giving valuable
suggestions, which definitely helped to improve the
overall quality of this work.
REFERENCES
[1] N. Szabo and R. Tanaka, Residue arithmetic and its applications to
computer technology. McGraw-Hill, 1967.
TRANSACTIONS ON COMPUTERS 14
[2] P. Montgomery, “Modular multiplication without trial division,”
Mathematics of computation, vol. 44, no. 170, pp. 519–521, 1985.
[3] K. Posch and R. Posch, “Modulo reduction in residue number
systems,” Parallel and Distributed Systems, IEEE Transactions on,
vol. 6, no. 5, pp. 449–454, May 1995.
[4] S. Kawamura, M. Koike, F. Sano, and A. Shimbo, “Cox-rower
architecture for fast parallel Montgomery multiplication,” in Ad-
vances in Cryptology EUROCRYPT 2000, ser. LNCS. Springer
Berlin / Heidelberg, 2000, pp. 523–538.
[5] J.-C. Bajard and L. Imbert, “A full RNS implementation of RSA,”
Computers, IEEE Transactions on, vol. 53, no. 6, pp. 769–774, June
2004.
[6] H. Nozaki, M. Motoyama, A. Shimbo, and S. Kawamura, “Imple-
mentation of RSA algorithm based on RNS Montgomery multi-
plication,” in Cryptographic Hardware and Embedded Systems CHES
2001, ser. LNCS. Springer Berlin / Heidelberg, 2001, pp. 364–376.
[7] J.-C. Bajard, L. S. Didier, and P. Kornerup, “Modular multipli-
cation and base extensions in residue number systems,” in 15th
IEEE Symposium on Computer Arithmetic, 2001, pp. 59–65.
[8] A. Shenoy and R. Kumaresan, “Fast base extension using a
redundant modulus in RNS,” Computers, IEEE Transactions on,
vol. 38, no. 2, pp. 292–297, Feb 1989.
[9] F. Gandino, F. Lamberti, J.-C. Bajard, and P. Montuschi, “A general
approach for improving RNS Montgomery exponentiation using
pre-processing,” in ARITH ’11: Proceedings of the 2011 20th IEEE
Symposium on Computer Arithmetic, 25-27 July, 2011.
[10] ——, “Pre-processing in RNS Montgomery multiplication,” Tech.
Rep., 2010.
[11] N. Guillermin, “A high speed coprocessor for elliptic curve scalar
multiplications over Fp,” in Workshop on Cryptographic Hardware
and Embedded Systems 2010 (CHES 2010), ser. LNCS. Springer
Berlin / Heidelberg, 2010, pp. 48–64.
[12] ——, “A coprocessor for secure and high speed modular arith-
metic,” Cryptology ePrint Archive, Report 2011/354, 2011.
[13] J. Bajard, N. Meloni, and T. Plantard, “Efficient RNS bases for
cryptography,” in IMACS’05 : World Congress: Scientific Computa-
tion, Applied Mathematics and Simulation, July 2005.
[14] J. C. Bajard, M. Kaihara, and T. Plantard, “Selected RNS bases
for modular multiplication,” in ARITH ’09: Proceedings of the 2009
19th IEEE Symposium on Computer Arithmetic. Washington, DC,
USA: IEEE Computer Society, 2009, pp. 25–32.
[15] M. Pohst and H. Zassenhaus, Eds., Algorithmic algebraic number
theory. New York, NY, USA: Cambridge University Press, 1989,
ch. 2.2.5.
[16] S. Duquesne and N. Guillermin, “A fpga pairing implementation
using the residue number system,” Cryptology ePrint Archive,
Report 2011/176, 2011.
[17] R. Cheung, S. Duquesne, J. Fan, N. Guillermin, I. Verbauwhede,
and G. Yao, “Fpga implementation of pairings using residue
number system and lazy reduction,” in Cryptographic Hardware
and Embedded Systems – CHES 2011, ser. Lecture Notes in Com-
puter Science, B. Preneel and T. Takagi, Eds. Springer Berlin /
Heidelberg, 2011, vol. 6917, pp. 421–441.
[18] F. Gandino, F. Lamberti, G. Paravati, J.-C. Bajard, and P. Mon-
tuschi, “Investigation on cell architectures for RNS Montgomery
exponentiation,” Tech. Rep., 2011.
[19] H. Wu, “On modular reduction,” CACR, University of Waterloo,
Tech. Rep., 2000.
Filippo Gandino obtained his MS and PhD de-
grees in Computer Engineering from Politecnico
di Torino, Italy, in 2005 and 2010, respectively.
He is currently a research fellow with the Diparti-
mento di Automatica e Informatica, Politecnico di
Torino. His research interests include ubiquitous
computing, RFID, WSNs, security and privacy,
network modeling and digital arithmetic.
Fabrizio Lamberti graduated with a degree in
Computer Engineering in 2000 and received the
PhD degree in Computer Engineering in 2005
from Politecnico di Torino, Italy. Since 2006 he
has been an assistant professor at Politecnico
di Torino. He has published a number of papers
in the areas of digital arithmetic, mobile and
distributed computing, image processing, and
visualization. He has served as a program or
organization committee member for several con-
ferences. He is a member of Editorial Advisory
Board of international journals. He is member of the IEEE and the IEEE
Computer Society.
Gianluca Paravati received the MS degree in
Electronic Engineering and the PhD degree
in Computer Engineering from Politecnico di
Torino, Italy, in 2007 and 2011, respectively. He
is currently a research fellow with the Diparti-
mento di Automatica e Informatica, Politecnico
di Torino. His research interests include image
processing, computer graphics and distributed
architectures.
Jean-Claude Bajard received the PhD degree
in computer science in 1993 from the E´cole
Normale Supe´rieure de Lyon (ENS), France. He
taught mathematics in high school from 1979
to 1990 and served as a research and teach-
ing assistant at the ENS in 1993. From 1994
to 1999, he was an assistant professor at the
Universite´ de Provence, Marseille, France. From
1999 to 2009 he was professor at the University
Montpellier 2, and was member of the ARITH
Team of the LIRMM. He is now professor at the
Universite´ Paris 6, France, and a member of the LIP 6. His research
interests include computer arithmetic and cryptography.
Paolo Montuschi is a Professor of Computer
Engineering at Politecnico di Torino, Chair of the
Board for Financial External Affairs, Member of
the Board of Governors, and Deputy Chair of the
Control and Computer Engineering Department.
Previously, he served as Chair of Department
from 2003 to 2011, and as Chair or Member
of several Boards. He is currently serving as
associate editor of the IEEE Transactions on
Computers, as Member of the Board of Gover-
nors of the IEEE Computer Society, Chair of the
Electronic Products and Services Committee and as a Member of the
Advisory Board of Computing Now. For over 20 years, he has been a
Member of the IEEE Computer Society, serving as: Chair and Member of
Digital Library Operations Committe, Member-at-Large of the Computer
Societys Publications Board, Member of an IEEE ad-hoc Committee
for Quality of Conference Articles in IEEE-Xplore, and Member of
Conference Publications Operations Committee. He served as guest
and associate editor of the IEEE Transactions on Computers from 2000
to 2004. He was on the program committees for the 13th through 19th
IEEE Symposia on Computer Arithmetic and program CoChair of the
17th IEEE Symposium on Computer Arithmetic. In 2009, he served as a
co-guest editor for a special section on computer arithmetic in the IEEE
Transactions on Computers. He received a PhD in computer engineering
in 1989. His research interests are in computer arithmetic, computer
graphics, computer architectures, and electronic publications. Within the
Computer Society he is actively involved in opening the door to new
publication frameworks geared towards e-reading and mobile devices.
