Hierarchical Approach in RNS Base Extension for Asymmetric Cryptography by Djath, Libey et al.
HAL Id: hal-02096353
https://hal.archives-ouvertes.fr/hal-02096353
Submitted on 11 Apr 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Hierarchical Approach in RNS Base Extension for
Asymmetric Cryptography
Libey Djath, Karim Bigou, Arnaud Tisserand
To cite this version:
Libey Djath, Karim Bigou, Arnaud Tisserand. Hierarchical Approach in RNS Base Extension for
Asymmetric Cryptography. ARITH: 2019 IEEE 26th Symposium on Computer Arithmetic, Jun 2019,
Kyoto, Japan. ￿hal-02096353￿
Hierarchical Approach in RNS Base Extension for
Asymmetric Cryptography
Libey Djath and Karim Bigou
Université de Bretagne Occidentale
Lab-STICC, UMR CNRS 6285
F-29200 Brest, France
libeyokonfu.djath@univ-brest.fr,
karim.bigou@univ-brest.fr
Arnaud Tisserand
CNRS
Lab-STICC, UMR 6285
Centre Recherche UBS, rue St Maudé, Lorient, France
arnaud.tisserand@univ-ubs.fr
Abstract—Base extension is a critical operation in RNS im-
plementations of asymmetric cryptosystems. In this paper, we
propose a new way to perform base extensions using a hierarchi-
cal approach for computing the Chinese remainder theorem. For
well chosen parameters, it significantly reduces the computational
cost and still ensures a high level of internal parallelism. We
illustrate the interest of the proposed approach on the cost of
typical arithmetic primitives used in asymmetric cryptography.
We also demonstrate improvements in FPGA implementations of
base extensions on typical elliptic curve cryptography field sizes
using high-level synthesis tools.
Index Terms—computer arithmetic; residue number system;
modular reduction; hardware implementation.
I. INTRODUCTION
Current asymmetric cryptosystems require an efficient sup-
port for arithmetic over large operands. For instance, RSA [1]
requires modular arithmetic over integers larger than 2000 bits
and elliptic curve cryptography (ECC) [2], [3] deals with finite
field elements larger than 200 bits. See [4] for references.
In these cryptographic applications, the residue number
system (RNS) [5], [6] is increasingly suggested to provide
more internal parallelism. RNS uses a set of small co-prime
moduli, called the base, to “split” some arithmetic operations
into independent and much smaller ones over the residues.
This independence leads to faster operations without carry
propagation between the moduli [7]. RNS can also help to
improve the security against some physical attacks [8].
In RNS, addition, subtraction and multiplication of large
integers are parallel operations. The Chinese remainder the-
orem (CRT) is often used for converting the residues into the
standard representation, see for instance [7, Chap. 3].
But operations such as comparison, division and modular
reduction are costly operations since RNS is a non-positional
representation. To avoid the conversion to the standard repre-
sentation for these operations, the use of base extensions (BE)
is often proposed [9].
To reduce the cost of RNS implementations, two main
directions have been explored: reducing the number of BEs
at application level (e.g., specific formulas for point addi-
tion/doubling in ECC [10]), or reducing the cost of a BE
(for instance [11]). This work deals with the second direction
and proposes a new BE algorithm. It uses a hierarchical
decomposition with partial applications of small CRTs. Our
algorithm, called hierarchical BE (HBE), allows to reduce the
cost of modular arithmetic in RNS.
Definitions and notations are presented in Section II. Section
III briefly recalls related elements from the state of the art.
Our HBE algorithm is detailed in Section IV. The interest of
this algorithm is analyzed in Section V for a few asymmetric
cryptography applications. Section VI presents our FPGA
implementations results for ECC over FP using high-level
synthesis (HLS). Finally, Section VII concludes the paper.
II. DEFINITIONS AND NOTATIONS
We use a 2-dimension notation for RNS bases and elements
instead of the usual 1-dimension one.
• |X|m = X mod m
• A is an RNS base of n = r × c moduli of w bits:
A =
a1,1 · · · a1,c... · · · ...
ar,1 · · · ar,c

• Ai =
∏c
j=1 ai,j is the product of the i-th row moduli
• A =
∏r
i=1Ai is the product of all moduli of A
• Ai = A/Ai is the product of all rows except the i-th one
• ai,j = Ai/ai,j is the product of all moduli in row i except
the j-th one
• The integer X is represented in RNS base A by:
XA =
xa1,1 · · · xa1,c... · · · ...
xar,1 · · · xar,c
 with xai,j = |X|ai,j
• XAi = |X|Ai and it can be computed from
(xai,1 , · · · , xai,c) using the CRT
• Tai,j =
∣∣∣∣( Aai,j )−1
∣∣∣∣
ai,j
• Elementary binary operations with (w,w′) bits operands
are: addition, subtraction, multiplication, modular multi-
plication (MM) and modular reduction (MR)
• Their costs are denoted:
– CADD(w,w′) for a (w ± w′)-bit addition/subtraction
– CMUL(w,w′) for a (w × w′)-bit multiplication
– CMR(w′, w) for a (w′ mod w)-bit MR
– CMM(w,w) for a (w × w mod w)-bit MM (in one
w-bit RNS channel)
RNS base B and the related values Bi, B, Bi, bi,j , XB ,
XBi and Tbi,j are similarly defined.
III. STATE OF THE ART
A. Residue Number System
In RNS [5], [6], the integer X is represented by its residues,
denoted xai,j , modulo a set of coprime moduli, denoted
ai,j , for all (i, j) in the RNS base A. To convert X from
a standard positional representation to RNS, one computes
|X|ai,j , possibly independently, for each moduli in A. The
reverse conversion is performed using the CRT formula:
X =
∣∣∣∣∣∣
r∑
i=1
c∑
j=1
|xai,j × Tai,j |ai,j ×
A
ai,j
∣∣∣∣∣∣
A
(1)
With X and Y represented in RNS base A, their addition,
subtraction and multiplication, X  Y where  ∈ {±,×}, is
performed independently on each residue by |xi,j  yi,j |ai,j
for all (i, j) in A. Computations related to one modulo are
performed in a channel (i.e. w-bit datapath). Clearly, RNS
offers a high level of parallelism for ± and × operations. But
for other operations, especially comparison, division and MR,
the situation is more complex. Their cost in RNS is more
important than for a positional representation. Multiplication
modulo a large integer M is crucial for asymmetric cryptog-
raphy (M = P is prime for ECC over FP and M = PQ
the product of 2 primes for RSA). In RNS, the cost of this
operation is mainly the cost of 2 successive base extensions
(see Section V).
B. RNS Base Extension
A BE can be seen as a conversion from one RNS baseA to a
second one B. If B is co-prime with A, then the concatenation
of both bases can be seen as an extension of A.
In the literature, two main strategies are used for BEs:
using an intermediate representation called mixed-radix system
(MRS [5]) or computing equation 1 directly in base B. MRS
requires more operations and introduces strong data dependen-
cies which limit its interest compared to CRT approaches. For
CRT based BEs, the main issue is to compute the reduction
modulo A from Eq. 1 directly in base B, other operations
are just sums and products. Usually, one instead computes the
CRT under the form:
X =
 r∑
i=1
c∑
j=1
|xai,j × Tai,j |ai,j ×
A
ai,j
− hA (2)
The issue is to compute h (i.e., how many times A should be
subtracted to get the correct MR). Authors of [12] noticed that
in some cryptographic applications, reduction modulo A can
be skipped after the sum, leading to larger values (less than
rcA instead of A). [13] proposed to use an extra modulo to
Algorithm 1: Base Extension from [9] (KBE).
Input: XA, σ = 0 or 0.5
Precomp.: Tai,j ∀i ∈ [1, r] and ∀j ∈ [1, c]
Output: XB
1 for i from 1 to r parallel do
2 for j from 1 to c parallel do
3 x̂ai,j ←
∣∣xai,j × Tai,j ∣∣ai,j
4 for i from 1 to r do
5 for j from 1 to c do
6 σ ← σ + trunc(x̂ai,j )2w
7 hi,j ← bσc
8 σ ← σ − hi,j
9 for k from 1 to r parallel do
10 for l from 1 to c parallel do
11 xbk,l ←∣∣∣∣xbk,l + x̂ai,j × ∣∣∣ Aai,j ∣∣∣bk,l + | − hi,j A|bk,l
∣∣∣∣
bk,l
retrieve h. This method cannot be used in some situations but
can be combined to [12] to fully implement RSA and ECC
in RNS. A third approach proposed in [14], and improved
in [9], uses an approximation for h in Eq. 2. In this paper,
we focus on [9], which is the most used in state-of-the-art
implementations. Our main idea can be adapted to other CRT
based BE methods.
The BE proposed in [9], often denoted Kawamura BE
or KBE in literature, is described in Algo. 1 (with our 2D
notations). One can see that lines 3 and 11 mainly compute the
sum of products from Eq. 2. The lines 6–8 of KBE perform
an accumulation of small t-bit values to compute hi,j , and
finally subtract hi,jA in line 11. The hi,j are 1-bit values
and
∑r
i=1
∑c
j=1 hi,j is h or h − 1. The function trunc(x)
keeps the t MSBs of x̂ai,j , and sets all the others to 0. For
asymmetric cryptography implementations, t ∈ [4, 8] is very
common. KBE can be used into 2 main different modes:
• if input X < A/2, choosing σ = 0.5 leads to∑r
i=1
∑c
j=1 hi,j = h and the output is exactly X in base
B;
• if X is close to A with X < A, choosing σ = 0 leads to∑r
i=1
∑c
j=1 hi,j = h− 1 and the output is X or X +A
in base B.
Thus, KBE can be used to perform all computations in
asymmetric cryptography. KBE is efficiently implemented
using the cox-rower architecture introduced in [9] and
depicted in Fig. 1. A rower unit performs all computations in
one channel. All rowers, one per modulo in the base, operate
in parallel. The single cox unit computes the appropriate
reduction factor hi,j and distributes it to the rowers. KBE
algorithm and cox-rower architecture have been used in
several hardware implementations of asymmetric cryptosys-
tems using RNS: e.g., [15] for RSA; [16] and [17] for ECC.
2
rower rower rower rower
cox
CTRL
Memory
w
1
w
w
w
t
Fig. 1. The cox-rower architecture from [16] inspired by the one from [9]
(control signals are not represented).
C. Hierarchical Approaches in RNS
[18] and [19] propose to recursively use RNS where com-
putations inside channels are performed in a small RNS base.
This does not improve much performances compared to usual
RNS, but it provides other properties such as protection against
some physical attacks. [19] proposes small RNS channels
where all elementary operations are performed using fully
precomputed lookup tables (reducing information leakage).
In [20], [21], RNS bases with 3 or 4 moduli are used for
signal processing applications (on small values) where some
moduli are factorizable into smaller ones. To be efficient,
the moduli and their factorization must support fast MR
algorithms in the corresponding (sub)-channels.
In this paper, we do not use an RNS hierarchical represen-
tation, but we propose a hierarchical approach for the CRT
computation to reduce the cost of cryptographic applications.
There is no need for factorizable moduli, and all MRs are
performed using usual RNS w-bit moduli (see Sec.IV).
To improve the CRT computation in a general context,
hierarchical approaches with partial CRT have been proposed
to convert back from a set of residues to the corresponding
value in Z, see for instance [22], [23]. However, as far as we
know, our proposition is the first BE algorithm based on a
hierarchical approach for full RNS computations.
IV. PROPOSED ALGORITHM
A. New Base Extension Algorithm
Our Hierarchical Base Extension (HBE) is detailed in
algorithm 2. As in the KBE algorithm, HBE mainly computes
the CRT Eq. 2 from the base A and the result is reduced
modulo each element in base B. Lines 1–3 are exactly the same
as in KBE. HBE uses intermediate CRT computations (lines
4–7) independently for each row i ∈ [1, r]: (xai,1 , . . . , xai,c).
The result at row i is X̂Ai ≡ XAi ×
(
Ai
)−1
mod Ai. No
modular reduction is performed at this step, thus X̂Ai < cAi.
Algorithm 2: Proposed hierarchical base extension HBE.
Input: XA, σ = 0 or 0.5
Precomp.: Tai,j ∀i ∈ [1, r] and ∀j ∈ [1, c]
Output: XB
1 for i from 1 to r parallel do
2 for j from 1 to c parallel do
3 x̂ai,j ←
∣∣xai,j × Tai,j ∣∣ai,j
4 for i from 1 to r parallel do
5 X̂Ai ← 0
6 for j from 1 to c do
7 X̂Ai ← X̂Ai + x̂ai,j × ai,j (no reduction)
8 for i from 1 to r do
9 σ ← σ + trunc(X̂Ai )2w×c
10 hi ← bσc
11 σ ← σ − hi
12 for k from 1 to r parallel do
13 for l from 1 to c parallel do
14 x̂bk,l,i ←
∣∣∣X̂Ai∣∣∣
bk,l
15 xbk,l ←∣∣xbk,l + x̂bk,l,i ×Ai + | − hiA|bk,l ∣∣bk,l
The value X̂Ai can be seen as a super-residue, corresponding
to the super-modulo Ai.
The nested loops in lines 8–15 replace lines 4–11 of KBE.
Lines 9–11 in HBE are an adaptation of the approximation
method proposed in [9] to compute hi from our super-residues
X̂Ai (hi is dlog2(c+1)e bits). Similarly to the definition of hi,j
for KBE, we define
∑r
i=1 hi = h or h−1. In practice, it allows
the computation of exact BEs in the same conditions than
KBE. These lines could be removed, leading to a hierarchical
version of the approximated BE proposed in [12].
Finally comes the most inner loop, at lines 14–15, with
cr2 iterations. Line 14 reduces X̂Ai mod xbk,l . Then, line 15
in HBE is strictly equivalent to line 11 in KBE. The factor
ai,j at line 11 of KBE is already integrated in X̂Ai . Then the
multiplication by Ai completes the CRT computation in line
15 of HBE. However, line 11 in KBE is executed c2r2 = n2
times instead of only cr2 for HBE.
To summarize, the c residues in a row are pooled to get
one super-residue (of size cw + dlog2(c)e) using a first CRT
computation. Then a CRT computation converts the super-
residues X̂Ai into the second base using an approximation
similar to the one from KBE. The number of MM(w,w) (i.e.,
MMs inside channels) is divided by c. But HBE deals with
multiplications of w × (c − 1)w bits (line 7) and reductions
from cw + dlog2(c)e to w bits (see details in Section IV-B).
In the papers [11], [16] and [24], the 3 first lines in KBE
are hidden inside the computations of the RNS Montgomery
reduction to get faster implementations. This could also be
applied with our algorithm (exactly the same way).
For our asymmetric cryptography applications, c ∈ {2, 3, 4}
seems to be a good choice (see also section IV-B). Below, we
3
will see that c = 2 leads to very interesting simplifications.
B. Validation of HBE Algorithm
We first show that the CRT formula computed in HBE is
equivalent to the one computed in KBE. Second, we explain
why the same base can be used for both an exact KBE and
an exact HBE with c ∈ {2, 3, 4}. HBE does not introduce
additional constraints. The only change can be introduced by
the trunc function. It may require 1 or 2 additional bits to
ensure the result after the approximation.
1) Validation of the CRT sum: In KBE and HBE, one can
see that the same CRT sum is actually computed:
S =
r∑
i=1
c∑
j=1
x̂ai,j ×
A
ai,j
=
r∑
i=1
c∑
j=1
x̂ai,j × ai,j ×Ai
=
r∑
i=1
 c∑
j=1
x̂ai,j × ai,j
×Ai
=
r∑
i=1
X̂Ai ×Ai ≡ X mod A
To get an exact BE with KBE, h is computed from all x̂ai,j .
Two strategies can be used for HBE.
The first strategy relies on the fact that the lines 1–3 are
the same as in KBE. Additional memories can store x̂ai,j in
addition to X̂Ai . In KBE on the cox-rower architecture,
x̂ai,j is broadcasted from one channel to all the other ones
through the large multiplexer. It is also sent to the cox, which
only accumulates its t MSBs (see Fig. 1). Nonetheless, this is
not directly possible with HBE where the CRT sum has only
r terms instead of rc. Then this strategy seems less promising
than the computation proposed in HBE to compute h.
The second strategy uses a cox unit directly operating on
the super-residues. The values Ai =
∏j=c
j=1 ai,j can be seen as
super-moduli and one extracts its MSBs to get hi. From now,
we assume ai,j selected such that they satisfy the constraints
from [9] for ensuring an exact BE.
2) Majoration of the approximation error: Below we show
that for c = 2, a compliant base A with the KBE constraints
from [9] is directly compliant with HBE using a trunc function
which keeps at most t′ = t+1 MSBs. The same property can
be shown for c = 3 and c = 4 with t′ = t+ 2.
Let us assume 2 RNS bases A and B with n = rc moduli
of w bits where each modulo is a pseudo-Mersenne number
selected to satisfy the theorems from [9]. As presented in
the proof from [9], with dai,j =
x̂ai,j−trunc(x̂ai,j )
ai,j
, eai,j =
2w−ai,j
2w , dm = max(dai,j ) and em = max(eai,j ), one has:
r∑
i=1
c∑
j=1
x̂ai,j
ai,j
− r c (dm + em) <
r∑
i=1
c∑
j=1
trunc(x̂ai,j )
2w
(3)
r∑
i=1
c∑
j=1
trunc(x̂ai,j )
2w
<
r∑
i=1
c∑
j=1
x̂ai,j
ai,j
. (4)
The value r c (dm+em) is an upper bound of the error com-
mitted by using trunc(x̂ai,j ) instead of x̂ai,j and 2
w instead of
ai,j . From [9], if A is chosen such that r c (dm+em) < α < 1
and X < (1− α)M , then the BE is exact.
In practice, choosing r c (dm + em) < 0.5 is sufficient to
accurately perform an exact BE for an integer X < A/2. For
X < A, KBE computes X or X + A in base B, which is
sufficient in various applications such as the second extension
in RNS Montgomery reduction (see theorems 1 and 2 in [9]).
Now let us assume RNS bases A and B are used in HBE
with c = 2. We will show that using the same definitions for
dm and em, one gets
r∑
i=1
c∑
j=1
x̂ai,j
ai,j
− n(dm + em) <
r∑
i=1
trunc(X̂Ai)
2cw
(5)
and
r∑
i=1
trunc(X̂Ai)
2cw
<
r∑
i=1
c∑
j=1
x̂ai,j
ai,j
, (6)
thus the conclusions from theorems in [9] still hold for HBE.
Let us define dAi =
X̂Ai−trunc(X̂Ai )
Ai
and eAi =
2cw−Ai
2cw ,
similar to dai,j and eai,j for the super-residues.
Let us prove that Eq. 5 holds. If trunc keeps the t MSBs in
KBE, then x−trunc(x) < 2w−t (i.e. at most all the w−t LSBs
dropped are 1s). The truncated bits cannot be predicted then
dm = max (2
w−t/ai,j) is assumed for the choice of the RNS
base in KBE, this assumption is still true for HBE. Assuming
c = 2 in HBE, our trunc keeps the t + 1 MSBs from the
2w + 1 bits of X̂Ai and we approximate Ai by 2
2w. Then
dAi <
22w+1−(t+1)
Ai
≤ 2
w
ai,2
× 2
w−t
ai,1
< 2 dm
because 2w/ai,2 < 2 (moduli ai,j are w-bit integers) and
by definition 2w−t/ai,1 < dm. Summing all rows leads to
r∑
i=1
dAi < ndm. (7)
For c = 2, Ai = ai,1× ai,2 = (2w − u1) (2w − u2) with u1
and u2 small since ai,js are pseudo-Mersenne integers. Thus
eAi =
22w − ai,1 ai,2
22w
=
22w − (2w − u1) (2w − u2)
22w
=
(u1 + u2)2
w − u1 u2
22w
=
(u1 + u2)
2w
− u1 u2
22w
=
2w − ai,1
2w
+
2w − ai,2
2w
− (2
w − ai,1) (2w − ai,2)
22w
= eai,1 + eai,2 − eai,1 × eai,2
< eai,1 + eai,2 .
It follows
r∑
i=1
eAi <
r∑
i=1
c∑
j=1
eai,j < nem (8)
4
which combined with Eq. 7 gives
r∑
i=1
(dAi + eAi) < n (dm + em) . (9)
Using the proof in [9], one can easily find:
r∑
i=1
c∑
j=1
x̂ai,j
ai,j
−
r∑
i=1
(dAi + eAi) <
r∑
i=1
c∑
j=1
trunc(X̂Ai)
2cw
which leads to Eq. 5 by applying Eq. 9.
To complete the proof, Eq. 6 directly comes from the
definition of the approximations from [9], using 2w instead
of ai,j to maximize the denominators, and trunc minimizes
the numerators thus the approximation is always less than the
real value.
C. Theoretical Cost Evaluation
To compare with KBE, we evaluate the number of CMMs and
CMRs in HBE for a generic c and then we focus on c = 2.
Lines 1–3 in HBE are the same in KBE, and cost
rc CMM(w,w). Lines 4–7 cost rc CMUL(w, (c − 1)w) and
rc CADD(cw, cw). This part is hard to evaluate for a generic c
without a full implementation because it introduces additions
on cw-bit integers (larger than the channels). HBE operations
at lines 9–11 are negligible. There are just very small additions
on a few bits computed in the cox unit in parallel with the
rowers (see Fig. 1). Finally HBE lines 14–15 are performed
r2c times each. HBE line 15 costs the same as KBE line
11, i. e., 1 CMM(w,w)+ 2 CADD(w,w) per iteration. Line 14 is
more difficult to evaluate, because it depends on c (i.e., X̂Ai
has cw+ dlog2(c)e bits) and on the form of the moduli (e.g.,
generic vs. sparse numbers).
To sum up, the theoretical cost of HBE Algo. 2 is :
r2c (CMM(w,w) + 2CADD(w,w) + CMR(cw + dlog2(c)e, w))+
rc (CMM(w,w) + CMUL(w, (c− 1)w) + CADD(cw, cw))
For c = 2, then r = n/2, using the assumption CMUL = CMM
(we overestimate CMUL), the cost reduces to:
n2
2
(CMM(w,w) + 2CADD(w,w) + CMR(2w + 1, w))+
n (2CMM(w,w) + CADD(2w, 2w))
Using the simplification CMUL = CMM (in reality CMUL <
CMM), the overestimation of the HBE cost is small since it
only impacts the linear term in n.
The choice c = 2 is very interesting since the circuit only
deals with values of w or 2w + 1 bits, as in KBE. No new
arithmetic operator is required to use HBE with c = 2.
As in most of the state-of-the-art works, we focus on
the number of modular multiplications because additions are
usually hidden in the pipeline computing the multiplications,
especially in DSP slices of modern FPGAs (they are actually
accumulations of products).
This leads to a cost for HBE of
n2
2
CMM(w,w) +
n2
2
CMR(2w + 1, w) + 2n CMM(w,w)
TABLE I
FPGA IMPLEMENTATION RESULTS FOR CMR(2w + 1, w) AND CMM(w,w)
ELEMENTARY OPERATIONS IN A XILINX XC7Z020.
operations CMR(2w + 1, w) CMM(w,w)
w (bits) 17 20 24 28 17 20 24 28
nb. slices 1 1 24 35 1 24 1 39
nb. DSP 2 2 1 1 3 3 4 5
nb. cycles 1 1 2 2 2 2 2 3
time (ns) 2.4 2.6 9.0 9.6 7.8 10.6 10.6 17.1
against a cost for KBE of
n2 CMM(w,w) + n CMM(w,w).
If CMR(2w+1, w) CMM(w,w), significant improvements
may be achieved using HBE instead of KBE. To evaluate
these costs in real hardware, we implemented on a Xilinx
XCV7020 FPGA (using Vivado 2017.4) both operations for
numerous randomly chosen pseudo-Mersenne moduli (2w−ui)
and various w. The results are reported in Tab. I. CMR(2w +
1, w) requires about half of the time and half of area than
CMM(w,w). Thus, we assume CMM(w,w)/4 6 CMR(2w +
1, w) 6 CMM(w,w)/2 for pseudo-Mersenne moduli.
Finally, Fig. 2 presents the theoretical improvements of HBE
using c = 2 vs. KBE for various values of the ratio CMM/CMR
and numbers of moduli between 6 and 32 (typical values
in our asymmetric cryptography applications). HBE leads to
theoretical improvements up to 35% compared to KBE.
D. Parallelism of HBE in a cox-rower Architecture
We want to implement our cryptosystems in RNS using a
cox-rower architecture adapted from [9] (see Fig. 1). We
also use n rowers (i.e., one physical channel per modulo).
HBE is mainly made of 3 successive loops. The first and the
third loops operate on all channels of A and B respectively
(each of n = rc channels). Thus these loops can be performed
on a cox-rower architecture with n parallel rowers as in
KBE. However, if c > 2 the architecture must be modified to
reduce values larger than 2w bits.
10 20 30 40 50 60
Number of moduli (n)
0.65
0.70
0.75
0.80
0.85
0.90
Co
st
 ra
tio
 H
BE
 / 
KB
E 
fo
r 1
 B
E CMM/CMR= 2
CMM/CMR= 3
CMM/CMR= 4
Fig. 2. Theoretical costs of a BE for various base sizes (n), the
two compared BE algorithms and c = 2. Each curve corresponds
to cost(HBE)/cost(KBE) for one cost ratio of elementary operations
(CMM/CMR).
5
rower rower rower rower
cox
CTRL
 Memory 
w+1
2
w
w+1 
w
2wt+1
Fig. 3. Our HBE cox-rower architecture with c = 2 and n = 4.
In the second loop, r = n/c independent sums of products
are computed: the greater is c, the greater is the reduction
of the parallelism. More, X̂Ai is cw + dlog2(c)e bits wide
and requires larger operators than in the cox-rower from
[9] architectures for c > 2. This could be balanced by the
reduction of the cost of the last loop with only n2/c iterations.
Using c = 2, the cox-rower can be modified as presented
in Fig. 3.
Computing the n multiplications in parallel units and sum-
ming their 2w-bit results allows to compute X̂Ai with an
architecture as parallel as the original cox-rower or nearly.
The w + 1 MSBs are sent to the cox (through a truncation)
and to all the rowers, the w LSBs are also broadcasted to
the rowers.
V. CRYPTOGRAPHIC APPLICATIONS
RSA exponentiation and ECC scalar multiplication require
numerous MRs with a large modulus (200+ bits for ECC
and 2000+ bits for RSA). The standard integer MR algorithm
used with a generic modulus (i.e., no specific form) is the
Montgomery reduction proposed in [25]. It replaces a costly
division by 2 multiplications by a constant, one reduction
modulo 2l and one division by 2l where l is the modulus
width in a radix-2 representation.
An RNS version of this algorithm has been proposed in [26],
see Algo. 3. Instead of 2l, it uses reduction and division by A
at line 4. Because A cannot be inverted in base A, BEs at lines
2/5 in/from base B are required to divide by A. The second
BE must be exact, but the first one can be approximated (the
result, multiplied by P , does not impact the result modulo P ).
KBE and HBE algorithms can be used for both BEs.
The cost of Algo. 3 is dominated by the 2 BEs. One RNS
reduction modulo P from [9] costs 2n2+5n CMM(w,w) where
the 2 BEs cost 2n2 + 2n. By reordering internal operations,
[11] reduces the cost of one RNS reduction modulo P to 2n2+
2n CMMs (actually 2 successive BEs).
10 20 30 40 50 60
Number of moduli (n)
0.65
0.70
0.75
0.80
0.85
0.90
0.95
Co
st
 ra
tio
 H
BE
 / 
KB
E 
fo
r 1
 R
NS
 M
M
CMM/CMR= 2
CMM/CMR= 3
CMM/CMR= 4
Fig. 4. Theoretical costs of one RNS Montgomery MM from [11] for
various base sizes (n) and the two compared BE algorithms for ECC or RSA
applications. Each curve corresponds to cost(MM w. HBE)/cost(MM w.
KBE) for one cost ratio of elementary operations (CMM/CMR).
One full RNS MM based on KBE costs 2n2 + 4n CMMs
(one MR and 2n CMMs, one multiplication per channel per
RNS base). The same RNS MM algorithm using HBE instead
of KBE only costs 3n
2
2 + 6n CMMs when CMM/CMR = 2 or
5n2
4 + 6n when CMM/CMR = 4.
Recently, [24] improves RNS MMs for ECC applications.
By selecting well-suited bases for prime field characteristic
P from standards (e.g., NIST primes), the KBE is reduced
from 2n2 + 4n to 2n2 + 3n CMMs. This method requires to
generate one base per field characteristic. For our applications
with multiple fields this method is less interesting than [11].
Figure 4 depicts the theoretical gain when using HBE
instead of KBE for one RNS MM using Algo. 3 and opti-
mizations from [11]. The gain is given for various cost ratios
CMR/CMM. For instance, 256-bit ECC and 1024-bit RSA-CRT
both with 17-bit moduli (to fit into one Xilinx DSP multiplier
see Sec. VI), respectively requires 16 and 32 moduli for the
RNS bases. In these typical cases, HBE leads to 17% up to
32% theoretical gain compared to KBE for MMs.
In RSA, modular exponentiation only performs MMs, then
the HBE gain for one RNS MM can be directly transposed
to one RNS modular exponentiation. In ECC scalar multi-
plication, this gain can be slightly reduced because one can
sometimes perform 1 MR for 2 multiplications, see [10].
In addition to the comparison with [11], we also evaluate
Algorithm 3: RNS Montgomery reduction modulo P [26].
Input: XA, XB
Precomp.: PA, PB ,
(
−P−1
)
A
,
(
A−1
)
B
Output: SA and SB with S =
(
XA−1
)
mod P + δP
and δ ∈ {0, 1, 2}
1 QA ← XA ×
(
−P−1
)
A
2 QB ← BE (QA,A,B)
3 RB ← XB +QB × PB
4 SB ← RB ×
(
A−1
)
B
5 SA ← BE (SB ,B,A)
6 return (SA, SB)
6
TABLE II
COSTS FOR THREE RNS MM ALGORITHMS (IN CMMS) WHEN USING KBE
AND HBE (WITH TWO RATIOS FOR r = CMM/CMR).
BE MM [11] HPR d = 2 [17] HPR d = 4 [27]
KBE 2n2 + 4n n2 + 8n n
2
2
+ 12n
HBE (r = 2) 3n
2
2
+ 6n 3n
2
4
+ 10n 3n
2
8
+ 14n
HBE (r = 4) 5n
2
4
+ 6n 5n
2
8
+ 10n 5n
2
16
+ 14n
10 15 20 25 30
Number of moduli (n)
0.75
0.80
0.85
0.90
0.95
1.00
1.05
Co
st
 ra
tio
 H
BE
 / 
KB
E 
fo
r 1
 R
NS
 M
M
CMM/CMR= 2
CMM/CMR= 3
CMM/CMR= 4
Fig. 5. Theoretical costs of one HPR MM from [17], [27] for d = 2, various
base sizes (n) and the two compared BE algorithms. Each curve corresponds
to cost(MM w. HBE)/cost(MM w. KBE) for one cost ratio of elementary
operations (CMM/CMR).
the interest in using HBE instead of KBE for other RNS MM
algorithms: [17] and [27].
A specific RNS MM approach for ECC has been proposed
in [17] and generalized in [27]. These methods propose to
use specific P values with pseudo-Mersenne like properties in
RNS. They mainly mix RNS with a polynomial representation
of a small degree d. The MM algorithms from [17] and [27]
are respectively referred as HPR with d = 2 and HPR with
d = 4 below.
We compare the cost of all these RNS MM algorithms when
using internally KBE and HBE for typical sets of parameters.
The results are reported in Tab. II. Fig. 4 illustrates the gain
for the RNS MM from [11]. They are illustrated in Fig. 5 for
HPR with d = 2 from [17]. With this RNS MM algorithm
and n = 16, HBE leads from 8 to 17% speedup compared
to KBE. For n = 32, the improvement reaches 15 to 25%.
With the HPR width d = 4 in algorithm from [27], the gain
is illustrated in Fig. 6. The number of moduli is assumed to
be at least 16 (below HPR with d = 2 is faster). For n = 32,
HBE is 8 to 15% faster than KBE.
Our evaluations assume a pessimistic ratio CMM/CMR = 4.
We currently use simple pseudo-Mersenne moduli (see FPGA
implementation results in Tab. I). We plan to improve this ratio
by using well chosen moduli.
VI. FPGA IMPLEMENTATION RESULTS
We implemented our HBE, Algo. 2, as well as the KBE,
Algo. 1, from state of the art [9] using the same environment
and effort. We did that to make a fair comparison and
because of a lack of experimental results for standalone BE
implementations in the literature. We used the Vivado 2017.4
16 18 20 22 24 26 28 30 32
Number of moduli (n)
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
Co
st
 ra
tio
 H
BE
 / 
KB
E 
fo
r 1
 R
NS
 M
M
CMM/CMR= 2
CMM/CMR= 3
CMM/CMR= 4
Fig. 6. Similar to Fig. 5 but for d = 4.
HLS tools and a XC7Z020 FPGA from Xilinx. HLS allows
us to quickly explore various widths for RNS channels and
implementation constraints (e.g., loop unrolling, pipelining).
We used RNS bases for ECC over FP with 256 and 384
bits elements. We explored several widths for RNS channels:
w ∈ {17, 20, 24, 28} bits. DSP slices in Xilinx FPGAs embed
hardwired integer multipliers for 18 × 18 or 18 × 25 bits
operands in 2’s complement. For unsigned integers, 17 bits
operands should be used to fully exploit DSP slices, see [28],
[29]. Currently, we do not support asymmetric widths for the
DSP operands (e.g., 17×24). The number n of RNS channels
is the smallest multiple of c = 2 such that nw > log2(P ).
Table III reports all the corresponding implementation re-
sults. “Time” rows refer the BE computation time (i.e., the
product of the period by the number of cycles). In most of
cases, HBE leads to faster and smaller solutions than KBE.
HBE always reduces the number of required DSP slices (the
largest elements in the FPGA). For instance, for 256-bit FP
elements and 20-bit channels, HBE is 10% faster and 19%
smaller in DSPs. The area reduction can reach 31% for w = 17
(the gain in computation time is 2% in that case).
HLS tools allow us to explore various area vs. time trade-
offs in a much faster way than using VHDL descriptions. For
instance, on 384-bit FP elements, we are able to speedup the
computation by 40% for a 28% increase of the number of DSP
slices by increasing the channels width.
VII. CONCLUSION AND FUTURE PROSPECTS
We proposed a new algorithm for RNS base extension called
hierarchical base extension (HBE). In HBE, the moduli are
used in a hierarchical way with partial applications of small
intermediate CRTs. The moduli, usually handled as a single
vector, are managed as a matrix in HBE. The moduli in each
row are used to construct super-moduli through small parallel
CRTs. At the highest level, all row-wise contributions are used
to complete the BE using a last CRT.
HBE reduces the theoretical computation cost up to 35 %
compared to the state of the art KBE algorithm from [9]. HBE
shares the exact same constraints than KBE for selecting RNS
bases. We also proposed a modified cox-rower architecture
to efficiently implement HBE. For specific shapes of the
7
TABLE III
HLS IMPLEMENTATION RESULTS ON A XC7Z020 FPGA FOR OUR HBE AND THE KBE (FROM [9]) ALGORITHMS FOR 2 WIDTHS OF PRIME FIELD
ELEMENTS AND 4 RNS CHANNELS WIDTHS w.
FP width (bits)
BE algo. KBE HBE KBE HBE KBE HBE KBE HBE
w (bits) 17 20 24 28
256
nb. slices 445 758 1073 784 785 769 753 843
nb. DSP 51 35 45 39 52 42 76 60
nb. BRAM 1 1 1 1 1 1 1 1
period (ns) 9.8 10.3 9.6 8.9 9.6 9.5 9.7 9.6
nb. cycles 98 91 88 83 89 81 77 71
time (ns) 960.4 937.3 844.8 738.7 854.4 769.5 746.9 681.6
384
nb. slices 587 644 1215 869 1251 1134 1031 1145
nb. DSP 81 63 63 54 76 60 104 80
nb. BRAM 1 1 1 1 1 1 1 1
period (ns) 7.6 10.1 9.6 9.0 7.6 7.6 9.9 9.4
nb. cycles 165 143 140 122 163 132 103 93
time (ns) 1254.0 1444.3 1344.0 1098.0 1238.8 1003.2 1019.7 874.2
matrix of moduli (i.e., two columns), our new architecture
preserves the natural RNS parallelism with a slightly deeper
pipeline at the rower level. FPGA implementation results
show significant area reduction in DSP slices as well as a
small speedup.
As future work, we intend to study and design full crypto-
processors for ECC using HBE. We also plan to study opti-
mizations for other types of decompositions (e.g., c ∈ {3, 4})
and well suited moduli forms for HBE.
ACKNOWLEDGMENTS
This work has been supported by a PhD grant from
DGA/Pôle de Recherche Cyber.
REFERENCES
[1] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital
signatures and public-key cryptosystems,” Communications of the ACM,
vol. 21, no. 2, pp. 120–126, Feb. 1978.
[2] H. Cohen and G. Frey, Eds., Handbook of Elliptic and Hyperelliptic
Curve Cryptography. Chapman & Hall/CRC, 2005.
[3] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curve
Cryptography. Springer, 2004.
[4] ECRYPT, “Algorithms, key size and protocols report,” Feb. 2018,
h2020-ICT-2014 - Project 645421. [Online]. Available: http://www.
ecrypt.eu.org/csa/documents/D5.4-FinalAlgKeySizeProt.pdf
[5] H. L. Garner, “The residue number system,” IRE Transactions on
Electronic Computers, vol. EC-8, no. 2, pp. 140–147, Jun. 1959.
[6] A. Svoboda and M. Valach, “Operátorové obvody (operator circuits
in czech),” Stroje na Zpracovánı́ Informacı́ (Information Processing
Machines), vol. 3, pp. 247–296, 1955.
[7] N. S. Szabo and R. I. Tanaka, Residue arithmetic and its applications
to computer technology. McGraw-Hill, 1967.
[8] J.-C. Bajard, L. Imbert, P.-Y. Liardet, and Y. Teglia, “Leak resistant
arithmetic,” in Proc. Cryptographic Hardware and Embedded Systems
(CHES), ser. LNCS, vol. 3156. Springer, Aug. 2004, pp. 62–75.
[9] S. Kawamura, M. Koike, F. Sano, and A. Shimbo, “Cox-Rower archi-
tecture for fast parallel Montgomery multiplication,” in Proc. Internat.
Conf. Theory and Application of Cryptographic Techniques (EURO-
CRYPT), ser. LNCS, vol. 1807. Springer, May 2000, pp. 523–538.
[10] J.-C. Bajard, S. Duquesne, and M. D. Ercegovac, “Combining leak-
resistant arithmetic for elliptic curves defined over Fp and RNS repre-
sentation,” Publications Mathématiques de Besançon: Algèbre et Théorie
des Nombres, pp. 67–87, 2013.
[11] F. Gandino, F. Lamberti, G. Paravati, J.-C. Bajard, and P. Montuschi,
“An algorithmic and architectural study on Montgomery exponentiation
in RNS,” IEEE Transactions on Computers, vol. 61, no. 8, pp. 1071–
1083, Aug. 2012.
[12] J.-C. Bajard and L. Imbert, “A full RNS implementation of RSA,” IEEE
Transactions on Computers, vol. 53, no. 6, pp. 769–774, Jun. 2004.
[13] A. P. Shenoy and R. Kumaresan, “Fast base extension using a redundant
modulus in RNS,” IEEE Transactions on Computers, vol. 38, no. 2, pp.
292–297, Feb. 1989.
[14] K. C. Posch and R. Posch, “Base extension using a convolution sum in
residue number systems,” Computing, vol. 50, no. 2, pp. 93–104, Jun.
1993.
[15] H. Nozaki, M. Motoyama, A. Shimbo, and S. Kawamura, “Implemen-
tation of RSA algorithm based on RNS Montgomery multiplication,”
in Proc. Cryptographic Hardware and Embedded Systems (CHES), ser.
LNCS, vol. 2162. Springer, May 2001, pp. 364–376.
[16] N. Guillermin, “A high speed coprocessor for elliptic curve scalar mul-
tiplications over Fp,” in Proc. Cryptographic Hardware and Embedded
Systems (CHES), ser. LNCS, vol. 6225. Springer, Aug. 2010, pp. 48–64.
[17] K. Bigou and A. Tisserand, “Single base modular multiplication for
efficient hardware RNS implementations of ECC,” in Proc. 17th Cryp-
tographic Hardware and Embedded Systems (CHES), ser. LNCS, vol.
9293. Springer, Sep. 2015, pp. 123–140.
[18] H. M. Yassine, “Hierarchical residue numbering system suitable for
VLSI arithmetic architectures,” in Proc. IEEE International Symposium
on Circuits and Systems, vol. 2, May 1992, pp. 811–814.
[19] H. D. L. Hollmann, R. Rietman, S. de Hoogh, L. Tolhuizen, and
P. Gorissen, “A multi-layer recursive residue number system,” in Proc.
IEEE International Symposium on Information Theory (ISIT), Jun. 2018,
pp. 1460–1464.
[20] A. Skavantzos and M. Abdallah, “Implementation issues of the two-
level residue number system with pairs of conjugate moduli,” IEEE
Transactions on Signal Processing, vol. 47, no. 3, pp. 826–838, Mar.
1999.
[21] T. Tomczak, “Hierarchical residue number systems with small moduli
and simple converters,” International Journal of Applied Mathematics
and Computer Science, vol. 21, no. 1, pp. 173–192, Mar. 2011.
[22] A. Bostan, G. Lecerf, and E. Schost, “Tellegen’s principle into practice,”
in Proc. International Symposium on Symbolic and Algebraic Compu-
tation ISSAC. ACM, Aug. 2003, pp. 37–44.
[23] J. van der Hoeven, “Fast chinese remaindering in practice,” in Mathe-
matical Aspects of Computer and Information Sciences. Springer, Nov.
2017, pp. 95–106.
[24] S. Kawamura, Y. Komano, H. Shimizu, and T. Yonemura, “RNS
montgomery reduction algorithms using quadratic residuosity,” Journal
of Cryptographic Engineering, Sep 2018.
[25] P. L. Montgomery, “Modular multiplication without trial division,”
Mathematics of Computation, vol. 44, no. 170, pp. 519–521, Apr. 1985.
[26] K. C. Posch and R. Posch, “Modulo reduction in residue number
systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 6,
no. 5, pp. 449–454, May 1995.
[27] K. Bigou and A. Tisserand, “Hybrid position-residues number system,”
in Proc. 23rd Symposium on Computer Arithmetic (ARITH). IEEE, Jul.
2016, pp. 126–133.
[28] Xilinx, “Vivado design suite user guide high-level synthesis (UG902),”
Tech. Rep., Apr. 2017.
[29] ——, “7 series DSP48E1 slice user guide (UG479),” Tech. Rep., Mar.
2018.
8
