Side-channel aware design: algorithms and architectures for elliptic curve cryptography over GF(2n) by Batina, L. et al.
PDF hosted at the Radboud Repository of the Radboud University
Nijmegen
 
 
 
 
The following full text is a preprint version which may differ from the publisher's version.
 
 
For additional information about this publication click this link.
http://repository.ubn.ru.nl/handle/2066/127471
 
 
 
Please be advised that this information was generated on 2017-03-09 and may be subject to
change.
Side-channel aware design: Algorithms and Architectures
for Elliptic Curve Cryptography over GF(2n)
Lejla Batina, Nele Mentens, Bart Preneel, Ingrid Verbauwhede
Katholieke Universiteit Leuven, ESAT/SCD-COSIC
{Lejla.Batina, Nele.Mentens, Bart.Preneel, Ingrid.Verbauwhede}@esat.kuleuven.ac.be
Abstract
This paper proposes efficient algorithms for El-
liptic Curve Cryptography (ECC). As an example a
compact and efficient FPGA architecture for ECC
over finite fields of even characteristic is presented.
The implementation is balanced in order to increase
the security w.r.t. simple side-channel attacks.
Keywords: Multiplication in GF(2n), Hardware
implementation, Systolic array architecture, Elliptic
Curve Cryptography (ECC), Montgomery method for
point multiplication
1 Introduction
The best-known and most commonly used public-
key cryptosystems (PKC) are based on factoring
(RSA) and on the discrete logarithm problem (Diffie-
Hellman, DSA) [19]. They allow secure communica-
tions over insecure channels without prior agreement
of a shared secret and they also enable digital signa-
tures. Another alternative for PKC is Elliptic Curve
Cryptography (ECC), which was proposed in the mid
1980s by Miller [21] and Koblitz [14].
In this article we propose an implementation of
the approach of Montgomery [22] for scalar mul-
tiplication in binary fields. It uses a representa-
tion where computations are performed on the x-
coordinate only. Menezes and Vanstone observed
that the benefit in storage comes at a considerable
expense of speed [20]. Also, from an algorithmic
point of view, Stam concludes that it is less efficient
than other known methods and that in the binary case
it can hardly be recommended [26]. However, our
conclusion is just the opposite, at least for hardware
implementations. This method can benefit from in-
dependent calculations for point operations that can
be therefore performed fully in parallel by means of
two multipliers. Furthermore, we have optimized the
formulae for the point operations to have exactly the
same number of field multiplications for point addi-
tion and doubling. The field multiplications are per-
formed in corresponding steps in both point opera-
tions. We are convinced that this approach also offers
an improved resistance against side-channel attacks
compared to other unbalanced methods.
The remainder of this paper is organized as fol-
lows. Section 2 provides the necessary mathemat-
ical background. In Sect. 3 previous work is dis-
cussed. Section 4 gives algorithms and details of
the new implementation. In Sect. 5 the results of
our FPGA implementation of the ECC processor are
presented including a comparison with other relevant
work. Section 6 addresses the security w.r.t. simple
side-channel attacks. It also includes graphs of power
consumption which prove improved side-channel re-
sistance. Section 7 concludes the paper.
2 Elliptic Curves over GF(2n)
The point or scalar multiplication is the basic op-
eration for cryptographic protocols; it is easily per-
formed via repeated group operations. At the next
(lower) level are the point operations, which are
closely related to the coordinates used to represent
the points. The lowest level consists of finite field op-
erations such as addition, subtraction, multiplication
and inversion required to perform the group opera-
tions.
We introduce some notation. Let P4 = (x4, y4) =
P2 − P1 and P5 = (x5, y5) = 2P1 with P3 =
P1 + P2. The point P4 is included because the
method for point multiplication, as introduced by
Montgomery, is defined by the fact that to add two
points their difference should be known (while y-
coordinate is not needed). The formulae for point
addition and doubling from [4] can be rewritten us-
ing that P1 = (x1, y1) ∈ E. For P1 6= P2 we get:
x3 = ( y1+y2x1+x2 )
2 + y1+y2x1+x2 + x1 + x2 + a ,
y3 = ( y1+y2x1+x2 )(x1 + x3) + x3 + y1 .
(1)
1
If P1 = P2,
x5 = x12 + bx12
y5 = x12 + (x1 + y1x1 )x5 + x5 .
For P4 we get from Blake et al. [4, Lemma III.2]:
x4 = (
y1 + y2 + x2
x1 + x2
)2+
y1 + y2 + x2
x1 + x2
+x1+x2+a .
We will use the observation that the x-coordinate
of P5 does not include the y-coordinate of P1. Also
the x-coordinate of the sum of P3 and P4 can be ex-
pressed with the x-coordinate only. More precisely,
we have:
Lemma 2.1 The x-coordinates of the points P3 =
(x3, y3) = P1 + P2 and P4 = (x4, y4) = P2 − P1
on an elliptic curve (1) satisfy:
x3 + x4 =
x1 · x2
(x1 + x2)2
Proof: It follows directly from the formula for ad-
dition and the curve equation.
3 Previous Work on Hardware
Implementations of ECC
This section lists some relevant previous work on
ECC architectures for binary fields. There are many
papers [23, 9, 10, 12] dealing with this topic but
very few efficient hardware implementations present
a completely generic solution which allows an arbi-
trary choice for all parameters. Orlando and Paar [23]
proposed a scalable elliptic curve processor architec-
ture which operates over finite fieldsGF(2m). Good-
man and Chandrakasan proposed a cryptographic
processor [9], which performs a variety of algorithms
for PKC applications. Gura et al. [10] have intro-
duced a programmable hardware accelerator for ECC
over GF(2n).
The first bit-serial multiplier was discussed by
Beth and Gollmann [3]. This multiplier uses con-
volution and reduction modulo an irreducible poly-
nomial and takes n clock cycles to compute a mul-
tiplication. Relevant algorithms and architectures
for multiplication in GF(2n) have been proposed
in [6, 3]. For a detailed survey on finite fields multi-
pliers and processors for PKC see Batina et al. [2].
4 A New Hardware Implemen-
tation
In this section we describe our new hardware im-
plementation. We follow the top-down approach and
for each step we elaborate our choice.
4.1 Montgomery Method for Point Multiplica-
tion in GF(2n)
For the point multiplication we chose the method
of Montgomery [22]. The algorithm used (Algo-
rithm 1) was considered by Lo´pez and Dahab [18].
Algorithm 1 Algorithm for point multiplication
Require: an integer k > 0 and a point P
Ensure: x(kP )
1: k ← kl−1, ..., k1, k0
2: P1 ← P , P2 ← 2P .
3: for i from l − 2 downto 0 do
4: If ki = 1 then
5: x(P1)← x(P1 + P2), x(P2)← x(2P2)
6: Else
7: x(P2)← x(P1 + P2), x(P1)← x(2P1)
8: end for
9: Return x(P1)
The advantage of this algorithm is that it calcu-
lates one point addition and one doubling in each
step. Moreover, the algorithm requires less registers
compared to other hardware solutions. This could be
of interest for implementations in constrained envi-
ronments.
4.2 Point Addition and Doubling
At this level our design is improved with respect to
other proposals. Namely, point operations (add and
double) are in principle different, which can be ex-
plored from the point of side-channel analysis. Some
authors have tried before to balance these two op-
erations in order to improve side-channel resistance.
We mention here the work of Brier and Joye [5] who
suggested two approaches to achieve uniformity of
point operations. However, both approaches result
with some penalty in speed.
In the formulae of Lo´pez and Dahab in GF(2n)
point operations are almost balanced as they have
A : D = 5M : 6M . Here, A and D are the point op-
erations and M is a field multiplication. Consider the
formulae for point operations in the case of simple
projective coordinates i.e. xi = (Xi/Zi), i = 1, 2.
2
The results of point doubling and point addition, i.e.
X5 = X(P5) and X3 = X(P3) = X(P1 + P2)
respectively, are calculated as:
X5 = X14 + bZ14
Z5 = X12 · Z12
X3 = (X1 · Z2 +X2 · Z1)2
Y3 = x4Z3 + (X1 ·X2) · (Z1 · Z2)
(2)
It is easy to see that point doubling and addition
would require 6 and 5 multiplications respectively.
We slightly rewrote the formulae in order to have 6
multiplications for both point operations. We had to
add one more multiplication in the point addition, so
we used the following formula:
X1Z2+X2Z1 = (X1+X2)(Z1+Z2)−X1Z1−X2Z2 ,
that follows from the Karatsuba-like approach [13].
The algorithms for point addition and doubling are
given in Algorithm 2.
Algorithm 2 EC point addition and doubling
Require: a, c ∈ GF(2n), c = b2m−1 , Xi, Zi
for i = 1, ..., 4, xi = XiZi , x4 = x(P1 − P2)
Ensure: X(P1 + P2) = X(P3) = X3
1. T1 ← x4,X3 ← X1 +X2, Z3 ← Z1 + Z2
2. Z3 ← X3 · Z3
3. T2 ← X1Z1
4. X3 ← X2Z2
5. Z3 ← Z3 − T2 −X3
6. Z3 ← Z32
7. T1 ← x4Z3
8. X3 ← T2X3
9. X3 ← X3 + T1
Require: a, c ∈ GF(2n), c = b2m−1 ,
X1, Z1 where x1 = X1Z1
Ensure: X(2P1) = X(P5) = X5
1. T1 ← c
2. Z5 ← Z12
3. X5 ← X12
4. T1 ← Z5T1
5. T1 ← T1 +X1 −X1
6. Z5 ← X5Z5
7. T1 ← T12
8. X5 ← X52
9. X5 ← X5 + T1
Each point operation requires exactly 6 multipli-
cations which are also balanced. In Step 5 of the
point doubling a redundant operation is inserted to
balance even the field additions. The required num-
ber of registers is 3 for both cases. More precisely,
the following lemma holds:
Lemma 4.1 When Algorithm 1 deploys algorithm 2
we get the following number of operations in
GF(2n):
#inversions = 1
#multiplications = 12blog2kc+ 13
#additions = 6blog2kc+ 7
Note that the 12 multiplications require only the time
for 6 multiplications since two multiplications are
performed in parallel in every iteration of the main
loop.
4.3 An Algorithm for Field Multiplication
Our circuit implements Algorithm 3; it includes
two parts, classical and Montgomery, each of which
is a systolic array. Those two arrays process the
operand a(x) from different sides and they stop af-
ter exactly dn/2e cycles for the bit-serial version and
after ds/2e for this new digit-serial architecture. In
short, let us denote aMSB(x) and aLSB(x) as the
most significant and the least significant half of a(x),
respectively. After exactly ds/2e steps the classical
and the Montgomery part have calculated aMSB(x) ·
b(x) · xds/2ew and aLSB(x) · b(x) · x−ds/2ew, re-
spectively. So each part evaluated half of the polyno-
mial a(x) and XOR-ing them will give the M-residue
of the multiplication result Res(x) with r(x) =
xds/2ew. After conversion from Montgomery to the
normal representation, we get the result. Namely, the
following lemma holds.
Lemma 4.2 The result of Algorithm 3 is in the
Montgomery domain i.e. Res(x) =
= a(x)b(x)r−1(x) where r(x) = xd
n
2 e.
The idea of combining two algorithms together
was mentioned in [24] and the schematic of the mul-
tiplier was presented in [1].
In Algorithm 3, Ai(x) represents one digit of the
polynomial a(x). Also, here ResM0(x) and F0(x)
are the least significant words of ResM (x) and f(x)
respectively. On the other hand, ResCn−1(x) corre-
sponds to the most significant word of ResC(x).
4.4 A prototype FPGA architecture
Our Elliptic Curve Processor (ECP) is shown in
Fig. 1. The operation blocks on each level from top
to bottom are as follows:
• Level 1: Main Controller
3
Algorithm 3 Digit-serial Modular Multiplication in
GF(2n)
Require: polynomials a(x), b(x) and f(x), F0′(x)
Ensure: Res(x) = a(x) · b(x) · x−dm2 e mod f(x)
1: Res(x) = 0, ResC(x) = 0, ResM (x) = 0 ,
2: for i from 0 to d s2e − 1 do
3: ResC(x)← ResCn−1 ·f(x)+ResC(x)·xw+
As−i · b(x)
4: ResM (x)← ResM (x) +Ai · b(x)
5: M(x)← ResM0 · F0′(x) (mod xw)
6: ResM (x)← ResM (x) +M(x) · f(x)
7: ResM (x)← ResM (x) div xw
8: end for
9: Res(x) = ResC(x) +ResM (x)
10: Return Res(x)
• Level 2:
1. Normal to Montgomery representation
conversion (NtoM)
2. Affine to Projective coordinates conver-
sion (AtoP)
3. EC Point Multiplication (PM)
4. Projective to Affine coordinates conver-
sion (PtoA)
5. Montgomery to Normal representation
conversion (MtoN)
• Level 3: EC Point Doubling (PD) and EC Point
Addition (PA)
• Level 4: Modular Addition (MA), Modular
Multiplication (MM) and Modular Inversion
(MI)
The main control Finite State Machine (FSM) first
commands the NtoM to convert all the inputs from
normal to Montgomery representation and the AtoP
to convert the coordinates from affine to projective.
The NtoM and the AtoP both use the MM to perform
these conversions. When all conversions are finished
the Main Controller orchestrates the PM block to
start the point multiplication by invoking the PD and
the PA in parallel. It writes the resulting X1 and Z1
to its output registers. The FSM inside the PM or-
chestrates these operations. Due to the parallelism
of the point operations both the PD and the PA use
an MA and an MM. The next step after point multi-
plication is conversion from projective to affine rep-
resentation using the MI, which invokes the MM to
perform the modular inversion using Fermat’s little
theorem [15]. Finally, the affine coordinates are con-
verted from Montgomery to normal representation
using the MM.
x P x
2nnca startk done
MAIN CONTROLLER
x kP
clk
rst
clk
rst
clk clk
rst rst
clk
rst
clk
rst
NtoM AtoP PtoA MtoN
PA
rst
clk
PD
rst
clk
PM
MA
rst
clk
MA
clk
rst
MM
clk
rst
clk rst
MI
clk
rst
MM
rst
clk
Figure 1. Architecture of the elliptic curve processor.
5 Results
The results of our design on a Xilinx Virtex
XCV800 FPGA are given in Table 1.
Table 1. Results for n = 179 where
bit-serial and digit-serial multipliers are
compared
Results for ECC point multiplication
# slices period (ns) latency (ms)
w = 1 10626 19.165 2.479
w = 4 11433 19.298 1.886
w = 8 11622 20.050 1.008
w = 16 11881 20.961 0.557
Table 2 presents a broader comparison with other
architectures. We have included only those FPGA
solutions that are either scalable [9, 10] or that are
believed to be the state of the art in ECC hardware
implementations. We give this comparison as a proof
that a scalable and side-channel secure design can
also lead to a solution that is competitive in perfor-
mance.
6 Simple Side-Channel Resis-
tance
Implementations of cryptographic algorithms
should be resistant to side-channel attacks such as
4
Table 2. Comparison with other rel. work
for point multiplication in GF(2n) (here
w=16)
Ref. Field Freq. Lat. Hw
(MHz) (ms) compl.
[9] F2160 50.0 5 n. a.
[23] F2167 76.7 0.84 6 513 g.,
501 reg.
[10] F2160 66.4 0.59 14 241 LUTs,
2 990 FFs
[12] F2163 22.1 7.4 not known
ours F2179 47.7 0.557 11 881 sl.
timing [16], power [17] and electromagnetic radia-
tion [25, 8] analysis attacks. These attacks present
a realistic threat for wireless applications and have
been demonstrated to be very effective against smart
cards without specific countermeasures. Here we
discuss the ability of our implementation to with-
stand simple side-channel attacks, such as the Sim-
ple Power Analysis (SPA). In that case the attacker
can get some information about the secret key by
observing one or a few power consumption graphs.
To prevent that, cryptographic algorithms should be
implemented as sequences of operations that are in-
distinguishable through simple side-channel analy-
sis. Chevallier-Mames et al. define this property as
the side-channel atomicity [7]. According to the au-
thors, the SPA-resistant algorithms should consist of
so-called side-channel atomic blocks, which are al-
gorithm specific. In our implementation, there ex-
ist side-channel atomic blocks on different level of
ECC hierarchy. Following the top-down approach
those are point addition/doubling and multiplica-
tion/squaring. Figure 2 shows a pattern for point ad-
dition implemented as in most of the standard refer-
ences on ECC [11, 4]. In this case the addition takes
14 multiplications, which is visible from the power
trace. (The standard doubling takes 10 multiplica-
tions in total.) The same situation in the light of Al-
gorithm 2 results in two not so distinguishable pat-
terns (Figure 3 and 4). In both figures the total of 6
field multiplications can be observed.
7 Conclusions
In this paper a complete ECC processor for bi-
nary fields is presented. An FPGA implementation
has been described. We proposed a fully balanced
Figure 2. Power consumption trace of an EC point
addition as in the IEEE standard. The total number
of steps is 14 as visible from the graph.
Figure 3. Power consumption trace of an EC point
addition as in Algorithm 2 (6 field multiplications are
easily detected).
Figure 4. Power consumption trace of an EC point
doubling as in Algorithm 2 includes also 6 multipli-
cations.
point multiplication algorithm that performs the same
operations for every loop of the algorithm. Fur-
thermore, a new algorithm for field multiplication is
given, which performs two separate multiplications
in parallel. By using this approach we believe that
5
this so-called side-channel aware design is the first
step towards side-channel resistance. However, it
is clear that additional countermeasures will be re-
quired to prevent more advanced attacks.
References
[1] L. Batina, N. Mentens, S. B. O¨rs, and B. Preneel.
Serial multiplier architectures over GF(2n) for ellip-
tic curve cryptosystems. In Proceedings of The 12th
IEEE MELECON 2004, Dubrovnik, Croatia, May 12-
15, 4 pages.
[2] L. Batina, S. B. O¨rs, B. Preneel, and J. Vandewalle.
Hardware architectures for public key cryptography.
Elsevier Science Integration the VLSI Journal, 34(1-
2):1–64, 2003.
[3] T. Beth and D. Gollmann. Algorithm engineering for
public key algorithm. IEEE Journal on Selected Ar-
eas in Communications, 7(4):458–465, May 1989.
[4] I. Blake, G. Seroussi, and N. P. Smart. Elliptic Curves
in Cryptography. London Mathematical Society Lec-
ture Note Series. Cambridge University Press, 1999.
[5] E. Brier and M. Joye. Weierstrass Ell. Curves and
Side-Channel Attacks. In D. Naccache and P. Pail-
lier, editors, Proc. of PKC’02, volume 2274 of LNCS,
pages 335–345. Springer-Verlag, 2002.
[6] C¸. K. Koc¸ and T. Acar. Montgomery multiplication in
GF(2k). Designs, Codes and Cryptography, 14:57–
69, 1998.
[7] B. Chevallier-Mames, M. Ciet, and M. Joye. Low-
cost solutions for preventing simple side-channel
analysis: Side-channel atomicity. IEEE Transactions
on Computers, 53(6):760–768, 2004.
[8] K. Gandolfi, C. Mourtel, and F. Olivier. Electro-
magnetic analysis: Concrete results. In C¸. K. Koc¸,
D. Naccache, and C. Paar, editors, Proceedings of
3rd International Workshop on Cryptographic Hard-
ware and Embedded Systems (CHES), number 2162
in Lecture Notes in Computer Science, pages 255–
265. Springer-Verlag, 2001.
[9] J. Goodman and A. P. Chandrakasan. An energy-
efficient reconfigurable public-key cryptography pro-
cessor. IEEE Journal of Solid-State Circuits,
36(11):1808–1820, November 2001.
[10] N. Gura, S. C. Shantz, H. Eberle, D. Finchelstein,
S. Gupta, V. Gupta, and D. Stebila. An end-to-end
systems approach to elliptic curve cryptography. In
Burt Kaliski Jr., C¸. K. Koc¸, and C. Paar, editors, Pro-
ceedings of 4th International Workshop on Crypto-
graphic Hardware and Embedded Systems (CHES),
Lecture Notes in Computer Science 2523, 2002.
[11] IEEE P1363. Standard specifications for public key
cryptography, 1999.
[12] P. Kitsos, G. Theodoridos, and O. Koufopavlou.
An efficient reconfigurable multiplier architecture for
Galois Field GF(2m). Elsevier Science Microelec-
tronics Journal, 34:975–980, 2003.
[13] D. E. Knuth. The Art of Computer Programming, vol-
ume 2/Seminumerical Algorithms. Addison-Wesley,
1997.
[14] N. Koblitz. Elliptic curve cryptosystem. Math.
Comp., 48:203–209, 1987.
[15] N. Koblitz. A Course in Number Theory and Cryp-
tography, volume 114 of Graduate text in mathemat-
ics. Springer-Verlag, Berlin, Germany, second edi-
tion, 1994.
[16] P. Kocher. Timing attacks on implementations of
Diffie-Hellman, RSA, DSS and other systems. In
N. Koblitz, editor, Advances in Cryptology: Proceed-
ings of CRYPTO’96, number 1109 in Lecture Notes
in Computer Science, pages 104–113. Springer-
Verlag, 1996.
[17] P. Kocher, J. Jaffe, and B. Jun. Differential power
analysis. In M. Wiener, editor, Advances in Cryp-
tology: Proceedings of CRYPTO’99, number 1666 in
Lecture Notes in Computer Science, pages 388–397.
Springer-Verlag, 1999.
[18] J. Lo´pez and R. Dahab. Fast multiplication on ellip-
tic curves over GF(2m). In C¸. K. Koc¸ and C. Paar,
editors, Proceedings of 1st International Workshop
on Cryptographic Hardware and Embedded Systems
(CHES), volume 1717 of Lecture Notes in Computer
Science, pages 316–327. Springer-Verlag, 1999.
[19] A. Menezes, P. van Oorschot, and S. Vanstone. Hand-
book of Appl. Cryptog. CRC Press, 1997.
[20] A. Menezes and S. Vanstone. Elliptic curve cryp-
tosystems and their implementations. Journal of
Cryptology, (6):209–224, 1993.
[21] V. Miller. Uses of elliptic curves in cryptography.
In H. C. Williams, editor, Advances in Cryptology:
Proceedings of CRYPTO’85, number 218 in Lec-
ture Notes in Computer Science, pages 417–426.
Springer-Verlag, 1985.
[22] P. Montgomery. Speeding the Pollard and elliptic
curve methods of factorization. Mathematics of Com-
putation, Vol. 48:243–264, 1987.
[23] G. Orlando and C. Paar. A high-performance recon-
figurable elliptic curve processor for GF(2m). In C¸.
K. Koc¸ and C. Paar, editors, Proceedings of 2nd Inter-
national Workshop on Cryptograpic Hardware and
Embedded Systems (CHES), number 1965 in Lecture
Notes in Computer Science, pages 41–56. Springer-
Verlag, 2000.
[24] M. J. Potgieter. A hardware implementation of the
group operations necessary for implementing an el-
liptic curve cryptosystem over a characteristic two fi-
nite field. Final report of project EPR400, Technical
University Eindhoven, 2002.
[25] J.-J. Quisquater and D. Samyde. Electromagnetic
analysis (EMA): Measures and couter-measures for
smard cards. In I. Attali and T. P. Jensen, edi-
tors, Smart Card Programming and Security (E-smart
2001), volume 2140 of Lecture Notes in Computer
Science, pages 200–210. Springer-Verlag, 2001.
[26] M. Stam. Speeding up Subgroup Cryptosystems. PhD
thesis, Tech. Universiteit Eindhoven, 2003.
6
