Balanced point operations for side-channel protection of elliptic curve cryptography by Batina, L. et al.
PDF hosted at the Radboud Repository of the Radboud University
Nijmegen
 
 
 
 
The following full text is a publisher's version.
 
 
For additional information about this publication click this link.
http://repository.ubn.ru.nl/handle/2066/127466
 
 
 
Please be advised that this information was generated on 2019-12-04 and may be subject to
change.
Balanced point operations for side-channel
protection of elliptic curve cryptography
L. Batina, N. Mentens, B. Preneel and I. Verbauwhede
Abstract: The authors propose balanced algorithms for elliptic curve cryptography (ECC).
The authors make the point addition and doubling balanced; that is, they are implemented as
identical sequences of operations. As an example the authors implement an ECC point
multiplication algorithm, using the approach of Montgomery, for which a single power trace
does not expose the Hamming weight nor the bits of the secret key. Nevertheless, their ﬁeld-
programmable gate array implementation is also compact and efﬁcient. The proposed
multiplier for the ﬁnite ﬁeld operations is digit serial and scalable to arbitrary bit-lengths.
The method calculates the result by splitting the multiplication into two separate processes.
The architecture presented compares favourably with designs presented in the literature.
Furthermore, the power consumption graphs show the new implementation has an improved
side-channel resistance.
1 Introduction
The best-known and most commonly used public-key
cryptosystems (PKCs) are based on factoring (RSA) and
on the discrete logarithm problem (Difﬁe-Hellman,
ElGamal, Schnorr, DSA) [1]. They allow secure com-
munications over insecure channels without prior
agreement of a shared secret; they also enable efﬁcient
and compact digital signatures. Another alternative for
PKC is elliptic curve cryptography (ECC), which was
proposed in the mid-1980s by Miller [2] and Koblitz [3].
For ECC, two types of ﬁnite ﬁelds are being
considered, i.e. binary and prime ﬁelds. A ﬁeld GF(2n)
offers far more options as there are many choices for
bases, irreducible polynomials, composite ﬁelds etc.
There exist several ways to accelerate this curve-based
arithmetic. Following a bottom-up approach these are
speeding up the ﬁnite-ﬁeld arithmetic (especially multi-
plication and inversion), choosing a ‘good’ representa-
tion (i.e. coordinates that are more efﬁcient) and
accelerating a scalar multiplication operation.
In this article we propose an algorithm for multi-
plication in binary ﬁelds and we describe an efﬁcient
systolic array architecture for the multiplication. The
proposed method performs two parts of the multi-
plication (from LSB and from MSB) in parallel.
Furthermore, we consider the approach of Montgomery
[4] for scalar multiplication. This uses a representation
where computations are performed on the x-coordinate
only. According to Lo´pez and Dahab [5], the Montgo-
mery representation requires less memory and offers
better protection against side-channel attacks. The same
conclusion with respect to side-channel protection was
drawn by Joye and Yen [6]. They also observed its
beneﬁt for a parallel computation. Menezes and
Vanstone observed that the beneﬁt in storage is at
considerable expense of speed [7]. From an algorithmic
point of view, Stam concluded that it is less efﬁcient
than other known methods and that in the binary case it
can hardly be recommended [8]. However, our conclu-
sion is just the opposite, at least for hardware
implementations. This method can beneﬁt from inde-
pendent calculations for point operations that can
therefore be performed fully in parallel by means of
two multipliers. Furthermore, we have optimised the
formulae for the point operations to have exactly the
same number of ﬁeld multiplications for point addition
and doubling. The ﬁeld multiplications are performed in
corresponding steps in both point operations. Similar
work has been done by Fischer et al. [9] and Izu and
Takagi [10] in GF(p). However, their point operations
are not fully balanced as in our case. We are convinced
that our approach also offers an improved resistance
against side-channel attacks compared with unbalanced
methods. We provide some evidence for this in the form
of power consumption graphs for what are considered
to be ‘side-channel vulnerable’ operations.
The remainder of this paper is organised as follows.
Section 2 provides the necessary mathematical back-
ground for ECC in GF(2n), the method of Montgomery
for point multiplication and modular multiplication
(MM) in binary ﬁelds. In Section 3 previous work is
discussed and some relevant hardware implementations
of ECC in GF(2n) are brieﬂy reviewed. Section 4 gives
algorithms and details of the new implementation. In
Section 5 the results of our ﬁeld-programmable gate
array (FPGA) implementation of the ECC processor are
presented, including a comparison with other relevant
work. Section 6 addresses the security with respect to
side-channel attacks. Section 7 concludes the paper and
points to future work.
2 Elliptic curves over GF(2n)
ECC relies on a group structure induced on an elliptic
curve. A set of points on an elliptic curve (with one
special point added, the so-called point at inﬁnity O)
together with the so-called chord-and-tangent rule has
ª IEE 2005
IEE Proceedings online no. 20055019
doi:10.1049/ip-ifs: 20055019
Paper received 15 July 2005
The authors are with Katholieke Universiteit Leuven, ESAT/SCD-COSIC,
Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium.
E-mail: Lejla.Batina@esat.kuleuven.ac.be
IEE Proc. Inf. Secur. 57
the structure of an abelian group. Here we consider a
ﬁnite ﬁeld of characteristic 2, i.e. GF(2n). The point or
scalar multiplication is the basic operation for crypto-
graphic protocols; it is easily performed via repeated
group operations. At the next (lower) level are the point
operations, which are closely related to the coordinates
used to represent the points. The lowest level consists of
ﬁnite ﬁeld operations such as addition, subtraction,
multiplication and inversion required to perform the
group operations (Fig. 1).
We introduce some notation. Let P4¼ (x4, y4)¼
P2P1 and P5¼ (x5, y5)¼ 2P1 with P3¼P1þP2. The
point P4 is included because the method for point
multiplication, as introduced by Montgomery, is deﬁned
by the fact that to add two points their difference should
be known (whereas the y-coordinate is not needed). The
formulae for point addition and doubling from [11]
can be rewritten using that P1¼ (x1, y1) 2 E. For
P1 6¼ P2 we get
x3 ¼ y1 þ y2
x1 þ x2
 2
þ y1 þ y2
x1 þ x2 þ x1 þ x2 þ a
y3 ¼ y1 þ y2
x1 þ x2
 
ðx1 þ x3Þ þ x3 þ y1
ð1Þ
If P1¼P2,
x5 ¼ x12 þ b
x12
y5 ¼ x12 þ x1 þ y1
x1
 
x5 þ x5
For P4 we get from Blake et al. [11, Lemma III.2]
x4 ¼ y1 þ y2 þ x2
x1 þ x2
 2
þ y1 þ y2 þ x2
x1 þ x2 þ x1 þ x2 þ a
We will use the observation that the x-coordinate of
P5 does not include the y-coordinate of P1. Also the
x-coordinate of the sum of P3 and P4 can be expressed
with the x-coordinate only. More precisely, we have
Lemma 2.1: [5] The x-coordinates of the points P3¼ (x3,
y3)¼P1þP2 and P4¼ (x4, y4)¼P2P1 on an elliptic
curve (1) satisfy
x3 þ x4 ¼ x1  x2
x1 þ x2ð Þ2
Proof: It follows directly from the formula for addition
and the curve equation. &
3 Previous work on hardware
implementations of ECC
This Section lists some relevant previous work on ECC
architectures for binary ﬁelds. There are many papers
[12–16] dealing with this topic but very few efﬁcient
hardware implementations present a completely generic
solution which allows an arbitrary choice for all
parameters: ﬁeld size, digit-length, irreducible poly-
nomial, elliptic curve parameters, coordinates etc. We
opt for a completely generic implementation as security
criteria are changing frequently. Nevertheless, we did
not need any optimisation to boost performance, which
would be possible by ﬁxing some parameters, e.g. special
curves, sparse polynomials etc.
As ﬁeld multiplication is the most crucial aspect for
efﬁcient hardware implementations the previous work
on ﬁnite ﬁelds multipliers should be also considered. The
ﬁrst bit-serial multiplier for ﬁnite ﬁelds was discussed by
Beth and Gollmann [17]. This multiplier uses convolu-
tion and reduction modulo an irreducible polynomial
and takes n clock cycles to compute a multiplication.
Relevant algorithms and architectures for multiplication
in GF(2n) have been proposed in [17–23].
In 2000 Orlando and Paar proposed a scalable elliptic
curve processor (ECP) architecture which operates over
ﬁnite ﬁelds GF(2m) [13]. The architecture is scalable,
with a separated squarer (bit-parallel). Goodman and
Chandrakasan proposed a cryptographic processor in
[14] which performs a variety of algorithms for PKC
applications. Multiplication is performed with a bit-
serial multiplier using the Montgomery modular multi-
plication (MMM) [24]. Gura et al. [15] have introduced
a programmable hardware accelerator for ECC over
GF(2n) which can handle arbitrary ﬁeld sizes up to 255.
The multiplier they use is a digit-serial shift-and-add
multiplier. For a detailed survey of ﬁnite ﬁeld multipliers
and processors for PKC see [25].
4 A new hardware implementation
In this Section we describe our new hardware imple-
mentation. We follow the top-down approach and for
each step we elaborate our choice.
4.1 Montgomery method for point multiplication
in GF(2n)
For the point multiplication we chose the method of
Montgomery that maintains the relationship P2P1 as
invariant [4]. The idea of Montgomery dealt with
speeding up the calculation of only the x-coordinate of
the result. More precisely, to add two points their
difference is used as an input parameter while the
y-coordinate is not used in the algorithm. This fact
is justiﬁed because cryptographic applications rarely use
the y-coordinate. The algorithm to be used (Algorithm 1)
GF(2 ) inversionn nGF(2 ) multiplication nGF(2 ) addition
point multiplication
point doubling point addition
Fig. 1 Scheme of the hierarchy for ECC operations
58 IEE Proc. Inf. Secur.
Algorithm 1: Algorithm for point multiplication
Require: an integer k> 0 and a point P
Ensure: x(kP )
1: k  kl1, . . . , k1, k0
2: P1  P, P2  2P.
3: for i from l 2 downto 0 do
4: If ki¼ 1 then
5: x(P1)  x(P1þP2), x(P2)  x(2P2)
6: Else
7: x(P2)  x(P1þP2), x(P1)  x(2P1)
8: end for
9: Return x(P1)
is a variant of the binary method and was considered by
Lo´pez and Dahab. They have also introduced an option
for recovering the y-coordinate [5].
The advantage of this algorithm is that it calculates
one point addition and one doubling in each step. In this
way the loop operations do not depend on the exponent,
which could offer an increased resistance against timing
and other side-channel attacks. In addition, we noticed
that the algorithm requires fewer registers than other
hardware solutions. Nevertheless, the performance is
not much affected. We discuss this in more details in
what follows.
4.2 Point addition and doubling
In this part we move one level lower, i.e. to point
operations. This is where our design is improved with
respect to other proposals. Namely, point operations
(add and double) are in principle different, which can be
explored from the viewpoint of side-channel analysis.
Some authors have tried already to balance these two
operations in order to improve side-channel resistance.
Chevallier-Mames et al. [26] presented a balanced
algorithm for ECC over binary ﬁelds in the case of
afﬁne coordinates. We mention here the work of Brier
and Joye [27], who suggested two approaches to achieve
uniformity of point operations. However, both
approaches result in some penalty in speed. For the
formulae of Lo´pez and Dahab in GF(2n) the operation
count is A : D¼ 5M : 6M. Here, A and D are the point
operations and M is a ﬁeld multiplication. We remind
the reader that ﬁeld addition in hardware for GF(2n) is
just a simple bitwise XOR operation and therefore is not
taken into account.
As already mentioned, we deal with projective
coordinates to avoid expensive inversions in hardware.
Let us consider the formulae for point operations in the
case of simple projective coordinates, i.e. xi¼ (Xi/Zi),
i¼ 1, 2. The results of point doubling and point
addition, i.e. X5¼X(P5) and X3¼X(P3)¼X(P1þP2),
respectively, are calculated as
X5 ¼ X41 þ bZ41
Z5 ¼ X21  Z21 ð2Þ
X3 ¼ ðX1  Z2 þ X2  Z1Þ2
Z3 ¼ x4Z3 þ ðX1  X2Þ  ðZ1  Z2Þ:
It is easy to see that point doubling and addition
would require six and ﬁve multiplications, respectively.
We slightly rewrote the formulae in order to have six
multiplications for both point operations. We had to
add one more multiplication in the point addition, so we
used the following formula:
X1Z2 þ X2Z1 ¼ ðX1 þ X2ÞðZ1 þ Z2Þ  X1Z1  X2Z2
which follows from the Karatsuba-like approach. In the
case of Karatsuba’s algorithm a formula for multi-
plication reduces the problem of multiplying 2n-bit
numbers to three multiplications of n-bit numbers. Here
we need one extra multiplication so we compute
X1Z2þX2Z1 with three multiplications. The next
property we want is to have balanced ﬁeld multi-
plications in each step of the point operation algorithms
(this is not the case for the algorithm of [5]). In this way
the two multipliers will work fully in parallel while the
exponent is scanned bit by bit. Then for each bit one
addition and one doubling are performed, with fully
synchronised multipliers. The algorithms for point
addition and doubling are given in Algorithm 2.
Each point operation requires exactly six multiplica-
tions which are also balanced with respect to the nine
steps in Algorithm 2. In Step 5 of the point doubling a
redundant operation is inserted to balance the ﬁeld
additions. The required number of intermediate n-bit
registers is three for both cases. More precisely, the
following lemma holds:
Lemma 4.1: When Algorithm 1 deploys Algorithm 2 we
get the following number of operations in GF(2n):
#inversions ¼ 1
#multiplications ¼ 12blog2kc þ 13
#additions ¼ 6blog2kc þ 7
Note that the 12 multiplications require only the time
for 6 multiplications since 2 multiplications are per-
formed in parallel in every iteration of the main loop. In
this way we improved the formulae of Lo´pez and Dahab
in order to have fully balanced point operations to
counterfeit simple side-channel attacks.
4.3 An algorithm for ﬁeld multiplication
The standard way to compute the product c(x)¼ a(x)
b(x)mod f(x) is using convolution, which we refer to as
the classical algorithm. Another possible way to
calculate the product of two polynomials in GF(2n) is
Montgomery’s multiplication algorithm as proposed in
[18]. Here, we deﬁne the MMM as MMM[a(x), b(x)] :¼
a(x)  b(x)  r 1(x)modf(x). Before a sequence of
operations can be started, all operands have to be
converted to the form a(x)r(x)modf(x), the so-called
M-residueof the operandbymultiplicationwith r(x)¼ xn.
Our circuit implements Algorithm 3. The combined
MM algorithm includes two parts, classical and
Montgomery, each of which is a systolic array. The
parts look quite similar as their cells are performing
similar operations, i.e. multiplication and XOR. The
difference is that they shift in opposite directions and
they start from opposite parts of the loop. While the
classical multiplier starts the shift-and-add process
from the MSB of one of the operands and shifts the
cumulative result left, the Montgomery-based multiplier
starts at the LSB and shifts the result right. Those
Algorithm 2: EC point addition and doubling
Require: Xi, Zi Require: c 2 GF(2n), c = b2
n1
,
for i¼ 1, . . . , 4; xi ¼ XiZi,
x4¼ x(P1P2)
X1, Z1 where x1 ¼ X1Z1
Ensure: X(P1þP2)
¼X(P3)¼X3}, Z3
Ensure: X(2P1)¼X(P5)¼X5, Z5
1. X3  X1þX2, Z3  Z1þZ2 1. T1  cþZ1, T1  T1þZ1
2. Z3  X3  Z3 2. Z5  Z12
3. T2  X1Z1 3. X5  X12
4. X3  X2Z2 4. T1  Z5T1
5. Z3  Z3þT2þX3 5. T1  T1þX1þX1
6. Z3  Z32 6. Z5  X5 Z5
7. T1  x4Z3 7. T1  T12
8. X3  T2X3 8. X5  X52
9. X3  X3þT1 9. X5  X5þT1
IEE Proc. Inf. Secur. 59
two arrays process the operand a(x) from different sides
and they stop after exactly ds/2e cycles (here s is
the number of digits). The classical part still has to
perform a shift over ds/2e bits, but this is taken care
of by the conversion of the M-residue of the
result. More precisely, the M-residue is of the form
a(x)b(x)r 1(x)mod f(x) where r(x)¼ xds/2e. The multi-
plication is completed by calculating MMM [Res(x), 1].
The idea of combining two algorithms together was
mentioned for the ﬁrst time in [28] and the schematic of
the multiplier was presented in [29].
A schematic of the multiplier is presented in Fig. 2.
Ai, Bi and Fi are the coefﬁcients of a(x), b(x) and f(x),
respectively. The outputs c(x) become inputs to the
systolic arrays in the next clock cycle. Finally, the result
of the multiplication is obtained by XOR-ing the
outputs of both systolic arrays.
In Figure 3, one processing element (PE) of the array
is depicted. There is the so-called regular PE, while the
boundary PEs are called leftmost and rightmost PE.
They have a slightly different structure which is shown
in Fig. 3b (leftmost PE) and 3c (rightmost PE).
After conversion from the M-residue to the normal
representation, we get the result. Namely, the following
lemma holds.
Lemma 4.2: The result of Algorithm 3 is in the
Montgomery domain, i.e. Res (x)¼ a (x) b (x)r 1(x)
where rðxÞ ¼ xds2e.
Proof: Here n¼ sw, i.e. the bit representation can be
written in s words of length w. After ds=2e steps in
Algorithm 3, the left (ResC) part calculated has been
shifted to the right for ds=2e  w bits. At the same time,
the right (ResM) part has been shifted to the left for the
same number of bits due to the division with xw. This
results in a partial result of the form ResM ¼
aðxÞbðxÞxds=2ew. The ResC part still has to be shifted
over the remaining dn2e bits, which is adjusted due to the
remaining conversion from Montgomery to normal
representation. &
In Algorithm 3 Ai(x) represents one digit of the
polynomial a(x). The addition step is performed by
multiplication of Ai(x) with b(x). In order to be able to
perform Step 7 in Algorithm 3 a multiple of f(x) has to
be added which results in ResM(x) being divisible by x
w.
So, for ResM(x) 6¼ 0 mod xw, we need to ﬁnd M(x) such
that ResM(x)þM(x)f(x)¼ 0 mod xw. M(x) is calculated
from the following equation: MðxÞ ¼ ResM0 
F0
1ðxÞmodxw. It is easy to see that F01ðxÞmodxw 
F0
0ðxÞ, which follows from the relations for
Montgomery’s parameter r(x) [18]. Also, here
ResM0ðxÞ and F0(x) are the least signiﬁcant words of
ResM(x) and f(x), respectively.
4.4 A prototype FPGA architecture
Our ECP is shown in Fig. 4. The operation blocks on
each level from top to bottom are as follows:
 Level 1: Main controller
 Level 2:
(1) Normal to Montgomery representation
conversion (NtoM)
(2) Afﬁne to projective coordinates conversion
(AtoP)
(3) EC point multiplication (PM)
(4) Projective to afﬁne coordinates conversion
(PtoA)
(5) Montgomery to normal representation conver-
sion (MtoN)
 Level 3: EC point doubling (PD) and EC point
addition (PA)
 Level 4: Modular addition (MA), MM and
modular inversion (MI)
The main control ﬁnite state machine (FSM) ﬁrst
commands the NtoM to convert all the inputs from
normal to Montgomery representation and the AtoP to
convert the coordinates from afﬁne to projective. The
NtoM and the AtoP both use the MM to perform these
conversions. The AtoP also needs the MA to do some
precalculations, see (2). When all conversions are
Algorithm 3: Digit-serial MM in GF(2n)
Require: polynomials a(x), b(x) and f(x), F0
0(x)
Ensure: ResðxÞ ¼ aðxÞ  bðxÞ  xdn2emodfðxÞ
1: Res(x)¼ 0, ResC(x)¼ 0, ResM(x)¼ 0,
2: for i from 0 to ds2e  1 do
3: ResCðxÞ  ResCn1  fðxÞ þ ResCðxÞ  xþ Asi  bðxÞ
4: ResM(x)  ResM(x)þAi  b(x)
5: MðxÞ  ResM0  F0 0ðxÞ (mod xw)
6: ResM (x)  ResM (x)þM(x)  f(x)
7: ResM (x)  ResM (x)div xw
8: end for
9: Res(x)¼ResC (x)þResM (x)
10: Return Res(x)
i,j
i,j
AB_next
MF_next
AB_prev
MF_prev
i,j-1
i,j-1
PE
regular
Ai Bj Mi Fj
i
AB_nexti,0
MF_nexti,0
rightmost
PE
F0
-1F00BiA
MPE
leftmost
Mi nf
i,s-1
i,s-1
AB_prev
MF_prev
C_prev i,j+1
C_next i,s
C_prev i,1
C_next i,j
Fig. 2 Schematic of the multiplier. The classical and the Montgomery parts are calculated in parallel and the result is XORed
afterwards
60 IEE Proc. Inf. Secur.
ﬁnished the main controller orchestrates the PM block
to start the point multiplication by invoking the PD and
the PA in parallel. It writes the resulting X1 and Z1 to its
output registers. The FSM inside the PM orchestrates
these operations. Due to the parallelism of the point
operations both the PD and the PA use an MA and an
MM. The next step after point multiplication is
conversion from projective to afﬁne representation
using the MI, which invokes the MM to perform the
MI using Fermat’s theorem. Finally, the afﬁne coordi-
nates are converted from Montgomery to normal
representation using the MM.
The ﬂowchart of the FSM inside the point multi-
plication block is shown in Fig. 5. When the START
signal is set, the bits of k are evaluated from MSB to
LSB resulting in the assignment of new values for P1
and P2. These values depend on the key-bit ki. When
all bits have been evaluated, an internal counter gives
an END signal. The result of the last P1 calculation
is written to the output register and the VALID output
is set.
5 Results
The results of our design on a Xilinx Virtex XCV800
FPGA are given in Table 1.
The formula for the latency that is used in Table 1 is:
latency¼ [21s]þ [6s(n 2)þ 3s]þ [18snþ 4n], where the
three parts of the formula correspond to the calculations
of the main conversion operations, the MI and the point
multiplication, respectively.
One of the consequences of the scalability of the
design is that the minimal clock period does not depend
on the bit-length n; it depends only on the digit-length w.
This can be observed in Table 1.
Table 2 presents a broader comparison with other
architectures. We found that comparing designs is hard
since these designs have been optimised for different
goals, have been implemented on different platforms
and have chosen different options for bases, coordi-
nates, irreducible polynomials etc. Moreover, some of
these solutions are not scalable: in these designs, some
parameters are ﬁxed (such as a special polynomial,
*
*
+
AB_prev
MF_prev 
C_next
C_prev(a)
(b) (c)
i1,j+1
i,j
A i B j M i F j
i,j1
i,j1
AB_next
MF_next
i,j
i,j
AB_prev i,s1
MF_prev i,s1
C_next
M i f n
msb*
+
i,s
*
n.c.
mod*
C_prev B Fi1,1 0 F 0 0A i
AB_next *lsb
+
1
mod
i,0
MF_next
M
i
i,0
Fig. 3 The regular PE of the systolic array, the leftmost PE of the systolic array, the rightmost PE of the systolic array
IEE Proc. Inf. Secur. 61
special reduction etc.), which boosts the performance.
Therefore, we have included only the solutions that are
either scalable [14, 15] or believed to be the state of the
art in ECC hardware implementations. The purpose of
this table is mainly to reference the prior methodology in
this area. We give this comparison as a proof that a
scalable and side-channel-secure design can also lead to
a solution that is competitive in performance.
6 Side-channel security
Implementations of cryptographic algorithms should be
resistant to side-channel attacks such as timing [30],
power [31] and electromagnetic radiation [32, 33]
analysis attacks. These attacks present a realistic threat
for wireless applications and have been demonstrated to
be very effective against smart cards without speciﬁc
countermeasures. The latency of the proposed algorithm
does not depend on the Hamming weight of the
exponent, which makes it very suitable for defending
against timing attacks. Indeed, timing attacks can
explore all steps with non-constant execution time;
conditional instructions are a typical target. Also in that
case power traces may reveal some secret information
that would allow the attacker to perform simple power
analysis, i.e. an SPA attack. For example, a typical
x P x
2nnca startk done
MAIN CONTROLLER
x kP
clk
rst
clk
rst
clk clk
rst rst
clk
rst
clk
rst
NtoM AtoP PtoA MtoN
PA
rst
clk
PD
rst
clk
rst
clk
MMI
PM
MA
rst
clk
MA
clk
rst
MM
clk
rst
MM
clk
rst
clk rst
Fig. 4 Architecture of the ECP
IDLE
START
INIT X1, Z1, X2, Z2
k iP2<2P2
P1<P1+P2
P1<2P1
P2<P2+P1
END END
VALID<1 VALID<1
RST
0
1
0
00
11
1
Fig. 5 Flowchart of the FSM inside the ECPM
62 IEE Proc. Inf. Secur.
double-and-add algorithm [1], which executes the point
doubling and addition operations if the i-th bit of the
exponent (ki)¼ 1 and otherwise (for ki¼ 0) performs
only doubling, is not SPA resistant.
Also, the computational difference between the point
operations is a typical target of an attacker [27, 34]. To
prevent that, cryptographic algorithms should be
implemented as sequences of operations that are
indistinguishable through simple side-channel analysis.
Chevallier-Mames et al. deﬁne this property as the side-
channel atomicity [26]. In their view, SPA-resistant
algorithms should consist of so-called side-channel
atomic blocks which are algorithm speciﬁc. In our
implementation, there exist side-channel atomic blocks
on different levels of the ECC hierarchy. Following the
top-down approach, these are point addition/doubling
and multiplication/squaring. For that purpose we use
the same multiplier for both multiplication and squar-
ing, although there exist more efﬁcient architectures for
squaring. Here we propose a fully balanced point
multiplication that performs the same operations for
every loop of the algorithm. More precisely, Algorithm
2 for point operations executes exactly one ﬁeld multi-
plication in corresponding steps. It is not the case in
general, as doubling and addition are two different
operations, as in most of the standard references on
ECC [11, 35]. Considering our new Algorithm 2, point
operations result in two not so distinguishable patterns
(Figs 6 and 7). In both Figures a total of six ﬁeld
multiplications can be observed. Furthermore, both
operations (double and add) are performed in parallel
for each step in Algorithm 1.
The systolic array architecture also allows more
parallelism of the multiplication process, which will
make power analysis more difﬁcult. These ﬁrst results,
although proving SPA and timing resistance, are just the
ﬁrst step towards a side-channel-resistant design. In
future work this algorithm should be more carefully
examined with respect to security against side-channel
attacks; however, we are convinced that our design
approach offers an increased resistance to these attacks.
Moreover, for more advanced attacks such as (ﬁrst- or
higher-order) differential power analysis and differential
electromagnetic analysis, other countermeasures are
required as well.
7 Conclusions and future work
A complete ECC processor for binary ﬁelds has been
presented. An FPGA implementation has been fully
described including a bit/digit-serial multiplier that
combines two previously known multiplication methods.
The proposed architecture is a systolic array that allows
for good performance in speed and more parallelism in
operation, which is also beneﬁcial for side-channel
security. Furthermore, we have proposed a fully
balanced point multiplication algorithm that performs
Table 2: Comparison of different architectures for
multiplication in GF(2n)
Impl. Fin. ﬁeld FPGA Frequency
(MHz)
Latency
(ms)
Hardware
compl.
[14] GF(2160) ASIC 50.0 Est. 5 n. a.
[13] GF(2167) XCV400E 76.7 Est. 0.84 6,513 Gates,
501 reg.
[15]a GF(2163) XCV2000E 66.4 Est. 0.586 14241 LUTs,
2990 FFs
[16] GF(2163) XCV1000E 22.1 7.4 Not known
[]b GF(2179) XCV800 47.7 0.991 7,788 Slices
aThe performance of [15] is for so-called known curves because of
a special reduction. For generic curves the performance is much
slower.
bThis work.
Fig. 6 Power consumption trace of a PA as in Algorithm 2 (six field multiplications are easily detected)
Table 1: Implementation results for bit-lengths of
n¼ 179 and n¼ 211 and comparison of bit-serial and
digit-serial multipliers
Results for ECC point multiplication
No. slices Period (ns) Latency (ms)
n 179 211 179 211 179 211
w ¼ 1 10 626 13 863 19.165 19.159 2.479 3.438
w ¼ 4 11 433 14 660 19.298 19.301 1.886 2.614
w ¼ 8 11 622 14 950 20.050 20.043 1.008 1.391
w ¼ 16 11 881 15 367 20.961 20.971 0.557 0.763
IEE Proc. Inf. Secur. 63
the same operations for every loop of the algorithm.
By using this approach we proved that this so-called
side-channel-aware design is the ﬁrst step towards side-
channel resistance. However, it is clear that additional
countermeasures might be required.
8 Acknowledgements
Lejla Batina and Nele Mentens are funded by research
grants of the Katholieke Universiteit Leuven, Belgium.
This work was supported by FWO projects (G.0450.04
and G.0141.03) and by the EU IST FP6 projects
SCARD and ECRYPT.
9 References
1 Menezes, A., van Oorschot, P., and Vanstone, S.: ‘‘Handbook of
Applied Cryptography’’ (CRC Press, 1997)
2 Miller, V.: ‘‘Uses of elliptic curves in cryptography’’. In Williams,
H.C. (Ed.), Proc. Advances in Cryptology: CRYPTO’85, Santa
Barbara, CA, August, 1985, LNCS 218 (Springer-Verlag, 1985),
pp. 417–426
3 Koblitz, N.: ‘‘Elliptic curve cryptosystem’’. Math. Comp., 1987,
48, pp. 203–209
4 Montgomery, P.: ‘‘Speeding the pollard and elliptic
curve methods of factorization’’. Math. Comput., 1987, 48,
pp. 243–264
5 Lo´pez, J., and Dahab, R.: ‘‘Fast multiplication on elliptic curves
over GF(2m)’’. In Koc¸, C¸.K., and Paar, C. (Eds), Proc. of 1st Int.
Workshop on Cryptographic Hardware and Embedded Systems
(CHES), Worcester, MA, USA, August, 1999, LNCS 1717
(Springer-Verlag, 1999), pp. 316–327
6 Joye, M., and Yen, S.-M.: ‘‘The montgomery powering ladder’’.
In Kaliski, B.S., Jr, Koc¸, C¸.K., and Paar, C. (Eds), Proc. 4th Int.
Workshop on Cryptographic Hardware and Embedded Systems
(CHES), San Francisco Bay, USA, August, 2002, LNCS 2523
(Springer-Verlag, 2002), pp. 291–302
7 Menezes, A., and Vanstone, S.: ‘‘Elliptic curve cryptosystems and
their implementations’’. J. Cryptol., 1993, 6, pp. 209–224
8 Stam, M.: ‘Speeding up Subgroup Cryptosystems’. (PhD thesis,
Technische Universiteit Eindhoven)
9 Fischer, W., Giraud, C., Knudsen, E.W., and Seifert, J.-P.:
‘‘Parallel scalar multiplication on general elliptic curves over Fp
hedged against non-differential side-channel attacks’’, IACR
ePrint archive: 2002/007 Report 2002/007, 2002
10 Izu, T., and Takagi, T.: ‘‘A fast parallel elliptic curve multi-
plication resistant against side channel attacks’’. In Naccache, D.,
and Paillier, P. (Eds), Proc. Int. Workshop on Practice and
Theory in Public Key Cryptosystems (PKC 2002) LNCS 3027,
Springer-Verlag, 2002 pp. 280–296
11 Blake, I., Seroussi, G., and Smart, N.P.: ‘‘Elliptic curves in
cryptography’’. London Mathematical Society Lecture Note
Series (Cambridge University Press, 1999)
12 Sutikno, S., Effendi, R., and Surya, A.: ‘‘Design and implemen-
tation of arithmetic processor GF(2155) for elliptic curve
cryptosystems’’. Proc. 1998 IEEE Asia-Paciﬁc Conf. on Circuits
and Systems (APCCAS’98), Chiangmai, Thailand, November
1998, pp. 647–650
13 Orlando, G., and Paar, C.: ‘‘A high-performance reconﬁgurable
elliptic curve processor for GF(2m)’’. In Koc¸, C¸.K., and Paar, C.
(Eds), Proc. 2nd Int. Workshop on Cryptograpic Hardware and
Embedded Systems (CHES), Worcester, MA, June 2000, LNCS
1965 (Springer-Verlag, 2000), pp. 41–56
14 Goodman, J., and Chandrakasan, A.P.: ‘‘An energy-efﬁcient
reconﬁgurable public-key cryptography processor’’. IEEE J.
Solid-St. Circ., 2001, 36 (11), pp. 1808–1820
15 Gura, N., Shantz, S.C., Eberle, H., Finchelstein, D., Gupta, S.,
Gupta, V., and Stebila, D.: ‘‘An end-to-end systems approach to
elliptic curve cryptography’’. In Kaliski, B., Jr, Koc¸, C¸.K., and
Paar, C. (Eds), Proc. 4th Int. Workshop on Cryptographic
Hardware and Embedded Systems (CHES), San Francisco Bay,
USA, August 2002, 2523
16 Kitsos, P., Theodoridos, G., and Koufopavlou, O.: ‘‘An efﬁcient
reconﬁgurable multiplier architecture for Galois Field GF(2m)’’.
Microelectr. J., 2003, 34, pp. 975–980
17 Beth, T., and Gollmann, D.: ‘‘Algorithm engineering for public
key algorithm’’. IEEE J. Sel. Area. Comm., 1989, 7 (4),
pp. 458–465
18 Koc¸, C¸.K., and Acar, T.: ‘‘Montgomery multiplication in
GF(2k)’’. Design. Code. Cryptogr., 14, pp. 57–69
19 Wei, S.-W.: ‘‘VLSI architectures for computing exponentiations,
multiplicative inverses, and divisions in GF(2m)’’. IEEE Trans.
Circuits II, 1997, 44 (10), pp. 847–855
20 Tsai, W.-C., and Wang, S.-J.: ‘‘A systolic architecture for elliptic
curve cryptosystems’’. Proc. ICSP2000, Beijing, China, August
2000, pp. 591–597
21 Song, L., and Parhi, K.K.: ‘‘Low-energy digit-serial/parallel ﬁnite
ﬁeld multipliers’’. J. VLSI Signal Proc., 1998, 19, pp. 149–166
22 Mentens, N., O¨rs, S.B., Preneel, B., and Vandewalle, J.: ‘‘An
FPGA implementation of an elliptic curve processor over
GF(2m)’’. Proc. GLSVLSI 2004, Boston, MA, April 2004, 4 pp
23 Wu, H.: ‘‘Montgomery multiplier and squarer for a class of ﬁnite
ﬁelds’’. IEEE Trans. Comput., 2002, 51 (5), pp. 521–529
24 Montgomery, P.: ‘‘Modular multiplication without trial divi-
sion’’. Math. Comput., 1985, 44, pp. 519–521
25 Batina, L., O¨rs, S.B., Preneel, B., and Vandewalle, J.: ‘‘Hardware
architectures for public key cryptography’’. Integration, 2003, 34,
(1–2), pp. 1–64
26 Chevallier-Mames, B., Ciet, M., and Joye, M.: ‘‘Low-cost
solutions for preventing Simple Side-Channel Analysis: Side-
Channel Atomicity’’. IEEE Trans. Comput., 2004, 53 (6),
pp. 760–768
27 Brier, E., and Joye, M.: ‘‘Weierstrass elliptic curves and side-
channel attacks’’ In Naccache, D., and Paillier, P. (Eds), Proc.
PKC’02, LNCS 2274, (Springer-Verlag, 2002) pp. 335–345
28 Potgieter, M.J.: ‘‘A hardware implementation of the group
operations necessary for implementing an elliptic curve crypto-
system over a characteristic two ﬁnite ﬁeld’’. Final report of
project EPR400. (Technical University Eindhoven, 2002)
29 Batina, L., Mentens, N., O¨rs, S.B., and Preneel, B.: ‘‘Serial mul-
tiplier architectures over GF(2n) for elliptic curve cryptosystems’’.
Fig. 7 Power consumption trace of a PD as in Algorithm 2 (also includes six multiplications)
64 IEE Proc. Inf. Secur.
Proc. 12th IEEE MELECON 2004, Dubrovnik, Croatia, May,
2004, 4 pp
30 Kocher, P.: ‘‘Timing attacks on implementations of Difﬁe-
Hellman, RSA, DSS and other systems’’. In Koblitz, N., (Ed.),
Proc. Advances in Cryptology: CRYPTO’96, Santa
Barbara, August, 1996, LNCS 1109 (Springer-Verlag, 1996),
pp. 104–113
31 Kocher, P., Jaffe, J., Jun, B., ‘‘Differential power analysis’’ In
Wiener, M. (Ed.), Proc. Advances in Cryptology: CRYPTO’99,
Santa Barbara, CA, August 1999, LNCS 1666 (Springer-Verlag,
1999) pp. 388–397
32 Quisquater, J.-J., and Samyde, D.: ‘‘Electromagnetic analysis
(ema): measures and couter-measures for smard cards’’. In
Attali, I., and Jensen, T.P. (Eds), Proc. Smart Card
Programming and Security (E-smart 2001), Cannes, French
Riviera, September, 2001, LNCS 2140 (Springer-Verlag, 2001),
pp. 200–210
33 Gandolﬁ, K., Mourtel, C., and Olivier, F.: ‘‘Electromagnetic
analysis: concrete results’’. In Koc¸, C¸.K., Naccache, D., and
Paar, C. (Eds), Proc. 3rd Int. Workshop on Cryptographic
Hardware and Embedded Systems (CHES), Paris, France, May,
2001, LNCS 2162 (Springer-Verlag, 2001), pp. 255–265
34 Trichina, E., and Bellezza, A.: ‘‘Implementation of elliptic curve
cryptography with built-in counter measures against side channel
attacks’’. In Kaliski, B.S., Jr, Koc¸, C¸.K., and Paar, C. (Eds),
Proc. 4th Int Workshop on Cryptographic Hardware and
Embedded Systems (CHES), San Francisco Bay, USA, LNCS
2535. (Springer-Verlag, 2002), pp. 98–113
35 IEEE P1363: ‘Standard speciﬁcations for public key
cryptography’, 1999
IEE Proc. Inf. Secur. 65
View publication stats
