Implementation of Binary Edwards Curves for very-constrained Devices by Kocabas, Ünal et al.
Implementation of Binary Edwards Curves for very-constrained devices
¨Unal Kocabas¸, Junfeng Fan and Ingrid Verbauwhede
Katholieke Universiteit Leuven, ESAT/SCD-COSIC
Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium
{unal.kocabas, junfeng.fan, ingrid.verbauwhede}@esat.kuleuven.be
Abstract—Elliptic Curve Cryptography (ECC) is considered
as the best candidate for Public-Key Cryptosystems (PKC)
for ubiquitous security. Recently, Elliptic Curve Cryptography
(ECC) based on Binary Edwards Curves (BEC) has been
proposed and it shows several interesting properties, e.g.,
completeness and security against certain exceptional-points
attacks. In this paper, we propose a hardware implementa-
tion of the BEC for extremely constrained devices. The w-
coordinates and Montgomery powering ladder are used. Next,
we also give techniques to reduce the register file size, which is
the largest component of the embedded core. Thirdly, we apply
gated clocking to reduce the overall power consumption. The
implementation has a size of 13,427 Gate Equivalent (GE), and
149.5 ms are required for one point multiplication. To the best
of our knowledge, this is the first hardware implementation of
binary Edwards curves.
Keywords-Public-Key Cryptography, Binary Edwards Curve
I. INTRODUCTION
Public-Key Cryptography (PKC), introduced in 1976 by
Diffie and Hellman [1], is now widely used for key estab-
lishment, digital signature and data encryption. The best-
known and most commonly used public-key cryptosystems
are RSA [2] and Elliptic Curve Cryptography (ECC) [3], [4].
The main benefit of ECC is that it offers equivalent security
as RSA for much smaller parameter sizes. These advantages
result in smaller data-paths, less memory usage and lower
power consumption. ECC is widely considered as the best
candidate for constrained devices such as RFID-tags.
Integrating a Public Key Cryptosystem into a constrained
device such as a RFID-tag is a challenge due to the lim-
itations in costs, area and power. A passive RFID-tag has
no battery, thus the processing power of these devices is
limited. On the other hand, security is required, in particular
to prevent cloning or tracing [5], [6]. It was widely believed
that devices with such constrained resources cannot carry
out strong cryptographic operations such as Elliptic Curve
Scalar Multiplication (ECSM). However, the feasibility of
integrating PKCs into such devices have been recently
proven by several implementations [7], [8], [9], [10].
Standard formulas for adding two points, say P and Q, on
a Weierstrass-form elliptic curves fail if P is at infinity, or
if Q is at infinity, or if P+Q is at infinity [11]. Attacks to
explore this feature has been proposed [12]. Binary Edwards
curves provides a different equation to define an elliptic
curve which no longer has points at infinity [11]. This feature
is known as completeness. In addition to completeness, new
explicit addition formulas of Edwards curves make point
addition slightly faster than known methods[13], [14].
In this paper, we present a hardware implementation of
binary Edwards curves. We focus on different techniques
to optimize the implementation to make it suitable for
constrained devices. The optimization is mainly to reduce
the area consumption and process time. By fixing the curve
parameters (d1 and d2) and using a common-Z coordinate
system [15] we reduce the number of registers from 9 to 6.
Moreover, by using a squarer module we reduce the number
of required cycles for a squaring operation to 1. As a result,
the overall BEC core requires from 11.72 k to 14.53 k GE
and the total process time is limited between 548 to 107
ms at 400 kHz for six different configurations, i.e, different
digit sizes for the modular multiplier.
The rest of the paper is organized as follows. In section
2, we briefly recap previous implementations on ECC and
introduce elliptic and binary Edwards curves. section 3
summaries several techniques to optimize the algorithm. The
architecture of BEC and the main parts of the design are
investigated in section 4. section 5 gives synthesis results
and comparisons. section 6 provides a conclusion.
II. BACKGROUND
A. Previous Implementations
In 2006, Kumar and Paar [8] proposed an affine coordinate
ASIC implementation of the ECC processor, using a mod-
ified Montgomery point multiplication method for binary
elliptic curves. An area consumption of 10k to 18k GE is
obtained in a 0.35 μm CMOS process. Moreover, a bit-
serial multiplier, a squarer unit, an adder and a memory,
which consists of 6 random access registers, are used to
compose an ECC-processor. This ECC-processor supports
binary curves from GF(2113) to GF(2193). Furthermore,
one scalar multiplication takes 31.8 ms running at 13.56
MHz over GF(2163). In the same year, Sakiyama et al.
[16] investigated the impact of the digit size on power
consumption and area usage. Generally, bit-serial multipliers
are more power efficient, whereas the digit-serial type is
more energy efficient.
In 2007, Lee et al. [15] presented an optimized archi-
tecture based on the so-called Modular Arithmetic Logic
978-1-4244-6967-3/10/$26.00 c© 2010 IEEE 185 ASAP 2010
Unit. It uses a common Z-coordinate system for representing
EC points to minimize memory requirements. Storage of
intermediate values is usually the main contributor to area
(roughly 66%). Area has a direct impact on the production
cost of an integrated circuit. This work was further improved
in [7].
In 2009, Hein et al. [9] presented an implementation of
modular multiplications using 16-bit multipliers. Bit-serial
multipliers clock on average 2∗163 = 326 registers per clock
cycle. Their digit-level multiplier clocks roughly 4∗16 = 64
registers. So, the proposed approach gives a better power, but
not energy, characteristic. Additionally, the core component
of the datapath is a 16 ∗ 16 multiplier. The data width of
the RAM circuit is adjusted to fit the 16-bit datapath. The
RAM is organized as a 16-bit wide memory with 64 entries.
The circuit features small silicon size (15k GE) and has low
power consumption (8.57 μW). It computes 163-bit ECC
point-multiplications in 296k cycles and has an ISO 18000-
3 RFID interface.
B. Elliptic Curve’s Algorithm
Let K be a finite field. An elliptic curve over K is defined
by the Weierstrass equation:
E : y2 + a1xy + a3y = x3 + a2x2 + a4x + a6, (1)
where a1, a2, a3, a4, a6 ∈ K and the discriminant Δ = 0
[17]. The K-rational points P(x,y) on curve E together with
the point at infinity, O(∞,∞), form an abelian group E(K).
When Char(K) = 2, E can be transformed into the
following form without loss of generality.
y2 + xy = x3 + a′2x
2 + a′6. (2)
Given two points P1(x1, y1) and P2(x2, y2), one can com-
pute P3(x3, y3) = P1 + P2 as follow:
x3 = λ2+λ+x1+x2+a′2, y3 = λ(x1+x3)+x3+y1, (3)
where λ = y1+y2x1+x2 if P1 = P2 or otherwise λ = x1 +
y1
x1
.
For each point addition or doubling, one field inversion is
required. Inversion is a quite costly operation (one inversion
is equivalent to 8-10 multiplications [18] in GF (2163)),
therefore projective coordinates were proposed. Using pro-
jective coordinates, a point P is represented as (X,Y, Z)∈
K
3
, and the affine coordinates (x, y) can be retrieved as
(X/Z, Y/Z).
A scalar multiplication on an elliptic curve is the compu-
tation of kP , where k is a large integer and P is a point on
the curve. The following algorithm gives a classical way to
compute kP .
Due to the conditional branch at step 4-5, Algorithm 1 is
considered insecure against Simple Power Analysis (SPA)
and Timing Analysis (TA). The Montgomery powering
ladder [19] is believed to be secure against SPA and TA.
It performs point addition and point doubling in each step
regardless of the value of ki.
Algorithm 1 Add-and-double scalar multiplication.
Input: P = (x, y), k = (kl−1kl−2 · · · k0)2
Output: Q = (x′, y′) = kP
1: Q = O
2: for i from l − 1 downto 0 do
3: Q ← 2Q
4: if ki = 1 then
5: Q ← Q + P
6: end if
7: end for
8: Return Q
Algorithm 2 Montgomery ladder for scalar multiplication.
Input: P = (x, y), k = (kl−1kl−2 · · · k0)2
Output: Q = (x′, y′) = kP
1: Q[0] ← P ,Q[1] ← 2P ,
2: for i from l − 2 downto 0 do
3: Q[1− ki] ← Q[0] + Q[1], Q[ki] ← 2Q[ki]
4: end for
5: Return Q[0]
Lo´pez and Dahab [13] observed that the Y coordinate
is unnecessary when using the Montgomery ladder, which
makes the Montgomery ladder a preferred choice in terms
of both efficiency and security.
C. Binary Edwards Curves
Let d1, d2 be elements of K with d1 = 0 and no element
t ∈ K satisfying t2 + t + d2 = 0, then the binary Edwards
curve with coefficients d1 and d2 is the affine curve [11]:
EB,d1,d2 : d1(x+y)+d2(x
2+y2) = xy+xy(x+y)+x2y2
(4)
This curve is symmetric in x and y and thus it has the
property that if (x1, y1) is a point on the curve then so
is (y1, x1). The point (0, 0) is the neutral element of the
addition law, while (1, 1) has order 2.
Bernstein et al. [11] proposed to use the so called w-
coordinates together with the Montgomery ladder. Assume
that P1(x1, y1), P2(x2, y2) and P3(x3, y3) are points on
curve EB,d1,d2 (we assume d1 = d2 throughout this paper),
and P3 = P1 + P2, then P2 + P3 and 2P2 can be per-
formed with only w-coordinates. Let P5(x5, y5) = P2 +P3,
P4(x4, y4) = 2P2, and wi = xi + yi for i ∈ {1, 2, 3, 4, 5},
the explicit formulas for Point addition (PA) and doubling
(DA) are given below.
Affine point addition with w-coordinates:
A = w22, B = A + w2, C = w
2
3, D = C + w3,
w5 = 1 + d1
1
d1 + B ·D + w1.
(5)
186 ASAP 2010
Affine point doubling with w-coordinates:
A = w22, B = A + w2, w4 = 1 + d1
1
d1 + B2
. (6)
When using Montgomery ladder, Q[1] is always Q[0]+P ,
thus w-coordinate is applicable. At the end of the loop, the
x-coordinate of 2Q[0] can be retrieved from w(Q[0]) (=the
w coordinate of Q[0]), w(Q[1]) and P (x, y).
In order to recover x and y values from w-coordinates,
we can use the following formula to compute 2(x2, y2) given
x1,y1,w2,w3 [11].
x22 + x2 = (w3(d1 + w1w2(1 + w1 + w2) +
d2
d1
w21w
2
2)
+d1(w1 + w2) + (y21 + y1)(w
2
2 + w2))
/(w21 + w1)
This formula produces x22 + x2, then we use a half-trace
computation to reveal either x2 or x2 +1. The Algorithm 3
shows the half-trace computation.
Algorithm 3 Half-trace Computation in GF (2m)
Input: T = x2 + x, H = 0;
1: for i from 0 to m2 	 − 1 do
2: H ← H + T (2(2i+1))
3: end for
return H
III. OPTIMIZATION OF THE ALGORITHM
For an ECC processor integrated in a very-constrained
device such as passive RFID-tags, the area and power
consumption budget is very limited. In this case, reducing
the area for temporary storage is crucial since it normally
takes more than 50% of the overall area (see [7]). Therefore,
reducing the number of intermediate results is important.
Although projective coordinates save us one inversion
in each iteration, one drawback is that extra registers are
required to store Z coordinates. The w-coordinates with
Montgomery ladder consist of three points (P1, P2, P3).
Hence, at least seven registers (d1, W1, Z1, W2, Z2, W3, Z3)
are required, not even including two registers for interme-
diate values. Using mixed w-coordinates, which represent
P1 with only w1, can save one register. Thus, we choose the
mixed w-coordinates for the implementation.
A. Original Mixed w-coordinate [20]
Assume that w1 is given as a field element, w2, w3
are given as fractions W2/Z2, W3/Z3 and w4, w5 are
outputs of the formulas. The explicit formulas of addition
and doubling are presented in register form in Table I. The
register allocation of values are arranged using the linear
scan method in [20].
Modular addition formula uses 5M + 1S + 1D, where M,
S and D denote field multiplication, field squaring and mul-
tiplying by the curve parameter d1, respectively. Doubling
formula uses 1M + 3S + 1D. These formulas can share
Table I
MIXED w-COORDINATE ALGORITHMS
Addition Algorithm Doubling Algorithm
1. T1 ← W3 + Z3 12. W2 ← T 21
2. W3 ← W3T1 13. Z2 ← Z22
3. T1 ← W2 + Z2 14. Z2 ← Z22
4. T1 ← W2T1 15. Z2 ← Z2d1
5. W2 ← T1W3 16. Z2 ← Z2 + W2
6. Z3 ← Z2Z3
7. Z3 ← Z23
8. Z3 ← d1Z3
9. Z3 ← Z3 + W2
10. T2 ← Z3w1
11. W3 ← T2 + W2
Table II
PROPOSED PA AND PD ALGORITHMS
Addition Algorithm Doubling Algorithm Extra Steps
1. T1 ← W3 + Z 11. W2 ← T 21 13. W2 ← W2T2
2. W3 ← W3T1 12. T1 ← Z + W2 14. W3 ← W3T1
3. T1 ← W2 + Z 15. Z ← T1T2
4. T1 ← W2T1
5. W2 ← T1W3
6. Z ← Z4
7. Z ← Zd1
8. T2 ← W2 + Z
9. W3 ← T2w1
10. W3 ← W2 + W3
the computation of T1 (step 3-4) to reduce the total cost
of differential addition and doubling to 5M + 4S + 2D. In
order to perform these formulas, seven registers are required
to store (w1,W2, Z2,W3, Z3, T1, T2). These formulas can be
transformed to a common Z-coordinate system at the cost
of three extra field multiplications.
B. Mixed w-coordinate Using Common Z
The common Z-coordinate was introduced by Lee
et al. [15] to save registers. After every iteration of Mont-
gomery ladder, we ensure that Z2 = Z3. This can be done
with three extra field multiplications as given below.
W2 ← W2Z3,W3 ← W3Z2, Z ← Z2Z3. (7)
In the first step of the Montgomery ladder (Q[1] = 2P ),
this condition is satisfied with Z = Z3 and W2 = W2 · Z3,
since the algorithm starts from Z2 = Z3 = 1. We apply this
method to the mixed w-coordinates of BEC. The formulas
of differential addition and doubling shown in Table I are
transformed to the sequences in Table II.
In this situation, the formula uses 7M + 3S + 1D, which
is much less than (5M + 4S + 2D) + (3M). Since after each
iteration Z2 = Z3, step 6 of Table 1 becomes a squaring,
and step 13-15 are omitted since they are the same to step
6-8. As a result, we trade (1M + 2S + 1D) with 3M + 1S,
and the complexity of one iteration is 7M + 3S + 1D. Due
to the use of common Z-coordinate, the proposed sequence
requires only 6 registers (w1,W2,W3,Z,T1,T2), or 7 registers
if d1 is not fixed. Furthermore, the mapping of variables in
Table II is given in the following section in Table IV by
using 6 registers.
187 ASAP 2010
C. Complexity
We compare the complexity of point addition and dou-
bling using different coordinates and different elliptic curves.
Table III shows the number of field operations for different
combinations. Here we assume the Montgomery ladder is
in use (as it is in [7], [9]). Since d1 can be chosen small
[11], D can be very efficient. In this case, the cost of one
Montgomery step on BEC is lower than normal curves or
almost the same when using the common-Z trick.
Besides the coordinate system, the selection of curve pa-
rameters also has an impact on the complexity. For example,
by selecting a′6 = 1 in (2), one multiplication is omitted [7].
For extremely constrained devices, fixing one or two curve
parameters helps to reduce the area and performance. For
example, the implementation in [9] chose to implement the
B-163 curve [21]. In this paper we choose d1 = d2 and d1
is also fixed.
Table III
COMPLEXITY OF ONE STEP OF MONTGOMERY LADDER
Curve Coordinates PA+PD Common Z
Affine 2I + 4M+2S -
Weierstrass Mixed (Jacobian+Affine) 15M + 8S + M2 -
Lope´z-Dahab 6M + 5S 7M + 4S
Affine 2I + 1M + 3S + 2D -
Edwards Projective w-Coord. 7M + 4S + 2D -
(d1 = d2) Mixed w-Coord. 5M + 4S + 2D 7M + 3S + 1D
IV. IMPLEMENTATION OF A BINARY EDWARDS CURVE
As RFID systems become ubiquitous and used in sev-
eral applications, the specification must be standardized.
For example, ISO 18000-3 requires a 13.56 MHz clock
frequency and a power consumption of less than 15 μW at
1.5 V to guarantee 1 m operating range [22]. Moreover, [7]
states that the clock frequency is chosen to finish protocols
within 250 ms and is a factor of 13.56 MHz. Taking into
account these limitations, the goals to achieve in an RFID
system are the low area and power consumption, and a short
processing time. In this case, the number of registers is the
most important issue, since the registers occupy more than
50% of the total area. Besides the common Z trick, we also
optimize the register file for limited access. Clock gating is
applied to reduce the power consumption.
A. Modular Arithmetic Logic Unit
In order to perform addition, multiplication and squaring
operations in Table III, we design a compact Modular
Arithmetic Logic Unit (MALU) [16] as illustrated in Figure
1. In the MALU architecture, operations are performed over
finite fields as shown through Equation (8).
A(x) = B(x) · C(x) mod P (x)
A(x) = A(x) + C(x)
C(x) = C(x)2 mod P (x)
(8)
where A(x) =
∑
aix
i, B(x) =
∑
bix
i, C(x) =
∑
cix
i
and P (x) = x163 + x7 + x6 + x3 + 1.
The cost of the field multiplication is 163d , with d the
digit size, and for addition it is one clock cycle. In every
cell, multiplication and addition can share the same XOR
array. Therefore, the MALU can be easily scaled to different
digit sizes by using cells in series. Furthermore, we design
a squarer to perform squaring operations in one clock cycle
independent of the digit size. The module of squaring is also
added into the MALU with one extra control bit.
The MALU block does not contain any internal registers.
To implement Equation (8), the register file keeps the
intermediate value (shown in Figure 1 as RetA) and MALU
performs the calculations. When the MALU performs a
multiplication, each digit of multiplicand must be provided.
That means in every cycle, RegB must be shifted to the
left by d bits and the most significant digits turn back to
the least significant bits as a circular shift. Moreover, the
intermediate result of RegA must be reused as an input for
163
d times. The addition operation can be performed with the
same hardware with some additional tricks using RegA and
RegC values as operands. Then RegA keeps the result. The
output of the MALU (RetA) is also used to return the result
of the squarer.
B. Register File Architecture
The register file architecture is shown in Figure 2. This
architecture was originally proposed in [7], and has been
shown an effective way to reduce the area of register files.
We modify it such that it fits binary Edwards curves. Six reg-
isters form a big circular shift register file, which has a lower
complexity than random access register file. Each register is
independently controlled for efficient management. RegA,
RegB and RegC are used by MALU, and RegD is used as the
input register. RegD has an 8-bit I/O through which data can
be loaded and stored. Loading RegD with a new value can
occur in parallel with multiplications. Furthermore, RegB is
a circular shift register which can be shiftted by d bits (digit
size of the data path) to the left. Two extra connections
are added to avoid extra shifting in the circular register file
Figure 1. Modular Arithmetic Logic Unit
188 ASAP 2010
which are shown by dashed lines in Figure 2. Except for
RegB and RegD, all the registers can be only updated by
the preceding.
Figure 2. Register File Architecture
Clock-gating is applied to the circular shift register to
reduce area and power consumption. An enable signal is cre-
ated for all registers to control whether to update with new
values or to keep previous values. As a result, the amount
of multiplexers needed for the register file is reduced. RegE
and RegF no longer require multiplexers.
C. Architecture of BEC Processor
The main architecture of the binary Edwards curves
processor is shown in Figure 3. It consists of a processor, a
small interface, a RAM, a ROM and front-end module. The
ROM stores first point coordinates (x1, y1) and a key value
(3 × 21-byte). Internal values and result points (W2, W3,
Z) of the differential addition and doubling are kept in the
register file. At the end of EC scalar multiplication, the final
point is recovered by Algorithm 3 and sent to the RAM. The
interface provides the connection from/to memory according
to the address bits. The processor part of the design consists
of a control block, a register file and a modular arithmetic
logic unit (MALU). The control block has the finite state
machine data path that manages the modules according to
the addition, squaring and multiplication.
D. Low Power ASIC Design with Clock Gating
Clock gating is one of the power-saving techniques used
on many synchronous circuits. It creates additional logic to
generate clock trees for each register individually. Thus,
flip-flops do not change state in disabled parts of the
design. Their switching power consumption goes to zero,
and only leakage currents are incurred. After clock gating is
applied to the register file, the area and power consumptions
are reduced by 8% and 20%, respectively, compared to
the normal clocked version due to the removal of the
multiplexers. The functionality of the clock gated version
is verified with Modelsim SE. Moreover, the clock gated
version is implemented on a Spartan-3E FPGA board with
some transformations to verify the functionality.
Table IV
MAPPING OF THE POINT MULTIPLICATION (SEE TABLE 2 FOR THE
OPERATION OF EACH STEP)
Operation RegA RegB RegC RegD RegE RegF
Z W2 W3 W2 w1 Z
Z W2 W2 W3 w1 Z
ADD (3) W2+Z W2 W2 W3 w1 Z
W2+Z W2+Z W2 W3 w1 Z
MULT (4) C W2+Z W2 W3 w1 Z
Z W3 W2 C w1 Z
Z W3 W3 C w1 Z
ADD (1) W3+Z W3 W3 C w1 Z
W3+Z W3+Z W3 C w1 Z
MULT(2) D W3+Z W3 C w1 Z
Z D W3 C w1 Z
Z Z D C w1 w1
Z Z Z D C w1
SQR (6) Z Z Z2 D C w1
SQR (6) Z d1 Z4 D C w1
MULT (7) X d1 Z4 D C w1
w1 X Z4 D C C
C D X w1 C C
C C D X w1 C
MULT (5) V C D X X w1
w1 V C X X X
SQR (11) X V W4 w1 X X
X V V W4 w1 X
ADD (8) Z4 V V W4 w1 X
X V V Z4 W4 w1
ADD (12) Z5 V V Z4 W4 w1
w1 Z5 V Z4 W4 w1
w1 w1 Z5 V Z4 W4
MULT (9) w1 ∗ Z5 w1 Z5 V Z4 W4
w1 ∗ Z5 V w1 Z5 Z4 W4
w1 ∗ Z5 Z5 V w1 Z4 W4
ADD (10) W5 Z5 Z5 w1 Z4 W4
W4 Z5 Z5 W5 w1 Z4
W4 W4 Z5 W5 w1 Z4
MULT (13) W2New W4 Z5 W5 w1 Z4
Z4 W5 Z5 W2New w1 Z4
Z4 Z4 W5 Z5 W2New w1
MULT (14) W3New Z4 W5 Z5 W2New w1
W3New Z5 Z4 W3New W2New w1
MULT (15) Z Z5 Z4 W3New W2New w1
w1 W3 Z4 Z W2 W2
W2 W3 W3 w1 Z W2
W2 W3 W3 W2 w1 Z
Z W2 W3 W2 w1 Z
E. Verifying Design on an FPGA
Global free-running clocks in an FPGA are distributed us-
ing dedicated interconnects specifically designed to connect
and supply clock inputs to various resources in the FPGA.
These clock networks are optimally designed to have low
skew, low power, and improved jitter tolerance. Gating a
global clock signal with a logic circuit forces the gated
clock signal to traverse the much slower routing network,
thus introducing significant skew [23]. This can lead to
hold errors. Therefore, we resort to manual transformation
of registers of clock gating while maintaining identical
functionality. It is achieved by re-declaring every gated flip-
flop with two nodes, a free-running clock node and an
enable node. That is, flip-flops that use the gated clocks
are changed to enabled flip-flops. Such flip-flops are then
clocked by the free-running clock and enabled by the gate
signal. Consequently, the functionality of a low power ASIC
design with clock gating is successfully tested on an FPGA.
189 ASAP 2010
Figure 3. BEC Processor Architecture
V. SYNTHESIS RESULT AND DISCUSSIONS
The implementation of binary Edwards curves is written
in the GEZEL [24] hardware description language and it is
optimized for fast computation and low-area consumption.
The implementation is synthesized for an ASIC design using
a low leakage library of UMC’s 0.13 μ m using the Synopsys
Design Compiler tools [25]. Table V shows the area, power
and delay estimations of the proposed design using different
digit sizes.
A. Area, Power and Energy Estimations
In order to find the best trade-off on digit size, we
evaluated six different digit sizes. For example, the im-
plementation (d=4) uses 13,427 GE and finishes one EC
scalar multiplication in 59,800 clock cycles. The power
consumption is estimated by using both Design Vision and
ModelSim SE, which provides average power consumptions
by including switching activity of gates. The total dynamic
power is around 12 μW at 400 kHz. Since one point multi-
plication takes 149.5 ms, the energy consumption is 1.79 μJ.
According to the synthesis results, the power consumption is
low enough to implement the proposed design in a passive
RFID-tag.
B. Comparison with Previous Implementations
Table VI shows the comparison of this work with related
work in GF(2163). Our strategy is similar to that of [7]. It
uses a digit-serial multiplier and a cyclic register structure,
Table V
IMPLEMENTATION RESULTS AT 400 KHZ
Area Power & Energy Est. Delay
Digit Total T . Dynamic Energy Time Cycles
Size Area [GE] Power [μW ] [μJ] [msec]
d=1 11,720 7.27 3.98 547.87 219,148
d=2 12,348 9.1 2.58 283.57 113,428
d=3 12,862 10.19 1.99 195.28 78,112
d=4 13,427 11.997 1.79 149.5 59,800
d=5 13,970 12.69 1.57 123.34 49,336
d=6 14.530 13.8 1.48 106.99 42,796
 Energy for one EC scalar multiplication
Table VI
COMPARISON TO RELATED WORK
Ref. Digit Area Cycles Freq. Perf. Power EnergySize [GE] [kHz] (msec) [μW ] [μJ]
BEC 1 11,720 219,148
400
547.87 7.27 3.98
This 2 12,348 113,428 283.57 9.1 2.58
work 3 12,862 78,112 195.28 10.19 1.99
[0.13μm] 4 13,427 59,800 149.5 11.99 1.79
5 13,970 49,336 123.34 12.69 1.57
6 14,530 42,796 106.99 13.8 1.48
ECC† 1 10,106 275,816 1,130 244.08 36.63 8.94
[7] 2 11,383 144,842 590 245.49 21.55 5.29
[0.13μm] 3 12,236 101,183 411 246.19 15.75 3.88
4 12,863 78,544 323 243.17 12.08 2.94
ECC [8] N/A 15,094 430,654 13,560 31.8 N/A N/A
[0.35μm] 16,207 376,864 27.9
ECC [9] N/A 11,904 296,000 106 2,792 N/A N/A
[0.18μm] 13,250 8.57 31.3
 Energy for one EC scalar multiplication
† Extra memory is required
but no squarer modules. Although the area consumption of
[7] is smaller than our design, the power consumption is
still larger at 400 kHz. Moreover, the energy cost of one EC
scalar multiplication is halved in our design.
The result of [8] shows 15,094 GE which is larger than
our proposal and requires almost seven times as many cycles.
[8] uses affine coordinates rather than projective or mixed
coordinates, so each iteration takes considerably more time
than our implementation due to inversions. The design in [9]
has the lowest power consumption among the related works,
but is much slower than our design.
The synthesis results do not include ROM or RAM in
Table VI, therefore the consumption of area and power
of memories should be added. Table VII compares the
number of memory access and memory requirements in one
EC scalar multiplication. Every design must have at least
three ROM blocks of size 21-byte to store the main point
(x1, y1) and a key value. Moreover, the number of RAM
blocks is chosen depending on the representation of the
final point. By storing the base point in shared memory,
one 163-bit register can be saved (Type-1 in [7]). However,
the ECC core needs to load 21 bytes in each iteration.
This loading dramatically increases the number of memory
accesses. Our design only reads the EC base point (x1, y1)
once in the beginning to calculates w1. For one EC scalar
multiplication, the RAM and ROM block are accessed 63
and 84 times, respectively. The requirements of memory
units and the number of memory access are not indicated in
[8] and [9], thus we add minimum requirements of ordinary
ECC curves on Table VII. Based on these comparisons, this
work requires only a limited number of memory units and
accesses.
VI. CONCLUSION
In this paper, the first hardware implementation of a
binary Edwards curve is presented. We propose a compact
architecture for a binary Edwards curve. We suggest the
use of mixed w-coordinate with the common Z-coordinates
190 ASAP 2010
Table VII
REQUIREMENTS OF THE MEMORY ACCESS
Ref. ECC Core Shared Memory Memory Access
Register RAM ROM RAM Read WriteFile ROM RAM RAM
BEC 6*163 - 3*21-byte 21-byte 84 21 42
ECC 6*163 † - 4*21-byte 21-byte 4,330 - 21
[7] 7*163 - 4*21-byte 21-byte 928 - 21
ECC 7*163 - 3*21-byte  21-byte  63 - 42
[8] 8*163 - 3*21-byte  21-byte  63 - 42
ECC - 64*16 3*21-byte  21-byte  63 - 42
[9] - 7*163 3*21-byte  21-byte  63 - 42
 Minimum memory requirements for ordinary ECC curves
† Type-1 in the original paper
which reduces the size of the register file. According to the
synthesis results, the implementation of a binary Edwards
curve can provide the same level of efficiency as previous
ECC designs.
ACKNOWLEDGEMENTS
This work was supported in part by K.U. Leuven-
BOF (OT/06/40), by the IAP Programme P6/26 BCRYPT
of the Belgian State (Belgian Science Policy), by FWO
project G.0300.07, and in part by the European Commis-
sion through the ICT programme under contract ICT-2007-
216676 ECRYPT II.
REFERENCES
[1] W. Diffie and M. E. Hellman, “New directions in cryptogra-
phy,” IEEE Transactions on Information Theory, vol. IT-22,
no. 6, pp. 644–654, 1976.
[2] R. L. Rivest, A. Shamir, and L. M. Adleman, “A Method for
Obtaining Digital Signatures and Public-Key Cryptosystems,”
Commun. ACM, vol. 21, no. 2, pp. 120–126, 1978.
[3] V. S. Miller, “Use of elliptic curves in cryptography,” in
Lecture notes in computer sciences; 218 on Advances in
cryptology—CRYPTO 85. New York, NY, USA: Springer-
Verlag New York, Inc., 1986, pp. 417–426.
[4] N. Koblitz, “Elliptic Curve Cryptosystem,” J. Cryptology
Math. Comp., vol. 48, pp. 203–209, 1987.
[5] S. Vaudenay, “On Privacy Models for RFID,” in ASIACRYPT,
2007, pp. 68–87.
[6] P. Tuyls and L. Batina, “RFID-Tags for Anti-counterfeiting,”
in CT-RSA, 2006, pp. 115–131.
[7] Y. K. Lee, K. Sakiyama, L. Batina, and I. Verbauwhede,
“Elliptic-Curve-Based Security Processor for RFID,” IEEE
Transactions on Computers, vol. 57, no. 11, pp. 1514–1527,
2008.
[8] S. Kumar and C. Paar, “Are standards compliant elliptic
curve cryptosystems feasible on RFID?” in Proceedings of
Workshop on RFID Security, 2006.
[9] D. Hein, J. Wolkerstorfer, and N. Felber, “ECC is Ready for
RFID - A Proof in Silicon,” in Selected Areas in Cryptog-
raphy, ser. Lecture Notes in Computer Science, R. Avanzi,
L. Keliher, and F. Sica, Eds. Springer-Verlag, 2009, in press.
[10] A. C. Atici, L. Batina, J. Fan, I. Verbauwhede, and S. Berna
Ors Yalcin, “Low-cost implementations of NTRU for per-
vasive security,” in ASAP ’08: Proceedings of the 2008
International Conference on Application-Specific Systems,
Architectures and Processors, 2008, pp. 79–84.
[11] D. J. Bernstein, T. Lange, and R. R. Farashahi, “Binary
Edwards Curves,” in CHES ’08: Proceedings of International
Workshop on Cryptographic Hardware and Embedded Sys-
tems, vol. 5154. Springer-Verlag, 2008, pp. 244–265.
[12] T. Izu and T. Takagi, “Exceptional procedure attack on elliptic
curve cryptosystems,” PKC 2003, LNCS, vol. 2567, pp. 224–
239, 2003.
[13] J. Lo´pez and R. Dahab, “Fast Multiplication on Elliptic
Curves over GF(2m) without Precomputation,” in CHES ’99:
Proceedings of the First International Workshop on Crypto-
graphic Hardware and Embedded Systems. London, UK:
Springer-Verlag, 1999, pp. 316–327.
[14] M. Stam, “On Montgomery-Like Representations for Elliptic
Curves over GF(2k),” in Public Key Cryptography, 2003, pp.
240–253.
[15] Y. K. Lee and I. Verbauwhede, “A Compact Architecture for
Montgomery Elliptic Curve Scalar Multiplication Processor,”
in WISA, 2007, pp. 115–127.
[16] K. Sakiyama, L. Batina, N. Mentens, B. Preneel, and I. Ver-
bauwhede, “Small-footprint ALU for Public-Key Processors
for Pervasive Security,” in Proc. Workshop RFID Security
(RFIDSec ’06), 2006.
[17] D. Hankerson, A. Menezes, and S. Vanstone, Guide to elliptic
curve cryptography. New York: Springer, 2004.
[18] T. Itoh and S. Tsujii, “Effective recursive algorithm for
computing multiplicative inverses in GF(2m),” vol. 24(6).
Electronic Letters, 1988, pp. 334–335.
[19] M. Joye and S.-M. Yen, “The Montgomery Powering Ladder,”
in CHES, 2002, pp. 291–302.
[20] M. Poletto and V. Sarkar, “Linear scan register allocation,”
ACM Trans. Program. Lang. Syst., vol. 21, no. 5, pp. 895–
913, 1999.
[21] National Institute for Standards and Technology, “Digital
Signature Standard (DSS),” Technical report, January 2000.
[22] ISO/IEC 18000-3:2004, “Information Technology - Radio
Frequency Identification (RFID) for Item Management - Part
3: Parameters for air interface communications at 13.56
MHz.”
[23] P. H. Wang, J. D. Collins, C. T. Weaver, B. Kuttanna,
S. Salamian, G. N. Chinya, E. Schuchman, O. Schilling,
T. Doil, S. Steibl, and H. Wang, “Intel R©atom TMprocessor
core made FPGA-synthesizable.” in FPGA, 2009, pp. 209–
218.
[24] P. Schaumont and I. Verbauwhede, “Interactive Cosimulation
with Partial Evaluation,” in DATE, 2004, pp. 642–647.
[25] Synopsys Inc., “Design Compiler Tutorial Using Design
Vision,” 2006.
191 ASAP 2010
