Hardware implementation of an elliptic curve processor over GF(p) by Ors, S.B. et al.
PDF hosted at the Radboud Repository of the Radboud University
Nijmegen
 
 
 
 
The following full text is a preprint version which may differ from the publisher's version.
 
 
For additional information about this publication click this link.
http://repository.ubn.ru.nl/handle/2066/127482
 
 
 
Please be advised that this information was generated on 2017-03-09 and may be subject to
change.
Hardware Implementation of an Elliptic Curve Processor
over GF (p)
Sıddıka Berna O¨rs1, Lejla Batina1,2, Bart Preneel1, Joos Vandewalle1
1Katholieke Universiteit Leuven, ESAT/SCD-COSIC
Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium
{sbors, Lejla.Batina, Bart.Preneel, Joos.Vandewalle}@esat.kuleuven.ac.be
2SafeNet BV
Boxtelseweg 26a, 5261 NE Vught, The Netherlands
Abstract
This paper describes a hardware implementation of an arithmetic processor which is ef-
ﬁcient for bit-lengths suitable for both commonly used types of Public Key Cryptography
(PKC), i.e., Elliptic Curve (EC) and RSA Cryptosystems. Montgomery modular multi-
plication in a systolic array architecture is used for modular multiplication. The processor
consists of special operational blocks for Montgomery Modular Multiplication, modular ad-
dition/subtraction, EC Point doubling/addition, modular multiplicative inversion, EC point
multiplier, projective to aﬃne coordinates conversion and Montgomery to normal represen-
tation conversion.
Keywords: Elliptic Curve Cryptosystems, Modular Operations, FPGA
1 Introduction
Elliptic Curve Cryptography (ECC) was proposed independently by Miller [13] and
Koblitz [7] in the 80’s. Since then a considerable amount of research has been performed
on secure and eﬃcient ECC implementations. The beneﬁts of ECC, when compared with
classical cryptosystems such as RSA [19], include: higher speed, lower power consumption
and smaller certiﬁcates, which are especially useful for wireless applications.
The performance of an elliptic curve cryptosystem and of other public key cryptosystems,
is mostly determined by the eﬃcient implementation of ﬁnite ﬁeld arithmetic. In this work
a hardware architecture of a processor for ECC over ﬁnite ﬁeld GF (p) is presented. The
most critical operation for latency is modular multiplication. We use our systolic array
multiplier based on Montgomery’s Modular Multiplication (MMM) algorithm [14] which is
proposed in [16]; this multiplier is proven to be very eﬃcient for modular exponentiation
as the basic operation for RSA cryptosystems [1].
The processor consists of special operational blocks for MMM, modular addition/subtrac-
tion (MAS), EC point doubling/addition, modular multiplicative inversion, EC point multi-
plier, projective to aﬃne coordinates conversion and Montgomery to normal representation
conversion. Hence it can be programmed by the host to execute any of these operations in
any order. It is possible to use the proposed processor not only for ECC, but also for any
system that modular arithmetic operations are essential for, such as the RSA cryptosystem.
1
The basic operations are MMM and MAS. The other blocks include a ﬁnite state ma-
chines (FSMs) which controls the execution of these operations in the right order. The
critical path depends only on the critical path of circuits for MMM and MAS. The archi-
tecture of these blocks is designed to ensure a short critical path to allow for high clock
frequencies which are independent from bit-length of the parameters of ECC. For simplic-
ity, all blocks were designed separately with their own FSMs. This allows for independent
optimization and testing of the building blocks.
The remainder of this paper is organized as follows. In Section 2 we discuss the related
work. Section 3 provides the mathematical background for Montgomery Multiplication
Method (MMM) and ECC over GF (p). Section 4 describes the hardware implementation;
some details are omitted due to space limitation. Section 5 concludes the paper.
2 Previous Work
To the best of our knowledge, the ﬁrst documented ECC processor over ﬁelds GF (p)
is proposed by Orlando and Paar [15]. The Elliptic Curve Processor (ECP) is scalable
in terms of area and speed and especially suited for FPGAs. The authors estimate that
it would take 3 ms to compute one 192-bit point multiplication. However, this superb
timing was estimated by assuming 100% throughput from the multiplier. The expected
latency was not considered. Their multiplier is also based on the MMM algorithm but it is
a generalized version with quotient pipelining introduced by Orup in [17]. We use the basic
MMM algorithm from which we only exclude the modular reduction as a result of the bound
adjustment. In this way no pre-computation is required which results substantial memory
saving. Their multiplier has a semi-systolic architecture while the multiplier presented here
is fully systolic. This results in an important ﬂexibility which is unrelated to any speciﬁc
parameter choice. Orlando and Paar also used an adaptation of a ﬁxed base exponentiation
method as introduced by Brickell et al. in [3]. This algorithm is assumed to be 4 times
faster than standard double-and-add algorithm which is used here. However, it involves a
known point calculation which is a limiting factor with respect to various applications of
ECC.
Wolkerstorfer proposes a dual-ﬁeld arithmetic unit that oﬀers all instructions required
for both types of ﬁnite ﬁelds: GF (p) and GF (2m) in [22]. He uses a redundant number
representation and a special multiplication with interleaved modular reduction. Inversion
is performed by the Extended Euclidean Algorithm. This is a low-power architecture that
can be realized on moderate silicon area; the author claims that it requires just a little
more hardware resources than for a pure GF (p) multiplier.
Goodman and Chandrakasan proposed a domain-speciﬁc reconﬁgurable cryptographic
processor (DSRCP) in [6]. The instruction set deﬁnition of the DSRCP was dictated by the
IEEE 1363 Public Key Cryptography Standard document. A list of the arithmetic functions
required to implement the various primitives deﬁned in the standard was tabulated in a
functional matrix, which was then used to deﬁne the instruction set architecture (ISA) of
the processor. The ISA contains 24 instructions broken up into six types of operations:
conventional arithmetic, modular integer arithmetic, GF arithmetic, elliptic curve ﬁeld
arithmetic over GF, register manipulation and processor conﬁguration.
2
3 Mathematical background
3.1 Elliptic curves over GF (p)
An elliptic curve E is often expressed in terms of the Weierstrass equation: y2 = x3+ax+
b, where a, b ∈ GF (p) with 4a3+27b2 = 0 (mod p). The inverse of the point P = (x1, y1) is
−P = (x1,−y1). The sum P +Q of the points P = (x1, y1) and Q = (x2, y2) (assume that
P,Q = O, and P = ±Q) is the point R = (x3, y3) where: λ = y2−y1x2−x1 , x3 = λ2−x1−x2, y3 =
(x1 − x3)λ− y1.
For P = Q, the “doubling” formulae are: λ = 3x
2
1+a
2y1
, x3 = λ2−2x1, y3 = (x1−x3)λ−y1.
The point at inﬁnity O plays a role analogous to that of the number 0 in ordinary
addition. Thus, P +O = P and P +(−P ) = O for all points P . The points on elliptic curve
together with the operation of “addition” form an abelian group. Then it is straightforward
to introduce the point or scalar multiplication as main operation for ECC. This operation
can be calculated by using double-and-add algorithm as shown in Algorithm 1. For details
see [13, 7, 2].
Algorithm 1 Elliptic Curve Point Multiplication
Require: EC point P = (x, y), integer k, 0 < k < M , k = (kl−1, kl−2, · · · , k0)2,
kl−1 = 1 and M
Ensure: Q = (x′, y′)
1: Q← P
2: for i from l − 2 downto 0 do
3: Q← 2Q
4: if ki = 1 then
5: Q← Q+ P
6: end if
7: end for
In the above deﬁnition of EC group aﬃne coordinates are used, but so-called projective
coordinates have some implementation advantages. The point addition can be done in
projective coordinates using almost only ﬁeld multiplications. Only one inversion is needed
at the end of a point multiplication operation. We have used the modiﬁed Jacobian (Jm)
coordinates as proposed by Cohen et al. in [5] because EC point doubling is fastest in
this representation. They represent internally the Jacobian coordinates as a quadruple(
X,Y,Z, aZ4
)
. This representation is called modiﬁed Jacobian coordinate system and
denoted by the authors as Jm. The algorithms for EC point addition and doubling are as
follows [5].
Let P =
(
X1, Y1, Z1, aZ
4
1
)
, Q =
(
X2, Y2, Z2, aZ
4
2
)
and P + Q = R =
(
X3, Y3, Z3, aZ
4
3
)
.
The addition formulas in Jm are the following (P = ±Q).
U1=X1Z22 , U2=X2Z
2
1 , S1=Y1Z
3
2 , S2=Y2Z
3
1 , H=U2−U1, r=S2−S1
X3=−H3−2U1H2+r2, Y3=−S1H3+r
(
U1H
2−X3
)
, Z3=Z1Z2H, aZ43 =aZ
4
3
(1)
The doubling formulas in Jm are the following (R = 2P ).
S=4X1Y 21 , U =8Y
4
1 , M=3X
2
1+
(
aZ41
)
X3=−2S+M2, Y3=M(S−X3)−U, Z3=2Y1Z1, aZ43 =2U
(
aZ41
) (2)
3
3.2 Montgomery Modular Multiplication
The Montgomery product is deﬁned as: Mont(x, y) = xyR−1 mod N , where N =
(nl−1 · · ·n1n0)b, 0 ≤ x, y < N , R = bl, b = 2α with gcd(N, b) = 1.
Montgomery’s method for multiplying two integers x and y (called N -residues) modulo
N , avoids trial division by N which is the most expensive operation in hardware. The
Montgomery representation of x ∈ ZN is xR mod N and it allows very eﬃcient modular
arithmetic especially for multiplication [14].
The original proposal of Montgomery had a conditional subtraction included at the end
of the algorithm. For eﬃciency as well as resistance against side-channel attacks [9, 10] a
bound for R is given as 4N < R to avoid this subtraction by Walter in [21]. This bound
guarantees that for inputs X,Y < 2N the output is also bounded by T < 2N .
We will take α = 1 for simplicity and make the iteration starting from Step 2 execute l+2
times instead of l times as in the original proposal. By these changes the desired bound
is achieved as 4N < R = 2l+2. Algorithm 2 is the algorithm for Montgomery modular
multiplication without ﬁnal subtraction which has the properties given above.
Algorithm 2 Montgomery modular multiplication without ﬁnal subtraction
Require: Integers N = (nl−1 · · ·n1n0)2, x = (xl · · ·x1x0)2, y = (yl · · · y1y0)2 with x ∈ [0, 2N − 1], y ∈
[0, 2N − 1], R = 2l+2, gcd(N, 2) = 1 and N ′ = −N−1 mod 2 (Notation T = (tl+1tl...t0))
Ensure: T = xyR−1 mod 2N
1: T ← 0
2: for i from 0 to l + 1 do
3: mi ← t0 ⊕ xiy0
4: T ← (T + xiy +miN)/2
5: end for
All the operations will be done modulo 2N through EC point multiplication. The last
step is to convert the Montgomery representation of the coordinates of the resulting point
back to the normal representation. This is done by calculating the Montgomery modular
multiplication of the coordinates and 1, Mont(xR, 1) = xRR−1 = x. It can be easily proved
that Mont(T, 1) ≤ N , if 0 ≤ T < 2N .
4 Hardware Implementation
Our Elliptic Curve processor (ECP) can be divided into 5 levels hierarchically as shown
in Fig. 1.
The operation blocks on each level from top to bottom are as follows:
• Level 1: Main Controller (MC)
• Level 2:
1. Aﬃne to projective coordinates converter (AtoP): (x, y) → (X,Y,Z, aZ4) such
that X = x, Y = y, Z = 1 and aZ4 = a
2. Normal to Montgomery representation converter (NtoM)
3. EC point multiplier (EPM)
4. Projective to aﬃne coordinates converter (PtoA)
5. Montgomery to normal representation converter (MtoN)
4
x
y
a
R
x’
y’
x, y, a, R, a and M  registers
X, Y, Z, aZ  registers
6 temporary registers for EPDA
2 input and 1 output register for 
MMMC and MAS
for point Q
START−PtoA
START−MtoN
DONE−MtoN
START−NtoM
DONE−NtoM
DONE−PtoA
MC
DONE
N to M EPM P to A M to N
logic 1
START−MMIDONE−MMI
EPDA MMI
logic 1
logic 1 logic 1
MMMCMAS
START−MAS
START−MMMDONE−MMM
DONE−MAS
AS
START−AS DONE−AS
DONE−PM
START−PM
DONE−PAD
START−PAD
Figure 1. EC point multiplier circuit block diagram
• Level 3:
1. EC Point doubling, addition circuit (EPDA)
2. Modular Multiplicative Inverter (MMI)
• Level 4:
1. Montgomery Modular Multiplication Circuit (MMMC)
2. Modular Addition, Substraction circuit (MASC)
• Level 5: Addition, Substraction circuit (ASC)
For simplicity all blocks were designed separately with their own FSMs and data paths.
This allows for independent optimization and testing of the building blocks. The VHDL
code was written by describing the bit-length N of the coordinates x and y of P and the
bit-length l of k as parameters. So this design is suitable for any N and l. In the following
sections we have described the system using a top-down approach.
4.1 Main Controller
MC includes a FSM with 5 states. The algorithmic state machine (ASM) chart [11] of
MC is shown in Fig. 2.(a). The START signal is the instruction signal from host. MC
instructs, NtoM to start conversion from normal to Montgomery representation, EPM to
start point multiplication, PtoA to start conversion from projective to aﬃne coordinates and
MtoN to start a conversion from Montgomery to normal representation one after another by
setting START-NtoM, START-PM, START-PtoA and START-MtoN signals, respectively.
The DONE-NtoM, DONE-PM, DONE-PtoA and DONE-MtoN signals indicate that the
related operations are ﬁnished. The DONE signal indicates to the host that a complete
point multiplication operation is ﬁnished and the results are ready on output ports.
5
START
x,y,M,k,a,R   registers
START−NtoM=1
2
START−PtoA=1
DONE−PtoA
START−MtoN=1
xR,yR,R,aR   registers
START−PM=1
result x,y   outputs
DONE=1
IDLE
1
S1
0
S2
DONE−PM 0
S3
DONE−MtoN 0
S4
0
0
1
1
1
1
DONE−NtoM
1,R   inputs of MMMC2
DONE−MMM
DONE−MMM
DONE−MMM
MMM result    xR register
1
START−MMM=1
2y,R   inputs of MMMC
ini−S3
0
DONE−MMM
1
ini−IDLE
0
ini−S1
START−MMM=1
MMM result    R register
0
1
START−MMM=1
x,R   inputs of MMMC2
ini−S2
0
1
START−MMM=1
2
MMM result    yR register
a,R   inputs of MMMC
ini−S4
0
1
MMM result    aR register
DONE−NtoM=1
START−NtoM
xR,yR,R,aR   Q registers
counter<l
DONE−PM=1
Q    outputs of EPDA
    inputs of EPDA
outputs of EPDA
ADD−DOUBLE=1
START−PAD=1
Q    outputs of EPDA
mul−S2
0DONE−PAD
1
k( −1)l
1
0
mul−IDLE
START−PM
yes
no
Q    inputs of EPDA
START−PAD=1
ADD−DOUBLE=0
k register  << 1 bit
counter++
mul−S1
0
1
DONE−PAD 0
mul−S3
1
counter<l
M−2   T1,
T1(N−1)
T2    inputs of MMMC
1
0
yes
no
inv−IDLE
START−INV
Z  T2
inv−S1
DONE−INV=1
START−MMM=1
counter++
T1 register  << 1 bit
1
0
inv−S2
DONE−MMM
0
1
    input of MMMC
output of MMMC
T2   output  of MMMC
START−MMM=1
Z    input of MMMC
1
inv−S3
DONE−MMM 0
T2    output of MMMC
(a) (b) (c) (d)
Figure 2. ASM charts of the operational blocks: (a) MC, (b) NtoM, (c) EPM, (d) MMI
4.2 Normal to Montgomery representation converter
The conversion of an integer x from the normal representation to the Montgomery rep-
resentation is done as Mont(x,R2) = xR2R−1 mod M = xR mod M . Multiplication by
MMMC of two numbers that are in Montgomery representation will produce the Mont-
gomery representation of product as Mont(xR, yR) = xRyRR−1 mod M = xyR mod M .
Modular addition and subtraction of two numbers that are in Montgomery representa-
tion will produce the Montgomery representation of the sum or diﬀerence as xR mod M ±
yR mod M = (x±y)R mod M . Because of these relations; the Montgomery representation
of the coordinates of P , the coeﬃcient a and number 1 will be calculated in the beginning
of point multiplication by the NtoM circuit and all the operations during the EC point
multiplication will be done in Montgomery representation.
NtoM includes a FSM with 5 states. The ASM chart of NtoM is shown in Fig. 2.(b).
NtoM waits in ﬁrst (ini-IDLE) state until the START-NtoM signal from MC is set. NtoM
makes MMMC to execute 4 MMMs, Mont(1, R2) = R mod M , Mont(x,R2) = xR mod M ,
Mont(y,R2) = yR mod M , Mont(a,R2) = aR mod M . After DONE-MMM is set in last
state, NtoM sets DONE-NtoM signal and goes back to (ini-IDLE) state.
6
4.3 EC Point Multiplier
EPM includes a FSM with 4 states to control the execution of Algorithm 1. The ASM
chart of EPM is shown in Fig. 2.(c). The circuit stays in ﬁrst (mul-IDLE) state until the
START-PM signal from the MC is set. DONE-PM signal indicates that the scanning of
the bits of k is ﬁnished, so the result of the operation can be read from the output ports.
EPM instructs EPDA to start a point double operation by setting START-PAD signal and
resetting ADD-DOUBLE signal and a point addition operation by setting START-PAD and
ADD-DOUBLE signals. DONE-PAD from EPDA indicates the a point double or addition
operation is ﬁnished.
4.4 Projective to aﬃne coordinates converter
After ﬁnishing the EC point multiplication the result point Q must be converted from
Jm coordinates to aﬃne coordinates. This is done as
(
X,Y,Z, aZ4
) → (x, y) such that
x = XZ−2 and y = Y Z−3 [5].
PtoA includes a FSM with 6 states to control above operations. PtoA waits in ﬁrst
(PtoA-IDLE) state until the signal START-PtoA from MC is set. After it is set, PtoA
visits the other ﬁve states in the following order and after DONE-MMM signal from MMM
circuit is set in (PtoA-S5) state, PtoA sets DONE-PtoA signal and goes back to (PtoA)-
IDLE state.
• PtoA-S1: Z−1R =Modular Multiplicative Inversion of Z
• PtoA-S2: Z−2R = Mont(Z−1R,Z−1R)
• PtoA-S3: xR = XZ−2R = Mont(XR,Z−2R)
• PtoA-S4: Z−3R = Mont(Z−1R,Z−2R)
• PtoA-S5: yR = Y Z−3R = Mont(Y R,Z−3R)
4.5 Montgomery to normal representation converter
Because the coordinates of the product point must be in normal representation, as a last
action a conversion from Montgomery representation to normal representation is needed.
This conversion requires two additional execution of the MMM operation with the inputs
xR and 1, then yR and 1, as x = Mont(xR, 1) = xRR−1, y = Mont(yR, 1) = yRR−1.
4.6 EC Point doubling, addition
When we convert the input point P from aﬃne coordinates to projective coordinates we
take Z as 1. The Jm representation of P (x, y) is (x, y, 1, a). During the execution of point
multiplication one of the points to be added is always P . According to these properties
we can take Z1 = 1 for EC point addition. Because there are both MMMC and modular
addition/subtraction (MAS) circuits available, these operations can be executed in parallel.
EC point addition and doubling can be realized by Algorithm 3.(a). and (b)., respectively.
Fourteen states and six temporary registers are needed for EC point addition and also
for EC point doubling. Because completing one MAS operation takes shorter time than one
MMM, the latency of one state is the same as one MMM. Hence the total execution time
of EC point addition is 14TMMM , with TMMM latency of one MMM. The total execution
time of EC point doubling is 8TMMM + 6TMAS , with TMAS latency of one MAS.
7
Algorithm 3 EC point addition and doubling
Require: P1 = (x, y, 1, a), P2 = (X2, Y2, Z2, aZ
4
2 )
Ensure: P1 + P2 = P3 = (X3, Y3, Z3, aZ
4
3 )
1. T1 ← Z22
2. T2 ← xT1
3. T1 ← T1Z2 T3 ← X2 − T2
4. T1 ← yT1
5. T4 ← T 23 T5 ← Y2 − T1
6. T2 ← T2T4
7. T4 ← T4T3 T6 ← 2T2
(a) 8. Z3 ← Z2T3 T6 ← T4 + T6
9. T3 ← T 25
10. T1 ← T1T4 X3 ← T3 − T6
11. aZ43 ← Z23 T2 ← T2 −X3
12. T3 ← T5T2
13. aZ43 ←
(
aZ43
)2
Y3 ← T3 − T1
14. aZ43 ← a
(
aZ43
)
Require: P1 = (X1, Y1, Z1, aZ
4
1 )
Ensure: 2P1 = P3 = (X3, Y3, Z3, aZ
4
3 )
1. T1 ← Y 21 T2 ← 2X1
2. T3 ← T 21 T2 ← 2T2
3. T1 ← T2T1 T3 ← 2T3
4. T2 ← X21 T3 ← 2T3
5. T4 ← Y1Z1 T3 ← 2T3
6. T5 ← T3
(
aZ41
)
T6 ← 2T2
7. T2 ← T6 + T2
(b) 8. T2 ← T2 +
(
aZ41
)
9. T6 ← T 22 Z3 ← 2T4
10. T4 ← 2T1
11. X3 ← T6 − T4
12. T1 ← T1 −X3
13. T2 ← T2T1 aZ43 ← 2T5
14. Y3 ← T2 − T3
4.7 Modular Multiplicative Inverter
Modular multiplicative inversion is done according to Fermat’s theorem [8, 12], a−1 =
ap−2 mod p, if gcd(a, p) = 1. Because the curves we are interested in are deﬁned over
GF (p), p is prime, we can use this theorem to ﬁnd the multiplicative inverses modulo p.
So multiplicative inversion can be done by modular exponentiation of a by p− 2. Modular
exponentiation can be realized by using the square and multiply algorithm given in [12].
MMI controls the execution of square and multiply algorithm. It includes a FSM with
4 states. The ASM chart of MMI is shown in Fig. 2.(a). The START-INV signal is the
instruction signal from PtoA. The DONE-INV signal indicates that the scanning of the bits
of T1 register is ﬁnished.
4.8 Montgomery Modular Multiplication Circuit
The i-th iteration of Step 2 in Algorithm 2 computes the temporary results
Ti = 2−1(Ti−1 + xi × Y +mi ×N), i = 0, · · · , l + 1 (3)
where T−1 = 0 [20]. The j-th digit of Ti is obtained using the recurrence relation
22 × c1i,j + 2× c0i,j + ti,j = ti−1,j+1 + xi × yj + mi × nj + 2× c1i,j−1 + c0i,j−1 (4)
i = 0, · · · , l + 1, j = 0, · · · , l + 1, c1i,−1 = 0 and c0i,−1 = 0. In Eq. (4), 2 × c1i,j + c0i,j ,
j = −1, · · · , l, denotes the carry chain up the adder.
To obtain a linear, pipelined modular multiplier, a systolic array shown in Fig. 3 is
used. X(0) denotes the least signiﬁcant bit (LSB) of the register in which the input x is
stored. T denotes the intermediate value register. The carry chain is stored in the C0
and C1 registers. The j-th cell behaves like cell (i, j), computing Eq.(4) at time 2i + j for
i = 0, · · · , l + 1.
Total area of the systolic array is (5l − 3)XOR + (7l − 7)AND + (4l − 5)OR gates and
4l ﬂip-ﬂops. The critical path is the same as the critical path of one regular cell and it is
independent of the bit length of the operands. So it is 2TFA(cin → cout)+THA(cin → cout).
More details can be found in [16].
8
xi
xi mimi
x(l−2)/2
0
ynl−1 n2
C0(1)
C1(1)
T(2)
C0(2)
C1(2)
C0(l−2)
C1(l−2)
cellcell
l l−1
C0(l−1)
C1(l−1)
T(l)
T(2)T(3)
T(l−1)
T(l)T(l+1)
T(l+1)
1st−bit
cell
rightmost
cell
regular
cell
C0(1)
xixi−1
mi−1m(l−2)/2
y
l−1
y
2
y
1 n1 T(1)
T(1)
y
l
Figure 3. Schematic view of complete systolic array
CONTROLLER SYSTOLIC ARRAY
START
DONE
X NY
RESULT
load
COUNTER
0 to l+1
RATOR
COMPA−
shift right for X register
load
reset
increment value of counter
count−end
(l+2)−bit   T    Register
l+1−bit   X, Y and l−bit N    Registers
Figure 4. Architecture of the Montgomery modular multiplier circuit
The MMMC consists of a controller and a data path as shown in Fig. 4. The data path
consists of a systolic array, four internal registers, a counter and a comparator.
ti,j is calculated at the (2i+ j)-th clock cycle. tl+1,l+1 is calculated at the 3l+3-th clock
cycle. Hence, the total number of clock cycles for completing one modular Montgomery
multiplication equals 3l + 3 .
4.9 Modular Addition, Substraction Circuit
Modular addition and subtraction are executed according to Algorithm 4 [4].
Algorithm 4 Modular addition and subtraction
Require: M , 0 ≤ A < M , 0 ≤ B < M
Ensure: C = A+B mod M
1: C′ = A+B
2: C′′ = C′ −M
3: if C′′ < 0 then
4: C = C′
5: else
6: C = C′′
7: end if
Require: M , 0 ≤ A < M , 0 ≤ B < M
Ensure: C = A−B mod M
1: C′ = A−B
2: C′′ = C′ +M
3: if C′ < 0 then
4: C = C′′
5: else
6: C = C′
7: end if
The numbers are represented in two’s complement representation. In this representation,
addition and subtraction can be realized by using the same circuit [18].
4.10 Implementation Results of The Elliptic Curve Processor
The proposed processor is implemented on Xilinx V1000E-BG-560-8 (Virtex E) FPGA
by taking the bit length of EC parameters N and the bit length of k, l as 160. Accord-
ing to implementation results, the number of ﬂip-ﬂops and 4 input LUTs are 6,959 and
9
Table 1. Latency of the operations executed in ECP
Operation Sub-operations # of clock cycles Execution time*
depending on N and l ms
NtoM 4 MMM 12N + 16 0.021
EPM l EC point double+ l(51N + 66) 14.414
l/2 EC point addition
PtoA MMI+4 MMM 3N2 + 16N + 16 0.397
MtoN 2 MMM 6N + 8 0.011
EC point doubling 8 MMM+6 MAS 40N + 38 0.070
EC point addition 14 MMM 42N + 56 0.074
MMI 3N/2 MMM 9/2N2 + 6N 1.272
MMM 3N + 4 0.005
MAS 2N + 1 0.003
* for N = l = 160 at 91.308MHz
11,227, respectively. This is equivalent to 115,520 gates. Minimum clock period is 10.952ns
(maximum clock frequency: 91.308MHz). LUTs are lookup-tables that are used as RAMs
or 4-input gates. The latency of the operations according to the clock frequency of the
implemented circuit is given in Table 1.
The only existing previous work done on FPGA is from Orlando and Paar [15]. They
reported that their processor used 11,416 LUTs, 5,735 ﬂip-ﬂops and 35 BlockRAMs. Block-
RAM is a block memory on Virtex FPGAs. On the FPGA that the authors used one
BlockRAM consists of 4096 bits of memory. The clock frequency was reported as 40 MHz.
If we compare both results, we can say that our processor uses less memory and can work
with higher clock frequency as we expected.
5 Conclusions and Future Work
We have described an eﬃcient implementation of a elliptic curve processor over GF (p).
The processor can be programmed to execute a modular multiplication, addition/subtraction,
multiplicative inversion, EC point addition/doubling and multiplication. We use the method
of Montgomery in a systolic array architecture for modular multiplication. Montgomery
modular multiplication is proven to be very secure in hardware. Namely, the optimal bound
is used which, with some savings in hardware, omits completely all reduction steps that are
known to be vulnerable to side-channel attacks.
One direction in which this work should go is to implement a processor which can be
programmed for point multiplication and also modular exponentiation, the basic operations
for ECC and RSA, respectively. A cryptographic device dealing with both types of PKC
would be very useful to secure communication systems.
Acknowledgements
Sıddıka Berna O¨rs is funded by a research grant of the Katholieke Universiteit Leuven, Belgium. Dr.
Bart Preneel and Dr. Joos Vandewalle are professors at the Katholieke Universiteit Leuven, Belgium. This
work was supported by Concerted Research Action GOA-MEFISTO-666 of the Flemish Government and
by the FWO “Identiﬁcation and Cryptography” project (G.0141.03).
10
References
[1] L. Batina, S. B. O¨rs, B. Preneel, and J. Vandewalle. Hardware architectures for public key cryptography.
Elsevier Science Integration the VLSI Journal, in print, 2002.
[2] I. Blake, G. Seroussi, and N. P. Smart. Elliptic Curves in Cryptography. London Mathematical Society
Lecture Note Series. Cambridge University Press, 1999.
[3] E. F. Brickell, D. M. Gordon, K. S. McCurley, and D. B. Wilson. Fast exponentiation with precompu-
tation: Algorithms and lower bound. In R. A. Rueppel, editor, Advances in Cryptology: Proceedings of
EUROCRYPT’92, number 658 in Lecture Notes in Computer Science, pages 200–207. Springer-Verlag,
1992.
[4] C¸. K. Koc¸. RSA hardware implementation. Technical report, RSA Laboratories, RSA Data Security,
Inc., Redwood City, CA, August 1995.
[5] H. Cohen, A. Miyaji, and T. Ono. Eﬃcient elliptic curve exponentiation using mixed coordinates.
In K. Ohta and D. Pei, editors, Proceedings of ASIACRYPT 1998, number 1514 in Lecture Notes in
Computer Science, pages 51–65. Springer-Verlag, 1998.
[6] J. Goodman and A. P. Chandrakasan. An energy-eﬃcient reconﬁgurable public-key cryptography
processor. IEEE Journal of Solid-State Circuits, 36(11):1808–1820, November 2001.
[7] N. Koblitz. Elliptic curve cryptosystem. Math. Comp., 48:203–209, 1987.
[8] N. Koblitz. A Course in Number Theory and Cryptography, volume 114 of Graduate text in mathemat-
ics. Springer-Verlag, Berlin, Germany, second edition, 1994.
[9] P. Kocher. Timing attacks on implementations of Diﬃe-Hellman, RSA, DSS and other systems. In
N. Koblitz, editor, Advances in Cryptology: Proceedings of CRYPTO’96, number 1109 in Lecture Notes
in Computer Science, pages 104–113. Springer-Verlag, 1996.
[10] P. Kocher, J. Jaﬀe, and B. Jun. Diﬀerential power analysis. In M. Wiener, editor, Advances in
Cryptology: Proceedings of CRYPTO’99, number 1666 in Lecture Notes in Computer Science, pages
388–397. Springer-Verlag, 1999.
[11] M. M. Mano and C. R. Kime. Logic and Computer Design Fundamentals. Prentice Hall, Upper Saddle
River, New Jersey 07458, second edition, 2001.
[12] A. Menezes, P. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. CRC Press, 1997.
[13] V. Miller. Uses of elliptic curves in cryptography. In H. C. Williams, editor, Advances in Cryptol-
ogy: Proceedings of CRYPTO’85, number 218 in Lecture Notes in Computer Science, pages 417–426.
Springer-Verlag, 1985.
[14] P. Montgomery. Modular multiplication without trial division. Mathematics of Computation, Vol.
44:519–521, 1985.
[15] G. Orlando and C. Paar. A scalable GF(p) elliptic curve processor architecture for programmable
hardware. In C¸. K. Koc¸, D. Naccache, and C. Paar, editors, Proceedings of Workshop on Cryptograpic
Hardware and Embedded Systems (CHES 2001), number 2162 in Lecture Notes in Computer Science,
pages 356–371, Paris, France, May 14-16 2001. Springer-Verlag.
[16] S. B. O¨rs, L. Batina, B. Preneel, and J. Vandewalle. Hardware implementation of a Montgomery
modular multiplier in a systolic array. In The The 10th Reconfigurable Architectures Workshop (RAW),
Nice, France, April 22 2003. to appear.
[17] H. Orup. Simplifying quotient determination in high-radix modular multiplication. In Proceedings of
the 12th Symposium on Computer Arithmetic, pages 193–199. IEEE, 1995.
[18] B. Parhami. Computer Arithmetic: Algorithms and Hardware Designs. Oxford University Press, Inc.,
New York, 2000.
[19] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures and public-key
cryptosystems. Communications of the ACM, 21(2):120–126, 1978.
[20] C. D. Walter. Montgomery’s multiplication technique: How to make it smaller and faster. In C¸.
K. Koc¸ and C. Paar, editors, Proceedings of Cryptographic Hardware and Embedded Systems (CHES
1999), number 1717 in Lecture Notes in Computer Science, pages 80–93. Springer-Verlag, 1999.
[21] C. D. Walter. Precise bounds for Montgomery modular multiplication and some potentially insecure
RSA moduli. In B. Preneel, editor, Proceedings of Topics in Cryptology- CT-RSA 2002, number 2271
in Lecture Notes in Computer Science, pages 30–39, 2002.
[22] J. Wolkerstorfer. Dual-ﬁeld arithmetic unit for GF (p) and GF (2m). In B. S. Kaliski Jr., C¸. Koc¸, and
C. Paar, editors, Proceedings of Cryptographic Hardware and Embedded Systems (CHES 2002), Lecture
Notes in Computer Science, Redwood Shores, CA, USA, August 13-15 2002. Springer-Verlag.
11
