Efficient unified Montgomery inversion with multibit shifting by Savaş, Erkay et al.
Efficient unified Montgomery inversion with multibit
shifting
E. Savas¸, M. Naseer, A.A-A. Gutub and C¸.K. Koc¸
Abstract: Computation of multiplicative inverses in finite fields GF( p) and GF(2n) is the most time-
consuming operation in elliptic curve cryptography, especially when affine co-ordinates are used.
Since the existing algorithms based on the extendedEuclidean algorithmdo not permit a fast software
implementation, projective co-ordinates, which eliminate almost all of the inversion operations from
the curve arithmetic, are preferred. In the paper, the authors demonstrate that affine co-ordinate
implementation provides a comparable speed to that of projective co-ordinates with careful hardware
realisation of existing algorithms for calculating inverses in both fields without utilising special
moduli or irreducible polynomials. They present two inversion algorithms for binary extension and
prime fields, which are slightly modified versions of the Montgomery inversion algorithm. The
similarity of the two algorithms allows the design of a single unified hardware architecture that
performs the computation of inversion in both fields. They also propose a hardware structure where
the field elements are represented using a multi-word format. This feature allows a scalable
architecture able to operate in a broad range of precision, which has certain advantages in
cryptographic applications. In addition, they include statistical comparison of four inversion
algorithms in order to help choose the best one amongst them for implementation onto hardware.
1 Introduction
The basic arithmetic operations (i.e. addition, multiplication
and inversion) in prime and binary extension fields, GF(p)
andGFð2nÞ; have several applications in cryptography, such
as RSA algorithm, Diffie-Hellman key exchange algorithm
[1], the US federal Digital Signature Standard [2] and also
elliptic curve cryptography [3, 4]. Recently, speeding up
inversion operations in both fields has been gaining
attention since inversion is the most time consuming
operation in elliptic curve cryptographic algorithms when
affine co-ordinates are selected [5–10].
Currently, most of the elliptic curves that are employed in
cryptographic applications are defined over primeGF( p) and
binary extensionGFð2nÞ fields. In [11], a scalable and unified
multiplier architecture for both fields is proposed and it has
been shown that it is possible to design a multiplier with an
insignificant increase in chip area ð2:8%Þ and no increase
in time delay since the Montgomery multiplication algor-
ithms for both fields are almost identical, except the basic
addition operation of the corresponding fields.
In this paper, we give and analyse multiplicative
inversion algorithms for GF(p) and GFð2nÞ; which allow
very fast and area-efficient hardware implementations.
The algorithms are based on the Montgomery inversion
algorithms given in [5]. Another variation of the
Montgomery inversion algorithm is proposed in [7],
which provides a very fast implementation of the multi-
plicative inverse in GF( p). This algorithm takes the
advantage of a multiplication unit, which already exists in
most of the modern multipurpose microprocessors.
Implementations of this algorithm are generally area-
hungry, and incorporation of such an algorithm as part of
a crypto-accelerator presents difficulties. Accordingly it is
not suited to our needs.
In a recently published technical report [12], two similar
algorithms for direct calculation of modular division in
GF( p) and GFð2nÞ have been proposed. Based on the binary
extended Euclidean algorithm [13; Note 1] (attributed to
M. Penk), [12] eliminates the correction phase of the
Montgomery inversion algorithm by introducing extra add-
and-shift operations in the while loop of the algorithm.
However, performing these extra operations within the loop
maintains a cost similar to performing them in Phase II.
Furthermore, this approach does not run the algorithm in
different modes as described in [8] in order to calculate
different inverses, which prove to be useful in elliptic curve
cryptography. Two algorithms proposed separately for fast
calculation of inversion operations of GF( p) and GFð2nÞ in
[14, 15], respectively, do not allow a unified design.
A more recent effort [16] presents an efficient inversion
algorithm that computes a direct inversion and decreases the
number of additions at the expense of introducing, on
average, more stand-alone [Note 2] shift operations. As far
as the number of clock cycles to compute inversion is
q IEE, 2005
IEE Proceedings online no. 20059032
doi: 10.1049/ip-cdt:20059032
E. Savas¸ and M. Naseer are with the Faculty of Engineering & Natural
Sciences, Sabanci University, Istanbul, Turkey TR-34956
A.A-A. Gutub is with Computer Engineering, King Fahd University of
Petroleum & Minerals, Dhahran 31261, Saudi Arabia
C¸.K. Koc¸ is with Electrical & Computer Engineering, Oregon State
University, Corvallis, OR 97331, USA
E-mail: erkays@sabanciuniv.edu
Paper first received 29th September 2003 and in revised form 5th August
2004
Note 1: Most of the binary inversion algorithms are in fact based on the
binary extended Euclidean algorithm.
Note 2: Note that the shift operations following subtraction operations in
the original Montgomery inversion algorithm must not be counted
separately as done in [16].
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 4, July 2005 489
concerned, shift operations are as costly as addition
operations in a scalable implementation since these shift
operations are performed in more than one cycle. In
addition, the algorithm in [16] applies the logical OR
operation on the entire bits of an integer. In cryptographic
applications where the integers are large, this OR operation
may have an adverse effect on the scalability and critical
path delay. Therefore, we use variants of the Montgomery
inversion algorithm in our work that we believe are more
suitable for hardware implementation.
2 The original Montgomery inversion algorithm
The Montgomery inversion algorithm as defined in [5]
computes
b ¼ a12n ðmod pÞ ð1Þ
given a< p; where p is a prime number and n ¼ dlog2 pe:
The algorithm consists of two phases: the output of Phase I
is the integer r such that r ¼ a12k ðmod pÞ; where n  k 
2n and Phase II is a correction step and can be modified as
shown in [8] in order to calculate a slightly different inverse
that can more precisely be called Montgomery inverse:
b ¼ MonInvða2nÞ ¼ a12n ðmod pÞ ð2Þ
This new definition is more suitable since it takes an integer
in the so-called Montgomery domain and yields its multi-
plicative inverse, again in the Montgomery domain. Below,
we give the original Montgomery inversion algorithm for
completeness; however, the new algorithms in subsequent
sections will compute the Montgomery inverse of an integer
as defined in (2).
Algorithm A
Phase I
Input: a 2 ½1; p 1 and p
Output: r 2 ½1; p 1 and k, where r ¼ a12k ðmod pÞ and
n  k  2n
1: u :¼ p; v :¼ a; r :¼ 0; and s :¼ 1
2: k :¼ 0
3: while ðv>0Þ
4: if u is even then u :¼ u=2; s :¼ 2s
5: else if v is even then v :¼ v=2; r :¼ 2r
6: else if u>v then u :¼ ðu vÞ=2; r :¼ r þ s; s :¼ 2s
7: else v :¼ ðv uÞ=2; s :¼ sþ r; r :¼ 2r
8: k :¼ k þ 1
9: if r  p then r :¼ r  p
10: return r :¼ p r and k
Phase II
Input: r 2 ½1; p 1; p, and k from Phase I
Output: b 2 ½1; p 1; where b ¼ a12n ðmod pÞ
11: for i ¼ 1 to k  n do
12: if r is even then r :¼ r=2
13: else r :¼ ðr þ pÞ=2
14: return b :¼ r
There are three important theorems about the algorithm,
which have already been proven in [5].
Theorem 1: If p>a>0; then the intermediate values r, s,
u and v in the Montgomery inversion algorithm are
always in the range ½1; 2p 1:
Theorem 2: If a and p are relatively prime, p is odd, and
p>a>0; then the number of iterations in the first phase
of Montgomery inversion algorithm is at least n and at
most 2n, where n is the number of bits in p.
Theorem 3: If p and a are relatively prime, p is odd, and
p>a>0; then Phase I of the Montgomery inversion
algorithm returns a12k ðmod pÞ:
We will include and prove analogous theorems for the
Montgomery inversion algorithm for binary extension fields
GFð2nÞ in the following Section.
3 The Montgomery inversion algorithm for GFð2nÞ
Let
pðxÞ ¼ xn þ pn1xn1 þ pn2xn2 þ . . .þ p1xþ p0
be an irreducible polynomial over GF(2) that is used to
construct the binary extension field GFð2nÞ: An element of
GFð2nÞ can be represented as a polynomial
aðxÞ ¼ an1xn1 þ an2xn2 þ . . .þ a1xþ a0
whose coefficients ai are from {0, 1}. Then arithmetic on the
elements in GFð2nÞ is regular polynomial arithmetic where
operations on coefficients are performed modulo 2.
The Montgomery inversion algorithm for GFð2nÞ can be
given as follows:
Algorithm B
Phase I
Input: a(x) and p(x), where degðaðxÞÞ< degðpðxÞÞ
Output: s(x) and k, where s ¼ aðxÞ1xk ðmod pðxÞÞ and
degðsðxÞÞ< degðpðxÞÞ and degðaðxÞÞ þ 1  k 
degðpðxÞÞ þ degðaðxÞÞ þ 1
1: uðxÞ :¼ pðxÞ; vðxÞ :¼ aðxÞ; rðxÞ :¼ 0; and sðxÞ :¼ 1
2: k :¼ 0
3: while ðuðxÞ 6¼ 0Þ
4: if u0 ¼ 0 then uðxÞ :¼ uðxÞ=x; sðxÞ :¼ xsðxÞ
5: else if v0 ¼ 0 then vðxÞ :¼ vðxÞ=x; rðxÞ :¼ xrðxÞ
6: else if degðuðxÞÞ degðvðxÞÞ then
uðxÞ :¼ ðuðxÞ þ vðxÞÞ=x; rðxÞ :¼ rðxÞ þ sðxÞ;
sðxÞ :¼ xsðxÞ
7: else vðxÞ :¼ðvðxÞ þ uðxÞÞ=x; sðx f Þ :¼ sðxÞ þ rðxÞ;
rðxÞ :¼ xrðxÞ
8: k :¼ k þ 1
9: if snþ1 ¼ 1 then sðxÞ :¼ sðxÞ þ xpðxÞ
10: if sn ¼ 1 then sðxÞ :¼ sðxÞ þ pðxÞ
11: return s(x) and k
Phase II
Input: s(x) where degðsðxÞÞ< degðpðxÞÞ; p(x), and k from
Phase I
Output: b(x) where bðxÞ ¼ aðxÞ1x2n ðmod pÞ
12: for i ¼ 1 to 2n k do
13: sðxÞ :¼ xsðxÞ þ sn1pðxÞ
14: return bðxÞ :¼ sðxÞ
Theorem 4: If degðpðxÞÞ>degðaðxÞÞ>0 where p(x) is an
irreducible polynomial, then the degrees of intermediate
binary polynomials r(x), u(x) and v(x) in the Montgomery
inversion algorithm are always in the range [0, deg(p(x))],
while deg(s(x)) is the range of ½0; degðpðxÞÞ þ 1:
Proof: Let us assume the following terminology:
n ¼ degðpðxÞÞ; da ¼ degðaðxÞÞ; du ¼ degðuðxÞÞ;
ds ¼ degðsðxÞÞ; dv ¼ degðvðxÞÞ and dr ¼ degðrðxÞÞ
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 4, July 2005490
The following invariants can be verified by induction:
pðxÞ ¼ uðxÞsðxÞ þ vðxÞrðxÞ
0  du  n
0  ds  n
dv  da< n
Therefore, we have n ¼ maxðdu þ ds; dv þ drÞ and du þ ds
6¼ dv þ dr: Just before the last iteration uðxÞ ¼ vðxÞ ¼ 1 and
pðxÞ ¼ sðxÞ þ rðxÞ; hence either ds ¼ n or dr ¼ n: If dr ¼ n
then ds  n after the last iteration is performed while ds 
nþ 1 after the last iteration when ds ¼ n beforehand. A
Theorem 5: If p(x) is an irreducible polynomial, and
degðpðxÞÞ>degðaðxÞÞ>0; then we can find the lower and
upper boundary for the number of iterations, k, in the
first phase of Montgomery inversion algorithm as below:
nþ 1  k  degðaðxÞÞ þ nþ 1
where n is the degree of the irreducible polynomial p(x).
Proof: Each iteration decrements either the degree of u(x)
or of v(x) by at least one. Initially, degðuðxÞÞ ¼ degðpðxÞÞ ¼
n and degðvðxÞÞ ¼ degðaðxÞÞ: In the iteration before the very
last one, we know that uðxÞ ¼ vðxÞ ¼ 1: Therefore, it can
easily be shown that k  1  degðaðxÞÞ þ n: Similarly, each
iteration decrements maxfdegðuðxÞ; degðvðxÞÞg by at most
one, hence, k  1  n: Consequently, we have
n  k  1  degðaðxÞÞ þ n
The lower bound is achieved when aðxÞ ¼ 1: A
Theorem 6: If p(x) is an irreducible polynomial, and
degðpðxÞÞ>degðaðxÞÞ>0; then Phase I of the Montgomery
inversion algorithm for GFð2nÞ returns aðxÞ1xk ðmod pðxÞÞ:
Proof: Similarly to the proof in [5], the following
invariants can be verified by induction:
aðxÞrðxÞ  uðxÞxk ðmod pðxÞÞ
aðxÞsðxÞ  vðxÞxk ðmod pðxÞÞ
Note that there is no negative sign in the first congruence,
which is natural since addition and subtraction are the same
operation in GFð2nÞ: By theorem 4, degðsðxÞÞ  nþ 1; it
takes at most two additions by p(x) to reduce the degree of
s(x) if it is not already reduced. By theorem 5, k  nþ 1; so
there are additional reduction steps to calculate the desired
inverse of a(x). A
Additions and subtractions in the original algorithm are
replaced with additions without carry in the GFð2nÞ version
of the algorithm. Since it is possible to perform addition (and
subtraction) with carry and addition without carry in a single
arithmetic unit, this difference does not cause a change in the
control unit of a possible unified hardware implementation.
Step 6 of the proposed algorithm (where the degrees of u(x)
and v(x) are compared) is different from that of the original
algorithm. This necessitates a significant change to the
control circuitry. To circumvent this problem we propose a
slight modification in the original algorithm for GF( p).
4 New variant of Montgomery inversion
algorithm for GF(p)
Let
p ¼ pn12n1 þ pn22n2 þ . . .þ p12þ p0
be a prime number that is used to construct the prime field
GF( p). An element of GF( p), a< p; can be represented as
a ¼ an12n1 þ an22n2 þ . . . a12þ a0
whose coefficients ai are from {0, 1}. Then arithmetic on the
elements in GF( p) is modulo p arithmetic. The bitsize(a) is
defined as the number of bits in the binary representation of
a. For example, the bitsize(11), where 11 ¼ ð00001011Þ2
and 11 2 GFð251Þ; is 4. An extended representation for an
integer a,
a ¼ . . .þ anþ12nþ1 þ an2n þ an12n1 þ an22n2
þ . . . a12þ a0
is used when a also takes negative values. In case a is a
negative integer excess coefficients (i.e. . . . anþ1; an) are
equal to 1 when two’s complement representation is used
for negative numbers. These coefficients are 0 when a is a
positive number.
Before describing the new inversion algorithm, we first
point out an important difference from the original
Montgomery inversion algorithm. In Step 6 of the original
Montgomery inversion algorithm two integers, u and v, are
compared. Depending on the result of the comparison it is
decided whether Step 6 or Step 7 is to be executed. We
propose to modify Step 6 of the algorithm in a way that,
instead of comparing u and v, the numbers of bits needed to
represent them are compared. As a result of this imperfect
comparison, u may become a negative integer. The fact that
u might be a negative integer may lead to problems in
comparisons in subsequent iterations, therefore u must be
made positive again. To do that, it is sufficient to negate r.
The proposed modifications can be seen in the modified
algorithm given below. Note that Algorithm C is in fact a
unified algorithm and it is reduced to Algorithm B provided
that all addition and subtraction operations in GF( p)-mode
are mapped to GFð2nÞ addition in GFð2nÞ-mode. The
variable FSEL is used to switch between GF( p) and GFð2nÞ
modes.
Algorithm C
Phase I
Input: a 2 ½1; p 1 and p
Output: s 2 ½1; p 1 and k, where s ¼ a12k ðmod pÞ and
n  k  2n
1: u :¼ p; v :¼ a; r :¼ 0; and s :¼ 1
2: k :¼ 0 and FSEL :¼ 0 ==FSEL :¼ 1 in GFð2nÞ-mode
3: if u is positive then
4: if ðbitsizeðuÞ ¼ 0Þ then go to Step 15
5: if u is even then u :¼ u=2; s :¼ 2s
6: else if v is even then v :¼ v=2; r :¼ 2r
7: else if bitsizeðuÞ bitsizeðvÞ then u :¼ ðu vÞ=2;
r :¼ r þ s; s :¼ 2s
8: else v :¼ ðv uÞ=2; s :¼ sþ r; r :¼ 2r
9: Update bitsize(u), bitsize(v) and sign of u
10: else (i.e. u is negative)
11: if u is even then u :¼ u=2; s :¼ 2s r :¼ r
12: else v :¼ðvþ uÞ=2; u :¼u; s :¼ s r; r :¼ 2r
13: k :¼ k þ 1
14: Go to Step 3
15: if snþ2 ¼ 1 (i.e. s is negative)
16: u :¼ sþ p
17: v :¼ sþ 2p
18: if unþ2 ¼ 1 then s :¼ v
19: else s :¼ u
20: u :¼ s p
21: v :¼ s 2p
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 4, July 2005 491
22: if vnþ1 ¼ 0 then s :¼ v
22-a: if sn ¼ 1 and FSEL ¼ 1 then s :¼ s p
23: else if un ¼ 0 then s :¼ u
24: else s :¼ s
25: return s and k
Phase II
Input: s 2 ½1; p 1; p, and k from Phase I
Output: b 2 ½1; p 1; where b ¼ a122n ðmod pÞ
26: for i ¼ 1 to 2n k do
27: u :¼ 2s sn1p
28: v :¼ 2s ð1þ sn1Þp
29: if vn ¼ 1 then s :¼ u == i.e. if v< 0 in GF( p)-mode
30: else s :¼ v
31: return b :¼ s
Changing the sign of both u and r simultaneously has the
effect of multiplying both sides of the invariant p ¼ usþ vr
by 1: Therefore, a new invariant when r< 0 is given as
p ¼ usþ vr
While u and v remain to be positive integers, s and r might
be positive or negative. Therefore, we need to alter the final
reduction steps to bring s in the correct range, which is [0, p).
The range of s is ½2p; 2p: As a result we need to use two
more bits to represent s and r than the bitsize of the modulus.
u becomes negative as a result of u ¼ ðu vÞ=2; when
bitsizeðuÞ ¼ bitsizeðvÞ and v>u before the operation. Since
u ¼ ðu vÞ=2 decreases the bitsize of the absolute value of
u at least by one independently of whether the result is
negative or positive, uwill become certainly less than v after
the negation operation. Therefore, if a negative u is
encountered during the operation only steps 11 and 12 are
executed.
Note that the variable FSEL is not needed for GF( p)-
mode computations. Further in GF( p)-mode, FSEL ¼ 0 and
Step 22-a is never executed. This step becomes relevant in
GFð2nÞ-mode when FSEL ¼ 1: Similarly, steps 27–30 in
GFð2nÞ-mode become:
27: uðxÞ :¼ xsðxÞ þ sn1pðxÞ
28: vðxÞ :¼ xsðxÞ þ ð1 sn1ÞpðxÞ
29: if vn ¼ 1 then sðxÞ :¼ uðxÞ
30: else sðxÞ :¼ vðxÞ:
Those steps are in fact equivalent to Step 13 of Algorithm B
since vn is always 1 in GFð2nÞ-mode.
In the following Section, we discuss the complexity of
Algorithm C and provide statistical figures for the number
of iterations where an iteration consists of operations from
Step 3 to Step 14. The statistical figures are obtained for
GF( p)-mode. It is possible to obtain similar statistics for
GFð2nÞ-mode.
5 Complexity analysis of Algorithm C and
multibit shifting
In this Section, we investigate the complexity of Algorithm
C in terms of the number of iterations and present a
technique to improve the complexity utilising a method
called multibit shifting. The multibit shifting was first
introduced in [17]. We adopt the same multibit shifting
approach for Phase I of Algorithm C, while for Phase II we
propose a slightly different version that requires no
precomputed values. Multibit shifting methods used for
Phase I and Phase II of Algorithm C in GF( p)-mode can be
applied in GFð2nÞ-mode, and yield an overall comparable
improvement, and thus the details of multibit shifting for
GFð2nÞ-mode are deliberately omitted.
A decisive figure in determining the complexity of
inversion algorithm is the number of iterations of the big
loop of Phase I, denoted as k. The iteration number k
determines the number of operations performed such as
addition, subtraction, comparison and shifting [Note 3]. The
number of iterations before termination is unpredictable, but
the algorithm demonstrates a regular and familiar distri-
bution for k. Thus, we provide the result of statistical
analysis on the number of iterations for both Algorithms
A and C in Fig. 1.
The distribution of k in Fig. 1 is obtained using 100
different 160-bit prime numbers and 100 different inversion
calculations for each prime. Algorithms A and C demon-
strate almost identical statistical behaviour (or more
precisely an unimportant degradation in Algorithm C).
The expected values of k are 226 and 228 for Algorithms
A and C, respectively, which is about 1.4 times the bit
length of the prime modulus p. The latter observation is
confirmed by statistical analysis for GF( p) with larger
moduli. The results for these larger moduli are shown in
Table 1.
In the original Montgomery inversion algorithm (i.e.
Algorithm A), the expected behaviour of Phase I is that
steps 4, 5, 6 and 7 are executed almost the same number of
times. Consequently the k loop performs shift operations
half of the time (steps 4 and 5) and addition=subtraction
operations (steps 6 and 7) in the other half. In Table 1, for
instance, for 160 bit prime, one can conclude that 113
iterations out of 226 in Algorithm A are spent on shift
operations. For future reference, we use the term shift
operations for the former and addition operations for the
latter, respectively. For example, in Algorithm C, steps 5, 6
and 11 performs shift operations, while steps 7, 8 and 12
Fig. 1 Distribution of k in Algorithms A and C for 160-bit primes
Table 1: Expected values for k with primes of different
sizes
n (bit length of p) Algorithm A Algorithm C
160 226 228
192 271 274
224 316 320
256 360 365
Note 3: Note that multiplication and division by 2 is just one bit shift
operation to the left and right, respectively.
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 4, July 2005492
perform addition operations. The number of addition and
shift operations are different in Algorithm C.
For the binary extension field GFð2nÞ; where n is chosen
in the same range as in Table 1, the ratio of the expected
number of iterations over n is found to be  1:7 while the
distribution function has exactly the same form and features
a close resemblance to that of the so-called almost inverse
algorithm in [6]. We must note that the higher expected
number of iterations increases the complexity of Phase I of
both algorithms. However, it decreases the number of
iterations in Phase II. Therefore, a larger k does not
necessarily mean a higher complexity in our setting when
both phases are taken into consideration. In the following
Section, we show how the number of iterations is reduced
using a multibit shifting method.
5.1 Multibit shifting for Phase I
Multibit shifting requires that the three least significant bits
of u or v be inspected instead of just making an evenness
check.At times it is possible to shift the integermore than one
bit to the right. For instance, when the least significant three
bits of u are all zero (i.e. u2u1u0 ¼ 000Þ it is possible to shift u
to the right by three bits. While this check increases the
hardware negligibly, the net effect of this technique on the
complexity is not obvious and requires statistical analysis.
Our experimentwith the precision range given in Table 1 (i.e.
[160, 256]) shows that there is about 42% [Note 4] reduction
in the number of shift operations when three-bit shifting is
applied. It has also been observed that three-bit shifting
provides the optimum performance improvement. The exact
number of shift operationswhenmultibit shifting is applied is
tabulated at the end of this Section along with the overall
effect of the multibit shifting on the complexity.
5.2 Multibit shifting for Phase II
The loop in Phase II of Algorithm C executes 2n k times
after k is determined as the output of Phase I. Each iteration
involves a subtraction operation and left shift operation by
one bit. As in the case of Phase I shifting, it is also possible
to shift s more than one bit in an iteration depending on the
most significant bits of u. For instance, it is possible to shift s
to the left by three bits when sn1sn2sn3 ¼ 000: The result
may be greater than p, and hence it must be brought back to
the range of ½1; p 1: The algorithm for Phase II of
Algorithm C with three-bit shifting, which provides the
maximum reduction in the number of iterations with an
insignificant cost in hardware, is given below.
Phase II with three-bit shifting
Input: s 2 ½1; p 1; p; and k from Phase I
Output: b 2 ½1; p 1; where b ¼ a122n ðmod pÞ
26: i :¼ 1
27: t :¼ sn1sn2sn3
28: if i< 2n k then
29: if t ¼ 0 then
29-a: u :¼ 8s
29-b: v :¼ 8s p
29-c: i :¼ iþ 3
30: else if t ¼ 1 then
30-a: u :¼ 4s
30-b: v :¼ 4s p
30-c: i :¼ iþ 2
31: else if t ¼ 2 OR t ¼ 3 then
31-a: u :¼ 2s
31-b: v :¼ 2s p
31-c: i :¼ iþ 1
32: else
32-a: u :¼ 2s
32-b: v :¼ 2ðs pÞ
32-c: i :¼ iþ 1
33: if ðv< 0Þ then s :¼ u
34: else s :¼ v
35: return b :¼ s
Note that since 2n k is not always a multiple of 3, a slight
modification to the above algorithm is necessary. The
algorithm given above must be adapted for the unified case.
The sign check on v can be done in the same manner by
checking whether vn is 0 or 1 since vn ¼ 1 implies that v is
negative. Our experiments for Phase II with three-bit
shifting are found to be the optimum choice, providing
 30% reduction in the number of iterations.
5.3 Overall effect of multibit shifting on
number of iterations
We conducted some experiments in which we used 100
different prime numbers of varying precisions, namely 160,
192, 224 and 256 bits. For every prime number, we
computed the Montgomery inverse of 100 different integers,
totalling 10 000 inverse computations. The results are
shown in Table 2.
In Table 2, the total number of iterations is defined as:
number of addition operations in Phase I þ number of shift
operations in Phase Iþ number of iterations in Phase II. We
have mentioned previously that half of the values of
Algorithm A in Table 1 are addition operations and the
other half shift operations. Reduction in total number of
iterations, as reported in the last column of Table 2, is
percentage decrease in total number of iterations for
Algorithm C compared to Algorithm A.
6 Comparison with other inversion algorithms
The paucity of hardware architectures implementing inver-
sion algorithms hinders a fair comparison of different
inversion algorithms from the perspective of hardware
implementation. Many inversion algorithms are designed to
Table 2: Reduction in number of shifts and total number of iterations
No. of shifts in Phase I No. of iterations in Phase II
n, bits Alg. A Alg. C
Reduction in
shifts, % Alg. A Alg. C
Reduction in
Phase II, %
Reduction in total number
of iterations, %
160 113 66 42 92 67 27 22
192 135 78 42 111 78 30 23
224 158 91 42 128 91 29 23
256 180 104 42 147 103 30 23
Note 4: The reduction in percentage is obtained by subtracting the number
of shift operations after multibit shifting is applied from the number of shift
operations without multibit shifting and then dividing it by the latter.
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 4, July 2005 493
be implemented in software on general-purpose processors.
It is difficult to predict how efficiently hardware implemen-
tations of these algorithms will perform. In this Section, we
try to address this issue and determine the factors that are
decisive in the performance of a hardware implementation.
Finally, using these factors we compare Algorithm C
against three other inversion algorithms.
As mentioned before, many inversion algorithms are
binary variants of an extended Euclidean algorithm.
Typically, these algorithms consist of a main loop executing
a number of times. The number of iterations before
termination demonstrates a familiar distribution whose
statistics are shown in Fig. 1. Each iteration incorporates
different combinations of iterations such as addition
operations, shift operations, addition operations followed
by shift operations, etc. Which combination is executed in a
particular iteration is determined using certain conditional
check operations such as parity check (e.g. Steps 4 and 5 of
Algorithm A), integer or degree comparison (e.g. Step 6 of
Algorithm B), etc. The expected number of these operations
and=or certain combinations of these operations determines
the complexity of the algorithm from the perspective of
hardware implementation.
To compare different inversion algorithms, we classified
the operations as follows: (i) standalone shift operations that
cannot be executed along with an addition or subtraction
operation simultaneously; and (ii) addition operations that
are basically addition or subtraction operations. For
example, Steps 5 or 6 of Algorithm C are standalone shift
operations, while ðu vÞ=2 in Step 7 of the same algorithm
is considered as an addition operation. Although the latter
also has a shift following the subtraction, it is considered as
an addition operation since this shift can be incorporated
into an adder while designing the hardware. Also assuming
that we can employ as many adders or shifters as we need,
we consider operations that can be executed simultaneously
by different units working in parallel, as only one operation.
In case addition and shift operations are executed in the
same iteration in parallel, we count it as a single addition
operation [Note 5]. For example, Step 7 of Algorithm C is
counted as a single addition operation.
Under these assumptions, we compared four algorithms:
Algorithm A in [5], Algorithm C proposed in this paper, the
inversion algorithms proposed by Brown et al. in [14] and
by Lorenz in [16]. Algorithms A and C are proposed for
Montgomery arithmetic, and Algorithm C is basically a
variant of Algorithm A optimised for hardware implemen-
tation. Note that Phase II of Algorithm A is modified
according to the new definition of Montgomery inverse. On
the other hand, algorithms in [14, 16] do not use
Montgomery arithmetic. Excluding the conditional checks
such as parity check or integer comparison, we counted the
number of standalone shift operations and additions by
running these four algorithms 10 000 times (with 100
different integers whose inverses are to be calculated for 100
different primes). In Fig. 2, the statistics obtained from this
experiment are displayed. As can be observed in Fig. 2, the
number of shift operations in Algorithm C is much fewer
than those in the other three algorithms owing to the multibit
shifting technique. Lorenz’s algorithm in [16] has the fewest
addition operations. In terms of total number of operations,
Algorithm C compares favourably with others.
As shown in the following Section, where the details of
our hardware architecture are discussed, shift and addition
operations are of equal cost in terms of number of clock
cycles because of the scalable nature of the architecture.
Therefore, Algorithm C, which has the fewest expected
Fig. 2 Comparison of four algorithms in terms of number of operations
Note 5: Note that we assume that adders and shifters are inexpensive and
that including up to four adders or shifters brings about no significant
overhead in design area. In our scalable design, these units are inexpensive
compared to memory and control logic. With other design approaches such
as ones requiring very long full-precision adders, this assumption does not
hold.
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 4, July 2005494
number of operations in total is the best choice for our
design. For other architectures taking different approaches
the best choice may be different. For example, Lorenz’s
algorithm in [16] may perform better where the shift
operations are much less expensive than additions.
7 Hardware architecture
Algorithm C presented in Section 4, which is reduced to
Algorithm B in GFð2nÞ-mode is suitable for scalable and
unified design. Before discussing the details of the design
issues involved we give definitions of scalable and unified
architectures for arithmetic operations.
Scalability: An arithmetic unit is called scalable if it can be
reused or replicated in order to generate long-precision
results independent of the data path precision for which the
unit was originally designed.
Scalability of the arithmetic modules is important in the
cryptographic context since it allows an increase in the key
length when the need for more security arises without
having to modify or re-design the cryptographic unit. The
scalability of the inverter unit can easily be achieved by
using shifter and adder units, which handle only certain
numbers of bits at a time. One addition (or shift) operation,
therefore, in the corresponding field takes more than one
clock cycle since the operands can be represented in
multiprecision format. The number of bits that the unit
operates on is referred to as word and the length of the word
is yet to be determined. The word length cannot be too small
since this increases the clock cycle count, leading to a slow
execution of the whole operation. On the other hand, using
long word units leads to other complications such as longer
latency and higher gate count during hardware realisation.
16-, 32- and 64-bit block adders and shifters provide
reasonable latency and gate counts. The length of a word
can be determined or adjusted with respect to given area,
speed or latency requirements.
Unified architecture: Even though prime and binary
extension fields, GF( p) and GFð2nÞ; respectively, have
dissimilar properties, the elements of both fields are
represented using almost the same data structures in digital
systems. In addition, the algorithms for basic arithmetic
operations in both fields have structural similarities allow-
ing a unified module design methodology. Therefore, a
scalable arithmetic module which is versatile in the sense
that it can be adjusted to operate in both fields is feasible,
provided that this extra functionality does not lead to an
excessive increase in area and dramatic decrease in speed.
In addition, designing such a module must require only a
small amount of extra effort and no major modification in
control logic of the circuit.
Algorithm C can be implemented in a unified hardware
architecture provided that a dual-field adder=subtracter
(DFA=S) that operates in both fields is available. For the
inverter unit to be scalable, the DFA=S is designed to
operate on words at a clock cycle rather than the whole
operand. Hence, it is referred to as word-DFA=S (or
WDFA=S). In addition to WDFA=S, a bidirectional shifter,
which is able to shift one word to the left or right at a clock
cycle, is needed to perform the shift operations.
Since there is no negative number in GFð2nÞ; the steps
10–12 and 15–19 will never be executed in GFð2nÞ-mode.
Algorithm C, when executed in GF( p)-mode, replaces the
integer comparison operation of the original algorithm with
just one bitsize comparison. In exchange for that, u takes
negative integer values because of this imperfect compari-
son. Therefore, the variables u and r may have to change
sign if the subtraction operation in Step 7 produces a
negative result. Performing these two negation operations
immediately after Step 7 would necessitate extra clock
cycles which may be more expensive than the integer
comparison operation. Therefore, these two negation
operations (i.e. taking two’s complements of u and r) are
performed in the next iteration of Algorithm C (Step 11 or
12). To perform the negation operations concurrently with
two addition=subtraction operations in the next iteration,
besides two WDFA=Ss we need two negators, which can be
implemented as a relatively less complicated word adder.
By having two extra negators, which may have a limited
impact on the area as demonstrated by our implementation
results, no extra clock cycles are spent on comparison
operations. In a similar implementation presented in [17],
the comparison is done utilizing a full subtraction. In [17],
to avoid extra clock cycles due to this subtraction operation
three extra full-length registers are employed, leading to a
substantial increase in chip area. In the implementation in
[17], both ðu vÞ and ðv uÞ are computed and the
negatives of the results are discarded. For example, if
ðu vÞ is negative then u< v; hence the operation that must
be performed is actually ðv uÞ=2: However, some extra
clock cycles are still needed to perform two subsequent shift
operations. On the other hand, our proposed algorithm
eliminates the need for these extra clock cycles as well as
three extra registers of full precision. The only extra cost for
the comparison is two negators. Therefore, our implemen-
tation calculates inversion faster and requires less chip space
than the one in [17].
The sign of the integer u is positive initially. The
operation u ¼ ðu vÞ=2 is completed at most in e ¼ dn=we
clock cycles. In the ðeþ 1Þst cycle, all control information,
the sign of u, bitsize of u and v become available, and
hence next iteration can start in the ðeþ 2Þnd clock cycle.
This basically means that one iteration takes ðeþ 1Þ clock
cycles to finish. Note that the operation r þ s may need the
ðeþ 1Þst clock cycle to complete since s may get two bits
longer than the modulus.
The WDFA=S, performing operation ðu vÞ=2 (or
ðv uÞ=2) on a word-by-word basis, also calculates the
bitsize of the result in the same manner. When operation
ðu vÞ=2 starts, the bitsize of the result is initially set to
zero. When the first word of the result is generated its bitsize
is stored in a temporary register without disturbing the
bitsize of the result on which subsequent word computations
depend. After e clock cycles the bitsize of the result is fully
calculated, and hence it is updated from the temporary
register. In the ðeþ 1Þst cycle after operation ðu vÞ=2
started, this information is used to determine how to proceed
with the calculations. In the same clock cycle, the bitsize of
the result is also checked to determine whether it is zero. A
result with zero bitsize indicates the termination of the
computations (step 4 of Algorithm C). Note that the bitsize
of a negative u is meaningless and the algorithm terminates
after u becomes 1. Therefore, a negative result only
indicates that the computations must carry on.
In Table 3, an example for the calculation of u :¼
ðu vÞ=2 is given for 160-bit operands. The operands u and
v can be expressed as five-word multiprecision numbers,
where each word is 32-bit (e.g. u ¼ ðuð4Þuð3Þuð2Þuð1Þuð0ÞÞÞ:
As can be observed in steps 5, 6, 7, 8, 11, 12 of Algorithm C,
one of the six different operations may update the values of
u, v, r and s in each clock cycle. These operations are
distinguished by an operation code (opcode for short) and
the opcode must be determined in the preceding clock cycle
before an operation starts. In clock cycle 1, the operation
ðuð0Þ  vð0ÞÞ=2 is performed assuming that the opcode is
known. The result of this operation is the least significant
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 4, July 2005 495
31 bits of ðuð0Þ  vð0ÞÞ=2 and the most significant bit is yet to
be computed in the next clock cycle. Thus, the result is not
considered to be ready and in the next cycle the result is
written to the register. In other words, the result of ðuð0Þ 
vð0ÞÞ=2 is available in the register file in clock cycle 2. This
calculation continues until all the words of u and v are
exhausted in clock cycle 5. In the following cycle (i.e. clock
cycle 6), the complete result of ðu vÞ=2 operations
becomes available in the register and the opcode is
calculated for the next iteration.
In Fig. 3, the block diagram for Phase I of the inversion
module is illustrated. Adder Building Block contains two
WDFA=Ss and two negators, while Register Block contains
registers for four intermediate variables, u, v, r and s. Main
Control Block is responsible for generating the opcode and
selecting which part of the register block is used and
updated in each clock cycle. The information the control
block uses to determine the opcode, such as the bitsize, the
parity of u and v, and the sign of u is kept in the block named
Registers=Flags. Before an iteration starts, the control block
updates some of these values and subsequently uses them
for generating the opcode for the next iteration. Note that the
opcode remains unchanged for e clock cycles after it is set,
where e is the number of words in the operands. To keep
track of this, a counter is introduced in the design
(counter(m)). When u becomes 0, the control block activates
the done signal, indicating that the computation is
terminated and the results are available in register s and at
the output kout of the control block.
tmp reg is used to store words of operands during shift
operations, while the carry register holds the carry as a result
of addition or subtraction of two words. bitsize tmp register
keeps the intermediate values of bitsize of u and v, while the
current value is kept in Registers=Flags. These new values
are used to update Registers=Flags in the last clock cycle of
the current iteration. The main control block takes a word of
u, v, r and s from the register block each clock cycle and
supplies them to the adder block. Each of the four adders
Table 3: Execution steps of ðuv Þ=2 with operands of five words where world length is 32
Cock cycle Operation Output of adder Result available in register file
1 ðuð0Þ  v ð0ÞÞ=2 ðuð0Þ  v ð0ÞÞ=2½30 : 0 –
2 ðuð1Þ  v ð1ÞÞ=2 ðuð1Þ  v ð1ÞÞ=2½30 : 0j ðuð0Þ  v ð0ÞÞ=2½31 ðuð0Þ  v ð0ÞÞ=2½31 : 0
3 ðuð2Þ  v ð2ÞÞ=2 ðuð2Þ  v ð2ÞÞ=2½30 : 0j ðuð1Þ  v ð1ÞÞ=2½31 ðuð1Þ  v ð1ÞÞ=2½31 : 0
4 ðuð3Þ  v ð3ÞÞ=2 ðuð3Þ  v ð3ÞÞ=2½30 : 0j ðuð2Þ  v ð2ÞÞ=2½31 ðuð2Þ  v ð2ÞÞ=2½31 : 0
5 ðuð4Þ  v ð4ÞÞ=2 ðuð4Þ  v ð4ÞÞ=2½30 : 0j ðuð3Þ  v ð3ÞÞ=2½31 ðuð3Þ  v ð3ÞÞ=2½31 : 0
6 opcode for the next iteration is calculated – ðuð4Þ  v ð4ÞÞ=2½31 : 0
Fig. 3 Block diagram of module implementing Phase I
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 4, July 2005496
produces the results in outputs zu; zv; zr and zs: The control
block decides how to update the register block with these
values.
Note that the module in Fig. 3 implements Steps 1–14 of
Algorithm C (Steps 1–8 of Algorithm B) and the rest of is
Phase I is implemented along with Phase II using a different
control logic which is not shown here, and is referred from
now on as Phase II control logic. Phase II control logic is
less complicated than the main control block shown in Fig. 3
and uses the same register and adder blocks. We only report
the area requirements of the Phase II control logic in the
following Section since its critical path delay does not
increase the overall critical path.
8 Implementation results
In this Section, we present some implementation results of
the inversion unit capable of operating in both GF( p) and
GFð2nÞ: We synthesised the unit using Synopsys tools with
0:18 mm CMOS technology [18]. During the synthesis, no
optimisation method is applied; hence the figures presented
here can be further optimised using standard optimisation
options of Synopsys tools.
Implementation results consist of three categories: (i) area
requirements; (ii) maximum applicable clock frequency
(critical path delay); (iii) number of clock cycles to compute
an inversion operation. Time and area requirements of main
blocks in our design are reported separately. This facilitates
accurate estimation of these requirements for inverters with
different precisions and word lengths, etc. For instance,
doubling the word length immediately results in an almost
two-fold increase in area of adder and control blocks, while
causing a relatively small increase in critical path delay of
these blocks. Similarly, there is a linear relation between the
register block size and the precision of the operands.
As noted before, the adder block should normally contain
two adders and two negators operating on the words of the
operands. In our implementation, however, we use four
WDFA=Ss in the adder block in order to make the design
more regular. Table 4 shows time and area costs of a single
WDFA=S and total cost of adder block for different
precisions that can be used for cryptographic applications.
Similarly, we give the area cost of the register block for
various operand precisions as in Table 5.
We also implemented the control blocks for two word
lengths, w ¼ 16 and w ¼ 32; and give the results in Table 6.
We did not implement the control blocks for w ¼ 64;
however, we estimate a two-fold increase in the area and
< 10% increase in the critical path delay.
Based on the figures in Tables 4–6, area and time
requirements for a 160-bit inverter with w ¼ 32 are reported
in Table 7. Note that these figures are comparable to those in
[17]. We avoid a direct comparison because of the different
technologies used. Also such comparison may be mislead-
ing since the main focus of the work in [17] is to compare
the scalable implementation against fixed-precision
implementation.
Besides time and area, the number of clock cycles to
complete an inversion operation is also important in
assessing the efficiency of an inversion module. We provide
expected values for the number of clocks for certain
precisions based on the figures in Tables 1 and 2. As
pointed out earlier, an iteration of the loops in Phases I and
II takes exactly eþ 1 clock cycles to finish, where e ¼
dn=we: In addition to these steps, some intermediate steps
are necessary to bring s in the correct range (i.e. steps 9 to 14
of Algorithm B and steps 15 to 25 of Algorithm C). These
steps take an additional 2ðeþ 1Þ clock cycles. Taking into
account the initialisation steps, which take eþ 1 clock
cycles, we need to add 3ðeþ 1Þ extra clock cycles to clock
cycles spent on the main loops of Phases I and II.
As an example, based on the figures in Tables 1 and 2 and
considering the additional clock cycles mentioned above, we
calculated expected execution time in terms of number of
Table 4: Time and area costs of block adder withw ¼ 16;
w ¼ 32 and w ¼ 64
World
length, w
Propagation
time, ns
Area
(in NAND gates)
Total area
(in NAND gates)
16 1.65 198 792
32 1.93 451 1804
64 2.31 989 3956
Table 5: Area cost of register block with various
precisions
Bitsize Area (in NAND gates)
1 30
160 4800
192 5760
224 6720
256 7680
Table 6: Time and area costs of control blocks
Main control Phase II control
Word length w Time delay, ns Area (NAND gates) Time delay, ns Area (NAND gates)
16 4.00 3474 1.87 779
32 4.40 6369 2.01 1970
Table 7: Area and time requirements for a 160-bit inverter with w ¼ 32
Adder (32-bit) Register Main control Phase II control Overall
Area (in NAND gates) 1804 4800 5867 1970 14441
Time delay, ns 1.93 Negligible 4.34 2.01 4.34
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 4, July 2005 497
clock cycles for an inversion operation using word length 32.
The results are summarised in Table 8. Table 8 also includes
the clock cycle count estimates for the modular multipli-
cation operation for the same precisions, which is assumed to
be performed using unified and scalar Montgomery modular
multiplication unit proposed in [11] with 7 pipeline stages
and a 32-bit word size. The ratio of inversion time to
multiplication time, which is important in the decision about
whether affine or projective co-ordinates are to be employed
in elliptic curve cryptography, is also included in the Table. It
is argued in [19] that for binary extension fields, GFð2nÞ
projective co-ordinates, which do not entail fast execution of
inversion operation, perform better than the affine co-
ordinates when the inversion operation is more than 7
times slower than themultiplication operation. Similarly, our
calculations show that this ratio is about 9 for prime field
GF( p). As can be observed in Table 8, the ratio stays lower
than 7 for the precision of interest to the elliptic curve
cryptography. The ratios achieved here are better than the
previous work of [10], which employs similar inversion
algorithms. Note that, unlike the current work, [10] does not
report on the time complexity of Phase II of the algorithm.
The improvement is largely due to the multibit shifting
techniques used in the two phases of the algorithm.
However, one must also consider that the maximum
critical path delay of the inverter can be greater than that of a
multiplier because of the more complicated control logic of
the inverter. Therefore, in case a multiplier that utilises a
faster clock is employed, the ratios in Table 8 are subject to
changes depending on the difference in the clock rates.
9 Conclusion
We presented two multiplicative inversion algorithms for
GF( p) and GFð2nÞ which are suitable for scalable and
unified hardware implementations. The hardware
implementation of these algorithms is easy, fast and area
efficient. We presented experimental results and estimated
values to show the practicality of the new algorithms. We
also reported on the time and area complexity of the
inversion unit using our implementation results. It turns out
that the two new algorithms presented in this paper can
enable hardware implementations that calculate inversions
at such a speed that the usage of projective co-ordinates in
elliptic curve cryptography no longer offers a significant
advantage over affine co-ordinates. While projective co-
ordinates [20] do not necessitate an inversion unit, they
require considerably more temporary storage space than
affine co-ordinates. Therefore, the extra design space used to
implement an inversion unit may no be significant.
10 References
1 Diffie, W., and Hellman, M.E.: ‘New directions in cryptography’, IEEE
Trans. Inf. Theory, 1976, 22, pp. 644–654
2 National Institute for Standards and Technology, “Digital signature
standard (DSS)”, Federal Register, Aug 1991, Vol. 56, p. 169
3 Koblitz, N.: ‘Elliptic curve cryptosystems’, Math. Comput., 1987, 48,
(177), pp. 203–209
4 Menezes, A.J.: ‘Elliptic curve public key cryptosystems’ (Kluwer
Academic Publishers, Boston, MA, 1993)
5 Kaliski, B.S., Jr.: ‘The Montgomery inverse and its applications’, IEEE
Trans. Comput., 1995, 44, (8), pp. 1064–1065
6 Schroeppel, R., Orman, H., O’Malley, S., and Spatscheck, O.: ‘Fast key
exchange with elliptic curve systems’, in Coppersmith, D. (Ed.):
‘Advances in cryptography – CRYPTO 95’, Lect. Notes Comput. Sci.,
No. 973, 1995, pp. 43–56
7 Kobayashi, T., and Morita, H.: ‘Fast modular inversion algorithm to
match any operand unit’, IEICE Trans. Fundam., 1999, E82-A, (5),
pp. 733–740
8 Savas¸, E., and Koc¸, C¸.K.: ‘The Montgomery modular inverse –
revisited’, IEEE Trans. Comput., 2000, 49, (7), pp. 763–766
9 Hasan, M.A.: ‘Efficient computation of multiplicative inverses for
cryptographic applications’, Technical Report CORR 2001–03, Centre
for Applied Cryptographic Research, 2001
10 Savas¸, E., and Koc¸, C¸.K.: ‘Architecture for unified field inversion with
applications in elliptic curve cryptography’. Proc. 9th IEEE Int. Conf.
on Electronics, Circuits and Systems – ICECS, Dubrovnik, Croatia,
Sept. 2002, Vol. 3, pp. 1155–1158
11 Savas¸, E., Tenca, A.F., and Koc¸, C¸.K.: ‘A scalable and unified
multiplier architecture for finite fields GF( p) and GF(2m)’. in
‘Cryptographic Hardware and Embedded Systems, Workshop on
Cryptographic Hardware and Embedded Systems’ (Springer-Verlag,
Berlin, 2000), pp. 277–292
12 Shantz, S.C.: ‘From Euclid’s GCD to Montgomery multiplication to the
great divide’, Technical Report SMLI TR–2001–95, Sun Microsys-
tems Laboratory Technical Report, June 2001
13 Knuth, D.E.: ‘The art of computer programming’, Vol. 2, (Addison-
Wesley, Reading, MA, 1981, 2nd edn.)
14 Brown, M., Hankerson, D., Lopez, J., and Menezes, A.: ‘Software
implementation of the NIST curves over prime fields’. Technical
Report CORR 2000–56, Centre for Applied Cryptographic Research,
2000
15 Hankerson, D., Lopez, J., and Menezes, A.: ‘Software implementation
of elliptic curve cryptography over binary fields’. Technical Report
CORR 2000–42, Centre for Applied Cryptographic Research, 2000
16 Lo´renz, R.: ‘New algorithm for classical modular inverse’, in Kaliski,
B.S., Jr., Koc, C.K., and Paar, C. (Eds.): ‘Cryptographic Hardware and
Embedded Systems’, LNCS (Springer-Verlag, Berlin, 2002), pp. 57–70
17 Gutub, A.A.-A., Tenca, A.F., Savas¸, E., and Koc¸, C¸.K.: ‘Scalable and
unified hardware to compute montgomery inverse in GF(p) and
GF(2m)’, in Kaliski, B.S., Jr., Koc¸, C¸.K., and Paar, C. (Eds.):
‘Cryptographic Hardware and Embedded Systems’, LNCS (Springer-
Verlag, Berlin, 2002), pp. 485–500
18 UMC 0.18mm CMOS process family. http://www.umc.com/english/
process/d.asp
19 Lopez, J., and Dahab, R.: ‘Fast multiplication on elliptic curves over
GF(2m) without precomputation’, in ‘Cryptographic Hardware and
Embedded Systems, Workshop on Cryptographic Hardware and
Embedded Systems’ (Springer-Verlag, Berlin, 1999), pp. 316–325
20 IEEE. P1363: Standard specifications for public-key cryptography,
2000
Table 8: Estimated clock cycle counts for inversion and ratio to multiplication operation
Inversion
n (bits) Phase I þ Phase II After multibit shifting Extra clock cycles Total number of clocks Multiplication Ratio
160 1920 1498 18 1516 327 4.6
192 2688 2070 21 2091 398 5.25
224 3584 2760 24 2784 469 5.94
256 4608 3548 27 3575 526 6.80
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 4, July 2005498
