Fast Elliptic Curve Cryptographic Processor Architecture Based On Three Parallel GF(2k) Bit Level Pipelined Digit Serial Multipliers by Gutub, Adnan
FAST ELLIPTIC CURVE CRYPTOGRAPHIC PROCESSOR 
ARCHITECTURE BASED ON THREE PARALLEL GF(2k) 
BIT LEVEL PIPELINED DIGIT SERIAL MULTIPLIERS  
 
Adnan Abdul-Aziz Gutub 
 
Computer Engineering Department 
King Fahd University of Petroleum and Minerals 
Dhahran 31261, SAUDI ARABIA 
Email: gutub@ccse.kfupm.edu.sa 
 
 
ABSTRACT 
 
Unusual processor architecture for elliptic curve 
encryption is proposed in this paper. The architecture 
exploits projective coordinates (x=X/Z, y=Y/Z) to convert 
GF(2k) division needed in elliptic point operations into 
several multiplication steps. The processor has three 
GF(2k) multipliers implemented using bit-level pipelined 
digit serial computation. It is shown that this results in a 
faster operation than using fully parallel multipliers with 
the added advantage of requiring less area. The proposed 
architecture is a serious contender for implementing data 
security systems based on elliptic curve cryptography. 
 
1. INTRODUCTION 
 
In 1985 Niel Koblitz and Victor Miller proposed the 
Elliptic Curve Cryptosystem (ECC) [1-9], a method 
based on the Discrete Logarithm problem over the points 
on an elliptic curve. Since that time, ECC has received 
considerable attention from mathematicians around the 
world, and no significant breakthroughs have been made 
in determining weaknesses in the algorithm. Although 
critics are still skeptical as to the reliability of this 
method, several encryption techniques have been 
developed recently using these properties. The fact that 
the problem appears so difficult to crack means that key 
sizes can be reduced in size considerably, even 
exponentially [2,5,8], especially when compared to the 
key size used by other cryptosystems. This made ECC 
become a challenge to the RSA, one of the most popular 
public key methods known. ECC is showing to offer 
equal security to RSA but with much smaller key size [2].  
Several crypto processors have been proposed in the 
literature recently [4,7,15]. A common feature of these 
processors is that they eliminate the need for an inversion 
circuit. It is well known that adding two points over an 
elliptic curve would require a division operation, and 
hence an inversion. Calculating the inverse is the most 
expensive operation over GF(2k) [16,17]. To eliminate 
the need for performing inversion in GF(2k), designs 
replace the inversion by several multiplication operations 
by representing the elliptic curve points as projective 
coordinate points [1,4,7,9,15,18]. This approach is also 
adopted in the processor proposed in this paper. 
The different crypto-processor designs differ mainly 
in the architecture of the basic GF(2k) multiplier. Clearly 
it is impractical to use bit-parallel multipliers for large 
word length, i.e. k > 512. In [4] a nd x md digit multiplier 
is used to implement the multiplication over GF(2k), 
where k > nd and md. While in [7] a digit serial multiplier 
was adopted. A similar approach was used in the elliptic 
curve processor over GF(qm) in [15].  There are two basic 
drawbacks with the existing processors. The first is that 
digit serial multiplication is not as efficient as sub-digit 
pipelined digit serial computation [13,14]. The second is 
that none of the existing designs exploit the inherent 
parallelism in the computation of the elliptic curve point 
operations. In this paper a new elliptic curve crypto 
processor architecture is proposed that takes an 
advantage of both of these aspects. It is strongly believed 
that these two aspects would lead to an even better trade 
off between the area and time of computation. 
 
 
2. ENCRYPTION AND DECRYPTION 
 
It will be assumed that the reader is familiar with the 
arithmetic over elliptic curve. For a good review the 
reader is referred to [9]. There are many ways to apply 
elliptic curves for encryption/decryption purposes. In it 
most basic form, users randomly chose a base point (x, 
y), lying on the elliptic curve E. The plaintext (the 
original message to be encrypted) is coded into an elliptic 
curve point (xm, ym). Each user selects a private key ‘n’ 
and compute his public key P = n(x, y). For example, 
user A’s private key is nA and his public key is PA = nA(x, 
y). 
For any one to encrypt and send the message point 
(xm, ym) to user A, he/she needs to choose a random 
integer k and generate the cipher text: 
Cm = {k(x, y) , (xm, ym)+ kPA }. 
The cipher text pair of points uses A’s public key, where 
only user A can decrypt the plaintext using his private 
key. 
To decrypt the cipher text Cm, the first point in the 
pair of    Cm, k(x,  y), is multiplied by A’s private key to 
get the point: nA (k(x,y)). Then this point is subtracted 
from the second point of Cm, the result will be the 
plaintext point (xm,ym). The complete decryption  
operations are:  
((xm, ym) + kPA) - nA (k(x, y)) 
= (xm, ym) + k (nA (x, y)) - nA (k(x, y)) 
= (xm, ym) 
The most time consuming operation in the encryption and 
decryption procedure is finding the multiples of the base        
point, (x,y). The algorithm used to implement this is 
discussed in the next section. 
 
 
3. POINT OPERATION ALGORITHM 
 
The ECC algorithm used for calculating nP from P is the 
binary method, since it is known to be efficient and 
practical to implement in hardware [2,5,7,9,10]. This 
binary method algorithm is shown below: 
Define k: number of bits in n and  ni: the ith bit of n  
Input:  P (a point on the elliptic curve). 
Output:  Q = nP (another point on the elliptic curve). 
1.  if nk-1 = 1, then Q:=P else Q:=0; 
2.  for i = k-2 down to 0; 
3.   { Q := Q+Q ; 
4.      if ni = 1 then Q:= Q+P ; } 
5.  return Q; 
Basically, the binary method algorithm scans the binary 
bits of n and doubles the point Q k-times. Whenever, a 
particular bit of n is found to be one, an extra operation is 
needed. This extra operation is Q+P.  
As can be seen from the description of the above 
binary algorithm, adding two elliptic curve points and 
doubling a point are the most basic operations in each 
iteration. As mentioned earlier, adding two points over 
elliptic curve requires inversion [9]. As in the crypto 
processor in [6], inversion is eliminated using projective 
coordinates as discussed in the next section. 
 
 
4. POINT OPERATIONS OVER PROJECTIVE 
COORDINATES 
 
Elimination of inversion is achieved by projecting the 
coordinates (x,y) into (X,Y,Z), where x=X/Z, and y=Y/Z. 
The projected elliptic curve equation is introduced in 
[18]; it is detailed below to generate the data flow graphs 
discussed afterward.  
 The procedure for projective point addition of P+Q 
(two elliptic curve points) is shown below: 
 
P=(X1,Y1,Z1);Q=(X2,Y2,Z2);P+Q=(X3,Y3,Z3); 
where P ≠ ±Q 
 
(x,y)=(X/Z,Y/Z)Î(X,Y,Z) 
 
A = X1Z2 1M 
B = X2Z1 1M 
C = A+B  
D = Y1Z2 1M 
E = Y2Z1 1M 
F = D+E  
G= C+F  
H= Z1Z2 1M 
I=C3+aHC2+HFG 6M 
X3 = CI 1M 
Z3 = HC3 1M 
Y3=GI+C2[FX1+CY1] 5M 
 ----- 
 17
M 
 
Similarly, the form of formulas for projective point 
doubling is shown below: 
 
P = (X1,Y1,Z1); P+P = (X3,Y3,Z3) 
 
(x, y) = (X/Z, Y/Z) Î (X,Y,Z) 
 
A=X1Z1 1M 
B= bZ14+X14 5M 
C= AX14 1M 
D=Y1Z1 1M 
E=X12+D+A  
Z3=A3 2M 
X3=AB 1M 
Y3= C+BE 1M 
 ----- 
 12M 
 
The squaring calculation over GF(2k) is assumed very 
similar to the multiplication computation. They are both 
denoted as M (multiplication) in the above.  
Figure 1a shows the data flow graph for adding two 
elliptic curve points. The hardware of this design if 
implemented as shown in Figures 1 would need 
seventeen multipliers and seven k-bit XOR gates. The 
complete data flow graph for doubling a point is shown 
in Figure 1b. It is made of twelve multipliers and four k-
bit XOR gates. 
Any elliptic curve crypto processor that uses 
projective coordinates must implements the dataflow 
graphs in Figures 1a and 1b iteratively. 
 
 
5. PROPOSED CRYPTO ARCHITECTURE 
 
The architecture of the new processor is shown in 
Figure 2. Unlike existing designs which use a single 
multiplier, the new architecture has three multipliers. The 
reason for using more than one multiplier is discussed 
fully in section 6. However, the reason for using no more 
than three multiplier is now explained. As can be seen 
from Figures 1a and 1b, the corresponding critical path 
each dataflow diagram is effectively of 6 GF(2k) 
multiplications and of 4 GF(2k) multiplications, 
respectively. Here the time of GF(2k) addition is ignored 
since it negligible compared to multiplication. Therefore, 
the lower bound of the minimum computation time to 
perform one elliptic point operation in the calculation of 
nP is ten GF(2k) multiplications. It can be easily seen 
from Figures 1a and 1b that performing three 
multiplications in parallel will meet this lower bound.  
Furthermore the utilization of the three multipliers is very 
high almost the maximum. As can be seen from Figures 
1a and 1b, all the three multipliers will be used in nine 
out of the ten steps, and in only one out of the ten cycles 
where a single multiplier is not used. 
 
 
 
Figure 1. Data flow graphs for the elliptic curve point 
operations of projecting (x,y) to (X/Z,Y/Z)  
 
 
In the crypto processor presented here we also 
propose to use bit-level pipelined GF(2k) digit serial 
multipliers reported in [13,14].  It is significant to point 
out that these multipliers are in fact faster and use less 
area than their un-pipelined bit-parallel counterparts 
[13,14]. Moreover, sub-digit pipelining of digit serial 
computation leads to a much better performance than the 
conventional digit serial structures as shown in Table 1 
[13].  
 
 
 
Figure 2. The elliptic curve point operations hardware 
 
 
Bit-level digit serial computation is more suitable for 
the elliptic curve crypto algorithm discussed above since 
the computation of elliptic point doubling, addition and 
the algorithm of computing multiples of the base point is 
such that the multiplication of one stage must be 
completed before starting the multiplication of the 
subsequent stage. Therefore  even if a pipelined bit-
parallel multipliers is used, the throughput of such a 
multiplier can not be exploited since the next 
multiplication operation can not commence until the 
multiplication operations in the previous stage has 
completed. As with regard to the GF(2k) modulo adder, it 
is to be implemented in bit parallel fashion since the area 
is not significant compared to the multiplier and 
minimizing the addition time will reduce the overall 
multiply-add cycle time.  
 
 
Table 1. Comparison of Area and Time of the pipelined 
digit-serial GF(2k) multiplier in [14] for different number 
of sub-digit pipelining levels, K. 
 
K Area: 
AT(K)/AT(1) 
Time: 
T(K)/T(1) 
1 1 1 
2 1.3 2 
4 1.4 4 
8 1.9 8 
 
 
 
6. COMPARISON WITH EXISTING DESIGN 
 
(a) adding two points (b) doubling a point 
 H                                            F 
 C 
C 
Y2Z1        X2Z1       X1Z2        
A 
Y2        Z1     X2    Z1    X1    Z2                X1    Z1          X1          Z1    
A+B 
 E             B 
 Y1Z2        Z1Z2        C2 
Z2 
Y1 
C 
 X3              Z3                        Y3                   Z3           X3          Y3 
Z2 
Z1 
E+D 
F+C 
 E         D 
 F                     C 
 H           C2 
 HC2          C3        FG 
G 
C2     C 
 CY1        aHC2      HFG 
Y1  a H 
C3+aHC2 
C3 
C3+aHC2+HFG 
HFG 
IC           GI          X1F        
I 
F 
X1 G 
FX1+CY1 
CY1 
HC3          JC2 
J 
C2  H     C3 
GI+JC2 
GI 
 X1Z1        X12         Z12        
A 
Y1
Z1
A 
A+D 
D 
X12 
X12+A+D 
A     E 
  b 
X14+ bZ14
X14 
B 
C 
E 
   A3           AB          EB        
A 
C 
C+BE
 Y1Z1        X14         Z14        
  A2          AX14       bZ14        
In existing designs, a single multiplier is used to perform 
all the multiplications needed in Figures 1a and 1b. The 
reason is that using more than one single multiplier is 
perceived to be too expensive. However, using three 
multipliers will lead to a better AT2.   
Observe Table 2, our proposed design is compared 
with an existing design demonstrated in [6]. The number 
of registers needed in the proposed hardware is not that 
much better than the existing one. However, the AT2 of 
our design is the real achievement. 
 
 
Table 2. Comparing the proposed design with the 
conventional one. 
 
Hardware Design Conventional Proposed 
Number of 
Multipliers (A) 
1 3 
Worst case No of 
Cycles 
17 + 12 = 29 6 + 4 = 10 
Avg. No. of 
Cycles (T) 
12 + (17/2) = 
20.5    
4 + (6/2) = 7 
Number of 
Registers 
12 11 
Cost: AT2 420.25 147 
 
 
7. CONCLUSION 
 
A new GF(2k) elliptic curve crypto processor is proposed 
in this paper. It does not need a GF(2k) inverter, because 
the inverse operation is converted into successive 
multiplication steps using projective coordinates. It 
exploits the inherent parallelism in the computation of 
doubling and adding points over an elliptic curve as well 
as the sub-digit pipelined digit serial computation to 
achieve a better trade-off between area and time. 
 
 
8. ACKNOWLEDGMENT 
 
The Author would like to thank Professor Mohammad K. 
Ibrahim for his valuable suggestions and comments. The 
Author also acknowledges the support provided to this 
work from King Fahd University of Petroleum and 
Minerals, Dhahran, Saudi Arabia.   
 
 
9. REFERENCES 
 
[1] Miyaji A., “Elliptic Curves over FP Suitable for 
Cryptosystems”, Advances in cryptology- 
AUSCRUPT’92, Australia, December 1992. 
[2] Stallings, W. “Cryptography and Network Security: 
Principles and Practice”, Second Edition, Prentice 
Hall Inc., New Jersey, 1999. 
[3] Chung, Sim, and Lee, “Fast Implementation of Elliptic 
Curve Defined over GF(pm) on CalmRISC with 
MAC2424 Coprocessor”, Workshop on Cryptographic 
Hardware and Embedded Systems, CHES 2000, 
Massachusetts, August 2000. 
[4] Okada, Torii, Itoh, and Takenaka, “Implementation of 
Elliptic Curve Cryptographic Coprocessor over GF(2m) 
on an FPGA”, Workshop on Cryptographic Hardware 
and Embedded Systems, CHES 2000, Massachusetts, 
August 2000. 
[5] Crutchley, D. A., “Cryptography And Elliptic Curves”, 
Master Thesis under Supervision of Prof. Gareth Jones, 
submitted to the Faculty of Mathematics at University 
of Southampton, England, May 1999. 
[6] Orlando, and Paar, “A High-Performance 
Reconfigurable Elliptic Curve Processor for GF(2m)”, 
Workshop on Cryptographic Hardware and Embedded 
Systems, CHES 2000, Massachusetts, August 2000. 
[7] Stinson, D. R., “Cryptography: Theory and Practice”, 
CRC Press, Boca Raton, Florida, 1995. 
[8] Paar, Fleischmann, and Soria-Rodriguez, “Fast 
Arithmetic for Public-Key Algorithms in Galois Fields 
with Composite Exponents”, IEEE Transactions on 
Computers, Vol. 48, No. 10, October 1999. 
[9] Blake, Seroussi, and Smart, “Elliptic Curves in 
Cryptography”, Cambridge University Press: NY, 
1999. 
[10] Hankerson, Hernandez, and Menezes, “Software 
Implementation of Elliptic Curve Cryptography Over 
Binary Fields”, Workshop on Cryptographic Hardware 
and Embedded Systems, CHES 2000, Massachusetts, 
August 2000. 
[11] G. A. Orton, M. P. Roy, P. A. Scott, L. E. Peppard, and 
S. E. Tavares. “VLSI implementation of public-key 
encryption algorithms”, Advances in Cryptology -- 
CRYPTO '86, volume 263 of Lecture Notes in 
Computer Science, pages 277-301, 11-15 August 1986. 
Springer-Verlag, 1987.  
[12] Scott, Norman R., “Computer Number Systems and 
Arithmetic”, Prentice-Hall Inc., New Jersey, 1985.  
[13]  Ibrahim, M. K., Almulhem, A., “Bit-Level Pipelined 
Digit Serial GF(2m) Multiplier”, IEEE International 
Symposium on Circuits and Systems, Sidney, Australia, 
2001. 
[14] Ibrahim, M. K., Junaid, A. K., Al-Abaji, R. H., 
Almulhem, A., “Trade-off analysis of a new sign digit 
serial GF multiplier”, Fifth World Multi-conference on 
Systemics, Cybernetics and Informatics SCI / ISAS 
2001. Volume XIV, Part II, pages 52-56. July 2001, 
Orlando, 2001. 
[15] Orlando, and Paar, “A scalable GF(p) elliptic curve 
processor architecture for programmable hardware”, 
Cryptographic Hardware and Embedded Systems, 
CHES 2001, May 14-16, 2001, Paris, France. 
[16] Gutub, Adnan Abdul-Aziz, Tenca,A., and Koc,C., 
“Scalable VLSI architecture for GF(p) Montgomery 
modular inverse computation”, IEEE Computer Society 
Annual Symposium on VLSI, pages 53--58, Pittsburgh, 
Pennsylvania, April 25-26, 2002.  
[17] Gutub, Adnan Abdul-Aziz, Tenca,A.F., and Koc,C., 
“Scalable and Unified Hardware to Compute 
Montgomery Inverse in GF(p) and GF(2^n)”, 
Cryptographic Hardware and Embedded Systems - 
CHES 2002, pages 485-500, August 13-15, 2002. 
[18] Ernst, Klupsch, Hauck, and Huss, “Rapid Prototyping 
for Hardware Accelerated Elliptic Curve Public-Key 
Cryptosystems”, The IEEE 12th International 
Workshop on Rapid System Prototyping, Monterey, 
CL, June 25-27, 2001. 
