Turkish Journal of Electrical Engineering and Computer Sciences
Volume 29

Number 4

Article 17

1-1-2021

Field-programmable gate array (FPGA) hardware design and
implementation ofa new area efficient elliptic curve cryptoprocessor
MUHAMMAD KASHIF
İHSAN ÇİÇEK

Follow this and additional works at: https://journals.tubitak.gov.tr/elektrik
Part of the Computer Engineering Commons, Computer Sciences Commons, and the Electrical and
Computer Engineering Commons

Recommended Citation
KASHIF, MUHAMMAD and ÇİÇEK, İHSAN (2021) "Field-programmable gate array (FPGA) hardware design
and implementation ofa new area efficient elliptic curve crypto-processor," Turkish Journal of Electrical
Engineering and Computer Sciences: Vol. 29: No. 4, Article 17. https://doi.org/10.3906/elk-2008-8
Available at: https://journals.tubitak.gov.tr/elektrik/vol29/iss4/17

This Article is brought to you for free and open access by TÜBİTAK Academic Journals. It has been accepted for
inclusion in Turkish Journal of Electrical Engineering and Computer Sciences by an authorized editor of TÜBİTAK
Academic Journals. For more information, please contact academic.publications@tubitak.gov.tr.

Turkish Journal of Electrical Engineering & Computer Sciences
http://journals.tubitak.gov.tr/elektrik/

Research Article

Turk J Elec Eng & Comp Sci
(2021) 29: 2127 – 2139
© TÜBİTAK
doi:10.3906/elk-2008-8

Field-programmable gate array (FPGA) hardware design and implementation of
a new area eﬀicient elliptic curve crypto-processor

1

Muhammad KASHIF1 , İhsan ÇİÇEK2,∗
Department of Electrical-Electronics Engineering, İstanbul Şehir University, İstanbul, Turkey
2
Department of Electrical-Electronics Engineering, İstinye University, İstanbul, Turkey

Received: 04.08.2020

•

Accepted/Published Online: 29.03.2021

•

Final Version: 26.07.2021

Abstract: Elliptic curve cryptography provides a widely recognized secure environment for information exchange in
resource-constrained embedded system applications, such as Internet-of-Things, wireless sensor networks, and radio
frequency identification. As the elliptic-curve cryptography (ECC) arithmetic is computationally very complex, there is
a need for dedicated hardware for eﬀicient computation of the ECC algorithm in which scalar point multiplication is the
performance bottleneck. In this work, we present an ECC accelerator that computes the scalar point multiplication for
the NIST recommended elliptic curves over Galois binary fields by using a polynomial basis. We used the Montgomery
algorithm with projective coordinates for the scalar point multiplication. We designed a hybrid finite field multiplier
based on the standard Karatsuba and shift-and-add multiplication algorithms that achieve one finite field multiplication
in

m
2

clock cycles for a key-length of m. The proposed design has been modeled in Verilog hardware description language

(HDL), functionally verified with simulations, and implemented for field-programmable gate array (FPGA) devices using
vendor tools to demonstrate hardware eﬀiciency. Finally, we have integrated the ECC accelerator as an AXI4 peripheral
with a synthesizable microprocessor on an FPGA device to create an elliptic curve crypto-processor.
Key words: Elliptic curve cryptography, Karatsuba multiplier, crypto-accelerator, crypto-processor, scalar multiplication, field-programmable gate array

1. Introduction
Modern security provisions require and mandate the use of cryptographic algorithms as a result of elevated
risks associated with the ever increasing connectivity of the embedded devices. In practical, cryptographic
applications, both symmetric and asymmetric cryptographic algorithms are used in communication protocols to
establish a secure channel for information interchange. Designers often follow a hybrid approach to utilize the
best of both worlds. Asymmetric algorithms are usually used in the management and exchange of keys in a secure
communication protocol while the symmetric cryptographic algorithms are utilized for high-throughput secure
data exchange. However, the computational requirements still pose a problem for the cost-sensitive lightweight
embedded systems, such as Internet-of-things (IoT) or wireless sensor networks (WSN). The processors used
in such systems typically have limited computational power, and thus cryptographic software implementations
have limited performance. Moreover, cryptographic software consumes the precious code space allocated for
the main application. The most eﬀicient approach adopted to address these issues is the use of dedicated
cryptographic hardware peripherals integrated with the central processing unit (CPU). Because of higher speed,
∗ Correspondence:

ihsan.cicek@istinye.edu.tr

2127
This work is licensed under a Creative Commons Attribution 4.0 International License.

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

lower power consumption, and smaller area advantages, elliptic curve cryptography (ECC) has become a very
convenient choice for lightweight and resource-constrained embedded systems. It is frequently used in various
ECC applications such as banking [1], cloud computing [2], and IoT [3]. It is practically impossible to extract
the decryption key through brute force breaking attempts due to the computational complexity of the elliptic
curve discrete logarithm problem (ECDLP) [4, 5]. ECC can provide an equivalent level of security by using
shorter key-lengths when compared to other popular asymmetric cryptography algorithms such as RSA. For
example, the security level provided by 2048-bit RSA is equivalent to using 233-bit ECC, which uses much fewer
hardware resources.
From a hardware architecture point of view, the ECC algorithm can be hierarchically partitioned into
four layers of computation [6]. The first layer consists of finite field arithmetic operations, such as addition,
multiplication, squaring, and inversion [7].

Point addition (PA) and point doubling (PD) operations are

computed at the second layer above the first one. Scalar point multiplication (SPM) is the most computationally
intensive part of the ECC, and it is performed in the third layer. Finally, the encryption and decryption
protocols are completed at the outermost fourth layer. Several algorithms are available to perform single point
mooring (SPM) operation for different applications. Among all SPM algorithms, Montgomery algorithm is the
most popular one due to its inherent resistance against the timing attacks and simple power analysis based sidechannel attacks [8]. The overall performance of SPM depends on the type of the utilized finite field multiplier [3].
The bit-serial, bit-parallel, digit serial, and digit parallel multipliers are the most popular types of multipliers
used for the computation of SPM operation in the literature [9]. Bit parallel and digit parallel multipliers
use single clock cycle for the computation of one finite field multiplication, while the bit-serial and digit
serial multipliers require multiple clock cycles [9]. Various technologies such as application-specific integrated
circuits (ASICs) or field-programmable gate arrays (FPGAs) are available for hardware implementation of the
ECC. Field-programmable gate arrays (FPGAs) have the design portability, reconfigurability, and scalability
advantages over application specific integrated circuits (ASICs), which is the common-sense choice for high
volume and low-cost applications with limited or no flexibility.
In this work, we propose a new ECC accelerator architecture that offers high-performance in a compact
footprint for addressing the ECC IP core requirements of the emerging resource-constrained embedded systems.
We can summarize our contributions in this work as the following:
1. We present a new hardware eﬀicient ECC accelerator design for use with National Institute for Standards
in Technology (NIST) recommended curves over GF (2m ) with m = 233 [10]. We introduced a new
hybrid multiplier architecture, which trades time with space to yield a high speed and low area multiplier
composed of the high-speed standard Karatsuba [11], and compact shift-and-add multipliers with a timing
cost of m/2 clock cycles [12].
2. SPM in the proposed ECC accelerator is based on the Montgomery ladder algorithm, which provides a
trade-off between speed and area [8, 13]. We used a 2-stage pipeline to keep the overall throughput with
a carefully designed schedule for the Montgomery algorithm to avoid read-after-write hazards.
3. Finally, our proposed ECC processor design has been integrated with a microblaze synthesizable central
processing unit (CPU) on a Xilinx Artix-7 FPGA as a peripheral to enable fourth layer ECC applications.
The remainder of this paper is organized as follows. We discuss the preliminaries to compute SPM on
ECC over GF (2m ) in Section 2. We present the hardware architecture of the proposed ECC accelerator in
2128

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

Section 3. A performance-optimized architecture for the SPM for ECC is explained in Section 4. Section 5
provides the details on FPGA hardware implementations, and in Section 6, we discuss performance results of
the proposed ECC processor along with a comparison to the literature.
2. Elliptic curve cryptography on GF (2m )
In cryptography, binary fields GF (2m ) are usually considered more convenient for hardware implementation
of the elliptic curves (ECs) [7]. Singular and nonsingular ECs can be used to implement the field arithmetic.
Nonsingular ECs are considered to be more secure and, hence, more suitable for cryptography [14]. In the aﬀine
coordinate system, a nonsingular EC over GF (2m ) is defined as a set of points ( x and y ), which satisfy the
following:
y 2 + xy = x3 + ax2 + b mod(F (x))

(1)

in which, a and b are EC parameters with b ̸= 0 . The variables x and y are the base point elements,
and F (x) is the irreducible polynomial. The point at infinity is ϕ . Therefore, when a point P 1 on an EC is
added to point at infinity ϕ , the resulting point also lies on the EC, i.e., P 1 + ϕ = P 1. Finite field arithmetic
operations are used for point addition (PA) and point doubling (PD) when two points P 1 and P 2 over GF (2m )
on the EC are considered. If the two points of EC are not the same ( P 1 ̸= P 2 ), then the addition of these
points gives a new point Q on the EC as a result of PA; whereas, if the points are the same ( P 1 = P 2) then
adding these two points gives a new point Q which lies on EC and is the result of PD. The following arithmetic
operations are used in ECC over GF (2m ):
Finite field addition and squaring:

Finite field addition is performed usually using bitwise exclusive OR (XOR)

operation without any carry propagation. The technique of interleaving zeros is normally used for finite field
squaring. It takes one clock cycle to compute finite field squaring and finite field addition.
Finite field inversion and reduction: Field squaring and field multiplication are performed repeatedly to
compute the field inversion. The reduction is needed to return to the original bit-length after multiplication
and squaring.
Finite field multiplication: Finite field multiplication is the most important arithmetic operation in ECC.
Numerous efforts have been spent on developing new ECC multiplier architectures and optimizing their performance. In [15], the authors have used a simple Karatsuba multiplier and use partial product technique at
final recursion of the Karatsuba multiplication algorithm. Bit-parallel and bit-serial multipliers are the most
frequently used multipliers in cryptographic applications [16]. The compact digit-serial multiplier takes u/v
clock cycles to perform multiplication [17] where u/v is the total digits, u and v are the key length and the
digit size, respectively. Bit-parallel multiplier multiplies two m-bit numbers in one clock cycle, but it requires
a large area [15]. A modified multiplier architecture is proposed in [18] using an encoding algorithm. Authors
of [19] chose projective coordinates and used a digit-serial multiplier for eﬀicient EC multiplication. In [20],
the authors have designed four different multipliers. Since they focused more on pipelining, their multipliers
improved the speed but at the cost of increased area. A comprehensive survey on finite field multipliers is
available in [21]. In this work, we have proposed a novel multiplier employing the Karatsuba multiplier along
with the ordinary shift-and-add algorithm as discussed in Section 4.
The most important operation in the ECC is the scalar point multiplication (SPM) because it directly
affects the computation time and the overall ECC performance [22]. SPM requires an initial point P on the EC
and an integer k whose size is equivalent to the size of the field under consideration [23]. The SPM is simply
2129

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

the addition of k copies of the point P , ( Q = kP = P + P + P... ). Alternatively, k is the discrete logarithm
of Q to the base P [24]. The SPM of kP is computed by performing the PA and PD repeatedly. if the ith bit
of k is 1, point addition is performed and if it is 0, either PA or PD can be done. Usually for k[i] = 0 , PD is
performed because it needs comparatively less arithmetic operations than the PA [14]. In SPM, inversion needs
to be performed for each PA and PD operation, which creates a performance bottleneck [14]. To avoid this
inversion cost, a conversion from aﬀine (x,y) to projective coordinates (X : Y : Z) can be computed, and then
SPM is performed [23]. Afterward, a projective to aﬀine coordinates conversion is performed back again to get
the EC points result. For EC over GF (2m ) as represented by the Equation 1, the corresponding conversion to
projective coordinates is provided in Equation 2 and the reconversion to general aﬀine coordinates is shown by
the Equation 3.
(X, Y, Z) = (λc x, λd y, λ)

(2)

(x, y) = (X/Z c , Y /Z d )

(3)

Different kinds of projective coordinates can be achieved using different values of c and d . For example, c = 1
and d = 2 results in Lopez Dahab projective coordinates, while c = 2 and d = 3 result in Jacobean coordinates.
Since Lopez Dahab coordinate system requires less finite field multipliers for one point multiplication, we have
chosen it in hardware implementation [22]. Montgomery algorithm, presented by Algorithm 1, has been used
for the computation of scalar point multiplication.
Algorithm 1 Montgomery algorithm over GF (2m )
Input : P = (xp , yp ) ∈ GF (2m ), k = (kn−1 , ...., k1 , k0 ) with kn−1 = 1
Output: Q(xq , yq ) = k.P
1: //Aﬀine to projective conversion
2: A1 = xp , B1 = 1, A2 = x4p + b, B2 = x2p
3: for ( i from 0 to n − 2 ) do
4: if ki = 1
5:
T ← A1 , A1 ← (A1 A2 + A2 B1 )2 , A1 ← xp B1 + A1 A2 T B2
6:
T ← A2 , A2 ← A42 + bB24 , B2 ← T 2 B22
7: else
8:
T ← B2 , B2 ← (A1 B2 + A2 B1 )2 , B2 ← xp B2 + A1 A2 T B1
9: end if
10: end for
11:
T ← A1 , A1 ← A41 + bB14 , B1 ← T 2 B12
12: //Projective to aﬀine conversion
13: xq ← A1 /B1 ,
A
14: yq ← (xp + B1 )[(A1 + xp B1 )(A2 + xp B2 ) + (x2p + yp )(B1 B2 )]/(xp B1 B2 ) + yp
1
15: return Q(xq , yq )
An initial point P (xp , yp ), along with a scalar multiplier k is required as an input to implement the
Montgomery algorithm. The algorithm computes the coordinates of the Q(xq , yq ) point as the final output.
The Montgomery algorithm operates in a three-phase process. The first phase is the conversion from aﬀine
coordinates to projective coordinates (Lopez-Dahab) to avoid the inversion cost of the PA and PD computation.
Then, in the second phase, the scalar multiplication is performed where PA, (Pi+1 = Pi + Qi ) , and PD,
(Pi+1 = 2Pi ) , instructions are computed depending on the value of (ki ) according to Algorithm 1. Finally,
2130

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

reconversion to aﬀine coordinates is done to get the final point Q of EC after PA and PD computations.
3. Hardware design of the elliptic curve cryptography (ECC) accelerator
We have chosen a binary GF (2m ) field for the SPM computation of the ECC by using a polynomial basis
representation with a projective coordinate system to perform eﬀicient finite field multiplications and to reduce
the cost of finite field inversion in hardware. Our ECC hardware accelerator design has a 2-stage pipeline
architecture as shown in Figure 1. It is composed of a memory unit (MU) for storing the results of point
multiplication, multiplexers, and demultiplexers for routing purposes, and an arithmetic logic unit (ALU) for
the computation of finite field arithmetic. We also designed a finite state machine (FSM) based dedicated
controller unit (DCU) and pipeline registers to manage the functions and to optimize the critical delay path.
The initial EC parameters ( xp , yp , b ), were chosen from NIST recommended ECs [10].
xq

km

Read 2

TMP 2

k

TMP 4

Squarer Unit

m

S out
2(m-1)

RED

Multiplier Unit

m

HKM
Data 2

2(m-1)

m
Mout

RED

m

MUX2 (8x1)

TMP 3

m
m

SQR

m
TMP 1

Aout

Adder Unit

MUX4 (3x1)

m

PIPE REG1

Read 1

2

A_Addr2

2

A_Addr1

b

Read3

PIPE REG2

B2

MUX1 (8x1)

DEMUX 1 (1x8)

B1

yq

Arithmetic & Logic Unit
Data 1
MUX3 (4x1)

YPinit

A1
A2

X Pinit
3

3

Memory
Unit

R_Addr2

R_Addr1

Finite State Machine Based Dedicated Control Unit (FSM-DCU)

m

Write Data

Figure 1. Hardware architecture of the proposed elliptic curve processor.

Memory unit: MU contains 8 ×m size of an array as shown in Figure 1, where m identifies the size of each
memory array and is equal to the size of the underlying field, which is 233-bits. The main purpose of the MU
unit is to store the intermediate results (A1-2, B1-2, Tmp1-4) during the computation of the SPM. Also, it
contains two multiplexers labeled as (Multiplexer_1 and Multiplexer_2) as presented in Figure 1. They are
used for reading the operands (Read_1 and Read_2) as inputs to the ALU from MU and a single demultiplexer
is used to update the MU registers using the Write_Data control signal.
Multiplexer units: Two additional multiplexers (Multiplexer_3 and Multiplexer_4) are employed for routing
purposes, as presented in Figure 1. Inputs to the Multiplexer_3 are the initial EC parameters and an operand
(Read_1) from the MU. The output of the Multiplexer_3 is an operand (Read_3) that is the next input to
2131

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

the ALU. The inputs for the Multiplexer_4 are the adder’s, multiplier’s, and squarer’s outputs, and they are
written back to MU using the demultiplexer.
Arithmetic logic unit: The ALU component consists of an adder, a multiplier, and a squarer unit as shown
in Figure 1. The adder used for the computation of addition operation, and it is implemented by the bitwise
XOR gates, whereas squaring is performed by inserting 0s after the each successive input data bit. Addition
and squaring are performed in a single clock cycle. To multiply two m-bit polynomials, we have utilized a
hybrid approach using the simple Karatsuba multiplier with the classic shift-and-add multiplication algorithm.
When two m-bit polynomials are multiplied or a single m -bit polynomial is squared, the result will be 2m-bits.
Therefore, a finite field reduction (Red) needs to be performed whenever a finite field multiplier or squarer of
the ALU is used [25]. We have used the reduction algorithm over GF (2m ) recommended by NIST as described
in [14]. Finally, the Itoh–Tsujii algorithm is used to perform an inversion over the field square [26].
Pipeline registers: we have used the Montgomery algorithm for the SPM. Two pipeline registers have been
placed at the inputs of the ALU as shown in Figure 1. The addition of pipeline registers allows the parallel
execution of the Montgomery algorithm’s PA and PD instructions, therefore, read-after-write hazards (RAW)
may occur. The PA and PD instructions of the Montgomery algorithm have been carefully scheduled as shown
in Table 1 to avoid potential RAW hazards. The sequence for the PA and PD instructions of the Montgomery
algorithm for both no-pipeline and 2-stage pipeline variants are presented in Table 1. The first column shows
the required clock cycles, while the second one lists the sequence of instructions for the no-pipeline variant.
The corresponding RAW hazards are tabulated in the third column, and instruction scheduling used for the
2-stage pipeline variant is shown in the fifth column. According to the Table 1, the total number of required
instructions for each PA and PD computation for the no-pipeline architecture is 14, and it takes 710 clock
cycles for a single SPM computation. Out of these 710 cycles, 6 multiplications are required, and each finite
field multiplication requires

m
2

clock cycles. However, for GF (2233 ), a total of 6 × m
2 = 702 clock cycles are

needed for the computation of 6 finite field multiplications. Finite field addition and squaring instructions
consume the remaining 8 clock cycles. In the context of pipelining, PA and PD instructions have a total of 7
RAW hazards as shown in Table 1. The hazard term means that the execution of current instruction is stalled
until the result of previous instruction has been written back in the memory unit [13]. Due to the occurrence
of several RAW hazards (Third column of the Table 1), the instructions shown in the second column of Table
1 requires a total of 717 cycles when using the same sequence for one PA and PD computation. Moreover, the
no-pipeline variant has a longer critical path delay and consequently operates at a lower clock frequency [13].
To optimize the clock frequency and reduce the critical path, we used the proposed sequence of instructions
shown in the fourth column of Table 1. The proposed sequence of instructions require only 713 clock cycles for
the execution of PA and PD as shown in Table 1.
Dedicated controller unit:

We designed a finite state machine (FSM) based dedicated controller unit (DCU) to

execute control functions such as the control signals of the multiplexers and read-write addresses of the memory
unit. The DCU generated signals are shown as dotted lines with red color in Figure 1. Each finite field adder,
squarer, and reduction modules generate results in one clock cycle to compute the Montgomery algorithm for
scalar multiplication. Each finite field multiplication requires m
2 clock cycles. The total clock cycles for the
no-pipeline and 2-stage pipeline variants are calculated using the Equation 4 and 5 respectively, and the yielding
2132

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

Table 1. Execution scheduling of PA and PD instructions in the 2-stage pipeline.

Clock
cycles
1-117
118-234
235-235
236-352
353-353
354-470
471-471
472-472
473-473
474-590
591-591
592-708
709-709
710-710
-

No-pipeline
schedule (Ini )
In1 → B1 = A2 × B1
In2 → A1 = A1 × B2
In3 → T1 = A1 + Z1
In4 → A1 = A1 × B1
In5 → B1 = T 2
In6 → T1 = xp × B1
In7 → A1 = A1 + T1
In8 → B2 = B22
In9 → T1 = B22
In10 → T1 = b × T1
In11 → A2 = A22
In12 → Z2 = A2 ×Z2
In13 → A2 = A22
In14 → A2 = A2 +T1
−
−
−

Pipeline hazard
−
−
RAW
−
−
RAW
RAW
−
RAW
RAW
−
RAW
−
RAW
−
−
−

: A1

: B1
: T1
: B2
: T1
: A2
: A2

Clock
cycles
1-1
2-118
119-235
236-236
237-237
238-354
355-355
356-356
357-473
474-474
475-475
476-476
477-593
594-710
711-711
712-712
713-713

2-stage pipeline
schedule (Ini )
In1 [R]
In1 [E, W B], In2 [R]
In2 [E, W B], In8 [R]
In8 [E, W B], In3 [R]
In3 [E, W B], In4 [R]
In4 [E, W B], In5 [R]
In5 [E, W B], In11 [R]
In11 [E, W B], In6 [R]
In6 [E, W B]
In7 [R]
In7 [E, W B], In9 [R]
In9 [E, W B], In12 [R]
In12 [E, W B], In10 [R]
In10 [E, W B], In13 [R]
In13 [E, W B]
In14 [R]
In14 [E, W B]

clock cycles are provided in Table 2.
Init + 710 × (m − 1) + 2 × (Inv) + 3, 556

(4)

Init + 713 × (m − 1) + 2 × (Inv) + 3, 571

(5)

For the no-pipeline and 2-stage pipeline architectures, the projective to aﬀine coordinate conversions
(Init) requires only 6 and 12 clock cycles, respectively. It takes a total of 164,720 clock cycles to perform one PA
and PD computation for the no-pipeline architecture when the sequence shown in the second column of Table
1 is used for the order of instruction execution. Similarly, for the 2-stage pipeline variant, the total number of
clock cycles required for PA and PD computation is 165,416 when the sequence of instructions shown in the
fifth column of Table 1 is executed. The projective to aﬀine conversion requires two inverse operations with
additional clock cycles of 3556 and 3571 for the no-pipeline and 2-stage pipeline architectures, respectively.
Each inverse operation (Inv) of the no-pipeline architecture requires 2436 clock cycles, whereas it takes 2524
cycles for the 2-stage pipeline variant. The computation of scalar multiplication operation requires 173,154 for
the no-pipeline and 174,047 cycles for the 2-stage pipeline architectures.
4. Proposed finite field multiplier
We have used the standard Karatsuba–Offman multiplier algorithm with the shift-and-add multiplication
algorithm in our design. Karatsuba multiplier is much faster than the shift-and-add multiplier, whereas the
shift-and-add multiplier needs much less area. The Karatsuba multiplier avoids some multiplication steps at
the cost of the addition. The input operands are divided into two equal parts. For example, if one input is
2133

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

Table 2. Timing information for ECC over GF (2233 ) .

Parameters
Init
710 × m − 1
713 × m − 1
2 × Inv
P roj − to − af f ine
Total clock cycles

No-pipeline
6
164,720
2 × 2436 = 4872
3,556
173,154

2-stage pipeline
12
165,416
2 × 2524 = 5048
3,571
174,047

OP 1 of length OP 1_length , then the second input is OP 2 of length OP 2_length . OP 1 is divided into a
and b where a contains upper half of OP 1 and b contains lower half of OP 1. Similarly, c and d divides OP 2
in the same way. Then, a is multiplied with c, and b is multiplied with d using the shift-and-add algorithm.
The next step in computation is the addition of a , b, c, and d. The sum of a and b is multiplied with
the sum of c and d in the following step. All of these are performed in parallel, which increases the overall
speed. Afterward, we subtract the bd and ac multiplication results from the accumulated multiplication and
shift the result left by OP 1_length/2 times, and add it to the multiplication result of a and c. Finally, we
add the result in the previous step to the multiplication result of b and d for computing the final result. The
implemented Karatsuba and shift-and-add algorithms are shown in Algorithm 2 and Algorithm 3, respectively.
Also, a flowchart representation of our proposed hybrid Karatsuba multiplier (HKM) is shown in Figure 2.

OP1
a=OP1[MSB:OP1_Length/2]

OP2

b=OP1[OP1_Length/2:LSB]

a

c=OP2[MSB:OP2_Length/2]

c

a

d=OP2[OP2_Length/2:LSB]

c

b

d

b

S1
Shift-Add Multiplier

S2

S2

S1
Shift-Add Multiplier

axc
M2

Shift-Add Multiplier

S1 x S2
M3
M2

M2

d

M3 - M2 - M1

bxd
M1
M1

Shift-Left[OP1_Length/2]

M1
FINAL RESULT

Shift-Left[OP1_Length]

Figure 2. The proposed hybrid Karatsuba multiplier architecture.

5. Hardware implementation of the elliptic curve crypto-processor
We developed a Verilog model for the ECC accelerator at the RTL level. Then, we simulated the model using
the Xilinx Vivado design suite and verified the correct functionality. In the testbench, we set a 233-bit test key
2134

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

Algorithm 2 Standard Karatsuba multiplier algorithm.
Input : OP 1 (bit − length = OP1_length)
OP 2 (bit − length = OP2_length )
Output: OP 1 × OP 2
1: Step1: a ←− OP 1[M SB : (OP 1_length/2)]
2:
b ←− OP 1[(OP 1_length/2) : 0]
3:
c ←− OP 2[M SB : (OP 2_length/2)]
4:
d ←− OP 2[(OP
⊕ 1_length/2) : 0]
5: Step2: S1 ←− a
⊕ b
6:
S2 ←− c
d
7: Step3: M 1 ←− Shif tAdd(d × b)
8:
M 2 ←− Shif tAdd(a × c)
9:
M 3 ←− Shif tAdd(S1 × S2)
10: Step4: R ←− M 3 − M 1 − M 2
11:
R ←− R <<
⊕ ( OP 1_length / 2 )
12:
R ←− R ⊕ ( M 2 << OP 2_length )
13:
R ←− R
M1
Algorithm 3 Shift-and-add multiplier algorithm.
Input : OP 1 (bit − length = OP1_length)
OP 2 (bit − length = OP2_length)
Output: OP 1 × OP 2
1: for OP 1_length down to 0 do
⊕
2:
if (OP 1[i]) then Result = Result
(OP 2 << i)
3:
i = i − 1
4: end for

based on alternating 1s and 0s (0x155...55) and used a 50 MHz test clock frequency for functional verification
as shown in Figure 3. We have used test vectors from standard NIST recommended standard ECs to verify
and confirm the correct operation. We have synthesized both the 2-stage pipeline and the no-pipeline variants
and implemented them on various FPGAs to obtain the area and power estimations for comparing with the
literature. The synthesis results showed promising improvements in area utilization, maximum achievable clock
frequency and energy eﬀiciency. The proposed ECC accelerator consumes 3.09 times fewer hardware resources
in contrast to [27], and 1.52 times fewer hardware resources than the design introduced in [28]. Our design
achieves an operational clock frequency of 119 MHz on a Xilinx Virtex-5 device that is 1.5 times higher than
the implementation in [27], which could achieve only 79 MHz.
Our accelerator design with 2-stage pipeline for ECC over GF (2233 ) has been implemented on a Xilinx
Artix-7 FPGA device (XC7A35TICPG238-1L). Table 3 presents the implementation results, in which the nopipeline architecture occupies only 4001 Slice LUTs, 2933 Slice FFs and achieves an operational clock frequency
of 89 MHz. For the computation of one SPM operation, no-pipeline variant requires 1945 µ s with an estimated
power consumption of 87 mW. Proposed ECC accelerator with 2-stage pipeline uses 4467 Slice LUTs, 3,399
Slice FFs, and achieves a maximum frequency of 143 MHz with an estimated power consumption of 106 mW.
Moreover, it only requires 1217 µ s for one SPM computation. We calculated the energy eﬀiciency per clock
cycle using E =

P ower×Duration
Clockcycles

and the calculated results are promising for both no-pipeline and 2-stage

pipeline variants. Although 2-stage pipeline variant uses 466 additional Slice LUTs and Slice FFs in contrast to
2135

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

the no-pipeline counterpart, the energy eﬀiciency is better due to higher clock frequency. In the computation
of one scalar point multiplication, the 2-stage pipeline variant achieves 38% higher clock frequency and 24%
lower execution time. To summarize, the proposed 2-stage pipeline architecture is 1.31 times faster than the
no-pipeline architecture at the expense of 18% higher power consumption.

Figure 3. RTL Simulation results of the proposed ECC processor.

Table 3. Implementation results for Xilinx Artix-7 FPGA

Architecture
no-pipeline
2-stage pipeline

Clock
cycles
173,154
174,047

Slice
LUTs
4001
4467

Slice
FFs
2933
3399

Freq
(MHz)
89
143

Time
(µs)
1945
1217

Power
(mW)
87.739
106.852

Energy
(pJ/cycle)
985
747

Proposed ECC accelerator implements the first three layers of the ECC as a high-speed, area, and
power-eﬀicient hardware module. The fourth layer is more convenient for software implementations. To enable
further testing of the ECC accelerator and also the Layer-4 operation, we have created a customized version
of the ECC accelerator IP core and integrated it with the Xilinx microblaze CPU to create an ECC processor.
The customized version of the ECC accelerator operates as a peripheral of the microblaze CPU over the AXI4
interface. Data read-write operations are performed through the memory-mapped registers of the AXI4 interface,
which are accessible by a custom-developed driver software written in C using the Xilinx software development
kit (SDK). Finally, we have also added a UART-Lite peripheral to the ECC processor to enable FPGA-PC
communication as shown in Figure 4.
6. Performance evaluation
We synthesized the ECC accelerator design with 2-stage pipeline on Virtex-5 (XC5VFX200T) and Virtex7 (XC7VX690T) FPGA devices using Xilinx ISE design suite for performance comparison with the recent
state-of-the-art solutions. Table 4 shows the performance evaluation and comparison results. Comparing our
architecture with the other architectures that support 163-bit key lengths, the work in [27] utilizes 16,116 FPGA
LUTs and 5,341 FPGA slices and achieves an operational clock frequency of 79 MHz on Virtex-5. We achieved
2136

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

JTAG

DEBUG

RESET

RESET GENERATOR

CLOCK

CLOCK GENERATOR

RX
AXI UART-Lite

TX

MICROBLAZE
CPU

LOCAL
MEMORY

AXI
INTERCONNECT

AXI ECC
ACCELERATOR

Figure 4. Hardware architecture of the ECC crypto-processor system.

better memory utilization as a result of the fewer number of storage elements used. Our ECC accelerator
design with 2-stage pipeline needs only 1728 FPGA slices that are 3.09 times smaller than hardware resources
consumed by [27]. Also, when compared to the design in [27], which operates at 79 MHz, our design achieved a
higher clock frequency of 119 MHz on the same Virtex-5 device, thanks to its 2-stage pipeline architecture. The
design in [28], utilizes 10,128 FPGA LUTs and 3,657 FPGA slices on a Virtex-7 FPGA device. In contrast, our
ECC accelerator with 2-stage pipeline consumes 1.52 times (ratio of 3,657 with 2,403) fewer hardware resources
(FPGA slices) and still achieves higher clock frequencies (148 vs. 135 MHz) when compared to [28] for the same
FPGA device. The work in [29] utilizes 10863 FPGA slices on Virtex-5 (XC5VLX50) FPGA for GF (2233 ) .
Our proposed design requires 6.29 times (ratio of 10,863 to 1,726) fewer hardware resources when compared
to [29]. Although the design in [16] is 6% faster than ours, the cost of area utilization is quite high. It uses
11,849 slices on the target FPGAi, whereas our design with 2-stage pipeline consumes only the 2403 FPGA
slices, which is 4.93 times smaller and performs almost fast. Although, the latency (computational duration)
of the proposed ECC accelerator for one scalar point multiplication is high as an inevitable drawback of its
pipeline architecture, our design outperforms existing solutions in every other aspect, especially in terms of area
utilization and hardware eﬀiciency. Consequently, proposed ECC accelerator is a high-speed, low power, and
small footprint hardware design, which is very promising for use in emerging resource-constrained embedded
systems, such as IoT applications, as a standalone IP core, or as a peripheral integrated with a microprocessor,
such as Microblaze, ARM, or RISC-V.
Table 4. Comparison with ECC accelerator FPGA implementations in the literature.

Ref #. Device
[27]
[28]
[29]
[16]

XC5VLX330T
XC7VX690T
XC5VLX50
XC7VX690T
This Work
XC5VFX200T
XC7VX690T

Key
length
163
163
233
233

FPGA
LUTs
16,116
10,128
10,710
21,453

FPGA
Slices
5,341
3657
10,863
11,849

Fclk
(MHz)
79
135
157

Time
(µs)
17.0
25.0
-

233
233

6,912
9,612

1,728
2,403

119
148

1,462
1,175

7. Conclusion
In this work, we presented the design of an ECC accelerator with 2-stage pipeline for hardware eﬀicient
computation of the scalar point multiplication (SPM), which is the performance bottleneck. In order to
2137

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

optimize the SPM, we introduced a new hybrid scalar point multiplier based on a balanced combination of the
standard Karatsuba multiplier with high-speed advantage and the shift-and-add multiplier that offers smaller
area. The proposed Hybrid Karatsuba multiplier is designed and implemented mainly for light-weight embedded
applications and consumes fewer hardware resources with a penalty of ( m
2 −1) clock cycles for an m-bit ECC keylength when compared to the bit-parallel multiplier architectures. To avoid potential read-after-write hazards
in the pipeline, careful ordering of the point addition and point doubling instructions has been scheduled to
save clock cycles. The proposed ECC accelerator with 2-stage pipeline is 1.31 times faster than the no-pipeline
variant and outperforms the other designs in the literature in terms of FPGA resource utilization and maximum
achievable clock frequency. We have created an ECC processor by integrating our design with a synthesizable
CPU to create a hardware platform for enabling the future development of ECC fourth layer applications
that use key exchange protocols. Our keystone contribution is the design of an eﬀicient and light-weight ECC
accelerator architecture that unifies high-performance in a compact footprint suitable for FPGA and ASIC
implementations.
References
[1] Wadhe A, Sabhle N. Mobile SMS banking security using elliptic curve cryptosystem in binary field. Journal of Engg
Res & App (IJERA) 2019; 3 (3): 413-420.
[2] Bai T P, Rabara S, Jerald A. Elliptic curve cryptography based securing framework for Internet of Things and
cloud computing. In: WSEAS 2015 Conference on Recent Advances on Computer Engineering; Seoul, South Korea;
2015. pp. 65-74.
[3] Marin L, Pawlowski MP, Jara A. Optimized ECC implementation for secure communication between heterogeneous
IoT devices. Sensors (Switzerland) 2015; 15 (9): 21478-21499.
[4] Koblitz N. Elliptic curve cryptosystems. Mathematics of Computation 1987; 48 (Jan): 203-209.
[5] Koblitz AH, Koblitz N, Menezes A. Elliptic curve cryptography: the serpentine course of a paradigm shift. Journal
of Number Theory 2011; 131 (5): 781-814.
[6] Rashid M, Imran M, Jafri AR. Comparative analysis of flexible cryptographic implementations. In: 2016 11th
International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC); Talinn, Estonia;
2016. pp. 1-6.
[7] Trujillo-Olaya V, Sherwood T, Koç Ç K. Analysis of performance versus security in hardware realizations of small
elliptic curves for lightweight applications. Journal of Cryptographic Engineering 2012; 2 (3): 179-188.
[8] Montgomery P. Speeding the pollard and elliptic curve methods of factorization. Mathematics of Computation 1987;
48 (Jan): 243-264.
[9] Imran M, Rashid M. Architectural review of polynomial bases finite field multipliers over GF( 2m ). In: 2017
Proceedings of IEEE International Conference on Communication, Computing and Digital Systems; Islamabad,
Pakistan; 2017. pp. 331-336. March 2017.
[10] Kerry CF, Secretary A, Director CR. FIPS PUB 186-4 Federal Information Processing Standards Publication Digital
Signature Standard (DSS), 2013.
[11] Karatsuba A, Ofman Y. Multiplication of multidigit numbers on automata. Soviet Physics Doklady 1963; 7: 595-596.
[12] Patel Z. Enhancing speed and reducing power of shift and add. International Journal Of Electrical, Electronics And
Data Communication 2016; 4(Jun): 13-17.
[13] Rashid M, Imran M, Jafri AR, Kashif M. A Throughput/Area Optimized Pipelined Architecture for Elliptic Curve
Crypto Processor. IET Computers & Digital Techniques 2019; 13 (5): 361-368.

2138

KASHIF and CICEK/Turk J Elec Eng & Comp Sci

[14] Hankerson D, Menezes A, Vanstone S. Guide to Elliptic Curve Cryptography. NY, USA: Springer-Verlag, 2004.
[15] Imran M, Kashif M, Rashid M. Hardware design and implementation of scalar multiplication in elliptic curve
cryptography (ECC) over GF(2163) on FPGA. In: ICICT 2015 International Conference on Information and
Communication Technologies; Karachi, Pakistan; 2015. pp. 4-7. 2016.
[16] Imran M, Shehzad F. FPGA Based Crypto Processor for Elliptic Curve Point Multiplication ECPM over GF 2233 .
International Journal for Information Security Research (IJISR) 2017; 7 (3): 706-713.
[17] Khan Z U A, Benaissa M. High Speed ECC Implementation on FPGA over GF ( 2m ) on FPGA. IEEE Transactions
on Very Large Scale Integration Systems 2017; 25 (3): 165-176.
[18] Bonifus PL, George D. ECC Encryption Ssystem Uusing Encoded Multiplier and VEDIC Mathematics. International
Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering 2013; 2 (11): 5531-5538.
[19] Sutter G, Deschamps J, Imana J. Eﬀicient elliptic curve point multiplication using digit-serial binary field operations.
IEEE Transactions on Industrial Electronics 2013; 60 (Jan): 217-225.
[20] Bednara M, Grabbe C, Teich J, Gathen JVZ, Shokrollahi J. FPGA Designs of Parallel High Performance GF ( 2233 )
Multipliers. In: ISCAS 2003 Proceedings of the 2003 IEEE International Symposium on Circuits and Systems;
Bangkok, Thailand; 2003. pp. II-268. Computer Engineering 2003: 268-271.
[21] Fan H, Hasan M. A survey of some recent bit-parallel GF( 2n ) multipliers. Finite Fields and their Applications
2105; 32: 5-43.
[22] Jafri AR, Islam M, Imran M, Rashid M. Towards an Optimized Architecture for Unified Binary Huff Curves. Journal
of Circuits, Systems and Computers 2017; 26 (11): 175-178.
[23] Imran M, Rashid M, Shafi I. Lopez Dahab based elliptic crypto processor (ECP) over GF(2163) for low-area
applications on FPGA. In: ICEET 2018 International Conference on Engineering and Emerging Technologies;
Lahore, Pakistan; 2018. pp. 1-6.
[24] Khan Z U, Benaissa M. High speed ECC implementation on FPGA over GF(2m). In: FPL 2015 25th International
Conference on Field Programmable Logic and Applications; London, UK; 2015. pp. 1-6.
[25] Halak B, Waizi S S, Islam A. A Survey on Hardware Implementations of Elliptic Curve Cryptosystems. IACR
Cryptology ePrint Archive 2016; 16 (712): .
[26] Itoh T, Tsujii S. A Fast algorithm for Computing Multiplicative Inverse in GF ( 2m ) Using Normal Bases. Information and Computation 1988; 78 (3): 171-177.
[27] Benselama ZA, Bencherif MA, Khorissi N, Bencherchali MA. Low cost reconfigurable Elliptic Crypto-hardware. In:
AICCSA 2014 Proceedings of IEEE/ACS International Conference on Computer Systems and Applications; Doha,
Qatar; 2014. pp 788-792.
[28] Imran M, Shafi I, Jafri A, Rashid M. Hardware design and implementation of ECC based crypto processor for
low-area-applications on FPGA. In: ICOSST 2017 Proceedings of 2017 International Conference on Open Source
Systems and Technologies; Lahore, Pakistan; 2018. pp. 54-59.
[29] Panchbhai M, Ghodeswar US. Implementation of point addition & point doubling for Elliptic Curve. In: ICCSP
2015 International Conference on Communication and Signal Processing; Chengdu, China; 2015. pp. 746-749.

2139

