A vendor-neutral unified core for cryptographic operations in GF(p) and GF(2m) based on montgomery arithmetic by Schramm, Martin et al.
Research Article
A Vendor-Neutral Unified Core for Cryptographic Operations in
GF(p) and GF(2m) Based on Montgomery Arithmetic
Martin Schramm ,1,2 Reiner Dojen,1 andMichael Heigl 2
1Department of Electronic and Computer Engineering, University of Limerick, Limerick, Ireland
2Institute ProtectIT, Deggendorf Institute of Technology, 94469 Deggendorf, Germany
Correspondence should be addressed to Martin Schramm; martin.schramm@th-deg.de
Received 6 October 2017; Revised 14 March 2018; Accepted 17 May 2018; Published 21 June 2018
Academic Editor: Fawad Ahmed
Copyright © 2018 Martin Schramm et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
In the emerging IoT ecosystem in which the internetworking will reach a totally new dimension the crucial role of efficient security
solutions for embedded devices will be without controversy. Typically IoT-enabled devices are equipped with integrated circuits,
such asASICs or FPGAs to achieve highly specific tasks. Such devicesmust have cryptographic layers implemented andmust be able
to access cryptographic functions for encrypting/decrypting and signing/verifying data using various algorithms and generate true
random numbers, random primes, and cryptographic keys. In the context of a limited amount of resources that typical IoT devices
will exhibit, due to energy efficiency requirements, efficient hardware structures in terms of time, area, and power consumption
must be deployed. In this paper, we describe a scalable word-based multivendor-capable cryptographic core, being able to perform
arithmetic operations in prime and binary extension finite fields based onMontgomery Arithmetic.The functional range comprises
the calculation of modular additions and subtractions, the determination of the Montgomery Parameters, and the execution of
Montgomery Multiplications and Montgomery Exponentiations. A prototype implementation of the adaptable arithmetic core is
detailed. Furthermore, the decomposition of cryptographic algorithms to be used together with the proposed core is stated and a
performance analysis is given.
1. Introduction
The next generation of embedded systems and IoT devices
will exhibit a much higher degree of internetworking which
gives rise to security considerations [1]. As a logical con-
sequence, such devices must become cryptographic nodes,
besides others, being capable of encrypting/decrypting and
signing/verifying data as well as establishing spontaneous
secured communications by exchanging common secrets
used for secret key calculation. While many embedded chips
already have support for hardware-accelerated symmetric
algorithms (mainly AES) [2] and hash functions, due to var-
ious reasons, such as complexity, space, and costs, they lack
in hardware support especially for supporting a wide range
of public-key and key exchange algorithms with different
precisionwidths. Besides,manymodern cryptographic prim-
itives necessitate the capability for producing true random
numbers and random prime numbers. Typical IoT devices
furthermore very often only exhibit a limited amount of
resources which requires efficient cryptographic hardware
structures in terms of area, power consumption, and calcula-
tion performance [3]. In general enterprises developing IoT
products basically have three options to include application
functionalities in high integrated devices, using Application
Specific Standard Products (ASSP), Application Specific Inte-
grated Circuits (ASIC), or Field Programmable Gate Arrays
(FPGA). Today FPGAs have become promising components
for IoT applications [4], compared to ASSP solutions which
often cannot provide the required functionality and can
provide a better Total Cost of Ownership (TCO) compared to
ASIC solutions. Thus for devices which are equipped with a
FPGAdevice, it is valuable to examine how efficient hardware
structures for performing cryptographic operations can be
included.
In matters of algorithm agility an arithmetic engine with
minimal hardware footprint, which can handle the arithmetic
Hindawi
Security and Communication Networks
Volume 2018, Article ID 4983404, 18 pages
https://doi.org/10.1155/2018/4983404
2 Security and Communication Networks
operations of a great variety of cryptographic algorithms,
is of great importance for IoT based devices. Especially the
calculability of the individual operations leading to lower and
upper calculation time bounds is quite important.
This paper proposes a tiny-held vendor-neutral crypto-
graphic arithmetic core exemplarily implemented in FPGA-
logic. For efficiency, time-intensive modular operations, such
as multiplication and exponentiation operations, Mont-
gomery Arithmetic is used. Without the need of any expen-
sive software precalculations the core is able to perform a
high number of cryptographic algorithms and handle various
key sizes by simply processing operation lists. Furthermore
the core architecture is unified and can perform calculations
in both prime finite fields (𝐺𝐹(𝑝)) and binary extension
fields (𝐺𝐹(2𝑚)). To illustrate the versatility of the developed
core, well-established cryptographic algorithms have been
rewritten and fragmented into operation lists to be processed
by the arithmetic engine.
The paper is organized as follows. Section 2 states the re-
lated work of this research. In Section 3 the design of the pro-
posed Enhanced Montgomery Multiplication Core is stated;
the specified functional range of the core is given in Section 4.
In Section 5 some exemplary application descriptions for
the core are mentioned and in Section 6 the results of the
performance analysis are stated. Finally, Section 7 concludes
the paper.
2. Related Works
The efficiency of cryptographic algorithms when imple-
mented on reconfigurable hardware is mainly determined by
the fact of how the underlying finite field arithmetic oper-
ations are realized [5]. Several applications in cryptography
such as ciphering and deciphering of asymmetric algorithms,
the creation and verification of digital signatures, and secure
key exchange mechanisms require excessive use of the basic
finite field modular arithmetic operations addition, multi-
plication, and the calculation of the multiplicative inverse.
Especially the field multiplication operation is crucial to the
efficiency of a design, since it is the core operation of many
cryptographic algorithms [6].
In [7] P. L. Montgomery introduced a representation of
residue classes in order to speed up modular multiplications
without affecting modular additions and subtractions. Over
the years numerous designs have been proposed implement-
ing modular multiplications based on Montgomery’s multi-
plication algorithm [8]. The foundation for these architec-
tures was presented by A. Tenca and C¸. Koc¸ in [9]. The archi-
tecture is based on a word-basedMontgomeryMultiplication
algorithm for prime finite fields in which multiplications are
performed in a bit-serial fashion. E. Savas¸ et al. in [10] have
proposed an extension which, in addition to the standard
integer modulo arithmetic, also allows polynomial computa-
tions over binary finite fields. An overview about algorithms
and hardware architectures for Montgomery Multiplication
can be found in [11]. Optimizations of the original design have
been proposed concerning the hardware implementation
of the Montgomery Multiplication algorithm [12] as well
as by utilizing special arithmetic hardcore extensions of
FPGAs to accelerate digital signal processing applications
[13]. Some designs only focus on utilizing the Montgomery
Multiplication method to accelerate modular exponentiation
operations as required by the RSA algorithm [14, 15].
However, no publication focuses on how the Mont-
gomery Multiplication architecture can be embedded into a
comprehensive solution. In this paper we propose an
enhanced version of a bit-serial word-based unified Mont-
gomery Multiplication core based on logic elements only
which is controlled by a state machine and offers the func-
tional range to be able to perform complete cryptographic
algorithms without additional complex processing required
in software.
3. Enhanced Montgomery Multiplication Core
3.1. Requirements. Today a high number of different public-
key algorithms are in use. To ensure compatibility, crypto-
graphic applications must support a large portion of those
algorithms. While typical software implementations often
can easily be upgraded in order to adapt new algorithms and
larger key sizes, the same is not necessarily true for hardware
implementations. Therefore following requirements have
been identified for the EnhancedMontgomeryMultiplication
Core:
(i) Use of Montgomery Arithmetic. The design must be
able to performmodulo operations in a time-efficient
manner by using Montgomery Arithmetic. At least
the core must support Montgomery Multiplications
and Montgomery Exponentiations. Furthermore the
core must support standard modulo additions and
modulo subtractions.
(ii) Works on Both Finite Fields 𝐺𝐹(𝑝) and 𝐺𝐹(2𝑚). The
architecture must exhibit an unified structure sup-
porting both standard integer modulo operations of
prime finite fields as well as polynomial calculations
of binary finite fields.
(iii) Montgomery Parameter Calculation. In general the
Montgomery Parameters (𝑟 and 𝑟2) can be precom-
puted for previously known moduli. However, as a
requirement the core must be able to handle arbitrary
moduli. Therefore it must be capable of calculating
the Montgomery Parameters 𝑟mod 𝑛, 𝑟2mod 𝑛 and
𝑟−1mod 𝑛 without the need of precalculations done
in software.
(iv) Scalable Design. The architecture must be scalable in
terms of timing, area, and power consumption. This
includes the parametrisation of the word width, the
internal storage size, and the amount of processing
units within the pipeline.
(v) Multialgorithm Support. The core must be based on a
building-block design.The functional range provided
by the arithmetic unit should empower algorithm
agility, by fragmenting cryptographic algorithms into
a list of core operations. At least the core must be
capable of performing RSA [16] operations, (safe)
prime number generation and primality testing (MR)
Security and Communication Networks 3
RAM
nres
aaddr
awe
adi
baddr
bce
bdo
dest_int
dest_ext
dest
we_int
we_ext
src_int
src_ext
bce_int
bce_ext
awe
src
bce
precision
nres
clear_pu
f_sel
a_i
b_i
p_i
ts_i
tc_i
PU 1
b_o
p_o
ts_o
tc_o
COMP
a_op
b_op
a_greater_b
a_less_b
a_equal_b
c_i
a_op
b_op
f_sel
aos
c_o
s_o
CLA
b_i_pu
zero
b_i_pu0
p_i_pu
zero
ts_i_pu
zero
tc_i_pu
zero
p_i_pu0
ts_i_pu0
tc_i_pu0
a_i_pu(0)
clear_pu(0)
mwmac_f_sel
mwmac_precision
nres
precision
nres
clear_pu
f_sel
a_i
b_i
p_i
ts_i
tc_i
PU 2
b_o
p_o
ts_o
tc_o
b_o_pu(0)
p_o_pu(0)
ts_o_pu(0)
tc_o_pu(0)
a_i_pu(1)
clear_pu(1)
precision
nres
clear_pu
f_sel
a_i
b_i
p_i
ts_i
tc_i
PU n
b_o
p_o
ts_o
tc_o
a_i_pu(n-1)
clear_pu(n-1)
ram_d_i
aux|aux|aux|aux
cla_result|zero|zero|zero
zero|zero|zero|zero
ram_d_o
d_i
b_i_pu
p_i_pu
ts_i_pu
tc_i_pu
b_i_pu
p_i_pu
ts_i_pu
tc_i_pu
b_o_pu(n-1)
p_o_pu(n-1)
ts_o_pu(n-1)
tc_o_pu(n-1)
ram_d_o
a_greater_b
a_less_b
a_equal_b
b_i_pu
p_i_pu
ts_i_pu
tc_i_pu
b_i_pu
p_i_pu
ts_i_pu
tc_i_pu
ts_o_pu(n-1-pu_tap)
bcla_ram
acla_ram
tc_o_pu(n-1-pu_tap)
pu_o(n-1)pu_o(1)pu_o(0)
pu_o(n-1)
'X' do
bcla
acla
zero|zero|zero|zero
zero|zero|zero|zero
acomp
bcomp
aos
cla_result
cocla
zero|zero|zero|aux a_reg
'1'
D-FF
cicla
cocla_reg
clk fetch_a_word
a_reg_i
REG
tc_i_pu
ts_i_pu
p_i_pu
b_i_pu
e_word
fetch_e_word
e_reg
REG
pu_mux_ctrl
ram_di_mux_ctrl
ram_mux_ctrl
bcla_mux_ctrl
acla_mux_ctrl
e_mux_ctrl bcomp_mux_ctrl
acomp_mux_ctrl
ram_do_mux_ctrl
cla_mux_ctrl
a_mux_ctrl
cicla_mux_ctrl
RAM_WIDTH
RAM_WIDTH
RAM_WIDTH
RAM_WIDTH
RAM_WIDTH
RAM_WIDTH
RAM_WIDTH
RAM_WIDTH
RAM_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
RAM_WIDTH
RAM_WIDTH
RAM_ADDR_WIDTH
RAM_ADDR_
WIDTH
WORD_WIDTH
RAM_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
RAM_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_
WIDTH
WORD_
WIDTH
RAM_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
RAM_WIDTH RAM_WIDTH RAM_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
WORD_WIDTH
ram_d_o
Figure 1: Overall architecture of the Enhanced Montgomery Multiplication Core.
[17, 18], key exchange operations (DH) [19], and ellip-
tic curve calculations (EC) [20] over both prime and
binary finite fields.
(vi) Supporting as Many Precision Widths as Possible.
The design must support a wide range of different
precision widths determining the security level of
the cryptographic algorithm. If a certain security
level, due to increased attacking computing power,
becomes inadequate, the precision width can be
adjusted accordingly which makes the hardware less
prone to become obsolete due to higher security
demands. The core must support the current recom-
mendations for minimum key sizes [21] and should
also support larger key sizes. For RSA algorithm and
Diffie-Hellman key exchange support the architecture
should be able to handle precisions up to 4096 bit
moduli, for elliptic curve cryptography support pre-
cisions up to 512 bits for prime finite fields and
precisions up to 571 bits for binary finite fields should
be possible.
(vii) Time-Invariant Operations. The architecture must
be capable of performing its operations in a time-
invariant manner. If security sensitive information,
such as private keys, will be processed, it must be
ensured that all operations exhibit the same execution
time to prevent side-channel attacks based on timing
analysis.
3.2. Overall Core Architecture. Figure 1 illustrates the overall
architecture of the proposed Enhanced Montgomery Multi-
plication Core which is capable of meeting all requirements
as specified above.
Besides the pipeline of processing units handling the
main part of the word-based Montgomery Multiplication
algorithm, the core features an enhanced word-based Carry
Look-Ahead adder being responsible for the calculation of
the final result after the pipeline has processed all bits of an
operand as well as for performing single modular addition
and subtraction operations. The register files of the original
design have been replaced with an internal dual-ported RAM
which holds the operands as well as intermediate results of
the core operations. Furthermore a word-based comparator
component has been described which is queried during
operations to decide if a modular addition or subtraction
step must be performed. Two additional 𝑟-bit words for the
𝐴 operand and the exponent 𝐸 have been introduced with
|𝑟| being the RAM width (|𝑟| = 4 ⋅ |𝑤|) which will be
fetched fromRAM in case ofMontgomeryMultiplication and
Montgomery Exponentiation operations. An auxiliary 𝑤-bit
word 𝑎𝑢𝑥 is used for RAM reorganisation operations as well
as for the calculation of theMontgomery Parameters 𝑟 and 𝑟2.
The intelligence of the core is the controlling state
machine which utilizes the defined components to per-
form standard modular addition and subtraction operations,
MontgomeryMultiplications,Montgomery Exponentiations,
Montgomery Parameter calculation, and RAM reorganisa-
tion operations. Therefore it is responsible for controlling
the RAM write and read access, the source and destination
address signals of RAM, as well as the values passed through
to the first processing unit, to the CLA adder, and to the com-
parator component. Furthermore it controls the assignments
of 𝐴 operand, 𝐸 exponent, and 𝑎𝑢𝑥 words.
Thedescribed core can be parametrised in threeways.The
parameter named MAX PRECISION WIDTH specifies the
highest supported precisionwidth |𝑝|, whereas the parameter
WORD WIDTH is used to specify the word width |𝑤| of the
operands involved in the calculations. These two parameters
determine the size and the address space of the internal
core RAM. The third parameter MAX NUM PUS specifies
4 Security and Communication Networks
UFA
D-FF
UFA UFA UFA
UFA UFA UFA UFA
D-FF D-FF D-FFD-FF D-FF D-FF D-FF
par
b3(j)
TC3(j)
f sel
f sel f sel f sel f sel
f sel
TS3(j)
TC3(j-1)TS3(j-1) TC2(j-1) TS2(j-1) TC1(j-1) TS1(j-1) TC0(j-1) TS0(j-1)
TC0(j)
f sel
TS0(j)TC1(j)
f sel
TS1(j)TC2(j) TS2(j)
ai
b2(j) b1(j) b0(j)p3(j) p2(j) p1(j) p0(j)
Figure 2: Processing unit with word size 𝑤 = |4|.
the maximum number of processing units of the pipeline
implemented for a specific core variationmainly affecting the
performance and the size in terms of area consumption.
3.2.1. Processing Units. The heart of the core is the pipeline
of processing units implementing the multiple word version
of the Montgomery Multiplication algorithm. Therefore the
processing unit structure has been described from scratch.
The processing unit can be held in reset and keeps track
of the cycle number according to the number of words to
be processed depending on the supplied parameters. This
control logic is needed to determine whether the supplied
modulus has to be added to the processed words in this cycle
or not, depending on the value of the signal 𝑝𝑎𝑟 denoting an
odd intermediate result. Note that buffering the output of a
processing unit between two processing units is not required
in this design. Compared to the original design presented
in [10] for a given precision width |𝑝| and a word size |𝑤|,
𝑒 = ⌈|𝑝|/|𝑤|⌉ + 1 number of words are required for a unified
solution and the pipeline must consist of a power of two
(2𝑥) number of processing units with a maximum number of
2𝑥 < (𝑒−1) in order to avoid pipeline stalls. Figure 2 illustrates
the internal architecture of an exemplary processing unit with
word size |𝑤| = 4.
Each processing unit consists of a cascade of two layers
of so-called Unified Full Adder (UFA) cells. The Unified Full
Adder cells basically consist of simple full adder cells which
have been enhanced by an additional finite field selection
input 𝑓𝑠𝑒𝑙. This allows for the creation of a unified multiplier
architecture which can not only be used in prime fields𝐺𝐹(𝑝)
(𝑓𝑠𝑒𝑙 = 1) but also in binary fields 𝐺𝐹(2
𝑚) (𝑓𝑠𝑒𝑙 = 0) in which
additionswill be simple bitwiseXORcalculationswithout any
carry output.
3.2.2. Carry Look-Ahead Adder. Since the pipeline generates
the result in carry save form, an additional step is necessary
at the end of each calculation to obtain a nonredundant
version of the result. For the sake of uniformity a circuit
is required that can operate in both finite fields 𝐺𝐹(𝑝) and
𝐺𝐹(2𝑚). Furthermore, since the calculation in 𝐺𝐹(𝑝) could
require one further subtraction step, the Carry Look-Ahead
adder in the design has been formulated to be able to perform
word-based modular additions and subtractions. Figure 3
illustrates the logic of the proposed enhanced 𝑛−bit wideCLA
adder of the core.
The internal signal 𝑏󸀠 of the second operand will be cal-
culated as 𝑏󸀠𝑖 = 𝑏𝑖⊕(𝑎𝑜𝑠 ⋅𝑓𝑠𝑒𝑙) in which 𝑎𝑜𝑠 denotes an add-or-
subtract signal (𝑎𝑜𝑠 = 0 means addition, 𝑎𝑜𝑠 = 1 represents
subtraction by performing an addition in two’s complement
representation). The modified CLA adder involves the same
common Carry Look-Ahead adder logic for the calculation
of the generate (𝑔𝑖 = 𝑎𝑖 ⋅ 𝑏
󸀠
𝑖 ) and propagate (𝑝𝑖 = 𝑎𝑖 + 𝑏
󸀠
𝑖 )
functions. The output values 𝑐𝑖 of the CLA adder logic will
be calculated as 𝑐0 = 𝑐𝑖𝑛 for the least-significant bit and 𝑐𝑖 =
𝑔𝑖−1 + (𝑝𝑖−1 ⋅ 𝑐𝑖−1) for all further bits.The final sum output bits
𝑠𝑖 will be calculated as 𝑠𝑖 = (𝑐𝑖 ⋅ 𝑓𝑠𝑒𝑙) ⊕ 𝑎𝑖 ⊕ 𝑏
󸀠
𝑖 the carry output
bit will be determined as 𝑐𝑜𝑢𝑡 = 𝑐𝑛 ⋅ 𝑓𝑠𝑒𝑙. If the selected finite
field is 𝐺𝐹(2𝑚) (𝑓𝑠𝑒𝑙 = 0), then the add-or-subtract input will
Security and Communication Networks 5
Carry Look-Ahead Logic
aos
f sel
an-1
an-1 an-2 a1 a0
Cn
Cn-1
Cout Sn-1 Sn-2 S1 S0
Cn-2 C1 C0
bn-1 an-2 bn-2 a1 b1 a0 b0 Cin
n-1b
n-1b
n-2b
n-2b
1b
1b
0b
0b
Figure 3: Enhanced n-bit wide CLA adder.
01234567
89101112131415
1617181920212223
242526272829303123 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
８8 ８7 ８6 ８5 ８4 ８3 ８2 ８1 E8 E7 E6 E5 E4 E3 E2 E1 A8 A7 A6 A5 A4 A3 A2 A1
B8 B7 B6 B5 B4 B3 B2 B1
TS1TS2TS3TS4TS5TS6TS7TS8
TC1TC2TC3TC4TC5TC6TC7TC8
P8 P7 P6 P5 P4 P3 P2 P1
Figure 4: Enhanced Montgomery Multiplication Core RAM organization.
be ignored, the final sum will simply be the bitwise modulo-
2 addition of the two input values 𝑇𝐶 and 𝑇𝑆 and the carry
output bit will be forced to zero.
3.2.3. Core RAM Structure. The RAM of the core must be
capable of holding all the necessary operands and interme-
diate values required during the execution of cryptographic
algorithms. The basic structure of the described RAM is
pictured in Figure 4.
It features four symbolic horizontal RAM operand loca-
tions with MAX PRECISION WIDTH bit each which are
organized as eight pieces ofMAX PRECISION WIDTH/8 bit
each. The location named 𝐵 is intended to hold operand 𝐵
in Montgomery Multiplication and Montgomery Exponenti-
ation operations; the location named𝑃 is intended to hold the
modulus. The location 𝑇𝑆 usually holds the temporary sum
value during Montgomery Multiplications and Montgomery
Exponentiation or the first operand in modular addition or
subtraction operations. The location 𝑇𝐶 usually holds the
temporary carry stream during Montgomery Multiplications
and Montgomery Exponentiation or the second operand in
modular addition or subtraction operations.
Besides the horizontal RAM operand locations three
symbolic vertical RAM operand locations with MAX
PRECISION WIDTH bit each have been defined which are
organized as eight pieces of MAX PRECISION WIDTH/8
bit each. The locations named 𝐴, 𝐸, and 𝑋 for convenience
usually are used to hold operand 𝐴 in Montgomery Mul-
tiplication and Montgomery Exponentiation operations as
well as the exponent operand 𝐸 and the auxiliary operand
𝑋 in Montgomery Exponentiation operations. In addition all
RAM slots are intended to hold intermediate values during
the execution of cryptographic algorithms.
4. Functional Range of the Core
This section provides a description of the functional range of
the proposed core. The following precisions (denoted in bit-
length) are supported:
(i) EC over 𝐺𝐹(𝑝), RSA, MR, DH: 192, 224, 256, 320, 384,
448, 512, 768, 1024, 1536, 2048, 3072, 4096
(ii) EC over 𝐺𝐹(2𝑚): 131, 163, 176, 191, 193, 208, 233, 239,
272, 283, 304, 359, 368, 409, 431, 571
6 Security and Communication Networks
If further or other precision widths should be supported,
the described core can easily be adjusted in an appropriate
manner. For the parametrisation and the execution/abortion
of an operation a 32-bit wide command input word has been
defined. Besides the start, abort, and finite field selection
signals also the encoded precision width, operation code
as well as RAM offsets for the specified operation can be
supplied. The following operations have been specified.
4.1. MontMult Operation. The MontMult operation code
instructs the core to perform a single Montgomery Multipli-
cation with the supplied elements in the given finite field. A
Montgomery Multiplication will start by reading the first 𝑟-
bit word of operand 𝐴 from RAM. Afterwards the pipeline
will be started and the appropriate bits of 𝐴 operand will
be fed to the individual processing unit. If all bits of the 𝐴
operand word have been fed to the processing units, a new
word will be read from RAM. Once the last bit of 𝐴 operand
has been processed, the temporary sum and temporary carry
words will be fed into the CLA adder in order to reunite
the two streams. After the last words of temporary sum
and temporary carry have been brought together, the carry
output bit of the CLA adder will be evaluated. If a carry bit
is set the modulus will be subtracted once; otherwise the
result will be compared to the given modulus. If the result
is equal or greater than the modulus the given modulus will
be subtracted once.
4.2. MontR Operation. The MontR operation code instructs
the core to calculate the Montgomery Parameter 𝑟 = 2𝑘
regarding a supplied modulus in the given finite field, with
𝑘 being the bit-length of the given precision.
In the case of prime field arithmetic the Montgomery
Parameter 𝑟 will be 𝑟 ≡ 2𝑘mod𝑝, so 𝑟 can be calculated as
two’s complement of𝑝 as bitwise inverse of the givenmodulus
plus 1. Therefore the individual words of the modulus will
be XOR-ed with a constant word consisting of all-ones. In
addition the least-significant bit of the first word will be set
to one.
In the case of binary field arithmetic the Montgomery
Parameter 𝑟 will be 𝑟 ≡ 2𝑘mod 𝑛(𝑥), so 𝑟 is equal to binary
expression of the irreducible polynomial 𝑛(𝑥) with the most
significant bit set to zero. Therefore the individual words
of the modulus will be scanned and the appropriate most
significant bit will be set to zero, depending on the given
precision.
4.3. MontR2 Operation. The MontR2 operation code in-
structs the core to calculate the Montgomery Parameter 𝑟2
with 𝑟2 = 22⋅𝑘 for a supplied modulus in the given finite field
with 𝑘 being the bit-length of the given precision.
In the case of prime field arithmetic the Montgomery
Parameter 𝑟2 will be given by 𝑟2 ≡ 𝑟 ⋅ 𝑟 ≡ 2𝑘 ⋅ 2𝑘 ≡ 22⋅𝑘mod𝑝.
Therefore in a first step the Montgomery Parameter 𝑟 ≡
2𝑘mod𝑝 will be calculated for prime fields as described
above. In order to calculate 𝑟2 one possible way is to calculate
2𝑙 ⋅ 2𝑘mod𝑝 with 𝑙 being a small divider of 𝑘. In the
given implementation 𝑙 = 1. Therefore the bits of 𝑟 will
be shifted to the left by one bit. If the result is equal or
greater than the modulus, 𝑝will be subtracted once. By using
a square-and-multiply-like algorithm, multiple Montgomery
Multiplications will be performed in order to calculate 𝑟2 ≡
2𝑘 ⋅ 2𝑘mod𝑝.
In the case of binary field arithmetic the Montgomery
Parameter 𝑟2 will be given by 𝑟2 ≡ 𝑟 ⋅ 𝑟 ≡ 2𝑘 ⋅ 2𝑘 ≡
22⋅𝑘mod 𝑛(𝑥). Therefore in a first step the Montgomery
Parameter 𝑟 ≡ 2𝑘mod 𝑛(𝑥) will be calculated for binary
fields as described above. In order to calculate 𝑟2 the resulting
parameter 𝑟 will be shifted 𝑘-times bitwise to the left. After
each shift, the most significant bit as given by the precision
parameter will be evaluated. If the bit is one, the irreducible
polynomial will be added to the intermediate result which
represents a modulo reduction with 𝑛(𝑥). Once the shift
has been performed 𝑘-times the result will be 𝑟2 ≡ 2𝑘 ⋅
2𝑘mod 𝑛(𝑥)
4.4. MontExp Operation. The MontExp operation code in-
structs the core to perform a Montgomery Exponentiation
consisting of multiple Montgomery Multiplication steps in
the given finite field. AMontgomery Exponentiationwill start
by reading the first 𝑟-bit word of exponent 𝐸 from RAM.
Afterwards the first appearing one of the exponent word will
be searched starting from the most significant bit. If the first
word consists of all-zeros then the next word of exponent 𝐸
will be read and evaluated. Once the highest bit of exponent𝐸
has been found,multipleMontgomeryMultiplicationswill be
performed until all bits of the exponent have been processed
following a square-and-multiply algorithm.
4.5. ModAdd Operation. The ModAdd operation code in-
structs the core to perform amodular addition of the supplied
elements in the given finite field. After preparing the core
for the addition operation, the CLA adder will add the
given operands using the appropriate arithmetic given by the
finite field selection input. Once the last words of the given
operands have been added the carry output bit of the CLA
adder will be evaluated. If a carry bit is set, the modulus will
be subtracted once; otherwise the result will be compared to
the given modulus. If the result is equal to or greater than the
modulus, it will also be subtracted once.
4.6. ModSub Operation. The ModSub operation code in-
structs the core to perform a modular subtraction of the sup-
plied elements in prime fields. After preparing the core for the
subtraction operation the CLA adder will be used to perform
a word-based subtraction by performing an addition in two’s
complement representationwith prime field arithmetic. After
the last words of the given operands have been processed,
the carry output bit of the CLA adder will be evaluated. If
the carry bit signals a negative result, the modulus will be
added once; otherwise the result will be compared to the
given modulus. If the result is equal to or greater than the
modulus, it will be subtracted once.
4.7. RAM Copy Operations. In order to support crypto-
graphic algorithms which have been disassembled into a list
Security and Communication Networks 7
of instructions, RAM copy operations are needed. According
to the proposed RAM layout stated above four individual
copy operations have been defined.
The CopyH2V operation code instructs the core to copy
a number of words, according to the supplied precision
parameter, from the horizontal RAM layout starting from the
given source address to the vertical RAM layout starting from
the given destination address.
The CopyV2V operation code instructs the core to copy a
number of words, according to the supplied precision param-
eter, from the vertical RAM layout starting from the given
source address to the vertical RAM layout starting from the
given destination address.
The CopyH2H operation code instructs the core to copy
a number of words, according to the supplied precision
parameter, from the horizontal RAM layout starting from the
given source address to the horizontal RAM layout starting
from the given destination address.
The CopyV2H operation code instructs the core to copy
a number of words, according to the supplied precision
parameter, from the vertical RAM layout starting from the
given source address to the horizontal RAM layout starting
from the given destination address.
4.8. MontMult1 Operation. The MontMult1 operation code
instructs the core to perform a single Montgomery Mul-
tiplication of the supplied element with the constant 1 in
the given finite field. This type of operation is needed when
a montgomerized value should be transformed back from
the Montgomery Domain and has been implemented as
an independent operation since an operand 𝐴 = 1 will
unnecessarily occupy a vertical RAM slot. A Montgomery
Multiplication with the constant 1 will be executed in an
analogous manner as the MontMult operation with the only
exception that, instead of the RAM words, constant words
will be used for the 𝐴 operand.
5. Exemplary Core Application Descriptions
This section gives exemplary descriptions of how the specified
functional range of the proposed building-block Enhanced
Montgomery Multiplication Core design can be utilized to
support a wide range of cryptographic algorithms demanding
the least possible memory capacity yet at the same time
supporting asmuch precisionwidths as possible. Information
is given of how to perform Chinese RemainderTheorem [22]
(CRT) accelerated RSA private key operations and how to use
the core in order to test/generate prime numbers.
For the support of elliptic curve cryptography over prime
and binary finite fields modular functions are given for
preparing and conducting point operations for arbitrary
elliptic curves for the supported precision widths. For all
these algorithms a list of operations and the quantity of dif-
ferent operations is given allowing to perform cryptographic
algorithms by simply processing these operation lists.
5.1. CRT-Accelerated RSA Operation. In order to speed up
RSA private key operations the CRT-accelerated version is
also supported by the core. Therefore some operations have
Requires: (𝑐, 𝑑, 𝑝, 𝑞, 𝑒𝑥𝑝1, 𝑒𝑥𝑝2, 𝑐𝑜𝑒𝑓𝑓, 𝑛)
Calculates: (𝑚)
(h) 𝑟𝑞2 = 𝑀𝑜𝑛𝑡𝑅2(𝑞);
(f) 𝑐𝑞 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑟𝑞2, 𝑐, 𝑞);
(h) 𝑐𝑞𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑟𝑞2, 𝑐𝑞, 𝑞);
(h) 𝑚𝑞𝑀𝐷 = 𝑀𝑜𝑛𝑡𝐸𝑥𝑝(𝑐𝑞𝑀𝐷, 𝑐𝑞𝑀𝐷, 𝑒𝑥𝑝2, 𝑐𝑞𝑀𝐷, 𝑞);
(h) 𝑚𝑞 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡1(𝑚𝑞𝑀𝐷, 𝑞);
(h) 𝑟𝑝2 = 𝑀𝑜𝑛𝑡𝑅2(𝑝);
(f) 𝑐𝑝 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑟𝑝2, 𝑐, 𝑝);
(h) 𝑐𝑝𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑟𝑝2, 𝑐𝑝, 𝑝);
(h) 𝑚𝑝𝑀𝐷 = 𝑀𝑜𝑛𝑡𝐸𝑥𝑝(𝑐𝑝𝑀𝐷, 𝑐𝑝𝑀𝐷, 𝑒𝑥𝑝1, 𝑐𝑝𝑀𝐷, 𝑝);
(h) 𝑚𝑝 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡1(𝑚𝑝𝑀𝐷, 𝑝);
(h) 𝑐𝑜𝑒𝑓𝑓𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑟𝑝2, 𝑐𝑝, 𝑝);
(h) 𝑡1 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑚𝑝,𝑚𝑞, 𝑝);
(h) 𝑡2 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑐𝑜𝑒𝑓𝑓𝑀𝐷, 𝑡1, 𝑝);
(f) 𝑟2 = 𝑀𝑜𝑛𝑡𝑅2(𝑛);
(f) 𝑞𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑞, 𝑟2, 𝑛);
(f) 𝑡3 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑡2, 𝑞𝑀𝐷, 𝑛);
(f) 𝑚 = 𝑀𝑜𝑑𝐴𝑑𝑑(𝑡3,𝑚𝑞, 𝑛);
Provides: (𝑚)
Algorithm 1: CRT-RSA private key operation.
to be performedwith full precision whereasmost of the oper-
ations have to be performed with half precision. Algorithm 1
lists the necessary steps to utilize the core for CRT-accelerated
RSA private key operations.
Table 1 illustrates the abstract operations lists of the core
for CRT-accelerated RSA application using the private key
portion for all supported precision widths (512, 768, 1024,
1536, 2048, 3072, 4096). The number given in the index of
the RAM locations denotes the offset given by the corre-
sponding src addr, dest addr, src addr e, src addr x input
signals. The width of the processed values depends on the
supplied mwmac precision input signal which depends on
the operation. In the table operations requiring full precision
(the precision of the RSA modulus) are marked by (𝑓),
operations requiring half precision are marked by (ℎ). The
mwmac f sel signal must be set to 𝐺𝐹(𝑝) arithmetic.
CRT-accelerated RSA private key operations require 2 ×
𝑀𝑜𝑛𝑡𝑅2, 4 ×𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, 2 ×𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡1, 2 ×𝑀𝑜𝑛𝑡𝐸𝑥𝑝, 1 ×
𝑀𝑜𝑑𝑆𝑢𝑏, 9 × 𝐶𝑜𝑝𝑦𝐻2𝑉, 4 × 𝐶𝑜𝑝𝑦𝑉2𝐻, 2 × 𝐶𝑜𝑝𝑦𝑉2𝑉 and
1 × 𝐶𝑜𝑝𝑦𝐻2𝐻 performed on half precision and 1 ×𝑀𝑜𝑛𝑡𝑅2,
4×𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, 1×𝑀𝑜𝑑𝐴𝑑𝑑, 1×𝐶𝑜𝑝𝑦𝐻2𝑉 and 1×𝐶𝑜𝑝𝑦𝐻2𝐻
performed on full precision.
5.2. Prime Generation/Testing Operation. Algorithm 2 lists
the necessary steps to utilize the core, in conjunction with
a TRNG generator as Miller-Rabin Primality Tester. In the
algorithm 𝑛 denotes the random integer to be tested for pri-
mality and 𝑘 denotes the confidence parameter determining
the accuracy of the test, i.e., the amount of Miller-Rabin
loops. In a precomputation step the parameters 𝑠 and 𝑑 with
2𝑠 ⋅ 𝑑 = (𝑛 − 1) must be calculated which can be done by
simple shift operations and counter increments in software.
8 Security and Communication Networks
Table 1: Core operations list CRT-RSA.
Step Nr. Precision Operation
1 - Clear RAM
2 -
Write 𝑞 󳨃→ 𝑃1,
𝑒𝑥𝑝1 󳨃→ 𝐸1, 𝑒𝑥𝑝2 󳨃→ 𝐸5,
𝑐𝑜𝑒𝑓𝑓 󳨃→ 𝑋1, 𝑝 󳨃→ 𝑋5
3 (ℎ) 𝑀𝑜𝑛𝑡𝑅2(𝑃1, 𝐴1)
4 (ℎ) 𝐶𝑜𝑝𝑦𝐻2𝑉(𝐵1, 𝐴1)
5 - Write 𝑐 󳨃→ 𝐵1
6 (𝑓) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝐴1, 𝐵1, 𝑃1)
7 (ℎ) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝐴1, 𝐵1, 𝑃1)
8-9 (ℎ) 𝐶𝑜𝑝𝑦𝐻2𝑉(𝐵1, 𝐴1),
𝐶𝑜𝑝𝑦𝑉2𝑉(𝐴1, 𝐴5)
10 (ℎ) 𝑀𝑜𝑛𝑡𝐸𝑥𝑝(𝐴5, 𝐵1, 𝐸5, 𝐴1, 𝑃1)
11 (ℎ) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡1(𝐵1, 𝑃1)
12 - 14 (ℎ)
𝐶𝑜𝑝𝑦𝐻2𝑉(𝑃1, 𝐸5),
𝐶𝑜𝑝𝑦𝑉2𝐻(𝑋5, 𝑃1),
𝐶𝑜𝑝𝑦𝐻2𝑉(𝐵1, 𝑋5)
15 (ℎ) 𝑀𝑜𝑛𝑡𝑅2(𝑃1, 𝐴1)
16 (𝑓) 𝐶𝑜𝑝𝑦𝐻2𝑉(𝐵1, 𝐴1)
17 (ℎ) 𝐶𝑜𝑝𝑦𝑉2𝐻(𝑋1, 𝐵1)
18 (ℎ) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝐴1, 𝐵1, 𝑃1)
19 (ℎ) 𝐶𝑜𝑝𝑦𝐻2𝑉(𝐵1, 𝑋1)
20 - Write 𝑐 󳨃→ 𝐵1
21 (𝑓) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝐴1, 𝐵1, 𝑃1)
22 (ℎ) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝐴1, 𝐵1, 𝑃1)
23 - 24 (ℎ) 𝐶𝑜𝑝𝑦𝐻2𝑉(𝐵1, 𝐴1),
𝐶𝑜𝑝𝑦𝑉2𝑉(𝐴1, 𝐴5)
25 (ℎ) 𝑀𝑜𝑛𝑡𝐸𝑥𝑝(𝐴5, 𝐵1, 𝐸1, 𝐴1, 𝑃1)
26 (ℎ) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡1(𝐵1, 𝑃1)
27 - 28 (ℎ) 𝐶𝑜𝑝𝑦𝐻2𝐻(𝐵1, 𝑇𝑆1),
𝐶𝑜𝑝𝑦𝑉2𝐻(𝑋5, 𝑇𝐶1)
29 (ℎ) 𝑀𝑜𝑑𝑆𝑢𝑏(𝑇𝑆1, 𝑇𝐶1, 𝑃1)
30 (ℎ) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑋1, 𝐵1, 𝑃1)
31 - 32 (ℎ) 𝐶𝑜𝑝𝑦𝐻2𝑉(𝐵1, 𝐴1),
𝐶𝑜𝑝𝑦𝐻2𝑉(𝑇𝑆1, 𝑋1)
33 - Write 𝑛 󳨃→ 𝑃1
34 (𝑓) 𝑀𝑜𝑛𝑡𝑅2(𝑃1, 𝐴5)
35 (ℎ) 𝐶𝑜𝑝𝑦𝑉2𝑉(𝑋1, 𝐴5)
36 (𝑓) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝐸5, 𝐵1, 𝑃1)
37 (𝑓) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝐴1, 𝐵1, 𝑃1)
38 (𝑓) 𝐶𝑜𝑝𝑦𝐻2𝐻(𝐵1, 𝑇𝑆1)
39 (ℎ) 𝐶𝑜𝑝𝑦𝑉2𝐻(𝑋5, 𝑇𝐶1)
40 (𝑓) 𝑀𝑜𝑑𝐴𝑑𝑑(𝑇𝑆1, 𝑇𝐶1, 𝑃1)
The test furthermore requires an amount of random integers
{𝑎1, . . . , 𝑎𝑘} serving as random bases.
Table 2 illustrates the operations list of utilizing the
core for Miller-Rabin Primality Test steps for all supported
precision widths (192, 224, 256, 320, 384, 512, 768, 1024, 1536,
2048, 3072, 4096). The number given in the index of the
Precomputation: (𝑠, 𝑑 with 2𝑠 ⋅ 𝑑 = (𝑛 − 1))
Input: (𝑛, 𝑠, 𝑑, 𝑘 {𝑎1, . . . , 𝑎𝑘})
Output: (𝑒V𝑎𝑙, composite or probably prime)
𝑟 = 𝑀𝑜𝑛𝑡𝑅(𝑛);
(𝑛 − 𝑟) = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑛, 𝑟, 𝑛);
for 𝑖 from 1 to 𝑘 do
𝑡1𝑀𝐷 = 𝑀𝑜𝑛𝑡𝐸𝑥𝑝(𝑎𝑖, 𝑎𝑖, 𝑑, 𝑎𝑖, 𝑛);
𝑡2𝑀𝐷 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑡1𝑀𝐷, 𝑟, 𝑛);
𝑡3𝑀𝐷 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑡1𝑀𝐷, (𝑛 − 𝑟), 𝑛);
if 𝑡2𝑀𝐷 = 0 or 𝑡3𝑀𝐷 = 0 then
continue;
for 𝑗 from 0 to (𝑠 − 1) do
𝑡1𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑡1𝑀𝐷, 𝑡1𝑀𝐷, 𝑛);
𝑡2𝑀𝐷 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑡1𝑀𝐷, 𝑟, 𝑛);
if 𝑡2𝑀𝐷 = 0 then
return (𝑒V𝑎𝑙 = composite);
𝑡3𝑀𝐷 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑡1𝑀𝐷, (𝑛 − 𝑟), 𝑛);
if 𝑡3𝑀𝐷 = 0 then
continue;
return (𝑒V𝑎𝑙 = composite);
return (𝑒V𝑎𝑙 = probably prime)
Algorithm 2: Modified Miller-Rabin Primality Test.
RAM locations denotes the offset given by the corresponding
src addr, dest addr, src addr e, src addr x input signals. The
width of the processed values depends on the supplied
mwmac precision input signal.Themwmac f sel signal must
be set to 𝐺𝐹(𝑝) arithmetic. Note that since the results of the
performed operations will be in the Montgomery Domain,
they will be checked against the Montgomery Parameter 𝑟
and (𝑛 − 𝑟) instead of 1 and (𝑛 − 1). Also note that the
random bases 𝑎𝑖 that will be checked must not necessarily be
transformed into the Montgomery Domain first, they simply
will be interpreted as random montgomerized values.
The total number of needed core operations depends on
the security parameter 𝑘 and the value 𝑠 resulting from the
factorization of (𝑛−1).Within the outer for loop fromwriting
a new 𝑎𝑖 to the RAM until the evaluation of 𝑡2𝑀𝐷 1×𝑀𝑜𝑛𝑡𝑅,
1×𝑀𝑜𝑛𝑡𝐸𝑥𝑝, 1×𝑀𝑜𝑑𝑆𝑢𝑏, 1×𝐶𝑜𝑝𝑦𝐻2𝑉, 2×𝐶𝑜𝑝𝑦𝑉2𝐻, 1×
𝐶𝑜𝑝𝑦𝑉2𝑉 and 1×𝐶𝑜𝑝𝑦𝐻2𝐻 and until evaluation of 𝑡3𝑀𝐷 2×
𝑀𝑜𝑑𝑆𝑢𝑏, 2 × 𝐶𝑜𝑝𝑦𝑉2𝐻 and 2 × 𝐶𝑜𝑝𝑦𝐻2𝐻 operations
are required. Within the inner for loop until evaluation of
updated 𝑡2𝑀𝐷 1 × 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, 1 × 𝑀𝑜𝑑𝑆𝑢𝑏, 1 × 𝐶𝑜𝑝𝑦𝐻2𝑉,
2 × 𝐶𝑜𝑝𝑦𝑉2𝐻 and 1 × 𝐶𝑜𝑝𝑦𝐻2𝐻 and until evaluation of
updated 𝑡3𝑀𝐷 2×𝑀𝑜𝑑𝑆𝑢𝑏, 2×𝐶𝑜𝑝𝑦𝑉2𝐻 and 2×𝐶𝑜𝑝𝑦𝐻2𝐻
operations is required.
5.3. Elliptic Curve Operations. Unlike modular exponen-
tiation which only is based on modular multiplications,
elliptic curve Point Addition and Point Doubling operations
also in the Jacobian projective coordinate representation
[23] involve modular additions, subtractions, and multipli-
cations. The algorithms for prime field elliptic curve Point
Addition and Point Doubling using Jacobian coordinates
furthermore involvemultiplications by some constants. Since
the described core performs multiplication operations by
Security and Communication Networks 9
Table 2: Core operations list for Miller-Rabin Primality Test.
Step Nr. Operation
1 Clear RAM
2 Write 𝑛 󳨃→ 𝑃1, 𝑑 󳨃→ 𝐸1
3 Write 𝑎 󳨃→ 𝑋1
4-5 𝐶𝑜𝑝𝑦𝑉2𝑉(𝑋1, 𝐴1),
𝐶𝑜𝑝𝑦𝑉2𝐻(𝑋1, 𝐵1)
6 𝑀𝑜𝑛𝑡𝐸𝑥𝑝(𝑋1, 𝐵1, 𝐸1, 𝐴1, 𝑃1)
7 𝐶𝑜𝑝𝑦𝐻2𝑉(𝐵1, 𝑋1)
8 𝑀𝑜𝑛𝑡𝑅(𝑃1, 𝐵1)
9-11
𝐶𝑜𝑝𝑦𝐻2𝑉(𝐵1, 𝐴1),
𝐶𝑜𝑝𝑦𝑉2𝐻(𝑋1, 𝑇𝑆1),
𝐶𝑜𝑝𝑦𝐻2𝐻(𝐵1, 𝑇𝐶1)
12 𝑀𝑜𝑑𝑆𝑢𝑏(𝑇𝑆1, 𝑇𝐶1, 𝑃1)
13 Read 𝐵1, if 𝑡2𝑀𝐷 = 0 continue at Step Nr. 3 elsecontinue at Step Nr. 14
14-15 𝐶𝑜𝑝𝑦𝐻2𝐻(𝑃1, 𝑇𝑆1),
𝐶𝑜𝑝𝑦𝑉2𝐻(𝐴1, 𝑇𝐶1)
16 𝑀𝑜𝑑𝑆𝑢𝑏(𝑇𝑆1, 𝑇𝐶1, 𝑃1)
17-18 𝐶𝑜𝑝𝑦𝑉2𝐻(𝑋1, 𝑇𝑆1),
𝐶𝑜𝑝𝑦𝐻2𝐻(𝐵1, 𝑇𝐶1)
19 𝑀𝑜𝑑𝑆𝑢𝑏(𝑇𝑆1, 𝑇𝐶1, 𝑃1)
20
Read 𝐵1, if 𝑡3𝑀𝐷 = 0 continue at Step Nr. 3 else if
(𝑠 − 1) = 0 stop test with 𝑒V𝑎𝑙 = composite else
continue at Step Nr. 21
21 𝐶𝑜𝑝𝑦𝑉2𝐻(𝑋1, 𝐵1)
22 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑋1, 𝐵1, 𝑃1)
23-25
𝐶𝑜𝑝𝑦𝐻2𝑉(𝐵1, 𝑋1),
𝐶𝑜𝑝𝑦𝐻2𝐻(𝐵1, 𝑇𝑆1),
𝐶𝑜𝑝𝑦𝑉2𝐻(𝐴1, 𝑇𝐶1)
26 𝑀𝑜𝑑𝑆𝑢𝑏(𝑇𝑆1, 𝑇𝐶1, 𝑃1)
27 Read 𝐵1, if 𝑡2𝑀𝐷 = 0 stop test with 𝑒V𝑎𝑙 = compositeelse continue at Step Nr. 28
28-29 𝐶𝑜𝑝𝑦𝐻2𝐻(𝑃1, 𝑇𝑆1),
𝐶𝑜𝑝𝑦𝑉2𝐻(𝐴1, 𝑇𝐶1)
30 𝑀𝑜𝑑𝑆𝑢𝑏(𝑇𝑆1, 𝑇𝐶1, 𝑃1)
31-32 𝐶𝑜𝑝𝑦𝑉2𝐻(𝑋1, 𝑇𝑆1),
𝐶𝑜𝑝𝑦𝐻2𝐻(𝐵1, 𝑇𝐶1)
33 𝑀𝑜𝑑𝑆𝑢𝑏(𝑇𝑆1, 𝑇𝐶1, 𝑃1)
34
Read 𝐵1, if 𝑡3𝑀𝐷 = 0 and 𝑖 = 𝑘 stop test with 𝑒V𝑎𝑙 =
probably prime else if 𝑡3𝑀𝐷 = 0 and 𝑖 ̸= 𝑘 continue at
Step Nr. 3 else if 𝑗 = (𝑠 − 1) stop test with 𝑒V𝑎𝑙 =
composite else continue at Step Nr. 21
using Montgomery Arithmetic, these constants must be
transformed into the Montgomery Domain first for the
intermediate values to remain montgomerized.
In order to utilize the core for elliptic curve operations
the following modular functions have been specified for both
𝐺𝐹(𝑝) and 𝐺𝐹(2𝑚) support:
(i) EC Preparation.
(ii) EC Montgomery Transformation.
(iii) EC Affine-to-Jacobi Transformation.
Requires: (2, 3, 4, 8, 𝑎, 𝑏, 𝑝)
Calculates: (𝑟2, 2𝑀𝐷, 3𝑀𝐷, 4𝑀𝐷, 8𝑀𝐷, 𝑎𝑀𝐷, 𝑏𝑀𝐷, 𝑒𝑥𝑝)
𝑟2 = 𝑀𝑜𝑛𝑡𝑅2(𝑝);
2𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑟
2, 2, 𝑝);
3𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑟
2, 3, 𝑝);
4𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑟
2, 4, 𝑝);
8𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑟
2, 8, 𝑝);
𝑎𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑟
2, 𝑎, 𝑝);
𝑏𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑟
2, 𝑏, 𝑝);
𝑒𝑥𝑝 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑝, 2, 𝑝);
Provides: (𝑟2, 2𝑀𝐷, 3𝑀𝐷, 4𝑀𝐷, 8𝑀𝐷, 𝑎𝑀𝐷, 𝑏𝑀𝐷, 𝑒𝑥𝑝)
Algorithm 3: Core 𝐺𝐹(𝑝) EC Preparation.
Requires: (𝑥𝑃, 𝑦𝑃, 𝑟2)
Calculates: (𝑥𝑃𝑀𝐷, 𝑦𝑃𝑀𝐷)
𝑥𝑃𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑥𝑃, 𝑟
2, 𝑝);
𝑦𝑃𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑦𝑃, 𝑟
2, 𝑝);
Provides: (𝑥𝑃𝑀𝐷, 𝑦𝑃𝑀𝐷)
Algorithm 4: Core 𝐺𝐹(𝑝) EC Montgomery Transformation.
(iv) EC Point Validation.
(v) EC Point Doubling.
(vi) EC Point Addition.
(vii) EC Jacobi-to-Affine Transformation.
(viii) EC Montgomery Backtransformation.
In the following, algorithms for utilizing the core to
perform EC operations in 𝐺𝐹(𝑝) are stated. For 𝐺𝐹(2𝑚) EC
support, similar algorithms have been derived.
5.3.1. GF(p) EC Preparation. The prime field EC Preparation
steps include the calculation of the Montgomery Parameter
𝑟2mod𝑝, the exponent exp = 𝑝 − 2 as well as the
montgomerized versions of the constants 2, 3, 4, 8 and the
EC Domain Parameters 𝑎 and 𝑏 for a given elliptic curve
𝐸 : 𝑦2 ≡ 𝑥3 + 𝑎 ⋅ 𝑥 + 𝑏mod𝑝 over 𝐺𝐹(𝑝). Algorithm 3
lists the necessary steps to utilize the core for EC prime field
preparation.
A core prime field EC preparation operation requires 1 ×
𝑀𝑜𝑛𝑡𝑅2, 6 × 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, 1 × 𝑀𝑜𝑑𝑆𝑢𝑏, 8 × 𝐶𝑜𝑝𝑦𝐻2𝑉, and
8 × 𝐶𝑜𝑝𝑦𝐻2𝐻.
5.3.2. GF(p) ECMontgomery Transformation. Theprime field
EC Montgomery Transformation steps are responsible for
the transformation of the supplied affine point coordinates
𝑥𝑃 and 𝑦𝑃 of a Point 𝑃 in the case of a Point Doubling
or Point Multiplication operation, 𝑥𝑃, 𝑦𝑃, 𝑥𝑄 and 𝑦𝑄 of the
curve Points 𝑃 and 𝑄 in the case of a Point Addition
operation into the Montgomery Domain. Algorithm 4 lists
10 Security and Communication Networks
Requires: (𝑥𝑃𝑀𝐷, 𝑦𝑃𝑀𝐷)
Calculates: (𝑥𝑃𝑀𝐷𝑗, 𝑦𝑃𝑀𝐷𝑗, 𝑧𝑃𝑀𝐷𝑗)
𝑥𝑃𝑀𝐷𝑗 fl 𝑥𝑃𝑀𝐷;
𝑦𝑃𝑀𝐷𝑗 fl 𝑦𝑃𝑀𝐷;
𝑧𝑃𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑅(𝑝);
Provides: (𝑥𝑃𝑀𝐷𝑗, 𝑦𝑃𝑀𝐷𝑗, 𝑧𝑃𝑀𝐷𝑗)
Algorithm 5: Core 𝐺𝐹(𝑝) EC Affine-to-Jacobi Transformation.
Requires: (𝑥𝑃𝑀𝐷, 𝑦𝑃𝑀𝐷, 𝑝, 𝑎𝑀𝐷, 𝑏𝑀𝐷)
Output: (𝑒V𝑎𝑙, point on curve or point not on curve)
𝑦2𝑃𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑦𝑃𝑀𝐷, 𝑦𝑃𝑀𝐷, 𝑝);
𝑥2𝑃𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑥𝑃𝑀𝐷, 𝑥𝑃𝑀𝐷, 𝑝);
𝑎𝑥𝑃𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑎𝑀𝐷, 𝑥𝑃𝑀𝐷, 𝑝);
𝑥3𝑃𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑥
2
𝑃𝑀𝐷, 𝑥𝑃𝑀𝐷, 𝑝);
𝑡1𝑀𝐷 = 𝑀𝑜𝑑𝐴𝑑𝑑(𝑥
3
𝑃𝑀𝐷, 𝑎𝑥𝑃𝑀𝐷, 𝑝);
𝑡2𝑀𝐷 = 𝑀𝑜𝑑𝐴𝑑𝑑(𝑡1𝑀𝐷, 𝑏𝑀𝐷, 𝑝);
𝑡3𝑀𝐷 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑦
2
𝑃𝑀𝐷, 𝑡2𝑀𝐷, 𝑝);
if 𝑡3𝑀𝐷 = 0 then
return: (𝑒V𝑎𝑙 = point on curve);
else
return: (𝑒V𝑎𝑙 = point not on curve);
Algorithm 6: Core 𝐺𝐹(𝑝) EC Point Validation.
the steps to utilize the core for prime field EC Montgomery
Transformation for an arbitrary curve Point 𝑃.
A core prime field EC Montgomery Transformation
operation requires 2 × 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, 2 × 𝐶𝑜𝑝𝑦𝐻2𝑉, and 2 ×
𝐶𝑜𝑝𝑦𝑉2𝑉 in the case of an intended Point Doubling or Point
Multiplication operation and 4×𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡 and 4×𝐶𝑜𝑝𝑦𝐻2𝑉
in the case of an intended Point Addition operation.
5.3.3. GF(p) EC Affine-to-Jacobi Transformation. The prime
field EC Affine-to-Jacobi Transformation steps are responsi-
ble for transforming the suppliedmontgomerized affine point
coordinates 𝑥𝑃𝑀𝐷 and 𝑦𝑃𝑀𝐷 of a curve Point 𝑃 into Jacobian
coordinates. Algorithm 5 lists the necessary steps to utilize
the core for prime field EC Affine-to-Jacobi Transformation
for an arbitrary montgomerized curve Point 𝑃.
A core prime field EC Affine-to-Jacobi Transformation
operation requires 1 × 𝑀𝑜𝑛𝑡𝑅, 1 × 𝐶𝑜𝑝𝑦𝐻2𝑉, and 1 ×
𝐶𝑜𝑝𝑦𝑉2𝑉 in the case of an intended Point Addition, Point
Doubling, or Point Multiplication operation.
5.3.4. GF(p) EC Point Validation. The prime field EC Point
Validation performs a check, if a supplied (or calculated)
point indeed is a valid point of the elliptic curve given by
the equation 𝑦2 ≡ 𝑥3 + 𝑎 ⋅ 𝑥 + 𝑏mod𝑝. As a requirement
the Point Validation must be conducted on montgomerized
points in affine coordinate representation. Algorithm 6 lists
the necessary steps to utilize the core for prime field EC Point
Validation.
A core prime field EC Point Validation operation requires
4×𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, 2×𝑀𝑜𝑑𝐴𝑑𝑑, 1×𝑀𝑜𝑑𝑆𝑢𝑏, 4×𝐶𝑜𝑝𝑦𝑉2𝐻, and
5 × 𝐶𝑜𝑝𝑦𝐻2𝐻.
5.3.5. GF(p) EC Point Doubling. The prime field EC Point
Doubling steps perform a single Point Doubling operation of
a Point 𝑃 with montgomerized Jacobi coordinates, resulting
in 2 ⋅ 𝑃 = 𝑅 also represented in montgomerized Jacobi
coordinates. The original algorithm for Point Doubling with
Jacobi coordinate representation has been modified to be
suitable for the proposed core and is given in Algorithm 7.
A core prime field EC Point Doubling operation requires
15 × 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, 1 × 𝑀𝑜𝑑𝐴𝑑𝑑, 3 × 𝑀𝑜𝑑𝑆𝑢𝑏, 6 × 𝐶𝑜𝑝𝑦𝐻2𝑉,
6 × 𝐶𝑜𝑝𝑦𝑉2𝐻, and 7 × 𝐶𝑜𝑝𝑦𝐻2𝐻.
5.3.6. GF(p) EC Point Addition. The prime field EC Point
Addition steps perform a single Point Addition operation of
two Points 𝑃 and 𝑄 with montgomerized Jacobi coordinates,
resulting in 𝑃 + 𝑄 = 𝑅 also represented in montgomerized
Jacobi coordinates.The original algorithm for Point Addition
with Jacobi coordinate representation has been modified to
be suitable for the proposed core and is given in Algorithm 8.
A core prime field EC Point Addition operation requires
17 ×𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, 6 ×𝑀𝑜𝑑𝑆𝑢𝑏, 9 × 𝐶𝑜𝑝𝑦𝐻2𝑉, 7 × 𝐶𝑜𝑝𝑦𝑉2𝐻,
and 12 × 𝐶𝑜𝑝𝑦𝐻2𝐻.
5.3.7. GF(p) EC Jacobi-to-Affine Transformation. The prime
field EC Jacobi-to-Affine Transformation steps are respon-
sible for the transformation of the supplied montgomerized
Jacobi coordinates 𝑥𝑅𝑀𝐷𝑗, 𝑦𝑅𝑀𝐷𝑗, and 𝑧𝑅𝑀𝐷𝑗 of the curve
Point 𝑅 back into affine coordinate representation. This
transformation step requires the calculation of a modular
multiplicative inverse element which will be performed by
a Montgomery modular exponentiation according to Euler’s
theorem since the modulus is a prime number. Algorithm 9
lists the necessary steps to utilize the core for prime field EC
Jacobi-to-Affine Transformation.
A core prime field EC Jacobi-to-Affine Transformation
operation requires 4×𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, 1×𝑀𝑜𝑛𝑡𝐸𝑥𝑝, 3×𝐶𝑜𝑝𝑦𝐻2𝑉,
1 × 𝐶𝑜𝑝𝑦𝑉2𝐻, 1 × 𝐶𝑜𝑝𝑦𝑉2𝑉 and 1 × 𝐶𝑜𝑝𝑦𝐻2𝐻.
5.3.8. GF(p) ECMontgomery Backtransformation. The prime
field EC Montgomery Backtransformation steps are respon-
sible for the transformation of the supplied montgomerized
point coordinates 𝑥𝑅𝑀𝐷 and 𝑦𝑅𝑀𝐷 of a Point 𝑅 out of
the Montgomery Domain. Algorithm 10 lists the necessary
steps to utilize the core for prime field EC Montgomery
Backtransformation for an arbitrary curve Point 𝑅.
A core prime field EC Montgomery Backtransformation
operation requires 2 ×𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡1 and 2 × 𝐶𝑜𝑝𝑦𝑉2𝐻.
6. Performance Analysis
In this section parameter-dependent formulas for the calcula-
tion of the computation times in clock cycles of the described
basic core operations are given which allows specifying
upper and lower calculation boundaries. Furthermore for
the supported precision widths in both finite fields the
number of words to be processed and the possible numbers
Security and Communication Networks 11
Requires: (𝑥𝑃𝑀𝐷𝑗, 𝑦𝑃𝑀𝐷𝑗, 𝑧𝑃𝑀𝐷𝑗, 𝑝, 2𝑀𝐷, 3𝑀𝐷, 4𝑀𝐷, 8𝑀𝐷, 𝑎𝑀𝐷)
Calculates: (𝑥𝑅𝑀𝐷𝑗, 𝑦𝑅𝑀𝐷𝑗, 𝑧𝑅𝑀𝐷𝑗)
𝑡1𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(4𝑀𝐷, 𝑥𝑃𝑀𝐷𝑗, 𝑝);
𝑦2𝑃𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑦𝑃𝑀𝐷𝑗, 𝑦𝑃M𝐷𝑗, 𝑝);
𝑆𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑡1𝑀𝐷, 𝑦
2
𝑃𝑀𝐷𝑗, 𝑝);
𝑥2𝑃𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑥𝑃𝑀𝐷𝑗, 𝑥𝑃𝑀𝐷𝑗, 𝑝);
𝑡2𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(3𝑀𝐷, 𝑥
2
𝑃𝑀𝐷𝑗, 𝑝);
𝑧2𝑃𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑧𝑃𝑀𝐷𝑗, 𝑧𝑃𝑀𝐷𝑗, 𝑝);
𝑧4𝑃𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑧
2
𝑃𝑀𝐷𝑗, 𝑧
2
𝑃𝑀𝐷𝑗, 𝑝);
𝑡3𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑎𝑀𝐷, 𝑧
4
𝑃𝑀𝐷𝑗, 𝑝);
𝑀𝑀𝐷 = 𝑀𝑜𝑑𝐴𝑑𝑑(𝑡2𝑀𝐷, 𝑡3𝑀𝐷, 𝑝);
𝑀2𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑀𝑀𝐷,𝑀𝑀𝐷, 𝑝);
𝑡4𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(2𝑀𝐷, 𝑆𝑀𝐷, 𝑝);
𝑥𝑅𝑀𝐷𝑗 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑀
2
𝑀𝐷, 𝑡4𝑀𝐷, 𝑝);
𝑡5𝑀𝐷 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑆𝑀𝐷, 𝑥𝑅𝑀𝐷𝑗, 𝑝);
𝑡6𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑀𝑀𝐷, 𝑡5𝑀𝐷, 𝑝);
𝑦4𝑃𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑦
2
𝑃𝑀𝐷𝑗, 𝑦
2
𝑃𝑀𝐷𝑗, 𝑝);
𝑡7𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(8𝑀𝐷, 𝑦
4
𝑃𝑀𝐷𝑗, 𝑝);
𝑦𝑅𝑀𝐷𝑗 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑡6𝑀𝐷, 𝑡7𝑀𝐷, 𝑝);
𝑡8𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑦𝑃𝑀𝐷𝑗, 𝑧𝑃𝑀𝐷𝑗, 𝑝);
𝑧𝑅𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(2𝑀𝐷, 𝑡8𝑀𝐷, 𝑝);
Provides: (𝑥𝑅𝑀𝐷𝑗, 𝑦𝑅𝑀𝐷𝑗, 𝑧𝑅𝑀𝐷𝑗)
Algorithm 7: Core 𝐺𝐹(𝑝) EC Point Doubling.
Table 3: Core RAM copy operations computation time in clock
cycles (CC).
GF(p) GF(2m)
𝐶𝐶𝐶𝑜𝑝𝑦𝐻2𝑉 ⌈(|𝑝|/|𝑤|)⌉ + 3 ⌈(𝑚/|𝑤|)⌉ + 3
𝐶𝐶𝐶𝑜𝑝𝑦𝑉2𝑉 ⌈(|𝑝|/|𝑟|)⌉ + 2 ⌈(𝑚/|𝑟|)⌉ + 2
𝐶𝐶𝐶𝑜𝑝𝑦𝐻2𝐻 ⌈(|𝑝|/|𝑤|)⌉ + 3 ⌈(𝑚/|𝑤|)⌉ + 3
𝐶𝐶𝐶𝑜𝑝𝑦𝑉2𝐻 ⌈(|𝑝|/|𝑤|)⌉ + 3 ⌈(𝑚/|𝑤|)⌉ + 3
of processing units is given. In order to estimate the size
ratio of different core variations the number of logic elements
and dedicated logic registers for exemplary Altera and Xilinx
FPGAs is stated. Furthermore results of power estimation are
given. Depending on the resulting clock cycle times of core
variations a reference implementation exhibiting a balance
of performance and area consumption has been defined. For
this reference implementation the computation times in clock
cycles for the described exemplary cryptographic algorithms
are given.
6.1. Core Computation Time Formulas. Table 3 lists the RAM
copy operations computation time formulas in clock cycles of
the proposed core. Note that the resulting calculation times
of RAM reorganisation operations are only dependent on the
specified precision (|𝑝| for 𝐺𝐹(𝑝) and 𝑚 for 𝐺𝐹(2𝑚)), the
word width |𝑤| parameter for which the core variation has
been generated and the resulting RAM width parameter |𝑟|
with |𝑟| = 4 ⋅ |𝑤|. The operations 𝐶𝑜𝑝𝑦𝐻2𝑉, 𝐶𝑜𝑝𝑦𝐻2𝐻, and
𝐶𝑜𝑝𝑦𝑉2𝐻 exhibit the same computation time, whereas the
operation 𝐶𝑜𝑝𝑦𝑉2𝑉 will be performed in less clock cycles.
Table 4: Core 𝐺𝐹(𝑝) operations computation time in clock cycles
(CC).
GF(p)
𝐶𝐶𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡 𝑏𝐺𝐹(𝑝)
|𝑝| − 𝑘
𝑘
⋅ 𝑒 + 𝑘 + 𝑒 + 4
𝐶𝐶𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡 𝑤𝐺𝐹(𝑝)
|𝑝| − 𝑘
𝑘
⋅ 𝑒 + 𝑘 + 3 ⋅ 𝑒 + 4
𝐶𝐶𝑀𝑜𝑛𝑡𝑅𝐺𝐹(𝑝) ⌈(|𝑝|/|𝑤|)⌉ + 3
𝐶𝐶𝑀𝑜𝑛𝑡𝑅2 𝑏𝐺𝐹(𝑝)
2 ⋅ (⌈(|𝑝|/|𝑤|)⌉+2)+1+𝑥⋅ (𝐶𝐶𝐶𝑜𝑝𝑦𝐻2𝑉−
1) + 𝑦 ⋅ (𝐶𝐶𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡 𝑏𝐺𝐹(𝑝) − 1) + 1
𝐶𝐶𝑀𝑜𝑛𝑡𝑅2 𝑤𝐺𝐹(𝑝)
2 ⋅ (⌈(|𝑝|/|𝑤|)⌉ + 2) + 2 ⋅ ⌈(|𝑝|/|𝑤|)⌉ +
3 + 𝑥 ⋅ (𝐶𝐶𝐶𝑜𝑝𝑦𝐻2𝑉 − 1) + 𝑦 ⋅
(𝐶𝐶𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡 𝑤𝐺𝐹(𝑝) − 1) + 1
𝐶𝐶𝑀𝑜𝑛𝑡𝐸𝑥𝑝 𝑏𝐺𝐹(𝑝)
[⌈(|𝑝|/|𝑤|)⌉ ⋅ (|𝑤| + 2)] + (𝐶𝐶𝐶𝑜𝑝𝑦𝑉2𝑉 −
1) + [2 ⋅ (𝐶𝐶𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡 𝑏𝐺𝐹(𝑝) − 1)]
𝐶𝐶𝑀𝑜𝑛𝑡𝐸𝑥𝑝 𝑤𝐺𝐹(𝑝)
[⌈(|𝑝|/|𝑤|)⌉] + [⌈(|𝑝|/|𝑤|)⌉ ⋅ |𝑤| − |𝑝| +
3] + [2 ⋅ (|𝑝| − 2) ⋅ (𝐶𝐶𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡 𝑤𝐺𝐹(𝑝) −
1)] + [(|𝑝| − 2) ⋅ (𝐶𝐶𝐶𝑜𝑝𝑦𝑉2𝑉 − 1)] +
[(|𝑝| − 3) ⋅ (𝐶𝐶𝐶𝑜𝑝𝑦𝐻2𝑉 − 1)] + 1
𝐶𝐶𝑀𝑜𝑑𝐴𝑑𝑑 𝑏𝐺𝐹(𝑝) ⌈(|𝑝|/|𝑤|)⌉ + 4
𝐶𝐶𝑀𝑜𝑑𝐴𝑑𝑑 𝑤𝐺𝐹(𝑝) 3 ⋅ ⌈(|𝑝|/|𝑤|)⌉ + 6
𝐶𝐶𝑀𝑜𝑑𝑆𝑢𝑏 𝑏𝐺𝐹(𝑝) ⌈(|𝑝|/|𝑤|)⌉ + 4
𝐶𝐶𝑀𝑜𝑑𝑆𝑢𝑏 𝑤𝐺𝐹(𝑝) 2 ⋅ ⌈(|𝑝|/|𝑤|)⌉ + 4
𝐶𝐶𝑀𝑜𝑑𝑆𝑢𝑏 𝑎𝑤𝐺𝐹(𝑝) 3 ⋅ ⌈(|𝑝|/|𝑤|)⌉ + 6
The computing time formulas of prime field core opera-
tions given in clock cycles are listed in Table 4.
12 Security and Communication Networks
Requires: (𝑥𝑃𝑀𝐷𝑗, 𝑦𝑃𝑀𝐷𝑗, 𝑧𝑃𝑀𝐷𝑗, 𝑥𝑄𝑀𝐷𝑗, 𝑦𝑄𝑀𝐷𝑗, 𝑧𝑄𝑀𝐷𝑗, 𝑝, 2𝑀𝐷)
Calculates: (𝑥𝑅𝑀𝐷𝑗, 𝑦𝑅𝑀𝐷𝑗, 𝑧𝑅𝑀𝐷𝑗)
𝑧2𝑄𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑧𝑄𝑀𝐷𝑗, 𝑧𝑄𝑀𝐷𝑗, 𝑝);
𝑈1𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑥𝑃𝑀𝐷𝑗, 𝑧
2
𝑄𝑀𝐷𝑗, 𝑝);
𝑧2𝑃𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑧𝑃𝑀𝐷𝑗, 𝑧𝑃𝑀𝐷𝑗, 𝑝);
𝑈2𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑥𝑄𝑀𝐷𝑗, 𝑧
2
𝑃𝑀𝐷𝑗, 𝑝);
𝑧3𝑄𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑧
2
𝑄𝑀𝐷𝑗, 𝑧𝑄𝑀𝐷𝑗, 𝑝);
𝑆1𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑦𝑃𝑀𝐷𝑗, 𝑧
3
𝑄𝑀𝐷𝑗, 𝑝);
𝑧3𝑃𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑧
2
𝑃𝑀𝐷𝑗, 𝑧𝑃𝑀𝐷𝑗, 𝑝);
𝑆2𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑦𝑄𝑀𝐷𝑗, 𝑧
3
𝑃𝑀𝐷𝑗, 𝑝);
𝐻𝑀𝐷 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑈2𝑀𝐷, 𝑈1𝑀𝐷, 𝑝);
𝑅𝑀𝐷 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑆2𝑀𝐷, 𝑆1𝑀𝐷, 𝑝);
if 𝐻𝑀𝐷 = 0 then
if 𝑅𝑀𝐷 ̸= 0 then
Notify: (𝑥𝑅𝑀𝐷𝑗, 𝑦𝑅𝑀𝐷𝑗, 𝑧𝑅𝑀𝐷𝑗) = (0, 1, 0)
Point at Infinity (O);
else
Perform: PointDoubling Operation;
else
𝑅2𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑅𝑀𝐷, 𝑅𝑀𝐷, 𝑝);
𝐻2𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝐻𝑀𝐷, 𝐻𝑀𝐷, 𝑝);
𝐻3𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝐻
2
𝑀𝐷, 𝐻𝑀𝐷, 𝑝);
𝑡1𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑈1𝑀𝐷, 𝐻
2
𝑀𝐷, 𝑝);
𝑡2𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(2𝑀𝐷, 𝑡1𝑀𝐷, 𝑝);
𝑡3𝑀𝐷 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑅
2
𝑀𝐷, 𝐻
3
𝑀𝐷, 𝑝);
𝑥𝑅𝑀𝐷𝑗 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑡3𝑀𝐷, 𝑡2𝑀𝐷, 𝑝);
𝑡4𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑆1𝑀𝐷, 𝐻
3
𝑀𝐷, 𝑝);
𝑡5𝑀𝐷 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑡1𝑀𝐷, 𝑥𝑅𝑀𝐷𝑗, 𝑝);
𝑡6𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑅𝑀𝐷, 𝑡5𝑀𝐷, 𝑝);
𝑦𝑅𝑀𝐷𝑗 = 𝑀𝑜𝑑𝑆𝑢𝑏(𝑡6𝑀𝐷, 𝑡4𝑀𝐷, 𝑝);
𝑡7𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑧𝑃𝑀𝐷𝑗, 𝐻𝑀𝐷, 𝑝);
𝑧𝑅𝑀𝐷𝑗 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑧𝑄𝑀𝐷𝑗, 𝑡7𝑀𝐷, 𝑝);
Provides: (𝑥𝑅𝑀𝐷𝑗, 𝑦𝑅𝑀𝐷𝑗, 𝑧𝑅𝑀𝐷𝑗)
Algorithm 8: Core 𝐺𝐹(𝑝) EC Point Addition.
Requires: (𝑥𝑅𝑀𝐷𝑗, 𝑦𝑅𝑀𝐷𝑗, 𝑧𝑅𝑀𝐷𝑗, 𝑝, 𝑒𝑥𝑝 = 𝑝 − 2)
Calculates: (𝑥𝑅𝑀𝐷, 𝑦𝑅𝑀𝐷)
𝑡1𝑀𝐷 = 𝑀𝑜𝑛𝑡𝐸𝑥𝑝(𝑧𝑅𝑀𝐷𝑗, 𝑧𝑅𝑀𝐷𝑗, 𝑒𝑥𝑝, 𝑧𝑅𝑀𝐷𝑗, 𝑝);
𝑡2𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑡1𝑀𝐷, 𝑡1𝑀𝐷, 𝑝);
𝑡3𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑡2𝑀𝐷, 𝑡1𝑀𝐷, 𝑝);
𝑥𝑅𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑥𝑅𝑀𝐷𝑗, 𝑡2𝑀𝐷, 𝑝);
𝑦𝑅𝑀𝐷 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡(𝑦𝑅𝑀𝐷𝑗, 𝑡3𝑀𝐷, 𝑝);
Provides: (𝑥𝑅𝑀𝐷, 𝑦𝑅𝑀𝐷)
Algorithm 9: Core 𝐺𝐹(𝑝) EC Jacobi-to-Affine Transformation.
The computation time of the 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡 operation in
𝐺𝐹(𝑝) depends on the specified precision |𝑝|, the number
of active processing units 𝑘 as well as the number of words
𝑒 = ⌈(|𝑝|/|𝑤|)⌉ + 1 running through the pipeline. In order to
specify lower and upper computation times a best case and
worst case formula is given. In the best case the carry-out
bit of the CLA adder after reuniting 𝑇𝑆 and 𝑇𝐶 words is not
set, the comparator only has to evaluate the most significant
word, and amodular subtraction is not necessary. In theworst
case the carry-out bit of the CLA adder is also not set but the
comparator has to evaluate all words and a reduction of the
resulting value is necessary.
The𝑀𝑜𝑛𝑡𝑅 operation in𝐺𝐹(𝑝) only depends on the cho-
sen precision |𝑝| and specified word width |𝑤| parameters.
Security and Communication Networks 13
Requires: (𝑥𝑅𝑀𝐷, 𝑦𝑅𝑀𝐷)
Calculates: (𝑥𝑅, 𝑦𝑅)
𝑥𝑅 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡1(𝑥𝑅𝑀𝐷, 𝑝);
𝑦𝑅 = 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡1(𝑦𝑅𝑀𝐷, 𝑝);
Provides: (𝑥𝑅, 𝑦𝑅)
Algorithm 10: Core 𝐺𝐹(𝑝) EC Montgomery Backtransformation.
Table 5: Precision-dependent values of 𝑥 and 𝑦 for 𝐺𝐹(𝑝) 𝑀𝑜𝑛𝑡𝑅2
operation.
192 224 256 320 384 448 512
x 7 6 8 7 8 7 9
y 8 11 8 10 9 12 9
768 1024 1536 2048 3072 4096
x 9 10 10 11 11 12
y 10 10 11 11 12 12
For the𝑀𝑜𝑛𝑡𝑅2 operation computation time a best case
andworst case formula is given. In the best case, after the shift
operation, the comparator will only evaluate one word and an
initialmodular subtraction operation is not necessary. For the
involvedMontgomeryMultiplication operations the best case
formula is used. In the worst case, after the shift operation
the comparator has to evaluate all words and decide that
an initial modular subtraction operation is needed. For the
involved Montgomery Multiplication operations the worst
case formula is used. The amount 𝑥 of 𝐶𝑜𝑝𝑦𝐻2𝑉 and 𝑦
of 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡 operations depends on the chosen precision.
Table 5 lists the values for all supported 𝐺𝐹(𝑝) precisions.
For the𝑀𝑜𝑛𝑡𝐸𝑥𝑝 operation, computation time in a best
case and worst case formula is given. In the best case the
exponent operand is 3; therefore only two Montgomery
Multiplications and one 𝐶𝑜𝑝𝑦𝑉2𝑉 operation is necessary.
For the involved Montgomery Multiplication operations the
best case formula is used. In the worst case the exponent is
assumed to be 2(|𝑝|−1); therefore 2⋅(|𝑝|−2)×𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, (|𝑝|−
2) × 𝐶𝑜𝑝𝑦𝑉2𝑉 and (|𝑝| − 3) × 𝐶𝑜𝑝𝑦𝐻2𝑉 operations have to
be performed. For the involved Montgomery Multiplication
operations the worst case formula is used.
For the𝑀𝑜𝑑𝐴𝑑𝑑 operation computation time a best case
and worst case formula is given. In the best case, after the
modular addition the CLA adder carry-out bit will not be set,
the comparator will only have to evaluate one word and an
additional modular subtraction is not needed. In the worst
case the CLA adder carry-out bit will also not be set, but the
comparator will have to evaluate all words to decide that an
additional modular subtraction is necessary.
For the𝑀𝑜𝑑𝑆𝑢𝑏 operation computation time a best case,
worst case, and absolute worst case formula is given. In the
best case, after themodular subtraction the CLA adder carry-
out bit will not be set, the comparator will only have to
evaluate one word and an additional modular subtraction is
not needed. In the worst case, after the modular subtraction
the CLA adder carry-out bit will be set and a modular
Table 6: Core 𝐺𝐹(2𝑚) operations computation time in clock cycles
(CC).
Operation GF(2m)
𝐶𝐶𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡𝐺𝐹(2𝑚)
𝑚 − (𝑚mod 𝑘)
𝑘
⋅ 𝑒 +
(𝑚mod 𝑘) + 𝑒 + 4
𝐶𝐶𝑀𝑜𝑛𝑡𝑅𝐺𝐹(2𝑚) ⌈(𝑚/|𝑤|)⌉ + 3
𝐶𝐶𝑀𝑜𝑛𝑡𝑅2 𝑏𝐺𝐹(2𝑚) (𝑚+1)⋅(⌈(𝑚/|𝑤|)⌉+2)+𝑚+1
𝐶𝐶𝑀𝑜𝑛𝑡𝑅2 𝑤𝐺𝐹(2𝑚)
(𝑚 + 1) ⋅ (⌈(𝑚/|𝑤|)⌉ + 2) +
𝑚 ⋅ ⌈(𝑚/|𝑤|)⌉ + 𝑚 + 1
𝐶𝐶𝑀𝑜𝑛𝑡𝐸𝑥𝑝 𝑏𝐺𝐹(2𝑚)
[⌈(𝑚/|𝑤|)⌉ ⋅ (|𝑤| + 2)] +
(𝐶𝐶𝐶𝑜𝑝𝑦𝑉2𝑉 − 1) + [2 ⋅
(𝐶𝐶𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡𝐺𝐹(2𝑚) )]
𝐶𝐶𝑀𝑜𝑛𝑡𝐸𝑥𝑝 𝑤𝐺𝐹(2𝑚)
[⌈(𝑚/|𝑤|)⌉] + [⌈(𝑚/|𝑤|)⌉ ⋅
|𝑤| − 𝑚 + 3] + [2 ⋅ (𝑚 − 2) ⋅
(𝐶𝐶𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡𝐺𝐹(2𝑚) − 1)] +
[(𝑚− 2) ⋅ (𝐶𝐶𝐶𝑜𝑝𝑦𝑉2𝑉 − 1)] +
[(𝑚−3)⋅(𝐶𝐶𝐶𝑜𝑝𝑦𝐻2𝑉−1)]+1
𝐶𝐶𝑀𝑜𝑑𝐴𝑑𝑑 𝑏𝐺𝐹(2𝑚) ⌈(𝑚/|𝑤|)⌉ + 4
𝐶𝐶𝑀𝑜𝑑𝐴𝑑𝑑 𝑎𝑤𝐺𝐹(2𝑚) 3 ⋅ ⌈(𝑚/|𝑤|)⌉ + 6
addition must be performed. In the absolute worst case
after the modular subtraction the CLA adder carry-out bit
will not be set, the comparator will evaluate all words, and
an additional modular subtraction step is necessary. Note
that this will only occur if the resulting value after the first
subtraction operation will be identical to the modulus, which
under normal operation conditions will not be the case.
The prime field𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡1 operation is identical to the
𝐺𝐹(𝑝) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡 operation; therefore the same best and
worst case formulas apply.
The computing time formulas of binary field core opera-
tions given in clock cycles are listed in Table 6.
The computation time of the 𝑀𝑜𝑛𝑡𝑀𝑢l𝑡 operation in
𝐺𝐹(2𝑚) depends on the specified precision parameter𝑚, the
number of active processing units 𝑘, and the number of words
𝑒 = ⌈(𝑚/|𝑤|)⌉ + 1 running through the pipeline. Since the
additions in𝐺𝐹(2𝑚) are simple XOR-operations and themost
significant bit of the resulting value will never be set after
calculation only one formula is given.
While the determination of theMontgomery Parameter 𝑟
in 𝐺𝐹(2𝑚) differs from the calculation rule for 𝐺𝐹(𝑝) it also
only depends on the chosen precision 𝑚 and specified word
width |𝑤| parameters.
The𝑀𝑜𝑛𝑡𝑅2 operation in 𝐺𝐹(2𝑚) is based on shifts and
possible modular additions whenever the most significant bit
of the intermediate value will be set after a shift. The amount
ofmodular additions depends on theMontgomery Parameter
𝑟which itself depends on the irreducible polynomial. In order
to specify lower and upper computation times a best case
and worst case formula is given. The best case assumes that
no modular addition operation is required at all, whereas
the worst case assumes that a modular addition operation is
required after each shift operation.
For the𝑀𝑜𝑛𝑡𝐸𝑥𝑝 operation computation time a best case
and worst case formula is given. In the best case the exponent
14 Security and Communication Networks
operand is 3 therefore only twoMontgomery Multiplications
and one 𝐶𝑜𝑝𝑦𝑉2𝑉 operation is necessary. In the worst case
the exponent is assumed to be 2(𝑚−1) therefore 2 ⋅ (𝑚 − 2) ×
𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, (𝑚 − 2) × 𝐶𝑜𝑝𝑦𝑉2𝑉 and (𝑚 − 3) × 𝐶𝑜𝑝𝑦𝐻2𝑉
operations are required.
For the𝑀𝑜𝑑𝐴𝑑𝑑 operation computation time a best case
and absolute worst case formula is given. In the best case
the comparator will only have to evaluate one word. In the
absolute worst case the comparator will have to evaluate all
words to decide that an additional modular addition is neces-
sary. Note that this will only occur if the resulting value after
the first addition operation will be identical to the modulus
polynomial, which under normal operation conditions will
not be the case.
The binary field𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡1 operation is identical to the
𝐺𝐹(2𝑚) 𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡 operation; therefore the same formula
applies.
6.2. Core Variations. Depending on the needs, in terms of
performance, area consumption, supported precisions, and
the interfacing structure, different variations of the core can
be generated by defining the parameters MAX PRECISION
WIDTH,WORD WIDTH andMAX NUM PUS. Table 7 lists
the resulting number of words 𝑒 and the possible number of
processing units 𝑘 for the supported prime field precisions |𝑝|
and typical word widths |𝑤| of 16, 32 and 64 bit.
In contrast, Table 8 lists the resulting number of words
𝑒 and the number of possible processing units 𝑘 for the
supported binary field precisions 𝑚 and typical word widths
|𝑤| of 16, 32, and 64 bits. Note that the number of possible
processing units for binary fields within the defined core is
subjected to a further constraint. Once all bits of 𝐴 operand
have been processed the remaining processing units in the
pipeline must be bypassed and the 𝑇𝑆 and 𝑇𝐶 words must
be directly fed into the CLA adder. Since the result of the
CLA adder will be written back to RAMbut remaining words
must still be read from RAM and fed into the first processing
unit, the RAM source and destination signals must never
address the same memory location at one time. Therefore
the equation (𝑘 − (𝑚mod 𝑘))mod 𝑘 must hold true to (𝑘 −
(𝑚mod 𝑘)) ≡ 0mod 𝑘, meaning that no processing unit will
be bypassed, or (𝑘 − (𝑚mod 𝑘)) ≡ 1mod 𝑘, meaning that the
very last processing unit will be bypassed at the last cycle of
𝐴 operand bits.
6.3. Core Hardware Footprint. Since all components of the
design consist of simple logic elements, the proposed arith-
metic core is vendor-neutral. In order to estimate the hard-
ware footprint of different core implementations the design
variations have been compiled on Altera and Xilinx FPGAs.
Table 9 lists the amount of total logic elements and comprised
logic registers for varied values of WORD WIDTH (|𝑤|)
and MAX NUM PUS generated for an Altera Cyclone IV
(EP4CE115F29C9L) device featuring 114, 480 logic elements
and 3, 981, 312memory bits.
Table 10 lists the amount of total logic elements and
comprised logic registers for varied values ofWORD WIDTH
(|𝑤|) andMAX NUM PUS generated for an Xilinx XC7Z020
(xc7z020clg484-1) device featuring 53, 200 logic elements
Table 7:Number ofwords 𝑒 and amount of possible processing units
𝑘 depending on precision |𝑝| and word width |𝑤| for 𝐺𝐹(𝑝).
GF(p) |w| = 16 |w| = 32 |w| = 64
|p| = 192 𝑒 = 13
𝑘 = 2, 4, 8
𝑒 = 7
𝑘 = 2, 4
𝑒 = 4
𝑘 = 2
|p| = 224 𝑒 = 15
𝑘 = 2, 4, 8
𝑒 = 8
𝑘 = 2, 4
𝑒 = 5
𝑘 = 2
|p| = 256 𝑒 = 17
𝑘 = 2, 4, 8
𝑒 = 9
𝑘 = 2, 4
𝑒 = 5
𝑘 = 2
|p| = 320
𝑒 = 21
𝑘 = 2, 4,
8, 16
𝑒 = 11
𝑘 = 2, 4, 8
𝑒 = 6
𝑘 = 2, 4
|p| = 384
𝑒 = 25
𝑘 = 2, 4,
8, 16
𝑒 = 13
𝑘 = 2, 4, 8
𝑒 = 7
𝑘 = 2, 4
|p| = 448
𝑒 = 29
𝑘 = 2, 4,
8, 16
𝑒 = 15
𝑘 = 2, 4, 8
𝑒 = 8
𝑘 = 2, 4
|p| = 512
𝑒 = 33
𝑘 = 2, 4,
8, 16
𝑒 = 17
𝑘 = 2, 4, 8
𝑒 = 9
𝑘 = 2, 4
|p| = 768
𝑒 = 49
𝑘 = 2, 4, 8,
16, 32
𝑒 = 25
𝑘 = 2, 4,
8, 16
𝑒 = 13
𝑘 = 2, 4, 8
|p| = 1024
𝑒 = 65
𝑘 = 2, 4, 8,
16, 32
𝑒 = 33
𝑘 = 2, 4, 8, 16
𝑒 = 17
𝑘 = 2, 4, 8
|p| = 1536
𝑒 = 97
𝑘 = 2, 4, 8,
16, 32, 64
𝑒 = 49
𝑘 = 2, 4, 8,
16, 32
𝑒 = 25
𝑘 = 2, 4,
8, 16
|p| = 2048
𝑒 = 129
𝑘 = 2, 4, 8,
16, 32, 64
𝑒 = 65
𝑘 = 2, 4, 8,
16, 32
𝑒 = 33
𝑘 = 2, 4,
8, 16
|p| = 3072
𝑒 = 193
𝑘 = 2, 4, 8,
16, 32, 64
𝑒 = 97
𝑘 = 2, 4, 8,
16, 32, 64
𝑒 = 49
𝑘 = 2, 4, 8,
16, 32
|p| = 4096
𝑒 = 257
𝑘 = 2, 4, 8,
16, 32, 64
𝑒 = 129
𝑘 = 2, 4, 8,
16, 32, 64
𝑒 = 65
𝑘 = 2, 4, 8,
16, 32
and 106, 400 registers. The resulting values demonstrate that
the design can compete with other proposed designs, for
instance, the one compared in [14, 15]. Furthermore instead
of being restricted to only one cryptographic application,
the core can handle various algorithms. According to the
needs, in terms of area, a suitable solution for a specific
implementation can be chosen. The choice will have an
impact on power consumption and computing time.
6.4. Core Power Estimation. In order to evaluate the suitabil-
ity of the proposed core for the application in the IoT area,
a power estimation has been conducted using two common
frequencies of 100 MHz and 200 MHz for various core
variations. Timing analysis yields that the design can reliably
be operated with these frequencies. The power consumption
characteristics have been derived by applying the PowerPlay
Power Analyzer Tool of the Quartus Prime IDE to the
Security and Communication Networks 15
Table 8:Number ofwords 𝑒 and amount of possible processing units
𝑘 depending on precision𝑚 and word width |𝑤| for 𝐺𝐹(2𝑚).
GF(2m) |w| = 16 |w| = 32 |w| = 64
m = 131 𝑒 = 10 𝑒 = 6 𝑒 = 4
𝑘 = 2, 4 𝑘 = 2, 4 𝑘 = 2
m = 163 𝑒 = 12 𝑒 = 7 𝑒 = 4
𝑘 = 2, 4 𝑘 = 2, 4 𝑘 = 2
m = 176 𝑒 = 12 𝑒 = 7 𝑒 = 4
𝑘 = 2, 4, 8 𝑘 = 2, 4 𝑘 = 2
m = 191 𝑒 = 13 𝑒 = 7 𝑒 = 4
𝑘 = 2, 4, 8 𝑘 = 2, 4 𝑘 = 2
m = 193 𝑒 = 14 𝑒 = 8 𝑒 = 5
𝑘 = 2 𝑘 = 2 𝑘 = 2
m = 208 𝑒 = 14 𝑒 = 8 𝑒 = 5
𝑘 = 2, 4, 8 𝑘 = 2, 4 𝑘 = 2
m = 233 𝑒 = 16 𝑒 = 9 𝑒 = 5
𝑘 = 2 𝑘 = 2 𝑘 = 2
m = 239 𝑒 = 16 𝑒 = 9 𝑒 = 5
𝑘 = 2, 4, 8 𝑘 = 2, 4 𝑘 = 2
m = 272 𝑒 = 18 𝑒 = 10 𝑒 = 6
𝑘 = 2, 4, 8, 16 𝑘 = 2, 4, 8 𝑘 = 2, 4
m = 283 𝑒 = 19 𝑒 = 10 𝑒 = 6
𝑘 = 2, 4 𝑘 = 2, 4 𝑘 = 2, 4
m = 304 𝑒 = 20 𝑒 = 11 𝑒 = 6
𝑘 = 2, 4, 8, 16 𝑘 = 2, 4, 8 𝑘 = 2, 4
m = 359 𝑒 = 24 𝑒 = 13 𝑒 = 7
𝑘 = 2, 4, 8 𝑘 = 2, 4, 8 𝑘 = 2, 4
m = 368 𝑒 = 24 𝑒 = 13 𝑒 = 7
𝑘 = 2, 4, 8, 16 𝑘 = 2, 4, 8 𝑘 = 2, 4
m = 409 𝑒 = 27 𝑒 = 14 𝑒 = 8
𝑘 = 2 𝑘 = 2 𝑘 = 2
m = 431 𝑒 = 28 𝑒 = 15 𝑒 = 8
𝑘 = 2, 4, 8, 16 𝑘 = 2, 4, 8 𝑘 = 2, 4
m = 571 𝑒 = 37 𝑒 = 19 𝑒 = 10
𝑘 = 2, 4 𝑘 = 2, 4 𝑘 = 2, 4
final design using default settings of a power toggle rate
as well as a power input I/O toggle rate of 12.5%, using
a vectorless estimation and a board temperature of 25∘C.
Table 11 lists the Total Thermal Power Dissipation values for
varied WORD WIDTH (|𝑤|) and MAX NUM PUS parame-
ters generated for the Altera Cyclone IV (EP4CE115F29C9L)
device. The values are comparable to the ones given in [24]
for RSA calculation.
Furthermore it has to bementioned that the optimization
mode in the compiler settings was set to balanced and no
specific compiler optimizations regarding power have been
turned on. The results show that the core is quite suitable for
applications which have special constraints regarding power
consumption. According to such needs as well as the desired
clock frequency a suitable variation can be implemented.The
choice will have an impact on computing time and hardware
footprint.
Table 9: Amount of logic elements and logic registers for different
core variations (Altera Cyclone IV).
|𝑤| MAX NUM PUS Logic Elements Registers
16 2 3, 128 706
16 4 3, 523 904
16 8 4, 344 1, 300
16 16 5, 960 2, 092
16 32 9, 198 3, 676
16 64 15, 568 6, 844
32 2 4, 086 988
32 4 4, 721 1, 314
32 8 5, 935 1, 966
32 16 8, 473 3, 270
32 32 13, 498 5, 878
32 64 23, 484 11, 094
64 2 6, 114 1, 557
64 4 7, 151 2, 139
64 8 9, 346 3, 303
64 16 13, 624 5, 631
64 32 22, 113 10, 287
Table 10: Amount of logic elements and logic registers for different
core variations (Xilinx XC7Z020).
|𝑤| MAX NUM PUS Logic Elements Registers
16 2 2, 587 755
16 4 2, 874 956
16 8 3, 467 1, 352
16 16 4, 623 2, 146
16 32 7, 208 3, 724
16 64 11, 934 6, 895
32 2 3, 367 1, 114
32 4 4, 014 1, 429
32 8 4, 694 2, 082
32 16 6, 645 3, 386
32 32 10, 177 5, 999
32 64 17, 770 11, 222
64 2 5, 098 1, 833
64 4 5, 869 2, 421
64 8 7, 585 3, 591
64 16 10, 654 5, 928
64 32 16, 871 10, 624
6.5. Core Reference Implementation. For the reference
implementation a word width of WORD WIDTH = 32
bit was chosen and the maximum number of processing
units of the pipeline was set to MAX NUM PUS = 32.
The maximum supported precision width parameter
MAX PRECISION WIDTH was set to 4096 leading to a
RAM consisting of 28, 672 bits. Table 12 lists the computation
time in clock cycles of the reference implementation for RSA
application. For RSA public-key operations best case and
worst case computation times are given under the assumption
that the public exponent is 𝑒 = 0𝑥10001.Therefore during the
16 Security and Communication Networks
Table 11: Total Thermal Power Dissipation (TTPD) values for
different core variations (Altera Cyclone IV).
|𝑤| MAX NUM PUS TTPD 100 MHz TTPD 200 MHz
16 2 237.36𝑚𝑊 266.38𝑚𝑊
16 4 239.72𝑚𝑊 297.73𝑚𝑊
16 8 245.97𝑚𝑊 304.40𝑚𝑊
16 16 261.07𝑚𝑊 358.22𝑚𝑊
16 32 289.61𝑚𝑊 389.32𝑚𝑊
16 64 358.76𝑚𝑊 549.60𝑚𝑊
32 2 287.18𝑚𝑊 362.94𝑚𝑊
32 4 294.21𝑚𝑊 375.02𝑚𝑊
32 8 311.36𝑚𝑊 401.65𝑚𝑊
32 16 343.68𝑚𝑊 454.94𝑚𝑊
32 32 398.65𝑚𝑊 568.00𝑚𝑊
32 64 490.90𝑚𝑊 760.71𝑚𝑊
Table 12: Core reference implementation RSA computation times.
|p| = 2048 |p| = 3072 |p| = 4096
CCRSA pub 134, 128 302, 309 529, 783
CCRSA prik 17, 925, 156 59, 094, 721 138, 504, 443
CCCRT−RSA 9, 194, 091 15, 667, 698 36, 131, 248
Table 13: Core reference implementationMiller-Rabin computation
times.
|p| = 512 |p| = 768 |p| = 1024
CCMR o1 1, 167, 529 1, 969, 583 4, 534, 869
CCMR o2 148 212 276
CCMR i1 1, 246 1, 430 2, 406
CCMR i2 148 212 276
|p| = 1536 |p| = 2048 |p| = 3072
CCMR o1 7, 720, 993 17, 868, 205 58, 960, 069
CCMR o2 404 532 788
CCMR i1 2, 790 4, 726 10, 134
CCMR i2 404 532 788
|p| = 4096
CCMR o1 138, 267, 613
CCMR o2 1, 044
CCMR i1 17, 590
CCMR i2 1, 044
𝑀𝑜𝑛𝑡𝐸𝑥𝑝 operation a total of 17×𝑀𝑜𝑛𝑡𝑀𝑢𝑙𝑡, 15×𝐶𝑜𝑝𝑦𝐻2𝑉
and 1 × 𝐶𝑜𝑝𝑦𝑉2𝑉 operations will be performed. Since the
private exponent is different for varied RSA keys only
worst case computation times for the supported precision
widths are given. The worst case RSA private key and CRT-
accelerated private key computation times assume the worst
case clock cycle times of the underlying operations given in
previous section.
Table 13 lists the worst computation times in clock cycles
of the reference implementation for Miller-Rabin prime
testing application for one iteration. Note that the most
time consuming operation is part one of the outer loop of
Table 14: Core reference implementation prime field EC computa-
tion times.
|p| = 192 |p| = 224 |p| = 256
CCEC prep 5, 252 8, 281 8, 736
CCEC mont doub 742 972 1, 234
CCEC mont add 1, 468 1, 928 2, 452
CCEC a2j 22 24 26
CCEC point kal 1, 577 2, 050 2, 587
CCEC point doub 5, 613 7, 351 9, 329
CCEC point add 6, 434 8, 412 10, 662
CCEC point mult 2, 288, 930 3, 499, 386 5, 077, 714
CCEC j2a 139, 233 213, 732 311, 079
CCEC demont 734 964 1, 226
|p| = 320 |p| = 384 |p| = 512
CCEC prep 7, 938 10, 357 17, 575
CCEC mont doub 984 1, 364 2, 318
CCEC mont add 1, 948 2, 708 4, 612
CCEC a2j 31 35 44
CCEC point kal 2, 109 2, 895 4, 851
CCEC point doub 7, 465 10, 341 17, 533
CCEC point add 8, 566 11, 842 20, 026
CCEC point mult 5, 097, 858 8, 473, 906 19, 155, 090
CCEC j2a 307, 884 514, 610 1, 172, 029
CCEC demont 974 1, 354 2, 306
Algorithm 2 which will always be performed for each itera-
tion. Depending on the evaluation of the result it might be
necessary to execute part two of the outer loop. Furthermore
depending on the structure of the prime in question it might
be necessary to execute part one and two of the inner loop
multiple times.
Table 14 lists the computation time in clock cycles of
the reference implementation for prime field EC operations
for all supported precision widths. The Affine-to-Jacobi
Transformation step requires a precision dependent number
of clock cycles. For the remaining steps worst case clock
cycle times are given. For the Point Multiplication operation
an absolute worst case computation time is stated in which
a theoretical scalar is hypothesized to be 2|𝑝|−1, therefore a
maximum of (|𝑝| − 1) Point Doubling and (|𝑝| − 1) Point
Addition operations would be necessary assuming a simple
double and add algorithm.
7. Conclusion and Future Work
A comprehensive adaptable hardware structure for efficient
prime finite field and binary finite field arithmetic operations
that expand the capabilities of single Montgomery Multiplier
hardware designs has been proposed which allows carrying
out cryptographic calculations for a large range of different
algorithms all based on the same arithmetic unit opera-
tions with arbitrary parameters. The approach taken by the
proposed core is to combine standard modulo addition /
subtraction support with the capability of performing Mont-
gomery Multiplications, full Montgomery Exponentiations,
Security and Communication Networks 17
and the calculation of Montgomery Parameters 𝑟 and 𝑟2 for
arbitrary moduli, bringing together all required arithmetic
operations for carrying out a wide range of cryptographic
algorithms used today.Through the breakdown of these algo-
rithms individual operation lists have been derived for the
arithmetic unit rendering extra precomputations in software
unnecessary.
The given values of possible hardware footprint and
power consumption for specific core variations allow choos-
ing the proper configuration for a specific implementation.
The reference implementation showed that with an internal
RAM of merely 3.5 kB the core is capable of performing
complete prime field and binary field EC operations for
various precisionwidths of standardised curves. Furthermore
the same core configuration is capable of performing (CRT-
accelerated) RSA operations for typical precision widths
required today, (safe) prime testing/generation, and Diffie-
Hellman key exchange operations up to 4096 bit precision
widths. The design should further be optimized in terms of
power consumption.
However the type of implementation of some core oper-
ations, such as the Montgomery Multiplication and espe-
cially the Montgomery Exponentiation operation, necessi-
tates additional security considerations, since the calculation
times dependon the structure of the processed operands.This
makes the design prone to side-channel attacks if security
sensitive information, such as private keys, will be processed.
But not all operations are critical and must be secured, such
as the calculation of the Montgomery Parameters. Therefore
during the writing of this article the core will be enhanced
to provide a secure calculation bit within the command
input word, which, if set, instructs the core to perform the
specified arithmetic operation in a time-invariant fashion.
In addition, special care has to be taken when defining core
operation lists, for instance, for performing elliptic curve
Point Multiplication operations. Descriptions performing in
a fixed amount of time, e.g., the Montgomery ladder [25],
mitigating the risk of timing, and power analysis attacksmust
be chosen.
Conflicts of Interest
The authors declare that there are no conflicts of interest
regarding the publication of this paper.
References
[1] D.Minoli, K. Sohraby, and J. Kouns, “IoT security (IoTSec) con-
siderations, requirements,” in Proceedings of 14th IEEE Annual
Consumer Communications Networking Conference (CCNC’17),
pp. 1006-1007, 2017.
[2] G.-L. Guo, Q. Qian, and R. Zhang, “Different implementations
of AES cryptographic algorithm,” in Proceedings of the 2015
IEEE 17th International Conference on High Performance Com-
puting and Communications, 2015 IEEE 7th International Sym-
posium on Cyberspace Safety and Security, and 2015 IEEE 12th
International Conference on Embedded Software and Systems,
pp. 1848–1853, 2015.
[3] M. Bafandehkar, S. M. Yasin, R. Mahmod, and Z. M. Hanapi,
“Comparison of ECC and RSA algorithm in resource con-
strained devices,” in Proceedings of the 2013 3rd International
Conference on IT Convergence and Security (ICITCS’13), pp. 1–3,
2013.
[4] A. Rupani and G. Sujediya, “A Review of FPGA implementa-
tion of Internet of Things,” International Journal of Innovative
Research in Computer and Communication Engineering, vol. 4,
no. 9, 2016.
[5] B. Halak, S. S.Waizi, and A. Islam, “A survey of hardware imple-
mentations of elliptic curve cryptographic systems,” Cryptology
ePrint Archive 2016/712, 2016.
[6] N. Nedjah and L. de Macedo Mourelle, “A review of modular
multiplication methods and respective hardware implementa-
tions,” Informatica, vol. 30, no. 1, pp. 111–129, 2006.
[7] P. L. Montgomery, “Modular multiplication without trial divi-
sion,”Mathematics of Computation, vol. 44, no. 170, pp. 519–521,
1985.
[8] H. Kaur and C. Madhu, “Montgomery multiplication methods
- A review,” Journal of Application or Innovation in Engineeting
& Management, IJAIEM, vol. 2, no. 2, pp. 229–235, 2013.
[9] A. F. Tenca and C¸. K. Koc¸, “A Scalable Architecture for Mont-
gomeryNultiplication,” inCryptographic Hardware and Embed-
ded Systems, vol. 1717 of Lecture Notes in Computer Science, pp.
94–108, Springer Berlin Heidelberg, Berlin, Heidelberg, 1999.
[10] E. Savas, A. F. Tenca, and C¸. K. Koc¸, “A Scalable and Unified
Multiplier Architecture for Finite Fields GF(p) and GF(2m),”
in Proceedings of the Second International Workshop on Cryp-
tographic Hardware and Embedded Systems, CHES 2000), pp.
277–292.
[11] M. Morales-Sandoval and A. D. Perez, “Novel algorithms and
hardware architectures for Montgomery Multiplication over
GF(p),” Cryptology ePrint Archive 2015/696, 2015.
[12] R. Cramer, Public Key Cryptography – PKC 2008, vol. 4939,
Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
[13] A. Mrabet, “A Systolic Hardware Architectures of Montgomery
ModularMultiplication for Public KeyCryptosystems, Cryptol-
ogy ePrint Archive,” Report 2016/487, 2016.
[14] Z. Liu et al., “A tiny RSA coprocessor based on optimized sys-
tolic Montgomery architecture,” in Proceedings of the Interna-
tional Conference on Security and Cryptography (SECRYPT’11),
2011.
[15] G. D. Sutter, J.-P. Deschamps, and J. L. Imana, “Modular mul-
tiplication and exponentiation architectures for fast RSA cryp-
tosystem based on digit serial computation,” IEEE Transactions
on Industrial Electronics, vol. 58, no. 7, pp. 3101–3109, 2011.
[16] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtain-
ing digital signatures and public-key cryptosystems,” Commu-
nications of the ACM, vol. 21, no. 2, pp. 120–126, 1978.
[17] M. O. Rabin, “Probabilistic algorithm for testing primality,”
Journal of Number Theory, vol. 12, no. 1, pp. 128–138, 1980.
[18] J. von zur Gathen and I. E. Shparlinski, “Generating safe
primes,” Journal of Mathematical Cryptology, vol. 7, no. 4, pp.
333–365, 2013.
[19] W. Diffie, W. Diffie, and M. E. Hellman, “New Directions in
Cryptography,” IEEE Transactions on Information Theory, vol.
22, no. 6, pp. 644–654, 1976.
[20] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of
Computation, vol. 48, no. 177, pp. 203–209, 1987.
[21] BSI - Technical Guidline, “CryptographicMechanisms: Recom-
mendations and Key Lengths,” BSI TR-02102-1, 2017.
18 Security and Communication Networks
[22] D.Wulansari, M. A.Muslim, and E. Sugiharti, “Implementation
of RSA algorithmwith chinese remainder theorem formodulus
n 1024 bit and 4096 bit,” International Journal of Computer
Science and Security, vol. 10, no. 5, pp. 186–194, 2016.
[23] V. S. Miller, “Use of elliptic curves in cryptography,” in Pro-
ceedings of the Conference on the Theory and Application of
Cryptographic Techniques, CRYPTO 1985, pp. 417–426, 1985.
[24] B. Zhou, M. Egele, and A. Joshi, “High-performance low-
energy implementation of cryptographic algorithms on a pro-
grammable SoC for IoT devices,” in Proceedings of the 2017 IEEE
High-Performance Extreme Computing Conference (HPEC),
2017.
[25] M. Joye and S. Yen, “The Montgomery Powering Ladder,” in
Proceedings of the International Workshop on Cryptographic
Hardware and Embedded Systems (CHES’02), pp. 291–302, 2002.
International Journal of
Aerospace
Engineering
Hindawi
www.hindawi.com Volume 2018
Robotics
Journal of
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
 Active and Passive  
Electronic Components
VLSI Design
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Shock and Vibration
Hindawi
www.hindawi.com Volume 2018
Civil Engineering
Advances in
Acoustics and Vibration
Advances in
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Electrical and Computer 
Engineering
Journal of
Advances in
OptoElectronics
Hindawi
www.hindawi.com
Volume 2018
Hindawi Publishing Corporation 
http://www.hindawi.com Volume 2013www.hindawi.com
The Scientific 
World Journal
8
Control Science
and Engineering
Journal of
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com
 Journal ofEngineering
Volume 2018
Sensors
Journal of
Hindawi
www.hindawi.com Volume 2018
International Journal of
Rotating
Machinery
Hindawi
www.hindawi.com Volume 2018
Modelling &
Simulation
in Engineering
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Chemical Engineering
International Journal of  Antennas and
Propagation
International Journal of
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Navigation and 
 Observation
International Journal of
Hindawi
www.hindawi.com Volume 2018
 Advances in 
Multimedia
Submit your manuscripts at
www.hindawi.com
