Hardware and Software Multi-precision Implementations of Cryptographic Algorithms by Janjua, Muhammad A
Rochester Institute of Technology
RIT Scholar Works
Theses Thesis/Dissertation Collections
2005
Hardware and Software Multi-precision
Implementations of Cryptographic Algorithms
Muhammad A. Janjua
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Janjua, Muhammad A., "Hardware and Software Multi-precision Implementations of Cryptographic Algorithms" (2005). Thesis.
Rochester Institute of Technology. Accessed from
Hardware and Software Multi-precision 
Implementations of Cryptographic Algorithms 
by 
Muhammad Ali Janjua 
A thesis submitted in partial fulfillment of the requirements for the degree of 
Master of Science in Computer Engineering 
Approved By: 
Dr. Marcin Lukowiak 
Supervised by 
Dr. Marcin Lukowiak 
Department of Computer Engineering 
Kate Gleason College of Engineering 
Rochester Institute of Technology 
Rochester, NY 
June, 2005 
Primary Advisor - R.I. T. Dept. oj Computer Engineering 
20/05 
Dr. Stanislaw P. Radziszowski I 
Secondary Advisor - R.I. T. Dept. oj Computer Science 
Dr. Muhammad Shaaban 
Secondary Advisor - R.I. T. Dept. oj Computer Engineering 
Thesis/ Dissertation Author Permission Statement 
Title of Thesis: Hardware and Software Multi-precision Implementations of 
Cryptographic Algorithms 
Name of Author: Muhammad Ali Janjua 
Degree: Master of Science 
Major: Computer Engineering 
College: Kate Gleason College of Engineering 
As per current Rochester Institute of Technology (RIT) guidelines for completion 
of my degree, I understand that I need to submit a copy of my Master's thesis to the RIT 
Archives. I hereby permit RIT and its agents to archive and make use of my thesis or 
dissertation in whatever forms necessary. I retain the ownership rights to the copyright of 
the thesis or dissertation and also retain the rights to use all or part of my thesis in my 
future work. 
Author signature Date ot~J~ fi 
Acknowledgement
I would like to offer deep and sincere thanks to God the Gracious and Merciful,
who blessed me with the capabilities of heart and mind, which lead me to successfully
complete this thesis.
I am indebt to my family members for giving me courage and confidence with all
their hopes and prayers for me.
Now as I submit my thesis, I express me heartiest gratitude to all those who
helpedme in completing my thesis.
Working on this thesis has been an experience, and education in its own way. For
the first time in my entire education career I was able to work thoroughly within a
challengeable environmentwhichmade me to learnmore from industrial point ofview.
My deepest gratitude is for Dr. Marcin Lukowiak, my thesis advisor, who was a
guiding light for me. I found his expertise spread equally over theory as well as practical
aspects. Specially, he guided my way to overcome problems occurring from time to time
during the course of completion of this thesis.
I would like to thank the members ofmy thesis committee, Dr. Stanislaw Pawel
Radziszowski and Dr. Muhammad Shaaban, for taking the time from their busy schedule
for advising me about the related questions, reviewing my work, and perfecting me for
the final defense.
Finally, I would like to thank all of my professors with whom I took various
courses which helped me in thinking creatively and dynamically to bring up the success
to my thesis. I am thankful my supervisor during my internship, Mr. Jim O'Connor for
givingme the confidence of learning SystemC, which I used inmy thesis.
11
Abstract
The software implementations of cryptographic algorithms are considered to be
very slow, when there are requirements ofmulti-precision arithmetic operations on very
long integers. These arithmetic operations may include addition, subtraction,
multiplication, division and exponentiation.
Several research papers have been published providing different solutions to make
these operations faster. Digital Signature Algorithm (DSA) is a cryptographic application
that requires multi-precision arithmetic operations. These arithmetic operations are
mostly based upon modular multiplication and exponentiation on integers of the size of
1024 bits. The use of such numbers is an essential part ofproviding high security against
the cryptanalytic attacks on the authenticated messages. When these operations are
implemented in software, performance in terms of speed becomes very low. The major
focus of the thesis is the study of various arithmetic operations for public key
cryptography and selecting the fast multi-precision arithmetic algorithms for hardware
implementation. These selected algorithms are implemented in hardware and software for
performance comparison and they are used to implement Digital Signature Algorithm for
performance analysis.
Ill
Table ofContents
Acknowledgements j
Abstract ^
Table ofContents jy
List of Figures vj
List ofTables ix
Glossary xii
Chapter 1. Introduction 1
1.1 Scope ofResearch 1
1.2 Organization ofThesis 2
Chapter 2. Cryptography, an introduction
2.1 Cryptography # 3
2.1.1 Digital Signatures 6
2.1.1.1 RSA Digital Signature 6
2.1.1.2 ElGamal Digital Signature Scheme 7
2.1.1.3 Digital Signature Algorithm 7
2.2 Summary 11
Chapter 3. Multi-precision arithmetic in cryptography 12
3.1 Congruence
,...,....12
3.2 Greatest Common Divisor (GCD) ......13
3.3 Euclidean Algorithm 14
3.4 Modular Exponentiation
...........14
3.4.1 Right-to-left binary exponentiation 15
3.4.2 left-to-right binary exponentiation -#J5
3.5 Common Algorithms used in Cryptography ...........16
3.5.1 Extended Euclidean Algorithm
....,.. 17
3.5.2 Primarily Testing
^ 18
IV
3.6 Summary jo
Chapter 4. Hardware implementations of the multi-precision modular arithmetic
methods
.....................20
4.1 BackgroundWork 20
4.2 Classical Modular Reduction 21
4.3 MontgomeryModular Reductionwithout trial division 22
4.4 MontgomeryModularMultiplication without trial division [7] 24
4.4. 1 Hardware implementation ofMontgomeryModularmultiplication 25
4.4.1.1 Design with two adders 26
4.4.1.1.1 Simulation results 29
4.4.1.1.2 Synthesis results 32
4.4.1.2 Design with two adders and amultiplexer 33
4.1.4.2.1 Simulation results 35
4.4.1.2.2 Synthesis results 37
4.5 MontgomeryModular Exponentiation 39
4.5.1 Left-to-RightMontgomerymodular exponentiation algorithm 39
4.5.2 Right-to-Left Montgomerymodular exponentiation algorithm 41
4.5.2.1 Hardware implementation ofRight-to-LeftMontgomery
modular exponentiation algorithm 43
4.5.2.1.1 Simulation ofRight-to-Left Montgomerymodular
exponentiation
# 5-7
4.5.2.1.2 Synthesis ofRight-to-LeftMontgomerymodular exponentiation
61
4.5 Summary ^
Chapters. Hardware and Software implementation of Digital Signature
iUgorithm usingMontgomeryModular methods 62
5.1 Software Implementation ofDigital Signature Algorithm
,.,.,.. 62
5.1.1 Results and Analysis of Software Implementation 65
5.2 Hardware Implementation ofDigital Signature Standard
_ 66
5.2.1 Hardware implementation ofDSA-Signature Operation 67
5.2.1.1 Simulation results for DSA-Signature block 74
5.2.1.2 Synthesis results for DSA-Signature block 79
5.2.1.3 Comparison between 1024 bit hardware and its equivalent software
design 79
5.2.2 Hardware implementation ofDSA-Verification Operation 80
5.2.2.1 Simulation results for DSA-Verification block 85
5.2.2.2 Synthesis results for DSA-Verification block 89
5.2.2.3 Comparison between 1024 bit hardware and its equivalent software
design 89
5.3 Summary 90
Chapter 6. Conclusions and future work 91
References 93
Vl
List ofFigures
Figure 2.1: Public Key Cryptosystem 5
Figure 2.2: Data flow diagram ofDigital Signature Algorithm using SHA-1 8
Figure 2.3: Block diagram Digital Signature Algorithm used for the thesis research 9
Figure 4.1 : Top level block diagram ofMontgomerymodularmultiplier 25
Figure 4.2: Port level block diagram ofMontgomerymodular multiplier 26
Figure 4.3: Block level diagram ofMontgomery Modular Multiplier using two adders
and two multipliers 27
Figure 4.4: Finite state machine forMontgomerymultiplier with two adders design ... 28
Figure 4.5: Figure 4.5: Waveform simulations for 4-bit Montgomery modular multiplier
design with two adders. Design enable and data load view 30
Figure 4.6: Figure 4.6: Waveform simulations for 4-bit Montgomery modular multiplier
using design with two adders, output =
ABR~l (modM) 31
Figure 4.7: Figure 4.7: Waveform simulations for 4-bit Montgomery modular multiplier
using design with two adders, output =AB(R2)(modM) 31
Figure 4.8: Block diagram of Montgomery modular multiplier using multiplexer, and
two adders 35
Figure 4.9: Input and output ports of the Montgomery Modular exponentiation design
shown in Algorithm 4.4 43
Vll
Figure 4.10: Block diagram of right-to-left Montgomery Modular Exponentiation
Algorithm 44
Figure 4.11: Block diagram of one ADDMUX block used in right-to-left Montgomery
modular exponentiation algorithm
.....................49
Figure 4.12: Data flow level block diagram of right-to-left Montgomery Modular
Exponentiation 50
Figure 4.13: Finite State Machine of right-to-left Montgomery Modular Exponentiation
Algorithm <-*
Figure 4.14: Wave form simulations showing the data inputs ofp, e, m and c with output
generated 58
Figure 4.15: Wave form simulations showing the behavior of the registers used in the
design. This simulation is the first halfof the total simulation 59
Figure 4.16: Wave form simulations showing the behavior of the registers used in the
design. This simulation is the second halfof the total simulation 60
Figure 5.1: Data flow diagram in the hardware Implementation ofDSA signature block
using data produced by the software implemented block
..,,...68
Figure 5.2: Block diagram of DSA Signature Block using two Montgomery modular
exponentiation blocks fin
Figure 5.3: Port-level detail ofDSA signature block 70
Figure 5.4: Shift register to load data sets in shift left mode 71
Vlll
Figure 5.5: Shift register to load data sets in shift left mode 72
Figure 5.6: Finite Machine ofDSA SignatureModule 73
Figure 5.7: Wave form simulation for 12 bit DSA signature design showing the load
operation completed in 6 clock cycles 76
Figure 5.8: Output generated at 1405 ns for 12 bit DSA signature design using 2 ns
clock. Wave form shows the final output operation 77
Figure 5.9: Data flow diagram of hardware implementation of DSA verification block
81
Figure 5.10: Block diagram ofDSA Verification Block using two Montgomery modular
exponentiation blocks 82
Figure 5.11: Port-level detail ofDSA verification block 82
Figure 5.12: Finite state machine for DSA verification block 84
Figure 5.13: Wave form simulations for the load operation of DSA verification block
86
Figure 5.14: Wave form simulations for the final out put of 12 bit DSA verification
block 87
IX
List ofTables
Table 4.1: Comparison ofFPGA resources used with minimum clock 32
Table 4.2: Comparison of simulation time to complete the modular multiplication
operation in hardware and software 36
Table 4.3: Comparison of FPGA resources used for Montgomery modular multiplier
with multiplexer 37
Table 4.4: Control signals for shift registers generated by finite state machine 52
Table 4.5: Control signals generated by FSM for registers in design in LOAD with
redjnontld sub-state 53
Table 4.6: Control signals generated by FSM for registers in REDUCEMONT state ..53
Table 4.7: Control signals generated by FSM for registers in LOAD with squmultld
state 54
Table 4.8: Control signals generated by FSM for registers in SQU_MULT state 55
Table 4.9: Control signals for registers in LOAD with finaloutld state 56
Table 4.10: Control signals generated by FSM for registers in FINALOUT state 57
Table 4.11: Results taken from the synthesis of different sizes of the design of
Montgomerymodular exponentiation algorithm 61
Table 5.1: Values generated at the output of the software implementation of Digital
Signature Algorithm 65
XTable 5.2: Total simulation time for the software blocks of DSA signature and
verification operations. The blocks considered for timing analysis were the same as
modeled in VHDL for synthesis 66
Table 5.3: Data input for 12 and 32 bit hardware block ofDSA Signature 75
Table 5.4: Data produced from the DSA Signature block for DSA verification block
76
Table 5.5: Data input produced by software for 1024 bit hardware block of DSA
Signature 77
Table 5.6: Data produced from the DSA Signature block for DSA verification block
78
Table 5.7: Synthesis results taken for 12, 32 and 1024 bit DSA Signature designs 79
Table 5.8: Comparison between hardware and software implementation in terms of speed
79
Table 5.9: Inputs to DSA verification block 85
Table 5.10: Outputs from DSA verification block except r 86
Table 5.11: Input values for DSA verification block 87
Table 5.12: Output values from DSA verification block except r, which is given here for
comparison 88
Table 5.13: Synthesis results taken for 12, 32 and 1024 bit DSA verification
blocks 89
XI
Table 5.14: Comparison between hardware and software implementation in terms of
speed 89
Xll
Glosary
DSS: Digital Signature Standard
DSA: Digital Signature Algorithm.
RSA: A public key cryptosystem named after its inventers, Rivest, Shamir, and
Adlerman.
SHA-1: 160 bit secure hashing standard.
GCD: Greatest Common Divisor.
ASIC: Application Specific Integrated Circuit.
VHDL: Very high speed integrated circuit Hardware Description Language.
Key: A number or equivalent representation form which is used to encrypt or decrypt
information.
Primality Testing: Any algorithm used to verify a number ifprime or not.
CPU: Central Processing Unit.
FSM: It stands for Finite State Machine. It is a control block to control the hardware data
flow logic.
Synthesis: Conversion of code written in VHDL to describe actual behavior ofhardware
into gate level model. Gate level model represents logical hardware components, which
can be physically implemented after place and route operation.
xm
Place and Route: In place and route operation, the gate level model generated by
synthesis operation are connected together using interconnects or wires to form a
hardware design. This operation also generates SDF file, which contains the actual timing
detail of the gate level model.
SDF: Standard delay format.
FPGA: It stands for field programmable gate array. It is used to implement the design,
which has been processed through place and route operation. After place and route
operation, a programming file is generated to be emulated on FPGA.
Chapter 1
Introduction
Application Specific Integrated Circuits (ASICs) have outperformed many
applications running on general purpose processors in terms of speed. This is due to the
nature of hardware implementation of these applications which gives the advantages of
hardware parallelism, pipelining, dedicated resources and short length ofdata transfer.
Multi-precision cryptographic applications require use of very long numbers, i.e.
Digital Signature Algorithm requires modular exponentiation of the size of 1024 bits. As
a result, the software based application execution on a 32 bit general purpose processor
becomes very slow because of sequential data operations, longer interconnects and lack
of data parallelism. Implementation of dedicated hardware can remove these speed
bottlenecks, and as a result a combination of hardware and software provides better
performance. This combination can also be further optimized into System on Board or
System on Chip to achieve more speedup.
Multi-precision modular arithmetic is an essential part of public key
cryptosystems. Ordinary implementation of 1024 or 2048 bit modular arithmetic in
hardware can affect the speed, because it requires use of division operation. Because of
this, software based fastmulti-precision algorithms cannot be implemented in hardware.
The requirement of the long numbers is required to provide high security against
cryptanalytic attacks on the authenticated messages.
1.1 Scope ofResearch
The research area covered in this thesis to recognize the speed bottlenecks in the
software based multi-precision cryptographic algorithms. In order to remove these
bottlenecks, fast multi-precision algorithms for modular multiplication and
exponentiation are analyzed and implemented in hardware. The hardware is then further
used to implement Digital Signature Algorithm (DSA) cryptosystem. The implementation
ofDSA has been done in both software and hardware for performance comparison. The
hardware implementation is particularly targeted for those portions ofDSA where multi-
precision modular exponentiation has been used.
1.2 Organization ofThesis
Chapter 2 is provides a brief introduction about cryptography and the use of
multi-precision arithmetic.
Chapter 3 then further elaborates the requirement ofmulti-precision arithmetic by
describing the most commonly used cryptographic algorithms.
Chapter 4 covers the selected Modular arithmetic methods, their hardware
implementation and analysis.
Chapter 5 provides the implementation of Digital Signature Algorithm while
using the selected algorithms in chapter 4. Chapter 5 also gives hardware and software
comparison for speed.
Chapter 6 includes the conclusions and future work.
Chapter 2
Cryptography, an introduction
Information security has been amajor concern formany years. Several techniques
have been developed to secure the information over short and long range communication.
This concept of securing information belongs to the field of cryptology. Generally,
cryptology is the field of study which provides information about the communication of
data over non-secure channels. It is further subdivided into cryptography and
cryptanalysis. Cryptography is the process of designing systems to secure information
where as cryptanalysis is the process ofbreaking those systems. In cryptographic science
cipher text is a term used for the encrypted information.
2.1 Cryptography
"The mathematical science used to secure the confidentiality and authentication
ofdata by replacing it with a transformed version that can be reconverted to reveal the
original data only by someone holding theproper cryptographic algorithm and key.'''' [1]
From ancient cryptographic systems the first recorded use of cryptography found
is by the Spartans who (400 BC) employed a cipher device called a "scytale" to send
secret communications between military commanders. The scytale consisted of a tapered
stick around which was wrapped a piece of parchment inscribed with the message. Once
unwrapped the parchment appeared to contain an incomprehensible set of letters,
however when wrapped around another stick of identical size the original text appears.
There are many examples of ancient cryptographic systems available these days. Julius
Caeser often used a simple cipher, which was later named after him, "Caeser Cipher".
His method of encryption and decryption was based upon shifting of letters by three
spaces.
With the growing needs to have fast and secure communication, the old
cryptographic systems can only be referenced to a small fraction of what is used these
days. As the distance between an encrypter and decrypter becomes longer, the security
becomes more critical when transferring some important information. Demand of more
security increases the complexity of systems, but for longer distances, the design of
4complex systems become difficult. The data encryption and decryption of a message is
done by using key. A key is a number or equivalent representation form which is used to
encrypt or decrypt information. Modern cryptography is categorized into two major
types, asymmetric key cryptography, and symmetric key cryptography.
Symmetric key cryptography is commonly termed as secret key or private key
cryptography. Private key cryptosystems use same key for both encryptions and
decryptions. They are mathematically less complicated than asymmetric key
cryptosystems. Security in private key cryptosystems requires secure channels of
communication. Generally the security also depends upon, how the key is exchanged
between the two parties. This exchange is usually done by using asymmetric key
methods. These cryptosystems are usually implemented for shorter distance of
information exchange. These cryptosystems do not require complex multi-precision
arithmetic, thus they will be ignored as part of research.
Asymmetric key cryptography is also commonly termed as public key
cryptography. Its concept was introduced in 1970. The word public key is based upon the
concept, that anyone can retrieve a key either for encryption or decryption but not for
both. In this case, there are thus two keys, one is considered public key, and one is secret.
One of a use of such cryptosystems is for long distance information exchange. Fast
cryptanalytic attacks have made these systems unsecured, thus there is always an
improvement required with the growth of fast systems.
Consider the basic structure of public key cryptosystem, which is divided into an
encrypter E, and a decrypter D. Suppose E wants to send an encrypted message to D,
which D then decrypts, but they are located very far from each other. Due to longer
distances they cannot agree upon any key to use because they cannot meet each other.
The concept ofpublic key then arises here, that, the decrypter sends or broadcast a public
key and either one encrypter or more encrypt their messages to form cipher texts. The
cipher texts then goes back to the decrypter, who already has the inverse of the public
key, which he uses to decrypt the cipher. The inverse of the public key is kept secret. The
following diagram describes this concept,
Attacker
E
Encrypter "* ' * I
* I
* I
* V
DecrypterEncrypter
\
\
1 *
1
Cipher \ \
\
\ \ ^
w
D
Figure 2.1 : Public Key Cryptosystem
In the figure 2.1 an attacker is also shown besides n number of encrypters and
decrypter. There can be several goals of the attacker, the common ones are
a) To read the message.
b) Retrieving the secret key, and thus reading all publicly encrypted messages.
c) Corrupting an original cipher text into another cipher so that the decrypter gets a
wrong cipher text.
d) Dodge the decrypter by pretending as a valid encrypter.
There are four possible types of attacks that the attackers can perform on the cipher text,
a) Cipher text only attack: The attacker is only able to copy the cipher text, and
then he can try different ways to decrypt the cipher text.
b) Known plaintext attack: It is more destructive attack then the cipher text only
attack. The attacker gets the copy of cipher text as well the original corresponding
text. Then he can deduce the key out of it. Once he has the secret key, and the
decrypter doesn't change the secret key so the attacker is able to read all the
messages in future.
c) Chosen plaintext attack: Using this method, if the attacker has access to the
actual encryption machine, he encrypts many chosen plain texts to deduce the
secret key out of it.
d) Chosen cipher text attack: In this method the attacker gains an access to the
decryptermachine and tries to decryptmany symbols in order to deduce the key.
In order to secure the information many cryptosystems has been developed.
Digital Signature Algorithm is a cryptosystem which is used to secure the document
authentication. Next section describes in detail about Digital Signature Algorithms.
2.1.1 Digital Signatures
A signature represents the authenticity of a document with its association to the
signer. Presently signatures have attained much importance in the digital media,
especially when the distance between themessage signer and signature verifier increases.
When authentication of a document is considered then an intrudermay try to steal
the authentication of the document by copying the original signature. Electronically same
situation exists, and thus the signed document needs to be encrypted and then to be
verified.
Digital signatures algorithms are used for signing the documents and verifying.
There are two basic techniques usually considered for digital signatures. One is RSA
signatures and the other is ElGamal signature scheme. Digital signatures are part of
public key cryptography.
2.1.1.1 RSA Digital Signature
RSA signature [3] uses the same concept of RSA public key cryptosystem [3].
First consider two persons, Alice and Bob, where Alice is the signer and Bob is the
verifier.
Bob sends a document m to Alice to sign it. The algorithm begins as follows,
1 . Alice finds two primes p and q and then computes n = pq. She next chooses eA
such that 1 < eA < </> (n) with gcd (eA, <t> ()) = 1, and calculates dA such that eAdA =
1 (mod <j> (ri)). Where <j)(n) = (p-\) (q-\). Alice publishes (eA, n) and keeps private
dA,p andg.
2. Alice's signature is
y= md/t (modw) (2.1)
3 . The pair (m, y) is thenmade public as a signedmessage.
Bob downloads the pair (tn, y) and the public key (eA, n). Bob verifies the signature
by using the following steps,
1 . Calculate z = ye" (mod n). If z = m, then Bob accepts the signature as valid;
otherwise the signature is not valid.
2.1.1.2 ElGamal Digital Signature Scheme
ElGamal signature scheme [3] is based upon ElGamal encryption. A difference
from RSA is that with ElGamal method, there can be many different signatures produced
that are valid for a given message. Based upon ElGamal encryption method, the National
Institute of Standards and Technology (NIST) proposed the Digital Signature Standard
[17] in 1991 and adopted it as a standard in 1994.
2.1.1.3 Digital Signature Algorithm
Digital Signature Algorithm (DSA) [3] based upon Digital Signature Standard
requires a hashed message SHA(m), which is 160-bit long. This hashed message is then
signed and verified in this algorithm. Before explaining the algorithm in detail, first
consider a hash algorithm SHA, which has been used in the design. The current standard
for hash algorithm is Secure Hash Standard Algorithm [7] named as SHA-1. It was
developed by NIST along with the NSA for use with the Digital Signature Standard
(DSS) and is specified within the Secure Hash Standard (SHS) National Institute of
Standards and Technology. SHA-1 is a cryptographic message digest algorithm which
takes a message m < 264 bit plain text and converts it into 160 bit hashed message
SHA(w). This 160 bit hashed message SEA(m) is called as a message digest. It is
designed to have the following properties: it is computationally infeasible to find a
message which corresponds to a given message digest, or to find two different messages
which produce the same message digest [8]. i.e. if there is a hash for document ml, it is
difficult to find a document ml which has the same hash. The algorithm of SHA-1 is
given in [7], and that's why the algorithm will not be explained here. For the thesis
research, the implementation of SHA-1 was done usingmulti-precision C++. A generated
hashed message SHA(m) is then used in DSA. Figure 2.2 describes the use ofDSA along
with SHA-1 for both signing and verification procedures.
Signature Generation Signature Verification
Message Received Message
Secure Hash Algorithm Secure Hash Algorithm
Message Digest
Private Digital
<
N
Message Digest
Digital X Public
^ DSA Sign
Operation
DSA Verify
Operation
^
Key Signature Signature <X Key
Yes - Signature Verified
or
o - Signature Verification Failed j
Figure 2.2: Data flow diagram ofDigital Signature Algorithm using SHA-1
In figure 2.2, a hash function is used in the signature generation process to obtain
a data, called a message digest. The message digest is then used by DSA to generate the
digital signature. The digital signature is sent to the verifier in terms of signed message
and public key. The verifier verifies the signature by using the signer's public key. The
same hash functionmust also be used in the verification process. For this thesis research,
the hash function is only used at the beginning of signing operation, because the purpose
of the research is to explain the difference between hardware and software
implementation ofmulti-precision arithmetic algorithms. DSA is used to implement these
multi-precision algorithms for performance comparisons between hardware and software.
A block diagram of a modified version ofDSA is mentioned in the figure 2.3. A common
SHA-1 block is used by both of the signature generation and sign verification blocks. The
signature generation block generates a public key and a signed message for the
verification process. In verification process, the signed message is verified using the
hashed message.
Message
I
SHA
T
Message Digest
Figure 2.3: Block diagram Digital Signature Algorithm used for the thesis research.
Consider Alice as a Signer and Bob as a Verifier. Bob has a message digest
SHA(m) which is 160 bits long, and he wants Alice to sign the message. Alice signs the
message, and sends it back to Bob along with a public key. Bob uses the public key to
verify the authenticity of the signature. Consider the following algorithm as mentioned in
[3], page 190.
2.1 Algorithm DSA
Alice creates the public keys as follows,
1 . Alice finds a prime q as a 160 bit long and chooses a prime jl? that satisfies q\p-\.
i.e. p = qx + 1 is also prime, x is the even factor of the p-l. The discrete log
problem should be hard for this choice ofp.
2. Let g be a primitive root modp and let,
10
a ^g^'' (modp), (2.2)
Then, aq= 1 (modp) must be satisfied.
3. Alice chooses a secret a such that 1 < a < q-\ and calculates
B = aa (modp) (2.3)
4. Alice publishes (p, q, a, B) as public key, and keeps a secret.
Next Alice signs themessage using a, p, q, a, and SHA(m).
1 . Alice selects a random, secret integer k such that 0 < k<q-\.
She then computes
r =
(ak (modp)) (mod q) (2.4)
2. Next she computes
s =
k'1
(SHA(m) + af) (mod q) (2.5)
3. Alice's signature is (r,s), which she sends to Bob.
On the other side, Bob receives the data including the public key and the signed message.
1 . Bob first downloads Alice's public key information (p, q, a, B).
2. He computes,
i =
s'1 SHA (m) (mod q) (2.6)
and,
3. Next he computes
u2 =
s'1
r (mod q) (2.7)
v =
(o7' /T2
(modp)) (mod q) (2.8)
4. Bob accepts the signature, if and only if
v = r. (2.9)
11
2.2 Summary
After a brief review ofRSA and DSA public key cryptosystems, it is evident that
themodular operation is the important part of the security. In 1024 bit RSA and DSA, the
arithmetic operations involving exponents for modular operations require lot of
computation cycles. For example in 1024 bit DSA cryptosystem, computation of cipher B
=
aa (mod p) requires 1024 bit data to be used. Performing software based multi-
precision arithmetic operations on a single general purpose processor (GPP) is very slow
because of lack of hardware parallelism and dedicated hardware resources. If dedicated
hardware is used to replace the slower software blocks then a very efficient solution can
be presented as a hardware and software combination. Such problems and their solutions
are explained in this thesis report. Elgamal DSA has been targeted for efficient
implementation and performance comparisons and the hardware implementation is
shown in Chapter 5. The performance comparison has been done between the
implementation of slower modules like modular exponentiation in software and
hardware. The next chapter describes the fastmulti-precision cryptographic algorithms.
12
Chapter 3
Multi-precision arithmetic in cryptography.
In modern cryptography, information is encoded and decoded in terms of
numerical values. Information is first converted into numerical form using various
conversion methods. The latest message conversion algorithms are secure hashing
standard and Message Digest. They are usually based upon the principle of hashing the
messages.
Once the information is converted into its numerical form m, it is then passed
through the encryption algorithm. In encryption algorithm, certain arithmetic operations
are done on m and produce a cipher c. This cipher c is when received on the decryption
side; the decryption algorithm contains the arithmetic operations which converts c to m.
This whole process may require from simple mathematical to very complex mathematical
operations. As the cryptosystems are becoming more secure, their mathematical
complexity is increasing and thus producing many challenges for the designers to
implement it in most efficient way. Before going into the details of the cryptographic
arithmetic, let's review the concepts ofbasic number theory used in them.
3.1 Congruence
In chapter 2 of "Introduction to Cryptography" [3], public key cryptosystems are
briefly described to show the mathematical implementations for encryption and
decryption. Modular arithmetic is used in these cryptosystems to obtain the results of
various functions. It is also termed as congruence. It is defined as,
"Let a, b and n be integers with n ^ 0. We say that
a = Z>(modw) (3.1)
if a - b is a multiple (positive or negative) of
." [3]. Simply, if the difference a - b is
integrally divisible by a number n (i.e. is an integer), then a and b are congruent to
n
n. The quantity a is some time called as base, and the quantity b is some time called as
residue or remainder.
13
a = b (mod n) can also be written as, a = b + nk for some integer k (positive or negative).
An example can be considered as, a = 19 and n = 8 then,
19 = 3 (mod 8), or 19 = 3 +2* 8.
In this case k = 2, and b = 3.
Another example using a = -12, and n = 7,
-12 = 37 (mod 7), or -12 = 37 + (-7 * 7)
In this case, k = -7 and 6 = 37.
There are four propositions for congruencies as from [3],
Z,er a, b, c, n be integers and n^O.
a = 0 (mod w), //"(/o/y ifn/a.
a = a (mod n).
a = b (mod w) z/"a<i only ifb = a (mod n).
ifa = ^ and b = c (mod ), ^era a = c (mod ).
3.2 Greatest Common Divisor (GCD)
Considering a and fe, the greatest common divisor [3] is the largest positive
integer that divides both a and b. It is also called as greatest common factor and highest
common factor. GCD is commonly used in public key cryptosystems particularly
for finding relative primes. Relative primes are the two numbers, whose GCD is 1.
Considering the following examples,
GCD (13, 21) = 1, GCD (2, 8) = 2, GCD (22, 71) = 1.
GCD of three or more than three numbers can be found by computing GCD of two
numbers at a time, as,
GCD (a, b, c) = GCD (a, GCD (b, c))
Finding GCD has two methods:
1 . Considering a and b as positive integers, the GCD can be computed by first factorizing
the numbers into smallest primes, i.e.
i
i
1728 = 26 32, 135 = 33 5, GCD (1728, 135) = 32 = 9
14
2. If the numbers are very large, the factorization is very hard. The GCD can be
calculated then by using Euclidean algorithm, which will be explained next.
3.3 Euclidean Algorithm
Euclidean algorithm [3] is generally used to find GCD of two integers. Suppose a
and b are two integers, then there exits k and / such that,
ka + lb = c (3.2)
where c is the GCD ofa and b.
Algorithm 3.1: Euclidean-Algorithm (a, b)
l.r0 = a;r\=b;m = 1;
2. while (rm 0) do
2.\qm =
r
2.2 rm+\ - rm.\ - qmrm ;
3. m = m+l;
llrm = GCD (a, b)
This algorithm only gives GCD of two integers. In order to find the multiplicative
inverses ofa and b, another version, known as extended Euclidean algorithm can be used.
3.4 Modular Exponentiation
Modular exponentiation [3] is very important part of the security of the most
public key cryptosystems. Modern cryptosystems use very long numbers which causes
arithmetic operations like modular exponentiation to operate at very slow speed.
Considering an example, 21027 (mod 637) = 37. If it is computed the power of 2 and
reducing by 637, it will take 1027 iterations which is very slow operation. On the other
hand the successive square on both sides and reduction of the base with modulus will
reduce the number of iterations. Considering the above example,
Start with 22 = 4 (mod 637) and repeatedly square both sides to obtain the
following congruencies:
24 = 42 = 16
28= 162 = 256
15
216 = 2562 = 562
232
= 529
2645= 198
2127= 347
2257= 16
2512;= 256
21024
= 562
Since the binary representation of 1027 = 1000000001 1
Then 21027 = 562 * 4 * 2 = 37 (mod 637). Where 562, 4, and 2 are selected based on logic
1's the binary representation of 1027. An efficient modular exponentiation algorithm is
based upon right to left and left to right binary exponentiation algorithm.
3.4.1 Right-to-left binary exponentiation
The inputs to the Right-to-left binary exponentiation [12] are x and e, where x is the base
and e is power.
Algorithm 3.2: Expo (xe)
\.A<-l,S<rx.
2. while ( e 0) do:
2.1 if e is odd, then^ <- A . S.
2.2 e<r
2.3 if e + 0, then S <r S . S.
3. return v4.
4. End Expo (xe).
In the above algorithm the loop runs until e 0. At line 2.1, e is verified for if e is
odd, then compute, A <r A . S. This verification of e to be odd or even can easily be
checked using the LSB of e. If e(0) =
' 1 ', then e is even, else it is odd. e is further divided
by 2, which in hardware represents a right shift register, which when shifts right, the
number is divided by 2. At line 2.3 S 4- S . S is computed. The multipliers at two lines
16
2.1 and 2.3 can be executed in parallel, if implemented in hardware although there is a
dependency in the loop between line 2.1 and 2.3. i.e. S is updated at line 2.3 and then in
the next iteration it is used byA at line 2.1. This dependency does not cause any conflict
with the parallelism, because the multiplier at line 2.1 needs the resources ofS to be used
in the next iteration, but not within the same iteration.
3.4.2 Left-to-right binary exponentiation
The inputs to Left-to-right binary exponentiation [12] algorithm are x and e.
Algorithm 3.3: Expo (xe)
1. A<r\.
2. fori=t down to 0, loop
A<rA.A
lf(ei=l),A<rA.x.
2. ReturnA
3. End Expo (xe).
In left-to-right binary exponentiation, the loop runs for time t, where t is the number
of bits in e. When comparing to right-to-left algorithm, this algorithm has a data
dependency within the same iteration, which does not allow achieving parallelism in this
algorithm. This reduces the performance in terms of speed in this algorithm, although, the
area requirements during the hardware implementation are half in left-to-right algorithm
as compared to right-to-left algorithm. This is because the same multiplier can be used
for bothA<r A .A andA <r A . x iteratively.
In chapter 4, these two methods are further explained and one method is then
chosen for hardware implementation.
3.5 Common Algorithms used in Cryptography
Most of the cryptosystems require complex computation based upon certain
algorithms. These algorithms are helpful in providing fast computations and thus access
to making cryptosystems more secure. The two algorithms used in this thesis research are
17
Extended Euclidean Algorithm and Primality test based upon Miller Rabin theorem. The
description of these algorithms in chapter 3 signifies the multi-precision modular
arithmetic used for public key cryptosystems. The purpose of showing the modular
arithmetic rs the hardware and software implementation of multi-precision arithmetic
algorithms for performance comparison. The computation time in terms of hardware and
software implementation varies, which is the target of this thesis research.
3.5.1 Extended Euclidean Algorithm
The Extended Euclidean algorithm [3] is a version of Euclidean algorithm. In
addition to computing GCD, this algorithm also generates the multiplicative inverses of a
and b only ifa and b are relatively prime numbers.
Algorithm 3.4: Extended-Euclidean-Algorithm (a, b)
Ha and b are positive integers
1. r0 = a;ri=b;
2. 'o = 0; 'i = l;
3. so= L 5i = 0;
4. m=\\
5. while (r^O)do
5.1 qm =
rm-\
5
5.2 fm+l t"m-\ ' qm^m'4-
5.3 *m+\ ~ tm-1 ' q-nttn-.
5.4 Sm+1 ~ Sm-\ ' qmSmi
5.5 m = m+l;
6. m = m l;
Hrm = GCD (a , b) anAsma + tmb = rm[4]
Due to the ability of this algorithm to provide the multiplicative inverses of the
given two co-prime numbers, it is widely used for many cryptographic applications. As
seen in chapter 2, RSA algorithm description, d has been computed as the multiplicative
18
inverse of e, which is the part of public key used for encryption of message. Besides
finding multiplicative inverses of numbers, it also helps in avoiding the fractions created
due to modulo division operation. The run time of this algorithm is O (n2).
3.5.2 Primality Testing
Primality testing [3] is required in order to verify if a certain integer is prime or
composite. For an integer N, the number of primes less than or equal to N is
approximatelyN I lnN [4]. One property of prime numbers is that, they have no factors.
Another is that every prime number is an odd number but not every odd number is a
prime number. Numbers which are not primes are called as composite numbers. The term
prime factorization refers to a prime p where (p -1) is an even number yields to its finite
number of factors. When considering very long numbers, computational time of finding if
a number is prime or not is less then prime factorization. Primality testing is now
considered in polynomial time after M. Agrawal, N. Kayal, and N. Saxena, provided an
algorithm that supposedly tested primality in polynomial time [5]. As compared to
polynomial time algorithms of primality, the randomized or probabilistic algorithms are
much faster and are widely used for Primality testing. Randomized algorithms are based
upon yes-based Monte Carlo algorithm where a "yes" answer is always correct, but a
"no"
answer is probabilistic means, it can be incorrect. A most common probabilistic
algorithm used for Primality is Miller-Rabin theorem [3].
Algorithm 3.5: Miller-Rabin (n)
1 . define a, k, m, xq, xi as integers.
2. choose a as a random integer.
3. m = n- I; k=0;
4. while (m mod 2 = 0)do
4.1 k=k+l;
4.2 m = m 1 2;
5. end while;
6. xo =
am(mod n);
1 . for i in 0 to k-l loop
19
7.1 xi =
x02(mod n);
7.2 if (x- = 1 and xq ^ 1 and x0 ^ n-\) then return composite;
7.3 xq = x>;
8. end loop;
9. if (xi iz 1) then return composite;
10. return prime;
EndMiller-Rabin(n).
Odd numbers are only tested for primes as mentioned above, so the input to this
algorithm is only the odd number. The probability of Miller-Rabin test for a certain
chosen random a is Va for a failure of recognizing a composite number.
3.6 Summary
The algorithms mentioned in this chapter are the basis of many cryptosystems.
Modular exponentiation is an essential part of RSA and DSA public key cryptosystems.
Chapter 4 describes the hardware implementation of modular multiplication and
exponentiation algorithms. Miller-Rabin Primality test has been used as part of thesis
research in order to find prime numbers for DSA. Extended Euclidean algorithm has been
used to compute modularmultiplication inverse of a secret random number in DSA.
20
Chapter 4
Hardware implementations of the
multi-precision modular arithmetic methods
When considering highly secure cryptosystem, use of very long numbers is
always targeted. Modern public key based cryptosystem like RSA and DSA normally
uses numbers up to 2048 bits long. As presented in chapter 2, RSA and DSA algorithms
requires modular exponentiation. When designing for speed, hardware replaces certain
software portions which require selection suitable algorithms that fit the requirement of
efficient hardware design.
Multi-precision modular exponentiation implementation in hardware gives much
speedup when compared with software implementation. The speedup can be achieved by
the availability of parallelism and dedicated hardware resources instead of a general
purpose CPU which schedules the targeted cryptographic application tasks with other
tasks. These are the most common consideration which causes the selection of dedicated
hardware solution. If modular exponentiation is implemented in hardware using the
division operation, it can consume lot of area and time resources on the hardware chip.
This division can be avoided by using an algorithm for modular reduction which is
explained next in the background work.
4.1 Background Work
a) Modular Multiplicationwithout Trial Division [5]
In this paper Montgomery proposed a method for multiplying two integers
modulo
A-"
while avoiding division byN. The representation of the integers is called as N.
This method is useful only if several multiplications are performed for fixed modulus N.
Fixed modulus is a drawback for such cryptosystems, where there is a requirement of
using more than one modulus. Avoiding division is an essential part for hardware
implementation, thus this algorithm has gained prime importance for the hardware
implementation ofmulti-precision cryptography. This research has been used as the basis
21
of this thesis. Using Montgomery modular multiplication technique, an algorithm of
modular exponentiation has been implemented in hardware.
b) MontgomeryModular Exponentiation on Reconfigurable Hardware [8]
In this paper, Montgomery modular multiplication was combined with systolic
array design, which was capable of processing a variable number of bits per array cell.
The design was targeted for Modular exponentiation which was further used as a design
unit for 512 and 1024 bit RSA. The design was implemented on Xilinx XC4000 Series
FPGA. The RSA decryption time for the 512 bit in hardware was 2.37ms as compared to
the software implementation which was 9.1 ms on a 150 MHz Alpha. Thus the speedup
in hardware was 3.8 times more as compared to software. Also the fastest 1024 bit
software implementation of 43.3 ms running on a Pentium Pro-200 based PC was about
4 times slower than the hardware (10.2ms).
c) Efficient Architectures for implementing Montgomery Modular
Multiplication and RSAModular Exponentiation on Reconfigurable Logic. [6]
This paper presents a review of some existing architectures for the
implementation ofMontgomery modular multiplication and exponentiation on an FPGA.
There are three different architectures of Montgomery multiplier described. These
architectures are implemented, and comparison for area and speed is given. A comparison
of two modular exponentiation algorithms using Montgomery multiplication is also
described. A good selection ofmodular multiplier and exponentiator has been made to
implement RSA cryptosystem. For this thesis research the algorithms presented in this
paper have been chosen for hardware implementation. Instead of RSA, the selected
architectures are implemented for DSA.
4.2 ClassicalModular Reduction [3]
Consider two positive integers, x & y, and the modulus M. The representation of
these numbers is in radix 2. The steps are,
22
a) Consrder x * y, x + y, x - y, x, y modular reduction operations.
b) Compute remainder r for any of the arithmetic operations x * y, x + y, x - y, x, y
divided byM.
c) return r.
In the steps above, the division is used in performing the modulo reduction operation. On
the other hand let's considerMontgomeryModular reduction in this next section.
4.3 MontgomeryModular Reduction without trial division
Montgomery modular algorithm [5] avoids the division requirement in common
modularmultiplication, addition, subtraction, and single value reduction.
The basic algorithm is as follows,
ithtm4.1: Mont-reduce (x)
1. u = (xmod R)
M1
mod R
2. __
x + uM
R
3. ift>Mthen
3.1 t = t-M
4. return?
EndMont_reduce (x)
Inputs to the algorithm are x as the number to be reduced, M as the modulus, and R. The
calculation ofR = 2W, where w is the number ofbits inM. Also 0 <M < R, and compute
using Extended Euclidean Algorithm so that GCD (R,M) = 1, that is
RRA-MMl
= \ (4.1)
This is equivalent to M~l = -M mod R. (M = ),
M
for variable x to be reduce bymoduloM, Montgomery algorithm uses
X=x* RmodM, (4.2)
ConsiderMx = -M mod R, and multiply both side by x
Af1*x = -M*x (mod R), (43)
23
where u =M * x
u = -M*xmodR, (4.4)
multiply both side byM
u*M = -(M*M)*xmodR.(M'M=l) (4.5)
u*M = -xmod R (4.6)
u = *;cmodi? (4.7)
u = [(xmod R) (-M1 mod R)] mod R, (4.8)
where
Af1
= -MmodR
u = [(xmod R) (-1 * -M mod R] mod R (4.9)
u = [(xmod R) (M mod R] mod R. (4.10)
This proves line 1.1 ofMontreduce.
Now consider equation (4.6)-> u*M = -xmod R
This is equivalent to,
u*M=-x + tR (4.11)
for any constant t.
tR = x + uM (4.12)
x + uM
t
R (4.13)
or
t =
xRA
modM (4.14)
This proves line 1.2 of Mont-reduce. The multiplication of x with RA and
reduction by M gives a Montgomery residue form of x, which is t in this case.
Montgomery residue form can also be obtained by multiplying x with R and reducing by
M and getting X as in equation 4.2. Where X 4- t. Consider both of the cases for the
reduction ofx into original form y. i.e. y = x (mod M). lit is considered, theny = tR (mod
M), otherwise forX y =
XR'1 (mod M). In the algorithm, the Modulus M is replaced by
R, which is a shift register to shift the data serially towards right iteratively. One right
shift is equivalent ofdividing by 2.
In the algorithm, instead ofdividing byM, R is used, which is 2W. This in iterative
implementation avoids the division by shifting the (jc, Ml in line 1), and, (x+ uM in line
2) by 1 bit to the right.
24
Considering two operands A and B instead of one as x in the previous case, where 0 < A,
B <M. The properties for addition and multiplication then are,
1- A+ B is mapped to (A+B)R=AR + BR.
2. AB is mapped to (A *B) R=AR(B *R)Rl
4.4 MontgomeryModular Multiplicationwithout trial division [6]
When considering ordinarymodularmultiplication, two approaches are considered,
a) Performing modulo operation aftermultiplication.
b) Performing modulo operation duringmultiplication.
Modular operation requires using division, the remainder of which is used for further
reduction. If the first approach is considered, it will require (w * w) bit multiplier, with a
2w bit register and (2w * w) bit divider, w is the number ofbits in the modulus. Using the
second approach, the division can be avoided, but it will require more units of addition
and subtractions to be involved. The ordinary modular multiplication algorithm for the
computation of (A *B) modM takes the normal multiplication method which accumulates
digit products A * bt and interleaves modular reductions to keep the result below M.
These reductions are achieved by subtracting the correct multiple of the modulus from
the intermediate result. This reduction is dependent on the most significant bits of the
operand. On the other hand, Montgomery in his algorithm for modular reduction [6]
reverses the order of treating the digits of the multiplier by using the least significant bits
of the intermediate result to perform an addition rather than a subtraction. A further shift
down operation is then performed instead of a conventional algorithm shift up operation
in each iteration.
vv-l w-1
2>(02', B=%
i=0 ;=0 i=0
w-1
A = B= j>(/)2', M= 5>(i)2'
Algorithm 4.2: MonPro (A, B, M) [6]
1 . A = 2A; //(left shift by one bit)
2. S.x =0;
3. for/ = 0todo
3.1 q,- = ($.1) mod 2; //( qt = LSB ofSiA )
25
3.2 Si-s.-i +<!& + !>,A
2
4. end for
5. return S
6. end MonJPro.
In the above algorithm, A, B, and M are inputs to the design. The loop runs for
ti+1 times, where n = w+2 and w is the number of bits in M. A is first shifted left. The
additional two clock cycles are needed in order to keep the intermediate results in bound
due to possible carry returns during the addition operations. An additional one clock
cycle will be needed to reduce the effect of 2A. Total number of n+1 clock cycles are
needed to compute the final Montgomery product. This productP is obtained as,
p = (A *B) Rl (modM) (4. 1 5)
R'
is the multiplicative inverse ofR (modM) which is automatically included during the
Montgomerymultiplication operation.. In order to removeR'\ P=pR (modM) needs to
be computed. This is also equal to,
P=pRA R2 (modM) (4.16)
Thus another Montgomery multiplication ofMon_Pro(p, R2, M) is computed in order to
get final output. R2 is greater than M so a remainder can be obtained from R2 (mod M)
first, and then it can be used.
4.4.1 Hardware implementation ofMontgomeryModularMultiplication
In order to compute A * B (mod M) the algorithm 4.2 have two different design
solutions for hardware implementation. The inputs to both of the designs are A, B andM
and the output is P as shown in figure 4. 1 ,
A B M
i 1 L
Mon_Pro(^, B,M)
I
P
Figure 4.1 : Top level block diagram ofMontgomerymodularmultiplier
26
clock reset en
A [port_size:l]
B [port_size:l]
M [port_size:l]
ld_done
op_done
I I I
Montgomery
Modular
Multiplier
h
?
w
?
. p [port_size:l]
mod done
Figure 4.2: Port level block diagram ofMontgomerymodular multiplier.
Figure 4.2 shows the input and output ports of Montgomery modular multiplier. It has
three data input ports and one data output port. Other ports are Iddone, op_done are used
for controlling the data flow, moddone is used as a flag for the completion of the
operation and en port for enabling and disabling the design.
4.4.1.1 Design with two adders
The first design uses two adders, with two bit multipliers, five registers and a
control block as shown in figure 4.3. The first adder is of n+2 bit size, where n= w+2.
The second adder is of n+3 bit size. The inputs to the design are A, B andMofw bit size.
These inputs are stored in the internal registers by data shift operation. They take small
packets of data and serial load them until the whole data is transferred. As the loop in
Mon_Pro(j4, B, M) runs for n+1 times; the registers storing these values needed to be of
the size of n+1. Also A is shifted left up 1 bit before starting the multiplication and
reduction operation. An intermediate register S of n+2 bit size is used to keep the
intermediate results synchronized. In the algorithm, consider the line 1.3.2,
Sj_l + qtM + btASi (4.17)
qt is taken as the LSB of S, and is used as a bit multiplicand in qfMmultiplier. The
register storing B is a parallel in serial out shift register, which generates serial bits as bt
for btA multiplier. An n+2 bit adder is then used to add #;Mand btA multiplication
results. The additional 1 bit in this adder is required to save the last carry out. Further the
output of n+2 bit adder is added with St.\ to produce St. St is the output of n+3 bit adder
27
with the LSB of this adder is discarded in every clock cycle. Again, the additional bit i
this adder is required to save the carry out ifproduced from the addition.
en ld_done op_done
clock
reset
I I I
M B
Finite State Machine
M - register
S(0). M- multiplier
control
B - register A - register
B(0) . A - multiplier
n+2 - adder
n+3 - adder
p - output register
output
Figure 4.3: Block level diagram ofMontgomery Modular Multiplier using two adders
and two multipliers.
The iterations are controlled by an n+1 bit counter used in the state machine. The
state machine controls the five registers. The output register generates output after n+1
number of counts. Figure 4.2 shows the block diagram of a general Montgomery modular
multiplier. Figure 4.3 given next shows the detail about the data flow architecture as
connected to the FSM.
The two adders in the design shown in figure 4.3 add carry propagation delays in
every iteration, n+2 bit adder can be eliminated by using a multiplexer, which will be
28
shown in the second design for Mon_Pro(A B, M) algorithm. Figure 4.4 shows the finite
state machine used in this design. It has four states,
en='0'
ld done = '0
op done =
'0'
counter = n + 1
counter <n+ \
'Id done='l'
Frgure 4.4: Finite state machine forMontgomerymultiplier with two adders design.
IDLE:
In this state the FSM remains IDLE, until en = '0', the enable signal of the design.
Once en = ' 1', FSM clears all the registers, and makes transition to LOAD state.
LOAD:
In LOAD state, the input registers are loaded with the data coming through input
ports. The load operation is done serially for small data sets based upon the size of input
ports. This makes the design more scalable for an FPGA's available 1/0 ports. FSM
remains in this state until ld_done = '0' and keeps loading the registers. Once Iddone =
T, it makes a transition to MULT state. During transition, register A is shifted left by
one bit, which is equivalent ofmultiplying by 2.
MULT:
In this state the Montgomery multiplication operation is performed. A counter is
used to keep track of all the intermediate additions to be done. It remains in this state,
until the counter < n+1 and then makes a transition to RESULT state after counter = n +
1 . During this transition it generates the final output.
29
RESULT:
In this state, the result obtained in MULT state is transferred out through output
out port. This is a shift register, which shifts the data out in small data sets. FSM remains
in this state until op done = '0'. Once opjone = '1 ', it makes a transition to IDLE state.
4.4.1.1.1 Simulation results
The simulations for design in figure 4.3 are done at RTL and at post-synthesis
levels. The RTL simulation does not need any timing constraint, so a minimum of 2 ns
clock cycle is used. For post synthesis simulations, the timing constraint is important, and
this varies with the size of design. For 4-bit design, the minimum clock period available
is 7 ns which is obtained after synthesizing the design which will be discussed in the next
section of synthesis results. The timing constraint applies, because of actual wire and
component delays caused during the design. The stages of simulations covered in post-
synthesis level are post-translate and post-place-route. In order to simulate a 4-bit
Montgomery multiplier design in figure 4.3, the date input and the output achieved is as
follows,
Input:
A = (1001)2 = (9)io
B = (1001)2 = (9)io
M = (1011)2 = (H)io
Output:
P = (9)io
As p = (A *B)
R'1
mod M, the value of R~l can be calculated using an unsigned
version of extended Euclidean algorithm, because a signed version can also returns a
negative number and Montgomerymodular reduction algorithms onlyworks for unsigned
numbers. In order to calculate R~l consider the number of bits in M, which is w = 4.
Consider n = w + 2, son = 6 then for R = 2", R = 64. Using extended Euclidean algorithm
which takes R andM as inputs, the output comes in the form of, RR'1 +MM1 = E
30
For relative prime numbers E is always 1. For R = 64 and M = 11, the output from
extended Euclidean algorithm achieved was, (5 * 64) - (29 * 11) = 1. The description of
calculatingK here is for verification purpose only. Rl is automatically included as part
of algorithm. So using
Rl
= 5, the outputp=9 * 9 * 5 (mod 1 1) = 9.
Figure 4.5: Waveform simulations for 4-bit Montgomery modularmultiplier design with
two adders. Design enable and data load view.
In figure 4.5 of wave form simulations, the ld_done signal goes high at 1 1 ns
when the internal registers are loaded with the following input values.
rega = 9
regb = 9
regm =11
At this moment the FSM is in LOAD state.
31
'a '<
^T-^.-
b- i; 01 . ,
m:
'
'%
'
11
alk i V ; ';
"reset ' Q.
',wt , , t ',,
W_dorte 1
op_done P
Figure 4.6: Waveform simulations for 4-bit Montgomery modular multiplier using
design with two adders, output = ABR'1 (modM).
Figure 4.6 shows the output = ABR'1 (modM) = 9 is generated and the modjone
signal is asserted high. Then the generated reg_out = 9 is again fed into the modular
multiplier through the stimulus along with
R2 (modM).
Ntr\b 81
00
01*101000
COH 011
~
T
'
us-- 9^ -^v
! . r
on rinfiC
i\ "' BE * ^ i|S- output 1 h '
Tr^5I
0001001
- . >
(^ ^*s^ 000101' 1 1 1
H- legb 0000. . 000001 1 000000" OOOOOO J 1
' ifejte \j, mult Srpsult
EH IH81WIMIil|l ' 0000 0100 0001
|
Now Kim 50 52 "4 56 58 ji fin p.,
^UHSUBKcIilBI
~r. ... . ...
I59 nsl
Figure 4.7: Waveform simulations for 4-bit Montgomery modular multiplier using design
with two adders, output =AB(R2)(mod M).
32
The waveform in figure 4.7 shows the final result obtained after second Montgomery
modularmultiplication of the input values,
rega = 9
regb = 4
regm = 1 1
The output obtained is regout = 4.
4.4.1.1.2 Synthesis results
The targeted FPGA to synthesize and simulate the design was Xilinx Vertex2p -
xc2vpl00-5ffl696. Total number of slices available was 44096. The speed grade was
selected to be -5 with optimization effort level set to normal for synthesis. The table 4.1
gives the results taken after synthesizing this design in figure 4.3 for 4, 32, 64, 128 and
1024 bits.
Size ofDesign (bits) Port Size No. of Slices Clock Period (ns)
4 2 76 6.8
32 4 239 8.3
64 8 436 9.5
128 16 824 12.3
1024 32 6401 51.8
Table 4.1 : Comparison ofFPGA resources used with minimum clock.
o, >
A, ^
M, ^
M+A, ^
33
4.4.1.2 Design with two adders and amultiplexer
This design uses almost same components as of the multiplier shown in figure 4.3
wrth some changes and an addition of one major multiplexer. In the modular multiplier
design of figure 4.3, the two bit multipliers uses S(0) and 5(0) as operands to be
multiplied with M and A respectively. This in turn is fed into the n+1 bit adder, which
further generates output for the n+2 bit adder. The n+1 bit adder leads to the four possible
outputs.
when S(0) = '0' and B(0) = '0'
when S(0) = '0' and B(0) = ' 1'
when S(0) = 1' and B(0) = '0'
when S(0) = yl' and B(0) = ' 1'
Because of these four possibilities, the combination ofM.S (0), A.B (0) multipliers
and the n+1 adder can be replaced by a 4:1 multiplexer with one input connected to an
adderM+A. This M+A input adder has an advantage over the old n+1 adder in terms of its
carry propagation delay can be considered only at the beginning of the operation, while
n+1 will add carry propagation delay for whole of the modular multiplication operation.
The reason that M+A will add the delay only at the beginning is because both M and A
registers remains constant with their outputs till the completion of the design cycles to
generate output. The following figure 4.8 represents the block diagram of this multiplier.
In this diagram, the M+A adder also generates the last carry out ofM and A addition, so
this adds one extra bit to the input size of the multiplier. Because of this, the size of the
inputs ofmultiplexer are made equal to the size of register S, i.e. n+1 bit. When the select
lines of the multiplexer have bits "00", the output of the multiplexer is 0, because the
input is connected to ground as shown in figure 4.8. It uses the same algorithm as used
for figure 4.3. Another change as compared to the design in figure 4.3 is the input
registers. This time, the registers are designed to load data in serial small data sets. Thus
an extra feature has been added which requires some additional clock cycles to load the
data in and also to transfer the final result out. The finite state machine in this design is
34
same as for the previous design shown in figure 4.4 except a little modification in the
LOAD state, where the addition ofA andM is performed once Idjone = ' 1 '.
BlockDiagram:
Consider figure 4.8 for the operational block diagram of theMontgomery modular
multiplier. The outputs of the registers A andM are parallel in parallel out registers, while
B is a parallel in and right serial out shift register which provides the LSB select signal
for the multiplexer. These registers are controlled by the finite state machine for clearing,
loading, shifting and holding of the data. The following line ofVHDL code shows how
the initial data is loaded through input ports.
regA <= regA ( ( (k+2) - port_size) downto 0) & A
regM <= regM ( ( (k+2) - port_size) downto 0) & M
regB <= regB(((k+2) - port_size) downto 0) & B
Register S is a parallel-in-parallel-out shift register, where its LSB is used as a
MSB select signal for the multiplexer. The output register is also controlled by the finite
state machine, which generates output after n + 1 number of clock cycles. The structure
of the output shift register can be understood from the following VHDL line of code.
output <= reg_out(k-l downto (k - port_size) ) ;
reg_out <= reg_out ( (k - port_size -1) downto 0) &
reg_out(k-l downto (k - port_size) ) ;
In this line of code, regout register is used to store final result produced during
MULT state ofFSM.
MM - register
A + M - adder
J1
35
A
k
B op_done ld_done en
"i "i I
Finite State
Machine
A - register B - register
10 01
multiplexer
00
T
X
S- register
n+3 - adder
out_reg - register LSB is discarded
clock
reset
output
Figure 4.8: Block diagram ofMontgomerymodular multiplier usingmultiplexer, and
two adders.
4.4.1.2.1 Simulation results
The design in figure 4.8 was simulated at RTL and post-synthesis levels for 4 and
1024 bits of size. The data inputs used in the case of 4 bit design are same as used in the
design shown in 4.4. 1.1. Consider the simulation results for 4 bit design
4 bit Design:
Data Input:
^ = (1001)2
Af= (1011)2
(9)io, 5 = (1001)2
(H)io
(9) 10,
36
Data Output:
Output =p = (9)10
The waveform simulations for this design are same as for the design in 4.4.1.1.
Similarly the test bench of this design was the same as used in 4.4.1.1. It is also capable
of calculating first,p = (A *B) R1 modM and thenP =pR2modM.
1024 bit Design:
1024 bit design was simulated to give a performance comparison ofMontgomery
modular multiplication with an ordinary software based modular multiplication. When
compared with ordinary modular multiplication, Montgomery modular multiplication
first converts the input multiplicands A & B into Montgomery based numbers. Consider
algorithm 4.2, which requires n+1 number of clock cycles to produce the output p. Where
p = A B
R'
(mod M). The extra factor of Rl is automatically included due to the
algorithm itself. Thus in order to get the final outputP ofAB (modM) form, P =pR (mod
M) is computed. This requires another Montgomery modular multiplication of n + 1
clock cycles for which the inputs will bep, R2 (modM), and M. The extra effort of n + 1
clock cycles slows down the process of getting final output. Thus a total of 2(n+l) clock
cycles will be needed. That is why; standalone application of Montgomery modular
multiplication is not preferred. The simulation results in table 4.2 provide the speed
comparison of hardware and software implementation ofModular multiplication. Total
number of clock cycles required for the hardware design was 2191.
- Simulation, HW & SW
Hardware Simulation using 62 ns clock
period
Time
62* 2191 = 135842 ns = 0.00013 sec
Software Implementation simulation 0.000001 sec
Table 4.2: Comparison of simulation time to complete the modular multiplication
operation in hardware and software.
37
The software implementation simulation time given in table 4.2 is very less as
compared to the hardware. This proves that, Montgomery modular multiplication is not
surtable for hardware implementation if it is targeted as standalone application. On the
other hand, rt can be used for implementing modular exponentiation in hardware, which
becomes very fast as compared to the software implementation of modular
exponentration. It has been mentioned in the next topic of Montgomery modular
exponentiation.
4.4.1.2.2 Synthesis results
The following table gives the results taken after synthesizing this design for 4, 32,
64, and 1024 bits. The FPGA used in the case was Xilinx Vertex2p - xc2vpl00-5ffl696.
Size ofDesign (bits) Port Size No. ofSlices Clock Freq (ns)
4 2 83 6.3
32 4 298 7.4
64 8 510 8.4
128 16 969 10.7
1024 32 8050 50.1
Table 4.3: Comparison ofFPGA resources used forMontgomerymodular multiplier with
multiplexer.
A comparison between table 4.1 and 4.3 gives a performance and device
utilization improvement in the second design in figure 4.5. The following chart 4.1 gives
a comparison for clock speed requirements for the two designs, one without multiplexer
(figure 4.3), and one with multiplexer (figure 4.8).
38
C
D
O
CU
Q.
u
O
O
32 64
Design size (bits)
128
B Clock period (ns) for design with
Multiplexer
Clock period (ns) for design
without Multiplexer
Chart 4.1: Comparison between two designs of Montgomery modular multipliers for
minimum clock period required. The sizes ofdesign are, 4, 32, 64 and 128 bits.
In chart 4.1, the design which used multiplexer in figure 4.8 in order to select the
inputs for n+3 bit adder gives more speed up as compared to the design in figure 4.3. The
design without multiplexer required an adder of n+1 bits to be connected in the data
propagation path for every clock cycle, so this added carry propagation delay during
every clock cycle. Instead in the design using multiplexer, the use of one n+1 bit adder
has been shown at the input of the multiplexer. This adder avoids the carry propagation
delay in every clock cycle. An algorithm for modular exponent, which will be discussed
later, requires the use of two multiplier units to be used repeatedly to perform the
modular exponentiation operation. Chart 4.2 gives a comparison of the number of slices
used by each design.
39
l FPGA Slices for design with
Multiplexer
l FPGA Slices for design without
Multiplexer
32 64
Design Size (bits)
128
Chart 4.2: Comparison between two designs of Montgomery modular multipliers for
number of slices used. The sizes ofdesign chosen are, 4, 32, 64 and 128 bits.
Modular multiplier with multiplexer consumes a little more area as compared to
the design without multiplexer. This added area is due to the multiplexer. Thus the
second design in figure 4.8 does not eliminate the second adder, but instead it eliminates
the effect of carry propagation delay of this adder in every clock cycle. The design goal
in this research is to obtain maximum speed, thus the design in figure 4.8 will be referred
in constructing the Montgomerymodular exponentiation block in the next sections.
4.5 MontgomeryModular Exponentiation
Modular exponentiation has been already discussed in chapter 3, section 2.1.6.
There are two basic types of exponentiation algorithms available, right-to-left and left-to-
right exponentiation algorithm. Montgomery modular exponentiation is based upon these
two algorithms, which uses the basic concept of square and multiply. Consider first left-
to-rightMontgomerymodular exponentiation algorithm.
4.5.1 Left-to-RightMontgomery modular exponentiation algorithm
In order to compute Z = PE (modM), the inputs to this algorithm are P, E, M, and
C. P is the base, E is the exponent, M is modulo, and C is a constant. The representation
40
of P, E, M and C is in binary or radix 2. Constant C is equal to 22" (mod M). This
constant must be pre-computed. This exponentiation algorithm repeatedly uses the
Montgomery modular multiplier. As generally, in the case of Montgomery modular
reduction, whenever a number A is reduced by a modulo M, it is converted to the M
residue according to the following equation,
A =ARmodM (4.18)
Thus, for Montgomery reduction, R2 is required in order to get JL as M residue. For this
purpose, C = R is fed as input the Montgomery modular exponentiation. Now this gives
the exponent function as the Montgomerymodularmultiplication,
Monpro(j4, R2, M) => A . R'1 R2 (modM)
=>A.R(modM)
=>A
R'
is the inverse of R automatically computed during the algorithm computation while
performing theMontgomerymodularmultiplication.
Now considering the algorithm, this constant C needs to be pre-computed. This
can be done behaviorally using the software approach outside of the physical design
implementation. The other variables used with in the algorithm are, H as an intermediate
variable used for the multiplication portion of square and multiply operation with in the
loop, w is the number ofbits inM.
Algorithm 4.3: MonExpo (P, E, M, Q
1. H<-0
2. P =Monpro(C,P,M)
3. H=Monpro(C,l,M)
4. for i = w-\ down to 0 loop
H= Monpro (H,H,M) ... .(Square)
if (E(i) = ' 1 ') then,H= Monpro (H, P,Af) .. ..(Multiply)
2. H=Monpro (\,H,M)
3. Return^
4. end MonExpo.
41
This algorithm requires two Montgomery modular multipliers for hardware
implementation. Lines 2 and 3 in the algorithm can be computed in parallel, because
there is no data dependency exists which can cause a conflict for making the parallel
architecture at this point. Further in the loop, the lines 4.1 and 4.2 have a data
dependency, which is read after write (RAW) with in the same iteration. Thus here only
one multiplier can be used to perform both of the operations ofH= Monpro (H, H, M)
and H = Monpro (H, P, M) successively. This reduces the area requirement, but slows
down the exponentiation operation. At line 5, the computation ofH= Monpro (l,H,M)
is done in order to remove the effect ofR'\ this again requires a multiplier, but as this
step comes in last, thus anyone of the old multipliers can be re-used. The computational
effort of left-to-right binary exponentiation algorithm can be calculated by first
considering the computation effort of each Montgomery modular multiplier, which is
n+1. Total computation effort of the above algorithm is thus, 2(n+l)(w+l). For example
ifw = 512 bits, then the computation effort of the left-to-right algorithm will be 528390
clock cycles at minimum. This calculation is not exact, when the algorithm is
implemented in hardware some additional clock cycles are needed for loading, shifting,
and updating the registers.
Left-to-right binary modular exponentiation algorithm requires less number of
hardware resources, but it is practically slow, when compared to right-to-left binary
modular exponentiation algorithm.
4.5.2 Right-to-LeftMontgomery modular exponentiation algorithm
Similar to left-to-right algorithm, this algorithm also requires the constant C=
22"
(mod M) to be fed into the algorithm. Besides the inputs P, E, M, and C, the other
variables used in the algorithms areH as an intermediate variable; w is the number ofbits
inM.
Algorithm 4.4: MonExpo (P, E,M, Q
1. H-0
2. P= Monpro (C,P,M)
3. H=Monpro (C, \,M)
42
4. fori = 0 to w-1 loop
4.1if((0=l)then
4.1.1/7= Monpro(i7, P,M) ... .(Multiply)
4.2 P = Monpro (P,P,M) ... .(Square)
5. #=Monpro (\,H,M)
6. Return (if).
7. end MonExpo.
This algorithm gives advantage over the previous left-to-right algorithm in terms
of speed. This also requires two multipliers like needed in the left-to-right modular
exponentiation algorithm. Lines 2 and 3 can be executed in parallel using two hardware
Montgomery modular multipliers. Once the computation is done, these multipliers are
further used in computing lines 4A(Multiply) and 4.2 (Square). Thus here the advantage
of more speed is achieved, because both of the lines in the algorithm have no such data
dependencies, which can conflict with the parallelism at this point. Finally, at line 5
anyone of the two multipliers can be used again in order to remove the effect of the R~l.
The computational effort of right-to-left Montgomery modular exponentiation
algorithm can be calculated by first considering the computation effort of the modular
multiplier, which is n+1. Due to parallelism in the loop, the computational effort reduces
to (n+l)(w+2). For example, if w= 512 bits, then this algorithm will take 264710 clock
cycles at minimum to compute the final result, while the algorithm 4.3 requires 263680
more clock cycles as compare to the right-to-left algorithm, which is almost twice more
effort.
After the analysis of speed, the algorithm, right-to-left has been accepted to be
implemented for this thesis. For cryptography, speed is the major requirement, and
presently the use of 2048 bit RSA and DSA public key cryptosystems require very fast
hardware architectures. So the major concern behind the selection of the algorithms is
speed when they are implemented in hardware.
43
4.5.2.1 Hardware implementation of Right-to-Left Montgomery modular
exponentiation algorithm
Selection of a fast modular multiplier is important for the hardware
implementation of right-to-left modular exponentiation algorithm. Thus, the design
shown in figure 4.8, the multiplexer based design is the target multiplier to be used for
algorithm 4.4. Before implementing at register transfer level (RTL), the functional
verification of the algorithm was very important. For this, a C++ code was created for
verifying the algorithm correctness. Also in VHDL, designing the exponent block needed
several steps, as initially, it was implemented behaviorally and then was further refined
down to RTL. When implemented behaviorally, the design was not provided with a
clock, and so the output was based upon the execution of loop statement in the code.
When designing at RTL, clock was added along with input and output load and shift
registers, so that the data flow could be done in a pipelined fashion. Consider figure 4.9
for the input and output ports configuration, and figure 4.10 for different functional
blocks used in the Montgomerymodular exponentiation design.
P (k : 1)
E(k: 1) H
M(k: 1) H
C(k:l) ?
clock reset enable
MontgomeryModular
ExponentiationModular
-> mod done
-> next me
> output (k : 1)
Figure 4.9: Input and output ports of theMontgomerymodular exponentiation design
shown in algorithm 4.4.
44
Pe(modM)
Data Input
Control Block (FSM)
TV
Control Signals
TT
it
Register
Bank
1Z
Adders & Mux
Units
I Data
Output Register
V
Data Output -> Z = Pe (modM)
Figure 4.10: Block diagram of right-to-left Montgomerymodular exponentiation algorithm
In order to computer Z= Pe (modM), the input registers are loaded with the data
(P, E, M, C) as shown in the register bank. Once the data is loaded, the ALU performs the
operation ofmodularmultiplication and finally it produces the result Z through the output
register. The explanation of the blocks used in figure 4.10 is as follows.
A. Register Bank
Register bank is a set of five different registers. Besides these five registers, there
are some additional registers used in designing the Montgomerymodular exponentiation,
which will be explained later in the detailed data flow diagram. The register bank is
responsible for loading the input data from parallel port, updating the subsequent changes
upon the completion ofmodular multiplications as shown in the algorithm 4.4. The size
of these shift registers is the size of the design, which is n+ 1 bit. There are two main
categories of these registers, which are parallel-in-parallel-out and parallel-in-serial-out.
These registers uses different mode signals as generated by the control block and the
45
sequence of the selection of the mode signal will be explained in the section of Control
block. Each of the shift register is explained as follows,
a) Reg E
This is a parallel-in-serial-out shift register to which the exponent e of Z = P
(mod M) is loaded during the load state of the finite state machine (FSM). The output of
RegE is fed into the FSM itself for the comparison of every incoming LSB (which is
Reg_E(0)) to be '1' or not. This represents line 4.1 of the algorithm 4.4. There is a 2-bit
control signal (mode) of this register, which is written in VHDL as follows,
case mode is
when "00" => s <= (others => ' 0 ' ) ;
when "11" => s <= data_in;
when "10" => s <= s;
when others => s <= '0' & s(j-l downto 1) ;
end case;
At "00", RegE clears itself synchronously, at "11" it loads the data in parallel, at "10", it
holds the data which is to avoid any changes at the output, and at
"01" it performs right
shift operation.
b) Reg_M
This is a parallel-in-parallel-out shift register to which the modulus M ofZ = Pe
(mod M) is loaded during the load state ofFSM. The outputRegM is fed into the ALU.
It uses 2-bit control signal to switch between different modes of operation. In VHDL, the
code for the control signal is as follows,
case mode is
when "00" => s <= (others => ' 0 ' ) ;
when "11"=> s <= data_in;
when others => s <= s;
46
end case;
At "00" Reg_M clears itself synchronously, at
"11" it loads the data in parallel, at
"10"
or
"01" it holds the data to avoid any changes at its output regardless of changes to be
occurred at its inputs
c) RegAcc
This is a parallel-in-serial-out shift register, which has a synchronous preset for a
value of 1 . Also it has two ports for parallel data in, which are required for loading two
different intermediate results of themultipliers as shown in the algorithm 4.4. It uses 3-bit
control signals to switch between different modes of operations. The VHDL code for
these modes ofoperations is as follows,
case mode is
when "000" => s <= (others => ' 0 ' ) ;
when "001" => s <= (0=>'l', others=> 0 ' ) ;
when "010" => s <= data_inl;
when "011" => s <= data_in2 ;
when others => s <= '0' & s(j-l downto 1) ;
end case ;
At "000" Reg_Acc clears itself synchronously, at
"001" it is preset to the value of 1, at
"010" it loads datainl, at "011" it loads data_in2, and for all other mode selections it
performs the right shift operation for serial out
d) Reg_l
It is a parallel-in-serial-out shift register, which initially loads the base P ofZ=Pe
(mod M). It has 2-bit control signal which is used to switch between different modes of
operations. It has two ports for data in, one is used for loading P and one is used for
loading the intermediate results generated by the multipliers as shown in the algorithm
4.4. The VHDL code for these control signals as follows,
47
case mode is
when "00" => s <= (others => ' 0 ' ) ;
when "01" => s <= data_inl;
when "10" => s <= data_in2;
when others => s <= '0' & s(j-l downto 1) ;
end case;
At "00" Reg_\ clears itself synchronously, at
"01" datajnl is loaded, at "10" data_in2 is
loaded, and at "11" it performs the right shift operation for serial out data.
e) Reg_2
It is a parallel-in-parallel-out shift register, which initially loads the constant C as
the input of the right-to-left algorithm. It has a synchronous preset for the value of 1, and
two parallel data in ports. It uses 3-bit control signal to switch between different modes of
operations. This register represents the register^ in figure 4.7. The VHDL code for these
control signals as follows,
case mode is
when "000" => s <= (others => '0');
when "001" => s <= data_inl(j-2 downto 0) & '0';
when "010" => s <= data_in2(j-2 downto 0) & '0';
when "100" => s <= (1=>'1', others=> 0 ) ;
when others => s <= s; --hold
end case ;
At "000", Reg_2 clears itself synchronously. At
"001" it loads datajnl in parallel
and also it shifts to the left by 1 bit, this is because in the algorithm 4.2 register A is
multiplied by 2 which is equivalent to be shifting a register to the left by 1 bit. Similarly
at
"010" it loads datajnl while shifting it to the left by 1 bit. At "100", it presets itself to
48
the value of 1, and at all other cases it hold the data which has been loaded in order to
avoid any changes at the output regardless of changes occur at the input.
B. ADD MUX block
This block is composed of two identical multiplexers, four adders, and three
intermediate parallel-in-parallel-out shift registers.
a) Multiplexers and Adders
The configuration of a multiplexer is 4:1 with n+2 bit data I/O. It is similar to the
multiplexer used in the Montgomerymodularmultiplier in figure 4.8. The adders are also
similar adders as in figure 4.8. Two adders are connected at the input of two multiplexers
with n+2 bit data out, while the other two adders are connected at the output of both of
themultiplexers with n+3 data out.
b) Shift Registers
Among the three shift registers, two are identical named as RegS. Each RegS is
connected between the input and output of an n+3 bit adder. So this makes a loop back to
the n+3 bit adder withRegS connected in-between. RegS is required to synchronize the
data in the loop path with the data coming from the multiplexers. It has 1 bit control
signal as shown in the following VHDL code,
case mode is
when ' 0 ' => s <= (others => ' 0 ' ) ;
when others=> s <= data_in;
end case;
At '0', the Reg clears itself synchronously, and at T it does parallel shift.
The third shift register is a hold register named as Holdreg. It is only connected
at the output of one of the RegjS. This register is required for holding the data if the LSB
of PISO registerRegE is '0'. When '0', it holds the previous results ofRegS, until the
49
LSB - '1' ofRegE arrives. It represents the if statement in the algorithm of right-to-left
shift register at line 4. 1 .
A complete connection configuration of a multiplexer, n+2 adder, n+3 adder,
RegS, andHoldyeg is shown below in figure 4.11.
Holdyeg
Adder
(n+2 bit)
Reg_S
10 01
Multiplexer
00
Adder (n+3 bit)
Figure 4.1 1 : Block diagram ofone ADDJVIUX block used in right-to-leftMontgomery
modular exponentiation algorithm
Figure 4.11 represents the components used in ADDMUX block. Two such
blocks are required by the modular exponentiation design with an addition of a register
named Holdyeg. This register is only used in one of the ADDMUX block which also
generates the final output of the modular exponentiation as shown in figure 4.12.
Consider figure 4.12 for the complete diagram ofMontgomery modular exponentiation
based upon the algorithm 4.4.
50
clock -
reset _
enable_
m
>
>
>
>
Control Block (FSM)
o Reg_E
Reg_1
Reg_M
C> Reg_2
t Reg_Acc
Ol
-a
<
2U
CD
CO
I
o?
+
>- !q
<D
"O
XJ
<
0
XJ
<%yyu
I'
X
ID
CD O
CQ Q.
*
oo
+
o
a
o
<
CD
CQ
I
Output Register
^
Figure 4.12: Data flow level block diagram of right-to-leftMontgomeryModular
Exponentiation
C. Control Block (Finite StateMachine)
Consider figure 4.13 of the finite state machine used in right-to-left Montgomery
modular algorithm. It has five major states and three minor states.
51
reset =
'0'
enable =
'0'
Output: set mode for
registers to '0' (default)
IDLE
enable
='1'
LOAD Output: Load registers
squmultld
counterl < k+2
Output: hold PIPO reg,
and shift PISO regs
count ai = k+2
counterl/= k+2
SQU_MULT
if e(i) =
' 1' counterl < k+2
counter2 < k
Output: hold PIPO reg,
and shift PISO regs
counterl < k+2
Output: hold PIPO reg,
and shift PISO regs
Output: Final
ouput is generated
Figure 4.13: Finite State Machine of right-to-left MontgomeryModular Exponentiation
Algorithm
The state machine has two internal counters counterl, and counter2. counterl
keeps track of the complition of Montgomery multiplication after n+1 clock cycles.
counterl controls the execution of square and multiply phase of Montgomery
exponentiation as shown in line 4 of the algorithm 4.4. It runs for w clock cycles.
52
Before explaining each state consider the table 4.4 for the control signals of all the shift
registers in the design.
Table 4.4: Control signals for shift registers generated by finite state machine.
a) IDLE
This is the state in which FSM remains when reset = '0' and enable = '0'. When
the state machine completes its all tasks, it comes to this state. When enable goes high it
clears all the output registers and then change the state of the FSM to LOAD.
b) LOAD with red montjd
Load state is further composed of three sub-states. These sub-states are used for
loading Montgomery multipliers used at various places in the algorithm 4.4. The default
sub-state ofLOAD state is redjnontjd. In redjnontjd, the registers for P = MonPro(C,
P, M) and H = MonPro(C, 1, M) are loaded with values. The following table gives the
control signals generated in this sub-state to load the values. After generating these
control signals, the FSM changes to state REDUCEMONT.
53
Registers Data 1 Preset
RegE 11
,
Regl 01
Reg_M 11
Reg_2 001
A
Reg_Acc 001
Reghold 11 asKr
. . . ;
Table 4.5: Control signals generated by FSM for registers in design in LOAD with
redjnontjd sub-state.
c) REDUCE_MONT
In REDUCE_MONT, P = MonPro(C, P, M) and H = MonPro(C, 1, M) are
computed. This name of the state represents the operation ofMontgomery reduction ofP
to a Montgomery based reduced P, and also to generateHwhich is further used in square
and multiply stage of the algorithm. The table for the control signals generated in this
state is as follows with the active signals are highlighted in white boxes.
Registers Data 1 Hold Shift right
Serial
Reg_l 04- 11
RegJVI 44 01/10
Reg_2 AQ1 All others
RegSl
Reg_S2
1
1
Table 4.6: Control signals generated by FSM for registers in REDUCEMONT state.
In REDUCEMONT counterl is incremented on every clock cycle till it reaches
w+3 count. This completes the Montgomery multiplication operation for P and H as
54
show on lines 2 and 3 in the right-to-left algorithm. After that FSM changes the state to
LOAD with its substate squmultld.
d) LOAD with squ multld
In squ_mult_ld, the registers for square and multiply stage are loaded. The
following table presents the control signals generated for the registers of the design.
Registers Clear Data 1 Data 2
Regl \s\J VX 10
Reg_2 QCtfX AQ1xJxJ 1 010
RegAcc \J\J\7 010 011
RegSl 0 4-
Reg_S2 0 4-
Table 4.7: Control signals generated by FSM for registers in LOAD with squmultld
state.
In table 4.7 RegAcc has a condition to either set at "010" or "011". The
condition is based upon RegE serial output. It RegE(O) = '0' then RegAcc is set at
"011"
otherwise it is at "010". Both ofRegSl and Reg_S2 are cleared in this state for
the next Montgomerymultiplicaiton operation. After generating the above control signals
the FSM changes to the state of SQUJV1ULT.
e) SQUJVIULT
This state performs square and multiply portion of the algorithm 4 .i.e.
if( E(i) = 1 )then, H= Monpro(#, P, M);
P = Monpro (P,P,M); ... .(Square)
.(Multiply)
If the LSB ofRegjE is '1', then FSM executesH= Monpro(r7, P, M). For this it
loads Reg_Acc with the H and Reg_2 with P generated from the previous Montgomery
55
modular multiplications. For P = Monpro (P, P, M), the FSM loads Reg I with P and as
Reg_2 is already loaded with P as well so it remains in hold condition. The following
table shows the control signals generated by FSM for the registers used for this portion of
algorithm.
Table 4.8: Control signals generated by FSM for registers in SQU_MULT state.
In table 4.8, the values in white boxes are the modes asserted during this state.
Notice Reg_S2, which is '0' when Reg_E(0) = 'O'otherwise Reg_S2 = '1'. Also
Regjiold is "01/10" when Reg_E(0) = '0', otherwise it is "11" which loading new data.
This is required because if the previous to the present LSB ofRegE is '1', then H =
Monpro(/f, P, M) executes, and so if the present LSB is '0' then Regjiold holds the
previous data H to be used for the next Montgomery modular multiplication and also
during this time Reg_S2 is set to '0', which is to keep its output set to the value of 0. The
condition that applied to Reg_S2 avoids any unwanted results to be appearing at the input
ofRegAcc which can load a wrong value for the comingmultiplication.
The control signal forRegE is also set to "01", which is a right serial shift. This
is done, when ever a Montgomery multiplication is completed with in the range of
counter2. As mentioned earlier, counter2 runs for w times. Thus Reg_E performs one
right serial shift after every w number of clock cycles, counterl increments on every
clock cycle until it reaches w+3 then with in SQU_MULT state FSM verifies that if
56
counter2 < w or not. If it is less then w, then the FSM chages its state from SQUMULT
to LOAD with squmultld state in order to load new outputs generated from the
previous modular multiplications. On the other hand, if counter2 completes its w
iterations, the FSM changes the state to LOAD with final_out_ld state.
f) LOAD with fmal_out_ld
In this state values for the final Montgomerymultiplications are loaded. The table
4.9 shows the control signals, which are generated in this state.
Registers Clear Data 1 Data_2 Preset
Reg_2 ooo\7T7T7 OQi pip 100
RegAcc ooo Oil QQ1
RegSl 0 4-
Reg_S2 0 1
Table 4.9: Control signals for registers in LOAD with finaloutld state.
Reg_S2 is synchronously preset to the value of 1 by generating control signal of
"100". This represents H = Monpro (1, H, M) at line 5 of right-to-left Montgomery
modular exponentiation algorithm. Once the registers are loaded, the FSM does the
transition to
FINAL OUT state.
g) FINAL_OUT
This state performs H = Monpro (1, H, M) operation. The control signals
generated in this state are shown in table 4.10.
57
Registers Clear Data_l Hold Shift right
Serial
Reg_2 ooo
npo
0
oot All others ^^^^H
RegAcc
Reg S2
pip
1
All others
Reg Out 11 10/01
Table 4.10: Control signals generated by FSM for registers in FINALOUT state.
As it is seen in the table 4.9. RegSl is not displayed. It is because only one
modular multiplying unit is active now. Also for Reg_2, the hold mode is asserted,
because the previous value of H from the SQUMULT state is now used in the
multiplication. Regout is active now. Once the counterl = w+3, after that FSM sets up
the Regout to load the data from the output of the second n+3 bit adder connected in
Reg_S2 block. Once the data is loaded, then FSM sets Regout to hold mode for the data
to be maintained at output. In this state the mod done signal is also asserted to logic ' 1 '
showing that theMontgomerymodular exponentiation is completed.
4.5.2.1.1 Simulation ofRight-to-LeftMontgomery modular exponentiation
The VHDL model was simulated using Modelsim at both RTL and post synthesis
levels. For post-synthesis simulation, the sub-levels of simulation covered step by step
including post-translate simulation, and post place and route simulation. The time taken
for the simulation to be completed was (n+1) (w+2) clock cycles. The simulated data was
4 bit. The inputs shown are,
p <= (1011)2 -01)io
E <= (0111)2 --(7)io
M <= (1101)2 ~(13)u>
C <= (1100)2 ~(12)io
58
,t]inniiiM
Figure 4.14: Wave form simulations showing the data inputs ofp, e, m and c with
output generated.
Figure 4.14 displays the output obtained at 105 ns, while the design got enable at
7 ns. Thus a total of 49 clock cycles are used in computing the final output. Considering
the algorithm of right-to-left Montgomery modular exponentiation, the total of (n+1)
(w+2) clock cycles are required. This is equal to 42 clock cycles. In simulation, the output
arrived after 49 clock cycles. A difference of 7 clock cycles has been occurred which is
due to the load and shift of the registers used in the register bank, hold register and output
register. These 7 clock cycles can be explained by considering the right-to-left algorithm
which requires a total of 6 Montgomery modular multiplications when considering the
parallel operations. Regl and Reg_2 in the register bank are updated upon the
completion of every modular multiplication. Thus 6 clock cycles are consumed by the
registers in register bank. An additional 1 clock cycle is needed to generate output, which
is consumed by output register.
Figure 4.15 and 4.16 gives the condition of the internal register enable signals
with the register values updated. These waveforms do not include the I/O ports. The
signal format is displayed in unsigned format.
59
Figure 4.15: Wave form simulations showing the behavior of the registers used in the
design. This simulation is the first halfof the total simulation.
60
Figure 4.16: Wave form simulations showing the behavior of the registers used in the
design. This simulation is the second halfof the total simulation.
61
4.5.2.1.2 Synthesis ofRight-to-LeftMontgomery modular exponentiation
This design was synthesized for 4, 32 and 64 using Xilinx Vertex2p - xc2vpl00-
5ffl696 FPGA. There were 33280 slices available on this FPGA. Because of parallel
ports used, a synthesis of approximately 124+ bit sized design was not possible. Table
4.1 1 shows the results taken for various design sizes.
Size ofDesign (bits) No. of Slices Clock Period (ns)
4 141 8
32 462 8.7
64 828 10.1
Table 4.11: Results taken from the synthesis of different sizes of the design of
Montgomerymodular exponentiation algorithm.
4.5 Summary
The design ofMontgomery modular exponentiation explained in figure 4.14 has
been used in the implementation ofDigital Signature Algorithm (DSA). The design was
implemented completely scalable to be incorporated for any size of DSA. In chapter 5,
the hardware implementation of DSA has been shown along with its implementation in
software. The purpose of implementing DSA in both hardware and software is to show
the through put ofboth of the implementations.
62
Chapter 5
Hardware and Software implementation of
Digital Signature Algorithm usingMontgomeryModular methods
Digital Signature Algorithm given in chapter 2 has been implemented in software
and hardware. The implementation level in software is completely functional with no
timing and signal details. For modular arithmetic, the fast multi-precision software
algorithms are used. For hardware implementation, the slow portions ofDSA are targeted
including the modular exponentiation for modulo p. The hardware portions are then
executed in corporation with the software portions for a complete DSA operation.
5.1 Software Implementation ofDigital SignatureAlgorithm
The software implementation of DSA based upon algorithm 2.1 was generic for
various sizes ofmodulus p. The software design was targeted foxp = 1024 bits, q = 160
bits and x = 864 bits. Standard C++ does not support data types which are longer than 32
bits. Thus standard C++ did not support the implementation in software. Selection of
SystemC as multi-precision C++ language was made as a result of these limitations. In
SystemC scjbigint and scj>iguint are of arbitrary data size. It means that it supports any
size of numbers. SystemC is a C++ library which is designed to support hardware
modeling as well as verification. The level of abstraction of modeling can be from
functional algorithmic level down to concurrent hardware modeling like hardware
description languages. For software implementation, the high level abstraction feature of
SystemC was used. The units of software implementation were,
a) main.cpp
It includes the top level detail of the design. In top level detail, the classes are
instantiated and theirmethods are invoked for particular instances.
b) data.h
It is a package class which includes the constant variables and procedures to be
used in all other classes as well as in the "main.cpp". The constant variables included are
63
used to set the sizes ofp, q, x, and h (hash) variables. These variables are declared as
follows,
const int h_width = 160; SHA(m)
const int q_width = 160; q
const int p_width = 1024; p
const int k_width = 864; w
"data.h"
also has two functions to compute modular exponentiation using square and
multiply algorithm.
hash.h
This class implements SHA-1 message digest algorithm. The algorithm is given in
[8]. In addition to the implementation of algorithm, this file is also capable of reading the
text message from a message file, and then converting the message into numerical format
between the range of 0 and 25 with a = 0, and z = 25 all lower case letters with no spaces
and no special characters in between them.
c) pre_sign_block.h
This class implements some pre-computations of the signing operation of DSA.
These pre-computations include finding primesp and q while selecting an even number jc.
The operation included here from algorithm 2.1 is,
p = qx+l
The algorithm used to find prime numbers is Miller Rabin Primality Test.
d) dsablock.h
This file includes all the major computations ofDSA. The computations are,
a
^g^'" (modp),
B = aa (modp)
r =
(ak (modp)) (mod q)
64
s =
k'x
(SHA(m) + af) (mod q)
ui=
sA
SHA(m) (mod q)
U2=
s'x
r (mod q)
v = (a Ml fi "2 (modp)) (mod q)
v = r.
dsajblockh has four functions in it which are create
_sign(),verify
findjilphaQ, and ext_euc(). exteucQ function based upon extended Euclidean algorithm
is used to find the modularmultiplicative inverse of secret number k, which is required in
the algorithm for signature creation operation.
Operation:
In main.cpp, digitial signature algorithm starts with first creating the hash by
invoking the hash.h constructor and calling its function SHAl (). This function in turns
calls the conv_string(), padmsgQ and hashjunctQ functions and finally returns the
message digest. After that in main.cpp, the pre_signjblock.h is instantiated by invoking
its constructor. The function generate
_primes()
is called for this instance of
pre signj)lock.h in order to generate primes p, q and the even number x. These three
numbers are made public which are accssed directly by the main () function in main.cpp
file. Finally, dsablock.h is instantiated in main.cpp, which creates the signature, verifies
it and generates the output.
The input to this program is a message file, "msg.txf1 and the output of this
program is a data file, "datajile.txf. The "datajile.txf is created in dsablockh in two
formats which are binary and decimal.
65
5.1.1 Results and Analysis of Software Implementation
Table 5.1 shows the output in hexadecimal format forp = 1024 bits, q - 160 bits
and x = 864 bits.
Variables Output values
P 0x04000000000000000000000000000000000000cef8000000000000000000000000000
0000000000000000000000000000000000000000400000000000000000000000000000
0000000cef8000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000001
j7
0x0800000000000000000000000000000000000 19df
X 0x08000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000080000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
000000000
a 0x060000000000000000000000000000000000019df
k 0x0700000000000000000000000000000000000 19df
P 0x01f5e6add9B2975859c942f09152d96a2673a20ec46216a5148fbD7bee55001903686
2295aa8417bddfd7f254cfa87ble646e6d88b9e80dc3457431 8bc90 126f2 165b594513de4
02639477e932eflc9f5d788f9596c738545978ef37cf772aef8452f367cl9487a6513d76cf
a7e43796810cl56b810af3f20c940558ebedd27
a 0x03558df42ac52a3c7775f2000ebe4f9bbe81093a4cc3e8bl9ca38d82e77ela522f98881
adab295f75elde9e4b48f24503a81b38af99e55ed2db427c24c8f30cOb3a4640d452220d
dc066c0c2155137d38e6clb4b42d258cbe23e9279eb7350e80a209ab34de7c9779bl0d4a
9612313d675622a9d8e475dfl85el8018f8fflfa4
r 0x048e275882ab7c8a22efa4el9bc298fa4face7de8
s 0x0 195e82962bc989 1 8e8ac48aafbc98 1574704f5ae
SHA(m) 0x0c3d2elf09032547618badcfe427656dla26e04b8
Table 5.1: Values generated at the output of the software implementation of Digital
Signature Algorithm.
The following portion of the algorithm was subjected to the timing analysis for a
comparison with hardware speed performance.
Signature operation by Alice:
a=g{p'l)lq (modp), then aq= 1 (modp)
fi =
aa (modp)
r =
ak
(modp)
must be satisfied.
66
Verification operation by Bob:
v = auXfiu2(modp)
The reason for the selection of (mod p) based calculation is that, the modular
exponentiation in the algorithm 2.1 occurs for (modp), which is the slowest part in terms
of software implementation. Due of this, the (mod q) based modular multiplication
operations are separated from modular exponentiation operations of the algorithm. Then
the portions ofDSA involvingmodular exponentiation are considered for timing analysis.
The timing results are taken using C++ time library, time.h, and it is calculated for the
unit of seconds. These results are based upon the total execution time of the results of the
operations involved. The following table 5.2 gives the results while simulations were
done using Pentium 4 (2.6 GHz) processor with 512 MB RAM running Windows XP
home edition,
Type of operation Time
Signature operation 1.522 sec
Verification operation 0.531 sec
Table 5.2: Total simulation time for the software blocks of DSA signature and
verification operations. The blocks considered for timing analysis were the same as
modeled in VHDL for synthesis.
5.2 Hardware Implementation ofDigital Signature Standard
The hardware implementation of DSA is mainly composed of the modular
exponentiation blocks of the size of prime p. The following portions of DSA algorithm
are chosen for hardware implementation.
Signature operation by Alice:
a =
g^'^
' g (modp), then a9=l (modp) must be satisfied.
fi =
aa (modp)
67
r =
ak
(modp)
Verification operation by Bob:
v = aulfiu2(modp)
The hardware implementation is accordingly divided into the signing and verification
units.
5.2.1 Hardware implementation ofDSA - Signature Operation
This portion required the implementation of the following sections of the
algorithm,
a = g^-V'-i (modp), then aq=l (modp) must be satisfied.
fi = cca(modp)
r = </(modp)
For this, only two Montgomery modular exponentiation units are used. The sequence of
the execution of thesemodular exponentiation blocks is as follows,
Two sequential modular exponentiation operations:
a = g&'V'1 (modp), then aq=l (modp) must be satisfied.
It is sequential, because there is a dependency of first generating a and then verifying
aq=l (modp). As a result, this requires computational effort of 2 [(n+1) (w+2)] clock
cycles.
Two parallel modular exponentiation operations:
fi =
a"
(modp) and r = ct (modp)
The computation of fi and r is implemented in parallel. This requires two modular
exponentiation blocks to be executed in parallel with computational effort of (n+1) (w+2)
clock cycles.
The total execution ofDSA-Signature operation requires 3 [(n+1) (w+2)] units of
clock cycles including some additional clock cycles for data I/O. The I/O ports ofDSA
were designed to fit the FPGA IOBs. That's why; additional load and shift registers were
used to load the values in packets by dividing the total number ofbits inp, q, and x by the
68
required port size. i.e. for 1024 bit p, the port size of 32 bits will take 32 parallel load
operations. Thus the additional clock cycles required were 2 (w / portjsize), where w is
the number of bits inp. The factor of 2 is used for both input and output data load time
required. Figure 5.1 gives the description of how DSA signature block has been
implemented in hardware using data produced by the software block. This connection is
not the HDL connection, but the pre-calculated numbers generated by the software are
used from a text file, "data_file_binary.txt".
Software Implementation of Digital Signature Algorithm
7^
datafile
P>
4>
x,
C,
SHA(m),
a,
k
sign_data_file
P,
c,
SHA(/n),
a,
P,
r,
s
DSASignature Block Stimulus (BehavioralModel)
File read andgeneration ofdatafor hardware block.
DSA_Signature Block (RTL)
a =g*(modp),
aq
=1 (modp)
fi =
a." (modp), r =
ak
(modp)
PostDSASignature Operation (Behavioral)
r = r (modp) -> r is reduced byp
s = k 'x(SHA(m)+ar) (mod q)
Figure 5.1 : Data flow diagram in the hardware Implementation ofDSA-Signature
block using data produced by the software implemented block.
The hardware implementation for DSA-Signature is only done for those units of
DSA algorithm 2.1 where the module exponentiation of the size ofp is required. In figure
5.1, the software implementation of DSA performs both the signature creation and its
69
verification. The file which is created for the hardware block provides the required data
to complete the arithmetic operation which involves the modular exponentiation of the
size ofprimep. The hardware block involves an RTL design which along with a stimulus
and post DSA signature block is incorporated in a test bench.
The stimulus reads the data from "data_file_binary.txt" and sends it to the RTL
block. The transmission of data is done in shifting small packets, which are then loaded
into the RTL block. The RTL block then performs the arithmetic operation as mentioned
in figure 5.1, and generates, a, fi and r all in the representation of (mod p). These
numbers are received by the post DSA-Signature block, which is a behavioral block and
performs some post computations as mentioned in figure 5.1. The output from this is then
written into a text file, "signdatafile". The data in
"sign_data_file" is then further used
by the DSA verification block. The reason behind using the hardware implementation for
certain computations but not all is targeting the bottlenecks for speed in the whole design.
Therefore the arithmetic portions of DSA, where the speed is not a problem are not
implemented in hardware, and this then reduces the cost of implementation as well as the
power consumption in the chip.
In/Out
Ports
Control Block
7T
iz.
MontgomeryModular
Exponentiation
Unitl
MontgomeryModular
Exponentiation
Unit 2
5.2: Block diagram ofDSA Signature Block using two Montgomery modular exponentiation
blocks.
In figure 5.2, there are two similar Montgomery modular exponentiation units
used. There are eighteen I/O ports used. A description of each port is given according the
figure 5.3.
70
clock reset enable
i 1 I
stopdsa
ip_ld_donel
ip_ld_done2
ip_ld_done3
oplddonel
q (portsize: 1)
p (port_size: 1)
x (port_size: 1)
c (portsize: 1)
a (port_size: 1)
k (portsize: 1)
Figure 5.3: Port-level detail ofDSA signature block.
^. alpha (port_size: 1)
^. beta (portsize: 1)
>. r(port_size: 1)
dsa done
The register bank in figure 5.2 has internal registers which store the data sets
which are arrived through the input ports shown in figure 5.3. The sizes of these input
registers and the corresponding ports are not equal. This is because, for example a 1024
bit design will require a 1024 bit register to store the data sets arriving through port p.
Similarly for the same design, a 160 bit register will be required to store the data sets
arriving from port q. These registers will be explained in the next section. An FPGA has
limited number of ports, thus ports of a large design needs to be less then the size of
internal data storing registers. The variable port_size is a generic variable used for this
purpose. The internal registers storing data arriving through these ports varies in sizes, so
in order to perform a successful load operation, the internal control signals iplddonel,
ipjd_done2 and ipjd_done3> are used. For 1024 bit design the ipjd_donel monitors
1024 bit data to be stored, ipjd_done2 monitors 864 bit data to be stored and
ipjd_done3 monitors 160 bit to be stored. Similar to input ports, the output ports are also
optimized.
A. Register Bank:
The register bank in figure 5.2 is composed of four different types of shift registers.
71
a) q_reg, ksecreg, asecreg:
These shift registers are identical and store the data sets arrived through input
ports of q, k and a. Consider q_reg, which is a parallel-in-parallel-out shift register with a
control option to load small data sets in parallel, i.e. For a 160 bit size ofqjeg, the data
sets will be equal to port size. Therefore, after storing the first data set it will shift left
and then it will keep on loading and shifting new data sets until ip_ld_done3 = '0'. In
VHDL it is modeled as follows,
q_reg <= q_reg ( (q_size - port_size) -1 downto 0) & q;
For a 1024 bit DSA design qyize = 160 bits andport_size = 32 bits. Therefore, on every
clock cycle 32 bit data arriving from port q is combined on the right side with a previous
(qsize - port_size) data stored in qjreg. This forms a left shift register, which is
controlled by ipjddonel for the number of shifts, i.e. Total of 5 shifts will be required
to load 160 bit data. Figure 5.4 describes the operation of a 160 bit shift register.
128 bit data
Dn.i D2 Dl & 32 bit port
/
* "S,
mm
N
D D3 D2 Dl
160 bit data
Left shift
Figure 5.4: Shift register to load data sets in shift leftmode.
b) P_re& p_const_reg:
p reg works in a similar way as qreg works. It stores the data sets arriving from
portp. It is of the size of 1024 bits. The number of shifts required to load complete data is
controlled by ipjd_donel control signal, i.e. Total of 32 shifts will be required to
complete the load operation of 1024 bit data, p const reg is similar top reg and stores
data sets arriving through port c.
72
c) pfactreg:
pjactjreg is also of the same type as pjeg and qyeg. It stores the data sets
arriving through port x. The size of this register is 864 bits for 1024 bit design. The
number of shifts required to completely load the data is controlled by ipjd_done2
control signal, i.e. Total of 27 shifts will be required to complete the load operation of
864 bit data.
d) alphareg, betareg, rreg:
These registers store the internal results and are connected to the output ports.
These are identical and function as load-and-rotate. Load is done when a data set of
portsize from the left side of these registers is generated through the output ports alpha,
beta and r. Rotate is done when it has already sent a portjsize data set. It rotates left to
store the sent data on its right side. The following figure 5.5 explains the operation.
024 bit shift register
Left shift
Figure 5.5: Shift register to load data sets in shift leftmode.
B. Finite StateMachine:
The operation ofDSA-Signature block begins, when the input enable signal goes
high and stopdsa goes low with rising edge of clock cycle. There are four states in
which the operation is completed in order to generate final outputs in the form of alpha,
beta and r. Initial state is LOAD state, second is PRMITIVEROOT state, third is
PUBLIC KEY state, and fourth is FINALOUT state. These states are described
according to the state diagram given in figure 5.6.
73
ip_ld_do me out2 1
ip_ld_donel =
oplddonel
medonel
=T'
me done2 = 1
oplddonel =
Figure 5.6: FiniteMachine ofDSA Signature
medonel =
'0'
me done2 = '0'
a) LOAD:
In load state, the registers in the register bank are loaded with data arriving
through input ports. FSM remains in LOAD state until the ipjd_donel is '0'. Once it
goes to '1 ', the FSM makes a transition to PRIMITIVEROOT state.
b) PRIMITVE_ROOT:
In this state the operation of finding g as primitive root (mod p) is computed. For
this purpose the followingmodular exponentiation operations are performed sequentially.
First a = g^'^
' q (modp) is computed after the selection ofg. After that,
aq is computed
to be verified as aq (mod p) = 1 (mod p). This requires random number of attempts in
verifying aq=l (modp) based upon the selection ofprime numbersp and q. The output of
produced in this state is a. This operation requires 2(n+l) (w+2) number of clock cycles.
Once the output is produced, the FSM makes a transition ofPUBLICKEY state.
c) PUBLICJCEY:
In this state, beta and r are computed as, fi (modp) and r =
ak (modp). This
operation requires two Montgomery modular exponentiation blocks to be executed in
parallel. Therefore, the total time of computation is (n+l)(w+2) clock cycles. Once the
values offi and r are computed, the FSM makes a transition to FINALOUT state.
74
d) FINAL_OUT:
In this state, the values of a, fi and r are generated as output. The FSM remains in
this state until the opjd_donel is '0', and once it goes to '1', the FSM makes a transition
to load state.
5.2.1.1 Simulation results for DSA-Signature block
Three different sizes of DSA-Signature are simulated at RTL and post-synthesis
levels. The first two simulations are done for 12 and 32 bits designs and the third
simulation is done for 1024 bit design. The reason for the simulations of 12 and 32 bit
designs is due to the verification of design accuracy at the earlier design phases. A
simulation at post-synthesis level for a 32 bit design took approximately two hours on a
single AMD 64 bit computer, while a 1024 bit design took approximately seven days on
the same configuration of system. For post synthesis simulation the Modelsim simulator
on AMD 64 bit computer was not optimized to simulate the gate level model of DSA
generated by Xilinx ISE FPGA. This required the simprim.lib library to be configured for
Modelsim. simprim.lib provides the gate level components for post synthesis simulation.
Instead of 160 message digest SHA(w), a random message m of size 12 bit is used for 12
and 32 bit designs. Considering the three design sizes and their simulation results as
follows,
DSA-Signature Designs of 12 and 32 bit sizes:
Consider figure 5.1 as a reference for the flow of design. The sizes of the
variables in 12 bit design arep, c = 12 bit; q, a,k,m,x = 6 bit. Similarly the variable sizes
of the 32 bit design are p, c = 32 bit; q, a, k, m = 12 bit; x = 20 bit. The input to this
design is according to the table 5.3,
75
Data Value
12 bit Design Size in bits 32 bit Design Size in bits
P 2543 12 129023 32
q 41 6 2081 12
X 62 6 62 20
a 3 6 38 12
k 28 6 1824 12
m 13 6 3277 12
c 1126 12 12080 32
Table 5.3: Data input for 12 and 32 bit hardware block ofDSA Signature.
This design was simulated at RTL and post-synthesis level using Xilinx ISE and
Modelsim for Vertex2p - xc2vpl00-5ffl696 FPGA. The output achieved from the
hardware block is fed into the post DSA-Signature block, which further computes s =
k'
(m + af) (mod q) and creates the result text file "signdatafile". This block is not
synthesizable.
Figure 5.7 shows the 12 bit DSA-Signature design wave form for the hardware
block. This simulation is done at RTL level; therefore, a minimum of 2 ns clock cycle is
used. The selection of 12 bit design has been given in order to display and explain the
results. This wave form shows only the initial load operation. Notice the input and output
data ports, labeled as p, q, x, c, a, k, alpha, beta, and r are of the size ofpost_size,which
is 2 bit in this case. Total number of 6 clock cycles is required for 2 bit sized ports to
load all the data sets arriving at the input ports. A difference 6 clock cycles has been
shown in the wave form between the point, when dsastate = LOAD and when
iplddonel = '1'.
76
Figure 5.7: Wave form simulation for 12 bit DSA signature design showing the load operation
completed in 6 clock cvcles.
Table 5.4 gives the values of a, fi, r and s, where a, fi are part of the public key, and r and
s represents the signed message.
Data Value
12 bit Design Size in bits 32 bit Design Size in bits
a 817 12 96956 32
P 2335 12 47605 32
r 1 6 1092 12
s 24 6 676 12
Table 5.4: Data produced from the DSA Signature block for DSA verification block.
The next wave form in figure 5.8 shows the completion of DSA signature block
operation, and therefore in the last state of FSM which is FINALOUT, it generates
alpha, beta and r for thepostdsablock. Thepostdsajblock computes the operations in
order to produce the final signatures, which are ryign and sjsign shown in the wave
form, rsign and sjsign axe equal to r and s as shown in the table 5.4.
77
.10
M
001101010001 11101011800100
:oi 1on i ooooo ii di 1 1 0SoOddl ii 1 1 ooopooi 1 o
10
iE
no
!oi 01 ootiioon jm ooo! boi 101 immoonoioi ioioonpiuioo I
:ooocn oij)i odi o iooi ooi aoi ooo tiddiooiooooo !oi 0010000010 1001000001001 fi 00000100100
10
"Tor
"loo"
5TT0TI i1ft11ldi066l10il1(:i0u |
1404 ns1336 M 1400 ns
-1E H 1405 ns
Figure 5.8: Output generated at 1405 ns for 12 bit DSA signature design using 2 ns clock.
Wave form shows the final output operation.
DSA-Signature Design of 1024 bit size:
The simulations of 1024 bit design include 160 bit SHA(ra) message digest. The
data input for the hardware block is given in table 5.5.
Data Size in
Bits
Value (Hexadecimal)
P 1024
0x04000000000000000000000000000000000000cef8000000000000000000000
0000000000000000000000000000000000000000000000400000000000000000
0000000000000000000cef8000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000001
-J
160 0x080000000000000000000000000000000000019df
X 864 0x0800000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000800000000000000000000
000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000
a 160 0x0600000000000000000000000000000000000 19df
k 160 0x070000000000000000000000000000000000019df
SHA(#w) 160 0x0c3d2elf09032547618badcfe427656dla26e04b8
c 1024 0x0208a789a4585906824f7080000000000a8d29256ffffffffffffffffffffffffffeffffd62af
bf000000000000000000000000000080a7d2371d8385906824f7080000000000a8b8
97ff4dcc753dfc0000000000000000000002a2d7049686e5a2caf7ffffffffffffffffbf779a
4e809b682222bdOf00000000000000d0c 12
Table 5.5: Data input produced by software for 1024 bit hardware block of DSA
Signature.
78
Table 5.6 shows the output produced by DSA-Signature hardware block. These
results are then stored in "sign_data_file" for DSA verification operation.
Data Size in
bits
Value (Hexadecimal)
a 1024 0x03558df42ac52a3c7775f2000ebe4f9bbe81093a4cc3e8bl9ca38d82e77ela522f9888
Iadab295f75elde9e4b48f24503a81b38af99e55ed2db427c24c8f30cf3b3a4640d45222
0ddc066c0c2 155137d38e6c Ib4b42d258cbe23e9279eb7350e80a209ab34de7c9779b 10
d4a9612313d675622a9d8e475dfl85el8018f8fflfa4
P 1024 0x01f5e6add9O2975859c942f09152d96a2673a20ec46216a5148fbf37bee5500190368
62295aa8417bddfd7f254cfa87b Ie646e6d88b9e80dc345743 1 8bc90126f2 165b5945 13d
e402639477e932eflc9f5d788f9596c738545978eO7cf772aef8452O67cl9487a6513d7
6cfa7e43796810cl56b810aGf20c940558ebedd27
r 160 0x048e275882ab7c8a22efa4el9bc298fa4face7de8
s 160 0x0195e82962bc98918e8ac48aafbc981574704f5ae
Table 5.6: Data produced from the DSA Signature block for DSA verification block.
Total number of 3164269 clock cycles is required to generate the output given in
table 5.6. The number of clock cycles in 12, 32, and 1024 bit designs depends upon
various factors. The first thing to consider here is finding g as primitive root, when the
condition of
aq = 1 (mod p) is true. The number of clock cycles used to satisfy this
condition varies for various selections of prime numbers p, q and even number x.
Consider the best case, when for g = 2, a =
gx (mod p) satisfies the condition of
aq
=1
(mod p), then this will require execution of two sequential modular exponentiation. For
the best case, it will require 2[(n+l)(w+2)] clock cycles. Computation offi and r for (mod
p) requires two parallel modular exponentiations at the
computational effort of
(n+l)(w+2) clock cycles. As a result, total of 3[(n+l)(w+2)] clock cycles will be required
for a best case implementation ofDSA-Signature hardware block.
79
5.2.1.2 Synthesis results for DSA Signature block
The design was synthesized for 12, 32 and 1024 bit sizes. Table 5.7 gives the
synthesis using Vertex2p family xc2vpl00 - 5ffl696 FPGA. Total number of slices
available on Vertex2p FPGA was 44096 and total number of IOBs was 1 164.
Size ofDesign
(bits)
Port Size No. of Slices Clock Period
(ns)
No. of IOBs
12 2 621 8.2 26
32 4 1274 8.7 26
1024 32 38675 52.3 296
Table 5.7: Synthesis results taken for 12, 32 and 1024 bit DSA Signature designs.
5.2.1.3 Comparison between 1024 bit hardware and its equivalent software design
The execution of software design required 0.14 seconds to perform the DSA
Signature operation for equivalent arithmetic portions. Consider table 5.8 for the
comparison between hardware and software implementations ofDSA-Signature block.
Simulation, HW & SW
Hardware, using 53 ns clock cycle
Software for signature equivalent block
Difference
Time
3164269 * 53 = 167706257ns
= 0.167 seconds
1 .522 seconds
1.355 seconds
Table 5.8: Comparison between hardware and software implementation in terms of
speed.
As seen from table 5.8, the software block required 1.355 seconds more time to
perform the operation. This comparison is based upon:
80
Sneed T Time taken to complete operation in Software
Time taken to complete operation in Hardware
1 522
Speed Improvement = = 9.1 1 times
0.167
Therefore, the hardware implementation of DSA signature block is 9.11 times
faster then software. The speedup attained in hardware is caused by some factors given
below in terms of the arithmetic operations performed in both hardware and software.
fi = aa(modp) and r = a* (modp)
The execution ofP and r is sequential in software which requires 2[(n+l)(w+2)] clock
cycles. On other hand, parallel execution in hardware requires (n+l)(w+2)] which
provides significant speedup when compared to software. The other factors which cause
the slow speed in software are execution of 1024 bit data on 32 bit processor and longer
interconnects. The next section shows the implementation, execution and synthesis
results ofDSA-Verification block.
5.2.2 Hardware implementation ofDSA-Verification Operation
DSA verification block involves the following set ofoperations;
1 . u\=
s'x
SHA(m) (mod q) and u2 =
s~x
r (mod q).
2. vl=a
ul
(modp)
3. v2=/T2(modp)
4. v = vl v2 (modp) (mod q)
3. Finally, comparing, if v = r
i.e. if this is true, the signature is valid, otherwise not.
Among these set of arithmetic operations, vl
= a
ul
(modp) and v2 = p
"2
(modp)
are implemented in hardware.
81
Test Bench
Signature File
(SignedMessage,
Public Key)
P.
q,
c,
SHA(m),
a,
fi,
r,
s
Pre-Computation & File read
u\ =
sA
m (mod q)
u2 =
s'1
r (mod q)
^r^
RTL Block
v =
auXpul(modp)
^y
Post-Computation & Result
v = v (mod q)
v~r
1
Figure 5.9: Data flow ofhardware implementation ofDSA verification block.
Figure 5.9 shows the implementation in terms ofbehavioral and RTL partitioning
of the DSA verification block. Data inputs are read from the text file "sign_data_file",
which is produced by DSA-Signature block.
Initially, ul =
s'x
SHA(m) (mod q) and u2 = s'lr (mod q) are computed
behaviorally outside of the RTL block. This requires use of Extended Euclidean
Algorithm for finding a modular multiplicative inverse of s'x. Once ul and u2 are
computed, they are sent to the RTL block using parallel data shift operation as mentioned
in DSA-Signature block. In the RTL block,
vi (modp)
and,
v2=/r2(modp)
are computed. Then, vl and v2 are sent to the post-compute block of
DSA-Verification.
In the post-compute block, v = (vlv2 (mod p)) (mod q) is computed. Finally, the v is
compared with r. r is generated by DSA-Signature block If v = r, then the signature is
valid, otherwise it is not. In this case the RTL block
requires two Montgomery
exponentiation units to be instantiated, and executed in parallel. The computational effort
82
is (n+l)(w+2) clock cycle. The block diagram of the RTL module is shown in figure
5.10.
Control Block
77
C>
In/Out
Ports
<C
I
7>
<C:
Register
Bank
c>
c>
<c=
MontgomeryModular
Exponentiation
Unitl
MontgomeryModular
Exponentiation
Unit 2
Figure 5.10: Block diagram of DSA Verification Block using two Montgomery modular
exponentiation blocks.
In figure 5.10, two Montgomery modular exponentiation blocks are used in
addition to internal registers and a control block. Figure 5.11 shows the port level detail
ofDSA-Verification block.
stop_dsa_ver
iplddonel
ip_ld_done2
oplddone
P (portsize: 1)
c (portsize: 1)
ul(port_size: 1)
u2(port_size: 1)
clock reset endsaver
>. vl(port_size: 1)
> v2(port_size: 1)
^. dsaverdone
Figure 5.1 1 : Port-level detail ofDSA verification block.
The explanation of the register bank and the control block in figure 5.10 is given below.
83
A. Register Bank:
The register bank in figure 5.10 is consisted of internal registers. The structure of loading
and unloading the data is similar to the registers used in the register bank of DSA
signature block.
a) P_reg p_const_reg, alphareg, betareg:
These registers are identical and their size is equal to the prime number p. Data
sets stored are p, c, a and p. The input to these registers is stored by shifting small data
sets to the left. Consider figure 5.4 for the operation.
b) ulreg, u2_reg:
These registers are identical and their size is equal to the prime number q. They
store data arriving through ports ul and u2 respectively. The structure of these registers is
same as shown in figure 5.4.
c) vlreg, v2_reg:
These registers are identical and their size is equal to the prime number p. The
output is first stored in these registers and then generated through the output ports vl and
v2. The structure of these registers is same as shown in figure 5.5.
B. Finite StateMachine:
The state machine is composed of four states. There are two load input ports and
output ports states, where data transactions are done. One state is used to compute the
modular exponentiation operations, and one state is used to clear all the registers. The
design is activated on low reset. Each state is explained below:
84
en dsa ver = '0'
va I
op ld done =
' 1 ' *-" ^^ op_ld_done
=
'0'
f Y CLEAR
7." "
C FINAL ^f"^J 7 OUT XJ"~~
,z
en_dsa_ver =
' 1 '
7 .7
me_donel = T
me_done2 =
' 1'
(~[ LOAD CALC_V N1*^
ipjddone = '0'
7 7
iplddone = ' 1 ' me_donel =
'0'
me done2 =
'0'
Figure 5.12: Finite state machine for DSA verification block.
a) CLEAR:
FSM remains in clear state as long as enjtsajyer =
'0'
and stopjisayer = '1'.
Initially, stop dsajyer = '0'. This condition is also required at the end of the design when
stop_dsajyer is asserted to logic T to stop the DSA block and then the FSM enters the
CLEAR state. From CLEAR state, the FSM makes transition to LOAD state, when
endsayer =
' 1'. During this transition, FSM clears all the registers.
b) LOAD:
In LOAD state, the FSM loads the input registers until the ipjddonel
= '0'.
ipjddonel is controlled by the pre-computation block, which also acts as generator.
The input registers are loaded with values and then FSM changes the state to CALC_V
when ipjddonel =
' 1 '.
c) CALC_V:
In this state, the FSM sets the values for the two Montgomery exponentiation
blocks, and remains in CALCV state until me_donel and me_done2 are low. FSM
makes transition to FINALOUT state, when both me_donel and me_done2 are asserted
to logic '1'. During this transition, the outputs of the Montgomery modular
exponentiation are generated as vl and v2.
85
d) FINAL OUT:
In this state, the data is generated out using the output shift registers. The FSM
remains in this state until oplddone = '0'. Once opjd_done = '1', the FSM makes
transition to CLEAR state. At this time, the signal stop_dsajyer is asserted to '1', and as
a result the FSM remains in CLEAR state without making any transition to the LOAD
state.
5.2.2.1 Simulation results for DSA verification block
The DSA-Verification block is executed for 12, 32 and 1024 bits of design sizes.
The 160 message digest SHA(m) is only used for 1024 bit design while a random
message m is used for 12 and 32 bit designs.
DSA-Verification Designs of 12 and 32 bit sizes:
Consider 5.9 as the inputs to these designs. These inputs are same as produced by
the DSA signature block.
Data Value
Size in bits 12 bit Design Size in bits 32 bit Design
Q 6 41 12 2081
m 6 13 12 3277
P 12 2543 32 129023
c 12 1126 32 12080
a 12 817 32 96956
P 12 2335 32 47605
r 6 1 12 1092
s 6 24 12 676
Table 5.9: Inputs to DSA verification block.
Figure 5.13 shows the waveform simulations for the 12 bit DSA verification
design. It displays the initial load operation of the internal registers. The data inputs are
loaded in total number of 6 clock cycles when iplddone = '1'. The input ports shown
86
arep, c, alpha, beta, ul and u2. ul and 2 are computed in the behavioral VHDL model
prior to loading data to the RTL block.
en dm v
ip_ld_done1 1
jpJd_dore2
11'
B-
B-
S~
Eh
B-
3-
ffl-
B-
jfc~
B-
3-
alph.-,
beta .
u2
p_reg
e_reg
betajeg
ii1_re!
I ts
I
01
01
i
0
01
0
mi 000011 101
0QQ1 11000101
000100110101
000001101110
001101
;ar Joad
HE
IE
JSL
JK
00 '01 HE
OOOQOOPOOOOOl
oooQogoooooo
JOOOPOOOOO
oooooopooooo
oooooo
oooooq 1000001
jse
JK
JK.
JK
looooTT
IE
IE
m IE
IE
jar
HE m HE
Koooooaoooocn noooooqoooi 1 TKoooooooi 1 j ooloooooi i j QOOllOQ
loooooqoooooTKoooooopooi ooHoooooooi 00TTJ000001 dpi iW|PiL_
lOOOOOfoOOOOlOOOOOOOOOl 10)00000001 1OJlffiL,
1001101;
tooonq 1011010I
Tfooooooboooi OKOOOOOOPOI 00010000001 00001 100001 ooooi 1 1 po
21 000 ps
Cursor 2 | 9000 p?
1 0 ns 15 ns 20 ns
-12000ps- 121 ns
Figure 5.13: Wave form simulations for the load operation ofDSA verification block.
Consider table 5.10 for the outputs generated from pre-computation, RTL and the
post computation blocks.
Block Data
bits 12 bit Design bits 32 bit Design
Pre-computation ul 6 33 12 482
Pre-computation u2 6 12 12 802
RTL vl 12 1600 32 111437
RTL v2 12 333 32 55018
Post-computation V 6 1 12 1092
DSA Signature r 6 1 12 1092
Table 5.10: Outputs from DSA verification block, r is added to this table to give
comparison with v.
Figure 5.14 displays the results of the 12 bit design obtained in table 5.10 .
87
Q111C010QOO
001 10011 COO
01
01
1
ooo! oonoon kqi ooi 1 hoi 1 oo looi 1 jj| i oooi m oon Q0Q1 oo Kooi 1 nopi ooi i fnoooi 001 1 oo ffiSiiginpMlQl^
0001 Oil 1 001 Qtoi 01 1 1 ggj OQO j:qi 1 1 pp1! doom ][j i QQ1 QbOQl Ql IfQOI Q0001 01 1 1 ffOOOOl 01 1 1 00HOOfflll
00 UT
000000
ooooo
JH
nr
DEI
DE HE
1100101010111
nc
de: HT
DOTOOQl
480m 485 ns 490 r
-16000
ps-
43.
Sm
Figure 5.14: Wave form simulations for the final out put of 12 bit DSA verification block.
DSA-Verification Design of 1024 bit size:
Consider table 5.11 for the inputs given to the DSA verification block.
Data Value (Hexadecimal)
q 0x080000000000000000000000000000000000019df
m 0x0c3d2e1 f090325476 18badcfe427656dla26e04b8
P
0x04000000000000000000000000000000000000cef8000000000000000000000000000000000000000
00000000000000000000000000004000000000000000000000000000000000000cef800000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000001
c
Ox0208a789a4585906824f7080000000000a8d29256flFffffffffffffffffffffffffeffffd62afbfOOOOOOOOOOOOO
00000000000000080a7d2371d8385906824f7080000000000a8b897ff4dcc753dfc00000000000000000
00002a2d7049686e5a2caf7ffffffffffffffffbf779a4e809b682222bdOf00000000000000d0cl2
a
0x03558df42ac52a3c7775f2000ebe4f9bbe81093a4cc3e8bl9ca38d82e77ela522f98881adab295f75elde
9e4b48f24503a81b38af99e55ed2db427c24c8f30cf3b3a4640d452220ddc066c0c2155137d38e6clb4b42
d258cbe23e9279eb7350e80a209ab34de7c9779bl0d4a9612313d675622a9d8e475dfl85el8018f8fflfa4
P
0x01f5e6add9f32975859c942f09152d96a2673a20ec46216a5148fbO7bee550019036862295aa8417bddf
d7f254cfa87ble646e6d88b9e80dc34574318bc90126f2165b594513de402639477e932eflc9f5d788f9596
c738545978eB7cf772aef8452f367cl9487a6513d76cfa7e43796810cl56b810af3f20c940558ebedd27
r
0x048e275882ab7c8a22efa4el9bc298fa4face7de8
s 0x0 195e82962bc989 1
8e8ac48aafbc98 1574704f5ae
Table 5.11. Input values for DSA verification block.
88
Consider table 5.12 for the outputs generated from pre-computation block, RTL
and the post computation block.
Block Data Values
Pre- ul
computation
0x0505e9985bcfc4c8c420d4bbda5803c02839a695c
Pre- u2
computation
0x076f9691523cb6fl4e76a53ad52807ba767c82f7f
RTL vl 0x01346abe0d317fla8414b9788ad547544c0356f7ead29fbaa7b981a2a8a4b628cl21020
693d9b5b56231el229b0afbd841calf86f446ddcd099187cf708e8488d7b030bfc396e4a5
8ac6a9a678b09a607faa869c98542ac36efla6b81b40c7eb88e02cb51c5e8c3ad744271d63
3889c6b4dd4dbfcddl2c29d7dbaf9095ae859c7
RTL v2 0x01ca27e9e400aafa23a537891a431462ab9f4b85a680c5dd3cc719596662007ab97f80el
5686cbel089faea0bf61857ec096791d20134d23dfe6651b91319aa3c4072a2584f520a366
dad2e5346474bl4b27e30530068blbe0be027c21c4bb75311491949662087d21916d4987
795685dd65d9f9dccl760064abd0800511ee445
Post- V
computation
0x048e275882ab7c8a22efa4el9bc298fa4face7de8
DSA r
Signature
0x048e275882ab7c8a22efa4el9bc298fa4face7de8
_
Table 5.12. Output values from DSA verification block except r, which is given here for
comparison.
Total number of 1054804 clock cycles is required to complete the 1024 bit DSA-
Verifrcation operation in hardware. For a 1024 bit Montgomery modular exponentiation,
the time period required is (n+l)(w+2), where n = w+2 and w = 1024. So 1053702 clock
cycles are consumed by the modular exponentiation block and 1102 clock cycles are
required for the loading the data at input and output ports plus the internal state
transitions.
89
5.2.2.2 Synthesis results for DSA verification block
The RTL block of DSA-Verification is synthesized for 12, 32, and 1024 bits.
Table 5.13 shows the synthesis results using Vertex2p family xc2vpl00 - 5ffl696 FPGA
Size ofDesign
(bits)
Port Size No. of Slices Clock Freq (ns) No. of IOBs
12 2 555 8.2 23
32 4 1160 8.7 23
1024 32 34174 52.3 263
Table 5.13: Synthesis results taken for 12, 32 and 1024 bit DSA verification blocks.
5.2.2.3 Comparison between 1024 bit hardware and its equivalent software design
The software execution of DSA-Verification took 0.531 to complete the
operation. Consider table 5.14 for the comparison of time taken to complete the DSA
verification block implemented in hardware as well as in software.
Simulation, HW & SW
Hardware, using 53 ns clock cycle
Software for signature equivalent block
Difference
Time
1054804 x 53 = 55904612 ns
= 0.055 seconds
0.531 seconds
0.476 seconds
Table 5.14. Comparison between hardware and software implementation in terms of
speed.
As seen from table 5.8, the software block required 0.476 seconds more time to
perform the operation. This comparison is based upon:
90
c , T Time taken to complete operation in SoftwareSpeed Improvement =
Time taken to complete operation in Hardware
c a t 0.531 n .Speed Improvement = = 9.65 trmes
0.055
Therefore, the hardware implementation of DSA signature block is 9.65 times
faster then software. The speedup attained in hardware is caused by some factors given
below in terms of the arithmetic operations performed in both hardware and software.
vi = a
ul (modp) and,
v2=fi"2(modp)
The execution of vi and v2 is sequential in software and requires 2[(n+l)(w+2)]
clock cycles. On other hand, parallel execution in hardware requires (n+l)(w+2)] which
provides speedup when compared to software. The other factors which cause the slow
speed in software are execution of 1024 bit data on 32 bit processor and longer
interconnects.
5.3 Summary
The implementation ofDigital Signature Algorithm in hardware and software has
been given for speed comparison. The execution of the multi-precisionDSA for 1024 bits
is very slow in software as compared to hardware. This is due to two major reasons; One
is the general purpose processor process 32 bit data for 1024 bit DSA operation which
requires serial arithmetic logic unit operation. Second reason is the sequential execution
ofparallel portions of algorithms. On the other hand, in hardware, these two problems are
resolved by having 1024 bit parallel data operation as well as parallelism for non data
dependent blocks.
91
Chapter 6
Conclusions and futurework
In this thesis, the implementation ofmulti-precision modular arithmetic was done
because of its importance for public key cryptosystems. An algorithm of Montgomery
modular multiplication was implemented in two different hardware designs and upon
comparison of the architectures and performance, the faster design was selected for
furtherwork. This faster designwas then compared with 1024 bit software based modular
multiplication which resulted in low performance caused by the hardware block. An
analysis of left-to-right and right-to-left Montgomery modular exponentiation was given
and to achieve more speed in hardware, right-to-left Montgomery exponentiation
algorithm was chosen for the implementation in hardware. It was implemented using the
fast architecture ofMontgomery modular multiplication. Modular exponentiation block
was designed as an internal part of any public key cryptosystem and thus Digital
Signature Algorithm was selected to implement this block. Current standard of 1024 bit
DSA design was implemented. It contained two design units, DSA-Signature unit and
DSA-Verification unit. These designs were targeted only for those portions of DSA
where the bottlenecks for the speed were present. Thus a combination of RTL and
behavioral designs were implemented. A software version ofDSA was also implemented
using SystemC and this implementation was at algorithmic level. The CPU time for the
software implemented DSA was noted and compared with the simulation time of
hardware DSA. Only those portions of software implementation were exposed to timing
analysis which were also implemented in hardware. It was shown that DSA signature unit
was 9.11 times faster and DSA verification unit was 9.65 times faster then the software
units.
The objective of this thesis research was to recognize the speed bottlenecks in the
software implementation of public key cryptosystems and removing them by hardware
implementation. The speedup obtained in hardware was approximately 10 times faster
then software. This speedup can be further improved by considering some of the
following factors.
92
a) Selection of faster technology libraries to implement Montgomery modular
exponentiation, i.e. Standard cell ASIC.
b) Introducing further partitioning and pipelining to increase the throughput.
c) A modification in the right-to-left Montgomery modular exponentiation block can
save some clock cycles. Consider the following lines of pseudo code of algorithm for
Right-toleft Montgomerymodular exponentiation.
4. for i = 0 to w-1 loop
4.1 if(E (0=1) then
4.1.1 H=Monpxo(H,P,M) Multiply
4.2 P = Monpro (P, P,M) Square
5. End for.
6. H-=Monpro (l,H,M)
This loop runs for w times, which is the number of bits in M. The purpose of loop is to
perform multiply and square operations. This block will remain affective until the last E
(i) = ' 1 ' is available. After that, it will not be affective for all the bits in E until the MSB
arrives, which is also '0'. Consider a small example,
P=1001
M=1011
E = 001 1 = {E (3), E (2), E (1), E (0)}.
Then in this case, the block will update the line 4.1.1 until E (1), E (0) are arrived. After
that for E (3) and E (2), this block will not do any useful work, because at line 6 the data
needed from block is H. This will waste extra two clock cycles. A suitable modification
of this algorithm may remove this problem.
d) Besides DSA, Montgomery modular exponentiation block can also be
implemented for RSA and Diffie-Hellman key exchange scheme. This will provide
further speed comparison with software based implementations. A modified version of
Montgomery modular exponentiation algorithm can be used for Elliptic Curve
cryptosystems.
93
References
1 . Cryptography - [Online]
Available: http://www.oft.state.nv.us/esra/Guidelines files/ESRAGuidelines5.htm
2. Stream ciphers - [Online]
Available: http://vvvvw.ssh.com/suppoiiycrvptoCTaphv/algorithms/svmmetric.html
3. Wade Trappe, and Lawrence C. Washington, "Introduction to Cryptography with
coding theory", PrenticeHall, 2002.
4. Manindra Agarwal, Nitin Saxena and Neeraj Kayal. "Primes is in P", Preprint, Aug. 6,
2002.
Available: http://www.cse.iitk.ac.in/primality.pdf
5. Peter L. Montgomery, "Modular Multiplication without Trial Division", published in
Mathematics ofComputation, Volume 44. Number 170, April 1985. Pages 519-521.
6. Alan Daly, Willian Marnane, "Efficient Architectures for implementing Montgomery
Modular Multiplication and RSA Modular Exponentiation on Reconfigurable Logic",
International Symposium on Field Programmable Gate Arrays, February 24-26, 2002,
Monterey, California, USA.
7. Secure Hash Standard, SHA-1, http://www.itl.nist.gov/fipspubs/fipl 80-1 .htm
8. Thomas Blum and Christof Paar, "Montgomery modular exponentiation on
reconfigurable hardware", Nth IEEE Symposium on Computer Arithmetic (ARITH-14)
April 14-16, Adelaide, Australia.
9. Nadia Nedjah and Luiza de Macedo Mourelle, "Reconfigurable Hardware
Implementation of Montgomery Modular Multiplication and Parallel Binary
94
Exponentiation", in proceedings of Euromicro Symposium on Digital System Design
(DSD'02) 0-7695-1790-0/02, 2002 IEEE.
10. Daniel M. Gordon, "A survey of fast exponentiation methods", in Journal of
Algorithms 27, 129-146(1998), Article No. AL970913.
1 1 . Douglas Stinson , "Cryptography, theory and practice", Chapman & Hall/CRC; 2nd
edition, 2002.
12. Donald Ervin Knuth, "The Art of Computer Programming", Volume 2: Semi-
numerical Algorithms. Reading, Massachusetts: Addison-Wesley, 2nd edition, 1981.
13. S. B. Ors, L. Batina, B. Preneel, J. Vandewalle, "Hardware Implementation of a
Montgomery Modular Multiplier in a Systolic Array", Parallel and Distributed
Processing Symposium, 2003. Proceedings International, 22-26 April 2003 Page(s):8 pp.
14. Taher Elgamal, "A public key cryptosystem and a signature scheme based on discrete
logarithms", Information Theory, IEEE Transactions on Volume 31, Issue 4, July 1985
Page(s):469 - 472.
15. Eun-Jun Yoon, Eun-Kyung Ryu, and Kee-Young Yoo, "Efficient Remote User
Authentication Scheme based Generalized ElGamal Signature Scheme", IEEE
Transactions on Consumer Electronics, Vol. 50, No. 2, MAY 2004.
16. Harn, L.; Xu, Y. "Design of generalised ElGamal type digital signature schemes
based on discrete logarithm", Electronics Letters Volume 30, Issue 24, 24 Nov. 1994
Page(s):2025 - 2026
17. Digital Signature Standard - [Online]
Available: http://www.itl.nist.gov/fipspubs/fipl86.htm.
95
18. P. Kitsos, N. Sklavos and O. Koufopavlou, "An efficient implementation of the
digital Signature algorithm", 9th International Conference on Electronics, Circuits and
Systems, 2002. Volume 3, 15-18 Sept. 2002 Page(s):l 151 - 1 154 vol.3.
19. G. Joseph.; W.T. Penzhorn, , "High-speed algorithms for public-key cryptosystems",
AFRICON, 2004. 7th AFRICON Conference in Africa Volume 2, 15-17 Sept. 2004
Page(s):945 - 951 Vol.2
