Private and Public-Key Side-Channel Threats Against Hardware Accelerated Cryptosystems by Lalonde, Dylan Roderick
University of Windsor 
Scholarship at UWindsor 
Electronic Theses and Dissertations Theses, Dissertations, and Major Papers 
2017 
Private and Public-Key Side-Channel Threats Against Hardware 
Accelerated Cryptosystems 
Dylan Roderick Lalonde 
University of Windsor 
Follow this and additional works at: https://scholar.uwindsor.ca/etd 
Recommended Citation 
Lalonde, Dylan Roderick, "Private and Public-Key Side-Channel Threats Against Hardware Accelerated 
Cryptosystems" (2017). Electronic Theses and Dissertations. 5995. 
https://scholar.uwindsor.ca/etd/5995 
This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor 
students from 1954 forward. These documents are made available for personal study and research purposes only, 
in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution, 
Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder 
(original author), cannot be used for any commercial purposes, and may not be altered. Any other use would 
require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or 
thesis from this database. For additional inquiries, please contact the repository administrator via email 
(scholarship@uwindsor.ca) or by telephone at 519-253-3000ext. 3208. 
Private and Public-Key Side-Channel
Threats Against Hardware Accelerated
Cryptosystems
by
Dylan Roderick Lalonde
A Thesis
Submitted to the Faculty of Graduate Studies through the
Department of Electrical and Computer Engineering in Partial
Fulfillment of the Requirements for the Degree of Master
of Applied Science at the University of Windsor
Windsor, Ontario, Canada
2017
c©2017 D.R.Lalonde
All Rights Reserved. No Part of this document may be reproduced, stored
or otherwise retained in a retrieval system or transmitted in any form, on any
medium by any means without prior written permission of the author.
Private and Public-Key Side-Channel Threats Against Hardware Accelerated
Cryptosystems
by
Dylan R. Lalonde
APPROVED BY:
A. Jaekel, External Reader
School of Computer Science
K. Tepe, Departmental Reader
Electrical and Computer Engineering
H. Wu, Co-Advisor
Electrical and Computer Engineering
M. Mirhassani, Advisor
Electrical and Computer Engineering
April 19, 2017
Declaration of Originality
I hereby certify that I am sole author of this thesis and that no part of this thesis
has been published or submitted for publication.
I declare that, to the best of my knowledge, my thesis does not infringe upon
anyone’s copyright nor violate any propriety rights and that any ideas, techniques,
quotations, or any other material from the work of other people included in my
thesis, published or otherwise, are fully acknowledged in accordance with the stan-
dard referencing practices. Furthermore, to the extent that I have included copy-
righted material that surpasses the bounds of fair dealing within the meaning of
the Canada Copyright Act, I certify that I have obtained a written permission
from the copyright owner(s) to include such material(s) in my thesis.
I declare that this is a true copy of my thesis, including any final revisions as
approved by my thesis committee and the Graduate Studies office, and that this
thesis has not been submitted for a higher degree to any other institution.
iv
Abstract
Modern side-channel attacks (SCA) have the ability to reveal sensitive data from
non-protected hardware implementations of cryptographic accelerators whether
they be private or public-key systems. These protocols include but are not limited
to symmetric, private-key encryption using AES-128, 192, 256, or public-key cryp-
tosystems using elliptic curve cryptography (ECC). Traditionally, scalar point (SP)
operations are compelled to be high-speed at any cost to reduce point multipli-
cation latency. The majority of high-speed architectures of contemporary elliptic
curve protocols rely on non-secure SP algorithms.
This thesis delivers a novel design, analysis, and successful results from a cus-
tom differential power analysis attack on AES-128. The resulting SCA can break
any 16-byte master key the sophisticated cipher uses and it’s direct applications
towards public-key cryptosystems will become clear. Further, the architecture of
a SCA resistant scalar point algorithm accompanied by an implementation of an
optimized serial multiplier will be constructed.
The optimized hardware design of the multiplier is highly modular and can use
either NIST approved 233 & 283-bit Kobliz curves utilizing a polynomial basis.
The proposed architecture will be implemented on Kintex-7 FPGA to later be inte-
grated with the ARM Cortex-A9 processor on the Zynq-7000 AP SoC (XC7Z045)
for seamless data transfer and analysis of the vulnerabilities SCAs can exploit.
v
In loving memory of Brianne & Dad
vi
Acknowledgments
I wish to express my most sincere gratitude my advisor Dr. Mitra Mirhassani for
her compassion and motivational spirits that truly inspired me to complete my
thesis. Her knowledge, time, and support made this work a great success. Mitra’s
work ethics and dedication in our joint research over the last 2 years shaped the
engineer I am today.
A genuine thank you goes to my co-advisor Dr. Huepang Wu for his great
knowledge and critical feedback for the mathematics of this project. I would also
like to give a special thanks to Dr. Roberto Muscedere for his judgement and
insight on the hardware design within the project.
I’d like to also thank my committee members Dr. Arunita Jaekel and Dr.
Kemel Tepe for their feedback and advice on my work.
In addition to my loving family, I would also like to thank my good friends
Philip Korta and George Kyrtsakas along with my colleagues in the ECE depart-
ment at the University of Windsor for their continual support, advice, and great
times during my entire post-secondary education.
Finally, to my mother, I am perpetually grateful. With your unconditional love
and everlasting life lessons, you taught me that knowledge is surely a beautiful
thing.
vii
Table of Contents
Declaration of Originality iv
Abstract v
Dedication vi
Acknowledgments vii
List of Tables xii
List of Figures xiii
Nomenclature xiv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Transition of Software to Hardware Cryptography . . . . . . 2
1.1.2 Side-Channel Attacks . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Mathematical Preliminaries 6
2.1 Number Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Group Law . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Levels of Security Within Public & Private Key Systems . . . . . . 9
viii
TABLE OF CONTENTS
2.3.1 Binary Field . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Polynomial Basis . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Provable Security . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Elliptic Curves over GF (2m) . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Marginal Note on EC Discrete Logarithm Problem . . . . . 13
2.4.2 Curves and EC Group Theory . . . . . . . . . . . . . . . . . 13
2.4.3 Point Operations . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.4 ECC Overview and Vulnerability Insight . . . . . . . . . . . 16
2.5 Statistical Analysis for Side-Channel Attacks . . . . . . . . . . . . . 16
3 Side-Channel Attacks Against Hardware 18
3.1 Malicious Actions Against Hardware . . . . . . . . . . . . . . . . . 19
3.1.1 Timing & Safe-Error Attacks . . . . . . . . . . . . . . . . . 19
3.1.2 Zero Point Attacks . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.3 Differential Power Analysis . . . . . . . . . . . . . . . . . . . 20
3.2 Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Executing DPA on AES-128 . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Framework Differential Power Analysis . . . . . . . . . . . . 23
3.3.3 Generating Hypothetical Keys . . . . . . . . . . . . . . . . . 26
3.3.4 Inverse Add-Round Key & Shift Row . . . . . . . . . . . . . 26
3.3.5 Inverse S-Box . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.6 Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.7 Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . 32
3.3.8 Inverse Sessional Round Key . . . . . . . . . . . . . . . . . . 34
3.3.9 Results & Analysis of Vulnerabilities . . . . . . . . . . . . . 34
3.3.10 Hardware Solutions Against DPA . . . . . . . . . . . . . . . 37
3.4 Related Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Secure High-Level Architecture 40
4.1 Previous Research & Review . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Secure Scalar Point Multiplication . . . . . . . . . . . . . . . . . . . 44
4.2.1 Montgomery’s Algorithm . . . . . . . . . . . . . . . . . . . . 44
ix
TABLE OF CONTENTS
4.2.2 Joye’s Algorithm with Hardware Design . . . . . . . . . . . 46
4.2.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Optimized Point Operations . . . . . . . . . . . . . . . . . . . . . . 49
4.3.1 Point Double with Datapath Schematic . . . . . . . . . . . . 50
4.3.2 Point Addition with Datapath Schematic . . . . . . . . . . . 51
4.3.3 Review of Hardware . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Multiplicative Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.1 Binary Extended Euclidean Algorithm . . . . . . . . . . . . 55
4.5 High-Level Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Low-Level Multiplier Implementation 57
5.1 Finite Field Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.1 FIFO for Large Keys . . . . . . . . . . . . . . . . . . . . . . 59
5.1.2 Parallel Multiplication and Squaring . . . . . . . . . . . . . 59
5.1.3 Montgomery Multiplication and Reduction . . . . . . . . . . 62
5.2 Summary of the Connected System . . . . . . . . . . . . . . . . . . 65
5.2.1 Multiplier Comparison . . . . . . . . . . . . . . . . . . . . . 66
5.2.2 Overview of Architecture . . . . . . . . . . . . . . . . . . . . 67
6 Conclusions 68
6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 68
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.1 Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.2 Software-Hardware Integration Against SCAs . . . . . . . . 70
6.2.3 Masking to Prevent CPA Attacks . . . . . . . . . . . . . . . 70
Appendices 73
A DPA Data & Results 73
A.1 Power Trace to be Attacked . . . . . . . . . . . . . . . . . . . . . . 73
A.2 16-Byte Key Results . . . . . . . . . . . . . . . . . . . . . . . . . . 73
B Matlab Script DPA 82
x
TABLE OF CONTENTS
C C Scripts - Verilog Script Generation 83
C.1 Parallel Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
C.2 Parallel Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.3 Parallel Polynomial Multiplier . . . . . . . . . . . . . . . . . . . . . 89
D Verilog HDL Scripts 91
D.1 Parallel Polynomial Squarer . . . . . . . . . . . . . . . . . . . . . . 91
D.2 Serial Montgomery Multiplier . . . . . . . . . . . . . . . . . . . . . 92
D.3 32-bit FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
D.4 Serialized Montgomery Multiplier Comparison . . . . . . . . . . . . 100
E Verilog HDL Pseudo Scripts 101
E.1 Binary Extended Euclidean Inversion . . . . . . . . . . . . . . . . . 101
E.2 Point Double . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
E.3 Point Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Bibliography 113
Vita Auctoris 119
xi
List of Tables
2.1 Achieving Standard Security - Keys . . . . . . . . . . . . . . . . . . 9
4.1 Keynote ECC Processors in Literature . . . . . . . . . . . . . . . . 42
4.2 Hardware Costs of Point Addition . . . . . . . . . . . . . . . . . . . 43
4.3 Comparison of SPM SCA Protection . . . . . . . . . . . . . . . . . 48
5.1 Parallel vs. Serial Finite Field Multiplier . . . . . . . . . . . . . . . 58
5.2 Post Synthesis Multiplier Results on Kintex-7 . . . . . . . . . . . . 66
xii
List of Figures
2.1 Kobliz Elliptic Curve - EK :y
2 + xy = x3 + x2 + 1 . . . . . . . . . . 14
2.2 Kobliz Elliptic Curve - Point Operations . . . . . . . . . . . . . . . 15
2.3 ECC GF (2m) Hierarchy of Operations . . . . . . . . . . . . . . . . 16
3.1 Block Diagram of AES-128 [38] . . . . . . . . . . . . . . . . . . . . 22
3.2 SASEBO-GIII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Block Diagram of DPA Against AES-128 . . . . . . . . . . . . . . . 25
3.4 10 Rounds of Inverse Session Key . . . . . . . . . . . . . . . . . . . 34
3.5 15,000 Traces Max Correlation Vector for Byte 4 . . . . . . . . . . . 35
3.6 Vulnerable Samples within a Power Trace . . . . . . . . . . . . . . . 36
4.1 Joye’s SPM Hardware Block Diagram . . . . . . . . . . . . . . . . . 47
4.2 LD - Point Double Datapath Schematic . . . . . . . . . . . . . . . . 51
4.3 LD - Point Addition Datapath Schematic . . . . . . . . . . . . . . . 53
5.1 32-bit FIFO Schematic . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Parallel m-bit Multiplier [40] . . . . . . . . . . . . . . . . . . . . . . 60
5.3 233-bit Montgomery Multiplier RTL Schematic . . . . . . . . . . . 64
5.4 233-bit Montgomery Multiplier RTL Datapath . . . . . . . . . . . . 65
xiii
Nomenclature
AES Advanced Encryption Standard
ALU Arithmetic Logic Unit
ASIC Application Specific Integrated Circuit
CPA Correlation Power Analysis
CPU Central Processing Unit
DoD Department of Defense
DPA Differential Power Analysis
EC Elliptic Curve
ECC Elliptic Curve Cryptography
ECDH Elliptic Curve Diffie-Hellman
ECDLP Elliptic Curve Discrete Logarithm Problem
ECDSA Elliptic Curve Digital Signature Algorithm
EEA Extended Euclidean Algorithm
EEPROM Electrically Erasable Programmable Read-only Memory
EM Electromagnetism
FIFO First in First out
FPGA Field Programmable Gate Array
FSM Finite State Machine
xiv
GCD Greatest Common Divisor
GF Galois Field
HDL Hardware Descriptive Language
HTTPS Hypertext Transfer Protocol Secure
IC Integrated Circuit
ISR Inverse Shift Row
JTAG Joint Test Action Group
L2R Left-to-Right
LD Lo´pez-Dehab
LUT Look-up Table
MCC Micro-programmable Controller
NAF Non-Adjacent Form
NIST National Institute of Standards and Technology
RSA R. Rivest, A. Shamir, L. Adleman Cryptosystem
SCA Side-Channel Attack
SISO Serial in Serial out
SoC System-on-Chip
SPA Simple Power Analysis
SPM Scalar Point Multiplication
USB Universal Serial Bus
V 2X Vehicle-to-Everything
XOR Exclusive-OR
ZV P Zero-Value Point
xv
Chapter 1
Introduction
Due to the ever-rising amount of private information being transmitted from one
source to another over any communication network, maintaining security places the
tremendous burden on internal processing capabilities. Unsatisfying performance
from software results in the reign of hardware accelerators in applied cryptography.
This speed comes at a large risk of modern attacks against hardware to reveal
delicate intelligence.
1.1 Motivation
Currently, this technological decade is proving the pace of digital accessibility to
be swift and reliable at any cost. People push the indefinitely increasing traffic of
the internet with a wide range of activities. These can be businesses needing real-
time updates, banks’ secure transactions, and simple password protected accounts
used for social media by consumers. In the late 90s, people had minimal access to
internet by an expensive personal computer but today in 2017, most people who
are connected to the internet out of the current 46% of the worlds population [48],
have more than one device connected to the web. Also, within the last few years
the automotive industry is starting to penetrate the capabilities of the internet
through vehicle-to-everything (V2X) technologies with telematics systems. All
of these uses of the connected world need to have a reliable secure end-to-end
connection when navigating important information.
1
CHAPTER 1. INTRODUCTION
1.1.1 Transition of Software to Hardware Cryptography
Traditionally computers would run dedicated software algorithms to authenticate,
maintain confidentiality, and or hold integrity of data. Due to the pressure of
throughput constraints placed on the computers central processing unit (CPU),
physical capabilities of the CPU would further halt speed requests. To obtain
these desirable yet forced objectives, the large bit size cryptographic algorithms
needed to be implemented in hardware to reach benchmarks that could never be
obtained by software. Presently, the most efficient security measures will com-
pute the encryption/decryption or processing in hardware while dedicating all
the input output data transmissions and analysis in software which are the mod-
ern integrated platforms engineers see today. Hardware computes the public-key
cryptosystems with ease enabling users to provide private key exchanges and es-
tablishment, authentication, and more importantly, preserving their privacy.
As public-key cryptography offers many benefits, the algorithms used till 1985
were inefficient in hardware. The new concept of Elliptic Curve Cryptography
(ECC) offered the same level of symmetric security [34], while maintaining smaller
key sizes, memory usage, and power consumption. Naturally, the industry stan-
dard then directed its interest to ECC for most large high-speed, high security
measures.
Hardware accelerators are in high demand due to the custom, high-speed
throughput they can provide. As these sophisticated circuits have an unreach-
able performance merit compared to software, they possess a characteristic that
fingerprints the adept algorithms. Fraudulent acts on these circuits can result in
an extensive amount of valuables compromised.
1.1.2 Side-Channel Attacks
Side-Channel Attacks (SCAs) are attacks that gain delicate information acquired
from hardware implementations of cryptosystems. This important data leaked is
from any side-channel of the circuit as it encrypts/decrypts plaintext-ciphertext or
as the system alters keys states [17]. The results of a successful SCA can reveal the
2
CHAPTER 1. INTRODUCTION
architecture of the integrated circuit (IC), intermediate keys within cryptosystems,
and more frightening, can compromise the master key to recover all sensitive in-
put data. Opposed to traditional brute force, SCAs exploit the hardware’s nature
through timing, fault injection, power, and or electromagnetism (EM) radiation
analysis to acquire secret information in merely a fraction of the time.
To protect an IC properly against SCAs there is a broad background required
from different domains of embedded security. The central knowledge required
includes areas from hardware design for feasibility, functionality, and constraints,
a cryptographic algorithm aptitude, and the ability to perform a successful SCA.
1.2 Objective
The main objective of this thesis is to create an entire platform to develop and
test side-channel attacks against a wide range of cryptosystems available in hopes
to better protect the hardware at the highest level of operations within or outside
of the scale of the algorithms.
Specifically, the other objective of this thesis is to create a complete base ar-
chitecture of a secure scalar point multiplication (SPM) to open the development
of an integrated hardware accelerator to be applied in ECC protocols for SCA
investigations. The auxiliary support of a comprehensive SCA is also needed to
accurately design the custom hardware.
1.2.1 Solution
This thesis will include an in-depth explanation with experimental results, of a
successful side-channel attack to show the susceptibilities of cryptographic hard-
ware from multiple aspects. The weaknesses will be discovered to transfer the
applied knowledge to the architectural design on a public-key system.
The thesis will additionally include an examination of the entire architecture
of the targeted SPM on a field-programmable gate array (FPGA). This will also
3
CHAPTER 1. INTRODUCTION
include an implementation and analyses of two types of multipliers to be utilized
in the design.
1.3 Organization of Thesis
The progression of this thesis will focus on the process of implementing the algo-
rithms for a novel SPM architecture with an insight and practise of modern SCAs
for maximum security. The rest of the thesis is as follows.
Chapter 2 is the primitive mathematics required in elliptic curve (EC) oper-
ations that encompass the base for ECC. The algebraic basics of group laws and
finite fields are be addressed in this chapter. Provable security will be discussed
in respect to choosing a correct finite field and parameters. A discussion of the
problem that makes ECC strong will be explained along with the fundamental
operations of the point operations. A brief introduction of the statistics needed
for a SCA will also be explained.
Chapter 3 discusses, investigates, and performs a pertinent side-channel at-
tacks on an industrial practised cipher. Initially the chapter will brief the most
opportune SCAs to leverage the targeted hardware auspiciously. The main attacks
include timing, safe-error, and differential power attacks. Lastly, the chapter will
explain in great depth the exact process to break AES-128 providing a simple
method to break power dependant states within a cryptosystem.
Chapter 4 provides a literature survey of the most opportune designs to fight
SCAs. It outlines the proposed secure high-level architecture and why it will be
secure against the previously mentioned side-channel attacks. This level of the de-
sign is most susceptible to SCAs as it deciphers how the master key manipulates
the base point of the EC during the SPM to produce the product of a public-
private key system. Joye’s SP algorithm is designed in hardware and is broken
into the two optimized point operations. The datapath of the point doubling and
addition is displayed. Lastly, the newly high-level inversion algorithm is selected
4
CHAPTER 1. INTRODUCTION
and explained to translate to a hardware design.
Chapter 5 offers the low-level implementation of the proposed design. The
finite field multiplier is the most important design to be made as it needs to be op-
timized to the correct application of the overall hardware accelerator. This chapter
discusses two multipliers with a hardware throughput solution in detail and gives
an analysis of the cost, speed, and feasibly through simulations and synthesis.
Finally, an overview of the complete scalar point multiplication algorithm with a
hierarchy of operations that build this design.
Chapter 6 covers the overall contributions of this work as well as the future work
needed to progress the full development of the secure hardware implementation.
The future works include the remainder of the designed hardware in Verilog hard-
ware descriptive language (HDL) to ultimately be attacked to diagnose threats,
an implementation of a custom cryptographic library for hardware functionality
testing, and vulnerabilities solutions towards symmetrical key hardware ciphers.
5
Chapter 2
Mathematical Preliminaries
ECC naturally revolves around number theory, group laws, and finite fields arith-
metic. Accompanied by the elliptic curve discrete logarithm problem (ECDLP)
over a NIST approved, efficient elliptic curve is the formulae for a tenacious math-
ematical backbone when designing custom hardware.
Sections 2.1-2.3 are in reference to the books [39,40]. These textbooks provide a
descriptive yet concise way to understand the ECC algebra fundamentals with ease.
The last section will brief the small amount of formulae to grasp the numerical
concept of a particular SCA.
2.1 Number Theory
Given two integers x, y, and a positive integer n:
Definition 2.1.1: Congruence
x is congruent to y mod n if the difference of x− y is integrally divisible by n:
x ≡ y mod n
Property: x is congruent to y if and only if y mod n = x mod n.
Definition 2.1.2: Multiplicative Group
The set of elements x of Zn relatively prime with n, is the multiplicative group Z
∗
n:
6
CHAPTER 2. MATHEMATICAL PRELIMINARIES
Z
∗
n = {x ∈ Zn | gcd(x, n) = 1}, where Zn = {0, 1, 2, ..., n− 1}
Property: The Euler totient function Φ(n) is the number of elements in Z∗n.
Also, if Z∗n has a generator, then Z
∗
n is said to be cyclic.
Definition 2.1.3: Multiplicative Inverse
In a multiplicative group where the operation is a product, if xy mod n = 1, then
y is the the multiplicative inverse of x:
y = x−1 mod n
Property: x has a multiplicative inverse if and only if gcd(x, n) = 1. If inverse
exist, it is unique.
Definition 2.1.4: Order of an Element
The order of element x ∈ Z∗n is the least positive integer r such that:
xr = 1mod n
Property: If the order of x is equal to the number Φ(n) of elements in Z∗n, then
x is said to be a generator or primitive element of Z∗n.
2.2 Algebra
The following definitions are shown below defined over set G.
2.2.1 Group Law
Using the binary operator ∗, the group is G∗:
Definition 2.2.1.1: Associativity
7
CHAPTER 2. MATHEMATICAL PRELIMINARIES
x ∗ (y ∗ z) = (x ∗ y) ∗ z, ∀ x, y, z ∈ G
Definition 2.2.1.2: Commutativity
x ∗ y = y ∗ x, ∀ x, y ∈ G
Property: If group G∗ has Commutativity, then group G∗ is an Albanian Group.
Definition 2.2.1.3: Identity Element
There exists an element 0 ∈ G such:
a ∗ 0 = 0 ∗ a = a, ∀ a ∈ G
Definition 2.2.1.4: Inverse Element
For ∀ a ∈ G, a 6= 0, there exists a single element a−1 ∈ G such:
a ∗ a−1 = a−1 ∗ a = 0, ∀ a ∈ G
2.2.2 Finite Fields
Defined over field F with the binary operator ∗, finite fields possess the same
group definitions and properties previously mentioned in Section 2.2.1 [18] with
the addition of the following:
Definition 2.2.2.1: Associativity of Closure under Multiplication
Given a ∗ (b ∗ c) = c ∗ (a ∗ b) ∈ G:
a, b, c ∈ F
Definition 2.2.2.2: Distributivity
a ∗ (b ∗ c) = c ∗ (a ∗ b) = a ∗ bc = c ∗ ab, ∀ a, b, c ∈ F
8
CHAPTER 2. MATHEMATICAL PRELIMINARIES
Definition 2.2.2.3: Multiplicative Identity
There exists an element 1 ∈ F such:
a ∗ 1 = 1 ∗ a = a, ∀ a ∈ F
Definition 2.2.2.4: Multiplicative Inverse
For ∀ a ∈ F, a 6= 0, there exists a single element a−1 ∈ F such:
a ∗ a−1 = a−1 ∗ a = 1, ∀ a ∈ F
Finite fields are defined as F = Z∗n/f(x), where f(x) ∈ F. A finite field is a field
of finite length [41]. The field selection now rises as a design decision. Whether
to implement a prime or binary field over various ECs with different key sizes
is crucial. Changing any detail in the base preliminary design alters the entire
architecture dramatically.
2.3 Levels of Security Within Public & Private
Key Systems
All aspects of ECC applications are important to understand the capabilities of
specifics design to be integrated into realizable cryptosystems. For example, the
below table visually shows the sizable keys needed to provide 80, 112, 128, 192, &
256-bit levels of security.
Table 2.1: Achieving Standard Security - Keys
Symmetric Example Algorithm Prime Field Binary Field Usage
280 RSA-1024 |p| = 2192 m = 2163 Authentication
2112 3DES |p| = 2224 m = 2233 Authentication
2128 AES-128 |p| = 2256 m = 2283 Confidentiality
2192 AES-192 |p| = 2384 m = 2409 Confidentiality
2256 SHA-256 |p| = 2521 m = 2571 Integrity
9
CHAPTER 2. MATHEMATICAL PRELIMINARIES
The algorithms above need to establish their targeted security level which is
defined by the key length. Public-Key authentication can be developed by a dig-
ital signature algorithms such as the R. Rivest, A. Shamir, L. Adleman cipher
(RSA)-1024 used by certificate authorities (CA) or a key-establishment can be
implemented with EC Diffie Hellman (ECDH) key-exchange. Confidentiality is
acquired by a symmetrical block cipher such as the Advanced Encryption Stan-
dard (AES) which is the leading method in preventing man-in-the-middle attacks
by using a secret private-key. To gain integrity, or uniqueness, one needs to apply a
hashing function with large complexity [1]. All of these systems need specific sym-
metrical cipher key lengths to ensure brute-forced attacks are negligible. Below,
Equation (2.1) that displays the number of possibilities to be growing exponen-
tially.
y(x) = 2m−1 (2.1)
Clearly as m increases, computationally this calculation becomes impossible
past 128-bits [37]. Modern ECC applications can work with notorious protocols
like HyperText Transfer Protocol Secure (HTTPS) that readily use AES-128 im-
plementations [19] to provide key-exchanges.
Practising ECC begins by choosing a key length, field, basis, and an elliptic
curve. In the following subsections, those qualities will be considered.
2.3.1 Binary Field
A binary field can be defined as a field of which all 2m elements are of radix-2
within a specified finite field and in this case, a Galois field (GF (2m)). If f(x) is
an irreducible/primitive binary polynomial of size m-bit, F2m = GF (2
m) - the field
is of degree m [39]. All elements within the field exhibit binary strings of length
m-bit.
GF (2m) = {a(x) | a(x) = am−1xm−1 + ...+ a1x+ a0, xi ∈ GF (2)} (2.2)
Equation (2.2) shows the Galois binary field GF (2), to explicitly depict that
the field and basis will be modulo 2. Normally this is implicit quality. All opera-
tions will be completed under the binary polynomial basis within this field. For the
10
CHAPTER 2. MATHEMATICAL PRELIMINARIES
scope of this project, binary fields of 233 and 283-bit will be tested and compared.
Another type of field is a prime field. Prime field encompass a set of integers
of any prime p-radix, [0, ..., p−1]. All field calculations are computed over modulo
p similarly to 2m. These fields are typically implemented in software as they are
computationally faster compared to binary fields while using multiple CPU cores.
2.3.2 Polynomial Basis
Polynomial or standard bases, are specified by a primitive polynomial of highest
degree m. This polynomial acts as the irreducible string (am...a1a0) in hardware
of which all other element strings defined as (am−1...a1a0) are concealed. Hence,
all elements shown as a polynomial sum under the binary field’s standard basis
are shown below in Equation (2.3).
X =
m−1∑
i=0
aix
i, ai ∈ GF (2m) (2.3)
Irreducible polynomials are chosen to be either trinomials or pentanomials de-
pending on the m-bit size of the key being used. An example is using 233 and
283-bit keys; respectively, they need a trinomial and pentanomial to encompass
the Galois field.
A primitive trinomial is defined as tm + tn + 1, where n is the lowest-degree
middle term. If the trinomial basis is not available, the pentanomial defined by
tm + tx + ty + tz + 1 has to be applied. Similarly x, y, z are the lowest-degree
successive terms. Using the pentanomial forces sacrifices of larger memory usage
(look-up table (LUT) on a FPGA), register complexion, and slower reduction com-
putations [8].
The subsequent field arithmetic includes typical polynomial multiplication and
addition modulo 2. Addition/subtraction in hardware will be simply be an m-bit
exclusive-OR (XOR) gate. Further operations are designed using a polynomial
basis. Hardware works end-to-end calculations in binary therefore introducing an-
other basis such as a normal basis is cumbersome when targeting a larger goal
11
CHAPTER 2. MATHEMATICAL PRELIMINARIES
such as side-channel attack analysis.
Normal bases are quite popular in hardware and software implemented ECC
protocols. Due to complexities with special class Type T, the normal basis proves
superiority in specific situations regarding fast squaring operations [20]. Due to
the difficulties testing and verifying hardware results using a normal basis, it will
not be further attempted.
2.3.3 Provable Security
In 2003, standards such as Brainpool, used in German passports, or the National
Security Agency (NSA) Suite B (2005) presently used in United States Depart-
ment of Defence (DoD) security clearance projects [49] were and still are the mod-
ern ECC standards. Within Suite B, the National Institute of Standards and
Technology (NIST) selected curves of which they have approved based on three
main categories of curve parameters, the elliptic curve discrete logarithm problem
(ECDLP) difficulty, and complex ECC security. A fantastic reference for a more
detailed analysis of the applied algebraic security is found from cyber-security ex-
perts, Safecurves’ website [47].
The most widely used curves that are the state-of-the-art are Montgomery,
Kobliz, and Edwards prime and binary curves [21]. These special curves are op-
timized to produce maximum efficiency over the elliptic curve operations. Any of
these EC equations would suffice as they are used in standards worldwide. How-
ever, Kobliz curve was selected due to accessible curve order, basis, and coefficients
that are open sourced by NIST [34]. The order and curve coefficients will be in
following section Elliptic Curves over GF (2m).
2.4 Elliptic Curves over GF (2m)
Secure ECC FPGA implementations are extremely valuable due to the need of
high-speed, low-cost, and rapid prototyping hardware, that can maintain high se-
curity asymmetric-key cryptography. Opposed to it’s predecessors, RSA and DSA,
12
CHAPTER 2. MATHEMATICAL PRELIMINARIES
ECC uses much smaller keys, lower power consumption, and smaller memory us-
age all while providing the same level of security in any public-key system. This
results in fewer clock cycles and reduced hardware overhead [45].
The security of ECC is based on the elliptic curve ECDLP; this allows ECC
applications to have a smaller key size compared to RSA because the ECDLP is
practically infeasible to solve versus the integer factorization problem [36].
2.4.1 Marginal Note on EC Discrete Logarithm Problem
The ECDLP is defined as the this following situation. Let an elliptic curve E
defined over the finite Field F, point P of order r and Q ∈ P , find k [0, 1, .., r− 1]
such that the scalar multiplication (SM) Q = kP .
The positive integer k is the discrete logarithm of Q base P , k = logPQ.
Research has been conducted to break the ECDLP and the most prominent attack
is the Pollard Rho method [22]. This method is improves the looping iteration-
based methods, but still tries to break this algebraic problem iteratively in 3
√
pim
2
cycles. As m, the bit size of 2m increases exponentially, this becomes exceedingly
unrealistic.
2.4.2 Curves and EC Group Theory
Below, Equation (2.4) is the pseudo-random curve; the NIST Kobliz curve (2.5) is
a special case of Equation (2.4) where b = 1. When b = 1, operations within the
finite field are highly simplified.
E : y2 + xy = x3 + ax2 + b, a, b ∈ GF (2m) (2.4)
EK : y
2 + xy = x3 + ax2 + 1, a ∈ GF (2m) (2.5)
To begin computing ECC operations, the base point P (x, y) ∈ EK needs to be
selected. Many base points are applicable as long as they provide maximum order
with respect to the curve.
13
CHAPTER 2. MATHEMATICAL PRELIMINARIES
Figure 2.1: Kobliz Elliptic Curve - EK :y
2 + xy = x3 + x2 + 1
To improve functionality of the EC, the cofactor should be minimized. The
finite number of points on the EC is n defined by Equation (2.6) where Fq is the
finite field.
Assuming a = 1, Equation (2.5), the cofactor f defined in Equation (2.7)
and graphically displayed in Figure 2.1. The order of the base point is r, which
multiplies point P to the theoretical point infinity. The order is a natural number
while infinity is depicted asO. The order is defined such that the minimum positive
prime integer r such that rP = O.
n = |#EK(Fq)− (q + 1)| ≥ 2√q (2.6)
f =
n
r
= 2 (2.7)
The identity element infinity implies P + O = P . Therefore P − P = O at
(x, 0) implies −P (x,−y). This is the modular compliment of base point P y-
coordinate. Other group laws within ECC are as follows. If x1 = x2 & y1 6= y2,
then y2 = x1 + y1 therefore P1 = −P2; if x1 of P1, then 2P1 = O. To add two
points on an elliptic curve E, one needs to check the simple condition of Q = P
or Q 6= P . If the points are equal, then point doubling follows.
14
CHAPTER 2. MATHEMATICAL PRELIMINARIES
2.4.3 Point Operations
If P (x1, y1) = Q(x2, y2) ∈ EK , point doubling equations are needed to compute
2P (x3, y3) ∈ EK . Below in Equations (2.8), the set of Weierstrauss equations
defining point doubling in affine coordinates1. In Figure 2.2(a), a tangent line is
drawn from the point P that intersects the curve at point −R. Once reflected
upon the x-axis, the point 2P is found.
λ =
y1
x1
+ x1
x3 = λ
2 + λ+ a
y3 = (x1 + x3)λ+ x3 + y1
(2.8)
(a) Point Doubling (b) Point Addition
Figure 2.2: Kobliz Elliptic Curve - Point Operations
When P (x1, y1) 6= Q(x2, y2) ∈ EK , point addition equations are needed to
compute P + Q = R(x3, y3) ∈ EK . Again, Weierstrauss equations defining point
addition in affine coordinates are Equations (2.9). In Figure 2.2(b), a tangent line
is drawn connecting P and Q that intersects the curve at point −R. Once reflected
upon the x-axis, the point R = P +Q is established.
λ =
y1 + y2
x1 + x2
x3 = λ
2 + λ+ x1 + x2 + a
y3 = (x1 + x3)λ+ x3 + y1
(2.9)
1Affine coordinates are (x, y) which span an indefinite xy-plane. They are the realizable
coordinates compared to other methods like projective coordinates [8].
15
CHAPTER 2. MATHEMATICAL PRELIMINARIES
2.4.4 ECC Overview and Vulnerability Insight
Understanding the necessary background of ECs is vital to recognize potential se-
curity threats on all levels. The hierarchy of ECC protocols is shown in Figure 2.3.
There are 3 sets of operations that build the echelon from the ground up. Firstly,
the multiplication and inversion methods under finite field arithmetic, second the
point operations, and lastly the scalar multiplication. As previously explained,
addition/subtraction is the same operation under GF (2) and is simply an XOR.
Each step in the hierarchy will be discussed in much greater detail while designing
the architecture in chapters 4 and 5.
Figure 2.3: ECC GF (2m) Hierarchy of Operations
The primitive operation exercising all lower operations is the scalar point mul-
tiplication. This SM is the pronounced task of popular protocols such as ECDH
or an EC digital signature algorithm (ECDSA). This makes this operation one of
the biggest security risks in ECC.
2.5 Statistical Analysis for Side-Channel Attacks
In reference to [38], the models and algorithms in this section are needed in differ-
ential power analysis (DPA) in order to carry out the analysis. These two concepts
are the essential basics behind DPA and are explained using the procedure of AES.
The Hamming Distance (HD) model is used to measure bus activity within the
16
CHAPTER 2. MATHEMATICAL PRELIMINARIES
selected device. This activity is directly related to the output power on the bus.
HD is the number of bit changes or bit inversions, in a binary word. With respect
to the next chapters analysis, the change in the output bit stream is the count 1’s
from logical XORs between two words and is calculated by the following Equation
(2.10). This count is defined as the Hamming Weight (HW).
DH =
k∑
i=1
|xi − yi|, xi, yi ∈ [0, 1] (2.10)
This will give a precise digital average of any state with a system for further
statistical analysis. This simple yet powerful model will be used to map the hypo-
thetical power consumption values to the hypothetical intermediate values.
After the appropriate values are mapped, the resulting matrix must have a
strong correlation with the power traces previously captured at a specific key.
R =
∑n
i=1(xi − x¯)(yi − y¯)√∑n
i=1(xi − x¯)2
∑n
i=1(yi − y¯)2
(2.11)
The correlation coefficient R, is calculated with the hypothetical power consump-
tion versus the traces.
17
Chapter 3
Side-Channel Attacks Against
Hardware
Side-Channel Attacks are invasive or non-invasive manoeuvres to exploit physical
leakages of information from hardware. Timing signals, register to register depen-
dencies, and physical power consumption are a few pieces of information that can
be easily obtained from hardware implementations through SCAs. Three rising
SCAs are the timing attacks, safe-error attacks, and differential power analysis.
These attacks all possess the ability to extract different pieces of information from
cryptographic accelerators.
“If you think technology can solve your security problems, then you don’t un-
derstand the problems and you don’t understand the technology” (B. Schneier,
2000).
In this chapter, the concepts of side-channel attacks will become clear and a
side-channel attack is performed on a notorious 128-bit encryption standard to
exploit it’s unique flaws. A description of steps needed to perform the attack on a
different system will be displayed.
18
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
3.1 Malicious Actions Against Hardware
As emerging techniques of attacks on hardware devices seem relentless, there are
three types that remain as the most pertinent issues. These attacks will be de-
scribed acknowledging physical weaknesses within the hardware and small issues
related to SPM algorithms.
3.1.1 Timing & Safe-Error Attacks
Timing attacks rely on the fact that operations on different inputs have a large
time variance [4]. This gives the attacker the non-invasive ability to measure the
time between computations of the attacked algorithm.
As shown in recent literature, [31, 32], timing attacks are sometimes focused
against software implemented cryptosystems. These attacks would rely on the
inter-process times through the state of the CPU’s cache as it reads and writes
data. This leads to leakage memory access patterns which can be made to make
data dependant look-up table and break the system at hand.
These methods are easily transferable to software-hardware SoC implementa-
tions as they rely on the CPU to transmit, receive, and store values in memory
while the hardware computes the encryption.
The timing attack employed against a hardware implementation needs a CPU
regularly communicating with it’s cache in order to effectively complete the hack.
Since the hardware implementation is not at the integration level, this attack will
be a candidate for future work as explained in Conclusions.
Safe-error attacks maliciously modify bits of a specific word in a specified regis-
ter [3] to determine if the registers are independent of one another. This invasively
shows the direct register dependencies that distinguishes parts or an entire algo-
rithm from another. This attack needs to physically tamper with the hardware
in order to falsify words, or to introduce a fake instruction [3] in the internal
19
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
arithmetic logic unit (ALU) to trigger a fault resulting in the adversary inquiring
sensitive information.
Safe-error attacks are categorized as computational safe-error (C Safe-error),
focused on tampering with the ALU or memory safe-error (M Safe-error) which
modifies CPU to memory address communication [6].
These attacks are primarily out of the scope of this work, but need to be
mentioned as they are a prominent SCA.
3.1.2 Zero Point Attacks
The Zero-Value Point (ZVP) attacks on ECC processors were introduced in [16].
The attackers choose a specific base point on an EC to produce the zero-value
coordinate in the scalar multiplication. The power consumption of the zero-value
multiplication will dramatically decreases therefore, exposing secret key distin-
guished by single observation of a set of power traces. This requires the attacker
to have physical access to the processor and or the CPU’s memory to tamper
with the embedded base point for the SPM. Having said that, this knowledge of
the ZVP power consumption can be applied with the help of another attack to
differentiate the key from intermediate scalar values.
3.1.3 Differential Power Analysis
Correlation power analysis (CPA) is widely notorious in the domains of embedded
security. CPA focuses on reading the leakage power from the encryption stage of
a device and relates it to the inputted data stream. This could be through elec-
tromagnetic radiation or passively sniffing output bus activity. It’s first derivative
was simple power analysis (SPA) which later became a shadow to it’s sibling, dif-
ferential power analysis (DPA) [24].
DPA was announced to the public in 1998 by researchers P. Kocher, J. Jaffe,
and B. Jun. It is a type of CPA where the attacker non-invasively reads the output
power consumption of the cryptographic processor to differentially compare those
20
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
results to potential state or key values. The assaulter then runs a series of statis-
tical processes giving the ability to learn inner-mechanisms or variables within the
core i.e. secret session and master keys.
This manoeuvre relies on the fact that internal switching of CMOS technology
consumes different amounts of power depending on different inputs’ operations.
This type of CPA would be a efficient, adaptable, and more importantly, a feasible
attack to a wide set of cryptosystems.
3.2 Novelty
The section titled Executing DPA on AES-128 is novel work that expands the
broader scope of the past research such as [25, 26, 33, 42] to detail exact algebraic
steps with explanation in order to successfully hack the hardware implementation
of AES-128. To the best of the authors knowledge, there is no research that out-
lines the detail of DPA to that of this thesis.
This detail is needed due to the elaborate steps and cryptographic insight of
where to attack and why. Understanding why the proposed attack works at a
hardware level is paramount for applying the practise for future research.
3.3 Executing DPA on AES-128
Due to the complexity of the AES and the fact that a brute-force attack is im-
possible in any life time, the encryption is viewed as an excellent option to handle
sensitive data for high-level security of 128-256 bits in reference to Table D.1.
In order to validate the security of data being processed through AES-128 in
electronic code book (ECB) configuration, the standard must be exposed to ex-
ploit it’s flaws to propose solutions to issues in both software and hardware. This
attack will uncover the vulnerability of the hardware implementation of the Ad-
vanced Encryption Standard to differential power analysis.
21
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
Figure 3.1: Block Diagram of AES-128 [38]
As a guideline for the process of AES, the block diagram of AES-128 is shown
above in Figure 3.1.
3.3.1 Experimental Setup
The three main hardware components of the attack include a cryptographic FPGA
evaluation board, an oscilloscope, and a computer. The Side-channel Attack Stan-
dard Evaluation Board (SASEBO)-GIII is the cryptographic research and devel-
opment board that is used to perform two tasks on two FPGAs. The data transfer
mitigation of plaintext and ciphertext are steered through Spartan-6, the con-
trolling FPGA, while Virtex-7, the processing FPGA, symmetrically encrypts the
plaintext from a master key established. The random data is manipulated in
software from a C# open-sourced script [50].
1. Cryptographic FPGA Evaluation Board: SASEBO-GIII
2. Oscilloscope: Agilent Technologies DSO-X 3012A at 50 M/s samples
22
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
3. Computer: Intel Xeon 64-bit 3.0 GHz Processor with 8 GB Memory
Below, in Figure 3.2, the board is labelled 1-6 and the labels are as follows. 1
output power pin of the hardware encryption bus, 2 output power pin that trig-
gers the oscilloscope set at 50 M/s samples to capture a power traces, 3 Virtex-7
FPGA, 4 Spartan-6 FPGA , and 5 & 6 is the Joint Test Action Group (JTAG)
port to program the the corresponding FPGAs electrically erasable programmable
read-only memory (EEPROM) with the combinational AES-128 implementation.
On the bottom left of the board, not labelled, is the universal standard bus
(USB) 2.0 that is the bi-lateral data transfer connection the communicates with
the Spartan-6.
The three main software components being used on are Xilinx ISE, Visual Stu-
dio, and Matlab.
Xilinx ISE is the design suite used to modify and compile Verilog HDL code
for the both FPGAs. The open-sourced HDL scripts were used from [50] since
the scope of this project is not to design a hardware implementation of AES-128
but rather exploit the standard’s flaws; Visual Studio is the environment of choice.
Matlab is utilized to develop the entire DPA attacking algorithm since it is tai-
lored to analyze and import very large matrices with ease to sort them accordingly.
The provided scripts from [50] are modified to establish a connection to the
evaluation board through the USB 2.0 and to display a graphical user interface
(GUI) that allows the user to view the hexadecimal values of the variables being
processed by the board; the GUI also allows the user to manipulate the AES-128
master key random 16-byte keys.
3.3.2 Framework Differential Power Analysis
The FPGA consumes characteristic power due to the exertion of words pushed
from the output pins of the Kintex-7 to the Spartan-6 as the switching activity
23
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
Figure 3.2: SASEBO-GIII
1 Output power encryption bus, 2 Output power trigger, 3 Virtex-7, 4 Spartan-6, and
5, 6 JTAG port to program EEPROMs
from internal signals changes.
The pin on the output of the encrypted text bus can be probed to read the
power traces. Each bit on the bus requires power in order to invert itself after each
clock cycle. This means that the power consumption is directly proportional to the
number of bit changes. Therefore, if one has a known state, ciphertext or plaintext,
and all hypothetical possibilities for a neighbouring state, they could can count the
number of bit changes between each hypothetical and the known state to corre-
late it to the power consumption in order to find out which trace it corresponds to.
The Kintex-7 performs an entire 1/10 rounds of AES-128 on 1 byte before
changing the values on the bus. This translates to investigating a whole round of
AES to get all hypothetical states at the neighbouring round. Due to the absence
of Mix-Columns, a GF (28) operation [41] in the 10th round, the simple choice to
use the ciphertext as the known value. Working backwards to get every hypothet-
ical value of station 9 labelled as ST9 as shown in Figure 3.3, will be executed in
next subsections. Using these values, the HD is obtained between the two refer-
24
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
Figure 3.3: Block Diagram of DPA Against AES-128
ence states and the correlation coefficient is used to match these distances to the
power consumption from their respective traces captured.
A momentous observation that enables this attack possible is that each byte of
the 16-byte key are independent of each other at each n-state and all operations
on them are essentially in parallel - this is true in software as well. Clearly this
is a large flaw in the algorithm and due to the nature of this attack there is no
byte-to-byte single state key dependencies, but rather state-to-state map below.
[B15(n), B14(n), ..., B0(n)] −→ [B15(n+ 1), B14(n+ 1), ..., B0(n+ 1)] (3.1)
There are 7 operations required in the developed algorithm and 3 of which
are inverse operations of the AES-128 algorithm. The other 4 are procedures to
develop hypothetical 1-byte keys, correlation coefficients, and lastly the inverse
sessional key.
Figure 3.3 above should be used in tangent with Figure 3.1 to understand the
concepts discussed and the operations that follow. To begin, the known 128-bit
25
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
ciphertext (CT) is split into 16 bytes as seen below.
CT =
(Byte1 Byte2 Byte15 Byte16
CT1 CT2 ... CT15 CT16
)
3.3.3 Generating Hypothetical Keys
Working backwards, the first operation encountered is the Add-Round Key. Since
the 10th round sessional key is unknown and what is being sought, all 0-255 pos-
sibilities for each byte is generated below. All of the AES operations and the
correlations are computed on bytes, not bits, which is why it is sufficient to cap-
ture every hypothetical value of each byte rather than each bit of the possible
128-bit key.
Hyp.Keys =


Byte1
00000000
00000001
00000010
...
11111111




Byte2
00000000
00000001
00000010
...
11111111


. . .


Byte16
00000000
00000001
00000010
...
11111111


3.3.4 Inverse Add-Round Key & Shift Row
The add-round key operation takes the input data and XORs it with the sessional
key in order to get the ciphertext output. The hypothetical keys and the ciphertext
are XORd in order to get the input in Equation (3.2).
A = CT ⊕Key ∈ [0, 1] (3.2)
Each ciphertext byte is XORd with its corresponding 256 possibilities of the
key. In other words, the first byte of the ciphertext is XORd with every guess in
byte one of the key and the rest of the bytes follow the same operation.
26
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
A =


Byte1

CT1
CT1
CT1
...
CT1


⊕


00000000
00000001
00000010
...
11111111






Byte2

CT2
CT2
CT2
...
CT2


⊕


00000000
00000001
00000010
...
11111111




. . .


Byte16

CT16
CT16
CT16
...
CT16


⊕


00000000
00000001
00000010
...
11111111




To simplify the following steps, we name this matrix as A, which has the following
configuration.
A =


Byte1
A1[1]
A1[2]
A1[3]
...
A1[256]




Byte2
A2[1]
A2[2]
A2[3]
...
A2[256]


. . .


Byte16
A16[1]
A16[2]
A16[3]
...
A16[256]


To perform the shift row operation, the 16 bytes of data are rearranged in a 4x4
matrix. Each row has a shift left operation of value 0, 1, 2, and 3, respectively. In
order to do the inverse shift row (ISR), Mat A is rearranged in a 4x4 formation
and each row is shifted right by 0, 1, 2, and 3.
27
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE




A1[1]
...
A1[256]




A5[1]
...
A5[256]




A9[1]
...
A9[256]




A13[1]
...
A13[256]




A2[1]
...
A2[256]




A6[1]
...
A6[256]




A10[1]
...
A10[256]




A14[1]
...
A14[256]




A3[1]
...
A3[256]




A7[1]
...
A7[256]




A11[1]
...
A11[256]




A15[1]
...
A15[256]




A4[1]
...
A4[256]




A8[1]
...
A8[256]




A12[1]
...
A12[256]




A16[1]
...
A16[256]




ISR
y
B =




A1[1]
...
A1[256]




A5[1]
...
A5[256]




A9[1]
...
A9[256]




A13[1]
...
A13[256]




A14[1]
...
A14[256]




A2[1]
...
A2[256]




A6[1]
...
A6[256]




A10[1]
...
A10[256]




A11[1]
...
A11[256]




A15[1]
...
A15[256]




A3[1]
...
A3[256]




A7[1]
...
A7[256]




A8[1]
...
A8[256]




A12[1]
...
A12[256]




A16[1]
...
A16[256]




A4[1]
...
A4[256]




To simplify the following steps, we name this matrix as B, which has the following
configuration.
28
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
B =




B1[1]
...
B1[256]




B5[1]
...
B5[256]




B9[1]
...
B9[256]




B13[1]
...
B13[256]




B2[1]
...
B2[256]




B6[1]
...
B6[256]




B10[1]
...
B10[256]




B14[1]
...
B14[256]




B3[1]
...
B3[256]




B7[1]
...
B7[256]




B11[1]
...
B11[256]




B15[1]
...
B15[256]




B4[1]
...
B4[256]




B8[1]
...
B8[256]




B12[1]
...
B12[256]




B16[1]
...
B16[256]




3.3.5 Inverse S-Box
The S-box takes each byte of data and maps them to a given well-established
value. The inverse S-box is a standard 16x16 array that simply maps the inverse
output of AES’s Substitute Box operation. It takes the hex or decimal value of
each byte and exchanges it with a new value. It accomplishes this by selecting the
most significant 4 bits of the code word of each byte as the row of the standard
array and the least significant 4 bits as the column. Every byte in the Mat B is
remapped through the developed inverse S-box in order to get the new Mat C.
29
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
C =




sbox−1[B1[1]]
...
sbox−1[B1[256]]




sbox−1[B5[1]]
...
sbox−1[B5[256]]




sbox−1[B9[1]]
...
sbox−1[B9[256]]




sbox−1[B13[1]]
...
sbox−1[B13[256]]




sbox−1[B2[1]]
...
sbox−1[B2[256]]




sbox−1[B6[1]]
...
sbox−1[B6[256]]




sbox−1[B10[1]]
...
sbox−1[B10[256]]




sbox−1[B14[1]]
...
sbox−1[B14[256]]




sbox−1[B3[1]]
...
sbox−1[B3[256]]




sbox−1[B7[1]]
...
sbox−1[B7[256]]




sbox−1[B11[1]]
...
sbox−1[B11[256]]




sbox−1[B15[1]]
...
sbox−1[B15[256]]




sbox−1[B4[1]]
...
sbox−1[B4[256]]




sbox−1[B8[1]]
...
sbox−1[B8[256]]




sbox−1[B12[1]]
...
sbox−1[B12[256]]




sbox−1[B16[1]]
...
sbox−1[B16[256]]




To simplify the following steps, we name this new matrix as C, which has the
following configuration.
C =




C1[1]
...
C1[256]




C5[1]
...
C5[256]




C9[1]
...
C9[256]




C13[1]
...
C13[256]




C2[1]
...
C2[256]




C6[1]
...
C6[256]




C10[1]
...
C10[256]




C14[1]
...
C14[256]




C3[1]
...
C3[256]




C7[1]
...
C7[256]




C11[1]
...
C11[256]




C15[1]
...
C15[256]




C4[1]
...
C4[256]




C8[1]
...
C8[256]




C12[1]
...
C12[256]




C16[1]
...
C16[256]




30
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
3.3.6 Hamming Distance
At this point, Mat C corresponds to every possible value/state for each byte of
data at ST9, reference Figure 3.3 for the known ciphertext. Now the hamming
distance between each of these values and their corresponding ciphertext must be
calculated in order to get the number of bit changes on the bus between ST9 and
the ciphertext. Equation (2.10) is applied to calculate the HD. In order to get the
HD, all 256 values of byte 1 in Mat C are XORd with the first byte of ciphertext.
This will be repeated for all 16 bytes.
I =


Byte1

C1[1]
C1[2]
C1[3]
...
C1[256]


⊕


CT [1]
CT [1]
CT [1]
...
CT [1]






Byte2

C2[1]
C2[2]
C2[3]
...
C2[256]


⊕


CT [2]
CT [2]
CT [2]
...
CT [2]




. . .


Byte16

C16[1]
C16[2]
C16[3]
...
C16[256]


⊕


CT [16]
CT [16]
CT [16]
...
CT [16]




To simplify the following steps, we name this matrix as I, which has the following
configuration.
I =


Byte1
I1[1]
I1[2]
I1[3]
...
I1[256]




Byte2
I2[1]
I2[2]
I2[3]
...
I2[256]


. . .


Byte16
I16[1]
I16[2]
I16[3]
...
I16[256]


31
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
Using HW model described in the previous chapter, the count of the number of
bit changes is shown.
F =


Byte1
HW (I1[1])
HW (I1[2])
HW (I1[3])
...
HW (I1[256])




Byte2
HW (I2[1])
HW (I2[2])
HW (I2[3])
...
HW (I2[256])


. . .


Byte16
HW (I16[1])
HW (I16[2])
HW (I16[3])
...
HW (I16[256])


In conclusion, the 256 hypothetical power consumption values for each byte is left
as seen in Mat F below. This entire process is repeated for every trace that is
captured.
F =


Byte1
F1[1]
F1[2]
F1[3]
...
F1[256]




Byte2
F2[1]
F2[2]
F2[3]
...
F2[256]


. . .


Byte16
F16[1]
F16[2]
F16[3]
...
F16[256]


3.3.7 Correlation Coefficient
The columns of Mat F below are correlated against each sample point value’s
columns of the power traces. This is the reason why having a precise trigger
on the oscilloscope that occurs on same sample of the trace is important. The
waveforms must overlap over each sample to get the true bus change in power
consumption for the highest correlation. This correlation is repeated for all 256
hypothetical ST9 values.
32
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
The notation following is FTrace,ColumnF [Row of F] and TTraceNumber[Sample
Number]. These matrices are built for an example byte 1 of 16.
F1[1] =


F1,1[1]
F2,1[1]
F3,1[1]
...
Fn,1[1]




T1[1] T1[2] T1[3] . . . T1[k]
T2[1] T2[2] T2[3] . . . T2[k]
T3[1] T3[2] T3[3] . . . T3[k]
...
...
...
. . .
...
Tn[1] Tn[2] Tn[3] . . . Tn[k]


F1[2] =


F1,1[2]
F2,1[2]
F3,1[2]
...
Fn,1[2]




T1[1] T1[2] T1[3] . . . T1[k]
T2[1] T2[2] T2[3] . . . T2[k]
T3[1] T3[2] T3[3] . . . T3[k]
...
...
...
. . .
...
Tn[1] Tn[2] Tn[3] . . . Tn[k]


...
F1[256] =


F1,1[256]
F2,1[256]
F3,1[256]
...
Fn,1[256]




T1[1] T1[2] T1[3] . . . T1[k]
T2[1] T2[2] T2[3] . . . T2[k]
T3[1] T3[2] T3[3] . . . T3[k]
...
...
...
. . .
...
Tn[1] Tn[2] Tn[3] . . . Tn[k]


The highest correlation among these hypothetical ST9 values will be the sessional
key result for the byte under analysis. Therefore, if the highest correlation occurs
using the F [41] values, that means that the sessional key has a value of 40−1, due
to the index of Matlab. This is because the key guesses were initially XORd into
the ciphertext with the values of 0-255 so in the end, the index of the successful
byte actually corresponds to the key value. Again, it is stressed that this algorithm
done for all 16 bytes of the key as they are independent of each other.
33
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
Exp.Keys =


Byte1
00000000
00000001
00000010
...
11111111




Byte2
00000000
00000001
00000010
...
11111111


. . .


Byte16
00000000
00000001
00000010
...
11111111


3.3.8 Inverse Sessional Round Key
The result from the correlation above is the 10th round sessional key since we are
attacking the 10th round. This means that the sessional key needs to be an inverse
of 10 rounds in order to get the master key. The Python inverse sessional key open-
sourced script [51] for a given DPA result is used to provide the master key. This
can be done as the sessional key generator is predictable and easily calculated with
a given input string. The figure below shows the reverse operation on a 16-byte
string assuming that the sessional key DPA result is: ‘00 01 02 . . . 0F ’. The master
key is highlighted along with the expected DPA result both in hexadecimal.
Figure 3.4: 10 Rounds of Inverse Session Key
3.3.9 Results & Analysis of Vulnerabilities
The results of the DPA algorithm were previously discussed as the 10th sessional
key within AES-128. The last matrix of the algorithm gives a matrix of correlation
that shows the maximum of correlation for each guess from 0-255. When this max
correlation vector for each byte is illustrated, a graph is obtained in resemblance
34
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
Figure 3.5: 15,000 Traces Max Correlation Vector for Byte 4
to Figure 3.5. As an example, byte 4 will be analyzed.
Clearly there is a spike in correlation of the normalized vector to the traces
that is graphed in Figure 3.5. The spike has an index of 128, but since Matlab
indexes from 1 instead of 0, the proper first byte of the 10th round sessional key is
(127)10 or (7F )16. This matches the same key needed in Figure 3.4 to recover the
master key’s first byte ‘03’.
All bytes were broken using the same correlation matrices for their respective
bytes and are all visually shown for 15,000 traces in Appendix - 16-Byte Key DPA
Results. Of the 50,000 samples acquired, only the last 15,000 are used to free at
least a quarter of memory in the computer during the DPA calculations - this
greatly accelerates the attack.
The threshold of the amount of data needed to break the cipher was tested and
it was determined that approximately 8,000 traces are required. For clarity, the
data given is at 15,000 traces.
On a 3 GHz processor, it takes 50 minutes to obtain every set of 2,000 traces,
35
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
ciphertext data, and execute the DPA algorithm with the inverse sessional script.
After all of the data is imported to the computer, the process takes approximately
7 hours to run. These are extremely noble results compared to the only other
method to break AES-128 that takes billions of years.
The vulnerability analysis of where the power traces are susceptible to the DPA
attack is visually shown below in Figure 3.6. This is done by referencing back to
any power trace’s sample with an offset with respect to the key found in the DPA
algorithm.
Figure 3.6: Vulnerable Samples within a Power Trace
From the graph displayed above, the 9th and 10th round of the encryption shows
the last 15,000 samples that were the samples being attacked. The 1-16 index at
the bottom of the figure show the 16-byte key found and where exactly the trace
had a very high correlation with respect to the output data. The red stars on the
graph correspond to the sample values from 35,000-50,000, the samples of AES-128
in it’s final round.
36
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
3.3.10 Hardware Solutions Against DPA
Public and private-key systems in modern day have large risks to mitigate with
very few hardware implemented cost effective solutions. Examples of these solu-
tions can be categorized in two types. They are creating new logic families or
implementing an external circuit to work as a voltage-current buffer for the en-
crypting core eliminating the sensitive side channel on the system-level entirely.
Creating new logic families [27–29] such as MOS Current Mode Logic, Sense
Amplifier Based Logic, or Wave Dynamic Differential Logic, for encryption cores is
unrealistic and non-efficient since every logical component of the chip would need
to be re-designed and calibrated accordingly. The high silicon area and power
overhead required for these methods do not justify the implementation cost of re-
placing all gates in the hardware realization.
External circuits are a sensible solution but they carry the burden of a large
power consumption and have heavily bottlenecked throughput restraints in mod-
ern systems that need to be achieved.
In literature one of the most recognized and cited circuit is [30], a three-stage
switched capacitor current equalizer. Overbearing drawbacks of this circuit is
that it has a +44% power overhead and −100% degradation throughput efficiency.
This popular circuit does protect against a DPA over 10 x 106 power traces, but
it compromises strict performance standards that need to be met in any hardware
accelerator. Even an application specific integrated circuit (ASIC) on-board solu-
tion on the same encryption die struggle give the reliable results.
The risks against application specific hardware solutions seem to be endless
while more threats arise and hardware solutions generally cannot deliver results.
In the last chapter, other potential hardware-software solutions are proposed. The
scope of this work will enable future works in realizable solutions against present
and future threats. Understanding the inner workings of the specific algorithm,
in this case AES-128, along with the way the ASIC or FPGA execute operations
37
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
are the most important preliminaries to patch this vulnerabilities to prevent any
associated threats.
3.4 Related Applications
After understanding this attack on a complex symmetric cipher, it seems daunting
to yield transferable skills however, it is not since the same rules apply. Whatever
part of an algorithm that can release sensitive data in which any of the operations
at the desired state consume unique power can be broken using the same general
method. The method is as follows.
Generalized DPA Attack
1. Determine the closest exterior state, data in or out and byte-to-byte inde-
pendence.
2. Determine the desired interior state to be attacked.
3. Generate hypothetical values and or keys.
4. Calculate hamming distance from exterior state to interior state.
5. Calculate the correlation between hamming distance matrix versus the out-
put power consumption.
6. Acquire interior state information.
In the case of a ECC SPM multiplication, the process requires knowledge of
the present SPM algorithm being executed. As an example to analyze potential
threats, the Double-and-Add method, the founding SPM is shown in Algorithm 1.
When looking at the main operations of this loop, which is completely key
dependant, there are only two operations which dictate the final result on register
R2. Line 4 leaks a large amount of power since there is a point addition operation
stating that the current state of the binary key string is a 1. While if the key’s index
bit is a 0, point doubling, occurring on Line 6, will always consumes less power
38
CHAPTER 3. SIDE-CHANNEL ATTACKS AGAINST HARDWARE
Algorithm 1 Double-and-Add Scalar Multiplication
Input: Point P ∈ E, k = (kh−1kh−2...k1k0)2, ki ∈ [0, 1],
Output: kP ∈ E
1: R1 = P ; R2 = 0;
2: for i = 0 to h− 1 do
3: if ki = 1 then
4: R2 = R1 +R2;
5: else
6: R1 = 2R1;
7: end if
8: end for
9: The final value is R2 = kP
compared with its counterpart. This may seem trivial, but after understanding
that point operations implemented in hardware consume distinctive amounts of
power, the volatile results become highly evident [25].
39
Chapter 4
Secure High-Level Architecture
The high-level design of any ECC processor determines whether or not it is vulner-
able to various SCAs. Though they all have risks, using proper techniques to lower
power consumption differences such as different coordinates over a finite plane and
a protected, highly regular SPM can ensure the safety of the unsuspecting hard-
ware.
In this chapter the point operations, proposed inverse operation, and scalar
point algorithms will be examined. A literature review is also completed to estab-
lish the most efficient designs that attempts to secure their respective architectures.
4.1 Previous Research & Review
There are numerous architectures of accelerated cryptographic processors for many
different applications. The usual top figures of merit include clock speed (MHz),
area, speed of SPM (s), small countermeasures against SCAs, and optimized low-
level multiplication and inversion operations. Typically the combination of efficient
algorithms and a well organized architecture present the best solutions for their
individual objectives.
The finite field layer of the hardware is the most influential decision in the entire
design [5]. This is due to the fact that the squaring and the inversion operations
within the layer require both major aspect of the multiplier, the multiplication
40
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
and reduction stages. Table 4.1 shows the recent, most prominent processors.
These processors from [5] are controlled by a finite state machine (FSM) or
a micro-programmable controller (MCC). Implementing a state driven design on
all levels is necessary. Next, FPGAs are the hardware platform of choice due to
the reconfigurability, modularity, and testing purposes to increase or decrease key
lengths. Clearly the binary polynomial basis fields are the most popular from
effortless transition to hardware. Key sizes range from 163-571 bits - the most
popular is 233-bit. The product of choosing this key is that the primitive tri-
nomial simplifies multiplication based operations at the finite field layer of the
architecture and will be further discussed in the next chapter, Low-Level Multi-
plier Implementation.
An inversion in the finite field layer is not performance hindering arithmetic
if the optimized coordinates for hardware are used. The coordinate system that
is the most popular is the projective coordinate system as seen in the above re-
view, specifically Lo´pez-Dehab coordinates [5, 8, 11, 12]. When employing projec-
tive coordinates, it replaces all inversions within the point operations with added
multiplications over the new three dimensional plane. Therefore, if projective co-
ordinates are employed, the field inversion is only computed once after the SPM
is completed - this is so that the calculation can be realized in the original two
dimensional plane.
If Affine coordinates are used, the inversion operation is an extremely costly
low-level operation; the inversion when m ≥ 128 requires approximately 7 mul-
tipliers [39]. Below in Table 4.2, the amount of hardware operations needed to
compute the large point operation in GF (2m) [39].
41
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
T
ab
le
4.
1:
K
ey
n
ot
e
E
C
C
P
ro
ce
ss
or
s
in
L
it
er
at
u
re
R
ef
.
P
la
tf
or
m
C
on
tr
ol
B
as
is
B
it
C
lk
(M
H
z)
A
re
a
(S
li
ce
s)
S
M
(s
)
S
M
C
o
or
d
in
at
e
M
u
lt
./
In
v
.
P
ro
te
ct
ed
[5
]
[8
]
X
C
X
2V
60
00
F
S
M
B
in
.
P
ol
y.
16
3
93
.3
16
18
8
34
.1
1
L
o´p
ez
-D
eh
ab
L
D
-P
ro
j.
M
S
D
S
P
A
&
T
im
in
g
[9
]
X
C
2V
80
00
4
-
B
in
.
P
ol
y.
23
3
62
.5
15
36
5
7.
2
M
on
tg
om
er
y
P
ro
je
ct
iv
e
K
ar
at
su
b
a
m
u
lt
.
S
P
A
&
T
im
in
g
[1
0]
X
C
X
5V
L
X
50
F
S
M
B
in
.
P
ol
y.
23
3
93
.3
30
73
-
B
in
ar
y
m
et
h
o
d
A
ffi
n
e
R
2L
S
h
if
t
m
u
lt
.
-
[1
1]
X
C
4V
F
X
10
0
F
S
M
B
in
.
P
ol
y.
57
1
93
.3
12
89
4
22
4
M
on
tg
om
er
y
L
D
-P
ro
j.
In
te
rl
ea
ve
d
m
u
lt
.
-
[1
2]
A
lt
er
a
S
tr
at
ix
II
F
S
M
/M
C
C
B
in
ar
y
P
ol
y.
16
3
16
3
14
28
0
11
.7
1
R
2L
,
L
2R
N
A
F
L
D
-P
ro
j.
It
oh
-T
su
ji
i
in
v
.
S
P
A
&
T
im
in
g
[1
3]
V
ir
te
x
-4
F
S
M
B
in
.
P
ol
y.
16
3
10
0
35
28
10
70
B
in
ar
y
m
et
h
o
d
A
ffi
n
e
In
te
rl
ea
ve
d
m
u
lt
.
S
P
A
&
T
im
in
g
[1
4]
V
ir
te
x
-6
M
C
C
P
ri
m
e
25
6
60
20
.8
k
6.
1
R
2L
,
L
2R
N
A
F
A
ffi
n
e
&
P
ro
j.
In
te
rl
ea
ve
d
m
u
lt
.
S
P
A
&
T
im
in
g
&
F
au
lt
[1
5]
B
al
sa
F
S
M
B
in
.
P
ol
y.
23
3
-
0.
80
25
m
m
2
91
9
M
on
tg
om
er
y
A
ffi
n
e
K
ar
at
su
b
a
m
u
lt
.
S
P
A
&
T
im
in
g
42
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
Table 4.2: Hardware Costs of Point Addition
Hardware Operation Affine (x,y) Projective (X,Y,Z)
Inversion 2 log k + 1 1
Multiplication 2 log k + 4 6 log k + 10
Cost: m ≥ 128 (I :M) 1:624 1:241
At last, the SPM and security will be reviewed. The scalar point multiplica-
tion’s speed is calculated on how long a design takes to complete a single m-bit
SPM with respective to the frequency of the clock.
Among the listed designs, the Binary Recoding Method reduces the number
of point additions recoding highest degree of polynomial a(x) = am−1x
m−1 + ...+
a1x + a0, xi ∈ GF (2) [10, 17]. The Right-to-Left & Left-to-Right Non-Adjacent
Form (NAF) further reduces point additions with precomputed LUTs in mem-
ory [12,14]. Both are computationally faster designs compared to the Double-and-
Add (Algorithm 1) by reducing the amount of point additions by adding a single
point double operation. The NAF form algorithm is a derivative of the Recoding
method, sharing the same SPM qualities. Algorithm 1 and the Recoding method
are equally unprotected against SCAs - these SPMs are never suitable when ex-
plicitly fighting SCAs.
The most attractive SPM is the Montgomery ladder method. Montgomery’s
algorithm is one of the fastest SM algorithm in practise due to the unique math-
ematical qualities it holds. This makes it the pinnacle of success for recent high-
speed architectures. Although this algorithm simplifies SPMs due to it’s highly-
regular complexion, it is vastly susceptible to modern timing and differential power
attacks [3, 35]. Though these reviewed designs state their resistance to certain
SCAs, they are not secure against the previous chapter’s DPA attacking method
especially if paired with a timing or ZVP attack1. The importance of the SPM
algorithm and point operations dictate the overall security of the processor.
1Since zero-value point attacks require the attacker to manipulate the base point pre-
programmed in the hardware’s memory making it unrealistic to test at this point in this research.
43
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
4.2 Secure Scalar Point Multiplication
Since it’s discovery in 1987, P. L. Montgomery’s ladder [6] has been the staple of
hardware designs as the optimized and arguably, the leading dynamic SPM algo-
rithm in ECC. What the ladder makes up for in pure computational speed and
regularity, it lacks in immunity from contemporary SCAs.
In 2009, M. Joye proposed a m-ary generalization to the Montgomery ladder
which would pave the way for a SCA resistant SPM algorithm [3].
This section will address the security of the two elite, left-to-right (L2R) SPM
algorithms with respect to timing and differential power attacks. Practising accel-
erators initiate countless SPMs in a single ECC protocol and the biggest security
vulnerability is the SPM gateway operation.
4.2.1 Montgomery’s Algorithm
The high-speed Algorithm 2 is Montgomery’s laddering method. It is heavily
dynamic and being used as the leading SPM without question. It’s vital invari-
ant property P = Y − X in every state leads to these keynote qualities. Mont-
gomery’s algorithm computes both (x, y) coordinates in any system i.e. affine or
projective, only depending on present and previous x coordinates mathematically
proven from [2]. Also, (x, y) coordinates of the next point on the curve can be
computed in parallel giving the option of a semi-pipelined design as shown in [7,9].
The highly-regular essence of this popular SM has been thought to be secure
because of its invariant states which protects it against simple power analysis and
safe-error attacks [6]. The architecture of Montgomery’s ladder is extensively ex-
plored in [7]. Advancements in malicious attacks against security cores make this
algorithm no longer safe.
At first glance of the loop on Lines 2-8, one can see regularity in both states.
However, output buses from registers X and Y carry different operations on both
44
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
Algorithm 2 Montgomery Scalar Multiplication [2]
Input: Point P ∈ E, k = (kh−1kh−2...k1k0)2, ki ∈ [0, 1], kh−1 = 1
Output: kP ∈ E
1: Int: X = P ; Y = 2P ;
2: for i = h− 2 down to 0 do
3: if ki = 1 then
4: X = X + Y ; Y = 2Y ;
5: else
6: Y = X + Y ; X = 2X;
7: end if
8: end for
9: The final value is X = kP
states, Lines 4 & 6, enabling DPA the obvious measurement to release the loop
characteristics.
These characteristics, in reference to Generalized DPA Attack last chapter,
are susceptible to the correlation of generated hypothetical power with the ac-
tual output power of main registers X and Y . Since the targeted interior state is
known, the HD needs to be calculated from the output data to the hypothetical
scalars/keys once the algorithm finishes a single m-bit SPM. Below is a hypothet-
ical situation to break a key establishment with a DPA attack.
Attack Against a Key Establishment
To find the public-private key (scalar k) during a key establishment, successive
runs of 5-25 x 103 random scalars fed into the SPM system will calculate random
output points. The HD between the output scalars, of the output points, and hy-
pothetical power consumption will be calculated. This matrix will be correlated,
Equation (2.11), to the output power consumption of the overall output bus and
registers X and Y . Knowing the public base point P , the private key will be
exposed.
The attacker needs to be aware that each bit of the scalar is dependant on
45
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
previously indexed bits therefore all bits need to be analyzed as one set of data
which is the opposite compared to AES-128 where each byte is independent. DPA
paired with a timing attack focused on output register activity would breakdown
the movement of data from X ↔ Y grounding any system using the Montgomery
ladder to lose it’s overall authenticity.
4.2.2 Joye’s Algorithm with Hardware Design
Being a more secure byproduct of Montgomery’s ladder, Joye’s SPM, Algorithm
3, possesses many of the great qualities of Algorithm 2. Some of these qualities
include regularity, high-speed, and low hardware cost. However it does not have
the invariant quality of the ladder. The pertinent difference between them is that
register X is active twice sequentially in every state of the SPM loop and there is a
point addition correction on register X in the final step. This makes the algorithm
regular, but not invariant.
Algorithm 3 Joye’s Scalar Point Multiplication (L2R) [3]
Input: Point P ∈ E, k = (kh−1kh−2...k1k0)2, ki ∈ [0, 1],
kh−1 = 1
Output: kP ∈ E
1: Int: X = (kh−2 + 1)P ; Y = 2P ;
2: for i = h− 3 down to 0 do
3: if ki = 0 then
4: X = 2X; X = X + P ;
5: else
6: X = 2X; X = X + Y ;
7: end if
8: end for
9: X = X + P ;
10: The final value is X = kP
The fact that there is only one register that holds sensitive information makes
it impossible to differentiate between state 1, Line 4 or state 2, Line 6. The 1-bit
46
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
ki register, the i
th index of the scalar k, dictates the internal and external power
consumed by hardware just like Montgomery’s ladder. The difference is that in-
ternal register X acts as the buffer to the external power consumption rendering
a DPA attack obsolete.
If a timing attack were to be employed, it would not be able to characterize
any difference from the ki’s behaviour since both sequential commands realized
in hardware are blocking statements on the same register that result in identical
activity compared to the next i+ 1 loop index. Completed in a single clock cycle
plus a minuscule logic delay, both state 1 and 2 are identical. If it was possible
to deploy a fault resulting word or fake operation into the ALU to affect register
X, the attacker again would not be able to predict whether the first or second
blocking statement in either state 1 or 2 was executed with full certainty.
Figure 4.1: Joye’s SPM Hardware Block Diagram
Figure 4.1 shows the hardware concept design of Joye’s algorithm with mini-
mal complexity to lower the area. The synchronous control unit includes a counter
register (count), initialization signal/flag (int), and 1-bit ki key register.
If int = 1 by a reset (rst) which resets the internal ith index, the hardware will
initialize the registers X = (kh−2 + 1)P and Y = 2P .
In next clock cycle, int = 0 will enable the counter and Pt.Add where count
will increment while it jumps between states 1 and 2 of Algorithm 3 on the positive
47
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
edge of the clock until reaching (m − 2 + 1). After the final cycle concludes, the
next clock will compute the correction on register X. Finally, the output will be
available on the following clock cycle in register X = kP .
4.2.3 Comparison
The Table 4.3 below shows a comparison of the both Algorithm 2 and Algorithm
3 and how they negate the three major SCAs discussed.
Table 4.3: Comparison of SPM SCA Protection
Montgomery’s Algorithm Joye’s Algorithm
Invariant-Regular Regular
NOT resistant to Timing Attacks Resistant to Timing Attacks
NOT resistant to C, M Safe-Error Attacks Resistant to C, M Safe-Error Attacks
NOT resistant to Power Analysis Resistant to Power Analysis
Evidently, Montgomery’s algorithm has no resistance to the SCA attacks out-
lined while Joye’s is fortified. There are other SCAs that both algorithms are not
fully secure against, for example, M safe-error fault attacks [35]. Certainly Joye’s
algorithm is not as computationally fast as Montgomery’s due to its underlying
mathematics, but when the cryptographic cores main purpose is to maintain au-
thenticity, Joye’s would be more suitable in small applications requiring 128-bit
security.
Ultimately, the security will be maintained at the highest level of operation
achieved by the proposed hardware design modeling Joye’s SPM. Establishing the
rest of the hardware is imperative and will be the subsequent focus with the future
goal of an all-programmable system-on-chip (SoC) FPGA implementation.
48
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
4.3 Optimized Point Operations
After studying Chapter 2 and point operations’ algebra, a more profitable coor-
dinate system can be established for progressive hardware implementations. For
example, in the point doubling Equation (2.8) there are 2 multiplications and 1
inversion. This is extremely costly since an arithmetic inversion is the most expen-
sive operation in any ECC ALU implementation [8]. Point doubling and addition
Equations (2.8) (2.9) can be easily mapped to a more efficient plane to further
improve the functionality of the low-level hardware’s operability. All the following
derivations can be found from [39].
Recalling point infinity has no distinctive Affine coordinates, point P is mapped
to an existing projective plane such that,
P (x, y) = P (X, Y, Z), Z 6= 0 ∈ E (4.1)
in which point infinity is defined as O = (1, 0, 0). Any arbitrary point P (X, Y, Z)
still carries the characteristics of O + P = P + O = P . In addition, −P =
(X1, X1 + Y1, Z1) is very similar to the Affine representation of −P .
This coordinate system needs to be applied to reduce the amount of finite field
inversions discussed in Table 4.2. The most favoured type of projective coordinates
is the Lo´pez-Dehab (LD) representation where (x, y) = (X/Z, Y/Z2), Z 6= 0 ∈ E
and preferably Zb = 1, to simplify operations
2 [39]. Below, the forward conversion
is shown by Equations (4.2) and the curve of Equation (2.5) is now mapped to
(4.3) as follows:
Zb = 1
X = xZ
Y = yZ2
(4.2)
EK : Y
2 +XY Z = X3Z + aX2Z2 + Z4, a ∈ EK (4.3)
Using LD-coordinates institutes the point doubling and addition equations to
2Zb is the Z-coordinate of the base point P .
49
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
be mapped accordingly with a = 1 for an optimal cofactor. The following sub-
sections will cover the hardware developed of the datapath in order to design the
point operations appropriately.
The analysis of different projective coordinates are explored in [7]. The memory
and power consumption of LD-coordinates are the lowest among the top projective
systems reviewed in recent literature3.
Just as in the synchronous control unit in Joye’s algorithm, register count in
point double and addition will increment every clock cycle to initiate the subse-
quent operations within datapath upon the reset signal. The focus will be on
the datapath design rather than the control unit since the designs are parallelized
compared to traditional serial designs [8, 12, 19].
4.3.1 Point Double with Datapath Schematic
Equations (2.8) are now mapped to it’s LD form with a set of three equations.
The resulting point 2P (X3, Y3, Z3) requires squaring operations to replace the prior
inversion operation. Equations (4.4) are as follows, where X3, Y3, Z3 ∈ EK .
Z3 = X
2
1
Z2
1
X3 = X
4
1
+ Z4
1
Y3 = Z
4
1
Z3 +X3(Z3 + Y
2
1
+ Z4
1
)
(4.4)
Below shows the schematic of the LD - point doubling circuit. It requires
3 multiplication (Mj), 5 squaring (Sj), and zero inversion operations within the
curve’s finite field. This design needs all 3 m-bit XOR gates along with all of the
other operating blocks, Sj and Mj, to be independent. No more than a single
operating block and one m-bit XOR will be computed under one clock cycle. The
design shown in Figure 4.2.
3Jacobian, Standard, and Montgomery projective coordinates are the other top projective
systems besides Lo´pex-Dehab.
50
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
Figure 4.2: LD - Point Double Datapath Schematic
On the first clock cycle, this circuit will square all three inputs X1, Y1, Z1 in
parallel. On the next cycle, the outputs X3 and Z3 are computed and will be avail-
able on count = 2. Y3 is available on following cycle as it is the last computation
after X3 and Z3. From looking at Equations (4.4), the next point x, y coordinates
X3 and Y3 depend on prior computations. The critical paths in Figure 4.2 start
through the two squaring operations on X1 or Z1 that lead to the first XOR on the
output of S2 and S3. From here, there are multiple paths that end at Y3 requiring
that same amount of logical delay resulting in a critical latency of 1 multiplication
and 2 squaring operations.
4.3.2 Point Addition with Datapath Schematic
Lastly, Equations (2.9) are mapped to the projective plane using LD-coordinates.
Equations (4.5) show the flow of operations within the set with intermediate reg-
isters A-G. The point addition of P +Q = R(X3, Y3, Z3) is as follows:
A = Y2Z
2
1
+ Y1
C = Z1B
Z3 = C
2
X3 = A
2 +D + E
G = (X2 + Y2)Z
2
3
B = X2Z1 +X1
D = B2(C + Z2
1
)
E = AC
F = X3 +X2Z3
Y3 = (E + Z3)F +G
(4.5)
51
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
where the datapath’s design is below. Intermediate registers A-G are labelled on
the hardware corresponding with resulting output. In total, 8 multiplication, 5
squaring, and again, zero inversion operations are required to compute the point
addition in this parellelized manner.
Requiring 5 clock cycles to complete, an improvement to the parallel design
of 8 cycles in [39], this datapath has been broken into ten sections for increased
speed and functionality.
In order, the output coordinates of next point R begin with Z3 becoming avail-
able when count = 2. The X3 is available after the following clock cycle when
count = 3. The critical latency is dependant on the calculations of M1,M2,M3,
onward to the operation of X3, and ending through M7 and the last m-bit XOR.
The Y3 is available on the following cycle after count = 4. The bottleneck in
this design is when computing X3 during the count = 0-2 cycles due to the three
multiplication operations and no squaring. It is quite evident that the point addi-
tion operations consumes a more considerable amount of power compared to the
point double since there is more than twice as many compulsory multiplication
operations. The critical latency is 4 multiplication operations.
4.3.3 Review of Hardware
FPGA designs are crucial because of the dynamic nature of the designing process
towards ECC processors. The designs must be robust to provide rapid prototyping
to test different key lengths, curve coefficients, the finite field, mixed-coordinates4,
and finite field multipliers. Generally speaking, the more specific the application
of the processor, the more efficient it can be.
Both proposed designs utilize a parallel architecture requiring a synchronous
state machine controller which can provide the high-level datapath to have efficient
4Mixed coordinates are typically used to reduce the number of multiplications using Frobenius
maps to lower the critical path latency, instead of using a single projective coordinate system
during point operations [12].
52
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
F
ig
u
re
4.
3:
L
D
-
P
oi
n
t
A
d
d
it
io
n
D
at
ap
at
h
S
ch
em
at
ic
53
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
results. No inversion operations need to be computed due to the LD-coordinates
selected. This may seem minuscule, but when implemented on a NIST medium
scale curve such at K-233 or K-283 [34], the amount of inversions become unrealis-
tic to implement on an affordable FPGA due to speed and complexity restrictions.
The HDL pseudo-code for the datapath designs can be found in Appendix - Verilog
HDL Pseudo Scripts.
The reasoning for designing both operations with individual squaring and mul-
tiplication blocks is because the targeted Kintex-7 FPGA has plenty of space
available. The serial multiplier implemented in the following chapter consumes
less than 0.5% of area after synthesis and before place & route. What this high-
level design lacks in area can be ignored due to a small serialized multiplier enabling
the overall architecture to be tested at an average speed SPM.
4.4 Multiplicative Inverse
The isolated inverse operation within this architecture is the last operation com-
puted after the entire SPM is completed. It is only computed once to act as the
conversion from LD-coordinates back to Affine. Since the EC scalar point infor-
mation to be used in ECC protocols resides on the two dimensional plane, the
conversion is crucial. The backwards conversion is as follows:
xQ =
X
Z
yQ =
Y
Z2
(4.6)
where xQ and yQ are the Affine coordinates of the scalar multiplication point
Q = kP .
To execute this backwards conversion, there are two exclusive inversions 1
Z
and
1
Z2
since (Z−1)2 mod 2m 6= (Z−1 mod 2m)2. To accomplish this, the Extended
Euclidean Algorithm (EEA) needs to be computed over GF (2m).
54
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
4.4.1 Binary Extended Euclidean Algorithm
The binary EEA is a simplified version of Euclid’s algorithm to find coefficients
x and y such that ax + by = gcd(a, b). In the case of finding the multiplicative
inverse for polynomials, a = A(x) and b = P (x) where y = 1, the inverse is found
by solving the previous equation for x. Algorithm 4 can be found in [39] but the
modified version below is developed for a logical transition to a hardware imple-
mentation.
The output of Algorithm 4 is A−1(x) and is found by these major steps. Within
the inner while loops, operand registers U and V are divided by x until they cannot
be divided by a whole number, hence mod x = 0. Both loops can be developed in
parallel hardware as long as the (U, V 6= 1) condition is true. A serialized design is
of greater benefit since the inversion is, again, only computed once and speed gain
would be infinitesimal compared to the overall computational time. This solitary
algorithm is the final step and the true output of any projective coordinate based
SPM processor.
4.5 High-Level Summary
The most efficient high-level operations were discussed and broken down into re-
lated blocks of the progressing design.
Joye’s algorithm proved to be superior to Montgomery’s ladder from a SPM
security aspect. The point doubling and addition were outlined using the LD-
coordinates to create a customized datapath circuit for each point operation.
Lastly, the ideal binary multiplicative inverse, the EEA, was explained and modi-
fied to better fit the use of the proposed architecture.
55
CHAPTER 4. SECURE HIGH-LEVEL ARCHITECTURE
Algorithm 4 Extended Euclidean Algorithm in Hardware
Input: Primitive Poly. P (x), Poly. A(x) ∈ GF (2m)
Output: A−1 mod P (x)
1: Int: U = A(x); V = P (x); G = 1; H = 0;
2: while (U, V 6= 1) do
3: while (U mod x = 0) do
4: U ← shiftRegRight(U);
5: if (Gmod x = 0) then
6: G ← shiftRegRight(G);
7: else
8: G ← shiftRegRight(G⊕ P );
9: end if
10: end while
11: while (V mod x = 0) do
12: V = shiftRegRight(V );
13: if (H mod x = 0) then
14: H ← shiftRegRight(H);
15: else
16: H ← shiftRegRight(H ⊕ P );
17: end if
18: end while
19: if [deg(U) > deg(V )] then
20: U ← U ⊕ V ; G ← G⊕H;
21: else
22: V ← V ⊕ U ; H ← H ⊕G;
23: end if
24: end while
25: if U=1 then
26: Output← G;
27: else
28: Output← H;
29: end if
56
Chapter 5
Low-Level Multiplier
Implementation
In this chapter, the implementation of 233 & 283-bit comparable finite field multi-
pliers is presented. As the low-level operations, they are one of the most essential
building blocks towards the efficiency of the ECC processor. Surrounded by many
options of multipliers for various applications, the classic parallel and a popular
serial multiplier will be implemented. A special case of the parallel multiplier will
lead to the development of a squaring operator.
The sections will begin with an introduction of choosing the correct key size
for relatable applications, dealing with strict throughput constraints, and con-
trasting both multipliers with their respective synthesis results. The overall SPM
design proposed will be outlined to highlight the interconnected parts of the entire
hardware design.
5.1 Finite Field Multiplier
Recalling from Table 4.1, the key size ranges from 163-571 bits where the most
popular is 233 bits. Recent designs [9, 10, 15] express the large computational ad-
vantage of choosing a 233-bit key with a polynomial basis providing the primitive
trinomial of t233+ t74+1. This trinomial provides a great level of symmetric secu-
rity while reducing the complexity of the circuit and increasing the speed of any
57
CHAPTER 5. LOW-LEVEL MULTIPLIER IMPLEMENTATION
operation using the field multiplier [14].
The area is reduced from having a single middle element, t74, rather than the
NIST approved pentanomial, t283 + t12 + t7 + t5 + 1 [34] by over 30% on larger
FPGA designs. The use of a pentanomial is the only way to increase key sizes with
a polynomial basis to the 283-bit standard level while the 233-bit key has benefits
geared towards performance upgrades.
Since the overall goal is to attack the entire proposed architecture, the multi-
plier serves purposes on a smaller scale which contradicts what the architectures
of [5] suggest. Established algorithms that provide speed and area should be fur-
ther explored for their appropriate merits [39] in different devices, but are presently
not scrutinized. From Table 5.1, the two types of multipliers selected can be seen
to have very different qualities.
Table 5.1: Parallel vs. Serial Finite Field Multiplier
Multiplier Field Clock Cycles Circuit Complexity (Gates)
Parallel GF (2m) 1 m2 ANDs, m2 − 1 XORs
Serial GF (2m) m 2 m-bit regs, m ANDs, m+ 1 XORs
Serial multipliers such as Interleaved [13], Karatsuba [9], and Montgomery se-
quential methods all accomplish the field multiplication within [m/32,m] clock
cycles. This range consists of small increments when processing multiples of 32-bit
data streams [40]. While the Interleaved and Karatsuba multiplier are slightly
faster than Montgomery’s smaller multiplier, shown in Table 4.1, all three with
respect have relatively the same performance to size ratio making them all a viable
choice.
The following designs will be achieved on a Kintex-7 FPGA having 218,600
LUTs, 437,200 flip-flop registers, and 350,000 programmable logic cells. The Zynq-
7000 AP SoC (XC7Z045) embodies this FPGA for seamless data transfer to and
from the ECC processing circuit. The on-chip 32-bit Cortex-A9 ARM CPU will
require a throughput controller as there are a fixed amount of bonded I/O pins on
58
CHAPTER 5. LOW-LEVEL MULTIPLIER IMPLEMENTATION
the FPGA. Since ECC uses key sizes well over 128 bits, a first in first out (FIFO)
controller has been made to mitigate input data traffic.
5.1.1 FIFO for Large Keys
A FIFO is a digital circuit that buffers large input data onto a register stack into
equal length words, in this case 32-bit words, to output the words in the sequential
order that they were addressed. Specifically, this synchronous 32-bit FIFO is the
throughput solution that can be used as the top module for both 233 & 283-bit
keys. The FIFO controller and 32-bit buffering registers are shown below in Figure
5.1.
Figure 5.1: 32-bit FIFO Schematic
Having a depth of 8 and 9 bits respectively, the design can be modified by
updating the internal counter to accommodate either bit size. This circuit after
synthesis consumes 38 LUTs as logic and 43 registers as flip-flops - this circuit is
≤ 0.2% of the total slices making it infinitesimal. The HDL code found for this
design can be found in Appendix - Verilog HDL Scripts: 32-bit FIFO.
5.1.2 Parallel Multiplication and Squaring
The most primary design for multiplication is the classical parallel in and out
design. It consists of a multiplication and reduction stage. The 233 & 283-bit
multiplying module (top) has an output of 2m− 2 bits that will feed into the final
reduction module (bottom) that reduces the polynomial to the original 233 & 283
59
CHAPTER 5. LOW-LEVEL MULTIPLIER IMPLEMENTATION
Figure 5.2: Parallel m-bit Multiplier [40]
bits as seen in Figure 5.2 - the circuit diagram shows the hardware required.
Both stages of the overall parallel multiplier will be tested with both key sizes.
Having the same structure with larger word lengths will show to be a drastic in-
crease in area. The 233-bit multiplication and reduction circuit after synthesis
consumes 127,948 LUTs as logic and 698 registers as flip-flops. This is too large
for any FPGA that has high scale modules since it consumes 58.53% of the total
slices. The 283-bit multiplier after synthesis consumes 187,235 LUTs as logic and
848 registers as flip-flops. This would consume 85.65% of the total slices which
leaves no room for any other operations. This design does not utilize any gates
more than once hence sacrificing a lot of area to complete the entire operation
within one clock cycle. A more suitable multiplier to be attacked by SCAs would
be a serial based design to cut down on the dramatic area consumption.
233-bit Squaring Module
A useful piece of hardware that can be extracted from the 233-bit multiplier is the
reduction module for a squaring operator. The parallel squaring circuit is devel-
oped by replacing the multiplying module with a smaller, concurrent assignment
datapath module. The script can be found in the blocking statement shown below.
60
CHAPTER 5. LOW-LEVEL MULTIPLIER IMPLEMENTATION
// Squaring Module
module classic_polySquare(
input [232:0] a,
input [232:0] f_x,
input clk,
output [232:0] z);
integer i;
reg [2*233-2:0] d;
// Polynomial Squaring
always @ (posedge clk)
begin
d[0] <= a[0];
for (i = 1; i <= 232; i = i + 1)
begin
d[2*i-1] = 0;
d[2*i] = a[i];
end
end
// Polynomial Reduction
poly_reduc a1 (d,f_x,clk,z);
endmodule
After synthesis the squaring circuit uses 79,343 LUTs as logic, 0 registers as
flip-flops consuming 36.3% of the total slices. Even though this percentage is still
very high for a small portion of the processor, a modified serial version of the
previous point operation circuits could utilize this single cycle squarer for secure
high-speed device.
A note on the reduction module is that the primitive polynomial will not need
to exceed m+1 bits since the highest m-bit polynomial degree always exists hence
reducing the size of register for the primitive string.
61
CHAPTER 5. LOW-LEVEL MULTIPLIER IMPLEMENTATION
5.1.3 Montgomery Multiplication and Reduction
The sequential Montgomery multiplier over GF (2m) is modelled by the following
Equation 5.1 where A,B,R ∈ GF (2m) [40].
C(x) = A(x)B(x)R−1 mod f(x) (5.1)
The element R needs to satisfy gcd(R, f(x)) = 1 which is always true in
GF (2m). String R needs to be cleverly selected to reduce the amount of inversions
and should be derived from the primitive polynomial f(x) = xm + am−1x
m−1 +
...+ a1x+ a0.
If R is selected to model the monomial R = xm, then R = am−1, ..., a1, a0. A
multiplication by xm is accomplished by shifting the string left by m bits; therefore
an inversion of this monomial is simply a shift to the left bym bits. This multiplier
can be easily and efficiently implemented in hardware by using the following Algo-
rithm 5; this algorithm is based on the proposed method in [40] but is translated
into hardware operations. The m bits of shifting mean that a single Montgomery
multiplication will take exactly m cycles.
Algorithm 5 Binary Montgomery Multiplication in Hardware
Input: A(x), B(x), f(x) ∈ GF (2m)
Output: C(x) = A(x)B(x)x−m mod f(x)
1: Int: C ← 0;
2: C0 ← C[0]⊕ AiB[0];
3: for i = 0 to m− 1 do
4: C = C(x)⊕ AiB(x);
5: C = C(x)⊕ C[0]f(x);
6: C = shiftRegRight(C);
7: end for
The algorithm above will loop for m cycles to flag, with it’s control unit, when
the correct result is available on register C. Line 4 shows that register C is XORd
with the register B if the 1-bit register Ai is true. Line 5 shows that if the previous
62
CHAPTER 5. LOW-LEVEL MULTIPLIER IMPLEMENTATION
index C[0] before the loop is true, then the value of register C is XORd with the
(m− 1)-bit primitive polynomial f(x). The last step of the loop on Line 6 shows
a single shift right by 1 bits representing the division of register C by monomial x.
The script in Verilog HDL is shown below that models the datapath of Algorithm
5 - the remaining code that is the control unit can be seen in Appendix - Verilog
HDL Scripts: Serial Montgomery Multiplier.
// Datapath Module
module mont_datapath(
input [232:0] c, b,
input a_i,
input [232:0] f_x,
output reg [232:0] new_c);
reg previous_c0;
integer i;
always @ (*)
begin
previous_c0 <= c[0] ^ (a_i & b[0]);
for (i = 1; i <= 232; i = i + 1)
new_c[i-1] <= c[i] ^ (a_i & b[i]) ^ (f_x[i] &
previous_c0);
end
new_c[232] <= previous_c0;
end
endmodule
The multiplier circuit is comprised of the small datapath circuit scripted above
and a FSM controller. A register transfer level (RTL) design of the entire 233-bit
control unit surrounding the datapath is below in Figure 5.3. The RTL schematic
of the multiplier is mostly made of the control unit due to small datapath as de-
scribed in Algorithm 5 and highlighted below.
63
CHAPTER 5. LOW-LEVEL MULTIPLIER IMPLEMENTATION
F
ig
u
re
5.
3:
23
3-
b
it
M
on
tg
om
er
y
M
u
lt
ip
li
er
R
T
L
S
ch
em
at
ic
64
CHAPTER 5. LOW-LEVEL MULTIPLIER IMPLEMENTATION
Figure 5.4: 233-bit Montgomery Multiplier RTL Datapath
A fragment of the 233-bit datapath is shown in Figure 5.4. In the middle of
the figure it is visible that the AND operation is computed on register Ai and B[i]
before XORing the result with the previous C[0] AND f(x) to get the respective
register C[i], in the script as new c[i], as previously explained.
The 233-bit Montgomery multiplication with integrated reduction after synthe-
sis consumes 475 LUTs as logic and 935 registers as flip-flops. This is an amazing
result and is as expected from Table 5.1. There is a 58.1% reduction of area com-
plexity since it only is 0.43% of the total slices. Similarly, the 283-bit multiplier
only consumes 576 LUTs as logic and 1135 registers as flip-flops with a combined
0.52% of the total area. Respectively, the multipliers take 233 and 283 cycles to
complete.
5.2 Summary of the Connected System
The following subsections will examine the multipliers implemented and review
the comprehensive architecture.
65
CHAPTER 5. LOW-LEVEL MULTIPLIER IMPLEMENTATION
5.2.1 Multiplier Comparison
Prior to the place-and-route implementation phase in the design, the synthesis re-
sults are listed below. All of the designs are synthesized separately, therefore there
will be small reductions in sizing after the place-and-routing stage is completed in
the future.
Table 5.2: Post Synthesis Multiplier Results on Kintex-7
Multiplier Control Unit Area of Hardware (LUT, Reg) I/O
233-bit Parallel - 58.53% , 0.16% 257.73%
283-bit Parallel - 85.65% , 0.19% 312.98%
233-bit Parallel Sq. - 36.3% , 0% 193.09%
233-bit Montgomery FSM 0.43% , 0.21% 258.56%
283-bit Montgomery FSM 0.52% , 0.26% 313.81%
The choice of the two operand multiplier from both compared designs is the
binary Montgomery multiplier. In terms of a speed-area ratio and future works
of the proposed design, Montgomery’s design is far superior. As for the squaring
design, Montgomery offers a serialized squaring method that builds off of the pre-
existing multiplier which may be more efficient in larger scale devices compared
to the parallel design shown above.
These specific multiplier are restricted due to a maximum of 32-bit I/O. Con-
sequently, they need to be serial in serial out (SISO) multipliers since data flow
is serially fed into and out of the hardware module. From Table 5.2 it is striking
that all designs practically use double the amount of the bonded I/O that the
FPGA can physically provide. This is because the FIFO was not wrapped over
the highest level module during simulation - the I/O percentage is listed to show
the dire need of a top level FIFO module.
A comparison of the singular Montgomery multiplier versus recent serialized
multipliers is provided in Appendix - Verilog HDL Scripts: Serialized Montgomery
Multiplier Comparison.
66
CHAPTER 5. LOW-LEVEL MULTIPLIER IMPLEMENTATION
5.2.2 Overview of Architecture
The novelty of this design revolves around using Joye’s algorithm for SPM. To the
best of the author’s knowledge there are no published FPGA designs that readily
use this secure SPM to protect against SCA threats. Below is a tree diagram of
the hierarchy throughout the past two chapters and as it falls down to the base
finite field multiplier.
ECDH & ECDSA Protocols
Extended Euclidean Inversion
Joye’s Scalar Point Multiplication
LD - Point Double
Montgomery Multiplication
LD - Point Addition
Montgomery Multiplication
Previously, the finite field inversion operation was needed at the lowest level
alongside the multiplier, but after using the LD-coordinate, the inversion was the
final conversion step of the SPM algorithm. The tree above completes the novel
secure ECC SPM processor and solidifies the end of the resulting work.
67
Chapter 6
Conclusions
In the final chapter, a discussion of the contributions is made along with many
areas that should be explored with the aid of this research. These areas will
include the development a software platform for a validation of the cryptosystems,
an analysis for masking, and hiding indirect data leakages in various cryptographic
schemes.
6.1 Summary of Contributions
List of Contributions
1. A platform to develop and test SCAs against a wide range of cryptosystems
2. A successful DPA SCA against AES-128 to show the susceptibilities and
weaknesses of cryptographic hardware from multiple aspects and propose an
approach to attack other systems
3. Propose a secure, robust, and small scale ECC SPM architecture resistant
to modern SCAs
4. A hardware design of a parallel K-233 & K-283 point doubling and addition
datapath in Lo´pez-Dehab coordinates
5. A tested and synthesized hardware design of a dynamic 32-bit FIFO
6. A tested and synthesized hardware design of a 233-bit squaring module for
large scale devices
68
CHAPTER 6. CONCLUSIONS
7. A tested and synthesized hardware design of both a parallel and serial mul-
tiplier over GF (2233) & GF (2283)
6.2 Future Work
The future work for this research has a number of areas that will highly impact
the domain of embedded security.
6.2.1 Hardware Design
Most importantly, the remaining pieces of the hardware design and implementa-
tion need to be completed. The design of the extended euclidean algorithm is
not new work but is an essential step in recovering useful information from the
three dimensional place of projected coordinates. Secondly, the control unit of
the point doubling and addition operations needs to be created - this design will
be small since they are parallel by design. Further development of a SIPO point
operation design would also be needed if the overall implementation does not fit
on the desired FPGA. Concluding the hardware designs, a full implementation
of Joye’s algorithm needs to be developed from the proposed design to compare
SCA resistance with it’s counterpart, Montgomery’s SPM algorithm. Once the
SPM algorithms are implemented in hardware, the SCAs can proceed. There will
be adjustments of the global clock when the hardware design reaches it’s phyical
limits and will probably range from 50 to 500 MHz. These designs are highly
recommended to be implemented on an existing SoC such as, but not limited to,
the Zynq-7000. It is highly attractive due to the seamless software integration of
the on-chip ARM processor using pre-existing wrapping software from Xilinx Vi-
vado Design Suite to intercommunicate the FPGA design with the ARM processor.
Note: The acclaimed HDL to be written in the future in most cases cannot
be written manually. The original HDL is written in Appendix - Verilog HDL
Scripts, but has very large for loops which can put a large strain on the design
suite’s compiler. The solution to this is to unroll every loop in the HDL code to
be written from an automated C script that populates every output register from
69
CHAPTER 6. CONCLUSIONS
the multiplying module by creating .v files from a file pointer in a .c file. An ex-
ample of this technique is in Appendix - C Scripts - Convert to Verilog Script and
is highly recommend to be used when writing any HDL for cryptographic purposes.
If once the FPGA designs are fully-functional and can perform a SPM in com-
parable time to the recent literature of [5], then it would be interesting to create an
application specific integrated circuit (ASIC) in either complementary metal-oxide-
semiconductor (CMOS) 0.18µm or 65nm technology. Another pathway would be
to integrate the ASIC design with an existing micro-controller to test the circuit
once fabricated for further exploration within accelerated cryptography.
6.2.2 Software-Hardware Integration Against SCAs
An entire customized cryptographic library is needed to test all possible input
over all keys desired with vectors ≥ 2m of the data inputted. Unlike other com-
mon software suites such as OpenSSL, the library needs to have the low-level
access to change existing multiplying algorithm and scalar point multiplication in
an automated fashion to verify hardware results correctly. Alternatively, a poten-
tial avenue of research could also include the capabilities to connect the proposed
hardware system with the common all-programmable interface of the OpenSSL
library.
A software solution of the finite field inversion technique may be more effi-
cient to compute it sequentially after the SPM circuit has results. This could cut
the time needed to perform a successful SPM since hardware design can be time
consuming.
6.2.3 Masking to Prevent CPA Attacks
One method of hiding sensitive power traces of cryptosystems on a reconfigurable
device involves randomizing the inner cipher or high-level process in order to ruin
the relationship between how each trace is executed. This is done by performing
a random amount of meaningless operations before, during, and after the targeted
process. However, there is a couple issues when implementing this method to
70
CHAPTER 6. CONCLUSIONS
protect AES-128 specifically. The first issue is that the AES-128 hardware design
is combinational so an entire round of AES-128 is performed in one clock cycle
and these fake operations can only be inserted in between rounds. This makes it
slightly easier for the hacker to realign the waveforms and re-establish the correct
overlap of traces. The second issue is that these fake operations greatly affect the
throughput of the system, hence only a finite amount of these operations can be
performed.
Another method is to make the power consumption random or equalize through-
out traffic throughout the targeted process. Firstly, increasing the noise of the
system can be done for randomizing the power consumption. To accomplish this,
one would need to run multiple random operations simultaneously. A disadvan-
tage to this sort of modification is that there is no such thing as a truly random
number generator in hardware therefore a hacker can still find patterns within
the system. Making the power consumption equal at each state within cryptosys-
tems’ processes is essentially the only sure way to mask a key or other important
information against DPA while maintaining the throughput of the original imple-
mentation. This could be accomplished with a optimal switched-capacitor design
similar to [30], mentioned in chapter 3, and implemented in CMOS technology
A conceptual but interesting technique to prevent any CPA attack against
AES-128 could be created from interconnecting byte-to-byte dependencies with
the Add-Round Key operation. This would make the algorithm slower because of
the theoretically added serial computations in the Add-Round Key generator, but
it would make DPA useless therefore greatly increasing it’s security.
In the future when designing an FPGA implementation to protect AES-128, it
is crucial to select pins that are located on the same type of I/O bank as the ciphers
output pins - ideally the same pins. This ensures the same amount of power is
being used to invert a fake output as the AES-128 output and hence, establishing
equalized power consumption. The bus changes must occur on the exact same
clock edge so that the power is consumed on the same state. These two features
71
CHAPTER 6. CONCLUSIONS
are the reason that a masked implementation of AES-128 is difficult to implement
on the SASEBO-GIII board specifically. This board utilizes a 1.5 V I/O bank to
transfer AES-128 data and does not have enough output pins available to execute
the mask.
72
Appendix A
DPA Data & Results
A.1 Power Trace to be Attacked
Figure A.1: Example of a Power Trace Captured at 50,0000 Samples
A.2 16-Byte Key Results
73
APPENDIX A. DPA DATA & RESULTS
Figure A.2: 15,000 Traces Max Correlation Vector for Byte 1
Figure A.3: 15,000 Traces Max Correlation Vector for Byte 2
74
APPENDIX A. DPA DATA & RESULTS
Figure A.4: 15,000 Traces Max Correlation Vector for Byte 3
Figure A.5: 15,000 Traces Max Correlation Vector for Byte 4
75
APPENDIX A. DPA DATA & RESULTS
Figure A.6: 15,000 Traces Max Correlation Vector for Byte 5
Figure A.7: 15,000 Traces Max Correlation Vector for Byte 6
76
APPENDIX A. DPA DATA & RESULTS
Figure A.8: 15,000 Traces Max Correlation Vector for Byte 7
Figure A.9: 15,000 Traces Max Correlation Vector for Byte 8
77
APPENDIX A. DPA DATA & RESULTS
Figure A.10: 15,000 Traces Max Correlation Vector for Byte 9
Figure A.11: 15,000 Traces Max Correlation Vector for Byte 10
78
APPENDIX A. DPA DATA & RESULTS
Figure A.12: 15,000 Traces Max Correlation Vector for Byte 11
Figure A.13: 15,000 Traces Max Correlation Vector for Byte 12
79
APPENDIX A. DPA DATA & RESULTS
Figure A.14: 15,000 Traces Max Correlation Vector for Byte 13
Figure A.15: 15,000 Traces Max Correlation Vector for Byte 14
80
APPENDIX A. DPA DATA & RESULTS
Figure A.16: 15,000 Traces Max Correlation Vector for Byte 15
Figure A.17: 15,000 Traces Max Correlation Vector for Byte 16
81
Appendix B
Matlab Script DPA
**Available Upon Request of Author**
82
Appendix C
C Scripts - Verilog Script
Generation
C.1 Parallel Multiplier
#include <stdio.h>
/******************************************************
Name : mk_poly_mult
Input : m-bit
Output : poly_mult.v file
Comment : Generate verilog code for an m-bit parallel
multiplier
Engineer: D.R. Lalonde
******************************************************/
void mk_poly_mult(int m);
int main(){
int m = 233;
mk_poly_mult(m);
return 0;
83
APPENDIX C. C SCRIPTS - VERILOG SCRIPT GENERATION
}
void mk_poly_mult(int m){
FILE *fd;
int k,i;
fd = fopen("poly_mult.v", "w");
// Module Declaration
fprintf(fd, "module poly_mult(\n");
fprintf(fd, "input [%d-1:0] a,\n", m);
fprintf(fd, "input [%d-1:0] b,\n", m);
fprintf(fd, "input clk,\n");
fprintf(fd, "output reg [2*%d-2:0] d);\n\n", m);
// Module integers and reg’s
fprintf(fd, "integer k,i;\n");
fprintf(fd, "reg a_b [2*%d-2:0][2*%d-2:0];\n", m,m); // a & b
for all a, b [m-1:0]
fprintf(fd, "reg xor_temp;\n\n");
//-----------------------------------------------------------
// AND Operations -------------------------------------------
fprintf(fd, "always @ (*) begin\n");
// dk = m-1, ... , 0
for(k = 0; k <= m-1; k++){
for(i = 0; i <= k; i++){
// a_b[k][i] = a[i] & b[k-i]
fprintf(fd, "a_b[%d][%d] = a[%d] & b[%d - %d];\n",
k,i,i,k,i);
}
}
84
APPENDIX C. C SCRIPTS - VERILOG SCRIPT GENERATION
// dk = 2*m-2, ... , m
for(k = m; k <= 2*m-2; k++){
for(i = k; i <= 2*m-2; i++){
// a_b[k][i] = a[k-i+(m-1)] & b[i-(m-1)]
fprintf(fd, "a_b[%d][%d] = a[%d - %d + %d-1] & b[%d -
(%d-1)];\n", k,i,k,i,m,i,m);
}
}
// d[0] has no XOR operation
fprintf(fd, "d[0] = a_b[0][0];\n\n");
// ----------------------------------------------------------
// XOR Operations -------------------------------------------
for(k = 1; k <= 2*m-2; k++){
if (k <= m-1){
fprintf(fd, "xor_temp = a_b[%d][0];\n",k);
for(i = 1; i <= k; i++){
fprintf(fd, "xor_temp = a_b[%d][%d] ^ xor_temp;\n",
k,i);
}
}
else {
fprintf(fd, "xor_temp = a_b[%d][%d];\n",k,k);
for(i = k + 1; i <= 2*m-2; i++){
fprintf(fd, "xor_temp = a_b[%d][%d] ^ xor_temp;\n",
k,i);
}
}
fprintf(fd, "d[%d] = xor_temp;\n", k,k);
}
85
APPENDIX C. C SCRIPTS - VERILOG SCRIPT GENERATION
fprintf(fd, "end\n");
fprintf(fd, "endmodule");
fclose(fd);
}
C.2 Parallel Reduction
#include <stdio.h>
/******************************************************
Name : mk_poly_reduc
Input : m-bit
Output : poly_reduc.v file
Comment : Generate verilog code for an m-bit parallel
classical multiplier
Engineer: D.R. Lalonde
******************************************************/
void mk_poly_reduc(int m);
int main(){
int m = 233;
mk_poly_reduc(m);
return 0;
}
void mk_poly_reduc(int m){
86
APPENDIX C. C SCRIPTS - VERILOG SCRIPT GENERATION
FILE *fd;
int i,j;
fd = fopen("poly_reduc.v", "w");
// Module Declaration
fprintf(fd, "module poly_reduc(\n");
fprintf(fd, "input [2*%d-2:0] d,\n", m);
fprintf(fd, "input [%d:0] f_x,\n", m);
fprintf(fd, "input clk,\n");
fprintf(fd, "output reg [%d-1:0] c);\n\n", m);
// Module integers and reg’s
fprintf(fd, "integer i,j;\n");
fprintf(fd, "reg matR [%d-1:0][%d-2:0];\n", m,m); // a & b for
all a, b [m-1:0]
fprintf(fd, "reg matR_temp [%d-1:0][%d-2:0];\n", m,m); // a &
b for all a, b [m-1:0]
fprintf(fd, "reg xorcount;\n\n");
//-----------------------------------------------------------
// Reduction matrix R ---------------------------------------
fprintf(fd, "always @ (*) begin\n");
// matR intilization
for(j = 0; j <= m-1; j++){
for(i = 0; i <= m-2; i++){
fprintf(fd, "matR[%d][%d] = 1’b0;\n", j,i);
}
}
for(j = 0; j <= m-1; j++){
fprintf(fd, "matR[%d][0] = f_x[%d];\n", j,j);
}
// matR population
87
APPENDIX C. C SCRIPTS - VERILOG SCRIPT GENERATION
for(i = 1; i <= m-2; i++){
for(j = 0; j <= m-1; j++){
if(j == 0){
fprintf(fd, "matR_temp[%d][%d] = matR[%d-1][%d-1] &
matR[%d][0];\n", j,i,m,i,j);
fprintf(fd, "matR[%d][%d] = matR_temp[%d][%d];\n",
j,i,j,i);
}
else{
fprintf(fd, "matR_temp[%d][%d] = matR[%d-1][%d-1] ^
(matR[%d-1][%d-1] & matR[%d][0]);\n",
j,i,j,i,m,i,j);
fprintf(fd, "matR[%d][%d] = matR_temp[%d][%d];\n",
j,i,j,i);
}
}
}
// ----------------------------------------------------------
// Polynomial Reduction -------------------------------------
for(j = 0; j <= m-1; j++){
fprintf(fd, "xorcount = d[%d];\n",j);
for(i = 0; i <= m-2; i++){
fprintf(fd, "xorcount = xorcount ^ (d[%d+%d] &
matR[%d][%d]);\n", m,i,j,i);
}
fprintf(fd, "c[%d] = xorcount;\n", j);
}
fprintf(fd, "end\n");
88
APPENDIX C. C SCRIPTS - VERILOG SCRIPT GENERATION
fprintf(fd, "endmodule");
fclose(fd);
}
C.3 Parallel Polynomial Multiplier
#include <stdio.h>
/******************************************************
Name : mk_classic_polyMult
Input : m-bit
Output : classic_polyMult.v file
Comment : Generate verilog code for an m-bit parallel
classical multiplier
Engineer: D.R. Lalonde
******************************************************/
void mk_classic_polyMult(int m);
int main(){
int m = 233;
mk_classic_polyMult(m);
return 0;
}
void mk_classic_polyMult(int m){
FILE *fd;
int i,j;
89
APPENDIX C. C SCRIPTS - VERILOG SCRIPT GENERATION
fd = fopen("classic_polyMult.v", "w");
// Module Declaration
fprintf(fd, "module classic_polyMult(\n");
fprintf(fd, "input [%d-1:0] a,\n", m);
fprintf(fd, "input [%d-1:0] b,\n", m);
fprintf(fd, "input [%d:0] f_x,\n", m);
fprintf(fd, "input clk,\n");
fprintf(fd, "output [%d-1:0] z);\n\n", m);
// Wire declaration
fprintf(fd, "wire [2*%d-2:0] d;\n", m);
// Polynomial Multiplication
fprintf(fd, "poly_mult a0 (a,b,clk,d);\n");
// Polynomial Reduction
fprintf(fd, "poly_reduc a1 (d,f_x,clk,z);\n");
fprintf(fd, "endmodule");
fclose(fd);
}
90
Appendix D
Verilog HDL Scripts
D.1 Parallel Polynomial Squarer
// poly_reduc.v is needed
module classic_polySquare(
input [282:0] a,
input [282:0] f_x,
input clk,
output [282:0] z
);
integer i;
reg [2*283-2:0] d;
// Polynomial Squaring
always @ (posedge clk)
begin
d[0] <= a[0];
for (i = 1; i <= 283-1; i = i + 1)
begin
d[2*i-1] = 0;
91
APPENDIX D. VERILOG HDL SCRIPTS
d[2*i] = a[i];
end
end
// Polynomial Reduction
poly_reduc a1 (d,f_x,clk,z);
endmodule
D.2 Serial Montgomery Multiplier
// Computes the poly multiplication A(x) B(x) R**(-1) mod
f(x), GF(2**233)
// Output not correct, very close
module mont_datapath(
input [232:0] c, b,
input a_i,
input [232:0] f_x,
output reg [232:0] new_c
);
reg previous_c0;
integer i;
always @ (*)
begin
previous_c0 <= c[0] ^ (a_i & b[0]);
for (i = 1; i <= 232; i = i + 1)
new_c[i-1] <= c[i] ^ (a_i & b[i]) ^ (f_x[i] &
previous_c0);
92
APPENDIX D. VERILOG HDL SCRIPTS
new_c[232] <= previous_c0;
end
endmodule
// Montgomery Control
Unit------------------------------------------------------------
module mont_mult(
input [232:0] a,
input [232:0] b,
input [232:0] f_x,
input clk, go, reset,
output reg int_done, done_mult,
output reg [232:0] z
);
/* wire [8:0] w_aa, w_bb, w_cc, w_new_c;
always @ (w_aa) aa = w_aa;
always @ (w_bb) bb = w_bb;
always @ (w_cc) cc = w_cc;
always @ (w_new_c) new_c = w_new_c;*/
reg [232:0] aa, bb, cc;
wire [232:0] n_c;
// Datapath
mont_datapath md1(.c(cc), .b(bb), .a_i(aa[0]), .f_x(f_x),
.new_c(n_c));
reg incr, shift_right;
93
APPENDIX D. VERILOG HDL SCRIPTS
// Counter
reg [3:0] count; // decimal 5
always @ (posedge clk or posedge reset)
begin
if (reset)
count <= 0;
else
begin
if (incr)
count <= 0;
else if (shift_right)
count <= count + 1;
end
end
// Shift Register A
always @ (posedge clk)
begin
if (reset)
aa <= 0;
else
begin
if (incr)
aa <= a;
else
aa <= {1’b0, aa[232:1]};
end
end
// Register B
always @ (posedge clk)
begin
if (reset)
bb <= 0;
94
APPENDIX D. VERILOG HDL SCRIPTS
else
if (incr)
bb <= b;
end
// Register C
reg c_en;
always @ (posedge clk or posedge incr)
begin
if (incr | reset)
cc <= 0;
else
if (c_en)
begin
cc <= n_c;
z <= cc;
end
end
// FSM
reg [2:0] state;
always @ (state)
begin
case (state)
0: begin
incr <= 0;
shift_right <= 0;
int_done <= 1;
c_en <= 0;
end
1: begin
incr <= 0;
95
APPENDIX D. VERILOG HDL SCRIPTS
shift_right <= 0;
int_done <= 1;
c_en <= 0;
end
2: begin
incr <= 1;
shift_right <= 0;
int_done <= 0;
c_en <= 0;
end
3: begin
incr <= 0;
shift_right <= 1;
int_done <= 0;
c_en <= 1;
end
endcase
end
// Next state
always @ (posedge clk or posedge reset)
begin
if (reset)
state <= 0;
else if (clk)
begin
case (state)
0:
if (!go)
state <= 1;
else
state <= 0;
96
APPENDIX D. VERILOG HDL SCRIPTS
1:
if (go)
state <= 2;
else
state <= 1;
2:
state <= 3;
3:
if (count == 232)
begin
state <= 0;
done_mult <= 1;
end
else
state <= 3;
endcase
end
end
endmodule
D.3 32-bit FIFO
// 32-bit FIFO for 233 bits of information -> 256-bit
capability
module sync_fifo (
input [31:0] in_fifo,
input rd_en,
input wr_en,
input clk,
input reset,
97
APPENDIX D. VERILOG HDL SCRIPTS
output reg [31:0] out_fifo,
output empty,
output full
);
// 4 bits to count to decimal 8 (depth)
reg [3:0] p_rd, p_wr;
// Declare the fifo memory (RAM that allow read and write at
the same time)
// creates an array of 8 elements of 233 bits
reg [31:0] mem_fifo [7:0];
// Flags
reg [4:0] counter_fifo; // 4
bits to count to decimal 8 (depth) + 1 bit for space
assign empty = (counter_fifo == 0); //
Completely empty
assign full = (counter_fifo == 8); // ’’
// Sequential circuit that checks empty & full flags
always @(posedge clk or negedge reset)
begin
if (~reset)
counter_fifo <= 0;
else if( (!full && wr_en) && ( !empty && rd_en ) )
counter_fifo <= counter_fifo; // If
read and write
else if (!full && wr_en)
counter_fifo <= counter_fifo + 1; //
Write -> increment
else if (!empty && rd_en)
98
APPENDIX D. VERILOG HDL SCRIPTS
counter_fifo <= counter_fifo - 1; //
Read -> decrement
end
// Sequential circuit - READING
always @(posedge clk or negedge reset)
begin
if(!reset)
out_fifo <= 0;
else
if (!empty && rd_en) // Not
empty and READ
out_fifo <= mem_fifo [p_rd];
end
// Sequential circuit - WRITING
always @(posedge clk)
if (!full && wr_en)
mem_fifo[p_wr] <= in_fifo;
// Sequential circut - read/write POINTERS
always @(posedge clk or negedge reset)
begin
if(!reset)
begin
p_wr <= 0;
p_rd <= 0;
end
else
begin
// Not full and WRITE -> incr. write pointer
if( !full && wr_en )
p_wr <= p_wr + 1;
99
APPENDIX D. VERILOG HDL SCRIPTS
// Not empty and READ -> decr. read pointer
if( !empty && rd_en )
p_rd <= p_rd + 1;
end
end
endmodule
D.4 Serialized Montgomery Multiplier Compar-
ison
This comparison displays the Area-Delay Product of the amount of LUTs and reg-
isters present in the Kintex-7 needed after synthesis of the Montgomery multiplier
when compared to recent literature. This comparison is not entirely valid due to
the architecture not being synthesized as a whole, resulting in the use of this table
to be strictly used as a general guide.
Table D.1: Serialized Multiplier Comparison
Multiplier Key (Bits) Clock (MHz) Area-Delay (slice*sec)
Karatsuba [52] 233 625 0.111
Interleaved [53] 283 264 -
Montgomery [54] 233 115.47 1.086
Montgomery [55] 163 132.5 1.098
This work: Montgomery 233 Approx. 250 0.00000564
This work: Montgomery 283 Approx. 250 0.00000684
100
Appendix E
Verilog HDL Pseudo Scripts
E.1 Binary Extended Euclidean Inversion
module EEA_test(
input [282:0] a,
input [283:0] f_x,
input clk,
input reset,
output reg [282:0] z
);
// Datapath
reg [283:0] r, s, u, v;
reg [8:0] d; // [log283:0]
reg [283:0] r_q, s_q, u_q, v_q; // New registers
reg [8:0] d_q; // [log283:0]
always @ (posedge clk)
begin
//_______________________________
// Alg for Inv in GF(2**m): 3 - 6
if (r[283] == 0)
101
APPENDIX E. VERILOG HDL PSEUDO SCRIPTS
begin
r_q <= {r[282:0], 1’b0}; // cyclic
shift right 1
u_q <= {u[282:0], 1’b0}; // ’’
s_q <= s; // same
since rm = 0 - unchanged
v_q <= v; // ’’
d_q <= d + 1;
end
//________________________________
// when d = 0
else
begin
if (d == 0)
begin
if (s[283] == 1)
begin //
Combined operations
r_q <= {s[282:0] ^ r[282:0], 1’b0}; //
Line: 9, 12, 14
u_q <= {v[282:0] ^ u[282:0], 1’b0}; //
Line: 10, 15
end
else
begin
r_q <= {s[282:0], 1’b0};
u_q <= {v[282:0], 1’b0};
end
s_q <= r;
v_q <= u;
102
APPENDIX E. VERILOG HDL PSEUDO SCRIPTS
d_q[0] <= 1’b1; // d_q
<= (0=> ’1’, others => ’0’); vdhl
d_q[283:1] <= 0;
end
//________________________________
// when d = otherwise
else
begin
r_q <= r;
u_q <= {1’b0, u[283:1]}; //
Cylc shift left 1, Line: 18 (division)
if (s[283] == 1)
begin
s_q <= {s[282:0] ^ r[282:0], 1’b0}; //
Line: 9
v_q <= v ^ u; //
Line: 10
end
else
begin
s_q <= {s[282:0], 1’b0}; //
Line: 12
v_q <= v;
end
d_q <= d - 1; //
Line: 19
end
end // 1st if
z <= u[282:0];
103
APPENDIX E. VERILOG HDL PSEUDO SCRIPTS
end
endmodule
E.2 Point Double
// Pt. Doubling in LD coords
// Y**2 + XYZ = X**3z + X**2Z**2 + a Z**4 LD-Elliptic Curve
mapping
// // P (X, Y, Z) = Q (X, Y, 1) ... 2P = (X3, Y3, Z3)
/* 1. Z3 = X1**2 Z1**2
2. X3 = X1**4 + Z1**4
3. Y3 = b Z1**4 Z3 + X3 (a Z3 + Y1**2 + b Z1**4)
*/
// ------------------------------------------------------
module pt_double (
input in_x1, in_y1, in_z1,
input f_x,
input clk,
input reset,
output reg x3, y3, z3,
output reg infinity
);
// -----------------------------------------------------
reg [282:0] x1, y1, z1;
reg [282:0] a, b, c;
reg [1:0] count;
reg done;
104
APPENDIX E. VERILOG HDL PSEUDO SCRIPTS
wire [2*283-2:0] w_x1, w_z1, w_y1, w_z3, w_x1_1, w_z1_1,
w_x3, w_a, w_b, w_c,
w_y3, w_y3_1;
// 1. Z3 = X1**2 Z1**2
// ****************************************************
classic_polySquare s0 (.a(x1), .f_x(f_x), .clk(clk),
.z(w_x1)); // clk0
classic_polySquare s1 (.a(z1), .f_x(f_x), .clk(clk),
.z(w_z1)); // clk0
classic_polyMult m0 (.a(w_x1), .b(w_z1), .f_x(f_x),
.clk(clk), .z(w_z3)); // clk1
// 2. X3 = X1**4 + Z1**4
// ***************************************************
classic_polySquare s2 (.a(w_x1), .f_x(f_x), .clk(clk),
.z(w_x1_1)); // clk1
classic_polySquare s3 (.a(w_z1), .f_x(f_x), .clk(clk),
.z(w_z1_1)); // clk1
// 3. Y3 = Y3 = b Z1**4 Z3 + X3 (a Z3 + Y1**2 + b Z1**4)
// ***************************************************
classic_polySquare s4 (.a(y1), .f_x(f_x), .clk(clk),
.z(w_y1)); // clk0
classic_polyMult m1 (.a(w_z1_1), .b(w_z3), .f_x(f_x),
.clk(clk), .z(w_a)); // clk2
classic_polyMult m2 (.a(w_x3), .b(w_b), .f_x(f_x),
.clk(clk), .z(w_y3)); // clk2
// What #clk edge is present, for specific wires XOR
always @ (posedge clk && !done)
begin
if (reset)
begin
105
APPENDIX E. VERILOG HDL PSEUDO SCRIPTS
count <= 0;
done <= 0;
x1 <= in_x1;
y1 <= in_y1;
z1 <= in_z1;
end
else
begin
if (count == 1)
begin
x3 <= w_x1_1 ^ w_z1_1; // 2. 1st clock
cycle, x1 XOR z1 = X3
b <= w_z1_1 ^ w_z3 ^ w_y1; // 3. 1st clock
cycle, z1 XOR z3 = b
end
if (count == 2)
begin
y3 <= w_a ^ w_y3; // 3. 2nd clock
cycle, a XOR wire y3 = y3
end
end
count <= count + 1;
end
// @ wire changes, make the corresponding register
available... in order of the design
// 1.
always @ (w_z3)
z3 = w_z3;
// 2.
always @ (w_x3)
x3 = w_x3;
// 3.
always @ (w_b)
106
APPENDIX E. VERILOG HDL PSEUDO SCRIPTS
b = w_b;
always @ (w_y3_1)
begin
y3 = w_y3_1;
done = 1;
end
// NEEDS MORE WORK
// Check for infinity
always @ (posedge clk && done)
begin
if (x3 && y3 == 0)
if (z3 == 1)
infinity <= 1;
end
endmodule
E.3 Point Addition
// Pt. Addition in LD coords
// Y**2 + XYZ = X**3z + X**2Z**2 + a Z**4 LD-Elliptic Curve
mapping
// P (X, Y, Z) ~= Q (X, Y, 1) ... xP = P + Q =(X3, Y3, Z3)
/* 1. A = Y2 Z1**2 + Y1 2. B = X2 Z1 + X1
3. C = Z1 B 4. D = B**2 (C + a Z1**2), a =
1
5. Z3 = C**2 6. E = A C
7. X3 = A**2 + D + E 8. F = X3 + X2 Z3
9. G = (X2 + Y2) Z3**2 10. Y3 = (E + Z3) F + G
*/
107
APPENDIX E. VERILOG HDL PSEUDO SCRIPTS
// --------------------------------------------------------
module pt_add (
input in_x1, in_y1, in_z1,
input in_x2, in_y2,
input f_x,
input clk,
input reset,
output reg x3, y3, z3
);
// --------------------------------------------------------
reg [282:0] x1, y1, z1, x2, y2;
reg [282:0] a, b, c, d, e, f, g, y3_1;
reg [1:0] count;
reg done;
wire [282:0] w_z1, w_a, w_b, w_x1, w_x2_1, w_x2, w_y2,
w_g, w_a_1, w_d, w_c, w_a_2, w_e,
w_d_1, w_x3, w_z3, w_y3, w_z3_1, w_z3_2,
w_f, w_g_1, w_y3_1, w_y3_2;
// 1. A = Y2 Z1**2 + Y1
// ********************************************************
classic_polySquare s0 (.a(z1), .f_x(f_x), .clk(clk),
.z(w_z1)); // clk0
classic_polyMult m0 (.a(w_z1), .b(y2), .f_x(f_x),
.clk(clk), .z(w_a)); // clk1
// 2. B = X2 Z1 + X1
// ********************************************************
108
APPENDIX E. VERILOG HDL PSEUDO SCRIPTS
classic_polyMult m1 (.a(z1), .b(x2), .f_x(f_x),
.clk(clk), .z(w_a)); // clk0
// 3. C = Z1 B
// ********************************************************
classic_polyMult m2 (.a(z1), .b(w_b), .f_x(f_x),
.clk(clk), .z(w_c)); // clk1
// 4. D = B**2 (C + a Z1**2)
// ********************************************************
classic_polySquare s1 (.a(w_b), .f_x(f_x), .clk(clk),
.z(w_b_1)); // clk1
classic_polyMult m3 (.a(w_d), .b(w_b_1), .f_x(f_x),
.clk(clk), .z(w_d_1)); // clk1
// 5. Z3 = C**2
// ********************************************************
classic_polySquare s2 (.a(w_c), .f_x(f_x), .clk(clk),
.z(w_z3)); // clk1
// 6. E = A C
// ********************************************************
classic_polyMult m4 (.a(w_c), .b(w_a_1), .f_x(f_x),
.clk(clk), .z(w_e)); // clk2
// 7. X3 = A**2 + D + E
// ********************************************************
classic_polySquare s3 (.a(w_a_1), .f_x(f_x), .clk(clk),
.z(w_a_2)); // clk2
// 8. F = X3 + X2 Z3
// ********************************************************
classic_polyMult m5 (.a(x2), .b(w_z3), .f_x(f_x),
.clk(clk), .z(w_z3_1)); // clk2
109
APPENDIX E. VERILOG HDL PSEUDO SCRIPTS
// 9. G = (X2 + Y2) Z3**2
// ********************************************************
classic_polySquare s4 (.a(w_z3), .f_x(f_x), .clk(clk),
.z(w_z3_2)); // clk3
classic_polyMult m6 (.a(w_z3_2), .b(w_g), .f_x(f_x),
.clk(clk), .z(w_g_1)); // clk4
// 10. Y3 = (E + Z3) F + G
// ********************************************************
classic_polyMult m7 (.a(w_f), .b(w_y3), .f_x(f_x),
.clk(clk), .z(w_y3_1)); // clk3
// What #clk edge is present, for specific wires XOR
always @ (posedge clk)
begin
if (reset)
begin
count <= 0;
done <= 0;
x1 <= in_x1;
y1 <= in_y1;
z1 <= in_z1;
x2 <= in_x2;
y2 <= in_y2;
end
else
begin
if (count == 0)
begin
b <= x1 ^ w_x2_1;
g <= x2 ^ y2;
end
110
APPENDIX E. VERILOG HDL PSEUDO SCRIPTS
if (count == 1)
begin
a <= w_a ^ y1;
d <= w_z1 ^ w_c;
end
if (count == 2)
begin
x3 <= w_a_2 ^ w_e ^ w_d_1;
y3_1 <= w_e ^ w_z3;
end
if (count == 3)
f <= w_x3 ^ w_z3_1;
if (count == 4)
y3 <= w_g_1 ^ w_y3_1;
end
count <= count + 1;
end
// @ wire changes, make the corresponding register available
// CLK 0
// 2.
always @ (w_b)
b = w_b;
// 9.
always @ (w_g)
g = w_g;
// CLK 1
// 1.
always @ (w_a_1)
a = w_a_1;
// 4.
always @ (w_d)
d = w_d;
// CLK 2
111
APPENDIX E. VERILOG HDL PSEUDO SCRIPTS
// 7.
always @ (w_x3)
x3 = w_x3;
// 10.
always @ (w_y3)
y3_1 = w_y3;
// CLK 3
// 8.
always @ (w_f)
f = w_f;
// CLK 4
// 10.
always @ (w_y3_2)
y3 = w_y3_2;
endmodule
112
Bibliography
[1] Dell EMC, ”A Cost-based Security Analysis of Symmetric and Asymmetric
Key Lengths”, RSA Laboratories, 2010.
[2] P. L. Montgomery, ”Speeding the pollard and elliptic curve methods of factor-
ization”, Mathematics of Computation, vol. 48, no. 177, pp. 243– 264, 1987.
[3] M. Joye, ”Highly regular m-ary powering ladders”, International Workshop on
Selected Areas of Cryptography, Springer Berlin Heidelberg 2009.
[4] D. Freeman, ”Pertinent Side Channel Attacks on Elliptic Curve Cryptographic
Systems”, IEEE Transactions on Energy Conversion, vol. 26, no. 1, pp. 55-63,
March 2011.
[5] I. H. Hazmi, F. Zhou, F. Gebali, ”Review of Elliptic Curve Processor Archi-
tectures”, IEEE, 78-1-4673-7788-1/15, 2015.
[6] M. Joye, S. Yen, ”The Montgomery Powering Ladder”, Laboratory of Cryp-
tography and Information Security (LCIS), Springer-Verlag Berlin Heidelberg
2003
[7] A. Sghaier, M. Zeghid, B. Bouallegue, A. Baganne, M. Machhout, ”Area Time
Efficient Hardware Implementation of Elliptic Curve Cryptosystem”, 2015
[8] Y. Dan et al., “High-performance hardware architecture of elliptic curve cryp-
tography processor over gf(21ˆ63),” Journal of Zhejiang University Science A,
vol. 10, no. 2, pp. 301–310, 2009.
[9] C. Puttmann et al., “Hardware accelerators for elliptic curve cryptography,”
Advances in Radio Science, vol. 6, no. 10, pp. 259–264, 2008.
113
BIBLIOGRAPHY
[10] M. Amara and A. Siad, “Hardware implementation of elliptic curve point
multiplication over gf(2m) for ecc protocols,” International Journal for Infor-
mation Security Research (IJISR), vol. 1, no. 3, 2011.
[11] M.A.Fayed, ”A security coprocessor for next generation IP telephony: archi-
tecture, abstraction, and strategies”. University of Victoria, 2007.
[12] K. Jarvinen, “Optimized fpga-based elliptic curve cryptography processor for
high-speed applications,” INTEGRATION, the VLSI journal, vol. 44, no. 4,
pp. 270–279, 2011.
[13] M. Morales-Sandoval, “A reconfigurable and interoperable hardware architec-
ture for elliptic curve cryptography,” Ph.D. dissertation, Tesis de Doctorado,
Instituto Nacional de Astrofısıca, Optica y Electronica, Mexico, 2008
[14] K. Ananyi et al., “Flexible hardware processor for elliptic curve cryptography
over nist prime fields,” Very Large Scale Integration (VLSI) Systems, IEEE
Transactions, vol. 17, no. 8, pp. 1099–1112, 2009.
[15] S. Zeidler et al., “Design of a low-power asynchronous elliptic curve cryptog-
raphy coprocessor,” in Electronics, Circuits, and Systems (ICECS), 2013 IEEE
20th International Conference on., pp. 569–572 IEEE, 2013.
[16] T. Akishita, T. Takagi, “Zero-Value Point Attacks on Elliptic Curve Cryp-
tosystem,” Sony Corporation, Ubiquitous Technology Laboratories, Technische
Universitat Darmstadt, Fachbereich Informatik, Germany
[17] R. Karri, K. Wu, P. Mishra, Yongkook Kim, “Fault-based side-channel crypt-
analysis tolerant Rijndael symmetric block cipher architecture,” Defect and
Fault Tolerance in VLSI Systems, 2001. Proceedings. 2001 IEEE International
Symposium, 2001.
[18] R. Lidl and H. Niederreiter, Introduction to Finite Fields and their applica-
tions, Cambridge University Press, 1994.
[19] A. Irwansyah, V.P. Nambiar, M. Khalil-Hani, “An AES Tightly Coupled
Hardware Accelerator in an FPGA-based Embedded Processor Core ,” 2009
International Conference on Computer Engineering and Technology, 2009.
114
BIBLIOGRAPHY
[20] H.W. Lenstra, R.J. Schoof Jr., ”Primitive normal bases for finite fields,”Math-
ematics of Computation, 48: 217–231, 1987.
[21] A. F. Diego, S. L. Paulo, M. Barreto, R. E. Jefferson, ”A note on high-
security general-purpose elliptic curves,” Computer Science Dept, University
of Bras´ılia, Cryptology ePrint Archive, Report 2013, 647 (2013).
[22] S. Ezzouak, M. Elamrani, A. Azizi, ”Improving Pollard’s Rho Attack on Ellip-
tic Curve Cryptosystems” IEEE Transactions, 978-1-4673-1520-3/12, c©2012
IEEE
[23] P. C. Kocher, ”Timing attacks on implementations of Diffie–Hellman, RSA,
DSS, and other systems,” Proc. CRYPTO, vol. 1109, pp.104–113, 1996
[24] P. C. Kocher, J. Jaffe, B. Jun, ”Differential Power Analysis,” technical re-
port, 1998; Advances in Cryptology - Crypto 99 Proceedings, Lecture Notes In
Computer Science Vol. 1666, M. Wiener, ed., Springer-Verlag, 1999.
[25] M. Alioto, S. Member, M. Poli, S. Rocchi, ”A General Power Model of Dif-
ferential Power Analysis Attacks to Static Logic Circuits,” IEEE TRANS-
ACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,
VOL. 18, NO. 5, May 2010.
[26] W. Shan, X. Fu, Z. Xu, ”A Secure Reconfigurable Crypto IC With Coun-
termeasures Against SPA, DPA, and EMA,” IEEE TRANSACTIONS ON
COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYS-
TEMS, vol. 34, no. 7, July 2015.
[27] K. Tiri, I. Verbauwhede, “A logic level design methodology for a secure DPA
resistant ASIC or FPGA implementation,” Proc. Conf. Design, Automation
and Test in Europe, IEEE Computer Society, Washington, DC, pp. 10246,
2004.
[28] Z. Toprak, Y. Leblebici, “Low-power current mode logic for improved DPA-
resistance in embedded systems,” Proc. IEEE Int. Symp. Cir. Sys., pp.
10591062, 2005.
115
BIBLIOGRAPHY
[29] M. W. Allam, M. I. Elmasry, “Dynamic current mode logic (DyCML): A new
low-power high-performance logic style,” IEEE J.Solid-State Circuits, vol. 36,
no. 3, pp. 550558, Mar. 2001.
[30] C. Tokunaga, D. Blaauw, “Securing Encryption Systems With a Switched Ca-
pacitor Current Equalizer,” IEEE JOURNAL OF SOLID-STATE CIRCUITS,
vol. 45, no. 1, January 2010
[31] D. A. Osvik1, A. Shamir, E. Tromer2 “Cache Attacks and Countermeasures:
the Case of AES,” Department of Computer Science and Applied Mathematics,
Weizmann Institute of Science, Rehovot 76100, Israel revised 2005
[32] D. Bernstein “Cache-timing attacks on AES” Department of Mathematics,
Statistics, and Computer Science (M/C 249) The University of Illinois at
Chicago, 2005
[33] X. Duan, Q. Cui, S. Wang, H. Fang, G. She, ”Differential Power Analysis
Attack and Efficient Countermeasures on PRESENT,” 2016 8th IEEE Inter-
national Conference on Communication Software and Networks, 2016
[34] NIST ”Recommended Elliptic Curves for Federal Government Use,”
http://csrc.nist.gov, 2004.
[35] H. Yue, ”Efficient Scalar Multiplication Against Side Channel Attacks Using
a New Binary Representation”, 1st Seminar - University of Windsor, 2016.
[36] M. Yasuda, ”On the Strength Comparison of ECC and RSA”, SHARCS 2012
(Special-Purpose Hardware for Attacking Cryptographic Systems), Fujisa Lab-
oratories Ltd., 2012.
[37] H. Wu, ”AES: Advanced Encryption Standard”, Chapter 5: Data Security
and Cryptography - University of Windsor, 2015.
[38] F. K. Gu¨rkaynak, Side Channel Attack Chapter 3: Secure Cryptographic
Accelerators, 2006
[39] Rodriguez-Henriquez, F., Saqib, N.A., Diaz Pe´rez, A., Koc, C.K., ”Crypto-
graphic Algorithms on Reconfigurable Hardware”, Springer, 2007.
116
BIBLIOGRAPHY
[40] Jean-Pierre Deschamps, Jose´ Luis Iman˜a, Gustavo D. Sutter, ”Hardware Im-
plementations of Finite-Field Arithmetic”, The McGraw-Hill Companies, Inc.,
2009.
[41] C. Paar, J. Pelzl, ”Understanding Cryptography”, Springer, 2010.
[42] S. Mangard, E. Oswald, T. Popp, ”Power Analysis Attacks – Revealing the
Secrets of Smart Cards”, Springer, 2007.
[43] V. Miller, ”Use of Elliptic Curves in Cryptography”, Advances in Cryptology-
CRYPTO 85 Proceedings, Springer, pp. 417-426, 1986.
[44] N. Koblitz, ”Elliptic Curve Cryptosystems”, Mathematics of Computations,
vol. 48, no. 177, pp. 203-209, 1987.
[45] Leboeuf, Karl Bernard, ”GPU and ASIC Acceleration of Elliptic Curve Scalar
Point Multiplication” (2012). Electronic Theses and Dissertations. Paper 5367.
[46] (2016) Cryptography Stack Exchange. [Online]. Available:
http://crypto.stackexchange.com
[47] (2016) Safe Curves - Choosing safe curves for elliptic curve cryptography.
[Online]. Available:
https://safecurves.cr.yp.to
[48] (2016) Internet stats - Live internet feed. [Online]. Available:
http://www.internetlivestats.com/internet-users
[49] (2016) NSA - CSA - NSA Security Assurance. [Online]. Available:
http://www.nsa.gov/what-we-do/information-assurance/
[50] (2016) DPA Contest, “DPA Contest v4” [Online] Available:
http://www.dpacontest.org/home/ Accessed July 2016.
[51] (2016) Chip Whisperer, “Open-Sourced SCA tools” [Online] Available:
https://newae.com/tools/chipwhisperer/ Accessed June 2016.
[52] R. Bilal and M. Rajaram, “Design and evaluation of parallel, scalable, curve
based processor over binary field,” WSEAS Transactions on Computers, vol.
10, no. 10, pp.353–365, 2011.
117
BIBLIOGRAPHY
[53] M.A.Fayed, ”A security coprocessor for next generation IP telephony: archi-
tecture, abstraction, and strategies”. University of Victoria, 2007.
[54] R. Bilal and M. Rajaram, “Design and evaluation of parallel, scalable, curve
based processor over binary field,” WSEAS Transactions on Computers, vol.
10, no. 10, pp. 353–365, 2011.
[55] Y. W. R.Li, “Fpga based unified architecture for public key and private key
cryptosystems,” Frontiers of Computer Science, vol. 7, no. 3, pp. 307–316, 2013.
118
Vita Auctoris
Dylan was born in May 1994, in Windsor, Ontario. He received his B.A.Sc. and
M.A.Sc. degrees in Electrical and Computer Engineering from the University of
Windsor in 2016 and 2017 respectively - Windsor Ontario, Canada.
His research interest includes cryptography, side-channel attacks, information and
algorithm security, automotive cyber-security, FPGA and ASIC accelerators, and
analog circuit design.
119
