University of South Florida

Digital Commons @ University of South Florida
USF Tampa Graduate Theses and Dissertations

USF Graduate Theses and Dissertations

March 2022

Secure Hardware Constructions for Fault Detection of Latticebased Post-quantum Cryptosystems
Ausmita Sarker
University of South Florida

Follow this and additional works at: https://digitalcommons.usf.edu/etd
Part of the Computer Engineering Commons

Scholar Commons Citation
Sarker, Ausmita, "Secure Hardware Constructions for Fault Detection of Lattice-based Post-quantum
Cryptosystems" (2022). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/9453

This Dissertation is brought to you for free and open access by the USF Graduate Theses and Dissertations at
Digital Commons @ University of South Florida. It has been accepted for inclusion in USF Tampa Graduate Theses
and Dissertations by an authorized administrator of Digital Commons @ University of South Florida. For more
information, please contact scholarcommons@usf.edu.

Secure Hardware Constructions for Fault Detection of Lattice-based Post-quantum
Cryptosystems

by

Ausmita Sarker

A dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Department of Computer Science and Engineering
College of Engineering
University of South Florida

Major Professor: Mehran Mozaffari Kermani, Ph.D.
Srinivas Katkoori, Ph.D.
Hao Zheng, Ph.D.
Nasir Ghani, Ph.D.
Reza Azarderakhsh, Ph.D.

Date of Approval:
March 9, 2022

Keywords: Cryptography, number-theoretic transform, recomputing with encoded
operands, ring learning with error, ring polynomial multiplication
Copyright © 2022, Ausmita Sarker

Dedication
To my Mom, Aparna Sarker, who fought against the world for the success of my career;
To the love of my life, Roussueu, who taught me humility and grit;
To my sister, Asma, for her unconditional support;

Acknowledgments
I would to express my sincere gratitude to my advisor, Mehran Mozaffari Kermani,
Ph.D., for his relentless support and continual encouragement. His extraordinary patience,
enthusiasm, along with his enlightened vision of the research, drove my Ph.D. journey. I
aspire to become as kind as well as diligent as Dr. Mozaffari Kermani in my future life.
I wish to extend my special thanks to my committee members, Srinivas Katkoori, Ph.D.,
Hao Zheng, Ph.D., Nasir Ghani, Ph.D., Reza Azarderakhsh, Ph.D., for investing a tremendous amount of time to thoroughly analyze the dissertation. Their constructive feedback
and attention to detail strengthened the dissertation many-folds. It is my honor to have
such prestigious scholars on my committee.
My gratitude goes to the faculty members and staff of the department of CSE at USF,
especially, Jessica Pruitt and Laura Owczarek, who dedicated their time to the smooth
operations of my grad works and research. Here, I want to thank my undergrad professor,
Gopa Biswas Caeser for the kindness she bestowed upon me during my tough times. To my
family and friends who are supporting me from here and afar, I am grateful.
I feel the result of this work goes to my husband, Rousseau, whose inspiration and sacrifice
are paramount to the fruition of my academic life. His patience and unreserved love are what
kept me afloat during this tough time. I could not have done it without him.
I want to thank my mom, Aparna Sarker. Your support and love are the foundation of
who I am as a person and a researcher.

Table of Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

Chapter 1:
1.1
1.2
1.3
1.4
1.5

Introduction . . . . . . . . . . . . . .
Cryptography and Internet of Things
Post Quantum Cryptography . . . .
Fault Attacks and Detection . . . .
Objectives . . . . . . . . . . . . . . .
Dissertation Outline . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

1
1
2
3
4
5

Chapter 2: Hardware Constructions for Error Detection of Number-Theoretic
Transform Utilized in Secure Cryptographic Architectures . . . . . . . .
2.1 Number-Theoretic Transform . . . . . . . . . . . . . . . . . . .
2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Proposed Error Detection Scheme . . . . . . . . . . . . . . . . .
2.3.1 Efficient NTT Implementation . . . . . . . . . . . . . .
2.3.2 Recomputing with Negated Operands . . . . . . . . . .
2.3.3 Recomputing with Scaled Operands . . . . . . . . . . .
2.3.4 Recomputing with Swapped Operands . . . . . . . . .
2.4 ASIC Assessments and Comparisons . . . . . . . . . . . . . . .
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

7
7
8
9
10
11
12
14
15
16

Chapter 3: Error Detection Architectures for Ring Polynomial Multiplication
and Modular Reduction of Ring-LWE in Z/pZ[x]
Benchmarked on ASIC . . . .
xn +1
3.1 Ring Polynomial Multiplication and Ring-Learning With Error . . . .
3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Ring Polynomial Multiplication . . . . . . . . . . . . . . . .
3.2.2 Ring-LWE Encryption Scheme . . . . . . . . . . . . . . . . .
3.3 Proposed Error Detection Scheme for Ring Polynomial Multiplication
3.3.1 Ring Polynomial Multiplication Architecture . . . . . . . . .
3.3.2 Proposed Error Detection Scheme through Recomputing . .
3.3.3 Ameliorating the Throughput Overhead through Pipelining .
3.4 Proposed Error Detection Schemes for Ring-LWE Architecture . . . .
3.4.1 Error Detection Scheme for Polynomial Multiplier and q=16381
3.4.2 Error Detection Scheme for SAMS2 Approach and q=12289

18
18
21
21
21
22
25
26
31
32
33
34
i

3.5

. . . . .
. . . . .
. . . . .
. . . . .
Module
. . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

36
36
37
38
39
41

Chapter 4: Fault Detection Architectures for Inverted Binary Ring-LWE Construction Benchmarked on FPGA . . . . . . . . . . . . . . . . . . . . . .
4.1 Inverted Binary Ring-LWE . . . . . . . . . . . . . . . . . . . . .
4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Proposed Fault Detection Schemes . . . . . . . . . . . . . . . .
4.3.1 Recomputing with Encoded (Shifted) Operands . . . .
4.3.1.1 Key Generation . . . . . . . . . . . . . . . . .
4.3.1.2 Encryption . . . . . . . . . . . . . . . . . . .
4.3.1.3 Decryption . . . . . . . . . . . . . . . . . . .
4.3.2 Recomputing with Encoded (Negated) Operands . . . .
4.3.2.1 RENO on Key Generation . . . . . . . . . . .
4.3.2.2 RENO on Decryption . . . . . . . . . . . . . .
4.4 Error Coverage and FPGA Implementations . . . . . . . . . . .
4.4.1 Fault Simulation . . . . . . . . . . . . . . . . . . . . .
4.4.2 FPGA Comparison for Error Detection . . . . . . . . .
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

43
43
44
45
46
46
47
49
50
50
52
52
52
54
56

3.6

Error Coverage and ASIC . .
3.5.1 Fault Model . . . . .
3.5.2 Assessments . . . . .
3.5.3 Fault Simulations . .
3.5.4 ASIC Comparison for
Conclusion . . . . . . . . . . .

. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
Error Detection in RPM
. . . . . . . . . . . . . .

Chapter 5: Efficient Error Detection Architectures for Post Quantum Signature Falcon’s Sampler and KEM SABER . . . . . . . . . . . . . . . . . . . .
5.1 Post-Quantum KEM and Signature Schemes . . . . . . . . . . . . . .
5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Recomputing Overview . . . . . . . . . . . . . . . . . . . . .
5.2.1.1 Saber Overview . . . . . . . . . . . . . . . . . . . .
5.2.2 Falcon Overview . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Proposed Error Detection Techniques . . . . . . . . . . . . . . . . . .
5.3.1 Fault Attacks and Threat Model . . . . . . . . . . . . . . . .
5.3.2 Proposed Error Detection Schemes on SABER . . . . . . . .
5.3.2.1 Error Detection on Binomial Sampler . . . . . . . .
5.3.2.2 Error Detection on Parallel Polynomial Multiplication
5.3.2.3 Error Detection on HW/SW Codesign . . . . . . .
5.3.3 Error Detection Schemes on Falcon Sampler . . . . . . . . .
5.3.3.1 Recomputing on Negation . . . . . . . . . . . . . .
5.3.3.2 RESwO on Multiplication . . . . . . . . . . . . . .
5.3.3.3 RENO on Multiplication . . . . . . . . . . . . . . .
5.3.3.4 RENO on Multiplication-and-Accumulator (MAC)
5.3.3.5 RENO on Overall ffsampling∗n . . . . . . . . . . . .
5.3.4 Implementation of Constant-time Falcon Sampler . . . . . .
5.3.4.1 ModFalcon Implementation and Error Detection . .

57
57
59
59
59
60
61
61
62
62
64
66
67
68
69
70
70
71
72
73
ii

.
.
.
.
.

75
75
76
76
79

Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

Appendix A: Copyright Permissions . . . . . . . . . . . . . . . . . . . . . . . . . .

95

5.4

5.5

5.3.4.2 Samplez Implementation and Error Detection
Error Coverage and FPGA Implementations . . . . . . . . . . .
5.4.1 Fault Simulation . . . . . . . . . . . . . . . . . . . . .
5.4.2 FPGA Implementations . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .End Page

iii

List of Tables

Table 2.1

Table 3.1

Table 4.1

Table 5.1

Implementation results for ASIC through TSMC 65-nm for three
case studies, i.e., (n, p)1 = (64, 257), (n, p)2 = (256, 65537),
and (n, p)3 = (512, 4294967297), and two proposed architectures,
i.e., recomputing with swapped operands-RESwO and its modified
variant RESwO-modified (RESwO-m) . . . . . . . . . . . . . . . . .

16

Implementation results for ASIC TSMC 65-nm of RPM architecture (Prop. 1: Negating both operands, Prop. 2: Negating one
operand) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

Implementation results for FPGA through Kinex-UltraScale+ and
Virtex-7 for encryption (EncKin and EncVir , respectively) and key
generation/decryption (GenKin and GenVir , respectively). . . . . . . .

55

Implementation results for FPGA through Xilinx Zynq-UltraScale+
ZCU102 (xczu9eg-ffvb1156-2-e) for binomial sampling, polynomial multiplication and hardware/software codesign. . . . . . . . . .

78

iv

List of Figures

Figure 2.1 Proposed butterfly construction for NTT through recomputing
with negated tri operands (RENtO). . . . . . . . . . . . . . . . . . .

11

Figure 2.2 Proposed butterfly construction for NTT through recomputing
with scaled dual operands (REScdO). . . . . . . . . . . . . . . . . . .

13

Figure 2.3 Proposed butterfly construction for NTT through recomputing
with Swapped Operands (RESwO). . . . . . . . . . . . . . . . . . . .

14

Figure 3.1 Hardware architecture of ring polynomial multiplication in R =

Z/pZ[x]
.
xn +1

26

Figure 3.2 Hardware architecture of proposed recomputing with scaled operand
(REScO). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

Figure 3.3 Hardware architecture of proposed recomputing with negated operand
(RENO). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

Figure 3.4 Hardware architecture of modified RENO. . . . . . . . . . . . . . . .

30

Figure 3.5 Pipelined scheduling for data path of the proposed schemes. . . . . .

32

Figure 3.6 Proposed construction of schoolbook log2 q × log2 q bit multiplier
for q = 16381. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

Figure 3.7 Our proposed SAMS2 construction for error detection in modular
reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

Figure 3.8 Sams2 construction of Shift-add block and Multq. block. . . . . . . .

36

Figure 3.9 Sams2 construction Subt. block. . . . . . . . . . . . . . . . . . . . . .

37

Figure 4.1 Hardware construction of recomputing with shifted operands for
key generation of InvRBLWE. . . . . . . . . . . . . . . . . . . . . . .

47

Figure 4.2 Hardware construction of recomputing with shifted operands for
encryption of InvRBLWE. . . . . . . . . . . . . . . . . . . . . . . . .

48

Figure 4.3 Hardware construction of recomputing with shifted operands for
decryption of InvRBLWE. . . . . . . . . . . . . . . . . . . . . . . . .

49
v

Figure 4.4 Hardware construction of recomputing with negated operands (RENO)
for (a) key generation (b) encryption of InvRBLWE (the graycolored box denotes the module on the corresponding scheme in
the RESO figures.) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

Figure 4.5 Hardware construction of recomputing with negated operands (RENO)
for decryption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

Figure 5.1 Error detection architecture on binomial sampler. . . . . . . . . . . .

63

Figure 5.2 Proposed error detection architecture on polynomial multiplication with multiply-and-accumulate (MAC) unit construction. . . . . .

65

Figure 5.3 Proposed error detection architecture on hardware accelerator. . . . .

66

Figure 5.4 Proposed recomputing with negated operands (RENO) on negation of key generation in Falcon . . . . . . . . . . . . . . . . . . . . .

69

Figure 5.5 Proposed recomputing with swapped operands (RESwO) on multiplication of key generation in Falcon . . . . . . . . . . . . . . . . . .

70

Figure 5.6 RENO on multiplication of key generation in Falcon . . . . . . . . . .

71

Figure 5.7 RENO on multiplication-and-accumulator (MAC) module of Falcon .

72

Figure 5.8 RENO on the overall ffsampling∗n . . . . . . . . . . . . . . . . . . . . .

73

Figure 5.9 Signature scheme of ModFalcon architecture. . . . . . . . . . . . . . .

74

vi

Abstract
The advent of quantum computers and the exponential speed-up of quantum computation
will render classical cryptosystems insecure, as that can solve current encryptions in minutes, resulting in a catastrophic failure of privacy preservation and data security. Through
the standardizing of quantum-resistant public-key cryptography algorithms, the National
Institute of Standards and Technology (NIST) is evaluating potential candidates to thwart
such quantum attacks. In this dissertation, countermeasures against fault attacks are proposed to secure various lattice-based cryptosystems, one of the most promising post-quantum
cryptosystems. Fault detection architectures for crucial building blocks of lattice-based cryptosystems, i.e., number-theoretic transform, ring polynomial multiplication, and ring learning with error are introduced. Moreover, the secure hardware architecture of post-quantum
key encapsulation mechanism SABER and the signature scheme Falcon are explored. The
proposed architectures can also detect natural faults, caused by device malfunctions, which
are crucial to proper functionalities of sensitive and secure deeply-embedded systems with
stringent constraints.

vii

Chapter 1: Introduction

1.1

Cryptography and Internet of Things
The emergence of the Internet of Things (IoT) broadens the traditional Internet by in-

cluding smart devices in computing systems. Ensuring secure communication for IoT devices
is crucial to thwart privacy attacks and prevent the exploitation of security vulnerabilities,
e.g., smart plug [1], thermostat [2], and smart light [3] to name a few. Although symmetric
key cryptography, where the sender and receiver share the same key to encrypt and decrypt,
has proven to be extremely efficient in terms of performance [4], the security vulnerability of
pre-shared keys as well as key distribution problems raise concerns for real-world applications
[5]. Public-key cryptography (PKE), on the other hand, uses a pair of keys, i.e., public and
private, preserving the integrity and confidentiality of two-party communication systems.
While PKE schemes are the most prominent protocol for secure key-exchange and communication establishment, classic PKEs, i.e., RSA [6] and elliptic curve cryptography (ECC) [7]
could be impractical for resource constrained architecture of IoT, because of high complexity and expensive performance metrics, e.g., device area, run-time, or energy consumption.
Secure communication and establishing temporary session keys are two crucial aspects of
IoT security, with a goal of fast key generation and low resource utilization. Several works
have explored the efficient implementation of PKE for resource constrained application [8, 9]
for a specific target platform or code-based cryptography [10]. However, costly key generation and the large key size of the aforementioned approaches have prevented the practical
application so far. Moreover, the advent of quantum computers pose imminent threats to
traditional cryptosystems as the classic PKEs are vulnerable to quantum attacks based on

1

Shor’s algorithm [11], leading to active research on alternatives to PKEs for post-quantum
era.

1.2

Post Quantum Cryptography
The security of currently employed PKEs depends on the hardness of factoring (RSA)

or the elliptic curve discrete logarithm problem (ECC). However, classical PKEs are not
sufficient to protect our cryptosystems in the long term [12], as Shor demonstrated that
both classes of problems will be efficiently solved in polynomial time by quantum computers
[11]. The fast development in the field of quantum computers and their computational
power as well as the progress in cryptanalysis urge the research on post-quantum secure
yet practical cryptosystems, namely post-quantum cryptography (PQC). In late 2017, the
National Institute of Standards and Technology (NIST) has announced the soliciting and
standardizing of one or more quantum-resistant public-key cryptography algorithms [13] to
be finalized in 2024.
The PQC research is focused on five classes of algorithms: Lattice-based, hash-based [14],
multivariate-based [15], code-based [16], and isogeny-based [17] cryptography. Among them,
lattice-based cryptography has long been considered secure yet inefficient, for having large
parameters beyond practicality. The introduction of cyclic and ideal lattices [18] changed
this perception through theoretically elegant and efficient cryptographic primitives. Based
on hard and quantum-resistant problems of finding solution of linear equation, the latticebased cryptography is one of the most resourceful approaches which can be employed on
many aspects of a cryptosystem, e.g., encryption [18], digital signature [19] and identification
[20]. The hardness of learning with error (LWE), a variant of lattice-based cryptosystem, has
received much attention due to its lower complexity, high efficiency, and scalability, which are
suitable for resource constraint IoT applications as well as its robustness against quantum
computations [21].

2

1.3

Fault Attacks and Detection
The side-channel security analysis of NIST PQC standardization is an emerging research

topic demanding extensive study for practical deployment. Every implementation should be
evaluated against side-channel analysis (SCA) attacks. Such physical attacks exploit the information externally available information (e.g., power consumption, run-time), rather than
the vulnerabilities of a cryptographic algorithm. The adversaries exploit the inadvertent
information leakage of a device, e.g., timing information, power consumption, or electromagnetic radiation. As these attacks are non-invasive and require cheap equipment, SCA pose
serious security threats to most cryptographic hardware device, ranging from smart card to
computers [22].
One popular variety of SCA is active fault analysis (FA) [23], where the adversary introduces faults into cryptographic systems and observes the difference. When the attackers
simply observe the device’s behavior without disturbing the proper functioning or attempting to access the inside functionalities, these are called passive SCA. On the contrary, in case
of active SCAs, such as differential fault analysis (DFA), the adversary injects faults into the
systems and compares the faulty output with the non-faulty operation, with little influence
on the actual fault value. Laser injection is one of the most precise fault injection techniques,
where the adversary has control over the timing and location of the fault. Other cheaper yet
effective fault injection methods are clock glitches and power supply drop. Based on the set
of incorrect responses, the attacker can decipher the secret information of the device. Exploiting the presence of transient faults (lasting one or few clock cycles) or permanent faults,
these attacks pose threat to majority of cryptosystems [24], even lattice-based cryptosystems
[25, 26].
Fault detection schemes can determine if a system is tampered with and fault analysis
has taken place. Previous works on fault detection [27–31] have explored fault detection
on classical cryptosystems. Among the three fault detection techniques, concurrent error
detection (CED), off-line detection and Roving fault detection, this dissertation studies the
3

concurrent error detection technique. CED can be classified into four types redundancy,
hardware, time, information, and hybrid redundancy. Hardware redundancy duplicates the
function and detects faults by comparing the output of two operations. Time redundancy
deals with performing the same function twice and may detect both permanent and transient
faults [32]. Information redundancy involves adding check bits (e.g., parity) to determine
fault attacks. Hybrid redundancy requires an operation is followed by its inverse operation.
In this dissertation, we explore the fault attacks of various lattice-based encryption as well
as signature schemes and propose fault detection methods which thwart permanent and
transient fault attacks.

1.4

Objectives
The introduced error detection schemes of different promising post-quantum cryptosys-

tems are explored for different performance and implementation metrics and efficiency. The
proposed architectures are benchmarked to assess their ability to detect transient and permanent faults. With high error coverage, the presented approaches achieve acceptable overhead
and can be tailored towards the objectives in terms of error detection and reliability. These
approaches add very little hardware overheads, which is advantageous to incorporate in
deeply-embedded systems.
The objectives of this dissertation are as follows:
• We devise fault detection schemes error detection architecture in key generation, encryption and decryption stages of multiple state-of-the-art lattice based cryptosystems.
The fault detection explored in this dissertation emphasizes on the performance bottlenecks and the most computationally exhaustive stage each crypto-algorithm, whose
security is crucial for the proper operation of that entire cryptosystem. Our proposed
schemes are not confined to certain cryptographic constructions.

4

• We also explore recomputing schemes in the signature algorithms of post-quantum signature scheme Falcon. We apply recomputing schemes to achieve high fault coverage.
• We have simulated the error coverage of our proposed work with HDL as design entry,
by injecting stuck-at faults. We observed high error detection rates for both permanent
and transient faults incorporating our schemes.
• We implement our schemes on application-specific integrated circuit (ASIC) using
Synopsys Design Compiler or field-programmable gate array (FPGA) to derive the
implementation and performance metrics. The proposed error detection schemes add
acceptable overheads, compared to the original implementation.

1.5

Dissertation Outline
The error detection of lattice-based post-quantum cryptosystems are investigated in this

dissertation. The chapter outline is as follows:
• Chapter 2: This chapter introduces efficient error detection schemes for numbertheoretic transform, a crucial as well as efficient Fourier transform over ring, for the
state-of-the-art lattice-based cryptosystems.
• Chapter 3: Error detection schemes of both RPM and modular reduction blocks, as
different ring-LWE architectures use different moduli, depending on the security level
and application, are proposed in this chapter.
• Chapter 4: This chapter presents fault detection constructions on Ring-BinLWE architecture, which can be tailored based on the needs in terms of reliability and the
restrictions in terms of the added overhead in constrained applications.
• Chapter 5: Fault detection schemes for SABER on the performance bottleneck, the
PRNG generator involving a binomial sampler, the polynomial multiplier architecture as well as high-level architecture of the HW/SW codesign approach of SABER
5

are introduced in this chapter. Moreover, this chapter also proposes error detection
schemes for the hardware construction of Falcon’s sampler, specifically, in the signature
algorithm of ModFalcon and the Gaussian sampler.
• Chapter 6: The dissertation is concluded in this chapter.

6

Chapter 2: Hardware Constructions for Error Detection of Number-Theoretic
Transform Utilized in Secure Cryptographic Architectures

2.1

Number-Theoretic Transform
1

Number theoretic transform (NTT) [34] is a discrete Fourier transform defined over a

finite ring or field. Being an elegant polynomial multiplication technique, NTT is essential to post-quantum cryptosystems, e.g., lattice-based cryptosystems. Such cryptosystems
rely on well-studied, hard problems, the merit of which is that quantum algorithms to solve
these problems efficiently are yet unknown. One of the most common average-case lattice
problems are learning with errors (LWE) problem [18], which assures the hardness of solving
other lattice problems in the worst case [21]. However, this very appealing technique gives
an impractical key-size of quadratic, i.e., O(n2 ) complexity, for security parameter n [35].
To reduce the complexity, cyclic [36] and ideal lattices [37] are introduced. Using computation based on fast Fourier transform (FFT), these structures can enable construction of
theoretically robust and efficient cryptosystems with quasi-linear, i.e., O(n.lgn), key lengths.
Ideal lattices are also employed in fully homomorphic encryption (FHE) [38] or somewhat
homomorphic encryption (SHE) [39], two new primitives with strong potential for securing cloud computing. Polynomial multiplication is the most computationally-exhaustive
operation of ideal lattices. Applying number theoretic constructions provides speed advantage, because the polynomial multiplication can be efficiently computed in quasi-linear time
O(n.lgn) using FFT [40].
1

This chapter was published in the IEEE Transactions on Very Large Scale Integration Systems (TVLSI)
[33] ©2019 IEEE

7

Besides post-quantum cryptography, NTT can radically improve currently-used schemes
by increasing their security parameters. For example, NTT proves to be a valuable tool
to signature schemes [41], collision resistant hash functions [42], as well as identification
schemes [20]. As a result, efficient error detection schemes of NTT in polynomial multiplication will boost the security and reliability of post-quantum cryptography as well as existing
cryptosystems.
Previous studies of NTT-based polynomial multiplication have dealt with reconfigurable
hardware [43] and efficient architecture to achieve high speed [44]. Examples of other interesting recent works related to the respective implementations include [45], [46]. However,
no work is yet proposed in open literature focusing on error detection of NTT polynomial
multiplier.
Error detection in cryptography has been center of attention in previous work [29, 47–
54]. In this chapter, we propose error detection schemes of NTT polynomial multiplier. The
Main contributions of this chapter are summarized as follows:
• We introduce a number of categories for error detection in NTT of the ring R =

Z/pZ[x]
.
xn +1

Our proposed schemes are not confined to certain cryptographic constructions.
• The first category of the proposed error detection schemes involves recomputing with
negated operands. Moreover, we present recomputing with scaled operands. The last category constitutes recomputing with swapped operands. Our target is low hardware overhead,
which is favorable to compact and deeply-embedded architectures.
• We implement the proposed error detection architectures on application-specific integrated circuit (ASIC) for a 65nm library to assess the implementation and performance
metrics.

2.2

Preliminaries
In this chapter, we have considered ideal lattices, defined by R =

Z/pZ[x]
.
xn +1

Here, f (x)

is an irreducible polynomial of degree n, which can be represented as f (x) = f0 + f1 x +
8

f2 x2 + ... + fn−1 xn−1 . Also, n is a power of 2, and p is a prime number where p ≡ 1 mod 2n.
Multiplication of two polynomials a(x), b(x) ∈ Zp , can be represented as:

a(x).b(x) =

n−1 X
n−1
X

ai bj xi+j

mod f (x),

(2.1)

i=0 j=0

taking quadratic complexity of O(n2 ) utilizing school book algorithm.
On the contrary, number theoretic transform is a discrete Fourier transform, defined in
a finite field, Zp = Z/pZ[x] [1]. For a given primitive n-th root of unity in Zp , A(x) and
B(x) are polynomials under Zp , where both are generic forward N T T ω(a) and N T T ω(b),
respectively:
Ai =

NTTnω (a(x))i

=

n−1
X

aj ω ij mod p, i = 0, 1, ..., n − 1

(2.2)

j=o

The NTT exists if and only if the block length n divides q − 1 for every prime factor q of
p, where p is a prime and n is a power of 2. Computing Inverse NTT (INTT) is similar to
computing NTT, while replacing ω with ω −1 and introducing n−1 , i.e.,

ai =

INTTnω (A(x))i

=n

−1

n−1
X

Aj ω −ij mod p, i = 0, 1, ..., n − 1

(2.3)

j=o

As p is a prime, the inverse of n, n−1 can be computed in modulo p, where n.n−1 ≡ 1 mod p.
Applying NTT and INTT to compute polynomial multiplication reduces the time complexity
from O(n2 ) to O(n.lgn).

2.3

Proposed Error Detection Scheme
For high-performance lattice-based cryptography, a flexible NTT-based polynomial mul-

tiplier is required. In this section, we present our schemes to provide error detection hardware
architectures with low complexity. The proposed approaches constitute three categories, i.e.,
recomputing with encoded operands through negated, scaled, and swapped operands.

9

2.3.1 Efficient NTT Implementation
In Algorithm 2.1 [55], the iterative FFT implementation computes the NTT of a given
polynomial a(x) ∈ Zp . The Bit-Reverse(a) operation (line 1) reorders the input vector a, in
which, the new position of the elements in position k can be found by reversing the binary
representation of k. This algorithm utilizes the “butterfly operation” [21] (lines 8 and 9),
which is the multiplication of the factor ω N

mod n

with d, and addition with or subtraction

of the result from c. Lines 5-10 divide the input polynomial into two smaller polynomials,
each with length n/2 and perform NTT on each polynomial simultaneously. Instead of
transforming the entire polynomial of degree n, decomposing a in two halves and computing
the NTT in parallel improves the time complexity from quadratic (O(n2 )) to quasi-linear
(O(n.lgn)).

Algorithm 2.1 Iterative-NTT
Input: a ∈ Zp [x] of length n = 2k with k ∈ N and a primitive n-th root of unity ω ∈ Zp
Output: y = NTTω (a)
1: A ← Bit-reverse(a); m ← 2
2: while m ≤ N do
3:
s←0
4:
while s < N do
5:
for i to m/2 − 1 do
6:
N ← i.n/m; a ← s + i; b ← s + i + m/2
7:
c ← A[a]; d ← A[b]
8:
A[a] ← c + ω N mod n d mod p
9:
A[b] ← c − ω N mod n d mod p
10:
end for
11:
s←s+m
12:
end while
13:
m ← m.2
14: end while
15: return A

10

Figure 2.1: Proposed butterfly construction for NTT through recomputing with negated tri
operands (RENtO).
2.3.2 Recomputing with Negated Operands
In proposing the error detection approaches, we make sure that augmenting the original
constructions with the proposed schemes leads to low-complexity architectures. As a result,
we have applied a number of recomputing with negated operands schemes.
The architecture for NTT consists of the common butterfly structure (lines 8 and 9 of
Algorithm 2.1. This well-known structure performs the core operation of NTT implementation, multiplying elements of the polynomial by powers of ω. Each cycle computes one
node of NTT flow, where a multiplier, followed by a modular reduction (mod p block in
Figure 2.1) circuit performs polynomial multiplication by reiterating the butterfly operation. For this most rigorous operation within such constructions, we propose two variants of
our scheme. The first one is through recomputing with negated dual operands (RENdO) in

11

which, as the name suggests, two operands are negated. The second one, shown in Figure
2.1, is recomputing with negated tri operands (RENtO), in which all three operands, i.e.,
c, ω, and d, are negated. In these approaches, encoding/decoding are the most prominent
operations (and carefully-thought operations to implement). In the latter, i.e., RENtO, for
a modified architecture of NTT-butterfly, we insert a negation unit for modulo p negation,
multiplexer, and comparator circuits. The select of multiplexer Norm/RENtO, determines
original NTT or RENtO operation. In accordance with lines 8 and 9 of Algorithm 2.1, at
the original NTT operation, the outputs are A and B where, A = c + ωd and B = c − ωd.
During the encoding stage, which is active at RENtO only, we negate all inputs, i.e., c, ω
and d, and they eventually become p − c, p − ω and p − d, respectively. Thus, the encoded
operands are A′ and B ′ , where A′ = −c + ωd and B ′ = −c − ωd. The decoding operation is
as follows: We negate A′ and B ′ , and the decoded outputs are compared with their alternate
pre-recomputed outputs. At the input of the decoder, depending on the multiplexer select,
the data bus flows either A or A′ , which is represented as A/A′ in Figure 2.1. In addition, for
the former approach, i.e., RENdO, encoding and decoding blocks are identical to RENtO.
However, in the comparator circuit, we compare the decoded output with their respective
original output.

2.3.3 Recomputing with Scaled Operands
A second variant of the proposed error detection schemes involves scaling the operands,
e.g., doubling, quadrupling, or multiplying with a factor. Let us present an example to
explain the scheme. A first example, i.e., recomputing with doubled and quadrupled operands
(REdqO), involves doubling ω and d, and deriving the quadruple of c. The encoded operands
would be A′ = 4c+(2ω∗2d) and B ′ = 4c−(2ω∗2d). The decoding is performed by dividing the
outputs by 4. In binary, dividing by 4 is right shift two places, making decoding a relativelyinexpensive operation. A second example would be, instead of doubling all the operands
as REdqO, doubling only ω and c, i.e., recomputing with doubled operands (REdO). The

12

Figure 2.2: Proposed butterfly construction for NTT through recomputing with scaled dual
operands (REScdO).
encoding and decoding of REdO is much similar to REdqO, requiring only one doubler and
one divider, i.e., one left and one right shift operation, resulting in low hardware overhead
and time delay.
In a more general variant of recomputing with scaled operands, namely, recomputing
with scaled dual operand (REScdO), we scale both ω and c by the factor k. This is shown
in Figure 2.2. Encoding operations would give A′ = kc + (kω) ∗ d = k(c + ωd) and B ′ =
kc − (kω) ∗ d = k(c − ωd). Decoding is performed by dividing both operands with k Figure
2.2. As p is a prime number, gcd(k, p) ≡ 1 mod p, for all values of k.

13

Figure 2.3: Proposed butterfly construction for NTT through recomputing with Swapped
Operands (RESwO).
2.3.4 Recomputing with Swapped Operands
If we swap ω and d, while negating c, we can perform recomputing with swapped operands
(RESwO). The recomputed operands are A′ = −c + ωd and B ′ = −c − ωd. As shown in
Figure 2.3, there is no necessity for decoding, and RESwO just requires comparison with
alternate pre-recomputed values. The only negation unit in the scheme makes it inexpensive
and efficient. We also present a modified variant of RESwO, i.e., RESwO-m in Figure 2.3,
in which we lower the overhead by swapping just ω and d, having c intact. This would result
in even lower overhead as decoding would be free in hardware.

14

2.4

ASIC Assessments and Comparisons
The proposed error detection schemes are able to detect transient and permanent faults

(intelligent attackers for intentional/malicious faults as well as natural defects). In this
section, we present the results of our ASIC assessments using Synopsys Design Compiler
and VHDL with TSMC 65-nm for three pairs of (n, p) and two of our architectures to
assess the overhead in Table 2.1. We have used Fermat primes in the form of 1 + 2i for
i = 8, 16, 32 which result in having ω = 2. Using 65-nm ASIC synthesis, and for three
cases (n, p)1 = (64, 257), (n, p)2 = (256, 65537), and (n, p)3 = (512, 4294967297), we
also present the overhead of the presented constructions for the case studies of the proposed
RESwO and RESwO-modified in this chapter. The benchmarking is performed for the error
detection architectures (for two proposed schemes) and also for the original constructions,
and overheads are shown in parentheses in Table 2.1. As shown in Table 2.1, the area
[in terms of µm2 ], delay (which is indication of maximum working frequency), and power
consumption at the frequency of 50MHz are tabulated. The proposed schemes achieve acceptable overhead with very high error coverage. One would use RESwO if both permanent
and transient faults in the entire architectures are to be detected. RESwO-modified has
slightly less overhead and can detect transient faults in the structures.
We have performed simulations for (a) single, (b) two-bit, and (c) multiple-bit stuck-atfaults. For each experiment, more than 65, 000 cases have been considered. From the results,
we achieved that our schemes can detect these three cases with 100 percent error coverage.
Further analysis shows that if the comparison units (i.e., voters) are compromised, the error
detection scheme will degrade. Hardening the comparators, using triple modular redundancy
and other fault tolerant techniques, can solve this faulty comparator status situation.
We would like to finalize this section by noting that the proposed architectures are oblivious of the standard-cell library and hardware platform. Therefore, we expect similar results on field-programmable gate array (FPGA) and ASIC libraries. We also note that the

15

Table 2.1: Implementation results for ASIC through TSMC 65-nm for three case studies,
i.e., (n, p)1 = (64, 257), (n, p)2 = (256, 65537), and (n, p)3 = (512, 4294967297), and two
proposed architectures, i.e., recomputing with swapped operands-RESwO and its modified
variant RESwO-modified (RESwO-m)
Architecture
Original (n, p)1
RESwO (n, p)1
RESwO-m (n, p)1
Original (n, p)2
RESwO (n, p)2
RESwO-m (n, p)2
Original (n, p)3
RESwO (n, p)3
RESwO-m (n, p)3

Area
(µm2 )

Delay
(ns)

Power
(mW)

2, 942
3, 674
(24%)
3, 544
(20%)

12.24
13.37
(9%)
13.19
(8%)

0.047
0.054
(16%)
0.052
(12%)

8, 995
11, 170
(24%)
11, 001
(22%)

13.80
14.41
(4%)
14.23
(3%)

0.093
0.111
(18%)
0.108
(16%)

30, 829
37, 476
(22%)
35, 972
(17%)

14.76
15.90
(8%)
15.45
(5%)

0.207
0.231
(15%)
0.228
(11%)

throughput and frequency overhead can be alleviated through pipelining at the expense of
added hardware overhead.

2.5

Conclusion
In this chapter, we have presented a number of categories for error detection schemes

of NTT in the ring R =

Z/pZ[x]
,
xn +1

which are also platform-oblivious. The proposed schemes

constitute error detection architectures on hardware based on recomputing with encoded
operands. Our target has been low hardware overhead, which is favorable to compact and
deeply-embedded architectures. We have implemented the proposed error detection techniques on ASIC for a 65nm library to assess the implementation and performance metrics.
With high error coverage, the presented approaches achieve acceptable overhead (at most

16

24% area, 18% power consumption, and 9% delay for the synthesized case studies) and can
be tailored towards the objectives in terms of error detection and reliability.

17

Chapter 3: Error Detection Architectures for Ring Polynomial Multiplication
and Modular Reduction of Ring-LWE in

3.1

Z/pZ[x]
xn +1

Benchmarked on ASIC

Ring Polynomial Multiplication and Ring-Learning With Error
2

Lattice-based cryptography is popular for resistance against known quantum algorithms,

as its security incorporates worst-case hardness of lattice problems [35]. Ideal lattices have
revolutionized post-quantum cryptography by providing realizable execution, higher efficiency, and low parameter size. Learning with error (LWE) [21] is one of the most versatile
worst-case lattice problems and allows us to completely pull out the lattice interpretation,
resulting in an extremely-simple scheme. Ring learning with error (ring-LWE) [18] is one of
the most explored and studied lattice-based cryptographic schemes, introducing even more
efficient encryption scheme than the standard lattice problems [57], practically realizable
and efficient for hardware implementation [58, 59], among post-quantum cryptosystems.
Ring-LWE emerges as a promising post-quantum cryptosystem to employ at limitedresource environments. Besides encryption and key generation, fully homomorphic encryption (FHE) [38] and somewhat homomorphic encryption (SHE) [39], two emerging groundbreaking techniques to secure cloud data, rely on ring-LWE for efficient and advanced operations.
Ring polynomial multiplication (RPM) is an integral part of a number of emerging postquantum cryptographic algorithms and various non-cryptographic applications. RPM is the
most rigorous computation for Ring-LWE, FHE, SHE, and a number of other cryptographic
architectures. Thus, designing an efficient RPM architecture will certainly improve the
performance of these state-of-the-art cryptosystems. RPM has versatile applications outside
2

This chapter was published in the IEEE Transactions on Reliability [56] ©2020 IEEE

18

the cryptographic area. Erasure coding [60], a strategy to reconstruct corrupt data, uses
RPM to ensure cost effectiveness and less complexity. Ensuring the privacy of electronic
medical records [61] or multi-party communication [62], along with other applications [63–
66], apply efficient realization of RPM. Consequently, a robust and efficient RPM will be
much beneficial in terms of time and hardware complexities.
Ring-LWE involves addition and multiplication over a polynomial ring, where multiplication is the most rigorous operation and is computed using number theoretic transformation
(NTT) [34], a robust and efficient construction [44–46], with smaller key lengths. Thus,
efficient and fault-free modular multiplication of NTT is crucial to both high-speed and secure operation. Error detection architectures for both multiplication and modular reduction
operations of NTT will enhance the security of current ring-LWE cryptosystems to a great
scale.
Previous works have been performed on error detection schemes on several cryptosystems, see, for instance, [47, 48, 52, 67, 68]. The research in [52] focused on different aspects of tweakable enciphering schemes (TES), including implementations on hardware and
software platforms, algorithmic security, and applicability to sensitive, security-constrained
usage models on TES. The work in [47] challenged the traditional use of fault coverage for
uniformly-distributed faults as a metric for evaluating the security of concurrent error detection (CED) against differential fault analysis (DFA). In [48], the security of logic encryption
against side-channel attacks has been evaluated. The problem of exploitable fault characterization in the context of DFA attacks on block ciphers was addressed in [67]. The research
work in [68] identified the weaknesses in the infection mechanism of the countermeasure that
could be exploited by attacks which change the flow sequence. This research work proposes
suitable randomization to reduce the success probabilities of attacks which change the flow
sequence and develop a fault tolerant implementation of the countermeasure. While these
works are based on classical cryptosystems, there exist some limited work on error detection
for post-quantum cryptosystems. The major contribution of our work is that we apply error

19

detection schemes on post-quantum cryptosystems, unlike these previous works based on
classical cryptosystems. Our error detection schemes are applied on ring polynomial multiplication and modular reduction. While the previous works have explored error detection
on hash-based secure signature [69] and number-theoretic transformation of lattice-based
cryptosystems [33], both of which are post-quantum cryptosystems, this work, for the first
time, explores error detection schemes on RPM and modular reduction architectures, both
integral to any lattice-based cryptosystems.
In this chapter, we propose error detection schemes of both RPM and modular reduction
blocks, as different ring-LWE architectures use different moduli, depending on the security
level and application. The main contributions of the chapter are as follows:
• We introduce error detection schemes for RPM with several “modulo q” architectures
within the ring R =

Zq [x]
.
xn +1

Among the merits of proposed schemes is that they are platform-

oblivious.
• The proposed error detection schemes are recomputing with shifted (RESO) and recomputing with swapped operands (RESwO). We apply both these schemes to different modulo
q architectures, where they could detect the faults injected with high error coverage.
• We also introduce error detection schemes for RPM architecture, recomputing with
negated operands (RENO), a subset of REScO with different performance and implementation metrics and efficiency. These approaches add very little hardware overhead, which is
advantageous to incorporate in deeply-embedded systems.
• The proposed error detection schemes are assessed and the results show acceptable error
coverage. We implement our schemes on application-specific integrated circuit (ASIC), using
Synopsys Design Compiler and a 65-nm standard-cell library, to derive the implementation
and performance metrics.
The rest of the chapter is organized as follows. The next section recaps the theoretical
background of ring polynomial multiplication technique and ring-LWE encryption. Section
3.3 and 3.4 discusses the proposed error detection schemes for ring polynomial multiplica-

20

tion and ring-LWE architectures, respectively. We summarize our hardware implementation
results in Section 3.5. Section 3.6 draws conclusions to the chapter.

3.2

Preliminaries

3.2.1 Ring Polynomial Multiplication
In this chapter, we have considered polynomial in the ring R =

Z/pZ[x]
.
xn +1

The irreducible

polynomial inside this ring is represented as f(x) with degree of n. Let two polynomials in
this ring be a(x) and b(x). The multiplication of a(x) and b(x) is derived as:

a(x) · b(x) =

n−1 X
n−1
X

ai bj xi+j

mod f (x).

(3.1)

i=0 i=0

Here, we use the case presented in [18]. f (x) is an irreducible polynomial where, f (x) =
xn + 1. Here, n is a power of 2, p is a prime number, and p ≡ 1 mod 2n. From the properties
of irreducible polynomial, we can write xn ≡ −1 mod f (x). Using this value of xn in (3.1),
we derive the polynomial multiplication as:

c(x) = a(x) · b(x)
=

n−1 X
n−1
X

(−1)⌊

i+j
⌋
n

ai bj xi+j mod n mod f (x).

(3.2)

i=0 j=0

3.2.2 Ring-LWE Encryption Scheme
Public-key encryption and signatures are essential for constructing lattice-based cryptosystems. Difficulty of Ring-LWE problems is the measure of their security, comparable
to the worst case lattice problems [18]. Ring-LWE provides both encryption and portions
of signature scheme of ideal lattices, within a short key space, resulting in faster algebraic
operations. The cryptographic schemes of Ring-LWE problem perform addition and multiplication over R =

Z[x]
,
xn +1

and Rq =

Zq [x]
,
xn +1

where q is a prime number and n is power of 2.

21

Such problems need one to decide whether the samples (a1 , t1 ), ... , (am , tm ) ∈ Rq × Rq are
chosen uniformly random, or each ti = ai s + ei , wheres, e1 , ..., em have small coefficients
from the (one-dimensional) discrete Gaussian distribution Dσ , with standard deviation σ
and mean 0, to attain best entropy/standard deviation ratio [59].
In the following, we describe the steps of the encryption scheme. The NTT of polynomial
a is denoted as ã.
• Key generation stage GEN(a): Two error polynomials r˜1 and r˜2 are sampled from Dσ
and let p̃ = r˜1 − ã.r˜2 ∈ Rq . The public key is the polynomial pair (ã, p̃) and the secret key
is r˜2 .
• Encryption stage ENC(ã, p̃, M ): The input message M ∈ {0, 1}n , is encoded into
a polynomial M̃ = encode(M ) ∈ R, by multiplying each message bit by ⌊(q/2)⌋. The
ciphertext can be obtained as c˜1 = ãe˜1 + e˜2 and c˜2 = p̃e˜1 + e˜3 + M̃ , where e˜1 , e˜2 and e˜3 ∈ R
are three error polynomials, sampled from Dσ .
• Decryption stage DEC(c˜1 , c˜2 , r˜2 ): Inverse NTT will recover M̃ using M̃ = INTT(r˜2 c˜1 +
c˜2 ). Decoding of M from M̃ can be found elementwise, using following rule: if M̃ [i] ∈
(−⌊(q/4⌋, ⌊(q/4⌋), then M [i] = 0, else M [i] = 1, for 0 < i < n − 1.
A number of combinations of (n, q , σ) have been explored in previous work. The research
works in [35] and [18] have proposed (256, 4093, 8.35) and (214, 16381, 7.37) as medium and
high-security parameter sets. Here, medium and high security correspond to the hardness
of breaking an AES-128 and AES-256 bit block cipher, respectively. The works in [59] and
√
√
[45] adopt the parameter sets to (256, 7861, 11.31/ 2π) and (512, 12289, 12.18/ 2π) as
medium and high security parameters, compared to AES-128 and AES-256, respectively.

3.3

Proposed Error Detection Scheme for Ring Polynomial Multiplication
In this chapter, we present efficient error detection architectures for polynomial ring

multiplication within ring R =

Z/pZ[x]
.
xn +1

The proposed schemes can be applied to general

22

polynomials or operands, not confined to a subset or special cases of polynomials. Previous
work in [70] has presented shift operation for the coefficients of one of the operands, as a
countermeasure. This method perfectly worked for their model, where one of the operands
of the RPM was ternary polynomial.
For general polynomial case, if we rewrite (3.2) in matrix form, the multiplication within
R=

Z/pZ[x]
xn +1

can be expressed as:






 c0   a0 −an−1

 
 c 1   a1
a0

 

 

 
a1
 c 2   a2

 

 
 . = .
.

 

 
 .   .
.

 

 
 .   .
.

 
 

cn−1
an−1 an−2




. . . −a1  
  b0 

. . . −a2 
 

 
b
1 
 

. . . −a3  

 


.
 

. . .
. ·
  . 

 

. . .
. 


 
  . 


. . .
. 
 

bn−1
. . . a0

(3.3)

Shifting the coefficients of a(x) produces a very complex circuitry, and decoding the
shifted message is practically impossible with low overhead. As a result, we do not utilize
shifting operation for general polynomial within R =

Z/pZ[x]
,
xn +1

although it worked smoothly

for the case of [70].
Besides shifting, research in [70] has also applied checksum method as fault detection
technique. Actual and predicted checksums are compared to verify if the data are intact. As
one of the polynomials for ring multiplication in this research work is ternary polynomial,
the checksum of one of the multiplication operands and that of an intermediate computation
are theoretically equal. Nonetheless, for the proposed ring multiplication here, the following
is derived for checksum Cs :
Cs =

Pn−1
k=0

ck = (a0 b0 −an−1 b1 −an−2 b2 −...−a1 bn−1) +(a1 b0 +a0 b1 −an−1 b2 −...−a2 bn−1 )+

(a2 b0 +a1 b1 +a0 b2 −...−a3 bn−1 )+...+(an−1 b0 +an−2 b1 +an−3 b2 +...+a0 bn−1 ) = a0 (b0 +b1 +b2 +
...+bn−1 )+a1 (b0 +b1 +b2 +...−bn−1 )+a2 (b0 +b1 +...−bn−2 −bn−1 )+...+an−1 (b0 −b1 −...−bn−1 ).
23

Additionally, we have derived the interleaved checksum, where we add the even and odd
coefficients of the product of multiplication. The results are given below, where Inte and
Into are even and odd interleaved checksums, respectively:
Inte =

Pn−1

= (a0 b0 − an−1 b1 − an−2 b2 − ... − a1 bn−1 ) + (a2 b0 + a1 b1 + a0 b2 − ... −

k=0,2,4... ck

a3 bn−1 ) + ... + (an−1 b0 + an−2 b1 + an−3 b2 + ... + a0 bn−1 ) = a0 (b0 + b2 + ... + bn−1 ) + a1 (b1 +
... + bn−2 − bn−1 ) + a2 (b0 + b2 + ... − bn−2 ) + ... + an−1 (b0 − b1 − b3 ... − bn−2 ),
Into =

Pn−1

k=1,3,5... ck

= (a1 b0 + a0 b1 − an−1 b2 − ... − a2 bn−1) + (a3 b0 + a2 b1 + a1 b2 − ... −

a4 bn−1 ) + ... + (an−2 b0 + an−3 b1 + an−4 b2 + ... − an−1 bn−1 ) = a0 (b1 + b3 + b5 + ... + bn−2 ) +
a1 (b0 + b2 + ... − bn−3 ) + a2 (b1 + b3 + ... − bn−1 ) + ... + an−1 (−b2 − b4 ... − bn−1 ).
Both checksum and interleaved checksum will incur high area overhead, as there is no
efficient approach that can minimize the cost of the circuit. The checksum presented in [70]
can be applied to ring R =

Z/pZ[x]
;
xn −1

however, it is not efficient for our ring R =

Z/pZ[x]
.
xn +1

Moreover, the checksum operation of convolution multiplication block in [70] requires no
multiplication operation, whereas the checksum in our RPM architecture requires n modular
multiplication units. Multiplication is an expensive operation that incurs high area overhead,
which makes checksum an unsuitable scheme for RPM. Checksum will be an acceptable
approach for high performance applications, where area overhead is not vital but delay is.
However, as our work is focused on embedded systems, area overhead is crucial. As a result,
we introduce recomputing schemes which will provide us error detection with low cost.
+
a0 b 0
+ −an−1 b1
+ −an−2 b2
.
.
.
+ −a1 bn−1
c0

a1 b 0
a0 b 1
−an−1 b2
.
.
.
−a2 bn−1
c1

a2 b 0
a1 b 1
a0 b 2
.
.
.
−a3 bn−1
c2

... an−1 b0
... an−2 b1
... an−3 b2
.
.
.
.
.
.
... a0 bn−1
... cn−1

24

3.3.1 Ring Polynomial Multiplication Architecture
In this chapter, we propose error detection schemes for the RPM within R =

Z/pZ[x]
.
xn +1

However, our scheme is applicable to another polynomial ring multiplication construction,
R=

Z/pZ[x]
.
xn −1

The aforementioned multiplication can be expressed as following:

We utilize a multiplication (modulo p) circuit to compute a(x) · b(x). The coefficients of
element-wise multiplications are either positive and negative as shown through preceding
i+j

partial products. We can explain this using (3.2), where the term (−1)⌊ n ⌋ decides whether
 
the coefficients are positive or negative. When i + j < n, i+j
= 0, then (−1)0 = 1, making
n
 
= 1, then (−1)1 = −1, and
the coefficients positive. On the contrary, when i + j ⩾ n, i+j
n
the coefficients are negative. The function is given as follows:

⌋
⌊ i+j
n

(−1)

=




1

when i + j < n or

 i+j 
n

= 0.
(3.4)

 


−1 when i + j ⩾ n or i+j = 1.
n
We require a module capable of performing both the addition and subtraction of two
operands. To achieve that, we use a multiplexing adder/subtractor unit. The selector of
 
. According to (3.4), the module acts as a mod
multiplexer, Sel, is basically the term i+j
n
 
p adder and as a mod p subtractor when Sel, i.e., i+j
, is 0 and 1, respectively. We decide
n
from Figure 3.1, each box computes one coefficient of c(x), which takes n cycles. Therefore,
we get the final result of the computation in n cycles, as each coefficient is computed in
parallel.
In our schemes, we utilize the aforementioned multiplication module to ensure smooth
operation. At first, we apply recomputing with scaled operands and compare the decoded
product of multiplication with the output.
Afterwards, as an economical and low-power subset of scaling, we recompute the multiplication by negating one or both of the operands and compare the decoded message with
the output. The latter approach adds very little area overhead, utilizing RPM for both the

25

a(n-i) mod n

a(n-i+1) mod n

a(2n-i-1) mod n bj
. . .

Sel=

𝑖+𝑗
𝑛

. . .

. . .

±mod p

±mod p

±mod p

c

c1

cn-1

0

Figure 3.1: Hardware architecture of ring polynomial multiplication in R =

Z/pZ[x]
.
xn +1

rings efficiently. Depending on objectives for error coverage and overhead, our schemes can
be tailored to negate as well as scale one or both of the operands of multiplication, for a
high error coverage error detection scheme.

3.3.2 Proposed Error Detection Scheme through Recomputing
Error detection codes might be generally inefficient and expensive for general polynomial
ring multiplication. With a view to making such detection schemes faster and cheaper, we
have utilized a recomputing method that scales one or both operands of multiplication as
encoding operation. Figure 3.2 describes REScO, which is a modified architecture of RPM,
where we insert a multiplexer, a multiplier, and dividers. The selector of the multiplexer,
Normal/REScO, determines whether it performs original RPM or REScO, respectively. In
the latter case, one of the operands, e.g., b, is scaled with a factor k and we get the encoded
operand, e(x) = k · a(x) · b(x). For the decoding process, we have to apply multiplicative
inversion mod p of the factor, k. Thus, we have to select k, carefully, to avoid cases of the
non-existing multiplicative inverse. To achieve that goal, k has to be a non-zero integer
where gcd(k, p) = 1. For example, if p = 128, we cannot use k = 2, as the gcd(2, 128) ̸= 1.
On the contrary, k = 3 can be easily inverted in mod p, as gcd(3, 128) = 1. For prime p,

26

bj
a(n-i) mod n

a(n-i+1) mod n

a(2n-i-1) mod n

1

k

0

1

Sel=

. . .
. . .

Normal/
REScO

𝑖+𝑗
𝑛

. . .

±mod p

±mod p

e0

e1

±mod p

. . .

en-1

Mod p division by k, gcd(k,p)=1

d

0

d

1

. . .

dn-1

Figure 3.2: Hardware architecture of proposed recomputing with scaled operand (REScO).
we do not have this restriction. As shown in Figure 3.2, we apply modular division to each
coefficient of e(x) by k, and get the decoded output, d(x) ≡ k −1 · e(x) mod p.
In terms of error coverage, REScO is effective in countering faults. However, it incurs
substantial hardware overhead from the multiplier and divider modules, required during
encoding and decoding stages. Specifically, we need to provide n number of dividers while
decoding. Division is a costly arithmetic operation, and inserting many of dividers in the
implementation makes it expensive, power consuming, and slow. To solve this situation, we
explore a more efficient and low power subset of REScO, i.e., RENO. In RENO, we recompute
by negating one or both of the operands of multiplication. Negating can be inferred as
multiplying with −1; hence, RENO is a special case of REScO. The hardware architecture
of this scheme (Figure 3.3) is very similar to the original RPM design. During encoding,
the select of the additional multiplexer chooses between normal or RENO operation. In
27

𝑖+𝑗
𝑛
Normal/
RENO

0

a(n-i) mod n

a(n-i+1) mod n

1

a(2n-i-1) mod n bj
. . .
. . .

. . .

±mod p

±mod p

e0

e1
p–e

d

0

d

1

±mod p

. . .

en-1

. . .

dn-1

i

Figure 3.3: Hardware architecture of proposed recomputing with negated operand (RENO).
the case of RENO, the coefficients of the operands, given by (3.4), are swapped, using an
 
inverter. Mathematically, i + j < n, i+j
= 0 becomes 1 after inversion, (−1)1 = −1,
n
making the coefficients negative and vice versa. Therefore, we get the encoded operand e(x)
as, e(x) = −a(x) · b(x). During the decode stage, we need to find additive inverse of each
coefficient of e(x) mod p. Let any of the coefficients of decoded output d(x) be di , where,
di ≡ (p − ei ) mod p ≡ −ei mod p. We use n number of subtractors in this stage of RENO.
As adder/subtractor modules are inexpensive, compared to multiplier and divider modules
of REScO, the structure in Figure 3.3 which derived the decoded results, i.e., (p − ei ) mod
p, is not as costly as general subtractors. This is because one of the inputs is always fixed,
i.e., p, which simplifies the architecture. Thus, RENO provides an efficient and low overhead
error detection method. We would like to emphasize that, REScO and RENO are not two

28

different techniques. RENO is a subset of REScO, where we are scaling the operands with
−1.
The architecture of RENO in Figure 3.3 can detect transient faults correctly. However,
this architecture can only detect permanent faults present in the multiplexing adder/subtractor
module, while failing to detect such faults in the operands or in any other section of the architecture. To resolve this issue, we introduce modification Figure 3.4 and negate one of the
operands since the beginning of the computation. The Normal/RENO multiplexer will now
select normal operand bj or negated operand (p − bj ). As a result, the operands are negated
at the input stage, and that enables this scheme to detect both permanent and transient
faults in operands as well as entire architecture satisfactorily. Moreover, the structure in
Figure 3.4 reduces the hardware overhead by removing the (p − ei ) mod p box and adding
two multiplexers before the feedback structure. Using the select Enc/Dec, the multiplexers
either perform encoding by element-wise multiplication of a and negated b, i.e., (p − bj ), or
subtract ei from p, giving the decoded output di ≡ (p − ei ) mod p. We perform computations on the operands in two runs: First run, i.e., run1, deals with normal computation,
and second run, i.e., run2, deals with RENO. In both Figures 3.4a and 3.4b , thickened
lines and bold texts represent the multiplexer paths for run1 and run2, accordingly. We use
the selector Enc throughout run1 and the first n clock cycles of run2. On the other hand,
Dec is selected only in the (n + 1)th cycle of RENO, in order to complete the decoding of
negated operands. In such manner, we eliminate n number of subtractors by performing the
(p − ei ) operation through the already existing adder/subtractor modules. As the hardware
overheads of multiplexers are considerably lower than these modules, modified RENO costs
even less than regular RENO, while ensuring higher error coverage. One can modify RENO
by negating both of the operands. In this case, the input operands are (p − ai ) and (p − bi ),
instead of a and b. As multiplication of two negative terms gives a positive result, there is
no need for decoding. Negating both input operands requires more involved encoding and

29

Normal
/RENO

p

Enc/
Dec

a(n-i) mod n

a(n-i+1) mod n

a(2n-i-1) mod n

0

. . .

1

. . .
. . .

Sel=

. . .

0

0

0

0

1

1

1

1

±mod p

±mod p

e0

e1

. . .

bj
p-bj

0

0

1

1

𝑖+𝑗
𝑛

±mod p

en-1

. . .

(a) run1, the select of top multiplexer is Normal
Normal/
RENO

p

Enc/
Dec

a(n-i) mod n

a(n-i+1) mod n

a(2n-i-1) mod n

0

. . .

1

. . .
. . .

Sel=

. . .

0

1

0

0

1

1

0

1

±mod p

±mod p

e0 /d0

e /d
1

. . .

bj
p-bj

0

0

1

1

𝑖+𝑗
𝑛

±mod p

. . .

1

en-1 /dn-1
-

(b) run2, the select of top multiplexer is RENO

Figure 3.4: Hardware architecture of modified RENO.

30

hardware overhead (as seen in the next section on ASIC). Nevertheless, in some platforms,
absence of decoding stage might compensate for this excess circuitry.
Our proposed schemes can detect transient faults close to 100% error coverage (one needs
to harden the comparison logic to achieve higher error coverage). RENO also provides close
to 100% error coverage for permanent and long transient faults. We utilize an error detection
flag, which is logic OR operation of comparisons for every column. Even if only one of the
columns of Figure 3.4 has erroneous output, the flag will be set to 1 and we can detect the
error.
Two unlikely cases may appear during assertion of permanent or long transient faults.
One event can be “masking”, in which the output is not erroneous, even if a fault exists
in the intermediate logic. Such cases are excluded because the circuit masks the faults and
these are not translated to errors. The second instance is a rare case where all the entries
of operands ai and bi are zero. RENO cannot detect these errors, because negating any zero
value will keep it unaltered. However, applying all the input bits to a logic OR gate can
be secondary measure to detect such case. We would like to emphasize that this would be
equivalent to multiplying two zero polynomials which is an unlikely case.

3.3.3 Ameliorating the Throughput Overhead through Pipelining
The delay overhead we took into account is the critical path delay, where critical path
is the path that incurs the highest delay. As our error detection is a time redundancy
technique, the total time of a recomputed architecture will be twice of an original architecture
deteriorating the throughput, if no measure is in place to compensate such shortcoming.
Such absence of pipelining will degrade the throughput drastically, which can be improved
by applying subpipelining. Subpipelining will increase the frequency to make sure the design
throughput is close to that of the original architecture. This will incur slightly higher area
overhead, which can be overlooked as we are achieving low throughput degradation of the
error detection approach. We insert registers in locations which will in turn break the timing

31

𝑯𝟏 N1 R1 N2
𝑯𝟐

N1 R1

...
...

R𝑛
N𝑛 R 𝑛

Figure 3.5: Pipelined scheduling for data path of the proposed schemes.
paths into approximately equal halves. We denote the two halves of the pipelined stages as
H1 and H2 . According to Figure 3.5, our scheduling order of normal (Ni ) and recomputed
(Ri ) operations are shown, where 1 ≤ i ≤ n, n being the number of cycles in original nonpipelined approach. We compute Ri and Ni at the same cycle but in different pipelined
stages, whereas in the next cycle, Ni+1 and Ri are computed.
3.4

Proposed Error Detection Schemes for Ring-LWE Architecture
To construct the ring-LWE encryption architecture, based on preliminaries presented in

this chapter, we utilize DSP-enabled schoolbook polynomial multiplier, along with modular
reduction block. Here, we emphasize on two sets of parameters, i.e., (n, q, σ)=(214, 16381,
√
7.37) and (512, 12289, 12.18/ 2π), both being high security parameters. Resemblance
between the reduction method of (n, q, σ)=(214, 16381, 7.37) and (256, 4093, 8.35), makes
our scheme easily modifiable to apply to the other parameter sets [71]. On the other hand,
√
√
(n, q, σ)=(256, 7681, 11.31/ 2π) and (512,12289,12.18/ 2π) both use SAMS2 technique for
modular reduction in the research works of [45] and [72]. Thus, our error detection scheme
presented through such parameter sets, is also applicable to the former.
Choosing the proper value of q varies upon level of security, efficient modular reduction
and based on the property of the modulus, e.g., Fermat number or a large prime number.
This work, for the first time, explores error detection schemes within modular operations.
Subsection 3.3.2 introduced error detection schemes using recomputing for multiplication
operation, i.e., RPM. In the following, i.e., Subsections 3.4.1 and 3.4.2, we explore error

32

detection schemes for modular reduction operations. In Figure 3.3, we have seen mod p
block, where we can apply our modular reduction operations, based on the value of p. Our
error detection schemes on modular reduction can be used in any compatible architecture,
not being limited to RPM only.

3.4.1 Error Detection Scheme for Polynomial Multiplier and q=16381
In this construction, we use a DSP-based schoolbook polynomial multiplication scheme,
followed by the modulo q operation. For q = 16381, it is found that 214 mod 16381 = 3. As
a result, the inputs of the DSP blocks are 14 bits in length, and the product can be written
as: x27...0 = 214 x27...14 + x13...0 = 3x27...14 + x13...0 = (x27...14 << 1) + (x27...14 ) + x13...0 , where
left shift is denoted by <<. The modular operation reduces the result within [0, 16380],
requiring two modulo q operations at most, which is performed by the modulo q reducer
block of Figure 3.6. On the other hand, the DSP block computes the unsigned multiplication
through (AB + C). In the case of signed multiplication, i.e., multiplication with a negative
number, (D − A)B + C is performed, where D = q.
In this section, we propose two variants of recomputing schemes, which we apply to the
most rigorous computation of ring-LWE encryption operation, i.e., the entire DSP as well
as modular q reducer block. Figure 3.6 shows recomputing with shifted operands (RESO),
in which two of the input operands are shifted to left by 1 bit, which is multiplication by 2
in binary operation. Another approach is recomputing with swapped operands (RESwO),
where two of the input operands are swapped. In the former approach, we insert a multiplexer
that controls either normal mode (Norm) or RESO mode (RESO) of operation through the
select pin Norm/RESO. In a Norm operation, we get the usual multiplier output with mod
q reduction. Whereas, RESO mode performs left shift of both operands A and C, giving the
multiplier output as 2AB + 2C = 2(AB + C). The output of the modulo q block is shifted to
the right by 1 bit, which will provide AB + C, in fault-free scenario. The outputs of both the
rounds are compared and any discrepancy between the results detect the presence of faults

33

DSP block

Norm/RESO

-

[13:0]

A

<<1
[14:0]

+

[28:0]

[13:0]

D

[13:0]

B

[13:0]

C

<<1
[14:0]

Modulo q reducer block
q= 16381
[13:0]

+

𝑥28…14
<<1

+

[14:0]

[15:0]

Norm/
RESO

[16:0]

2*16381

-

>>1

[13:0]

16381

Figure 3.6: Proposed construction of schoolbook log2 q × log2 q bit multiplier for q = 16381.
in the architecture. RESwO can also be applied in a similar manner, which will detect both
permanent and transient faults with less overhead.

3.4.2 Error Detection Scheme for SAMS2 Approach and q=12289
Modular reduction operations for a number of values are computationally less efficient
than the former values we explored, yet such values of q are famous and widely used in SHE
and other cryptographic applications. The works of [45, 72] apply the values of q = 7681
and 12289, and use shift-addition-multiplication-subtraction-subtraction (SAMS2) for faster
modular reduction operation.
34

Norm/RESO

x
<<1

ShiftAdd

Subt
Multq

>>1

xout

Figure 3.7: Our proposed SAMS2 construction for error detection in modular reduction.
In Figure 3.7, we explain the Norm mode for q = 12289, where the input contains 14 bits,
as 214 ≡ 212 − 1 mod 12289. In Shift-Add block (Figure 3.8a), we approximate quotient t of
xout = x − tq as x >> 14 + x >> 16 + x >> 18 + x >> 20 + x >> 22 + x >> 24 + x >> 26,
which is a combination of shift and addition operation, based on [72]. From this value of
t, we use M ultq. block (Figure 3.8b) to find the product tq. However, the M ultq block is
a combination of left-shifts and addition, resulting in a much efficient scheme, compared to
a multiplier. In the last block, i.e., Subt., the subtraction between xin and multiples of q
from q to 7q are performed in parallel (Figure 3.9), providing a much faster reduction due
to simultaneous calculations. Taking the least positive number between xin − tq and the
results of the above subtractions, the Subt. block works in a loop until the output is not
lower than q.
In this chapter, we present the error detection architecture of SAMS2 operation, through
the RESO mode of the multiplier (Figure 3.7). We apply recomputing with shifted operands
as encoding and decoding operations of the input and output, respectively. As the entire
SAMS2 operation is linear, applying a left shift at the input stage, and a right shift at the
output stage should retain the same xout as Norm mode. Thus, comparing the values of both
rounds of operations provides us the error detection if both values of xout fail to coincide.

35

[13...0]

29

14

14

xin
29
15

xout

15

[28...14]
15
16

+
[28...16]

[28...18]

14

14

+

16

[28...22]

17
+

[28...20]
10

+

10

t

11
11

+
[28...24]

[28...26

6

+

6

(a) Shift-add block

[16...12]

17

t

+

17
0

18

18

17

+

18

18

2t

[12...1]

19

tq[30...12] 31

tq

12

tq[11...0]

(b) Multq. block

Figure 3.8: Sams2 construction of Shift-add block and Multq. block.
3.5

Error Coverage and ASIC

3.5.1 Fault Model
In fault attacks (intentional, malicious fault injections), preferably, single-bit faults using
the stuck-at model are injected. By repeatedly comparing the erroneous and error-free
outputs, the last subkey is derived, and eventually, the secret key is compromised (noting
the technological constraints, an attacker may not be able to inject a single stuck-at fault.
Therefore, multiple bits might be flipped). We note that the stuck-at fault model (both
single and multiple) is able to model both natural and malicious faults and thus is utilized
throughout this chapter to achieve this twofold goal of the proposed schemes [73]. This is
one of the reasons that adjacent stuck-at faults need to be considered in fault models as well.

36

xin
tq

31
31

-

18

q

18

2q

18
3q
18

4q
18

5q

6q

18

18

-

6
5
4
3
2
1
0

18

15
xout

18

+
18

18

18

18

Figure 3.9: Sams2 construction Subt. block.
We note that such fault models consider both malicious faults and also natural faults based
on stuck-at faults considered in this chapter.

3.5.2 Assessments
In this section, we present the results of our error simulations and ASIC assessments using
Synopsys Design Compiler and VHDL with TSMC 65-nm for two security levels and two of
our architectures to assess the overhead. Using 65-nm ASIC synthesis, we also present the
overhead of the presented constructions for the case studies of moderate and high security
levels, i.e., for (n = 256, p = 1049089), and (n = 512, p = 4206593), respectively. We have
also chosen third set of parameters (n = 1024, p = 536903681), based on SHE [39]. The
benchmarking is performed for the error detection architectures (for two proposed schemes,
i.e., Prop. 1: Negating both operands and Prop. 2: Negating one operand, respectively) and
also for the original construction.

37

3.5.3 Fault Simulations
We evaluated the error detection capability of the proposed work based on fault-injection
simulation coded in VHDL. We injected three types of stuck-at faults, i.e., (a) single, (b)
two-bit, and (c) multiple-bit faults for over 1012 cases, all injected at the input state of the
algorithm. An attacker may not be successful at flipping exactly one bit to collect sensitive
information due to technological constraints, which led us to consider multiple stuck-at
faults. The faults that we consider are stuck at 0 and stuck at 1. The schemes provide high
error coverage (reservation on the comparator is explained below) for these three cases. The
simulation results are confirming that the schemes can detect both transient and permanent
faults satisfactorily. We assume the comparators are hardened, i.e., the comparators are
fault free and not compromised.
In cryptographic engineering, reliability of a cryptosystem is the measure of its ability to
thwart malicious fault and, in this work, natural stuck-at faults. The higher the accuracy an
error detection scheme can provide, the higher is its reliability. As our schemes provide high
error coverage, our schemes are relatively-reliable. To further elaborate, let us explore the
rate of missed detection, also known as false negative rate (F N R), which is 0.001%. This
can be explained by masking of errors, where, the presence of one defect hides the presence
of another defect. Our schemes have no false positives (false alarms), as such cases are
relevant where error detection is done in mid-stages of error detection constructions, where
detected faults are masked. All faults are correctly detected and there has not been a case
where a fault-free condition was flagged as a faulty one, resulting in zero false positive (F P )
detection and zero percent false positive rate (F P R) of our schemes. We can deduce the
true positive rate (T P R) of our schemes from T P R = 1 − F N R, to be 99.999%. Moreover,
our schemes have full coverage for the ratio of the number of true positives to the number
of all positives, i.e., precision =

TP
,
T P +F P

where F P is explained before and T P is the total

number of true positive detection.

38

Sensitivity is the measure to correctly detect faults where faults are actually injected,
i.e., sensitivity =

TPR
.
T P R+F N R

Our proposed schemes are highly sensitive, with close to

100% sensitivity, using the above-mentioned values of T P R and F N R. To find the receiver
operating characteristics (ROC), one has to plot T P R against F P R with respect to varying
threshold values. As the output of comparators shows either a high or low flag for fault and
fault-free output, respectively, we do not require various values of threshold in our scenario.
Considering the cases in which the comparison units are hardened, the resulting ROC curve
is almost a vertical graph as the horizontal axis denotes F P R and vertical axis denotes T P R,
the values of which are presented before.

3.5.4 ASIC Comparison for Error Detection in RPM Module
As shown in Table 3.1, the area [in terms of µm2 which can be converted to kilo gate
equivalent (kGE) which is the normalized area for 2-input NAND gate by dividing the
column numbers by 1.41 × 103 ], delay (which is indication of maximum working frequency),
and power consumption at the frequency of 20 MHz are tabulated. The original architecture
denotes where no error detection schemes were applied. The proposed schemes achieve
acceptable overhead compared to the original architecture, with very high error coverage.
We also note that negating one of the input operands requires more involved encoding and
hardware overhead, compared to negation both input operation.
Here, we note that the RENO operation in Subsection 3.3.2 and RESO operation in
Subsections 3.4.1 and 3.4.2 are compatible. We explored the efficient schemes in each of
the architectures. RENO is computationally more efficient than RESO, in case of RPM
architecture, as only one mod p negation block suffices the RENO operation. On the other
hand, to apply RESO on RPM, we require to add left shift blocks to all of the input operands,
which will require (n + 1) left shift blocks at the input as we are feeding the coefficients of
a in parallel, and n right shift block at the output. Thus, RESO incurs much higher area
overhead than RENO, making RENO a better recomputing scheme for RPM. However,

39

Table 3.1: Implementation results for ASIC TSMC 65-nm of RPM architecture (Prop. 1:
Negating both operands, Prop. 2: Negating one operand)
Architecture
Original (n = 256, p = 1049089)
Original (n = 512, p = 4206593)
Original (n = 1024, p = 536903681)
Prop. 1 (n = 256, p = 1049089)
Prop. 1 (n = 512, p = 4206593)
Prop. 1 (n = 1024, p = 536903681)
Prop. 2 (n = 256, p = 1049089)
Prop. 2 (n = 512, p = 4206593)
Prop. 2 (n = 1024, p = 536903681)

Area
(µm2 )

Delay(ns)/
Frequency

Power (mW)
(at 20MHz)

260, 055
(184 kGE)
539, 766
(382 kGE)
1, 097, 602
(778 kGE)

25.2
(39.6 MHz)
26.5
(37.7 MHz)
28.0
(35.7 MHz)

5.3

290, 846 (11.5%)
(206 kGE)
589, 111 (11.1%)
(417 KGE)
1, 254, 009 (14.3%)
(889 kGE)

33.9 (34.5%)
(29.5 MHz)
34.7 (32.8%)
(28.8 MHz)
36.3 ns (28.6%)
(27.5 MHz)

311,056 (19.6%)
(220 kGE)
608,223 (12.7%)
(431 kGE)
1,262,960 (19.6%)
(895 kGE)

28.6 ns (13.5%)
(34.9 MHz)
29.5 ns (11.3%)
(33.8 MHz)
33.8 (20.7%)
(29.6 MHz)

10.7
23.9
5.9 (11.3%)
12.8 (19.6%)
26.3 mW (10.1%)
6.1 (15.1%)
12.4 mW (15.8%)
26.7 (11.7%)

applying RENO in the modular reduction operation further complicate the mod q negation
inside the modulo q reducer block for Figure 3.6 and Subt. block for Figure 3.7, as they are
already a series of negation operation. Consequently, adding another mod q negation will
result in a discrepancy in the reduction operation. Moreover, to apply RESO, we only use
two left shift blocks and one right shift block in Figure 3.6, which is much cheaper than
mod q negation units. We note that the results of our ASIC analysis can be compared
with theoretical analysis of overhead which leads to justifications on such results. The area
overhead would include those of comparators and added registers; thus, the overhead is
not unacceptable as seen in the table (this holds for power overhead as well). The delay
overhead is not high but as we have explained the throughput degradation is not negligible
and pipelining would ameliorate such degradation.

40

We would like to finalize this section by noting that the proposed architectures are oblivious of the standard-cell library and hardware platform. Therefore, we expect similar results on field-programmable gate array (FPGA) and ASIC libraries. We also note that the
throughput and frequency overhead can be alleviated through pipelining at the expense of
added hardware overhead. Differential fault intensity analysis (DFIA) which is a combination of differential power analysis and fault injection concepts has gained much attention in
the recent past. The biased fault models range from low intensity to higher ones in previous works. Our aforementioned proposed fault detection schemes have capabilities to detect
these biased faults. Finally, previous studies of [70] introduced efficient countermeasures
against fault attacks on NTRUEncrypt. However, RPM in NTRUEncrypt [70] is a special
case and it is not applicable for encrypting other systems, for example, the ring-LWE in [44].
Moreover, we do not utilize shifting operation for general polynomial within R =

Z/pZ[x]
,
xn +1

although it worked smoothly for the case of [70]. We also note that the presented error detection schemes in this chapter for RPM in the ring R =

Z/pZ[x]
xn +1

are not confined to this ring

.
and can be incorporated into a number of other constructions, such as the ring R = Z/pZ[x]
xn −1
3.6

Conclusion
In this chapter, we have proposed efficient error detection schemes, i.e., REScO and

RENO a subset of REScO with different performance and implementation metrics and efficiency. Additionally, we employ RESO and RESwO, to a number of ring-LWE architectures
and modular reduction stages, which can be applied to most well-known modulo operations.
These approaches add very little hardware overheads, which is advantageous to incorporate in
deeply-embedded systems. We have benchmarked the proposed architectures to assess their
ability to detect transient and permanent faults. Moreover, we have implemented the proposed error detection architectures on ASIC and our results show that the proposed efficient
error detection architectures can be feasibly utilized for RPM in the rings R =

Z/pZ[x]
xn +1

and

41

R=

Z/pZ[x]
.
xn −1

We note that our scheme is suitable for the required performance, reliability,

and implementation metrics for constrained applications.

42

Chapter 4: Fault Detection Architectures for Inverted Binary Ring-LWE
Construction Benchmarked on FPGA

4.1

Inverted Binary Ring-LWE
3

Lattice-based cryptography has revolutionized post-quantum cryptography (PQC) thro-

ugh realizable execution, efficiency, and low parameter size. Learning with errors (LWE) is
a highly-explored worst-case lattice problem and provides an efficient scheme. Ring learning
with errors (RLWE) is a family of assumptions which lead to one of the most versatile
encryption schemes, compared to the standard lattice problems. A new variant of RLWE is
proposed in the research presented in [75], involving a binary distribution to choose binary
coefficients instead of Gaussian, namely, Ring-BinLWE. A hardware-optimized scheme of
Ring-BinLWE proposed in [76] utilizes an inverted ring of Ring-BinLWE (InvRBLWE) and
2’s-complement notation range.
In this chapter, we introduce fault detection constructions on Ring-BinLWE architecture,
which can be tailored based on the needs in terms of reliability and the restrictions in terms
of the added overhead in constrained applications. Past research works have been performed
for fault detection schemes on several cryptosystems [25, 68, 77–81]. These include research
works on different public and symmetric-key cryptosystems, and are mainly based on errordetecting codes on classical cryptosystems. Very few works exist on fault detection of PQC,
e.g., hash-based secure signature [50], the number-theoretic transformation of lattice-based
cryptosystems [33], and ring polynomial multiplication of RLWE [56]. Some examples for
error detection in general computations and classical cryptography exist as well [82, 83].
The main contributions of this work are as follows:
3

This chapter was published in the IEEE Transactions on Circuits and Systems II [74] ©2021 IEEE

43

• We devise architectures for key-generation and encryption of Ring-BinLWE problem.
The construction clarifies the gate-level architectures of these two stages and supports the
validity of the augmented fault detection modules.
• We introduce fault detection schemes for Ring-BinLWE within the ring R =

Zq [x]
,
xn +1

for all three phases, i.e., key generation, encryption, and decryption. The proposed fault
detection schemes are based on encoding, recomputing, and decoding the operands. We
apply these schemes to three stages of InvRBLWE architecture, which can be tailored to
apply on other RLWE architectures as well.
• The assessed results of the proposed schemes show acceptable error coverage. To assess
the overhead, we implement the proposed schemes on a Xilinx field-programmable gate array
(FPGA) family.

4.2

Preliminaries
RLWE provides both encryption and portions of the signature scheme of ideal lattices,

within a short keyspace, resulting in faster algebraic operations. The cryptographic schemes
of RLWE problem perform addition and multiplication over R =

Z[x]
,
xn +1

and Rq =

Zq [x]
,
xn +1

where q is a prime number and n is power of 2. Using xn + 1 as modulus leverages the
efficiency during implementation of anti-circular rotation through shift operation. Among
multiple variants of RLWE, the work in [84] proposes binary error distribution instead of
the Gaussian, namely, Ring-BinLWE, which led to smaller key and ciphertext sizes and
no expensive computations of Gaussian distributions. Moreover, another improvement on
Ring-BinLWE was achieved in [76] using 2’s complement notation of the coefficients, namely,
InvRBLWE, by selecting the range of Rq =

Zq [x]
xn +1

= (−⌊ 2q ⌋, ⌊ 2q ⌋ − 1) and eliminating the need

for modular reduction. In the following, we describe the steps for InvRBLWE problem.
• Key Generation stage GEN(a): Let us assume two error polynomials r1 , r2 ∈ {0, 1}n
and let p = r1 − ar2 ∈ Rq . The public key is the polynomial pair (a, p)∈Rq and the secret
key is r2 .
44

• Encryption stage ENC(a, p, m): The input message m ∈ {0, 1}n is encoded into a
polynomial m̃ = encode(m) ∈ Rq , where encode is defined as follows:

(m0 , m1 , . . . , mn−1 ) →

n−1
X
i=0

q
mi (− )xi
2

(4.1)

The ciphertext can be obtained as c1 = ae1 + e2 and c2 = pe1 + e3 + m̃, where e1 , e2 and
e3 ∈ Rq are three error polynomials, sampled from {0, 1}n .
• Decryption stage DEC(c1 , c2 , r2 ): To recover m from m̃, first m̃ = c1 r2 +c2 is computed.
Decoding of m from m̃ can be performed using the following decode function:

DECODE : Rq → {0, 1}n
n−1
X

mi =

ai xi → (m0 , m1 , . . . , mn−1 )

i=0


0 when |ai − i − ⌊ n−3 ⌉| >
2

q
4

(4.2)



1 else.
4.3

Proposed Fault Detection Schemes
From most recent attack [81], we get 73/84 bits and 140/190 bits of quantum/classical

security from the parameter sets of (n, q) = (256, 256) and (512, 256), respectively. Our
schemes are applicable to both security levels and we apply recomputing schemes on three
stages of InvRBLWE. Our motivation is to achieve low-complexity schemes; thus, we ensure
that the augmented fault detection schemes lead to acceptable overhead, compared to the
original architecture.

45

4.3.1 Recomputing with Encoded (Shifted) Operands
In this chapter, we adopt shifting the operands by doubling the inputs and dividing the
outputs by 2, which can be interpreted as shifting the input to the left and right one place
in binary, respectively.

4.3.1.1

Key Generation

The multiplexer select input, Norm/RESO, shown in Figure 4.1, determines whether
the original or the recomputed operation (denoted as recomputing with shifted operands
(RESO)) will be performed. During Norm/RESO=0, i.e., the original operation, the NAND
gate produces a.r2 , while the left adder of the top block, completes the 2’s complement of
a.r2 by adding 1 and produces −a.r2 . The right adder input is either -a.r2 or r1 during
multiplexer select S1=0 and 1, respectively. The anti-circular rotation is implemented in
hardware by adding the registers Res[i] to the next adder, and the negative of Res[n − 1] to
the right adder of the top block. The architecture performs multiplication when the control
signal S1 is set to zero, through the shift-and-add method, requiring n parallel adders of 8
bits. In such a cycle, all the adders, except the top one, performs add operation to find the
product of a and r2 . A shift register feeds each bit of r1 , r2 , namely, r1 [i], r2 [i], during each
clock cycle of multiplication, while r1 , r2 ∈ {0, 1}n . Each bit of n-bit length vector, r1 and r2
is extended as 8-bit (log2 q) as the results are stored in registers of 8-bit length. Such notation,
using the index i, e.g., r2 [i], has been used throughout the chapter, representing each bit of
binary vector being stretched to 8-bit using a shift register to maintain consistency. During
run2 , i.e., the recomputed operation, we multiply a and r1 with 2, which can be represented
as each being left shifted one place and the output being Subrun2 = 2(r1 − ar2 ). The left
shift explains the size of the a and r2 becoming 9 bits in RESO operation, instead of 8 in the
Norm cycle. Afterward, to compute the decoded operands, we discard the least significant
bit of the output. In Figure 4.1 and subsequent figures, the gray-colored box represents the
original architecture, whereas the components outside the box, represent the fault detection
46

𝑆𝑢𝑏𝑟𝑢𝑛 2 [8:1]

Sub [7:0]
Norm/ RESO
9

9

<<

0

9

9

9

9

r1[i] 9
<<

Res[n-2]

r2[i]

9

8

Res[0]

p[0]

Res[1]

p[1]

9

9

Add

<<

S1

...

r2[i]

Shift-Register
9

𝑆𝑢𝑏 7: 0 /
𝑆𝑢𝑏𝑟𝑢𝑛 2 [8:1]

9

9

r2[n-1:0]

9

S1

9

Res[n-1]

a[n-1]

r2[i] 9

...

9

9

...

9

Key-generation

1

Add

<<

r1[i]

S1

9

Add

a[0]

Error

=?

9

9

𝑆𝑢𝑏 7: 0 /
𝑆𝑢𝑏𝑟𝑢𝑛 2 [8:1]

8

Res[n-2]

p[n-2]

Res[n-1]

p[n-1]

9

r1[n-1:0]

Shift-Register

r1[i]
Sub [7:0]

=?

Error

𝑆𝑢𝑏𝑟𝑢𝑛 2 [8:1]

Figure 4.1: Hardware construction of recomputing with shifted operands for key generation
of InvRBLWE.
modules. For example, the multiplexers, the shifters, and the comparator modules outside
the gray-colored box in Figure 4.1 are our added circuitry for fault detection.

4.3.1.2

Encryption

The encryption operations provide two outputs, c1 and c2 . Based on Figure 4.2a, the
output c1 can be computed using logic circuitry similar to that of key generation. The
original architecture requires multiplication of a and e1 , which is performed during the S1=0
cycle of the multiplexer. The addition is complete through multiplexer when S1=1. The
anti-circular rotation is performed as described above. In order to perform recomputing on
c1 , we set multiplexer select Norm/RESO to 1 for RESO operation. During the encoding, the
output of Figure 4.2a adders provide Addrun2 = 2(ae1 +e2 ). We extract the most significant 8

47

𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

Add [7:0]
Norm/ RESO

Encryption (c1)

9

𝐴𝑑𝑑 7: 0 /
𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

8

Res[0]

c1[0]

Res[1]

c1[1]

9

9
9

e1[i]

e1[i]
9

9

9

9
9

<< 9

Res[n-2]

...

S1

...

ShiftRegister

e1[n-1:0]

...

9

9

<<

9

S1

9

Res[n-1]

e2i]

9

e1[i] 9

Add

9

9

<<

a[n-1]

9

9

<<
e2[i]

S1

9

Add

a[0]

Error

=?

9

8

𝐴𝑑𝑑 7: 0 /
𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

Res[n-2]

c1[n-2]

Res[n-1]

c1[n-1]

9

ShiftRegister

e2[n-1:0]

e2[i]

Add [7:0]

Error

=?
𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

(a) Fault detection for c1
𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

Add [7:0]
Norm/ RESO
9
9
9

<<

Encryption (c2)
9

9

S1

9

𝑚[0]

e2[i]

9

9
9

9

e1[i] 9

9

c2[0]

Res[1]

c2[1]

9

S1

9

𝑚[n-1]
<<

Res[0]

0

9

9

<<

Res[n-2]

...

Res[n-1]

<<

8

9

<< 9

p[n-1]

𝐴𝑑𝑑 7: 0 /
𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

...

9

e1[i]

9

Add

<<
e2[i]

S1

9

Add

p[0]

Error

=?

9

𝐴𝑑𝑑 7: 0 /
𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

8

Res[n-2]

c2[n-1]

Res[n-1]

c2[n-2]

9
9
9

Add [7:0]

=?

Error

𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

(b) fault detection for c2

Figure 4.2: Hardware construction of recomputing with shifted operands for encryption of
InvRBLWE.

48

𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

Add [7:0]
Norm/RESO

S1

9

9

9

9

S1

Norm/RESO

c2[n-1]

9

<<

Res[n-2]

r2[i]

9

9

Res[0]

Res[1]

r2[i]
S1

9

9

9
9

...

r2[n-1:0]

ShiftRegister

9
<<

𝐴𝑑𝑑[7:0]/
𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

9

9

9

9

Parallel Output Message

9

Res[n-1]

c1[n-1]

Add

<<

r2[i]

9

9

...

<<

c2[0] 9

Error

Decryption

9

0
Res[n-1]

9

Add

c1[0]

=?

9 𝐴𝑑𝑑[7:0]/
𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

9

Res[n-2]

9

Add [7:0]

=?

Error

𝐴𝑑𝑑𝑟𝑢𝑛 2 [8:1]

Figure 4.3: Hardware construction of recomputing with shifted operands for decryption of
InvRBLWE.
bits of the output and compare it with the Norm cycle output. To construct the architecture
computing c2 = pe1 + e3 + m, we assume the m is pre-computed from (4.1). According to
Figure 4.2b, during multiplexer select S1=0, we multiply the p and e1 , then during S1=1, the
addition of e3 and m is performed. During RESO run, we encode twice of c2 by shifting p, e3 ,
and m one place to left each, which gives us the encoded output, Addrun2 = 2(pe1 + e3 + m).
The decoding operation halves the output, which is then compared with the Norm cycle
output to detect the presence of any faults.

4.3.1.3

Decryption

The decryption computes m = c1 r2 + c2 , which we deduce by applying the same architecture of computing c1 , as shown in Figure 4.3. During the Norm run, we compute the
original m and compare it with the RESO cycle output. The latter uses shifting one place

49

to the right, that gives us 2.m, and the decoding takes the most significant 8 bits, in order
to find the half of the encoded output.

4.3.2 Recomputing with Encoded (Negated) Operands
While RESO has a high rate of fault detection, the increase in bus size makes RESO relatively expensive to perform the rigorous multiplication operation. Moreover, the comparator
unit requires the selective 8 bits ranging from LSB to (MSB-1), further complicating the
process. Hence, we explore a less extensive alternative, namely, recomputing with negated
operands (RENO). The operands in InvRBLWE are already in 2’s complement; thus, we
can avoid the cost of performing 2’s complement externally, which eventually makes RENO
a highly-efficient fault detection scheme while maintaining high error coverage.

4.3.2.1

RENO on Key Generation

To perform recomputing with negated operands in the key generation stage, we insert
a multiplexer that controls the regular operation without error-detection (NORM) and the
RENO operations. While the NORM operation computes the p = r1 − ar2 , we negate both
operands a and r2 , which provides p′ = r1 − (−a)(−r2 ), −a and −r2 are denoted as a′ and
r2′ in Figure 4.4a. In a fault-free scenario, the recomputed output of the adder, i.e., Sub′
will be equal to the original output, i.e., Sub. RENO benefits in terms of overhead in two
ways: 1) there is no need for decoding, as the negating two operands is self-decoding and 2)
the representation of the operands in 2’s complement discards the need to compute negation
with external circuitry.
As encryption provides two outputs, we have to enforce fault detection schemes in both
computations. We compute RENO outputs of c1 as c1reno = (−a)(−e1 )+e2 , whose operation
is identical to the decryption as described below, and c2 as c2reno = (−p)(−e1 ) + e3 + m,
as shown in Figure 4.4b. Eventually, we compare the RENO outputs (Add′ ) with their
corresponding original round outputs (Add) and any discrepancy will be detected.

50

Norm/RENO

a[0]
a'[0]
r2[0]

r2'[0]

8

8

8

Sub'

=?

Error

8

8
8

Keygeneration

...

a[n-1]
a'[n-1]

Sub

...

r1[0]

8

8

8

r2[n-1]
r2'[n-1]

8

8

Sub

8

Sub'

8

=?

Error

r1[n-1]
8

(a) Key generation
Norm/RENO

p[0]
p'[0]
e1[0]

e1'[0]
e3[0]

8
8

8

e1[n-1]
e1'[n-1]
𝑚[n-1]

=?

Error

8
8

8
8

...

p[n-1]
p'[n-1]

Add'

Add

...

𝑚[0]

8

Encryption
(c2)

8
8

8
8

8
8

8

Add'

Add
=?

Error

e3[n-1]
8

(b) Encryption

Figure 4.4: Hardware construction of recomputing with negated operands (RENO) for (a)
key generation (b) encryption of InvRBLWE (the gray-colored box denotes the module on
the corresponding scheme in the RESO figures.)

51

Norm/RENO

r2[0]
r2'[0]

8

c1[0]

8

c1'[0]

8

8

=?

Error

8
8

...

r2[n-1]
r2'[n-1]

Add'

Add

...

c2[0]

8

Decryption

8
8

c1[n-1]
c1'[n-1]

8

8
8
8

Add'

Add

=?

Error

c2[n-1]
8

Figure 4.5: Hardware construction of recomputing with negated operands (RENO) for
decryption.
4.3.2.2

RENO on Decryption

Here, we compare the non-recomputed round output of decryption, m with the recomputed output mreno = (−c1 )(−r2 ) + c2 , as shown in Figure 4.5. The Norm round output of
each byte Add is compared with the RESO round output of the same byte, Add′ .

4.4

Error Coverage and FPGA Implementations

4.4.1 Fault Simulation
Our proposed fault detection schemes can detect both permanent and transient faults. An
attacker may not be successful in flipping exactly one bit to collect sensitive information due
to technological constraints, which leads to considering schemes that can detect multiple
stuck-at faults (stuck-at 0 and stuck-at 1), in addition to single faults. Our fault model
considers stuck-at faults, whose effect time can range from multiple clock cycles (transient
faults) throughout a full operation (permanent faults). We consider the cases of faulty wires,
even the cases where such a wire does not affect the other connected wires. Hence, our fault
model encompasses the events which are excluded by the assumptions of the multivariate
fault model of the work in [79]. Our redundancy based schemes can thwart the fault injections

52

presented in the work of [25], which includes zeroing ciphertext and zeroing secret key.
Such fault attacks can be counted as CCA2 (adaptive chosen-ciphertext attacks), where
redundancy can protect against skipping faults in the context of RLWE. In the same line of
logic, our schemes can thwart the faults presented in [80] which assumes injection of a single
random fault, ranging from skipping faults to glitches in storage, which is evident from
our simulation results of permanent and transient faults. A software-based fault resilient
approach was presented in the work of [81], whose fault model states zeroing, skipping, and
randomization faults, which can be thwarted based on the above discussion.
We evaluated the fault detection capability of the proposed work based on fault-injection
simulation coded in VHDL. We injected three types of stuck-at faults, i.e., a) single-bit upset
(SBU), b) single-byte double-bit upset (SBDBU), and c) multiple bit (MB) faults for over
65,000 cases, all injected at the input state of the decryption algorithm. The faults that
we consider are stuck-at 0 and stuck-at 1. In each case, we attained that our schemes can
achieve high fault detection rates (worst case error coverage 99.9991%), for both permanent
and transient faults. Moreover, the comparator circuits can be compromised, which can
be resolved by hardening them using triple modular redundancy (TMR) and other faulttolerant techniques as a solution to faulty voter conditions. We incorporated the TMR
circuit, where a module is replicated three times, and a majority voter, which is immune to
faults, extracts the output. To further enhance the simulation, we injected faults in three
locations, a) the inputs, b) the adder outputs, and c) one of the TMR voter inputs of the key
generation scheme. Our schemes show worst-case error coverage of 99.9968% for such cases,
confirming that they can detect faults with high error coverage even when the comparators
are compromised. Our simulations show that recomputing can detect cryptographically
impactful faults which can break the security of unprotected implementations, detecting
faults with different multiplicities.
Our fault detection schemes are algorithm-oblivious, hence, the faults injected and the
errors introduced in the algorithm of RLWE do not coincide. The errors added during the

53

three stages of RLWE do not tolerate the malicious or natural faults of our fault model,
because the faults cause malfunction in the site of injection, i.e., the module or the wire.
On the contrary, the errors are injected to ensure that the RLWE problem is the worst-case
lattice problem. It is evident from our simulation results that our schemes strengthen the
security of the RLWE architecture, as they are prone to hardware fault injection.
A subset of fault attacks that can obtain biased fault models is presented in the work
of [77], with the idea of a higher probability for fault injected in both original and redundant architectures. The fault categories presented in [77] are single-bit upset (SBU),
single-byte double-bit upset, single-byte triple-bit upset (SBTBU), single-byte quadruple-bit
upset (SBQBU), other single byte (OSB) faults, and multiple byte (MB) faults. The redundancy based fault detection schemes presented in this chapter, along with other parity-based
approaches, e.g., signatures and interleaved parity, can prevent the aforementioned faults
fully [85]. While the presented redundancy based fault detection schemes may fail to detect
attacks where the adversary can inject the same fault in both the input and output, i.e.,
bypassing the fault detection computation, cascading the encoding schemes based on fault
space transformation [78] can nullify the effect of bias and thwart the biased attacks.

4.4.2 FPGA Comparison for Error Detection
We perform the benchmark for fault detection on the RESO and RENO schemes as
well as part of the original implementation from [76] on Virtex-7 and Kintex UltraScale+
FPGAs. We note that we have implemented just a subset of the work in [76] which helps
us in comparisons. We note that the entire architecture is much larger as seen in [76], but
in order to have fair overheads, just a subset on which error detection is applied has been
implemented here. Table 4.1 represents hardware implementations for n = 256, performing a
complete encryption/key-generation operation, as shown in Figure 4.1 and Figure 4.2. In our
implementation, the key generation and decryption stages provided identical results, hence
we are tabulating both in one category. Our results incorporate TMR as well as subpipelining

54

Table 4.1: Implementation results for FPGA through Kinex-UltraScale+ and Virtex-7 for
encryption (EncKin and EncVir , respectively) and key generation/decryption (GenKin and
GenVir , respectively). We chose (n, q) = (256, 256) to reflect moderate security and the
overheads include the cost of TMR module.
Architecture
Original (EncKin )
RESO(EncKin )
RENO(EncKin )
Original (GenKin )
RESO(GenKin )
RENO(GenKin )
Original (EncVir )
RESO(EncVir )
RENO(EncVir )
Original (GenVir )
RESO(GenVir )
RENO(GenVir )

Area
LUT

FF

Delay
(ns)

Power
(mW)

826
1133
(37.17%)
888
(7.51%)

769
1045
(35.89%)
809
(5.20%)

19.13
20.58
(7.56%)
19.74
(3.17%)

1.44
1.64
(14.10%)
1.61
(11.81%)

108
152
(40.74%)
129
(19.44%)

256
359
(40.23%)
297
(16.02%)

14.45
17.51
(21.23%)
16.71
(15.68%)

1.38
1.69
(22.41%)
1.44
(4.06%)

930
1234
(32.69%)
1007
(8.27%)

577
792
(37.26%)
611
(5.89%)

19.13
21.67
(13.29%)
19.9
(4.04%)

0.54
0.593
(9.81%)
0.577
(6.85%)

108
151
(34.4%)
125
(15.74%)

256
378
(41.92%)
291
(13.67%)

18.98
21.93
(14.40%)
20.45
(7.74%)

0.186
0.274
(39.46%)
0.223
(19.89%)

for throughput degradation alleviation. Subpipelining does reduce the data path delay by
doubling the frequency, with the expense of higher area overhead. For fair comparison, we
have utilized medium area and performance efforts for both synthesis and implementation
phases in Vivado across the implementations. In absence of any compensation, the total
time of recomputing architectures that do not embed throughput alleviation approaches will
be twice the original, i.e., 2n cycles. This drastic deterioration of the throughput can be
improved by incorporating subpipelining. The design throughput will be close to the original
architecture as subpipelining increases the frequency. While subpipelining introduces slight
55

area overhead, the overall low throughput degradation of the error detection approach highly
compensates for the former. One can insert registers in locations that will eventually break
the timing paths into approximately equal halves.
From Table 4.1, for both cases, the area overhead, i.e., lookup table (LUT) and flip-flop
(FF), delay and power overheads are significantly lower for RENO, compared to RESO,
proving that the lack of decoding stage and implementing the inputs as 2’s complement
form. The overheads are also acceptable, with the highest overhead in RENO being 15.74%.
The source of overheads is the modules outside the gray-colored box in all the figures.

4.5

Conclusion
The chapter presents two fault detection schemes on three separate stages of InvRBLWE

architectures in the ring R =

Z/pZ[x]
.
xn +1

The schemes add low overhead with high error cov-

erage. The low hardware overhead is beneficial to compact and deeply embedded system
applications. We assess the implementation and performance metrics of our fault detection
schemes by implementing the schemes on Virtex-7 and Kintex-UltraScale+ FPGA. With the
high error coverage and low overhead, our schemes can be tailored in terms of fault detection
and overhead to be tolerated.

56

Chapter 5: Efficient Error Detection Architectures for Post Quantum
Signature Falcon’s Sampler and KEM SABER

5.1

Post-Quantum KEM and Signature Schemes
4

Lattice-based cryptography [87] is one of the most promising classes among the NIST

post-quantum cryptography (PQC) submissions of the final round (announced in 2020). One
category of lattice-based encryption schemes is learning with errors (LWE)-based schemes,
incorporating the worst-case lattice problem. Learning with rounding (LWR) [88] is a subclass within LWE, both of their security levels relying on noise introduction. SABER is
one such module-LWR [89] encryption scheme, which is resistant against Chosen-Ciphertext
Attack (CCA) and has proceeded to the third round of NIST’s PQC competition in 2020.
SABER was computationally challenging for the absence of an NTT-based multiplier,
because of using an unconventional set compared to the popular number-theoretic transform (NTT) with prime parameter set [37], which has been improved by proposing a fast
polynomial multiplication based on the Toom-Cook algorithm [90] in the work of [91]. Software optimization techniques of SABER have been proposed by improving the Toom-Cook
multiplier [91]. The hardware/software co-design approach to accelerate the SABER computation process has been explored in [92], which achieved significant speed-up compared to
software-based implementations.
Among NIST PQC competition Round 3 finalists, Falcon [93], a lattice-based signature scheme, utilizes fast Fourier sampling over NTRU lattices, instantiating the theoretical
framework of a hash-and-sign-based signature technique, proposed in [94], the latter being
4

This chapter was published in the IEEE Transactions on Very Large Scale Integration Systems (TVLSI)
[86] ©2022 IEEE

57

provably secure and resistant against the key-recovery attack [95]. The article in [96] presented a compact and efficient instantiation of Falcon, which allows an intermediate security
level. The toolchain proposed in [97] to instantiate efficient constant-time discrete Gaussian
sampler, proved to be practical and secure to use as a post-quantum signature algorithm,
e.g., Falcon, with insignificant performance degradation compared to a non-constant-time
sampler. To summarize, Falcon ranks best in terms of efficiency and compactness, while not
sacrificing security, making it an attractive signature scheme for the PQC era.
In this chapter, we propose fault detection techniques for SABER, in both the full hardware and HW/SW codesign approach. As the security concerns of Gaussian samplers have
been an issue for the scheme, we propose error detection for fault attacks on Falcon hardware implementation, a highly compact variant of Falcon, i.e., ModFalcon [96], as well as
the sample algorithm of a constant time Gaussian sampler [97]. This is the first work on
fault detection schemes of a post-quantum cryptographic signature scheme. Such attacks
can break into state-of-the-art signature schemes and derive sensitive information. Very few
works such as [33, 56, 74, 98, 99] exist on error detection of PQC. Our proposed schemes can
be tailored to resource-constrained applications while being flexible to different reliability
levels.
The main contributions of this chapter are as follows:
• We present fault detection schemes for SABER on the performance bottleneck, the
PRNG generator involving a binomial sampler, as well as the polynomial multiplier architecture for fully hardware SABER architecture.
• We also propose error detection architecture in the high-level architecture of the
HW/SW codesign approach of SABER, especially, in the evaluation and the interpolation
datapath of the Toom-Cook algorithm, which is the most computationally exhaustive stage
of any SABER architecture.
• We propose error detection schemes for the hardware construction of Falcon’s sampler,
specifically, in the signature algorithm of ModFalcon and the Gaussian sampler. We apply

58

recomputing schemes to achieve high fault coverage. The schemes are flexible and can be
applied to other signature schemes as well.
• We simulate the proposed scheme by injecting faults in a Xilinx FPGA family. The
assessment of our proposed schemes shows high error coverage.
• We implement the proposed architecture on FPGA family to evaluate the implementation and performance metrics of SABER. The proposed error detection schemes add acceptable overheads, compared to the original implementation.

5.2

Preliminaries

5.2.1 Recomputing Overview
Recomputing is a time redundancy technique, involving encoding (c) and decoding (d)
operations of the function in question (f ) (e.g., sampler, polynomial multiplication), where
decoding is the functional inverse of the encoding operation. In this method, typically the
transient faults within the functions result in different outputs between the non-recomputed
and recomputed cycles. However, permanent faults are typically only detected if recomputation is done using encoded operands. Fault attacks, involving clock or voltage glitches, laser
beam injection, electromagnetic pulses, which tamper the operation of the electric circuit
and alter the input, intermediate variable, or final results, may be detected via recomputing.

5.2.1.1

Saber Overview

The security of SABER relies on the hardness of module-LWR problem, which is given
−
−
→
−
−
a T→
s )⌉) ∈ Rl×l
by: (→
a , b = ⌊ pq (→
q × Rp , here a is a vector of randomly generated polynomials
−
in Rq and →
s is a secret vector of polynomials in Rq whose coefficients are sampled from a
centered binomial distribution, and the modulus p is less than q.
• Key generation: This process starts by randomly generating a seed that determines an
→
−
−
l × l matrix A consisting of l2 polynomials in Rq . A secret vector →
s of polynomials whose
entries are sampled from a centered binomial distribution is also generated. The public key
59

→
− −
then incorporates the matrix seed and the rounded product A T →
s , while the secret key
−
consists of the secret vector →
s.
→
−
• Encryption: Encryption consists of generating a new ‘secret’ s′ and adding the message
→
−
to the inner product between the public key and the new secret s′ . This forms the first part
of the ciphertext, while the second is used to hide the encrypting secret and contains the
−
→
−→
rounded product A s′ .
• Decryption: Decryption utilizes the secret key to compute v, which is approximately
the same as the v ′ computed during encryption. This allows extracting the message from
the ciphertext.
• Parameter Selection: SABER defines three sets of parameters which match NIST security levels 1, 3 and 5, namely, LightSABER, SABER and FireSABER. All three levels use
polynomial degree N = 256, and moduli q = 213 and p = 210 . However, the binomial distribution parameter and the message space of them are the following: LightSABER, SABER,
and FireSABER use module dimensions 2, 3, 4 respectively, and their secrets are sampled
from [−5, 5], [−4, 4], and [−3, 3].

5.2.2 Falcon Overview
A lattice is a discrete subgroup L of some Rn and the lattices are full-rank. In other
terms, a lattice is a set of integer linear combinations of the rows, the basis being B ∈ Rn∗n .
The Falcon signature algorithm consists of three steps, key generation, signature generation,
and verification, which are described as follows:
• Key generation: In the first step of key generation, one needs to generate the polynomials, f, g, F, G ∈ Z[x]/φ, fulfilling the NTRU equation. In the next step, Falcon tree T
is constructed, through LDL∗ decomposition of the matrix G = BB ∗ . The output of key
generation is a public key pk = h = gf −1 mod q and a secret key sk = (B̂, T ).
• Signature generation: In the first part of the signature generation, a hash value c ∈
Zq [x]/φ of the message m and a salt r are computed. The short values s1 , s2 such that

60

s1 + s2 = c mod q, are computed from the hash value as well as the sk, the latter taking
advantage of its knowledge about f, g, F, G, and ffSampling algorithm. A compressed version
of s2 which also contains a random seed r, is generated as the signature. Sending only s2 as
output is sufficient because s2 , hash c, and public key h can reconstruct s1 .
• Signature verification: The first step of signature verification repeats the hashing of m
and r into the hash value c. This hashing is followed by recomputing the s1 and checking
whether ||s1 , s2 || ≤ β is satisfied, β being predefined acceptance bound.
5.3

Proposed Error Detection Techniques
In this section, we discuss the existing side-channel attacks on SABER and Falcon as well

as present recomputing-based error detection schemes, which incur low overhead for SABER
and Falcon architectures.

5.3.1 Fault Attacks and Threat Model
Fault injection can be defined as an active attack that aims to disrupt the cryptographic
operation processing sensitive data, and in turn, results in incorrect output revealing sensitive information [100]. As precise fault injections are getting more difficult because of the
shrinking geometry size of integrated chips, studies show that arbitrary injection of faults
can be utilized to exploit vulnerabilities instead [101]. Such faults attacks do not tamper
with the combinational circuitry of digital systems, rather alter the sampling process of the
flip-flop or decreased clock period of a register, resulting in wrong output [102], [103]. This
injection can exploit the sampler of any signature algorithm, e.g., the Gaussian sampler of
the Falcon signature.
The ideal attack (which is not practical in general) would be to inject bit-faults in the
location and at the preferred cycle to gain much information. While technological constraints may hinder an attacker to flip exactly one bit, our fault model includes single as
well as multiple stuck-at faults (stuck-at 0 and stuck-at 1). We inject single event upset
61

(SEU) and multiple upset (MU) with a single fault adversary, where the adversary can
inject stuck-at faults at one or multiple positions, in one execution of the operation. To
execute that, the fault model we chose requires minimal information on faulty and faultfree computation, resembling differential fault intensity analysis (DFIA) [104]. Although the
Fujisaki-Okamoto (FO) transform applied in the encryption/decryption provides redundancy
through re-encryption, it fails to detect recent attacks described in [105] which gathers linear
inequality of key coefficients by observing the outcome of decapsulation after inserting an
instruction-skipping fault. Our error detection schemes, combined with the FO transform
can prevent such attacks. Moreover, our suggested schemes, combined with masking, can
protect against recent categories of fault attacks, i.e., persistent fault analysis [106], and
Statistical Ineffective Fault Attack (SIFA) [101].

5.3.2 Proposed Error Detection Schemes on SABER
The binomial sampler (essential for random coefficient generation) and polynomial multiplication (both Toom-Cook multiplication and schoolbook) are essential to the operation of
SABER; hence, their error detection schemes are crucial. We also explore the error detection
techniques for HW/SW codesign architectures, which are accelerated design resulting in a
fast cycle and high flexibility for encapsulation and decapsulation operation.

5.3.2.1

Error Detection on Binomial Sampler

The binomial sampler computes a sample from a µ-bit pseudo-random input string, e.g.,
r[µ − 1 : 0], by computing HW (r[µ/2 − 1 : 0]) − HW (r[µ − 1 : µ/2]), where HW () stands
for the Hamming weight Figure 5.1. In SABER, the secret coefficients are drawn from
a centered binomial distribution with the parameters µ = 10, 8, and 6 for LightSABER,
SABER, and FireSABER, respectively. In Figure 5.1, a sample is represented as a 4-bit,
sign and magnitude number (pair of sign and an absolute value) in the implementation. For
SABER, since µ = 8 divides the word-length of the data memory, two 64-bit pseudo-random

62

µ-1

...

µ/2

µ/2-1

...

Hamming
Weight

0

0

Hamming
Weight

1

in1

0

(in1-in2)

1

Norm/
RESwO

in2

2's complement to Sign
Magnitude
Norm/
RESwO

4
LSB

MSB

0

1

4

{sign, magnitude}
Figure 5.1: Error detection architecture on binomial sampler.
words are read from the memory, then they are stored in a 128-bit buffer register, then 16
samples are generated in parallel and they are stored in an output buffer register of length
64-bit, and finally, the output buffer is written to the data memory. In our architecture
from Figure 5.1, we implement recomputing with swapped operands (RESwO), to detect
faults in the binomial sampler. We introduce a multiplexer with the select Norm/RESwO,
which runs the original operation in Norm cycle, and swaps the inputs of the subtractor in
the RESwO cycle. For example, the subtractor output is (a − b) in Norm cycle and (b − a)
in the RESwO cycle. To detect faults, we compare the Norm and RESwO cycle outputs,

63

Algorithm 5.1 Schoolbook Polynomial Multiplication
Input: Two polynomials a(x) and b(x) ∈ Rq of degree N
Output: The product a(x).b(x) of degree N
1: acc(x) ← 0
2: for i = 0; i < N ; i = i + 1 do
3:
for j = 0; i < N ; j = j + 1 do
4:
acc[j] = acc[j] + b[j].a[i] mod Zq
5:
end for
6:
b = b.x mod Rq ;
7: end for
8: return acc
which are the same in a fault-free scenario. To ensure that, we flip the sign bit of the 2’s
complement so that the output is 2’s complement of (a − b) in both cases. Figure 5.1 shows
error detection operation for µ bits, which is replicated 8 times for a 64-bit data memory
output for SABER.

5.3.2.2

Error Detection on Parallel Polynomial Multiplication

The Toom-Cook method is proposed in the work of [91], which can be used to split
a polynomial multiplication of 256-coefficient into seven polynomial multiplications of 64coefficient. Using such Toom-Cook multiplication, the total number of calls to schoolbook
multiplication is 63 for 256-coefficient multiplication, compared to 81 calls for the Karatsuba
method. The polynomial multiplier architecture that implements a parallelized version of the
schoolbook multiplication is described in Algorithm 5.1. To attain maximum parallelism in
data read/write, and to avoid the memory-access bottlenecks, the entire secret polynomial
s(x) is stored in a shift register (Figure 5.2), as all the bits of a register can be accessed
simultaneously on a hardware platform. At the beginning of a polynomial multiplication,
s(x) is read from the data memory (block RAM) and then loaded into the shift register.
As shown in Algorithm 5.1, only one coefficient of the other polynomial a(x) is required at
a time to compute the scalar multiplication s(x) · a[i]. Hence, it is not necessary to store
the entire a(x) polynomial. The coefficient selector block in Figure 5.2 provides the required
64

BRAM
Small Poly.
Buffer

Secret
Poly.
64

64

2's complement

2's complement

s[i]a[j]
MAC

0

0

1
64

Secret Poly.

4

4

4
MAC

MAC
4

1

64

Secret Poly.
4

Norm/
RENO

1

...

4

acc[i]

MAC
4

4
Accumulator
64

Figure 5.2: Proposed error detection architecture on polynomial multiplication with
multiply-and-accumulate (MAC) unit construction.
coefficient of a(x) during the multiplication s(x)·a[i] by the parallel multiply-and-accumulate
(MAC) cores, from the inset of Figure 5.2 After the multiplication s(x) · a[i], s(x) needs to be
multiplied by x. This operation is a simple nega-cyclic left-shift operation that moves each
coefficient from position i to position i + 1 and sends the last coefficient to the first position
after a modular subtraction from zero. In this implementation, such is performed easily by
flipping the 256-th coefficient, taking advantage of the sign-magnitude system representation.

65

Matrix_A

Sign Ext.

LFSR

Norm/RENO

13

13n

13

13
A

Mod q

13

MAC B

Mod q
13
13

13

13
A MAC

B

Mod q

Mod q
13
13

13

A MAC

Mod q

13

B
Mod q

13

...

13

13

Mod q

A

MAC

B

13
Mod q
13

PISO

52

Figure 5.3: Proposed error detection architecture on hardware accelerator.
5.3.2.3

Error Detection on HW/SW Codesign

The hardware/software codesign approach is an extensively researched technique that
aims to achieve performance targets through a shorter development cycle than is typical
for hardware-only implementations. Replacing a purely-hardware benchmarking is not the
intention of hardware/software benchmarking, rather, the aim is to ease the development of
hardware-only implementations via researching hardware accelerators for major operations.
During the encapsulation of SABER, only the accelerated operations performed during encryption are SABER.PKE.Enc. The seed of SHAKE-128, i.e., s0 , is used to generate elements
66

of the matrix A, with each element representing a polynomial, as shown in Figure 5.3. The
sign-extended version of matrix A is used to generate b′ = (As′ + h) mod q, where h is a
constant of the equation. Only one row of the A matrix is produced at once and the elements
of A are multiplied by the corresponding elements of s0 , with a view to shorter execution
time and smaller matrix memory. The registers on the right of MAC in Figure 5.3, stores
the temporary results. The MAC constructions are shown as the inset in Figure 5.2.
In our scheme, we apply RENO at both the inputs of the MAC module in Figure 5.3.
The negated input operands of the multiplication detect the presence of faults in the RENO
cycle of the multiplexer select when discrepancy with the Norm cycle output is flagged by
the comparator. Applying RENO does not increase the bus size; thus, the inputs remain
13-bit; hence, the implementation is compatible with the existing architecture. We perform
a modular negation operation by subtracting each MAC input from q. Our schemes can
apply to any modified version of the MAC core, thus our schemes are MAC architecture
oblivious. As the SABER decapsulation stage utilizes the same mechanism, RENO can be
applied there as well to detect fault injection.

5.3.3 Error Detection Schemes on Falcon Sampler
We apply the schemes for the non-constant time Gaussian sampler, which is prone to
fault attacks, hence requiring additional countermeasure. Our error detection approaches
are also applicable to constant time Gaussian samplers.
From Section 5.2, we recall the ffSampling algorithm is the basis to generate secret key
sk for signature generation. As shown in Algorithm 5.2, the Falcon tree generation, i.e.,
line 7 stating LDL∗ decomposition of matrix G, and the ffSampling are combined in one
algorithm, namely, ffsampling∗n . Such combination reduces the memory consumption significantly compared to the reference Falcon implementation. Here we note that the three
functions of Algorithm 5.2, i.e., ffSampling, splitfft, and mergefft are linear elementary op-

67

Algorithm 5.2 ffsampling∗n (t, G)
Input: t = (t0 , t1 ) ∈ FFT(Q[x](xn + 1))2 and a full-rank
√
Gram matrix G ∈ FFT(Q[x](xn + 1))2∗2 , σ ∈ 1.55 q
Output: z = (z0 , z1 ) ∈ FFT(Z[x](xn + 1))2
1: if (n = 1) then
√
2:
σ ′ ← σ G00
3:
z0 ← DZ,t0 ,σ′
4:
z1 ← DZ,t1 ,σ′
5:
return z = (z0 , z1 )
6: end if
7: L, D ← LDL∗ (G)
8: d01 , d11 ← splitfft2 (D11 )
′
9: t1 ← splitfft
2 (t1 )


d10 d11
10: G1 ←
xd11 d10
11: z1 ← ffsamplingn/2 (t1 , G1 )
12: z1 ← mergefft2 (z1 )
J
L10
13: t′0 ← t0 + (t1 − z1 )
14: d00 , d01 ← splitfft2 (D00 )
′
15: t0 ← splitfft
2 (t0 )


d00 d01
16: G0 ←
xd01 d00
17: z0 ← ffsamplingn/2 (t0 , G0 )
18: z0 ← mergefft2 (z0 )
19: return z = (z0 , z1 )
erations: Addition, subtraction, multiplication, and division; hence, we can apply linear
encoding and decoding schemes, without any loss of information.

5.3.3.1

Recomputing on Negation

Algorithm 5.2 can be partially depicted (lines 11 through 13) by Figure 5.3, multiplexer
select Norm/RENO being at Norm, i.e., unmodified operation of the ffsampling∗n . In the
original operation of line 13, the output of ffsampling∗n , z1 is subtracted from t1 . In our
encoded scheme, we perform RENO during the RENO cycle of the multiplexer, where we
negate both t1 and z1 , and perform subtraction of −z1 from −t1 , resulting in out1 = (t1 − z1 )
in a fault-free scenario, which is consistent with the Norm cycle output. However, in a

68

𝑡1

ff Sampling

𝑜𝑢𝑡1

flag

=?

0

Norm/
RENO

1

-

𝑧1

0
1

-

Figure 5.4: Proposed recomputing with negated operands (RENO) on negation of key
generation in Falcon
faulty scenario, the outputs of both Norm and RENO cycles will be discrepant, which will
be flagged by the comparator comparing this output with out1 , detecting the presence of
faults. We note that decoding in this scheme is free of hardware cost; hence, a low-overhead
and inexpensive fault detection approach.

5.3.3.2

RESwO on Multiplication

In line 13 of Algorithm 5.2, (t1 − z1 ), is multiplied with the left child of LDL∗ output
L10 . In our scheme, we perform this unmodified operation during the Norm cycle of the
multiplexer. For the recomputed operation, we perform recomputing with swapped operands
(RESwO), where the multiplication operands L10 and (t1 − z1 ) are swapped and stored in
out2 , as shown in Figure 5.5. Any discrepancy between the Norm and RESwO rounds is
flagged by the comparator comparing L10 ⊙ (t1 − z1 ), and out2 . RESwO scheme also requires
no decoding, making it a cost-effective fault detection mechanism.
69

𝐿10

(𝑡1 − 𝑧1 )

Norm/
RESwO

𝑜𝑢𝑡2

flag

0

=?

1

·

𝑧1

0
1

Figure 5.5: Proposed recomputing with swapped operands (RESwO) on multiplication of
key generation in Falcon
5.3.3.3

RENO on Multiplication

One can also explore negation on the aforementioned multiplicands. In such a case, as
depicted in Figure 5.6, the Norm cycle will perform L10 ⊙ out1 , where out1 = (t1 − z1 ).
On the contrary, during our proposed RENO cycle, the architecture will perform negation
on both operands, resulting in out2 = −L10 ⊙ −out1 . In a fault-free scenario, the RENO
output should match with the Norm cycle, deviation from which will be captured by the
comparator comparing out2 and L10 ⊙out1 . Similar to the case of RENO on negation, RENO
on multiplication requires no decoding as negating both operands provides the same output
in the case of multiplication.

5.3.3.4

RENO on Multiplication-and-Accumulator (MAC)

Instead of applying error detection on either the multiplication or the subtraction of
line 13 in Algorithm 5.2, one can perform error detection on this overall multiplication-and-

70

𝐿10

𝑜𝑢𝑡2

flag

=?

0

Norm/
RENO

1

·

𝑧1

0
1

-

𝑜𝑢𝑡1

Figure 5.6: RENO on multiplication of key generation in Falcon
accumulator circuitry. We propose RENO for MAC of line 13, where the Norm cycle of
multiplexer results in t′0 , according to Algorithm 5.2. In our proposed RENO operation,
we negate both L10 and t0 , as shown in Figure 5.7, resulting in the encoded output of
(−L10 ⊙out1 )−to , where out1 = (t1 −z1 ). We decode this encoded operand by again negating
the MAC output, providing out2 = −(−L10 ⊙ out1 − t0 ), which should be identical to t′0 in
a fault-free scenario and the comparator flags any inconsistency between these two. The
presence of an additional decoding circuit is somewhat more expensive than the previously
mentioned schemes requiring no decoding; however, if one wishes to perform overall error
detection on the entire MAC, RENO is a viable choice.

5.3.3.5

RENO on Overall ffsampling∗n

We finally propose an error detection scheme that operates on the inputs of the entire
Algorithm 5.2 and performs RENO on its operands, as depicted in Figure 5.8. During the
multiplexer select Norm, the unmodified function of ffsampling∗n is performed. On the other
hand, in our proposed RENO scheme to detect faults, we negate to , the output of ffsampling∗n ,
71

𝐿10

-

Norm/
RENO

1

0

𝑜𝑢𝑡2

flag

=?

-

+

·

𝑧1

𝑜𝑢𝑡1

𝑡0
Figure 5.7: RENO on multiplication-and-accumulator (MAC) module of Falcon
z1 as well as t1 . Therefore, the encoded output becomes −t0 + (−t1 − (−z1 )) ⊙ L10 , after
the subtraction and MAC operations. Now, to decode the encoded output and find out4 , we
again negate it which, in a fault-free scenario, would result in t0 + (z1 − t1 ) ⊙ L10 , resembling
t′0 . The comparator notifies of the discrepancy between Norm and RENO rounds. We would
like to conclude that a non-constant time Gaussian sampler can easily fall victim to timing
attacks and other fault attacks. However, such non-constant time Falcon approaches are
heavily researched and popular for micro-controller based platforms. Our proposed error
detection schemes are low-overhead, while ensuring high error detection for those faulty
situations, and can be implemented for already compact Falcon implementations.

5.3.4 Implementation of Constant-time Falcon Sampler
Falcon being a fairly new scheme, its resilience against fault attacks has not been analyzed thoroughly. While active attacks on Falcon are yet unknown, incorporating non-

72

𝑡1

ff Sampling
𝑡0

=?

1

0

𝑜𝑢𝑡4

flag

-

-

Norm/
RENO

-

+

𝐿10

0

·

𝑧1

Norm/
RENO

1

-

𝑧1

0
1

-

Figure 5.8: RENO on the overall ffsampling∗n .
constant time Gaussian sampler can seriously affect the security of the scheme; thus, should
be replaced with a constant-time Gaussian sampler.

5.3.4.1

ModFalcon Implementation and Error Detection

ModFalcon, a new variant of signature schemes based on the Falcon design, is based on
module lattices. This new implementation possesses both the compactness and efficiency of
Falcon. ModFalcon achieves the highly compact lattice-based signature with a 128-bit quantum level security. This variant generalizes the instantiation of the hash-and-sign algorithm
to NTRU lattices for large module ranks; hence, broadening the parameter set of the Falcon
design to a much wider range.
As shown in Algorithm 5.3, the pair (r, S) is the signature, r being a hashing salt and
S being an encoding of a short vector s such that s · vk = H(r||msg). After computing
H(r||msg), the secret key BF,g is used to sample a proper s. Algorithm 5.3 can be partially
depicted (line 4) by Figure 5.9. One can compute z via the parallel computations of F alconsig .
73

𝐵𝐹,𝑔0

𝐹𝑎𝑙𝑐𝑜𝑛𝑠𝑖𝑔

𝑧0

𝐵𝐹,𝑔𝑛 −2 𝐵𝐹,𝑔𝑛 −1

𝐵𝐹,𝑔1

𝑧1

...

𝐹𝑎𝑙𝑐𝑜𝑛𝑠𝑖𝑔

𝑧𝑛−2

𝑧𝑛−1

Figure 5.9: Signature scheme of ModFalcon architecture.
Algorithm 5.3 Signature: (sk, msg) → (r, S)
Require: A standard deviation parameter σ
1: Get r ← U ({0, 1})λr
2: µ ← H(r||msg) ∈ Rq and let c = (µ, 0, ..., 0)
−1
3: Compute t = c · BF,g
4: Compute z ∈ Rn+1 such that s := (t − z) · BF,g
5: S = Compress(s)
6: return the signature (r, S)

Even constant-time Falcon can be vulnerable to fault attacks, hence Figure 5.9 can be
modified to incorporate error detection schemes. One can select multiplexer Norm/RENO
being at Norm, i.e., unmodified operation of the ModFalcon signature scheme. In the original
operation of line 4, the output of signature scheme z is subtracted from t. In our encoded
scheme, we perform RENO during the RENO cycle of the multiplexer, where we negate both
t and z, and perform subtraction of −z from −t, resulting in out1 = (t − z) in a fault-free
scenario, which is consistent with the Norm cycle output. However, in a faulty scenario, the
outputs of both Norm and RENO cycles will be discrepant, which will be flagged by the
comparator comparing this output with out1 , detecting the presence of faults. We note that
decoding in this scheme is free of hardware cost; hence, a low-overhead and inexpensive fault
detection approach.

74

Algorithm 5.4 SamplerZ (σ, µ)
Require: µ ∈ [0, 1), σ ≤ σ0 a scaling factor C = C(σ) ∈ (0, 1]
Ensure: z ∼ DZ,σ,µ
1: while true do
2:
z0 ← BaseSampler()
3:
b ← {0, 1} uniformly
4:
z ← (2b − 1) · z0 + b
2
z2
− 2σ02
5:
x ← (z−µ)
2σ 2
0
6:
if BerExpC(σ) (x) then
7:
return z
8:
end if
9: end while
5.3.4.2

Samplez Implementation and Error Detection

The constant-time sampler, formally described in Algorithm 5.4, works by using BaseSampler to generate a sample z0 . Then, it samples a random bit b, and compute z = (2b−1)·z0 +b.
Finally, it calls BerExpC(σ) (x) to determine if z is returned or rejected and start again if
necessary.
We explore fault detection to thwart fault attacks on the Samplez . One can explore negation in line 4 of Algorithm 5.4. In such a case, the Norm cycle will perform the aforementioned
computation of z. On the contrary, during our proposed RENO cycle, the architecture will
perform negation on both operands, resulting in out2 = {−(2b − 1) · −z0 } + b. In a fault-free
scenario, the RENO output should match with Norm cycle, deviation from which will be
captured by the comparator comparing out2 and z. Similar to the case of RENO on negation,
RENO on multiplication requires no decoding as negating both operands provides the same
output in the case of multiplication.

5.4

Error Coverage and FPGA Implementations
This section presents the results of our FPGA assessments using Xilinx Vivado and VHDL

with an FPGA family (Zynq-UltraScale+ ZCU102), using the device xczu9eg-ffvb1156-2-e,
to assess the overhead of the proposed construction for the case study of proposed RESwO
75

and RENO in the SABER encapsulation algorithm as well as the hardware accelerator, as
shown in Table 5.1.

5.4.1 Fault Simulation
We have simulated the error coverage of our proposed work with VHDL as design entry,
by injecting three types of stuck-at faults, i.e., (a) single, (b) two-bit, and (c) multiple-bit
faults for 200,000 cases, all injected at the input state of the parallel polynomial multiplication algorithm, for permanent and transient faults. In each case, we observed high
error detection rates (99.9975%), for both permanent and transient faults incorporating our
schemes. For example, in single-bit stuck-at 0 faults, we inserted faults at the LSB of both
the inputs of the polynomial multiplication architecture of Saber, using logical AND operation between that faulty bit and logical 0. We also injected two-bit and multi-bit (6-bit)
faults, similarly, for a total of 200,000 instances. After the simulation, the error flags were
high for 199,995 cases, demonstrating the presence of faults. We calculated the fault detection ratio as

faults detected
,
faults injected

which in our case resulted in 99.9975%. To be very conservative in

reporting the error coverage and about the faults occurring in the entire architecture, one
needs to consider those affecting the comparator unit. In case a voter is faulty, a comparator
using modular redundancy can be one of the solutions for a compromised comparator circuit,
among different fault-tolerant techniques.

5.4.2 FPGA Implementations
We perform the benchmark for error detection on the RESwO scheme for binomial sampler and RENO schemes for both parallel polynomial multiplier and HW/SW accelerator as
well as the original. For both cases, we tabulated both the lookup table (LUT) and flip-flop
(FF) as area overhead as well as delay and power overheads in Table 5.1, all of which are
of the acceptable range. Both error detection schemes applied to binomial sampling and
polynomial multiplication incur approximately 18% area overhead, whereas the RENO in-

76

corporated in HW/SW accelerator adds 22.59% overhead for LUTs. On the contrary, the
RESwO and RENO of the binomial sampler and HW/SW accelerator show a lower overhead (19.32% and 17.66%, respectively), compared to the 22.72% overhead for RENO of the
polynomial multiplier in FFs. In terms of power, it is evident that the RESwO added the
least overhead (6.88%) compared to both the RENO architectures. The delay overhead for
the RENO on the polynomial multiplier was the lowest at 11.42%, although the RESwO
overhead was acceptable at 15.76%. Thus, we can conclude RESwO results in lower percent
overhead compared to the RENO models, due to the simplicity of the RESwO architecture. As this is the first work on implementing error detection of SABER architecture as
well as HW/SW codesign, there is no previously published architecture to compare with
our performance and overhead matrices. In some of the previous works on fault detection
of post-quantum architectures [33, 56], recomputing has been utilized to detect faults on
number-theoretic transform and ring polynomial multiplication, respectively, two integral
components of lattice-based cryptosystems. The implementation overheads in the work of
[33] are 20%, 6%, and 16%, on average, for the area, delay, and power, respectively. On the
other hand, the performance matrices for the error detection in [56] are 19.6%, 13.5%, and
15.1% in cases of area, delay, and power overhead, respectively. The overheads of our error
detection overheads align with the performance overheads of the previous works, demonstrating the efficiency and low cost of our implementations.
Implementing lattice-based signatures is difficult, based on either the high-speed or
lightweight approach, which explains the lack of literature on hardware or hardware/software
implementation of Falcon and other lattice-based signatures, e.g., LUOV, HQC, NTS-KEM
[107]. However, recomputing being an efficient scheme, we expect similar low overhead
results for Falcon as our derived results for SABER.
In absence of any compensation, the total time of recomputing architectures that do not
embed throughput alleviation approaches will be twice the original. Subpipelining is the
solution to alleviate this drastic decline of the throughput. Through increasing frequency,

77

Table 5.1: Implementation results for FPGA through Xilinx Zynq-UltraScale+
ZCU102.(xczu9eg-ffvb1156-2-e) for binomial sampling, polynomial multiplication and
hardware/software codesign. All the inputs are 256 bits and the parentheses represent
percent overheads compared to original architecture.5
Architecture
Binomial Sampling

Scheme
Original
RESwO

Polynomial Multiplication

Original
RENO

Hardware/software codesign

Original
RENO

Area
LUT

FF

Delay
(ns)

Power
(mW)

85
100
(17.64%)

88
105
(19.32%)

2.03
2.35
(15.76%)

0.697
0.745
(6.88%)

17,352
20,420
(17.68%)

5,171
6,346
(22.72%)

2.959
3.297
(11.42%)

1.724
1.908
(10.67%)

14,277
17,502
(22.59%)

1,025
1,206
(17.66%)

3.764
4.508
(19.77%)

2.097
2.295
(9.4%)

subpipelining increments the frequency, which in turn makes the recomputed architecture
throughput close to the original architecture. The slight area overhead of adding subpipelining can be reasonably traded off by achieving low throughput degradation. The timing paths
can be broken into approximately equal halves by inserting registers in proper locations.
In conclusion, we would like to note that the proposed architectures are platform oblivious
of the FPGA fabric and hardware platform. As a result, implementing the schemes on
application-specific integrated circuits (ASIC) will also provide similar results. Moreover,
adding pipelines in the architectures will improve the efficiency and throughput, with the
compromise of increased hardware overhead. We would like to note that the proposed
architectures are platform oblivious of the FPGA fabric and hardware platform.
5

SABER polynomial degree N = 256 , moduli q = 213 and p = 210 , module dimensions 3, and their
secrets are sampled from [4, −4].

78

5.5

Conclusions
We present error detection schemes for SABER on fully hardware construction and hard-

ware/software codesign accelerators. Moreover, we propose error detection schemes for postquantum signature scheme Falcon, its compact variant ModFalcon and Gaussian sampler, a
crucial element of the Falcon signature scheme. Our error detection schemes with recomputing incur low overheads with high error coverage on these two state-of-the-art NIST PQC
finalists. We achieve high error coverage of 99.9975% on average, from our recomputing
schemes. Moreover, the area, delay, and power overheads are 22.59%, 19.77%, and 10.67%,
respectively, in the worst-case scenario. The proposed architectures are implemented on the
FPGA family Zynq-UltraScale+, which shows acceptable area, power, and delay overhead.

79

Chapter 6: Conclusion

With the advent of quantum computers, computationally infeasible problems can be
solved efficiently through their usage of physical properties of matter and energy. Mathematical problems which will take more than a human lifetime to solve are the basis of classical
cryptosystems. The exponential speed-up of quantum computation will render classical cryptosystems useless, as the encryption will be solved in mere minutes, resulting in a drastic
failure of privacy preservation and data security. Thus, encryption schemes need to be developed to protect us against quantum attacks, because many experts predict that within
20 years, quantum computers can break into the current encryption infrastructures. This
dissertation focused on developing cryptosystems that countermeasure against side-channel
attacks on these architectures, which protect cryptosystems against adversaries, ensuring
secured data for all. In addition, our architectures can also detect natural faults, caused by
device malfunctions, crucial to proper functionalities of sensitive medical applications, e.g.,
pacemakers, ring heart rate monitors, and Bluetooth-based ECG monitors.
We present error detection schemes for various lattice-based encryption and key generation schemes. Our error detection schemes with recomputing incur low overheads with high
error coverage on these state-of-the-art NIST PQC finalists. We achieve high error coverage
from our recomputing schemes. The proposed architectures are implemented on the FPGA
family or ASIC, which show acceptable area, power, and delay overhead. These approaches
add very little hardware overheads, which is advantageous to incorporate in deeply-embedded
systems. We have benchmarked the proposed architectures to assess their ability to detect
transient and permanent faults. The proposed architectures are oblivious of the standard-

80

cell library and hardware platform. Therefore, we expect similar overhead results across
different FPGA families and ASIC libraries.
As the future extensions to this dissertation, implementing our proposed countermeasures for PQC can be done on deeply-embedded architectures for instance implantable and
wearable medical devices to assess the deployment challenges. Preventing threats against
hardware vulnerabilities and cyber-attacks based on side-channel attacks, of both classical
and post-quantum cryptosystems, are indispensable to data protection as well as the correct operation of deeply-embedded architectures. Moreover, one can consider combined fault
and power analysis attacks and countermeasures on PQC, a challenging extension to this
dissertation which has not got considerable attention in open literature.

81

References

[1] Z. Ling, J. Luo, Y. Xu, C. Gao, K. Wu, and X. Fu. Security vulnerabilities of Internet
of Things: A case study of the smart plug system. IEEE IoT J., 4(6):1899–1909,
December 2017.
[2] G. Hernandez, O. Arias, D. Buentello, and Y. Jin. Smart nest thermostat: A smart
spy in your home. In Proc. Black Hat USA, pages 1–8, 2014.
[3] E. Ronen and A. Shamir. Extended functionality attacks on IoT devices: The case of
smart lights. In Proc. IEEE Eur. Symp. Security Privacy, pages 3–12, 2016.
[4] A. Bogdanov, L. R. Knudsen, G. Leander, C. Paar, A. Poschmann, M. J. B. Robshaw,
Y. Seurin, and C. Vikkelsoe. PRESENT: An ultra-lightweight block cipher. In Cryptographic Hardware and Embedded Systems - CHES 2007, pages 450–466. Springer Berlin
Heidelberg.
[5] T. Eisenbarth, T. Kasper, A. Moradi, C. Paar, M. Salmasizadeh, and M. T. M. Shalmani. On the power of power analysis in the real world: A complete break of the
KeeLoq code hopping scheme. pages 203–220, August 2008.
[6] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures
and public-key cryptosystems. Commun. ACM, 21(2):120–126, 1978.
[7] N. Koblitz. Elliptic curve cryptosystems. Math. Comput., 48(177), 1987.
[8] Z. Khan and M. Benaissa. Low area ECC implementation on FPGA. IEEE Int. Conf.
Electron. Circuits, and Syst. (ICECS), pages 581–584, 2013.

82

[9] L. Batina, N. Mentens, K. Sakiyama, B. Preneel, and I. Verbauwhede. Low-cost
elliptic curve cryptography for wireless sensor networks. In Proc. Third European
Conf. Security and Privacy in Ad-Hoc and Sensor Networks, page 6–17, 2006.
[10] T. Eisenbarth, T. Güneysu, S. Heyse, and C. Paar. Microeliece: McEliece for embedded
devices. In Proc. 11th Int. Workshop Cryptogr. Hardware Embedded Syst. (CHES),
2009.
[11] P. Shor. Algorithms for quantum computation: Discrete logarithms and factoring. In
Proc. 35th Annu. Symp. Foundations of Comput. Science, pages 134–134, 1994.
[12] J. Buchmann, A. May, and U. Vollmer. Perspectives for cryptographic long-term
security. Commun. ACM, 49(9):50–55, September 2006.
[13] D. Moody. Post-quantum cryptography: NIST’s plan for the future. In The Seventh
Int. Conf. on Post-Quntum Cryptography, 2008.
[14] S. Suhail, R. Hussain, A. Khan, and C. Hong. On the role of hash-based signatures
in quantum-safe Internet of things: Current solutions and future directions. IEEE
Internet of Things Journal, 2020.
[15] J. Ding and D. Schmidt. Rainbow, a new multivariable polynomial signature scheme.
In Proc. of Third Int. Conf. on Appl. Cryptography Network Security, page 164–175,
Berlin, Heidelberg, 2005.
[16] D. Bernstein, T. Chou, T. Lange, I. von Maurich, R. Misoczki, R. Niederhagen, E. Persichetti, C. Peters, P. Schwabe, N. Sendrier, et al. Classic McEliece: Conservative
code-based cryptography. NIST submissions, 2017.
[17] B. Koziel, R. Azarderakhsh, and M. Mozaffari Kermani. A high-performance and
scalable hardware architecture for isogeny-based cryptography. IEEE Trans. Comput.,
67(11):1594–1609, 2018.
83

[18] V. V. Lyubashevsky, C. Peikert, and O. Regev. On ideal lattices and learning with
errors over rings. J. ACM, 60(6):1–35, November 2013.
[19] V. Lyubashevsky. Lattice signatures without trapdoors. In Advances in Cryptology –
EUROCRYPT 2012, pages 738–755. Springer Berlin Heidelberg, 2012.
[20] V. Lyubashevsky. Lattice-based identification schemes secure under active attacks. In
Proc. Int. Conf. Public Key Cryptography, pages 162–179, 2008.
[21] O. Regev. On lattices, learning with errors, random linear codes, cryptography. In
Proc. Annu. Acm Symp. Theory Comput., pages 84–93, 2005.
[22] F-X. Standaert. Introduction to side-channel attacks. In Secure Integrated Circuits
and Systems, pages 27–42. Springer US, Boston, MA, 2010.
[23] D. Boneh, R. A. DeMillo, and R. J. Lipton. On the importance of checking cryptographic protocols for faults. In Advances in Cryptology, EUROCRYPT ’97, pages
37–51, Berlin, Heidelberg, 1997. Springer Berlin Heidelberg.
[24] M. C. Hsueh, T. K. Tsai, and R. K. Iyer. Fault injection techniques and tools. Computer, 30(4):75–82, 1997.
[25] F. Valencia, T. Oder, T. Guneysu, and F. Regazzoni. Exploring the vulnerability of
R-LWE encryption to fault attacks. In Proc. CS2, pages 7–12, 2018.
[26] N. Bindel, J. Buchmann, and J. Krämer. Lattice-based signature schemes and their
sensitivity to fault attacks. In Proc. IEEE Workshop Fault Diagn. Tolerance Cryptography (FDTC), pages 63–77, 2016.
[27] M. Mozaffari Kermani, M. Zhang, A. Raghunathan, and N. K. Jha. Emerging frontiers
in embedded security. In 2013 26th Int. Conf. on VLSI Design and 2013 12th Int. Conf.
on Embedded Syst., pages 203–208, 2013.

84

[28] M. Mozaffari Kermani, E. Savas, and S. J. Upadhyaya. Guest editorial: Introduction to
the special issue on emerging security trends for deeply-embedded computing systems.
IEEE Trans. Emerg. Topics Comput., 4(3):318–320, 2016.
[29] M. Mozaffari Kermani and A. Reyhani-Masoleh. Concurrent structure-independent
fault detection schemes for the advanced encryption standard. IEEE Trans. Comput.,
59(5):608–622, May 2010.
[30] M. Mozaffari Kermani and R. Azarderakhsh. Efficient fault diagnosis schemes for
reliable lightweight cryptographic ISO/IEC standard CLEFIA benchmarked on ASIC
and FPGA. IEEE Trans. on Ind. Electron., 60(12):5925–5932, December 2013.
[31] P. Maistri and R. Leveugle. Double-data-rate computation as a countermeasure against
fault analysis. IEEE Tran. Comput., 57(11):1528–1539, August 2008.
[32] G. Canivet, P. Maistri, Régis Leveugle, J. Clédière, F. Valette, and M. Renaudin.
Glitch and laser fault attacks onto a secure AES implementation on a SRAM-based
FPGA. Journal of Cryptology, 24:247–268, 2010.
[33] ©2019 IEEE. Reprinted with the permission of A. Sarker, M. Mozaffari Kermani,
and R. Azarderakhsh. Hardware constructions for error detection of number-theoretic
transform utilized in secure cryptographic architectures. IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., 27(3):738–741, 2019.
[34] J. M. Pollard. The fast Fourier transform in a finite field. Math. Comput., 25:365–374,
1971.
[35] D. Micciancio and O. Regev. Lattice-based cryptography. In Proc. Post-Quantum
Cryptography, pages 147–191, Berlin, 2009. Springer.
[36] D. Micciancio. Generalized compact knapsacks, cyclic lattices, and efficient one-way
functions. Computational Complexity, 16(2):365–411, 2007.
85

[37] V. Lyubashevsky and D. Micciancio. Generalized compact knapsacks are collision
resistant. Automata, Languages and Programming, pages 144–155, 2006.
[38] C. Gentry. Fully homomorphic encryption using ideal lattices. In Proc. Annu. Acm
Symp. Theory Comput., pages 169–178, 2009.
[39] K. Lauter, M. Naehrig, and V. Vaikuntanathan. Can homomorphic encryption be
practical? In Proc. ACM Workshop Cloud Comput. Security, pages 113–124, 2011.
[40] J. Cooley and J. Turkey. An algorithm for the machine computation of complex Fourier
series. Math. Comput., 19(90):297–301, 1965.
[41] V. Lyubashevsky. Fiat-Shamir with aborts: Applications to lattice and factoring-based
signatures. In Proc. Advances in Cryptology ASIACRYPT, pages 598–61, 2009.
[42] V. Lyubashevsky, D. Micciancio, C. Peikert, and A. Rosen. Swifft: A modest proposal
for FFT hashing. In Proc. Fast Software Encryption, pages 54–72, 2008.
[43] T. Poppelmann and T. Guneysu. Towards efficient arithmetic for lattice-based cryptography on reconfigurable hardware. In Proc. Progress in Cryptology LATINCRYPT,
pages 139–158, 2012.
[44] D. D. Chen, N. Mentens, F. Vercauteren, S. S. Roy, R. C. C. Cheung, D. Pao, and
I. Verbauwhede. High-speed polynomial multiplication architecture for ring-LWE and
SHE cryptosystems. IEEE Trans. Circuits Syst. I: Reg, 62(1):157–166, January 2015.
[45] C. P. Renterı́a-Mejı́a and J. Velasco-Medina. High-throughput ring-LWE cryptoprocessors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 25(8):2332–2345, 2017.
[46] T. Oder, T. Schneider, T. Poppelmann, and T. Guneysu. Practical CCA2-secure
and masked ring-LWE implementation. IACR Trans. Cryptographic Hardware and
Embedded Systems, pages 142–174, 2018.

86

[47] X. Guo, D. Mukhopadhyay, C. Jin, and R. Karri. Security analysis of concurrent error
detection against differential fault analysis. J. Crypto. Eng., 5(3):153–169, 2015.
[48] M. Yasin, B. Mazumdar, S. Subidh Ali, and O. Sinanoglu. Security analysis of logic
encryption against the most effective side-channel attack: DPA. In Proc. Defect and
Fault Tolerance in VLSI Systems, pages 97–102, 2015.
[49] M. Mozaffari Kermani and A. Reyhani-Masoleh. Fault detection structures of the Sboxes and the inverse S-boxes for the advanced encryption standard. J. Electronic
Testing: Theory and Applications (JETTA), 25(4):225–245, August 2009.
[50] M. Mozaffari Kermani, R. Azarderakhsh, and A. Aghaie. Fault detection architectures
for post-quantum cryptographic stateless hash-based secure signatures benchmarked
on ASIC. ACM Trans. Embedded Computing Syst., 16(2):59:1–59:19, December 2016.
[51] M. Mozaffari Kermani, R. Azarderakhsh, and A. Aghaie. Reliable and error detection
architectures of Pomaranch for false-alarm-sensitive cryptographic applications. IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., 23:2804–2812, 2015.
[52] M. Mozaffari Kermani, R. Azarderakhsh, A. Sarker, and A. Jalali. Efficient and reliable
error detection architectures of Hash-Counter-Hash tweakable enciphering schemes.
ACM Trans. Embedded Computing Syst., 17(2):1–54, May 2018.
[53] M. Mozaffari Kermani, A. Jalali, R. Azarderakhsh, J. Xie, and K. R. Choo. Reliable
inversion in GF(28 ) with redundant arithmetic for secure error detection of cryptographic architectures. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 37(3):696–704, 2018.
[54] S. Subramanian, M. Mozaffari Kermani, R. Azarderakhsh, and M. Nojoumian. Reliable hardware architectures for cryptographic block ciphers LED and HIGHT. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 36(10):
1750–1758, 2017.
87

[55] R. E. Blahut. Fast algorithms for digital signal processing, Boston, MA: AddisonWesley. Addison-Wesley, Boston, MA, 1985.
[56] ©2021 IEEE. Reprinted with the permission of A. Sarker, M. Mozaffari Kermani,
and R. Azarderakhsh. Error detection architectures for ring polynomial multiplication
and modular reduction of ring-LWE in

𭟋/p𭟋[x]
,
xn +1

benchmarked on ASIC. IEEE Trans.

Reliability, 70(1):362–370, March 2021.
[57] N. Gottert, T. Feller, M. Schneider, J. Buchmann, and S. Huss. On the design of
hardware building blocks for modern lattice-based encryption schemes. In Proc. 14th
Int Workshop Cryptogr. Hardware Embedded Syst. (CHES), pages 512–529, September
2012.
[58] T. Poppelmann and T. Guneysu. Towards practical lattice-based public key encryption
on reconfigurable hardware. In Proc. 20th Int. Conf. Sel. Areas Cryptogr. (SAC), pages
68–85, 2013.
[59] S. S. Roy, F. Vercauteren, N. Mentens, D. D. Chen, and I. Verbauwhede. Compact
ring-LWE cryptoprocessor. In Proc. 16th Int. Workshop Cryptogr. Hardw. Embedded
Syst. (CHES), pages 371–391, 2014.
[60] J. Detchart and J. Lacan. Polynomial ring transforms for efficient XOR-based erasure
coding. In Proc. IEEE Int. Symp. on Inform. Theory, pages 604–608, 2017.
[61] J. Benaloh, M. Chase, E. Horvitz, and K. Lauter. Patient controlled encryption:
Ensuring privacy of electronic medical records. In Proc. ACM Workshop on Cloud
Comput. Security, pages 103–114, 2009.
[62] A. Ben-David, N. Nisan, and B. Pinkas. FairplayMP: A system for secure multi-party
computation. In Proc. ACM Conf. on Comput. and Commun. Security, pages 257–266,
2008.

88

[63] J. Bos, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, J. M. Schanck, P. Schwabe,
and D. Stehle. CRYSTALS – Kyber: A CCA-secure module-lattice-based KEM. In
Proc. IEEE European Symp. Security and Privacy, pages 353–367, 2018.
[64] S. Streit and F. De Santis. Post-quantum key exchange on ARMv8-A: A new hope for
NEON made simple. IEEE Trans. Comp., 67(11):1651–1662, 2018.
[65] B. Pinkas, T. Schneider, N. P. Smart, and S. C. Williams. Secure two party computation is practical. In Proc. Int. Conf. Theory Appl. Cryptol. and Inform. Security:
Advances in Cryptology, pages 250–267, 2009.
[66] Z. Brakerski and V. Vaikuntanathan. Fully homomorphic encryption from ring-LWE
and security for key dependent messages. In Proc. Annu. Conf. Advances Cryptol.,
pages 505–524, 2011.
[67] S. Saha, U. Kumar, D. Mukhopadhyay, and P. Dasgupta. An automated framework for
exploitable fault identification in block ciphers. J. Crypto. Eng., 9(3):203–219, 2019.
[68] S. Patranabis, A. Chakraborty, and D. Mukhopadhyay. Fault tolerant infective countermeasure for AES. J. Hardware and Syst. Security, 1(1):3–17, 2017.
[69] M. Mozaffari Kermani, R. Azarderakhsh, and A. Aghaie. Fault detection architectures
for post-quantum cryptographic stateless hash-based secure signatures benchmarked
on ASIC. ACM Trans. Embedded Comput. Syst., 16(2):59:1–59:19, 2019.
[70] A. Kamal and A. Youssef. Strengthening hardware implementations of NTRUEncrypt
against fault analysis attacks. J. Crypto. Eng., 3(4):227–240, May 2013.
[71] T. Poppelmann and T. Guneysu. Area optimization of lightweight lattice-based encryption on reconfigurable hardware. In Proc. IEEE Int. Symp. on Circuits and Systs.,
pages 2796–2799, 2014.

89

[72] Z. Liu, H. Seo, S. S. Roy, J. Großschädl, H. Kim, and I. Verbauwhede. Efficient
ring-LWE encryption on 8-bit AVR processors. In Proc. Int. Workshop Cryptographic
Hardware and Embedded systems (CHES), pages 663–682, 2015.
[73] L. Breveglieri, I. Koren, and P. Maistri. An operation-centered approach to fault
detection in symmetric cryptography ciphers. IEEE Trans. Computers, 56(5):534–540,
May 2007.
[74] ©2021 IEEE. Reprinted with the permission of A. Sarker, M. Mozaffari Kermani,
and R. Azarderakhsh. Fault detection architectures for inverted binary ring-LWE
construction benchmarked on FPGA. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems II, 68(4):1403–1407, April 2021.
[75] A. Aysu, M. Orshansky, and M. Tiwari. Binary ring-LWE hardware with power sidechannel countermeasures. In Proc. IEEE Design Autom. Test Europe Conf. Exhibit.
(DATE), pages 1253–1258, March 2018.
[76] S. Ebrahimi, S. Bayat-Sarmadi, and H. Mosanaei-Boorani. Post-quantum cryptoprocessors optimized for edge and resource-constrained devices in IoT. IEEE IoT J., 6
(3):5500–5507, June 2019.
[77] S. Patranabis, A. Chakraborty, P. H. Nguyen, and D. Mukhopadhyay. A biased fault
attack on the time redundancy countermeasure for AES. In Proc. COSADE, pages
189–203, 2015.
[78] S. Patranabis, A. Chakraborty, D. Mukhopadhyay, and P. P. Chakrabarti. Fault space
transformation: A generic approach to counter differential fault analysis and differential fault intensity analysis on AES-like block ciphers. IEEE Trans. Inf. Forensics
Security, 12(5):1092–1102, December 2016.
[79] A. Aghaie, A. Moradi, S. Rasoolzadeh, A. R. Shahmirzadi, F. Schellenberg, and
T. Schneider. Impeccable circuits. IEEE Trans. Comput., 69(3):361–376, March 2020.
90

[80] L. G. Bruinderink and P. Pessl. Differential fault attacks on deterministic lattice
signatures. IACR Transactions on Cryptographic Hardware and Embedded Systems,
2018(3):21–43, 2018.
[81] S. Ebrahimi and S. Bayat-Sarmadi. Lightweight and fault-resilient implementations of
binary ring-LWE for IoT devices. IEEE Internet of Things Journal, 7(8):6970–6978,
August 2020.
[82] M. Mozaffari Kermani, R. Ramadoss, and R. Azarderakhsh. Efficient error detection
architectures for CORDIC through recomputing with encoded operands. In Proc.
ISCAS, pages 2154–2157, May 2016.
[83] M. Mozaffari Kermani and A. Reyhani-Masoleh. A low-cost S-box for the Advanced
Encryption Standard using normal basis. In Proc. EIT, pages 52–55, Windsor, Canada,
2009.
[84] J. Buchmann, F. Gopfert, T. Guneysu, T. Oder, and T. Poppelmann.

High-

performance and lightweight lattice-based public-key encryption. In Proc. ACM Workshop IoT Privacy Trust Security, pages 2–9, 2016.
[85] A. Aghaie, M. Mozaffari Kermani, and R. Azarderakhsh. Fault diagnosis schemes for
low-energy block cipher Midori benchmarked on FPGA. IEEE Trans. on Very Large
Scale Integrated (VLSI) Systems, 25(4):1528–1536, April 2017.
[86] ©2022 IEEE. Reprinted with the permission of A. Sarker, M. Mozaffari Kermani, and
R. Azarderakhsh. Efficient error detection architectures for post quantum signature
Falcon’s sampler and KEM SABER. IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., 2022. (in press).
[87] T. Guneysu, V. Lyubashevsky, and T. Poppelmann. Practical lattice-based cryptography: A signature scheme for embedded systems. In Proc. 14th Int. Workshop Cryptogr.
Hardware Embedded Syst. (CHES), pages 530–547, September 2012.
91

[88] H. Baan, S. Bhattacharaya, O. Garcia-Morchon, R. Rietman, L. Tolhuizen, J. L. TorreArce, and Z. Zhang. Round2: KEM and PKE based on GLWR. Cryptology ePrint
Archive, Report 2017/1183, 2017. https://eprint.iacr.org/2017/1183.
[89] A. Banerjee, C. Peikert, and A. Rosen. Pseudorandom functions and lattices. In Proc.
Annu. Conf. Advances Cryptol., pages 719–737, 2012.
[90] D. Knuth. The Art of Computer Programming, volume 3. Boston, MA, USA: AddisonWesley, 1997.
[91] J. P. D’Anvers, A. Karmakar, S. S. Roy, and F. Vercauteren. Saber: Module-LWR
based key exchange, CPA-secure encryption and CCA-secure KEM. In Proc. International Conference on Cryptology in Africa, page 282–305, April 2018.
[92] J. M. B. Mera, F. Turan, A. Karmakar, S. S. Roy, and I. Verbauwhede. Compact
domain-specific co-processor for accelerating module lattice-based key encapsulation
mechanism. Cryptology ePrint Archive, Report 2020/321, 2020. https://eprint.iacr.
org/2020/321.
[93] T. Prest, P. Fouque, J. Hoffstein, P. Kirchner, V. Lyubashevsky, T. Pornin, T. Ricosset,
G. Seiler, W. Whyte, and Z. Zhang. Falcon. Technical report, April 2021. URL
https://csrc.nist.gov/projects/post-quantum-cryptography/round--submissions.
[94] C. Gentry, C. Peikert, and V. Vaikuntanathan. Trapdoors for hard lattices and new
cryptographic constructions. In Proc. ACM STOC, pages 197–206, 2008.
[95] L. Ducas and P. Q. Nguyen. Faster Gaussian lattice sampling using lazy floating-point
arithmetic. In Proc. ASIACRYPT, page 415–432, 2012.
[96] C. Chuengsatiansup, T. Prest, D. Stehlé, A. Wallet, and K. Xagawa. ModFalcon:
Compact signatures based on module-NTRU lattices. In Proc. Asia CCS. 853–866,
2020.
92

[97] A. Karmakar, S. S. Roy, F. Vercauteren, and I. Verbauwhede. Pushing the speed limit
of constant-time discrete Gaussian sampling: A case study on the Falcon signature
scheme. In Proc. Annu.Design Automation Conf., pages 1–6, 2019.
[98] M. Mozaffari Kermani and A. Reyhani-Masoleh. A high-performance fault diagnosis
approach for the AES SubBytes utilizing mixed bases. In Proc. Workshop on Fault
Diagnosis and Tolerance in Cryptography, pages 80–87, 2011.
[99] M. Mozaffari Kermani and A. Reyhani-Masoleh. Reliable hardware architectures for
the third-round SHA-3 finalist Grostl benchmarked on FPGA platform. In Proc. IEEE
Int Symp. on Defect and Fault Tolerance in VLSI and Nanotechnology Syst, pages 325–
331, 2011.
[100] E. Biham and A. Shamir. Differential fault analysis of secret key cryptosystems. In
Proc. Annu. Int. Cryptology Conf., pages 17–21, 1997.
[101] C. Dobraunig, M. Eichlseder, T. Korak, S. Mangard, F. Mendel, and R. Primas. SIFA:
Exploiting ineffective fault inductions on symmetric cryptography. IACR TCHES,
2018(3):547–572, 2018.
[102] M. Dumont, M. Lisart, and P. Maurine. Electromagnetic fault injection: How faults
occur. In Proc. Workshop on Fault Diagn.Tolerance Cryptography (FDTC). 9–16, 2019.
[103] M. Agoyan, J. Dutertre, D. Naccache, B. Robisson, and A. Tria. When clocks fail: On
critical paths and clock faults. In Proc. Int. Conf. Smart Card Res. Advanced Appl.,
page 182–193, 2010.
[104] N. F. Ghalaty, B. Yuce, M. M. I. Taha, and P. Schaumont. Differential fault intensity
analysis. In Proc. Workshop on Fault Diagnosis and Tolerance in Cryptography, pages
49–58, 2014.

93

[105] P. Pessl and L. Prokop. Fault attacks on CCA-secure lattice KEMs. IACR TCHES,
2018(2):37–60, February 2021.
[106] F. Zhang, X. Lou, X. Zhao, S. Bhasin, W. He, R. Ding, S. Qureshi, and K. Ren.
Persistent fault analysis on block ciphers. IACR TCHES, 2018(3):150–172, 2018.
[107] V. B. Dang, F. Farahmand, M. Andrzejczak, and K. Gaj.

Implementing and

benchmarking three lattice-based post-quantum cryptography algorithms using software/hardware codesign. In Proc. IntConf. Field-Programmable Technol. (ICFPT),
pages 206–214, 2019.

94

Appendix A: Copyright Permissions

The permission below is for the reproduction of material in Chapters 2, 3, 4, and 5.

95

About the Author
Ausmita Sarker received her B.Sc. degree in Electrical and Electronic Engineering from
Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, in 2016. She
is currently a Ph.D. candidate at the Department of Computer Science and Engineering,
University of South Florida. Her research interests include cryptographic engineering, postquantum cryptography, and embedded systems. As of 2022, she has 5 IEEE/ACM Transactions journal papers. She is a student member of IEEE.
Ausmita is the recipient of Chih foundation research and publication award 2021, USF
graduate research symposium best poster award 2019, and FICS conference best poster award
2019. She has received NSF HOST Travel Grant: HOST conference travel grant provided by
NSF in 2020, CRA conference Travel Grant in 2020, 2019, 2018, and HOST/WISE Travel
Grant in 2019.

