Efficient Side-channel Resistant MPC-based Software Implementation of the AES by Fernandez Rubio, Abraham
Worcester Polytechnic Institute
Digital WPI
Masters Theses (All Theses, All Years) Electronic Theses and Dissertations
2017-04-27
Efficient Side-channel Resistant MPC-based
Software Implementation of the AES
Abraham Fernandez Rubio
Worcester Polytechnic Institute
Follow this and additional works at: https://digitalcommons.wpi.edu/etd-theses
This thesis is brought to you for free and open access by Digital WPI. It has been accepted for inclusion in Masters Theses (All Theses, All Years) by an
authorized administrator of Digital WPI. For more information, please contact wpi-etd@wpi.edu.
Repository Citation
Fernandez Rubio, Abraham, "Efficient Side-channel Resistant MPC-based Software Implementation of the AES" (2017). Masters Theses (All
Theses, All Years). 403.
https://digitalcommons.wpi.edu/etd-theses/403
Efficient Side-channel Resistant MPC-based Software Implementation
of the AES
by
Abraham Fernandez-Rubio
A Thesis
Submitted to the Faculty
of the
WORCESTER POLYTECHNIC INSTITUTE
In partial fulfillment of the requirements for the
Degree of Master of Science
in
Electrical Computer Engineering
by
April 2017
APPROVED:
Professor Thomas Eisenbarth, Thesis Advisor
Professor Xinming Huang, Thesis Committee
Professor Yehia Massoud, Head of Department
Abstract
Current cryptographic algorithms pose high standards of security yet they are susceptible to
side-channel analysis (SCA). When it comes to implementation, the hardness of cryptography
dangles on the weak link of side-channel information leakage. The widely adopted AES encryption
algorithm, and others, can be easily broken when they are implemented without any resistance to
SCA. This work applies state of the art techniques, namely Secret Sharing and Secure Multiparty
Computation (SMC), on AES-128 encryption as a countermeasure to those attacks. This embedded
C implementation explores multiple time-memory trade-offs for the design of its fundamental
components, SMC and field arithmetic, to meet a variety of execution and storage demands. The
performance and leakage assessment of this implementation for an ARM based micro-controller
demonstrate the capabilities of masking schemes and prove their feasibility on embedded software.
ii
In memory of my mother and father
iii
Acknowledgments
This work represents the sum of all the support from all of my circles, my family, friends and
fellow students. I feel grateful to contribute, even in the most insignificant way, to science and
humanity.
I arrived at the right time to the WPI and had the great opportunity to be advised by professor
Thomas Eisenbarth, a very talented and hard-working researcher. Beneath this work the are
countless moments of discussion, stress, learnings and laughter we shared. I thank him for all the
patience, support and belief in me.
I thank professor Yehia Massoud for being part of this thesis committe and professor Xinming
Huang for his professional opinion and his support as a committe member.
During these ten months of research, the Vernam lab has been my home in which I’ve been
surrounded by great fellow lab members. I want to thank Okan Seker and Cong Chen for their
collaboration on this research and their patience to share their knowledge with me.
I am very grateful with the programs that have sponsored my studies and stay in the US:
the Fulbright program for opening the door for me to this country, CONACYT for their constant
support, the WPI for their tremendous support and acceptance as a fellow, all of these institutions
have allowed me to focus on my studies and embrace the research experience.
This thesis is based upon work supported by the National Science Foundation under Grant
No.1261399.
Special thanks to Neda Seyedmahmoud for her support and review in the writing of this thesis.
Also, Jessica O’Toole for her guidance in the writing process.
This work would have not been possible without the constant love and continuous support by
Elisa Montano, her care and attention regardless of the distance have been fundamental.
iv
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Cryptography in Embedded Devices . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Shamir’s Secret Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Secure Multiparty Computation (SMC) . . . . . . . . . . . . . . . . . . . . 10
2.4 Side-Channel Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Masking Implementation 14
3.1 Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Leakage Assessment 23
4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.1 Trace Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.2 Trace Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.3 Trace Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 SCA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Higher-Order t test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.2 Multivariate t test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Conclusion 35
6 Appendix 36
References 37
v
1 Introduction
1.1 Motivation
In the current technological age, millions of devices are connected to the internet, yet the projected
amount is expected to reach 200 billion by 2020 [Int]. From a smart watch attached to our hand
to the Milkyway-2, one of the world’s fastest supercomputers [Alb], the computing demand is
ubiquitous but the level of security in the Internet of Things (IoT) remains uncertain. Computer
security has become a priority as our dependency on technology leads us to a point where we have
to trust computers for most of our daily activities. Even though the computer industry has been
incorporating secure algorithms in their hardware and software designs, there’s a facet of it that
is frequently unattended known as side-channel analysis (SCA).
Side-channel analysis is a branch of computer security that focuses on gathering and correlating
big amounts of data inherently related to the behavior of cryptographic algorithm implementa-
tions. The information derived from the management of resources like execution time [Koc96],
energy [KJJ99] [GMO01, QS01] and memory of an electronic device reveals valuable clues about
the internal actions it performs. Single samples of data cannot tell much about the internal state
of the algorithm but in the aggregate, after collecting thousands or millions of traces, an observer
would have enough information to make educated guesses regarding the secrets processed by the
device. The objective of this analysis is to either assess the level of security offered by the imple-
mentation or break into it by learning from the leaked information about the internal state of the
algorithm.
Encouraged by the current trend of determining the efficiency of practical cryptographic im-
plementations and their leakage assestment, this thesis inspects the level of resistance against side-
channel analysis of a software implementation of the widely-used Advanced Encryption Standard
(AES). This version of AES is based on [SES17] that prevents side-channel leakage by incorpo-
rating state of the art masking techniques and multi-party computation of the secrets processed
by the algorithm. The concept behind a masking scheme is to split a secret into multiple pieces
called shares. Each share is independent and can be processed individually without revealing any
information about the secret and in the end, all the shares are combined to yield the processed
secret.
The characterization of a masking scheme is defined by two parameters: the number of random
shares n and the smallest possible number d + 1 required to reconstruct the secret variable. The
order of the masking scheme is determined by the magnitude of d, thus an (n, d)-sharing scheme
does not reveal any information under a dth-order side-channel analysis. Due to the exponential
proportion between the order and the complexity of mounting an attack on a certain sharing
scheme, it brings acceptable results as d grows [CJRR99].
1.2 Related Work
Kocher et al. [KJJR11] describe in detail the characteristics of Differential Power Analsys (DPA)
and Simple Power Analysis (SPA). These methods are capable of learning valuable information
about secrets processed by cryptographic algorithm implementations when they are not properly
1
protected against side-channel attacks. In the presence of enough leakage it is possible to recover
the keys. The work by [SM15] points out the current leakage assessment methodologies, these
test are useful for the designer to understand the side-channel behavior of the implementation and
reveal potential sources of leakage if any.
Security researchers have been proposing a wide variety of countermeasures to defend crypto-
graphic algorithm implementations from SCA. Rivain and Prouff [RP10] presented the first generic
d-order masking scheme for AES with provable security and acceptable software implementation
overhead. That contribution is based on Ishai’s proposal [ISW03]. Their work extends the boolean
masking schemes to any finite field and by incorporating provable secure operations on masked
variables they achieve an efficient software AES implementation. According to it, an (n, d) masking
scheme with n = d+ 1 allows dth-order level of security but Coron et al. [CPRR14] indicated that
it requires n ≥ 2d+ 1 shares.
Another countermeasure against passive SCA, introduced by von Willich [vW01], presents a
new technique known as affine masking. It suggests an increase in the number of intermediate
variables required to recover any secret information, it also changes the variables randomly on
every run. Affine masking can achieve similar performance to Boolean masking, however it has
not been generalized for orders d > 1.
One of the first influential concepts was brought by Shamir [Sha79] where a randomly generated
degree d polynomial is used to break a confidential variable into n parts without revealing any
information about it. The secret can be reconstructed from the n shares but an adversary cannot
recover any information from d number of them, it is commonly assumed that n = d+ 1.
Secure multi-party computations were introduced initially by Ben-Or et al. [BOGW88] and
Gennaro et al. [GRR98], they represent the important counterpart of secret sharing. Later the
work bt Goubin [GM11], Roche and Prouff [RP12] exploit polynomial masking. They demonstrate
that their construction thwarts dth-order side-channel attacks even in the presence of glitches.
In summary, it is important to identify the difference between masking schemes and secure
multi-party computation. The former describes how to break a secret into multiple shares and the
latter carries out the processing of the parts preserving the completeness of the secret. The purpose
of this layer of redundancy is to keep the confidentiality of the sensitive variables from side-channel
leakage. The downside of SCA resistant implementation is the computation performance overhead
inherently derived from the need to process the cryptographic algorithms for every share. Grosso
et al. [GSF14] evaluate the performance cost introduced by higher-order masking schemes for AES,
and additional security features like glitch-freeness require an elevated computation overhead.
1.3 Contribution
This thesis extends the work [SES17] in which a polynomial masking and secure multi-party com-
putation [RP12] approach is applied in software to harden the widely used AES. My work provides
higher-order side-channel analysis (HO-SCA) results, a prototype implementation released to the
public 1 and in-depth performance analysis. The leakage assessment and performance analysis are
done on the ultra-low power ARM Cortex-M0+. This code is an 8-bit implementation written in C
1https://github.com/vernamlab
2
and it can be easily ported to any platform, it also provides 5 different (n, d) secret sharing schemes
that are selectable by pre-compiler parameters, however the code is easily expandable to higher
orders. Most of the contributions by the side-channel research community have been focused on
hardware implementations [MM13, DCBRN15]. Although recently Goudarzi and Rivain [GR16]
thoroughly analyzed the performance of higher-order masking techniques in software for ARM
architecture, but their work lacks leakage assessment.
The leakage assessment included in this thesis comprises univariate t tests and second-order
multivariate t tests. These tests are capable of highlighting the relationship between the processed
sensitive variables and side-channel leakage. Due to the long execution times of the masking scheme,
the HO-SCA is focused on the secure multiplication function that is fundamental to the algorithm.
The results show that, statistically, the power consumption is independent of the processed secrets.
The implementation features several execution optimizations, according to the target platform
memory and execution limitations the user can alternate between different versions of the field
multiplications. Additionally, since the code is written in a branchless fashion to prevent microar-
chitectural leakage, all the available versions of the functions execute in constant time. Performance
measurements for all version of secure multiplication for different (n, d)-sharing schemes are also
discussed. A breakdown of the operation per round of AES-128 and execution timings are also
given. Finally, a table comparing code and memory sizes is provided.
1.4 Outline
This thesis is organized in three main sections: general background, implementation details, per-
formance results and leakage assessment. Section 2 provides a brief description of main concepts
such as AES-128, an introduction to polynomial masking, secure multiparty computation and side-
channel leakage assesstment methodologies. Section 3 describes the considerations and low-level
details of bringing the scheme to the ARM based microcontroller as well as performance results.
Finally, Section 4 shows a comprehensive side-channel analysis for (3,1)-sharing and (5,2)-sharing
schemes including the challenges to perform the assessment and its results.
3
2 Background
There are important considerations in the applied cryptography and security industry that must be
satisfied to reduce the surface of attack of a product or service. Current cryptographic algorithms
have been closely scrutinized by the cryptography research community before these algorithms
were globally adopted. These selections are based on a history of learnings from broken crypto-
algorithms, new contributions to cryptoanalysis and the current computational power pushing the
boundaries of brute-force attacks. Yet, once these algorithms are transitioned from the theory to
their practical implementation, new entry points emerge like side-channel analysis (Section 2.4)
and fault injections. One of reasons is that crypto-algorithms have to be secure but they have
to be efficient as well, some of them perform better in software and others in hardware; but to
provide another layer of protection against SCA and physical attacks, certain level of redundancy or
extra processing effort is required. Unfortunately, it is very common to find implementations that
prioritize the user experience at the cost of sacrificing security. So, it is fundamental to understand
the market segments and the conditions where secure devices will be operating in order to find the
appropriate trade-off between security and performance.
2.1 Cryptography in Embedded Devices
Today, embedded devices are a key part of our lives from the big servers processing financial
transactions to the small ones inside smartphones. The omnipresence of these tiny general purpose
computers ,that handle specific tasks, process big part of the information that we need everyday.
Embedded systems are virtually everywhere, from the industry like machinery and tools, inside
our homes like in appliances and air conditioners, transportation like cars and planes, in space like
satellites and spaceships, as well as part of our gadgets like phones, tablets and credit cards.
This section classifies the handling of sensitive information into three high-level stages: commu-
nication, processing and storage. To illustrate this process assume a user, called Alice, who wants
to log in to her home PC. She has to introduce her credentials (user ID and password) and then
make sure nobody is watching as she types the information. In other words, the communication
channels have to be secure including the keyboard that is used to input the secret data. Later, the
computer checks the credentials to verify her identity. Ideally, the processing of the data should
not reveal any information to an adversary looking at side-channel effects like cache usage, power
consumption or execution time. Finally, the password must be stored in a way that even if an
attacker breaks into the computer, the attacker should not be able to get the password. In other
words, the password must be scrambled in a deterministic way from which the password can be
verified but not guessed.
Communication. The embedded electronics sector has been tremendously pushed by the
telecommunications industry. It is not only the necessity of human beings to communicate with
each others but a fundamental requirement to transfer a bit of information from one place to
another. Thus, our communicated world heavily depends on moving data around in a reliable and
practical way, but when the sensitivity of the information being transferred demands it, security
plays a big role. Think of the most common actions that people carry out every day like sending
4
money from our bank account to another part of the world using our phone, paying for a cup of
coffee with a credit card, making a phone call or simply just posting a message on social media;
these large scale examples involve dealing with the exchange of sensitive information.
Processing. Once the sensitive data is inside a computing core, it follows a series of instructions
to operate on the data. For example, arithmetic, logical, flow control operations, etc. It is im-
portant to have the protection mechanisms inside the core to prevent an adversary from accessing
restricted information. For instance, access to memory should be limited to certain range, impor-
tant configuration registers should have locking mechanisms, the core should support hierarchical
levels of execution, and many more complex protections. Even if these considerations are put in
place, the execution time and power consumption of the instructions can reveal certain information
about the data being processed. The reason is that the data representation within the core and
the way it is handled relates to side-channel variables.
Storage. Inevitably, the information has to be saved temporarily, even just for a few nano-
seconds like in a cache memory or for longer periods in a hard drive. The confidentiality/privacy
level of the data should match its level of protection. For example, if the information is sensible
and is stored in clear text then it should have restricted access, or if it requires stronger protection
then it should be encrypted. There have been recent attacks that exploit the cache memory access
time as a source of leakage to recover cryptographic keys [BM06] and [IIES14].
A well designed implementation of a cryptographic algorithm should systematically consider
these three important stages at all levels. In other words, security can be applied to all the
layers that a system is composed of, either at the level of a System-on-Chip (SoC) or at the
level of a wireless telephone network or any other. There are many cryptographic primitives
that can be used to guarantee the security objectives of an embedded system. For example,
symmetric and asymmetric encryption algorithms to provide confidentiallity, cryptographic hashes
to assess integrity and signature schemes for authentication and non-repudiation. As the security
of embedded devices directly depends on the streghth of cryptographic algorithms, it also depends
on the resistance of the implementation to side-channel attacks.
Many embedded devices feature cryptographic primitives as part of their design. Either as part
of their hardware modules or as an extension of the set of instructions that they support, these
features can improve the performance of a secure application. Hardware cryptographic modules or
extensions tend to be faster as they are dedicated machines, however the cost and size of the die
is also higher compared to software implementations. The demand of either software or hardware
assisted cryptography relies on the market segment and usage model of the application. For
example, if the user has physical possesion of the device then she/he could manage to perform a
wider range of attacks as oppose to having remote access. There are many other considerations like
the user access privilege and the connectivity of the application in order to patch any vulnerabilities
remotely.
As the presence of embedded devices is very wide, product developers and security design-
ers must consider an increasing amount of attack sources and possible protections against them.
This thesis focuses on the design, performance and side-channel assessment of a robust software
implementation of the AES encryption algorithm in an embedded microcontroller.
5
2.2 AES
The Advanced Encryption Standard (AES) was built on top of three pillars [DR99, p. 8]: simplicity,
efficiency and resistance. The algorithm, initially called Rijindael named after its creators, is
compact in terms of code size and its simplicity allows performance efficient implementations.
Although the algorithm itself was designed to resist all known attacks by then, most current
implementations are still subject to side-channel attacks.
AES is a block cipher that supports a block size of 128 bits and key lengths of 128, 192 or 256
bits. The algorithm operates iteratively on the intermediate mixture of plaintext and key called
state. The number of iterations (rounds) depends on the block and key sizes, it can be 10, 12 or
14. For simplicity and to focus on the particular AES implementation of this thesis, the rest of
this section describes the AES-128 that works on 128-bit key, a 128-bit block and 10 rounds in
ECB (Electronic Codebook) mode. Further details about the other block/key sizes and number
of rounds can be found in [DR99, p. 9]. In AES-128, the state and the key can be pictured as two
different arrays of 4-by-4 bytes, the first four bytes compose the first column, byte 0 at the top
and byte 3 at the bottom. Next column from left to right represents the next four bytes and so on.
This representation simplifies the explanation of how the functions operate on the state and keys.
Figure 1: AES-128 state and key represented as arrays of 16 Bytes arranged in column major order.
The algorithm is composed of four invertible main functions: SubBytes, ShiftRows, MixColumns
and AddRoundKey. For the first nine rounds, all functions are called but in the last round all
excepting MixColumn are used. The next diagram illustrates how these functions are structured
for encryption.
6
Figure 2: AES-128 Encryption high-level diagram. Plaintex and key are the inputs, ciphertext is the output. The
blocks in purple represent the functions that are applied to the intermediate state on every round.
The SubBytes is simply an arbitrary substitution of the state bytes with another set of bytes
based on a look-up table called S-Box, this function is a non-linear substitution, it basically means
that it cannot be replaced by boolean operations AND and XOR. The S-box can be expressed as
two operations on each byte of the state: multiplicative inverse and then an affine transformation,
both over GF(28). To invert the process, first the inverse of the affine mapping is applied and then
the multiplicative inverse.
Figure 3: AES SubBytes function. Every byte of the state is replaced by another byte according to the S-Box table.
The ShiftRows transformation rotates to the left the bytes within each column of the state,
the first row is left without rotation, the second row is rotated by one position, the third row is
rotated by two positions and the fourth by three positions. To invert the process, the first row is
again left without rotation, the second, third and fourth rows are rotated to the right by one, two
and three bytes respectably.
7
Figure 4: AES Shiftrows function. Every row of the state is shifted by 0,1,2 and 3 positions correspondingly.
In the MixColumns function, every column of the state is multiplied by an invertible polynomial
b(x) = 03x3+x2+x+02 modulo x4+1 over GF(28). To reverse this process, the inverse polynomial
c(x) = 0bx3 + 0dx2 + 09x+ 0e is multiplied by all the columns of the state.
Figure 5: AES MixColumns function. Every column of the state is multiplied by an invertible polynomial b(x).
During the AddRoundKey, each byte of the expanded key is mixed with its corresponding byte
of the state by an XOR operation. It is the simplest function but for every round the expanded
key is different. In order to get a different expanded key for every round, there’s a key schedule
mechanism. It takes the previous key and other operations like rotate, S-box transformation, and
one exponentiation of 2 over GF(28) for every round. A full description of the key expansion and
key selection algorithm can be found in [DR99, pp.14,15].
2.3 Masking
An adversary can perform a side-channel analysis on a device running a cryptographic algorithm
and the complexity of the attack can be extended to different points in time and different sources
of information. This type of physical attack is considered a Higher-Order Side-Channel Analysis
(HO-SCA), the number of different points of observation (leakages) determines the order of the
8
attack [RP12, p.111]. There is an exponential relationship between the order and the complexity
of such attack. But it is important to mention that not only the attack becomes more complex,
also a cryptographic implementation resistant to HO-SCA requires more computing resources to
successfully thwart the analysis.
Masking, also known as sharing, is a technique to prevent HO-SCA [CJRR99,
pp.146,147] [RP12, pp.111,114]. It basically consists in splitting a sensitive variable into mul-
tiple parts, all these shares hold a deterministic relationship to the sensitive variable but the
processing of these individual parts may not reveal enough information to break the cryptographic
algorithm under test. A sensitive variable depends on parameters of the algorithm that are secret,
for example plaintext mixed with keys as in the AES intermediate state.
2.3.1 Shamir’s Secret Sharing
The power consumption due to the processing of a sensitive variable by a cryptographic device
may reveal valuable information about the variable itself. The reason of that leakage is because
there is a relationship between the variable, the instructions operating on it and the side-channel
information coming out of the device, e.g. the power consumption. An adversary, collecting a
certain amount (say a few thousands) of power traces, would be able recover the variable by doing
some statistical analysis that will be described later in section 2.4. Adi Shamir in his [Sha79]
introduces Shamir’s Secret Sharing a very clever concept that [RP12] use as a countermeasure to
help thwart this attack.
Shamir’s Secret Sharing splits a sensitive variable s into multiple pieces that are related to
the secret by a degree-d polynomial P (x) = s + a1x + . . . + adx
d where the coefficients ai’s are
randomly selected and are meant to remain secret throughout the lifetime of the share. A number
n of parties (Ii)i=0,...,n−1 get a new share by evaluating the polynomial at different non-zero points
α0, . . . , αn−1 that are public, the resulting points P (α0), . . . , P (αn−1) are known as shares. The
number of shares n and the degree d of the polynomial specify the characterictics of the masking
scheme, in other words, (n,d)-sharing. The relationship between the number of shares and the
order of the polynomial can be n = d + 1 [RP12, p.111] for linear operations among the secret
shares. This is because in order to reconstruct the coefficients of the polynomial, d+ 1 points are
required. Figure 1 describes the masking scheme in a matrix equation representation.

P (α0)
P (α1)
...
P (αd)
...
P (αn−1)

=

1 α0 α
2
0 . . . α
n−1
0
1 α1 α
2
1 . . . α
n−1
1
...
...
...
...
...
1 αd α
2
d . . . α
n−1
d
...
...
...
. . .
...
1 αn−1 α2n−1 . . . α
n−1
n−1


s
a1
a2
...
ad
...
an−1

(1)
From the equation (1) above, the matrix composed of the public points (αji )i,j=0,...,n−1 is known
as the Vandermonde matrix V , it is a square matrix of n × n. The coefficients (ai)i=1,...,n−1 are
9
randomly generated by the masking scheme and the user does not need to have access to them.
It is important to note that to reconstruct the secret value s, only the first row (expressed as
λ0, . . . , λn−1) of the inverse Vandermonde matrix is required such that s =
∑n−1
i=0 P (αi)λi.
2.3.2 Secure Multiparty Computation (SMC)
Once secret sharing has been introduced, the next step is how to take advantage of its level of
confidentiality to establish a suitable side-channel analysis resistant masking scheme. From the
previous section, a user can mask a secret by splitting it into multiple parts so no information
can be extracted from each individual share or the accumulation of d shares together. The inverse
is also possible, the user is able to reconstruct the secret by polynomial interpolation with the
knowledge of d + 1 shares. To make use of this elegant mechanism, the user also needs to know
how to do computations on the shares [BOGW88] such that they cause the same effect on the
secret value. This section briefly describes three fundamental operations required to implement
AES encryption, the operations act on the secret shares but produce the same result as if the
secret variable is operated directly.
To compute linear operations based on shares of a secret variable the process is straight forward.
Consider a fixed finite field E and s1, s2 are two secret values that belong to E and have been
previously generated by two different polynomials f(x) and g(x) respectively. Also consider two
non-zero constants c1 and c2 that belong to the same field E. The addition of shares h(αi) =
f(αi) + g(αi) of all parties i = 0, . . . , n− 1 return the same result as if the two secrets are added
together s1 + s2. In the same way, the affine transformation k(αi) = c1 · f(αi) + c2 of the shares
of all parties correspond to the affine transformation of the secret c1 · s1 + c2.
However, the multiplication of two masked secrets is a non-linear operation that involves a
more complicated process. After its introduction in [BOGW88, p.4], the secure multiplication of
two masked secrets was later improved by [GRR98] and afterwards by [RP12, p.119], the high-level
description of the algorithm is defined as follows:
1. Each player Ii computes h(αi) = f(αi) · g(αi) locally,
2. Ii generates a degree d polynomial Qi(x) such that, Qi(0) = h(αi) and sends the value
Qi(αj) to player Ij .
3. Ii computes Q(αi) =
n−1∑
j=0
λjQj(αi) where (λ0, . . . , λn−1) represents the first row of the inverse
Vandermonde matrix.
4. The family Q(αi)i=0,...,n−1 is a shared representation of s1 · s2.
Based on this algorithm, equations (2) and (3) show an example of the matrix representation
of the operations necessary for a secure multiplication of two shared secrets in the (5,2)-masking
scheme. Note that these equations only summarize the required field multiplications and additions
but not how they are processed by the different players.
10
Q =

f(α0) · g(α0) r0,0 r0,1
f(α1) · g(α1) r1,0 r1,1
...
...
...
f(α4) · g(α4) r4,0 r4,1

α0
0 α1
0 . . . α04
α0
1 α1
1 . . . α14
α0
2 α1
2 . . . α24
 (2)
Q =
[
λ0 λ1 . . . λ4
]
· Q (3)
It is important to note that when two sets of shares derived from degree d polynomials are
multiplied, the resulting degree would be 2d, then 2d + 1 points are required to reconstruct the
shared secret. A particular case of the multiplication is the squaring operation, due to a property
introduced [RP12, p.118], it can be simplified if pairs of public points denoted αi and αj that
follow αi 6= 0 for i = 1, . . . , n and if for each αi there is an αj that satisfies α2i = αj then the
transformation ηk(y) = c1 · y2k is possible.
Each player calculates the transformation on its share locally by s(αi) = c1 · [f(αi)]2k = f ′(αj)
where f ′(x) is the polynomial whose coefficient are calculated by applying the transformation to
the coefficients of f(x). The family of shares s(αi) for i = 0, . . . , n−1 is a valid set of secret shares
of c1 · s2k1 . However, communication between players is needed to do the reordering of the secret
shares.
Notice that there is a performance cost that has to be paid in order to break every secret
variable into shares and do computations on them. An embedded device running a masking scheme
should use the appropriate trade-off between the level of resistance against side-channel attacks
and the desired performance. Based on these considerations and the effort that the fundamental
operations represent, [RP12] created an AES implementation resistant to side-channel attacks.
Also recently, [GR16] proposed a fine-tuned AES and analyzed its performance. The corresponding
details of the implementation of these and other operations are presented in section 3.1, their
performance is described in section 3.2 and the SCA of the secure multiplication is shown in
section 4.3.
2.4 Side-Channel Analysis
As electronic devices are more common part of our lives and they handle confidential and private
data, the industry must provide high standards of security embedded into the devices. The side-
channel research community has been developing more sophisticated attacks and the computing
tools are becoming faster and more efficient. Storage is cheaper and more available than it was
a few years ago so the analysis of side-channel data is becoming more accessible. Nowadays, the
budget required to perform HO-SCA is reachable by low-profile adversaries, thus the electronics
industry must agree on a standard set of tests and methodologies to assess the level of security
against side-channel attacks on embedded devices [SM15, p.1]. This section describes the t test,
one of the most common analyses to detect if a cryptographic implementation is leaking potentially
valuable data.
Proposed by Gilbert et al. [GGJR+11], and later used in [BGN+14] and [LMW14], the t test
reveals if there’s any side-channel leakage in a device-under-test (DUT) i.e. a device running a
11
cryptographic algorithm. Note that the test itself is not an attack on the DUT, it cannot recover
the secret keys but it helps to determine whether it is possible to perform more specific attacks
to exploit the leakage and retrieve sensitive data. This test is based on the concept that the
implementation is potentially free of leakage if two side-channel sets of measurements, e.g. power
traces taken under two different situations, are indistinguishable from one another. In other words,
there should not be a correlation between two (usually large) sets of data and the conditions under
which the sets were captured.
The t test has an statistical foundation, as [SM15, p.2] very precisely point out: ”a fundamental
question in many different scientific fields is whether two sets of data are significantly different from
each other”. This test quantifies the probability of the distinction between the averages of two
sets of data. The measurements can be collected in a variety of different ways but to get relevant
results and to prevent false-positives it is important to take some considerations, like the way the
measurements are catalogued and the order in which they are extracted from the embedded device.
The analysis in this thesis uses the non-specific t test, it consists on taking all the samples in a
random fashion based on a random pattern that must be appropiately logged to later process the
samples accordingly. The DUT processes a random input or a fixed input based on the value of
the random pattern, for example if a selected bit of the pattern is 1 then the input is selected to
be random or fixed if the opposite and so on for the rest of the bits in the pattern. This prevents
false-positive dectections since the state of the device is being exercised under the same thermal
and electrical conditions, otherwise the difference in these parameters could bias the result of the
test.
After collecting and cathegorizing the traces according to the logged pattern, the means (µf ,
µr) and the variances (σ
2
f , σ
2
r) for the two sets are calculated. Finally, the Welch’s t test is executed
as in Equation 4 where nf and nr denote the cardinality of the fixed and random sets respectively.
t =
µf − µr√
(σ2f/nf ) + (σ
2
r/nr)
. (4)
Notice that the performance of the calculation will depend on the amount of traces and their
length. Also the number of random traces and the number of fixed traces would differ slightly since
their occurrance is selected randomly. Notice that for every point in time there is a t value, thus
this test is considered the univariate or first order t test. An (n,2)-masked implementation should
not reveal leakage under this test, for an appropriate selection of n. Furthermore, since it is capable
of only analysing every point in time individually, there is the assumption that all the shares are
processed in parallel for the result of the test to match the behavior of the algorithm. This is usually
the case for hardware masking implementations, that’s why it is recommended [SM15, p.12] that
for software masking implementations to also perform multivariate t test to take into account
the combination of different points in time to calculate the t value. The different points can be
selected as the corresponding cycles of when the different shares are processed and it is possible to
exactly find these points in time since the developer has white-box visibility of the design. Going
back to the first-order case, the order of the univariate t test can be raised to a second order by
extracting the mean-free squared traces Y = (X − µ)2 where X represent the originally aquired
12
measurements, µ their mean and Y the new set of traces, so equation 4 can be applied on those
new values. The same procudure can be extended to higher numbers, thus the HO-SCA can help
determine the resistant against higher-order leakage.
As briefly described in this section, the assessment methodology depends on how and when the
data is collected, also the data can be combined in different ways to yield other levels of observation.
A special consideration must be taken when processing the shares, the accumulation of thousands
of traces is required to get acceptable results, thus the computating resources like processor and
memory can be directly impacted by how the test is carried out. Specially for software based
masking schemes, the execution times grows with the order of the scheme and so the length of
the traces grows considerably to occupy large storage. Finally, remark that the t test is not a
penetration test but a detection tool that allows the designer to easily and practically find out the
level of vulnerability or resistance against multiple levels of attacks.
13
3 Masking Implementation
This section describes the general charachteristics of the masked implementation of the AES en-
cryption algorithm and points out the considerations necessary for its proper design and resistance
against HO-SCA.
The AES encryption implementation presented in this work is developed using polynomial
masking and secure multiparty computations. As it was previously mentioned in section 2.2, the
AES-128 encryption algorithm consists of 10 rounds of operations on a mixture of plaintext and key
material known as the AES state. The linearity of the MixColumns, ShiftRows and AddRoundKey
allow a straight forward translation to a masked version, however, the inherent non-linearity of the
SubBytes function represents a more complicated challenge in terms of implementation, execution
time and risk of leakage. The linear operations are simply applied on all the shares of the secret
so the performance linearly decreases as the number of participants increases. On the other side,
the SubBytes function requires square and multiplications operations that raise the order of the
polynomials employed for sharing the intermediate state. To assure the resistance of the masking
scheme it is fundamental to count on a true random number generator (RNG), the sharing of the
secrets and the operations of the shares, like the secure multiplication, depend on the entropy of
a reliable RNG.
3.1 Specifications
This work presents an 8-bit oriented implementation written in bare C. The program supports
masking schemes for different n, d selections to measure the performance at multiple levels. Also,
there are many possibilites to make performance improvements with time-memory trade-offs es-
pecially in the GF(28) arithmetic operations. The code was developed on an ARM based micro-
controller (µC) although it can be easily ported to other platforms with slight changes. The
platform used in this work was selected for development simplicity and flexibility although there
are important challenges related to hardware constraints that are discussed in the following section.
The µC is the STM32L053R8T6 by ST-Microelectronics, it features the ultra-low power ARM
Cortex-M0+ core that is a 32-bit processor capable of running at up to 32 MHz. The µC is equipped
with multiple peripherals that simplify the connectivity of the developement board, it supports
serial communication over USART (Universal Synchronous/Asynchronous Receiver/Transmitter)
port and can be connected to the USB port of a computer to stablish virtual serial communication.
The development board comes with the ST-Link, another micro-controller, capable of flashing and
debugging the STM32 µC as well as handling USB serial communication. The STM32L053R8T6
is integrated with a hardware true RNG capable of generating a new 32-bit random number every
40 clock cycles, it has to be specifically clocked at 48 MHz to enable the generation so it requires
internal phase-locked loops (PLLs) to derive a 48 MHz from the operating core clock frequency.
For this implementation the selected core clock frequency is 4 MHz which comes from a high-speed
external oscillator thus it has to be multiplied by 24 and divided by 2 in order to match the RNG
clock frequency.
By nature, cryptographic algorithms heavily depend on random numbers to produce acceptable
14
1 u i n t 8 t getnonzerorn ( ) {
2 u i n t 8 t out = 0 ;
3 u i n t 8 t nonzero = 0 , i , outSet = 0 ;
4 u i n t 8 t bytes [ 4 ] ;
5 u i n t 3 2 t rn ;
6
7 rn = RNG−>DR ; //Read data r e g i s t e r from RNG
8 bytes [ 0 ] = ( u i n t 8 t ) rn ;
9 bytes [ 1 ] = ( u i n t 8 t ) ( rn >> 8) ;
10 bytes [ 2 ] = ( u i n t 8 t ) ( rn >> 16) ;
11 bytes [ 3 ] = ( u i n t 8 t ) ( rn >> 24) ;
12
13 f o r ( i =0; i <4; i++){
14 out = ( (˜(− outSet ) ) & (˜(−nonzero ) ) & bytes [ i ] ) + out ;
15 nonzero = bytes [ i ] ;
16 nonzero |= nonzero >> 4 ;
17 nonzero |= nonzero >> 2 ;
18 nonzero |= nonzero >> 1 ;
19 nonzero &= 0x01 ;
20 outSet |= nonzero ;
21 }
22 r e turn out ;
23 }
Code 1: Get Non-zero Random Byte Function
results, otherwise the protected data could be easily compromised by a meticulous adversary.
Furthermore, masking schemes especially rely on entropy to fuss the sensitive data processed by
cryptographic algorithm implementations. For example, if the generator is pseudo-random then
the adversary would be able to find out, after gathering enough information, the secret coefficients
of the polynomials that are used to break the sensitive variables into shares.
The RNG has been validated following the German AIS-31 standard [STM, p.452]. This AES
implementation calls either getrn() or getnonzerorn() functions everytime a new random num-
ber is required. The first one returns the least significant byte (LSB) out of the double word RNG
data register regardless of its value. The second function, shown in code 1, scans the four bytes
from the data register and retrieves the first non-zero byte starting from the LSB to the most
significant byte (MSB). Only when the four bytes of the random number are zero, the returned
byte is zero. The two functions run on constant time, but a new random number can only be
generated after 40 clock cycles of a 48 MHz clock and since the core runs at 4 MHz, there’s enough
time to get fresh random numbers even when the functions are called back-to-back due to the
function call overhead.
The reason to look for a non-zero byte in the 4-byte random number is to avoid the reduction
of the degree of the sharing polynomial while generating coefficient associated with the highest
degree term. If it happens to be an (n, 1)-masking scheme, then the shares would be the same as
the secret values turning the scheme into useless execution overhead. Thus by calling the second
function, the probability of getting a zero is expected to be one over four billion.
Since AES encryption algorithm is built on finite field GF(28) arithmetic, it simplifies software
implementations as every element of the finite field can be represented with a single byte thus
the result of an operation lies within the same set. Additions and subtractions on this field are
trivially reduced to the XOR operation, field multiplication is a little bit more complex but there
are many different ways in which it can be coded, it is a matter of finding the appropriate balance
between execution time and memory usage available in the embedded device running the masked
15
1 u i n t 8 t gfmult ( u i n t 8 t h , u i n t 8 t v ) {
2 u i n t 8 t z = 0 ;
3 u i n t 8 t i = 0 ;
4 u i n t 8 t mask = 0 ;
5
6 f o r ( i=0 ; i<8 ; i++ ) {
7
8 mask = −( (h>>i ) & 1 ) ; // Generate a mask o f 0 x f f or 0x00
9 // depending on every b i t o f h .
10 z = z ˆ(mask & v ) ; // A 0 or the other [ s h i f t e d ] [ reduced ]
11 // operand i s accumulated .
12
13 mask = −( (v>>7) & 1 ) ; // Generate a mask based on the degree
14 // o f the other operand .
15 v<<=1; // S h i f t v
16 vˆ= mask & 0x1b ; // I f the degree o f the other operand
17 // i s more than 7 , reduce i t modulo 0x1b .
18 }
19 r e turn z ;
20 }
Code 2: Instructions-only GF(28) Multiplication
encryption. In this research three different methods are used and compared in performance, in
section 4.3 shows the leakage assessment of two of them. Yet, it is also possible to implement
the field multiplication by applying ab = 24ahb + alb where a and b are the operands, and ah, al
represent the two nibbles of a. Notice that ahb and alb can be precomputed to generate two 4
kB LUTs but it is impractical from the perspective of the embedded device memory constraints,
however the code for this and the rest of the functions is available on GitHub. Also, squaring in
the GF(28) field can be implemented multiple ways, a couple of them are described here. Notice
that all of the functions execute in constant time, there are no branches to control the execution
flow; the for loops run from beginning to end without interruption or change in the flow. Also,
remark that the irreducible polynomial used in all cases is 0x1b.
The first instance of the field multiplication, described in Code 2, consists of instructions only
i.e. it has the minimum memory consumption of the three implementations but it is the slowest
as well. Notice that a logic mask is used to control process of the operation, it becomes either all
zeros or all ones depending on the bit that it is checking and later that mask determines whether
or not to apply other logical instructions. This mechanism prevents the usage of other conditional
statements that could leak information about the operands.
The second instance of the field multiplication (Code 3 ) saves some execution cycles by the
usage a of a 256-byte Look-Up Table (LUT), it turns lines 11,12 and 13 of the instructions-only
multiplication in Code 2 into a single memory access. It is possible to reduce those lines because
they only depend on the operand v, in other words, the LUT was generated using those three
lines of code for all possible values of v. Notice that it is assumed that the table look-up is done
in constant time regardless of the memory index and so it happens for the Cortex-M0+ core.
There are other memory architectures where the accesses vary depending on the data location.
For example, if there is a cache memory to store recently used data, then it would cause response
time variations. An adversary could measure the access latency and eventually reconstruct the
shared secret. However, this implementation relies on the fact that memory transactions complete
in constant time and they point to addresses based on random shares. Theoretically, there should
16
1 u i n t 8 t gfmult ( u i n t 8 t h , u i n t 8 t v ) {
2 u i n t 8 t z = 0 ;
3 u i n t 8 t i ;
4 u i n t 8 t mask ;
5
6 f o r ( i=0 ; i<8 ; i++ ) {
7 mask = −((h>>i ) &1) ;
8 z = z ˆ(mask & v ) ;
9
10 v=secondOp [ v ] ;
11 }
12 r e turn z ;
13 }
Code 3: Mixed GF(28) Multiplication
1 u i n t 8 t gfmult ( u i n t 8 t h , u i n t 8 t v ) {
2 u i n t 1 6 t tmp ;
3 u i n t 8 t out , nonzero ;
4
5 // Check i f h i s ze ro
6 nonzero = h ;
7 nonzero |= nonzero >> 4 ;
8 nonzero |= nonzero >> 2 ;
9 nonzero |= nonzero >> 1 ;
10 nonzero &= 0x01 ;
11
12 out = −nonzero ;
13
14 // Check i f v i s ze ro
15 nonzero = v ;
16 nonzero |= nonzero >> 4 ;
17 nonzero |= nonzero >> 2 ;
18 nonzero |= nonzero >> 1 ;
19 nonzero &= 0x01 ;
20
21 out = −nonzero & out ;
22 tmp = l o g t [ h ] + l o g t [ v ] ;
23
24 // tmp mod (2ˆn)−1
25 tmp = tmp + (tmp>>8) ;
26 tmp = tmp & 0 x f f ;
27
28 r e turn ( out & expt [ ( u i n t 8 t )tmp ] ) ;
29 }
Code 4: Exp-Log GF(28) Multiplication
be a power variation relative to the memory address being accessed but this SCA analysis does
not reveal evidence of it.
Finally, by just using two 256-byte LUTs the Exp-Log field multiplication [GR16, p.p.5,6]
offers a good time-memory trade-off. This method is synthesized by ab = glogg(a)+logg(b) thus by
precomputing the tables logg(x) and g
x for all possible x, the GF(28) multiplication can be quickly
computed. The two tables contain the same values but arranged in a different way since the two
operations are opposites. However, extra code is needed to deal with the case when either of the
two is zero, also the result of the arithmetic addition has to be reduced because the result could be
up to 29− 2. Notice that with logical and arithmetic instructions prevent the usage of conditional
statements.
A field squaring function can help boost the performance of the encryption algorithm, it is a
special case of the multiplication when the two operands are the same. The easiest and fastest way
of implementing it is to precompute a LUT for all the 256 possible input values, thus everytime
17
1 u i n t 8 t g f s q r ( u i n t 8 t a ) {
2 r e turn ( a & 0x01 ) ˆ
3 ( ( a & 0x02 )<<1)ˆ
4 ( ( a & 0x04 )<<2)ˆ
5 ( ( a & 0x08 )<<3)ˆ
6 (−((a & 0x10 )>>4) & 0x1b ) ˆ
7 (−((a & 0x20 )>>5) & 0x6c ) ˆ
8 (−((a & 0x40 )>>6) & 0xab ) ˆ
9 (−((a & 0x80 )>>7) & 0x9a ) ;
10 }
Code 5: GF(28) Squaring. Lines 2,3,4 and 5 square every bit of the lower nibble. Lines 6,7,8 and 9
square every bit of the upper nibble and reduce the partial result.
a squaring is done it can be easily replaced with a table look-up. It is also possible to use only
instructions to compute the square of a number, Code 5 shows the necessary operations.
These elementary operations represent the building blocks to construct the masking scheme to
protect AES. However, another intermediate layer of operations is necessary to implement the non-
linear AES functions: multiplication and squaring of shares. They are commonly known in the
literature as secure multiplication/squaring or SMC multiplication/squaring. These operations
work among the shared secrets thus are more complicated and demand more execution time.
Section 2.3.1, briefly introduces the sharing mechanism to break a sensitive variable into multiple
parts. Section 2.3.2, describes the high-level algorithm and the required field operations to perform
a secure multiplication. Notice that the performance of the secure multiplication relies on the
number of shares and the degree of the polynomial selected to mask the secret variables, whereas
in the secure squaring it only depends on the number of shares due to the property described in
section 2.3.2.
The secure multiplication accepts two sets of shares F and G as input parameters, and outputs
another set with the result. The product of the two secrets, from which F and G were derived, can
be revealed by multiplying the inverse Vandermonde matrix with the resulting vector H. In Code
6, N and D correspond to the number of players and the order of the masking scheme (n, d). Notice
that there are conditional compilation commands based on D and  = n − 2d − 1 (EPS) to save
performance and memory for specific cases. The gfadd() operation is an addition in the GF(28)
field, i.e. an XOR operation, that in C can be coded as a macro. Take into consideration that αi
for i = 0, . . . , n− 1 and the inverse Vandermonde matrix are pre-computed and they are constants
for all the execution of the encryption, see the appendix in section 6.
Once the properties decribed in section 2.3.2 are satisfied, the secure squaring functions are
turned into a field squaring operations and reordering of shares. This implies a significant per-
formance boost in terms of code size and execution time. Consider the example Code 7 specific
to the (5, 2)-sharing scheme to illustrate the simplicity of the operation where F[N] are the input
shares and H[N] the squared shared secret.
Masking schemes represent a big performance workload to embedded devices and there are
many mathematical approaches, from the threshold schemes theory and architectural shortcuts
perspective, that can be used to speed up the their execution. The C implementation does not
allow to fully leverage the intrinsic micro-architectural benefits of the ARM architecture, most of
it is left to the compiler. On the other hand, the implementation can be easily ported to different
18
1 void M u l t i p l i c a t i o n ( u i n t 8 t F [N] , u i n t 8 t G[N] , u i n t 8 t H[N] ) {
2 u i n t 8 t i , j , a , sum , sum1 ;
3 u i n t 8 t tmp1 , tmp2 ;
4 u i n t 8 t Q[D+1] ;
5 u i n t 8 t QFunctions [N ] [ N] ;
6
7 #i f D>1
8 u i n t 8 t t ;
9 #e n d i f
10
11 f o r ( i=0 ; i<N ; i++ ) {
12 Q[ 0 ] = gfmult ( F [ i ] , G[ i ] ) ;
13 f o r ( j=1 ; j<(D+1) ; j++ ) {
14 Q[ j ] = getrn ( ) ;
15 }
16 f o r ( a=0 ; a<N ; a++ ) {
17 sum1 = Q[ 0 ] ;
18 #i f D==1
19 sum1 = gfadd ( sum1 , gfmult ( Q[ 1 ] , alpha [ a ] ) ) ;
20 #e l s e
21 f o r ( t=1 ; t<(D+1) ; t++ ) {
22 sum1 = gfadd ( sum1 , gfmult ( Q[ t ] , g fexp ( alpha [ a ] , t ) ) ) ;
23 }
24 #e n d i f
25 QFunctions [ i ] [ a ] = sum1 ;
26 }
27 }
28
29 f o r ( j=0 ; j<N ; j++ ) {
30 sum = 0 ;
31 f o r ( i=0 ; i<N ; i++ ) {
32 tmp1 = gfmult ( InvVand [ 0 ] [ i ] , QFunctions [ i ] [ j ] ) ;
33 #i f EPS==0
34 i f ( j < D ) {
35 tmp2 = gfmult ( InvVand [N−j −1] [ i ] , gfadd ( F [ i ] , G[ i ] ) ) ;
36 sum = gfadd ( sum , gfadd ( tmp1 , tmp2 ) ) ;
37 } e l s e {
38 sum = gfadd ( sum , tmp1 ) ;
39 }
40 #e l s e
41 i f ( j < EPS ) {
42 tmp2 = gfmult ( InvVand [N−j −1] [ i ] , g fmult ( F [ i ] , G[ i ] ) ) ;
43 sum = gfadd ( sum , gfadd ( tmp1 , tmp2 ) ) ;
44 } e l s e i f ( j >= EPS && j<(EPS+D) ) {
45 tmp2 = gfmult ( InvVand [N−j −1] [ i ] , gfadd ( F [ i ] , G[ i ] ) ) ;
46 sum = gfadd ( sum , gfadd ( tmp1 , tmp2 ) ) ;
47 } e l s e {
48 sum = gfadd ( sum , tmp1 ) ;
49 }
50 #e n d i f
51 }
52 H[ j ] = sum ;
53 }
54 }
Code 6: Secure Multiplication.
1 void qPow2( u i n t 8 t F [N] , u i n t 8 t H[N] ) {
2 H[ 0 ] = g f s q r (F [ 2 ] ) ;
3 H[ 1 ] = g f s q r (F [ 3 ] ) ;
4 H[ 2 ] = g f s q r (F [ 1 ] ) ;
5 H[ 3 ] = g f s q r (F [ 0 ] ) ;
6 H[ 4 ] = g f s q r (F [ 4 ] ) ;
7 }
Code 7: Secure Squaring in the (5, 2)-sharing scheme.
19
Table 1: Execution time for AES-128 encryption in µs with the CPU running at 4 MHz.
(3,1) (5,2)
GF(28) mult. Instr. Only Mixed Exp-Log Instr. Only Mixed Exp-Log
GF(28) sqr. Instr. Only LUT LUT Instr. Only LUT LUT
Encryption 1.453.107,0 1.119.818,0 524.896,0 8.047.896,00 6.487.622,00 2.903.750,0
Encryption1 1.597.706,0 1.239.214,5 579.480,0 8.621.960,50 6.977.620,00 2.956.370,0
1 Encryption based on error preserving multiplication which requires more operations, more details
in [SES17].
platforms and processor architectures due to the omnipresence of embedded C in the industry. The
following section focuses on the performance of the ARM Cortex-M0+ core running the masking
scheme.
3.2 Performance
The performance measurements are described for two selections of n and d that satisfy n = 2d+ 1.
For practical pursposes the (3,1)-sharing scheme is the lowest that meets this condition and with
the 4 MHz frequency of the core it represents a realistic case to measure performance. Since the
order of the scheme is one, it can be broken with a second order side-channel analysis but still
it is worth understanding the practical implications on an embedded device. The (5,2)-sharing
scheme is also analyzed, it represents a more realistic case that could be potentially employed for
embedded applications. The way the code is written, it is possible to select other modes like (4,1),
(5,1) or (6,2) however these selections would not bring higher resistance to SCA compared to the
(3,1) and (5,2).
The execution time of the two schemes is shown in Table 1, the data is given in µs and the
system clock employed for the measurements was 4 MHz. Only AES-128 encryption has been
implemented, however AES-192, AES-256 encryption and decryption should be a straight forward
process. Remark that different versions of the GF(28) multiplication and squaring were used, even
though only a few permutations were selected to show the performance comparison, the user can
select any combination of the field operations to meet the embedded device’s limitations. As a
reference, consider the OpenSSL 1.0.1g AES encryption released on April 2014. It is a 32-bit C
implementation compiled for the ARM Cortex-M0+ to run at 4 MHz. The execution time for this
unmasked encryption is 481.5 µs, Table 5 shows its corresponding code and data size. Notice that,
even though full unrolling is disabled, the code and data size is significantly large, in return the
execution time is 1090X faster than the shortest masked encryption in Table 1.
The breakdown of SMC operations per round of AES is shown in Table 2, it describes how
many SMC operations are needed for every function of the AES encryption algorithm.
The SubByte function requires SMC operations, i.e. secure multiplications, secure squaring,
additions and affine transformations due to its non-linear nature. In turn, these SMC operations
are built on top of GF(28) operations. Table 3 describes the necessary field operations for every
SMC operation.
Finally, Table 4 shows the execution time required for the field operations and the SMC multi-
20
Table 2: Number of SMCs in one round of AES where SubBytes is split into two parts Inv(y) inversion and τ(y)
affine transformation over GF(28).
Inv(y) τ(y) MixColumns AddRoundKey ShiftRows
SMC Multiplication 16× 4 - - - -
Efficient Squaring 16× 3 16× 7 - - -
SMC Addition - 16× 7 12 16× 1 -
Affine Transformation - 16× 9 16 - -
Table 3: GF(28) field operations required for SMC operations where  = n− 2d− 1.
SMC operation Multiplication Squaring Addition Affine Transform
Field Mul. n2(d+ 1) + n(+ d+ 1) n - n
Field Add n2(d+ 1) + n(+ d) - n n
Randomness nd - - -
plication, it also includes the random number generation functions. Notice that based on Table 3,
these building block operations represent the key elements to boost the performance of the masking
scheme, that is the reason to look for faster methods to do field arithmetic.
The performance results show that SMC multiplication is mainly the bottleneck of the encryp-
tion algorithm and in turn it relies on the field multiplication. The execution time can be also
reduced by running at higher frequencies, this board is capable of running at 32 MHz, however
power consumption increases with the CPU frequency. Table 5 shows the different sizes for code
and RW-data according to selected combinations of field operations, notice that other combinations
are also possible to produce different code and data sizes.
3.3 Challenges
The Nucleo-L053R8 board has features that allow the rapid implementation of the masking scheme.
The ARM architechture onmipresence and the development tools are valuable advantages to bring
the concept to life, however important considerations must be taken especially when designing
performance hungry algorithms into a resource constrained device.
The µController has an embedded RNG that it’s fundamental for the secure operation of the
scheme. The downside of it is that it has to be clocked at the particular frequency of 48 MHz
Table 4: Execution time for GF(28) and SMC operations in µs with the CPU running at 4 MHz.
(3,1) (5,2)
Instr. Only Mixed Exp-Log Instr. Only Mixed Exp-Log
GF(28) mult. 54,50 44,50 17,50 54,50 44,50 17,50
GF(28) sqr. 13,75 1,50 1,50 13,75 1,50 1,50
getrn() 3,75 3,75 3,75 3,75 3,75 3,75
getnonzerorn() 41,75 41,75 41,75 41,75 41,75 41,75
SMC add. 15,25 15,25 15,25 21,25 21,25 21,25
SMC mult. 1.246,75 1.026,25 475,00 9.130,50 7.503,00 3.434,50
SMC mult1. 1.427,50 1.175,50 545,50 9.847,75 8.115,50 3.784,50
1 Error preserving multiplication which requires more operations, more details in [SES17].
21
Table 5: AES-128 encryption code and RW-data size depending on the GF(28) operations variations.
unmasked (3,1) (5,2)
GF(28) mult. - Instr. Only Mixed Exp-Log Instr. Only Mixed Exp-Log
GF(28) sqr. - Instr. Only LUT LUT Instr. Only LUT LUT
Code Size 7.23 kB 3,38 kB 3,26 kB 3,29 kB 3,58 kB 3,44 kB 3,48 kB
RW-data 12 B 12 B 524 B 780 B 32 B 544 B 800 B
RO-data 8.7 kB 224 B 224 B 224 B 224 B 224 B 224 B
and the user can configure the clock tree to get the dedicated High-Speed Internal oscillator for it,
however it is not a trivial task.
It is also possible to provide the system clock from various sources, either internal or external but
there are important considerations regarding side-channel analysis that are going to be described
in section 4.2. For this implementation, the clock generation comes from a external crystal and it
derives to a PLL module that generates the desired 48 MHz for the RNG. The external oscillator
can be chosen for different frequency values because the PLL can multiply and divide that frequency
to generate another one.
Since serial communication is fundamental for the development and analysis of the scheme,
some libraries are reused and other functions were created for this work. The board comes with
the ST-Link chip that allows USB communication between the target µC and the PC, it is used
to flash and debug the µC. According to the clock source and its frequency, the USART module
and the internal clock tree have to be configured appropriately to stablish serial communication.
There are a few tool chains available to develop the code and debug it. This work was initially
implemented on mbed that is an online compiler provided by ARM available to all registered users,
registration is free of charge and there’s no limitation on the amount of code that can be compiled.
It also has an integrated version control tool to keep track of the progress and allow other users to
collaborate in the developement life cycle. This tool chain allows the users a rapid development flow
since most of the basic configurations are given by default and there are many libraries available.
Unfortunately it does not offer a debugger and internet connection is needed to use the compiler.
On the other end, the KEIL development kit is more complete, it gives the user more control over
the configuration of the device, it allows inline assembly and includes a debugger. The software
can be downloaded for free, however registration is required and it allows to flash up to 32 kB,
which can be a restriction if large LUTs are used alongside with the code.
From the implementation perspective these are the major challenges, later in the next section
other challenges that rose during the leakage assessment process are described.
22
4 Leakage Assessment
Theoretically, an (n, d)-sharing masking scheme is resistant to a d-order side-channel analysis.
This work focuses on two masking levels (3,1) and (5,2), for both of them, it shows the t test
results of the power traces taken during the secure multiplication. This operation is one of the key
components in the non-linear function of AES, it is also the most complex operation throughout
the whole encryption thus it represents an attractive entry point for an adversary. Even though
there are multiple techniques, like Simple Power Analysis (SPA) and Differential Power Analysis
(DPA), than can help an adversary to break the cryptographic implementation, the t test reveals
any possible source of leakage and the point in time where it is happening.
4.1 Setup
The setup consists of the design under test (DUT) which is the NUCLEO-L053R8 development
board by ST-Microelectronics, a PC to interact with the DUT via serial communication, a LeCroy
WavePro 725Zi oscilloscope, an active differential probe LeCory AP 033, a regular probe LeCroy
PP007-WR and an external power source. The differential probe is connected to the JP6 pins of
the board and a small 47Ω resistor is connected between the two pins to sense a small voltage
drop but still leave enough voltage for the µC to work. The regular probe is used to synchronize
the recording of the trace with the beginning of every multiplication, one of the GPIOs is used for
this purpose tu output a signal that is set before the operation and reset after it is finished. The
oscilloscope starts recording when the trigger signal is detected, it stops after a defined period of
time enough to capture the processing of the whole operation. The traces are saved in a hard disk
drive and then processed in Matlab sequentially.
Figure 6: NUCLEO-L053R8 developement board connected to the differential probe and communicated to the PC
through USB virtual serial communication.
23
Figure 7: Lab. Setup: NUCLEO-L053R8 board (center), oscilloscope (upper left), external power source (left), PC
(right) and external hard disk (upper right).
4.2 Challenges
The computation of the t test might seem trivial but in reality there are some considerations that
need to be taken properly to get accurate results in an efficient way. The nature of side-channel
analysis requires a large amount of samples to properly identify possible sources of leakage, in other
words, if the size of the traces is large, storing and processing them can represent a big effort. Here
are the main challenges encountered while doing the leakage assessment of this masking scheme
implementation.
4.2.1 Trace Generation
As mentioned in section 2.4 two different sets of traces have to be generated to avoid false positive
results, the approach is fixed-vs-ramdom. First, a script running on the PC generates a random
number and based on its value either a fixed secret value or a random secret value is broken
into shares by polynomial masking and then the corresponding shares are fed into the secure
multiplication function. It is essential to have continuous communication between the DUT and
the PC to send the random execution pattern, also the PC must keep track of all the patterns and
the order in which they were generated to process the traces appropriately.
To eliminate all possible sources of interference, the unnecessary modules of the microcontroller
are turned off. Even the USART that handles the serial communication is switched off during
the secure operation and restored right after the multiplication completes. Another prevention
is delaying the processing of the secure operation for a few micro-seconds after the trigger that
synchronizes the storage.
An important requirement for the implementation is to have a constant execution time other-
wise it could leak information about the internal state. From the programming perspective, this
implementation runs in constant time however there’s an important factor that is taken for granted:
the clock cycles are always constant. Execution time does not only depend on the instructions and
24
the flow control mechanisms but on the clock precision as well. One of the biggest issues found
during the early stages of the analysis was that the traces were misaligned, the misalignment was
caused by clock jitter. In other words, the oscillator that the µC was using as clock source was
an internal RC resonator thus the precision of the cycles was very low and it produced different
trace lengths. Fixing that problem is possible but it would have required advanced techniques like
Elastic Alignment [vWWB11]. A simpler solution was to configure the internal clock tree of the
µC to use an external high-speed oscillator as a clock. A 4 MHz crystal was chosen to test the
(3,1)-sharing scheme and a 16 MHz crystal for the (5,2) case. After that workaround all the traces
were correctly aligned and ready for the analysis.
4.2.2 Trace Collection
Masking schemes involve a redundant amount of work thus performance is a very important factor,
especially in single-threaded software implementations where the workload is hardly parallelized.
Another downside is the relatively slow frequency at which embedded devices run compared to
FPGAs, ASICS or high-end computers. Thus the execution time of such implementation is large for
low-power embedded devices, the length depends on the degree of the masking scheme, the number
of shares employed, the frequency of the device’s clock and the performance of the algorithm. The
longer the execution time, the larger the traces are going to be. However, it also depends on the
settings of the oscilloscope like the sampling rate and the way each point is represented in a trace
file. For example, the size of a trace file for a multiplication that lasts around 1 ms is approximately
100 KB, if the analysis requires hundreds of thousands of traces or millions, then storage becomes
a significant parameter when dealing with higher-order masking schemes and analysis.
4.2.3 Trace Processing
Due to the huge amount of data needed for the analysis and the length of each trace, an efficient
way of computing the t test is necessary. The test itself requires the average and variance of two
large sets, it is infeasible to accumulate all the traces in DRAM memory to do those calculations,
eventually the PC would run out of memory and the Matlab script would crash. Thus it is
important to compute the t test progressively. Instead of calculating the mean and variance of the
whole sets, one can process a certain amount of traces at a time, without overflowing the memory,
and compute the local mean, the local variance and accumulate them for all the subsets. If these
subsets are large enough, the accumulated mean and the variance would tend to be equal to the
total mean and variance of the whole sets and the t test would bring valid results.
It is also possible to delimit each of the traces to length of the effective trace. Oscilloscopes
usually have fixed ranges of trace recording that cause larger traces than necessary and if t test
does not set a starting point and a limit on each of the traces then part of the computation will
be wasted. Matlab also offers parallel processing to improve the performance of the test, however
it is not always possible or necessary to reach that level but it is recommended to use vector and
matrix operations as much as possible since they have been designed with performance in mind.
Finally, The bandwith of the oscilloscope can be determining factor when computing the t test.
If the bandwidth is very high, the trace will contain information regarding high frequencies beyond
25
the operating frequency of the embedded device and thus yield misleading results. For the results
of the experiments presented in this work the bandwith of the differential probe was set to 20
MHz, also the samples are taken at a 100 million samples per second without any attenuation or
gain factor in the probe.
4.3 SCA Results
This section contains the results on the side-channel analyses performed on the (3,1)- and (5,2)-
sharing masking schemes, only the secure multiplication has been analyzed due to its relevance in
the non-linear function of AES encryption. The traces were taken under the same test conditions,
the only difference between the two cases is the frequency of operation of the target device.
4.3.1 Higher-Order t test
The leakage assessment of the secure multiplication for the (3,1)-sharing is shown in Figure 8, the
system clock ran at 4 MHz, it used the field multiplication with minimal memory footprint thus the
execution time is the longest of the provided versions. The plot shows the orders from the first to
fifth, notice that for this version of the field multiplication, there’s a significant leakage at various
points under the second and fourth order analysis. For all the plots, the red upper and lower lines
represent the absolute 4.5 bounds, if the result lies within that range, it can be considered secure
based on the number of traces.
26
Figure 8: HO t test for SMC (3,1)-multiplication with instructions-only GF(28) multiplication.
0 2 4 6 8 10 12 14 16 18
Number of traces ×104
-6
-4
-2
0
2
4
6
t
Figure 9: 1st order t growth for SMC (3,1)-multiplication with instructions-only GF(28) multiplication. The black
lines show the evolution of the maximum (top) and minimum (bottom) first order t values over the number of
traces. The stars mark how the index of the last maximum value grew over the number of traces. The circles mark
the corresponding behavior for the last minimum value.
The Exp-Log field multiplication offers a 2.6X execution speedup over the whole secure multi-
plication, although it implies the allocation of two 256-byte LUT, the leakage assessment in Figure
10 is contained within the acceptable boundaries. As for the results, it seems to be better suitable
27
for side-channel resistance compared to the previous Figure 8.
Figure 10: HO t test for SMC (3,1)-multiplication with Exp-Log GF(28) multiplication.
Figure 9 shows the t growth over the number of samples for the first order t test, Figure 11
shows the corresponding values for the (3,1) secure multiplication based on the Exp-Log field
multiplication.
28
0 0.5 1 1.5 2 2.5
Number of traces ×105
-4
-3
-2
-1
0
1
2
3
4
t
Figure 11: 1st order t growth for SMC (3,1)-multiplication with Exp-Log GF(28) multiplication. The black lines
show the evolution of the maximum (top) and minimum (bottom) first order t values over the number of traces.
The stars mark how the index of the last maximum value grew over the number of traces. The circles mark the
corresponding behavior for the last minimum value.
Finally, Figure 12 shows the t result for the (5,2)-sharing scheme. Due to its execution length
with a 4 MHz clock that would turn the trace collection impractical, the system clock was switched
to 16 MHz. This SMC multiplication uses the Exp-Log field multiplication as well.
Figure 12: HO t test for SMC (5,2)-multiplication with Exp-Log GF(28) multiplication.
29
0 0.5 1 1.5 2 2.5
Number of traces ×105
-6
-4
-2
0
2
4
6
t
Figure 13: 1st order growth t for SMC (5,2)-multiplication with Exp-Log GF(28) multiplication. The black lines
show the evolution of the maximum (top) and minimum (bottom) first order t values over the number of traces.
The stars mark how the index of the last maximum value grew over the number of traces. The circles mark the
corresponding behavior for the last minimum value.
Based on this analysis it can be inferred that the univariate first order t test does not show points
of leakage, the higher order tests reveal some points where there could be certain leakage. It is
important to mention that although the amount of traces is relatively small, for many experiments
previous to these results the test revealed significant leakage at different points. Thus, based on
experience and considering the internal sources of noise in the µC, these results show a good level
of resistance. However, single-threaded software masking schemes process one share at a time,
thus it is important to test it under a multivariate analysis to check if there’s any leakage due to
the relation of two different points in time.
4.3.2 Multivariate t test
As opposed to hardware implementations of secret sharing and multiparty computation that pro-
cess their shares in parallel, this is a single-threaded software implementation. The operations on
every share or pair of shares is done sequentially, so it happens that the power consumption at
certain interval may only be related to a single share or pair of shares being processed [SM15, p.12].
This section describes how the multivariate t test was applied to the collection of traces previously
used for the univariate t test in section 4.3.1 and the results for the previously introduced cases.
Once the traces are classified in two sets (fixed vs random), their corresponding mean for every
point in time is calculated and substracted from every sample. That generates two new meanless
sets of traces cointaining the same amount of samples as the previous sets. Later, to save execution
time and memory consumption, a few intervals of time are identified where the different shares are
processed. Then all the points are multiplied with each other to generate all possible combinations
and proceed to calculate the t test with these new data. Notice that doing the multivariate analyisis
on a section against itself would yield the second order univariate t test as part of the result.
Figure 14 illustrates this task by pointing out relevant events throughout the execution of a
(3,1) secure multiplication. For example, the rising and falling edge of the blue signal marks the
beginning and end of the first three field multiplications within the SMC multiplication. During
these three sections, the shares of the operands are processed for the fist time. In other words,
each section corresponds to a field multiplication according to line 12 of Code 6, thus these are
30
areas of interest to focus the multivariate analysis on. This analysis demands a great amount of
memory because the objective is to combine (multiply) one point with all the others, so the size
of data grows from n to n2 where n is the number of sample points under analysis.
0 200 400 600 800 1000 1200 1400 1600 1800
Time in µs
-5
0
5
10
vo
lta
ge
 (s
ca
led
) a
nd
 t
2nd field mult.
3rd field mult.
SMC mult. end
1st field mult.
And start of 
SMC mult.
Figure 14: The black trace is a signal that is set before the beginning of the SMC multiplication and reset by the
end of it, the blue trace is a signal that is set by the beginning and reset by the end of each of the first three field
multiplications. The gray signal in the background is the first order t result for the overall (3,1) SMC multiplication
based on instructions-only field multiplication.
Figures 15 and 16 show the results of the multivariate analysis on relevant sections of the (3,1)
secure multiplication. Each of the plots belongs to the combination of the section where the first
pair of shares is processed to the sections where the remaining pairs are processed as line 12 of
Code 6 describes.
31
0 0.5 1 1.5 2 2.5 3 3.5
×105
-5
0
5
t
share-1 vs share-1
0 0.5 1 1.5 2 2.5 3 3.5
×105
-5
0
5
t
share-1 vs share-2
0 0.5 1 1.5 2 2.5 3 3.5
×105
-5
0
5
t
share-1 vs share-3
0 0.5 1 1.5 2 2.5 3 3.5
sample points ×105
-5
0
5
t
share-1 vs other
Figure 15: Multivariate t test for sections of the (3,1) SMC multiplication based on Instructions-Only GF(28)
multiplication. Share-1 vs Share-1 shows the multivariate t test result of all combinations of points during the first
GF(28) multiplication. Share-1 vs Share-2 shows the result of the multivariate t test for the combination of points
from the first field multiplication to the second one. Share-1 vs Share-3 corresponds to the multivariate t test analog
to the previous case. Share-1 vs Other shows the result of multivariate t test of the points during the first field
multiplication combined with all the points of a section close to the end of the SMC multiplication.
Notice that the graphs in Figure 15 reveal certain peaks indicating potential leakage, these do
not appear under the univariate high-order analysis. It could be possible that the same internal
hardware, like registers, is exercised at those two points in time and there is a relevant difference
in the average power consumption.
32
0 1 2 3 4 5 6 7 8 9
×104
-5
0
5
t
share-1 vs share-1
0 1 2 3 4 5 6 7 8 9
×104
-5
0
5
t
share-1 vs share-2
0 1 2 3 4 5 6 7 8 9
×104
-5
0
5
t
share-1 vs share-3
0 1 2 3 4 5 6 7 8 9
sample points ×104
-5
0
5
t
share-1 vs other
Figure 16: Multivariate t test for sections of the (3,1) SMC multiplication based on Exp-Log GF(28) multiplication
. Share-1 vs Share-1 shows the multivariate t test result of all combinations of points during the first GF(28)
multiplication. Share-1 vs Share-2 shows the result of the multivariate t test for the combination of points from
the first field multiplication to the second one. Share-1 vs Share-3 corresponds to the multivariate t test analog
to the previous case. Share-1 vs Other shows the result of multivariate t test of the points during the first field
multiplication combined with all the points of a section close to the end of the SMC multiplication.
Despite the fact that the Exp-Log field multiplication uses table look-ups, the result in Figure
16 does not show any evidence of leakage derived from the memory accesses. In fact, the results
are more secure than those of Figure 15. Also, the analysis could be extended to other areas of
interest in the secure multiplication that may reveal potential leakage.
Figure 17 shows the corresponding analysis output as in previous figures but for the (5,2)
secure multiplication. In this case, notice that there are possible leakage points, one of the reasons
is that the frequency of the clock is 16 MHz, instead of the 4 MHz in the previous cases. A higher
frequency reduces the time interval between one state and the next in the hardware thus causing
the power consumption to overlap. As this is an on going research, the detected leakage is taken
into consideration for future analysis. Once those points of interest have been discovered, a more
detailed analysis can be made to gather more information.
The multivariate analysis is a useful tool to reveal potential sources of side-channel leakage.
However the time execution and memory constraints are significant factors to constrain the exten-
33
sion of the analysis to certain sections. Remark that for all of the multivariate analysis results,
the horizontal axis does not represent time since the analysis itself requires the combination of
traces at different points. The data can be plotted in a 3D format to make it visually easier to
understand and identify the exact points of leakage, however for the simplicity of this document
the data is expanded across the horizontal axis.
Figure 17: Multivariate t test for sections of the (5,2) SMC multiplication. Share-1 vs Share-1 shows the multivariate
t test result of all combinations of points during the first GF(28) multiplication. Share-1 vs Share-2 shows the result
of the multivariate t test for the combination of points from the first field multiplication to the second one. Share-1
vs Share-3, Share-1 vs Share-4, Share-1 vs Share-5 correspond to the multivariate t test analog to the previous case.
Share-1 vs Other shows the result of multivariate t test of the points during the first field multiplication combined
with all the points of a section close to the end of the SMC multiplication.
34
5 Conclusion
The omnipresence of embedded devices in every sector of our lives opens many gaps in terms
of security. Even standard algorithms for encryption, authentication and integrity checking are
susceptible to side-channel analysis, not because of any possible intrinsic weakness in its design
but due to their utilization on real hardware that scapes the safety of their conceptualization.
SCA resistance comes at a performance and development cost, masking schemes pose complex
challenges on the hardware architechture in order to securely implement these algorithms.
Compared to high-end computers that have adopted AES instructions part of their Instruction
Set Architechture (ISA) extensions that can execute a whole AES encryption in just a few clock
cycles, or hardware modules that are integrated into a huge variety of embedded devices that
parallelize the encryption workload, polynomial masking and SMC fall short in terms of execution
time. However, as described in this work and others e.g. [GR16], it is possible to optimize masking
schemes from both theoretical and implementation perspective. SMC and field operations can
be implemented in multiple ways to speedup execution but it always boils down to the hardware
architecture and the desired level of resistance.
Yet there are many considerations that have to be correctly handled to achieve proper imper-
meability. From the essential branchless design that already demands a relatively small overhead,
and the reliance on true randomness generation to disperse the secrets appropiately, the developer
must take care of fitting the secure algorithm into the constrained hardware and still allow space for
the application itself. Hardware features can help boost the performance, for example, instruction
set operations like cmov that does conditional movement of data in constant time or the presence
of an embedded True Random Number Generator simplify the implementation.
Software based masking schemes are significantly slower than their hardware counterparts,
that implies a higher effort in terms of side-channel trace generation, collection and processing.
The clock sources and the available frequencies of operation of the embedded device influence the
complexity of SCA. Dozens of gigabytes of storage are necessary to save the power traces and
their collection takes hours and in some cases even days. A slight error in the setup, like a serial
communication error, can turn the trace collection into trash. However, once the observer has
become familiar with the implications of the analysis, some improvements can be made to it. Like
reducing the trace recording to the effective processing time or by the automation of repetitive
work with scripts.
This thesis gives evidence of the feasibility and side-channel resistance of polynomial masking
and SMC in software, they demand a performance overhead but in return provide an acceptable
level of impermeability. The analysis can be extended but it already comprehends key elements,
like the focus on secure multiplication, randomness and constant time execution with small but
practical orders of protection. The code has been made publicly available, it counts with several
selectable optimizations that imply a trade-off in memory and processing time. This contribution
may not be considered a ground breaking work but it can help as a reference to further extend the
side-channel resistance to software implementations.
35
6 Appendix
1 . . .
2 #e l i f (N==5) && (D==2)
3
4 u i n t 8 t alpha [N] = {0x51 , 0xec , 0x0d , 0xb1 , 0x01} ;
5 u i n t 8 t InvVand [N ] [ N] = {
6 {0x01 , 0x01 , 0x01 , 0x01 , 0x01 } ,
7 {0x5d , 0x5c , 0xe0 , 0xe1 , 0x00 } ,
8 {0xbc , 0xbc , 0xbd , 0xbd , 0x00 } ,
9 {0xb1 , 0x0d , 0x51 , 0xec , 0x01 } ,
10 {0x51 , 0xec , 0x0d , 0xb1 , 0x01}
11 } ;
12
13 . . .
14 #e l i f (N==3) && (D==1)
15
16 u i n t 8 t alpha [N] = {0x01 , 0xbc , 0xbd} ;
17 u i n t 8 t InvVand [N ] [ N] = {
18 {0x01 , 0x01 , 0x01 } ,
19 {0x01 , 0xbd , 0xbc } ,
20 {0x01 , 0xbc , 0xbd}
21 } ;
22 . . .
23 #e n d i f
Code 8: Precomputed selection of evaluation points (αi) and Inverse Vandermonde Matrices for (3,1)-
and (5,2)-sharing.
36
References
[Alb] D. Alba. China’s tianhe-2 caps top 10 supercomputers. http://spectrum.ieee.org/
tech-talk/computing/hardware/tianhe2-caps-top-10-supercomputers. Pub-
lished: 2013-06-17.
[BGN+14] Begu¨l Bilgin, Benedikt Gierlichs, Svetla Nikova, Ventzislav Nikov, and Vincent Ri-
jmen. Higher-Order Threshold Implementations, pages 326–343. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2014.
[BM06] Joseph Bonneau and Ilya Mironov. Cache-Collision Timing Attacks Against AES,
pages 201–215. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006.
[BOGW88] Michael Ben-Or, Shafi Goldwasser, and Avi Wigderson. Completeness theorems
for non-cryptographic fault-tolerant distributed computation. In Proceedings of the
Twentieth Annual ACM Symposium on Theory of Computing, STOC ’88, pages 1–10,
New York, NY, USA, 1988. ACM.
[CJRR99] Suresh Chari, Charanjit Jutla, Josyula R. Rao, and Pankaj Rohatgi. A cautionary
note regarding evaluation of aes candidates on smart-cards. In In Second Advanced
Encryption Standard (AES) Candidate Conference, pages 133–147, 1999.
[CPRR14] Jean-Se´bastien Coron, Emmanuel Prouff, Matthieu Rivain, and Thomas Roche.
Higher-Order Side Channel Security and Mask Refreshing, pages 410–424. Springer
Berlin Heidelberg, Berlin, Heidelberg, 2014.
[DCBRN15] Thomas De Cnudde, Begu¨l Bilgin, Oscar Reparaz, and Svetla Nikova. Higher-Order
Glitch Resistant Implementation of the PRESENT S-Box, pages 75–93. Springer
International Publishing, Cham, 2015.
[DR99] Joan Daemen and Vincent Rijmen. AES Proposal: Rijndael, pages 4–8. 1999.
[GGJR+11] Benjamin Jun Gilbert Goodwill, Josh Jaffe, Pankaj Rohatgi, et al. A testing method-
ology for side-channel resistance validation. In NIST non-invasive attack testing work-
shop, 2011.
[GM11] Louis Goubin and Ange Martinelli. Protecting AES with Shamir’s Secret Sharing
Scheme, pages 79–94. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
[GMO01] Karine Gandolfi, Christophe Mourtel, and Francis Olivier. Electromagnetic analysis:
Concrete results. In Cryptographic Hardware and Embedded SystemsCHES 2001,
pages 251–261. Springer, 2001.
[GR16] Dahmun Goudarzi and Matthieu Rivain. How fast can higher-order masking be in
software? Cryptology ePrint Archive, Report 2016/264, 2016. http://eprint.iacr.
org/2016/264.
37
[GRR98] Rosario Gennaro, Michael O. Rabin, and Tal Rabin. Simplified vss and fast-track
multiparty computations with applications to threshold cryptography. In Proceedings
of the Seventeenth Annual ACM Symposium on Principles of Distributed Computing,
PODC ’98, pages 101–111, New York, NY, USA, 1998. ACM.
[GSF14] Vincent Grosso, Franc¸ois-Xavier Standaert, and Sebastian Faust. Masking vs. mul-
tiparty computation: how large is the gap for AES? Journal of Cryptographic Engi-
neering, 4(1):47–57, 2014.
[IIES14] Gorka Irazoqui, Mehmet Sinan Inci, Thomas Eisenbarth, and Berk Sunar. Wait a
Minute! A fast, Cross-VM Attack on AES, pages 299–319. Springer International
Publishing, Cham, 2014.
[Int] A guide to the internet of things. http://www.intel.com/content/www/us/en/
internet-of-things/infographics/guide-to-iot.html. Accessed: 2017-02-10.
[ISW03] Yuval Ishai, Amit Sahai, and David Wagner. Private Circuits: Securing Hardware
against Probing Attacks, pages 463–481. Springer Berlin Heidelberg, Berlin, Heidel-
berg, 2003.
[KJJ99] Paul Kocher, Joshua Jaffe, and Benjamin Jun. Differential Power Analysis, pages
388–397. Springer Berlin Heidelberg, Berlin, Heidelberg, 1999.
[KJJR11] Paul Kocher, Joshua Jaffe, Benjamin Jun, and Pankaj Rohatgi. Introduction to
differential power analysis. Journal of Cryptographic Engineering, 1(1):5–27, 2011.
[Koc96] Paul C. Kocher. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS,
and Other Systems, pages 104–113. Springer Berlin Heidelberg, Berlin, Heidelberg,
1996.
[LMW14] Andrew J. Leiserson, Mark E. Marson, and Megan A. Wachs. Gate-Level Masking
under a Path-Based Leakage Metric, pages 580–597. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2014.
[MM13] Amir Moradi and Oliver Mischke. On the simplicity of converting leakages from multi-
variate to univariate: Case study of a glitch-resistant masking scheme. In Proceedings
of the 15th International Conference on Cryptographic Hardware and Embedded Sys-
tems, CHES’13, pages 1–20, Berlin, Heidelberg, 2013. Springer-Verlag.
[QS01] Jean-Jacques Quisquater and David Samyde. ElectroMagnetic Analysis (EMA): Mea-
sures and Counter-measures for Smart Cards, pages 200–210. Springer Berlin Hei-
delberg, Berlin, Heidelberg, 2001.
[RP10] Matthieu Rivain and Emmanuel Prouff. Provably Secure Higher-Order Masking of
AES, pages 413–427. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
38
[RP12] Thomas Roche and Emmanuel Prouff. Higher-order glitch free implementation of
the aes using secure multi-party computation protocols. Journal of Cryptographic
Engineering, 2(2):111–127, 2012.
[SEFR17] Okan Seker, Thomas Eisenbarth, and Abraham Fernandez-Rubio. Analyzing secure
multiparty aes in software. Manuscript in preparation, 2017. Unpublished manuscript,
Department of Electrical and Computer Engineering, Worcester Polytechnic Institute,
Massachusetts, USA.
[SES17] Okan Seker, Thomas Eisenbarth, and Rainer Steinwandt. Extending glitch-free mul-
tiparty protocols to resist fault injection attacks. Cryptology ePrint Archive, Report
2017/269, 2017. http://eprint.iacr.org/2017/269.
[Sha79] Adi Shamir. How to share a secret. Commun. ACM, 22(11):612–613, nov 1979.
[SM15] Tobias Schneider and Amir Moradi. Leakage assessment methodology - a clear
roadmap for side-channel evaluations. Cryptology ePrint Archive, Report 2015/207,
2015. http://eprint.iacr.org/2015/207.
[STM] Stmicroelectronics. (2016). rm0367 reference manual: Ultra-low-power
stm32l0x3 advanced arm-based 32-bit mcus. (doc. id 025274 rev. 5). http:
//www.st.com/content/ccc/resource/technical/document/reference_manual/
2f/b9/c6/34/28/29/42/d2/DM00095744.pdf/files/DM00095744.pdf/jcr:
content/translations/en.DM00095744.pdf. Accessed: 2017-04-05.
[vW01] Manfred von Willich. A Technique with an Information-Theoretic Basis for Protecting
Secret Data from Differential Power Attacks, pages 44–62. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2001.
[vWWB11] Jasper G. J. van Woudenberg, Marc F. Witteman, and Bram Bakker. Improving
Differential Power Analysis by Elastic Alignment, pages 104–119. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2011.
39
