Compiler strategies for mitigating timing side channel attacks by Van Cleemput, Jeroen


Compilertechnieken om aanvallen gebaseerd op tijdsnevenkanalen
te mitigeren
Compiler Strategies for Mitigating Timing Side Channel Attacks
Jeroen Van Cleemput
Promotoren: prof. dr. ir. B. De Sutter, prof. dr. ir. K. De Bosschere
Proefschrift ingediend tot het behalen van de graad van 
Doctor in de Ingenieurswetenschappen: Computerwetenschappen
Vakgroep Elektronica en Informatiesystemen
Voorzitter: prof. dr. ir. R. Van de Walle
Faculteit Ingenieurswetenschappen en Architectuur
Academiejaar 2015 - 2016
ISBN 978-90-8578-866-9
NUR 980
Wettelijk depot: D/2015/10.500/110
Dankwoord
Het dankwoord is traditioneel het stuk van de doctoraatsthesis dat als
laatste geschrevenwordt alvorens de finale versie bij de drukker terecht-
komt. In mijn geval was dit niet anders. Nochtans zou dit manuscript
niet tot stand zijn gekomen zonder de hulp van heelwat personen. Deze
had ik hierbij graag uitvoerig bedankt voor hun hulp en steun gedu-
rende de voorbije jaren.
In de eerste plaats bedank ik mijn promotoren Bjorn en Koen om
mij de kans te geven om te doctoreren. Hun advies en professionele
begeleiding waren onmisbaar voor de goede afloop van mijn doctoraat.
Next, I would like to thank the members of my exam committee:
prof. Ingrid Verbauwhede, dr. Francesco Regazzoni, prof. Eric Laer-
mans, dr. Bart Coppens and prof. Rik Van de Walle. I would like to
thank you for your remarks, questions, and constructive feedback.
Daarnaast bedank ikmijn collega’s Christophe, Stijn, Bart, Jonas, Pa-
nagiotis, Tim, Niels, Hadi, Jens, Bert, Ronald, Wim, Marnix, Michiel,
Ronny, Jeroen, Eneko, Vicky en Karen, evenals alle andere leden van de
vakgroep en de vele ex-collega’s voor hun collegialiteit en vriendschap,
zowel tijdens als buiten de werkuren.
Tenslotte wil ik ook mijn ouders Frans en Nadine, mijn zus Sofie,
mijn vriendin Ellen en mijn vrienden bedanken. Ondanks dat ik vaak
afwezig was en op andere momenten mentaal bij mijn onderzoek zat,
bleven jullie mij door dik en dun steunen.
Jeroen Van Cleemput
Gent, 3 december 2015

Dit werk is ondersteund door het agentschap voor Innovatie door
Wetenschap en Technologie (IWT).
This work was supported by the agency for Innovation by Science and Tech-
nology (IWT)
iv
Examencommissie
Prof. Rik Van de Walle, voorzitter
Vakgroep ELIS
Faculteit Ingenieurswetenschappen en Architectuur
Universiteit Gent
Prof. Bjorn De Sutter, promotor
Vakgroep ELIS
Faculteit Ingenieurswetenschappen en Architectuur
Universiteit Gent
Prof. Koen De Bosschere, promotor
Vakgroep ELIS
Faculteit Ingenieurswetenschappen en Architectuur
Universiteit Gent
Dr. Bart Coppens, secretaris
Vakgroep ELIS
Faculteit Ingenieurswetenschappen en Architectuur
Universiteit Gent
Prof. Eric Laermans
Vakgroep INTEC
Faculteit Ingenieurswetenschappen en Architectuur
Universiteit Gent
Prof. Ingrid Verbauwhede
Departement ESAT
Faculteit Ingenieurswetenschappen
Katholieke Universiteit Leuven
Dr. Francesco Regazzoni
ALaRi
Faculty of Informatics
Università della Svizzera Italiana
vi
Leescommissie
Dr. Bart Coppens, secretaris
Vakgroep ELIS
Faculteit Ingenieurswetenschappen en Architectuur
Universiteit Gent
Prof. Eric Laermans
Vakgroep INTEC
Faculteit Ingenieurswetenschappen en Architectuur
Universiteit Gent
Prof. Ingrid Verbauwhede
Departement ESAT
Faculteit Ingenieurswetenschappen
Katholieke Universiteit Leuven
Dr. Francesco Regazzoni
ALaRi
Faculty of Informatics
Università della Svizzera Italiana
viii
Samenvatting
Veel processoren voeren regelmatig cryptografische toepassingen uit.
Wanneer deze correct geïmplementeerd zijn, bieden ze sterke garanties
tegen aanvallen op privacy, authenticatie en andere cryptografische toe-
passingen. In de praktijk is het input-outputgedrag echter niet de enige
waarneembare eigenschap. Afhankelijk van de toegangsmogelijkheden
tot het apparaat, hetzij fysiek of via een netwerk, kunnen aanvallers an-
dere eigenschappen zoals elektromagnetische straling, stroomverbruik,
het gebruik van architecturale componenten of uitvoeringstijd obser-
veren. Deze eigenschappen worden nevenkanalen genoemd wanneer
er een verband is tussen de observeerbare eigenschappen en geheime
informatie zoals bvb. cryptografische sleutels. Nevenkanaalaanvallen
maken gebruik van dit verband om de cryptografische software aan te
vallen.
In deze thesis onderzoeken we hoe compilertransformaties kunnen
worden ingezet om software te beschermen tegen nevenkanaalaanvallen
gebaseerd op uitvoeringstijd. Deze aanvallen maken gebruik van kleine
verschillen in de uitvoeringstijd van een programma of algoritme wan-
neer verschillende invoer wordt verwerkt om extra informatie in te win-
nen over deze invoer. Gewapend met deze extra informatie kan een
cryptografische aanval sterk vereenvoudigdworden. In sommige geval-
len is het zelfs mogelijk om de cryptografische sleutel volledig te ach-
terhalen. We beschouwen zowel aanvallen die de uitvoeringstijd van
een cryptografisch algoritme direct opmeten, als aanvallen die indirect
de invloed van een cryptografisch proces op een proces onder controle
van een aanvaller opmeten.
Het verschil in uitvoeringstijd kan veroorzaakt worden door sleutel-
afhankelijk dataverloop of sleutelafhankelijk controleverloop. In het al-
gemeen heeft een verschillend controleverloop de grootste impact op
de uitvoeringstijd. Afhankelijk van de programmainvoer worden ver-
x SAMENVATTING
schillende paden genomen in de controleverloopgraaf. Bijgevolg wor-
den verschillende series van instructies uitgevoerd wat uiteindelijk een
waarneembaar verschil in uitvoeringstijd tot gevolg heeft. Daarnaast
kan ook sleutelafhankelijk dataverloop in het programma bijdragen tot
tijdsvariatie. Deze tijdsvariaties zijn vaak het gevolg van verschillende
optimalisaties in de onderliggende hardware: instructies met variabele
uitvoeringstijd, variabele toegangstijden tot het werkgeheugen, data-
afhankelijkheden via het geheugen en interacties tussen instructies in
de pijplijn van de processor.
Programmacode kan met behulp van compilertransformaties in sta-
tische compilers zodanig getransformeerdworden dat sleutelafhankelijke
tijdsvariatie sterk wordt gereduceerd en zelfs volledig geëlimineerd. In
plaats van ruis toe te voegen of de gelekte tijdsinformatie te proberen
maskeren, proberen we met onze aanpak de bron van de tijdsvariatie
te elimineren. Op deze manier kunnen we het tijdsnevenkanaal volle-
dig afsluiten. De beschermingstransformaties bieden inherent volledige
bescherming tegen aanvallen op de sprongvoorspeller en de instructie-
cache. Bovendien kunnen we statistisch aantonen dat er geen verschil
in uitvoeringstijd kan worden waargenomen in het beschermde pro-
gramma.
We implementeren onze transformaties in het LLVM-compiler-raam-
werk. Hierbij breiden we bestaande transformaties uit die het sleutel-
afhankelijk controleverloop wegtransformeren. We voegen ondersteu-
ning toe omsleutelafhankelijk dataverloop te eliminerendoor ofwel com-
pensatiecode toe te voegen of invariabele uitvoeringstijd op te leggen.
Verder tonenwe aandat het toevoegen vanNOP-instructies tussen schrijf-
en leesinstructies naar het geheugen de variatie in uitvoeringstijd ten
gevolge van interactie in de processorpijplijn kan elimineren.
We evalueren onze technieken op een codefragment dat modulaire
exponentiatie implementeert. We tonen aan dat we met onze dataver-
looptransformaties de overhead om de code te beveiligen kunnen redu-
ceren van 20x naar minder dan 6x. We tonen bovendien statistisch aan
dat er geen waarneembaar tijdsverschil meer is tussen de verschillende
inputs in het beveiligde programma.
De transformaties die de uitvoeringstijdvariatie volledig wegnemen
hebben echter nog steeds een aanzienlijke overhead. We onderzoeken
daarom de mogelijke trade-off tussen beveiliging, prestatie en toepas-
baarheid op verschillende architecturen van de verschillende compiler-
gebaseerde beveiligingstechnieken voor x86 processoren. Afhankelijk
xi
van de omstandigheden kunnen bij een lager beveiligingsniveau kleine
variaties in uitvoeringstijd toegestaan worden in ruil voor een snellere
uitvoeringstijd van de software.
Beschermingstechnieken in statische compilers worden echter belemmerd
door hun statisch karakter en hun afhankelijkheid van de details van de
doelarchitectuur tijdens het compileren. Tijdens het compileren is de
doelarchitectuur niet altijd gekend en het programma beveiligen voor
meerdere architecturen zou onnodige overhead met zich meebrengen.
Verder ondersteunen statische compilers slechts één beveiligingsniveau
per gegenereerd uitvoerbaar bestand. Technisch gezien is het niet on-
mogelijk om verschillende programmaversies met verschillende bevei-
liginstransformaties samen in één uitvoerbaar bestand te combineren.
De impact op de codegrootte maakt deze aanpak echter niet praktisch.
Dynamische compilatietechnieken bieden een oplossing voor veel van
de beperkingen van een statische compiler. Door middel van profiel-
informatie die offline verzameld wordt op representatieve inputs kun-
nen de gevoelige codefragmenten in het programma automatisch gede-
tecteerd worden. Tijdens het uitvoeren van een programma genereert
de JIT-compiler dan automatisch beschermde code op basis van deze
profielinformatie. Deze aanpak staat toe om het beschermingsniveau,
en bijgevolg de prestatie van het programma, dynamisch aan te pas-
sen naargelang de veiligheidsvereisten en de exacte processor waar het
programma wordt uitgevoerd.
Deze automatische beveiligingstechnieken gebaseerd op profielin-
formatie reduceren niet alleen het risico op menselijke fouten, maar
kunnen ook de beveiligingsoverhead sterk reduceren. Dit kan door de com-
pilatietechnieken zo aan te passen dat ze op eenmeer fijnmazigemanier
kunnenworden toegepast. Enkel demethodendie effectief bijdragen tot
tijdsvariatie worden beschermd en binnen deze methoden wordt enkel
een deel van de code getransformeerd.
We hebben onze transformatietechnieken geïmplementeerd in de
JikesRVM JIT-compiler en geëvalueerd op realistische software. Ge-
bruikmakend van de profielgebaseerde technieken beschermen we een
RSA encryptiealgoritmemet een overhead van 8.4x op eenCore™ 2 pro-
cessor en een overhead van 16.5x op een Core™ i7 systeem. In verge-
lijking met puur statische technieken kunnen we de overhead om een
modulair exponentiatiealgoritme van de GNU Classpath bibliotheek te
beschermen reduceren van 66.3x tot 16.2x op een Core™ 2 processor
en van 62.7x naar 11.2x op een Core™ i7 systeem. Bovendien is onze
xii SAMENVATTING
aanpak ook in staat om één enkele methode te detecteren die tijdsinfor-
matie lekt in een HMAC-verificatieroutine en deze te beschermen met
een minimale overhead van 5.7%. Opnieuw kunnen we statistisch be-
wijzen dat er in het beschermde programma geen tijdsverschil tussen
de verschillende inputs waarneembaar is.
Door beveiligingstechnieken te centraliseren in de compiler scheidenwe
beveiligingsvereisten af van de implementatiedetails van de software.
In plaats van een programmeur te verplichten om veiligheismaatrege-
len te treffen op algoritmisch niveau, broncodeniveau of in programeer-
tools worden de beveiligingstechnieken eenmalig geïmplementeerd op
compilerniveau door een expert op vlak van nevenkanalen. Bijgevolg
verhoogt niet enkel het beveiligingsniveau van de software, maar ook
de productiviteit van de programmeur.
Summary
Processors execute cryptographic software on a regular basis. When
implemented properly, the input-output behavior of that software pro-
vides strong guarantees against attacks on privacy, authentication, and
other applications of cryptography. In practice, however, the imple-
mented input-output relation is not the only observable property. De-
pending on the physical or network access to devices, attackers can ob-
serve properties such as electromagnetic emanations, power consump-
tion, resource consumption and execution times. These properties are
called side channels when they feature a correlationwith protected data
such as secret keys. Side-channel attacks exploit this correlation to at-
tack cryptographic software.
In this dissertation, we investigate how compiler transformations
can be used to defend against timing side channel attacks on modern x86
architectures. Based on (often very small) differences in the execution
time of an algorithm or application when it is processing different in-
puts, an attacker can learn information about those inputs. This addi-
tional information can make the algorithm in question less secure, and
in some cases allows an attacker to completely reconstruct the secret
inputs of an algorithm. We consider attacks based on direct measure-
ments of the execution time of a cryptographic process, as well as in-
direct attacks that measure the influence of a cryptographic process on
the execution time of another process controlled by the attacker.
These differences in execution time can be caused by either key-de-
pendent control flow or key-dependent data flow. Control flow has,
in general, the biggest impact on execution time variation. Depending
on application input different paths are taken in the control flow graph
(CFG), different instruction sequences are executed and consequently a
possible difference in program execution time can be measured. Even
if the control flow does not depend on a security-sensitive value, data
xiv SUMMARY
flow dependencies can still cause execution time variation. These vari-
ations are often the result of performance optimizations in the underly-
ing hardware: Variable-latency arithmetic instructions, memory access
delays, data dependencies through memory and interactions between
instructions in the processor pipeline.
Using compiler transformations in static compilersweare able to trans-
form the application code in such a way that the key dependent timing
variation is significantly reduced or even completely eliminated. In our
approach, instead of introducing noise or masking timing variation, we
try to eliminate the source of the timing variation, effectively closing the
timing side channel. Our approach inherently provides complete pro-
tection against branch prediction attacks and instruction cache attacks
andwe provide statistical evidence that no differences in execution time
can be observed in the protected application.
We extend existing compiler mitigation techniques implemented in
the LLVM compiler framework that remove key-dependent control flow
based on if-conversion. We add additional support to remove key de-
pendent data flow by either adding variable-latency compensation code
or forcing invariable latencies. Furthermorewe show that insertingNOP
instructions between store and load instructions can remove timing vari-
ation caused by pipeline interactions.
We evaluate our approach onmodular exponentiation code and show
that we can reduce the protection overhead from 20x to below 6x us-
ing our data flowmitigation techniques. We statistically prove indistin-
guishability of execution times using the students t-test.
Transformations that completely eliminate all timing variation still
come at a high overhead. We therefore explore the trade-off between se-
curity, performance, and portability of compiler-based defenses against
time-based side channel attacks on x86 processors. Depending on the
scenario, a lower security level that leaks very small amounts of timing
information can be an acceptable trade-off for faster performance.
Protection techniques in static compilers are hampered, however, by
their static nature and their dependence on details of the processor tar-
geted during the compilation. At compilation time the target architec-
ture is often not known, and protecting for multiple target architectures
introduces unnecessary overhead. Furthermore, static compiler tech-
niques support only one design point, i.e., one protection level per gen-
erated binary. While it is conceivable to generate static binaries that in-
clude code compiled for several scenarios, the resulting code bloat im-
xv
pedes the generation of code for many different combinations of usage
scenarios and target processors.
To overcome the limitations of static compiler approaches, wepresent
a dynamic approach for mitigating timing side channels. Using profile in-
formation collected offline on representative inputs, sensitive code frag-
ments in an application are detected and tagged automatically. At run
time, a just-in-time (JIT) compiler then generates protected code based
on the profile information. This enables dynamic adaptation of the level
of protection, and hence of the efficiency of the protected code, to chang-
ing circumstances, including the exact processor on which the applica-
tion is being executed or to which, e.g, it is migrated.
Automatedprotection techniques based on run-timeprofiles not only
reduce the risk of human error but can also significantly reduce the per-
formance overhead of the application. We did this by adapting the com-
piler mitigation techniques to allow for a more fine grained application
of the transformations based on these profiles. Only methods that ac-
tually cause timing variation are protected, and within those methods
only a subset of the code is transformed.
We implemented our transformation techniques in the JikesRVM JIT
compiler and evaluated them on real-life use cases. Using our profile-
based approach we were able to protect an RSA encryption algorithm
at an overhead of 8.4x and 16.5x on Core™ 2 and Core™ i7 system re-
spectively. Compared to static transformation techniques, our approach
reduced the overhead of protectingGNUClasspath’s implementation of
themodular exponentiation algorithm from66.3x to 16.2x on theCore™
i7 and from 62.7x to 11.2x on the Core™ 2. Our approach was also
able to pinpoint a single method causing timing information leakage
in an HMAC verification routine and secure it automatically at a min-
imal overhead of 5.7%. Observed execution times for different inputs
were statistically proved to be indistinguishable.
By centralizing security transformations in the compiler, we separate
(side channel) security requirements from implementation details of
the software. Instead of obligating the programmer to apply security
measures in algorithms, source code or programming tools, protection
techniques are implemented once at compiler level by an expert pro-
grammer with thorough knowledge about side channel security. This
increases both programmer productivity and application security in the
context of side channels.
xvi SUMMARY
Contents
Nederlandstalige samenvatting ix
English summary xiii
Contents xvii
1 Introduction 1
1.1 Side Channel Attacks . . . . . . . . . . . . . . . . . . . . . 2
1.2 Mitigation Strategies . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Algorithmic Mitigation Techniques . . . . . . . . 9
1.2.2 Source Code Protection . . . . . . . . . . . . . . . 10
1.2.3 Compiler-Based Techniques . . . . . . . . . . . . . 10
1.2.4 Operating System Techniques . . . . . . . . . . . . 11
1.2.5 Hardware Techniques . . . . . . . . . . . . . . . . 12
1.2.6 Theoretical Analysis . . . . . . . . . . . . . . . . . 13
1.3 Leaked Information and Security Guarantees . . . . . . . 13
1.4 Contributions and Publications . . . . . . . . . . . . . . . 15
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Execution Time on Modern Processors 19
2.1 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Conditional Moves . . . . . . . . . . . . . . . . . . 22
2.2.2 Variable Latency Arithmetic Instructions . . . . . 23
2.2.3 Interaction Between Memory Operations . . . . . 25
3 Static Compilers 29
3.1 Control Flow Transformations . . . . . . . . . . . . . . . . 29
3.2 Mitigation Strategies: Data flow . . . . . . . . . . . . . . . 31
3.2.1 StrategyOne: Variable-LatencyCompensationCode 32
xviii CONTENTS
3.2.2 Strategy 2: Forcing Invariable Latencies . . . . . . 34
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 StrategyOne: Variable-LatencyCompensationCode 38
3.3.2 Strategy 2: Forcing Invariable Latencies . . . . . . 42
3.3.3 No-ops forAvoidingVariable Interaction Between
Memory Operations . . . . . . . . . . . . . . . . . 44
3.3.4 Feasibility of Compiler-Based Mitigation . . . . . 45
4 JIT compilation 49
4.1 Profile Based Protection . . . . . . . . . . . . . . . . . . . 50
4.1.1 Transforming Loops and Recursive Calls . . . . . 51
4.1.2 Automatic Detection of Code Regions to Transform 51
4.1.3 Selective If-Conversion . . . . . . . . . . . . . . . . 57
4.1.4 Security/Performance Trade-off . . . . . . . . . . 58
4.1.5 Execution Paths not Covered by Profile Inputs . . 59
4.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . 61
4.2.1 Prototype Implementation . . . . . . . . . . . . . . 61
4.2.2 Methodology & Experimental Setup . . . . . . . . 61
4.2.3 BouncyCastle RSA Encryption . . . . . . . . . . . 62
4.2.4 Keyczar HMAC Verification . . . . . . . . . . . . . 71
4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Tool Prototype 75
5.1 Extending the Adaptive Optimization System . . . . . . . 75
5.2 Class Loading and Dynamic Linking . . . . . . . . . . . . 78
5.3 Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Compilation Passes . . . . . . . . . . . . . . . . . . . . . . 82
5.4.1 Static Single Assignment . . . . . . . . . . . . . . . 82
5.4.2 Overview of Compilation Passes . . . . . . . . . . 84
5.4.3 Basic Block Predicates . . . . . . . . . . . . . . . . 84
5.4.4 Protected Branches . . . . . . . . . . . . . . . . . . 86
5.4.5 Unprotected Branches . . . . . . . . . . . . . . . . 91
5.4.6 Loops With Fixed Iteration Count . . . . . . . . . 92
5.4.7 Loops Without Fixed Iteration Count . . . . . . . 95
5.4.8 Protected Calls . . . . . . . . . . . . . . . . . . . . 97
5.4.9 Unprotected Calls . . . . . . . . . . . . . . . . . . . 98
5.4.10 Recursion . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.11 Safeguard Individual Instructions . . . . . . . . . 99
5.4.12 Transformations in lower level IR . . . . . . . . . . 100
5.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
CONTENTS xix
6 Conclusions & Future Work 103
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 105
List of Tables 107
List of Figures 109
List of abbreviations 113
Bibliography 115
xx CONTENTS
Chapter 1
Introduction
In this digital age where almost every device is networked, and more
personal information is shared over the internet than ever, software and
information security has become indispensable. More and more infor-
mation about our personal life is stored "in the cloud". We use our com-
puters and mobile devices to e-mail personal information to other par-
ties, to do our shopping, to transfer money and even to do our taxes. We
rely on traditional cryptographic algorithms to keep this information
safe, and to make sure the information reaches its target destination,
without being altered or read, by an adversary called Eve.
These cryptographic algorithms do their job very well, in the sense
that they make it computationally hard, from a mathematical point of
view to decode encrypted information without access to the secret key.
But what happens if Eve peaks over your shoulder when you enter your
pin code at the ATM? A big downside of traditional cryptography is
that their security proofs only take into account what is defined in the
algorithm’s specifications. Any extra information gained by an attacker
through other means can possibly compromise the security of the ap-
plication. In this case, the (supposedly) secure ATM machine, which
relies on physical possession of the card and a secret PIN code, loses a
great deal of its security when the code is leaked through an external
channel.
These external channels, which leak additional information about a
system, are called side channels. Besides the trivial example of peeking
over someones shoulder, there exist many other and often very compli-
cated side channels: an attacker might, for example, analyze the sound
of your keyboard while you type your password [91]. He can read gy-
2 Introduction
roscope information from your smartphone or tablet and infer where
you tapped your screen when entering your pin code [30]. Solely based
on the size of the encrypted packets sent between you and Google, an
attacker can deduce the search query you entered [32]. More advanced
methods based on electromagnetic emanations [9] and power consump-
tion traces [2, 58, 63] of the device on which the target code is executed,
have been successfully used to reconstruct secret keys of cryptographic
algorithms.
The focus of this thesis will be on timing side channels: Based on
(often very small) differences in the execution time of an algorithm or
application when it is processing different inputs, an attacker can learn
information about those inputs. This additional information can make
the algorithm in question less secure, and in some cases allows an at-
tacker to completely reconstruct the secret inputs of an algorithm.
We will investigate how applications can be protected against tim-
ing side channel attacks using compiler transformation techniques. We
do so in an automated manner, with as little interaction as possible re-
quired from the user or developer. This way we want to obtain safer
software and increase the productivity of the programmer.
In the remainder of this chapter, we will give a general overview of
different types of side channel attacks with a focus on timing attacks
and discuss the different approaches to defend against those attacks.
1.1 Side Channel Attacks
Many different types of side channel attacks have been described in lit-
erature, ranging from trivial to very complex. We will start by giving
a short overview of existing side channel attacks. Figure 1.1 illustrates
a classification of side channels according to the observable properties,
signals or components used by an attacker. We indicate the scope of the
attacks against which we will defend using the protection techniques in
this thesis in gray.
The majority of side channel attackswe will discuss in this section are
focused on cryptographic algorithms. These attacks try to gain addi-
tional information about the secret keys, reducing the difficulty of a
cryptanalytic attack, or even allow complete reconstruction of the se-
cret keys used in those algorithms. Side channel attacks are not lim-
ited to cryptographic algorithms, however. Any method that can be
1.1 Side Channel Attacks 3
    Side-Channels Attacks 
   Micro-Architectural Attacks 
Power 
EM 
Temperature Visual 
Sound 
Gyroscopes 
… 
 
 Timing Attacks 
Trace-driven 
Access-driven 
I-Cache 
Branch Predictor 
Time-driven 
D-cache 
D-cache 
Aggregate time 
Figure 1.1: Overview of side channel attacks.
used to gain additional information about a system can be considered
a side channel, when it was not intended for that purpose during de-
sign. Information can for example leak visually, either direct or indirect
by post-processing reflected images from computer monitors or even
from the eyes of the user [21]. An attacker can listen to sounds emitted
from peripherals such as keyboards [91] or user interaction with mobile
phones [79] to deduce a users password or pin. Based on information
from gyroscopes inside a tablet or smartphone, an attacker can deter-
mine where a user tapped on the screen [30].
(Micro-)Architectural Side Channels are a subclass of side chan-
nels that observe properties, signals or components of computer micro-
architecture.
They are typically subdivided in three types of attacks: Trace-
driven, access-driven and time-driven. Trace-driven and access-driven
attacks are concurrent attacks, in the sense that the targeted application
is concurrently monitored during its execution.
Trace-driven attacks allow to reconstruct a complete trace of the ob-
served side channel during the execution of the application. Some ex-
amples are power-based attacks [2, 46, 58, 63], and attacks based on elec-
tromagnetic emanations (EM) [9]. These attacks are trace-driven, as a
trace of the power consumption or EM is captured during execution of
4 Introduction
the targeted algorithm. Based on this trace, an attacker can then infer
information about the internal state of the algorithm and consequently,
the secret key.
Access-driven attacks are less fine-grained. Instead of a complete ex-
ecution trace, only the access pattern is known. In access-driven cache
attacks for example, an attacker reconstructs the secret key based on
which cache lines have been used by the target algorithm.
Time-driven attacks are based solely on aggregate execution time of
the targeted algorithm. Although key recovery is often based on knowl-
edge and prediction of how architectural components influence the ag-
gregate execution time, no components or signals are directlymeasured.
Time-driven attacks do not require concurrent monitoring of the target
application as is the case with trace-driven and access-driven attacks.
This makes them suitable to perform remotely.
Timing Side Channels can be considered a subclass of the (micro)-
architectural side channels. Based on small differences in execution
time of an algorithm or applicationwhen different inputs are processed,
an attacker can learn information about those inputs.
The measured timing differences can be caused by either different
control flow based on the application input, or different behavior of
the architectural components in the hardware running the application.
Input-dependent control flow causes, in general, the most variation in
execution time. Depending on the input, branches in the code are taken
in different directions, and a different sequence of instructions is exe-
cuted. Consequently execution time differences can be observed. This
input-dependent control flowwill also have an influence on the internal
state of architectural components such as instruction caches, data caches
and branch predictors, which in turn will influence the execution time.
Even if an application does not exhibit input dependent control flow,
some variation in execution time may still be measured. Different in-
puts can cause data dependencies between instructions, different mem-
ory access delays or even cause single instructions to occupy functional
units longer depending on their operands. We will discuss execution
time on modern processor architectures in greater detail in Chapter 2.
Timing measurements can either be direct (non-concurrent), by
measuring the aggregate execution time of a target algorithm, or indi-
rect (concurrent) by reconstruction traces or access patterns of micro-
architectural components based on timing behavior of a spy application
running concurrently. Concurrent timing attacks are based on resource
1.1 Side Channel Attacks 5
contention. Two concurrently running applications share hardware
resources such as caches, execution units or other architectural compo-
nents on the processor. This resource sharing inevitably causes the two
applications to influence each other’s execution time. They can evict
each other’s caches, overwrite entries in the branch prediction table, or
occupy execution units on the processor. This is especially true on si-
multaneous multi threading (SMT) systems, where instructions of both
applications can be in the processor pipeline together. An attacker can
thus run a specially crafted application concurrently with the target,
and learn what resources the target uses by measuring (parts of) its
own execution time.
Concurrent attacks can circumvent the isolation properties of oper-
ating systems or hypervisors [13, 54, 76, 90], allowing the spy process to
run both on the same operating system, or on a different virtualmachine
running on the same physical hardware.
Time-driven attacks are based on observed aggregate execution
time of an algorithm or application. Several such attacks have been
suggested against different cryptographic algorithms time, both local
and remotely over a network. We give a short chronological overview:
Kocher [57] was one of the first to demonstrate a timing attack solely
based on observed execution time. He showed that by carefully mea-
suring private key operations, attackers may be able to find fixed Diffie-
Hellman exponents or factor RSA keys. Dhem et al. [38] implemented
a practical attack against smart card encryption. Provide that enough
samples can be taken, the private key can be reconstructed in a mat-
ter of minutes. Schindler [78] further improved upon the state of the
art by mounting an attack against RSA implemented using the Chinese
remainder theorem. Brumley and Boneh [29] devised a timing attack
against OpenSSL and showed that private keys could be extracted from
an OpenSSL-based webserver running on the local network. As a re-
sult of their work, many cryptographic libraries implemented blinding
by default (see Section 1.2). Bernstein [23] was the first to demonstrate
cache-based timing attack attack on AES. His work was then refined
by Neve et al. [72] allowing for reconstruction of the full key using a
much lower number of samples. Bonneau and Mironov [24] describe
another type of cache-based attack against AES. By predicting time vari-
ation due to cache-collisions in sequences of table lookups they are able
to reduce the number of samples needed to recover an AES key by four
orders of magnitude compared to Bernstein [23]. Aciiçmez et al. [6]
6 Introduction
showed that remote cache-based attacks on AES are possible when the
server runs a multitasking or multithreaded system with a high work-
load. Groszschaedl et al. [46] showed that the early terminating behav-
ior of integer multipliers in various embedded systems can introduce
both timing and power based side channels. Brumley and Tuveri [28]
mounted a remote attack on the OpenSSL implementation of elliptic
curve cryptography based on a vulnerability in OpenSSL’s ladder im-
plementation. The vulnerable algorithm in question was implemented
using a while loop with a variable iteration count depending on the
highest order 1 bit of the key. Based on key-dependent timing varia-
tions caused by this loop they were able to recover a private key of a
TLS server which authenticates with ECDSA signatures.
Although in time-driven attacks, no micro-architectural compo-
nents are targeted directly, the observed timing difference will always
be a combination of control flow variation and influences of micro-
architectural components. In some cases, the attack relies on knowledge
of the algorithm and its interaction between the architectural compo-
nents to interpret the timing variation. We therefore classify them as a
subset of the micro-architectural attacks.
Trace-driven attacks Trace-driven attacks have been described in
which side channels leak enough information to reconstruct the whole
execution trace of the attacked algorithm.
Percival [74] showed that on SMT processors, data access traces of
an RSA encryption can be used to partially reconstruct the private key,
allowing the modulus to be factored easily. Aciiçmez et al. were the
first to mount a trace-driven attack against RSA targeting the branch
prediction unit [3, 4]. They showed that a carefully written spy process
running simultaneous with an RSA process can recover almost all key
bits during one signing. Additionally, they showed that blinding or ran-
domization provide no protection against this type of branch prediction
attack. Aciiçmez was also the first to mount attack against the instruc-
tion cache [5]. He showed that these attacks are as efficient as attacks
targeting the branch prediction unit. Both instruction cache and branch
prediction attacks reveal the key-dependent control flow in the appli-
cation. Brumley and Hakala [27] demonstrated a cache template attack
on the elliptic curve portion of OpenSSL.
Access-driven attacks Access-driven attacks have also been dis-
cussed, in which enough side channel information is available to recon-
1.1 Side Channel Attacks 7
struct the access patterns to architectural components such as caches,
branch predictions tables, and execution units [7, 27, 49, 71, 73, 76, 85].
Neve and Seifert [71] set up an access-driven attack on the last round
of AES, and showed that a complete 128bit key can be recovered with
less than 20 encryptions. Osvik et al. [73] also demonstrated several
efficient attacks on the shared data cache. Uhsadel et al. [85] lever-
aged hardware performance counter information about data cache hit
miss rates to mount an attack on AES. Ristenpart et al. [76] showed
that it is possible to instantiate virtual machines on the Amazon EC2
cloud on the same physical machine as a target virtual machine and
mount a cache-based attack over virtual machine boundaries. Brum-
ley and Hakala [27] introduced a framework for automated templating
cache-timing data analysis based onHiddenMarkovModels. They suc-
cessfully mounted a template based attack on the elliptic curve portion
of OpenSSL. Aciiçmez et al. [7] then used this framework to mount a
successful instruction cache attack on OpenSSL’s DSA implementation.
Gullasch et al. [49] presents an attack which is capable of recovering a
full AES-128 key after observing as little as 100 encryptions in almost
real-time. They improve on the previous attacks by not requiring any
knowledge about the used plaintexts. Tromer et al. [84] mount another
access-based attack on OpenSSL and dm-crypt without any knowledge
about ciphertext or plaintexts. In the case of dm-crypt, a disk encryp-
tion tool, the attack only takes 800 writes to disk for a total runtime of
65 milliseconds.
Covert channels are similar to side channels. They are using the
same signals and components as side channels for communication.
Contrary to side channels, however, covert channels have both an active
sender and active receiver on the channel. Whereas a side channel only
has an active listener and a target application, which is not intentionally
sending information. Covert channels are typically used to circumvent
security policies put in place by operating systems or hypervisors and
allow applications to communicate indirectly.
Consider for example the following simple covert channel on a mo-
bile phone. The phone runs two applications: An organizer, which has
access to personal information but is not allowed network access, and a
browser, which is allowed to connect to the network but is not allowed
to access personal information. The access rules are set up in such away
that no personal information is allowed to be sent over the network. In
this scenario the two applications circumvent these rules by using the
8 Introduction
phone’s volume setting as a covert communication channel. Because
there are no access restrictions to the volume setting, each application
can query and set the volume level. By setting and reading the volume
setting according to an agreed upon protocol, the organizer app can
send the users personal information to the browser app. The browser
can then in turn send this information over the network to the attacker,
circumventing the applications permissions enforced by the operating
system.
Ritzdorf presents a thorough analysis of covert channels on the an-
droid platform in his master thesis [77].
Evans et al. [42] show that a timing covert channel can be used to
leak protected code pointer information. They do this by timing a loop
with a variable iteration count based on the values at a target memory
location.
1.2 Mitigation Strategies
Now that we have presented an overview of the many possible ap-
proaches to attack (cryptographic) algorithms, and have an idea of how
these attacks work, this section discusses the different approaches to
close side channels and to defend against those attacks. We will how-
ever limit ourselves to techniques defending against timing attacks, as
this is the scope of this thesis.
From this section it will be clear that there is no “silver bullet” ap-
proach that will solve the problem of side channels completely. Each
technique has its merits and downsides, and many of them only tar-
get specific types of attacks. Most of the techniques also come with a
significant overhead, and a trade-off between security and application
performance has to be considered.
Figure 1.2 shows the different levels at which protection techniques
against side channels can be applied. They range from the algorithmic
level of abstraction [23, 26, 47, 55, 57], over source code rewriting tech-
niques [68], compiler-based techniques [34, 35, 50], operating system
or hypervisor modifications [25, 66, 89] to techniques at the hardware
level [23, 48, 87, 88] and combinations thereof [25, 89].
1.2 Mitigation Strategies 9
Algorithm Source Code Compiler Operating System 
Hypervisor 
Hardware 
Figure 1.2: Side channel mitigation techniques can be applied at different lev-
els.
1.2.1 Algorithmic Mitigation Techniques
At the highest level, existing algorithms can be modified in such a way
that they run in constant time, and do not leak any timing information
through side channels [12, 47, 55]. Although an elegant solution, it re-
quires a lot of insight into the algorithm and knowledge about the target
architecture.
Alternatively, a technique called blinding applies a transformation
on the algorithms inputs in such a way that the algorithm never uses
the original inputs. It is most commonly used for blind signatures [31],
where a signer signs a message without being able to see the content of
the message. Blinding has the additional benefit that any leaked infor-
mation would therefore only leak information about the blinded input,
making an attack much harder [57]. Blinding is applied by default in
most applications of RSA encryption.
For AES Brickell et al. [26] suggest permutating the S-boxes to pro-
tect against cache attacks. Others have designed libraries containing
constant-time implementations of floating point operations [12].
Finally, side channel resistant cryptography has been suggested as
a fundamental solution against side channel attacks [41]. These are
cryptographic algorithms that are still considered (and proved to be)
safe even if a bounded amount if information is leaked during com-
putation. This approach is of course not backwards compatible, as the
current standards in cryptographic algorithms would have to be re-
placed by the new algorithms. Side channel resistant cryptographic
algorithms should, however, be something to be considered for future
cryptographic standards.
10 Introduction
1.2.2 Source Code Protection
A different technique is to apply mitigation transformations directly on
the source code of the application. Using bit masks, much of the varia-
tion due to control flow can be removed from the code [68], defending
against attacks based on both instruction caches and branch prediction
units. Others have suggested rewriting branches to indirect jumps to de-
fend against branch prediction attacks[8]. Askarov et al. [19] suggests
an epoch based approach to protect against time-driven attacks. They
apply time quantization to divide time into discrete intervals and only
allow algorithms to return at these boundaries, reducing the amount of
information leaked.
Compared to algorithmic approaches, these source code transfor-
mations are not application-specific and do not target a single algo-
rithm. The same transformations are, in theory, general enough to pro-
tect a wide range of applications. One downside to this approach is that
carefully crafted transformations can possibly be undone at a later stage
by an optimizing compiler. Generated machine code always has to be
checked if transformations “survived” the compiler optimizations.
1.2.3 Compiler-Based Techniques
Compiler-based techniques are implemented by adding additional
compilation steps in the compiler that transform the code in such a
way that it does not leak timing information.
Compiler-based approaches have the benefit over pure algorithmic
and source code solutions in that security measures don’t have to be
taken for each individual application, but are implemented in the com-
piler once, by a (side channel) security expert. Implementing these secu-
rity transformations in a centralized place by an expert significantly re-
duces the chance of implementation mistakes. Furthermore, it relieves
the application programmer from having to worry about side channels,
increasing his productivity.
Coppens et al. [34] suggested modifying the compiler-backend to
generate side channel secure software. Similar to source code transfor-
mations, variation due to control flow is removed, effectively protecting
against instruction cache and branch prediction attacks. Compared to
source code protections, compiler-based techniques have the additional
benefit of having full control over the generated code and can apply
transformations at a much lower level of abstraction. A compiler can
1.2 Mitigation Strategies 11
for example apply fine grained transformation at machine code level,
which is not possible at the higher abstraction level of the source code
techniques. On the other hand, the overhead of this compiler approach
can be significant.
While the aim of Coppens et al. [34] is to eliminate all key depen-
dent timing variation, others have tried masking or introducing noise
to significantly reduce the information leaked [35, 50].
Compiler techniques have also been used to prevent power analysis
attacks. Bayrak et al. [22] implemented a compiler extension that au-
tomatically detects instructions that are sensitive to a power attack and
protects them using techniques such as random precharging or boolean
masking.
1.2.4 Operating System Techniques
Several approaches implemented at the operating system level have also
been suggested. Because they have an overview of all running pro-
cesses, they are an ideal place to implement resource isolation tech-
niques.
Zhang and Reiter [89] implemented an operating system extension
that allows a virtualmachine to defend itself against cache-based attacks
on the cloud. It does this by actively cleansing the shared caches when
an attack is detected. The overhead during normal execution is very
low.
Braun et al. [25] leverage the operating system to fix execution time
of tagged methods and provide resource isolation. They eliminate tim-
ing variation by padding methods tagged by the programmer. This ef-
fectively mitigates against time-driven attacks. Furthermore, they rely
on the operating system for resource isolation to defend against trace
and access driven attacks based on shared resources between processes.
At the hypervisor level Lefray et al. [66] show that achievable bit
rates of side channels can change dramatically based on what resources
are shared between virtual machines. Based on this observation they
introduces a side-channel-aware placement algorithm for virtual ma-
chines based on a new information leakage metric.
12 Introduction
1.2.5 Hardware Techniques
At the lowest level, side-channel-aware hardware components can be
designed to protect against side channel attacks. Many of the micro-
architectural side channels are caused by hardware optimizations to
make the application run faster for specific cases. Caches make fre-
quently used data faster accessible than other data, store to load for-
warding speeds op memory loads by forwarding data from unretired
store instructions, the integer division instruction terminates early de-
pending on its operands, branch predictors are used to speculatively
execute instructions, etc. . . Each of these features influences the execu-
tion time of an application, and can therefore contribute to information
leakage through side channels.
Specialized hardware allows the secure execution of cryptographic
algorithms (or other applications dealing with sensitive information),
that would leak side channel information on conventional hardware. A
typical example is the Intel AES (Advanced Encryption Standard) in-
struction set, an extension to the x86 instruction set that provides com-
plete hardware support for AES encryption [48].
Similarly, certain ARM architectures provide an extension to let the
software choose whether or not multiplications should be executed in
constant time [15]. Others have suggested specially designed caches to
prevent cache-base side channel attacks [87, 88]. These caches can be
used to store sensitive data used by encryption algorithms.
To prevent trace-based attacks that abuse hyper-threading on SMT
processors [74, 87], Percival [74] suggests simply disabling that feature
as an easy temporary solution.
Hardware-based techniques to protect against power attacks have
also been suggested. Regazzoni et al. [75], for example, present a
masked S-box implementation on an FPGA to prevent power attacks
against AES. Masking decreases the correlation between the power
consumption of the device and the processed inputs. Another ap-
proach presented by Tiri and Verbauwhede [83] suggests a new design
methodology for crypto processors based on building blocks with a
power consumption profile independent of the data or key.
The obvious drawback of these techniques is that they require spe-
cialized hardware, which is not always feasible. Additionally, applica-
tions have to be designed specifically to make use of this hardware and
at design time it is not always clear on what hardware an application
1.3 Leaked Information and Security Guarantees 13
will run on. A final thing to consider about hardware based mitigation
techniques is that there is often a significant delay between the start of
the hardware design and themoment the final product reaches themar-
ket. Software-based solutions on the other hand can, in many cases, be
adopted almost immediately.
1.2.6 Theoretical Analysis
Others have analyzed side channel security from a theoretical stand-
point. Standaert et al. [81] present a framework for theoretical analy-
sis of both cryptographic implementations and key recovery attacks in
the context of power analysis. Köpf and Dürmuth [60] show that the
amount of information that can be leaked by a deterministic side chan-
nel is bounded from above by the formula |O| log2(n+1). Where n is the
amount of side-channel measurements and O the set of possible obser-
vations. They reduce the information leakage by applying a bucketing
technique to reduce the amount of possible observations.
1.3 Leaked Information and Security Guarantees
Some of the aforementioned attacks derive bits of the secret key directly
from the reconstructed traces and patterns or from the observed tim-
ing [4, 49], while others collect statistical information with which the
search space of a brute-force attack can be pruned [59]. In the latter
case, the pruningwill bemore effectivewhenmore useful information is
leaked. Regarding this leaked information, most countermeasures aim
at limiting the amount of information leaked [22, 60]. Other counter-
measures try to add noise to the information [57]. To illustrate the differ-
ence between limiting information leakage and adding noise, imagine a
decryption algorithm with two possible keys. The two conditional sta-
tistical distributions of the algorithm’s execution time (one for each key)
are depicted in Figure 1.3(a). Very few measurements of the execution
time suffice to extract the key reliably. When the algorithm is rewritten
such that both conditional distributions become identical, as depicted
in Figure 1.3(b), no information leaks at all. When the distributions are
not identical but overlapping, as in Figures 1.3(c) and 1.3(d), less infor-
mation is available to an attacker. The amount of useful information de-
pends on thewhole conditional distributions, not only on their expected
values. So depending on the attack context, more or less difference in
14 Introduction
Execution time
D
en
si
ty
p1
p2
(a)
Execution time
D
en
si
ty
p1=p2(b)
Execution time
D
en
si
ty
p1 p2(c)
Execution time
D
en
si
ty
p1 p2
(d)
Figure 1.3: Conditional execution time distributions of a cryptographic algo-
rithm given two possible secret keys.
expected execution times can be tolerated. When little noise is present
because an attacker has full control over all tasks running on a local ma-
chine, an acceptable level of security can only be reached bymaking the
execution times obtained with different keys nearly identical, as in Fig-
ure 1.3(c). But when a lot of noise is present due to variable network
delay and due to the unknown load of a cloud processor, a larger dif-
ference in expected execution times as depicted in Figure 1.3(d) can still
provide the same level of security. The effect of such noise has been
quantified in literature [36, 76]. As can be expected, the signal-to-noise
ratio has a major influence on the amount of useful information an at-
tacker can extract, and on the effort he has to invest to mount an attack.
Smaller signal-to-noise ratios require the attacker to performmoremea-
surements, up to the point where it can become infeasible to mount an
attack, e.g., because encryption keys are refreshed frequently.
1.4 Contributions and Publications 15
1.4 Contributions and Publications
In this dissertation, we investigate how compiler transformation can be
used to defend against timing attacks. Wedo so for both static compilers
and dynamic JIT compilers.
We explore the trade-off between security, performance, and porta-
bility of compiler-based defenses against time-based side channel at-
tacks on x86 processors. We consider attacks based on direct measure-
ments of the execution time of a cryptographic process, as well as in-
direct attacks that measure the influence of a cryptographic process on
the execution time of another process controlled by the attacker.
In our approach, instead of introducing noise or masking timing
variation, we try to eliminate all timing variation altogether, effectively
closing the timing side channel. Our approach inherently provides
complete protection against branch prediction attacks and instruction
cache attacks and we provide statistical evidence that no differences
in execution time can be observed in the protected application. To this
endwe apply compiler transformations to get rid of both key dependent
control flow and key dependent data flow in the application. This work
was based on the compiler transformations presented by Coppens et al.
[34] and was done partly in cooperation.
To overcome the limitations of static compiler approaches, wepresent
a dynamic approach for mitigating timing side channels. Using profile
information collected offline on representative inputs, sensitive code
fragments in an application are detected and tagged automatically. At
run time, a just-in-time (JIT) compiler then generates protected code
based on the profile information. This enables dynamic adaptation of
the level of protection, and hence of the efficiency of the protected code,
to changing circumstances, including the exact processor on which the
application is being executed or to which, e.g, it is migrated.
Furthermore, by centralizing security transformations in the com-
piler, we separate (side channel) security requirements from implemen-
tation details of the software. Instead of obligating the programmer to
apply security measures in algorithms, source code or programming
tools, protection techniques are implemented at compiler level by an
expert programmer with thorough knowledge about side channel se-
curity. This increases both programmer productivity and application
security in the context of side channels.
16 Introduction
Our main contributions are as follows:
• The study of several mitigation techniques against timing vari-
ations caused by data flow behavior on modern x86 processor
pipelines.
• A demonstration of the fact that static compilers can provide
strong protection only at a high performance overhead.
• A demonstration of the fact that weaker protection, without
portable security guarantees, can be provided at lower levels of
overhead.
• We present a JIT compiler approach that enables the protection of
bytecode applications and libraries atmultiple security levels, and
that can switch between different protection levels dynamically
based on the current security context.
• Wepresent an analysis approach to automate the selection of code
fragments to transform in order to reduce the total overhead of the
transformations. This approachdelivers the necessary input to the
aforementioned JIT compiler approach, but can also be used by
developers in case they want to fix timing side channels manually.
• We evaluate a prototype implementation in JikesRVM on two
cryptographic algorithms from real-life security libraries, demon-
strating that the performance overhead of the pre-existing state of
the art can be reduced with 75%-82%.
The research presented in this thesis is based on the following pub-
lications:
• Compiler mitigations for time attacks on modern x86 processors
Jeroen Van Cleemput, Bart Coppens and Bjorn De Sutter
In ACM Transactions on Architecture and Code Optimization (TACO),
Volume 8 Issue 4, Article No. 23, January 2012 [86].
• Adaptive Compiler Strategies for Mitigating Timing Side Channel
Attacks
Jeroen Van Cleemput, Bjorn De Sutter and Koen De Bosschere
submitted to IEEETransactions onDependable and Secure Computing.
1.5 Outline 17
1.5 Outline
The remainder of this dissertation is structured as follows. Chapter 2
gives a detailed overview of the various causes of execution time vari-
ation on modern architectures. Static compiler mitigations to protect
against timing attacks are discussed in Chapter 3. In Chapter 4 we
present a dynamic mitigation approach based on JIT compilation and
run-time profiles. Chapter 5 describes our prototype implementation.
Finally, conclusions are drawn in chapter 6 together with a discussion
about future work.
18 Introduction
Chapter 2
Execution Time on Modern
Processors
In this chapter we discuss the various causes of key-dependent execu-
tion time variation on modern x86 processors. We start by explaining
how key-dependent control flow influences execution time and show
that an attacker can indeed extract useful information from this tim-
ing variation. Then we discuss the various data-dependent causes for
execution time variation and how they differ depending on the micro-
architecture.
2.1 Control Flow
In general, control flow has the biggest impact on execution time varia-
tion. Depending on the application input different paths are taken in the
control flow graph (CFG), different instruction sequences are executed,
and consequently a possible difference in program execution time can
be measured. Consider for example the code fragment in Listing 2.1,
which implements a simple modular exponentiation algorithm.
The inputs to this algorithm are the exponent and themodulus n. In
cryptographic applications such as RSA, the (exponent, modulus) com-
bination forms a public or private key. In the case of a private key,
this is sensitive information that we do not want an attacker to know.
To calculate the modular exponentiation, the algorithm iterates exactly
keysize times over the loop body (line 3-8). Depending on the value
of the exponent variable, the branch on line 4 is taken in a different
direction. If the bit at position i in the exponent is set, an additional
20 Execution Time on Modern Processors
1 i = keysize;
2 result = 1;
3 do {
4 result = (result*result) % n;
5 if ((exponent >>i) & 1)
6 result = (result*a) % n;
7 i--;
8 } while (i >= 0);
Listing 2.1: Modular exponentiation algorithm
1 public static boolean equals(byte[] a1 , byte[] a2)
2 {
3 if (a1 == a2)
4 return true;
5
6 if (null == a1 || null == a2)
7 return false;
8
9 if (a1.length == a2.length)
10 {
11 int i = a1.length;
12 while (--i >= 0)
13 if (a1[i] != a2[i])
14 return false;
15 return true;
16 }
17 return false;
18 }
Listing 2.2: Array compare function from GNU Classpath
multiplication and division is executed. The total execution time of this
algorithm thus depends on the number of 1 bits in the exponent. A sim-
ple timing attack on this algorithmwould reveal the number of 1 bits in
the secret key.
Additionally, the input-dependent control flow makes this code
fragment potentially vulnerable to branch prediction attacks and in-
struction cache attacks discussed in Section 1.1.
Non-manifest loops, i.e. loopswhere the iteration count is unknown
at compilation time, can leak timing information as well. A typical ex-
ample is an array compare function such as the one from GNU Class-
path in Listing 2.2. Depending of the values of the two parameters a1
and a2 themethod returns at different points in the code, resulting in an
2.2 Data flow 21
observable execution timedifference. If the corner case tests at line 3 and
6 evaluate to true, the code returns immediately resulting in very short
execution time. Additionally, if the two arrays have the same length,
the loop on line 12-14 returns as soon as two byte values at a certain po-
sition in the two arrays differ. Consequently the execution time of this
loop correlates directly with the number of consecutive equal bits in the
two arrays. In practice, implementations like the one in Listing 2.2 have
been exploited to attack HMAC signature verification algorithms [65]
and load custom software on Microsoft’s Xbox 360 console [64].
Next to the (loop) branch instructions discussed above, other con-
trol flow structures such as switch statements or potential exception
throwing instructions (PEIs) can cause timing variation as well. At any
point during execution a PEI might throw an exception and divert con-
trol flow to the appropriate exception handling code, either in the same
method or a method higher up the stack
From the above examples it should be clear that key-dependent
control flow in algorithms developed without side-channel security in
mind can reveal a significant amount of information to an attacker.
2.2 Data flow
This section discusses the variations on execution time on modern x86
processors caused by data flow properties. As for execution time vari-
ations relating to control flow and caches, we will neglect those causes
in the remainder of this section unless explicitly stated otherwise. This
section also discusses the relevant differences with regard to data flow
between different recent x86 processors.
We performed experiments on a dual core Intel Core 2 Duo E8400
that lacks hyperthreading (HT), on a dual CPU 2x4 core Intel(R) Xeon
E5620withHT, and on a single core Intel AtomN280withHT.On any of
these processors, the execution time of a program depends heavily on
the data dependencies between successive instructions in a program.
As the (true) data dependencies on the critical path in a data depen-
dency graph (DDG) of the executed code put a fundamental limit on the
instruction-level parallelism (ILP) that can be exploited by a processor,
having fewer dependencies implies that more Instructions get executed
Per Cycle (IPC).
By and large, the data dependencies in a program are fixed statically,
22 Execution Time on Modern Processors
as they are determined by the registers occurring in the instruction en-
codings. The one exception to the observations is formed by memory
operations. Whether a store and a consecutive load depend on each
other depends on the addresses used in the operations. The influence of
this dependence on execution time is discussed in detail in Section 2.2.3.
Furthermore, the length of the critical path in a DDG is determined
by the individual execution latencies of the operations on it. Most non-
memory operations have fixed latencies. There are some exceptions,
however, for arithmetic instructions that are so complex that they are
implemented bymeans of microcode instruction sequences. Some typi-
cal such sequences allow an early exit (a.k.a. early termination), ofwhich
the influence on execution time is discussed in Section 2.2.2.
First we briefly discuss how conditions occurring in conditional
move instructions do not influence execution time, as we will rely on
this property in themitigating transformations presented inChapter 3, 4
and 5.
2.2.1 Conditional Moves
Consider the three following x86 loop bodies:
body 1: mov ecx, edi
add eax, ebx
body 2: mov eax, edi
add eax, ebx
body 3: test edx,edx
cmoveq eax, edi
add eax, ebx
In all loop bodies, the last instruction adds the value in ebx to that in
eax. When the first loop body is executed in a loop, eax serves as an
accumulator to which the value in ebx is added repeatedly. So all ad-
ditions in subsequent iterations depend on each other. In the second
loop, each iteration starts with a fresh value being copied into eax. The
register renaming pipeline stages in out-of-order processors detect this,
and the second loop gets executed up to 40% faster as a result.
In the third loop body, test sets condition flags depending on the
loop-invariant value in edx. If that value is zero, the cmoveq instruction
2.2 Data flow 23
is executed, copying the value of edi into eax. In that case, the third
body performs the same computation as the second body. If the value
of edx is not zero, the third body performs the same computation as the
first body. Despite these two different behaviors, we observed identical
execution timeswhen executing this third loop body in an unrolled loop
with edx either fixed to zero or to some non-zero value.
This experiment demonstrates that the values of the guard condi-
tions of conditional moves do not influence execution timing. As dis-
cussed extensively by Coppens et al. [34], this property fundamentally
results from the fact that register renaming is implemented in the in-
order stages of pipeline frontends. As we do not expect that aspect of
pipeline design to change in the near future, we can safely rely on this
fixed execution time of conditional moves to implement our transfor-
mation techniques discussed in Chapter 3.
2.2.2 Variable Latency Arithmetic Instructions
Variable-latency arithmetic instructions such asmultiplication are known
to be a side channel [46]. For that reason, some architectures offer the
option to disable the variable latency through control registers [17].
Studying several x86 documents [33, 43], we found one class of com-
monly used arithmetic instructions with variable instruction latency:
the signed and unsigned, 32-bit and 64-bit variations of the integer di-
vision. These instructions are also used to compute remainders of di-
visions, and hence play a role when modulo arithmetic is needed, as in
many cryptographic algorithms.1
Intel documents state that the execution time of an integer division
instruction on its recent out-of-order processors depends on arguments
being zero and on the number of quotient bits that need to be generated.
This number equals the distance in bit positions between the most sig-
nificant bits of the divisor and the dividend. Public documentation also
mentions the minimum and maximum latency of the division instruc-
tions on different generations of Intel core [43, 45], but further details
on the relation between operands and latency are missing.
1While no side-channel attacks exploiting variable-latency divisions are currently
known, it is better to be cautious. Moreover, variable-latency divisions are conceptu-
ally not different from variable-latency multiplications. The conclusions drawn here
can hence also be used on other architectures that feature variable-latency multiplica-
tions [17], and in the event Intel would decide to implement variable-latency multipli-
cations in the future.
24 Execution Time on Modern Processors
1
31
310 most significant true bit in dividend
m
os
t s
ig
ni
fic
an
t t
ru
e 
bi
t i
n 
di
vi
so
r
(a) Core 2 latency classes
31
0 most significant true bit in dividend 31
1
m
os
t s
ig
ni
fic
an
t t
ru
e 
bi
t i
n 
di
vi
so
r
(b) Xeon latency classes
Figure 2.1: Table showing the different latency classes of the 32-bit unsigned
integer division instruction depending on the most significant true bits in div-
idend and divisor. Darker shades correspond to higher latencies.
2.2 Data flow 25
In order to understand that relation and the potential consequences
for side-channel attacks, we measured the execution time of a program
consisting of division instructions executed in a loop on loop-invariant
structured and randomly selected arguments covering the whole 32-bit
unsigned integer range. We measured execution times with operands
which are exact powers of two and compared this to the execution times
using random numbers of the same magnitude. We found the latencies
indistinguishable.
As we can only measure the execution time and cycle count of that
whole program, we cannot give exact latencies for individual divisions.
We can, however, distinguish different latencies by comparing the total
execution times. Figure 2.1 shows the results of this experiment for the
Core 2 and Xeon for 32-bit unsigned divisions. Each shade corresponds
to a latency, with darker shades corresponding to higher latencies.
Several observations can bemade about these results. First of all, the
Intel documentation is correct. Secondly, there are a limited number of
distinct latencies: six for the Core 2 processor, and seven for the Xeon
Nehalem core. Third, due to the regularity of the results and the exis-
tence of the bsr bit-scan-reverse instructions that computes the location
of the most-significant bit set, it is rather easy for a given processor to
write code that computes the latency class given the divisor and divi-
dend values. Fourth, the exact latencies and their patterns differ from
one architecture generation to the other.
The results for other division instructions are similar, the main dif-
ference being that for 64-bit divisions, about twice as many different la-
tency classes are observed. So the four above observations still hold. In
Section 3.2, we will consider some mitigation strategies based on these
observations.
2.2.3 Interaction Between Memory Operations
Consider the following program fragment:
loop body: mov dword [ebx], 2 // store
add eax, [ecx] // load
When this fragment is executed in a loop, it will repeatedly store
the value 2 at the memory location to which register ebx points, and it
will repeatedly add to eax the value at the memory location to which
ecx points. Since only two memory locations are touched in this loop,
26 Execution Time on Modern Processors
the cache will not influence the timing significantly when the loop has
enough iterations. A number of other micro-architectural features do
influence the execution time of such a loop, however. Some of these fea-
tures only relate to the static properties such as the instruction mix and
instruction ordering in the assembly code. Their influence on the execu-
tion time of a programwill hence be constant over all possible program
executions. Consequently we can safely ignore these with regard to the
time-based side channels.
Other features do depend on the actual memory locations accessed
andwill cause a different program execution time for different program
inputs, causing potential information leakage. Themost important such
features present on some but not all Intel processors are optimistic and
pessimistic load bypassing, store forwarding, 4K aliasing, memory dis-
ambiguation (i.e., conflict speculation), alignment, partial aliasing, and
bank conflicts. While code optimization manuals document these fea-
tures [40, 52, 53, 80], no specification of their exact implementation, in-
teraction and timing behavior is available. Moreover, their implementa-
tion and their behavior differ significantly from one architecture gener-
ation to the other. In addition, the influence on execution time depends
to a large extent on the arithmetic code that surrounds the memory ac-
cesses. Assuming no cache misses or branch mispredictions, the execu-
tion progress is mainly limited by saturating buffers in the instruction
pipeline that cause pipeline bubbles or stalls. Which buffers cause stalls
first, and hence which specific slow down is experienced, depends on
the instruction mix as well as on buffer sizes.
In summary, it is very difficult to predict the execution speed of
memory access streams. In fact, it is even difficult to pinpoint the ex-
act reasons for slowdowns in observed executions.
For example, we measured the execution time of the above loop
body storing aligned 4-byte values on an Intel Core 2 Duo for differ-
ent loop-invariant base addresses in ebx and ecx. Two different execu-
tion timeswere observed, depending on the offset between the load and
store addresses, and on their alignment. Figure 2.2(a) depicts the pre-
cise relation we observed. Even for this simple loop, we cannot explain
this relation completely. Most of the light area (i.e., faster execution)
and the first, dark column can be explained by means of load bypass-
ing. When the load and store addresses differ, load bypassing speeds up
the program. When they are the same, there is a dependency that slows
down the program. The fact that the same behavior is observed for all
2.2 Data flow 27
offsets modulo 64, i.e., the fact that the behavior is periodic with period
64, indicates that the load bypassing is pessimistic and that only bits 3
to 5 of the addresses are used to determine if load bypassing is allowed.
However, for the slowdowns observed in the columns 4, 52, 56, and 60,
we have no satisfactory explanation. We suspect that they are caused by
pipeline features that optimize the handling of (unaligned and partially
overlapping) 64-bit and 128-bit (SSE) memory accesses, but we cannot
confirm this. We studied several Intelmanuals, collectedmanyprogram
traces including all possible performance counters (including ones that
count store buffer saturation, the stall cycles resulting from it, the num-
ber of overlapping loads and stores, the numbers of loads and stores
accessing multiple cache lines, etc.) and contacted Intel engineers, but
none of these information sources provided satisfying explanations.
Data dependencies through memory do not only occur on complex
out-of-order architectures. Even on the less complex Atom architecture
we observe timing variations due to data dependencies through mem-
ory. Consider the following program fragment:
loop body: mov byte [ebx], 2 // store
add al, [ecx] // load
Instead of 4-byte words, this fragment accesses individual bytes.
As this is an in-order architecture, load bypassing cannot be the cause
of difference in execution time depending on addresses. However, on
this processor we also observe different timings, as visualized in Fig-
ure 2.2(b). In this case, the difference in timing seems to be caused by
different ways of overlapping of forwarded data. Intel documentation
only states that different latencies can occur when accessing multiple
bytes within a 4-byte word. Clearly, the differences for this code and
processor occur along very different patterns.
We conclude this sectionwith pointing out that each of the two visu-
alized patterns only occur on one processor architecture. The first pat-
tern of Figure 2.2 as observed on the Core 2 Duo processor is completely
absent when running the same 4-byte memory access microbenchmark
on the Xeon Nehalem (which has larger buffers and a different memory
hierarchy architecture) or on the Atom processor (which is in-order).
Vice versa, the second pattern of Figure 2.2 as observed on an Atom
processor is completely absent on the two out-of-order cores.
28 Execution Time on Modern Processors
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
68
72
76
80
84
   
   
   
  b
yt
e 
al
ig
nm
en
t o
f t
he
 s
to
re
 a
dd
re
ss
displacement between the 4-byte memory accesses
(a) Core 2 Duo 4-byte access latency classes
displacement between the 1-byte memory accesses
   
   
   
   
   
   
by
te
 a
lig
nm
en
t o
f t
he
 s
to
re
 a
dd
re
ss
56 60 64 68 72 7632 36 40 44 48 528 12 28
72
4
8
12
16
20
44
0
16 20 24
80
84
76
0 4
48
52
56
60
64
68
24
28
32
36
40
(b) Atom byte-access latency classes
Figure 2.2: Execution time differences on a Core 2 Duo and Atom processor
of a microbenchmark loop with 32-bit and 8-bit resp. load and store instruc-
tions executed for varying displacements between the accessed locations and
for varying alignments of the addresses.
Chapter 3
Static Compilers
In Chapter 2 we showed that both input-dependent control flow and
input-dependent data flow in an application contribute to the observ-
able execution variation. In this chapter we discuss how static compiler
transformations can be used to reduce and even eliminate input depen-
dent timing variation.
We start by giving an overview of existing compiler mitigation
techniques to remove key dependent control flow [34] after which we
present our own transformations to remove any remaining execution
time variation caused by data flow.
3.1 Control Flow Transformations
Key-dependent control flow transfers are the most obvious cause of
correlation between secret keys and execution time. This correlation
can be avoided by eliminating conditional branches and by fixing loop
bounds. Coppens et al. [34] already discussed how a compiler backend
can exploit the conditional move instruction on the x86 to eliminate
branches by means of if-conversion. The elimination of conditional
control flow transfers also eliminates, in a portable manner, the influ-
ence of secret keys on execution time through the microarchitectural
side channels of branch prediction, instruction caching, and branch
target buffers [3, 4, 5].
We give a short overview of the technique below. In chapter 4 we
show how this technique can be optimized using runtime analysis and
provide the technical implementation details required to do so.
If-conversion [10] is a common compiler optimization that has been
30 Static Compilers
 if (x) { 
    y++; 
    y += (y*y); 
    y = y << 2; 
  } else { 
    y--; 
    p[0]=y; 
  } 
if (x) y++; 
tmp1 = y*y; 
tmp1 = y + tmp1; 
if (x) y = tmp1; 
tmp1 = y << 2 
if (x) y = tmp1; 
tmp1 = y - 1; 
if (!x) y = tmp1; 
tmp2=dummy_location; 
if (!x) tmp2 = p; 
tmp2[0] = y; 
if-conversion 
Figure 3.1: Source code equivalent of if-conversion on architectures that only
support conditional moves.
used successfully to mitigate timing attacks by transforming the control
flow dependencies into data flow dependencies [34]. On architectures
that support full predication, if-conversion deletes branches around
blocks of instructions and replaces them with predicated instructions.
Figure 3.1 demonstrates the basic transformation on architectures that
lack full predication, but that support conditional move instructions
or select instructions, such as the Intel x86 architecture [51] and the
ARMv8 architecture [16]. Such instructions allow the encoding of all
statements starting with if (...) on the right of Figure 3.1 in a single in-
struction, of which the latency is typically independent of the values of
the source operands, including the condition.
In addition to time-based side channel attacks, if-conversion pro-
tects against branch prediction attacks and instruction cache attacks.
As an alternative to the use of conditional moves, masking oper-
ations have also been proposed [68]. Masking cannot handle all op-
erations, however, and also with if-conversion, some instructions re-
quire special care. This includes instructionswith side-effects like loads,
stores, and exception-throwing instructions such as divisions. We refer
to the literature for more details [34].
Special care is also needed for converting procedure calls. A call
instruction contributes not only its own execution time, but also the ex-
ecution time of the callee’s body and possibly the bodies of callees fur-
ther down in the call chain. To make the call’s execution time constant,
the call instruction is not executed conditionally; only the code inside
the callee is executed conditionally under the control of an additional
predicate parameter. Figure 3.2 demonstrates this: The conditional call
to f1 in the left is replaced by an unconditional call to f2, on the right,
3.2 Mitigation Strategies: Data flow 31
int f1(int x) { 
    p[0]=x; 
} 
 
 
g(int x, boolean y) { 
    if (y) f1(x); 
} 
if-conversion 
int f2(int x, boolean cond) { 
    tmp1=dummy_location; 
    if(cond) tmp1 = p 
    tmp1[0] = x; 
} 
 
g(int x, boolean y) { 
    f2(x,y); 
} 
Figure 3.2: If-conversion of conditional function calls.
with f2 the if-converted replacement of f1. The callee’s body is adapted
such that it only updates the global program state if the call would have
been executed in the original program.
Throughout this dissertation we use the term predicate of a condi-
tionally executed instruction, basic block ormethod in if-converted code
to denote the logical value that indicateswhether or not that instruction,
basic block or method was to be executed at some point in time in the
original program.
3.2 Mitigation Strategies: Data flow
This sectiondiscusses several potentialmitigation strategies for variable-
latency arithmetic instructions and for interacting memory operations.
Themost obviousmitigation strategy is to avoid the variable-latency
instructions altogether. For the division instruction, e.g., it is easy to
write a library function that performs the division in constant time,
without division instructions. Replacing each division by a call to such
a function solves the problem, albeit at a significant overhead as we
demonstrated by Coppens et al. [34]. Moreover, for other instructions
such as memory accesses, it is simply impossible to avoid them.
In this section, we study alternative solutions that do include the
execution of variable latency instructions, but in such a way that they
do not influence the total execution time. Two different code rewriting
strategies are studied to achieve this. The first strategy is to rewrite the
code such that the total execution time no longer depends on the vari-
able latency of individual instructions. The second strategy is to rewrite
the code such that all instructions that have variable latency in general,
now get executed in a context that forces a constant latency on them.
32 Static Compilers
3.2.1 Strategy One: Variable-Latency Compensation Code
With this strategy, we try to add compensation code to program fragments
such that the execution of the original variable-latency code and the
compensation code combined always results in the same total execution
time.
When we neglect all instruction caches, instruction branch target
and translation look-aside buffers, as we can after conditional branches
have been eliminated, the total execution time of a sensitive code frag-
ment on a processor is determined by (1) the state of the processor upon
entry of the code fragment, (2) the data consumed by the fragment, and
(3) the DDG of the code fragment itself.
Forcing the entry state to some predetermined state upon entry of
a sensitive variable-latency code fragment is impossible without huge
performance overhead. As a result, it is impossible to make the execu-
tion time of any sensitive code fragment truly constant with low over-
head. However, we don’t need that time to be constant. We only need
it to be independent of the secret data. That implies that variations in
the entry state that do not depend on secret data can be tolerated. As
such, we can reason along the lines of mathematical induction: when
the entry state of a fragment is independent of secret data and when
the fragment’s own execution time and processor state transition is in-
dependent of secret data (because it does not consume secret data or
because we apply a mitigation technique), the exit state is also indepen-
dent. As such, the entry state of the next fragment is independent of
secret data. Clearly, it is possible to make the entry state of the program
independent of secret data. So we don’t need to consider the entry state
to a sensitive program fragment as long as we can take care of the frag-
ment itself.
Changing the data consumed by a sensitive code fragment is gener-
ally not feasible for a compiler when that data concerns the secret key
and input data or somederivatives thereof. Only the algorithmdesigner
or programmer can do that. So the only remaining option is to trans-
form the DDG of the code fragment such that irrespective of the data
operated on, the fragment’s execution time is independent of the secret
data onwhich it operates. We study two potential DDG transformations
to achieve this.
First, we can try to make the total latency of the path in the DDG
containing the variable-latency instruction constant by adding sequen-
3.2 Mitigation Strategies: Data flow 33
!"#$"!%&'()#*+*,#!-#./0#
or#
or#
fix
ed
#(
m
e#
fix
ed
#(
m
e#
(a) variable-latency (b) same with sequential (c) same with parallel
instruction compensation code compensation code
Figure 3.3: A simple DDG in which visual height models instruction latency
and execution time.
tial compensation code such that the sum of the latencies on the resulting
path always equals a constant. Consider the original variable-latency
instruction as depicted in Figure 3.3(a). The two alternative executions
of the DDG in the figure model the fact that whenever the instruction is
executed, it will execute with one of two latencies. In Figure 3.3(b) se-
quential compensation code has been added. This compensation code
is visualized in the figure as one dark blue operation node for the sake
of clarity. In practice, however, the compensation code cannot simply
be one instruction, as it needs to compute what should be compensated
and before it can actually perform the necessary compensation.
Secondly, we can try to ensure that the variable-latency instruction
is not on the critical path of the DDG by inserting parallel compensa-
tion code that needs more cycles to execute than the maximum of the
variable latencies. This concept is visualized in Figure 3.3(c). For a com-
bination of reasons, this parallel compensation code will typically also
have to consist of more than one instruction, as depicted in the figure.
In order to hide the variable latency completely, the parallel compensa-
tion code obviously needs to be executed in parallel with the variable-
latency code. So itmust consist of instructions that do not execute on the
same pipeline components. Hence the parallel compensation code can-
not contain the variable-latency instruction itself, or variations thereof.
Furthermore, the very reason for instructions being implemented with
variable latency is a highmaximum latency, up to 116 cycles for the inte-
ger division on some Intel x86 processors. This implies that the parallel
compensation code has to be built from multiple, shorter, fixed-latency
instructions.
34 Static Compilers
Considering the pipelines of modern x86 instructions, several in-
structions are available for the parallel compensation of variable-latency
divisions, including sequences of multiplication. Section 3.3 reports on
the results obtained with such sequences.
We should note that for indirect attacks, the parallel compensation
method cannot provide any solution. Such attacks do not measure the
execution time of the process under attack but insteadmeasure that pro-
cess’ effect on another process caused by resource contention. This ef-
fect can even be observable throughmany software layers. For example,
we verified that on an x86 processor with simultaneous multi threading
(SMT), one Java programexecuting divisions in one JavaVMrunning on
top of an operating system virtualizedwith Xen and pinned to a specific
SMT core, can approximately measure the occupation of the divider by
another Java program running in another Java VM, in another Xen VM,
but pinned onto the same SMT core.
With parallel compensation code, the occupation of the execution
units used by variable-latency operations remains variable, so parallel
compensation code cannot close this indirect time side-channel.
3.2.2 Strategy 2: Forcing Invariable Latencies
An alternative strategy is to manipulate the operands of variable-
latency instructions or the way in which memory accesses interact to
force a constant latency on them.
In the case of division, for example, one could imagine shifting the
operands before the division, and then shifting back after the division
to force maximum latency. Or by trying to replace a division AB by
2(A+2B)
2B − 2 which would never cause early exit. However, given that
most division instructions in cryptographic software are executed to
compute remainders, we have not found any such sequence which does
not suffer from either rounding errors or from overflow in parts of the
operand range. So this is not a generic solution. In some cases, how-
ever, this type of mitigation is feasible, such as when the divisor is pub-
lic knowledge. This is often the case in public key cryptography where
the modulus of the encryption and decryption is part of the the public
key, and where it is most of the time a fixed value throughout the com-
putation [67]. The execution time is then only dependent on the most
significant bit in the dividend. To eliminate execution time variation, it
suffices to shift the dividend to the left such that the most significant bit
3.2 Mitigation Strategies: Data flow 35
fix
ed
%
&m
e%
fix
ed
%
&m
e%
Figure 3.4: An extendedDDG that includes all latency variations of a variable-
latency instruction.
is always set to true. The divisor does not need to be shifted, and hence
overflow and rounding errors are not a problem.
An alternative, generic solution consists of building a DDG inwhich
multiple copies of the variable-latency instruction are executed, with
exactly one copy per possible latency, and by forcing the operands of
those copies to be such that all of them have the desired constant la-
tency. In addition, forcing the operands to appropriate values has to
happen in such a way that at least one copy executes the proper, origi-
nal division. Then after all divisions have been computed, code needs
to select the correct result among all computed results by means of con-
ditional moves.
This solution is visualized in Figure 3.4. Whereas Figure 3.3 de-
picts alternative executions of a DDG, this figure depicts only one DDG.
The inserted nodes at the top model the code that computes suitable
operands from the original operands such that all copies of the original
instruction will have a different, but constant execution time. The node
inserted at the end models the selection of the correct result among all
computed ones. Clearly, this solution will also involve some overhead.
Using the already mentioned bsr instruction, the overhead stays lim-
ited. We evaluate it in Section 3.3.
This solution also provides excellent protection against indirect at-
tacks: as the resource use is now constant, its influence on other pro-
cesses becomes constant as well.
For memory operations that have variable latency because their in-
teraction depends on concrete addresses on which they operate, the
above solutions are not applicable. We can, however, force them to have
36 Static Compilers
an invariable execution time by excluding the unwanted interaction al-
together. For example, in the case of interacting stores and loads, all of
the pipeline optimizations discussed in Section 2.2.3 only matter when
the store and load are in flight in the processor pipeline simultaneously.
Sowhenever such a pair of instructions occurs in a program fragment of
which the compiler cannot determine that the addresses operated upon
will lead to fixed timing or that they are independent of any secret data,
it suffices to pull the instructions apart.
This can be done by simply inserting no-ops. While it may come
as a surprise that simple no-op insertion can work, processor designers
do not expect compilers or programmers to insert no-ops in sequential
code. While no-ops are inserted to optimize code alignment, the in-
serted no-ops are then merely padding that almost never gets executed.
So the processor designers do not implement any pipeline optimiza-
tion to get rid of no-ops in instruction streams [53]. Consequently, the
inserted no-ops result in pipeline bubbles, which can effectively force
loads and stores to be executed separately, without any interaction and
hence without variable execution times as a result. This strategy is also
evaluated in Section 3.3.
With respect to indirect time attacks, thismethod of inserting no-ops
provides as many guarantees as for direct time attacks. When the inter-
action between loads and stores is avoided (and still neglecting caches
for which other solutions exist), their occupation of buffers and other
resources in the pipeline also becomes data-independent. So resource
contention with other processor cannot leak any information.
3.3 Evaluation
In this section, we evaluate the proposed mitigation techniques. This
evaluation includes the mitigation success, the performance overhead,
and the feasibility of implementing the proposed mitigations in a com-
piler backend. All results for variable-latency arithmetic are based on
compiled and handwritten assembly implementations of modular ex-
ponentiation, of which a basic C implementation looks as follows:
3.3 Evaluation 37
result = 1;
do {
result = (result*result) % n;
if ((exponent>>i) & 1)
result = (result*a) % n;
i--;
} while (i >= 0);
To eliminate the conditional control flow, we relied on the support
implemented in LLVM and discussed by Coppens et al. [34]. The gen-
erated code forms the basis for the experiments described here. In this
code, the division instruction is used to perform the modulo n compu-
tation.
To evaluate the effectiveness of the proposed mitigation techniques,
we ran themodular exponentiation code on inputs consisting of (1) ran-
domly varying modulo values, (2) randomly varying base values, and
(3) four different types of exponents.
In theZero input set, the exponent in binary format consists of all ze-
roes except for the two most-significant bits set that are set to one. This
ensures that the variable result does not remain constant throughout
the whole loop. Having all other bits set to zero ensures that the con-
ditional code in the original loop will only be executed twice per loop.
This pattern results in very accurate branch prediction. In the One in-
put set, all bits in the exponent are set to one. This ensures that the
conditional code in the loop is executed in every iteration in the origi-
nal, unprotected code. So in total, the conditional code is then executed
32/64/256 times per loop for 32/64/256-bit numbers. This pattern also
results in very accurate branch prediction. So when this input is fed
to a benchmark, much more code is executed than with all-zero input,
but the branch predictor performs similarly. In the Regular input set,
half of the bits are set to one in a regular pattern. This implies that
the conditional code is executed in half of the iterations in the origi-
nal unprotected code, and that the pattern is predicted very well by the
branch predictor. In the Random input set, half of the bits are set to one
aswell, but now the pattern of zeroes and ones is generated by a pseudo-
random number generator. Consequently, this input will result in the
same amount of code executed as for the regular input set, but branch
prediction will be much less accurate, resulting in more branch misses
and higher execution times in the original, unprotected code.
Please note that the number of times each loop was invoked per ex-
38 Static Compilers
periment differs from one experiment to the other. For each benchmark,
the number of invocationswas chosen to be a good balance between fast
experiments and accurate measurements.
3.3.1 Strategy One: Variable-Latency Compensation Code
To test themitigation strength of compensation code as discussed in Sec-
tion 3.2.1, we developed an LLVM [62] plugin to insert parallel compen-
sation code. This plugin operates on the LLVMhigh intermediate repre-
sentation. The generated compensation code for this example takes the
dividend as input and invariantly computes the value 1 using a number
of shifts and mostly multiplications, as illustrated in the equivalent C
code in figure 3.6. That resulting 1 is then multiplied with the remain-
der result of the division instruction. That final multiplication serves as
the bottom node of Figure 3.3(b). It take the division instruction off the
critical path.
Several versions of this loop were generated, with increasing num-
bers of multiplications to discover the minimal required number to
omit the variable-latency division from the critical path. Figure 3.5(a)
demonstrates that the average execution times for the four inputs con-
verge after 6 multiplications. So surely this mitigation technique helps
in reducing the amount of useful information leaked via the time side
channel. The performance overhead is also limited in this case. For 6
multiplications, the overhead is about 7%.
To test whether all useful information is eliminated, more rigorous
statistical testing is needed. To that extent we performed t-tests on the
sets of 100 timings we obtained for each of the four inputs. Figure 3.5(b)
shows the p-values of those tests obtained from comparing the One in-
put set to the Zero input set and from comparing the Regular input set
to the Random input set. This figure shows that only the versions with
15 and 39multiplications survive the t-tests. So according to these tests,
only those versions are likely free of leakage.
To understand why precisely these two versions provide more se-
curity, we studied the execution of the mitigated code versions using a
wide range of performance counters. We discovered that evenwhen the
compensation code is executed on different functional units to enable
its execution in parallel with the division instruction, there are various
other pipeline components such as buffers and ports for which there is
contention between the compensation code and the division. Through
3.3 Evaluation 39
0 1 2 3 4 5 6 74
.2
e+
09
4.
6e
+0
9
5.
0e
+0
9
#multiplications parallel with 32bit division
Ex
e
cu
tio
n 
tim
e
zero
one
regular
random
(a)
0 10 20 30 40
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
#multiplications parallel with 32bit division
P−
Va
lu
es
one−zero
regular−random
treshold (0.05)
(b)
0 10 20 30 40
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
#multiplications parallel with 32bit division
P−
Va
lu
es
one−zero
regular−random
treshold (0.05)
(c)
0 10 20 30 40
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
#multiplications parallel with 32bit division
P−
Va
lu
es
one−zero
regular−random
treshold (0.05)
(d)
Figure 3.5: (a) Average execution times (number of cycles on a Core 2 Duo) of a
100 executions on a Core 2 Duo processor of a loop performing modular expo-
nentiations, for four different inputs. Processor time stamp counters are used
to measure the execution times. (b) p-values of t-tests to test distinguishability
between different inputs on the Core 2 Duo. (c) p-values for same program on
a Xeon processor. (d) p-values for a minimally changed program on the same
Core 2 Duo.
40 Static Compilers
this contention, the variable latency of the division instruction can still
influence the execution of the compensation code, thus influencing the
total execution time. For this precise combination of processor architec-
ture and code fragment, 15 and 39 proved to be the right numbers of
multiplications to eliminate the time dependence completely.
We have no knowledge of publicly available hardware models that
would allow a compiler to predict this. Furthermore, when we make
small changes to the original code before inserting the parallel compen-
sation code or when we run the code on other out-of-order processors
with variable-latency division instructions, similar results are obtained,
that peaks in the t-test results occur for different numbers of multipli-
cations. Figure 3.5(c) shows the t-test result for the same code fragment
on our Xeon processor, while Figure 3.5(d) shows the t-test result for
the same Core 2 Duo processor and for a software version in which we
removed 4 padding bytes before the function containing the sensitive
code fragment.
Similar experiments in which we used long-latency load operations
(through forced cache misses) as parallel compensation code, gave re-
sults along similar lines.
From these experiments, we draw the following conclusions:
• In at least the presented case, parallel compensation code is able
to limit the amount of useful information leakage significantly.
• The performance overhead is relatively low when no strict secu-
rity guarantees are needed, as in the case where close but non-
identical averages suffice.
• In at least the presented case, parallel compensation code is able
to eliminate all leakage.
• However, this stricter security guarantee can only be achieved at
a much higher overhead.
• A specific instance of the mitigation, such as a specific number
of multiplications, provides no portable security guarantees for
different processor versions.
• It is unpredictablewhich specific instance of themitigation ismost
effective.
3.3 Evaluation 41
tmp1 = (dividend >> 31) | 1; // tmp 1 is always 1 or -1
quotient = dividend / divisor; // original division intended to be executed
// in parallel with multiplications
tmp2 = tmp1 * tmp1; // tmp2 and beyond are always 1
...
tmp = tmpn * tmpn; // n multiplications in total
quotient = quotient * tmp; // final multiplication to
// take division off critical path
Figure 3.6: C code equivalent of variable-latency division in parallel with
fixed-latency multiplications.
class = divisor < 2 ? 1 : divisor < 0x20 ? 2 : divisor < 0x200 ? 3 :
divisor < 0x2000 ? 4 : divisor < 0x20000 ? 5 : 6;
leading_zeroes = 31 - bsrl(dividend); // fixed-latency bit-scan-reverse
dividend <<= leading_zeroes;
res1 = shifted_dividend / ( class == 1 ? divisor : 0x2); // fixed latency!
res2 = shifted_dividend / ( class == 2 ? divisor : 0x20); // fixed latency!
res3 = shifted_dividend / ( class == 3 ? divisor : 0x200); // fixed latency!
res4 = shifted_dividend / ( class == 4 ? divisor : 0x2000); // fixed latency!
res5 = shifted_dividend / ( class == 5 ? divisor : 0x20000); // fixed latency!
res6 = shifted_dividend / ( class == 6 ? divisor : 0x200000); // fixed latency!
quotient = class == 1 ? res1 : class == 2 ? res2 : class == 3 ? res3 :
class == 4 ? res4 : class == 5 ? res5 : res6;
quotient >>= leading_zeroes;
remainder = dividend - (quotient * divisor);
Figure 3.7: C code equivalent of unsigned division computation using only
invariable-latency divisions. All selection statements a?b:c are implemented
with fixed-latency conditional moves. The (re)computation of the remainder
is necessary because the remainders computing using shifted dividends are
not correct. Similar code can be used for signed division, but then the sign
needs to be corrected afterwards.
• The effectiveness of a precise instance of the mitigation is highly
sensitive to the precise form and even location of the code frag-
ment to be protected.
With respect to the sequential compensation code strategy, we sim-
ply did not succeed in write concrete, effective mitigation code. And
even if sequential compensation code can be crafted that satisfies the
needed security guarantees for a specific software-hardware combina-
tion, all of the mentioned predictability, portability and sensitivity is-
sues will still apply.
42 Static Compilers
3.3.2 Strategy 2: Forcing Invariable Latencies
Next, we manually rewrote the 32-bit version of the modular exponen-
tiation code to implement the strategy depicted in Figure 3.4. The re-
sulting average execution times and p-values are presented in Table 3.1.
Results are again presented for four inputs and two t-tests. Four soft-
ware versions have been measured in this experiment:
1. original: the original version as compiled with LLVMwithout any
mitigation;
2. if-conversion: the version generated by amodified LLVM that elim-
inates the conditional branches [34];
3. if-conversion + nodiv function: an if-converted version in which
LLVM replaced the division instruction by a call to a fixed-time
library function that emulates division without executing a single
division instruction.
4. if-conversion + 6 div functions: an if-converted version in which
LLVM replaced the 32-bit division instruction by a call to a man-
ually written assembly function in which 6 divisions with forced
invariable latencies are executed, corresponding to the 6 latency
classes of the 32-bit division instruction on the Core 2 architecture.
Figure 3.7 shows the main body of this function in equivalent C
code.
These results indicate that the proposed solution is capable of com-
pletely closing the time side-channel leak due to the variable latency di-
vision instruction. Equally important, additional experiments demon-
strated that this solution does not depend on the exact form of the code
fragment to be protected. Whatever the surrounding code of the divi-
sion looks like, the proposed solution works.
The overhead of this solution ismore than a factor 4, however, which
is considerably higher than the overhead resulting from parallel com-
pensation code as measured in the previous section. Still, compared
to using a library function, this novel mitigation technique is about 3.5
times more efficient.
At this point, it is useful to keep in mind that only the sensitive code
in an application needs to be protected. So in most applications, the
aforementioned overhead will only be observed on small fractions of
3.3 Evaluation 43
original if-converted if-converted if-converted
+ no-div function + 6 div functions
all zero 0.611 1.242 20.800 5.774
all one 1.127 1.255 20.800 5.774
regular 0.845 1.251 20.800 5.774
random 0.978 1.245 20.800 5.774
(a) average execution times (seconds)
original if-converted if-converted if-converted
+ no-div function + 6 div functions
all zero 0.007 0.000 0.000 0.001
all one 0.000 0.000 0.000 0.001
regular 0.000 0.000 0.000 0.001
random 0.000 0.000 0.000 0.001
(b) standard deviation of execution times
original if-converted if-converted if-converted
+ no-div function + 6 div functions
all zero - all one 0.000 0.000 0.288 0.482
regular - random 0.000 0.000 0.135 0.215
(c) p-value of the t-test applied to the execution times
Table 3.1: Execution times and statistical information on the successful miti-
gation of variable-latency divisions in modular exponentiation.
the execution time, and might hence very well be negligible in the total
execution time.
With respect to portability, this solution can be extended to provide
security portability over all existing processor versions. For each differ-
ent processor version, a specific secured code version can be provided
in the software. When the software is executed, the cpuid instruction is
used to query the processor for its version in order to invoke the appro-
priate code version. Obviously this will introduce some additional per-
formance overhead as well as significant code size overhead. With re-
spect to future processors, an application could refuse to continuewhen
executed on a processor version for which it has no secured code. This
may be not very user-friendly, but at least the security guarantee is not
broken. We can conclude that:
• Predictable and strict security guarantees insensitive to code frag-
ment properties can be provided using this mitigating transfor-
mations.
44 Static Compilers
• The overhead is very high, however.
• True portability can only be provided for existing processors, and
comes with an additional overhead, in particular in the form of
increased code size.
3.3.3 No-ops forAvoidingVariable InteractionBetweenMem-
ory Operations
As a final experiment, we report on the mitigation based on no-ops
against variable execution times caused by the variable interaction be-
tween consecutive store and load operations. For this experiment, we
let our LLVM plugin insert no-ops in between the store and load oper-
ations in the example code at the beginning of Section 2.2.3.
The resulting execution times for different offsets between store and
load addresses are depicted in Figure 3.8. These execution times are
measured for a store address that is aligned on a 64-byte boundary, cor-
responding to the top row of Figure 2.2. The different lines correspond
to different numbers of no-ops inserted, from one to six, from bottom to
top.
From five no-ops on, the t-tests showed that the curve for this ex-
periment became indistinguishably flat, and thus that the timing be-
havior did not leak any information about the addresses accessed by
the memory locations. For other similar code fragments, similarly look-
ing results were obtained, indicating that with enough no-ops inserted
where needed, this time side channel can be closed.
As with the previous solution, the performance overhead can be
quite big. In this experiment, it is about a factor 1.5. Again, however,
this overhead is limited to the code fragments to be protected.
This type of mitigation proves to be very predictable, portable, and
insensitive to specific code fragment properties. So we can conclude
that:
• The technique is portable, predictable and insensitive to code frag-
ment features when the overhead is not minimized, i.e., when a
number of no-ops is inserted that is guaranteed high enough.
• The technique does come with a significant overhead.
3.3 Evaluation 45
0 8 16 24 32 40 48 56 64 72 80 88 96 112 128 144
2.
0
2.
5
3.
0
3.
5
Offset between memory accesses in bytes
Ex
e
cu
tio
n 
tim
e 
(cy
cle
s)
1 nop 2 nop 3 nop 4 nop 5 nop 6 nop
Figure 3.8: Execution time (number of cycles on a Core 2 Duo processor) of a
loop consisting of a pair of store/load instructions, in function of the offset be-
tween the accessed loop-invariantmemory locations, for an increasing number
of no-ops (1 to 6) inserted in between them.
3.3.4 Feasibility of Compiler-Based Mitigation
We studied compiler support for mitigating side channel attacks be-
cause compared to the manual mitigation of algorithms or source code,
automatedmitigation in a compiler offers the potential advantage of in-
creasing the developer’s productivity. Compared to x86 hardware sup-
port to mitigate side channels, which is currently only available for very
specific cryptographic computations such as AES table look-ups [48],
generic compilers offer the potential advantage of being able to protect
any code that handles sensitive data.
However, our experiments have demonstrated that static compilers
cannot always provide the highest level of security at low performance
overhead. When both are needed, the developers remain responsible
for ensuring that their implementations do not leak information. Our
experiments have shown that this is not a simple task, and that the de-
velopers need to be aware of many computer architecture artifacts. Oc-
casional security programmers are therefore not recommended to try to
develop ad hoc solutions, but to reuse existing libraries into which the
security community has grown confidence instead. Alternatively, hard-
ware designers might be persuaded to provide hardware support in the
future. Support for fixed-latency arithmetic is available on the ARM ar-
chitecture [17] but as of today, not on x86 processors. For other poten-
tial side-channels such as load/store forwarding, we know of no exist-
46 Static Compilers
ing hardware support. Given the complexity of memory data paths on
modern out-of-order processors, we believe further research is needed
to assess the cost of such support.
Lacking hardware support in existing processors, our experiments
have demonstrated that when a very high level of overhead is accept-
able, static compilers are in fact able to provide leakage free solutions
that are portable over existing processors. When a low level of security
is needed and a limited amount of overhead is acceptable, such as when
the average execution times need to be similar but not identical, static
compilers can also provide portable solutions.
Anywhere in between, performance-wise as well as overhead-wise,
static compilers can only provide non-portable solutions. Moreover,
they can only do so when the infrastructure and development time is
available to iteratively generate and test many specific instances of the
mitigations: iterative code generation and testing is the only option to
get high confidence in the provided level of security.
Reliable testing is a problem on its own, however. To validate the ab-
sence of anymeasurable correlations between secret data and execution
times, extremely accurate and precise test measurements must be done.
It is known from the literature that extreme care needs to be taken to
measure supposed performance improvements [70] and that rigorous
statistical analysis is needed [44]. In order to conduct our experiments
and get trustworthy timing results, we took the following precautions:
reduce the number of interrupts, for example by disconnecting the net-
work cable and other peripherals like keyboard, and by disabling the
USB ports; disable a number of operating systems features such as clock
frequency scaling, turbo boost mode, address space layout randomiza-
tion, and dynamic linking; stop a number of services such as deamons
and cron jobs; pin software to one specific core on the multicore proces-
sors and disable hyperthreading where available; make sure the soft-
ware measurement environment is invariable, including the length of
all variables in scripts, paths, inputs, environment variables, etc.
Even when all these precautions had been taken, the resulting time
measurements were sometimes not usable. For example, the graph in
Figure 3.9 depicts over 30.000 consecutive time measurements of a sin-
gle program using the same input in an experiment running for two
days. Despite of all the precautions taken, there is a clear phase be-
havior that manifests itself after one day and that we cannot explain.
Furthermore, even within a single phase (e.g., day one of the measure-
3.3 Evaluation 47
l
lll
ll
l
ll
lll
lll
l
lllll
l
ll
l
l
l
ll
l
l
l
l
l
l
llll
ll
ll
l
l
l
l
l
l
ll
l
l
l
l
ll
l
l
ll
l
ll
ll
ll
l
ll
l
l
l
l
lll
ll
lll
l
ll
l
ll
l
l
ll
l
ll
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
lll
l
ll
l
l
l
lll
l
l
l
ll
ll
l
ll
l
l
l
llll
l
ll
l
ll
l
ll
ll
l
ll
l
l
l
l
l
l
l
ll
l
l
l
l
lll
ll
l
l
l
l
lll
l
ll
l
l
l
l
l
ll
l
l
l
ll
lll
l
l
l
l
l
ll
l
l
l
ll
l
l
l
l
l
ll
ll
l
lll
l
l
l
l
l
l
l
lll
ll
l
l
ll
l
ll
l
llll
l
ll
lll
l
ll
l
l
lll
ll
l
l
ll
l
l
l
l
l
llll
l
l
lll
l
l
ll
lll
llll
l
l
ll
ll
l
l
ll
lll
ll
l
l
l
l
l
l
ll
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
lll
ll
l
ll
ll
l
l
l
ll
ll
l
l
l
lll
l
l
l
ll
l
l
lll
llll
l
l
ll
l
l
lll
l
l
ll
l
ll
lll
l
ll
l
ll
l
l
lll
l
l
l
l
l
l
l
ll
ll
l
lll
l
l
ll
l
llll
l
ll
ll
ll
ll
l
lll
l
ll
l
l
ll
l
l
l
ll
ll
ll
l
ll
l
ll
ll
l
ll
l
l
l
lll
l
ll
l
ll
lll
l
l
l
l
l
l
l
l
l
ll
l
l
ll
llll
l
l
l
l
l
l
l
l
l
ll
lll
l
llll
ll
ll
l
l
lll
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
ll
l
llll
l
l
l
l
l
llll
l
l
l
ll
l
l
l
l
l
l
lll
ll
l
lll
l
l
l
l
l
lll
l
l
ll
l
l
l
l
l
l
ll
l
l
l
l
ll
ll
l
ll
ll
l
ll
l
lll
ll
l
l
l
lll
ll
l
l
ll
ll
l
l
l
ll
ll
ll
l
l
ll
l
l
l
l
l
ll
l
l
l
ll
l
lll
ll
l
l
ll
ll
ll
l
l
l
l
l
llll
l
l
l
ll
l
l
l
l
l
ll
l
l
l
ll
l
l
l
ll
ll
l
ll
l
l
l
lll
ll
l
l
l
l
l
l
ll
ll
l
l
ll
ll
l
l
l
l
l
ll
l
ll
l
ll
l
l
l
ll
l
l
l
ll
l
l
l
ll
ll
lll
l
l
l
l
l
ll
lll
l
l
l
l
lll
lll
l
l
ll
l
l
l
l
l
l
lll
l
ll
l
ll
l
l
l
lll
l
l
l
l
l
ll
l
l
l
l
ll
l
l
ll
ll
l
l
l
l
l
ll
ll
llll
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
ll
l
l
l
ll
l
l
l
ll
l
l
l
l
l
ll
llll
l
l
ll
l
l
ll
l
l
l
l
ll
ll
l
ll
l
lll
l
ll
l
l
l
ll
l
ll
l
ll
l
ll
ll
l
l
l
l
l
l
ll
l
l
l
l
ll
l
ll
ll
l
ll
ll
l
l
ll
l
l
ll
l
ll
ll
l
ll
l
l
l
l
l
l
l
l
l
l
l
lll
l
l
l
l
llll
llll
l
l
l
l
lll
l
lll
l
l
l
l
l
ll
l
ll
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
lll
ll
l
l
ll
lll
l
l
lll
l
ll
l
l
l
lllll
l
l
l
ll
l
lll
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
ll
l
l
l
ll
l
l
llll
l
l
l
l
ll
lll
ll
l
l
l
l
l
l
ll
l
l
ll
l
l
ll
ll
l
l
l
ll
l
l
l
l
l
lll
l
l
l
l
l
l
ll
l
ll
l
ll
l
ll
l
l
ll
ll
l
l
ll
ll
l
lll
l
l
ll
l
lll
l
ll
ll
l
l
lll
l
l
lll
ll
l
ll
l
lll
l
l
l
l
ll
l
ll
l
llll
l
ll
l
l
lll
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
ll
l
lll
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
ll
l
l
ll
l
l
l
l
lll
ll
l
l
l
lll
l
ll
l
lll
lll
l
ll
l
l
ll
ll
ll
l
l
l
lll
l
l
l
l
ll
ll
l
l
l
l
ll
l
ll
lll
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
lll
llll
ll
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
ll
l
ll
lll
l
l
llll
ll
ll
ll
l
l
l
l
l
l
l
ll
l
l
l
l
ll
l
l
l
lll
l
l
ll
l
ll
ll
l
l
ll
l
l
l
l
ll
l
l
l
l
l
l
ll
l
l
lll
l
lll
l
llll
l
l
l
l
ll
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
lll
llll
l
ll
ll
l
l
l
l
l
l
ll
l
l
llll
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
ll
ll
lll
l
l
l
l
l
lll
l
l
l
l
l
lll
ll
l
l
l
l
l
l
l
lll
l
l
l
ll
l
l
l
llll
l
l
l
ll
lll
l
ll
l
l
l
l
l
ll
l
l
ll
ll
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
ll
l
lll
l
l
l
l
lll
ll
ll
l
l
l
l
ll
l
l
l
l
l
lll
l
l
l
ll
ll
l
l
ll
lll
ll
l
l
l
lll
ll
l
l
l
l
l
ll
ll
l
l
l
lll
l
l
l
l
ll
l
l
l
l
l
l
lll
lll
lll
l
l
l
l
lll
l
ll
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
llll
ll
l
l
l
l
ll
l
ll
l
lll
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
ll
l
l
lll
l
l
ll
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
lll
ll
l
ll
l
ll
ll
l
l
l
ll
ll
l
l
l
l
l
ll
l
ll
l
lll
l
ll
l
l
l
lll
l
lll
l
l
l
l
l
l
l
l
l
lll
l
l
l
l
l
l
lll
ll
l
l
l
l
l
ll
ll
l
lll
l
ll
ll
l
l
lll
ll
l
ll
l
ll
ll
l
l
l
l
llll
l
l
l
l
ll
l
ll
l
ll
l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
l
l
ll
l
l
l
l
l
lll
lll
ll
l
ll
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
lll
l
ll
ll
l
l
l
l
l
l
ll
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
ll
l
l
l
ll
l
ll
ll
l
l
l
l
l
l
l
ll
l
l
l
ll
ll
l
l
l
ll
l
lll
lll
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
llllll
l
ll
l
ll
l
l
ll
l
l
l
ll
ll
l
l
l
l
l
lll
lll
l
ll
ll
l
ll
l
l
l
l
lll
l
l
l
l
llll
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
ll
l
l
ll
lll
ll
l
l
l
l
l
l
l
l
ll
lll
lll
l
l
l
ll
lll
l
ll
ll
l
l
l
lll
ll
l
l
ll
l
l
ll
l
l
l
ll
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
ll
l
l
l
l
l
l
lll
l
l
l
l
ll
ll
ll
l
l
l
ll
l
ll
ll
l
l
l
lll
l
l
l
l
l
ll
ll
l
lll
ll
l
l
l
l
l
ll
l
l
l
l
l
ll
l
l
l
l
l
l
ll
l
l
ll
l
l
ll
l
l
l
l
l
l
ll
l
l
ll
l
lll
l
l
ll
l
l
l
l
l
l
l
ll
l
l
l
ll
l
ll
ll
l
ll
l
lll
l
l
ll
ll
l
ll
ll
ll
l
l
l
l
ll
l
l
ll
l
ll
l
l
l
l
ll
lll
l
l
l
l
l
ll
l
ll
ll
l
l
ll
lll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
lllll
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
ll
ll
l
l
ll
l
l
lll
ll
l
l
ll
l
lllll
l
l
l
l
l
l
l
l
l
l
l
llll
ll
llll
l
ll
ll
l
ll
l
l
l
l
l
l
ll
l
l
ll
l
l
l
ll
lll
ll
l
l
l
l
lll
l
l
ll
l
l
l
l
l
l
l
l
lll
ll
ll
l
l
l
ll
l
l
l
ll
l
l
lll
l
l
l
lll
l
l
l
l
lll
l
ll
l
l
ll
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
ll
ll
l
l
l
llll
l
ll
l
l
l
lll
l
l
l
l
l
l
l
l
l
l
ll
lll
ll
l
l
l
l
l
l
l
l
l
llll
lll
ll
ll
lll
ll
l
ll
l
lll
ll
lllll
ll
l
l
lll
l
lll
l
ll
l
lll
lll
l
l
ll
l
l
llll
l
l
l
l
l
ll
l
l
ll
ll
ll
l
l
ll
llll
l
l
lll
ll
ll
l
ll
l
ll
l
l
l
l
l
l
ll
l
l
l
l
l
l
lll
ll
ll
ll
l
l
l
lll
l
ll
l
l
l
l
l
l
l
ll
l
ll
l
l
l
l
lll
l
ll
lll
ll
l
ll
l
ll
l
ll
ll
ll
l
l
l
ll
l
l
l
l
l
l
ll
ll
l
lll
l
l
l
llll
l
ll
l
l
l
l
ll
l
lll
ll
l
l
ll
l
lll
l
l
l
ll
l
ll
l
l
l
l
ll
l
l
l
ll
l
l
l
l
l
ll
ll
l
l
lll
lll
l
ll
l
l
ll
l
lll
l
l
lll
l
l
l
lll
ll
l
l
l
l
l
lll
l
l
ll
l
l
lll
l
l
l
l
l
l
ll
l
l
ll
l
l
ll
ll
l
l
ll
l
ll
ll
ll
l
l
l
l
l
ll
ll
l
l
ll
l
ll
l
l
l
l
l
l
l
l
l
lll
lll
l
ll
ll
l
l
ll
l
l
l
ll
l
ll
lll
l
ll
l
l
l
ll
l
ll
ll
ll
ll
l
lll
lll
l
l
l
ll
lll
l
l
l
l
l
l
lll
l
l
ll
l
llll
l
ll
ll
l
l
l
ll
ll
l
l
ll
l
l
l
l
ll
lllll
l
l
l
l
l
l
l
lll
l
ll
l
ll
l
ll
ll
lll
l
l
ll
ll
lllll
l
l
l
ll
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
ll
l
l
l
l
l
l
l
lll
l
l
l
l
l
llll
l
l
lll
l
l
l
l
ll
l
l
l
ll
ll
l
ll
lll
llll
ll
l
l
ll
ll
ll
l
ll
l
l
l
ll
l
ll
l
l
l
ll
l
ll
ll
ll
ll
l
l
l
ll
l
l
ll
l
l
ll
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
lll
l
l
ll
l
l
ll
l
lll
ll
l
ll
l
l
ll
ll
ll
l
ll
lll
l
l
lllll
l
l
l
l
l
l
l
l
l
l
lll
l
ll
lll
l
ll
ll
l
l
lll
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
lll
lllllll
ll
ll
ll
l
ll
l
l
l
l
llll
ll
ll
l
l
ll
lll
l
llll
lll
l
l
lll
llll
ll
lll
l
l
ll
llll
ll
lll
l
l
l
ll
l
l
l
l
l
l
ll
l
l
ll
l
ll
l
l
ll
l
l
ll
l
l
l
ll
l
ll
l
l
ll
l
l
l
l
l
l
ll
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
ll
ll
l
lll
l
llllll
ll
ll
l
ll
lll
ll
l
l
l
ll
ll
l
ll
l
l
l
ll
ll
ll
lllll
l
lll
ll
ll
l
ll
ll
ll
l
ll
l
ll
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
lll
ll
l
lll
ll
lll
ll
ll
l
ll
l
ll
l
l
l
l
ll
ll
l
l
l
l
l
l
lllll
ll
ll
ll
l
l
ll
lll
l
l
l
lllll
ll
l
lll
l
ll
lllll
l
l
ll
ll
l
l
l
ll
l
ll
l
l
l
l
l
l
ll
ll
l
ll
lllll
l
ll
l
l
ll
l
l
ll
l
l
ll
l
lll
ll
ll
l
l
l
l
l
l
llll
l
l
l
lll
ll
ll
l
l
ll
ll
l
ll
l
ll
l
l
ll
l
l
lllll
l
l
ll
llll
l
ll
l
l
ll
l
ll
0 5000 10000 15000 20000 25000 30000 35000
7.
83
46
e+
09
7.
83
50
e+
09
7.
83
54
e+
09
Ex
e
cu
tio
n 
tim
e
Figure 3.9: Phased behavior of execution time over 2 days. On the X-axis, the
order ofmeasurements is indicated, on the Y-axis the execution time in number
of cycles.
ments) there are a number of outliers that clearly demonstrate that the
execution times are not distributed according to a Gaussian distribu-
tion. In another experiment, we used a bootloader developed in house
to run our microbenchmarks on bare metal, i.e., without installing any
operating system. For both the bare metal and the OS-supported mea-
surements, we measured bimodal distributions. Surprisingly, however,
we noticed that the standard deviation on the measured timings (mea-
sured by means of processor time stamp counters) was two orders of
magnitude bigger for the version running on the bare metal than for
the version running on top of the operating system.
Compared to execution times considered relevant in performance-
oriented compiler research, the relative changes in behavior observed
in Figure 3.9 are extremely small, as are all standard deviations we ob-
served in our experiments. Such changes would probably go unnoticed
in typical compiler research. To provide strict security guarantees, how-
ever, such small differences are relevant.
Even if such testing is considered an option for static compilers, it
certainly is no option for dynamic compilers. So we have to conclude
that the range ofmitigations and requirements for which static compila-
tion is not feasible because of portability issues, also poses fundamental
problems for dynamic compilers. In the latter case, the problem relates
to a lack of predictability and testability of the provided security.
In the cases where static compilers can provide some portable, guar-
anteed higher levels of security, dynamic compilers can do so as well.
Dynamic compilers will be able to do so at a lower code size overhead
48 Static Compilers
and at a lower performance overhead, for example because they know
exactly how many division latency classes the used processor has and
because they know the precise number of no-ops that needs to be in-
serted for that processor (rather than the maximum number of no-ops
needed for all possible processors). However, the performance over-
head will still be considerable.
Chapter 4
JIT compilation
The static nature of the compiler techniques discussed in the previous
chapter comeswith somemajor drawbacks. Most importantly, for archi-
tectures like Intel’s x86 that are implemented by means of out-of-order
pipelines, the effectiveness and the efficiency of the protective trans-
formations depend on the details of the pipeline implementation. As
a result, the optimal combination of transformations differs from one
processor design to another, even within families of processors that of-
fer exactly the same instruction set architecture. This is a serious issue
at times when developers see their apps downloaded and installed on
a multitude of different (mobile) devices, when users upload their ap-
plications into the cloud onto machines of which they might not know
(and don’twant to care about) the exact pipeline features, andwhen het-
erogeneous multiprocessors are coming on the market such as ARM’s
big.LITTLE architecture on which applications migrate from one type
of core to another transparently under the control of a hypervisor [56].
Moreover, even if the developer and user would know exactly on
which single processor their application will be executed during its life-
time, the application can still be deployed in circumstances that vary
over time and that impose different requirements on the protection. For
example, at any point in time (i) a device might be offline or online in
networks that introduce varying amounts of jitter, which enables dif-
ferent types of attacks [36]; (ii) the application might be handling very
sensitive or rather insensitive data; (iii) the application might be execut-
ing on a processor that is shared or that is not shared with other appli-
cations that can execute attacks based on resource contention [46, 63].
Static compiler techniques support only one design point, i.e., one pro-
tection level per generated binary. While it is conceivable to generate
50 JIT compilation
static binaries that include code compiled for several scenarios, the re-
sulting code bloat impedes the generation of code for many different
combinations of usage scenarios and target processors.
To overcome the above limitations of static compiler approaches, this
chapter presents a dynamic approach for mitigating timing side chan-
nels. Using profile information collected offline on representative in-
puts, sensitive code fragments in an application are detected and tagged
automatically. At run time, a just-in-time (JIT) compiler then generates
protected code based on the profile information. This enables dynamic
adaptation of the level of protection, and hence of the efficiency of the
protected code, to changing circumstances, including the exact proces-
sor on which the application is being executed or to which, e.g, it is
migrated.
4.1 Profile Based Protection
This section discusses how run-time profiles can be used to extend the
protection techniques described in the previous section and to reduce
their overhead by automating and optimizing the selection of protected
code regions.
To this end we collect a separate application profile for each input
in a test input set I. Each profile contains edge counts, i.e., the num-
ber of times each edge in the program’s call graph and its procedures’
CFGs are followed during the program’s execution. The edge counts
and the instruction counts derived from them are collected by running
an instrumented program version. In addition, this version contains
counters to count the number of iterations in each loop and to measure
recursion depths.
The inputs have to be selected such that they provoke extreme be-
havior, i.e., behavior that is representative for the shortest and of the
longest runs. Ifmultiple paths are equivalently short or long, all of those
paths need to be triggered with separate inputs. Later we discuss how
and why this strict requirement can be loosened in practice.
4.1 Profile Based Protection 51
4.1.1 Transforming Loops and Recursive Calls
Loops with variable iteration counts require extra attention. Obviously
the number of executed iterations has an impact on the execution time
of a program, and obviously, the conditional branches that control loop
exits cannot be transformed with if-conversion. Coppens et. al. [34] al-
ready suggested a possible solution based on introducing an extra loop
counter with a fixed iteration count manually selected by the developer.
If-conversion is then applied on the loop body such that only the itera-
tions that need to be executed contribute to the global program state.
Our first adaptation to this scheme is minor: for each loop, our
tools automatically extract the largest iteration count observed in the
collected profiles. This automatic approach does not require any in-
sight into the code by a user or programmer and thus facilitates the
protection of legacy applications or third party libraries.
Loops bymeans of recursive calls are handled similarly. In that case,
a counter parameter is added to the function to pass the recursion depth
at run time, and the recursion is executed until the required depth, i.e.,
the maximal depth observed during the profiling.
4.1.2 Automatic Detection of Code Regions to Transform
The existing compiler techniques to transform code to obtain constant
execution time still require that developers tag one or more methods
as sensitive, implying that those handle sensitive information. During
compilation thosemethods and all of their direct and indirect callees are
then transformedwith the protections discussed in Chapter 3, resulting
in a secure application. If some callees should not be transformed, the
developer has to tag those explicitly.
That manual tagging is highly susceptible to human error. The de-
veloper might overestimate or underestimate the procedures that need
to be transformed, because it is not always clear which procedures ac-
tually leak information. In case he underestimates them, applications
might still leak information. If he overestimates them, a price in the
form of unnecessarily high overhead is paid.
In addition, applying if-conversion naively on all direct and indirect
callees of a procedure that needs to be protected can cause a large, and
often unnecessary overhead. This is especially true for libraries with
multiple levels of indirection and large call trees. Consider for example
52 JIT compilation
1 blic byte [] encrypt(byte [] plaintext , Key k, int type){
2 switch(type){
3 case 1: return encrypt_RSA(plaintext ,k)
4 case 2: return encrypt_DSA(plaintext ,k)
5 case 3: return encrypt_ECC(plaintext ,k)
6 default: throw new InvalidKeyException ();
7 }
Listing 4.1: Example method to be protected
the library method in Listing 4.1, which uses a switch statement to se-
lect one of several encryption algorithms. If we would blindly deploy
if-conversion on this code fragment and its callees, a protected version
of this method and a protected version of each algorithm would be ex-
ecuted to ensure constant execution time. However, if this method is
always passed the same options in some program, such that the same
encryption routine is always invoked, much of this overhead is com-
pletely unnecessary. The shownmethod can then definitely remain un-
transformed, and perhaps even the invoked encryption method can be
left as is.
This problem of selecting the methods to protect and the extent to
which they need to be if-converted is addressed by using profiles to au-
tomatically select the code fragments to transform. This selection hap-
pens at two levels. First we use call edge counts and instruction counts
to determine the set of methods S to be transformed. Secondly, we op-
timize the way if-conversion is applied within these methods based on
the branch profile information.
We iteratively build the set S of methods to be transformed. Each
iteration adds extra methods to S region, thus increasing the protec-
tion level. The end result is a fully protected application in which
only a reduced subset of the methods are transformed. Moreover, each
intermediate result can be considered as a partially protected appli-
cation suitable in scenarios with less strict security requirements, e.g.,
because attacks can only occur over a network that introduces jitter
on the attacker’s measurements. Section 4.1.4 discusses the security-
performance trade-off in more detail.
Phase 1: Call Edge Heuristic
In a first phase, we build on the simple heuristic that in the sensitive
code, all method calls need to be executed exactly the same number
4.1 Profile Based Protection 53
Algorithm 1:Method selection based on call edge counts.
1 R = {m ∈M |m is tagged as a root method }
2 C = {m ∈M |m is an (in)direct callee of a method in R}
3 Eequal = {e ∈ E | ∀(i, j) ∈ I2 : Eci(e) = Ecj(e)}
4 S← ∅
5 n← 1
6 do
7 A← ∅
8 foreachm ∈ C \ S do
9 if (∀e ∈ In(m) : e ∈ Eequal)
10 ∧ (∃e ∈ Out(m) : e 6∈ Eequal) then
11 A← A ∪ {m}
12 Eequal ← Eequal ∪Out(m)
13 S← S ∪ A
14 Sen ← S
15 n← n+ 1
16 while A 6= ∅
of times, independently of the input data. The rationale is that if this
requirement is not met, the execution time will likely vary too much to
provide any useful level of protection. We use the following definitions:
• Let R be the set of root methods that need to be secured, M the
set of all application methods, C the subset ofM containing direct
and indirect callees of the root methods in R, S the complete set
of methods that need to be secured, Sen the set of methods that
need to be secured at iteration n in the algorithm, I the set of input
values of the application, E the set of call edges betweenmethods.
• Let Eequal ⊆ E initially be the set of call edges with equal edge
counts for all inputs I.
• Define Ec : I × E 7→ N, with Eci(e) the function that returns the
execution count of edge e under input i.
• Define In : M 7→ P(E), with In(m) the set of incoming call edges
of methodm.
• Define Out : M 7→ P(E), with Out(m) the set of outgoing call
edges of methodm.
54 JIT compilation
Algorithm 1 starts by building a set of methods C of direct and in-
direct callees of the root methods R. These root methods are tagged by
either a developer or library user to be secured against timing attacks.
They are used to determine the scope of candidate methods to protect.
It is important to note that these root methods do not necessarily have
to be transformed themselves if they do not leak any timing informa-
tion. Our approach only requires that the developer tags methods to
specify the security requirement, not to specify which methods need to
be transformed. This method is hence much less error prone than the
existing state of the art.
Each iteration over lines 6–16 adds methods to the secured set S
if they meet the following conditions: (i) The method has equal edge
count for all of its incoming edges, and (ii) the method has at least one
outgoing edge that has unequal edge counts.
Because allmethods in Swill eventually be protected, their outgoing
edge counts will become equal for all input values in I in the protected
program. The algorithm already simulates this by adding the outgoing
edges of each selected method to Eequal on line 12.
Figure 4.1 illustrates the first three iterations of this process on a fic-
titious call graph containing all methods in C. For each iteration, dark
gray squares represent methods in the set of secured methods S before
the start of that iteration. Light gray squares represent methods in A
added to S in the current iteration. Edges in Eequal are marked with an
=, all other edges with 6=. In the first iteration methodsB andG are the
only methods that meet the requirements to be protected. By adding
them to S and their outgoing edges to Eequal method I can be added in
iteration 2. In iteration 3 all edges have been added to Eequal and the
algorithm finishes, with S = {B,G, I}.
In each iteration n, the set of methods that have been selected for
protection up to that point is saved in a separate set Sen (line 14). This
enables the (potentially dynamic) selection of one of these intermediate
results when complete protection is not required at some point during
the actual deployment of the application or library.
Phase 2: Instruction Count Heuristic
Obviously, the set of methods selected based on call edge counts in
phase 1 is only a coarse-grained estimation of the methods that might
leak information. Its intermediate and final results will not provide
4.1 Profile Based Protection 55
B C 
A 
E F G 
I J 
= 
= 
D 
H 
K 
= 
≠ = 
 
= 
= 
= = ≠ 
L M 
≠ ≠ 
B C 
A 
E F G 
I J 
= 
= 
D 
H 
K 
= 
= = 
 
= 
= 
= = = 
L M 
≠ ≠ 
B C 
A 
E F G 
I J 
= 
= 
D 
H 
K 
= 
= = 
 
= 
= 
= = = 
L M 
= = 
Iteration 1 Iteration 2 Iteration 3 
Figure 4.1: First three iterations of call-edge-based code selection.
Algorithm 2:Method selection based on execution counts.
1 foreachm ∈ C \ S do
2 if ∃t ∈ Inst(m) . ∃(i, j) ∈ I2 : Ici(t) 6= Icj(t) then
3 S = S ∪ {m}
4 Si = S
meaningful protection in all scenarios. So in a second phase, we use
instruction execution counts to identify those locations where intrapro-
cedural control flow can contribute to leaks. Let
• T be the set of application instructions;
• Inst :M 7→ P(T), with Inst(m) the function that returns the set of
instructions in methodm;
• Ic : I× T 7→ N, with Ici(t) the function that returns the execution
count of instruction t under input i.
Algorithm 2 iterates over all methods in C that were not yet added
to S in phase 1. If a method has a different instruction count for at least
one of its instructions for two or more of the inputs in I, it has input-
dependent control flow, and is hence added to S on line 3.
Because the total execution time difference caused by methods
tagged in this phase is not influenced by their callees (unless they are
added to S themselves during this phase), protecting methods tagged
in this phase will on average result in smaller reductions of execution
56 JIT compilation
time difference compared to methods tagged in phase 1. The overhead
of protecting these additionally added methods is also more limited,
however, because tagging them never triggers the additional tagging of
more methods down the call chain.
The end result of this phase is a set of methods Si. If this set is pro-
tected during execution, the application will no longer feature input-
dependent control flow. So at that point we effectively protect against
both instruction cache attacks and branch prediction attacks [3, 4, 5, 7].
In practice we observed that applying all control flow and data flow
transformations to this set of methods suffices to eliminate most of exe-
cution time variation.
One important remark in this regard, is that this can only be ob-
tained if the Java VM provides side-channel free implementations of
VM service routines and native methods. Using a meta-circular virtual
machine such as JikesRVM, which itself is written in Java, has the bene-
fit that the protection transformations can by applied to the compiler it-
self. Furthermore, native functions can be compiled using existing static
compiler techniques [86].
Input-Dependent Data Flow
The algorithms described above use input-dependent control flow vari-
ation as an indicator for execution time variation. This allows fast detec-
tion of the majority of code regions to protect based on a single profile
run for each input in I. However, it does not detect methods that only
leak timing information through data flow features such as variable la-
tency instructions, memory access delays and interactions between in-
structions in the pipeline.
In our experimental results on actual implementations of crypto-
graphic algorithms we did not encounter any additional methods leak-
ing timing information on top of those already selected based on control
flow variation in Phases 1 & 2. In case there would be such methods,
however, it is also possible to detect them using profile information,
albeit with a much slower process. Concretely, sample-based execu-
tion time profiling during many runs of the fully if-converted methods
can detect statistically relevant differences between methods’ execution
times for different inputs. It is worth noting here, that some of the most
frequently used methods for collecting per method timing information
in other contexts, such as code optimization, are not useful in this con-
4.1 Profile Based Protection 57
text. In particular, the collection of time stamp counters by means of
injected RDTSC instruction and the necessary bookkeeping code is not
useful, as the insertion of this instruction affects the pipeline behavior
too much to measure the effect of variable-latency instructions in non-
instrumented code.
4.1.3 Selective If-Conversion
The static transformation techniques as described in Chapter 3 apply
if-conversion to whole procedures. This includes conditional branches
that will only be taken in one direction or switches of which only a few
cases will actually be executed. Unexecuted paths or code fragments
in executed functions occur for several reasons. One reason involves li-
brary functionality that is not executed in a specific program, as already
discussed in Section 4.1.2. A second reason involves code that is present
because of coding practices and guidelines but that is actually never ex-
ecuted, such as superfluous default cases in switch statements. A third
reason involves checks that always evaluate to true or always evaluate to
false under “normal” circumstances, including all scenarios that might
leak information through side channels. This includes checks for de-
tecting out-of-memory issues or for the presence of files to be opened.
Using branch profile information, we can select the branches that actu-
ally exhibit input-dependent behavior and only convert those to reduce
the overhead.
Figure 4.2a illustrates the CFG of an unprotected application. Gray
squares and bold arrows represent basic blocks and control flow edges
resp. that are executed at least once for at least one of the program in-
puts. In this simple case, it is clear that only the conditional branch in
basic block B can cause input-dependent control flow. Compared to
the whole procedure being if-converted as shown in Figure 4.2b, the
selective if-conversion result shown in Figure 4.2c will yield much less
performance overhead.
In our current implementation, branches are left unprotected only if
they are profiled as always going in the same direction for all inputs in I.
Although branches with identical branch history for all inputs (e.g. al-
ways taken in the first half of thewhole execution, and always not-taken
in the second half) do not leak timing information, enforcing the same
branch sequence in the protected application is not trivial and currently
not supported.
58 JIT compilation
B C 
A 
D E F 
I 
G 
(a) original CFG 
J 
H 
B F 
A 
I 
D C 
H 
G 
(b) if-conversion 
J 
E 
B C 
A 
D 
E 
F 
I 
G 
(c) selective if-conversion 
J 
H 
Figure 4.2: Example of complete and selective if-conversion.
4.1.4 Security/Performance Trade-off
As already discussed in the introduction, the required level of protec-
tion depends on the usage scenario. In some scenarios, protectionmight
be traded off for reduced overhead. The latter can be achieved in two
ways: (i) protecting only a subset of methods S′ ⊆ S and (2) applying
only a subset of protection techniques P′ ⊆ P. Each subset S′ of S com-
bined with a set of mitigation strategies P′ of P used to protect these
methods can be seen as design point in a security/performance trade-
off.
When security requirements change, for example when an intrusion
is detected by an external intrusion detection system (IDS), or the ap-
plication is migrated to a less secure machine, the compiler can easily
switch between different code versions by recompiling the code accord-
ing to the security transformations described by a target design point.
Figure 4.3 illustrates such a security/performance trade-off graph
for a subset of those design points. The security is measured in terms
of observable delta in execution times of slow versus fast runs, the
overhead in terms of slowdown. The x marks the unprotected appli-
cation, circles represent compilation points with different subsets of
S protected with only CFG transformations and the gray filled circles
represent compilation points protecting the full set of methods S with
CFG transformations and different combinations of data flow transfor-
mations.
Ideally, at any point in time the compilation strategy should be cho-
4.1 Profile Based Protection 59
∆
 E
xe
cu
ti
o
n
 t
im
e
 
Overhead 
Allowed timing 
difference 
Figure 4.3: Compilation strategy security/performance trade-off.
sen that generates the fastest codemeeting the security requirement im-
posed at that time. If that requirement is indicated by the red dotted line
in the chart, the black circle denotes that strategy.
To find all potentially useful strategies, i.e., all strategies on the
Pareto front shown in the chart, one cannot use an exhaustive search:
The range of parameters for all the data flow transformations is sim-
ply too wide to make that practical. Multi-objective machine learning,
however, seems a good option. For the time being, however, we have
only experimented with manually chosen design points based on our
experience, both with JIT compilers for the research presented here as
with earlier static techniques [34, 86]. We discuss those design points
in the evaluation section. A machine learning approach remains future
work.
4.1.5 Execution Paths not Covered by Profile Inputs
The effectiveness and security of run-time profile-based protection tech-
niques depends greatly on the set of test inputs I used to generate the
profiles. It is important that these inputs cover all the relevant work-
loads of the application, including corner cases that might occur in the
presence of an attack. In practice, we found that cryptographic code
typically requires only a small set of training inputs to provide enough
coverage to successfully protect cryptography applications against tim-
ing attacks.
If some relevant execution pathswere not triggered by the profile in-
puts, this does not necessarily cause problems, however. First of all, our
mitigation strategies are designed to correctly handle these inputs and
60 JIT compilation
conserve semantic behavior of the applicationwhen unprotected code is
executed unexpectedly. For example, in the partially if-converted CFG
on the right of Figure 4.2, the application will still be executed correctly
when the branch in basic block A jumps to basic block C. Likewise,
converted loops and recursive call chains are still allowed to complete
all their iterations resp. calls, even if their number exceeds the upper
bounds observed during profiling.
When that happens, there invariably leaks some timing information.
The amount of leaked information is easily minimized, however. On
the paths that were not triggered during the profiling, but that are still
present in the transformed, protected code, we inject monitors. If those
get triggered, two actions are initiated. First, the stored profile informa-
tion is updated persistently to reflect that some previously untaken path
from now on has to be considered taken or that some observed upper
bound has to be increased. Secondly, the involved methods are flushed
from the compiled code cache in the JIT compiler, and recompiled tak-
ing into account the new information.
If the initial profiling by the developer was complete but not entirely
inadequate, these monitors have a minimal impact on the performance
of the protected code: After being triggered once, they either disappear
entirely from the recompiled CFGs, or they are replaced by other mon-
itors further down in the CFGs to monitor subpaths of the originally
excluded paths. Those new monitors can be replaced iteratively, but if
the original profiling was any good, very few iterations can occur. So
very few monitors will ever be triggered in practice, and being located
off the normally executed paths in the CFGs, they don’t hurt regular
performance.
The very same reasoning applies to the amount of leaked informa-
tion. Timing attacks require an attacker to collect a significant num-
ber of timing samples. While triggered monitors will result in samples
with relevant information for the attacker, he will not be able to collect
enough samples to let an attack succeed.
4.2 Experimental Evaluation 61
4.2 Experimental Evaluation
We evaluated our approach on two cryptographic functions that are
known to leak timing information, an RSA encryption routine and a
HMAC key verification routine. We first describe our experimental
setup, after which we give a detailed overview of the results of both
experiments.
4.2.1 Prototype Implementation
For the run-time JIT compilation phase, we use the Jikes Research Vir-
tualMachine (JikesRVM(v3.1.2), available at http://jikesrvm.org/) [11].
Our secure JIT compilermakes use of the JikesRVMAdaptiveOptimiza-
tion System (AOS) and its optimizing compiler [18] to (re)compile code
for different security scenarios. With this system, JikesRVM never in-
terprets Java bytecode. All bytecode is compiled, which is of course
necessary to ensure that our protections are applied on all code that is
ever invoked. We discuss our prototype implementation in more detail
in Chapter 5.
4.2.2 Methodology & Experimental Setup
To mimic the strongest possible attacker that can observe many con-
trolled executions in a low-noise setup, we ran all time measurements
locally on an otherwise unloaded machine. This avoids noise from net-
work delay jitter. We also disabled address-space layout randomization
and frequency scaling. We also forced JikesRVM to use only one CPU
and to pin the Java processes to that CPU, such that the Linux kernel
cannot intervene to migrate processes between different cores.
We then fed the appropriate options to JikesRVM to make it bulk
compile all methods at highest optimization level, and to disable recom-
pilations not related to changing security levels. This way, all timing
measurements are performed in the so-called steady state execution of
the software. This avoids that large variations in execution time due to
the (non-deterministic) invocation of compilers, masks the statistically
relevant variations caused by the input-dependent behavior that an at-
tacker wants to exploit.
To reduce noise introduced by garbage collection we increased the
initial heap size to 2GB. As a source for accurate timing we use the x86
62 JIT compilation
time stamp counter instruction (RDTSC) preceded by a CPUID instruc-
tion for instruction serialization. This timing is performed in a harness
around the evaluated code, not in the code itself, to avoid intrusion in
the code’s pipeline behavior.
Special care was also taken for the measurement of execution times
of which differences are computed and on which statistical tests are
performed. Such measurements of the same code version on multiple
keys were always conducted in a single invocation of the Jikes RVM,
with a harness that alternates between different input keys. This elim-
inates (non-deterministic) timing differences between different VM in-
vocations. Our experiments show that execution times of the same ex-
periment with the same input key can differ up to 0.23% (almost 3ms for
the RSA encryption test case) between VM invocations, which is much
higher than the noise in our individual test cases. Secondly, this ac-
counts for intermittent variations in behavior due to potential external
influences such as CPU temperature or garbage collection.
In order to assess the influence of processor implementation details
on the opportunities for optimizing the performance overhead of the
available protection, and in order to assess the benefits of our adaptive,
training-based approach that can tune code for specific architectures,
we performed experiments on two different generations of Intel pro-
cessors: ® Core™ 2 Duo (CPU E8400) and Intel® Core™ i7 (CPU 870).
They feature different pipeline designs, with, amongst others, different
load/store forwarding logic and different early-exit algorithms for di-
vision instructions, which results in different variable-latency behavior.
For example, the Core™ 2 Duo features 6 early-exit points, and hence
6 possible latencies, whereas the Core™ i7 features 7 (different) points
and latencies. Despite these differences, the input combinations used
for training and evaluation gave extreme timing behavior on both pro-
cessors. The reason is of course that control flow dependencies have a
much bigger impact than data-flow-related variable latencies.
4.2.3 BouncyCastle RSA Encryption
First, we evaluated our approach using the RSA encryption algorithm
provided by the latest version of the BouncyCastle cryptographic library
(v1.52). By default the encryption algorithm uses blinding. This com-
mon countermeasure against timing attack reduces the amount of infor-
mation leaked, but the complete information is still eventually revealed
to the attacker [20]. For its mathematical operations BouncyCastle re-
4.2 Experimental Evaluation 63
1 public byte [] encrypt(byte [] plaintext , Key k){
2 Cipher c = Cipher.getInstance("RSA","BC");
3 c.init(Cipher.ENCRYPT_MODE , k);
4 return c.doFinal(plainbytes );
5 }
Listing 4.2: BouncyCastle RSA application
lies heavily on the underlying core Java libraries. In our test setup these
libraries are provided by the GNU Classpath(V0.97.2) implementation
used to build the JikesRVM compiler. To ensure the security of the en-
cryption algorithm, both the code of the algorithm itself and the code
of the underlying core Java libraries need to be protected.
As a sample application in which the RSA encryption needs to be
protected, we used the code from Listing 4.2. The method that needs to
be protected is doFinal(byte[]).
This implementation of the RSA encryption algorithm features rel-
atively deep call trees and methods with control flow that is highly de-
pendent on the input data consisting of the secret keys and plain text
input. Excluding exception handling calls and native method calls, the
call chain from the root doFinalmethod, which starts the encryption pro-
cess, contains 104 methods and 270 calls between those methods. This
code region we are trying to protect is much larger and complex than
the simple functions in which static compiler techniques were previ-
ously evaluated [34, 68, 86].
Profile-Based Detection of Code Regions
We used an RSA key generator to generate hundreds of RSA keys. From
this set of keys we selected the two keys with the most variation in exe-
cution time, which we then used as the input set I for our profile-based
code selection.
The results of applying Algorithm 1 on the code in Listing 4.2 are
shown in Table 4.1. The algorithm needs a total of 6 iterations to collect
all methods and calls that need to be transformed. After the final iter-
ation, 45/270 = 16.7% of all calls have been marked to be replaced by
calls to a protected version their callees, forwhich 32/104 = 30.8% of the
methods need to be protected. In this experiment, Algorithm 1 sufficed:
Algorithm 2 did not result in additional methods being selected.
64 JIT compilation
iteration 0 1 2 3 4 5 6
nr. of calls 0 8 20 35 41 44 45
nr. of methods 1 7 15 25 30 31 32
Table 4.1: Selected code regions based on call edge profiles.
1 
10 
100 
1000 
10000 
100000 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
Ex
e
cu
ti
o
n
 t
im
e
 d
if
fe
re
n
ce
 (
µ
s)
 
Overhead relative to unprotected application 
Unprotected 
CFG 
CFG+Data 
Figure 4.4: Intel Core™ 2 Duo security/performance trade-off.
Run-time Overhead vs. Execution Time Variation
The protection’s run-time overhead depends heavily on the desired
level of protection. To illustrate this we protected the encryption algo-
rithm using 18 different combinations of protection techniques, specif-
ically chosen to represent the broad scope of all possible combinations.
For each of those compilation points we measured the average execu-
tion time difference between the two input keys based on 1000 samples.
We then apply statistical tests to identify for which design points (i.e.,
combinations of protections) the execution times of the two input keys
are indistinguishable, and thus provide complete protection.
Figures 4.4 and 4.5 show the average absolute difference in execution
time between input keys and the relative execution times (relative to
the slowest unprotected version) for each design point for the Core™
2 Duo processor and to Core™ i7. To interpret the absolute numbers
correctly, it is useful to know that the unprotected application using the
slowest key takes 69.97ms on the Core™ 2 and 68.66ms on the Core™
i7 in steady state. So the variations on the unprotected application are
in the order of 20–23%.
4.2 Experimental Evaluation 65
1 
10 
100 
1000 
10000 
100000 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
Ex
e
cu
ti
o
n
 t
im
e
 d
if
fe
re
n
ce
 (
µ
s)
 
Overhead relative to unprotected application 
Unprotected 
CFG 
CFG+Data 
Figure 4.5: Intel Core™ i7 security/performance trade-off.
The whiskers on the graph represent the 5th and 95th percentile.
The x marks the unprotected application; circles represent compilation
points with different subsets of S protected with only CFG transforma-
tions; the gray filled circles represent compilation points protecting the
full set S with CFG transformations and different combinations of data
flow transformations.
The mitigation techniques used for each of the 18 design points are
described in rows 1–18 in the Compilation Strategy columns in Table 4.2.
S indicates the number of methods in the set S as selected with Algo-
rithm 1. These numbers correspond to those in Table 4.1. The column
array gives the size of the dummy array objects; div indicates whether or
not the division instructions are replaced by constant time implementa-
tions; nop indicates how many nop instructions are inserted in between
memory instructions. In that table, the column marked OH (overhead)
shows the execution time of each design point relative to the execution
time of the slowest key in the unprotected application, i.e., the X-value
of the points in Figures 4.4 and 4.5.
Using only CFG transformations, the delta in execution time be-
tween input keys can be reduced bymore than two factors ofmagnitude,
i.e., to 126µs on the Core™ 2 and to 82µs on the Core™ i7 at a remaining
overhead of 4.96x for the Core™ 2 and 5.05x for the Core™ i7. Besides
protection against timing attacks over noisy communication channels,
this compilation strategy also protects against instruction cache attacks
and branch prediction attacks because all input-dependent control flow
66 JIT compilation
S array div nop 20 50 100 200 300 500 700 990 20 50 100 200 300 500 700 990
1 1 0 0 0 1.00 1.00
2 7 0 0 0 1.01 1.00
3 15 0 0 0 1.73 1.67
4 25 0 0 0 3.76 3.75
5 30 0 0 0 4.84 11 6 4 1 4.99 1
6 31 0 0 0 4.91 11 9 9 8 5 1 2 1 5.00 1
7 32 0 0 0 4.95 12 10 8 6 3 5.05 10 9 6
8 32 64 0 0 6.06 12 12 12 12 12 12 6 4 5.79
9 32 0 1 0 7.25 12 9 5 1 7.53 9 8 2
10 32 0 0 6 7.55 10 2 8.26 1 2
11 32 64 1 0 8.40 12 10 12 12 12 11 12 12 8.51
12 32 64 0 6 8.98 12 10 7 1 9.58
13 32 0 1 6 9.94 12 11 7 6 3 2 2 2 10.90
14 32 64 1 6 11.52 12 10 6 1 12.27
15 32 0 0 12 12.90 12 12 12 12 10 9 11 10 13.46 10 9 11 9 9 9 7 7
16 32 64 0 12 14.85 9 9 9 6 6 1 1 1 16.52 12 12 12 12 12 11 12 11
17 32 0 1 12 15.24 10 12 12 12 10 9 12 12 16.12 11 9 10 5 4 3 7 1
18 32 64 1 12 17.46 11 10 11 12 11 11 11 11 19.38 12 12 12 12 11 10 9 8
Compilation 
Strategy
Intel Core 2 Intel Core i7
OH
Sample Size
OH
Sample Size
Table 4.2: Test results for different compilation strategies.
is if-converted.
Applying various data flow transformations further reduces the ob-
served difference in execution time, although there is a clear difference
between architectures. On the Core™ i7, only the 4 design points with
the highest overhead, which correspond to the design points on rows 15,
16, 17, and 18 of Table 4.2 significantly reduce the execution time differ-
ence compared to the design points with only CFG transformations. All
of these four compiler strategies insert 12 nop instructions in between
memory operations to avoid data-dependent load/store forwarding in
the pipeline. On the Core™ 2 processor almost all compilation points
with data flowmitigation techniques reduce the difference in execution
time compared to CFG-only compilation points.
We use the Anderson-Darling test [82] to evaluate how many sam-
ples an attacker (with direct access to amachine to perform timingmea-
surements) needs to collect to reliably distinguish between the two in-
put keys, and thus to evaluate how difficult it is to perform an actual
attack. For each compilation strategy and for different sample set sizes
ranging from 20 to 990, we use the test to compare the timing results.
These 20–990 samples used in the statistical test always exclude the first
10 collected samples during the timing measurements to eliminate the
influence of the initial compilation on the processor pipeline, caches,
buffers, etc.
4.2 Experimental Evaluation 67
In our initial test results, we observed that the test occasionally re-
ported false positives, such as when it reported a significant difference
in execution times over two sample sets collected for the same input.
Given the very small time scale on which we try to measure difference,
and the chaotic nature of computers, such false positives are to be ex-
pected [69, 70]. To prevent drawing false conclusions from occasional
false-positive results, we completely reran each data collection and sta-
tistical testing 12 times for each compilation strategy and sample set
size, i.e., using a different sample set for each of the 12 experiments.
The columns labeled 20–990 in Table 4.2 show the results of the sta-
tistical tests for the Core™ 2 and Core™ i7 architectures. For each cell,
the number in the cell is the number ofAnderson-Darling tests that indi-
cated it is impossible to distinguish the execution times of the different
input keys for a given sample set size. Values on a dark background
indicate that the majority of the statistical tests indicate no significant
difference between the execution times, and thus that the application
can be considered secure for the given sample size. The lighter the back-
ground, themore tests indicate a difference in execution time, and hence
the easier it is to perform an attack for an attacker that can collect that
number of samples.
Again, we observe a clear difference between the two architectures.
In case an attacker has 990 timing samples for each key, the JIT compiler
provides full protection on the Core™ 2 machine at the lowest over-
head using the compilation strategy on row 11. This strategy protects
all methods in S, creates dummy array objects of size 64 and replaces di-
visions by a constant time implementation. For this design point, none
of the 12 tests were able to distinguish between the two input keys. On
the Core™ i7 machine complete protection can be provided at the low-
est overhead by compiling the application using the design point on row
16. The application needs to have allmethods in Sprotected, dummy ar-
ray objects of size 64 and 12 nop instructions inserted between memory
instructions. For this design point, 11 of the 12 statistical tests indicate
no significant difference between the two sample sets.
When an attacker in some scenario would only be able to collect 20
samples, however, applications can be protected at lower overhead. On
the Core™ 2 machine the results for the compilation strategy on row 5
show that only 30 of the 32 methods in S need to be protected with only
the CFG transformations for 11/12 tests to indicate no significant differ-
ence. On the Core™ i7 machine protecting all methods in S with CFG
68 JIT compilation
total memory minimum required % of time spent
collected(MB) heap size (MB) in GC
clean 906 30 0.015
unprotected 1109 32 0.015
protected(1) 2554 32 0.017
protected(8) 3585 32 0.024
protected(16) 4737 32 0.035
protected(32) 7069 32 0.057
protected(64) 11660 32 0.101
Table 4.3: Memory and garbage collection statistics for different application
versions running on the Intel Core™ 2 Duo.
transformations results in 10/12 statistical tests indicating no difference
between the inputs, as shown by the results on row 7.
Memory Usage and Garbage Collection
To measure the memory and garbage collection overhead of our ap-
proach we performed 300 encryptions using 7 different versions of our
application: An unprotected version running on a clean JikesRVMbuild
without modifications, an unprotected version running on our mod-
ified JIT compiler and five equally protected versions (row 11 of Ta-
ble 4.2), but eachwith differently sized dummy arrays ranging from 1 to
64 bytes. For each of these versions Table 4.3 reports the total heapmem-
ory collected, the minimum required amount of heap memory to run
our the encryption algorithm and the percentage of time spent garbage
collecting in steady state.
Between the clean build and the unprotected version, 203MB ad-
ditional memory is collected and the minimum required heap size is
increased by 2MB. The latter increase is mainly due to the additional
infrastructure needed for our protection framework. When protection
is enabled with dummy array sizes of 1, the total collected memory in-
creases by 1445MB and increases further with growing dummy array
sizes. The minimum required heap size, however, remains the same as
when executing the unprotected application. This is because the max-
imum heap size is reached during the initial startup phase, when the
VM is booted and the application is loaded and compiled. This initial
startup phase determines the minimal size of the heap for this experi-
ment.
There is no difference in the amount of time spent garbage collect-
ing in steady state execution between the clean build and unprotected
4.2 Experimental Evaluation 69
3341 3249 
2653 
276 
0 
500 
1000 
1500 
2000 
2500 
3000 
3500 
4000 
Protected 
execution 
Unprotected 
execution 
Unmodified 
compiler 
C
o
m
p
ila
ti
o
n
 t
im
e
 (
m
s)
 
Standard Compilation 
Protection Transformations 
Figure 4.6: Core™ 2 JIT compilation times.
version. When protecting the application, the percentage of time spent
in garbage collecting increases with the size of the allocated dummy
arrays. The influence of garbage collection on the total execution time
remains minimal, with a maximum of 0.101% of total execution time
spent garbage collecting in the protected application with dummy ar-
ray sizes of 64.
Compilation Overhead
When protecting code, the compilation time increases because of three
reasons. First, more code needs to be compiled: in case a method is in-
voked from within and from outside of the protected code region, the
extended compiler needs to generate two versions of this method, a se-
cured one and a non-secured one. Secondly, the extended JIT compiler
of course also applies its additional transformations to secure the se-
lectedmethods, which also requires computation time. Lastly, themod-
ifications made to the compiler infrastructure also introduce some extra
compiler overhead, even when the application runs unprotected.
Figure 4.6 shows the JIT compilation times to compile the applica-
tion in Listing 4.2 on the Core™ 2 system for the unmodified JikesRVM
compiling an unprotected application, and for the modified framework
compiling a fully protected (with all transformations applied to all
methods in S) and an unprotected version.
Due the modifications made in the compiler infrastructure to sup-
port side channel protection, the unprotected application compiles
22.5% slower. There is a lot of room for improvement here and we
70 JIT compilation
66.3 
44.4 
28.6 
22.2 
16.2 
62.7 
40.2 
26.1 
20.8 
11.2 
0.0 
10.0 
20.0 
30.0 
40.0 
50.0 
60.0 
70.0 
all methods, 
full if-conversion 
subset of 
methods 
 selective if-
conversion 
subset of 
methods, 
selective if-
conversion 
optimal 
compilation 
strategy 
R
e
la
ti
ve
 o
ve
rh
e
ad
 
Intel Core ™ i7 
Intel Core ™ 2 
Figure 4.7: Slowdowns for modPow with different protection techniques.
believe this overhead can significantly reduced with additional engi-
neering effort, because our implementation is a prototype implemen-
tation in which optimizing compilation time was not a primary goal
until now. The compilation overhead of the protected application com-
pared to the unprotected application is 11.3%. This overhead can be
divided into additional compiledmethods (2.9%) and the extra security
transformations (8.5%).
The reported differences will be experienced by the users during the
startup of the application (more precisely, at the first invocation of a sen-
sitive region), which will be slowed down as all code then needs to be
compiled from scratch. Furthermore, whenever code needs to be recom-
piled because a new security requirement is imposed, the price of full
recompilation of the methods in the secured code regions will have to
be paid, which will be experienced as a temporary halting of the appli-
cation. In case the security requirements are lowered, it is conceivable
to perform the recompilation in the background (e.g., on a spare core of
a multi-core processor), while continuing to run the more heavily pro-
tected code. In that case, no sudden stuttering of the application will be
experienced. Previously compiled versions of methods can be stored
in a cache to avoid the cost of recompilation when switching between
security levels that have already been compiled.
Comparison to Previous State Of The Art
The automated profile-based detection of code regions to be trans-
formed reduces the overhead of compiler-based side-channel mitiga-
4.2 Experimental Evaluation 71
tion strategies by a significant amount, and our current work makes
them applicable to more complex code such as cryptographic libraries.
Figure 4.7 reports the overhead of our protection techniques for the
BigInteger modPow(BigInteger exponent, BigInteger modulus) method. This is the
method detected in iteration 1 of Algorithm 1 as the source of timing
variation for the RSA encryption routine. The Y-axis denotes the execu-
tion times relative to the execution time of the unprotected application
using the input key yielding the slowest execution time. Protecting all
methods reachable from the modPow method, if-converting all branches
within those methods and applying all available protection techniques,
which is the approach used in the previous state of the art [86], results
in an overhead of 66.3x on the Core™ i7 machine. Protecting only a sub-
set of methods reduces the overhead to 44.4x and partial if-conversion
reduces the overhead to 28.6x. Combining the two techniques reduces
the overhead further to 22.2x. The graph shows similar results on the
Core™ 2 machine.
When compilingwith the optimal security strategy the overhead can
be reduced further: On the Core™ i7 to 16.2x using the compilation
strategy on row 17 in Table 4.2 and on the core 2 to 11.2x using the strat-
egy in row 12. In total we reduce the overhead by 75.6% on the Core™
i7 machine and 82.2% on the Core™ 2 machine while still providing the
same level of protection. As can be seen in Table 4.2, less secure appli-
cation versions have even lower overhead.
4.2.4 Keyczar HMAC Verification
In a second experiment we tested our approach on an older version of
the Google Keyczar library. The HMAC signature verification function-
ality of this library is known to leak timing information because it uses a
standard Arrays.equals(byte[],byte[]) function call to compare the twoHMAC
signatures [64, 65]. This function compares two byte arrays and returns
as soon as it encounters two bytes that are unequal. Therefore, the ver-
ification process will take longer depending on the number of consecu-
tive correct signature bytes checked. Although this timing leak has been
patched in May 2009, it can still serve as an example to show how our
profile based approach could have easily detected and fixed this leak
and similar types of early-exit timing leaks.
Figure 4.8 shows the average execution time based on 10.000 samples
per signature of the unprotected signature verification function for 21
72 JIT compilation
4598 
4599 
4600 
4601 
4602 
4603 
4604 
0 2 4 6 8 10 12 14 16 18 20 
A
ve
ra
ge
 e
xe
cu
ti
o
n
 t
im
e
 (
n
s)
 
Consecutive correct signature bytes 
Figure 4.8: Average HMAC execution time on the Core™ 2.
0 
0.2 
0.4 
0.6 
0.8 
1 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
p
-v
al
u
e
s 
Difference in consecutive correct signature bytes 
unprotected protected p = 0.05 
Figure 4.9: Anderson-Darling test results on the Core™ 2.
different signatures with an increasing number of consecutive correct
bytes. Signature 0 with 0 correct bytes has, in theory, the fastest ver-
ification time while signature 20 is the correct signature and takes the
longest to verify. Although the ordering is not 100% correct due to noisy
measurements, there is a clear relationship between the number of con-
secutive correct signature bytes and the execution time. We marked to
Signer.verify(byte[] text,byte[] signature) as the root of protection and applied
our profile-based protection techniques, using a correct and incorrect
key as test inputs. In this experiment, Algorithm 1 did not detect any
methods to protect. However, Algorithm 2 based on instruction counts,
correctly tagged Arrays.equals(byte[],byte[]) as the source of timing variation.
Figure 4.9 shows the results of the Anderson-Darling statistical test
comparing signatures keys with different consecutive correct bytes
based on 10.000 samples per signature. P-values below the critical
4.3 Related Work 73
value of 0.05 indicate that we can reject the null-hypothesis that there
is no statistical difference between the sample sets. In other words, a
p-value above 0.05 indicates there is no statistical difference between
the two sets of execution times and it is impossible for an attacker to
differentiate between them. In the unprotected program version we
are able to differentiate between keys that have two or more bytes dif-
ference, while in the protected version there is no statistical difference
between the signatures.
The overhead of our mitigation technique in this test case is min-
imal, 7.5% with all mitigation strategies enabled and only 5.7% using
the optimal protection strategy.
4.3 Related Work
Compared to existing (static) compiler techniques [34, 86] our approach
uses profiled-based analysis to automate and optimize the selection of
the protected code base, reducing the overhead up to 90%. This makes
our approach suitable to protect real-life security libraries, where ex-
isting techniques would introduce unacceptable overhead. Secondly,
JIT compilation allows to optimize security transformations for a target
architecture and required security level. Only one (unprotected) ver-
sion of the application needs to be distributed where static compilation
techniques require separate binaries for each architecture and security
requirement.
Compared to other pure software techniques our approach does not
have to rely on operating system modifications to inject enough noise,
e.g., in the measurable cache-behavior [89] or pad the execution time
of methods and provide resource isolation [25]. Our approach ensures
an application can run protected in any target environment, without
directly affecting the other processes that happen to share resources
with the protected application. Other approaches using JIT compilation
rely on dynamic software diversity to mask information leakage [35,
50] while our approach inherently provides complete protection against
branch prediction attacks and instruction cache attacks and we pro-
vide statistical evidence that no differences in execution time can be
observed. However, compared to the above techniques, which mask in-
formation leakage or required operating systemmodifications, the over-
head of our approach can be significantly higher.
Both hardware [23, 48, 87, 88] and algorithmic [12, 47, 55, 57] ap-
74 JIT compilation
proaches can be easily integrated in our framework. The compiler can,
for example, automatically replace instructions that leak timing infor-
mation by calls to fixed library functions [12], or generate code to take
advantage of hardware features such as the Intel AES instruction set [48]
or side channel free cache designs [88] in code fragments that require
protection.
Chapter 5
Tool Prototype
The prototype implementation of our side-channel-free virtual ma-
chine is built on top of the Jikes Research Virtual Machine (Jikes RVM
v3.1.2) [11].
We chose JikesRVM for several reasons. First, it is well known in
the compiler research community and has an active developer base. It
already has an extensive set of built-in profiling and instrumentation
tools. It is therefore often a first choice for many researchers, resulting
in many published papers based on JikesRVM.
Secondly, JikesRVM is a metacircular virtual machine, written al-
most entirely in Java itself. This has not only the benefit that the com-
piler extensions can be implemented in a managed language such as
Java, but also that any implemented compiler optimization can also be
applied on the compiler code itself. This makes it an ideal candidate to
implement a prototype of our transformation techniques.
The main focus of JikesRVM, however, is research. Execution time
results obtained using JikesRVM can be significantly higher than using
a “production level” virtual machine. We are, however, only interested
in the relative performance decrease of our mitigation strategies, not in
the absolute numbers. We expect to see the same relative performance
in production virtual machines as we do with Jikes RVM.
5.1 Extending the Adaptive Optimization System
Jikes uses anAdaptiveOptimization System (AOS) to gather runtime in-
formation about the running applications and perform code optimiza-
tions based on that data. In this section we give an overview of the
76 Tool Prototype
Knowledge 
repository 
Controller  Thread 
Executing code 
priority queue  
Compilers 
Compilation 
Thread 
priority queue  
Method 
Sample Data 
Hot Method Organizer Adaptive Inlining Organizer 
Dynamic Call 
Graph 
Decay Organizer 
Method 
Samples 
Call Edge 
Samples 
Runtime Measurements 
Figure 5.1: Jikes Adaptive Optimization System. Source: Matthew Arnold
et al., Architecture and Policy for Adaptive Optimization in Virtual Ma-
chines [11].
system and how we extended it to support our side channel mitigation
techniques. The general overview of the AOS can be seen in Figure 5.1.
During execution of the application, call edge samples and method
samples are collected at specific points in the executed code, and pro-
cessed by the runtime measurements subsystem, indicated by the dashed
line. Edge samples are aggregated by the adaptive inlining organizer
to determine which method calls should be inlined during execution.
Method samples are aggregated by the hot method organizer to de-
termine which methods are “hot” and should be compiled at a higher
optimization level. When enough samples are collected, the organizer
notifies the controller thread.
The controller thread is at the core of the AOS, it decides what profile
data should be gathered, analyzes the data received from the runtime
measurements and decides which action to take based on that infor-
mation. If the controller decides that a specific hot method needs to
be optimized or a particular call needs to be inlined based on the run-
time measurements, the method in question is passed to the compilation
thread together with a compilation plan containing the required param-
eters for compilation. When compilation is finished, the old compiled
code is replaced by the newly optimized compiled code and program
execution is continued. Communication between the controller, com-
5.1 Extending the Adaptive Optimization System 77
 
CSD 
Knowledge 
repository 
Controller  Thread 
Executing code 
priority queue  
Compilers 
Compilation 
Thread 
priority queue  
Runtime Measurements 
Security Organizer 
External Applications 
IDS 
Hypervisor 
OS 
External Interface 
Figure 5.2: Jikes Adaptive Optimization System with side channel security
support.
pilation thread and runtime measurements subsystem is done through
priority queues.
The knowledge repository is the central point where each component
can store or retrieve data about the running application or the state of
the AOS. It contains, for example, profile data generated by the runtime
measurements, the history of the compiled methods and information
about the global state of the AOS.
We extended the AOS framework by adding an additional Security
Organizer, an interface for external applications and the Compilation
Strategy Database (CSD) as shown in Figure 5.2.
TheCompilation StrategyDatabase (CSD) is an extension of the knowl-
edge repository and the central point where the compilation strategies
and runtime profiles generated during the analysis phase are stored. It
is used during program execution by the class loader, dynamic linker
and compiler to ensure the application is secured according to the tar-
get compilation strategy.
Similar to the runtime measurements subsystem, the security orga-
nizer can instruct the controller to recompile applicationmethods. How-
ever, recompilation is only triggered when the security requirements
change. This can happen for two reasons:
78 Tool Prototype
• A security level change is requested through the external interface
by an external source such as the operating system, an intrusion
detection system or a hypervisor. The operating system can, for
example, request a higher security level if the system is connected
to a network, the hypervisor might schedule a third party virtual
machine running on the same core or the intrusion detection sys-
tem might notice an ongoing attack.
• Recompilation can be requested by the running application itself.
During the compilation of protected methods, monitor instruc-
tions are inserted on paths in the control flow graph that the initial
profile analysis phase tagged as never executed. These instruc-
tions serve as guards: Whenever one of these instructions is exe-
cuted, a path in theCFG is executed thatwas not profiledusing the
analysis phase, and the application can possibly leak side channel
information. When such a guard instruction is executed, the cor-
responding branch is tagged to be protected in theCSD and the se-
curity organizer instructs the controller to recompile the method.
5.2 Class Loading and Dynamic Linking
In order to minimize the overhead of the protections, it is important to
realize that many of the sensitive code fragments invoked from within
the root methods are often also invoked from within non-sensitive re-
gions. To avoid having to pay the protection overhead when executing
code outside of sensitive regions, and to avoid that we need to recom-
pile codewhen transferring control from sensitive regions to insensitive
ones and vice versa, we adapted the VMand JIT to support two versions
of all methods: a secured version and an unsecured version. This is par-
ticularly useful for libraries andmethods that are shared throughout the
different components of applications.
To support two versions of the same method, both the class loader
and dynamic linker need to be adapted.
Class Loading
Before we go into more detail on the modifications to the class loader,
some background on the internal representation of classes in JikesRVM
is required. Figure 5.3 gives an overview of some relevant parts of the
5.2 Class Loading and Dynamic Linking 79
JTOC object TIB CodeArray
TIB TIB TIB
status status status
field 0 length length
field 1 type byte 0
field 2 virtual method 0 byte 1
field 3 virtual method 1 byte 2
TIB reference virtual method 2 byte 3
CodeArray virtual method 3 byte 4
<init> method reference TIB virtual method 4 byte 5
status virtual method 5 …
static field reference length
byte 0 RVMClass
static method reference byte 1 TIB
byte 2 status
byte 3 field 0
byte 4 field 1
byte 5 field 2
… field 3
lazyMethodInvoker
Figure 5.3: The JikesRVM object model [1, 39].
JikesRVM internal objectmodel and the relation between different inter-
nal data structures. The names of the internal structures are indicated
in bold. Their contents is represented as an array. The relations between
the different structures are indicated by the arrows.
For each class, JikesRVM keeps track of its virtual methods in a Type
Information Block (TIB), a structure similar to a virtual method table
commonly used in object-oriented programming languages. The vir-
tual method table contains entries with references to CodeArray objects
containing the current compiled code for that method. Besides its vir-
tualmethods, the TIB contains a reference to its class descriptor together
with some status information. Each class has its own TIB and each run-
time object of that class has a reference to its TIB to allow easy determi-
nation of the runtime type and easy method resolution.
References to static methods are not stored in the class TIB. They are
put in the global JikesRVMTable of Contents (JTOC) togetherwith static
initialization methods, static fields and TIB references among others.
To keep track of two versions of the same method, each version will
need its own entry in the TIB or JTOC. We added support for this by
modifying the class loader to clone each method of a class as it is loaded,
and assign the clone a separate entry. Subsequently we add an extra
boolean and integer parameter to the method’s descriptor to support
80 Tool Prototype
the if-conversion of conditional calls and fixed recursion depths as de-
scribed in Section 3.1. A mapping between the secured and unsecured
versions of the method is then stored in the bookkeeping part of the
CSD. Because each version has a separate offset in the TIB or JTOC, both
versions can coexist and can be called separately, without the need for
recompilation. This approach can easily be extended to support dif-
ferent security levels by creating multiple TIB or JTOC entries for each
level.
Dynamic Linking
JikesRVM uses a lazy compilation approach to compile class meth-
ods [61]. Instead of compiling all class methods when the class is
instantiated, which would cause a considerable overhead at the start
of the application, methods are only compiled when they are first in-
voked. During class initialization the virtual method references in the
TIB and the static method references in the global JTOC are initialized
to point to the so-called lazyMethodInvoker, as is the case for the refer-
ence to the code of virtual method 4 in the TIB in figure 5.3. This is a
fragment of trampoline code that invokes the compiler to generate code
for the called method, updates the relevant JTOC or TIB entry to point
to the generated code, and then passes control to the called method. All
subsequent calls to that method will go directly to the generated code.
The lazy method invoker resolves the requested method based on
the return address in the stack frame of the calling method. This re-
turn address, in turn, is used to find the call instruction in the calling
method that triggered the lazy method invoker code. Subsequently, the
bytecode index of the calling instruction is then used to calculate an off-
set in the constant pool of the calling method’s class. This entry in the
constant pool contains a reference to the methods description. Finally,
a virtual method reference matching this description is searched for in
the list of virtual methods of the class of the called method.
We “hijack” the method resolution code by returning a method ref-
erence to the protected version of themethod if the following conditions
apply:
• The calling method is tagged as a secured method and the compi-
lation plan in the CSD requires that a call from the calling method
to the callee method is protected.
• The callee is tagged as a secured method.
5.3 Analysis Tools 81
The lazy method invoker then continues as usual by updating the
relevant JTOC or TIB entries and passes control to the secured method.
Subsequent calls will always point to the secured method unless the
calling method is recompiled without side channel protection.
5.3 Analysis Tools
To analyze the code and generate runtime profiles we make use of the
instrumented event counters available in JikesRVM. These event coun-
ters are placeholder instructions that can be inserted in the high level
intermediate representation of the code. At a later stage during com-
pilation, during the instrumentation lowering phase, these instructions
are then translated to perform the required functionality.
JikesRVM already implements some of the functionality we need
such asmethod invocation counters and instruction counters to perform the
profile based analysis. We only had to modify the instruction counter
code slightly to count each individual instruction of each method in-
stead of an aggregated count of each instruction type. Additionally,
event counters to track recursion depth are inserted at each method pro-
logue.
To determine the loop upper bounds, instrumentation instructions
are inserted at method prologue and all loop headers (i.e. the entry
points of the loop). During the instrumentation lowering phase they
are translated as follows:
• Loop counting instrumentation instructions located at method
prologues reset the loop counter for each loop in the method.
• The counting instrumentation instructions located at the loop
headers increment the loop count of the loop corresponding to
the header and adjust the maximum loop count where needed.
Furthermore, the counters of all direct and indirect child loops of
the loop are reset.
At the moment our approach is limited to single threaded applica-
tions. To support multi-threaded execution, however, the analysis tools
can easily be modified such that each thread has its own set of counters.
82 Tool Prototype
Instrumenting boot image methods
So-called boot image methods are compiled during the compilation of
the VM itself and therefore do not contain any instrumentation instruc-
tions. To make sure all key-dependent control flow is detected when
generating the runtime profiles, boot image methods need to be instru-
mented as well. To this end, the instrumentation phases are modified
to request the controller thread to recompile every boot image method
called from the instrumented method.
5.4 Compilation Passes
The key components of our mitigation technique are the individual
compiler passes that actually rewrite code and remove the timing vari-
ation in the application. Based on the information stored in the CSD
they apply the required transformations to ensure side-channel-safe
code.
The control flow transformations in this section are based on the
if-conversion technique to rewrite control dependencies to data de-
pendencies using predicated execution. If-conversion as an optimiza-
tion is traditionally only applied in optimizing compilers when predi-
cated execution is expected to have a performance increase compared
to code containing branches [10]. Coppens et al. [34] expanded this
idea and applied if-conversion at the whole method level to eliminate
key-dependent control flow in an application. The transformations pre-
sented in this section are based on the transformations of Coppens et al.
[34]. Although the idea of if-converting code to remove key-dependent
control flow is the same our transformations allow to convert single
branches in a method. Furthermore, we made the transformation more
robust, and added support for additional code structures, allowing
the protection techniques to be applied on realistic software instead of
small test cases.
5.4.1 Static Single Assignment
Our transformations, and the transformations presented by Coppens
et al. [34] onwhichwe based our transformations are applied on applied
on code that is in static single assignment (SSA) form [37]. Code in SSA
form has the following properties:
5.4 Compilation Passes 83
x ← 3 
x ← x + 1 
if(y > 0) 
y ← y + 2 y ← y * 2 
z ← y + 1 
x ← x + z 
A 
B C 
D 
(a)
x1 ← 3 
x2 ← x1 + 1 
if(y1 > 0) 
y2 ← y1 + 2 y3 ← y1 * 2 
y4 = ф(y2,y3) 
z1 ← y4 + 1 
x3 ← x2 + z1 
A 
B C 
D 
(b)
Figure 5.4: Code fragment before transformation to Static Single Assignment
(left) and after (right).
• Each variable is assigned exactly once.
• Every variable is defined before it is used.
We refer to the literature for detailed information on how to trans-
form in and out of SSA form [14]. For the purpose of clarifying our
transformation techniques, wewill explain SSAusing the code fragment
illustrated in Figure 5.4(a). The figure illustrates the control flow graph
of a small test application. The branch in basic block A jumps to basic
blockB or C depending on the value of y. Depending on the branch di-
rection, y gets assigned a different value. The code path then converges
again in basic block D.
To transform this code to SSA form, as shown in Figure 5.4(b), the
variables are renamed so that each variable is assigned only once. Fur-
thermore, so called φ-nodes are inserted wherever a use can come from
multiple definitions. In this case, the use of y in the line z ← y + 1
in block D can refer both to the y defined in block B and the y de-
fined in blockC. To differentiate between the two definitions a phi-node
φ(y2, y3) is inserted in block D to select the correct definition of y4 de-
pending on which way control came from.
SSA reduces the complexity ofmany compiler transformations, such
as constant propagation and dead code elimination, for example, be-
cause def-use chains can be calculated easily [14].
84 Tool Prototype
5.4.2 Overview of Compilation Passes
The security transformations are implemented as separate compilation
phases in the overall compilation plan of the compiler. This allows our
different transformations to take place at different points in the com-
pilation process. This way the high-level transformations such as if-
conversion can happen early in the compilation process on the high-
level intermediate representation of the code and other transformations
such as inserting NOP instructions can happen at a later stage, just be-
fore emitting machine code. In between we allow the compiler to apply
the required optimizing transformations as usual.
We implemented the following transformations that transform code
in the high-level intermediate representation of the optimizing com-
piler:
• Calculate basic block predicates
• Rewrite loops
• Rewrite branches
• Rewrite call instructions
• Safeguard individual instructions
At a later phase during compilation, the following transformations
are applied:
• Insert NOP instructions
• Replace variable latency instructions by constant time implemen-
tations
5.4.3 Basic Block Predicates
The first step to secure a method is to insert code to calculate the pred-
icate of each basic block in the method. Figure 5.5 illustrates this on
a fictitious example. The figure represents the CFG of the method we
want to protect. Each additionmade to the code of themethod is shown
in a gray box.
At run-time, predicates are propagated along the edges of the CFG
starting from the method entry point. This ensures that at any point
5.4 Compilation Passes 85
A 
B 
C 
E 
D 
F 
pa ← pmethod 
pb ← ф(pa ,pet) 
pd ← pbn pc ← pbt 
pe ← ф(pc ,pd) 
pf ← pen 
pa 
pbn pbt 
pd pc 
pet 
pen 
Figure 5.5: Calculation of basic block predicates.
during the execution, the predicate of each basic block, and therefore
every instruction in that basic block, is calculated. The code in that basic
block is then executed conditionally based on the predicate.
At compile-time, inserting the code to calculate the predicates is
done in two passes. In a first pass, each basic block and CFG edge of
the method gets assigned a placeholder predicate. Basic block A gets
assigned predicate pa, basic block B gets assigned predicate pb, etc. For
basic blocks with only one outgoing edge, the predicate of the basic
block is assigned to the outgoing edge and no further calculations are
required. For example, the outgoing edge of basic blockA gets assigned
the predicate pa. Outgoing edges of basic blocks ending with a branch
instruction are assigned a predicate based on the branch condition. In
basic blockB for example, the outgoing edges get assigned predicate pbt
for the taken edge of the branch, and pbn for the not-taken branch. The
code to calculate the values of these predicates, depends on the type of
edge (loop edge or normal branch) and whether or not the edge will be
protected. It is inserted at a later stage in compilation when the individ-
ual branch instructions are transformed.
Secondly, once all edges have a predicate assigned, code is inserted
to calculate the value of the predicate for each block based on the pred-
icates of the incoming edges. As already mentioned in Section 3.1, the
code of a protected method is executed conditionally based on an addi-
86 Tool Prototype
A 
B 
C 
E 
D 
F 
if(tb) goto C 
pd ← pbn 
y2 ← d 
pc ← pbt 
y1 ← c 
pe ← ф(pc ,pd) 
y3 ← ф(y1 , y2) 
pa 
pbn pbt 
pd pc 
pet 
pen 
pbt ← tb ∧ pb 
pbn ← ¬ tb ∧ pb 
pe ← pc ∨ pd 
y3 ← pd ? y2 : y1 
Figure 5.6: Transforming a single protected branch using if-conversion.
tional predicate parameter of the method. Consequently, the first basic
block in the method gets assigned the method predicate, in this case
the following code is inserted at the start of basic blockA: pa ← pmethod.
Predicates of basic blocks with only one incoming edge get assigned
the value of the predicate of the incoming edge. For example: pf ← pen.
Basic blocks with multiple incoming edges get assigned a value using
a φ-node, selecting the predicate of the incoming edge based on which
way control came from. For example, the predicate of basic block B is
calculated as follows: pb ← φ(pa, pet).
5.4.4 Protected Branches
Branches are protected by transforming key-dependent control flow to
data dependencies using if-conversion. We demonstrate this transfor-
mation by rewriting the branch if(tb) gotoC in basic block B, shown in
Figure 5.6. The white boxes show the original program code, including
the calculated predicates, the gray boxes show the transformed code.
The transformation is done in three steps:
First, the branch if(tb) gotoC is removed and code is inserted to
propagate the predicate of the basic block containing the branch in-
struction pb along its outgoing edges in the CFG. The predicate pbt is
5.4 Compilation Passes 87
A 
B 
C 
E 
D 
F 
pd ← pbn 
y2 ← d 
pc ← pbt 
y1 ← c 
pa 
pet 
pen 
pbt ← tb ∧ pb 
pbn ← ¬ tb ∧ pb 
pe ← pc ∨ pd 
y3 ← pd ? y2 : y1 
Figure 5.7: Transforming a single protected branch using if-conversion: Lin-
earized code.
propagated along its outgoing edges in the CFG. The predicate pbt of
the taken branch is calculated by and-ing the branch condition with the
predicate of the basic block: pbt ← tb ∧ pb. Similarly the predicate pbn is
calculated by and-ing the predicate of the basic block with the negation
of the jump condition: pbn ← ¬tb ∧ pb, ensuring that the code in the not
executed path of the CFG never alters the program state.
Secondly, the φ-nodes in the basic block E, where the control flow
converges again, are transformed. Because the removal of the protected
branch will linearize the code contained in basic blocks B,C,D, and
E as shown in Figure 5.7, the number of incoming edges in the block
where control flow converges will decrease by 1. In our example, the
number of incoming edges in basic block E will be reduced from 2 to
1 after transformation. The φ-nodes in that block will therefore have to
be rewritten to account for that change. φ-nodes calculating basic block
predicates and φ-nodes that are part of the original program code are
transformed differently:
• The general formula for transforming a predicate φ-node
px ← φ(p1, p2, p3, . . . , pn)
when removing the branch with predicate p1 can be written as
88 Tool Prototype
follows:
ptmp ← φ(p2, p3, ..., pn)
px ← p1 ∨ ptmp (5.1)
The predicate p1 of the removed branch is taken out of the φ-node,
and afterwards or-ed with the result of the modified φ-node ptmp
to calculate the basic block predicate px.
In our example, because the number of incoming edges is reduced
to 1, the φ-node pe ← φ(pc, pd) can be rewritten as follows: pe ←
pc ∨ pd
• φ-nodes that are part of the original application are rewritten as
follows:
x = φ(x1, x2, x3, . . . , xn)
when removing the branch related to the value x1 in the φ-node is
transformed to:
xtmp = φ(x2, x3, . . . , xn)
x = p1 ? x1 : xtmp (5.2)
Here, the value related to the removed edge x1 is taken out of the
φ-node. The value of x is then calculated by selecting either x1 or
the result of the modified φ-node xtmp based on the predicate of
the removed edge p1.
In our example, the code will be linearized as shown in Figure 5.7,
removing the edge between block C and E. The φ-node y3 ←
φ(y1, y2) can therefore be rewritten as: y3 ← pc ? y1 : y2
Thirdly, the code is linearized in code order as shown in Figure 5.7.
Because the code is in SSA form, the code in basic block C and D can
remain as is.
We illustrate the above transformation on amore complex code frag-
ment shown in Figure 5.8. This code fragment contains three branches
5.4 Compilation Passes 89
B 
A 
C 
D 
pg ← ф(pat,pbt,pdt,pf) 
yg ← ф(ya, yb, yd, yf) 
F 
E 
G 
ya ← a 
if(ta) goto G 
yd ← d 
if(td) goto G 
yb ← b 
if(tb) goto G 
yf ← f 
pf ← ф(pdn ,pe) 
pat pan 
pbt 
pcn pct 
if(tc) goto D 
pdt pdn pe 
pf 
pbn 
ye ← e 
Figure 5.8: More complex unprotected code fragment.
B 
A 
C 
D 
pg_tmp ← ф(pat,pdt,pf) 
pg ← pg_tmp ∨ pbt 
 
ytmp ← ф(ya, yd, yf) 
yg ← pbt ? yb : ytmp 
F 
E 
G 
ya ← a 
if(ta) goto G 
yd ← d 
if(td) goto G 
yb ← b 
pbt ← tb ∧ pb 
pbn ← ¬ tb ∧ pb 
yf ← f 
pf ← ф(pdn ,pe) 
pat pan 
pcn pct 
if(tc) goto D 
pdt pdn pe 
pf 
ye ← e 
pbt is not always defined! 
Figure 5.9: Protected code fragment with undefined behavior.
90 Tool Prototype
B 
A 
C 
D 
pg1_tmp ← ф(pdt,pf) 
pg1 ← pg_tmp ∨ pbt 
 
yg1_tmp ← ф(ya, yd, yf) 
yg1 ← pbt? yb : yg1_tmp 
F 
E 
G1 
ya ← a 
if(ta) goto G 
yd ← d 
if(td) goto G 
yb ← b 
pbt ← tb ∧ pb 
pbn ← ¬ tb ∧ pb 
yf ← f 
pf ← ф(pdn ,pe) 
pat pan 
pcn pct 
if(tc) goto D 
pdt pdn pe 
pf 
ye ← e 
G2 
pg ← ф(pat,pg1) 
yg ← ф(ya, yg1) 
pg1 
Figure 5.10: Correctly protected code fragment.
in basic block A, B and D, of which only the branch in basic block B
will be protected.
Figure 5.9 shows the code, naively transformedwith the same trans-
formations as the previous example: First, the branch if(tb) gotoG is
removed from basic block B and the predicates of the outgoing edges
pbt and pbn are calculated as before. Secondly, the φ-nodes in block G
where control flow converges again are rewritten according to the for-
mulas (5.1) and (5.2).
The code, as shown in Figure 5.9, however, is not semantically equiv-
alent to the unprotected code in Figure 5.8: The value of pbt in the con-
ditional move instruction yg ← pbt ? yb : ytmp in basic block G is not
always defined. When the jump in basic block A is taken, we arrive in
basic block G without calculating the value of pbt, which is done in ba-
sic block B. The conditional move instruction yg ← pbt ? yb : ytmp will
therefore not select the correct value for yg.
To solve this problem, and make sure our transformations work as
5.4 Compilation Passes 91
A 
B 
C 
E 
D 
if(x) goto B 
if(y) goto D 
A 
B 
C 
E 
D 
if(y) goto D 
a) original CFG b) unsafe  
if-converted code 
A 
B 
C 
E 
D 
if(x ∧ y) goto D 
c) constant time 
if-converted code 
Figure 5.11: Example of correct partial if-conversion.
intended, an additional transformation is required to ensure that the
condition of the conditional move is always defined before the con-
ditional move is executed, regardless of the path taken in the CFG.
In other words, the basic block containing the generated conditional
move needs to be dominated by the basic block defining the condition
operand. Splitting basic blockG intoG1 andG2 as shown in Figure 5.10
ensures that basic block G1 is dominated by B and the value of pbt is
defined regardless of the path taken in the CFG.
5.4.5 Unprotected Branches
Partial if-conversion has to be performed with great care, because it can
actually introduce control flow variation in the unconverted branches
where no such variation was present in the original program. Fig-
ure 5.11a shows a simple example CFG, with only two executed paths
during profiling: A → E and A → B → C → E. Assuming the value
of y is always evaluated to false in this particular example, basic block
D will never be executed.
When the branch in basic block A is if-converted as shown in Fig-
ure 5.11b, block B is always executed. If block B is predicated false
(when the jump in A was taken) the value of y might be undefined or
evaluate differently than profiled, allowing the jump in blockB to go in
either direction.
To avoid this, the jump condition of unconverted branches needs
to be rewritten as follows: Let original be the original jump condi-
tion of the branch, predicate a boolean value calculated at run time
92 Tool Prototype
A 
B 
C 
E 
if(x ^  y) goto D 
Security Organizer 
 
adjustBranchProfile (…) 
D 
Figure 5.12: Amonitor instruction is inserted in the path that is profiled as not
taken to trigger recompilation when the profiling was incorrect.
indicating if the basic block containing the branch instruction was
executed in the original program. The new jump condition new for
unconverted branches that are profiled as always taken then becomes
new = ¬predicate ∨ original. For never-taken branches, it becomes
new = predicate ∧ original. This ensures the new jump direction de-
faults to the profiled branch directionwhen the instruction is predicated
false and jumps in the same direction as the original branch condition
when predicated true. Figure 5.11c shows the CFGwith the branch con-
dition of the branch in basic block B rewritten according to the rules
above. After if-converting the branch in basic blockA and rewriting the
jump condition in basic block B, the same path A → B → C → E will
be taken as long as not both x and y evaluate to true at the same time,
i.e., as long as they behave as during the profiling.
To handle a situation in which x and y would diverge from their
behavior during profiling, and possibly cause execution time variation,
monitoring instructions are inserted in the path that is profiled to be
never executed, as shown in Figure 5.12. Whenever a monitor instruc-
tion is triggered, it calls back to the Security Organizer. The security
organizer in turn updates the branch information in the CSD such that
the offending branch is tagged to be if-converted and sends a recompi-
lation request to the controller thread for the method in question.
5.4.6 Loops With Fixed Iteration Count
Section 4.1 explained how an analysis phase collects information about
the loop upper bounds for each loop in the executed program. The fol-
lowing transformation uses these upper bounds to fix the number of
5.4 Compilation Passes 93
A 
B 
C 
E 
D 
F 
pe ← ф(pc ,pd) 
… 
ce ← ce +1 
te’ ← ¬(ce>= ce _max) ∨ (pe ∧ te ) 
ofe ← te’ ∧ (¬ te) 
pet ← pa ∧ (¬ ofe) 
pen ← pa 
if(te’) goto B 
pa 
pet 
pen 
pe ← ф(pc ,pd) 
… 
if(te) goto B 
ce ← 0 
pd pc 
b ← a 
f ← b 
btmp ← a 
if(pb) b ← btmp 
Figure 5.13: Rewrite loop branches with fixed loop count.
iterations for each loop. In short, we add an extra counter that counts
each loop iteration and rewrite the loop branch in such a way that it can
only jump out of the loop after the loop counter has reached the fixed
upper bound of iterations.
Figure 5.13 illustrates how the loop branch in basic block E can be
rewritten to fix the iteration count to ce_max iterations. The white boxes
contain the original code, the gray boxes on the right the transformed
code. In this example, the branch instruction is in the last basic block
of the loop, E, and jumps to the loop header B depending on the value
of the jump condition te. Other loop types, for example where the loop
branch is situated in the loop header and jumps out of the loop, can be
transformed in a similar way.
First, a loop counter ce is inserted in basic block A and initialized to
0.
Secondly, the following code is inserted in the block E containing
the loop branch instruction:
• Each iteration the loop counter is incremented: ce ← ce + 1
• The new jump condition te′ is calculated. The jump should al-
ways be taken when the loop upper bound is not reached: ¬(ce ≥
ce_max). When the upper bound is reached, the jump is only taken
94 Tool Prototype
when both the basic block predicate and the original jump con-
dition evaluate to true : (pe ∧ te). This makes sure the program
executes correctly when the loop upper bound was set too low.
The new jump condition can subsequently be written as: te′ ←
¬(ce ≥ ce_max) ∨ (pe ∧ te).
• Next, code is inserted to calculate if the loop is “overflowing”. A
loop is overflowingwhen the original jump condition evaluates to
false but the loop upper bound is not reached, i.e. the new jump
condition evaluates to true. When overflowing, the additional it-
erations should not modify the internal program state. Overflow
is calculated as follows: ofe ← te′ ∧ (¬te).
• The not-taken edge gets assigned the predicate of the incoming
edge of the loop header: pen ← pa. Because the loop only has one
entry edge and one exit edge, the loop can be seen as a single “su-
per block” and we can propagate the value of the predicate of the
incoming edge to the outgoing edge. Loops with multiple entries
or exits are first transformed to single entry, single exit loops, as
shown in Figure 5.14. This is done by splitting the header block
C into two basic blocks. The first part C1 contains the necessary
φ-nodes to calculate the basic block predicate and the φ-nodes to
calculate values based on the incoming edges. The second part
C2 now serves as the new loop header, with only one loop entry
from basic block C1. Next, the exit edge in the last basic block in
the loop body is is chosen as the “main loop exit”. Every other
edge that jumps out of the loop body is then tagged to be secured
as shown in Section 5.4.4. In this example the main exit branch
is the branch in basic block F . Both edges in basic blocks D and
E are then protected. The end result is a single entry, single exit
loop.
• The predicate of the taken edge should evaluate to true only if
the predicate the entry edge of the loop pa evaluates to true and
the loop is not overflowing: pet ← pa ∧ (¬ofe). This causes all
predicates of basic blocks in the loop to evaluate to false when
overflowing.
• The loop branch is rewritten to use the new jump condition.
Finally, some additional precautions need to be made such that the
internal state of the application is notmodifiedwhen the loop overflows.
5.4 Compilation Passes 95
C 
D 
F 
E 
G 
pc ← ф(pa, pb, pft) 
pb 
pft 
pfn 
A B 
pa 
pg ← ф(pdt, pet, pfn) 
if(td) goto G 
if(te) goto G 
pdt 
pet 
C2 
D 
F 
E 
G 
pc2← ф(pc1, pft) 
pb 
pft 
pfn 
A B 
pa 
pg ← pdt ∨ pet ∨ pfn 
pdt ← td ∧ pd 
pdn ← ¬ td ∧ pd 
C1 pc1← ф(pa, pb) 
pet ← te ∧ pe 
pen ← ¬ te ∧ pe 
Figure 5.14: Transforming a loop to a single entry, single exit loop.
Each variable defined inside the loop that has a use outside of the loop
cannot be modified when the loop is overflowing. In our example, the
definition b← a in basic block B has a use in basic block F : f ← b. By
rewriting the definition as follows, the value of b will only be updated
if its predicate pb evaluates to true :
btmp ← a
if(pb) b← btmp
The above transformation is only required for definitions in loop
bodies that are transformed to have a fixed loop count. Section 5.4.11
discusses how other individual instructions such as memory accesses
or potentially exception-throwing instructions are handled.
5.4.7 Loops Without Fixed Iteration Count
In some cases, the iteration count of a loop does not have to be fixed,
either because the loop does not exhibit variable iteration counts, or the
iteration count does not depend on a secret value. In either case, trans-
forming the loop to have a fixed iteration count would only introduce
96 Tool Prototype
A 
B 
C 
E 
D 
F 
pa ← pmethod 
pb ← ф(pa ,pet) 
pd ← pbn pc ← pbt 
pe ← ф(pc ,pd) 
… 
pet ← te ∧ pe 
pen ← ¬ te ∧ pe 
te’ ← te ∧ pe 
if(te’) goto B 
pf ← pen 
pa 
pbn pbt 
pd pc 
pet 
pen 
pe ← ф(pc ,pd) 
… 
if(te) goto B 
Figure 5.15: Rewrite loop branches without fixing loop count.
unnecessary overhead. These loops still have to be transformed to cor-
rectly propagate the predicates and to prevent infinite loops. Figure 5.15
illustrates how the loop branch in basic block E can be rewritten with-
out fixing the loop count. The white boxes contain the original program
code the gray box contains the code for the rewritten jump.
The branch is transformed in three steps:
• First, code is inserted to calculate the value of the edge predicates
of the outgoing edges of basic block E and ensure correct propa-
gation of the predicate pe of basic block E. The value of the taken
edge is calculated as follows: pet ← te ∧ pe. The predicate of the
not-taken edge pen is calculated similarly: pen ← ¬te ∧ pe.
• Secondly, code is inserted to calculate the new jump condition te′.
This is necessary because, when the predicate pe of basic block
E evaluates to false, the original jump condition te contains an
arbitrary value and can cause the code to loop indefinitely, stalling
the application. The new jump condition is calculated as follows:
te
′ ← te ∧ pe. This makes sure the jump is never taken when the
basic block predicate evaluates to false.
• Finally, the original branch instruction is rewritten to use the new
jump condition te′.
5.4 Compilation Passes 97
… 
d ← x.method() 
…  
… 
r ← (recursive) ? r + 1 : 0 
x ← pd? x :dummy 
c ← x.method_secure(pd,r) 
…  
dummy ← new X()  A 
B 
C 
E 
D 
F 
Figure 5.16: Transforming a protected call instruction.
5.4.8 Protected Calls
As discussed in Section 3.1, protected call instructions are not executed
conditionally. Instead, an additional parameter pmethod is added to the
call, representing the predicate of the called instruction. The code of the
calledmethod is then executed conditionally based on that predicate. A
second additional parameter r is added to indicate the recursion depth.
Figure 5.16 illustrates how instance methods are protected. The
white box contains the original call instruction and the gray boxes con-
tain the code for the rewritten call.
First the recursion depth is calculated: r ← (recursive) ? r + 1 : 0.
For recursive calls the method’s parameter is increased by 1, for non re-
cursive calls, r is set to 0. Secondly, the object x on which the method is
called is replaced by a dummy object if the predicate of the basic block
containing the call, pd, evaluates to false : x ← pd ? x : dummy. This
additional safe-guard is required because the original object x can pos-
sibly be a null reference when its predicate evaluates to false, causing
the application to crash. The dummyobject is allocated at the start of the
first basic block A in the method. Finally we transform the call to point
to the secured version of the method and add the predicate pb and the
recursion depth r as additional parameters.
Static method calls are transformed in a similar way with the excep-
98 Tool Prototype
dtmp ← x.method()  
A 
B 
C 
E 
D1 
F 
D2 
D3 
… 
if(pd) goto D2 
d ← ф(ddefault,dtmp) 
… 
pd 
Figure 5.17: Transforming an unprotected call instruction.
tion that no dummy object is allocated.
5.4.9 Unprotected Calls
Calls can be left unprotected if the runtime profile information indi-
cates that they do not introduce any execution time variation. In that
case we execute the call conditionally based on the predicate of the ba-
sic block containing the call instruction. More concretely, we insert an
additional jump instruction that jumps over the unprotected call based
on the predicate of the basic block.
Figure 5.17 shows how the call in Figure 5.16 is transformed when
left unprotected.
We start by splitting the original basic block D into three separate
blocks D1,D2 and D3. Basic block D1 contains all the code from the
original block D preceding the call instruction, block D2 contains the
call itself and block D3 contains all code from block D following the
call instruction.
In a second step a jump instruction if(pd) goto D2 is inserted in block
D1 that jumps to the block D2 containing the call instruction basis on
the predicate of the original block pd. This ensures that the call is only
executed when the call instruction is predicated true, and would have
5.4 Compilation Passes 99
been executed in the original unprotected application.
Next, if the call has a return value, the call is transformed to write to
a temporary variable. In the example the call is transformed to: dtmp ←
x.method(). If the call has no return value this step, and the following
step can be skipped.
Finally, the correct result value is selected by inserting a φ-node d←
φ(ddefault, dtmp) in basic block D3. This φ-node selects the result dtmp
when control came from basic block D2 or a default value for the vari-
able type ddefault when control came from basic block D1 and the call
was not executed. The exact value of ddefault does not matter as in that
case, basic block D was predicated false anyway.
Because we currently do not support protecting explicit exception
throwing instructions, such as AThrow bytecode instructions. They are
treated as unprotected call instruction and are jumped based on the
predicate of their basic block. The code inside of the exception handler
blocks, however, can be protected.
5.4.10 Recursion
Direct recursion is handled by adding code in the first basic block of the
method that causes the method to return immediately when a certain
recursion depth determined by the initial profiling phase is reached.
Determining if the maximum recursion depth has been reached hap-
pens by checking if the value in the additional method parameter r that
contains the current recursion depth, exceeds the maximum recursion
depth rmax determined during the initial profiling phase.
Indirect recursion is currently not supported. Whenever a possible
indirect recursive call is detected, that call is transformed as an unpro-
tected call, as protecting the call would cause an infinite loop.
5.4.11 Safeguard Individual Instructions
Coppens et al. already mentioned that certain instructions need to be
safeguarded when if-converting code. Memory instructions, for exam-
ple, are safeguarded by executing them on dummy memory locations
when predicated false. This prevents them from reading or writing
from invalid addresses or updating the program state. Division instruc-
tions are safe-guarded by making sure they never divide by 0 when
predicated false. Next to these instructions we made sure that addi-
100 Tool Prototype
tional checks inserted in the code such as null-checks, bounds-checks,
zero-checks and array-length-checks do not cause any unwanted be-
havior or throw exceptions. As implementation is relatively straight-
forward, we do not discuss them in detail.
5.4.12 Transformations in lower level IR
In the lower level intermediate representation, variable latency instruc-
tions are replaced by calls to their constant time implementations. Some
examples of instructions that are replaced are: integer division instruc-
tions , long division instructions and subroutines to calculate long mul-
tiplication and long shift operations. During machine code generation,
NOP instructions are inserted between store and load instructions in
protected methods, according to the compilation strategy.
5.5 Limitations
Our mitigation techniques, as implemented today, come with some im-
portant limitations.
Granularity of applied transformations: In our current implementation,
the applied data flow transformations are the same for all methods, and
they are applied to their whole code bodies. In the future, we plan to re-
fine this, such that different strategies can be applied to different meth-
ods and even to parts of methods, and such that multiple secured ver-
sions can co-exist, such that differently protected versions of methods
can be invoked as they are executed in different call chains.
Optimal compilation strategy: More research is required to automate
the process of finding the optimal compilation point for a given applica-
tion and target architecture based on code characteristics, without hav-
ing to rely on predefined compilation strategies.
Exceptions: We assume exceptions are a rare occurrence and are not
part of the time-critical sensitive code. Code within an exception han-
dling region can of course be secured, but the throwing of an exception
is not if-converted.
Native Code and System Calls: Our setup currently only supports pro-
tecting bytecode against timing attacks. System calls and native code
can still cause the application to leak timing information. In the case
of native code a solution could be to provide both a normal and secure
5.5 Limitations 101
variant of the code with the application. These variant can be generated
automatically with static techniques [34, 68, 86].
Multithreading: For the time being, we only protect single-threaded
code regions, i.e., code that executes within one thread. That thread
can, however, be part of a multithreaded application.
Virtual method calls: Our current implementation assumes that tar-
get methods of virtual calls remain constant during the execution of the
application. In other words, the run-time type of the object on which
the method is invoked has to remain the same. This restriction does
not pose problems for the crypto libraries we experimented with. If the
need would arise in the future, it is possible to extend our implemen-
tation based on class hierarchy analysis and points-to analysis, and by
converting polymorphic virtual calls into switch-like constructs.
102 Tool Prototype
Chapter 6
Conclusions & Future Work
6.1 Conclusions
In this dissertation, we investigated how compiler transformations can
be used to defend against timing side channel attacks. We researched both
static and dynamic compilation techniques.
We discussed several ways in which variable-latency instructions
and interactions between memory operations on modern x86 proces-
sors can leak timing information that is useful for time-based side channel
attacks on cryptographic software. We showed that their exact timing
behavior can differ significantly between different micro-architectures.
We discussed several potential code transformations in static compil-
ers to mitigate these side channels. We extended existing compiler mit-
igation techniques that remove key-dependent control flow based on
if-conversion by adding additional support to remove key-dependent
data flow. We did this by implementing transformations that either
add variable-latency compensation code or force invariable latencies.
Furthermore we showed that inserting NOP instructions between store
and load instructions can remove timing variation caused by pipeline
interactions.
Some of those transformations provide strong protection, albeit that
they introduce significant overhead, in particular when they have to
protect the software on a range of micro-architectures. Other transfor-
mations can provide weak or in some cases stronger but non-portable
protection. The effectiveness of a specific transformation is unpre-
dictable and highly sensitive to the precise code to be protected and to
the micro-architectural details of the processor architecture.
104 Conclusions & Future Work
We evaluated our approach on modular exponentiation code and
show that we can reduce the protection overhead from 20x to below
6x using our data flow mitigation techniques. We statistically proved
indistinguishability of execution times of the protected application.
We showed that protection techniques in static compilers are hampered,
however, by their static nature and their dependence on details of the
processor targeted during the compilation. We therefore presented a
dynamic profile-based approach for mitigating side channel attacks. In
this approach, a JIT compiler generates code for sensitive code regions
such that the execution time becomes completely, or largely indepen-
dent of the sensitive data values used in the code. This approach sup-
ports adaptive protectionwith regards to changes in the underlying hard-
ware and changes in the security requirements, without requiring code
duplication or specialization before the code is distributed to users. We
presented a profile-based approach with which we significantly reduce
the overhead of full protection, and that allows the user tomake a trade-
off between the provided level of protection and the incurred overhead.
Moreover, based on these run-time profiles, we were able to protect ad-
ditional code structures such as non-manifest loops and recursive calls.
We implemented our transformation techniques in the JikesRVM JIT
compiler and evaluated them on real-life use cases. Using our profile-
based approachwewere able to protect an RSA encryption algorithm at
an overhead of 8.4x and 16.5x on Core™ 2 and Core™ i7 system respec-
tively. Compared to existing state-of-the-art techniques, our approach
reduced the overhead of protecting the modular exponentiation algo-
rithm from 66.3x to 16.2x on the Core™ i7 and from 62.7x to 11.2x on
the Core™ 2. Our approach was also able to pinpoint a single method
causing timing information leakage in an HMAC verification routine
and secure it automatically at a minimal overhead of 5.7%.
We conclude that our dynamic compiler based approach can suc-
cessfully protect against timing side channel attack, but even with sig-
nificant improvements compared to previous state-of-the art, still come
at a considerable overhead depending on the protected application. On
the other hand, our approach eliminates the source of timing varia-
tion instead of masking information leakage by introducing noise and
does not require amodified operating system, hypervisor or specialized
hardware to protect applications.
Additionally, by centralizing the mitigation techniques in the com-
piler, we separate (side channel) security requirements from implemen-
6.2 Future Work 105
tation details of the software. Instead of obligating the programmer to
apply security measures in algorithms, source code or programming
tools, protection techniques are implemented once at compiler level by
an expert programmer with thorough knowledge about side channel
security. This increases both programmer productivity and application
security in the context of side channels.
6.2 Future Work
From the results presented in this thesis it is clear that a dynamic ap-
proach to timing side channelmitigation in compiler is farmore promis-
ing than its static counterparts. In the future we therefore want to con-
tinue improving these dynamic techniques. Some of the possible re-
search directions are discussed below:
Extending the JIT compilation framework
Asdiscussed in Section 5.5, our prototype implementation currently has
several limitations. Some of those limitationsmake interesting topics for
further research.
Granularity of applied transformations: In our current implementation,
the applied data flow transformations are the same for all methods, and
they are applied to their whole code bodies. In the future, we plan to re-
fine this, such that different strategies can be applied to different meth-
ods and even to parts of methods, and such that multiple secured ver-
sions can co-exist, such that differently protected versions of methods
can be invoked as they are executed in different call chains.
Optimal compilation strategy: More research is required to automate
the process of finding the optimal compilation point for a given applica-
tion and target architecture based on code characteristics, without hav-
ing to rely on predefined compilation strategies.
Dynamic Resource Allocation
Another interesting topic is to research the potential of integrating dy-
namic resource allocationwith dynamic code generation. Side-channel-
aware code generators will limit the amount of information that code
leaks into side channels where necessary, while side-channel-aware re-
source allocators, implemented in operating systems of hypervisors for
106 Conclusions & Future Work
example, will close and block side channels by adapting the resource
allocation and scheduling of potentially attacking processes. By inte-
grating the code generation and resource allocation, we hope to achieve
a more efficient and more effective co-optimization of performance and
security.
Security Contracts
Developers should be able to specify their security needs in a contract,
completely independent of the domain specific development tools. This
specification will be in terms of security requirements only and not in
terms of properties of the underlying hardware. Translation between
the different domains, i.e. from the security domain to the hardware
domain, will be the responsibility of the VM. This way applications and
contracts are portable and can be used easily. The main challenge (and
innovation) here is to define a practical, usable and sufficiently abstract
contract specification and in particular the automatic translation from
abstract contracts to security mechanisms on the software-hardware in-
terface.
Other Side Channels
The techniques presented in this thesis focus primarily onprotecting ap-
plications against timing side channel attacks. An interesting research
direction would be to measure how well our transformations hold up
against other types of attacks such as power attacks. Furthermore, we
could combine our own transformations with existing code transfor-
mations protecting against power attacks, such as the ones presented
by Bayrak et al. [22], increasing the applicability of our framework.
List of Tables
3.1 Execution times and statistical information on the suc-
cessful mitigation of variable-latency divisions in mod-
ular exponentiation. . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Selected code regions based on call edge profiles. . . . . . 64
4.2 Test results for different compilation strategies. . . . . . . 66
4.3 Memory and garbage collection statistics for different ap-
plication versions running on the Intel Core™ 2 Duo. . . 68
108 LIST OF TABLES
List of Figures
1.1 Overview of side channel attacks. . . . . . . . . . . . . . . 3
1.2 Side channel mitigation techniques can be applied at dif-
ferent levels. . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Conditional execution time distributions of a crypto-
graphic algorithm given two possible secret keys. . . . . 14
2.1 The different latency classes of the 32-bit unsigned inte-
ger division instruction. . . . . . . . . . . . . . . . . . . . 24
2.2 Execution time differences of a microbenchmark loop
with load and store instructions executed for varying
displacements and alignments. . . . . . . . . . . . . . . . 28
3.1 Source code equivalent of if-conversion on architectures
that only support conditional moves. . . . . . . . . . . . . 30
3.2 If-conversion of conditional function calls. . . . . . . . . . 31
3.3 A simple DDG inwhich visual height models instruction
latency and execution time. . . . . . . . . . . . . . . . . . 33
3.4 An extended DDG that includes all latency variations of
a variable-latency instruction. . . . . . . . . . . . . . . . . 35
3.5 Variable-latency compensation code exemperimental re-
sults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 C code equivalent of variable-latency division in parallel
with fixed-latency multiplications. . . . . . . . . . . . . . 41
3.7 C code equivalent of unsigned division computation us-
ing only invariable-latency divisions. . . . . . . . . . . . . 41
110 LIST OF FIGURES
3.8 Execution time of a loop consisting of a pair of store/load
instructions for an increasing number of no-ops inserted
between them. . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.9 Phased behavior of execution time over 2 days. . . . . . . 47
4.1 First three iterations of call-edge-based code selection. . . 55
4.2 Example of complete and selective if-conversion. . . . . . 58
4.3 Compilation strategy security/performance trade-off. . . 59
4.4 Intel Core™ 2 Duo security/performance trade-off. . . . 64
4.5 Intel Core™ i7 security/performance trade-off. . . . . . . 65
4.6 Core™ 2 JIT compilation times. . . . . . . . . . . . . . . . 69
4.7 Slowdowns for modPow with different protection tech-
niques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8 Average HMAC execution time on the Core™ 2. . . . . . 72
4.9 Anderson-Darling test results on the Core™ 2. . . . . . . 72
5.1 Jikes Adaptive Optimization System. . . . . . . . . . . . . 76
5.2 Jikes Adaptive Optimization System with side channel
security support. . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 The JikesRVM object model. . . . . . . . . . . . . . . . . . 79
5.4 Code fragment illustrating the transformation to Static
Single Assignment. . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Calculation of basic block predicates. . . . . . . . . . . . . 85
5.6 Transforming a single protected branchusing if-conversion. 86
5.7 Transforming a single protected branchusing if-conversion:
Linearized code. . . . . . . . . . . . . . . . . . . . . . . . . 87
5.8 More complex unprotected code fragment. . . . . . . . . 89
5.9 Protected code fragment with undefined behavior. . . . . 89
5.10 Correctly protected code fragment. . . . . . . . . . . . . . 90
5.11 Example of correct partial if-conversion. . . . . . . . . . . 91
5.12 A monitor instruction is inserted in the path that is pro-
filed as not taken to trigger recompilation when the pro-
filing was incorrect. . . . . . . . . . . . . . . . . . . . . . . 92
5.13 Rewrite loop branches with fixed loop count. . . . . . . . 93
5.14 Transforming a loop to a single entry, single exit loop. . . 95
LIST OF FIGURES 111
5.15 Rewrite loop branches without fixing loop count. . . . . . 96
5.16 Transforming a protected call instruction. . . . . . . . . . 97
5.17 Transforming an unprotected call instruction. . . . . . . . 98
112 LIST OF FIGURES
List of abbreviations
AES advanced encryption standard
AOS adaptive optimization system
CFG control flow graph
CSD compilation strategy database
DDG data dependency graph
DSA digital signature algorithm
EM electromagnetism / electromagnetic
HMAC keyed-hash message authentication code
HT hyperthreading
IDS intrusion detection system
ILP instruction-level parallelism
IPC instructions per cycle
JIT just-in-time
JTOC JikesRVM table of contents
NOP no operation
PEI potential exception throwing instruction
RSA Rivest-Shamir-Adleman public key algo-
rithm
114 LIST OF ABBREVIATIONS
SMT simultaneous multi threading
SSA static single assignment
SSE Streaming SIMD Extensions
TIB type information block
VM virtual machine
Bibliography
[1] Jikes RVM User Guide. http://www.jikesrvm.org/UserGuide/,
2015. [Online; accessed 30-August-2015].
[2] O. Aciiçmez and Ç. Koç. Trace-driven cache attacks on AES. In In-
formation and Communications Security, volume 4307 of Lecture Notes
in Computer Science, pages 112–121. 2006.
[3] O. Aciiçmez, Ç. Koç, and J.-P. Seifert. Predicting secret keys via
branch prediction. In Topics in Cryptology - The Cryptographers’ Track
at the RSA Conference (CT-RSA’07), pages 225–242, 2007.
[4] O. Aciiçmez, Ç. Koç, and Jean-Pierre Seifert. On the power of sim-
ple branch prediction analysis. In Proceedings of the 2nd ACM Sym-
posium on Information, Computer and Communications Security (ASI-
ACCS’07), pages 312–320, 2007.
[5] OnurAciiçmez. Yet anothermicroarchitectural attack: exploiting I-
Cache. In Proceedings of the 2007 ACMworkshop on Computer security
architecture (CSAW’07), pages 11–18, 2007.
[6] OnurAciiçmez,Werner Schindler, andÇ. Koç. Cache based remote
timing attack on the AES. In Topics in Cryptology - The Cryptogra-
phers Track at the RSA Conference (CT-RSA’07), pages 271–286, 2007.
[7] Onur Aciiçmez, Billy Bob Brumley, and Philipp Grabher. New re-
sults on instruction cache attacks. In Proceedings of the 12th Inter-
national Conference on Cryptographic Hardware and Embedded Systems
(CHES’10), pages 110–124, 2010. ISBN 3-642-15030-6, 978-3-642-
15030-2.
[8] Giovanni Agosta, Luca Breveglieri, Gerardo Pelosi, and Israel Ko-
ren. Countermeasures against branch target buffer attacks. In
116 BIBLIOGRAPHY
FDTC ’07: Proceedings of the Workshop on Fault Diagnosis and Tol-
erance in Cryptography, pages 75–79, 2007. ISBN 0-7695-2982-8. doi:
http://dx.doi.org/10.1109/FDTC.2007.5.
[9] Dakshi Agrawal, Bruce Archambeault, Josyula R Rao, and Pankaj
Rohatgi. The em side-channel(s). In Cryptographic Hardware and
Embedded Systems-CHES 2002, pages 29–45. Springer, 2003.
[10] J. R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. Con-
version of control dependence to data dependence. In POPL ’83:
Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Prin-
ciples of programming languages, 1983.
[11] Bowen Alpern, C Richard Attanasio, John J Barton, Michael G
Burke, Perry Cheng, J-D Choi, Anthony Cocchi, Stephen J Fink,
David Grove, Michael Hind, et al. The jalapeno virtual machine.
IBM Systems Journal, 39(1):211–238, 2000.
[12] Marc Andrysco, David Kohlbrenner, Keaton Mowery, Ranjit Jhala,
Sorin Lerner, and Hovav Shacham. On subnormal floating point
and abnormal timing. In Security and Privacy (SP), 2015 IEEE Sym-
posium on, pages 623–639, May 2015. doi: 10.1109/SP.2015.44.
[13] Gorka Irazoqui Apecechea, Mehmet Sinan Inci, Thomas Eisen-
barth, and Berk Sunar. Fine grain cross-vm attacks on xen and
vmware are possible! IACR Cryptology ePrint Archive, 2014:248,
2014.
[14] AndrewWAppel. Modern compiler implementation in C. Cambridge
University Press, 1997.
[15] ARM. ARMarchitecture referencemanualARMv7-A andARMv7-
R edition, 2010.
[16] ARMv8 Instruction Set Overview. ARM, 2014.
[17] ARM7TDMI Technical Reference Manual (Revision r4p1). ARM Lim-
ited, 2004.
[18] Matthew Arnold, Stephen Fink, David Grove, Michael Hind, and
Peter F Sweeney. Architecture and policy for adaptive optimization
in virtual machines. Technical report, Technical Report 23429, IBM
Research, 2004.
BIBLIOGRAPHY 117
[19] Aslan Askarov, Danfeng Zhang, and Andrew C. Myers. Predic-
tive black-box mitigation of timing channels. In Proceedings of the
17th ACMConference on Computer and Communications Security, CCS
’10, pages 297–307, New York, NY, USA, 2010. ACM. ISBN 978-
1-4503-0245-6. doi: 10.1145/1866307.1866341. URL http://doi.
acm.org/10.1145/1866307.1866341.
[20] Michael Backes and Boris Köpf. Formally bounding the side-
channel leakage in unknown-message attacks. In Computer
Security-ESORICS 2008, pages 517–532. Springer Berlin Heidel-
berg, 2008.
[21] Michael Backes, Tongbo Chen, Markus Duermuth, Hendrik
Lensch, and Martin Welk. Tempest in a teapot: Compromising
reflections revisited. In Security and Privacy, 2009 30th IEEE Sympo-
sium on, pages 315–327. IEEE, 2009.
[22] A.G. Bayrak, F. Regazzoni, D. Novo, P. Brisk, F.-X. Standaert, and
P. Ienne. Automatic application of power analysis countermea-
sures. Computers, IEEE Transactions on, 64(2):329–341, Feb 2015.
ISSN 0018-9340. doi: 10.1109/TC.2013.219.
[23] Daniel J. Bernstein. Cache-timing attacks onAES. Technical report,
The University of Illinois at Chicago, 2005.
[24] Joseph Bonneau and Ilya Mironov. Cache-collision timing attacks
against AES. In Proceedings of the International Workshop on Crypto-
graphic Hardware and Embedded Systems (CHES’06), pages 201–215,
2006.
[25] Benjamin A. Braun, Suman Jana, and Dan Boneh. Robust and
efficient elimination of cache and timing side channels. CoRR,
abs/1506.00189, 2015.
[26] Ernie Brickell, Gary Graunke, Michael Neve, and Jean pierre
Seifert. Software mitigations to hedge AES against cache-based
software side channel vulnerabilities. Cryptology ePrint Archive,
Report 2006/052, 2006.
[27] Billy Bob Brumley and RistoM. Hakala. Cache-timing template at-
tacks. In Proceedings of the 15th International Conference on the Theory
and Application of Cryptology and Information Security: Advances in
Cryptology (ASIACRYPT ’09), pages 667–684, 2009. ISBN 978-3-642-
10365-0. doi: http://dx.doi.org/10.1007/978-3-642-10366-7\_39.
118 BIBLIOGRAPHY
[28] Billy Bob Brumley and Nicola Tuveri. Remote timing attacks are
still practical. Cryptology ePrint Archive, Report 2011/232, 2011.
[29] David Brumley and Dan Boneh. Remote timing attacks are prac-
tical. Computer Networks, 48(5):701–716, August 2005. ISSN 1389-
1286. doi: 10.1016/j.comnet.2005.01.010.
[30] Liang Cai and Hao Chen. Touchlogger: Inferring keystrokes on
touch screen from smartphone motion. In Proceedings of the 6th
USENIX Conference on Hot Topics in Security, HotSec’11, pages 9–
9, Berkeley, CA, USA, 2011. USENIX Association. URL http:
//dl.acm.org/citation.cfm?id=2028040.2028049.
[31] David Chaum. Blind signatures for untraceable payments. In Ad-
vances in cryptology, pages 199–203. Springer, 1983.
[32] Shuo Chen, Rui Wang, XiaoFeng Wang, and Kehuan Zhang. Side-
channel leaks in web applications: A reality today, a challenge to-
morrow. In Proceedings of the 2010 IEEE Symposium on Security and
Privacy, SP ’10, pages 191–206, Washington, DC, USA, 2010. IEEE
Computer Society. ISBN 978-0-7695-4035-1. doi: 10.1109/SP.2010.
20. URL http://dx.doi.org/10.1109/SP.2010.20.
[33] J. Coke, H. Balig, N. Cooray, E. Gamsaragan, P. Smith, K. Yoon,
J. Abel, and A. Valles. Improvements in the Intel Core 2 processor
family architecture and microarchitecture. Intel Technology Journal,
12(03):179–192, 2008.
[34] Bart Coppens, Ingrid Verbauwhede, KoenDe Bosschere, and Bjorn
De Sutter. Practical mitigations for timing-based side-channel at-
tacks on modern x86 processors. In Proceedings of the 30th IEEE
Symposium on Security and Privacy (S&P’09), pages 45–60, 2009.
ISBN 978-0-7695-3633-0. doi: 10.1109/SP.2009.19.
[35] Stephen Crane, Andrei Homescu, Stefan Brunthaler, Per Larsen,
andMichael Franz. Thwarting cache side-channel attacks through
dynamic software diversity. In Network And Distributed System Se-
curity Symposium, NDSS, volume 15, 2015.
[36] Scott A. Crosby, Dan S. Wallach, and Rudolf H. Riedi. Opportuni-
ties and limits of remote timing attacks. ACM Transactions on In-
formation and System Security, 12(3):17:1–17:29, January 2009. ISSN
1094-9224. doi: http://doi.acm.org/10.1145/1455526.1455530.
BIBLIOGRAPHY 119
[37] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman,
and F. Kenneth Zadeck. Efficiently computing static single assign-
ment form and the control dependence graph. ACM Transactions
on Programming Languages and Systems, 13(4), 1991.
[38] Jean-Francois Dhem, Francois Koeune, Philippe-Alexandre Ler-
oux, Patrick Mestré, Jean-Jacques Quisquater, and Jean-Louis
Willems. A practical implementation of the timing attack. In Pro-
ceedings of the The International Conference on Smart Card Research and
Applications (CARDIS’98), pages 167–182, 1998.
[39] Dmitri Makarov. JJikes RVM hacker’s guide. http://dmakarov.
github.io/work/guide/guide.html, 2014. [Online; accessed 30-
August-2015].
[40] Jack Doweck. Inside Intel Core microarchitecture and smart mem-
ory access. Technical report, Intel Corporation, 2006.
[41] S. Dziembowski and K. Pietrzak. Leakage-resilient cryptography.
In Foundations of Computer Science, 2008. FOCS ’08. IEEE 49thAnnual
IEEE Symposium on, pages 293–302, Oct 2008. doi: 10.1109/FOCS.
2008.56.
[42] Isaac Evans, Sam Fingeret, Julián González, Ulziibayar Ot-
gonbaatar, Tiffany Tang, Howard Shrobe, Stelios Sidiroglou-
Douskos, Martin Rinard, and Hamed Okhravi. Missing the
point(er): On the effectiveness of code pointer integrity. In Secu-
rity and Privacy (SP), 2015 IEEE Symposium on, pages 781–796, May
2015. doi: 10.1109/SP.2015.53.
[43] Agner Fog. Instruction tables: Lists of instruction latencies,
throughputs andmicro-operation breakdowns for Intel, AMD and
VIA CPUs. Technical report, Copenhagen University of Engineer-
ing, 2011.
[44] Andy Georges, Dries Buytaert, and Lieven Eeckhout. Statistically
rigorous java performance evaluation. SIGPLAN Notices, 42(10):
57–76, October 2007. ISSN 0362-1340. doi: http://doi.acm.org/10.
1145/1297105.1297033.
[45] Torbjörn Granlund. Instruction latencies and throughput for AMD
and Intel x86 processors. Technical report, 2011.
120 BIBLIOGRAPHY
[46] Johann Groszschaedl, Elisabeth Oswald, Dan Page, and Michael
Tunstall. Side channel analysis of cryptographic software via early-
terminating multiplications. In Proceedings of the 12th international
conference on Information security and cryptology (ICISC’09), pages
176–192, November 2009.
[47] Jorge Guajardo and Bart Mennink. Towards side-channel resistant
block cipher usage or can we encrypt without side-channel coun-
termeasures. Cryptology ePrint Archive, Report 2010/015, 2010.
[48] Shay Gueron. Advanced encryption standard (AES) instructions
set. Technical report, Intel Mobility Group, 2008.
[49] David Gullasch, Endre Bangerter, and Stephan Krenn. Cache
games - bringing access based cache attacks on AES to practice.
Cryptology ePrint Archive, Report 2010/594, 2010.
[50] Muhammad Hataba, Reem Elkhouly, and Ahmed El-Mahdy. Di-
versified remote code execution using dynamic obfuscation of con-
ditional branches. In Distributed Computing Systems Workshops
(ICDCSW), 2015 IEEE 35th International Conference on, pages 120–
127. IEEE, 2015.
[51] Intel 64 and IA-32 Architectures Software Developer’s Manual. Intel
Cooperation, 2014.
[52] Using Intel VTune Performance Analyzer to Optimize Software on Intel
Core i7 Processors. Intel Corporation, 2010.
[53] Intel 64 and IA-32 Architectures Optimization Reference Manual. Intel
Corporation, 2011.
[54] Gorka Irazoqui, Mehmet Sinan Inci, Thomas Eisenbarth, and Berk
Sunar. Wait a minute! a fast, cross-vm attack on aes. In Research in
Attacks, Intrusions and Defenses, pages 299–319. Springer, 2014.
[55] Marc Joye and Sung-Ming Yen. The montgomery powering lad-
der. In Revised Papers from the 4th International Workshop on Crypto-
graphic Hardware and Embedded Systems (CHES’03), pages 291–302,
2003. ISBN 3-540-00409-2.
[56] Myungsun Kim, Kibeom Kim, James R. Geraci, and Seongsoo
Hong. Utilization-aware load balancing for the energy efficient op-
eration of the big.little processor. In Proceedings of the Conference on
BIBLIOGRAPHY 121
Design, Automation & Test in Europe, pages 223:1–223:4, 2014. ISBN
978-3-9815370-2-4.
[57] Paul C. Kocher. Timing attacks on implementations of diffie-
hellman, rsa, dss, and other systems. In Proceedings of the 16th
Annual International Cryptology Conference on Advances in Cryptology
(CRYPTO’96), pages 104–113, 1996. ISBN 3-540-61512-1.
[58] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential power
analysis. In Proceedings of the 19th Annual International Cryptology
Conference on Advances in Cryptology (CRYPTO’99), pages 388–397,
1999. ISBN 3-540-66347-9.
[59] Boris Köpf and David Basin. An information-theoretic model for
adaptive side-channel attacks. In Proceedings of the 14th ACMCAon-
ference on Computer and Communications Security (CCS’07), pages
286–296, 2007. ISBN 978-1-59593-703-2. doi: http://doi.acm.org/
10.1145/1315245.1315282.
[60] Boris Köpf and Markus Dürmuth. A provably secure and effi-
cient countermeasure against timing attacks. In Proceedings of the
2009 22nd IEEE Computer Security Foundations Symposium (CSF’09),
pages 324–335, 2009. ISBN 978-0-7695-3712-2. doi: 10.1109/CSF.
2009.21.
[61] Chandra J. Krintz, David Grove, Vivek Sarkar, and Brad Calder.
Reducing the overhead of dynamic compilation. Software: Practice
and Experience, 31(8):717–738, 2001. ISSN 1097-024X. doi: 10.1002/
spe.384. URL http://dx.doi.org/10.1002/spe.384.
[62] Chris Lattner andVikramAdve. LLVM:ACompilation Framework
for Lifelong ProgramAnalysis & Transformation. Technical report,
Univ. of Illinois at Urbana-Champaign, 2003.
[63] Cédric Lauradoux. Collision attacks on processors with cache
and countermeasures. In Western European Workshop on Research
in Cryptology (WEWoRC’05), pages 76–85, 2005.
[64] N. Lawson. Side-channel attacks on cryptographic software. Se-
curity Privacy, IEEE, 7(6):65–68, Nov 2009. ISSN 1540-7993. doi:
10.1109/MSP.2009.165.
[65] Nate Lawson. Timing attack in Google Key-
czar library. http://rdist.root.org/2009/05/28/
122 BIBLIOGRAPHY
timing-attack-in-google-keyczar-library/, 2009. [On-
line; accessed 20-August-2015].
[66] Arnaud Lefray, Eddy Caron, Jonathan Rouzaud-Cornabas, and
Christian Toinard. Microarchitecture-aware virtualmachine place-
ment under information leakage constraints. In Cloud Computing
(CLOUD), 2015 IEEE 8th International Conference on, pages 588–595,
June 2015. doi: 10.1109/CLOUD.2015.84.
[67] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone.
Handbook of Applied Cryptography. CRC Press, 2001.
[68] D. Molnar, M. Piotrowski, D. Schultz, and D. Wagner. The pro-
gram counter security model: Automatic detection and removal
of control-flow side channel attacks. In Proceedings of the Inter-
national Conference Information Security and Cryptology (ICISC’05),
pages 156–168, 2005.
[69] Todd Mytkowicz, Amer Diwan, and Elizabeth Bradley. Computer
systems are dynamical systems. Chaos (Woodbury, N.Y.), 19(3):
033124, September 2009. ISSN 1089-7682. doi: 10.1063/1.3187791.
[70] Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F.
Sweeney. Producing wrong data without doing anything obvi-
ously wrong! In Proceeding of the 14th International conference on
Architectural Support for Programming Languages and Operating Sys-
tems (ASPLOS’09), pages 265–276, 2009. ISBN 978-1-60558-406-5.
doi: http://doi.acm.org/10.1145/1508244.1508275.
[71] Michael Neve and Jean-Pierre Seifert. Advances on access-driven
cache attacks on AES. In Proceedings of the 13th International Confer-
ence on Selected Areas in Cryptography (SAC’06), pages 147–162, 2007.
ISBN 978-3-540-74461-0.
[72] MichaelNeve, Jean-Pierre Seifert, andZhenghongWang. A refined
look at bernstein’s aes side-channel analysis. In Proceedings of the
2006 ACM Symposium on Information, Computer and Communications
Security, ASIACCS ’06, pages 369–369, New York, NY, USA, 2006.
ACM. ISBN 1-59593-272-0. doi: 10.1145/1128817.1128887. URL
http://doi.acm.org/10.1145/1128817.1128887.
[73] Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache attacks and
countermeasures: the case of AES. In Topics in Cryptology - The
BIBLIOGRAPHY 123
Cryptographers Track at the RSA Conference (CT-RSA’06), pages 1–20,
2006.
[74] Colin Percival. Cache missing for fun and profit, 2005.
[75] Francesco Regazzoni, Yi Wang, François-Xavier Standaert, et al.
Fpga implementations of the aes masked against power analysis
attacks. Proceedings of COSADE, 2011:56–66, 2011.
[76] Thomas Ristenpart, Eran Tromer, H. Shacham, and Stefan Savage.
Hey, you, get off of my cloud: exploring information leakage in
third-party compute clouds. In Proceedings of the 16th ACM Confer-
ence on Computer and Communications Security (CCS’09), pages 199–
212, 2009.
[77] Hubert Ritzdorf. Analyzing covert channels on mobile devices. PhD
thesis, ETH Zürich, Department of Computer Science, 2012.
[78] Werner Schindler. A timing attack against rsa with the Chinese re-
mainder theorem. InCryptographic Hardware and Embedded Systems-
CHES 2000, pages 109–124. Springer, 2000.
[79] Roman Schlegel, KehuanZhang, Xiao-yongZhou,Mehool Intwala,
Apu Kapadia, and XiaoFeng Wang. Soundcomber: A stealthy and
context-aware sound trojan for smartphones. InNDSS, volume 11,
pages 17–33, 2011.
[80] J. Shen and M. Lipasti. Modern Processor Design: Fundamentals of
Superscalar Processors. McGraw-Hill, 2005.
[81] François-Xavier Standaert, TalG.Malkin, andMoti Yung. A unified
framework for the analysis of side-channel key recovery attacks. In
Antoine Joux, editor, Advances in Cryptology - EUROCRYPT 2009,
volume 5479 of Lecture Notes in Computer Science, pages 443–461.
Springer Berlin Heidelberg, 2009. ISBN 978-3-642-01000-2. doi: 10.
1007/978-3-642-01001-9_26. URL http://dx.doi.org/10.1007/
978-3-642-01001-9_26.
[82] Michael A Stephens. Edf statistics for goodness of fit and some
comparisons. Journal of the American statistical Association, 69(347):
730–737, 1974.
[83] Kris Tiri and Ingrid Verbauwhede. A logic level design method-
ology for a secure dpa resistant asic or fpga implementation. In
124 BIBLIOGRAPHY
Proceedings of the Conference on Design, Automation and Test in Eu-
rope - Volume 1, DATE ’04, pages 10246–, Washington, DC, USA,
2004. IEEE Computer Society. ISBN 0-7695-2085-5. URL http:
//dl.acm.org/citation.cfm?id=968878.969036.
[84] Eran Tromer, Dag Arne Osvik, and Adi Shamir. Efficient cache
attacks on aes, and countermeasures. Journal of Cryptology, 23(1):
37–71, 2010.
[85] Leif Uhsadel, Andy Georges, and Ingrid Verbauwhede. Exploiting
hardware performance counters. In 5thWorkshop on Fault Diagnosis
and Tolerance in Cryptography (FDTC’08), pages 59–67, 8 2008.
[86] Jeroen Van Cleemput, Bart Coppens, and Bjorn De Sutter. Com-
piler mitigations for time attacks on modern x86 processors. ACM
Transactions on Architecture and Code Optimization (TACO), 8(4):23,
2012.
[87] Zhenghong Wang and Ruby B. Lee. Covert and side channels due
to processor architecture. In Proceedings of the 22nd Annual Com-
puter Security Applications Conference (ACSAC’06), pages 473–482,
2006. ISBN 0-7695-2716-7.
[88] ZhenghongWang and Ruby B. Lee. New cache designs for thwart-
ing software cache-based side channel attacks. SIGARCHComputer
Architecture News, 35(2):494–505, 2007. ISSN 0163-5964.
[89] Yinqian Zhang and Michael K Reiter. Düppel: retrofitting com-
modity operating systems to mitigate cache side channels in the
cloud. In Proceedings of the 2013 ACM SIGSAC conference on Com-
puter & communications security, pages 827–838. ACM, 2013.
[90] Yinqian Zhang, Ari Juels, Michael K. Reiter, and Thomas Risten-
part. Cross-vm side channels and their use to extract private keys.
InProceedings of the 2012ACMConference onComputer andCommuni-
cations Security, CCS ’12, pages 305–316, New York, NY, USA, 2012.
ACM. ISBN 978-1-4503-1651-4. doi: 10.1145/2382196.2382230.
URL http://doi.acm.org/10.1145/2382196.2382230.
[91] Li Zhuang, Feng Zhou, and J Doug Tygar. Keyboard acoustic em-
anations revisited. ACM Transactions on Information and System Se-
curity (TISSEC), 13(1):3, 2009.


