Purdue University

Purdue e-Pubs
Open Access Dissertations

Theses and Dissertations

Fall 2013

Variation-Derived Chip Security And Accelerated
Simulation Of Variations
William Paul Griffin
Purdue University

Follow this and additional works at: https://docs.lib.purdue.edu/open_access_dissertations
Part of the Electrical and Computer Engineering Commons
Recommended Citation
Griffin, William Paul, "Variation-Derived Chip Security And Accelerated Simulation Of Variations" (2013). Open Access Dissertations.
158.
https://docs.lib.purdue.edu/open_access_dissertations/158

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.

Graduate School ETD Form 9
(Revised 12/07)

PURDUE UNIVERSITY

GRADUATE SCHOOL
Thesis/Dissertation Acceptance
This is to certify that the thesis/dissertation prepared
By

William Griffin

Entitled
Variation-derived Chip Security and Accelerated Simulation of Variations

For the degree of

Doctor of Philosophy

Is approved by the final examining committee:
KAUSHIK ROY
Chair

ANAND RAGHUNATHAN
BYUNGHOO JUNG
VIJAY RAGHUNATHAN

To the best of my knowledge and as understood by the student in the Research Integrity and
Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of
Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material.

KAUSHIK ROY

Approved by Major Professor(s): ____________________________________
____________________________________
Approved by:

M. R. Melloch

07-24-2013
Head of the Graduate Program

Date

VARIATION-DERIVED CHIP SECURITY AND ACCELERATED SIMULATION
OF VARIATIONS.

A Dissertation
Submitted to the Faculty
of
Purdue University
by
William Paul Griffin

In Partial Fulfillment of the
Requirements for the Degree
of
Doctor of Philosophy

August 2013
Purdue University
West Lafayette, Indiana

ii

ACKNOWLEDGMENTS
As I am but a reflection of my experiences, I feel it important to reflect and give
acknowledgment to those who have helped me grow in the last six years.
First, I would like to give special thanks to my parents. With their love and
support, they have enabled me to become who I am today. Providing consolation
and encouragement in both my failings and my endeavors, they are an incredible
cornerstone to my life.
Next, I would like to thank those who have enhanced my academic development.
My major professor, Kaushik, has provided invaluable guidance while I have progressed through my research in graduate school, and Anand has been willing to step
in to offer ancillary guidance. I would like to offer my thanks to Vijay and Byunghoo
as well for their service on my committee and the pursuit they give in ensuring I meet
the quality expected of a Ph.D recipient.
At the same time, I have not simply received academic support while in graduate
school. Kaushik and Rwitti have willingly taken both me and my academic colleagues
in; offering us a family while ours cannot be present. Through interactions with all
of my peers, the stresses experienced in graduate school have been lessened; it has
been beneficial to know that I am not alone in my research pursuits.
My internship with Intel also heloped give me a unique perspective on my work,
and as such, it feels only necessary to thank those who aided me there. My manager,
Burzin, and my mentors, Reouven and Nachiketh gave me valuable hands-on training
and the opportunity to spend time investigating hardware security issues in a practical
manner.
To all mentioned above, and to those who I have forgotten to mention, thank you.
It is said that it takes a village to raise a child, and you all have wonderfully served
as my village.

iii

TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2 SIMPLIFIED MODELING OF DEVICE VARIATIONS . . . . . . . . .

6

2.1

2.2

2.3

2.4

2.5

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.1

Pao-Sah’s Transistor Model . . . . . . . . . . . . . . . . . .

7

2.1.2

Parameter variation simulation via Monte Carlo analysis . .

9

Methods for parameter reduction . . . . . . . . . . . . . . . . . . .

13

2.2.1

∆Vgs Approximation . . . . . . . . . . . . . . . . . . . . . .

14

2.2.2

Reduced Parameter . . . . . . . . . . . . . . . . . . . . . . .

17

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.3.1

Golden Pao-Sah Model . . . . . . . . . . . . . . . . . . . . .

23

2.3.2

∆Vgs model . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.3.3

Reduced parameter model . . . . . . . . . . . . . . . . . . .

23

Circuit simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.4.1

Model accuracy . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.4.2

Model efficiency . . . . . . . . . . . . . . . . . . . . . . . . .

30

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

3 CLIP: CIRCUIT-LEVEL IC PROTECTION THROUGH DIRECT INJECTION OF PROCESS VARIATIONS . . . . . . . . . . . . . . . . . . . .

36

3.1

Background and Related Work . . . . . . . . . . . . . . . . . . . . .

39

3.2

Circuit-Level IC Protection . . . . . . . . . . . . . . . . . . . . . .

42

3.2.1

46

Logic Modifications for CLIP . . . . . . . . . . . . . . . . .

iv
Page
3.2.2

Process Variation Sensors . . . . . . . . . . . . . . . . . . .

49

3.2.3

Sensor Recovery for IC Unlocking . . . . . . . . . . . . . . .

52

3.2.4

Key Preprocessor . . . . . . . . . . . . . . . . . . . . . . . .

57

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . .

58

3.3.1

Area Impact . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

3.3.2

Functional Impact . . . . . . . . . . . . . . . . . . . . . . .

61

3.3.3

Power Impact . . . . . . . . . . . . . . . . . . . . . . . . . .

61

3.3.4

Key recovery . . . . . . . . . . . . . . . . . . . . . . . . . .

61

Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

3.4.1

External Attacks . . . . . . . . . . . . . . . . . . . . . . . .

63

3.4.2

Mask Knowledge and Internal Attacks . . . . . . . . . . . .

66

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

4 VOLATILE PROTECTION OF NONVOLATILE CACHES . . . . . . .

71

3.3

3.4

3.5

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

4.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

4.3

Page-segmented cache encryption . . . . . . . . . . . . . . . . . . .

80

4.3.1

Single boot keys . . . . . . . . . . . . . . . . . . . . . . . . .

82

4.3.2

One time pads . . . . . . . . . . . . . . . . . . . . . . . . . .

83

4.3.3

Low-latency cipher . . . . . . . . . . . . . . . . . . . . . . .

85

4.3.4

Page-level keys . . . . . . . . . . . . . . . . . . . . . . . . .

89

4.3.5

Page key management . . . . . . . . . . . . . . . . . . . . .

89

4.4

Effect on security . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

4.5

Performance impact . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.6

Optional enhancements to security . . . . . . . . . . . . . . . . . .

93

4.7

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

105

v

LIST OF TABLES
Table
2.1

Page

Translations from a normal variable (x = µx +σx z) with mean µx , variance
σx , and standard normal random variable z ∈ N(0, 1) to another domain
y with mean µy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.2

Best-fit parameter approximations over a 3σ range. . . . . . . . . . . .

24

2.3

Simulation runtime for 1000 circuits. . . . . . . . . . . . . . . . . . . .

33

3.1

Comparison against active protection techniques . . . . . . . . . . . . .

42

3.2

Expected values on pruned circuit. . . . . . . . . . . . . . . . . . . . .

54

3.3

Area overhead for specific key strengths. . . . . . . . . . . . . . . . . .

59

4.1

Observed duration ranges of 80% data preservation after removal of power

76

vi

LIST OF FIGURES
Figure
1.1

2.1

2.2

2.3

2.4

2.5

Page

Effect of lowering the threshold voltage of a transistor by 0.1 V, as demonstrated by (a) Ids − V and (b) log Ids − V graphs. While Ion increases
linearly with respect to the threshold voltage, thereby offering a faster
transistor, Iof f increases exponentially. . . . . . . . . . . . . . . . . . .

2

Monte Carlo framework for a charge-based transient circuit simulation
analysis. During the preparation phase, the circuit configuration is determined. During simulation setup, the circuit parameters are reset
to their initial conditions, while a new set of random input vectors are
generated from each input’s respective distribution. Finally, during the
runtime loop, each transistor’s charge and capacitance is determined
based on the voltage at its nodes, which in turn affects the voltage at each
circuit node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

Calculation of ∆Vgs via current matching techniques. At a prescribed voltage, the properties of a transistor with variations - a transitor “instance”
- are compared against the nominal, variation-free transistor by observing
the shift in gate voltage required for the nominal transistor to match the
current draw. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

Domain translation. We consider how changes in a parameter (L) could be
equivalently represented in another parameter domain (Vgs ) by considering
how each effects a common parameter domain (Iof f ) (a, b). By matching
their effects, we can consider how a minor change in L could be expressed
similarly by ∆Vgs (c), and vice versa (d). The linear approximation of
these relationships is useful for the fast ∆Vgs approximation technique.

16

Domain translation from L to tox via Iof f . The relationship between these
two parameters is approximated better via a logarithmic distribution than
a linear distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

CDF from function composition. For x = 5, the two formulas from Equations 2.19 and 2.20 demonstrate different distribution functions, as the
composition of x1 and x2 was non-commutative. . . . . . . . . . . . . .

20

vii
Figure
2.6

Page

Slope factors. Matching the effects that L and tox have on current deviations suggests that ∆tox ∝ ln ∆L. However, the relative influence of the
two varies with respect to (a) Vds and (b) Vgs . Avoiding point calibration
errors requires tracking these slope in coefficient tables. . . . . . . . . .

21

Runtime loop computation of Ids from the reduced parameter model. L,
W , and tox each play a role on tox,d , albeit through nonlinear translation
and then the use of tables for the slope factors (m) mentioned in Equation 2.33. Afterwards, a three-dimensional lookup table in terms of Vgs ,
Vds , and tox is used to calculate Ids (Eqn. 2.36). . . . . . . . . . . . . .

26

Test circuits. A 6T SRAM cell (a) and a three-inverter ring oscillator (b)
were tested under our Monte Carlo framework. . . . . . . . . . . . . . .

27

SRAM static noise margins under read conditions, shown in (a) linear and
(b) logarithmic scales. . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.10 SRAM read noise margins under read conditions, shown in (a) linear and
(b) logarithmic scales. . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.11 SRAM access delay distribution, enhanced to demonstrate right tail accuracy. (a) demonstrates the amount of time necessary to perform a successful read (Vbl < 0.9 V ) while (b) demonstrates the time necessary to
perform a successful write (Vq < 0.1 V ; Vqb > 0.9 V ) . . . . . . . . . . .

29

2.12 Ring oscillator period under the presence of variations. (a) demonstrates
the ring oscillator period under standard conditions (Vdd=1 V), while (b)
demonstrates the reduced accuracy achieved under low-power (Vdd=0.6
V) conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.13 Necessary additions to the Monte Carlo procedure in Figure 2.1 to support
simulation with parameter variations under our three models. For the
∆Vgs models and Reduced Parameter models, calculation from the PaoSah model is required during the preparation phases, while the Golden
model requires calculation during each simulation setup. All three of the
models require interpolation during the runtime loop, albeit the Reduced
Parameter model additionally requires calculation of the derived oxide
thickness parameter during the runtime loop. . . . . . . . . . . . . . .

31

3.1

43

2.7

2.8
2.9

Design flow for CLIP.

. . . . . . . . . . . . . . . . . . . . . . . . . . .

viii
Figure
3.2

Page

Overview of CLIP. (a) CLIP-Enhanced design with multiple logic modifications and PV sensors (b) Logic cones (fanin and fanouts) for a set of
injection and correction nodes (c) Alternate representation as logic blocks
(d) Modified Logic with logic duplication, switchbox, and selector (e) Sense
amplifier-based PV sensor. The NMOS devices under test operate in the
subthreshold regime and have a low duty cycle to enhance their stability.

44

Logic modifications for correctibility. Based on the PV sensor reading, the
switchbox blocks value propagation to one of the duplicated functions (f1 ,
f2 ) such that only one function is reliably correct. A key input, fed to the
selector, determines which function’s value to use . . . . . . . . . . . .

47

Single-bit switchbox styles. The variety of options give a designer a greater
opportunity to conceal the injection nodes from mask analysis. . . . . .

47

Unlocking procedure. Test vectors are generated dynamically by merging
predetermined test vectors with known sensor readings and computing the
required unlocking key to supply to the device under test. The outputs
undergo a frequency analysis that reveals additional sensor values. . .

53

Random logic pruning on binary logarithm circuitry. (a) pre-pruned logic
with marked injection locations. (b) post-pruned logic in terms of only
injection values. If the gates are sorted topologically, we can propagate
constant values from the primary inputs through the gates, eliminating
gates that receive constant inputs at a linear rate. . . . . . . . . . . . .

53

Area overhead for various heuristics on the benchmark circuit S5378 with
a 20% area limit for logic modification. . . . . . . . . . . . . . . . . . .

58

Functional impact. Only a small number of wrong key bits ensures the
circuitry does not give proper results. . . . . . . . . . . . . . . . . . . .

62

Output Hamming distance based on key error. Dashed results represent
a wrapped Hamming distance measure. . . . . . . . . . . . . . . . . . .

63

3.10 Switching overhead for increasing key length. (a) Maximal impact. (b).
Heavy grouping (c). Converse power. . . . . . . . . . . . . . . . . . . .

64

3.11 Annealing attacks on s5378. (a) No preprocessor (b) XOR-based preprocessor. The preprocessor used approximately 130 XOR gates ( 25%
additional area) to dissociate the unlocking and internal keys. . . . . .

65

3.12 Switching overhead based on key error. . . . . . . . . . . . . . . . . . .

70

3.3

3.4
3.5

3.6

3.7
3.8
3.9

ix
Figure
4.1

4.2

4.3

4.4

4.5

4.6

4.7

Page

Typical processor memory layout. Each processing core has dedicated
instruction and data caches (I$, D$), which a larger 2nd level cache for
both (L2$). Multiple cores commonly share an even larger, 3rd level cache
(L3$) before they go off-die to access external memory. . . . . . . . . .

72

Power usage in Intel’s Ivy Bridge processors, with independent tracking
for standby and dynamic power usage. Even with the transition to FinFET designs, more than 1/3 of a design’s power usage comes from idle
components. Source: http://blog.stuffedcow.net [68] . . . . . . . . . . .

76

Block diagram. Using a multilayered approach, we achieve low-latency
memory confidentiality and segmentation. The single boot key, the root
of trust, is used both for generating the page keys as well as ensuring
volatility against the use of a SRAM cache for the page keys. The page keys
themselves are generated by information from the translation lookaside
buffer (TLB) such as the physical address, the virtual processor ID (in
the case of virtualization), and any other relevant details the OS desires
to store in the page tables. Finally, a unique one-time pad (OTP) is
produced for each memory row in a uniquely-identified page to provide for
simple, confidential data storage. . . . . . . . . . . . . . . . . . . . . .

81

Intel’s Hardware Random Number Generator [74]. This random number
generator is based on biasing two inverters to be metastable such that the
resultant state after the clock enable should be random. . . . . . . . .

83

A typical D Flip Flop modified to support a reset low signal. The two
sets of coupled NAND gates would re-bias themselves to a known state
whenever R is low. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

Counter mode operation of a block cipher like AES. The cipher is instantiated multiple times and driven with the same key, while the cipher’s inputs
are driven by an address parameter and some unique, one-time use value
called a nonce. The output of the cipher is XORed against the plaintext to
produce the ciphertext, and is undoable by simply running the ciphertext
back through with the same key, address, and nonce. . . . . . . . . . .

84

AES Cipher. A single encryption round, as demonstrated in (a), consists of
a substitution, an element shift, and an invertible column multiplication.
The substitution phase requires a computationally intense 8-bit Galois
Field inversion, and is frequently substituted by the 4 kbit table shown in
(b, from [78]). Instead of duplicating this single table 16 times to minimize
the per-round latency, the substitution can also be calculated through an
isomorphic conversion to a simpler Galois Field, as shown in (c, from [80]).

86

x
Figure
4.8

Page

Embedded ciphers. XTEA (a), a 64-round Feistel cipher, requires two
32-bit algebraic additions [+] and two boolean additions (XOR, ⊕) per
round. SEA (b) requires 63 rounds for its 66-bit version, and consists of
[S]-Boxes, word [R]otations, and bit [r]otations, but the critical path is
only constrained by the substitution box, a single 11-bit addition, and a
boolean addition operation. . . . . . . . . . . . . . . . . . . . . . . . .

87

Unique attributes of various hardware-efficient ciphers. PRESENT (a,
from [86]) uses a permutation matrix to allow for wide dissemination of
information. NOEKEON (b) provides for a unique key mixing algorithm
that operates like overlapped Feistel ciphers. mCrypton (c, from [87])
integrates different substitution boxes, allowing for more nonlinearity. .

88

4.10 Expected L3 read performance impact under TLB scenarios, as compared
to Sandy Bridge performance and a 10-cycle encryption cipher. (a) L1
TLB hit. (b) L1 TLB miss, L2 TLB hit. (c) TLB miss. . . . . . . . .

92

4.11 Integration of nonce and hash operation, as shown in grey. Nonce operation requires a memory-backed cache, while hash operation requires both
a hash function and an additional memory-backed cache. The two caches
are referenced by the TLB row index and the address index. The onetime pad cipher may be reusable for the hash function. In the case of a
malformed hash, an interrupt can be sent back to the processor. . . . .

94

4.9

xi

ABSTRACT
Griffin, William Paul Ph.D., Purdue University, August 2013. Variation-derived Chip
Security and Accelerated Simulation of Variations.. Major Professor: Kaushik Roy.
In modern ICs, variations can be quite troublesome. Ensuring quality and yield
requires careful and resource-intensive simulation under the effects of parameter variations. Threshold voltage approximation of parameter variations can help accelerate
simulations, but it comes with unknown losses in quality. Instead, we propose a parameter reduction technique designed to minimize quality loss through careful analysis of the source transistor model and the set of input parameter variations. Using
Pao-Sah’s Double Integral model, we demonstrate the relative quality of our reduced
parameter approach versus threshold voltage approximation.
Despite their negative role in circuitry, variations can be useful for security applications. They can be used both to provide a fingerprint for any chip, as well
as to generate truly random numbers. Using these properties, we have developed
two security-related chip enhancements: a chip antipiracy scheme, and a secure nonvolatile cache.
To prevent chip cloning, we require a unique key to operate every manufactured
chip by leveraging the nonreplicable nature of variations. Through minor modifications, we create a system in which process variation sensors can be integrated into
any logic block so that the block itself authenticates the key.
Finally, the anticipated use of leakage-free nonvolatile caches presents a disruption
to the processing security assumption that memory values are not retained upon power
loss or system reset. To prevent against data leakage, we propose the use of truly
random, single boot keys, along with a multistage encryption mechanism. We offer
insurance against the possibility of a cryptographic break by implementing a two-level

xii
encryption scheme that offers cache-level data confidentiality with minimal impact on
system performance.

1

1. INTRODUCTION
In 1965, it was predicted that, in order to stay profitable, the silicon industry would
continue to follow a particular trend of advances for at least ten years. This prediction,
commonly referred to as Moore’s Law [1], requires faster and more functional chips,
subject to reasonable considerations in other attributes like power usage and heat
dissipation. Unfortunately, for a single process technology, speed increases come
at the sacrifice of power efficiency. While adjustments to the threshold voltage can
promote faster speeds, they also create more leaky transistors, as shown in Figure 1.1.
Supply voltage increases can also play a linear effect in increasing a transistor’s Ion ,
but the associated power usage increases at an even faster rate (P ∝ CV 2 ).
Feature size reduction can achieve both faster and more efficient transistors, but
not without misgivings. As the size of transistors decrease, the effect of intrinsic
silicon imperfections become amplified, initially moreso than the advantages of the
feature shrink. Even once a process is considered stable, these imperfections still play
roles in circuit and system behavior on both delay and power. Under some situations,
variations can even lead to functional errors.
Accounting for these imperfections is paramount to creating a successful design;
however, a full scale, accurate simulation to determine the effect of parameter variations is not feasible. Between the number of parameters involved and the desire for
complete coverage, such a simulation would require exponential scaling with respect
to the number of parameters.
Instead, circuits can be analyzed more efficiently through the use of subdivision breaking the circuit into smaller components; approximation - the use of an inexact
model to offer faster simulation times; and Monte Carlo analysis, where simulation is
only intended to gather a representative set of results.

2

(a)

(b)

Fig. 1.1. Effect of lowering the threshold voltage of a transistor by 0.1
V, as demonstrated by (a) Ids − V and (b) log Ids − V graphs. While
Ion increases linearly with respect to the threshold voltage, thereby
offering a faster transistor, Iof f increases exponentially.

Unfortunately, Monte Carlo analyses are not known for being fast. Monte Carlo
analyses merely attempt to generate a representative sample set through the use of
simulations; large sample sizes are required to make informed observations. Reducing
the cost of a Monte Carlo analysis requires targeting the per-simulation cost and
looking for additional methods of approximation.
At the transistor level, a common approximation method for simplifying process
variations’ role is to represent process variations via a single parameter variation,
threshold voltage (Vth ) approximation. Such a simplification can be advantageous,
as the transistor’s current characteristics can be captured by a two-dimensional table
based on the adjusted gate voltage Vgs − ∆Vth and the drain voltage Vds . During
circuit simulation, such a model could interpolate from this table, avoiding expensive
runtime calculation of the current.
While Vth approximation is but one style of parameter reduction, the advantage
of parameter reduction techniques cannot be overstated: as the number of input
parameters is reduced, the advantage to using table interpolation over direct model

3
computation grows. The reduction in parameters causes an exponential reduction
in the size of the lookup tables, minimizing the bottleneck of table generation and
storage that would be required with a many-parameter table.
Chapter 2 offers a study on parameter reduction methods, as well as an enhanced
method for reducing parameter count in a device model. From a physical model (PaoSah’s double integral model [2]), we explore two parameter reduction approximations
- the traditional ∆Vth approximation, and an approximation using a novel approach
to parameter reduction. We offer a qualitative comparison of these simplified models, and examine the runtime impact of using these models during a Monte Carlo
simulation.
While process technology advancements do play a role in the properties of new silicon devices, these advancements only occupy one part of the semiconductor pipeline.
Fabrication facilities are merely high-tech printers; the printing process itself has no
purpose without a quality design. The high design costs may encourage some to turn
towards design piracy, be it via skimming at a fabrication facility, microscopy-based
mask extraction, or even corporate theft.
Illicit manufacture of a design presents a two-fold problem for the design originator. First, these stolen devices can damage market sales. These additional components will be likely sold cheaper than their legitimate counterparts, causing both
devaluation of the original components and lost sales. Second, these devices can
damage the reputation of the original designer. If the illicit devices contains flaws,
regardless if they are due to poor design extraction, incomplete design theft, or poor
device testing, the designer and not the design thief would receive the blame.
Preventing design theft requires reliance on a nonduplicable component; unfortunately, as microscopy can offer greater resolution than photolithography, any intentionallydesigned structures can likely be detected. Parameter variations, on the other hand,
can be impossible to duplicate. Microscopy is not designed to measure the effect
atomic variations would have on a single transistor’s attributes; it only serves to re-

4
veal the structure. Indeed, the nonduplicable nature of variations has prompted a
interest in their use as device fingerprints [3].
Using these variations to implement a piracy protection scheme is not novel;
several methodologies to use variations to prevent against piracy have been introduced [4–6], but they do have shortcomings. Many of these schemes suffer from a
single point of failure, under which only one mask modification could defeat their
protection. One of these designs offers no protection against direct application of the
key [5], opening up the possibility for key discovery via an annealing-based attack on
the key. And, in a third case, a large area overhead was incurred by implementing
the protection scheme as additional states inside an existing FSM [4].
In Chapter 3, we present a framework that can avoids the majority of these costly
mistakes, and we supply details on discovered security weaknesses in our own framework and the measures required to minimize or eliminate such weaknesses. Our
framework manages to integrate the unlocking mechanism inside of any random logic
circuit, using the circuit itself to provide verification for the relationship between the
key and process variation readings, as well as eliminating the single point of failure
seen in other designs.
Finally, we turn our attention to the issue of data confidentiality. As designs
become faster and more efficient, it becomes easy to justify convenience over security.
Sales are made based on performance and capabilities; security measures are seen as
a drain on resources without any direct correlation to increased sales.
Without the focus on hardware security measures, a computer may be vulnerable
to data leakage. For example, through a Direct Memory Access (DMA) interface, a
connected device is able to extract the entirety of memory even while the operating
system is operational [7]. An additional attack surface exists through the means of
cold boot attacks [8]. By leveraging the data remanence property present in both
SRAM and DRAM, data can be recovered either through a platform reset attack
where, following a reset, a modified operating system extracts the contents of RAM,
or through a cryogenic attack, where a memory module is cooled to extend the effects

5
of remanence to allow for physical transferrance of the memory module to a different
system.
With the advent of nonvolatile storage structures like Spin-Torque-Transfer Magnetic RAM (STT-MRAM), data remanency becomes more critical. Data degradation
under a cold boot attack is now less likely, as the nonvolatile structures are designed
for greater endurance. Anywhere that these nonvolatile structures appear now becomes subject to physical attacks - even if used inside a processor as a memory cache,
data could be exposed through careful decapping and microprobing. To this end,
Chapter 4 presents a method to combat against cold boot attacks - even as they may
target the processor’s cache. Through the use of hardware-efficient ciphers, volatile
registers, and hardware random number generators, we propose a data confidentiality
layer against cold boot attacks that can be integrated at the L3 cache level without
an impact in performance.

6

2. SIMPLIFIED MODELING OF DEVICE VARIATIONS
Transistor features have been scaling down to feature sizes in the low namometers,
despite the use of 193 nm light for photolithography. The difficulty in lithography
results in interdie variations in transistor parameters such as gate length, width, oxide
thickness and flat band potential [9], as well as random variations such as random
dopant fluctuations [10] and line edge roughness [11].
To capture the effects of these variations on circuits and systems and determine
manufacturing yields, there is a need to perform simulations that take into account
these parameter variations. Researchers often use transistor threshold voltage modulation (∆Vth ) as an approximation for the parameter variations, as it offers both a
convenient means to simulate process variations, and it conveys a vast improvement
in runtime over simulation with individual variations. Unfortunately, ∆Vth approximations come saddled with a rarely-understood loss in quality.
In this chapter, we will examine interdie variations as they occur in transistor
parameters L, W , tox , and Vf b , and how to provide for accelerated circuit simulations
under the effect of variations in these parameters. We will focus on techniques that
provide for a reduction in the number of parameters, including a ∆Vth -like approach
using gate voltage modulation (∆Vgs ) and a specialized reduced parameter method
that uses nonlinear transformations between parameter domains and a dynamic variable composition to reduce quality loss.
Using the set of interdie parameter variations (variations in L, W , tox , Vf b ) and
a transistor model (Pao-Sah’s Double Integral [2]), we demonstrate how to efficiently
construct computationally-efficient model approximations.
Circuit demonstration of these two models shows the potential for errancy in
a ∆Vth approximation. While the reduced parameter model does not suffer from
significant errors, a ∆Vth model results in consequential errors.

7
2.1

Introduction
Process variations arise from acceptable imperfections during manufacture; these

imperfections can come from sources as large as mask alignment and as small as
random dopant fluctuations. They can affect critical attributes such as an oscillator’s
period, SRAM access behavior and stability, random logic timing, and their associated
power usage. Handling process variations requires reasonably accurate prediction of
their effects on circuit and system performance via transistor models and circuit
simulation tools.

2.1.1

Pao-Sah’s Transistor Model

In this chapter, we have selected Pao-Sah’s double integral formula [2] as our
transistor model. It was selected over other commonly used models (e.g., BSIM) for
a number of reasons, including:
· It is strongly linked with the underlying physics. The model uses electrostatic
computation to determine the current flow.
· It is consistent across operating regions. Other models use different sets of equations
based on the region of operation (subthreshold, linear, saturation) along with some
means to transition between regions. Pao-Sah requires only a single set of equations
to explain the device’s operation across all regions of operation.
· Pao-Sah’s model does not require many tuning parameters, unlike other transistor
models that require curve-fitting parameters.
· BSIM itself used Pao-Sah’s model for verification purposes [12].
Pao-Sah’s general form is as follows:

Ids

W
= qµef f
L

!

Vds
0

!

ψs
ψb

(n2i /Ndep )e(ψ−V )/vt
dψdV
E(ψ, V )

(2.1)

The drain current Ids is determined through charge integration from the bulk to
the gate (ψ integration from ψb to ψs ) and from the source to the drain (using variable

8
substitution to use V integration from 0 to Vds ). The carrier density (n2i /Ndep )e(ψ−V )/vt
is found for a given V and ψ from the depletion region dopant concentration Ndep , the
intrinsic silicon dopant concentration ni , and the thermal voltage constant vt . The
electric field is found for a given V and ψ based on these same variables as well as
the silicon dielectric constant εsi :
"
E(ψ, V ) = 2k T Ndep /εsi

#

(eψs /vt − 1 +ψs /vt ) +

n2i
(e−V /vt (eψs /vt − 1) −ψs /vt )
2
Ndep
(2.2)

Our implementation of Pao-Sah included a metal gate, allowing us to relate ψs
and Vgs via Equation 2.3; including the flat band potential Vf b , the surface charge
Qs (Eqn. 2.4), and the oxide capacitance Cox (Eqn. 2.5, based on the oxide dielectric
constant εox and the oxide thickness tox ). Equations 2.6 and 2.7 were included to to
model DIBL and channel length modulation effects [12] based on constants γ and λ,
respectively.

Vgs = Vf b + ψs − Qs /Cox
Qs (ψ, V ) = −εsi · E(ψ, V )

(2.3)
(2.4)

Cox = εox /tox

(2.5)

Vgs = Vgs,orig + γVds

(2.6)

Ids = Ids,orig (1 + λVds )

(2.7)

Calculation of Pao-Sah’s model requires numerical methods; in fact, the numerical
integration of Pao-Sah’s model has a large amount of computation overlap when
measuring Ids as Vds and Vgs vary and the remaining set of transistor parameters
remain fixed.

9
2.1.2

Parameter variation simulation via Monte Carlo analysis

While methods such as statistical and static timing analysis can be used to estimate a circuit’s behavior under variations, they have their own issues. Statistical
timing analysis methods [13] typically require hefty computational resources, and
static timing analysis methods are suited only for rough estimates of circuit behavior. Instead, designs are characterized via Monte Carlo methods, wherein random
variables are used to generating a representative sample set of measurements.
Unfortunately, Monte Carlo methods can be quite slow. A proper Monte Carlo
analysis requires large sample sets (each derived from a single simulation) to achieve
a desired level of accuracy. To accelerate a Monte Carlo analysis either requires the
use of an approximate, fast Monte Carlo method [14], or faster calculation of each
simulation.
Consider a transient, Monte Carlo circuit analysis based on I-V and C-V characteristics, as shown in Figure 2.1. This composes of three phases: preparation,
wherein the overall Monte Carlo procedure is planned; simulation setup, wherein
each simulation is prepared by determining the input parameters; and finally, the
runtime loop, during which node current (Inode ) and capacitance (Cnode ) are determined from addition of attached transistor’s current (Ids ) and capacitance (Cg , Cd ,
Cs ), and then the node voltages (Vnode ) are updated according to the simulation time
step size tstep :

Vnode (t + tstep ) = Vnode (t) + tstep · Inode /Cnode

(2.8)

As the majority of simulation time is occupied determining the transistor’s current
and capacitance, the application of a transistor model such as Pao-Sah’s Double
Integral model [2] needs to be carefully considered. While the generation of a new set
of parameter variations for each circuit occurs during simulation step, their interaction
with the transistor model and the circuit simulation can occur through three different
approaches:

10

Preparation
Determine circuit configuration

Simulation Setup
Reset circuit parameters
Generate new set of random input
vectors

Runtime Loop
Determine transistor current,
capacitance from node voltages
Calculate node voltages based on
current and capacitance
Advance timestep
Fig. 2.1. Monte Carlo framework for a charge-based transient circuit simulation analysis. During the preparation phase, the circuit
configuration is determined. During simulation setup, the circuit
parameters are reset to their initial conditions, while a new set of
random input vectors are generated from each input’s respective distribution. Finally, during the runtime loop, each transistor’s charge
and capacitance is determined based on the voltage at its nodes, which
in turn affects the voltage at each circuit node.

· Direct integration. The current and capacitance can be computed in a just-intime manner from Pao-Sah’s model, taking into account the present voltage values
and the given transistor’s set of parameter variations. This approach would calculate
Pao-Sah’s model during every time step during the runtime loop.
· Per-transistor integration. As mentioned in Section 2.1.1, computation overlap
exists when computing Ids in Pao-Sah’s model for a range of Vgs and Vds values and
for a particular set of parameter variations. As a single simulation uses a fixed set of
parameter variations for each transistor, the runtime loop calculation of Ids from Pao-

11
Sah’s model can be avoided in favor of constructing a two-dimensional, Ids [Vgs , Vds ]
table from Pao-Sah’s model during the simulation setup phase. During the runtime
loop, Ids can be found via interpolation from this two-dimensional table.
· Parameter variation capture. Much like how per-transistor integration was
performed, a table could be constructed to capture the possible effects of any set of
parameter variations. Such a table would avoid calculations from Pao-Sah’s model
during simulation setup and the runtime loop, while during the runtime loop Ids would
be obtained from a six-dimension table in terms of the input voltages and parameter
variations: Ids [Vgs , Vds , L, W, tox , Vf b ].
Unfortunately, this six-dimension table for Ids suffers from exponential growth in
storage space as well as processing requirements. If this six-dimension table only
contained 100 points per axis, one trillion data points would need to be obtained
from Pao-Sah model calculations for the table. Furthermore, interpolation delay from
tables is also exponential: a single interpolation from this table would take sixty-four
memory accesses.
However, this prompts the question: how can we capture the transistor characteristics under any set of parameter variations without such hefty costs? The answer lies
in finding a parameter reduction technique – it requires finding a method that can approximate the transistor model’s behavior under parameter variations, but do so with
a lookup table with fewer dimensions. The most frequent parameter minimization
method combines all the parameters together based on how they collectively affect
threshold voltage (Vth ) [15]. A ∆Vth = ∆Vgs definition (as shown later in Figure 2.2),
could potentially use a two-dimension table in terms of Vgs − ∆Vth and Vds .
While the most frequent parameter variation approximation is a normal ∆Vth
distribution, others have developed approaches to efficient representation of variations. [16] constructed a method to find a ∆Vth distribution for use in SPICE simulations based on the standard deviations of basic parameters. An alternative approach
is to is to characterize the relative current distribution (e.g. the distribution for

12
∆Ion /Ion,nom ) via either principal component analysis for each region of transistor
operation, or through the use of a Taylor series expansion [17, 18]
In this chapter, we present two parameter reduction techniques: a ∆Vgs approximation, and our improved reduced parameter approximation. For the ∆Vgs approximation, we will first explain how to determine ∆Vgs ’ distribution via a transistor-level
Monte Carlo routine; proceeding on to an alternate and fast method to determining
the ∆Vgs distribution via linear approximations.
Next, we will detail our improved reduced parameter methodology. Through
careful nonlinear transformations, variable combinations, and dynamic (with respect
to Vgs and Vds ) composition, we improve upon the ∆Vgs approach. The merits of this
methodology are as follows:
· Simplicity. The model generated from our methodology can be tuned with minimal
human input and low processing requirements.
· Accuracy. By avoiding the pitfalls in the fast ∆Vgs approach, we can ensure
a greater adherence to the Pao-Sah model characteristics as we perform parameter
reduction.
· Efficiency. Our reduced parameter approach achieves computational efficiency
through its one-time construction during the preparation phase of a Monte Carlo
analysis, and its ability to be represented by a small table, resulting in low table
preparation costs and fast interpolation speeds.
In Section 2.2 we detail both approximation methods. Section 2.2.1 focuses on the
∆Vgs approximation, while Section 2.2.2 explains our reduced parameter approach.
In Section 2.3 we detail the application of both methods to Pao-Sah’s Double Integral
Model [2], with special care taken to detail the simplifications that our reduced parameter approach can perform. Section 2.4 compares the approximations performed
against the original model in terms of both accuracy and speed, and Section 2.5 concludes this chapter with details about what future developments could be achieved.

13

1 mA

Instance

Nominal

I ds

1 uA ΔV
gs

1 pA
0

0.2

0.4

0.6
Vgs

0.8

1

1.2

Fig. 2.2. Calculation of ∆Vgs via current matching techniques. At a
prescribed voltage, the properties of a transistor with variations - a
transitor “instance” - are compared against the nominal, variationfree transistor by observing the shift in gate voltage required for the
nominal transistor to match the current draw.

2.2

Methods for parameter reduction
Due to the computation and storage requirements of an interpolation table, it is

important to develop methods that reduce the number of parameters used to represent
a transisitor model. In this section we address two methods - a ∆Vgs approximation,
where the original parameter variations are substituted by applying variations only to
a ∆Vgs parameter, and our new reduced parameter approach, which uses techniques
to minimize quality loss while merging the effects of individual parameter variations.

14
2.2.1

∆Vgs Approximation

The transistor threshold voltage (Vth ) approximation is a method frequently used
by researchers to express process variations in lieu of simulation using variations in
the original L, W , tox , and Vf b parameters. This approximation is useful as the
relationship between threshold voltage, delay, and power is well understood, and
threshold voltage is affected by changes in L, W , tox , and Vf b .
Application of the Vth approximation involves applying a Gaussian distribution
to Vth to represent the effect the other parameter’s variations play on Vth . In this
section, we will focus on using the shift in gate voltage, ∆Vgs as a stand-in for Vth , as,
despite there being no standard definition for Vth , a shift in Vth and a shift in ∆Vgs
have similar effects on transistor behavior. Additionally, Pao-Sah contains no such
Vth term.
Implementation of this approximation requires only a known distribution for
∆Vgs , and a two-dimensional, nominal (free from the effects of parameter variations)
Ids [Vgs , Vds ] table obtained from Pao-Sah’s model. During circuit simulations, current
is obtained from the nominal table as follows

Ids [Vgs − ∆Vgs , Vds ]

(2.9)

One method to determine the ∆Vgs distribution’s shape is through Monte Carlo
analysis. First, the Ids attributes of a nominal transistor are calculated from the
transistor model. Then, Ids is calculated at some particular voltage bias for multiple
transistors with unique sets of parameter variations. As outlined in Figure 2.2, ∆Vgs
for each transistor is determined by examining the shift in gate voltage required in
the nominal transistor to achieve the same current.

Ids,nom [Vgs − ∆Vgs , Vds ] = Ids (Vgs , Vds , L, W, tox , Vf b )
Finally, ∆Vgs ’ variance (σV gs ) is obtained from statistical analysis.

(2.10)

15
∆Vgs ’ distribution can also be estimated without the use of Monte Carlo analysis.
To do such requires the use of domain translations, linearization, distribution translation, and convolution. The individual effect of a single parameter variation upon
∆Vgs are first determined, and then the effects of all variations are merged together
under the assumption they have independent effects on ∆Vgs .
Domain translation determines the equivalent value required to achieve the same
effect on some output parameter. As shown in Figure 2.3, this consists of independently determining the effects that changes in two separate parameters (e.g., L and
∆Vgs ) have on the output (e.g., Iof f ). By equating the output of both equations

Iof f (L) = Iof f (∆Vgs )

(2.11)

we can see how a given change in L has an equivalent effect on Iof f as a change in
∆Vgs , a fact that can be expressed in either direction. A Taylor series approximation
can be used to express a first-order relationship between L and ∆Vgs :

dVgs
dIof f
dL
∆L(∆Vgs ) ≈
dIof f
∆Vgs (∆L) ≈

dIof f
∆L ≈ mL→V gs ∆L
dL
dIof f
∆Vgs ≈ mV gs→L ∆Vgs
dVgs

(2.12)
(2.13)

This slope factor, m, represents the approximate scaling relationship between the two
domains.
To move the effects of a distribution between domains requires the use of these
relationships. If L follows a known cumulative distribution CDFL (L), translating its
effects to ∆Vgs requires variable substitution.

CDFL (Vgs ) = CDFL (L(∆Vgs ))

(2.14)

For normal distributions, a linear transformation will still produce a normal distribution. With a linear approximation, the standard deviation in ∆Vgs due to L
(σL→∆Vgs ) is simply the original deviation σL scaled by the slope factor mL→V gs :

16

(a)

(b)

(c)

(d)

Fig. 2.3. Domain translation. We consider how changes in a parameter (L) could be equivalently represented in another parameter
domain (Vgs ) by considering how each effects a common parameter
domain (Iof f ) (a, b). By matching their effects, we can consider how
a minor change in L could be expressed similarly by ∆Vgs (c), and vice
versa (d). The linear approximation of these relationships is useful
for the fast ∆Vgs approximation technique.

17

σL→∆Vgs = mL→V gs · σL

(2.15)

If we assume that all of the parameter variations play independent roles on ∆Vgs ,
it becomes trivial to perform convolution (merge their distributions). The variance
of convoluted, independent normal distributions is additive, so the variance of ∆Vgs
can be found through summation of the translated (Equation 2.15) variances:

2
2
2
σvgs(f
ast) ≈ (mL→V gs σL ) + (mW →V gs σW )

+ (mV f b→V gs σV f b )2 + (mtox→V gs σtox )2

(2.16)

Through these linear approximations, we are able to produce σvgs(f ast) , which is
in close agreement to σvgs obtained from Monte Carlo analysis.

2.2.2

Reduced Parameter

Unfortunately, as will be demonstrated in Section 2.4.1, the ∆Vgs approximation
does not offer great accuracy, in part due to linear approximations, and in part because
its distribution is calculated for some particular voltage bias. To extend beyond these
inadequacies requires an improved approach to parameter reduction, starting with
improved handling of domain translations.
As demonstrated in Section 2.2.1, parameter variations can be equivalently expressed from a different parameter domain through the use of parameter matching.
What was not demonstrated was the possibility of nonlinear approximations. If we
consider the effects of L variations on tox via Iof f (Figure 2.4), one can observe that
the linear approximation is not ideal. In this case, a logarithmic approximation offers
a much better fit, and similar observations can be made about the relationship between other parameters. Oftentimes the numerical relationship between parameters
is better expressed via a nonlinear (log, exponential, inverse, etc.) approximation
than a linear approximation.

18

(a)

(b)

Fig. 2.4. Domain translation from L to tox via Iof f . The relationship
between these two parameters is approximated better via a logarithmic distribution than a linear distribution.

Table 2.1
Translations from a normal variable (x = µx + σx z) with mean µx ,
variance σx , and standard normal random variable z ∈ N(0, 1) to
another domain y with mean µy .
Variable Transform

Translated Normal Variables

y(x) = m x + b

y(z) = µy + m σx z
$
%
y(z) = µy + m ln 1 + σµxxz
&
'
y(z) = µy + m (µx + σx z)−1 + µ−1
x

y(x) = m ln (x) + b
y(x) = m x−1 + b
y(x) = em x+b

y(z) = µy em σx z

While distributions for any nonlinear approximations can be found (albeit not
always in a closed form expression), distributions are not directly applicable to Monte
Carlo analysis. Monte Carlo analysis requires random numbers that are generated
from the given distribution. As such, we can consider how we can combine translated

19
random variables (such as the translations to normal variables as shown in Table 2.1)
rather than focus on the convolution of the parameter’s respective distributions.
When we combine two random variables together, we must consider the underlying function composition (using the result of one function inside of another) behind
their combination. Each random variable will independently explain how a given
parameter variation can affect some particular parameter; to combine them together,
we must realize how both random variables collectively affect that particular parameter. Consider an example of how a variable x is influenced by two random variables
(x1 , x2 ; each derived from a standard normal distribution z = N(0, 1)), and how the
direction of composition (x12 or x21 ) would matter:

x1 = f1 (x, z) = x + z

(2.17)

x2 = f2 (x, z) = x · ez

(2.18)

x12 = f1 (f2 (x, z2 ), z1 ) = x · ez2 + z1

(2.19)

x21 = f2 (f1 (x, z1 ), z2 ) = (x + z1 ) · ez2

(2.20)

As shown in Figure 2.5, not all compositions make sense. In fact, this example
demonstrates dependence between variables, as there are no indications on which
composition direction is correct. If we instead limit ourselves only to consideration
of random variables that fit the following form:
x# = f (x, z) = x + f ∗ (z), f ∗ (0) = 0

(2.21)

where the output x# is determined by the sum of a fixed value x and some varying component f ∗ (z), potentially based on a standard random normal variable z ∈ N(0, 1),
then the variables can be considered independent with respect to their output, as
their composition is commutative. In fact, use of the above form allows a new derived variable to be constructed through simple addition of the varying components
f ∗ . Two random variables x1 and x2 that abide by the form of Equation 2.21 can be
composed in either direction (either x#12 or x#21 ) to yield the same result:

CDF(x)

20

x12, x21
Fig. 2.5. CDF from function composition. For x = 5, the two formulas from Equations 2.19 and 2.20 demonstrate different distribution
functions, as the composition of x1 and x2 was non-commutative.

x#1 = f1 (x, z) = x + f1∗ (z)

(2.22)

x#2 = f2 (x, z) = x + f2∗ (z)

(2.23)

x#12 = f1 (f2 (x, z2 ), z1 )

(2.24)

x#21 = f2 (f1 (x, z1 ), z2 )

(2.25)

x#12 = x#21 = f12 (x, z1 , z2 ) = x + f1∗ (z1 ) + f2∗ (z2 )

(2.26)

Of the relationships shown in Table 2.1, linear, logarithmic, and inverse transformations can be composed per the above guidelines. Only the exponential transformation does not lend itself to composition.
Revisiting the relationship between L and tox , it is now possible to express a
derived variable tox,d in terms of both parameter’s variations. The functional rela-

21
tionship between L and tox is logarithmic, with the independent effect of L variations
on tox expressed by
(

σL z L
tox (zL ) = µtox + mL→tox · ln 1 +
µL

)

(2.27)

based on each parameter’s mean µ, variance σ, and standard normal random variables
z ∈ N(0, 1). The effect of tox ’s own variations also needs to be taken into account
tox (ztox ) = µtox + σtox · ztox

(2.28)

and, collectively, a derived random number tox,d can be generated by
(

σL z L
tox,d (ztox , zL ) = µtox + σtox · ztox + mL→tox · ln 1 +
µL

7

x 10

−8

(2.29)

−8

7

Distinct Vgs values

6

x 10

Distinct Vds values

6

5

tox

tox

mL

mL

5

4
3

4
3

2

2

1

1

0
0

)

0.1

0.2

0.3

0.4

0.5

Vds

(a)

0.6

0.7

0.8

0.9

1

0
0

0.1

0.2

0.3

0.4

0.5

Vgs

0.6

0.7

0.8

0.9

1

(b)

Fig. 2.6. Slope factors. Matching the effects that L and tox have on
current deviations suggests that ∆tox ∝ ln ∆L. However, the relative
influence of the two varies with respect to (a) Vds and (b) Vgs . Avoiding point calibration errors requires tracking these slope in coefficient
tables.
The slope factors (m) hold another secret: they are not necessarily constant. The
same arithmetic relationship may hold true between two given parameters across the

22
various transistor regions, but with varying slope factors. As shown in Figure 2.6, for
the relationship between L and tox , it was observed that mL→tox varied between the
subthreshold, linear, and saturation regions - despite that the relationship between L
and tox remained logarithmic.
Proper handling of these changing slope factors requires that our derived random
numbers must behave dynamically with respect to Vgs and Vds , thereby creating a
dynamic derived random number. The slope factors need to be tracked according to
Vgs and Vds , requiring the creation of an additional lookup table m[Vgs , Vds ]. This
dynamic property of our derived variable (e.g., tox,d ) signifies that, due to its dependency on Vgs and Vds , it needs to be recalculated during the simulation loop (Fig. 2.1)
as the applied voltages change.
(

σL z L
tox,d (ztox , zL , Vgs , Vds ) = µtox + σtox · ztox + mL→tox [Vgs , Vds ] · ln 1 +
µL
2.3

)

(2.30)

Implementation
To test our methodologies, we constructed a tool to parse SPICE-like netlists and

follow the transient Monte Carlo analysis outlined by Figure 2.1.
Our tool was constructed to use three different variation models: golden, ∆Vgs ,
and reduced parameter. The golden model used Pao-Sah’s Double Integral, taking
into consideration a normal distribution with means µ and variances σ for each of
the four sources - L, W , tox , and Vf b . The ∆Vgs approximation received only a
normal distribution for ∆Vgs in order to demonstrate an approach similar to Vth
approximation. The last model, our reduced parameter model, uses all four parameter
variations, but uses nonlinear translations and variable composition on the parameters
to achieve a compact tabular representation of Pao-Sah’s model.

23
2.3.1

Golden Pao-Sah Model

At its core, our tool uses Pao-Sah’s Double Integral [2] as our golden physical
model. Much of Pao-Sah’s Double Integral was discussed in Section 2.1.1, but to
reiterate, Pao-Sah’s general form uses a double integral structure (Eqn. 2.1) that
through numerical integration can easily calculate Ids [Vgs , Vds ] for varying Vgs , Vds
values on a single set of parameter variations. This makes Pao-Sah’s Double Integral
well suited for a per-transistor style of integration during Monte Carlo analysis

2.3.2

∆Vgs model

The ∆Vgs model is designed to capture parameter variations during the preparation phase (see Section 2.1.2) through only a nominal Ids [Vgs , Vds ] transistor table and
knowledge of ∆Vgs ’ distribution. The distribution is found through the fast procedure
outlined in Section 2.2.1, where ∆Vgs ’ standard deviation σV gs is determined through
linear transformations and normal distribution convolution.
To determine σV gs , the linear approximation slope factors (mL→V gs , mW →V gs , ...,
as found through Equations 2.12 and 2.13) can be found for each parameter variation
through generation of two transistors from Pao-Sah’s model with minor differences
only in the given parameter, and then comparing their current mismatch at the
selected voltage bias (in this case, Ion : Vgs = Vds = 1.0V) against the appropriate
shift in Vgs necessary in a nominal transistor.
During the runtime loop, Ids is calculated from the two-dimensional nominal transistor table via Ids [Vgs − ∆Vgs , Vds ].
2.3.3

Reduced parameter model

Using the procedure outlined in Section 2.2.2, the original Pao-Sah transistor models were analyzed to determine possible parameter reduction methods with respect to
drain current Ids and gate capacitance Cg measurements.

24
Table 2.2
Best-fit parameter approximations over a 3σ range.
Variation Iof f

L

W

tox

Vf b

∆Vgs

L

Inv

=

Inv

Log

Log

Log

W

Lin

Inv

=

Log

Log

Log

Exp

Exp

=

Log

Log

Exp

Exp

Exp

=

Lin

Variation Ion

L

W

tox

Vf b

∆Vgs

L

Inv

=

Inv

Log

Log

Log

W

Lin Inv

=

Log

Log

Log

Exp

=

Exp

Exp

Log

=

Lin

tox
Vf b

tox

Exp

Exp

Vf b

For drain current, the models were analyzed under subthreshold (Iof f ) and abovethreshold (Ion ) conditions to identify the domain translation between any two parameters with respect to current. Table 2.2 details the relationships identified between
any two parameters. These values are culled to find a minimal fit, which, from our
observations was the combination of the parameters L, W , and tox into a new tox,d
parameter, and the translation of Vf b variations to the ∆Vgs parameter.
As tox was logarithmically related to L and W , the variations in these domains
were translated as shown in Table 2.1. The derived parameter tox,d is computed
via straightforward composition of tox variations with the translation of L and W
variations onto the tox parameter domain.

25

(

σ L zL
tox,d (ztox , zL , zW ) =µtox + σtox ztox + mL→tox ln 1 +
µL
(
)
σW z W
+ mW →tox ln 1 +
µW

)

tox,d (tox , L, W ) =tox + mL→tox · ln(L/µL ) + mW →tox · ln(W/µW )

(2.31)
(2.32)

z corresponds to standard random normal variables, while µ and σ correspond to a
single parameter’s mean and standard deviation. This derived equation can also be
expressed in terms of each source parameter as shown in Equation 2.32.
As detailed earlier in Figure 2.6, the slope factor mL→tox is dependent on both
gate and drain voltages. mW →tox has similar characteristics, and as such, our derived
oxide thickness needs to include these region dependencies.

tox,d (tox , L, W ) =tox + mL→tox [Vgs , Vds ] · ln(L/µL )
+ mW →tox [Vgs , Vds ] · ln(W/µW )

(2.33)

Cg analysis revealed a differing set of reductions. Vf b can still be translated to
∆Vgs , but L and W were best translated via a scaling effect on the output Cg .
The mathematical expression of our reduction approach can be expressed as follows:

L =N(µL , σL ); W = N(µW , σW )
tox =N(µtox , σtox ); ∆Vf b = N(0, σV f b )

(2.34)

tox,d =tox + mL→tox [Vgs , Vds ] · ln(L/µL )
+ mW →tox [Vgs , Vds ] · ln(W/µW )

(2.35)

Ids =Ids,nom [Vgs − ∆Vf b , Vds , tox,d ]

(2.36)

Cg =(L/µL ) · (W/µW ) · Cg,nom [Vgs − ∆Vf b , Vds , tox ]

(2.37)

26

L

Ids
tox,d

W

tox
Vgs,d

Vfb
Vgs

Vds

Fig. 2.7. Runtime loop computation of Ids from the reduced parameter
model. L, W , and tox each play a role on tox,d , albeit through nonlinear translation and then the use of tables for the slope factors (m)
mentioned in Equation 2.33. Afterwards, a three-dimensional lookup
table in terms of Vgs , Vds , and tox is used to calculate Ids (Eqn. 2.36).

During the preparation phase (Fig. 2.1) of the Monte Carlo simulation, the tables - mL→tox [Vgs , Vds ], mW →tox [Vgs , Vds ], Ids [Vgs , Vds , tox ], and the capacitance table
Cgate [Vgs , Vds , tox ] - are calculated from the Pao-Sah model. During simulation setup,
random values for each transistor parameter (Eqn. 2.34) are generated based on each
parameter distribution’s µ and σ - but it is not until the runtime loop, when a transistor has a known Vgs and Vds , that the derived parameter tox,d (Eqn. 2.35) can be
calculated and the current Ids (Eqn. 2.36) and capacitance Cg (Eqn. 2.37) can be
determined. A flow for the computation of Ids during the runtime loop is shown in
Figure 2.7.

27

#"

#"#

!"

!"

(a)

(b)

Fig. 2.8. Test circuits. A 6T SRAM cell (a) and a three-inverter ring
oscillator (b) were tested under our Monte Carlo framework.

2.4

Circuit simulation
To classify the quality of our simplifications, we analyze the effect our simpli-

fications have in two areas: accuracy - the effect our simplifications have on circuit
behavior, and efficiency - the effect our simplifications have on the simulation process.

2.4.1

Model accuracy

The accuracy of our models was tested using two representative circuits: an SRAM
cell and a ring oscillator (Fig. 2.8).
SRAM cells are ideal for study as they are one of the most frequently used blocks
on a chip, and because their reliability is related to the sensitive metastable point
between two inverters. We studied four properties for SRAM cells: hold static noise
margin [19] (SNM), read SNM, read delay, and write delay.
Hold SNM, shown in Figure 2.9, captures the sensitivity of a cell with regards to
preserving its data - the lower the margin, the easier it is to accidentally upset the
cell’s state. The tail is amplified under a log-scale analysis to better demonstrate the
observed error. Read SNM, whose results are shown in Figure 2.10, captures how

28

100

P(SNM) (%)

80
70

100

Golden
ΔVgs
Reduced

50
20

P(SNM) (%)

90

60
50
40
30

10
5
2
1
0.5

20

0.2

10
0
200

250

300

350

0.1
200

400

250

Hold SNM (mV)

300

350

400

Hold SNM (mV)

(a)

(b)

Fig. 2.9. SRAM static noise margins under read conditions, shown in
(a) linear and (b) logarithmic scales.

100

100

Golden
ΔVgs
Reduced

P(SNM) (%)

80
70

50
20

P(SNM) (%)

90

60
50
40
30

5
2
1
0.5

20

0.2

10
0

10

50

100

150

Read SNM (mV)

(a)

200

250

0.1
0

50

100

150

200

Read SNM (mV)

(b)

Fig. 2.10. SRAM read noise margins under read conditions, shown in
(a) linear and (b) logarithmic scales.

250

100

100

50

50

20

20

P(Delay) (%)

P(Delay) (%)

29

10
5
2
1

Golden
ΔVgs
Reduced

0.5
0.2
0.1
5e−10

10
5
2
1
0.5
0.2

1e−09

2e−09

0.1

5e−09

5e−09

Read Delay (s)

1e−08

2e−08

5e−08

Write Delay (s)

(a)

(b)

100

100

50

50

20

20

P(Period) (%)

P(Period) (%)

Fig. 2.11. SRAM access delay distribution, enhanced to demonstrate
right tail accuracy. (a) demonstrates the amount of time necessary
to perform a successful read (Vbl < 0.9 V ) while (b) demonstrates the
time necessary to perform a successful write (Vq < 0.1 V ; Vqb > 0.9 V )

10
5
2
1

Golden
ΔVgs
Reduced

0.5
0.2
0.1
1

1.5

2

10
5
2
1
0.5
0.2

2.5

Period (s)

(a)

3

3.5

4
−8

x 10

0.1

2e−08

5e−08

1e−07

Period (s)

(b)

Fig. 2.12. Ring oscillator period under the presence of variations.
(a) demonstrates the ring oscillator period under standard conditions
(Vdd=1 V), while (b) demonstrates the reduced accuracy achieved
under low-power (Vdd=0.6 V) conditions.

2e−07

30
likely it is for a cell to lose its stored value during a read operation. For both of these
static measurements, almost no error is observed between the golden model and the
reduced parameter model.
Read and write delays, shown in Figure 2.11, demonstrate the timing requirements necessary for reliable SRAM operation. Error is observed between the golden
and reduced parameter models and is likely due to the need for a more accurate
capacitance approximation. Regardless, the reduced parameter model still achieves
greater accuracy than the ∆Vgs model.
Studying a ring oscillator helps us even better understand the timing properties,
especially as some designs may need to function in near-threshold conditions. In
Figure 2.12(a) a three-inverter ring oscillator’s period is tested under normal (1.0
V) operations, with a large demonstrated inaccuracy between the ∆Vgs and golden
models. Analyzing the same circuits under near-threshold, 0.6 V supply operation
(Figure 2.12) demonstrates even more error: the ∆Vgs measurements greatly overestimate the golden profile even when delay is demonstrated in a logarithmic manner.
This inaccuracy demonstrates the difficulty of ∆Vgs calibration: ∆Vgs is determined based on some particular transistor voltage bias; transistor operation away
from this voltage bias is less likely to be accurate. For the normal oscillator operation, most of the current flow occurs while the transistor is in saturation and has
a gate voltage close to Vgs = 1 V . The near-threshold operation of the ring oscillator uses a supply voltage of only 0.6 V; never allowing a transistor to approach the
voltage bias used for ∆Vgs ’ calculation.

2.4.2

Model efficiency

As detailed in Figure 2.1, our Monte Carlo analysis time can be divided into
three phases: preparation, simulation setup, and the runtime loop. The method of
integrating our three models in the Monte Carlo analysis can be shown in Figure 2.13.

Fig. 2.13. Necessary additions to the Monte Carlo procedure in Figure 2.1 to support simulation with
parameter variations under our three models. For the ∆Vgs models and Reduced Parameter models, calculation from the Pao-Sah model is required during the preparation phases, while the Golden model requires
calculation during each simulation setup. All three of the models require interpolation during the runtime
loop, albeit the Reduced Parameter model additionally requires calculation of the derived oxide thickness
parameter during the runtime loop.
31

32
For the golden Pao-Sah model, transistor parameters were generated from PaoSah into a two-dimension Ids [Vgs , Vds ] lookup table for the current set of parameter
variations. This table requires approximately 10 MB of storage per transistor on
our setup. Values were interpolated from this table for each time step, and upon
a new circuit simulation, a new transistor table was calculated for a different set of
parameter variations.
The ∆Vgs model requires two operations during preparation: generation of the
nominal transistor table from Pao-Sah’s model, and enough information to determine
the slope parameters used in Equation 2.16. During runtime, the current and capacitance are interpolated and adjusted as appropriate, at a similar efficiency to the
golden model’s runtime interpolation. The total storage requirement was about 10
MB per type of transistor (a different transistor approximation is required for NMOS,
PMOS, as well as for different sizings).
The reduced parameter model requires calculation of both the two-dimensional
mL→tox , mW →tox tables and the three-dimensional Cg and Ids tables used in Equations 2.34-2.37 during the preparation stage. During runtime, the current is determined after calculation of the derived tox,d parameter (Eqn. 2.35). Due to the threedimensional structure, the memory usage was approximately 1 GB per transistor
type.
While direct comparisons of runtime efficiency against other simulation tools may
not be fair due to differences in the quality of code optimization, a relative comparison of the three models’ runtimes is presented in Table 2.3. While the ∆Vgs and
Reduced Parameter models were slightly slower during the runtime phase, the percircuit speedup achieved by these two models allows them to achieve a faster analysis
at a Break-Even point of less than 60 circuit simulations; far fewer circuit simulations
than most Monte Carlo analyses would require.

33

Table 2.3
Simulation runtime for 1000 circuits.
SRAM SNM

SRAM Access

Osc. period

Hold

Read

Read

Write 1.0 V

44.2k

33.4k

11.4k

76.2k

411k

176k

Setup (s)

1358

1357

1357

1356

1367

1364

Runtime (s)

83.2

62.2

33.3

267

1474

598

Preparation (s)

37.0

37.3

37.0

37.0

37.0

36.9

Setup (ms)

23.6

19.1

14.3

15.9

20.0

18.0

Runtime (s)

150

115

52.3

405

2177

1040

Circuit Speedup

9.6x

12.3x

26.6x

4.0x

1.3x

1.89x

Break-Even (ckts.)

28.6

28.6

27.5

30.3

55.6

40.0

Iterations/circuit

0.6 V

Golden Model

∆Vgs Model

Reduced Parameter Model
Preparation (s)

47.4

47.9

47.7

47.8

47.7

48.0

Setup (ms)

26.4

25.9

11.7

16.2

21.8

16.4

Runtime (s)

99.9

75.9

36.6

290

1617

650.4

14.4x

18.7x

38.0x

5.6x

1.76x

3.02x

35.3

35.6

35.3

35.8

39.0

36.6

Circuit Speedup
Break-Even (ckts.)

34
2.5

Conclusion
The accurate prediction of the effect parameter variations have on circuits is diffi-

cult to achieve. Direct computation of a transistor model like Pao-Sah’s model during
Monte Carlo analysis results in slow simulation, and the table-based methods that
allow for the reuse of computations from the original transistor model are sub-optimal
as they cannot track the individual effect of variations without an explosion in the
preparation time and storage needs of a suitable table.
The ∆Vth (∆Vgs ) approximation subverts this table size explosion by expressing
these variations as application of a normal distribution to the gate’s voltage. This
standard approximation mechanism has weaknesses, as it is difficult for a single ∆Vgs
distribution to be accurate under subthreshold and above-threshold conditions.
Seeking to offer improved quality, a novel reduced parameter approach was developed. Through the use of nonlinear transformations, random variable combination,
and allowance for dynamic composition, we found a method that reduces the number of parameters, but offers improved preservation of the shape and effect from the
original set of parameter variations.
Both approximation methods were able to achieve the desired reduction in complexity when tested against the optimal integration Pao-Sah’s double integral into
a Monte Carlo analysis. Using a few sample circuits, we were able to demonstrate
computation savings in Monte Carlo analyses before performing 60 simulations.
The tail analysis of these approaches demonstrates the error that can occur with
a ∆Vth technique. Due to issues such as distribution calculation at a particular
voltage bias, the tails of our ∆Vgs model show worrisome error. In contrast, our
reduced parameter approach demonstrates an impressive ability to approximate the
Golden Pao-Sah model, frequently offering 5x lower error than ∆Vgs at estimating
the necessary allowance needed for a 99% yield.
While this chapter demonstrates the application of this novel parameter reduction
technique against Pao-Sah’s Double Integral model, it is our opinion that this tech-

35
nique may offer acceleration and accuracy to parameter variation simulations that
use other transistor models.

36

3. CLIP: CIRCUIT-LEVEL IC PROTECTION THROUGH
DIRECT INJECTION OF PROCESS VARIATIONS
Significant portions of this chapter originally appeared in IEEE TVLSI [20].
The increasing complexity and cost of modern integrated design and fabrication
has led to significant changes in the semiconductor industry. To cope with these
trends, the industry has restructured itself to reduce expenses, facilitate knowledge
sharing, and globalize operations. Companies that complete an entire design in house
are a rarity - many designs are produced through a process involving Intellectual Property (IP) resellers, System on a Chip (SOC) design houses, and fabrication plants that
span multiple countries. This dispersion of resources across corporate and legislative
boundaries naturally raises significant concerns about piracy.
IP providers are concerned that their designs will be leaked, resold, or overused
by design houses; design houses are worried about handing masks to foundries they
cannot monitor; and foundries are worried that their reputation and revenues could
be jeopardized by a rogue employee. Given the large work force that goes into the
design and manufacturing of an IC, it should come as no surprise that an estimated
80% of piracy is as a result of internal design theft [21].
ICs that are outsourced for fabrication are especially vulnerable to piracy from
overproduction and mask theft. For example, it is conceivable that a dishonest manufacturing plant could create more chips than ordered and sell the additional chips
at a lower cost, subverting the profits of the legitimate owner.
Legislative protection for chip designs is not sufficient to address piracy concerns.
The U.S. Semiconductor Chip Protection Act [22] (and similar international legislation [23–25]) gives the mask work owner exclusive rights to manufacture and sell
original designs. However, such legislation is not present in all parts of the world.
Furthermore, mask reverse engineering is not forbidden even under existing laws:

37
provisions are included that allow the sale of a design created from knowledge gained
through reverse engineering, as long as the resultant work can be considered original.
As a result, for many companies, reverse engineering competitors’ chips is a routine
business practice, and is further simplified by companies that provide tools and services for IC reverse engineering [26]. While reverse engineering an IC does incur some
cost, layout and netlist extraction is still far cheaper than design costs.
Due to growing concerns across the IC industry, a number of approaches have
been proposed to help prevent piracy and design theft. For IP cores, these range
from encryption of the design to watermarking and functional obfuscation [3, 27, 28].
For fabricated ICs, both passive schemes such as unique chip identifiers and active
schemes such as chip locking or metering have been proposed [4–6, 29, 30]. Process
variations, which are an inevitable consequence of scaled manufacturing technologies,
are utilized to realize some of the IC protection schemes. While these techniques
have made promising advances, significant challenges remain. Revealing the mask to
a foundry means that one must be concerned with modifications of the mask that
can disable the protection schemes. Specifically, techniques that have a single “point
of vulnerability” are especially susceptible to mask modifications. Others require
intrusive modifications to the design or manufacturing process, raising the barrier to
adoption.
In this chapter, we propose a scalable method to achieve Circuit-Level IC Protection (CLIP) through direct injection of process variations. Our approach exploits
process variations similar to other active schemes; however, it differs significantly in
the manner in which they are utilized. The circuit is enhanced during design by
“injecting” the outputs of digital process variation (PV) sensors into the logic in a
distributed manner at carefully selected injection nodes, rendering the IC inoperative
upon fabrication. A corresponding set of correction nodes is identified and utilized to
reverse the effects of process variation injection when a unique per-chip unlocking key

38
is applied 1 . Uniqueness of the unlocking key is realized by requiring it to be based
on process variation (PV) sensor readings. As a further level of defense, we utilize
a cryptographic preprocessor to transform the unlocking key into the internal key
used to reverse the effects of the PV sensors. Knowledge of the design modifications
is utilized to generate special test vectors that propagate the PV sensor outputs to
the circuit’s primary outputs. After manufacturing, the responses of each IC to these
test vectors are used to compute its unique unlocking key. We analyze the security of
the proposed scheme under a range of external and intrusive attacks. The significant
contributions of our work are:
• We demonstrate how to utilize direct and distributed injection of process variations
to realize chip locking with unique per-chip unlocking keys. Unlike previous schemes
that use unique external unlocking keys that get transformed to a fixed internal key,
direct injection ensures that keys remain unique per-chip at all levels.
• The distributed insertion of process variation sensors and circuit elements implies
that there is no single point-of-attack that can compromise the proposed technique.
• The injection and correction nodes can be arbitrarily separated from each other
and merged into the functional logic during synthesis, increasing the difficulty in
reverse engineering and identification.
• We achieve scalable security by allowing the designer to increase the length of the
unlocking key at a reasonable increase in hardware overhead.
• No information is exposed from the chip by means of scan chain or similar readout
system. Rather, PV sensor information is obtained through logic paths upon application of specific test vectors, and the responses are used to compute the unique
unlocking key.

In Section 3.1 we present relevant background to our work. In Section 3.2, we
demonstrate a systematic methodology to enhance any given circuit for CLIP. The
1

Uniqueness is desirable since a key that is reverse-engineered or leaked is of no use beyond the
single instance that it unlocks.

39
core logic modification to allow for injection and correction is presented in 3.2.1,
followed with an example process variation sensor in 3.2.2 with an explanation for
why we believe such a sensor is feasible. We examine the test and unlocking procedure
for proper IC operation in 3.2.3, and 3.2.3 explains the necessity for a preprocessor
that forbids direct application of a key to the logic. We demonstrate the proposed
technique on a range of benchmarks in Section 3.3, followed by an analysis of various
attacks in Section 3.4. We conclude our work in Section 3.5.

3.1

Background and Related Work
A significant body of work has addressed the problems of IP and IC protection,

and we summarize the major efforts below.
The business model of IP core providers squarely depends on the ability to ensure
that their cores are kept secure. IP cores can be from an IP vendor to a design
house in an encrypted form [31, 32]. While this adds some measure of security, the
implementations of IP cores are exposed (as netlists or layouts) at later stages of
the design flow. A commonly proposed IP protection technique is watermarking,
wherein a watermark or signature is embedded into the design that can be used
for a posteriori detection and enforcement. Watermarks can be created through the
use of physical positioning to create observable features [27], added constraints to
optimization problems during synthesis [3, 28], or using unused transitions in a state
transition graph [33].
FPGA-based designs are particularly vulnerable to piracy and exploitation since
their designs must be transmitted via a bitstream to the FPGA on system power-on.
To counter this issue, Physically Unclonable Functions (PUFs) have been considered
as a viable method for locking a given bitstream to a particular FPGA. PUFs, first
proposed by [34], produce an irreplicable challenge-response mechanism driven by
process variations. PUFs have several advantages over the use of nonvolatile memory:
they do not require active tamper protection, do not require additional masks during

40
manufacture, and do not require programming by a trusted party to achieve their
uniqueness [35]. FPGA-based PUF designs rely on measurements taken from delay
and memory elements [36, 37], and silicon-based PUFs have been proposed based on
delay, memory, and capacitor-based sensors [38–40].
IC protection schemes aim to protect the design or mask after it is released to the
foundry, as well as fabricated instances of the IC that are deployed in end systems.
They can be classified into several categories. Passive schemes, such as unique chip
IDs or PUF-based fingerprints that are registered into a database [41], rely on a
posteriori detection rather than pre-emption of piracy.
Recent work has focused largely on active protection schemes, where ICs are inherently manufactured in a locked state pending the application of an unlocking key.
These schemes can be used to implement “metering”, wherein the legitimate owner of
the design releases unique unlocking keys for each manufactured instance. Recently,
a class of active protection schemes have been studied that modify the finite state
machine representation of the circuit in order to introduce protection. [4] proposed
adding states to a FSM to connect a PUF and key, while [30] proposed introduction of
a deterministic FSM that requires an activation sequence supplied at bootup through
the primary inputs.
In [5] and [6], key exchange protocols were proposed to help protect the unlocking
of an individual IC. These protocols relied on PUF-driven fingerprints at the key
exchange level to require a unique external unlocking key. Some vulnerabilities and
improvements to these protocols were presented in [42]. Similar protocols have also
been implemented commercially [43].
While piracy prevention may be related with security mechanisms used for IC
identification or user authentication, these mechanisms are built to serve the end user
(protection starts primarily after the IC is incorporated into an end system), whereas
piracy prevention is intended to aid the design house (protection starts when the IC
is fabricated). If the objective is piracy prevention, one should not expect an end
user to be a willing active participant - upon purchase, they expect to receive an

41
unrestricted product; they do not wish to contact the manufacturer nor repeatedly
authenticate themselves to unlock their chip again.
The objective of CLIP is IC piracy prevention, and it falls under the category of
active protection schemes. Therefore, we closely compare it with other approaches
that fall under this category. Existing active protection schemes may be effective at
implementing protection against external attacks, but significant improvements can
still be made on protecting against attacks with a knowledge of the internal structure
and mask. A unique external unlocking key is achievable by previous schemes, but
key transformations and combination with internal values culminate in (i) supplying
a fixed, common internal key to the logic core to unlock it’s operation, or (ii) a single
signal enabling or disabling the operation of the design. These limitations lead to
simple attacks. For example, probing at the location of the fixed internal key could
reveal the fixed internal key, followed by mask modification to hardwire the internal
key. Alternatively, the protection scheme could be permanently disabled through a
single mask modification.
The specific differences between CLIP and the known active protection schemes
are summarized in Table 3.1. The first row indicates that CLIP focuses on piracy
prevention, and does not target in-field re-authentication. The second and third
rows suggest that CLIP utilizes not only unique unlocking keys, but also unique keys
internally. The only other technique that does this is [4]. The fourth row implies that
all of the other techniques have a single point of vulnerability. Finally, we point out
that [4] is based on explicit FSM modification, which is not scalable since construction
of the FSM is not practical for even moderately sized circuits.
With CLIP, we achieve enhanced security with respect to previous approaches.
We inject the key and process variations into the logic directly and in a distributed
manner, creating a system wherein no static, common internal key exists prior (or
even inside of) the logic core. The insertion of sensors directly into the logic minimizes probing opportunities, while distributing any potential vulnerabilities across
the mask. The distributed nature of our scheme also allows for high scalability -

42

Re

mo
te
Ac
HW
tiv
Pr
at i
o
on
EP
t ec
[ 4]
tio
IC
n
[ 30
IC [ 5]
]
Ac
t
i
v
CL
at i
on
IP
[ 6]

Table 3.1
Comparison against active protection techniques

In-Field Reauthentication

!

!

!

Unique Unlocking Key

!

!

!

Unique Internal Key

!

Single point of vulnerability

×

Requires FSM Representation

×

!
!

×

×

×

we achieve a linear relationship between the key length and the added logic, giving
designers tight control on the hardware overhead. We present a comprehensive integrated synthesis methodology for selecting the injection and correction points so as
to avoid delay overheads whenever possible, while maximizing functional impact and
ensuring the ability to recover PV sensor values through the logic paths in the circuit.

3.2

Circuit-Level IC Protection
Figure 3.1 contains a flowchart applying our methodology for circuit-level IC pro-

tection (CLIP). Starting with a logic-level netlist, we analyze and perform transformations on the logic, producing a CLIP-enhanced design (Figure 3.2(a)). The logic

43

!"#$%&!'(')*+',)$-,
!"#$%"&'(#)*+,-#.)&"-*
+".'*/0$1*
2')'%,$"3
93$:;'

6A%);.'*
/0,7

0

6A%);.'*
/0$1

/

<".$=>*!"#$%
0

C'3'10,'*
D'-,*B'%,"1-

?1'0
2,1'3#,7
/

53-'1,*/B*
2'3-"1-

<0@*!"#$%

4!5/&63703%'.*
8'-$#3

D'-,*B'%,"1,"*'A@"-'*/B

/)0%'*E*F";,'
4"I@;,'*53,'130)*E*
93)"%J$3#*K'>

G0H1$%0,$"3
93)"%J$3#*
K'>

<03;=0%,;1$3#*
D'-,$3#

Fig. 3.1. Design flow for CLIP.

core modification includes selection of suitable logic regions (Figure 3.2(b)-(c)), logic
modification for injection and correction (Figure 3.2(d)), and the addition of digital
process variation (PV) sensors (3.2e). Based on the circuit transformations, test vectors are generated to expose the process variations at the circuit outputs, and after

&# %

%

#

),

&# % &# (

!"

$%
$

!'#

&# % &'(

&# % &'(

#

#

$

$

)*+

$%
$

(

&'(

!"#

!&'(&)*'++)&

"-

!"

#

%)./*01)&'

1.3&

)23

!"

#

$

)23

/ -.

/ #/1

4#/1

$%

!%#

1.3&

!$#

'56

)23

-*"%(*7859:*;86*7<

/ -.0

/ #/10

4#/10

!#
&#%

&# (
%

!$#

&# %

%

&# % &# (

!&#

&# % &'(

(

&'(

<1.=61>%"*?8-4 DE D/3.F.(3%E/C.,
<1.=61>%$8-?8-4 %
"*+(,-./*%&.C*6)
@*)/,B.*C%A(>
&
0/11(,-./*%&.C*6)
"*-(1*6)%A(>

&'(

!"# $$$$ "*+(,-./*#%0/11(,-./*%2/3(4
!"#!$%% 516*4.-.7(%!6*.*#%!6*/8&'#%&() &'.-,9:/;#%&()(,-/1

<"
<$
@A
"A

&# % &'( (

Fig. 3.2. Overview of CLIP. (a) CLIP-Enhanced design with multiple logic modifications and PV sensors
(b) Logic cones (fanin and fanouts) for a set of injection and correction nodes (c) Alternate representation
as logic blocks (d) Modified Logic with logic duplication, switchbox, and selector (e) Sense amplifier-based
PV sensor. The NMOS devices under test operate in the subthreshold regime and have a low duty cycle to
enhance their stability.

,-

!"

!"

$%

44

45
fabrication, the test vector responses are used to compute the unlocking key for each
manufactured chip.
To integrate the digital sensor outputs and key into the logic, a pair of nodes in the
circuit are selected as potential locations for PV injection and key correction based on
predetermined heuristics. For selected node pairs, the circuit is modified to allow for
proper operation under the presence of injection. Nodes that lie in the path between
the selection and correction nodes are excluded from further analysis, and the process
is repeated until the desired unlocking key strength is achieved, hardware overhead
limitations are reached, or the logic cannot fit any more nodes without increasing the
critical path.
After logic synthesis, suitable PV sensor cells are added. To the circuit’s logic,
they only need to behave as black-box random number generators. These sensors can
vary in construction, but ideally would be be structured such that they are not easily
distinguishable from surrounding logic.
To discover the unlocking key for the design, a set of test vectors are constructed
that expose the PV sensor readings. While one can certainly use a more thorough
test generation process, we developed a process which is fast and produces good
results. To generate the test vectors, random vectors are applied to the primary
inputs and propagated through the logic such that the logic’s output is only dependent
on the sensor values. While some sensor values can be immediately obtained from
the outputs (if an output is directly dependent on a sensor reading), others require
iteration thru a subset of keys. The determined sensor values enable the corresponding
key bits to be used as primary inputs for greater pruning in later stages. This reuse
causes different test sets for each chip; however, the test set does not need to be
recomputed from scratch since the test vectors are universal with respect to the
key/process variation mapping.
Due to a trivial mapping between the internal key and the process variations, advanced circuit knowledge coupled with methods such as differential analysis [44] may
allow an outsider to recover a single IC’s key. However, if we can make the application

46
of a particular internal key nontrivial, an outsider cannot defeat the methodology so
easily. Instead of directly applying a key to the logic, an external, unlocking key is
fed through a preprocessor to generate an internal key which directly feeds into the
core logic. This preprocessor could be verification-based to, for example, prevent
application of any key that does not fulfill checksum requirements, or it could be
encryption-based and apply a secret transformation to the key.

3.2.1

Logic Modifications for CLIP

To perform logic modifications, the circuit is analyzed based on the logic cones
that feed and are driven by the injection and correction nodes, and then transformed
through a combination of logic duplication and additional circuitry (Figure 3.2(b-d)).
Since the circuit receives static random values from the sensors, the circuit’s construction must allow for some means of correct operation for any given sensor reading;
a key must exist for all potential sensor outputs. Under a trivial scenario, bits of the
key and sensor readings could be combined before attachment into the circuit, and
are subsequently combined into the existing logic via an obfuscating (logic-modifying)
gate. While functional, each modification would be vulnerable to a single point of
attack. Once a wire is tied to either Vdd or Vss, the modification is placed into a
permanent and reproducible unlocked state. Instead, if we separate the key and PV
connections, we can quickly realize a design that eliminates the fixed internal key
and doubles the required number of mask modifications to reach the unlocked state
without increasing the key size or the number of PV sensors.
To account for different key and PV values, the overlapping fanout and fanin of
the injection and correction nodes, respectively, are duplicated to create a second logic
block. The input path to the blocks is fed through a switchbox, which determines
which circuit copy receives the originally intended data. After each copy performs an
operation on the data, a selector is used with a corresponding internal unlocking key
to choose the block that obtained the proper result. Figure 3.3 presents an example

47

#*+'&,-(.

#$%$&'()

!!

"

!

!"

#$%

&'(%

Fig. 3.3. Logic modifications for correctibility. Based on the PV
sensor reading, the switchbox blocks value propagation to one of the
duplicated functions (f1 , f2 ) such that only one function is reliably
correct. A key input, fed to the selector, determines which function’s
value to use

circuit modified for correctibility. The switchbox uses the PV sensor output to block
some of the data supplied to the duplicated logic (in our case, a 2-bit equality checker).
The logic executes the two sets of data in parallel, and then a two-way multiplexer
attached to the key decides which output to trust.

! "%

!"

!"

!"%

!"&
#$"
AND

! "&
#$"
OR

!"%

!"

!"&
#$"
AND/OR

!"

!"%
#$"

! "&

XOR

Fig. 3.4. Single-bit switchbox styles. The variety of options give a
designer a greater opportunity to conceal the injection nodes from
mask analysis.

A single-bit input switchbox allows for a variety of implementations, including
dual AND, dual OR, mixed AND/OR, XOR, and their complements (Figure 3.4).

48
Dual AND and dual OR implementations allow the input signal to propagate to one
of the copies of the duplicated logic, and always feeding a constant logic value (0
or 1) to the incorrect copies. To minimize area, a mixed AND/OR implementation
can be used, as the side inputs to the gates allow propagation for complementary
values. However, if an XOR implementation is used, switching activity is more likely
to propagate to the outputs.
Multibit switchbox implementations can be implemented using any random logic,
provided the switchbox follows one restriction: dependent on the switchbox side input,
one output must always be correct while the other paths are incorrect. Functionally,
this can be accomplished by splitting the subfunction’s input signal into two paths
and then inserting various styles of logic obfuscators (primitive gates that alter the
output when enabled), driving each path’s obfuscators by complementary signals.
This allows for asymmetric implementations, which could potentially confuse sidechannel attacks.
Custom implementation of a suitable selector is more constrained, but it does
contain alternate options. With the typical two-level AND-OR mux, the AND gates
create a small unate region prior to recombination at the OR gate. As long as this
unate property is preserved when the two paths reach the recombination gate, the
selector can be reimagined by moving the location or changing the implementation
of the unate-causing gates.
The placement and selection of the switchboxes and selectors is largely a heuristic
problem, which, while issues such as area minimization, critical path avoidance, and
testability should be a major consideration, allows for the addition of many additional
constraints such as preserving internal node probabilities. While there is plenty of exploration space with respect to node placement, a few simple observations pertaining
to node selection can be made:
• Single-Bit Components. Switchboxes that affect more than one modification
per output path, and selectors that recombine more than one signal have little
impact on the device’s security: additional logic is required without any increase

49
in the key length. Single-bit components give us a minimal-area method for
removing the fixed internal key.
• Fanout-Free Cones. To minimize logic duplication, it would be ideal to only
clone preexisting logic once. We can accomplish single duplication by restricting the node selection to when the correction node is at the tip of a fanout-free
cone that contains the injection node. If we follow this guideline, we can automatically avoid the use of multibit selectors.
• Overlap Avoidance. If the duplication region between the selector and the
switchbox overlaps with another node pair’s duplication region, some of the
original logic will appear four times in the modified circuit. By using a node
selection procedure that tracks duplicated gates or relocates the injection node,
we can avoid overlapping multiple sets of node pairs.

3.2.2

Process Variation Sensors

While any process variation sensor that can return a static random value will
function with our method, the sensor should be low overhead and have a consistent
output. Sensors with low area are ideal, as many such sensors could be embedded in
the circuitry. The sensor’s output should also be invariant to both operation-induced
variations such as thermal effects. While a formal analysis of the behavior of PV
sensors is outside the scope of this work, a critical element to our approach is the
existence of such a sensor. Therefore, we present a design of one such PV sensor.
A worthwhile direction for future work is to analyze the characteristics (randomness
and stability) of such sensors in greater detail.
In nano-scale transistors (especially for sub-65nm technology nodes) the location
and the number of dopants in the channel can vary widely between transistors (also
known as random dopant fluctuations, or RDF [10]). RDF leads to widely different
transistor threshold voltage (Vth ) for transistors which are placed close to each other
and designed to have the same threshold voltage. Measured properly using on-chip

50
circuitry, such random threshold condition could effectively act as a static random
number. Conventionally, measurement of current difference between identical neighboring devices, followed by complex data analysis to extract Vth difference from the
current difference, is used to characterize local random mismatch.
In this chapter, we have used a sense-amplifier based test-circuit and measurement
method for on-chip measurement of local random variation. Instead of complex and
sophisticated analog voltage-current measurements required in conventional schemes,
we used a simple digital measurement technique. The sensor is based on the Current
Latch-type Sense Amplifier (CLSA) shown in Figure 3.2(e) [45].
The circuit measures the difference in threshold voltage of the devices under test
(circled in Figure 3.2(e)) by applying a sub-threshold reference voltage (Vref , generated on-chip) to the gate input of both devices. The sense amplifier, a cross-coupled
inverter, determines the mismatch between the devices under consideration due to the
exponential relationship of drain current on sub-threshold gate voltage and outputs a
0 or 1 based on the direction of the mismatch. Due to the small sized transistors for
the DUTs, the mismatch is mainly due to RDF. The other transistors of the sensor
are kept larger so that RDF has minimal effect on the sensor.
Simulations on a similar CLSA sensor using 130nm technology [45] show that
the two transistors could be differentiated by applying an offset voltage to the two
devices to detect where an observable difference in current occurred. It was found
that this offset voltage was strongly dependent on local random Vth variation while
systematic (correlated) variation had minimal effect. While the offset voltage is a
linear combination of the Vth offset of both the driver FETs (NDR − NDRB ) and the
mismatches in the latch FETs (PIN V − NIN V , PIN V B − NIN V B ), the latch FETS
are sized to be large such that RDF has minimal impact and does not lead to latch
mismatches. On smaller technology sizes, greater Vth variations will result in a more
deterministic output.
Since we would like the output to be stable under changes in temperature, let us
now consider the stability of the random number generator to temperature variations.

51
To the first order, assuming no oxide charge, the threshold voltage of an NMOS
can be expressed as:
Eg
Vth = −
+ ψB +
2q

√

4εsi qNa ψB
Cox

(3.1)

as derived by [46], pp. 131. Eg denotes the silicon bandgap, ψB is the surface-tointrinsic potential, Na is the channel doping density, and Cox is the oxide capacitance.
Both ψB and Eg are functions of temperature, although Eg is a very weak function
of temperature. Substituting the expressions for ψB :
* (√
)
+ (
)
∂Vth
k
Nc Nv
3
m − 1 ∂Eg
= −(2m − 1) ln
+
+
∂T
q
Na
2
q
∂T
* (√
)
+
k
Nc Nv
3
≈ −(2m − 1) ln
+
q
Na
2

(3.2)
(3.3)

Assuming that ∂Eg /∂T = 0.
Typically, for (Nc Nv )1/2 = 2.5 · 1019 cm−3 , a channel doping Na = 1 · 1016 cm−3
and a body coefficient m = 1.1, we have ∂Vth /∂T = −1 mV /K [46].
Now, let us consider the differential random number generator as shown in Figure 3.2(e). Let us consider that the threshold voltage difference due to the circuit
under test is because of different number of discrete dopants in the channel leading
to an average doping concentration of Na1 and Na2 . Lets consider a change in temperature ∆T (both transistors under test will experience the same temperature due
to their close proximity). Under such a situation, if at temperature T 1, Vth1 > Vth2 ,
then we need to show for stability:
Vth1@T 1 > Vth2@T 2

(3.4)

Mathematically modeling the temperature change of threshold voltage, the problem
is equivalent to showing:
Vth1@T 1 − Vth2@T 2 >

(

∂Vth2 ∂Vth1
−
∂T
∂T

where, ∂Vth1 /∂T and ∂Vth2 /∂T are estimated at T 1.

)

· ∆T

(3.5)

52
From the above analysis, since ∂Vth /∂T ≈ −1 mV /K for similar Na and m (Na1
and Na2 are almost similar, leading to a very similar m), the right-hand side of
the above equation tends to zero. Experimental analysis [47] also shows differential
thermal stability in sub-threshold region of operation.
While NMOS may behave differently under the effects of temperature as compared
to PMOS, the temperature profiles of nearby transistors of the same type will be
nearly identical, causing an equivalent temperature reaction on each side of the circuit
as previously explained. This effect is likely to change the trip points of the two
inverters equally, but will not cause appreciable differences in node capacitance since
most capacitance in the circuit originates from the use of oxide in the MOSFET gates.
Altering the trip points of the transistors changes the total amount of charge
the DUTs will need to drain before the cross-coupled inverters reinforce a voltage
differential, but this effect is balanced between the two sides of the circuit since the
capacitance at the internal nodes is unaffected by temperature. In addition, the crosscoupled inverters are of larger than minimum size, making them less susceptible to
random dopant fluctuations. Hence, the design is quite insensitive to temperature
fluctuations.

3.2.3

Sensor Recovery for IC Unlocking

The proposed design framework produces manufactured ICs that are inoperative,
i.e. they will not produce correct functional outputs due to the effects of PV injection.
In order to properly unlock and operate the fabricated chips, a unique unlocking key
must be computed based on the PV sensor outputs. While the easiest method for
exposing the values generated by PV sensors is to include their digitized outputs in
a Design-For-Test structure such as a scan chain, such a path could be used just as
easily by attackers as by authorized users. Therefore, we instead generate and use a
set of test vectors specifically constructed to expose the effect of the PV sensors at
the primary outputs (or scanned state elements).

53
4&9$)&*78#&,*2&03

2&03*4535650&

!"#$%$&#
'"($)*+",&

-.

-A

.1
.1

.89&,0&*
71
-,&/,")&00",

-,&/,")&00",
;&0&3

:&80",*
;&5#$8(0

<,&=>&8)?
+">83&,

;&0>@30*.83&/,&3&,

.83&,/,&353$"8*.803,>)3$"80

Fig. 3.5. Unlocking procedure. Test vectors are generated dynamically by merging predetermined test vectors with known sensor readings and computing the required unlocking key to supply to the device
under test. The outputs undergo a frequency analysis that reveals additional sensor values.

!"#& !!!
$"#%
!"#$
$"## !!"
$"#" !!$
!"#!

!

!

)*+,-./012-3.,43152-3.

"
! !#

!

%'

!

!

%&

!

657

&$

!

%' ! && # &$(

&&

!

% & ! &&( " &% # &$(

&%

!
687

Fig. 3.6. Random logic pruning on binary logarithm circuitry. (a) prepruned logic with marked injection locations. (b) post-pruned logic
in terms of only injection values. If the gates are sorted topologically,
we can propagate constant values from the primary inputs through
the gates, eliminating gates that receive constant inputs at a linear
rate.

54
Table 3.2
Expected values on pruned circuit.

pattern

E(f1 ) pattern

E(f1 )

pattern

E(f2 )

pattern

E(f2 )

x#2

0.5

x2

1.0

x#2

0.75

x2

0.75

x#2 x#4

1.0

x#2 x4

0.0

x#3

0.75

x3

0.75

x#4

1.0

x4

0.5

x#2 x4

0.5

x#2 x4

0.5

x#3 x4

0.5

x#3 x4

0.5

x#2 x#3 x4

1.0

x#2 x3 x4

0.0

The unlocking procedure is detailed in Figure 3.5. The test database distributes
both a universal set of primary input vectors and a generic set of internal key vectors.
Prior sensor readings are combined with the generic internal keys vectors to create
the true internal key vector for testing on the modified logic core. The internal key
is not directly applied due to the preprocessor; instead, the equivalent unlocking key
must be created, which is accomplished by performing the inverse operation of the
preprocessor. The response bits are accumulated and then matched against a table
to determine the value of sensors in the chip. These sensor values are then used as
part of subsequent tests to obtain more test vectors. Once the procedure is complete,
the chip-specific unlocking key can be calculated by supplying one additional generic
internal key from the test database, which when combined with the internal key and
fed through the inverse processor, produces the chip-specific unlocking key.
The formulation of the test vectors cannot use standard test tools - it may be
impossible to uniquely expose a sensor’s value at the output. The test generator must

55
assume that the non-determined sensors are represented as an unknown state, and
generate a pattern at the primary input that creates a controlling path from the sensor
to the output, while simultaneously ensuring the sensor modifies a fixed (variationfree) value at the switchbox. Such a test, potentially implemented via stuck-at fault
testing tools, would yield an output which would be in a direct relationship with
the PV sensor’s value. However, it is quite likely that not all sensors can uniquely
sensitize an output, and as such, SAT techniques are not adequate.
Since the node selection heuristics are likely to be geared towards maximal functional impact, random vector simulation at the primary inputs is likely to expose the
sensor’s values at the outputs. Rather than assuming the sensors are an unknown
value, the vectors instead propagate through the logic, pruning the logic by eliminating any gates with constant inputs such that the logic becomes a function of only a
composite vector (X = V ⊕ K) between the PV sensor readings V and the internal

key K 2 . Such pruning will result in each bit of the primary output (fi ∈ F ) becoming
a function of specific bits of the composite vector ({x} ∈ X). If fi is a function of
only one composite xj , the given test pattern is effective at finding the sensor value
vj .
In practice, not all sensors find such a trivial route to the output, but instead,
are only visible in multivariable output functions. Determining the values might
be possible through linear programming, or alternatively, through exhaustive key
toggling. If we assume one of the inputs (e.g., xi ) is fixed at either a logical 0 or 1, we
can find the expected value of the function E(f ) under each scenario by enumeration
of the set S of potential inputs to the other inputs (X\xi ).
If a mismatch exists between E(fx!i ) and E(fxi ) for some i, it is possible to determine vi from the function. Even without knowledge of xi ’s value, fixing the key
input ki and applying the set of test vectors S will yield an output probability equal
to one of the expected values, revealing what xi (and consequently, vi ) is. While we
cannot directly apply distinct values in the set S to the composite vector X since
2

Simplifying the sensor and the key as a single variable is only possible with XOR-based switchboxes

56
we do not yet know V , applying the set to K will still apply the entire set at X,
but will permute the set due to the variations V . Once vi is found, the function can
be partitioned again to find a different expected value mismatch with a set half its
previous size.
After repeating the mismatch characterization on other outputs, the pruning process is repeated, but with an advantage: for each sensor vi we could determine on
prior passes, the composite value xi can be deterministically set to allow for deeper
circuit pruning: we simply need to apply xi ⊕ vi to ki , and xi arrives at the desired
value.
Consider the circuit in Figure 3.6(a) which is part of a binary logarithm circuit.
By propagating the shown random input vector through the logic, we arrive at the
pruned circuit shown in Figure 3.6(b) that is in terms of only the input combinations.
For f2 , partitioning by both x2 and x3 causes no difference between the expected
values (Table 3.2); however, partitioning by x4 does result in a noticeable outcome.
For f1 , more information can be gathered - partitioning by x2 immediately reveals
a distinction in expected values, and a further partition reveals x4 independently of
analyzing f2 . With the knowledge of x2 , x3 could then be determined from f2 . While
x1 is not observable in the output, it could be determined through a separate pruning
pass.
Our testing method’s downside is twofold: an output of n distinct sensors that can
all be determined requires up to 2n−1 + 2n−2 + ... = 2n − 1 different keys for a single
primary input vector to determine the sensor values, and due to the preprocessor and
the dynamic nature of the test procedure, the test vectors applied to the chip must
be partially constructed during the test phase. However, this exponential growth is
localized, and can be compensated for by simply ignoring any outputs that exceed
a predetermined input threshold. The test set can be further reduced by selecting
optimal test sets that maximize the number of sensor values revealed per test vector.

57
3.2.4

Key Preprocessor

The trivial relationship between the variation sensors and the key is a problematic vulnerability. With circuit-level knowledge and an understanding of the original,
intended functionality, an attacker could use the testing process described in Section 3.2.3 just as easily as the IP rights owner. Even without circuit-level knowledge
but a functional understanding, an attacker could perform annealing on the key bits
with respect to the output error. However, if one can control the application of keys,
such a vulnerability can be overcome.
The internal keys that unlock a chip’s functionality are little more than messages
that require delivery to a location inside the chip. When applying an external, unlocking key to the system, the unlocking information traverses a key preprocessor prior
to application to the circuit. This preprocessor can be implemented in two methods:
either through decryption or through authentication of the unlocking key.
Key decryption protects key application by encrypting the internal key off-chip,
and supplying the encrypted information as the unlocking key. The preprocessor
would then decrypt the unlocking key, and apply the decrypted internal key to the
logic.
Selection of a decryption preprocessor depends heavily on what can be afforded
by the designer: asymmetric implementations occupy far more area than symmetric
implementation (e.g. 10k gates for ECC versus 3.4k gates for AES [48]), but due to
the fixed implementation in silicon, the preprocessor’s decryption key will be hardcoded. Under a symmetric implementation, knowledge of this decryption key implies
immediate knowledge of the encryption key, and consequently, the transform from
internal to unlocking key. With an asymmetric implementation, knowledge of the
decryption key conveys no information about the inverse transformation required to
construct the unlocking key.
Key authentication involves supplying a signature alongside the internal key. The
circuitry could forbid the application of a key without a proper signature, or it could

58

1180

Max Impact
Heavy Grp
Converse Pwr

1160
1140

Node Count

1120
1100
1080
1060
1040
1020
1000
0

5

10

15

20

25

30

35

40

45

Key Length

Fig. 3.7. Area overhead for various heuristics on the benchmark circuit
S5378 with a 20% area limit for logic modification.

feed the results of the signature check directly into the logic, inserting primitive gates
elements to block or modify logic values as they traverse the logic. If the primitive
gates are placed in certain paths, they could prevent the exposure of the internal key
by not allowing the sensors’ values to propagate to the outputs, and therefore, make
the internal key undiscoverable. Alternatively, different signature parameters could
allow for a higher density of logic modifications by segmenting pruned circuits during
the testing phase.

3.3

Experimental Results
The proposed methodology was evaluated by implementing the CLIP design flow

described in this work within the ABC [49] logic synthesis framework. Given a gatelevel circuit, the identification and insertion of injection and correction nodes are automatically performed. Next, the test vectors used for key recovery are automatically

59
Table 3.3
Area overhead for specific key strengths.

Circuit Nodes

Node Usage

with sensor estimate

16 bits 64 bits 256 bits

16 bits 64 bits 256 bits

x3

634

687

811

N/A

743

1035

N/A

dalu

1103

1188

1448

2054

1244

1672

2950

C5315

1310

1383

1606

2252

1439

1830

3140

des

3571

3628

3792

4320

3684

4016

5216

s38417

7892

7995

8216

9007

8051

8440

9903

generated. We applied the methodology to several ISCAS’89 and ITC’99 benchmark
circuits. Three different node selection heuristics were used in our analysis: maximal functional impact, heavy grouping, and converse power. Maximal impact chose
nodes based on greedily maximizing the hamming distance between the output when
an incorrect key is applied and the correct output. Heavy grouping concentrated on
choosing nodes such that the effects of sensors have greater inter-dependencies (and
are harder to distinguish). The converse power heuristic chose nodes such that the
difference in switching activity of an incorrect key vs. a correct key is minimal or
possibly negative.
There are four distinct areas of interest with respect to the impact our methodology has on the circuit: functional impact (how much the outputs differ from the
correct values when an incorrect key is applied), area, delay, and power. Since our
heuristics were designed to avoid affecting the critical paths of the circuit, the delay
of the resultant circuitry was unchanged in all our experiments 3 . We present an
evaluation of the other metrics.
3

For some circuits, it is possible that the critical path(s) may have to be impacted to insert a
sufficient number of injection/correction nodes and achieve a sufficiently large key size. In such
cases, the designer needs to make the tradeoff between delay and security.

60
Our results are divided into two parts - area results are presented for several
benchmarks, while extensive analysis is performed for one of the benchmarks in order
to present deeper insights. To simplify the extensive analysis, the s5378 benchmark
circuit was selected for demonstration purposes due to its manageable size (988 gates),
and an area overhead limit of 20% was used when inserting injection and correction
nodes.

3.3.1

Area Impact

Table 3.3 presents the area overheads of CLIP for a set of benchmark circuits
to achieve various key lengths. For the results presented in this table, the maximal
impact heuristic was used for selection of injection and correction nodes, along with a
strict constraint that the delay of the circuit should not increase (i.e., nodes selected
for injection and correction should have sufficient slack). We present area overheads
both without and with the PV sensors, calculated based on the assumption that
each sensor is equivalent to 3.5 logic gates (see Figure 3.2(e)). In the case of the x3
benchmark, our tool was unable to realize a 256-bit key since there was not enough
delay slack to support sufficient injection and correction nodes.
There are three parts to the area overhead incurred by the proposed method: logic
modification, PV sensors, and the key preprocessor, of which we have quantified the
first two factors in Table 3.3. Since the preprocessor that a designer would use is
independent of our methodology, and we anticipate that it would be shared across
several blocks in an IC, we did not include it in our overhead calculations. However,
for completeness we note that compact implementations of symmetric and asymmetric
encryption algorithms have been proposed (e.g., 10k gates for ECC and 3.4k gates
for AES [48]), which should not present a major overhead in medium to large ICs.
If the IC already includes a hardware or software based encryption block as part of
its functionality, we could simply re-use the same as the preprocessor (at powrup)
without incurring any overheads.

61
Figure 3.7 shows the impact of the different selection heuristics on area overheads for the s5378 benchmark circuit, and suggests that all of the heuristics lead to
reasonable increases in circuit area.

3.3.2

Functional Impact

The functional impact of our methodology can be observed in Figure 3.8. With
only a small number of incorrect key bits, the output achieves incorrect operation
at all times. The average Hamming distance can be seen in Figure 3.9, as well as
a wrapped hamming distance, where nodes with greater than 50% distance count as
100% − distance. Since the maximal impact heuristic did not achieve a higher impact
than other heuristics, such a greedy selection algorithm is unsuitable.

3.3.3

Power Impact

The impact of CLIP on power consumption (quantified by the additional switching
activity in the CLIP enhanced circuit as compared to the original circuit) is presented
in Figure 3.10 for the s5378 benchmark. In general, the switching overhead rises at a
rate on par with the additional logic required. As discussed in the next section, the
additional power consumed varies with the number of bits that are incorrect in the
internal key, leading to the possibility of differential power analysis attacks unless a
preprocessor is used.

3.3.4

Key recovery

To expose the sensor values, different vector combinations were applied in an
attempt to maximize the number of bits recovered for each test vector applied. It
was observed that with the S5378 benchmark, only 6-8 test vectors were necessary to
recover all 45 sensor values. Each test vector used for sensor extraction was obtained

62

100

Max Impact
Heavy Grp
Converse Pwr

90
80

Output Error (%)

70
60
50
40
30
20
10
0
0

5

10

15

20

25

30

35

40

45

Key Error

Fig. 3.8. Functional impact. Only a small number of wrong key bits
ensures the circuitry does not give proper results.

through a guided simulation based procedure, choosing test vectors that propagate
the larges number of PV sensor values to the circuit outputs at a time.

3.4

Security Analysis
In order to understand the benefits of the CLIP framework, we analyze it against

several different styles of attacks to determine its effectiveness. A number of attack models were considered, including external attacks such as brute force, man in
the middle, differential power analysis, and differential signal analysis, and internal
attacks such as mask modification based on varying levels of knowledge.

63

20

Max Impact
Heavy Grp
Converse Pwr

18

Hamming Distance (%)

16
14
12
10
8
6
4
2
0
0

5

10

15

20

25

30

35

40

45

Key Error

Fig. 3.9. Output Hamming distance based on key error. Dashed
results represent a wrapped Hamming distance measure.

3.4.1

External Attacks

External attacks view the chip as a black box, i.e., the attacker has no knowledge
of design internals, and the cost of de-packaging the chip and reverse engineering the
design are too high. The simplest external attack is the brute force attack - if an
attacker only has external access to a locked chip (and knows the expected output
for a given input), they may attempt to discover the unlocking key by enumerating
all possible values until they find a combination that causes the locked chip to reach
an unlocked state (i.e., behave like an unlocked chip). Due to the low overhead
associated with increasing the key length, a designer is likely to choose key lengths
that exceed reasonable levels of computing power. Therefore, brute force attacks are
infeasible and an attacker must seek out more intelligent approaches.
Differential signal analysis attacks attempt to apply multiple inputs and keys
that differ in a controlled manner (e.g., at specific bit positions), and analyze the

64

20

25

K0
Kµ ± σ
Switching Overhead (%)

Switching Overhead (%)

25

15

10

5

0
0

10

20

30

20

15

10

5

0
0

40

10

Key Length

20

30

40

Key Length

(a)

(b)

Switching Overhead (%)

25

20

15

10

5

0
0

10

20

30

40

Key Length

(c)

Fig. 3.10. Switching overhead for increasing key length. (a) Maximal
impact. (b). Heavy grouping (c). Converse power.

differences between the chip responses to them. They have been used successfully
to reduce the complexity of breaking cryptographic algorithms [50]. In our context,
differential analysis is a concern if there are simple interdependencies between the
unlocking key bits and the IC outputs (e.g. if one or a few of the bits of the unlocking
key affect only one or a few output bits). Such attacks are precisely the reason for our
use of a cryptographic preprocessor to de-couple the unlocking key from the internal
key. We illustrate this by performing experiments on several benchmarks where we
removed the preprocessor, and attempted to apply simulated annealing to search

45

45

40

40

35

35

30

30

Key error

Key error

65

25
20

25
20

15

15

10

10

5

5

0
0

2000

4000

6000

Test vectors

(a)

8000

10000

0
0

2000

4000

6000

8000

10000

Test vectors

(b)

Fig. 3.11. Annealing attacks on s5378. (a) No preprocessor (b) XORbased preprocessor. The preprocessor used approximately 130 XOR
gates ( 25% additional area) to dissociate the unlocking and internal
keys.

for the key value that minimizes the error in the output bits (difference between
the incorrect outputs and the expected correct output values). This approach was
successful in unlocking the protection in several benchmarks when no preprocessor
was used. Figure 3.11(a) shows an example of multiple simulated annealing attempts
on the s5378 benchmark. However, the addition of a preprocessor makes it extremely
difficult to perform such attacks. Figure 3.11(b) demonstrates annealing attempts on
the same benchmark, using a very simple preprocessor based on a random network
of XOR gates in a pseudo-decryption configuration. While we recommend the use of
preprocessors based on cryptographic algorithms in practice, our experiment shows
that even a trivial preprocessor is effective in defending against differential analysis
attacks.
Man in the middle and replay attacks rely on intercepting the stream of information passed to the chip with hopes of either modifying the bitstream or reusing
the information transmitted. We consider man in the middle attacks launched by
observing the operation of an unlocked chip, as well as attacks launched by observing
the key recovery process. If the unlocking key is stored off-chip and transferred to

66
the chip upon powerup, the unlocking key could be intercepted, but this would be of
no value since the key is valid only for the given chip.
Man in the middle attacks could also be applied by an attacker who has observed
the test vectors applied to a chip during the key recovery process. The same test
vectors could be applied to any chip, and will result in the propagation of many of
the PV sensor values to the circuit outputs. However, without a knowledge of the
design (gate-level netlist and exact location of injection and correction nodes), this
information is of little use. Furthermore, even if the attacker is able to infer the values
of PV sensors from the circuit responses, the attacker must apply the inverse of the
preprocessor function in order to convert these values into an unlocking key, which is
equivalent to breaking the cryptographic algorithm used in the preprocessor.
Differential power analysis (DPA) [51] analyzes a circuit’s power trace in order
to expose information about the key. In the context of CLIP, DPA could result in
the exposure of the preprocessor’s decryption key. DPA-resistant design technques
have been extensively researched in recent years, and many implementations have
been proposed [52, 53] that could be applied to the implementation of the preprocessor to address such an attack. In addition, the injection and correction nodes in
the circuit are also vulnerable to power analysis; unbiased node selection reveals a
linear relationship between the average power usage and the number of correct bits
(Figure 3.12). However, we note that exploiting this correlation will require that the
attacker apply internal keys with specific characteristics, which is prevented by the
use of the cryptographic preprocessor.

3.4.2

Mask Knowledge and Internal Attacks

External attacks all have one major weakness: limited reusability. There is low
value in discovering the unlocking key for only one chip; each chip has distinct intradie variations that cause unique sensor measurements, and consequently, a unique

67
unlocking key. Even a successful external attack would require that the attacker incur
significant per-chip costs.
Internal attacks require a higher up-front cost (such as chip reverse-engineering
and mask extraction), or the collusion of an entity that has a knowledge of the mask
(the foundry, or one of its employees). However, they are attractive from an attacker’s
perspective since they could lead to elimination of per-chip costs to unlock chips.
We consider microprobing attacks, mask analysis and modification attacks, and
combinations thereof, and discuss the difficulty of launching each on an IC protected
by CLIP.
Microprobing refers to de-packaging of a chip followed by probing of internal
signals (e.g., using e-beam microscopy) while the chip is operational. Advanced microprobing attacks may also make small modifications to the circuit (e.g., by cutting
or shorting wires). First, we note that there is no constant secret value that can be
revealed by probing any chip protected by CLIP, since all secrets differ from chip to
chip. This implies that probing a chip can only lead to unlocking that single chip
itself, which is not a viable proposition given the cost of the equipment involved.
Furthermore, any process used to read out the values of the PV sensors should be
sensitive enough that it does not disturb the relationship between the adjacent transistors, and this process must be repeated at several spots in the chip since the sensors
are dispersed throughout the logic.
Mask analysis and modification attacks involve studying the mask to extract information that can be used in an external or internal attack, or modifying the mask to
disable or significantly weaken the protection scheme. For unsophisticated attackers,
the small size of the PV sensors coupled with their spread throughout the logic makes
mask analysis difficult. If the locations of the PV sensors are pin-pointed by sophisticated mask analysis, the attacker still needs to undo all changes made to the circuit
during the CLIP design process - merely removing the sensors and hardwiring their
outputs to fixed values will not help, since the correction nodes will now effectively
ensure that the circuit is still not operational. Discovering the location of correction

68
nodes is difficult since there is no fixed pattern (gate type) that the attacker can
look for, and they can be placed at arbitrary distance from the PV sensors. Even
if the correction nodes are located, the relationship between the PV sensor outputs
and internal key (determined during the synthesis process) is not trivial, and must
be discovered.
If one targets the preprocessor instead of the sensors, this is of little utility. Unless
the designer selects a symmetric encryption algorithm for the preprocessor, mask
analysis of the preprocessor to expose the decryption key does not help - revealing the
decryption key in an asymmetric algorithm does not provide adequate information for
the inverse transformation required to construct the unlocking key, and as such, the
system is still secure. If the decryption key is recovered for a symmetric algorithm,
this key may be as useful to an attacker as bypassing the preprocessor altogether,
but regardless of recovering a symmetric key or remanufacturing a chip to bypass
the preprocessor, each manufactured chip still requires extensive post-manufacturing
characterization for constructing an unlocking key since the sensors themselves were
not removed.
It must be noted that the objective of most security schemes in practice is to raise
the difficulty and cost of launching an attack to a point where the attack is no longer
attractive, while keeping the cost of adopting the security scheme minimal. We now
describe a couple of sophisticated attacks that can be used to break CLIP, which we
believe are of sufficiently high high cost and difficulty.
In order to have complete reusability for an attack, an attacker could purchase an
unlocked chip (thereby legitimately acquiring an unlocking key for it), successfully
read out the PV sensor values, extract the mask, and re-fabricate the chip with PV
sensor outputs replaced by the fixed values, and utilize the same unlocking key for all
chips. Such an attack requires that the aforementioned issue of non-destructive reads
of PV sensors be overcome. It is very difficult to protect against such an elaborate
attack.

69
Another possible attack is to analyze the mask to locate the PV sensors, replace
them with fixed values, and feed a chip through the standard key discovery process.
We note that such an attack will require the participation of the owner of the IP rights
(since the responses of the chip during the key discovery process need to be converted
to an unlocking key). We note that this requires the unknowing participation of the
owner of the IC, and can be prevented by ensuring that the key discovery and test
process are secured using techniques similar to [5, 42]. At the very least, a record
of unlocking keys that have been issued can be used to help establish liability a
posteriori.
In summary, we believe that CLIP substantially raises the cost and difficulty of
IC piracy. In addition, as discussed in Section II, CLIP offers significant advantages
over other active protection schemes due to several factors such as preserving the
uniqueness of the applied keys at all levels, the distributed nature of PV injection
and correction, and the use of a cryptographic preprocessor to decouple the internal
and external unlocking keys.

3.5

Summary
In this chapter, a novel and scalable technique to prevent the piracy of microchips

was proposed. Drawing from the inspiration that process variations are nonduplicable,
the measurement of process variations provided the means to lock down a microchip:
the presence of process variations could be used to create a unique lock from the
manufacturing process.
Under the proposed technique, a circuit designer can embed a locking mechanism
consisting of these variation sensors and an externally-applied key into any random
logic block. Unlike previous techniques, there is no need for the key and process
variation values to be authenticated outside of the circuit: using unate regions and
logic duplication, the logic itself can serve as the authentication mechanism for the
supplied key. While the added logic does increase both the area and power usage,

70

40

Max Impact
Heavy Grp
Converse Pwr

Switching Overhead (%)

35
30
25
20
15
10
5
0
0

5

10

15

20

25

30

35

40

45

Key Error

Fig. 3.12. Switching overhead based on key error.

any increase in circuit latency can be prevented by only selecting paths in the circuit
containing adequate slack.
The analysis of our proposed technique suggests that such an implementation can
be made reasonably resistant to a number of attack scenarios. Probing the sensors
themselves has been made sufficiently complex by the embedding of accessible wires
at the circuit level, while micrography is likely to disrupt the sensor readings before it
can probe them. In fact, the only access to the sensor values is through the logic itself,
an avenue that could be protected via other means such as a key preprocessor. While
the power profile of the circuit subject to the number of correct key bits suggests that
side channel attacks are feasible, proper selection of key/variation sensor locations
are even capable of creating a flat relationship between key error and power usage.

71

4. VOLATILE PROTECTION OF NONVOLATILE
CACHES
Computer security is often thought of as a software problem in terms of virus protection and cryptography, but the underlying hardware can be just as important. While
a computer’s operating system (OS) may ensure a software-based layer to segregate
data between programs running on top of the OS, the underlying hardware used to
store the data is often not fully protected against malicious access.
The temporary memory storage elements that make up a computer comprise of
a hierarchy of successively larger, random access memory (RAM) arrays to offer a
tradeoff between performance and capacity (Fig. 4.1). At the base of the memory
hierarchy is a large memory array that contains a direct mapping between a physical
memory address and a given set of memory cells. This array is typically implemented
using dynamic random access memory (DRAM) cells.
Being comprised of only one transistor and one capacitor, these DRAM cells can
offer a very high storage density. Data is represented by the amount of charge contained in a given capacitor, but as the charge stored will slowly dissipate over time,
DRAM cells require a mechanism that reinforces the values via a periodic refresh
mechanism that reads and re-writes the cells’ contents.
To save costs, these DRAM cells are typically implemented in a large, off-chip
array of memory referred to as external memory (or external RAM). The deep-well
capacitors typically used in a DRAM cell require additional, expensive lithography
steps for construction, and their inclusion in a processor’s mask would cause lower
manufacturing yields as the probability of defects would increase in relation to both
the increased die area and the added manufacturing steps.
Due to both the external memory’s large size and physical separation from the
processor, data transferrance between the two is typically quite slow; a single memory

72
CPU DIE
CORE
I$

D$
L2$
L3$

External RAM

Fig. 4.1. Typical processor memory layout. Each processing core has
dedicated instruction and data caches (I$, D$), which a larger 2nd
level cache for both (L2$). Multiple cores commonly share an even
larger, 3rd level cache (L3$) before they go off-die to access external
memory.

access initiated by the processor could take over 50 ns; an eternity to processors that
operate in excess of 2.4 GHz. Processors achieve reduced access delay by implementing
a hierarchy of memory caches designed to exploit the temporal and spatial locality of
memory access patterns, in which a given memory access signifies the likelihood both
that the given memory address as well as its surrounding memory addresses may be
used soon.
These caches store only a subset of the external memory’s contents through allowing multiple memory addresses to map to a set of memory cells, with tags and
associativity (multiple storage locations per address) to minimize the collisions that
can occur when multiple memory addresses map to a single location. The hierarchy
of caches increases in size with the distance from the processor to allow for tradeoffs
between capacity and speed - from dedicated L1 (level one) caches for instruction
storage and data storage, to a shared L2 cache for both instructions and data, to a

73
much larger shared L3 cache that is sometimes shared between multiple execution
cores.
On-chip caches are frequently implemented using static random access memory
(SRAM) cells. SRAM cells are built out of six transistors: four transistors are used as
cross-coupled inverters that store data via voltage reinforcement, while the remaining
two act as gates to control the access to the cell. These SRAM cells may be far less
dense than DRAM, and may consume more power since they are always powered, but
they are compatible with a typical CMOS process, are faster, and can retain their
values without a periodic refresh.
Even though both SRAM and DRAM are classified as volatile memory because
they require power to retain their state, they both suffer from remanence: both types
of cells will retain some data even when they fall outside their retention conditions.
SRAM cells that have been power cycled (power has been removed and then returned)
could use the latent charge left in the cell to re-bias the cross-coupled inverters back
to their prior levels, while in DRAM cells, the refresh timings used are designed to
ensure retention under the worst-case conditions.
A class of attacks known as cold boot attacks [8] exploits the data remanence of
these cells. Methods such as a platform reset into a specially-constructed OS, or even
reducing the DRAM temperature to slow data degradation and allow for removal and
transport of the memory module, have proven effective at recovering passwords and
cryptography keys from memory.
These attacks have so far been demonstrated only against external memory, and
as of yet, solutions have only targeted the remanence as it appears in DRAM [35, 54,
55]. However, it is our belief that these attacks on data remanence are as feasible
in on-die caches, both because the SRAM used in these caches suffers from data
remanence [56], and because of the access mechanisms available to the OS [57, 58]
and to any components connected via a bus interface [59, 60].
This same class of attacks can be made even more effective if either SRAM or
DRAM is simply replaced by a nonvolatile (retains its state even without the appli-

74
cation of power) storage structure like Spin-Torque-Transfer Magnetic RAM (STTMRAM) [61], as they are constructed specifically to resist both time and temperaturebased degradation of memory. A cold boot attack would have no limits on the time
needed between reboots, while external memory built out of nonvolatile structures
could be extracted but kept at room temperature almost indefinitely before being
read. This resistance to temperature and thermal degradation would also open up
the possibility for invasive attacks that remove the chip’s coating and use probes to
read out the internal memory.
In this chapter, we present a protection mechanism to fight against data remanency attacks as they may affect either the processor’s L3 cache or external memory.
Through the use of low-latency encryption algorithms, volatile storage devices, and
hardware random number generators, we ensure data confidentiality to both the L3
cache and external memory without compromising system performance.

4.1

Introduction
Computing devices have become the modern-day vault. Secrets are no longer kept

locked away behind a steel door with a security guard on duty; they are stored inside
a computer. They are now indirectly accessible from anywhere, but are protected
behind layers of firewalls, antivirus software, and encryption.
These software-based protection mechanisms have little effectiveness against anyone with physical access to a computer and the ability to implement attacks on the
underlying hardware. With physical access, many more avenues are available to recover data without the reliance on even a single software vulnerability. Consider
following five attack scenarios:
1. A malicious PCMCIA (Personal Computer Memory Card International Association) card is plugged into a laptop. Without any OS intervention, the card uses
its DMA (direct memory access) abilities to dump the contents of the computer
system’s memory.

75
2. A computer is turned off, but only for long enough to insert a malicious device into a PCI (Peripheral Component Interconnect) slot on the motherboard.
The power is restored, the PCI device triggers system interrupts that halt the
processor’s operation, and then it extracts the contents of external memory.
3. After a system reboot, a small, customized OS is loaded. Aside from the external
memory required to load the new OS, the remaining external memory remains
untouched as the customized OS extracts the data present.
4. A CO2 air can is inverted and directed at a memory module, cooling it significantly to slow data degradation. This module is then removed and placed into
another PC or memory module reader to extract its contents.
5. A novel processor with a nonvolatile lowest-level cache is disconnected from
power. It is then decapped (the chip’s packaging is removed) and analyzed
using microscopy to recover the data left in the cache.
All of these attack scenarios are capable of subverting data confidentiality without
any vulnerabilities in the OS itself. Scenarios 1 and 2 are specific instances of DMAbased attacks; most any device that can be connected to the system bus, be it through
a PCI interface [62], a PCMCIA card [63], a Firewire interface [64], or even the new
Thunderbolt interface [65], can perform memory transactions transparently (without
the end user finding out). If these bus-connected devices want to prevent the OS
from altering the memory state, they can additionally trigger processor interrupts
that would halt the OS’ operation.
For most of these attack scenarios, they exploit data remanency - the concept
that some data may persist well after it is considered unreadable. Both SRAM and
DRAM suffer from data remanency [8, 56, 66, 67], and as hinted by Scenario 4 and
shown in Table 4.1, data degradation slows down as they are cooled. Even partial
data retention can be a cryptographic break: it would correspond with a massive
reduction in the key search space.

76
Table 4.1
Observed duration ranges of 80% data preservation after removal of power
Temperature

SRAM, Vdd = 0 [56]

SRAM, Vdd =floating [56]

DRAM [8]

24 C

6 ms - 1 s

11 ms - 2 min

1 s - 20 s

-50 C

5 s - 1 day

10 s - 5 days

>5 min

80

Power Usage (W)

70

2.4 GHz Dynamic

60

1.6 GHz Dynamic

50
40

Standby Power

30
20
10
0
0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

Supply Voltage (V)
Fig. 4.2. Power usage in Intel’s Ivy Bridge processors, with independent tracking for standby and dynamic power usage. Even with the
transition to FinFET designs, more than 1/3 of a design’s power usage
comes from idle components. Source: http://blog.stuffedcow.net [68]

These attacks on data remanence are referred to as cold boot attacks. They have
varying styles, including platform reset attacks where a modified OS is loaded into
memory, and cryogenic attacks where DRAM is cooled, removed, and installed in
some memory reader. Invasive attacks on the chips themselves are another avenue
of attack, which, while may presently appear innocuous, may soon serve to be quite
dangerous in light of the future direction of semiconductors.
The increased capacity of on-chip caches has provided for accelerated memory
accesses, but at the cost of power consumption. Unlike logic blocks that can be fully
powered down when inactive, SRAM must ensure adequate voltage to preserve its

77
contents. The cumulative power necessary to preserve the data has reached worrisome
levels (Figure 4.2), prompting a heavy interest in developing alternative memory
structures that can reduce or eliminate this preservation power.
A likely contender for cache replacement is an emerging nonvolatile technology
known as STT-MRAM [61]. Its structure is more favorable for integration into a
CMOS process than DRAM [69], it can offer near-SRAM performance, and it can
achieve the holy grail of memory power consumption: it is capable of retaining its
state without leakage power or periodic refreshes. Other upcoming technologies like
Redox [70] share a similar design: a dense one transistor, one-resistor nonvolatile
structure.
As these nonvolatile structures gain increased reisistance against temporal and
thermal degradation, they are vulnerable to both the cold boot attacks as well as
invasive attacks that uses technology such as micro-probes. In a noninvasive scenario,
a simple state machine could certainly clear the data contents upon a reset, but any
invasive attacks could bypass such a hardware mechanism and read the values directly
from the cells themselves.
To protect against all of these attacks on data remanence requires an active protective mechanism. Some method needs to be in place to ensure data confidentiality
is preserved the moment there is a separation between the OS and its data.
In this chapter, we present a design intended to re-introduce the concept of volatility to the lowest-level, L3 cache, as well as to external RAM. By utilizing a per-boot
root of trust derived from a hardware random number generator, volatile storage via
reset-enabled registers, and tiered, low-latency ciphers, we ensure the confidentiality
of information stored in the L3 cache and in external memory without compromising
system performance.
The merits of our approach are as follows:
· Minimal-latency protection. The effective burden of decryption is concealed by
using fast ciphers in parallel with memory accesses by operating the block ciphers in

78
counter mode. Some write operations may incur increased latency due to encryption,
but such latencies could be hidden behind larger write buffers.
· Key segmentation. Fast ciphers have minimal complexity, which may lead to
a future cryptographic break. As cryptographic breaks traditionally require a large
quantity of plaintext/ciphertext pairings, we limit the number of such pairings by
using a different key for each page ( 4 KB block) of memory.
· Cold boot protection. Unlike SRAM and DRAM arrays, register cells can provide
for a simple, guaranteed-volatile storage mechanism. Proper usage of registers can
ensure that the keys responsible for data protection are lost upon either a reset,
a power cycle, or even the detection of some unsupported change to the system
configuration.
In Section 4.2 we present the current state of security with respect to memory
protection measures. Section 4.3 demonstrates our approach to securing against cold
boot attacks, including the motivation behind each element. Section 4.4 traces how
our design handles differing data remanence-based attacks, while Section 4.5, chronicles the expected effect on performance. The expandability of our design to cover
more styles of attacks than cold boot attacks is discussed in Section 4.6. Finally,
Section 4.7 concludes this chapter.

4.2

Background
Cold boot attacks have proven effective at recovering keys to disk encryption

schemes, login passwords, and even asymmetric encryption keys from DRAM due to
data remanence [8], but the ability to extract data fades greatly with both time and
temperature [71].
However, obtaining system memory does not necessarily require a cold boot attack or even OS intervention. Most devices attached to a computer’s bus have direct
memory access (DMA) capabilities granting them full access to the entirety of the
external memory in an OS-unaware measure. The ports on the outside of a com-

79
puter can allow DMA transactions; hot-pluggable interfaces such as PC cards [63],
Firewire [64], and even the new Thunderbolt standard [65]) allow for DMA transactions that the OS might never find out about. Even integrated peripherals represent
a DMA attack route - as long as the devices could be improperly programmed, they
could be used to extract memory to a non-privileged entity [7].
Even the cache can be vulnerable to these DMA attacks. While frequently viewed
as a tertiary storage cache for only the processor, the lowest-level cache is sometimes
considered as a buffer for even DMA-based memory requests. As the lowest-level
caches have grown in capacity, the concept of requiring that data from high-speed
I/O devices be sent to RAM before the processor can use it is wasteful both in
terms of performance and power. Instead of simply ensuring cache coherency through
snooping, some designs have expanded DMA access to allow selective writing into the
cache via Direct Cache Access [59] or even always-active measures like Data Direct
I/O [60].
Thankfully, the malicious bus-based access to the system memory can already be
countered, but it involves proper interactions between the operating system and the
underlying hardware. Through the proper use of an I/O Memory Management Unit,
a feature designed for hardware virtualization, memory could be virtually segmented
from the perspective of the hardware.
Preventing these DMA attacks to both the caches and to external memory requires
proper configuration by the OS onto the underlying hardware. As soon as that
configuration is lost (e.g., through one of these cold boot attacks), the data might be
easily accessed. As such, our goal is to instead focus on the data remanence and how
to protect against cold boot attacks.
Already, several schemes have been envisioned to counter DRAM’s unintended
volatility. Two software schemes have suggested the storage of disk encryption keys
into processor registers, as the underlying registers are typically reset alongside a
system reset. In [72], a disk encryption scheme was rewritten such that the disk
keys are stored only in SSE (parallel floating point) registers. A related solution [73]

80
found a SSE-free solution via keeping encryption keys in superfluous model-specific
registers; in their case, the authors found that performance counters functioned as a
convenient location to store the keys.
Only a limited amount of data can be stored via these software-based methods, as
they rely on identifying hardware registers that the OS might not necessarily need.
To protect the entirety of DRAM, a few secure cryptoprocessor designs have been
proposed [35, 54]. Even Microsoft’s Xbox 360’s PowerPC processor offers hardwarebased DRAM encryption [55].
External memory has been the primary target of cold boot attacks, but internal
caches comprised of either SRAM or even a nonvolatile structure like STT-MRAM
display definite weakness against these attacks on data remanency. Outside of the
DMA-cache capabilities, the caches can be configured to allow direct access [57, 58]
during the boot process.
Direct prevention of data remanence in the cache would require one of three approaches: remanence-free storage, clearing during power loss, or clearing during system boot. A remanence-free storage design would require some huge change in the
cell design, such as the use of area (and power) consuming flip flop registers instead of
SRAM cells. To support a memory wipe when power is lost would require a method
to store sufficient charge, as well as the circuitry to reset the memory cells. Startup
support to reset the memory would also require additional circuitry, but in the face
of any invasive attacks, would not be able to stop such an attack.

4.3

Page-segmented cache encryption
The solution to large-scale protection against data remanency is not to target the

memory store (be it external memory or the caches), but instead focus on protecting
the data stored in the memory. If we encrypt the contents of memory, a memory
readout would have no significance without the corresponding key used for decryption.

BUS

Page key cache

OTP cipher

Plaintext

Ciphertext

DRAM

CPU
L1$
L2$

address index

Page key cipher

L3$

Row index

REG

address
VPID
etc.

SEQU

TLB

Page info

Single boot key

SRAM

TRNG

SEQU

81

Fig. 4.3. Block diagram. Using a multilayered approach, we achieve
low-latency memory confidentiality and segmentation. The single
boot key, the root of trust, is used both for generating the page keys
as well as ensuring volatility against the use of a SRAM cache for
the page keys. The page keys themselves are generated by information from the translation lookaside buffer (TLB) such as the physical
address, the virtual processor ID (in the case of virtualization), and
any other relevant details the OS desires to store in the page tables.
Finally, a unique one-time pad (OTP) is produced for each memory
row in a uniquely-identified page to provide for simple, confidential
data storage.

82
To protect against cold boot attacks as they can occur to the lowest-level (e.g.,
L3) cache or DRAM, we envision a design that uses memory encryption to provide
data confidentiality and reduce the amount of data directly resilient against data
remanence to just the encrypting key. With the help of tiered, hardware-efficient
ciphers, we can achieve low-latency encryption that adds future resiliency against
potential security weakenesses in these ciphers.
Figure 4.3 gives a block diagram for our system. At system boot, a master key
is generated from a true random number generator (TRNG). This key is used in one
hardware cipher to generate a unique key for each memory page (a “page key”), while
a second cipher uses the page key to generate a one time pad value (a secure pseudorandom value) that is combined with the data in transit to the L3 cache/DRAM.

4.3.1

Single boot keys

To prevent against cold boot attacks, our system builds its trust upon a guaranteedvolatile and nonpredictable value. Accomplishing such requires the use of a key valid
only for the current boot cycle: at system boot, a master, one-time boot key is generated from a true random number generator (TRNG) and is stored in some structure
that can resist remanence.
While such a random number generator has been uncommon in silicon and traditionally only available via leaky analog solutions [75] or external components like
trusted platform modules, efficient hardware random number generators have begun
to appear on silicon. Intel’s Ivy Bridge platform contains a thermal mismatch-driven
digital latch as the basis for a software-accessible random number generator [74]
(Fig. 4.4), and competing platforms will likely soon follow suit.
Fighting data remanence in the master key itself might be accomplished through
the use of registers tied to the reset signal (Fig. 4.5) as well as circuitry that would
detect system disruptions and cause a reset. In the case of system disruption, the

83

Fig. 4.4. Intel’s Hardware Random Number Generator [74]. This
random number generator is based on biasing two inverters to be
metastable such that the resultant state after the clock enable should
be random.

R

Fig. 4.5. A typical D Flip Flop modified to support a reset low signal.
The two sets of coupled NAND gates would re-bias themselves to a
known state whenever R is low.

system would only need enough latent charge left on the die both to detect and trigger
the reset behavior in the master key’s registers.

4.3.2

One time pads

To ensure minimal latency, we propose the use of block ciphers in what is known
as counter mode [76]. Counter mode operation (Fig. 4.6) is a method that operates
a block cipher like a randomly-accessible stream cipher. The key, an address (e.g.

84
1

2

3

4

AES

AES

AES

AES

Address
Nonce
Key
Plaintext

Ciphertext
Ciphertext

Plaintext

Fig. 4.6. Counter mode operation of a block cipher like AES. The
cipher is instantiated multiple times and driven with the same key,
while the cipher’s inputs are driven by an address parameter and
some unique, one-time use value called a nonce. The output of the
cipher is XORed against the plaintext to produce the ciphertext, and
is undoable by simply running the ciphertext back through with the
same key, address, and nonce.

stream offset), and an optional nonce value are encrypted to produce a one time
pad-like value.
Provided that a one-time pad is never reused, true one time pads are the ideal
method for ensuring data confidentiality. With a true one time pad, a truly random
key is generated that is equal or longer than the message being encrypted, and can
protect the data via a simple mixing function (like XOR gates) between the one time
pad and the data. Without the encrypting pad, not only is brute force required to
recover the message, but in the process, every other possible message of that same
length could be obtained.
The “one time pads” generated by counter mode operation of a cipher are not
true one time pads, both because the pads generated from the block cipher are not

85
considered truly random but only secure pseudorandom (not truly random but nearly
impossible to guess without some secret information), and because the cipher is vulnerable to a replay attack if the same address and nonce are used for a differing set of
data. However, a secure pseudorandom pad generator is adequate for data confidentiality, and, under the context of cold boot attacks, there would be no opportunity
to use the same key, address, and nonce.

4.3.3

Low-latency cipher

While a parallel encryption method does serve to reduce the latency, a minimallatency system requires that the cipher used in counter mode be faster than a comparable data access. In a 32 nm process operating at 2.4 GHz, a L3 cache access can
complete in roughly 12 ns [77]; by comparison, a hardware implementation of AES
in 32 nm silicon takes approximately 35 ns.
AES [78], while it can achieve high throughput, is not terribly low-latency. While
the Rijndael cipher was selected for the Advanced Encryption Standard in part because it was noticeably more efficient than the majority of its competitors and predecessors [79], its structure does not lead to hardware efficiency.
The AES cipher is a ten-round cipher consisting of substitutions, permutations,
and rotations, as shown in Figure 4.7. The substitution stage is quite complex, as it
requires either 8-bit Galois field inversion followed by bit arithmetic, or it requires the
use of a 4 kbit (256 entries of 8 bits a piece) lookup table (referred to as a Substitution
Box - S-Box) a total of sixteen times per round. While field transformation methods
have determined how to compute the Galois field inversion with a reasonable efficiency,
the state-of-the-art solution still contains 19 XORs and 4 AND gates in the critical
path [81].
Other popularized ciphers such as Serpent [82] are similarly inefficient in hardware;
while 4-bit S-Boxes are used, the linear transformation stage requires 32-bit adders.

86

(a)

(b)

(c)
Fig. 4.7. AES Cipher. A single encryption round, as demonstrated in
(a), consists of a substitution, an element shift, and an invertible column multiplication. The substitution phase requires a computationally intense 8-bit Galois Field inversion, and is frequently substituted
by the 4 kbit table shown in (b, from [78]). Instead of duplicating
this single table 16 times to minimize the per-round latency, the substitution can also be calculated through an isomorphic conversion to
a simpler Galois Field, as shown in (c, from [80]).

87

Li

Li

Ri

Ki

R

<<4
>>5

Ki

Ri

r

S

R -1

Li+1

Ri+1
(a)

Li+1

Ri+1
(b)

Fig. 4.8. Embedded ciphers. XTEA (a), a 64-round Feistel cipher,
requires two 32-bit algebraic additions [+] and two boolean additions
(XOR, ⊕) per round. SEA (b) requires 63 rounds for its 66-bit version,
and consists of [S]-Boxes, word [R]otations, and bit [r]otations, but
the critical path is only constrained by the substitution box, a single
11-bit addition, and a boolean addition operation.

Serpent is also slowed as it requires 32 rounds, but by our estimation, a Serpent
hardware implementation in 32 nm silicon would also take around 35 ns.
Investigation into ciphers for embedded systems revealed similarly disappointing
results. eXtended Tiny Encryption Algorithm (XTEA) [83] and Scalable Encryption
Algorithm (SEA) [84], shown in Figure 4.8, are Feistel ciphers [85] ideal for use
on an embedded RISC platform, but do not map efficiently enough into hardware.
XTEA’s heavy reliance on arithmetic operands (128 serial 32-bit additions) took an
unbelievable 141 ns. SEA was far more efficient as it relies on S-Boxes and only 11-bit
additions, but a 66-bit version of SEA still required around 26 ns.
A third cipher field with growing interest is the development of hardware-efficient
ciphers. A few prospective choices, like PRESENT [86], NOEKEON [88], and mCrypton [87], have surfaced with an AES-like structure, but have looked into more hardwareefficient structures (Fig. 4.9) like 4-bit S-Boxes that can be implemented with 1/64

88

(a)

W0

W2

K0

K1

W1

W3

K3

K2
<<<8

>>>8
<<<8

>>>8

(b)

(c)
Fig. 4.9. Unique attributes of various hardware-efficient ciphers.
PRESENT (a, from [86]) uses a permutation matrix to allow for wide
dissemination of information. NOEKEON (b) provides for a unique
key mixing algorithm that operates like overlapped Feistel ciphers.
mCrypton (c, from [87]) integrates different substitution boxes, allowing for more nonlinearity.

89
the area required for AES’ eight-bit S-Box. Due to these optimizations, all three
manage to achieve sub-5 ns encryption in the 32 nm silicon. Even PRESENT was
selected as an ISO/IEC standard for lightweight cryptography [89].

4.3.4

Page-level keys

Unfortunately, the simplicity these of hardware-efficient ciphers may also be their
downfall. Incomplete versions of PRESENT have been shown to be vulnerable to
cryptanalysis [90], while NOEKEON displays potential weakness towards related-key
attacks [91]. Impossible differential attacks have even been demonstrated against
mCrypton [92]. While none of the presently-known attacks against these hardwaredesigned ciphers represent full cryptographic breaks, based on the novelty of these
ciphers, more serious vulnerabilities may be discovered.
Many advanced cryptanalysis techniques, if proven effective against these algorithms, would still require a large sample of plaintexts and ciphertext to mount an
effective attack. If we use a different key at a page-level (e.g. 4 KB blocks) granularity
for generation of line-level (e.g., 64 byte) one time pads, only 64 samples might be
be available for use during cryptanalysis. To accomplish this division, we use linked
ciphers: the first cipher uses the master key to generate keys for each page, while the
second uses these page keys to generate the one time pads used for data obfuscation.

4.3.5

Page key management

Chaining together multiple ciphers increases both the power and delay of the encryption operation. However, memory access patterns commonly have spatial locality
and would access a single page multiple times. As such, the page key might not need
to be generated for every memory access and could be stored in an intermediate cache.
A preexisting structure that already takes advantage of spatial locality is the
translation lookaside buffer (TLB). The TLB is designed to perform a subset of the
page table’s duties of translating virtual memory addresses (which are subdivided at

90
a page level) to physical memory addresses. By retaining only a small set of all of
the page table entries (for example, 512 entries), a TLB can achieve a very low miss
rate (often less than 3%) and can save from a page walk - a slow search through the
page table for the proper address translation.
As the translation lookaside buffer is already designed to exploit spatial locality, we
can achieve a savings in storage space by linking together the TLB with our page key
cache. For any address found in the TLB, we can use the index of the corresponding
TLB entry to reference the appropriate cached page key. Such a connection would
use the TLB like a tag array for the page key cache.
As the page key cache will be likely composed of a SRAMy array that would
demonstrate remanence, one more modification is necessary to ensure that the page
key cache by itself is not sufficient to decrypt the contents of external memory or
the L3 cache. As shown in Figure 4.3, the master key needs to be recombined with
the output from the page key cache (with something even as simple as XOR gates)
to ensure that the page key cache contents cannot decrypt memory when the master
key is lost.

4.4

Effect on security
To examine the effect of our system on security, we must consider how data confi-

dentiality is ensured under three scenarios: a platform reset attack, cryogenic freeze,
and invasive data recovery.
For a platform reset attack, the registers storing the master key are immediately
zeroed when the reset signal is registered. With the loss of the master key, new page
keys can no longer be generated. The cached page keys were recombined with the
master key before generation of one time pads, so even if the cached page keys are
still present and usable, the one time pads produced will be incorrect and will not
allow for memory decryption.

91
Under a cryogenic attack, the external RAM is significantly cooled to decrease
data degradation, allowing for removal and transportation of the memory modules
to a memory reader. However, the root of trust is in the processor, not the external
RAM; insufficient information exists on the external RAM to reconstruct the data.
Invasive attacks would likely cause data spoliation to the root of trust due to the
generation of heat while decapping the chip package; the master key stored in the
register values would likely be lost, as well as all of the cached page keys. Even if
the cached page keys can be recovered, the XOR operation with the single boot key
means that any single row in the page key cache is insufficient for data recovery. The
only hope of rediscovering the page key under such a scenario is if the ciphers used are
extremely weak against differential attacks like related-key attacks. Considering there
would only be a modicum of entries in the page key cache, this will prove difficult.

4.5

Performance impact
It is our belief that our system could be implemented without any significant

impact in performance..
Based off of Sandy Bridge performance numbers wherein significant effort was
taken to minimize L3 cache latency [77], a memory read request that hits the L3
cache can take as little as 26 cycles to complete. Performance numbers for the actual
access delay of the L3 cache are unknown, but by our estimate take around 20 cycles.
As our page key cache is indexed in conjunction with the L2 TLB, we need to
first determine the corresponding TLB row. If the address mapping is available in
either TLB, the page key will be available as soon as we decide to perform an L3/main
memory access, giving us plenty of time to use our 8-12 cycle hardware-efficient cipher.
This is demonstrated in Figures 4.10(a) and (b).
When a TLB miss does occur, no page key will be available. The hardware will
need to resolve the TLB miss via a page walk, but as soon as the TLB miss is resolved
and the physical address is known, we begin regeneration of the page key while the

92

(a)
L1 TLB
L2 TLB
L1$
L2$ tag
L2$ data
L3$ tag
L3$ data
Page key cipher
Page key $
OTP key cipher
OTP XOR
Time

L1 TLB
L2 TLB
L1$
L2$ tag
L2$ data
L3$ tag
L3$ data
Page key cipher
Page key $
OTP key cipher
OTP XOR
Time

Page Walk

(b)

(c)
Fig. 4.10. Expected L3 read performance impact under TLB scenarios, as compared to Sandy Bridge performance and a 10-cycle encryption cipher. (a) L1 TLB hit. (b) L1 TLB miss, L2 TLB hit. (c) TLB
miss.

93
L2 cache is checked for the given address. Provided that one of our hardware-efficient
ciphers is used for both the page key and the one-time pad generators, no added
latency should result.
For multiple sequential read requests, the read pipeline might become stalled
waiting for the generation of each one time pad. To compensate, the ciphers used to
generate the pads could be instantiated multiple times to form a pipeline that would
alleviate the possible effect on system latency.
Write operations are unable to take advantage of the parallel decryption afforded
to read operations, and may suffer from the added complexity of reverse page mapping
to use the page key cache. Unlike a read operation which can compute the one time
pad before a memory read is completed, the write operation must finish computing
the pad before data can be written.
Additionally, the information necessary to use the TLB in a standard manner is
likely to have disappeared. When the processor wants to write directly to external
RAM, the virtual address information can be used through the TLB to find the corresponding cached page key (Section 4.3.5). For a writeback operation from the L2 cache
to the L3 cache, the physical address and not the virtual address is available. The
proper TLB structure (an associative one, not one based off of a content-addressible
memory) would be required to find the index of the cached page key.
Fortunately, for almost all write concerns, additional buffers can hide most all
latency from the processor.

4.6

Optional enhancements to security
Our design may be geared towards only addressing issues of data confidentiality

under cold boot attacks, but it is expandable to support additional elements such as
nonces and hashing to help fight against live hardware attacks like malicious DMA
accesses. Figure 4.11 shows how these additions could supplement the design.

Page key cipher

Page key cache

Nonce Cache

Plaintext

Ciphertext

SEQU

Hash Function

SRAM

Hash Interrupt

Hash Cache

L3$

OTP cipher

BUS

CPU
L1$
L2$

address index

DRAM

SRAM

DRAM

Row index

REG

address
VPID
etc.

SEQU

TLB

Page info

Single boot key

SEQU

TRNG

SRAM

94

Fig. 4.11. Integration of nonce and hash operation, as shown in grey.
Nonce operation requires a memory-backed cache, while hash operation requires both a hash function and an additional memory-backed
cache. The two caches are referenced by the TLB row index and the
address index. The one-time pad cipher may be reusable for the hash
function. In the case of a malformed hash, an interrupt can be sent
back to the processor.

95
Under the guise of cold boot attacks, we assume the ability to contain the use
of one time pads such that a nonce would not be necessary (Section 4.3.2). If a
situation exists where some privileged entity can use the encryption mechanism to
store a value in memory, but can also directly read from memory, they can recover
the one time pad used for that given memory location and use their direct access to
memory to recover the encrypted data that is later written to that memory location.
If a different value (a nonce) is used for each encryption operation but then retained
until the decryption of that memory location, the one time pads would be unique for
each memory write and would reveal no information about future data storage.
Cold boot attacks also need only consider data confidentiality, not data integrity;
malicious modification of data does not matter if the system is no longer active or
a different OS is loaded. To guarantee data integrity requires some sort of secure
hashing mechanism. Without data hashes, a privileged entity with constant, direct
access to system memory, and knowledge of the layout of data in memory, could
definitely compromise the system. The OS would be fully unaware if select bits at a
given memory location were flipped, and yet only a few flipped bits in the right spot
would be enough to derail the system’s operation.
A simple hash function - which could even reuse one of the other ciphers - could be
used to offer data integrity by securely hashing the ciphertext using address-specific
information such as the one time pad. Hashes would be stored on write and verified
on read, and in the case of a hash mismatch, could send an interrupt to the processor
warning it of the compromised data.
Both the nonce and the hashes will require additional memory. Even if each
memory row had only an eight-bit value for either the nonce or the hashes, that
would correspond to a 2 KB storage table (about 100k transistors) inside the package
if linked with a 512-entry TLB. In addition, both the nonces and the hashes would
need to be backed by system memory, each eating up about 1.5% additional space
for any data that would require the added protection.

96
4.7

Conclusion
With the advent of nonvolatile solid-state random-access memories, the danger of

cold boot attacks has increased. Not only are external memory sources vulnerable to
this issue, but so is the lowest-level cache inside the processor.
In this chapter, we have presented a design intended to fight against various cold
boot attack methods, inasmuch as they target either the lowest-level cache or an external memory source. Through a two-layer scheme, we envision a method to utilize
multiple low-latency ciphers in a secure manner that does not affect system performance. The root of trust is protected behind a volatile storage mechanism, ensuring
the protection of the underlying data upon either power loss or some triggered reset.
Our design represents an extendable means to provide page-level granularity of
access to physical memory; with the addition of other components such as nonces and
post-access hashing, we feel this design may be able to offer further benefit against
other types of memory attacks such as those performed through DMA channels.

97

5. CONCLUSIONS
Variations have a variety of effects on how our microprocessors work. They can affect
their yield, performance, and power usage. Throughout this thesis we have explored
two overarching topics with variations: their role under circuit simulations, and their
role in design security.
Our investigation into circuit simulations found that the academic approach of
using threshold voltage as a substitute for process variations was inadequate. To that
end, we aimed to discover an improved method for simulation, which was encountered via parameter reduction through careful parameter translation and composition. Not only were we able to reduce a six-parameter variation model down to a
three-dimension lookup table, but we were able to achieve unparalleled simulation
accuracy.
Application of variations to design security found solutions that benefit from two
distinct types of variations. Process variations provide a means to provide unique,
nonduplicable fingerprints for chips, while environmental variations can provide for
true on-chip random number generators.
Using the nonduplicable nature of process variations, we found a scalable solution
to combat potential chip piracy. We discovered a method wherein most any combinational logic block could function as a locking mechanism without any affect on
latency. Through localized logic duplication to support correctibility, we embed process variation sensors within the logic in such a way to prevent easy removal. Through
a secure readout testing procedure, we described how each chip key can be extracted
by only a trusted party. Finally, we demonstrated the effect our solution has on area,
functionality, and switching activity.
As the future composition of semiconductors will likely supplant SRAM in on-chip
caches with nonvolatile memory, the longstanding security model is in jeopardy as the

98
guaranteed data remanence will increase the vulnerability to cold boot attacks. To
ensure resistance against cold boot attacks, the system needs to behave in a volatile
manner. Utilizing a root of trust centered around true random number generators and
the volatility of registers, we constructed a multilevel protection mechanism designed
to combat cold boot attacks. Using ciphers designed specifically for hardware, we
accomplish minimal effect on system performance, even when protection is supplied
to the lowest-level processor cache.
Through their effect on circuit behavior, the benefit they can have on preventing
piracy, and the role they can play in guarding against data remanence, we have
demonstrated that variations present a research area open to investigations. The
continued effort to push to the edge of semiconductor performance will continue to
raise questions as to how to best understand the behavior of variations and how
variations can serve to benefit processor designs.

LIST OF REFERENCES

99

LIST OF REFERENCES

[1] G. Moore, “Cramming more components onto integrated circuits,” Electronics,
vol. 38, May 1965.
[2] C. Sah and H. Pao, “The effects of fixed bulk charge on the characteristics
of metal-oxide-semiconductor transistors,” Electron Devices, IEEE Transactions
on, vol. 13, pp. 393 – 409, apr 1966.
[3] A. Kahng et al., “Watermarking techniques for intellectual property protection,”
in DAC ’98, pp. 776–781, 1998.
[4] Y. Alkabani, F. Koushanfar, and M. Potkonjak, “Remote activation of ICs for
piracy prevention and digital right management,” in ICCAD ’07, pp. 674–677,
2007.
[5] J. Roy, F. Koushanfar, and I. Markov, “EPIC: Ending piracy of integrated circuits,” in DATE ’08, pp. 1069–1074, 2008.
[6] J. Huang and J. Lach, “IC activation and user authentication for securitysensitive systems,” in HOST, pp. 76–80, 2008.
[7] L. Duflot, Y.-A. Perez, G. Valadon, and O. Levillain, “Can you still trust your
network card?,” in CanSecWest, 2010.
[8] J. A. Halderman, S. D. Schoen, N. Heninger, W. Clarkson, W. Paul, J. A.
Calandrino, A. J. Feldman, J. Appelbaum, and E. W. Felten, “Lest we remember:
cold-boot attacks on encryption keys,” Communications of the ACM, pp. 91–98,
May 2009.
[9] B. Ping Yang and P. Chatterjee, “Statistical modelling of small geometry MOSFETs,” in IEEE International Electron Devices Meeting, vol. 28, pp. 286–289,
1982.
[10] D. Reid, C. Millar, G. Roy, S. Roy, and A. Asenov, “Analysis of threshold voltage
distribution due to random dopants: A 100,000-sample 3-D simulation study,”
IEEE Transactions on Electron Devices, vol. 56, no. 10, pp. 2255 –2263, 2009.
[11] P. Oldiges, Q. Lin, K. Petrillo, M. Sanchez, M. Ieong, and M. Hargrove, “Modeling line edge roughness effects in sub 100 nanometer gate length devices,” in
SISPAD, pp. 131–134, 2000.
[12] “BSIM5: An advanced charge-based MOSFET model for nanoscale VLSI circuit
simulation,” Solid-State Electronics, vol. 51, pp. 433–444, Mar. 2007.
[13] M. Orshansky and K. Keutzer, “A general probabilistic framework for worst case
timing analysis,” in DAC, pp. 556–561, 2002.

100
[14] P. Zuber, V. Matvejev, P. Roussel, P. Dobrovolný, and M. Miranda, “Exponent
monte carlo for quick statistical circuit simulation,” in PATMOS, pp. 36–45,
2010.
[15] T. Tsunomura, A. Nishida, and T. Hiramoto, “Verification of threshold voltage
variation of scaled transistors with ultralarge-scale device matrix array test element group,” Japanese Journal of Applied Physics, vol. 48, no. 12, p. 124505,
2009.
[16] P. Drennan and C. McAndrew, “A comprehensive mosfet mismatch model,” in
IEEE IEDM, pp. 167 –170, 1999.
[17] J. Croon, M. Rosmeulen, S. Decoutere, W. Sansen, and H. Maes, “An easy-touse mismatch model for the mos transistor,” IEEE JSSC, vol. 37, pp. 1056 –
1064, aug 2002.
[18] Q. Zhang, J. Liou, J. McMacken, J. Thomson, and P. Layman, “Spice modeling
and quick estimation of mosfet mismatch based on bsim3 model and parametric
tests,” IEEE JSSC, vol. 36, pp. 1592 –1595, oct 2001.
[19] E. Seevinck, F. List, and J. Lohstroh, “Static-Noise Margin Analysis of MOS
SRAM Cells,” IEEE JSSC, vol. 22, pp. 748–754, 5 1987.
[20] W. P. Griffin, A. Raghunathan, and K. Roy, “Clip: circuit level ic protection
through direct injection of process variations,” IEEE Trans. Very Large Scale
Integr. Syst., vol. 20, pp. 791–803, May 2012.
[21] V. Alliance, “Intellectual property protection: Schemes, alternatives and discussion,” Aug. 2000.
[22] United States Code Title 17, Chapter 9. Semiconductor Chip Protection Act.
Available at http://www.copyright.gov/.
[23] The Act Concerning the Circuit Layout of a Semiconductor Integrated Circuit.
Available at http://www.japaneselawtranslation.go.jp/.
[24] Council Directive 87/54/EEC of 16 December 1986 on the legal protection of
topographies of semiconductor products. Available at http://eur-lex.europa.eu/.
[25] Agreement on Trade-Related Aspects of Intellectual Property Rights. Available
at http://www.wto.org/.
[26] R. Torrance and D. James, “The state-of-the-art in IC reverse engineering,” in
CHES ’09, pp. 363–381, 2009.
[27] E. Charbon, “Hierarchical watermarking in IC design,” in Proc. IEEE Custom
Integrated Circuit Conference, pp. 295–298, May 1998.
[28] A. Oliveira, “Robust techniques for watermarking sequential circuit designs,” in
DAC ’99, pp. 837–842, 1999.
[29] Y. Alkabani and F. Koushanfar, “Active hardware metering for intellectual property protection and security,” in Proc. USENIX Security Symposium, pp. 1–16,
2007.

101
[30] R. Chakraborty and S. Bhunia, “Hardware protection and authentication
through netlist level obfuscation,” in ICCAD ’08, pp. 674–677, 2008.
[31] O. Sinanoglu and A. Orailoglu, “Partial core encryption for performance-efficient
test of SOCs,” in ICCAD ’03, (Washington, DC, USA), p. 91, IEEE Computer
Society, 2003.
[32] C. Paar, J. Guajardo, S. Kumar, and T. Guneysu, “Secure IP-block distribution
for hardware devices,” in HOST, pp. 82–89, July 2009.
[33] A. T. Abdel-Hamid et al., “A public-key watermarking technique for IP designs,”
in DATE, pp. 330–335, 2005.
[34] P. S. Ravikanth, Physical one-way functions. PhD thesis, Massachusetts Institute
of Technology, 2001. Chair-Benton, Stephen A.
[35] G. E. Suh et al., “Design and implementation of the AEGIS single-chip secure
processor using physical random functions,” in ISCA, pp. 25–36, 2005.
[36] S. S. Kumar et al., “The butterfly PUF protecting IP on every FPGA,” in HOST,
pp. 67–70, 2008.
[37] J. Guajardo et al., “FPGA intrinsic PUFs and their use for IP protection,” in
CHES, pp. 63–80, 2007.
[38] B. Gassend et al., “Delay-based circuit authentication and applications,” in Proceedings of the 2003 ACM Symposium on Applied Computing, pp. 294–301, 2003.
[39] Y. Su, J. Holleman, and B. Otis, “A digital 1.6 pJ/bit chip identification circuit
using process variations,” IEEE Journal of Solid-State Circuits, vol. 43, pp. 69–
77, Jan. 2008.
[40] D. Roy et al., “Comb capacitor structures for on-chip physical uncloneable function,” IEEE Transactions on Semiconductor Manufacturing, vol. 22, pp. 96–102,
Feb. 2009.
[41] F. Koushanfar and G. Qu, “Hardware metering,” in DAC ’01, pp. 490–493, 2001.
[42] R. Maes, D. Schellekens, P. Tuyls, and I. Verbauwhede, “Analysis and design of
active IC metering schemes,” in HOST, pp. 74–81, 2009.
[43] “Certicom launches trusted key injection platform for anti-cloning.” Available
at http://www.certicom.com/.
[44] E. Biham and A. Shamir, “Differential cryptanalysis of DES-like cryptosystems,”
in CRYPTO ’90, pp. 2–21, 1991.
[45] S. Mukhopadhyay, K. Kim, K. Jenkins, C. Chuang, and K. Roy, “Statistical
characterization and on-chip measurement methods for local random variability
of a process using sense-amplifier-based test structure,” in ISSC, pp. 400 –611,
11-15 2007.
[46] Y. Taur and T. H. Ning, Fundamentals of modern VLSI devices. Cambridge
University Press, 1998.

102
[47] P. Andricciola and H. Tuinhout, “The temperature dependence of mismatch in
deep-submicrometer bulk mosfets,” IEEE Electron Device Letters, vol. 30, no. 6,
pp. 690 –692, 2009.
[48] T. Eisenbarth, S. Kumar, C. Paar, A. Poschmann, and L. Uhsadel, “A survey
of lightweight-cryptography implementations,” IEEE Design and Test, vol. 24,
no. 6, pp. 522–533, 2007.
[49] Berkeley Logic Synthesis and Verification Group,
tem for sequential synthesis and verification,
http://www.eecs.berkeley.edu/ alanmi/abc/.

“ABC: a sysrelease 70930..”

[50] E. Biham and A. Shamir, “Differential cryptanalysis of DES-like cryptosystems,”
in CRYPTO, pp. 2–21, 1991.
[51] P. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” in CRYPTO,
pp. 388–397, 1999.
[52] H. Saputra et al., “Masking the energy behavior of DES encryption,” in DATE,
p. 10084, 2003.
[53] K. Tiri and I. Verbauwhede, “A logic level design methodology for a secure DPA
resistant ASIC or FPGA implementation,” in DATE, p. 10246, 2004.
[54] D. L. C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh, J. Mitchell, and
M. Horowitz, “Architectural support for copy and tamper resistant software,”
SIGARCH, pp. 168–177, Nov. 2000.
[55] M. Steil and F. Domke, “The xbox 360 security system and its weaknesses,”
June 2008.
[56] S. Skorobogatov, “Low temperature data remanence in static RAM,” tech. rep.,
June 2002.
[57] “Processor cache memory as RAM for execution of boot code,” 2002. United
State Patent No. 7254676 B2.
[58] “IOS and Kernel Developers Guide (BKDG) For AMD Family 10h Processors,”
Jan. 2013. http://support.amd.com/us/Processor_TechDocs/31116.pdf.
[59] R. Huggahalli, R. Iyer, and S. Tetrick, “Direct cache access for high bandwidth
network i/o,” in International Symposium on Computer Architecture, pp. 50–59,
2005.
[60] “Intel Data Direct I/O Technology,” 2012. http://www.intel.com/content/
www/us/en/io/direct-data-i-o.html.
[61] H. Yoda, S. Fujita, N. Shimomura, E. Kitagawa, K. Abe, K. Nomura, H. Noguchi,
and J. Ito, “Progress of STT-MRAM technology and the effect on normally-off
computing systems,” in IEEE International Electron Devices Meeting (IEDM),
pp. 11.3.1–11.3.4, 2012.
[62] “Pci express an overview of the pci express standard,” Aug. 2009. http://www.
ni.com/white-paper/3767/en.

103
[63] D. Aumaitre and C. Devine, “Subverting windows 7 x64 kernel with dma attacks,” in Hack in the Box, June 2010.
[64] M. Becher, M. Dornseif, and C. Klein, “Firewire: all your memory are belong to
us,” in CanSecWest, 2005.
[65] “ThunderboltT M technology,” June 2011.
http://download.intel.com/
newsroom/kits/research/2011/pdfs/Intel_Thunderbolt_Overview.pdf.
[66] A. Rahmati, M. Salajegheh, D. Holcomb, J. Sorber, W. P. Burleson, and K. Fu,
“Tardis: time and remanence decay in sram to implement secure protocols on
embedded devices without clocks,” in USENIX, p. 36, 2012.
[67] P. Gutmann, “Data remanence in semiconductor devices,” in USENIX, p. 4,
2001.
[68] H. Wong, “A Comparison of Intels 32nm and 22nm Core i5 CPUs: Power, Voltage, Temperature, and Frequency,” Oct 2012. http://blog.stuffedcow.net/
2012/10/intel32nm-22nm-core-i5-comparison/.
[69] C. Lin, S. Kang, Y. Wang, K. Lee, X. Zhu, W. Chen, X. Li, W. Hsu, Y. C. Kao,
M. T. Liu, W. Chen, Y. Lin, M. Nowak, N. Yu, and L. Tran, “45nm low power
CMOS logic compatible embedded STT MRAM utilizing a reverse-connection
1T/1MTJ cell,” in IEEE International Electron Devices Meeting (IEDM), pp. 1–
4, 2009.
[70] R. Waser, R. Dittmann, G. Staikov, and K. Szot, “Redox-based resistive switching memories nanoionic mechanisms, prospects, and challenges,” Advanced Materials, pp. 2632–2663, 2009.
[71] R. Carbone, C. Bean, and M. Salois, “An in-depth analysis of the cold boot
attack,” Jan. 2011.
[72] T. Müller, A. Dewald, and F. C. Freiling, “Aesse: a cold-boot resistant implementation of aes,” in Proceedings of the Third European Workshop on System
Security, EUROSEC, pp. 42–47, 2010.
[73] P. Simmons, “Security through amnesia: a software-based solution to the cold
boot attack on disk encryption,” in Annual Computer Security Applications Conference, pp. 73–82, 2011.
[74] M. Hamburg, P. Kocher, and M. E. Marson, “Analysis of intels ivy bridge digital
random number generator,” 2012.
[75] B.
Jun
and
P.
Kocher,
“The
INTEL
Random
Number
Generator,”
pp.
1–8,
4
1999.
Available
at
http://www.cryptography.com/resources/whitepapers/IntelRNG.pdf.
[76] H. Lipmaa, D. Wagner, and P. Rogaway, “Comments to nist concerning aes
modes of operation: Ctr-mode encryption,” 2000.
[77] D. Kanter, “Intels sandy bridge microarchitecture,” 2010.
realworldtech.com/sandy-bridge/.

http://www.

104
[78] “Announcing the Advanced Encryption Standard (AES),” 2001. Federal Information Processing Standards Publication 197.
[79] A. Dandalis, V. K. Prasanna, and J. D. P. Rolim, “A comparative study of
performance of AES final candidates using FPGAs,” in Cryptographic Hardware
and Embedded Systems, pp. 125–140, 2000.
[80] R. Rachh, B. Anami, and P. Ananda Mohan, “Efficient implementations of s-box
and inverse s-box for aes algorithm,” in TENCON, pp. 1–6, 2009.
[81] X. Zhang and K. Parhi, “On the Optimum Constructions of Composite Field for
the AES Algorithm,” IEEE TCAS, pp. 1153–1157, 2006.
[82] R. Anderson, E. Biham, and L. Knudsen, “SERPENT: A proposal for the advanced encryption standard,”
[83] R. Needhan and D. Wheeler, “eXtended Tiny Encryption Algorithm, technical
report,” Oct. 1997.
[84] F. xavier St, G. Piret, N. Gershenfeld, and J. jacques Quisquater, “SEA: A
scalable encryption algorithm for small embedded applications,” in CARDIS,
pp. 222–236, 2006.
[85] A. Sorkin, “LUCIFER: a cryptographic algorithm,” Cryptologia, pp. 22–35, 1984.
[86] A. Bogdanov, L. R. Knudsen, G. Le, C. Paar, A. Poschmann, M. J. B. Robshaw,
Y. Seurin, and C. Vikkelsoe, “PRESENT: An ultra-lightweight block cipher,” in
Computation Hardware and Embedded Security, Springer.
[87] C. H. Lim and T. Korkishko, “mCrypton – a lightweight block cipher for security
of low-cost RFID tags and sensors,” in WISA, pp. 243–258, 2005.
[88] J. Daemen, M. Peeters, G. Van Assche, and V. Rijmen, “Nessie Proposal:
NOEKEON,” Oct. 2000. http://gro.noekeon.org/.
[89] “Lightweight cryptography – part 2: Block ciphers.” ISO/IEC 29192-2:2011.
[90] J. Nakahara, Jorge, P. Sepehrdad, B. Zhang, and M. Wang, “Linear (hull) and
algebraic cryptanalysis of the block cipher PRESENT,” in Cryptology and Network Security, pp. 58–75, 2009.
[91] L. Knudsen and H. Raddun, “On NOEKEON,” Apr. 2001.
[92] H. Mala, M. Dakhilalian, and M. Shakiba, “Cryptanalysis of mCrypton – a
lightweight block cipher for security of RFID tags and sensors,” International
Journal of Communications Systems, pp. 415–426, Apr. 2012.

VITA

105

VITA
William “Paul” Griffin received the Bachelors of Science degree in Mathematics
from Evangel University in Springfield, MO in 2007, and the Doctor of Philosophy degree from the School of Electrical Engineering at Purdue University in West Lafayette,
IN in 2012.
While at Purdue University, Dr. Griffin has served as a teaching assistant, a head
teaching assistant, and a research assistant for the School. He received the Magoon
Award from the School for two concurrent years for his efforts as a teaching assistant,
and the Outstanding Graduate Teaching Assistants award from the University while
a head teaching assistant. Paul has excelled at project-oriented courses, from the construction of a highly-optimized zero skew clock tree generator to the construction of
an FPGA-based SoC that took in a live video feed and used MPEG-style compression
to store the video to an SD card.
Paul’s research has focused on a hybrid of chip security and process variations. He
has explored a means of preventing chip piracy through the use of process variation
sensors and nonseparable integration, investigated and constructed methods to accelerate the simulation of variations, and developed a low-latency approach to securing
both on-chip caches and the RAM’s contents. His ongoing publication efforts have
produced papers for IEEE Transactions on Very Large Scale Integration as well as
at the IEEE International Conference on Simulation of Semiconductor Processes and
Devices.
In 2011, Dr. Griffin held a seven-month internship at Intel Corporation with
Platform Validation and Engineeering (PVE)’s Security Center of Excellence (SeCoE)
Detect. His internship served as a pilot program for System-on-Chip Security for
SeCoE Detect, and in his short tenure, he examined and found critical security issues
across the Medfield Platform’s RTL, BootROM, and Firmware. His efforts in SoC

106
investigation earned him several awards, including the PVE Division Recognition
Award ”In recognition of timely efforts to identify and disposition several critical
security vulnerabilities in Penwell SoC.”

