Fast Secure Processor for Inhibiting Software Piracy and Tampering by Jun Yang et al.
Fast Secure Processor for Inhibiting Software Piracy and Tampering
Jun Yang Youtao Zhang* Lan Gao
Computer Science and Engineering Department *Computer Science Department
University of California, Riverside University of Texas at Dallas
Riverside, CA 92521 Richardson, TX 75083
￿junyang,lgao
￿@cs.ucr.edu zhangyt@utdallas.edu
Abstract
Due to the widespread software piracy and virus attacks,
signiﬁcant efforts have been made to improve security for
computer systems. For stand-alone computers, a key obser-
vation is that other than the processor, any component is
vulnerable to security attacks. Recently, an execution only
memory (XOM) architecture has been proposed to support
copy and tamper resistant software [18, 17, 13]. In this de-
sign, the program and data are stored in encrypted format
outside the CPU boundary. The decryption is carried after
they are fetched from memory, and before they are used by
the CPU. As a result, the lengthened critical path causes a
serious performance degradation.
In this paper, we present an innovative technique in
which the cryptography computation is shifted off from the
memory access critical path. We propose to use a different
encryption scheme, namely “one-time pad” encryption, to
produce the instructions and data ciphertext 1. With some
additional on-chip storage, cryptography computations are
carried in parallel with memory accesses, minimizing per-
formance penalty. We performed experiments to study the
trade-off between storage size and performance penalty.
Our technique improves the execution speed of the XOM
architecture by 34.7% at maximum.
1. Introduction
Software copyright protection plays an important role in
assuring the software market value and a fair return on their
developmentinvestment. A studyin 2001done bythe Busi-
nessSoftwareAllianceshoweda12billiondollarloss in the
software industry due to software piracy [2]. Preventing il-
licit duplicationof software will havea largeimpact oneco-
nomic development. Therefore, it is important to develop
foolproof devices that disallow unauthorized execution of
software.
1Ciphertext is the term assigned to encryption result. Likewise, plain-
text is unencrypted instruction or data
Several techniques have been proposed to provide hard-
ware support at micro-processor level against software
piracy [19, 18, 17, 15, 13]. In those techniques, the only
trusted hardware entity is the processor itself. Any other
hardware components in the computer system are con-
sidered vulnerable to security attacks, particularly the co-
processor and the main memory. This is because program
privacycanbeviolatedbytappingthecommunicationchan-
nel such as the system bus. An adversary can easily tamper
the executionof program once some knowledgeof the code
is obtained. Moreover, the operating system is also consid-
ered non-tamper resistant since it may be hijacked by the
adversary to become malicious to the software running un-
der its control.
The software is stored in the system storage in encrypted
form. It can only be decrypted by the processor internally
before execution. This prevents any user having the full
control of the computer from examining the clear text in-
struction. More importantly, the data communicated be-
tween the processor and the memory are all encrypted to
prohibit reverse-engineering the code. To protect from the
potential malicious operating system that can access the
register values on interrupts, register values need to be en-
cryptedalsoonsuchevents. Therepresentativetechniqueof
the above model is called execution only memory, or XOM,
meaningthatsoftwarecan beexecutedbythe ownerproces-
sor only but not copied (since it would not run on other pro-
cessors) nor manipulated (since it would raise exceptions
and then halt) by unauthorized entities [18, 17].
Though secure at a satisfactory level, one of the most
importantproblemsin the XOM-typearchitectureis its efﬁ-
ciency. As one may notice, every off-chip memory transac-
tion including both instruction and data undergoes encryp-
tion and decryption. Even with the most optimistic assump-
tion of ﬁnishing the crypto process in 48 cycles with fully
pipelined hardware [18], performance loss can be as high
as 34.7%, as our experimentsindicate. The situation is even
worse for applications that are memory bound or time criti-
cal. For this reason, the usefulness of the XOM architecture
Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03) 
0-7695-2043-X/03 $ 17.00 © 2003 IEEE is yet to be evaluated. Software users would ﬁnd it very
annoying every time the program runs signiﬁcantly slower
than the unprotected version, diminishing the attractiveness
of copyright protection.
The purpose of this paper is to relieve the performance
burden on XOM-type architecture. We propose to off-load
the crypto computation from the critical path. In XOM ar-
chitecture, instructions and data can not be used until they
are fetched from the memoryand decryptedafterwards. We
propose to perform decryption in parallel with a memory
access, overlapping crypto-computation time with memory
latency.
Our technique strives to maintain the same level of se-
curity strength as the XOM architecture. Thus, our work
is based on its proposed mechanisms in handling poten-
tial attacks. No attempts are made to enhance its security
level. Our design also requiresextra on-chipstorage and we
studiedthe trade-offsbetweenstoragesize andperformance
improvement. Experimental results show this technique is
able to lower the 16.7% average performance loss of XOM
architecture to only 1.28% over the insecure baseline pro-
cessor.
The remainder of the paper is organized as follows. We
ﬁrst describe brieﬂy the XOM architecture in Section 2.
Then we elaborate on the idea of off-loading the cryptog-
raphycomputationfrom critical path in Section 3. We illus-
trate the detailed architecture design in Section 4. In Sec-
tion 5, we show the experimental results on performance
gain with varioushardwareconﬁgurations. In Section 6, we
give a brief description of the related work, and conclude
the paper in Section 7.
2 XOM Architecture Overview
2.1 Software Encryption and Decryption
Background There are two major types of cryptography
commonly used in information systems today: symmetric
key ciphers and asymmetric key ciphers (see Figure 1). In
symmetric key cryptography, communicating parties share
a common private key in encryption and decryption. The
advantage is that it runs as much as 1000 times faster than
comparableasymmetrickeyciphers[7]. Theprimaryobsta-
cle is the distribution of the private key to information ex-
change parties. Asymmetric key ciphers solve this problem
by implementing encryption using a key pair: public key
and private key. Informationis encryptedusing the publicly
available public key at the sender, and decrypted using the
private key which is kept secret by the receiver. Thus, the
sender can send information securely without knowing the
receiver’s private key.
XOM Software Encryption The software that runs on the
XOM architecture is encrypted by the vendor. The encryp-
tion not only protects the privacy of the software algorithm
E(private Key) D(private Key)
Symmetric Key Cipher
Sending Receiving
Plaintext Ciphertext Plaintext
E(public Key) D(private Key)
Asymmetric Key Cipher
Sending Receiving
Plaintext Ciphertext Plaintext
Figure 1. Illustration of symmetric and asym-
metric ciphers
but also guarantees that it can only run on the target pro-
cessor. To maximize security and performance, the soft-
ware is encrypted using a combination of symmetric and
asymmetric key cryptography. The vendorﬁrst encryptsthe
software using some fast symmetric key cipher with private
key
￿
￿. The decryption of the program using the same key
is relatively fast. The XOM chip is installed with a pri-
vate decryption key
￿
￿
￿
￿ of a public-key encryption pair.
The corresponding public key,
￿
￿, is available to the pub-
lic. To communicate the
￿
￿ to the processor, the vendor
uses
￿
￿ to encrypt it and ships it along with the software.
The execution of the protected software begins with com-
puting
￿
￿ using
￿
￿
￿
￿ which is carried only once but might
take a relatively long time, and decrypting instructions us-
ing
￿
￿ which is much faster but is carried on every instruc-
tion fetched into the processor. In this way, software en-
crypted for
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ can not run on
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ since
they have distinct private keys.
2.2 Interacting with External Memory
The XOM architecture adopts a complicated mechanism
in protecting the program data privacy and providing mem-
ory integrity veriﬁcation. Ensuring privacy means to keep
data information hidden from anyone for whom it is not in-
tended. This is achieved through data and instruction en-
cryption. Memory integrity veriﬁcation is to detect if the
memory has been tampered with by an adversary. This
is accomplished by creating a hash (MAC) value for each
memory block 2. A cryptographic hash function can take
inputs of any length and produce a ﬁxed length output. It
is “one-way”, meaning that it is computationally infeasi-
ble to ﬁnd the original data given the hash value, and rela-
tively easy to compute. Hashing is especially useful in the
three types of attacks considered in XOM: spooﬁng, splic-
ing,a n dreplay. The ﬁrst two attacks were handled satis-
factorily in XOM. The replay attack is better developed for
performance improvement by Gassend et. al. [11]. Thus,
2The block was chosen as an L2 cache line size.
Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03) 
0-7695-2043-X/03 $ 17.00 © 2003 IEEE we do not address the issue of veriﬁcation and concentrate
on speeding up encryption and decryption process in this
paper.
Main Memory
L2 Cache
Write Buffer
Encryption/Decryption Unit
Figure 2. The lengthened XOM memory path.
To mitigate the performance impact, XOM pushes data
encryptionanddecryptionthroughthememoryhierarchyso
that it is only done when the data leaves the processor and
enters insecure memory. Thus, all the on-chip caches are
secure and store data and instructions in plaintext. Figure
2 illustrates the abstract model of the crypto procedure. A
two-level cache structure is assumed in the processor. Writ-
ing to the memory is deferred through the write buffer. Ev-
ery dirty L2 cache line is encryptedﬁrst and then sent down
to memory. Likewise, every line read from the memory is
decrypted before it is stored in L2 cache and used by the
program.
2.3 Internal Protection for Multi-tasking
A major effort in designing XOM secure processor goes
to protectinginteractionsamongmultiple activetasks. Each
task is protected by a strict perimeter, termed “compart-
ment”. Each compartment has its own ID and a secrete key
which was used for encrypting the program. The compart-
mentID is used in taggingdata writtenintothe registersand
thecaches. Thistaggingensuresnoprogramscanaccessthe
data of another program.
New instructions are added to support security func-
tionalities. They are used for handling start/termination of
XOM mode, communication between programs, traps and
interrupts, and storing and loading cryptographic data to
and from memory traps and interrupts.
3 Ofﬂoading Crypto-Computation
from Critical Path
In this section, we present a scheme that shifts the com-
putation intensive crypto-process off from the critical path.
First we analyze the performance degradation in XOM ar-
chitecture.
3.1 Motivation
As we can see from Figure 2, the crypto hardware lies
on the memory access critical path and therefore, the per-
formance decrease is obvious. Developing fast crypto hard-
ware has been the major focus recently to accelerate secu-
rity applications [24, 3, 21, 23]. However, in spite of the
effort in crafting the designs, the crypto-hardware here still
inserts long latency on memory access due to the computa-
tion intensive nature.
Figure 3 shows the performance degradation due to the
prolonged memory path in XOM architecture. We tested
over 11 SPEC2000 [14] benchmarks with 32K separated
L1 instruction and data cache and 256K L2 uniﬁed cache
on an out-of-order 4-issue processor simulation using Sim-
pleScalar [4]. We assumed a typical 100 cycle memory la-
tency and a 50 cycle encryption/decryptiondelay similar to
that in [18]. Such a fast hardware for widely used symmet-
ric ciphers, e.g. DES [9], is possible with ASICS designs
[10]. For strongercipherssuch as AES [1], a longerencryp-
tion/decryption latency would apply, which in turn results
in longer average memory access latency. Thus, we give an
optimistic estimation of the potential performancedegrada-
tion in reality. On average, there is 16.7% slowdown posed
to the programs. For memory intensive programs such as
database applications, the delay will be more severe.
23.02
34.91
15.82
14.27
18.30
1.08
34.76
0.63
13.39
7.05
21.16
16.76
-
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
ammp
art
bzip2
equake
gcc
gzip
mcf
mesa
parser
vortex
vpr
Average
P
e
r
f
o
r
m
a
n
c
e
 
S
l
o
w
d
o
w
n
 
[
%
]
Figure 3. Optimistic estimation on perfor-
mance loss due to encryption/decryption.
3.2 Proposed Solution
The difﬁculty in XOM lies in the fact that the crypto-
computation is data dependent on memory accesses, i.e.,
without knowingthe data to be written out or brought in the
encryption or decryption can not begin. We propose to use
a different encryption algorithm to generate the ciphertext
in memory. The creation of the ciphertext must be data in-
dependent on the memory access so that it can be carried
in parallel, not in serial with the memory operation. The
designated ciphertext must also be related to the memory
access so that each access has a unique ciphertext.
Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03) 
0-7695-2043-X/03 $ 17.00 © 2003 IEEE We propose to use an algorithm similar to “one-time
pad” encryption[22] for both data and instructionsin mem-
ory. In one-time pad encryption, the ciphertext is the
exclusive-or of the plaintext and a true random key:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ (1)
where
￿ is the ciphertext,
￿ is the plaintext data value, and
￿
￿
￿
￿
￿
￿
￿
￿
￿ is a true random number having the same bit
width as
￿. In our model, we replace the
￿
￿
￿
￿
￿
￿
￿
￿
￿ with
an encrypted seed. The seed uniquely corresponds to the
plaintext, and can be generated regardless of its value (see
Section 3.4). Thus, with the “one-time pad” algorithm, the
encryption and decryption of a plaintext data value can be
expressed as the following:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ (2)
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ (3)
where
￿ is stored in insecure memory,
￿ is the plaintext
data value, and
￿ is the private key shipped with the soft-
ware. Operationally, when
￿ is sent off chip, equation (2) is
used; when
￿ is read from memory, equation 3 is used. Cal-
culating
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ is carried while the processor is
waiting for the memory. Let us assume the memory access
latency is 100 cycles and computing
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ is 50
cyclesasbefore. When
￿isloadedfrommemoryandarrives
at the processor,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ is already ready. With an
additionalone-cycleXOR,
￿ can be obtainedand sentto the
processorforexecution. Thus,insteadofhaving100+50cy-
cles delay, we now reduce it to 101 (i.e. MAX(100, 50)+1).
3.3 Encryption Strength
Using the one-time pad encryption (equation 2 and 3)
achieves the same strength as a normal data encryption
where
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. This can be seen through the
analogyofthe proposedschemeandthestream ciphers[20].
Thestreamcipherissimilar toone-timepad. Thedifference
is that it uses pseudorandom number stream instead of a
genuinerandomnumberstream. Many widely used encryp-
tion algorithms such as AES[1] and 3DES [8] are believed
to do a good job in generating pseudo-randomnumbers. As
a result,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ is as random as
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
where the selection of seed is discussed next 3.
3.4 Seed Selection
The purpose of encrypting a seed instead of value itself
is todoit inparallelwithmemoryreadoperation. We donot
consider penalty due to memory writes in this paper since
most processors are equipped with write buffers which can
3The seed used here should not be confused with the seed that is used
in random number generation functions supported by many higher level
languages. As an example, the seed in C function sram() represent a start-
ing point of in a chain of “so called” random numbers. This is not the case
in our design. We treat the seed as an input to the encryption function.
steal idle bus cycles efﬁciently. Therefore, the seed must be
available at the time the read command is issued to mem-
ory. It is also important to differentiate seeds for different
encryption units, i.e. blocks, to de-correlate program data.
Naturally, a seed derived from the location, e.g. address, of
a value is a good candidate. Let us see why using addresses
alone might be good in some cases and bad in the others.
Advantage: In the XOM model, every data value is en-
crypteddirectly and stored in its memorylocation. This im-
plies that same data values at different locations have same
ciphertexts. It is known that the memory contains a lot of
repeated values [16, 25]. Thus even with encryption, the
repetition pattern still preserves, creating potential security
holes.
Using address of a data value as the seed in equa-
tion (2), each
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ is different from oth-
ers. Moreover, the property of an encryption function as-
sures no patterns exist between sequential addresses, i.e.
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿are com-
pletely unrelated, hence the neighboring memory cipher-
text.
Disadvantage: However, for the same location,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ remains the same every time the value
is written into memory. Thus, a series of data value 0,
1, 2,... generated at address
￿
￿
￿
￿ will have a series of
ciphertexts
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿which amounts to
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
where
￿ is a constant. With little effort, the ciphertexts
stored in memory can be cracked by a skilled attacker.
Therefore, the seed used for such a series of writes should
not be a constant, i.e. it should vary. This is also pointed
out in the XOM architecture for saving register values on
OS interrupts. In such cases, a mutating value for vary-
ing the XOM ID is employed for encrypting register values
on each interrupt event. To mutate the seeds in equation
(2), we choose to adopt a sequence number associated with
an address. The sequence number is updated every time it
is used. Thus, the encryption becomes
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. The details will be fully described
shortly.
At this point, it is necessary to separate situations for en-
crypting instructions and data. The above analysis on the
disadvantage of using address directly as seed applies to
data writing only. For instructions, there are only read op-
erationsas they are only loaded from but never written back
to memory. Therefore, a constant seed directly associated
with instruction address can be used.
Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03) 
0-7695-2043-X/03 $ 17.00 © 2003 IEEE 3.4.1 Encrypting Instructions
The instructions are encrypted by the vendor but are exe-
cuted on the customer’s processor. The vendor does not
know the actual addresses when the program is loaded into
the customer’s memory space for execution. Therefore, it is
easier forthe vendorto use the virtualaddressstartingfrom,
for example,
￿
￿. Suppose the vendor chooses a symmetric
key
￿
￿
￿, encryption function DES with block size of 64
bits. Each instruction is 32-bit. A sequence of instructions
￿
￿,
￿
￿,
￿
￿,
￿
￿,
￿
￿,
￿
￿,
￿
￿
￿will be encrypted as:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
where “
￿” means concatenating two 32-bit instructions into
a 64-bit data block. To decipher the program, the processor
simply adds to
￿
￿ the offset of the current 64-bit instruc-
tion block to the ﬁrst instruction block, obtaining the seed
for the encryption. When the ciphered instruction is avail-
able from memory, plaintext instructions can be computed
through XORing the encrypted seed in only one cycle.
3.4.2 Encrypting Memory Data
On-chip data are encrypted when they are evicted out due
to cache conﬂicts. We assume a two-level cache structure
as in most high-performance processors. Similar to XOM,
encryptionand decryptionare doneon perL2 cacheline ba-
sis. Since we adoptsequencenumbersonwrites to the same
memory location, each sequence number is maintained for
each L2 cache line. The initial values of the seeds are the
virtual cache line addresses. It is incorrect to use the phys-
ical line addresses since programs may be loaded to differ-
ent physical memory spaces on context switches. As cache
lines are transmitted across the chip boundary, the seeds
are increased by the correspondingline’s sequence number.
Thus, on the
￿
￿
￿
￿
￿
￿
￿
￿ write to memory for a line
￿,t h e
following steps are taken in sequence:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ (4)
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ (5)
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ (6)
where
￿
￿
￿
￿
￿
￿
￿
￿
￿. On a line
￿ read:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ (7)
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ (8)
Reading a cache line may happen long after it was written
to the memory. To make sure it is available when a line is
being fetched, we need to remember the sequence number
that was previously assigned to the line. Next, we give the
details of the design of a special on-chip cache that stores
the sequence numbers. The sequence number cache should
locate within the security boundaryas in the XOM architec-
ture.
4 Architecture Design
As clariﬁed earlier, an on-chip sequence number cache
(SNC) is needed in order to store sequence numbers for
each cache line that goes off-chip. Thus, we place the SNC
below the L2 cache and monitor the trafﬁc between L2 and
the memory. Figure 4 illustrates the architecture of the ab-
stract partial XOM model with our SNC.
Main Memory
Security boundary
L2 Cache
Write
Buffer
Encryption/
Decryption Unit
P
h
y
s
i
c
a
l
A
d
d
r
e
s
s
Sequence
Number
Cache
V
i
r
t
u
a
l
A
d
d
r
e
s
s
r
e
a
d
w
r
i
t
e
Figure 4. Design of one-time pad encryption
on data with sequence number cache.
The SNC should be accessed using the virtual address
(VA) of an L2 cache line. This is because physical address
of a line may be changed after a context switch, losing en-
cryption seed information. However, using VA to index the
SNC may incur synonym problems in which two different
VAs may map to the same physical address. The result
is that two different sequence numbers may be generated
for the same physical line. The synonym problem happens
wheneithertheOS andtheuser,ortwouserswanttosharea
memory segment. The XOM architecture is very restrictive
in sharing data among tasks, includingthe OS. The solution
proposed by XOM is to share a key among tasks that have
synonyms,whichisconsideredvulnerable. Sincethisisstill
open problem in XOM, we choose not to perform one-time
pad encryption on those shared data. In other words, SNC
does not store the sequence numbers for memory segments
that are aliased by two different virtual addresses.
In conventional cache design, the VA will not be avail-
able beyond L1 caches, and the L2 cache is physically ad-
dressed. Thus, the VA of each L2 cache line should be kept
Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03) 
0-7695-2043-X/03 $ 17.00 © 2003 IEEE within the L2 cache. The stored VA can be then used to
address SNC on a cache write back. The storage incurred
due to storing VA’s in L2 cache is very modest. For exam-
ple, in a 256KB L2 cache having 128B each line, 40 bits of
a 48-bit VA (e.g. in Alpha architecture) need to be stored,
enlarging L2 cache by 4.0%.
Ideally, the SNC should store all the sequence numbers
of memory lines. Take a 1GB memory and a 128B line
size as an example, 8M (1GB/128B) sequence numbers are
necessary to be remembered. Having an 8M on-chip cache
is unrealistic to ask. We therefore provide only a limited
sized SNC which stores sequence numbers efﬁciently. To
remove conﬂict misses as much as possible, a fully asso-
ciative cache is desired. A fully associative cache normally
provides the best hit rate but occupy larger chip area and
take longer time to access. Normally, a highly associative
cache, e.g. 32-way or 64-way, would perform equally well.
We will present most of our experimental results using a
fullyassociativeSNC implementationin Section5, and also
show the results with a 32-way set associative SNC.
With a limited amount of SNC storage, not all sequence
numbers can be stored on-chip. Thus, when the SNC is
full, no further sequence numbers can be stored unless
some stored contents are evicted out. If so, where will the
evicted sequence numbers be stored? Complication arises
astowhethera replacementpolicyshouldbe employed,and
what may happen with or without a replacement policy.
4.1 SNC Operation Policy
WithReplacement If replacementsare carriedin the SNC,
we need to solve where those evicted sequence numbers
should be stored. It is clear to see that we can not discard
them since otherwise, their corresponding memory lines
would not be able to be deciphered. Then the only solu-
tion is to store them in the insecure memory. To protect the
privacy of these sequence numbers, we choose to encrypt
them just as normal program data. It should be noted that
even without encryption, the on-chip one-time pad encryp-
tion remains secure since the private key is not revealed.
However, it is not preferred that the sequence numbers are
encrypted using one-time pad again since they themselves
would need sequence numbers! Therefore, we choose to
use encryptionon the sequence numbersdirectly, just as the
XOM solution.
The advantage of allowing replacement in SNC is to
makeone-timepadencryptionavailabletoasmanymemory
lines as possible. If LRU replacement is adopted, the SNC
willcatch frequentlyused sequencenumbersin the longrun
so as to reduce the SNC capacity misses. However, each
replacementincursanothermemory access plus the encryp-
tion latency of the contents. Although this does not neces-
sarily happen on critical path, it imposes additional mem-
ory trafﬁc and may compete with other memory requests
that are critical. Thus, the numberof replacementshould be
small enough to overcome the above defect. Using LRU in
this sense, helps reduce the SNC replacement frequency.
With No Replacement An alternative way is to disallow
replacements. In such a situation, the one-time pad encryp-
tion is carried as long as there are vacant slots in the SNC.
When SNC is full, cache lines whose sequence numbersare
not stored in the SNC will not be able to perform one-time
pad encryption. Consequently, they should be encrypted
directly and sent to memory. The advantage of no replace-
ment policy is its simplicity. The disadvantage is, however,
onlypartial memorylines can employone-timepad encryp-
tion, the rest are treated the same way as in XOM. We will
show in Section 5 that using LRU is more advantageous
than the non-replacementSNC.
4.2 Algorithm
In this section, we discuss the SNC query (i.e. read)
and update (i.e. write ) operations in great details. To be
clear, we categorize various operations into query hits, up-
date hits, query misses, and update misses. SNC is ﬁlled
with update operations and looked up through query opera-
tions.
SNC Hits A query hit in SNC happens when a read miss
occurs at L2 and the target line’s sequence number is stored
in SNC. A seed is then calculated using equation 7. After
the memory access returns, the plaintext value is then ob-
tained by applyingequation8. Assuming a 100-cyclemem-
ory latency and 1-cycle XOR, the value is ready to the CPU
at the 101
￿
￿ cycle. An update hit in SNC happens when a
L2 cache line is evicted down to the memory and this line’s
sequence number is stored in SNC. The sequence number
is updated according to equation 4. Seed is formed and line
is ciphered using equation 5 and 6 respectively. Note that
the evicted line should ﬁrst go to the write buffer (Figure 4)
and are later ﬂushed to the memory on certain conditions.
Thus, the encryption can be done while they remain in the
write buffer. With SNC, the delay is nearly the same as in
XOM except that the XOR takes one more additional cycle.
Since write operation is not on the critical path, its impact
on overall performance is not a big concern.
SNC Misses Misses in SNC are more complex, especially
in supporting LRU replacement. We will separate the no-
replacement and LRU replacement designs for clarity. In
no-replacement SNC, an update miss means no free entries
areavailable. Atthistime,thecachelinehastobeencrypted
directly like in XOM. A query miss means the correspond-
ing L2 cache line’s sequence number is not stored in SNC.
As mentioned earlier, those lines were encrypted directly.
Thus after the line is fetched from the memory, it should go
through the decryption unit which is another 50 cycles on
top of 100 cycles.
With LRU replacement, every L2 cache line has a se-
Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03) 
0-7695-2043-X/03 $ 17.00 © 2003 IEEE quence number. For those that can not ﬁt in the SNC, they
are stored in memory. As pointed earlier, sequence num-
bers in memories should also be encrypted (directly). On
an SNC query miss, a memory access is needed to fetch the
target sequence number, followed by the decryption. Thus,
each query miss incurs 150 cycles before the seed encryp-
tion can start, becoming the most expensive operation. As
such, an update miss in SNC also needs to access memory
and decrypt the ciphered sequence number. Since this is
carried while the cache line is in write buffer, impact is less
signiﬁcant. Algorithm1 givesthe pseudo-codefor handling
the SNC misses.
Algorithm 1 Pseudo-code for handling SNC misses em-
ploying LRU replacement
1: if SNC query miss on cache line
￿ then
2:
￿
￿
￿read memory for
￿’s sequence number
3:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿;
4: for each block
￿
￿
￿
￿
￿ in
￿’s virtual address
￿
￿ do
5:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿; /* executed
in fully pipelined engine, in parallel with line 7 */
6: end for
7:
￿
￿
￿
￿
￿
￿
￿
￿read
￿ from memory
8: for each block
￿
￿
￿
￿ in
￿
￿
￿
￿
￿
￿
￿ do
9:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿;/ *
￿
￿
￿
￿’s form plaintext for
￿ */
10: end for
11: replace a victim
￿
￿
￿ in SNC with
￿
￿
￿;
12: push
￿
￿
￿ in write buffer; /* to be encrypted later */
13: else if SNC update miss on cache line
￿ then
14:
￿
￿
￿read memory for
￿’s sequence number
15:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿;
16:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿;
17: for each block
￿
￿
￿
￿
￿ in
￿’s virtual address
￿
￿ do
18:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿; /* executed
in fully pipelined engine */
19: end for
20: for each block
￿
￿
￿
￿ in plaintext
￿ do
21:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ /*
￿
￿
￿
￿’s compose ciphertext
of
￿ */
22: end for
23: write ciphertext of
￿ into memory;
24: replace a victim
￿
￿
￿ in SNC with
￿
￿
￿;
25: push
￿
￿
￿ in write buffer; /* to be encrypted later */
26: end if
4.3 Other Security and Implementation Is-
sues
Context switching One of the difﬁculties we realized is to
handle situation in context switching. On context switches,
XOMarchitectureemploysexpensiveoperationsnotto leak
information to potential malicious OS and other users. The
contents of our SNC should also be protected as the new
user may use it for its own purpose. There are two ways of
protecting the SNC: 1) ﬂushing it to the memory with en-
cryption; and 2) tag each entry with XOM ID. Each method
encounters long latency either during context switching or
after. Fortunately, context switching does not occur very
often. The impact on the overall performance in multi-task
systems is currently open.
Sharedlibraryandprograminputs. Ifthesoftwarepack-
age contains shared library code, e.g. .dll, they are meant
for usage by multiple users. Therefore, those library codes
should be provided in plaintext. Similarly, program inputs
are also providedin plaintextsince they are broughtin from
I/O devices. As a result, memory spaces taken by them do
not need sequence numbers in SNC.
5 Experiment Evaluation
We implemented the two schemes in order to compare
the one-time pad encryption scheme with XOM. We used
SimpleScalarTool Set [4]to run11SPEC2000[14] bench-
marks, and compared performances for various algorithms
and conﬁgurations. The benchmarks are fast forwarded by
10 billion instructions to warm up the pipeline as well as L1
and L2 caches, and then continued to execute for another
10 billion instructions so that they ﬁnish within reasonable
amount of time. Our baseline is a 4-issue out-of-order exe-
cution processor with 32KB, 4-way, L1 separate instruction
and data caches, plus a 256KB, 4-way, 128B per line, L2
uniﬁed cache. We set the memory access latency as a typi-
cal 100 cycles, the encryption/decryptiondelay as 50 cycles
as before. All other parametersare set as defaultvaluespro-
vided by SimpleScalar.
5.1 Performance Comparison
The ﬁrst set of experiments measures the performance
loss due to security operations. We compare the XOM ar-
chitecture with one-time pad encryptionhaving an SNC. As
described in Section 4, the SNC can either allow or disal-
low sequence number replacements. We plot the result for
both schemes as shown in Figure 5. Here the SNC is set
to 64KB, each sequence numbertaking 2 bytes. Thus, there
are 32Knumbersstoredin the SNC, covering32K L2 cache
lines, or 4MB memory data space. It is clearly shown from
the graph that our scheme drastically reduces performance
loss—from 16.7% to 4.59% for no replacement SNC and
1.28% for LRU SNC. We can draw two conclusions from
these results:
1. Using one-time pad encryptionis an excellent solution
to minimize performance degradation of secure pro-
cessors. The 1.28% slowdown from the LRU SNC de-
sign is not noticeable to the user and thus increases the
practicability of a secure processor.
2. The difference between no-replacement and LRU
Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03) 
0-7695-2043-X/03 $ 17.00 © 2003 IEEE proves that using the latter is beneﬁcial in the long run
since it will catch relatively frequently accessed cache
lines. For example, the benchmark gcc shows a big
difference between the two, sequence numbers ﬁlled
into the SNC initially are hardly used later.
23.02
34.91
15.82
14.27
18.30
1.08
34.76
0.63
13.39
7.05
21.16
16.76
4.57
0.23
1.04
0.06
18.07
0.51
13.51
0.24
6.94
5.02
0.24
4.59
2.76
0.23
0.56
0.06
1.40
0.31
6.44
0.07
0.95
1.03
0.24
1.28
-
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
ammp
art
bzip2
equake
gcc
gzip
mcf
mesa
parser
vortex
vpr
Average
P
r
o
g
r
a
m
 
S
l
o
w
d
o
w
n
 
[
%
]
XOM SNC-NoRepl SNC-LRU
Figure 5. Performance comparison for XOM,
SNC with LRU and no cache replacement.
5.2 SNC of 32KB, 64KB, and 128KB
The second set of experiments intends to answer how
our scheme is sensitive to SNC size. To see this, we tested
32KB, 64KB, and 128KB SNC with LRU. Figure 6 shows
the execution slowdown in percentage of the baseline. We
can see that with smaller SNC, the scheme under-performs
the larger SNCs. Since a 128KB on-chip cache may be a
highrequirementforprocessors,weconcludethatthe64KB
is a better choice among the three.
4.36
0.23
1.61
7.58
1.44
0.33
15.23
0.14
2.70
1.86
0.24
3.25
2.76
0.23
0.56
0.06
1.40
0.31
6.44
0.07 0.95
1.03
0.24
1.28
0.41
0.23
0.34
0.06
1.29
0.30
1.45
0.01 0.57
0.70
0.24
0.51
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
ammp
art
bzip2
equake
gcc
gzip
mcf
mesa
parser
vortex
vpr
avg
S
l
o
w
d
o
w
n
f
o
r
d
i
f
f
e
r
e
n
t
S
N
C
s
i
z
e
s
[
%
]
32KB 64KB 128KB
Figure 6. Performance comparison for differ-
ent sized SNC. LRU replacement is used.
5.3 SNC of Different Associativity
Thethirdset ofexperimentsistosee if a fullyassociative
SNC is really necessary. Implementing a 64KB cache with
full set associativity might be expensive. We therefore ran
the benchmarks with a 32-way, 64KB SNC, and compare
it with the fully associative, 64KB SNC. Figure 7 plots the
results. Apart from one benchmark ammp (which increases
the slowdown from 2.8% to 9.6%), all the rest programs
show an equivalence of using the two caches. Sometimes,
32-way is even slightly better. Therefore, in most cases, a
32-way SNC serves as good as a fully associative SNC.
2.76
0.23 0.56
0.06
1.40
0.31
6.44
0.07
0.95
1.03
0.24
1.28
0.23 0.55
0.18
1.38
0.31
6.34
0.07
0.94
1.03
0.24
1.90
9.62
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
ammp
art
bzip2
equake
gcc
gzip
mcf
mesa
parser
vortex
vpr
avg
S
l
o
w
d
o
w
n
f
o
r
d
i
f
f
e
r
e
n
t
S
N
C
a
s
s
o
c
i
a
t
i
v
i
t
y
[
%
]
fully associative 32-way set associative
Figure 7. Performance comparison between
fully associative and 32-way set associative
SNC’s.
5.4 Larger L2 vs. L2+SNC
The fourth set of experiments we conducted is to justify
the added on-chip SNC storage is indeed very effective. We
show this by comparing the execution time for LRU SNC
with a XOM architecture that has a larger L2 cache size.
A fair comparison requires that the enlarged L2 occupies
the same amount of chip area as the original L2 plus SNC
since the increase in cache area is not linear to its capac-
ity. We used CACTI 3.2 [5] to obtain the area estimation.
We found that a 64KB 32-way set associative SNC on top
of a 4-way 256KB L2 cache occupies chip area between
that of a 5-way 320KB and a 6-way 384KB L2 cache. We
therefore compare our conﬁguration with XOM having a 6-
way 384KB L2 cache. Figure 8 plots the normalized execu-
tion time w.r.t. the baseline having 4-way 256KB L2 cache.
With thesame amountofon-chiparea, ourone-timepaden-
cryptionscheme still outperformsXOM on average(2% vs.
9% slowdown). Program gcc, mesa and vortex,s h o wa
speedup of 4%, 1% and 7% in execution time compared to
the baseline. This is because with 50% of capacity increase
in L2, almost everythingin the two programsﬁt into L2 and
thus the need of going off chip is greatly reduced. This ex-
periments show that in general, having a larger L2 cache
can not mitigate the performance impact of XOM while us-
ing the one-time pad encryption is satisfactory.
Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03) 
0-7695-2043-X/03 $ 17.00 © 2003 IEEE 1.23
1.35
1.16
1.14
1.18
1.01
1.35
1.01
1.13
1.07
1.21
1.17
1.20
1.35
1.03
1.14
0.96
1.00
1.32
0.99
1.02
0.93
1.04
1.09
1.10
1.00
1.01
1.00
1.01
1.00 1.06
1.00
1.01
1.01
1.00
1.02
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
ammp
art
bzip2
equake
gcc
gzip
mcf
mesa
parser
vortex
vpr
avg
N
o
r
m
a
l
i
z
e
d
 
E
x
e
c
u
t
i
o
n
 
T
i
m
e
 
w
r
t
 
B
a
s
e
XOM-256KL2 XOM-384KL2 SNC-32way-LRU-256KL2
Figure 8. Impact of a larger L2 cache.
5.5 SNC Induced Memory Trafﬁc
The ﬁfth set of experiments is designed to show the in-
duced memory trafﬁc due to SNC LRU replacements. The
results are measured in percentages of L2 cache memory
trafﬁc. See Figure 9. We can see that the effect of SNC re-
placement is negligible in terms of memory trafﬁc increase.
For quite a few benchmarks, art, equake, vpr,t h e
increase is almost zero. On average, there is only 0.31% of
the L2 memorytrafﬁc posed on to the system bus. This also
explainswhy SNC with LRU performsbest even thoughre-
placements are expensive.
0.32%
0.00%
0.09%
0.00%
0.05%
1.03%
0.47%
0.90%
0.18%
0.39%
0.00%
0.31%
0.00%
0.20%
0.40%
0.60%
0.80%
1.00%
1.20%
ammp
art
bzip2
equake
gcc
gzip
mcf
mesa
parser
vortex
vpr
avg
P
e
r
c
e
n
t
a
g
e
o
v
e
r
L
2
-
M
e
m
A
c
c
e
s
s
Figure 9. SNC induced additional memory
trafﬁc (64KB SNC).
5.6 Sensitivity to Encryption Latency
Our one-time pad encryption has an advantage in that it
is insensitive to the cryptography latency. Compared with
the XOM memory latency,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿, the new
memory latency on cache read misses (which is critical
to speed)is now MAX(
￿
￿
￿
￿
￿
￿,
￿
￿
￿
￿
￿
￿
￿
￿
￿) + 1. There-
fore, we performed experimentsthat use a different encryp-
tion/decryption latency, 102 cycles [12]. Results are shown
in Figure 10. It is clearly seen that the XOM degrades
greatly with the prolonged encryption latency: from 16.7%
to 34.2% slowdown. This is because 102-cycle roughly
doubles the original 100-cycle latency. While in our de-
sign with LRU replacement, the performance is almost un-
changed: from 1.28% to 1.3%. The difference between the
no-replacementpolicy and the LRU also provesthat the lat-
ter is much more effective than the former. This result en-
hances the usability and attractiveness of our proposed one-
time pad encryption.
46.95
71.21
32.27
29.10
37.36
2.21
70.91
1.28
27.32
14.42
43.16
34.20
8.95
0.23
1.82
0.06
36.89
1.04
27.30
0.48
14.02
10.23
0.24
9.21
2.72
0.23
0.56
0.06
1.38
0.30
6.32
0.07
0.94
1.01
0.24
1.26
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
ammp
art
bzip2
equake
gcc
gzip
mcf
mesa
parser
vortex
vpr
avg
1
0
2
C
r
y
p
t
o
-
D
e
l
a
y
,
P
r
o
g
r
a
m
S
l
o
w
d
o
w
n
[
%
]
XOM SNC-NoRepl SNC-LRU
Figure 10. Performance comparison using a
longer delay for encryption/decryption unit.
6 Related Work
The research closely related to us is the fast hashing
mechanism for memory integrity veriﬁcation [11]. Defense
of the replay attacks for XOM type of architecture is ad-
dressed. The solution is to build hash trees and combine
them into the on-chip caches to speedup veriﬁcation of the
untrusted memory. Yet dealing with the data privacy and its
computation overhead is not considered.
One of the early hardware techniques in protecting soft-
ware copyright is to use a tamper resistant plug-in model—
“dongle”. Softwareissoldtogetherwithadongle. Itperiod-
icallyqueriesthedonglebasedonanauthorizationprotocol.
If the dongle does not respond the software will halt. How-
ever, a skilled programmer can easily analyze the machine
code and disable the software protection functions.
Another type of secure processor, bus-encryptionmicro-
processor,has been used for almost a decade in 8-bit micro-
controllers such as Dallas Semiconductor DS5000 series.
Its application ranges from credit card termination, ATM
to pay-TV access control devices and communication en-
cryption modules [15]. In such microprocessors, software
is stored in encryptedformoutside CPU and decryptedonly
when it is read into the chip. Both the data and address
buses values are encrypted in order to send data to external
memory. Bus-encryption microprocessors target for single
application environments in which the code size is usually
very small.
Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03) 
0-7695-2043-X/03 $ 17.00 © 2003 IEEE Fast cryptographic co-processors have been developed
to support security applications for Internet communication
and E-commerce [3, 24, 21, 23]. Such a co-processor can
support multiple ciphers at competitive speed simultane-
ously. The model we are using in this paper is fundamen-
tally different from those co-processor since ciphers are di-
rectly implemented on the main processor and we do not
trust any components other than the main CPU.
Many software techniques have been proposed in pro-
viding certain level of copyright and intellectual property
protection. Obfuscation attempts to transform the code into
a form that is harder to reverse engineer. Tamper-prooﬁng
causes a program to malfunction when it detects that it has
been modiﬁed. Software watermarking embeds copyright
notice in the software code to allow the owners of the soft-
ware to asserttheir intellectualpropertyrights[6]. Thesoft-
ware techniquesdiscourage software theft, can trace piracy,
prove ownership, but can not prevent copying itself.
7 Conclusion
We proposed to use a fast cryptography method—one-
timepadcryptography,to speeduptheexecutionofasecure
processor. In our design, the cryptography computation is
offloadedfromthe processor’scritical pathandis carriedin
parallel with memory access. We developed the new cryp-
tography scheme and its hardware support. Experiments
show that our technique reduces the performance overhead
from16.7%for critical pathcryptographyto 1.28%forone-
time pad cryptography.
References
[1] “Advanced Encryption Standard(AES) Development Effort,”
US Government, http://csrc.nist.gov/encryption/aes/.
[2] International Planning and Research Corporation,
“Sixth Annual BSA Global Software Piracy Study,”
http://www.bsa.org/resources/2001-05-21.55.pdf, 2001.
[3] J. Burke, J. McDonald, and T. Austin, “Architectural Sup-
port for Fast Symmetric-Key Cryptography,” ACM 9th Interna-
tional Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS-IX), November
2000.
[4] D. Burger and T. Austin, “The SimpleScalar Tool Set, Ver-
sion 2.0,” Technical Report 1342, Univ. of Wisconsin-Madison,
Comp. Sci. Dept., 1997.
[5] CACTI3.2, HP-Compaq Western Research Lab,
http://research.compaq.com/wrl/people/jouppi/CACTI.html.
[6] C. Collberg and C. Thomborson, “Watermarking, Tamper-
Prooﬁng, and Obfuscation — Tools for Software Protection,”
IEEE Transactions on Software Engineering, Vol. 28, No. 8,
August 2002.
[7] “An Introduction to Cryptography,” Network Associates, Inc.,
http://www.pgpi.org/doc/pgpintro, 1999.
[8] D. W. Davies and W. L. Price, “Security for Computer Net-
works,” Wiley, 1989.
[9] “Data Encryption Standard (DES),” Federal Information Pro-
cessing Standards Publication 46-2, December, 1993.
[10] H.EberleandC. Thacker, “A 1Gbit/second GaAsDESchip,”
IEEE Custom Integrated Circuits Conference, pages 19.7.1–
19.7.4, May 1992.
[11] B. Gassend, G. E. Suh, D. Clarke, M. v. Dijk, and S. De-
vadas, “Caches and Hash Trees for Efﬁcient Memory Integrity
Veriﬁcation,” The 9th International Symposium on High Per-
formance Computer Architecture (HPCA9), pages, February
2003.
[12] “Sandia researchers develop world’s fastest encryptor,”
http://www.sandia.gov/media/NewsRel/NR1999/encrypt.htm.
[13] T. Gilmont, J.-D. Legat, and J.-J. Quisquater, “Enhancing the
Securityin the Memory Management Unit,” Proceedings of the
25th EuroMicro Conference, pages 449–456, September 1999.
[14] http://www.specbench.org/osg/cpu2000.
[15] M. Kuhn, “The TrustNo1 Cryptoprocessor Concept,” Tech-
nical Report, Purdue University, April 1997.
[16] K. M. Lepak, G. B. Bell, and M. H. Lipasti, “Silent Stores
and Store Value Locality,” IEEE Transactions on Computers,
Vol. 50, No. 11, 2001.
[17] D. Lie, J. Mitchell, C. A. Thekkath, and M. Horwitz, “Speci-
fying and Verifying Hardware for Tamper-Resistant Software,”
IEEE Symposium on Security and Privacy, 2003.
[18] D. Lie, C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh,
J. Mitchell, and M. Horwitz, “Architectural Support for Copy
and Tamper Resistant Software,” ACM 9th International Con-
ference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS-IX),pages 168–177, Novem-
ber 2000.
[19] T. Maude and D. Maude, “Hardware Protection Against
Software Piracy,” Communication of the ACM, Volume 27,
Number 9, pages 950–959, September 1984.
[20] M. J. B. Robshaw, “Stream Ciphers,” Technical Report TR-
701, version 2.0, RSA Laboratories, 1995.
[21] S. W. Smith, E. R. Palmer, and S. Weingart, “Using a Higher
Performance, Programmable Secure Coprocessor,” Financial
Cryptography, pages 73–89, February 1998.
[22] W. Stallings, “Cryptography and Network Security, Princi-
ples and Practice,” Prentice Hall, 3rd ed. 2003.
[23] J. Tygar and B. Yee, “Dyad: A system for Using Phys-
ically Secure Coprocessors,” Technical Report CMU-CS-91-
140R, Carnegie Mellon University, May 1991.
[24] L. Wu, C. Weaver, and T. Austin, “CryptoManiac: A Fast
Flexible Architecture for Secure communication,” ACM 28th
International Symposium on computer Architecture (ISCA01),
June 2001.
[25] Y. Zhang, J. Yang, and R. Gupta, “Frequent Value Local-
ity and Value-Centric Data Cache Design,” International Con-
ference on Architectural Support for Programming Languages
and Operating Systems, pages 150–159, November 2000.
Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03) 
0-7695-2043-X/03 $ 17.00 © 2003 IEEE 