Rochester Institute of Technology

RIT Scholar Works
Theses

Thesis/Dissertation Collections

6-1-2011

Extremely low overhead off-chip memory
encryption
Michael Sanfilippo

Follow this and additional works at: http://scholarworks.rit.edu/theses
Recommended Citation
Sanfilippo, Michael, "Extremely low overhead off-chip memory encryption" (2011). Thesis. Rochester Institute of Technology.
Accessed from

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.

Extremely Low Overhead Off-Chip Memory Encryption
by
Michael A. Sanfilippo

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of
Master of Science in Computer Engineering
Supervised by
Dr. Marcin Łukowiak
Department of Computer Engineering
Kate Gleason College of Engineering
Rochester Institute of Technology
Rochester, New York
June 2011
Approved By:

Dr. Marcin Łukowiak
RIT Department of Computer Engineering

Dr. Michael Kurdziel
Harris Corporation, Secure Communication Group

Dr. Roy Melton
RIT Department of Computer Engineering

Release page

Thesis Release Permission Form
Rochester Institute of Technology
Kate Gleason College of Engineering

Title: Extremely Low Overhead Off-Chip Memory Encryption

I, Michael A. Sanfilippo, hereby grant permission to the Wallace Memorial Library
reproduce my thesis in whole or part.

Michael A. Sanfilippo

Date

Abstract
Over the last decade, advancements in performance and efficiency of portable computing
devices have allowed them to provide much of the functionality previously restricted to
larger computers. Instant communication, GPS navigation, remote banking, and even online shopping are only a few of the activities that can be performed from almost anywhere.
However, these conveniences may come at the cost of physical security since portable
devices are often operated in a public environment where there is a possibility of being
physically exposed or obtained by untrustworthy users. While it is a common practice to
secure the data that is transferred from one point to another, the contents of system memory
often go unprotected. When physical access to a device is attained, this so called “data-atrest” can be exploited to reveal private information. Emails, GPS location data, financial
transactions, etc. could be harmful if revealed to the wrong party.
This thesis investigates the design trade-offs of obscuring data stored within low latency
memory on an embedded device. This was achieved by implementing a parameterizable
system based on the keystream cache concept. While this solution could be implemented
for almost any embedded system, the design was evaluated using reconfigurable hardware
in order to reduce development costs. A prototype was built and tested on an Altera FPGA
development board where parameters of the architecture were varied to find a solution that
reduced performance overhead, while minimizing hardware usage. The resulting application benchmarks show as little as 1% performance overhead while using minimal hardware
resources.

iii

1. Introduction
In recent years, system security has become a critical concern when designing embedded
systems. Devices such as smart phones, tablets, and navigation systems are often used
to process and store private information about the user. Applications for mobile banking
or shopping store critical identification and financial data. If this information were to be
retrieved by someone with malicious intent, the user’s assets and privacy could be permanently impaired.
In some cases, the contents of memory needs to be protected from the users as well.
Every year billions of dollars of software and services are stolen through piracy [1]. Some
of this content is distributed through vendor set top boxes or mobile devices that are given
to the users. The incentive for untrustworthy users to break into systems such as Pay-TV
access control boxes, and other digital rights management (DRM) devices has overcome
the effort required to bypass current security measures [2][3].
This information stored locally on the device, also known as “data-at-rest” [4][5], is
increasingly becoming a security weak point. The tools required to analyze and read memory contents have become more accessible and readily available. This makes it much easier
for a knowledgeable individual to compromise the physical security of an embedded device. In order to protect personal user information, as well as the intellectual property of
corporations, new security measures are required [6].

1.1

Project Description

This thesis focuses on a solution for off-chip memory encryption that is low cost in terms of
resource usage and performance overhead. This task was accomplished by implementing
an intermediate secure memory controller between a CPU and off-chip system memory as
1

shown in Figure 1.1. All off-chip memory transactions were conducted though the secure
memory controller, which is also transparent to applications running on the CPU.
The encryption algorithm chosen for this design was the Advanced Encryption Standard
(AES) [7]. AES was chosen for its efficiency and proven security as disclosed by NIST in
2001 [8]. It is also worth mentioning that AES is the only public standard approved for
systems intended to encrypt classified data [9].
The encryption method used in this design is based on the keystream cache concept
introduced by [7]. This solution encrypts data by using an XOR operation between the data
and a keystream generated by the AES algorithm. These keystreams are then stored in a
keystream cache where they can be retrieved later for fast encryption or decryption. The
performance of this base system was then improved by modifying the way keystreams are
generated and stored in the cache.
Altera FPGA

CPU

Plain Data

Secure Memory
Controller

Encrypted Data

Off-Chip
Memory

Figure 1.1: Secure Memory Controller Prototype
The system was tested by implementing a prototype on an Altera Cyclone III FPGA
using a Nios II soft core processor and external SRAM memory. The secure memory
controller resides on the FPGA and encrypts/decrypts all external memory transactions.
The system performance was evaluated using several benchmarking applications running
on an embedded OS. The performance enhancing features were varied to show the tradeoffs between performance and resource usage [2].
2

2. Essential Background
Several methods have been proposed for securing external memory [10][11][12][13][14][15].
Of the proposed methods, there are two key concepts that are essential for this thesis. The
first is the use of a pseudo one-time-pad encryption method, which allows for each memory
address to have its own “unique” encryption pad [11]. The second concept builds on the
first by introducing a method for caching these pads for later use [7].

2.1

AES Engine

This thesis as well as several other proposed solutions uses AES as the encryption algorithm. The component used to perform this operation in hardware is referred to as the AES
engine. While the internal design of the AES engine may be unique to each system, the
external functionality is fundamentally the same. There are two inputs which include one
for data that is always 128-bits wide, and one for they key which can be 128, 192, or 256
bits wide. There is also a 128-bit wide output for the encrypted data. The AES engine
can either encrypt or decrypt 128-bits of data at a time. As shown in Figure 2.1, the unencrypted data is denoted as plaintext, and the resulting encrypted data is known as ciphertext
[11].

2.2

Direct Memory Encryption

The most naı̈ve solution to securing memory is through direct encryption. During a memory
read, encrypted data is read from external memory and processed directly by the AES
engine. The resulting plaintext is then sent to the CPU. An entire encryption or decryption
operation must be performed for each read or write; therefore this method adds a significant
3

Key

128/192/256

AES Engine

Plaintext

Encryption Mode

128

Key

Ciphertext
128

128/192/256

AES Engine

Ciphertext

Decryption Mode

128

Plaintext
128

Figure 2.1: AES Engine
delay to the memory access time [11].
The method proposed in [14] uses direct encryption to encrypt transactions between
a System on Chip (SoC) and external memory. This solution uses specific properties of
the AES algorithm to implement a parallel encryption and integrity checking engine. With
AES, even when only a single bit of the encrypted data is modified before decryption,
the resulting plaintext will be drastically different than its original value. This property
allows the system to easily detect alterations made to stored data. This particular method
of checking integrity does not work with the design evaluated in this thesis due to the fact
that the AES engine is not used to directly encrypt the data stored in external memory.

2.3

Pseudo One-Time-Pad

True one-time-pad encryption is performed by combining plaintext with a secret random
key. One of the major benefits to this method is that the encryption operation can be
simplified to a bit-wise XOR operation. The pitfall is that it requires a truly random key
4

Direct Encryption Memory Controller

Memory Address

CPU

Memory
Plain Data

Plaintext In/out

AES Engine
Ciphtertext In/Out

Encrypted Data

Figure 2.2: Direct Memory Encryption
that is the same length as the plaintext being encrypted. For most encryption applications,
this requirement is unrealistic due to the resources needed to generate or store such a large
key [16].
To overcome the impractical key requirement a pseudo one-time-pad method was proposed [11]. Instead of storing a large random key, many smaller secret keys can be generated using a block cipher algorithm. This design suggests using the AES algorithm to
generate secret pads or keystreams for encryption from a small secret key and a memory
address. A keystream [10] is similar to the pad used in the one-time-pad encryption method
with the exception that it is not truly random.
Figure 2.3 shows how the plaintext fed to the AES engine is the corresponding memory
address being accessed in external memory. The keystream is the resulting ciphertext. This
keystream is used as a pad where the stream of generated bits is combined with the data
using an XOR operation similar to a one-time pad operation. The data is then unencrypted
by using the same operation on the ciphertext.
The performance benefit of this method is that it separates the encryption operation
from the memory access path allowing both operations to be done in parallel [12]. As
shown in Figure 2.4, a memory access operation can be executed while the keystream is
5

Indirect Encryption Memory Controller

Memory Address

Plaintext In

AES Engine

CPU

Memory

Ciphtertext Out
Keystream

Plain Data

+

Encrypted Data

Figure 2.3: Indirect Memory Encryption
generated in parallel. In a case where the memory access time is longer than the keystream
generation time, there would be no performance overhead for decrypting data from memory. It should be noted that encryption cannot be done in parallel for a write operation since
the keystream must be generated before a write operation to external memory can begin.

2.4

Keystream Caching

The introduction of the pseudo one-time-pad method allowed for further performance improvements in the ability to efficiently provide keystreams. Since each memory address
requires a single keystream for encryption and decryption, the same keystream is generated each time that memory address is accessed. To reduce the number of times a repeated
keystream is generated, a system that caches keystreams was proposed [13][7].
A secure memory controller (SMC) operates between the CPU cache and the external
memory interface. When the SMC receives a request from the CPU cache, it first checks
to see if the required encryption keystream is available in the keystream cache. If the cache
returns a hit, the keystream corresponding to the requested memory address is read from

6

CPU

CPU Cycles

CPU Cycles

CPU Cycles

XOR

XOR

AES

Keystream Generation

MEM

Memory Read

Figure 2.4: Secure Memory Read Operation
cache and combined with the data to encrypt or decrypt based on the type of transaction.
If the cache returns a miss, the required keystream is generated and then used to encrypt
or decrypt the data. This keystream is then also stored in the keystream cache as shown in
Figure 2.5.
Secure Memory Controller

Address

Keystream
Cache

AES Engine
CPU

Plain Data

+

Encrypted Data

Figure 2.5: Secure Memory Controller

7

Memory

2.5

Related Work

There have been various attempts to use the pseudo one-time-pad method to decrease performance overhead when securing memory. The memory encryption scheme proposed by
[17] uses a pseudo one-time-pad method similar to the one in this thesis. In addition, this
implementation uses an AES engine in Galois/Counter Mode (GCM) [18][19] to allow
for data authentication. This requires a counter value to be stored with each encrypted
memory block. Also, the keystreams generated by the AES engine are dependent on the
counters stored in memory; therefore the encryption operation cannot start until the data
from external memory is read. This eliminates the performance benifits of using the pseudo
one-time-pad method since parallel keystram generation and memory access is not possible. To solve this issue, [17] proposes the use of an on chip cache to store frequently used
counters. Therefore if the counter for the requested encrypted memory block is in cache,
the encryption operation can begin before external memory is read. The additional security
provided by the data authentication comes at the cost of internal and external memory overhead as well as increased memory access delay. In cases where the counter is not available
in cache, the memory read time will be delayed by the length of time required to generate
the required keystream.
Another memory protection method described in [20] takes advantage of the pseudo
one-time-pad method as well. This implementation also stores a counter or timestamp
value with data stored in memory in order to perform data authentication. As introduced
in [17], [20] also uses a cache to store frequently used authentication data. To further
increase performance, this design was improved in [10]. The improved implementation
adds flexibility by allowing for data to be stored in memory with different levels of security.
Data can be stored with no encryption, encryption, or encryption with authentication. This
allows for increased system performance by only securing data that is considered sensitive.
In comparison to the work presented in this thesis, these secure memory designs [12]
[17] [20] [10] assume that the memory access delay will hide the latency of keystream
generation process. In the case where low latency memory such as SRAM is used, the time
8

required to generate a keystream induces significant performance degradation.
In [15] a solution for securing data transmitted on external bus systems for embedded
devices is presented. This method differs from the other solutions discussed here in that the
system encrypts data only when it is being transferred over an external bus. When an external memory transaction occurs, the system encrypts the data before it enters the bus architecture and then decrypts it before it is stored in memory. In this case the keystream values
are not associated with the memory address or data in anyway, allowing them to be pregenerated for all cases (reads and writes). With a sufficient amount of parallel keystream
generation hardware, this design is able to encrypt system bus transactions without adding
any performance overhead. While this method looks attractive from a performance point
of view, the data transferred between the encrypted system bus and the off chip memory is
still unprotected.
There is one other known use of a cache to store keystream values [13]. This design
proposes using a cache to decrease the latency for memory encryption during a write back
only. The cache proposed in [13] also takes advantage of the pseudo-one-time pad method
in [11]. As data is read in from memory, an Encryption Unit generates a pad which is
used to decrypt the data before it is sent to the CPU. Since the pad used to decrypt is the
same as the encrypting pad, it can be stored in cache in case the CPU writes back to that
same address. This design greatly reduces the memory latency when writing data back to
memory since the latency added by the pad generation cannot be hidden during a write as
it can with a read.
This memory encryption solution is similar to the one used in this thesis. It caches
pads generated by an AES unit where the memory address is used as plaintext. It differs in
that it only uses the cache to decrease memory write latency. In order to hide the latency
added by encryption during a read, a pipelined AES unit is used. As described in [12]
the encryption latency can be masked by the memory access latency during a read operation. Therefore it is possible to hide the latency entirely during a read by increasing pad
generation throughput.

9

3. System Architecture
This chapter describes the design of the secure memory controller and its individual components. The most basic implementation will be described first, and then more advanced
system configurations will be introduced along with the reasoning behind their design.

3.1

Overview

As shown previously in Figure 2.5, the secure memory controller resides between the CPU
and the off-chip memory. All memory transfers to and from memory are routed though the
secure memory controller.
Figure 3.1 shows a top level view of the internal components that make up the secure
memory controller. The control component receives requests from the CPU and sends the
appropriate commands to the AES engine, datapath, and memory interface.

3.2

Control

The control component in the secure memory controller is responsible for managing memory requests from the CPU while driving the control signals between the AES engine,
datapath, and memory interface components. The behavior of this component is explained
by the state machine in Figure 3.2.
If there are no memory requests from the CPU the control component stays in its initial
idle state. The CPU will initiate a memory request by pulsing the BeginTransfer signal. At
this point the main duty of the control logic is to prepare the required keystream and encrypt
or decrypt the data being transferred between the CPU and off-chip memory. As shown in
Figure 3.1, the keystream can come from either the keystream cache, or the AES engine. If
10

Secure Memory Controller
Control

Address
Control State
Machine

Mem Control

Datapath
Control

Control Signals

CPU

Memory

Datapath

Keystream
Cache

AES
Engine

Plain Data

+

Encrypted Data

Figure 3.1: Secure Memory Controller Components
the keystream has already been generated and stored in the keystream cache, the system can
start the encryption operation right away without any delay. If not, the memory controller
must hold up the CPU bus while the AES engine generates the required keystream. After
the BeginTransfer signal is activated, the system moves into the CheckCache state. Here
the datapath is responsible for notifying the control component if the keystream is available
in the cache. If the datapath returns a cache hit, the control component moves to the Read
Mem Wait or Write Mem Wait states. In the case of a memory read, the secure memory
controller begins reading data from the off-chip memory as soon as the CPU makes the
initial request. Therefore the Read Mem Wait state simply ensures that the data has arrived
before decrypting it and releasing it to the CPU. In the case of a memory write, the data
must be encrypted before any off-chip memory transfer is started. Once in the Write Mem
Wait state the memory controller checks to ensure that there are no pending off-chip memory accesses before writing the encrypted data to memory. Once the read or write request

11

Read Data Valid

Begin Transfer

Cache Hit
Ava Read

Idle

BeginTransfer

Check Cache

Begin Transfer
Cache Hit
Ava Write
Write Queue Full

Miss

Read Mem
Wait

Key Done
Ava Read

Generate Key

Key Done
Ava Write
Write Queue Full

Write Mem
Wait

Write queue
Full

Key Not
Done
Key Done
Ava Write
Write Queue !Full

Error (Reset
Ava Bus)

!Full

Write Mem

Figure 3.2: Control State Machine
is complete, the control system releases the CPU and goes back to the idle state. In the case
where there are back to back memory requests, the system will go directly to checking the
cache instead of going back to idle.

3.3

Datapath

The main functions of the datapath include encrypting or decrypting data, as well as caching
frequently used keystreams. As shown in Figure 3.3, the encryption operation is simply the
combination of the keystream and the data to be encrypted or decrypted by the use of an
XOR operation. The keystream caching portion of the datapath includes on-chip block
memory to store the keystreams, and the logic required to manage how they are stored and
retrieved.

12

Datapath

Cache Write Enable from Control

Cache Write Enable

Off-chip Memory Address from Control

Off-chip Memory Address
Keystream Cache
Keystream Data In
Hit/Miss Signal
Keystream Data Out

Keystream from AES Engine

ReadData to CPU

ReadData from Memory Controller

+

WriteData from CPU

+

WriteData to Memory Controller

Figure 3.3: Datapath Overview

3.3.1

Keystream Cache

The keystream cache stores previously generated keystreams in memory so that they can
be quickly retrieved and reused at a later time. The memory used to store the keystreams
can consist of one or more blocks of synchronous on-chip block ram. The FPGA used
in this work consists of hundreds of individually accessible memory blocks. These can be
combined in different configurations allowing for various cache architectures. The memory
blocks can be combined to create a single memory unit that can vary in size. Multiple
blocks can also be accessed simultaneously allowing for larger data words to be stored at
one time.
13

The interface to the keystream cache is the same regardless of the internal configuration.
As shown in Figure 3.4, the inputs consist of the off-chip memory address, the newly
generated keystream to be stored, and a write enable line. The cache uses the memory
address to determine where the keystream should be read from or written to. If the write
enable signal is pulsed for one cycle, the cache will store the provided keystream in the
internal on-chip block ram. The output of the cache consists of a hit/miss signal and the
keystream stored in cache. The keystream that is present at the output is dependent on the
memory address provided to the cache. If the keystream for the provided memory address
is available in cache, that keystream will be provided at the output, and the hit/miss signal
will be asserted high.
By taking advantage of spatial and temporal locality of common memory access behavior, the cache can be configured to achieve higher performance. Spatial locality, or the
tendency to access memory at a nearby address, can be exploited by modifying the width of
the cache lines. Temporal locality, or the tendency to access the same address in the future,
can be exploited by a flexible n-way associative cache.

Keystream Cache
Keystream from AES Engine
Off-chip Memory Address
Cache Control Logic
Cache Write Enable from Control
Cache Memory
Hit/Miss Signal to Control
Stored Keystream Data

Figure 3.4: Keystream Cache Interface

14

Direct Mapped Cache
The keystream cache system is built around addressable synchronous block ram. The address used to access this memory is known as the cache index. At each index, a cache
line can be stored or read by the cache control logic. The simplest cache line in this implementation consists of a tag to identify the stored data, a valid bit, and the keystream
data.

Off-Chip Memory
Address 0001

Cache Memory

Address 0011

Cache Index 001
Cache Index 011
Address 1001
Address 1011

Figure 3.5: Generic Data Cache Structure
In the direct mapped cache each keystream is mapped to a specific cache line index
based on its associated off-chip memory address. Since the cache is much smaller than the
space required to hold all of the possible keystreams, multiple keystreams may be assigned
to the same cache line. Therefore when a keystream is stored in cache, it can only be
stored at a single location and will overwrite any keystream previously stored in that cache
location.
The location at which a keystream can be stored is determined by the least significant
bits of the associated off- chip memory address. Figure 3.5 shows how addresses from
15

off-chip memory are mapped to the cache memory. The lower bits of the memory address
are used as the cache index to allow for keystreams that belong to a continuous block of
memory to be stored one after another in the cache memory. This setup was chosen to take
advantage of the special locality of the memory required by the benchmarking applications.
In order to identify which keystream is stored at a particular location in the cache memory, a tag must be stored with each keystream. The tag is used along with the cache index to
reconstruct the original memory address that the keystream is associated with. Figure 3.6
shows how the off-chip memory address is broken down and used by the keystream cache.
The top, or most significant, bits of the address are designated as the tag. The remaining
bits are designated as the cache line index and word offset.
Off-Chip Memory Address
Tag

Cache Line Index

Word Offset

[MEM_ADDRESS_WIDTH-1 : CACHE_INDEX_WIDTH+2]

[CACHE_INDEX_WIDTH+1 : 2]

[1 : 0]

Figure 3.6: Memory Address Breakdown
The lower or less significant bits of the off-chip memory address are used as the cache
line index. As mentioned before, these bits signify where in the cache memory the keystream
can be stored or retrieved from.
The two least significant bits in Figure 3.6 are known as the word offset. The AES
algorithm always generates a 128-bit keystream, although only 32 bits are required to encrypt a single word from the CPU. Therefore each 128-bit keystream holds four individual
32-bit keystreams used for encryption and decryption. Here the word offset is used to select
which 32 bits of the full 128-bit keystream should be used to encrypt or decrypt the data at
that particular off-chip memory address.

Cache size = N um cache lines ∗ (Keycache length + T ag length + V alid bit)(3.1)
The exact size of the tag and cache line index are determined by the size of the cache.
In this implementation the size of the cache is defined by the number of cache lines and
16

the width of each cache line. The relative size of the cache can be calculated using Equation 3.1. The number of cache lines or depth of the cache is determined by the cache line
index size, and the width is determined by the tag, and keystream size. Therefore the cache
line width for this cache type will always be 128-bits for the keystream, plus the tag size,
plus one valid bit. As shown in Figure 3.6, the length of the tag is determined by the total
length of the address, the size of the cache line index, and the width of the word offset. The
“+2” portion of the equation is to adjust for the two word offset bits. A visualization of the
keycache layout can be seen in Figure 3.7.
Cache Memory
Address

Cache Memory Structure
Tag

Key Word (0)

Key Word (1)

Key Word (2)

Key Word (3)

Valid
Bit

0

[MEM_ADDRESS_WIDTH-1 :
CACHE_INDEX_WIDTH+2]

(32-bits)

(32-bits)

(32-bits)

(32-bits)

(1-bit)

1

[MEM_ADDRESS_WIDTH-1 :
CACHE_INDEX_WIDTH+2]

(32-bits)

(32-bits)

(32-bits)

(32-bits)

(1-bit)

2CACHE_INDEX_WIDTH-1

[MEM_ADDRESS_WIDTH-1 :
CACHE_INDEX_WIDTH+2]

(32-bits)

(32-bits)

(32-bits)

(32-bits)

(1-bit)

Cache Line
Index

Figure 3.7: Keycache Structure
Figure 3.8 shows the basic operation performed by the cache control logic. When the
secure memory controller receives a memory read or write request from the CPU, the system first checks the cache to see if the required keystream is available. This is done by
sending the off-chip memory address to the keystream cache. The cache index portion
of the address is used as the memory address for the keystream cache memory. The information read from the memory at this address is the entire cache line including the tag,
keystream data, and the valid bit. First the tag portion of the cache line is compared to the
tag portion of the off-chip memory address. The output of the tag comparator is a single
bit that signifies whether or not the values are the same. If both tags are equal, then the
tag comparator outputs logic ‘1’. This value is then combined with the valid bit through
17

an AND operation. The valid bit signifies that the data stored in the cache line is correct
for the specified off-chip memory address. The output of the AND operation is the hit signal. A logic value of ‘1’ on the hit signal tells the datapath that the requested keystream is
available in the cache.

Keystream Cache Control Logic
Valid Bit from Cache Memory

Hit/Miss Signal to Control

AND

Tag
Comparator

Tag

Off-chip Memory Address from Control

Tag from Cache Memory

Cache Index to Cache Memory Address

Cache Write Enable from Control

Memory Write Enable to Cache Memory

Keystream (128 bits)

Keystream to Cache Memory Write Data

Figure 3.8: Direct Cache
If either the tag comparison or the AND operation result in a logic ‘0’, the datapath must
signal back to the main secure memory controller control component that a new keystream
must be generated. The control component will then initiate the keystream generation
using the AES engine. Once the keystream is available, it will be sent directly from the
AES engine to the datapath to be used in the pending encryption or decryption operation.
At the same time the newly generated keystream will be written to the keystream cache
by using the cache write enable signal. This will cause the cache to write the keystream,
along with its tag and a valid bit to the keystream cache memory. When a new keystream
is written to the cache the valid bit is always set to logic ’1’.
Direct Mapped Cache with Variable Cache Line Width
This thesis generally describes the CPU as the device that sends memory requests to the
secure memory controller. The soft core CPU used to test this system utilizes a separate
18

instruction and data cache. These caches sit between the CPU and the secure memory
controller. When the CPU needs data from the memory it sends a request to the instruction
or data cache. If the required data is not available here, one of the CPU caches will request
memory from the secure memory controller. To take advantage of CPU idle time, and faster
sequential memory access, the CPU cache will often request more data than what is needed
by the CPU.
When larger chunks of memory are requested, the secure memory controller may be
required to generate multiple keystreams to decrypt the data from memory. For example if
the CPU cache were to request eight 32-bit words from memory at a time, the AES engine
would be required to generate two 128-bit keystreams. Since this behavior can be predicted
based on the CPU architecture, the secure memory controller can be reconfigured to allow
for longer cache lines. In the case of this example, the cache line should contain at least
256-bits of keystream data in order to encrypt or decrypt an entire CPU cache request.
Cache Memory
Address

Cache Memory Structure
Tag

Key Word (0) Key Word (1)

Key Word (7)

Valid
Bit

Cache Line
Index
0

[MEM_ADDRESS_WIDTH-1 :
CACHE_INDEX_WIDTH+3]

(32-bits)

(32-bits)

(32-bits)

(1-bit)

1

[MEM_ADDRESS_WIDTH-1 :
CACHE_INDEX_WIDTH+3]

(32-bits)

(32-bits)

(32-bits)

(1-bit)

2CACHE_INDEX_WIDTH-1-1

[MEM_ADDRESS_WIDTH-1 :
CACHE_INDEX_WIDTH+3]

(32-bits)

(32-bits)

(32-bits)

(1-bit)

Figure 3.9: Double Width Keystream Cache Line
An example of a cache that holds twice the amount of keystream data per cache line can
be seen in Figure 3.9. This cache structure is very similar to that of the single wide cache
line with a few small changes. First to keep the cache the same size, the number of cache
lines are reduced by half. This keeps the overall memory used by the cache consistent for
comparison. In this instance the tag size remains the same even though the cache line index
19

is reduced by one bit. This is due to the fact that an extra bit is consumed by the word
offset. Here the word offset consists of the three least significant bits since it must be used
to select between eight different 32-bit keystream words.
The number of 32-bit keystream words stored in a single cache line should always be a
multiple of four since the AES engine is only designed to output 128-bits a time. If less than
a full 128-bit keystream were to be stored in a cache line, then the remaining keystream
bits would be lost. This may result in lower overall performance as the entire keystream
will need to be regenerated if even part of it is missing from the cache. To ensure that this
will not be an issue in this implementation, the number of 32-bit keystream words stored
in a cache line is defined by the WORD INDEX LENGTH. This will allow the number
of words to be set to 4,8,16,etc. With a variable length cache line the total size of the
cache is dependent on two manually set variables which are CACHE INDEX WIDTH and
WORD INDEX LENGTH as shown in Equation 3.1
The off-chip memory address breakdown in Figure 3.10 gives the generic sizes for the
tag, cache line index, and word offset parameters. As with the simple direct mapped cache,
the number of cache lines is still defined by the CACHE INDEX WIDTH variable, but the
total size of the cache is now dependent on the WORD INDEX LENGTH as well.
Off-Chip Memory Address
Tag

Cache Line Index

[MEM_ADDRESS_WIDTH-1 :
CACHE_INDEX_WIDTH+WORD_INDEX_LENGTH+2]

[CACHE_INDEX_WIDTH+WORD_INDEX_LENGTH+1 :
WORD_INDEX_LENGTH+1]

Word Offset

[WORD_OFFSET_LENGTH : 0]

Figure 3.10: Off-chip Memory Address Breakdown for Generic Cache Line Width
With longer cache lines, the keystream cache will require more keystream data for each
cache line it saves. Unlike a traditional CPU memory instruction or data cache, the system
cannot simply read in more data to be stored. The AES engine is designed to generate 128bit of keystream at a time based on the specification approved by NIST [8]. In order to fill
a variable length cache line, more keystream data must be generated by the AES engine.
In order to generate the required amount of keystream data, serial and parallel cache

20

line fill methods were proposed. With the serial method a single AES engine would be used
multiple times to generate the required amount of keystream data as shown in Figure 3.11.
With a single AES engine the time to generate a complete keystream will multiply by the
number of 128-bit block of keystream required. As the WORD INDEX LENGTH variable
is increased, this delay will increase accordingly.
Double Wide Cacheline

Valid
Bit

Tag

128-bit Keystream

128-bit Keystream

AES
Engine

Figure 3.11: Serial Cache Line Fill Method

Double Wide Cacheline

Valid
Bit

Tag

128-bit Keystream

128-bit Keystream

AES
Engine

Figure 3.12: Parallel Cache Line Fill Method
The second method displayed in Figure 3.12 uses multiple AES engines to generate
more keystream data. With this method, an additional AES engine is added and operated
in parallel for each additional 128 bits of keystream data required for a single cache line.
This allows a larger number of keystream bits to be generated at one time without any
decrease in cache line fill time. The AES engine is the most complex component of the
secure memory controller; therefore the tradeoff of this method is an increase in utilized
FPGA resources.

21

N-Way Associative Cache
An associative cache architecture allows for greater flexibility in how keystreams are stored
in cache memory. The main distinction from the direct mapped cache is that an associative
cache can choose which keystreams to store based on their usage rather than memory location alone. This adds a significant amount of complexity to the cache control logic due
to the decisions required to store cache lines and retrieve the correct keystreams at a later
time.
N-Way Keystream Cache Memory
ReadData[0]…[N-1]
WE[0]…[N-1]
Cache Index
WriteData

Memory Bank [0]

Memory Bank [1]

Bank [N-1]

Figure 3.13: Associative Cache Memory Banks
In order to store multiple cache lines at a single cache index, this system implements
multiple cache memory banks as shown in Figure 3.13. Each bank can store a single cache
line at a particular index; therefore the number of memory banks required is dependent on
the number of cache ways.
With this cache architecture the cache control logic must be able to search all cache
lines at a particular index in parallel. With each cache way implemented in a separate
memory bank, all of the cache lines at a particular cache index can be read at once. The
tag comparator then evaluates the tag from each cache line. The output of the comparator
22

Keystream Cache Control Logic
Valid Bit 0

Memory Bank 0

Valid Bit 1

Memory Bank 1

Cache Hit Signal
Valid Bit N-1

Tag
Comparator

Tag 0

Memory Bank 0

Tag 1

Memory Bank 1

Tag N-1
Tag

Off-chip Memory Address

Memory Bank N-1

Cache Index

Datapath
Keystream 0 (128 bits)
Keystream 1 (128 bits)
Keystream Read (128 bits)
Keystream N-1 (128 bits)

Memory Bank N-1
Keystream
Cache Memory
Memory Bank 0
Memory Bank 1

Memory Bank N-1

Cache Write Enable
Memory Bank 0
Memory Bank 1

Keystream Write (128 bits)

Memory Bank N-1
Counter 0
Counter 1
Cache Replacement Policy
Counter N-1

Figure 3.14: N-Way Cache Control Logic
is used to select the valid bit line that will be passed on to the datapath as the hit signal. If
the valid bit is set to logic ‘1’ for the cache line with the matching tag, the cache control
logic will return a ‘1’ on the hit signal to notify the datapath that the requested keystream is
available. The comparator output will then be used to select the correct keystream to send
back to the datapath from the keystream cache memory. An example of the cache control
logic for an n-way cache can be seen in Figure 3.14.
When there are multiple cache lines per cache index, the cache control logic must also
determine which memory bank to store the keystream data in. This can be done in various
23

different ways; therefore multiple solutions were developed and tested in this thesis. These
solutions, or cache replacement policies, use cache usage statistics to decide which cache
line to replace at a particular index. These statistics are calculated by the cache control
logic and stored in the cache memory with each cache line.
The first and most simplistic cache replacement policy is the first in first out, or FIFO
policy. In this case the cache control logic records the order in which cache lines are written
at a particular index. This is done by numbering each cache line based on the number of
ways the cache is split. Each way needs its own number; therefore the number of bits
required to store this statistic are calculated by taking the log base two of the number of
ways. This number, or insertion order, is stored with each cache line in the cache memory.
When a new cache line is written, its insertion order number is set to the highest number
possible since it was the last cache line to be inserted at that index. The cache control logic
must then update the insertion order numbers on all of the other cache lines at that index
by decrementing them by one. This causes the oldest cache line to have an insertion order
number of zero. When a cache line is placed into cache memory, it is written to the cache
way with the insertion order number of zero.
The second replacement policy keeps track of how long ago a particular cache line was
accessed. This is commonly known as the least recently used (LRU) replacement policy.
It then uses this data to replace the cache line that was accessed least recently. This policy
allows more recently used cache lines to remain in the cache memory. Just as with the
FIFO policy, a usage statistic is calculated and stored with each cache line. The cache lines
at a particular index are numbered based on when they were accessed last with the highest
number being the most recently accessed cache line. Therefore, when the cache control
logic inserts a new cache line, it will replace the cache line with the lowest usage statistic
number for the specified index.
The last policy replaces cache lines based on how often they are used. As cache lines
are accessed, the accompanying usage statistic is incremented, and the other cache lines at
that index are decremented, although, if there is a cache miss at a particular cache index,

24

the statistics for all cache lines at that index are decremented. Therefore the cache line with
the lowest usage statistic will be replaced when new cache lines are inserted. In order to
keep newly inserted cache lines from being replaced too quickly, they are inserted with a
usage statistic that is half way from the highest and lowest possible values.

3.4

AES Engine

The AES engine consists of two major parts. The first is the key expansion subsystem, and
the second is the main AES datapath that is used to encrypt data. The hardware implementation of the AES algorithm used for this thesis was designed for a cooperative project
between RIT and the Harris Corporation. It was designed to be low latency while using
minimal hardware resources.
This memory encryption system uses the pseudo one-time-pad method; therefore the
AES engine only needs to be able to encrypt data. As shown previously in Figure 2.1, the
encrypting AES engine has two inputs and one output. The two inputs are the secret key
and the data to be encrypted. In this particular design, a 256-bit key is used for maximum
specified security [8]. The data to be encrypted, or plaintext, is 128-bits wide. The output
is the encrypted data, or ciphertext, which is also 128-bits wide.
The AES algorithm is an iterative block cipher that operates on an internal 128-bit
state. Each iteration of the algorithm is called a round. In a single round, the current state is
transformed by four independent operations which are SubBytes, ShiftRows, MixColumns,
and AddRoundKey as shown in Figure 3.16. This implementation goes through a total of
14 rounds for a 256-bit key. As shown in Figure 3.15, there is also an initial round where
only the AddRoundKey operation is performed. It should also be noted that in the last
round, the MixColumns operation is omitted.

25

Plaintext

Round 1
Round 2

Round 13
Round 14
(No MixColumns)

K0
K1
K2

Expanded Key

Initial Round
(AddRoundKey)

K13
K14

Ciphertext

Figure 3.15: AES Rounds

3.4.1

AES Key Expansion

As shown in Figure 3.15, each round of the AES algorithm requires an expanded round key
K0−14 . The round key is the result of a transformation of the 256-bit secret key by the key
expansion subsystem. The expanded key is 240 bytes long, which provides a unique 128bit round key for all 14 rounds plus the initial round. Figure 3.17 shows the architecture
used to generate the required round keys.
The key expansion subsystem holds two independent 128-bit internal states. Each state
is split into four 32-bit words (W0 . . . W7 ) which are operated on independently. The two
states are initialized to the upper and lower 128-bits of the secret key. The round keys
(K0 . . . K14 ) are then taken from the 128-bit internal states and saved in synchronous onchip block memory.
The first two round keys, K0 and K1 , are taken from the initial two 128-bit states. From
there each state is updated one at a time, alternating between them. Each time a state is
updated, it is saved in the on-chip memory for use in the AES algorithm at a later time. In
this system the expanded key needs to be generated only once on startup. That key is then
used for all AES operations.

26

Current Internal State

SubBytes

MixColumns

AES Round

Shift Rows

AddRoundKey

Expanded Round Key

Next Internal State

Figure 3.16: AES Round Structure
While this project implements the key expansion algorithm in hardware, the expanded
key could also be generated in software. In this case the secret 256-bit key would be
sent to the CPU where it would be expanded through many instructions. Performing this
operation in software would free up hardware resources that could be better utilized for
other functions. This may be a viable option due to the fact that the time required for this
operation does not have any impact on system performance.

3.4.2

AES Architecture

The datapath displayed in Figure 3.18 is used by the AES engine to encrypt plaintext. It
operates on the 128-bit plaintext input as 16 words of one byte each, which are organized
in a two dimensional array of four rows and four columns. This structure represents the
internal state of the system.
The AES datapath consists of the logic required to compute a single round of the AES
algorithm in Figure 3.16. The architecture is designed to compute a single round of the
AES algorithm for each clock cycle. A round is completed by transforming the internal
state using the SubBytes, ShiftRows, MixColumns, and AddRoundKey operations.

27

256-bit Key (Lower 4 Words)

W0

KEven

W1

W2

W3

256-bit Key (Upperr 4 Words)

Clk

W4

W5

W6

Clk

W7

Clk

Clk

KOdd
Clk

S-Box

Clk
Rotate Word

+

Rcon

+
+
+
+

Figure 3.17: AES Key Expansion Datapath
The SubBytes operation is a non-linear substitution where each byte is mapped to another byte like in a lookup table. The component responsible for this operation is called
the substitution box, or SBox, which can be seen in Figure 3.18. This implementation uses
synchronous block ram to implement the required SBoxes. Since the output of the SBox
memory only changes once per cycle, this component also holds the current state of the
system.
The ShiftRows operation consists of a transposition where each row of the state array
is cyclically shifted by a constant number of steps. Since this is simply a reordering of the
signals, this operation requires no additional hardware or combinational logic delay. The

28

signals are simply remapped to a new location in the state array.
MixColumns is a mixing operation that combines the four bytes of each column using
an invertible linear transformation. This step is used in each round except for the initial and
last rounds as noted in Figure 3.16.
The AddRoundKey step simply combines the unique round key with the current state
by the use of an XOR operation. This operation is performed once for each round of the
AES algorithm.

AES_PT
128
128

+

Round_Key
128
128

128

+

Round Number
128

SBox0...15

Clk

128

Shift Rows
128

+
128

Mix Columns
128

AES_CT

Figure 3.18: AES Datapath
The initial state of the AES datapath is set during the initial round by combining the
plaintext input with the first round key though an XOR operation. The result is then fed to
the 16 SBoxes. After the next rising clock edge the next state of the system is computed
by feeding the output of the SBoxes to the ShiftRows, MixColumms, and AddRoundKey
combinational logic. The next state is then fed back to the input of the SBoxes. This
process continues for 14 cycles until each round has been computed. After the 13th cycle
29

the internal state is transformed by the SubBytes, ShiftRows, and AddRoundKey operations
before it is saved in the output register. The final value in the output register is used as the
128-bit ciphertext output.

30

4. Test System
The system used to evaluate the performance of the secure memory controller was designed
to operate as a typical embedded system consisting of a CPU, system memory, and various
peripherals. Since the proposed solution is targeting low latency memory, SRAM was
chosen as the off-chip memory. The Altera Cyclone III 3C120 development board was
used as the platform for the initial prototype.

4.1

FPGA Setup

The CPU used in this project was a Nios II core provided by Altera. This CPU was used
for its simplicity and small size in order to leave room for custom logic development. The
Nios II connects to the main system Avalon Bus also provided by the Altera development
software. As shown in Figure 4.1, the CPU is the master that controls the main system bus
that connects to several other peripherals.
The specific CPU configuration settings used are listed in Table 4.1. All of the system
startup and software instructions and data are stored in external SRAM; therefore the boot
memory pointer is directed to the off-chip memory as well. Relatively small instruction and
data caches were selected to force the CPU to access memory more often during operation.
This allows for a better evaluation of the secure memory controller due to the fact that it is
utilized more often.
The Nios II/f CPU configuration was chosen specifically for its performance over the
II/e and II/s options. This soft processor is a 32-bit reduced instruction set core. In addition to the on-chip instruction and data caches, it provides hardware support for advanced
features such as dynamic branch prediction. It also utilizes hardware acceleration for multiplication and division as well as a hardware barrel shifter.
31

Table 4.1: Nios II Configuration

Core Type
Memory Ponter
Instruction Cache
Data Cache
Debugging
Custom Instructions

Nios II/f
Off-chip SRAM
512 byte
512 byte
JTAG Level 1
None

The JTAG UART device is used for communication between the development board
and the desktop computer used to program and control the onboard FPGA. Prior to system
startup, the JTAG UART is used to write the programming file to the FPGA for logic
configuration. After startup, it is then used for terminal access to the software.
The System ID component is used to keep track of the hardware version so that the
correct software is used for any particular setup. The on-chip memory is not required but
can be used as system memory for debugging or testing purposes.
The performance counter component contains seven individual 64-bit counters that can
be used to measure the application performance by counting CPU cycles. This counter is
used to measure the application runtime performance for all of the results presented in this
thesis.
The System Avalon Bus also connects the CPU with the custom secure memory controller component. This device sits between the CPU and off-chip memory in order to encrypt all external communication. On this particular development board, the I/O pins used
to communicate with the SRAM chips are shared between multiple components. Therefore, a tri-state bridge is required to communicate with the off-chip memory. As shown
in Figure 4.1, the secure memory controller is connected to the tri-state bridge by the use
of a second Avalon bus. The tri-state bridge is then connected directly to the I/O pins
controlling the SRAM.

32

Cyclone III FPGA
Nios II
Processor

Jtag
UART

On Chip
Memory

M

S

S

Embedded System

System Avalon Bus

S

S

S

System
ID

Performance
Counter

Secure
Memory
Controller
M

Avalon Bus

S
Tri-State
Bridge

SRAM

Figure 4.1: Test System Setup

4.2

Benchmarking Software

Various applications were used to evaluate the system performance when using the secure
memory controller. These applications were either designed for this thesis or obtained from
a 3rd party source. The software applications vary in the way they process data and access
memory in order to evaluate a general case for performance. Some applications are more
CPU intensive where the amount of work required by the processor is more significant
than the time required for reading or writing memory. Benchmarks that are memory intensive require a much simpler CPU operation to be performed on a larger set of data. The
tests along with some of their properties are shown in Table 4.2. The Image Rotation, 2D
Discrete Cosine Transform, and the Dhrystone applications are all individual benchmark
33

tests.
Image Rotation Test
This application simulates the simple operation of rotating an image through matrix multiplication. The images are represented by 50x50 pixel two-dimensional array. Each array
element represents one pixel or byte of data; therefore each image is 2,500 bytes in size.
Each time the application is executed it dynamically allocates memory for a total of eight
images to be rotated. The actual image data is not important; therefore no image data is
stored at these memory locations.
The image transformation matrix is also created by allocating a two-dimensional byte
array in memory. Again the initial contents of this memory is not important since it will
not affect the resulting performance. The images are then rotated by multiplying each pixel
matrix by the transformation matrix. The result is then written back to the memory initially
allocated for the images.
This benchmark is intended to simulate an application that is memory bound. The
multiplication operation is much quicker than the time it takes to read and write memory;
therefore the system performance is directly dependent on the performance of reads to the
application data storage and retrieval speed.
2D Discrete Cosine Transform Test
This application performs discrete cosine transform operations, which are often used to
compress audio or visual data. In this particular implementation a transformation is applied to a two-dimensional array of data. The application data is represented by 25 twodimensional double arrays with dimensions 10x10. The required memory space is dynamically allocated by the application at runtime.

Xk1 ,k2 =

N
1 −1 N
2 −1
X
X
n1=0 n2=0

xn1,n2 cos[

π
1
π
1
(n1 + )k1 ] cos[ (n2 + )k2 ] [21]
N1
2
N2
2
34

(4.1)

When the application is executed the CPU reads in each array of memory and applies
Equation 4.1 to each 10x10 matrix. The result (Xk1 ,k2 ) is a combination of all other elements in the matrix. Also, the data values in this application are stored as doubles; therefore
the CPU must perform floating point operations for each DCT calculation. Since the NiosII
CPU has no hardware floating point support, each calculation requires several cycles. This
in turn results in a benchmarking application that is very computationally expensive in
terms of cycles per result.
Dhrystone Test
The Dhrystone benchmark application was originally developed as a general purpose CPU
performance test of system integer processing. This test was chosen for its popularity
among embedded developers and to demonstrate this system’s performance while running
an application developed by a 3rd party. [22]
This test provides a combination of memory read and integer processing tests which
puts it in between the 2D DCT and Image Rotation test for CPU and memory intensity.
This test was provided with the Altera development tools and was not modified in any way.
Table 4.2: Benchmark Applications

Software
Size (Inst/Data) Computation Type CPU Intensity Memory Intensity
Image Rotation 3KB/19.5KB
Integer
Low
High
17KB/9.75KB
Floating Point
High
Medium
2D DCT
Dhrystone
13KB
Integer
Medium
Medium

35

5. Evaluation
Embedded systems are becoming increasingly more complex. It is very common for multiple applications to be running on a single system within an operating system (OS). Therefore, to better evaluate the real world performance of this solution, an embedded operating
system was used to run multiple applications simultaneously.
The Micro C OS was chosen as the embedded operation system for its compatibility
with the Nios II core and its availability in the Altera development software. This allowed
for multiple applications, or multiple instances of a single application to be executed at the
same time.
For the results provided in this thesis all of the benchmarking applications were executed within the Micro C OS using a time division scheduling algorithm. The results were
obtained by measuring the total execution time of the entire system by the performance
counter peripheral.
For each test, various parameters were varied in order to evaluate different cache configurations. The main parameters are shown in Table 5.1. The cache size determines how
much memory is allocated to storing keystreams. The cache type determines how the cache
memory is structured and how keystreams are generated. The replacement policy determines which keystreams should be replaced at a particular cache index. The replacement
policy parameter has an affect only when using the 2-way or 4-way associative cache.

5.1

Resource Utilization

The total additional resources consumed by the secure memory controller are relatively
small compared to the total test system as shown in Figure 5.1. The individual component
only uses 1,232 logic elements and 18 memory blocks. The synchronous memory blocks
36

Table 5.1: Cache Parameters

Parameter
Options
Cache Size
0.5k,1k,2k,4k,8k,16k
Cache Type
Direct, 2-wide, 4-wide, 2-way, 4-way
Replacement Policy FIFO, Least Recently Accessed, Lease Frequently Accessed

Table 5.2: Test System Resource Usage

Hardware
Logic Elements Memory Bits M9Ks
Total Test System
5,710
8, 958,546
158
Memory Controller Only
1,232
316,504
18
119,088
31,850,496
432
Total Available

in the Altera FPGA are also known as M9Ks.
The discrepancy between the percentage of memory bits used and the percentage of
M9Ks used is due to the method in which performance is evaluated in this prototype. The
performance test for each setup varies the cache size to evaluate its effects on performance
for a particular application size. In order to get a better measurement the cache sizes were
set to smaller values that were not exact multiples of the available memory block size (9
kilobytes). Therefore the memory efficiency appears to be much lower than it may be in a
practical application where the cache size would be set to fully utilize the available memory
blocks.

5.2

Direct Mapped Cache

The application performance of the direct mapped cache was evaluated using the MicroC
OS running the 2D DCT, image rotation, and Dhrystone tests simultaneously. Each application was executed twice resulting in a total of six simultaneous tasks.
The same test was executed for each cache size (0 byte, 8 byte, 512 byte, 1 kbyte, 2

37

Logic Elements
Utilized

Memory Bits

Free

Utilized

1%

M9k

Free

Utilized

1%

99%

Free

4%

99%

96%

Figure 5.1: Secure Memory Controller Resource Usage
kbyte, 4 kbyte, 8 kbyte, and 16 kbyte) while recording the total number of cache hits and
misses, along with the total number of execution cycles. Figure 5.2 shows the reciprocal of
application execution time compared to a system that does not use any encryption at all.

1/Performance Overhead

Application Performance

Hit Rate

1.00
0.75 0.75

0.78 0.79

0.82 0.86

0.86 0.91

1k

2k

4k

0.95 0.96

0.97 0.98

8k

16k

0.60 0.63
0.38

0.00
No Encrypt

0.00
0

8bytes

512bytes

Keystream Cache Size

Figure 5.2: Direct Mapped Cache Application and Cache Performance
As expected the application performance and keystream cache hit rate increase as the
cache size increases. This is due to the fact that more of the keystream data can be reused
without regenerating it with the AES engine.
The tradeoff between application performance and memory usage is displayed in Figure 5.3. The chart shows the percentage increase in performance and resources compared
to a system that does not use a cache at all. At cache sizes of 2 kbyte and above the application performance does not improve significantly compared to the increase in memory
usage. This is due to the fact that the majority of the functions in the test applications are
38

% Memory Usage Increase

% Application Performance Increase
627.3%

315.9%
159.1%
0.0% 0.0%

0.0% 0.0%

No Encrypt

0

0.0%

35.8%

8bytes

48.8%
20.3%

40.3%50.7%

80.1%
53.1%

512bytes

1k

2k

55.5%
4k

59.4%
8k

60.5%

16k

Figure 5.3: Direct Mapped Cache Memory Usage
small enough that two kilobytes of keystream data can be used to encrypt and decrypt the
required memory space without needed to generate many new keystreams. A summary of
that performance and resource data has has been provided in Table 5.3

39

40

Cache Size (bytes)
No SMC
0
8
512
1024
2048
4096
8192
16384

Logic Elements
n/a
1466
1698
1254
1267
1264
1270
1255
1256

Memory Bits
n/a
22528
22528
27104
31616
40576
58368
93696
163840

M9Ks
n/a
14
14
18
18
18
18
22
30

Hits
n/a
n/a
0.30x109
0.31x109
0.33x109
0.36x109
0.37x109
0.38x109
0.39x109

Misses
n/a
n/a
0.17x109
0.10x109
0.8x109
0.6x109
0.3x109
0.1x109
0.6x109

Hit Rate
n/a
n/a
0.62856
0.62856
0.78818
0.85797
0.90680
0.95633
0.98262

Table 5.3: Direct Mapped Cache Results

Cycles
4.00x109
10.43x109
6.70x109
5.33x109
5.14x109
4.89x109
4.64x109
4.23x109
4.12x109

Application Performance
100%
38%
60%
75%
78%
82%
86%
95%
97%

5.3

Direct Mapped Cache with Variable Cache Line Length

The variable cache line length was evaluated using the MicroC OS running the 2D DCT,
image rotation, and Dhrystone benchmark applications. The application performance was
recorded using the performance counter peripheral. Looking at the results displayed in Figure 5.4 it was observed that the use of multiple engines in parallel has a much greater effect
on performance when a smaller cache size is used. This is due to the fact that generating
larger amounts of keystream data can cause a direct increase in hit rate regardless of cache
size. When a larger cache size is used, the additional performance gained by using multiple
AES engines is hidden by the fact that most of the required keystreams are stored in cache
and reused.

1 / Performance Overhead

100.00%
88.53% 90.77%
77.69%

96.86% 97.56% 94.84% 98.72% 99.47%
92.12% 93.72% 91.66%

83.55%

38.40%

Cache Setup

Figure 5.4: Variable Cache Line Length Application Performance
The resource usage for each cache setup can be seen in Figure 5.5 and Figure 5.7. Here
the changes in resource utilization are dominated by the use of additional AES engines. The
AES engine is the most complex component of the secure memory controller; therefore
duplicating this hardware results in a significant increase in required logic resources.
In order to decrease the impact of the hardware resources required for this architecture
an alternative approach to AES key expansion was investigated. Instead of expanding the
41

3780

Logic Elements Used

3764

2031

1239

3774

2046

2032

2021

1240

1218

3765

1230

587
0

Cache Setup

Figure 5.5: Variable Cache Line Length Logic Element Utilization
key in each AES engine, a single expanded key can be used. This is due to the fact that
each AES engine can use the same 256-bit secret key, and therefore will result in the same
expanded key. To test this method in hardware the key expansion logic was removed from
each AES engine except for one. The single expanded key was then shared among all engines. The logic element utilization for a variable length cache line with a single expanded
key can be seen in Figure 5.6.
The M9K utilization shown in Figure 5.8 reveals that the memory elements used with
different cache sizes is exactly the same except for the 8k cache. This is again due to the
fact that the total memory usage is dominated by AES engine hardware.
The performance results for the variable length cache line reveal that a cache line with
more than 256 bits of keystream data does not result in a significant performance increase.
This is caused by the way the Nios II CPU cache requests data from memory. If there
is a miss in either the CPU instruction or data cache, the respective cache logic requests
a total of eight words, or 256 bits of data from memory. Therefore every time the secure
memory controller receives a memory request from the CPU, it will require at least 256 bits
of keystream data for decryption. In the case that this data is not available in the keystream
cache, a setup that generates 256 bits of keystream data at once will provide much better
42

2380

Logic Elements Used

2364

1330
889

2374

1346

1332

1321
893

863

2365

873

237
0

Cache Setup

Figure 5.6: Variable Cache Line Length Logic Element Utilization (Single Key Espansion)
performance than one that only generates 128 bits at a time. This is the reason for the
significant performance improvement between a single wide and double wide cache line
setup. For the applications used in this work it was observed that using a cache line wider
than double resulted in wasted keystream data since the system does not always utilize the
extra keystream data. While this does not decrease system performance, the advantages of
increasing the cache line past 256 bits did not justify the increase in hardware utilization.
A summary of that performance and resource data has has been provided in Table 5.4

43

157056

Memory Bits Used

123648
113408

106912

98528

93696
79360
62272

53696
31616
0

58368

40576

9088

Cache Setup

Figure 5.7: Variable Cache Line Length Memory Utilization

M9Ks Used

70

70

36

18

0

70

36

18

70

36

18

36
22

4

Cache Setup

Figure 5.8: Variable Cache Line Length M9K Utilization

44

45

Cache Size (bytes)
No SMC
0
1024
1024
1024
2048
2048
2048
4096
4096
4096
8192
8192
8192

Line Width
n/a
n/a
1-Wide
2-Wide
4-Wide
1-Wide
2-Wide
4-Wide
1-Wide
2-Wide
4-Wide
1-Wide
2-Wide
4-Wide

Logic Elements
0
1239
1239
2031
3764
1218
2021
3780
1240
2032
3774
1230
2046
3765
9088
31616
53696
98528
40576
62272
106912
58368
79360
123648
93696
113408
157056

Memory Bits

M9Ks
0
4
18
36
70
18
36
70
18
36
70
22
36
70

Cycles
4.00x109
10.4x109
5.15x109
4.52x109
4.41x109
4.79x109
4.34x109
4.27x109
4.37x109
4.13x109
4.10x109
4.22x109
4.05x109
4.02x109

Table 5.4: Variable Cache Lline Width Results

Application Performance
100%
38.40%
77.69%
88.53%
90.77%
83.55%
92.12%
93.72%
91.66%
96.86%
97.56%
94.84%
98.72%
99.74%

5.4

N-Way Associative Cache

Modifying the associativity of the keystream cache changes how keystreams are stored.
The various configurations and replacement policies can have a different effect on application performance depending on how that application accesses memory. In order to show
these effects, the performance of individual applications was measured. Instead of running
multiple types of applications, individual instances of the same application were executed
simultaneously within the Micro C OS.

1 / Performance Overhead

100.0%
78.7%

80.1%

80.2%

85.2%

85.3%

85.6%

88.7%

92.0%

91.9%

94.1%

95.5%

97.2%

38.3%

Cache Configuration

Figure 5.9: Floating Point Associative Cache Performance using 2D DCT Benchmark
The performance results shown in Figure 5.9 was gathered by measuring the execution
time of the 2D DCT application. In this case the extra flexibility provided by the associative
cache helped to improve performance for all cache sizes. The performance benefit gained
by n-Way cache is also much more significant with larger cache sizes as shown by the 3%
gain by the 4-way cache over the direct mapped when using 8 KB of cache memory. A
summary of that performance and resource data has has been provided in Table 5.5
The results for the Dhrystone application performance in Figure 5.10 show an even
larger gain in performance when using the associative cache. In the best case the system is
able to achieve as little as 1% overhead when using 8 KB of cache configured with 4-way
46

1 / Performance Overhead

100.0%
90.4%
77.3%

78.5%

79.3%

82.9%

82.5%

91.6%

91.8%

93.5%

97.9%

99.0%

82.6%

37.4%

Cache Configuration

Figure 5.10: Integer Associative Cache Performance Using Dhrystone Benchmark
associativity. This application also shows a much greater improvement between the direct
mapped and the associative cache. In the case where 8 KB of cache is used the performance
gain between the direct and 4-way cache is over 5%. A summary of that performance and
resource data has has been provided in Table 5.6
The resource utilization for each associative cache policy is shown in Figure 5.11 and
Figure 5.12. The linear increase in logic usage is due to the fact that a larger number of
comparators are required for the control logic to evaluate the cache line usage statistics
such as the order in which they were accessed, or how many times they were accessed.
More comparators are also required to check for a hit on a particular cache line since all of
the cache lines at one index must be evaluated at the same time.
The number of memory bits used by each cache size and configuration are shown in
Figure 5.12. Since the amount of data that can be cached stays the same for each configuration, the overall memory usage for each cache configuration is approximately the
same. The associative cache configurations show a slight increase due to the fact that a
small amount of memory is required to keep track of the cache line usage statistics for the
replacement policies.
To test the performance of the available replacement policies, the image rotation, 2D
47

2168

2161
1713

Logic Elements Used

1707

1234

2148

1234

2146

1700

1232

1695
1244

999

0

Cache Configuration

Figure 5.11: Associative Cache Logic Element Utilization
DCT, and Dhrystone tests were executed simultaneously within the Micro C OS. The resulting application performance is shown in Figure 5.13. The least frequently accessed
policy showed the greatest performance, although it also requires the largest number of
logic elements compared to the other policies. The differences in memory usage for each
policy were negligible, and therefore not shown.

48

Memory Usage (Kbytes)

11.4

7.1
5.0

5.0
3.9

11.8

11.8

7.3

7.3

5.0

3.9

3.9

2.8

0.0

Cache Configuration

Figure 5.12: Associative Cache Memory Utilization

2170

1699

Total Logic Elements Used

1/Application Performance

100.00%

92.26%
90.67%
87.78%

1518

0
No Encryption

FIFO

Least Recently Least Accessed
Used

No Encryption

Replacement Policy

FIFO

Least Recently Least Accessed
Used

Replacement Policy

Figure 5.13: Associative Cache Replacement Policy Performance and Resource Utilization

49

50

Cache Size (bytes)
No SMC
0
1024
1024
1024
2048
2048
2048
4096
4096
4096
8192
8192
8192

Way
n/a
n/a
1
2
4
1
2
4
1
2
4
1
2
4

Logic Elements
0
999
1234
1707
2161
1234
1713
2168
1232
1700
2148
1244
1695
2146

Memory Bits
0
22528
31616
31936
32000
40576
41216
41344
58368
59648
59904
93696
96256
96768

M9Ks
0
14
18
24
34
18
24
34
18
24
34
22
24
34

Cycles
5.11x109
13.33x109
6.49x109
6.38x109
6.37x109
5.99x109
5.99x109
5.96x109
5.75x109
5.55x109
5.56x109
5.43x109
5.35x109
5.25x109

Table 5.5: N-Way Associative 2D DCT Results

Application Performance
100.00%
38.34%
78.78%
80.12%
80.20%
85.21%
85.39%
85.64%
88.72%
92.06%
91.90%
94.17%
95.51%
97.20%

51

Cache Size (bytes)
No SMC
0
1024
1024
1024
2048
2048
2048
4096
4096
4096
8192
8192
8192

Way
n/a
n/a
1
2
4
1
2
4
1
2
4
1
2
4

Logic Elements
0
999
1234
1707
2161
1234
1713
2168
1232
1700
2148
1244
1695
2146

Memory Bits
0
22528
31616
31936
32000
40576
41216
41344
58368
59648
59904
93696
96256
96768

M9Ks
0
14
18
24
34
18
24
34
18
24
34
22
24
34

Cycles
4.31x109
11.53x109
5.58x109
5.50x109
5.44x109
5.20x109
5.23x109
5.22x109
4.77x109
4.71x109
4.70x109
4.61x109
4.40x109
4.35x109

Table 5.6: N-Way Associative Dhrystone Results

Application Performance
100.00%
37.42%
77.35%
78.50%
79.34%
82.93%
82.59%
82.65%
90.46%
91.64%
91.86%
93.58%
97.95%
99.00%

6. Conclusions
In this thesis, a memory controller was implemented and evaluated to reveal the performance and hardware tradeoffs of obscuring off-chip memory for an embedded system. A
system prototype was developed for the Cyclone III FPGA using a Nios II process and offchip SRAM to simulate a typical embedded system. Various synthetic applications were
used to benchmark the performance overhead of encrypting all memory contents. The results show that this can be achieved while incurring as little as 1% performance overhead.
This solution could be used in any embedded system that utilizes off-chip memory to secure
information without a significant impact performance.
The performance enhancements tested in this thesis provide a way to increase performance and flexibility at the cost of additional hardware resources. Higher system performance can be achieved when adding features such as parallel AES engines or associative
cache policies; although at a certain point adding additional hardware does not result in a
significant gain.
Future designs based on this system could explore more advanced techniques for generating keystreams such as predictive generation. Using information about how the CPU
accesses memory could provide insight into what keystreams will be needed before they
are requested by the CPU. There are also optimizations that could be made for specific
hardware platforms such as different FPGAs or ASIC systems.

52

References
[1] B. S. Alliance, “Seventh Annual BSA/IDC Global Software 09 Piracy Study.” Online,
May 2010.
[2] R. Anderson and M. Kuhn, “Low cost attacks on tamper resistant devices,” in Security
Protocols (B. Christianson, B. Crispo, M. Lomas, and M. Roe, eds.), vol. 1361 of
Lecture Notes in Computer Science, pp. 125–136, Springer Berlin / Heidelberg, 1998.
10.1007/BFb0028165.
[3] A. B. Huang, Hacking the Xbox: An Introduction to Reverse Engineering. San Francisco, CA, USA: No Starch Press, 2003.
[4] I. Finder, “Data Loss Prevention: Data-at-Rest vs. Data-in-Motion,” white paper,
Identity Finder, LLC., 2009.
[5] SNIA Security Technical Work Group, “Encryption of Data-at-Rest,” tech. rep., Storage Networking Industry Association, 2009.
[6] A. B. Huang, “The Trusted PC: Skin-Deep Security,” Computer, vol. 35, pp. 103–105,
October 2002.
[7] C. D. Mackey and M. T. Kurdziel, “Secure processing device with keystream cache
and related methods,” Patent Application 20100299537, Harris Corporation, 255 S
ORANGE AVENUE, SUITE 1401, ORLANDO, FL, 32801, US, 11 2010.
[8] National Institute of Standards and Technology (NIST), “Specification for the Advanced Encryption Standard (AES).” Federal Information Processing Standards Publication 197, 2001.
[9] L. Hathaway, “National Policy on the Use of the Advanced Encryption Standard
(AES) to Protect National Security Systems and National Security Information.” Online, June 2003. CNSS Policy No. 15, Fact Sheet No. 1.

53

[10] R. Vaslin, G. Gogniat, J.-P. Diguet, R. Tessier, D. Unnikrishnan, and K. Gaj, “Memory
security management for reconfigurable embedded systems,” in ICECE Technology,
2008. FPT 2008. International Conference on, pp. 153 –160, 12 2008.
[11] G. E. Suh, D. Clarke, B. Gassend, M. v. Dijk, and S. Devadas, “Efficient memory
integrity verification and encryption for secure processors,” in Proceedings of the
36th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 36,
(Washington, DC, USA), pp. 339–, IEEE Computer Society, 2003.
[12] J. Yang, L. Gao, and Y. Zhang, “Improving memory encryption performance in secure
processors,” IEEE Transactions on Computers, vol. 54, pp. 630–640, 2005.
[13] Z. Liu, W. Huo, X. Zou, and Y. Lin, “A lightweight memory encryption cache design and implementation for embedded processor,” in Integrated Circuits, ISIC ’09.
Proceedings of the 2009 12th International Symposium on, pp. 57 –60, 12 2009.
[14] R. Elbaz, L. Torres, G. Sassatelli, P. Guillemin, M. Bardouillet, and A. Martinez, “A
parallelized way to provide data encryption and integrity checking on a processormemory bus,” in Proceedings of the 43rd annual Design Automation Conference,
DAC ’06, (New York, NY, USA), pp. 506–509, ACM, 2006.
[15] S. Muhlbach and S. Wallner, “Secure communication in microcomputer bus systems
for embedded devices,” Journal of Systems Architecture, vol. 54, no. 11, pp. 1065 –
1076, 2008. Embedded Systems: Architectures, Modeling and Simulation.
[16] R. J. Anderson, Security Engineering: A Guide to Building Dependable Distributed
Systems. Wiley Publishing, 2 ed., 2008.
[17] C. Yan, B. Rogers, D. Englender, D. Solihin, and M. Prvulovic, “Improving cost,
performance, and security of memory encryption and authentication,” in Computer
Architecture, 2006. ISCA ’06. 33rd International Symposium on, pp. 179 –190, 2006.
[18] D. A. McGrew and J. Viega, “The Galois/Counter Mode of OPeration (GCM).” Submission to NIST Modes of Operation Process, 2005.
[19] G. Zhou, H. Michalik, and L. Hinsenkamp, “Efficient and High-Throughput Implementations of AES-GCM on FPGAs,” in Field-Programmable Technology, 2007.
ICFPT 2007. International Conference on, pp. 185 –192, dec. 2007.

54

[20] R. Vaslin, G. Gogniat, J.-P. Diguet, R. Tessier, and W. Burleson, “Low latency solution for confidentiality and integrity checking in embedded systems with off-chip
memory,” in ReCoSoc proceeedings 2007 Reconfigurable communication-centric
Socs 2007, (Montpellier France), 06 2007.
[21] N. Ahmed, T. Natarajan, and K. Rao, “Discrete cosine transform,” Computers, IEEE
Transactions on, vol. C-23, pp. 90 –93, jan. 1974.
[22] R. P. Weicker, “Dhrystone: a synthetic systems programming benchmark,” Commun.
ACM, vol. 27, pp. 1013–1030, October 1984.

55

