A Secure Architecture for Defense Against Return Address Corruption by Bruner, Grayson J
University of Tennessee, Knoxville 
TRACE: Tennessee Research and Creative 
Exchange 
Masters Theses Graduate School 
5-2021 
A Secure Architecture for Defense Against Return Address 
Corruption 
Grayson J. Bruner 
University of Tennessee, Knoxville, gbruner@vols.utk.edu 
Follow this and additional works at: https://trace.tennessee.edu/utk_gradthes 
 Part of the Computer and Systems Architecture Commons, Hardware Systems Commons, Other 
Computer Engineering Commons, and the VLSI and Circuits, Embedded and Hardware Systems 
Commons 
Recommended Citation 
Bruner, Grayson J., "A Secure Architecture for Defense Against Return Address Corruption. " Master's 
Thesis, University of Tennessee, 2021. 
https://trace.tennessee.edu/utk_gradthes/6184 
This Thesis is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and 
Creative Exchange. It has been accepted for inclusion in Masters Theses by an authorized administrator of TRACE: 
Tennessee Research and Creative Exchange. For more information, please contact trace@utk.edu. 
To the Graduate Council: 
I am submitting herewith a thesis written by Grayson J. Bruner entitled "A Secure Architecture 
for Defense Against Return Address Corruption." I have examined the final electronic copy of 
this thesis for form and content and recommend that it be accepted in partial fulfillment of the 
requirements for the degree of Master of Science, with a major in Computer Engineering. 
Garrett S. Rose, Major Professor 
We have read this thesis and recommend its acceptance: 
Garrett S. Rose, Ahmedullah Aziz, Jian Liu 
Accepted for the Council: 
Dixie L. Thompson 
Vice Provost and Dean of the Graduate School 
(Original signatures are on file with official student records.) 
A Secure Architecture for Defense
Against Return Address Corruption
A Thesis Presented for the
Master of Science
Degree
The University of Tennessee, Knoxville
Grayson Bruner
May 2021




Thank you to Dr. Garrett Rose, my mentor, supervisor, and chair of my thesis committee.
Thank you to Ben Sergent and Mst. Shamim Ara Shawkat for their contributions to this
work. Thank you to Md Badruddoja Majumder, Mark Lee, and Jonathan Ting for their
work on the Mini-RISC-V Core. Thank you to Dr. Ahmedullah Aziz and Dr. Jian Liu,
members of my thesis committee.
This material is based upon work supported by the Air Force Office of Scientific
Research under award number FA9550-16-1-0301. Any opinions, finding, and conclusions or
recommendations expressed in this material are those of the authors and do not necessarily
reflect the views of the United States Air Force.
iii
Abstract
The advent of the Internet of Things has brought about a staggering level of inter-
connectivity between common devices used every day. Unfortunately, security is not a high
priority for developers designing these IoT devices. Often times the trade-off of security
comes at too high of a cost in other areas, such as performance or power consumption. This
is especially prevalent in resource-constrained devices, which make up a large number of IoT
devices. However, a lack of security could lead to a cascade of security breaches rippling
through connected devices. One of the most common attacks used by hackers is return
oriented programming. With it, the attacker seeks to take over the control flow of a program
by modifying function return addresses. The prevalence of these kinds of attacks makes
security against them paramount. This thesis proposes a secure architecture that leverages
a return address hardware stack and a cryptography unit to securely store redundant copies
of return addresses to be used to maintain control flow integrity. Furthermore this work seeks
to provide this layer of security against ROP attacks with as little compromise as possible





2 Background and Attack Model 4
2.1 The RISC-V Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . 4
2.2 Concepts of Return-Oriented Programming . . . . . . . . . . . . . . . . . . . 4
2.3 Mitigations to ROP Attacks and Their Weaknesses . . . . . . . . . . . . . . 6
2.3.1 Address Space Layout Randomization . . . . . . . . . . . . . . . . . . 6
2.3.2 Stack Canaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Memory Protections . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.4 Flow Control Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.5 Return Address Stack . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Attack Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 System Design 11
3.1 The RISC-V Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 The Cryptographic Return Address Stack . . . . . . . . . . . . . . . . . . . 13
3.2.1 The Return Address Stack . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 The Encryption Engine . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 The Control Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Integration with the RISC-V Core . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Performance and Security Analysis 19
4.1 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
v
4.1.1 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Conclusions and Future Work 30
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.1 Further Exploration Into the Optimization of Preemptive Stack Handling 31
5.2.2 Implementing Support for Context Switching . . . . . . . . . . . . . . 31
5.2.3 Exploration into alternatives to SIMON . . . . . . . . . . . . . . . . 32
Bibliography 33
Appendices 36
A Program Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.1 Buffer Overflow Exploitable Function . . . . . . . . . . . . . . . . . . 37
A.2 Quicksort Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.3 Triangular Numbers Algorithm . . . . . . . . . . . . . . . . . . . . . 38
A.4 Mergesort Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
B Attack Model Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
B.1 Unprevented Buffer Overflow . . . . . . . . . . . . . . . . . . . . . . 41
B.2 RAS Attack Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 41
C CRAS SystemVerilog Design Code . . . . . . . . . . . . . . . . . . . . . . . 42
C.1 Return Address Stack . . . . . . . . . . . . . . . . . . . . . . . . . . 42




4.1 Hardware Implementation Results of the Proposed RISC-V Core . . . . . . . 20
4.2 Runtime Overhead of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 28
vii
List of Figures
2.1 Structure of the program stack. . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Example function vulnerable to buffer overflow attacks even with a canary. . 9
3.1 A Block Diagram of the 5-Stage Pipeline Structure. . . . . . . . . . . . . . . 12
3.2 A Block Diagram of the CRAS Module. . . . . . . . . . . . . . . . . . . . . . 14
3.3 Diagram of the SIMON Round Function. . . . . . . . . . . . . . . . . . . . . 17
3.4 CRAS Control State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Comparison of the worse-case relative runtime overhead of the Quicksort
algorithm for a CRAS with a 32-address capacity with a CRAS with a 64-
address capacity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Comparison of the worse-case relative runtime overhead of calculating N th
triangular number for a CRAS with a 32-address capacity with a CRAS with
a 64-address capacity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Comparison of the worse-case relative runtime overhead of the Mergesort
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25




The rapid growth of the Internet of Things (IoT) has led to an unprecedented level of inter-
connectivity among the devices that surround us every day. Our refrigerators, thermostats,
cell phones, and even now our cars are all connected to the internet and each other. Cisco
Systems projects that there will be 29.3 billion networked devices by 2023, and half of those
will be IoT or other machine-to-machine devices [2]. Most of the time, these devices are
implemented in environments which leave them very resource constrained, be it size, power
consumption, or something else. These constraints can lead to designers compromising in
other areas, most notably security. Due to the high levels of inter-connectivity between IoT
devices, as well as the importance of some of their tasks, the security of IoT devices should
be of top concern, and compromised on as little as possible.
Among the most common, and perhaps most dangerous attack types that plague IoT
devices is Return-Oriented Programming (ROP). Having been around for many years, ROP
attacks come in many variants to counter various existing mitigations. The main goal of
an ROP attack is to hijack the control flow of a running program such that the attacker
can leverage it to gain control of a system through the alteration of a function return
address on the program stack. Among the more classic means of accomplishing this is
stack smashing, also known as buffer overflows. These have traditionally been used to inject
malicious shellcode into programs giving the attacker a terminal shell on the compromised
system [14]. A more thorough explanation of how such attacks work will be discussed in 2.2.
1
Mitigations already exist that serve to prevent or increase the difficulty of ROP attacks.
These may range from randomizing the program’s address space, to separate stack structures
that redundantly store return addresses. More examples, as well as specific details regarding
them will be covered later in 2.3. However each mitigation has its own downsides. Some
may incur performance overhead. Some may require more area or power than an IoT device
can reasonably provide.
This work seeks to introduce a new secure architecture design that detects and secures
itself from ROP attacks, while minimizing impact on performance, area overhead, and power
consumption. The design seeks to improve on the concept of using a hardware RAS similar
to [15], with a few additions to increase security. The proposed design in this work has
been named the Cryptographic Return Address Stack, and differs from the conventional
hardware RAS in two main ways. The first is that the CRAS utilizes an encryption engine
that encrypts return addresses that are overflowing to main memory and decrypts them as
they come back. This encryption obfuscates the return addresses when they’re in memory,
and provides integrity such that return addresses will not decrypt correctly should they have
been altered, which allows the CRAS to detect an attempted ROP attack. The second is
that the CRAS preemptively stores or loads return addresses to or from memory before the
stack overflows or underflows. In doing this, we can minimize, and in some cases eliminate
the performance overhead resulting from waiting for the CRAS to handle an overflow or
underflow. The CRAS is also designed with the goal of minimizing modifications to code
and compilers. No compiler changes are required to implement the proposed design. The
only code changes required are to the kernel or bootloader, such that the CRAS can be
provided an encryption key, and so the kernel can handle exceptions in the event of an
attempted ROP attack.
The remainder of this thesis is organized as follows. Chapter 2 provides background
information on the RISC-V architecture the proposed design is integrated with, and protects.
Additionally this chapter covers concepts of ROP and buffer overflow attacks, details related
works and existing mitigations to these attacks, and lays out the attack model against the
design proposed in this thesis. 3 describes the proposed design in detail, covering information
regarding the RISC-V core, the CRAS itself, as well as how they are integrated with each
2
other. Chapter 4 details the resource utilization for the hardware implementation of the
design and how it compares to the same design without the integrated CRAS. Then it
discusses the performance overhead brought on by the CRAS in different configurations
for multiple algorithms. Lastly, it looks at the effectiveness of the CRAS at detecting and
preventing the attacks described in the attack model. Chapter 5 gives concluding remarks,
summarizes the premise and results of the research within this thesis, and proposes avenues
of further study for this work.
3
Chapter 2
Background and Attack Model
2.1 The RISC-V Instruction Set Architecture
The design proposed in this work conforms to the RISC-V instruction set architecture (ISA).
RISC-V is an instruction set architecture developed at UC Berkely. RISC-V has grown in
popularity as an alternative to ARM in embedded and IoT devices, because it is open-source,
extensible, and highly ocnfigurable. RISC-V is an architecture for reduced instruction set
computers (RISC). Such ISAs are designed with the guiding principle of having fewer, faster,
more generalized instructions, rather than having a multitude of complex and specialied ones.
The two base variants are rv32i and rv64i, which are 32 and 64-bit architectures
respectively. The ”i” present in the name indicates that only integer operations, excluding
multiplication and division, are supported. However, RISC-V supports several extensions
that increase the functionality of the core, including but not limited to multiplication, floating
point arithmetic, and atomic memory operations. [18]
2.2 Concepts of Return-Oriented Programming
Return-oriented programming (ROP) is a classification of attacks in which an adversary
will attempt to take control of a program by overwriting function return addresses on the
program stack to point to a new location at which unexpected code is executed. This code
may be injected via the adversaries attack payload, or more commonly consist of a string of
4
gadgets. Gadgets are short sequences of instructions ending with a return that already exist
in the program and any shared libraries included. Marking writable sections of memory as
non-executable is a common protection against code injection, however doing do has no effect
against attacks leveraging gadgets [12]. The attacker effectively leverages each overwritten
return address and treats them like function calls, allowing for the execution of arbitrary
operations.
One of the most common ways to perform an ROP attack is through a buffer overflow.
In this attack the adversary takes advantage of vulnerable code, typically through specific
implementations of the function strcpy(), to write past a local buffer and over the return
address. Figure 2.1 shows the structure of the program stack in memory for a standard
RISC-V program. The top of the figure is the ”top” of the stack, or lower addresses in
memory. This is where the most recently pushed data lies. The bottom of the figure is
the ”bottom” of the stack, or higher addresses in memory. A specific way that a program
could be vulnerable to buffer overflow is through insecure use of the strcpy() function.
Typically the vulnerability lies in the lack of bounds checking. The function takes three
arguments: a pointer to a destination string, a pointer to the source string, and an integer
indicating the number characters to copy, or the length. When called, strcpy() will not
stop copying characters over to the destination buffer unless it reaches the specified length,
or until it reaches zero or other terminator value. An example of such a vulnerable function
is provided in Appendix A.1. Exploiting that function, an attacker could pass a string
larger than the buffer, and an integer well past the end of the buffer. In doing so, calling
strcpy() would result in the attackers input string writing past the buffer, as well as other
local variables if any, the previous frame pointer, and even the return address. The attacker
could put anything he wants in there, such as malicious shellcode or the address of another
function or gadget to execute.
5
2.3 Mitigations to ROP Attacks and Their Weaknesses
2.3.1 Address Space Layout Randomization
Address Space Layout Randomization (ASLR) seeks to provide control-flow integrity through
adding random memory offsets throughout the binary executable. This includes the base
address of the executable itself, as well as the positions of libraries, the heap, and the stack.
This randomness in the layout of the binary decreases the likeliness that an attacker will
be able to exploit a buffer overflow in order to hijack the control flow to the attacker’s
intended location. However, ASLR does not completely remove the vulnerability, let alone
prevent buffer overflows from occurring [10]. Because of this it is not a sufficient means of
preventing ROP atttacks on its own. Furthermore, the security of ASLR is heavily dependent
on how much of the address space can be randomized. For a 32-bit system, for example, the
maximum possible entropy for the address space is about 16 bits. A space this small could
easily be brute forced, compromising the security of the system [12].
2.3.2 Stack Canaries
Among the more common techniques used in buffer overflow attacks is what is often called
a nop slide. The attacker pads the payload with a large number of repeated nop machine
instructions. Everything in the buffer is overwritten with the nops, as well as whatever might
be in between the buffer and the return address wtihin the stack. This way the attacker can
ensure his payload is injected into the right place in the stack. These nop slides can typically
be prevented with the implementation of stack canaries. Named after the coal mine canaries,
they are bytes of data that exist between variables on the stack and the return address. At
the end of a function, the canaries are checked to see if their values changed, determining if
a stack overflow was attempted [8]. Stackguard is one of the most well-known systems that
implements stack canaries among other mitigations [5].
With stack canaries in place, it is more difficult to exploit a function such as strcpy()
in order to overwrite the return address. However, while the canary protects the return
address, local variables and function arguments may be vulnerable. Using the code snippet
6
in Figure 2.2, an attacker can write further past the return address to overwrite the local
variable msg, creating a write-anything-where condition [17].
2.3.3 Memory Protections
Return-oriented programming attacks can also be prevented through the use of memory
protections. Stack canaries are a form of memory protection, however they specifically
protect and validate the return address and frame pointer that are stored on the stack.
Memory protections extend to preventing data leakage and use-after-free scenarios by
protecting all pointers, including the return address stored on the stack. Multiple works
such as [6] achieve memory safety through the use of a concept referred to as ”fat pointers”.
These fat pointers contain the original memory pointer, as well as three other 32 bit fields.
In total each fat pointer is 128 bits. Other works that use the fat pointer concept have even
larger pointers [6]. Replacing every pointer with one that is two to four times larger can
quickly ramp up the memory cost of running a program. In a resource constrained system
such as an IoT device, memory can be a limited and precious resource, making such memory
protections impractical.
2.3.4 Flow Control Monitoring
Flow control monitoring ensures that control flow integrity remains intact by statically
computing a valid control flow graph of the program during compile time, then using a
coprocessor ensure that the program abides by the control flow graph [7].
The shortcomings in these forms of mitigation lie in several aspects. To start with, many
control flow integrity solutions have higher performance overheads. Chip Area is another big
compromise, as room must be made for the coprocessor, reducing the appeal for IoT devices
to use this kind of ROP mitigation. Furthermore, a large number of them have high energy
overhead, which is also quite unappealing to resource constrained IoT devices [7].
7
2.3.5 Return Address Stack
The concept of a return address stack can be implemented in multiple ways. One possible
method is to store return addresses in a separate ”control stack” in lieu of storing it in the
data stack [19]. Temporary data and other information normally stored in the data stack
remains there. By doing this, the return address is not in a location that can be reached
with buffer overflow. This solution is able to protect the system from buffer overflow, but
cannot detect when such attacks are attempted. Furthermore the downside of this solution
is the modifications to not only the compiler, but to the machine instructions as well. Any
preexisting code that must be run on the system will need to be recompiled to be able to
run. In addition, compiler-based implementations that use this split stack approach can have
performance overheads varying between 2% and 24% [19]. This is an acceptable amount of
performance overhead, but typically less is preferred.
The RAS implementation method most similar to this work is to utilize a hardware stack
for redundantly storing return addresses. Works such as [15] and [4] implement a hardware
stack that pushes in the return address of each function as it is called. When the control
flow of the program leads it to return from a function, the most recently pushed return
address is popped from the top of the RAS, and compared to the address being returned to.
An attempted ROP attack would create a discrepancy between the two. The RAS detects
this difference, and therefore detects the attempted attack, stopping it from continuing.
Because the RAS is implemented in its own hardware, in can normally operate in parallel
to the processor, causing very little to no performance overhead. This can depend on the
program being run, however. The main downside to a hardware stack is that it has a static,
sometimes limited size. The RAS may eventually overflow, especially during deeply nested
function calls, such as in recursive programming. When this happens, the RAS raises an
exception to the processor, which will trap and empty the contents of the RAS, storing the
data in main memory. Conversely when this happens, and the functions eventually begin to
return, the RAS will underflow. As before, the RAS raises an exception to the processor,
which then traps, writing the previously overflown return addresses back into the RAS [15].
Encountering control flows that would cause this behavior can have performance impact, as
8
Figure 2.1: Structure of the program stack.
Figure 2.2: Example function vulnerable to buffer overflow attacks even with a canary.
9
the executing program will have to stop and wait for the processor to handle the overflowing
or underflowing RAS before execution can continue.
In addition to the performance cost, this implementation has a potential vulnerability in
that it relies on the security of main memory to protect the overflown return addresses. This
may be sufficient in most cases where secure memory protections have been implemented.
However, if those memory protections were compromised - or worse nonexistent, the
overflown return addresses would be vulnerable. An attacker would be able to overwrite
a return address while it is overflown in memory, then create a buffer overflow to overwrite
the value in the stack with the same address. Upon returning, the RAS will compare the
return address the attacker modified in memory against the one he modified in the stack.
Seeing no discrepancy, execution will continue, and the attack will succeed. The work in
this thesis seeks to provide an additional layer of security within the hardware to detect and
prevent this sort of attack.
2.4 Attack Model
The main attack model focused on in this thesis is a buffer overflow attack targeting a system
with a Return Address Stack implemented for maintaining Control Flow Integrity, similar to
[15] and [4]. It is assumed that the adversary has knowledge of where in memory overflown
return addresses from the RAS are stored, and that he has acquired the ability to write to
that memory space. In order to bypass the RAS protecting the system, the adversary is
going to have to overwrite both the return address in the data stack, as well as its copy
stored in the RAS. The first may be overwritten in the way of a typical attack, such as using
strcpy(). The latter, however, requires the adversary to exploit deep recursion in order to
overflow the RAS such that the copy of the target return address is transferred into memory.
When in memory, the copy can be overwritten with the same value the adversary overwrote
the original return address with in the data stack. Upon returning from the deep recursion,
the corrupted copy of the return address is put back into the RAS, until eventually the
function returns. Because both the return address and its copy have been overwritten to the




3.1 The RISC-V Core
The RISC-V core present in the design proposed in this thesis makes use of the rv32ic
base and extension of the ISA. It is a 32-bit architecture. The i indicates that integer
operations are supported, with multiplication and division supported through software. The
core also supports compressed instructions, as denoted by the c. The compressed instructions
halve the size of most common operations in order to reduce static and dynamic code size.
Implementation of these instructions can result in a 25%-30% reduction in code size, making
it crucial in memory constrained embedded and IoT devices [18].
The microarchitecture of the core implements a 5-stage in-order execution pipeline.
Figure 3.1 depicts these stages and their interactions. The stages of this pipeline are
the standard five found in most pipelines: Instruction Fetch, Instruction Decode, Execute,
Memory Access, and Writeback [16]. The Instruction Fetch stage is where the instruction
is read from memory. Next, the Decode stage breaks down the instruction to determine the
opcode, the operands, as well as any immediate values that may be present. The Execute
Stage is where the operation is calculated for arithmetic operations. For memory loads and
stores, this is where the memory address is calculated for the memory access. The next stage
is where main memory is accessed. Nothing happens in this stage if the current operation
does not concern memory. Lastly, in the Writeback stage, the calculated values from the
11
Figure 3.1: A Block Diagram of the 5-Stage Pipeline Structure.
12
execute stage, or those retrieved from memory in the memory access phase, are written into
the register file.
In order to facilitate communication between the RISC-V core and the user, a UART
receiver and transmitter module is implemented within the core. The UART, or Univeral
Asynchronous Receiver/Transmitter, communicates with an attached computer through an
asynchronous serial interface. It allows the core to receive input and send output from
program execution. It can even be used to upload a new program to the core. To allow
the RISC-V core to control the UART module, an interface between the two devices was
implemented in the form of a memory-mapped device. This means that the RISC-V core
will exchange data with the UART module by using the memory load and store operations
in a predetermined address space.
3.2 The Cryptographic Return Address Stack
The core of this work is the Cryptographic Return Address Stack, or CRAS. Referring to
Figure 3.2, there are three main components that make up the whole of the CRAS: the
return address stack, the encryption engine, and the control module.
3.2.1 The Return Address Stack
The Return Address Stack, or RAS, is a hardware stack that redundantly stores return
addresses with each function call. Its base function is very similar to such works as [15], but
with a fews to its functionality. As with a typical RAS, this RAS stores the return address
of a function call whenever the core calls a function. When the core returns from a function,
the most recently stored return address is popped from the top of the RAS and compared
to the address begin returned to. If they match, the control flow is valid and execution
continues as normal. On a mismatch, however, a signal is generated indicating that either
or both of the return addresses are invalid. This signal might trigger an exception or an
interrupt for the kernel on the processor to handle as deemed necessary by the programmer.
Implementing an RAS in hardware has its own unique downsides. Specifically, in that
the size of the stack is static. More memory cannot be allocated to make a hardware
13
Figure 3.2: A Block Diagram of the CRAS Module.
14
stack larger. As such, the possibility of overflowing the RAS exists, and is probable under
circumstances involving deeply nested function calls, such as recursive functions. Such cases
are unusual, especially when the hardware stack is large. However, overflow can occur with
deeply nested function calls, such as recursions. When these overflows occur, the data in the
RAS is emptied and stored in main memory, leading to wasted time while the core waits
on the RAS to empty itself for more function calls. Furthermore, Underflows resulting from
bringing return addresses back into the RAS from memory also result in hang time that
impacts the performance of the CPU.
To help counter this performance overhead the RAS in the design preemptively handles
overflows and underflows. In doing this the RAS monitors its own capacity level such that it
can attempt to keep the stack from overflowing or underflowing. If the stack gets close to full,
the RAS will begin popping return addresses out from the bottom, encrypting them, and
storing them in memory. If the stack starts to get empty, then encrypted return addresses
are read from memory, decrypted, and pushed back into the bottom of the stack, if any
are available.This preemptive handling works to ensure that the RAS is neither too full
nor too empty in order to minimize stalling due to overflow or underflow. Details on how
performance is impacted will be discussed in 4.2.
3.2.2 The Encryption Engine
After some research into multiple encryption algorithms, the encryption engine implemented
within the design utilizes the SIMON64/128 algorithm. SIMON is a light-weight and
symmetrical block cipher, similar to AES. However, SIMON and its multiple configurations
were developed and optimized specifically for hardware implementations. It provides higher
performance and more area efficiency than hardware implemented AES or other symmetric
ciphers, according to its creators [3]. SIMON may be configured in multiple ways. The
specific configuration is indicated by the two numbers following SIMON. In the case of this
design, the configuration using a 64-bit block size and 128-bit key size was implemented. It
may seem sensible to use a 32-bit block for the cipher to coincide with the return address
lengths, but the 64-bit blocks are used instead to increase diffusion between addresses in
order to increase security.
15
SIMON Block Cipher
SIMON is a round-based cipher that iterates and encrypts blocks over multiple clock cycles.
Each round operates on one block of 64 bits, which is divided into two words. Figure 3.3
comes from the creators of the cipher, and shows the breakdown of one SIMON round. In this
diagram, xi and xi+1 are respectively the lower and upper words of the block. The operation
Sn indicates a left circular shift by n bits. Lastly ki is the round key. A key schedule is used
to generate one 32-bit key for each round of encryption. The supplied encryption key begins
the key schedule. The number of rounds varies depending on the SIMON configuration. For
the SIMON64/128 used in the proposed design, there are 44 total rounds.
3.2.3 The Control Module
The control module facilitates communication between the other two modules within the
CRAS. Additionally, it provides and utilizes the interface to main memory for storing or
loading the encrypted return addresses. It also communicates with the RISC-V core, such
that it can be activated or deactivated, or provided a key by the kernel. No modifications
to normal program code will be required. However, the bootloader or the kernel must be
updated in order to perform these interactions. A few lines of code will suffice for facilitating
these interactions. The finite state machine for the control flow is shown in Figure 3.4.
3.3 Integration with the RISC-V Core
In order to effectively secure the RISC-V core, the CRAS needs to be able to communicate
with it. It needs to be able to see when the core is jumping or returning, as well as the return
addresses associated with each. In addition, it needs to be able to signal to the RISC-V core
when it detects an unexpected change in control flow. Within the design of the system, these
interactions are facilitated through SystemVerilog interfaces. This allows each component,
the memory, the core, and the CRAS, to be able to interact with each other and only see
the data and signals necessary for their own operation.
16
Figure 3.3: Diagram of the SIMON Round Function.
17
Figure 3.4: CRAS Control State Machine
18
Chapter 4
Performance and Security Analysis
4.1 Hardware Implementation
In order to test the proposed design, it was implemented on a field programmable gate array
(FPGA). An FPGA consists of a massive array of lookup tables and flip-flops, which can be
configured and programmed in order to behave as a designed digital system. This makes them
strong tools for rapid RTL prototyping. The entire design was created from scratch in-house
to conform to the RISC-V specifications primarily using SystemVerilog. Some VHDL was
utilized as well, specifically for the UART module. The code from these hardware description
languages was synthesized into a layout that was encoded into a bitsrtream that could be
uploaded to the FPGA to program it to implement the CRAS and the RISC-V core. The
platform for the proposed design is a Xilinx Nexys4 board, which houses an Artix-7 FPGA.
Table 4.1 shows the implementation results of the proposed design.
4.1.1 Resource Utilization
When implemented along the RISC-V Core, the CRAS increases resource utilization by
193%. Most of this increase is resultant from the encryption engine, as a large 64/128
version of SIMON is used. It was considered to reduce this area overhead by instead
implementing SIMON’s software-optimized sister algorithm, Speck [3]. This would reduce
the area overhead of the CRAS, but would lead to performance issues, as the RISC-V Core
19
Table 4.1: Hardware Implementation Results of the Proposed RISC-V Core
Instance
Resource Utilization Estimated Power Usage (W)
LUT FF Dynamic Static Total
Normal
Post-Synth 3908 3411 0.145 0.101 0.246
Post-Impl 3912 3413 0.145 0.101 0.246
CRAS
Post-Synth 11494 6128 0.174 0.101 0.275
Post-Impl 11468 6130 0.174 0.101 0.275
20
would need to be utilized to perform the encryption, meaning that preemptively handling
the stack would incur no performance benefit, and running the CRAS parallel to the core
would not be possible. More importantly, running speck on the core for encryption would
mean that the CRAS could not preemptively handle overflows and underflows in parallel
with the running core, causing us to lose that benefit. Furthermore, the RISC-V core is very
small, only using a barebones RISC-V implementation. Relative overhead would be smaller
for larger processor cores.
4.1.2 Power Consumption
The CRAS consumes little power, and therefore only slightly increases power consumption.
The increase in dynamic power is only 20%, making the total increase in power overhead
about 12%. Due to this, the CRAS is ideal in securing devices that work with very low
power constraints.
4.2 Performance Results
In this section, the impact of the CRAS on the performance of the processor is tested
and analyzed. Due to the parallel nature of the CRAS, performance impact will only
manifest during recursive functions or other deeply nested calls. Therefore, several recursive
algorithms shall serve as the basis on which performance is tested. Performance is measured
as a relative increase in execution time between a system with a CRAS implemented, and the
same system without. This is repeated for two configurations of the CRAS. The first recursive
algorithm tested is quicksort [11]. The recursive calculation of triangular numbers [13] is also
used as a benchmark. Additionally, another recursive sorting algorithm, mergesort, serves
as the third benchmark program. It was decided to test the performance of the design
against these algorithms because of the ease of creating deep recursion with them, allowing
performance overhead to manifest.
Performance for the quicksort algorithm was tested for randomly generated arrays of
varying size. In order to get the worst-case performance algorithm, however, the worst-case
arrays for various sizes were tested. For a sorting algorithm, that means sorting an array
21
that is in reverse order. This is done for a number of different array sizes. In addition, this
test is done using a CRAS with a 32-address RAS and a 64-address RAS.
Figure 4.1 depicts the relative percent increase in runtime overhead when compared to
running the algorithm with the CRAS disabled. A percent increase of zero indicates that
there was no performance overhead, and the algorithm executed in the same amount of time
both with and without the CRAS. For both RAS sizes, the worst-case performance overhead
was minimal. The worse increase in execution time for this algorithm was less than 1%.
Each configuration’s own worst overhead time coincided with the sorting of arrays twice the
size of the RAS. When testing performance on randomly generated arrays, no performance
impact was measured for either configuration.
Next, performance is evaluated for traingular number calculation. Figure 4.2 shows
the relative increase in runtime overhead for each configuration. Unfortunately, overhead
increases dramatially for higher triangular numbers. This overhead, compared to that of
quicksort, differs due to the nature of the algorithm. With quicksort, the CRAS is capable
of preemptively handling the RAS before space is needed, or return addresses are. The
calculation of triangular numbers introduces recursion much more quickly, and the CRAS
has difficulty keeping up.
The performance of the CRAS was also tested using the recursive mergesort algorithm.
As before, relative runtime overhead for 32-address and 64-address CRAS configurations is
shown in Figure 4.3. Worst-case performance for this algorithm is again very small, just
over 3%. Although small, the trends shown in these results are nonetheless interesting.
The 64-address configuration performed worse for smaller arrays, contrary to the trend set
by the previous two tests. In addition, while previous results showed that RAS size has
less effect on performance as recursion deepens, for this algorithm there it has no effect on
performance overhead for deeper recursion. Furthermore performance overhead decreases
overall as recursion deepens, much like with quicksort.
The Mergesort algorithm was also tested to find the average performance overhead for
random arrays of varying size. The results are shown in Figure 4.4. Performance overhead
on random lists is virtually nonexistent. For larger arrays, it is nonexistent on average.
22
Figure 4.1: Comparison of the worse-case relative runtime overhead of the Quicksort
algorithm for a CRAS with a 32-address capacity with a CRAS with a 64-address capacity.
23
Figure 4.2: Comparison of the worse-case relative runtime overhead of calculating N th
triangular number for a CRAS with a 32-address capacity with a CRAS with a 64-address
capacity.
24
Figure 4.3: Comparison of the worse-case relative runtime overhead of the Mergesort
algorithm.
25
Figure 4.4: Runtime overhead of the CRAS for Mergesort on Random Lists
26
The results of all previous testing is summarized in Table 4.2. Based on the results
of the evaluation of the CRAS, performance overhead is very dependent on the algorithm,
specifically how fast and how deep the recursion is. Algorithms like the triangular number
algorithm recurse very quickly, and the CRAS struggles to keep up. For sorting algorithms
and programs that recurse at a similar rate, performance overhead is very small. The CRAS
is capable of making room in the RAS before it is needed. Programs that do not recurse at
all will have no overhead at all save a few very specific scenarios. Such a case may be one
where many function calls are made such that the CRAS overflows, then rapidly returns at
a rate the CRAS cannot keep up with. Overall, performance impact is largely minimal, and
shows promise in securing IoT devices without compromising too much performance.
4.3 Security Analysis
To demonstrate the security of the proposed design, it is tested on its ability to detect
attempted control flow hijacking via buffer overflow with that of a system with that of a
system with no security, and a system with an unencrypted RAS. For these analyses, two
assumptions are made. It is assumed that the attacker has knowledge of where in memory
overflown return addresses from the RAS are stored. Secondly, it is also assumed that
the attacker has acquired write privileges to this memory space. Two kinds of attacks are
implemented for analysis.
The first attack model is a basic buffer overflow. The adversary leverages a poor
implementation of strcpy() without proper bounds checking. The pseudocode for the
vulerable function is shown in Algorithm B.1. The second model builds upon the first. The
adversary may employ strcpy() as before to overwrite the target return address on the
program stack. Additionally, exploting recursion, the adversary may nest function calls to
force the RAS to dump overflown return addresses, including the copy of the target one,
into main memory. With the previously stated assumptions, the attacker can overwrite the
copy of the target return address to match the payload he injected into the program stack.
As recursion returns, the RAS underflows and the modified return address is placed back
in hardware. When the target function returns, the RAS compares its copy of the return
27
Table 4.2: Runtime Overhead of Algorithms
Algorithm Worst-Case Overhead Best-Case Overhead
Quicksort (Reversed) 1.05% 0.00%
Quicksort (Random) 0.00% 0.00%
Triangular 175% 0.00%
Mergesort (Reversed) 3.29% 0.00%
Mergesort (Random) 0.00041% 0.00%
28
address to the address being returned to. Since the attacker modified both values to be
the same, the RAS considers the control flow valid and the attack succeeds. Algorithm B.2
describes the pseudocode for this attack.
The system without protections is analyzed first, and was compromised for both attacks.
Without a means of ensuring control-flow integrity, both attacks can succeed without
resistance. The system with a regular RAS successfully defended itself from the first attack.
Its success is due to the RAS leveraging copies of return addresses to monitor control flow.
By exploiting recursion, however, the adversary is able to succeed with the second attack by
modifying the copy of the target return address while it is in memory to match his payload
in the program stack. Regardless of the attacker’s alterations, the RAS believed that the
control flow was still valid and allowed execution to continue uninterrupted.
The work proposed within this thesis, utilizing a CRAS, is secure from both attacks. The
return address stack ensures control flow integrity by providing redundant return addresses,
preventing the first attack. The second attack failed because the adversary must be aware of
the value of the encryption key used by the CRAS to protect the return addresses. When the
attacker attempted to write over the encrypted return address, it came back into the RAS as
a seemingly random value that will not match the return address the attacker attempted to




Conclusions and Future Work
5.1 Conclusions
This thesis has introduced a unique architecture leveraging cryptography in tandem with a
hardware return address stack in order to provide an environment secure from ROP attacks.
The layer of security provided by the cryptographic engine increases the overal security of
the system to similar works. In addition this work has tested validated and tested the design
in silicon through the utilization of an FPGA, so that it could be shown how the low power
consumption of the CRAS makes it a promising solution to power-constrained devices. The
performance overhead resultant from the introduction of the CRAS is evaluated as well
for scenarios triggering stalling of the CRAS. The results indicate that this overhead is
largely minimal with the exception of specific cases involving fast recursion. Furthermore,
it was observed that the design proposed within this thesis secures the control flow of a
program running on the system with minimal compromise to power or performance, making




5.2.1 Further Exploration Into the Optimization of Preemptive
Stack Handling
The largest contributing factor in performance overhead is the downtime when the processor
has to wait on the CRAS to have room for more function calls, or to bring in previous return
addresses. In algorithms that recurse quickly like the triangular number calculations, this is
especially apparent, causing a massive performance penalty. Performance has already been
improved by the inclusion of preemptive stack handling, where the CRAS attempts to ensure
there is always more room in the stack, and that it’s never empty. Currently, the CRAS
will start making room anytime the stack is at least 75% full or greater. Conversely it will
start bringing in and decrypting return addresses - if any are available from memory when
the stack is at less than 50% capacity. Exploring performance using different thresholds
could have the potential of optimising performance. Furthermore, different thresholds are
likely to perform better for different algorithms and programs. It stands to reason that
dynamically adaptive thresholds that speculate and adapt much like a branch predictor
would in a pipelined architecture.
5.2.2 Implementing Support for Context Switching
Multiprocessing has been used to increase the throughput of computers for many years. Many
devices take advantage of multiprocessing to make up for shortcomings in single-process
speed. Unfortunately, multiprocessing and context switching is not currently supported
by the CRAS. Resource constrained systems may not have the capabilities to run multiple
processes on one core. As such, context switching is not necessarily present in embedded and
IoT devices, though this typically depends on the purpose of the device. The CRAS is still
fully capable of securing single-process systems. However, implementing support for context
switching is a good direction to take the CRAS in next. The largest challenge in supporting
context switching is overcoming the inevitable performance overhead of switching. When the
core switches to a new process. The return addresses from the previous context stored within
31
the CRAS will need to be encrypted and stored in memory. Predicting context switching and
preparing accordingly does not seem to be a particularly viable solution. Perhaps pipelining
the process in such a way that the processor does not have to wait on the CRAS to empty
and encrypt its stack before starting the new process. Other solutions may yet present
themselves as well. That, combined with the definite need for context switching make it a
priority in future work.
5.2.3 Exploration into alternatives to SIMON
While SIMON64/128 may be the encryption engine implemented in the CRAS design. Many
other options are available. Depending on what is desired from the encryption standard, each
has its benefits and downsides. As stated earlier, other SIMON configurations are possible.
However, others have not yet been tested with the CRAS. Other configurations with fewer
rounds may be ideal in boosting performance. However, all configurations with fewer rounds
also come with a smaller block size or key - sometimes both [3]. Conversely there are larger
configurations with more rounds. Decreasing the block size would reduce the overall security
of the system, and with how many configurations there were, SIMON64/128 seemed the best
compromise between security and performance.
Speck was also considered as the encryption standard. Speck is SIMON’s sister algorithm
optimized for software rather than hardware [3]. To use Speck, the design would require
either that the RISC-V core performs the encryption, or that a co-processor is included
within the CRAS to perform the encryption. Implementing the first of these options would
require giving up the ability to perform the encryption in parallel, and would incur a massive
performance overhead. The latter of these, however, has not been tested, and warrants
further study.
Outside of SIMON and Speck, AES is a very popular option. It is an encryption standard
used in many scenarios, and is considered to be very secure [9]. However, as stated before,
SIMON is optimized for hardware, and takes up significantly less space than AES. Resource
constrained systems would therefore find SIMON a more attractive option. However, for





[1] (2012). Microgadgets: Size does matter in turing-complete return-oriented programming.
In 6th USENIX Workshop on Offensive Technologies (WOOT 12), Bellevue, WA. USENIX
Association. 10
[2] (2020). Cisco annual internet report - cisco annual internet report (2018–2023) white
paper. 1
[3] Beaulieu, R., Shors, D., Smith, J., Treatman-Clark, S., Weeks, B., and Wingers, L.
(2013). The simon and speck families of lightweight block ciphers. Cryptology ePrint
Archive, Report 2013/404. https://eprint.iacr.org/2013/404. 15, 19, 32
[4] Bresch, C., Hely, D., Papadimitriou, A., Michelet-Gignoux, A., Amato, L., and Meyer, T.
(2018). Stack redundancy to thwart return oriented programming in embedded systems.
IEEE embedded systems letters, 10(3):87–90. 8, 10
[5] Cowan, C., Pu, C., Maier, D., Hintony, H., Walpole, J., Bakke, P., Beattie, S., Grier,
A., Wagle, P., and Zhang, Q. (1998). Stackguard: Automatic adaptive detection and
prevention of buffer-overflow attacks. In Proceedings of the 7th Conference on USENIX
Security Symposium - Volume 7, SSYM’98, page 5, USA. USENIX Association. 6
[6] Das, S., Unnithan, R. H., Menon, A., Rebeiro, C., and Veezhinathan, K. (2019).
Shakti-ms: A risc-v processor for memory safety in c. In Proceedings of the 20th ACM
SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for
Embedded Systems, LCTES 2019, page 19–32, New York, NY, USA. Association for
Computing Machinery. 7
[7] De, A., Basu, A., Ghosh, S., and Jaeger, T. (2019). Fixer: Flow integrity extensions
for embedded risc-v. In 2019 Design, Automation Test in Europe Conference Exhibition
(DATE), pages 348–353. 7
[8] Dowd, M., McDonald, J., and Schuh, J. (2007). The art of software security assessment:
identifying and preventing software vulnerabilies. Addison-Wesley. 6
[9] Dworkin, M., Barker, E., Nechvatal, J., Foti, J., Bassham, L., Roback, E., and Dray, J.
(2001). Advanced encryption standard (aes). 32
34
[10] Ganz, J. and Peisert, S. (2017). Aslr: How robust is the randomness? In 2017 IEEE
Cybersecurity Development (SecDev), pages 34–41. 6
[11] Graham, S. and Rivest, R. (1978). Quicksort programs. Communications. 21
[12] Gu, G. and Shacham, H. (2020). Return-oriented programming in risc-v. Copyright -
© 2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/
(the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this
content in accordance with the terms of the License; Last updated - 2020-08-01. 5, 6
[13] Hoggatt Jr, V. E. and Bicknell, M. (1974). Triangular numbers. Fibonacci Quarterly,
12(3):221–230. 21
[14] One, A. (1996). Smashing the stack for fun and profit. Phrack, 7(49). 1
[15] Ozdoganoglu, H., Vijaykumar, T., Brodley, C., Kuperman, B., and Jalote, A. (2006).
Smashguard: A hardware solution to prevent security attacks on the function return
address. Computers, IEEE Transactions on, 55:1271 – 1285. 2, 8, 10, 13
[16] Patterson, D. A. and Hennessy, J. L. (2017). Computer Organization and Design RISC-
V Edition: The Hardware Software Interface. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 1st edition. 11
[17] Richarte, G. (2002). Four different tricks to bypass stackshield and stackguard
protection. World Wide Web, 1. 7
[18] Waterman, A., Lee, Y., Patterson, D. A., and Asanović, K. (2019). The risc-v
instruction set manual, volume i: User-level isa, version 20191213. Technical report,
EECS Department, University of California, Berkeley. 4, 11
[19] Xu, J., Patel, S., and Iyer, R. (2002). Architecture support for defending against






A.1 Buffer Overflow Exploitable Function
void v u l n e r a b l e f u n c t i o n ( char ∗ c , int l ength ) {
char buf [ 6 4 ] ; // our b u f f e r
s t r cpy ( buf , c , i ) ; // copy input to b u f f e r




void swap ( int ∗a , int ∗b) {
int t = ∗a ;
∗a = ∗b ;
∗b = t ;
}
int p a r t i t i o n ( int ar r [ ] , int low , int high ) {
int pivot = ar r [ high ] ;
int i = ( low − 1 ) ;
for ( int j = low ; j <= high − 1 ; j++) {
i f ( a r r [ j ] < pivot ) {
i ++;




swap(& ar r [ i +1] , &ar r [ high ] ) ;
return ( i + 1 ) ;
}
void qu i ck so r t ( int ar r [ ] , int low , int high ) {
i f ( low < high ) {
int pi = p a r t i t i o n ( arr , low , high ) ;
qu i ck so r t ( arr , low , p i − 1 ) ;
qu i ck so r t ( arr , p i + 1 , high ) ;
}
}
A.3 Triangular Numbers Algorithm
int t r i a n g u l a r ( int n) {
i f (n == 1) return n ;
else return n + f a c t (n−1);
}
A.4 Mergesort Algorithm
void merge ( int ar r [ ] , int l , int m, int r )
{
int i , j , k ;
int n1 = m − l + 1 ;
int n2 = r − m;
/∗ c r e a t e temp arrays ∗/
38
int L [ n1 ] , R[ n2 ] ;
/∗ Copy data to temp arrays L [ ] and R [ ] ∗/
for ( i = 0 ; i < n1 ; i++)
L [ i ] = ar r [ l + i ] ;
for ( j = 0 ; j < n2 ; j++)
R[ j ] = ar r [m + 1 + j ] ;
/∗ Merge the temp arrays back i n t o arr [ l . . r ] ∗/
i = 0 ; // I n i t i a l index o f f i r s t subarray
j = 0 ; // I n i t i a l index o f second subarray
k = l ; // I n i t i a l index o f merged subarray
while ( i < n1 && j < n2 ) {
i f (L [ i ] <= R[ j ] ) {









/∗ Copy the remaining e lements o f L [ ] , i f t h e r e
are any ∗/
while ( i < n1 ) {





/∗ Copy the remaining e lements o f R[ ] , i f t h e r e
are any ∗/
while ( j < n2 ) {





/∗ l i s f o r l e f t index and r i s r i g h t index o f the
sub−array o f arr to be s o r t e d ∗/
void mergeSort ( int ar r [ ] , int l , int r )
{
i f ( l < r ) {
// Same as ( l+r )/2 , but avo ids o v e r f l o w f o r
// l a r g e l and h
int m = l + ( r − l ) / 2 ;
// Sort f i r s t and second h a l v e s
mergeSort ( arr , l , m) ;
mergeSort ( arr , m + 1 , r ) ;




B Attack Model Algorithms
B.1 Unprevented Buffer Overflow
Algorithm 1: Standard Buffer Overflow Attack
def vuln func(char *s, int len):
buf[32] = [];
strcpy(buf, &s, len);
B.2 RAS Attack Algorithm
Algorithm 2: RAS Attack Algorithm
def vuln func(char *s, int len, int i):
buf[32] = [];
if i == 0 then
strcpy(buf, &s, len);
vuln func(s, len, i+1);
return;
else if i == 64 then
Write return address to memory;
return;
else
vuln func(s, len, i+1);
end
41
C CRAS SystemVerilog Design Code
C.1 Return Address Stack
module r a s t a c k #(
DATA WIDTH = 32 ,
DEPTH = 64 ,
FILL THRESH = 48 ,
EMPTY THRESH = 32
) (
input l o g i c c lk , r s t , ena ,
input l o g i c push , pop , ret ,
input push bottom , pop bottom ,
input l o g i c [DATA WIDTH−1:0 ] din , din bottom ,
output l o g i c [DATA WIDTH−1:0 ] dout , dout bottom ,
output l o g i c mismatch , f u l l , empty , over thre sh , under thresh
) ;
l o g i c [DATA WIDTH−1:0 ] data [DEPTH−1 : 0 ] ;
integer cnt = 0 ;
assign empty = ( cnt == 0) ;
assign f u l l = ( cnt == DEPTH) ;
assign ove r th r e sh = ( cnt >= FILL THRESH) ? 1 ’ b1 : 1 ’ b0 ; //
Let CRAS c o n t r o l l e r know the s t a c k i s g e t t i n g f u l l
assign under thresh = ( cnt <= EMPTY THRESH) ? 1 ’ b1 : 1 ’ b0 ;
// Let CRAS know the s t a c k i s g e t t i n g empty
assign dout = ( cnt == 0) ? 0 : data [ cnt − 1 ] ;
assign dout bottom = data [ 0 ] ;
42
// push to top o f s t a c k − t y p i c a l b e h a v i o r
task s tack push ;
begin




// push to bottom of s t a c k . Only used by CRAS c o n t r o l l e r
task stack push bottom ;
begin
for ( i n t i = DEPTH − 1 ; i > 0 ; i −−) begin
data [ i ] = data [ i −1] ;
end




//pop from bottom of s t a c k . Only used by CRAS c o n t r o l l e r
task stack pop bottom ;
begin
for ( i n t i = 0 ; i < DEPTH − 1 ; i++) begin
data [ i ] = data [ i +1] ;
end





//pop from top o f s t a c k − t y p i c a l b e h a v i o r
task stack pop ( input l o g i c i s r e t ) ;
begin
l o g i c [DATA WIDTH−1:0 ] tmp ;
cnt −−;
tmp = data [ cnt ] ;
data [ cnt ] = 0 ;
i f ( ( tmp != din ) & i s r e t ) mismatch = 1 ;
end
endtask
// s e q u e n t i a l l o g i c
a l w a y s f f @(posedge c l k ) begin
i f ( r s t ) begin
cnt <= 0 ;
mismatch <= 0 ;
for ( i n t i = 0 ; i < DEPTH; i++)
data [ i ] <= 0 ;
end else begin
i f ( ena & ˜mismatch ) begin
i f ( push & ˜ f u l l ) begin
s tack push ( ) ;
end else i f ( ( pop | r e t ) & ˜empty )
begin
stack pop ( r e t ) ;
end
end
i f ( push bottom & ˜ f u l l & ˜( ( cnt == DEPTH
−1) & push ) ) begin
44
stack push bottom ( ) ;
end else i f ( pop bottom & ˜empty ) begin





C.2 CRAS Top Module and Controller
module CRAS top #(
parameter i n t W = 32 , //Word s i z e / l e n g t h o f re turn address
parameter i n t NKW = 4 , //Number o f words t h a t make up the
key
parameter i n t DEPTH = 64 , // S i z e o f RAS in re turn
a d d r e s s e s
parameter i n t FILL THRESH = 48 , // Threshold to s t a r t
making room in the RAS
parameter i n t EMPTY THRESH = 32 // Threshold to s t a r t
r e f i l l i n g the RAS
) (
‘ i fndef SIMTEST
r i s c v b u s rbus , // i n t e r f a c e wi th RISC−V core
mmio bus mbus // i n t e r f a c e wi th memory c o n t r o l l e r
‘ e l s e
input l o g i c c lk , r s t ,
input l o g i c [W−1:0 ] addr in ,
input l o g i c branch , ret ,
output l o g i c s t a c k f u l l , stack empty , stack mismatch ,
45
input l o g i c [ 2 : 0 ] con f i g addr ,
input l o g i c [ 3 1 : 0 ] c on f i g d in ,
input l o g i c conf ig wr ,
output l o g i c rdy , RAS ena
‘endif
) ;
l o g i c mem rdy , mem rd , mem wr ;
l o g i c [W−1:0 ] mem dout , mem din ;
l o g i c [ 3 1 : 0 ] mem addr ;
‘ i fndef SIMTEST
l o g i c c lk , r s t , branch , ret , s t a c k f u l l , stack empty ,
stack mismatch ;
l o g i c [W−1:0 ] addr in ;
l o g i c rdy ;
// c o n f i g u r a t i o n r e g i s t e r s i g n a l s
l o g i c [ 2 : 0 ] c on f i g addr ;
l o g i c [ 3 1 : 0 ] c o n f i g d i n ;
l o g i c con f i g wr ;
l o g i c RAS ena ;
always comb begin
46
c l k = rbus . c l k ;
r s t = rbus . Rst ;
branch = rbus . RAS branch ;
r e t = rbus . r e t ;
rbus . s t a c k f u l l = s t a c k f u l l ;
rbus . stack empty = stack empty ;
rbus . stack mismatch = stack mismatch ;
addr in = rbus . RAS addr in ;
mem rdy = mbus . RAS mem rdy ;
mem dout = mbus . RAS mem dout ;
mbus . RAS mem rd = mem rd ;
mbus . RAS mem wr = mem wr ;
mbus . RAS mem din = mem din ;
mbus . RAS mem addr = mem addr ;
rbus . RAS rdy = rdy ;
mbus . RAS ena = RAS ena ;
con f i g addr = mbus . RAS config addr ;
c o n f i g d i n = mbus . RAS conf ig din ;
con f i g wr = mbus . RAS config wr ;
end
‘e l se
assign mem rdy = 1 ;
‘endif
l o g i c [ 3 : 0 ] wea ;
assign wea = mem wr ? 4 ’ b1111 : 4 ’ b0000 ;
47
// s t a t e machine enumeration
enum { i d l e , enc pop1 , enc pop2 , enc pop3 , enc pop4 , enc begin ,
enc wait ,
enc wr i t e l ower , enc wr i te upper , enc wr i t e l ower2 ,
enc wr i te upper2 , enc push temp , dec read lower ,
dec r ead lower 2 ,
dec read upper , dec read lower2 , dec read lower2 2 ,
dec read upper2 ,
dec begin , dec wait , dec push1 ,
dec push2 , dec push3 , dec push4 , dec check } s ta te ,
n e x t s t a t e ;
//simon core s i g n a l s
l o g i c a r s t n , a c t i v e o , v a l i d i , ready o , mode i ;
l o g i c [ 1 : 0 ] [W−1:0 ] p t i ;
l o g i c [NKW−1 : 0 ] [W−1:0 ] k e y i ;
l o g i c va l i d o , r eady i , mode o ;
l o g i c [ 1 : 0 ] [W−1:0 ] c t o ;
// s t a c k s i g n a l s
l o g i c stack ena , stack push , stack pop , s t a c k r e t ;
l o g i c s tack push bot , s tack pop bot , s t a ck ove r th r e sh ,
s t a ck unde r th r e sh ;
l o g i c [W−1:0 ] s tack d in , stack dout , s tack d in bot ,
s t ack dout bot ;
l o g i c [W−1:0 ] raddr temp reg ;
l o g i c [ 3 1 : 0 ] base addr , cur addr ; //Base address and current
address o f o v e r f l o w pages
48
l o g i c [ 3 1 : 0 ] page count ; //Number o f e x i s t i n g o v e r f l o w pages
l o g i c e n c l a s t ;
l o g i c d e c r e a d l a s t ;
l o g i c [ 1 : 0 ] [W−1:0 ] IV ; // I n i t i a l Vector
l o g i c [ 1 : 0 ] [W−1:0 ] Ck ; // C i p h e r t e x t output
//SIMON Core
s imon top #(.WW(W) , .NKW(NKW) ) c ryp to co r e ( . ∗ ) ;
r a s t a c k #(.DATA WIDTH(32) , .DEPTH(DEPTH) , . FILL THRESH(
FILL THRESH) , .EMPTY THRESH(EMPTY THRESH) ) ra s ( c lk , r s t ,
s tack ena , stack push , stack pop , s t a c k r e t ,
s tack push bot , s tack pop bot ,
s tack d in , s tack d in bot , s tack dout , s tack dout bot ,
stack mismatch , s t a c k f u l l , stack empty ,
s t a ck ove r th r e sh , s t a ck unde r th r e sh ) ;
assign IV = {32 ’ hbaddab69 , 32 ’ hbaddab69 } ;
assign rdy = ˜ s t a c k f u l l & ˜( stack empty & ( page count > 0) &
r e t ) ;
// combinat iona l ass ignments
always comb begin
a r s t n = ˜ r s t ;
end
// Conf i gura t ion Logic
a l w a y s f f @(posedge c l k ) begin
i f ( r s t ) begin
49
RAS ena <= 1 ;
base addr <= 0 ;
k e y i <= {32 ’ hdeadbeef , 32 ’ hdeadbeef , 32 ’ hdeadbeef
, 32 ’ hdeadbeef } ;
end else i f ( con f i g wr ) begin
case ( c on f i g addr )
3 ’ b000 : RAS ena <= c o n f i g d i n [ 0 ] ;
3 ’ b001 : base addr <= c o n f i g d i n ;
3 ’ b100 : k e y i [ 0 ] <= c o n f i g d i n ;
3 ’ b101 : k e y i [ 1 ] <= c o n f i g d i n ;
3 ’ b110 : k e y i [ 2 ] <= c o n f i g d i n ;






assign s tack ena = RAS ena & rdy ;
assign s tack push = rdy & branch ;
assign stack pop = 0 ;
assign s t a c k r e t = rdy & r e t ;
assign s t a c k d in = rdy ? addr in : 32 ’ h0 ;
// combination s t a t e l o g i c
always comb begin
n e x t s t a t e = i d l e ;
s tack pop bot = 0 ;
s tack push bot = 0 ;
s t a c k d i n b o t = 0 ;
50
v a l i d i = 0 ;
mode i = 0 ;
r e a d y i = 0 ;
mem wr = 0 ;
mem din = 0 ;
mem addr = 0 ;
mem rd = 0 ;
unique case ( s t a t e )
i d l e : begin
//Need to make room?
i f ( s t a c k o v e r t h r e s h ) begin
n e x t s t a t e = enc pop1 ;
// Get t ing c l o s e to empty , and we have
pages to b r i n g back ?
end else i f ( s t a ck unde r th r e sh & (
page count != 0) ) begin
n e x t s t a t e = dec read upper ;
end
end
// P u l l out f i r s t address f o r encryp t ion
enc pop1 : begin
n e x t s t a t e = enc pop2 ;
s tack pop bot = 1 ;
end
//Get second address
enc pop2 : begin
n e x t s t a t e = enc beg in ;
s tack pop bot = 1 ;
end
// Begin encryp t ion
51
enc beg in : begin
v a l i d i = 1 ;
mode i = 0 ;
i f ( ready o ) begin
n e x t s t a t e = enc wai t ;
v a l i d i = 0 ;
end else begin
n e x t s t a t e = enc beg in ;
end
end
//Wait u n t i l encryp t ion i s f i n i s h e d
enc wai t : begin
i f ( v a l i d o ) begin
r e a d y i = 1 ;
n e x t s t a t e = e n c w r i t e l o w e r ;
end else begin
n e x t s t a t e = enc wai t ;
end
end
// Write lower h a l f o f c i p h e r t e x t to memory , when
a v a i l a b l e
e n c w r i t e l o w e r : begin
i f (mem rdy) begin
mem wr = 1 ;
mem din = c t o [ 0 ] ;
mem addr = base addr + cur addr ;
n e x t s t a t e = enc wr i t e upper ;
end else begin




// Write upper h a l f o f c i p h e r t e x t to memory , when
a v a i l a b l e
enc wr i t e upper : begin
i f (mem rdy) begin
mem wr = 1 ;
mem din = c t o [ 1 ] ;
mem addr = base addr + cur addr ;
i f ( s t a c k o v e r t h r e s h )
n e x t s t a t e = enc pop1 ;
else
n e x t s t a t e = i d l e ;
end else begin
n e x t s t a t e = enc wr i t e upper ;
end
end
// read in lower h a l f o f page f o r d e c r y p t i o n
dec r ead lower : begin
i f (mem rdy) begin
mem rd = 1 ;
mem addr = base addr + cur addr −
4 ;
n e x t s t a t e = dec r ead l owe r 2 ;
end else begin
n e x t s t a t e = dec r ead lower ;
end
end
// upper h a l f
dec r ead l owe r 2 : begin
n e x t s t a t e = dec beg in ;
53
end
dec read upper : begin
i f (mem rdy) begin
mem rd = 1 ;
mem addr = base addr + cur addr −
4 ;
n e x t s t a t e = dec r ead lower ;
end else begin
n e x t s t a t e = dec read upper ;
end
end
// beg in d e c r y p t i o n
dec beg in : begin
v a l i d i = 1 ;
mode i = 1 ;
i f ( ready o ) begin
n e x t s t a t e = dec wai t ;
end else begin
n e x t s t a t e = dec beg in ;
end
end
// wai t u n t i l d e c r y p t i o n i s f i n i s h e d
dec wai t : begin
i f ( v a l i d o ) begin
r e a d y i = 1 ;
n e x t s t a t e = dec push1 ;
end else begin




// push f i r s t re turn address i n t o the bottom of the
s t a c k
dec push1 : begin
s tack push bot = 1 ;
s t a c k d i n b o t = c t o [ 0 ] ;
n e x t s t a t e = dec push2 ;
end
// push the o ther one
dec push2 : begin
s tack push bot = 1 ;
s t a c k d i n b o t = c t o [ 1 ] ;




// s e q u e n t i a l s t a t e l o g i c
a l w a y s f f @(posedge c l k ) begin
i f ( r s t ) begin
s t a t e <= i d l e ;
cur addr <= 0 ;
page count <= 0 ;
raddr temp reg <= 0 ;
e n c l a s t <= 0 ;
d e c r e a d l a s t <= 0 ;
end else begin
case ( s t a t e )
i d l e : begin
raddr temp reg <= addr in ;
end
55
enc pop1 : begin
p t i [ 0 ] <= stack dout bot ;
end
enc pop2 : begin
p t i [ 1 ] <= stack dout bot ;
end
e n c w r i t e l o w e r : begin
i f (mem rdy) cur addr <= cur addr
+ 4 ;
end
enc wr i t e upper : begin
i f (mem rdy) begin
cur addr <= cur addr + 4 ;




dec r ead lower : begin
i f (mem rdy) begin
cur addr <= cur addr − 4 ;
end
i f ( d e c r e a d l a s t ) begin
p t i [ 0 ] <= mem dout ;
d e c r e a d l a s t <= 0 ;
end
end
dec r ead l owe r 2 : begin
p t i [ 1 ] <= mem dout ;
end
dec read upper : begin
56
i f (mem rdy) begin
cur addr <= cur addr − 4 ;
d e c r e a d l a s t <= 1 ;
end
end
dec push2 : begin











Grayson Bruner was born on February 13th, 1997 in Lynchburg, Virginia. Not long after his
family moved to Colorado, and then to Tennessee just before Grayson entered kindergarten.
Upon graduating high school, he began his college career at Pellissippi State Community
College to take advantage of the first wave of the Tennessee Promise Scholarship. After two
years at Pellissippi and receiving his associate’s degree in Science, Grayson transferred to the
University of Tennessee to continue pursuing a bachelor’s of Computer Engineering. Another
two years later, Grayson wished to further his education even more, and was accepted into
graduate school to do research under Dr. Garrett Rose, which eventually led to the writing
of this thesis.
58
