Memory access patterns and page promotion in hybrid memory systems by Agarwal, Ayush
c© 2020 Ayush Agarwal





Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2020
Urbana, Illinois
Adviser:
Professor Wen-mei W. Hwu
ABSTRACT
Hybrid heterogeneous memory systems are becoming increasingly popular
as traditional memory systems are hitting performance and energy walls in
processing data-intensive applications, which are becoming the norm with
the resurgence of machine learning, big data, graph analytics, and database
management systems, especially in modern datacenters. In addition to the
massive data that these applications process, they exhibit varying and non-
deterministic memory access patterns making I/O latency a prime criterion
in the design considerations that go into building modern computing systems
to support them.
A traditional memory system moves data by swapping pages between the
faster DRAM and the slower SSD. While applications with sequential ac-
cesses have better traffic between the DRAM and the SSD, applications with
random page accesses, such as large graphs, often produce high traffic and
exhibit little or no reuse of pages swapped into the DRAM.
This thesis proposes a technique to identify memory access patterns, and
a scalable and distributed technique to determine when pages should be
promoted from the slower memory system to the faster memory system,
thereby reducing I/O traffic. The proposed page promotion design shows
up to 6.74x reduction in page traffic and 1.21x increase in the total hit rate
of a data-intensive application with uniformly distributed random memory
accesses.
ii
To dearest Maa and Paa,
for their unconditional love and support.
iii
ACKNOWLEDGMENTS
I would like to thank my adviser, Professor Wen-mei W. Hwu, for his guidance
and support. His wisdom and his passion have molded me into a better person
and a better professional.
I would like to thank the IMPACT research group for supporting me
throughout. The group not only helped me with academic know-how but
also created a conducive and fun environment to enable cutting-edge re-
search. A special mention to Zaid Qureshi, Vikram Sharma Mailthody, and
David Min for being amazing seniors.
I would especially like to thank Marie-Pierre Lassiva-Moulin for her excel-
lent management of everything in the group in order to ensure that we can
commit all our focus to research.
I would like to thank my parents for their unwavering and unconditional
love, support, and guidance. My gratitude to them is above and beyond what
words can express. I would like to thank my sisters for keeping me strong
and motivated and my nephew for never failing to put a smile on my face.
I would like to thank my friends, who made this journey fun, kept me
grounded, and helped me keep my sanity.
iv
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 MEMORY ACCESS PATTERN RECOGINITION . . 4
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Design Assessment and Evaluation . . . . . . . . . . . . . . . 12
CHAPTER 3 DISTRIBUTED PAGE PROMOTION . . . . . . . . . 18
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Design and Methodology . . . . . . . . . . . . . . . . . . . . . 22
3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 30
CHAPTER 4 RELATED WORK . . . . . . . . . . . . . . . . . . . . 38
CHAPTER 5 CONCLUSION AND FUTURE WORK . . . . . . . . 40
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
v
LIST OF TABLES
2.1 Cache configuration . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 DRAM configuration . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 SCALESim results . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 A breakdown of SSD latency . . . . . . . . . . . . . . . . . . . 21
3.2 Memory system configuration of the proposed system . . . . . 25
vi
LIST OF FIGURES
2.1 Memory access patterns . . . . . . . . . . . . . . . . . . . . . 5
2.2 Illustration of an LSTM cell . . . . . . . . . . . . . . . . . . . 7
2.3 Memory access patterns in GUPS benchmark . . . . . . . . . 15
2.4 Actual versus predicted accesses . . . . . . . . . . . . . . . . . 16
3.1 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Architecture of an SSD . . . . . . . . . . . . . . . . . . . . . . 20
3.3 A hybrid memory system . . . . . . . . . . . . . . . . . . . . . 21
3.4 SSD architecture with distributed page promotion . . . . . . . 23
3.5 Single entry of a distributed page promotion table . . . . . . . 23
3.6 DRAM hits, SSD page cache hits, and misses for the base-
line page swapping system, FlatFlash, and the proposed
page promotion system . . . . . . . . . . . . . . . . . . . . . . 28
3.7 DRAM-SSD transfers and SSD cache-flash transfers for the
baseline page swapping system, FlatFlash, and the pro-
posed page promotion system . . . . . . . . . . . . . . . . . . 29
3.8 Total hits versus DRAM size . . . . . . . . . . . . . . . . . . . 31
3.9 Total pages moved versus DRAM size . . . . . . . . . . . . . . 31
3.10 Total hits versus the number of pages per channel of the SSD 32
3.11 Total pages moved versus the number of pages per channel
of the SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.12 Total hits versus the number of ways per set . . . . . . . . . . 34
3.13 Total pages moved versus the number of ways per set . . . . . 34
3.14 Total hits versus epoch size . . . . . . . . . . . . . . . . . . . 35
3.15 Total pages moved versus epoch size . . . . . . . . . . . . . . 36
3.16 Total hits versus percentage of pages promoted per epoch . . . 37
3.17 Total pages moved versus percentage of pages promoted
per epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vii
LIST OF ABBREVIATIONS
AMAT Average Memory Access Time
ARIMA Auto Regressive Integrated Moving Average
DRAM Dynamic Random Access Memory
GUPS Giga Updates per Second
HBM High Bandwidth Memory
HMC Hybrid Memory Cube
LRU Least Recently Used
LSTM Long Short-Term Memory
PC Program Counter
PCIe Peripheral Component Interconnect Express
PCM Phase Change Memory
PR Page Rank
RNN Recurrent Neural Network
SATA Serial Advanced Technology Attachment
SSD Solid State Drive
TC Triangle Counting




Modern computer systems are built with complex hardware to deliver max-
imum performance for widespread applications. A plethora of algorithms,
mechanisms and design styles, like out-of-order execution, multicore proces-
sors, caches and scratchpads, and so on have been proposed from the early
days of computer systems and many of them are widely used in systems even
to this day. However, the increase in complexity also leaves the system open
to multiple bottlenecks and despite the efforts to alleviate these performance-
damping factors, there are reasons which cause a processor’s efficiency to be
less than ideal. One such bottleneck is the expensive page faults, which oc-
cur when the processor accesses information that is not present in the main
memory - the DRAM. Since computation is orders of magnitude faster than
accessing memory, and the number of processing elements per chip is growing
much faster than memory bandwidth and latency, there exists amemory wall
[1]. The cost of a page fault is in the order of tens of thousands of clock cy-
cles, which translates to microseconds for systems running at gigahertz-level
clock frequencies.
On the other hand, the trend of applications becoming more data-intensive
has rekindled research in new and different types of memory system orga-
nization and design. Due to poor DRAM scaling, there is an uptick in the
research activities surrounding hybrid heterogeneous memory systems to en-
able systems in handling big data efficiently, and minimize expensive I/O
accesses that a traditional system makes.
Multiple techniques have been proposed in the past to reduce the number
of stalls or hide latencies that a processor suffers. A common example is
speculative execution, where the processor continues to execute instructions
speculatively after checkpointing and restarts from the checkpoint in the
event of a mis-speculation. Another example is prefetching, where data is
brought into faster memories ahead of time so that it is available for the
1
processor to consume when the instruction executes. Prefetching in modern
processors, however, is done from DRAM to the caches and is usually based
on a look-up table. As data-size increases, the fixed size of the look-up table
becomes the inflection point and the performance of prefetching may begin
to drop sharply. Consequently, the overall performance can be negatively
impacted since the incorrect prefetches increases DRAM traffic and can begin
to pollute the caches. Incorrect prefetches pose an even bigger challenge
when there is an increase in the randomness of memory accesses made by
an application. The behavior of memory accesses may also become non-
deterministic due to data races and scheduling policies, especially in parallel
systems, making it seemingly harder for prefetchers to accurately predict
subsequent accesses.
The other challenge that many systems face is the ability to keep data in
faster memories. When applications work on massive amounts of data, the
DRAM is usually insufficient to hold the entire working set and thus needs
to rely on page swapping between the DRAM and the SSD (or any other
larger storage system). While this helps applications with relatively simple
memory access patterns that exhibit high reuse of pages, applications with
predominantly random access patterns suffer two-fold. First, the average
cost of moving an entire page, for a few accesses, is high, in terms of latency
and power. Second, the pages swapped into the DRAM may evict pages that
are needed in the near future, known as DRAM thrashing, and degrade the
overall performance of the system. Thrashing is usually due to pages that are
not highly/recently reused but turn out to be needed in the near future. This
happens a great deal when the working set is too big and the access pattern
is not localized enough. Furthermore, with byte-addressability support in
the newer PCIe interconnect [2], it seems logical to access data directly from
the SSD without moving it to the DRAM, when the access pattern is mostly
random.
In this thesis, we look at existing solutions to address the issues of access-
ing memory for data-intensive workloads, especially those that are sufficiently
large to spill over to secondary storage devices. We also study prefetching
algorithms for complex access patterns to extract algorithms that can be ex-
tended to the storage-DRAM boundary. One challenge in prefetching pages
from the SSD as opposed to cache lines from the DRAM is that the address
space in storage is significantly larger, rendering a look-up table approach
2
infeasible. Sparsity in this large address space exacerbates the issue and
standard regression models mostly show poor fits [3]. Therefore, we explore
classification based models in an attempt to recognize complex patterns and
predict the pages that might be accessed in the near future and promote
these pages into the DRAM ahead of time. To address the challenge of ac-
cessing data after swapping pages into the DRAM vis-à-vis accessing data
directly from the SSD, we propose and evaluate a distributed page promo-
tion mechanism that exploits the inherent parallelism within the SSD while
maintaining a fair balance with the cost of memory accesses, both in terms
of performance and power.
The main contributions of the thesis are as follows:
• Analysis of memory access patterns of data-intensive workloads like
graph analytics with results that motivate and inform the design of
page prefetching and promotion mechanisms
• A neural network based memory address pattern recognizer to prefetch
pages from slower storage devices to the faster main memory system
• A novel distributed page promotion system design to improve the over-
all hit rate of data-intensive applications as well as reduce the cost of






2.1.1 Memory Access Patterns
The current organization of the memory and storage systems relies on the
assumption that most applications exhibit significantly localized patterns in
accessing the memory (or storage). From the hardware point of view, aver-
age memory access time (AMAT) is reduced by the introduction of caches,
translation look-aside buffers, prefetchers, page tables, memory coalescers
and storage caches. From the software point of view, AMAT is reduced by
carefully constructing memory-friendly data structures and accessing data
such that the number of memory transactions reaching the lower levels of
the memory hierarchy is reduced. Software prefetching, compiler optimiza-
tions and speculative execution further aid in reducing the time to access the
memory system.
Determining patterns in memory accesses has been an active topic of re-
search and is the fundamental principle on which caches and prefetchers work,
both in hardware and software. Broadly, memory accesses exhibit three kinds
of patterns. First, sequential (or strided) accesses of memory addresses gener-
ate a spatial locality. For example, instruction fetches of a basic block of code,
accessing a vector, or a row/column of a matrix are in a sequential manner.
Second, repeated accesses to the same memory location create a temporal
locality in the accesses to the memory system. Scalar accesses and accesses
to arguments and variables are examples of temporal reuse of data. Third,
a series of memory accesses that exhibit no conceivable pattern are termed
as random accesses. Pointer chasing is an example of random accesses to the
memory system. In general, applications, especially emerging data-intensive
4
graph and sparse applications, show all three kinds of access behavior and
thus combinations of patterns and non-patterns at different points during
their runtime. A simple loop that increments the values of all elements of an
array five times exhibits both spatial and temporal locality. An application
that processes graphs, like breadth-first search, exhibits sequential spatial
locality during preprocessing the graph while accesses become more random
during the actual graph traversal. Figure 2.1 shows the different types of
memory access patterns.
Figure 2.1: Memory access patterns
2.1.2 Data Prefetchers
Prefetchers, in hardware or software, try to predict future memory accesses
based on information from past history. Prefetches are generally performed
in order to bring data from slower DRAMs to faster caches ahead of time so
that processors can use data without stalling while fetching it from the main
memory.
There are a number of types of prefetchers that exist today. Stride prefetch-
ers try to recognize delta values in memory access addresses that are fixed
and repeated [4]. For example, if addresses {0, 8, 16, 24} are accessed, the
stride prefetcher fixes on the delta (= 8) between consecutive accesses and
prefetches addresses {32, 40, ...} into the caches. Correlation prefetchers are
advanced prefetchers that can recognize accesses that are more complex and
5
not just based on stable deltas. Some examples include markov prefetchers
[5] and GHB prefetchers [6].
Base offset prefetchers [7] fix on the offsets of addresses that are being ac-
cessed. This can be inefficient since many patterns can be missed if they are
outside of the offset bits. Access map pattern matching [8] observes memory
accesses and marks them so that these accesses can be prefetched when the
first memory access repeats. Spatial memory streaming [9] works similarly
to the access map pattern matching prefetcher and tends to prefetch ac-
cesses that show spatial and temporal reuse. Variable length delta prefetcher
[10] tends to prefetch data based on history tables that are pointed to by
sequences in another history buffer; this is in some sense a hierarchical
approach. Indirect memory prefetchers [11] are useful in predicting and
prefetching data that are indexed indirectly by resolving and remembering
the indirection early.
2.1.3 LSTM
LSTM [12] is an RNN algorithm that is widely employed for predicting se-
quences like time-series data. An LSTM looks at some input xt and outputs
a value ht. A loop allows information to be passed from one step of the net-
work to the next. This passage of information from one stage to the next can
help the network understand and learn long-term dependencies that exist in
time-series data. However, information that is very old can interfere with
the current context and degrade the prediction. LSTMs solve this long-term
dependency problem through a forget gate. The forget gate looks at the
old state, ht−1 and xt, and outputs a value that determines whether to keep
or forget the information in cell ct−1. The input, output and forget gates
operate after getting passed through a sigmoid function. A tanh layer uses
this information to create a vector of new candidate values C̃t to add to the
state. Finally this is combined with the output gate to provide the new state.
Figure 2.2 shows an LSTM cell, its logically unrolled representation, and the
detailed schematic of the LSTM cell.
The mathematical representation of the operations performed within the
LSTM cell is given below.
6
Figure 2.2: Illustration of an LSTM cell
it = σ(Wi.[ht−1, xt] + bi)
ft = σ(Wf .[ht−1, xt] + bf )
ot = σ(Wo.[ht−1, xt] + bo)
gt = tanh(Wg.[ht−1, xt] + bg)
ct = ft × ct−1 + it × gt
ht = ot × ct
2.1.4 Neural Networks Based Prediction
All the prefetchers in section 2.1.2 are based on history tables. This is not a
scalable design. Since we are trying to prefetch from a much larger and more
sparse address space, the table-based prefetchers become infeasible. In this
thesis, we look at classification based machine learning techniques, inspired
from other domains that are time-series dependent, like speech. We explore
long short-term-memory (LSTM) based recurrent neural networks (RNNs).
Hashemi et al.’s Learning Memory Access Patterns [3] demonstrates the
use of LSTMs and LSTMs with clustering to perform memory access predic-
tions. The LSTMs use program counters (PCs) and deltas between memory
accesses as features for the LSTMs and show improvement in memory ac-
7
cess prediction over Stride and Correlation prefetchers for memory-intensive
workloads of the SPEC CPU2006 benchmark suite. Abulila et al.’s FlatFlash
[13] explores adaptive page promotion schemes to determine if accesses are
‘hot’ or ‘not hot’ to decide whether to aggressively or conservatively promote
pages from the SSD to the DRAM.
Methods other than LSTMs can also be used to predict memory addresses.
Time delay neural networks (TDNNs) are another type of RNNs that can
be used to forecast values. They are generally less robust when compared
to LSTMs but have the advantage of requiring less processing and are easier
to train. Other statistical methods like Autoregressive Integrated Moving
Average (ARIMA) or exponential smoothing are also being explored for this
problem space. However, these methods are less robust and more sensitive
to outliers and require copious amounts of data.
2.2 Methodology
In this thesis, we extend the neural network based prefetching to promote
pages in a hybrid memory system, such as the one outlined in FlatFlash. For
the LSTM model that was created, the following features were explored:
1. Timestamp or Miss Number - Timestamp of a page fault is an impor-
tant feature that provides information about the randomness in the
accesses. Misses or faults that are closer in time represent more ran-
domness.
2. Program Counter (PC) of the missing instruction - PC can be an im-
portant feature since it helps in identifying loops (temporal locality),
sequential accesses (spatial locality), and indirections.
3. Memory Address - This feature provides the history based on which
predictions will be made.
4. Deltas of memory access addresses - This feature helps reduce the size
of the query since deltas of consecutive addresses will be smaller than
memory addresses themselves. The additional benefit of keeping deltas
of addresses is that it makes the LSTM agnostic to where the page is
8
actually stored since each access has an address that is offset by the
base address.
While the above feature list is non-exhaustive, it is important to note that
it contains a lot of information about the memory accesses. The LSTM can
be trained to recognize address patterns. The LSTMs proposed in this thesis
are not deep and, hence, all the above features can be included without
affecting the computational requirement of the neural network significantly.
Training the LSTM involves tuning various hyperparameters. These hy-
perparameters are usually determined through the exploration of the design
space using iterative methods. The important hyperparameters in this net-
work are history length, batch size, number of epochs to train on, and number
of hidden layers. History length refers to the number of samples before the
current sample that the network looks at in order to determine the input
to the hidden layer. It can easily be visualized as creating a tensor that is
‘history’ times larger than the dimension of a single sample and concatenat-
ing those samples. Batch size refers to the number of samples after which
the LSTM cell is reset. This is done to prevent the saturation of the cell
due to very old information. Although the forget gate alleviates the issue
of long-term dependencies, cells can get saturated over time. Number of
epochs is the number of iterations an LSTM is trained for. A small number
of epochs leads to underfitting which causes the network to poorly learn pat-
terns, while a large number of epoch leads to overfitting, where the network
learns almost perfectly for the given training dataset but fails on newer data
samples. Therefore, it is important to tune the number of epochs to get
a well-represented network. Lastly, the number of hidden layers can affect
how well a network can learn complex patterns that are non-trivial. This,
however, comes at the cost of time and space.
One constraint is that LSTMs are computationally heavy and take a few
thousand cycles per inference, especially if the network is large. For this
reason, we minimize the complexity and keep the network small.
After training the LSTM offline, any new memory access can be passed
to the input of the network to predict values and promote relevant pages
from the storage to host DRAM ahead of time. The predicted address does
not strictly have to be the next access. An access that is likely to occur
further in time is predicted and promoted to the DRAM. Such promotion will
9
reduce the time required to service a cache miss that would have otherwise
resulted in a page fault. In either case, the overhead associated with sending
read or write requests to the storage and updating the page table entries
remains. Only the time to physically copy the data from storage to DRAM
is hidden. Fortunately, the time required for copying these pages is significant
and usually more than half of the total time required to serve a DRAM miss.
2.3 Experimental Setup
2.3.1 Generating Memory Traces





To generate memory access traces, different tools were evaluated. Intel’s Pin
[14] is a dynamic binary instrumentation framework for the x86 64 ISA that
enables the creation of dynamic program analysis tools. Pin provides a rich
API that abstracts away the underlying instruction set idiosyncrasies and al-
lows context information such as register contents to be passed to the injected
code as parameters. With the help of these APIs, we can modify C/C++
code to profile instruction and data requests. Valgrind [15] is yet another
framework for debugging and profiling Linux systems. Lackey is a tool in
Valgrind that is capable of profiling memory related instructions. Building
on the base code of Lackey, tracing the addresses of the memory requests
that a process makes was enabled. In this thesis, traces were collected using
the Valgrind framework.
Memory access addresses were traced for different graph analytics appli-
cations. Generating memory addresses through these tracing tools takes
significant time. Therefore, to expand the scope of this method, a trace
10
generator capable of generating random addresses based on different distri-
butions like uniform distribution, binomial distribution and zipfian distri-
bution was developed. The primary purpose of this trace generator was to
speed up the process of creating LSTM models that can identify patterns in
pseudo-random data.
2.3.2 CacheSim and DDRSim
The addresses of memory requests generated from Valgrind were passed
through a cache simulator to filter out memory requests that would otherwise
be cached in a real system. The output of the cache simulator represented
the misses that accessed the DRAM. Using another simulation environment,
the accesses that resulted in DRAM misses or page faults were captured.
These accesses are of interest in this thesis as they represent cache lines in
pages that would normally be swapped between the DRAM and the SSD.
Recall that DRAM thrashing refers to the phenomenon where blocks or
pages with low reuse evict blocks or pages that will be needed, used or reused
in the near future due to capacity limitations of the DRAM. A large number
of mis-predictions from the neural network can lead to DRAM thrashing. In
order to reduce DRAM trashing, an upper limit on the number of pages that
can be promoted by the network is imposed. Once the threshold is reached,
the network can either stop promoting pages till a promoted page is used, or
a replacement policy can be used, for example, the oldest unused promoted
page is replaced. This bookkeeping adds additional overhead.
2.3.3 SCALE-Sim
Systolic Arrays are one of the most popular compute substrates within deep
learning accelerators today, as they provide extremely high efficiency for run-
ning dense matrix multiplications. ARM Research’s Systolic CNN Accelera-
tor Simulator [16] is a configurable systolic array based cycle accurate DNN
accelerator simulator. For evaluating the overhead from making inferences
on the neural network, we use this simulator and measure the number of cy-
cles it takes to perform a single inference on the LSTM network. This helps
in estimating the latency of an inference at a given frequency of operation of
11
the simulator. A weight-stationary configuration was used since outputs of
the network are not used frequently, and storing weights in the scratchpad
of the simulator improves access time. The number of processing elements of
the systolic array was varied to efficiently determine the right size such that
the entire array was well utilized. The bandwidth for the input feature map,
output feature map and the filter was calculated for the given frequency to
ensure that the accelerator is not bandwidth limited.
2.4 Design Assessment and Evaluation
2.4.1 Building the LSTM Model
In this thesis, we create an LSTM model to capture memory address access
patterns and show that a learning technique may be applied to predict what
pages are accessed in the storage device. For creating the LSTM model, the
features mentioned in section 2.2 were studied. The following observations
were made for the four potential features of the model:
1. Timestamp or Miss Number - On analyzing the behavior of the LSTM
model, it was observed that this feature did not add information to
significantly improve the accuracy of the network. The reason for this is
that LSTM, by its nature as a sequence predicting network, inherently
takes into account the ‘timestamp’ of inputs.
2. Program Counter (PC) of the missing instruction - The importance
of this feature is not very conspicuous in the LSTM network modeled
because of the applications targeted. Since data-intensive parallel ap-
plications are the focus of this work, a majority of the accesses are
data accesses and the program counter of the instructions being exe-
cuted adds little information in the grand scheme of the model.
3. Memory Address - In this thesis, the DRAM misses that trigger a page
fault were captured. The 48-bit memory address of these page faults
was shifted right by 12 bits to eliminate the page offset, considering a
page size of 4KB. The remaining 36 bits were split into 9 features such
that each feature is a 4-bit nibble. Splitting the memory addresses into
12
multiple features improves the accuracy of the model significantly. The
intuition behind splitting the memory addresses was that accesses that
are more random show more variation in multiple nibbles while sequen-
tial accesses show variations only in the lower nibbles. A history of 3
accesses is passed into the network, essentially increasing the feature
space by 3 times.
4. Deltas of memory access addresses - The importance of this feature
is reduced due to the nature of an LSTM network. The combination
of the different gates and learned weights for each of the LSTM gates
already represents the information contained in the delta of memory
accesses in some form.
Since the primary focus of this method, in this thesis, is to extract patterns
in accesses while minimizing the overhead of an inference, the network was
pruned to include only essential features. After a thorough design space
exploration, the history of the network was empirically determined to be 3
samples with a batch size of 75 over 100 epochs and 1 hidden layer. The
model was based on split memory addresses as the only features. A 75:25
split ratio was used to split the traces into training sets and testing sets
respectively.
The cache was modeled as a single-level 16MB shared cache, 16-way set
associative, 64B cache lines and a true LRU replacement policy. Although
systems have multiple levels of caches with more sophisticated replacement
policies, this configuration was chosen for simplifying the simulator since the
accesses of interest are the ones that trigger page faults and this single-level
cache simulation filters out the accesses that would have been served by the
last level cache or any of the higher levels of the cache in a system with
inclusive caches. The DDR simulator was modeled for 4KB pages. A 48-bit
memory address from the traces was used to access the pages. Since the page
offset is 12 bits for a 4KB page, the raw access address that missed in the
DDR was shifted right by 12 bits and split into nibbles that became the input
features for the proposed LSTM network. Table 2.1 and table 2.2 summarize
the cache and DRAM configurations respectively. To limit DRAM thrashing,
an upper limit on the number of pages promoted in a given period of time
was set to one million pages, corresponding to 4GB of the DRAM.
13
Table 2.1: Cache configuration
Capacity 16 MB
Block size 64 B
Number of sets 16384
Associativity 16
Cache replacement policy True LRU
Table 2.2: DRAM configuration
Capacity 16 GB
Page size 4 KB
Associativity Full
Page replacement policy True LRU
2.4.2 Benchmarks
To evaluate the performance of the LSTM model, memory access traces were
generated for Giga-upates per second (GUPS), a workload that measures the
rate of integer random updates of memory. The benchmark is part of the High
Performance Computing Challenge [17] benchmark suite’s RandomAccess
benchmark (HPCC-GUPS). A small percentage of random memory accesses
(cache and DDR misses) in an application can significantly affect the overall
performance of that application. GUPS attempts to measure the impact
of random accesses on an application. GUPS accesses memory in a mixed
fashion, that is, some accesses are sequential while some are random. GUPS
is calculated by identifying the number of memory locations that can be
randomly updated in one second, divided by one billion. All updates in
GUPS are read-modify-write operations on a large matrix. To make sure that
the working dataset used in GUPS is not completely cached in the DRAM,
a 32 GB address space was allocated to the application. Using openMP, 4
instances of GUPS were launched and the access addresses were merged in a
round-robin pattern.
2.4.3 Analysis of Memory Traces
The analysis of memory traces describes how patterns corresponding to dif-
ferent localities appear when plotted against time or miss number. Figure
2.3 is a memory access graph that shows a zoomed-in region of a sample
14
plot of page faulting addresses against miss number (the sequence in which
the miss occurred) for the GUPS benchmark. This plot shows that regions
exhibit temporal, spatial or no locality. Accesses to the same address exhibit
temporal locality and appear as a horizontal straight region in the graph.
Accesses that exhibit spatial locality would appear as linear regions with
addresses linearly increasing with miss number (or access number). Regions
that are spread across and sparse represent random accesses and these are
the regions of interest where we try to recognize non-trivial patterns using
sophisticated learning algorithms like the LSTM.
Figure 2.3: Memory access patterns in GUPS benchmark
In a multiprocessor system, parallel applications may even distort se-
quences that were easily detectable by a simple prefetching technique. How-
ever, the long-term memory of an LSTM can still recognize patterns exist-
ing between samples farther apart after appropriately weighing intermediate
samples. An additional benefit comes from parallel accesses of one process
influencing the access pattern of another when multiple processes of the same
application are executing. This advantage of a learning based approach out-
weighs traditional approaches and reduces sensitivities like the one described
above, more of which occur in parallel applications.
15
2.4.4 Analysis of the LSTM Network
Figure 2.4 shows a zoomed-in portion of the testing set on a four-process
instance of the GUPS benchmark. From the figure, we can see that the
predicted curve closely follows the actual curve indicating that the LSTM
was indeed able to detect and learn patterns in the accesses. The error in
prediction is depicted by the delta between the two curves. This difference is
particularly prominent when there are sharp changes in the accesses or when
the curve is jagged and rough. However, for majority of the curve, the lines
coincide.
Figure 2.4: Actual versus predicted accesses
To improve the prediction of the network, it would help to include all
accesses to the DRAM, that is, both hits and misses. In the current imple-
mentation, only the DRAM misses (or page faults) are extracted and used to
train the network. By including hits, the network will have more information
to learn from since the page offsets will provide 3 extra features, considering
the 12 bits of a page offset are split into 3 nibbles. For LSTMs, we define the
coverage of the LSTM as the number of accesses it was able to predict cor-
rectly. For the GUPS benchmark, we observe a 63% coverage of all memory
accesses. This coverage will change with applications, types and number of
concurrent applications, LSTM size and complexity, and studying the effects
of these on the coverage of the LSTM is part of the future work.
16
2.4.5 Simulator Results for the LSTM
LSTM networks get bulky and computationally expensive very quickly. Gen-
erally, accesses exhibit patterns with other accesses that are closer in time.
The ability to predict deep into the future is important in timely prefetch-
ing the pages. Predictions that are slow defeat the goal of promoting pages
ahead of time. Due to these constraints, it is imperative that an inference
of the network is quick and accurate. In this thesis, we propose to add an
accelerator between the DRAM and storage to improve the performance of
an inference in terms of latency. We simulate the network to ensure that in-
ferences are completed in a meaningful time window and that the accelerator
itself is not creating a new bottleneck in the system. To evaluate this, we
use SCALESim, a systolic CNN accelerator from ARM Research.
The number of processing elements, size of the scratchpads and dataflow
parameter are additional hyperparameters in the accelerator. After a design
space exploration of these hyperparameters, an 8 x 8 systolic array with
108KB of scratchpad memory for input feature maps, filters, and output
features maps respectively was simulated. The modelling dataflow chosen
was weight stationary since the weights are frequently used and do not change
once trained.
The average utilization of the accelerator’s PEs was 74.74%. While this
is not the maximum utilization of the 8 x 8 array, a lower configuration
leads to increase in compute cycles. The array uses 2478 cycles to perform a
single inference. The total bandwidth requirement of the array is about 1.78
bytes per cycle, which translates to 1.416GBps at an operating frequency
of 800MHz. This bandwidth is much lower than the 4x PCIe bandwidth
which is generally used between storage devices like SSDs and the DRAM,
and therefore, this does not introduce a bottleneck in the design. Table 2.3
summarizes the simulation results.
Table 2.3: SCALESim results
Per Cycle At 800MHz
Average Utilization 74.74%
Compute cycles 2478 cycles
DRAM Input FM Read BW 0.35B/cycle 280MB/s
DRAM Filter Read BW 1.42B/cycle 1.14GB/s






The memory system is a collection of units that are capable of storing infor-
mation in the form of binary digits and is classified into two main groups,
volatile memory and non-volatile memory, based on the persistence of data
when the power source is lost. Figure 3.1 shows the different components of
a typical memory system and its corresponding area and latency. Registers
Figure 3.1: Memory hierarchy
and higher levels of caches are present on-chip and provide high bandwidths
and low latencies that are in the order of a few nanoseconds. This comes at
the expense of additional transistors making these storage units bulky and
posing limitations on capacity, especially when total chip area is constrained.
Off-chip caches and main memory, typically a DRAM, have moderate band-
width and access latencies in the order of few hundred nanoseconds. However
18
they require fewer transistors per bit of storage. For example, the DRAM
requires one transistor and one capacitor per bit of information while SRAM,
typically used in caches, requires four transistors to store a single bit of infor-
mation and two additional transistors for accessing the bit. The components
discussed thus far are volatile and do not scale to very high capacities due
to power and area constraints.
The main memory can typically store tens or few hundreds of gigabytes
of data. However, it does not provide non-volatility of data. Moreover,
emerging data-intensive applications process data in the order of hundreds
of gigabytes or even few terabytes, mandating additional non-volatile storage
in the memory system. Non-volatile memory, solid state drives, and magnetic
discs and tapes, are all examples of non-volatile and high-capacity storage
devices that provide auxiliary memory to the system. These devices are
slower to access and therefore cannot replace the main memory but are denser
enabling the storage of large amounts of data in an acceptable area.
Data is moved between caches and registers in words, typically 32-64 bits.
Between the main memory and the caches, the smallest unit of data move-
ment is known as cache blocks or cache lines, typically 64-128 bytes of data.
The unit of data movement between the main memory and auxiliary memory
(or storage) is traditionally through pages or segments, while recent research
in this area has enabled accesses to the storage in smaller units, like a byte-
addressable or cache-line accessible SSD. The primary reason for an increase
in the smallest transferable unit at each level is to amortize the cost of ac-
cesses to the lower levels of the memory hierarchy. This, again, relies on
the assumption that memory accesses will exhibit some spatial and temporal
locality and subsequent accesses can find data (memory hit) in the higher
levels of the hierarchy.
Traditionally, pages are moved between storage and main memory based
on two mechanisms, demand segmentation and demand paging. The more
popular demand paging is a paging system which moves a page from the
storage to the main memory on the first memory reference that results in a
page fault. Demand segmentation is the other paging mechanism, where the
logical address space of applications is divided into a collection of segments,
and segments of the memory move between the two levels. Section 3.2 pro-
poses a novel algorithm to promote pages more efficiently in hybrid memory
systems, a primary contribution of this thesis.
19
3.1.2 Solid State Drive
A solid state drive (SSD) is a persistent storage device based on non-volatile
semiconductor memory chips, like flash memory chips due to their low cost
and high density. Figure 3.2 shows the architecture of an SSD. The main
components of an SSD are the host interface, the SSD controller, an embed-
ded processor, and the flash memory interface. The host interface, in most
commercial SSDs, is PCIe or SATA. The SSD controller has many functional
blocks for a variety of tasks like address translation, wear leveling, buffer
memory, error checking and correction, garbage collection and scheduling.
The interface to the flash memory is organized into channels. Each channel
supports multiple flash chips which are the physical storage units. Chips are
organized into planes, which are organized into blocks, and then to pages.
Figure 3.2: Architecture of an SSD
SSDs provide the opportunity to exploit parallelism through the organiza-
tion of the flash memory interface. Highly efficient use of these SSD channels
creates enough throughput to fully utilize the high bandwidth buses in the
SSD. Different data layouts are used to maximize this throughput. Common
data layouts are based on striping pages across different chips or different
channels. In this thesis, we assume a data layout where pages are striped
across channels, that is, consecutive pages are placed on consecutive channels
of the SSD. Table 3.1 breaks down the average latency in serving a request
to a conventional PCIe SSD [18].
20
Table 3.1: A breakdown of SSD latency
Component Description Time (µs)
thost Time for host to issue a command 5
tcontroller Time for controller to decode a command 20
tcell Time to read data from flash cell arrays 50
tDMA Time to transfer read data to controller 8
ttransfer Time to transfer 4KB data to host 5
3.1.3 Hybrid Memory Systems
Following the processing systems like CPUs, GPUs, and accelerators, modern
computer systems are embracing heterogeneity in memory systems. Mem-
ory systems are combining the salient features of new technologies that are
optimized for bandwidth, latency, capacity, and cost. For example, Intel’s
Knight’s Landing multi-channel DRAM (MCDRAM) is a form of high band-
width memory (HBM), along with DDR4 memory for high bandwidth and
high capacity [19]. Non-volatile 3D X-point memory has become commer-
cially available and technologies like dis-aggregated memory is gaining trac-
tion in academic and industrial research.
Figure 3.3: A hybrid memory system
It is not counter-intuitive to imagine computer systems connected to more
than one type of memory system, as shown in figure 3.3, collaboratively
serving as a single memory abstraction. These hybrid memory systems will
have varying bandwidths and latencies which makes it imperative to design
21
a strong control unit that places the smallest unit of memory transfer, for
example pages, most efficiently to reduce the AMAT of a memory request.
Ideally, all the most reused data should reside in the fastest memory, spilling
over to the next fastest memory and so on. However, in order to fully exploit
a hybrid memory system, it is equally important to ensure that pages in
the slower memory devices with low or no reuse remain in the slower device
to avoid the cost of page copies and page movements as well as prevent
thrashing or polluting the faster memory systems. In this thesis, we propose
a mechanism for promoting only those pages that exhibit some re-usability
from the slower memory, the SSD, to the faster memory, the DRAM.
3.2 Design and Methodology
Section 3.1.2 gives an overview of the main components in an SSD. The
flash memory interface is made up of channels and each channel provides a
common bus for multiple chips connected on the same channel.
In this thesis, a novel page promotion design and algorithm is proposed
to facilitate efficient page promotion. We consider a hybrid memory system
that has a unified memory address space for the different types of memory
devices available in the system. An example of this type of system can be
seen in Abulila et al.’s FlatFlash [13]. The two main considerations of any
page promotion algorithm are the hit rate of a memory access and the data
movement traffic between different types of memory systems in a hybrid
memory system. An overview of the architecture for the proposed page
promotion mechanism is shown in figure 3.4.
In the proposed page promotion mechanism, each SSD channel is connected
to a table that holds entries that map the pages that have been cached in
the SSD page cache. This distributed table allows the design to scale with
number of channels or even multiple SSDs. The distributed table also reduces
the cost of looking-up entries in the table which determines whether an access
is an SSD hit, that is, cached in the SSD page cache.
The distributed table is built on the fundamental principles of a cache
look-up. Figure 3.5 shows a single entry in the distributed table. Each entry
in the table contains the tag bits of the page, a counter to track how ‘hot’ the
page is, a timestamp to lazily decay the heat of a page, a bit E that indicates
22
Figure 3.4: SSD architecture with distributed page promotion
Figure 3.5: Single entry of a distributed page promotion table
if the page is eligible for promotion, a used bit A to track if the entry was
accessed in the current epoch and a dirty bit D. Since the table can map
many pages to a particular entry based on the associativity of the table, the
tag bits are responsible for distinguishing whether an access belongs to the
particular entry. The hotness counter tracks the number of accesses made to
a particular page. The value of the counter increments by x for every access
and decrements lazily on the next access by a value that is a function of the
number of other SSD accesses that were made since the last access to the
page. This decay of the counter is essential to ensure that the counters do not
saturate and a page that was heavily accessed during a particular region of
the application and has accumulated a high value for hotness begins to cool
down to allow other more recently used pages to be promoted to the DRAM.
This is facilitated by a global timestamp which increments by 1 unit for each
SSD access, the timestamp bits and the used bit of the table entry. Abrupt
delearning methods like reset to 0 at every epoch are not recommended as the
entries need to relearn or become hot again at every epoch which diminishes
the efficiency of the promotion mechanism. The eligibility bit is a hardware
optimization to reduce the computational complexity of sorting the entries
at each epoch. It is set when the hotness of the entry is greater than the
threshold. The dirty bit represents whether the contents of the page were
modified in order to ensure that the data is written back to the flash memory
when it is evicted from the page cache. Each page that is cached in the SSD
23
page cache is tracked by an entry in one of the tables. The overall hardware
overhead of the proposed design depends on the size of the distributed tables,
but is negligible compared to the capacity of the SSD.
When a memory access misses in the DRAM, a request is sent to the SSD. If
the page is cached in the SSD page cache, the request hits in the SSD and data
is returned from the page cache. If the page is not cached in the SSD page
cache, an SSD miss occurs and the page is read from the flash memory and
cached in the SSD page cache. When an SSD hit occurs, the corresponding
table entry is updated by incrementing the hotness counter by increment
value x. The timestamp is updated and the corresponding decay is subtracted
from the counter. In the event of an SSD miss, an entry in the corresponding
channel’s table is populated with the details of the page. An initial value,
equal to the hotness increment parameter x, is assigned to the entry and the
other bits of the entry are set or unset depending on the configuration of the
tables. Since the tables are accessed like a set-associative cache structure, a
replacement policy like least recently used (LRU) is used to allocate the table
entries when the set is full. To prevent thrashing of the SSD page cache, the
table can be appended with an additional buffer which behaves like a victim
cache to the page cache table. Exchanging entries between the main table
and its corresponding buffer occurs when the minimum hotness of the main
table entries is less than the hotness of the buffered entry. Only the main
table entries are considered for promotion. This ensures that highly reused
entries remain in the distributed tables and are not prematurely evicted due
to capacity constraints.
Pages are promoted from the SSD to the DRAM at regular intervals called
epochs. These epochs should be sufficiently small to allow hot pages to
quickly move to the faster DRAM and sufficiently large to reduce the over-
head of the page promotion algorithm. Each epoch defines the maximum
number of pages, k, that can be promoted at once. At the end of each epoch,
the entries in the tables are looked up and the entries that are eligible for pro-
motion are selected. The eligible entries per table are then sorted to retrieve,
at most, the k hottest pages. These entries are then merged with the eligible
entries from all the other tables. A second round of sorting is performed and
the top-k entries of the entire SSD are retrieved. These pages represent the
most frequently accessed pages and are therefore promoted to the DRAM.
The entries are erased from the distributed tables in the SSD. If the DRAM
24
is full, the least recently used pages are swapped from the DRAM to the SSD
page cache to make room for the new hot page in the DRAM.
This hierarchical approach of determining the top-K hot pages in the SSD
is integral to the scalability of this approach. It reduces the performance
overhead of the promotion algorithm by searching and sorting the pages in
the distributed tables in parallel.
The page promotion algorithm is shown in algorithm 1.
3.3 Experimental Setup
To evaluate the proposed page promotion design, memory traces were gener-
ated using the Valgrind framework, as described in section 2.3.1. The mem-
ory addresses were filtered to generate accesses to the DRAM and lower levels
of the memory system using the cache simulator described in section 2.3.2
with a configuration identical to that used in chapter 2, which is summarized
in table 2.1. A software simulator was developed to evaluate the proposed
page promotion algorithm. This simulator models the DRAM, the SSD page
cache and the distributed tables for each channel of the SSD. The control unit
of the algorithm is designed to provide flexibility over a number of hyperpa-
rameters like threshold, epoch duration, number of channels, number of sets
and ways per table, DRAM size, and number of pages promoted per epoch.
Table 3.2 summarizes the configuration of the hyperparameters used in the
evaluation of this page promotion mechanism. The hyperparameters were
empirically chosen to maximize the efficiency of the algorithm in both per-
formance and cost. Section 3.5 describes how varying the hyperparameters
affects the performance of the page promotion algorithm.
Table 3.2: Memory system configuration of the proposed system
DRAM Size 128 MB
DRAM Page size 4 KB
SSD Page Cache sets per channel 1024




Max pages promoted per epoch 100
25






3 setIdx = Addr & setMask;














17 if (min(ssdPgCache.currHot) < max(ssdPgCacheBuff.currHot))
18 swapEntry;




23 foreach (channel) do
24 getTopKHotPages(channel);
25 end
26 hotList ←− merge();
27 getTopKHotPages(hotList);
28 swapWithDRAM(topKPages);




To evaluate the proposed page promotion system, memory traces were gen-
erated for a 128GB uniform distribution, and graph analytics algorithms like
triangle counting (TC) and page rank (PR). A memory trace generator ca-
pable of generating random memory requests based on different distributions
was developed. To study the benefit of the proposed system, a uniform dis-
tribution trace over a 128GB address space was generated. The generated
trace contains 95% read requests and 5% write requests. This benchmark
represents large, irregularly accessed datasets that are commonly found in
graph analytics and sparse linear algebraic programs. It represents the types
of applications which would benefit most from the proposed system.
Memory traces were generated for a CPU version of the triangle count-
ing benchmark outlined by Pearson et al. in the HPEC subgraph isomor-
phism static graph challenge [20]. Another graph analytics benchmark from
GraphChi [21], Page Rank, was also evaluated. For each of the graph ana-
lytics applications, three graphs of varying dataset sizes−amazon0302, cit-
patents, and friendster−were evaluated. amazon0302 is a relatively small
graph with 262,111 nodes and 1,234,877 edges, which is expected to cache
in the DRAM completely. cit-patents and friendster are larger graphs,
containing 3,774,768 nodes and 16,518,948 edges, and 65,608,366 nodes and
1,806,067,135 edges respectively.
3.4.2 Experimental Results
The benchmarks described in section 3.4.1 were evaluated on three different
systems:
1. traditional system with demand paging (baseline)
2. FlatFlash
3. the proposed system
In a traditional system with a demand paging policy, each DRAM miss trig-
gers a page fault that transfers the entire page in which the request was made
27
from the SSD to the DRAM. FlatFlash accesses the memory systems through
a unified address space for both the DRAM and the SSD, similar to the pro-
posed system, and promotes pages from the SSD to the DRAM based on an
adaptive threshold at every DRAM miss. In the proposed system, pages are
promoted by distributed page cache tables at the end of each epoch.
Figure 3.6: DRAM hits, SSD page cache hits, and misses for the baseline
page swapping system, FlatFlash, and the proposed page promotion system
Figure 3.6 shows the number of DRAM hits, the number of SSD hits,
and the number of misses, normalized to the baseline page swapping system.
We make the following observations based on the memory requests that hit
or miss, directly influencing the overall performance of the system. Larger
datasets are able to utilize the SSD page cache positively. Applications with
a greater degree of memory randomness, like uniform distribution and Page
Rank, have fewer total misses when compared to the baseline and FlatFlash.
We observe a 1.21x gain in total hits for uniform distribution and a 1.16x
gain for Page Rank with the cit-patents dataset. Although some applications,
like triangle counting, use memory-friendly data structures to mitigate the
large cost of memory misses, most pages may not be reused heavily. This
28
is seen in an increase in SSD hits and a corresponding decrease in DRAM
hits. Since some of the datasets, like amazon0302, can eventually be cached
in the DRAM due to being smaller in size, we observe a similar number of
hits and pages moved in all three systems. This reinforces that the proposed
mechanism is designed keeping data-intensive applications with very large
working datasets in mind. However, all three systems show high hit rates,
supporting the proposed system’s minimal loss in performance for less data-
intensive applications.
Figure 3.7: DRAM-SSD transfers and SSD cache-flash transfers for the
baseline page swapping system, FlatFlash, and the proposed page
promotion system
The second benefit comes with the decrease in the number of page move-
ments. Figure 3.7 shows the number of pages swapped between the DRAM
and the SSD ache and the number of pages evicted from the SSD page cache
back to the flash memory. Once again, we observe that for data-intensive
applications with a large degree of randomness in memory accesses, the total
number of pages moved in the system is significantly lower. The uniform
distribution benchmark has a 6.74x reduction in the communication traffic,
29
and Page Rank on single-threaded and multi-threaded cit−patents has 3.62x
and 4.69x fewer page movements respectively. This reduces the energy cost
of moving pages throughout the system and reduces shared bus contentions
allowing requests to be completed faster. Similar to the observations made
for total hits in a smaller less-random memory access pattern, like triangle
counting, we observe that the number of pages moved is at par with the
baseline. This is intuitive, since pages will eventually be promoted to the
faster memory system, given that the capacity of the faster memory system
is sufficiently large to cache the working set of the application.
3.5 Sensitivity Analysis
To tune the hyperparameters of the proposed page promotion system for per-
formance and cost, a series of sensitivity experiments were performed. The
design space that was explored includes the size of the DRAM, organization
of the SSD page cache, epoch, and the number of pages moved per epoch
as a percentage of SSD accesses per epoch. The total number of hits and
total number of pages swapped between the SSD and the DRAM were stud-
ied. More hits indicate better performance while fewer page swaps indicate
better cost. The benchmarks from section 3.4.1 were evaluated for different
system configurations, varying one parameter at a time. The details of these
experiments are summarized in sections 3.5.1 - 3.5.5.
3.5.1 Size of the DRAM
Figure 3.8 and figure 3.9 show the total number of hits for an application
and the total number of pages swapped between the DRAM and the SSD
respectively as a function of DRAM size. From figure 3.8, it is observed that
a larger DRAM has a higher hit rate since more pages can now be cached
in the DRAM. In addition to this, the total number of page swaps between
the DRAM and the SSD also reduce with larger DRAM sizes since capacity
misses are reduced. This can be observed in figure 3.9.
Constraints on application and dataset sizes posed by the memory tracing
framework and page promotion simulator limit the variation in the mem-
ory requests to the SSD. Being fully-associative, very large DRAMs tend to
30
Figure 3.8: Total hits versus DRAM size
Figure 3.9: Total pages moved versus DRAM size
cache the entire dataset making it difficult to evaluate the true performance
and cost benefit of the proposed system. A practical DRAM can cache few
millions of 4KB pages. Although this thesis targets data-intensive applica-
tions with working datasets in the order of a few hundred gigabytes or a few
terabytes, given the constraints of the experimental setup, a 128MB DRAM
was modeled to simulate applications with datasets that are, at most, a few
31
gigabytes to provide more randomness in the SSD accesses.
3.5.2 Size of the SSD Page Cache
Figure 3.10 and figure 3.11 show the total number of hits for an application
and the total number of pages swapped between the DRAM and the SSD
respectively as a function of the number of sets per page cache table of the
SSD, that is, the size of the SSD page cache tables. Similar to the effect of
increasing DRAM size, an increasing SSD page cache size exhibits a higher
hit rate. This is observed in figure 3.10.
Figure 3.10: Total hits versus the number of pages per channel of the SSD
It is interesting to note that the total number of pages swapped between
the SSD page cache and the DRAM, as shown in figure 3.11, varies with
applications and access patterns. The primary reason for this variation is
that the SSD page cache is based on a set-associative organization. This gives
room for conflict misses, that is, misses that arise due to the unavailability
of an empty entry in a particular set even though entries in other sets may
be available.
The total size of the page cache is a function of the number of SSD channels.
Leveraging the fact that most commercial SSDs have 16 to 32 channels, the
total hardware overhead of maintaining a larger page cache is insignificant.
32
Figure 3.11: Total pages moved versus the number of pages per channel of
the SSD
For this reason, we use a 1024-entry table per channel in our experimental
setup.
3.5.3 Associativity of the SSD Page Cache
Figure 3.12 and figure 3.13 show the total number of hits for an application
and the total number of pages swapped between the DRAM and the SSD page
cache respectively as a function of the associativity of the SSD page cache.
A larger number of SSD page cache ways reduces conflict misses; however,
as observed in figure 3.12, the benefit of increasing associativity does not
significantly improve total hits in this design. The high cost of misses means
that the reduction of misses for applications like PR cit − Patients can
still have significant impact on the overall performance of the applications.
On the other hand, the total number of pages swapped in applications with
greater randomness reduces with more ways.
To maximize performance, for a fixed SSD size, a high number of sets and
a low number of ways of the SSD page cache tables was modeled for the
evaluation of this thesis.
33
Figure 3.12: Total hits versus the number of ways per set
Figure 3.13: Total pages moved versus the number of ways per set
3.5.4 Epoch
Figure 3.14 and figure 3.15 show the total number of hits for an application
and the total number of pages swapped between the DRAM and the SSD
respectively as a function of the epoch size that triggers page promotion. A
large epoch size reduces the probability of serving a request in the DRAM
since pages are promoted only at the end of each epoch. To exacerbate
34
the problem, since the SSD page cache tables are smaller than the DRAM
and set-associative as compared to a fully associative DRAM, there is an
increase in the number of conflicts in the SSD page cache tables that affects
the total hit rate, especially for applications that exhibit a higher degree of
randomness. This behavior is observed in figure 3.14.
Figure 3.14: Total hits versus epoch size
Interestingly, but not surprisingly, the total traffic between the DRAM
and the SSD is similar to what is observed for varying DRAM size and SSD
cache size. Since both conflict and capacity misses in the SSD increase with
an increase in the epoch, depending on the randomness of an application’s
memory accesses, we observe large variations in the number of pages moved.
While executing the promotion algorithm at smaller epochs adds to the
overhead of this system, a large epoch greatly hurts performance due to
reduced DRAM hits and SSD hits. Therefore, in this thesis, we evaluate
with an epoch of 1000 SSD accesses.
3.5.5 Promotions per Epoch
Figure 3.16 and figure 3.17 show the total number of hits for an application
and the total number of pages swapped between the DRAM and the SSD
respectively as a function of the percentage of pages promoted per epoch.
35
Figure 3.15: Total pages moved versus epoch size
Promoting a large number of pages per epoch is detrimental to performance
as well as communication cost. It is intuitive that a high value for this hy-
perparameter results in increased traffic between the DRAM and SSD, as
shown in figure 3.17. In addition to the high traffic generated, the perfor-
mance of the system drops due to thrashing of the DRAM when pages with
low reuse become eligible for promotion and replace pages with high reuse
in the DRAM. This DRAM thrashing results in a reduced total hit rate, as
shown in figure 3.16, and affects applications which exhibit some intermittent
localities the most.
In this thesis, we promote 10% of the accesses per epoch, that is, 100 pages
per epoch.
36
Figure 3.16: Total hits versus percentage of pages promoted per epoch





Prefetching data from the DRAM to the caches by identifying complex
memory access patterns with hardware [22, 23, 24] and software/compiler
[25, 26, 27, 28] techniques has been studied extensively over the last couple of
decades. Hardware prefetching has been traditionally researched within the
context of look-up tables, helper threads, and runahead executions. Compiler
based static code analysis and profiling techniques have also been studied to
generate prefetch instructions and improve memory performance. However,
these techniques do not scale well and become infeasible for very large address
spaces, such as those found in secondary storage devices. Some prefetching
methods have evolved to make use of neural network based classification ap-
proaches to prefetch data from DRAMs to caches [3, 29], while a more recent
work attempts to classify memory accesses, not by the use of neural networks
but by performing dataflow analysis [30].
Hybrid memory systems are being deployed in commercial systems [19,
31]. Hybrid memory cubes (HMC) [32], phase change memory (PCM) [33],
and high bandwidth memory (HBM) [34] have different bandwidths and la-
tencies are gaining traction for being deployed in hybrid memory systems.
NVIDIA’s Unified Memory is another system based on multiple memory sys-
tems mapped to a unified address space [35]. Yan et al. [36] propose a general
purpose OS-integrated multi-level memory management system that reuses
OS page tracking structures to tier pages directly between memories with
minimal monitoring overhead.
Attempts to improve the performance of virtual memories for software-
based DRAM buffers have been carried out in [37, 38, 39]. This thesis pro-
poses the management of page cache with the help of hardware structures
that improves overall performance of the system. Page placement and pro-
motion or migration have been based on heuristics or counter based profiling
[40, 41, 42]. These methods track the hotness of a page and operate on heav-
38
ily sampled memory access patterns. Wang et al. [43] focus on migrating
pages from slower memory banks to faster memory banks with low overhead
and minimal pollution of the faster memory. Ryoo et al. [44] demonstrate
the effect of the amount of page migration in heterogenous memory systems.
Abulila et al. [13] propose an adaptive page promotion mechanism dedi-
cated for the byte-addressable SSDs and DRAMs with the goal of exploiting
their advantages concurrently and transparently. In contrast to the afore-
mentioned works, in this thesis, we propose a scalable and distributed page
promotion mechanism that can be extended to different memory systems and
scaled to multiple compute and memory nodes.
39
CHAPTER 5
CONCLUSION AND FUTURE WORK
This thesis highlights the ever-increasing I/O bottleneck in systems today.
The work proposes a method to alleviate the bottleneck by recognizing pat-
terns in accesses and promoting pages ahead of time. An LSTM based ap-
proach was built to recognize memory access patterns and the trained LSTM
network provided a 63% coverage and an 8 x 8 systolic array was used to sim-
ulate the performance of making an inference on the LSTM network. With
the help of this accelerator, the LSTM approach seems viable and can help
reduce the time spent in an I/O request by hiding the promotion latency.
The other contribution of this thesis is the enablement and improvement of
scalable and distributed page tracking and promotion mechanisms to improve
memory performance in hybrid or heterogeneous memory systems. This the-
sis proposes a distributed page cache table per channel of SSDs to track
recently and frequently used pages and avoid unnecessary page movement
traffic for accesses that sparingly utilize the page from the SSD. The sim-
ulated system shows promising results, especially for data-intensive appli-
cations that have a high degree of random memory requests. Uniformly
distributed random access patterns that do not fit in the DRAM show as
much as 6.74x reduction in page traffic and a 1.21x increase in total memory
hits.
This thesis provides an improved baseline for page promotion algorithms in
multi-tiered or hybrid memory systems. Following is a list of possible future
directions of this work.
• Tuning the hyperparameters of the LSTM network, revisiting the fea-
ture lists, and enabling online training could help improve the coverage
and performance of the LSTM network.
• A study to understand the effects of multiple parallel and concurrent
processes with varying non-deterministic memory access patterns on
40
the coverage of the LSTM can help gain better insight on how the
network can be improved.
• The placement of the network in the pipeline affects coverage drasti-
cally. If the network is closer to the core, there will be more accesses
that it can learn from at the expense of additional computation. On
the other hand, placing it near the memory reduces the accesses that
it can learn from but also reduces the computation required.
• Another direction to explore is the necessity of an accelerator for per-
forming an inference. If the network is small and lightweight, the infer-
ences can be done on embedded cores inside the SSD and the acceler-
ator can be eliminated. Statistical and learning techniques other than
LSTMs like time-delay neural networks (TDNNs) and Auto Regressive
Integrated Moving Averages (ARIMA), and clustering techniques like
KNNs, can be studied to give better inputs to the networks.
• It would be essential to extend the simulation environment developed
and used for this thesis to support very large real-world application
tracing and simulation in a reasonable simulation timeframe and es-
tablish the benefits of this thesis in very large hybrid memory systems.
• It would be interesting to study how the two proposed systems for
pattern recognition in memory accesses can be combined with the page
promotion mechanism to maximize the number of memory requests
served by the highest (or fastest) level of the memory hierarchy.
• If the overhead of a neural network pattern recognizer is high, the
proposed page promotion system can be augmented with a simpler
random versus sequential access classifier that can aid in enhancing the
current algorithm to be more adaptive to the most recent short-term
memory access pattern during application runtime.
• A much larger SSD page cache table can be implemented by adopting
a multi-level hierarchical approach to significantly increase the size of
the SSD page cache, increase the total hits and reduce the page traffic
for various applications in the system.
41
REFERENCES
[1] W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications
of the obvious,” SIGARCH Comput. Archit. News, vol. 23, no. 1, p.
20–24, Mar. 1995.
[2] D. Bae, I. Jo, Y. A. Choi, J. Hwang, S. Cho, D. Lee, and J. Jeong, “2B-
SSD: The case for dual, byte- and block-addressable solid-state drives,”
in 2018 ACM/IEEE 45th Annual International Symposium on Computer
Architecture (ISCA), 2018, pp. 425–438.
[3] M. Hashemi, K. Swersky, J. Smith, G. Ayers, H. Litz, J. Chang,
C. Kozyrakis, and P. Ranganathan, “Learning memory access pat-
terns,” in Proceedings of the 35th International Conference on Ma-
chine Learning, ser. Proceedings of Machine Learning Research, J. Dy
and A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm Sweden:
PMLR, 10–15 Jul 2018, pp. 1919–1928.
[4] S. Palacharla and R. E. Kessler, “Evaluating stream buffers as a sec-
ondary cache replacement,” SIGARCH Comput. Archit. News, vol. 22,
no. 2, p. 24–33, Apr. 1994.
[5] D. Joseph and D. Grunwald, “Prefetching using Markov predictors,”
IEEE Transactions on Computers, vol. 48, no. 2, pp. 121–133, 1999.
[6] K. J. Nesbit and J. E. Smith, “Data cache prefetching using a global
history buffer,” in 10th International Symposium on High Performance
Computer Architecture (HPCA’04), 2004, pp. 96–96.
[7] P. Michaud, “Best-offset hardware prefetching,” in 2016 IEEE In-
ternational Symposium on High Performance Computer Architecture
(HPCA), 2016, pp. 469–480.
[8] Y. Ishii, M. Inaba, and K. Hiraki, “Access map pattern matching for
data cache prefetch,” in Proceedings of the 23rd International Conference
on Supercomputing, ser. ICS ’09. New York, NY, USA: Association for
Computing Machinery, 2009, p. 499–500.
42
[9] S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos,
“Spatial memory streaming,” in 33rd International Symposium on Com-
puter Architecture (ISCA’06), 2006, pp. 252–263.
[10] M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H.
Pugsley, and Z. Chishti, “Efficiently prefetching complex address pat-
terns,” in 2015 48th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), 2015, pp. 141–152.
[11] X. Yu, C. J. Hughes, N. Satish, and S. Devadas, “IMP: Indirect memory
prefetcher,” in 2015 48th Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), 2015, pp. 178–190.
[12] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, p. 1735–1780, Nov. 1997.
[13] A. Abulila, V. S. Mailthody, Z. Qureshi, J. Huang, N. S. Kim,
J. Xiong, and W.-m. Hwu, “FlatFlash: Exploiting the byte-accessibility
of SSDs within a unified memory-storage hierarchy,” in Proceedings of
the Twenty-Fourth International Conference on Architectural Support
for Programming Languages and Operating Systems, ser. ASPLOS ’19.
New York, NY, USA: Association for Computing Machinery, 2019, p.
971–985.
[14] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wal-
lace, V. J. Reddi, and K. Hazelwood, “Pin: Building customized pro-
gram analysis tools with dynamic instrumentation,” in Proceedings of
the 2005 ACM SIGPLAN Conference on Programming Language De-
sign and Implementation, ser. PLDI ’05. Association for Computing
Machinery, 2005, p. 190–200.
[15] N. Nethercote and J. Seward, “Valgrind: A framework for heavyweight
dynamic binary instrumentation,” in Proceedings of the 28th ACM SIG-
PLAN Conference on Programming Language Design and Implementa-
tion, ser. PLDI ’07. Association for Computing Machinery, 2007, p.
89–100.
[16] A. Samajdar, Y. Zhu, P. N. Whatmough, M. Mattina, and T. Krishna,
“SCALE-Sim: Systolic CNN accelerator,” CoRR, vol. abs/1811.02883,
2018.
[17] P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lu-
cas, R. Rabenseifner, and D. Takahashi, “The HPC Challenge (HPCC)
benchmark suite,” in Proceedings of the 2006 ACM/IEEE Conference
on Supercomputing, ser. SC ’06. Association for Computing Machin-
ery, 2006, p. 213–es.
43
[18] W. Cheong, C. Yoon, S. Woo, K. Han, D. Kim, C. Lee, Y. Choi, S. Kim,
D. Kang, G. Yu, J. Kim, J. Park, K. Song, K. Park, S. Cho, H. Oh,
D. D. G. Lee, J. Choi, and J. Jeong, “A flash memory controller for 15s
ultra-low-latency SSD using high-speed 3D NAND flash with 3s read
time,” in 2018 IEEE International Solid - State Circuits Conference -
(ISSCC), 2018, pp. 338–340.
[19] A. Sodani, “Knights landing (KNL): 2nd Generation Intel R© Xeon Phi
processor,” in 2015 IEEE Hot Chips 27 Symposium (HCS), 2015, pp.
1–24.
[20] C. Pearson, M. Almasri, O. Anjum, V. S. Mailthody, Z. Qureshi,
R. Nagi, J. Xiong, and W. Hwu, “Update on triangle counting on
GPU,” in 2019 IEEE High Performance Extreme Computing Confer-
ence (HPEC), 2019, pp. 1–7.
[21] A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Large-scale graph
computation on just a PC,” in Proceedings of the 10th USENIX Confer-
ence on Operating Systems Design and Implementation, ser. OSDI’12.
USA: USENIX Association, 2012, p. 31–46.
[22] M. Annavaram, J. M. Patel, and E. S. Davidson, “Data prefetching by
dependence graph precomputation,” in Proceedings 28th Annual Inter-
national Symposium on Computer Architecture, 2001, pp. 52–61.
[23] M. Hashemi, O. Mutlu, and Y. N. Patt, “Continuous Runahead: Trans-
parent hardware acceleration for memory intensive workloads,” in The
49th Annual IEEE/ACM International Symposium on Microarchitec-
ture, ser. MICRO-49. IEEE Press, 2016.
[24] C. Jung, D. Lim, J. Lee, and Y. Solihin, “Helper thread prefetching
for loosely-coupled multiprocessor systems,” in Proceedings 20th IEEE
International Parallel Distributed Processing Symposium, 2006.
[25] D. Callahan, K. Kennedy, and A. Porterfield, “Software prefetching,”
in Proceedings of the Fourth International Conference on Architectural
Support for Programming Languages and Operating Systems, ser. ASP-
LOS IV. New York, NY, USA: Association for Computing Machinery,
1991, p. 40–52.
[26] H. Al-Sukhni, I. Bratt, and D. A. Connors, “Compiler-directed content-
aware prefetching for dynamic data structures,” in 2003 12th Interna-
tional Conference on Parallel Architectures and Compilation Techniques,
2003, pp. 91–100.
44
[27] W. Y. Chen, S. A. Mahlke, P. P. Chang, and W.-m. W. Hwu, “Data ac-
cess microarchitectures for superscalar processors with compiler-assisted
data prefetching,” in Proceedings of the 24th Annual International Sym-
posium on Microarchitecture, ser. MICRO 24. New York, NY, USA:
Association for Computing Machinery, 1991, p. 69–73.
[28] E. H. Gornish, E. D. Granston, and A. V. Veidenbaum, “Compiler-
directed data prefetching in multiprocessors with memory hierarchies,”
in Proceedings of the 4th International Conference on Supercomputing,
ser. ICS ’90. New York, NY, USA: Association for Computing Machin-
ery, 1990, p. 354–368.
[29] Y. Zeng and X. Guo, “Long short-term memory based hardware
prefetcher: A case study,” in Proceedings of the International Sympo-
sium on Memory Systems, ser. MEMSYS ’17. New York, NY, USA:
Association for Computing Machinery, 2017, p. 305–311.
[30] G. Ayers, H. Litz, C. Kozyrakis, and P. Ranganathan, “Classifying
memory access patterns for prefetching,” in Proceedings of the Twenty-
Fifth International Conference on Architectural Support for Program-
ming Languages and Operating Systems, ser. ASPLOS ’20. New York,
NY, USA: Association for Computing Machinery, 2020, p. 513–526.
[31] “NVLink, Pascal and Stacked memory: Feeding
the appetite for big data,” NVIDIA Corporation,
Mar. 2014. [Online]. Available: http://devblogs.nvidia.com/
nvlink-pascal-stacked-memory-feeding-appetite-big-data/
[32] Hybrid Memory Cube – HMC Gen2, Micron Technol-
ogy, Inc. Std., Rev. H, Feb. 2018. [Online]. Avail-
able: http://www.micron.com/-/media/client/global/documents/
products/data-sheet/hmc/gen2/hmc gen2.pdf
[33] H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Ra-
jendran, M. Asheghi, and K. E. Goodson, “Phase change memory,”
Proceedings of the IEEE, vol. 98, no. 12, pp. 2201–2227, 2010.
[34] High Bandwidth Memory (HBM) DRAM, JEDEC Std., Jan. 2020. [On-
line]. Available: https://www.jedec.org/standards-documents/docs/
jesd235a
[35] “Unified Memory in CUDA 6,” NVIDIA Corporation,
Nov. 2013. [Online]. Available: http://devblogs.nvidia.com/
unified-memory-in-cuda-6/
45
[36] Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee, “Nimble page
management for tiered memory systems,” in Proceedings of the Twenty-
Fourth International Conference on Architectural Support for Program-
ming Languages and Operating Systems, ser. ASPLOS ’19. New York,
NY, USA: Association for Computing Machinery, 2019, p. 331–345.
[37] N. Megiddo and D. S. Modha, “ARC: A self-tuning, low overhead re-
placement cache,” in Proceedings of the 2nd USENIX Conference on File
and Storage Technologies, ser. FAST ’03. USA: USENIX Association,
2003, p. 115–130.
[38] E. J. O’Neil, P. E. O’Neil, and G. Weikum, “The LRU-K page replace-
ment algorithm for database disk buffering,” SIGMOD Rec., vol. 22,
no. 2, p. 297–306, June 1993.
[39] T. Johnson and D. Shasha, “2Q: A low overhead high performance buffer
management replacement algorithm,” in Proceedings of the 20th Interna-
tional Conference on Very Large Data Bases, ser. VLDB ’94. San Fran-
cisco, CA, USA: Morgan Kaufmann Publishers Inc., 1994, p. 439–450.
[40] M. R. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and
G. H. Loh, “Heterogeneous memory architectures: A HW/SW approach
for mixing die-stacked and off-package memories,” in 21st IEEE Interna-
tional Symposium on High Performance Computer Architecture, HPCA
2015, Burlingame, CA, USA, February 7-11, 2015. IEEE Computer
Society, 2015, pp. 126–136.
[41] L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page placement in hybrid
memory systems,” in Proceedings of the International Conference on
Supercomputing, ser. ICS ’11. New York, NY, USA: Association for
Computing Machinery, 2011, p. 85–95.
[42] M. M. Tikir and J. K. Hollingsworth, “Hardware monitors for dy-
namic page migration,” J. Parallel Distrib. Comput., vol. 68, no. 9, p.
1186–1200, Sep. 2008.
[43] H. Wang, J. Zhang, S. Shridhar, G. Park, M. Jung, and N. S. Kim,
“DUANG: Fast and lightweight page migration in asymmetric memory
systems,” in 2016 IEEE International Symposium on High Performance
Computer Architecture (HPCA), 2016, pp. 481–493.
[44] J. H. Ryoo, L. K. John, and A. Basu, “A case for granularity aware
page migration,” in Proceedings of the 2018 International Conference
on Supercomputing, ser. ICS ’18. New York, NY, USA: Association for
Computing Machinery, 2018, p. 352–362.
46
