Page management in hybrid memory systems by Kokolis, Apostolos


























Submitted in partial fulfillment of the requirements 
for the degree of Master of Science in Computer Science 
in the Graduate College of the  










 Professor Josep Torrellas 
ABSTRACT
Recent byte-addressable Non-Volatile Memory (NVM) technologies enable hybrid memory
systems comprising of both DRAM and NVM technologies. Such systems have the potential
to address the capacity requirements of data intensive workloads and achieve high perfor-
mance. The main challenge lies in dynamically managing data placement between DRAM
and NVM in a flat address space configuration, where the Operating System can allocate
pages to either of the two memories.
Prior work on this area has proposed software and architectural techniques that can dy-
namically swap memory pages between the two memories. However, due to the high swap
overhead, initiating swaps solely based on hardware counters or relying on software methods
hinders the potential for performance gains due to their conservative decision making.
In this work, we introduce Prefetching, a novel hybrid memory management scheme that
exploits page correlation to identify forthcoming memory accesses and commence swaps
ahead of time. The page management techniques proposed by Prefetching e↵ectively hide
the overhead of data movement between memories. We evaluate our design with simulations
across 17 benchmarks from three di↵erent benchmark suites. Thanks to Prefetching’s highly
accurate swaps, we improve performance by up to 20% and reduce average main memory
access time by up to 33% when compared to prior state-of-the-art.
ii
To my family, for their love and support.
iii
ACKNOWLEDGMENTS
I wish to express my sincere thanks to Professor Josep Torrellas for giving me the oppor-
tunity to carry out my MSc thesis under his supervision. His guidance, alongside with his
invaluable research experience provided me with the motivation and inspiration to accom-
plish this work.
In addition, I would like to thank the members of i-acoma group and especially Dimitris
and Bhargava, for their help and support in this project.
I give my special thanks Kiriaki Kokoli and my parents, Nikolaos Kokolis and Ioanna
Zachari, who always encouraged and assisted me to achieve my goals.
Finally, I would like to thank all my friends who supported me all these years.
iv
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 BACKGROUND AND MOTIVATION . . . . . . . . . . . . . . . . . 3
2.1 Hybrid Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Existing Hardware-Based Memory Management Techniques in Flat-Memory
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Correlation Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Motivation of Our Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 3 DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Hybrid Memory Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Complete HMC operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
CHAPTER 4 EXPERIMENTAL METHODOLOGY . . . . . . . . . . . . . . . . . 15
4.1 Simulator Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Area Overhead Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
CHAPTER 5 EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1 Prefetch Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Energy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
CHAPTER 6 DISCUSSION POINTS ANDDIRECTIONS FOR FUTUREWORK
IN HYBRID MEMORY SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . 29
CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
v
CHAPTER 1: INTRODUCTION
The demand for high-capacity, high-bandwidth and low-energy memory systems is steadily
increasing. Data-intensive applications are widely used, and conventional systems cannot
satisfy their memory needs. Main memory has relied for decades on DRAM, due to its
relatively low latency and its low dynamic power. However, DRAM scaling is becoming
challenging due to the increasing leakage power and the manufacturing di culties it faces
[1]. Thus, new solutions are required to sustain high performance in future memory systems.
Recent Non-Volatile Memory (NVM) such as PCM [2, 3] and STT-RAM [4] show promise.
They are denser than DRAM, can be produced at smaller feature sizes [2, 3] and have
already been announced for production [5]. Nevertheless, NVMs experience higher read and
write latency, higher dynamic power, and typically have limited write endurance. For these
reasons, the trend is to incorporate both DRAM and NVM in a hybrid memory system that,
ideally, o↵ers the low latency of DRAM and the extra capacity of NVM.
Hybrid memory systems can use DRAM as a cache for the NVM [6] or have a flat address
space configuration with both memories [7]. The latter case is better for capacity-critical
applications. In a flat address space, we have to decide where to place data to achieve the
maximum performance. Static page allocation by the OS is unlikely to deliver acceptable
performance. Hence, we need to support dynamic movement of pages between the two
memories (i.e., swaps).
Hardware-managed swap techniques can adapt better to the changes of application behav-
ior than software techniques [7, 8, 9]. Yet, swaps are costly. Additionally, we need auxiliary
structures to support activity tracking and bookkeeping of remaps between the memories.
The e cient management of this meta-data is crucial for performance.
Hardware schemes have considered di↵erent mechanisms for monitoring activity in an at-
tempt to identify heavily accessed memory segments and swap them to the faster memory.
However, these mechanisms have not considered the need to start swapping before the mem-
ory accesses reach main memory, so that the swap overhead is hidden. They have also failed
to consider the fact that the auxiliary structures needed for swapping can easily become a
bottleneck.
In this work, we devise a hardware mechanism that can e↵ectively manage a hybrid
memory system. We identify or predict future page accesses early, and start the swapping
of pages between DRAM and NVM ahead of time. As a consequence, the swap overhead is
hidden and more memory requests target the fast memory. Moreover, the early recognition
of future accesses is used to prepare the auxiliary data structures that keep meta-data for
1
our scheme and reduce their performance cost.
Our scheme is based on two ideas. First, a Correlation Prefetching Table (PCT) identifies
a page that will soon be accessed frequently and, potentially, its immediate follower page
that will also be accessed frequently. Both pages then induce a prefetch-triggered swap to
move them to DRAM. Second, a Hot Page Table (HPT) identifies pages that are generally
hot and need to be swapped to DRAM or remain in the DRAM.
We assess our design using simulations across a wide variety of workloads. Our experi-
ments verify that page accesses have a repeatable pattern and that prefetching pages can
increase performance. We tested our scheme across 17 di↵erent workloads from three dif-
ferent benchmark suites and we compared it to multiple current schemes (e.g., [7, 8]). We
found that our scheme is 14% faster than the second best configuration, and 20% faster than
a baseline system without page swapping.
2
CHAPTER 2: BACKGROUND AND MOTIVATION
2.1 HYBRID MEMORY SYSTEMS
The evolution of NVM technology, such as Phase Change Memory [2] and 3D XPoint
[5], and their promising results have steered the attention of the research community and
industry to their direction. NVM memories have high bit density, low static power, and
good scalability with feature size. Hence, they appear as a viable solution to the increasing
memory demands of future systems. However, NVMs experience higher read and write
latency when compared to DRAM. So, they cannot replace DRAM entirely without critical
performance loss. Therefore, the combination of DRAM and NVM has been proposed as
a method to e ciently increase system capacity, performance and reliability [6]. A system
that integrates memories with di↵erent performance and power characteristics is typically
called a hybrid memory system.
There are two di↵erent configurations for a hybrid memory system. It can be configured
either as a hardware managed DRAM cache [10, 11, 12, 13, 14, 15], where the faster and
smaller DRAM serves as a cache for the slower and larger NVM, or it can be configured as a
flat address space, where the OS is aware of both memories for page allocation [7, 8, 16, 9, 17,
18]. In the first case, there is the advantage that it can be easily deployed and is transparent
to the OS, with DRAM acting as an additional level of caching between LLC and main
memory. However, it faces the challenge of e ciently storing and accessing large amount
of tags [11, 19]. Moreover, the overall capacity of the system decreases by a non-negligible
value, as long as the sizes of DRAM and NVM are comparable. Additionally, the overall
system bandwidth decreases as each memory request has to go to DRAM before accessing
NVM and cannot take advantage of the aggregate memory bandwidth [20]. Published work
in the area has shown performance improvement for latency critical applications [10, 13].
However, capacity limited applications do not benefit as much [16]. Also, the overall memory
bandwidth is limited, since we cannot take advantage of the combined bandwidth of the two
memories.
The flat address space configuration has the advantages that it provides higher aggregated
bandwidth and memory capacity, and that there is no need for tag storage. However,
there are the challenges of deciding data placement and of swapping data between the two
memories.
Data swaps between the two memories can be done either in software [21, 18, 22] or
hardware [7, 8, 16, 17, 9]. In both cases, we must identify data that are currently “hot” (i.e.,
3
accessed frequently) but reside in the slow memory, and swap them with data that are “cold”
(i.e., accessed infrequently) but reside in the fast memory. In a software managed approach,
the OS interrupts the processor, swaps the pages, performs a TLB shootdown to update
stale TLB entries, and continues execution. This procedure can take several microseconds
[23] and constrains swaps to a coarser time granularity.
When data swaps are hardware-managed, they can happen at finer time granularity. Still,
there are several challenges in this method. The first challenge concerns the consistency
between the OS’s view of memory and the data movements that have been performed. Since
the OS is not aware of any remapping that has happened to memory, we need to fidn a way
to keep track of the re-mappings that have happened. The second challenge is that we need
dedicated hardware to decide and perform swaps between memory segments of fast and slow
memory. The third challenge is that we need to track memory activity and trigger a swap
accurately and promptly, to tolerate the swap cost.
2.2 EXISTING HARDWARE-BASED MEMORY MANAGEMENT TECHNIQUES IN
FLAT-MEMORY SYSTEMS
There is prior work that has investigated HW-only techniques for managing hybrid flat
memory systems. The common scenario for these schemes is that they depend on LLC misses
to track memory activity and determine page swaps. The main di↵erences between them
are the size of the memory segments to swap, and what triggers the swap. One of the very
first papers is CAMEO [16]. CAMEO migrates data at 64B block granularity, and a swap
is triggered on every access to a block in slow memory. It restricts the swap flexibility by
organizing swap groups in a direct-mapped fashion, where multiple blocks of slow memory
map to the same block of fast memory, and only one of them can reside in fast memory at
a time. Also, it uses fast swaps, which means that, within a single swap group, a memory
segment can reside anywhere. While CAMEO keeps the swap bandwidth requirements low
and is easy to implement, the small swap granularity requires high meta-data storage and
misses the opportunity to take advantage of spatial locality. Moreover, the direct-mapped
structure of the swap groups is vulnerable to many conflict misses.
PoM [7] is similar to CAMEO with the di↵erence that swaps happen at the granularity
of 2KB, and a swap is triggered when the number of accesses to a 2KB memory segment
reaches a threshold. PoM is more adaptive, and the swap threshold can change according to
the program characteristics. Similarly to CAMEO, it uses fast swaps and has direct-mapped
swap groups. SILC-FM [9] optimizes the granularity of swaps that can range from 64B to
2KB, supporting sub-block interleaving between two memory segments and it also relaxes
4
the direct-mapped swap groups to set associative swap groups. The latter decision comes
at the cost of slow swaps, which means that within a swap group, swapping a segment may
cause the swapping of a second segment.
Mempod [8] further relaxes the swap flexibility by making it fully-associative at the cost
of a substantial increase in the metadata overhead. Mempod is using the Majority Element
Algorithm [24] to identify memory segments that are to be accessed in the future and mi-
grates them at 2KB granularity after predefined time intervals. Other hardware schemes
are targeting di↵erent aspects of hybrid memory systems. BATMAN [20] tries to optimize
swaps so that the overall memory bandwidth utilization is maximized, while ProFess [17]
proposes a cost-benefit mechanism that decides swaps considering fairness between di↵erent
programs that compete for space in fast memory.
There are some designs with other features. For instance, SILC-FM [9] suggests a method
to swap only a portion of a page. This method can be adopted to save memory bandwidth.
A bitmap for a page can tell us which cache lines from a page are worth swapping and avoid
moving 4KB of data. Mempod [8] mentions a clustered architecture that groups together
memory controllers to be more scalable.
Also, software schemes for hybrid memory management are an active area of research, and
people have proposed hardware/software techniques for e↵ective page placement [25, 26, 22,
27]. These techniques require either annotations to the applications and compiler support
to identify “hot” data structures at compile time or the OS involvement to track memory
activity and swap pages. The role of hardware is limited to work in conditions where the
adaptation to dynamic program behavior is needed. OS involvement is also proposed as an
optimization [26, 28] to periodically update the page table entries and relieve some pressure
from the hardware remapping tables. The remapping table entries can be freed if the OS
becomes aware of the swaps that happened in the memory.
2.3 CORRELATION PREFETCHING
Maybe the most well-known mechanism for hiding memory latency is prefetching. Prefetch-
ing is widely used in caches, where multiple prefetchers are trying to fetch cache lines that
are likely to be used in the future. These techniques can substantially improve the per-
formance of the system. People have examined the use of prefetching for hybrid memory
systems [21, 29, 30] to prefetch pages from the slow to the fast memory. In our work we
examine correlation prefetching. Correlation prefetching has been tested to work e ciently
in caches [31, 32]. The intuition behind that is that many times a program accesses a set of
pages at some point and then accesses these pages in the same or similar order in the future.
5
Keeping the required information about page correlation can help us swap prospective pages
before we see any requests for these pages. The di↵erence here is that our mechanism has to
identify the page patterns using only LLC misses, coming from shared caches in a possible
multi-program environment. Also, the page access patterns might change during program
execution. Thus, we need an adaptive page correlation mechanism able to identify access
patterns across di↵erent workloads and programs.
The implementation of page correlation prefetching is similar to that of a cache. When
a page is accessed (i.e., page A), we create a seed with the page number of A. If another
page B is accessed after page A, we name page B the follower of page A. If page B receives
a substantial number of requests after page A was accessed, we assume that pages A and B
are correlated in time, and that accesses to page A will probably be followed by accesses to
page B.
2.4 MOTIVATION OF OUR PROPOSAL
Previous techniques to manage hybrid memory systems have focused on ways to identify
the hotness of pages. However, little emphasis has been given to the fact that identifying a
page as active after you see accesses to the page might be too late.
This paper focuses on an e↵ective way to trigger and handle a swap as soon as possible,
so that we can redirect as many requests as we desire to DRAM. We employ a prefetching
mechanism that captures the page access behavior of di↵erent programs and swaps pages
in an e cient manner, ahead of time, while at the same time we do not block the memory
channels during a swap. We propose hardware structures that enable swaps transparent
to the OS, and minimize the swap overhead. Our work is orthogonal to several previous
techniques [20, 9, 17], which can be used together with our architecture. For this reason
we choose to compare our work against PoM and Mempod, which use di↵erent triggers
for swaps. Also, we came up with two other configurations that have di↵erent or no swap
triggers to assess our technique.
We perform experiments comparing all di↵erent configurations in terms of application




In this section, we present the main idea of Prefetching, our proposed mechanism to
manage a hybrid memory system. The goal of Prefetching is to hide the overhead of swapping
pages between DRAM and NVM by predicting future memory accesses, and prefetching
pages from NVM to DRAM. We use page correlation prefetching to identify a page that
will soon be accessed frequently and its immediate follower page. Both pages may cause a
prefetch-triggered swap to move them to DRAM. The goal is to move the pages before many
requests to these pages arrive.
In addition, the HMC has a table that identifies pages that are generally hot and need to
migrate to or remain in DRAM. This table can also trigger swaps.
The rest of the section is organized as follows. First, we overview the HMC, and describe
the hardware structures that are present in the HMC. Next, we analyze the process of
tracking memory activity, and how we trigger swaps. Finally, we present the overall HMC
operation.
3.2 HYBRID MEMORY CONTROLLER
Figure 3.1 shows the position of the HMC. The HMC is between the Host’s Last Level
Cache (LLC) and main memory. It receives memory requests and directs them to the correct
memory module. The HMC is enhanced with some hardware structures to be able to swap
pages between the two memories, track memory activity, store the current re-mappings
between DRAM and NVM, and swap pages. The exact parameters for the HMC structures
are shown later in Section 4. Figure 3.1 presents the architecture with prefetching and
no further interaction with the host. The components that we added or modified in our
architecture are indicated with a di↵erent color.
3.2.1 Page Remapping Table
Our scheme swaps pages transparently from the OS. Thus, every memory request that
reaches the HMC should be tested to find out whether or not the page address has changed



































Figure 3.1: Overview of the design.
As one can imagine, the meta-data required to keep this information for every page in
the system is substantial. What is more, the PRT should be accessed upon every memory
request and is the only structure that lies on the critical path. As a result, the access
time of the PRT should be kept to a minimum. This is why instead of having all the PRT
entries on the HMC, we have a cache (PRTc) that holds some of the PRT entries. The rest
of the entries are stored in DRAM, similarly to previous designs [7, 8, 9, 17]. The main
disadvantage of this approach is that the cost of a PRTc miss is high. When we miss in the
PRTc, we have to fetch the PRT entry from DRAM. We need to send a DRAM memory
request and wait until this request is satisfied, which is a costly operation. Therefore, an
e cient PRTc should be able to hold as many entries as possible, have a high hit rate and
know exactly which entry to fetch on a PRTc miss.
8
To know where to look for a possible remapping, we constrain the swap flexibility between
DRAM and NVM pages as shown in Figure 3.2. All the NVM pages with a given pattern
can be swapped with all the DRAM pages with the same pattern, and pairs of swapped















Figure 3.2: Page Remapping Table (PRT) design.
With this PRT design, an NVM page can only be swapped with DRAM pages that map
to the same PRT set. Each PRT entry in a set has a DRAM PPN and an NVM PPN. The
entry denotes a swap between these two pages. Since our PRT is set associative, an NVM
page has more than one DRAM candidates for swapping. With this design, we can avoid
multiple conflict misses within a set. Also, it is possible to have many NVM pages from the
same swap group in DRAM at the same time.
The next goal is to keep the size of each PRT entry small, and make the PRT scalable to
di↵erent DRAM-NVM ratios. To achieve this, we choose to keep slow swaps in addition to
fast swaps, as we will see. With this decision, we introduce more tra c during a swap, but
the size of each PRT entry is smaller. As a result, the PRTc hit rate is expected to be high.
Moreover, a slow swap can be accelerated using the swap bu↵ers.
Consider the scenario presented in Figure 3.3. Assume that Pages 1 and 2 have been
swapped, and all the ways of the PRT set where this page color maps to are occupied. Now,
Page 3 wants to migrate from NVM to DRAM, and maps to the same set as 1 and 2. Because
the PRT set is full, we need to empty one way. The result is a slow swap. Specifically, we
need to read pages 1 and 2 and write them back (1 to DRAM and 2 to NVM); then we need
9
to swap 3 and the new page that will be in 2. The process proceeds as follows. In Step 1,
we read Pages 1 and 2 into swap bu↵ers. Then, we write back the page that used to be in
2 into 1. Then, in Step 2, we read page 3 into a swap bu↵er. Finally, we write back the
bu↵ers to locations 3 and 2. We call this a ”half-slow” swap operation.
NVM
DRAM
Swap Buffer Swap Buffer
DRAM









Step 1 Step 2
Swap Buffer Swap Buffer Swap Buffer Swap Buffer
Figure 3.3: Half-slow swap operation.
3.2.2 Activity Tracking & Swap Triggering
The HMC tracks memory activity by monitoring the LLC misses. There are two di↵erent
structures in the HMC that contribute to trigger a swap, namely the PCTc (Page Correlation
Table cache) and the HPT (Hot Page Table).
The role of the Page Correlation Table (PCT) and its associated cache (PCTc) is to keep
information about the number of accesses that target a page. The design of a PCTc entry
is shown in Figure 3.4. When a page is accessed multiple times, we count the number of
accesses to this page. This counter is saved and, the next time the page is accessed, the
counter is used to determine whether to swap the page. Specifically, if the count is above a
predefined threshold, we assume that the page is worth moving to DRAM because we expect
that the same number of accesses will repeat.
In addition, the PCT logically links pages according to their access pattern. For instance,
when we see that accesses to Page A are followed by accesses to Page B, we name B as a
follower of A and store this information at the PCT. When some time in the future we see
a new access to page A, we check B’s counter and if it is above a threshold, we will prefetch
B to DRAM — predicting that B will be accessed soon. This threshold is set so that the
10







PCTPPN counter Follower PPN Follower counter
PCTc entry
Figure 3.4: PCTc and Filter structure organization.
swap overhead is justified by the number of accesses we expect to go in the fast memory.
Similarly to the PRT, we cannot have all the PCT entries on chip, so we are using a PCT
cache (PCTc).
In addition to the PCT and the PCTc, we have a smaller structure called the Filter. The
reason of the Filter is to update the information about accesses to a page fast. The Filter
table entries are larger than the PCTc entries (Figure 3.4). In the Filter table, we record the
accesses to a page as well as the accesses to the page following it in time (Follower PPN).
In order to perform well in a multi-program environment, we also need store the Process
Context Identifier (ID) so that we do not confuse page accesses from di↵erent programs. We
want to record access patterns on a per program basis.
Further, a Filter entry has two extra fields compared to PCTc entries (New Follower PPN
and its counter). These fields are used to record changes in the access patterns. Specifically,
if we had recorded that Page A was followed by Page B (in Follower PPN), but then we
recognized that Page A is followed by Page C (in New Follower PPN) with a higher count,
we use the Filter entry to update the PCT entry so that Page A is followed by Page C.
In addition, as the name suggests, the Filter table is used to determine if we need to keep
history information for a page or not. The Filter table has a small number of entries to keep
updating the last accessed pages. When a page is removed from the Filter, the sum of all
the access counters in the entry are calculated. If they are below a threshold, then we decide
not to use the Filter information to update the information about this page in the PCT and
save memory bandwidth.
Every time that a page is accessed and that page does not exist in the Filter, we bring the
entry of the page from the PCT to the Filter. At this time, we halve all the counters in the
Filter entry so that we can adjust faster to new phases of the application. In Figure 3.4, we
have denoted in dark shading the fields that keep the history of a page, and in light shading
the fields that are use for page correlation prefetching.
11
The second structure that is responsible for triggering swaps is the Hot Page Table (HPT).
There are two small HPT structures, one for the DRAM pages and one for the NVM pages.
They keep track of ”hot” pages by counting page accesses. The goal of the DRAM HPT is
to lock pages in DRAM. Specifically, a DRAM page that appears in the DRAM HPT means
that the page is highly accessed and should not be swapped out of DRAM. On the other
hand, NVM HPT tries to identify pages that are becoming ”hot”, and the PCT has not
triggered a swap. So, when a page in the NVM HPT reaches a swap threshold, the hardware
starts a swap operation.
HPTs are needed for two reasons. First, we use them to find swaps that we missed with
the PCTc. This can happen either because the page is accessed for the first time, or the
counts were too small for the page to make it to the PCT. Second, HPTs are used to avoid
swapping DRAM pages that are frequently used. Each HPT entry has a PPN and a counter
to keep track of accesses. Counters are halved at regular intervals. If they are zero, the page
is removed from the HPT to make room for new pages. Both the HPTs and the PCTc are
o↵ the critical path.
3.3 COMPLETE HMC OPERATION
We have described the structures of the HMC and in this section we will explain their
operation. The operation can be divided into two di↵erent scenarios: 1) a regular memory
request arrives to the HMC and 2) a request reaches the HMC while a swap is ongoing.
3.3.1 Regular Memory Request Reaches the HMC
Figure 3.5 shows a flowchart of the HMC operation when a simple memory request misses
in the L3 cache and reaches the HMC. In this case, three operations are performed in parallel.
The PRTc is accessed to find if the page is remapped and, if so, the address that this request
should be redirected to. Remember that the access to the PRTc is critical for the application
performance, because every memory request has to first check the PRTc to find whether it
is targeting a page has has been swapped between DRAM and NVM or not. In parallel with
the PRTc access and out of the memory request critical path, the Filter and PCTc receive
the request. The Filter updates the access counters for the page and the correlated page to
identify page access patterns. The PCTc checks if the history of the page indicates that this
page and its follower should be swapped to DRAM. When the PRTc is accessed, we obtain
the information about where this page resides in main memory, and the request can be sent
to the correct memory module without waiting for the rest of the structures to finish. Also,
12
the HPT (for DRAM or NVM) is updated and we can make our swap decision. The NVM
HPT and the PRTc are the two structures that can trigger a swap operation. If the HPT
or PRTc decide that a swap is necessary, a swap is initiated. In case we miss in the PRTc
or the PCTc, a memory request is sent to DRAM to fetch the appropriate entries. This is
a costly operation, thus the hit rate of the PRTc and the PCTc is crucial.
Simple Mem Req 
reaches HMC












Figure 3.5: Flowchart of a simple memory request.
If a swap is triggered, the Swap Driver is responsible to start the page migration and
keep track of the pages under swap. If no swap was triggered, normal operation continues.
To swap two pages we must read them into swap bu↵ers and then write them to their new
positions. For the swap procedure, we perform the critical block first optimization. We fetch
first the cache line of the request that initiated the migration, and also service the memory
request. To make e↵ective use of the overall memory bandwidth available, the Swap Driver
can deny a swap trigger. That can happen when most of the memory requests are directed
to DRAM, leading to the saturation of DRAM bandwidth and the under-utilization of the
NVM bandwidth. Our goal is to have a portion of LLC misses serviced from NVM. The ratio
of requests that should be directed to DRAM and NVM depends on the number of channels
and the latency characteristics of the two memories. The goal is to take advantage of the
total memory bandwidth and not redirect all the incoming memory requests to DRAM,
leaving the NVM under utilized.
13
3.3.2 Memory Request Arrives While a Swap Operation is Ongoing
Finally, there is the scenario when a memory request arrives while a swap operation is
ongoing. Figure 3.6 shows the flowchart of this scenario. First, the hardware checks with the
Swap Driver whether the incoming request targets a page that is participating in the swap or
not. In case the page is not taking part in the swap, the procedure is the same as an ordinary
request: the request is sent to memory. If, however, the page is currently under the swap
process, then the procedure di↵ers. We do not want to stall such requests, especially because
we have identified these pages as ”hot” and requests are expected to be common. To avoid
stalling, we utilize the swap bu↵ers to serve memory requests. We want the swap bu↵ers to
opportunistically serve memory requests. Specifically, the hardware checks whether we have
fetched the required line into a swap bu↵er. If so, the request is serviced from there.
We enhance the Swap Driver with some logic that can keep track of how many of the cache
lines we are reading in the Swap Bu↵er have been fetched, and enable servicing these requests
from the bu↵ers. As we will see later, the replies to memory requests from the swap bu↵ers
do not account for a large portion of the total memory replies. However, this technique
proved to be beneficial because these requests are targeting pages that are currently needed


















Figure 3.6: Flowchart of a memory request during the swap procedure.
14
CHAPTER 4: EXPERIMENTAL METHODOLOGY
4.1 SIMULATOR INFRASTRUCTURE
For the purposes of our evaluation, we use cycle-level simulations and we simulate the full-
system. We model a server architecture that has 4 cores (and sometimes up to 12, depending
on the number of program instances that we execute) and a total of 4.5GB of main memory.
The main memory has 4GB of NVM and 512MB of DRAM. The architecture and timing
parameters used in our experiments are presented in Table 4.1. Each core is an out-of-order
core that has private L1 and L2 caches, while L3 is shared among all cores. Also each core
has private L1 and L2 TLBs and page walk caches that store intermediate translations.
Processor Parameters
Cores; Frequency 4 out-of-order; 2GHz
Cache Line 64B
L1 cache 32KB, 8-way, 2 cycles round trip (RT)
L2 cache 256KB, 8-way, 8 cycles RT
L3 cache 8MB, 16-way, 32 cycles RT, shared
L1 TLB 64 entries, 4-way, 1 cycle RT






Ranks per Channel 1/2
Banks per Rank 8/8
Frequency; Data rate 1GHz; DDR
Bus width 64bits per channel
Operating System
Ubuntu Server 16.04
Table 4.1: System configuration.
For our simulator infrastructure we integrate the Simics full-system simulator [33] with
the SST [34] framework and the DRAMSim2 [35] memory simulator. NVM is modeled by
modifying DRAMSim2 timing parameters and disabling refreshes. Both DRAM and NVM
timing parameters are shown in Table 4.1. The power analysis for main memory is done
according to the number of reads and writes that target each memory module.
Additionally, we utilize Intel SAE [36] on top of Simics for OS instrumentation. We
model page walks according to the x86 architecture. We simulate 4-level page tables that
are created and maintained by the operating system in order to perform the page walk and
15
the required memory accesses. We accurately model the swaps between DRAM and NVM
as well as the accesses to the HMC structures. Our programs are set up and executed under
the Ubuntu 16.04 operating system.
4.2 CONFIGURATIONS
We compare five di↵erent configurations of memory systems. Only one (the baseline) has
no swap mechanism; the rest have di↵erent swap triggering mechanisms. For the configu-
rations that use a cache in the HMC structures, we use a 32KB cache size. We measure
performance, power, and how many memory requests are serviced from each memory module.
4.2.1 Baseline
In this configuration, we perform no swaps between DRAM and NVM memories. The OS
statically allocates pages to one memory or the other randomly, and the pages remain there
throughout the program execution. We use this configuration as our base to present what
would happen if there was no page movement between the memories and also the HMC
complexity was minimal.
4.2.2 oneTouch
The oneTouch configuration triggers a page swap every time we are trying to read from
NVM. When a memory read targets a page that currently resides in NVM, we move this
page to DRAM. oneTouch performs swaps at the granularity of 4KB.
4.2.3 PoM
PoM is configured according to the specification given in previous work [7]. PoM is a
counter-based policy that makes swap decisions according to counter values. The original
PoM manages die-stacked and DRAM memories, which have di↵erent latency than our
memories. For this reason, we modified their parameters that decide a swap trigger (we
set the K parameter to 12), to be consistent with our memory timing model. PoM swaps
memory segments at the granularity of 2KB.
16
4.2.4 Mempod
Mempod is another configuration that is based on previous work [8]. Mempod triggers a
swap according to the Majority Element Algorithm (MEA), and is di↵erent than PoM in the
way that counters are maintained and updated, and the way that swaps are initiated between
the two memories. Similarly to PoM, Mempod swaps memory segments at the granularity
of 2KB. For Mempod, we use 64 MEA counters that decide about memory movement and
swaps that are triggered every 50 µ as described in the original work [8]. We use a 32KB
cache for the remapping table like in the rest of the configurations. However, Mempod also
requires an inverted map table. Since we lack details about the precise implementation of
the inverted map table, and to be optimistic in our evaluation we account zero latency for
this structure.
4.2.5 Prefetching
Prefetching is our proposed scheme. Its operation was described in Section 3.3. To decide
on the swapping of pages between DRAM and NVM, this scheme uses both correlation
prefetching (with only one follower page) as recommended by the PCTc, and the hotness
of the page as recommended by the HPT. The parameters of the design are presented in
Table 4.2. The sizes of the required hardware structures were chosen so that it is feasible to
incorporate them in the Hybrid Memory Controller logic.
PRTc 32KB, 4-way, 1 cycle RT
PCTc 32KB, 4-way, 1 cycle RT
HPT size 3.7KB (combined for both memories)
HPT interval 50K cycles between decrements
HPT swap threshold 6
Prefetch swap threshold 14
PRT, PCT, HPT entry size 3.5B, 6.75B, 3.6B
PRT assoc, size (in DRAM) 4-way, 426KB
PCT size (in DRAM) 5.1MB with follower or 884.7KB without follower
Filter size, num. entries in Filter 1.5KB, 128 entries
Swap segment size 4KB
Swap bu↵ers: size of one, number 4KB, 4 (typical)
Counter size 6-bits all counters
Table 4.2: HMC parameters.
17
4.3 WORKLOADS
To assess the benefits of our design, we set up 17 di↵erent benchmarks from di↵erent
benchmark suites. For our evaluation we choose 6 benchmarks from the SPEC CPU2006
[37] that are memory intensive, 6 benchmarks from the Splash-3 Suite [38], and 5 benchmarks
from CORAL [39] which are representative benchmarks for analyzing HPC systems from the
US Department of Energy (DOE). Table 4.3 presents the benchmarks that were used, as well
as their memory footprint for our simulated period of time when a single instance of the
benchmark is executed.
Benchmark MB(single) Benchmark MB(single)
milc⇥4 380 oceanCon⇥4 887
bwaves⇥4 385 barnes⇥8 250
GemsFDTD⇥4 502 radix⇥4 648
mcf⇥8 290 luNCon⇥4 520
omnetpp⇥8 164 stream⇥4 457
leslie3d⇥12 62 LULESH⇥4 914
↵t⇥4 768 miniFE⇥4 480
luCon⇥4 520 SNAP⇥4 441
MILCmk⇥4 480
Table 4.3: Simulated workloads. In the table, ”x4” means that we are running four instances
of the benchmark—one on each core.
To evaluate our system, we execute multiple instances of the same benchmark in di↵erent
cores (noted in Table 4.3). In cases where the memory footprint of a benchmark was not
adequate to stress the sizes of our memory system, we increased the number of cores and run
more instances of the same benchmark (for example, 12 instances of the leslie3d application).
For our experiments with a single type of benchmark, we simulated 2 billion instructions per
core, while for the mixed benchmark experiments, we simulated until a core reaches 2-billion
instructions or a program terminates. In both cases we performed 1.5 billion instructions of
warm-up per core.
4.4 AREA OVERHEAD COMPARISON
The five di↵erent configurations that we evaluate have di↵erent area needs. We start with
our Prefetching scheme. Table 4.2 shows the sizes of the hardware structures used by the
Prefetch scheme. In the DRAM, Prefetching needs to store metadata for the PRT (to keep
remapping information) and the PCT (to track correlation between pages). In the HMC, it
needs the PRTc, PCTc, HPT, Filter, and the Swap bu↵ers.
18
For PoM, [7] explains the area overhead of their design. For our memory parameters, PoM
needs about 1.3MB of DRAM memory to store the remapping information of the memory
segments and the competing counters they use to trigger swaps. PoM also needs the PRTc
and the swap bu↵ers to keep cache remapping information and to perform swaps.
As for Mempod [8], although the original work does not go into a detailed explanation
about the area overhead, they mention that they are performing fast swaps and they keep
track of every single 2KB memory segment of their system. Thus, they need more than 7MB
of total storage for the meta-data. They also need swap bu↵ers, the PRTc, and the counters
for their MEA algorithm.
The baseline configuration is the simplest one and needs no additional area overhead,
since it performs no swapping and does not need any hardware structures in the HMC.
oneTouch is also simple. It needs the swap bu↵ers to perform a swap and the PRTc to keep




The first part of our evaluation assesses the e↵ectiveness of our Prefetching scheme. The
main target is to identify pages that will be accessed frequently soon and move them to
DRAM as soon as possible, while at the same time we prepare the HMC hardware structures
to service future incoming memory requests.
The e↵ectiveness of our Prefetching can be quantified by how accurately we are able
to recognize future memory accesses, and pages that are frequently accessed. Also, it is
important to see how fast we can move pages to fast memory and not block all incoming
requests in the process.
To achieve maximum performance, we want some memory requests to access the NVM and
the rest the DRAM. Of course, more memory requests should target the DRAM, but we do
not want to completely saturate DRAM bandwidth and have NVM bandwidth underutilized.
5.1.1 Which Memories Service the Memory Accesses
In Figure 5.1, we present the percentage of memory requests that were serviced from
DRAM, from NVM or from the swap bu↵ers for the configurations we are comparing. Each
one of the bars of the plot represents a di↵erent configuration (baseline, oneTouch, PoM,
Mempod, and Prefetching). The results are presented for each benchmark suite and for the
mixes. It is clear from this figure that oneTouch and Prefetching achieve the highest fraction
of memory requests from DRAM. oneTouch can service on average 94.1% of the requests
from DRAM, while Prefetching can service 83.3% of the requests from DRAM and a small
but important amount from the swap bu↵ers (2.6% on average). Therefore, our mechanism
is able to identify and predict future accesses to DRAM. This is a large improvement over
the baseline configuration, which only has 19% of the memory accesses going to DRAM.
The improvement of Prefetching over PoM and Mempod is mainly because Prefetching
takes a swap decision ahead of time, so it does not wait until requests start hitting the
NVM to initiate a swap. What is more, Mempod swaps pages on regular time intervals
that are not optimal for every application, so it falls short of achieving accurate swap rate
for a variety of benchmarks. Additionally, all pages qualified for a swap start moving at
the same time, causing swap bursts. As for PoM, the swap flexibility is restricted by the
use of a direct mapped re-mapping table, losing at times the opportunity to have multiple
20
Figure 5.1: Percentage of memory requests that are serviced from DRAM, NVM, or swap
bu↵ers.
pages of the same swap group in DRAM. We decrease the impact of this problem with our
set-associative PRT.
5.1.2 Outcomes of the Page Swaps
Figure 5.2 considers all the memory accesses and groups them depending on how they
were a↵ected by page swaps. There are three ways in which they can be a↵ected. Positive
accesses are accesses that would have accessed NVM but, thanks to a swap, end-up accessing
DRAM. Negative accesses are those that would have accessed DRAM but, due to a swap,
end-up accessing NVM. Finally, Neutral accesses are those whose destination is not a↵ected
by any swaps.
We can see that the Prefetching scheme has a higher percentage of positive accesses
than all the other schemes: 4%, 14% and 11% higher percentage than oneTouch, PoM and
Mempod, respectively. Moreover, Prefetching has very few negative accesses. The reason
for Prefetching’s good behavior is that it can predict page accesses before other schemes
and, therefore, can swap pages to fast memory sooner. Although not in the figure, it can be
shown that Prefetching introduces 1% and 3% more swaps than PoM and Mempod, but 5%
fewer swaps than oneTouch.
21
Figure 5.2: Characterization of the impact of swapping on the oneTouch, PoM, Mempod
and Prefetching configurations
5.1.3 E↵ectiveness of the Prefetch-Triggered Swaps
Recall that, in our Prefetch scheme, a page swap may be caused by a PCTc trigger or by
the HPT. In this section we focus only on those caused by a PCTc trigger. Such prefetch-
triggered swaps occur when a regular memory request reaches the HMC, and the PCTc
predicts that this page (currently in NVM) will be accessed multiple times soon. The PCTc
may also predict that this page will be followed by many accesses to a second page (i.e., the
follower, in NVM), in which case this second page also causes a prefetch-triggered swap.
We consider that one of these prefetch-triggered swaps is accurate when the number of
subsequent accesses to this swapped page in DRAM is high enough to justify its swap cost.
In our experiments, we want to achieve at least 14 subsequent DRAM accesses to the page
to recognize the prefetch as accurate.
Figure 5.3 shows the percentage of prefetch-triggered swaps that are accurate, while Figure
5.4 shows the percentage of swaps that are prefetch-triggered swaps. As we can see in
Figure 5.3, our mechanism is accurate. On average, 84% of the prefetch-triggered swaps are
accurate. The only two benchmarks for which the accuracy of our mechanism is low are
GemsFTDT and luCon. In GemsFTDT, the majority of the swaps are prefetch-triggered swaps
(Figure 5.4). In this application, the accesses to a given page
vary from time to time. On the other hand, in luCon, Figure 5.4 shows that prefetch-
triggered swaps account for very few of the swaps. Hence, the inaccuracy is unimportant.
Figure 5.4 is split into two parts. The first part shows benchmarks for which our PCTc-
22
Figure 5.3: Percentage of prefetch-triggered swaps that are accurate.
Figure 5.4: Percentage of swaps that are prefetch-triggered swaps.
triggered prefetch mechanism is unable to identify many swaps ahead of time. One reason
is that the pages for these benchmarks do not receive a large enough number of accesses to
qualify for prefetching. Another reason is that the highly-accessed pages of the application
can be moved by the HPT to DRAM, and the remaining pages are not worth swapping.
We include this type of benchmarks to show that even in cases where the prefetch-triggered
23
swaps are few, we are still able to sustain high performance thanks to the HPT. For the
rest of the benchmarks, we see that the prefetch-triggered swaps dominate. On average
prefetch-triggered swaps account for 54% of all swaps. The accuracy and the e↵ectiveness
of our swap mechanism leads to more positive memory accesses.
5.2 PERFORMANCE ANALYSIS
In this section, we evaluate the performance of all the configurations. We start by an-
alyzing the e↵ects that boost the performance of the Prefetching configuration. Then, we
compare all the configurations.
5.2.1 E↵ects that Improve Performance
There are three factors that improve the performance of the Prefetching configuration. The
first one is prefetch-triggered swaps. Specifically, thanks to the PCTc, this configuration is
able to predict when pages will receive many accesses soon, and moves the pages to DRAM
before the requests arrive. Figures 5.3 and 5.4 depict the e↵ectiveness of the prefetch-
triggered swaps.
The second factor is that this configuration can serve memory requests for pages that are
currently swapping. This is done by servicing the request from data currently in the swap
bu↵ers. We showed in Figure 5.1 that a small percentage of the total memory requests are
serviced from the swap bu↵ers.
The third factor is that Prefetching enables the PRTc and the other HMC structures to
operate very fast. Recall that the speed of the PRTc is very crucial to our design, because
the PRTc is on the critical path of a memory request. It is important to miss as little as
possible. Every time there is a PRTc miss, the hardware sends a memory request to DRAM,
and fetches the PRT entry. So, the earlier these entries are fetched, the less time a memory
request will stall at a PRTc miss. In the Prefetching configuration, we reuse the PCTc-driven
correlation prefetching to discover pages that will likely be accessed next, and pre-load their
metadata in the PRTc and the PCTc.
To assess this last e↵ect, we measure the total number of cycles that the hardware spends
in the PRTc translation in the Prefetching and in the PoM configurations. The PRTc in
both configurations has the same size (32KB). We do not consider Mempod because Mempod
lacks implementation details about its PRT structure (and thus we do not assign any latency
to access the PRT). We consider the number of clock cycles in the PRTc to be the sum of
all the cycles that the memory requests have to wait until the PRT entry is fetched into the
24
PRTc. Of course, these requests may be serviced in parallel with each other and with other
requests.
Figure 5.5 shows the reduction in clock cycles spent in the PRTc in Prefetching over PoM.
We see that there is a large reduction. On average, the reduction is 39%, which shows
that the time spent in the PRTc in Prefetching is greatly reduced compared to PoM. There
are some cases where Prefetching is not much better than PoM. This is either because the
prefetching mechanism is not invoked much (in the case of fft), or because PoM can also
attain good PRTc hit rate for some cases.
Figure 5.5: Reduction of PRTc time in Prefetching without host hints over PoM.
5.2.2 Comparing the Performance of the Di↵erent Configurations
Figure 5.6 shows the execution time speedup of oneTouch, PoM, Mempod and Prefetching
over the baseline. We can see that, on average, Prefetching delivers a speedup of 20% over
baseline, 14% over oneTouch, 15% over PoM, and 20.1% over Mempod.
There are cases like milc where, although the Prefetching configuration was able to achieve
high prefetch accuracy, and most swaps are induced by prefetching, PoM and Mempod
achieve higher performance. In this case, the reason is that the number of swaps is high and
25
Figure 5.6: Execution time speedup over baseline.
the Prefetching configuration introduces too much tra c. This harms performance. These
pathological cases can be eliminated if we limit the amount of total memory bandwidth that
is consumed by swaps.
There are also two cases, mcf, barnes, where baseline achieves better performance than
the other configurations. This is because the accesses to pages are random and have little
locality. In this case, every configuration wastes a lot of time in the PRTc due to a high miss
rate. Also, as we showed previously, these benchmarks are not amenable to prefetching.
Figure 5.7 shows the Average Main Memory Access Time (AMMAT) of the configurations
normalized to baseline. AMMAT is calculated as the average time that a memory request
spends going from the memory controller to DRAM and back to the memory controller. We
can see that our Prefetching mechanism outperforms every other configuration. Prefetching
reduces AMMAT by 19% over baseline, 33% over oneTouch, 25% over PoM, and 32% over
Mempod.
Consider the benchmarks where our prefetch mechanism was unable to identify many pages
to prefetch (these benchmarks were indicated in Figure 5.4). For most of these benchmarks,
our Prefetching configuration is able to maintain high performance — often higher than
the rest of the configurations we tested. One major reason is that the HPTs are able to
capture the hot pages, forcing them to remain in DRAM, avoiding negative swaps, and
26
Figure 5.7: AMMAT normalized to baseline.
saving memory bandwidth. In addition, the fact that our PRTc can maintain more entries
than the table in PoM gives us the advantage of lower PRTc latency that can improve
performance.
Overall, our experimental results confirm that our scheme can e ciently manage a hybrid
memory system and achieve high performance.
5.3 ENERGY ANALYSIS
We perform a simple analysis of the energy consumed by the di↵erent configurations in
the main memory system. We compute the energy consumed based on the total number of
memory accesses to each memory module. We assume that a DRAM read request consumes
30pJ/bit, an NVM read request 80pJ/bit, a DRAM write 30pJ/bit, and an NVM write
550pJ/bit. Based on these numbers, we calculate the energy that each configuration con-
sumes, and the results are presented in Figure 5.8. The bars are normalized to the Baseline
configuration.
As we can see from the figure, oneTouch consumes a vast amount of energy relative to
the other configurations, due to its aggressive swapping mechanism. Swapping a page costs
64 reads and 64 writes to both the DRAM and to the NVM modules — if it is a fast swap.
27
Figure 5.8: Energy consumed by di↵erent configurations in the main memory, normalized
to the Baseline configuration.
Otherwise, it can cost even more. This means that a very aggressive swapping policy is
energy hungry.
On average, Mempod consumes less energy than the rest of the swapping mechanisms.
However, we saw earlier that its performance is also the worst. The Prefetching configuration
is the second best swapping design in terms of energy, and is also the fastest in terms of
performance. Also, in cases where the accuracy of our prefetches is high, like in the CORAL
benchmarks, energy can be kept low. What we found from these experiments is that it is
important to reduce the unnecessary swaps as much as possible.
28
CHAPTER 6: DISCUSSION POINTS AND DIRECTIONS FOR FUTURE
WORK IN HYBRID MEMORY SYSTEMS
In this section, we discuss implementation challenges and possible extensions to the
Prefetching configuration. In the paper, we evaluate the benefits of a centralized HMC,
where each memory request has to go through the HMC before reaching main memory. Fu-
ture systems are expected to have multiple Memory Controllers (MCs), typically on-chip.
Hence, a single HMC would serialize potentially parallel requests to di↵erent MCs. This
problem has been considered in previous work [8]. Their solution is to group multiple MCs
that target fast and slow memories, together in logical groups. Then, they allow the MCs
within a group to swap pages among their di↵erent memories. Our scheme can be adjusted
to support this solution by having a di↵erent HMC in each of the MC groups. The HMC
would allow swaps only between the MCs for which it is responsible, and would keep meta-
data only for the memory range that it handles. Considering that the number of MCs per
group will not be high, the extra complexity will be small. [8] suggests that a good number
of groups is equal to the number of MCs for slow memories (i.e., each group should have one
of these MCs).
Another issue is the decision of whether the HMC should be on-chip or o↵-chip. An on-chip
HMC has a higher bandwidth and can communicate with the host more easily. However,
there are area costs and the HMC cannot be easily changed.
Another aspect is the handling of huge pages. Currently, the Prefetching configuration only
swaps at a granularity of 4KB. Swapping huge pages is challenging because these memory
segments are large, and their swapping cost is enormous. To support swaps between huge
pages we need to adjust our mechanism. One option could be to increase to a higher value
the threshold for predicting when to do a swap. In this way, more predicted accesses to fast
memory can compensate for the higher swapping cost. A second option could be to identify
smaller memory regions within a huge page that are highly accessed and swap only those
regions. This issue is out of the scope of this work. However, for huge pages, it is safe to
assume that, if the OS decides to provide a huge page, the page should be DRAM resident.
This is because the operating system has identified a region that is highly accessed. If the
OS is completely agnostic of DRAM, then we can explore the options described above.
Finally, our Prefetching mechanism is orthogonal to previous work [9] that suggests a
method to swap only portions of a page. Such method can be adopted by our Prefetching
configuration and save memory bandwidth. We can use a bitmap for a page to tell us which
cache lines from the page are worth swapping, and avoid moving the entire page in cases
when we predict that it would not be profitable.
29
CHAPTER 7: CONCLUSION
Recent advances in NVMs pave the way for hybrid memory systems to be integrated
in computers and attain higher memory capacities. In this project, we aimed to identify
the challenges in creating a hybrid memory system. To this end, we propose a hardware
mechanism that intelligently swaps pages between DRAM and NVM based on two ideas.
First, a Correlation Prefetching Table (PCT) identifies a page that will soon be accessed
frequently and its immediate follower page. Both pages may cause a prefetch-triggered swap
to move them to DRAM. The goal is to move the pages before many requests to these pages
arrive. Second, a Hot Page Table (HPT) identifies pages that are generally hot and need to
migrate to or remain in DRAM.
Our experiments verify that page accesses have a repeatable pattern and that prefetching
pages can increase performance. We tested our scheme across 17 di↵erent workloads from
three di↵erent benchmark suites and we compared it to four other schemes. We found that
our scheme (called Prefetching) is 14% faster than the second best configuration, and 20%
faster than a baseline system without page swapping.
We identified some important points for hybrid memory systems:
• The history of page accesses is repeatable. If a page was highly accessed in the past,
then there is a good chance that we will see similar access patterns in the future. Thus,
prefetching a page based on its history is reasonable.
• It is critical that we start the swap process soon. Swaps take time until they complete,
so if we wait until we recognize accesses to a page, we might lose the opportunity to
move the page to DRAM.
• It is better to utilize the total memory bandwidth e ciently than to overwhelm the
fast memory with requests.
• The ability to serve memory requests while performing a swap is crucial. This applies
to requests for the page that is being swapped and for other pages.
• Special care should be taken for the data structure that keeps the location of where
each page is remapped to, namely the Page Remapping Table (PRT).
30
REFERENCES
[1] J. A. Mandelman, R. H. Dennard, G. B. Bronner, J. K. DeBrosse, R. Divakaruni, Y. Li,
and C. J. Radens, “Challenges and future directions for the scaling of dynamic random-
access memory (dram),” IBM Journal of Research and Development, vol. 46, no. 2.3,
pp. 187–212, March 2002.
[2] M. K. Qureshi, S. Gurumurthi, and B. Rajendran, Phase Change Memory: From De-
vices to Systems, 1st ed. Morgan & Claypool Publishers, 2011.
[3] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase change memory as a
scalable dram alternative,” in Proceedings of the 36th Annual International Symposium
on Computer Architecture, ser. ISCA ’09. New York, NY, USA: ACM, 2009. [Online].
Available: http://doi.acm.org/10.1145/1555754.1555758 pp. 2–13.
[4] E. Kültürsay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Evaluating stt-ram as
an energy-e cient main memory alternative,” in 2013 IEEE International Symposium
on Performance Analysis of Systems and Software (ISPASS), April 2013, pp. 256–267.
[5] F. T. Hady, A. Foong, B. Veal, and D. Williams, “Platform storage performance with
3d xpoint technology,” Proceedings of the IEEE, vol. 105, no. 9, pp. 1822–1833, Sept
2017.
[6] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high performance main mem-
ory system using phase-change memory technology,” in Proceedings of the 36th Annual
International Symposium on Computer Architecture, ser. ISCA ’09. New York, NY,
USA: ACM, 2009. [Online]. Available: http://doi.acm.org/10.1145/1555754.1555760
pp. 24–33.
[7] J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim, “Transparent
hardware management of stacked dram as part of memory,” in Proceedings of
the 47th Annual IEEE/ACM International Symposium on Microarchitecture, ser.
MICRO-47. Washington, DC, USA: IEEE Computer Society, 2014. [Online].
Available: http://dx.doi.org/10.1109/MICRO.2014.56 pp. 13–24.
[8] A. Prodromou, M. Meswani, N. Jayasena, G. Loh, and D. M. Tullsen, “Mempod: A
clustered architecture for e cient and scalable migration in flat address space multi-level
memories,” in 2017 IEEE International Symposium on High Performance Computer
Architecture (HPCA), Feb 2017, pp. 433–444.
[9] J. H. Ryoo, M. R. Meswani, A. Prodromou, and L. K. John, “Silc-fm: Subblocked in-
terleaved cache-like flat memory organization,” in 2017 IEEE International Symposium
on High Performance Computer Architecture (HPCA), Feb 2017, pp. 349–360.
31
[10] D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi, “Unison cache: A
scalable and e↵ective die-stacked dram cache,” in Proceedings of the 47th
Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-
47. Washington, DC, USA: IEEE Computer Society, 2014. [Online]. Available:
http://dx.doi.org/10.1109/MICRO.2014.51 pp. 25–37.
[11] G. H. Loh and M. D. Hill, “E ciently enabling conventional block sizes for very large
die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International
Symposium on Microarchitecture, ser. MICRO-44. New York, NY, USA: ACM, 2011.
[Online]. Available: http://doi.acm.org/10.1145/2155620.2155673 pp. 454–464.
[12] J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling e cient and
scalable hybrid memories using fine-granularity dram cache management,” IEEE Com-
puter Architecture Letters, vol. 11, no. 2, pp. 61–64, July 2012.
[13] M. K. Qureshi and G. H. Loh, “Fundamental latency trade-o↵ in architecting dram
caches: Outperforming impractical sram-tags with a simple and practical design,”
in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on
Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society,
2012. [Online]. Available: https://doi.org/10.1109/MICRO.2012.30 pp. 235–246.
[14] J. Sim, G. H. Loh, H. Kim, M. O’Connor, and M. Thottethodi, “A mostly-clean
dram cache for e↵ective hit speculation and self-balancing dispatch,” in Proceedings
of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture,
ser. MICRO-45. Washington, DC, USA: IEEE Computer Society, 2012. [Online].
Available: https://doi.org/10.1109/MICRO.2012.31 pp. 247–257.
[15] L. Zhao, R. Iyer, R. Illikkal, and D. Newell, “Exploring dram cache architectures for
cmp server platforms,” in 2007 25th International Conference on Computer Design, Oct
2007, pp. 55–62.
[16] C. Chou, A. Jaleel, and M. K. Qureshi, “Cameo: A two-level memory organization with
capacity of main memory and flexibility of hardware-managed cache,” in Proceedings
of the 47th Annual IEEE/ACM International Symposium on Microarchitecture,
ser. MICRO-47. Washington, DC, USA: IEEE Computer Society, 2014. [Online].
Available: http://dx.doi.org/10.1109/MICRO.2014.63 pp. 1–12.
[17] D. Knyaginin, V. Papaefstathiou, and P. Stenstrom, “Profess: A probabilistic hybrid
main memory management framework for high performance and fairness,” in 2018 IEEE
International Symposium on High Performance Computer Architecture (HPCA), Feb
2018, pp. 143–155.
[18] M. R. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G. H.
Loh, “Heterogeneous memory architectures: A hw/sw approach for mixing die-stacked
and o↵-package memories,” in 2015 IEEE 21st International Symposium on High
Performance Computer Architecture (HPCA), vol. 00, Feb. 2015. [Online]. Available:
doi.ieeecomputersociety.org/10.1109/HPCA.2015.7056027 pp. 126–136.
32
[19] Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A fully
associative, tagless dram cache,” in Proceedings of the 42Nd Annual International
Symposium on Computer Architecture, ser. ISCA ’15. New York, NY, USA: ACM,
2015. [Online]. Available: http://doi.acm.org/10.1145/2749469.2750383 pp. 211–222.
[20] C. Chou, A. Jaleel, and M. Qureshi, “Batman: Techniques for maximizing system
bandwidth of memory systems with stacked-dram,” in Proceedings of the International
Symposium on Memory Systems, ser. MEMSYS ’17. New York, NY, USA: ACM,
2017. [Online]. Available: http://doi.acm.org/10.1145/3132402.3132404 pp. 268–280.
[21] M. Oskin and G. H. Loh, “A software-managed approach to die-stacked dram,” in 2015
International Conference on Parallel Architecture and Compilation (PACT), Oct 2015,
pp. 188–200.
[22] N. Agarwal and T. F. Wenisch, “Thermostat: Application-transparent page
management for two-tiered main memory,” in Proceedings of the Twenty-Second
International Conference on Architectural Support for Programming Languages and
Operating Systems, ser. ASPLOS ’17. New York, NY, USA: ACM, 2017. [Online].
Available: http://doi.acm.org/10.1145/3037697.3037706 pp. 631–644.
[23] T. Straumann, “Open Source Real-Time Operating System Overview (Invited),” in
Accelerator and Large Experimental Physics Control Systems, H. Shoaee, Ed., 2001, p.
235.
[24] R. M. Karp, C. H. Papadimitriou, and S. Shenker, “A simple algorithm for finding fre-
quent elements in streams and bags,” ACM Transactions on Database Systems, vol. 28,
p. 2003, 2003.
[25] S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan, “Heteroos: Os design for
heterogeneous memory management in datacenter,” in Proceedings of the 44th Annual
International Symposium on Computer Architecture, ser. ISCA ’17. New York, NY,
USA: ACM, 2017. [Online]. Available: http://doi.acm.org/10.1145/3079856.3080245
pp. 521–534.
[26] L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page placement in hybrid
memory systems,” in Proceedings of the International Conference on Supercom-
puting, ser. ICS ’11. New York, NY, USA: ACM, 2011. [Online]. Available:
http://doi.acm.org/10.1145/1995896.1995911 pp. 85–95.
[27] F. X. Lin and X. Liu, “Memif: Towards programming heterogeneous memory
asynchronously,” in Proceedings of the Twenty-First International Conference
on Architectural Support for Programming Languages and Operating Systems,
ser. ASPLOS ’16. New York, NY, USA: ACM, 2016. [Online]. Available:
http://doi.acm.org/10.1145/2872362.2872401 pp. 369–383.
33
[28] X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee:
Bandwidth-e cient dram caching via software/hardware cooperation,” in Proceedings
of the 50th Annual IEEE/ACM International Symposium on Microarchitecture,
ser. MICRO-50 ’17. New York, NY, USA: ACM, 2017. [Online]. Available:
http://doi.acm.org/10.1145/3123939.3124555 pp. 1–14.
[29] M. Islam, S. Banerjee, M. Meswani, and K. Kavi, “Prefetching as a potentially e↵ective
technique for hybrid memory optimization,” in Proceedings of the Second International
Symposium on Memory Systems, ser. MEMSYS ’16. New York, NY, USA: ACM,
2016. [Online]. Available: http://doi.acm.org/10.1145/2989081.2989129 pp. 220–231.
[30] S. Volos, J. Picorel, B. Falsafi, and B. Grot, “Bump: Bulk memory access prediction and
streaming,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on
Microarchitecture, ser. MICRO-47. Washington, DC, USA: IEEE Computer Society,
2014. [Online]. Available: http://dx.doi.org/10.1109/MICRO.2014.44 pp. 545–557.
[31] Y. Solihin, J. Lee, and J. Torrellas, “Using a user-level memory thread for correla-
tion prefetching,” in Proceedings 29th Annual International Symposium on Computer
Architecture, 2002, pp. 171–182.
[32] A.-C. Lai, C. Fide, and B. Falsafi, “Dead-block prediction and dead-block correlating
prefetchers,” in Proceedings 28th Annual International Symposium on Computer Archi-
tecture, 2001, pp. 144–154.
[33] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg,
F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation platform,”
Computer, vol. 35, no. 2, pp. 50–58, Feb 2002.
[34] A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston,
R. Risen, J. Cook, P. Rosenfeld, E. CooperBalls, and B. Jacob, “The structural
simulation toolkit,” SIGMETRICS Perform. Eval. Rev., vol. 38, no. 4, pp. 37–42, Mar.
2011. [Online]. Available: http://doi.acm.org/10.1145/1964218.1964225
[35] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “Dramsim2: A cycle accurate memory
system simulator,” IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 16–19, Jan
2011.
[36] N. Chachmon, D. Richins, R. Cohn, M. Christensson, W. Cui, and V. J. Reddi,
“Simulation and analysis engine for scale-out workloads,” in Proceedings of the 2016
International Conference on Supercomputing, ser. ICS ’16. New York, NY, USA:
ACM, 2016. [Online]. Available: http://doi.acm.org/10.1145/2925426.2926293 pp.
22:1–22:13.
[37] J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH Comput.
Archit. News, vol. 34, no. 4, pp. 1–17, Sep. 2006. [Online]. Available:
http://doi.acm.org/10.1145/1186736.1186737
34
[38] C. Sakalis, C. Leonardsson, S. Kaxiras, and A. Ros, “Splash-3: A properly synchronized
benchmark suite for contemporary research,” in 2016 IEEE International Symposium
on Performance Analysis of Systems and Software (ISPASS), April 2016, pp. 101–111.
[39] “Coral benchmark codes,” https://asc.llnl.gov/CORAL-benchmarks/.
35
